a simple framework for corpus-based keyphrase extraction

This post outlines a simple framework for identifying and extracting keyphrases from component texts of a corpus. We first consider some functional characteristics of descriptive keyphrases, as well as some more formal (ie, regex-based) definitions.

We then demonstrate the use of corpuslingr for identifying potential keyphrases in an annotated corpus, and present an unsupervised (and well-established) methodology (tf-idf weights) for extracting descriptive keyphrases for each text.

The Slate Magazine corpus from the corpusdatr package is used here for demo purposes.

library(corpusdatr) #devtools::install_github("jaytimm/corpusdatr")
library(corpuslingr) #devtools::install_github("jaytimm/corpuslingr")

Defining potential keyphrases

Ideally, keyphrases are semantically rich noun phrases that shed light on the content of a particular text. For illustrative purposes, three noun phrases of varying degrees of complexity and informativeness are presented below:

  • flowers
  • pretty flowers
  • pretty flowers in suburban Poughkeepsie

The first is comprised solely of a plural noun form; the second is comprised of a noun form modified by an adjective; the third is comprised of a noun phrase modified by a prepositional phrase (which contains another noun phrase). By virtue of specifying both the type and “where” of flower, the latter would seem to have the most descriptive utility to someone perusing content (or some algorithm classifying texts) via keyphrases.

From a regex perspective, then, we want to create a search template that is as greedy as possible when it comes to noun phrase constituency; in other words, while we will settle for unmodified noun forms as keyphrases, we prefer highly modified ones. And we are not interested in pronominal forms.

So, we define a noun phrase as “zero or more adjectives followed by one or more nouns” and define potential keyphrases as follows

  • Noun phrase + ( preposition + Noun phrase )

where the prepositional phrase is optional. This schema maps generically to the regex below (per Penn Treebank POS codes):

nounPhrase <- "(JJ[A-Z]* )*(NN[A-Z]* )+" 
prepPhrase <- paste0("((IN )",nounPhrase,")?")
keyPhrase <- paste0(nounPhrase,prepPhrase)
## [1] "(JJ[A-Z]* )*(NN[A-Z]* )+((IN )(JJ[A-Z]* )*(NN[A-Z]* )+)?"

Using the simplifying CQL made available via corpuslingr, the above regex is written as:

keyPhrase <- "(ADJ )*(NOUNX )+((PREP )(ADJ )*(NOUNX )+)?"

Corpus search for potential keyphrases

Per this definition, the next step is to search the Slate magazine corpus for potential keyphrases. So, we first set the corpus for search (within the corpuslingr framework) using the clr_set_corpus function:

slate <- corpusdatr::cdr_slate_ann %>%

Then we use the corpuslingr::clr_search_gramx function to extract all potential keyphrases from each text in the corpus:

kps <- corpuslingr::clr_search_gramx(search=keyPhrase, corp=slate) 

Example output returned by clr_search_gramx:

##    doc_id              token      tag            lemma
## 1:      1             rulers      NNS            ruler
## 2:      1              world       NN            world
## 3:      1 populous countries   JJ NNS populous country
## 4:      1      hold on power NN IN NN    hold on power
## 5:      1          Indonesia      NNP        Indonesia
## 6:      1            Suharto      NNP          Suharto

The plot below illustrates the top fifteen instantiations of our keyphrase regex search in the Slate Magazine corpus. While the top two instantiations are unmodified noun phrases, multi-word noun phrases constitute a sizable portion of potential keyphrases as well.

  corpuslingr::clr_get_freq(agg_var = 'tag')%>%
    ggplot(aes(x=reorder(tag, txtf), y=txtf)) + 
    geom_col(width=.65, fill="cyan4") +  
    theme_bw() +
    labs(title = "Top 15 keyphrases by pattern type")

Selecting descriptive keyphrases with the tf-idf statisitic

The term frequency - inverse document frequency (tf-idf) statistic is a super simple, unsupervised approach to keyphrase extraction. As a metric it is meant to capture (or weigh) how frequent a given phrase is in a text (ie, text frequency) relative to how dispersed the phrase is across documents comprising the corpus (ie, document frequency).

Phrases occurring more frequently in a given text than we would expect based on their document frequency receive higher weights; theoretically, such phrases shed light on the content of a given text (relative to the content of the corpus as a whole).

The tf-idf weight for a given keyphrase in a given document, then, is calculated as the product of token frequency, tf, and inverse document frequency, idf, where the latter is logarithmically transformed.

Here we compute text frequency and document frequency for each keyphrase, and join metadata from cdr_slate_meta, which includes texts titles and descriptives. Frequencies are aggregated by keyphrase lemma constituents.

kpsAgg <- kps %>%

Based on these two (relative) frequencies, we compute td-idf values for each keyphrase in each document, collapse the top five phrases into a single cell, and wrap the table up with some DT.

  mutate(td_idf = (txtf/textLength)*log(docsInCorpus/(docf+1)))%>%
  summarise(key_phrases = paste(lemma, collapse=" | "))%>%
  DT::datatable(class = 'cell-border stripe', 
                rownames = FALSE,

Scrolling through the table, we can see some fairly informative keyphrases, and keyphrases that would seem to align with text titles in semantically intuitive ways. So, while a comparatively simple approach to the issue of keyphrase extraction, td-idf weights seem to perform quite well. An informal overview of alternative approaches to keyphrase identification/extraction is available here.

Post script - State of the Union Addresses

I have wrapped this method into my corpuslingr package as clr_search_keyphrase. To demonstrate its usage, we extract keyphrases from State of the Union (SOTU) addresses delivered by US Presidents from 1790 to 2016.

Conveniently, folks at the The Programming Historian make these addresses available in text format. For demonstration purposes, I have annotated the corpus of SOTU addresses (via spacyr), and included them in a R-package called sotuAnn.

The user can specify the number of keyphrases to extract (n), how to aggregate keyphrases (agg_var), how to output keyphrases (flatten), and whether or not to use jitter to break ties among top n keyphrases (jitter).

library(sotuAnn) #devtools::install_github("jaytimm/sotuAnn")

sotuAnn::cdr_sotu_ann %>%
  corpuslingr::clr_set_corpus (meta = sotuAnn::cdr_sotu_meta) %>%
                                     key_var ='lemma', 
                                     remove_nums = TRUE, 
                                     include = c("year","president"))%>%
  DT::datatable(class = 'cell-border stripe', 
                rownames = FALSE,