corpus query and grammatical constructions

This post demonstrates the use of a simple collection of functions from my R-package corpuslingr. Functions streamline two sets of corpus linguistics tasks:

  • annotated corpus search of grammatical constructions and complex lexical patterns in context, and
  • detailed summary and aggregation of corpus search results.

While still in development, the package should be useful to linguists and digital humanists interested in having BYU corpora-like search & summary functionality when working with (moderately-sized) personal corpora, as well as researchers interested in performing finer-grained, more qualitative analyses of language use and variation in context. The package is available for download at my github site.

library(tidyverse)
library(corpuslingr) #devtools::install_github("jaytimm/corpuslingr")
library(corpusdatr) #devtools::install_github("jaytimm/corpusdatr")

Search syntax

Under the hood, corpuslingr search is regex-based & tuple-based — akin to the RegexpParser function in Python’s Natural Language Toolkit (NLTK) — which facilitates search of grammatical and lexical patterns comprised of:

  • different types of elements (eg, form, lemma, or part-of-speech),
  • contiguous and/or non-contiguous elements,
  • positionally fixed and/or free (ie, optional) elements.

Regex character matching is streamlined with a simple “corpus querying language” modeled after the more intuitive and transparent syntax used in the online BYU suite of English corpora. This allows for convenient specification of search patterns comprised of form, lemma, & pos, with all of the functionality of regex metacharacters and repetition quantifiers.

Example searches & syntax are presented below, which load with the package as clr_ref_search_egs. A full list of part-of-speech codes can be viewed here, or via clr_ref_pos_codes.

example search syntax

Search summary

The clr_get_freq() function enables quick aggregation of search results. It calculates token and text frequency for search terms, and allows the user to specify how to aggregate counts with the agg_var parameter.

  • VERB up, eg “pass up”
search3 <- "VERB up"

The figure below illustrates the top 20 instantiations of the grammatical construction VERB up.

slate %>%
  corpuslingr::clr_search_gramx(search=search3)%>%
  corpuslingr::clr_get_freq(agg_var=c("lemma"),
                            toupper =TRUE) %>%
  slice(1:20)%>%
  ggplot(aes(x=reorder(lemma,txtf), y=txtf)) + 
    geom_col(width=.6, fill="steelblue") +  
    coord_flip()+
    labs(title="Top 20 instantiations of 'VERB up' by frequency")

Although search is quicker when searching for multiple search terms simultaneaously, in some cases it may be useful to treat multiple search terms distinctly using lapply():

search3a <- c("VERB across",
              "VERB through", 
              "VERB out", 
              "VERB down")
vb_prep <- lapply(1:length(search3a), function(y) {
    corpuslingr::clr_search_gramx(corp=slate, 
                                  search=search3a[y])%>%
    corpuslingr::clr_get_freq(agg_var=c("lemma"),
                              toupper =TRUE) %>%
    mutate(search = search3a[y])
    }) %>%  
  bind_rows()

Summary by search:

vb_prep %>%
  group_by(search) %>%
  summarize(gramx_freq = sum(txtf), 
            gramx_type = n())


Top 10 instantiations of each search pattern by search term:

KWIC & BOW

KEYWORD IN CONTEXT

clr_context_kwic() is a super simple function that rebuilds search contexts from the output of clr_search_context(). The include parameter allows the user to add details about the search pattern, eg. part-of-speech, to the table. It works nicely with DT tables.

  • VERB NOUNPHRASE into VERBING, eg “trick someone into believing”
search4 <- "VERB NPHR into VBG"  
slate %>%
  corpuslingr::clr_search_context(search=search4, 
                                  LW=10, 
                                  RW = 10) %>%
  corpuslingr::clr_context_kwic(include=c('doc_id')) %>%
  DT::datatable(class = 'cell-border stripe', 
                rownames = FALSE,width="100%", 
                escape=FALSE) %>%
  DT::formatStyle(c(1:2),fontSize = '85%')

BAG OF WORDS

The clr_context_bow() function returns a co-occurrence vector for each search term based on the context window-size specified in clr_search_context(). Again, how features are counted can be specified using the agg_var parameter. Additionally, features included in the vector can be filtered to content words using the content_only parameter.

  • Multiple search terms
search5 <- c("Clinton", "Lewinsky", 
             "Bradley", "McCain", 
             "Milosevic", "Starr",  
             "Microsoft", "Congress", 
             "China", "Russia")

Here we search for some prominent players of the late 90s (when articles in the cdr_slate_ann corpus were published), and plot the most frequent co-occurring features of each search term.

co_occur <- slate %>%
  corpuslingr::clr_search_context(search=search5, 
                                  LW=15, 
                                  RW = 15)%>%
  corpuslingr::clr_context_bow(content_only=TRUE,
                               agg_var=c('searchLemma','lemma','pos'))

Plotting facets in ggplot is problematic when within-facet categories contain some overlap. We add a couple of hacks to address this.

co_occur %>%
  filter(pos=="NOUN")%>%
  arrange(searchLemma,cofreq)%>%
  group_by(searchLemma)%>%
  top_n(n=10,wt=jitter(cofreq))%>%
  ungroup()%>%
  #Hack1 to sort order within facet
  mutate(order = row_number(), 
         lemma=factor(paste(order,lemma,sep="_"), 
                      levels = paste(order, lemma, sep = "_")))%>%
  ggplot(aes(x=lemma, 
             y=cofreq, 
             fill=searchLemma)) + 
    geom_col(show.legend = FALSE) +  
    facet_wrap(~searchLemma, scales = "free_y", ncol = 2) +
  #Hack2 to modify labels
    scale_x_discrete(labels = function(x) gsub("^.*_", "", x))+
    theme_fivethirtyeight()+ 
    scale_fill_stata() +
    theme(plot.title = element_text(size=13))+ 
    coord_flip()+
    labs(title="Co-occurrence frequencies for some late 20th century players")

Summary and shiny

So, a quick demo of some corpuslingr functions for annotated corpus search & summary of complex lexical-grammatical patterns in context.

I have built a Shiny app to search/explore the Slate Magazine corpus available here. Code for building the app is available here. Swapping out the Slate corpus for a personal one should be fairly straightforward, with the caveat that the annotated corpus needs to be set/“tuple-ized” using the clr_set_corpus function from corpuslingr.

Share