This post demonstrates some methods for building multi-lingual corpora from web-based news content using my R package
quicknews, as well as methods for annotating multi-lingual corpora using the
cleanNLP (Arnold 2017) and
udpipe (Wijffels 2018) packages. In the process, this post lays a bit of a foundation for analyzing text data, and serves as a reference for some basic corpus concepts and common text structures.
library(tidyverse) library(DT) library(cleanNLP) library(corpuslingr) #devtools::install_github("jaytimm/corpuslingr") library(quicknews) #devtools::install_github("jaytimm/quicknews")
Simple web scraping
quicknews package streamlines two basic tasks relevant to building multi-lingual web corpora:
- Retrieving metadata (including urls) for current articles posted on GoogleNews (via RSS) based on user-specified search parameters, and
- Scraping article content from urls obtained from the RSS, and outputting content as a corpus data frame.
Both functions depend on a combination of functionality made available in the
Here, we obtain metadata for the most current US-based, nation-related articles written in English.
us_nation_meta <- quicknews::qnews_get_meta ( language="en", country="us", type="topic", search='nation')
country searches (that I know work) include:
- Spanish/US (es/us),
- Spanish/Mexico (es/mx),
- French/France (fr/fr),
- German/Germany (de/de).
type= “topic” (as above),
searchparameter options include business, world, health, science, sports, technology, and some others.
- when the
typeparameter is set to “term”, the
searchparameter can be set to anything.
- when the
typeparameter is set to “topstories”, the
searchparameter is ignored.
The call to
qnews_get_meta() returns metadata for the 20 most recent articles by search type. Metadata include:
##  "lang" "country" "search" "date" "source" "title" "link"
Article publication dates, sources, and titles returned from the call to Google News’ RSS feed are summarized below:
clr_web_scrape function returns a TIF-compliant corpus dataframe, with each scraped text represented as a single row. Metadata from output of
clr_web_gnews are also included.
us_nation_corpus <- us_nation_meta %>% quicknews::qnews_scrape_web ()
Example text from the corpus data frame:
##  "Pennsylvania 'I'll Stomp on Your Face': GOP Pennsylvania Governor Candidate Removes Combative Message to Opponent Scott Wagner, Republican and Tom Wolf, Incumbent and Democratic candidate for the seat of Governor of Pennsylvania attend a student forum in Philadelphia, PA, on October 10, 2018. Wagner gave a warning to Wolf that he would \"stomp on his face.\" NurPhoto-NurPhoto via Getty Images By MARC LEVY / AP 12:55 PM EDT (HARRISBURG, Pa.) - Pennsylvania's Republican candidate fo"
Having built a simple corpus of the day’s news, the next step is annotation. Corpus annotation has become a fairly straightforward process with the open-source development of several natural language processing toolkits, including:
As CoreNLP and spaCy are Java- and Python-based applications, respectively, dependency wrappers are required for their use in R.
udpipe, on the other hand, is a dependency-free R package.
cleanNLP & udpipe
cleanNLP package provides wrappers (and a singular interface) for all three of these annotators. Per this flexibility, we demo its usage here in annotating our web-based corpus, with a specific focus on the dependency-free annotator
udpipe. (Note that
udpipe can be installed & used independently of
We initialize the annotator
udpipe via a call to
cleanNLP::cnlp_init_udpipe( model_name="english", feature_flag = FALSE, parser = "none")
Then we annotate the text using the
cnlp_annotate function. Resulting output is a list of data frames, in which different features/types of annotations (eg, named entities, coreferences, dependencies, etc.) are stored seperately. We can access the standard annotation as the
token element of the list.
us_nation_annotated <- cleanNLP::cnlp_annotate (us_nation_corpus$text, as_strings = TRUE) %>% .$token %>% filter(lemma !='ROOT') %>% mutate(id = gsub('doc','', id))
The overall composition of the corpus is presented below; as can be noted, not all of the 20 articles included in the RSS have been successfully scraped. Not all websites allow non-Google entities to scrape their pages.
corpuslingr::clr_desc_corpus( us_nation_annotated,doc='id', sent='sid', tok='word', upos='upos')$corpus
## n_docs textLength textType textSent ## 1: 7 4662 1710 234
Characteristics of constituent news articles:
corpuslingr::clr_desc_corpus(us_nation_annotated, doc='id', sent='sid', tok='word', upos='upos')$text %>% inner_join(us_nation_corpus, by = c('id'='doc_id'))%>% ggplot(aes(x=textLength, y=textType, color=source, size=textSent))+ geom_point()
The table below illustrates the first 1,000 tokens of our annotated corpus. Along with sentence identification and lemmatization, the annotation includes both universal part-of-speech tags (
upos) and English-specific part-of-speech tags (
Per text interchage formats, this particular data structure is referred to as a “tokens data frame object.”
us_nation_annotated%>% slice(1:1000)%>% DT::datatable(options = list(scrollX = TRUE), selection="none", class = 'cell-border stripe', rownames = FALSE, width="100%", escape=FALSE)%>% DT::formatStyle(c(1:9),fontSize = '85%')
Spanish web corpus
Next, we take a quick look at building an annotated corpus of web-scraped news articles written in Spanish.
For a more analytic language like English (ie, a language with limited inflectional morphology), units of meaning can roughly be demarcated by spaces in a character string; the inflectional morphemes that do exist can be collapsed to language specific part-of-speech tags (eg, verb forms inflected for third-person singular in the present tense are coded as “VBZ”). And, as the table above attests, without proliferating all that many POS categories.
For a language richer in inflectional morphology like Spanish, on the other hand, demarcating units of meaning via spaces in a character string does not really work, as a single verb form, eg, is inflected for tense, aspect, mood, person, and number. Mapping multiple features of inflectional meaning for a given token to a single “POS” code, then, becomes a bit more challenging, and annotations reflect this. We consider some of these differences as we go.
RETRIEVE METADATA & CONTENT
So, we first retrieve metadata for the most current Mexico-based, “world news” articles, and scrape text content from urls in a single pipe.
es_mx_world_corp <- quicknews::qnews_get_meta(language="es", country="mx", type="topic", search="world")%>% quicknews::qnews_scrape_web ()
Metadata from our search results are summarized below:
Again, we use
cleanNLP to annotate the Spanish language corpus, and initialize the annotator & Spanish model with a call to
cnlp_init_udpipe(model_name = "spanish")
Annotate corpus data frame:
full_udpipe_es_ann <- es_mx_world_corp%>% select(doc_id,text)%>% cleanNLP::cnlp_annotate(as_strings = TRUE) %>% .$token%>% filter(lemma !='ROOT')%>% arrange(as.numeric(id))
The basic “token-lemma-pos” annotation for Spanish via
udpipe is presented below:
full_udpipe_es_ann %>% select(id:upos) %>% slice(1:1000)%>% DT::datatable(options = list(scrollX = TRUE), selection="none", class = 'cell-border stripe', rownames = FALSE, width="100%", escape=FALSE)%>% DT::formatStyle(c(1:9),fontSize = '85%')
As can be noted, there is no language-specific part-of-speech column for the Spanish annotation; instead, the annotation includes a range of features (as columns) detailing the individual units of (inflectional) meaning comprising each token. The full annotation contains the following features:
##  "case" "definite" "degree" "gender" "mood" ##  "num_type" "number" "person" "polarity" "polite" ##  "poss" "prep_case" "pron_type" "reflex" "tense" ##  "verb_form"
The table below presents a detailed account of (some of the) features of nominal and verbal inflectional meaning included in the annotation:
It should be noted that different annototars, eg.
coreNLP, structure feature output differently; namely, features of inflectional meaning are collapsed into a single column, generally as a language-specific pos column. Depending on the goal/endgame, one annotation structure may be more preferable than the other.
We do not rollcall all differences among languages/annotators here; instead, the goal has been to present a simple & uniform workflow for bulding and annotating multi-lingual, web-based corpora.
Among other things, the
udpipe packages facilitate the development of real-time, annotated corpora for tracking any number of cultural & linguistic changes in-progress. Subsequent discussions on this site will consider different analyses that the corpus annotation enables, including fine-grained, complex search in context, network and topic analysis, and a host of distributional semantics models.
Arnold, Taylor. 2017. “A Tidy Data Model for Natural Language Processing Using cleanNLP.” arXiv Preprint arXiv:1703.09570.
Wijffels, Jan. 2018. Udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the ’Udpipe’ ’Nlp’ Toolkit. https://CRAN.R-project.org/package=udpipe.