This post demonstrates some methods for building multi-lingual corpora from web-based news content using my R package
quicknews, as well as methods for annotating multi-lingual corpora using the
cleanNLP (Arnold 2017) and
udpipe (Wijffels 2018) packages. In the process, this post lays a bit of a foundation for analyzing text data, and serves as a reference for some basic corpus concepts and common text structures.
library(tidyverse) library(DT) library(cleanNLP) library(corpuslingr) #devtools::install_github("jaytimm/corpuslingr") library(quicknews) #devtools::install_github("jaytimm/quicknews")
Simple web scraping
quicknews package streamlines two basic tasks relevant to building multi-lingual web corpora:
- Retrieving metadata (including urls) for current articles posted on GoogleNews (via RSS) based on user-specified search parameters, and
- Scraping article content from urls obtained from the RSS, and outputting content as a corpus data frame.
Both functions depend on a combination of functionality made available in the
Here, we obtain metadata for the most current US-based, nation-related articles written in English.
us_nation_meta <- quicknews::qnews_get_meta ( language="en", country="us", type="topic", search='nation')
country searches (that I know work) include:
- Spanish/US (es/us),
- Spanish/Mexico (es/mx),
- French/France (fr/fr),
- German/Germany (de/de).
type= “topic” (as above),
searchparameter options include business, world, health, science, sports, technology, and some others.
- when the
typeparameter is set to “term”, the
searchparameter can be set to anything.
- when the
typeparameter is set to “topstories”, the
searchparameter is ignored.
The call to
qnews_get_meta() returns metadata for the 20 most recent articles by search type. Metadata include:
##  "lang" "country" "search" "date" "source" "title" "link"
Article publication dates, sources, and titles returned from the call to Google News’ RSS feed are summarized below:
clr_web_scrape function returns a TIF-compliant corpus dataframe, with each scraped text represented as a single row. Metadata from output of
clr_web_gnews are also included.
us_nation_corpus <- us_nation_meta %>% quicknews::qnews_scrape_web ()
Example text from the corpus data frame:
##  "By Katie Reilly 1:18 PM EDT The valedictorian of Holy Cross High School in Covington, Kentucky, planned on delivering a graduation speech about the power of young voices on Friday, but he was barred from doing so after church leaders said his speech was \"inconsistent with the teaching of the Catholic Church.\" So he went outside after the ceremony and delivered it by megaphone instead. \"I didn't think it was very polarizing. It was, like, about empowerment through youth,\" Bales told local TV stat"
Having built a simple corpus of the day’s news, the next step is annotation. Corpus annotation has become a fairly straightforward process with the open-source development of several natural language processing toolkits, including:
As CoreNLP and spaCy are Java- and Python-based applications, respectively, dependency wrappers are required for their use in R.
udpipe, on the other hand, is a dependency-free R package.
cleanNLP & udpipe
cleanNLP package provides wrappers (and a singular interface) for all three of these annotators. Per this flexibility, we demo its usage here in annotating our web-based corpus, with a specific focus on the dependency-free annotator
udpipe. (Note that
udpipe can be installed & used independently of
We initialize the annotator
udpipe via a call to
cleanNLP::cnlp_init_udpipe( model_name="english", feature_flag = FALSE, parser = "none")
Then we annotate the text using the
cnlp_annotate function. Resulting output is a list of data frames, in which different features/types of annotations (eg, named entities, coreferences, dependencies, etc.) are stored seperately. We can access the standard annotation as the
token element of the list.
us_nation_annotated <- cleanNLP::cnlp_annotate (us_nation_corpus$text, as_strings = TRUE) %>% .$token %>% filter(lemma !='ROOT') %>% mutate(id = gsub('doc','', id))
The overall composition of the corpus is presented below; as can be noted, not all of the 20 articles included in the RSS have been successfully scraped. Not all websites allow non-Google entities to scrape their pages.
corpuslingr::clr_desc_corpus( us_nation_annotated,doc='id', sent='sid', tok='word', upos='upos')$corpus
## n_docs textLength textType textSent ## 1: 16 13305 3496 686
Characteristics of constituent news articles:
corpuslingr::clr_desc_corpus(us_nation_annotated, doc='id', sent='sid', tok='word', upos='upos')$text %>% inner_join(us_nation_corpus, by = c('id'='doc_id'))%>% ggplot(aes(x=textLength, y=textType, color=source, size=textSent))+ geom_point()
The table below illustrates the first 1,000 tokens of our annotated corpus. Along with sentence identification and lemmatization, the annotation includes both universal part-of-speech tags (
upos) and English-specific part-of-speech tags (
Per text interchage formats, this particular data structure is referred to as a “tokens data frame object.”
us_nation_annotated%>% slice(1:1000)%>% DT::datatable(options = list(scrollX = TRUE), selection="none", class = 'cell-border stripe', rownames = FALSE, width="100%", escape=FALSE)%>% DT::formatStyle(c(1:9),fontSize = '85%')
Spanish web corpus
Next, we take a quick look at building an annotated corpus of web-scraped news articles written in Spanish.
For a more analytic language like English (ie, a language with limited inflectional morphology), units of meaning can roughly be demarcated by spaces in a character string; the inflectional morphemes that do exist can be collapsed to language specific part-of-speech tags (eg, verb forms inflected for third-person singular in the present tense are coded as “VBZ”). And, as the table above attests, without proliferating all that many POS categories.
For a language richer in inflectional morphology like Spanish, on the other hand, demarcating units of meaning via spaces in a character string does not really work, as a single verb form, eg, is inflected for tense, aspect, mood, person, and number. Mapping multiple features of inflectional meaning for a given token to a single “POS” code, then, becomes a bit more challenging, and annotations reflect this. We consider some of these differences as we go.
RETRIEVE METADATA & CONTENT
So, we first retrieve metadata for the most current Mexico-based, “world news” articles, and scrape text content from urls in a single pipe.
es_mx_world_corp <- quicknews::qnews_get_meta(language="es", country="mx", type="topic", search="world")%>% quicknews::qnews_scrape_web ()
Metadata from our search results are summarized below:
Again, we use
cleanNLP to annotate the Spanish language corpus, and initialize the annotator & Spanish model with a call to
cnlp_init_udpipe(model_name = "spanish")
Annotate corpus data frame:
full_udpipe_es_ann <- es_mx_world_corp%>% select(doc_id,text)%>% cleanNLP::cnlp_annotate(as_strings = TRUE) %>% .$token%>% filter(lemma !='ROOT')%>% arrange(as.numeric(id))
The basic “token-lemma-pos” annotation for Spanish via
udpipe is presented below:
full_udpipe_es_ann %>% select(id:upos) %>% slice(1:1000)%>% DT::datatable(options = list(scrollX = TRUE), selection="none", class = 'cell-border stripe', rownames = FALSE, width="100%", escape=FALSE)%>% DT::formatStyle(c(1:9),fontSize = '85%')
As can be noted, there is no language-specific part-of-speech column for the Spanish annotation; instead, the annotation includes a range of features (as columns) detailing the individual units of (inflectional) meaning comprising each token. The full annotation contains the following features:
##  "case" "definite" "degree" "gender" "mood" ##  "num_type" "number" "person" "polarity" "polite" ##  "poss" "prep_case" "pron_type" "reflex" "tense" ##  "verb_form"
The table below presents a detailed account of (some of the) features of nominal and verbal inflectional meaning included in the annotation:
It should be noted that different annototars, eg.
coreNLP, structure feature output differently; namely, features of inflectional meaning are collapsed into a single column, generally as a language-specific pos column. Depending on the goal/endgame, one annotation structure may be more preferable than the other.
We do not rollcall all differences among languages/annotators here; instead, the goal has been to present a simple & uniform workflow for bulding and annotating multi-lingual, web-based corpora.
Among other things, the
udpipe packages facilitate the development of real-time, annotated corpora for tracking any number of cultural & linguistic changes in-progress. Subsequent discussions on this site will consider different analyses that the corpus annotation enables, including fine-grained, complex search in context, network and topic analysis, and a host of distributional semantics models.
Arnold, Taylor. 2017. “A Tidy Data Model for Natural Language Processing Using cleanNLP.” arXiv Preprint arXiv:1703.09570.
Wijffels, Jan. 2018. Udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the ’Udpipe’ ’Nlp’ Toolkit. https://CRAN.R-project.org/package=udpipe.