from web to annotated corpus: a multi-lingual demo

This post demonstrates some methods for building multi-lingual corpora from web-based news content using my R package quicknews, as well as methods for annotating multi-lingual corpora using the cleanNLP (Arnold 2017) andudpipe (Wijffels 2018) packages. In the process, this post lays a bit of a foundation for analyzing text data, and serves as a reference for some basic corpus concepts and common text structures.

library(corpuslingr) #devtools::install_github("jaytimm/corpuslingr")
library(quicknews) #devtools::install_github("jaytimm/quicknews")

Simple web scraping

The quicknews package streamlines two basic tasks relevant to building multi-lingual web corpora:

  • Retrieving metadata (including urls) for current articles posted on GoogleNews (via RSS) based on user-specified search parameters, and
  • Scraping article content from urls obtained from the RSS, and outputting content as a corpus data frame.

Both functions depend on a combination of functionality made available in the boilerpipeR, xml2, and RCurl packages.


Here, we obtain metadata for the most current US-based, nation-related articles written in English.

us_nation_meta <- quicknews::qnews_get_meta (

Additional language & country searches (that I know work) include:

  • Spanish/US (es/us),
  • Spanish/Mexico (es/mx),
  • French/France (fr/fr),
  • German/Germany (de/de).

Specifics of type & search parameters:

  • when type = “topic” (as above), search parameter options include business, world, health, science, sports, technology, and some others.
  • when the type parameter is set to “term”, the search parameter can be set to anything.
  • when the type parameter is set to “topstories”, the search parameter is ignored.

The call to qnews_get_meta() returns metadata for the 20 most recent articles by search type. Metadata include:

## [1] "lang"    "country" "search"  "date"    "source"  "title"   "link"

Article publication dates, sources, and titles returned from the call to Google News’ RSS feed are summarized below:


The clr_web_scrape function returns a TIF-compliant corpus dataframe, with each scraped text represented as a single row. Metadata from output of clr_web_gnews are also included.

us_nation_corpus <- us_nation_meta %>% 
  quicknews::qnews_scrape_web () 

Example text from the corpus data frame:

## [1] "By Katie Reilly 1:18 PM EDT The valedictorian of Holy Cross High School in Covington, Kentucky, planned on delivering a graduation speech about the power of young voices on Friday, but he was barred from doing so after church leaders said his speech was \"inconsistent with the teaching of the Catholic Church.\" So he went outside after the ceremony and delivered it by megaphone instead. \"I didn't think it was very polarizing. It was, like, about empowerment through youth,\" Bales told local TV stat"

Corpus annotation

Having built a simple corpus of the day’s news, the next step is annotation. Corpus annotation has become a fairly straightforward process with the open-source development of several natural language processing toolkits, including:

As CoreNLP and spaCy are Java- and Python-based applications, respectively, dependency wrappers are required for their use in R. udpipe, on the other hand, is a dependency-free R package.

cleanNLP & udpipe

The cleanNLP package provides wrappers (and a singular interface) for all three of these annotators. Per this flexibility, we demo its usage here in annotating our web-based corpus, with a specific focus on the dependency-free annotator udpipe. (Note that udpipe can be installed & used independently of cleanNLP.)

We initialize the annotator udpipe via a call to cnlp_init_udpipe:

  feature_flag = FALSE, 
  parser = "none") 

Then we annotate the text using the cnlp_annotate function. Resulting output is a list of data frames, in which different features/types of annotations (eg, named entities, coreferences, dependencies, etc.) are stored seperately. We can access the standard annotation as the token element of the list.

us_nation_annotated <- cleanNLP::cnlp_annotate (us_nation_corpus$text, 
                                                as_strings = TRUE) %>%
  .$token %>%
  filter(lemma !='ROOT') %>%
  mutate(id = gsub('doc','', id))


The overall composition of the corpus is presented below; as can be noted, not all of the 20 articles included in the RSS have been successfully scraped. Not all websites allow non-Google entities to scrape their pages.

##    n_docs textLength textType textSent
## 1:     16      13305     3496      686

Characteristics of constituent news articles:

                             upos='upos')$text %>%
  inner_join(us_nation_corpus, by = c('id'='doc_id'))%>%

The table below illustrates the first 1,000 tokens of our annotated corpus. Along with sentence identification and lemmatization, the annotation includes both universal part-of-speech tags (upos) and English-specific part-of-speech tags (pos).

Per text interchage formats, this particular data structure is referred to as a “tokens data frame object.”

  DT::datatable(options = list(scrollX = TRUE),
                class = 'cell-border stripe', 
                rownames = FALSE,
  DT::formatStyle(c(1:9),fontSize = '85%')

English part-of-speech tags

For reference purposes, the table below presents English-specific part of speech tags (pos), including descriptions and some examples of each tag type. This tag set is (more or less) uniformly used across annotators, and is based on conventions established by the Penn Treebank Project.

In contrast, the BYU suite of corpora use a decidedly more detailed tagset.

Spanish web corpus

Next, we take a quick look at building an annotated corpus of web-scraped news articles written in Spanish.

For a more analytic language like English (ie, a language with limited inflectional morphology), units of meaning can roughly be demarcated by spaces in a character string; the inflectional morphemes that do exist can be collapsed to language specific part-of-speech tags (eg, verb forms inflected for third-person singular in the present tense are coded as “VBZ”). And, as the table above attests, without proliferating all that many POS categories.

For a language richer in inflectional morphology like Spanish, on the other hand, demarcating units of meaning via spaces in a character string does not really work, as a single verb form, eg, is inflected for tense, aspect, mood, person, and number. Mapping multiple features of inflectional meaning for a given token to a single “POS” code, then, becomes a bit more challenging, and annotations reflect this. We consider some of these differences as we go.


So, we first retrieve metadata for the most current Mexico-based, “world news” articles, and scrape text content from urls in a single pipe.

es_mx_world_corp <- quicknews::qnews_get_meta(language="es",
  quicknews::qnews_scrape_web ()

Metadata from our search results are summarized below:


Again, we use udpipe via cleanNLP to annotate the Spanish language corpus, and initialize the annotator & Spanish model with a call to cnlp_init_udpipe.

cnlp_init_udpipe(model_name = "spanish")

Annotate corpus data frame:

full_udpipe_es_ann <- es_mx_world_corp%>%
  cleanNLP::cnlp_annotate(as_strings = TRUE) %>%
  filter(lemma !='ROOT')%>%

The basic “token-lemma-pos” annotation for Spanish via udpipe is presented below:

full_udpipe_es_ann %>%
  select(id:upos) %>%
  DT::datatable(options = list(scrollX = TRUE),
                class = 'cell-border stripe', 
                rownames = FALSE, 
  DT::formatStyle(c(1:9),fontSize = '85%')

As can be noted, there is no language-specific part-of-speech column for the Spanish annotation; instead, the annotation includes a range of features (as columns) detailing the individual units of (inflectional) meaning comprising each token. The full annotation contains the following features:

##  [1] "case"      "definite"  "degree"    "gender"    "mood"     
##  [6] "num_type"  "number"    "person"    "polarity"  "polite"   
## [11] "poss"      "prep_case" "pron_type" "reflex"    "tense"    
## [16] "verb_form"

The table below presents a detailed account of (some of the) features of nominal and verbal inflectional meaning included in the annotation:


It should be noted that different annototars, eg. spacy and coreNLP, structure feature output differently; namely, features of inflectional meaning are collapsed into a single column, generally as a language-specific pos column. Depending on the goal/endgame, one annotation structure may be more preferable than the other.

Quick summary

We do not rollcall all differences among languages/annotators here; instead, the goal has been to present a simple & uniform workflow for bulding and annotating multi-lingual, web-based corpora.

Among other things, the quicknews and cleanNLP/udpipe packages facilitate the development of real-time, annotated corpora for tracking any number of cultural & linguistic changes in-progress. Subsequent discussions on this site will consider different analyses that the corpus annotation enables, including fine-grained, complex search in context, network and topic analysis, and a host of distributional semantics models.


Arnold, Taylor. 2017. “A Tidy Data Model for Natural Language Processing Using cleanNLP.” arXiv Preprint arXiv:1703.09570.

Wijffels, Jan. 2018. Udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the ’Udpipe’ ’Nlp’ Toolkit.