linking entities and wikidata

A quick demo using the spacyfishing python library
nlp
ner
python
Published

June 26, 2022


A quick demo using the spacyfishing library – “a spaCy wrapper for Entity-Fishing, a tool for named entity recognition, linking and disambiguation against Wikidata.” Facilitates disambiguating/linking named entities to the Wikidata knowledge base.

1 Reticulate & Python

conda create -n fishing
source activate fishing
conda install numpy pip pandas
/home/jtimm/anaconda3/envs/fishing/bin/pip install spacyfishing
python -m spacy download en_core_web_sm
Sys.setenv(RETICULATE_PYTHON = "/home/jtimm/anaconda3/envs/fishing/bin/python")
reticulate::use_condaenv(condaenv = "fishing",
                         conda = "/home/jtimm/anaconda3/bin/conda")

2 Build a simple news corpus

qn <- quicknews::qnews_build_rss('war in ukraine') |>
  quicknews::qnews_strip_rss()

qn[1:5,1:3] |> knitr::kable() 
date source title
2022-06-28 NPR Russia-Ukraine war: What happened today (June 28)
2022-06-28 The Washington Post Latest Russia-Ukraine war news: Live updates
2022-06-28 VOA News Latest Developments in Ukraine: June 28
2022-06-28 The New York Times The West Seeks a More Effective Way to Tighten Sanctions on Russia
2022-06-28 CNN Russia’s war in Ukraine: Live updates
arts <- quicknews::qnews_extract_article(qn$link[1:3], cores = 3)
text_en <- arts$text[1]

3 spaCy

import spacy 
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("entityfishing", config={"extra_info": True})
nlp.add_pipe('sentencizer')
doc = nlp(r.text_en)

3.1 Entities to Wikipedia

import pandas as pd
entities = [(e.label_, 
             e.text, 
             e._.normal_term, 
             e._.kb_qid, 
             e._.url_wikidata, 
             e._.nerd_score,
             e._.description) for e in doc.ents]
             
df99 = pd.DataFrame(entities, 
                    columns=['type',
                             'entity', 
                             'normed', 
                             'qid', 
                             'url', 
                             'score', 
                             'description'])
reticulate::py$df99 |> 
  dplyr::select(-description) |> 
  DT::datatable(rownames = F)

3.2 Wikidata description

strwrap(reticulate::py$df99$description[[3]], width = 60)[1:10]
 [1] "'''Kremenchuk''' (, ;, [[Romanization of"                   
 [2] "Russian|translit.]] ''Kremenchug''), an important"          
 [3] "industrial [[city]] in central [[Ukraine]], stands on the"  
 [4] "banks of the [[Dnieper]] River. Kremenchuk is the [[Capital"
 [5] "city|administrative center]] of the [[Kremenchuk Raion]]"   
 [6] "([[Raion|district]]) in [[Poltava Oblast]]"                 
 [7] "([[Oblast|province]]). Kremenchuk is administratively"      
 [8] "incorporated as a [[City of regional significance"          
 [9] "(Ukraine)|city of oblast significance]] and does not belong"
[10] "to the raion. Population: Along with [[Svitlovodsk]] and"   

4 displaCy

from spacy import displacy
ss = list(doc.sents)

displacy.render(ss[:4], style="ent")
A photograph taken TuesdayDATEshows charred goods in a grocery store of the destroyed Amstor mallPERSONin KremenchukLOC, central UkraineGPE, one dayDATEafter it was hit by a RussianNORPmissile strike.
The death toll climbed to at least 20CARDINALafter MondayDATE's missile attack on a crowded mall in the central UkrainianGPEcity of KremenchukORG, which leaders at a Group of SevenORGmeeting called a "war crime."
On TuesdayDATE, emergency responders ended a rescue search for survivors.
RussiaGPE's government denied hitting the shopping center, claiming it caught fire after RussiaGPEstruck a nearby weapons depot.