Introduction
This post provides a brief description of methods for quantifying political bias of online news media based on the media-sharing habits of US lawmakers on Twitter. I have discussed this set of methods in a previous post. Here, the focus is on a more streamlined (and multi-threaded) approach to resolving shortened URLs via the quicknews
package. We also present unsupervised methods for visualizing media bias in two-dimensional space via tSNE, and compare results to the manually curated fact and bias checking online resource, Media Bias/Fact Check (MBFC), with some fairly nice results.
library(tidyverse)
localdir <- '/home/jtimm/jt_work/GitHub/data_sets'
## devtools::install_github("jaytimm/quicknews")
Tweet-set
The tweet-set used here was accessed via the GWU Library, and subsequently “hydrated” using the Hydrator desktop application. Tweets were generated by members of the 116th House from 3 Jan 2019 to 7 May 2020. Subsequent analyses are based on a sample of 500 tweets/lawmaker containing shared URLs.
setwd(localdir)
house_tweets <- readRDS('house116-sample-urls.rds') %>%
filter(urls != '')
Media bias data set
Media Bias/Fact Check is a fact-checking organization that classifies online news sources along two dimensions: (1) political bias and (2) factuality. These two scores (for ~850 sources) have been extracted by Baly et al. (2020), and made available in tabular format here.
setwd('/home/jtimm/jt_work/GitHub/packages/quicknews/data-raw')
## emnlp18 <- read.csv('emnlp18-corpus.tsv', sep = '\t')
acl2020 <- read.csv('acl2020-corpus.tsv', sep = '\t')
A sample of this data set is presented below.
set.seed(221)
acl2020 %>%
group_by(fact, bias) %>%
sample_n(1) %>%
# ungroup() %>%
select(source_url_normalized, fact, bias) %>%
# spread(bias, source_url_normalized) %>%
knitr::kable()
source_url_normalized | fact | bias |
---|---|---|
wn.com | high | center |
dailydot.com | high | left |
yellowhammernews.com | high | right |
freakoutnation.com | low | left |
christianaction.org | low | right |
wionews.com | mixed | center |
extranewsfeed.com | mixed | left |
lifenews.com | mixed | right |
Resolving shortened URLs
The quicknews package is a collection of tools for navigating the online news landscape; here, we detail a simple workflow for researchers to use for multi-threaded URL un-shortening. As a three step process: (1) identify URLs that have been shortened via qnews_clean_urls
, (2) split vector of URLs into multiple batches via qnews_split_batches
for distribution across multiple cores, and (3) resolve shortened URLs via qnews_unshorten_urls
.
## step 1
shortened_urls <- quicknews::qnews_clean_urls(url = house_tweets$urls) %>%
filter(is_short == 1)
## step 2
batch_urls <- shortened_urls %>% quicknews::qnews_split_batches(n = 12)
## step 3
unshortened_urls <- parallel::mclapply(lapply(batch_urls, "[[", 1),
quicknews::qnews_unshorten_urls,
seconds = 10,
mc.cores = 12)
unshortened_urls1 <- data.table::rbindlist(unshortened_urls)
Media bias & tSNE
Build matrix
To aggregate these data, we build a simple domain-lawmaker matrix
, in which each domain/news organization is represented by the number of times each lawmaker has shared one of its news stories.
ft1 <- filt.tweets %>%
group_by(user_screen_name, source) %>%
count() %>%
filter(source %in% share.summary$source) %>%
tidytext::cast_sparse(row = 'source',
column = 'user_screen_name',
value = n)
ft2 <- as.matrix(ft1) #%>% Rtsne::normalize_input()
Matrix top-left::
ft2[1:5, 1:5]
## AUSTINSCOTTGA08 BENNIEGTHOMPSON BETTYMCCOLLUM04 BILLPASCRELL
## abcnews.go.com 1 4 0 3
## airforcetimes.com 1 0 0 0
## ajc.com 6 0 0 0
## bloomberg.com 2 3 0 5
## c-span.org 2 1 4 3
## BOBBYSCOTT
## abcnews.go.com 0
## airforcetimes.com 0
## ajc.com 0
## bloomberg.com 2
## c-span.org 1
TSNE
set.seed(77) ## 9
tsne <- Rtsne::Rtsne(X = ft2, check_duplicates = FALSE)
tsne_clean <- data.frame(descriptor_name = rownames(ft1), tsne$Y) %>%
#mutate(screen_name = toupper(descriptor_name)) %>%
left_join(acl2020, by = c('descriptor_name' = 'source_url_normalized')) %>%
replace(is.na(.), 'x')
Plot
Per figure below, the first dimension of the tSNE plot does a fairly nice job capturing differences in bias classifications as presented by Media Bias/Fact Check, and results are generally intuitive. Factors underlying variation along the second dimension, however, are less clear, and do not appear to be capturing factuality in this case. Note: news organizations indicated by orange Xs are not included in the MB/FC data set.
split_pal <- c('#3c811a',
'#395f81', '#9e5055',
'#e37e00')
tsne_clean %>%
ggplot(aes(X1, X2)) +
geom_point(aes(col = bias,
shape = fact),
size = 3) +
geom_text(aes(label = descriptor_name,
col = bias,
shape = fact), #
size = 3,
check_overlap = TRUE) +
theme_minimal() +
theme(legend.position = "bottom") +
scale_color_manual(values = split_pal) +
xlab('Dimension 1') + ylab('Dimension 2')+
labs(title = "Measuring political bias")
Bias score distributions
tsne_clean %>%
ggplot() +
geom_density(aes(X1, fill = bias),
alpha = .4) +
theme_minimal() +
theme(legend.position = "bottom") +
scale_fill_manual(values = split_pal) +
ggtitle('Media bias scores by MB/FC bias classification')
Resources
Baly, Ramy, Georgi Karadzhov, Jisun An, Haewoon Kwak, Yoan Dinkov, Ahmed Ali, James Glass, and Preslav Nakov. 2020. “What Was Written Vs. Who Read It: News Media Profiling Using Text Analysis and Social Media Context.” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. ACL ’20.