twitter users, demographic inference & reticulate

Implementing M3 for demographic inference
python
twitter
demographics
Published

June 10, 2022

A simple code-through for using the Python library m3inference in R via reticulate. As described in Wang et al. (2019). Library facilitates demographic attribute inference of Twitter users, namely, gender, age, and organizational status, based on profile images, screen names, names, and biographies.

1 Reticulate & Python

First, we build a conda environment (via the terminal) comprised of m3inference and pip (and their respective dependencies).

## <TERMINAL>
conda create -n m3demo
source activate m3demo
conda install pip 
/home/jtimm/anaconda3/envs/m3demo/bin/pip install m3inference

Then we establish Python and conda environment paths.

## <R-console>
Sys.setenv(RETICULATE_PYTHON = "/home/jtimm/anaconda3/envs/m3demo/bin/python")

library(reticulate)
#reticulate::use_python("/home/jtimm/anaconda3/envs/m3demo/bin/python")
reticulate::use_condaenv(condaenv = "m3demo",
                         conda = "/home/jtimm/anaconda3/bin/conda")

2 Twitter data

For demonstration purposes, we identify/extract my Twitter followers (and some of their M3-relevant features) using the rtweet package.

## <R-console>
library(tidyverse)
fws  <- rtweet::get_followers(user = 'DrJayTimm') 

users <- rtweet::lookup_users(fws$user_id) %>%
  select(user_id, name, screen_name, description, profile_image_url)

Below is a simple hack to provide the M3 model with an actual image file for Twitter profiles that lack profile pics.

## <R-console>
jk <- 'http://abs.twimg.com/sticky/default_profile_images/default_profile_normal.png'
jk0 <- 'https://twirpz.files.wordpress.com/2015/06/twitter-avi-gender-balanced-figure.png'

dir0 <- tempdir()

users2 <- users %>%
  mutate(profile_image_url = ifelse(profile_image_url == jk, jk0, profile_image_url)) %>%
  rename(id_str = user_id) 

3 Profile pics via M3

Output Twitter user details to local temp directory as a ~ ndjson file.

## <R-console>
tmp2 <- tempfile()
jsonlite::stream_out(users2, file(tmp1 <- tempfile()), verbose = F)

In a Python console, we then import the M3Twitter module, and set the directory in which Twitter profile pics will be stored. (Note that the directory established in the R chunk above is accessed below via the r. prefix.)

## <PYTHON-console>
from m3inference import M3Twitter
m3twitter = M3Twitter(cache_dir = r.dir0) 

Then, via the transform_jsonl function, we restructure the ndjson/jsonl file and download Twitter users’ profile pics to the temp directory. This function also identifies description language. Note: While we can download profile images and identify description language in R, things tend to go much more smoothly (& quicker) using the functionality included in m3inference.

## <PYTHON-console>
m3twitter.transform_jsonl(input_file = r.tmp1, 
                          output_file = r.tmp2, 
                          img_path_key = "profile_image_url")#, 
                          #lang_key = "lang")
/home/jtimm/anaconda3/envs/m3demo/lib/python3.10/site-packages/PIL/Image.py:992: UserWarning: Palette images with Transparency expressed in bytes should be converted to RGBA images
  warnings.warn(

4 Deomgraphic inference via M3

Apply M3 classification model. Attribute classes:

  • Gender: male, female;

  • Age: <= 18, 19-29, 30-39, >=40; and

  • Organization: non-org, is-org.

## <PYTHON-console>
from m3inference import M3Inference
m3 = M3Inference() 
pred = m3.infer(r.tmp2)

4.1 Accessing classifications

Output/predictions from the Python-based M3 model can be moved into R via the (R-based) reticulate::py function.

## <R-console>
py_predictions <- reticulate::py$pred

The table below details age-gender-organization inferences by Twitter ID for a small subset of my followers.

## <R-console>
df <- reshape2::melt(py_predictions) 
df0 <- data.table::setDT(df)[, .SD[which.max(value)], by = list(L1, L2)]
df1 <- data.table::dcast(df0, L1  ~ L2, value.var = 'L3')
## <R-console>
df1 %>% sample_n(10) %>% knitr::kable()
id age gender org
1139150461599195137 19-29 male non-org
394161473 >=40 male non-org
164731384 30-39 male non-org
700915202779279360 >=40 female non-org
477371410 <=18 male non-org
1075633925521846272 19-29 male non-org
35538459 >=40 male non-org
232957049 >=40 male non-org
785346816 30-39 male non-org
381070776 >=40 male non-org

4.2 Demographic summary

4.2.1 By Organization

table(df1$org)

 is-org non-org 
     16     159 

4.2.2 By Age & Gender

(for followers that have not been classified as organizations):

## <R-console>
df2 <- df1 %>%
  mutate(age = factor(age, levels = c('<=18', '19-29', '30-39', '>=40'))) %>%
  filter(org != 'is-org') %>%
  count(gender, age) %>%
  mutate(percent = round(n/sum(n)*100,1)) %>%
  mutate(percent = ifelse(gender == "male", percent*-1, percent))

df2 %>% knitr::kable()
gender age n percent
female 19-29 12 7.5
female 30-39 5 3.1
female >=40 23 14.5
male <=18 6 -3.8
male 19-29 32 -20.1
male 30-39 25 -15.7
male >=40 56 -35.2

4.3 Age-Gender “pyramid”

## <R-console>
maxs <- max(abs(df2$percent))
df2 %>%
  ggplot(aes(x = age, y = percent, fill =gender)) +
  geom_col(alpha = .75) + 
  ylim(-maxs - 1, maxs + 1) +
  coord_flip() +
  ggthemes::scale_fill_stata() +
  # scale_y_continuous(breaks = c(-5, 0, 5),
  #                    labels = c("5%", "0%", "5%")) +
  labs(title="Inferred age-gender demographics of my followers")

5 Profile pics & demographic inference

## <R-console>
users2$paths <- grep('224x224', dir(dir0, full.names = TRUE), value = T)

users3 <- users2 %>%
  arrange(id_str) %>%
  mutate(paths = grep('224x224', dir(dir0, full.names = TRUE), value = T)) %>%
  left_join(df1, by = c('id_str' = 'id'))

A simple function for modifying profile pics. Including: (1) “charcoal-ing” photos for user privacy, and (2) labeling photos with predicted age, gender, and organization classes.

## <R-console>
modify_images <- function(paths){
  
  for(i in 1:length(paths)){
    y1 <- magick::image_read(paths[i])
    y2 <- magick::image_charcoal(y1)
    y3 <- magick::image_border(y2, 'white', '5x5')
    
    ll <- paste0(users3$org[i], '\n',
                 users3$gender[i], '\n',
                 users3$age[i])
    
    y4 <- magick::image_annotate(y3, 
                                 text = ll, 
                                 color = "black", 
                                 size = 26,
                                 weight = 700,
                                 location = "+10+10")
        
    magick::image_write(y4, paths[i]) 
    }
}

Apply function, and build a collage of profile pics with predicted demographics using the photomoe package.

## <R-console>
modify_images(paths = users3$paths)

# devtools::install_github("jaytimm/photomoe")
photomoe::img_build_collage(paths = users3$paths, 
                            dimx = 7, 
                            dimy = 12)

References

Wang, Zijian, Scott Hale, David Ifeoluwa Adelani, Przemyslaw Grabowicz, Timo Hartman, Fabian Flöck, and David Jurgens. 2019. “Demographic Inference and Representative Population Estimates from Multilingual Social Media Data.” In The World Wide Web Conference, 2056–67.