6 Get Started

In this chapter we briefly explore the twinetverse: we collect tweets, build and visualise our first graph. The graphs are built with graphTweets (part of the ’verse) which allows building several kinds of graphs, most of which we will explore, however, we will start with a graph that attempts at depicting how users communicate with each other by looking at who @tags who tweets.

If you follow this along in RStudio; the visualisations do not open in the viewer and instead open in your default browser.

6.1 Collect

See the prerequisites section if the line below confuses you.

TK <- readRDS("token.rds")

rtweet lets you do a lot of things, however within the context of the twinetverse we mainly use its search_tweets to get tweets.

tweets <- search_tweets("rstats", token = TK)
## Searching for tweets...
## Finished collecting tweets!

The search_tweets function takes a few arguments some of which we’ll eventually get into, above we run the simplest possible call; fetching tweets about “rstats”, a reference to the R (R Core Team 2019) Twitter #hashtag, by default the function returns 100 tweets. Note that we also pass our token to the function.

Each row a is a tweet, rtweet returns quite a lot of variables (88), we’ll only look at a select few.

names(tweets)
##  [1] "user_id"                 "status_id"              
##  [3] "created_at"              "screen_name"            
##  [5] "text"                    "source"                 
##  [7] "display_text_width"      "reply_to_status_id"     
##  [9] "reply_to_user_id"        "reply_to_screen_name"   
## [11] "is_quote"                "is_retweet"             
## [13] "favorite_count"          "retweet_count"          
## [15] "hashtags"                "symbols"                
## [17] "urls_url"                "urls_t.co"              
## [19] "urls_expanded_url"       "media_url"              
## [21] "media_t.co"              "media_expanded_url"     
## [23] "media_type"              "ext_media_url"          
## [25] "ext_media_t.co"          "ext_media_expanded_url" 
## [27] "ext_media_type"          "mentions_user_id"       
## [29] "mentions_screen_name"    "lang"                   
## [31] "quoted_status_id"        "quoted_text"            
## [33] "quoted_created_at"       "quoted_source"          
## [35] "quoted_favorite_count"   "quoted_retweet_count"   
## [37] "quoted_user_id"          "quoted_screen_name"     
## [39] "quoted_name"             "quoted_followers_count" 
## [41] "quoted_friends_count"    "quoted_statuses_count"  
## [43] "quoted_location"         "quoted_description"     
## [45] "quoted_verified"         "retweet_status_id"      
## [47] "retweet_text"            "retweet_created_at"     
## [49] "retweet_source"          "retweet_favorite_count" 
## [51] "retweet_retweet_count"   "retweet_user_id"        
## [53] "retweet_screen_name"     "retweet_name"           
## [55] "retweet_followers_count" "retweet_friends_count"  
## [57] "retweet_statuses_count"  "retweet_location"       
## [59] "retweet_description"     "retweet_verified"       
## [61] "place_url"               "place_name"             
## [63] "place_full_name"         "place_type"             
## [65] "country"                 "country_code"           
## [67] "geo_coords"              "coords_coords"          
## [69] "bbox_coords"             "status_url"             
## [71] "name"                    "location"               
## [73] "description"             "url"                    
## [75] "protected"               "followers_count"        
## [77] "friends_count"           "listed_count"           
## [79] "statuses_count"          "favourites_count"       
## [81] "account_created_at"      "verified"               
## [83] "profile_url"             "profile_expanded_url"   
## [85] "account_lang"            "profile_banner_url"     
## [87] "profile_background_url"  "profile_image_url"

6.2 Build

Now we can use the second package part of the twinetverse, graphTweets. Again, we’ll leave all function’s arguments to default to get a simple graph. There’s a lot more to the package which we’ll uncover progressively as we move through the book.

There are two ways of building a network of Twitter users with graphTweets, the one we use in this book is preferable over the other as it is much more accurate.

  1. The more accurate gt_edges based on mentions_screen_name (names of users @tagged in tweets) provided by the Twitter API.
  2. The less accurate gt_edges_from_text which extracts the @tagged users from the tweets’ text, essentially the same as mentions_screen_name but likely less accurate.
net <- tweets %>% 
  gt_edges(source = screen_name, target = mentions_screen_name)

We called gt_edges on our tweets data.frame, passing a few bare column names. The source of the tweets (the user posting the tweets) will also be the source of our edges so we pass source = screen_name, then the target of these edges will be users @tagged in the tweets, which is given by the API as mentions_screen_name; this will be target of our edges.

The object returned is of an unfamiliar class.

class(net)
## [1] "graphTweets"

To extracts the results from graphTweets run gt_collect, this will work at any point in the chain of pipes (%>%).

net <- net %>% 
  gt_collect()

class(net)
## [1] "list"

Great but this returns a lists and R users much prefer data.frames. graphTweets actualy returns two data.frames that are encapsulated in a list. Indeed networks cannot be compressed into a single data.frame, we need 1) nodes and 2) edges.

names(net)
## [1] "edges" "nodes"

A network consists of nodes and edges: this is just what graphTweets returns.

Great, so it looks like we have both nodes and edges, not really. We only have edges, net$nodes is actually NULL.

lapply(net, class)
## $edges
## [1] "tbl_df"     "tbl"        "data.frame"
## 
## $nodes
## [1] "NULL"

In hindsights, we only ran gt_edges so it make sense to only have edges. Let’s scrap that and get both nodes and edges.

net <- tweets %>% 
  gt_edges(screen_name, mentions_screen_name) %>% # get edges
  gt_nodes() %>% # get nodes
  gt_collect() # collect

lapply(net, class)
## $edges
## [1] "tbl_df"     "tbl"        "data.frame"
## 
## $nodes
## [1] "tbl_df"     "tbl"        "data.frame"

Before we move on, something to note. graphTweets requires that you run the functions in the correct order, first gt_edges and second gt_nodes. This is because one can only know the nodes of a graph based on the edges and not vice versa.

Run graphTweets’ functions in the correct order, first get the edges then the nodes.

We have downloaded tweets used them to build a data.frame of nodes and another of edges. Now we’re ready to visualise the data.

6.3 Visualise

We can visualise the network with sigmajs. Then again, it’s very easy and follows the same idea as graphTweets; we pipe our nodes and edges through. Before we do so, for the sake of clarity, let’s unpack our network using the %<-% from the Zeallot package (Teetor 2018), imported by the twinetverse.

c(edges, nodes) %<-% net

You can always unpack the network with edges <- net$edges and nodes <- net$nodes if you are not comfortable with the above.

Let’s take a look at the edges.

head(edges)
source target n
alanfsoli rbloggers 1
amin_e22 neptanum 1
amosoketch gp_pulipaka 1
angelanovari themiserablephd 1
annahenschel dalejbarr 1
annahenschel debruine 1

Each edge simply consists of a source and a target. In this network, the source essentially corresponds to screen_name passed in gt_edges, it is the user who posted the tweet. In contrast, target includes the users that were tagged in the text of the tweet. The n variable indicates how many tweets connect the source to the target.

Now let’s take a look at the nodes:

head(nodes)
nodes type n
acotgreave user 1
alanfsoli user 1
amin_e22 user 1
amosoketch user 1
andrewheiss user 2
angelanovari user 1

In the nodes data frame, the column n is the number of times the node appears (whether as source or as target), while the nodes column are the Twitter handles of both the authors of the tweets (screen_name) and those who were @tagged in the tweets. All nodes are of type user in this network, but we will have examples later in the book where it may not be.

Below we rename a few columns, to meet sigmajs’ naming convention.

  1. We add ids to our nodes, this can be a string and thus simply corresponds to our nodes column.
  2. We essentially rename n to size as this is what sigmajs understands.
  3. We add ids to our edges as sigmajs requires each edge to have a unique id.
nodes$id <- as.factor(nodes$nodes) 
nodes$size <- nodes$n 
edges$id <- seq(1, nrow(edges)) 

sigmajs has a specific but sensible naming convention as well as basic minimal requirements.

  • Nodes must at least include id, and size.
  • Edges must at least include id, source, and target.

Actually, the twinetverse comes with helper functions to prepare the nodes and edges built by graphTweets for use in sigmajs (these are the only functions the ’verse provides).

nodes <- nodes2sg(nodes)
edges <- edges2sg(edges)

You need to respect sigmajs naming convention or the graph will not display.

Let’s visualise that, we must initialise every sigmajs graph with the sigmajs function, then we add our nodes with sg_nodes, passing the column names we mentioned previously, id, and size to meet sigmajs’ minimum requirements. In sigmajs, at the exception of the function called sigmajs, all start with sg_

sigmajs() %>% 
  sg_nodes(nodes, id, size) 

sigmajs actually allows you to build graphs using only nodes or edges, we’ll see why this is useful in a later chapter on temporal graphs. Let’s add the edges. Then again, to meet sigmajs’ requirements, we pass id, source and target.

sigmajs() %>% 
  sg_nodes(nodes, id, size) %>% 
  sg_edges(edges, id, source, target)

This graph does not look great. We’ll beautify that bit by bit as we move through the book: sigmajs is highly customisable.

Nevermind beauty, what’s on the graph exactly? Each disk/point on the graph is a twitter user, they are connected when one has tagged the other in the a tweet.

You may also notice that the graph contains surprisingly few nodes, given that we queried 100 tweets you would expect over 100 nodes on the graph. This is because our visualisation only includes tweets that mention other users and most tweets are not targeted (tagged) at other users. There is an easy remedy to this which we’ll look at in the next section.

Remember the workflow of the twinetverse:

  1. We collect the data
  2. We build the graph
  3. We visualise the network

6.4 Recap

Let’s recapitulate before moving on to the next section. The section may look long~ish but the code is not, here it is put together.

library(dplyr)

# COLLECT
tweets <- search_tweets("rstats", token = TK)

# BUILD
net <- tweets %>% 
  gt_edges(screen_name, mentions_screen_name) %>% 
  gt_nodes() %>% 
  gt_collect() 

c(edges, nodes) %<-% net # unpack

# prepare for sigmajs
nodes <- nodes2sg(nodes)
edges <- edges2sg(edges)

# VISUALISE
sigmajs() %>% 
  sg_nodes(nodes, id, size) %>% 
  sg_edges(edges, id, source, target)

6.5 Improve

Let’s collect more tweets this time, we’ll also optimise our Twitter query. This is very useful as the Twitter API (like the vast majority of APIs) limits the amount of data you can access by imposing a rate limit. You can always check where you stand with the various Twitter rate limits with rate_limit().

We set include_rts = FALSE as we don’t need the same tweet multiple times, it does not add information to our graph (currently but it could). We also pass a slightly more sophisticated query to the search tweet endpoint. This is too often overlooked, the Twitter API provides advanced operators: you are not limited to searching a single keyword every time.

Optimise your queries or you’ll be hit by frustrating waiting times.

We query 1,000 tweets that:

  • Include #rstats
  • Include a mention i.e.: @jdatap
  • Are original (not re-tweets)

Remember to load your token if you’re in a new environment.

# TK <- readRDS(file = "token.rds")
tweets <- search_tweets("#rstats filter:mentions", n = 1000, token = TK, include_rts = FALSE)
## Searching for tweets...
## Finished collecting tweets!

Let’s build the graph, just like we did before. There is more to graphTweets but we won’t look into that just yet.

net <- tweets %>% 
  gt_edges(screen_name, mentions_screen_name) %>% 
  gt_nodes() %>% 
  gt_collect()

Let’s make a slightly more interesting visualisation this time. First, we’ll prepare the data for sigmajs like we did in the get started chapter.

c(edges, nodes) %<-% net

nodes <- nodes2sg(nodes)
edges <- edges2sg(edges)

Now onto the visualisation.

  • We add labels that will display on hover by simply passing the label column to sg_nodes.
  • We color the nodes by cluster with sg_cluster.
  • We layout the graph appropriately using one of igraph’s (file. 2019) many layout algorithms with sg_layout.
  • We use sigmajs’ settings to change the edges color.
sigmajs("webgl") %>% 
  sg_nodes(nodes, id, label, size) %>% 
  sg_edges(edges, id, source, target) %>% 
  sg_layout(layout = igraph::layout_components) %>% 
  sg_cluster(
    colors = c(
      "#0075a0",
      "#0084b4",
      "#00aced",
      "#1dcaff",
      "#c0deed"
      )
  ) %>% 
  sg_settings(
    minNodeSize = 1,
    maxNodeSize = 2.5,
    edgeColor = "default",
    defaultEdgeColor = "#d3d3d3"
  )
## Found # 266 clusters

Already looking better.

References

R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Teetor, Nathan. 2018. Zeallot: Multiple, Unpacking, and Destructuring Assignment. https://CRAN.R-project.org/package=zeallot.

file., See AUTHORS. 2019. Igraph: Network Analysis and Visualization. https://CRAN.R-project.org/package=igraph.