Chapter 6 Get Started
In this chapter we briefly explore the twinetverse: we collect tweets, build and visualise our first graph. The graphs are built with graphTweets (part of the ’verse) which allows building several kinds of graphs, most of which we will explore, however, we will start with a graph that attempts at depicting how users communicate with each other by looking at who @tags who in their tweets.
If you follow this along in RStudio; the visualisations do not open in the viewer and instead open in your default browser.
See the prerequisites section if the line below confuses you.
rtweet lets you do a lot of things, however within the context of the twinetverse we mainly use its
search_tweets to get tweets.
## Searching for tweets...
## Finished collecting tweets!
search_tweets function takes a few arguments some of which we’ll eventually get into, above we run the simplest possible call; fetching tweets about “rstats”, a reference to the R (R Core Team 2017) Twitter #hashtag, by default the function returns 100 tweets. Note that we also pass our token to the function.
Each row a is a tweet, rtweet returns quite a lot of variables (88), we’ll only look at a select few.
##  "user_id" "status_id" ##  "created_at" "screen_name" ##  "text" "source" ##  "display_text_width" "reply_to_status_id" ##  "reply_to_user_id" "reply_to_screen_name" ##  "is_quote" "is_retweet" ##  "favorite_count" "retweet_count" ##  "hashtags" "symbols" ##  "urls_url" "urls_t.co" ##  "urls_expanded_url" "media_url" ##  "media_t.co" "media_expanded_url" ##  "media_type" "ext_media_url" ##  "ext_media_t.co" "ext_media_expanded_url" ##  "ext_media_type" "mentions_user_id" ##  "mentions_screen_name" "lang" ##  "quoted_status_id" "quoted_text" ##  "quoted_created_at" "quoted_source" ##  "quoted_favorite_count" "quoted_retweet_count" ##  "quoted_user_id" "quoted_screen_name" ##  "quoted_name" "quoted_followers_count" ##  "quoted_friends_count" "quoted_statuses_count" ##  "quoted_location" "quoted_description" ##  "quoted_verified" "retweet_status_id" ##  "retweet_text" "retweet_created_at" ##  "retweet_source" "retweet_favorite_count" ##  "retweet_retweet_count" "retweet_user_id" ##  "retweet_screen_name" "retweet_name" ##  "retweet_followers_count" "retweet_friends_count" ##  "retweet_statuses_count" "retweet_location" ##  "retweet_description" "retweet_verified" ##  "place_url" "place_name" ##  "place_full_name" "place_type" ##  "country" "country_code" ##  "geo_coords" "coords_coords" ##  "bbox_coords" "status_url" ##  "name" "location" ##  "description" "url" ##  "protected" "followers_count" ##  "friends_count" "listed_count" ##  "statuses_count" "favourites_count" ##  "account_created_at" "verified" ##  "profile_url" "profile_expanded_url" ##  "account_lang" "profile_banner_url" ##  "profile_background_url" "profile_image_url"
Now we can use the second package part of the twinetverse, graphTweets. Again, we’ll leave all function’s arguments to default to get a simple graph. There’s a lot more to the package which we’ll uncover progressively as we move through the book.
There are two ways of building a network of Twitter users with graphTweets, the one we use in this book is preferable over the other as it is much more accurate.
- The more accurate
mentions_screen_name(names of users @tagged in tweets) provided by the Twitter API.
- The less accurate
gt_edges_from_textwhich extracts the @tagged users from the tweets’
text, essentially the same as
mentions_screen_namebut likely less accurate.
gt_edges on our
tweets data.frame, passing a few bare column names. The source of the tweets (the user posting the tweets) will also be the source of our edges so we pass
source = screen_name, then the target of these edges will be users @tagged in the tweets, which is given by the API as
mentions_screen_name; this will be target of our edges.
The object returned is of an unfamiliar class.
##  "graphTweets"
To extracts the results from graphTweets run
gt_collect, this will work at any point in the chain of pipes (
##  "list"
Great but this returns a lists and R users much prefer data.frames. graphTweets actualy returns two data.frames that are encapsulated in a list. Indeed networks cannot be compressed into a single data.frame, we need 1) nodes and 2) edges.
##  "edges" "nodes"
A network consists of nodes and edges: this is just what graphTweets returns.
Great, so it looks like we have both nodes and edges, not really. We only have edges,
net$nodes is actually
## $edges ##  "tbl_df" "tbl" "data.frame" ## ## $nodes ##  "NULL"
Well, we only ran
gt_edges so it make sense to only have edges. Let’s scrap that and get both nodes and edges.
## $edges ##  "tbl_df" "tbl" "data.frame" ## ## $nodes ##  "tbl_df" "tbl" "data.frame"
Before we move on, something to note. graphTweets requires that you run the functions in the correct order, first
gt_edges and second
gt_nodes. This is because one can only know the nodes of a graph based on the edges and not vice versa.
Run graphTweets’ functions in the correct order, first get the edges then the nodes.
Now we’re ready to visualise the data, we have downloaded tweets used them to build a of nodes and another of edges.
We can visualise the network with sigmajs. Then again, it’s very easy and follows the same idea as graphTweets; we pipe our nodes and edges through. Before we do so, for the sake of clarity, let’s unpack our network using the
%<-% from the Zeallot package (Teetor 2018), imported by the twinetverse.
You can always unpack the network with
edges <- net$edges and
nodes <- net$nodes if you are not comfortable with the above.
Let’s take a look at the edges.
Edges simply consist of
target, as explained earlier on,
source essentially corresponds to
screen_name passed in
gt_edges, it is the user who posted the tweet. In contrast,
target includes the users that were tagged in the
text of the tweet. The
n variable indicates how many tweets connect the
source to the
Now let’s take a look at the nodes:
In the nodes data frame, the column
n is the number of times the node appears (whether as
source or as
target), while the
nodes column are the Twitter handles of both the authors of the tweets (
screen_name) and those who were @tagged in the tweets. All nodes are users of course, but we will see another graph later in the book where they may not be.
Below we rename a few columns, to meet sigmajs’ naming convention.
- We add ids to our nodes, this can be a string and thus simply corresponds to our
- We essentially rename
sizeas this is what sigmajs understands.
- We add ids to our edges as sigmajs requires each edge to have a unique id.
sigmajs has a specific but sensible naming convention as well as basic minimal requirements.
- Nodes must at least include
- Edges must at least include
Well actually, the twinetverse comes with helper functions to prepare the nodes and edges build from graphTweets for use in sigmajs (these are the only functions the ’verse provides).
You need to respect sigmajs naming convention or the graph will not display.
Let’s visualise that, we must initialise every sigmajs graph with the
sigmajs function, then we add our nodes with
sg_nodes, passing the column names we mentioned previously,
size to meet sigmajs’ minimum requirements. In sigmajs, at the exception of the function called
sigmajs, all start with
sigmajs actually allows you to build graphs using only nodes or edges, we’ll see why this is useful in a later chapter on temporal graphs. Let’s add the edges. Then again, to meet sigmajs’ requirements, we pass
This graph does not look great. We’ll beautify that bit by bit as we move through the book: sigmajs is highly customisable.
Nevermind beauty, what’s on the graph exactly? Each disk/point on the graph is a twitter user, they are connected when one has tagged the other in the a tweet.
You may also notice that the graph contains surprisingly few nodes, given that we queried 100 tweets you would expect over 100 nodes on the graph. This is because our visualisation only includes tweets that mention other users and most tweets are not targeted (tagged) at other users. There is an easy remedy to this which we’ll look at in the next section.
Remember the workflow of the twinetverse:
- We collect the data
- We build the graph
- We visualise the network
Let’s recapitulate before moving on to the next section. The section may look long~ish but the code is not, here it is put together.
library(dplyr) # COLLECT tweets <- search_tweets("rstats", token = TK) # BUILD net <- tweets %>% gt_edges(screen_name, mentions_screen_name) %>% gt_nodes() %>% gt_collect() c(edges, nodes) %<-% net # unpack # prepare for sigmajs nodes <- nodes2sg(nodes) edges <- edges2sg(edges) # VISUALISE sigmajs() %>% sg_nodes(nodes, id, size) %>% sg_edges(edges, id, source, target)
Let’s collect more tweets this time, we’ll also optimise our Twitter query. This is very useful as the Twitter API (like the vast majority of APIs) limits the amount of data you can access by imposing a rate limit. You can always check where you stand with the various Twitter rate limits with
include_rts = FALSE as we don’t need the same tweet multiple times, it does not add information to our graph (currently but it could). We also pass a slightly more sophisticated query to the search tweet endpoint. This is too often overlooked, the Twitter API provides advanced operators: you are not limited to searching a single keyword every time.
Optimise your queries or you’ll be hit by frustrating waiting times.
We query 1,000 tweets that:
- Include a mention i.e.:
- Are original (not re-tweets)
Remember to load your token if you’re in a new environment.
## Searching for tweets...
## Finished collecting tweets!
Let’s build the graph, just like we did before. There is more to graphTweets but we won’t look into that just yet.
Let’s make a slightly more interesting visualisation this time. First, we’ll prepare the data for sigmajs like we did in the get started chapter.
Now onto the visualisation.
- We add labels that will display on hover by simply passing the
- We color the nodes by cluster with
- We layout the graph appropriately using one of igraph’s (file. 2018) many layout algorithms with
- We use sigmajs’ settings to change the edges color.
sigmajs("webgl") %>% sg_nodes(nodes, id, label, size) %>% sg_edges(edges, id, source, target) %>% sg_layout(layout = igraph::layout_components) %>% sg_cluster( colors = c( "#0075a0", "#0084b4", "#00aced", "#1dcaff", "#c0deed" ) ) %>% sg_settings( minNodeSize = 1, maxNodeSize = 2.5, edgeColor = "default", defaultEdgeColor = "#d3d3d3" )
## Found # 327 clusters
Already looking better.
R Core Team. 2017. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Teetor, Nathan. 2018. Zeallot: Multiple, Unpacking, and Destructuring Assignment. https://CRAN.R-project.org/package=zeallot.
file., See AUTHORS. 2018. Igraph: Network Analysis and Visualization. https://CRAN.R-project.org/package=igraph.