Kansas State University


IT News

The NodeXL Series: Conducting a Twitter User Network Crawl (Part 6)

Per the prior entry, if a hashtag search is very time-dependent and ephemeral / transitory, the user accounts and relationships created around entities (people, organizations, companies, robots, and “cyborgs”) tend to be more stable.  While the research does not necessarily show that a follower / following sort of reciprocal relationship means that all Tweets are read and engaged, these do show a sense of some initial commitment and public declaration of a kind of relationship.  (Those interested I the research may find that there are surprises, such as that popularity and positive word-of-mouth does not necessarily translate to sales commitments.  Further, there is sufficient system gaming by using ‘bot and other accounts that a more accurate read of a user network requires some more digging and critical thinking analysis.)

First, it helps to pick a “target.”  A search on a search engine of an organization’s name “and Twitter” will often lead to the account information. For our purposes, we’ll go with the Centers for Disease Control and Prevention (CDC), in part because they have a clear social media strategy to engage their constituents.

A Limited Crawl of the CDCGov User Network on Twitter

The official Twitter account for the CDCGov site is https://twitter.com/CDCgov.  (Do read the fine print carefully to make sure that you haven’t landed on a farce site.  There are many pretenders, some not-so-subtle, and others very elusively so.)

A quick perusal of their main page shows an account with 5,872 Tweets, 217 following, and 151,034 followers.  The 5,872 Tweets refer to the number of microblogging messages (up to 140 characters each) that have emanated from this site.  The 217 following means that this account follows these 217 accounts and ostensibly keeps track of the contents coming out of these account feeds.  The 151,034 followers are those who follow the CDC feed.  These stats meet the classic definition of a popular account with many more followers than those it is following.  A quick perusal of recent Tweets shows this as a very active site, with the most recent Tweet at the time of this screen capture at 36 minutes prior.  Other Tweets were as recent as a few hours before that.  To the left are some links to Twitpics from the account along with links to videos.  Visitors to this site may read some of the Tweets to get the gist of the broadcast conversations.  (It is important to draw information from a range of sources in order to better understand a data extraction from a social media platform.  Context is important.)

Parameters of the User Network Crawl  

To start the user network data extraction for CDCGov, please start the NodeXL Excel Template.  Click on the NodeXL tab.  To the far left of the ribbon, click on “Import.”  The “Import from Twitter User’s Network” window will appear with the default settings shown below.

Put “CDCGov” (without the quotation marks) in the Twitter user field.  The search is not usually case sensitive, but it helps to use correct capitalization just to be accurate.

For our purposes, we will pursue vertexes only for the “Person followed by the user,” because the 217 following are an elite group.  In terms of edges, this crawl will include only the following relationship per the “followed / following relationship” to simplify the crawl.  In terms of the levels to include, this crawl will only be done at the 1-degree level, or only at the level of the ego neighborhood for the focal node “CDCGov.”   We will “Add a Latest Tweet column to the Vertices worksheet”.

This crawl will first be done in an unlimited way (without any artificial limit” because the 217 following and the fact of the 1-degree crawl should offer some limits to the capture.  (Very complex social networks, though, may be misleading in terms of size.  Any change in the parameters will add complexity and size to most social network graphs.)  It helps to have as few artificial limits on a crawl as possible.  (In terms of the research literature, it is hard to find consensus around how much of a social network is representative of the larger whole. In fact, very little work has been done on the generalizability.  Many simply use high-level software and high capacity computers and capture whole networks, even if they have millions of nodes or vertices.)

The completed parameters look like this.

Click “OK” to start the data extraction.  When the capture is complete, the window indicating “Text Wrapping” to speed up the import of the data into the workbook will be shown.

At this juncture, until the data is dropped into the workbook, the data from the crawl may still be lost, particularly if the computer lacks the resources to achieve this work.  (One  way to make sure that the system is still working and hasn’t hung is to type Ctrl+Alt+Delete and “Start Task Manager”.  The Task Manager will show which processes are running on the machine.  Look at the Applications tab.)  Usually, the data shows up within a few seconds to a few minutes.  If NodeXL goes silent, you may have to recrawl to attain the data.  In those cases, it’s better to totally shut down the software and restart, so residual information is not in the Excel workbook.

The Captured Data

Save your file.   For this workbook, it was named:  CDCGovUserNetworkonTwitter1DegFollowingNoLimits.

Graph Metrics

Again, post capture data processing is needed.  In the NodeXL tab, click on Graph Metrics, and Select All…and Calculate Metrics.  The following Graph Metrics table was created.

What this table shows is that 218 vertices were captured. That’s accurate to the 217 following that was listed on the Twitter site.  In terms of edges, there are only 217…because these are the connections of the CDCGov user network following these particular “alters” in its ego neighborhood (per our crawl parameters).  In terms of a Groups or clusters extraction, there is only one type…because of the simplicity of the crawl.

Graph Layouts

A Fruchterman-Reingold layout algorithm shows CDCGov in the middle and the various following accounts around it.

A Harel-Koren Fast Multiscale Layout Algorithm results in the following image.

A spiral layout look like this.

Finally, a vertical sine wave looks like this.

After sufficient experimentation with graph types, an individual can visualize what the layout algorithm will look like based on the extracted graph metrics.

The graphs themselves are ephemeral inside the NodeXL Excel Template, and any saved Excel Workbook will require a re-drawing to visualize the data.

There can be zoomed-in views…which show the directionality of the links.  Notice that the CDCGov in the middle is following the various vertices or nodes.

Interacting with the Graph inside the NodeXL Excel Template

Individuals may go to the Group Vertices to find out who the various account holders are…or they may interact with the graph visualization in the graph pane in order to see which node is represented at each point in the graph.  (The mouse rollover effect on the graphs do not always seem to work, but that may be a factor of how I’ve set parameters.)

Be sure to save the workbook for possible future reference.  For all the glamour of the graphs, the real data is often in the various worksheets and the workbook.

Final Note:  NodeXL is a free and open-source tool that is available from Microsoft’s CodePlex site (which is a space for project hosting for open-source software), and it is sponsored by the Social Media Research Foundation .