Kansas State University


IT News

The NodeXL Series: Conducting a Data Crawl of a Facebook Fan Page (Part 8)

Facebook is currently the foremost social networking site in the Western world.  Many individuals and entities create fan pages on this social network to be their public-facing side.  The ability to extract information from Facebook requires an authorized account.

To practice this data extraction, this will describe the extraction of the social network around the Hershey’s site on Facebook (https://www.facebook.com/HERSHEYS).  With 5.9 million likes, any crawl will have to be a limited one in order not to overwhelm NodeXL.

To begin the data extraction, start the NodeXL Excel Template.  Click on the NodeXL tab.  Go to the “Import” dropdown menu.  Select “From Facebook Fan Page Network (v. 1.6.1)…”.

A window will open. The next step will be to set up the parameters of the crawl.

Setting the Parameters of the Data Crawl

The name or identification of the fan page should be input in the text box at the top left.  In this case, it will be HERSHEYS.  Next, a NodeXL user needs to check the attributes (data fields) of the accounts that he or she wants to extract into a work sheet.  The default settings select “Name, Picture, Sex, Profile, and Locale” focused around an individual.  (These are important demographics for marketers and advertisers.)  Since this crawl will be about a company’s fan page, we will select “Locale,” “Time Zone,” and “Website.”  This keeps the crawl simple.  It’s also good practice generally not to collect information that is not going to be needed or practically used.  That completes the left column of the table.

Next, the Network collected will have to be defined.  A unimodal network focuses on one type of information that is collected:  User-User Network (relationships between user accounts on Facebook (based on co-likes or co-comments or both), or Post-Post Networks (based on shared likes, or comments, or both).  The collected networks would be all of a type (whether user-to-user or post-to-post).  The other option is for a mixed or bi-modal network:  User-Post Network.  For this trial run, Bi-Modal Networks were selected based on both likes and comments.

Finally, at the bottom right of the table, a user has to decide whether to collect the most recent posts; the start and stop times of the data extraction; whether to include posts not made by the page owner (the target and focal node of the crawl), whether to capture status updates, and whether to get wall posts.  To simplify, this crawl will involve a week period from March 28 – April 4, 2013.

The completed table looks like this.

The next step requires a login.  Click “Login.”  After putting in the Facebook credentials (email or phone and password), the “Download” button in the middle will activate.  Click “Download.”  (NodeXL helps remember these credentials and will keep a user “logged in” unless he or she explicitly asks to be logged out.

The crawl will begin.  The process will be shown at the bottom of the “Import from Facebook Fan Page Network (v. 1.6.1)” window.

The progress bar gives a count of the Step in the process as well as the progress on the extraction of the particular batch of data.  This is the only data extraction time in NodeXL that gives a sense of reviewing the size of the network before engaging the full crawl.

Populating the NodeXL Excel Template Worksheet

When the extraction has completed, the Text Wrapping window appears.  Just click “Yes.”  The populated table looks like this.  Please save the file.   In this case, it was named “HersheysFanPageonFacebookMar28toApril42013.”

A name cannot capture all of the parameters of the data crawl.  If a NodeXL user needs to review the input parameters, he or she may go back to the NodeXL ribbon and click on Import, and go to the type of crawl done to see what the parameters were.

Data Processing:  Graph Metrics and Clusters  

As before, extract the Graph Metrics.  Extract Groups.  (This may take some time given the size of the network.)

The Graph Metrics table shows that this short-term data extraction has found participation by 24,663 vertices connected by 35,240 edges.  In terms of clustering, a cluster extraction found 10 clusters.

Graph Visualizations of the Hershey’s Fan Page on Facebook

This data was visualized using the Fruchterman-Reingold layout algorithm.  This does show a mix of text and visuals, which may not be what the developers of the software intended.

A grid layout version of this data follows.

The online version on the NodeXL Graph Gallery is available online here.

Save the NodeXL Excel workbook.

The initial finding:  Lots of people really adore chocolate!

Final Note:  NodeXL is a free and open-source tool that is available from Microsoft’s CodePlex site (which is a space for project hosting for open-source software), and it is sponsored by the Social Media Research Foundation .