A Free Open Source Tool to Analyze Twitter

A Free Open Source Tool to Analyze Twitter

This article is a bit technical, but the end result will help you to search Twitter, analyze the tweets and then export them into a CSV file for loading into a spreadsheet application - all with a free open source tool. This article is just the beginning. We'll take a look at graphing, wordclouds and sentiment analysis in the next few weeks. All of these things are critical in identifying trends during disaster response and recovery.

Twitter is great for emergency management, emergency response, and trend analysis but getting the tweets into a program for manipulation can be expensive or require a significant knowledge of programming. We're going to explore the use of "R" and "RStudio" to perform a Twitter search. While it's not simple, it does allow you to (fairly) easily get data into other programs where things ARE much simpler.

"R" is an open source and free "language and environment for statistical computing and graphics". It is part of the "GNU Project", a collection of software including applications, libraries and developer tools. The GNU project has been around since 1984 and the project software is used, in some way, on almost every UNIX machine in existence.

In our case, however, we're going to use the Microsoft Windows version of R, and don't be scared off by the description of what the software does.

This guide may seem daunting, but it's not too complex. I'll bring you through the process step by step.

Installing "R" and "R Studio"

We're going to install both R, and R Studio. R is the component that actually executes our work, retrieves the data and manipulates it. R Studio will give us a graphical interface to use R.

"R For Windows" runs on Windows XP or later systems. To download it, visit http://cran.r-project.org/bin/windows/base/ and click on "Download R for Windows" at the top. The file is not small (at the time of this writing, it is a 54MB download). Once it has downloaded, run the file. You may be told that the file is from an "unidentified publisher" or "unknown publisher", but go ahead an install it. You can see more information and a few troubleshooting tips at http://cran.r-project.org/bin/windows/base/rw-FAQ.html#Does-R-run-under-Windows-Vista_003f.

Security Warning

You may also receive a warning box asking if you want to allow this program to make changes to your computer. Once again, go ahead and install the software. Moving forward, you'll be asked for your default language, and an installation location. When you are asked what components you would like to install, check all options and then click on Next.

 

Installing R

 

You can customize the startup options if you would like (I selected "MDI - One Big Window", HTML Help and Standard Internet, and to "Save the version number in the registry" and "Associate R with .RData files"). The program will then install on your system.

Once it's finished, we're going to go ahead and install R Studio. Now, there are pay versions of R Studio available with access to support or for organizations that can't abide by the AGPL licensing, but we're going to focus on the free open source version of R Studio. Visit http://www.rstudio.com/products/rstudio/download/ and download the appropriate installer. I'm using the Windows XP/Vista/7/8 version for my Windows 7 desktop.

 

Downloading R Studio

 

Now, go ahead and run the installer. I went with all of the defaults for any questions asked.

We're all done with the application installs. Now we need to tell it how to access Twitter.

Configuring R to work with Twitter

Go ahead and launch R Studio. It should look something like this when first launched.

 

R Studio First Launched

We're going to add some enhancements that will allow it to work with Twitter. Go ahead and type the following command:

install.packages("devtools")
install.packages("rjson")
install.packages("bit64")
install.packages("httr")

After each command, you'll see activity, and when it's done your screen will look similar to this:

 

Installing R Support Tools

Now you're going to install the Twitter functions. Type in the following commands:

install_github("twitteR", username="geoffjentry")

When done, your screen may look like this:

 

Installing Twitter for R

 

No need to quit out of the app for the next part. Just switch to a web browser.

Configuring Twitter

Now that we have Windows set up to access Twitter, you need to tell Twitter that you want to communicate with it and get the appropriate permissions set up. This is where people often get confused, but there's not a lot of magic to it. We're going to create an app on Twitter and set up a few "keys". These keys are what will identify R to Twitter.

The first step is to create the app. Go to https://apps.twitter.com/ and click on the "Create New App" button.

You'll be presented with a page that asks you a few questions. Use entries similar to the following, but modify them for yourself:

Name: Disaster.Com R App
Description: App for interfacing GNU R With Twitter
Website: https://www.disaster.com
Callback URL:

Change the Disaster.Com info the reflect your own information, but the page will look similar to this:

 

Create App on Twitter

 

Scroll down a little, read through the Developer Agreement and click the "Yes, I agree" checkbox. Then click on "Create your Twitter application".

You'll be presented with your app details page. Click on the "Keys and Access Tokens" tab at the top.

 

Twitter App Created

 

There are two parts to the Keys and Access Tokens page - the "Application Settings" at the top, and "Your Access Token" at the bottom. You should already have a bunch of information entered next to Consumer Key and Consumer Secret at the top. Scroll down and you'll see a button labeled "Create my access token" under the "Your Access Token" area. Click this button and information will be generated and presented on screen - specifically an "Access Token" and an "Access Token Secret". You now have all of the information you need to interface "R" with Twitter, so let's make it happen!

Interfacing R with Twitter

So now that we have all of the parts, let's put them together.

Switch back to your R Studio app and type in the following. You'll want to replace the brackets and text with the information you generated from Twitter above.

library(twitteR)
setup_twitter_oauth("<Consumer Key (API Key)>","<Consumer Secret (API Secret)>","<Access Token>","<Access Token Secret>")

You should receive a response:

[1] "Using direct authentication"

Use a local file to cache OAuth access credentials between R sessions?
1: Yes
2: No

Alternatively, you may see: "Error in check_twitter_oauth() : OAuth authentication error: This most likely means that you have incorrectly called setup_twitter_oauth()'". Most likely, this means you have mis-entered one of the keys from the Twitter app page.

Choose Yes if you would like this authorization reloaded each time you load the Twitter library.

Now we're ready to get tweets!

Retrieving and Exporting Tweets

It's real easy to search for tweets. You're going to use the "searchTwitter" command to make it happen.

Let's search for all tweets containing the hashtag #ebola and return the latest 50 of them.

Type in the following

searchTwitter('#ebola', n=50)

You should see a list of tweets containing the hashtag #ebola.

The results may look like this:

 

Search Twitter for Ebola

Maybe you want to get a bit more complex in your search?

searchTwitter("#ebola", n=50, lang="en", since="2014-12-01", until="2014-12-03",geocode="6.313363,-10.809802,500km", result_type="recent", retryOnRateLimit=120)

This will search for the hashtag #ebola in English language tweets (you can get a complete list of language codes from the "639-1" column on this page: http://en.wikipedia.org/wiki/List_of_ISO_639-1_codes), posted after 12/1/2014 and before 12/3/2014, located within 500km of the latitude/longitude 6.313363,-10.809802 (Monrovia, Liberia), from the "recent" stream (as opposed to popular), and will retry pulling the information down 120 times if Twitter decides to restrict (rate limit) your access. At the completion of a search you may see a message like this: "50 tweets were requested but the API can only return 10". This usually means that only a certain number of tweets were found as opposed to the full amount you requested.

Or maybe you want to see all of the tweets done by a specific user:

userTimeline("DisasterDotCom",n=50,includeRts="True",excludeReplies="False")

This will show up to 50 of DisasterDotCom's tweets, including retweets and replies. Maybe you just want to see what tweets of your own have been retweeted:

retweetsOfMe(n=25)

This will show the last 25 posts of yours that have been retweeted.

Now, you might notice that not much information is given other than the actual tweet. We can actually find out a lot more by changing things up a little. This gets a little more complex, but you can still copy and paste from the example.

Let's take our above tweet request that looks for the hashtag #ebola between specific dates and gelocated around Monrovia, and let's see more information about each tweet. First, we're going to execute the search but we're going to store it in a container. Then, we're going to use another function to actually show the full information about the tweets:

mySearch <- searchTwitter("#ebola", n=50, lang="en", since="2014-12-01", until="2014-12-03",geocode="6.313363,-10.809802,500km", result_type="recent", retryOnRateLimit=120)
do.call("rbind",lapply(mySearch,as.data.frame))

You'll see a lot of data scroll across your screen. If you use the scroll bar to look back at it you'll see the first section includes the text of the tweets, each one with a number beside it.

 

text
1        #ebola infection rates going up again in Freetown...
2  Challenges of road transport in #Guinea as @PlanGlobal...
3  #ebola no one is educating men about infectious...

The next section contains several more data pieces:

   favorited favoriteCount replyToSN             created truncated replyToSID
1      FALSE             0        NA 2014-12-02 21:13:40     FALSE         NA
2      FALSE             1        NA 2014-12-02 21:09:56     FALSE         NA
3      FALSE             0        NA 2014-12-02 20:56:25     FALSE         NA

The first column gives you the tweet reference number - related to the text above. This way, you can follow the tweet all the through all of the tables.

"Favorited" tells you whether or not YOU favorited it. "favoriteCount" gives you the total number of times the tweet was favorited by anyone. If this was a reply, "replyToSN" gives you the screen name this tweet was a reply to. "Created" tells you when the tweet was made. It used to be that retweets longer than 140 characters were allowed, but they would end in ellipses ("..."). Now, long tweets are rejected, but if the tweet is old and was long, it might have been shortened. If "truncated" is true, this will tell you that the tweet was shortened. If this was a reply to a tweet, "replyToSID" will show you an internal Twitter ID of the tweet the reply was to.

The next section shows an "id" and "replyToUID":

                   id replyToUID
1  539890163892383747         NA
2  539889224179523585         NA
3  539885824016678914         NA

"id" is the internal Twitter identification number of the tweet. "replyToUID" is the internal Twitter User ID number of the person making the tweet. Moving to the next section:

statusSource
1  Twitter for Android Tablets
2             Twitter for iPhone
3  Twitter for Android Tablets

This table is showing you what client the user used to make the Tweet. Moving on:

      screenName retweetCount isRetweet retweeted    longitude   latitude
1     soniausmar            0     FALSE     FALSE   -0.3311195  5.5491169
2       queallyd            2     FALSE     FALSE -10.72144054 6.26466955
3     soniausmar            1     FALSE     FALSE   -0.3311192   5.549117

"screenName" is the name of the person or group who made the Tweet. "retweetCount" shows how many times the tweet was retweeted. "isRetweet" would be true if the tweet itself was a retweet of someone else's status. "retweeted" shows whether YOU retweeted this Tweet. And longitude/latitude shows the location the tweet was made from. Keep in mind that most tweets don't have location information, but we did a geolocated search for this example.

Now, this is all great, but really? There's a lot you can do inside R, but more people know how to use Excel. How do we get this into a format Excel can understand?

Exporting Twitter Data to Excel

It's actually fairly simple, and just a modified version (yet again) of what we've done above.

First, we're going to store the Tweets in a container, and then we're going to write that container out to a file. Try the following, but replace <MyWindowsLoginName> with the ID you use to login to Windows. Or, replace the entire string with a different file location. Note that you normally use backslashes in Windows (i.e. "\") but we're using forward slashes here (i.e. "/"). An easy way to figure out where to put the file is to open Windows Explorer, change to the directory that you want to save the file in, and then click on the empty space after the folders in the status bar. This will change the folders to an actual directory listing which you can copy (hit Ctrl-c) and paste (Ctrl-v) into your command below. Don't forget to change the backslashes to forward slashes!

Change to Directory

 

mySearch <- searchTwitter('#ebola', n=1000)
tdf <- do.call("rbind",lapply(mySearch,as.data.frame))
write.table(tdf,"c:/users/<MyWindowsLoginName>/desktop/tweetsaboutebola.csv",sep="\t",col.names=NA)

The first line does a Twitter search for #ebola and stores 1,000 results in the temporary container "mySearch". It may take a little while to run, depending on the speed of your Internet connection.

The second line takes the full information from each Tweet and stores it in the temporary container "tdf".

The third line actually writes the file to your computer. In this case, it's writing it to my Desktop. The sep="\t" command at the end tells R to put a tab between fields. This allows your spreadsheet or database to understand where one field ends and the next one begins. The "col.names=NA" switch says to place blanks in the first row when column names are empty. If you don't include this, the column headers may be skewed. Here are a few alternatives to "write.table" - http://www.statmethods.net/input/exportingdata.html.

Now you can launch Excel or a similar spreadsheet program.

Go to File/Open and navigate to the file you have saved, and then open it. If you're using Excel, you'll be presented with a dialog box like the following:

 

Excel Text Import Step 1

So far, we're just using defaults. Click on "Next" and you'll see the second step:

Excel Text Import Step 2

Here we're going to make sure that "Delimiters" is set to "Tab". You also want to change "Text qualifier" to be the quote mark. This will strip the quotes from each field. Click on Next and you'll see the third and final screen:

Excel Text Import Step 3

 

If you don't want to include every column, click on the column under "Data preview" and then select "Do not import column {skip)" at the top. Otherwise, click on "Finish" and you'll have all of your data in Excel.

For my next post on "R" and Twitter, we'll take a look at graphing, plotting and maybe a little sentiment analysis, and definitely creating wordclouds in R

 

#smem Wordcloud from R

Chris is the owner of Disaster.Com, along with being a business consultant and entrepreneur. In addition to working 80 hours a week on Disaster.Com, Chris is spending another 80 hours a week building a small business consulting company called Fair Winds Strategies. When he's not working, you can find Chris hanging out with his wife and kids, or on his sailboat (which he spent two years living on and cruising down to the Bahamas from New York, and then back).

Facebook Twitter LinkedIn Google+  

twitter

Discussion
  1. Guest
    I have no prior knowledge in programming, and this article totally helped me to achieve what i wanted to do, and with only one main software : extracting data from twitter and ordering them into a proper csv file. I stumbled upon other guides but i must say this one is the most comprehensive!
    Guest
    Thanks Chris. You made this easy for use
Click here to add your comments

Submit a Comment

Write Articles
×
Suggest A Category

 

×