lp0 On Fire: twitteR

Showing posts with label twitteR. Show all posts

Sunday, February 23, 2014

twitteR now supports database persistence

For a long time now I've wanted to add the ability for storing data from twitteR into a RDBMS. In the past I've done things by concatenating new results onto old results which simply becomes unwieldy. I know that many people have doctored up their own solutions for this but it seemed useful to have it baked in. Unfortunately I never had the time or energy to do this so the idea languished. But then dplyr happened - it provides some revolutionary tools for interacting with data stored in a database backend. I figured I'd kill two birds with one stone by finally implementing this project which in turn would give me a lot of data to play with. This is all checked in to master on github.

This is still a work in progress, so please let me know if you have any comments, particularly as regards making it more seamless to use.

First, some basics:

While theoretically any DBI based backend will work, currently only RMySQL and RSQLite are supported.
The only types of data able to be persisted are tweets (status) objects and user objects. Granted, this likely covers 95%+ of use cases.
Data can be retrieved as either a list of the appropriate object or as a data.frame representing the table. Only the entire table will be retrieved - my expectation is that it will be simpler for users to interact with data via things like dplyr.

To get started, you must register your database backend. You can either create a DBI connection from one of the supported packages or call one of the available convenience methods (which will return the connection as well as register it with twitter.

To continue, suppose we have a list of tweets we want to persist. Simply call store_tweets_db() with your list and they'll be persisted into your database. By default they will be persisted to the table tweets but you can change this with the table_name argument.

Finally, to retrieve your tweets from the database the function is load_tweets_db(). By default this will return a list of the appropriate object, although by specifying as.data.frame=TRUE the result will be a data.frame mirroring the actual table. Much like store_tweets_db() there is a table_name argument.

Note that for user data there is a mirror set of functions, store_users_db() and load_users_db(), and the default table name is users.

Saturday, January 25, 2014

An updated look at the #code2013 language rankings

A few weeks ago I compared the #code2013 rankings from twitter to TIOBE's rankings although when I had collected the #code2013 data people were still chiming in, albeit at a slowing pace. As I would visually scan the new tweets it seemed like there was a huge increase of Delphi & Object Pascal compared to the data I had collected previously, and it made me curious if this was a real effect or just coincidence. Luckily I had continued to collect the #code2013 data after I made that post so I had an opportunity to find out, considering I had 6028 tweets giving me 1404 more than the last time.

At the same time, I commented in my original post that I was unhappy with the mechanism which I used to strip manual retweets (i.e. manually adding RT instead of a built-in retweet), as I had removed any tweet from the data which contained a RT. Because people often add commentary to the left of the RT, I created a new function which would leave anything to the left of the RT (as well as MT) which should leave more useable data. This code now appears in the github version of twitter as the function strip_retweets(). Unfortunately, this didn't make much of a difference - applying this new function to the original data set only gave me 23 more tweets worth of data, oh well. It was the thought that counted.

I processed the new dataset the same as the previous batch (all code included as a single gist below), and sure enough there was a large skew toward Delphi & Pascal in this batch. Note that I had tried to morph any usage of "object pascal" into a single "delphi/object pascal" entry, but presumably most people mentioning "pascal" mean delphi:

So despite the inclusion of about 30% more data, the results are very similar. So what happens if we look at the updated data against the TIOBE data as I did the first time?

Sure enough - when visually compared to the original, the pascal entries gained quite a lot (bouncing one of my favorites, Scala, down a tier). There were some other changes, most notably abap & c# gained while fortran lost but only ABAP had a very noticeable gain.

What happens if we only look at the new tweets against the TIOBE rankings. How much of a skew would Delphi show now?

As expected, Delphi took a huge leap forward. Also expected, some of the fringe languages fell off of this plot - which makes sense as we have about a third of the data so fewer opportunities to make the grade. You can also see some languages like R (another favorite) and ObjC dropping while others like Haskell and Matlab gaining.

So what happened? It seems reasonable to me to expect a fairly steady distribution over time, although clearly the social aspect to Twitter is affecting things causing viral gains and losses over time.

Thursday, January 2, 2014

Comparing the #code2013 results with the current TIOBE rankings

The TIOBE language rankings have always been controversial but in the absence of more meaningful metrics tends to be viewed as holy writ. Over the last few days of 2013 a hashtag was started by Twitter user @deadprogram called #code2013. The idea of this hashtag was that users would tweet which languages they used over the last year. I felt this would be an interesting comparison to the TIOBE rankings - the latter is based on search engine popularity but the #code2013 rankings would be based on what people are actually reporting.

To do this I used my R library twitteR to pull 4624 tweets with this hash tag and then started pulling it apart to see what I could see. I previously pulled the tweets using the searchTwitter() function, and loaded it into my R session. From there my first step was to try to remove retweets. Removing the new style Twitter retweets are simple, and then after that I removed anything with RT in the text. The latter isn't perfect and is likely to throw out good data (e.g. "lang1 lang2 lang3 RT @deadprogram: What programming languages have you used this year? Tweet using #code2013 Please do it and also RT!") but it seemed unlikely to radically skew the results. The R code I used to do this was:

This left 3745 tweets, so we lost about 1000 due to retweets. Considering the number of RTs thrown out here one thought might be to redo this by removing everything to the right of the RT instead of a blanket removal of anything with RT in the text.

The next step was to read in the TIOBE rankings (well, the top 50). Visually inspecting a sampling of the #code2013 tweets and looking at the TIOBE data made it clear that I would have to massage the language names a bit as there were a few problems. The most notable issue were things like "Objective C" or "emacs lisp" as I was planning on tokenizing languages by whitespace. Similarly, TIOBE defined "delphi/object pascal" but people in #code2013 tended to say either "Delphi" or "object pascal". It would be an impossible task to perfectly clean up the #code2013 data but I made a few adjustments to help things along:

I wanted to normalize all of the text to be lowercase but this presented an issue. A relatively small number of tweets (67, to be exact) were in a language encoding that tolower() wasn't fond of. Instead of fighting encoding issues I chose to throw these out as well. I looped through all of the statuses and if I was able to convert to lowercase I kept it, otherwise I threw it out:

Finally we're getting somewhere. I tokenized each status on any whitespace as well as . or , characters. From here I filtered each status to only contain words which exist in the TIOBE language list. The potential downside here is that we could have languages being represented by #code2013 that doesn't exist in the top 50 TIOBE languages and/or alternate spellings but this seemed unlikely to affect the outcome of this exercise in a meaningful way so this was a convenient way to normalize things. This resulted in 40 languages from the #code2013 that we're considering. Once that was done I created a data.frame with columns for the language name, the frequency count and a tier code. The tier code will be used to color the final plot and covered ranges 1-5, 6-10, 11-15, 16-25 and 26-40.

Ok. Now we're cooking. What I wanted to see here was how the rankings differed so what I did was to create a bar plot showing the frequency counts of the #code2013 hits with the Y axis being the languages and the X axis being the counts. The languages were ordered by their position in the TIOBE rankings, and the bars were colored by the #code2013 tier I mentioned previously. This is what the results looked like:

In general the top 10-ish are roughly the same although the most stark trend is that the top 5 and the next chunk are largely reversed. The top #code2013 languages were javascript, ruby, python, java & php while those are numbers 9, 11, 8, 2 & 6 respectively. Similarly 4 of the top 5 TIOBE languages are in the 6-10 tier, with the 10 place #code2013 (scala) language being all the way down as the TIOBE #33 language.

I might be way off base here but looking at the rankings of the #code2013 languages tells me a couple of things. One is that unsurprisingly web development still rules the roost: javascript, ruby, python, java, php. The other is that data analysis & big data (I loathe the term, but chest la vie) is coming on stronger than TIOBE recognizes considering some of darlings of that world are doing better in #code2013 than TIOBE with notable examples being Python, Scala, Haskell & R.

For the record, my tweet in this hashtag was: "Scala, java, R, python, matlab, C++ #code2013" so I have to say I'm pleasantly surprised to see some of my favorite languages (which would be the first four I mentioned, although not in that order) looking like a better combination than TIOBE would suggest.

Edit #1: Hadley Wickham suggested that I include a scatterplot of the data. Considering that one of the main motivations for this exercise was to force myself to figure out how his ggplot2 library worked I figured I'd oblige:

Tuesday, April 3, 2012

Some eensy teensy twitteR and ROAuth news

I've been completely swamped the last couple of months and am just now getting back to the twitteR thing. I've been building up a dataset for a while now and am hoping to combine the various techniques that others have already done and hopefully add a few others in order to provide a comprehensive look at what one can do over time with Twitter data.

I'm making a couple of minor changes tonight to some of the data structures:

adding the retweetCount field to the status class, which will report how many times it has been retweeted
adding the retweeted field to the status class, a logical which will report if it has been retweeted.
adding the listedCount field to the user class, which gives the number of public lists the user is part of.
adding the followRequestSent field to the user class, which will report TRUE if that user is someone that you've sent a follow request to (provided you've used OAuth).

On the ROAuth front, Duncan Temple Lang will be taking over as maintainer. A combination of heavy interest on his part and lack of time on my part led to this - he has big plans to push it much further than it is now, integrating OAuth 2.0 and all sorts of other goodies. I believe some Google folks are in on this as well. I wish them luck :)