To do this I used my R library twitteR to pull 4624 tweets with this hash tag and then started pulling it apart to see what I could see. I previously pulled the tweets using the searchTwitter() function, and loaded it into my R session. From there my first step was to try to remove retweets. Removing the new style Twitter retweets are simple, and then after that I removed anything with RT in the text. The latter isn't perfect and is likely to throw out good data (e.g. "lang1 lang2 lang3 RT @deadprogram: What programming languages have you used this year? Tweet using #code2013 Please do it and also RT!") but it seemed unlikely to radically skew the results. The R code I used to do this was:
This left 3745 tweets, so we lost about 1000 due to retweets. Considering the number of RTs thrown out here one thought might be to redo this by removing everything to the right of the RT instead of a blanket removal of anything with RT in the text.
The next step was to read in the TIOBE rankings (well, the top 50). Visually inspecting a sampling of the #code2013 tweets and looking at the TIOBE data made it clear that I would have to massage the language names a bit as there were a few problems. The most notable issue were things like "Objective C" or "emacs lisp" as I was planning on tokenizing languages by whitespace. Similarly, TIOBE defined "delphi/object pascal" but people in #code2013 tended to say either "Delphi" or "object pascal". It would be an impossible task to perfectly clean up the #code2013 data but I made a few adjustments to help things along:
I wanted to normalize all of the text to be lowercase but this presented an issue. A relatively small number of tweets (67, to be exact) were in a language encoding that tolower() wasn't fond of. Instead of fighting encoding issues I chose to throw these out as well. I looped through all of the statuses and if I was able to convert to lowercase I kept it, otherwise I threw it out:
Finally we're getting somewhere. I tokenized each status on any whitespace as well as . or , characters. From here I filtered each status to only contain words which exist in the TIOBE language list. The potential downside here is that we could have languages being represented by #code2013 that doesn't exist in the top 50 TIOBE languages and/or alternate spellings but this seemed unlikely to affect the outcome of this exercise in a meaningful way so this was a convenient way to normalize things. This resulted in 40 languages from the #code2013 that we're considering. Once that was done I created a data.frame with columns for the language name, the frequency count and a tier code. The tier code will be used to color the final plot and covered ranges 1-5, 6-10, 11-15, 16-25 and 26-40.
Ok. Now we're cooking. What I wanted to see here was how the rankings differed so what I did was to create a bar plot showing the frequency counts of the #code2013 hits with the Y axis being the languages and the X axis being the counts. The languages were ordered by their position in the TIOBE rankings, and the bars were colored by the #code2013 tier I mentioned previously. This is what the results looked like:
For the record, my tweet in this hashtag was: "Scala, java, R, python, matlab, C++ #code2013" so I have to say I'm pleasantly surprised to see some of my favorite languages (which would be the first four I mentioned, although not in that order) looking like a better combination than TIOBE would suggest.
Edit #1: Hadley Wickham suggested that I include a scatterplot of the data. Considering that one of the main motivations for this exercise was to force myself to figure out how his ggplot2 library worked I figured I'd oblige: