Sunday, February 23, 2014

twitteR now supports database persistence

For a long time now I've wanted to add the ability for storing data from twitteR into a RDBMS. In the past I've done things by concatenating new results onto old results which simply becomes unwieldy. I know that many people have doctored up their own solutions for this but it seemed useful to have it baked in. Unfortunately I never had the time or energy to do this so the idea languished. But then dplyr happened - it provides some revolutionary tools for interacting with data stored in a database backend. I figured I'd kill two birds with one stone by finally implementing this project which in turn would give me a lot of data to play with. This is all checked in to master on github.

This is still a work in progress, so please let me know if you have any comments, particularly as regards making it more seamless to use.

 First, some basics:

  • While theoretically any DBI based backend will work, currently only RMySQL and RSQLite are supported.
  • The only types of data able to be persisted are tweets (status) objects and user objects. Granted, this likely covers 95%+ of use cases.
  • Data can be retrieved as either a list of the appropriate object or as a data.frame representing the table. Only the entire table will be retrieved - my expectation is that it will be simpler for users to interact with data via things like dplyr.
To get started, you must register your database backend. You can either create a DBI connection from one of the supported packages or call one of the available convenience methods (which will return the connection as well as register it with twitter.

register_db_backend(dbi_connection)
# or create a sqlite connection
register_sqlite_backend("/path/to/sqlite/file")
# or create a mysql connection
register_mysql_backend("my_database", "hostname", "username", "password")
view raw gistfile1.r hosted with ❤ by GitHub

To continue, suppose we have a list of tweets we want to persist. Simply call store_tweets_db() with your list and they'll be persisted into your database. By default they will be persisted to the table tweets but you can change this with the table_name argument.

tweets = searchTwitter("#rstats", n=500)
store_tweets_db(tweets)
view raw gistfile1.r hosted with ❤ by GitHub

Finally, to retrieve your tweets from the database the function is load_tweets_db(). By default this will return a list of the appropriate object, although by specifying as.data.frame=TRUE the result will be a data.frame mirroring the actual table. Much like store_tweets_db() there is a table_name argument.

from_db = load_tweets_db()
view raw gistfile1.r hosted with ❤ by GitHub

Note that for user data there is a mirror set of functions, store_users_db() and load_users_db(), and the default table name is users.

2 comments:

  1. Jeff,

    If I add tweets using store_tweets_db() when these tweets were already in the db I have them double.

    Is there a way to avoid storing them double? Is there a reason why I should want them to be double?

    Thanks

    ReplyDelete
  2. hello ,

    i am facing problem in this

    register_db_backend(dbi_connection)
    Error: could not find function "register_db_backend"

    how this can be solved..

    thanks..

    ReplyDelete