Sunday, February 23, 2014

twitteR now supports database persistence

For a long time now I've wanted to add the ability for storing data from twitteR into a RDBMS. In the past I've done things by concatenating new results onto old results which simply becomes unwieldy. I know that many people have doctored up their own solutions for this but it seemed useful to have it baked in. Unfortunately I never had the time or energy to do this so the idea languished. But then dplyr happened - it provides some revolutionary tools for interacting with data stored in a database backend. I figured I'd kill two birds with one stone by finally implementing this project which in turn would give me a lot of data to play with. This is all checked in to master on github.

This is still a work in progress, so please let me know if you have any comments, particularly as regards making it more seamless to use.

 First, some basics:

  • While theoretically any DBI based backend will work, currently only RMySQL and RSQLite are supported.
  • The only types of data able to be persisted are tweets (status) objects and user objects. Granted, this likely covers 95%+ of use cases.
  • Data can be retrieved as either a list of the appropriate object or as a data.frame representing the table. Only the entire table will be retrieved - my expectation is that it will be simpler for users to interact with data via things like dplyr.
To get started, you must register your database backend. You can either create a DBI connection from one of the supported packages or call one of the available convenience methods (which will return the connection as well as register it with twitter.


To continue, suppose we have a list of tweets we want to persist. Simply call store_tweets_db() with your list and they'll be persisted into your database. By default they will be persisted to the table tweets but you can change this with the table_name argument.


Finally, to retrieve your tweets from the database the function is load_tweets_db(). By default this will return a list of the appropriate object, although by specifying as.data.frame=TRUE the result will be a data.frame mirroring the actual table. Much like store_tweets_db() there is a table_name argument.


Note that for user data there is a mirror set of functions, store_users_db() and load_users_db(), and the default table name is users.

2 comments:

  1. Jeff,

    If I add tweets using store_tweets_db() when these tweets were already in the db I have them double.

    Is there a way to avoid storing them double? Is there a reason why I should want them to be double?

    Thanks

    ReplyDelete
  2. hello ,

    i am facing problem in this

    register_db_backend(dbi_connection)
    Error: could not find function "register_db_backend"

    how this can be solved..

    thanks..

    ReplyDelete