Connecting R to the Database

There are several ways to read data from RDBMS in R. One of the way is to do it via JDBC. This method is generic and it will be easy to switch to other database when needed. You don’t need to install any database specific clients — just copy a JDBC driver.

The following code in R connects to the database (Oracle in this example):

require(RJDBC)
driver <- JDBC("oracle.jdbc.driver.OracleDriver", "path/to/JDBC/Driver/ojdbc6.jar")
connection<-dbConnect(driver,"jdbc:oracle:thin:@ip_addres:1521:dbname","username","password")
d<-dbGetQuery(connection,"select * from user_tables")

I had problems when JAVA_HOME environment variable was pointing to JDK. In such case just change it to JRE directory. You can do it right from the R script with the following line:

Sys.setenv(JAVA_HOME="/path/to/java")

Remember that when setting this on Windows you will need to escape backslash: (i.e.: C:\\Program Files\\Java…)

Wikipedia Stats on the Net

If you are interested in Wikipedia statistics there are several pages worth visiting.

Last time I mentioned the dumps page which provides source data for all analysis (how often given page was requested). Basically, there are two kind of logs: the original, and in a compact form — maintained by Erik Zachte. The latter makes the analysis faster (logs are aggregated to daily results, however hour values are added as well in the last column).

Erik Zachte (Data Analyst at Wikipedia) presents a handful of Wikipedia visualisations and metrics. His profile page and blog are good starting point. You can find a lot of useful links there to check what other people have already done in this matter.

There is also http://stats.wikimedia.org, page that you should browse if you are looking for anything related to Wikipedia statistics. One of my favourite example from that page is http://www.wikipediatrends.com.