Last month I spent some time working with Elasticsearch and Kibana and I was trying to integrate it with other systems. For example connectiong Elasticsearch wiht Hive is very easy, but I wondered how about reading and saving data from R
Let’s use same NYC data describing vehicle collisions, available here.
Saving data to Elasticsearch
There is a package that makes interaction with ES very easy.
install.packages("elastic") require(elastic) connect(es_base = "localhost", es_port = "9200")
Let’s say that collisions is a data frame with our data that will be stored in Elastic. Saving data to the index is a one function call:
docs_bulk(collisions, index = "nyc_collisions")
If the index already exists you can try removing it first when clean upload is neccessary:
index_delete(index = "nyc_collisions")
Mappings
Unfortunately by default all text fields are analyzed, so in our case that can be not usfull for columns describing street:
To prevent analyzing we need to specify mapping:
index_delete(index = "nyc_collisions") index_create(index = "nyc_collisions") mapping_create(index = "nyc_collisions", type = "nyc_collisions", body = ' { "nyc_collisions": { "properties": { "street": { "type": "string", "index": "not_analyzed" } } } }') docs_bulk(collisions, index = "nyc_collisions")
Now we can fully enjoy categorical variable:
Readeing from Elasticsearch
There is a Search function (with capital “s”) which allows to perform searches on ES and fetch the results into R.
res <- Search(index = "nyc_collisions", q = "manhattan", size = 10, asdf = T)
where q is our query used to filter index, size specifies the number of rectors that will be retrived and asdf means to fetch the data as data frame.
Our final data frame is in res$hits$hits$’_source’:
resdf <- res$hits$hits$'_source' resdf$date <- as.Date(resdf$date) summary(resdf) date borough zip lat lon street killed Min. :2015-07-12 Length:10 Min. :10006 Min. :40.71 Min. :-74.01 Length:10 Min. :0 1st Qu.:2015-07-14 Class :character 1st Qu.:10025 1st Qu.:40.80 1st Qu.:-73.96 Class :character 1st Qu.:0 Median :2015-07-28 Mode :character Median :10026 Median :40.80 Median :-73.96 Mode :character Median :0 Mean :2015-07-26 Mean :10024 Mean :40.79 Mean :-73.96 Mean :0 3rd Qu.:2015-08-05 3rd Qu.:10027 3rd Qu.:40.81 3rd Qu.:-73.95 3rd Qu.:0 Max. :2015-08-07 Max. :10031 Max. :40.83 Max. :-73.95 Max. :0 pedestrians_injured Min. :0 1st Qu.:0 Median :0 Mean :0 3rd Qu.:0 Max. :0
Searching gets more complicated when we retrieve only specified fields, not the whole documents. We will have to unlist the list values manually, but it can be necessary to fit bigger datasets into memory.