Elasticsearch and R

Last month I spent some time working with Elasticsearch and Kibana and I was trying to integrate it with other systems. For example connectiong Elasticsearch wiht Hive is very easy, but I wondered how about reading and saving data from R

Let’s use same NYC data describing vehicle collisions, available here.

Saving data to Elasticsearch

There is a package that makes interaction with ES very easy.

install.packages("elastic")
require(elastic)

connect(es_base = "localhost", es_port = "9200")

Let’s say that collisions is a data frame with our data that will be stored in Elastic. Saving data to the index is a one function call:

docs_bulk(collisions, index = "nyc_collisions")

If the index already exists you can try removing it first when clean upload is neccessary:

index_delete(index = "nyc_collisions")

Mappings

Unfortunately by default all text fields are analyzed, so in our case that can be not usfull for columns describing street:

Aggregating by analyzed text field

To prevent analyzing we need to specify mapping:

index_delete(index = "nyc_collisions")

index_create(index = "nyc_collisions")

mapping_create(index = "nyc_collisions", type = "nyc_collisions", body = '
{
  "nyc_collisions": {
    "properties": {
      "street": { "type": "string", "index": "not_analyzed" }
    }
  }
}')

docs_bulk(collisions, index = "nyc_collisions")

Now we can fully enjoy categorical variable:

Grouping by not analyzed text field

Readeing from Elasticsearch

There is a Search function (with capital “s”) which allows to perform searches on ES and fetch the results into R.

res <- Search(index = "nyc_collisions", 
                            q = "manhattan", 
                            size = 10, 
                            asdf = T)

where q is our query used to filter index, size specifies the number of rectors that will be retrived and asdf means to fetch the data as data frame.

Our final data frame is in res$hits$hits$’_source’:

resdf <- res$hits$hits$'_source'
resdf$date <- as.Date(resdf$date)
summary(resdf)

      date              borough               zip             lat             lon            street              killed 
 Min.   :2015-07-12   Length:10          Min.   :10006   Min.   :40.71   Min.   :-74.01   Length:10          Min.   :0  
 1st Qu.:2015-07-14   Class :character   1st Qu.:10025   1st Qu.:40.80   1st Qu.:-73.96   Class :character   1st Qu.:0  
 Median :2015-07-28   Mode  :character   Median :10026   Median :40.80   Median :-73.96   Mode  :character   Median :0  
 Mean   :2015-07-26                      Mean   :10024   Mean   :40.79   Mean   :-73.96                      Mean   :0  
 3rd Qu.:2015-08-05                      3rd Qu.:10027   3rd Qu.:40.81   3rd Qu.:-73.95                      3rd Qu.:0  
 Max.   :2015-08-07                      Max.   :10031   Max.   :40.83   Max.   :-73.95                      Max.   :0  
 pedestrians_injured
 Min.   :0          
 1st Qu.:0          
 Median :0          
 Mean   :0          
 3rd Qu.:0          
 Max.   :0    

Searching gets more complicated when we retrieve only specified fields, not the whole documents. We will have to unlist the list values manually, but it can be necessary to fit bigger datasets into memory.