Big Data Landscape

There are many projects that make big data possible. If you look at the popular Cloudera or Hortonworks distribution you can see number of tools and frameworks that are ready to fit into existing corporate ecosystem and provide insight into processed data.

This big data ecosystem evolves all the time and because it is in huge majority open source software everyone can participate in developing it. I took a quick look at the GitHub repositories of some Hadoop related projects and generated statistics like number of commits, added or removed lines. This give some picture of the project and the effort involved in making a tool more mature.

Screenshot showing number of commits to big data projects

The projects I choose was rather arbitrary and there are good reasons to go further and keep on adding next repositories. I tried to select those tools that are usually find in Hadoop deployment or can optinally fit into existing big data environments. I divided the projects into several groups:

  • general tools
  • SQL processing tools (+Pig)
  • processing frameworks or libraries
  • big table implementations (HBase, Cassandra, Accumulo)
  • web notebooks (Hue, Zeppelin)
  • integration tools (online or batch)

Of course, you should keep in mind that this is based only on current GitHub repository. Some of the projects were developed earlier in different repositories (for example Hive has much earlier history records than Hadoop itself). Besides that some of the tools were open sourced at some point of time, whereas other started as open software since the very beginning. Nevertheless it can give some overall feeling and easily spot the projects with more intensive development.

Have a look at the this page.