BigTable, a database technology, NoSQL DataBase is gaining huge momentum in the Big Data world. The most important reason being- its creators. It is created by engineers of Google and, for obvious reasons; they do know how to manage humongous amounts of data. For most parts, Bigtable has a schema-less architecture and makes use of row keys for the purpose of partitioning data and distributing it throughout the cluster.
If Bigtable sounds impressive to you, the Bigtable-inspired Cassandra and HBase are sure to enjoy your appreciation. While Cassandra has cascaded from Amazon Dynamo and Bigtable, HBase is an “open-source Bigtable implementation”. While they do have a lot of similarities, there are also some important differences that cannot be ignored. Let us have a look at some of the differences that might enable you to pick one.
The Similarities
Cassandra and HBase are NoSQL databases that cannot be manipulated by SQL. They are distributed databases which provide you with an enhanced distribution freedom while storing and accessing the data. They are created for managing very large sets of data, which should have millions and billions of rows. It is better to use RDBMS if you are working with anything smaller.
Their scalability is near linear and users can easily multiple the number of nodes if they need to manage additional data. With the help of replication, they both protect the data from the failure of cluster node. If at all there is a failure of primary nodes, you can still fetch the data from the replica nodes.
The Dissimilarities
While Cassandra and HBase have a lot in common, as mentioned above, they also have some major differences.
HBase makes use of the Hadoop infrastructure (NameNode, Zookeeper, HDFS) and organizations that use Hadoop might be able to use HBase with their existing knowledge about Hadoop. Cassandra didn’t start or evolve around Hadoop and as a result, its infrastructure and the knowledge requirements are very different from Hadoop.
While Cassandra and HBase are known to have symmetrical nodes, the symmetry is left incomplete. You will be required to identify some see nodes in Cassandra that can work as concentration points for communication between the clusters. On the other hand, you need HBase to the nodes that will work as master nodes that will do the job of monitoring the activities of servers. Thus, while Cassandra allows multiple seed nodes for guaranteeing high availability, HBase offers multiple master nodes for the same.
HBase exclusively supports ordered partitioning. This allows HBase to scale horizontally while also supporting scans of rowkey range. However, if columns are used for storing data in Cassandra, here is a rowsize limitation of 10’s of MB’s for supporting scans of rowkey range. Larger rows can result in problems with time and overhead.
The Winner
There is no clear winner among the two. Both the databases have a lot to offer and the only solution is to try both and find out which one works for the target application.