Registration is open - Live, Instructor-led Online Classes - Elasticsearch in March - Solr in April - OpenSearch in May. See all classes


HBase Digest, February 2010

The first HBase Digest post received very good feedback from the community. We continue using HBase at Sematext and thus continue covering the status of HBase project with this post.

  • Added Performance Evaluation for IHBase. The PE does a good job of showing what IHBase is good and bad at.
  • Transactional contrib stuff is more geared to short-duration transactions, but it should be possible to share transaction states across machines with certain rules in mind. Thread…
  • Choosing between Thrift and REST connectors for communicating with HBase outside of Java is explained in this thread.
  • How to properly set up Zookeeper to be used by HBase (how many instances/resources should be dedicated to it, etc) is discussed in this thread. Some more info you can find is also in this one.
  • Yahoo Research has developed a benchmarking tool for “Cloud Serving Systems”. In their paper describing the tool which they intend to open source soon, they compare four “Cloud Serving Systems” and HBase is one of them. Please, also read the explanation from HBase dev team about the numbers inside this paper.
  • HBase trunk has been migrated to a Maven build system.
  • New branch opened for 0.20 updated version to run on Hadoop 0.21. This lets 0.20.3 or 0.20.2 clients operate against HBase running on HDFS 0.21 (with durable WAL, etc.) without any change to the client side. Thread…
  • Since Hadoop 0.21 isn’t going to be released soon and HBase team is waiting for applying critical changes (HDFS-265, HDFS-200, etc.) to make HBase user’s life easier, HBase trunk is likely to support both 0.21 and the patched 0.20 versions of Hadoop. There was a discussion about naming convention for HBase releases with regard to this fact which also touches plans for which features to include in the nearest releases.
  • Cloudera’s latest release now includes HBase-0.20.3.
  • Exploring possible solutions to “write only top N rows from reduce outcome”. Thread…
  • A new handy binary comparator was added that only compares up to the length of the supplied byte array.
  • These days, HBase developers are working hard on the very sweet “Multi data center replication” feature. It is aimed for 0.21 and will support federated deployment where someone might have terascale (or beyond) clusters in more than one geography and would want the system to handle replication between the clusters/regions.

We’d also like to introduce a small FAQ and FA (frequent advices) section to save some time for HBase dev team who is very supportive on the mailing lists.

  • How to move/copy all data from one HBase cluster to another? If you stop the source cluster then you can distcp the /hbase to the other cluster. Done. A perfect copy.
  • Is there a way to get the row count of the table? From Java API? There is no single-call method to do that. Actual row count info isn’t stored anywhere. You can use “count” command from HBase shell which iterates over all records and may take a lot of time to complete. It can be discovered by a table scan, or distributed count (MapReduce job usually).
  • I’m editing property X in hbase-default.xml to… You should edit hbase-site.xml, not hbase-default.xml.
  • Inserting row keys with an incremental ID is usually not a good idea since sequential writing is usually slower than random writing. If you can’t find a natural row key (which is good for scans), use a UUID.
  • Apply HBASE-2180 patch to increase random read performance in case of multiple concurrent clients.
  • How can I perform “select * from tableX where columnY=Z”-like query in HBase? You’ll need to use a Scan along with a SingleColumnValueFilter. But this isn’t quick, it’s like performing a SQL query on a column that isn’t indexed: the more data you have the longer it will take. There’s no support for secondary indexes in HBase core, you need you use one of the contribs (2 are available in 0.20.3: src/contrib/indexed and src/contrib/transactional). Another option is maintaining the indexes yourself.

Some other efforts that you may find interesting:

  • Clojure-HBase was introduced. It is a simple library for accessing HBase conveniently from Clojure.
  • DataNucleus Access Platform 2.0 now contains plugin for persistence to HBase (HADOOP) datastores.
  • HBase Indexing Library aids in building and querying indexes on top of HBase, in Google App Engine datastore-style.
  • HBase-dsl is meant to help reduce and simplify your HBase code.

Start Free Trial