HBase Digest, January 2010

Here at Sematext we are making more and more use of the Hadoop family of projects. We are expanding our digest post series with this HBase Digest and adding it to the existing Lucene and Solr coverage. For those of you who wants to be up to date with HBase community discussions and benefit from the knowledge-packed discussions that happen in that community, but can’t follow all those high volume Hadoop mailing lists, we also include a brief mailing lists coverage.

  • HBase 0.20.3 has just been released. Nice way to end the month.  It includes fixes of huge number of bugs, fixes of EC2-related issues and good amount of improvements. HBase 0.20.3 uses the latest 3.2.2 version of Zookeeper.  We should also note that another distributed and column-oriented database from Apache was released a few days ago, too – Cassandra 0.5.0.
  • An alternative indexed HBase implementation (HBASE-2037) was reported as completed (and included in 0.20.3). It speeds up scans by adding indexes to regions rather than secondary tables.
  • HBql was announced this month, too. It is an abstraction layer for HBase that introduces SQL dialect for HBase and JDBC-like bindings, i.e. more familiar API for HBase users. Thread…
  • Ways of integrating with HBase (instantiating HTable) on client-side: Template for HBase Data Acces (for integration with Spring framework), simple Java Beans mapping for HBase, HTablePool class.
  • HbaseExplorer – an open-source web application that helps with simple HBase data administration and monitoring.
  • There was a discussion about the possibilities of splitting the process of importing very large data volumes into HBase in separate steps. Thread…
  • To get any parallelization, you have to start multiple JVMs in the current Hadoop version. Thread…
  • Tips for increasing the HBase write speed: use random int keys to distribute loading between RegionServers; use multi-process client instead of multi-threaded client; set a higher heap space in conf/hbase-env.sh , give it a much as you can without swapping; consider lzo to hold the same amount of data in fewer regions per server. Thread…
  • Some advice on hardware configuration for the case of managing 40-50K records/sec write speed. Thread…
  • Secondary index can go out of sync with the base table in case of I/O exceptions during commit (when using transactional contrib). Handling such exceptions in transactional layer should be revised. Thread…
  • Configuration instance (as well as an instance of HBaseConfiguration) is not thread-safe, so do not change it when sharing between threads. Thread…
  • What are the minimal number of boxes for HBase deployment? Covered both HA and non-HA options, what deployments can share the same boxes, etc. Thread…
  • Optimizing random reads: using client-side multi-threading will not improve reads greatly according to some tests, but there is an open JIRA issue HBASE-1845 related to batch operations. Thread…
  • Exploring possibilities for server-side data filtering. Discussed classpath requirements for that and the variants for filters hot-deploy. Thread…
  • How-to: Configure table to keep only one version of data. Thread…
  • Recipes and hints for scanning more than one table in Map. Thread…
  • Managing timestamps for Put and Delete operations to avoid unwanted overlap between them. Thread…
  • The amount of replication should have no effect on the performance for reads using either scanner or random-access. Thread…

Did you really make it this far down?  :) If you are into search, see January 2010 Digests for Lucene and Solr, too.

Please, share your thoughts about the Digest posts as they come. We really like to know whether they are valuable.  Please tell us what format and style you prefer and, really, any other ideas you have would be very welcome.

3.2.2

6 Responses to HBase Digest, January 2010

  1. Stack says:

    Thanks for doing this you angels from heaven.

  2. stephen says:

    Hi, this is a very useful summary – please keep it up! :)

  3. Thanks for this digest !

    Lucene, Solr and HBase community activity is sometime hard to track so such digest is definitely valuable !

  4. Pingback: Mahout Digest, February 2010 « Sematext Blog

  5. Pingback: Hadoop Digest, February 2010 « Sematext Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 1,696 other followers

%d bloggers like this: