HBase Digest, January 2010

Here at Sematext we are making more and more use of the Hadoop family of projects. We are expanding our digest post series with this HBase Digest and adding it to the existing Lucene and Solr coverage. For those of you who wants to be up to date with HBase community discussions and benefit from the knowledge-packed discussions that happen in that community, but can’t follow all those high volume Hadoop mailing lists, we also include a brief mailing lists coverage.

  • HBase 0.20.3 has just been released. Nice way to end the month.  It includes fixes of huge number of bugs, fixes of EC2-related issues and good amount of improvements. HBase 0.20.3 uses the latest 3.2.2 version of Zookeeper.  We should also note that another distributed and column-oriented database from Apache was released a few days ago, too – Cassandra 0.5.0.
  • An alternative indexed HBase implementation (HBASE-2037) was reported as completed (and included in 0.20.3). It speeds up scans by adding indexes to regions rather than secondary tables.
  • HBql was announced this month, too. It is an abstraction layer for HBase that introduces SQL dialect for HBase and JDBC-like bindings, i.e. more familiar API for HBase users. Thread…
  • Ways of integrating with HBase (instantiating HTable) on client-side: Template for HBase Data Acces (for integration with Spring framework), simple Java Beans mapping for HBase, HTablePool class.
  • HbaseExplorer – an open-source web application that helps with simple HBase data administration and monitoring.
  • There was a discussion about the possibilities of splitting the process of importing very large data volumes into HBase in separate steps. Thread…
  • To get any parallelization, you have to start multiple JVMs in the current Hadoop version. Thread…
  • Tips for increasing the HBase write speed: use random int keys to distribute loading between RegionServers; use multi-process client instead of multi-threaded client; set a higher heap space in conf/hbase-env.sh , give it a much as you can without swapping; consider lzo to hold the same amount of data in fewer regions per server. Thread…
  • Some advice on hardware configuration for the case of managing 40-50K records/sec write speed. Thread…
  • Secondary index can go out of sync with the base table in case of I/O exceptions during commit (when using transactional contrib). Handling such exceptions in transactional layer should be revised. Thread…
  • Configuration instance (as well as an instance of HBaseConfiguration) is not thread-safe, so do not change it when sharing between threads. Thread…
  • What are the minimal number of boxes for HBase deployment? Covered both HA and non-HA options, what deployments can share the same boxes, etc. Thread…
  • Optimizing random reads: using client-side multi-threading will not improve reads greatly according to some tests, but there is an open JIRA issue HBASE-1845 related to batch operations. Thread…
  • Exploring possibilities for server-side data filtering. Discussed classpath requirements for that and the variants for filters hot-deploy. Thread…
  • How-to: Configure table to keep only one version of data. Thread…
  • Recipes and hints for scanning more than one table in Map. Thread…
  • Managing timestamps for Put and Delete operations to avoid unwanted overlap between them. Thread…
  • The amount of replication should have no effect on the performance for reads using either scanner or random-access. Thread…

Did you really make it this far down?  :) If you are into search, see January 2010 Digests for Lucene and Solr, too.

Please, share your thoughts about the Digest posts as they come. We really like to know whether they are valuable.  Please tell us what format and style you prefer and, really, any other ideas you have would be very welcome.

3.2.2

UIMA Talk at NY Search & Discovery Meetup

This coming February, the NY Search & Discovery meetup will be hosting Dr. Pablo Duboue from IBM Research and his talk about UIMA.

This following is an “excerpt from the blurb” about Pablo’s talk:

“… In this talk, I will briefly present UIMA basics before discussing full UIMA systems I have been involved in the past (including our Expert Search system in TREC Enterprise Track 2007). I will be talking about how UIMA supported the construction of our custom NLP tools. I will also sketch the new characteristics of the UIMA Asynchronous Scaleout (UIMA AS) subproject that enable UIMA to run Analysis Engines in thousands of machines… “

The talk/presentation will be on February 24, 2010.  Sign up (free) at http://www.meetup.com/NYC-Search-and-Discovery/calendar/12384559/ .

Solr Digest, January 2010

Similar to our Lucene Digest post , we’ll occasionally cover recent developments in the  Solr world. Although Solr 1.4 was released only two months ago, work on new features for 1.5 (located in the svn trunk) is in full gear:

1) GeoSpatial search features have been added to Solr. The work is based on Local Lucene project, donated to the Lucene community a while back and converted to a Lucene contrib module. The features to be incorporated into Solr will allow one to implement things like:

  • filter by a bounding box – find all documents that match some specific area
  • calculate the distance between two points
  • sort by distance
  • use the distance to boost the score of documents (this is different than sort and would be used in case you want some other factors to affect the ordering of documents)

The main JIRA issue is SOLR-773, and it is scheduled to be implemented in Solr 1.5.

2) Solr Autocomplete component - Autocomplete functionality is a very common requirement. Solr itself doesn’t have the component which would provide such functionality (unlike SpellcheckComponent which provides spellchecking, although in limited form; more on this in another post or, if you are curious, have a look at the DYM ReSearcher). There are a few common approaches to solving this problem, for instance, by using recently added TermsComponent (for more information, you can check this excellent showcase by Matt Weber).  These approaches, however, have some limitations. The first limitation that comes to mind is spellchecking and correction of misspelled words. You can see such feature in Firefox’ google-bar,  if you write “Washengton”. You’ll still get “Washington” offered, while TermsComponent (and some other) approaches fail here.

The aim of SOLR-1316 is to provide autocomplete component in Solr out-of-the-box. Since it is still in development, you can’t quite rely on it, but it is scheduled to be released with Solr 1.5. In the mean-time, you can check another Sematext product which offers AutoComplete functionality with few more advanced features. It is a constantly developing product whose features have been heavily absed on real-life/customer feedback.  It uses an approach unlike than of TermsComponent and is, therefore, faster. Also, one of the features currently in development (and soon to be released) is spell-correction of queries.

3) One very nice addition in Solr 1.5 is edismax or Extended Dismax. Dismax is very useful for situations where you want to let searchers just enter a few free-text keywords (think Google) , without field names, logical operators, etc.  The Extended Dismax is contributed thanks to Lucid Imagination.  Here are some of its features:

  • Supports full Lucene query syntax in the absence of syntax errors
  • supports “and”/”or” to mean “AND”/”OR” in Lucene syntax mode
  • When there are syntax errors, improved smart partial escaping of special characters is done to prevent them… in this mode, fielded queries, +/-, and phrase queries are still supported.
  • Improved proximity boosting via word bi-grams… this prevents the problem of needing 100% of the words in the document to get any boost, as well as having all of the words in a single field.
  • advanced stopword handling… stopwords are not required in the mandatory part of the query but are still used (if indexed) in the proximity boosting part. If a query consists of all stopwords (e.g. to be or not to be) then all will be required.
  • Supports the “boost” parameter.. like the dismax bf param, but multiplies the function query instead of adding it in
  • Supports pure negative nested queries… so a query like +foo (-foo) will match all documents

You can check the development in JIRA issue SOLR-1553.

4) Up until version 1.5, distributed Solr deployments depended not only on Solr features (like replication or sharding), but also on some external systems, like load-balancers. Now, Zookeeper is used to provide Solr specific naming service (check JIRA issue SOLR-1277 for the details). The features we’ll get look exciting:

  • Automatic failover (i.e. when a server fails, clients stop trying to index to or search it and uses a different server)
  • Centralized configuration management (i.e. new solrconfig.xml or schema.xml propagate to a live Solr cluster)
  • Optionally allow shards of a partition to be moved to another server (i.e. if a server gets hot, move the hot segments out to cooler servers).

We’ll cover some of these topics in more details in the future installments. Thanks for reading!

Lucene Digest, January 2010

In this debut Lucene Digest post we’ll cover some of the recent interesting happenings in Lucene-land.

  • For anyone who missed it, Lucene 3.0.0 is out (November 2009).  The difference between 3.0.0 and the previous release is that 3.0.0 has no deprecated code (big cleanup job) and that the support for Java 1.4 was finally dropped!
  • As of this writing, there is only 1 JIRA issue targeted for 3.0.1 and 104 targeted for 3.1 Lucene release.
  • Luke goes hand in hand with Lucene, and Andrzej quickly released Luke that uses the Lucene 3.0.0 jars.  There were only minor changes in the code (and none in functionality) related to the changes in API between Lucene 2.9.1 and 3.0.
  • As usual, the development is happening on Lucene’s trunk, but there is now also a “flex branch” in svn, where features related to Flexible Indexing (not quite up to date) are happening.  Lucene’s JIRA also has a Flex Branch “version”, and as of this writing there are 6 big issues being worked on there.
  • The new Lucene Connectors Framework (or LCF for short) subproject is in the works.  As you can guess from the name, LCF aims to provide a set of connectors to various content repositories, such as relational databases, SharePoint, ECM Documentum, File System, Windows File Shares, various Content Management Systems, even Feeds and Web sites.  LCF is based on code donation from MetaCarta and will be going through the ASF Incubator. We expect it to graduate from the Incubator into a regular Lucene TLP subproject later this year.
  • Spatial search is hot, or at least that’s what it looks like from inside Lucene/Solr.  Both projects’ developers and contributors are busy adding support for spatial/geo search.  Just in Lucene alone, there are currently 21 open JIRA issues aimed at spatial search.  Lucene’s contrib/spatial is where spatial search lives.  We’ll cover Solr’s support for spatial search in a future post.
  • Robert Muir and friends have added some final-state automata technology to Lucene and dramatically improved regular expression and wildcard query performance.  Fuzzy queries may benefit from this in the future, too.  The ever industrious Robert covered this in a post with more details.

These are just some highlights from Luceneland.  More Lucene goodness coming soon.  Please use comments to tell us whether you find posts like this one useful or whether you’d prefer a different angle, format, or something else.

Sponsoring NYC Search & Discovery Meetup

We were busy in 2009, and so was Otis, who organized the NYC Search & Discovery Meetup, and which we happily sponsor.  So far a few interesting talks/presentations have happened and more are scheduled for 2010 (presenters for the next few months are already known).  Also, Otis and Daniel gave Faceted Search presentation (slides) at the NY CTO Club in December 2009, and Otis presented Lucene (slides and video) at the Semantic Web Meetup in October 2009.  More to come in 2010!

Lucene TLP Digests

We promised we’d blog more in 2010.  One of the things you’ll see on this blog in 2010 are regularly irregular periodic Digests of interesting or significant development in various projects under the Lucene Top Level Project.  The thinking is that with the ever-increasing traffic on various Lucene TLP mailing lists, we are not the only ones having a hard time keeping up.  Since we have to closely follow nearly all Lucene TLP projects for our work anyway, we may as well try and help others by periodically publishing posts summarizing key development.  We’ll tag and categorize our Digest posts appropriately and consistently, so you’ll be able to subscribe to them directly if that’s all you are interested in.  We’ll try to get @sematext to tweet as well, so you can follow us there, too.

Blog Volume, 2010

Happy 2010, subscribers!

If you were to judge Sematext by its blogging frequency, you’d think we are out of business.  Luckily, the reason for low blogging volume in 2009 is quite the opposite – we’ve been busy developing several products and helping our customers.   Despite being busy, we’ll be blogging more in 2010.

Follow

Get every new post delivered to your Inbox.

Join 1,668 other followers