What’s New in Solr Since 1.4.0

Following up to our yesterday’s What’s New & What’s Fixed in Solr 1.4.1, here is a quick summary of some of the most visible changes that happened in Solr since Solr 1.4.0 release.  All these changes are on Lucene/Solr’s trunk, which we don’t recommend using in serious production environments, unless you really love the bleeding edge (see Which Lucene or Solr Branch to Use).

So, What’s New in Solr 1.4.1?  Great many things!  According to @lucene, a total of 110 JIRA issues were resolved between Solr 1.4.0 and 1.4.1.  If you have extra time on your hands, you can see all 110 issues in CHANGES.txt.  If you don’t have lots of time, read our summary below:

  • Solr trunk uses the index format that has been changed since Solr 1.4.0, so if you upgrade from 1.4.0 to trunk, you must reindex.  Moreover, you must upgrade your slaves first, followed by the master upgrade.
  • If you build Solr from the trunk today, you will see it’s at version 4.0 (4.0-dev to be more precise).  It uses Lucene 4.0-dev (from trunk, of course).
  • No more compressed fields – you have to compress them on your own before you add them.  Inherited from Lucene.
  • If you use Dismax and rely on its “mm” parameter, check CHANGES.txt for details, it’s been changed.
  • Lots of work on Spatial Search and various function queries and distance measures has been done.
  • New “edismax” (enhanced dismax) query parser was added.  It addresses some of the frequently needed functionality missing from the old dismax (e.g. full Lucene syntax support, fielded queries, etc.).
  • Spellchecker and Distributed Search now play well together.  Hey, hey, if you care for Spellchecker/DYM and want to make sure your users never get “sorry, no matches, go fix your query” page, please see our DYM ReSearcher, you’ll love it!  I know we love having it over on http://search-lucene.com/!
  • If you use date facet ranges, you now have more control over inclusion/exclusion of upper/lower end points.
  • You can specify percentages for cache autowarmCount, not just absolute numbers.
  • If you prefer JSON over XML, you can now use JSON to add and delete documents, plus call the commit command.
  • Facets got more optimization love.
  • Bugfixes……lots of them.
  • Last but not least, Solr also benefits from fixes and optimizations in the new Lucene versions.  Lucene has much nicer CHANGES.txt.

These are the main changes, but numerous other ones are described in CHANGES.txt, including one new feature contributed by one of Sematext guys.

Solr 1.4.1 What’s New & What’s Fixed

What’s New and What’s Fixed in Solr 1.4.1?  Many little and not so little things!  If you have extra time on your hands, you can go read CHANGES.txt.  If you don’t have lots of time, read our summary below:

  • Upgraded Lucene to 2.9.3
  • StreamingUpdateSolrServer could hang, so this was fixed
  • A few (potential or edge-case) memory and resource leaks have been plugged
  • Various Tokenizers and TokenFilters have been fixed.  These changes may actually require you to reindex data if you upgrade to Solr 1.4.1.
  • That’s about it!

Now, if you don’t want to use Solr 1.4.1, but want to ride the trunk, go read What’s New in Solr Since 1.4.0.

Which Lucene or Solr Branch to Use

If you are still wondering about which Lucene or Solr version to use and you want to live on the nearly bleeding edge, here is a snippet from one of our customer’s emails from earlier today:

“…

first of all thanks so much for your suggestion about using
svn.apache.org/repos/asf/lucene/dev/branches/branch_3x

It really changed our life!
:-)
It’s working well end seems much more stable than trunk/dev.

…”

Is anyone still using Lucene/Solr trunk?  Are you?

Lucene Digest, May 2010

Last month we were busy with work and didn’t publish our monthly Lucene Digest.  To make up for it, this month’s Lucene Digest really covers all Lucene developments in May from A to Z.

  • Mark Harwood had a busy month.  In LUCENE-2454 he contributed a production-tested and often-needed functionality for properly indexing parent-child entities (or, more generally, any form of hierarchical search).  He introduced his work in Adding another dimension to Lucene searches.  Joaquin Delgado has been talking about the merge of unstructured and structured search (not surprising, considering his old company with Lucene-based Federated Search product got acquired by Oracle several years ago!), so he quickly related this to ability to perform XQuery + Full-Text searches.  MarkLogic, watch your back! ;)
  • Mark also contributed a match spotter for all query types in LUCENE-1999.  This patch makes it possible to figure out which field(s) a match for a particular hit was on, which is functionality people ask about on Lucene and Solr mailing lists every so often.  Warning, though: spotting the matching and encoding that causes some score precision loss.
  • While Lucene already has TimeLimitedCollector, it’s not perfect and offers room for improvement.  Back in 2009, Mark came up with TimeLimitedIndexReader, as you can tell from his messages in Improving TimeLimitedCollector thread and created a patch with it in LUCENE-1720, which filled some of the TimeLimitedCollector’s gaps:
    • Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase.
    • Times out faster (i.e. runaway queries such as fuzzies detected quickly before last “collect” stage of query processing)
  • Robert Muir, who gave a well-received presentation on Finite State Queries in Lucene at New York Search & Discovery Meetup (see slides) back in April 2010, has been busy consolidating Lucene and Solr analyzers, tokenizers, token filters, character filters, etc. and moving them to their new home: modules/analysis, under Lucene root dir in svn.  The plan is to produce separate and standalone artifacts (read: jars) for this analysis module.  Here at Sematext we will make use of this module immediately for some of our products that currently list Lucene as a dependency, even though they really only need Lucene’s analyzers.  Solr, too, will be another customer for the new analysis module, as described by Robert in solr and analyzers module (yes, we’re showing off Search-Lucene.com’s in-document search-term highlighting, which we find very useful).
  • Robert also worked on and committed an ICU-based tokenizer for Unicode-based text segmentation in LUCENE-2414.  This translates to having the ability to properly tokenize text that doesn’t use spaces as token separators.  If you’ve ever had to deal with searching Chinese, for example, you’ll know that word segmentation is one of the initial challenges one has to deal with.
  • Talking about splitting on space, another task Robert took upon himself was to stop Lucene QueryParser from splitting queries on space: LUCENE-2458.  This problem of tokenizing queries in space comes up quote often, so this is going to be a very welcome improvement in Lucene.
  • One day Robert was super bored, so he decided to write a Lucene analyzer for Indonesian: LUCENE-2437.
  • Andrzej and Isreal Ekpo (the author of one of the Solr PHP clients) both decided to add support for search-time bitwise operations of integer fields around the same time.  Isreal’s work in in LUCENE-2460, with an accompanying SOLR-1913 issue, while Andrzej’s is in SOLR-1918 and has no pure Lucene patch.  The difference is that Israel’s patch offers only filtering, while Andrzej’s patch performs scoring, which allows finding the best matching inexact bit patterns. This has applications in e.g. near-duplicate detection.
  • In one of our current engagements we are working with a large, household-name organization and a big U.S. government contractor.  Their index is heavily sharded and is well over 2 TB.  Working with such large indices is no joke (though I’m happy to say we were able to immediately improve their search performance by 40% in the first performance tuning iteration). What if we could make their indices smaller?  Would that make their search even faster?  Of course!  In LUCENE-1812 (nice number), Andrej implemented a static index pruning tool that removes posting data from indices for terms with in-document frequency lower than some threshold.  We haven’t used this tool, and it looks like we may not use it for a while, because IBM apparently holds a patent on an exact same algorithm used in this tool.
  • Phrase queries got a little performance boost in LUCENE-2410.  Every little bit counts!
  • Tom Burton-West created and contributed a handy tool that outputs total term frequency and document frequency from a Lucene index: LUCENE-2393.  This tool can be handy for estimating sizes of some of the Lucene index files, and thus getting a better grasp on disk IO needs.
  • On both Lucene and Solr lists with often see people asking about updating individual Document fields instead of simply deleting and re-adding the whole Document.  Delete and re-add approach is not necessarily a problem for Lucene/Solr, but for an external system from which data for the complete new Document needs to be fetched.  Shai Erera, another recently added Lucene committer, proposed a new approach for incremental field updates that was well received.  Once implemented, this will be a big boon for Lucene and Solr!  If that thread or message is too long for you to read, let us at least highlight (pun intended) the two great use cases from this message.
  • Lucandra is a Cassandra backend for Lucene.  But no, it’s not a Lucene Directory implementation.  Lucandra has its own IndexReader and IndexWriter that read from Cassandra and write to it.  But in LUCENE-2456 we now have another option: a Cassandra-based Lucene Directory.  We hope to have a post on this in the near future!
  • The author of Cassandra-based Lucene Directory also opened LUCENE-2425 for Anti-Merging Multi-Directory Indexing Framework that splits an index (at run-time) into multiple sub-indices, based on a “split policy”, several of which have also been added to Lucene’s JIRA.  This is somewhat similar to Lucene’s ParallelWriter, but has some differences, as described in the issue.
  • Michael McCandless is working on prototyping a multi-stage pipeline sub-system that aims to further decouple analysis from indexing.  In this pipeline, indexing would be just one step, one stage in the pipeline.  Based on the work done so far, this may even bring some performance improvements.
  • LUCENE-2295 added a LimitTokenCountAnalyzer / LimitTokenCountFilter to wrap any other Analyzer and provide the same functionality as MaxFieldLength provided on IndexWriter
  • Shay Banon, the author of Elastic Search, contributed LUCENE-2468 (can you complete this hard to figure out numeric sequence?), which allows one to specify how new Document deletions should be handled in CachingWrapperFilter and CachingSpanFilter.  We recently did work for another large organization and a household name (in the U.S. at least) where we improved their Lucene-based search performance by over 30%.  One of the things we did was making good use of CachingWrapperFilter.
  • LUCENE-2480 removes support for pre-Lucene 3.* indices from Lucene 4.*.  Thus, if you are still on Lucene 1.* or Lucene 2.*, we suggest moving to Lucene 3.* soon.  But, due to radical Lucene changes, even moving from Lucene 3.x to Lucene 4.0 won’t be as seamless as with previous Lucene upgrades.  Lucene 4.0 will include a Lucene 3.x to Lucene 4.0 migration tool: LUCENE-2441.

That’s it for this month.  Remember that you can also follow @sematext on Twitter.

HBase Digest, May 2010

Big news first:

  • HBase 0.20.4 is out! This release includes critical fixes, some improvements and performance improvements. HBase 0.20.4 EC2 AMIs are now available in all regions, the latest launch scripts can be found here.
  • HBase has become Apache’s Top Level Project. Congratulations!

Good to know things shared by community:

  • HBase got a code review board. Feel free to join!
  • The guarantees for each operation in HBase with regard to ACID are properties stated here.
  • Writing filter that compares values in different columns is explained in this thread.
  • It is OK to mix transactional IndexTable and regular HTables in the same cluster. One can access tables w/out the transactional semantics/overhead as normal, even when running a TransactionalRegionServer. More in this thread.
  • Gets and scans now never return partially updated rows (as of 0.20.4 release).
  • Try to avoid building code on top of lockRow/unlockRow because this can lead to serious delays in a system work and even deadlock. Thread…
  • Read about how HBase performs load-balancing in this thread.
  • Thinking about using HBase with alternative (to HDFS) file system? Then this thread is a must-read for you.

Notable efforts:

  • HBase Indexing Library aids in building and querying indexes on top of HBase, in Google App Engine datastore-style. The library is complementary to the tableindexed contrib module of HBase.
  • HBasene is a scalable information retrieval engine, compatible with the Lucene library while using HBase as the store for the underlying TF-IDF representation.  This is much like Lucandra, which uses Lucene on top of Cassandra.  We will be covering HBasene in the near future here on Sematext Blog.

FAQ:

  1. Is there an easy way to remove/unset/clean a few columns in a column family for an HBase table?
    You can either delete an entire family or delete all the version of a single family/qualifier. There is no ‘wild card’ deletion or other pattern matching. Column Family is the closest.
  2. How to unsubscribe from user mailing list?
    Send mail to user-unsubscribe@hbase.apache.org.
Follow

Get every new post delivered to your Inbox.

Join 1,683 other followers