Solr Digest, Spring-Summer 2011, Part 1

No, Solr Digests are not dead, we’ve just been crazily busy at Sematext (yes, we are hiring!). Since our last Solr Digest not one, but 2 new Solr releases have been made: 3.2 in June, 3.3 in July and version 3.4 is imminent – voting is already in progress, so you can expect a new release pretty soon. Also, there were a number of interesting developments on the trunk (future 3.x and 4.0 versions). Therefore, we will be publishing two Solr Digests this time. This first Digest covers general developments in Solr world, while the sequel will be more focused on two features drawing a lot of attention: Solr Cloud and Near Real Time search.

Let’s get started with a short overview of announced news in 3.2 and 3.3. First, 3.2 brought us:

  • Ability to specify overwrite and commitWithin as request parameters when using the JSON update format
  • TermQParserPlugin, useful when generating filter queries from terms returned by field faceting or terms component
  • DebugComponent now supports using a NamedList to model Explanation objects in its responses instead of Explanation.toString
  • Improvements to the UIMA and Carrot2 integrations
  • Highlighting performance improvements
  • A test-framework jar for easy testing of Solr extensions
  • Bugfixes and improvements from Apache Lucene 3.2

With 3.3 we got:

  • Grouping / Field Collapsing
  • A new, automaton-based suggest/autocomplete implementation offering an order of magnitude smaller RAM consumption
  • KStemFilterFactory, an optimized implementation of a less aggressive stemmer for English
  • Solr defaults to a new, more efficient merge policy (TieredMergePolicy). See Mike’s cool Lucene segment merging video
  • Important bugfixes, including extremely high RAM usage in spellchecking
  • Bugfixes and improvements from Apache Lucene 3.3

Let’s now look at other interesting stuff. We’ll start with DataImportHandler and its bug fixes. As you’ll notice, there are quite a few of them (and we didn’t even list them all!) so we advise using all available patches.

Already committed features

  • A bug-fix for DataImportHandler – “replication reserves commit-point forever if using replicateAfter=startup”. SOLR-2469 brought a fix to version 3.2 and future 4.0 (trunk). This problem caused unnecessary (and huge) buildup in the number of index files on the slaves.
  • Another bug-fix for DataImportHandler – DIH does not commit if only Deletes are processed. When using special commands $deleteDocById and/or $deleteDocByQuery, when there were no updates of documents, commit wasn’t called by the DIH. Fix is available in 3.4 and 4.0.
  • Also – DataImportHandler multi-threaded option throws exception. The problem would happen when threads attribute was used. The fix for this is available in 3.4 and 4.0. Related to this is another fixed issue – DIH multi threaded mode does not resolves attributes correctly also available in 3.4 and 4.0.
  • Join feature got committed to the trunk (future 4.0 version). It can also perform cross-core joins now, which can be very useful. However, this feature also initiated some heated discussions which can be seen in SOLR-2272. The root cause was the fact that this feature was committed only to the Solr while Lucene got none of it. Of course, it might get refactored and included in Lucene too in the future, but this discussion shows the divisons which still existed between Solr and Lucene communities back then.
  • While we’re talking about Join feature, it might be worth mentioning a patch in SOLR-2604 which back-ports it to 3.x version. Be careful though, it was created for version 3.2 more than two months ago, so a few more adjustments after applying this patch might be needed.
  • Function Queries got new if(), exists(), and(), or(), not(), xor() and def() functions. The fix is committed to trunk so you’ll be able to use it in 4.0.
  • As can be seen from the Solr 3.3 announcement, one of the longest living Solr issue is finally closed for good :). SOLR-236 – Field Collapsing – along with SOLR-2524 finally bring field collapsing to 3_x and future 4.0 versions.
  • Since grouping/field collapsing was added to Solr, we should be able to use faceting in combination with it. Issue SOLR-2665 – Solr Post Group Faceting – brought exactly that to 3.4 and 4.0.
  • Ever wanted to have more control over what gets stored in the cache? SOLR-2429 will bring exactly that starting with the next Solr release – 3.4. It is simple to use, just add cache=false to your queries like this: fq={!frange l=10 u=100 cache=false}mul(popularity,price).  Note that with this new functionality you can prevent either a filter or a query to be cached, while document caching still remains out of request-time control.
  • If you’re using JMX to observe the state of your Solr installation, you might have encountered a problem when reloading Solr cores – it appears that JMX beans didn’t survive those reloads in the past versions. The fix is created and is available in future 3_x and trunk releases.

Interesting features in development

  • To achieve case-insensitive search with wildcard queries you could use a patch suplied under issue SOLR-2438. It has to be said that this isn’t committed to svn and it is hard to say whether it ever will be since there is a similar issue SOLR-219 on which work started 4 years ago.
  • Multithreaded faceting might bring some performance improvements. At the moment, initial patch exists, but more work will be needed here and it still isn’t clear how big improvement we could expect in real-world conditions, but it is worth keeping an eye on this issue.
  • We all know that Solr’s Spatial support has its limitations. One of them is specifying bounding box which isn’t based on point distance, effectively making it limited to a circular shape. Under SOLR-2609 we might get support for exactly this.
  • For anyone interested in which direction Spatial support might evolve, we suggest checking Lucene Spatial Playground. It continues the great work done in SOLR-2155 which provided extension to initial GeoSpatial support in Solr by adding multivalued spatial fields. At some point, SOLR-2155 might get the goodness from LSP. Also, another thing to check would be a thread on Lucene Spatial Future.

Interesting new features

  • Support for Lucene’s Surround Parser is added to Solr in issue SOLR-2703. The patch is already committed to the trunk.
  • Solr will get the ability to use configuration like analyzer type=”phrase”. Lucene’s Query Parsers recently got a simpler way to use different analyzer based on the query string. One example is usage of double quotes where one can decide that instead of current meaning in Lucene/Solr world – specifying a phrase to be searched for – it should have a meaning like in Google’s search engine – find this exact wording. Patch for this exists and can be applied on the trunk (it depends on Lucene trunk).
  • SOLR-2593 aims to provide a new Solr core admin action – ‘split’ – for splitting index. It would be used in case some core got too big or in any other case you might find it necessary.  Lucene already has a similar function.

Miscellaneous

  • Oracle released Java 7 about a month ago, but we advize against using it yet. JVM crashes and index corruption are issues likely to be encoutered with it. For more information, visit this URL
  • As anticipated for some time, Java 5 support got axed from Lucene 4.0 (trunk). You can expect similar stuff for Solr too.
  • Solr’s build system has been reworked now. Among other things, this implies changes in directory structure in Solr project. For example, solr/src/ doesn’t exist any more and its old subdirs /java and /test are now in solr/core/. The changes are already applied to the trunk and 3_x which holds the next 3.4 version. For more details, see SOLR-2452.
  • A handy Solr architecture diagram can be found in ML thread
  • Solr’s Admin UI is being refreshed with the work in JIRA issue SOLR-2399 (we already wrote about it) and its spin-off SOLR-2667. Some of this stuff is already committed (on the trunk), so you may want to inspect the changes. More details can be found in the wiki where you can also get the sneak-peak of the upcoming changes.

And that would be all for part one of the Solr Spring-Summer 2011 Digest edition from @sematext. Part two of the Spring-Summer Digest is coming in a few days – stay tuned!

3 Responses to Solr Digest, Spring-Summer 2011, Part 1

  1. Mike Schultz says:

    Does field collapsing work with an aggregator?

  2. Pingback: Solr Digest, Spring-Summer 2011, Part 2: Solr Cloud and Near Real Time Search « Sematext Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 1,652 other followers

%d bloggers like this: