Solr Digest, July 2010
August 5, 2010 Leave a comment
As usual, July is one of the slower months in Solr world, however, we managed to find a few interesting topics for our readers.
- Interesting feature might be added with SOLR-1979 – Create LanguageIdentifierUpdateProcessor. It would provide ability to differently handle the text in different languages (think about stemming in analysis, for instance) and to do it automatically. This issue was just created, so the work on it and any usable patches are coming some time in the future. However, if you need something working now, Sematext has a few products for similar multilingual functionality, for instance, Multilingual Indexer or its cousin Language Identifier.
- Another interesting feature might come with SOLR-1980 – Implement boundary match support. This will enable one to specify that query should match only at the start or at the end of the field (or be exact match), not somewhere in the middle, which could provide more relevant search results in some specific cases. This issue is also in its infancy and has no patches yet, so we’ll have to wait and see how it progresses.
- Ever wanted Solr to store as the value of some field something other than the raw input value (remember, when you search Solr, you search on analyzed and indexed values; when you fetch the content of some field, you get the raw input value added to that field, not its analyzed version)? Patch for that already exists in one rather fresh JIRA issue – SOLR-1997 – Store internal value instead of input one.
- Getting ready to start using Solr, but are unsure about which version you should use? Don’t worry, confusion about Solr’s version started this spring (see Solr May 2010 Digest), but things stabilized lately. The latest release is the fairly recent 1.4.1, which is basically 1.4 version with many bugfixes. The next release version is 3.1 which can be found on branch_3x branch. You can find its nightly build versions here. The trunk is still used for “unstable” development and the future 4.0 version. To get more information, check these recent threads on the Solr mailing list: here and here.
- Many will probably agree that Solr’s SpellCheckComponent isn’t very useful in real-life applications. One of the main problems is that it poorly handles multi-word queries, where it creates its suggestion as a collated version of best suggestion for each word of the query, so you often get suggestions which have 0 hits. Also, it doesn’t return important information about suggested query, like how many hits such query would generate and what results it would give. Some of these issues could be fixed some day with SOLR-2010 – Improvements to SpellCheckComponent Collate functionality. The first version of the patch is already provided. However, if you’d like to use such functionality in your Solr production today, you might consider one much more sophisticated and production-ready component developed by Sematext – DYM ReSearcher – you can see DYM ReSearcher in action on Search-Lucene.com, for example.
- One minor functionality is added to QueryElevationComponent – Add option to return only the specified results. It was added with JIRA issue SOLR-1966 and is already committed to 3.x and trunk.
We hope that this was enough to satisfy your Solr appetite. Hopefully, we’ll dig more interesting topics for you in August. Until then you can keep up with us via @sematext on Twitter.