Solr Digest, February-March 2011
April 5, 2011 Leave a comment
We Sematexters have been very busy over the past few months, so we missed Solr’s February Digest. This one will therefore be a bit longer than usual. Let’s get started…
First, some major news : Solr 3.1 is officially released! The details of the announcement can be found here. We covered most of the new features in our digests already, so we’ll keep it short:
- Numeric range facets (similar to date faceting)
- New spatial search, including spatial filtering, boosting and sorting capabilities
- Example Velocity driven search UI at http://localhost:8983/solr/browse
- A new termvector-based highlighter
- Extend dismax (edismax) query parser which addresses some missing features in the dismax query parser along with some extensions
- Several more components now support distributed mode: TermsComponent, SpellCheckComponent
- A new Auto Suggest component
- Ability to sort by functions
- JSON document indexing
- CSV response format
- Apache UIMA integration for metadata extraction
- Leverages Lucene 3.1 and it’s inherent optimizations and bug fixes as well as new analysis capabilities
- Numerous improvements, bug fixes, and optimizations
You can start your download :).
Already committed features
- post.jar got improved – JIRA issue improve post.jar to handle non UTF-8 files removed some of its very old limitations
- jetty server included in Solr distribution didn’t support UTF-8. Now this is solved, fresh 3.1 version already contains this fix
Interesting features in development
- as part of SolrCloud, distributed indexing is being implemented in JIRA issue SOLR-2358. You can already see the work in progress in the initial patch, but you can also check SOLR-2341 which deals with shard distribution policies which will be available in Solr 4.0
- If you ever wanted to add custom fields (not existing in the index) to Solr responses, you couldn’t have done that from Solr components. There were other ways to achieve such functionality (for instance, customizing response writer class), but it looks like we’ll get such ability inside of components, too. No need to say how much more natural that would be. Anyway, issue Allow components to add fields to outgoing documents provides the umbrella for this new functionality. Although it is already closed, there are few sub-issues in which actual pieces of logic will be implemented.
- if you have problem with case sensitive searches in wildcard queries, you might take a look at a patch provided in JIRA issue Case Insensitive Search for Wildcard Queries
- although Solr got its first solid spatial implementation in version 3.1, many people found its limitations. One of them is surely a case where documents have multivalued spatial fields. We already wrote about SOLR-2155 in our December digest, but work under that issue hasn’t stopped and keeps evolving. It is likely that it will become a part of the standard Solr distribution and Lucene could get it incorporated, too. If you need spatial search you may want to watch this issue.
Interesting new features
- one common problem when using Solr’s default spellchecker or auto-suggest is filtering of suggestions based on what some user can see (for instance, depending on the region in which your user resides). JIRA issue Doc Filters for Auto-suggest, spell checking, terms component, etc. proposes a feature which would help here. Currently, no work was done there, though we believe we’ll get to see some progress in the future. While we are at it, in case you need such feature in Auto-suggest now, you might take a look at our in-house Search Auto-Complete solution, which you can see in action on search-lucene.com and search-hadoop.com.
- just like there are default components for SearchHandlers (which are used by default for every new search handler, unless overriden), update processors will get a similar feature. JIRA issue Let some UpdateProcessors be default without explicitly configuring them will take care that some important update processors are available by default to your UpdateRequestProcessorChain.
- one great new feature could be added to Solr – ability not to cache a filter. JIRA issue SOLR-2429 will deal with this. Many Solr users will be happy to optimize their cache performance when this feature is available some day.
- some interesting thoughts on spellchecker can be found in ML thread My spellchecker experiment and much more on that topic in the related blog
- should you use ASCIIFoldingFilter or MappingCharFilter when dealing with accents? Interesting discussion in thread Should ASCIIFoldingFilter be deprecated? could help you decide which one is right for you
- interesting idea for Solr’s admin UI can be found in this ML thread. Community’s reception was very good so we also got Solr Admin Interface, reworked issue as the home for this new work.
- anyone using Solr’s UIMA (Unstructured Information Management Architecture) contrib might be interested to know that its wiki page got a major improvement – more docs to read!
- we might be a bit late on this, but there is still some time left – Google’s Summer of Code applications can be submitted until 8th April. Check this ML thread for some detail. And don’t forge that Sematext is sponsoring interns, too!
- new Solr/Lucene users should take a look at the Refcard provided by Erik Hatcher in ML thread [infomercial] Lucene Refcard at DZone
- some deep thoughts on Solr/Lucene’s release process by some of the key people can be found here Brainstorming on Improving the Release Process. Related to that is a JIRA issue Define Test Plan for 4.0 which will… eh, contain some info about Test plan for 4.0 release, obviously. Also, check the TestPlans wiki page that’s in the making.
Although there were some other interesting topics, we have to stop somewhere. Until next month, you’ll find us on Twitter.