Solr Digest, April 2010
April 27, 2010 1 Comment
Another month is almost over, so it is time for our regular monthly Solr Digest. This time we’ll focus on interesting JIRA issues, so let’s start:
- Issue SOLR-1860 intends to improve stopwords list handling in Solr, based on recent Lucene’s stopwords lists additions to all language analyzers. The work hasn’t started just yet (there are no patches to try), so we’ll need to be patient before actually using it.
- Ever had problems with http authentication in distributed Solr environment? Currently, it worked only when querying one Solr server. Now JIRA issue SOLR-1861 solves such problems and allows specification of credentials for each shard, while in the absence of credential info it falls back to default functionality (no credentials). The patch is already attached to the issue and it can be used with Solr 1.4.
- If you have used Solr’s MoreLikeThisComponent, you noticed its output lacks any info which would explain why it recommended some item. Patch in issue SOLR-860 deals with that and improves MLT Component by adding debug info, like this (copied from JIRA):
"rawMLTQuery":"features:2 features:0 features:lcd features:x features:3",
"boostedMLTQuery":"features:2 features:0 features:lcd features:x features:3",
"realMLTQuery":"+(features:2 features:0 features:lcd features:x features:3) -id:MA147LL/A"}},
This issue is marked to be included in Solr 3.1.
- If you ever got a requirement like "some users should be able to access these documents while being forbidden to access some other", Solr wasn't able to help you much. Recently, document level security has been the subject of 2 JIRA issues. In SOLR-1834 you can find a patch which is already running in production environment, while another approach to the same problem (also with attached patch) is presented in SOLR-1872 (the latter currently adds security only on select queries, delete is not supported yet).
- SolrCloud brings exciting new capabilities to Solr, some of them already mentioned in our Solr Digest posts (for instance, check Solr Digest January 2010). Solr Cloud functionality is getting committed to trunk, you can monitor the progress in SOLR-1873. This is big!
- When working with Solr, you should explicitly configure Solr to take care of lowercasing indexed tokens and query strings (so uppercased versions of some words match their lowercase versions. For instance, to have query Sematext matches SEMATEXT, sematext and Sematext). However, there is one old JIRA issue SOLR-219 designated to be fixed in Solr 1.5 which would automatically make Solr smart enough for searches to be case insensitive.
- One common source of confusion for first time Solr users was dismax and its relation to default query operator defined in schema.xml. In reality, the default query operator has no effect on how dismax works. Also, with dismax you can't use directly AND and OR operators, but you can achieve such functionality by using dismax's mm (minimum should match) parameter. The default value for it is 100% (meaning that all clauses must match, which is equal to using AND operator between all clauses). If you want to achieve OR operator functionality, you would just define its value to 1 (meaning, one matching clause is enough). The confusion with default operator arises from the fact that in case your default query operator in schema.xml is OR, dismax would by default behave like it was AND. Issue SOLR-1889 should deal with that and assign default mm value for dismax depending on the default query operator from schema.xml, which will make Solr behave more consistently for new users.
- Another old JIRA issue got its first patch a few days ago, SOLR-571. This patch allows autowarmCount values to be specified as percentages of cache size (for instance, 50% would mean that autowarm of only top half of cached queries is needed) instead of being specified by an absolute amount.
- Solr 1.4 introduced ClusteringComponent which can cluster search results and documents. By using plugins, it allows implementation of any clustering engine. One such engine was recently unveiled, lsa4solr, which is based on Latent Semantic Analysis. This engine depends on development version of Solr 1.3 and Clojure 1.2, so take a look if you're interested in clustering.
- And last, but not least, for all Solr enthusiasts, an interesting webinar is on schedule for April 29th: "Practical Search with Solr: Beyond just looking it up". You can find more about it here.
Remember, you can also follow us on Twitter: @sematext. Until next month!