Solr Digest, June 2010

We have already written about news in Solr world this month here and here, so you already know that Solr’s 1.4.1. version was released, based on Lucene 2.9.3. Still, one thread from the mailing lists gives some more info about svn branches and how they are related to Solr versions.

Real Time indexing is again one of the hot topics. We already mentioned Zoie plugin in Solr March Digest, so this time we’ll point to interesting discussion on mailing lists. In case you followed this topic, Zoie Solr Plugin is a great plugin for Solr, but still has some limitations. For instance, master-slave architecture (which is the base of almost all big Solr deployments) isn’t well suited for Zoie. Version 2.9 of Lucene brought interesting addition of Near Realtime Search capabilities. As you probably already know, Solr 1.4 release already was running on Lucene 2.9 (2.9.1. to be precise), but support for NRT wasn’t implemented. Solr’s next release might have it since there is a JIRA issue dealing with NRT integration, but don’t hold your breath.

We’ll also mention some new functionalities in Solr:

  • Added relevancy function queries – JIRA issue SOLR-1932 adds function queries for relevancy factors such as tf, idf, etc. This issue is already fixed and committed to trunk.
  • Improved Solr response indentation - added with issue SOLR-1933. Solr only supported 7 levels of indenting previously, so this issue solves it. The downside is a small increase in response size (since instead of tabs, blank spaces will be used). The fix is already committed, but not only to trunk, but also to 3_x branch.
  • Ever wanted to see index files without logging into your servers? This patch will make them visible from Solr admin pages or by using LukeRequestHandlers.
  • Another related issue also got a patch and is already committed to the trunk – SOLR-1946misc enhancements to SystemInfoHandler. Here is a brief list of additions:   include CWD in directory info, include raw bytes version of memory stats, include a list of all system properties.

We’ll end with the short overview of interesting issues which are still in development:

  • Use Lucene’s Field Cache To Retrieve Stored Fields From Memory – the issue SOLR-1961 isn’t finished yet, althought there is a patch. When it is finished, it might give a new boost to the performance of your Solr server, thanks to developers from Cisco.
  • If you want to track performance improvements prepared for 4.0 release, you can just follow JIRA issue SOLR-1965. Some stuff is already listed there, so you can go and check what is in store for the future versions.
  • For anyone using PHP to talk to Solr, there is a new PHP Response Writer – currently, it is available as a Jar that has to be added to your Solr’s classpath. For more details check JIRA issue comments.
  • Field collapsing is one of the longest still unresolved issues in Solr world. SOLR-236 (many people probably easily recognize this JIRA issue number :) ) was created more than 3 years ago and during the time it has grown into a “monster” – huge number of comments, patches, problems, parameters… you name it.  Integrating it with your Solr version was never fun (we tried it!). New hope appeared on the field collapsing horizon with the opening of SOLR-1682 (that’s a new JIRA issue for you to commit to your memory!). Some work had already been done there in the past, but now Yonik decided to dedicate some of his time to this issue, which means we might soon have a non-monster implementation that will be committed to Solr.

That’s all for this month. As you can see, in Solr May Digest there was no mention of new 1.4.1. release, but it happened, almost unexpectedly. So stay tuned (and follow @sematext) – you never know if something unexpected might happen this month too…

Solr Digest, May 2010

May’s Solr Digest brings another review of interesting Solr developments and a short look at current state of Solr’s branches and versions. Confused about which versions to use and which to avoid? Don’t worry, many people are.  We’ll try to clear it up in this Digest.

  • In April’s edition of Solr Digest, we mentioned two JIRA issues dealing with document level security (SOLR-1834 and SOLR-1872). Now another issue (SOLR-1895) deals with LCF security and aims to work with SOLR-1834 to provide a complete solution.
  • One ancient JIRA issue, SOLR-397, finally got resolved and its code is now committed. Solr now has the ability to control how date range end points in facets are dealt with.  You can use this functionality by specifying the facet.date.include HTTP request parameter, which can have values “all”, “lower“, “upper“, “edge“, “between“, “before“, or “after“. More details about this can be found in SOLR-397.
  • Another issue related to date ranges was created.  This one aims to add Date Range QParser to Solr, which would simplify definition of date range filters resulting from date faceting.   This issue is still in its infancy and has no patches attached to it as of this writing, but it looks useful and promising.  When we add date faceting to Search-Lucene.com and Search-Hadoop.com we’ll be looking to use this date range query parser.
  • Some errors in Solr will be much easier to trace after JIRA issue SOLR-1898 gets resolved. Everyone using Solr probably encountered exceptions like: java.lang.NumberFormatException: For input string: “0.0″.  The message itself lacks some crucial details, such as information about the document and field that triggered the exception.  SOLR-1898 will solve that problem, and we are looking forward to this patch!
  • Have you recently been in the situation where you were unsure about which branch or version of Solr you should use on your projects? If yes, you’re certainly not alone! After the recent merge of Solr and Lucene (covered in Solr March Digest and Lucene March Digest), things became confusing, especially for casual observers of Lucene and Solr. Here are some facts about the current state of Solr:
  1. latest stable release version of Solr is still 1.4
  2. 1.4 version was released more than 6 months ago, so many new features, patches and bug fixes aren’t included in it
  3. however, it was a stable release, so if you’re planning your production very soon, maybe one low-risk choice would be using 1.4 version on top of which you could apply the patches that you find necessary for you deployment
  4. current development is ongoing on trunk (considered as unstable version and slated for future Solr 4.0 version) and branch named branch_3x. This branch is the most likely candidate for the next version of Solr (named 3.1) and is considered as (stable) development version which could be usable, though you have to be careful with your testing, as always.
  5. another choice could be some old 1.5 nightly build, but 1.5 is abandoned and, in our opinion, it makes more sense to use nightly builds from branch_3x

Here are couple of threads where you can get more information:

  1. Lucene 3.x branch created
  2. Which Solr to use?
  • To show one of the dangers of unstable versions, we’ll immediately point to one recently open JIRA issue related to “file descriptor leak” while indexing.
  • Although at Sematext we’ve been using Java 6 for a very long time both for our products and with our clients, some people might still be stuck with Java 5. It appears that they will never be able to use Solr 4.0 once it is released, since Solr trunk version now requires Java 6 to compile.

We’ll finish this month’s Solr Digest with two new Solr features:

  • For anyone wanting to use JSON format when sending documents to Sorl, JSON update handler is now committed to trunk
  • on the other hand, if you need CSV as output format from Solr, you might benefit from the work on new CSV Response Writer. Currently, there are no patches with it, but you can watch the issue and see when it is added.

Thanks for reading another Solr Digest!  Help us spread the word, please Re-Tweet it, and follow @sematext on Twitter.

Follow

Get every new post delivered to your Inbox.

Join 706 other followers