Lucene and Solr: 2010 in Review
January 6, 2011 1 Comment
Lucene has been around for 10+ years and Solr for 4+ years. It’s amazing that even after being as mature as these tools are there is still very rapid development and improvement going on. We are not talking about polishing of the APIs or minor tweaks here and there, but serious development in the heart of both of these tools. When you know this, it’s even more amazing to hear commercial search vendors spread FUD about tools like Lucene or Solr not being ready for serious business, large scale, high performance, etc. Those 5000-6000 daily downloads of Lucene/Solr/Nutch/etc. (see the graph, scroll down on the page) must be from people who simply don’t know better than to download this free stuff…
But let’s not go down that path. Below are some of the Lucene & Solr highlights from 2010.
Lucene and Solr code bases were merged early in 2010. Development mailing lists merged, but user lists remained separate, as did release artifacts. The code repository went through major reorganization resulting in the addition of the “modules” section that currently hosts only the analysis package (this contains numerous analyzers, tokenizers, stemmers – over 400 Java classes so far. Why is this good? Because tools like our Key Phrase Extractor can now use just the jar from the analysis package instead of having to use the whole Lucene jar if all they really want is access to Lucene’s tokenizers, for example.). In short, things are working out well after the merge.
In 2010 Lucene saw 3 releases: 3.0.1, 3.0.2, and 3.0.3, as well as 2.9.2, 2.9.3, and 2.9.4. Solr 1.4.1 was released, too. Subversion repository got some new branches which essentially means parallel development at increased pace, more experimentation, more freedom to change the code, etc. Ultimately it’s the users of Lucene and Solr who reap major benefits from this. In 2011 we’ll most likely see Lucene 4.0, as well as SolrCloud version of Solr, both of which will bring speed improvements, lower memory footprint, flexible indexing, and a bunch of other good stuff that we’ll write about in our Lucene Digests and Solr Digests on this blog in 2011.
Top Level Projects, Incubator, New Sub-Project
Three former Lucene sub-projects became Top Level Projects: Mahout, Nutch, and Tika. Mahout 0.3 and 0.4 were released. Nutch 1.1 and 1.2 were released and work is under way to get Nutch 2.0 out in 2011. This new Nutch 2.0 includes some major improvements, such as great use of HBase. After some semi-stagnation, it feels like Nutch is getting some more love from contributors and developers. Tika is developing rapidly and also releasing rapidly with releases 0.6, 0.7, and 0.8 happening in 2010 and 0.9 being mentioned on the mailing list already.
Lucene ecosystem got a new sub-project in 2010: ManifoldCF (previously known as Lucene Connectors Framework). The code was donated by MetaCarta and it includes connectors for various enterprise data sources, such as Microsoft Sharepoint or EMC Documentum, as well as the file system, Web, or RSS and Atom feeds. Importantly, ManifoldCF includes a Security Model and has the ability to index documents with Solr.
At the same time, Lucy (the Lucene C port) went to the Incubator. Lucene.Net is on its way to the Incubator, too. In short, both projects need to work on building more active development community.
Lucene in Action 2nd edition was published by Manning and a book on Solr was published by Packt. Mahout in Action is nearly done, and Tika in Action is in the works. A book on Nutch is also getting started.
We built a Lucene/Solr-powered search-lucene.com and the sister search-hadoop.com sites, where one can search all mailing list archives, JIRA issues, source code, javadoc, wiki, and web site for all (sub-) projects at once, facet on sub-projects, data sources, and authors, get short links for any mailing list message handy for sharing, view mailing list messages in threaded or non-threaded view, see search term highlighted not only on search results page, but also in mailing list messages themselves (click on that “book on Nutch” link above for an example), etc.
If you’d like to keep up with Lucene and Solr news in 2011, as well as keep an eye on Nutch, Mahout, and Tika, you can follow @lucene on Twitter – a low volume source of key developments in these projects.