Lucene and Solr: 2010 in Review

Lucene has been around for 10+ years and Solr for 4+ years.  It’s amazing that even after being as mature as these tools are there is still very rapid development and improvement going on.  We are not talking about polishing of the APIs or minor tweaks here and there, but serious development in the heart of both of these tools.  When you know this, it’s even more amazing to hear commercial search vendors spread FUD about tools like Lucene or Solr not being ready for serious business, large scale, high performance, etc.  Those 5000-6000 daily downloads of Lucene/Solr/Nutch/etc. (see the graph, scroll down on the page) must be from people who simply don’t know better than to download this free stuff…

But let’s not go down that path.  Below are some of the Lucene & Solr highlights from 2010.

The Merge

Lucene and Solr code bases were merged early in 2010.   Development mailing lists merged, but user lists remained separate, as did release artifacts.  The code repository went through major reorganization resulting in the addition of the “modules” section that currently hosts only the analysis package (this contains numerous analyzers, tokenizers, stemmers – over 400 Java classes so far.  Why is this good?  Because tools like our Key Phrase Extractor can now use just the jar from the analysis package instead of having to use the whole Lucene jar if all they really want is access to Lucene’s tokenizers, for example.).  In short, things are working out well after the merge.

Code, Releases…

In 2010 Lucene saw 3 releases: 3.0.1, 3.0.2, and 3.0.3, as well as 2.9.2, 2.9.3, and 2.9.4.  Solr 1.4.1 was released, too.  Subversion repository got some new branches which essentially means parallel development at increased pace, more experimentation, more freedom to change the code, etc.  Ultimately it’s the users of Lucene and Solr who reap major benefits from this.  In 2011 we’ll most likely see Lucene 4.0, as well as SolrCloud version of Solr, both of which will bring speed improvements, lower memory footprint, flexible indexing, and a bunch of other good stuff that we’ll write about in our Lucene Digests and Solr Digests on this blog in 2011.

Top Level Projects, Incubator, New Sub-Project

Three former Lucene sub-projects became Top Level Projects: Mahout, Nutch, and Tika.  Mahout 0.3 and 0.4 were released.  Nutch 1.1 and 1.2 were released and work is under way to get Nutch 2.0 out in 2011.  This new Nutch 2.0 includes some major improvements, such as great use of HBase.  After some semi-stagnation, it feels like Nutch is getting some more love from contributors and developers.  Tika is developing rapidly and also releasing rapidly with releases 0.6, 0.7, and 0.8 happening in 2010 and 0.9 being mentioned on the mailing list already.

Lucene ecosystem got a new sub-project in 2010: ManifoldCF (previously known as Lucene Connectors Framework). The code was donated by MetaCarta and it includes connectors for various enterprise data sources, such as Microsoft Sharepoint or EMC Documentum, as well as the file system, Web, or RSS and Atom feeds.  Importantly, ManifoldCF includes a Security Model and has the ability to index documents with Solr.

At the same time, Lucy (the Lucene C port) went to the Incubator.  Lucene.Net is on its way to the Incubator, too.  In short, both projects need to work on building more active development community.

Conferences

Lucene Eurocon was the first Lucene-focused conference last May in Praha, followed by Lucene Revolution in October in Boston, where we presented how we built search-lucene.com and search-hadoop.com.

Books

Lucene in Action 2nd edition was published by Manning and a book on Solr was published by Packt.  Mahout in Action is nearly done, and Tika in Action is in the works.  A book on Nutch is also getting started.

Search-Lucene.com

We built a Lucene/Solr-powered search-lucene.com and the sister search-hadoop.com sites, where one can search all mailing list archives, JIRA issues, source code, javadoc, wiki, and web site for all (sub-) projects at once, facet on sub-projects, data sources, and authors, get short links for any mailing list message handy for sharing, view mailing list messages in threaded or non-threaded view, see search term highlighted not only on search results page, but also in mailing list messages themselves (click on that “book on Nutch” link above for an example), etc.

If you’d like to keep up with Lucene and Solr news in 2011, as well as keep an eye on Nutch, Mahout, and Tika, you can follow @lucene on Twitter – a low volume source of key developments in these projects.

Solr Digest, September 2010

It is a busy time of year here at Sematext – we have 3 different presentations to prepare for 3 different conferences to prepare (2 down, 1 more to go!), so we’re a bit late with our digests. Nevertheless, we managed to compile a list of interesting topics in Solr world:

Already committed functionality

  • Solr was upgraded to use Tika 0.7SOLR-1819 – the fix was applied to 1.4.2, 3.1 and 4.0 versions.  Of course, Tika 0.8 is going to happen in not very distant future.
  • If you’re still using old rsync based replication and have a need to throttle transfer rate, have a look at a patch contributed in JIRA issue SOLR-2099. Unfortunatelly, if you’re using 1.4 Java based replication, there is currently no way to throttle replication.
  • If you are using new spatial capabilities in Solr, you might have noticed some incorrect calculations. One of them is fixed – Spatial filter is not accurate –  on 3.1 and 4.0 branches
  • Another minor but useful addition – function queries can now be defined in terms of parameters from other request parameters. Check JIRA issue “full parameter dereferencing for function queries”. It is already implemented in 3.1 and 4.0 and is ready to be used. Here is a short example from JIRA (check how add function is defined and note v1 and v2 request parameters):

http://localhost:8983/solr/select?defType=func&fl=id,score&q=add($v1,$v2)&v1=mul(2,3)&v2=10

Can we say, Solr Calculator, eh?

Interesting functionalities in development for some time

  • Ever wanted to add some custom fields to a response, although they were not stored in your Solr index? You could always create a custom response writer which would add those fields (although it would probably be a “dirty” copy of some already existing Solr’s response writer). However, we all know that it doesn’t sound as the right way to code.  One JIRA issue might deliver a correct way some day - Allow components to add fields to outgoing documents. We say “some day“, since this functionality has been in development for quite some time now and, although it has some patches (currently unfunctional, it seems), is probably is not very near being completed.  But it will be handy to have once it’s done.

Interesting new functionalities

  • Highlighter could get one frequently requested improvement – Highlighter fragement/formatter for returning just the matching terms – we believe this will be a useful addition, although we don’t expect it very soon.
  • One potentially useful feature for all of you who use HDFS – DIH should be able read data directly from HDFS for indexing.  This issue already contains some working code, although it is a question if the fix will become a part of standard Solr distribution.  Still, if you’re using Solr 1.4.1 and you have data in HDFS that you want to index with Solr, have a look at this contribution.
  • Another improvement related to replication is in SOLR-2117 – Allow slaves to replicate at different times. This should be useful to anyone who has long (and therefore heavy) warmup periods on their slaves after replication. This way, you can have your slaves replicate at different time and at the time of replication just take replicating slave offline (to avoid degradation of response times). Be careful though, there is a downside : for some time (limited, but still), your slaves will serve different data. Patch is available for 4.0 version.

Miscellaneous

So, we had a little bit of everything from Solr this month. Until late October (or start of November) when new Solr Digest arrives, stay tuned to @sematext, where we tweet other interesting stuff on a wider set of topics from time to time.

Nutch Digest, April 2010

In the first part of this Nutch Digest we’ll go through new and useful features of the upcoming Nutch 1.1 release, while in the second part we’ll focus on developments and plans for next big Nutch milestone, Nutch 2.0. But, let’s start with few informational items.

  • Nutch has been approved by the ASF  board to become Top Level Project (TLP) in the Apache Software Foundation.  The changing of Nutch mailing lists, URL, etc. will start soon.

Nutch 1.1 will be officially released any day now and here is a Nutch 1.1 release features walk through:

  • Nutch release 1.1 uses Tika 0.7 for parsing and MimeType detection
  • Hadoop 0.20.2 is used for job distribution (Map/Reduce) and distributed file system (HDFS)
  • On the indexing and search side, Nutch 1.1 uses either Lucene 3.0.1.with its own search application or Solr 1.4
  • Some of the new features included in release 1.1 were discussed in previous Nutch Digest. For example, alternative generator which can generate several segments in one parse of the crawlDB is included in release 1.1. We used a flavour of this patch in our most recent Nutch engagement that involved super-duper vertical crawl.  Also, improvement of SOLRIndexer, which now commits only once when all reducers have finished, is included in Nutch 1.1.
  • Some of the new and very useful features were not mentioned before. For example, Fetcher2 (now renamed to just Fetcher) was changed to implement Hadoop’s Tool interface. With this change it is possible to override parameters from configuration files, like nutch-site.xml or hadoop-site.xml, on the command line.
  • If you’ve done some focused or vertical crawling you probably know that one or few unresponsive host(s) can slow down entire fetch, so one very useful feature added to Nutch 1.1 is the ability to skip queues (which can be translated to hosts) for URLS getting repeated exceptions.  We made good use of that here at Sematext,  in the Nutch project we just completed in April 2010.
  • Another improvement included in 1.1 release related to Nutch-Solr integration comes in a form of improved Solr schema that allows field mapping from Nutch to Solr index.
  • One useful addition to Nutch’s injector is new functionality which allows user to inject metadata into the CrawlDB. Sometimes you need additional data, related to each URL, to be stored. Such external knowledge can later be used (e.g. indexed) by a custom plug-in. If we can all agree that storing arbitrary data in CrawlDb (with URL as a primary key) can be very useful, then migration to database oriented storage (like HBase) is only a logical step.  This makes a good segue to the second part of this Digest…

In the second half of this Digest we’ll focus on the future of Nutch, starting with Nutch 2.0.  Plans and ideas for the next Nutch release can be found on mailing list under Nutch 2.0 roadmap and on the official wiki page.

Nutch is slowly replacing some of its home-grown functionality with best of breed products — it uses Tika for parsing, Solr for indexing/searching and HBase for storing various types of data.  Migration to Tika is already included in Nutch 1.1. release and exclusive use of Solr as (enterprise) search engine makes sense — for months we have been telling clients and friends we predict Nutch will deprecate its own Lucene-based search web application in favour of Solr, and that time has finally come.  Solr offers much more functionality, configurability, performance and ease of integration than Nutch’s simple search web application.  We are happy Solr users ourselves – we use it to power search-lucene.com.

Storing data in HBase instead of directly in HDFS has all of the usual benefits of storing data in database instead of a files system.  Structured (fetched and parsed) data is not split into segments (in file system directories), so data can be accessed easily and time consuming segment merges can be avoided, among other things.  As a matter of fact, we are about to engage in a project that involves this exact functionality: the marriage of Nutch and HBase.  Naturally, we are hoping we can contribute this work back to Nutch, possibly through NUTCH-650.

Of course, when you add a persistence layer to an application there is always a question if whether it is acceptable for it to be tied to one back-end (database) or whether it is better to have an ORM layer on top of the datastore. Such an ORM layer would be an additional layer which would allow different backends to be used to store data.  And guess what? Such an ORM, initially focused on HBase and Nutch, and then on Cassandra and other column-oriented databases is in the works already!  Check the evaluation of ORM frameworks which support non-relational column-oriented datastores and RDBMs and development of an ORM framework that, while initially using Nutch as the guinea pig, already lives its own decoupled life over at http://github.com/enis/gora.

That’s all from us on Nutch’s present and future for this month, stay tuned for more Nutch news, next month! And of course, as usual, feel free to leave any comments or questions – we appreciate any and all feedback.  You can also follow @sematext on Twitter.

Nutch Digest, March 2010

This is the first post in the Nutch Digest series and a little introduction to Nutch seems in order. Nutch is a multi-threaded and, more importantly, a distributed Web crawler with distributed content processing (parsing, filtering), full text indexer and a search runtime. Nutch is at version 1.0 and community is now working towards a 1.1. release. Nutch is a large scale, flexible Web search engine, which includes several types of operations. In this post we’ll present new features and mailing list discussion as we describe each of these operations.

Crawling

Nutch starts crawling from a given “seed list” (a list of seed URLs) and iteratively follows useful/interesting outlinks, thus expanding its link database. When talking about Nutch as a crawler it is important to distinguish between two different approaches: focused or vertical crawling and whole Web or wide crawling. Each approach has a different set-up and issues which need to be addressed.  At Sematext we’ve done both vertical and wide crawling.

When using Nutch at large scale (whole Web crawling and dealing with e.g. billions of URLs), generating a fetchlist (a list of URLs to crawl) from crawlDB (the link database) and updating crawlDB with new URLs tends to take a lot of time. One solution is to limit such operations to a minimum by generating several fetchlists in one parse of the crawlDB and then update the crawlDb only once on several segments (set of data generated by a single fetch iteration). Implementation of a Generator that generates several fetchlists at was created in NUTCH-762. Whether this feature will be included in 1.1 release and when will version 1.1 be released, check here.

One more issue related to whole Web crawling which often pops up on the Nutch mailing list is an authenticated websites crawling, so here is wiki page on this subject.

When using Nutch for vertical/focused crawls, one often ends up with a very slow fetch performance at the end of each fetch iteration. An iteration typically starts with high fetch speed, but it drops significantly over time and keeps dropping, and dropping, and dropping. This is known problem.  It is caused by the fetch run having a small number of sites, some of which may have a lot more pages than others, and may be much slower than others.  Crawler politeness, which means it will politely wait before hitting the same domain again, combined with the fact that the number of distinct domains from fetchlist often drops rapidly during one fetch causes fetcher to wait a lot. More on this and overall Fetcher2 performance (which is a default fetcher in Nutch 1.0) you can find NUTCH-721.

To be able to support whole Web crawling, Nutch needs also needs to have scalable data processing mechanism. For this purpose Nutch uses Hadoop’s MapReduce processing and HDFS for storage.

Content storing

Nutch uses Hadoop’s HDFS as a fully distributed storage which creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable and fast computations even on large data volumes. Currently, Nutch is using Hadoop 0.20.1, but will be upgrading to Hadoop 0.20.2. in version 1.1.

In this NUTCH-650 you can find out more about the ongoing effort (and progress) to use HBase as Nutch storage backend. This should simplify Nutch storage and make URL/page processing work more efficient due to the features of HBase (data is mutable and indexed by keys/columns/timestamps).

Content processing

As we noted before, Nutch does a lot of content processing, like parsing (parsing downloaded content to extract text) and filtering (extracting only URLs which match filtering requirements). Nutch is moving away from its own processing tools and delegating content parsing and MimeType detection to Tika by use of Tika plugin, you can read more NUTCH-766.

Indexing and Searching

Nutch uses Lucene for indexing, currently version 2.9.1, but pushing to upgrade Lucene to 3.0.1. Nutch 1.0 can also index directly to Solr, becuse of Nutch-Solr integration. More on this and how using Solr with Nutch worked before Solr-Nutch integration you can find here. This integration now upgrades to Solr 1.4, because Solr 1.4 has a StreamingUpdateSolrServer which simplifies the way docs are buffered before sending to the Solr instance. Another improvement in this integration was a change to SolrIndexer to commit only once after all reducers have finished, NUTCH-799.

Some of patches discussed here and a number of other high quality patches were contributed by Julien Nioche who was added as a Nutch committer in December 2009.

One more thing, keep an eye on an interesting thread about Nutch becoming a Top Level Project (TLP) at Apache.

Thank you for reading, and if you have any questions or comments leave them in comments and we’ll respond promptly!

Follow

Get every new post delivered to your Inbox.

Join 1,696 other followers