Nutch Digest, April 2010

In the first part of this Nutch Digest we’ll go through new and useful features of the upcoming Nutch 1.1 release, while in the second part we’ll focus on developments and plans for next big Nutch milestone, Nutch 2.0. But, let’s start with few informational items.

Nutch has been approved by the ASF board to become Top Level Project (TLP) in the Apache Software Foundation. The changing of Nutch mailing lists, URL, etc. will start soon.

Nutch 1.1 will be officially released any day now and here is a Nutch 1.1 release features walk through:

Nutch release 1.1 uses Tika 0.7 for parsing and MimeType detection
Hadoop 0.20.2 is used for job distribution (Map/Reduce) and distributed file system (HDFS)
On the indexing and search side, Nutch 1.1 uses either Lucene 3.0.1.with its own search application or Solr 1.4

Some of the new features included in release 1.1 were discussed in previous Nutch Digest. For example, alternative generator which can generate several segments in one parse of the crawlDB is included in release 1.1. We used a flavour of this patch in our most recent Nutch engagement that involved super-duper vertical crawl. Also, improvement of SOLRIndexer, which now commits only once when all reducers have finished, is included in Nutch 1.1.

Some of the new and very useful features were not mentioned before. For example, Fetcher2 (now renamed to just Fetcher) was changed to implement Hadoop’s Tool interface. With this change it is possible to override parameters from configuration files, like nutch-site.xml or hadoop-site.xml, on the command line.
If you’ve done some focused or vertical crawling you probably know that one or few unresponsive host(s) can slow down entire fetch, so one very useful feature added to Nutch 1.1 is the ability to skip queues (which can be translated to hosts) for URLS getting repeated exceptions. We made good use of that here at Sematext, in the Nutch project we just completed in April 2010.
Another improvement included in 1.1 release related to Nutch-Solr integration comes in a form of improved Solr schema that allows field mapping from Nutch to Solr index.
One useful addition to Nutch’s injector is new functionality which allows user to inject metadata into the CrawlDB. Sometimes you need additional data, related to each URL, to be stored. Such external knowledge can later be used (e.g. indexed) by a custom plug-in. If we can all agree that storing arbitrary data in CrawlDb (with URL as a primary key) can be very useful, then migration to database oriented storage (like HBase) is only a logical step. This makes a good segue to the second part of this Digest…

In the second half of this Digest we’ll focus on the future of Nutch, starting with Nutch 2.0. Plans and ideas for the next Nutch release can be found on mailing list under Nutch 2.0 roadmap and on the official wiki page.

Nutch is slowly replacing some of its home-grown functionality with best of breed products — it uses Tika for parsing, Solr for indexing/searching and HBase for storing various types of data. Migration to Tika is already included in Nutch 1.1. release and exclusive use of Solr as (enterprise) search engine makes sense — for months we have been telling clients and friends we predict Nutch will deprecate its own Lucene-based search web application in favour of Solr, and that time has finally come. Solr offers much more functionality, configurability, performance and ease of integration than Nutch’s simple search web application. We are happy Solr users ourselves – we use it to power search-lucene.com.

Storing data in HBase instead of directly in HDFS has all of the usual benefits of storing data in database instead of a files system. Structured (fetched and parsed) data is not split into segments (in file system directories), so data can be accessed easily and time consuming segment merges can be avoided, among other things. As a matter of fact, we are about to engage in a project that involves this exact functionality: the marriage of Nutch and HBase. Naturally, we are hoping we can contribute this work back to Nutch, possibly through NUTCH-650.

Of course, when you add a persistence layer to an application there is always a question if whether it is acceptable for it to be tied to one back-end (database) or whether it is better to have an ORM layer on top of the datastore. Such an ORM layer would be an additional layer which would allow different backends to be used to store data. And guess what? Such an ORM, initially focused on HBase and Nutch, and then on Cassandra and other column-oriented databases is in the works already! Check the evaluation of ORM frameworks which support non-relational column-oriented datastores and RDBMs and development of an ORM framework that, while initially using Nutch as the guinea pig, already lives its own decoupled life over at https://github.com/enis/gora.

That’s all from us on Nutch’s present and future for this month, stay tuned for more Nutch news, next month! And of course, as usual, feel free to leave any comments or questions – we appreciate any and all feedback. You can also follow @sematext on Twitter.

Start Free Trial

Sematext

Menu

Featured

Latest

Categories

Nutch Digest, April 2010

You might also like

Share

You might also like