Nutch Digest, March 2010

This is the first post in the Nutch Digest series and a little introduction to Nutch seems in order. Nutch is a multi-threaded and, more importantly, a distributed Web crawler with distributed content processing (parsing, filtering), full text indexer and a search runtime. Nutch is at version 1.0 and community is now working towards a 1.1. release. Nutch is a large scale, flexible Web search engine, which includes several types of operations. In this post we’ll present new features and mailing list discussion as we describe each of these operations.

Crawling

Nutch starts crawling from a given “seed list” (a list of seed URLs) and iteratively follows useful/interesting outlinks, thus expanding its link database. When talking about Nutch as a crawler it is important to distinguish between two different approaches: focused or vertical crawling and whole Web or wide crawling. Each approach has a different set-up and issues which need to be addressed.  At Sematext we’ve done both vertical and wide crawling.

When using Nutch at large scale (whole Web crawling and dealing with e.g. billions of URLs), generating a fetchlist (a list of URLs to crawl) from crawlDB (the link database) and updating crawlDB with new URLs tends to take a lot of time. One solution is to limit such operations to a minimum by generating several fetchlists in one parse of the crawlDB and then update the crawlDb only once on several segments (set of data generated by a single fetch iteration). Implementation of a Generator that generates several fetchlists at was created in NUTCH-762. Whether this feature will be included in 1.1 release and when will version 1.1 be released, check here.

One more issue related to whole Web crawling which often pops up on the Nutch mailing list is an authenticated websites crawling, so here is wiki page on this subject.

When using Nutch for vertical/focused crawls, one often ends up with a very slow fetch performance at the end of each fetch iteration. An iteration typically starts with high fetch speed, but it drops significantly over time and keeps dropping, and dropping, and dropping. This is known problem.  It is caused by the fetch run having a small number of sites, some of which may have a lot more pages than others, and may be much slower than others.  Crawler politeness, which means it will politely wait before hitting the same domain again, combined with the fact that the number of distinct domains from fetchlist often drops rapidly during one fetch causes fetcher to wait a lot. More on this and overall Fetcher2 performance (which is a default fetcher in Nutch 1.0) you can find NUTCH-721.

To be able to support whole Web crawling, Nutch needs also needs to have scalable data processing mechanism. For this purpose Nutch uses Hadoop’s MapReduce processing and HDFS for storage.

Content storing

Nutch uses Hadoop’s HDFS as a fully distributed storage which creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable and fast computations even on large data volumes. Currently, Nutch is using Hadoop 0.20.1, but will be upgrading to Hadoop 0.20.2. in version 1.1.

In this NUTCH-650 you can find out more about the ongoing effort (and progress) to use HBase as Nutch storage backend. This should simplify Nutch storage and make URL/page processing work more efficient due to the features of HBase (data is mutable and indexed by keys/columns/timestamps).

Content processing

As we noted before, Nutch does a lot of content processing, like parsing (parsing downloaded content to extract text) and filtering (extracting only URLs which match filtering requirements). Nutch is moving away from its own processing tools and delegating content parsing and MimeType detection to Tika by use of Tika plugin, you can read more NUTCH-766.

Indexing and Searching

Nutch uses Lucene for indexing, currently version 2.9.1, but pushing to upgrade Lucene to 3.0.1. Nutch 1.0 can also index directly to Solr, becuse of Nutch-Solr integration. More on this and how using Solr with Nutch worked before Solr-Nutch integration you can find here. This integration now upgrades to Solr 1.4, because Solr 1.4 has a StreamingUpdateSolrServer which simplifies the way docs are buffered before sending to the Solr instance. Another improvement in this integration was a change to SolrIndexer to commit only once after all reducers have finished, NUTCH-799.

Some of patches discussed here and a number of other high quality patches were contributed by Julien Nioche who was added as a Nutch committer in December 2009.

One more thing, keep an eye on an interesting thread about Nutch becoming a Top Level Project (TLP) at Apache.

Thank you for reading, and if you have any questions or comments leave them in comments and we’ll respond promptly!

5 Responses to Nutch Digest, March 2010

  1. bruce says:

    These latest technology updates are always interesting. The scope of growth in Nutch’s capability due to it’s integration with Solr should be regarded as an achievement . This should be a welcome step as it’ll surely enhance the power rendered to a user because of upgradation in indexing techniques. I read about Solr’s indexing techniques at http://www.lucidimagination.com/search/?q=indexing
    which infers that it should be an advantage to the user.

    • sematext says:

      Yes, integrating Nutch with Solr (and both of them with Tika) is an example of good (re)use of technology and indeed results in a better software package for the end user.

      You should note Sematext has a Lucene/Solr/Nutch/etc….. search service at http://search-lucene.com/ . It’s powered by a number of Lucene technologies, of course.

  2. Suren says:

    Is your whole web crawled data available for download? In that way, everyone who wants test/use Nutch search engine do not need to crawl repeatedly. Even if your crawled data is not available, is there any other place we can download? Thanks.

  3. Pingback: Nutch Digest, April 2010 « Sematext Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 143 other followers

%d bloggers like this: