Nutch Digest, May 2010

With May being almost over, it’s time for our regular monthly Nutch Digest. As you know, Nutch became a top-level project and first related changes are already visible/applied. The Nutch site was moved to its new home at and the mailing lists (user- and dev-) have been moved to new addresses: and  Old subscriptions were automatically moved to the new lists, so subscribers don’t have to do anything (apart from changing mail filters, perhaps).  Even though Nutch is not a Lucene sub-project any more, we will continue to index its content (mailing lists, JIRA, Wiki, web site) and searchable over on  We’ve helped enough organizations with their Nutch clusters that it makes sense for at least us at Sematext to have a handy place to find all things Nutch.

There is a vote on updated Release Candidate (RC3) for the Apache Nutch 1.1 release.  The major differences between this release and RC2 are several bug fixes: NUTCH-732, NUTCH-815, NUTCH-814, NUTCH-812 and one improvement in NUTCH-816.  Nutch 1.1 is expected shortly!

From relevant changes related to Nuch develompent during May it’s important to note than Nutch will be using Ivy in Nutch builds and that there is one change regarding Nutch’s language identification: code for Japanese changed from “jp” to “ja”.  The former is Japan’s country code and the latter is the language code for Japanese.

There’s been a lot of talk on the Nutch mailing list about the first book on Nutch which Dennis Kubes started writing.  We look forward to reading it, Dennis!

Nutch developers were busy with the TLP-related changes and with preparations for the Nutch 1.1 release this month, so this Digest is a bit thinner than usual.  Follow @sematext on Twitter.

Mahout Digest, April 2010

April was a busy month here at Sematext, so we are a few days late with our April Mahout Digest, but better late than never!

The Board has approved Mahout to become Top Level Project (TLP) at the Apache Software Foundation. Check the status thread on change of Mahout mailing lists, svn layout, etc. Several discussions on mailing list, all describing what needs to be done for Mahout to become Top Level Project, resulted in Mahout TLP to-do list

As we reported in previous digest there was a lot of talk on mailing list about Google Summer of Code (GSoC) and here is a follow up on this subject. GSoC announcements are up and Mahout got 5 projects accepted this year. Check the full list of GSoC projects and Mahout’s community reactions on the mailing list.

In the past we have reported about the idea of Mahout/Solr integration, and it seems this is now getting realized. Read more on features and implementation progress of this integration here.

At the beginning of April, there was a proposal to make collections releases independent of the rest of Mahout. After some transition period of loosening the coupling between mahout-collections and the rest of Mahout, mahout-collections were extracted as an independent component.  The vote on the first independent release of collections is in progress. Mahout-collections-1.0 differs from the version released with mahout 0.3 only by removed dependency on slf4j.

Mahout parts that use Lucene are updated to use the latest release of Lucene 3.0.1 and code changes for this migration can be found in this patch.

There was a question that generated a good discussion about cosine similarity between two items and how it is implemented in Mahout. More on cosine similarity between two items which is implemented as PearsonCorrelationSimilarity (source) in Mahout code, can be found in MAHOUT-387.

The goal of clustering is to determine the grouping in a set of data (e.g. a corpus of items or a (search) result set or …).  Often, the problem in clustering implementations is that the data to be clustered has a high number of dimensions, which tend to need reducing (to a smaller number of dimensions) in order to make clustering computationally more feasible (read: faster).  The simplest way of reducing those dimensions is to use some sort of a ‘mapping function’ that takes data presented in n-dimensional space and transforms it to data presented in m-dimensional space, where m is lower than n, hence the reduction. Of course, the key here is to preserve variance of important data features (dimensions) and flatten out unimportant ones. One simple and interesting approach to clustering is the use of several independent hash functions where the probability of collision of similar items is higher. This approach, called Minhash based clustering, was proposed back in March as part of GSoC (see the proposal).  You’ll find more on theory behind it and Mahout applicable implementation in MAHOUT-344.

Those interested in Neural Networks should keep an eye on MAHOUT-383, where Neuroph (a lightweight neural net library) and Mahout will get integrated.

This concludes our short April Mahout Digest.  Once Mahout completes its transition to TLP, we expect the project to flourish even more.  We’ll be sure to report on the progress and new developments later this month.  If you are a Twitter user, you can follow @sematext on Twitter.

Nutch Digest, April 2010

In the first part of this Nutch Digest we’ll go through new and useful features of the upcoming Nutch 1.1 release, while in the second part we’ll focus on developments and plans for next big Nutch milestone, Nutch 2.0. But, let’s start with few informational items.

  • Nutch has been approved by the ASF  board to become Top Level Project (TLP) in the Apache Software Foundation.  The changing of Nutch mailing lists, URL, etc. will start soon.

Nutch 1.1 will be officially released any day now and here is a Nutch 1.1 release features walk through:

  • Nutch release 1.1 uses Tika 0.7 for parsing and MimeType detection
  • Hadoop 0.20.2 is used for job distribution (Map/Reduce) and distributed file system (HDFS)
  • On the indexing and search side, Nutch 1.1 uses either Lucene 3.0.1.with its own search application or Solr 1.4
  • Some of the new features included in release 1.1 were discussed in previous Nutch Digest. For example, alternative generator which can generate several segments in one parse of the crawlDB is included in release 1.1. We used a flavour of this patch in our most recent Nutch engagement that involved super-duper vertical crawl.  Also, improvement of SOLRIndexer, which now commits only once when all reducers have finished, is included in Nutch 1.1.
  • Some of the new and very useful features were not mentioned before. For example, Fetcher2 (now renamed to just Fetcher) was changed to implement Hadoop’s Tool interface. With this change it is possible to override parameters from configuration files, like nutch-site.xml or hadoop-site.xml, on the command line.
  • If you’ve done some focused or vertical crawling you probably know that one or few unresponsive host(s) can slow down entire fetch, so one very useful feature added to Nutch 1.1 is the ability to skip queues (which can be translated to hosts) for URLS getting repeated exceptions.  We made good use of that here at Sematext,  in the Nutch project we just completed in April 2010.
  • Another improvement included in 1.1 release related to Nutch-Solr integration comes in a form of improved Solr schema that allows field mapping from Nutch to Solr index.
  • One useful addition to Nutch’s injector is new functionality which allows user to inject metadata into the CrawlDB. Sometimes you need additional data, related to each URL, to be stored. Such external knowledge can later be used (e.g. indexed) by a custom plug-in. If we can all agree that storing arbitrary data in CrawlDb (with URL as a primary key) can be very useful, then migration to database oriented storage (like HBase) is only a logical step.  This makes a good segue to the second part of this Digest…

In the second half of this Digest we’ll focus on the future of Nutch, starting with Nutch 2.0.  Plans and ideas for the next Nutch release can be found on mailing list under Nutch 2.0 roadmap and on the official wiki page.

Nutch is slowly replacing some of its home-grown functionality with best of breed products — it uses Tika for parsing, Solr for indexing/searching and HBase for storing various types of data.  Migration to Tika is already included in Nutch 1.1. release and exclusive use of Solr as (enterprise) search engine makes sense — for months we have been telling clients and friends we predict Nutch will deprecate its own Lucene-based search web application in favour of Solr, and that time has finally come.  Solr offers much more functionality, configurability, performance and ease of integration than Nutch’s simple search web application.  We are happy Solr users ourselves – we use it to power

Storing data in HBase instead of directly in HDFS has all of the usual benefits of storing data in database instead of a files system.  Structured (fetched and parsed) data is not split into segments (in file system directories), so data can be accessed easily and time consuming segment merges can be avoided, among other things.  As a matter of fact, we are about to engage in a project that involves this exact functionality: the marriage of Nutch and HBase.  Naturally, we are hoping we can contribute this work back to Nutch, possibly through NUTCH-650.

Of course, when you add a persistence layer to an application there is always a question if whether it is acceptable for it to be tied to one back-end (database) or whether it is better to have an ORM layer on top of the datastore. Such an ORM layer would be an additional layer which would allow different backends to be used to store data.  And guess what? Such an ORM, initially focused on HBase and Nutch, and then on Cassandra and other column-oriented databases is in the works already!  Check the evaluation of ORM frameworks which support non-relational column-oriented datastores and RDBMs and development of an ORM framework that, while initially using Nutch as the guinea pig, already lives its own decoupled life over at

That’s all from us on Nutch’s present and future for this month, stay tuned for more Nutch news, next month! And of course, as usual, feel free to leave any comments or questions – we appreciate any and all feedback.  You can also follow @sematext on Twitter.

HBase Digest, March 2010

We were waiting until the end of the month hoping to include coverage of the new HBase 0.20.4 version, but HBase developers are still working on it. This release will contain a lot of critical fixes and enhancements, so stay tuned.

Typically, our HBase Digest posts consist of three main parts: project status and summary of mailing lists’ most interesting discussions, other projects’ efforts & announcements related to HBase technology, and a FAQ section that aims to save time of the very responsive HBase developers answering the same questions again and again. Please feel free to provide feedback on how you like this coverage format in the post comments.

  • A must-read HBase & HDFS presentation from Todd Lipcon of Cloudera that was a part of “HUG9: March HBase User Group at Mozilla”. Links to other presentations are here. The meeting was followed by a nice discussion on Hadoop (and therefore HBase) reliability with regard to EC2. People shared a lot of useful information about hosting opportunities for one’s HBase setup.
  • Very interesting discussion covers various use-cases of what HBase is a good fit for.
  • Some notes on what settings to adjust when running HBase on a machine with low RAM in this thread.
  • Good questions from HBase evaluating person/team and good answers in this thread. Such discussions periodically appear on mailing lists and given the great responsiveness of HBase committers are very good to read by those who thinking about using HBase or are already using, like we are.
  • The warning we already shared with readers through our Hadoop Digest (March): avoid upgrading your clusters to Sun JVM 1.6.0u18, stick to 1.6.0u16 for a while — this update proved to be very stable.
  • One more explanation of the difference of indexed (IHBase) and transactional (THBase) indices.
  • Deleting the row and putting another one with the same key at the same time (i.e. performing “clean update”) can cause unexpected results if not done properly. There are several solutions to make this process safer currently. In case you face this problem, please share your experience with HBase developers on user mailing list, they will be happy to consider your case when developing solution to the issue in next release.
  • Making column names/keys shorter can result in ~20-30% of RAM savings, and visible storage savings too. Even bigger advantage came with defining the right schema and column families. More advices in this thread.
  • What are the options for connecting to HBase running on EC2 from outside the Amazon cloud using Java library? Thread…

Most notable efforts:

  • Lucehbase: Lucene Index on HBase, ported from Lucandra. Please find more info on this topic in the comments to our super popular Lucandra: A Cassandra-based Lucene backend post.
  • Elephant Bird: Twitter’s library of LZO and/or Protocol Buffer-related Hadoop InputFormats, OutputFormats, Writables, Pig LoadFuncs, HBase miscellanea, etc. The majority of these are in production at Twitter running over rather big data every day.

Small FAQ:

  1. How to back up HBase data?
    You can either do exports at the HBase API level (a la Export class), or you can force flush all your tables and do an HDFS level copy of the /hbase directory (using distcp for example).
  2. Is there a way to perform multiple get (put, delete)?
    There is a work being done on that, please refer to HBASE-1845. The patch is available for 0.20.3 version.

Thank you for reading us and follow us on Twitter – @sematext.

Mahout Digest, March 2010

In this Mahout Digest we’ll summarize what went on in the Mahout world since our last post back in February.

There has been some talk on the mailing list about Mahout becoming a top level project (TLP). Indeed, the decision to go TLP has been made (see Lucene March Digest to find out about other Lucene subprojects going for TLP) and this will probably happen soon, now that Mahout 0.3 is released. Check the discussion thread on Mahout as TLP and follow the discussion on what the PMC will look like. Also, Sean Owen is nominated as Mahout PMC Chair.  There’s been some logo tweaking.

There has been a lot of talk on Mahout mailing list about Google Summer Of Code and project ideas related to Mahout. Check the full list of Mahout’s GSOC project ideas or take on the invitation to write up your GSOC idea for Mahout!

Since Sematext is such a big Solr shop, we find the proposal to integrate Mahout clustering or classification with Solr quite interesting.  Check more details in MAHOUT-343 JIRA issue.  One example of classification integrated with Solr or actually, classifier as a Solr component, is Sematext’s Multilingual Indexer.  Among other things, Multilingual Indexer uses our Language Identifier to classify documents and individual fields based on language.

When talking about classification we should point out a few more interesting developments. There is an interesting thread on implementation of new classification algorithms and overall classifier architecture. In the same context of classifier architecture, there is a MAHOUT-286 JIRA issue on how (and when)  Mahout’s Bayes classifier will be able to support classification of non-text (non-binary) data like numeric features. If you are interested in using decision forests to classify new data, check this wiki page and this JIRA and patch.

In the previous post we discussed application of n-grams in collocation identification and now there is a wiki page where you can read more on how Mahout handles collocations. Also, check memory management improvements in collocation identification here. Of course, if you think you need more features in key phrases identification and extraction, check Sematext’s Key Phrase Extractor demo – it does more than collocations, can be extended, etc.

Finally, two new committers, Drew Farris and Benson Margulies, have been added to the list of Mahout committers.

That’s all for now from the Mahout world.  Please feel free to leave comments or if you have any questions – just ask, we are listening!

Hadoop Digest, March 2010

Main news first: Hadoop 0.20.2 was released! The list of changes may be found in the release notes here. Related news:

To get the most fresh insight on the 0.21 version release plans, check out this thread and the continuation of it.

More news on releases:

High availability is one of the hottest topics nowadays in Hadoop land. Umbrella HDFS-1064 JIRA issue has been created to track discussions/issues related to HDFS NameNode availability. While there are a lot of questions about eliminating single point of failure, Hadoop developers are more concerned about the minimizing the downtime (including downtime for upgrades, restart time) than getting rid of SPOFs, since high downtime is the real pain for those who manage the cluster. There is some work on adding hot standby that might help with planned upgrades. Please find some thoughts and a bit of explanation on this topic in a thread that started with “Why not to consider Zookeeper for the NameNode?” question. Next time we see “How Hadoop developers feel about SPOF?” come up on the mailing list, we’ll put it in a special FAQ section at the bottom of this digest. :)

We already reported in our latest Lucene Digest (March) about various Lucene projects starting discussions on their mailing lists about becoming Top Level Apache projects. This tendency (motivated by the Apache board’s warnings of Hadoop and Lucene becoming umbrella projects) raised discussions at HBase, Avro, Pig and Zookeeper as well.

Several other notable items from MLs:

  • Important note from Todd Lipcon we’d like to pass to our readers: avoid upgrading your clusters to Sun JVM 1.6.0u18, stick to 1.6.0u16 for a while which proved to be very stable. Please read the complete discussion around it here.
  • Storing Custom Java Objects in Hadoop Distibuted Cache is explained here.
  • Here is a bit of explanation of the fsck command output.
  • Several users shared their experience with issues running Hadoop on a Virtualized O/S vs. the Real O/S in this thread.
  • Those who think about using Hadoop as a base for academic research work (both students and professors) might find a lot of useful links (public datasets, sources for problems, existed researches) in this discussion.
  • Hadoop security features are in high demand among the users and community. Developers will be working hard on deploying authentication mechanisms this summer. You can monitor the progress via HADOOP-4487.

This time a very small FAQ section:

  1. How can I request a larger heap for Map tasks?
    By including -Xmx in
  2. How to configure and use LZO compression?
    Take a look at

Thank you for reading us! Please feel free to provide feedback on the format of the digests or anything else, really.

Apache ZooKeeper 3.3.0

Nutch Digest, March 2010

This is the first post in the Nutch Digest series and a little introduction to Nutch seems in order. Nutch is a multi-threaded and, more importantly, a distributed Web crawler with distributed content processing (parsing, filtering), full text indexer and a search runtime. Nutch is at version 1.0 and community is now working towards a 1.1. release. Nutch is a large scale, flexible Web search engine, which includes several types of operations. In this post we’ll present new features and mailing list discussion as we describe each of these operations.


Nutch starts crawling from a given “seed list” (a list of seed URLs) and iteratively follows useful/interesting outlinks, thus expanding its link database. When talking about Nutch as a crawler it is important to distinguish between two different approaches: focused or vertical crawling and whole Web or wide crawling. Each approach has a different set-up and issues which need to be addressed.  At Sematext we’ve done both vertical and wide crawling.

When using Nutch at large scale (whole Web crawling and dealing with e.g. billions of URLs), generating a fetchlist (a list of URLs to crawl) from crawlDB (the link database) and updating crawlDB with new URLs tends to take a lot of time. One solution is to limit such operations to a minimum by generating several fetchlists in one parse of the crawlDB and then update the crawlDb only once on several segments (set of data generated by a single fetch iteration). Implementation of a Generator that generates several fetchlists at was created in NUTCH-762. Whether this feature will be included in 1.1 release and when will version 1.1 be released, check here.

One more issue related to whole Web crawling which often pops up on the Nutch mailing list is an authenticated websites crawling, so here is wiki page on this subject.

When using Nutch for vertical/focused crawls, one often ends up with a very slow fetch performance at the end of each fetch iteration. An iteration typically starts with high fetch speed, but it drops significantly over time and keeps dropping, and dropping, and dropping. This is known problem.  It is caused by the fetch run having a small number of sites, some of which may have a lot more pages than others, and may be much slower than others.  Crawler politeness, which means it will politely wait before hitting the same domain again, combined with the fact that the number of distinct domains from fetchlist often drops rapidly during one fetch causes fetcher to wait a lot. More on this and overall Fetcher2 performance (which is a default fetcher in Nutch 1.0) you can find NUTCH-721.

To be able to support whole Web crawling, Nutch needs also needs to have scalable data processing mechanism. For this purpose Nutch uses Hadoop’s MapReduce processing and HDFS for storage.

Content storing

Nutch uses Hadoop’s HDFS as a fully distributed storage which creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable and fast computations even on large data volumes. Currently, Nutch is using Hadoop 0.20.1, but will be upgrading to Hadoop 0.20.2. in version 1.1.

In this NUTCH-650 you can find out more about the ongoing effort (and progress) to use HBase as Nutch storage backend. This should simplify Nutch storage and make URL/page processing work more efficient due to the features of HBase (data is mutable and indexed by keys/columns/timestamps).

Content processing

As we noted before, Nutch does a lot of content processing, like parsing (parsing downloaded content to extract text) and filtering (extracting only URLs which match filtering requirements). Nutch is moving away from its own processing tools and delegating content parsing and MimeType detection to Tika by use of Tika plugin, you can read more NUTCH-766.

Indexing and Searching

Nutch uses Lucene for indexing, currently version 2.9.1, but pushing to upgrade Lucene to 3.0.1. Nutch 1.0 can also index directly to Solr, becuse of Nutch-Solr integration. More on this and how using Solr with Nutch worked before Solr-Nutch integration you can find here. This integration now upgrades to Solr 1.4, because Solr 1.4 has a StreamingUpdateSolrServer which simplifies the way docs are buffered before sending to the Solr instance. Another improvement in this integration was a change to SolrIndexer to commit only once after all reducers have finished, NUTCH-799.

Some of patches discussed here and a number of other high quality patches were contributed by Julien Nioche who was added as a Nutch committer in December 2009.

One more thing, keep an eye on an interesting thread about Nutch becoming a Top Level Project (TLP) at Apache.

Thank you for reading, and if you have any questions or comments leave them in comments and we’ll respond promptly!

HBase Digest, February 2010

The first HBase Digest post received very good feedback from the community. We continue using HBase at Sematext and thus continue covering the status of HBase project with this post.

  • Added Performance Evaluation for IHBase. The PE does a good job of showing what IHBase is good and bad at.
  • Transactional contrib stuff is more geared to short-duration transactions, but it should be possible to share transaction states across machines with certain rules in mind. Thread…
  • Choosing between Thrift and REST connectors for communicating with HBase outside of Java is explained in this thread.
  • How to properly set up Zookeeper to be used by HBase (how many instances/resources should be dedicated to it, etc) is discussed in this thread. Some more info you can find is also in this one.
  • Yahoo Research has developed a benchmarking tool for “Cloud Serving Systems”. In their paper describing the tool which they intend to open source soon, they compare four “Cloud Serving Systems” and HBase is one of them. Please, also read the explanation from HBase dev team about the numbers inside this paper.
  • HBase trunk has been migrated to a Maven build system.
  • New branch opened for 0.20 updated version to run on Hadoop 0.21. This lets 0.20.3 or 0.20.2 clients operate against HBase running on HDFS 0.21 (with durable WAL, etc.) without any change to the client side. Thread…
  • Since Hadoop 0.21 isn’t going to be released soon and HBase team is waiting for applying critical changes (HDFS-265, HDFS-200, etc.) to make HBase user’s life easier, HBase trunk is likely to support both 0.21 and the patched 0.20 versions of Hadoop. There was a discussion about naming convention for HBase releases with regard to this fact which also touches plans for which features to include in the nearest releases.
  • Cloudera’s latest release now includes HBase-0.20.3.
  • Exploring possible solutions to “write only top N rows from reduce outcome”. Thread…
  • A new handy binary comparator was added that only compares up to the length of the supplied byte array.
  • These days, HBase developers are working hard on the very sweet “Multi data center replication” feature. It is aimed for 0.21 and will support federated deployment where someone might have terascale (or beyond) clusters in more than one geography and would want the system to handle replication between the clusters/regions.

We’d also like to introduce a small FAQ and FA (frequent advices) section to save some time for HBase dev team who is very supportive on the mailing lists.

  • How to move/copy all data from one HBase cluster to another? If you stop the source cluster then you can distcp the /hbase to the other cluster. Done. A perfect copy.
  • Is there a way to get the row count of the table? From Java API? There is no single-call method to do that. Actual row count info isn’t stored anywhere. You can use “count” command from HBase shell which iterates over all records and may take a lot of time to complete. It can be discovered by a table scan, or distributed count (MapReduce job usually).
  • I’m editing property X in hbase-default.xml to… You should edit hbase-site.xml, not hbase-default.xml.
  • Inserting row keys with an incremental ID is usually not a good idea since sequential writing is usually slower than random writing. If you can’t find a natural row key (which is good for scans), use a UUID.
  • Apply HBASE-2180 patch to increase random read performance in case of multiple concurrent clients.
  • How can I perform “select * from tableX where columnY=Z”-like query in HBase? You’ll need to use a Scan along with a SingleColumnValueFilter. But this isn’t quick, it’s like performing a SQL query on a column that isn’t indexed: the more data you have the longer it will take. There’s no support for secondary indexes in HBase core, you need you use one of the contribs (2 are available in 0.20.3: src/contrib/indexed and src/contrib/transactional). Another option is maintaining the indexes yourself.

Some other efforts that you may find interesting:

Hadoop Digest, February 2010

We’ve published the HBase Digest last month, but this is our first ever Hadoop Digest in which we cover Hadoop HDFS and MapReduce pieces of the Hadoop Ecosystem.  Before we get started, let us point out that we recently published a guest post titled Introdoction to Cloud MapReduce, which should be interesting to all users of Hadoop, as well as its developers.

As of this writing, there are 34 open issues in JIRA scheduled for 0.21.0 release with most of them considered as “major” and 4 “critical” or “blockers”. There is quite a lot of work to do before 0.21.0 is out.  Hadoop developers are working hard, providing at the same time a tons of very helpful answers & advice on mailing lists. Please find the summary of the most interesting discussions along with information on current Hadoop API usage below.

  • After several rejections, the USPTO granted a patent to Google for MapReduce. Find out the community reactions in thread and in thread.
  • What are security mechanisms in HDFS and what should we expect in the near future? Presentation, Design Document, Thread…
  • An attempt was made to get Hadoop into the Debian Linux distribution. All relevant links and summary can be found in this thread.
  • Consider using LZO compression, which allows splitting for a compressed file for Map jobs. GZIP is not splittable.
  • Use Python-based scripts to utilize EBS for NameNode and DataNode storage to have persistent, restartable Hadoop clusters running on EC2. Old scripts (in src/contrib/ec2) will be deprecated.
  • Do not rely on uniquness of objects in the “values” parameter when implementing reduce(T key, Iterable<T> values, Context context), the same instances of objects can be reused. Thread…
  • In order for long running tasks not to be aborted, use the Reporter object to provide “task heartbeat”.  If the map task takes longer than 600 seconds (default) to complete an iteration map/reduce assumes the task is stalled and axes it.
  • Setting up DNS lookup properly (caching DNS servers, reverse DNS setup) for a big cluster to avoid DNS requests traffic flood is discussed in this thread.
  • Setting other than default output compressing codec programmatically is explained in this thread.
  • What are the version compatibility rules for Hadoop distributions? Read the hot discussion here.
  • Critical issue HDFS-101 (DFS write pipeline: DFSClient sometimes does not detect second DataNode failure) was reported and fixed (and compatible with DFSClient 0.20.1) and will be included in 0.20.2.
  • Text type is meant for use when you have a UTF8-encoded string. Creating a Text object from a byte array that is not proper UTF-8 is likely to result in an exception or data mangling. BytesWritable should be used for this purpose.
  • How to make particular “section of code” run only in any one of the mappers? (or how to share some flag state between jobs running on the different machines). Thread…

We would also like to add small FAQ section here to spot the common user questions.

  1. MR. Is there a way to cancel/kill the job?
    Invoke command: hadoop job -kill jobID
  2. MR. How to get the name of the file that is being used for the map task?
    FileSplit fileSplit = (FileSplit) context.getInputSplit();
    String sFileName = fileSplit.getPath().getName();
  3. MR. When framework splits a file, can some part of a line fall in one split and the other part in some other split?
    In general, the file split may break the records, it is the responsibility of the record reader to present the record as a whole. If you use standard available InputFormats, the framework will make sure complete records are presented in <key,value>.
  4. HDFS. How to view text content of SequenceFile?
    The SequenceFile is not text file, so you can not see the content by invoking UNIX command cat. Use hadoop command : hadoop fs -text <src>
  5. HDFS. How to move file from one dir to another using Hadoop API?
    Use FileSystem#rename(Path, Path) to move files. The copy methods will leave you with two of the same file.
  6. Cluster setup. Some of my nodes are in the blacklist, and I want to reuse them again. How can I do that?
    Restarting the trackers removes them from the blacklist.
  7. General. What command should I use to…? How should use comand X?
    Please, refer to Commands Guide page.

There were also several efforts (be patient, some of them are still somewhat rough) that might be of interest:

  • JRuby on Hadoop is a thin wrapper for Hadoop Mapper / Reducer for JRuby, not to be mixed with Hadoop Streaming.
  • Stream-to-hdfs is a simple utility for streaming stdin to a file in HDFS.
  • Crane manages Hadoop cluster using Clojure.
  • Piglet is a DSL for writing Pig Latin scripts in Ruby.

Thank you for reading us! We highly appreciate all feedback to our Digests, so tell us what you like or dislike.

HBase Digest, January 2010

Here at Sematext we are making more and more use of the Hadoop family of projects. We are expanding our digest post series with this HBase Digest and adding it to the existing Lucene and Solr coverage. For those of you who wants to be up to date with HBase community discussions and benefit from the knowledge-packed discussions that happen in that community, but can’t follow all those high volume Hadoop mailing lists, we also include a brief mailing lists coverage.

  • HBase 0.20.3 has just been released. Nice way to end the month.  It includes fixes of huge number of bugs, fixes of EC2-related issues and good amount of improvements. HBase 0.20.3 uses the latest 3.2.2 version of Zookeeper.  We should also note that another distributed and column-oriented database from Apache was released a few days ago, too – Cassandra 0.5.0.
  • An alternative indexed HBase implementation (HBASE-2037) was reported as completed (and included in 0.20.3). It speeds up scans by adding indexes to regions rather than secondary tables.
  • HBql was announced this month, too. It is an abstraction layer for HBase that introduces SQL dialect for HBase and JDBC-like bindings, i.e. more familiar API for HBase users. Thread…
  • Ways of integrating with HBase (instantiating HTable) on client-side: Template for HBase Data Acces (for integration with Spring framework), simple Java Beans mapping for HBase, HTablePool class.
  • HbaseExplorer – an open-source web application that helps with simple HBase data administration and monitoring.
  • There was a discussion about the possibilities of splitting the process of importing very large data volumes into HBase in separate steps. Thread…
  • To get any parallelization, you have to start multiple JVMs in the current Hadoop version. Thread…
  • Tips for increasing the HBase write speed: use random int keys to distribute loading between RegionServers; use multi-process client instead of multi-threaded client; set a higher heap space in conf/ , give it a much as you can without swapping; consider lzo to hold the same amount of data in fewer regions per server. Thread…
  • Some advice on hardware configuration for the case of managing 40-50K records/sec write speed. Thread…
  • Secondary index can go out of sync with the base table in case of I/O exceptions during commit (when using transactional contrib). Handling such exceptions in transactional layer should be revised. Thread…
  • Configuration instance (as well as an instance of HBaseConfiguration) is not thread-safe, so do not change it when sharing between threads. Thread…
  • What are the minimal number of boxes for HBase deployment? Covered both HA and non-HA options, what deployments can share the same boxes, etc. Thread…
  • Optimizing random reads: using client-side multi-threading will not improve reads greatly according to some tests, but there is an open JIRA issue HBASE-1845 related to batch operations. Thread…
  • Exploring possibilities for server-side data filtering. Discussed classpath requirements for that and the variants for filters hot-deploy. Thread…
  • How-to: Configure table to keep only one version of data. Thread…
  • Recipes and hints for scanning more than one table in Map. Thread…
  • Managing timestamps for Put and Delete operations to avoid unwanted overlap between them. Thread…
  • The amount of replication should have no effect on the performance for reads using either scanner or random-access. Thread…

Did you really make it this far down?  :) If you are into search, see January 2010 Digests for Lucene and Solr, too.

Please, share your thoughts about the Digest posts as they come. We really like to know whether they are valuable.  Please tell us what format and style you prefer and, really, any other ideas you have would be very welcome.



Get every new post delivered to your Inbox.

Join 1,550 other followers