Hadoop Digest, February 2010

We’ve published the HBase Digest last month, but this is our first ever Hadoop Digest in which we cover Hadoop HDFS and MapReduce pieces of the Hadoop Ecosystem.  Before we get started, let us point out that we recently published a guest post titled Introdoction to Cloud MapReduce, which should be interesting to all users of Hadoop, as well as its developers.

As of this writing, there are 34 open issues in JIRA scheduled for 0.21.0 release with most of them considered as “major” and 4 “critical” or “blockers”. There is quite a lot of work to do before 0.21.0 is out.  Hadoop developers are working hard, providing at the same time a tons of very helpful answers & advice on mailing lists. Please find the summary of the most interesting discussions along with information on current Hadoop API usage below.

  • After several rejections, the USPTO granted a patent to Google for MapReduce. Find out the community reactions in thread and in thread.
  • What are security mechanisms in HDFS and what should we expect in the near future? Presentation, Design Document, Thread…
  • An attempt was made to get Hadoop into the Debian Linux distribution. All relevant links and summary can be found in this thread.
  • Consider using LZO compression, which allows splitting for a compressed file for Map jobs. GZIP is not splittable.
  • Use Python-based scripts to utilize EBS for NameNode and DataNode storage to have persistent, restartable Hadoop clusters running on EC2. Old scripts (in src/contrib/ec2) will be deprecated.
  • Do not rely on uniquness of objects in the “values” parameter when implementing reduce(T key, Iterable<T> values, Context context), the same instances of objects can be reused. Thread…
  • In order for long running tasks not to be aborted, use the Reporter object to provide “task heartbeat”.  If the map task takes longer than 600 seconds (default) to complete an iteration map/reduce assumes the task is stalled and axes it.
  • Setting up DNS lookup properly (caching DNS servers, reverse DNS setup) for a big cluster to avoid DNS requests traffic flood is discussed in this thread.
  • Setting other than default output compressing codec programmatically is explained in this thread.
  • What are the version compatibility rules for Hadoop distributions? Read the hot discussion here.
  • Critical issue HDFS-101 (DFS write pipeline: DFSClient sometimes does not detect second DataNode failure) was reported and fixed (and compatible with DFSClient 0.20.1) and will be included in 0.20.2.
  • Text type is meant for use when you have a UTF8-encoded string. Creating a Text object from a byte array that is not proper UTF-8 is likely to result in an exception or data mangling. BytesWritable should be used for this purpose.
  • How to make particular “section of code” run only in any one of the mappers? (or how to share some flag state between jobs running on the different machines). Thread…

We would also like to add small FAQ section here to spot the common user questions.

  1. MR. Is there a way to cancel/kill the job?
    Invoke command: hadoop job -kill jobID
  2. MR. How to get the name of the file that is being used for the map task?
    FileSplit fileSplit = (FileSplit) context.getInputSplit();
    String sFileName = fileSplit.getPath().getName();
  3. MR. When framework splits a file, can some part of a line fall in one split and the other part in some other split?
    In general, the file split may break the records, it is the responsibility of the record reader to present the record as a whole. If you use standard available InputFormats, the framework will make sure complete records are presented in <key,value>.
  4. HDFS. How to view text content of SequenceFile?
    The SequenceFile is not text file, so you can not see the content by invoking UNIX command cat. Use hadoop command : hadoop fs -text <src>
  5. HDFS. How to move file from one dir to another using Hadoop API?
    Use FileSystem#rename(Path, Path) to move files. The copy methods will leave you with two of the same file.
  6. Cluster setup. Some of my nodes are in the blacklist, and I want to reuse them again. How can I do that?
    Restarting the trackers removes them from the blacklist.
  7. General. What command should I use to…? How should use comand X?
    Please, refer to Commands Guide page.

There were also several efforts (be patient, some of them are still somewhat rough) that might be of interest:

  • JRuby on Hadoop is a thin wrapper for Hadoop Mapper / Reducer for JRuby, not to be mixed with Hadoop Streaming.
  • Stream-to-hdfs is a simple utility for streaming stdin to a file in HDFS.
  • Crane manages Hadoop cluster using Clojure.
  • Piglet is a DSL for writing Pig Latin scripts in Ruby.

Thank you for reading us! We highly appreciate all feedback to our Digests, so tell us what you like or dislike.

7 Responses to Hadoop Digest, February 2010

  1. Todd Lipcon says:

    Hey Otis

    Thanks for writing this up! Will this be monthly? Should be a great resource for the community.

    -Todd

  2. Hi Todd. I didn’t personally put this one together. But yes, digest posts will be monthly. Thanks for the feedback!

  3. Jakob Homan says:

    Otis-
    This is an excellent resource; thank you very much for compiling this information. Very much appreciated.

    Thanks,
    Jakob Homan
    Hadoop @ Yahoo!

  4. yuanke says:

    how the framework makes sure complete records are presented in ?could you give a more detailed description? all you said I know.thanks

    • sematext says:

      I’m not sure which part of the post you are referring to, but this sounds like a question for the mailing list.
      Alternatively, have you looked for the answer on http://www.search-hadoop.com/ ?

    • Alex Baranau says:

      I think I know what the question is about.

      In case the split ends with only some part of the record, the framework (this is up to InputFormat actually) requests for additional bytes (which are in the beginning of the next split). The call is made over HTTP and it usually transmits very little piece of data so, given the size of each split (usually at least 64Mb) it doesn’t affect the performance at all even if the next split is on another physical box.

      I hope I answered your question.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 1,640 other followers

%d bloggers like this: