Hadoop Digest, February 2010
February 16, 2010 7 Comments
We’ve published the HBase Digest last month, but this is our first ever Hadoop Digest in which we cover Hadoop HDFS and MapReduce pieces of the Hadoop Ecosystem. Before we get started, let us point out that we recently published a guest post titled Introdoction to Cloud MapReduce, which should be interesting to all users of Hadoop, as well as its developers.
As of this writing, there are 34 open issues in JIRA scheduled for 0.21.0 release with most of them considered as “major” and 4 “critical” or “blockers”. There is quite a lot of work to do before 0.21.0 is out. Hadoop developers are working hard, providing at the same time a tons of very helpful answers & advice on mailing lists. Please find the summary of the most interesting discussions along with information on current Hadoop API usage below.
- After several rejections, the USPTO granted a patent to Google for MapReduce. Find out the community reactions in thread and in thread.
- What are security mechanisms in HDFS and what should we expect in the near future? Presentation, Design Document, Thread…
- An attempt was made to get Hadoop into the Debian Linux distribution. All relevant links and summary can be found in this thread.
- Consider using LZO compression, which allows splitting for a compressed file for Map jobs. GZIP is not splittable.
- Use Python-based scripts to utilize EBS for NameNode and DataNode storage to have persistent, restartable Hadoop clusters running on EC2. Old scripts (in src/contrib/ec2) will be deprecated.
- Do not rely on uniquness of objects in the “values” parameter when implementing reduce(T key, Iterable<T> values, Context context), the same instances of objects can be reused. Thread…
- In order for long running tasks not to be aborted, use the Reporter object to provide “task heartbeat”. If the map task takes longer than 600 seconds (default) to complete an iteration map/reduce assumes the task is stalled and axes it.
- Setting up DNS lookup properly (caching DNS servers, reverse DNS setup) for a big cluster to avoid DNS requests traffic flood is discussed in this thread.
- Setting other than default output compressing codec programmatically is explained in this thread.
- What are the version compatibility rules for Hadoop distributions? Read the hot discussion here.
- Critical issue HDFS-101 (DFS write pipeline: DFSClient sometimes does not detect second DataNode failure) was reported and fixed (and compatible with DFSClient 0.20.1) and will be included in 0.20.2.
- Text type is meant for use when you have a UTF8-encoded string. Creating a Text object from a byte array that is not proper UTF-8 is likely to result in an exception or data mangling. BytesWritable should be used for this purpose.
- How to make particular “section of code” run only in any one of the mappers? (or how to share some flag state between jobs running on the different machines). Thread…
We would also like to add small FAQ section here to spot the common user questions.
- MR. Is there a way to cancel/kill the job?
Invoke command: hadoop job -kill jobID
- MR. How to get the name of the file that is being used for the map task?
FileSplit fileSplit = (FileSplit) context.getInputSplit(); String sFileName = fileSplit.getPath().getName();
- MR. When framework splits a file, can some part of a line fall in one split and the other part in some other split?
In general, the file split may break the records, it is the responsibility of the record reader to present the record as a whole. If you use standard available InputFormats, the framework will make sure complete records are presented in <key,value>.
- HDFS. How to view text content of SequenceFile?
The SequenceFile is not text file, so you can not see the content by invoking UNIX command cat. Use hadoop command : hadoop fs -text <src>
- HDFS. How to move file from one dir to another using Hadoop API?
Use FileSystem#rename(Path, Path) to move files. The copy methods will leave you with two of the same file.
- Cluster setup. Some of my nodes are in the blacklist, and I want to reuse them again. How can I do that?
Restarting the trackers removes them from the blacklist.
- General. What command should I use to…? How should use comand X?
Please, refer to Commands Guide page.
There were also several efforts (be patient, some of them are still somewhat rough) that might be of interest:
- JRuby on Hadoop is a thin wrapper for Hadoop Mapper / Reducer for JRuby, not to be mixed with Hadoop Streaming.
- Stream-to-hdfs is a simple utility for streaming stdin to a file in HDFS.
- Crane manages Hadoop cluster using Clojure.
- Piglet is a DSL for writing Pig Latin scripts in Ruby.
Thank you for reading us! We highly appreciate all feedback to our Digests, so tell us what you like or dislike.