Hadoop Digest, March 2010
March 29, 2010 3 Comments
Main news first: Hadoop 0.20.2 was released! The list of changes may be found in the release notes here. Related news:
- Maven artifacts have been pushed to repository.apache.org.
- This version has entered Debian unstable repository.
- Cloudera officially announced CDH2 release (as well as CDH3 Beta 1).
More news on releases:
- Pig 0.6.0 is out. This release includes performance and memory usage improvements, a new Accumulator interface for UDFs, and many bug fixes. Release notes available at http://hadoop.apache.org/pig/releases.html.
- ZooKeeper 3.3.0 is out. Please, find the announcement and release details.
High availability is one of the hottest topics nowadays in Hadoop land. Umbrella HDFS-1064 JIRA issue has been created to track discussions/issues related to HDFS NameNode availability. While there are a lot of questions about eliminating single point of failure, Hadoop developers are more concerned about the minimizing the downtime (including downtime for upgrades, restart time) than getting rid of SPOFs, since high downtime is the real pain for those who manage the cluster. There is some work on adding hot standby that might help with planned upgrades. Please find some thoughts and a bit of explanation on this topic in a thread that started with “Why not to consider Zookeeper for the NameNode?” question. Next time we see “How Hadoop developers feel about SPOF?” come up on the mailing list, we’ll put it in a special FAQ section at the bottom of this digest.
We already reported in our latest Lucene Digest (March) about various Lucene projects starting discussions on their mailing lists about becoming Top Level Apache projects. This tendency (motivated by the Apache board’s warnings of Hadoop and Lucene becoming umbrella projects) raised discussions at HBase, Avro, Pig and Zookeeper as well.
Several other notable items from MLs:
- Important note from Todd Lipcon we’d like to pass to our readers: avoid upgrading your clusters to Sun JVM 1.6.0u18, stick to 1.6.0u16 for a while which proved to be very stable. Please read the complete discussion around it here.
- Storing Custom Java Objects in Hadoop Distibuted Cache is explained here.
- Here is a bit of explanation of the fsck command output.
- Several users shared their experience with issues running Hadoop on a Virtualized O/S vs. the Real O/S in this thread.
- Those who think about using Hadoop as a base for academic research work (both students and professors) might find a lot of useful links (public datasets, sources for problems, existed researches) in this discussion.
- Hadoop security features are in high demand among the users and community. Developers will be working hard on deploying authentication mechanisms this summer. You can monitor the progress via HADOOP-4487.
This time a very small FAQ section:
- How can I request a larger heap for Map tasks?
By including -Xmx in mapred.child.java.opts
- How to configure and use LZO compression?
Take a look at http://www.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/.
Thank you for reading us! Please feel free to provide feedback on the format of the digests or anything else, really.