Hadoop Digest, March 2010

Main news first: Hadoop 0.20.2 was released! The list of changes may be found in the release notes here. Related news:

To get the most fresh insight on the 0.21 version release plans, check out this thread and the continuation of it.

More news on releases:

High availability is one of the hottest topics nowadays in Hadoop land. Umbrella HDFS-1064 JIRA issue has been created to track discussions/issues related to HDFS NameNode availability. While there are a lot of questions about eliminating single point of failure, Hadoop developers are more concerned about the minimizing the downtime (including downtime for upgrades, restart time) than getting rid of SPOFs, since high downtime is the real pain for those who manage the cluster. There is some work on adding hot standby that might help with planned upgrades. Please find some thoughts and a bit of explanation on this topic in a thread that started with “Why not to consider Zookeeper for the NameNode?” question. Next time we see “How Hadoop developers feel about SPOF?” come up on the mailing list, we’ll put it in a special FAQ section at the bottom of this digest. :)

We already reported in our latest Lucene Digest (March) about various Lucene projects starting discussions on their mailing lists about becoming Top Level Apache projects. This tendency (motivated by the Apache board’s warnings of Hadoop and Lucene becoming umbrella projects) raised discussions at HBase, Avro, Pig and Zookeeper as well.

Several other notable items from MLs:

  • Important note from Todd Lipcon we’d like to pass to our readers: avoid upgrading your clusters to Sun JVM 1.6.0u18, stick to 1.6.0u16 for a while which proved to be very stable. Please read the complete discussion around it here.
  • Storing Custom Java Objects in Hadoop Distibuted Cache is explained here.
  • Here is a bit of explanation of the fsck command output.
  • Several users shared their experience with issues running Hadoop on a Virtualized O/S vs. the Real O/S in this thread.
  • Those who think about using Hadoop as a base for academic research work (both students and professors) might find a lot of useful links (public datasets, sources for problems, existed researches) in this discussion.
  • Hadoop security features are in high demand among the users and community. Developers will be working hard on deploying authentication mechanisms this summer. You can monitor the progress via HADOOP-4487.

This time a very small FAQ section:

  1. How can I request a larger heap for Map tasks?
    By including -Xmx in mapred.child.java.opts
  2. How to configure and use LZO compression?
    Take a look at http://www.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/.

Thank you for reading us! Please feel free to provide feedback on the format of the digests or anything else, really.

Apache ZooKeeper 3.3.0

3 Responses to Hadoop Digest, March 2010

  1. Ismael Juma says:

    Hi,

    “Important note from Todd Lipcon we’d like to pass to our readers: avoid upgrading your clusters to Sun JVM 1.6.0u18, stick to 1.6.0u16 for a while which proved to be very stable. Please read the complete discussion around it here.”

    Note that JDK6u19 has been released recently and it fixes the major problem with u18 (which incidentally was mentioned in the release notes with a workaround: -XX:-ReduceInitialCardMarks).

    Best,
    Ismael

  2. Pingback: Hadoop Digest, May 2010 « Sematext Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 1,716 other followers

%d bloggers like this: