Hadoop Digest, July 2010
August 2, 2010 Leave a comment
Strong moves towards the 0.21 Hadoop release “detected”: 0.21 Release Candidate 0 was out and tested. A number of issues were identified and with it the roadmap to the next candidate is set. Tom White has been hard at work and is acting as the release engineer for the 0.21 release.
Community trends and discussions:
- Hadoop Summit 2010 slides and videos are available here.
- In case you’re at the design stage of your Hadoop cluster aimed at work with text-based and/or structured data, you should read the “Text files vs SequenceFiles” thread.
- Thinking of decreasing HDFS replication factor to 2? This thread might be useful to you.
- Managing workflows of Sqoop, Hive, Pig, and MapReduce jobs with Oozie (Hadoop workflow engine from Yahoo!) is explained in this post.
- The 2nd edition of “Hadoop: The Definitive Guide” is now in Production. Again, Tom While in action.
- How do you efficiently process (large) XML documents in Hadoop MapReduce?
Take a look at Mahout’s XmlInputFormat in case StreamXmlRecordReader doesn’t do a good job for you. The former one got a lot of positive feedback from the community.
- What are the ways of importing data to HDFS from remote locations? I need this process to be well-managed and automated.
Here are just some of the options. First you should look at available HDFS shell commands. For large inter/intra-cluster copying distcp might work best for you. For moving data from RDBMS system you should check Sqoop. To automate moving (constantly produced) data from many different locations refer to Flume. You might also want to look at Chukwa (data collection system for monitoring large distributed systems) and Scribe (server for aggregating log data streamed in real time from a large number of servers).
Hey, follow @sematext if you are on Twitter and RT!