Introducing Cloud MapReduce
February 11, 2010 2 Comments
The following post is the introduction to Cloud MapReduce (CMR) written by Huan Liu, CMR’s main author and the Research Manager at Accenture Technology Labs.
MapReduce is a programming model (borrowed from functional programming languages) and its associated implementation, and it was first proposed by Google in 2003 to cope with the challenge of processing an exponentially growing amount of data. In the same year the technology was invented, Google’s production index system was converted to MapReduce. Since then, it is quickly proven to be applicable to a wide range of problems. For example, there are roughly 10,000 MapReduce programs written in Google by June 2007, and there are 2,217,000 MapReduce job runs in the month of September 2007. MapReduce also has found wide application outside of the Google environment.
Cloud MapReduce is another implementation of the MapReduce programming model. Back in late 2008, we saw the emergence of a cloud Operating System (OS) — a set of software managing a large cloud infrastructure rather than an individual PC. We asked ourselves the following questions: what if we build systems on top of a cloud OS instead of directly on bare metal? Can we dramatically simplify system design? We thought we will try implementing MapReduce as a proof of concept; thus, the Cloud MapReduce project was born. At the time, Amazon was the only one that has a “complete” cloud OS, so we built on top of it. In the course of the project, we encountered a lot of problems working with the Amazon cloud OS, most could be attributed to the weaker consistent model it presents. Fortunately, we were able to work through all issues and successfully built MapReduce on top of the Amazon cloud OS. The end result surprises us somewhat. We are not only able to have a simpler design, but we are also able to make it more scalable, more fault tolerant and faster than Hadoop.
Why Cloud MapReduce
There are already several MapReduce implementations, including Hadoop, which is already widely used. So the natural question to ask is: why another implementation? The answer is that all previous implementations essentially copy what Google have described in their original MapReduce paper, but we need to explore alternative implementation approaches for the following reasons:
1) Patent risk. MapReduce is a core technology in Google. By using MapReduce, Google engineers are able to focus on their core algorithms, rather than being bogged down by parallelization details. As a result, MapReduce greatly increases their productivity. It is no surprise that Google would patent such a technology to maintain its competitive advantage. The MapReduce patent covers the Google implementation as described in its paper. Since CMR only implements the programming model, but has a totally different architecture and implementation, it poses a minimal risk w.r.t. the MapReduce patent. This is particularly important for enterprise customers who are concerned about potential risks.
2) Architectural exploration. Google only described one implementation of the MapReduce programming model in its paper. Is it the best one? What are the tradeoffs if using a different one? CMR is the first to explore a completely different architecture. In the following, I will describe what is unique about CMR’s architecture.
Architectural principle and advantages
CMR advocates component decoupling: separate out common components as independent cloud services. If we can separate out a common component as a stand-alone cloud service, the component not only can be leveraged for other systems, but it can also evolve independently. As we have seen in other contexts (e.g., SOA, virtualization), decoupling enables faster innovation.
CMR currently uses existing components offered by Amazon, including Amazon S3, SimpleDB, and SQS. By leveraging the concept of component decoupling, CMR achieves a couple of key advantages.
- A fully distributed architecture. Since each component is a smaller project, it is easier to build it as a fully distributed system. Amazon has done it for all its services (S3, SimpleDB, SQS). Building on what Amazon has done, we are able to build a fully distributed MapReduce implementation with only 3,000 lines of code. A fully distributed architecture has several advantages over a master/slave architecture. First, it is more fault tolerant. Many enterprise customers are not willing to adopt something with a single point of failure, especially for their mission critical data. Second, it is more scalable. In one comparison study, we are able to stress the master node in Hadoop so much that CMR has a 60x performance advantage.
- More efficient data shuffling between Map and Reduce. CMR uses queue as the intermediate point between Map and Reduce, which enables Map to “push” data to Reduce (rather than Reduce “pulls” data from Map). This design is similar to what is used in parallel databases, so it inherits the benefits of efficient data transfer as a result of pipelining. However, unlike in parallel databases, by using tagging, filtering and a commit mechanism, CMR still maintains the fine grain fault tolerance property offered by other MapReduce implementations. The majority of CMR’s performance gain (aside from the 60x gain which is from stressing the master node) comes from this optimization.
CMR is particularly attractive in a cloud environment due to its native integration with the cloud. Hadoop, on the other hand, is designed for a static environment inside an enterprise. If run in a cloud, Hadoop introduces additional overhead. For example, after launching a cluster of virtual machines, Amazon’s Elastic MapReduce (pay-per-use Hadoop) has to configure and setup Hadoop on the cluster, then copy data from S3 to the Hadoop file system before it can start data processing. At the end, it also has to copy results back to S3. All these steps are additional overheads that CMR does not incur, because CMR starts processing right away on S3 data directly, no cluster configuration and data copying are necessary.
Although CMR currently only runs on Amazon components, we envision that it will support a wide range of components in the future, including other clouds, such as Microsoft Windows Azure. There are a number of very interesting open source projects already, such as jclouds, libcloud, deltacloud and Dasein, that are building a layer of abstraction on top of various cloud services to hide their differences. These middleware would make it much easier for CMR to support a large number of cloud components.
At the same time, we are also looking at how to build these components and deploy them locally inside an enterprise. Although several products, such as vCloud and Eucalyptus, provide cloud services inside an enterprise, their current version is limited to the compute capability. There are other cloud services, such as storage and queue, that an enterprise has to deploy to provide a full cloud capability to its internal customers. At Accenture Technology Labs, we are helping to address some pieces of the puzzle. For example, we have started a research project to design a fully distributed and scalable queue service, which is similar to SQS in functionality, but exploring a different tradeoff point.
Synergy with Hadoop
Although, on the surface, CMR may seem to compete with Hadoop, there are actually quite a bit of synergies between the two projects. First, they are both moving to the vision of component decoupling. In the recent 0.20.1 release of Hadoop, the HDFS file system is separated out as an independent component. This makes a lot of sense because HDFS is useful as a stand alone component to store large data sets, even if the users are not interested in MapReduce at all. Second, there are lessons to be learned from each project. For example, CMR points the way on how to “push” data from Map to Reduce to streamline data transfer without sacrificing fine-grain fault tolerance. Similarly, Hadoop supports rich data types beyond simple strings, which is something that CMR will for sure inherit in the near future.
Hopefully, by now, I have convinced you that CMR is something that is at least worth a look. In the next post (coming in two weeks), I will follow up with a practical post showing step by step how to write a CMR application and how to run it. The current thinking is that I will demonstrate how to perform certain data analytics with data from search-hadoop.com, but I am open to suggestions. If you have suggestions, I would appreciate if you could post them in comments below.