Enterprise data analytics strategy: A guide for CIOs
A comprehensive collection of articles, videos and more, hand-picked by our editors
It's been nearly 10 years since Google published MapReduce: Simplified Data Processing on Large Clusters. The seminal paper, which describes first "mapping" or sorting the data and then "reducing" or summarizing it, altered the data processing landscape, helping pave the way for Apache Hadoop, the popular open source, distributed computing system.
"Google solved a problem I was having," Doug Cutting, chief architect at Cloudera and creator of Apache Hadoop, said recently. Cutting was working on Nutch at the time, an open source Internet search engine project. And the Google paper described "the same data flow I was already doing, the same way of partitioning and processing data, but with an automated framework to manage storage, sequencing computations, failure of computation -- so that you could pretty much run it hands off," he said.
Cutting's history lesson was delivered during a recent panel discussion hosted by The Hive, an incubator for big data startups in Silicon Valley. Moderated by Gartner analyst Nick Heudecker, the question before panelists was what lies beyond MapReduce. And the short answer is perfection, or at least evolution.
Before MapReduce, a file system
First, the success of the next big thing after MapReduce hinges on the stability of another software layer, the file system. "It's one of the most critical parts of any distributed framework," said M.C. Srivas, CTO and co-founder at MapR.
Before data can be processed, it needs to be ingested -- a setup CIOs know well. In the Apache Hadoop ecosystem, that task is relegated to the Hadoop Distributed File System, or HDFS. Unlike a relational database, HDFS is indiscriminate about data type, a defining characteristic one panelist called "the power of Hadoop."
"Relational databases, by virtue of being rectangular, force you to stick to certain data types," said Shankar Venkataraman, chief architect for the IBM InfoSphere BigInsights platform. "[Hadoop] doesn't define a shape or format for the data."
Another benefit of HDFS? Data of varying types can be collected in the same place. Combined with cheap processing power "you start to think about what you can do with all of it," Cutting said. "Falling out of that explosion is different kinds of processing frameworks."
No panelist was so bold as to say relational database management systems don't have a place in the big data world, but "there are a lot of things you can't do effectively in the database," Cutter said. Full text search, for example, "is done very poorly in relational databases," he said. Having a "wider palate" of tools opens different doors.
MapReduce and its imperfections
By itself, though, MapReduce is still an imperfect big data technology, especially as businesses delve into the Internet of Things or want to get their hands on real-time analytics. It essentially does two things: It massively processes data in parallel and then aggregates that processing, explained Arun Murthy, founder and architect at Hortonworks. "At its heart, MapReduce is a simple paradigm," he said. "In reality, any real-world data app needs more than these two simple steps."
The software community appears to agree with Murthy. Last year, Apache Hadoop YARN (Yet Another Resource Negotiator) became a fundamental part of Hadoop 2.0. Most of the panelists characterized YARN, which sits between HDFS and things like MapReduce, as a game changer. By taking the job of resource management away from MapReduce, new kinds of processing engines, such as stream processing, could be introduced. That meant users weren't limited by MapReduce's known latency issues.
In the last 10 years, "even at Google, there's been an evolution of processing paradigms after MapReduce," said Matei Zaharia, CTO at DataBricks. And other tech giants have tinkered with the tool. In 2007, Microsoft released Dryad, which Zaharia described as a general task execution model that extended MapReduce.
And Zaharia is taking a stab at it, too. When he was doctoral student at the University of California at Berkeley's AMPLab, he developed an in-memory processing engine known as Spark. Today, Spark, which became an Apache top level project earlier this year, is generating a fair amount of buzz these days, with some suggesting it could be the system to supplant MapReduce. "The specific goal of Spark is to unify a lot of the workloads that require separate systems today like SQL, stream processing and batch processing and let you do them in one system," he said.
What is beyond MapReduce?
Advances like Spark and YARN suggest there could come a time when MapReduce plays a less prominent role in the Hadoop ecosystem -- and soon. No one on the panel explicitly said that, but the vibe from these leading Apache Hadoop lights was that MapReduce's hegemony may have run its course.
Likewise, any soothsaying was given in broad strokes. Milind Bhandarkar, chief scientist at Pivotal, for one believes that as Hadoop evolves, so too will more established technologies such as the relational database. In fact, the way he sees it, both the traditional and next generation developers are climbing up the same mountain, but they aren't climbing up the same side. "What we are doing today is basically building tunnels within this mountain so that we can move people from this side to that," he said. He points to the push to bring SQL to Hadoop as an example. At the summit is "a perfect data system," he said.
Cutting disagrees. "The notion that there's one system that's going to rule them all, I don't buy," he said. The technology will continue to evolve "and new, better things will arrive," he said. Imperfection continually perfected sounds pretty good.
Elastic MapReduce for Hadoop
Google trumps MapReduce with new big data service
Hadoop 2 FAQ