BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
While soccer fans heatedly debate who will win the World Cup, a small Cambridge, Massachusetts travel startup is closely following the action in order to answer a different question: How will game results have an impact on where people are going?
"What happens when somebody scores a goal?" asked Patrick Surry, who heads the data science program at Hopper, an online travel search and recommendation site built on open source software. "Does that also cause an increase in demand for flights?" Or, what happens when a team qualifies for the next round? "Now lots of people need to move from one city to another inside Brazil," the World Cup's host country, Surry said.
To determine the impact of one moment in time compared with another on the price of travel, Surry and his data science team analyze mounds of flight and event data, and they do so in an ad hoc fashion -- that is, by analyzing the data as needed. True to analytics fashion, the analysis for one question can raise another and so on. That kind of iterative analytics can be time-consuming with MapReduce, a well-known open source data processing engine. To avoid latency, Surry and his team are experimenting with a relatively new kid on the block: Apache Spark, an in-memory, large-scale data processing engine that's beginning to garner attention from big data practitioners.
Spark vs. MapReduce
Developed at the University of California at Berkley's AMPLab and released as an open source project in 2010, Spark isn't beholden to the read/write processes that define MapReduce, the engine that's become synonymous with Hadoop.
MapReduce batch-processes data in two steps: In very simple terms, the data is first "mapped" or sorted, and then the data is "reduced" or summarized. That two-step routine causes latency. "When you're processing data in MapReduce today, every time you go from a map task to a reduce task, data has to get written off to disk," Nick Heudecker, a Gartner analyst, said. "And that takes time."
Surry has encountered that exact problem when using open source technologies like Hive, data warehouse software that brings a SQL-like interface to MapReduce queries. "Each step in the query might require read and write of intermediate stuff to disk, so you end up going through a lot of processing," he said. "The pipeline is long."
But that's not the only issue. For the kinds of questions Surry is asking, analysis doesn't run from beginning to end; the process tends to be iterative. Going back to the World Cup example, Surry said, "We're running these queries to say, 'What is the search demand from U.S. origins to Brazil?'" They gather up flight data -- as much as several billion flights at the onset -- over a designated time period and run an analysis.
If Surry and his team want to break the analysis down even further and pinpoint if, for example, some parts of the U.S. are more interested or less interested in particular teams, the source data for that analysis doesn't change. "It's reading all of the same flight history from the same time period. So the question is almost the same," Surry said.
The only thing that changes is the lens being applied to the data. But "in a Hive world, we have to go back, and it's going to start again from scratch with the original flight data, do all of those intermediate steps and then our final aggregation," he said.
Speed up with Spark?
That's where Spark could be a game changer. It still processes data in batch, but it caches those intermediate steps, essentially capturing them in random-access memory, or RAM, rather than on disk. "So if you run another MapReduce job, which is using the same input table, then you'll have the table already in memory," Surry said. He estimates that for these kinds of analyses, Spark could shave off 90% of the processing time.
Spark's in-memory processing capability has some talking heads and vendors labeling it the next big thing. Mike Olson, chief strategy officer at Cloudera, called it "the successor to MapReduce" in a Dec. 2013 blog post. "Like MapReduce, it is a general-purpose engine, but it is designed to run many more workloads and to do so much faster than the older system," he wrote. He's not the only one. In the last two months, employees at Intel, NuoDB and LevelTrigger have mentioned the MapReduce migration to Spark at local conferences.
Along with the excitement for improved efficiency, Gartner's Heudecker also sounds a word of caution. "When MapReduce came out, it was the wonder drug. The same is being said for Spark," he said. "But Spark is very early at this point."
Heudecker has yet to take any inquiries from clients about Spark, which he believes still "needs to mature," especially for enterprise CIOs. Major topics such as governance and security "are things that, frankly, are not well-addressed, if addressed at all, in the Hadoop-verse today," Heudecker said. "Now those things have to be developed in the Spark ecosystem." And, while Spark might be a tool for the data scientist, he's unsure "if it's ready for a general-purpose developer to start diving into right now," he said.
Still, given time, he believes the technology has promise. And so do legacy vendors such as SAP and IBM, both of which have plans to support Spark.
Is adopting Spark worth it right now? CIOs need to look at what they already have and consider how something like Spark can be integrated into the ecosystem, especially when it's likely that the successor to Spark will hit shelves in a matter of years. "There's a continual starting over that businesses have to rationalize," Heudecker said. And they need to ask themselves, "What kind of competitive advantage can I get?"
To see Hopper’s startup culture in action, watch this SearchCIO video.
Apache Spark goes 1.0
Google trumps MapReduce with big data service
The big data market rises and must converge, just not yet