This content is part of the Essential Guide: Enterprise data analytics strategy: A guide for CIOs

Essential Guide

Browse Sections

Is Apache Spark the next big thing in big data analytics?

At Spark Summit East, Databricks talked up a storm about its new Spark cloud offering in an effort to distinguish the data processing engine from MapReduce and the Hadoop stack: The Data Mill reports.

Spark Summit East in New York City proved businesses are applying Apache Spark, an open source big data processing engine, to problems that live on the cutting edge: How can you capture the beginnings of a distributed denial-of-service attack on the Bitcoin network? How can cars be connected to the Internet (of people and of things) in a meaningful way? How can you identify and map sophisticated money laundering rings?

The interest was not just from businesses born digital or vendor businesses building technology on top of Spark. Novartis, Comcast and Goldman Sachs were all present and singing Spark's praises. But the conference at times felt more like the Databricks show, a company providing a commercial offering of the technology, than a Spark convention. While Databricks doesn't want Spark to be seen as a Hadoop "frenemy," it doesn't want it to be tethered to the Hadoop stack either. At the Spark Summit, Databricks made its case, which hinges, to a large extent, on the cloud.

A spark is born

Databricks was founded by the creators of Apache Spark at the UC Berkeley AmpLab as a modern data processing engine. Since its inception, the technology has been compared and contrasted to MapReduce, the original data processing engine in the Hadoop stack. While MapReduce brought processing capability of large data sets on a distributed computing framework into the spotlight, the technology has been plagued by its speed -- or lack thereof. MapReduce processes data in batches, which knocks it out of the running for stream processing (kind of a big deal for Internet of Things projects). And it doesn't process data in-memory, which (as big data scientists know well) makes asking iterative questions on a distributed data set time-consuming.

MapReduce, in other words, is far from perfect, paving the way for a second-generation processing engine like Spark. "MapReduce is an implementation of a design that was created more than 15 years ago," Patrick Wendell, co-founder of Databricks, said at the summit. "Spark is a from-scratch reimagined or re-architecting of what you want out of an execution engine given today's hardware."

What's more, Databricks is convinced that Spark, while compatible in a Hadoop environment, is destined to play a much bigger role in the big data ecosystem. To wit: "I do think Spark has a role to play and a life that's outside of the Hadoop environment. I hope Spark transcends the label, and I think to a large extent, we've done that," Wendell said.

Last spring, Databricks formed a partnership with DataStax, a vendor that sells a commercial distribution for the NoSQL database Apache Cassandra. And in the fall, it rolled out Databricks Cloud, a kind of big data as a service offering that combines Spark with Amazon S3 (not Hadoop) and is still in limited availability. Rumor has it, the cloud offering will eventually work with Google Compute Engine and Microsoft Azure. While MapReduce requires a certain level of skills, Databricks is attempting to make Spark accessible to a broad user base by, in part, providing high-level as well as low-level APIs. (FYI, high-level APIs are meant for users who aren't well-versed in data science or distributed systems but could benefit from something like a complex machine learning algorithm. And really, who couldn't?)

A familiar plot

The upshot of all this Spark Summit East hoopla? For CIOs and, more likely, for the data analysts who are really keeping track of how big data technology is evolving, Spark will feel like the Hadoop sequel. Only this time it's Databricks touting the virtues of a big data processing engine that's going to change the face of enterprise analytics while Cloudera, the original Hadoop distributor, plays a supporting role. And this time it’s Databricks reassuring the audience that even "normal" companies (to use Cloudera's language from an early Hadoop World conference) will benefit from this next big thing in big data processing: Novartis and Comcast were trotted out at Spark Summit East, as well as lesser-known names like Automatic and Shopify.

But the summit speak was also festooned with Spark testimonials like the one from Abhishek Mehta, founder and CEO at Tresata: "I think anything interesting going on in big data ecosystem is being done in Spark." And, from Goldman Sachs' Matt Glickman, Spark is "the future" and "the lingua franca of big data analytics." Or this one from George Mathew, COO at Alteryx, a vendor that's attempting to bring the programming language R and big data analytics to the masses. Alteryx tried scaling out R with MapReduce but ran into problems.

"They say there's no reference to hell in the Old Testament," Mathew said, "but when we went down this path with R and MapReduce -- that was an interesting experience for us to see how fundamentally unready MapReduce was to introduce other kinds of general purpose computing capabilities."

Harsh. Such is the life of a big data star, now being eclipsed by Spark -- at least in the minds of some big data enthusiasts.

Let us know what you think of the story; email Nicole Laskowski, senior news writer, or find her on Twitter @TT_Nicole.

Dig Deeper on Enterprise business intelligence software and big data