Apache Spark, an open source big data processing engine, is trying hard to become the new darling of big data. But is the technology enterprise-ready? The answer to that question is it’s getting there.
Databricks, founded by the inventors of Apache Spark to provide a commercial offering of the technology, made it clear at Spark Summit East in New York City that enterprise-readiness will be a major focus for the company over the next six to 12 months. In fact, the company has already started down the enterprise-readiness road. Last fall, Databricks began providing a limited cloud offering of Spark on Amazon S3.
“Our vision with Databricks Cloud was to solve these problems, provide an integrated environment, security and so forth,” said Patrick Wendell, a Databricks co-founder. But, he added, there’s more to do.
During a summit panel discussion, Martin Van Ryswyk, executive vice president of engineering at DataStax, advised Wendell to think security. “When people want [an] enterprise [version], they want the kind of meat and potatoes features of a platform,” Van Ryswyk said. “The really cool groundbreaking functionality has got to be there, but they want it with a couple of basics: You need security.”
In more ways than one. Not only does the technology, itself, need to be secure, but the product has got to deliver in order to avoid putting its enterprise customers’ jobs in jeopardy. “They’re betting their company on you, and you can’t let them down,” Van Ryswyk said, “It’s got to be up, it’s got to be available, it’s got to be economical.”
Van Ryswyk knows what he’s talking about. He’s been helping make DataStax, a commercial provider of the open source distributed database system Apache Cassandra, enterprise-ready. “Over the last five years, we’ve taken Cassandra from a wild and wooly open source project to something that’s being used at some of the biggest companies in the world,” he Van Ryswyk said. DataStax customers include Netflix, Thomson Reuters, eBay and ING.
One practical tip? Van Ryswyk said they test DataStax Enterprise on 1,000 nodes every day. “Under load, taking loads in and out of a cluster, injecting faults,” he said. “That’s the kind of things enterprises are going to do quickly.” He can back up the claim with his own experience with the technology. Three years ago when he joined DataStax, customers used 30- to 40-node clusters on average. Today, it’s not unheard of for customers to use 1,000-node clusters. “That happens quickly as you become a chosen technology. You’ve got to be ready for it,” he said.