The first part of this two-part SearchCIO feature on big data obstacles looked at the challenges of moving big data from inside the enterprise out -- and vice versa. Here, the focus shifts to look at how some businesses are overcoming the challenges of moving big data across the enterprise.
Bottlenecks don't occur just when data is moved into or out of the enterprise; they can also happen when data is moved from one part of a corporate campus to another -- and that has some businesses going the route of the enterprise data hub.
In the case of Epsilon, a marketing services firm, a primary reason for embracing the enterprise data hub model had to do with the three Vs of big data: volume, velocity and variety.
"Standard database technologies don't have support for unstructured and semistructured data like we need, and didn't have the storage vehicle we wanted," Bob Zurek, senior vice president of products at the Wakefield, Mass.-based Epsilon, said. "In addition, it didn't seem to support our scaling requirements."
As businesses collect and keep more data, internal networks that connect, say, a database to an analytics platform can be easily overwhelmed by the volume. "Data growth for companies is in the 40%, 50%, 60% range per year typically," said Phil Shelley, president of big data consulting firm Newton Park Partners in Chicago, Ill., and former chief technology officer at Sears Holding Corp. "That data is starting to become a big data challenge -- even if you're not doing anything crazy like [collecting data from] Twitter or Facebook."
That's partly because tools used to move data around the enterprise "have been growing out of control," Shelley said. When businesses began using transactional systems, for example, they realized the back-end databases collecting all of the real-time transactions needed to be kept small in order to remain responsive, he said. That's when businesses started using programming tools to perform extract, transform, load (ETL) functions -- where a subset of data is extracted from one system, transformed or converted into the desired format, and then loaded into another system.
The result is a web of less-than-optimal data due to "complex chains of ETL jobs that are all tied together," Shelley said. Employees are, essentially, making copies of copies of data, which can lead to incomplete data sets. But these complex chains are also a drain on the network. "My ETL job may take minutes or even hours to run," he said. "The next one down the stream may also take minutes or hours and so on, creating data latency." More than that, ETL tools were designed for legacy systems and are buckling under the weight and pressure of big data.
Vic Bhagat, executive vice president and CIO at Hopkinton, Mass.-based EMC Corp., explains that the problem is bigger than ETL. "Typical data movement techniques are single-threaded," he said. They use the operating system kernel networking stack, which manages input/output requests, and ship data around the enterprise in small packets. "The data may need to be transformed so it can be stored on the destination location in the 'right' way, in a structured database with certain formats and in files," he continued. "Consequently, moving terabytes or petabytes of data around can be slow and resource-intensive."
Smooth sailing on business data lake, or spoke on the wheel?
For today's businesses, where real-time decisions are playing an increasingly relevant role, draining the network by moving the data has become inefficient. That's led some businesses to embrace what Shelley referred to as an enterprise data hub and Bhagat calls the business data lake. Both models turn the ETL paradigm on its head. As with ETL, subsets of data are still extracted from systems streaming data into the enterprise, but rather than the data being transformed into the desired format, it's loaded into a data hub such as the Apache Hadoop platform. L comes before T, in other words.
That data is starting to become a big data challenge -- even if you're not doing anything crazy like [collecting data from] Twitter or Facebook.
Once the data is loaded into the hub, that's where it stays -- where it can be indexed, aggregated, tagged, segmented and maintained. While traditional data management models move the data to analytics or data mining tools, the enterprise data hub and the business data lake models are built so that the tools are brought to the data, freeing up the network and speeding up data accessibility.
Most businesses won't be celebrating just yet. The biggest of the big Web 2.0 companies might have found an answer to their big data in motion problems by building an enterprise data hub (Shelley held up Facebook as the exemplar), but getting to this laudable state could take years for traditional companies with critical legacy systems.
In a recent blog post, Merv Adrian, analyst at Stamford, Conn.-based Gartner, called the enterprise data hub aspirational marketing, adding, "It addresses the ambition its advocates have for how Hadoop will be used, when it realizes the vision of capabilities currently in early development. There's nothing wrong with this, but it does need to be kept in perspective. It's a long way off."
Adrian went on to write that the enterprise data warehouse is nowhere close to retirement and that the enterprise data hub will, at best, this year "begin to build a role as part of an enterprise data spoke in some shops."
Enterprise data hub in motion
When Epsilon set out to build its new marketing platform for its customers, it turned to Cloudera Inc., a Hadoop distributor.
More on Hadoop
Hadoop clusters: Benefits and challenges for big data analytics
Hadoop still too slow for real-time analysis?
Handling the hoopla: When and when not to use Hadoop
Epsilon, which counts Ford, JP Morgan Chase and Dunkin' Donuts as customers, uses the enterprise data hub model to stream in customer data from email, mobile and social channels. Epsilon not only has to juggle terabytes of data, but it also has to deal with the fact that much of that data comes in the form of text. The technology underpinning its new platform needed to be able to ingest structured, unstructured and semi-structured data quickly, which Hadoop excels at.
Epsilon also wanted its customers to have access to the all of its data as soon as possible, according to product SVP Zurek. By pooling the data in one place, Epsilon customers can target a microdemographic on the fly, using segmentation tools built on top of the data hub.
"We found the capabilities provided by the enterprise data hub allow us to get at that, whereas if we were using legacy systems, we'd probably have to throw more CPU at that problem," Zurek said.