The ability to accurately interpret data is absolutely essential to deriving value from big data projects, but it's a rarified skill. "In a big data world, the chance of creating false alarms and incidental moments is profound. Across all verticals," John Mattison, chief medical information officer at Kaiser Permanente, said during his presentation at the recent Big Data Innovation conference in Boston.
Despite the high margin for error, businesses are hungrier than ever to take on big data, according to David Corrigan, a director of marketing at IBM. But binging on data is leaving them feeling starved and full at the same time. Quoting results from a recent IBM survey, Corrigan said half of the respondents reported not having enough data to make decisions, but 60% of those same respondents reported having too much data in general. "It's almost like an all-you-can-eat buffet," said Corrigan, also a presenter at the Big Data Innovation conference. "[The business] wants all kinds of information; they're consuming and gorging on that information from anywhere, but they're not thinking about whether it's good for them or they're full or they can use it."
That all-you-can-eat approach could leave companies vulnerable to big data mistakes, given one architectural style some refer to as the data lake. For the uninitiated, the data lake is used to collect enterprise data in one location (typically a Hadoop cluster), which means data is ingested quickly and kept in a fairly raw form, free from the constraints of a particular schema and available to everyone, according to Nick Heudecker and Andrew White.
Back in July, the two Gartner analysts struck a nerve when they expressed concern about data lakes in their note The Data Lake Fallacy: All Water and Little Substance. Soon enough, it felt like everyone was taking a side. As the analysts pointed out, data lakes often lack governance features (among other things), which means anyone diving into their pooled data should be adept at data manipulation and analysis.
In other words, the relatively uncharted waters of data lakes may not be for everyone in the enterprise. IBM's Corrigan said he's witnessing "a growing realization that data lakes are for super users, data scientists, people who know the difference between a good and false correlation."
It's that push-pull between access and governance that seems to be helping bubble up another marketing term du jour: the data refinery. In IBM's vision, it enables businesses to keep data in the close-to-raw format, refining it to a properly integrated, aggregated and governed state "automatically, on demand when the business user is asking for it," Corrigan said. (Sounds a little like an IT magic wand, right?)
It should be noted that the concept of a data refinery is not new. Hortonworks was on to this back in 2012, which Shaun Connolly, vice president of corporate strategy, described as "a new system capable of storing, aggregating and transforming a wide range of multi-structured raw data sources into usable formats that help fuel new insights for the business."
IBM's refinery services, included in the Watson Analytics tool released just last month, are cloud based. Corrigan said the services are so seamlessly integrated within an application, the business user -- or an application developer, for that matter -- isn't aware the refining is even happening.
Pentaho's streamlined data refinery
Corrigan wasn't the only vendor touting the benefits of a data refinery in the last couple of weeks. Chuck Yarbrough, director of big data product marketing at Pentaho, talked about the company's "streamlined data refinery" during a recent O'Reilly Media webcast.
"We view this as governed data delivery," he said during the webcast. "It's about blending, enriching vast, diverse data sets you have, providing a mechanism to orchestrate and apply governance to all of that data, and to deliver that data set in a way that it can easily be consumed by the user."
The most typical use cases for a streamlined data refinery -- and part of Pentaho's big data blueprint series -- are fraud detection, data on demand and forensic analysis. Yarbrough pointed to Paytronix Systems Inc., a software company specializing in customer loyalty programs, as an example.
"We're at the center of what's becoming the convergence of social, loyalty and mobile," Andrew Robbins, chief data scientist at Paytronix, said at last year's Strata Conference + Hadoop World in New York City. Paytronix works with thousands of restaurants, collecting swaths of data types, such as point of sale and even machine data. It's using the refinery for data processing and integration, which led to non-database administrators tackling ETL, Robbins said.
The data refinery down the road
In the future, Yarbrough sees the data refinery moving in the same direction that IBM's Corrigan talked about. Users will be able to make a data request that filters everything through a refinery, delivering a governed data set on demand.
Better still, Corrigan said, IT may be able to crowd source business acumen. "Why couldn't business analysts leave a rating on data sets the same way all of you might go to TripAdvisor," he said.
Time will tell if the data refinery is the fairy godmother some make it out to be or the sorcerer's apprentice run amuck or maybe just a rebranding of the same old-same old.
Previously on The Data Mill
Got big data? You may need a data concierge
Malcolm Gladwell talks attitude
Stress test calls for IT-finance alliance