A common prescription for a big data headache is technology. "With volume, you can use Hadoop; with velocity, you can use Storm," Mark Schreiber, associate director of knowledge engineering at Novartis International AG, said.
But when it comes to variety the "take two and call me in the morning" cure is bad medicine, potentially suppressing data exploration and creative thinking. Or that's how a panel of medical and pharmaceutical experts described it at last week's MIT Chief Data Officer and Information Quality Symposium.
Schreiber and his team discovered the particular challenge of data variety through trial and error. What was one of their biggest errors? Building a classic data warehouse. "It's a very seductive idea because it works so well in other industries," he said. "The problem is once you start building a data warehouse, you have to know what you're putting in it and what your question is up front."
This approach makes sense for industries such as retail, where "your warehouse's data models look like your business models," Schreiber said. "You're basically exploring your data model for your business." But scientific and medical research flips that paradigm on its head. "You're in the business of exploring your data model," he said, "to learn new things about the world."
By normalizing the data and bringing it into a data warehouse, researchers painted themselves into a corner, effectively constraining the kinds of questions they could ask as well as their "ability to innovate and learn new things," he said.
Another error? Schreiber and his team employed web forms with restrictive vocabularies to sidestep the immense variability of free form text. But when text mining techniques were applied to a researcher's full text, "we extracted a lot more value than we'd get if we asked them to fill out a constrained form," he said.
Schreiber called it "a counterintuitive discovery," but as he explained, variety is a spectrum. The more structured the data is, the less robust it tends to be. "Just as a rule of thumb, my observation has been that the level of information seems to go up as the structure goes down," Schreiber said. Highly structured data, such as sensor data, is "data at the lowest level," he said. For it to have any kind of value, it still "needs to be interpreted." Unstructured data, such as text used in PowerPoint slides, is "the result of an analysis and people's decisions around whether a particular drug should progress to the next phase of development," he said.
Parsing imprecise and biased language
An additional fundamental challenge in big data variety is deciphering meaning. When the country of Scotland developed a care management record, the two most common words to bubble up within the unstructured notes were wife and alcohol.
"What's unclear is the relationship between those two words," said panelist John Halamka, CIO of Beth Israel Deaconess Medical Center and the Dean of Technology at Harvard Medical School. "Is it 'My wife is irritating, so I drink,' or is it 'I drink too much and my wife is anxious?'" (Turns out it's the latter.) But, Halamka continued, "It isn't sufficient to do a Google-like search or a frequency analysis of words. You actually have to understand subject-predicate relationships."
Text analytics tools and natural language processing techniques have been around for decades. But the biomedical field is just like any other profession: Language can be imprecise and meaning can be biased by a particular moment, according to Schreiber. That leaves humans to do the deciphering. But, as the quantities of big data variety grow, having a human dig through text to connect the dots or explain the context simply doesn't work.
"If you imagine someone came to you years ago and said I've got a source of several billion documents -- they're unstructured, they contain spelling mistakes and different languages and I want you to come up with a way to index them all -- your old style reaction would be to say we're going to employ millions of librarians," he said. "That would be the right approach for a very small data set."
But that approach is not sufficient when it comes to big data, explained Shawn Murphy, director of Research Information Systems and Computing at Partners Healthcare. "Everything you think about with big data has to work at scale. That's the challenge -- to create solutions at scale. And we're not used to doing that."
Halamka, for one, sends unstructured notes to a third-party provider "through a secure cloud connection" to parse through the text. Meaning is provided in the form of metadata, which "highlights what are considered diagnostic concepts," he said. The technology "looks at things like negation, verb tense," and it doesn't stumble over non-intuitive phrases such as the patient whose mother had breast cancer is doing fine. "It classifies that concept as 'family history,'" Halamka said. The data analysis is then used by Beth Israel to enroll patients into a care management program.
Schreiber also advocates a technology-plus approach to big data variety. The real challenge, he said, is recognizing data that's meaningful. "A mistake is to just apply algorithms because you can only reach a certain level of accuracy," he said. "The other mistake is just to use expert curators because you can only achieve a certain volume with that." Today, Schreiber and his team are trying an approach "where we actually do both," he said.
The model includes a handful of "gatekeeper curators," who add human intelligence to the mix. "That seems to be the best approximation of how we can deal with this unstructured variability problem," he said. "Apply machine learning to the limit that you can and then bring in human intelligence in a scalable way."
At least, for now.