Solving big data mining challenges, it turns out, is not exactly rocket science -- and too bad it isn't. That's what I was thinking as I mulled over a recent seminar here in Boston on big data in the life sciences. Sponsored by the Mass Technology Leadership Council, the two-hour seminar was moderated by the global VP of Oracle's Health Sciences business unit, Kris Joshi, who happens also to have a doctorate in astrophysics. The panel was a trio of big-data bright lights: Michele Clamp, head of informatics and scientific applications at Harvard University; Matthew Trunnell, CIO at The Broad Institute (of human genome fame); and Peter Bergethon, head of neuroinformatics and computational neurology at Pfizer. Not a group that's easily stumped.
Yet, 15 minutes into the presentation, it was clear that Joshi et al. have met their match in big data. Oh, not every aspect of it. The infrastructure to handle the vaunted volume of big data -- the pipes, portals, repositories -- has been figured out. Harvard, for example, has a new data center in Holyoke, Mass., that will be able to handle 40,000 calls. It's the university's last, by the way, according to Clamp, because when this baby no longer computes, the next cluster will be in the cloud. In the big data puzzle, as Broad's Trunnell put it, infrastructure is "in many ways the easiest one to solve." Heck, the U.S. Department of Homeland Security is using Oracle's Exa-line products to help monitor a purported 50 billion transactions per day.
No, the hair shirt in using big data to actually get somewhere in the life sciences -- figuring out the molecular basis of Type 2 diabetes, or, say, how the brain works -- is neither the generation of data (it's coming in torrents) nor its storage, but the data mining tools to analyze it. "We as a field are relatively naïve in the way we represent and interact with the data," Trunnell said. The average lifespan of software in his field is maybe six months. "We don't have the luxury, or have not taken the effort to optimize algorithms for infrastructure," he said. Nor do researchers eager to publish their latest findings have much incentive to do so.
Solve the data mining challenges by building front-end visualization interfaces that present the data in a model that humans can not only easily apprehend but also interact with.
Hardware solutions to data mining challenges fall short
This being a panel that has no trouble thinking on its feet, Trunnell's complaint gave Oracle's Joshi an idea. Back in the 1990s, when physicists were working on tracking the gravitational interactions between celestial bodies to simulate the evolution of a galaxy, even the supercomputers couldn't keep up with every incremental gravitational pull. "Then some bright guys in Japan created a special-purpose computer, called a GRAPE [GRAvity PipE]," Joshi said. Realizing there was just one equation to Newton's law that was being computed over and over, and over many numbers of objects, they built a motherboard with chips designed from the ground up to work on that one calculation, with throughput "like nothing else." The N-body calculator could simulate the movements of thousands of stars -- not reality, to be sure, but still a huge step in the right direction, he said. "Why not have a special motherboard that does nothing but genome alignment?" he asked. Put the calculators close to the machine, so that as the gene sequences were being spun out, they were being aligned in real time!
The astrophysicist's lightbulb moment didn't generate much enthusiasm. That would be a great solution, but for one and only one data type. "The problem is how fast the algorithms are changing … and if they are really changing in a way we need to pay attention to every six months," Harvard's Clamp said. The reality that N-body motherboards don't solve, it seems -- for the time being, anyway -- is that every data set requires its own custom analysis; and in some endeavors -- curing cancer or training Olympic swimmers, where human judgment is required -- these analytic approaches are not yet well-developed.
"In the financial business, if you count a penny or a piece of a penny, you know what it is," Pfizer's Bergethon explained. Not so in brain research, where getting to the data point itself is not as accurate as people might suppose. "One of the challenges in building big data sets, which are intrinsically noisy, is that you then have to make some assumptions. As soon as the technology gets better, the question gets deeper and then you discover more variability in the system," he said. No one cares what the gene is anyway. "It's about how they are feeling or thinking or driving. How do you measure that?"
Data visualization at the front end
The problem boils down to how to build powerful analytical tools that can be used to reveal what is in the data. That doesn't mean finding what's included in the data, but what's hidden in a data set that's variable and volatile. Rigid data models won't suffice. But the "fuzzier, more fluid data models" that big data experts find enticing aren't there yet. "What they are, I don't know," Broad's Trunnell said. (Even Oracle's Joshi marvels at companies that "pay millions of dollars" for systems "that are like handcuffs. Most data is sitting in models that are not flexible at all. You can't use the data!") And don't look to Google-ized engines for help. Some of the use cases with big data are search-oriented, Joshi noted, but search is not analysis. "That gives a broad but thin view," he said. Plus, searchers require a hypothesis. Ideally, these data mining tools should be generating hypotheses rather than just testing yours.
More CIO Matters columns on big data
In big data visualization, seeing is believing -- but is that good?
The data wars are just beginning
big data analysis is my new best girlfriend
One way to allow the data to reveal itself is to tap into the strongest human analytical sense -- our vision: Solve the data mining challenges by building front-end visualization interfaces that present the data in a model that humans can not only easily apprehend but also interact with. More successful so far in the movies (think Tom Cruise in Minority Report ) than in real labs, this is easier said than done -- and not just because the software is a challenge. "We're still trying to convince people that visualization is better than looking at a graph or 25 windows on their desktop," Pfizer's Bergethon said. Researchers will play with the machines, à la Cruise, "and then go back to their desktops and work on a sketch pad."
Calling all CIOs and IT bright lights: As with so many aspects of IT services today, the challenge is not on the back end but on the front end -- that is, finding user-friendly tools to help humans take account of big data. We can collect, we can store; but in order to capture those insights from information we all give lip service to, it's up to you to find a way to make the data presentable to us. Otherwise, the data will keep piling up, expertly corralled but yielding nothing.
Let us know what you think about the story; email Linda Tucci, Executive Editor .