BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
Martin Leach knows what it's like to be up to his neck in data. Until recently, he was CIO at The Broad Institute...
of MIT and Harvard, the biomedical research organization in Cambridge, Mass., that played a major role in mapping the human genome. There, he was responsible for more than 13 petabytes of storage and for compute "farms" totaling more than 10,000 cores.
Before The Broad Institute, he led the IT team that supported the research group for discovery and pre-clinical sciences at Merck, the pharmaceutical giant. Now, in his new role as vice president of research and development IT at the biotech pioneer Biogen Idec, Leach leads the buildout of a new data sciences group. Formed to support Biogen Idec research, in time the group will assist business groups across the company on "big data" projects.
Just before he left the not-for-profit Broad Institute for Biogen Idec, SearchCIO.com met with Leach to talk about the hurdles CIOs face in developing the big data infrastructure and skill sets required to get big data projects off the ground. According to Leach, challenges range from the $2 million to $4 million investment required for setup, to the paucity of technology experts willing to work with open source tools. And not to be overlooked is the high-and-low search for those elite name-my-price data scientists who can actually help companies make something of the data.
You've consulted with CIOs on setting up a big data infrastructure. Where do you advise they start?
Martin Leach: The starting point is trying to figure out what you're trying to achieve. Where is the biggest pain point or need that necessitates this? That's the first question -- not what technology or setup to buy.
At the Broad Institute, what was the big pain point that necessitated a big data infrastructure?
Leach: It was the volume of internal data being generated and the ability to first absorb the data, then do stuff with it. And there was a race -- the human genome project race -- between the public efforts, like Broad, and private efforts. So, the criticality was how can we do it faster and faster, because there is a driver on the outside. We either had to slow down or our infrastructure would fall over -- or find a way to go faster.
I think the one thing they don't realize -- and this is everyone's problem -- is how you should take care of this data from the start. The less time you spend on curating, annotating and organizing the data up front will impact how you can use it down the road.
vice president of R&D IT, Biogen Idec
This is the challenge I've also been hearing about from some of the biotechs now. They are outsourcing some of their experiments, having their data generated, and suddenly they have a couple hundred terabytes of data being delivered and they are like, 'Eek! What type of disks am I going to put them on, how am I going to ingest that data, where do I put it in a way that I can then compute on it, and then how the hell am I going to compute on it?' What I am seeing in life sciences organizations is that they are getting large ingestions of data and the first questions are, 'How do I ingest it and where do I put it?'
So, where do they put it?
Leach: Many organizations are bringing it in-house. Some, however, are keeping it outside as long as they can in the cloud. But these are still few and far between because of the risk. Most of the data in life sciences is associated with genetics and genomics or drugs or medicines or patient records. There is sensitivity as to storing it outside the firewall.
So, after identifying why you need this data, the starting point is how do I store stuff? And then after that, the question is usually how do I compute on that? Is it going to come from internal compute capability, or do we send it back outside again and put it on, say, Amazon, and do some compute there? Which then leads people to second-guess why they brought it internally in the first place.
Is ingesting data straightforward?
Leach: The actual ingestion process is not straightforward. Some organizations will deliver the data from the cloud, requiring a fast connection. Or some organizations will deliver the data on disk or deliver the data encrypted outright. You now need to figure out, for example, the problem of receiving some data in Boston but your data center is in North Carolina: How do I plug a couple of hundred terabytes into the corporate network and get it onto a server so that I can actually do something with it?
How are companies handling the ingestion of the data?
Leach: In some cases, research is on a pile of disks and they are trying to figure out who the hell to talk to in IT to actually get it onto a server. In some cases, they actually try dragging it across their internal network, and then impact internal networking because they are moving a lot of data on a typical corporate network instead of just plugging it into a data center. Others try to work with IT.
Part of this comes down to how are you partnering with IT to get stuff to the right place? I think the rate limiter is how well the consuming organizations work with their own IT departments and how flexible IT is. This kind of stuff is not the standard IT infrastructure anymore. Try doing big data on an Oracle database. Oracle will tell you that you can, and you can buy some external hardware to do that, but you need database experts that don't just know conventional relational database technology but also know NoSQL, [Apache] CouchDB, MongoDB -- the document stores, the key-value stores, the column stores.
Clearing the hurdles
on 'big data' infrastructure
Data hoarding and biasamong big challenges in 'big data' and analytics
PayPal chief scientist on cracking the code for 'big data' analytics
Data silos in 'big data' analytics: Now you see them, now you don't?
In 'big data' visualization, seeing is believing -- but is that good?
Consider these five questions before you tackle large-scale data
Jettison relational database management systems for BI
It comes down to having top-trained people who can deal with the bevy of open source technologies at the moment -- like Hadoop, which has companies popping up to provide it as a service, or OpenStack, a cloud environment for building your own cloud. So, having people that can embrace the open source technologies and provide the right support model -- that's one of the key things at the moment. And the big thing I am hearing is, 'Where do we source that talent?'
Where do CIOs source that talent?
Leach: One great source I heard from the CTO of eBay is economists. Economists love to wade in data, and they love trying to use the data to get the bottom of questions. There are a whole bunch of economists who are starting to open their eyes and say, 'Wow, this data in science and other places, we've never had our hands on this level of data before.'
So, you have to find people who are willing to wade into big data and are open to the open source tools.
Leach: I've seen groups of physicists working in the big data field. People who work on the [Large] Hadron Collider are wading in petabytes a day of data that gets driven off some of those machines. Physicists, economists, people who deal with derivatives -- the typical quants: They like data. I am going to try to dig into the economist world because that is a new source of data experts I hadn't thought about before.
What's the biggest misconception companies have about what's required to do big data and, in particular, to leverage it with some analytics that are going to tell you something?
Leach: I think the one thing they don't realize -- and this is everyone's problem -- is how you should take care of this data from the start. The less time you spend on curating, annotating and organizing the data up front will impact how you can use it down the road. We've found from statistics looking at our own work that essentially five months after a project completes, no one really looks at the data anymore. So, what do you do with that data two years from now? Have you deleted it? Are you going to regenerate it? The cost of storing data is constantly going down, so we just keep everything.
So, what you are saying is that people are shortsighted when it comes to handling big data?
Leach: It is shortsighted from IT, but also from the investigators. IT supports the investigators going in, but we don't think about the long-term consequences from an IT point of view -- but then neither do the investigators because they are just focused on the moment, of doing something on the data they just received.
So, doesn't that defeat the purpose of big data, where you want to keep accumulating this stuff because the more you have, the more accurately you can predict and see patterns?
Leach: Yes, it's only big if you can actually combine it.