The name Michael Stonebraker is practically synonymous with the term database. Ingres, Postgres, Vertica, VoltDB...
-- that's a shortlist of the relational databases and database management systems he's had a hand in developing. In March, Stonebraker, a researcher at MIT's Computer Science and Artificial Intelligence Laboratory, was awarded the A.M. Turing Award for his pioneering work in database technology. The annual prize from the Association for Computing Machinery is often referred to as the Nobel Prize of computing.
For the past few years, Stonebraker has also been focused on alleviating one of the biggest headaches for CIOs: the data integration problem. Stonebraker is a co-founder of Tamr Inc., a data curation tool born out of research at MIT that relies on machine learning and domain expertise to create a bottom-up approach to data quality rather than the top-down, master data management approach businesses are used to.
SearchCIO caught up with Stonebraker ahead of his keynote talk at the MIT Chief Data Officer and Information Quality (MITCDOIQ) Symposium. The interview has been edited and condensed.
Your keynote talk at MITCDOIQ is titled "Tackling Data Curation." What is data curation? And how is it different from data integration?
Michael Stonebraker: Data integration sometimes just refers to the act of putting two independently constructed schemas together without the thought that you have to clean dirty data, without the thought that you have to transform dirty data. Data curation is the end-to-end process of creating good data.
The other thing is that data curation inevitably requires human interaction, whereas people talk about data integration as an automated, hands-off process sometimes.
Data integration has long been a CIO headache. What makes this so hard?
Stonebraker: Say you're the human resources person in Paris, and I'm the human resources person in New York. My company buys your company, or we're two divisions of the same company. You have employees; I have employees. Your employees have salaries; my employees have salaries. But let's say you call it "salaries," while I call it "wages." So we have to normalize the attribute names. Your salaries are in euros -- after taxes -- with a lunch allowance, because that's how it is in France. My salaries are gross in U.S. dollars and do not include a lunch allowance. So there's a fairly elaborate transformation from one of our salaries to another or from both of our salary representations -- there's a cleaning problem, a deduplication problem, and all of this stuff is hard.
How has big data compounded the problem?
Stonebraker: If you go back 20 years to the mid-90s, enterprises universally put together customer facing data using ETL systems. This was led by the retail space, and it was wildly successful. But the systems were integrating fairly modest kinds of data and fairly modest data sources.
Around 2000, I got to visit Miller Beer [MillerCoors LLC] in Milwaukee, which had a traditional data warehouse composed of ETL-style sales data of beer -- by brand, by time period, by zip code [and] by distributor. It turned out to be a year where it was widely forecast El Nino was coming; this was in November. The weather guys are figuring out that El Nino screws up the U.S. weather during winter -- it's warmer than normal in New England and wetter than normal on the west coast.
So I asked the Miller Beer guys, 'Are beer sales correlated with temperature or precipitation, because there's this El Nino thing coming?' The business analysts all said, 'Gee, I really wish we could answer that question.' But, of course, weather data and temperature data were not in the data warehouse.
The [business intelligence] guys have an insatiable demand for more. I think the driving difficulty of data integration is this insatiable demand for more. And it's not without merit because the return on investment from integrating more and more kinds of data is generally positive.
You've stated the traditional tools and practices for data integration aren't really working. Why?
Stonebraker: Let's look at traditional ETL. The basic idea is somebody creates a global schema up front -- think of it as master data management. For every data source, you send a programmer out into the field to find the business owner; you have him interview the business owner, figure out what data looks like, what it is and how to extract it. He writes extract routines, he writes transformation and cleaning routines, and he then loads it into data warehouse.
There are two big problems with this approach. The first big problem is the global schema: If you have 10 data sources, you can create one up front. If you have 500 data sources, no one knows what the data schema looks like in advance. Constructing a global schema won't scale.
Read more on how big data applications require new thinking on data integration on SearchCIO's sister site SearchDataManagement.com. Then, read about an application suite developed to bring genomics in healthcare to the point of care.