You can't stuff the Arctic into a relational database. That's what David Gallaher, who studies snow and ice around the world, discovered while designing a system to help answer a fundamental question: How is global warming changing the North and South poles?
Gallaher started with Greenland's roughly 660,000-square-mile ice sheet. As it turned out, probing 30 years of Big Data on Greenland -- a task that includes thrice-daily satellite sweeps that produce an almost petabyte-scale amount of data -- is better suited to a technology that has been overlooked in this heyday of conventional relational databases. That technology is an object-oriented database management system that addresses data as objects -- in this case, an object database engine from Versant Corp.
"The data was too large for an Oracle, or a conventional, relational database. It simply collapses under the load," said Gallaher, IT services manager for the National Snow and Ice Data Center (NSIDC) at the University of Colorado at Boulder. Relational databases designed for reporting and analyzing the kind of consistent data that fits neatly into tables could not untangle the historical web of changes that could shed light on the state of Greenland's ice.
A geologist by training, Gallaher is principal investigator in a $600,000 project funded by a grant from the National Science Foundation to build a system that can handle billions of bits of time series information (a sequence of data measurements made at uniform intervals), and to make it accessible over the Web to researchers worldwide. "We are trying to shift to a paradigm where it is easier to move the analysis to the data than to move the data to the analysis," he said.
The data is so vast that NSIDC (and NASA, its collecting partner) held only the metadata in a relational database. The data itself was stored in directory trees, and had to be extracted before researchers could ask the critical what, where and when questions -- let alone analyze the why. Given the size of the files, a researcher asking, for example, about the reflectivity, or reflective properties, of the ice -- how white or how dark it is, and how much or how fast that feature is changing -- might have to spend many weeks just to get the data. (Properties is the term used by the object-oriented community for persistent data.)
"Then they'd have to write something to figure out what they have. If they're lucky and ran an algorithm, they might get one or two passes before their grant ran out," Gallaher said. "We said, 'There must be another way to do this.'"
Not your daddy's object-oriented database
Object-oriented database technology frequently is mistakenly thought of -- even among the database community -- as a technology that was tried before and didn't work except for limited use cases, said Carl Olofson, research vice president for information management and data integration software research at IDC. This may be because the work on database standards for collecting and reporting focused on relational databases, he said.
To get the most out of object databases, an object model must be created that reflects the the structure of the persistent data. "To do this involves a level of abstract thinking," Olofson said. IT shops may feel "they don't have time for that level of analysis."
But perceptions are changing, Olofson said. The classes of data and complex structures that companies now want to track over time and space -- the person-to-person-to-person relationships contained in social media, for example -- are better rendered and retrieved with an object database engine. Such vendors as Versant, GemStone Systems (recently acquired by VMware Inc.), and Objectivity Inc. are getting more attention from business and programmers.
"The basic point is that an object database is really quite useful in bringing order without losing nuance to the Big Data world," Olofson said.
The data was too large for an Oracle, or a conventional, relational database. It simply collapses under the load.
David Gallaher, IT services manager, NSIDC
The new NoSQL technologies are relevant and provide a lot of benefit, but they lack the infrastructure and industry standards required of enterprise computing. Hadoop, for example, is good at the initial ingestion of data, but falls short on creating some kind of structured output, Olofson said.
'Data rods' for time travel
A key to making an object database work is knowing the question you want to answer, Gallaher said. Another challenge is persuading database administrators accustomed to relational databases to stop thinking in terms of tables. Gallaher and his small team -- two graduate students and a professor (part-time) -- came up with a construct they call a data rod, which contains billions of pixels and looks at the entire time record of a fixed area.
"Think of it as a stack of quarters, with each quarter representing several hours, with the stack now 30 feet tall," Gallaher explains. Take reflectivity as an example: You could ask the system to "tell you where some of the "quarters" are darker than others and what happened there. And if it finds something interesting, [you can ask it to] tell you about the objects adjacent to it as well," he said.
"The beauty of this, is that we are saying, let's not look at this as an image, but as a rod through time. We think of this as giant 3-D matrix," Gallaher said.
For efficiency's (and recoverability's) sake, the data rods for all of Greenland are cut off at five-year intervals and cover multiple databases. "You can run a query on all the databases and make them behave as 'one rod,' if you will," Gallaher said. Using VQL, the Versant query language (which he notes looks to the outside user like SQL), it becomes a fairly straightforward task to filter for what has changed over time.
"The best way for me to explain it to people is to think about it as a record of infinite length that you can ask what you like anywhere along that line." Gallaher said.
Gallaher looked extensively at Hadoop and similar technologies, and believes he could have gotten it to work, he said. The Versant system handled any size of data they wanted. "We were asking questions of huge areas, enormous points of times, against massive amounts of variables and getting responses in seconds, cached," he added. "What we can do in a couple of hours would take six months conventionally, and that is no joke."
Let us know what you think about the story; email Linda Tucci, Senior News Writer.