From data gathering to competitive strategy: The evolution of big data
A comprehensive collection of articles, videos and more, hand-picked by our editors
After he was labeled a person of interest in a murder investigation, antivirus software pioneer John McAfee fled his home in Belize, but he didn't disappear. For the next month, as he hid from police, the founder of McAfee Inc. remained visible to the public through blog posts, tweets and media reports.The Silicon Valley legend might have continued to elude police while maintaining a virtual presence if it weren't for a single, sizable...
electronic bread crumb: a photograph of McAfee and a tag-along journalist posted to the website of Vice, a New York magazine on arts and culture. The image revealed little, but the same couldn't be said for the information embedded within the image. There, for law enforcement officials or any reasonably tech-savvy person to see, were the coordinates documenting precisely where the photo was taken.
According to The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East, a report published in December 2012 by IDC and sponsored by storage vendor EMC, metadata is one of the fastest-growing subsegments of that digital universe.
The problem is that while metadata is growing -- in volume, as well as in importance because of its role in making sense of the data landscape -- it's not keeping pace with the "big data" deluge. IDC refers to this as the "big data gap," and it will require CIOs to think about their data management strategy differently.
One such practice is the resource description framework, or RDF. Used to represent information on the Web, the data is broken up into subjects, verbs and predicates (also called "triples"), then graphed. In this framework, the data and metadata are tied so closely together that the two are hard to distinguish.
"However, in order to use this approach, typical data stores have to be broken down into these triples that can be connected," said Gwen Thomas, founder and president of the Data Governance Institute. The tight coupling of data and metadata is almost certainly in everybody's future, she said -- "but not in this decade for most of us."
The information not only led to McAfee's capture and arrest, it also pushed metadata -- geocode, in this case -- into the limelight. Metadata, commonly referred to as "data about data," is rapidly making a name for itself -- and not just in relation to tracking suspected criminals. Companies investing in "big data" tools will need to think beyond storing and analyzing large data sets to consider the tags or labels that give the data context over time, according to experts. Without metadata, companies will forfeit some of the deep insights big data can yield, including the identification of important business trends by analyzing detailed data over time.
"The interesting thing about big data tools is that they let you store pretty much anything that you want very easily, and they 'persist' the data," said Phil Shelley, chief technology officer at Sears Holdings Corp. in Hoffman Estates, Ill., and CEO at a Sears Holdings' subsidiary, data services provider MetaScale. "But the data is completely useless if it's not meaningful in a way that you can use it and retrieve it."
Gwen Thomas, founder and president of the Data Governance Institute, based in Orlando, Fla., agreed that the advent of big data is changing attitudes about metadata. "When you're talking about 'small data,' there's always the possibility of actually sampling the data itself to see what's in it," she said. "But you don't have that option with big data. It's like drinking from a fire hose: You're going to get knocked over."
Metadata and the rise of non-relational databases
Metadata's usefulness in lending context and location to data collected in the course of business has long been acknowledged, but it has not always been admired, Sears' Shelley said. "Documenting the metadata is a boring chore most people don't want to do," he said. "People want to get the data in, use it and produce value."
Traditionally, the business has been able to circumvent the dreaded metadata discussion because data repositories -- such as a relational database -- neatly organize the data into rows and columns, "and the metadata is heavily implied by the structure," Shelley said. New storage and analytics tools equipped to handle the volume and structure of big data, however, take a different approach to organizing data, one with advantages and disadvantages. "The advantage of big data tools is that they don't enforce a strict schema on your data when you put it away. They allow you to read the data and apply schemas onto the data at the time you read it," he said.
Digital data by the numbers
23%: The percentage of digital data useful to "big data" in 2012 if it was tagged and analyzed
3%: The percentage of digital data tagged in 2012
< 3%: The percentage of digital data analyzed in 2012
40 trillion gigabytes: the projected size of digital data by 2020
1/3: By 2020, the amount of digital data that will have big data value if it's tagged and analyzed
Source: The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East, a report published December 2012 by IDC and sponsored by storage vendor EMC
The promise of big data tool Hadoop, for example, is its ability to maintain years' worth of corporate data for analysis. "But the disadvantage is if you don't have a good handle on what the data is and on what the metadata is, you really don't know after a while what you have," Shelley said.
The big data gap
Metadata not only helps provide a data legacy for businesses, said David Marco, president of consultancy EWSolutions, a systems integrator headquartered in Chicago. Metadata can also help companies establish data consistency. Take companies' classic struggle to define the term "customer," he said. Depending on the business department, the term can be interpreted -- and therefore measured -- differently. But by using metadata, companies can craft a definition or business rule for that metric -- or for any major data concept -- and affix it to enterprise-wide data used for analytics. "When you bring analytical information into your engine, there's always going to be errors or something you didn't load," he added. "If you're the decision maker, the marketing officer or the [chief financial officer], wouldn't you want to know some of those numbers you're looking at don't include 2% of your data?" Metadata ensures a more accurate picture of your data.
Marco believes there's a yet another reason metadata is enjoying a resurgence: reducing the IT footprint. When companies have multiple order entry systems, financial systems and copies of the same data, they're spending money to keep the redundant systems operational, and that can be incredibly costly, he said. "How do you get off of them? You have to know what data you have, what it means, where it's located -- and that's all metadata management."
Let us know what you think about the story; email Nicole Laskowski, Senior News Writer.