Structuring a big data strategy
A comprehensive collection of articles, videos and more, hand-picked by our editors
In the aftermath of the 2012 presidential election, Nate Silver, a statistician who correctly predicted the race and achieved rock star status in political circles, has since become the poster child for unlocking the potential of big data. Silver, however, is uncomfortable with the mantle. That's partly because his state-by-state election prediction wasn't "rocket science" (his words) and in part because he approaches big data analytics...
with a healthy dose of skepticism.
"We've had all types of problems with over-claims for what big data will do for us in the last 10 years or so," Silver said to more than 1,400 IT and business professionals at the recent Gartner Business Intelligence Summit in Dallas, Texas. "The big picture question here is, 'Why hasn't big data produced big progress?'"
While it's true that bigger, richer and more granular data sets offer the possibility of deeper insights and greater business value, they also come with challenges that make uncovering those insights more difficult than before, Silver said. Big data can mean more distraction, more false positives, more bias and more reliance on machine-learning -- all of which can color analytical results and interpretation and lead to potentially disastrous decisions.
Big data and the eye of the beholder
The promise of what big data can deliver -- that its sheer size will revolutionize whole industries -- has helped drive faulty thinking, according to Silver. In 2008, for example, a Wired magazine editor predicted that the data deluge would make the scientific method obsolete. The piece argued that businesses with inordinate amounts of data won't need data models, hypotheses or testing. "His idea was that if you have so much data, you mine the data, and the truth emerges from the cloud, so to speak: It rains down on you," Silver said.
"A turd in the punchbowl" and other Nate Silverisms
"The book I wrote over the last four years, The Signal and the Noise, is a little bit of a turd in the punch bowl about [the big data obsession]. I'm the guy associated with making predictions and forecasts, and there are a lot of ways to do it right and a lot of ways to do it wrong, as well."
"For a long time, I tried to program everything in Microsoft Excel, which was a big mistake. I was kind of the MacGyver of Microsoft Excel … but it wasn't really very efficient. Now I mostly use Stata, which is a language I learned a long time ago."
"There's still no substitute for a good business culture."
The big, big data problem, however, doesn't just center on more, faster or different kinds of data; it also includes the growing number of relationships between data points, according to Silver. These relationships can help uncover more dimensions to a business problem, but they can also uncover more false positives due to data redundancy in the system, and even bias.
"When we have more and more information -- more and more data -- it gives us the opportunity to cherry-pick the results we want to see," said Silver, who founded the popular blog FiveThirtyEight.com and is the author of The Signal and the Noise: Why So Many Predictions Fail but Some Don't. "We don't have the ability to perceive everything. We have to pick and choose our news diet -- our information diet. Unless we're careful, it produces more bias in the end."
Given the human predisposition toward bias, it might seem this is an argument for giving artificial intelligence a more prominent role. But artificial intelligence also has its limitations, Silver said. Computers can repeat tasks flawlessly and can even detect correlation within the data, but they can't detect causation.
"If you try and mine for relationships in a data set, you risk getting a lot of false positives instead, or a lot of bugs, which are mistaken for features potentially," Silver said. "I don't trust the idea of turning everything over to the machines."
Enter Bayesian logic.
Three Bayesian principles
In the literal sense of the term, Bayes' Theorem is a mathematical formula; for Silver, the theorem provides three major principles for how he tackles big data and predication: Know where you're coming from; think probabilistically; try, err and try again.
"A nice attribute of Bayes' method is … that over time you converge toward the correct results," Silver said. "People can begin with different beliefs, and if they abide by Bayes' Theorem, in the end they converge toward a consensus as more and more data is accumulated."
Acknowledging bias. Rather than going in with a belief that the data automatically will reveal the "truth," Silver believes businesses should recognize that they almost certainly approach big data problems with a prior belief -- a bias reinforced by many years of doing things the same way or an expectation of a desired outcome. Acknowledging bias up front can help businesses be more aware of their blind spots, Silver said, and be more open to novel results in the data, which often offer the best opportunities for competitive advantage. "Knowing that we come with a point of view and have a small subjective viewpoint in this big world out there -- and that sometimes objective knowledge clashes with our first impression -- is an important part of making progress," he said.
Probability. Data and analytics are powerful components in pushing forward on progress, but adding probability to the mix can help businesses prepare (and even avoid) potential disasters.
"[The theorem] is always framing things in terms of probabilities, where the result it gives is somewhere between 0% and 100%," Silver said. "Usually it's not exactly zero or exactly 100. You increase or decrease your level of confidence along a continuous spectrum."
In 1997, for example, the city of Grand Forks, N.D. flooded. The weather service predicted the Red River would crest at 49 feet; instead, the river hit 53 feet, causing major flooding. It was later revealed that the weather service could have helped the city prepare for the flood: The margin of error on its prediction was plus or minus nine feet, though that information was never released to the public.
"They were afraid if they told the people in North Dakota what the error was, people wouldn't take the prediction seriously," said Silver. "They'd mistake the uncertainty for a lack of precision or a lack of science behind the prediction."
Trial and error. As businesses build models to take on complex problems, such as predicting the weather or personalizing products for a specific customer, they'll encounter a learning curve. And in the first 20% of that curve, the learning is steep, according to Silver.
"A lot of this first 20%, by the way, is about describing the world accurately," Silver said. "If you are measuring things accurately, though, you can proceed to the fun part of things, which is the analytics -- the last 80%."
Still, once the model is built, don't expect perfection, he warned. That last 80% of the learning curve is a "two steps forward, one step back" kind of existence -- devoted to continually chipping away at the edges of how to improve the model's accuracy and performance. This involves painstaking effort, Silver said, but it should not be cast off or brushed aside because it's where the true edge over the competition lies.
"If you look at what Google does, for example, they're running literally thousands of experiments every year by tweaking their search products and by adding or dropping different products that they run," he said. "What they realize is that if you're in a data-rich environment, there's no substitute for testing your models, your hypothesis, on real customers, on real data."
Let us know what you think about the story; email Nicole Laskowski, Senior News Writer.
Image: JD Lasica/Socialmedia.biz