Big data tutorial: Everything you need to know
A comprehensive collection of articles, videos and more, hand-picked by our editors
Big data is not about the data, it's about the analytics, according to Harvard University professor Gary King -- and, boy, are there some really bad analytics out there. One of his favorite recent examples concerns a big data project that set out to use Twitter feeds and other social media to predict the U.S. unemployment rate. The researchers devised a category of many words that pertained to unemployment, including: jobs, unemployment and classifieds. They culled tweets and other social media that contained these words then looked for correlations between the total number of words per month in this category and the monthly unemployment rate. This is known as sentiment analysis by word count, and it's a common analytics approach, King said.
Money was raised. The work crept along, when all of a sudden there was a tremendous spike in the number of tweets containing the type of words that fell into this category. Maybe the researchers were really onto something. More money poured into the project. "What they hadn't noticed was Steve Jobs died," said King, the Albert J Weatherhead III University Professor and director of the Institute for Quantitative Social Science at Harvard. Of course, tweets with "Jobs" spiked for a completely different reason.
King, whose research focuses on developing and applying empirical methods to social science research, said such errors happen "all the time" in sentiment analysis by word count and other "off the shelf" analytics programs. That's because these approaches tend to conflate humans with systems that respond in completely predictable ways. That's bad analytics. "We're pretty good at being humans but pretty lame at being computers."
The keynote speaker at the recent Text and Social Analytics Summit 2013 in Cambridge, Mass., King made a point that no doubt many companies are discovering as they attempt to extract value from the ready supply of data they collect and generate minute by minute. (Fun fact: The volume of email produced every five minutes is equivalent to all the digital data in the Library of Congress.) The real value in big data is in the quality of the analytics, and that often requires mathematics customized for your particular business purpose, not some generic off-the-shelf program.
"We have tried to commoditize analytics, and there are computer programs out there that do a lot of it," King said. But commercial software that automates that "last mile" -- the stretch that separates a winning big data analysis project from an also-ran -- he believes is rare, if it exists at all.
Word counts versus computer-assisted reading
A common feature of bad analytics involves the lumping of many individual classifications to answer questions about the zeitgeist. The Twitter analysis project described above is an example. Sentiment analysis by a categorical word count works for a bit, "but if you do it enough it will fail catastrophically," King said.
One way to avoid misinterpretation is by reading through the posts -- King works on computer-assisted reading -- to make sure the post is actually about the subject. This requires semantics as opposed to simple word counts and is much harder to do.
Bad analytics is not confined to the enormously difficult task of analyzing unstructured social media feeds. Another big data project gone wrong described by King tried to discover causes of death in parts
of the world where no death certificate is issued. One way to collect such data is to have researchers go household by household and do what's called a "verbal autopsy." What were the symptoms the deceased exhibited before dying -- bleeding out the nose, stomach pains?
This works great, he said, until you try to turn the verbal report into a diagnosis and find you won't necessarily get the same cause of death from one physician to the next. Sending a physician to Tanzania to do the verbal autopsy would seem to help, but this could be a dead end too. A physician trained in Boston, for example, without much experience in tropical diseases, might not immediately think malaria when he hears runny nose and, in turn, gets the cause of death wrong. And sending the best physician in Tanzania out in the field to do "this little study," King said, might actually wind up killing people by depriving them of a scarce commodity, namely, a doctor. The fundamental problem is that the analyses are focused on individual classifications when the real goal of the analysis is how the whole population was distributed.
"In public health, they don't care about you -- they care about what everybody died of," King said. The approach is ineffective across many fields. "Once we realized that, we realized we needed to come up with a different method for estimating the percent in the category that had nothing to do with an individual's classification."
Reading social media to predict China's next move
SearchCIO will be writing more about King's computer-assisted reading technology. Meantime, here's an example of his work to whet the appetite.
Partnering with Boston-based startup Crimson Hexagon Inc., King et al set out to better understand what information the Chinese government censors by analyzing social media posts. The previous research on this topic had drawn the not-so-surprising conclusion that the government censored communications critical of the government itself.
"Totally not true," King said. Analyzing 11million Chinese social media posts -- collected before the Chinese censors were able to take them down-- King's study found that 13% of them were censored. But the censored posts did not conform to the prevailing theories. Included in the 87% of uncensored posts were communications highly critical of government leaders --"vitriolic, personal" attacks accusing local and national leaders of stealing money, having mistresses and a number of corrupt actions. The government was not censoring criticism of itself. "They were censoring any attempt at collective action, any attempt to move people by someone other than the government," King said.
The analysis of the censorship led to another potential insight. By analyzing social media censorship, it might actually be possible to predict political actions on important issues by Chinese officials. Case in point was the highly charged political atmosphere after oil was discovered in the South China Sea between China and Vietnam, King said. During the height of the tension between the two countries, with talk of war dominating the media coverage, the rates of censorship of social media "were soaring," King said. Then one day the rates plummeted. Less than a week later, China signed a treaty with Vietnam.
Big data promises new breakthroughs. But companies have had access to a lot of the data out there for years, King said. "Being able to do something in an innovative way, designed for a purpose, is what makes it valuable."
Let us know what you think about the story; email Linda Tucci, Executive Editor.
Social apps need to grow up and go to work
Consumerization hits the e-discovery market
At MIT Sloan, CEOs talk innovation while Rome burns