This content is part of the Essential Guide: Big data tutorial: Everything you need to know
Manage Learn to apply best practices and optimize your operations.

Big data can mean bad analytics, says Harvard professor

The value of big data is in the analytics -- not any old analytics but software customized to your business purpose. Anything less is bad analytics.

Big data is not about the data, it's about the analytics, according to Harvard University professor Gary King -- and, boy, are there some really bad analytics out there. One of his favorite recent examples concerns a big data project that set out to use Twitter feeds and other social media to predict the U.S. unemployment rate. The researchers devised a category of many words that pertained to unemployment, including: jobs, unemployment and classifieds. They culled tweets and other social media that contained these words then looked for correlations between the total number of words per month in this category and the monthly unemployment rate. This is known as sentiment analysis by word count, and it's a common analytics approach, King said.

Money was raised. The work crept along, when all of a sudden there was a tremendous spike in the number of tweets containing the type of words that fell into this category. Maybe the researchers were really onto something. More money poured into the project. "What they hadn't noticed was Steve Jobs died," said King, the Albert J Weatherhead III University Professor and director of the Institute for Quantitative Social Science at Harvard. Of course, tweets with "Jobs" spiked for a completely different reason.

King, whose research focuses on developing and applying empirical methods to social science research, said such errors happen "all the time" in sentiment analysis by word count and other "off the shelf" analytics programs. That's because these approaches tend to conflate humans with systems that respond in completely predictable ways. That's bad analytics. "We're pretty good at being humans but pretty lame at being computers."

The keynote speaker at the recent Text and Social Analytics Summit 2013 in Cambridge, Mass., King made a point that no doubt many companies are discovering as they attempt to extract value from the ready supply of data they collect and generate minute by minute. (Fun fact: The volume of email produced every five minutes is equivalent to all the digital data in the Library of Congress.) The real value in big data is in the quality of the analytics, and that often requires mathematics customized for your particular business purpose, not some generic off-the-shelf program.

"We have tried to commoditize analytics, and there are computer programs out there that do a lot of it," King said. But commercial software that automates that "last mile" -- the stretch that separates a winning big data analysis project from an also-ran -- he believes is rare, if it exists at all.

Word counts versus computer-assisted reading

A common feature of bad analytics involves the lumping of many individual classifications to answer questions about the zeitgeist. The Twitter analysis project described above is an example. Sentiment analysis by a categorical word count works for a bit, "but if you do it enough it will fail catastrophically," King said.

One way to avoid misinterpretation is by reading through the posts -- King works on computer-assisted reading -- to make sure the post is actually about the subject. This requires semantics as opposed to simple word counts and is much harder to do.

Bad analytics is not confined to the enormously difficult task of analyzing unstructured social media feeds. Another big data project gone wrong described by King tried to discover causes of death in parts

We're pretty good at being humans but pretty lame at being computers.
Gary King

of the world where no death certificate is issued. One way to collect such data is to have researchers go household by household and do what's called a "verbal autopsy." What were the symptoms the deceased exhibited before dying -- bleeding out the nose, stomach pains?

This works great, he said, until you try to turn the verbal report into a diagnosis and find you won't necessarily get the same cause of death from one physician to the next. Sending a physician to Tanzania to do the verbal autopsy would seem to help, but this could be a dead end too. A physician trained in Boston, for example, without much experience in tropical diseases, might not immediately think malaria when he hears runny nose and, in turn, gets the cause of death wrong. And sending the best physician in Tanzania out in the field to do "this little study," King said, might actually wind up killing people by depriving them of a scarce commodity, namely, a doctor. The fundamental problem is that the analyses are focused on individual classifications when the real goal of the analysis is how the whole population was distributed.

"In public health, they don't care about you -- they care about what everybody died of," King said. The approach is ineffective across many fields. "Once we realized that, we realized we needed to come up with a different method for estimating the percent in the category that had nothing to do with an individual's classification."

Reading social media to predict China's next move

SearchCIO will be writing more about King's computer-assisted reading technology. Meantime, here's an example of his work to whet the appetite.

Partnering with Boston-based startup Crimson Hexagon Inc., King et al set out to better understand what information the Chinese government censors by analyzing social media posts. The previous research on this topic had drawn the not-so-surprising conclusion that the government censored communications critical of the government itself.

"Totally not true," King said. Analyzing 11million Chinese social media posts -- collected before the Chinese censors were able to take them down-- King's study found that 13% of them were censored. But the censored posts did not conform to the prevailing theories. Included in the 87% of uncensored posts were communications highly critical of government leaders --"vitriolic, personal" attacks accusing local and national leaders of stealing money, having mistresses and a number of corrupt actions. The government was not censoring criticism of itself. "They were censoring any attempt at collective action, any attempt to move people by someone other than the government," King said.

The analysis of the censorship led to another potential insight. By analyzing social media censorship, it might actually be possible to predict political actions on important issues by Chinese officials. Case in point was the highly charged political atmosphere after oil was discovered in the South China Sea between China and Vietnam, King said. During the height of the tension between the two countries, with talk of war dominating the media coverage, the rates of censorship of social media "were soaring," King said. Then one day the rates plummeted. Less than a week later, China signed a treaty with Vietnam.

Big data promises new breakthroughs. But companies have had access to a lot of the data out there for years, King said. "Being able to do something in an innovative way, designed for a purpose, is what makes it valuable."

Let us know what you think about the story; email Linda Tucci, Executive Editor.

Next Steps

Social apps need to grow up and go to work

Consumerization hits the e-discovery market

At MIT Sloan, CEOs talk innovation while Rome burns

Dig Deeper on Enterprise business intelligence software and big data

Join the conversation


Send me notifications when other members comment.

Please create a username to comment.

Let's assume your company is working on a big data project. Are you confident it has the right analytics to derive business value from it?
Always start with a pilot and end state in mind
I'm confident that most of Big Data projects don't keep a close eye on the 4th V of Big Data Data (Veracity or quality). Having the right analytics it's not the problem.
Ranjanasfdsa, what's the reasoning behind your decision? Are you worried the analytics won't be accurate?
Analytics has to deliver faster tham it uses to do. It is necessary to balance decision makers needs of fast information and the low speed of 'perfect' models and tracking ...
Here in México, hiring proffesionals with seniority in Statistics and Math is increasing very fast. This situation answers the questions of the issue here. Companies must hire statistic experts in order to lead with big data & social media data
we had many successful analytics projects based on statistical analysis and data mining before going to big data. We then want to apply this on a massive amount of data. We didn't go for the hype.
We've taken the steps to assess our business intelligence needs, find a BI solution to fill those needs and implement the BI solution throughout the entire company. From the CEO to low level employees, everyone hass access to how their efforts fit into the company's goals and strategies. This is due to the fact we selected a BI solution with real-time data feeds, dashboards and collaboration. It's changed our company.
it highly essential to have real time analysis for having effective results.
At this moment, we are not confident at all. There is still long way to go on Bigdata Analytics Implementation.
Having supplied BI solutions since 1998, working with different SW suppliers, 1st requirement for success is that the Company application users really do want answers and use BI to get them, to have the situation under control and then react accordingly. While one has to begin from somewhere, with regard to big data specific knowhow and mature software are necessary, but I am not sure about existence of the latter.
Normally, data warehousing are IT projects where there more emphasis on infrastruture rather than analysis.End many times this results into dead data.
Focus on Mathematics and ready to be surprised by the variety of data; results may be a bit different than what you thought the math equations would do.
You are applying a metha physical equation: multiplying the unknown with the uncertain!!
The off-the-shelf "Big Data" analytics technology is still in its infancy, but in-house solutions are not the best answer to this kind of project challenge. According to some Proof of Concept (POC) I've carried out to test market solutions, HP Autonomy and IBM Content Analytics have the best answers to face this challenge, but both requires some customization (in my opinion, HP Autonomy is the "top" solution for Big Data).
For Big Data Analytics, one algorithm is not silver bullet! For example, the CMS released data for Pharmacy drugs for patients/ mHealth is complicated. After applying regression I looked at the bias in coefficients. The results were dates of services that had patient paying nothing or less than the total cost of drugs. But, need to quantify "Less". What about 10 $ less or 20$ less? Only then Patient behavior can be interpreted because medicare patients who had to pay the least (or largest price differential) would be filling the medications often. Hence the need is segmentation. So apply Mathematics, for example, Differential calculus and Taylor's Power Series etc to get the proper, actionable results.
Just came across a company called Argus Labs that seems to be on the cusp of something interesting. Looking into it as you should if you're interested in what they're calling "contextualized data" about users
We are planning to work on Big data and firstly, I wanted to do analysis for this.
I believe analytics being done by untrained statisticians is cause of BigData analysis failure
Quite fascinating that big data can even play an important role in those big cases like china! I'm wondering if they are rather using those simple algorithms for the detection or some really strong technology... But weird that scientists support that idea - but god knows what they are forced to :/