Gut instinct is out and data-driven decisions are in at smart companies, right? Not so fast. Even when relying on data to drive business decisions, organizations should proceed with caution and skepticism.
New York City's Columbia University hopes to instill this kind of thinking into its Masters in Data Sciences program, which begins this fall. Courses will delve into technologies like machine learning and data analytics. But according to Rachel Schutt, senior vice president of data science for New York City-based media conglomerate News Corp. and an adjunct professor of statistics at Columbia, students will also have to think hard about ethics and data manipulation, especially as organizations try to monetize the data they collect.
Take such data-based products as LinkedIn's People You May Know recommendation engine or, everyone's daily habit, Google's search engine. "The more a user uses the product, the more data is generated, which can improve the product," she said at the Harvard University Institute for Applied Computational Science's (IACS) annual symposium in Cambridge, Mass. Data products capture the behavior of users in the hopes of predicting their behavior. That concept is nothing new for "classical statistics problems," Schutt said, where models are often designed to understand causation or make predictions. But with this technology, data scientists must keep in mind that data products built on human behavior can in and of themselves change the behavior of people who use those products.
"In the context of building data products … the models and algorithms that you use to predict also have the ability to cause," she said. Think of it this way: When a statistician builds a model to predict the weather, the model will never cause the weather to change, she said. But when a statistician builds a model to predict who a user knows or what information a user is looking for, how search results and recommendations are selected and ranked could influence what the user clicks on.
"You have to be aware of the impact you're having on products going out into society," Schutt said.
More data is better
Having more data beats having better models. "Not all of the time, but often," said Diane Lambert, research scientist for Mountain View, Calif.-based Google Inc. Take search queries. Users who query Britney Spears and misspell her name still expect to see results for the Britney Spears. "If you have more data, you can start solving problems like that where people spell really badly."
And tapping into all of the data on the Web can produce sophisticated products. Just look at Google predictive text, which anticipates what the query is likely to be before the user finishes typing as well as what type of advertisements best match that query -- all at lightning speed. (See caveat above.) More data often beats better models, Lambert said, but more data isn't everything.
"You still need to experiment," she said at the IACS symposium. "Without experiments, you can't answer questions like, 'Is this [user interface] change better or is it going to confuse people?'"
Google runs thousands of experiments at the same time, Lambert noted -- with our help. "If you've ever put a query into Google, you've been in an experiment."
Better than Google?
What's the next new thing in search? Cynthia Rudin is building a better search engine by "growing a list."
"The current generation of search engines tells you where to find information, but the next generation finds it for you," Rudin, associate professor of statistics at Cambridge, Mass.-based Massachusetts Institute of Technology, said at the IACS symposium.
Rudin's search engine starts with a "seed" and then aggregates information from expert online sources to find more information related to the seed. The algorithm competes with Google Sets, which came about 10 years ago, and Boo!Wa!, an academic engine, and, according to Rudin, significantly outperforms both.
In a search for annual Boston events, Rudin's model came up with a good-sized list that included First Night Boston, Beantown Jazz Festival, Chinatown Main Street Festival and so on. Boo!Wa! came back with a mix of events (Boston Wine Expo), general topics (Boston Red Sox) and junk results (parking in Boston). And Google Sets? Its list included the Boston Massacre … and zero annual events.
Tweet roundup from #datastorm14
Previously on The Data Mill
Hadoop 2.0 impacts big data technologies
Big data and bust-out credit card fraud
International Institute for Analytics' predictions for 2014
The IACS symposium attracted the cream of the academic data expert crop and beyond, who naturally held forth on Twitter. Intelligentsia from University of California at Berkley, University of Washington, Google, IBM and Dropbox, to name a handful, exclaimed on the new (newly anointed?) field of data science. Read all the tweets by searching #datastorm14 on Twitter, or enjoy this slightly edited collection.
Harvard IACS (@Harvard_IACS), official handle for IACS: "Claudia Perlich: Even in this age of big data, 'You almost never have enough of the right data.'"
Steve Patton (@spttnnh), information security professional: "Panel: machine vs. scientist tension will remain. We must check results as we cope with data only machines can handle."
Harvard SEAS (@hseas), Harvard School of Engineering and Applied Science: "'Regardless of the amount of data we have … we only have two eyeballs and one brain.' -- Fernando Pérez."
Marta S. Rivera Monclova (@PhDeviate), founder and CEO of consultancy PhDeviate Inc.: "Data scientist personality: creative, curious, problem solving, can tell stories with and about data."