BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
If the big data hype is starting to become a big data headache, don't throw in the towel just yet. Gideon Mann, head of Bloomberg LP's data science team in New York City, has put together six big data project "gotchas" to help CIOs and data science teams avoid painful mistakes.
Mann, who spent six years at Google as a research scientist before joining Bloomberg, works for the office of the CTO, where he pitches in on both internal and customer-facing big data projects. "In some cases, we function like a product management organization; in some cases we function like architects; in some cases we function like prototypers; and in some cases, we function like researchers," he said.
Whatever the case, a couple of things are apparent: Clear communication with collaborators is key and flexible planning goes a long way.
Gotcha #1: Algorithms are magic
Data science and machine learning are powerful tools, but they don't perform magic. Seems obvious, but Mann said it's an important point to keep in mind when determining big data project scope. What does he mean by scope? "I mean define what a success is," he said.
If the criteria for success aren't something a human could reasonably do, don't expect an algorithm to succeed either. Simple as that -- with one stipulation: "You have to take out the context of time," Mann said. "If you look at the way a search engine works, a person could do that task, it would just take them looking through every document and deciding if it's relevant or irrelevant."
Gotcha #2: The quick fix is good enough
It's important to explain to the business that machine learning techniques may not be the fastest techniques out of the gate, but they often provide more sustainable solutions. Take sentiment analysis. "It's easy at the outset to write out a lot of these little rules," Mann said, "but that doesn't scale." Language continuously changes, and maintaining a list of all variations of a phrase will eventually become inefficient.
Initially, the product team might be attracted to the quick fix because it yields a quick reward that performs as well as the slower-to-build machine learning algorithm. But once the team discovers how brittle the quick fix is, the appeal will likely turn into frustration. To avoid the misstep, Mann suggests "making sure the expectations of the team that's asking for [a product] are well aligned with who is delivering."
The good news? As product teams become savvier (read: burned by the quick fix), this gotcha goes away. "When you work with a company that's as technologically sophisticated as a Bloomberg or a Google, there's recognition from the product team itself. They intuitively understand what is and what is not going to work, and what kind of scaling you are or are not going to need," Mann said. Anything that doesn't meet the team's scale expectations won't be tolerated.
Gotcha #3: Set it and forget it
"Techniques that worked yesterday might not work tomorrow simply because the nature of what you're observing has changed," Mann said. The goal of Google AdWords, for example, is to display ads alongside search results that will generate the highest return on investment for the advertiser.
But consumer behavior changes all of the time, which means the underlying algorithm that's making the ad selection cannot be static. Instead, it needs to be continuously tested and reevaluated to accommodate the "domain drift." Mann said data science teams often use a combination of automated and manual testing procedures to assess the performance of the algorithm.
How often should an assessment be performed? "Depends on how quickly you expect the world to change," Mann said.
Gotcha #4: Failure is not an option
Expecting perfection from a machine learning or big data project on the first go is a recipe for disaster. A data science team is better off figuring out an acceptable failure rate -- the thin line between when a product provides enough value that customers will use it (thereby generating the data needed to build the next generation) versus when a product misses the mark.
Think of it as the difference between Facebook M, a virtual personal assistant, and Apple's Siri, a natural language processing system. "Siri works OK, but it doesn't achieve the accuracy rate that people need to use it for everything," Mann said. Instead, customers use the product discriminately, which affects the scope of the data Apple can collect.
M, on the other hand, "integrates human action and machine action holistically so that you make sure as a user, it will solve whatever problem you have," Mann said. "As a result, Facebook's able to collect training data from a very wide perspective, which it can then use to train its model."
Gotcha #5: Privacy is dead
Customer privacy can be a tough nut to crack. Why did customers reject Facebook Beacon, which collected information on user's buying habits, but embrace Google Now, a virtual assistant that collects personal information to, say, provide traffic alerts?
"I suspect it comes down to how in your face the application is," Mann said.
Customers may have pooh-poohed Facebook Beacon, but they certainly don't feel the same way about the company itself, which uses customer data as currency. Perhaps when the bartering happens somewhere in the background, it's easier for customers to ignore, Mann said, and "what they do notice is the value they get out of using Facebook."
To find the line between customer value and Big Brother, Mann suggested trial and error, user focus group and talking with people.
Gotcha #6: The business knows exactly what it wants
The time it takes to build a machine learning system versus the time it takes for product requirements to change doesn't typically mesh, a reality CIOs and the data science team would be wise to expect. To avoid following the business down rabbit hole after rabbit hole, Mann suggested "building general," which allows room for flexibility, rather than building small and specific, "because the small problem might change," he said.
The business might start off asking for a product recommendation engine only to later realize it really needs a news recommendation engine. "Depending on how tightly you designed your machine learning set up, you may or may not be able to adjust," Mann said.
Adopting to an Agile workflow can help CIOs and the data science team maintain flexibility and make adjustments when necessary.
Retailers profit on big data
Big data do's and don'ts