BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
Three years ago, forward-looking experts predicted Hadoop could become such an integrated part of IT systems, chatter about the technology itself would fade away. Last week at Strata + Hadoop World in New York City, Mike Olson, co-founder and chief strategy officer at Cloudera Inc., announced that time has come. "This year, we're going to see Hadoop disappear," he said.
In fact, he said, the conversation is already shifting. He pointed to DigitalGlobe Inc., which produces high-resolution satellite images and is best known for its Google Earth work. In the wake of the Boko Haram kidnapping of 270 Nigerian school girls, DigitalGlobe analyzed images, specifically looking at roads, clearings and towns. It built models to predict movement along these corridors, eventually recommending 14 locations to security and government officials, Olson reported.
"A short time afterwards, in nine of those locations, actual hostages were sighted," he said. "So the predictions were good."
And he pointed to Children's Healthcare of Atlanta (CHOA), a leading pediatric hospital. In the past, data collected in the neonatal unit from blood pressure and heart monitors was too big to keep, Olson said. Today, CHOA is using big data technology to figure out how environmental conditions -- lights and noises -- impact patient care.
"We're beginning to see a focus on the stories data enables rather than on the technology that enables the data collection," he said.
This is not unlike what's happened to the Oracles or the SAPs of the enterprise. "If you talk to Teradata or EMC customers, you'll hear business users getting enormous value from those platforms," Olson said. "But the users don't even know they're there." Hadoop is on its way to becoming part of the background, he said.
The data science of humor
How is writing a caption like solving a data science problem? That's a question for Bob Mankoff, cartoon editor at The New Yorker, who gave a talk at last week's Strata + Hadoop World. Every week he runs a caption contest for readers. The entries -- 5,372 on average per week -- are culled to a top three list, with readers selecting the winner.
Mankoff, a self-described technophile, said successful captions often contain bisociation, the combining of two seemingly unrelated ideas. "Arthur Koestler, who wrote the book The Art of Creation, said humor and science and art are the same, bringing together things you didn't think were going to come together," he said during his talk.
Figuring out how two dots are (or are not) connected is a data science trademark. Peter Guerra, who leads a team of data scientists at the consultancy Booz Allen Hamilton Inc. in Annapolis Junction, Md., describes it as "the ability to think differently about a problem and view it from multiple angles." It's a skill he and his Booz Allen cohorts specifically look for because, while data science skills such as math and programming are important, "the harder thing to find is creativity," he said.
Notes from a data scientist
As a team leader for a group of 500 data scientists, Guerra, it's safe to say, knows what he's talking about. Here are a couple of additional sketches from his data science playbook:
1. Data science is a team sport. "It's not a purple unicorn," he said, meaning CIOs aren't going to find a computer scientist with knowledge of distributed systems, a mathematician who can develop algorithms and a domain expert who knows what kinds of questions to ask of the data all wrapped up in one person. "We feel like we get the best bang out of our buck with a team of people," he said.
2. Correlation is not causation. This often-quoted phrase is a data scientist's mantra. Guerra gave the example of a life sciences client that couldn't figure out how to get a consistent yield for one of its vaccines. "In that world, you can't do correlation," he said. Instead, his data scientists had to test each correlation to see which ones held up.
3. Spark's sweet spot. Tools may not make the data scientist, but they sure help. And one tool that continues to gain attention in the data science world is Spark. Developed at the University of California at Berkley's AMPLab, Spark is an in-memory data processing engine that Guerra describes as both scalable and fast. "For our clients, they want to be able to do more iterative-type queries," he said. "MapReduce is great for the heavy lifting, but if you need a quick answer, Spark is great for that."
"Data has grown; however, five to 10 years ago, we still had a big data problem or a big data challenge. We just didn't have the technology to address it." -- Peter Ferns, CTO of compliance technology, Goldman Sachs & Co.
"The question used to be, what data can we afford to keep? Now it's really, what can we afford to throw away?"-- Chris Wilson, senior vice president of direct channel, L.L. Bean Inc.
"An increasingly important conversation to have in this space is how do you donate data, how do you opt in for research studies and say, 'I don't want this data to go to advertising, but I do want it to go to research.' We don't even have the right policies around that yet." -- Rachel Kalmar, data scientist, Misfit Wearables
"Between cloud and fog [computing], I'm looking for atmospheric confusion." -- Roger Magoulas, research director, O'Reilly Media
Previously on The Data Mill
Gartner's top 10 technologies and trends for 2015
If data is the new oil, does it need a refinery?
Got big data? You may need a data concierge