With machine learning, data scientists have to perform a task called feature engineering. “People get the incoming data, and they prepare it, and they clean it, and they maybe manipulate it in a way that’s going to give them the relevant information,” said Edd Wilder-James, former vice president of technology strategy at Silicon Valley Data Science and now an open source strategist at Google’s TensorFlow, during a presentation at the Strata Data Conference.
Take the use of machine learning to determine if it’s day or night, and the data used to train the model is photographs. Before the model is released into production and before it’s even trained, data scientists have to determine what features in the data will help the model learn. “Our feature engineering might be as simple as counting the number of dark pixels at a certain threshold: What percentage of the image is dark?” he said.
Pinning down features and thresholds is a difficult but vital process that requires domain expertise and knowledge of the data, according to Wilder-James. “With this kind of machine learning, a lot of the effort goes into … figuring out what are the features and making the damn thing work,” he said.
With deep learning, data scientists can skip the feature engineering step. The model would instead rely on enormous training data sets to figure out which is which on its own — and time.
“It’s slow. We’re talking days, weeks even, maybe a month to train a model,” Wilder-James said. “It requires a large amount of training data to get right. This is definitely a big data problem, in that sense.”
And be warned, deep learning models can also be fooled. Generative adversarial networks can trick a model into seeing something in the images that can’t be detected by the human eye. This creates big security implications, Wilder-James said.