BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
Those CIOs starting to implement machine learning projects in their organizations should pay attention to the quality and source of their data, according to Ed Featherston, VP and principal architect at cloud computing consulting firm Cloud Technology Partners. Featherston spoke with SearchCIO at the recent Cloud Expo in New York. In this video, he offers pointers on how CIOs can prepare their organizations for machine learning projects and explains the need for figuring out the best way to transport their data into machine learning algorithms. He also sheds light on how internet-connected devices and crowdsourced data are changing data pedigree, and explains why candidates with data science skills are a top priority for CIOs hiring for machine learning projects.
Read excerpts of the interview below, or click on the player to hear the interview in its entirety.
How should CIOs prepare the organization for machine learning projects?
Ed Featherston: A lot of that centers on the data and being able to answer questions like, "Where is my data? Who has my data? Who knows my data? Who understands it best?" One of the biggest challenges with getting and leveraging machine learning as a service is about understanding how you are going to get the data to that algorithm. If I'm using IBM Watson, for example, and I have 50 petabytes of data, sending that out over the internet is probably not going to be an optimum solution. I have to figure out not only what data do I need, but how am I going to get it to where I need in order to be able to do the analysis, and worry about whether the quality of my data is up to snuff.
There is nothing worse than feeding bad data into a machine learning algorithm because then you're going to get some really bizarre results. Microsoft's Tay chatbot learned to become racist because the algorithm's only as good as the data you feed it. You feed it bad data you're going to get bad results.
The timeliness of the data is also important, like in the example I was giving with the railways; up-to- date data is important. If the data is three weeks old, that's not really going to tell you a whole lot. You have to take into account how you are going to get the fuel to the engine if they're not in the same place.
The last one is a special one and it relates specifically to the IoT space and in crowdsourcing of data, and that's the data pedigree. If all the data is your data, that's fine, you have control over that. But a lot of times people are getting data from devices all over the internet or they're getting crowdsourced data. Think of Waze, the GPS application. Waze is entirely crowdsourced as far as traffic information goes -- where backups are, where accidents are -- but they have to trust that people are providing good information. Somebody could just start saying, "There's an accident here." They have to have algorithms in place that analyze not just that somebody said there's an accident there, but are other Waze drivers moving slowly in that area, are other Waze drivers reporting that, so that they can qualify the pedigree of the information that they're getting.
Devices are the same way and I go back to the chatbots again. Just think of somebody hijacking your devices and sending bogus information into your analysis as a business attack. If you don't worry about where you're getting the data from and whether you can trust the data source, it could open you up to all kinds of possible dangers.
What skills should CIOs look for when hiring for machine learning projects?
Featherston: The biggest ones are all in the data, data science and data analysis space, because it's key that the people understand the data and understand how to move the data. Today it's more than just the structured database data that we're used to because there's so much unstructured data out there that is key to a lot of this information. You need people with skill sets that understand how to manipulate, and how to verify the quality, integrity and pedigree of that data.
You need people that understand the mechanisms and the tools that are available to work with the data. For machine learning projects, really big companies can afford to hire scientists that know how to write those algorithms. But if you're using somebody else's algorithms, it's ultimately just a set of APIs that you're feeding information into. Where your benefit comes is in having people that actually understand the data, how to use it, how to manipulate it, how to transport it, how to cleanse it, and all of those things. The data skills are going to be critical in the machine learning space.