E-Handbook: Enterprise machine learning and AI: Use cases and challenges Article 3 of 4

everythingpossible - Fotolia

CIOs need an AI infrastructure, but it won't come easy

From picking vendors to upskilling staff, folding an AI infrastructure into enterprise architectures isn't simple. Experts at the recent ReWork Deep Learning Summit zero in on the issues.

CIOs are starting to rethink the infrastructure stack required to support artificial intelligence technologies, according to experts at the ReWork Deep Learning Summit in San Francisco. In the past, enterprise architectures coalesced around efficient technology stacks for business processes supported by mainframes, then by minicomputers, client servers, the internet and now cloud computing. But every level of infrastructure is now up for grabs in the rush to take advantage of AI.

"There were well-defined winners that became the default stack around questions like how to run Oracle and what PDP was used for," said Ashmeet Sidana, founder and managing partner of Engineering Capital, referring to the Programmed Data Processor, an older model of minicomputer

"Now, for the first time, we are seeing that every layer of that stack is up for grabs, from the CPU and GPU all the way up to which frameworks should be used and where to get data from," said Sidana, who serves as chief engineer of the venture capital firm, based in Menlo Park, Calif.

The stakes are high for building an AI infrastructure -- startups, as well as legacy enterprises, could achieve huge advantages by innovating at every level of this emerging stack for AI, according to speakers at the conference.

But the job won't be easy for CIOs faced with a fast-evolving field where the vendor pecking order is not yet settled, and their technology decisions will have a dramatic impact on software development. An AI infrastructure requires a new development model that requires a more statistical, rather than deterministic, process. On the vendor front, Google's TensorFlow technology has emerged as an early winner, but it faces production and customization challenges. Making matters more complicated, CIOs also must decide whether to deploy AI infrastructure on private hardware or use the cloud.

New skills required for AI infrastructure

Traditional application development approaches build deterministic apps with well-defined best practices. But AI involves an inherently statistical process. "There is a discomfort in moving from one realm to the other," Sidana said. Acknowledging this shift and understanding its ramifications will be critical to bringing the enterprise into the machine learning and AI space, he said. 

Ashmeet Sidana, chief engineer at  Engineering CapitalAshmeet Sidana

The biggest ramification is also AI's dirty little secret: The types of AI that will prove most useful to the enterprise, machine learning and especially deep learning approaches, work great only with great data -- both quantity and quality. With algorithms becoming more commoditized, what used to be AI's major rate-limiting feature -- the complexity of developing the software algorithms -- is being supplanted by a new hurdle: the complexity of data preparation. "When we have perfect AI algorithms, all the software engineers will become data-preparation engineers," Sidana said.

Then, there are the all-important platform questions that need to be settled. In theory, CIOs can deploy AI workloads anywhere in the cloud, as cloud providers like Amazon, Google and Microsoft, to name just some, can provide almost bare-metal GPU machines for the most demanding problems. But conference speakers stressed the reality requires CIOs to carefully analyze their needs and objectives before making a decision.

TensorFlow examined

There are a number of deep learning frameworks, but most are focused on academic research. Google's is perhaps the most mature framework from a production standpoint, but it still has limitations, AI experts noted at the conference.

Eli David, CTO at Deep InstinctEli David

Eli David, CTO of Deep Instinct, a startup based in Tel Aviv that applies deep learning to cybersecurity, said TensorFlow is a good choice when implementing specific kinds of well-defined workloads like image recognition or speech recognition.

But he cautioned it requires heavy customization for seemingly simple changes like using non-rectangular convolution operations on images. "You can do high-level things with the building blocks, but the moment you want to do something a bit different, you cannot do that easily," David said.

The machine learning platform that Deep Instinct built to improve the detection of cyberthreats by analyzing infrastructure data, for example, was designed to ingest a number of data types that are not well-suited to TensorFlow or existing cloud AI services. As a result, the company built its own deep learning systems on private infrastructure, rather than running it in the cloud.

"I talk to many CIOs that do machine learning in a lab, but have problems in production, because of the inherent inefficiencies in TensorFlow," David said. He said his team also encountered production issues with implementing deep learning inference algorithms based on TensorFlow on devices with limited memory that require dependencies on external libraries. As more deep learning frameworks are designed for production, rather than just for research environments, he said he expects providers will address these issues.

Separate training from deployment

It is also important for CIOs to make a separation between training and deployment of deep learning algorithms, said Evan Sparks, CEO of San Francisco-based Determined AI, a service for training and deploying deep learning models. The training side often benefits from the latest and fastest GPUs.  Deployments are another matter. "I pushed back on the assumption that deep learning training has to happen in the cloud. A lot of people we talk to eventually realize that cloud GPUs are five to 10 times more expensive than buying them on premise," Sparks said.

Deployment targets can include web services, mobile devices or autonomous cars. The latter may have power, processing efficiency and latency issues that may be critical and might not be able to depend on a network. "I think when you see friction when moving from research to deployment, it is as much about the researchers not designing for deployment as limitations in the tools," Sparks said. 

Dig Deeper on Digital transformation

Cloud Computing
Mobile Computing
Data Center
Sustainability and ESG
Close