News Stay informed about the latest enterprise technology news and product updates.

ModelDB aims to keep track of machine learning modeling process

BOSTON ­– Sam Madden, professor of electrical engineering and computer science at MIT, is hoping to help advance the field of machine learning from dark art to principled science with an open source project. ModelDB, available on GitHub, is essentially a database system designed to help organize and manage machine learning models.

“These models are the engines of machine learning,” Madden said at the MassIntelligence conference, hosted by MassTLC and MIT’s Computer Science and Artificial Intelligence Laboratory. “They are the things that take the data and extract the insight out of it.”

When researchers build machine learning models, the process is highly iterative. Models are built using training data, and, if they’re supervised models, they are tested, evaluated and then tweaked (i.e. new features are added, new parameters are added) to improve their performance. That process is repeated — sometimes hundreds of thousands of times, according to Madden — until the models perform at an acceptable level.

But there is no way to manage the process. “You go through thousands of these models, you update the models all of the time, and there’s no sort of standardized way to track the history of the modeling process,” he said.

Madden likened it to the way people organize personal documents on their computers, which is to say not at all. “People are terrible at it,” he said. “And they don’t promote carefully organized data.”

ModelDB is a database system that acts as a central repository for machine learning models — all iterations — and is searchable, creating a system of record for researchers. “People can look at see what’s been done in the past and continue work that’s been partially completed,” Madden said.

Features include “experiment tracking,” so that models in the pipeline can be logged; “versioning,” or the ability to compare model performance; and “reproducibility,” so that any model can be rerun an any input data set.

“This isn’t a deep or radically complicated idea,” he said. “But it’s one of the things that I think is needed in order for us to go from where we are now, which is sort of this [dark] art, to a much more principled scientific approach.”