Luis Louro - Fotolia
The University of Pennsylvania Health System had a problem many organizations experience: a plethora of valuable data that its computers couldn't analyze because the data was entered as free text and in other unstructured ways.
Searching for and analyzing unstructured data required prohibitively high amounts of time, said David Birtwell, director of informatics of Biobank at University of Pennsylvania Health System (Penn Medicine). But its absence from research programs meant researchers couldn't get as complete a picture of patients, medical conditions and treatment protocols as possible.
"Pulling information from these resources is absolutely required to do deep research," Birtwell said of Penn Medicine's unstructured data troves. "It was clear at the highest level that we needed to get this information to keep UPenn at the forefront of research."
Indeed, Birtwell said Penn Medicine officials have recognized the need to get to that unstructured data for years. But it wasn't until a few years ago that they believed natural-language processing (NLP) technology -- the type of artificial intelligence tool they considered key to accessing that data -- was mature enough to deliver the returns the health system expected.
"We want to enable our investigators to do groundbreaking research, and so they need to pull high-quality information from the nondiscreet part of millions of text records, and they need to do that quickly and efficiently while still respecting privacy in the process," Birtwell said.
To accomplish this, Penn Medicine implemented Linguamatics Health I2E, a natural-language processing platform designed for healthcare organizations, to build queries and automatically mine data from various sources containing unstructured information, such as doctor's notes recorded in electronic health records and specialist reports. Such documents contained different forms of unstructured data -- notably free text and text containing specialized medical language such as pathology reports as well as documents that contain a combination of discreet data points and free text.
Birtwell said the organization selected technology from Linguamatics, based in Marlborough, Mass., after surveying the marketplace and speaking to peer institutions who had implemented natural-language processing technology. After running a few projects as proofs of concept with the platform, Penn Medicine implemented it more widely in 2015.
Gateway to data-driven culture
Natural-language processing technology applies machine learning algorithms to analyzing and processing language in text and speech, and it's "increasingly becoming part of business intelligence and analytics offerings," said Krishna Roy, a senior analyst for the data platform and analytics team at 451 Research.
"Most companies are now data-driven or in the processing of becoming so. If you want to run your business by data and metrics, the people in your organization need to have access to the data and be able to understand it and analyze it," she said.
Business intelligence (BI) programs are using natural-language processing technology to expand what information can be accessed for analytical queries as well as the types of workers who can search and analyze that information.
"If a BI product has NLP, a user can type in a question in English, rather than use SQL or another type of query language. The next step up is employing NLP to talk to a piece of analysis software and ask it a question [such as], 'What is my sales performance for February?' The answer will be depicted in a dashboard -- or if Amazon Echo or the like is integrated into the analytic app, it will speak the answer back to the user," Roy explained. "However, it is still very early days for the latter use case."
Requires multidisciplinary effort
Despite advances in natural-language processing technology, platforms like Linguamatics Health I2E still require a significant role from enterprise IT departments.
For instance, Roy said organizations must ensure the natural-language processing algorithms are learning effectively so they understand human questions accurately -- something that requires large data sets to ensure effective training and, thus, accurate learning.
Penn Medicine's most significant challenge in implementing natural-language processing involved feeding documents into the system and indexing them so that researchers can access and search the information, according to Birtwell.
Building bridges between Linguamatics Health and other systems also required a significant effort. "You have to find a way to export the text," he said, adding that the labor-intensive job is done iteratively and collaboratively.
"There's scripting involved, leveraging whatever part of the software that lets you export. And you have to find a place to put the information -- a secure infrastructure to hold all that data," Birtwell said.
Some organizations create an ephemeral data store, index it and then put that data into the NLP system, "but that's not what we do, because we have others using that data besides I2E," he said. Penn Medicine opted to use its own secure private servers to hold the data.
Birtwell said data, technology and research personnel all worked together to implement this technology, explaining that it's critical for organizations to bring in workers from the divisions using the technology because they know what work needs to get done. Senior investigators, who understand the needed queries, generate the requirements. Corporate IT owns the servers and understands how to implement and manage the systems as well as what's required to keep them secure.
Driver and bridge-builder
Birtwell, a computer programmer who has experience in bioinformatics, explained his role as being both a driver and a bridge between the different disciplines that need to come together to make this project successful.
"Getting the data into I2E requires scripting -- that's straight up programming. But incorporating ontologies and setting up indexes requires a degree of domain knowledge and it requires someone who has an understanding of classical natural-language processing," he said.
He noted, however, that finding such a combination of skills is challenging, so he had domain experts and technical staff work on the task together.
Birtwell's team had to follow regulatory requirements around privacy and security throughout this process, which added to the workload. Additionally, workers had to extract data from sources that didn't easily yield its information, such as PDF documents that workers had to scrape to pull out the data they held.
As a result of that extra work, extracting data from some sources is on a back burner, he said.
Tangible results, early on
Despite such challenges, Birtwell said getting the needed data in place and indexed in the natural-language processing platform took just weeks, allowing users to start querying the text in short order.
Today, Penn Medicine researchers use the platform to access a wider spectrum of patient information and other critical data that helps them better understand how variations in individuals and health conditions can impact diseases and treatments.
In one proof-of-concept case, workers identified patients who had a particular cardiac condition that exhibited certain results on the electrocardiogram. Prior to the Linguamatics Health implementation, workers would have had to sort through volumes of patient information using billing codes or keyword searches -- neither of which was as effective or accurate as using natural-language processing technology. Senior investigators work with IT staffers to develop the queries needed to access data via the platform.
"We'd love a user interface so simple that doctors could point and click and get good results; I2E has an accessible interface, but the queries that get run are complicated and, thus, building those queries is a challenge and it takes a day or two, so that still falls to IT," Birtwell said.
But in the future, he said he expects investigators to be able to log on and create the searches themselves as the technology and the training on it mature.