Data is everywhere -- there is no shortage of it. But how much of it can be relied on? Where is the good data?...
And, most importantly, what can be done to improve the quality of the data that is available?
In this third article in my data management series, we will look at how we evolved into the current situation and the possibility of using robotic process automation (RPA) bots to help us create a solid base of data.
The bulk of legacy applications were built to support small business functions, not larger end-to-end processes. These applications were designed largely to operate independently, with little thought given to sharing data or controlling its flow and its quality as it moved across applications that are "interfaced" with one another. That resulted in applications -- and application groups -- functioning as data islands.
These applications were viewed as "best of breed;" that is, the best of a given type of functional application. Companies made their decisions to buy or build these applications based on the offering's content and its fit with the company's business function.
Each system had its own data model, and in order to share data, complex interfaces were built. However, given the frequency of change, keeping these interfaces and the data that flowed between applications up to date became problematic. In addition, the data edits for the individual applications varied, as did the need to transform the data to conform to the formats in each application. Adding to this growing problem, many of these applications became so customized that their original documentation became worthless. Data quality did not fare well through these changes.
While this was happening, new computer languages were created and new operating systems were released. IT could not really keep up with the technology changes. In response, a new type of computer control software was created by the hardware manufacturers: emulation software. This allowed software environments to mimic earlier versions of the operating systems and allowed old applications to continue to be used. Complexity on top of complexity -- and data quality problems -- just kept piling up.
The situation was largely the same with the large database management systems being built and modified to handle the growing need for additional information and its storage. Companies bought multiple database products and designed databases to support single applications or small groups of applications. Taken together, this data approach created more complex situations, leading to the data quality management problems which most companies are still dealing with today.
AI and RPA bots
So, what can be done to ease the burden of cleaning legacy data? While there is no single solution, my teams have begun to look into RPA bots and AI as a way to clean and reconcile data. Someone will still need to make decisions about matching the duplicate data from multiple sources to determine which is right. But the goal is to limit the need for manual intervention or review.
In particular, we are looking at using RPA bots and possibly AI to automate data comparison and identify the data element content with the highest probability of being correct. We are at the concept level, so I do not claim that this will work yet. But it looks like a promising data cleaning process.
RPA bots mimic what a person would do. All the logic a person would use to identify which version of the data is right is built into sets of rules the RPA bots execute. Rules would also be built to apply to any additional supporting data that may need to be pulled to help decide what version to consider correct.
A bot can be built to read each database (regardless of the tool that was used to design and build the database), identify the format and other information about the data, and log the information for each application's databases. At the data-element level, this RPA bot input would include the data that was entered or changed. The bot would then compare the data and the time of its last change and point out certain types of discrepancies for review. As each type of discrepancy is dealt with, the way it is handled would become a rule or set of rules that is added to the bot's operation for future use. In this way, the issues related to data quality are used to teach the RPA bots to be better.
Training and testing RPA bots
The initial version of the bots will need to evolve: It is unlikely that all the rules needed to perform any activity will be identified in the first version. To evolve, the bots will need to look at several databases --and improve their rules as the business managers and technical staff gain experience.
Initially, any data matches in these multiple databases will be identified by the bot and manually reviewed to make certain the rules are right and being applied correctly. The data issues that surface will be manually corrected and the corrections used to determine how the bot's rules will need to be modified. This approach allows the bot rules to evolve and address more complex situations over time.
Once tested for a given type of database management system, the RPA bot can be turned loose. I am planning to apply this data cleanup process first in system groups that should have mostly common data elements. This will allow me to create a limited library to start with, which later can be used to compare disparate data elements across other libraries. The best part? A bot, being a bot, can be run constantly until all the data is reviewed and cleaned.
As I said, I am investigating this approach. And like most conceptual approaches, I expect this one will change as it is tried. The final solution may be fairly different from the one outlined above. But, given that I firmly believe the data in our databases must be cleaned if we are to be able to change our operations and processes fast enough to compete, we must work through the experiment: Our challenge is to find creative ways to address our data issues.