Sergey Nivens - Fotolia
Automation is a key component of enterprises' push to transform their operations, and a driver of that automation is data science. However, there are many misconceptions that still arise when it comes to data science, AI and machine learning, specifically when it comes to data projects.
To help address some of these, here are four data science project best practices for organizations to follow.
1. Understand the business requirement
A common misconception about data scientists is that they simply grab data, run models and then produce results. While they do all these things, the most important part of the job is to first establish and understand the use case for a particular model. Put simply, what is the business problem that needs to be addressed?
For data scientists, this process is summed up by converting the business problem into a mathematical one. But to do that, they must intricately understand the pain point of the business or customer, as this will determine the data sets used to build the models.
Data scientists can only understand the business problem by fully understanding the market the business operates in. Data scientists must also work closely with business teams, such as product managers, to understand exactly how a customer views their problem.
2. Communicate effectively
Communicating with a business team is an important data science project best practice to follow, but this has its difficulties. Data scientists typically have more technical backgrounds than product managers, so communicating complex mathematical solutions effectively -- i.e. in a way that can be understood and fed back to clients -- poses a challenge. They can't simply point to a set of formulas and say, "These meet the customer's requirements, so we're ready to get going."
Properly conveying how a model can answer a business problem is a soft skill that data scientists should develop. By doing so, the business team can help ask the right questions that will enable data scientists to identify the right datasets for the models.
"We need an efficient way to do X" is a simplistic but typical starting point for any data project. But there is an understanding that "X" is never clearly defined. This is when data scientists work with the business teams to eliminate ambiguities and refine the use case.
Never underestimate the power of "Why?" It's sometimes the case that a customer's demand doesn't address the problem. A data scientist may not have the datasets available to achieve the best model, so an alternative and workable answer may be needed. Adjusting the target to what's possible is essential in this instance and, again, requires effective communication with the business team so that the technical constraints can be relayed to the client.
3. Avoid junk in, junk out
Data scientists are faced with many inherent constraints when it comes to getting the information needed for the models, from gaining the right permissions to access certain datasets and regulatory issues around sensitive data, to the disparate locations and formats of the data required. Once they have this information in one place, they then manipulate the data to identify the features that will become the input for the models.
This process can take up to 90% of a data scientist's time as they need to clean the data, locate anomalies and missing values and merge datasets. Often, the tools and algorithms needed to create a certain use case already exist through open source libraries such as Python, Tensorflow and PyTorch. This is why feature engineering, due diligence and data manipulation are the most time-consuming parts of the job.
The feature engineering process is, of course, informed by their knowledge of the business problem, which is why the first step -- understanding the business requirement -- is a vital data science project best practice to follow. The quality of the data that data scientists feed into an algorithm ultimately determines the success of the data science project, and quality is determined by the accuracy of the data itself, but also its relevance to meeting the business requirement.
Data scientists are aware that data scarcity and inaccurate data are the norm whenever they embark on a project. Even when it comes to data recorded by advanced monitoring tools, it is a fundamental principle of physics that a measurement is never 100% accurate, and this must also be taken into consideration. Every model is "wrong" in some way, but models get data science teams close enough to the answers to business problems so that effective data-driven decisions can be made.
At some point, data scientists must make the decision that they have enough data to make a workable model. But data works like a currency -- you get the closest to what you want by using what you have.
4. Iterate and adapt to change
A characteristic of data-driven projects is that they cannot be built for perpetual use. There could be a shift in business priorities that will require a data scientist to rebuild a model.
A recent example is the shifting behaviours of organizations and customers in the wake of the COVID-19 pandemic. Statistical models that addressed certain problems before the crisis have either been rebuilt or adjusted to address the new reality. As organizations continue to adapt to the crisis, they need to re-engineer their models. The judgement as to when this happens is determined by their performance, which must be closely monitored.
Monitoring the effectiveness of an algorithm requires setting thresholds for performance, which is quite simple. Once performance drops below a set threshold -- i.e. the minimum required to deliver actionable insights -- it is time for a new iteration. From a business perspective, this is key to delivering monetizable data offerings, as shifting data requirements necessitate new models. To deliver new models, data scientists must, once again, understand the new business requirements -- and thus the cycle begins again.
About the authors
Yujun Chen currently works as a senior data scientist for the Innovation Lab at Finastra. His daily responsibilities are focused on data analysis and modeling, using a variety of statistical models and machine learning methodologies to help clients solve their business problems in the areas of treasury, capital markets and retail banking. He also has a PhD in physics.
Dawn Li works as a data scientist at Finastra's Innovation Lab, where she and her team apply the newest technologies in machine learning to solve various problems in finance. Dawn graduated from the Georgia Institute of Technology with a background in mathematics and statistics. While attending school, Dawn developed an interest in data science and decided to pursue a career in this field because it is fast-growing and provides the opportunity to learn and challenge herself every day.