Data Quality and Anomaly Detection in the Era of Big Data

1 Dec, 2020 •

Data drift

Our world has become increasingly digitized. Data is the new currency. Business leaders who successfully tap into the insights provided by big data can establish a clear advantage over their competitors, which has the potential to persist over the long-term.

In the blockbuster book Moneyball, author Michael Lewis describes the transformation of the Oakland A’s into a data-driven organization focused on building a winning baseball team around a purely analytical approach. The time-honored approach of recruiting and retaining players based on gut feel was to be supplanted with a more methodical quantitative process. It worked. Oakland A’s team manager Billy Beane has been hailed as a pioneer in the sport of baseball, and numerous other teams have adopted a similar data-driven approach.

Amazon, Netflix, and others have built sophisticated preference models to tailor product recommendations for individual customers. For statisticians, this kind of data-intensive understanding of the world certainly isn’t new; but the volume and velocity of data available for such analysis has increased dramatically in recent years. That trend is certain to accelerate.

More data means more opportunity to build insights, which translates into a greater capacity than ever before to generate value from all of that information. As the potential value of such insights increases, however, so too do the potential drawbacks of getting the data wrong.

Artificial intelligence and machine learning hold great promise, and are already delivering intriguing results for many organizations. Yet the old adage applies more than ever before; “garbage in, garbage out”. Small errors, compounded over the course of time, can add up to significant inaccuracies.

Managing Chaos

For most business leaders, data is inherently orderly, – or at least it should be. We are accustomed to the kind of discrete, highly structured, relational hierarchical data sets that are found in ERP systems, CRM databases, and other “business information systems”. 

That world is dominated by master data (such as customers, vendors, and inventory items) and transactional information (such as sales quotations, orders, and service ticket requests). Even when the quality of that data falls short, it usually does so in relatively predictable ways. The data is always meaningful; otherwise there would be no reason to collect and manage it. In that world, every data point has a distinct purpose.

Now consider the new world of big data. This is where the statisticians live. In this world, we are no longer necessarily interested in the individual data points; instead, we want to step back and see the big picture created by all of that data. This world is messy.  It’s full of anomalies, – those statistical outliers that have the potential to wreak havoc on the validity of insights.

That calls for a different approach to data quality. Specifically, it means adding to our current understanding of data quality to account for “data drift”.

Consider the case of a 30-year-old professional male who likes fitness, gourmet cooking, and classic jazz. That person’s Amazon browsing history is likely to be filled with the kinds of products that fit those particular tastes. Then, all of a sudden, he starts browsing for princess-themed toys for young girls.  What does that mean? In all likelihood, our hypothetical Amazon shopper is buying a gift. If it happens just once, it probably doesn’t have much of an impact on future product recommendations; but what happens every year around the same time? Is it real? Or is it truly an anomaly?  

As a data scientist, what would you want to do with that? If it happens once a year around the same time, then maybe our hypothetical shopper has a young niece to whom he sends a birthday gift every year.  Or maybe it is truly an outlier… a one-time gift to the local toy drive, for example.

Consider another case: let’s pretend that our hypothetical 30-year-old male also has Netflix account, and enjoys watching Marvel movies, John Wick, and reruns of The Office.  Netflix’s recommendation engine understand his viewing preferences and will make suggestions accordingly. Then, he appears to suddenly take an interest in animated Disney movies, and watching for most of a long weekend.

What happened? In all likelihood, he had out-of-town visitors with young children. How should the sort of anomalies be treated? 

Dealing with Data Drift

Data drift can happen for other in virtually any domain primarily because data models change, and because when large and diverse data sources are involved, there are many potential points of failure at which errors may be introduced.

Imagine that you have a demand planning application that uses machine learning to analyze historical sales data alongside external variables such as weather, economic trends, and competitive activity. Then consider what happens if the inventory master data is changed to accommodate a larger ID field; or items are discontinued and replaced with slightly newer variations that appear in the inventory master table as entirely new items. In both cases, machine learning algorithms may see sales volumes drop to zero because an item with a new master record ID isn’t correctly correlated with its historical data. 

Data drift may also occur when sensors return faulty information, or when bugs in upstream systems return incorrect values. Healthcare statistics, for example, could be skewed by faulty patient sensor data. If you don’t have a clear plan for dealing with anomalies, they have the potential to turn into a data quality issue. Data drifts further and further away from an accurate representation of the truth. When that data drift is compounded over a population of customers, and over the course of time, it distorts reality.

Data pipelines are processing information at an incredible pace. That requires moving and transforming raw data from systems of records  into curated datasets in data lakes and data warehouses.  The velocity and volume of information is increasing. That creates considerable challenges.

There are tremendous opportunities to create value out of all that; analytics and AI/ML models are using this data and enabling new ways to operate, innovate and automate.  But when data drifts, AI models can lead to poor predictions. Poor decisions and insights lead to bad business outcomes.

Real-World Implications

As artificial intelligence and machine learning gain momentum, it’s critically important that businesses get ahead of the problem of data drift. Responsibility for low level decisions has begun to shift to AI bots.  Poor data quality biases those decision models, and that has negative implications in the real world.

The biggest challenge around data drift is that it happens slowly and is hard to detect. If it isn’t addressed proactively, it has the potential to snowball into a much bigger problem.

Data architects, data scientists, and product managers understand that they need to detect and handle data drifts; but they can’t do it alone.

DvSum is a leader in data quality.  We provide autonomous tools that drive quality initiatives for your big data initiatives.  By connecting directly to your existing data sources, DvSum can instantly alert you when metrics drift unexpectedly. DvSum provides out-of-the-box connectivity to your data warehouses in AWS, Azure, Snowflake and to you data pipelines running on Spark.

With an intuitive cloud-based platform, and a rich catalog of pre-configured data quality checks and anomaly detection algorithms, setting up data tracking rules in DvSum takes just minutes. Our platform compares live data to learned profiles and takes action when data falls outside of expected parameters.

If you are embarking on a big data initiatives centered around AI and machine learning, now is the time to get ahead of potential data drift issues. To learn more about how DvSum can help, contact us today to discuss your project. 

Share this post:

You may also like