Data Discovery is a necessary step in Self-Service Analytics to find, prepare and analyze data to generate insights. However, it is slow and complex, which results in analysts not being able to deliver timely insights or for business teams to be able to actually do self-service analytics. In a previous article, I discussed this as a key challenge identified by Analytics leaders. You can read that post here. In this article we talk about 4 steps to add Data Intelligence to your data using a centralized Data Catalog that will make Data Discovery faster and smarter.
Data Discovery Tale – Story of Mary
Mary is a business analyst, working at Chips Inc, a Consumer Goods company.
Chips has recently on-boarded a new third party logistics provider to ship its products. Chips has certain expectations of a target on-time delivery performance from its Carrier. But the actual performance is sub-par and it is causing customer satisfaction issues. Mary has been tasked to analyze shipments data and find into what might be causing this.
Mary has never worked before on Shipments data, so she needs to first find the data. After multiple back-n-forth request with IT, she finally gets the data but receives 3 similar looking datasets. She has to evaluate the data to figure out, which one she should use. She also doesn’t know what all the fields in the tables mean, so she has meetings with SMEs to understand the data. That understanding results in her going back to IT to get more data and go through the process again of evaluating and understanding it. Finally, when she starts analyzing data and realizes missing data and she spends time preparing and analyzing it.
Finally, 4 weeks later, she completes her analysis and was able to identify that there is a big delay between an order being processed for shipments and its approval for parcel pick-up. She has found the insight!
She goes into the presentation meeting excited to share her analysis. Unfortunately, as soon as she starts presenting, she is questioned on how she created her analysis, because her on-time delivery metric calculation doesn’t match the number that Logistics team sees in their operational reports. Now she has to go back, talk to SMEs again, figure out the right definition of metric and re-do her analysis
- Mary was frustrated because she put all this effort in but didn’t get the right information from business.
- Logistics team is frustrated because they didn’t get insights quickly and the one they got was not trust-worthy.
- The understanding of data gained continues to reside as tribal knowledge between Logistics SME and Mary
This is a common tale. A very common tale!
So, what is a better approach? Well, if:
- The data were cataloged, tagged and searchable. Mary could have searched and found the data herself
- If the columns and metrics were documented, she would have understood the data quicker
- If Data’s Profile was readily available, Mary would have quickly identified the preparation steps required
- If the discussions and this analysis of data was stored, next time the Data Discovery of this data should be faster.
These are the 4 steps to add Data Intelligence to your data, which will make Data Discovery faster and smarter and Business-led Self-Service Analytics easier. A centralized Data Catalog tool can make this a reality.
4 ways in which Data Catalogs help make Data Intelligent and Data Discovery faster and smarter.
Lets break-down, each of these 4 points.
1.Cataloged and Tagged Data
Data can exist in multiple sources. In databases, cloud data warehouses like Snowflake, Data Lakes like AWS S3 and Microsoft Azure, or in enterprise applications like SAP and Sales Force. Each system has its own internal Data Dictionary, but that is designed for system and database administrators. What you need is a centralized Data Catalog that brings all the data dictionaries into a central location and makes it searchable. The source tables, files, columns, fields are technical and not always intuitive on their meaning. So the data should be tagged with their business and logical names, so they are easy to find when searched by Analysts and Business.
2. Documented descriptions, so it can be understood
How many times do we see column names like Utilization% What does it mean? Utilization % based on weight or based on volume?
Key fields in a dataset, especially identifiers categorical attributes, and measures should be documented, so the meaning of the fields can be understood to decide how to use that information.
3. Data Profile is available
When we are trying to understand the usability of data, what is the first thing we do.
We first try to figure out whether it is complete, what are the unique values, what are the ranges and we try to see if there are outliers. This information is critical to decide whether the data is usable and if not, what cleansing steps are required.
4.Make it easy for Analysts to collaborate with Business SMEs
Take our same example of “Utilization %” column. If as part of this Discovery, it was established that Utilization % is based on weight, then this information should be now documented, so that the next time this field is looked at, its meaning is readily understood. Additionally, business teams have wealth of information about the data that is stored in power points and documents that might exist in SharePoint. The Catalog should allow linking this information in one place, so it is readily accessible. It doesn’t have to be all done in advance, but whenever such Discovery happens, it should be stored in a way that it is easy to refer back to it next time.
So how do you create your Intelligent Data Catalog?
Of course, capturing all of this information is not easy. And doing it manually is a non-starter for any company. You should look for a Smart (Active, Automated, AI-driven) Data Catalog that can automate the above steps, and provide an easy interface so collaboration and enrichment by data and business teams is quick and easy. DvSum’s Data Catalog for example automatically catalogs, classifies and curates the data and make it available a actionable Data Catalog.
In Summary, Data Catalogs adds the missing Data Intelligence to your data. It makes Data Discovery faster and smarter. Which in turn results in faster time to insights.