What is data quality and why does it matter?
If data fuels your business strategy, poor quality data could kill it
Just like professional athletes don’t fuel their body with junk food, businesses cannot function properly on diets of unhealthy data. With poor data, results can be disastrous and cost millions.
Poor data quality leads to more than just poor decisions, it causes companies to waste resources, miss opportunities and spend far too much time fixing that data. That’s time that could be better spent on other areas of the business. All of this translates into increased costs. With the huge overall growth in data, the cost of poor data quality will also grow exponentially if not addressed quickly. That’s why it’s crucial to spot and fix that data in your organisation.
Diagnosing bad data
Bad data can come from every area of an organisation. However, there is a common framework to assess data quality. The five most critical dimensions are:
Completeness – Is the data sufficiently complete for its intended use?
Accuracy – Is the data correct, reliable and/or certified by some governance body? Data provenance and lineage — where data originates and how it has been used — may also fall in this dimension, as certain sources are deemed more accurate or trustworthy than others.
Timeliness – Is this the most recent data? Is it recent enough to be relevant for its intended use?
Consistency – Does the data maintain a consistent format throughout the dataset? Does it stay the same between updates and versions? Is it sufficiently consistent with the other datasets to allow
joins or enrichments?
Accessibility – Is the data easily retrievable by the people who need it?
Four myths about data quality
Results from the sixth annual Gartner Chief Data Officer (CDO) survey show that data quality initiatives are the top objective for data and analytics leaders. But the truth is that little has been done to solve the issue. Data quality has always been perceived by organisations as difficult to achieve. In the past, the general opinion was that achieving better data quality was too lengthy and complicated. However, there are a few common data quality misconceptions.
Myth #1 – “Data quality is just for traditional data warehouses.”
Today, there are more data sources than ever, and data quality tools are evolving. They are now expanding to take care of any dataset whatever its type, its format, and its source. It can be on-premises data or cloud data, data coming from traditional systems, and data coming from IoT systems.
Faced with data complexity and growing data volumes, modern data quality solutions can increase efficiency and reduce risks by fixing bad data at multiple points along the data journey, rather than only improving data stored in a traditional data warehouse. These data quality solutions use machine learning and natural language processing capabilities to ease up your work and separate the wheat from the chaff. And the earlier you can implement these solutions to fix your data, the better. Solving data quality downstream at the edge of the information chain is difficult and expensive. Research indicates it can be 10x cheaper to fix data quality issues at the beginning of the chain than at the end.
Myth #2 – “Once you solve your data quality, you’re done.”
Just like data does not come all at once to a company, improving data health is not a one-time operation. Data quality must be an always-on operation, a continuous and iterative process where you constantly control, validate, and enrich your data; smooth your data flows; and get better insights.
Myth #3 – “Data quality is IT’s responsibility.”
Gone is the time when maintaining healthy, trustworthy data was simply an IT function. Data is the whole organisation’s priority as well as a shared responsibility. No central organisation, whether it’s IT, compliance, or the office of a Chief Data Officer can magically cleanse and qualify all organisational data. It’s better to delegate some data quality operations to business users because they’re the data owners. Business users can then become data stewards and play an active role in the whole data management process. It’s only by moving from an authoritative mode to a more collaborative role that you will succeed in your modern data strategy.
Myth #4 – “Data quality software is complicated.”
As companies are starting to rely on data citizens and data has become a shared responsibility, data quality tools have also evolved. Many data quality solutions are now designed as self-service applications so that anyone in an organisation can combat bad data. With an interface that is familiar to users who spend their time using well-known data programs like Excel, a non-technical user can easily manipulate big datasets while keeping the company’s raw data intact. Line of business users can enrich and cleanse data without requiring any help from IT. Connected with line of business applications these solutions will dramatically improve daily productivity and data flows.
The enterprise challenge: eliminating bad data with pervasive data quality
It’s 10x more expensive to fix bad data at the end of the chain than it is to cleanse it when it enters your system. But the costs don’t stop there. If that data is acted upon to make decisions, or sent out to your customers, or otherwise damages your company or its image, you could be looking at exponentially higher costs compared with the cost to just deal with that data at the point of entry. The cost gets greater the longer bad data sits in the system.
Pervasive data quality can ensure, analyse, and monitor data quality from end to end. This proactive approach allows you to check and measure data quality before the data gets into your systems. Accessing and monitoring data across internal, cloud, web, and mobile applications is a huge undertaking. The only way to scale that kind of monitoring across those types of systems is by embedding data quality processes and controls throughout the entire data journey.
With the right tools, you can create whistleblowers that detect and surface some of the root causes of poor data quality. Once a problem has been flagged, you need to be able to track the data involved across your landscape of applications and systems, and parse, standardise, and match the data in real time.
This is where data stewardship comes in. Many modern solutions feature point-and-click, Excel-like tools so business users can easily curate their data. These tools allow users to define common data models, semantics, and rules needed to cleanse and validate data, and then define user roles, workflows, and priorities, so that tasks can be delegated to the people who know the data best. Those users can curate the data by matching and merging it, resolving data errors, and certifying or arbitrating on content.
Modern data solutions like Qlik Talend can simplify these processes even further because data integration, quality, and stewardship capabilities are all part of the same unified platform. Quality, governance, and stewardship can be easily embedded into data integration flows, Master Data Management initiatives, and matching processes to manage and quickly resolve any data integrity issues.
Five steps for better data quality
With pervasive data quality embedded at every step of the data journey, organisations can close the gap on ensuring that trusted data is available everywhere in the enterprise. But what does the data quality process actually look like?
There are five key steps for delivering quality data. And while the specifics may vary when looking at different data sources and formats, the overall process remains remarkably consistent. In fact, that highlights another benefit of using a single, unified platform across your entire data infrastructure: you don’t have to build everything from the ground up every time you add a source or target. In this case, when quality rules are created, they can be reused across both on-premises and cloud implementations, with batch and real-time processing, and in the form of data services that can automate data quality processes. The five steps for better data quality are:
Step #1 – Profiling
The first step is to really understand what your data looks like. Profiling your data will help you discover data quality issues, risks, and overall trends. Analysing and reporting on your data in this manner will give you a clear picture of where to focus your data quality improvement efforts. And as time goes on, continued profiling can provide valuable insights into how and where your data quality is improving.
Step #2 – Standardising and matching
Many of the data quality issues uncovered during the profiling process can be fixed through standardisation and matching processes. Standardising data is an essential step when getting data ready for analysis so that all the data being examined is in the same format. Matching lets you associate different records within different systems, and you can even embed matching into real-time
processing and make those associations on the fly.
Step #3 – Enriching
At this stage, you really start to see the data you’ve been working with come together. This is where federation can come into play. For example, you may want to use an API to share a particular piece of information, but from your profiling and matching exercises you know that additional related data exists in other locations. Because you’ve standardised your data and know that it’s formatted correctly, you can confidently enrich and augment the data you want to share with the additional related data, so that data users can get a more complete understanding of the information they’re consuming.
Step #4 – Monitoring
Data quality is not a “one and done” operation. It is a continuous, ongoing practice because data is constantly transforming and shifting, and those changes need to be monitored to ensure quality is maintained. When any new quality issues are discovered, you can go back to the previous steps to standardise, match, and enrich that data to get it back on track.
Step #5 – Operationalising
The final step is to operationalise data quality. This is where you really get to see data quality in action. Automating the checks and rules and embedding them in your data pipelines can see significant gains in efficiency. This will drastically cut down the amount of bad data requiring manual interventions to fix.
Topic: Data analytics