We live in a world of data: There’s more of it than ever before, in a ceaselessly expanding array of forms and locations. Dealing with Data is your window into the ways data teams are tackling the challenges of this new world to help their companies and their customers thrive.
Uncomfortable truth incoming: Most people in your organization don’t think about the quality of their data from intake to production of insights. However, as a data team member, you know how important data integrity (and a whole host of other aspects of data management) is. Keeping data quality high ensures that the insights your end-users pull are aligned with reality and can help them (and the company at large) make smarter, data-driven decisions, as well as pipe quality information to customer-facing apps.
In this article, we’ll dig into the core aspects of data integrity, what processes ensure it, and how to deal with data that doesn’t meet your standards. Let’s start by defining it.
What is data integrity?
The first question most of us ask regarding data integrity is, “Is this data accurate?” While that’s an important part of the larger issue of data integrity, the concept goes deeper than that. Data integrity looks at the whole life cycle for your data and considers the processes around how it’s generated, stored, accessed, and applied to accomplish specific business tasks. Throughout that life cycle, a good data integrity program aims to ensure that data is available, complete and accurate.
Data integrity risks
Each of the data characteristics we just listed — available, complete, and accurate — exposes a specific weakness that you work to prevent with your data integrity efforts.
For starters, unavailability: When data is unavailable, the business is operating without visibility into a specific aspect of its behavior or history. You may have sales data or marketing data or financials, but without the right level of data integrity and availability, you can’t actually use it to make effective decisions. You simply don’t have the access you need to perform analyses.
Incomplete data poses a similar problem: Your company is operating as if the data it’s using for insights is complete but may be coming up with insights that don’t align with reality. Of course, many companies are out there performing complex analyses on incomplete datasets, then scratching their heads when their predictions come up wrong. The right data integrity program can help avoid outcomes like this.
Much like with incomplete data, when the dataset contains inaccuracies, the business makes the wrong decisions, thinking that they’re the right ones. In some cases, you may get lucky and the business realizes the data is wrong before making a decision, but this still compromises the trust you’ve built with the business. All these pitfalls are avoidable with the right data integrity policies in place.
Means of ensuring data integrity
Data integrity can be divided into two areas: physical and logical. Physical data integrity refers to how data is stored and accessed. This includes your physical, on-premises servers and any cloud databases you may be remotely connected to. How are your devices physically secured? What’s your disaster recovery plan? All this contributes to your overall data integrity profile. Logical data integrity is designed to guard against human error. We’ll explore this concept in detail in the testing section below.
Data integrity: A process and a state
There are two means for ensuring data integrity: process and testing. The former allows us to control the data before it is generated, and the latter allows us to identify if there is an issue with our data that would impact its availability, completeness, or accuracy.
Process-driven data integrity: Getting data generation right
Data transformations are a go-to solution for fixing bad data, but if we do not focus on the source of what’s making the data less than ideal or unusable by the business, then we’ll be doomed to forever chase our tails, cleaning up after bad data instead of addressing the root cause.
A strong data integrity program begins with understanding what data we are aiming to track, how that data enters our systems, how it is stored, and the length of time it is stored. Before we consider what data we have, we should know what we want to track.
With this question answered, we can consider how we will track this data: Will it be automated? Will it be entered manually? If the latter, we should consider what processes or controls we’d want to put into place to ensure that the data being manually entered is consistent. Once we know that, we can answer questions around how frequently the data will be loaded and where it will reside.
Armed with a strong understanding of the process and having done due diligence to ensure that there are no gaps in how we will record data, we can build expectations for how that data should look and validate it, which leads us to testing.
Ensuring data integrity in your database via testing
The primary mechanism for validating data is testing. This is often what comes to mind when people think of data integrity, but we should really consider this part of data quality. That being said, there are standard checks we can perform on our data to evaluate its quality within the database:
- Referential integrity: This evaluates that the existence of one value in one table/data store is consistent with the existence of that value in another table.
- NULL values: This can be used to check for the completeness of fields. If we know a field should not be blank (or null), we can test to make sure that is the case.
- Uniqueness: This test is used to avoid duplication or overcounting. This, along with the null value check, is common for fields we consider to be primary keys.
- Accepted values/range: If we know that a field should only contain specific or a range of values, then we can test it to ensure that’s the case.
- Recency: This checks to make sure the data is current, not stale.
Business logic tests are essential for avoiding inaccurate data in the event that we cannot control a process tightly enough to ensure the data is perfect entering the system. We also use them to evaluate that the logic we’ve applied to our data to transform it is outputting accurate insights.
Examples of business logic tests you might run on your company’s data include making sure that sales opportunities are assigned to account executives only. You also want to validate that leads make it through the proper steps in your marketing funnel. Another important piece of information is whether the sum of customers’ purchases matches the reported revenue.
Testing should be applied to the raw data both after it enters your data systems and throughout the data pipeline/data transformation life cycle to ensure consistent quality and integrity along the entire journey.
Cleaning up data that doesn’t meet data integrity standards
Now that we know what data integrity is, we should discuss what we do when we find data that hasn’t met our standards.
When data is unavailable, we need to choose systems that support continuous data availability. Understanding your data architecture and knowing the weak points will allow you to know what can be done to guarantee 100% availability.
Incomplete data, such as data that has been deleted or was never generated, can be difficult to handle. However, using backups will allow you to restore missing data. Access controls can prevent unwanted deletions, so make sure that all users have the right permissions set. Regular testing can also help ensure that data is consistently following your program and that there are no gaps.
Inaccurate data should be fixed in the source system if possible (or, failing that, wherever you have the access and control to fix the problem). In larger datasets, where some amount of error is expected, using an error rate and a threshold to provide some fault tolerance will allow you to continue to be agile with your data but be alerted to larger issues.
Distinguishing between data integrity and security
Data security is a neighbor of data integrity, as we strongly rely on it for avoiding unwanted interventions to our data that could alter its availability, completeness and accuracy, and thus the value to our business. Having a strong data security system in place across people, processes, and systems prevents data from being corrupted or accessed in harmful or unintended ways. If you’re overhauling your data integrity program, it can also be a good time to review your data security protocols, and vice versa. Whatever you’re building with your data, integrity and security will be important parts of it.
Chris Meier is a Manager of Analytics Engineering for Sisense.