Databases aren’t infallible, as anyone in modern business is well aware.
No matter what initial safeguards and rules are put into place for the things, something will inevitably result in bad data. This bad data will then result in varying degrees of mayhem as poorly-informed decisions and inaccurate measurements of all sorts occur.
You may feel that abating these problems is a monstrously daunting task. If you approach it the wrong way, you’re not necessarily wrong. Fortunately, most data quality experts agree there are three basic ways to handle these unavoidable issues once they arrive.
These approaches are actually surprisingly simple. Once you get past the overwhelming complexity of a database by seeing it from a new perspective, they make perfect sense. Not only that, the optimal choice becomes rather obvious.
In light of this, let’s move away from that complexity and think of a database as something far more mundane and common to daily life. Let’s think of it as a restaurant.
A restaurant, when looked at in the simplest way, is a two-step process. Ingredients come into the kitchen, where they are made into various dishes by the cooks.
These dishes are then delivered to and eaten by the restaurant’s patrons.
Likewise, we can look at data the same way. Data goes into a database. Once there, it becomes part of the larger structure of data which is in turn used by businesses, their customers and so on.
A constant danger which good restaurants diligently work to avoid is that of bad ingredients or bad preparation of the resulting food. Bad food, once eaten by the patrons will inevitably make them quite ill. No restaurant wants to make their customers sick.
Similarly, bad data (either erroneous from the start, or somehow entered in a way that ruins it) going into a database taints the entire thing. Data usually isn’t discrete in most databases. One piece of bad information, linked to other information, begins to pollute the whole thing. Just like one contaminated ingredient poisons an entire dish. Once used, this data results in misinformation, choices based on bad input and all kinds of other chaos, just as the bad food makes the diners ill.
Like a restaurant trying to solve contaminated food issues, we have to look at the point where this contamination happens, and decide what to do about it. As said before, we have three choices.
The first choice is to simply ignore the problem. A restaurant can theoretically continue to serve tainted food, and simply have medication readily available to treat the illnesses that result.
We can ignore the bad data, and simply put into place a lot of protocols to handle the turmoil that results from it. Essentially, this is permanent damage control.
There’s probably no need to explain at length why this is a terrible idea. It doesn’t really stop any of the actual damage, it simply repairs as much of said damage as possible. It’s a huge drain on resources (financially and in man-hours), and there will come a point when nobody involved will want anything to do with the data coming out of this database.
Who continues to eat at a restaurant that’s consistently sent them to the hospital?
Find and Fix
The second choice for a restaurant is to create a position in the kitchen whose sole responsibility is to both oversee every food preparation done, and test every ingredient before it’s allowed to be used. Of course, service would be slow and a kitchen with an optimized system of doing things would grind to a halt.
With our database, we can task some personnel with looking over every entry made to a database, and correct them as the errors arise. This is all too commonly the more popular approach.
For short-term damage control, this isn’t a bad approach, but it should really only be done while the third option is fully explored and realized. While some automation can be put into place to spot and correct bad data as it comes through, this has its own problems.
Algorithms love false positives, and have no sense of context when making corrections. As a result, these automata are really more like trying to fix the kitchen by hiring a blind cook with no taste buds.
Prevention at the Source
The best choice for a restaurant is to shut down operation for as brief a time as possible, and look at the ingredients over a period of time as well as the cooking and sanitary practices being used. This will unveil whence the contamination originates. Knowing the root of the problem, new sources of ingredients and new kitchen practices may be put in place to stop the problem at its root.
When fixing our database’s increasing contamination, we do the same basic thing. We look at where the data’s coming from and how it’s being entered (both the data entry personnel and the data entry tools and forms used).
Don’t be discouraged
With close inspection, we’ll usually find one of four problems. First, the data source may be providing outright bad data to begin with. This itself could be caused by a plethora of potential problems on the source’s end including but not limited to human error.
Second, the data entry personnel (if the data is being entered in such a way) may be poorly-trained, simply unfit for the job or outright careless.
Third, the data entry software may have a few different problems such as bad form designs and so on.
Finally, in cases where the data is generated and entered automatically, it may be the web or other software handling it. Configuration or outright programming errors could be the culprit.
None of these, when you really look at them, are all that difficult to fix, just as a restaurant can address the analogous problems with relative ease.
Thus, while database pollution is inevitable, it’s not only possible to reduce and somewhat eliminate this issue – it’s not even that difficult when you get down to it.