A typical (suboptimal) timeline in many project dealing with data integration (warning, provocative):
- What is the data we need? – Do workshops
- Where is the data? – Look at the documentation/Glossary — if there is one
- Challenges like: Oh, wow, we meant some other data… Oh, wow, why are these fields empty?
- All set, we are writing ETL/WS/… code to integrate the data
- No data could be loaded/integrated, lots of errors
- Repeat 3-4-5 in a trial-and-error loop as data quality is not good enough
- NO ERRORS FROM THE LOAD PROCESS 🙂 We are ready! (Fanfares)
- Oh, no, business says data does not make sense – data quality is not good enough
- Work additional 2 months (the timeline can be anything from 5 days to 12 months) repeating steps 1-2-3-4-5-6-7 and 8 until data quality will be good enough
- OK, now most of the data makes sense (Fanfares again)
Of course the above is exaggerated but everyone who were involved in data integration knows that this is not that far away from reality.
There is a lot of things you can do to make this loop much shorter (see in a later post), and you can also have a historic view how similar situations arise.
In this post I will point out just two aspects:
- Do you see how many times bad data quality is mentioned? But what is exactly bad data quality? Can we measure data quality in an objective manner?
- Sadly the point #7 is where some projects way to often think that the integration work is finished. In fact by that time you are not even half way through. Why is this?
Before answering I need to cover a bit of theory. The theory of relational databases together with the common implementation teaches us how to put together a database structure that guarantee certain aspects of data quality. Possibilities to ensure good data are (without mathematical precision):
- You can define data types (text, number etc.) – but of course you just can store all numbers as texts…
- You can set up keys (also unique, non-empty values that identify e.g. a car, a person etc.) – but you are allowed in most database systems not to
- You can set up referential integrity constraints (e.g. there is no credit card without a cardholder) – but you are allowed not to
- You can define domains like size for humans is something between 0-270 cm – but you are allowed not to
- You can define patterns your data must have like if some text is a valid phone number – but you are allowed not to
- You can even define more complex rules/programs that allows you to check every aspects of the data you can possibly think of – but you are allowed not to
The reality is that these are possibilities to enhance the data quality of your data that a project may or might not implement. The more of these you are willing to implement the more you have to know your business rules. The more of these you implement in fact the slower will your system be (last time this was a valid excuse in the early 2000’s).
So back to the above questions:
- Data quality can be measured ultimately with validation through business rules. The more data complies with the business rules the better the quality is. Sometimes not all business rules are known in advance. Sometimes they change and the data management is not updated. Sometimes the business rules are so complicated that coding them does not worth the effort. Sometimes there is not enough time or money to implement all rules. In short: data quality is usually not fully known before the data integration begins.
- When the data integration stream reaches point #7 only that subset of business rules is validated that is somehow implemented. Usually this is only the fraction of the existing business rules. Pont #8 and #9 is nothing else but figuring out the business rules that are not stored in any systematical way and trying to clean up using the new knowledge.
Are there ways to do this better? Definitely: with data strategy you can do a lot to get out of such troubles. I’ll share you some best practices in a future post.
Did you make a different experience? Do you have a different view?