Data Blog

Test data management

There is a significant data management aspect to testing – the test data management. Let’s have a quick intro and have a glimpse on project management reality on testing – then,  look at the test data management a bit more in detail.

Software testing

How long you have to test a Dacia to turn it to a Mercedes?

I’ve heard the above quote during university from one of our professors. Although the answer is obvious a relatively big amount of software projects still somehow believe that poor design can be corrected during the testing phase.

In reality the amount of code going into production without proper testing is surprisingly big (source of the image).






Sometimes we hear excuses why not to do testing properly. Here some of them:

  • I’ve heard already statements from a vendor that code-level unit tests were enough and no further testing should be done. That is just simply not true.
  • If you are in the custom software development business OR you customise a piece of existing software you must keep track of your business requirements and maintain the changes of those during the project until you hand over the software. How you do this depends a whole lot on your methodology, but there is no way avoiding this. The granularity of the requirements should reach a level so that they are testable: also you can objectively decide if a requirement was fulfilled or not. Please do not think this is an expensive “gold plating”. If you do not have a control over your requirements you will go into a trial-and-error loop, just like the one we described in this blog entry.
  • One of the other excuses in the top 5 is that software supporting the organisation in the testing is expensive. If you hear this excuse just check out some open-source testing software like TestLink & co. (likewise for bug tracking you could use Mantis).
  • If you hear someone saying the amount of labor put into setting up these tools are huge and you should use Excel to manage the test cases and the execution then please ask this person to measure the time needed to manage the efforts in lack of a centralised tool.
  • If you hear that there are no resources to do regression testing (testing if new function does not destroy properly working functionality) you might think about using robotics to automate at least a part of the testing.

Test data management

Now imagine for a moment that in a software project proper amount of resources went into the requirement assessment and there are proper test cases to execute.

What often times is still missing is the proper test data and an agreed way how to manage it e.g.:

  • Ensure how to put the system into an initial state “ready for testing” before a test run
  • Ensure your test data supports retesting in case of bugfixes
  • Ensure you have proper test data for multiple test runs (sequentially/parallel)
  • Know which data can not be reused (e.g. certain identifiers) and how to generate new data in a systematic way
  • Ensure that the above is conducted as a routine task – in an optimal situation in an automated manner

Please note if you make end-user trainings with a training system you have the same challenges to solve.

Here are some recommendation who to put together a proper test data set:

  • Tests should be reproducible. Optimally you should be able to restore the initial state of the system (before the testing take place) easily by e.g.:
    • Having a virtual image of the systems in place that you restore+patch+upgrade  and store before each test round. You can think about putting such a solution to AWS or to Azure (or the like) – if your architecture is modern enough.
    • Backing up the database and restoring it (usually not easily possible with multiple systems integrated)
    • Generating test data with robots or even manually before the test run and assigning the test data to the proper test cases
  • Systems are integrated. This means that during testing you have to consider that you have some limitations.
    • Sometimes it is possible to have a fully separated test environment with all the integrated systems. If you have this lucky situation you can usually think you have a “single system” to deal with.
    • If this is not possible (usual case) you should think about system-level data consistency rules. (Well-well, if you have a proper model of your data, that helps.)
  • Automate-automate-automate
    • If you have 30% more initial efforts to automate test data generation, just invest. The more often you test the bigger the gain will be on your original efforts.
  • Privacy
    • There are cases where test data must be very close to production data. Should this be the case consider privacy rules.
    • Whenever possible just please make your lives easier and use non-productive data for testing. Some database provider offer cloning features with data masking/scrambling.
    • Mostly with data & analytics projects you have algorithms (e.g. grouping, classification etc.) that must be trained with productive data. Please note that this is not testing, and a separate, restricted environment can be needed. As a result of such training runs a model is constructed, that is usually small in size. This model usually does not contain any sensitive information by itself and hence can be transported into any other systems – including the test system. Please note however that a model working well with productive data can be useless when meets training data.

Data migration

One of the hard nuts in a typical software project is the data migration.

A typical situation for the migration team:

  • The data should be migrated (integrated etc.) from the old system to the new one
  • The data model of the old system is not known precisely any more (optionally: the old provider does not even exist any more, or, its involvement requires a heavy investment)
  • In course of the time certain fields on the screens were not used according to standards (e.g. storing a GLN number in an address field)
  • The data model, used for mass data input/output of the new system is partially known (sadly a reality with some software companies still today)

One of the possibilities is the usage of software robotics (RPA – Robotics Process Automation). How?

  • The screens/reports of the old application are still used and known very well by the business specialists
  • The screens of the new system are also well known and usually designed together with the same business users who use the old system
  • With software robotics it is possible (and usually not very complex) to build an algorithm that opens the old screen and does copy-paste into the new system
  • This can be combined with some data transformation as well (e.g. cleansing of addresses etc.)

In our experience the above migration method is surprisingly fast both in terms of business analysis and can also be used to migrate multiple millions of records (degree of parallelism).



A typical (suboptimal) timeline in many project dealing with data integration (warning, provocative):

  1. What is the data we need? – Do workshops
  2. Where is the data? – Look at the documentation/Glossary — if there is one
  3. Challenges like: Oh, wow, we meant some other data… Oh, wow, why are these fields empty?
  4. All set, we are writing ETL/WS/… code to integrate the data
  5. No data could be loaded/integrated, lots of errors
  6. Repeat 3-4-5 in a trial-and-error loop as data quality is not good enough
  7. NO ERRORS FROM THE LOAD PROCESS 🙂 We are ready! (Fanfares)
  8. Oh, no, business says data does not make sense – data quality is not good enough
  9. Work additional 2 months (the timeline can be anything from 5 days to 12 months) repeating steps 1-2-3-4-5-6-7 and 8 until data quality will be good enough
  10. OK, now most of the data makes sense (Fanfares again)

Of course the above is exaggerated but everyone who were involved in data integration knows that this is not that far away from reality.

There is a lot of things you can do to make this loop much shorter (see in a later post), and you can also have a historic view how similar situations arise.

In this post I will point out just two aspects:

  • Do you see how many times bad data quality is mentioned? But what is exactly bad data quality? Can we measure data quality in an objective manner?
  • Sadly the point #7 is where some projects way to often think that the integration work is finished. In fact by that time you are not even half way through. Why is this?

Before answering I need to cover a bit of theory. The theory of relational databases together with the common implementation teaches us how to put together a database structure that guarantee certain aspects of data quality. Possibilities to ensure good data are (without mathematical precision):

  1. You can define data types (text, number etc.) – but of course you just can store all numbers as texts…
  2. You can set up keys (also unique, non-empty values that identify e.g. a car, a person etc.) – but you are allowed in most database systems not to
  3. You can set up referential integrity constraints (e.g. there is no credit card without a cardholder) – but you are allowed not to
  4. You can define domains like size for humans is something between 0-270 cm – but you are allowed not to
  5. You can define patterns your data must have like if some text is a valid phone number – but you are allowed not to
  6. You can even define more complex rules/programs that allows you to check every aspects of the data you can possibly think of  – but you are allowed not to

The reality is that these are possibilities to enhance the data quality of your data that a project may or might not implement. The more of these you are willing to implement the more you have to know your business rules. The more of these you implement in fact the slower will your system be (last time this was a valid excuse in the early 2000’s).


So back to the above questions:

  • Data quality can be measured ultimately with validation through business rules. The more data complies with the business rules the better the quality is. Sometimes not all business rules are known in advance. Sometimes they change and the data management is not updated. Sometimes the business rules are so complicated that coding them does not worth the effort. Sometimes there is not enough time or money to implement all rules. In short: data quality is usually not fully known before the data integration begins.
  • When the data integration stream reaches point #7 only that subset of business rules is validated that is somehow implemented. Usually this is only the fraction of the existing business rules. Pont #8 and #9 is nothing else but figuring out the business rules that are not stored in any systematical way and trying to clean up using the new knowledge.

Are there ways to do this better? Definitely: with data strategy you can do a lot to get out of such troubles. I’ll share you some best practices in a future post.

Did you make a different experience? Do you have a different view?

Client data management

Reading the news on the train today I found an interesting article with data aspects – and I decided to share it with you in a short post.

A recurring topic in many discussions around data management is the current quality of the data and where peer companies stand.

Maybe it is easier to give an answer if you read this article on engadget. Summary: a couple living in Indiana ordered ca. 2’700 items from Amazon, of a value USD 1.2 Mio. They reported Amazon that they want to return the products and got their money back. In fact they never returned anything and re-sold the items, making USD 750k over 2 years.

How did they do this? According to engadget:

The Finans created hundreds of false identities and fake accounts in order to pull off their scheme.

We know how great job Amazon is doing in terms of data quality and data management. And even then this fraud was possible by temporarily tricking out Amazon’s customer deduplication algorithms (a somewhat older article but still a good summary).

Reaching good data quality is a continuous work and not a one-time effort, not just for your company but also for key players like Amazon.

GDPR Trends

I just took a look at the recent search trends on GDPR with Google Trends. Please take a look:

The regulation was approved on the 14 April 2016. The enforcement date was last week exactly where you see that big peak in the picture.
For first I thought that the source of the search queries were mostly not companies but people looking for more information on the new regulation. If you take a look at the map below you will see that in some of the cantons the interest is significantly lower than in others. Zug, Basel, Genf and Tessin are with the most search queries.

If I compare the geographical distribution of the GDPR search queries to those of the popular reality show Die Bachelorette we’ll see a distribution that is more typical for queries triggered by bigger amounts of people.

Does this mean that most companies react just after the enforcement data of GDPR? What is your view?

Data = Labor?

Being a “data guy” I’m always fascinated how data changes the world and how the world is going to be more data driven.

I did not know I was a dataist until I read the fantastic book Homo Deus by Yuval Noah Harari (please find a review and links to buy here).

We think about huge corporations all around the world as perfect data processing superbrains looking at every aspect of our lives. Well, in some cases this is surely the reality: just ask somebody in on-line advertising. (Just check out what dimensions Google offers you in their demographic module, or, check out what Oracle can tell about you by just visiting their website).
We as a society are aware of this: by introducing GDPR we made an important step to set limits what our data should be used for.

I know from experience that not all corporations are this advanced. Also we are far from harnessing the full potential of the data and embedding it in our decisions. (We are too much driven by our “System 1” as you can read in an other must-read.)

In spite of this it is obvious that data is the single central element of the coming years.

One exciting aspect is that just by conducting our lives in the digital world we generate data. This data is then used by the economy to generate money. Would it be fair to treat data as labor? Could we even get a salary for allowing corporations to use our data?

I’m starting to read the book Radical Markets from Eric Posner and Glen Weyl today to learn their opinions.

What is yours?


I’m very interested in how the execution of the GDPR will look like in real life.

What is GDPR?

Stronger rules on data protection from 25 May 2018 mean citizens have more control over their data and business benefits from a level playing field. One set of rules for all companies operating in the EU, wherever they are based.

This sounds very good, doesn’t it?

For the companies this means some new challenges: the requirements of GDPR are complex to understand, even more complex to implement and they build on an already implemented corporate data governance framework that in most cases far from being GDPR-ready. For small- and middle-sized firms the investment needs are very high in relation to the company turnover.

This is why it is interesting how the EU members will execute upon the regulation. One example is the Hungarian legalization published yesterday. The Hungarian authorities admit that there are only a few interpretation possibilities left open by the regulation. This however allows them to focus on notifying small and medium sized companies about data breaches, whereas the large international players must pedantically execute the regulations.

Of course after a notification a smaller firm must also implement the missing controls – but a large fine is not the primary goal.

Will this be a pattern followed by most of the member states?


If you are working with data, from time to time you will be confronted with “multimaster” situations where the same kind of information (typically client, product, installed base etc. data) is managed in multiple systems in parallel.

Let me give you an example: one of my projects some years ago. Our goal was to cleanse and sustain high quality of master client data. The data was managed in a CRM system and other systems in parallel.

But why did our client have all these systems? Why didn’t they build just a single client management system? Didn’t they know that everything else was not good ab ovo?

Well, it is easy to make this statement now. But historically the story is a different one.

Some years back all business lines were happy with their own systems and processes. (Hands up how many of you have seen this…) Some of these business lines were even completely different companies. There was no valid business case justifying the integration of the client data.

Also different systems stored different information about the same client. For one business unit a client meant a corporate, for an other unit a private person.
The business requirements for the systems managing client data were captured in separate projects. One of the projects finished one year ago and one in the last century. Of course one of the systems was designed to serve the needs primarily e.g. of the sales team, the other system design concentrated on the marketing needs, and so on.
How many times, do you think the sales team spoke about the client during their project? I guess the client was the single central topic of most discussions. I would bet the same applies to the marketing project. Even if the two projects would have had run in parallel, the chances that the projects share the same understanding of the client in terms of processes and data is (at least in the real life) surprisingly small. Why? The respective project teams were responsible for their own scope and budget and had to decide what aspects of the client are in scope and which are not.
Imagine the second project were started 2-3 years after finishing the first: how many people would participate in both projects? (Attrition rate.) Is the first project documented in a way that the knowledge gained can be reused in the second project? Well, usually not.

After the systems were finally created some teams were interested in more data their systems could manage. An easy and cost-efficient solution was to create some spreadsheet-tables to store some specific attributes in some special cases. The more so as this could work without involving any other business lines or IT. Anyhow the business team was under pressure and needed a solution ASAP – no time for lengthy discussions and no energy to convince the whole corporate.

Later times changed. Marketing and sales both wanted to have all client information stored and managed in a single, central place. The world has changed and there were dozens of viable business cases requiring to do so (up-selling, cross-selling, uniform communications, GDPR – just to name a few).
Of course designing the CRM the goal was to make it the single system managing client data. Sadly the task of collecting, cleansing and syncing already existing data was so huge that the integration was never fully finished in fact. Why? Because before the CRM implementation nobody integrated the client data in this depth in a permanent read-write manner and hence nobody had a realistic estimate on the complexity of this task.
When I joined the project the same client data was partially synchronized automatically, partially manually and partially… well, not at all.
The business processes were modified to work around this situation – somehow. There was a responsible person for the CRM system – but there was no single person responsible for the client data.

If you realize that this could be your organization as well, you might be right. This situation is more common then participants of an analytics conference would admit.

It is important to emphasise that a multimaster situation is usually a consequence of a series of valid business decisions.

This is especially interesting now when most companies want to exploit data in a more advanced way then before. Sadly with questionable data quality no really valuable analytics can be done.

In our project IT experts were working on finding a solution. They were trying to figure out e.g. why client types for marketing are entirely different from client types for sales. They involved business experts but somehow there was no agreement in sight.

In my view the root cause of data challenges lies within the business processes. In this project it was also not different: IT tried to solve a business problem – and it did not work out.

In the next blog entry we I will give you some insights what we did to get rid of this situation and how we managed to give business quality data.

Stay tuned!

(Please note: the above client situation is based on actual client projects. Some details were changed to comply with contractual obligations.)

Welcome to the Data Blog

Thank you for visiting the first entry of our brand new data blog.

If you are looking for data related news, information, book and website recommendations – all connected to data, analytics and IT – then you are exactly right here.

If you are looking for some funny reading while taking a cup of coffee we will also post some anecdotes and funny stories from our data-centric-life.

Keep tuned!