Test data management

There is a significant data management aspect to testing – the test data management. Let’s have a quick intro and have a glimpse on project management reality on testing – then,  look at the test data management a bit more in detail.

Software testing

How long you have to test a Dacia to turn it to a Mercedes?

I’ve heard the above quote during university from one of our professors. Although the answer is obvious a relatively big amount of software projects still somehow believe that poor design can be corrected during the testing phase.

In reality the amount of code going into production without proper testing is surprisingly big (source of the image).






Sometimes we hear excuses why not to do testing properly. Here some of them:

  • I’ve heard already statements from a vendor that code-level unit tests were enough and no further testing should be done. That is just simply not true.
  • If you are in the custom software development business OR you customise a piece of existing software you must keep track of your business requirements and maintain the changes of those during the project until you hand over the software. How you do this depends a whole lot on your methodology, but there is no way avoiding this. The granularity of the requirements should reach a level so that they are testable: also you can objectively decide if a requirement was fulfilled or not. Please do not think this is an expensive “gold plating”. If you do not have a control over your requirements you will go into a trial-and-error loop, just like the one we described in this blog entry.
  • One of the other excuses in the top 5 is that software supporting the organisation in the testing is expensive. If you hear this excuse just check out some open-source testing software like TestLink & co. (likewise for bug tracking you could use Mantis).
  • If you hear someone saying the amount of labor put into setting up these tools are huge and you should use Excel to manage the test cases and the execution then please ask this person to measure the time needed to manage the efforts in lack of a centralised tool.
  • If you hear that there are no resources to do regression testing (testing if new function does not destroy properly working functionality) you might think about using robotics to automate at least a part of the testing.

Test data management

Now imagine for a moment that in a software project proper amount of resources went into the requirement assessment and there are proper test cases to execute.

What often times is still missing is the proper test data and an agreed way how to manage it e.g.:

  • Ensure how to put the system into an initial state “ready for testing” before a test run
  • Ensure your test data supports retesting in case of bugfixes
  • Ensure you have proper test data for multiple test runs (sequentially/parallel)
  • Know which data can not be reused (e.g. certain identifiers) and how to generate new data in a systematic way
  • Ensure that the above is conducted as a routine task – in an optimal situation in an automated manner

Please note if you make end-user trainings with a training system you have the same challenges to solve.

Here are some recommendation who to put together a proper test data set:

  • Tests should be reproducible. Optimally you should be able to restore the initial state of the system (before the testing take place) easily by e.g.:
    • Having a virtual image of the systems in place that you restore+patch+upgrade  and store before each test round. You can think about putting such a solution to AWS or to Azure (or the like) – if your architecture is modern enough.
    • Backing up the database and restoring it (usually not easily possible with multiple systems integrated)
    • Generating test data with robots or even manually before the test run and assigning the test data to the proper test cases
  • Systems are integrated. This means that during testing you have to consider that you have some limitations.
    • Sometimes it is possible to have a fully separated test environment with all the integrated systems. If you have this lucky situation you can usually think you have a “single system” to deal with.
    • If this is not possible (usual case) you should think about system-level data consistency rules. (Well-well, if you have a proper model of your data, that helps.)
  • Automate-automate-automate
    • If you have 30% more initial efforts to automate test data generation, just invest. The more often you test the bigger the gain will be on your original efforts.
  • Privacy
    • There are cases where test data must be very close to production data. Should this be the case consider privacy rules.
    • Whenever possible just please make your lives easier and use non-productive data for testing. Some database provider offer cloning features with data masking/scrambling.
    • Mostly with data & analytics projects you have algorithms (e.g. grouping, classification etc.) that must be trained with productive data. Please note that this is not testing, and a separate, restricted environment can be needed. As a result of such training runs a model is constructed, that is usually small in size. This model usually does not contain any sensitive information by itself and hence can be transported into any other systems – including the test system. Please note however that a model working well with productive data can be useless when meets training data.