Learning about Data Quality from a Joke
Data can be good or bad in various ways - formally defining these ways can help when designing test data.
A tester walks into a bar.
Orders a beer.
Orders NULL beers.
Orders -1 beers.
Orders a unicorn.
At its core, this joke is about data quality. At a high abstraction level, it contrasts "good" and "bad" data. These notions seem somewhat obvious. When actually ordering a unicorn in real life, we are let's say 99.x% sure that we will not get what we asked for, unfortunately.
But is this really the case for the data we use in our daily life? Are we always aware of all the ways in which data can be "bad"? By consequence, when testing a system can we easily find all those helpfully wrong ways of ordering a beer?
Data quality literature to the rescue. The field names a multitude of dimensions along which data quality can be considered and measured. In this talk, we will introduce the most agreed upon ones and relate them back to the bar joke specifically and to test data in general. Thus, we establish a handy checklist for test data quality.
Additionally, we will use a real-world machine learning (ML) project as a practical example. ML systems put special focus on data and all its aspects. When testing these systems, some data quality dimensions even take on new meanings.
Marco, software- and test architect at Siemens Healthineers, and Gregor, data scientist at codemanufaktur, work together on a project that predicts test results ahead of time. Thus, they deal with loads of different kinds of data when testing, training, or monitoring the ML system - Marco mainly from a domain and use case perspective, Gregor mainly from the academic side. This talk therefore alternates between the theory of data quality and its impact in real-world projects.
(By the way: We've hidden at least three possible data quality issues in the abstract, incIuding in this sentence.)