Lisa Ehrlinger,
"Automated Continuous Data Quality Measurement"
, 2021
Original Titel:
Automated Continuous Data Quality Measurement
Sprache des Titels:
Englisch
Original Kurzfassung:
Data is often the basis for decision making in modern organizations. Representative applications in which automated data-driven decisions play a major role are smart factories, where critical errors in the production process must be detected early, or healthcare information systems that monitor the condition of patients in an intensive care unit. A current application where human decisions rely on data is the COVID-19 dashboard, which provides epidemiological statistics to the public. To assess the trustworthiness of such data-driven decisions, it is essential to measure and know the quality of the data on which the decision-making process is based.
Research into data quality has been conducted since the 1980's, with Prof. Richard Wang from the Massachusetts Institute of Technology being a driving force to advance the topic. Since then, a lot of methods, algorithms, and tools have been developed to support data quality measurement. While existing methods agree that data quality measurement is a cyclic process that has to be carried out continuously, the majority of tools allow the evaluation of data sources only at specific points in time. Therefore, automation and scheduling of data quality measurement remains in the responsibility of the user. Apart from methods that promote the manual assessment of data quality, even tool-based methods usually make the following two assumptions: (1) the setup is done manually, either by defining business rules or context-specific data quality metrics, and (2) the measurement process is triggered by a human. Both assumptions do not hold for real-world applications, where data is continuously exchanged, transformed, and integrated between multiple heterogeneous information sources and has a high degree of volatility. Existing data quality tools lack an automated solution to continuously measure the development of data quality in such complex systems over time.
This dissertation contributes to data quality research with a new method for automated continuous data quality measurement, which has been implemented in the data quality tool DQ-MeeRKat to demonstrate its applicability. In contrast to existing methods (i.e., using business rules or data quality metrics), the new approach in this dissertation automatically generates "reference data profiles", which represent a quasi-gold-standard of the data quality. By measuring changes compared to these reference data profiles, continuous data quality monitoring is enabled. The automated initialization of reference data profiles is a great advantage compared to the arduous manual creation of business rules, which is standard in most existing data quality tools. In contrast to generally applicable data quality metrics, reference data profiles are derived and generated directly from the investigated data and therefore take into account the context in which data quality is measured.
The importance of considering the context is also highlighted by the "fitness for use" definition for data quality. The ratings gained from data quality measurement can then be used to assess the quality of data-driven decisions (in a specific context), which allows users to estimate the trustworthiness of such decisions.