SemGen - Towards a Semantic Data Generator for Benchmarking Duplicate Detectors
Sprache des Titels:
Englisch
Original Buchtitel:
Proceedings of the 4th International Workshop on Data Quality in Integration Systems in conjunction with DASFAA 2011
Original Kurzfassung:
Benchmarking the quality of duplicate detection methods
requires comprehensive knowledge on duplicate pairs in addition to sufficient size and variability of test data sets. While extending real-world data sets with
artificially created data is promising, current approaches
to such synthetic data generation, however, work solely on a quantitative level, which entails that duplicate semantics are only implicitly represented,
leading to only insufficiently configurable variability.
In this paper we propose SemGen, a semantics-driven approach to synthetic data generation. SemGen first diversifies real-world objects on a qualitative level,
before in a second step quantitative values are generated. To demonstrate the applicability of SemGen, we propose how to define duplicate semantics for the
domain of road traffic management. A discussion of lessons learned concludes the paper.