About the talk
Data Quality is defined as fitness for use. The principal challenge in improving data quality is that it can not be measured directly (otherwise it would be quantity) and thus progress tracking is a hard problem. In the last 5 years DBpedia and its community has made excellent progress in this area, so that we are ready — conceptually as well as technological — to produce quantifiable measurements that allow to better pinpoint and track data quality and therefore improve the capacity for usage and also community adoption.
Data Usage is meant here as a combination of data demand, data popularity and data impact. Assessment of data usage is in some cases easy to measure, e.g. by download numbers or API calls (data popularity). However, demand and impact are more challenging to track. Demand can be assessed by willingness to contribute to improve data quality, i.e. user that need a certain type of data would be willing to spend effort. Impact on the other hand needs proper mechanisms for provenance tracking, i.e. where does data come from and how can we track usage in systems that use our data.
Both can be mutually exploited in a synergistic fashion to drive the success of data projects. Data Usage (demand, popularity) gives indicators where effort in improving data quality is most effectively spent. Generally speaking, improvement in data quality should drive usage up and increase data impact. In agricultural terms, data quality would be concerned with the soil and data usage with the produce growing on that soil (see the end of https://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen ).
At DBpedia, we have worked on creating individual components to drive data development, especially with the help of the new SHACL standard (co-edited by the technical head of DBpedia, https://www.w3.org/TR/shacl/) that is used for testing Metadata, data, linking, mapping and ontology quality, with plans to extend it to new measurements for text extraction.
About the Speaker