Close

Follow us on Twitter and join our MeetUp group!

   

September 27, 2017

Data Quality and Data Usage in a large-scale Multilingual Knowledge Graph

dbpedia_plain

About the talk

Data Quality is defined as fitness for use. The principal challenge in improving data quality is that it can not be measured directly (otherwise it would be quantity) and thus progress tracking is a hard problem. In the last 5 years DBpedia and its community has made excellent progress in this area, so that we are ready — conceptually as well as technological — to produce quantifiable measurements that allow to better pinpoint and track data quality and therefore improve the capacity for usage and also community adoption.

Data Usage is meant here as a combination of data demand, data popularity and data impact. Assessment of data usage is in some cases easy to measure, e.g. by download numbers or API calls (data popularity). However, demand and impact are more challenging to track. Demand can be assessed by willingness to contribute to improve data quality, i.e. user that need a certain type of data would be willing to spend effort. Impact on the other hand needs proper mechanisms for provenance tracking, i.e. where does data come from and how can we track usage in systems that use our data.

Both can be mutually exploited in a synergistic fashion to drive the success of data projects. Data Usage (demand, popularity) gives indicators where effort in improving data quality is most effectively spent. Generally speaking, improvement in data quality should drive usage up and increase data impact. In agricultural terms, data quality would be concerned with the soil and data usage with the produce growing on that soil (see the end of https://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen ).

At DBpedia, we have worked on creating individual components to drive data development, especially with the help of the new SHACL standard (co-edited by the technical head of DBpedia, https://www.w3.org/TR/shacl/) that is used for testing Metadata, data, linking, mapping and ontology quality, with plans to extend it to new measurements for text extraction.

dbpedia_plain

DBpedia is a project aiming to extract structured content from the information created in the Wikipedia project.

About the Speaker

Sebastian Hellmann has completed his PhD thesis under the guidance of Jens Lehmann and Sören Auer at the University of Leipzig in 2014 on the transformation of NLP tool output to RDF. Sebastian is a senior member of the “Agile Knowledge Engineering and Semantic Web” AKSW research center, which currently has 50 researchers (PhDs and senior researchers) focusing on semantic technology research – often in combination with other areas such as machine learning, databases, and natural language processing. Sebastian is head of the “Knowledge Integration and Language Technologies (KILT)” Competence Center at InfAI. He also is the executive director and board member of the non-profit DBpedia Association. Sebastian is contributor to various open-source projects and communities such as DBpedia, NLP2RDF, DL-Learner and OWLG and wrote code in Java, PHP, JavaScript, Scala, C & C++, MatLab, Prolog, Smodels, but now does everything in Bash and Zsh since he discovered the Ubuntu Terminal. Sebastian is the author of over 80 peer-reviewed scientific publications (h-index of 21 and over 4300 citations [Google Scholar] (https://scholar.google.com/citations?user=caLrIhoAAAAJ)) and a not-yet-deleted Wikipedia article about Knowledge Extraction. Currently, he is project manager for Leipzig University and InfAI of the EU H2020 Projects ALIGNED and FREME and the BMWi funded project Smart Data Web. Before that, he was also involved in other funded projects such as FREME (EU H2020), LIDER (EU FP7), BIG and LOD2. Sebastian was chair at the Open Knowledge Conference in 2011, the Workshop on Linked Data in Linguistics 2012, the Linked Data Cup 2012, the Multilingual Linked Data for Enterprises (MLODE) 2012 workshop, the NLP & DBpedia Workshop 2014, the SEMANTiCS 2014, 2015 and 2016 conference as well as the KEKI Workshop 2016. In 2012, we held a hackathon at MLODE 2012 bootstrapping an initial version of the Linguistic Linked Open Data cloud image, which led to the LIDER project and linguistic-lod.org that now publishes regular updates (thanks to John McCrae).