What does it take to build usable enterprise knowledge graphs? Let’s hear it from the experts
As the interest in Enterprise Knowledge Graphs is growing, there is also a growing need for sharing experience and best practices around them.
In Connected Data London, we love graphs. That’s easy to tell by looking at our program – graphs are pretty much in all 4 of our tracks in one way or the other. That’s also easy to explain – graphs are key to modelling and solving a wide array of problems.
We have a special track on Enterprise Knowledge Graphs (EKGs), and we have the pleasure of hosting some excellent speakers with deep knowledge of the topic. As a teaser for their talks, we had a chat with Katariina Kari from Zalando and Panos Alexopoulos from Textkernel who were kind enough to share their insights.
Katariina Kari (née Nyberg) is a research engineer at the Zalando Tech-Hub in Helsinki. Katariina holds a Master in Science and Master in Music and is specialized in semantic web and guiding the art business to the digital age. At Zalando she is modelling the Fashion Knowledge Graph, a common vocabulary for fashion with which Zalando improves is customer experience. Katariina also consults art institutions to embrace the digital age in their business and see its opportunities.
Panos Alexopoulos has been working at the intersection of data, semantics, language and software for years, and is leading a team at Textkernel developing a large cross-lingual Knowledge Graph for HR and Recruitment. Alexopoulos holds a PhD in Knowledge Engineering and Management from National Technical University of Athens, and has published 60 papers at international conferences, journals and books.
Building your first Enterprise Knowledge Graph
Kari and Alexopoulos have over 20 years of combined experience with EKGs between them. The first topic we covered was how they got started in the field.
Kari created her first ontology 10 years ago. It was about music, and it was in Finnish and Swedish. Kari says it was not really used back then in some concrete use case, as it was part of a larger scheme to get different cultural vocabularies published in Finland’s official languages to be perhaps used later.
Kari also subsequently evaluated her MUSO ontology using OntoClean as part of her early semantic web studies at the Aalto University, and wrote a paper in Finnish on using OntoClean. However, she considers her first publication on using ontologies for improving machine learning (ML) models in 2010 as her official entry in the field.
Alexopoulos built his first ontology in the context of a European research project in 2006 to help with fraud detection in healthcare. The model covered the domains of drug prescriptions and public sector procurement and contained entities, relations and rules that could help towards detecting patterns of potential fraudulent behaviour in relevant organizations. Not much is known on how the model was actually used though, as is quite common for such projects.
As both Kari and Alexopoulos come from the semantic web world, that explains the use of the term “ontology”. Although perhaps a scary word for some, for people with this kind of background an ontology is not an abstract philosophical concept. Rather, an ontology is a graph domain model incorporating schema and rules, and the term is used interchangeably with EKGs.
“When I got interested in the topic, the new brand for ontologies, Knowledge Graphs, was not yet known. It was all about adding a bit of structure to the world and not very use case driven yet. I think I onboarded the ontology train around the time of the Linked Open Data hype, so very many exciting opportunities lay in creating data-mesh applications that made use of two or more LOD datasets” says Kari.
In any case, building an EKG is a knowledge-driven process itself, and starting out one may feel overwhelmed. Alexopoulos for example finds the fact that his first ontology was about fraud detection ironic:
“Back then I also felt a bit of a fraud, in the sense that I had no formal education on ontologies and semantics. But then again, I wouldn’t have entered the field otherwise.
To develop the model I used languages, methodologies and tools coming from the Semantic Web community (such as RDF/OWL and Protege) but, most importantly, the whole process made me realize what it means to be a knowledge engineer. It’s not about knowing formal languages, methodologies and fancy tools for semantic data modeling; it’s about applying them appropriately.
And for that, you need to
- Understand, identify and make explicit the different semantic phenomena that you come across when you develop semantic models,
- Understand and use correctly the modeling elements you have available,
- Put yourself in the shoes of your model’s users and see if it’s really comprehensible by them and
- Question and uncover hidden/implicit assumptions behind modeling decisions.
So if I could go back, I would apply these hard-learned lessons to build a better model than the one I created then”.
Kari also emphasizes the applicability aspect of building EKGs, reusing existing models and making sure EKGs are used in the real-world:
“Going back, I would have probably done further research on existing music ontology, since MusicBrainz was already up at that time, and done some overlap and gap analysis. Since then I have become a firm believer in creating ontologies only if there is also a use case for them, so that I can evaluate the applicability of the ontology right away”.
Enterprise Knowledge Graphs opportunities and challenges
While many people are familiar with traditional domain modeling, for example for relational databases, not as many are proficient with EKGs. What are the differences, and why go from one to the other?
“Data modeling for traditional databases aims to serve the data representation needs of a particular application. An ontology, on the other hand, aims to serve the common understanding of one or more domains among different agents (systems and humans). So our job as ontologists is to achieve the semantic clarity that our domains, applications and users need, and that’s a tough, yet very enjoyable to me task” says Alexopoulos.
Kari says that “Intuitively, what excited me most was the opportunity to express human knowledge, the knowledge that the scientific field of humanities gather, into machine-readable form. I actually like that many universities now call semantic web studies digital humanities studies. THIS is what it should be about:
Imagine if musicologists or historians would in addition to writing papers and books express the knowledge they have gathered and the synthesis they have created in triple format! What background knowledge this would give any ML model or a hybrid AI solution making use of both: what we humans believe and what the data shows”.
For Alexopoulos, the main challenge in building good semantic models is finding the right level of semantic expressiveness that will be beneficial to users and applications without too much maintenance cost:
“In my experience, software developers and data engineers tend to under-specify meaning when building data models. Ontologists, linguists and “experts” on the other hand tend to over-specify it and debate about semantic distinctions for which the model’s users may not care at all.
If you don’t make any effort to create and attach some proper semantic description to your data (call it schema, ontology, it doesn’t really matter) in order to shed some light on their meaning, you can’t expect systems and people to use these data consistently. Human language and perception is full of ambiguity, vagueness, imprecision and other similar phenomena.
Those are inevitably reflected into the data and affect their interpretability and usability. Not tackling these phenomena with proper semantic modeling will most likely lead to suboptimal or even harmful usage of the data and, in the end, the organization will get no value out of them”.
For Kari, the biggest challenge, or the challenge she is most interested in tackling, is to create and develop hybrid solutions with ML models. Kari wants to “Once and for all bury this idea that graphs compete with ML models and ML models do things better.
The graph doesn’t do anything! It is data expressed in a networked form with structural and associative information about real world concepts as we humans understand it. Surely a ML algorithm could work with qualitative background data such as this to improve its performance?”
The right tool for building Enterprise Knowledge Graphs
These days, many tools and vendors claim to be in the EKG space. That includes approaches without ontological underpinnings, or rich schema capabilities. But is that really possible, given how much EKGs are about expressing semantic relationhips?
For example, we’ve seen the term “Semantic Data Lake” emerge to describe EKGs built on top of Hadoop data lakes, but making this possible entails adding a semantic metadata layer on top of Hadoop. So what are the most suitable tools for building EKGs?
Kari notes her team has some experience team creating their own triple store and SPARQL query support on top of Datomic, but with the release of AWS Neptune, they dropped this: “There are still some out-of-the-box features in Datomic that we miss, but the convenience of having a database solution that supports RDF and SPARQL out-of-the-box outweighs this”, she says.
For Alexopoulos, “Finding and keeping a balance is more difficult and more important than, say, deciding whether RDF is better than Neo4J.
After more than a decade in the field I am pretty immune to vendors claiming that they can build for you a knowledge graph overnight. Even in small organizations, complexity of data, processes and systems can be quite high, and expecting that all this complexity will be seamlessly taken care of by a software platform, no matter how sophisticated it is, is rather naive.
In a talk I gave at Data Innovation Summit last March, I highlighted the fact that a Knowledge Graph is not simply another data engineering, science or analytics project but rather a continuous effort of fueling the organization with up-to-date and useful data and knowledge to serve its business strategy. As such, it requires considering technical, strategic and organizational aspects, and technical-only focused approaches have a high risk of failing”.
Advice for the young at graph
So, what should you keep in mind when setting out to build your EKGs? Alexopoulos advises:
“Start with trying to understand where you are getting into, i.e., get a comprehensive, hype-free picture of what exactly are knowledge graphs, how they are used by other similar organizations, how they could benefit you and how easy or difficult it would be to implement them in your organization.
The latter is perhaps the most difficult to answer, as there are many factors involved in such an estimation and are all dependent on your organization’s particularities.
For example, are your products and services safety-critical, requiring your knowledge graph to have a very high level of precision? Then, most likely, you can’t trust fully automatic approaches for its construction but you will need to dedicate resources to its curation and quality control.
Or, from an organizational perspective, do you currently have many product teams each working with their own data that do not share with the rest? Then, integrating all these data under a single knowledge graph with agreed semantics will be much harder than if, for example, your organization has already an established data governance practice.
In a nutshell, treat a Knowledge Graph as a significant investment both in terms of time and resources, and make sure it’s compatible with your overall Data Strategy”.
Of course, it’s impossible to share insights gained in over 20 years of building EKGs in one blog post, but hopefully this should give you an idea.
This is what Connected Data London 2018 brought to the fore. Connected Data London 2019 is on! Secure your chance to learn from experts and innovators, get your ticket early! Limited number early bird tickets available.