DIY: DBpedia Movie Recommender – an exercise in linked open data engineering

Guest blogger and  Connected Data London 2016 speaker   Szymon Klarman   unwraps some of the mysteries of Linked Open Data by trying to solve a problem many of us have encountered – not knowing what movie to watch.


Tired of tediously crawling through countless HTML pages of IMDb to find that one more promising movie you’d like to watch yet? Fancy building your own movie recommendation system instead, which could make the search swift and painless?  Here’s a quick recipe:

  • Carve out a chunk of the DBpedia dataset describing movies, including properties such as director, starring actors, language, country and subject.
  • Blend it with a basic similarity measure that scores the extent to which the values of these properties overlap among different movies.
  • Cover everything up with an eye-pleasing graph visualisation by vis.js.

The result you can achieve in just a few hours is a simple yet functional movie recommender powered by the riches of the linked open data cloud: And the little engineering exercise itself might also offer a few interesting insights into the nature of the linked open data technology and initiative.

The concept of linked open data (LOD) comes with three hallmark promises: firstly – of making the data immediately accessible and machine processable by your applications; secondly, in contrast to alternative, non-graphical data models – of making the connections between objects in the data and across the datasets explicit and easy to exploit; thirdly, by involving the sociological and technological frames of the Web architecture – of making the LOD cloud the richest global knowledge repository, free and open for reuse to your own benefit and liking. The observations made while building our movie recommender support to a large extent these theses, even if in reality they have to be somewhat more carefully qualified – at least at the current level of maturity of the LOD technologies.

The machine accessibility of the LOD cloud is warranted by the use of the standard OWL/RDF(S) languages for representing data, and by the ability of making automated inferences over it, given a set of ontology axioms. That’s really the pragmatic gist of it. The far-fetched visions of machines being able to sufficiently decrypt the meaning of a query to autonomously reach out for the relevant data using ontologies and live SPARQL endpoints are so far, well… far-fetched. The findability of information on the LOD cloud and the availability of SPARQL endpoints are two yet unresolved limitations, well-known to the LOD practitioners. It is still the data engineer who has to find the right portions of data and download it for the local use. The ontologies and SPARQL endpoints are mostly there to facilitate his task, not the machine’s, but in this respect they are invaluable, allowing, for instance, to identify and extract a movie-related subset of DBpedia in mere minutes.

Data linking is the key to understanding how objects in the application domain are related to each other, thus complementing the knowledge about what these objects are like, which has been the main focus of more traditional data models. Of course, provided that this is what we really need. For building our toy application it was essential to know the individual descriptions of the movies – a little less so to comprehend the relationships holding between them, so in this case an attribute-value table would probably suffice. However, any attempt of making our recommender smarter or broader in its scope would have us immediately tap on the existing DBpedia links to other objects  (actors, directors, subjects) and their categories, which are further organized in reach taxonomies allowing for insightful inferences. In fact more intelligent similarity algorithms than the naïve one employed here exploit property paths between the entities rather than plain attribute-value pairs. And for this, a graphical data model like RDF, underpinning LOD, is critical.

The richness and openness of knowledge in the LOD cloud is perhaps the most powerful prospect when combined with the previous two. If structured, reliable data about any domain could be freely and effortlessly accessed via the Web, the traditional knowledge acquisition bottleneck in building intelligent knowledge-based systems would simply cease to exist. Needless to say, we’re not exactly there yet. A quick run through our recommender might easily start you questioning your own familiarity with the movie domain (in what sense is ‘Star Wars’ so similar to ‘Titanic’? and why ‘Pulp Fiction’ is not among the top picks similar to ‘Reservoir Dogs’?). Well, your internal similarity algorithm is probably all right and the one behind the recommender is not the major issue either. The real hindrance is the low quality of our movie dataset – the one extracted from DBpedia, the very heart of the LOD cloud. Are such problems symptomatic of the entire LOD cloud then? Definitely not. DBpedia inherits a lot of its messiness from its original source, i.e., Wikipedia, defined by its collaborative style of content publishing and validation – inevitably noisy and of highly varying quality. Many other data publishers, however, who currently embrace the LOD practice, do accept the responsibility for ensuring the right quality of the data they contribute to the LOD cloud. It’s the question of data provenance, not of the publishing or representation formats that will eventually matter. The vision of the LOD cloud evolving into the global hub of machine understandable knowledge of good quality and wide coverage is therefore quite realistic.

Overall, the easiness and speed with which prototypes like this one can be put into action is eventually the strongest argument that the linked open data works. This observation in turn, is apparently common enough to be efficiently bootstrapping further growth and refinement of the LOD cloud and the accompanying technologies, which can be undoubtedly witnessed over many past years. It can be expected then, that on steady maturing, the LOD cloud should start becoming the natural backbone for many knowledge-driven applications, much more sophisticated than a simple DBpedia movie recommender.