Follow us on Twitter and join our MeetUp group!


June 22, 2016

DIY: DBpedia Movie Recommender – an exercise in linked open data engineering

Movie Recommendations from DBPEDIA Szymon Klarman

Guest blogger and  Connected Data London 2016 speaker   Szymon Klarman   unwraps some of the mysteries of Linked Open Data by trying to solve a problem many of us have encountered – not knowing what movie to watch.


Tired of tediously crawling through countless HTML pages of IMDb to find that one more promising movie you’d like to watch yet? Fancy building your own movie recommendation system instead, which could make the search swift and painless?  Here’s a quick recipe:

  • Carve out a chunk of the DBpedia dataset describing movies, including properties such as director, starring actors, language, country and subject.
  • Blend it with a basic similarity measure that scores the extent to which the values of these properties overlap among different movies.
  • Cover everything up with an eye-pleasing graph visualisation by vis.js.

The result you can achieve in just a few hours is a simple yet functional movie recommender powered by the riches of the linked open data cloud: And the little engineering exercise itself might also offer a few interesting insights into the nature of the linked open data technology and initiative.

The concept of linked open data (LOD) comes with three hallmark promises: firstly – of making the data immediately accessible and machine processable by your applications; secondly, in contrast to alternative, non-graphical data models – of making the connections between objects in the data and across the datasets explicit and easy to exploit; thirdly, by involving the sociological and technological frames of the Web architecture – of making the LOD cloud the richest global knowledge repository, free and open for reuse to your own benefit and liking. The observations made while building our movie recommender support to a large extent these theses, even if in reality they have to be somewhat more carefully qualified – at least at the current level of maturity of the LOD technologies.

The machine accessibility of the LOD cloud is warranted by the use of the standard OWL/RDF(S) languages for representing data, and by the ability of making automated inferences over it, given a set of ontology axioms. That’s really the pragmatic gist of it. The far-fetched visions of machines being able to sufficiently decrypt the meaning of a query to autonomously reach out for the relevant data using ontologies and live SPARQL endpoints are so far, well… far-fetched. The findability of information on the LOD cloud and the availability of SPARQL endpoints are two yet unresolved limitations, well-known to the LOD practitioners. It is still the data engineer who has to find the right portions of data and download it for the local use. The ontologies and SPARQL endpoints are mostly there to facilitate his task, not the machine’s, but in this respect they are invaluable, allowing, for instance, to identify and extract a movie-related subset of DBpedia in mere minutes.

Data linking is the key to understanding how objects in the application domain are related to each other, thus complementing the knowledge about what these objects are like, which has been the main focus of more traditional data models. Of course, provided that this is what we really need. For building our toy application it was essential to know the individual descriptions of the movies – a little less so to comprehend the relationships holding between them, so in this case an attribute-value table would probably suffice. However, any attempt of making our recommender smarter or broader in its scope would have us immediately tap on the existing DBpedia links to other objects  (actors, directors, subjects) and their categories, which are further organized in reach taxonomies allowing for insightful inferences. In fact more intelligent similarity algorithms than the naïve one employed here exploit property paths between the entities rather than plain attribute-value pairs. And for this, a graphical data model like RDF, underpinning LOD, is critical.

The richness and openness of knowledge in the LOD cloud is perhaps the most powerful prospect when combined with the previous two. If structured, reliable data about any domain could be freely and effortlessly accessed via the Web, the traditional knowledge acquisition bottleneck in building intelligent knowledge-based systems would simply cease to exist. Needless to say, we’re not exactly there yet. A quick run through our recommender might easily start you questioning your own familiarity with the movie domain (in what sense is ‘Star Wars’ so similar to ‘Titanic’? and why ‘Pulp Fiction’ is not among the top picks similar to ‘Reservoir Dogs’?). Well, your internal similarity algorithm is probably all right and the one behind the recommender is not the major issue either. The real hindrance is the low quality of our movie dataset – the one extracted from DBpedia, the very heart of the LOD cloud. Are such problems symptomatic of the entire LOD cloud then? Definitely not. DBpedia inherits a lot of its messiness from its original source, i.e., Wikipedia, defined by its collaborative style of content publishing and validation – inevitably noisy and of highly varying quality. Many other data publishers, however, who currently embrace the LOD practice, do accept the responsibility for ensuring the right quality of the data they contribute to the LOD cloud. It’s the question of data provenance, not of the publishing or representation formats that will eventually matter. The vision of the LOD cloud evolving into the global hub of machine understandable knowledge of good quality and wide coverage is therefore quite realistic.

Overall, the easiness and speed with which prototypes like this one can be put into action is eventually the strongest argument that the linked open data works. This observation in turn, is apparently common enough to be efficiently bootstrapping further growth and refinement of the LOD cloud and the accompanying technologies, which can be undoubtedly witnessed over many past years. It can be expected then, that on steady maturing, the LOD cloud should start becoming the natural backbone for many knowledge-driven applications, much more sophisticated than a simple DBpedia movie recommender.

4 Comments on “DIY: DBpedia Movie Recommender – an exercise in linked open data engineering

Jesús Barrasa
June 23, 2016 at 11:05 am

Hi Szymon, nice post.

It would be super interesting to know a bit more about how you used ontologies to infer and score the similarity between movies. Do you think it would be possible?

Thanks in advance and looking forward to meeting you at Connected Data 2016.


June 24, 2016 at 1:10 am

Hi Jesus,

In short, the dc:subject property associated with movie objects ranges over SKOS concepts organized in a thesuarus in DBpedia, which originates in the Wikipedia categories. You can use it to infer implicit (broader) subject associations (via skos:broader property). For instance, there could be a movie with subject movie_about_cats and another one with movie_about_dogs, which are clearly two different values. But in the thesaurus both might have as skos:broader concept movie_about_house_pets. If you use this inference step you will obviously observe some degree of similarity between the two movies that you couldn’t see before. If you were to represent the thesaurus along the data as part of the same property graph you could basically look for a number of different (minimal) paths connecting movies to capture this mechanism.

The flip side of using such a thesaurus is naturally the increase in the size of the search space, which together with the not so great quality of this subset of DBpedia might compromise the potential gain quite a lot…

Hope we get to discuss more at the conference. See you there!

July 28, 2016 at 12:43 pm

Hello Szymon,

Thank you for the post. Do you have an updated link to the movie recommender? The link seems to be broken.

Best wishes, Max

Hey Max,
Just checked the link, seems to be working fine for me, just takes time to load up, can you tell me what error appears?

Comments are closed.