Getting data from Wikipedia is no easy feat.
At first glance, it seems like the easiest way would be to scrape data using a web crawler such as Beautiful Soup but this is highly likely to get your client blocked. Not to mention the sheer number of pages you’ll need to crawl if you are looking for something in particular.
Another option is using Wikipedia’s Special:Export functionality, which will allow you to export a number of pages into an XML format however this requires you to know which pages you’d like to export in advance and requires additional translation from the XML file if you want to save it as tabular data.
There is an easier way though: querying DBpedia using a query language called SPARQL. This blog will cover how to write a simple SPARQL query to identify Wikipedia pages containing a certain link (in this case using a link to identify the movie genre) and return the film and additional attributes such as producer, director, tags and actors.
DBpedia
DBpedia is a project that extracts structured content from the information created in Wikipedia and preserves the links between all the pages. It can be accessed at dbpedia.org and you can view individual Wikipedia pages and the data associated with it by using the url ‘dbpedia.org/page/’ followed by the page name. For example to view the data associated with the first film in The Hobbit trilogy you can go to http://dbpedia.org/page/The_Hobbit:_An_Unexpected_Journey.
SPARQL
SPARQL (pronounced ‘sparkle’) stands for SPARQL protocol and RDF query language and is mostly used to query online knowledge graphs. It’s a SQL-like language sharing similarities with NoSQL DBs ‘document-key-value’ syntax.
Understanding SPARQL
SQL uses this concept of triples often referred to as Subject, Predicate, Object. You may hear it referred to as Entity Identifier, Attribute Name, Attribute Value or Document, Key, Value.
Imagine if the data was in a table – the Subject would be the row, the Predicate the column and the Object would be the value. In DBpedia, the Subject is the page, the predicate is a key and the Object is the value of the key for that page. It’s important to note that a Subject can have more than one Object associated with a key, something that tabular data doesn’t handle as easily.
Writing SPARQL
To actually query DBpedia, you can go to dbpedia.org/sparql and submit your query to return the results in a number of formats such as HTML, JSON or CSV. You can also run SPARQL on DBpedia from within a Python application but that will be covered in my next blog.
Let’s start with a really simple query to find all Wikipedia pages that contain the link to the Adventure_film in an attempt to find all Adventure films. The SQL Query for this will look like:
SELECT DISTINCT ?film
WHERE
{
?film dbo:wikiPageWikiLink dbr:Adventure_film .
}
In the WHERE clause you can see a perfect example of a triple -Subject, Predicate, Object. The Predicate dbo:wikiPageWikiLink is the key. This key contains all the links in the Wikipedia page. The Object, dbr:Adventure_film, is the value we are searching for. Because we don’t know which film pages contain this link we use the variable ?film as the Subject. By using this same variable in the SELECT clause, we are telling DBpedia this is the field we’d like to return.
Now we want to return more information. We also want to return the pageID, abstract and film name:
SELECT DISTINCT ?film, ?number, ?abstract, ?name
WHERE
{
?film dbo:wikiPageWikiLink dbr:Adventure_film .
?film dbo:wikiPageID ?number .
?film rdfs:comment ?abstract .
?film dbp:name ?name .
}
By using the same Subject ?film we can search the same pages and by providing new variables as Objects we don’t need to know the information before it is returned. We also needed to add our new columns to the SELECT clause in order for it to be returned in our query.
Slightly More Advanced SPARQL
Now we can started to get really specific with our query. Let’s say we only want films where the abstract is written in English:
FILTER ( LANG ( ?abstract ) = 'en' )
We can add this to the end of our WHERE clause like so:
SELECT DISTINCT ?film, ?number, ?abstract, ?name
WHERE
{
?film dbo:wikiPageWikiLink dbr:Adventure_film .
?film dbo:wikiPageID ?number .
?film rdfs:comment ?abstract .
?film dbp:name ?name .
FILTER ( LANG ( ?abstract ) = 'en' )
}
Some pages might not always contain the fields we want, but we still want to return whether they do or don’t. To do this we can use the OPTIONAL functionality:
OPTIONAL { ?film dbo:cinematography ?cinematography } .
And finally… what do we do if there is more than one Object for the Predicate specified? Well, SPARQL will automatically return a new line per Object but the results in a lot of duplication. We can, however, group all the objects into one field using the following in the SELECT clause:
(GROUP_CONCAT(DISTINCT ?starring; SEPARATOR="-") AS ?starring)
Like so:
SELECT DISTINCT ?film, ?number, ?abstract, (GROUP_CONCAT(DISTINCT ?starring; SEPARATOR="-") AS ?starring), ?name, (GROUP_CONCAT(DISTINCT ?subject; SEPARATOR="-") AS ?subjects)
, ?cinematography, ?director, ?gross, (GROUP_CONCAT(DISTINCT ?producer; SEPARATOR="-") AS ?producer), ?language
WHERE
{
?film dbo:wikiPageWikiLink dbr:Romantic_comedy .
?film dbo:wikiPageID ?number .
?film dbp:starring ?starring .
?film rdfs:comment ?abstract .
?film dbp:name ?name .
OPTIONAL { ?film dct:subject ?subject } .
OPTIONAL { ?film dbo:cinematography ?cinematography } .
OPTIONAL { ?film dbo:director ?director } .
OPTIONAL { ?film dbo:gross ?gross } .
OPTIONAL { ?film dbo:producer ?producer } .
OPTIONAL { ?film dbp:language ?language } .
FILTER ( LANG ( ?abstract ) = 'en' )
}
As you can see, querying DBpedia with SPARQL is a great way to get fast data from Wikipedia and will allow you to build up some really cool datasets.
Using SPARQL through the DBpedia query editor however has some drawbacks. First of all you can only return a maximum amount of 1000 results. Secondly, some of the results are a bit messy because they contain links to documents here and there. Both of these can be resolved if you wrapped your SPARQL in some Python that both cleans the data and makes multiple requests in order to return more than the 1000 limit. I will be covering this in my next blog so check it out if you are interested…