April 9, 2011

DISCLAIMER: this is a bit of a hack, but it should get you started. I managed to get the core dataset of DBpedia into Neo4J, but this procedure should actually be working for any Blueprints-ready vendor, like OrientDB.

Ok, a little background first: we want to store DBpedia inside of a GraphDB, instead of the typical TripleStore, and run SPARQL queries over it. DBpedia is a project aiming to extract structured content from Wikipedia, information such as the one you can find in the infoboxes, the links, the categorization infos, geo-coordinates etc. This information is extracted and exported as triples to form a graph, a network of properties and relationships between Wikipedia resources.

So we're going to store millions of triples like "Barack Obama -- president of --> United States of America", or "Rome -- capital of --> Italy" etc. and once we have these triples in the store, we can run queries over this graph with a language that is not so different from SQL.

Why doing this with something like Neo4J or OrientDB instead of a Triple Store? Well, doing graph traversals, the operation of exploring the graph to answer a query (well, most of them), is usually more efficient with a graph database. Check out my slides for NoSQLDay and this nice paper by Marko Rodriguez and Peter Neubauer if you're interested in the details.

To start, we first have to get the DBpedia dump which is divided in multiple languages and datasets: geocoordinates, personaldata, links, categories etc.: choose those you need. Also, the dump is released in multiple formats, we'll go for the N-Triples (therefore the .nt extension), a line-based format to export RDF triples.

Next, you'll have to clone my dbpedia4neo github project which is composed of two packages: org.acaro.dbpedia4neo.web and org.acaro.dbpedia4neo.inserter. The first one is a very little web SPARQL endpoint based on the nice "Simple webserver with WebSocket and REST using jetty and NO XML" (yes, that's the actual name of the project).
The second is a package with a very simple class that parses the nt files through the Sesame library and issues an insert into the graph through the Blueprints layer. The process is very simple: for each line we have a triple; for each triple an insert into the database is issued.

Two things to notice here: some of the nt files are malformed, meaning that some URIs don't start with the scheme, i.e. http, and Sesame will just refuse to go on parsing, failing without possible intervention. So you'll have to grep them out before you insert them. I've used this grep command: grep -P '<(?!http(s)?:\/\/).*>'. Second thing to note is that, by default, the insertion is transactional, so for each insert it would start a transaction and insert the triple. You understand that the performance issue here. For this reason the class uses a bulk strategy, but it will need you to setup the size of the bulk insert as it depends on the amount of available RAM.

What are the pieces that do the magic? From the inserter perspective it will need a Sail interface to issue the addStatement() operations for each triple. Blueprints' GraphSail is done for this as it translates each addStatement(), the insertion of a triple into the store, into a call to the underlying IndexableGraph, implemented by Neo4J.
From the perspective of the SPARQL endpoint, we still have the GraphSail with Neo4J inside. This time the GraphSail is encapsulated into a SailGraph, which implements methods like executeSparql() over the GraphSail interface, that allows the execution of the queries.

This is all you need, this should get you started. I warn you, it's going to take at least 24h to insert the whole thing. The process is mostly CPU bound, I believe the problem is due to the Lucene indexing of Neo4j, but I haven't investigated further.

Let me know if and how you improved this workflow.


blog comments powered by Disqus