Giraph talk for GraphDevRoom @ FOSDEM 2012
This weekend I attended the GraphDevRoom at FOSDEM 2012 where the community met to discuss the current trends on the topic of graph processing. The talks spawned from query languages to ranking algorithms on graphs, some presented research results while others took a more business oriented path.
The event was sponsored by Gephi, NuvolaBase and Belectric and the first two had specific talks as well. I was particularly impressed by the product presented by the NuvolaBase guys, the same team behind OrientDB, who provides instances of their DB following a "database as a service" architecture in the cloud.
One interesting talk was given by Rene Pickhardt on the graph ranking algorithm Graphity that allows for the retrieval of top-k items associated to a vertex in a network (think of an operation such as "give me the last 5 tweets tweeted by "Justin Bieber"). The presented results were promising thanks to a very smart approach.
Neo4j presented their graph query language Cypher, with a very interesting syntax with concepts borrowed from SPARQL, SQL, gremlin and regexp. I personally find it much cleaner than gremlin.
There was also some space for a couple of talks about graph visualization, respectively with Processing and Gephi.
I gave a talk about Giraph and you can find the slides embedded in this post. Good news here: people were eager to get to know more about the project and we're gaining quite some authority and trust.
The closing talk was mostly an opportunity to discuss the necessity for a general benchmarking approach to graph databases. The discussion drove towards a set of universal and primitive operations to be tested a long with a group of established algorithms. My suggestion was to focus more on an analysis of the queries being run on the databases more than following a top-down a-priori design.
Throughout the whole day a few aspects came out repeatedly with one of them being the necessity for distribution strategies for graph databases. None of the vendors, except for InfiniteGraph which isn't sharing their technology, are supporting a general way of distributing a graph on multiple machines (something that elsewhere would be called sharding). The current solution provided by the vendors is basically to pass the ball to the users, allowing them to define their own domain-specific way to project the graph on the different nodes belonging to the cluster.
Also, it was also quite clear that the community hasn't converged on the topic of transactions. Some think transactions should be ACID, some think transactions should be more relaxed, others (me) think transactions should be avoided at all in favor of a more fine-grained set of atomic operations (a little bit like the path taken by some NoSQL databases like HBase).
Something I was impressed by is the quite diffuse disappointment about the drawbacks of the Tinkerpop/Blueprints stack. It's generally accepted that a common graph API is necessary but there's also a general agreement that the performance issues due to Blueprints are bigger than the advantages. Talking about abstraction, a nice discussion followed on the topic of the effectiveness of the different graph data models. Needless to say an agreement was not even close.
To wrap up, it was a very pleasant and well organized event, showing a very tight and enthusiastic community around these technologies.