In the past, I’ve written about Google Pregel. At the time, as it was quite obvious, there was no implementation of anything like Pregel out there of any kind, not to mention Open Source.
Now things have changed, so I’d like to give a quick list of the projects out there that might help you getting started with this technology, as I see that very often people ask what the difference is between all of them. I have direct experience only with the Java implementations, so I can talk about them a bit more extensively.
As you remember from my last post, Pregel is a framework for large-scale graph processing that builds on top of the BSP computational model. It allows the developer to write a vertex-centric algorithm for graph processing (meaning you write a function that receives messages from vertices and sends messages to other vertices) and forget about things as distribution and fault-tolerance. As I mentioned in the post, BSP isn’t directly connected to graph processing, but it maps quite smoothly to it.
The first project to mention is Apache Hama. “Apache Hama is a distributed computing framework based on BSP (Bulk Synchronous Parallel) computing techniques for massive scientific computations, e.g., matrix, graph and network algorithms. It was inspired by Google's Pregel, but different in the sense that it's purely BSP and common model, not just for graph.”. What this means for you, a developer willing to work on graph processing, is that you have to build the Pregel API/layer on-top-of it. What Apache Hama provides are the BSP primitives, so messaging between tasks and a synchronization barrier. Be ware, it’s not just a matter of API, there’s a technological effort to implement all the graphy behavior that happens under the hood of Pregel. To have an idea about what that means, go and chekout the example implementation of the PageRank and SingleSourceShortestPath code, and see how they differ from the sample code in the Google Pregel paper. Also, last time I checked, the framework wasn’t supporting much of the fault-tolerance, the Combiners and Aggregators and some other features of Pregel.
More in this direction you can try GoldenOrb. “GoldenOrb is a cloud-based open source project for massive-scale graph analysis, built upon best-of-breed software from the Apache Hadoop project modeled after Google’s Pregel architecture.”. It tries to stick as much as possible to the original Pregel API and it builds on top of Hadoop. GoldenOrb differs from Apache Hama for the first aspect but shares the second one: they both build on top of Hadoop, meaning they re-use parts of it, as the RPC for messaging, they use Zookeeper for synchronization, and HDFS for IO. Although they build on top of Hadoop, they still require you to install and add to your infrastructure some of their code (so to say, GoldenOrb and Hama have their own JobTracker and TaskTracker). This can be a drawback, considering you might not have control over your Hadoop cluster to install new software or in general the code might suffer from usual problems related to young projects.
The third project, which attacks this last aspect in particular and was developed originally by Yahoo!, is called Giraph (great name by the way). Giraph is “a graph-processing framework that is launched as a typical Hadoop job to leverage existing Hadoop infrastructure, such as Amazon’s EC2. Giraph builds upon the graph-oriented nature of Pregel but additionally adds fault-tolerance to the coordinator process with the use of ZooKeeper as its centralized coordination service.”. What this means is that Giraph is a Hadoop Map-only job, so you can run it on your existing Hadoop infrastructure, but it provides the API and the middleware of Pregel. It also does implement a BSP computation, so you’ll have the advantages that make Pregel often a better solution for iterative graph algorithms compared to Mapreduce.
So, to summarize, what Hame, GoldenOrb and Giraph have in common is: Java platform, Apache License (and incubation), BSP computation. What they differ for: Hama offers BSP primitives not graph processing API (so it sits at a lower level), GoldenOrb provides Pregel’s API but requires the deployment of additional software to your existing Hadoop infrastructure, Giraph provides Pregel’s API (and is kind of complete at the current state) and doesn’t require additional infrastructure.
Closing with Java, I’d just like to mention Phoebus, a Pregel implementation in Erlang. I haven't tried it, so I cannot say much about it, but if you want to comment on that, please drop me a line in the comments.
So, here it is, fire up your Hadoop pseudo-cluster and get back to me if you have something to add.
EDIT: I forgot to mention other two possibilities. They are not really Pregel "clones", but are for sure inspired by Pregel. The first is called Signal/Collect and is a framework particularly developed for the Semantic Web, although it can be used for other scenarios as well. It's written in Scala, with Java API also available, and is released with Apache License. Here's how the authors describe it: "Signal/Collect is a framework for synchronous and asynchronous parallel graph processing. It allows programmers to express many algorithms on graphs in a concise and elegant way.". It differs from Pregel for two main reasons: (1) the edges can have computations associated to them as well, so you basically write a compute() method for the vertices and another one for the edges (2) the synchronization barrier constraints are relaxed, so it's possible also to implement async algorithms that look more dataflow-ish. At the time the core system doesn't scale to multiple machine, but the next version is supposed to be fill this gap. Check this post's comments for more info.
The second is called HipG and it's Java GPL-licensed framework very similar to Pregel. Here's the description from the homepage: "HipG is a library for high-level parallel processing of large-scale graphs. HipG is implemented in Java and is designed for distributed-memory machine. Besides basic distributed graph algorithms it handles divide-and-conquer graph algorithms and algorithms that execute on graphs created on-the-fly. It is designed for clusters of machines, but can also be tested on desktops - all you need is a recent Java runtime environment.". The main contribution of this work to the Pregel model is again related to the way the synchronization barrier is handled: there's no more a global barrier but the algorithm can specify different barriers to which subsets of the vertices can synchronize. The idea is to allow a more fine-grained synchronization mechanism, to avoid many vertices to idle waiting for unrelated vertices to finish their computation.