The Giraph in Action book is now available

April 4, 2014

Giraph

After some effort, we are finally officially going live with the Manning Early Access Program for our Giraph in Action book. If you are interested in getting in learning from how to move your first steps with Giraph to how to develop full-fledged applications, the book can be your companion on this journey. Go and check it out on the Manning site. What follows, is the description. Happy reading!

"Graphs are everywhere, and they are getting larger and larger. From the Internet to Social Networks, analysing large graphs on a single computer is not an option anymore. Graph mining applications have to be run in clusters, or in the cloud, by means of large-scale distributed computing.

Apache Giraph is a large-scale graph processing system that can be used to process Big Data. Giraph is part of the Hadoop ecosystem, and it is a loose open-source implementation of the Google Pregel system. Originally developed at Yahoo!, it is now a top top-level project at the Apache Foundation, and it enlists contributors from companies such as Facebook, LinkedIn, and Twitter.

The book seeks to show how to write large-scale graph algorithms using Giraph. It covers the API and the programming model. It also presents explanations of common graph algorithms, and scenarios where modeling problems using graphs suit best. Furthermore, it describes how to integrate Giraph with other tools of the Hadoop ecosystem, how to run Giraph in the cloud, and how to tune it to process very large graphs."

Apache Giraph slides at Hadoop Summit 2014

April 3, 2014

If you want to take advantage of the added value that comes from looking at your data as a graph and at your problem as a graph problem, you need tools specific for graphs, such as Giraph.

Graph analytics and algorithms have particular characteristics compared to more traditional analytics, and they set a number of specific challenges to the tools that process them. These challenges are different from those faced by tools like MapReduce, Pig, and Hive.

The fact that for example Facebook is able to process graph analytics on their graph with a trillion relationships across hundreds of machines within minutes, shows that Giraph is able to tackle those challenges. And this is what my talk at the Hadoop Summit was about.

Check out the slides.

Giraph at Hadoop Summit 2014 from Claudio Martella

Okapi: the new kid on the block of large-scale graph processing

March 1, 2014

There’s a new kid on the block of large-scale graph processing and graph analytics. While in the last few years we have seen new graph processing systems popping up quite consistently, with Apache Giraph and GraphLab being the most prominent and mature ones, less has happened with respect to the algorithms to run on these systems.
A lot has been written about the simplicity of the vertex-centric programming model borrowed from Pregel by these systems, but little code has been released. Currently, it is still mostly the academia that is investigating design patterns for graph algorithms on these systems, and the practitioners who do, keep their algorithms private.

But things are changing! Okapi is a new open-source library developed at Telefonica for large-scale computations on graphs. I am a proud collaborator to this effort.
It is a very exciting project that tries to fill an important gap in the codebase of Giraph, and in the graph analytics scene in general: usable algorithms to analyse Big Graphs. To a certain extent, it follows the path of Apache Mahout to provide a set algorithms for Machine Learning and Data Analytics to use on top of Hadoop. Where Apache Mahout focuses on MapReduce, Okapi focuses on Giraph.

The codebase is growing fast, and it is currently organised in two main groups of algorithms:

  • Collaborative Filtering: CF is a class of techniques to build recommender systems based on a set of ratings of users to items. According to CF, we can conceptually model data as a matrix where rows represent users, columns represent items, and a cell contains the rating of a user to an item. Algorithms analyse this matrix trying to compute predictions for ratings that are not present in the matrix. Many of these algorithms are iterative and process the matrix in a number of steps. And this is where Giraph shines! Okapi includes a number of iterative algorithms that work on bipartite graphs, where user and items are connected by the respective ratings as edges. These algorithms include: Alternating Least Squares (ALS), Stochastic Gradient Descent (SGD), Singular Value Decomposition (SVD++), Bayesian Personalised Ranking (BPR), Collaborative Less-is-more Filtering (CLiMF), TFMAP, Popularity Ranking etc.

  • Graph Analytics: mining graphs to extract meaningful information has been used in particular in the analysis of Online Social Networks and the Web, to rank web pages, identify communities and popular users (or so-called influencers), or detect frauds. Some of these algorithms borrow concepts from traditional graph theory, while other algorithms and heuristics have been introduced only recently. Okapi includes implementations of PageRank, Clustering Coefficients, Multi-source Shortest Paths, K-core Computation, Balanced k-way Partitioning (Spinner), SybilRank, and others.

It’s not over. The team of Okapi is also working on a modified version of the Giraph runtime, to run real-time computations of graph algorithms for event-based workloads. Without changing your existing code for offline analysis! The work is not ready for the public yet, but it is being tested for a few months now. The team at Telefonica is looking for more testers and user cases, so if you have a real-world scenario where your graph is constantly changing and you want to keep your results up-to-date efficiently, drop them a line!

Come and visit us, there is a lot to be done, this is just the beginning!

Apache Giraph 1.0 released

May 7, 2013

The Apache Giraph team is proud to announce our first release out of incubation. The release version is 1.0.0 to reflect a lot of hard work that went into making the code stable enough for production use, memory efficient, and performant.

Apache Giraph is an scalable and distributed iterative graph processing system that is inspired by BSP (bulk synchronous parallel) and Google's Pregel. Giraph distinguishes itself from those projects by being open-source, running on Hadoop infrastructure, and going beyond the Pregel model with features such as master computation, sharded aggregators, out-of-core support, no single point of failure design, and more.

Here are some highlights of the release:

  • Scales out to hundreds of machines easily and hundreds of billions of edges (memory permitting)
  • Efficient use of memory via fast byte-based serialization by default and can use primitive specific types when better performance is required
  • Multithreaded input and computation can take advantage of multicore machines efficiently
  • Simplified vertex API
  • Vertex-based and/or edge-based input supported
  • Master compute API for handling application-wide logic
  • Sharded aggregators for handling large (memory) aggregators
  • Easy access to/from Hive tables to integrate with your data warehouse
  • Out-of-core graph and messaging support
  • YARN support

For release details and download access, please visit: http://giraph.apache.org/releases.html

Thanks so much to everyone for all their contributions. It is you who made this release possible! We've also been investing in updating our website as part of this release (http://giraph.apache.org), more documentation/updates will be coming in the near future. We expect that releases will happen more frequently in the future now that we are more familiar with the process.

Regards,

The Apache Giraph team

CFP: Graph Processing Devroom at FOSDEM 2013

December 23, 2012

We are organizing the Graph Processing Devroom at FOSDEM 2013. The Free and Open source Software Developers’ European Meeting (FOSDEM) is a two-day event organized by volunteers to promote the widespread use of Free and Open Source software. The Graph Processing DevRoom is now a two years old effort to have a one day workshop where people can discuss on the topic of Graph Processing. If you have a cool project, or you think your use case is interesting for others, feel free to submit your proposal.

Important dates (all GMT+1)

Submission deadline: 28-12-2012
Notification of accepted speakers: 01-01-2013
Publication of the final schedule: 4-01-2013
Meetup: 2 of February 2013, Brussels, Belgium. Co-located with in FOSDEM 2013.

Important links

Submission link: https://graphdevroom.busyconf.com/proposals/new
Devroom site: http://fosdem.graph-database.org/
Last years' presentations: http://www.youtube.com/graphprocessing
FOSDEM site: https://fosdem.org/2013/

In case of problems, please contact the room team at: graph-devroom@graph-database.org

To be keep on the loop:

Follow @GraphDevRoom on twitter to be updated with the latest news of the event.

Organizations are silos of information

February 19, 2012

While at TIS, I was running a project related to Enterprise 2.0. As part of the process, we wrote a questionnaire to understand the paths through which information flows internally between teams and groups. At first, we asked all the employees things such as "if a customer asks you who's' responsible, at your company, for a particular project or technology, how do you find out?" or "if you need a particular know-how internally, how do you search for it?". What we found out from these and other similar questions, is that when employees need to find a solution to this kind of problems, they mostly go to their boss. This upward delegation works on the assumption that the boss knows everything and this knowledge, if possible, would be acquired during meetings with the leaders of the other teams. This upward delegation makes the boss a bottleneck and the teams silos of information. We tried to look for a scenario that would describe easily what was going on and what was causing this situation. Here it goes.

When you're a startup, in the early days, you're all in the same room. Information spreads quickly and easily inside that room. When things go in the right way you grow, the team is split in two. Now the members of the teams have meetings with their leaders where they report the advances of their projects, the development of their know-how etc. The team leaders later meet and exchange this information. The team leaders can now report information about team A to team B and vice versa. This mechanism solves for the physical constraints of logistics, for the fact that teams are in different rooms, basically. Sure, people still meet in the corridors or at the coffee machines, but as the company grows this gets more and more difficult and soon you don't know what the members of the other teams are doing (and the bosses can't help there). So you do more meetings, with the hope that more information can be exchanged, but we all know how people react to frequent and lengthy meetings. They are expensive: often they turn into a set of direct conversations between each member and the team leader. Which means that the people waiting for their turn just idle while they could be doing something productive. Also, most probably they just don't listen to these direct conversations, maybe regretting it later (but can they be blamed?), when they'll be working on a different project that might depend on one of those conversations. To sum up, the content of meetings is unsearchable and not persistent, which makes it extremely costly.

Now, this is a simple example, but just imagine what this means if you take a big company and replicate this phenomena for many levels from the employees at the bottom of the pyramid to their boss, the boss of their boss all the way up to the president. Enterprise 2.0 advocates for an horizontal organization, at least for information management, where employees are provided with a platform, something similar to the well known Web 2.0 tools (HA! guess where the 2.0 in Enterprise 2.0 comes from…) where they can post information about their projects, about what they do and like, they can answer and pose questions etc, upload documents, share links etc. All this information would be easily searchable and what not, you can imagine easily what you could do, just take your favorite social network and imagine it being a tool used inside of your company with a specific purpose. Of course it's not all roses and chocolates, so I'll get back to our previous questionnaire example. In the second part we asked questions such as "when you want to buy a new digital camera, would you ask for advices on Facebook?", "would you use a wiki or google docs to help organizing your holidays with friends living elsewhere?".
I guess you know where I'm going: what emerged is that they successfully acquire information, for which they don't have the right know-how, from their peers not to mention they are already efficiently using search engines to locate information on the web. So, if they already happily use these tools in their free time, why shouldn't these tools and paradigms being used inside of the organizations to improve information flows and management? These are the basic principles of Enterprise 2.0, as means of breaking those physical walls dividing the rooms, bringing the information flow back to the startup condition. More clearly, it's a way of breaking the bottlenecks, the silos effects due to upward delegation and dependency.

Explain me Hadoop like I was a... linux power user

February 8, 2012

In november I gave with Elia Bruni a hands-on introduction to Hadoop for the CIMEC researchers at University of Trento. They are mostly NLP-ists or in general cognitive science-related researchers applying machine learning techniques to validate their models on real data. I personally think that explaining MapReduce to this kind of audience just through its API and its distributed architecture would be confusing and probably even useless. It's true that MapReduce provides a simple API that hides the complexity of fault-tolerance, distributed IPC and parallelism, but there are so many ways to screw up implementing a suboptimal algorithm that I felt I needed a different approach.
Most of them are linux power users, they work with terminal-level commands effectively, they run their tasks on a Sun/Oracle Grid Engine managed cluster and mostly, due to the nature of their work, they are used to work with algorithms following a scan-based pattern on files or pair-wise computations such as matrix operations. The experience of working with tools as scp, rsync, ssh, grep, cat etc. was there, along with an understanding of how to implement an algorithm based on scanning a file line by line.
For this reason I decided to present a story that they could relate to, presenting challenges for which they could come up with practical "linux power-user level" solutions. After this setup was complete, I would show that their solutions were analogous to what Hadoop does under the hood automatically. I thought this would give me two advantages: they would understand what MapReduce was designed for (and therefore what it's good at, the most important thing), but also they would feel comfortable using it with the right paradigm without being confused about what the "big distributed and complicated monster" was doing under their feet. I'm going to share this story/analogy here, hoping it can be useful for somebody else. Be ware, I don't think it's anything magical or new, I've seen parts of this in different presentations (which I guess inspired me), but I believe the analogy was never followed directly and explicitly.

The initial scenario: it's 2002 and we're a startup running an e-commerce shop online, business is doing so well that we're starting competing with the big guys. We soon realize that the clicks and searches our customers do on our site are more than just pages to be served from our database but are actually data worth keeping and analyzing. After all, through their clicks and searches our customers are more or less explicitly telling us what they like and want. I show them a log coming from an apache web server: 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /index.html HTTP/ 1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)" and we consider that by a simple regexp we can extract pairs as "who clicked what", "when a certain page was clicked", "from which page the user came from" etc. This is their bread and butter so it's obvious for the whole audience how these sets of pairs can be fed into typical machine learning algorithms to extract similarity between products, a ranking algorithm for search, recommendations, plots for the marketing department etc.
I go on with the story: one possible way of doing this aggregations and computations would be to store the data in an RDBMS and run some smart SQL queries. Our problem though is that we're a startup and we can't afford an expensive and performant data warehouse, we can only use an open source solution as MySQL or Postgres and they are pretty poor at managing our high write throughput.

So we give up relational databases and decide to go for a practical solution of running our bash scripts directly on the web server logs on a separate machine. We have simple pipelines of commands that extract the tuples and run simple aggregations on them with perl. We extract the tuples to temporary files and pipeline sort, cut, wc based scripts to extract numbers and do our calculations. The audience can easily imagine how to pipeline unix commands to extract, group and count tuples, they already do it on their laptops and their small datasets on a daily basis, mostly with perl and python. The story continues: everything runs quite smoothly until we realize that we're starting to fill our disks and we can't run this computations on that separate machine anymore.
We decide to buy a second machine and we spread the log files among the two machines. We setup a cron job to scp the log files to two machines on a daily basis and call our scripts, which we have uploaded on both of them, through ssh. The additional problem is that now the two external machines have a partial view of the problem, so we decide to add a script that partitions the output of our filters in two different files through a simple modulo operation on the tuples. We assign each file to a one of the machines and wet let the two separate machines exchange these files accordingly through scp. Everything is good again and we use this technique effectively for a few more months adding more machines. After all we just need to update the list of machines on the web servers and on our partitioning script with the modulo operation.

One night the pager beeps and there's bad news. One of the machines has failed and we lost our data. We have to take the backups from the tapes and it takes a lot of time and manual labour. We decide we can make something smarter. We buy more machines, one for each we already have (therefore doubling our storage capacity) and we use rsync to keep copies of the same log files on them. We then install a simple script that pings the machine assigned to them and, in case of ping timeout, it takes over the other machine's IP address. With this smart trick we should be able to avoid downtime in our scripts pipeline without major hassle by the admins.
At this point the story goes on with us hiring more people to analyze the data making it unfeasible to have all these special scripts and cron jobs. To solve the problem we come up with a high level API to use in our scripts running on the servers. We notice that all our computations are more or less a sequence of these operations: (1) collect and iterate over many records (2) filter and extract something from each (3) shuffle & sort the data into groups (4) run aggregations on the groups (5) produce the final output set. We decide to let our system manage 1, 3 and 5 while the user can define 2, 4 through the API.

The story is over and I finally tell them this can be the story of how Google itself came up with MapReduce and it comes out now quite easy to explain the API of MapReduce, the underlying paradigm and typical concepts as "moving computation to data", "commodity hardware-based clusters" etc.
I must admit the seminar was a success. After two hours of seminar, one by me with this talk and one by Elia explaining the Hadoop Streaming API with the word-count example in python (also there with the typical analogy with the command line grep | sort | count) they were let to implement cosine-similarity between co-occurrence vectors extracted from text (they were more than familiar with NLP). All of the 27 attendants finished the task practically autonomously half an hour prior to the deadline (so in about 1h), many of them willing to dig deeper into optimizations, combiners etc. To be honest, I was expecting half of them not even trying.

I hope it helps, let me know what you think and, please, don't try it on your kids…

Giraph talk for GraphDevRoom @ FOSDEM 2012

February 6, 2012

This weekend I attended the GraphDevRoom at FOSDEM 2012 where the community met to discuss the current trends on the topic of graph processing. The talks spawned from query languages to ranking algorithms on graphs, some presented research results while others took a more business oriented path.

The event was sponsored by Gephi, NuvolaBase and Belectric and the first two had specific talks as well. I was particularly impressed by the product presented by the NuvolaBase guys, the same team behind OrientDB, who provides instances of their DB following a "database as a service" architecture in the cloud.

One interesting talk was given by Rene Pickhardt on the graph ranking algorithm Graphity that allows for the retrieval of top-k items associated to a vertex in a network (think of an operation such as "give me the last 5 tweets tweeted by "Justin Bieber"). The presented results were promising thanks to a very smart approach.

Neo4j presented their graph query language Cypher, with a very interesting syntax with concepts borrowed from SPARQL, SQL, gremlin and regexp. I personally find it much cleaner than gremlin.

There was also some space for a couple of talks about graph visualization, respectively with Processing and Gephi.

I gave a talk about Giraph and you can find the slides embedded in this post. Good news here: people were eager to get to know more about the project and we're gaining quite some authority and trust.

The closing talk was mostly an opportunity to discuss the necessity for a general benchmarking approach to graph databases. The discussion drove towards a set of universal and primitive operations to be tested a long with a group of established algorithms. My suggestion was to focus more on an analysis of the queries being run on the databases more than following a top-down a-priori design.

Throughout the whole day a few aspects came out repeatedly with one of them being the necessity for distribution strategies for graph databases. None of the vendors, except for InfiniteGraph which isn't sharing their technology, are supporting a general way of distributing a graph on multiple machines (something that elsewhere would be called sharding). The current solution provided by the vendors is basically to pass the ball to the users, allowing them to define their own domain-specific way to project the graph on the different nodes belonging to the cluster.

Also, it was also quite clear that the community hasn't converged on the topic of transactions. Some think transactions should be ACID, some think transactions should be more relaxed, others (me) think transactions should be avoided at all in favor of a more fine-grained set of atomic operations (a little bit like the path taken by some NoSQL databases like HBase).

Something I was impressed by is the quite diffuse disappointment about the drawbacks of the Tinkerpop/Blueprints stack. It's generally accepted that a common graph API is necessary but there's also a general agreement that the performance issues due to Blueprints are bigger than the advantages. Talking about abstraction, a nice discussion followed on the topic of the effectiveness of the different graph data models. Needless to say an agreement was not even close.

To wrap up, it was a very pleasant and well organized event, showing a very tight and enthusiastic community around these technologies.

Adaptation in Embodied & Situated Agents

October 11, 2011

For my master thesis I did an internship at the Institute of Cognitive Sciences and Technologies at the National Research Council in Rome, under the supervision of Stefano Nolfi.

A few days ago I had to give a presentation about my master thesis work, so I decided to write a blog post about it, hoping it can get somebody interested in the topic and to spread the word about the interesting work being done there.

Evolutionary robotics is a branch of Robotics that uses evolutionary computation to develop controllers for autonomous robots. A candidate controller is a set of parameters influencing the behavior of the autonomous robot. Genetic algorithms operate on this candidates through operators such as crossover, mutation and selection according to a fitness function. The parameters often describe a neural network (i.e. the weights or network topology) and avoid any kind of imperative top-down programming.

Evolutionary robotics is often used for learning and adaptation in embodied and situated agents. Embodied and situated means that the behavior of the robot is a result of a continuous interaction between the controller, the robot's body and the environment in which it's located. The approach differs from traditional robotics in the way that the control system is not designed a priori and then uploaded into a robot which is later introduced into the environment. On the contrary, the candidates always learn and are evaluated inside the robot in the final environment, with the real sensory information. The continuous interaction between the robot and the environment, with its noisy incomplete sensorial experience, pushes an emergent and robust adaptation of the control system to both the features of the robot (embodiment) and of the environment the robot is put in (situatedness).

In fact, the fitness function does not describe the behavior that the designer desires but it only measures the goodness of the behavior being observed. This allows the system to develop autonomously an emergent behavior through a (global) trial-and-error by exploiting the characteristics of the world.

My contributions were:

(1) the development of an individual learning algorithm based on the Simulated Annealing.

Individual learning can be implemented by SA through a sequence of mutations from an initial state (the candidate random initial parameters). After each mutation, the new parameters are accepted if they bring an improvement to the fitness function. They can also be accepted by a probability p (also known as temperature) even if they degrade the fitness. The probability p decreases with the increasing of the number of learning cycles. The role of p is to avoid premature convergence to a local maximum.

In ER, each behavior (candidate) has to be evaluated multiple times with different initial conditions (i.e. by putting the robot in a different place inside of the arena). The more times the robot is tested, the more robust the fitness evaluation will be (it will be reduced the odds of a lucky test). This requires a computational effort. By decreasing the number of evaluations, the probability that an under-fitting candidate looks fitting is increased, rising the probability to accept a degrading mutation. As this role, in SA, is already played by the temperature p, we can save computation by avoiding useless evaluations that bring us to a correct fitness function that we might ignore with probability p. In fact, we can obtain the same temperature behavior by evaluating the candidate fewer times at the beginning of the learning phase (so increasing the probability p of having an under-fitting candidate look fitting) and more times in the end, when we want to accept mutations only based on robust evaluations (the number of evaluations you would have used for the whole process with the baseline algorithm). This brings to the same result as baseline SA but saves about 50% evaluations and therefore computations.

(2) the development of an algorithm for social learning and cultural evolution.

Social learning refers to a process in which agents learn new skills by interacting with other agents. Social learning should allow the learning individuals to acquire a strategy or behavior in a shorter time than would be necessary by individual learning (avoiding reinventing the wheel every time). Social learning can also be defined by its uncertainty-reduction function: it allows agents to gather information about the environment without taking the risks and costs of trial & error, rising its chances of survival. Also, social learning would be an important actor in what is called "cumulative cultural evolution", a process in which strategies are transmitted generation after generation, accumulating variations and improvements, leading to strategies that would be too difficult to learn individually.

With the rising complexity of robot systems, it is more difficult to look for a suitable solution, through trial & error, inside a solution space that is constantly increasing in size. Therefore trying to acquire a behavior starting from an adaptive one, might allow to drastically reduce the search area inside the solution space, as social learning would guide the search towards its promising areas.

Consider a control system defined by a set of free parameters. In computational terms, acquiring an adaptive behavior, exhibited by another control system (the model), through social learning means searching a set of free parameters that allow the learning system to exhibit a strategy that is similar to the experts'. Learning the strategy starting from an existing one, instead of searching through trial & error, would expedite the learning process, considering that the information provided by the expert control system is richer than the information provided by the environment and the fitness function. It would lead the search of free parameters into promising areas of the search space.

Different forms of social learning exist in nature and many of them aren't true imitation: (1) social facilitation, something like "eat only if there's a conspecifics around you, otherwise watch out from predators" (2) contagious behavior, like "flee away of your conspecifics flee away" (3) stimulus enhancement, i.e. "follow an adult and learn as much as possible during that time by interacting with the environment around you". We wanted to model social learning through a simple model of imitation or social influence, so avoiding top-down approaches that require the student robot to understand the inner goal of the expert agent, match a particular behavior and translate it into its own perspective. So we decided to model imitation as the individual learning of the behavior of the expert agent, in a way similar to stimulus enhancement.

Instead of measuring the goodness of the strategy, the new fitness function would measure the average difference between the actuator commands of the student compared to those of the expert, given the same sensory input. It would be like putting the student on the expert's shoulders and measuring how different their decisions would be in each condition. Therefore the social learning algorithm would put pressure on the student to act like the expert solution, but be ware that no pressure was put on having similar parameters!

What we realized is that just learning through imitation would produce an under-fitting agent. The student would basically produce a prototypical behavior, so something that could "quite resolve" the problem, but it would get lost in the details. For this reason we introduced a hybrid form of social learning: the new fitness would be a weighted sum of the ability of the student to imitate the expert agent and of the ability of the student to solve the given problem. The weights of this new hybrid fitness would change during the learning process and would give more weight to the first component at the beginning of the process and more weight to the second one towards the end. So the student would be able to acquire a prototypical strategy from the expert agent at the beginning, and optimize it through "its own experience" towards the end. This gave us quite good results: the student could acquire an adaptive behavior in 66% learning cycles compared to the individual learner, matching our initial hypothesis.

As a possible interpretation of how social learning works in our models, consider that two similar behaviors can be implemented by two sets of very different free parameters. For this reason, acquiring a similar behavior to the experts', through social learning, could push the student to find a solution that might be distant, in the parameters space, from the experts'. As the two solutions could be far from each other in the parameters space, also the morphology of the student solution neighborhood would be different (i.e. without areas of local minima that could characterize the area around the expert solution). Therefore the social learning process could be considered as a technique for the selection of the initial conditions in a parameter optimization problem or as a function that would allow to jump out of local minima.

Where do you go from now?

Check Stefano's homepage, check his book and check Boyd and Richerson book on social learning and cultural evolution.

Google Pregel: the Rise of the Clones

September 13, 2011

In the past, I’ve written about Google Pregel. At the time, as it was quite obvious, there was no implementation of anything like Pregel out there of any kind, not to mention Open Source.

Now things have changed, so I’d like to give a quick list of the projects out there that might help you getting started with this technology, as I see that very often people ask what the difference is between all of them. I have direct experience only with the Java implementations, so I can talk about them a bit more extensively.

As you remember from my last post, Pregel is a framework for large-scale graph processing that builds on top of the BSP computational model. It allows the developer to write a vertex-centric algorithm for graph processing (meaning you write a function that receives messages from vertices and sends messages to other vertices) and forget about things as distribution and fault-tolerance. As I mentioned in the post, BSP isn’t directly connected to graph processing, but it maps quite smoothly to it.

The first project to mention is Apache Hama. “Apache Hama is a distributed computing framework based on BSP (Bulk Synchronous Parallel) computing techniques for massive scientific computations, e.g., matrix, graph and network algorithms. It was inspired by Google's Pregel, but different in the sense that it's purely BSP and common model, not just for graph.”. What this means for you, a developer willing to work on graph processing, is that you have to build the Pregel API/layer on-top-of it. What Apache Hama provides are the BSP primitives, so messaging between tasks and a synchronization barrier. Be ware, it’s not just a matter of API, there’s a technological effort to implement all the graphy behavior that happens under the hood of Pregel. To have an idea about what that means, go and chekout the example implementation of the PageRank and SingleSourceShortestPath code, and see how they differ from the sample code in the Google Pregel paper. Also, last time I checked, the framework wasn’t supporting much of the fault-tolerance, the Combiners and Aggregators and some other features of Pregel.

More in this direction you can try GoldenOrb. “GoldenOrb is a cloud-based open source project for massive-scale graph analysis, built upon best-of-breed software from the Apache Hadoop project modeled after Google’s Pregel architecture.”. It tries to stick as much as possible to the original Pregel API and it builds on top of Hadoop. GoldenOrb differs from Apache Hama for the first aspect but shares the second one: they both build on top of Hadoop, meaning they re-use parts of it, as the RPC for messaging, they use Zookeeper for synchronization, and HDFS for IO. Although they build on top of Hadoop, they still require you to install and add to your infrastructure some of their code (so to say, GoldenOrb and Hama have their own JobTracker and TaskTracker). This can be a drawback, considering you might not have control over your Hadoop cluster to install new software or in general the code might suffer from usual problems related to young projects.

The third project, which attacks this last aspect in particular and was developed originally by Yahoo!, is called Giraph (great name by the way). Giraph is “a graph-processing framework that is launched as a typical Hadoop job to leverage existing Hadoop infrastructure, such as Amazon’s EC2. Giraph builds upon the graph-oriented nature of Pregel but additionally adds fault-tolerance to the coordinator process with the use of ZooKeeper as its centralized coordination service.”. What this means is that Giraph is a Hadoop Map-only job, so you can run it on your existing Hadoop infrastructure, but it provides the API and the middleware of Pregel. It also does implement a BSP computation, so you’ll have the advantages that make Pregel often a better solution for iterative graph algorithms compared to Mapreduce.

So, to summarize, what Hame, GoldenOrb and Giraph have in common is: Java platform, Apache License (and incubation), BSP computation. What they differ for: Hama offers BSP primitives not graph processing API (so it sits at a lower level), GoldenOrb provides Pregel’s API but requires the deployment of additional software to your existing Hadoop infrastructure, Giraph provides Pregel’s API (and is kind of complete at the current state) and doesn’t require additional infrastructure.

Closing with Java, I’d just like to mention Phoebus, a Pregel implementation in Erlang. I haven't tried it, so I cannot say much about it, but if you want to comment on that, please drop me a line in the comments.

So, here it is, fire up your Hadoop pseudo-cluster and get back to me if you have something to add.

EDIT: I forgot to mention other two possibilities. They are not really Pregel "clones", but are for sure inspired by Pregel. The first is called Signal/Collect and is a framework particularly developed for the Semantic Web, although it can be used for other scenarios as well. It's written in Scala, with Java API also available, and is released with Apache License. Here's how the authors describe it: "Signal/Collect is a framework for synchronous and asynchronous parallel graph processing. It allows programmers to express many algorithms on graphs in a concise and elegant way.". It differs from Pregel for two main reasons: (1) the edges can have computations associated to them as well, so you basically write a compute() method for the vertices and another one for the edges (2) the synchronization barrier constraints are relaxed, so it's possible also to implement async algorithms that look more dataflow-ish. At the time the core system doesn't scale to multiple machine, but the next version is supposed to be fill this gap. Check this post's comments for more info.

The second is called HipG and it's Java GPL-licensed framework very similar to Pregel. Here's the description from the homepage: "HipG is a library for high-level parallel processing of large-scale graphs. HipG is implemented in Java and is designed for distributed-memory machine. Besides basic distributed graph algorithms it handles divide-and-conquer graph algorithms and algorithms that execute on graphs created on-the-fly. It is designed for clusters of machines, but can also be tested on desktops - all you need is a recent Java runtime environment.". The main contribution of this work to the Pregel model is again related to the way the synchronization barrier is handled: there's no more a global barrier but the algorithm can specify different barriers to which subsets of the vertices can synchronize. The idea is to allow a more fine-grained synchronization mechanism, to avoid many vertices to idle waiting for unrelated vertices to finish their computation.