Apache Giraph 1.0 released

May 7, 2013

The Apache Giraph team is proud to announce our first release out of incubation. The release version is 1.0.0 to reflect a lot of hard work that went into making the code stable enough for production use, memory efficient, and performant.

Apache Giraph is an scalable and distributed iterative graph processing system that is inspired by BSP (bulk synchronous parallel) and Google's Pregel. Giraph distinguishes itself from those projects by being open-source, running on Hadoop infrastructure, and going beyond the Pregel model with features such as master computation, sharded aggregators, out-of-core support, no single point of failure design, and more.

Here are some highlights of the release:

  • Scales out to hundreds of machines easily and hundreds of billions of edges (memory permitting)
  • Efficient use of memory via fast byte-based serialization by default and can use primitive specific types when better performance is required
  • Multithreaded input and computation can take advantage of multicore machines efficiently
  • Simplified vertex API
  • Vertex-based and/or edge-based input supported
  • Master compute API for handling application-wide logic
  • Sharded aggregators for handling large (memory) aggregators
  • Easy access to/from Hive tables to integrate with your data warehouse
  • Out-of-core graph and messaging support
  • YARN support

For release details and download access, please visit: http://giraph.apache.org/releases.html

Thanks so much to everyone for all their contributions. It is you who made this release possible! We've also been investing in updating our website as part of this release (http://giraph.apache.org), more documentation/updates will be coming in the near future. We expect that releases will happen more frequently in the future now that we are more familiar with the process.

Regards,

The Apache Giraph team

CFP: Graph Processing Devroom at FOSDEM 2013

December 23, 2012

We are organizing the Graph Processing Devroom at FOSDEM 2013. The Free and Open source Software Developers’ European Meeting (FOSDEM) is a two-day event organized by volunteers to promote the widespread use of Free and Open Source software. The Graph Processing DevRoom is now a two years old effort to have a one day workshop where people can discuss on the topic of Graph Processing. If you have a cool project, or you think your use case is interesting for others, feel free to submit your proposal.

Important dates (all GMT+1)

Submission deadline: 28-12-2012
Notification of accepted speakers: 01-01-2013
Publication of the final schedule: 4-01-2013
Meetup: 2 of February 2013, Brussels, Belgium. Co-located with in FOSDEM 2013.

Important links

Submission link: https://graphdevroom.busyconf.com/proposals/new
Devroom site: http://fosdem.graph-database.org/
Last years' presentations: http://www.youtube.com/graphprocessing
FOSDEM site: https://fosdem.org/2013/

In case of problems, please contact the room team at: graph-devroom@graph-database.org

To be keep on the loop:

Follow @GraphDevRoom on twitter to be updated with the latest news of the event.

Organizations are silos of information

February 19, 2012

While at TIS, I was running a project related to Enterprise 2.0. As part of the process, we wrote a questionnaire to understand the paths through which information flows internally between teams and groups. At first, we asked all the employees things such as "if a customer asks you who's' responsible, at your company, for a particular project or technology, how do you find out?" or "if you need a particular know-how internally, how do you search for it?". What we found out from these and other similar questions, is that when employees need to find a solution to this kind of problems, they mostly go to their boss. This upward delegation works on the assumption that the boss knows everything and this knowledge, if possible, would be acquired during meetings with the leaders of the other teams. This upward delegation makes the boss a bottleneck and the teams silos of information. We tried to look for a scenario that would describe easily what was going on and what was causing this situation. Here it goes.

When you're a startup, in the early days, you're all in the same room. Information spreads quickly and easily inside that room. When things go in the right way you grow, the team is split in two. Now the members of the teams have meetings with their leaders where they report the advances of their projects, the development of their know-how etc. The team leaders later meet and exchange this information. The team leaders can now report information about team A to team B and vice versa. This mechanism solves for the physical constraints of logistics, for the fact that teams are in different rooms, basically. Sure, people still meet in the corridors or at the coffee machines, but as the company grows this gets more and more difficult and soon you don't know what the members of the other teams are doing (and the bosses can't help there). So you do more meetings, with the hope that more information can be exchanged, but we all know how people react to frequent and lengthy meetings. They are expensive: often they turn into a set of direct conversations between each member and the team leader. Which means that the people waiting for their turn just idle while they could be doing something productive. Also, most probably they just don't listen to these direct conversations, maybe regretting it later (but can they be blamed?), when they'll be working on a different project that might depend on one of those conversations. To sum up, the content of meetings is unsearchable and not persistent, which makes it extremely costly.

Now, this is a simple example, but just imagine what this means if you take a big company and replicate this phenomena for many levels from the employees at the bottom of the pyramid to their boss, the boss of their boss all the way up to the president. Enterprise 2.0 advocates for an horizontal organization, at least for information management, where employees are provided with a platform, something similar to the well known Web 2.0 tools (HA! guess where the 2.0 in Enterprise 2.0 comes from…) where they can post information about their projects, about what they do and like, they can answer and pose questions etc, upload documents, share links etc. All this information would be easily searchable and what not, you can imagine easily what you could do, just take your favorite social network and imagine it being a tool used inside of your company with a specific purpose. Of course it's not all roses and chocolates, so I'll get back to our previous questionnaire example. In the second part we asked questions such as "when you want to buy a new digital camera, would you ask for advices on Facebook?", "would you use a wiki or google docs to help organizing your holidays with friends living elsewhere?".
I guess you know where I'm going: what emerged is that they successfully acquire information, for which they don't have the right know-how, from their peers not to mention they are already efficiently using search engines to locate information on the web. So, if they already happily use these tools in their free time, why shouldn't these tools and paradigms being used inside of the organizations to improve information flows and management? These are the basic principles of Enterprise 2.0, as means of breaking those physical walls dividing the rooms, bringing the information flow back to the startup condition. More clearly, it's a way of breaking the bottlenecks, the silos effects due to upward delegation and dependency.

Explain me Hadoop like I was a... linux power user

February 8, 2012

In november I gave with Elia Bruni a hands-on introduction to Hadoop for the CIMEC researchers at University of Trento. They are mostly NLP-ists or in general cognitive science-related researchers applying machine learning techniques to validate their models on real data. I personally think that explaining MapReduce to this kind of audience just through its API and its distributed architecture would be confusing and probably even useless. It's true that MapReduce provides a simple API that hides the complexity of fault-tolerance, distributed IPC and parallelism, but there are so many ways to screw up implementing a suboptimal algorithm that I felt I needed a different approach.
Most of them are linux power users, they work with terminal-level commands effectively, they run their tasks on a Sun/Oracle Grid Engine managed cluster and mostly, due to the nature of their work, they are used to work with algorithms following a scan-based pattern on files or pair-wise computations such as matrix operations. The experience of working with tools as scp, rsync, ssh, grep, cat etc. was there, along with an understanding of how to implement an algorithm based on scanning a file line by line.
For this reason I decided to present a story that they could relate to, presenting challenges for which they could come up with practical "linux power-user level" solutions. After this setup was complete, I would show that their solutions were analogous to what Hadoop does under the hood automatically. I thought this would give me two advantages: they would understand what MapReduce was designed for (and therefore what it's good at, the most important thing), but also they would feel comfortable using it with the right paradigm without being confused about what the "big distributed and complicated monster" was doing under their feet. I'm going to share this story/analogy here, hoping it can be useful for somebody else. Be ware, I don't think it's anything magical or new, I've seen parts of this in different presentations (which I guess inspired me), but I believe the analogy was never followed directly and explicitly.

The initial scenario: it's 2002 and we're a startup running an e-commerce shop online, business is doing so well that we're starting competing with the big guys. We soon realize that the clicks and searches our customers do on our site are more than just pages to be served from our database but are actually data worth keeping and analyzing. After all, through their clicks and searches our customers are more or less explicitly telling us what they like and want. I show them a log coming from an apache web server: 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /index.html HTTP/ 1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)" and we consider that by a simple regexp we can extract pairs as "who clicked what", "when a certain page was clicked", "from which page the user came from" etc. This is their bread and butter so it's obvious for the whole audience how these sets of pairs can be fed into typical machine learning algorithms to extract similarity between products, a ranking algorithm for search, recommendations, plots for the marketing department etc.
I go on with the story: one possible way of doing this aggregations and computations would be to store the data in an RDBMS and run some smart SQL queries. Our problem though is that we're a startup and we can't afford an expensive and performant data warehouse, we can only use an open source solution as MySQL or Postgres and they are pretty poor at managing our high write throughput.

So we give up relational databases and decide to go for a practical solution of running our bash scripts directly on the web server logs on a separate machine. We have simple pipelines of commands that extract the tuples and run simple aggregations on them with perl. We extract the tuples to temporary files and pipeline sort, cut, wc based scripts to extract numbers and do our calculations. The audience can easily imagine how to pipeline unix commands to extract, group and count tuples, they already do it on their laptops and their small datasets on a daily basis, mostly with perl and python. The story continues: everything runs quite smoothly until we realize that we're starting to fill our disks and we can't run this computations on that separate machine anymore.
We decide to buy a second machine and we spread the log files among the two machines. We setup a cron job to scp the log files to two machines on a daily basis and call our scripts, which we have uploaded on both of them, through ssh. The additional problem is that now the two external machines have a partial view of the problem, so we decide to add a script that partitions the output of our filters in two different files through a simple modulo operation on the tuples. We assign each file to a one of the machines and wet let the two separate machines exchange these files accordingly through scp. Everything is good again and we use this technique effectively for a few more months adding more machines. After all we just need to update the list of machines on the web servers and on our partitioning script with the modulo operation.

One night the pager beeps and there's bad news. One of the machines has failed and we lost our data. We have to take the backups from the tapes and it takes a lot of time and manual labour. We decide we can make something smarter. We buy more machines, one for each we already have (therefore doubling our storage capacity) and we use rsync to keep copies of the same log files on them. We then install a simple script that pings the machine assigned to them and, in case of ping timeout, it takes over the other machine's IP address. With this smart trick we should be able to avoid downtime in our scripts pipeline without major hassle by the admins.
At this point the story goes on with us hiring more people to analyze the data making it unfeasible to have all these special scripts and cron jobs. To solve the problem we come up with a high level API to use in our scripts running on the servers. We notice that all our computations are more or less a sequence of these operations: (1) collect and iterate over many records (2) filter and extract something from each (3) shuffle & sort the data into groups (4) run aggregations on the groups (5) produce the final output set. We decide to let our system manage 1, 3 and 5 while the user can define 2, 4 through the API.

The story is over and I finally tell them this can be the story of how Google itself came up with MapReduce and it comes out now quite easy to explain the API of MapReduce, the underlying paradigm and typical concepts as "moving computation to data", "commodity hardware-based clusters" etc.
I must admit the seminar was a success. After two hours of seminar, one by me with this talk and one by Elia explaining the Hadoop Streaming API with the word-count example in python (also there with the typical analogy with the command line grep | sort | count) they were let to implement cosine-similarity between co-occurrence vectors extracted from text (they were more than familiar with NLP). All of the 27 attendants finished the task practically autonomously half an hour prior to the deadline (so in about 1h), many of them willing to dig deeper into optimizations, combiners etc. To be honest, I was expecting half of them not even trying.

I hope it helps, let me know what you think and, please, don't try it on your kids…

Giraph talk for GraphDevRoom @ FOSDEM 2012

February 6, 2012

This weekend I attended the GraphDevRoom at FOSDEM 2012 where the community met to discuss the current trends on the topic of graph processing. The talks spawned from query languages to ranking algorithms on graphs, some presented research results while others took a more business oriented path.

The event was sponsored by Gephi, NuvolaBase and Belectric and the first two had specific talks as well. I was particularly impressed by the product presented by the NuvolaBase guys, the same team behind OrientDB, who provides instances of their DB following a "database as a service" architecture in the cloud.

One interesting talk was given by Rene Pickhardt on the graph ranking algorithm Graphity that allows for the retrieval of top-k items associated to a vertex in a network (think of an operation such as "give me the last 5 tweets tweeted by "Justin Bieber"). The presented results were promising thanks to a very smart approach.

Neo4j presented their graph query language Cypher, with a very interesting syntax with concepts borrowed from SPARQL, SQL, gremlin and regexp. I personally find it much cleaner than gremlin.

There was also some space for a couple of talks about graph visualization, respectively with Processing and Gephi.

I gave a talk about Giraph and you can find the slides embedded in this post. Good news here: people were eager to get to know more about the project and we're gaining quite some authority and trust.

The closing talk was mostly an opportunity to discuss the necessity for a general benchmarking approach to graph databases. The discussion drove towards a set of universal and primitive operations to be tested a long with a group of established algorithms. My suggestion was to focus more on an analysis of the queries being run on the databases more than following a top-down a-priori design.

Throughout the whole day a few aspects came out repeatedly with one of them being the necessity for distribution strategies for graph databases. None of the vendors, except for InfiniteGraph which isn't sharing their technology, are supporting a general way of distributing a graph on multiple machines (something that elsewhere would be called sharding). The current solution provided by the vendors is basically to pass the ball to the users, allowing them to define their own domain-specific way to project the graph on the different nodes belonging to the cluster.

Also, it was also quite clear that the community hasn't converged on the topic of transactions. Some think transactions should be ACID, some think transactions should be more relaxed, others (me) think transactions should be avoided at all in favor of a more fine-grained set of atomic operations (a little bit like the path taken by some NoSQL databases like HBase).

Something I was impressed by is the quite diffuse disappointment about the drawbacks of the Tinkerpop/Blueprints stack. It's generally accepted that a common graph API is necessary but there's also a general agreement that the performance issues due to Blueprints are bigger than the advantages. Talking about abstraction, a nice discussion followed on the topic of the effectiveness of the different graph data models. Needless to say an agreement was not even close.

To wrap up, it was a very pleasant and well organized event, showing a very tight and enthusiastic community around these technologies.

Adaptation in Embodied & Situated Agents

October 11, 2011

For my master thesis I did an internship at the Institute of Cognitive Sciences and Technologies at the National Research Council in Rome, under the supervision of Stefano Nolfi.

A few days ago I had to give a presentation about my master thesis work, so I decided to write a blog post about it, hoping it can get somebody interested in the topic and to spread the word about the interesting work being done there.

Evolutionary robotics is a branch of Robotics that uses evolutionary computation to develop controllers for autonomous robots. A candidate controller is a set of parameters influencing the behavior of the autonomous robot. Genetic algorithms operate on this candidates through operators such as crossover, mutation and selection according to a fitness function. The parameters often describe a neural network (i.e. the weights or network topology) and avoid any kind of imperative top-down programming.

Evolutionary robotics is often used for learning and adaptation in embodied and situated agents. Embodied and situated means that the behavior of the robot is a result of a continuous interaction between the controller, the robot's body and the environment in which it's located. The approach differs from traditional robotics in the way that the control system is not designed a priori and then uploaded into a robot which is later introduced into the environment. On the contrary, the candidates always learn and are evaluated inside the robot in the final environment, with the real sensory information. The continuous interaction between the robot and the environment, with its noisy incomplete sensorial experience, pushes an emergent and robust adaptation of the control system to both the features of the robot (embodiment) and of the environment the robot is put in (situatedness).

In fact, the fitness function does not describe the behavior that the designer desires but it only measures the goodness of the behavior being observed. This allows the system to develop autonomously an emergent behavior through a (global) trial-and-error by exploiting the characteristics of the world.

My contributions were:

(1) the development of an individual learning algorithm based on the Simulated Annealing.

Individual learning can be implemented by SA through a sequence of mutations from an initial state (the candidate random initial parameters). After each mutation, the new parameters are accepted if they bring an improvement to the fitness function. They can also be accepted by a probability p (also known as temperature) even if they degrade the fitness. The probability p decreases with the increasing of the number of learning cycles. The role of p is to avoid premature convergence to a local maximum.

In ER, each behavior (candidate) has to be evaluated multiple times with different initial conditions (i.e. by putting the robot in a different place inside of the arena). The more times the robot is tested, the more robust the fitness evaluation will be (it will be reduced the odds of a lucky test). This requires a computational effort. By decreasing the number of evaluations, the probability that an under-fitting candidate looks fitting is increased, rising the probability to accept a degrading mutation. As this role, in SA, is already played by the temperature p, we can save computation by avoiding useless evaluations that bring us to a correct fitness function that we might ignore with probability p. In fact, we can obtain the same temperature behavior by evaluating the candidate fewer times at the beginning of the learning phase (so increasing the probability p of having an under-fitting candidate look fitting) and more times in the end, when we want to accept mutations only based on robust evaluations (the number of evaluations you would have used for the whole process with the baseline algorithm). This brings to the same result as baseline SA but saves about 50% evaluations and therefore computations.

(2) the development of an algorithm for social learning and cultural evolution.

Social learning refers to a process in which agents learn new skills by interacting with other agents. Social learning should allow the learning individuals to acquire a strategy or behavior in a shorter time than would be necessary by individual learning (avoiding reinventing the wheel every time). Social learning can also be defined by its uncertainty-reduction function: it allows agents to gather information about the environment without taking the risks and costs of trial & error, rising its chances of survival. Also, social learning would be an important actor in what is called "cumulative cultural evolution", a process in which strategies are transmitted generation after generation, accumulating variations and improvements, leading to strategies that would be too difficult to learn individually.

With the rising complexity of robot systems, it is more difficult to look for a suitable solution, through trial & error, inside a solution space that is constantly increasing in size. Therefore trying to acquire a behavior starting from an adaptive one, might allow to drastically reduce the search area inside the solution space, as social learning would guide the search towards its promising areas.

Consider a control system defined by a set of free parameters. In computational terms, acquiring an adaptive behavior, exhibited by another control system (the model), through social learning means searching a set of free parameters that allow the learning system to exhibit a strategy that is similar to the experts'. Learning the strategy starting from an existing one, instead of searching through trial & error, would expedite the learning process, considering that the information provided by the expert control system is richer than the information provided by the environment and the fitness function. It would lead the search of free parameters into promising areas of the search space.

Different forms of social learning exist in nature and many of them aren't true imitation: (1) social facilitation, something like "eat only if there's a conspecifics around you, otherwise watch out from predators" (2) contagious behavior, like "flee away of your conspecifics flee away" (3) stimulus enhancement, i.e. "follow an adult and learn as much as possible during that time by interacting with the environment around you". We wanted to model social learning through a simple model of imitation or social influence, so avoiding top-down approaches that require the student robot to understand the inner goal of the expert agent, match a particular behavior and translate it into its own perspective. So we decided to model imitation as the individual learning of the behavior of the expert agent, in a way similar to stimulus enhancement.

Instead of measuring the goodness of the strategy, the new fitness function would measure the average difference between the actuator commands of the student compared to those of the expert, given the same sensory input. It would be like putting the student on the expert's shoulders and measuring how different their decisions would be in each condition. Therefore the social learning algorithm would put pressure on the student to act like the expert solution, but be ware that no pressure was put on having similar parameters!

What we realized is that just learning through imitation would produce an under-fitting agent. The student would basically produce a prototypical behavior, so something that could "quite resolve" the problem, but it would get lost in the details. For this reason we introduced a hybrid form of social learning: the new fitness would be a weighted sum of the ability of the student to imitate the expert agent and of the ability of the student to solve the given problem. The weights of this new hybrid fitness would change during the learning process and would give more weight to the first component at the beginning of the process and more weight to the second one towards the end. So the student would be able to acquire a prototypical strategy from the expert agent at the beginning, and optimize it through "its own experience" towards the end. This gave us quite good results: the student could acquire an adaptive behavior in 66% learning cycles compared to the individual learner, matching our initial hypothesis.

As a possible interpretation of how social learning works in our models, consider that two similar behaviors can be implemented by two sets of very different free parameters. For this reason, acquiring a similar behavior to the experts', through social learning, could push the student to find a solution that might be distant, in the parameters space, from the experts'. As the two solutions could be far from each other in the parameters space, also the morphology of the student solution neighborhood would be different (i.e. without areas of local minima that could characterize the area around the expert solution). Therefore the social learning process could be considered as a technique for the selection of the initial conditions in a parameter optimization problem or as a function that would allow to jump out of local minima.

Where do you go from now?

Check Stefano's homepage, check his book and check Boyd and Richerson book on social learning and cultural evolution.

Google Pregel: the Rise of the Clones

September 13, 2011

In the past, I’ve written about Google Pregel. At the time, as it was quite obvious, there was no implementation of anything like Pregel out there of any kind, not to mention Open Source.

Now things have changed, so I’d like to give a quick list of the projects out there that might help you getting started with this technology, as I see that very often people ask what the difference is between all of them. I have direct experience only with the Java implementations, so I can talk about them a bit more extensively.

As you remember from my last post, Pregel is a framework for large-scale graph processing that builds on top of the BSP computational model. It allows the developer to write a vertex-centric algorithm for graph processing (meaning you write a function that receives messages from vertices and sends messages to other vertices) and forget about things as distribution and fault-tolerance. As I mentioned in the post, BSP isn’t directly connected to graph processing, but it maps quite smoothly to it.

The first project to mention is Apache Hama. “Apache Hama is a distributed computing framework based on BSP (Bulk Synchronous Parallel) computing techniques for massive scientific computations, e.g., matrix, graph and network algorithms. It was inspired by Google's Pregel, but different in the sense that it's purely BSP and common model, not just for graph.”. What this means for you, a developer willing to work on graph processing, is that you have to build the Pregel API/layer on-top-of it. What Apache Hama provides are the BSP primitives, so messaging between tasks and a synchronization barrier. Be ware, it’s not just a matter of API, there’s a technological effort to implement all the graphy behavior that happens under the hood of Pregel. To have an idea about what that means, go and chekout the example implementation of the PageRank and SingleSourceShortestPath code, and see how they differ from the sample code in the Google Pregel paper. Also, last time I checked, the framework wasn’t supporting much of the fault-tolerance, the Combiners and Aggregators and some other features of Pregel.

More in this direction you can try GoldenOrb. “GoldenOrb is a cloud-based open source project for massive-scale graph analysis, built upon best-of-breed software from the Apache Hadoop project modeled after Google’s Pregel architecture.”. It tries to stick as much as possible to the original Pregel API and it builds on top of Hadoop. GoldenOrb differs from Apache Hama for the first aspect but shares the second one: they both build on top of Hadoop, meaning they re-use parts of it, as the RPC for messaging, they use Zookeeper for synchronization, and HDFS for IO. Although they build on top of Hadoop, they still require you to install and add to your infrastructure some of their code (so to say, GoldenOrb and Hama have their own JobTracker and TaskTracker). This can be a drawback, considering you might not have control over your Hadoop cluster to install new software or in general the code might suffer from usual problems related to young projects.

The third project, which attacks this last aspect in particular and was developed originally by Yahoo!, is called Giraph (great name by the way). Giraph is “a graph-processing framework that is launched as a typical Hadoop job to leverage existing Hadoop infrastructure, such as Amazon’s EC2. Giraph builds upon the graph-oriented nature of Pregel but additionally adds fault-tolerance to the coordinator process with the use of ZooKeeper as its centralized coordination service.”. What this means is that Giraph is a Hadoop Map-only job, so you can run it on your existing Hadoop infrastructure, but it provides the API and the middleware of Pregel. It also does implement a BSP computation, so you’ll have the advantages that make Pregel often a better solution for iterative graph algorithms compared to Mapreduce.

So, to summarize, what Hame, GoldenOrb and Giraph have in common is: Java platform, Apache License (and incubation), BSP computation. What they differ for: Hama offers BSP primitives not graph processing API (so it sits at a lower level), GoldenOrb provides Pregel’s API but requires the deployment of additional software to your existing Hadoop infrastructure, Giraph provides Pregel’s API (and is kind of complete at the current state) and doesn’t require additional infrastructure.

Closing with Java, I’d just like to mention Phoebus, a Pregel implementation in Erlang. I haven't tried it, so I cannot say much about it, but if you want to comment on that, please drop me a line in the comments.

So, here it is, fire up your Hadoop pseudo-cluster and get back to me if you have something to add.

EDIT: I forgot to mention other two possibilities. They are not really Pregel "clones", but are for sure inspired by Pregel. The first is called Signal/Collect and is a framework particularly developed for the Semantic Web, although it can be used for other scenarios as well. It's written in Scala, with Java API also available, and is released with Apache License. Here's how the authors describe it: "Signal/Collect is a framework for synchronous and asynchronous parallel graph processing. It allows programmers to express many algorithms on graphs in a concise and elegant way.". It differs from Pregel for two main reasons: (1) the edges can have computations associated to them as well, so you basically write a compute() method for the vertices and another one for the edges (2) the synchronization barrier constraints are relaxed, so it's possible also to implement async algorithms that look more dataflow-ish. At the time the core system doesn't scale to multiple machine, but the next version is supposed to be fill this gap. Check this post's comments for more info.

The second is called HipG and it's Java GPL-licensed framework very similar to Pregel. Here's the description from the homepage: "HipG is a library for high-level parallel processing of large-scale graphs. HipG is implemented in Java and is designed for distributed-memory machine. Besides basic distributed graph algorithms it handles divide-and-conquer graph algorithms and algorithms that execute on graphs created on-the-fly. It is designed for clusters of machines, but can also be tested on desktops - all you need is a recent Java runtime environment.". The main contribution of this work to the Pregel model is again related to the way the synchronization barrier is handled: there's no more a global barrier but the algorithm can specify different barriers to which subsets of the vertices can synchronize. The idea is to allow a more fine-grained synchronization mechanism, to avoid many vertices to idle waiting for unrelated vertices to finish their computation.

the Therminal

September 10, 2011

Almost a year ago we organized the "Long night of research" where we opened up our labs, along with the University and other research centers, to show to people what we do. We wanted to talk about innovation and our audience was made of ordinary people and kids, so we had to talk simple and sexy, they are not necessarily somebody who gets excited by the concept of distributability and parallel computing.

For this reason I decided to create a little side project for the event, something that could relate to the concept of innovation as the result of a process. I wanted to show how you can take something interesting but old, re-implement it with new technology and get something more out if it as a result. So I thought about making a theremin, the first electronic music instrument, with devices we have at home, like a laptop and a WII controller. Ok, I know, it's been done before, the Web is full of many different attempts to make a theremin practically out of anything (my favorite is this attempt to make it out of three AM radios). My point though, was not only to implement a theremin out of a laptop+WII but, as I was using newer technology compared to 1920, to take advantage of what the technology I was using allowed me to do.

What I particularly like about the theremin is its human-machine interface, this idea of controlling the music with two hands without actually touching anything. That was the concept I wanted to keep. But the theremin controllers, two antennas, work on just one dimension each, while we move our hands on a three dimensional space. That was something I wanted to work on. Also, the theremin has just one sound while, as I wanted to digitalize the process, I could generate any sound. So, in the end, the idea was: why not take the original theremin interaction, enhance it by adding a 2D/3D experience and use it to command any kind of instrument. That was how the Therminal project was born.

Capturing the hands position is straight-forward with the built-in camera of the WII controller. It works pretty smoothly, it can capture up to four Infrared points in space and report their cartesian position, plus there's lots of code out there that allows you to collect this data with any platform (I used python and the beautiful python-cwiid lib). Once you have the position, you have to convert it to sound. This is interesting because with just one point you can already have the theremin interaction. With the theremin you have two controls: the pitch and the volume. You can map the volume to the Y axis and the pitch to the X axis and you're set. You move this Infrared emitter in this two-dimensional space, with one hand instead of two, and you generate notes. We're innovating: we don't HAVE TO copy the theremin, just get inspiration.

To generate sounds I used a pretty naive solution: a linear function to map a set from a dimension to another. Basically the position of the point on the X axis could go from 0 to 128 and I had 12 notes for 7 octaves (84 notes in total). In practice is like having a 7-octaves virtual piano in front of you: wherever you put your hand it will be on a key. Then I did the same with the volume. I mapped the position on the Y axis between 0 and 128 onto a 0-100 dimension, so that by moving my hand up and down I could control the volume (think of it from 0 to 100%). I tried different functions and in the end I chose a non-linear one which allowed me to have a more fine-grained control on the high volume range. These are two good starting points to generate a note and in particular a MIDI note. You can go deeper and map the velocity of the point to simulate the pressure on the key, for the Aftertouch. Once you have a MIDI note, you can map it to an instrument and you're done. Check out the video if my words aren't clear, it should explain everything.

So, to summarize, what the therminal code does is: poll the WII camera, collect each point position, map the point coordinates to MIDI notes, send the MIDI notes to a MIDI synthesizer (jack/alsa does this) and the software converts this note into sounds according to the instrument you've chosen. It's pretty easy and simple. You can play with it and extend the interaction. Another nice option, as I mentioned 3D experience, is to map the third dimension. You can collect 3D coordinates with two WII controllers and this software, the rest of the approach stays as-is. What to map this third dimension to is left to you :)

Here's the code on my github, please drop me a line here if you need any help or manage to do something out of it.

Somebody is going to hate me: NoSPARQL

April 13, 2011

If you have recently been attending one of those NoSQL/BigData conferences you will have noticed that graph databases are usually considered as the eccentric cousin that nobody understands. I believe there's a bit of confusion about the role of GraphDBs inside of the NoSQL ecosystem and I have tried to expose the idea during this presentation.

First, GraphDBs are far from new. As Leonid Libkin put it at one of his latest talks, "Meet the new data model, same as the old data model": "In the (very) old days, the world of databases was a big mess, dominated by the network (graph) and the hierarchical (tree) data models. Then Codd came, and the nice and clean relational model replaced all others. In addition to providing a steady employment to many logicians, it created a $20.000.000.000/year business. We didn't live in that paradise for too long though: less than 30 years later, the world came back to the hierarchical model (XML). And graph-structured data hasn't been dormant all those years, although it was much less visible than the relational and XML models. Alberto Mendelzon was the first to revisit the graph model back in the 80s, and we saw more activity over the past 10-15 years."

Second, I believe you need to look at modern GraphDBs within their own ecosystem to fully understand them. Just for sake of analogy, and definitely lacking originality, I like calling this ecosystem NoSPARQL. Before nowadays GraphDBs like Neo4J, OrientDB, DEX, InfiniteGraph etc., we already had, and still have, a usable technology to handle graphs: Triplestores. These databases are thought and built with an Algebra-of-Sets-based mindset that is very similar to the one used for traditional Relational Databases. We have sets of triples, we join them, we put indices over them and we write queries with a language that is a dialect (meaning that you can translate SPARQL to SQL queries keeping their semantics) of SQL: SPARQL. And they have the same drawbacks.

Now, if we define the current NoSQL ecosystem as group of technologies that avoid join-based operations and queries based on a descriptive language, you will see that NoSPARQL is not such a bad name after all (at least as an analogy!). GraphDBs don't avoid relations but they embrace them in a way that they are not a computational problem anymore, by making them explicit instead of implicit through joins. To re-phrase M.Rodriguez, Graph Databases are those databases that allow the access to related data efficiently (in constant time, compared to more expensive tree-based operations, like those on which relational databases' indices are based on). So: no joins at query-time, but direct links at storage level.
Also, they provide a different way of accessing the data through an API that is at the lowest level, something like get(key), put(key, value), delete(key) for the NoSQL datastores, instead of SPARQL. Ok, it's still possible to query GraphDBs through SPARQL, in the end it is a descriptive language to describe a graph traversal, the operation of graph exploration, but it is not the only way.

Tinkerpop is a community that is building and feeding this ecosystem, with:

  • Blueprints: a jdbc of graph databases.
  • Pipes: a dataflow framework using process graphs.
  • Gremlin: a graph-based programming language.
  • Rexter: a REST-full graph shell.
  • And others.

Go and check them out.

DBpedia4Neo

April 9, 2011

DISCLAIMER: this is a bit of a hack, but it should get you started. I managed to get the core dataset of DBpedia into Neo4J, but this procedure should actually be working for any Blueprints-ready vendor, like OrientDB.

Ok, a little background first: we want to store DBpedia inside of a GraphDB, instead of the typical TripleStore, and run SPARQL queries over it. DBpedia is a project aiming to extract structured content from Wikipedia, information such as the one you can find in the infoboxes, the links, the categorization infos, geo-coordinates etc. This information is extracted and exported as triples to form a graph, a network of properties and relationships between Wikipedia resources.

So we're going to store millions of triples like "Barack Obama -- president of --> United States of America", or "Rome -- capital of --> Italy" etc. and once we have these triples in the store, we can run queries over this graph with a language that is not so different from SQL.

Why doing this with something like Neo4J or OrientDB instead of a Triple Store? Well, doing graph traversals, the operation of exploring the graph to answer a query (well, most of them), is usually more efficient with a graph database. Check out my slides for NoSQLDay and this nice paper by Marko Rodriguez and Peter Neubauer if you're interested in the details.

To start, we first have to get the DBpedia dump which is divided in multiple languages and datasets: geocoordinates, personaldata, links, categories etc.: choose those you need. Also, the dump is released in multiple formats, we'll go for the N-Triples (therefore the .nt extension), a line-based format to export RDF triples.

Next, you'll have to clone my dbpedia4neo github project which is composed of two packages: org.acaro.dbpedia4neo.web and org.acaro.dbpedia4neo.inserter. The first one is a very little web SPARQL endpoint based on the nice "Simple webserver with WebSocket and REST using jetty and NO XML" (yes, that's the actual name of the project).
The second is a package with a very simple class that parses the nt files through the Sesame library and issues an insert into the graph through the Blueprints layer. The process is very simple: for each line we have a triple; for each triple an insert into the database is issued.

Two things to notice here: some of the nt files are malformed, meaning that some URIs don't start with the scheme, i.e. http, and Sesame will just refuse to go on parsing, failing without possible intervention. So you'll have to grep them out before you insert them. I've used this grep command: grep -P '<(?!http(s)?:\/\/).*>'. Second thing to note is that, by default, the insertion is transactional, so for each insert it would start a transaction and insert the triple. You understand that the performance issue here. For this reason the class uses a bulk strategy, but it will need you to setup the size of the bulk insert as it depends on the amount of available RAM.

What are the pieces that do the magic? From the inserter perspective it will need a Sail interface to issue the addStatement() operations for each triple. Blueprints' GraphSail is done for this as it translates each addStatement(), the insertion of a triple into the store, into a call to the underlying IndexableGraph, implemented by Neo4J.
From the perspective of the SPARQL endpoint, we still have the GraphSail with Neo4J inside. This time the GraphSail is encapsulated into a SailGraph, which implements methods like executeSparql() over the GraphSail interface, that allows the execution of the queries.

This is all you need, this should get you started. I warn you, it's going to take at least 24h to insert the whole thing. The process is mostly CPU bound, I believe the problem is due to the Lucene indexing of Neo4j, but I haven't investigated further.

Let me know if and how you improved this workflow.

Enjoy.