Here at Lab41, we’ve recently found ourselves interested in dynamic graphs and we’ve spent the last few months trying to understand what tools we can use to analyze them – we call this effort Project SkyLine. We’re writing this blog to explain why we think dynamic graphs are interesting, and what we’ve found out so far.
What’s so great about dynamic graphs?
We’ve said it before and we’ll say it again, “Graphs are a great way to model the world around us – from links on the Internet, to the wiring of our brains, to our friendships and relationships.” Graphs naturally represent connections, and connections are central to each of these things. That’s not the whole story though: the world is constantly changing, and so are those connections. Web pages are taken down and links are added every day; our brains constantly rewire themselves as we learn and experience the world.
To understand how the world is changing, we need to be able to analyze graphs that change over time – in other words, dynamic graphs. In a dynamic graph, new edges and vertices can be created at any time, old ones can be destroyed, and attributes (things like age or location) can be altered at any moment, updating the graph to reflect changes in the things and relationships it represents.
We can learn a great deal by applying graph analytic techniques to dynamic graphs. For instance, a dynamically connected components algorithm might tell us when someone joins or leaves a particular group of friends; applying PageRank to the web can show us how web pages rise or fall in mindshare. In addition, we can watch for patterns in the graph – like a series of vertices and edges matching a particular query of interest – flag them as they emerge, and track them for as long as they endure. We call this functionality “triggering” because it lets us respond to specific kinds of changes to the graph by using them to “trigger” relevant actions, as in the example below:
Imagine we have a graph where the vertices are web pages and the edges are links connecting them. We might have a trigger such as, “Send a notification whenever a page with ‘dogs’ in the URL is connected to a page with ‘cats’ in the URL which is connected to a page with ‘parakeets’ in the URL.”
The initial graph below doesn’t contain that path and therefore wouldn’t fire that trigger at all:
However, a later update could add an edge between “example.com/cats” and “example.com/parakeets,” which should trigger a notification being sent to the user since it matches the path highlighted by the red arrows [“example.com/dogs”, “example.com/cats”, “example.com/parakeets”]:
[“example.com/dogs”, “example.com/cats”, “example.com/parakeets”] and a notification should be sent to the user.
So, how many open source packages does it take to analyze a dynamic graph?
Unfortunately, graph analytics to date have dealt mostly with static graphs – graphs that don’t update change over time – and most of the relevant software is designed for that use case. There are a few exceptions, but even these aren’t very well known, so we decided to figure out what’s possible today.
We started off by looking at all the open source graph analytics packages we could find. Our goal was to find out what functionality each one offers, what use cases are supported, and how these hold up to the stress of real-world dynamics (possibly changing hundreds of thousands of times per second). Below is a summarized version of what we found, and you can see the results for yourself in all their gory detail here.
In the table below, we’ve included a few of the categories that we felt were the important points when making a decision on which tool to use. Here’s what each one means:
ACID Compliance / Eventual Consistency: Each operation relies on the state of the underlying graph in some way. What guarantees does this platform provide that each operation will see all changes made by its predecessors / will not interrupt or conflict with another operation happening concurrently?
Supports Graphs Larger than Memory: Pretty self-explanatory, can this platform handle graphs bigger than the memory of the machine it’s running on?
Supports Edge/Vertex Labels: Can we attach additional information to each edge and vertex besides what vertices/edges it’s connected to?
Supports Dynamic Graphs/Streaming: Can this platform handle changes to the graph under consideration, without having to reload it altogether?
Supports Triggering: If the answer to the feature above is yes, is there a way to run some piece of code every time a particular type of change happens (for instance, every time a vertex is added with the “name” attribute set to “Bob”)?
Quality of Documentation: On a scale from “I can haz cheezburger?” to The Encyclopaedia Britannica, just how approachable and comprehensive is the documentation? And on a scale from Twilight to Shakespeare, how readable is it?
Summarized Points of Comparison (Full Survey: http://lab41.github.io/SkyLine)
Looking at this table, it becomes clear that only a few of the packages under consideration attempt to support all of dynamic graphs, streaming updates, and triggering:
Titan: a leading graph database, created and maintained by the team at Aurelius (now DataStax) and being used at places including Cisco Systems and Los Alamos National Labs.
Stinger: an open source project started by a team at Georgia Tech based on their work on efficient graph data structures. The goal of Stinger is to support high performance analytics on dynamic graphs!
Weaver: a new open source, distributed graph store by a team at Cornell’s Systems Group, which shards the graph over multiple servers, and supports highly efficient transactions and updates.
As soon as we saw Weaver, we fell in love with the vision behind it. It looks like a really solid idea with a very smart group of contributors working on it. Unfortunately, it’s very much in its infancy, and the FAQ makes it very clear that Weaver isn’t production-ready, so we’ve had to put a pin in this one for now. Nonetheless, we’ll be following it closely over the near future, and are excited to see what becomes of the project.
That leaves us with Titan and Stinger. Since our first concern is with the platform’s ability to handle updates to the graph efficiently, we decided to benchmark the speed with which each one could process a given stream of changes to a starting graph (actually, a starting collection of disconnected vertices).
We wrote a Python script to create graphs that are representative of interesting workloads for us: lots of nodes and edges, potentially long cycles, vertex attributes, etc. In order to do this efficiently, our script started off by generating a large number of trees, and then randomly adding ancestors to each node from the set of nodes closer than it to the root. Our script then picked and joined random pairs of nodes, and selected sequences of nodes which it joined together to make cycles (all the requisite probabilities and limits were tunable, and the random number generator was given the constant seed of 0xcafebabe, for reproducibility). This random generation of graphs is slightly different than our previous work with stochastic Kronecker natural graphs which, for those who are interested, can be found here.
The next step was to turn this graph into a randomly ordered stream of updates that could be used to generate it. This was slightly more complex than it sounds, since we wanted to ensure that:
- No edge was created unless the vertices on each end existed.
- Every vertex created (after those in the initial set) would have an edge connecting it to an existing vertex no more than one step later in the stream.
In short, this meant that we had to make sure that at least one predecessor of any given vertex existed before it was itself created. To do this, we started with the standard Graph Traversal Algorithm:Repeat:
- Select a path on the frontier. Let’s call the path selected P.
- Remove P from the frontier.
- For each neighbor of the node at the end of P, extend P to that neighbor and add the extended path to the frontier.
Until the frontier is empty. (Adapted from http://artint.info/tutorials/search/search_1.html)
Where the frontier is defined as the set of nodes or edges to be explored and is initially set to all nodes that are adjacent to the root (are at the other end of an edge from the root node) or simply all the edges emanating from the root.
We then tweaked this algorithm so that the frontier was randomly ordered. By ensuring no node would ever get into the frontier (and thus be added to the stream) before at least one of its predecessors, we mirrored the random ordering that real streams exhibit.
(After all, in the real world, we can easily predict that a father will exist before his son, but not which father will have a son first!) We then fed the resulting streams to both Titan (using the Berkeley DB backend) and Stinger and measured total time taken to process them. Below are our findings.
We then fed the resulting streams to both Titan (using the Berkeley DB backend) and Stinger and measured total time taken to process them. Below are our findings.
|Number of Nodes||Number of Edges||Titan Time||Stinger Time|
|23,236||37,391||8.9 sec||0.05 sec|
|33,510||65,759||9.0 sec||0.08 sec|
|52,203||100,712||11.5 sec||0.11 sec|
|74,724||114,785||11.5 sec||0.13 sec|
|97,490||150,234||15.2 sec||0.20 sec|
|109,709||168,898||21.8 sec||0.32 sec|
|185,705||274,919||19.7 sec||0.31 sec|
|190,476||292,376||29.2 sec||0.56 sec|
|376,126||557,933||43.0 sec||1.11 sec|
|675,017||982,804||58.9 secs||1.19 sec|
aNo data available
bThis datapoint is provided only as rough bound, as it was produced on a different, much more powerful machine than all the others.
As we can see, Stinger throws down with the best of them. In our tests it was clear that it performed a lot better. Unfortunately, it can only handle graphs of up to a predefined number of vertices (which is very small by default). Titan, on the other hand, while around an order of magnitude slower (using default settings and transaction parameters), was able to handle graphs with no apparent limit on vertex or edge count. Obviously, there are several factors that could explain the performance difference we observed on graphs of comparable size. One reason for this difference is the fact that Stinger operates in memory while Titan is a disk-based transactional database – although there are opportunities to tune those transactions. Other reasons include the fact that Stinger is written in C, whereas Titan uses Java (which could add to the overall performance slowdown).
Regardless, we’ll probably end up choosing to build on Titan, for a variety of reasons apart from performance. First, it’s ACID compliant, whereas Stinger isn’t. Second, of the two projects it’s much more robust and production-ready. It’s also better documented and more actively supported and contributed to. Finally, it has much better support for ad hoc querying and integration with other analytical tools, such as the TinkerPop stack, and the powerful visualization platform Gephi.
So what’s next?
We’re excited about the “triggering” scenario we described above. The ability to spot patterns that emerge as the graph updates, and take actions based on those patterns holds a lot of promise for various application areas. Business rules engines, for instance, do exactly this but with relationally structured data rather than graphs; alternately, being able to annotate models of the human transcriptome with new findings and have a machine notify scientists when patterns emerge that they’re interested in (e.g., This gene that increases connectivity in the fusiform gyrus when knocked out has mutant variations that correlate lower expressions of these other genes that influence height. Yes, I did just make that up.).
Thus far, we’ve only taken a preliminary look at this problem. We’ve simplified our patterns to be fixed length “chains” where each link specifies a predicate that the corresponding node or edge in a potential matching path must satisfy. Even in this case, the problem is pretty tough. Even simple indexing approaches run into problems like high memory requirements – if you’re caching partial paths as they occur – or the substantial time complexity of the dynamic all points shortest path problem – if you want to maintain information about how many hops you have to travel to get to the nearest node satisfying the next predicate in the chain (the first approach would be easy if you knew that the paths were always very frequent or very infrequent, but lacking such information we’re stuck with the worst case for now). Some approaches we’re looking into are smarter caching of subpaths, or employing zero-suppressed binary decision diagrams, which have previously been used to count paths in graphs. But it’s early days so far.
If we’re lucky, and have a successful outcome, we’re hoping to help one or more open source projects implement and adopt a really efficient engine for doing these sorts of analyses and ideally, dynamic graph analytics in general. We think this is going to be a huge development in analytics, and can’t wait to see what the community builds on top of the ability to see how the world is changing.
Thanks for tuning into another exciting episode of the Lab41 blog. See you next time!