Figure 1: Mesos resource sharing increases throughput and utilization, via Apache Mesos at Twitter
Before diving into ways to combine Spark and Mesos with Docker, it will help to give a little background on how Spark typically integrates with Mesos. The interplay between the two systems is important, so bear with me for this quick overview.
Out of the box, it is relatively straightforward for Mesos to distribute Spark tasks to slave nodes. Following the guidance of several overviews and tutorials, the integration begins by first building the Spark binary. Next, you place that binary in an HDFS location each Mesos slave can reach (or alternatively, within the same local directory on each slave). From that point on, when Mesos slaves accept Spark jobs, they retrieve the binary from HDFS (or point to their local path install) to do Spark magic. The following configuration snippet illustrates how to configure Mesos to pull the Spark binary from HDFS:
1 2 3 

Figure 2. Mesosphere Spark configuration: First, specify the location of the libmesos.so library. Second, define the URI for the Spark executor, which the Mesos slaves will run. Then define the URIs for the Zookeeper masters.
Once configured, the Spark client can use Mesos as its master, as the Spark Python programming guide explicitly states:
sparksubmit supports launching Python applications on standalone, Mesos or YARN clusters, through its
master
argument. However, it currently requires the Python driver program to run on the local machine, not the cluster
This setup should work great for vanilla Spark, but what about our interest in using PySpark with nonstandard Python modules like Pandas or Numpy? As was the impetus for our ipythonsparkdocker project, that would require getting all the right, compatible Python packages to each slave. It looks like using Mesos gets right back to the starting point, where we’d need to do one of the following to make sure each slave has all the right Python packages:
pyfiles
option to distribute Python packages via egg/zip files. When I attempted this before with Spark standalone mode, I encountered a few difficulties: 1) finding the right packaged module; 2) distributing the modules to all slaves when starting the Spark cluster. I’m sure that has a lot to do with my relative Spark inexperience and interest in having so many Python modules available. Either way, this isn’t a path I plan to revisit.Making a long story short, things aren’t straightforward for getting our desired IPythondriven Spark setup working with Mesos out of the box. So what to do?
Just like before, this desire for portability and repeatability leads to Docker. Since there are several ways to combine these systems, I’ll next walk through three potential architectures for deployment.
Our first thought for Mesosizing a Dockerized Spark setup was to just install Mesos on baremetal. Given a vanilla Mesos installation, the Mesos master should be able to accept Spark jobs, send them to the slave nodes, and hosts would then run containerized Spark workers. Voila, distributed analysis! Easy, right?
Not so fast. As near as I can tell, it might not be that straightforward to run Mesos on bare metal as the master for our containerized Spark in cluster mode. Our current Spark worker containers are configured for Spark standalone mode, where the workers register with a Spark master node when launched. Using Mesos as the master would require Mesos to launch the Spark worker containers as task executors. Luckily, Mesos v0.20.1 added the following magic:
Spark can make use of a Mesos Docker containerizer by setting the property spark.mesos.executor.docker.image in your SparkConf. The Docker image used must have an appropriate version of Spark already part of the image, or you can have Mesos download Spark via the usual methods.
That suggests it might be possible to extend our Spark worker image to include Mesos libraries, point our IPythonSpark client container at Mesos, and configure Spark to launch worker containers to execute Spark tasks. If this option works, it could be a solid way to benefit from Mesos and use our desired Python ecosystem inside Spark worker containers. It could also avoid all the network routing we needed for standalone mode.
Just in case this option has unforeseen issues, or in case we want to consider an alternate architecture in the future, there seem to be two additional options for combining Spark, Mesos, and Docker.
As outlined above, putting Spark inside Docker provides the ability to quickly spin up and teardown a Spark cluster. To build on that key benefit, it seems like the next step would be to consider a Dockerized Mesos setup similar to our standalone version of SparkinDocker containers. But those Mesos containers, of course, also need to include all the requisite Spark and Python libraries, which makes for a pretty beefy container.
Although overloading one Docker image makes me uneasy, I’m wondering if that would deliver a highly portable option for using Spark with Mesos on top of Docker. After all, it should be technically feasible to build a set of master and slave/worker containers that include both Spark and Mesos. Each container would then have all the necessary packages, configurations, and versions. The last step—shuttling communication between hosts and containers—would repeat the network routing work outlined in our last post, leading to a similar situation where each host runs one container and “transparently” routes traffic among the cluster hosts and containers.
If we did pursue the Path to Mordor (i.e. “one container to rule them all”), we could build on top of the heavy lifting others have already done to Dockerize Mesos. For example, this article and related GitHub repo looks like a solid way to “launch a fault tolerant multi node cluster with about seven Docker commands.” Merging this with our existing repo would take some effort, but we should be able to leverage prior work by adding libraries and configuring hostcontainer routing.
That path is not without its perils, however. Cramming several frameworks into one container could become a slippery slope. As our containerized Spark project demonstrated, network routing (i.e. container1>host1>host2>container2) also isn’t the most straightforward undertaking. Adding Mesos into the mix only complicates matters.
Figure 3: Concept architecture for incorporating Mesos into the existing Docker image
Faced with the previous two options (everything in one container or just using Spark worker containers), our team brainstormed a possible third choice. As outlined above, Mesos can launch tasks that contain Docker images. Further, Marathon is a Mesos framework designed specifically to launch longrunning applications using containers. So instead of putting Mesos+Spark into one container, and instead of deploying things on bare metal, could we try Running Docker Containers on Marathon? Therefore, instead of using the Mesos master to distribute Spark jobs to Mesos slaves, Marathon might be able to run the standalone ipythonsparkdocker cluster as a service inside of Mesos. I haven’t seen anyone try this specific setup with Spark (and maybe for good reason), but it should be possible for Mesos to spawn Spark containers that would look, feel, and act like a standalone Spark cluster.
One downside of this approach is that we would probably lose some of the efficiencies gained by using Mesos as Spark’s master. Second, it would require the Mesos slaves to redirect a large portion of host ports to the Spark containers, which could break Mesos communication patterns or initiate dominos of errors that could be hard to debug. On that latter point, the standard ports for Mesos (5050/5051), Zookeeper (2181,2888,3888,…) and Marathon (customizable) do not appear to overlap with Spark, giving me hope that network routing might actually be possible. At this point, the only way to know for sure might be to try and see what works and where things break.
Figure 4: Concept architecture for running Dockerized Spark in standalone mode within a Mesos cluster
The best path forward probably depends on specific needs for using Mesos with Spark with Docker. If Mesos can launch our Spark worker containers, keeping Mesos on metal would position it squarely as a piece of infrastructure and launch Spark jobs as an application (as intended). For those interested in maximizing benefits from Docker, the idea of containerizing Spark, Mesos, and all associated libraries should make it possible to quickly deploy (and/or rebuild) a cluster. On the other hand, despite the efficiencies gained through Mesos, adding several frameworks to one Docker container feels a bit messy. If that is too much, using Marathon to run the Standalone Spark containers as a service might be the option to consider.
Overall, it seems worthwhile to experiment and see where things fall over. We’d be interested in knowing whether anyone else has figured out a way to containerize both Mesos and Spark in a multinode cluster. As of now, we plan to do the following:
Stay tuned for a followup post after I finish the above steps. Until then, thanks for reading!
]]>HDFS  Hbase  Hive  Oozie  Pig  Hue  
Pandas  NLTK  NumPy  SciPy  SymPy  ScikitLearn  
Cython  Numba  Biopython  0MQ  Pattern  Seaborn  
Matplotlib  Statsmodels  Beautiful Soup  NetworkX  LLVM  MDP  
Bokeh  Vincent 
Here at Lab41, we build open source prototypes using the latest big data, infrastructure, and data science technologies. In a perfect world, these tools would magically work together. The reality is that they usually require a lot of effort just to install and configure properly. And when someone else comes along to actually use them — especially if they are a newlyminted teammate or someone unfamiliar with the myriad commandline switches and gotchas — the experience can transform these tools into shark repellent for sysadmins and end users alike.
If the above sounds familiar, or if you’re interested in using IPython notebooks to perform nontrivial* data analytics with Apache Spark, then please continue…This effort started when I became interested in Apache Spark, which has quickly become the heir apparent to MapReduce’s big data throne. By most measures, this data processing engine is living up to claims of better performance and usable APIs for powerful algorithmic libraries. If you add in its support for interactive and iterative development, plus use of datascientist and developerfriendly languages like Python, it’s no surprise why so many have fallen for this relative newcomer.
Out of the box, Spark includes a number of powerful capabilities, including the ability to write SQL queries, perform streaming analytics, run machine learning algorithms, and even tackle graphparallel computations. Those features enable Spark to compete with a number of tools from mature ecosystems like Hadoop, but what really stands out is its usability. In short, incorporating interactive shells (in both Scala and Python) presents an approachable way to kick the tires. In my book, that’s a huge win that should help pull in curious developers (like me). After going through Spark’s cutandpaste examples, as well as a few more involved tutorials, I had seen enough to want to begin using this platform. Anticipating the rest of our team benefiting from its capabilities, I also became interested in enabling their data analysis needs.
Within our team, we have developers, data scientists, and analysts with varying skills and experiences. Providing a solution that everyone could use was a key goal, which led to the following objectives:
After giving it some thought, I realized IPython would address that short list nicely and would be a familiar interface for our team. I decided to try to build something that looks like:
The first step, deploying the Spark cluster, seemed trivial since Lab41 uses a CDH5 cluster and Cloudera includes Spark in their distribution. However, I also had to develop around the situation where end users won’t be able to login/SSH directly to the Spark cluster for their analytics. Most of our partners are very securityconscious, so adding a client node that can remotely connect and drive analytics on the cluster became the next musthave. “Easy,” I thought. “I’ll just setup a remote node to drive Spark analysis within the cluster.” I assumed the steps would be straightforward (and probably already solved):
Starting the master and worker nodes in our CDH5 cluster via the Cloudera Manager was straightforward. Building a client node also was easy since the Spark team graciously provides several source and prebuilt packages for several recent releases. With a straightforward download and install, my client was ready to drive the cluster.
To initially test the client driver — considering the end goal was to use IPython — I decided to start with a pyspark
shell connected to the master (I decided to hold off on IPython integration to isolate any potential misconfigurations). Based on tutorials, connecting the remote client to the cluster initially appeared as easy as specifying ./bin/pyspark master spark://ip:port
. However, I immediately ran into a couple errors related to library mismatches:
1


1


After a few rounds of Googling, I found out these errors are caused by using incompatible spark driver libraries. Understandably, the client driver node needs to use libraries compatible with our cluster nodes, whereas the driver’s v1.2.1 was apparently incompatible with our cluster’s v1.2.0. With that quick reminder to always verify build versions, I downloaded and installed the correct one on the client. Problem fixed!
With those library mismatch errors behind me, I soon encountered another error:
1


These kinds of errors scare me more than most since they give just enough to identify the general cause (“registering” the client and/or workers), but not enough to figure out exactly where to look. After poking around the server and worker logs (/var/log/spark/spark<masterworker><hostname>.log
), it looked like the client successfully connected to the master, but something after that failed to complete the Spark initialization. Errors like the following highlighted that it had something to do with my network configuration:
1


Since “unreachable addresses” can of course be caused by several factors, I’ll save you the nittygritty and jump straight to the important point: connecting a remote client node requires several expected and nonobvious network settings:
tcp://sparkDriver
above needed to be a fullyqualified domain name (FQDN) on the network.Whereas I could easily open the potential range of random ports on master, worker, and client nodes, adding a networkaddressable client to the cluster felt like a step too far for this initial test setup. At this point, I decided to stop using our primary Hadoop cluster and instead virtualize a test Spark cluster within our internal instance of OpenStack. As before, using the prebuilt Spark packages made it easy to create master and worker nodes for a standalone Spark installation. Running the startup scripts ./sbin/start<masterslavesall>.sh
fired up and registered the master and workers, providing me with a throwaway cluster I could use for experimentation more comfortably.
I now had spun up a small virtualized Spark cluster, added a client node on the network, ensured it was reachable with a FQDN, and opened all necessary OpenStack security rules and ports for each node. For good measure I ensured each node’s /etc/hosts
contained entries for the cluster’s nodes (i.e. 10.1.2.3 sparknode1.internaldomain
), leaving me confident all necessary traffic would reach its intended destination.
With the network configurations behind me, the quest led me into another set of library mismatch errors:
1


“Hmmm, strange,” I thought. “All of my OpenStack images include Python…what’s the deal?” Well, when I provisioned the OpenStack instances, I used different host images for the cluster nodes and client driver as a way to better mimic that realworld possibility. It turns out the older worker/client nodes had python2.6, whereas the client (and Spark’s default options) explicitly specify python2.7. Updating the client environment to export PYSPARK_PYTHON=python
propagated Spark’s configuration and let each node rely on their native python build. This situation clearly won’t work for a production deployment, but I was at the stage of wanting to move past errors and could later rebuild environments and configurations.
Next, I ran into the strange situation where my client would accept the examples I had created, but when it submitted jobs to workers, they seemed to be missing things and would fail with messages such as:
1


and
1


Of course! In my previous rounds of yakshaving fixes, I forgot one obvious requirement: All Spark nodes clearly need the same/compatible environment to effectively distribute analysis (aka “not fail”) across the cluster. It wasn’t sufficient to add things like numpy
and GLIBC
to my client; every node in the Spark cluster also needed those same modules and libraries.
I made a crucial decision at this point. I did not like the idea of continuing to tweak and tune the configurations, environments, and libraries for each master, worker, and client hosts. While I was beginning to understand things, I knew nobody else would be able (or want) to replicate my work. If only there was a technology focused on transparent repeatability and portability of infrastructure…
Enter Docker!
Yes, the fantastic “build once, run anywhere” container not only enables development of portable apps, but also can be a godsend to sysadmins in this type of situation. From their website:
Sysadmins use Docker to provide standardized environments for their development, QA, and production teams, reducing “works on my machine” fingerpointing. By “Dockerizing” the app platform and its dependencies, sysadmins abstract away differences in OS distributions and underlying infrastructure.
Perfect! This benefit, I knew, would enable me to package all the configuration options within a common environment I could then deploy as master, worker, and client nodes. Caveat emptor, though. I have used Docker enough to know my intended IPythonizedSpark (or is it SparkifiedIPython?) setup would require a decent amount of customization, especially the network configuration pieces. But I also knew it was possible, and since a combination of Dockerfiles and scripting would lead to a repeatable build, I made the call to Dockerize the entire setup.
Since others already figured out how to run Spark in Docker, I first turned to those images and tutorials. After using them, I learned a few important things:
hdfs://<hadoopnamenode>/path/to/hdfs/file
to access our Hadoop cluster, I’m lazy and wanted our hadoopnamenode
to serve as the container’s default HDFS endpoint. To enable that default connectivity, I added our Hadoop configuration to the container. As an added measure for data locality, the ideal deployment would run these containers inside our Hadoop nodes and thereby avoid sending large amounts of data across the network. Keep in mind this setup means you’ll have to ensure library version compatibility between the containers and your Hadoop nodes.d5d3225d06c4
would cause Spark workers to attempt sending traffic to that host, which of course wouldn’t exist on the network. By passing the host node’s hostname to the container at runtime, the container effectively “thinks” it is the host and can broadcast the appropriate destination address.Refined concept for IPythonSparkHDFS using Docker
Since I’ve used similar bits and pieces in other work, I knew where I wanted to start for building the foundation. I started with the following Docker images:
HDFS  Hbase  Hive  Oozie  Pig  Hue 
Pandas  NLTK  NumPy  SciPy  SymPy  ScikitLearn 
Cython  Numba  Biopython  0MQ  Pattern  Seaborn 
Matplotlib  Statsmodels  Beautiful Soup  NetworkX  LLVM  MDP 
Bokeh  Vincent 
Borrowing from those two Docker images to build the common base, I layered a few important changes within the Dockerfile:
PYSPARK_PYTHON
, SPARK_SSH_PORT
and SPARK_SSH_OPTS
. The latter two force Spark to communicate on SSH via a nonstandard port (I chose 2122). I made this change to the containers’ SSH daemons so I could still SSH in “normally” via port 22 on the host machines.Building on that base image, I created Docker images for each master, worker, and client node types. Each image uses a bootstrap.sh
script to start runit
, leaving each node type to implement different startup services:
sparkmaster
process. This setup violates Docker’s “oneprocesspercontainer” philosophy, but is necessary since master and workers communicate via SSH (as noted before, via port 2122)sparkworker
processI wrote a few Bash scripts to startup each container type. If you plan to use these, keep in mind two important details:
./3runsparkworker.sh spark://masterfqdn:port
). If provisioning on bare metal and/or within your HDFS cluster, you could use something heavyweight like puppet, a lighter deployment tool like fabric, or even a simple series of ssh e
commands.EXPOSE
ing ports in the Dockerfile and later publishing each port/range at runtime, but that method can cause iptables
to run out of memory. Plus, it makes the container metadata (and docker ps
output) unreadable with so many mapped ports. Instead, I made the host create a new iptables
chain with custom PREROUTING
rules. If you don’t want iptables
as a dependency, or if you just want to handle networking The Docker Way, I would suggest explicitly setting the random ports identified in the Configuring Ports for Network Security guide (i.e. SPARK_WORKER_PORT
and spark.driver.port
).Docker image and networking for ipythonsparkdocker deployment
As with most big data platforms, setting up Apache Spark was not a simple “doubleclick installation” process. It required host and network configurations that sometimes were difficult to find and decipher. Adding my goal of driving analytics with a remote client revealed additional gotchas. I managed to troubleshoot these, but it was an effort that I wouldn’t want others to have to reproduce. The extra desire to leverage IPython’s simpler interface, connect to our HDFS cluster, and ensure library compatibility between all nodes led me to Docker’s doorstep.
While the architecture is complex, Docker made it less complicated and more repeatable to develop, test, document, and iterate. Fast forward to today, we now have a working version of IPythondriven Spark analytics on our HDFS data, which is something others might be looking to use. And rather than say, “Email me for help,” or “Google ‘this’ and StackOverflow ‘that’,” I can point you to ipythonsparkdocker for:
If you’ve read this far, thanks for your patience while I walked you through this endtoend journey. I came across so many questions online where people ran into similar problems that I wanted to document the entire process. Hopefully, this post will save others from wondering where things might have gone wrong.
If you decide to give our repo a try, let us know. The Lab is interested in knowing if it helps, and is happy to offer a helping hand if something needs a little more work.
Until our next post, thanks for reading!
]]>Here at Lab41, we’ve recently found ourselves interested in dynamic graphs and we’ve spent the last few months trying to understand what tools we can use to analyze them – we call this effort Project SkyLine. We’re writing this blog to explain why we think dynamic graphs are interesting, and what we’ve found out so far.
We’ve said it before and we’ll say it again, “Graphs are a great way to model the world around us – from links on the Internet, to the wiring of our brains, to our friendships and relationships.” Graphs naturally represent connections, and connections are central to each of these things. That’s not the whole story though: the world is constantly changing, and so are those connections. Web pages are taken down and links are added every day; our brains constantly rewire themselves as we learn and experience the world.
To understand how the world is changing, we need to be able to analyze graphs that change over time – in other words, dynamic graphs. In a dynamic graph, new edges and vertices can be created at any time, old ones can be destroyed, and attributes (things like age or location) can be altered at any moment, updating the graph to reflect changes in the things and relationships it represents.
We can learn a great deal by applying graph analytic techniques to dynamic graphs. For instance, a dynamically connected components algorithm might tell us when someone joins or leaves a particular group of friends; applying PageRank to the web can show us how web pages rise or fall in mindshare. In addition, we can watch for patterns in the graph – like a series of vertices and edges matching a particular query of interest – flag them as they emerge, and track them for as long as they endure. We call this functionality “triggering” because it lets us respond to specific kinds of changes to the graph by using them to “trigger” relevant actions, as in the example below:
Imagine we have a graph where the vertices are web pages and the edges are links connecting them. We might have a trigger such as, “Send a notification whenever a page with ‘dogs’ in the URL is connected to a page with ‘cats’ in the URL which is connected to a page with ‘parakeets’ in the URL.”
The initial graph below doesn’t contain that path and therefore wouldn’t fire that trigger at all:
However, a later update could add an edge between “example.com/cats” and “example.com/parakeets,” which should trigger a notification being sent to the user since it matches the path highlighted by the red arrows [“example.com/dogs”, “example.com/cats”, “example.com/parakeets”]:
[“example.com/dogs”, “example.com/cats”, “example.com/parakeets”] and a notification should be sent to the user.
Unfortunately, graph analytics to date have dealt mostly with static graphs – graphs that don’t update change over time – and most of the relevant software is designed for that use case. There are a few exceptions, but even these aren’t very well known, so we decided to figure out what’s possible today.
We started off by looking at all the open source graph analytics packages we could find. Our goal was to find out what functionality each one offers, what use cases are supported, and how these hold up to the stress of realworld dynamics (possibly changing hundreds of thousands of times per second). Below is a summarized version of what we found, and you can see the results for yourself in all their gory detail here.
In the table below, we’ve included a few of the categories that we felt were the important points when making a decision on which tool to use. Here’s what each one means:
ACID Compliance / Eventual Consistency: Each operation relies on the state of the underlying graph in some way. What guarantees does this platform provide that each operation will see all changes made by its predecessors / will not interrupt or conflict with another operation happening concurrently?
Supports Graphs Larger than Memory: Pretty selfexplanatory, can this platform handle graphs bigger than the memory of the machine it’s running on?
Supports Edge/Vertex Labels: Can we attach additional information to each edge and vertex besides what vertices/edges it’s connected to?
Supports Dynamic Graphs/Streaming: Can this platform handle changes to the graph under consideration, without having to reload it altogether?
Supports Triggering: If the answer to the feature above is yes, is there a way to run some piece of code every time a particular type of change happens (for instance, every time a vertex is added with the “name” attribute set to “Bob”)?
Quality of Documentation: On a scale from “I can haz cheezburger?” to The Encyclopaedia Britannica, just how approachable and comprehensive is the documentation? And on a scale from Twilight to Shakespeare, how readable is it?
Summarized Points of Comparison (Full Survey: http://lab41.github.io/SkyLine)
Looking at this table, it becomes clear that only a few of the packages under consideration attempt to support all of dynamic graphs, streaming updates, and triggering:
Titan: a leading graph database, created and maintained by the team at Aurelius (now DataStax) and being used at places including Cisco Systems and Los Alamos National Labs.
Stinger: an open source project started by a team at Georgia Tech based on their work on efficient graph data structures. The goal of Stinger is to support high performance analytics on dynamic graphs!
Weaver: a new open source, distributed graph store by a team at Cornell’s Systems Group, which shards the graph over multiple servers, and supports highly efficient transactions and updates.
As soon as we saw Weaver, we fell in love with the vision behind it. It looks like a really solid idea with a very smart group of contributors working on it. Unfortunately, it’s very much in its infancy, and the FAQ makes it very clear that Weaver isn’t productionready, so we’ve had to put a pin in this one for now. Nonetheless, we’ll be following it closely over the near future, and are excited to see what becomes of the project.
That leaves us with Titan and Stinger. Since our first concern is with the platform’s ability to handle updates to the graph efficiently, we decided to benchmark the speed with which each one could process a given stream of changes to a starting graph (actually, a starting collection of disconnected vertices).
We wrote a Python script to create graphs that are representative of interesting workloads for us: lots of nodes and edges, potentially long cycles, vertex attributes, etc. In order to do this efficiently, our script started off by generating a large number of trees, and then randomly adding ancestors to each node from the set of nodes closer than it to the root. Our script then picked and joined random pairs of nodes, and selected sequences of nodes which it joined together to make cycles (all the requisite probabilities and limits were tunable, and the random number generator was given the constant seed of 0xcafebabe, for reproducibility). This random generation of graphs is slightly different than our previous work with stochastic Kronecker natural graphs which, for those who are interested, can be found here.
The next step was to turn this graph into a randomly ordered stream of updates that could be used to generate it. This was slightly more complex than it sounds, since we wanted to ensure that:
In short, this meant that we had to make sure that at least one predecessor of any given vertex existed before it was itself created. To do this, we started with the standard Graph Traversal Algorithm:
Repeat:Until the frontier is empty. (Adapted from http://artint.info/tutorials/search/search_1.html)
Where the frontier is defined as the set of nodes or edges to be explored and is initially set to all nodes that are adjacent to the root (are at the other end of an edge from the root node) or simply all the edges emanating from the root.
We then tweaked this algorithm so that the frontier was randomly ordered. By ensuring no node would ever get into the frontier (and thus be added to the stream) before at least one of its predecessors, we mirrored the random ordering that real streams exhibit.
(After all, in the real world, we can easily predict that a father will exist before his son, but not which father will have a son first!) We then fed the resulting streams to both Titan (using the Berkeley DB backend) and Stinger and measured total time taken to process them. Below are our findings.
We then fed the resulting streams to both Titan (using the Berkeley DB backend) and Stinger and measured total time taken to process them. Below are our findings.
Number of Nodes  Number of Edges  Titan Time  Stinger Time 
23,236  37,391  8.9 sec  0.05 sec 
33,510  65,759  9.0 sec  0.08 sec 
52,203  100,712  11.5 sec  0.11 sec 
74,724  114,785  11.5 sec  0.13 sec 
97,490  150,234  15.2 sec  0.20 sec 
109,709  168,898  21.8 sec  0.32 sec 
185,705  274,919  19.7 sec  0.31 sec 
190,476  292,376  29.2 sec  0.56 sec 
376,126  557,933  43.0 sec  1.11 sec 
675,017  982,804  58.9 secs  1.19 sec 
2,100,312  3,012,488  14.4 min  N/A^{a} 
22,727,509  40,173,501  12.1 hours  N/A^{a,b} 
^{a}No data available
^{b}This datapoint is provided only as rough bound, as it was produced on a different, much more powerful machine than all the others.
As we can see, Stinger throws down with the best of them. In our tests it was clear that it performed a lot better. Unfortunately, it can only handle graphs of up to a predefined number of vertices (which is very small by default). Titan, on the other hand, while around an order of magnitude slower (using default settings and transaction parameters), was able to handle graphs with no apparent limit on vertex or edge count. Obviously, there are several factors that could explain the performance difference we observed on graphs of comparable size. One reason for this difference is the fact that Stinger operates in memory while Titan is a diskbased transactional database – although there are opportunities to tune those transactions. Other reasons include the fact that Stinger is written in C, whereas Titan uses Java (which could add to the overall performance slowdown).
Regardless, we’ll probably end up choosing to build on Titan, for a variety of reasons apart from performance. First, it’s ACID compliant, whereas Stinger isn’t. Second, of the two projects it’s much more robust and productionready. It’s also better documented and more actively supported and contributed to. Finally, it has much better support for ad hoc querying and integration with other analytical tools, such as the TinkerPop stack, and the powerful visualization platform Gephi.
We’re excited about the “triggering” scenario we described above. The ability to spot patterns that emerge as the graph updates, and take actions based on those patterns holds a lot of promise for various application areas. Business rules engines, for instance, do exactly this but with relationally structured data rather than graphs; alternately, being able to annotate models of the human transcriptome with new findings and have a machine notify scientists when patterns emerge that they’re interested in (e.g., This gene that increases connectivity in the fusiform gyrus when knocked out has mutant variations that correlate lower expressions of these other genes that influence height. Yes, I did just make that up.).
Thus far, we’ve only taken a preliminary look at this problem. We’ve simplified our patterns to be fixed length “chains” where each link specifies a predicate that the corresponding node or edge in a potential matching path must satisfy. Even in this case, the problem is pretty tough. Even simple indexing approaches run into problems like high memory requirements – if you’re caching partial paths as they occur – or the substantial time complexity of the dynamic all points shortest path problem – if you want to maintain information about how many hops you have to travel to get to the nearest node satisfying the next predicate in the chain (the first approach would be easy if you knew that the paths were always very frequent or very infrequent, but lacking such information we’re stuck with the worst case for now). Some approaches we’re looking into are smarter caching of subpaths, or employing zerosuppressed binary decision diagrams, which have previously been used to count paths in graphs. But it’s early days so far.
If we’re lucky, and have a successful outcome, we’re hoping to help one or more open source projects implement and adopt a really efficient engine for doing these sorts of analyses and ideally, dynamic graph analytics in general. We think this is going to be a huge development in analytics, and can’t wait to see what the community builds on top of the ability to see how the world is changing.
Thanks for tuning into another exciting episode of the Lab41 blog. See you next time!
]]>Lab41 recently released <a href="https://github.com/lab41/circulo" target="_blank">Circulo</a>, a python framework that evaluates *Community Detection Algorithms*. As background, a community detection algorithm partitions a *graph* containing vertices and edges into clusters or communities. Vertices can be entities such as people, places or things; while the relationships among these entities are the edges. Behind the scenes, community detection algorithms use these relationships (edges), their strength (edge weights), and their direction (directedness) to conduct the partitioning. The partitioning is significant because it often can provide valuable insight into the underlying raw data, revealing information such as community organizational structure, important relationships, and influential entities.
Circulo becomes especially important in circumstances where community detection algorithms fail to present clear and consistent results. One of the more prominent examples of this is the case where algorithms executed against the same dataset produce variable results relative to membership, size, execution time, or number of communities. For example, ten different algorithms can produce ten different results, all at different rates. This level of variation puts a researcher into the difficult position of having to choose among the results, without much guidance as to which of the results most accurately applies to the circumstances. Can varying results combine to form a global best result? Does the type of input data affect which algorithm to use? If an algorithm takes too long to execute, is using a fast algorithm sufficient? Do different definitions of a community determine the algorithm to use? Is there such thing as a best result?
Circulo enables researchers to try and answer these questions by giving them an efficient platform to conduct data collection, analysis, and experimentation. The framework calculates a variety of quantitative metrics on each resulting community. It can validate a partitioning against some predefined ground truth or can compare two different partitions to each other. This data can be used to draw conclusions about algorithm performance and efficacy. And best of all, it is completely modular so that a researcher can add new algorithms, metrics, and experiments with ease.
To help explain the Circulo framework, we will use the flight routes data obtained by openflights.org. This dataset is one of 14 that we use for testing in Circulo. In this example, the airports are nodes, and the routes between airports are the edges. The resulting graph is both directed (since flights travel in a direction from one airport to another) and multiedged (since numerous routes may exist from one airport to another) and contains 3,255 vertices and 67,614 edges. The reasons to employ community detection against this type of data could range anywhere from an airline debating where to build its next hub, to trying to identify a new route to an underserved region, or to developing a plan to reroute regional airport traffic to a different hub. A clearer understanding of how flight routes can divide airports into clusters could lead to better informed decisions.
Circulo execution can be divided into a three stage pipeline. Generally speaking, the inputs to the pipeline are a collection of algorithms and a collection of datasets. What comes out are JSON encoded files containing numeric metric results. Each algorithm/data pair produces a partitioning, and each partitioning produces a set of metrics. The first stage is the Data Ingestion stage, where raw data is extracted, transformed, and loaded (ETL) into a graph and serialized into a GraphML file. The second stage is the Algorithm Execution stage, where one or more algorithms are executed against the graph. And finally, the third stage is the Metric Execution stage, where the results of the previous stage are evaluated against a variety of metrics.
The primary purpose of the Data Ingestion stage is to provide the remaining two stages with a consistent, known graph input format. In many ways, this stage therefore serves as a tool to convert any raw data format into the expected input format of downstream stages. To accomplish this, a researcher needs to subclass the provided CirculoData base class. We have chosen igraph as the primary framework for representing a graph in memory, and graphML as the primary serialization format for storing the graph to disk. Both igraph and graphML were selected for the following reasons:
For the flights data, the Data Ingestion stage begins with the execution of functionality provided by the FlightsData class. Each new dataset must subclass the CirculoData class as we do with FlightsData, which provides the base functionality to download the data, convert it into a graph, identify ground truth from labels when available, then serialize the graph as graphML. The raw data for flights includes two CSV files:
flights.csv
(e.g. “Goroka”,“Goroka”,“Papua New Guinea”,“GKA”,“AYGA”,6.081689,145.391881,5282,10,“U”,“Pacific/Port_Moresby”)
routes.csv
(e.g. “AF,137,ATL,3682,ILM,3845,Y,0,CRJ 319”)
Below are both the vertex (node) and edge representations of the previous CSV lines from flights and routes in graphML:
1 2 3 4 5 6 7 8 9 10 11 12 13 

1 2 3 4 5 6 7 8 9 10 11 

By default, Circulo includes 14 datasets to enable a researcher to quickly evaluate community detection algorithms out of the box. The project contains information about these default datasets, including how to add additional datasets.
Dataset  Description 
amazon  Copurchasing Data – http://snap.stanford.edu/data/bigdata/communities 
house_voting  2014 congress (house) voting data – https://www.govtrack.us/developers/data 
flights  Flight Route Data – http://openflights.org/data.html 
football  NCAA D1A games  http://wwwpersonal.umich.edu/~mejn/netdata 
karate  Famous data set of Zachary’s karate club  http://wwwpersonal.umich.edu/~mejn/netdata/ 
malaria  Amino acids in malaria parasite – http://danlarremore.com/bipartiteSBM 
nba_schedule  Games played in the 20132014 NBA season – https://github.com/davewalk/20132014nbaschedule 
netscience  Graph of collaborators on papers about network science – http://wwwpersonal.umich.edu/~mejn/netdata/ 
pgp  Interactions in pretty good privacy – http://deim.urv.cat/~alexandre.arenas/data/xarxes/ 
revolution  Graph representing colonial American dissidents – https://github.com/kjhealy/revere.git 
school  Facetoface interactions in a primary school – http://www.sociopatterns.org/datasets/primaryschoolcumulativenetworks/ 
scotus  Supreme court case citation network – http://jhfowler.ucsd.edu/judicial.htm 
senate_voting  2014 congress (senate) voting data – https://www.govtrack.us/developers/data 
southern_women  bipartite graph of southern women social groups – http://nexus.igraph.org/api/dataset_info?id=23&format=html 
The second stage of the Circulo pipeline is running the community detection algorithms against each dataset. To run all algorithms against the flights dataset, a researcher would do the following:
1


Circulo will run each algorithm/dataset pair in parallel, serializing the job name, dataset name, iteration (running algorithm/dataset pair multiple times), VertexCover membership array, total elapsed time of execution, and alterations to the filesystem in JSON as shown in the following example:
1 2 3 4 5 6 7 8 9 

The VertexCover membership array above indicates through 0based indexing that vertex 0 belongs to communities 56 and 1, vertex 1 belongs to community 56, vertex 2 belongs to community 28, etc.
Though a researcher could add any algorithm for evaluation, the framework by default comes with 15 algorithms, with implementations from Lab41 (Conga, Congo, Radicchi Strong, Radicchi Weak), igraph (infomap, fastgreedy, Girvan Newman, leading eigenvector, label propagation, walktrap, spinglass, multilevel) and SNAP–Stanford Network Analysis Project (BigClam, Coda, Clauset Newman Moore). More information about each algorithm can be found in the Lab41 Community Detection Survey.
When an algorithm executes against a dataset, the Algorithm Execution stage first attempts to match that dataset as best as possible to the input parameters of that algorithm. Given that algorithms may vary in how they use weighted, multi, and directed edges, it is necessary to conform a dataset to an algorithm to enable proper execution and maximize algorithm efficacy. In this manner, Circulo operates with a Best Chance methodology when executing algorithms –we provide the algorithm with the best circumstances so that it can have the best chance at finding a solution. This transformation process can sometimes be as simple as changing all directed edges to undirected edges, however in other cases, it can be more difficult. igraph provides all the transformation functionality through the functions simplify and to_undirected, which both have the ability to collapse edges, which is necessary when, for example, simplifying a multiedge graph.
The flights dataset (directed, multigraph, unweighted) is an excellent example of how these transformations can occur. To illustrate this, we highlight how the flights data will change when encountering the following two algorithms:
Once data is transformed and executed against algorithms, a researcher can already start to experiment with some of the results. The following experiment highlights a distinct difference between Label Propagation and Infomap when applied against the flights dataset. We will use the graph visualization tool Gephi to view the results overlaid onto a map using the geocoordinates of the airports.. Each color in the figures below represents a community as determined by the respective algorithm (the colors vary between the two figures because gephi randomizes the colors).
The visualization confirms that both algorithms are presenting accurate partitions if one assumes that locality is a valid source of ground truth for flight data. However, if one were to look closely, the results do vary in regards to the degree of locality. For example, Label Propagation treats most of the US and Mexico as a single community while Infomap treats them as separate communities. One could surmise that perhaps Infomap presents a more detailed view of communities whereas Label Propagation a more general one–information valuable when using the algorithms in the future, especially when ground truth such as geocoordinates is not available.
<a class="fancyboxeffectsa" href=/images/post_9_circulo/label_propagation_flights.jpg><img src="/images/post_9_circulo/label_propagation_flights.jpg" title="Label Propagation Community Detection Results for Flight Data" ></a>
<p style="textalign:center"><small>_Label Propagation Results_</small></p>
<a class="fancyboxeffectsa" href=/images/post_9_circulo/infomap_flights.jpg><img src="/images/post_9_circulo/infomap_flights.jpg" title="Infomap Community Detection Results for Flight Data" ></a>
<p style="textalign:center"><small>_Infomap Results_</small></p>
Before we proceed, it is important to identify the two major data points known at this point in the data pipeline: the vertex membership produced by the algorithm and the algorithm execution time. With this data alone, a researcher could likely come to various conclusions: (1) the effectiveness of algorithms by comparing resulting memberships with ground truth memberships, (2) the variance of membership results by comparing memberships produced from different algorithms, and (3) the accuracy of algorithms as a function of time by including elapsed execution time. So why probe further into the data? Why not just accept that vertex membership and time are enough?
Generally speaking, if more data is available, there are more opportunities to come to better conclusions. When an algorithm draws a boundary, and a community is formed, the graph does actually change. Yes, it has the same vertices, and yes it has the same edges, but it now has a third element: the communities themselves. Communities interact with other communities. Communities have ecosystems within them. It is not just that a boundary sets apart vertices in a graph, it is also that it redefines how relationships among vertices can collectively be viewed. All of these facets of the communities can provide further insight beyond just the vertex membership.
One notable benefit of metrics is that we can now better define what it means to be a good community. For example, a good community might be one that is isolated from the rest of the graph, a metric known as conductance. Now, we can identify those algorithms that minimize the conductance metric or discover other metrics that correlate with conductance. We could even use individual metrics to help distinguish individual communities amongst themselves. What is the most isolated community? What is the most dense community?
In Circulo, we have identified the following community metrics:
Cut Ratio  TLU–Local Clustering CoefficientMax  TLU–Local Clustering CoefficientBiased Kurtosis 
Degree StatisticsMax  TLU–Local Clustering CoefficientMedian  Average Out Degree Fraction 
Internal Number Edges  Transitivity Undirected (Global Clustering Coefficient)  Diameter 
Conductance  Degree StatisticsUnbiased Variance  Separability 
Triangle Participation Ratio  Cohesiveness  TLU–Local Clustering CoefficientUnbiased Variance 
Fraction over a Median Degree  TLU–Local Clustering CoefficientMin  Degree StatisticsMedian 
Degree StatisticsMean  Degree StatisticsSize  Degree StatisticsBiased Kurtosis 
TLU–Local Clustering CoefficientSize  Internal Number Nodes  Degree StatisticsBiased Skewness 
Density  Flake Out Degree Fraction  Expansion 
TLU–Local Clustering CoefficientMean  Maximum Out Degree Fraction  Normalized Cut 
Degree StatisticsMin  TLU–Local Clustering CoefficientBiased Skewness 
Descriptions of each of these metrics can be found in the igraph documentation and the paper “Evaluating Network Communities based on Groundtruth,” by Jaewon Yang and Jure Leskovec.
To run the metrics against a given vertex membership, one would do the following:
1


The algorithm_results_path is the directory containing the JSON encoded results from stage 2. The metric_results_output_path is the path to the directory where the JSON encoded metrics results will be saved. For example, by running the metrics suite against the infomap/flights result, the following JSON would be created:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 

The omega index is a measurement of how similar the partition is to some predefined ground truth, if available. In the case of the flights data, we used the countries of the airports. The metrics_elapsed is the time for the stage to complete. The membership is a carry over of the membership from the Algorithm Execution stage. The metrics is divided into two subsections for each metric: (1) results–the actual score for each community indexed by the community id, and (2) aggregations: aggregations of the results.
Though each metric has the potential to provide a valuable perspective on the resulting communities, we will only focus on conductance and density in detail here for the sake of example. Conductance is the ratio of edges leaving the community to the total number of edges. One could consider conductance as a measure of how much a community conducts its energy to the rest of the graph. Density is a measure of the ratio of edges inside a community to the number of possible edges in a community. Vertices belonging to dense communities will have multiple edges connecting them to other vertices in the community.
Using the flights example once again, a researcher for the airline industry might have concluded that the infomap algorithm tends to find communities with low conductance and high density based on previous analysis of the algorithm against numerous datasets. Because the researcher is trying to find the airline new opportunities for expanding into underserved markets, a good community in this case is one that is isolated from other regions (low conductance) and has high internal traffic (high density). When applying the metrics to the results of infomap/flights, we see the following results:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 

We can see that the infomap algorithm discovered 156 communities, with an average density of ~0.45 and an average conductance of ~0.61. The individual scores are documented in the results section, where for instance, the conductance of each of the first 3 communities is approximately 0.51, 0.67, and 0.66. Though the moderate average scores of both conductance and density suggest that, overall, many of the existing airports optimally serve their regions; interestingly, there exists at least one community with a conductance of 0.00029 and at least one community with a density of 1.0. Given that in our example, a researcher is trying to identify underserved regions, the results of infomap might still be worth a more detailed analysis.
The big question that remains is, “What’s Next?” Circulo provides the pipeline for efficiently gathering metrics based on community detection algorithm execution, but where is the value?
The value, we would argue, is hidden in the metrics data. Before, when we asked the questions about determining which algorithms to use in which situations, we really had no place to start aside from crude qualitative observations. Now when researchers ask these questions, they can leverage the Circulo framework to produce quantitative metrics to serve as the impetus for further experimentation. We have started this experimentation on our Circulo experiments page. From here, we hope to add more algorithms, include more metrics, and build a variety of experiments that will drive a better understanding of community detection and the numerous algorithms that encompass it. We also hope that you can help make this happen by contributing to Circulo in the future.
]]>Conceptual illustration of network analysis using community and role detection
Network modeling is an extremely powerful analytic method employed across a massive variety of disciplines. As Robbie mentioned in his recent post, one of the most useful techniques in the network analysis domain is community detection, which breaks down large networks into smaller component parts. The earliest models of community detection viewed these breakdowns as partitions of the network; however, as our collective understanding has matured, we realized that communities make more sense as organic, unrestricted groupings of vertices that could range anywhere from complete overlap to complete exclusion.
This nuanced understanding of communities has several powerful applications in the real world. For example, we can find different clusters of protein interactions within cells, identify how fans of sports teams relate to each other, or understand the influence of different scientists in their collaborations. However, community detection algorithms produce groups of related nodes without distinguishing them relative to each other, leaving several meaningful realworld questions unanswered without further analysis. Fortunately, this field of research has advanced upon the idea that community structure is not the only construct that lends itself to graph analytics.
Going one step further, roles of individual nodes can also be gleaned from the structure of the graph. Combined with community detection, role detection can add crucial insight when looking for key information such as:The analysis required to answer these questions is complementary to community detection – it must identify common “roles” across many communities, finding nodes in the graph with similar structure and function. To perform these calculations, it is possible to do them by hand or by examining the auxiliary structure provided by some community detection algorithms. However, those approaches are often ad hoc and do not scale.
We’d hope for a richer system to complement community detection  a system that, given a graph, does the following:Lucky for us, Henderson et al. proposed such a system in their 2012 paper “RolX: Structural Role Extraction & Mining in Large Graphs.” We’ll discuss the paper’s key details below.
The RolX (pronounced “roleex”) algorithm is simple in conception, but somewhat more complicated in presentation. The core idea of RolX is the observation that if we gather data about a graph in some linear form (such as a matrix), we can use matrix factorization methods to find structure in the data  and possibly use this to discover corresponding structure in the graph itself.
RolX starts by gathering a wide range of information associated with nodes in the graph. To gather details about the elements of the graph, the authors rely on the discussion of ReFeX (Recursive Feature eXtraction) from their earlier paper, “It’s Who You Know: Graph Mining Using Recursive Structural Features” (2011). ReFeX recursively collects a wide range of structural data about the graph, for example, nodecentric parameters such as a node’s degree. A key feature of the recursion is that it captures both global and local structure by looking at successively larger regions of the graph.
Quantifying a graph’s structure typically requires computing an ensemble of complicated structural metrics  some focus on localtothenode information, such as a node’s degree, and some are more influenced by global structural parameters, such as a node’s PageRank value. These calculations can be timeconsuming for large graphs, as many global metrics often require several passes over the graph structure before they converge or stabilize.
ReFeX proposes a new way to do this, using only three basic metrics per node:[degree, egonetworkinter, egonetworkout]
.
Astute minds may notice this process is not guaranteed to terminate, and indeed it could go on forever. We need a stopping condition. Fortunately, there is an obvious one: the process should terminate when the information ceases to yield more knowledge about the graph’s structure. This stopping condition is accomplished by eliminating columns that closely resemble each other, or are in fact duplicates of one another. First, we need to construct an auxiliary graph where each node represents a single column of the matrix. Next, we connect columns where their values are close for all nodes. We can then use connectedcomponentfinding algorithms to “trim the fat” from the matrix. This process ensures that all columns contribute unique information to our structural understanding of the graph.
Now that we have a giant matrix of data about the graph’s structure, we can begin mining it for insights. At this stage, we have a giant matrix – more than \(2^{10}\) elements for an averagesized graph — whose rows represent nodes of the graph, and whose columns represent the values of the recursive structural metrics computed. Each cell associates a given node with a given metric’s value, which makes the matrix rich in value. The challenge is figuring out how to use this information to get a complete, but concise, description of the graph’s structure. It turns out we can do this, using a technique called nonnegative matrix factorization, or NMF.
NMF is a mathematical strategy that has proven popular in the field of unsupervised machine learning. Its goal is straightforward: given a large matrix, create an approximation of that matrix that is much smaller, but mostly equivalent. It is part of a broader class of algorithms that perform the task of dimensionality reduction, which takes complex data and projects it into a smallerdimensional space for easier analysis. Given it can enable insight into very large datasets, NMF is useful in a wide variety of contexts, such as modeling document topics or building recommender systems.
In mathematical terms, for an \(m\times n\) matrix, \(V\) factors into an \(m\times r\) matrix \(W\) and a \(r\times n\) matrix \(H\), where \(r\) is much smaller than \(m\) or \(n\), so that \(WH \approx V.\) Because a perfect solution to this problem is usually not possible, we must approximate it instead. This approximation requires use of several linearalgebra and numerical analysis techniques.
The matrices generated by NMF
In this particular instance, NMF can allow us to break down the massive array of graph metrics into a smaller collection of “roles.” Using this, we may be able to extract meaning from the graph’s structure and use this to find common structural motifs in the graph.
Unfortunately, NMF outputs are not always clear to the naked eye. In fact, they are essentially just more matrices of numbers. Since the statistics on nodes output by ReFeX were somewhat obscure, knowing which roles correspond to which combinations of node statistics does not help us actually understand the meaning of the node roles. To do this, we need to understand how each role corresponds to actual graph metrics, such as the PageRank value of a node, or its degree. To figure this out, we need to essentially perform the inverse of this factorization. Now, we have a matrix \(N\) associating nodes in the graph with graph metrics, and a matrix \(W\) associating nodes with roles. We want to generate matrix \(G\) such that \(W\times G\approx N.\) Doing this is a relatively simple optimization problem.
The following example from Henderson’s 2012 paper shows the fundamental difference between community detection and role discovery. Both graphs represent the same community of scientists who have coauthored scholarly papers.
Figure 2: Henderson and Gallager illustrate the differences between the Fast Modularity community detection algorithm, on left, and RolX after applying each to a collaboration graph (RolX numbering added by Lab41 to help readers identify roles)
Whereas the graph on the left shows 22 communities, the one on the right shows four roles that crosscut those communities (represented as diamonds, squares, triangles, and circles). Some scientists are central members of networks – they reside within tightly connected clusters representing a specific discipline and influence every researcher in that discipline. Others bridge two or more different communities of researchers – these scientists usually focus on interdisciplinary topics. Some scientists are members of cliques – they are members of a small “star” of researchers all connected to each other, and loosely connected to the rest of the graph. Finally, most scientists are connected to some other researchers in one specific field, but not tightly, and are not the central node in that field.
Hopefully, this whirlwind tour of RolX highlights how it can provide valuable insight from graphs. Combined with community detection algorithms, RolX can help you understand not only which groups are tightly connected, but also how certain nodes play key roles within the network. If you’re interested in learning more, be sure to read Henderson and Gallagher’s original paper, as well as take a look at our RolX implementation and our broader Circulo project. We also welcome contributors to our project, so please checkout our repo on GitHub and submit whatever issues, fixes, or code you think would help.
Background image Selforganization used under Creative Commons license
]]>Here at Lab41, we don’t just see graphs. We’re also investigating the interesting and useful properties of these graphs. Recently, the lab has been evaluating the applicability of community detection algorithms to graphs of varying size and structure. These algorithms attempt to find groups of vertices that seem related to one another, and could therefore be considered communities in the larger network.
When a graph can be split into groups of nodes that are densely internally connected, we say that those groups are communities and that the graph has community structure. Natural graphs often have this property, and as we’ve mentioned before, natural graphs tend to be the most illuminating.
While there has been a fair amount of research focused on community detection since Michelle Girvan and Mark Newman published their seminal paper in 2002, we still lack a proper understanding of which algorithm(s) work best for a given graph structure and scale.
There are two classes of community detection algorithms. Some algorithms find overlapping communities, while others partition the graph into distinct, nonoverlapping communities. The GirvanNewman algorithm is the canonical example of the latter group. In this post, we’ll discuss an evolution of the GirvanNewman algorithm into newer algorithms called CONGA and CONGO, and eventually try to find out whether the structure of a graph impacts CONGO’s performance.
It is a good idea to fully digest GirvanNewman before delving into derivations such as CONGA and CONGO.
Girvan and Newman introduced an idea called edge betweenness centrality, or edge betweenness. The edge betweenness of an edge \(e\) is defined as the number of shortestpaths that pass through \(e\), normalized by the number of shortestpaths between each pair of vertices. Girvan and Newman argue that edges with high betweenness scores tend to be intercommunity edges, because a high edge betweenness hints that the popular edge acts as a bottleneck between two communities. An examination of the intercommunity edges in the figure below make this intuition obvious.
If we accept that the edge with the highest betweenness tends to divide communities, we see that repeatedly removing that edge might be a good way to partition a network. To do so, the GirvanNewman algorithm takes the following steps:
Natural graphs can grow very large. It’s not uncommon to want to find insight about a graph with millions or even billions of nodes. Consequently, it’s important to be able to compare the expected performance of algorithms without looking at the exact number of machine instructions or even writing code. One way to do this is by leveraging asymptotic (also known as BigO) notation. An easy way to think of BigO notation is to imagine an expression once you’ve eliminated all constants and kept only the largest factor. For example, \(.01x^3 + 950x^2\log x + 3 = O(x^3)\), since even though \(x^3\)’s constant is the smallest, it is still the dominating factor as \(x\) increases.
If some function \(f(x)\) grows no faster than another function \(g(x)\), we say that \(f(x) \in O(g(x))\) or \(f(x) = O(g(x))\). A bit more formally, \(f(x) = O(g(x))\) if and only if there exist constants \(C\) and \(x_0\) such that \(f(x) \le C g(x)\) for all \(x > x_0\). In other words, no matter how much larger \(f(x)\) is for small values of \(x\), \(g(x)\) will eventually catch up. BigO notation is an incredibly useful tool to quickly compare algorithms and find out how much performance depends on the size of the input.
On a graph with \(V\) vertices and \(E\) edges, it would seem that calculating all of the betweenness centralities would require \(O(EV^2)\) time, because shortest paths must be found between all \(V \times (V  1) / 2 = O(V^2)\) pairs of vertices, each using a breadthfirst search that costs \(O(E)\). Luckily, Newman and Brandes independently describe an \(O(EV)\) algorithm for betweenness centrality that requires a single breadthfirst search from each vertex. This shortcut method uses a flow algorithm, which yields the edge betweenness without requiring the shortestpaths.
An algorithm like GirvanNewman’s that repeatedly divides the graph is known as a divisive algorithm. A divisive algorithm on a graph usually returns a dendrogram – a specialized type of tree. A dendrogram is a memoryefficient data structure that describes the history of the algorithm. It stores a list of the ways small communities merge to make larger ones, until the entire graph is one big community. We can even derive the historical list of divisions that the algorithm made by inspecting the list of merges. Furthermore, a dendrogram can be split at any level to find a single clustering (a set of clusters) that contains the desired number of communities. When the number of communities is known, the dendrogram can easily be split at the appropriate level. When the optimal number of communities is unknown, we use a metric like modularity to determine which clustering is the “best” one.
Ostensibly, the GirvanNewman algorithm runs in \(O(E^2V)\) time, since the betweennesses must be recalculated for each edge removal. However, since the betweenness only needs to be recalculated in the component in which an edge has just been removed, the algorithm is much more tractable on graphs with strong community structure that split quickly into several disconnected components.
An example of a graph with strong community structure is the graph of character interactions in Victor Hugo’s Les Miserables. The following figure shows how GirvanNewman partitions that graph.
While Valjean’s cohorts in Les Mis seem to partition nicely into their own communities, actors in real networks are often members of multiple communities. Most of us belong to multiple social groups. In fact, almost any realworld network has at least some overlapping community structure.
Most existing algorithms partition networks into nonoverlapping communities, but there has been a recent push to design an effective overlapping community detection algorithm.
Zachary’s Karate Club is a famous network representing feuding factions at a karate club. Nonoverlapping community detection algorithms provide a great deal of insight, but at the cost of forcing each student into a single faction, even if he belongs in multiple. The figure on the left is a partitioning performed by a nonoverlapping algorithm like GirvanNewman, and the figure on the right is a clustering that allows for overlap. Of course, this is a toy example in which we’ve limited the number of communities to two, but it’s not hard to imagine a very complex network with many communities and vertices that belong in several.
For whatever reason, network scientists have been exclusively naming their overlapping algorithms using acronyms. CODA, CESNA, BIGCLAM, and CONGA are all algorithms that attempt to discover overlapping communities in a graph. Today, we’ll briefly explore Steve Gregory’s CONGA, or Cluster Overlap Newman Girvan Algorithm.
In CONGA, Gregory defines a concept called split betweenness. Imagine splitting a vertex into two parts, such that each part keeps some of the original edges. Then the split betweenness is the edge betweenness of an imaginary edge between the two new vertices (represented as a dashed line in the figure below).
Since split betweenness of this imaginary edge can be calculated the same way as edge betweenness on a real one, comparing the two values is an entirely legitimate operation. CONGA splits the vertex instead of removing an edge when the maximum split betweenness is greater than the maximum edge betweenness. Vertices can be split repeatedly, so a single vertex in the original graph can end up in an arbitrary number of communities. This property gives us the overlapping community structure that we were looking for.
A naive version of CONGA would simply calculate the split betweenness for every possible split of every vertex. The algorithm would then look like this:
Each time the graph splits, vertices are assigned to one more community than before. Since we don’t know the optimal number of communities, we have to somehow store the historical community assignments before continuing the algorithm. Because CONGA is a divisive algorithm, we would hope to be able to use a dendrogram to store the results. However, the overlapping structure of the result set means that such a data structure wouldn’t make much sense. Instead, our version of CONGA returns a list of all of the community assignments that the algorithm generates.
This version of CONGA is simple, but it’s also intractable with more than a handful of nodes. To see why, assume that each vertex has \(m\) incident edges. Then we can split each vertex \(2^m\) different ways, since we can choose any subset of edges to be split away to the new vertex, and any set with \(m\) elements has \(2^m\) subsets. Since we have \(V\) vertices, we need to calculate \(V\times 2^m\) split betweenness scores. Calculating a split betweenness costs \(O(EV)\) operations, so each iteration of the algorithm takes \(O(EV^2 2^m)\) time. Finally, we have to recalculate all split betweennesses each time we remove an edge, yielding a total runtime of \(O(E^2V^2 2^m)\). In the worst case, on a connected graph, \(m = O(V)\) and \(E = O(V^2)\), so we have \(O(E^2V^2 2^m)=O(V^6 2^{V})\) operations. Even a single densely connected vertex takes this algorithm into exponential time.
Luckily, there are a few significant improvements that we can make. Split betweenness is clearly bounded above by vertex betweenness, the normalized number of paths through some vertex \(v\), because an imaginary path within a vertex cannot possibly be involved in more shortestpaths than the vertex as a whole. Furthermore, we can calculate vertex betweenness from edge betweenness, since any shortestpaths going through an edge incident to some vertex must also contribute to the betweenness of the vertex itself.
We can calculate vertex betweenness from edge betweenness using the following equation:In practice, filtering the vertices by vertex betweenness makes a big difference. However, calculating all split betweennesses for even a single vertex can still take exponential time. To fix this, Gregory introduces a greedy algorithm that makes use of yet another metric that he calls pair betweenness. While pair betweenness is not too useful by itself, it allows us to calculate split betweenness much faster. Pair betweenness is the normalized number of shortestpaths that travel through some triplet \(u \rightarrow v \rightarrow w\). For every vertex \(v\) with \(m\) incident edges, there are \(\binom{m}{2}\) pair betweennesses that need to be calculated. If we use the original allpairsshortestpaths algorithm, we can calculate the pair betweennesses at the same time as the edge betweennesses, in \(O(V^2E)\) time (though we’re trying to extend the optimized algorithm to do so in \(O(VE)\)).
We can represent the pair betweennesses of a single vertex \(v\) of degree \(k\) by constructing a kclique where each vertex in the clique represents some neighbor of \(v\). The weight on each edge \(\{u, w\}\) is the pair betweenness of \(v\) for \(\{u, w\}\) (the normalized number of shortest paths through \(u\rightarrow v \rightarrow w\)).
Gregory explains the greedy algorithm using these four steps:
We are left with two vertices with one edge connecting them, where the edge weight is the split betweenness and the labels on each remaining vertex specify the split.
This procedure does not guarantee an optimal split, but Gregory asserts that it usually ends up close, and the greedy algorithm is much (much) more efficient. Our implementation is \(O(k^3)\), but a cleverer one that sorts the betweennesses could potentially use \(O(k^2\log k)\) operations. Compared with \(2^k\), we’re quite pleased.
We can modify CONGA to use the greedy algorithm and to filter by vertex betweenness as follows (steps taken from [3]):
Given these two optimizations, we now have an algorithm that runs in \(O(E^2V)\). In practice, runtime again depends heavily on the community structure of the graph, and how often vertices need to be split.
CONGA is a nice extension to GirvanNewman, and it even has a cool name. But even with the optimizations, it is still painfully, brainmeltingly slow. What we really need is a significantly faster algorithm with a slightly cooler name. This brings us to Gregory’s next iteration of the algorithm, which fits into the CONGesque theme: CONGO, or ClusterOverlap Newman Girvan Optimized.
CONGA spends almost all of its time calculating edge and pair betweennesses, because it has to calculate all shortestpaths each time we want to recalculate a score. Since almost all contributions to betweenness tend to come from very short shortestpaths, Gregory defines yet another class of betweenness: local betweenness.
We can calculate both edge betweenness and pair betweenness by only considering paths no longer than length \(h\). This is a much faster calculation, since any breadthfirst search needs only to traverse to \(h\) levels, rather than to the entire graph. This localization is the essence of CONGO.
When we’re only concerned with shortestpaths less than or equal to length \(h\), betweenness scores aren’t affected by an edge removal or a vertex split \(h + \epsilon\) away, where \(\epsilon\) is some small distance. This means that we only have to calculate all edge and pair betweenness scores once, then adjust the scores in the neighborhood of the modification every time a change is made.
To formalize that notion, Gregory defines the \(h\)region of a modification to be the smallest possible subgraph containing all shortestpaths that pass through \(v\) (for a vertex split) or \(e\) (for an edge removal) of length no longer than \(h\). The \(h\)region of edge \(e\) that traverses \(\{u, v\}\) is the subgraph with the set of vertices:When we remove an edge or split a vertex, we have to update the betweenness scores in its \(h\)region. Before we modify the graph, we compute all shortest paths no longer than \(h\) that lie entirely within the region, and subtract those scores from the stored values. Then we can make our modification (remove an edge or split a vertex) and recalculate these local shortest paths and betweennesses. Finally, we add these recalculated scores back. This procedure updates the local betweenness scores without requiring a full traversal of the graph.
To experiment for yourself, I recommend trying out igraph and looking into some of the community detection algorithms that they’ve already implemented.
[1] Newman, M. E., & Girvan, M. (2004). Finding and evaluating community structure in networks. Physical review E, 69(2), 026113.
[2] Brandes, U. (2001). A faster algorithm for betweenness centrality*. Journal of Mathematical Sociology, 25(2), 163177.
[3] Gregory, S. (2007). An algorithm to find overlapping community structure in networks. Knowledge discovery in databases: PKDD 2007, 91102.
[4] Gregory, S. (2008). A fast algorithm to find overlapping communities in networks. Machine Learning and Knowledge Discovery in Databases, 408423.
[5] Nicosia, V., Mangioni, G., Carchiolo, V., & Malgeri, M. (2009). Extending the definition of modularity to directed graphs with overlapping communities. Journal of Statistical Mechanics: Theory and Experiment, 2009(03), P03024.
[6] Zarei, M., Izadi, D., & Samani, K. A. (2009). Detecting overlapping community structure of networks based on vertex–vertex correlations. Journal of Statistical Mechanics: Theory and Experiment, 2009(11), P11013.
]]>Often times the hardest thing about building open source software is conveying what a project really does. The purpose of a project can easily be lost to any given audience due to a variety of factors, including mismatch of technical depth, misinterpreted jargon, or insufficient explanations. Similar to most places, “seeing is believing” in the software world  actually using something is often the only way to solidify points made around the project. However, crafting a demo has remained difficult and time consuming.
For too long, software developers have had few options to provide ondemand product demos, leaving many at the mercy of PowerPoint slides and vague discussions. By combining a few easy habits with open source technologies like Docker, individuals and enterprises alike can automate the creation of simple, intuitive, and reproducible software demos.
This week, our team launched try.lab41.org which provides instances of our open source projects so users can kick the tires before committing to spinning up their own version. We encourage you to checkout Try41 and let us know what you think. In this post, I’m going to walk through five steps we use at Lab41 to easily create repeatable, ondemand demonstrations of our open source projects.
The first step toward creating useful demonstrations is to craft multiple layers of documentation at the project level. There are many forms of documentation, from commenting on a single line of code to complete and verbose install instructions and everything in between. Two types we consistently use are markupgenerated overviews of code, as well as READMEstyle instructions for firsttime installations.
One such example can be taken from Redwood, a framework we’ve been working on at Lab41 to identify anomalous files. Below is a breakdown of languages used in Redwood and how the number of lines of comments compares to the number of lines of code. As you can see, the bulk of the project was written in Python, and there is nearly a 1to1 ratio of comments to lines of code. Not bad.
Charts generated by Ohloh
While comments are great, it can often be tedious to hunt through the code and discern what a particular function does and how it is intended to behave. Most modern languages allow for a markup that can be used to generate beautiful, intuitive code documentation.
Looking at Hemlock, another project we’ve spent time on at the Lab, we can see how the markup works in practice.
"""
This module is the main core of Hemlock and interfaces with and controls the
majority of other modules in this package.
····
Created on 19 August 2013
@author: Charlie Lewis
"""
····
from clients.hemlock_base import Hemlock_Base
from clients.hemlock_debugger import Hemlock_Debugger
from clients.hemlock_runner import Hemlock_Runner
····
import hemlock_options_parser
····
import getpass
import json
import MySQLdb as mdb
import os
import requests
import sys
import texttable as tt
import time
import uuid
····
class Hemlock():
"""
This class is responsible for driving the API and the core functionality of
Hemlock.
"""
····
def __init__(self):
self.log = Hemlock_Debugger()
self.HELP_COUNTER = 0
····
def client_add_schedule(self, args, var_d):
"""
Adds a specific schedule to a specific client.
····
:param args: arguments to pass in from API
:param var_d: dictionary of key/values made from the arguments
:return: returns a list of the arguments supplied
"""
arg_d = [
'uuid',
'schedule_id'
]
return self.check_args(args, arg_d, var_d)
············
Snippet from Hemlock
The red comments enclosed in triple quotes are the markup lines for Python that can be used by tools like Sphinx to generate HTML documentation, as seen in the screenshot below.
Taken from Hemlock’s Documentation
That sort of documentation is great for fellow developers of the project, but what about the rest of us that just want to know how to install the project and get it up and running so that we can actually use the awesome tool? For those less familiar users, we at Lab41 ensure our projects always have a solid README to guide endtoend installation from an outsider’s perspective.
Having a wellthoughtout README goes a long way and should not only explain the project’s intentions, but also include details like installation, dependencies, quick start, known issues, examples, and so on. Here we have the first page of the README for another project we’ve spent a fair amount of time on, called Dendrite, which provides a way to analyze and share graphs.
Taken from Dendrite’s README
There are many ways to document a project, and the more uptodate and consistent the documentation is, the easier it will be to maintain in the future. More importantly, great documentation will help others get a sense of where the project stands and what it is expected to do.
We’ve often heard the saying, “It’s not a bug; it’s an undocumented feature!”. The truth, however, is that if it’s not documented, it’s a bug. It may be hard at times to fit things like this into a schedule, but this can sometimes be just as valuable if not more so (user experience, etc.) than the product itself.
Teams often refactor code  restructure the program  to make it cleaner, less complex, and more intuitive as the project evolves. However refactoring a project can potentially create unpredictable and unstable behavior.
To avoid unintended consequences during the process of refactoring, good test coverage of the code base can help give you peace of mind as you rework functions, syntax, formatting, or other general cleanup.
Testing is another one of those things, like documentation, that often gets left behind, forgotten, or deemed unimportant. To avoid this common target of neglect, we at Lab41 turn to a popular (and automated) testing framework. Beyond the obvious benefits of having tests that ensure a particular project’s code behaves as intended, we’ve found that tests are a great way to craft reproducible demonstrations that behave exactly as intended.
Below we can see the code coverage for several of our projects using Coveralls, which we have integrated with Travis CI (we will cover this tool in more depth in step 3) so that every time a build happens, we can not only ensure that the project builds, but also that the tests pass, automatically.
Taken from Coveralls
Here we see that specifically for HemlockREST, a RESTful server for the Hemlock project, the test coverage adjusts for most of the commits in the history, indicating that tests are being written alongside the code for the project.
Coverage for HemlockREST on Coveralls
As you can see, automated testing makes it, er … automatic to march forward with greater peace of mind and less effort. Another specific reason to write tests is to benefit others who want to contribute to a project. Basically, tests are a nice way to show others the way the project is expected to operate  especially for those who haven’t contributed to the project yet or are not familiar with how everything is designed to work.
Unit tests are a great way to get started writing tests that will provide code coverage. Most languages have several different unit testing frameworks, including JUnit (Java), CUnit (C), and my personal favorite, py.test (Python). Combine these testing frameworks with tools like Cobertura, CodeCover, or Emma to generate reports on how well the unit tests covered the code in the project. Finally, feed those reports to Coveralls, and you’re left with automated code coverage tied to commit history as the project emerges.
In concert with documentation, retaining and maintaining traceable testing for a project preps it nicely for the next step toward delivering demonstration: building.
Project builds are important. Being able to build a project consistently, and furthermore, guarantee that it still builds in the expected manner as the project gets updated and evolves, is paramount to ensuring that the community has a positive experience getting the project up and running on their own.
One of the ways we ensure the project builds correctly with every change we make is by using a tool called Travis CI (“CI” refers to Continuous Integration). There are lots of CI solutions out there, but this one integrates nicely with GitHub and supports a large number of languages and services to build and test against.
Pull Request by @erickt for Dendrite showing green check mark of a passing build
Here we have a sample config file for Travis CI that tells Travis what it needs to build and test in order to verify that the new changes made don’t break any tests or intended build executions. We can set multiple targets; this one builds against both OpenJDK7 and OracleJDK7. We can specify which branches get built (or which ones don’t) as well as have before and after installation steps for things like dependencies and test reports.
language: java
jdk:
 oraclejdk7
 openjdk7
before_install:
 source ./scripts/ci/dependencies.sh
install: mvn install
after_success:
 mvn cobertura:cobertura coveralls:cobertura
branches:
except:
 ghpages
notifications:
email:
 charliel@lab41.org
.travis.yml config file for Travis CI for Dendrite
That simple config file translates into a nice user interface that shows the progress, logs, and history of all builds for each specific project setup with Travis.
Travis CI status of Dendrite
Travis CI build history of Dendrite
Each PR (Pull Request) is built and tested against Travis before it gets merged, ensuring that no broken builds end up in Master, or whatever specific branch you’re intending the community use to download and try your project. If the build breaks on the PR, it gives the contributor the opporunity to remedy the error before it gets pushed upstream, which keeps things clean and consistent for everyone.
However, jumping the gun before doing due diligence on steps 1, 2, and 3 can lead to unstable builds, irreproducible build errors, and next to impossible troubleshooting. For those of us working with open source projects, this can lead to general frustration for all. And there’s no quicker path to unused open source then when something doesn’t work due to lack of documentation, absense of testing, or unchecked build processes.
When you are ready to deploy, there are several great options that vary from generic to specific. Since our projects are all hosted on GitHub we get one deployment path for free: tags.
GitHub tags for Hemlock
GitHub allows us to create tags associated at any particular point in the commit history to create a downloadable version of the project at that particular point in time. This can be great for prereleases, or even more official releases.
Release notes for prerelease 0.1.6 of Hemlock
Tags are very generic, letting one create downloadable source of anything at any given time, leaving the details of how to get it installed and running up to you.
Another more specific approach that can be used for deployment is PyPI, a Python specific index for packages that can be automatically downloaded and installed via tools like pip and easy_setup. There are many language specific indices for packages, such as CPAN (Perl), RubyGems (Ruby), and Sonatype (Java).
Hemlock package hosted on PyPI
Sometimes project deployment requires many moving parts, multiple languages, and is more complex than just a single package. Docker, which we’ll go into in more detail in step 5, is a fantastic new technology for these complex cases. It provides developers with a simple way to create an environment, based on a simple configuration file, for running one or more processes inside a container. In addition to providing finegrained resource utilization, this capability moves us faster towards the “build once, ship everywhere” Holy Grail for deploying across multiple machines. Using Docker, we are able to deploy trusted builds of each project that remain synced with GitHub as each project matures and evolves. All the end user has to do is issue a few easy commands to pull down the image and run it; the installation and setup is already baked into the container and ready to go.
Lab41 projects deployed on the Docker Index
Repeatability is the key to tying together demonstration and deployment. Pretty much every developer has run into the opposite (and unfortunate) situation: for example, having a demo that only runs on a specific laptop becuase of undocumented dependencies, untested hacks, an outdated operating system, or unspecific and mismatched build parameters. These alltoocommon unreproducible factors really mean you have a weak prototype, not a demo. A demo should be something that can be shared and reproduced, not a Rube Goldberg machine:
Thanks to Docker, we can specify the exact environment(s), all of the required dependencies and their versions, and any other setup required for a given project. Through a simple configuration file, we can be assured that the next time someone builds that Docker container, it will do the exact same thing it did before, regardless of the state of the machine  and without any “gotchas” or undocumented hacks! Once that Docker specification  a Dockerfile  is created and deployed as a trusted build, repeatable demonstration of the project (Redwood in this case) comes as simple as a single command:
docker run –d P lab41/redwood
Below is an example of a Dockerfile for another project Lab41 has been working on called try41.
from ubuntu
MAINTAINER Charlie Lewis
····
ENV REFRESHED_AT 20140214
RUN sed 's/main$/main universe/' i /etc/apt/sources.list
RUN aptget update
····
# Keep upstart from complaining
RUN dpkgdivert local rename add /sbin/initctl
RUN ln s /bin/true /sbin/initctl
····
RUN aptget install y git
RUN aptget install y pythonsetuptools
RUN easy_install pip
ADD . /try41
RUN pip install r /try41/requirements.txt
ADD patch/auth.py /usr/local/lib/python2.7/distpackages/docker/auth/auth.py
ADD patch/client.py /usr/local/lib/python2.7/distpackages/docker/client.py
····
EXPOSE 5000
····
WORKDIR /try41
CMD ["python", "api.py"]
····
Dockerfile for try41
You’re now ready to follow these five steps for demonstration and delivery:
So no more excuses. Get out there and deliver demos for your projects.
]]>Recently, @schvin and I participated in Docker’s first 24hour hackathon ahead of the first DockerCon. We were joined by 98 other tech junkies at the Docker offices, each teaming up to build their own awesome projects in the hopes of winning the highly coveted DockerCon tickets as well as a speaking slot at the conference. @schvin and I set off with an idea that spawned at the Monitorama conference we attended in May.
After seeing a presentation on logging by @torkelo around Grafana – which gives you all of the benefits one might get from Logstash and Kibana, but lets you use Graphite and StatsD to provide a more timeseriescentric view of your logs – we decided that it would be cool if we could automatically ship logs from a Docker host, and the containers running on that Docker host, straight into Grafana without having to run extra services on the containers or enforce global changes on legacy containers.
Here is how we built it:
5:15 pm  We started the project at the kickoff of the hackathon with an initial commit to Dockerana.
Initial commit to Dockerana
For our idea to work, we had to overcome a number of hurdles related to three core aspects of Docker’s unique environment. First, we needed to ensure we could extract logs from both the host and its containers without having to modify the containers themselves. Second, in order to use Grafana, we had to incorporate several dependencies all within a Docker environment. Finally, we needed to come up with a standard format to wrap the Docker logs in so that the logs could be fed to Graphite and displayed with Grafana.
8:45 pm  Three and a half hours later, we had our initial working prototype: a single Docker container running Grafana and all of its dependencies, gleaning data from the Docker host, and shoving it into Graphite so that it was viewable via Grafana.
While it was a good start, there was still a lot of work to be done. The following three tasks required significantly more thought and tinkering:
For our Grafana setup, we required six processes to be running, which in Dockerland translates to six containers that all know how to properly communicate and share data with each other. Below are the six services, which together work as a cohesive Grafana system:
Here’s a roughly drawn diagram of how all of these technologies are wired up for Dockerana:
Here in Dockerland, we can see how to spin up some of the necessary containers and how they are interconnected both through data and communication.
# spin up carbon container with a volume
docker run d \
p 2004:2004 \
p 7002:7002 \
v /opt/graphite \
name dockeranacarbon dockerana/carbon
# spin up a graphite container and connect the volume from carbon to it
docker run d \
volumesfrom dockeranacarbon \
name dockeranagraphite dockerana/graphite
# spin up an nginx container and link the networking exposed in graphite to it
docker run d \
p 8080:80 \
link dockeranagraphite:dockeranagraphitelink \
name dockerananginx dockerana/nginx
An astute eye might notice that when the Graphite container is spun up there does not appear to be any exposed networking specified for the Nginx container to link to. That is because we are not exposing any networking to the outside. Instead, we are using native Docker linking between containers through the Dockerfile. As you can see in the example below, the EXPOSE
command allows those ports (in this case, 8000) to communicate between linked containers without being exposed to the outside world.
FROM ubuntu:trusty
MAINTAINER Charlie Lewis
RUN aptget y update
RUN aptget y install git \
pythondjango \
pythondjangotagging \
pythonsimplejson \
pythonmemcache \
pythonldap \
pythoncairo \
pythontwisted \
pythonpysqlite2 \
pythonsupport \
pythonpip
# graphite, carbon, and whisper
WORKDIR /usr/local/src
RUN git clone https://github.com/graphiteproject/graphiteweb.git
RUN git clone https://github.com/graphiteproject/carbon.git
RUN git clone https://github.com/graphiteproject/whisper.git
RUN cd whisper && git checkout master && python setup.py install
RUN cd carbon && git checkout 0.9.x && python setup.py install
RUN cd graphiteweb && git checkout 0.9.x && python checkdependencies.py; python setup.py install
# make use of cache from dockerana/carbon
RUN aptget y install gunicorn
RUN mkdir p /opt/graphite/webapp
WORKDIR /opt/graphite/webapp
ENV GRAPHITE_STORAGE_DIR /opt/graphite/storage
ENV GRAPHITE_CONF_DIR /opt/graphite/conf
ENV PYTHONPATH /opt/graphite/webapp
EXPOSE 8000
CMD ["/usr/bin/gunicorn_django", "b0.0.0.0:8000", "w2", "graphite/settings.py"]
Snippet from Graphite Dockerfile
If we then look at the Nginx configuration snippet below, we can see how it is using that link to proxy through the Graphite content to Grafana:
. . .
http {
. . .
server {
listen 80 default_server;
server_name _;
open_log_file_cache max=1000 inactive=20s min_uses=2 valid=1m;
location / {
proxy_pass http://dockeranagraphitelink:8000;
proxy_set_header XRealIP $remote_addr;
proxy_set_header XForwardedFor $proxy_add_x_forwarded_for;
. . .
Snippet from Nginx configuration file
With all of those components working nicely, we just needed a process to collect and aggregate logs. This sounds like a good candidate for a container  so that’s exactly what we did. We Dockerized that process as well into a simple container that runs a couple scripts which poll various parts of the Docker host to glean logs not only about the host but also about the containers running on the host.
Here is our simple Dockerfile to build the container to do the log collection, where the primary driver is runner.sh
:
FROM ubuntu:trusty
MAINTAINER George Lewis
RUN aptget update
RUN aptget install y sysstat make
RUN perl MCPAN e 'install Net::Statsd'
ADD scripts/ingest.pl /usr/local/bin/
ADD scripts/loop.pl /usr/local/bin/
ADD scripts/periodicingest.sh /usr/local/bin/
ADD scripts/runner.sh /usr/local/bin/
CMD /usr/local/bin/runner.sh
Snippet from Main Dockerfile
Now for the fun part: displaying the data. We built a dashboard that is mostly centered around the events on the Docker host. In the future we hope to add automatic dashboards specific to containers, but there’s only so much two people can do in 24 hours.
Here are some screenshots of what the dashboard looks like, all dynamically configurable:
Note that here we can see the virtual network interfaces of each container as well as the host:
Finally, we wanted it to be easy to setup and repeatable by anyone running a Docker host. Below is a screencast showing just how simple it is to get Dockerana up and running:
DockerCon was a great event. We learned a lot, and got to see a lot of other interesting ideas around Docker that other groups worked on. You can find our presentation as well as those from the other groups that participated on the Docker Blog. We are definitely looking forward to the next hackathon.
]]>We think graph analysis can add even more bang for the buck if tailored for a team environment, where colleagues could each experiment with techniques that could alter the structure of the graph. That way, workflows wouldn’t need to revolve around a single graph and colleagues could divide responsibilities, simultaneously test different theories, and follow intuition towards unknown outcomes. Basically, everyone would benefit from all the things that innovative teams do well.
But how could we prototype such a capability, especially since we’d need to link together multiple storage and analytics technologies? And how does it actually work under the hood to support multiple users, each of whom could be doing different or conflicting things?
Just as we can build off existing open source projects, it helps to build from a common workflow paradigm when thinking about collaborative graph experimentation. To work together, many analysts want to create different versions, track changes such as modifications or calculations, and selectively accept or reject those changes back into a shared version. Yes, what I’m describing is the same Track Changes feature everyone has come to know and love from Microsoft^{®} Word^{®}. Imagine if several people tried to edit a document without that feature. Basically, everyone trips over each other’s edits, making it painful to review and merge changes. That is exactly what happens with graphs, posing a huge problem for analysts.
To prototype collaboration around graphs, Dendrite borrows features from a technology that we developers use on a daily basis: distributed version control systems (such as Git). These systems enable teams of engineers to independently modify source code, collaboratively review updates, then selectively accept or reject changes. The paradigm fits so well that we even refer to this aspect of the project as Git for Graphs ^{®(not really)}. The only problem is that such systems are not designed to handle Big Dataesque structures, so we actually pushed down the path of implementing custom Gitstyle features in something that could handle the scale and data type.
What scalable data type can inherently store relationships between projects, graphs, and versions? After several design and coding sessions, we developed something that we call a Metagraph. The concept is that Dendrite uses Titan, the scalable database behind its graphs, to store different versions and the associated metadata about each graph (let that sink in: Dendrite uses a graph to store data about graphs). Within the context of each project, users can create, modify, and clone different versions of a graph. They can even carve off a querydefined subset into a new graph for tailored analysis. In practice, these collaboration features support the essential actions of selectively incorporating data and experimenting with different hypotheses:
With this baseline of Metagraph services, Dendrite demonstrates how teams can use different graph versions—optionally configured with a Gitbacked change log—for experimentation and a better workflow. Erick Tryzelaar, Lab41’s resident “Git whisperer” who designed and built the core of this component, rightly deserves credit for drawing these proven concepts into the graph space.
Like most prototypes, the collaboration features within Dendrite would benefit from a few performance optimizations, especially to decrease the storage footprint of multiple versions. Nevertheless, it is a good foundation upon which we will continue building capabilities for better collaboration in the space. If that sounds like a worthy pursuit, we welcome talented engineers to join (perhaps even by applying to work at Lab41) or simply drop us a line if you know of interesting work in this area.
Stay tuned in the coming weeks for additional posts that describe (perhaps even demo) additional facets of Dendrite and other Lab41 projects.
]]>To understand how we got here, it helps to have a notion of how we work. At its core, Lab41 is a venue for collaboration among talented people from the private sector, academia, and government. Together, we develop prototypes for shared challenges in the Big Data space. Tackling points of overlap can be difficult, but in instances like this, it can be a uniquely effective way to advance capabilities.
Since our Lab is missiondriven, we started by examining the problem space of analysts using graph technologies. To put it mildly, some analysts have to answer very difficult questions. But looking at the problem space alone would have been insufficient. We also took time to consider workflow and communication needs, such as how colleagues can team up to tackle the everimportant Six Degrees of Kevin Bacon.
What we learned is that graphfocused analysts, like most teams in pretty much every industry, did not have a problem with technology alone. The market was already providing access to powerful graph databases, many elegant algorithms had been published through academia and open source, and the tailwinds behind Big Data had delivered robust analytic engines. But there was a core need to combine graph storage and analytic technologies into something a team could collaboratively use. Given the overlap among a variety of groups in different industries, we figured these goals were worth pursuing:
It’s easy to see that combination of goals requires a fullstack approach. So you’re probably wondering, “What open source technologies did you use, already??” Glad you asked:
Graph Storage: The Aurelius team behind the Titan Distributed Graph Database built an impressive suite of capabilities that enables scalefree storage in either Berkeley DB for small datasets or HBase for horizontally scalable needs.
Graph Analytics: GraphLab is a powerful machine learning engine, which my fellow Lab41 contributor Charlie Lewis managed to execute on graphs from Titan by creatively leveraging its sister project Faunus. Being a Hadoopbased analytic engine, Faunus was also a natural fit for extra horsepower, which we rounded off with inmemory calculations using Java’s JUNG framework.
Information Retrieval: Developers of userfacing analytics must figure out a way to combine deep computational power, which takes time, with the interactivity we’ve come to expect from The Internets. We initially limited Elasticsearch to its standard search features, but now use it as the primary store for listing and visualizing both vertices and edges.
User Interface: A RESTful webserver (using custom SpringMVC controllers paired with endpoints served through the Titancompatible Rexster) followed principles of datadriven modularity. This design enabled us to build both AngularJS and commandline interfaces while also allowing others to swap a different front end if desired.
I could go on into deep technical details, lessons learned, and future directions of the project, but my colleagues and I will save those topics for future posts, including one in the near future on the technical underpinnings of Dendrite’s collaboration features. For now, I’ll close out this overview with a few (hopefully) lasting impressions:
Initial feedback seems to validate our thoughts that graph technologies could gain wider adoption through cointegration and development of better workflow tools. We welcome contributors to join our project, but would also appreciate pointers to any work in this space.
If you really want to nerd out on graph technologies, consider attending GraphLab’s annual user conference in July. Our team is slated for an indepth talk on Dendrite.
There is still a lot of room for collaboration between the brainpower in academic research, talented commercial and open source engineers, and government partners with some very challenging problems. In our second year, Lab41 aims to continue cultivating our space as a venue for that type of participation. Contact us if you’re interested in learning more or getting involved.
At Lab41, we see graphs everywhere. Much of our work revolves around analyzing and generating natural graphs that have structural properties similar to those found in realworld settings. Such graphs could represent an arrangment of computers in a network, animals in a food chain, or neurons in your brain. Unlike randomlygenerated graphs, natural graphs have meaning. For example, characteristics of a system modeled by a graph can be deduced by calculating mathematical metrics such as its nodes’ degree (the number of edges connected to a node in a graph) or the number of triangles formed by its edges.
Working with natural graphs involves a number of challenges:
Obtaining natural graphs is hard. One must painstakingly collect a large dataset of realworld observations and connections, find a suitable way to interpret it as a graph, and then actually convert it into a graph  a process that can be tedious and timeconsuming.
Datasets for natural graphs are scarce. There are only a small number of existing datasets representing natural graphs. In fact, at the recent GraphLab workshop, one speaker noted that he was getting tired of every presenter using the same dataset (articles and links between them on Wikipedia) for their analyses!
Synthetic graphs miss the mark. Graphs randomly generated according to standard models (as my colleague Charlie did in his previous post, and others have done using the ErdosRenyi graph model) tend to look unnatural, no matter what parameters we use. We can’t just create natural graphs by taking a random number generator and going crazy. Instead, we need to find out what properties make a graph “natural,” and then find a way to effectively and efficiently generate graphs with those properties.
So, what makes a graph “natural”? While there is no hardandfast definition, nearly all natural graphs exhibit two simple properties:
Powerlaw degree distributions. A very small number of nodes have a very large number of connections (high degree), while a large number of nodes have a very small number of connections (low degree). Mathematically speaking, this means the degree of any vertex in the graph can be interpreted as a random variable that follows a powerlaw probability distribution.
Selfsimilarity. In natural graphs, the largescale connections between parts of the graph reflect the smallscale connections within these different parts. Such a property also appears within fractals, such as the Mandelbrot or Julia sets.
An accurate mechanism for natural graph generation must preserve these properties. As it turns out, the stochastic Kronecker graph model does this. It has a few other advantages as well:
Parallelism. The model allows large graphs to be generated at scale via parallel computation.
Structural summarization. The model provides a very succinct, yet accurate, way to “summarize” the structural properties of natural graphs. Two Kronecker graphs generated with the same parameters will produce graphs with matching values for common structural metrics, such as degree distribution, diameter, hop number, scree value, and network value.
The remainder of this blog post will describe the basic Kronecker generation algorithm and how it can be modified to efficiently generate very large graphs via parallel computation, on top of MapReduce and Hadoop.
The core of the Kronecker generation model is a simple matrix operation called the Kronecker product, an operation on two matrices that “nests” many copies of the second within the first. Since graphs can be represented by adjacency matrices (Karthik’s post), this operation can be generalized to graphs.
Taking the Kronecker product of a graph with itself thus easily produces a new, selfsimilar graph, as does taking the more general “Kronecker power” of it. In fact, Kronecker powers will have further selfsimilarity. For example, below you can see an example of a simple threenode graph, its Kronecker cube, and its Kronecker fourth power, with the selfsimilarity evident in the adjacency matrix.
Because the Kronecker power so easily generates selfsimilar graphs, it’s reasonable to consider that it might be similarly effective at generating random natural graphs. To do this, we simply start with an adjacency matrix, but allow probabilities to occupy the cells of the matrix rather than ones and zeros. This gives us the stochastic Kronecker graph model.
The simplest algorithm for generating Kronecker graphs is to use Kronecker powers to generate a stochastic adjacency matrix, and then step through each cell of the matrix, flipping a coin biased by the probability present in that matrix. In more detail, the algorithm is as follows:
We start with an \(n\) by \(n\) initiator matrix, \(\theta,\) and the number of iterations \(k\) for which we wish to run the algorithm. We compute the \(k\)th Kronecker power of the matrix \(\theta,\) giving us a large matrix of probabilities, which we call \(P.\) Each cell in this matrix corresponds to an edge between two nodes in the graph; the formula for the value at the \((u,v)\)th cell of \(P\) is: \[\prod_{i=0}^{k1} \theta\left[\left\lfloor \frac{u}{n^i}\right\rfloor \bmod{n}, \left\lfloor \frac{v}{n^i}\right\rfloor \bmod{n} \right].\] (For convenience, we have assumed the matrix is zeroindexed, as is common in computer science.)
To generate the actual graph, we 1) step through each cell in the matrix, 2) take the probability in the cell, 3) flip a coin biased by that probability, and if the coin “comes up heads,” we 4) place the corresponding edge in the graph.
If the initiator matrix is an \(n\times n\) square matrix, and we perform \(k\) iterations of the Kronecker power operation, the generated matrix will have dimension \(N=n^k.\) We will need to take a product of \(k\) values to obtain each cell of the final matrix, and there will be \(N^2\) cells, so the runtime of this algorithm will be \(O(kN^2).\)
This means that if we want to generate a graph with approximately one billion nodes (a reasonable size for a large natural graph) from an initiator matrix of size 2, our runtime expression tells us we should expect to perform approximately \({(30)(10^9)^2 = 3.0\times 10^{19}}\) operations. That’s 30 quintillion operations. This leads us to wonder whether we could do this with fewer operations. Spoiler alert: it’s possible.
If we switch from a nodeoriented approach to an edgeoriented approach, there does exist a faster algorithm for generating a Kronecker graph. Most natural graphs are sparse  \(E = O(N).\) Thus, if we can find a way to place each edge, one at a time, in the graph, rather than figuring out if a pair of nodes has an edge between them, we can vastly reduce the onaverage running time. To do this, we need to figure out how many edges are in the graph, and we need to figure out which nodes are associated with each edge.
It turns out that the expected number of edges in a stochastically generated Kronecker graph is encoded within the initiator matrix itself  it’s given by: \[E = \left(\sum_{i,j} \theta[i,j]\right)^k.\] In general, this works out to being on the order of the number of nodes.
Next, we need to find a procedure that starts from nothing, and in \(k\) iterations picks a new edge in the graph to add. Thankfully, this operation is already staring us in the face  in the formula presented in the previous section. Here it is again: \[\prod_{i=0}^{k1} \theta\left[\left\lfloor \frac{u}{n^i}\right\rfloor \bmod{n}, \left\lfloor \frac{v}{n^i}\right\rfloor \bmod{n} \right].\] This formula can be understood in a different way  as a “recursive descent” into the adjacency matrix of the graph, picking smaller and smaller blocks of the matrix until we have finally narrowed our choice to a single cell, which we then “color in” to represent that an edge should be placed there.
Thus, to generate a stochastic Kronecker graph, all we need to do is set up a loop which runs \(E\) times, generating a new edge in the graph on each passthrough. (If we generate the same edge twice, we ignore it and repeat the passthrough as if nothing happened.) This runs in \(O(kE)\) time, which means that for sparse, realworld graphs, it runs in \(O(kN)\).
This algorithm allows us to generate every edge in the graph independently of every other edge, allowing us to parallelize the graph’s generation. This means we can leverage the power of Hadoop to generate very large graphs.
The only twist is that this method allows for the creation of duplicate edges, and most of the graphs we’re interested in don’t contain such duplicates. Thus, we need to figure out how to identify and eliminate them. This is hard when generating the graph across multiple machines, because it’s very likely the duplicate edges will be generated on separate machines. Fortunately, with a bit of cleverness, we can leverage the nature of MapReduce to do our duplicate checking. Instead of one MapReduce job, we’ll have three  one to generate edges and eliminate duplicates, one to generate vertices, and one to combine the two together to form a single graph. This gives us the workflow below.
The pipeline consists of three stages:
The first stage of our pipeline is vertex generation. This is the simplest stage  it is a maponly job, utilizing a custom input format representing a range of vertices to be generated. We use as the key a unique Long
identifying the vertex, and a FaunusVertex
object as the value, giving us a (Long,FaunusVertex
) output sequence file.
The second stage of our pipeline is edge generation. As with vertex generation, it uses a custom input format representing a quota of edges to place into the graph. For each edge in this quota, we run the fast stochastic Kronecker placement algorithm, yielding a tuple of vertex IDs that represents a directed edge in the graph. This tuple is stored as a custom intermediate key type (called a NodeTuple
), with the value as a NullWritable
; this allows the shuffling and sorting logic of MapReduce to place identical tuples together, and consequently allows us to easily eliminate duplicate copies of the directed edges before the reduce step. Finally, in our reduce step, we emit a Long,FaunusVertex
tuple. The FaunusVertex
represents the edge’s source vertex and contains a FaunusEdge
indicating its destination vertex. The Long
key is the source vertex’s ID.
The third and final stage of our pipeline reads in the vertex objects generated by both the edge and vertex creators and combines them, creating a final list of FaunusVertexes
that represents the graph.
A few details on the pipeline:
Faunus. This pipeline uses the same data types as the Faunus engine for graph analytics. Faunus provides objects representing edges (FaunusEdge
s) and vertices (FaunusVertex
es) that can be serialized and utilized by MapReduce jobs but can also serve as a final representation of a graph. Conveniently, FaunusVertex
es can store the edges coming off them as well, so we do not need to store edges separately from vertices in the final graph  we need only store the list of vertices with edges added to them.
SequenceFiles
. This pipeline produces SequenceFiles
(a native MapReduce serialization format) consisting of FaunusVertex
es to serve as intermediate representations of the graph as we construct it.
Annotations. In the final stage, we annotate the vertices with several property values (a mixture of floatingpoints and strings) in order to mimic the data we are interested in.
We have written a version of this blog post up as an informal paper that can be viewed here. It contains a more indepth explanation of the mathematics behind Kronecker graphs.
A graph is a mathematical construct that describes things that are linked together. Graphs model networkssystems of “things” connected together in some way. For example, the Internet is a physical network of computers connected together by data connections; the Web is a logical network of web pages connected by hyperlinks; and human societies are networks of people connected together by various social relationships. In the language of mathematics, we say each of the “things” in a network is a node and they are connected by “edges.”
It turns out that you can think of much of the world, both physical and virtual, as a graph. As a mathematical construct, graphs have been around since Leonhard Euler tried to figure out the best way to get around Konigsberg in 1735. Since then, graph theory has been embraced by a wide array of disciplines including sociology, economics, physics, computer science, biology, and statistics. An excellent resource for understanding how graphs map onto real world systems is the “Rosetta Stone for Connectionism,” which maps various real world systems onto graph concepts. Graphs really are everywhere.
While graphs are prevalent in many fields, the tools for working with graphs, especially large graphs, are still in their infancy. As graph technologies mature it should become easier to model many different problems, and easier to implement solutions. However, we are still figuring out the best ways to store, query, and compute on graphs. Right now, people use different data structures and technologies for different types of graphs and different use cases. Eventually, we need to figure out how to hide that complexity and let people treat graph data as graphs without thinking about what the right tools are for manipulating that data. Marko Rodriguez, a leading graph technologist, has a great summary of several different types of graphs and graph technologies in his recent blog post. Lab41 is actively using and working with many of the technologies that Marko describes, including the graph database Titan, which we load tested as noted in our previous blog entry.
The earliest tools for working with graphs were tools for manipulating matrices. In mathematics, graphs are frequently expressed as an adjacency matrix. In an adjacency matrix each row/column represents a node, and each entry in the matrix represents the presence of an edge between two nodes. The cool thing about the matrix form of a graph is that once you think of a graph as a matrix, you can apply concepts and methods from linear algebra to your graph analysis. Many common graph metrics and algorithms can easily be expressed in terms of standard matrix operations.
The cool thing about the matrix form of a graph is that once you think of a graph as a matrix, you can apply concepts and methods from linear algebra to your graph analysis. Many common graph metrics and algorithms can easily be expressed in terms of standard matrix operations.
While an adjacency matrix is a mathematical abstraction, it’s also a data structure. In this post we are talking primarily about a matrix as a contiguous block of memory. In most programing languages this is an array of arrays:
1 2 

From an engineering perspective, there are a number of advantages to storing a graph as a matrix, if the matrix representation of the graph fits in memory:
<table>
<tr>
<td> Operation </td>
<td> Adjacency Matrix </td>
<td> Adjacency List </td>
<td> Adjacency Tree </td>
</tr>
<tr>
<td> Insert </td>
<td> O(1) </td>
<td> O(1) </td>
<td> O(log(m/n))</td>
</tr>
<tr>
<td>Delete</td>
<td>O(1)</td>
<td>O(m/n)</td>
<td>O(log(m/n))</td>
</tr>
<tr>
<td>Find</td>
<td>O(1)</td>
<td>O(m/n)</td>
<td>O(log(m/n))</td>
</tr>
<tr>
<td>Enumerate</td>
<td>O(n)</td>
<td>O(m/n)</td>
<td>O(m/n)</td>
</tr>
</table>
Perhaps worst of all, adjacency matrices are very memory inefficient taking O(n^2) memory. An adjacency matrix has an entry for each possible edge, which means each possible edge is using memory even if it does not exist. This is primarily a problem when working with sparse networks – networks where many of the edges in the network don’t exist. Unfortunately, most real world networks are sparse. For example, if you consider the graph of all people on earth, it is a sparse network because each person knows only a relatively small number of other people. Most of the possible relationships that could exist between people don’t exist. In some ways that is both an engineering problem and an existential problem.
To put the memory inefficiency of adjacency matrices into perspective, if each edge in a matrix is stored as a 32 bit integer then the memory requirement for a graph can be calculated by following equation:
\[\text{memory in gb} = \frac{(\text{number of nodes})^2 * 8}{(1024^3)}\]Thus a graph of 50,000 nodes would take about 10GB of memory, which means you can’t store and manipulate large graphs like the graph of the Web, which is estimated to have 4.7 billion nodes, on most desktop computers.
While technologies for dealing with small matrices – matrices that can fit in memory – are well developed, technologies for dealing with large matrices in a distributed manner are just emerging. One approach to dealing with extremely large matrices is to use some type of super computer, which has a lot of memory. Another approach is to swap portions of the matrix into and out of memory; there are algorithms that can do this relatively efficiently based on the unique properties of a matrix. A new, and extremely interesting, approach is to distribute a matrix computation across the memory of multiple computers. I think the Pegasus project is a particularly interesting example of the distributed matrix computation approach.
If you’re interested in learning more about networks and adjacency matrices, I would highly recommend taking a look at M.E.J. Newman’s Networks. He has an excellent discussion of the adjacency matrix as a mathematical concept in Chapter 6, and discussion of an adjacency matrix as a data structure in Chapter 9.
Also, keep an eye on this blog. I plan to address other data structures for storing graph data, and when they may (or may not be) appropriate in a future post.
]]>This first post is intended to introduce the type of work we’ve started at Lab41, which is a unique partnership InQTel has started with Academia, Industry, and the U.S. Intelligence Community. We’re excited about this venture and look forward to sharing our progress towards collaboratively addressing big data challenges with new technologies.
Designing scalable systems for the real world requires careful consideration of data – namely, Big Data’s volume, variety, and velocity – to ensure the right pieces are engineered and valuable resources don’t miss the gotchas or edge cases that lead to insight. Basically, when tinkering with different architectures in the Big Data arena, having good data to test against is paramount. One of our projects involves assessing various architectures for working with largescale graphs, including how to incorporate data that tests the limits of storage, computation, and analytic workflows.
As you might expect, we wanted to use real world data when designing our realworld system. However, getting real data that mimics production data is difficult and time consuming. Oftentimes data only tells a story for that specific dataset, leading a developer to miss the more comprehensive view of the system’s strengths and weaknesses. For those reasons, we developed a method that generates large graphs with the “right” qualities of a system that can scale to one billion nodes.
Before comparing the leading projects for scaling graphs, we needed a good baseline for assessing the data requirements of the overall system. It quickly became apparent that current offerings such as Gephi, Network WorkBench, NetworkX, and Furnace, all do a good job of following particular distributions and structural constraints. However, most of them are unable to generate graphs at large scale and produce the correct format and build to completion and finish in a reasonable amount of time.
The evaluation and assessment of graph data generators led us down the path of writing a fairlystraightforward script. The code is very young in its development – and has room for a lot of improvement – but it has proven simple to use, moderately good at generating graphs large enough to test claims, and flexible enough to vary characteristics such as directedness, outdegrees, and of course numbers of edges, nodes, and attributes. We made sure to add a twist of randomness to avoid creating identical graphs.
The script takes commandline switches to configure the following graph characteristics:
With our script in hand, we moved on to begin the requisite performance testing, but we first discovered an important consideration for anyone wishing to release our script into the wild.
The most important “practical” consideration proved to be enforcement (or not) of strict parameters, which forces the script to scan and verify characteristics of all nodes. By enforcing strict parameters, we mean that:
In order to guarantee this 100% of the time, each time an edge is added, all preexisting edges must be checked to make sure that the chosen random vertices chosen do not go outside the imposed limits. To put it in perspective, the script initially enforces strict parameters, which – as you can probably guess by now – simply become untenable for quickly producing large graph data sets. As the below chart shows, we are able to generate a graph of 100 Million nodes in roughly the same amount of time it took to generate a graph of only 100,000 nodes using strict parameter enforcement:
Since disabling strict enforcement led to a graph three orders of magnitude larger in the same amount of time, you might be asking how the absence of that check – the degree of edges/node – affected the number of edges. Below we show that the difference of edges between checking and not checking is negligible:
Since we are generating these graphs, it seems reasonable to bend the requirements slightly to treat the minima and maxima simply as guidelines that some nodes may not conform to. While there is still room for improvement, such as leveraging more than a single CPU core, the results are reasonable enough to use.
The most important point is that seemingly “simple” parameter changes – which represent actual differences in realworld networks – make huge differences to the resulting network and therefore our system design. We generated three different classes of graphs from a baseline of graph data sets to determine how varying parameters influences such important characteristics as: time to generate, number of edges created, storage footprint, number of node and edge attributes, and average degree of nodes.
Each graph was generated with an increasing value of nodes, while all other settings were static between generations, per graph type. Graph types A, B, and C – described below – will be used in the next couple of charts:
Graph Type: A B C
Magnitude: 1K  1B 1K  100M 1K  100K
Format: graphml* graphml* graphml*
Directed: No No No
Minimum Degree: 1 1 1
Maximum Degree: 10 10 Same number as nodes
Minimum Node Attributes: 2 50 2
Maximum Node Attributes: 2 100 2
Minimum Edge Attributes: 0 5 0
Maximum Edge Attributes: 0 25 0
GraphML (http://graphml.graphdrawing.org/) is a convenient XML format that describes nodes in terms of names and types with labeled edges between nodes.
The first chart illustrates how the number of nodes greatly influences all other characteristics. While Type A generated one billion nodes in approximately 24 hours, the same timeframe yielded graphs of Type B with only 10 Million nodes and Type C with a scant one million nodes:
As this is just a first cut, restricting the number of nodes on the graph types seems acceptable for now. The following illustrates that limiting Type C to only 100,000 nodes still produces almost the same number of edges as a one billion node version of graph A:
The following chart shows that a one billion node graph of Type B would require approximately 10TB of storage space, while Type C would require 300 Petabytes (!) to reach one billion nodes:
Sadly, I couldn’t justify buying 300 Petaracks just to generate the world’s most unrealistic graph. Not to mention it would have taken approximately 20,000 years to generate, but that’s beside the point.
When looking at attribute differences, Type B creates about 3040 times more attributes than the fairlysimilar Types A and C:
Finally, this last chart shows how the degree of each node for Type C grows exponentially with the number of nodes, whereas the average degree for the other two graph types remain static:
Now that we generated our various graph datasets, we need to load them into a distributed graph data store. For a variety of reasons, we decided to use Titan with an HBase backend.
One of the nice things about Titan is that its Gremlin console shell enables graph interaction, traversals, and calculations. It also has functions for loading a graph file into the graph data store, which in this case is HBase on top of HDFS. Unfortunately, Gremlin through Titan does not leverage the awesomeness of MapReduce that generally goes handinhand with HBase and its Hadoop counterparts. So running the import in parallel is currently impossible. In terms of data formats, Gremlin on Titan can load GraphML; however, the current ID scheme prevents federation of GraphML across multiple machines (or even multiple cores). So as you can see from the chart below, the load times are unspectacular:
*Note: ‘Minimal Degree and Attributes’ corresponds to ‘Graph A’ from above, similarly ‘Heavy Attributes’ and ‘Heavy Degree’ correspond to ‘Graph B’ and ‘Graph C’, respectively.
Fret not! Faunus to the rescue! Faunus uses a Gremlin shell, which is similar to Titan’s and one we can use for importing a slightly different data format to gain benefits via MapReduce. The next chart shows the benefits of moving from GraphML to loading the GraphSON* format using a MapReduce job:
*Note: GraphSON, which is slightly modified from traditional GraphSON, is a onerecordperline JSONstyle format that describes nodes in terms of names, types, and labeled edges.
Our Graphgeneration Code allows us to generate onebillion node graphs with varying characteristics such as directedness, number of edges and nodes, and node degrees. Just as important, we determined how the different characteristics affect realworld considerations such as loading time and storage footprint, also finding an early optimization through MapReduce parallel processing. As we move to the next phase of designing around the data, we anticipate shortly being able to improve at least a couple orders of magnitude through fairly straightforward tweaks such as parallelizing load computations across a cluster. Of course this is only the first step, and there is a long exciting road ahead of us.
As is usually the case, it should be noted that this is not necessarily representative of the technology’s overall performance characteristics, but rather our experience within a specific environment. We used the following environment for our tests:
Environment:
GRAPH GENERATION:
Run on a MacBook Air 2GHz Intel Core i7, 8GB 1600 MHz DDR3
Python 2.7.2 (default, Jun 20 2012, 16:23:33)
[GCC 4.2.1 Compatible Apple Clang 4.0 (tags/Apple/clang418.0.60)] on darwin
LOADING GRAPHS:
Gremlin with Titan:
Virtual Machine, Ubuntu 12.04 LTS, 4 core processor, 8GB RAM
Started with the default Java options for Gremlin of:
JAVA_OPTIONS="Xms32m Xmx512m
Then bumped that up as the graph file got larger to:
JAVA_OPTIONS="Xms256m Xmx4096m
Gremlin with Faunus:
Virtual Machine, Ubuntu 12.04 LTS, 2 core processor, 4GB RAM
Hadoop/HBase Cluster:
12 node cluster  8 core processors, 64GB RAM, CDH4.2, heap set to 4GB, HBase 0.94.2
]]>