This first post is intended to introduce the type of work we’ve started at Lab41, which is a unique partnership In-Q-Tel has started with Academia, Industry, and the U.S. Intelligence Community. We’re excited about this venture and look forward to sharing our progress towards collaboratively addressing big data challenges with new technologies.
Designing scalable systems for the real world requires careful consideration of data – namely, Big Data’s volume, variety, and velocity – to ensure the right pieces are engineered and valuable resources don’t miss the gotchas or edge cases that lead to insight. Basically, when tinkering with different architectures in the Big Data arena, having good data to test against is paramount. One of our projects involves assessing various architectures for working with large-scale graphs, including how to incorporate data that tests the limits of storage, computation, and analytic workflows.
As you might expect, we wanted to use real world data when designing our real-world system. However, getting real data that mimics production data is difficult and time consuming. Oftentimes data only tells a story for that specific dataset, leading a developer to miss the more comprehensive view of the system’s strengths and weaknesses. For those reasons, we developed a method that generates large graphs with the “right” qualities of a system that can scale to one billion nodes.
Before comparing the leading projects for scaling graphs, we needed a good baseline for assessing the data requirements of the overall system. It quickly became apparent that current offerings such as Gephi, Network WorkBench, NetworkX, and Furnace, all do a good job of following particular distributions and structural constraints. However, most of them are unable to generate graphs at large scale and produce the correct format and build to completion and finish in a reasonable amount of time.
The evaluation and assessment of graph data generators led us down the path of writing a fairly-straightforward script. The code is very young in its development – and has room for a lot of improvement – but it has proven simple to use, moderately good at generating graphs large enough to test claims, and flexible enough to vary characteristics such as directed-ness, out-degrees, and of course numbers of edges, nodes, and attributes. We made sure to add a twist of randomness to avoid creating identical graphs.
The script takes command-line switches to configure the following graph characteristics:
- Number of nodes
- Degree of nodes
- In-degree of nodes
- Out-degree of nodes
- Number of node attributes
- Number of edge attributes
- Output type (GraphML and GraphSON so far)
With our script in hand, we moved on to begin the requisite performance testing, but we first discovered an important consideration for anyone wishing to release our script into the wild.
The most important “practical” consideration proved to be enforcement (or not) of strict parameters, which forces the script to scan and verify characteristics of all nodes. By enforcing strict parameters, we mean that:
- each node must guarantee
- that it has no more and no less edges attached to it
- from both an in and out degree context.
- that it has no more and no less edges attached to it
In order to guarantee this 100% of the time, each time an edge is added, all preexisting edges must be checked to make sure that the chosen random vertices chosen do not go outside the imposed limits. To put it in perspective, the script initially enforces strict parameters, which – as you can probably guess by now – simply become untenable for quickly producing large graph data sets. As the below chart shows, we are able to generate a graph of 100 Million nodes in roughly the same amount of time it took to generate a graph of only 100,000 nodes using strict parameter enforcement:
Since disabling strict enforcement led to a graph three orders of magnitude larger in the same amount of time, you might be asking how the absence of that check – the degree of edges/node – affected the number of edges. Below we show that the difference of edges between checking and not checking is negligible:
Since we are generating these graphs, it seems reasonable to bend the requirements slightly to treat the minima and maxima simply as guidelines that some nodes may not conform to. While there is still room for improvement, such as leveraging more than a single CPU core, the results are reasonable enough to use.
Generating the Baseline
The most important point is that seemingly “simple” parameter changes – which represent actual differences in real-world networks – make huge differences to the resulting network and therefore our system design. We generated three different classes of graphs from a baseline of graph data sets to determine how varying parameters influences such important characteristics as: time to generate, number of edges created, storage footprint, number of node and edge attributes, and average degree of nodes.
Each graph was generated with an increasing value of nodes, while all other settings were static between generations, per graph type. Graph types A, B, and C – described below – will be used in the next couple of charts:
Graph Type: A B C Magnitude: 1K - 1B 1K - 100M 1K - 100K Format: graphml* graphml* graphml* Directed: No No No Minimum Degree: 1 1 1 Maximum Degree: 10 10 Same number as nodes Minimum Node Attributes: 2 50 2 Maximum Node Attributes: 2 100 2 Minimum Edge Attributes: 0 5 0 Maximum Edge Attributes: 0 25 0
GraphML (http://graphml.graphdrawing.org/) is a convenient XML format that describes nodes in terms of names and types with labeled edges between nodes.
Number of Nodes
The first chart illustrates how the number of nodes greatly influences all other characteristics. While Type A generated one billion nodes in approximately 24 hours, the same timeframe yielded graphs of Type B with only 10 Million nodes and Type C with a scant one million nodes:
As this is just a first cut, restricting the number of nodes on the graph types seems acceptable for now. The following illustrates that limiting Type C to only 100,000 nodes still produces almost the same number of edges as a one billion node version of graph A:
The following chart shows that a one billion node graph of Type B would require approximately 10TB of storage space, while Type C would require 300 Petabytes (!) to reach one billion nodes:
Sadly, I couldn’t justify buying 300 Petaracks just to generate the world’s most unrealistic graph. Not to mention it would have taken approximately 20,000 years to generate, but that’s beside the point.
When looking at attribute differences, Type B creates about 30-40 times more attributes than the fairly-similar Types A and C:
Finally, this last chart shows how the degree of each node for Type C grows exponentially with the number of nodes, whereas the average degree for the other two graph types remain static:
Loading the Data
One of the nice things about Titan is that its Gremlin console shell enables graph interaction, traversals, and calculations. It also has functions for loading a graph file into the graph data store, which in this case is HBase on top of HDFS. Unfortunately, Gremlin through Titan does not leverage the awesomeness of MapReduce that generally goes hand-in-hand with HBase and its Hadoop counterparts. So running the import in parallel is currently impossible. In terms of data formats, Gremlin on Titan can load GraphML; however, the current ID scheme prevents federation of GraphML across multiple machines (or even multiple cores). So as you can see from the chart below, the load times are un-spectacular:
*Note: ‘Minimal Degree and Attributes’ corresponds to ‘Graph A’ from above, similarly ‘Heavy Attributes’ and ‘Heavy Degree’ correspond to ‘Graph B’ and ‘Graph C’, respectively.
Fret not! Faunus to the rescue! Faunus uses a Gremlin shell, which is similar to Titan’s and one we can use for importing a slightly different data format to gain benefits via MapReduce. The next chart shows the benefits of moving from GraphML to loading the GraphSON* format using a MapReduce job:
Our Graph-generation Code allows us to generate one-billion node graphs with varying characteristics such as directed-ness, number of edges and nodes, and node degrees. Just as important, we determined how the different characteristics affect real-world considerations such as loading time and storage footprint, also finding an early optimization through MapReduce parallel processing. As we move to the next phase of designing around the data, we anticipate shortly being able to improve at least a couple orders of magnitude through fairly straightforward tweaks such as parallelizing load computations across a cluster. Of course this is only the first step, and there is a long exciting road ahead of us.
As is usually the case, it should be noted that this is not necessarily representative of the technology’s overall performance characteristics, but rather our experience within a specific environment. We used the following environment for our tests:
GRAPH GENERATION: Run on a MacBook Air 2GHz Intel Core i7, 8GB 1600 MHz DDR3 Python 2.7.2 (default, Jun 20 2012, 16:23:33) [GCC 4.2.1 Compatible Apple Clang 4.0 (tags/Apple/clang-418.0.60)] on darwin LOADING GRAPHS: Gremlin with Titan: Virtual Machine, Ubuntu 12.04 LTS, 4 core processor, 8GB RAM Started with the default Java options for Gremlin of: JAVA_OPTIONS="-Xms32m -Xmx512m Then bumped that up as the graph file got larger to: JAVA_OPTIONS="-Xms256m -Xmx4096m Gremlin with Faunus: Virtual Machine, Ubuntu 12.04 LTS, 2 core processor, 4GB RAM Hadoop/HBase Cluster: 12 node cluster - 8 core processors, 64GB RAM, CDH4.2, heap set to 4GB, HBase 0.94.2