Lab41 recently released <a href="https://github.com/lab41/circulo" target="_blank">Circulo</a>, a python framework that evaluates *Community Detection Algorithms*. As background, a community detection algorithm partitions a *graph* containing vertices and edges into clusters or communities. Vertices can be entities such as people, places or things; while the relationships among these entities are the edges. Behind the scenes, community detection algorithms use these relationships (edges), their strength (edge weights), and their direction (directedness) to conduct the partitioning. The partitioning is significant because it often can provide valuable insight into the underlying raw data, revealing information such as community organizational structure, important relationships, and influential entities.
Circulo becomes especially important in circumstances where community detection algorithms fail to present clear and consistent results. One of the more prominent examples of this is the case where algorithms executed against the same dataset produce variable results relative to membership, size, execution time, or number of communities. For example, ten different algorithms can produce ten different results, all at different rates. This level of variation puts a researcher into the difficult position of having to choose among the results, without much guidance as to which of the results most accurately applies to the circumstances. Can varying results combine to form a global best result? Does the type of input data affect which algorithm to use? If an algorithm takes too long to execute, is using a fast algorithm sufficient? Do different definitions of a community determine the algorithm to use? Is there such thing as a best result?
Circulo enables researchers to try and answer these questions by giving them an efficient platform to conduct data collection, analysis, and experimentation. The framework calculates a variety of quantitative metrics on each resulting community. It can validate a partitioning against some predefined ground truth or can compare two different partitions to each other. This data can be used to draw conclusions about algorithm performance and efficacy. And best of all, it is completely modular so that a researcher can add new algorithms, metrics, and experiments with ease.
To help explain the Circulo framework, we will use the flight routes data obtained by openflights.org. This dataset is one of 14 that we use for testing in Circulo. In this example, the airports are nodes, and the routes between airports are the edges. The resulting graph is both directed (since flights travel in a direction from one airport to another) and multi-edged (since numerous routes may exist from one airport to another) and contains 3,255 vertices and 67,614 edges. The reasons to employ community detection against this type of data could range anywhere from an airline debating where to build its next hub, to trying to identify a new route to an underserved region, or to developing a plan to reroute regional airport traffic to a different hub. A clearer understanding of how flight routes can divide airports into clusters could lead to better informed decisions.
The Circulo Data Pipeline
Circulo execution can be divided into a three stage pipeline. Generally speaking, the inputs to the pipeline are a collection of algorithms and a collection of datasets. What comes out are JSON encoded files containing numeric metric results. Each algorithm/data pair produces a partitioning, and each partitioning produces a set of metrics. The first stage is the Data Ingestion stage, where raw data is extracted, transformed, and loaded (ETL) into a graph and serialized into a GraphML file. The second stage is the Algorithm Execution stage, where one or more algorithms are executed against the graph. And finally, the third stage is the Metric Execution stage, where the results of the previous stage are evaluated against a variety of metrics.
Stage 1: Data Ingestion
The primary purpose of the Data Ingestion stage is to provide the remaining two stages with a consistent, known graph input format. In many ways, this stage therefore serves as a tool to convert any raw data format into the expected input format of downstream stages. To accomplish this, a researcher needs to subclass the provided CirculoData base class. We have chosen igraph as the primary framework for representing a graph in memory, and graphML as the primary serialization format for storing the graph to disk. Both igraph and graphML were selected for the following reasons:
- igraph offered a few key benefits that made it an attractive choice. First and foremost, in addition to the standard vertex and edge manipulation typically provided in a graph framework, igraph also provides a suite of community detection algorithms. That means that not only can we leverage these algorithms out of the box, but also that igraph already has a notion of many of the key data structures and functions one would want when creating an evaluation framework. For example, igraph includes a Dendrogram for divisive results, a VertexClustering for non-overlapping results, and a VertexCover for overlapping results. At a lower level, a vertex membership array maintains which vertices belong to which communities. igraph also provides many evaluation functions such as cohesion, degree distribution, and triad_census to name a few. igraph, which is written in C, also has decent performance, is well-documented, and provides wrappers for both Python and R.
- graphML is a widely used graph serialization format, enabling Circulo to be compatible with numerous other graph related technologies. graphML is also written in XML which is a standard serialization format, making it reliable and easy to parse using any number of third party tools.
For the flights data, the Data Ingestion stage begins with the execution of functionality provided by the FlightsData class. Each new dataset must subclass the CirculoData class as we do with FlightsData, which provides the base functionality to download the data, convert it into a graph, identify ground truth from labels when available, then serialize the graph as graphML. The raw data for flights includes two CSV files:
flights.csv(e.g. “Goroka”,“Goroka”,“Papua New Guinea”,“GKA”,“AYGA”,-6.081689,145.391881,5282,10,“U”,“Pacific/Port_Moresby”)
routes.csv(e.g. “AF,137,ATL,3682,ILM,3845,Y,0,CRJ 319”)
Below are both the vertex (node) and edge representations of the previous CSV lines from flights and routes in graphML:
GraphML (Airport Example)
1 2 3 4 5 6 7 8 9 10 11 12 13
GraphML (Route Example)
1 2 3 4 5 6 7 8 9 10 11
By default, Circulo includes 14 datasets to enable a researcher to quickly evaluate community detection algorithms out of the box. The project contains information about these default datasets, including how to add additional datasets.
|amazon||Co-purchasing Data – http://snap.stanford.edu/data/bigdata/communities|
|house_voting||2014 congress (house) voting data – https://www.govtrack.us/developers/data|
|flights||Flight Route Data – http://openflights.org/data.html|
|football||NCAA D1A games - http://www-personal.umich.edu/~mejn/netdata|
|karate||Famous data set of Zachary’s karate club - http://www-personal.umich.edu/~mejn/netdata/|
|malaria||Amino acids in malaria parasite – http://danlarremore.com/bipartiteSBM|
|nba_schedule||Games played in the 2013-2014 NBA season – https://github.com/davewalk/2013-2014-nba-schedule|
|netscience||Graph of collaborators on papers about network science – http://www-personal.umich.edu/~mejn/netdata/|
|pgp||Interactions in pretty good privacy – http://deim.urv.cat/~alexandre.arenas/data/xarxes/|
|revolution||Graph representing colonial American dissidents – https://github.com/kjhealy/revere.git|
|school||Face-to-face interactions in a primary school – http://www.sociopatterns.org/datasets/primary-school-cumulative-networks/|
|scotus||Supreme court case citation network – http://jhfowler.ucsd.edu/judicial.htm|
|senate_voting||2014 congress (senate) voting data – https://www.govtrack.us/developers/data|
|southern_women||bipartite graph of southern women social groups – http://nexus.igraph.org/api/dataset_info?id=23&format=html|
Stage 2: Algorithm Execution
The second stage of the Circulo pipeline is running the community detection algorithms against each dataset. To run all algorithms against the flights dataset, a researcher would do the following:
Circulo will run each algorithm/dataset pair in parallel, serializing the job name, dataset name, iteration (running algorithm/dataset pair multiple times), VertexCover membership array, total elapsed time of execution, and alterations to the filesystem in JSON as shown in the following example:
1 2 3 4 5 6 7 8 9
The VertexCover membership array above indicates through 0-based indexing that vertex 0 belongs to communities 56 and 1, vertex 1 belongs to community 56, vertex 2 belongs to community 28, etc.
Though a researcher could add any algorithm for evaluation, the framework by default comes with 15 algorithms, with implementations from Lab41 (Conga, Congo, Radicchi Strong, Radicchi Weak), igraph (infomap, fastgreedy, Girvan Newman, leading eigenvector, label propagation, walktrap, spinglass, multilevel) and SNAP–Stanford Network Analysis Project (BigClam, Coda, Clauset Newman Moore). More information about each algorithm can be found in the Lab41 Community Detection Survey.
When an algorithm executes against a dataset, the Algorithm Execution stage first attempts to match that dataset as best as possible to the input parameters of that algorithm. Given that algorithms may vary in how they use weighted, multi, and directed edges, it is necessary to conform a dataset to an algorithm to enable proper execution and maximize algorithm efficacy. In this manner, Circulo operates with a Best Chance methodology when executing algorithms –we provide the algorithm with the best circumstances so that it can have the best chance at finding a solution. This transformation process can sometimes be as simple as changing all directed edges to undirected edges, however in other cases, it can be more difficult. igraph provides all the transformation functionality through the functions simplify and to_undirected, which both have the ability to collapse edges, which is necessary when, for example, simplifying a multi-edge graph.
The flights dataset (directed, multigraph, unweighted) is an excellent example of how these transformations can occur. To illustrate this, we highlight how the flights data will change when encountering the following two algorithms:
- The fastgreedy algorithm accepts an undirected, simple (opposite of multigraph), and weighted graph – basically the opposite of the flights data. To make the dataset compatible, Circulo will first convert all directed edges to undirected edges, collapsing those edge pairs between two nodes into a single edge when each goes in an opposite direction. Because we want to preserve the number of edges that are collapsed, Circulo will first add an edge weight of 1 to each edge in the graph. Therefore, two opposite directed edges between two nodes will collapse into a single undirected edge with a weight of 2. Then finally, Circulo will convert any remaining multi-edges into a single edge by collapsing those edges and summing the weights. The result is an undirected, simple, weighted graph.
- The bigclam algorithm accepts an undirected, simple, and unweighted graph. Unlike fastgreedy, bigclam cannot rely on edge weights to distinguish between collapsed edges. As a result, a decision must be made to prune or not prior to sending the collapsed graph to the algorithm. Because this logic largely depends on the data, we expect the researcher to provide that logic in the implementation of the CirculoData class. The prune function will be called in one of two circumstances: (1) the graph collapses and the algorithm does not support weights or (2) the graph collapses and becomes nearly complete. The default prune function does nothing. The FlightsData class will prune all edges less than the median + 0.0001. Another issue that arises with the bigclam algorithm is that it has a variety of optional parameters for fine-tuning. To deal with this case, Circulo can pass a data context for each dataset, where the data provider can specify parameters for specific algorithms that it might encounter.
Once data is transformed and executed against algorithms, a researcher can already start to experiment with some of the results. The following experiment highlights a distinct difference between Label Propagation and Infomap when applied against the flights dataset. We will use the graph visualization tool Gephi to view the results overlaid onto a map using the geo-coordinates of the airports.. Each color in the figures below represents a community as determined by the respective algorithm (the colors vary between the two figures because gephi randomizes the colors).
The visualization confirms that both algorithms are presenting accurate partitions if one assumes that locality is a valid source of ground truth for flight data. However, if one were to look closely, the results do vary in regards to the degree of locality. For example, Label Propagation treats most of the US and Mexico as a single community while Infomap treats them as separate communities. One could surmise that perhaps Infomap presents a more detailed view of communities whereas Label Propagation a more general one–information valuable when using the algorithms in the future, especially when ground truth such as geo-coordinates is not available.
<a class="fancybox-effects-a" href=/images/post_9_circulo/label_propagation_flights.jpg><img src="/images/post_9_circulo/label_propagation_flights.jpg" title="Label Propagation Community Detection Results for Flight Data" ></a> <p style="text-align:center"><small>_Label Propagation Results_</small></p>
<a class="fancybox-effects-a" href=/images/post_9_circulo/infomap_flights.jpg><img src="/images/post_9_circulo/infomap_flights.jpg" title="Infomap Community Detection Results for Flight Data" ></a> <p style="text-align:center"><small>_Infomap Results_</small></p>
Stage 3: Metric Execution
Before we proceed, it is important to identify the two major data points known at this point in the data pipeline: the vertex membership produced by the algorithm and the algorithm execution time. With this data alone, a researcher could likely come to various conclusions: (1) the effectiveness of algorithms by comparing resulting memberships with ground truth memberships, (2) the variance of membership results by comparing memberships produced from different algorithms, and (3) the accuracy of algorithms as a function of time by including elapsed execution time. So why probe further into the data? Why not just accept that vertex membership and time are enough?
Generally speaking, if more data is available, there are more opportunities to come to better conclusions. When an algorithm draws a boundary, and a community is formed, the graph does actually change. Yes, it has the same vertices, and yes it has the same edges, but it now has a third element: the communities themselves. Communities interact with other communities. Communities have ecosystems within them. It is not just that a boundary sets apart vertices in a graph, it is also that it redefines how relationships among vertices can collectively be viewed. All of these facets of the communities can provide further insight beyond just the vertex membership.
One notable benefit of metrics is that we can now better define what it means to be a good community. For example, a good community might be one that is isolated from the rest of the graph, a metric known as conductance. Now, we can identify those algorithms that minimize the conductance metric or discover other metrics that correlate with conductance. We could even use individual metrics to help distinguish individual communities amongst themselves. What is the most isolated community? What is the most dense community?
In Circulo, we have identified the following community metrics:
|Cut Ratio||TLU–Local Clustering CoefficientMax||TLU–Local Clustering CoefficientBiased Kurtosis|
|Degree StatisticsMax||TLU–Local Clustering CoefficientMedian||Average Out Degree Fraction|
|Internal Number Edges||Transitivity Undirected (Global Clustering Coefficient)||Diameter|
|Conductance||Degree StatisticsUnbiased Variance||Separability|
|Triangle Participation Ratio||Cohesiveness||TLU–Local Clustering CoefficientUnbiased Variance|
|Fraction over a Median Degree||TLU–Local Clustering CoefficientMin||Degree StatisticsMedian|
|Degree StatisticsMean||Degree StatisticsSize||Degree StatisticsBiased Kurtosis|
|TLU–Local Clustering CoefficientSize||Internal Number Nodes||Degree StatisticsBiased Skewness|
|Density||Flake Out Degree Fraction||Expansion|
|TLU–Local Clustering CoefficientMean||Maximum Out Degree Fraction||Normalized Cut|
|Degree StatisticsMin||TLU–Local Clustering CoefficientBiased Skewness|
Descriptions of each of these metrics can be found in the igraph documentation and the paper “Evaluating Network Communities based on Ground-truth,” by Jaewon Yang and Jure Leskovec.
To run the metrics against a given vertex membership, one would do the following:
The algorithm_results_path is the directory containing the JSON encoded results from stage 2. The metric_results_output_path is the path to the directory where the JSON encoded metrics results will be saved. For example, by running the metrics suite against the infomap/flights result, the following JSON would be created:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
The omega index is a measurement of how similar the partition is to some predefined ground truth, if available. In the case of the flights data, we used the countries of the airports. The metrics_elapsed is the time for the stage to complete. The membership is a carry over of the membership from the Algorithm Execution stage. The metrics is divided into two sub-sections for each metric: (1) results–the actual score for each community indexed by the community id, and (2) aggregations: aggregations of the results.
Though each metric has the potential to provide a valuable perspective on the resulting communities, we will only focus on conductance and density in detail here for the sake of example. Conductance is the ratio of edges leaving the community to the total number of edges. One could consider conductance as a measure of how much a community conducts its energy to the rest of the graph. Density is a measure of the ratio of edges inside a community to the number of possible edges in a community. Vertices belonging to dense communities will have multiple edges connecting them to other vertices in the community.
Using the flights example once again, a researcher for the airline industry might have concluded that the infomap algorithm tends to find communities with low conductance and high density based on previous analysis of the algorithm against numerous datasets. Because the researcher is trying to find the airline new opportunities for expanding into underserved markets, a good community in this case is one that is isolated from other regions (low conductance) and has high internal traffic (high density). When applying the metrics to the results of infomap/flights, we see the following results:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
We can see that the infomap algorithm discovered 156 communities, with an average density of ~0.45 and an average conductance of ~0.61. The individual scores are documented in the results section, where for instance, the conductance of each of the first 3 communities is approximately 0.51, 0.67, and 0.66. Though the moderate average scores of both conductance and density suggest that, overall, many of the existing airports optimally serve their regions; interestingly, there exists at least one community with a conductance of 0.00029 and at least one community with a density of 1.0. Given that in our example, a researcher is trying to identify underserved regions, the results of infomap might still be worth a more detailed analysis.
The big question that remains is, “What’s Next?” Circulo provides the pipeline for efficiently gathering metrics based on community detection algorithm execution, but where is the value?
The value, we would argue, is hidden in the metrics data. Before, when we asked the questions about determining which algorithms to use in which situations, we really had no place to start aside from crude qualitative observations. Now when researchers ask these questions, they can leverage the Circulo framework to produce quantitative metrics to serve as the impetus for further experimentation. We have started this experimentation on our Circulo experiments page. From here, we hope to add more algorithms, include more metrics, and build a variety of experiments that will drive a better understanding of community detection and the numerous algorithms that encompass it. We also hope that you can help make this happen by contributing to Circulo in the future.