# { blog }

## innovation through collaboration

### Exploring the CONGO

Here at Lab41, we don’t just see graphs. We’re also investigating the interesting and useful properties of these graphs. Recently, the lab has been evaluating the applicability of community detection algorithms to graphs of varying size and structure. These algorithms attempt to find groups of vertices that seem related to one another, and could therefore be considered communities in the larger network.

When a graph can be split into groups of nodes that are densely internally connected, we say that those groups are communities and that the graph has community structure. Natural graphs often have this property, and as we’ve mentioned before, natural graphs tend to be the most illuminating.

While there has been a fair amount of research focused on community detection since Michelle Girvan and Mark Newman published their seminal paper in 2002, we still lack a proper understanding of which algorithm(s) work best for a given graph structure and scale.

There are two classes of community detection algorithms. Some algorithms find overlapping communities, while others partition the graph into distinct, non-overlapping communities. The Girvan-Newman algorithm is the canonical example of the latter group. In this post, we’ll discuss an evolution of the Girvan-Newman algorithm into newer algorithms called CONGA and CONGO, and eventually try to find out whether the structure of a graph impacts CONGO’s performance.

## Girvan-Newman Algorithm

It is a good idea to fully digest Girvan-Newman before delving into derivations such as CONGA and CONGO.

Girvan and Newman introduced an idea called edge betweenness centrality, or edge betweenness. The edge betweenness of an edge $e$ is defined as the number of shortest-paths that pass through $e$, normalized by the number of shortest-paths between each pair of vertices. Girvan and Newman argue that edges with high betweenness scores tend to be inter-community edges, because a high edge betweenness hints that the popular edge acts as a bottleneck between two communities. An examination of the inter-community edges in the figure below make this intuition obvious.

If we accept that the edge with the highest betweenness tends to divide communities, we see that repeatedly removing that edge might be a good way to partition a network. To do so, the Girvan-Newman algorithm takes the following steps:

1. Calculate betweenness scores for all edges in the network.
2. Find the edge with the highest score and remove it from the network.
3. Recalculate betweenness for all remaining edges.
4. Repeat from step 2 until all edges have been removed.

Natural graphs can grow very large. It’s not uncommon to want to find insight about a graph with millions or even billions of nodes. Consequently, it’s important to be able to compare the expected performance of algorithms without looking at the exact number of machine instructions or even writing code. One way to do this is by leveraging asymptotic (also known as Big-O) notation. An easy way to think of Big-O notation is to imagine an expression once you’ve eliminated all constants and kept only the largest factor. For example, $.01x^3 + 950x^2\log x + 3 = O(x^3)$, since even though $x^3$’s constant is the smallest, it is still the dominating factor as $x$ increases.

If some function $f(x)$ grows no faster than another function $g(x)$, we say that $f(x) \in O(g(x))$ or $f(x) = O(g(x))$. A bit more formally, $f(x) = O(g(x))$ if and only if there exist constants $C$ and $x_0$ such that $f(x) \le C g(x)$ for all $x > x_0$. In other words, no matter how much larger $f(x)$ is for small values of $x$, $g(x)$ will eventually catch up. Big-O notation is an incredibly useful tool to quickly compare algorithms and find out how much performance depends on the size of the input.

On a graph with $|V|$ vertices and $|E|$ edges, it would seem that calculating all of the betweenness centralities would require $O(|E||V|^2)$ time, because shortest paths must be found between all $|V| \times (|V| - 1) / 2 = O(|V|^2)$ pairs of vertices, each using a breadth-first search that costs $O(|E|)$. Luckily, Newman and Brandes independently describe an $O(|E||V|)$ algorithm for betweenness centrality that requires a single breadth-first search from each vertex. This shortcut method uses a flow algorithm, which yields the edge betweenness without requiring the shortest-paths.

An algorithm like Girvan-Newman’s that repeatedly divides the graph is known as a divisive algorithm. A divisive algorithm on a graph usually returns a dendrogram – a specialized type of tree. A dendrogram is a memory-efficient data structure that describes the history of the algorithm. It stores a list of the ways small communities merge to make larger ones, until the entire graph is one big community. We can even derive the historical list of divisions that the algorithm made by inspecting the list of merges. Furthermore, a dendrogram can be split at any level to find a single clustering (a set of clusters) that contains the desired number of communities. When the number of communities is known, the dendrogram can easily be split at the appropriate level. When the optimal number of communities is unknown, we use a metric like modularity to determine which clustering is the “best” one.

Ostensibly, the Girvan-Newman algorithm runs in $O(|E|^2|V|)$ time, since the betweennesses must be recalculated for each edge removal. However, since the betweenness only needs to be recalculated in the component in which an edge has just been removed, the algorithm is much more tractable on graphs with strong community structure that split quickly into several disconnected components.

An example of a graph with strong community structure is the graph of character interactions in Victor Hugo’s Les Miserables. The following figure shows how Girvan-Newman partitions that graph.

## Overlapping Communities

While Valjean’s cohorts in Les Mis seem to partition nicely into their own communities, actors in real networks are often members of multiple communities. Most of us belong to multiple social groups. In fact, almost any real-world network has at least some overlapping community structure.

Most existing algorithms partition networks into non-overlapping communities, but there has been a recent push to design an effective overlapping community detection algorithm.

Zachary’s Karate Club is a famous network representing feuding factions at a karate club. Non-overlapping community detection algorithms provide a great deal of insight, but at the cost of forcing each student into a single faction, even if he belongs in multiple. The figure on the left is a partitioning performed by a non-overlapping algorithm like Girvan-Newman, and the figure on the right is a clustering that allows for overlap. Of course, this is a toy example in which we’ve limited the number of communities to two, but it’s not hard to imagine a very complex network with many communities and vertices that belong in several.

For whatever reason, network scientists have been exclusively naming their overlapping algorithms using acronyms. CODA, CESNA, BIGCLAM, and CONGA are all algorithms that attempt to discover overlapping communities in a graph. Today, we’ll briefly explore Steve Gregory’s CONGA, or Cluster Overlap Newman Girvan Algorithm.

## CONGA

In CONGA, Gregory defines a concept called split betweenness. Imagine splitting a vertex into two parts, such that each part keeps some of the original edges. Then the split betweenness is the edge betweenness of an imaginary edge between the two new vertices (represented as a dashed line in the figure below).

Since split betweenness of this imaginary edge can be calculated the same way as edge betweenness on a real one, comparing the two values is an entirely legitimate operation. CONGA splits the vertex instead of removing an edge when the maximum split betweenness is greater than the maximum edge betweenness. Vertices can be split repeatedly, so a single vertex in the original graph can end up in an arbitrary number of communities. This property gives us the overlapping community structure that we were looking for.

A naive version of CONGA would simply calculate the split betweenness for every possible split of every vertex. The algorithm would then look like this:

1. Calculate all edge and split betweenness scores.
1. If the maximum split betweenness is greater than the maximum edge betweenness, split the vertex at the optimal split.
2. Otherwise, delete the edge with the maximum edge betweenness.
2. Recalculate all edge and split betweenness scores.
3. Repeat from step 2 until all edges have been removed.

Each time the graph splits, vertices are assigned to one more community than before. Since we don’t know the optimal number of communities, we have to somehow store the historical community assignments before continuing the algorithm. Because CONGA is a divisive algorithm, we would hope to be able to use a dendrogram to store the results. However, the overlapping structure of the result set means that such a data structure wouldn’t make much sense. Instead, our version of CONGA returns a list of all of the community assignments that the algorithm generates.

This version of CONGA is simple, but it’s also intractable with more than a handful of nodes. To see why, assume that each vertex has $m$ incident edges. Then we can split each vertex $2^m$ different ways, since we can choose any subset of edges to be split away to the new vertex, and any set with $m$ elements has $2^m$ subsets. Since we have $|V|$ vertices, we need to calculate $|V|\times 2^m$ split betweenness scores. Calculating a split betweenness costs $O(|E||V|)$ operations, so each iteration of the algorithm takes $O(|E||V|^2 2^m)$ time. Finally, we have to recalculate all split betweennesses each time we remove an edge, yielding a total runtime of $O(|E|^2|V|^2 2^m)$. In the worst case, on a connected graph, $m = O(|V|)$ and $|E| = O(|V|^2)$, so we have $O(|E|^2|V|^2 2^m)=O(|V|^6 2^{|V|})$ operations. Even a single densely connected vertex takes this algorithm into exponential time.

Luckily, there are a few significant improvements that we can make. Split betweenness is clearly bounded above by vertex betweenness, the normalized number of paths through some vertex $v$, because an imaginary path within a vertex cannot possibly be involved in more shortest-paths than the vertex as a whole. Furthermore, we can calculate vertex betweenness from edge betweenness, since any shortest-paths going through an edge incident to some vertex must also contribute to the betweenness of the vertex itself.

We can calculate vertex betweenness from edge betweenness using the following equation:
$VertexBetweenness(v) = \frac{1}{2} \left(\sum\limits_{e \in incident(v)} EdgeBetweenness(e) - (n-1)\right)$

In practice, filtering the vertices by vertex betweenness makes a big difference. However, calculating all split betweennesses for even a single vertex can still take exponential time. To fix this, Gregory introduces a greedy algorithm that makes use of yet another metric that he calls pair betweenness. While pair betweenness is not too useful by itself, it allows us to calculate split betweenness much faster. Pair betweenness is the normalized number of shortest-paths that travel through some triplet $u \rightarrow v \rightarrow w$. For every vertex $v$ with $m$ incident edges, there are $\binom{m}{2}$ pair betweennesses that need to be calculated. If we use the original all-pairs-shortest-paths algorithm, we can calculate the pair betweennesses at the same time as the edge betweennesses, in $O(|V|^2|E|)$ time (though we’re trying to extend the optimized algorithm to do so in $O(|V||E|)$).

We can represent the pair betweennesses of a single vertex $v$ of degree $k$ by constructing a k-clique where each vertex in the clique represents some neighbor of $v$. The weight on each edge $\{u, w\}$ is the pair betweenness of $v$ for $\{u, w\}$ (the normalized number of shortest paths through $u\rightarrow v \rightarrow w$).

Gregory explains the greedy algorithm using these four steps:

1. Choose edge $\{u,w\}$ with minimum score.
2. Coalesce $u$ and $w$ to a single vertex, $uw$.
3. For each vertex $x$ in the clique, replace edges $\{u,x\}$, score $b_1$, and $\{w,x\}$, score $b_2$, by a new edge $\{uw,x\}$ with score $b_1+b_2$.
4. Repeat from step 1 $k-2$ times (in total).

We are left with two vertices with one edge connecting them, where the edge weight is the split betweenness and the labels on each remaining vertex specify the split.

This procedure does not guarantee an optimal split, but Gregory asserts that it usually ends up close, and the greedy algorithm is much (much) more efficient. Our implementation is $O(k^3)$, but a cleverer one that sorts the betweennesses could potentially use $O(k^2\log k)$ operations. Compared with $2^k$, we’re quite pleased.

We can modify CONGA to use the greedy algorithm and to filter by vertex betweenness as follows (steps taken from [3]):

1. Calculate all edge betweenness scores.
2. Calculate all vertex betweenness scores, using the equation described above.
3. Find set of vertices such that each member’s vertex betweenness is greater than the maximum edge betweenness. If the set is empty, remove the edge with maximum betweenness and skip to step 7.
4. For each member of the set, calculate pair betweennesses of each vertex by examining shortest paths.
5. Calculate the maximum split betweenness and optimal split using the greedy algorithm outlined above.
6. Remove edge with maximum edge betweenness or split vertex with maximum split betweenness (if greater).
7. Recalculate edge betweenness for all remaining edges in same component(s) as removed edge or split vertex.
8. Repeat from step 2 until no edges remain.

Given these two optimizations, we now have an algorithm that runs in $O(|E|^2|V|)$. In practice, runtime again depends heavily on the community structure of the graph, and how often vertices need to be split.

CONGA is a nice extension to Girvan-Newman, and it even has a cool name. But even with the optimizations, it is still painfully, brain-meltingly slow. What we really need is a significantly faster algorithm with a slightly cooler name. This brings us to Gregory’s next iteration of the algorithm, which fits into the CONG-esque theme: CONGO, or Cluster-Overlap Newman Girvan Optimized.

## CONGO

CONGA spends almost all of its time calculating edge and pair betweennesses, because it has to calculate all shortest-paths each time we want to recalculate a score. Since almost all contributions to betweenness tend to come from very short shortest-paths, Gregory defines yet another class of betweenness: local betweenness.

We can calculate both edge betweenness and pair betweenness by only considering paths no longer than length $h$. This is a much faster calculation, since any breadth-first search needs only to traverse to $h$ levels, rather than to the entire graph. This localization is the essence of CONGO.

When we’re only concerned with shortest-paths less than or equal to length $h$, betweenness scores aren’t affected by an edge removal or a vertex split $h + \epsilon$ away, where $\epsilon$ is some small distance. This means that we only have to calculate all edge and pair betweenness scores once, then adjust the scores in the neighborhood of the modification every time a change is made.

To formalize that notion, Gregory defines the $h$-region of a modification to be the smallest possible subgraph containing all shortest-paths that pass through $v$ (for a vertex split) or $e$ (for an edge removal) of length no longer than $h$. The $h$-region of edge $e$ that traverses $\{u, v\}$ is the subgraph with the set of vertices:
$\{w : d(u, w) \lt h \vee d(v, w) \lt h\}$
where $d(a, b)$ is the length of the shortest path between $a$ and $b$. Similarly, the $h$-region of vertex $v$ is the subgraph with the set of vertices:
$\{w : d(v, w) \le h\}$

When we remove an edge or split a vertex, we have to update the betweenness scores in its $h$-region. Before we modify the graph, we compute all shortest paths no longer than $h$ that lie entirely within the region, and subtract those scores from the stored values. Then we can make our modification (remove an edge or split a vertex) and recalculate these local shortest paths and betweennesses. Finally, we add these recalculated scores back. This procedure updates the local betweenness scores without requiring a full traversal of the graph.

• CONGA: Paper
• CONGO: Paper
• To experiment for yourself, I recommend trying out igraph and looking into some of the community detection algorithms that they’ve already implemented.

## References

[1] Newman, M. E., & Girvan, M. (2004). Finding and evaluating community structure in networks. Physical review E, 69(2), 026113.

[2] Brandes, U. (2001). A faster algorithm for betweenness centrality*. Journal of Mathematical Sociology, 25(2), 163-177.

[3] Gregory, S. (2007). An algorithm to find overlapping community structure in networks. Knowledge discovery in databases: PKDD 2007, 91-102.

[4] Gregory, S. (2008). A fast algorithm to find overlapping communities in networks. Machine Learning and Knowledge Discovery in Databases, 408-423.

[5] Nicosia, V., Mangioni, G., Carchiolo, V., & Malgeri, M. (2009). Extending the definition of modularity to directed graphs with overlapping communities. Journal of Statistical Mechanics: Theory and Experiment, 2009(03), P03024.

[6] Zarei, M., Izadi, D., & Samani, K. A. (2009). Detecting overlapping community structure of networks based on vertex–vertex correlations. Journal of Statistical Mechanics: Theory and Experiment, 2009(11), P11013.