# { blog }

## innovation through collaboration

### Beyond Community Detection - RolX

In the Circulo project, Lab41 has researched, compared, and implemented several powerful methods for community detection in large graphs. As part of analyzing tightly-connected nodes within a network, we have also evaluated ways to discover the structural roles of nodes. This goal, role detection, is complementary to community detection and can provide additional insight for several applications and industries. A method of particular interest to Lab41 is described by Henderson, Gallagher, et al. (2012) in their paper, “RolX: Structural Role Extraction & Mining in Large Graphs,” as it can reveal roles such as central connectors, bridges between groups, or members of a clique. The following post highlights the intuition and general mechanics of the algorithm, which we implemented in Python as part of our broader community detection effort.

Conceptual illustration of network analysis using community and role detection

### Introduction

Network modeling is an extremely powerful analytic method employed across a massive variety of disciplines. As Robbie mentioned in his recent post, one of the most useful techniques in the network analysis domain is community detection, which breaks down large networks into smaller component parts. The earliest models of community detection viewed these breakdowns as partitions of the network; however, as our collective understanding has matured, we realized that communities make more sense as organic, unrestricted groupings of vertices that could range anywhere from complete overlap to complete exclusion.

This nuanced understanding of communities has several powerful applications in the real world. For example, we can find different clusters of protein interactions within cells, identify how fans of sports teams relate to each other, or understand the influence of different scientists in their collaborations. However, community detection algorithms produce groups of related nodes without distinguishing them relative to each other, leaving several meaningful real-world questions unanswered without further analysis. Fortunately, this field of research has advanced upon the idea that community structure is not the only construct that lends itself to graph analytics.

Going one step further, roles of individual nodes can also be gleaned from the structure of the graph. Combined with community detection, role detection can add crucial insight when looking for key information such as:
• Who are the most important, most connected figures - the “key influencers”?
• Which members are highly connected to multiple close-knit communities, forming “bridges” in the network?
• What distinguishes communities with almost no connections to the rest of the graph, from communities deeply embedded inside it?

The analysis required to answer these questions is complementary to community detection – it must identify common “roles” across many communities, finding nodes in the graph with similar structure and function. To perform these calculations, it is possible to do them by hand or by examining the auxiliary structure provided by some community detection algorithms. However, those approaches are often ad hoc and do not scale.

We’d hope for a richer system to complement community detection - a system that, given a graph, does the following:
1. Identify nodes that play similar structural roles in the graph.
2. Give the user a quantitative understanding of how similar these nodes are to each other.
3. Allow the user to “make sense” of the roles by correlating them with well-understood graph-metrics (such as PageRank value, clustering coefficient, structural hole value, or eccentricity). Such information would make analysis much simpler and much easier to automate.

Lucky for us, Henderson et al. proposed such a system in their 2012 paper “RolX: Structural Role Extraction & Mining in Large Graphs.” We’ll discuss the paper’s key details below.

### Overview of the Algorithm

The RolX (pronounced “role-ex”) algorithm is simple in conception, but somewhat more complicated in presentation. The core idea of RolX is the observation that if we gather data about a graph in some linear form (such as a matrix), we can use matrix factorization methods to find structure in the data - and possibly use this to discover corresponding structure in the graph itself.

#### Step 1 - Getting the features

RolX starts by gathering a wide range of information associated with nodes in the graph. To gather details about the elements of the graph, the authors rely on the discussion of ReFeX (Recursive Feature eXtraction) from their earlier paper, “It’s Who You Know: Graph Mining Using Recursive Structural Features” (2011). ReFeX recursively collects a wide range of structural data about the graph, for example, node-centric parameters such as a node’s degree. A key feature of the recursion is that it captures both global and local structure by looking at successively larger regions of the graph.

Quantifying a graph’s structure typically requires computing an ensemble of complicated structural metrics - some focus on local-to-the-node information, such as a node’s degree, and some are more influenced by global structural parameters, such as a node’s PageRank value. These calculations can be time-consuming for large graphs, as many global metrics often require several passes over the graph structure before they converge or stabilize.

ReFeX proposes a new way to do this, using only three basic metrics per node:
1. Its degree
2. The number of edges connecting its neighbors (its “ego-network interconnectivity”)
3. The number of edges connecting its neighbors to other parts of the graph (its “ego-network outdegree”)
All three of these metrics are local and easy to measure, requiring no more than traversing the neighborhood of each member of the graph. To ascertain more global structural properties, the ReFeX paper proposes a recursive technique, proceeding as follows:
1. Initialize a list $L$ of metrics with [degree, ego-network-inter, ego-network-out].
2. For each node $v$ in the graph $G$ and each metric $m$ in $L,$ define a new metric $m'$ to be the sum of $m(u)$ for each neighbor $u$ of $v.$
3. Append these metrics to the list $L.$
4. Repeat.

Astute minds may notice this process is not guaranteed to terminate, and indeed it could go on forever. We need a stopping condition. Fortunately, there is an obvious one: the process should terminate when the information ceases to yield more knowledge about the graph’s structure. This stopping condition is accomplished by eliminating columns that closely resemble each other, or are in fact duplicates of one another. First, we need to construct an auxiliary graph where each node represents a single column of the matrix. Next, we connect columns where their values are close for all nodes. We can then use connected-component-finding algorithms to “trim the fat” from the matrix. This process ensures that all columns contribute unique information to our structural understanding of the graph.

#### Step 2 - Finding hidden structure

Now that we have a giant matrix of data about the graph’s structure, we can begin mining it for insights. At this stage, we have a giant matrix – more than $2^{10}$ elements for an average-sized graph — whose rows represent nodes of the graph, and whose columns represent the values of the recursive structural metrics computed. Each cell associates a given node with a given metric’s value, which makes the matrix rich in value. The challenge is figuring out how to use this information to get a complete, but concise, description of the graph’s structure. It turns out we can do this, using a technique called nonnegative matrix factorization, or NMF.

NMF is a mathematical strategy that has proven popular in the field of unsupervised machine learning. Its goal is straightforward: given a large matrix, create an approximation of that matrix that is much smaller, but mostly equivalent. It is part of a broader class of algorithms that perform the task of dimensionality reduction, which takes complex data and projects it into a smaller-dimensional space for easier analysis. Given it can enable insight into very large datasets, NMF is useful in a wide variety of contexts, such as modeling document topics or building recommender systems.

In mathematical terms, for an $m\times n$ matrix, $V$ factors into an $m\times r$ matrix $W$ and a $r\times n$ matrix $H$, where $r$ is much smaller than $m$ or $n$, so that $WH \approx V.$ Because a perfect solution to this problem is usually not possible, we must approximate it instead. This approximation requires use of several linear-algebra and numerical analysis techniques.

The matrices generated by NMF

In this particular instance, NMF can allow us to break down the massive array of graph metrics into a smaller collection of “roles.” Using this, we may be able to extract meaning from the graph’s structure and use this to find common structural motifs in the graph.

#### Step 3 - Sensemaking

Unfortunately, NMF outputs are not always clear to the naked eye. In fact, they are essentially just more matrices of numbers. Since the statistics on nodes output by ReFeX were somewhat obscure, knowing which roles correspond to which combinations of node statistics does not help us actually understand the meaning of the node roles. To do this, we need to understand how each role corresponds to actual graph metrics, such as the PageRank value of a node, or its degree. To figure this out, we need to essentially perform the inverse of this factorization. Now, we have a matrix $N$ associating nodes in the graph with graph metrics, and a matrix $W$ associating nodes with roles. We want to generate matrix $G$ such that $W\times G\approx N.$ Doing this is a relatively simple optimization problem.

### A Quick Example

The following example from Henderson’s 2012 paper shows the fundamental difference between community detection and role discovery. Both graphs represent the same community of scientists who have co-authored scholarly papers.

Figure 2: Henderson and Gallager illustrate the differences between the Fast Modularity community detection algorithm, on left, and RolX after applying each to a collaboration graph (RolX numbering added by Lab41 to help readers identify roles)

Whereas the graph on the left shows 22 communities, the one on the right shows four roles that crosscut those communities (represented as diamonds, squares, triangles, and circles). Some scientists are central members of networks – they reside within tightly connected clusters representing a specific discipline and influence every researcher in that discipline. Others bridge two or more different communities of researchers – these scientists usually focus on interdisciplinary topics. Some scientists are members of cliques – they are members of a small “star” of researchers all connected to each other, and loosely connected to the rest of the graph. Finally, most scientists are connected to some other researchers in one specific field, but not tightly, and are not the central node in that field.

### In Conclusion

Hopefully, this whirlwind tour of RolX highlights how it can provide valuable insight from graphs. Combined with community detection algorithms, RolX can help you understand not only which groups are tightly connected, but also how certain nodes play key roles within the network. If you’re interested in learning more, be sure to read Henderson and Gallagher’s original paper, as well as take a look at our RolX implementation and our broader Circulo project. We also welcome contributors to our project, so please checkout our repo on GitHub and submit whatever issues, fixes, or code you think would help.

Background image Self-organization used under Creative Commons license