Market Survey: Community Detection
Good Data Sets
 Data which may not have Ground Truth
 Stanford's InfoLab  https://snap.stanford.edu/data
 Data with Ground Truth
 Stanford's InfoLab  https://snap.stanford.edu/data
 Synthetic Data Generation
 Generated Graph by splitting into 4 groups [11]
 Create graph which replicates properties of graphsKronecker Method [Lescovec et al. 2010]
 Need to replicate realworld properties: heavy tails for in and outdegree distribution, heavy tails for eigenvalues and eigenvectors, small diameters and desnificaiton and shrinking diameters over time.
Methods of Community Detection

Node Based
 Given na individual node, identify the communities that the node belongs to

Component Based
 Given a component(community), discover overlapping subcommunities within it

Nework Based
 Given a community (nodes & edges), discover overlapping subcommunities within it
Use Cases
 Nodes belonging to a community are more than likely to have other properties in common [10].
 Enables data triage through node relationships
 May identify hierarchies (communities within communities)
 Finding partitions in nodes to promote efficiencies in distributing computation
 Enables centrality measures within a community
 Enables the identification of key edges used to link communities
Evaluation Metrics
 Overview of using ground truth [8]
 These metrics were collectively suggested in the paper [14]
 Good paper outlining potential issues with validating community detection, 2013 [22]
 Classes of metrics: Internal Connectivity, Combo of internal and external connectivity
 Goodness scores (as second order testing bc increases in community scores increases goodness scores): Separability, Density, Cohesiveness, Cluster Coefficient
 For overlapping communities, we will want to use only internal and external metric values. Combination metrics and modularity scores will result in misleading values that should be inconsistent.
Metric Name  Connectivity Type  Ground Truth Required?  Details 

Edge/Link Prediction  Comparative Scoring Funciton  No  Quality of division is based on ability to predict an edge after removing it 
Modularity [11]  Comparative Scoring Function  No  Tests a given division of a network against the random division. Scale below which modularity cannot identify communities bc random graphs have highmodularity subsets. Does not work well under nodeswap perturbation 
F1 Score  Comparative Scoring Function  Yes  Determination of precision and recall. Compares membership output of an algorithm to ground truth and returns the ratio of true and false assignments 
Omega Index  Comparative Scoring Function  Yes  Compares nondisjoing clustering results. The omega index considers the number of clusters in which a pair of objects appears together. 
Normalized Mutual Information [10]  Comparative Scoring Function  Yes  Used to quantify matching of "true" graphs particularly in synthetic graphs, it quantifies teh degree of overlap between graphs. 
Metrics
Metric Name  Connectivity Type  Details 

FOMD  Internal  Fraction over Median Degree determines the number of nodes that have an internal degree greater than the median degree of all nodes in a group 
Internal Density  Internal  Density is defined by the number of edges (m_{s}) in subset S divided by the total number of possible edges between all nodes (n_{s}(n_{s}1)/2). The "2" is there to cancle out duplicated edges. Internal Density = m_{s}/(n_{s}(n_{s}1)/2) 
Edges Inside  Internal  Somewhat useless by itself since it isn't related to any other attributes of subset S. The total number of edges (m_{s}) present in subset S. Edges Inside = m_{s} 
Average degree  Internal  The average interal degree across all nodes (n_{s}) in subset S. Average Degree = 2m_{s}/n_{s} 
Fraction over median degree (FOMD)  Internal  Determines the number of nodes that have an internal degree greater than the median degree of nodes in Subset S. 
Triangle Participation Ratio (TPR)  Internal  The best mesure for density, cohesiveness, and clustering within the goodness scales. Robust under random and expand perturbations. A measure of cohesion. The fraction of nodes in S that belong to a triad. TPR = (# of nodes belonging to a triad)/n 
Expansion (External Degree)  External  This measure of separability gives the average of number of external connections (c_{s}) per node (n_{s}) in subset S has with graph G. It can be thought of "External Degree". Expansion = c_{s}/(n_{s}(nn_{s})). 
Cut Ratio (External Density)  External  This metric is a measure of sperarability and can be thought of as "External Density". It is the fraction of external edges (c_{s}) of subset S out of the total number of possible edges in graph G. 
Conductance [6,8]  Combo (Int. & Ext.)  Ratio of edges inside the cluster to the number of edges leaving the cluster (captures surface area to volume). Discriminative Algorithm. Measures best in separability (goodness scale); measures wellseparated nonoverlapping communities. Robust under nodeswap and shrink perturbation. Community like sets of nodes have lower conductance. 
Normalized Cut  Combo (Int. & Ext.)  Represents how well subset S is separated from graph G. It sums up the fraction of external edges over all edges in subset S (conductance) with the fraction of external edges over all noncommunity edges. 
Maximum Out Degree Fraction (Maximum ODF)  Combo (Int. & Ext.)  This metric first finds the fraction of external conections to internal connections for each node (n_{s}) in S. It then returns the fraction with the highest value. For example, S contain 5 nodes all connected to themselves. There is one node in S that has 7 external connections. The Max ODF = 7/4. 
Average ODF  Combo (Int. & Ext.)  The sum of the individual fraction of edges outside of the community over the total connections of a node in subset S. It is then divided by the total number of nodes (n_{s}) in subset S. 
FlakeODF  Combo (Int. & Ext.)  This is a fraction of the number of nodes that have fewer internal connections than external connections (***) to the number of nodes (n_{s}) in subset S. 
Density  Internal  No 
Algorithms
Algorithm Classes[7]
 Divisive: Focus on edges and verticies that exist between communities. This class tends to be more repeatable, traditional and computationally expensive
 Model Based: Considers an underlying model of statistical nature that can generate the division of the network. Time complexity often is derived through testing, and is not explicity characterized.
 Quality Optimization: Usually maximizes the delta from some score such as modularity
 Vertex Clustering: Embeds the Graph into vector space in order to use conventional data clustering methods such as kmeans
 Cohesive Subgraph Discovery: This class finds communities of a particular predefined structure within the graph. For example, the structure could be a clique, nclique, kcore, LS set, lambda set
Algorithm Attributes
 R: Requires Number of Communities
 R+: Requires # of communities, but is able to guess if not provided
 D: Deterministic
 G: Global
 L: Local
 H: Heirarchical
 A: Agglomerative
 O: Overlapping
 Q: Distributable
 S: Requires community size
Graph Input Types
 w = "weighted"
 w = "unweighted"
 d = "directed"
 d = "nondirected"
Framework Descriptions
Name  Author(s)  Homepage  Comments 

Snap  http://snap.stanford.edu  
igraph  http://igraph.sourceforge.net/index.html  igraph is a free software package for creating and manipulating undirected and directed graphs. It includes implementations for classic graph theory problems. igraph has four implementations: R, C/C++, Python, and Ruby  
NetworkX  
MatLab 
Algorithm Table
Algorithm Name/Year  Class  Time/Space Complexity (n = # of vertices, m = # of edges)  Algo Attributes  Implementations  Input Graph Type  Comments 

GirvanNewman (Shortestpath betweenness), 2004 [11]  Divisive  Ο(nm^{2}): Each iteration uses a tree structure to calculate edge betweeness of a graph in O(mn). Do this m times, once for each edge  D,H,G  Mathematica  w,d  Iteratively removes links with the highest betweenness score (a score that measures the number of shortest paths that pass through an edge) 
GirvanNewman (randomwalk betweenness), 2004 [11][16]  Divisive  Ο((n+m)mn^{2})  Uses random traversal through a graph  
GirvanNewman (currentflow betweenness), 2004 [11][16]  Divisive  Ο((n+m)mn^{2})  Uses a resistor network concept  
Infomaps[13]  Model Based/Information Theory/ Random Walk  Ο((n^{2})Log(n))  snap, igraph  w,d  Attempts to find partition that yields the minimum description length of an infinite random walk on the network. It determines community and network structure by analyzing the flow of information, proxied through random walk calculations, among various groups of nodes.  
Fast Newman [12]  Quality Optimization  O((m + n)n)  A,H  Snap  w,d  
ClausetNewmanMoore [9]  Model Based, Quality Optimization  O(md log n) where d is depth of dendrogram  Snap, igraph  Improves upon Fast Newman, remains based oon the greedy optimization of modularity. A modularity of >0.3 is a good indicator of a significant community structure  
BIGCLAM (Cluster Affilitation Model for Big Networks)[14]  Model Based  Unclear  O,H,G,Q,R+  Snap  w,d  Partitioning that most likely produces the edges of a given graph, given the notion that nodes in the same community are more likely to share an edge. It captures the probability that a pair of nodes are connected as a funciton of that shared membership. http://www.youtube.com/watch?v=htWQWN1xAZQVideo Lecture 
Label Propogation, 2007 [17]  Vertex Clustering  A  Label Propagation uses the structure of the graph to cluster vertices. Nodes are iteratively labeled with the label that most of its neighbors holds. Densely connected groups form a consensus with a unique label.  
Fast Unfolding, 2008 [19]  Quality Optimization  H  igraph  w,d  An iterative algorithm that calculates modularity based on nearest neighbros, joining nodes if the values are positive, and then builds a new network whose nodes are now communities based on step one. This it repeated until maximum modularity is achieved.  
WalkTrap, 2005 [18]  Vertex Clustering, Quality Optimization  O(mn^{2})  A  igraph  w,d  Defines a similarity metric between nodes based on distance to all other nodes from a random walk. Merges nodes in an agglomarative fashion that minimizes distance from other nodes in the community. Then picks partition that has highest modularity. 
CESNA (Communities from Edge Structure and Node Attributes) [26]  Subgraph Discovery  1 million nodes in 10 hours  Q,R+  Snap  w,d  Detects overlapping communities in networks with node attributes. Assumes that nodes that belong ot the same communities are likely to be connected to each other, communities can overlap, the more common communities a set of nodes share the more likely they are to be connected, and nodes in the same community are likely to share common attributes 
Radicchi et al [15]  Model Based, Divisive  O(m^{2}) on large systems  None  d  A local algorithm that returns similar results to Girvan Newman betweenness but reduces computational cost by using a local quantitiythe number of loops of a given length that contains an edgeto choose which edge to remove.  
Leading Eigenvector, 2006 [20]  igraph  
Latent Space Models [21]  
CoDA (Communities through directed affiliations)[23]  Model Based  Unclear  O,H,G,Q,R+  snap  w,d  Extension of BIGCLAM that incorporated directedness to discover community type. It detects overlapping communities as wel as cohesively and 2mode communities, whether connected or hierarically nested. 
Clique Clique Percolation [27]  Cohesive Subgraph Discovery  G,D,S  CFinder  w,d  Begins by identifying all cliques of size k in a network. All cliques are collapsed into single vertex. Vertices are connected if cliques that they are representing share k1 members—connected components identify which cliques compose a community. Input value of k must be chosen and typically is between 3 and 6.  
CONGA (ClusterOverlapping Newman Girvan Algorithm), 2007 [28]  Divisive  O(m^{3})  D,G,O  None  w,d  CONGA, like Girvan Newman, relies on calculation of betweenness to determine splits, but splits on vertices, rather than edges. The algorithm specifies when and how to split vertices allowing for overlapping membership 
CONGO (CONGA Optimized), 2008 [32]  Divisive  O(n log n) Sparse Networks  D,G,O  None  w,d  Expands upon the CONGA algorithm by achieving a local form of betweenness 
Fuzzy Overlapping Group (FOG) [29]  Divisive  O(L^{3})  O  ORA  A combination of a stochastic model and maximum likelihood method group detection algorithm for inferring fuzzy overlapping groups. Groups are built from links which allows for multiple memberships varying levels of association from entities to groups.  
Bipartite Stochastic Block Model (biSBM), 2014 [30]  Model Based  O(N_{a}K_{a}(K_{a} + k)) + O(N_{b}K_{b}(K_{b} + k))  Matlab  w  The degree corrected Bipartite Stochastic Block Model uses probability to search for maximum likelihood partition of a network into communities. It models a network's observed degree sequence before finding community structure. It divdes verticies into groups of like types. These calculations are done iteratively to continually propose movements of vertices that increases the likelihood function. In ref to run time, k is avg degree of each node, N_a and N_b or number of vertices of type a and b, K_a and K_b are number of communities of type a and b  
Spectral Clustering [31]  Model Based  Determined by Eigenvalue Computation  L,Q,R  w,d  Spectral clustering algorithms rely on the calculation of eigenvalues of a similarity matrix to find optimal partitions, give a predetermined number of partitions. It relaxes the complex problem of minimizing cut ratio over every possible kpartition to find the ksmallest eigenvalues and related eigenvectors of the Laplacian of the graph.  
RolX  Role identification/discovery  Q  SNAP  w,d  Generates list of roles as "profiles" of local/global node metrics, and assigns nodes a linear combination of these roles. 
Notable Papers
Topic  Title  Comments 

Algorithm Comparisons  Community Detection Algorithms: A Comparative Analysis [3]  
General Overview  Communities in networks [4]  
General Overview  Community Detection in Graphs (2010) [5]  
General Overview  Community Detection in Social Media [7]  
Algorithm Comparisons  Empirical Comparison of Algorithms for Network Community Detection [6]  
Algorithm Evalutation  Defining and Evaluating Network Communities based on Groundtruth[8]  
Centrality Measures / Community Detection  Using Centrality Measures to Identify Key Members of an Innovation Collaboration Network [24]  
Overlapping Community Detection  Overlapping community detection in networks: the stateoftheart and comparative study [25] 
References
 Michelle Girvan and M. E. J. Newman, 2001 http://arxiv.org/pdf/condmat/0112110.pdf, Community structure in social and biological networks
 M E J Newman, 2001 href="http://wwwpersonal.umich.edu/~mejn/papers/016132.pdf, Scientific collaboration networks. II. Shortest paths, weighted networks, and centrality
 Andrea Lancichinetti and Santo Fortunato, 2010 http://arxiv.org/pdf/0908.1062v2.pdfCommunity detection algorithms: a comparative analysis
 Mason A. Porter, JukkaPekka Onnela, and Peter J. Mucha, 2009 http://www.ams.org/notices/200909/rtx090901082p.pdfCommunities in Networks
 Santo Fortunato, http://arxiv.org/pdf/0906.0612v2.pdfCommunity detection in graphs
 Jure Leskovec, Kevin J. Lang, Michael W. Mahoney, http://cs.stanford.edu/people/jure/pubs/communitieswww10.pdfEmpirical Comparison of Algorithms for Network Community Detection
 Symeon Papadopoulos, Yiannis Kompatsiaris, Athena Vakali, Ploutarchos Spyridonos, http://static.springer.com/sgw/documents/1387631/application/pdf/103.pdfCommunity detection in Social Media Performance and application considerations
 Jaewon Yang and Jure Leskovec, http://cs.stanford.edu/people/jure/pubs/comscoreicdm12.pdfDefining and Evaluating Network Communities based on Groundtruth
 Aaron Clauset, M. E. J. Newman, Cristopher Moore, http://arxiv.org/pdf/condmat/0408187.pdfFinding community structure in very large networks
 Leon Danon, Albert D ́iazGuilera, Jordi Duch, and Alex Arenas http://arxiv.org/pdf/condmat/0505245.pdfComparing community structure identification
 M E J Newman, and M. Girvan, 2004, http://arxiv.org/pdf/condmat/0308217.pdfFinding and evaluating community structure in networks
 M E J Newman, http://arxiv.org/pdf/condmat/0309508.pdfFast algorithm for detecting community structure in networks
 Martin Rosvall and Carl T. Bergstromhttp://www.mapequation.org/assets/publications/RosvallBergstromPNAS2008Full.pdfMaps of random walks on complex networks reveal community structure
 Jaewon Yang and Jure Leskovec http://infolab.stanford.edu/~crucis/pubs/papernmfagm.pdfOverlapping Community Detection at Scale: A Nonnegative Matrix Factorization Approach
 Filippo Radicchi, Claudio Castellano, Federico Cecconi, Vittorio Loreto, and Domenico Parisi http://www.pnas.org/content/101/9/2658.full.pdf+htmlDefining and identifying communities in networks
 M.E.J. Newman, 2004, http://arxiv.org/pdf/condmat/0309045.pdfA measure of betweenness centrality based on random walks
 Usha Nandini Raghavan, Reka Albert, Soundar Kumara, 2007, http://arxiv.org/pdf/0709.2938v1.pdfNear linear time algorithm to detect community structures in largescale networks
 Pascal Pons and Matthieu Latapy, 2007 http://arxiv.org/pdf/physics/0512106v1.pdfComputing communities in large networks using random walks
 Vincent D. Blondel1; JeanLoup Guillaume, Renaud Lambiotte, and Etienne Lefebvre, http://arxiv.org/pdf/0803.0476v2.pdfFast unfolding of communities in large networks
 M E J Newman, 2006 http://arxiv.org/pdf/physics/0605087v3.pdfFinding community structure in networks using the eigenvectors of matrices
 Lei Tang, Huan Liu, Community Detetection and Mining in Social Media
 Conrad Lee and Pádraig Cunningham, 2013http://comnet.oxfordjournals.org/content/2/1/19.full.pdf+htmlCommunity detection: effective on large social networks
 Detecting Cohesive and 2mode Communities in Directed and Undirected Networks by J. Yang, J. McAuley, J. Leskovec. ACM International Conference on Web Search and Data Mining (WSDM), 2014.
 John Cardente, http://dsgeek.com/docs/jcardente_cs224w_projreport.pdfUsing Centrality Measures to Identify Key Members of an Innovation Collaboration Network
 Jierui Xie, Stephen Kelley, Boleslaw K. Szymanski, http://arxiv.org/abs/1110.5813Overlapping community detection in networks: the stateoftheart and comparative study
 Jaewon Yang, Julian McAuley, Jure Leskovec, Community Detection in Networks with Node Attributes
 Gergely Palla, Imre Derenyi, Illes Farkas, and Tamas Vicsek, Uncovering the overlapping community structure of complex networks in nature and society.
 Steve Gregory, An algorithm to Find Overlapping Community Structure in Networks.
 George David and Kathleen Carley, Clearing the FOG: Fuzzy, Overlapping Groups in Social Networks.
 Daniel Larremore, Aaron Clauset, and Abigail Jacobs, Efficiently inferring community structure in bipartite networks.<
 Ulrike von Luxburg, A Tutorial On Spectral Clustering.
 Steve Gregory, http://www.cs.bris.ac.uk/publications/Papers/2000885.pdfA Fast Algorithm to Find Overlapping Communities in Networks, 2008