Logo { blog }

innovation through collaboration

I See Graphs

At Lab41 we are obsessed with graphs. We see graphs everywhere we look, in everything we think about. One of our goals is to better understand and advance techniques for manipulating, storing, and analyzing graphs. In this post, we try and do three things: (1) explain what a graph is, (2) show that it’s an important concept, and (3) discuss a way of working with graphs. More specifically, we talk about how to work with graphs as matrices.

Why are graphs important?

A graph is a mathematical construct that describes things that are linked together. Graphs model networks-systems of “things” connected together in some way. For example, the Internet is a physical network of computers connected together by data connections; the Web is a logical network of web pages connected by hyperlinks; and human societies are networks of people connected together by various social relationships. In the language of mathematics, we say each of the “things” in a network is a node and they are connected by “edges.”

It turns out that you can think of much of the world, both physical and virtual, as a graph. As a mathematical construct, graphs have been around since Leonhard Euler tried to figure out the best way to get around Konigsberg in 1735. Since then, graph theory has been embraced by a wide array of disciplines including sociology, economics, physics, computer science, biology, and statistics. An excellent resource for understanding how graphs map onto real world systems is the “Rosetta Stone for Connectionism,” which maps various real world systems onto graph concepts. Graphs really are everywhere.

While graphs are prevalent in many fields, the tools for working with graphs, especially large graphs, are still in their infancy. As graph technologies mature it should become easier to model many different problems, and easier to implement solutions. However, we are still figuring out the best ways to store, query, and compute on graphs. Right now, people use different data structures and technologies for different types of graphs and different use cases. Eventually, we need to figure out how to hide that complexity and let people treat graph data as graphs without thinking about what the right tools are for manipulating that data. Marko Rodriguez, a leading graph technologist, has a great summary of several different types of graphs and graph technologies in his recent blog post. Lab41 is actively using and working with many of the technologies that Marko describes, including the graph database Titan, which we load tested as noted in our previous blog entry.

Graphs as Matrices

The earliest tools for working with graphs were tools for manipulating matrices. In mathematics, graphs are frequently expressed as an adjacency matrix. In an adjacency matrix each row/column represents a node, and each entry in the matrix represents the presence of an edge between two nodes. The cool thing about the matrix form of a graph is that once you think of a graph as a matrix, you can apply concepts and methods from linear algebra to your graph analysis. Many common graph metrics and algorithms can easily be expressed in terms of standard matrix operations.

The cool thing about the matrix form of a graph is that once you think of a graph as a matrix, you can apply concepts and methods from linear algebra to your graph analysis. Many common graph metrics and algorithms can easily be expressed in terms of standard matrix operations.

While an adjacency matrix is a mathematical abstraction, it’s also a data structure. In this post we are talking primarily about a matrix as a contiguous block of memory. In most programing languages this is an array of arrays:

1
2
//Java
int[][] matrix = {{1,2,3}, {1,2,3}}

From an engineering perspective, there are a number of advantages to storing a graph as a matrix, if the matrix representation of the graph fits in memory:

  • A number of common operations can be performed on matrices quite quickly in comparison to how long they would take on other data structures. (Table 1)
    <table>
      <tr>
        <td> Operation </td>
        <td> Adjacency Matrix </td>
        <td> Adjacency List </td>
        <td> Adjacency Tree </td>
      </tr>
      <tr>
        <td> Insert </td>
        <td> O(1) </td>
        <td> O(1) </td>
        <td> O(log(m/n))</td>
      </tr>
      <tr>
        <td>Delete</td>
        <td>O(1)</td>
        <td>O(m/n)</td>
        <td>O(log(m/n))</td>
      </tr>
      <tr>
        <td>Find</td>
        <td>O(1)</td>
        <td>O(m/n)</td>
        <td>O(log(m/n))</td>
      </tr>
      <tr>
        <td>Enumerate</td>
        <td>O(n)</td>
        <td>O(m/n)</td>
        <td>O(m/n)</td>
      </tr>
    </table>
  • Nearly all programming languages have highly optimized libraries for storing and working with matrices. For example: Python- NumPy; Java-JAMA, Colt,etc..; C++: Armadillo, Eigen, etc.
However, matrices are a relatively limited representation of a graph. There are a number of operations that they perform very poorly, and they can be very memory inefficient:
  • Adjacency matrices only allow you to capture the structure of the graph. They don’t give you a chance to associate information with each node – in other words, you can’t represent property graphs with matrices. For example, if you were modeling the world wide web you might want to store the url, title, and text of each web page in the network, and you can’t do that with an adjacency matrix. Now, if you’re persnickety, you might argue that it’s possible to store edge properties in an adjacency matrix if you think of each entry in the matrix as something other than a numeric value.
  • You can’t model heterogeneous graphs consisting of different types of entities using an adjacency matrix. For example, if you are trying to implement collaborative filtering, you may want to model a network of users to products. Again, if you are persnickety, you might propose various ways of simulating heterogeneous graphs using a single matrix. However, all of those methods will require you to use information not actually stored in the matrix to differentiate nodes of one type from nodes of another.
  • Adjacency matrices are really slow at some critical matrix operations, such as running through the neighbors of a particular node on a sparse matrix (O(n)– where n is the number of nodes in the graph).
  • Perhaps worst of all, adjacency matrices are very memory inefficient taking O(n^2) memory. An adjacency matrix has an entry for each possible edge, which means each possible edge is using memory even if it does not exist. This is primarily a problem when working with sparse networks – networks where many of the edges in the network don’t exist. Unfortunately, most real world networks are sparse. For example, if you consider the graph of all people on earth, it is a sparse network because each person knows only a relatively small number of other people. Most of the possible relationships that could exist between people don’t exist. In some ways that is both an engineering problem and an existential problem.

    To put the memory inefficiency of adjacency matrices into perspective, if each edge in a matrix is stored as a 32 bit integer then the memory requirement for a graph can be calculated by following equation:

    \[\text{memory in gb} = \frac{(\text{number of nodes})^2 * 8}{(1024^3)}\]

    Thus a graph of 50,000 nodes would take about 10GB of memory, which means you can’t store and manipulate large graphs like the graph of the Web, which is estimated to have 4.7 billion nodes, on most desktop computers.

  • While technologies for dealing with small matrices – matrices that can fit in memory – are well developed, technologies for dealing with large matrices in a distributed manner are just emerging. One approach to dealing with extremely large matrices is to use some type of super computer, which has a lot of memory. Another approach is to swap portions of the matrix into and out of memory; there are algorithms that can do this relatively efficiently based on the unique properties of a matrix. A new, and extremely interesting, approach is to distribute a matrix computation across the memory of multiple computers. I think the Pegasus project is a particularly interesting example of the distributed matrix computation approach.

    Next Steps

    If you’re interested in learning more about networks and adjacency matrices, I would highly recommend taking a look at M.E.J. Newman’s Networks. He has an excellent discussion of the adjacency matrix as a mathematical concept in Chapter 6, and discussion of an adjacency matrix as a data structure in Chapter 9.

    Also, keep an eye on this blog. I plan to address other data structures for storing graph data, and when they may (or may not be) appropriate in a future post.