Connected components in a graph with 100 million nodes

Connected components in a graph with 100 million nodes - python

I am trying to get the list of connected components in a graph with 100 million nodes. For smaller graphs, I usually use the connected_components function of the Networkx module in Python which does exactly that. However, loading a graph with 100 million nodes (and their edges) into memory with this module would require ca. 110GB of memory, which I don't have. An alternative would be to use a graph database which has a connected components function but I haven't found any in Python. It would seem that Dex (API: Java, .NET, C++) has this functionality but I'm not 100% sure. Ideally I'm looking for a solution in Python. Many thanks.

SciPy has a connected components algorithm. It expects as input the adjacency matrix of your graph in one of its sparse matrix formats and handles both the directed and undirected cases.
Building a sparse adjacency matrix from a sequence of (i, j) pairs adj_list where i and j are (zero-based) indices of nodes can be done with
i_indices, j_indices = zip(*adj_list)
adj_matrix = scipy.sparse.coo_matrix((np.ones(number_of_nodes),
(i_indices, j_indices)))
You'll have to do some extra work for the undirected case.
This approach should be efficient if your graph is sparse enough.

https://graph-tool.skewed.de/performance
this tool as you can see from performance is very fast. It's written in C++ but the interface is in Python.
If this tool isn't good enough for you. (Which I think it will) then you can try Apache Giraph (http://giraph.apache.org/).

Related

Simulate graph model in networkx

I have a very specific graph problem in networkx:
My directed graph has two different type of nodes ( i will call them I and T) and it is built with edges only between I-T and T-I (so T doesn't connect with other T and the same with I)
Now I need to simulate a new graph with the same behavior: i have a certain number of I and T and the edge between I-T exists with a certain probability (also for T-I but with different probability, let's call them p_i and p_o).
So my problem is that i can't iterate with for loops both for I and then for T because both are quite big (the data I'm analyzing right now are 5000 T's and 5000 I's but the will probably increase up to 300000 each) and my pc can't handle that.
What is the best way to create a graph in this situation?

This is the solution that #ravenspoint makes me reach with his comment.
For T=5000 and I=5000 it works by doing for loops in I and one in T and by using np.random.binomial(1, pi, nI) and np.random.binomial(1, po, nO) from numpy, where nO and nI are the length of O and I in the real graph and creating edges if these arrays are 1.
If po=pi (as it happens in my example) also #Stef solution works and you can use nx.bipartite.random_graph(countadd, countt, p, seed=None,directed=True)

Efficient representing sub-graphs (data structure) in Python

What is the efficient way of keeping and comparing generated sub-graphs from given input graph G in Python?
Some details:
Input graph G is a directed, simple graph with number of vertices varying from n=100-10000. Number of edges - it can be assumed that maximum would be 10% of complete graph (usually less) so it gives in that case maximum number of n*(n-1)/10
There is an algorithm that can generate from input graph G sub-graphs in number of hundreds/thousands. And for each sub-graph are made some (time consuming) computations.
Pair "subgraph, computation results" must be stored for later use (dynamic programming approach - if given sub-graph were already processed we want to re-use its results).
Because of point (2.) it would be really nice to store sub-graph/results pairs in kind of dictionary where sub-graph is a key. How it can be done efficiently? Some ideas of efficient calculation of sub-graph hash value maybe?
Let's assume that memory is not a problem and I can find machine with enough memory to keep a data - so let's focus only on speed.
Of course If there are already nice to use data-structures that might be helpful in this problem (like sparse matrices from scipy) they are very welcome.
I just would like to know your opinions about it and maybe some hints regarding approach to this problem.
I know that there are nice graph/network libraries for Python like NetworkX, igraph, graph-tool which have very efficient algorithms to process provided graph. But seems (or I could not find) efficient way to fulfill points (2. 3.)

The key point here is the data format of the graphs already generated by your algorithm. Does it contruct a new graph by adding vertices and edges ? Is it rewritable ? Does it uses a given format (matrix, adjacency list, vertices and nodes sets etc.)
If you have the choice however, because your subgraph have a "low" cardinality and because space is not an issue, I would store subgraphs as arrays of bitmasks (the bitmask part is optional, but it is hashable and makes a compact set). A subgraph representation would be then
L a list of node references in your global graph G. It can also be a bitmask to be used as a hash
A an array of bitmask (matrix) where A[i][j] is the truth value of the edge L[i] -> L[j]
This takes advantage of the infinite size low space requirement of Python integers. The space complexity is O(n*n) but you get efficient traversal and can easily hash your structure.

Generating reachability matrix from a given adjacency matrix

What is the best algorithm for generating a reachability matrix from a given adjacency matrix. There is warshall's algorithm but it is not the best method. There are some other methods but the procedures are more theoretical. Is there any module or with which I can create a reachability matrix with ease. I am working on python 2.7.

I don't think there is a way to do it faster than O(n³) in general case for directed graph.
That being said, you can try to use clever technics to reduce the constant.
For example, you can do the following:
Convert your graph into DAG by finding all strongly connected components and replacing them with a single vertex. This can be done in Θ(V + E) or Θ(V²)
On the new graph, run DFS to calculate reachability for all vertices, but, when updating reachability set for a vertex, do it in the fast vectorized way. This is technically Θ( (V + E) * V ) or Θ(V³), but the constant will be low (see below).
The proposed vectorized way is to have the reachability set for every vertex represented as the bit vector, residing on GPU. This way, the calculation of the union of two sets is performed in extremely fast parallelized manner on GPU. You can use any tensor library for GPU, for example, tf.bitwise.
After you calculated reachability bit vectors for every vertex on GPU, you can extract them into CPU memory in Θ(V²) time.

MapReduce: Finding the triangles in a network graph using Mrjob

I've an application where I have a graph and I need to count the number of triangles in the graph using MrJob (MapReduce in Python). However, I'm having some trouble wrapping my head around the mapping and the reducing steps needed.
What is the best Map Reduce pipeline for computing the triangles of a network graph?

Well, it would help to answer this to have a bit more context. Do you have a single graph or a large number of graphs, a tree? How many nodes are we talking about in your graph?
But in general, I would try to build a solution that uses the networkx package, specifically the triangles method at the core.
An issue you may face is filtering duplicates, as the triangles are reported relative to a node.
So a bit more context here on the specifics here would help narrow down the answer.

What scalability issues are associated with NetworkX?

I'm interested in network analysis on large networks with millions of nodes and tens of millions of edges. I want to be able to do things like parse networks from many formats, find connected components, detect communities, and run centrality measures like PageRank.
I am attracted to NetworkX because it has a nice api, good documentation, and has been under active development for years. Plus because it is in python, it should be quick to develop with.
In a recent presentation (the slides are available on github here), it was claimed that:
Unlike many other tools, NX is designed to handle data on a scale
relevant to modern problems...Most of the core algorithms in NX rely on extremely fast legacy code.
The presentation also states that the base algorithms of NetworkX are implemented in C/Fortran.
However, looking at the source code, it looks like NetworkX is mostly written in python. I am not too familiar with the source code, but I am aware of a couple of examples where NetworkX uses numpy to do heavy lifting (which in turn uses C/Fortran to do linear algebra). For example, the file networkx/networkx/algorithms/centrality/eigenvector.py uses numpy to calculate eigenvectors.
Does anyone know if this strategy of calling an optimized library like numpy is really prevalent throughout NetworkX, or if just a few algorithms do it? Also can anyone describe other scalability issues associated with NetworkX?
Reply from NetworkX Lead Programmer
I posed this question on the NetworkX mailing list, and Aric Hagberg replied:
The data structures used in NetworkX are appropriate for scaling to
large problems (e.g. the data structure is an adjacency list). The
algorithms have various scaling properties but some of the ones you
mention are usable (e.g. PageRank, connected components, are linear
complexity in the number of edges).
At this point NetworkX is pure Python code. The adjacency structure
is encoded with Python dictionaries which provides great flexibility
at the expense of memory and computational speed. Large graphs will
take a lot of memory and you will eventually run out.
NetworkX does use NumPy and SciPy for algorithms that are primarily
based on linear algebra. In that case the graph is represented
(copied) as an adjacency matrix using either NumPy matrices or SciPy
sparse matrices. Those algorithms can benefit from the legacy C and
FORTRAN code that is used under the hood in NumPy and SciPY.

This is an old question, but I think it is worth mentioning that graph-tool has a very similar functionality to NetworkX, but it is implemented in C++ with templates (using the Boost Graph Library), and hence is much faster (up to two orders of magnitude) and uses much less memory.
Disclaimer: I'm the author of graph-tool.

Your big issue will be memory. Python simply cannot handle tens of millions of objects without jumping through hoops in your class implementation. The memory overhead of many objects is too high, you hit 2GB, and 32-bit code won't work. There are ways around it - using slots, arrays, or NumPy. It should be OK because networkx was written for performance, but if there are a few things that don't work, I will check your memory usage.
As for scaling, algorithms are basically the only thing that matters with graphs. Graph algorithms tend to have really ugly scaling if they are done wrong, and they are just as likely to be done right in Python as any other language.

The fact that networkX is mostly written in python does not mean that it is not scalable, nor claims perfection. There is always a trade-off. If you throw more money on your "machines", you'll have as much scalability as you want plus the benefits of using a pythonic graph library.
If not, there are other solutions, ( here and here ), which may consume less memory ( benchmark and see, I think igraph is fully C backed so it will ), but you may miss the pythonic feel of NX.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.