Generating reachability matrix from a given adjacency matrix

Generating reachability matrix from a given adjacency matrix - python

What is the best algorithm for generating a reachability matrix from a given adjacency matrix. There is warshall's algorithm but it is not the best method. There are some other methods but the procedures are more theoretical. Is there any module or with which I can create a reachability matrix with ease. I am working on python 2.7.

I don't think there is a way to do it faster than O(n³) in general case for directed graph.
That being said, you can try to use clever technics to reduce the constant.
For example, you can do the following:
Convert your graph into DAG by finding all strongly connected components and replacing them with a single vertex. This can be done in Θ(V + E) or Θ(V²)
On the new graph, run DFS to calculate reachability for all vertices, but, when updating reachability set for a vertex, do it in the fast vectorized way. This is technically Θ( (V + E) * V ) or Θ(V³), but the constant will be low (see below).
The proposed vectorized way is to have the reachability set for every vertex represented as the bit vector, residing on GPU. This way, the calculation of the union of two sets is performed in extremely fast parallelized manner on GPU. You can use any tensor library for GPU, for example, tf.bitwise.
After you calculated reachability bit vectors for every vertex on GPU, you can extract them into CPU memory in Θ(V²) time.

Related

About custom operations in Tensorflow and PyTorch

I have to implement an energy function, termed Rigidity Energy, as in Eq 7 of this paper here.
The energy function takes as input two 3D object meshes, and returns the energy between them. The first mesh is the source mesh, and the second mesh is the deformed version of the source mesh. In rough psuedo-code, the computation would go like this:
Iterate over all the vertices in the source mesh.
For every vertex, compute its covariance matrix with its neighboring vertices.
Perform SVD on the computed covariance matrix and find the rotation matrix of the vertex.
Use the computed rotation matrix, the point coordinates in the original mesh and the corresponding coordinates in the deformed mesh, to compute the energy deviation of the vertex.
Thus this energy function requires me to iterate over each point in the mesh, and the mesh could have more than 2k such points. In Tensorflow, there are two ways to do this. I can have 2 tensors of shape (N,3), one representing the points of source and the other of the deformed mesh.
Do it purely using Tensorflow tensors. That is, iterate over elements of the above tensors using tf.gather and perform the computation on each point using only existing TF operations. This method, would be extremely slow. I've tried to define loss functions that iterate over 1000s of points before, and the graph construction itself takes too much time to be practical.
Add a new TF OP as explained in the TF documentation here . This involves writing the function in CPP (and Cuda, for GPU support), and registering the new OP with TF.
The first method is easy to write, but impractically slow. The second method is a pain to write.
I've used TF for 3 years, and have never used PyTorch before, but at this point I'm considering switching to it, if it offers a better alternative for such cases.
Does PyTorch have a way of implementing such loss functions both easily and performs as fast as it would on GPU. i.e, A pythonic way of writing my own loss functions that runs on GPU, without any C or Cuda code on my part?

As far as I understand, you are essentially asking if this operation can be vectorized. The answer is no, at least not fully, because svd implementation in PyTorch is not vectorized.
If you showed the tensorflow implementation, it would help in understanding your starting point. I don't know what you mean by finding the rotation matrix of the vertex, but I would guess this can be vectorized. This would mean that svd is the only non-vectorized operation and you could perhaps get away with writing just a single custom OP, that is the vectorized svd - which is likely quite easy, because it would amount to calling some library routines in a loop in C++.
Two possible sources of problems I see are
if the neighborhoods of N(i) in equation 7 can be of significantly different sizes (which would mean that the covariance matrices are of different sizes and vectorization would require some dirty tricks)
the general problem of dealing with meshes and neighborhoods could be difficult. This is an innate property of irregular meshes, but PyTorch has support for sparse matrices and a dedicated package torch_geometry, which at least helps.

Efficient representing sub-graphs (data structure) in Python

What is the efficient way of keeping and comparing generated sub-graphs from given input graph G in Python?
Some details:
Input graph G is a directed, simple graph with number of vertices varying from n=100-10000. Number of edges - it can be assumed that maximum would be 10% of complete graph (usually less) so it gives in that case maximum number of n*(n-1)/10
There is an algorithm that can generate from input graph G sub-graphs in number of hundreds/thousands. And for each sub-graph are made some (time consuming) computations.
Pair "subgraph, computation results" must be stored for later use (dynamic programming approach - if given sub-graph were already processed we want to re-use its results).
Because of point (2.) it would be really nice to store sub-graph/results pairs in kind of dictionary where sub-graph is a key. How it can be done efficiently? Some ideas of efficient calculation of sub-graph hash value maybe?
Let's assume that memory is not a problem and I can find machine with enough memory to keep a data - so let's focus only on speed.
Of course If there are already nice to use data-structures that might be helpful in this problem (like sparse matrices from scipy) they are very welcome.
I just would like to know your opinions about it and maybe some hints regarding approach to this problem.
I know that there are nice graph/network libraries for Python like NetworkX, igraph, graph-tool which have very efficient algorithms to process provided graph. But seems (or I could not find) efficient way to fulfill points (2. 3.)

The key point here is the data format of the graphs already generated by your algorithm. Does it contruct a new graph by adding vertices and edges ? Is it rewritable ? Does it uses a given format (matrix, adjacency list, vertices and nodes sets etc.)
If you have the choice however, because your subgraph have a "low" cardinality and because space is not an issue, I would store subgraphs as arrays of bitmasks (the bitmask part is optional, but it is hashable and makes a compact set). A subgraph representation would be then
L a list of node references in your global graph G. It can also be a bitmask to be used as a hash
A an array of bitmask (matrix) where A[i][j] is the truth value of the edge L[i] -> L[j]
This takes advantage of the infinite size low space requirement of Python integers. The space complexity is O(n*n) but you get efficient traversal and can easily hash your structure.

Clustering GPS points with a custom distance function in scipy

I'm curious if it is possible to specify your own distance function between two points for scipy clustering. I have datapoints with 3 values: GPS-lat, GPS-lon, and posix-time. I want to cluster these points using some algorithm: either agglomerative clustering, meanshift, or something else.
The problem is distance between GPS points needs to be calculated with the Haversine formula. And then that distance needs to be weighted appropriately so it is comparable with a distance in seconds for clustering purposes.
Looking at the documentation for scipy I don't see anything that jumps out as a way to specify a custom distance between two points.
Is there another way I should be going about this? I'm curious what the Pythonic thing to do is.

You asked for sklearn, but I don't have a good answer for you there. Basically, you could build a distance matrix the way you like, and many algorithms will process the distance matrix. The problem is that this needs O(n^2) memory.
For my attempts at clustering geodata, I have instead used ELKI (which is Java, not Python). First of all, it includes geodetic distance functions; but it also includes index acceleration for many algorithms and for this distance function.
I have not used an additional attribute such as time. As you already noticed you need to weight them appropriately, as 1 meter does not equal not 1 second. Weights will be very much use case dependant, and heuristic.
Why I'm suggesting ELKI is because they have a nice Tutorial on implementing custom distance functions that then can be used in most algorithms. They can't be used in every algorithm - some don't use distance at all, or are constrained to e.g. Minkowski metrics only. But a lot of algorithms can use arbitrary (even non-metric) distance functions.
There also is a follow-up tutorial on index accelerated distance functions. For my geodata, indexes were tremendously useful, speeding up by a factor of over 100x, and thus enabling be to process 10 times more data.

Connected components in a graph with 100 million nodes

I am trying to get the list of connected components in a graph with 100 million nodes. For smaller graphs, I usually use the connected_components function of the Networkx module in Python which does exactly that. However, loading a graph with 100 million nodes (and their edges) into memory with this module would require ca. 110GB of memory, which I don't have. An alternative would be to use a graph database which has a connected components function but I haven't found any in Python. It would seem that Dex (API: Java, .NET, C++) has this functionality but I'm not 100% sure. Ideally I'm looking for a solution in Python. Many thanks.

SciPy has a connected components algorithm. It expects as input the adjacency matrix of your graph in one of its sparse matrix formats and handles both the directed and undirected cases.
Building a sparse adjacency matrix from a sequence of (i, j) pairs adj_list where i and j are (zero-based) indices of nodes can be done with
i_indices, j_indices = zip(*adj_list)
adj_matrix = scipy.sparse.coo_matrix((np.ones(number_of_nodes),
(i_indices, j_indices)))
You'll have to do some extra work for the undirected case.
This approach should be efficient if your graph is sparse enough.

https://graph-tool.skewed.de/performance
this tool as you can see from performance is very fast. It's written in C++ but the interface is in Python.
If this tool isn't good enough for you. (Which I think it will) then you can try Apache Giraph (http://giraph.apache.org/).

Obtaining the triadic closure of a Graph with graph-tool

I am trying to build the triadic closure of a graph-tool Graph.
graph_tool.topology contains transitive_closure, which is basically the "infinite-th" power of the adjacency matrix: what I need is the second, or in general the k-th, power.
Is there some better way than just... calculating the powers of the adjacency matrix?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.