Efficient representing sub-graphs (data structure) in Python

Efficient representing sub-graphs (data structure) in Python - python

What is the efficient way of keeping and comparing generated sub-graphs from given input graph G in Python?
Some details:
Input graph G is a directed, simple graph with number of vertices varying from n=100-10000. Number of edges - it can be assumed that maximum would be 10% of complete graph (usually less) so it gives in that case maximum number of n*(n-1)/10
There is an algorithm that can generate from input graph G sub-graphs in number of hundreds/thousands. And for each sub-graph are made some (time consuming) computations.
Pair "subgraph, computation results" must be stored for later use (dynamic programming approach - if given sub-graph were already processed we want to re-use its results).
Because of point (2.) it would be really nice to store sub-graph/results pairs in kind of dictionary where sub-graph is a key. How it can be done efficiently? Some ideas of efficient calculation of sub-graph hash value maybe?
Let's assume that memory is not a problem and I can find machine with enough memory to keep a data - so let's focus only on speed.
Of course If there are already nice to use data-structures that might be helpful in this problem (like sparse matrices from scipy) they are very welcome.
I just would like to know your opinions about it and maybe some hints regarding approach to this problem.
I know that there are nice graph/network libraries for Python like NetworkX, igraph, graph-tool which have very efficient algorithms to process provided graph. But seems (or I could not find) efficient way to fulfill points (2. 3.)

The key point here is the data format of the graphs already generated by your algorithm. Does it contruct a new graph by adding vertices and edges ? Is it rewritable ? Does it uses a given format (matrix, adjacency list, vertices and nodes sets etc.)
If you have the choice however, because your subgraph have a "low" cardinality and because space is not an issue, I would store subgraphs as arrays of bitmasks (the bitmask part is optional, but it is hashable and makes a compact set). A subgraph representation would be then
L a list of node references in your global graph G. It can also be a bitmask to be used as a hash
A an array of bitmask (matrix) where A[i][j] is the truth value of the edge L[i] -> L[j]
This takes advantage of the infinite size low space requirement of Python integers. The space complexity is O(n*n) but you get efficient traversal and can easily hash your structure.

Related

Adjacency Lists, Store edges as Linked Lists or Sets?

I've been learning a lot about graph theory and graphs in general. I have been looking for my first job as an engineer and the technical interviews have definitely pushed my scope of knowledge.
When I read about Adjacency lists, the literature always references a list, with indexes representing a vertex and linked lists stored under each index representing edges connected to that vertex.
Before reading this material I was making my graphs as hash tables, with each key being a vertex and each value being a set of edges.
My idea was fast lookup and deletion time. I was solving basic problems involving DFS with this setup and it worked fine.
My question: is there any downside representing a graph as a hash table, k=vertex, v=set{edges}, or any real benefit of using arrays/linked list to store edges, switching my approach?
Thanks in advance for helping me grasp the basics!

QuickSI algorithm for finding subgraph isomorphisms

I am studying the Quick Subgraph Isomorphism (QuickSI) Algorithm and I am having a problem understanding the formulae regarding the inner support and average inner support calculation described at page 6, (2) and (3). If "v" stands for vertex and "e" stands for edge, then what does f(v) and f(e) do? How can I obtain the values of Table 2 from page 6? Definition 4 from page 5 does not really do much good in helping me understand. By isomorphic mappings from the query graph to the data graph I understand taking different components from the query graph and see if they can be found in the data graph. But the computation time for this does not seem to be too feasible for large graphs.
Here you can find the original article:
http://www.cse.unsw.edu.au/~lxue/10papers/vldb08_haichuan.pdf
Thank you in advance!

The function f is described in Definition 1 - it's just the isomorphism function that preserved the labels (l).
The 'average inner-support' is the number of 'features' (for example, vertices) that have an isomorphism divided by the number of graphs that have an isomorphism. To get the values of the table, you would need to know the dataset of graphs (D) that was used. It doesn't seem to be referenced except in Example 4.
Really, taking a step back - do you need to implement this particular algorithm? There are plenty of simpler ones that might be slightly slower, but clearer. Furthermore, why not use someone else's implementation of a subgraph isomorphism algorithm?

Nearest Neighbor 3D w/ Delete & Add Operations

Does there exist nearest neighbor data structure that supports delete and add operations along with exact nearest neighbor queries? Looking for a Python implementation ideally.
Attempts:
Found MANY implementations for approximate nearest neighbor queries in high dimensional spaces.
Found KD Trees and Ball Trees but they do not allow for dynamic rebalancing.
Thinking an algorithm could be possible with locality sensitive hashing.
Looking at OctTrees.
Context:
For each point of 10,000 points, query for it's nearest neighbor
Evaluate each pair of neighbors
Pick one and delete the pair of points and add a merged point.
Repeat for some number of iterations

Yes. There exists such a datastructure. I invented one. I had exactly this problem at hand. The datastructure makes KD-trees seem excessively complex. It consists of only sorted lists of points in each dimensionality the points have.
Obviously you can add and removing a n-dimensional point from n lists sorted by their respective dimensions rather trivially without issue. A lot of the tricks allows one to iterate these lists and mathematically prove you have the shortest distance to a point. See my answer here for elaboration and code.
I must note though that your context is wrong. The closest point for A may be B, but it doesn't hold that B's closest point is necessarily A. You could rig a chain of points such that each distance between each link is less than the one before but also necessarily further than the other points resulting in there being only 1 pair of neighbors that share their nearest neighbor.

Generating reachability matrix from a given adjacency matrix

What is the best algorithm for generating a reachability matrix from a given adjacency matrix. There is warshall's algorithm but it is not the best method. There are some other methods but the procedures are more theoretical. Is there any module or with which I can create a reachability matrix with ease. I am working on python 2.7.

I don't think there is a way to do it faster than O(n³) in general case for directed graph.
That being said, you can try to use clever technics to reduce the constant.
For example, you can do the following:
Convert your graph into DAG by finding all strongly connected components and replacing them with a single vertex. This can be done in Θ(V + E) or Θ(V²)
On the new graph, run DFS to calculate reachability for all vertices, but, when updating reachability set for a vertex, do it in the fast vectorized way. This is technically Θ( (V + E) * V ) or Θ(V³), but the constant will be low (see below).
The proposed vectorized way is to have the reachability set for every vertex represented as the bit vector, residing on GPU. This way, the calculation of the union of two sets is performed in extremely fast parallelized manner on GPU. You can use any tensor library for GPU, for example, tf.bitwise.
After you calculated reachability bit vectors for every vertex on GPU, you can extract them into CPU memory in Θ(V²) time.

Connected components in a graph with 100 million nodes

I am trying to get the list of connected components in a graph with 100 million nodes. For smaller graphs, I usually use the connected_components function of the Networkx module in Python which does exactly that. However, loading a graph with 100 million nodes (and their edges) into memory with this module would require ca. 110GB of memory, which I don't have. An alternative would be to use a graph database which has a connected components function but I haven't found any in Python. It would seem that Dex (API: Java, .NET, C++) has this functionality but I'm not 100% sure. Ideally I'm looking for a solution in Python. Many thanks.

SciPy has a connected components algorithm. It expects as input the adjacency matrix of your graph in one of its sparse matrix formats and handles both the directed and undirected cases.
Building a sparse adjacency matrix from a sequence of (i, j) pairs adj_list where i and j are (zero-based) indices of nodes can be done with
i_indices, j_indices = zip(*adj_list)
adj_matrix = scipy.sparse.coo_matrix((np.ones(number_of_nodes),
(i_indices, j_indices)))
You'll have to do some extra work for the undirected case.
This approach should be efficient if your graph is sparse enough.

https://graph-tool.skewed.de/performance
this tool as you can see from performance is very fast. It's written in C++ but the interface is in Python.
If this tool isn't good enough for you. (Which I think it will) then you can try Apache Giraph (http://giraph.apache.org/).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.