QuickSI algorithm for finding subgraph isomorphisms

QuickSI algorithm for finding subgraph isomorphisms - python

I am studying the Quick Subgraph Isomorphism (QuickSI) Algorithm and I am having a problem understanding the formulae regarding the inner support and average inner support calculation described at page 6, (2) and (3). If "v" stands for vertex and "e" stands for edge, then what does f(v) and f(e) do? How can I obtain the values of Table 2 from page 6? Definition 4 from page 5 does not really do much good in helping me understand. By isomorphic mappings from the query graph to the data graph I understand taking different components from the query graph and see if they can be found in the data graph. But the computation time for this does not seem to be too feasible for large graphs.
Here you can find the original article:
http://www.cse.unsw.edu.au/~lxue/10papers/vldb08_haichuan.pdf
Thank you in advance!

The function f is described in Definition 1 - it's just the isomorphism function that preserved the labels (l).
The 'average inner-support' is the number of 'features' (for example, vertices) that have an isomorphism divided by the number of graphs that have an isomorphism. To get the values of the table, you would need to know the dataset of graphs (D) that was used. It doesn't seem to be referenced except in Example 4.
Really, taking a step back - do you need to implement this particular algorithm? There are plenty of simpler ones that might be slightly slower, but clearer. Furthermore, why not use someone else's implementation of a subgraph isomorphism algorithm?

Related

Efficient representing sub-graphs (data structure) in Python

What is the efficient way of keeping and comparing generated sub-graphs from given input graph G in Python?
Some details:
Input graph G is a directed, simple graph with number of vertices varying from n=100-10000. Number of edges - it can be assumed that maximum would be 10% of complete graph (usually less) so it gives in that case maximum number of n*(n-1)/10
There is an algorithm that can generate from input graph G sub-graphs in number of hundreds/thousands. And for each sub-graph are made some (time consuming) computations.
Pair "subgraph, computation results" must be stored for later use (dynamic programming approach - if given sub-graph were already processed we want to re-use its results).
Because of point (2.) it would be really nice to store sub-graph/results pairs in kind of dictionary where sub-graph is a key. How it can be done efficiently? Some ideas of efficient calculation of sub-graph hash value maybe?
Let's assume that memory is not a problem and I can find machine with enough memory to keep a data - so let's focus only on speed.
Of course If there are already nice to use data-structures that might be helpful in this problem (like sparse matrices from scipy) they are very welcome.
I just would like to know your opinions about it and maybe some hints regarding approach to this problem.
I know that there are nice graph/network libraries for Python like NetworkX, igraph, graph-tool which have very efficient algorithms to process provided graph. But seems (or I could not find) efficient way to fulfill points (2. 3.)

The key point here is the data format of the graphs already generated by your algorithm. Does it contruct a new graph by adding vertices and edges ? Is it rewritable ? Does it uses a given format (matrix, adjacency list, vertices and nodes sets etc.)
If you have the choice however, because your subgraph have a "low" cardinality and because space is not an issue, I would store subgraphs as arrays of bitmasks (the bitmask part is optional, but it is hashable and makes a compact set). A subgraph representation would be then
L a list of node references in your global graph G. It can also be a bitmask to be used as a hash
A an array of bitmask (matrix) where A[i][j] is the truth value of the edge L[i] -> L[j]
This takes advantage of the infinite size low space requirement of Python integers. The space complexity is O(n*n) but you get efficient traversal and can easily hash your structure.

Networkx shortest tree algorithm

I have a undirected weighted graph G with a set of nodes and weighted edges.
I want to know if there is a method implemented in networkx to find a minimum spanning tree in a graph between given nodes (e.g. nx.steiner_tree(G, ['Berlin', 'Kiel', 'Munster', 'Nurnberg'])) (aparently there is none?)
I don't have reputation points to post images. The link to similar image could be:
Map
(A3, C1, C5, E4)
What I'm thinking:
check dijkstras shortest paths between all destination nodes;
put all the nodes (intermediate and destinations) and edges to a new graph V;
compute the mst on V (to remove cycles by breaking longest edge);
Maybe there are better ways(corectness- and computation- wise)? Because this approach does pretty bad with three destination nodes and becomes better with more nodes.
P.S. My graph is planar (it can be drawn on paper so that edges would not intersect). So maybe some kind of spring/force (like in d3 visualisation) algorithm could help?

As I understand your question, you're trying to find the lowest weight connected component that contains a set of nodes. This is the Steiner tree in graphs problem. It is NP complete. You're probably best off taking some sort of heuristic based on the specific case you are studying.
For two nodes, the approach to do this is Dijkstra's algorithm- it's fastest if you expand around both nodes until the two shells intersect. For three nodes I suspect a version of Dijkstra's algorithm where you expand around each of the nodes will give some good results, but I don't see how to make sure you're getting the best.
I've found another question for this, which has several decent answers (the posted question had an unweighted graph so it's different, but the algorithms given in the answers are appropriate for weighted). There are some good ones beyond just the accepted answer.

In networkx there's a standard Kruskal algorithm implemented with undirected weighted graph as input. The function is called "minimum_spanning_tree"
I propose you build a subgraph that contains the nodes you need and then let the Kruskal algorithm run on it.
import nertworkx as nx
H=G.subgraph(['Berlin','Kiel', 'Konstanz'])
MST=nx.minimum_spanning_tree(H)

As pointed out already, this is the Steiner tree problem in graphs.
There is a Steiner tree algorithm in networkx:
https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.approximation.steinertree.steiner_tree.html
However, it only gives you an approximate solution, and it is also rather slow. For state-of-the-art solvers, see the section "External Links" under:
https://en.wikipedia.org/wiki/Steiner_tree_problem

"Agglomerative" clustering of a graph based on node weight in network X?

I have a very large connected graph (millions of nodes). Each edge has a weight -- identifying the proximity of the connected nodes. I want to find "clusters" in the graph (sets of nodes that are very close together). For instance, if the nodes were every city in the US and the edges were distance between the cities -- the clusters might be {Dallas, Houston, Fort Worth} and {New York, Bridgeport, Jersey City, Trenton}.
The clusters don't have to be the same size and not all nodes have to be in a cluster. Instead, clusters need to have some average minimum weight, W which is equal to (sum of weights in cluster) / (number of edges in cluster).
I am most comfortable in Python, and NetworkX seems to be the standard tool for this
What is the most efficient graph data structure in Python?
It seems like this would not be too hard to program, although not particularly efficiently. Is there a name for the algorithm I am describing? Is there an implementation in NetworkX already?

I know some graph partitioning algorithms that their goal is to make all parts with approximate same size and minimum edge cut as possible, but as you described you do not need such an algorithm. Anyways i think your problem is NP complete like many other graph partitioning problems.
Maybe there be some algorithms which specifically work fine for your problem (and i think there are but i do not know them) but i think you can still find good and acceptable solutions with slight changing the some algorithms which are originally for finding minimum edge cut with same components size.
For example see this. i think you can use multilevel k-way partitioning with some changes.
For example in coarsening phase, you can use Light Edge Matching.
Consider a situation when in coarsening phase you've matched A and B into one group and also C and D into another group. weight of edge between these two groups is sum of edges of its members to each other e.g. W=Wac+Wad+Wbc+Wbd where W is edge weight, Wac is edge weight between A and C an so on. I also think that considering average of Wac, Wad, Wbc and Wbd instead of sum of them also worth a try.
From my experience this algorithm is very fast and i am not sure you be able to find precoded library in python to make changes into it.

Solve multi-objectives optimization of a graph in Python

I'm trying to find what seems to be a complicated and time-consuming multi-objective optimization on a large-ish graph.
Here's the problem: I want to find a graph of n vertices (n is constant at, say 100) and m edges (m can change) where a set of metrics are optimized:
Metric A needs to be as high as possible
Metric B needs to be as low as possible
Metric C needs to be as high as possible
Metric D needs to be as low as possible
My best guess is to go with GA. I am not very familiar with genetic algorithms, but I can spend a little time to learn the basics. From what I'm reading so far, I need to go as such:
Generate a population of graphs of n nodes randomly connected to
each other by m = random[1,2000] (for instance) edges
Run the metrics A, B, C, D on each graph
Is an optimal solution found (as defined in the problem)?
If yes, perfect. If not:
Select the best graphs
Crossover
Mutate (add or remove edges randomly?)
Go to 3.
Now, I usually use Python for my little experiments. Could DEAP (https://code.google.com/p/deap/) help me with this problem?
If so, I have many more questions (especially on the crossover and mutate steps), but in short: are the steps (in Python, using DEAP) easy enough to be explain or summarized here?
I can try and elaborate if needed. Cheers.

Disclaimer: I am one of DEAP lead developer.
Your individual could be represented by a binary string. Each bit would indicate whether there is an edge between two vertices. Therefore, your individuals would be composed of n * (n - 1) / 2 bits, where n is the number of vertices. To evaluate your individual, you would simply need to build an adjacency matrix from the individual genotype. For an evaluation function example, see the following gist https://gist.github.com/cmd-ntrf/7816665.
Your fitness would be composed of 4 objectives, and based on what you said regarding minimization and maximization of each objective, the fitness class would be created like this :
creator.create("Fitness", base.Fitness, weights=(1.0, -1.0, 1.0, -1.0)
The crossover and mutation operators could be the same as in the OneMax example.
http://deap.gel.ulaval.ca/doc/default/examples/ga_onemax_short.html
However, since you want to do multi-objective, you would need a multi-objective selection operator, either NSGA2 or SPEA2. Finally, the algorithm would have to be mu + lambda. For both multi-objective selection and mu + lambda algorithm usage, see the GA Knapsack example.
http://deap.gel.ulaval.ca/doc/default/examples/ga_knapsack.html
So essentially, to get up and running, you only have to merge a part of the onemax example with the knapsack while using the proposed evaluation function.

I suggest the excellent pyevolve library https://github.com/perone/Pyevolve. This will do most of the work for you, you will only have to define the fitness function and your representation nodes/functions. You can specify the crossover and mutation rate as well.

What algorithms can I use to make inferences from a graph?

Edited question to make it a bit more specific.
Not trying to base it on content of nodes but solely of structure of directed graph.
For example, pagerank(at first) solely used the link structure(directed graph) to make inferences on what was more relevant. I'm not totally sure, but I think Elo(chess ranking) does something simlair to rank players(although it adds scores also).
I'm using python's networkx package but right now I just want to understand any algorithms that accomplish this.
Thanks!

Eigenvector centrality is a network metric that can be used to model the probability that a node will be encountered in a random walk. It factors in not only the number of edges that a node has but also the number of edges the nodes it connects to have and onward with the edges that the nodes connected to its connected nodes have and so on. It can be implemented with a random walk which is how Google's PageRank algorithm works.
That said, the field of network analysis is broad and continues to develop with new and interesting research. The way you ask the question implies that you might have a different impression. Perhaps start by looking over the three links I included here and see if that gets you started and then follow up with more specific questions.

You should probably take a look at Markov Random Fields and Conditional Random Fields. Perhaps the closest thing similar to what you're describing is a Bayesian Network

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.