python, igraph coping with vertex renumbering

python, igraph coping with vertex renumbering - python

I am implementing an algorithm for finding a dense subgraph in a directed graph using python+igraph. The main loop maintains two subgraphs S and T which are initially identical and removes nodes (and incident edges) accoriding to a count of the indegree (or outdegree) of those nodes with respect to the other graph. The problem I have is that igraph renumbers the vertices so when I delete some from T, the remaining nodes no longer correspond to the same ones in S.
Here is the main part of the loop that is key.
def directed(S):
T = S.copy()
c = 2
while(S.vcount() > 0 and T.vcount() > 0):
if (S.vcount()/T.vcount() > c):
AS = S.vs.select(lambda vertex: T.outdegree(vertex) < 1.01*E(S,T)/S.vcount())
S.delete_vertices(AS)
else:
BT = T.vs.select(lambda vertex: S.indegree(vertex) < 1.01*E(S,T)/T.vcount())
T.delete_vertices(BT)
This doesn't work because of the effect of deleting vertices on the vertex ids. Is there a standard workaround for this problem?

One possibility is to assign unique names to the vertices in the name vertex attribute. These are kept intact when vertices are removed (unlike vertex IDs), and you can use them to refer to vertices in functions like indegree or outdegree. E.g.:
>>> g = Graph.Ring(4)
>>> g.vs["name"] = ["A", "B", "C", "D"]
>>> g.degree("C")
2
>>> g.delete_vertices(["B"])
>>> g.degree("C")
1
Note that I have removed vertex B so vertex C also gained a new ID, but the name is still the same.
In your case, the row with the select condition could probably be re-written like this:
AS = S.vs.select(lambda vertex: T.outdegree(vertex["name"]) < 1.01 * E(S,T)/S.vcount())
Of course this assumes that initially the vertex names are the same in S and T.

Related

Scale-free network using preferential attachment algorithm

I'm having trouble understanding what this piece of code does. Please could someone step by step go through the code and explain how it works and what it's doing?
def scale_free(n,m):
if m < 1 or m >=n:
raise nx.NetworkXError("Preferential attactment algorithm must have m >= 1"
" and m < n, m = %d, n = %d" % (m, n))
# Add m initial nodes (m0 in barabasi-speak)
G=nx.empty_graph(m)
# Target nodes for new edges
targets=list(range(m))
# List of existing nodes, with nodes repeated once for each adjacent edge
repeated_nodes=[]
# Start adding the other n-m nodes. The first node is m.
source=m
while source<n:
# Add edges to m nodes from the source.
G.add_edges_from(zip([source]*m,targets))
# Add one node to the list for each new edge just created.
repeated_nodes.extend(targets)
# And the new node "source" has m edges to add to the list.
repeated_nodes.extend([source]*m)
# Now choose m unique nodes from the existing nodes
# Pick uniformly from repeated_nodes (preferential attachement)
targets = _random_subset(repeated_nodes,m)
source += 1
return G

So the first part of this makes sure that m is at least 1 and n>m.
def scale_free(n,m):
if m < 1 or m >=n:
raise nx.NetworkXError("Preferential attactment algorithm must have m >= 1"
" and m < n, m = %d, n = %d" % (m, n))
Then it creates a graph with no edges and the first m nodes 0, 1, ..., m-1.
This looks a bit different from the standard barabasi-albert graph which starts from a connected version, rather than a version without any edges.
# Add m initial nodes (m0 in barabasi-speak)
G=nx.empty_graph(m)
Now it's going to start adding new nodes 1 at a time and connecting them to existing nodes based on various rules. It first creates a set of "targets" that has all of the nodes in the edge-less graph.
# Target nodes for new edges
targets=list(range(m))
# List of existing nodes, with nodes repeated once for each adjacent edge
repeated_nodes=[]
# Start adding the other n-m nodes. The first node is m.
source=m
Now it's going to add each node 1 at a time. When it does that, it will add the new node with edges to m of the previous existing nodes. Those m previous nodes have been stored in a list called targets.
while source<n:
Here it creates those edges
# Add edges to m nodes from the source.
G.add_edges_from(zip([source]*m,targets))
Now it's going to decide who will get those edges when the next node is added. It's supposed to choose them with probability proportional to their degree The way it does that is by having a list repeated_nodes which has each node appearing once per edge. It then chooses a random set of m nodes from that to be the new targets. Depending on how _random_subset is defined, it might or might not be able to choose the same node several times to be a target in the same step.
# Add one node to the list for each new edge just created.
repeated_nodes.extend(targets)
# And the new node "source" has m edges to add to the list.
repeated_nodes.extend([source]*m)
# Now choose m unique nodes from the existing nodes
# Pick uniformly from repeated_nodes (preferential attachement)
targets = _random_subset(repeated_nodes,m)
source += 1
return G

Kosaraju’s algorithm for scc

Can anyone explain me the logic behind Kosaraju’s algorithm for finding connected component?
I have read the description, though I can't understand how the DFS on reversed graph can detect the number of strongly connected components.
def dfs(visited, stack, adj, x):
visited[x] = 1
for neighbor in adj[x]:
if (visited[neighbor]==0):
dfs(visited, stack, adj, neighbor)
stack.insert(0, x)
return stack, visited
def reverse_dfs(visited, adj, x, cc):
visited[x] = 1
for neighbor in adj[x]:
if (visited[neighbor]==0):
cc += 1
reverse_dfs(visited, adj, neighbor,cc)
print(x)
return cc
def reverse_graph(adj):
vertex_num = len(adj)
new_adj = [ [] for _ in range(vertex_num)]
for i in range(vertex_num):
for j in adj[i]:
new_adj[j].append(i)
return new_adj
def find_post_order(adj):
vertex_num = len(adj)
visited = [0] * vertex_num
stack = []
for vertex in range(vertex_num):
if visited[vertex] == 0:
stack, visited = dfs(visited, stack, adj, vertex)
return stack
def number_of_strongly_connected_components(adj):
vertex_num = len(adj)
new_adj = reverse_graph(adj)
stack = find_post_order(adj)
visited = [0] * vertex_num
cc_num = 0
while (stack):
vertex = stack.pop(0)
print(vertex)
print('------')
if visited[vertex] == 0:
cc_num = reverse_dfs(visited, new_adj, vertex, cc_num)
return cc_num
if __name__ == '__main__':
input = sys.stdin.read()
data = list(map(int, input.split()))
n, m = data[0:2]
data = data[2:]
edges = list(zip(data[0:(2 * m):2], data[1:(2 * m):2]))
adj = [[] for _ in range(n)]
for (a, b) in edges:
adj[a - 1].append(b - 1)
print(number_of_strongly_connected_components(adj))
Above you can find my implementation. I guess that initial DFS and reverse operation are done correctly, but I can't get how to perform the second DFS properly.
Thanks in advance.

The first thing that you should notice is that the set of strongly connected components is the same for a graph and its reverse. In fact, the algorithm actually finds the set of strongly connected components in the reversed graph, not the original (but it's alright, because both graphs have the same SCC).
The first DFS execution results in a stack that holds the vertices in a particular order such that when the second DFS is executed on the vertices in this order (on the reversed graph) then it computes SCC correctly. So the whole purpose of running the first DFS is to find an ordering of the graph vertices that serves the execution of the DFS algorithm on the reversed graph. Now I'll explain what is the property that the order of vertices in the stack have and how it serves the execution of DFS on the reversed graph.
So what is the property of the stack? Imagine S1 and S2 are two strongly connected components in the graph, and that in the reversed graph, S2 is reachable from S1. Obviously, S1 cannot be reachable from S2 as well, because if that was the case, S1 and S2 would collapsed into a single component. Let x be the top vertex among the vertices in S1 (that is, all other vertices in S1 are below x in the stack). Similarly, let y be the top vertex among the vertices in S2 (again, all other vertices in S2 are below y in the stack). The property is that y (which belongs to S2) is above x (which belongs to S1) in the stack.
Why is this property helpful? When we execute DFS on the reversed graph, we do it according to the stack order. In particular, we explore y before we explore x. When exploring y we explore every vertex of S2, and since no vertex of S1 is reachable from S2 we explore all the vertices of S2 before we explore any vertex of S1. But this holds for any pair of connected components such that one is reachable from the other. In particular, the vertex at the top of the stack belongs to a sink component, and after we're done exploring that sink component, the top vertex again belongs to a sink relative to the graph induced by the not-yet-explored vertices.
Therefore, the algorithm correctly computes all the strongly connected components of the reversed graph, which, as aforesaid, are the same as in the original graph.

Check if two nodes are on the same path in constant time for a DAG

I have a DAG (directed acyclic graph) which is represented by an edge list e.g.
edges = [('a','b'),('a','c'),('b','d')]
will give the graph
a - > b -> d
|
v
c
I'm doing many operations where I have to check if two nodes are on the same path (b and d are on the same path whereas b and c are not from example above) which in turn is slowing down my program. I was hoping I could somehow traverse the graph once and save all paths so I can check this in constant time.
Is this speed-up possible and how would I go about to implement this in python?
Edit:
Note that I need to check for both directions, so if I have a pair of nodes (a,b) I need to check for both a to b and b to a.

You actually want to find the transitive closure of the graph.
In computer science, the concept of transitive closure can be thought
of as constructing a data structure that makes it possible to answer
reachability questions. That is, can one get from node a to node d in
one or more hops?
There are multiple ways of finding the transitive closure of the graph. The simplest way is using the floyd-warshall algorithm O(|V|3). Explained here.
Another way is to perform a DFS from each node and mark all the nodes you visit as reachable from the current node. Explained here.
There is also a method that works only on DAGs. First perform a topological sorting on your DAG. Then work backward in the topological sorted list and OR the transitive closure of the children of the current node together, to get the transitive closure of the current node. Explained here.
Below is the Python implementation of the DFS based method:
def create_adj_dict_from_edges(edges):
adj_dict = {}
for e in edges:
for u in e:
if u not in adj_dict:
adj_dict[u] = []
for e in edges:
adj_dict[e[0]].append(e[1])
return adj_dict
def transitive_closure_from_adj_dict(adj_dict):
def dfs(node, visited):
if node not in adj_dict:
return
for next in adj_dict[node]:
if next in visited:
continue
visited.add(next)
dfs(next,visited)
reachable = {}
for node in adj_dict:
visited = set(node,)
dfs(node,visited)
reachable[node] = visited
return reachable
def main():
edges = [('a','b'),('a','c'),('b','d')]
adj_dict = create_adj_dict_from_edges(edges)
tc = transitive_closure_from_adj_dict(adj_dict)
print tc
# is there a path from (b to d) or (d to b)
print 'd' in tc['b'] or 'b' in tc['d']
# is there a path from (b to c) or (c to b)
print 'c' in tc['b'] or 'b' in tc['c']
if __name__ == "__main__":
main()
output
{'a': set(['a', 'c', 'b', 'd']), 'c': set(['c']), 'b': set(['b', 'd']), 'd': set(['d'])}
True
False

Iterate over two lists, execute function and return values

I am trying to iterate over two lists of the same length, and for the pair of entries per index, execute a function. The function aims to cluster the entries
according to some requirement X on the value the function returns.
The lists in questions are:
e_list = [-0.619489,-0.465505, 0.124281, -0.498212, -0.51]
p_list = [-1.7836,-1.14238, 1.73884, 1.94904, 1.84]
and the function takes 4 entries, every combination of l1 and l2.
The function is defined as
def deltaR(e1, p1, e2, p2):
de = e1 - e2
dp = p1 - p2
return de*de + dp*dp
I have so far been able to loop over the lists simultaneously as:
for index, (eta, phi) in enumerate(zip(e_list, p_list)):
for index2, (eta2, phi2) in enumerate(zip(e_list, p_list)):
if index == index2: continue # to avoid same indices
if deltaR(eta, phi, eta2, phi2) < X:
print (index, index2) , deltaR(eta, phi, eta2, phi2)
This loops executes the function on every combination, except those that are same i.e. index 0,0 or 1,1 etc
The output of the code returns:
(0, 1) 0.659449892453
(1, 0) 0.659449892453
(2, 3) 0.657024790285
(2, 4) 0.642297230697
(3, 2) 0.657024790285
(3, 4) 0.109675332432
(4, 2) 0.642297230697
(4, 3) 0.109675332432
I am trying to return the number of indices that are all matched following the condition above. In other words, to rearrange the output to:
output = [No. matched entries]
i.e.
output = [2, 3]
2 coming from the fact that indices 0 and 1 are matched
3 coming from the fact that indices 2, 3, and 4 are all matched
A possible way I have thought of is to append to a list, all the indices used such that I return
output_list = [0, 1, 1, 0, 2, 3, 4, 3, 2, 4, 4, 2, 3]
Then, I use defaultdict to count the occurrances:
for index in output_list:
hits[index] += 1
From the dict I can manipulate it to return [2,3] but is there a more pythonic way of achieving this?

This is finding connected components of a graph, which is very easy and well documented, once you revisit the problem from that view.
The data being in two lists is a distraction. I am going to consider the data to be zip(e_list, p_list). Consider this as a graph, which in this case has 5 nodes (but could have many more on a different data set). Construct the graph using these nodes, and connected them with an edge if they pass your distance test.
From there, you only need to determine the connected components of an undirected graph, which is covered on many many places. Here is a basic depth first search on this site: Find connected components in a graph
You loop through the nodes once, performing a DFS to find all connected nodes. Once you look at a node, mark it visited, so it does not get counted again. To get the answer in the format you want, simply count the number of unvisited nodes found from each unvisited starting point, and append that to a list.
------------------------ graph theory ----------------------
You have data points that you want to break down into related groups. This is a topic in both mathematics and computer science known as graph theory. see: https://en.wikipedia.org/wiki/Graph_theory
You have data points. Imagine drawing them in eta phi space as rectangular coordinates, and then draw lines between the points that are close to each other. You now have a "graph" with vertices and edges.
To determine which of these dots have lines between them is finding connected components. Obviously it's easy to see, but if you have thousands of points, and you want a computer to find the connected components quickly, you use graph theory.
Suppose I make a list of all the eta phi points with zip(e_list, p_list), and each entry in the list is a vertex. If you store the graph in "adjacency list" format, then each vertex will also have a list of the outgoing edges which connect it to another vertex.
Finding a connected component is literally as easy as looking at each vertex, putting a checkmark by it, and then following every line to the next vertex and putting a checkmark there, until you can't find anything else connected. Now find the next vertex without a checkmark, and repeat for the next connected component.
As a programmer, you know that writing your own data structures for common problems is a bad idea when you can use published and reviewed code to handle the task. Google "python graph module". One example mentioned in comments is "pip install networkx". If you build the graph in networkx, you can get the connected components as a list of lists, then take the len of each to get the format you want: [len(_) for _ in nx.connected_components(G)]
---------------- code -------------------
But if you don't understand the math, then you might not understand a module for graphs, nor a base python implementation, but it's pretty easy if you just look at some of those links. Basically dots and lines, but pretty useful when you apply the concepts, as you can see with your problem being nothing but a very simple graph theory problem in disguise.
My graph is a basic list here, so the vertices don't actually have names. They are identified by their list index.
e_list = [-0.619489,-0.465505, 0.124281, -0.498212, -0.51]
p_list = [-1.7836,-1.14238, 1.73884, 1.94904, 1.84]
def deltaR(e1, p1, e2, p2):
de = e1 - e2
dp = p1 - p2
return de*de + dp*dp
X = 1 # you never actually said, but this works
def these_two_particles_are_going_the_same_direction(p1, p2):
return deltaR(p1.eta, p1.phi, p2.eta, p2.phi) < X
class Vertex(object):
def __init__(self, eta, phi):
self.eta = eta
self.phi = phi
self.connected = []
self.visited = False
class Graph(object):
def __init__(self, e_list, p_list):
self.vertices = []
for eta, phi in zip(e_list, p_list):
self.add_node(eta, phi)
def add_node(self, eta, phi):
# add this data point at the next available index
n = len(self.vertices)
a = Vertex(eta, phi)
for i, b in enumerate(self.vertices):
if these_two_particles_are_going_the_same_direction(a,b):
b.connected.append(n)
a.connected.append(i)
self.vertices.append(a)
def reset_visited(self):
for v in self.nodes:
v.visited = False
def DFS(self, n):
#perform depth first search from node n, return count of connected vertices
count = 0
v = self.vertices[n]
if not v.visited:
v.visited = True
count += 1
for i in v.connected:
count += self.DFS(i)
return count
def connected_components(self):
self.reset_visited()
components = []
for i, v in enumerate(self.vertices):
if not v.visited:
components.append(self.DFS(i))
return components
g = Graph(e_list, p_list)
print g.connected_components()

How to traverse tree with specific properties

I have a tree as shown below.
Red means it has a certain property, unfilled means it doesn't have it. I want to minimise the Red checks.
If Red than all Ancestors are also Red (and should not be checked again).
If Not Red than all Descendants are Not Red.
The depth of the tree is d.
The width of the tree is n.
Note that children nodes have value larger than the parent.
Example: In the tree below,
Node '0' has children [1, 2, 3],
Node '1' has children [2, 3],
Node '2' has children [3] and
Node '4' has children [] (No children).
Thus children can be constructed as:
if vertex.depth > 0:
vertex.children = [Vertex(parent=vertex, val=child_val, depth=vertex.depth-1, n=n) for child_val in xrange(self.val+1, n)]
else:
vertex.children = []
Here is an example tree:
I am trying to count the number of Red nodes. Both the depth and the width of the tree will be large. So I want to do a sort of Depth-First-Search and additionally use the properties 1 and 2 from above.
How can I design an algorithm to do traverse that tree?
PS: I tagged this [python] but any outline of an algorithm would do.
Update & Background
I want to minimise the property checks.
The property check is checking the connectedness of a bipartite graph constructed from my tree's path.
Example:
The bottom-left node in the example tree has path = [0, 1].
Let the bipartite graph have sets R and C with size r and c. (Note, that the width of the tree is n=r*c).
From the path I get to the edges of the graph by starting with a full graph and removing edges (x, y) for all values in the path as such: x, y = divmod(value, c).
The two rules for the property check come from the connectedness of the graph:
- If the graph is connected with edges [a, b, c] removed, then it must also be connected with [a, b] removed (rule 1).
- If the graph is disconnected with edges [a, b, c] removed, then it must also be disconnected with additional edge d removed [a, b, c, d] (rule 2).
Update 2
So what I really want to do is check all combinations of picking d elements out of [0..n]. The tree structure somewhat helps but even if I got an optimal tree traversal algorithm, I still would be checking too many combinations. (I noticed that just now.)
Let me explain. Assuming I need checked [4, 5] (so 4 and 5 are removed from bipartite graph as explained above, but irrelevant here.). If this comes out as "Red", my tree will prevent me from checking [4] only. That is good. However, I should also mark off [5] from checking.
How can I change the structure of my tree (to a graph, maybe?) to further minimise my number of checks?

Use a variant of the deletion–contraction algorithm for evaluating the Tutte polynomial (evaluated at (1,2), gives the total number of spanning subgraphs) on the complete bipartite graph K_{r,c}.
In a sentence, the idea is to order the edges arbitrarily, enumerate spanning trees, and count, for each spanning tree, how many spanning subgraphs of size r + c + k have that minimum spanning tree. The enumeration of spanning trees is performed recursively. If the graph G has exactly one vertex, the number of associated spanning subgraphs is the number of self-loops on that vertex choose k. Otherwise, find the minimum edge that isn't a self-loop in G and make two recursive calls. The first is on the graph G/e where e is contracted. The second is on the graph G-e where e is deleted, but only if G-e is connected.

Python is close enough to pseudocode.
class counter(object):
def __init__(self, ival = 0):
self.count = ival
def count_up(self):
self.count += 1
return self.count
def old_walk_fun(ilist, func=None):
def old_walk_fun_helper(ilist, func=None, count=0):
tlist = []
if(isinstance(ilist, list) and ilist):
for q in ilist:
tlist += old_walk_fun_helper(q, func, count+1)
else:
tlist = func(ilist)
return [tlist] if(count != 0) else tlist
if(func != None and hasattr(func, '__call__')):
return old_walk_fun_helper(ilist, func)
else:
return []
def walk_fun(ilist, func=None):
def walk_fun_helper(ilist, func=None, count=0):
tlist = []
if(isinstance(ilist, list) and ilist):
if(ilist[0] == "Red"): # Only evaluate sub-branches if current level is Red
for q in ilist:
tlist += walk_fun_helper(q, func, count+1)
else:
tlist = func(ilist)
return [tlist] if(count != 0) else tlist
if(func != None and hasattr(func, '__call__')):
return walk_fun_helper(ilist, func)
else:
return []
# Crude tree structure, first element is always its colour; following elements are its children
tree_list = \
["Red",
["Red",
["Red",
[]
],
["White",
[]
],
["White",
[]
]
],
["White",
["White",
[]
],
["White",
[]
]
],
["Red",
[]
]
]
red_counter = counter()
eval_counter = counter()
old_walk_fun(tree_list, lambda x: (red_counter.count_up(), eval_counter.count_up()) if(x == "Red") else eval_counter.count_up())
print "Unconditionally walking"
print "Reds found: %d" % red_counter.count
print "Evaluations made: %d" % eval_counter.count
print ""
red_counter = counter()
eval_counter = counter()
walk_fun(tree_list, lambda x: (red_counter.count_up(), eval_counter.count_up()) if(x == "Red") else eval_counter.count_up())
print "Selectively walking"
print "Reds found: %d" % red_counter.count
print "Evaluations made: %d" % eval_counter.count
print ""

How hard are you working on making the test for connectedness fast?
To test a graph for connectedness I would pick edges in a random order and use union-find to merge vertices when I see an edge that connects them. I could terminate early if the graph was connected, and I have a sort of certificate of connectedness - the edges which connected two previously unconnected sets of vertices.
As you work down the tree/follow a path on the bipartite graph, you are removing edges from the graph. If the edge you remove is not in the certificate of connectedness, then the graph must still be connected - this looks like a quick check to me. If it is in the certificate of connectedness you could back up to the state of union/find as of just before that edge was added and then try adding new edges, rather than repeating the complete connectedness test.
Depending on exactly how you define a path, you may be able to say that extensions of that path will never include edges using a subset of vertices - such as vertices which are in the interior of the path so far. If edges originating from those untouchable vertices are sufficient to make the graph connected, then no extension of the path can ever make it unconnected. Then at the very least you just have to count the number of distinct paths. If the original graph is regular I would hope to find some dynamic programming recursion that lets you count them without explicitly enumerating them.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.