How to find trio in list of lists? [duplicate] - python
I am working with complex networks. I want to find group of nodes which forms a cycle of 3 nodes (or triangles) in a given graph. As my graph contains about million edges, using a simple iterative solution (multiple "for" loop) is not very efficient.
I am using python for my programming, if these is some inbuilt modules for handling these problems, please let me know.
If someone knows any algorithm which can be used for finding triangles in graphs, kindly reply back.
Assuming its an undirected graph, the answer lies in networkx library of python.
if you just need to count triangles, use:
import networkx as nx
tri=nx.triangles(g)
But if you need to know the edge list with triangle (triadic) relationship, use
all_cliques= nx.enumerate_all_cliques(g)
This will give you all cliques (k=1,2,3...max degree - 1)
So, to filter just triangles i.e k=3,
triad_cliques=[x for x in all_cliques if len(x)==3 ]
The triad_cliques will give a edge list with only triangles.
A million edges is quite small. Unless you are doing it thousands of times, just use a naive implementation.
I'll assume that you have a dictionary of node_ids, which point to a sequence of their neighbors, and that the graph is directed.
For example:
nodes = {}
nodes[0] = 1,2
nodes[1] = tuple() # empty tuple
nodes[2] = 1
My solution:
def generate_triangles(nodes):
"""Generate triangles. Weed out duplicates."""
visited_ids = set() # remember the nodes that we have tested already
for node_a_id in nodes:
for node_b_id in nodes[node_a_id]:
if nod_b_id == node_a_id:
raise ValueError # nodes shouldn't point to themselves
if node_b_id in visited_ids:
continue # we should have already found b->a->??->b
for node_c_id in nodes[node_b_id]:
if node_c_id in visited_ids:
continue # we should have already found c->a->b->c
if node_a_id in nodes[node_c_id]:
yield(node_a_id, node_b_id, node_c_id)
visited_ids.add(node_a_id) # don't search a - we already have all those cycles
Checking performance:
from random import randint
n = 1000000
node_list = range(n)
nodes = {}
for node_id in node_list:
node = tuple()
for i in range(randint(0,10)): # add up to 10 neighbors
try:
neighbor_id = node_list[node_id+randint(-5,5)] # pick a nearby node
except:
continue
if not neighbor_id in node:
node = node + (neighbor_id,)
nodes[node_id] = node
cycles = list(generate_triangles(nodes))
print len(cycles)
When I tried it, it took longer to build the random graph than to count the cycles.
You might want to test it though ;) I won't guarantee that it's correct.
You could also look into networkx, which is the big python graph library.
Pretty easy and clear way to do is to use Networkx:
With Networkx you can get the loops of an undirected graph by nx.cycle_basis(G) and then select the ones with 3 nodes
cycls_3 = [c for c in nx.cycle_basis(G) if len(c)==3]
or you can find all the cliques by find_cliques(G) and then select the ones you want (with 3 nodes). cliques are sections of the graph where all the nodes are connected to each other which happens in cycles/loops with 3 nodes.
Even though it isn't efficient, you may want to implement a solution, so use the loops. Write a test so you can get an idea as to how long it takes.
Then, as you try new approaches you can do two things:
1) Make certain that the answer remains the same.
2) See what the improvement is.
Having a faster algorithm that misses something is probably going to be worse than having a slower one.
Once you have the slow test, you can see if you can do this in parallel and see what the performance increase is.
Then, you can see if you can mark all nodes that have less than 3 vertices.
Ideally, you may want to shrink it down to just 100 or so first, so you can draw it, and see what is happening graphically.
Sometimes your brain will see a pattern that isn't as obvious when looking at algorithms.
I don't want to sound harsh, but have you tried to Google it? The first link is a pretty quick algorithm to do that:
http://www.mail-archive.com/algogeeks#googlegroups.com/msg05642.html
And then there is this article on ACM (which you may have access to):
http://portal.acm.org/citation.cfm?id=244866
(and if you don't have access, I am sure if you kindly ask the lady who wrote it, you will get a copy.)
Also, I can imagine a triangle enumeration method based on clique-decomposition, but I don't know if it was described somewhere.
I am working on the same problem of counting number of triangles on undirected graph and wisty's solution works really well in my case. I have modified it a bit so only undirected triangles are counted.
#### function for counting undirected cycles
def generate_triangles(nodes):
visited_ids = set() # mark visited node
for node_a_id in nodes:
temp_visited = set() # to get undirected triangles
for node_b_id in nodes[node_a_id]:
if node_b_id == node_a_id:
raise ValueError # to prevent self-loops, if your graph allows self-loops then you don't need this condition
if node_b_id in visited_ids:
continue
for node_c_id in nodes[node_b_id]:
if node_c_id in visited_ids:
continue
if node_c_id in temp_visited:
continue
if node_a_id in nodes[node_c_id]:
yield(node_a_id, node_b_id, node_c_id)
else:
continue
temp_visited.add(node_b_id)
visited_ids.add(node_a_id)
Of course, you need to use a dictionary for example
#### Test cycles ####
nodes = {}
nodes[0] = [1, 2, 3]
nodes[1] = [0, 2]
nodes[2] = [0, 1, 3]
nodes[3] = [1]
cycles = list(generate_triangles(nodes))
print cycles
Using the code of Wisty, the triangles found will be
[(0, 1, 2), (0, 2, 1), (0, 3, 1), (1, 2, 3)]
which counted the triangle (0, 1, 2) and (0, 2, 1) as two different triangles. With the code I modified, these are counted as only one triangle.
I used this with a relatively small dictionary of under 100 keys and each key has on average 50 values.
Surprised to see no mention of the Networkx triangles function. I know it doesn't necessarily return the groups of nodes that form a triangle, but should be pretty relevant to many who find themselves on this page.
nx.triangles(G) # list of how many triangles each node is part of
sum(nx.triangles(G).values())/3 # total number of triangles
An alternative way to return clumps of nodes would be something like...
for u,v,d in G.edges(data=True):
u_array = adj_m.getrow(u).nonzero()[1] # get lists of all adjacent nodes
v_array = adj_m.getrow(v).nonzero()[1]
# find the intersection of the two sets - these are the third node of the triangle
np.intersect1d(v_array,u_array)
If you don't care about multiple copies of the same triangle in different order then a list of 3-tuples works:
from itertools import combinations as combos
[(n,nbr,nbr2) for n in G for nbr, nbr2 in combos(G[n],2) if nbr in G[nbr2]]
The logic here is to check each pair of neighbors of every node to see if they are connected. G[n] is a fast way to iterate over or look up neighbors.
If you want to get rid of reorderings, turn each triple into a frozenset and make a set of the frozensets:
set(frozenset([n,nbr,nbr2]) for n in G for nbr, nbr2 in combos(G[n]) if nbr in G[nbr2])
If you don't like frozenset and want a list of sets then:
triple_iter = ((n, nbr, nbr2) for n in G for nbr, nbr2 in combos(G[n],2) if nbr in G[nbr2])
triangles = set(frozenset(tri) for tri in triple_iter)
nice_triangles = [set(tri) for tri in triangles]
Do you need to find 'all' of the 'triangles', or just 'some'/'any'?
Or perhaps you just need to test whether a particular node is part of a triangle?
The test is simple - given a node A, are there any two connected nodes B & C that are also directly connected.
If you need to find all of the triangles - specifically, all groups of 3 nodes in which each node is joined to the other two - then you need to check every possible group in a very long running 'for each' loop.
The only optimisation is ensuring that you don't check the same 'group' twice, e.g. if you have already tested that B & C aren't in a group with A, then don't check whether A & C are in a group with B.
This is a more efficient version of Ajay M answer (I would have commented it, but I've not enough reputation).
Indeed the enumerate_all_cliques method of networkx will return all cliques in the graph, irrespectively of their length; hence looping over it may take a lot of time (especially with very dense graphs).
Moreover, once defined for triangles, it's just a matter of parametrization to generalize the method for every clique length so here's a function:
import networkx as nx
def get_cliques_by_length(G, length_clique):
""" Return the list of all cliques in an undirected graph G with length
equal to length_clique. """
cliques = []
for c in nx.enumerate_all_cliques(G) :
if len(c) <= length_clique:
if len(c) == length_clique:
cliques.append(c)
else:
return cliques
# return empty list if nothing is found
return cliques
To get triangles just use get_cliques_by_length(G, 3).
Caveat: this method works only for undirected graphs. Algorithm for cliques in directed graphs are not provided in networkx
i just found that nx.edge_disjoint_paths works to count the triangle contains certain edges. faster than nx.enumerate_all_cliques and nx.cycle_basis.
It returns the edges disjoint paths between source and target.Edge disjoint paths are paths that do not share any edge.
And result-1 is the number of triangles that contain certain edges or between source node and target node.
edge_triangle_dict = {}
for i in g.edges:
edge_triangle_dict[i] = len(list(nx.edge_disjoint_paths(g, i[0], i[1]))-1)
Related
Checking if a graph contain a given induced subgraph
I'm trying to detect some minimal patterns with properties in random digraphs. Namely, I have a list called patterns of adjacency matrix of various size. For instance, I have [0] (a sink), but also a [0100 0001 1000 0010] (cycle of size 4), [0100, 0010, 0001, 0000] a path of length 3, etc. When I generate a digraph, I compute all sets that may be new patterns. However, in most of the case it is something that I don't care about: for instance, if the potential new pattern is a cycle of size 5, it does not teach me anything because it has a cycle of length 3 as an induced subgraph. I suppose one way to do it would look like this: #D is the adjacency matrix of a possible new pattern new_pattern = True for pi in patterns: k = len(pi) induced_subgraphs = all_induced_subgraphs(D, k) for s in induced_subgraphs: if isomorphic(s, pi): new_pattern = False break where all_induced_subgraphs(D,k) gives all possible induced subgraphs of D of size k, and isomorphic(s,pi) determines if s and pi are isomorphic digraphs. However, checking all induced subgraphs of a digraph seems absolutely horrible to do. Is there a clever thing to do there?
Thanks to #Stef I learned that this problem has a name and can be solved using on netwokx with a function described on this page. Personally I use igraph on my project so I will use this.
Prune nodes not in networkx simple path?
I have a DiGrraph and I want to prune any node that's not contained in one of the simple paths between two of the nodes that I specify. (Another way to think of it is any node that can't reach both the start and end points should be trimmed). The best way I've found to do this is to get all_simple_paths, then to rebuild a new graph using those, but I'm hoping for a more elegant and less error prone solution. For example, is there a way to determine what's NOT on a simple path, and to then delete those nodes?
You can use the method all_simple_paths which returns a generator but you only need the first path. Then you can use the G.subgraph(nbunch) to return the induced graph from your path. EDIT: to return the subgraphs induced by all simple paths just concatenate the uniques nodes returned by all_simple_paths. import networkx as nx import itertools G = nx.complete_graph(10) # or DiGraph, MultiGraph, MultiDiGraph, etc # Concatenate all the paths and keep unique nodes (in one line) all_path_nodes = set(itertools.chain(*list(nx.all_simple_paths(G, source=0, target=3)))) # Extract the induced subgraph from a given list of nodes H = G.subgraph(all_path_nodes) print(nx.info(H)) Output: Name: complete_graph(10) Type: Graph Number of nodes: 10 Number of edges: 45 Average degree: 9.0000
I did make some progress on this while #kikohs was working to understand my question and provide his answer, so I'm posting this as an alternative solution to the problem. I do think his answer is superior though! def _trim_branches(self, g, start, end): """Find all the paths from start to finish, and nuke any nodes that aren't in those paths. """ good_nodes = set() for path in networkx.all_simple_paths( g, source=start, target=end): [good_nodes.add(n) for n in path] for node in g.nodes: if node not in good_nodes: g.remove_node(node) return g Using subgraph to do the second loop is clearly better, as is his one-liner using itertools.chain. Great stuff around these parts today!
Fetch connected nodes in a NetworkX graph
Straightforward question: I would like to retrieve all the nodes connected to a given node within a NetworkX graph in order to create a subgraph. In the example shown below, I just want to extract all the nodes inside the circle, given the name of one of any one of them. I've tried the following recursive function, but hit Python's recursion limit, even though there are only 91 nodes in this network. Regardless of whether or not the below code is buggy, what is the best way to do what I'm trying to achieve? I will be running this code on graphs of various sizes, and will not know beforehand what the maximum recursion depth will be. def fetch_connected_nodes(node, neighbors_list): for neighbor in assembly.neighbors(node): print(neighbor) if len(assembly.neighbors(neighbor)) == 1: neighbors_list.append(neighbor) return neighbors_list else: neighbors_list.append(neighbor) fetch_connected_nodes(neighbor, neighbors_list) neighbors = [] starting_node = 'NODE_1_length_6578_cov_450.665_ID_16281' connected_nodes = fetch_connected_nodes(starting_node, neighbors)
Assuming the graph is undirected, there is a built-in networkx command for this: node_connected_component(G, n) The documentation is here. It returns all nodes in the connected component of G containing n. It's not recursive, but I don't think you actually need or even want that. comments on your code: You've got a bug that will often result an infinite recursion. If u and v are neighbors both with degree at least 2, then it will start with u, put v in the list and when processing v put u in the list and keep repeating. It needs to change to only process neighbors that are not in neighbors_list. It's expensive to check that, so instead use a set. There's also a small problem if the starting node has degree 1. Your test for degree 1 doesn't do what you're after. If the initial node has degree 1, but its neighbor has higher degree it won't find the neighbor's neighbors. Here's a modification of your code: def fetch_connected_nodes(G, node, seen = None): if seen == None: seen = set([node]) for neighbor in G.neighbors(node): print(neighbor) if neighbor not in seen: seen.add(neighbor) fetch_connected_nodes(G, neighbor, seen) return seen You call this like fetch_connected_nodes(assembly, starting_node).
You can simply use a Breadth-first search starting from your given node or any node. In Networkx you can have the tree-graph from your starting node using the function: bfs_tree(G, source, reverse=False) Here is a link to the doc: Network bfs_tree.
Here is a recursive algorithm to get all nodes connected to an input node. def create_subgraph(G,sub_G,start_node): sub_G.add_node(start_node) for n in G.neighbors_iter(start_node): if n not in sub_G.neighbors(start_node): sub_G.add_path([start_node,n]) create_subgraph(G,sub_G,n) I believe the key thing here to prevent infinite recursive calls is the condition to check that node which is neighbor in the original graph is not already connected in the sub_G that is being created. Otherwise, you will always be going back and forth and edges between nodes that already have edges. I tested it as follows: G = nx.erdos_renyi_graph(20,0.08) nx.draw(G,with_labels = True) plt.show() sub_G = nx.Graph() create_subgraph(G,sub_G,17) nx.draw(sub_G,with_labels = True) plt.show() You will find in the attached image, the full graph and the sub_graph that contains node 17.
Deleting nodes with tie=1 in a large NetworkX graph
I have made large graph with NetworkX with about 20,000 nodes. I would like to delete nodes with only one tie (or zero ties) to try to reduce the clutter. Since it is a very large graph I do not know the nodes by name or ID that have tie=1 or 0. Does anyone know how to delete these nodes without specifying the node ID or name?
Iterating on a Graph g yields all of g's nodes, one at a time -- I believe you can't alter g during the iteration itself, but you can selectively make a list of nodes to be deleted, then remove them all: to_del = [n for n in g if g.degree(n) <= 1] g.remove_nodes_from(to_del)
I think you're after this one-liner: G= nx.k_core(G,k=2) You should be aware that if you delete some nodes, you'll have new nodes whose degree is just 1 or 0. If you want to repeat this process until no such nodes exist, you're generating the "k-core" with k=2. That is you're generating the largest network for which all nodes have degree at least 2. This is a built-in function: import networkx as nx G = nx.fast_gnp_random_graph(10000,0.0004) #erdos renyi graph, average degree = 4 G = nx.k_core(G,k=2) You could instead do: for node in G.nodes(): if G.degree(node)<2: G.remove_node(node) but this would yield a different result from the 2-core I described above, and a different result from A Martelli's as well since some of the later nodes in the list may originally have degree 2 but be reduced to 1 before you reach them. And it wouldn't be as clean because it creates the list G.nodes() rather than using a nicer iterator (if you're in networkx v1.x and you aren't altering a graph in the loop, it's usually better to loop through the nodes with G.nodes_iter() rather than G.nodes())
Finding cycle of 3 nodes ( or triangles) in a graph
I am working with complex networks. I want to find group of nodes which forms a cycle of 3 nodes (or triangles) in a given graph. As my graph contains about million edges, using a simple iterative solution (multiple "for" loop) is not very efficient. I am using python for my programming, if these is some inbuilt modules for handling these problems, please let me know. If someone knows any algorithm which can be used for finding triangles in graphs, kindly reply back.
Assuming its an undirected graph, the answer lies in networkx library of python. if you just need to count triangles, use: import networkx as nx tri=nx.triangles(g) But if you need to know the edge list with triangle (triadic) relationship, use all_cliques= nx.enumerate_all_cliques(g) This will give you all cliques (k=1,2,3...max degree - 1) So, to filter just triangles i.e k=3, triad_cliques=[x for x in all_cliques if len(x)==3 ] The triad_cliques will give a edge list with only triangles.
A million edges is quite small. Unless you are doing it thousands of times, just use a naive implementation. I'll assume that you have a dictionary of node_ids, which point to a sequence of their neighbors, and that the graph is directed. For example: nodes = {} nodes[0] = 1,2 nodes[1] = tuple() # empty tuple nodes[2] = 1 My solution: def generate_triangles(nodes): """Generate triangles. Weed out duplicates.""" visited_ids = set() # remember the nodes that we have tested already for node_a_id in nodes: for node_b_id in nodes[node_a_id]: if nod_b_id == node_a_id: raise ValueError # nodes shouldn't point to themselves if node_b_id in visited_ids: continue # we should have already found b->a->??->b for node_c_id in nodes[node_b_id]: if node_c_id in visited_ids: continue # we should have already found c->a->b->c if node_a_id in nodes[node_c_id]: yield(node_a_id, node_b_id, node_c_id) visited_ids.add(node_a_id) # don't search a - we already have all those cycles Checking performance: from random import randint n = 1000000 node_list = range(n) nodes = {} for node_id in node_list: node = tuple() for i in range(randint(0,10)): # add up to 10 neighbors try: neighbor_id = node_list[node_id+randint(-5,5)] # pick a nearby node except: continue if not neighbor_id in node: node = node + (neighbor_id,) nodes[node_id] = node cycles = list(generate_triangles(nodes)) print len(cycles) When I tried it, it took longer to build the random graph than to count the cycles. You might want to test it though ;) I won't guarantee that it's correct. You could also look into networkx, which is the big python graph library.
Pretty easy and clear way to do is to use Networkx: With Networkx you can get the loops of an undirected graph by nx.cycle_basis(G) and then select the ones with 3 nodes cycls_3 = [c for c in nx.cycle_basis(G) if len(c)==3] or you can find all the cliques by find_cliques(G) and then select the ones you want (with 3 nodes). cliques are sections of the graph where all the nodes are connected to each other which happens in cycles/loops with 3 nodes.
Even though it isn't efficient, you may want to implement a solution, so use the loops. Write a test so you can get an idea as to how long it takes. Then, as you try new approaches you can do two things: 1) Make certain that the answer remains the same. 2) See what the improvement is. Having a faster algorithm that misses something is probably going to be worse than having a slower one. Once you have the slow test, you can see if you can do this in parallel and see what the performance increase is. Then, you can see if you can mark all nodes that have less than 3 vertices. Ideally, you may want to shrink it down to just 100 or so first, so you can draw it, and see what is happening graphically. Sometimes your brain will see a pattern that isn't as obvious when looking at algorithms.
I don't want to sound harsh, but have you tried to Google it? The first link is a pretty quick algorithm to do that: http://www.mail-archive.com/algogeeks#googlegroups.com/msg05642.html And then there is this article on ACM (which you may have access to): http://portal.acm.org/citation.cfm?id=244866 (and if you don't have access, I am sure if you kindly ask the lady who wrote it, you will get a copy.) Also, I can imagine a triangle enumeration method based on clique-decomposition, but I don't know if it was described somewhere.
I am working on the same problem of counting number of triangles on undirected graph and wisty's solution works really well in my case. I have modified it a bit so only undirected triangles are counted. #### function for counting undirected cycles def generate_triangles(nodes): visited_ids = set() # mark visited node for node_a_id in nodes: temp_visited = set() # to get undirected triangles for node_b_id in nodes[node_a_id]: if node_b_id == node_a_id: raise ValueError # to prevent self-loops, if your graph allows self-loops then you don't need this condition if node_b_id in visited_ids: continue for node_c_id in nodes[node_b_id]: if node_c_id in visited_ids: continue if node_c_id in temp_visited: continue if node_a_id in nodes[node_c_id]: yield(node_a_id, node_b_id, node_c_id) else: continue temp_visited.add(node_b_id) visited_ids.add(node_a_id) Of course, you need to use a dictionary for example #### Test cycles #### nodes = {} nodes[0] = [1, 2, 3] nodes[1] = [0, 2] nodes[2] = [0, 1, 3] nodes[3] = [1] cycles = list(generate_triangles(nodes)) print cycles Using the code of Wisty, the triangles found will be [(0, 1, 2), (0, 2, 1), (0, 3, 1), (1, 2, 3)] which counted the triangle (0, 1, 2) and (0, 2, 1) as two different triangles. With the code I modified, these are counted as only one triangle. I used this with a relatively small dictionary of under 100 keys and each key has on average 50 values.
Surprised to see no mention of the Networkx triangles function. I know it doesn't necessarily return the groups of nodes that form a triangle, but should be pretty relevant to many who find themselves on this page. nx.triangles(G) # list of how many triangles each node is part of sum(nx.triangles(G).values())/3 # total number of triangles An alternative way to return clumps of nodes would be something like... for u,v,d in G.edges(data=True): u_array = adj_m.getrow(u).nonzero()[1] # get lists of all adjacent nodes v_array = adj_m.getrow(v).nonzero()[1] # find the intersection of the two sets - these are the third node of the triangle np.intersect1d(v_array,u_array)
If you don't care about multiple copies of the same triangle in different order then a list of 3-tuples works: from itertools import combinations as combos [(n,nbr,nbr2) for n in G for nbr, nbr2 in combos(G[n],2) if nbr in G[nbr2]] The logic here is to check each pair of neighbors of every node to see if they are connected. G[n] is a fast way to iterate over or look up neighbors. If you want to get rid of reorderings, turn each triple into a frozenset and make a set of the frozensets: set(frozenset([n,nbr,nbr2]) for n in G for nbr, nbr2 in combos(G[n]) if nbr in G[nbr2]) If you don't like frozenset and want a list of sets then: triple_iter = ((n, nbr, nbr2) for n in G for nbr, nbr2 in combos(G[n],2) if nbr in G[nbr2]) triangles = set(frozenset(tri) for tri in triple_iter) nice_triangles = [set(tri) for tri in triangles]
Do you need to find 'all' of the 'triangles', or just 'some'/'any'? Or perhaps you just need to test whether a particular node is part of a triangle? The test is simple - given a node A, are there any two connected nodes B & C that are also directly connected. If you need to find all of the triangles - specifically, all groups of 3 nodes in which each node is joined to the other two - then you need to check every possible group in a very long running 'for each' loop. The only optimisation is ensuring that you don't check the same 'group' twice, e.g. if you have already tested that B & C aren't in a group with A, then don't check whether A & C are in a group with B.
This is a more efficient version of Ajay M answer (I would have commented it, but I've not enough reputation). Indeed the enumerate_all_cliques method of networkx will return all cliques in the graph, irrespectively of their length; hence looping over it may take a lot of time (especially with very dense graphs). Moreover, once defined for triangles, it's just a matter of parametrization to generalize the method for every clique length so here's a function: import networkx as nx def get_cliques_by_length(G, length_clique): """ Return the list of all cliques in an undirected graph G with length equal to length_clique. """ cliques = [] for c in nx.enumerate_all_cliques(G) : if len(c) <= length_clique: if len(c) == length_clique: cliques.append(c) else: return cliques # return empty list if nothing is found return cliques To get triangles just use get_cliques_by_length(G, 3). Caveat: this method works only for undirected graphs. Algorithm for cliques in directed graphs are not provided in networkx
i just found that nx.edge_disjoint_paths works to count the triangle contains certain edges. faster than nx.enumerate_all_cliques and nx.cycle_basis. It returns the edges disjoint paths between source and target.Edge disjoint paths are paths that do not share any edge. And result-1 is the number of triangles that contain certain edges or between source node and target node. edge_triangle_dict = {} for i in g.edges: edge_triangle_dict[i] = len(list(nx.edge_disjoint_paths(g, i[0], i[1]))-1)