Algorithm to construct DeBruijn graph gives wrong results

Algorithm to construct DeBruijn graph gives wrong results - python

I am trying to write some code to construct a DeBruijn graph from a set of kmers (k letter long strings, DNA sequencing reads) in Python, output as a collection of edges, joining the same node to others.
When I run my code on sample input:
['GAGG','CAGG','GGGG','GGGA','CAGG','AGGG','GGAG']
I get:
CAG -> AGG
GAG -> AGG
Instead of:
AGG -> GGG
CAG -> AGG,AGG
GAG -> AGG
GGA -> GAG
GGG -> GGA,GGG
Any hint of what I am doing wrong?
Here is the code:
import itertools
inp=['GAGG','CAGG','GGGG','GGGA','CAGG','AGGG','GGAG']
y=[a[1:] for a in inp]
z=[b[:len(b)-1] for b in inp]
y.extend(z)
edjes=list(set(y))
w=[c[1:] for c in edjes]
v=[d[:len(d)-1] for d in edjes]
w.extend(v)
nodes=list(set(w))
graph={}
new=itertools.product(edjes,edjes)
for node in nodes:
for edj in new:
edje1,edje2=edj[0],edj[1]
if edje1[1:]==node and edje2[:len(edje2)-1]==node:
if edje1 in graph:
graph[edje1].append(edje2)
else:
graph[edje1]=[edje2]
for val in graph.values():
val.sort()
for k,v in sorted(graph.items()):
if len(v)<1:
continue
else:
line=k+' -> '+','.join(v)+'\n'
print (line)

I think you make the algorithm much too complicated: you can simply first perform a uniqueness filter on the input:
inp=['GAGG','CAGG','GGGG','GGGA','CAGG','AGGG','GGAG']
edges=list(set(inp))
And then iterate over this list of "edges". For each edge, the first three characters is the from node, the last three characters are the to node:
for edge in edges:
frm = edge[:len(edge)-1]
to = edge[1:]
#...
Now you simply need to add this to your graph:
for edge in edges:
frm = edge[:len(edge)-1]
to = edge[1:]
if frm in graph:
graph[frm].append(to)
else:
graph[frm]=[to]
And finally perform a sorting and printing like you did yourself:
for val in graph.values():
val.sort()
for k,v in sorted(graph.items()):
print(k+' -> '+','.join(v))
This results in:
AGG -> GGG
CAG -> AGG
GAG -> AGG
GGA -> GAG
GGG -> GGA,GGG
As you can see, there is a small difference on line 2: there your expected output contains AGG two times, which makes not much sense.
So the full algorithm is something like:
inp=['GAGG','CAGG','GGGG','GGGA','CAGG','AGGG','GGAG']
edges=list(set(inp))
graph={}
for edge in edges:
frm = edge[:len(edge)-1]
to = edge[1:]
if frm in graph:
graph[frm].append(to)
else:
graph[frm]=[to]
for val in graph.values():
val.sort()
for k,v in sorted(graph.items()):
print(k+' -> '+','.join(v))
Your algorithm
A problem I think, is that you consider three letter sequences to be "edjes" (probably edges). The edges are the four sequence characters. By performing this conversion, information is lost. Next you construct a set of two-character items (nodes which are no nodes at all). They seem to be used to "glue" the nodes together. But at that stage, you do not longer know how the pieces are glued together anyway.

Related

How to find the count of (1->2->3, 1->3), (1->2->3->1), (1->2<->3) triads in a directed graph in NetworkX?

I have a DiGraph G with no self loops. I need to count the three different kinds of triads (1->2->3, 1->3), (1->2->3->1), (1->2<->3) in my graph. I can't use any of the networkx.algorithms.triads functions even though triadic_census gives me exactly the result that I'm looking for.
For the second type 1->2->3->1, 2->3->1->2, 3->2->1->3 should be counted only once.
I started with the second type but the count I'm getting is way more than the triadic_census['030C'] count.
t2 = []
visited_ids = set()
for node_a_id in G_dict.keys():
for node_b_id in G_dict[node_a_id]:
for node_c_id in G_dict[node_b_id]:
if node_c_id in G_dict[node_a_id]:
cyc = [(node_a_id, node_b_id), (node_b_id,node_c_id), (node_c_id, node_a_id)]
s= {frozenset(i) for i in cyc} # I used this in order to not count the same cycle multiple times
if s not in t2:
t2.append(s)
visited_ids.add(node_a_id)
print(len(t2))
Here G_dict is of the form:
G_dict = {'1':['2','3','4'...], '2':['3','4','8'...]...}
Where the keys are the node ids and values are the nodes the key node has an outgoing edge to. Is there a way to write a function which gives the same counts as triadic_census for these three types of triads without excessively copying its source code?

Create "short cut" aware graph in Python

Assume we have these sequences:
A->X->Y->Z
B->Y->Z
C->Y->Z
D->X->Z
I would like to create a graph like:
C
|
A-X-Y-Z
| |
D B
In the sequence D-X-Z there is a short cut. My goal is to create a directed acyclic graph by eliminating these short-cuts and vice versa, expand existing edges when encountering expanded paths (e.g.: X-Z with X-Y-Z).
My approach so far was to create a directed graph with Networkx but this does not solve the problem because I could not find a way to eliminate short circuits (it is a big graph with hundreds of thousands of nodes).
Any hints would be appreciated.

You can set up the graph:
import networkx as nx
text = '''
A-X-Y-Z
B-Y-Z
C-Y-Z
D-X-Z
'''
G = nx.Graph()
for s in text.strip().split('\n'):
l = s.split('-')
G.add_edges_from(zip(l,l[1:]))
Then use find_cycles and remove_edge repeatedly to identify and remove edges that form cycles:
while True:
try:
c = nx.find_cycle(G)
print(f'found cycle: {c}')
G.remove_edge(*c[0])
except nx.NetworkXNoCycle:
break

How to find the shortest path with multiple edges between two nodes?

I am trying to generate the shortest path but I need to generate dummy nodes to do this as I have several edges from Istanbul to Ankara so I am unable to create a path using the normal method since the model considers those edges as one edge.
My nodes are shown in the first two columns of the excel sheet (Node1 and Node2). I wanted to generate the shortest path using Node1_reference and Node2_reference but I am unsure how to go about it or whether I should create dummy nodes as I am unable to call the cities without the suffixes

What you want here is to use a nx.MultiDigraph, a directed multi graph which can hold several parallel edges between two nodes.
# Add nodes
g = nx.MultiDiGraph()
g.add_node('Istanbul')
g.add_node('Ankara')
g.add_node('Muscat')
# Add edges
g.add_edge('Istanbul', 'Ankara', data=dict(time=1, route=1))
g.add_edge('Istanbul', 'Ankara', data=dict(time=2, route=2))
g.add_edge('Istanbul', 'Ankara', data=dict(time=10, route=3))
g.add_edge('Istanbul', 'Muscat', data=dict(time=20, route=1))
g.add_edge('Istanbul', 'Muscat', data=dict(time=20, route=2))
g.add_edge('Ankara', 'Muscat', data=dict(time=2, route=1))
Now we have multiple edges from one city to another. The trick is to specify how the weight function has to behave when querying the graph for shortest paths: https://networkx.org/documentation/stable/reference/algorithms/shortest_paths.html
def weight_func(u, v, d):
for v in d.values():
if v['data']['route'] == 1:
return v['data']['time']
return None
result = nx.shortest_path(g, source=None, target=None, weight=weight_func)
print(result)
Basically, when the algorithm evaluates all the parallel edges between nodes [u, v], we get the list of data properties for these edges, then we loop over them and filter so we can return the weight we want.
You can wrap this function in another higher-level function to make it more convenient like so:
def filter_route_id(route):
def weight_func(u, v, d):
for v in d.values():
if v['data']['route'] == route:
return v['data']['time']
return None
return weight_func
result = nx.shortest_path(g, source=None, target=None, weight=filter_route_id(1))
print(result)
{'Istanbul': {'Istanbul': ['Istanbul'], 'Ankara': ['Istanbul', 'Ankara'], 'Muscat': ['Istanbul', 'Ankara', 'Muscat']}, 'Ankara': {'Ankara': ['Ankara'], 'Muscat': ['Ankara', 'Muscat']}, 'Muscat': {'Muscat': ['Muscat']}}

Improving BFS performance with some kind of memoization

I have this issue that I'm trying to build an algorithm which will find distances from one vertice to others in graph.
Let's say with the really simple example that my network looks like this:
network = [[0,1,2],[2,3,4],[4,5,6],[6,7]]
I created a BFS code which is supposed to find length of paths from the specified source to other graph's vertices
from itertools import chain
import numpy as np
n = 8
graph = {}
for i in range(0, n):
graph[i] = []
for communes in communities2:
for vertice in communes:
work = communes.copy()
work.remove(vertice)
graph[vertice].append(work)
for k, v in graph.items():
graph[k] = list(chain(*v))
def bsf3(graph, s):
matrix = np.zeros([n,n])
dist = {}
visited = []
queue = [s]
dist[s] = 0
visited.append(s)
matrix[s][s] = 0
while queue:
v = queue.pop(0)
for neighbour in graph[v]:
if neighbour in visited:
pass
else:
matrix[s][neighbour] = matrix[s][v] + 1
queue.append(neighbour)
visited.append(neighbour)
return matrix
bsf3(graph,2)
First I'm creating graph (dictionary) and than use the function to find distances.
What I'm concerned about is that this approach doesn't work with larger networks (let's say with 1000 people in there). And what I'm thinking about is to use some kind of memoization (actually that's why I made a matrix instead of list). The idea is that when the algorithm calculates the path from let's say 0 to 3 (what it does already) it should keep track for another routes in such a way that matrix[1][3] = 1 etc.
So I would use the function like bsf3(graph, 1) it would not calculate everything from scratch, but would be able to access some values from matrix.
Thanks in advance!

Knowing this not fully answer your question, but this is another approach you cabn try.
In networks you will have a routing table for each node inside your network. You simple save a list of all nodes inside the network and in which node you have to go. Example of routing table of node D
A -> B
B -> B
C -> E
D -> D
E -> E
You need to run BFS on each node to build all routing table and it will take O(|V|*(|V|+|E|). The space complexity is quadratic but you have to check all possible paths.
When you create all this information you can simple start from a node and search for your destination node inside the table and find the next node to go. This will give a more better time complexity (if you use the right data structure for the table).

Discover All Paths in Single Source, Multi-Terminal (possibly cyclic) Directed Graph

I have a graph G = (V,E), where
V is a subset of {0, 1, 2, 3, …}
E is a subset of VxV
There are no unconnected components in G
The graph may contain cycles
There is a known node v in V, which is the source; i.e. there is no u in V such that (u,v) is an edge
There is at least one sink/terminal node v in V; i.e. there is no u in V such that (v,u) is an edge. The identities of the terminal nodes are not known - they must be discovered through traversal
What I need to do is to compute a set of paths P such that every possible path from the source node to any terminal node is in P. Now, if the graph contains cycles, it is possible that by this definition, P becomes an infinite set. This is not what I need. Rather, what I need is forPto contain a path that doesn't explore the loop and at least one path that does explore the loop.
I say "at least one path that does explore the loop", as the loop may contain branches internally, in which case, all of those branches will need to be explored as well. Thus, if the loop contains two internal branches, each with a branching factor of 2, then I need a total of four paths inP` that explore the loop.
For example, an algorithm run on the following graph:
+-------+
| |
v |
1->2->3->4->5->6 |
| | | |
v | v |
9 +->7-+
|
v
8
which can be represented as:
1:{2}
2:{3}
3:{4}
4:{5,9}
5:{6,7}
6:{7}
7:{4,8}
8:{}
9:{}
Should produce the set of paths:
1,2,3,4,9
1,2,3,4,5,6,7,8
1,2,3,4,5,6,7,4,9
1,2,3,4,5,7,8
1,2,3,4,5,7,4,9
1,2,3,4,5,7,4,5,6,7,8
1,2,3,4,5,7,4,5,7,8
Thus far, I have the following algorithm (in python) that works in some simple cases:
def extractPaths(G, s=None, explored=None, path=None):
_V,E = G
if s is None: s = 0
if explored is None: explored = set()
if path is None: path = [s]
explored.add(s)
if not len(set(E[s]) - explored):
print path
for v in set(E[s]) - explored:
if len(E[v]) > 1:
path.append(v)
for vv in set(E[v]) - explored:
extractPaths(G, vv, explored-set(n for n in path if len(E[n])>1), path+[vv])
else:
extractPaths(G, v, explored, path+[v])
but it fails horribly in the more complex cases.
I'd appreciate any help as this is a tool to validate an algorithm that I have developed for my Master's thesis.
Thank you in advance

I've though about this for a couple of hours, and have come up with this algorithm. It doesn't quite give the result you're asking for, but it's similar (and might be equivalent).
Observation: If we try to go to a node that has been seen before, the most recent visit, up until the current node, can be considered a loop. If we have seen that loop, we cannot go to that node.
def extractPaths(current_node,path,loops_seen):
path.append(current_node)
# if node has outgoing edges
if nodes[current_node]!=None:
for thatnode in nodes[current_node]:
valid=True
# if the node we are going to has been
# visited before, we are completeing
# a loop.
if thatnode-1 in path:
i=len(path)-1
# find the last time we visited
# that node
while path[i]!=thatnode-1:
i-=1
# the last time, to this time is
# a single loop.
new_loop=path[i:len(path)]
# if we haven't seen this loop go to
# the node and node we have seen this
# loop. else don't go to the node.
if new_loop in loops_seen:
valid=False
else:
loops_seen.append(new_loop)
if valid:
extractPaths(thatnode-1,path,loops_seen)
# this is the end of the path
else:
newpath=list()
# increment all the values for printing
for i in path:
newpath.append(i+1)
found_paths.append(newpath)
# backtrack
path.pop()
# graph defined by lists of outgoing edges
nodes=[[2],[3],[4],[5,9],[6,7],[7],[4,8],None,None]
found_paths=list()
extractPaths(0,list(),list())
for i in found_paths:
print(i)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Algorithm to construct DeBruijn graph gives wrong results - python

Related

How to find the count of (1->2->3, 1->3), (1->2->3->1), (1->2<->3) triads in a directed graph in NetworkX?

Create "short cut" aware graph in Python

How to find the shortest path with multiple edges between two nodes?

Improving BFS performance with some kind of memoization

Discover All Paths in Single Source, Multi-Terminal (possibly cyclic) Directed Graph

Categories

Resources