Python - CSV to networkx format - python

I would like to create a network with networkx package from the data stored in the CSV. The data in the CSV file consist of two columns, as in example below. All nodes within the same edge group connects with each other (i.e. in E1 group (3 elements); there are: ABC -> BCD, BCD -> DEF, ABC -> DEF).
What would be the best approach/practice for transforming such data in Python to get an input for networkx package?
Edges Nodes
E1 ABC
E1 BCD
E1 DEF
E2 ABC
E2 BCD
E3 ABC
E3 BCD
E3 CDE
E3 DEF

Your format seems rather awkward as a graph specification. Edges are dyadic in nature, i.e. they connect two nodes, but your definition has potential for an edge to be related to a single node.
Your example also includes more than one edge between the same pair of nodes (e.g. ABC -> BCD via E1 and E2). This implies a MultiGraph.
IF this is really what your format should define, here is a way to get it into a networkx graph. There are very likely cleaner ways to read the data.
import networkx as nx
import itertools
# read file into a dictionary, in 3 stages.
with open("graph.txt") as f:
lines = f.readlines()
edges = []
for line in lines:
if line.startswith('Edges'):
continue
parts = line.strip().split()
edges.append(parts)
D = {}
for e, n in edges:
if e not in D:
D[e] = []
D[e].append(n)
G = nx.MultiGraph()
# for each edge group, add edges between all pairs of nodes
for e, nodes in D.items():
print e, nodes
for (u,v) in itertools.combinations(nodes, 2):
G.add_edge(u, v, label=e)

Related

Create "short cut" aware graph in Python

Assume we have these sequences:
A->X->Y->Z
B->Y->Z
C->Y->Z
D->X->Z
I would like to create a graph like:
C
|
A-X-Y-Z
| |
D B
In the sequence D-X-Z there is a short cut. My goal is to create a directed acyclic graph by eliminating these short-cuts and vice versa, expand existing edges when encountering expanded paths (e.g.: X-Z with X-Y-Z).
My approach so far was to create a directed graph with Networkx but this does not solve the problem because I could not find a way to eliminate short circuits (it is a big graph with hundreds of thousands of nodes).
Any hints would be appreciated.
You can set up the graph:
import networkx as nx
text = '''
A-X-Y-Z
B-Y-Z
C-Y-Z
D-X-Z
'''
G = nx.Graph()
for s in text.strip().split('\n'):
l = s.split('-')
G.add_edges_from(zip(l,l[1:]))
Then use find_cycles and remove_edge repeatedly to identify and remove edges that form cycles:
while True:
try:
c = nx.find_cycle(G)
print(f'found cycle: {c}')
G.remove_edge(*c[0])
except nx.NetworkXNoCycle:
break

Writing graph object to dimacs file format

I have created a graph object using the networkx library with the following code.
import networkx as nx
#Convert snap dataset to graph object
g = nx.read_edgelist('com-amazon.ungraph.txt',create_using=nx.Graph(),nodetype = int)
print(nx.info(g))
However I need to write the graph object to a dimacs file format which I believe networkx's functions do not include. Is there a way to do so?
The specification described on http://prolland.free.fr/works/research/dsat/dimacs.html is pretty simple, so you can just do something like this:
g = nx.house_x_graph() # stand-in graph since we don't have your data
dimacs_filename = "mygraph.dimacs"
with open(dimacs_filename, "w") as f:
# write the header
f.write("p EDGE {} {}\n".format(g.number_of_nodes(), g.number_of_edges()))
# now write all edges
for u, v in g.edges():
f.write("e {} {}\n".format(u, v))
this generates the file "mygraph.dimacs":
p EDGE 5 8
e 0 1
e 0 2
e 0 3
e 1 2
e 1 3
e 2 3
e 2 4
e 3 4

Create a Dendogram from Genome

I wanted to play around with genomic data:
Species_A = ctnngtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag
Species_B = ctaagtggactgacaggaactgtttcgaatcggaagcttgcttaacgtag
Species_C = ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgtag
Species_D = ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgccg
Species_E = ctgtgtggancgacaaggacagttccaaatcggaagcttgcttaacacag
I wanted to create a dendrogram based on how close these organisms are related to each other given the genome sequence above. What I did first was to count the number of a's, c's, t's and g's of each species then I created an array, then plotted a dendrogram:
gen_size1 = len(Species_A)
a1 = float(Species_A.count('a'))/float(gen_size1)
c1 = float(Species_A.count('c'))/float(gen_size1)
g1 = float(Species_A.count('g'))/float(gen_size1)
t1 = float(Species_A.count('t'))/float(gen_size1)
.
.
.
gen_size5 = len(Species_E)
a5 = float(Species_E.count('a'))/float(gen_size5)
c5 = float(Species_E.count('c'))/float(gen_size5)
g5 = float(Species_E.count('g'))/float(gen_size5)
t5 = float(Species_E.count('t'))/float(gen_size5)
my_genes = np.array([[a1,c1,g1,t1],[a2,c2,g2,t2],[a3,c3,g3,t3],[a4,c4,g4,t4],[a5,c5,g5,t5]])
plt.subplot(1,2,1)
plt.title("Mononucleotide")
linkage_matrix = linkage(my_genes, "single")
print linkage_matrix
dendrogram(linkage_matrix,truncate_mode='lastp', color_threshold=1, labels=[Species_A, Species_B, Species_C, Species_D, Species_E], show_leaf_counts=True)
plt.show()
Species A and B are variants of the same organism and I am expecting that both should diverge from a common clade form the root, same goes with Species C and D which should diverge from another common clade from the root then with Species E diverging from the main root because it is not related to Species A to D. Unfortunately the dendrogram result was mixed up with Species A and E diverging from a common clade, then Species C, D and B in another clade (pretty messed up).
I have read about hierarchical clustering for genome sequence but I have observed that it only accommodates 2 dimensional system, unfortunately I have 4 dimensions which are a,c,t and g. Any other strategy for this? thanks for the help!
This is a fairly common problem in bioinformatics, so you should use a bioinformatics library like BioPython that has this functionality builtin.
First you create a multi FASTA file with your sequences:
import os
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import generic_dna
sequences = ['ctnngtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag',
'ctaagtggactgacaggaactgtttcgaatcggaagcttgcttaacgtag',
'ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgtag',
'ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgccg',
'ctgtgtggancgacaaggacagttccaaatcggaagcttgcttaacacag']
my_records = [SeqRecord(Seq(sequence, generic_dna),
id='Species_{}'.format(letter), description='Species_{}'.format(letter))
for sequence, letter in zip(sequences, 'ABCDE')]
root_dir = r"C:\Users\BioGeek\Documents\temp"
filename = 'my_sequences'
fasta_path = os.path.join(root_dir, '{}.fasta'.format(filename))
SeqIO.write(my_records, fasta_path, "fasta")
This creates the file C:\Users\BioGeek\Documents\temp\my_sequences.fasta that looks like this:
>Species_A
ctnngtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag
>Species_B
ctaagtggactgacaggaactgtttcgaatcggaagcttgcttaacgtag
>Species_C
ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgtag
>Species_D
ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgccg
>Species_E
ctgtgtggancgacaaggacagttccaaatcggaagcttgcttaacacag
Next, use the command line tool ClustalW to do a multiple sequence alignment:
from Bio.Align.Applications import ClustalwCommandline
clustalw_exe = r"C:\path\to\clustalw-2.1\clustalw2.exe"
assert os.path.isfile(clustalw_exe), "Clustal W executable missing"
clustalw_cline = ClustalwCommandline(clustalw_exe, infile=fasta_path)
stdout, stderr = clustalw_cline()
print stdout
This prints:
CLUSTAL 2.1 Multiple Sequence Alignments
Sequence format is Pearson
Sequence 1: Species_A 50 bp
Sequence 2: Species_B 50 bp
Sequence 3: Species_C 50 bp
Sequence 4: Species_D 50 bp
Sequence 5: Species_E 50 bp
Start of Pairwise alignments
Aligning...
Sequences (1:2) Aligned. Score: 90
Sequences (1:3) Aligned. Score: 94
Sequences (1:4) Aligned. Score: 88
Sequences (1:5) Aligned. Score: 84
Sequences (2:3) Aligned. Score: 90
Sequences (2:4) Aligned. Score: 84
Sequences (2:5) Aligned. Score: 78
Sequences (3:4) Aligned. Score: 94
Sequences (3:5) Aligned. Score: 82
Sequences (4:5) Aligned. Score: 82
Guide tree file created: [C:\Users\BioGeek\Documents\temp\my_sequences.dnd]
There are 4 groups
Start of Multiple Alignment
Aligning...
Group 1: Sequences: 2 Score:912
Group 2: Sequences: 2 Score:921
Group 3: Sequences: 4 Score:865
Group 4: Sequences: 5 Score:855
Alignment Score 2975
CLUSTAL-Alignment file created [C:\Users\BioGeek\Documents\temp\my_sequences.aln]
The my_sequences.dnd file ClustalW creates, is a standard Newick tree file and Bio.Phylo can parse these:
from Bio import Phylo
newick_path = os.path.join(root_dir, '{}.dnd'.format(filename))
tree = Phylo.read(newick_path, "newick")
Phylo.draw_ascii(tree)
Which prints:
____________ Species_A
____|
| |_____________________________________ Species_B
|
_| ____ Species_C
|_________|
| |_________________________ Species_D
|
|__________________________________________________________________ Species_E
Or, if you have matplotlib or pylab installed, you can create a graphic using the draw function:
tree.rooted = True
Phylo.draw(tree, branch_labels=lambda c: c.branch_length)
which produces:
This dendrogram clearly illustrates what you observed: that species A and B are variants of the same organism and both diverge from a common clade from the root. Same goes with Species C and D, both diverge from another common clade from the root. Finally, Species E diverges from the main root because it is not related to Species A to D.
Well, using SciPy you could use a custom distance (my bet is on Needleman-Wunsch or Smith-Waterman as a start). Here is an example of how to play with your input data. You should also check how to define a custom distance in SciPy. Once you have it set, you can use a more advanced alignment approach like MAFFT. You could extract the relationships between genomes and use them when you create your dendrogram.

Algorithm to construct DeBruijn graph gives wrong results

I am trying to write some code to construct a DeBruijn graph from a set of kmers (k letter long strings, DNA sequencing reads) in Python, output as a collection of edges, joining the same node to others.
When I run my code on sample input:
['GAGG','CAGG','GGGG','GGGA','CAGG','AGGG','GGAG']
I get:
CAG -> AGG
GAG -> AGG
Instead of:
AGG -> GGG
CAG -> AGG,AGG
GAG -> AGG
GGA -> GAG
GGG -> GGA,GGG
Any hint of what I am doing wrong?
Here is the code:
import itertools
inp=['GAGG','CAGG','GGGG','GGGA','CAGG','AGGG','GGAG']
y=[a[1:] for a in inp]
z=[b[:len(b)-1] for b in inp]
y.extend(z)
edjes=list(set(y))
w=[c[1:] for c in edjes]
v=[d[:len(d)-1] for d in edjes]
w.extend(v)
nodes=list(set(w))
graph={}
new=itertools.product(edjes,edjes)
for node in nodes:
for edj in new:
edje1,edje2=edj[0],edj[1]
if edje1[1:]==node and edje2[:len(edje2)-1]==node:
if edje1 in graph:
graph[edje1].append(edje2)
else:
graph[edje1]=[edje2]
for val in graph.values():
val.sort()
for k,v in sorted(graph.items()):
if len(v)<1:
continue
else:
line=k+' -> '+','.join(v)+'\n'
print (line)
I think you make the algorithm much too complicated: you can simply first perform a uniqueness filter on the input:
inp=['GAGG','CAGG','GGGG','GGGA','CAGG','AGGG','GGAG']
edges=list(set(inp))
And then iterate over this list of "edges". For each edge, the first three characters is the from node, the last three characters are the to node:
for edge in edges:
frm = edge[:len(edge)-1]
to = edge[1:]
#...
Now you simply need to add this to your graph:
for edge in edges:
frm = edge[:len(edge)-1]
to = edge[1:]
if frm in graph:
graph[frm].append(to)
else:
graph[frm]=[to]
And finally perform a sorting and printing like you did yourself:
for val in graph.values():
val.sort()
for k,v in sorted(graph.items()):
print(k+' -> '+','.join(v))
This results in:
AGG -> GGG
CAG -> AGG
GAG -> AGG
GGA -> GAG
GGG -> GGA,GGG
As you can see, there is a small difference on line 2: there your expected output contains AGG two times, which makes not much sense.
So the full algorithm is something like:
inp=['GAGG','CAGG','GGGG','GGGA','CAGG','AGGG','GGAG']
edges=list(set(inp))
graph={}
for edge in edges:
frm = edge[:len(edge)-1]
to = edge[1:]
if frm in graph:
graph[frm].append(to)
else:
graph[frm]=[to]
for val in graph.values():
val.sort()
for k,v in sorted(graph.items()):
print(k+' -> '+','.join(v))
Your algorithm
A problem I think, is that you consider three letter sequences to be "edjes" (probably edges). The edges are the four sequence characters. By performing this conversion, information is lost. Next you construct a set of two-character items (nodes which are no nodes at all). They seem to be used to "glue" the nodes together. But at that stage, you do not longer know how the pieces are glued together anyway.

Iterate variable for every node | Node Connectivity in Python Graph

I would like to find node connectivity between node 1 and rest of the nodes in a graph. The input text file format is as follows:
1 2 1
1 35 1
8 37 1
and so on for 167 lines. First column represents source node, second column represents destination node while the last column represents weight of the edge.
I'm trying to read the source, destination nodes from input file and forming an edge between them. I need to then find out if it is a connected network (only one component of graph and no sub-components). Here is the code
from numpy import*
import networkx as nx
G=nx.empty_graph()
for row in file('out40.txt'):
row = row.split()
src = row[0]
dest = row[1]
#print src
G.add_edge(src, dest)
print src, dest
for i in range(2, 41):
if nx.bidirectional_dijkstra(G, 1, i): print "path exists from 1 to ", i
manually adding the edges using
G.add_edge(1, 2)
works but is tedious and not suitable for large input files such as mine. The if loop condition works when I add edges manually but throws the following error for the above code:
in neighbors_iter
raise NetworkXError("The node %s is not in the graph."%(n,))
networkx.exception.NetworkXError: The node 2 is not in the graph.
Any help will be much appreciated!
In your code, you're adding nodes "1" and "2" et cetera (since reading from a file is going to give you strings unless you explicitly convert them).
However, you're then trying to refer to nodes 1 and 2. I'm guessing that networkx does not think that 2 == "2".
Try changing this...
G.add_edge(src, dest)
to this:
G.add_edge(int(src), int(dest))
Not sure if that is an option for you, but are you aware of the build-in support of networkx for multiple graph text formats?
The edge list format seems to apply pretty well to your case. Specifically, the following method will read your input files without the need for custom code:
G = nx.read_weighted_edgelist(filename)
If you want to remove the weights (because you don't need them), you could subsequently do the following:
for e in G.edges_iter(data=True):
e[2].clear() #[2] is the 3rd element of the tuple, which
#contains the dictionary with edge attributes
From Networkx documentation:
for row in file('out40.txt'):
row = row.split()
src = row[0]
dest = row[1]
G.add_nodes_from([src, dest])
#print src
G.add_edge(src, dest)
print src, dest
The error message says the the graph G doesn't have the nodes you are looking to create an edge in between.
You can also use "is_connected()" to make this a little simpler. e.g.
$ cat disconnected.edgelist
1 2 1
2 3 1
4 5 1
$ cat connected.edgelist
1 2 1
2 3 1
3 4 1
$ ipython
In [1]: import networkx as nx
In [2]: print(nx.is_connected(nx.read_weighted_edgelist('disconnected.edgelist')))
False
In [3]: print(nx.is_connected(nx.read_weighted_edgelist('connected.edgelist')))
True
Another option is to load the file as a pandas dataframe and then use iterrows to iterate:
import pandas as pd
import networkx as nx
cols = ["src", "des", "wei"]
df = pd.read_csv('out40.txt', sep=" ", header=None, names=cols)
G = nx.empty_graph()
for index, row in df.iterrows():
G.add_edge(row["src"], row["des"])

Categories

Resources