I want to read sdf file (containing many molecules) and return the weighted adjacency matrix of the molecule. Atoms should be treated as vertices and bond as edges. If i and j vertex are connected by single, double, or triple bond then corresponding entries in the adjacency matrix should be 1,2, and 3 respectively. I need to further obtain a distance vector for each vertex which list the number of vertices at different distance.
Are there any python package available to do this?
I would recommend Pybel for reading and manipulating SDF files in Python. To get the bonding information, you will probably need to also use the more full-featured but less pythonic openbabel module, which can be used in concert with Pybel (as pybel.ob).
To start with, you would write something like this:
import pybel
for mol in pybel.readfile('sdf', 'many_molecules.sdf'):
for atom in mol:
coords = atom.coords
for neighbor in pybel.ob.OBAtomAtomIter(atom.OBAtom):
neighbor_coords = pybel.atom(neighbor).coords
See
http://code.google.com/p/cinfony/
However for your exact problem you will need to consult the documentation.
Related
Before I describe my problem I'll summarise what I think I'm looking for. I think I need a method for nearest-neighbor searches which are restricted by node type in python (In my case a node represents an atom and the node type represents the element the atom is). So only returning the nearest neighbors of a given type. Maybe I am wording my problem incorrectly. I haven't been able to find any existing methods for this.
I am writing some ring statistics code to find different types of rings for molecular dynamics simulation data. The input data structure is a big array of atom id, atom type, and XYZ positions.
For example.
At the moment I only consider single-element systems (for example graphene, so only Carbon atoms are present). So each node is considered the same type when finding its nearest neighbors and calculating the adjacency matrix.
For this, I am using KDTree and scipy.spatial algorithms and find all atoms within the bond length, r, from any given atom. If an atom is within, r radius of a given atom I consider it connected and then populate and update an adjacency dictionary accordingly.
def create_adjacency_dict(data, r, leaf_size=5, box_size=None):
from scipy.spatial import KDTree
tree = KDTree(data, leafsize=leaf_size,
boxsize=box_size)
all_nn_indices = tree.query_ball_point(data, r, workers=5) # Calculates neighbours within radius r of a point.
adj_dict = {}
for count, item in enumerate(all_nn_indices):
adj_dict[count] = item # Populate adjacency dictionary
for node, nodes in adj_dict.items():
if node in nodes:
nodes.remove(node) # Remove duplicates
adj_dict = {k: set(v) for k, v in adj_dict.items()}
return adj_dict
I would like to expand the code to deal with multi-species systems. For example AB2, AB2C4 etc (Where A,B and C represent different atomic species). However, I am struggling to figure out a nice way to do this.
A
/ \
B B
The obvious method would be to just do a brute force Euclidean approach. My idea is to input the bond types for a molecule, so for AB2 (shown above), you would input something like AB to indicate the different types of bonds to consider, and then the respective bond lengths. Then loop over each atom finding the distance to all other atoms and, for this example of AB2, if an atom of type A is within the bond length of an atom B, consider them connected and populate the adjacency matrix. However, I'd like to be able to use the code on large datasets of 50,000+ atoms, so this method seems wasteful.
I suppose I could still use my current method, but just search for say the 10 nearest neighbors of a given atom and then do a Euclidean search for each atom pair, following the same approach as above. Still seems like a better method would already exist though.
Do better methods already exist for this type of problem? Finding nearest neighbors restricted by node type? Or maybe someone knows a more correct wording of my problem, which is I think one of my issues here.
"Then search the data."
This sounds like that old cartoon where someone points to a humourously complex diagram with a tiny label in the middle that says "Here a miracle happens"
Seriously, I am guessing that this searching is what you need to optimize ( you do not exactly say )
In turn, this suggests that you are doing a linear search through every atom and calculating the distance of each. Could it be so!?
There is a standard answer for this problem, called an octree.
https://en.wikipedia.org/wiki/Octree
A netflix tv miniseries 'The Billion Dollar Code' dramatizes the advantages of this approach https://www.netflix.com/title/81074012
I have a .vtu file representing a mesh which I read through vtkXMLUnstructuredGridReader. Then I create a numpy array (nbOfPoints x 3) in which I store the mesh vertex coordinates, which I'll call meshArray.
I also have a column array (nOfPoints x 1), which I'll call brightnessArray, which represents a certain property I want to assign to the vertexes of the meshArray; so to each vertex corresponds a scalar value. For example: to the element meshArray[0] will correspond brightnessArray[0] and so on.
How can I do this?
It is then possible to interpolate the value at the vertexes of the mesh to obtain a smooth variation of the property I had set in order to visualize it in paraview?
Thank you.
Simon
Here is what you need to do :
Write a Python Programmable Source to read your numpy data as a vtkUnstructuredGrid.
Here are a few examples of programmable sources :
https://www.paraview.org/Wiki/ParaView/Simple_ParaView_3_Python_Filters
https://www.paraview.org/Wiki/Python_Programmable_Filter
Read your .vtu dataset
Use a "Ressample with Dataset" filter on your python programmable source output and select your dataset as "source"
And you're done.
The hardest part is writing the programmble source script.
I would like to calculate the distance matrix (using genetic distance function) on a data set using http://biopython.org/DIST/docs/api/Bio.Cluster.Record-class.html#distancematrix, but I seem to keep getting errors, typically telling me the rank is not of 2. I'm not actually sure what it wants as an input since the documentation never says and there are no examples online.
Say I read in some aligned gene sequences:
SingleLetterAlphabet() alignment with 7 rows and 52 columns
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRL...SKA COATB_BPIKE/30-81
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKL...SRA Q9T0Q8_BPIKE/1-52
DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRL...SKA COATB_BPI22/32-83
AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKL...SKA COATB_BPM13/24-72
AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKL...SKA COATB_BPZJ2/1-49
AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKL...SKA Q9T0Q9_BPFD/1-49
FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKL...SRA COATB_BPIF1/22-73
which would be done by
data = Align.read("dataset.fasta","fasta")
But the distance matrix in Cluster.Record class does not accept this. How can I get it to work! ie
dist_mtx = distancematrix(data)
The short answer: You don't.
From the documentation:
A Record stores the gene expression data and related information
The Cluster object is used for gene expression data and not for MSA.
I would recommend using an external tool like MSARC which runs in Python as well.
I have the following txt file representing a network in edgelist format.
The first two columns represent the usual: which node is connected to which other nodes
The third column represents weights, representing the number of times each node has contacted the other.
I have searched the igraph documentation but there's no mention of how to include an argument for weight when importing standard file formats like txt.
The file can be accessed from here and this is the code I've been using:
read.graph("Irvine/OClinks_w.txt", format="edgelist")
This code treats the third column as something other than weight.
Does anyone know the solution?
does the following cause too much annoyance?
g <- read.table("Irvine/OClinks_w.txt")
g <- graph.data.frame(g)
if it does then directly from the file you can use
g<-read.graph("Irvine/OClinks_w.txt",format="ncol")
E(g)$weight
If you are using Python and igraph, the following line of code works to import weights and vertex names:
g1w=Graph.Read_Ncol("g1_ncol_format_weighted.txt",names=True)
Note: you must tell igraph to read names attribute with names=True, otherwise just vertex numbers will be imported.
Where g1_ncol_format_weighted.txt looks something like:
A B 2
B C 3
To make sure the import worked properly, use the following lines:
print(g1w.get_edgelist())
print(g1w.es["weight"])
print(g1w.vs["name"])
I'd like to know the best way to read a disconected undirected graph using igraph for python. For instance, if I have the simple graph in which 0 is linked to 1 and 2 is a node not connected to any other. I couldn't get igraph to read it from a edgelist format(Graph.Read_Edgelist(...)), because every line must be an edge, so the following is not allowed:
0 1
2
I've been just wondering if adjacency matrix is my only/best option in this case (I could get it to work through this representation)? I'd rather a format in which I could understand the data by looking it (something really hard when it comes to matrix format).
Thanks in advance!
There's the LGL format which allows isolated vertices (see Graph.Read_LGL). The format looks like this:
# nodeID
nodeID2
nodeID3
# nodeID2
nodeID4
nodeID5
nodeID
# isolatedNode
# nodeID5
I think you get the basic idea; lines starting with a hash mark indicate that a new node is being defined. After this, the lines specify the neighbors of the node that has just been defined. If you need an isolated node, you just specify the node ID prepended by a hash mark in the line, then continue with the next node.
More information about the LGL format is to be found here.
Another fairly readable format that you might want to examine is the GML format which igraph also supports.