SMILES from graph - python

Is there a method or package that converts a graph (or adjacency matrix) into a SMILES string?
For instance, I know the atoms are [6 6 7 6 6 6 6 8] ([C C N C C C C O]), and the adjacency matrix is
[[ 0., 1., 0., 0., 0., 0., 0., 0.],
[ 1., 0., 2., 0., 0., 0., 0., 1.],
[ 0., 2., 0., 1., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 1., 0., 1., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 1., 1.],
[ 0., 0., 0., 0., 0., 1., 0., 0.],
[ 0., 1., 0., 0., 0., 1., 0., 0.]]
I need some function to output 'CC1=NCCC(C)O1'.
It also works if some function can output the corresponding "mol" object. The RDkit software has a 'MolFromSmiles' function. I wonder if there is something like 'MolFromGraphs'.

Here is a simple solution, to my knowledge there is no built-in function for this in RDKit.
def MolFromGraphs(node_list, adjacency_matrix):
# create empty editable mol object
mol = Chem.RWMol()
# add atoms to mol and keep track of index
node_to_idx = {}
for i in range(len(node_list)):
a = Chem.Atom(node_list[i])
molIdx = mol.AddAtom(a)
node_to_idx[i] = molIdx
# add bonds between adjacent atoms
for ix, row in enumerate(adjacency_matrix):
for iy, bond in enumerate(row):
# only traverse half the matrix
if iy <= ix:
continue
# add relevant bond type (there are many more of these)
if bond == 0:
continue
elif bond == 1:
bond_type = Chem.rdchem.BondType.SINGLE
mol.AddBond(node_to_idx[ix], node_to_idx[iy], bond_type)
elif bond == 2:
bond_type = Chem.rdchem.BondType.DOUBLE
mol.AddBond(node_to_idx[ix], node_to_idx[iy], bond_type)
# Convert RWMol to Mol object
mol = mol.GetMol()
return mol
Chem.MolToSmiles(MolFromGraphs(nodes, a))
Out:
'CC1=NCCC(C)O1'
This solution is a simplified version of https://github.com/dakoner/keras-molecules/blob/dbbb790e74e406faa70b13e8be8104d9e938eba2/convert_rdkit_to_networkx.py
There are many other atom properties (such as Chirality or Protonation state) and bond types (Triple, Dative...) that may need to be set. It is better to keep track of these explicitly in your graph if possible (as in the link above), but this function can also be extended to incorporate these if required.

Related

Finding the start/stop positions and length of the longest and shortest sequence of 1s or 0s in a numpy matrix

I have a numpy matrix that looks like:
matrix = [[0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]
How would I get the length of the longest sequence of 1s or 0s? Also how would I get their start and stop positions?
Is there an easier numpy-way to get this done?
Output format is flexible as long as it denotes the inner list index, the length value, and value's list indices.
Example:
LONGEST ONES: 1, 16, 2, 17 (index of inner list, length, longest 1s sequence index start, longest 1s sequence end pos.).
or [1, 16, 2, 17]/(1, 16, 2, 17)
LONGEST ZEROS: 2, 45, 0, 45
Not a duplicate of these questions as this concerns a matrix:
find the start position of the longest sequence of 1's
The result(longest) should be considered among all lists.
A sequence count does not continue when it reaches the end of an inner list.
Using Divakar's base answer, you can adapt by using np.vectorize, setting the argument signature and doing simple math operations to get what you're looking for.
Take, for instance,
m = np.array(matrix)
def get_longest_ones_matrix(b):
idx_pairs = np.where(np.diff(np.hstack(([False], b==1, [False]))))[0].reshape(-1,2)
if not idx_pairs.size: return(np.array([0,0,0]))
d = np.diff(idx_pairs, axis=1).argmax()
start_longest_seq = idx_pairs[d,0]
end_longest_seq = idx_pairs[d,1]
l = end_longest_seq - start_longest_seq
p = start_longest_seq % 45
e = end_longest_seq - 1
return(np.array([l,p,e]))
s = m.shape[-1]
v = np.vectorize(get_longest_ones_matrix, signature=f'(s)->(1)')
x = v(m)
Which yields
[[ 3 26 28]
[16 2 17]
[ 0 0 0]]
Then,
a = x[:,0].argmax()
print(a,x[a])
1 [16 2 17]

Representing Graph with Numpy Array

I am receiving data in the following format:
tail head
P01106 Q09472
P01106 Q13309
P62136 Q13616
P11831 P18146
P13569 P20823
P20823 P01100
...
Is there a good way to format this data as a graph with a numpy array? I am hoping to compute PageRank using this graph.
So far I have
import numpy as np
data = np.genfromtxt('wnt_edges.txt', skip_header=1, dtype=str)
I was thinking about using the graph data structure from Representing graphs (data structure) in Python but it didn't seem to make sense in this case since I'll be doing matrix multiplication.
To avoid reinventing the wheel you should use networkx as suggested in comments and other answers.
If, for educational purposes, you want to reinvent the wheel you can create an adjacency matrix. The PageRank can be computed from that matrix:
The PageRank values are the entries of the dominant right eigenvector of the modified adjacency matrix.
Since each row/column of the adjacency matrix represents a node, you will need to enumerate the nodes so each node is represented by a unique number starting from 0.
import numpy as np
data = np.array([['P01106', 'Q09472'],
['P01106', 'Q13309'],
['P62136', 'Q13616'],
['P11831', 'P18146'],
['P13569', 'P20823'],
['P20823', 'P01100']])
nodes = np.unique(data) # mapping node name --> index
noidx = {n: i for i, n in enumerate(nodes)} # mapping node index --> name
n = nodes.size # number of nodes
numdata = np.vectorize(noidx.get)(data) # replace node id by node index
A = np.zeros((n, n))
for tail, head in numdata:
A[tail, head] = 1
#A[head, tail] = 1 # add this line for undirected graph
This results in the following graph representation A:
array([[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 1., 1., 0.],
[ 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])
A 1 in row 5, column 0 for example means that there is an edge from node 5 to node 0, which corresponds to 'P20823' --> 'P01100'. Use the nodes array to look up node names from indices:
print(nodes)
['P01100' 'P01106' 'P11831' 'P13569' 'P18146' 'P20823' 'P62136' 'Q09472'
'Q13309' 'Q13616']
If there are many nodes and few connections it's better to use a sparse matrix for A. But first try to stay with a dense matrix and only bother to switch to sparse of you have memory or performance issues.
I strongly suggest networkx:
import networkx as nx
#make the graph
G = nx.Graph([e for e in data])
#compute the pagerank
nx.pagerank(G)
# output:
# {'P01100': 0.0770275315329843, 'P01106': 0.14594493693403143,
# 'P11831': 0.1, 'P13569': 0.0770275315329843, 'P18146': 0.1,
# 'P20823': 0.1459449369340315, 'P62136': 0.1, 'Q09472':
# 0.07702753153298428, 'Q13309': 0.07702753153298428, 'Q13616': 0.1}
That's all it takes. pagerank documentation is here.

Detecting Diagonals in a matrix

I have outputs from a process that produce a data trend as seen below:
The data output seems to have a trend with the diagonals, however I am unsure on how I can track this. Ultimately, I know the first 15 numbers in each 16 number sample, and want to predict the 16th. It seems like you should be able to do this with some type of approximation that involves matrix math or possible phase shift in a Fourier series. Is there a method that could achieve this? If there is a solution that can be used via Python that would be preferred.
you can use my diagonal detection matrix, it was developed for a similar issue, some times, it is referred to by Omran Matrix. All you need, is to multiply the image (your matrix) with my matrix, and summate the first row of the output, which will give you the number of diagonals in the image. The matrix is also very flexible and can be a vertical rectangular matrix, I used some tricks in the physical meaning to inverse it. I developed it in 2010 in Zurich, while doing my PhD to detect diagonal lines or overtones in sweeps in visual sound images. the matrix is published in Detecting diagonal activity to quantify harmonic structure preservation with cochlear implant mapping or formal link. The PhD thesis is called, mechanism of music perception using cochlear implants, University of Zurich, 2011 by Sherif Omran. If you write a paper, please cite me and good luck
here are similar images with overtones, I used my matrix to detect these diagonal activities, which look very near to yours.
Here is an example of how to check whether opposite diagonals contain only 1s, like in your case:
In [52]: from scipy.sparse import eye
let's create a matrix with a opposite diagonal
In [53]: a = np.fliplr(eye(5, 8, k=1).toarray())
In [54]: a
Out[54]:
array([[ 0., 0., 0., 0., 0., 0., 1., 0.],
[ 0., 0., 0., 0., 0., 1., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 1., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0., 0., 0., 0.]])
Flip array in the left/right direction
In [55]: f = np.fliplr(a)
In [56]: f
Out[56]:
array([[ 0., 1., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 1., 0., 0.]])
the same can be done:
In [71]: a[::-1,:]
Out[71]:
array([[ 0., 0., 1., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 1., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 1., 0.]])
get given diagonal
In [57]: np.diag(f, k=1)
Out[57]: array([ 1., 1., 1., 1., 1.])
In [58]: np.diag(f, k=-1)
Out[58]: array([ 0., 0., 0., 0.])
In [111]: a[::-1].diagonal(2)
Out[111]: array([ 1., 1., 1., 1., 1.])
check whether the whole diagonal contains 1s
In [61]: np.all(np.diag(f, k=1) == 1)
Out[61]: True
or
In [64]: (np.diag(f, k=1) == 1).all()
Out[64]: True
In [65]: (np.diag(f, k=0) == 1).all()
Out[65]: False
This answer will help you to find all diagonals
PS i'm a newbie in numpy, so i'm pretty sure there must be faster and more elegant solutions

Sparse Construct: Repeating Identity

say I have with ij being large (e.g. 5000) , the two following matrices
E = np.identity((ij))
oneVector = np.ones((1, ij))
and I need to compute
np.kron(E, oneVector)
This is quite slow and inefficient. Basically, the Kronecker product of identity and a row vector of ones is repeating the identity matrix horizontally oneVector.size times.
I believe that creating a sparse product would make more sense. scipy.sparse.kron would allow me to create that product if I had both A, B as sparse. But I don't know how to create the vector of ones as a "sparse type" matrix.
Is there a simple way to generate the sparse equivalent of np.ones() or is there another way I should proceed?
The arguments to scipy.sparse.kron do not have to be sparse.
In [31]: import numpy as np
In [32]: import scipy.sparse as sp
In [33]: ij = 4
In [34]: E = sp.identity(ij) # Sparse identity matrix
In [35]: oneVector = np.ones((1, ij)) # Dense
In [36]: m = sp.kron(E, oneVector) # m is sparse.
In [37]: m
Out[37]:
<4x16 sparse matrix of type '<type 'numpy.float64'>'
with 16 stored elements (blocksize = 1x4) in Block Sparse Row format>
In [38]: m.A
Out[38]:
array([[ 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1.]])
P.S. Based on this comment:
Basically, the Kronecker product of identity and a row vector of ones is repeating the identity matrix horizontally oneVector.size times.
I wonder if you meant kron(oneVector, E):
In [39]: m = sp.kron(oneVector, E)
In [40]: m.A
Out[40]:
array([[ 1., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0.],
[ 0., 1., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0.],
[ 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 1., 0.],
[ 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 1.]])

multiple condition in fancy indexing

I am new to python and am trying to some simple classification on raster image.
Basically, I am reading a TIF image as a 2D array and do some calculating and manipulation on it. For classification part, I am trying to create 3 empty arrays for land, water, and clouds. These classes will be assigned a value of 1 under multiple conditions, and eventually assigning these classes as landclass=1, waterclass=2, cloudclass=3 respectively.
apparently I can assign all values in an array to 1 under one condition
like this:
crop = gdal.Open(crop,GA_ReadOnly)
crop = crop.ReadAsArray()
rows,cols = crop.shape
mode = int(stats.mode(crop, axis=None)[0])
water = np.empty(shape(row,cols),dtype=float32)
land = water
clouds = water
than I have something like this (output):
>>> land
array([[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
...,
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.]], dtype=float32)
>>> land[water==0]=1
>>> land
array([[ 0., 0., 0., ..., 1., 1., 1.],
[ 0., 0., 0., ..., 1., 1., 1.],
[ 0., 0., 0., ..., 1., 1., 1.],
...,
[ 1., 1., 1., ..., 0., 0., 0.],
[ 1., 1., 1., ..., 0., 0., 0.],
[ 1., 1., 1., ..., 0., 0., 0.]], dtype=float32)
>>> land[crop>mode]=1
>>> land
array([[ 0., 0., 0., ..., 1., 1., 1.],
[ 0., 0., 0., ..., 1., 1., 1.],
[ 0., 0., 0., ..., 1., 1., 1.],
...,
[ 1., 1., 1., ..., 0., 0., 0.],
[ 1., 1., 1., ..., 0., 0., 0.],
[ 1., 1., 1., ..., 0., 0., 0.]], dtype=float32)
But how can I have the values in "land" equal to 1 under a couple of conditions without altering the shape of the array?
I tried to do this
land[water==0,crop>mode]=1
and I got ValueError. And I tried this
land[water==0 and crop>mode]=1
and python asks me to use a.all() or a.all()....
For only one condition, the result is exactly what I want, and I have to do it in order to get the result. eg (this is what I have in my actual code):
water[band6 < b6_threshold]=1
water[band7 < b7_threshold_1]=1
water[band6 > b6_threshold]=1
water[band7 < b7_threshold_2]=1
land[band6 > b6_threshold]=1
land[band7 > b7_threshold_2]=1
land[clouds == 1]=1
land[water == 1]=1
land[b1b4 < 0.5]=1
land[band3 < 0.1)]=1
clouds[land == 0]=1
clouds[water == 0]=1
clouds[band6 < (b6_mode-4)]=1
I found this is a bit confusing and I would like to combine all conditions within one statement... Any suggestion on that?
Thank you very much!
You can multiply the boolean arrays for something like "and":
>>> import numpy as np
>>> a = np.array([1,2,3,4])
>>> a[(a > 1) * (a < 3)] = 99
>>> a
array([ 1, 99, 3, 4])
And you can add them for something like "or":
>>> a[(a > 1) + (a < 3)] = 123
>>> a
array([123, 123, 123, 123])
Alternatively, if you prefer to think of boolean logic rather than True and False being 0 and 1, you can also use the operators & and | to the same effect.

Categories

Resources