I am receiving data in the following format:
tail head
P01106 Q09472
P01106 Q13309
P62136 Q13616
P11831 P18146
P13569 P20823
P20823 P01100
...
Is there a good way to format this data as a graph with a numpy array? I am hoping to compute PageRank using this graph.
So far I have
import numpy as np
data = np.genfromtxt('wnt_edges.txt', skip_header=1, dtype=str)
I was thinking about using the graph data structure from Representing graphs (data structure) in Python but it didn't seem to make sense in this case since I'll be doing matrix multiplication.
To avoid reinventing the wheel you should use networkx as suggested in comments and other answers.
If, for educational purposes, you want to reinvent the wheel you can create an adjacency matrix. The PageRank can be computed from that matrix:
The PageRank values are the entries of the dominant right eigenvector of the modified adjacency matrix.
Since each row/column of the adjacency matrix represents a node, you will need to enumerate the nodes so each node is represented by a unique number starting from 0.
import numpy as np
data = np.array([['P01106', 'Q09472'],
['P01106', 'Q13309'],
['P62136', 'Q13616'],
['P11831', 'P18146'],
['P13569', 'P20823'],
['P20823', 'P01100']])
nodes = np.unique(data) # mapping node name --> index
noidx = {n: i for i, n in enumerate(nodes)} # mapping node index --> name
n = nodes.size # number of nodes
numdata = np.vectorize(noidx.get)(data) # replace node id by node index
A = np.zeros((n, n))
for tail, head in numdata:
A[tail, head] = 1
#A[head, tail] = 1 # add this line for undirected graph
This results in the following graph representation A:
array([[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 1., 1., 0.],
[ 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])
A 1 in row 5, column 0 for example means that there is an edge from node 5 to node 0, which corresponds to 'P20823' --> 'P01100'. Use the nodes array to look up node names from indices:
print(nodes)
['P01100' 'P01106' 'P11831' 'P13569' 'P18146' 'P20823' 'P62136' 'Q09472'
'Q13309' 'Q13616']
If there are many nodes and few connections it's better to use a sparse matrix for A. But first try to stay with a dense matrix and only bother to switch to sparse of you have memory or performance issues.
I strongly suggest networkx:
import networkx as nx
#make the graph
G = nx.Graph([e for e in data])
#compute the pagerank
nx.pagerank(G)
# output:
# {'P01100': 0.0770275315329843, 'P01106': 0.14594493693403143,
# 'P11831': 0.1, 'P13569': 0.0770275315329843, 'P18146': 0.1,
# 'P20823': 0.1459449369340315, 'P62136': 0.1, 'Q09472':
# 0.07702753153298428, 'Q13309': 0.07702753153298428, 'Q13616': 0.1}
That's all it takes. pagerank documentation is here.
Related
I am trying to define a function and plot it. It is a convolution of a Gaussian (i) and exponential decay (f). It works fine for some x values but doesn't work for others. Can someone please help me find out why?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
plt.rcParams["font.size"] = 16
plt.rcParams['figure.figsize'] = (10, 5)
#d is FWHM of the Gaussian (determines the width of convolution)
#but the initial height and shape is affected by k value
def one_exp_fit_function(t1, d, u, k1):
d_bar = d/(2*np.sqrt(np.log(2)))
i = ((d_bar*np.sqrt(2*np.pi))**(-1))*np.exp(-np.log(2)*(((2*(t1-u))/d)**2))
f = np.exp(-(k1)*t1)
k = np.convolve(i, f, mode='same')
return k
plt.plot(one_exp_fit_function(np.linspace(-5, 5, 100), 0.1, 3, 0.5), label='d=0.1, u=3, k=0.5')
plt.legend()
Plot
But when I enter the following list as x values it returns a list of zeros.
one_exp_fit_function(x, 5, 908, 0.5)
output
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
dtype=float32)
This is my list of x values
x
907.6, 907.61, 907.62, 907.63, 907.64, 907.65, 907.66, 907.67, 907.68, 907.69, 907.7, 907.71, 907.72, 907.73, 907.74, 907.75, 907.76, 907.77, 907.78, 907.79, 907.8, 907.81, 907.82, 907.83, 907.84, 907.85, 907.86, 907.87, 907.88, 907.89, 907.9, 907.91, 907.92, 907.93, 907.94, 907.95, 907.96, 907.97, 907.98, 907.99, 908.0, 908.01, 908.02, 908.03, 908.04, 908.05, 908.06, 908.07, 908.08, 908.09, 908.1, 908.11, 908.12, 908.13, 908.14, 908.15, 908.16, 908.17, 908.18, 908.19, 908.2, 908.21, 908.22, 908.23, 908.24
Can someone please help me as to what the problem might be?
Is there a method or package that converts a graph (or adjacency matrix) into a SMILES string?
For instance, I know the atoms are [6 6 7 6 6 6 6 8] ([C C N C C C C O]), and the adjacency matrix is
[[ 0., 1., 0., 0., 0., 0., 0., 0.],
[ 1., 0., 2., 0., 0., 0., 0., 1.],
[ 0., 2., 0., 1., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 1., 0., 1., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 1., 1.],
[ 0., 0., 0., 0., 0., 1., 0., 0.],
[ 0., 1., 0., 0., 0., 1., 0., 0.]]
I need some function to output 'CC1=NCCC(C)O1'.
It also works if some function can output the corresponding "mol" object. The RDkit software has a 'MolFromSmiles' function. I wonder if there is something like 'MolFromGraphs'.
Here is a simple solution, to my knowledge there is no built-in function for this in RDKit.
def MolFromGraphs(node_list, adjacency_matrix):
# create empty editable mol object
mol = Chem.RWMol()
# add atoms to mol and keep track of index
node_to_idx = {}
for i in range(len(node_list)):
a = Chem.Atom(node_list[i])
molIdx = mol.AddAtom(a)
node_to_idx[i] = molIdx
# add bonds between adjacent atoms
for ix, row in enumerate(adjacency_matrix):
for iy, bond in enumerate(row):
# only traverse half the matrix
if iy <= ix:
continue
# add relevant bond type (there are many more of these)
if bond == 0:
continue
elif bond == 1:
bond_type = Chem.rdchem.BondType.SINGLE
mol.AddBond(node_to_idx[ix], node_to_idx[iy], bond_type)
elif bond == 2:
bond_type = Chem.rdchem.BondType.DOUBLE
mol.AddBond(node_to_idx[ix], node_to_idx[iy], bond_type)
# Convert RWMol to Mol object
mol = mol.GetMol()
return mol
Chem.MolToSmiles(MolFromGraphs(nodes, a))
Out:
'CC1=NCCC(C)O1'
This solution is a simplified version of https://github.com/dakoner/keras-molecules/blob/dbbb790e74e406faa70b13e8be8104d9e938eba2/convert_rdkit_to_networkx.py
There are many other atom properties (such as Chirality or Protonation state) and bond types (Triple, Dative...) that may need to be set. It is better to keep track of these explicitly in your graph if possible (as in the link above), but this function can also be extended to incorporate these if required.
I've a 2 arrays:
np.array(y_pred_list).shape
# returns (5, 47151, 10)
np.array(y_val_lst).shape
# returns (5, 47151, 10)
np.array(y_pred_list)[:, 2, :]
# returns
array([[ 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])
np.array(y_val_lst)[:, 2, :]
# returns
array([[ 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]], dtype=float32)
I would like to go through all 47151 examples, and calculate the "accuracy". Meaning the sum of those in y_pred_list that matches y_val_lst over 47151. What's the comparison function for this?
You can find a lot of useful classification scores in sklearn.metrics, particularly accuracy_score(). See the doc here, you would use it as:
import sklearn
acc = sklearn.metrics.accuracy_score(np.array(y_val_list)[:, 2, :],
np.array(y_pred_list)[:, 2, :])
Sounds like you want something like this:
accuracy = (y_pred_list == y_val_lst).all(axis=(0,2)).mean()
...though since your arrays are clearly floating-point arrays, you might want to allow for numerical-precision errors rather than insisting on exact equality:
accuracy = (numpy.abs(y_pred_list - y_val_lst) < tolerance ).all(axis=(0,2)).mean()
(where, for example, tolerance = 1e-10)
The .all(axis=(0,2)) call records cases in which everything in its input is True (i.e. everything matches) when working along the dimension 0 (i.e. the one that has extent 5) and dimension 2 (the one that has extent 10). It outputs a one-dimensional array of length 47151. The .mean() call then gives you the proportion of matches in that sequence, which is my best guess as to what you mean by "over 47151".
I have outputs from a process that produce a data trend as seen below:
The data output seems to have a trend with the diagonals, however I am unsure on how I can track this. Ultimately, I know the first 15 numbers in each 16 number sample, and want to predict the 16th. It seems like you should be able to do this with some type of approximation that involves matrix math or possible phase shift in a Fourier series. Is there a method that could achieve this? If there is a solution that can be used via Python that would be preferred.
you can use my diagonal detection matrix, it was developed for a similar issue, some times, it is referred to by Omran Matrix. All you need, is to multiply the image (your matrix) with my matrix, and summate the first row of the output, which will give you the number of diagonals in the image. The matrix is also very flexible and can be a vertical rectangular matrix, I used some tricks in the physical meaning to inverse it. I developed it in 2010 in Zurich, while doing my PhD to detect diagonal lines or overtones in sweeps in visual sound images. the matrix is published in Detecting diagonal activity to quantify harmonic structure preservation with cochlear implant mapping or formal link. The PhD thesis is called, mechanism of music perception using cochlear implants, University of Zurich, 2011 by Sherif Omran. If you write a paper, please cite me and good luck
here are similar images with overtones, I used my matrix to detect these diagonal activities, which look very near to yours.
Here is an example of how to check whether opposite diagonals contain only 1s, like in your case:
In [52]: from scipy.sparse import eye
let's create a matrix with a opposite diagonal
In [53]: a = np.fliplr(eye(5, 8, k=1).toarray())
In [54]: a
Out[54]:
array([[ 0., 0., 0., 0., 0., 0., 1., 0.],
[ 0., 0., 0., 0., 0., 1., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 1., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0., 0., 0., 0.]])
Flip array in the left/right direction
In [55]: f = np.fliplr(a)
In [56]: f
Out[56]:
array([[ 0., 1., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 1., 0., 0.]])
the same can be done:
In [71]: a[::-1,:]
Out[71]:
array([[ 0., 0., 1., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 1., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 1., 0.]])
get given diagonal
In [57]: np.diag(f, k=1)
Out[57]: array([ 1., 1., 1., 1., 1.])
In [58]: np.diag(f, k=-1)
Out[58]: array([ 0., 0., 0., 0.])
In [111]: a[::-1].diagonal(2)
Out[111]: array([ 1., 1., 1., 1., 1.])
check whether the whole diagonal contains 1s
In [61]: np.all(np.diag(f, k=1) == 1)
Out[61]: True
or
In [64]: (np.diag(f, k=1) == 1).all()
Out[64]: True
In [65]: (np.diag(f, k=0) == 1).all()
Out[65]: False
This answer will help you to find all diagonals
PS i'm a newbie in numpy, so i'm pretty sure there must be faster and more elegant solutions
I wish to be able to extract a row or a column from a 2D array in Python such that it preserves the 2D shape and can be used for matrix multiplication. However, I cannot find in the documentation how can this best be done. For example, I can use
a = np.zeros(shape=(6,6))
to create an array, but a[:,0] will have the shape of (6,), and I cannot multiply this by a matrix of shape (6,1). Do I need to reshape a row or a column of an array into a matrix for every matrix multiplication, or are there other ways to do matrix multiplication?
You could use np.matrix directly:
>>> a = np.zeros(shape=(6,6))
>>> ma = np.matrix(a)
>>> ma
matrix([[ 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0.]])
>>> ma[0,:]
matrix([[ 0., 0., 0., 0., 0., 0.]])
or you could add the dimension with np.newaxis
>>> a[0,:][np.newaxis, :]
array([[ 0., 0., 0., 0., 0., 0.]])