Convert binomial tree to paths - python

I have a binomial tree stored as an upper triangular matrix:
array([[400., 500., 625.],
[ 0., 320., 400.],
[ 0., 0., 256.]])
and I am trying to convert it to a matrix with all possible paths, like:
array([[400., 500., 625.],
[400., 500., 400.],
[400., 320., 400.],
[400., 320., 256.]])
I've written a snippet that does the job when there are only 2 steps:
def unstack_tree(tree):
output_map = []
for i in range(tree.shape[0] - 1):
for j in range(tree.shape[1] - 1):
output_map.append([tree[0,0], tree[i, 1], tree[i+j, 2]])
return np.array(output_map)
But I am struggling with how to generilize it to N steps to handle, say 3 step tree:
array([[400. , 500. , 625. , 781.25],
[ 0. , 320. , 400. , 500. ],
[ 0. , 0. , 256. , 320. ],
[ 0. , 0. , 0. , 204.8 ]])
I think I need more loops but cannot formulate it

Each path can be represented by binary code: first (0, 0), second (0, 1), third
(1, 0) ... . But actual index of array will be represented by cumsum of binary
representation.
import numpy as np
from itertools import product
n = 2
b = np.array([[400., 500., 625.],
[ 0., 320., 400.],
[ 0., 0., 256.]])
a = np.array(list(product((0, 1), repeat=n)))
a = np.c_[[0] * 2 ** n, a]
print(a)
# [[0 0 0]
# [0 0 1]
# [0 1 0]
# [0 1 1]]
a = a.cumsum(axis=1)
print(a)
# [[0 0 0]
# [0 0 1]
# [0 1 1]
# [0 1 2]]
print(np.choose(a, b))
# [[400. 500. 625.]
# [400. 500. 400.]
# [400. 320. 400.]
# [400. 320. 256.]]

Related

How to convert List of Lists of Tuples- pairs (index,value) into 2D numpy array

There is list of list of tuples:
[[(0, 0.5), (1, 0.6)], [(4, 0.01), (5, 0.005), (6, 0.002)], [(1,0.7)]]
I need to get matrix X x Y:
x = num of sublists
y = max among second eleme throught all pairs
elem[x,y] = second elem for x sublist if first elem==Y
0
1
2
3
4
5
6
0.5
0.6
0
0
0
0
0
0
0
0
0
0.01
0.005
0.002
0
0.7
0
0
0
0
0
You can figure out the array's dimensions the following way. The Y dimension is the number of sublists
>>> data = [[(0, 0.5), (1, 0.6)], [(4, 0.01), (5, 0.005), (6, 0.002)], [(1,0.7)]]
>>> dim_y = len(data)
>>> dim_y
3
The X dimension is the largest [0] index of all of the tuples, plus 1.
>>> dim_x = max(max(i for i,j in sub) for sub in data) + 1
>>> dim_x
7
So then initialize an array of all zeros with this size
>>> import numpy as np
>>> arr = np.zeros((dim_x, dim_y))
>>> arr
array([[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]])
Now to fill it, enumerate over your sublists to keep track of the y index. Then for each sublist use the [0] for the x index and the [1] for the value itself
for y, sub in enumerate(data):
for x, value in sub:
arr[x,y] = value
Then the resulting array should be populated (might want to transpose to look like your desired dimensions).
>>> arr.T
array([[0.5 , 0.6 , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0.01 , 0.005, 0.002],
[0. , 0.7 , 0. , 0. , 0. , 0. , 0. ]])
As I commented in the accepted answer, data is 'ragged' and can't be made into a array.
Now if the data had a more regular form, a no-loop solution is possible. But conversion to such a form requires the same double looping!
In [814]: [(i,j,v) for i,row in enumerate(data) for j,v in row]
Out[814]:
[(0, 0, 0.5),
(0, 1, 0.6),
(1, 4, 0.01),
(1, 5, 0.005),
(1, 6, 0.002),
(2, 1, 0.7)]
'transpose' and separate into 3 variables:
In [815]: I,J,V=zip(*_)
In [816]: I,J,V
Out[816]: ((0, 0, 1, 1, 1, 2), (0, 1, 4, 5, 6, 1), (0.5, 0.6, 0.01, 0.005, 0.002, 0.7))
I stuck with the list transpose here so as to not convert the integer indices to floats. It may also be faster, since making an array from a list isn't a time-trivial task.
Now we can assign values via numpy magic:
In [819]: arr = np.zeros((3,7))
In [820]: arr[I,J]=V
In [821]: arr
Out[821]:
array([[0.5 , 0.6 , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0.01 , 0.005, 0.002],
[0. , 0.7 , 0. , 0. , 0. , 0. , 0. ]])
I,J,V could also be used as input to a scipy.sparse.coo_matrix call, making a sparse matrix.
Speaking of a sparse matrix, here's what a sparse version of arr looks like:
In list-of-lists format:
In [822]: from scipy import sparse
In [823]: M = sparse.lil_matrix(arr)
In [824]: M
Out[824]:
<3x7 sparse matrix of type '<class 'numpy.float64'>'
with 6 stored elements in List of Lists format>
In [825]: M.A
Out[825]:
array([[0.5 , 0.6 , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0.01 , 0.005, 0.002],
[0. , 0.7 , 0. , 0. , 0. , 0. , 0. ]])
In [826]: M.rows
Out[826]: array([list([0, 1]), list([4, 5, 6]), list([1])], dtype=object)
In [827]: M.data
Out[827]:
array([list([0.5, 0.6]), list([0.01, 0.005, 0.002]), list([0.7])],
dtype=object)
and the more common coo format:
In [828]: Mc=M.tocoo()
In [829]: Mc.row
Out[829]: array([0, 0, 1, 1, 1, 2], dtype=int32)
In [830]: Mc.col
Out[830]: array([0, 1, 4, 5, 6, 1], dtype=int32)
In [831]: Mc.data
Out[831]: array([0.5 , 0.6 , 0.01 , 0.005, 0.002, 0.7 ])
and the csr used for most calculations:
In [832]: Mr=M.tocsr()
In [833]: Mr.data
Out[833]: array([0.5 , 0.6 , 0.01 , 0.005, 0.002, 0.7 ])
In [834]: Mr.indices
Out[834]: array([0, 1, 4, 5, 6, 1], dtype=int32)
In [835]: Mr.indptr
Out[835]: array([0, 2, 5, 6], dtype=int32)

Numpy insert rows to match number of max row count

I have the following array:
a = [[40.5, 23.4],
[175.9, 20.2],
[21.4, 24.0],
[130.3, 18.4],
[6.3, 25.7],
[73.4, 21.5],
[16.6, 25.7],
[125.9, 19.1],
[41.4, 24.7],
[180.6, 16.4],
[13.6, 24.4],
[103.2, 19.0],
[3.2, 24.7],
[55.9, 23.1],
[208.8, 20.4]]
I need to add rows with zeroes to have the final array as:
b = [[40.5, 23.4],
[175.9, 20.2],
[0., 0.],
[21.4, 24.0],
[130.3, 18.4],
[0., 0.],
[6.3, 25.7],
[73.4, 21.5],
[0., 0.],
[16.6, 25.7],
[125.9, 19.1],
[0., 0.],
[41.4, 24.7],
[0., 0.],
[0., 0.],
[180.6, 16.4],
[0., 0.],
[0., 0.],
[13.6, 24.4],
[103.2, 19.0],
[0., 0.],
[3.2, 24.7],
[55.9, 23.1],
[208.8, 20.4]]
In Summary, what I need is to add rows to specific indexes however the number of rows is not constant. In this example (please see image), I need to add the number of rows that will make each key match the maximum number of keys. I don't care about the keys in my code but I need to somehow "normalise" the array so I'll have the same number of rows for each key.
Array Details
Here's a sample of the list of indices: [[ 2 1] [ 4 1] [ 6 1] [ 8 2] [ 9 2] [10 1] [12 0]]
I've tried np.insert, np.concatenate, advanced indexing, etc but could not come up with a solution.
Any Ideas how to solve this?
import numpy as np
# as it seems to me indeces you provided don't conform to the data,
# here is just an example list of indeces. Substitute.
# First I transform this list to a form that can be fed to np.insert()
indeces = [[0, 2], [2, 3]]
tmp = [[_[0]] * _[1] for _ in indeces]
indeces_flat = []
for elem in tmp:
for item in elem:
indeces_flat.append(item)
print(indeces_flat)
# substitute with your array
a = np.array([[1, 2], [3, 4], [5, 6]])
# the main insertion
a_inserted = np.insert(a, indeces_flat, [0, 0], axis=0)
print(a_inserted)
prints:
[0, 0, 2, 2, 2]
[[0 0]
[0 0]
[1 2]
[3 4]
[0 0]
[0 0]
[0 0]
[5 6]]
Here's a NumPy based approach
def insert_n_zeros_at(a, i):
# which zeros have more than 1 row inserted?
m = i[:,1]>1
#empty nan array, filled on cols >1 with replicated values
ix = np.full((i.shape[0], i[:,1].max()),np.nan)
ix[:,:1] = i[:,:1]
ix.ravel()[np.stack((m,m)).ravel('F')] = np.repeat(i[m,0], i[m,1])
# columns' values cumulative sum (they are the real indices)
ix += np.arange(ix.shape[1])
# accounts for the index increasement when prior rows are added
cs = i[:,1].cumsum()
ix[1:] += cs[:-1,None]
# flattens to 1d of actual indices
ix = ix[~np.isnan(ix)]
# amount of zeros to insert. Used to define out
out = np.zeros((a.shape[0]+cs[-1], a.shape[1]))
r = np.arange(out.shape[0])
# assign a where we don't have indices of 0s
out[r[~np.isin(r, ix)]] = a
return out
For the shared example, we get:
i = np.array([[ 2, 1], [ 4, 1], [ 6, 1], [ 8, 2], [ 9, 2], [10, 1]])
insert_n_zeros_at(a, i)
array([[ 40.5, 23.4], # 0
[175.9, 20.2], # 1
[ 0. , 0. ], # <- 1 zero at 2
[ 21.4, 24. ], # 2
[130.3, 18.4], # 3
[ 0. , 0. ], # <- 1 zero at 4
[ 6.3, 25.7], # 4
[ 73.4, 21.5], # 5
[ 0. , 0. ], # <- 1 zero at 6
[ 16.6, 25.7], # 6
[125.9, 19.1], # 7
[ 0. , 0. ], # <- 2 zeros at 8
[ 0. , 0. ],
[ 41.4, 24.7], # 8
[ 0. , 0. ], # <- 2 zeros at 9
[ 0. , 0. ],
[180.6, 16.4], # 9
[ 0. , 0. ], # <- 1 zero at 10
[ 13.6, 24.4], # 10
[103.2, 19. ], # 11
[ 3.2, 24.7], # 12
[ 55.9, 23.1], # 13
[208.8, 20.4]])

Is there a short way in networkx(Python) to calculate the reachability matrix?

Imagine I have given a directed graph and I want a numpy reachability matrix whether a path exists, so R(i,j)=1 if and only if there is a path from i to j;
networkx has the function has_path(G, source, target), however it is only for specific source and taget nodes; Therefore, I've so far been doing this:
import networkx as nx
R=np.zeros((d,d))
for i in range(d):
for j in range(d):
if nx.has_path(G, i, j):
R[i,j]=1
Is there a nicer way to achieve this?
Here would be a minimum example with real numbers:
import networkx as nx
import numpy as np
c=np.random.rand(4,4)
G=nx.DiGraph(c)
A=nx.minimum_spanning_arborescence(G)
adj=nx.to_numpy_matrix(A)
Here we can see that this would be the adjacency but not reachability matrix - with my number example I would get
adj=
matrix([[0. , 0. , 0. , 0. ],
[0. , 0. , 0.47971056, 0. ],
[0. , 0. , 0. , 0. ],
[0.16101491, 0.04779295, 0. , 0. ]])
So there is a path from 4 to 2 (adj(4,2)>0) and from 2 to 3 (adj(2,3)>0) so there also would be a path from 4 to 3 but adj(4,3)=0
You could use all_pairs_shortest_path_length:
import networkx as nx
import numpy as np
np.random.seed(42)
c = np.random.rand(4, 4)
G = nx.DiGraph(c)
length = dict(nx.all_pairs_shortest_path_length(G))
R = np.array([[length.get(m, {}).get(n, 0) > 0 for m in G.nodes] for n in G.nodes], dtype=np.int32)
print(R)
Output
[[1 1 1 1 1]
[0 1 1 1 1]
[0 0 1 1 1]
[0 0 0 1 1]
[0 0 0 0 1]]
One approach could be to find all descendants of each node, and set the corresponding rows that are reachable to 1:
a = np.zeros((len(A.nodes()),)*2)
for node in A.nodes():
s = list(nx.descendants(A, node))
a[s, node] = 1
print(a)
array([[0., 0., 1., 0.],
[1., 0., 1., 0.],
[0., 0., 0., 0.],
[1., 1., 1., 0.]])

Scikit-image and central moments: what is the meaning?

Looking for examples of how to use image processing tools to "describe" images and shapes of any sort, I have stumbled upon the Scikit-image skimage.measure.moments_central(image, cr, cc, order=3) function.
They give an example of how to use this function:
from skimage import measure #Package name in Enthought Canopy
import numpy as np
image = np.zeros((20, 20), dtype=np.double) #Square image of zeros
image[13:17, 13:17] = 1 #Adding a square of 1s
m = moments(image)
cr = m[0, 1] / m[0, 0] #Row of the centroid (x coordinate)
cc = m[1, 0] / m[0, 0] #Column of the centroid (y coordinate)
In[1]: moments_central(image, cr, cc)
Out[1]:
array([[ 16., 0., 20., 0.],
[ 0., 0., 0., 0.],
[ 20., 0., 25., 0.],
[ 0., 0., 0., 0.]])
1) What do each of the values represent? Since the (0,0) element is 16, I get this number corresponds to the area of the square of 1s, and therefore it is mu zero-zero. But how about the others?
2) Is this always a symmetric matrix?
3) What are the values associated with the famous second central moments?
The array returned by measure.moments_central correspond to the formula of https://en.wikipedia.org/wiki/Image_moment (section central moment). mu_00 corresponds indeed to the area of the object.
The inertia matrix is not always symmetric, as shown by this example where the object is a rectangle instead of a square.
>>> image = np.zeros((20, 20), dtype=np.double) #Square image of zeros
>>> image[14:16, 13:17] = 1
>>> m = measure.moments(image)
>>> cr = m[0, 1] / m[0, 0]
>>> cc = m[1, 0] / m[0, 0]
>>> measure.moments_central(image, cr, cc)
array([[ 8. , 0. , 2. , 0. ],
[ 0. , 0. , 0. , 0. ],
[ 10. , 0. , 2.5, 0. ],
[ 0. , 0. , 0. , 0. ]])
As for second-order moments, they are mu_02, mu_11, and mu_20 (coefficients on the diagonal i + j = 1). The same Wikipedia page https://en.wikipedia.org/wiki/Image_moment explains how to use second-order moments for computing the orientation of objects.

How to generate one hot encoding for DNA sequences?

I would like to generate one hot encoding for a set of DNA sequences. For example the sequence ACGTCCA can be represented as below in a transpose manner. But the code below will generate the one hot encoding in horizontal way in which I would prefer it in vertical form. Can anyone help me?
ACGTCCA
1000001 - A
0100110 - C
0010000 - G
0001000 - T
Example code:
from sklearn.preprocessing import OneHotEncoder
import itertools
# two example sequences
seqs = ["ACGTCCA","CGGATTG"]
# split sequences to tokens
tokens_seqs = [seq.split("\\") for seq in seqs]
# convert list of of token-lists to one flat list of tokens
# and then create a dictionary that maps word to id of word,
# like {A: 1, B: 2} here
all_tokens = itertools.chain.from_iterable(tokens_seqs)
word_to_id = {token: idx for idx, token in enumerate(set(all_tokens))}
# convert token lists to token-id lists, e.g. [[1, 2], [2, 2]] here
token_ids = [[word_to_id[token] for token in tokens_seq] for tokens_seq in tokens_seqs]
# convert list of token-id lists to one-hot representation
vec = OneHotEncoder(n_values=len(word_to_id))
X = vec.fit_transform(token_ids)
print X.toarray()
However, the code gives me output:
[[ 0. 1.]
[ 1. 0.]]
Expected output:
[[1. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
[0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 1. 0.]]
def one_hot_encode(seq):
mapping = dict(zip("ACGT", range(4)))
seq2 = [mapping[i] for i in seq]
return np.eye(4)[seq2]
one_hot_encode("AACGT")
## Output:
array([[1., 0., 0., 0.],
[1., 0., 0., 0.],
[0., 1., 0., 0.],
[0., 0., 1., 0.],
[0., 0., 0., 1.]])
I suggest doing it a slightly more manual way:
import numpy as np
seqs = ["ACGTCCA","CGGATTG"]
CHARS = 'ACGT'
CHARS_COUNT = len(CHARS)
maxlen = max(map(len, seqs))
res = np.zeros((len(seqs), CHARS_COUNT * maxlen), dtype=np.uint8)
for si, seq in enumerate(seqs):
seqlen = len(seq)
arr = np.chararray((seqlen,), buffer=seq)
for ii, char in enumerate(CHARS):
res[si][ii*seqlen:(ii+1)*seqlen][arr == char] = 1
print res
This gives you your desired result:
[[1 0 0 0 0 0 1 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0]
[0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 1 1 0]]
from keras.utils import to_categorical
def one_hot_encoding(seq):
mp = dict(zip('ACGT', range(4)))
seq_2_number = [mp[nucleotide] for nucleotide in seq]
return to_categorical(seq_2_number, num_classes=4, dtype='int32').flatten()

Categories

Resources