How to generate one hot encoding for DNA sequences?

How to generate one hot encoding for DNA sequences? - python

I would like to generate one hot encoding for a set of DNA sequences. For example the sequence ACGTCCA can be represented as below in a transpose manner. But the code below will generate the one hot encoding in horizontal way in which I would prefer it in vertical form. Can anyone help me?
ACGTCCA
1000001 - A
0100110 - C
0010000 - G
0001000 - T
Example code:
from sklearn.preprocessing import OneHotEncoder
import itertools
# two example sequences
seqs = ["ACGTCCA","CGGATTG"]
# split sequences to tokens
tokens_seqs = [seq.split("\\") for seq in seqs]
# convert list of of token-lists to one flat list of tokens
# and then create a dictionary that maps word to id of word,
# like {A: 1, B: 2} here
all_tokens = itertools.chain.from_iterable(tokens_seqs)
word_to_id = {token: idx for idx, token in enumerate(set(all_tokens))}
# convert token lists to token-id lists, e.g. [[1, 2], [2, 2]] here
token_ids = [[word_to_id[token] for token in tokens_seq] for tokens_seq in tokens_seqs]
# convert list of token-id lists to one-hot representation
vec = OneHotEncoder(n_values=len(word_to_id))
X = vec.fit_transform(token_ids)
print X.toarray()
However, the code gives me output:
[[ 0. 1.]
[ 1. 0.]]
Expected output:
[[1. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
[0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 1. 0.]]

def one_hot_encode(seq):
mapping = dict(zip("ACGT", range(4)))
seq2 = [mapping[i] for i in seq]
return np.eye(4)[seq2]
one_hot_encode("AACGT")
## Output:
array([[1., 0., 0., 0.],
[1., 0., 0., 0.],
[0., 1., 0., 0.],
[0., 0., 1., 0.],
[0., 0., 0., 1.]])

I suggest doing it a slightly more manual way:
import numpy as np
seqs = ["ACGTCCA","CGGATTG"]
CHARS = 'ACGT'
CHARS_COUNT = len(CHARS)
maxlen = max(map(len, seqs))
res = np.zeros((len(seqs), CHARS_COUNT * maxlen), dtype=np.uint8)
for si, seq in enumerate(seqs):
seqlen = len(seq)
arr = np.chararray((seqlen,), buffer=seq)
for ii, char in enumerate(CHARS):
res[si][ii*seqlen:(ii+1)*seqlen][arr == char] = 1
print res
This gives you your desired result:
[[1 0 0 0 0 0 1 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0]
[0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 1 1 0]]

from keras.utils import to_categorical
def one_hot_encoding(seq):
mp = dict(zip('ACGT', range(4)))
seq_2_number = [mp[nucleotide] for nucleotide in seq]
return to_categorical(seq_2_number, num_classes=4, dtype='int32').flatten()

Related

Calculating the sum of array-based terms in SymPy [ValueError: Invalid limits given]

I have been trying to obtain some numerical outputs from the sum below. It needs to give numerical outputs normally. However, I receive an error related to the limits.
I have a sum with a low boundary of i = 0 and an upper boundary of i = k-1 mathematically. I took (i, 0, k) because the last term is excluded by SymPy, Sum. Besides, I need to get the result within this nested for loop. Could there be a mismatch between for loops and sum? Even so, I cannot change for loops for k and Nt. Here, k depends on the Nt.
The code:
import numpy as np
from sympy import *
from sympy import Sum
Nx = 31
Nt = 17
tau = .85 / Nt
u = np.ones((Nt, Nx)) * np.sin(np.pi)
Sigma = np.zeros((Nt, Nx))
for k in range(1, Nt):
for i in range(1, k):
for j in range(1, Nx-1):
#define sum
Sigma[i, j] = Sum( ((u[i+1][j] - u[i][j]) / tau * 97.1), (i, 0, k)).doit()
print(Sigma[i, j])
Error:
---> 16 Sigma[i, j] = Sum( ((u[i+1][j] - u[i][j]) / tau * 97.1), (i, 0, k) ).doit()
ValueError: Invalid limits given: ((1, 0, 2),)
Also, I am confused about sum, Sum or summation. None of these gave me a numerical result, or most likely I am not using these methods correctly. How can I get the numerical outputs? Again, I've tried np.sum() below.
for k in range(1, Nt):
for i in range(1, k):
for j in range(1, Nx-1):
#define sum
Sigma[i, j] = np.sum( ((u[i+1][j] - u[i][j]) / tau * 97.1), 0, k-1)
print(Sigma[i, j])
Output:
0.0
0.0
0.0...
I think I cannot properly write the limits of the sum in np.sum(). How can I correct this? How can I avoid 0 result?
EDIT:
I used sum():
for k in range(1, Nt):
for i in range(1, k):
for j in range(1, Nx-1):
#define sum
Sigma[i, j] = sum(u[i+1][j] - u[i][j])
print(Sigma[i, j])
Error:
TypeError: 'numpy.float64' object is not iterable
Thank you for any help!

Is this what you are looking for?
for k in range(1, Nt):
for i in range(1, k):
for j in range(1, Nx-1):
#define sum
Sigma[i, j] = sum(u[_+1][j] - u[_][j] for _ in range(0, k))
print(Sigma[i, j])
Although you could write Sum(u[_+1[j] - u[_], (_, 0, k)).doit(), the built-in sum is really what you are trying to do: an elementwise summation of literal values, not a summation of symbolic terms like Sum(1/x, (x, 0, 5)) -- there does not need to be an x array for SymPy to figure out that sum since the limits indicate what the values of x are going to be. In your case you have the values in an array already.

Reread the sympy docs. I think they specify that sum should be used with:
Sum(expr, (var, a, b)
(I probably shouldn't try to work from memory here, but you can check).
In your:
Sum( ((u[i+1][j] - u[i][j]) / tau * 97.1), (i, 0, k))
((u[i+1][j] - u[i][j]) / tau * 97.1) is a number, derived from numpy u. It isn't a sympy expression.
and i is a number, not sympy symbol. The error tells us that "Invalid limits given: ((1, 0, 2),)".
For someone who is new to Python, trying to use sympy will be difficult.
The problem with
np.sum( ((u[i+1][j] - u[i][j]) / tau * 97.1), 0, k-1)
is that np.sum does not take limits like the sympy Sum. Don't assume the documentation for one function applies to a similarly named one in another package. np.sum, if you read its docs, takes an array, with optional parameters like axis.
As for your last attempt:
sum(u[i+1][j] - u[i][j])
the python sum takes an "iterable", something like a list. u[i+1][j] - u[i][j] is a single number.
u is a 2d numpy array. u[i] is a 1d array, a "row";, u[i,j] is a single element of the array.
What exactly are you trying to sum?
I suspect you have some sort of mathematical summation in mind, and have tried to express that with sympy algebra. But your u is a 2d numpy array. So u[i+1,j]-u[i,j] is a single number, the difference between two elements. u[1:,j]-u[:-1,j] takes the difference between all such pairs of rows.
I haven't tried to figure out what your nested loops are doing, especially since i is a subset of possible rows.
edit
Let's simplify your example a bit - smaller dimensions, and removing the constants that don't change the behavior:
In [5]: Nx = 4
...: Nt = 3
...: u = np.ones((Nt, Nx))
...: Sigma = np.zeros((Nt, Nx))
...:
...: for k in range(1, Nt):
...: print('k',k)
...: for i in range(1, k):
...: print('i',i)
...: for j in range(1, Nx-1):
...: print('j',j)
...: Sigma[i,j] = u[i+1,j] - u[i,j]
...: print(Sigma)
...:
k 1
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]
k 2
i 1
j 1
j 2
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]
When k is 1, there's no i iteration since range(1,1) is empty. So Sigma is still the original 0s.
For k 2, i ranges(1,2), i.e. once. j iterates range(1,3), i.e. 1 and 2. But Sigma is still 0. u is all ones, so paired differences are 0. #smichr already pointed this out (I missed it on earlier reads).
In [3]: u
Out[3]:
array([[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.]])
In [4]: u[1:]-u[:-1]
Out[4]:
array([[0., 0., 0., 0.],
[0., 0., 0., 0.]])
I'm not sure it's worth pursuing this further. You need a realistic example where the u differences matter. But keep it small (like this (4,3), so you can actually specify what values you seek.
If I define a random u:
In [13]: u
Out[13]:
array([[14, 1, 1, 11],
[ 2, 4, 17, 4],
[11, 2, 6, 19]])
In [14]: u[1:]-u[:-1]
Out[14]:
array([[-12, 3, 16, -7],
[ 9, -2, -11, 15]])
For k 1 sigma is still 0, but for k 2:
k 2
i 1
j 1
j 2
[[ 0. 0. 0. 0.]
[ 0. -2. -11. 0.]
[ 0. 0. 0. 0.]]
The code set the Sigma[1,1] and Sigma[1,2] values from the difference array.
Here's a sample run for a case with more rows:
In [16]: Nt,Nx = 5,4
In [17]: u = np.random.randint(0,20,(Nt,Nx))
In [18]: u
Out[18]:
array([[18, 15, 17, 3],
[ 9, 2, 5, 16],
[11, 19, 5, 2],
[13, 0, 4, 5],
[ 8, 10, 10, 0]])
In [19]: u[1:]-u[:-1]
Out[19]:
array([[ -9, -13, -12, 13],
[ 2, 17, 0, -14],
[ 2, -19, -1, 3],
[ -5, 10, 6, -5]])
In [20]: Sigma = np.zeros((Nt, Nx))
...: for k in range(1, Nt):
...: print('k',k)
...: for i in range(1, k):
...: print('i',i)
...: for j in range(1, Nx-1):
...: print('j',j)
...: Sigma[i,j] = u[i+1,j] - u[i,j]
...: print(Sigma)
...:
k 1
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]
k 2
i 1
j 1
j 2
[[ 0. 0. 0. 0.]
[ 0. 17. 0. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]]
k 3
i 1
j 1
j 2
i 2
j 1
j 2
[[ 0. 0. 0. 0.]
[ 0. 17. 0. 0.]
[ 0. -19. -1. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]]
k 4
i 1
j 1
j 2
i 2
j 1
j 2
i 3
j 1
j 2
[[ 0. 0. 0. 0.]
[ 0. 17. 0. 0.]
[ 0. -19. -1. 0.]
[ 0. 10. 6. 0.]
[ 0. 0. 0. 0.]]
Let's try a case where successive k values are added to the original, rather than simply overwritting.
In [21]: Sigma = np.zeros((Nt, Nx))
...: for k in range(1, Nt):
...: print('k',k)
...: for i in range(1, k):
...: print('i',i)
...: for j in range(1, Nx-1):
...: print('j',j)
...: Sigma[i,j] += u[i+1,j] - u[i,j] # <== change here
...: print(Sigma)
...:
k 1
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]
k 2
i 1
j 1
j 2
[[ 0. 0. 0. 0.]
[ 0. 17. 0. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]]
k 3
i 1
j 1
j 2
i 2
j 1
j 2
[[ 0. 0. 0. 0.]
[ 0. 34. 0. 0.]
[ 0. -19. -1. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]]
k 4
i 1
j 1
j 2
i 2
j 1
j 2
i 3
j 1
j 2
[[ 0. 0. 0. 0.]
[ 0. 51. 0. 0.] # 3*17
[ 0. -38. -2. 0.] # 2* (-19 -1)
[ 0. 10. 6. 0.]
[ 0. 0. 0. 0.]]
Not knowing what you are aiming at, I can't say whether that makes any more sense.

Convert binomial tree to paths

I have a binomial tree stored as an upper triangular matrix:
array([[400., 500., 625.],
[ 0., 320., 400.],
[ 0., 0., 256.]])
and I am trying to convert it to a matrix with all possible paths, like:
array([[400., 500., 625.],
[400., 500., 400.],
[400., 320., 400.],
[400., 320., 256.]])
I've written a snippet that does the job when there are only 2 steps:
def unstack_tree(tree):
output_map = []
for i in range(tree.shape[0] - 1):
for j in range(tree.shape[1] - 1):
output_map.append([tree[0,0], tree[i, 1], tree[i+j, 2]])
return np.array(output_map)
But I am struggling with how to generilize it to N steps to handle, say 3 step tree:
array([[400. , 500. , 625. , 781.25],
[ 0. , 320. , 400. , 500. ],
[ 0. , 0. , 256. , 320. ],
[ 0. , 0. , 0. , 204.8 ]])
I think I need more loops but cannot formulate it

Each path can be represented by binary code: first (0, 0), second (0, 1), third
(1, 0) ... . But actual index of array will be represented by cumsum of binary
representation.
import numpy as np
from itertools import product
n = 2
b = np.array([[400., 500., 625.],
[ 0., 320., 400.],
[ 0., 0., 256.]])
a = np.array(list(product((0, 1), repeat=n)))
a = np.c_[[0] * 2 ** n, a]
print(a)
# [[0 0 0]
# [0 0 1]
# [0 1 0]
# [0 1 1]]
a = a.cumsum(axis=1)
print(a)
# [[0 0 0]
# [0 0 1]
# [0 1 1]
# [0 1 2]]
print(np.choose(a, b))
# [[400. 500. 625.]
# [400. 500. 400.]
# [400. 320. 400.]
# [400. 320. 256.]]

Assigning values to a numpy array to zeros array

I've been trying to assign values from an array to another array, specifically from an array with values to a zeros array. Position of these values in the zeros array is also very essential. This is also a small piece of a bigger code, the bigger picture is to be able to import values from an excel spreadsheet into a zeros matrix. This is my problem:
import numpy as np
x = np.zeros((2,3))
P= np.asarray ([1,2,3,4,5,6])
for i in range(0,2):
for j in range(0,3):
x[i,j] = P[(i-1)*3+j] # 3 is the counter in x direction, nx
x
With this code, the output is (which is what I want):
array([[4., 5., 6.],
[1., 2., 3.]])
However if I try to the expand the array, as such:
import numpy as np
x = np.zeros((3,3))
P= np.asarray ([1,2,3,4,5,6,7,8,9])
for i in range(0,3):
for j in range(0,3):
x[i,j] = P[(i-1)*3+j] # 3 is the counter in x direction, nx
x
The output is:
array([[7., 8., 9.],
[1., 2., 3.],
[4., 5., 6.]])
I expect the output to be:
array([[7., 8., 9.],
[4., 5., 6.],
[1., 2., 3.]])
Is there a reason why the ouput is changing with the expansion of the array?

You don't need to iterate:
In [323]: P=np.arange(1,10).reshape(3,3)[::-1,:]
In [324]: P
Out[324]:
array([[7, 8, 9],
[4, 5, 6],
[1, 2, 3]])
As for your loop, look a the i,j's:
In [325]: for i in range(3):
...: for j in range(3):
...: print(i,j,(i-1)*3+j)
...:
0 0 -3
0 1 -2
0 2 -1
1 0 0
1 1 1
1 2 2
2 0 3
2 1 4
2 2 5

You don't need to use loop, just use flip() with reshape().
import numpy as np
m = 3 # number of rows you want
n = 3 # number of column you want
P = np.asarray ([1,2,3,4,5,6,7,8,9])
P = np.flip(P.reshape(m,n), axis=0)
print(P)
[[7 8 9]
[4 5 6]
[1 2 3]]
If you want to assign it to a zero matrix, you can just iterate through the indices.
For example, let's say you have a much bigger zero matrix, you want to fill row x, y, z with the current matrix generated.
zero = np.zeros((10, 3))
print(zero.shape)
zero[[2, 5, 7], : ] = P # randomly assigning P to index 2, 5, 7th row of zero matrix
print(zero)
(10, 3)
[[0. 0. 0.]
[0. 0. 0.]
[7. 8. 9.]
[0. 0. 0.]
[0. 0. 0.]
[4. 5. 6.]
[0. 0. 0.]
[1. 2. 3.]
[0. 0. 0.]
[0. 0. 0.]]
You can also loop through:
for i in range(3):
zero[i,:] = P[i,:]

Is there a short way in networkx(Python) to calculate the reachability matrix?

Imagine I have given a directed graph and I want a numpy reachability matrix whether a path exists, so R(i,j)=1 if and only if there is a path from i to j;
networkx has the function has_path(G, source, target), however it is only for specific source and taget nodes; Therefore, I've so far been doing this:
import networkx as nx
R=np.zeros((d,d))
for i in range(d):
for j in range(d):
if nx.has_path(G, i, j):
R[i,j]=1
Is there a nicer way to achieve this?
Here would be a minimum example with real numbers:
import networkx as nx
import numpy as np
c=np.random.rand(4,4)
G=nx.DiGraph(c)
A=nx.minimum_spanning_arborescence(G)
adj=nx.to_numpy_matrix(A)
Here we can see that this would be the adjacency but not reachability matrix - with my number example I would get
adj=
matrix([[0. , 0. , 0. , 0. ],
[0. , 0. , 0.47971056, 0. ],
[0. , 0. , 0. , 0. ],
[0.16101491, 0.04779295, 0. , 0. ]])
So there is a path from 4 to 2 (adj(4,2)>0) and from 2 to 3 (adj(2,3)>0) so there also would be a path from 4 to 3 but adj(4,3)=0

You could use all_pairs_shortest_path_length:
import networkx as nx
import numpy as np
np.random.seed(42)
c = np.random.rand(4, 4)
G = nx.DiGraph(c)
length = dict(nx.all_pairs_shortest_path_length(G))
R = np.array([[length.get(m, {}).get(n, 0) > 0 for m in G.nodes] for n in G.nodes], dtype=np.int32)
print(R)
Output
[[1 1 1 1 1]
[0 1 1 1 1]
[0 0 1 1 1]
[0 0 0 1 1]
[0 0 0 0 1]]

One approach could be to find all descendants of each node, and set the corresponding rows that are reachable to 1:
a = np.zeros((len(A.nodes()),)*2)
for node in A.nodes():
s = list(nx.descendants(A, node))
a[s, node] = 1
print(a)
array([[0., 0., 1., 0.],
[1., 0., 1., 0.],
[0., 0., 0., 0.],
[1., 1., 1., 0.]])

Insert zero rows and columns at the same time at specific indices instead of at the end

I have a 2D array (a confusion matrix), for example (3,3). The number in the array refers to the index into a set of labels.
I know that this array should actually be (5,5) instead of (3,3), for the 5 row and column labels. I can find the labels that have been "hit":
import numpy as np
x = np.array([[3, 0, 3],
[0, 2, 0],
[2, 3, 3]])
labels = ["a", "b", "c", "d", "e"]
missing_idxs = np.setdiff1d(np.arange(len(labels)), x) # array([1, 4]
I know that the row and column for the missed index is all zero, so the output I want is this:
y = np.array([[3, 0, 0, 3, 0],
[0, 0, 0, 0, 0], # <- Inserted row at index 1 all zeros
[0, 0, 2, 0, 0],
[2, 0, 3, 3, 0],
[0, 0, 0, 0, 0]]) # <- Inserted row at index 4 all zeros
# ^ ^
# | |
# Inserted columns at index 1 and 4 all zeros
I can do that with multiple calls to np.insert in a loop over all missing indices:
def insert_rows_columns_at_slow(arr, indices):
result = arr.copy()
for idx in indices:
result = np.insert(result, idx, np.zeros(result.shape[1]), 0)
result = np.insert(result, idx, np.zeros(result.shape[0]), 1)
However, my real array is much bigger, and there may be many more missing indices. Since np.insert re-allocates every time, this is not very efficient.
How can I achieve the same result, but in a more efficient, vectorized way? Bonus points if it works in more than 2 dimensions.

Just another option:
Instead of using the missing indeces, use the non missing indeces:
non_missing_idxs = np.union1d(np.arange(len(labels)), x) # array([0, 2, 3])
y = np.zeros((5,5))
y[non_missing_idxs[:,None], non_missing_idxs] = x
output:
array([[3., 0., 0., 3., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 2., 0., 0.],
[2., 0., 3., 3., 0.],
[0., 0., 0., 0., 0.]])

You can do this by pre-allocating the full resulting array and filling the rows and columns with the old array, even in multiple dimensions, and the dimensions don't have to match size:
def insert_at(arr, output_size, indices):
"""
Insert zeros at specific indices over whole dimensions, e.g. rows and/or columns and/or channels.
You need to specify indices for each dimension, or leave a dimension untouched by specifying
`...` for it. The following assertion should hold:
`assert len(output_size) == len(indices) == len(arr.shape)`
:param arr: The array to insert zeros into
:param output_size: The size of the array after insertion is completed
:param indices: The indices where zeros should be inserted, per dimension. For each dimension, you can
specify: - an int
- a tuple of ints
- a generator yielding ints (such as `range`)
- Ellipsis (=...)
:return: An array of shape `output_size` with the content of arr and zeros inserted at the given indices.
"""
# assert len(output_size) == len(indices) == len(arr.shape)
result = np.zeros(output_size)
existing_indices = [np.setdiff1d(np.arange(axis_size), axis_indices,assume_unique=True)
for axis_size, axis_indices in zip(output_size, indices)]
result[np.ix_(*existing_indices)] = arr
return result
For your use-case, you can use it like this:
def fill_by_label(arr, labels):
# If this is your only use-case, you can make it more efficient
# By not computing the missing indices first, just to compute
# The existing indices again
missing_idxs = np.setdiff1d(np.arange(len(labels)), x)
return insert_at(arr, output_size=(len(labels), len(labels)),
indices=(missing_idxs, missing_idxs))
x = np.array([[3, 0, 3],
[0, 2, 0],
[2, 3, 3]])
labels = ["a", "b", "c", "d", "e"]
missing_idxs = np.setdiff1d(np.arange(len(labels)), x)
print(fill_by_label(x, labels))
>> [[3. 0. 0. 3. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 2. 0. 0.]
[2. 0. 3. 3. 0.]
[0. 0. 0. 0. 0.]]
But this is very flexible. You can use it for zero padding:
def zero_pad(arr):
out_size = np.array(arr.shape) + 2
indices = (0, out_size[0] - 1), (0, out_size[1] - 1)
return insert_at(arr, output_size=out_size,
indices=indices)
print(zero_pad(x))
>> [[0. 0. 0. 0. 0.]
[0. 3. 0. 3. 0.]
[0. 0. 2. 0. 0.]
[0. 2. 3. 3. 0.]
[0. 0. 0. 0. 0.]]
It also works with non-quadratic inputs and outputs:
x = np.ones((3, 4))
print(insert_at(x, (4, 5), (2, 3)))
>>[[1. 1. 1. 0. 1.]
[1. 1. 1. 0. 1.]
[0. 0. 0. 0. 0.]
[1. 1. 1. 0. 1.]]
With different number of insertions per dimension:
x = np.ones((3, 4))
print(insert_at(x, (4, 6), (1, (2, 4))))
>> [[1. 1. 0. 1. 0. 1.]
[0. 0. 0. 0. 0. 0.]
[1. 1. 0. 1. 0. 1.]
[1. 1. 0. 1. 0. 1.]]
You can use range (or other generators) instead of enumerating every index:
x = np.ones((3, 4))
print(insert_at(x, (4, 6), (1, range(2, 4))))
>>[[1. 1. 0. 0. 1. 1.]
[0. 0. 0. 0. 0. 0.]
[1. 1. 0. 0. 1. 1.]
[1. 1. 0. 0. 1. 1.]]
It works with arbitrary dimensions (as long as you specify indices for every dimension)1:
x = np.ones((2, 2, 2))
print(insert_at(x, (3, 3, 3), (0, 0, 0)))
>>>[[[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]]
[[0. 0. 0.]
[0. 1. 1.]
[0. 1. 1.]]
[[0. 0. 0.]
[0. 1. 1.]
[0. 1. 1.]]]
You can use Ellipsis (=...) to indicate that you don't want to change a dimension1,2:
x = np.ones((2, 2))
print(insert_at(x, (2, 4), (..., (0, 1))))
>>[[0. 0. 1. 1.]
[0. 0. 1. 1.]]
1: You could automatically detect this based on arr.shape and output_size, and fill it with ... as needed, but I'll leave that up to you if you need it. If you wanted to, you could probably get rid of the output_size parameter instead, but then it gets trickier with passing in generators.
2: This is somewhat different to the normal numpy ... semantics, as you need to specify ... for every dimension that you want to keep, i.e. the following does NOT work:
x = np.ones((2, 2, 2))
print(insert_at(x, (2, 2, 3), (..., 0)))
For timing, I ran the insertion of 10 rows and columns into a 90x90 array 100000 times, this is the result:
x = np.random.random(size=(90, 90))
indices = np.arange(10) * 10
def measure_time_fast():
insert_at(x, (100, 100), (indices, indices))
def measure_time_slow():
insert_rows_columns_at_slow(x, indices)
if __name__ == '__main__':
import timeit
for speed in ("fast", "slow"):
times = timeit.repeat(f"measure_time_{speed}()", setup=f"from __main__ import measure_time_{speed}", repeat=10, number=10000)
print(f"Min: {np.min(times) / 10000}, Max: {np.max(times) / 10000}, Mean: {np.mean(times) / 10000} seconds per call")
For the fast version:
Min: 7.336409069976071e-05, Max: 7.7440657400075e-05, Mean:
7.520040466995852e-05 seconds per call
That is about 75 microseconds.
For your slow version:
Min: 0.00028272533010022016, Max: 0.0002923079213000165, Mean:
0.00028581595062998535 seconds per call
That is about 300 microseconds.
The difference will be greater, the bigger the arrays get. E.g. for inserting 100 rows and columns into a 900x900 array, these are the results (ran only 1000 times):
Fast version:
Min: 0.00022916630539984907, Max: 0.0022916630539984908, Mean:
0.0022916630539984908 seconds per call
Slow version:
Min: 0.013766934227399906, Max: 0.13766934227399907, Mean:
0.13766934227399907 seconds per call

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to generate one hot encoding for DNA sequences? - python

def one_hot_encode(seq): mapping = dict(zip("ACGT", range(4))) seq2 = [mapping[i] for i in seq] return np.eye(4)[seq2] one_hot_encode("AACGT") ## Output: array([[1., 0., 0., 0.], [1., 0., 0., 0.], [0., 1., 0., 0.], [0., 0., 1., 0.], [0., 0., 0., 1.]])

from keras.utils import to_categorical def one_hot_encoding(seq): mp = dict(zip('ACGT', range(4))) seq_2_number = [mp[nucleotide] for nucleotide in seq] return to_categorical(seq_2_number, num_classes=4, dtype='int32').flatten()

Related

Calculating the sum of array-based terms in SymPy [ValueError: Invalid limits given]

Convert binomial tree to paths

Assigning values to a numpy array to zeros array

Is there a short way in networkx(Python) to calculate the reachability matrix?

Insert zero rows and columns at the same time at specific indices instead of at the end

Categories

Resources