get one array from vectors - python - python

I am struggling with this problem for some time and i dont find a solution. I try to plot a heatmap and have read some stuff but the problems are different.
I have 3 Vectors:
x = [ -1, 0, 1, -1, 0, 1, 0, -1, 1 ]
y = [ -1, -1, -1, 0, 0, 0, 1, 1, 1 ]
E = [ 3, 1, 4, 1, 5, 9, 6, 2, 5 ]
What i want is a matrix like the one below for the actual plotting:
E_xy = [ [ 3, 1, 4],
[ 1, 5, 9],
[ 2, 6, 5]]
x[0] belongs to y[0] belongs to E[0] and so on.
What is the best/easiest way to do this?
Please note: The ordering of the matrix can not be used (see E[7] and E[8] and the resulting E_xy[2,0] and E_xy[2,1]).

Assuming a square NxN matrix
import numpy as np
x = np.array([ -1, 0, 1, -1, 0, 1, 0, -1, 1 ])
y = np.array([ -1, -1, -1, 0, 0, 0, 1, 1, 1 ])
E = np.array([ 3, 1, 4, 1, 5, 9, 6, 2, 5 ])
N = int(np.sqrt(E.size))
sorting = np.argsort(x + y*N)
E_xy = E[sorting].reshape(N,N)

Interpreting your x and y lists as definitions of positional offsets in relation to the matrix' center element, you can first create tuples containing your positions and then fill everything into a newly created matrix of desired dimensions:
x = [ -1, 0, 1, -1, 0, 1, 0, -1, 1 ]
y = [ -1, -1, -1, 0, 0, 0, 1, 1, 1 ]
E = [ 3, 1, 4, 1, 5, 9, 6, 2, 5 ]
positions = [
(x[i] + 1, y[i] + 1, E[i])
for i in range(len(x))
]
result = [
[None] * 3 for _ in range(3)
]
for x, y, value in positions:
result[y][x] = value
print(result)
Note that this is not a general solution; it assumes the matrix to always be 3x3; it requires changes to make this a general approach for arbitrarily shaped matrices.
The above code prints:
>>> [[3, 1, 4], [1, 5, 9], [2, 6, 5]]
Additionally, the above code surely can be written more efficient, but would lose in readability then; so it's up to you to save on memory, if you need to (if matrices are growing large).

Related

setting the values of sliding windows of an array in numpy

Suppose I have a 2D array with shape (3, 3), call it a, and an array of zeros with shape (7, 7, 5, 5), call it b. I want to modify b in the following way:
for p in range(5):
for q in range(5):
b[p:p + 3, q:q + 3, p, q] = a
Given:
a = np.array([[4, 2, 2],
[9, 0, 5],
[9, 9, 4]])
b = np.zeros((7, 7, 5, 5), dtype=int)
b would end up something like:
>>> b[:, :, 0, 0]
array([[4, 2, 2, 0, 0, 0, 0],
[9, 0, 5, 0, 0, 0, 0],
[9, 9, 4, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0]])
>>> b[:, :, 0, 1]
array([[0, 4, 2, 2, 0, 0, 0],
[0, 9, 0, 5, 0, 0, 0],
[0, 9, 9, 4, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0]])
One way to think about this to make a sliding window view of b (6D), slice out the parts you want (3D or 4D), and assign a to them.
However, there is a simpler way to do this altogether. The way a sliding window view works is by creating a dimension that steps along less than the full size of the dimension you are viewing. For example:
>>> x = np.array([1, 2, 3, 4])
array([1, 2, 3, 4])
>>> window = np.lib.stride_tricks.as_strided(
x, shape=(x.shape[0] - 2, 3),
strides=x.strides * 2)
[[1 2 3]
[2 3 4]]
I'm deliberately using np.lib.stride_tricks.as_strided rather than np.lib.stride_tricks.sliding_window_view here because it has a certain flexibility that you need.
You can have a stride that is larger than the axis you are viewing, as long as you are careful. Contiguous arrays are more forgiving in this case, but by no means a requirement. An example of this is np.diag. You can implement it something like this:
>>> x = np.arange(12).reshape(3, 4)
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>> diag = np.lib.stride_tricks.as_strided(
x, shape=(min(x.shape),),
strides=(sum(x.strides),))
array([ 0, 5, 10])
The trick is to make a view of only the parts of b you care about in a way that makes the assignment easy. Because of broadcasting rules, you will want the last two dimensions of the view to be a.shape, and the strides to be b.strides[:2], since that's where you want to place a.
The first two dimensions of the view will be responsible for making the copies of a. You want 25 copies, so the shape will be (5, 5). The strides are the trickier part. Let's take a look at a 2D case, just because that's easier to visualize, and then attempt to generalize:
>>> a0 = np.array([1, 2])
>>> b0 = np.zeros((4, 3), dtype=int)
>>> b0[0:2, 0] = b0[1:3, 1] = b0[2:4, 2] = a0
The goal is to make a view that strides along the diagonal of b0 in the first axis. So:
>>> np.lib.stride_tricks.as_strided(
b0, shape=(b0.shape[0] - a0.shape[0] + 1, a0.shape[0]),
strides=(sum(b0.strides), b0.strides[0]))[:] = a0
>>> b0
array([[1, 0, 0],
[2, 1, 0],
[0, 2, 1],
[0, 0, 2]])
So that's what you do for b, but adding up every second dimension:
a = np.array([[4, 2, 2],
[9, 0, 5],
[9, 9, 4]])
b = np.zeros((7, 7, 5, 5), dtype=int)
vshape = (*np.subtract(b.shape[:a.ndim], a.shape) + 1,
*a.shape)
vstrides = (*np.add(b.strides[:a.ndim], b.strides[a.ndim:]),
*b.strides[:a.ndim])
np.lib.stride_tricks.as_strided(b, shape=vshape, strides=vstrides)[:] = a
TL;DR
def emplace_window(a, b):
vshape = (*np.subtract(b.shape[:a.ndim], a.shape) + 1, *a.shape)
vstrides = (*np.add(b.strides[:a.ndim], b.strides[a.ndim:]), *b.strides[:a.ndim])
np.lib.stride_tricks.as_strided(b, shape=vshape, strides=vstrides)[:] = a
I've phrased it this way, because now you can apply it to any number of dimensions. The only expectations is that 2 * a.ndim == b.ndim and that b.shape[a.ndim:] == b.shape[:a.ndim] - a.shape + 1.

Randomly pick a zero in a 2d numpy array

I have a 2d numpy array like this:
board = numpy.array([[ 0, 0, 2, 2],
[ 4, 0, 2, 0],
[ 2, 2, 2, 2],
[ 0, 0, 0, 16]])
I want to choose one of the zeros, and replace it with something else. I came up with a solution myself, but I'm looking for a better way; perhaps using a numpy function like choice but for a 2d array.
zeros = np.where(board == 0)
r = np.random.randint(len(zeros[0]))
z1 = zeros[0][r]
z2 = zeros[1][r]
board[z1, z2] = 2
You can extract the indices where board == 0, convert them to a linear index so that you can use np.random.choice (because this method only accepts 1-D arrays) and then convert that randomly chosen linear index to the corresponding 2D index and make the replacement.
import numpy as np
board = np.array([[ 0, 0, 2, 2],
[ 4, 0, 2, 0],
[ 2, 2, 2, 2],
[ 0, 0, 0, 16]])
zeros = np.argwhere(board == 0) # Indices where board == 0
indices = np.ravel_multi_index([zeros[:, 0], zeros[:, 1]], board.shape) # Linear indices
ind = np.random.choice(indices) # Randomly select your index to replace
board[np.unravel_index(ind, board.shape)] = 100 # Perform the replacement
>>> board
[[ 0 0 2 2]
[ 4 0 2 100]
[ 2 2 2 2]
[ 0 0 0 16]]
The method you're using seems fine. Could be sped-up/streamlined a bit.
Python's random.choice() works fine with the arrays of zeros zipped:
zeros
# is:
(array([0, 0, 1, 1, 3, 3, 3], dtype=int64),
array([0, 1, 1, 3, 0, 1, 2], dtype=int64))
list(zip(*zeros))
# output:
# [(0, 0), (0, 1), (1, 1), (1, 3), (3, 0), (3, 1), (3, 2)]
import random
random.choice(list(zip(*zeros)))
# (3, 1)
That returns a 2-element tuple - one index per axis, which can be used for [] assignment:
board[random.choice(list(zip(*zeros)))] = 100
board
# output:
array([[ 0, 0, 2, 2],
[ 4, 0, 2, 0],
[ 2, 2, 2, 2],
[ 0, 0, 100, 16]])

How to do indexing of a NumPy 3D-array based on 2D-array in Python?

Let say I have a NumPy array A of shape (66,5) and B of shape (100, 66, 5).
The elements of A will index the first dimension (axis=0) of B, where the values are from 0 to 99 (i.e. the first dimension of B is 100).
A =
array([[ 1, 0, 0, 1, 0],
[ 0, 2, 0, 2, 4],
[ 1, 7, 0, 5, 5],
[ 2, 1, 0, 1, 7],
[ 0, 7, 0, 1, 4],
[ 0, 0, 3, 6, 0]
.... ]])
For example, A[4,1] will take index 7 of the first dimension of B, index 4 of the second dimension of B and index 1 of the third dimension B.
What I wanted to is to produce array C of shape (66,5) where it contains the elements in B that are selected based on the elements in A.
You can use np.take_along_axis to do that:
import numpy as np
np.random.seed(0)
a = np.random.randint(100, size=(66, 5))
b = np.random.random(size=(100, 66, 5))
c = np.take_along_axis(b, a[np.newaxis], axis=0)[0]
# Test some element
print(c[25, 3] == b[a[25, 3], 25, 3])
# True
If I understand correctly, you are looking for advances indexing of first dimension of B. You can use np.indices to create the indices required for the other two dimensions of B and use advanced indexing:
idx = np.indices(A.shape)
C = B[A,idx[0],idx[1]]
Example:
B = np.random.rand(10,20,30)
A = np.array([[ 1, 0, 0, 1, 0],
[ 0, 2, 0, 2, 4],
[ 1, 7, 0, 5, 5],
[ 2, 1, 0, 1, 7],
[ 0, 7, 0, 1, 4],
[ 0, 0, 3, 6, 0]])
print(C[4,1]==B[7,4,1])
#True
Use the following (using functions of NumPy library):
print(A)
# array([[2, 0],
# [1, 1],
# [2, 0]])
print(B)
# array([[[ 5, 7],
# [ 0, 0],
# [ 0, 0]],
# [[ 1, 8],
# [ 1, 9],
# [10, 1]],
# [[12, 22],
# [ 2, 2],
# [ 2, 2]]])
temp = A.reshape(-1) + np.cumsum(np.ones([A.reshape(-1).shape[0]])*B.shape[0], dtype = 'int') - 3
C = B.swapaxes(0, 1).swapaxes(2, 1).reshape(-1)[temp].reshape(A.shape)
print(C)
# array([[12, 7],
# [ 1, 9],
# [ 2, 0]])

Is there a way to find the largest change in a pandas dataframe column?

Im trying to find the largest difference between i and j in a series where i cannot be before j. Is there an efficient way to do this in pandas:
x = [1, 2, 5, 4, 2, 4, 2, 1, 7]
largest_change = 0
for i in range(len(x)):
for j in range(i+1, len(x)):
change = x[i] - x[j]
print(x[i], x[j], change)
if change > largest_change:
largest_change = change
The output would just be the value, in this case 4 from 5 to 1.
Try numpy broadcast with np.triu and max
arr = np.array(x)
np.triu(arr[:,None] - arr)
array([[ 0, -1, -4, -3, -1, -3, -1, 0, -6],
[ 0, 0, -3, -2, 0, -2, 0, 1, -5],
[ 0, 0, 0, 1, 3, 1, 3, 4, -2],
[ 0, 0, 0, 0, 2, 0, 2, 3, -3],
[ 0, 0, 0, 0, 0, -2, 0, 1, -5],
[ 0, 0, 0, 0, 0, 0, 2, 3, -3],
[ 0, 0, 0, 0, 0, 0, 0, 1, -5],
[ 0, 0, 0, 0, 0, 0, 0, 0, -6],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0]])
np.triu(arr[:,None] - arr).max()
Out[758]: 4
Besides Andy's smart method, here is another one propagating the minimum value backward whose advantage is to have linear time complexity, instead of quadratic, in case you handle a large amount of data.
a = np.flipud(np.array(x))
largest_change = (a - np.minimum.accumulate(a)).max()
How about this?
x = [1, 2, 5, 4, 2, 4, 2, 1, 7]
largest_change = 0
position = 0
for i in range(len(x)-1):
change = x[i] - min(x[i+1:])
if change > largest_change:
largest_change = change
position = i
print(x[position], min(x[position+1:]), largest_change)
Why don't you just take the diff then the max of that?
x = [1, 2, 5, 4, 2, 4, 2, 1, 7]
s = pd.Series(x)
z = abs(s.diff())
idx_max_val = z[z==z.max()].index[0]
print(f'Max difference in value ({z.max()}) occurs at the indices of {idx_max_val-1}:{idx_max_val}')
I would suggest rolling window:
import pandas
df = pandas.DataFrame({'col1': [1, 2, 5, 4, 2, 4, 2, 1, 7]})
df["diff"] = df['col1'].rolling(window=2).apply(lambda x: x[1] - x[0])
print(df["diff"].max())
Output: 6.0
Or did I misunderstand you and you just want the largest difference between any two values?
This would be:
import pandas
df = pandas.DataFrame({'col1': [1, 2, 5, 4, 2, 4, 2, 1, 7]})
max_diff = df["col1"].max() - df["col1"].min()
print("Min:", df["col1"].min(), "Max:", df["col1"].max(), "Diff:", max_diff)
Output:
Min: 1 Max: 7 Diff: 6

Python numpy: perform function on each pair of columns in a numpy 2-D array?

I'm trying to apply a function to each pair of columns in a numpy array (each column is an individual's genotype).
For example:
[48]: g[0:10,0:10]
array([[ 1, 1, 1, 1, 1, 1, 1, 1, 1, -1],
[ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[ 1, 1, 1, 1, 1, 1, -1, 1, 1, 1],
[ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 1, 1, 1, 1, 1, 1, 1, 1, 1, -1],
[-1, -1, 0, -1, -1, -1, -1, -1, -1, 0],
[ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int8)
My goal is to produce a distance matrix d so that each element of d is the pairwise distance comparing each column in g.
d[0,1] = func(g[:,0], g[:,1])
Any ideas would be fantastic! Thank you!
You can simply define the function as:
def count_snp_diffs(x, y):
return np.count_nonzero((x != y) & (x >= 0) & (y >= 0),axis=0)
And then call it using as index an array generated with itertools.combinations, in order to get all possible column combinations:
combinations = np.array(list(itertools.combinations(range(g.shape[1]),2)))
dist = count_snp_diffs(g[:,combinations[:,0]], g[:,combinations[:,1]])
In addition, if the output must be stored in a matrix (which for large g is not recomendes because only the upper triangle will be filled and all the rest will be useless info, this can be achieved with the same trick:
d = np.zeros((g.shape[1],g.shape[1]))
combinations = np.array(list(itertools.combinations(range(g.shape[1]),2)))
d[combinations[:,0],combinations[:,1]] = count_snp_diffs(g[:,combinations[:,0]], g[:,combinations[:,1]])
Now, d[i,j] returns the distance between columns i and j (whereas d[j,i] is a zero). This approach relies in the fact that arrays can be indexed with lists or arrays containing repeated indexes:
a = np.arange(3)+4
a[[0,1,1,1,0,2,1,1]]
# Out
# [4, 5, 5, 5, 4, 6, 5, 5]
Here is one step by step explanation of what is happening.
Calling g[:,combinations[:,0]] accesses all the columns in the first clumn of permutations, generating a new array, which is compared column by column with the array generated with g[:,combinations[:,1]]. Thus, A boolean array diff is generated. If g had 3 columns it would look like this, where each column is the comparison of columns 0,1, 0,2 and 1,2:
[[ True False False]
[False True False]
[ True True False]
[False False False]
[False True False]
[False False False]]
And finally, the values for each column are added:
np.count_nonzero(diff,axis=0)
# Out
# [2 3 0]
In addition, due to the fact that the boolean class in python inherits from integer class (roughly False==0 and and True==1, see this answer of "Is False == 0 and True == 1 in Python an implementation detail or is it guaranteed by the language?" for more info). The np.count_nonzero adds 1 for each True position, which is the same result obtained with np.sum:
np.sum(diff,axis=0)
# Out
# [2 3 0]
Notes on performance and memory
For large arrays, working with the whole array at a time can require too much memory and you can get a Memory Error, however, for small or medium arrays, it tends to be the fastest approach. In some cases it can be useful to work by chunks:
combinations = np.array(list(itertools.combinations(range(g.shape[1]),2)))
n = len(combinations)
dist = np.empty(n)
# B = np.zeros((g.shape[1],g.shape[1]))
chunk = 200
for i in xrange(chunk,n,chunk):
dist[i-chunk:i] = count_snp_diffs(g[:,combinations[i-chunk:i,0]], g[:,combinations[i-chunk:i,1]])
# B[combinations[i-chunk:i,0],combinations[i-chunk:i,1]] = count_snp_diffs(g[:,combinations[i-chunk:i,0]], g[:,combinations[i-chunk:i,1]])
dist[i:] = count_snp_diffs(g[:,combinations[i:,0]], g[:,combinations[i:,1]])
# B[combinations[i:,0],combinations[i:,1]] = count_snp_diffs(g[:,combinations[i:,0]], g[:,combinations[i:,1]])
For g.shape=(300,N), the execution times reported by %%timeit in my computer with python 2.7, numpy 1.14.2 and allel 1.1.10 are:
10 columns
numpy + matrix storage: 107 µs
numpy + 1D storage : 101 µs
allel : 247 µs
100 columns
numpy + matrix storage: 15.7 ms
numpy + 1D storage : 16 ms
allel : 22.6 ms
1000 columns
numpy + matrix storage: 1.54 s
numpy + 1D storage : 1.53 s
allel : 2.28 s
With these array dimensions, pure numpy is a litle faster than allel module, but the computation time should be checked for the dimensions in your problem.
You can create your expected pairs using np.dstack and then apply the function on third axis using np.apply_along_axis.
new = np.dstack((arr[:,:-1], arr[:, 1:]))
np.apply_along_axis(np.sum, 2, new)
Example :
In [86]: arr = np.array([[ 1, 1, 1, 1, 1, 1, 1, 1, 1, -1],
...: [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
...: [ 1, 1, 1, 1, 1, 1, -1, 1, 1, 1],
...: [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
...: [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
...: [ 1, 1, 1, 1, 1, 1, 1, 1, 1, -1],
...: [-1, -1, 0, -1, -1, -1, -1, -1, -1, 0],
...: [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
...: [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
...: [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=np.int8)
...:
...:
In [87]: new = np.dstack((arr[:,:-1], arr[:, 1:]))
In [88]: new
Out[88]:
array([[[ 1, 1],
[ 1, 1],
[ 1, 1],
[ 1, 1],
[ 1, 1],
[ 1, 1],
[ 1, 1],
[ 1, 1],
[ 1, -1]],
...
In [89]:
In [89]: np.apply_along_axis(np.sum, 2, new)
Out[89]:
array([[ 2, 2, 2, 2, 2, 2, 2, 2, 0],
[ 2, 2, 2, 2, 2, 2, 2, 2, 2],
[ 2, 2, 2, 2, 2, 0, 0, 2, 2],
[ 2, 2, 2, 2, 2, 2, 2, 2, 2],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 2, 2, 2, 2, 2, 2, 2, 2, 0],
[-2, -1, -1, -2, -2, -2, -2, -2, -1],
[ 2, 2, 2, 2, 2, 2, 2, 2, 2],
[ 2, 2, 2, 2, 2, 2, 2, 2, 2],
[ 2, 2, 2, 2, 2, 2, 2, 2, 2]])
Thank you for suggestions!
I've just been told that this is possible with scikit-allel: https://scikit-allel.readthedocs.io/en/latest/ where you can define your own distance matrix to be performed on pairwise combinations of columns in a 2-D numpy array:
dist = allel.pairwise_distance(g, metric=count_snp_diffs)
Thank you for your help!
http://alimanfoo.github.io/2016/06/10/scikit-allel-tour.html

Categories

Resources