Conditional average with numpy - python

Given a 2x3 array, I want to calculate the average on axis=0, but only considering values that are larger than 0.
So given the array
[ [1,0],
[0,0],
[1,0] ]
I want the output to be
# 1, 0, 1 filtered for > 0 gives 1, 1, average = (1+1)/2 = 1
# 0, 0, 0 filtered for > 0 gives 0, 0, 0, average = 0
[1 0]
My current code is
import numpy as np
frame = np.array([ [1,0],
[0,0],
[1,0] ])
weights=np.array(frame)>0
print("weights:")
print(weights)
print("average without weights:")
print((np.average(frame, axis=0)))
print("average with weights:")
print((np.average(frame, axis=0, weights=weights)))
This gives me
weights:
[[ True False]
[False False]
[ True False]]
average without weights:
[ 0.66666667 0. ]
average with weights:
Traceback (most recent call last):
File "C:\Users\myuser\project\test.py", line 123, in <module>
print((np.average(frame, axis=0, weights=weights)))
File "C:\Users\myuser\Miniconda3\envs\myenv\lib\site-packages\numpy\lib\function_base.py", line 1140, in average
"Weights sum to zero, can't be normalized")
ZeroDivisionError: Weights sum to zero, can't be normalized
I don't understand this error. What am I doing wrong and how can I get the average for all values greater than zero along axis=0? Thanks!

You can get the mask of greater than zeros and use it to do elementwise multilication and sum-reduction along the first axis. Finally, divide by the number of masked elements along the first axis for getting the average values.
Thus, one solution would be -
mask = a > 0 # Input array : a
out = np.einsum('i...,i...->...',a,mask)/mask.sum(0)
Sample run -
In [52]: a
Out[52]:
array([[ 3, -3, 3],
[ 2, 2, 0],
[ 0, -3, 1],
[ 0, 1, 1]])
In [53]: mask = a > 0
In [56]: np.einsum('i...,i...->...',a,mask) # summations of > 0s
Out[56]: array([5, 3, 5])
In [57]: np.einsum('i...,i...->...',a,mask)/mask.sum(0) # avg values of >0s
Out[57]: array([ 2.5 , 1.5 , 1.66666667])
To account for all zero columns, it seems we are expecting 0 as the result. So, we can use np.where to do the choosing, like so -
In [61]: a[:,-1] = 0
In [62]: a
Out[62]:
array([[ 3, -3, 0],
[ 2, 2, 0],
[ 0, -3, 0],
[ 0, 1, 0]])
In [63]: mask = a > 0
In [65]: np.where( mask.any(0), np.einsum('i...,i...->...',a,mask)/mask.sum(0), 0)
__main__:1: RuntimeWarning: invalid value encountered in true_divide
Out[65]: array([ 2.5, 1.5, 0. ])
Just ignore the warning there.
If you feel paranoid about warnings, use masking -
mask = a > 0
vm = mask.any(0) # valid mask
out = np.zeros(a.shape[1])
out[vm] = np.einsum('ij,ij->j',a[:,vm],mask[:,vm])/mask.sum(0)[vm]

Related

Assign zeros to minimum values in numpy 3d array

I have a numpy array of shape (100, 100, 20) (in python 3)
I want to find for each 'pixel' the 15 channels with minimum values, and make them zeros (meaning: make the array sparse, keep only the 5 highest values).
Example:
input: array = [[1,2,3], [7,6,9], [12,71,3]], num_channles_to_zero = 2
output: [[0,0,3], [0,0,9], [0,71,0]]
How can I do it?
what I have for now:
array = numpy.random.rand(100, 100, 20)
inds = numpy.argsort(array, axis=-1) # also shape (100, 100, 20)
I want to do something like
array[..., inds[..., :15]] = 0
but it doesn't give me what I want
np.argsort outputs indices suitable for the [...]_along_axis functions of numpy. This includes np.put_along_axis:
import numpy as np
array = np.random.rand(100, 100, 20)
print(array[0,0])
#[0.44116124 0.94656705 0.20833932 0.29239585 0.33001399 0.82396784
# 0.35841905 0.20670957 0.41473762 0.01568006 0.1435386 0.75231818
# 0.5532527 0.69366173 0.17247832 0.28939985 0.95098187 0.63648877
# 0.90629116 0.35841627]
inds = np.argsort(array, axis=-1)
np.put_along_axis(array, inds[..., :15], 0, axis=-1)
print(array[0,0])
#[0. 0.94656705 0. 0. 0. 0.82396784
# 0. 0. 0. 0. 0. 0.75231818
# 0. 0. 0. 0. 0.95098187 0.
# 0.90629116 0. ]
As it mentioned in the numpy documentation
From each row, a specific element should be selected. The row index is just [0, 1, 2] and the column index specifies the element to choose for the corresponding row, here [0, 1, 0]. Using both together the task can be solved using advanced indexing:
>>>x = np.array([[1, 2], [3, 4], [5, 6]])
>>>x[[0, 1, 2], [0, 1, 0]]
array([1, 4, 5])
So, for your example:
a = np.array([[1,2,3], [7,6,9], [12,71,3]])
amax = a.argmax(axis=-1)
a[np.arange(a.shape[0]), amax] = 0
a
array([[ 1, 2, 0],
[ 7, 6, 0],
[12, 0, 3]])

How to use slice notation in this case?

I am solving a problem in which I am using two large matrices A and B. Matrix A is formed by ones and zeros and matrix B is formed by integers in the range [0,...,10].
At some point, I have to update the components of A such that, if the component is 1 it stays at 1. If the component is 0, it can stay at 0 with probability p or change to 1 with probability 1-p. The parameters p are functions of the same component of matrix B, i.e., I have a list of probabilities such that, if I update A[i,j], then p equals the component B[i,j] of the vector probabilities
I can do the updating with the following code:
import numpy as np
for i in range(n):
for j in range(n):
if A[i,j]==0:
pass
else:
A[i,j]=np.random.choice([0,1],p=[probabilities[B[i,j]],1-probabilities[B[i,j]]])
I think that there should be a faster way to update matrix A using slice notation. Any advice?
See that this problem is equivalent to, given a vector 'a' and a matrix with positive entries 'B', obtain a matrix C in which C[i,j]=a[B[i,j]]
The general answer
You can achieve this by generating a random value with a continuous probability, and then compare the generated value with the inverse cumulative probability function of that distribution.
To be practical
Let's use uniform random variable X in the range [0, 1), then the probability that X < a is a. Assuming that A and B have the same shape
You can use numpy.where for this (make sure that your probabilities variable is a numpy array)
A[:,:] = np.where(A == 0, np.where(np.random.rand(*A.shape) < probabilities[B], 1, 0), A);
If you want to avoid computing the random values for positions where A is non-zero then you have a more complex indexing.
A[A == 0] = np.where(np.random.rand(*A[A == 0].shape) < probabilities[B[A == 0]], 1, 0);
Make many p in the 0-1 range:
In [36]: p=np.arange(1000)/1000
Your choice use - with sum to get an overall statistic:
In [37]: sum([np.random.choice([0,1],p=[p[i],1-p[i]]) for i in range(p.shape[0])])
Out[37]: 485
and a statistically similar random array without the comprehension:
In [38]: (np.random.rand(p.shape[0])>p).astype(int).sum()
Out[38]: 496
You are welcome to perform other tests to verify their equivalence.
If probabilities is a function that takes the whole B or B[mask], we should be able to do:
mask = A==0
n = mask.ravel().sum() # the number of true elements
A[mask] = (np.random.rand(n)>probabilities(B[mask])).astype(int)
To test this:
In [39]: A = np.random.randint(0,2, (4,5))
In [40]: A
Out[40]:
array([[1, 1, 0, 0, 0],
[1, 1, 1, 1, 1],
[1, 1, 0, 0, 1],
[1, 1, 1, 1, 0]])
In [41]: mask = A==0
In [42]: A[mask]
Out[42]: array([0, 0, 0, 0, 0, 0])
In [43]: B = np.arange(1,21).reshape(4,5)
In [44]: def foo(B): # scale the B values to probabilites
...: return B/20
...:
In [45]: foo(B)
Out[45]:
array([[0.05, 0.1 , 0.15, 0.2 , 0.25],
[0.3 , 0.35, 0.4 , 0.45, 0.5 ],
[0.55, 0.6 , 0.65, 0.7 , 0.75],
[0.8 , 0.85, 0.9 , 0.95, 1. ]])
In [46]: n = mask.ravel().sum()
In [47]: n
Out[47]: 6
In [51]: (np.random.rand(n)>foo(B[mask])).astype(int)
Out[51]: array([1, 1, 0, 1, 1, 0])
In [52]: A[mask] = (np.random.rand(n)>foo(B[mask])).astype(int)
In [53]: A
Out[53]:
array([[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 0, 1, 1],
[1, 1, 1, 1, 0]])
In [54]: foo(B[mask])
Out[54]: array([0.15, 0.2 , 0.25, 0.65, 0.7 , 1. ])

add mask to an 2*2 nparray for value ranges

Here's a sample nparray:
array([[ 0.70582116, 0.29417881],
[ 0.65219176, 0.34780821],
[ 0.82653958, 0.17346044],
...,
[ 0.76903266, 0.23096734],
[ 0.65070963, 0.3492904 ],
[ 0.63485813, 0.36514184]], dtype=float32)
I intend to mask on the first column that if it is greater than 0.7, then apply 1 else 0 (for second column, vice versa). So in the end the nparray should look something like this:
array([[ 1, 0],
[ 0, 1],
[ 1, 0],
...,
[ 1, 0],
[ 0, 1 ],
[ 0, 1]], dtype=float32)
How could I do it via numpy in Pythonic way? Thanks!
IIUC, a little broadcasted logical comparison and conversion to int:
(x > 0.7).astype(int)
array([[1, 0],
[0, 0],
[1, 0],
[1, 0],
[0, 0],
[0, 0]])
It's rather simple:
arr > 0.7
That gives a result in np.bool. To convert to np.float32:
(arr > 0.7).astype(dtype=np.float32)
You could use numpy.column_stack:
x = array([[ 0.70582116, 0.29417881],
[ 0.65219176, 0.34780821],
[ 0.82653958, 0.17346044])
col = x[:,0] > 0.7
final = numpy.column_stack([col, ~col]).astype(int)
Since col consists of booleans, ~col is the inverse of col.
Assuming your rows sum to 1, another way would be to compare it to numpy.array([0.7, 0.3]):
final = (x > numpy.array([0.7, 0.3])).astype(int)

Testing similarity of several datasets by producing a cross-correlation matrix

I am trying to compare several datasets and basically test, if they show the same feature, although this feature might be shifted, reversed or attenuated.
A very simple example below:
A = np.array([0., 0, 0, 1., 2., 3., 4., 3, 2, 1, 0, 0, 0])
B = np.array([0., 0, 0, 0, 0, 1, 2., 3., 4, 3, 2, 1, 0])
C = np.array([0., 0, 0, 1, 1.5, 2, 1.5, 1, 0, 0, 0, 0, 0])
D = np.array([0., 0, 0, 0, 0, -2, -4, -2, 0, 0, 0, 0, 0])
x = np.arange(0,len(A),1)
I thought the best way to do it would be to normalize these signals and get absolute values (their attenuation is not important for me at this stage, I am interested in the position... but I might be wrong, so I will welcome thoughts about this concept too) and calculate the area where they overlap. I am following up on this answer - the solution looked very elegant and simple, but I may be implementing it wrongly.
def normalize(sig):
#ns = sig/max(np.abs(sig))
ns = sig/sum(sig)
return ns
a = normalize(A)
b = normalize(B)
c = normalize(C)
d = normalize(D)
which then look like this:
But then, when I try to implement the solution from the answer, I run into problems.
OLD
for c1,w1 in enumerate([a,b,c,d]):
for c2,w2 in enumerate([a,b,c,d]):
w1 = np.abs(w1)
w2 = np.abs(w2)
M[c1,c2] = integrate.trapz(min(np.abs(w2).any(),np.abs(w1).any()))
print M
Produces TypeError: 'numpy.bool_' object is not iterable or IndexError: list assignment index out of range. But I only included the .any() because without them, I was getting the ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().
EDIT - NEW
(thanks #Kody King)
The new code is now:
M = np.zeros([4,4])
SH = np.zeros([4,4])
for c1,w1 in enumerate([a,b,c,d]):
for c2,w2 in enumerate([a,b,c,d]):
crossCorrelation = np.correlate(w1,w2, 'full')
bestShift = np.argmax(crossCorrelation)
# This reverses the effect of the padding.
actualShift = bestShift - len(w2) + 1
similarity = crossCorrelation[bestShift]
M[c1,c2] = similarity
SH[c1,c2] = actualShift
M = M/M.max()
print M, '\n', SH
And the output:
[[ 1. 1. 0.95454545 0.63636364]
[ 1. 1. 0.95454545 0.63636364]
[ 0.95454545 0.95454545 0.95454545 0.63636364]
[ 0.63636364 0.63636364 0.63636364 0.54545455]]
[[ 0. -2. 1. 0.]
[ 2. 0. 3. 2.]
[-1. -3. 0. -1.]
[ 0. -2. 1. 0.]]
The matrix of shifts looks ok now, but the actual correlation matrix does not. I am really puzzled by the fact that the lowest correlation value is for correlating d with itself. What I would like to achieve now is that:
EDIT - UPDATE
Following on the advice, I used the recommended normalization formula (dividing the signal by its sum), but the problem wasn't solved, just reversed. Now the correlation of d with d is 1, but all the other signals don't correlate with themselves.
New output:
[[ 0.45833333 0.45833333 0.5 0.58333333]
[ 0.45833333 0.45833333 0.5 0.58333333]
[ 0.5 0.5 0.57142857 0.66666667]
[ 0.58333333 0.58333333 0.66666667 1. ]]
[[ 0. -2. 1. 0.]
[ 2. 0. 3. 2.]
[-1. -3. 0. -1.]
[ 0. -2. 1. 0.]]
The correlation value should be highest for correlating a signal with itself (i.e. to have the highest values on the main diagonal).
To get the correlation values in the range between 0 and 1, so as a result, I would have 1s on the main diagonal and other numbers (0.x) elsewhere.
I was hoping the M = M/M.max() would do the job, but only if condition no. 1 is fulfilled, which it currently isn't.
As ssm said numpy's correlate function works well for this problem. You mentioned that you are interested in the position. The correlate function can also help you tell how far one sequence is shifted from another.
import numpy as np
def compare(a, b):
# 'full' pads the sequences with 0's so they are correlated
# with as little as 1 actual element overlapping.
crossCorrelation = np.correlate(a,b, 'full')
bestShift = np.argmax(crossCorrelation)
# This reverses the effect of the padding.
actualShift = bestShift - len(b) + 1
similarity = crossCorrelation[bestShift]
print('Shift: ' + str(actualShift))
print('Similatiy: ' + str(similarity))
return {'shift': actualShift, 'similarity': similarity}
print('\nExpected shift: 0')
compare([0,0,1,0,0], [0,0,1,0,0])
print('\nExpected shift: 2')
compare([0,0,1,0,0], [1,0,0,0,0])
print('\nExpected shift: -2')
compare([1,0,0,0,0], [0,0,1,0,0])
Edit:
You need to normalize each sequence before correlating them, or the larger sequences will have a very high correlation with the all the other sequences.
A property of cross-correlation is that:
So if you normalize by dividing each sequence by it's sum, the similarity will always be between 0 and 1.
I recommend you don't take the absolute value of a sequence. That changes the shape, not just the scale. For instance np.abs([1, -2]) == [1, 2]. Normalizing will already ensure that sequence is mostly positive and adds up to 1.
Second Edit:
I had a realization. Think of the signals as vectors. Normalized vectors always have a max dot product with themselves. Cross-Correlation is just a dot product calculated at various shifts. If you normalize the signals like you would a vector (divide s by sqrt(s dot s)), the self correlations will always be maximal and 1.
import numpy as np
def normalize(s):
magSquared = np.correlate(s, s) # s dot itself
return s / np.sqrt(magSquared)
a = np.array([0., 0, 0, 1., 2., 3., 4., 3, 2, 1, 0, 0, 0])
b = np.array([0., 0, 0, 0, 0, 1, 2., 3., 4, 3, 2, 1, 0])
c = np.array([0., 0, 0, 1, 1.5, 2, 1.5, 1, 0, 0, 0, 0, 0])
d = np.array([0., 0, 0, 0, 0, -2, -4, -2, 0, 0, 0, 0, 0])
a = normalize(a)
b = normalize(b)
c = normalize(c)
d = normalize(d)
M = np.zeros([4,4])
SH = np.zeros([4,4])
for c1,w1 in enumerate([a,b,c,d]):
for c2,w2 in enumerate([a,b,c,d]):
# Taking the absolute value catches signals which are flipped.
crossCorrelation = np.abs(np.correlate(w1, w2, 'full'))
bestShift = np.argmax(crossCorrelation)
# This reverses the effect of the padding.
actualShift = bestShift - len(w2) + 1
similarity = crossCorrelation[bestShift]
M[c1,c2] = similarity
SH[c1,c2] = actualShift
print(M, '\n', SH)
Outputs:
[[ 1. 1. 0.97700842 0.86164044]
[ 1. 1. 0.97700842 0.86164044]
[ 0.97700842 0.97700842 1. 0.8819171 ]
[ 0.86164044 0.86164044 0.8819171 1. ]]
[[ 0. -2. 1. 0.]
[ 2. 0. 3. 2.]
[-1. -3. 0. -1.]
[ 0. -2. 1. 0.]]
You want to use a cross-correlation between the vectors:
https://en.wikipedia.org/wiki/Cross-correlation
https://docs.scipy.org/doc/numpy/reference/generated/numpy.correlate.html
For example:
>>> np.correlate(A,B)
array([ 31.])
>>> np.correlate(A,C)
array([ 19.])
>>> np.correlate(A,D)
array([-28.])
If you don't care about the sign, you can simply take the absolute value ...

Python - replace masked data in arrays

I would like to replace by zero value all my masked values in 2D array.
I saw with np.copyto it was apparently possible to do that as :
test=np.copyto(array, 0, where = mask)
But i have an error message...'module' object has no attribute 'copyto'. Is there an equivalent way to do that?
Try numpy.ma.filled()
I think this is exactly what you need
In [29]: a
Out[29]: array([ 1, 0, 25, 0, 1, 4, 0, 2, 3, 0])
In [30]: am = n.ma.MaskedArray(n.ma.log(a),fill_value=0)
In [31]: am
Out[31]:
masked_array(data = [0.0 -- 3.2188758248682006 -- 0.0 1.3862943611198906 -- 0.6931471805599453 1.0986122886681098 --],
mask = [False True False True False False True False False True],
fill_value = 0.0)
In [32]: am.filled()
Out[32]:
array([ 0. , 0. , 3.21887582, 0. , 0. ,
1.38629436, 0. , 0.69314718, 1.09861229, 0. ])
test = np.copyto(array, 0, where=mask) is equivalent to:
array = np.where(mask, 0, array)
test = None
(I'm not sure why you would want to assign a value to the return value of np.copyto; it always returns None if no Exception is raised.)
Why not use array[mask] = 0?
Indeed, that would work (and has nicer syntax) if mask is a boolean array with the same shape as array. If mask doesn't have the same shape then array[mask] = 0 and np.copyto(array, 0, where=mask) may behave differently:
np.copyto (is documented to) and np.where (appears to) broadcast the shape of the mask to match array.
In contrast, array[mask] = 0 does not broadcast mask. This leads to a big difference in behavior when the mask does not have the same shape as array:
In [60]: array = np.arange(12).reshape(3,4)
In [61]: mask = np.array([True, False, False, False], dtype=bool)
In [62]: np.where(mask, 0, array)
Out[62]:
array([[ 0, 1, 2, 3],
[ 0, 5, 6, 7],
[ 0, 9, 10, 11]])
In [63]: array[mask] = 0
In [64]: array
Out[64]:
array([[ 0, 0, 0, 0],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
When array is 2-dimensional and mask is a 1-dimensional boolean array,
array[mask] is selecting rows of array (where mask is True) and
array[mask] = 0 sets those rows to zero.
Surprisingly, array[mask] does not raise an IndexError even though the mask has 4 elements and array only has 3 rows. No IndexError is raised when the fourth value is False, but an IndexError is raised if the fourth value is True:
In [91]: array[np.array([True, False, False, False])]
Out[91]: array([[0, 1, 2, 3]])
In [92]: array[np.array([True, False, False, True])]
IndexError: index 3 is out of bounds for axis 0 with size 3

Categories

Resources