add mask to an 2*2 nparray for value ranges - python

Here's a sample nparray:
array([[ 0.70582116, 0.29417881],
[ 0.65219176, 0.34780821],
[ 0.82653958, 0.17346044],
...,
[ 0.76903266, 0.23096734],
[ 0.65070963, 0.3492904 ],
[ 0.63485813, 0.36514184]], dtype=float32)
I intend to mask on the first column that if it is greater than 0.7, then apply 1 else 0 (for second column, vice versa). So in the end the nparray should look something like this:
array([[ 1, 0],
[ 0, 1],
[ 1, 0],
...,
[ 1, 0],
[ 0, 1 ],
[ 0, 1]], dtype=float32)
How could I do it via numpy in Pythonic way? Thanks!

IIUC, a little broadcasted logical comparison and conversion to int:
(x > 0.7).astype(int)
array([[1, 0],
[0, 0],
[1, 0],
[1, 0],
[0, 0],
[0, 0]])

It's rather simple:
arr > 0.7
That gives a result in np.bool. To convert to np.float32:
(arr > 0.7).astype(dtype=np.float32)

You could use numpy.column_stack:
x = array([[ 0.70582116, 0.29417881],
[ 0.65219176, 0.34780821],
[ 0.82653958, 0.17346044])
col = x[:,0] > 0.7
final = numpy.column_stack([col, ~col]).astype(int)
Since col consists of booleans, ~col is the inverse of col.
Assuming your rows sum to 1, another way would be to compare it to numpy.array([0.7, 0.3]):
final = (x > numpy.array([0.7, 0.3])).astype(int)

Related

How to use slice notation in this case?

I am solving a problem in which I am using two large matrices A and B. Matrix A is formed by ones and zeros and matrix B is formed by integers in the range [0,...,10].
At some point, I have to update the components of A such that, if the component is 1 it stays at 1. If the component is 0, it can stay at 0 with probability p or change to 1 with probability 1-p. The parameters p are functions of the same component of matrix B, i.e., I have a list of probabilities such that, if I update A[i,j], then p equals the component B[i,j] of the vector probabilities
I can do the updating with the following code:
import numpy as np
for i in range(n):
for j in range(n):
if A[i,j]==0:
pass
else:
A[i,j]=np.random.choice([0,1],p=[probabilities[B[i,j]],1-probabilities[B[i,j]]])
I think that there should be a faster way to update matrix A using slice notation. Any advice?
See that this problem is equivalent to, given a vector 'a' and a matrix with positive entries 'B', obtain a matrix C in which C[i,j]=a[B[i,j]]
The general answer
You can achieve this by generating a random value with a continuous probability, and then compare the generated value with the inverse cumulative probability function of that distribution.
To be practical
Let's use uniform random variable X in the range [0, 1), then the probability that X < a is a. Assuming that A and B have the same shape
You can use numpy.where for this (make sure that your probabilities variable is a numpy array)
A[:,:] = np.where(A == 0, np.where(np.random.rand(*A.shape) < probabilities[B], 1, 0), A);
If you want to avoid computing the random values for positions where A is non-zero then you have a more complex indexing.
A[A == 0] = np.where(np.random.rand(*A[A == 0].shape) < probabilities[B[A == 0]], 1, 0);
Make many p in the 0-1 range:
In [36]: p=np.arange(1000)/1000
Your choice use - with sum to get an overall statistic:
In [37]: sum([np.random.choice([0,1],p=[p[i],1-p[i]]) for i in range(p.shape[0])])
Out[37]: 485
and a statistically similar random array without the comprehension:
In [38]: (np.random.rand(p.shape[0])>p).astype(int).sum()
Out[38]: 496
You are welcome to perform other tests to verify their equivalence.
If probabilities is a function that takes the whole B or B[mask], we should be able to do:
mask = A==0
n = mask.ravel().sum() # the number of true elements
A[mask] = (np.random.rand(n)>probabilities(B[mask])).astype(int)
To test this:
In [39]: A = np.random.randint(0,2, (4,5))
In [40]: A
Out[40]:
array([[1, 1, 0, 0, 0],
[1, 1, 1, 1, 1],
[1, 1, 0, 0, 1],
[1, 1, 1, 1, 0]])
In [41]: mask = A==0
In [42]: A[mask]
Out[42]: array([0, 0, 0, 0, 0, 0])
In [43]: B = np.arange(1,21).reshape(4,5)
In [44]: def foo(B): # scale the B values to probabilites
...: return B/20
...:
In [45]: foo(B)
Out[45]:
array([[0.05, 0.1 , 0.15, 0.2 , 0.25],
[0.3 , 0.35, 0.4 , 0.45, 0.5 ],
[0.55, 0.6 , 0.65, 0.7 , 0.75],
[0.8 , 0.85, 0.9 , 0.95, 1. ]])
In [46]: n = mask.ravel().sum()
In [47]: n
Out[47]: 6
In [51]: (np.random.rand(n)>foo(B[mask])).astype(int)
Out[51]: array([1, 1, 0, 1, 1, 0])
In [52]: A[mask] = (np.random.rand(n)>foo(B[mask])).astype(int)
In [53]: A
Out[53]:
array([[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 0, 1, 1],
[1, 1, 1, 1, 0]])
In [54]: foo(B[mask])
Out[54]: array([0.15, 0.2 , 0.25, 0.65, 0.7 , 1. ])

How to get an array of indices of minimum values for a multimensional array?

I have a multimensionnal array with a shape of (2, 2, 3) like this :
array([[[ 0.64, 0.49, 2.56],
[ 7.84, 13.69, 21.16]],
[[ 33.64, 44.89, 57.76],
[ 77.44, 94.09, 112.36]]])
I would like to find the indices of the min for each row. So for this example there are 4 minimums which are : 0.49, 7.84, 33.64 and 77.44.
To get the indices of those minimum I thought this would work :
idx_arr = np.unravel_index(np.argmin(my_array,axis=2),my_array.shape)
This yields the following array of indices :
(array([[0, 0],
[0, 0]]), array([[0, 0],
[0, 0]]), array([[1, 0],
[0, 0]]))
However, the minimums are not correctly computed, as one can see :
my_array[idx_arr]
array([[0.49, 0.64],
[0.64, 0.64]])
What am I missing there ?
the argmin is actually calculating the values correctly. But you misunderstand what np.unravel_index is expecting.
From docs:
Converts a flat index or array of flat indices into a tuple of
coordinate arrays.
To see what kind of input it would accept to give the desired output here, We need to focus on the main point: it will convert a flat array into the correct coordinate array for a particular location in non-flat terms. Essentially, what it expected is coordinates of your desired points as if your input array was flattened.
import numpy as np
inp = np.array([[[ 0.64, 0.49, 2.56],
[ 7.84, 13.69, 21.16]],
[[ 33.64, 44.89, 57.76],
[ 77.44, 94.09, 112.36]]])
idx = inp.argmin(axis=-1)
#Output:
array([[1, 0],
[0, 0]], dtype=int64)
Note that you cannot send this idx directly because it is not representing correct coordinates for a flattened version of inp array.
That would look more like the following:
flat_idx = np.arange(0, idx.size*inp.shape[-1], inp.shape[-1]) + idx.flatten()
#Output:
array([1, 3, 6, 9], dtype=int64)
And we can see unravel_index accepts it happily.
temp = np.unravel_index(flat_idx, inp.shape)
#Output:
(array([0, 0, 1, 1], dtype=int64),
array([0, 1, 0, 1], dtype=int64),
array([1, 0, 0, 0], dtype=int64))
inp[temp]
Output:
array([ 0.49, 7.84, 33.64, 77.44])
Also, taking a look at the output tuple, we can notice that it is not too difficult to recreate the same ourselves as well. Notice that the last array corresponds to a flattened form of idx, while the first two arrays essentially enable indexing through the first two axes of inp.
And to prepare that, we can actually use the unravel_index function in a rather nifty way, as follows:
real_idx = (*np.unravel_index(np.arange(idx.size), idx.shape), idx.flatten())
inp[real_idx]
#Output:
array([ 0.49, 7.84, 33.64, 77.44])

Updating a numpy array values based on multiple conditions

I have an array P as shown below:
P
array([[ 0.49530662, 0.32619367, 0.54593724, -0.0224462 ],
[-0.10503237, 0.48607405, 0.28572714, 0.15175049],
[ 0.0286128 , -0.32407902, -0.56598029, -0.26743756],
[ 0.14353725, -0.35624814, 0.25655861, -0.09241335]])
and a vector y:
y
array([0, 0, 1, 0], dtype=int16)
I want to modify another matrix Z which has the same dimension as P, such that Z_ij = y_j when Z_ij < 0.
In the above example, my Z matrix should be
Z = array([[-, -, -, 0],
[0, -, -, -],
[-, 0, 1, 0],
[-, 0, -, 0]])
Where '-' indicates the original Z values. What I thought about is very straightforward implementation which basically iterates through each row of Z and comparing the column values against corresponding Y and P. Do you know any better pythonic/numpy approach?
What you need is np.where. This is how to use it:-
import numpy as np
z = np.array([[ 0.49530662, 0.32619367, 0.54593724, -0.0224462 ],
[-0.10503237, 0.48607405, 0.28572714, 0.15175049],
[ 0.0286128 , -0.32407902, -0.56598029, -0.26743756],
[ 0.14353725, -0.35624814, 0.25655861, -0.09241335]])
y=([0, 0, 1, 0])
result = np.where(z<0,y,z)
#Where z<0, replace it by y
Result
>>> print(result)
[[0.49530662 0.32619367 0.54593724 0. ]
[0. 0.48607405 0.28572714 0.15175049]
[0.0286128 0. 1. 0. ]
[0.14353725 0. 0.25655861 0. ]]

Conditional average with numpy

Given a 2x3 array, I want to calculate the average on axis=0, but only considering values that are larger than 0.
So given the array
[ [1,0],
[0,0],
[1,0] ]
I want the output to be
# 1, 0, 1 filtered for > 0 gives 1, 1, average = (1+1)/2 = 1
# 0, 0, 0 filtered for > 0 gives 0, 0, 0, average = 0
[1 0]
My current code is
import numpy as np
frame = np.array([ [1,0],
[0,0],
[1,0] ])
weights=np.array(frame)>0
print("weights:")
print(weights)
print("average without weights:")
print((np.average(frame, axis=0)))
print("average with weights:")
print((np.average(frame, axis=0, weights=weights)))
This gives me
weights:
[[ True False]
[False False]
[ True False]]
average without weights:
[ 0.66666667 0. ]
average with weights:
Traceback (most recent call last):
File "C:\Users\myuser\project\test.py", line 123, in <module>
print((np.average(frame, axis=0, weights=weights)))
File "C:\Users\myuser\Miniconda3\envs\myenv\lib\site-packages\numpy\lib\function_base.py", line 1140, in average
"Weights sum to zero, can't be normalized")
ZeroDivisionError: Weights sum to zero, can't be normalized
I don't understand this error. What am I doing wrong and how can I get the average for all values greater than zero along axis=0? Thanks!
You can get the mask of greater than zeros and use it to do elementwise multilication and sum-reduction along the first axis. Finally, divide by the number of masked elements along the first axis for getting the average values.
Thus, one solution would be -
mask = a > 0 # Input array : a
out = np.einsum('i...,i...->...',a,mask)/mask.sum(0)
Sample run -
In [52]: a
Out[52]:
array([[ 3, -3, 3],
[ 2, 2, 0],
[ 0, -3, 1],
[ 0, 1, 1]])
In [53]: mask = a > 0
In [56]: np.einsum('i...,i...->...',a,mask) # summations of > 0s
Out[56]: array([5, 3, 5])
In [57]: np.einsum('i...,i...->...',a,mask)/mask.sum(0) # avg values of >0s
Out[57]: array([ 2.5 , 1.5 , 1.66666667])
To account for all zero columns, it seems we are expecting 0 as the result. So, we can use np.where to do the choosing, like so -
In [61]: a[:,-1] = 0
In [62]: a
Out[62]:
array([[ 3, -3, 0],
[ 2, 2, 0],
[ 0, -3, 0],
[ 0, 1, 0]])
In [63]: mask = a > 0
In [65]: np.where( mask.any(0), np.einsum('i...,i...->...',a,mask)/mask.sum(0), 0)
__main__:1: RuntimeWarning: invalid value encountered in true_divide
Out[65]: array([ 2.5, 1.5, 0. ])
Just ignore the warning there.
If you feel paranoid about warnings, use masking -
mask = a > 0
vm = mask.any(0) # valid mask
out = np.zeros(a.shape[1])
out[vm] = np.einsum('ij,ij->j',a[:,vm],mask[:,vm])/mask.sum(0)[vm]

how to create similarity matrix in numpy python?

I have data in a file in following form:
user_id, item_id, rating
1, abc,5
1, abcd,3
2, abc, 3
2, fgh, 5
So, the matrix I want to form for above data is following:
# itemd_ids
# abc abcd fgh
[[5, 3, 0] # user_id 1
[3, 0, 5]] # user_id 2
where missing data is replaced by 0.
But from this I want to create both user to user similarity matrix and item to item similarity matrix?
How do I do that?
Technically, this is not a programming problem but a math problem. But I think you better off using variance-covariance matrix. Or correlation matrix, if the scale of the values are very different, say, instead of having:
>>> x
array([[5, 3, 0],
[3, 0, 5],
[5, 5, 0],
[1, 1, 7]])
You have:
>>> x
array([[5, 300, 0],
[3, 0, 5],
[5, 500, 0],
[1, 100, 7]])
To get a variance-cov matrix:
>>> np.cov(x)
array([[ 6.33333333, -3.16666667, 6.66666667, -8. ],
[ -3.16666667, 6.33333333, -5.83333333, 7. ],
[ 6.66666667, -5.83333333, 8.33333333, -10. ],
[ -8. , 7. , -10. , 12. ]])
Or the correlation matrix:
>>> np.corrcoef(x)
array([[ 1. , -0.5 , 0.91766294, -0.91766294],
[-0.5 , 1. , -0.80295507, 0.80295507],
[ 0.91766294, -0.80295507, 1. , -1. ],
[-0.91766294, 0.80295507, -1. , 1. ]])
This is the way to look at it, the diagonal cell, i.e., (0,0) cell, is the correlation of your 1st vector in X to it self, so it is 1. The other cells, i.e, (0,1) cell, is the correlation between the 1st and 2nd vector in X. They are negatively correlated. Or similarly, the 1st and 3rd cell are positively correlated.
covariance matrix or correlation matrix avoid the zero problem pointed out by #Akavall.
See this question: What's the fastest way in Python to calculate cosine similarity given sparse matrix data?
Having:
A = np.array(
[[0, 1, 0, 0, 1],
[0, 0, 1, 1, 1],
[1, 1, 0, 1, 0]])
dist_out = 1-pairwise_distances(A, metric="cosine")
dist_out
Result in:
array([[ 1. , 0.40824829, 0.40824829],
[ 0.40824829, 1. , 0.33333333],
[ 0.40824829, 0.33333333, 1. ]])
But that works for dense matrix. For sparse you have to develop your solution.

Categories

Resources