How to remove numpy columns based on condition?

How to remove numpy columns based on condition? - python

I have a numpy array which contains the correlation between a label column
[0.5 -0.02 0.2]
And also a numpy array containing
[[0.42 0.35 0.6]
[0.3 0.34 0.2]]
Can I use a function to determine which columns to keep?
Such as
abs(cors) > 0.05
It will yield
[True False True]
then the resulting numpy array will becomes
[[0.42 0.6]
[0.3 0.2]]
May I know how to achieve this?

You can do boolean indexing along values with something like this:
a = np.array([
[1, 2, 3],
[4, 5, 6]
])
b = np.array([
[True, False, True],
[False, True, False]
])
new_a = a[b]
Or, to do boolean indexing along rows/columns, use this syntax:
a = np.array([
[1, 2, 3],
[4, 5, 6]
])
b = np.array([True, False, True])
c = np.array([False, True])
new_a = a[c, b]
So, for your example you could do:
a = np.array([
[0.42, 0.35, 0.6],
[0.3, 0.34, 0.2]
])
cors = np.array([0.5, -0.02, 0.2])
new_a = a[:, abs(cors) > 0.05]

In NumPy, you can do something like this
new_array = np.delete(array, np.where(cors <= 0.05), 1)

Related

Any numpy/torch style to set value given an index ndarray and a flag ndarray?

I'm working on PyTorch and currently I met a problem for which I've no idea how to solve it in a torch/numpy style. For example, suppose I have three PyTorch tensors
import torch
import numpy as np
indices = torch.from_numpy(np.array([[2, 1, 3, 0], [1, 0, 3, 2]]))
flags = torch.from_numpy(np.array([[False, False, False, True], [False, False, True, True]]))
tensor = torch.from_numpy(np.array([[2.8, 0.5, 1.2, 0.9], [3.1, 2.8, 1.3, 2.5]]))
Here flags is a boolean flag tensor to show which elements in indices should be extracted. Given the extracted indices, I want to set the corresponding elements in tensor to an indicated const (say 1e-30). Based on the example shown above, I want
>>> sub_indices = indices.op1(flags)
>>> sub_indices
tensor([[0], [3, 2]])
>>> tensor.op2(sub_indices, 1e-30)
>>> tensor
tensor([[1e-30, 0.5, 1.2, 0.9], [3.1, 2.8, 1e-30, 1e-30]])
Could anyone help to give a solution? I'm using list comprehension but I think this way is a little bit ugly. I tried indices[flags] but it only returns a 1d-array [0, 3, 2] so applying this would change all rows on the same columns 0, 2, 3
Some additional remarks:
The number of "True" values for each row in flags cannot be determined
Each row of indices is assured to be a permutation of sequence 0 ... N - 1
Below is a numpy version of the example code, for the convenience of copy-pasting. I doubt whether this could be done in a pure numpy way
import numpy as np
indices = np.array([[2, 1, 3, 0], [1, 0, 3, 2]])
flags = np.array([[False, False, False, True], [False, False, True, True]])
tensor = np.array([[2.8, 0.5, 1.2, 0.9], [3.1, 2.8, 1.3, 2.5]])

You may sort flags according to the indices to create a mask, then use the mask as a mux. Here is an example code:
indices = np.array([[2, 1, 3, 0], [1, 0, 3, 2]])
flags = np.array([[False, False, False, True], [False, False, True, True]])
tensor = np.array([[2.8, 0.5, 1.2, 0.9], [3.1, 2.8, 1.3, 2.5]])
indices_sorted = indices.argsort(axis=1)
mask = np.take_along_axis(flags, indices_sorted, axis=1)
result = tensor * (1 - mask) + 1e-30 * mask
I'm not quite familiar with pytorch, but I guess it is not a good idea to gather a ragged tensor. Though, even in the worst case, you can convert to/from numpy arrays.

The pytorch version of #soloice's solution. In pytorch, torch.gather is used instead of torch.take.
indices = torch.tensor([[2, 1, 3, 0], [1, 0, 3, 2]])
flags = torch.tensor([[False, False, False, True], [False, False, True, True]])
tensor = torch.tensor([[2.8, 0.5, 1.2, 0.9], [3.1, 2.8, 1.3, 2.5]])
indices_sorted = indices.argsort(axis=1)
mask = torch.gather(flags, 1, indices_sorted).float()
result = tensor * (1 - mask) + 1e-30 * mask

Tensorflow: Masking an array based on duplicated elements of another array

I have an array, x=[2, 3, 4, 3, 2] which contains the states of model and another array which gives corresponding probabilities of these states, prob=[.2, .1, .4, .1, .2]. But some states are duplicated and I need to sum their corresponding probabilities. So my desired outputs are: unique_elems=[2, 3, 4] and reduced_prob=[.2+.2, .1+.1, .4]. Here is my approach:
x = tf.constant([2, 3, 4, 3, 2])
prob = tf.constant([.2, .1, .4, .1, .2])
unique_elems, _ = tf.unique(x) # [2, 3, 4]
unique_elems = tf.expand_dims(unique_elems, axis=1) # [[2], [3], [4]]
tiled_prob = tf.tile(tf.expand_dims(prob, axis=0), [3, 1])
# [[0.2, 0.1, 0.4, 0.1, 0.2],
# [0.2, 0.1, 0.4, 0.1, 0.2],
# [0.2, 0.1, 0.4, 0.1, 0.2]]
equal = tf.equal(x, unique_elems)
# [[ True, False, False, False, True],
# [False, True, False, True, False],
# [False, False, True, False, False]]
reduced_prob = tf.multiply(tiled_prob, tf.cast(equal, tf.float32))
# [[0.2, 0. , 0. , 0. , 0.2],
# [0. , 0.1, 0. , 0.1, 0. ],
# [0. , 0. , 0.4, 0. , 0. ]]
reduced_prob = tf.reduce_sum(reduced_prob, axis=1)
# [0.4, 0.2, 0.4]
but I am wondering whether there is a more efficient way to do that. In particular I am using tile operation which I think is not very efficient for large arrays.

It can be done in two lines by tf.unsorted_segment_sum:
unique_elems, idx = tf.unique(x) # [2, 3, 4]
reduced_prob = tf.unsorted_segment_sum(prob, idx, tf.size(unique_elems))

Get top N values from each sub-array from 2D numpy array [duplicate]

I think this is an easy question for experienced numpy users.
I have a score matrix. The raw index corresponds to samples and column index corresponds to items. For example,
score_matrix =
[[ 1. , 0.3, 0.4],
[ 0.2, 0.6, 0.8],
[ 0.1, 0.3, 0.5]]
I want to get top-M indices of items for each samples. Also I want to get top-M scores. For example,
top2_ind =
[[0, 2],
[2, 1],
[2, 1]]
top2_score =
[[1. , 0.4],
[0,8, 0.6],
[0.5, 0.3]]
What is the best way to do this using numpy?

Here's an approach using np.argpartition -
idx = np.argpartition(a,range(M))[:,:-M-1:-1] # topM_ind
out = a[np.arange(a.shape[0])[:,None],idx] # topM_score
Sample run -
In [343]: a
Out[343]:
array([[ 1. , 0.3, 0.4],
[ 0.2, 0.6, 0.8],
[ 0.1, 0.3, 0.5]])
In [344]: M = 2
In [345]: idx = np.argpartition(a,range(M))[:,:-M-1:-1]
In [346]: idx
Out[346]:
array([[0, 2],
[2, 1],
[2, 1]])
In [347]: a[np.arange(a.shape[0])[:,None],idx]
Out[347]:
array([[ 1. , 0.4],
[ 0.8, 0.6],
[ 0.5, 0.3]])
Alternatively, possibly slower, but a bit shorter code to get idx would be with np.argsort -
idx = a.argsort(1)[:,:-M-1:-1]
Here's a post containing some runtime test that compares np.argsort and np.argpartition on a similar problem.

I'd use argsort():
top2_ind = score_matrix.argsort()[:,::-1][:,:2]
That is, produce an array which contains the indices which would sort score_matrix:
array([[1, 2, 0],
[0, 1, 2],
[0, 1, 2]])
Then reverse the columns with ::-1, then take the first two columns with :2:
array([[0, 2],
[2, 1],
[2, 1]])
Then similar but with regular np.sort() to get the values:
top2_score = np.sort(score_matrix)[:,::-1][:,:2]
Which following the same mechanics as above, gives you:
array([[ 1. , 0.4],
[ 0.8, 0.6],
[ 0.5, 0.3]])

In case someone is interested in the both the values and corresponding indices without tempering with the order, the following simple approach will be helpful. Though it could be computationally expensive if working with large data since we are using a list to store tuples of value, index.
import numpy as np
values = np.array([0.01,0.6, 0.4, 0.0, 0.1,0.7, 0.12]) # a simple array
values_indices = [] # define an empty list to store values and indices
while values.shape[0]>1:
values_indices.append((values.max(), values.argmax()))
# remove the maximum value from the array:
values = np.delete(values, values.argmax())
The final output as list of tuples:
values_indices
[(0.7, 5), (0.6, 1), (0.4, 1), (0.12, 3), (0.1, 2), (0.01, 0)]

Easy way would be:
To get top-2 indices
np.argsort(-score_matrix)[:, :2]
To get top-2 values
-np.sort(-score_matrix)[:, :2]

How do I sort the rows of a 2d numpy array based on indices given by another 2d numpy array

Example:
arr = np.array([[.5, .25, .19, .05, .01],[.25, .5, .19, .05, .01],[.5, .25, .19, .05, .01]])
print(arr)
[[ 0.5 0.25 0.19 0.05 0.01]
[ 0.25 0.5 0.19 0.05 0.01]
[ 0.5 0.25 0.19 0.05 0.01]]
idxs = np.argsort(arr)
print(idxs)
[[4 3 2 1 0]
[4 3 2 0 1]
[4 3 2 1 0]]
How can I use idxs to index arr? I want to do something like arr[idxs], but this does not work.

It's not the prettiest, but I think something like
>>> arr[np.arange(len(arr))[:,None], idxs]
array([[ 0.01, 0.05, 0.19, 0.25, 0.5 ],
[ 0.01, 0.05, 0.19, 0.25, 0.5 ],
[ 0.01, 0.05, 0.19, 0.25, 0.5 ]])
should work. The first term gives the x coordinates we want (using broadcasting over the last singleton axis):
>>> np.arange(len(arr))[:,None]
array([[0],
[1],
[2]])
with idxs providing the y coordinates. Note that if we had used unravel_index, the x coordinates to use would always have been 0 instead:
>>> np.unravel_index(idxs, arr.shape)[0]
array([[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0]])

How about something like this:
I changed variables to make the example more clear, but you basically need to index by two 2D arrays.
In [102]: a = np.array([[1,2,3], [4,5,6]])
In [103]: b = np.array([[0,2,1], [2,1,0]])
In [104]: temp = np.repeat(np.arange(a.shape[0]), a.shape[1]).reshape(a.shape).T
# temp is just [[0,1], [0,1], [0,1]]
# probably can be done more elegantly
In [105]: a[temp, b.T].T
Out[105]:
array([[1, 3, 2],
[6, 5, 4]])

Numpy Arrays: Slice y-values array based on threshold, then slice the x-values array correspondingly

Very quick question, can't find an answer with these keywords. What is a better way of doing the following?
t = linspace(0,1000,300)
x0 = generic_function(t)
x1 = x0[x0>0.8]
t1 = t[t>t[len(x0)-len(x1)-1]]
The operation I'm using #t1 strikes me as very un-pythonic and inefficient. Any pointers?

IIUC, you can simply reuse the cut array. For example:
>>> from numpy import arange, sin
>>> t = arange(5)
>>> t
array([0, 1, 2, 3, 4])
>>> y = sin(t)
>>> y
array([ 0. , 0.84147098, 0.90929743, 0.14112001, -0.7568025 ])
As you've already done, you can make a bool array:
>>> y > 0.8
array([False, True, True, False, False], dtype=bool)
and then you can use this to filter both t and y:
>>> t[y > 0.8]
array([1, 2])
>>> y[y > 0.8]
array([ 0.84147098, 0.90929743])
No use of len or assumptions about monotonicity involved.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to remove numpy columns based on condition? - python

In NumPy, you can do something like this new_array = np.delete(array, np.where(cors <= 0.05), 1)

Related

Any numpy/torch style to set value given an index ndarray and a flag ndarray?

Tensorflow: Masking an array based on duplicated elements of another array

Get top N values from each sub-array from 2D numpy array [duplicate]

How do I sort the rows of a 2d numpy array based on indices given by another 2d numpy array

Numpy Arrays: Slice y-values array based on threshold, then slice the x-values array correspondingly

Categories

Resources