Keeping the indexes of deleted column - python

I want to remove features with low variance in my array of data. By using scikit-learn, the code will look like below.
>>> from sklearn.feature_selection import VarianceThreshold
>>> X = [[0, 2, 0, 3], [0, 1, 4, 3], [0, 1, 1, 3]]
>>> selector = VarianceThreshold()
>>> selector.fit_transform(X)
array([[2, 0],
[1, 4],
[1, 1]])
My question is how to catch the column indexes that have been deleted? Let say I want to use them to delete another array in the same column (0th and 3th column in the above example).
Any idea?

selector.get_support() will return an array which shows which columns are kept and which are removed. In above case:
selector.get_support()
will return
array([False, True, True, False], dtype=bool)
which means first and last indices of the original input (X) are removed.

Related

Delete numpy axis 1 based on condition

I need to remove values from a np axis based on a condition.
For example, I would want to remove [:,2] (the second values on axis 1) if the first value == 0, else I would want to remove [:,3].
Input:
[[0,1,2,3],[0,2,3,4],[1,3,4,5]]
Output:
[[0,1,3],[0,2,4],[1,3,4]]
So now my output has one less value on the 1st axis, depending on if it met the condition or not.
I know I can isolate and manipulate this based on
array[np.where(array[:,0] == 0)] but then I would have to deal with each condition separately, and it's very important for me to preserve the order of this array.
I am dealing with 3D arrays & am hoping to be able to calculate all this simultaneously while preserving the order.
Any help is much appreciated!
A possible solution:
a = np.array([[0,1,2,3],[0,2,3,4],[1,3,4,5]])
b = np.arange(a.shape[1])
np.apply_along_axis(
lambda x: x[np.where(x[0] == 0, np.delete(b,2), np.delete(b,3))], 1, a)
Output:
array([[0, 1, 3],
[0, 2, 4],
[1, 3, 4]])
Since you are starting and ending with a list, a straight forward iteration is a good solution:
In [261]: alist =[[0,1,2,3],[0,2,3,4],[1,3,4,5]]
In [262]: for row in alist:
...: if row[0]==0: row.pop(2)
...: else: row.pop(3)
...:
In [263]: alist
Out[263]: [[0, 1, 3], [0, 2, 4], [1, 3, 4]]
A possible array approach:
In [273]: arr = np.array([[0,1,2,3],[0,2,3,4],[1,3,4,5]])
In [274]: mask = np.ones(arr.shape, bool)
In [275]: mask[np.arange(3),np.where(arr[:,0]==0,2,3)]=False
In [276]: mask
Out[276]:
array([[ True, True, False, True],
[ True, True, False, True],
[ True, True, True, False]])
arr[mask] will be 1d, but since we are deleting the same number of elements each row, we can reshape it:
In [277]: arr[mask].reshape(arr.shape[0],-1)
Out[277]:
array([[0, 1, 3],
[0, 2, 4],
[1, 3, 4]])
I expect the list approach will be faster for small cases, but the array should scale better. I don't know where the trade off is.

using integer as index for multidimensional numpy array

I have boolean array of shape (n_samples, n_items) which represents a set: my_set[i, j] tells if sample i contains item j.
To populate it, the array is initialized as zeros, and receive another array of integers, with shape (n_samples, 3), telling for each example, three elements that belongs to it, for instance:
my_set = np.zeros((2, 5), dtype=bool)
init_values = np.array([[1,3,4], [0,1,2]], dtype=np.int64)
So, I need to fill my_set in row 0 and columns 1, 3, 4 and in row 1, columns 0, 1, 2, with with ones.
my_set contain valid values in appropriated range (that is, in [0, n_items)), and each column doesn't contain duplicated items.
Some failed approaches:
I know that a list of integers (or array) can be used as index, so I tried to use init_values as index straightforward, but it failed:
my_set[init_values] = 1
File "<ipython-input-9-9b2c4d19f4f6>", line 1, in <cell line: 1>
my_set[init_values] = 1
IndexError: index 3 is out of bounds for axis 0 with size 2
I don't know why the 3 is indexing over the first axis, so I tried a second approach: "pick up all rows and index only desired columns", using a mix of slicing and integer index. And it didn't throw error, but didn't worked as expected: checkout the shape, I expect it to be (2, 3), however...
my_set[:, init_values].shape
Out[11]: (2, 2, 3)
Not sure why it didn't work, but at least the first axis looks correct, so I tried to pick up only the first column, which is a list of integers, and therefore it is "more natural"... once again, it didn't worked:
my_set[:, init_values[:,0]].shape
Out[12]: (2, 2)
I expected this shape to be (2, 1) since I wanted all rows with a single column on each, corresponding to the indexes given in init_values.
I decided to go back to integer index approach for the first axis.... and it worked:
my_set[np.arange(len(my_set)), init_values[:,0]].shape
Out[13]: (2,)
However, it only works wor one column, so I need to iterate over columns to make it really work, but it looks like a good-initial workaround.
Current solution
So, to solve my original problem, I wrote this:
for c in range(init_values.shape[1])
my_set[np.arange(len(my_set)), init_values[:,c]] = 1
# now lets check my_set is properly filled
print(my_set)
Out[14]: [[False True False True True]
[ True True True False False]]
which is exactly what I need.
Question(s):
That said, here goes my main question:
Is there a more efficient way to do this? I see it quite inefficient as the number of elements grows (for this example I used 3 but I actually need larger values).
In addition to this I'd like to understand why using np.arange on the first index behaves different from slicing it as :: I didn't expect this behavior.
Any other comment to understand why previous approaches failed, are also welcome.
You only have column indices, so you also need to create their corresponding row indices:
>>> my_set[np.arange(len(my_set))[:, None], init_values] = 1
>>> my_set
array([[False, True, False, True, True],
[ True, True, True, False, False]])
[:, None] is used to convert the row indices row vector to the column vector, so that row and column indices have compatible shapes for broadcasting:
>>> np.arange(len(my_set))[:, None]
array([[0],
[1]])
>>> np.broadcast_arrays(np.arange(len(my_set))[:, None], init_values)
[array([[0, 0, 0],
[1, 1, 1]]),
array([[1, 3, 4],
[0, 1, 2]], dtype=int64)]
The essence of slicing is to apply the index of other dimensions to each index in the slicing range of this dimension. Here is a simple test. The matrix to be indexed is as follows:
>>> ar = np.arange(4).reshape(2, 2)
>>> ar
array([[0, 1],
[2, 3]])
If you want to get elements whit indices 0 and 1 in row 0, and elements with indices 1 and 0 in row 1, but you use the combination of column indices [[0, 1], [1, 0]] and slice, you will get:
>>> ar[:, [[0, 1], [1, 0]]]
array([[[0, 1],
[1, 0]],
[[2, 3],
[3, 2]]])
This is equivalent to combining the row index from 0 to 1 with the column indices respectively:
>>> ar[0, [[0, 1], [1, 0]]]
array([[0, 1],
[1, 0]])
>>> ar[1, [[0, 1], [1, 0]]]
array([[2, 3],
[3, 2]])
In fact, broadcasting is used secretly here. The actual indices are:
>>> np.broadcast_arrays(0, [[0, 1], [1, 0]])
[array([[0, 0],
[0, 0]]),
array([[0, 1],
[1, 0]])]
>>> np.broadcast_arrays(1, [[0, 1], [1, 0]])
[array([[1, 1],
[1, 1]]),
array([[0, 1],
[1, 0]])]
This is not the same as the indices you actually need. Therefore, you need to manually generate the correct row indices for broadcasting:
>>> ar[[[0], [1]], [[0, 1], [1, 0]]]
array([[0, 1],
[3, 2]])
>>> np.broadcast_arrays([[0], [1]], [[0, 1], [1, 0]])
[array([[0, 0],
[1, 1]]),
array([[0, 1],
[1, 0]])]

How to remove/select rows in matrix given external condition from another array with numpy?

I have a matrix and a truth table array for this matrix like so :
matrix = np.array([[1, 2, 2], [2, 3, 4], [4, 3, 5]])
truth_table = np.array([0, 1, 0])
The goal is to keep only the rows in the matrix where the truth table is equal to one, in this case only [[2, 3, 4]].
The matrix has as many row as the truth table has elements.
In any other language I would do this :
results = np.array([])
for i in range(truth_table.size) :
if(truth_table[i] == 1)
results.append(matrix[i])
The problem is that the matrix can be enormous and for loops are not optimized in Python for this sort of problem and thus can take a really long time to execute.
I am sure there is a better way to do this using numpy but I can't seem to find the solution.
Make sure your truth table has dtype=bool then you can just do matrix[truth_table]
import numpy as np
matrix = np.array([[1, 2, 2], [2, 3, 4], [4, 3, 5]])
truth_table = np.array([0, 1, 0], dtype=bool)
# or truth_table = np.array([False, True, False])
print(matrix[truth_table])
# prints [[2, 3, 4]]

numpy.where for row index which that row is not all zero

I have a large matrix which some rows are all zero. I want to get the index of the row that is not all zero. I tried
idx = np.where(mymatrix[~np.all(mymatrix != 0, axis=1)])
and got
(array([ 21, 21, 21, ..., 1853, 3191, 3191], dtype=int64),
array([3847, 3851, 3852, ..., 4148, 6920, 6921], dtype=int64))
Is the first array the row index? Is there more straightforward way to get only row index?
There is a straight way:
np.where(np.any(arr != 0, axis=1))
You are actually close enough to the solution yourself. You need to think a bit what you do inside the np.where().
I get this matrix as an example:
array([[1, 1, 1, 1],
[2, 2, 2, 2],
[0, 0, 0, 0],
[3, 3, 3, 3]])
# This will give you back a boolean array of whether your
# statement is true or false per raw
np.all(mymatrix != 0, axis=1)
array([ True, True, False, True], dtype=bool)
Now if you give that to the np.where() it will return your desired output:
np.where(np.all(mymatrix != 0, axis=1))
(array([0, 1, 3]),)
What you do wrong is try to accessing the matrix with the bool matrix you get.
# This will give you the raws without zeros.
mymatrix[np.all(mymatrix != 0, axis=1)]
array([[1, 1, 1, 1],
[2, 2, 2, 2],
[3, 3, 3, 3]])
# While this will give you the raws with only zeros
mymatrix[~np.all(mymatrix != 0, axis=1)]
Given an array like this, np.where() is not able to return the indices. It doesn't know what you ask for.

select first occurance of minimum index from numpy array

I am trying to find out the index of the minimum value in each row and I am using below code.
#code
import numpy as np
C = np.array([[1,2,4],[2,2,5],[4,3,3]])
ind = np.where(C == C.min(axis=1).reshape(len(C),1))
ind
#output
(array([0, 1, 1, 2, 2], dtype=int64), array([0, 0, 1, 1, 2], dtype=int64))
but the problem it is returning all indices of minimum values in each row. but I want only the first occurrence of minimum values. like
(array([0, 1, 2], dtype=int64), array([0, 0, 1], dtype=int64))
If you want to use comparison against the minimum value, we need to use np.min and keep the dimensions with keepdims set as True to give us a boolean array/mask. To select the first occurance, we can use argmax along each row of the mask and thus have our desired output.
Thus, the implementation to get the corresponding column indices would be -
(C==C.min(1, keepdims=True)).argmax(1)
Sample step-by-step run -
In [114]: C # Input array
Out[114]:
array([[1, 2, 4],
[2, 2, 5],
[4, 3, 3]])
In [115]: C==C.min(1, keepdims=1) # boolean array of min values
Out[115]:
array([[ True, False, False],
[ True, True, False],
[False, True, True]], dtype=bool)
In [116]: (C==C.min(1, keepdims=True)).argmax(1) # argmax to get first occurances
Out[116]: array([0, 0, 1])
The first output of row indices would simply be a range array -
np.arange(C.shape[0])
To achieve the same column indices of first occurance of minimum values, a direct way would be to use np.argmin -
C.argmin(axis=1)

Categories

Resources