Selecting numpy columns based on values in a row - python

Suppose I have a numpy array with 2 rows and 10 columns. I want to select columns with even values in the first row. The outcome I want can be obtained is as follows:
a = list(range(10))
b = list(reversed(range(10)))
c = np.concatenate([a, b]).reshape(2, 10).T
c[c[:, 0] % 2 == 0].T
However, this method transposes twice and I don't suppose it's very pythonic. Is there a way to do the same job cleaner?

Numpy allows you to select along each dimension separately. You pass in a tuple of indices whose length is the number of dimensions.
Say your array is
a = np.random.randint(10, size=(2, 10))
The even elements in the first row are given by the mask
m = (a[0, :] % 2 == 0)
You can use a[0] to get the first row instead of a[0, :] because missing indices are synonymous with the slice : (take everything).
Now you can apply the mask to just the second dimension:
result = a[:, m]
You can also convert the mask to indices first. There are subtle differences between the two approaches, which you won't see in this simple case. The biggest difference is usually that linear indices are a little faster, especially if applied more than once:
i = np.flatnonzero(m)
result = a[:, i]

Related

numpy write with two masks

If I have an array a with 100 elements that I want to conditionally update. I have the first mask m which selects elements of a that will be tried to update. Out of a[m] (say, 50 elements), I want to update a subset some elements, but leaves others. So the second mask m2 has 50=m.sum() elements, only some of which are True.
For completeness, a minimal example:
a = np.random.random(size=100)
m = a > 0
m2 = np.random.random(size=m.sum()) < 0
newvalues = -np.random.randint(size=m2.sum())
Then if I were to do
a[m][m2] = newvalues
This does not change the values of a, because fancy indexing a[m] makes a copy here. using indices (with where) has the same behaviour.
Instead, this works:
m12 = m.copy()
m12[m] = m2
a[m12] = newvalues
However, this is verbose and difficult to read.
Is there a more elegant way to update a subset of a subset of an array?
You can potentially first compute the "final index" of interest and then use those indexes to update. One way to achieve this in a more "numpy" way is to mask the first index array, which is computed based on the first mask array.
final_mask = np.where(m)[0][m2]
a[final_mask] = newvalues
First compute the indices of elements to update:
indices = np.array(range(100))
indices = indices[m1][m2]
then use indices to update array a:
a[indices] = newvalue

Is there an efficient way to get the position of the max element except for a specific column in a NumPy matrix?

For example, There is a 2d Numpy matrix M:
[[1,10,3],
[4,15,6]]
The max element except for those in M[:][1] is 6, and its position is (1,2). So the answer is (1,2).
Thank you very much for any help!
One way:
col = 1
skip_col = np.delete(x, col, axis=1)
row, column = np.unravel_index(skip_col.argmax(), skip_col.shape)
if column >= col:
column += 1
Translated:
Remove the column
find the maximum argument (argmax gives a flattened result, unravel_index gives back the placement in the 2d array)
If the column is greater or equal to the skipped one, add one
Following Dunes comment, I like the suggestion. It's nearly identical in amount of lines, but does not require a copy (as in np.delete). So if you are memory bound (as in really big data):
col = 1
row, column = np.unravel_index(x[:, :col].argmax(), x[:, :col].shape) # left max, saving a line assuming it's the global max, but less readable
right_max = np.unravel_index(x[:, col+1:].argmax(), x[:, col+1:].shape)
if x[right_max] > x[row, column]:
row, column = right_max
column += col
Here's a solution taking advantage of the set of nan functions:
In [180]: arr = np.array([[1,10,3],[4,15,6]])
In [181]: arr1 = arr.astype(float)
In [182]: arr1[:,1]=np.nan
In [183]: arr1
Out[183]:
array([[ 1., nan, 3.],
[ 4., nan, 6.]])
In [184]: np.nanargmax(arr1)
Out[184]: 5
In [185]: np.unravel_index(np.nanargmax(arr1),arr.shape)
Out[185]: (1, 2)
It might not be optimal timewise, but is probably easier to debug that alternatives.
Looking at the np.nanargmax I see that it just replaces the np.nan with -np.inf. So we do something similar by just replacing the exclude column values with a small enough integer so they won't be the max.
In [188]: arr1=arr.copy()
In [189]: arr1[:,1] = np.min(arr1)-1
In [190]: arr1
Out[190]:
array([[1, 0, 3],
[4, 0, 6]])
In [191]: np.argmax(arr1)
Out[191]: 5
In [192]: np.unravel_index(np.argmax(arr1),arr.shape)
Out[192]: (1, 2)
I can also imagine a solution using np.ma.masked_array, but that tends to be more of a convenience than speed tool.
Agreeing with the comment by Dunes:
With small arrays, like your example, it's probably just quicker to
make a copy of the matrix, without the given column, and then take the
max. With a larger array it may be quicker to take the max either side
of the column and take the max of the left and the right sides of the
column.
Here is an implementation of each of these cases, and a dispatcher function. (A value for THRESHOLD_SIZE needs to be added based on experimentation.)
Small array case
Creates array with the specified column removed. Calculates the overall maximum and then the location where it occurs. Adds one to the column if it is on the right side.
Large array case
It creates temporary 1d arrays containing the column maxima. These will typically (although not in every case) be significantly smaller than the 2-dimensional array. First, it is identified which side of the excluded column contains the maximum, then it is identified which column it is, and finally which row it is. This avoids the need to examine every element twice. The code also avoids creating any 2-dimensional slice of the array at any point.
THRESHOLD_SIZE = .....
def get_max_position(m, exclude_column):
return (get_max_position_largearray if m.size > THRESHOLD_SIZE
else get_max_position_smallarray)(m, exclude_column)
def get_max_position_smallarray(m, exclude_column):
mnew = np.delete(m, exclude_column, axis=1)
row, col = np.argwhere(mnew == np.max(mnew))[0]
# uses: int(True)=1 and int(False)=0
return (row, col + (col >= exclude_column))
def get_max_position_largearray(m, exclude_column):
column_maxima = np.max(m, axis=0)
l_col_maxima = column_maxima[:exclude_column]
r_col_maxima = column_maxima[exclude_column + 1:]
l_max = np.max(l_col_maxima) if l_col_maxima.size else None
r_max = np.max(r_col_maxima) if r_col_maxima.size else None
use_left = (True if r_max == None else
False if l_max == None else
(l_max > r_max))
if use_left:
themax = l_max
col = np.argwhere(l_col_maxima == themax)[0][0]
else:
themax = r_max
col = exclude_column + 1 + np.argwhere(r_col_maxima == themax)[0][0]
row = np.argwhere(m[:,col] == themax)[0][0]
return (row, col)
Here is the example in the question, by both methods:
m = np.array([[1,10,3],
[4,15,6]])
exclude_column = 1
print(get_max_position_largearray(m, exclude_column))
print(get_max_position_smallarray(m, exclude_column))
Output:
(1, 2)
(1, 2)
Here is what you can do:
m = [[1,10,3],
[4,15,6]]
c = 1 # Choose the column to exclude
a = max([[n,(k,b)] for k,i in enumerate(m) for b,n in enumerate(i) if b!=c])[1]
print(a)
Output:
(1, 2)
Another way without a copy, indexing the columns with a list:
import numpy as np
m = np.array([[1, 10, 3], [4, 15, 6]])
exclude_col = 1
# assign nicer names to the shape
rows, cols = m.shape
# generate indices for slicing
inds = list(range(cols))
inds.remove(exclude_col)
# find the maximum in the sliced array
max_ind = np.unravel_index(np.argmax(m[:, inds]), (rows, cols - 1))
# fix the found column index if we exceeded exclude_col
max_ind = (max_ind[0], max_ind[1] if max_ind[1] < exclude_col else max_ind[1] + 1)
The last line is a good candidate for a Python3.8 assignment expression, so in Python3.8+ you could write:
max_ind = (max_ind[0], v if (v := max_ind[1]) < exclude_col else v + 1)
EDIT: Indexing like that probably also creates a copy, I have not tested it, but the elements are not contiguous in memory.

Efficient 2D array processing

Let A be a 2D matrix of size NxL. Every row A[i] should be processed independently such that all entries in consecutive chunks of length C in every row are replaced by the average value of the entries in the chunk. Specifically, I am looking for an efficient way to replace every k-th chunk in every i-th row A[i][kC:(k+1)C] by mean(A[i][kC:(k+1)C]) * ones(length=C).
Example
A=[[1,3,5,7], [7,5,3,1]] should be transformed to A=[[2,2,6,6],[6,6,2,2]] if C=2.
You can reshape the data into chunks, take the mean and use broadcasting to assign the data back into the array
B = A.reshape(-1, C)
B[...] = B.mean(-1)[:, None]
Afterwards A contains the desired result as B is not a copy but a view.
Simply do
If a.shape[1]%c==0
res = np.concatenate([np.repeat(a.mean(axis=1,keepdims=True),c, axis=1) for a in np.split(A, c, axis=1)], axis=1)
Else
res = np.concatenate(
[np.repeat(a.mean(axis=1,keepdims=True),a.shape[1], axis=1) for a in np.split(A, list(range(0,A.shape[1],c)), axis=1)], axis=1)

iterate through number of columns, with variable columns

For example, let's consider this toy code
import numpy as np
import numpy.random as rnd
a = rnd.randint(0,10,(10,10))
k = (1,2)
b = a[:,k]
for col in np.arange(np.size(b,1)):
b[:,col] = b[:,col]+col*100
This code will work when the size of k is bigger than 1. However, with the size equal to 1, the extracted sub-matrix from a is transformed into a row vector, and applying the function in the for loop throws an error.
Of course, I could fix this by checking the dimension of b and reshaping:
if np.dim(b) == 1:
b = np.reshape(b, (np.size(b), 1))
in order to obtain a column vector, but this is expensive.
So, the question is: what is the best way to handle this situation?
This seems like something that would arise quite often and I wonder what is the best strategy to deal with it.
If you index with a list or tuple, the 2d shape is preserved:
In [638]: a=np.random.randint(0,10,(10,10))
In [639]: a[:,(1,2)].shape
Out[639]: (10, 2)
In [640]: a[:,(1,)].shape
Out[640]: (10, 1)
And I think b iteration can be simplified to:
a[:,k] += np.arange(len(k))*100
This sort of calculation will also be easier is k is always a list or tuple, and never a scalar (a scalar does not have a len).
np.column_stack ensures its inputs are 2d (and expands at the end if not) with:
if arr.ndim < 2:
arr = array(arr, copy=False, subok=True, ndmin=2).T
np.atleast_2d does
elif len(ary.shape) == 1:
result = ary[newaxis,:]
which of course could changed in this case to
if b.ndim==1:
b = b[:,None]
Any ways, I think it is better to ensure the k is a tuple rather than adjust b shape after. But keep both options in your toolbox.

Delete a column in a multi-dimensional array if all elements in that column satisfy a condition

I have a multi-dimensional array such as;
a = [[1,1,5,12,0,4,0],
[0,1,2,11,0,4,2],
[0,4,3,17,0,4,9],
[1,3,5,74,0,8,16]]
How can I delete the column if all entries within that column are equal to zero? In the array a that would mean deleting the 4th column resulting in:
a = [[1,1,5,12,4,0],
[0,1,2,11,4,2],
[0,4,3,17,4,9],
[1,3,5,74,8,16]]
N.b I've written a as a nested list but only to make it clear. I also don't know a priori where the zero column will be in the array.
My attempt so far only finds the index of the column in which all elements are equal to zero:
a = np.array([[1,1,5,12,0,4,0],[0,1,2,11,0,4,2],[0,4,3,17,0,4,9],[1,3,5,74,0,8,16]])
b = np.vstack(a)
ind = []
for n,m in zip(b.T,range(len(b.T))):
if sum(n) == 0:
ind.append(m)
Is there any way to achieve this?
With the code you already have, you can just do:
for place in ind:
for sublist in a:
del sublist[place]
Which gets the job done but is not very satisfactory...
Edit: numpy is strong
import numpy as np
a = np.array(a)
a = a[:, np.sum(a, axis=0)!=0]

Categories

Resources