Efficient 2D array processing

Efficient 2D array processing - python

Let A be a 2D matrix of size NxL. Every row A[i] should be processed independently such that all entries in consecutive chunks of length C in every row are replaced by the average value of the entries in the chunk. Specifically, I am looking for an efficient way to replace every k-th chunk in every i-th row A[i][kC:(k+1)C] by mean(A[i][kC:(k+1)C]) * ones(length=C).
Example
A=[[1,3,5,7], [7,5,3,1]] should be transformed to A=[[2,2,6,6],[6,6,2,2]] if C=2.

You can reshape the data into chunks, take the mean and use broadcasting to assign the data back into the array
B = A.reshape(-1, C)
B[...] = B.mean(-1)[:, None]
Afterwards A contains the desired result as B is not a copy but a view.

Simply do
If a.shape[1]%c==0
res = np.concatenate([np.repeat(a.mean(axis=1,keepdims=True),c, axis=1) for a in np.split(A, c, axis=1)], axis=1)
Else
res = np.concatenate(
[np.repeat(a.mean(axis=1,keepdims=True),a.shape[1], axis=1) for a in np.split(A, list(range(0,A.shape[1],c)), axis=1)], axis=1)

Related

numpy write with two masks

If I have an array a with 100 elements that I want to conditionally update. I have the first mask m which selects elements of a that will be tried to update. Out of a[m] (say, 50 elements), I want to update a subset some elements, but leaves others. So the second mask m2 has 50=m.sum() elements, only some of which are True.
For completeness, a minimal example:
a = np.random.random(size=100)
m = a > 0
m2 = np.random.random(size=m.sum()) < 0
newvalues = -np.random.randint(size=m2.sum())
Then if I were to do
a[m][m2] = newvalues
This does not change the values of a, because fancy indexing a[m] makes a copy here. using indices (with where) has the same behaviour.
Instead, this works:
m12 = m.copy()
m12[m] = m2
a[m12] = newvalues
However, this is verbose and difficult to read.
Is there a more elegant way to update a subset of a subset of an array?

You can potentially first compute the "final index" of interest and then use those indexes to update. One way to achieve this in a more "numpy" way is to mask the first index array, which is computed based on the first mask array.
final_mask = np.where(m)[0][m2]
a[final_mask] = newvalues

First compute the indices of elements to update:
indices = np.array(range(100))
indices = indices[m1][m2]
then use indices to update array a:
a[indices] = newvalue

Selecting numpy columns based on values in a row

Suppose I have a numpy array with 2 rows and 10 columns. I want to select columns with even values in the first row. The outcome I want can be obtained is as follows:
a = list(range(10))
b = list(reversed(range(10)))
c = np.concatenate([a, b]).reshape(2, 10).T
c[c[:, 0] % 2 == 0].T
However, this method transposes twice and I don't suppose it's very pythonic. Is there a way to do the same job cleaner?

Numpy allows you to select along each dimension separately. You pass in a tuple of indices whose length is the number of dimensions.
Say your array is
a = np.random.randint(10, size=(2, 10))
The even elements in the first row are given by the mask
m = (a[0, :] % 2 == 0)
You can use a[0] to get the first row instead of a[0, :] because missing indices are synonymous with the slice : (take everything).
Now you can apply the mask to just the second dimension:
result = a[:, m]
You can also convert the mask to indices first. There are subtle differences between the two approaches, which you won't see in this simple case. The biggest difference is usually that linear indices are a little faster, especially if applied more than once:
i = np.flatnonzero(m)
result = a[:, i]

Is there an efficient way to get the position of the max element except for a specific column in a NumPy matrix?

For example, There is a 2d Numpy matrix M:
[[1,10,3],
[4,15,6]]
The max element except for those in M[:][1] is 6, and its position is (1,2). So the answer is (1,2).
Thank you very much for any help!

One way:
col = 1
skip_col = np.delete(x, col, axis=1)
row, column = np.unravel_index(skip_col.argmax(), skip_col.shape)
if column >= col:
column += 1
Translated:
Remove the column
find the maximum argument (argmax gives a flattened result, unravel_index gives back the placement in the 2d array)
If the column is greater or equal to the skipped one, add one
Following Dunes comment, I like the suggestion. It's nearly identical in amount of lines, but does not require a copy (as in np.delete). So if you are memory bound (as in really big data):
col = 1
row, column = np.unravel_index(x[:, :col].argmax(), x[:, :col].shape) # left max, saving a line assuming it's the global max, but less readable
right_max = np.unravel_index(x[:, col+1:].argmax(), x[:, col+1:].shape)
if x[right_max] > x[row, column]:
row, column = right_max
column += col

Here's a solution taking advantage of the set of nan functions:
In [180]: arr = np.array([[1,10,3],[4,15,6]])
In [181]: arr1 = arr.astype(float)
In [182]: arr1[:,1]=np.nan
In [183]: arr1
Out[183]:
array([[ 1., nan, 3.],
[ 4., nan, 6.]])
In [184]: np.nanargmax(arr1)
Out[184]: 5
In [185]: np.unravel_index(np.nanargmax(arr1),arr.shape)
Out[185]: (1, 2)
It might not be optimal timewise, but is probably easier to debug that alternatives.
Looking at the np.nanargmax I see that it just replaces the np.nan with -np.inf. So we do something similar by just replacing the exclude column values with a small enough integer so they won't be the max.
In [188]: arr1=arr.copy()
In [189]: arr1[:,1] = np.min(arr1)-1
In [190]: arr1
Out[190]:
array([[1, 0, 3],
[4, 0, 6]])
In [191]: np.argmax(arr1)
Out[191]: 5
In [192]: np.unravel_index(np.argmax(arr1),arr.shape)
Out[192]: (1, 2)
I can also imagine a solution using np.ma.masked_array, but that tends to be more of a convenience than speed tool.

Agreeing with the comment by Dunes:
With small arrays, like your example, it's probably just quicker to
make a copy of the matrix, without the given column, and then take the
max. With a larger array it may be quicker to take the max either side
of the column and take the max of the left and the right sides of the
column.
Here is an implementation of each of these cases, and a dispatcher function. (A value for THRESHOLD_SIZE needs to be added based on experimentation.)
Small array case
Creates array with the specified column removed. Calculates the overall maximum and then the location where it occurs. Adds one to the column if it is on the right side.
Large array case
It creates temporary 1d arrays containing the column maxima. These will typically (although not in every case) be significantly smaller than the 2-dimensional array. First, it is identified which side of the excluded column contains the maximum, then it is identified which column it is, and finally which row it is. This avoids the need to examine every element twice. The code also avoids creating any 2-dimensional slice of the array at any point.
THRESHOLD_SIZE = .....
def get_max_position(m, exclude_column):
return (get_max_position_largearray if m.size > THRESHOLD_SIZE
else get_max_position_smallarray)(m, exclude_column)
def get_max_position_smallarray(m, exclude_column):
mnew = np.delete(m, exclude_column, axis=1)
row, col = np.argwhere(mnew == np.max(mnew))[0]
# uses: int(True)=1 and int(False)=0
return (row, col + (col >= exclude_column))
def get_max_position_largearray(m, exclude_column):
column_maxima = np.max(m, axis=0)
l_col_maxima = column_maxima[:exclude_column]
r_col_maxima = column_maxima[exclude_column + 1:]
l_max = np.max(l_col_maxima) if l_col_maxima.size else None
r_max = np.max(r_col_maxima) if r_col_maxima.size else None
use_left = (True if r_max == None else
False if l_max == None else
(l_max > r_max))
if use_left:
themax = l_max
col = np.argwhere(l_col_maxima == themax)[0][0]
else:
themax = r_max
col = exclude_column + 1 + np.argwhere(r_col_maxima == themax)[0][0]
row = np.argwhere(m[:,col] == themax)[0][0]
return (row, col)
Here is the example in the question, by both methods:
m = np.array([[1,10,3],
[4,15,6]])
exclude_column = 1
print(get_max_position_largearray(m, exclude_column))
print(get_max_position_smallarray(m, exclude_column))
Output:
(1, 2)
(1, 2)

Here is what you can do:
m = [[1,10,3],
[4,15,6]]
c = 1 # Choose the column to exclude
a = max([[n,(k,b)] for k,i in enumerate(m) for b,n in enumerate(i) if b!=c])[1]
print(a)
Output:
(1, 2)

Another way without a copy, indexing the columns with a list:
import numpy as np
m = np.array([[1, 10, 3], [4, 15, 6]])
exclude_col = 1
# assign nicer names to the shape
rows, cols = m.shape
# generate indices for slicing
inds = list(range(cols))
inds.remove(exclude_col)
# find the maximum in the sliced array
max_ind = np.unravel_index(np.argmax(m[:, inds]), (rows, cols - 1))
# fix the found column index if we exceeded exclude_col
max_ind = (max_ind[0], max_ind[1] if max_ind[1] < exclude_col else max_ind[1] + 1)
The last line is a good candidate for a Python3.8 assignment expression, so in Python3.8+ you could write:
max_ind = (max_ind[0], v if (v := max_ind[1]) < exclude_col else v + 1)
EDIT: Indexing like that probably also creates a copy, I have not tested it, but the elements are not contiguous in memory.

How can I check if each row in a matrix is equal to an array and return a Boolean array containing the result?

How can I check if each row in a matrix is equal to an array and return a Boolean array containing the result using NumPy? e.g.
a = np.array([[1,2,3],[4,5,6],[7,8,9]])
b = np.array([4,5,6])
# Expected Result: [False,True,False]

The neatest way I've found of doing this, is:
result = np.all(a==b, axis=1)

Return elements in a location corresponding to the minimum values of another array

I have two arrays with the same shape in the first two dimensions and I'm looking to record the minimum value in each row of the first array. However I would also like to record the elements in the corresponding position in the third dimension of the second array. I can do it like this:
A = np.random.random((5000, 100))
B = np.random.random((5000, 100, 3))
A_mins = np.ndarray((5000, 4))
for i, row in enumerate(A):
current_min = min(row)
A_mins[i, 0] = current_min
A_mins[i, 1:] = B[i, row == current_min]
I'm new to programming (so correct me if I'm wrong) but I understand that with Numpy doing calculations on whole arrays is faster than iterating over them. With this in mind is there a faster way of doing this? I can't see a way to get rid of the row == current_min bit even though the location of the minimum point must have been 'known' to the computer when it was calculating the min().
Any tips/suggestions appreciated! Thanks.

Something along what #lib talked about:
index = np.argmin(A, axis=1)
A_mins[:,0] = A[np.arange(len(A)), index]
A_mins[:,1:] = B[np.arange(len(A)), index]
It is much faster than using a for loop.

For getting the index of the minimum value, use amin instead of min + comparison
The amin function (and many other functions in numpy) also takes the argument axis, that you can use to get the minimum of each row or each column.
See http://docs.scipy.org/doc/numpy/reference/generated/numpy.amin.html

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficient 2D array processing - python

You can reshape the data into chunks, take the mean and use broadcasting to assign the data back into the array B = A.reshape(-1, C) B[...] = B.mean(-1)[:, None] Afterwards A contains the desired result as B is not a copy but a view.

Simply do If a.shape[1]%c==0 res = np.concatenate([np.repeat(a.mean(axis=1,keepdims=True),c, axis=1) for a in np.split(A, c, axis=1)], axis=1) Else res = np.concatenate( [np.repeat(a.mean(axis=1,keepdims=True),a.shape[1], axis=1) for a in np.split(A, list(range(0,A.shape[1],c)), axis=1)], axis=1)

Related

numpy write with two masks

Selecting numpy columns based on values in a row

Is there an efficient way to get the position of the max element except for a specific column in a NumPy matrix?

How can I check if each row in a matrix is equal to an array and return a Boolean array containing the result?

Return elements in a location corresponding to the minimum values of another array

Categories

Resources