How to threshold based on the average value of a row? - python

I have a 2d array. I want to set all the values in each row that are greater than the mean value of that row to 0.
Some code that does this naively is:
new_arr = arr.copy()
for i, row in enumerate(arr):
avg = np.mean(row)
for j, pixel in enumerate(row):
if pixel > avg:
new_arr[i,j] = 0
else:
new_arr[i,j] = 1
This is pretty slow, and I want to know if there's some way to do this using Numpy indexing?
If it were the average value of the whole matrix, i could simply do:
mask = arr > np.mean(arr)
arr[mask] = 0
arr[np.logical_not(mask)] = 1
Is there some way to do this with the per-row average, using a one-dimensional array of averages or something similar?
EDIT:
The proposed solution:
avg = np.mean(arr, axis=0)
mask = arr > avg
new_arr = np.zeros(arr.shape)
arr[mask] = 1
was actually using the columnwise average, which might be useful to some people as well. It was equivalent to:
new_arr = arr.copy()
for i, row in enumerate(arr.T):
avg = np.mean(row)
for j, pixel in enumerate(row):
if pixel > avg:
new_arr[j,i] = 0
else:
new_arr[j,i] = 1

Setup
a = np.arange(25).reshape((5,5))
You can use keepdims with mean:
a[a > a.mean(1, keepdims=True)] = 0
array([[ 0, 1, 2, 0, 0],
[ 5, 6, 7, 0, 0],
[10, 11, 12, 0, 0],
[15, 16, 17, 0, 0],
[20, 21, 22, 0, 0]])
Using keepdims=True, gives the following result for mean:
array([[ 2.],
[ 7.],
[12.],
[17.],
[22.]])
The benefit to this is stated in the docs:
If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the input array.

You can use np.mean(a, axis=1) to get the mean of each row, broadcast that to the shape of a, and set all values where a > broadcasted_mean_array to 0:
Example:
a = np.arange(25).reshape((5,5))
>>> a
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24]])
a[a > np.broadcast_to(np.mean(a,axis=1),a.shape).T] = 0
>>> a
array([[ 0, 1, 2, 0, 0],
[ 5, 6, 7, 0, 0],
[10, 11, 12, 0, 0],
[15, 16, 17, 0, 0],
[20, 21, 22, 0, 0]])

Use the axis keyword for your mean:
avg = np.mean(arr, axis=0)
Then use this to create your mask and assign the values you want:
mask = avg>=arr
new_arr = np.zeros(arr.shape)
arr[mask] = 1
Of course, you can directly create a new array from the mask without the two steps approach.

Related

Index numpy 3d-array with 1d array of indices

I have a 3D numpy array of shape (i, j, k). I have an array of length i which contains indices in k. I would like to index the array to get a shape (i, j).
Here is an example of what I am trying to achieve:
import numpy as np
arr = np.arange(2 * 3 * 4).reshape(2, 3, 4)
# array([[[ 0, 1, 2, 3],
# [ 4, 5, 6, 7],
# [ 8, 9, 10, 11]],
#
# [[12, 13, 14, 15],
# [16, 17, 18, 19],
# [20, 21, 22, 23]]])
indices = np.array([1, 3])
# I want to mask `arr` using `indices`
# Desired output is equivalent to
# np.stack((arr[0, :, 1], arr[1, :, 3]))
# array([[ 1, 5, 9],
# [15, 19, 23]])
I tried reshaping the indices array to be able to broadcast with arr but this raises an IndexError.
arr[indices[np.newaxis, np.newaxis, :]]
# IndexError: index 3 is out of bounds for axis 0 with size 2
I also tried creating a 3D mask and applying it to arr. This seems closer to the correct answer to me but I still end up with an IndexError.
mask = np.stack((np.arange(arr.shape[0]), indices), axis=1)
arr[mask.reshape(2, 1, 2)]
# IndexError: index 3 is out of bounds for axis 0 with size 2
From what I understand in your example, you can simply pass indices as your second dimension slice, and a range of length corresponding to your indices for the zeroth dimension slice, like this:
import numpy as np
arr = np.arange(2 * 3 * 4).reshape(2, 3, 4)
indices = np.array([1, 3])
print(arr[range(len(indices)), :, indices])
# array([[ 1, 5, 9],
# [15, 19, 23]])
This works:
sub = arr[[0,1], :, [1,3]]
Output:
>>> sub
array([[ 1, 5, 9],
[15, 19, 23]])
A more dynamic version by #Psidom:
>>> sub = arr[np.arange(len(arr)), :, [1,3]]
array([[ 1, 5, 9],
[15, 19, 23]])

What's a good way of setting most elements of an ndarray to zero?

I've got an ndarray with, say, 10,000 rows and 75 columns, and another one with the same number of rows and, say, 3 columns. The second one has integer values.
I want to end up with an array of 10,000 rows and 75 columns with all the elements set to zero except the elements in each row indexed by the values in the corresponding row of the second array.
So starting with z_array and i_array, I want to end up with a_array
>>> z_array
array([[10, 11, 12, 13, 14, 15],
[10, 11, 12, 13, 14, 15],
[10, 11, 12, 13, 14, 15],
[10, 11, 12, 13, 14, 15]])
>>> i_array
array([[0, 2],
[3, 1],
[1, 4],
[2, 3]])
>>> a_array
array([[10, 0, 12, 0, 0, 0],
[ 0, 11, 0, 13, 0, 0],
[ 0, 11, 0, 0, 14, 0],
[ 0, 0, 12, 13, 0, 0]])
I can see two ways of approaching this: either start with an array full of zeros and copy across the relevant elements from z_array; or start with z_array and set all the irrelevant elements to zero. Note that the number of irrelevant elements is typically much, much larger than the number of relevant elements.
Either way, is there a good way of doing the multiple assignments, or do I simply have to loop through them? Or is there a third approach?
I'm wondering if I can use numpy.ufunc.at somehow?
I can see how to get a list of indexes for the relevant elements, for example
>>> index_list = [[i, val] for (i, x) in enumerate(i_array) for val in x ]
index_list
[[0, 0], [0, 2], [1, 3], [1, 1], [2, 1], [2, 4], [3, 2], [3, 3]]
And there's a slightly more complex way to get them for the irrelevant elements. But these lists would be big!!
It seems like you are looking for something similar to np.put_along_axis
Taking the example you have there if you run: np.put_along_axis(z_array, i_array, 0, axis=1)
z_array = [[ 0 11 0 13 14 15]
[10 0 12 0 14 15]
[10 0 12 13 0 15]
[10 11 0 0 14 15]]
The output is the opposite of what you want.
To get what you want, create a copy of z_array as a_array. Then, compare these matrices and keep the values where the elements of z_array are non-zero.
a_array = copy.copy(z_array)
np.put_along_axis(z_array, i_array, 0, axis=1)
a_array[(z_array != 0)] = 0
This gives the output you expected:
a_array = [[10 0 12 0 0 0]
[ 0 11 0 13 0 0]
[ 0 11 0 0 14 0]
[ 0 0 12 13 0 0]]
np.put_along_axis documentation
See this answer for more options for combining matrics (np.where)
You could use masked arrays
import numpy as np
def mask_array(z_array, i_array):
ROWS, COLS = np.shape(z_array)
# Fill in the mask
mask = np.zeros((ROWS, COLS))
for i in range(ROWS):
np.add.at(mask[i,:], i_array[i], 1)
mask = mask > 0
mask = ~mask
m_array = np.ma.array(z_array, mask=mask, fill_value = 0)
return np.ma.filled(m_array)
a_array = mask_array(z_array, i_array)
To tell apart elements of z_array, I defined it as:
array([[ 10, 11, 12, 13, 14, 15],
[110, 111, 112, 113, 114, 115],
[210, 211, 212, 213, 214, 215],
[310, 311, 312, 313, 314, 315]])
Then one of possible solutions is to create an array filled with zeroes, using
zeros_like and then run a loop based on ndenumerate method:
result = np.zeros_like(z_array)
for (r, c), x in np.ndenumerate(i_array):
result[r, x] = z_array[r, x]
For my (changed) source data, the result is:
array([[ 10, 0, 12, 0, 0, 0],
[ 0, 111, 0, 113, 0, 0],
[ 0, 211, 0, 0, 214, 0],
[ 0, 0, 312, 313, 0, 0]])

Numpy slicing function: Dynamically create slice indices np.r_[a:b, c:d, ...] from array shaped (X, 2) for selection in array

The situation
I have 2D array representing dual-channel audio. I want to create a function that returns slices of this array at arbitrary locations (e.g. speech only parts). I already know how to do it when I explicitly write the values into np.r_:
Sample data
arr = np.arange(0,24).reshape((2, -1))
# array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
# [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]])
Input
A x length array of width 2. E.g.
selector = np.array([[0, 2], [6, 9]])
# array([[0, 2],
# [6, 9]])
Desired output
# create an indexed arrays
selection_indices = np.r_[0:2, 6:9]
# array([0, 1, 6, 7, 8])
# use indices to select 2D
arr[:, selection_indices]
# array([[ 0, 1, 6, 7, 8],
# [12, 13, 18, 19, 20]])
Goal
A function that takes a X length array of width 2 (shape: X, 2), representing the start and end of a slice, and use that to return a selection of an array. Effectively the np.r_[0:2, 6:9], but then from an argument.
arr = np.arange(0,24).reshape((2, -1))
def slice_returner(arr, selector):
# something like this (broken); should be like: np.r_[0:2, 6:9]
selection_indices = np.r_[[row[0]:row[1]] for row in selector]
# return 2D slice
return arr[:, selection_indices]
selector = np.array([[0, 2], [6, 9]])
sliced_arr = slice_returner(arr, selector)
How do I turn the input into selection slices? Preferably with minimal array creation / copying.
Think boolean-indexing could be one efficient way. Hence, we can create a mask and then index cols and get our output -
# Generate mask for cols
mask = np.zeros(arr.shape[1],dtype=bool)
for (i,j) in selector:
mask[i:j] = True
# Boolean index into cols for final o/p
out = arr[:,mask]
The memory-overhead is just the mask, which being a boolean array should be minimal and the final output, which is required anyway.
Vectorized mask creation
If there are many entries in selector, there's a broadcasting-based vectorized way to create the mask for cols, like so -
r = np.arange(arr.shape[1])
mask = ((selector[:,0,None]<=r) & (selector[:,1,None]>r)).any(0)
You can just create an indexing array from individual aranges
slices = [[0, 2], [6, 9]]
np.concatenate([np.arange(*i) for i in slices])
# array([0, 1, 6, 7, 8])
and use it to extract the data
arr[:, np.concatenate([np.arange(*i) for i in slices])]
# array([[ 0, 1, 6, 7, 8],
# [12, 13, 18, 19, 20]])

Concatenate range arrays given start, stop numbers in a vectorized way - NumPy

I have two matrices of interest, the first is a "bag of words" matrix, with two columns: the document ID and the term ID. For example:
bow[0:10]
Out[1]:
array([[ 0, 10],
[ 0, 12],
[ 0, 19],
[ 0, 20],
[ 1, 9],
[ 1, 24],
[ 2, 33],
[ 2, 34],
[ 2, 35],
[ 3, 2]])
In addition, I have an "index" matrix, where every row in the matrix contains the index of the first and last row for a given document ID in the bag of words matrix. Ex: row 0 is the first and last index for doc id 0. For example:
index[0:4]
Out[2]:
array([[ 0, 4],
[ 4, 6],
[ 6, 9],
[ 9, 10]])
What I'd like to do is take a random sample of document ID's and get all of the bag of word rows for those document ID's. The bag of words matrix is roughly 150M rows (~1.5Gb), so using numpy.in1d() is too slow. We need to return these rapidly for feeding into a downstream task.
The naive solution I have come up with is as follows:
def get_rows(ids):
indices = np.concatenate([np.arange(x1, x2) for x1,x2 in index[ids]])
return bow[indices]
get_rows([4,10,3,5])
Generic sample
A generic sample to put forth the problem would be with something like this -
indices = np.array([[ 4, 7],
[10,16],
[11,18]]
The expected output would be -
array([ 4, 5, 6, 10, 11, 12, 13, 14, 15, 11, 12, 13, 14, 15, 16, 17])
Think I have cracked it finally with a cumsum trick for a vectorized solution -
def create_ranges(a):
l = a[:,1] - a[:,0]
clens = l.cumsum()
ids = np.ones(clens[-1],dtype=int)
ids[0] = a[0,0]
ids[clens[:-1]] = a[1:,0] - a[:-1,1]+1
out = ids.cumsum()
return out
Sample runs -
In [416]: a = np.array([[4,7],[10,16],[11,18]])
In [417]: create_ranges(a)
Out[417]: array([ 4, 5, 6, 10, 11, 12, 13, 14, 15, 11, 12, 13, 14, 15, 16, 17])
In [425]: a = np.array([[-2,4],[-5,2],[11,12]])
In [426]: create_ranges(a)
Out[426]: array([-2, -1, 0, 1, 2, 3, -5, -4, -3, -2, -1, 0, 1, 11])
If we are given starts and stops as two 1D arrays, we just need to use those in place of the first and second columns. For completeness, here's the complete code -
def create_ranges(starts, ends):
l = ends - starts
clens = l.cumsum()
ids = np.ones(clens[-1],dtype=int)
ids[0] = starts[0]
ids[clens[:-1]] = starts[1:] - ends[:-1]+1
out = ids.cumsum()
return out

Add always each value before each entry in object in numpy. insert

I have an 1D array a and everytime I have in this array a zero value I would like to copy a complete array before this entry, for example :
b=np.ones(30)
I got the location of the needed entries with :
c=np.nonzero(a==0)
In doing so :
len(c) > len(b)
If I then use :
np.insert(arr, obj, values, axis=None)
respectively
np.insert(a,c,b)
Np insert will always only copy one value of b before the position specified in c
Question
How do I have to modifiy the np. insert code that he will always copy all entries in values before ech entry in obj ?
A list approach:
In [197]: a=[0,0,1,0,1,2,0,1,2,3]
In [198]: for i in c[::-1]:
...: a[i:i+1]=b+[a[i]]
...:
In [199]: a
Out[199]: [10, 11, 12, 0, 10, 11, 12, 0, 1, 10, 11, 12, 0, 1, 2, 10, 11, 12, 0, 1, 2, 3]
insert can insert single values at each before:
In [200]: arr
Out[200]: array([0, 0, 1, 0, 1, 2, 0, 1, 2, 3])
In [201]: np.insert(arr, c, np.arange(10,15))
Out[201]: array([10, 0, 11, 0, 1, 12, 0, 1, 2, 13, 0, 1, 2, 3])
insert is written in Python, but is a bit complex, especially since it can handle multidimensional inserts (axis). But essentially it does:
res = np.empty(result_size)
idx1 = <where original a values go>
# e.g. [ 3, 7, 8, 12, 13, 14, 18, 19, 20, 21]
res[idx1] = a
idx2 = <where insert values go>
# e.g. [0,1,2, 4,5,6, 9,10,11, 15,16,17]
res[idx2] = values
idx1 can be derived from idx2 with
mask=np.ones(res.shape, bool)
mask[idx2]=False
idx1 = np.where(mask)
So you only have to construct one or the other, which ever is easier.
Expanding that logic to your case could be get messy. I suspect #Divakar can quickly come up with something using cumsum or np.max.accumulate.
A different iterative approach - add values of b one by one
In [234]: arr1 = arr.copy()
In [236]: for i in b:
...: idx = np.where(arr1==0)[0]
...: print(idx)
...: arr1 = np.insert(arr1, idx, i)
...:
[0 1 3 6]
[ 1 3 6 10]
[ 2 5 9 14]
In [237]: arr1
Out[237]:
array([10, 11, 12, 0, 10, 11, 12, 0, 1, 10, 11, 12, 0, 1, 2, 10, 11, 12, 0, 1, 2, 3])
Given the overhead of creating arrays (and I'm sure insert has array overhead), I suspect this will be slower than the pure list iteration - unless the length of c (the insert points) is much larger than the length of b. This also depends on being able to identify the original insert points at each iteration (here with arr1==0).
A possible solution would be to duplicate the object array as often as len(value) using np.tile(object, len(value)) and resort the values afterwards using np.sort(). The new array should then be used instead of the inial object array in the np.insert function

Categories

Resources