Selective deletion by value in numpy array - python

EDITED: Refined problem statement
I am still figuring out the fancy options which are offered by the numpy library. Following topic came on my desk:
Purpose:
In a multi-dimensional array I select one column. This slicing works fine. But after that, values stored in another list need to be filtered out of the column values.
Current status:
array1 = np.asarray([[0,1,2],[1,0,3],[2,3,0]])
print(array1)
array1woZero = np.nonzero(array1)
print(array1woZero)
toBeRemoved = []
toBeRemoved.append(1)
print(toBeRemoved)
column = array1[:,1]
result = np.delete(column,toBeRemoved)
The above mentioned code does not bring the expected result. In fact, the np.delete() command just removes the value at index 1 - but I would need the value of 1 to be filtered out instead. What I also do not understand is the shape change when applying the nonzero to array1: While array1 is (3,3), the array1woZero turns out into a tuple of 2 dims with 6 values each.
0
Array of int64
(6,)
0
0
1
1
2
2
1
Array of int64
(6,)
1
2
0
2
0
1
My feeling is that I would require something like slicing with an exclusion operator. Do you have any hints for me to solve that? Is it necessary to use different data structures?

In [18]: arr = np.asarray([[0,1,2],[1,0,3],[2,3,0]])
In [19]: arr
Out[19]:
array([[0, 1, 2],
[1, 0, 3],
[2, 3, 0]])
nonzero gives the indices of all non-zero elements of its argument (arr):
In [20]: idx = np.nonzero(arr)
In [21]: idx
Out[21]: (array([0, 0, 1, 1, 2, 2]), array([1, 2, 0, 2, 0, 1]))
This is a tuple of arrays, one per dimension. That output can be confusing, but it is easily used to return all of those non-zero elements:
In [22]: arr[idx]
Out[22]: array([1, 2, 1, 3, 2, 3])
Indexing like this, with a pair of arrays, produces a 1d array. In your example there is just one 0 per row, but in general that's not the case.
This is the same indexing - with 2 lists of the same length:
In [24]: arr[[0,0,1,1,2,2], [1,2,0,2,0,1]]
Out[24]: array([1, 2, 1, 3, 2, 3])
idx[0] just selects on array of that tuple, the row indices. That probably isn't what you want. And I doubt if you want to apply np.delete to that tuple.
It's hard to tell from the description, and code, what you want. Maybe that's because you don't understand what nonzero is producing.
We can also select the nonzero elements with boolean masking:
In [25]: arr>0
Out[25]:
array([[False, True, True],
[ True, False, True],
[ True, True, False]])
In [26]: arr[ arr>0 ]
Out[26]: array([1, 2, 1, 3, 2, 3])

the hint with the boolean masking very good and helped me to develop my own solution. The symbolic names in the following code snippets are different, but the idea should become clear anyway.
At the beginning, I have my overall searchSpace.
searchSpace = relativeDistances[currentNode,:]
Assume that its shape is (5,). My filter is defined on the indexes, i.e. range 0..4. Then I define another numpy array "filter" of same shape with all 1, and the values to be filtered out I set to 0.
filter = np.full(shape=nodeCount,fill_value=1,dtype=np.int32())
filter[0] = 0
filter[3] = 0
searchSpace = searchSpace * filter
minValue = searchSpace[searchSpace > 0].min()
neighborNode = np.where(searchSpace==minValue)
The filter array provides me the flexibility to adjust the filter later on as part of a loop. Using the element-wise multiplication with 0 and subsequent boolean masking, I can create my reduced searchSpace for minimum search. Compared to a separate array or list, I still have the original shape, which is required to get the correct index in the where-statement.

Related

How to filter with numpy on 2D array using np.where

I read the numpy doc and np.where takes 1 argument to return row indices when the condition is matching..
numpy.where(condition, [x, y, ]/)
In the context of multi dimensional array I want to find and replace when the condition is matching
this is doable with some other params from the doc [x, y, ] are replacement values
Here is my data structure :
my_2d_array = np.array([[1,2],[3,4]])
Here is how I filter a column with python my_2d_array[:,1]
Here is how I filter find/replace with numpy :
indices = np.where( my_2d_array[:,1] == 4, my_2d_array[:,1] , my_2d_array[:,1] )
(when the second column value match 4 invert the value in column two with column one)
So its hard for me to understand why the same syntax my_2d_array[:,1] is used to filter a whole column in python and to designate a single row of my 2D array for numpy where the condition is matched
Your array:
In [9]: arr = np.array([[1,2],[3,4]])
In [10]: arr
Out[10]:
array([[1, 2],
[3, 4]])
Testing for some value:
In [11]: arr==4
Out[11]:
array([[False, False],
[False, True]])
testing one column:
In [12]: arr[:,1]
Out[12]: array([2, 4])
In [13]: arr[:,1]==4
Out[13]: array([False, True])
As documented, np.where with just one argument is just a call to nonzero, which finds the index(s) for the True values:
So for the 2d array in [11] we get two arrays:
In [15]: np.nonzero(arr==4)
Out[15]: (array([1], dtype=int64), array([1], dtype=int64))
and for the 1d boolean in [13], one array:
In [16]: np.nonzero(arr[:,1]==4)
Out[16]: (array([1], dtype=int64),)
That array can be used to select a row from arr:
In [17]: arr[_,1]
Out[17]: array([[4]])
If used in the three argument where, it selects elements between the 2nd and 3rd arguments. For example, using arguments that have nothing to do with arr:
In [18]: np.where(arr[:,1]==4, ['a','b'],['c','d'])
Out[18]: array(['c', 'b'], dtype='<U1')
The selection gets more complicated if the arguments differ in shape; then the rules of broadcasting apply.
So the basic point with np.where is that all 3 arguments are first evaluated, and passed (in true python function fashion) to the where function. It then selects elements based on the cond, returning a new array.
That where is functionally the same as this list comprehension (or an equivalent for loop):
In [19]: [i if cond else j for cond,i,j in zip(arr[:,1]==4, ['a','b'],['c','d'])]
Out[19]: ['c', 'b']

Python Fancy Indexing Assignments: cannot assign 3 input values to the 6 output values where the mask is true

I am trying to make a zeroed array with the same shape of a source array. Then modify every value in the second array that corresponds to a specific value in the first array.
This would be simple enough if I was just replacing one value. Here is a toy example:
import numpy as np
arr1 = np.array([[1,2,3],[3,4,5],[1,2,3]])
arr2 = np.array([[0,0,0],[0,0,0],[0,0,0]])
arr2[arr1==1] = -1
This will work as expected and arr2 would be:
[[-1,0,0],
[ 0,0,0],
[-1,0,0]]
But I would like to replace an entire row. Something like this replacing the last line of the sample code above:
arr2[arr1==[3,4,5]] = [-1,-1,-1]
When I do this, it also works as expected and arr2 would be:
[[ 0, 0, 0],
[-1,-1,-1],
[ 0, 0, 0]]
But when I tried to replace the last line of sample code with something like:
arr2[arr1==[1,2,3]] = [-1,-1,-1]
I expected to get something like the last output, but with the 0th and 2nd rows being changed. But instead I got the following error.
ValueError: NumPy boolean array indexing assignment cannot assign 3 input values to the 6
output values where the mask is true
I assume this is because, unlike the other example, it was going to have to replace more than one row. Though this seems odd to me, since it worked fine replacing more than one value in the simple single value example.
I'm just wondering if anyone can explain this behavior to me, because it is a little confusing. I am not that experienced with the inner workings of numpy operations. Also, if anyone has an any recommendations to do what I am trying to accomplish in an efficient manner.
In my real world implementation, I am working with a very large three dimensional array (an image with 3 color channels) and I want to make an new array that stores a specific value into these three color channels if the source image has a specific three color values in that corresponding pixel (and remain [0,0,0] if it doesn't match our pixel_rgb_of_interest). I could go through in linear time and just check every single pixel, but this could get kind of slow if there are a lot of images, and was just wondering if there was a better way.
Thank you!
This would be a good application for numpy.where
>>> import numpy as np
>>> arr1 = np.array([[1,2,3],[3,4,5],[1,2,3]])
>>> arr2 = np.array([[0,0,0],[0,0,0],[0,0,0]])
>>> np.where(arr1 == [1,2,3], [-1,-1,-1], arr1)
array([[-1, -1, -1],
[ 3, 4, 5],
[-1, -1, -1]])
This basically works as "wherever the condition is true, use the x argument, then use the y argument the rest of the time"
Lets add an "index" array:
In [56]: arr1 = np.array([[1,2,3],[3,4,5],[1,2,3]])
...: arr2 = np.array([[0,0,0],[0,0,0],[0,0,0]])
...: arr3 = np.arange(9).reshape(3,3)
The test against 1 value:
In [57]: arr1==1
Out[57]:
array([[ True, False, False],
[False, False, False],
[ True, False, False]])
that has 2 true values:
In [58]: arr3[arr1==1]
Out[58]: array([0, 6])
We could assign one value as you do, or 2.
Test with a list, which is converted to array first:
In [59]: arr1==[3,4,5]
Out[59]:
array([[False, False, False],
[ True, True, True],
[False, False, False]])
That has 3 True:
In [60]: arr3[arr1==[3,4,5]]
Out[60]: array([3, 4, 5])
so it works to assign a list of 3 values as you do. Or a scalar.
In [61]: arr1==[1,2,3]
Out[61]:
array([[ True, True, True],
[False, False, False],
[ True, True, True]])
Here the test has 6 True.
In [62]: arr3[arr1==[1,2,3]]
Out[62]: array([0, 1, 2, 6, 7, 8])
So we can assign 6 values or a scalar. But you tried to assign 3 values.
Or we could apply all to find the rows that match [1,2,3]:
In [63]: np.all(arr1==[1,2,3], axis=1)
Out[63]: array([ True, False, True])
In [64]: arr3[np.all(arr1==[1,2,3], axis=1)]
Out[64]:
array([[0, 1, 2],
[6, 7, 8]])
To this we could assign a (2,3) array, a scalar, a (3,) array, or a (2,1) (as per broadcasting rules):
In [65]: arr2[np.all(arr1==[1,2,3], axis=1)]=np.array([100,200])[:,None]
In [66]: arr2
Out[66]:
array([[100, 100, 100],
[ 0, 0, 0],
[200, 200, 200]])

iterating a filtered Numpy array whilst maintaining index information

I am attempting to pass filtered values from a Numpy array into a function.
I need to pass values only above a certain value, and their index position with the Numpy array.
I am attempting to avoid iterating over the entire array within python by using Numpys own filtering systems, the arrays i am dealing with have 20k of values in them with potentially only very few being relevant.
import numpy as np
somearray = np.array([1,2,3,4,5,6])
arrayindex = np.nonzero(somearray > 4)
for i in arrayindex:
somefunction(arrayindex[0], somearray[arrayindex[0]])
This threw up errors of logic not being able to handle multiple values,
this led me to testing it through print statement to see what was going on.
for cell in arrayindex:
print(f"index {cell}")
print(f"data {somearray[cell]}")
I expected an output of
index 4
data 5
index 5
data 6
But instead i get
index [4 5]
data [5 6]
I have looked through different methods to iterate through numpy arrays such and neditor, but none seem to still allow me to do the filtering of values outside of the for loop.
Is there a solution to my quandary?
Oh, i am aware that is is generally frowned upon to loop through a numpy array, however the function that i am passing these values to are complex, triggering certain events and involving data to be uploaded to a data base dependent on the data location within the array.
Thanks.
import numpy as np
somearray = np.array([1,2,3,4,5,6])
arrayindex = [idx for idx, val in enumerate(somearray) if val > 4]
for i in range(0, len(arrayindex)):
somefunction(arrayindex[i], somearray[arrayindex[i]])
for i in range(0, len(arrayindex)):
print("index", arrayindex[i])
print("data", somearray[arrayindex[i]])
You need to have a clear idea of what nonzero produces, and pay attention to the difference between indexing with a list(s) and with a tuple.
===
In [110]: somearray = np.array([1,2,3,4,5,6])
...: arrayindex = np.nonzero(somearray > 4)
nonzero produces a tuple of arrays, one per dimension (this becomes more obvious with 2d arrays):
In [111]: arrayindex
Out[111]: (array([4, 5]),)
It can be used directly as an index:
In [113]: somearray[arrayindex]
Out[113]: array([5, 6])
In this 1d case you could take the array out of the tuple, and iterate on it:
In [114]: for i in arrayindex[0]:print(i, somearray[i])
4 5
5 6
argwhere does a 'transpose', which could also be used for iteration
In [115]: idxs = np.argwhere(somearray>4)
In [116]: idxs
Out[116]:
array([[4],
[5]])
In [117]: for i in idxs: print(i,somearray[i])
[4] [5]
[5] [6]
idxs is (2,1) shape, so i is (1,) shape array, resulting in the brackets in the display. Occasionally it's useful, but nonzero is used more (often by it's other name, np.where).
2d
argwhere has a 2d example:
In [119]: x=np.arange(6).reshape(2,3)
In [120]: np.argwhere(x>1)
Out[120]:
array([[0, 2],
[1, 0],
[1, 1],
[1, 2]])
In [121]: np.nonzero(x>1)
Out[121]: (array([0, 1, 1, 1]), array([2, 0, 1, 2]))
In [122]: x[np.nonzero(x>1)]
Out[122]: array([2, 3, 4, 5])
While nonzero can be used to index the array, argwhere elements can't.
In [123]: for ij in np.argwhere(x>1):
...: print(ij,x[ij])
...:
...
IndexError: index 2 is out of bounds for axis 0 with size 2
Problem is that ij is a list, which is used to index on dimension. numpy distinguishes between lists and tuples when indexing. (Earlier versions fudged the difference, but current versions are taking a more rigorous approach.)
So we need to change the list into a tuple. One way is to unpack it:
In [124]: for i,j in np.argwhere(x>1):
...: print(i,j,x[i,j])
...:
...:
0 2 2
1 0 3
1 1 4
1 2 5
I could have used: print(ij,x[tuple(ij)]) in [123].
I should have used unpacking the [117] iteration:
In [125]: for i, in idxs: print(i,somearray[i])
4 5
5 6
or somearray[tuple(i)]

Expand a numpy array based on a value in another array

I have the following numpy array a = np.array([1,1,2,1,3]) that should be transformed into the following array b = np.array([1,1,1,1,1,1,1,1]).
What happens is that all the non 1 values in the a array should be expanded in the b array to their multiple defined in the a array. Simpler said, the 2 should become 2 ones, and the 3 should become 3 ones.
Frankly, I couldn't find a numpy function that does this, but I'm sure one exists. Any advice would be very welcome! Thank you!
We can simply do -
np.ones(a.sum(),dtype=int)
This will accomodate all numbers : 1s and non-1s, because of the summing and hence give us the desired output.
In [71]: np.ones(len(a),int).repeat(a)
Out[71]: array([1, 1, 1, 1, 1, 1, 1, 1])
For this small example it is faster than np.ones(a.sum(),int), but it doesn't scale quite as well. But overall both are fast.
Here's one possible way based on the number you wanna be repeated:
In [12]: a = np.array([1,1,2,1,3])
In [13]: mask = a != 1
In [14]: np.concatenate((a[~mask], np.repeat(1, np.prod(a[mask]))))
Out[14]: array([1, 1, 1, 1, 1, 1, 1, 1, 1])

Product of array elements by group in numpy (Python)

I'm trying to build a function that returns the products of subsets of array elements. Basically I want to build a prod_by_group function that does this:
values = np.array([1, 2, 3, 4, 5, 6])
groups = np.array([1, 1, 1, 2, 3, 3])
Vprods = prod_by_group(values, groups)
And the resulting Vprods should be:
Vprods
array([6, 4, 30])
There's a great answer here for sums of elements that I think it should be similar to:
https://stackoverflow.com/a/4387453/1085691
I tried taking the log first, then sum_by_group, then exp, but ran into numerical issues.
There are some other similar answers here for min and max of elements by group:
https://stackoverflow.com/a/8623168/1085691
Edit: Thanks for the quick answers! I'm trying them out. I should add that I want it to be as fast as possible (that's the reason I'm trying to get it in numpy in some vectorized way, like the examples I gave).
Edit: I evaluated all the answers given so far, and the best one is given by #seberg below. Here's the full function that I ended up using:
def prod_by_group(values, groups):
order = np.argsort(groups)
groups = groups[order]
values = values[order]
group_changes = np.concatenate(([0], np.where(groups[:-1] != groups[1:])[0] + 1))
return np.multiply.reduceat(values, group_changes)
If you groups are already sorted (if they are not you can do that with np.argsort), you can do this using the reduceat functionality to ufuncs (if they are not sorted, you would have to sort them first to do it efficiently):
# you could do the group_changes somewhat faster if you care a lot
group_changes = np.concatenate(([0], np.where(groups[:-1] != groups[1:])[0] + 1))
Vprods = np.multiply.reduceat(values, group_changes)
Or mgilson answer if you have few groups. But if you have many groups, then this is much more efficient. Since you avoid boolean indices for every element in the original array for every group. Plus you avoid slicing in a python loop with reduceat.
Of course pandas does these operations conveniently.
Edit: Sorry had prod in there. The ufunc is multiply. You can use this method for any binary ufunc. This means it works for basically all numpy functions that can work element wise on two input arrays. (ie. multiply normally multiplies two arrays elementwise, add adds them, maximum/minimum, etc. etc.)
First set up a mask for the groups such that you expand the groups in another dimension
mask=(groups==unique(groups).reshape(-1,1))
mask
array([[ True, True, True, False, False, False],
[False, False, False, True, False, False],
[False, False, False, False, True, True]], dtype=bool)
now we multiply with val
mask*val
array([[1, 2, 3, 0, 0, 0],
[0, 0, 0, 4, 0, 0],
[0, 0, 0, 0, 5, 6]])
now you can already do prod along the axis 1 except for those zeros, which is easy to fix:
prod(where(mask*val,mask*val,1),axis=1)
array([ 6, 4, 30])
As suggested in the comments, you can also use the Pandas module. Using the grouby() function, this task becomes an one-liner:
import numpy as np
import pandas as pd
values = np.array([1, 2, 3, 4, 5, 6])
groups = np.array([1, 1, 1, 2, 3, 3])
df = pd.DataFrame({'values': values, 'groups': groups})
So df then looks as follows:
groups values
0 1 1
1 1 2
2 1 3
3 2 4
4 3 5
5 3 6
Now you can groupby() the groups column and apply numpy's prod() function to each of the groups like this
df.groupby(groups)['values'].apply(np.prod)
which gives you the desired output:
1 6
2 4
3 30
Well, I doubt this is a great answer, but it's the best I can come up with:
np.array([np.product(values[np.flatnonzero(groups == x)]) for x in np.unique(groups)])
It's not a numpy solution, but it's fairly readable (I find that sometimes numpy solutions aren't!):
from operator import itemgetter, mul
from itertools import groupby
grouped = groupby(zip(groups, values), itemgetter(0))
groups = [reduce(mul, map(itemgetter(1), vals), 1) for key, vals in grouped]
print groups
# [6, 4, 30]

Categories

Resources