Determine sum of numpy array while excluding certain values - python

I would like to determine the sum of a two dimensional numpy array. However, elements with a certain value I want to exclude from this summation. What is the most efficient way to do this?
For example, here I initialize a two dimensional numpy array of 1s and replace several of them by 2:
import numpy
data_set = numpy.ones((10, 10))
data_set[4][4] = 2
data_set[5][5] = 2
data_set[6][6] = 2
How can I sum over the elements in my two dimensional array while excluding all of the 2s? Note that with the 10 by 10 array the correct answer should be 97 as I replaced three elements with the value 2.
I know I can do this with nested for loops. For example:
elements = []
for idx_x in range(data_set.shape[0]):
for idx_y in range(data_set.shape[1]):
if data_set[idx_x][idx_y] != 2:
elements.append(data_set[idx_x][idx_y])
data_set_sum = numpy.sum(elements)
However on my actual data (which is very large) this is too slow. What is the correct way of doing this?

Use numpy's capability of indexing with boolean arrays. In the below example data_set!=2 evaluates to a boolean array which is True whenever the element is not 2 (and has the correct shape). So data_set[data_set!=2] is a fast and convenient way to get an array which doesn't contain a certain value. Of course, the boolean expression can be more complex.
In [1]: import numpy as np
In [2]: data_set = np.ones((10, 10))
In [4]: data_set[4,4] = 2
In [5]: data_set[5,5] = 2
In [6]: data_set[6,6] = 2
In [7]: data_set[data_set != 2].sum()
Out[7]: 97.0
In [8]: data_set != 2
Out[8]:
array([[ True, True, True, True, True, True, True, True, True,
True],
[ True, True, True, True, True, True, True, True, True,
True],
...
[ True, True, True, True, True, True, True, True, True,
True]], dtype=bool)

Without numpy, the solution is not much more complex:
x = [1,2,3,4,5,6,7]
sum(y for y in x if y != 7)
# 21
Works for a list of excluded values too:
# set is faster for resolving `in`
exl = set([1,2,3])
sum(y for y in x if y not in exl)
# 22

Using np.sums where= argument, we avoid the need for array copying which would otherwise be triggered from using advanced array indexing:
>>> import numpy as np
>>> data_set = np.ones((10,10))
>>> data_set[(4,5,6),(4,5,6)] = 2
>>> np.sum(data_set, where=data_set != 2)
97.0
>>> data_set.sum(where=data_set != 2)
97.0
https://numpy.org/doc/stable/reference/generated/numpy.sum.html
Advanced indexing always returns a copy of the data (contrast with basic slicing that returns a view).
https://numpy.org/doc/stable/user/basics.indexing.html#advanced-indexing

How about this way that makes use of numpy's boolean capabilities.
We simply set all the values that meet the specification to zero before taking the sum, that way we don't change the shape of the array as we would if we were to filter them from the array.
The other benefit of this is that it means we can sum along axis after the filter is applied.
import numpy
data_set = numpy.ones((10, 10))
data_set[4][4] = 2
data_set[5][5] = 2
data_set[6][6] = 2
print "Sum", data_set.sum()
another_set = numpy.array(data_set) # Take a copy, we'll need that later
data_set[data_set == 2] = 0 # Set all the values that are 2 to zero
print "Filtered sum", data_set.sum()
print "Along axis", data_set.sum(0), data_set.sum(1)
Equally we could use any other boolean to set the data we wish to exclude from the sum.
another_set[(another_set > 1) & (another_set < 3)] = 0
print "Another filtered sum", another_set.sum()

Related

Is there a 2-D "where" in numpy?

This might seem an odd question, but it boils down to quite a simple operation that I can't find a numpy equivalent for. I've looked at np.where as well as many other operations but can't find anything that does this:
a = np.array([1,2,3])
b = np.array([1,2,3,4])
c = np.array([i<b for i in a])
The output is a 2-D array (3,4), of booleans comparing each value.
If you're asking how to get c without loop, try this
# make "a" a column vector
# > broadcasts to produce a len(a) x len(b) array
c = b > a[:, None]
c
array([[False, True, True, True],
[False, False, True, True],
[False, False, False, True]])
You can extend the approach in the other answer to get the values of a and b. Given a mask of
c = b > a[:, None]
You can extract the indices for each dimension using np.where or np.nonzero:
row, col = np.nonzero(c)
And use the indices to get the corresponding values:
ag = a[row]
bg = b[col]
Elements of a and b may be repeated in the result.

How to count groups of certain value in a numpy 1d array?

I have a numpy 1d arrays with boolean values, that looks like
array_in = np.array([False, True, True, True, False, False, True, True, False])
This arrays have different length. As you can see, there are parts, where True values located next to each other, so we have groups of Trues and groups of Falses. I want to count the number of True groups. For our case, we have
N = 2
I tried to do some loops with conditions, but it got really messy and confusing.
You can use np.diff to determine changes between groups. By attaching False to the start and the end of this difference calculation we make sure that True groups at the start and end are properly counted.
import numpy as np
array_in = np.array([False, True, True, True, False, False, True, True, False, True, False, True])
true_groups = np.sum(np.diff(array_in, prepend=False, append=False))//2
#print(true_groups)
>>>4
If you don't want to write loops and conditions, you could take a shortcut by looking at this like a connected components problem.
import numpy as np
from skimage import measure
array_in = np.array([False, True, True, True, False, False, True, True, False])
N = max(measure.label(array_in))
When an array is passed into the measure.label() function, it treats the 0 values in that array as the "background". Then it looks at all the non-zero values, finds connected regions, and numbers them.
For example, the label output on the above array will be [0, 1, 1, 1, 0, 0, 2, 2, 0]. Naturally, then doing a simple max on the output gives you the largest group number (here it's 2) -- which is also the same as the number of True groups.
A straightforward way of finding the number of groups of True is by counting the number of False, True sequences in the array. With a list comprehension, that will look like:
sum([1 for i in range(1, len(x)) if x[i] and not x[i-1]])
Alternatively you can convert the initial array into a string of '0's and '1's and count the number of '01' occurences:
''.join([str(int(k)) for k in x]).count('01')
In native numpy, a vectorised solution can be done by checking how many times F changes to T sequentially.
For example,
np.bitwise_and(array_in[1:], ~array_in[:-1]).sum() + array_in[0]
I am computing a bitwise and of every element of the array with it's previous element after negating the previous element. However, in doing so, the first element is ignored, so I am adding that in manually.

Count all values of an array that are between 0 and 1

Is there an effective way to count all the values in a numpy array which are between 0 and 1?
I know this is easily countable with a for loop, but that seems pretty inefficient to me. I tried to play around with the count_nonzero() function but I couldn't make it work the way I wanted.
Greetings
One quick and easy method is to use the logical_and() function, which returns a boolean mask array. Then simply use the .sum() function to sum the True values.
Example:
import numpy as np
a = np.array([0, .1, .2, .3, 1, 2])
np.logical_and(a>0, a<1).sum()
Output:
>>> 3
Example 2:
Or, if you'd prefer a more 'low-level' (non-helper function) approach, the & logical operator can be used:
((a > 0) & (a < 1)).sum()
This might be one way. You can easily replace <= and >= with strict inequalities as per your wish.
>>> import numpy as np
>>> a = np.random.randn(3,3)
>>> a
array([[-2.17470114, 0.59575531, 0.06795138],
[-0.57380035, 0.05663369, 1.12636801],
[ 0.55363332, -0.04039947, 1.14837819]])
>>> inds1 = a >= 0
>>> inds2 = a <= 1
>>> inds = inds1 * inds2
>>> inds
array([[False, True, True],
[False, True, False],
[ True, False, False]])
>>> inds.sum()
4

numpy multiple boolean index arrays

I have an array which I want to use boolean indexing on, with multiple index arrays, each producing a different array. Example:
w = np.array([1,2,3])
b = np.array([[False, True, True], [True, False, False]])
Should return something along the lines of:
[[2,3], [1]]
I assume that since the number of cells containing True can vary between masks, I cannot expect the result to reside in a 2d numpy array, but I'm still hoping for something more elegant than iterating over the masks the appending the result of indexing w by the i-th b mask to it.
Am I missing a better option?
Edit: The next step I want to do afterwards is to sum each of the arrays returned by w[b], returning a list of scalars. If that somehow makes the problem easier, I'd love to know as well.
Assuming you want a list of numpy arrays you can simply use a comprehension:
w = np.array([1,2,3])
b = np.array([[False, True, True], [True, False, False]])
[w[bool] for bool in b]
# [array([2, 3]), array([1])]
If your goal is just a sum of the masked values you use:
np.sum(w*b) # 6
or
np.sum(w*b, axis=1) # array([5, 1])
# or b # w
…since False times you number will be 0 and therefor won't effect the sum.
Try this:
[w[x] for x in b]
Hope this helps.

Replace values in specific columns of a numpy array

I have a N x M numpy array (matrix). Here is an example with a 3 x 5 array:
x = numpy.array([[0,1,2,3,4,5],[0,-1,2,3,-4,-5],[0,-1,-2,-3,4,5]])
I'd like to scan all the columns of x and replace the values of each column if they are equal to a specific value.
This code for example aims to replace all the negative values (where the value is equal to the column number) to 100:
for i in range(1,6):
x[:,i == -(i)] = 100
This code obtains this warning:
DeprecationWarning: using a boolean instead of an integer will result in an error in the future
I'm using numpy 1.8.2. How can I avoid this warning without downgrade numpy?
I don't follow what your code is trying to do:
the i == -(i)
will evaluate to something like this:
x[:, True]
x[:, False]
I don't think this is what you want. You should try something like this:
for i in range(1, 6):
mask = x[:, i] == -i
x[:, i][mask] = 100
Create a mask over the whole column, and use that to change the values.
Even without the warning, the code you have there will not do what you want. i is the loop index and will equal minus itself only if i == 0, which is never. Your test will always return false, which is cast to 0. In other words your code will replace the first element of each row with 100.
To get this to work I would do
for i in range(1, 6):
col = x[:,i]
col[col == -i] = 100
Notice that you use the name of the array for the masking and that you need to separate the conventional indexing from the masking
If you are worried about the warning spewing out text, then ignore it as a Warning/Exception:
import numpy
import warnings
warnings.simplefilter('default') # this enables DeprecationWarnings to be thrown
x = numpy.array([[0,1,2,3,4,5],[0,-1,2,3,-4,-5],[0,-1,-2,-3,4,5]])
with warnings.catch_warnings():
warnings.simplefilter("ignore") # and this ignores them
for i in range(1,6):
x[:,i == -(i)] = 100
print(x) # just to show that you are actually changing the content
As you can see in the comments, some people are not getting DeprecationWarning. That is probably because python suppresses developer-only warnings since 2.7
As others have said, your loop isn't doing what you think it is doing. I would propose you change your code to use numpy's fancy indexing.
# First, create the "test values" (column index):
>>> test_values = numpy.arange(6)
# test_values is array([0, 1, 2, 3, 4, 5])
#
# Now, we want to check which columns have value == -test_values:
#
>>> mask = (x == -test_values) & (x < 0)
# mask is True wherever a value in the i-th column of x is negative i
>>> mask
array([[False, False, False, False, False, False],
[False, True, False, False, True, True],
[False, True, True, True, False, False]], dtype=bool)
#
# Now, set those values to 100
>>> x[mask] = 100
>>> x
array([[ 0, 1, 2, 3, 4, 5],
[ 0, 100, 2, 3, 100, 100],
[ 0, 100, 100, 100, 4, 5]])

Categories

Resources