numpy multiple boolean index arrays - python

I have an array which I want to use boolean indexing on, with multiple index arrays, each producing a different array. Example:
w = np.array([1,2,3])
b = np.array([[False, True, True], [True, False, False]])
Should return something along the lines of:
[[2,3], [1]]
I assume that since the number of cells containing True can vary between masks, I cannot expect the result to reside in a 2d numpy array, but I'm still hoping for something more elegant than iterating over the masks the appending the result of indexing w by the i-th b mask to it.
Am I missing a better option?
Edit: The next step I want to do afterwards is to sum each of the arrays returned by w[b], returning a list of scalars. If that somehow makes the problem easier, I'd love to know as well.

Assuming you want a list of numpy arrays you can simply use a comprehension:
w = np.array([1,2,3])
b = np.array([[False, True, True], [True, False, False]])
[w[bool] for bool in b]
# [array([2, 3]), array([1])]
If your goal is just a sum of the masked values you use:
np.sum(w*b) # 6
or
np.sum(w*b, axis=1) # array([5, 1])
# or b # w
…since False times you number will be 0 and therefor won't effect the sum.

Try this:
[w[x] for x in b]
Hope this helps.

Related

Is there a 2-D "where" in numpy?

This might seem an odd question, but it boils down to quite a simple operation that I can't find a numpy equivalent for. I've looked at np.where as well as many other operations but can't find anything that does this:
a = np.array([1,2,3])
b = np.array([1,2,3,4])
c = np.array([i<b for i in a])
The output is a 2-D array (3,4), of booleans comparing each value.
If you're asking how to get c without loop, try this
# make "a" a column vector
# > broadcasts to produce a len(a) x len(b) array
c = b > a[:, None]
c
array([[False, True, True, True],
[False, False, True, True],
[False, False, False, True]])
You can extend the approach in the other answer to get the values of a and b. Given a mask of
c = b > a[:, None]
You can extract the indices for each dimension using np.where or np.nonzero:
row, col = np.nonzero(c)
And use the indices to get the corresponding values:
ag = a[row]
bg = b[col]
Elements of a and b may be repeated in the result.

Dynamically index a specific element in an n-dimensional numpy array

I have an n-dimensional bool numpy array. I want to invert the bool value of a random item in the array. Inverting is easy, but I'm not sure how to best index a random element. I can generate a list of random positions along the n dimensions
indices = [np.random.randint(n) for n in array.shape]
However, how do I use that to index the corresponding element? array[indices] fetches elements at indices along the first dimension. array[*indices] does not work. I could do something like
a = array[indices[0]]
for index in indices[1:]:
a = a[index]
but I'd like to avoid loops. Is there a more elegant solution?
I know this sound too simple, but use a tuple instead of a list for the index. There's some documentation about why this is in the Numpy docs, but (at least for me) the significance of tuples isn't immediately obvious (even though it was quite clear and there's a big warning box about this).
import numpy as np
np.random.seed(201)
A = np.random.choice(a=[False, True], size=(2,2))
# make it a Tuple
index = tuple(np.random.randint(n) for n in A.shape)
print(A)
# [[False True]
# [False True]]
print(index)
# (0, 1)
print(A[index])
# True
A[index] = ~A[index]
print(A)
# [[False False]
# [False True]]

Numpy compare array to multiple scalars at once

Suppose I have an array
a = np.array([1,2,3])
and I want to compare it to some scalar; this works fine like
a == 2 # [False, True, False]
Is there a way I can do such a comparison but with multiple scalars at once? The default behavior when comparing two arrays is to do an elementwise comparison, but instead I want each element of one array to be compared elementwise with the entire other array, like this:
scalars = np.array([1, 2])
some_function(a, scalars)
[[True, False, False],
[False, True, False]]
Obviously I can do this, e.g., with a for loop and then stacking, but is there any vectorized way to achieve the same result?
Outer product, except it's equality instead of product:
numpy.equal.outer(scalars, a)
or adjust the dimensions and perform a broadcasted comparison:
scalars[:, None] == a

Numpy: Tuple-Indexing of rows in ndarray (for later selection of subsets)

I'm fairly new to NumPy, and also not the most expierenced Python programmer,
so please excuse me if this seems trivial to you ;)
I am writing a script to extract specific data out of several molecular-dynamics simulation.
Therefore I read data out of some files and modify and truncate them to a uniform length
and add everything together row-wise, to form a 2D-array for each simulation run.
These arrays are appended to each other, so that I ultimately get a 3D-Array, where each slice along the z-Axis would represent a dataset of a specific simulation run.
The goal is to later on do easy manipulation, e.g. averaging over all simulation runs.
This is just to give you the basic idea of what is done:
import numpy as np
A = np.zeros((2000), dtype = bool)
A = A.reshape((1, 2000))
# Appending different rows to form a '2D-Matrix',
# this is the actual data per simulation run
for i in xrange(1,103):
B = np.zeros((2000), dtype = bool)
B = B.reshape((1, 2000))
A = np.concatenate((A, B), axis=0)
print A.shape
# >>> (2000, 103)
C = np.expand_dims(A, axis=2)
A = np.expand_dims(A, axis=2)
print A.shape
# >>> (2000, 103, 1)
# Appending different '2D-Matrices' to form a 3D array,
# each slice along the z-Axis representing one simulation run
for i in xrange(1,50):
A = np.concatenate((A, C), axis=2)
print A.shape
# >>> (2000, 103, 50)
So far so good, now to the actual question:
In one 2D-array, each row represents a different set of interacting atom-pairs.
I later on want to create subsets of the array, depending on different critera - e.g. 'show me all pairs, where the distance x is 10 < x <= 20'.
So when I first add the rows together in for i in xrange(1,103): ..., I want to include indexing of the rows with a set of ints for each row.
The data of atom pairs is there anyway, at the moment I'm just not including it in the ndarray.
I was thinking of a tuple, so that my 2D-Array would look like
[ [('int' a,'int' b), [False,True,False,...]],
[('int' a,'int' d), [True, False, True...]],
...
]
Or something like that
[ [['int' a], ['int' b], [False,True,False,...]],
[['int' a], ['int' d], [True, False, True...]],
...
]
Can you think of another or easier approach for this kind of filtering?
I'm not quite sure if I'm on the right track here and it doesn't seem to be very straight-forward to have different datatypes in an array like that.
Also notice, that all indexes are ordered in the same way in each 2D-array, because I sort them (atm based on a String) and add np.zeros() rows for those that only occur on other simulation runs.
Maybe a Lookup-table is the right approach?
Thanks a lot!
Update/Answer:
Sorry, I know the question was a little bit too specific and bloated with
code that wasn't relevant to the question.
I answered the question myself, and for the sake of documentation you can find it below. It is specific, but maybe it helps someone to handle his indexing in numpy.
Short, general answer:
I basically just created a look-up-table as a python list and did a very simple numpy slicing operation for selection with a mask, containing indices:
A = [[[1, 2],
[3, 4],
[5, 6]],
[[7, 8],
[9,10],
[11,12]]]
A = np.asarray(A)
# selects only rows 1 and 2 from each 2D array
mask = [1,2]
B = A[ : , mask, : ]
Which gives for B:
[[[ 3 4]
[ 5 6]]
[[ 9 10]
[11 12]]]
Complete answer, specific for my question above:
This is my 2D array:
A =[[True, False, False, False, False],
[False, True, False, False, False],
[False, False, True, False, False]]
A = np.asarray(A)
Indexing of the rows as tuples, this is due to my specific problem
e.g.:
lut = [(1,2),(3,4),(3,5)]
Append other 2D array to form a 3D array:
C = np.expand_dims(A, axis=0)
A = np.expand_dims(A, axis=0)
A = np.concatenate((A, C), axis=0)
This is the 3D Array A:
>[[[ True False False False False]
[False True False False False]
[False False True False False]]
[[ True False False False False]
[False True False False False]
[False False True False False]]]
Selecting rows, which contain "3" in the Look-up-Table
mask = [i for i, v in enumerate(lut) if 3 in v]
> [1, 2]
Applying mask to the 3D-array:
B = A[ : , mask, : ]
Now B is the 3D array A after selection:
[[[False True False False False]
[False False True False False]]
[[False True False False False]
[False False True False False]]]
To keep track of the new indices of B:
create a new Look-up-Table for further computation:
newLut = [v for i, v in enumerate(lut) if i in mask]
>[(3, 4), (3, 5)]

Determine sum of numpy array while excluding certain values

I would like to determine the sum of a two dimensional numpy array. However, elements with a certain value I want to exclude from this summation. What is the most efficient way to do this?
For example, here I initialize a two dimensional numpy array of 1s and replace several of them by 2:
import numpy
data_set = numpy.ones((10, 10))
data_set[4][4] = 2
data_set[5][5] = 2
data_set[6][6] = 2
How can I sum over the elements in my two dimensional array while excluding all of the 2s? Note that with the 10 by 10 array the correct answer should be 97 as I replaced three elements with the value 2.
I know I can do this with nested for loops. For example:
elements = []
for idx_x in range(data_set.shape[0]):
for idx_y in range(data_set.shape[1]):
if data_set[idx_x][idx_y] != 2:
elements.append(data_set[idx_x][idx_y])
data_set_sum = numpy.sum(elements)
However on my actual data (which is very large) this is too slow. What is the correct way of doing this?
Use numpy's capability of indexing with boolean arrays. In the below example data_set!=2 evaluates to a boolean array which is True whenever the element is not 2 (and has the correct shape). So data_set[data_set!=2] is a fast and convenient way to get an array which doesn't contain a certain value. Of course, the boolean expression can be more complex.
In [1]: import numpy as np
In [2]: data_set = np.ones((10, 10))
In [4]: data_set[4,4] = 2
In [5]: data_set[5,5] = 2
In [6]: data_set[6,6] = 2
In [7]: data_set[data_set != 2].sum()
Out[7]: 97.0
In [8]: data_set != 2
Out[8]:
array([[ True, True, True, True, True, True, True, True, True,
True],
[ True, True, True, True, True, True, True, True, True,
True],
...
[ True, True, True, True, True, True, True, True, True,
True]], dtype=bool)
Without numpy, the solution is not much more complex:
x = [1,2,3,4,5,6,7]
sum(y for y in x if y != 7)
# 21
Works for a list of excluded values too:
# set is faster for resolving `in`
exl = set([1,2,3])
sum(y for y in x if y not in exl)
# 22
Using np.sums where= argument, we avoid the need for array copying which would otherwise be triggered from using advanced array indexing:
>>> import numpy as np
>>> data_set = np.ones((10,10))
>>> data_set[(4,5,6),(4,5,6)] = 2
>>> np.sum(data_set, where=data_set != 2)
97.0
>>> data_set.sum(where=data_set != 2)
97.0
https://numpy.org/doc/stable/reference/generated/numpy.sum.html
Advanced indexing always returns a copy of the data (contrast with basic slicing that returns a view).
https://numpy.org/doc/stable/user/basics.indexing.html#advanced-indexing
How about this way that makes use of numpy's boolean capabilities.
We simply set all the values that meet the specification to zero before taking the sum, that way we don't change the shape of the array as we would if we were to filter them from the array.
The other benefit of this is that it means we can sum along axis after the filter is applied.
import numpy
data_set = numpy.ones((10, 10))
data_set[4][4] = 2
data_set[5][5] = 2
data_set[6][6] = 2
print "Sum", data_set.sum()
another_set = numpy.array(data_set) # Take a copy, we'll need that later
data_set[data_set == 2] = 0 # Set all the values that are 2 to zero
print "Filtered sum", data_set.sum()
print "Along axis", data_set.sum(0), data_set.sum(1)
Equally we could use any other boolean to set the data we wish to exclude from the sum.
another_set[(another_set > 1) & (another_set < 3)] = 0
print "Another filtered sum", another_set.sum()

Categories

Resources