Is there a 2-D "where" in numpy? - python

This might seem an odd question, but it boils down to quite a simple operation that I can't find a numpy equivalent for. I've looked at np.where as well as many other operations but can't find anything that does this:
a = np.array([1,2,3])
b = np.array([1,2,3,4])
c = np.array([i<b for i in a])
The output is a 2-D array (3,4), of booleans comparing each value.

If you're asking how to get c without loop, try this
# make "a" a column vector
# > broadcasts to produce a len(a) x len(b) array
c = b > a[:, None]
c
array([[False, True, True, True],
[False, False, True, True],
[False, False, False, True]])

You can extend the approach in the other answer to get the values of a and b. Given a mask of
c = b > a[:, None]
You can extract the indices for each dimension using np.where or np.nonzero:
row, col = np.nonzero(c)
And use the indices to get the corresponding values:
ag = a[row]
bg = b[col]
Elements of a and b may be repeated in the result.

Related

numpy multiple boolean index arrays

I have an array which I want to use boolean indexing on, with multiple index arrays, each producing a different array. Example:
w = np.array([1,2,3])
b = np.array([[False, True, True], [True, False, False]])
Should return something along the lines of:
[[2,3], [1]]
I assume that since the number of cells containing True can vary between masks, I cannot expect the result to reside in a 2d numpy array, but I'm still hoping for something more elegant than iterating over the masks the appending the result of indexing w by the i-th b mask to it.
Am I missing a better option?
Edit: The next step I want to do afterwards is to sum each of the arrays returned by w[b], returning a list of scalars. If that somehow makes the problem easier, I'd love to know as well.
Assuming you want a list of numpy arrays you can simply use a comprehension:
w = np.array([1,2,3])
b = np.array([[False, True, True], [True, False, False]])
[w[bool] for bool in b]
# [array([2, 3]), array([1])]
If your goal is just a sum of the masked values you use:
np.sum(w*b) # 6
or
np.sum(w*b, axis=1) # array([5, 1])
# or b # w
…since False times you number will be 0 and therefor won't effect the sum.
Try this:
[w[x] for x in b]
Hope this helps.

ndarray row-wise index of values greater than array

I have one array of shape (X, 5):
M = [[1,2,3,4,5],
[6,7,8,9,1],
[2,5,7,8,3]
...]
and one array of shape (X, 1):
n = [[3],
[7],
[100],
...]
Now I need to get the first index of M >= n for each row, or nan if there is no such index.
For example:
np.where([1,2,3,4,5] >= 3)[0][0] # Returns 2
np.searchsorted(np.array([1,2,3,4,5]), 3) # Returns 2
These examples are applied to each row individually (I could loop X times as both arrays have the length X).
I wonder, is there a way to do it in a multidimensional way to get an output of all indices at once?
Something like:
np.where(M>=n)
Thank you
Edit: Values in M are unsorted, I'm still looking for the first index/occurrence fitting M >= n (so probably not searchsorted)
You could start by checking which row indices are lower or equal than n and use argmax to get the first True for each row. For the rows where all columns are False, we can use np.where to set them to np.nan for instance:
M = np.array([[1,2,3,4,5],
[6,7,8,9,1],
[2,5,7,8,3]])
n = np.array([[3],[7],[100]])
le = n<=M
# array([[False, False, True, True, True],
# [False, True, True, True, False],
# [False, False, False, False, False]])
lea = le.argmax(1)
has_any = le[np.arange(len(le)), lea]
np.where(has_any, lea, np.nan)
# array([ 2., 1., nan])

Creating a "bitmask" from several boolean numpy arrays

I'm trying to convert several masks (boolean arrays) to a bitmask with numpy, while that in theory works I feel that I'm doing too many operations.
For example to create the bitmask I use:
import numpy as np
flags = [
np.array([True, False, False]),
np.array([False, True, False]),
np.array([False, True, False])
]
flag_bits = np.zeros(3, dtype=np.int8)
for idx, flag in enumerate(flags):
flag_bits += flag.astype(np.int8) << idx # equivalent to flag * 2 ** idx
Which gives me the expected "bitmask":
>>> flag_bits
array([1, 6, 0], dtype=int8)
>>> [np.binary_repr(bit, width=7) for bit in flag_bits]
['0000001', '0000110', '0000000']
However I feel that especially the casting to int8 and the addition with the flag_bits array is too complicated. Therefore I wanted to ask if there is any NumPy functionality that I missed that could be used to create such an "bitmask" array?
Note: I'm calling an external function that expects such a bitmask, otherwise I would stick with the boolean arrays.
>>> x = np.array(2**i for i in range(1, np.shape(flags)[1]+1))
>>> np.dot(flags, x)
array([1, 2, 2])
How it works: in a bit mask, every bit is effectively an original array element multiplied by a degree of 2 according to its position, e.g. 4 = False * 1 + True * 2 + False * 4. Effectively this can be represented as matrix multiplication, which is really efficient in numpy.
So, first line is a list comprehension to create these weights: x = [1, 2, 4, 8, ... 2^(n+1)].
Then, each line in flags is multiplied by the corresponding element in x and everything is summed up (this is how matrix multiplication works). At the end, we get the bitmask
How about this (added conversion to int8, if desired):
flag_bits = (np.transpose(flags) << np.arange(len(flags))).sum(axis=1)\
.astype(np.int8)
#array([1, 6, 0], dtype=int8)
Here's an approach to directly get to the string bitmask with boolean-indexing -
out = np.repeat('0000000',3).astype('S7')
out.view('S1').reshape(-1,7)[:,-3:] = np.asarray(flags).astype(int)[::-1].T
Sample run -
In [41]: flags
Out[41]:
[array([ True, False, False], dtype=bool),
array([False, True, False], dtype=bool),
array([False, True, False], dtype=bool)]
In [42]: out = np.repeat('0000000',3).astype('S7')
In [43]: out.view('S1').reshape(-1,7)[:,-3:] = np.asarray(flags).astype(int)[::-1].T
In [44]: out
Out[44]:
array([b'0000001', b'0000110', b'0000000'],
dtype='|S7')
Using the same matrix-multiplication strategy as dicussed in detail in #Marat's solution, but using a vectorized scaling array that gives us flag_bits -
np.dot(2**np.arange(3),flags)

Numpy: Tuple-Indexing of rows in ndarray (for later selection of subsets)

I'm fairly new to NumPy, and also not the most expierenced Python programmer,
so please excuse me if this seems trivial to you ;)
I am writing a script to extract specific data out of several molecular-dynamics simulation.
Therefore I read data out of some files and modify and truncate them to a uniform length
and add everything together row-wise, to form a 2D-array for each simulation run.
These arrays are appended to each other, so that I ultimately get a 3D-Array, where each slice along the z-Axis would represent a dataset of a specific simulation run.
The goal is to later on do easy manipulation, e.g. averaging over all simulation runs.
This is just to give you the basic idea of what is done:
import numpy as np
A = np.zeros((2000), dtype = bool)
A = A.reshape((1, 2000))
# Appending different rows to form a '2D-Matrix',
# this is the actual data per simulation run
for i in xrange(1,103):
B = np.zeros((2000), dtype = bool)
B = B.reshape((1, 2000))
A = np.concatenate((A, B), axis=0)
print A.shape
# >>> (2000, 103)
C = np.expand_dims(A, axis=2)
A = np.expand_dims(A, axis=2)
print A.shape
# >>> (2000, 103, 1)
# Appending different '2D-Matrices' to form a 3D array,
# each slice along the z-Axis representing one simulation run
for i in xrange(1,50):
A = np.concatenate((A, C), axis=2)
print A.shape
# >>> (2000, 103, 50)
So far so good, now to the actual question:
In one 2D-array, each row represents a different set of interacting atom-pairs.
I later on want to create subsets of the array, depending on different critera - e.g. 'show me all pairs, where the distance x is 10 < x <= 20'.
So when I first add the rows together in for i in xrange(1,103): ..., I want to include indexing of the rows with a set of ints for each row.
The data of atom pairs is there anyway, at the moment I'm just not including it in the ndarray.
I was thinking of a tuple, so that my 2D-Array would look like
[ [('int' a,'int' b), [False,True,False,...]],
[('int' a,'int' d), [True, False, True...]],
...
]
Or something like that
[ [['int' a], ['int' b], [False,True,False,...]],
[['int' a], ['int' d], [True, False, True...]],
...
]
Can you think of another or easier approach for this kind of filtering?
I'm not quite sure if I'm on the right track here and it doesn't seem to be very straight-forward to have different datatypes in an array like that.
Also notice, that all indexes are ordered in the same way in each 2D-array, because I sort them (atm based on a String) and add np.zeros() rows for those that only occur on other simulation runs.
Maybe a Lookup-table is the right approach?
Thanks a lot!
Update/Answer:
Sorry, I know the question was a little bit too specific and bloated with
code that wasn't relevant to the question.
I answered the question myself, and for the sake of documentation you can find it below. It is specific, but maybe it helps someone to handle his indexing in numpy.
Short, general answer:
I basically just created a look-up-table as a python list and did a very simple numpy slicing operation for selection with a mask, containing indices:
A = [[[1, 2],
[3, 4],
[5, 6]],
[[7, 8],
[9,10],
[11,12]]]
A = np.asarray(A)
# selects only rows 1 and 2 from each 2D array
mask = [1,2]
B = A[ : , mask, : ]
Which gives for B:
[[[ 3 4]
[ 5 6]]
[[ 9 10]
[11 12]]]
Complete answer, specific for my question above:
This is my 2D array:
A =[[True, False, False, False, False],
[False, True, False, False, False],
[False, False, True, False, False]]
A = np.asarray(A)
Indexing of the rows as tuples, this is due to my specific problem
e.g.:
lut = [(1,2),(3,4),(3,5)]
Append other 2D array to form a 3D array:
C = np.expand_dims(A, axis=0)
A = np.expand_dims(A, axis=0)
A = np.concatenate((A, C), axis=0)
This is the 3D Array A:
>[[[ True False False False False]
[False True False False False]
[False False True False False]]
[[ True False False False False]
[False True False False False]
[False False True False False]]]
Selecting rows, which contain "3" in the Look-up-Table
mask = [i for i, v in enumerate(lut) if 3 in v]
> [1, 2]
Applying mask to the 3D-array:
B = A[ : , mask, : ]
Now B is the 3D array A after selection:
[[[False True False False False]
[False False True False False]]
[[False True False False False]
[False False True False False]]]
To keep track of the new indices of B:
create a new Look-up-Table for further computation:
newLut = [v for i, v in enumerate(lut) if i in mask]
>[(3, 4), (3, 5)]

Determine sum of numpy array while excluding certain values

I would like to determine the sum of a two dimensional numpy array. However, elements with a certain value I want to exclude from this summation. What is the most efficient way to do this?
For example, here I initialize a two dimensional numpy array of 1s and replace several of them by 2:
import numpy
data_set = numpy.ones((10, 10))
data_set[4][4] = 2
data_set[5][5] = 2
data_set[6][6] = 2
How can I sum over the elements in my two dimensional array while excluding all of the 2s? Note that with the 10 by 10 array the correct answer should be 97 as I replaced three elements with the value 2.
I know I can do this with nested for loops. For example:
elements = []
for idx_x in range(data_set.shape[0]):
for idx_y in range(data_set.shape[1]):
if data_set[idx_x][idx_y] != 2:
elements.append(data_set[idx_x][idx_y])
data_set_sum = numpy.sum(elements)
However on my actual data (which is very large) this is too slow. What is the correct way of doing this?
Use numpy's capability of indexing with boolean arrays. In the below example data_set!=2 evaluates to a boolean array which is True whenever the element is not 2 (and has the correct shape). So data_set[data_set!=2] is a fast and convenient way to get an array which doesn't contain a certain value. Of course, the boolean expression can be more complex.
In [1]: import numpy as np
In [2]: data_set = np.ones((10, 10))
In [4]: data_set[4,4] = 2
In [5]: data_set[5,5] = 2
In [6]: data_set[6,6] = 2
In [7]: data_set[data_set != 2].sum()
Out[7]: 97.0
In [8]: data_set != 2
Out[8]:
array([[ True, True, True, True, True, True, True, True, True,
True],
[ True, True, True, True, True, True, True, True, True,
True],
...
[ True, True, True, True, True, True, True, True, True,
True]], dtype=bool)
Without numpy, the solution is not much more complex:
x = [1,2,3,4,5,6,7]
sum(y for y in x if y != 7)
# 21
Works for a list of excluded values too:
# set is faster for resolving `in`
exl = set([1,2,3])
sum(y for y in x if y not in exl)
# 22
Using np.sums where= argument, we avoid the need for array copying which would otherwise be triggered from using advanced array indexing:
>>> import numpy as np
>>> data_set = np.ones((10,10))
>>> data_set[(4,5,6),(4,5,6)] = 2
>>> np.sum(data_set, where=data_set != 2)
97.0
>>> data_set.sum(where=data_set != 2)
97.0
https://numpy.org/doc/stable/reference/generated/numpy.sum.html
Advanced indexing always returns a copy of the data (contrast with basic slicing that returns a view).
https://numpy.org/doc/stable/user/basics.indexing.html#advanced-indexing
How about this way that makes use of numpy's boolean capabilities.
We simply set all the values that meet the specification to zero before taking the sum, that way we don't change the shape of the array as we would if we were to filter them from the array.
The other benefit of this is that it means we can sum along axis after the filter is applied.
import numpy
data_set = numpy.ones((10, 10))
data_set[4][4] = 2
data_set[5][5] = 2
data_set[6][6] = 2
print "Sum", data_set.sum()
another_set = numpy.array(data_set) # Take a copy, we'll need that later
data_set[data_set == 2] = 0 # Set all the values that are 2 to zero
print "Filtered sum", data_set.sum()
print "Along axis", data_set.sum(0), data_set.sum(1)
Equally we could use any other boolean to set the data we wish to exclude from the sum.
another_set[(another_set > 1) & (another_set < 3)] = 0
print "Another filtered sum", another_set.sum()

Categories

Resources