Numpy Chain Indexing - python

I am trying to gain a better understanding of numpy and have come across something I can't quite understand when it comes to indexing.
Let's say we have this first array of random booleans
bools = np.random.choice([True, False],(7),p=[0.5,0.5])
array([False, True, False, False, True, False, False], dtype=bool)
Then let's also say we have this second array of random numbers selected from a normal distribution
data = np.random.randn(7,3)
array([[ 2.24116809, -0.41761776, -0.69026077],
[-0.85450123, 0.98218741, 0.0233551 ],
[-1.3157436 , -0.79753471, 1.77393444],
[-0.26672724, -0.9532758 , 0.67114247],
[-1.34177843, 1.220083 , -0.35341168],
[ 0.49629327, 1.73943962, 0.59050431],
[ 0.01609382, 0.91396293, 0.3754827 ]])
Using the numpy chain indexing I can do this
data[bools, 2:]
array([[ 0.0233551 ],
[-0.35341168]])
Now let's say I want to simply grab the first element, I can do this
data[bools, 2:][0]
array([ 0.0233551])
But why does this, data[bools, 2:, 0] not work?

But why does this, data[bools, 2:, 0] not work?
Because the input is a 2D array and as such you don't have three dimensions there to use something like : [bools, 2:, 0].
To achieve what you want you are trying to do, you could store the indices corresponding to the True ones in the mask bools and then use it as whole or one element from it for indexing.
A sample run to make things clear -
Inputs :
In [40]: data
Out[40]:
array([[ 1.02429045, 1.74104271, -0.54634826],
[-0.48451969, 0.83455196, 1.94444857],
[ 0.66504345, 0.41821317, 2.52517305],
[ 2.11428982, -0.05769528, 0.84432614],
[ 0.9251009 , -0.74646199, -0.93573164],
[ 0.07321257, -0.10708067, 1.78107884],
[-0.12961046, -0.5787856 , 0.2189466 ]])
In [41]: bools
Out[41]: array([ True, True, False, False, False, False, True], dtype=bool)
Store the valid indices :
In [42]: idx = np.flatnonzero(bools)
In [43]: idx
Out[43]: array([0, 1, 6])
Use as a whole or its first element :
In [44]: data[idx, 2:] # Same as data[bools, 2:]
Out[44]:
array([[-0.54634826],
[ 1.94444857],
[ 0.2189466 ]])
In [45]: data[idx[0], 2:]
Out[45]: array([-0.54634826])

I haven't seen 2d numpy indexing called 'chaining'
data is 2d, and thus can be indexed with a 2 element tuple
data[bools, 2:]
data([bools, slice(2,None,None))]
That can also be expressed as
data[bools,:][:,2:]
where it first selects from rows, and then from columns.
Notice that your indexing produces a (2,1) array; 2 from the number of True in bool, and 1 from the length of the 2: slice.
Your 2nd indexing with [0] is really a row selection:
data[bools, 2:][0]
data[bools, 2:][0,:]
The result is a (1,) array, the size of the 2nd dimension of the intermediate array.

Related

Get boolean array indicating which elements in array which belong to a list

This seems to be a simple question but I am struggling with errors from quite some time.
Imagine an array
a = np.array([2,3,4,5,6])
I want to test which elements in the array belong to another list
[2,3,6]
If I do
a in [2,3,6]
Python raises "ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()"
In return, i would like to get a boolean array-like
array([ True, True, False, False, True], dtype=bool)
Use np.isin to create a boolean mask then use np.argwhere on this mask to find the indices of array elements that are non-zero:
m = np.isin(a, lst)
indices = np.argwhere(m)
# print(m)
array([ True, True, False, False, True])
# print(indices)
array([[0], [1], [4]])
import numpy as np
arr1 = np.array([2,3,4,5,6])
arr2 = np.array([2,3,6])
arr_result = [bool(a1 in arr2) for a1 in arr1]
print(arr_result)
I have used simple list-comprehension logic to do this.
Output:
[True,True,False,False,True]

Compare two ndarrays with different dimensions

I have two ndarrays. First ndarray has string in one column and float values in another column. Second ndarray contains only a column of string values.
For eg:
Array1 Array2
"abc" 1.000 "abc"
"fsfds" -5.000 "qw"
"svs" 2.094 "svs"
"dfdsge" 3.348 "dd"
My question is, how can I compare matching string values from Array1 and Array2 then return corresponding float values from Array1?
I tried set(Array1) & set(Array2) to find unique elements but don't know how to extract float values. Is there a function in numpy?
Thank you.
The easiest way to turn your example into arrays is to copy-n-paste it as a multiline string and use genfromtxt to parse it:
In [344]: txt=b'''"abc" 1.000 "abc"
...: "fsfds" -5.000 "qw"
...: "svs" 2.094 "svs"
...: "dfdsge" 3.348 "dd" '''
In [346]: np.genfromtxt(txt.splitlines(),dtype=None)
Out[346]:
array([(b'"abc"', 1. , b'"abc"'), (b'"fsfds"', -5. , b'"qw"'),
(b'"svs"', 2.094, b'"svs"'), (b'"dfdsge"', 3.348, b'"dd"')],
dtype=[('f0', 'S8'), ('f1', '<f8'), ('f2', 'S5')])
With dtype=None it deduces column dtype, and creates a structured array. I can split that into 2 arrays, one with 2 fields, the other with 1. These are all 1d.
In [347]: arr1, arr2 = _[['f0','f1']], _['f2']
In [348]: arr1
Out[348]:
array([(b'"abc"', 1. ), (b'"fsfds"', -5. ), (b'"svs"', 2.094),
(b'"dfdsge"', 3.348)],
dtype=[('f0', 'S8'), ('f1', '<f8')])
In [349]: arr2
Out[349]:
array([b'"abc"', b'"qw"', b'"svs"', b'"dd"'],
dtype='|S5')
You are little unclear about how you want to compare the text columns. An easy one that looks reasonable with this data is just element by element, the simple ==.
In [350]: arr1['f0']==arr2
Out[350]: array([ True, False, True, False], dtype=bool)
With this boolean mask I can easily select the elements of arr1:
In [351]: arr1[_]
Out[351]:
array([(b'"abc"', 1. ), (b'"svs"', 2.094)],
dtype=[('f0', 'S8'), ('f1', '<f8')])
Lets see if I can turn these into object arrays.
In [372]: array1 = np.array(arr1.tolist(),dtype=object)
In [373]: array2 = np.array(arr2.tolist(),dtype=object)
In [374]: array1
Out[374]:
array([[b'"abc"', 1.0],
[b'"fsfds"', -5.0],
[b'"svs"', 2.094],
[b'"dfdsge"', 3.348]], dtype=object)
In [375]: array2
Out[375]: array([b'"abc"', b'"qw"', b'"svs"', b'"dd"'], dtype=object)
We can get the same mask:
In [376]: array1[:,0]==array2
Out[376]: array([ True, False, True, False], dtype=bool)
In [377]: array1[_,:]
Out[377]:
array([[b'"abc"', 1.0],
[b'"svs"', 2.094]], dtype=object)
Another way to get a mask:
In [378]: np.in1d(array2,array1[:,0])
Out[378]: array([ True, False, True, False], dtype=bool)
In this case it produces the same thing
Actually to get the rows of array1 that are in array2 (in any order), we need to switch the order:
In [389]: np.in1d(array1[:,0],array2[[1,0,3,2]])
Out[389]: array([ True, False, True, False], dtype=bool)
Look at in1d and the related array set functions for more ideas and details.
In any case, use field or column selection to get the 1d array of strings that can be compared to the strings in the other array.
You can use array comparison as your index for the first dimension to select the rows you want. I'm not sure exactly how you have an ndarray containing both strings and floats, but here's an example where we set it so the first and last rows have the same value in the first column.
import numpy as np
array_1 = np.random.randn(4, 2)
array_2 = np.random.randn(4)
array_2[3] = array_1[3, 0]
array_2[0] = array_1[0, 0]
print(array_1, array_2)
print(array_1[array_1[:, 0] == array_2, 1])
This gives
[[ 0.76170733 -1.40708366]
[-1.42535617 -1.03982291]
[ 0.67999753 -0.92733875]
[ 0.96474552 -1.95639871]]
[ 0.76170733 0.95046454 0.1548689 0.96474552]
[-1.40708366 -1.95639871]
I think that list comprehension can do the trick here:
Output=[i[1] for i in Array1 if i[0] in Array2]

What is going on behind this numpy selection behavior?

Answering this question, some others and I were actually wrong by considering that the following would work:
Say one has
test = [ [ [0], 1 ],
[ [1], 1 ]
]
import numpy as np
nptest = np.array(test)
What is the reason behind
>>> nptest[:,0]==[1]
array([False, False], dtype=bool)
while one has
>>> nptest[0,0]==[1],nptest[1,0]==[1]
(False, True)
or
>>> nptest==[1]
array([[False, True],
[False, True]], dtype=bool)
or
>>> nptest==1
array([[False, True],
[False, True]], dtype=bool)
Is it the degeneracy in term of dimensions which causes this.
nptest is a 2D array of object dtype, and the first element of each row is a list.
nptest[:, 0] is a 1D array of object dtype, each of whose elements are lists.
When you do nptest[:,0]==[1], NumPy does not perform an elementwise comparison of each element of nptest[:,0] against the list [1]. It creates as high-dimensional an array as it can from [1], producing the 1D array np.array([1]), and then broadcasts the comparison, comparing each element of nptest[:,0] against the integer 1.
Since no list in nptest[:, 0] is equal to 1, all elements of the result are False.

Creating a "bitmask" from several boolean numpy arrays

I'm trying to convert several masks (boolean arrays) to a bitmask with numpy, while that in theory works I feel that I'm doing too many operations.
For example to create the bitmask I use:
import numpy as np
flags = [
np.array([True, False, False]),
np.array([False, True, False]),
np.array([False, True, False])
]
flag_bits = np.zeros(3, dtype=np.int8)
for idx, flag in enumerate(flags):
flag_bits += flag.astype(np.int8) << idx # equivalent to flag * 2 ** idx
Which gives me the expected "bitmask":
>>> flag_bits
array([1, 6, 0], dtype=int8)
>>> [np.binary_repr(bit, width=7) for bit in flag_bits]
['0000001', '0000110', '0000000']
However I feel that especially the casting to int8 and the addition with the flag_bits array is too complicated. Therefore I wanted to ask if there is any NumPy functionality that I missed that could be used to create such an "bitmask" array?
Note: I'm calling an external function that expects such a bitmask, otherwise I would stick with the boolean arrays.
>>> x = np.array(2**i for i in range(1, np.shape(flags)[1]+1))
>>> np.dot(flags, x)
array([1, 2, 2])
How it works: in a bit mask, every bit is effectively an original array element multiplied by a degree of 2 according to its position, e.g. 4 = False * 1 + True * 2 + False * 4. Effectively this can be represented as matrix multiplication, which is really efficient in numpy.
So, first line is a list comprehension to create these weights: x = [1, 2, 4, 8, ... 2^(n+1)].
Then, each line in flags is multiplied by the corresponding element in x and everything is summed up (this is how matrix multiplication works). At the end, we get the bitmask
How about this (added conversion to int8, if desired):
flag_bits = (np.transpose(flags) << np.arange(len(flags))).sum(axis=1)\
.astype(np.int8)
#array([1, 6, 0], dtype=int8)
Here's an approach to directly get to the string bitmask with boolean-indexing -
out = np.repeat('0000000',3).astype('S7')
out.view('S1').reshape(-1,7)[:,-3:] = np.asarray(flags).astype(int)[::-1].T
Sample run -
In [41]: flags
Out[41]:
[array([ True, False, False], dtype=bool),
array([False, True, False], dtype=bool),
array([False, True, False], dtype=bool)]
In [42]: out = np.repeat('0000000',3).astype('S7')
In [43]: out.view('S1').reshape(-1,7)[:,-3:] = np.asarray(flags).astype(int)[::-1].T
In [44]: out
Out[44]:
array([b'0000001', b'0000110', b'0000000'],
dtype='|S7')
Using the same matrix-multiplication strategy as dicussed in detail in #Marat's solution, but using a vectorized scaling array that gives us flag_bits -
np.dot(2**np.arange(3),flags)

Intersection of two numpy arrays of different dimensions by column

I have two different numpy arrays given. First one is two-dimensional array which looks like (first ten points):
[[ 0. 0. ]
[ 12.54901961 18.03921569]
[ 13.7254902 17.64705882]
[ 14.11764706 17.25490196]
[ 14.90196078 17.25490196]
[ 14.50980392 17.64705882]
[ 14.11764706 17.64705882]
[ 14.50980392 17.25490196]
[ 17.64705882 18.03921569]
[ 21.17647059 34.11764706]]
the second array is just one-dimensional which looks like (first ten points):
[ 18.03921569 17.64705882 17.25490196 17.25490196 17.64705882
17.64705882 17.25490196 17.64705882 21.17647059 22.35294118]
Values from the second (one-dimension) array could occur in first (two-dimension) one in the first column. F.e. 17.64705882
I want to get an array from the two-dimension one where values of the first column match values in the second (one-dimension) array. How to do that?
You can use np.in1d(array1, array2) to search in array1 each value of array2. In your case you just have to take the first column of the first array:
mask = np.in1d(a[:, 0], b)
#array([False, False, False, False, False, False, False, False, True, True], dtype=bool)
You can use this mask to obtain the encountered values:
a[:, 0][mask]
#array([ 17.64705882, 21.17647059])

Categories

Resources