Intersection of two numpy arrays of different dimensions by column - python

I have two different numpy arrays given. First one is two-dimensional array which looks like (first ten points):
[[ 0. 0. ]
[ 12.54901961 18.03921569]
[ 13.7254902 17.64705882]
[ 14.11764706 17.25490196]
[ 14.90196078 17.25490196]
[ 14.50980392 17.64705882]
[ 14.11764706 17.64705882]
[ 14.50980392 17.25490196]
[ 17.64705882 18.03921569]
[ 21.17647059 34.11764706]]
the second array is just one-dimensional which looks like (first ten points):
[ 18.03921569 17.64705882 17.25490196 17.25490196 17.64705882
17.64705882 17.25490196 17.64705882 21.17647059 22.35294118]
Values from the second (one-dimension) array could occur in first (two-dimension) one in the first column. F.e. 17.64705882
I want to get an array from the two-dimension one where values of the first column match values in the second (one-dimension) array. How to do that?

You can use np.in1d(array1, array2) to search in array1 each value of array2. In your case you just have to take the first column of the first array:
mask = np.in1d(a[:, 0], b)
#array([False, False, False, False, False, False, False, False, True, True], dtype=bool)
You can use this mask to obtain the encountered values:
a[:, 0][mask]
#array([ 17.64705882, 21.17647059])

Related

Numpy remove a row from a multidimesional array

I have a array like this
k = np.array([[ 1. , -120.8, 39.5],
[ 0. , -120.5, 39.5],
[ 1. , -120.4, 39.5],
[ 1. , -120.3, 39.5]])
I am trying to remove the following row which is also at index 1 position.
b=np.array([ 0. , -120.5, 39.5])
I have tried the traditional methods like the following:
k==b #try to get all True values at index 1 but instead got this
array([[False, False, False],
[ True, False, False],
[False, False, False],
[False, False, False]])
Other thing I tried:
k[~(k[:,0]==0.) & (k[:,1]==-120.5) & (k[:,1]==39.5)]
Got the result like this:
array([], shape=(0, 3), dtype=float64)
I am really surprised why the above methods not working. By the way in the first method I am just trying to get the index so that i can use np.delete later. Also for this problem, I am assuming I don't know the index.
Both k and b are floats, so equality comparisons are subject to floating point inaccuracies. Use np.isclose instead:
k[~np.isclose(k, b).all(axis=1)]
# array([[ 1. , -120.8, 39.5],
# [ 1. , -120.4, 39.5],
# [ 1. , -120.3, 39.5]])
Where
np.isclose(k, b).all(axis=1)
# array([False, True, False, False])
Tells you which row of k matches b.

What is going on behind this numpy selection behavior?

Answering this question, some others and I were actually wrong by considering that the following would work:
Say one has
test = [ [ [0], 1 ],
[ [1], 1 ]
]
import numpy as np
nptest = np.array(test)
What is the reason behind
>>> nptest[:,0]==[1]
array([False, False], dtype=bool)
while one has
>>> nptest[0,0]==[1],nptest[1,0]==[1]
(False, True)
or
>>> nptest==[1]
array([[False, True],
[False, True]], dtype=bool)
or
>>> nptest==1
array([[False, True],
[False, True]], dtype=bool)
Is it the degeneracy in term of dimensions which causes this.
nptest is a 2D array of object dtype, and the first element of each row is a list.
nptest[:, 0] is a 1D array of object dtype, each of whose elements are lists.
When you do nptest[:,0]==[1], NumPy does not perform an elementwise comparison of each element of nptest[:,0] against the list [1]. It creates as high-dimensional an array as it can from [1], producing the 1D array np.array([1]), and then broadcasts the comparison, comparing each element of nptest[:,0] against the integer 1.
Since no list in nptest[:, 0] is equal to 1, all elements of the result are False.

Numpy Chain Indexing

I am trying to gain a better understanding of numpy and have come across something I can't quite understand when it comes to indexing.
Let's say we have this first array of random booleans
bools = np.random.choice([True, False],(7),p=[0.5,0.5])
array([False, True, False, False, True, False, False], dtype=bool)
Then let's also say we have this second array of random numbers selected from a normal distribution
data = np.random.randn(7,3)
array([[ 2.24116809, -0.41761776, -0.69026077],
[-0.85450123, 0.98218741, 0.0233551 ],
[-1.3157436 , -0.79753471, 1.77393444],
[-0.26672724, -0.9532758 , 0.67114247],
[-1.34177843, 1.220083 , -0.35341168],
[ 0.49629327, 1.73943962, 0.59050431],
[ 0.01609382, 0.91396293, 0.3754827 ]])
Using the numpy chain indexing I can do this
data[bools, 2:]
array([[ 0.0233551 ],
[-0.35341168]])
Now let's say I want to simply grab the first element, I can do this
data[bools, 2:][0]
array([ 0.0233551])
But why does this, data[bools, 2:, 0] not work?
But why does this, data[bools, 2:, 0] not work?
Because the input is a 2D array and as such you don't have three dimensions there to use something like : [bools, 2:, 0].
To achieve what you want you are trying to do, you could store the indices corresponding to the True ones in the mask bools and then use it as whole or one element from it for indexing.
A sample run to make things clear -
Inputs :
In [40]: data
Out[40]:
array([[ 1.02429045, 1.74104271, -0.54634826],
[-0.48451969, 0.83455196, 1.94444857],
[ 0.66504345, 0.41821317, 2.52517305],
[ 2.11428982, -0.05769528, 0.84432614],
[ 0.9251009 , -0.74646199, -0.93573164],
[ 0.07321257, -0.10708067, 1.78107884],
[-0.12961046, -0.5787856 , 0.2189466 ]])
In [41]: bools
Out[41]: array([ True, True, False, False, False, False, True], dtype=bool)
Store the valid indices :
In [42]: idx = np.flatnonzero(bools)
In [43]: idx
Out[43]: array([0, 1, 6])
Use as a whole or its first element :
In [44]: data[idx, 2:] # Same as data[bools, 2:]
Out[44]:
array([[-0.54634826],
[ 1.94444857],
[ 0.2189466 ]])
In [45]: data[idx[0], 2:]
Out[45]: array([-0.54634826])
I haven't seen 2d numpy indexing called 'chaining'
data is 2d, and thus can be indexed with a 2 element tuple
data[bools, 2:]
data([bools, slice(2,None,None))]
That can also be expressed as
data[bools,:][:,2:]
where it first selects from rows, and then from columns.
Notice that your indexing produces a (2,1) array; 2 from the number of True in bool, and 1 from the length of the 2: slice.
Your 2nd indexing with [0] is really a row selection:
data[bools, 2:][0]
data[bools, 2:][0,:]
The result is a (1,) array, the size of the 2nd dimension of the intermediate array.

Trying to Remove for-Loops from Python code, Performing Operations with a Look-up Table On Matrices

I feel like this is a similar problem to the one I asked before, but I can't figure it out. How can I convert these two lines of code into one line with no for-loop?
for i in xrange(X.shape[0]):
dW[:,y[i]] -= X[i]
In English, every row in matrix X should be subtracted from a corresponding column in matrix dW given by the vector y.
I should mention dW is DXC and X is NXD, so the transpose of X does not have the same shape as W, otherwise I could re-order the the rows of X, and take the transpose directly. However, it is possible for the columns in dW to have multiple corresponding rows which need to be subtracted.
I feel like I do not have a firm grasp of how indexing in python is supposed to work, which makes it difficult to remove unnecessary for-loops, or even to know what for-loops are possible to remove.
The straightforward way to vectorize would be:
dW[:,y] -= X.T
Except, though not very obvious or well-documented, this will give problems with repeated indices in y. For these situations there is the ufunc.at method (elementwise operations in numpy are implemented as "ufunc's" or "universal functions"). Quote from the docs:
ufunc.at(a, indices, b=None)
Performs unbuffered in place operation on operand ‘a’ for elements specified by ‘indices’. For addition ufunc, this method is equivalent to a[indices] += b, except that results are accumulated for elements that are indexed more than once. For example, a[[0,0]] += 1 will only increment the first element once because of buffering, whereas add.at(a, [0,0], 1) will increment the first element twice.
So in your case:
np.subtract.at(dW.T, y, X)
Unfortunately, ufunc.at is relatively inefficient as far as vectorization techniques go, so the speedup compared to the loop might not be that impressive.
Approach #1 Here's a one-liner vectorized approach with matrix-multiplication using np.dot and NumPy broadcasting -
dWout -= (np.arange(dW.shape[1])[:,None] == y).dot(X).T
Explanation : Take a small example to understand what's going on -
Inputs :
In [259]: X
Out[259]:
array([[ 0.80195208, 0.40566743, 0.62585574, 0.53571781],
[ 0.56643339, 0.4635662 , 0.4290103 , 0.14457036],
[ 0.31823491, 0.12329964, 0.41682841, 0.09544716]])
In [260]: y
Out[260]: array([1, 2, 2])
First off, we create the 2D mask of y indices spread across the length of dW's second axis.
Let dW be a 4 x 5 shaped array. So, the mask would be :
In [261]: mask = (np.arange(dW.shape[1])[:,None] == y)
In [262]: mask
Out[262]:
array([[False, False, False],
[ True, False, False],
[False, True, True],
[False, False, False],
[False, False, False]], dtype=bool)
This is using NumPy broadcasting here to create a 2D mask.
Next up, we use matrix-multiplication to sum-aggregate the same indices from y -
In [264]: mask.dot(X)
Out[264]:
array([[ 0. , 0. , 0. , 0. ],
[ 0.80195208, 0.40566743, 0.62585574, 0.53571781],
[ 0.8846683 , 0.58686584, 0.84583872, 0.24001752],
[ 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. ]])
Thus, corresponding to the third row of the mask that has True values at second and third columns, we would sum up the second and third rows from X with that matrix-multiplication. This would be put as the third row in the multiplication output.
Since, in the original loopy code we are updating dW across columns, we need to transpose the multiplication result and then update.
Approach #2 Here's another vectorized way, though not a one-liner using np.add.reduceat -
sidx = y.argsort()
unq,shift_idx = np.unique(y[sidx],return_index=True)
dWout[:,unq] -= np.add.reduceat(X[sidx],shift_idx,axis=0).T

numpy slice an array without copying it

I have a large data in matrix x and I need to analyze some some submatrices.
I am using the following code to select the submatrix:
>>> import numpy as np
>>> x = np.random.normal(0,1,(20,2))
>>> x
array([[-1.03266826, 0.04646684],
[ 0.05898304, 0.31834926],
[-0.1916809 , -0.97929025],
[-0.48837085, -0.62295003],
[-0.50731017, 0.50305894],
[ 0.06457385, -0.10670002],
[-0.72573604, 1.10026385],
[-0.90893845, 0.99827162],
[ 0.20714399, -0.56965615],
[ 0.8041371 , 0.21910274],
[-0.65882317, 0.2657183 ],
[-1.1214074 , -0.39886425],
[ 0.0784783 , -0.21630006],
[-0.91802557, -0.20178683],
[ 0.88268539, -0.66470235],
[-0.03652459, 1.49798484],
[ 1.76329838, -0.26554555],
[-0.97546845, -2.41823586],
[ 0.32335103, -1.35091711],
[-0.12981597, 0.27591674]])
>>> index = x[:,1] > 0
>>> index
array([ True, True, False, False, True, False, True, True, False,
True, True, False, False, False, False, True, False, False,
False, True], dtype=bool)
>>> x1 = x[index, :] #x1 is a copy of the submatrix
>>> x1
array([[-1.03266826, 0.04646684],
[ 0.05898304, 0.31834926],
[-0.50731017, 0.50305894],
[-0.72573604, 1.10026385],
[-0.90893845, 0.99827162],
[ 0.8041371 , 0.21910274],
[-0.65882317, 0.2657183 ],
[-0.03652459, 1.49798484],
[-0.12981597, 0.27591674]])
>>> x1[0,0] = 1000
>>> x1
array([[ 1.00000000e+03, 4.64668400e-02],
[ 5.89830401e-02, 3.18349259e-01],
[ -5.07310170e-01, 5.03058935e-01],
[ -7.25736045e-01, 1.10026385e+00],
[ -9.08938455e-01, 9.98271624e-01],
[ 8.04137104e-01, 2.19102741e-01],
[ -6.58823174e-01, 2.65718300e-01],
[ -3.65245877e-02, 1.49798484e+00],
[ -1.29815968e-01, 2.75916735e-01]])
>>> x
array([[-1.03266826, 0.04646684],
[ 0.05898304, 0.31834926],
[-0.1916809 , -0.97929025],
[-0.48837085, -0.62295003],
[-0.50731017, 0.50305894],
[ 0.06457385, -0.10670002],
[-0.72573604, 1.10026385],
[-0.90893845, 0.99827162],
[ 0.20714399, -0.56965615],
[ 0.8041371 , 0.21910274],
[-0.65882317, 0.2657183 ],
[-1.1214074 , -0.39886425],
[ 0.0784783 , -0.21630006],
[-0.91802557, -0.20178683],
[ 0.88268539, -0.66470235],
[-0.03652459, 1.49798484],
[ 1.76329838, -0.26554555],
[-0.97546845, -2.41823586],
[ 0.32335103, -1.35091711],
[-0.12981597, 0.27591674]])
>>>
but I would like x1 to be only a pointer or something like this. Copy the data every time that I need a submatrix is too expensive for me.
How can I do that?
EDIT:
Apparently there is not any solution with the numpy array. Are the pandas data frame better from this point of view?
The information for your array x is summarized in the .__array_interface__ property
In [433]: x.__array_interface__
Out[433]:
{'descr': [('', '<f8')],
'strides': None,
'data': (171396104, False),
'typestr': '<f8',
'version': 3,
'shape': (20, 2)}
It has the array shape, strides (default here), and pointer to the data buffer. A view can point to the same data buffer (possibly further along), and have its own shape and strides.
But indexing with your boolean can't be summarized in those few numbers. Either it has to carry the index array all the way through, or copy selected items from the x data buffer. numpy chooses to copy. You have choice of when to apply the index, now or further down the calling stack.
Since index is an array of type bool, you are doing advanced indexing. And the docs say: „Advanced indexing always returns a copy of the data.“
This makes a lot of sense. Compared to normal indexing where you only need to know the start, stop and step, advanced indexing can use any value from the original array without such a simple rule. This would mean having lots of extra meta information where referenced indices point to that might use more memory than a copy.
If you can manage with a traditional slice such as
x1 = x[3:8]
Then it will be just a pointer.
Have you looked at using masked arrays? You might be able to do exactly what you want.
x = np.array([0.12, 0.23],
[1.23, 3.32],
...
[0.75, 1.23]])
data = np.array([[False, False],
[True, True],
...
[True, True]])
x1 = np.ma.array(x, mask=data)
## x1 can be worked on and only includes elements of x where data==False

Categories

Resources