Numpy remove a row from a multidimesional array - python

I have a array like this
k = np.array([[ 1. , -120.8, 39.5],
[ 0. , -120.5, 39.5],
[ 1. , -120.4, 39.5],
[ 1. , -120.3, 39.5]])
I am trying to remove the following row which is also at index 1 position.
b=np.array([ 0. , -120.5, 39.5])
I have tried the traditional methods like the following:
k==b #try to get all True values at index 1 but instead got this
array([[False, False, False],
[ True, False, False],
[False, False, False],
[False, False, False]])
Other thing I tried:
k[~(k[:,0]==0.) & (k[:,1]==-120.5) & (k[:,1]==39.5)]
Got the result like this:
array([], shape=(0, 3), dtype=float64)
I am really surprised why the above methods not working. By the way in the first method I am just trying to get the index so that i can use np.delete later. Also for this problem, I am assuming I don't know the index.

Both k and b are floats, so equality comparisons are subject to floating point inaccuracies. Use np.isclose instead:
k[~np.isclose(k, b).all(axis=1)]
# array([[ 1. , -120.8, 39.5],
# [ 1. , -120.4, 39.5],
# [ 1. , -120.3, 39.5]])
Where
np.isclose(k, b).all(axis=1)
# array([False, True, False, False])
Tells you which row of k matches b.

Related

ValueError with ReLU function in python

I declared ReLU function like this:
def relu(x):
return (x if x > 0 else 0)
and an ValueError has occured and its traceback message is
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
But if I change ReLU function with numpy, it works:
def relu_np(x):
return np.maximum(0, x)
Why this function(relu(x)) doesn't work? I cannot understand it...
================================
Used code:
>>> x = np.arange(-5.0, 5.0, 0.1)
>>> y = relu(x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "filename", line, in relu
return (x if x > 0 else 0)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
TLDR; Your first function is not using vectorized methods which means it expects a single float/int value as input, while your second function takes advantage of Numpy's vectorization.
Vectorization in NumPy
Your second function uses numpy functions which are vectorized and run on each individual element of the array.
import numpy as np
arr = np.arange(-5.0, 5.0, 0.5)
def relu_np(x):
return np.maximum(0, x)
relu_np(arr)
# array([0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.5, 1. ,
# 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5])
Your second function however uses a ternary operator (x if x > 0 else 0) which expects a single value input and outputs a single value. This is why when you pass a single element, it would work, but on passing an array it fails to run the function on each element independently.
def relu(x):
return (x if x > 0 else 0)
relu(-8)
## 0
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Note: The reason for this error is due to the ternary operator you are using (x if x > 0 else 0). The condition x>0 can only take the value True or False for a given integer/float value. However, when you pass an array, you will need to use something like any() or all() to aggregate that list of boolean values to a single one, before you can apply your if, else clause.
Solutions -
There are a few ways you can make this work -
1. Using np.vectorize (not recommended, lower performance than pure numpy approach)
import numpy as np
arr = np.arange(-5.0, 5.0, 0.5)
def relu(x):
return (x if x > 0.0 else 0.0)
relu_vec = np.vectorize(relu)
relu_vec(arr)
# array([0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.5, 1. ,
# 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5])
2. Iteration over the array with list comprehension
import numpy as np
arr = np.arange(-5.0, 5.0, 0.5)
def relu(x):
return (x if x > 0 else 0)
arr = np.array(arr)
np.array([relu(i) for i in arr])
# array([0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.5, 1. ,
# 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5])
Keep in mind that x > 0 is an array of booleans, a mask if you like:
array([False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True])
So it does not make sense to do if x>0, since x contains several elements, which can be True or False. This is the source of your error.
Your second implementation of numpy is good ! Another implementation (maybe more clear?) might be:
def relu(x):
return x * (x > 0)
In this implementation, we do an elementwise multiplication of x, which is a range of values along the x axis, by 0 if the element of x is below 0, and 1 if the element is above.
Disclaimer: please someone correct me if I'm wrong, I'm not 100% sure about how numpy does things.
Your function relu expects a single numerical value and compares it to 0 and returns whatever is larger. x if x > 0 else 0 would be equal to max(x, 0) where max is a builtin Python function.
relu_np on the other hand uses the numpy function maximum which accepts 2 numbers or arrays or iterables. That means you can pass your numpy array x and it applies the maximum function to every single item automatically. I believe this is called 'vectorized'.
To make the relu function you have work the way it is, you need to call it differently. You'd have to manually apply your function to every element. You could do something like y = np.array(list(map(relu, x))).

What is going on behind this numpy selection behavior?

Answering this question, some others and I were actually wrong by considering that the following would work:
Say one has
test = [ [ [0], 1 ],
[ [1], 1 ]
]
import numpy as np
nptest = np.array(test)
What is the reason behind
>>> nptest[:,0]==[1]
array([False, False], dtype=bool)
while one has
>>> nptest[0,0]==[1],nptest[1,0]==[1]
(False, True)
or
>>> nptest==[1]
array([[False, True],
[False, True]], dtype=bool)
or
>>> nptest==1
array([[False, True],
[False, True]], dtype=bool)
Is it the degeneracy in term of dimensions which causes this.
nptest is a 2D array of object dtype, and the first element of each row is a list.
nptest[:, 0] is a 1D array of object dtype, each of whose elements are lists.
When you do nptest[:,0]==[1], NumPy does not perform an elementwise comparison of each element of nptest[:,0] against the list [1]. It creates as high-dimensional an array as it can from [1], producing the 1D array np.array([1]), and then broadcasts the comparison, comparing each element of nptest[:,0] against the integer 1.
Since no list in nptest[:, 0] is equal to 1, all elements of the result are False.

Numpy Chain Indexing

I am trying to gain a better understanding of numpy and have come across something I can't quite understand when it comes to indexing.
Let's say we have this first array of random booleans
bools = np.random.choice([True, False],(7),p=[0.5,0.5])
array([False, True, False, False, True, False, False], dtype=bool)
Then let's also say we have this second array of random numbers selected from a normal distribution
data = np.random.randn(7,3)
array([[ 2.24116809, -0.41761776, -0.69026077],
[-0.85450123, 0.98218741, 0.0233551 ],
[-1.3157436 , -0.79753471, 1.77393444],
[-0.26672724, -0.9532758 , 0.67114247],
[-1.34177843, 1.220083 , -0.35341168],
[ 0.49629327, 1.73943962, 0.59050431],
[ 0.01609382, 0.91396293, 0.3754827 ]])
Using the numpy chain indexing I can do this
data[bools, 2:]
array([[ 0.0233551 ],
[-0.35341168]])
Now let's say I want to simply grab the first element, I can do this
data[bools, 2:][0]
array([ 0.0233551])
But why does this, data[bools, 2:, 0] not work?
But why does this, data[bools, 2:, 0] not work?
Because the input is a 2D array and as such you don't have three dimensions there to use something like : [bools, 2:, 0].
To achieve what you want you are trying to do, you could store the indices corresponding to the True ones in the mask bools and then use it as whole or one element from it for indexing.
A sample run to make things clear -
Inputs :
In [40]: data
Out[40]:
array([[ 1.02429045, 1.74104271, -0.54634826],
[-0.48451969, 0.83455196, 1.94444857],
[ 0.66504345, 0.41821317, 2.52517305],
[ 2.11428982, -0.05769528, 0.84432614],
[ 0.9251009 , -0.74646199, -0.93573164],
[ 0.07321257, -0.10708067, 1.78107884],
[-0.12961046, -0.5787856 , 0.2189466 ]])
In [41]: bools
Out[41]: array([ True, True, False, False, False, False, True], dtype=bool)
Store the valid indices :
In [42]: idx = np.flatnonzero(bools)
In [43]: idx
Out[43]: array([0, 1, 6])
Use as a whole or its first element :
In [44]: data[idx, 2:] # Same as data[bools, 2:]
Out[44]:
array([[-0.54634826],
[ 1.94444857],
[ 0.2189466 ]])
In [45]: data[idx[0], 2:]
Out[45]: array([-0.54634826])
I haven't seen 2d numpy indexing called 'chaining'
data is 2d, and thus can be indexed with a 2 element tuple
data[bools, 2:]
data([bools, slice(2,None,None))]
That can also be expressed as
data[bools,:][:,2:]
where it first selects from rows, and then from columns.
Notice that your indexing produces a (2,1) array; 2 from the number of True in bool, and 1 from the length of the 2: slice.
Your 2nd indexing with [0] is really a row selection:
data[bools, 2:][0]
data[bools, 2:][0,:]
The result is a (1,) array, the size of the 2nd dimension of the intermediate array.

Intersection of two numpy arrays of different dimensions by column

I have two different numpy arrays given. First one is two-dimensional array which looks like (first ten points):
[[ 0. 0. ]
[ 12.54901961 18.03921569]
[ 13.7254902 17.64705882]
[ 14.11764706 17.25490196]
[ 14.90196078 17.25490196]
[ 14.50980392 17.64705882]
[ 14.11764706 17.64705882]
[ 14.50980392 17.25490196]
[ 17.64705882 18.03921569]
[ 21.17647059 34.11764706]]
the second array is just one-dimensional which looks like (first ten points):
[ 18.03921569 17.64705882 17.25490196 17.25490196 17.64705882
17.64705882 17.25490196 17.64705882 21.17647059 22.35294118]
Values from the second (one-dimension) array could occur in first (two-dimension) one in the first column. F.e. 17.64705882
I want to get an array from the two-dimension one where values of the first column match values in the second (one-dimension) array. How to do that?
You can use np.in1d(array1, array2) to search in array1 each value of array2. In your case you just have to take the first column of the first array:
mask = np.in1d(a[:, 0], b)
#array([False, False, False, False, False, False, False, False, True, True], dtype=bool)
You can use this mask to obtain the encountered values:
a[:, 0][mask]
#array([ 17.64705882, 21.17647059])

numpy slice an array without copying it

I have a large data in matrix x and I need to analyze some some submatrices.
I am using the following code to select the submatrix:
>>> import numpy as np
>>> x = np.random.normal(0,1,(20,2))
>>> x
array([[-1.03266826, 0.04646684],
[ 0.05898304, 0.31834926],
[-0.1916809 , -0.97929025],
[-0.48837085, -0.62295003],
[-0.50731017, 0.50305894],
[ 0.06457385, -0.10670002],
[-0.72573604, 1.10026385],
[-0.90893845, 0.99827162],
[ 0.20714399, -0.56965615],
[ 0.8041371 , 0.21910274],
[-0.65882317, 0.2657183 ],
[-1.1214074 , -0.39886425],
[ 0.0784783 , -0.21630006],
[-0.91802557, -0.20178683],
[ 0.88268539, -0.66470235],
[-0.03652459, 1.49798484],
[ 1.76329838, -0.26554555],
[-0.97546845, -2.41823586],
[ 0.32335103, -1.35091711],
[-0.12981597, 0.27591674]])
>>> index = x[:,1] > 0
>>> index
array([ True, True, False, False, True, False, True, True, False,
True, True, False, False, False, False, True, False, False,
False, True], dtype=bool)
>>> x1 = x[index, :] #x1 is a copy of the submatrix
>>> x1
array([[-1.03266826, 0.04646684],
[ 0.05898304, 0.31834926],
[-0.50731017, 0.50305894],
[-0.72573604, 1.10026385],
[-0.90893845, 0.99827162],
[ 0.8041371 , 0.21910274],
[-0.65882317, 0.2657183 ],
[-0.03652459, 1.49798484],
[-0.12981597, 0.27591674]])
>>> x1[0,0] = 1000
>>> x1
array([[ 1.00000000e+03, 4.64668400e-02],
[ 5.89830401e-02, 3.18349259e-01],
[ -5.07310170e-01, 5.03058935e-01],
[ -7.25736045e-01, 1.10026385e+00],
[ -9.08938455e-01, 9.98271624e-01],
[ 8.04137104e-01, 2.19102741e-01],
[ -6.58823174e-01, 2.65718300e-01],
[ -3.65245877e-02, 1.49798484e+00],
[ -1.29815968e-01, 2.75916735e-01]])
>>> x
array([[-1.03266826, 0.04646684],
[ 0.05898304, 0.31834926],
[-0.1916809 , -0.97929025],
[-0.48837085, -0.62295003],
[-0.50731017, 0.50305894],
[ 0.06457385, -0.10670002],
[-0.72573604, 1.10026385],
[-0.90893845, 0.99827162],
[ 0.20714399, -0.56965615],
[ 0.8041371 , 0.21910274],
[-0.65882317, 0.2657183 ],
[-1.1214074 , -0.39886425],
[ 0.0784783 , -0.21630006],
[-0.91802557, -0.20178683],
[ 0.88268539, -0.66470235],
[-0.03652459, 1.49798484],
[ 1.76329838, -0.26554555],
[-0.97546845, -2.41823586],
[ 0.32335103, -1.35091711],
[-0.12981597, 0.27591674]])
>>>
but I would like x1 to be only a pointer or something like this. Copy the data every time that I need a submatrix is too expensive for me.
How can I do that?
EDIT:
Apparently there is not any solution with the numpy array. Are the pandas data frame better from this point of view?
The information for your array x is summarized in the .__array_interface__ property
In [433]: x.__array_interface__
Out[433]:
{'descr': [('', '<f8')],
'strides': None,
'data': (171396104, False),
'typestr': '<f8',
'version': 3,
'shape': (20, 2)}
It has the array shape, strides (default here), and pointer to the data buffer. A view can point to the same data buffer (possibly further along), and have its own shape and strides.
But indexing with your boolean can't be summarized in those few numbers. Either it has to carry the index array all the way through, or copy selected items from the x data buffer. numpy chooses to copy. You have choice of when to apply the index, now or further down the calling stack.
Since index is an array of type bool, you are doing advanced indexing. And the docs say: „Advanced indexing always returns a copy of the data.“
This makes a lot of sense. Compared to normal indexing where you only need to know the start, stop and step, advanced indexing can use any value from the original array without such a simple rule. This would mean having lots of extra meta information where referenced indices point to that might use more memory than a copy.
If you can manage with a traditional slice such as
x1 = x[3:8]
Then it will be just a pointer.
Have you looked at using masked arrays? You might be able to do exactly what you want.
x = np.array([0.12, 0.23],
[1.23, 3.32],
...
[0.75, 1.23]])
data = np.array([[False, False],
[True, True],
...
[True, True]])
x1 = np.ma.array(x, mask=data)
## x1 can be worked on and only includes elements of x where data==False

Categories

Resources