Numpy: Combine several arrays based on an indices array - python

I have 2 arrays of different sizes m and n, for instance:
x = np.asarray([100, 200])
y = np.asarray([300, 400, 500])
I also have an integer array of size m+n, for instance:
indices = np.asarray([1, 1, 0, 1 , 0])
I'd like to combine x and y into an array z of size m+n, in this case:
expected_z = np.asarray([300, 400, 100, 500, 200])
In details:
The 1st value of indices is 1, so the 1st value of z should come from y. Therefore 300.
The 2nd value of indices is 1, so the 2nd value of z should also come from y. Therefore 400
The 3rd value of indices is 0, so the 3rd value of z should this time come from x. Therefore 100
...
How could I do that efficiently in NumPy?
Thanks in advance!

Make an output array and use boolean indexing to assign x and y into the correct slots of the output:
z = numpy.empty(len(x)+len(y), dtype=x.dtype)
z[indices==0] = x
z[indices==1] = y

out will be your desired output:
out = indices.copy()
out[np.where(indices==0)[0]] = x
out[np.where(indices==1)[0]] = y
or as the above answer suggested, simply do:
out = indices.copy()
out[indices==0] = x
out[indices==1] = y

i hope this could help you:
x = np.asarray([100, 200])
y = np.asarray([300, 400, 500])
indices = np.asarray([1, 1, 0, 1 , 0])
expected_z = np.asarray([])
x_indice = 0
y_indice = 0
for i in range(0,len(indices)):
if indices[i] == 0:
expected_z = np.insert(expected_z,i,x[x_indice])
x_indice += 1
else:
expected_z = np.insert(expected_z,i,y[y_indice])
y_indice += 1
expected_z
and the output is:
output : array([300., 400., 100., 500., 200.])
P.S. always make sure that len(indices) == len(x) + len(y) and :
the values that are coming from y == len(y)
the values that are coming from x == len(x)

Related

Value at a given index in a NumPy array depends on values at higher indexes in another NumPy array

I have two 1D NumPy arrays x = [x[0], x[1], ..., x[n-1]] and y = [y[0], y[1], ..., y[n-1]]. The array x is known, and I need to determine the values for array y. For every index in np.arange(n), the value of y[index] depends on x[index] and on x[index + 1: ]. My code is this:
import numpy as np
n = 5
q = 0.5
x = np.array([1, 2, 0, 1, 0])
y = np.empty(n, dtype=int)
for index in np.arange(n):
if (x[index] != 0) and (np.any(x[index + 1:] == 0)):
y[index] = np.random.choice([0,1], 1, p=(1-q, q))
else:
y[index] = 0
print(y)
The problem with the for loop is that the size of n in my experiment can become very large. Is there any vectorized way to do this?
Randomly generate the array y with the full shape.
Generate a bool array indicating where to set zeros.
Use np.where to set zeros.
Try this,
import numpy as np
n = 5
q = 0.5
x = np.array([1, 2, 0, 1, 0])
y = np.random.choice([0, 1], n, p=(1-q, q))
condition = (x != 0) & (x[::-1].cumprod() == 0)[::-1] # equivalent to the posted one
y = np.where(condition, y, 0)

kNN feature should passed through as list

my data is like:
sample1 = [[1, 0, 3, 5, 0, 9], 0, 1.5, 0]
sample2 = [[0, 4, 0, 6, 2, 0], 2, 1.9, 1]
sample3 = [[9, 7, 6, 0, 0, 0], 0, 1.3, 1]
paul = pd.DataFrame(data = [sample1, sample2, sample3], columns=`['list','cat','metr','target'])`
on this data a scikit-learn kNN-Regression with an specific distance function should be done.
The distance function is:
def my_distance(X,Y,**kwargs):
if len(X)>1:
x = X
y = Y
all_minima = []
for k in range(len(x)):
one_minimum = min(x[k],y[k])
all_minima.append(one_minimum)
sum_all_minima=sum(all_minima)
distance = (sum(x)+sum(y)-sum_all_minima) * kwargs["Para_list"]
elif X.dtype=='int64':
x = X
y = Y
if x == y and x != -1:
distance = 0
elif x == -1 or y == -1 or x is None or y is None:
distance = kwargs["Para_minus1"] * 1
else:
distance = kwargs["Para_nominal"] * 1
else:
x = X
y = Y
if x == y:
distance = 0
elif x == -1 or y == -1 or x is None or y is None:
distance = kwargs["Para_minus1"] * 1
else:
distance = abs(x-y) * kwargs["Para_metrisch"]
return distance
And should be implemented as valid distance function by
DistanceMetric.get_metric('pyfunc',func=my_distance)
As I'm right, the scikit code should be like this:
train , test = train_test_split(paul, test_size = 0.3)
#x_train soll nur unabhähgige Variablen enthalten, andere kommen raus:
x_train = train.drop('target', axis=1)
y_train = train['target']
x_test = test.drop('target', axis = 1)
y_test = test['target']
knn = KNeighborsRegressor(n_neighbors=2,
algorithm='ball_tree',
metric=my_distance,
metric_params={"Para_list": 2,
"Para_minus1": 3,
"Para_metrisch": 2,
"Para_nominal": 4}))
knn.fit(x_train,y_train)
y_pred=knn.predict(x_test)
I get
ValueError: setting an array element with a sequence.
I guess scikit can not handle a single feature item as list? Is there a way to make that happen?
I guess scikit can not handle a single feature item as list? Is there a way to make that happen?
No, there is no way I know of to make this happen. You need to convert this feature into 2D matrix, concatenate it with other 1D features, to form data appropriately. This is standard sklearn behavior.
Unless you have some very narrow use-case, making 2D array from list feature is totally fine. I assume, all lists have same length.

Comparing a column in an array to a set of values and returning another set of values from the second column with the corresponding indices

So I have an array with 2 columns (x, y).
I need to find values in the y column matching some other set of numbers, say [0.5, 0.5, 0.99] and return the values from the x column with the same indices into a new variable.
x=np.linspace(50,70,20)
y=np.linspace(0,1,20)
c=np.zeros((2,len(x)))
x=np.around(x,3)
y=np.around(y,3)
for ii, (left, right) in enumerate(zip(x[1:], y[1:])):
print(left, right)
c[0, ii] = left
c[1, ii] = right
q=[0.05,0.5,0.99]
So I need to compare c[1,:] to q and then return the values from c[0,:] with the corresponding indices.
I tried for and enumerate but I can't figure out whether I need to use iterator once or twice (for c and q).
Thanks!
You could use np.nonzero to find the values of q in y.
The question is what is the expected behaviour if the value is not present in y.
Right now, the values for this case are `-1'.
import numpy as np
n = 100
x = np.linspace(50, 70, n)
y = np.linspace(0, 1, n)
x = np.around(x, 2)
y = np.around(y, 2)
q = [0.05, 0.50, 0.99]
res = np.full((len(q), 2), -1)
for i, qq in enumerate(q):
j = np.nonzero(y == qq)[0]
if np.size(j) == 1:
res[i] = (j, x[j])
res
# index, value
# array([[ 5, 51],
# [-1, -1],
# [98, 69]])

How to remove columns in 2D numpy array if one element is smaller or larger than a certain value

Right now I have a 2-D numpy arrays that represents the coordinates pixels of an image
points = [[-1,-2,0,1,2,3,5,8] [-3,-4,0,-3,5,9,2,1]]
Each column represents a coordinate in the image, e.g:
array[0] = [-1,-3] means x = -1 and y = -3
Right now, I want to remove columns that either has x less than 0 && more than 5 or y less than 0 && more than 5
I know how to remove elements of a certain value
#remove x that is less than 0 and more than 5
x = points[0,:]
x = x[np.logical_and(x>=0, x<=5)]
#remove y that is less than 0 and more than 5
y = points[1,:]
y = y[np.logical_and(y>=0,y<=5)]
Is there a way to remove the y that shares the same index with the x that is deleted?(in other words, remove columns when either the condition for x deletion or y deletion is satisfied)
You can convert list to ndarray, then create a mask of boolean and reassign x, y. The nested logical_and mean you create a mask of x>=0 and x<=5 and y>=0 and y<=5, then the AND operator ensure that if once x[i] deleted, y[i] got deleted as well
points = [[-1,-2,0,1,2,3,5,8], [-3,-4,0,-3,5,9,2,1]]
x = np.array(points[0,:])
y = np.array(points[1,:])
mask = np.logical_and(np.logical_and(x>=0, x<=5), np.logical_and(y>=0, y<=5))
# mask = array([False, False, True, False, True, False, True, False])
x = x[mask] # x = array([0, 2, 5])
y = y[mask] # y = array([0, 5, 2])
You can use np.compress along the axis=1 to get the points you need:
np.compress((x>=0) * (x<=5) * (y>=0) * (y<=5), points, axis=1)
array([[0, 2, 5],
[0, 5, 2]])
where I have assumed that x, y and points are numpy arrays.

Efficient repeated numpy.where

I have a code in which I want to check whether pairs of coordinates fall into certain rectangles. However, there are many rectangles and i am not sure how to generalize the following code to many rectangles. I only can do it using eval in a loop but that is quite ugly.
Here is a code which checks to which of the rectangles each entry of a DataFrame consisting of coordinates. It assigns 0 if it belongs to the first, 1 for the second, an nan otherwise. I want to have such a code that would produce the analogous result assuming we have a large list of Rectangle objects, without applying eval or loops in the last row. Thanks alot.
from matplotlib.patches import Rectangle
rec1 = Rectangle((0,0), 100, 100)
rec2 = Rectangle((100,0), 100, 100)
x = np.random.poisson(100, size=200)
y = np.random.poisson(80, size=200)
xy = pd.DataFrame({"x" : x, "y" : y}).values
e1 = np.asarray(rec1.get_extents())
e2 = np.asarray(rec2.get_extents())
r1m1, r1m2 = np.min(e1), np.max(e1)
r2m1, r2m2 = np.min(e2), np.max(e2)
out = np.where(((xy >= r1m1) & (xy <= r1m2)).all(axis=1), 0,
np.where(((xy >= r2m1) & (xy <= r2m2)).all(axis=1), 1, np.nan))
EDIT Here is a version with 3 rectangles
rec1 = Rectangle((0,0), 100, 100)
rec2 = Rectangle((0,100), 100, 100)
rec3 = Rectangle((100,100), 100, 100)
x = np.random.poisson(100, size=200)
y = np.random.poisson(100, size=200)
xy = pd.DataFrame({"x" : x, "y" : y}).values
e1 = np.asarray(rec1.get_extents())
e2 = np.asarray(rec2.get_extents())
e3 = np.asarray(rec3.get_extents())
r1m1, r1m2 = np.min(e1), np.max(e1)
r2m1, r2m2 = np.min(e2), np.max(e2)
r3m1, r3m2 = np.min(e3), np.max(e3)
out = np.where(((xy >= r1m1) & (xy <= r1m2)).all(axis=1), 0,
np.where(((xy >= r2m1) & (xy <= r2m2)).all(axis=1), 1,
np.where(((xy >= r3m1) & (xy <= r3m2)).all(axis=1), 2, np.nan)))
What I like to get are values of 0, 1, 2 or np.nan. But the output is consists only of 0 and 1.
Here's a vectorized approach making use of NumPy broadcasting -
# Store extents in a 3D array
e = np.dstack((e1,e2,e3))
# Get a valid mask for the X's and Y's and then the combined one
x_valid_mask = (xy[:,0] >= e[0,0,:,None]) & (xy[:,0] <= e[1,0,:,None])
y_valid_mask = (xy[:,1] >= e[0,1,:,None]) & (xy[:,1] <= e[1,1,:,None])
valid_mask = x_valid_mask & y_valid_mask
# Finally use argmax() to choose the rectangle each pt belongs. We can use
# argmax to choose the first matching one and that works here because
# we are guaranteed to have the recatnagles mutually exclusive
out = np.where(valid_mask.any(0), valid_mask.argmax(0), np.nan)
Let's have a sample run to verify things here -
1) Setup random inputs :
In [315]: rec1 = Rectangle((0,0), 100, 100)
...: rec2 = Rectangle((0,100), 100, 100)
...: rec3 = Rectangle((100,100), 100, 100)
...:
In [316]: e1 = np.asarray(rec1.get_extents())
...: e2 = np.asarray(rec2.get_extents())
...: e3 = np.asarray(rec3.get_extents())
...:
2) Taking at look at extents for rec3 :
In [317]: e3
Out[317]:
array([[ 100., 100.],
[ 200., 200.]])
3) Get random 5 pts for xy :
In [319]: x = np.random.poisson(100, size=5)
...: y = np.random.poisson(100, size=5)
...: xy = pd.DataFrame({"x" : x, "y" : y}).values
...:
4) Let's setup the pt[1] such that its inside rec3. So, the o/p for this pt should be 2.
In [320]: xy[1] = [150,175]
5) Let's setup pt[3] such that its outside all of the rectangles. So, the correponding o/p should be a NaN.
In [321]: xy[3] = [400,400]
6) Run posted codes and print output :
In [323]: out
Out[323]: array([ nan, 2., 2., nan, 2.])
As seen out[1] is 2 and out[3] is NaN, which were anticipated earlier.
matplotlib has a built-in routine contains_point for checking if a point is contained in a polygon object which is quite fast.
from matplotlib.patches import Rectangle
rec1 = Rectangle((0, 0), 100, 100)
rec1.contains_point((1, 1))
# True
rec1.contains_point((101, 101))
# False
Nested wheres like this are hard to read and extend:
where(cond1, 0, where(cond2, 1, where(cond3, 2, ..)))
You'll see from other questions that where is used most often to generate indices, that is the I,J=np.where(cond) version instead of the np.where(cond, 0, x) version.
So I'd be tempted, just for clarity, to write your code as
res = xy.copy() # or np.zeros_like(xy)
for i in range(n):
ij = np.where(cond[i]
res[ij] = i

Categories

Resources