Efficient repeated numpy.where

Efficient repeated numpy.where - python

I have a code in which I want to check whether pairs of coordinates fall into certain rectangles. However, there are many rectangles and i am not sure how to generalize the following code to many rectangles. I only can do it using eval in a loop but that is quite ugly.
Here is a code which checks to which of the rectangles each entry of a DataFrame consisting of coordinates. It assigns 0 if it belongs to the first, 1 for the second, an nan otherwise. I want to have such a code that would produce the analogous result assuming we have a large list of Rectangle objects, without applying eval or loops in the last row. Thanks alot.
from matplotlib.patches import Rectangle
rec1 = Rectangle((0,0), 100, 100)
rec2 = Rectangle((100,0), 100, 100)
x = np.random.poisson(100, size=200)
y = np.random.poisson(80, size=200)
xy = pd.DataFrame({"x" : x, "y" : y}).values
e1 = np.asarray(rec1.get_extents())
e2 = np.asarray(rec2.get_extents())
r1m1, r1m2 = np.min(e1), np.max(e1)
r2m1, r2m2 = np.min(e2), np.max(e2)
out = np.where(((xy >= r1m1) & (xy <= r1m2)).all(axis=1), 0,
np.where(((xy >= r2m1) & (xy <= r2m2)).all(axis=1), 1, np.nan))
EDIT Here is a version with 3 rectangles
rec1 = Rectangle((0,0), 100, 100)
rec2 = Rectangle((0,100), 100, 100)
rec3 = Rectangle((100,100), 100, 100)
x = np.random.poisson(100, size=200)
y = np.random.poisson(100, size=200)
xy = pd.DataFrame({"x" : x, "y" : y}).values
e1 = np.asarray(rec1.get_extents())
e2 = np.asarray(rec2.get_extents())
e3 = np.asarray(rec3.get_extents())
r1m1, r1m2 = np.min(e1), np.max(e1)
r2m1, r2m2 = np.min(e2), np.max(e2)
r3m1, r3m2 = np.min(e3), np.max(e3)
out = np.where(((xy >= r1m1) & (xy <= r1m2)).all(axis=1), 0,
np.where(((xy >= r2m1) & (xy <= r2m2)).all(axis=1), 1,
np.where(((xy >= r3m1) & (xy <= r3m2)).all(axis=1), 2, np.nan)))
What I like to get are values of 0, 1, 2 or np.nan. But the output is consists only of 0 and 1.

Here's a vectorized approach making use of NumPy broadcasting -
# Store extents in a 3D array
e = np.dstack((e1,e2,e3))
# Get a valid mask for the X's and Y's and then the combined one
x_valid_mask = (xy[:,0] >= e[0,0,:,None]) & (xy[:,0] <= e[1,0,:,None])
y_valid_mask = (xy[:,1] >= e[0,1,:,None]) & (xy[:,1] <= e[1,1,:,None])
valid_mask = x_valid_mask & y_valid_mask
# Finally use argmax() to choose the rectangle each pt belongs. We can use
# argmax to choose the first matching one and that works here because
# we are guaranteed to have the recatnagles mutually exclusive
out = np.where(valid_mask.any(0), valid_mask.argmax(0), np.nan)
Let's have a sample run to verify things here -
1) Setup random inputs :
In [315]: rec1 = Rectangle((0,0), 100, 100)
...: rec2 = Rectangle((0,100), 100, 100)
...: rec3 = Rectangle((100,100), 100, 100)
...:
In [316]: e1 = np.asarray(rec1.get_extents())
...: e2 = np.asarray(rec2.get_extents())
...: e3 = np.asarray(rec3.get_extents())
...:
2) Taking at look at extents for rec3 :
In [317]: e3
Out[317]:
array([[ 100., 100.],
[ 200., 200.]])
3) Get random 5 pts for xy :
In [319]: x = np.random.poisson(100, size=5)
...: y = np.random.poisson(100, size=5)
...: xy = pd.DataFrame({"x" : x, "y" : y}).values
...:
4) Let's setup the pt[1] such that its inside rec3. So, the o/p for this pt should be 2.
In [320]: xy[1] = [150,175]
5) Let's setup pt[3] such that its outside all of the rectangles. So, the correponding o/p should be a NaN.
In [321]: xy[3] = [400,400]
6) Run posted codes and print output :
In [323]: out
Out[323]: array([ nan, 2., 2., nan, 2.])
As seen out[1] is 2 and out[3] is NaN, which were anticipated earlier.

matplotlib has a built-in routine contains_point for checking if a point is contained in a polygon object which is quite fast.
from matplotlib.patches import Rectangle
rec1 = Rectangle((0, 0), 100, 100)
rec1.contains_point((1, 1))
# True
rec1.contains_point((101, 101))
# False

Nested wheres like this are hard to read and extend:
where(cond1, 0, where(cond2, 1, where(cond3, 2, ..)))
You'll see from other questions that where is used most often to generate indices, that is the I,J=np.where(cond) version instead of the np.where(cond, 0, x) version.
So I'd be tempted, just for clarity, to write your code as
res = xy.copy() # or np.zeros_like(xy)
for i in range(n):
ij = np.where(cond[i]
res[ij] = i

Related

Value at a given index in a NumPy array depends on values at higher indexes in another NumPy array

I have two 1D NumPy arrays x = [x[0], x[1], ..., x[n-1]] and y = [y[0], y[1], ..., y[n-1]]. The array x is known, and I need to determine the values for array y. For every index in np.arange(n), the value of y[index] depends on x[index] and on x[index + 1: ]. My code is this:
import numpy as np
n = 5
q = 0.5
x = np.array([1, 2, 0, 1, 0])
y = np.empty(n, dtype=int)
for index in np.arange(n):
if (x[index] != 0) and (np.any(x[index + 1:] == 0)):
y[index] = np.random.choice([0,1], 1, p=(1-q, q))
else:
y[index] = 0
print(y)
The problem with the for loop is that the size of n in my experiment can become very large. Is there any vectorized way to do this?

Randomly generate the array y with the full shape.
Generate a bool array indicating where to set zeros.
Use np.where to set zeros.
Try this,
import numpy as np
n = 5
q = 0.5
x = np.array([1, 2, 0, 1, 0])
y = np.random.choice([0, 1], n, p=(1-q, q))
condition = (x != 0) & (x[::-1].cumprod() == 0)[::-1] # equivalent to the posted one
y = np.where(condition, y, 0)

Multiple plots, every plot with same X but different Y values. I want the line for each plot to start when the Y values have a positive value

I have to plot many plots in the same graph. The x values is the same array for all and it is an array from 0 to N. The Y values for each plot are arrays that start with 0 and start having positive values at different x, depending on the plot.
EXAMPLE:
x = np.arange(100)
y1 = [0, 0, 10, 12 , 53, ... , n]
y2 = [0, 0, 0, 12 , 53, ... , n]
y3 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 40, 67, 53, ... , n]
when I plot there is a vertical line that goes from the bottom to the first positive value for Y. In the case of y1, there is line from (1, 0) to (2, 10) that is the line i want to avoid and just plot from (2, 10).
I know I can create new arrays for x and y to match the conditions I want, but I really need to know if there is other way.
There is an image with one example of my current plot.
Link of image
CODE:
import pandas as pd
import numpy as np
import xlrd
import matplotlib.pyplot as plt
# This is a excel where a user types a number, this number will be the number
of months.
workbook = xlrd.open_workbook('INPUT.xlsx')
sheet1 = workbook.sheet_by_name('ASSUMPTIONS')
Num_Meses = np.array([i for i in range(int(sheet1.cell(5, 5).value) + 1)])
# Then I create a dictonary from which I take the arrays, (YPP, Y1P, Y2P)
are type 'numpy.ndarray'
filt = df['WELL TYPE'] == 'PP'
YPP = df.loc[filt, 'OIL PRODUCTION'][0]
filt = df['WELL TYPE'] == '1P'
Y1P = df.loc[filt, 'OIL PRODUCTION'][0] + YPP
filt = df['WELL TYPE'] == '2P'
Y2P = df.loc[filt, 'OIL PRODUCTION'][0] + Y1P
filt = df['WELL TYPE'] == '3P'
Y3P = df.loc[filt, 'OIL PRODUCTION'][0] + Y2P
plt.plot(Num_Meses, Y3P, label='3P')
plt.plot(Num_Meses, Y2P, label='2P')
plt.plot(Num_Meses, Y1P, label='1P')
plt.plot(Num_Meses, YPP, label='PP', color='k')

Protection against "index 0 is out of bounds for axis 0 with size 0" error in Python

I have a code in which I get a specific distribution of points on the graph of the function tan()
limited from the bottom and top by straight lines:
import matplotlib.pyplot as plt
import numpy as np
import sys
import itertools
import multiprocessing
import tqdm
ic = range(1,10)
jc = range(1,10)
paramlist = list(itertools.product(ic,jc))
def func(params):
ic = params[0]
jc = params[1]
fig = plt.figure(1, figsize=(10,6))
x_all = np.linspace(0, 10*np.pi, 10000, endpoint=False)
x_above = x_all[ (-0.01)*ic*x_all < np.tan(x_all) ]
x = x_above[ np.tan(x_above) < 0.01*jc*x_above ]
y = np.tan(x)
y2 = 0.01*jc*x
y3 = (-0.01)*ic*x
y_up = np.diff(y) > 0
y_diff = np.where( y_up, np.diff(y), 0 )
x_diff = np.where( y_up, np.diff(x), 0 )
diffs = np.sqrt( x_diff**2 + y_diff**2 )
length = diffs.sum()
numbers = [2,4,6,8,10,12,14,16,18,20]
p2 = []
for d in range(len(numbers)):
cumlenth = np.cumsum(diffs)
s = np.abs(np.diff(np.sign(cumlenth-numbers[d]))).astype(bool)
c = np.argwhere(s)[0][0]
p = x[c], y[c]
p2.append(p)
p3 = sorted(p2, key=lambda x: x[0])
x_max = p3[len(p3)-1][0]
p4 = sorted(p2, key=lambda x: x[1])
y_min = p4[0][1]
y_max = p4[len(p3)-1][1]
for b in range(len(p2)):
plt.scatter( p2[b][0], p2[b][1], color="crimson", s=8)
plt.plot(x, np.tan(x))
plt.plot(x, y2)
plt.plot(x, y3)
ax = plt.gca()
ax.set_xlim([0, x_max+0.5])
ax.set_ylim([y_min-0.5, y_max+0.5])
plt.savefig('C:\\Users\\tkp\\Desktop\\wykresy_4\\i='+str(ic)+'_j='+str(jc)+'.png', bbox_inches='tight')
plt.show()
if __name__ == '__main__':
p = multiprocessing.Pool(4)
for params in tqdm.tqdm(p.imap_unordered(func, paramlist), total=len(paramlist)):
#pass
sys.stdout.write('\r'+ str(params))
sys.stdout.flush()
p.close()
p.join()
Where, for example, I receive plot:
The problem is that if I set the range in x_all = np.linspace(0, 10*np.pi, 10000, endpoint=False) too small, I get the error index 0 is out of bounds for axis 0 with size 0. How can I protect yourself against this? Or maybe in this case I can set a variable range in the "linspace" function?

Where does this error occur? That's a fundamental piece of information - for us, but especially for you!
#edison says it's in the argwhere expression. I'll try to recreate that step, starting with a guess as to what diffs looks like:
In [8]: x = np.ones(5)*.1
In [9]: x
Out[9]: array([0.1, 0.1, 0.1, 0.1, 0.1])
In [10]: s = np.cumsum(x)
In [11]: s
Out[11]: array([0.1, 0.2, 0.3, 0.4, 0.5])
In [12]: s-1
Out[12]: array([-0.9, -0.8, -0.7, -0.6, -0.5])
In [13]: np.sign(s-1)
Out[13]: array([-1., -1., -1., -1., -1.])
In [14]: np.diff(np.sign(s-1))
Out[14]: array([0., 0., 0., 0.])
In [15]: np.abs(np.diff(np.sign(s-1)))
Out[15]: array([0., 0., 0., 0.])
In [16]: np.abs(np.diff(np.sign(s-1))).astype(bool)
Out[16]: array([False, False, False, False])
Regardless of the details to this point, it's a good guess that s is an array with just False. where finds the True elements in that array; there are none.
In [17]: np.where(_)
Out[17]: (array([], dtype=int64),)
argwhere is the transpose of this - one column for each dimension, and one row for each found item.
In [18]: np.argwhere(_)
Out[18]: array([], shape=(0, 2), dtype=int64)
In [19]: _[0]
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-19-aa79beb95eae> in <module>
----> 1 _[0]
IndexError: index 0 is out of bounds for axis 0 with size 0
So your first line of defense is to check the shape of the returned array:
c = np.argwhere(s)
if c.shape[0]>0:
c = c[0,0]
p = x[c], y[c]
else:
# what do you want to do if non of `s` are true?
You can work backwards from there, taking care to ensure that the diffs or numbers are correct, and always find a valid c. But regardless, when using where or argwhere, be careful about assuming it has found a given number of items.

Numpy: Combine several arrays based on an indices array

I have 2 arrays of different sizes m and n, for instance:
x = np.asarray([100, 200])
y = np.asarray([300, 400, 500])
I also have an integer array of size m+n, for instance:
indices = np.asarray([1, 1, 0, 1 , 0])
I'd like to combine x and y into an array z of size m+n, in this case:
expected_z = np.asarray([300, 400, 100, 500, 200])
In details:
The 1st value of indices is 1, so the 1st value of z should come from y. Therefore 300.
The 2nd value of indices is 1, so the 2nd value of z should also come from y. Therefore 400
The 3rd value of indices is 0, so the 3rd value of z should this time come from x. Therefore 100
...
How could I do that efficiently in NumPy?
Thanks in advance!

Make an output array and use boolean indexing to assign x and y into the correct slots of the output:
z = numpy.empty(len(x)+len(y), dtype=x.dtype)
z[indices==0] = x
z[indices==1] = y

out will be your desired output:
out = indices.copy()
out[np.where(indices==0)[0]] = x
out[np.where(indices==1)[0]] = y
or as the above answer suggested, simply do:
out = indices.copy()
out[indices==0] = x
out[indices==1] = y

i hope this could help you:
x = np.asarray([100, 200])
y = np.asarray([300, 400, 500])
indices = np.asarray([1, 1, 0, 1 , 0])
expected_z = np.asarray([])
x_indice = 0
y_indice = 0
for i in range(0,len(indices)):
if indices[i] == 0:
expected_z = np.insert(expected_z,i,x[x_indice])
x_indice += 1
else:
expected_z = np.insert(expected_z,i,y[y_indice])
y_indice += 1
expected_z
and the output is:
output : array([300., 400., 100., 500., 200.])
P.S. always make sure that len(indices) == len(x) + len(y) and :
the values that are coming from y == len(y)
the values that are coming from x == len(x)

Looking up index of value in numpy 3D arrays

import numpy as np
# The 3D arrays have the axis: Z, X, Y
arr_keys = np.random.rand(20, 5, 5)
arr_vals = np.random.rand(20, 5, 5)
arr_idx = np.random.rand(5, 5)
For each grid cell in arr_idx, I want to look up the Z-position of the value closest to it in arr_keys (but with the same X, Y location) and return the value at the corresponding position in arr_vals array. Is there a way to do this without using nested for loops?
So, if the value at X=0, Y=0 for arr_idx is 0.5, I want to find the number closest to it at X=0, Y=0, Z ranges from 0 to 10
in arr_keys, and then I want to use the Z position of that number (lets call it Z_prime) to find the value in arr_vals (Z_prime, X=0, Y=0)

This is the type of problem for which np.take_along_axis was created:
# shape (20, 5, 5)
diff = np.abs(arr_idx - arr_keys)
# argmin(..., keepdims=True) doesn't exist yet - this emulates it
# shape (1, 5, 5)
inds = np.expand_dims(np.argmin(diff, axis=0), axis=0)
# shape (1, 5, 5)
res = np.take_along_axis(arr_vals, inds, axis=0)
# shape (5, 5)
res = res.squeeze(axis=0)

I think #xnx's answer is pretty good. Mine is longer but I'll post it anyway ;).
Also, a note: NumPy is made to handle large multi-dimensional arrays efficiently by vectorizing the operations. So I'd suggest avoiding for loops as much as possible. Whatever the task you're looking for, there is (usually) a way to do it while avoiding loops.
arr_keys = np.split(arr_keys, 20)
arr_keys = np.stack(arr_keys, axis=-1)[0]
arr_vals = np.split(arr_vals, 20)
arr_vals = np.stack(arr_vals, axis=-1)[0]
arr_idx = np.expand_dims(arr_idx, axis=-1)
difference = np.abs(arr_keys - arr_idx)
minimum = np.argmin(difference, axis=-1)
result = np.take_along_axis(arr_vals, np.expand_dims(minimum, axis=-1), axis=-1)
result = np.squeeze(result, axis=-1)

I think this might work: roll the axes into the correct orientation, find the index of the value of the (absolute) minimum for each of the 5x5 X,Y values and take the corresponding Z-values from arr_vals:
idx = np.argmin(np.abs(np.rollaxis(arr_keys,0,3) - arr_idx[:,:,None]), axis=2)
i,j = np.ogrid[:5,:5]
arr_vals[idx[i,j],i,j]
To test this, try the (3,2,2) case:
In [15]: arr_keys
Out[15]:
array([[[ 0.19681533, 0.26897784],
[ 0.60469711, 0.09273087]],
[[ 0.04961604, 0.3460404 ],
[ 0.88406912, 0.41284309]],
[[ 0.46298201, 0.33809574],
[ 0.99604152, 0.4836324 ]]])
In [16]: arr_vals
Out[16]:
array([[[ 0.88865681, 0.88287688],
[ 0.3128103 , 0.24188022]],
[[ 0.23947227, 0.57913325],
[ 0.85768064, 0.91701097]],
[[ 0.78105669, 0.84144339],
[ 0.81071981, 0.69217687]]])
In [17]: arr_idx
Out[17]:
array([[[ 0.31352609],
[ 0.75462329]],
[[ 0.44445286],
[ 0.97086161]]])
gives:
array([[ 0.88865681, 0.57913325],
[ 0.3128103 , 0.69217687]])

A little verbose than the already posted solution but easier to understand.
import numpy as np
# The 3D arrays have the axis: Z, X, Y
arr_keys = np.random.rand(20, 5, 5)
arr_vals = np.random.rand(20, 5, 5)
arr_idx = np.random.rand(5, 5)
arr_idx = arr_idx[np.newaxis, :, :]
dist = np.abs(arr_idx - arr_keys)
dist_ind = np.argmin(dist, axis=0)
x = np.arange(0, 5, 1)
y = np.arange(0, 5, 1)
xx, yy = np.meshgrid(x, y)
res = arr_vals[dist_ind, yy, xx]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficient repeated numpy.where - python

matplotlib has a built-in routine contains_point for checking if a point is contained in a polygon object which is quite fast. from matplotlib.patches import Rectangle rec1 = Rectangle((0, 0), 100, 100) rec1.contains_point((1, 1)) # True rec1.contains_point((101, 101)) # False

Related

Value at a given index in a NumPy array depends on values at higher indexes in another NumPy array

Multiple plots, every plot with same X but different Y values. I want the line for each plot to start when the Y values have a positive value

Protection against "index 0 is out of bounds for axis 0 with size 0" error in Python

Numpy: Combine several arrays based on an indices array

Looking up index of value in numpy 3D arrays

Categories

Resources