Logical indexing python - python

I am working on improving the speed of logical indexing in Python. So, currently I have to plot some heatmaps, for which I am divinding the inputs data into specified number of x and y bins, and then through the function return_val, I am using logical indexing to compute the mean value in a given bin
This works well when my bin size is small, but when I try to increase the bin size, to let say 100x100, then the program slows down quite a lot
I know that the speed could be increased by using the stats.binned_statistic_2d function in Python. However, I would like to understand how can I optimize my current code in order to make the averaging process go quicker
import numpy as np
arr_len = 932826
x = np.random.uniform(low=0, high=4496, size=arr_len)
y = np.random.uniform(low=-74, high=492, size=arr_len)
z = np.random.uniform(low=-30, high=97, size=arr_len)
# Check points
bin_x = 10
bin_y = 10
x1 = np.linspace(x.min(), x.max(), bin_x)
y1 = np.linspace(y.min(), y.max(), bin_y)
def return_val(x, y, z, x1, y1, i, j):
idx = np.logical_and(np.logical_and(x > x1[i - 1], x < x1[i]), np.logical_and(y > y1[j - 1], y < y1[j]))
if np.count_nonzero(idx) == 0:
return np.nan
else:
return np.mean(z[idx])
z1 = np.zeros((len(x1), len(y1)))
for i in range(1, len(x1)):
for j in range(1, len(y1)):
z1[i - 1, j - 1] = return_val(x, y, z, x1, y1, i, j)
z1 = z1.transpose()

Half the time spent by the code is in implicitly allocating temporary arrays (due to logical_and and comparison operators) and another half the time is spent in the slow nested loops calling a function with the slow CPython interpreter. One way to overcomes these issues is simply to use the Numba's JIT using branchless operations without temporary arrays and using parallelism. Here is an example:
import numpy as np
import numba as nb
arr_len = 932826
x = np.random.uniform(low=0, high=4496, size=arr_len)
y = np.random.uniform(low=-74, high=492, size=arr_len)
z = np.random.uniform(low=-30, high=97, size=arr_len)
# Check points
bin_x = 10
bin_y = 10
x1 = np.linspace(x.min(), x.max(), bin_x)
y1 = np.linspace(y.min(), y.max(), bin_y)
#nb.njit('float64(float64[::1], float64[::1], float64[::1], float64[::1], float64[::1], int32, int32)')
def return_val(x, y, z, x1, y1, i, j):
count = 0
s = 0.0
# Branchless mean
for k in range(len(x)):
valid = (x[k] > x1[i - 1]) & (x[k] < x1[i]) & (y[k] > y1[j - 1]) & (y[k] < y1[j])
s += z[k] * valid
count += valid
if count == 0:
return np.nan
else:
return s / count
#nb.njit('float64[:,:](float64[::1], float64[::1], float64[::1], float64[::1], float64[::1])', parallel=True)
def compute(x, y, z, x1, y1):
z1 = np.zeros((len(x1), len(y1)))
for i in nb.prange(1, len(x1)):
for j in range(1, len(y1)):
z1[i - 1, j - 1] = return_val(x, y, z, x1, y1, i, j)
return z1
z1 = compute(x, y, z, x1, y1)
The above code is 11 times faster on my machine. It can be improved further by working on loops so that the computation can be more cache-friendly.

Related

Remove the intersection between two curves

I'm having a curve (parabol) from 0 to 1 on both axes as follows:
I generate another curve by moving the original curve along the x-axis and combine both to get the following graph:
How can I remove the intersected section to have only the double bottoms pattern like this:
The code I use for the graph:
import numpy as np
import matplotlib.pyplot as plt
def get_parabol(start=-1, end=1, steps=100, normalized=True):
x = np.linspace(start, end, steps)
y = x**2
if normalized:
x = np.array(x)
x = (x - x.min())/(x.max() - x.min())
y = np.array(y)
y = (y - y.min())/(y.max() - y.min())
return x, y
def curve_after(x, y, x_ratio=1/3, y_ratio=1/2, normalized=False):
x = x*x_ratio + x.max() - x[0]*x_ratio
y = y*y_ratio + y.max() - y.max()*y_ratio
if normalized:
x = np.array(x)
x = (x - x.min())/(x.max() - x.min())
y = np.array(y)
y = (y - y.min())/(y.max() - y.min())
return x, y
def concat_arrays(*arr, axis=0, normalized=True):
arr = np.concatenate([*arr], axis=axis).tolist()
if normalized:
arr = np.array(arr)
arr = (arr - arr.min())/(arr.max() - arr.min())
return arr
x, y = get_parabol()
new_x, new_y = curve_after(x, y, x_ratio=1, y_ratio=1, normalized=False)
new_x = np.add(x, 0.5)
# new_y = np.add(y, 0.2)
xx = concat_arrays(x, new_x, normalized=True)
yy = concat_arrays(y, new_y, normalized=True)
# plt.plot(x, y, '-')
plt.plot(xx, yy, '--')
I'm doing a research on pattern analysis that requires me to generate patterns with mathematical functions.
Could you show me a way to achieve this? Thank you!
First off, I would have two different parabola functions such that:
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(-1, 1, 100)
y1 = np.add(x, 0.3)**2 # Parabola centered at -0.3
y2 = np.add(x, -0.3)**2 # Parabola centered at 0.3
You can choose your own offsets for y1 and y2 depending on your needs.
And then it's simply take the min of the two arrays
y_final = np.minimum(y1, y2)
plt.plot(x, y_final, '--')
This involves curve fitting. You need to find the intersection part before you drop the values. Since the values of x and y have been normalized, we would have to determine exactly where the two datasets meet. We can see that they meet when x[i] >x[i+1]. Using your cobined xx and yy from the data provided, We therefore can do the following:
data_intersect = int(np.where(np.r_[0,np.diff(xx)] < 0)[0])
x1 = xx[:data_intersect]
x2 = xx[data_intersect:]
y1 = yy[:data_intersect]
y2 = yy[data_intersect:]
difference = np.polyfit(x1, y1, 2) - np.polyfit(x2,y2,2)
meet = np.roots(difference) # all points where the two curves meet
meet = meet[(meet < max(x1)) & (meet >min(x1))] # only point curve meet
xxx = np.r_[x1[x1<meet], x2[x2>meet]]
yyy = np.r_[y1[x1<meet], y2[x2>meet]]
plt.plot(xxx, yyy, '--')

trying to use a meshgrid instead of a double for loop

I was wondering does using a mess grid instead of a double for loop make the code run faster if so how do a do it
def f(x, y):
return np.sin(x)*np.cos(y/5)
print(f(1,2))
def midpoint_I(D, nx, ny):
hx = (D[1] - D[0])/float(nx)
hy = (D[2] - D[3])/float(ny)
I = 0
for i in range(nx):
for j in range(ny):
xi = D[0] + hx/2 + i*hx
yj = D[2] + hy/2 + j*hy
I += hx*hy*f(xi, yj)
return I
D= [0,5,0,5]
N =100
M=100
print(np.absolute(midpoint_I(D,N,M)))
I tried adouble loop it worked but took a bit too long wondering is a meshgrid instead would work faster

Using solve_ivp instead of odeint to solve initial problem value

Currently, I solve the following ODE system of equations using odeint
dx/dt = (-x + u)/2.0
dy/dt = (-y + x)/5.0
initial conditions: x = 0, y = 0
However, I would like to use solve_ivp which seems to be the recommended option for this type of problems, but honestly I don't know how to adapt the code...
Here is the code I'm using with odeint:
import numpy as np
from scipy.integrate import odeint, solve_ivp
import matplotlib.pyplot as plt
def model(z, t, u):
x = z[0]
y = z[1]
dxdt = (-x + u)/2.0
dydt = (-y + x)/5.0
dzdt = [dxdt, dydt]
return dzdt
def main():
# initial condition
z0 = [0, 0]
# number of time points
n = 401
# time points
t = np.linspace(0, 40, n)
# step input
u = np.zeros(n)
# change to 2.0 at time = 5.0
u[51:] = 2.0
# store solution
x = np.empty_like(t)
y = np.empty_like(t)
# record initial conditions
x[0] = z0[0]
y[0] = z0[1]
# solve ODE
for i in range(1, n):
# span for next time step
tspan = [t[i-1], t[i]]
# solve for next step
z = odeint(model, z0, tspan, args=(u[i],))
# store solution for plotting
x[i] = z[1][0]
y[i] = z[1][1]
# next initial condition
z0 = z[1]
# plot results
plt.plot(t,u,'g:',label='u(t)')
plt.plot(t,x,'b-',label='x(t)')
plt.plot(t,y,'r--',label='y(t)')
plt.ylabel('values')
plt.xlabel('time')
plt.legend(loc='best')
plt.show()
main()
It's important that solve_ivp expects f(t, z) as right-hand side of the ODE. If you don't want to change your ode function and also want to pass your parameter u, I recommend to define a wrapper function:
def model(z, t, u):
x = z[0]
y = z[1]
dxdt = (-x + u)/2.0
dydt = (-y + x)/5.0
dzdt = [dxdt, dydt]
return dzdt
def odefun(t, z):
if t < 5:
return model(z, t, 0)
else:
return model(z, t, 2)
Now it's easy to call solve_ivp:
def main():
# initial condition
z0 = [0, 0]
# number of time points
n = 401
# time points
t = np.linspace(0, 40, n)
# step input
u = np.zeros(n)
# change to 2.0 at time = 5.0
u[51:] = 2.0
res = solve_ivp(fun=odefun, t_span=[0, 40], y0=z0, t_eval=t)
x = res.y[0, :]
y = res.y[1, :]
# plot results
plt.plot(t,u,'g:',label='u(t)')
plt.plot(t,x,'b-',label='x(t)')
plt.plot(t,y,'r--',label='y(t)')
plt.ylabel('values')
plt.xlabel('time')
plt.legend(loc='best')
plt.show()
main()
Note that without passing t_eval=t, the solver will automatically choose the time points inside tspan at which the solution will be stored.

starting Summation value at i=2

I am trying to plot the error of this algorithm against h and I have run into a problem, for this error calculation, it cant use the first value, as it divides 0/0. How do I go about ignoring the first value where x =0? I basically need to start the summation on i=2 on line 46 (the absolute error one). Any help is much appreciated
import numpy
import matplotlib.pyplot as pyplot
from scipy.optimize import fsolve
from matplotlib import rcParams
rcParams['font.family'] = 'serif'
rcParams['font.size'] = 16
rcParams['figure.figsize'] = (12,6)
printing = False
def rk3(A, bvector, y0, interval, N):
h = (interval[1] - interval[0]) / N
x = numpy.linspace(interval[0], interval[1], N+1)
y = numpy.zeros((len(y0), N+1))
y[:, 0] = y0
b = bvector
for i in range(N):
y_1 = y[:, i] + h *(numpy.dot(A, y[:, i]) + b(x[i]))
y_2= (3/4)*y[:, i] + 0.25*y_1+0.25* h* (numpy.dot(A,y_1)+b(x[i]+h))
y[:, i+1] = (1/3)*y[:, i] + (2/3)*y_2 + (2/3)*h*(numpy.dot(A,y_2)+b(x[i]+h))
return x, y
def exact( interval, N):
w = numpy.linspace(interval[0], interval[1], N+1)
z = numpy.array([numpy.exp(-1000*w),(1000/999)*(numpy.exp(-w)-numpy.exp(-1000*w))])
return w, z
A=numpy.array([[-1000,0],[1000,-1]])
def bvector(x):
return numpy.zeros(2)
y0=numpy.array([1,0])
interval=numpy.array([0,0.1])
N=numpy.arange(40,401,40)
h=numpy.zeros(len(N))
abs_err = numpy.zeros(len(N))
for i in range(len(N)):
interval=numpy.array([0,0.1])
h[i]=(interval[1] - interval[0]) / N[i]
x, y = rk3(A, bvector, y0, interval, N[i])
w,z=exact(interval,N[i])
abs_err[i] = h[i]*numpy.sum(numpy.abs((y[1,:]-z[1,:])/z[1,:]))
p = numpy.polyfit(numpy.log(h), numpy.log(abs_err),1)
fig = pyplot.figure(figsize = (12, 8), dpi = 50)
pyplot.loglog(h, abs_err, 'kx')
pyplot.loglog(h, numpy.exp(p[1]) * h**(p[0]), 'b-')
pyplot.xlabel('$h$', size = 16)
pyplot.ylabel('$|$Error$|$', size = 16)
pyplot.show()
Simply add an if for the value which is zero. so for example if the dividing variable is x.
if x>0:
#code here for the calculation
The above code will use all positive non-zero value. to only skip zero use this
if x!=0:
You can also us the three arguments of a for loop:
for a in range(start_value, end_value, increment):
so this means
for a in range(2,10,2):
print a
will give you the below result
2
4
6
8

Not getting correct contour plot of coefficients from my Logistic Regression implementation?

I implemented logistic regression and use it on a data set. (This is an exercise in Coursera's ML course Week #3 (which normally uses matlab and octave) using python (so this isn't cheating)).
I started with the implementation in sklearn to classify the data set used in week three of this course (http://pastie.org/10872959). Here is a small, reproducible example for anyone to try out what I used (it relies only on numpy and sklearn):
It takes the data set, splits it into the feature matrix and the output matrix, and then constructs 26 more features from the original 2 (i.e from
). I then use logistic regression in sklearn, but this does not give the contour plot desired (please see below).
from sklearn.linear_model import LogisticRegression as expit
import numpy as np
def thetaFunc(y, theta, x):
deg = 6
spot = 0
sum = 0
for i in range(1, deg + 1):
for j in range(i + 1):
sum += theta[spot] * x**(i - j) * y**(j)
spot += 1
return sum
def constructVariations(X, deg):
features = np.zeros((len(X), 27))
spot = 0
for i in range(1, deg + 1):
for j in range(i + 1):
features[:, spot] = X[:,0]**(i - j) * X[:,1]**(j)
spot += 1
return features
if __name__ == '__main__':
data = np.loadtxt("ex2points.txt", delimiter = ",")
X,Y = np.split(data, [len(data[0,:]) - 1], 1)
X = reg.constructVariations(X, 6)
oneArray = np.ones((len(X),1))
X = np.hstack((oneArray, X))
trial = expit(solver = 'sag')
trial = trial.fit(X = X,y = np.ravel(Y))
print(trial.coef_)
# everything below has been edited in
from matplotlib import pyplot as plt
txt = open("RegLogTheta", "r").read()
txt = txt.split()
theta = np.array(txt, float)
x = np.linspace(-1, 1.5, 100)
y = np.linspace(-1,1.5,100)
z = np.empty((100,100))
xx,yy = np.meshgrid(x,y)
for i in range(len(x)):
for j in range(len(y)):
z[i][j] = thetaFunc(yy[i][j], theta, xx[i][j])
plt.contour(xx,yy,z, levels = [0])
plt.show()
Here are the coefficients of the generic feature terms.
http://pastie.org/10872957 (i.e the coefficients to terms
and the contour it generates:
One potential source of error is that I'm misinterpreting the 7 X 4 matrix coefficient matrix stored in trial._coeff. I believe that these 28 values are the coefficients to the 28 "variations" above, and I've mapped the coefficients to the variations both column-wise and row-wise. By column-wise, I mean that [:][0] get mapped to the first 7 variations, [:][1] to the next 7 and so on, and my function constructVariations explains how the variations are systematically created. Now the API maintains than an array of shape (n_classes, n_features) is stored in trial._coeff, so should I infer that fit classified the data into four classes? Or have I run through this problem poorly in another way?
Update
My interpretation (and/or use) of the weights must be at fault:
Instead of relying on the prediction built into sklearn, I myself tried to calculate the values that set the following to 1/2
The values of theta are those found from printing trial._coeff and x and y are scalars. Those x,y are then plotted to give the contour.
The code I used (but did not originally add in) attempts to do this. What is wrong with the math behind it?
One potential source of error is that I'm misinterpreting the 7 X 4 matrix coefficient matrix stored in trial._coeff
This matrix is not 7x4, it is 1x28 (check print(trial.coef_.shape)). One coefficient for each of your 28 features (27 returned by constructVariations and 1 added manually).
so should I infer that fit classified the data into four classes?
No, you missinterpreted the array, it has a single row (for binary classificaation there is no point in having two).
Or have I run through this problem poorly in another way?
Code is fine, interpretation not. In particular, see actual decision boundary from your model (plotted by calling "predict" and plotting contour)
from sklearn.linear_model import LogisticRegression as expit
import numpy as np
def constructVariations(X, deg):
features = np.zeros((len(X), 27))
spot = 0
for i in range(1, deg + 1):
for j in range(i + 1):
features[:, spot] = X[:,0]**(i - j) * X[:,1]**(j)
spot += 1
return features
if __name__ == '__main__':
data = np.loadtxt("ex2points.txt", delimiter = ",")
X,Y = np.split(data, [len(data[0,:]) - 1], 1)
rawX = np.copy(X)
X = constructVariations(X, 6)
oneArray = np.ones((len(X),1))
X = np.hstack((oneArray, X))
trial = expit(solver = 'sag')
trial = trial.fit(X = X,y = np.ravel(Y))
print(trial.coef_)
from matplotlib import pyplot as plt
h = 0.01
x_min, x_max = rawX[:, 0].min() - 1, rawX[:, 0].max() + 1
y_min, y_max = rawX[:, 1].min() - 1, rawX[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
data = constructVariations(np.c_[xx.ravel(), yy.ravel()], 6)
oneArray = np.ones((len(data),1))
data = np.hstack((oneArray, data))
Z = trial.predict(data)
Z = Z.reshape(xx.shape)
plt.figure()
plt.scatter(rawX[:, 0], rawX[:, 1], c=Y, linewidth=0, s=50)
plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)
plt.show()
Update
In the code provided you forgot (in visualization) that you added column of "1"s to your data representation, thus your thetas are one "off", as theta[0] is a bias, theta1 is related to your 0'th variable etc.
def thetaFunc(y, theta, x):
deg = 6
spot = 0
sum = theta[spot]
spot += 1
for i in range(1, deg + 1):
for j in range(i + 1):
sum += theta[spot] * x**(i - j) * y**(j)
spot += 1
return sum
you also forgot about intercept term from logisticregression itself, thus
xx,yy = np.meshgrid(x,y)
for i in range(len(x)):
for j in range(len(y)):
z[i][j] = thetaFunc(yy[i][j], theta, xx[i][j])
z -= trial.intercept_
(image generated using fixed code of yours)
import numpy as np
from sklearn.linear_model import LogisticRegression as expit
def thetaFunc(y, theta, x):
deg = 6
spot = 0
sum = theta[spot]
spot += 1
for i in range(1, deg + 1):
for j in range(i + 1):
sum += theta[spot] * x**(i - j) * y**(j)
spot += 1
return np.exp(-sum)
def constructVariations(X, deg):
features = np.zeros((len(X), 27))
spot = 0
for i in range(1, deg + 1):
for j in range(i + 1):
features[:, spot] = X[:,0]**(i - j) * X[:,1]**(j)
spot += 1
return features
if __name__ == '__main__':
data = np.loadtxt("ex2points.txt", delimiter = ",")
X,Y = np.split(data, [len(data[0,:]) - 1], 1)
X = constructVariations(X, 6)
rawX = np.copy(X)
oneArray = np.ones((len(X),1))
X = np.hstack((oneArray, X))
trial = expit(solver = 'sag')
trial = trial.fit(X = X,y = np.ravel(Y))
from matplotlib import pyplot as plt
theta = trial.coef_.ravel()
x = np.linspace(-1, 1.5, 100)
y = np.linspace(-1,1.5,100)
z = np.empty((100,100))
xx,yy = np.meshgrid(x,y)
for i in range(len(x)):
for j in range(len(y)):
z[i][j] = thetaFunc(yy[i][j], theta, xx[i][j])
z -= trial.intercept_
plt.contour(xx,yy,z > 1,cmap=plt.cm.Paired, alpha=0.8)
plt.scatter(rawX[:, 0], rawX[:, 1], c=Y, linewidth=0, s=50)
plt.show()

Categories

Resources