Non Linear Decision boundary SVM - python

I need you guys help to find a non linear decision boundary. I have 2 features with numerical data, I made a simple linear decision boundary (see picture below)
Now the thing is that I would like my red line to look like the black line:
the 'equation' I used for plotting the red line is:
# x and y not related to the dataset, those variables are only for plotting the red line
# mdl is my model
x = np.linspace(0.6, -0.6)
y = -(mdl.coef_[0][0] / mdl.coef_[0][1]) * x - mdl.intercept_ / mdl.coef_[0][1]
The model is a SVM, I performed a GridSearchCV and got the best estimator. I used a linear kernel to be able to get the models coefs and intercept.
I can add a third dimension to the equation if needed. I have plenty of columns available in my df. I only kept the 2 most important ones (I got those from looking the model's feature importance).
The best thing would be if I could have an equation for plotting the decision boundary and one that I could include in my dataset, and that would be used as a 'sanction', like if the result of the equation sanction is above 0, sample's target is 1, else it's 0.
Like this (something I made with another model but for 3D plot):
# Equation sanction that goes into my dataset
df['sanction'] = df.widthS63R04 * model.coef_[0][0] + df.chordS67R04 * model.coef_[0][1] + df.chordS71R04 * model.coef_[0][2]
#Equation for 3D Hyperplane
tmp = np.linspace(-5,5,30)
x,y = np.meshgrid(tmp,tmp)
z = lambda x,y: (-mdl.intercept_[0]-mdl.coef_[0][0]*x -mdl.coef_[0][1]*y) / mdl.coef_[0][2]
# lambda usage for 3d surface hyperplane
ax.plot_surface(x, y, z(x, y))

Support-Vector-Machines-SVM-/Kernel Trick SVM.ipynb
zero_one_colourmap = ListedColormap(('blue', 'red'))
def plot_decision_boundary(X, y, clf):
X_set, y_set = X, y
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1,
stop = X_set[:, 0].max() + 1,
step = 0.01),
np.arange(start = X_set[:, 1].min() - 1,
stop = X_set[:, 1].max() + 1,
step = 0.01))
plt.contourf(X1, X2, clf.predict(np.array([X1.ravel(),
X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75,
cmap = zero_one_colourmap)
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = (zero_one_colourmap)(i), label = j)
plt.title('SVM Decision Boundary')
plt.xlabel('X1')
plt.ylabel('X2')
plt.legend()
return plt.show()

Related

why does my own implementation of logistic regression differ from sklearn?

I am trying to implement logistic regression for a binary classification problem from scratch in Python. My results do not match those provided by the implementation of sklearn, as you can see in this example. Note that the lines look "similar", but they are clearly not the same.
I took care of what is mentioned in this answer: both sklearn and me (i) fit the intercept term, and; (ii) do not apply regularization (penalty='none'). Also, while sklearn applies 100 iterations to train the algorithm (by default), I am applying 10000 with a rather small learning rate of 0.01. I tried different combination of values, but the problem does not seem to depend on this.
At the same time, I do notice that, even before comparing the results with sklearn, the ones I obtain with my implementation seem to be wrong: the decision regions are clearly off in some cases. You can see an example in this image.
The last point seems to indicate that the problem is all my own fault. Here is my code (it actually generates new datasets at each run and plots the results):
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.datasets import make_blobs
from sklearn.linear_model import LogisticRegression
def create_training_set():
X0, y = make_blobs(n_samples=[100, 100],
centers=None,
n_features=2,
cluster_std=1)
y = y.reshape(-1, 1) # make y a column vector
return np.hstack([np.ones((X0.shape[0], 1)), X0]), X0, y
def create_test_set(X0):
xx, yy = np.meshgrid(np.arange(X0[:, 0].min() - 1, X0[:, 0].max() + 1, 0.1),
np.arange(X0[:, 1].min() - 1, X0[:, 1].max() + 1, 0.1))
X_test = np.c_[xx.ravel(), yy.ravel()]
X_test = np.hstack([np.ones((X_test.shape[0], 1)), X_test])
return xx, yy, X_test
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def apply_gradient_descent(theta, X, y, max_iter=1000, alpha=0.1):
m = X.shape[0]
cost_iter = []
for _ in range(max_iter):
p_hat = sigmoid(np.dot(X, theta))
cost_J = -1/float(m) * (np.dot(y.T, np.log(p_hat)) + np.dot((1 - y).T, np.log(1 - p_hat)))
grad_J = 1/float(m) * np.dot(X.T, p_hat - y)
theta -= alpha * grad_J
cost_iter.append(float(cost_J))
return theta, cost_iter
fig, ax = plt.subplots(10, 2, figsize = (10, 30))
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
max_iter = 10000
alpha = 0.1
all_cost_history = []
for n_fil in range(10):
X_train, X0, y = create_training_set()
xx, yy, X_test = create_test_set(X0)
theta, cost_evolution = apply_gradient_descent(np.zeros((X_train.shape[1], 1)), X_train, y, max_iter, alpha)
all_cost_history.append(cost_evolution)
y_pred = np.where(sigmoid(np.dot(X_test, theta)) > 0.5, 1, 0)
y_pred = y_pred.reshape(xx.shape)
ax[n_fil, 0].pcolormesh(xx, yy, y_pred, cmap = cmap_light)
ax[n_fil, 0].scatter(X0[:, 0], X0[:, 1], c=y.ravel(), cmap=cmap_bold, alpha = 1, edgecolor="black")
y = y.reshape(X_train.shape[0], )
clf = LogisticRegression().fit(X0, y)
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
ax[n_fil, 1].pcolormesh(xx, yy, Z, cmap = cmap_light)
ax[n_fil, 1].scatter(X0[:, 0], X0[:, 1], c=y, cmap=cmap_bold, alpha = 1, edgecolor="black")
plt.show()
There is actually a difference between your implementation and Sklearn's one: you are not using the same optimization algorithm (also called solver in sklearn), and I think the difference you observe comes from here. You are using gradient descent, while sklearn's implementation uses by default the "liblinear" solver, which is different
Indeed, different optimization algorithms can yield different results based on, as an example :
The convergence speed: As we are limiting the number of iterations, an algorithm converging slower would stop at a different minima and thus yield differents decision regions
Whether an algorithm is deterministic or not: non deterministic algorithms (such as stochastic gradient descent) can converge to a different local minima given the same dataset. With non deterministic algorithms you could observe different results with the exact same dataset and algorithm.
Hyperparameters: changing a hyperparameter (as an example the learning rate of a gradient descent algorithm) changes the behavior of the optimization algorithm too, thus leading to different results.
In you case, there are good reasons for not always getting the same results: the gradient descent algorithm you use can get stuck in a local minima (because of an insufficient number of iterations, a non optimal learning rate...) which can be different from the local minima reached by the liblinear solver.
You can observe the same kind of discrepancies if you compare sklearn's implementation with different solvers (reusing your code):
fig, ax = plt.subplots(10, 2, figsize=(10, 30))
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
max_iter = 10000
alpha = 0.1
solver_algo_1 = 'liblinear'
solver_algo_2 = 'sag'
for n_fil in range(10):
X_train, X0, y = create_training_set()
xx, yy, X_test = create_test_set(X0)
y = y.reshape(X_train.shape[0], )
clf = LogisticRegression(solver=solver_algo_1, max_iter=max_iter).fit(X0, y)
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
ax[n_fil, 0].pcolormesh(xx, yy, Z, cmap=cmap_light)
ax[n_fil, 0].scatter(X0[:, 0], X0[:, 1], c=y, cmap=cmap_bold, alpha=1, edgecolor="black")
clf = LogisticRegression(solver=solver_algo_2, max_iter=max_iter).fit(X0, y)
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
ax[n_fil, 1].pcolormesh(xx, yy, Z, cmap=cmap_light)
ax[n_fil, 1].scatter(X0[:, 0], X0[:, 1], c=y, cmap=cmap_bold, alpha=1, edgecolor="black")
plt.show()
As an example, with "liblinear" (left) and "newton-cg" (right), you can get this:
Though the Logistc regression implementation is the same, the difference in the optimization algorithms leads to different results. So in a few words, the difference between your implementation and Scikit learn's one is the optimization algorithm.
Now if the quality if the decsion boundary you get is not satisfying, you can try tuning the hyperparameters of your gradient descent algorithm or try changing the optimization algorithm!

Problems using curve_fit for multivariate gaussian fit

I am so stuck trying to fit 3D gaussians, and I am hoping someone can see some silly mistake I am making, because I have spent hours debugging to no avail.
I have a 3d image stored in an array called "data", where data[x, y, z] gives the grayscale intensity at the point (x, y, z). I know that this 3d image follows a 3D Gaussian distribution, with a peak near the center of the image, but I am interested in the amplitude and spread. I am trying to fit this 3d array to a gaussian of the form My function in Python is:
def gaussian_3d(X, A, x0, y0, z0, sigx, sigy, sigz, offset):
x, y, z = X
return offset + A*np.exp(-(x-x0)**2/(2*sigx**2) - \
(y-y0)**2/(2*sigy**2) - (z-z0)**2/(2*sigz**2))
And the way I am doing this is as follows: if my image is of size 3 x 4 x 5, then I create a meshgrid (0...2) x (0...3) x (0...4), and then try to fit the intensity values to the function above.
My code looks like this:
def fit_gauss_3d(data):
dim = data.shape
# Step 1: set up meshgrid
x, y, z = np.arange(0, dim[0]), np.arange(0, dim[1]), np.arange(0, dim[2])
X, Y, Z = np.meshgrid(x, y, z)
data_in = np.vstack((X.ravel(),Y.ravel(),Z.ravel()))
data_out = data.ravel()
# Step 2: make good guess of the center "peak" point of the gaussian (x0, y0, z0)
# by using slices along the middle and finding the position of the maxes
mid1, mid2, mid3 = dim[0]//2, dim[1]//2, dim[2]//2
x0, y0, z0 = np.argmax(data[:, mid2, mid3]), np.argmax(data[mid1, :, mid3]), np.argmax(data[mid1, mid2, :])
# Step 3: Set lower/upper bounds for parameter search
delta = 0.5 # I am saying that the fit peak must be within +/- 0.5 of the initial guess
p0 = (data_max + 0.05, x0, y0, z0, 0.9, 0.9, 0.9, 0.05)
# Note: I know that sigmas are between 0.7 and 2.5, and offset is between 0 and 5
lower_bound = [data_max * 0.9, x0 - delta, y0 - delta, z0 - delta, 0.7, 0.7, 0.7, 0]
upper_bound = [data_max*1.1 + 0.1, x0 + delta, y0 + delta, z0 + delta, 2.5, 2.5, 2.5, 5]
# Step 4: Fit
p_param, p_cov = opt.curve_fit(gauss_3d, data_in, data_out, p0=p0, maxfev=50000, bounds=(lower_bound, upper_bound))
return p_param
def predict_gauss_3d(params, dims):
x = np.arange(0, dims[0])
y = np.arange(0, dims[1])
z = np.arange(0, dims[2])
XX, YY, ZZ = np.meshgrid(x, y, z)
X = np.vstack((XX.ravel(),YY.ravel(),ZZ.ravel()))
return gaussian_3d(X, *params).reshape(dims)
def plot_results(orig, sec):
''' Plot original and second fitted image'''
mid1, mid2, mid3 = dim[0]//2, dim[1]//2, dim[2]//2
fig = plt.figure()
ax1 = fig.add_subplot(3, 1, 1)
ax1.plot(orig[:, mid2, mid3], label='orig')
ax1.plot(sec[:, mid2, mid3], label='fitted')
ax1.legend(loc="upper left")
ax2 = fig.add_subplot(3, 1, 2)
ax2.plot(orig[mid1, :, mid3], label='orig')
ax2.plot(sec[mid1, :, mid3], label='fitted')
ax2.legend(loc="upper left")
ax3 = fig.add_subplot(3, 1, 3)
ax3.plot(orig[mid1, mid2], label='orig')
ax3.plot(sec[mid1, mid2], label='fitted')
ax3.legend(loc="upper left")
plt.tight_layout()
plt.show()
I plotted the projection of the fits along the middle axes. The first pic is varying x and keeping y, z at their midpoints, the second is varying y and keeping x, z at their midpoints, and so forth.
Some of my fits are reasonable, something like this:
While most are insanely bad, and not even Gaussian looking! For the below image, it chose the following parameters: . Clearly, I am either plotting wrong or fitting wrong. Can someone help me out? Is my meshgridding messed up somehow?

LinAlgError: not positive definite, even with jitter

I am trying to use Gaussian process regression on a cancer dataset using GPy, but the problem is when I fit a combination of 3 or 4 kernels the system collapses and gives the LinAlgError: not positive definite, even with jitter error. But it produces some output when I use a combination of two kernels. Here is the main code and the dataset image(the year in x-axis and tumor count in y-axis) I am trying to predict is attached below:
k_rbf = GPy.kern.RBF(1, lengthscale=50,name = "rbf")
k_exp = GPy.kern.Exponential(1,lengthscale=6)
k_lin = GPy.kern.Linear(1)
k_per = GPy.kern.StdPeriodic(1, period = 5)
k = k_rbf * k_per + k_lin + k_exp
m = GPy.models.GPRegression(X, Y, k)
m.optimize()
def plot_gp(X, m, C, training_points=None):
""" Plotting utility to plot a GP fit with 95% confidence interval """
# Plot 95% confidence interval
plt.fill_between(X[:,0],
m[:,0] - 1.96*np.sqrt(np.diag(C)),
m[:,0] + 1.96*np.sqrt(np.diag(C)),
alpha=0.5)
# Plot GP mean and initial training points
plt.plot(X, m, "-")
plt.legend(labels=["GP fit"])
plt.xlabel("x"), plt.ylabel("f")
# Plot training points if included
if training_points is not None:
X_, Y_ = training_points
plt.plot(X_, Y_, "kx", mew=2)
plt.legend(labels=["GP fit", "sample points"])
X_ = np.linspace(X.min(), X.max() + 30, 1000)[:, np.newaxis]
mean, Cov = m.predict(X_, full_cov=True)
plt.figure(figsize=(20, 10))
plot_gp(X_, mean, Cov)
plt.gca().set_xlim([1990,2060]), plt.gca().set_ylim([35000, 150000])
plt.plot(X, Y, "b.");

Linear regression minimizing errors only above the linear

I have a dataset that resembles the data created in the MWE below:
from matplotlib import pyplot as plt
import numpy as np
sz=100
x = np.linspace(-1, 1, sz)
mean = -np.sign(x)
noise = np.random.randn(*x.shape)
K = -2
y_true = K*x
y = y_true + mean + noise
plt.scatter(x, y, label="Data with error")
plt.plot(x, y_true, "-", label="True line")
plt.grid()
That is, the errors around the line I want are mostly negative for x>0 and mostly positive for x<0. What I'm looking for is a way to estimate the coefficient K (which in this case is -2).
Really I think the way to do it would be to minimize the error only of the points that fall above the line for x<0 and below the line for x>0, but I'm not sure how to go about it effectively in Python, since everything I can think of involves iterative processes which are slow in Python.
Basically you want to include something that can account for the mean variable in your data generating model. You can do this by modeling a discontinuity at the point x=0 by including a variable in your model that is 0 where x < 0 and 1 where x > 0.
We can even just include the "mean" variable itself and get the same model (with a different interpretation for the second coefficient). Here is a linear model that recovers the correct value for the slope of this discontinuous line. Note that this assumes the slope is the same on the right side of 0 as the left side.
from sklearn.linear_model import LinearRegression
X = np.array([x, mean]).T
reg = LinearRegression().fit(X, y)
print(reg.coef_)
Here is my attempt where I A) fit all data to a straight line, and then B) separate data depending on two criteria: whether x is greater than or less than zero and whether predicted Y is above or below that straight line, and finally C) fit the separated data. The slope is here -2.417 and will vary from run to run depending on the random data.
from matplotlib import pyplot as plt
import numpy as np
sz=100
x = np.linspace(-1, 1, sz)
mean = -np.sign(x)
noise = np.random.randn(*x.shape)
K = -2
y_true = K*x
y = y_true + mean + noise
plt.scatter(x, y, label="Data with error")
plt.plot(x, y_true, "-", label="True line")
###############################
# new section for calculatiing new line
allDataFirstOrderParameters = np.polyfit(x, y, 1)
allDataFirstOrderErrors = y - np.polyval(allDataFirstOrderParameters, x)
newX = []
newY = []
for i in range(len(x)):
if x[i] < 0 and allDataFirstOrderErrors[i] < 0:
newX.append(x[i])
newY.append(y[i])
if x[i] > 0 and allDataFirstOrderErrors[i] > 0:
newX.append(x[i])
newY.append(y[i])
newX = np.array(newX)
newY = np.array(newY)
newFirstOrderParameters = np.polyfit(newX, newY, 1)
print("New Parameters", newFirstOrderParameters)
plotNewX = np.linspace(min(x), max(x))
plotNewY = np.polyval(newFirstOrderParameters, plotNewX)
plt.plot(plotNewX, plotNewY, label="New line")
plt.legend()
plt.show()

Plotting horizontal hyperbola/circle using fsolve, numpy, and matplotlib

I was recently trying to plot a nonlinear decision boundary, and the function ended up being a partially horizontal hyperbola, where there were multiple y-values for a given x. Although I got it to work, I know there has to be a more pythonic or numpythonic way of plotting this line.
Background: The problem was a perceptron classifier on a set of inputs that were not linearly separable. In order to find this, the inputs were mapped to a general hyperbola function to increase the dimensionality to 5, and have these separable by a hyperplane. The equation for the decision boundary that will be plotted is
d(x) = w0 + w1xx + w2yy + w3xy + wx + w5y
Through the course of the perceptron's gradient descent, the values for w0-w5 are found, and the boundary is the x,y value when d(x)=0.
Current implementation: I got it to work, but I think it is hacky. I first have to create an array of the given size so that I can append these values, and I have to delete the initialized value the first time I append my found value. I then sweep through my the space on my graph and find a y-value, first by guessing high, second by guessing low, in order to find both possible y-values. I put these found values at the front and back of D, in order to plot this using matplotlib.
D = np.array([[0], [0]])
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
a_iter, b_iter = 0, 0 # used as initial guess for numeric solver
for xx in range(x_min, x_max):
# used to print top and bottom sides of hyperbola
yya = fsolve(lambda yy: W[:,0] + W[:,1]*xx**2 + W[:,2]*yy**2 + W[:,3]*xx*yy + W[:,4]*xx + W[:,5]*yy, max(a_iter, 7))
yyb = fsolve(lambda yy: W[:,0] + W[:,1]*xx**2 + W[:,2]*yy**2 + W[:,3]*xx*yy + W[:,4]*xx + W[:,5]*yy, b_iter)
a_iter = yya
b_iter = yyb
# add these points to a single matrix for printing
dda = np.array([[xx],[yya]])
ddb = np.array([[xx],[yyb]])
D = np.concatenate((dda, D), axis=1)
if xx == x_min: # delete initial [0; 0]
D = dda
D = np.concatenate((D, ddb), axis=1)
I know there has to be a better way to do this. Any insight is appreciated.
Edit: Apologies, I realize that without an image this is really difficult to understand. The main issue of finding multiple roots and populating a numpy array are a bit generic. I don't have enough rep to post images, but the link is below
nonlinear classifier
If you want plot an implicit equation curve, you can use pyplot.contour(), here is an example:
np.random.seed(1)
w = np.random.randn(6)
def f(x, y, w):
return w[0] + w[1]*x**2 + w[2]*y**2 + w[3]*x*y + w[4]*x + w[5]*y
X, Y = np.mgrid[-2:2:100j, -2:2:100j]
pl.contour(X, Y, f(X, Y, w), levels=[0])
there are parameterized options too - a trig one, branches centered at 0, pi
t = np.linspace(-np.pi/3, np.pi/3, 200) # 0 centered branch
y = 1/np.cos(t)
x = 1*np.tan(t)
plt.plot(x, y) # (default blue)
Out[94]: [<matplotlib.lines.Line2D at 0xe26e6a0>]
t = np.linspace(np.pi-np.pi/3, np.pi+np.pi/3, 200) # pi centered branch
y = 1/np.cos(t)
x = 1*np.tan(t)
plt.plot(x, y) # (default orange)
Out[96]: [<matplotlib.lines.Line2D at 0xf68e780>]
sympy ought to be up to finding the full denormalized, rotated, offset parameterized hyperbola coefficients from the bivariate polynomial ws
(or continue the hackage with a fit)

Categories

Resources