why does my own implementation of logistic regression differ from sklearn?

why does my own implementation of logistic regression differ from sklearn? - python

I am trying to implement logistic regression for a binary classification problem from scratch in Python. My results do not match those provided by the implementation of sklearn, as you can see in this example. Note that the lines look "similar", but they are clearly not the same.
I took care of what is mentioned in this answer: both sklearn and me (i) fit the intercept term, and; (ii) do not apply regularization (penalty='none'). Also, while sklearn applies 100 iterations to train the algorithm (by default), I am applying 10000 with a rather small learning rate of 0.01. I tried different combination of values, but the problem does not seem to depend on this.
At the same time, I do notice that, even before comparing the results with sklearn, the ones I obtain with my implementation seem to be wrong: the decision regions are clearly off in some cases. You can see an example in this image.
The last point seems to indicate that the problem is all my own fault. Here is my code (it actually generates new datasets at each run and plots the results):
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.datasets import make_blobs
from sklearn.linear_model import LogisticRegression
def create_training_set():
X0, y = make_blobs(n_samples=[100, 100],
centers=None,
n_features=2,
cluster_std=1)
y = y.reshape(-1, 1) # make y a column vector
return np.hstack([np.ones((X0.shape[0], 1)), X0]), X0, y
def create_test_set(X0):
xx, yy = np.meshgrid(np.arange(X0[:, 0].min() - 1, X0[:, 0].max() + 1, 0.1),
np.arange(X0[:, 1].min() - 1, X0[:, 1].max() + 1, 0.1))
X_test = np.c_[xx.ravel(), yy.ravel()]
X_test = np.hstack([np.ones((X_test.shape[0], 1)), X_test])
return xx, yy, X_test
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def apply_gradient_descent(theta, X, y, max_iter=1000, alpha=0.1):
m = X.shape[0]
cost_iter = []
for _ in range(max_iter):
p_hat = sigmoid(np.dot(X, theta))
cost_J = -1/float(m) * (np.dot(y.T, np.log(p_hat)) + np.dot((1 - y).T, np.log(1 - p_hat)))
grad_J = 1/float(m) * np.dot(X.T, p_hat - y)
theta -= alpha * grad_J
cost_iter.append(float(cost_J))
return theta, cost_iter
fig, ax = plt.subplots(10, 2, figsize = (10, 30))
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
max_iter = 10000
alpha = 0.1
all_cost_history = []
for n_fil in range(10):
X_train, X0, y = create_training_set()
xx, yy, X_test = create_test_set(X0)
theta, cost_evolution = apply_gradient_descent(np.zeros((X_train.shape[1], 1)), X_train, y, max_iter, alpha)
all_cost_history.append(cost_evolution)
y_pred = np.where(sigmoid(np.dot(X_test, theta)) > 0.5, 1, 0)
y_pred = y_pred.reshape(xx.shape)
ax[n_fil, 0].pcolormesh(xx, yy, y_pred, cmap = cmap_light)
ax[n_fil, 0].scatter(X0[:, 0], X0[:, 1], c=y.ravel(), cmap=cmap_bold, alpha = 1, edgecolor="black")
y = y.reshape(X_train.shape[0], )
clf = LogisticRegression().fit(X0, y)
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
ax[n_fil, 1].pcolormesh(xx, yy, Z, cmap = cmap_light)
ax[n_fil, 1].scatter(X0[:, 0], X0[:, 1], c=y, cmap=cmap_bold, alpha = 1, edgecolor="black")
plt.show()

There is actually a difference between your implementation and Sklearn's one: you are not using the same optimization algorithm (also called solver in sklearn), and I think the difference you observe comes from here. You are using gradient descent, while sklearn's implementation uses by default the "liblinear" solver, which is different
Indeed, different optimization algorithms can yield different results based on, as an example :
The convergence speed: As we are limiting the number of iterations, an algorithm converging slower would stop at a different minima and thus yield differents decision regions
Whether an algorithm is deterministic or not: non deterministic algorithms (such as stochastic gradient descent) can converge to a different local minima given the same dataset. With non deterministic algorithms you could observe different results with the exact same dataset and algorithm.
Hyperparameters: changing a hyperparameter (as an example the learning rate of a gradient descent algorithm) changes the behavior of the optimization algorithm too, thus leading to different results.
In you case, there are good reasons for not always getting the same results: the gradient descent algorithm you use can get stuck in a local minima (because of an insufficient number of iterations, a non optimal learning rate...) which can be different from the local minima reached by the liblinear solver.
You can observe the same kind of discrepancies if you compare sklearn's implementation with different solvers (reusing your code):
fig, ax = plt.subplots(10, 2, figsize=(10, 30))
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
max_iter = 10000
alpha = 0.1
solver_algo_1 = 'liblinear'
solver_algo_2 = 'sag'
for n_fil in range(10):
X_train, X0, y = create_training_set()
xx, yy, X_test = create_test_set(X0)
y = y.reshape(X_train.shape[0], )
clf = LogisticRegression(solver=solver_algo_1, max_iter=max_iter).fit(X0, y)
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
ax[n_fil, 0].pcolormesh(xx, yy, Z, cmap=cmap_light)
ax[n_fil, 0].scatter(X0[:, 0], X0[:, 1], c=y, cmap=cmap_bold, alpha=1, edgecolor="black")
clf = LogisticRegression(solver=solver_algo_2, max_iter=max_iter).fit(X0, y)
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
ax[n_fil, 1].pcolormesh(xx, yy, Z, cmap=cmap_light)
ax[n_fil, 1].scatter(X0[:, 0], X0[:, 1], c=y, cmap=cmap_bold, alpha=1, edgecolor="black")
plt.show()
As an example, with "liblinear" (left) and "newton-cg" (right), you can get this:
Though the Logistc regression implementation is the same, the difference in the optimization algorithms leads to different results. So in a few words, the difference between your implementation and Scikit learn's one is the optimization algorithm.
Now if the quality if the decsion boundary you get is not satisfying, you can try tuning the hyperparameters of your gradient descent algorithm or try changing the optimization algorithm!

Related

Non Linear Decision boundary SVM

I need you guys help to find a non linear decision boundary. I have 2 features with numerical data, I made a simple linear decision boundary (see picture below)
Now the thing is that I would like my red line to look like the black line:
the 'equation' I used for plotting the red line is:
# x and y not related to the dataset, those variables are only for plotting the red line
# mdl is my model
x = np.linspace(0.6, -0.6)
y = -(mdl.coef_[0][0] / mdl.coef_[0][1]) * x - mdl.intercept_ / mdl.coef_[0][1]
The model is a SVM, I performed a GridSearchCV and got the best estimator. I used a linear kernel to be able to get the models coefs and intercept.
I can add a third dimension to the equation if needed. I have plenty of columns available in my df. I only kept the 2 most important ones (I got those from looking the model's feature importance).
The best thing would be if I could have an equation for plotting the decision boundary and one that I could include in my dataset, and that would be used as a 'sanction', like if the result of the equation sanction is above 0, sample's target is 1, else it's 0.
Like this (something I made with another model but for 3D plot):
# Equation sanction that goes into my dataset
df['sanction'] = df.widthS63R04 * model.coef_[0][0] + df.chordS67R04 * model.coef_[0][1] + df.chordS71R04 * model.coef_[0][2]
#Equation for 3D Hyperplane
tmp = np.linspace(-5,5,30)
x,y = np.meshgrid(tmp,tmp)
z = lambda x,y: (-mdl.intercept_[0]-mdl.coef_[0][0]*x -mdl.coef_[0][1]*y) / mdl.coef_[0][2]
# lambda usage for 3d surface hyperplane
ax.plot_surface(x, y, z(x, y))

Support-Vector-Machines-SVM-/Kernel Trick SVM.ipynb
zero_one_colourmap = ListedColormap(('blue', 'red'))
def plot_decision_boundary(X, y, clf):
X_set, y_set = X, y
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1,
stop = X_set[:, 0].max() + 1,
step = 0.01),
np.arange(start = X_set[:, 1].min() - 1,
stop = X_set[:, 1].max() + 1,
step = 0.01))
plt.contourf(X1, X2, clf.predict(np.array([X1.ravel(),
X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75,
cmap = zero_one_colourmap)
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = (zero_one_colourmap)(i), label = j)
plt.title('SVM Decision Boundary')
plt.xlabel('X1')
plt.ylabel('X2')
plt.legend()
return plt.show()

LinAlgError: not positive definite, even with jitter

I am trying to use Gaussian process regression on a cancer dataset using GPy, but the problem is when I fit a combination of 3 or 4 kernels the system collapses and gives the LinAlgError: not positive definite, even with jitter error. But it produces some output when I use a combination of two kernels. Here is the main code and the dataset image(the year in x-axis and tumor count in y-axis) I am trying to predict is attached below:
k_rbf = GPy.kern.RBF(1, lengthscale=50,name = "rbf")
k_exp = GPy.kern.Exponential(1,lengthscale=6)
k_lin = GPy.kern.Linear(1)
k_per = GPy.kern.StdPeriodic(1, period = 5)
k = k_rbf * k_per + k_lin + k_exp
m = GPy.models.GPRegression(X, Y, k)
m.optimize()
def plot_gp(X, m, C, training_points=None):
""" Plotting utility to plot a GP fit with 95% confidence interval """
# Plot 95% confidence interval
plt.fill_between(X[:,0],
m[:,0] - 1.96*np.sqrt(np.diag(C)),
m[:,0] + 1.96*np.sqrt(np.diag(C)),
alpha=0.5)
# Plot GP mean and initial training points
plt.plot(X, m, "-")
plt.legend(labels=["GP fit"])
plt.xlabel("x"), plt.ylabel("f")
# Plot training points if included
if training_points is not None:
X_, Y_ = training_points
plt.plot(X_, Y_, "kx", mew=2)
plt.legend(labels=["GP fit", "sample points"])
X_ = np.linspace(X.min(), X.max() + 30, 1000)[:, np.newaxis]
mean, Cov = m.predict(X_, full_cov=True)
plt.figure(figsize=(20, 10))
plot_gp(X_, mean, Cov)
plt.gca().set_xlim([1990,2060]), plt.gca().set_ylim([35000, 150000])
plt.plot(X, Y, "b.");

Fitting two peaks with gauss in python

Curve_fit is not fit properly. I'm trying to fit experimental data with curve_fit. The data is imported from a .txt file to a array:
d = np.loadtxt("data.txt")
data_x = np.array(d[:, 0])
data_y = np.array(d[:, 2])
data_y_err = np.array(d[:, 3])
Since i know there must be two peaks, my model is a sum of two gaussian curves:
def model_dGauss(x, xc, A, y0, w, dx):
P = A/(w*np.sqrt(2*np.pi))
mu1 = (x - (xc - dx/3))/(2*w**2)
mu2 = (x - (xc + 2*dx/3))/(2*w**2)
return y0 + P * ( np.exp(-mu1**2) + 0.5 * np.exp(-mu2**2))
Using values for the guess is very sensitive to my guess values. Where is the point of fitting data if just nearly perfect fitting parameter will provide a result? Or am I doing something completely wrong?
t = np.linspace(8.4, 10, 300)
guess_dG = [32, 1, 10, 0.1, 0.2]
popt, pcov = curve_fit(model_dGauss, data_x, data_y, p0=guess_dG, sigma=data_y_err, absolute_sigma=True)
A, xc, y0, w, dx = popt
Plotting the data
plt.scatter(data_x, data_y)
plt.plot(t, model_dGauss(t1,*popt))
plt.errorbar(data_x, data_y, yerr=data_y_err)
yields:
Plot result
The result is just a straight line at the bottom of my graph while the evaluated parameters are not that bad. How can that be?

A complete example of code is always appreciated (and, ahem, usually expected here on SO). To remove much of the confusion about using curve_fit here, allow me to suggest that you will have an easier time using lmfit (https://lmfit.github.io/lmfit-py) and especially its builtin model functions and its use of named parameters. With lmfit, your code for two Gaussians plus a constant offset might look like this:
from lmfit.models import GaussianModel, ConstantModel
# start with 1 Gaussian + Constant offset:
model = GaussianModel(prefix='p1_') + ConstantModel()
# this model will have parameters named:
# p1_amplitude, p1_center, p1_sigma, and c.
# here we give initial values to these parameters
params = model.make_params(p1_amplitude=10, p1_center=32, p1_sigma=0.5, c=10)
# optionally place bounds on parameters (probably not needed here):
params['p1_amplitude'].min = 0.
## params['p1_center'].vary = False # fix a parameter from varying in fit
# now do the fit (including weighting residual by 1/y_err):
result = model.fit(data_y, params, x=data_x, weights=1.0/data_y_err)
# print out param values, uncertainties, and fit statistics, or get best-fit
# parameters from `result.params`
print(result.fit_report())
# plot results
plt.errorbar(data_x, data_y, yerr=data_y_err, label='data')
plt.plot(data_x, result.best_fit, label='best fit')
plt.legend()
plt.show()
To add a second Gaussian, you could just do
model = GaussianModel(prefix='p1_') + GaussianModel(prefix='p2_') + ConstantModel()
# and then:
params = model.make_params(p1_amplitude=10, p1_center=32, p1_sigma=0.5, c=10,
p2_amplitude=2, p2_center=31.75, p1_sigma=0.5)
and so on.
Your model has the two gaussian sharing or at least having "linked" values - the sigma values should be the same for the two peaks and the amplitude of the 2nd is half that of the 1st. As defined so far, the 2-Gaussian model has all the parameters being independent. But lmfit has a mechanism for setting constraints on any parameter by giving an algebraic expression in terms of other parameters. So, for example, you could say
params['p2_sigma'].expr = 'p1_sigma'
params['p2_amplitude'].expr = 'p1_amplitude / 2.0'
Now, p2_amplitude and p2_sigma will not be independently varied in the fit but will be constrained to have those values.

Linear regression minimizing errors only above the linear

I have a dataset that resembles the data created in the MWE below:
from matplotlib import pyplot as plt
import numpy as np
sz=100
x = np.linspace(-1, 1, sz)
mean = -np.sign(x)
noise = np.random.randn(*x.shape)
K = -2
y_true = K*x
y = y_true + mean + noise
plt.scatter(x, y, label="Data with error")
plt.plot(x, y_true, "-", label="True line")
plt.grid()
That is, the errors around the line I want are mostly negative for x>0 and mostly positive for x<0. What I'm looking for is a way to estimate the coefficient K (which in this case is -2).
Really I think the way to do it would be to minimize the error only of the points that fall above the line for x<0 and below the line for x>0, but I'm not sure how to go about it effectively in Python, since everything I can think of involves iterative processes which are slow in Python.

Basically you want to include something that can account for the mean variable in your data generating model. You can do this by modeling a discontinuity at the point x=0 by including a variable in your model that is 0 where x < 0 and 1 where x > 0.
We can even just include the "mean" variable itself and get the same model (with a different interpretation for the second coefficient). Here is a linear model that recovers the correct value for the slope of this discontinuous line. Note that this assumes the slope is the same on the right side of 0 as the left side.
from sklearn.linear_model import LinearRegression
X = np.array([x, mean]).T
reg = LinearRegression().fit(X, y)
print(reg.coef_)

Here is my attempt where I A) fit all data to a straight line, and then B) separate data depending on two criteria: whether x is greater than or less than zero and whether predicted Y is above or below that straight line, and finally C) fit the separated data. The slope is here -2.417 and will vary from run to run depending on the random data.
from matplotlib import pyplot as plt
import numpy as np
sz=100
x = np.linspace(-1, 1, sz)
mean = -np.sign(x)
noise = np.random.randn(*x.shape)
K = -2
y_true = K*x
y = y_true + mean + noise
plt.scatter(x, y, label="Data with error")
plt.plot(x, y_true, "-", label="True line")
###############################
# new section for calculatiing new line
allDataFirstOrderParameters = np.polyfit(x, y, 1)
allDataFirstOrderErrors = y - np.polyval(allDataFirstOrderParameters, x)
newX = []
newY = []
for i in range(len(x)):
if x[i] < 0 and allDataFirstOrderErrors[i] < 0:
newX.append(x[i])
newY.append(y[i])
if x[i] > 0 and allDataFirstOrderErrors[i] > 0:
newX.append(x[i])
newY.append(y[i])
newX = np.array(newX)
newY = np.array(newY)
newFirstOrderParameters = np.polyfit(newX, newY, 1)
print("New Parameters", newFirstOrderParameters)
plotNewX = np.linspace(min(x), max(x))
plotNewY = np.polyval(newFirstOrderParameters, plotNewX)
plt.plot(plotNewX, plotNewY, label="New line")
plt.legend()
plt.show()

Gaussian fit for python with confidence interval

I'd like to make a Gaussian Fit for some data that has a rough gaussian fit. I'd like the information of data peak (A), center position (mu), and standard deviation (sigma), along with 95% confidence intervals for these values.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy.stats import norm
# gaussian function
def gaussian_func(x, A, mu, sigma):
return A * np.exp( - (x - mu)**2 / (2 * sigma**2))
# generate toy data
x = np.arange(50)
y = [ 97.04421053, 96.53052632, 96.85684211, 96.33894737, 96.85052632,
96.30526316, 96.87789474, 96.75157895, 97.05052632, 96.73473684,
96.46736842, 96.23368421, 96.22526316, 96.11789474, 96.41263158,
96.32631579, 96.33684211, 96.44421053, 96.48421053, 96.49894737,
97.30105263, 98.58315789, 100.07368421, 101.43578947, 101.92210526,
102.26736842, 101.80421053, 101.91157895, 102.07368421, 102.02105263,
101.35578947, 99.83578947, 98.28, 96.98315789, 96.61473684,
96.82947368, 97.09263158, 96.82105263, 96.24210526, 95.95578947,
95.84210526, 95.67157895, 95.83157895, 95.37894737, 95.25473684,
95.32842105, 95.45684211, 95.31578947, 95.42526316, 95.30526316]
plt.scatter(x,y)
# initial_guess_of_parameters
# この値はソルバーとかで求めましょう．
parameter_initial = np.array([652, 2.9, 1.3])
# estimate optimal parameter & parameter covariance
popt, pcov = curve_fit(gaussian_func, x, y, p0=parameter_initial)
# plot result
xd = np.arange(x.min(), x.max(), 0.01)
estimated_curve = gaussian_func(xd, popt[0], popt[1], popt[2])
plt.plot(xd, estimated_curve, label="Estimated curve", color="r")
plt.legend()
plt.savefig("gaussian_fitting.png")
plt.show()
# estimate standard Error
StdE = np.sqrt(np.diag(pcov))
# estimate 95% confidence interval
alpha=0.025
lwCI = popt + norm.ppf(q=alpha)*StdE
upCI = popt + norm.ppf(q=1-alpha)*StdE
# print result
mat = np.vstack((popt,StdE, lwCI, upCI)).T
df=pd.DataFrame(mat,index=("A", "mu", "sigma"),
columns=("Estimate", "Std. Error", "lwCI", "upCI"))
print(df)
Data Plot with Fitted Curve
The data peak and center position seems correct, but the standard deviation is off. Any input is greatly appreciated.

Your scatter indeed looks similar to a gaussian distribution, but it is not centered around zero. Given the specifics of the Gaussian function it will therefor be hard to nicely fit a Gaussian distribution to the data the way you gave us. I would therefor propose by starting with demeaning the x series:
x = np.arange(0, 50) - 24.5
Next I would add one additional parameter to your gaussian function, the offset. Since the regular Gaussian function will always have its tails close to zero it is impossible to otherwise nicely fit your scatterplot:
def gaussian_function(x, A, mu, sigma, offset):
return A * np.exp(-np.power((x - mu)/sigma, 2.)/2.) + offset
Next you should define an error_loss_function to minimise:
def error_loss_function(params):
gaussian = gaussian_function(x, params[0], params[1], params[2], params[3])
errors = gaussian - y
return sum(np.power(errors, 2)) # You can also pick a different error loss function!
All that remains is fitting our curve now:
fit = scipy.optimize.minimize(fun=error_loss_function, x0=[2, 0, 0.2, 97])
params = fit.x # A: 6.57592661, mu: 1.95248855, sigma: 3.93230503, offset: 96.12570778
xd = np.arange(x.min(), x.max(), 0.01)
estimated_curve = gaussian_function(xd, params[0], params[1], params[2], params[3])
plt.plot(xd, estimated_curve, label="Estimated curve", color="b")
plt.legend()
plt.show(block=False)
Hopefully this helps. Looks like a fun project, let me know if my answer is not clear.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

why does my own implementation of logistic regression differ from sklearn? - python

Related

Non Linear Decision boundary SVM

LinAlgError: not positive definite, even with jitter

Fitting two peaks with gauss in python

Linear regression minimizing errors only above the linear

Gaussian fit for python with confidence interval

Categories

Resources