Python Logistic Regression Produces Wrong Coefficients

Python Logistic Regression Produces Wrong Coefficients - python

I am trying to use LogisticRegression model fro scikit-learn to solve Excercise 2 from Machine Learning course by Andrew Ng on Coursera. But the result, which I get is wrong:
1) Outcome coefficients doen't match with answers:
What I get with a model
What I should get according to answers
[-25.16, 0.21, 0.20]
you can see on the plot (wrong graph), that decision boundary seems to be a little bit below the decision boundary by intuition.
2) Graph outcome seems wrong
As you can see, decision boundary is below
LogisticRegression
Answer
MY CODE:
% matplotlib notebook
# IMPORT DATA
ex2_folder = 'machine-learning-ex2/ex2'
input_1 = pd.read_csv(folder + ex2_folder +'/ex2data1.txt', header = None)
X = input_1[[0,1]]
y = input_1[2]
# IMPORT AND FIT MODEL
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(fit_intercept = True)
model.fit(X,y)
print('Intercept (Theta 0: {}). Coefficients: {}'.format(model.intercept_, model.coef_))
# CALCULATE GRID
n = 5
xx1, xx2 = np.mgrid[25:101:n, 25:101:n]
grid = np.c_[xx1.ravel(), xx2.ravel()]
probs = model.predict_proba(grid)[:, 1]
probs = probs.reshape(xx1.shape)
# PLOTTING
f = plt.figure()
ax = plt.gca()
for outcome in [0,1]:
xo = 'yo' if outcome == 0 else 'k+'
selection = y == outcome
plt.plot(X.loc[selection, 0],X.loc[selection,1],xo, mec = 'k')
plt.xlim([25,100])
plt.ylim([25,100])
plt.xlabel('Exam 1 Score')
plt.ylabel('Exam 2 Score')
plt.title('Exam 1 & 2 and admission outcome')
contour = ax.contourf(xx1,xx2, probs, 100, cmap="RdBu",
vmin=0, vmax=1)
ax_c = f.colorbar(contour)
ax_c.set_label("$P(y = 1)$")
ax_c.set_ticks([0, .25, .5, .75, 1])
plt.contour(xx1, xx2, probs, [0.5], linewidths=1, colors='b', alpha = 0.3);
plt.plot(xx1[probs > 0.5], xx2[probs > 0.5],'.b', alpha = 0.3)
LINKS
DataFile in txt
PDF Tasks and Solutions in Octave

To get identical results you need to create identical testing conditions.
One obvious difference at a glance is the iteration count. Sklearn LogisticRegression classifier default iteration count is 100, while Andrew NG's sample code runs for 400 iterations. That will certainly give you a different result from Nguyen's course.
I am not sure anymore which cost function Nguyen is using for the exercise, but I am pretty sure it's cross entropy and not the L2 that is default function for LogisticRecression classifier in scikit learn.
And the last note, before you implement higher-level solutions (scikitlearn/tensorflow/keras), you should first try to implement them in pure python to understand how they work. It will be easier (and more fun) to try and make higher-level packages to work for you.

Related

Should features that correlate be deleted from ML models?

I've seen that it's common practice to delete input features that demonstrate co-linearity (and leave only one of them).
However, I've just completed a course on how a linear regression model will give different weights to different features, and I've thought that maybe the model will do better than us giving a low weight to less necessary features instead of completely deleting them.
To try to solve this doubt myself, I've created a small dataset resembling a x_squared function and applied two linear regression models using Python:
A model that keeps only the x_squared feature
A model that keeps both the x and x_squared features
The results suggest that we shouldn't delete features, and let the model decide the best weights instead. However, I would like to ask the community if the rationale of my exercise is right, and whether you've found this doubt in other places.
Here's my code to generate the dataset:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Generate the data
all_Y = [10, 3, 1.5, 0.5, 1, 5, 8]
all_X = range(-3, 4)
all_X_2 = np.square(all_X)
# Store the data into a dictionary
data_dic = {"x": all_X, "x_2": all_X_2, "y": all_Y}
# Generate a dataframe
df = pd.DataFrame(data=data_dic)
# Display the dataframe
display(df)
which produces this:
and this is the code to generate the ML models:
# Create the lists to iterate over
ids = [1, 2]
features = [["x_2"], ["x", "x_2"]]
titles = ["$x^{2}$", "$x$ and $x^{2}$"]
colors = ["blue", "green"]
# Initiate figure
fig = plt.figure(figsize=(15,5))
# Iterate over the necessary lists to plot results
for i, model, title, color in zip(ids, features, titles, colors):
# Initiate model, fit and make predictions
lr = LinearRegression()
lr.fit(df[model], df["y"])
predicted = lr.predict(df[model])
# Calculate mean squared error of the model
mse = mean_squared_error(all_Y, predicted)
# Create a subplot for each model
plt.subplot(1, 2, i)
plt.plot(df["x"], predicted, c=color, label="f(" + title + ")")
plt.scatter(df["x"], df["y"], c="red", label="y")
plt.title("Linear regression using " + title + " --- MSE: " + str(round(mse, 3)))
plt.legend()
# Display results
plt.show()
which generate this:
What do you think about this issue? This difference in the Mean Squared Error can be of high importance on certain contexts.

Because x and x^2 are not linear anymore, that is why deleting one of them is not helping the model. The general notion for regression is to delete those features which are highly co-linear (which is also highly correlated)

So x2 and y are highly correlated and you are trying to predict y with x2? A high correlation between predictor variable and response variable is usually a good thing - and since x and y are practically uncorrelated you are likely to "dilute" your model and with that get worse model performance.
(Multi-)Colinearity between the predicor variables themselves would be more problematic.

Python: scikit-learn isomap results seem random, but no possibility to set random_state

I am using Isomap from scikit-learn manifold learning. I reduce to two dimension, and observe that with every run of the algorthm on the same data set without any changes the resulting vectors change. I assume there are some random numbers used in the algorithm, but there is no way to set a seed. Random_state is not a variable to pass in Isomap. Am I missing something?

The random you've seen is about the sign of your result. The sign is not (in my opinion) 100% random. Signs within each component are consistent so that the relative relation is consistent in your result. Signs between components are random. In other words, which component got multiplied by -1 or 1 are random. This behavior comes from the KernelPCA function used by Isomap when the arpack kernel is used.
To give you a solution first, you can use eigen_solver='dense' when using Isomap. That may slow down your algorithm but should remove this randomness. I know this explanation above might be confusing. Let me give more details and show this by plot.
First, what is a visualized consequence of the "sign randomness"? Using the following code (modified from this official example) with eigen_solver = 'arpack', you can see two fit_transform using the same Isomap class may (or may not) give you different results. However, as you can see in the plot, the relative location maintains. It's just the whole plot getting flipped. If you use eigen_solver='dense' and run the code multiple times, you won't see this randomness:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import offsetbox
from sklearn import (manifold, datasets, decomposition, ensemble,
discriminant_analysis, random_projection)
digits = datasets.load_digits(n_class=6)
X = digits.data
y = digits.target
n_samples, n_features = X.shape
n_neighbors = 30
def plot_embedding(X, ax, title=None):
x_min, x_max = np.min(X, 0), np.max(X, 0)
X = (X - x_min) / (x_max - x_min)
for i in range(X.shape[0]):
ax.text(X[i, 0], X[i, 1], str(digits.target[i]),
color=plt.cm.Set1(y[i] / 10.),
fontdict={'weight': 'bold', 'size': 9})
eigen_solver = 'arpack'
#eigen_solver = 'dense'
iso = manifold.Isomap(n_neighbors, n_components=2, eigen_solver=eigen_solver)
X_iso1 = iso.fit_transform(X)
X_iso2 = iso.fit_transform(X)
fig = plt.figure(figsize=(16, 6))
ax1 = fig.add_subplot(121)
plot_embedding(X_iso1, ax1)
ax2 = fig.add_subplot(122)
plot_embedding(X_iso2, ax2)
plt.show()
Secondly, is there a way to set a seed to "stabilize" the random state? No, there is currently no way to set a seed for KernelPCA from Isomap. With KernelPCA, however, there is a kwarg random_state which is "A pseudo random number generator used for the initialization of the residuals when eigen_solver == ‘arpack’". Play with the following code (modified from this official test code) and you can see this randomness is gone (blue dots cover red dots) even with eigen_solver = 'arpack':
import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import KernelPCA
X_fit = np.random.rand(100, 4)
X = np.dot(X_fit, X_fit.T)
eigen_solver = 'arpack'
#eigen_solver = 'dense'
#random_state = None
random_state = 0
kpca = KernelPCA(n_components=2, kernel='precomputed',
eigen_solver=eigen_solver, random_state=random_state)
X_kpca1 = kpca.fit_transform(X)
X_kpca2 = kpca.fit_transform(X)
plt.plot(X_kpca1[:,0], X_kpca1[:,1], 'ro')
plt.plot(X_kpca2[:,0], X_kpca2[:,1], 'bo')
plt.show()

How to overplot fit results for discrete values in pymc3?

I am completely new to pymc3, so please excuse the fact that this is likely trivial. I have a very simple model where I am predicting a binary response function. The model is almost a verbatim copy of this example: https://github.com/pymc-devs/pymc3/blob/master/pymc3/examples/gelman_bioassay.py
I get back the model parameters (alpha, beta, and theta), but I can't seem to figure out how to overplot the predictions of the model vs. the input data. I tried doing this (using the parlance of the bioassay model):
from scipy.stats import binom
mean_alpha = mean(trace['alpha'])
mean_beta = mean(trace['beta'])
pred_death = binom.rvs(n, 1./(1.+np.exp(-(mean_alpha + mean_beta * dose))))
and then plotting dose vs. pred_death, but this is manifestly not correct as I get different draws of the binomial distribution every time.
Related to this is another question, how do I evaluate the goodness of fit? I couldn't seem to find anything to that effect in the "getting started" pymc3 tutorial.
Thanks very much for any advice!

Hi a simple way to do it is as follows:
from pymc3 import *
from numpy import ones, array
# Samples for each dose level
n = 5 * ones(4, dtype=int)
# Log-dose
dose = array([-.86, -.3, -.05, .73])
def invlogit(x):
return np.exp(x) / (1 + np.exp(x))
with Model() as model:
# Logit-linear model parameters
alpha = Normal('alpha', 0, 0.01)
beta = Normal('beta', 0, 0.01)
# Calculate probabilities of death
theta = Deterministic('theta', invlogit(alpha + beta * dose))
# Data likelihood
deaths = Binomial('deaths', n=n, p=theta, observed=[0, 1, 3, 5])
start = find_MAP()
step = NUTS(scaling=start)
trace = sample(2000, step, start=start, progressbar=True)
import matplotlib.pyplot as plt
death_fit = np.percentile(trace.theta,50,axis=0)
plt.plot(dose, death_fit,'g', marker='.', lw='1.25', ls='-', ms=5, mew=1)
plt.show()

If you want to plot dose vs pred_death, where pred_death is computed from the mean estimated values of alpha and beta, then do:
pred_death = 1./(1. + np.exp(-(mean_alpha + mean_beta * dose)))
plt.plot(dose, pred_death)
instead if you want to plot dose vs pred_death, where pred_death is computed taking into account the uncertainty in posterior for alpha and beta. Then probably the easiest way is to use the function sample_ppc:
May be something like
ppc = pm.sample_ppc(trace, samples=100, model=pmmodel)
for i in range(100):
plt.plot(dose, ppc['deaths'][i], 'bo', alpha=0.5)
Using Posterior Predictive Checks (ppc) is a way to check how well your model behaves by comparing the predictions of the model to your actual data. Here you have an example of sample_ppc
Other options could be to plot the mean value plus some interval of interest.

unexpected poor performance of GMM from sklearn

I'm trying to model some simulated data using the DPGMM classifier from scikitlearn, but I'm getting poor performance. Here is the example I'm using:
from sklearn import mixture
import numpy as np
import matplotlib.pyplot as plt
clf = mixture.DPGMM(n_components=5, init_params='wc')
s = 0.1
a = np.random.normal(loc=1, scale=s, size=(1000,))
b = np.random.normal(loc=2, scale=s, size=(1000,))
c = np.random.normal(loc=3, scale=s, size=(1000,))
d = np.random.normal(loc=4, scale=s, size=(1000,))
e = np.random.normal(loc=7, scale=s*2, size=(5000,))
noise = np.random.random(500)*8
data = np.hstack([a,b,c,d,e,noise]).reshape((-1,1))
clf.means_ = np.array([1,2,3,4,7]).reshape((-1,1))
clf.fit(data)
labels = clf.predict(data)
plt.scatter(data.T, np.random.random(len(data)), c=labels, lw=0, alpha=0.2)
plt.show()
I would think that this would be exactly the kind of problem that gaussian mixture models would work for. I've tried playing around with alpha, using gmm instead of dpgmm, changing the number of starting components, etc. I can't seem to get a reliable and accurate classification. Is there something I'm just missing? Is there another model that would be more appropriate?

Because you didn't iterate long enough for it to converge.
Check the value of
clf.converged_
and try increasing n_iter to 1000.
Note that, however, the DPGMM still fails miserably IMHO on this data set, decreasing the number of clusters to just 2 eventually.

Time series forecasting with support vector regression

I'm trying to perform a simple time series prediction using support vector regression.
I am trying to understand the answer provided here.
I adapted Tom's code to reflect the answer provided:
import numpy as np
from matplotlib import pyplot as plt
from sklearn.svm import SVR
X = np.arange(0,100)
Y = np.sin(X)
a = 0
b = 10
x = []
y = []
while b <= 100:
x.append(Y[a:b])
a += 1
b += 1
b = 10
while b <= 90:
y.append(Y[b])
b += 1
svr_rbf = SVR(kernel='rbf', C=1e5, gamma=1e5)
y_rbf = svr_rbf.fit(x[:81], y).predict(x)
figure = plt.figure()
tick_plot = figure.add_subplot(1, 1, 1)
tick_plot.plot(X, Y, label='data', color='green', linestyle='-')
tick_plot.axvline(x=X[-10], alpha=0.2, color='gray')
tick_plot.plot(X[10:], y_rbf[:-1], label='data', color='blue', linestyle='--')
plt.show()
However, I still get the same behavior -- the prediction just returns the value from the last known step. Strangely, if I set the kernel to linear the result is much better. Why doesn't the rbf kernel prediction work as intended?
Thank you.

I understand this is an old question, but I will answer it as other people might benefit from the answer.
The values you are using for C and gamma are most likely the issue if your example works with a linear kernel and not with rbf.
C and gamma are SVM parameters used for nonlinear kernel. For a goodexplanation of what C and gamma are intuitively, have a look here: http://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html?
In order to predict the values of a sinusoid, try C = 1 and gamma = 0.1. It performs much better than with the values you have.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Logistic Regression Produces Wrong Coefficients - python

Related

Should features that correlate be deleted from ML models?

Python: scikit-learn isomap results seem random, but no possibility to set random_state

How to overplot fit results for discrete values in pymc3?

unexpected poor performance of GMM from sklearn

Time series forecasting with support vector regression

Categories

Resources