I'm trying to use machine learning to model an addition. But the model always predicts the same. Here is my code:
import numpy as np
import random
from sklearn.naive_bayes import GaussianNB
X=np.array([[0,1],[1,1],[2,1],[2,2],[2,3],[3,3],[3,4],[4,4],[4,5]])
Y=np.array([1,2,3,4,5,6,7,8,9])
clf = GaussianNB()
clf.fit(X,Y)
x=random.random()
y=random.random()
d=1
e=10000
accuracy=0
while d<e:
d+=1
if (clf.predict([[x, y]])) == x+y:
accuracy+=1
if d==e:
print(accuracy)
In 10000 predictions zero predicted Y is and addition of both random variables in X what went wrong.
First of all, as pointed out in the comments, this is a regression problem, not a classification one, and GaussianNB is a classifier. Secondly, your code is wrong, you're predicting on the same test set, since you're not regenerating the random values to predict on.
Here's a how you could go about solving this problem. First of all, you're trying to model a linear relation between the features and the target variable, hence you want your model to learn how to map f(X)->y with a linear function, in this case a simple addition. Hence you need a linear model.
So here we could use LinearRegression. To train the regressor you could do:
from sklearn.linear_model import LinearRegression
X = np.random.randint(0,1000, (20000, 2))
y = X.sum(1)
lr = LinearRegression()
lr.fit(X,y)
And then similarly generate a test test with unseen combinations, which hopefully the regressor should be able to accurately predict on:
X_test = X = np.random.randint(0,1000, (2000, 2))
y_test = X.sum(1)
If we predict with the trained model, and compare the predicted values with the orginal ones, we see that the model indeed perfectly maps the addition function as we expected it to:
y_pred = lr.predict(X_test)
pd.DataFrame(np.vstack([y_test, y_pred]).T, columns=['Test', 'Pred']).head(10)
Test Pred
0 1110.0 1110.0
1 557.0 557.0
2 92.0 92.0
3 1210.0 1210.0
4 1176.0 1176.0
5 1542.0 1542.0
By checking the model's coef_, we can see that the model has learnt the following optimal coefficients:
lr.coef_
# array([1., 1.])
And:
lr.intercept_
# 4.547473508864641e-13 -> 0
Which basically turns a linear regression into an addition, for instance:
X_test[0]
# array([127, 846])
So we'd have that y_pred = 0 + 1*127 + 1*846
the random.random() generate real number random between 0 and 1, so model just predicted 0 or 1 and maximum 2, and x+y is not an integer.
you can use of random.randint(a,b)
x=random.randint(0,4)
y=random.randint(1,5)
Related
# import sklearn and necessary libraries
from sklearn.linear_model import LogisticRegression
# Apply sklearn logistic regression on the given data X and labels Y
X_skl = np.vstack((df1,df2)) # 10000 x 2 array
Y_skl = Y # 10000 x 1 array
LogR = LogisticRegression()
LogR.fit(X_skl,Y_skl)
Y_skl_hat = LogR.predict(X_skl)
# Calculate the accuracy
# Check the number of points where Y_skl is not equal to Y_skl_hat
error_count_skl = 0 # Count the number of error points
for i in range(N):
if Y_skl[i] == Y_skl_hat[i]:
error_count_skl = error_count_skl
else:
error_count_skl = error_count_skl + 1
# Calculate the accuracy
Accuracy = 100*(N - error_count_skl)/N
print("Accuracy(%):")
print(Accuracy)
Output:
Accuracy(%):
99.48
Hello,
I'm trying to apply logistic regression model on array X (with size of 10000 x 2) and label Y (10000 x 1)
using sklearn library in Python. I'm completely lost cause I've never used this library before. Can anyone help me with the coding?
Edited:
Sorry for the vague question, the goal is to find the training accuracy using the entire dataset of X. Above is what I came up with, can anyone take a look and see if it makes sense?
To calculate accuracy you can simply use this sklearn method.
sklearn.metrics.accuracy_score(y_true, y_pred)
In your case
sklearn.metrics.accuracy_score(Y_skl, Y_skl_hat)
If you want to take a look at
sklearn documentation for accuracy_score
And also you should train your model on some data and test it on others to check if the model can be generalized and to avoid overfitting.
To split your data in train and test datasets you could use:
sklearn.model_selection.train_test_split
If you want to take a look at
sklearn documentation for train_test_split
I have a dataset for regression: (X_train_scaled, y_train) and (X_val_scaled, y_val) for training and validation respectively. The inputs were scaled using StandardScaler.
I create a linear regression model using sklearn.linear_model.LinearRegression like follows:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
linear_reg = LinearRegression()
linear_reg.fit(X_train_scaled, y_train)
y_pred_train = linear_reg.predict(X_train_scaled)
y_pred_val = linear_reg.predict(X_val_scaled)
r2_train = r2_score(y_train, y_pred_train)
r2_val = r2_score(y_val, y_pred_val)
print('r2_train', r2_train)
print('r2_val', r2_val)
After that I do the same but use polynomial features with degree = 1 (which are just the same as the original features but with an additional feature of ones, i.e. x^0, which I ignore).
from sklearn.preprocessing import PolynomialFeatures
pf = PolynomialFeatures(1)
X_train_poly = pf.fit_transform(X_train_scaled)[:, 1:] # ignore first col
X_val_poly = pf.transform(X_val_scaled)[:, 1:] # ignore first col
linear_reg = LinearRegression()
linear_reg.fit(X_train_poly, y_train)
y_pred_train = linear_reg.predict(X_train_poly)
y_pred_val = linear_reg.predict(X_val_poly)
r2_train = r2_score(y_train, y_pred_train)
r2_val = r2_score(y_val, y_pred_val)
print('r2_train', r2_train)
print('r2_val', r2_val)
However, I get different results. The first code gives me the following outputs:
r2_train 0.7409525513417043
r2_val 0.7239859358973735
whereas the second code gives this output:
r2_train 0.7410093370149977
r2_val 0.7241725658840452
Why are the outputs different although the dataset and model are the same?
To prove the datasets are the same, I tried the following code:
print(X_train_scaled.shape, X_train_poly.shape)
print(X_val_scaled.shape, X_val_poly.shape)
print((X_train_poly != X_train_scaled).sum())
print((X_val_poly != X_val_scaled).sum())
which has the output:
(802, 9) (802, 9)
(268, 9) (268, 9)
0
0
which indicates that the two datasets are identical.
Also, I use LinearRegession in the two cases which uses OLS algorithm and has no random operations at all. So, it's supposed to do the same calculations on the same data. However, I get different results.
Does anyone have an idea about the reason?
Sklearn LinearRegression uses ordinary least squares optimization to fit train data into a linear model while it is not clear what Sklearn PolynomialFeatures use. But based on its transform() function:
Prefer CSR over CSC for sparse input (for speed), but CSC is required
if the degree is 4 or higher. If the degree is less than 4 and the
input format is CSC, it will be converted to CSR, have its polynomial
features generated, then converted back to CSC.
(see: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html)
Assuming PolynomialFeatures uses ordinary least squares optimization, you would still have same results but with slight difference (just like yours) because Compressed Sparse Row (CSR) method would compromise float values (in other words, truncation/approximation error).
I am trying to train a logistic regression model with data as follows:
Categorical Variable: either 0 or 1
Numerical Variables: Continuous number between 8 and 20
I have 20 numerical variables and I want to only use one at a time for the predicting model, and see which is the best feature to use.
The code I'm using is:
for variable in numerical_variable:
X = data[[variable ]]
y = data[categorical_variable]
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.20,random_state=0)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred=logreg.predict(X_test)
print(y_pred)
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
print("Precision:", metrics.precision_score(y_test, y_pred))
print("Recall:", metrics.recall_score(y_test, y_pred))
The categorical variable is biased towards 1, there are about 800 1s to 200 0s. So I think this is why it always predicts one, regardless of the test samples (if I don't set random_state=0) and regardless of the numerical variable.
(using python 3)
Any thoughts on how to fix this?
Thanks
use joblib library to save your model,
import joblib
your_model = LogisticRegression()
your_model.fit(X_train, y_train)
filename = 'finalized_model.sav'
joblib.dump(your_model, filename)
this code will save your model as 'finalized_model.sav'. the extension doesn't matter,even if you don't write.
then you can call your exact and fixed model by this code to take the same predictions all the time.
your_loaded_model = joblib.load('finalized_model.sav')
as a prediction example;
your_loaded_model.predict(X_test)
I'm trying to reproduce the following R results in Python. In this particular case the R predictive skill is lower than the Python skill, but this is usually not the case in my experience (hence the reason for wanting to reproduce the results in Python), so please ignore that detail here.
The aim is to predict the flower species ('versicolor' 0 or 'virginica' 1). We have 100 labelled samples, each consisting of 4 flower characteristics: sepal length, sepal width, petal length, petal width. I've split the data into training (60% of data) and test sets (40% of data). 10-fold cross-validation is applied to the training set to search for the optimal lambda (the parameter that is optimized is "C" in scikit-learn).
I'm using glmnet in R with alpha set to 1 (for the LASSO penalty), and for python, scikit-learn's LogisticRegressionCV function with the "liblinear" solver (the only solver that can be used with L1 penalisation). The scoring metrics used in the cross-validation are the same between both languages. However somehow the model results are different (the intercepts and coefficients found for each feature vary quite a bit).
R Code
library(glmnet)
library(datasets)
data(iris)
y <- as.numeric(iris[,5])
X <- iris[y!=1, 1:4]
y <- y[y!=1]-2
n_sample = NROW(X)
w = .6
X_train = X[0:(w * n_sample),] # (60, 4)
y_train = y[0:(w * n_sample)] # (60,)
X_test = X[((w * n_sample)+1):n_sample,] # (40, 4)
y_test = y[((w * n_sample)+1):n_sample] # (40,)
# set alpha=1 for LASSO and alpha=0 for ridge regression
# use class for logistic regression
set.seed(0)
model_lambda <- cv.glmnet(as.matrix(X_train), as.factor(y_train),
nfolds = 10, alpha=1, family="binomial", type.measure="class")
best_s <- model_lambda$lambda.1se
pred <- as.numeric(predict(model_lambda, newx=as.matrix(X_test), type="class" , s=best_s))
# best lambda
print(best_s)
# 0.04136537
# fraction correct
print(sum(y_test==pred)/NROW(pred))
# 0.75
# model coefficients
print(coef(model_lambda, s=best_s))
#(Intercept) -14.680479
#Sepal.Length 0
#Sepal.Width 0
#Petal.Length 1.181747
#Petal.Width 4.592025
Python Code
from sklearn import datasets
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
import numpy as np
iris = datasets.load_iris()
X = iris.data
y = iris.target
X = X[y != 0] # four features. Disregard one of the 3 species.
y = y[y != 0]-1 # two species: 'versicolor' (0), 'virginica' (1). Disregard one of the 3 species.
n_sample = len(X)
w = .6
X_train = X[:int(w * n_sample)] # (60, 4)
y_train = y[:int(w * n_sample)] # (60,)
X_test = X[int(w * n_sample):] # (40, 4)
y_test = y[int(w * n_sample):] # (40,)
X_train_fit = StandardScaler().fit(X_train)
X_train_transformed = X_train_fit.transform(X_train)
clf = LogisticRegressionCV(n_jobs=2, penalty='l1', solver='liblinear', cv=10, scoring = ‘accuracy’, random_state=0)
clf.fit(X_train_transformed, y_train)
print clf.score(X_train_fit.transform(X_test), y_test) # score is 0.775
print clf.intercept_ #-1.83569557
print clf.coef_ # [ 0, 0, 0.65930981, 1.17808155] (sepal length, sepal width, petal length, petal width)
print clf.C_ # optimal lambda: 0.35938137
There are a few things that are different in the examples above:
Scale of the coefficients
glmnet (https://cran.r-project.org/web/packages/glmnet/glmnet.pdf) standardizes the data and "The coefficients are always returned on the original scale". Hence you did not scale your data before calling glmnet.
The Python code standardizes the data, then fits to that standardized data. The coefs in this case are in the standardized scale, not the original scale. This makes the coefs between the examples non-comparable.
LogisticRegressionCV by default uses stratifiedfolds. glmnet uses k-fold.
They are fitting different equations. Notice that scikit-learn logistic fits (http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression) with the regularization on the logistic side. glmnet puts the regularization on the penalty.
Choosing the regularization strengths to try - glmnet defaults to 100 lambdas to try. scikit LogisticRegressionCV defaults to 10. Due to the equation scikit solves, the range is between 1e-4 and 1e4 (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html#sklearn.linear_model.LogisticRegressionCV).
Tolerance is different. In some problems I have had, tightening the tolerance significantly changed the coefs.
glmnet defaults thresh to 1e-7
LogisticRegressionCV default tol to 1e-4
Even after making them the same, they may not measure the same thing. I do not know what liblinear measures. glmnet - "Each inner coordinate-descent loop continues until the maximum change in the objective after any coefficient update is less than thresh times the null deviance."
You may want to try printing the regularization paths to see if they are very similar, just stopping on a different strength. Then you can research why.
Even after changing what you can change which is not all of the above, you may not get the same coefs or results. Though you are solving the same problem in different software, how the software solves the problem may be different. We see different scales, different equations, different defaults, different solvers, etc.
The problem that you've got here is the ordering of the datasets (note I haven't checked the R code, but I'm certain this is the problem). If I run your code and then run this
print np.bincount(y_train) # [50 10]
print np.bincount(y_test) # [ 0 40]
You can see the training set is not representative of the test set. However if I make a couple of changes to your Python code then I get a test accuracy of 0.9.
from sklearn import datasets
from sklearn import preprocessing
from sklearn import model_selection
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
import numpy as np
iris = datasets.load_iris()
X = iris.data
y = iris.target
X = X[y != 0] # four features. Disregard one of the 3 species.
y = y[y != 0]-1 # two species: 'versicolor' (0), 'virginica' (1). Disregard one of the 3 species.
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y,
test_size=0.4,
random_state=42,
stratify=y)
X_train_fit = StandardScaler().fit(X_train)
X_train_transformed = X_train_fit.transform(X_train)
clf = LogisticRegressionCV(n_jobs=2, penalty='l1', solver='liblinear', cv=10, scoring = 'accuracy', random_state=0)
clf.fit(X_train_transformed, y_train)
print clf.score(X_train_fit.transform(X_test), y_test) # score is 0.9
print clf.intercept_ #0.
print clf.coef_ # [ 0., 0. ,0., 0.30066888] (sepal length, sepal width, petal length, petal width)
print clf.C_ # [ 0.04641589]
I have to take umbrage with a couple of things here.
Firstly, "for python, scikit-learn's LogisticRegressionCV function with the "liblinear" solver (the only solver that can be used with L1 penalisation)". That is just patently false, unless you meant to qualify that in some more definitive way. Just take a look at the descriptions of the sklearn.linear_model classes and you will see a handful that specifically mention L1. I am sure that others allow you to implement it as well, but I don't really feel like counting them.
Secondly, your method for splitting the data is less than ideal. Take a look at your input and output after the split and you will find that in your split all of the test samples have target values of 1, while the target of 1 only accounts for 1/6 of your training sample. This imbalance, which is not representative of the distribution of the targets, will cause your model to be poorly fit. For example, just using sklearn.model_selection.train_test_split out of the box and then refitting the LogisticRegressionCV classifier exactly as you had, results in an accuray of .92
Now all that being said there is a glmnet package for python and you can replicate your results using this package. There is a blog by the authors of this project that discusses some of the limitations in trying to recreate glmnet results with sklearn. Specifically:
"Scikit-Learn has a few solvers that are similar to glmnet, ElasticNetCV and LogisticRegressionCV, but they have some limitations. The first one only works for linear regression and the latter does not handle the elastic net penalty." - Bill Lattner GLMNET FOR PYTHON
I have a precomputed kernel of size NxN. I am using GridSearchCV to tune C parameter of SVM with kernel='precomputed' as follows:
C_range = 10. ** np.arange(-2, 9)
param_grid = dict(C=C_range)
grid = GridSearchCV(SVC(kernel='precomputed'), param_grid=param_grid, cv=StratifiedKFold(y=data_label, n_folds=10))
grid.fit(kernel, data_label)
print grid.best_score_
This works pretty fine, however since I use the full data for prediction (with grid.predict(kernel)), it overfits (I get precision/recall = 1.0 most of the times).
So I would like to first split my data to 10 chunks (9 for training, 1 for testing) with cross-validation, and in each fold, I want to run GridSearch to tune the C value on the training set, and test on the testing set.
In order to do this, I sliced the kernel matrix into 100x100 and 50x50 submatrices where I run grid.fit() on one of them and grid.predict() on the other.
But I get the following error:
ValueError: X.shape[1] = 50 should be equal to 100, the number of features at training time
I guess training kernel should have the same dimension as testing kernel, but I don't understand why, because I simply compute np.dot(X, X.T) for 100x100, and for 50x50, hence the final kernel have different dimensions..
The scikit learn doc says:
Set kernel='precomputed' and pass the Gram matrix instead of X in the fit method. At the moment, the kernel values between all training vectors and the test vectors must be provided.
So I guess that it's not possible to do (simple) cross-validation with precomputed kernels.
Custom grid search is fairly straightforward to hack together, though to the best of my knowledge six years later there's still no built-in way of doing it in sklearn. Here's a simple snippet that worked for me to tune the C parameter:
import numpy as np
from sklearn.model_selection import ShuffleSplit
from sklearn.svm import SVC
def precomputed_kernel_GridSearchCV(K, y, Cs, n_splits=5, test_size=0.2, random_state=42):
"""A version of grid search CV,
but adapted for SVM with a precomputed kernel
K (np.ndarray) : precomputed kernel
y (np.array) : labels
Cs (iterable) : list of values of C to try
return: optimal value of C
"""
from sklearn.model_selection import ShuffleSplit
n = K.shape[0]
assert len(K.shape) == 2
assert K.shape[1] == n
assert len(y) == n
best_score = float('-inf')
best_C = None
indices = np.arange(n)
for C in Cs:
# for each value of parameter, do K-fold
# The performance measure reported by k-fold cross-validation
# is the average of the values computed in the loop
scores = []
ss = ShuffleSplit(n_splits=n_splits, test_size=test_size, random_state=random_state)
for train_index, test_index in ss.split(indices):
K_train = K[np.ix_(train_index,train_index)]
K_test = K[np.ix_(test_index, train_index)]
y_train = y[train_index]
y_test = y[test_index]
svc = SVC(kernel='precomputed', C=C)
svc.fit(K_train, y_train)
scores.append(svc.score(K_test, y_test))
if np.mean(scores) > best_score:
best_score = np.mean(scores)
best_C = C
return best_C