When I run this cross validation leave one out, it does nothing, not even an error message. I can't figure out what I am missing. I'm using the csv from kaggle - https://www.kaggle.com/dileep070/heart-disease-prediction-using-logistic-regression/downloads/heart-disease-prediction-using-logistic-regression.zip/1
import csv
from sklearn.model_selection import LeaveOneOut
from sklearn import svm
from sklearn.impute import SimpleImputer
import numpy as np
import pandas as pd
from pandas import read_csv
from sklearn.model_selection import cross_val_score,
cross_val_predict
from sklearn import metrics
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
#replace missing values with mean
dataset=read_csv("//Users/crystalfortress/Desktop/CompGenetics
/Final_Project_Comp/framingham.csv")
dataset.fillna(dataset.mean(), inplace=True)
print(dataset.isnull().sum())
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 15].values
model = svm.SVC(kernel='linear', C=10, gamma = 0.1)
loo = LeaveOneOut()
scores = cross_val_score(model, X, y, cv=loo, scoring='accuracy')
print('Accuracy after cross validation:', scores.mean())
predictions = cross_val_predict(model, X, y, cv=loo)
accuracy = metrics.r2_score(y, predictions)
print('Prediction accuracy:', accuracy)
x = metrics.classification_report(y, predictions)
print(x)
cf = metrics.confusion_matrix(y, predictions)
print(cf)
I tried running this on my machine, it looks like your code works fine (though I did comment out a lot of unnecessary imports). Leave One Out will take a really long time to train a model (it makes n training data sets out of n data points link). So you will have to wait for it to train before you will get your results. Changing cv= to a number (default is 3, I believe) will train a model more quickly, and most likely have less variance in the model. Also, add n_jobs=-1 to your cross_val_score call will allow python access to all of your processors. E.g.: scores = cross_val_score(model, X, y, cv=loo, scoring='accuracy', n_jobs=-1)
You can also set the verbose parameter in your cross_val_score as well to watch the progress (though, heads up, it isn't fast). I believe the highest value is 3 (it doesn't mention it in their online docs), but using a higher value is fine. So the final cross_val_score call would look like:
scores = cross_val_score(model, X, y, cv=loo, scoring='accuracy', n_jobs=-1, verbose=10)
Related
I am getting 100% accuracy in multiple linear regression. I am following one tutorial of last year. He is not getting 100% accuracy on the same model but I am getting now. Seems weird to me. Here's my code. Am I am doing it right or there's something wrong in my code?
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('M_Regression.csv')
X = dataset.iloc[:, :-1].values
Y = dataset.iloc[:, :1].values
from sklearn.model_selection import train_test_split
X_train, x_test, Y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=0)
#regression
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(X_train,Y_train)
#Prediction
y_pred = reg.predict(x_test)
print(str(y_test) + " - " + str(y_pred))
Linear Regression have simple numbers it is common to have 100% accuracy on large dataset. Try with other datasets once. I tried your code i got 1.0 Accuracy on it.
To check the accuracy of your model, you could try printing the r2 score of your test sample. Something among the lines of :
from sklearn.metrics import r2_score
print(r2_score(y_test,y_pred))
if you still have issues with the score. You could try removing the "random_state=0" to check if you still have 100% accuracy with several train/test data sets.
I am separating the features in X and y then I preprocess my train test data after splitting it with k fold cross validation. After that i fit the train data to my Random Forest Regressor model and calculate the confidence score. Why do i preprocess after splitting? because people tell me that it's more correct to do it that way and i'm keeping that principle since that for the sake of my model performance.
This is my first time using KFold Cross Validation because my model score overifts and i thought i could fix it with cross validation. I'm still confused of how to use this, i have read the documentation and some articles but i do not really catch how do i really imply it to my model but i tried anyway and my model still overfits. Using train test split or cross validation resulting my model score is still 0.999, I do not know what is my mistake since i'm very new using this method but i think maybe i did it wrong so it does not fix the overfitting. Please tell me what's wrong with my code and how to fix this
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestRegressor
import scipy.stats as ss
avo_sales = pd.read_csv('avocados.csv')
avo_sales.rename(columns = {'4046':'small PLU sold',
'4225':'large PLU sold',
'4770':'xlarge PLU sold'},
inplace= True)
avo_sales.columns = avo_sales.columns.str.replace(' ','')
x = np.array(avo_sales.drop(['TotalBags','Unnamed:0','year','region','Date'],1))
y = np.array(avo_sales.TotalBags)
# X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
kf = KFold(n_splits=10)
for train_index, test_index in kf.split(x):
X_train, X_test, y_train, y_test = x[train_index], x[test_index], y[train_index], y[test_index]
impC = SimpleImputer(strategy='most_frequent')
X_train[:,8] = impC.fit_transform(X_train[:,8].reshape(-1,1)).ravel()
X_test[:,8] = impC.transform(X_test[:,8].reshape(-1,1)).ravel()
imp = SimpleImputer(strategy='median')
X_train[:,1:8] = imp.fit_transform(X_train[:,1:8])
X_test[:,1:8] = imp.transform(X_test[:,1:8])
le = LabelEncoder()
X_train[:,8] = le.fit_transform(X_train[:,8])
X_test[:,8] = le.transform(X_test[:,8])
rfr = RandomForestRegressor()
rfr.fit(X_train, y_train)
confidence = rfr.score(X_test, y_test)
print(confidence)
The reason you're overfitting is because a non-regularized tree-based model will adjust to the data until all training samples are correctly classified. See for example this image:
As you can see, this does not generalize well. If you don't specify arguments that regularize the trees, the model will fit the test data poorly because it will basically just learn the noise in the training data. There are many ways to regularize trees in sklearn, you can find them here. For instance:
max_features
min_samples_leaf
max_depth
With proper regularization, you can get a model that generalizes well to the test data. Look at a regularized model for instance:
To regularize your model, instantiate the RandomForestRegressor() module like this:
rfr = RandomForestRegressor(max_features=0.5, min_samples_leaf=4, max_depth=6)
These argument values are arbitrary, it's up to you to find the ones that fit your data best. You can use domain-specific knowledge to choose these values, or a hyperparameter tuning search like GridSearchCV or RandomizedSearchCV.
Other than that, imputing the mean and median might bring a lot of noise in your data. I would advise against it unless you had no other choice.
While #NicolasGervais answer gets to the bottom of why your specific model is overfitting, I think there is a conceptual misunderstanding with regards to cross-validation in the original question; you seem to think that:
Cross-validation is a method that improves the performance of a machine learning model.
But this is not the case.
Cross validation is a method that is used to estimate the performance of a given model on unseen data. By itself, it cannot improve the accuracy.
In other words, the respective scores can tell you if your model is overfitting the training data, but simply applying cross-validation does not make your model better.
Example:
Let's look at a dataset with 10 points, and fit a line through it:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
X = np.random.randint(0,10,10)
Y = np.random.randint(0,10,10)
fig = plt.figure(figsize=(1,10))
def line(x, slope, intercept):
return slope * x + intercept
for i in range(5):
# note that this is not technically 5-fold cross-validation
# because I allow the same datapoint to go into the test set
# several times. For illustrative purposes it is fine imho.
test_indices = np.random.choice(np.arange(10),2)
train_indices = list(set(range(10))-set(test_indices))
# get train and test sets
X_train, Y_train = X[train_indices], Y[train_indices]
X_test, Y_test = X[test_indices], Y[test_indices]
# training set has one feature and multiple entries
# so, reshape(-1,1)
X_train, Y_train, X_test, Y_test = X_train.reshape(-1,1), Y_train.reshape(-1,1), X_test.reshape(-1,1), Y_test.reshape(-1,1)
# fit and evaluate linear regression
reg = LinearRegression().fit(X_train, Y_train)
score_train = reg.score(X_train, Y_train)
score_test = reg.score(X_test, Y_test)
# extract coefficients from model:
slope, intercept = reg.coef_[0], reg.intercept_[0]
print(score_test)
# show train and test sets
plt.subplot(5,1,i+1)
plt.scatter(X_train, Y_train, c='k')
plt.scatter(X_test, Y_test, c='r')
# draw regression line
plt.plot(np.arange(10), line(np.arange(10), slope, intercept))
plt.ylim(0,10)
plt.xlim(0,10)
plt.title('train: {:.2f} test: {:.2f}'.format(score_train, score_test))
You can see that the scores on training and test set are vastly different. You can also see that the estimated parameters vary a lot with the change of train and test set.
That does not make your linear model any better at all.
But now you know exactly how bad it is :)
I have been playing around with lasso regression on polynomial functions using the code below. The question I have is should I be doing feature scaling as part of the lasso regression (when attempting to fit a polynomial function). The R^2 results and plot as outlined in the code I have pasted below suggests not. Appreciate any advice on why this is not the case or if I have fundamentally stuffed something up. Thanks in advance for any advice.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
def answer_regression():
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Lasso, LinearRegression
from sklearn.metrics.regression import r2_score
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
scaler = MinMaxScaler()
global X_train, X_test, y_train, y_test
degrees = 12
poly = PolynomialFeatures(degree=degrees)
X_train_poly = poly.fit_transform(X_train.reshape(-1,1))
X_test_poly = poly.fit_transform(X_test.reshape(-1,1))
#Lasso Regression Model
X_train_scaled = scaler.fit_transform(X_train_poly)
X_test_scaled = scaler.transform(X_test_poly)
#No feature scaling
linlasso = Lasso(alpha=0.01, max_iter = 10000).fit(X_train_poly, y_train)
y_test_lassopredict = linlasso.predict(X_test_poly)
Lasso_R2_test_score = r2_score(y_test, y_test_lassopredict)
#With feature scaling
linlasso = Lasso(alpha=0.01, max_iter = 10000).fit(X_train_scaled, y_train)
y_test_lassopredict_scaled = linlasso.predict(X_test_scaled)
Lasso_R2_test_score_scaled = r2_score(y_test, y_test_lassopredict_scaled)
%matplotlib notebook
plt.figure()
plt.scatter(X_test, y_test, label='Test data')
plt.scatter(X_test, y_test_lassopredict, label='Predict data - No Scaling')
plt.scatter(X_test, y_test_lassopredict_scaled, label='Predict data - With Scaling')
return (Lasso_R2_test_score, Lasso_R2_test_score_scaled)
answer_regression()```
Your X range is around [0,10], so the polynomial features will have a much wider range. Without scaling, their weights are already small (because of their larger values), so Lasso will not need to set them to zero. If you scale them, their weights will be much larger, and Lasso will set most of them to zero. That's why it has a poor prediction for the scaled case (those features are needed to capture the true trend of y).
You can confirm this by getting the weights (linlasso.coef_) for both cases, where you will see that most of the weights for the second case (scaled one) are set to zero.
It seems your alpha is larger than an optimal value and should be tuned. If you decrease alpha, you will get similar results for both cases.
I am trying to make predictions for the iris dataset. I have decided to use svms for this purpose. But, it gives me an accuracy 1.0. Is it a case of overfitting or is it because the model is very good? Here is my code.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
svm_model = svm.SVC(kernel='linear', C=1,gamma='auto')
svm_model.fit(X_train,y_train)
predictions = svm_model.predict(X_test)
accuracy_score(predictions, y_test)
Here, accuracy_score returns a value of 1. Please help me. I am a beginner in machine learning.
You can try cross validation:
Example:
from sklearn.model_selection import LeaveOneOut
from sklearn import datasets
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
#load iris data
iris = datasets.load_iris()
X = iris.data
Y = iris.target
#build the model
svm_model = SVC( kernel ='linear', C = 1, gamma = 'auto',random_state = 0 )
#create the Cross validation object
loo = LeaveOneOut()
#calculate cross validated (leave one out) accuracy score
scores = cross_val_score(svm_model, X,Y, cv = loo, scoring='accuracy')
print( scores.mean() )
Result (the mean accuracy of the 150 folds since we used leave-one-out):
0.97999999999999998
Bottom line:
Cross validation (especially LeaveOneOut) is a good way to avoid overfitting and to get robust results.
The iris dataset is not a particularly difficult one from where to get good results. However, you are right not trusting a 100% classification accuracy model. In your example, the problem is that the 30 test points are all correctly well classified. But that doesn't mean that your model is able to generalise well for all new data instances. Just try and change the test_size to 0.3 and the results are no longer 100% (it goes down to 97.78%).
The best way to guarantee robustness and avoid overfitting is using cross validation. An example on how to do this easily from your example:
from sklearn import datasets
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
iris = datasets.load_iris()
X = iris.data[:, :4]
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
svm_model = svm.SVC(kernel='linear', C=1, gamma='auto')
scores = cross_val_score(svm_model, iris.data, iris.target, cv=10) #10 fold cross validation
Here cross_val_score uses different parts of the dataset as testing data iteratively (cross validation) while keeping all your previous parameters. If you check score you will see that the 10 accuracies calculated now range from 87.87% to 100%. To report the final model performance you can for example use the mean of the scored values.
Hope this helps and good luck! :)
I met a problem when I was using python3.6 with scikitlearn 0.18
I used the randomforest to do a regression and the regression was pretty good, but when I was trying to compute the cross-validation, I met a problem that the scores got from cross_val_score and train_test_split were really different. The score of train_test_split is 0.9 but the mean of score of cross_val_score is about 0.3.
Could you tell me the reason?
or anything wrong in my code?
the code is
import numpy as np
import cv2
import itertools
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.cross_validation import cross_val_score,cross_val_predict,ShuffleSplit,KFold
from sklearn.model_selection import train_test_split
train= np.loadtxt('.txt')
traindata=train[0:,38:]
traintarget=train[0:,j]
rf=RandomForestRegressor(n_estimators=20)
rf.fit(traindata,traintarget)
X=traindata
Y=traintarget
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0)
print (rf.score(X_test, Y_test))
score3 = cross_val_score(rf, X, Y, scoring= 'r2', cv=ShuffleSplit(n=len(X),test_size=0.3,train_size=0.6))
score4 = cross_val_score(rf, X, Y, scoring= 'neg_mean_absolute_error', cv=ShuffleSplit(n=len(X),test_size=0.3,train_size=0.6))
score5 = cross_val_score(rf, X, Y, scoring= 'neg_mean_squared_error', cv=ShuffleSplit(n=len(X),test_size=0.3,train_size=0.6))
print (score3)
print (score4)
print (score5)
When you're using train_test_split, you're splitting the dataset into train and test randomly .
Shufflesplit is not too different. The issue may be because the class distribution is highly uneven. (BTW remember one thing, Leave One Out Cross Validation that you're using always gives bad results in my experience). Instead use a 5-fold cross validation.
You can also use stratified cross validation if the class distribution is uneven. It preserves the percentage of samples in each class. Take a look at http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html for example.