Error while calculating accuracy using sklearn [duplicate] - python

I want to run several regression types (Lasso, Ridge, ElasticNet and SVR) on a dataset with around 5,000 rows and 6 features. Linear regression. Use GridSearchCV for cross validation. The code is extensive but here are some critical parts:
def splitTrainTestAdv(df):
y = df.iloc[:,-5:] # last 5 columns
X = df.iloc[:,:-5] # Except for last 5 columns
#Scaling and Sampling
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=0)
return X_train, X_test, y_train, y_test
def performSVR(x_train, y_train, X_test, parameter):
C = parameter[0]
epsilon = parameter[1]
kernel = parameter[2]
model = svm.SVR(C = C, epsilon = epsilon, kernel = kernel)
model.fit(x_train, y_train)
return model.predict(X_test) #prediction for the test
def performRidge(X_train, y_train, X_test, parameter):
alpha = parameter[0]
model = linear_model.Ridge(alpha=alpha, normalize=True)
model.fit(X_train, y_train)
return model.predict(X_test) #prediction for the test
MODELS = {
'lasso': (
linear_model.Lasso(),
{'alpha': [0.95]}
),
'ridge': (
linear_model.Ridge(),
{'alpha': [0.01]}
),
)
}
def performParameterSelection(model_name, feature, X_test, y_test, X_train, y_train):
print("# Tuning hyper-parameters for %s" % feature)
print()
model, param_grid = MODELS[model_name]
gs = GridSearchCV(model, param_grid, n_jobs= 1, cv=5, verbose=1, scoring='%s_weighted' % feature)
gs.fit(X_train, y_train)
print("Best parameters set found on development set:")
print(gs.best_params_)
print()
print("Grid scores on development set:")
print()
for params, mean_score, scores in gs.grid_scores_:
print("%0.3f (+/-%0.03f) for %r"
% (mean_score, scores.std() * 2, params))
print("Detailed classification report:")
print()
print("The model is trained on the full development set.")
print("The scores are computed on the full evaluation set.")
y_true, y_pred = y_test, gs.predict(X_test)
print(classification_report(y_true, y_pred))
soil = pd.read_csv('C:/training.csv', index_col=0)
soil = getDummiedSoilDepth(soil)
np.random.seed(2015)
soil = shuffleData(soil)
soil = soil.drop('Depth', 1)
X_train, X_test, y_train, y_test = splitTrainTestAdv(soil)
scores = ['precision', 'recall']
for score in scores:
for model in MODELS.keys():
print '####################'
print model, score
print '####################'
performParameterSelection(model, score, X_test, y_test, X_train, y_train)
You can assume that all required imports are done
I am getting this error and do not know why:
ValueError Traceback (most recent call last)
in ()
18 print model, score
19 print '####################'
---> 20 performParameterSelection(model, score, X_test, y_test, X_train, y_train)
21
<ipython-input-27-304555776e21> in performParameterSelection(model_name, feature, X_test, y_test, X_train, y_train)
12 # cv=5 - constant; verbose - keep writing
13
---> 14 gs.fit(X_train, y_train) # Will get grid scores with outputs from ALL models described above
15
16 #pprint(sorted(gs.grid_scores_, key=lambda x: -x.mean_validation_score))
C:\Users\Tony\Anaconda\lib\site-packages\sklearn\grid_search.pyc in fit(self, X, y)
C:\Users\Tony\Anaconda\lib\site-packages\sklearn\metrics\classification.pyc in _check_targets(y_true, y_pred)
90 if (y_type not in ["binary", "multiclass", "multilabel-indicator",
91 "multilabel-sequences"]):
---> 92 raise ValueError("{0} is not supported".format(y_type))
93
94 if y_type in ["binary", "multiclass"]:
ValueError: continuous-multioutput is not supported
I am still very new to Python and this error puzzles me. This should not because I have 6 features, of course. I tried to follow standard buil-in functions.
Please, help

First let's replicate the problem.
First import the libraries needed:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
from sklearn import linear_model
from sklearn.grid_search import GridSearchCV
Then create some data:
df = pd.DataFrame(np.random.rand(5000,11))
y = df.iloc[:,-5:] # last 5 columns
X = df.iloc[:,:-5] # Except for last 5 columns
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=0)
Now we can replicate the error and also see options which do not replicate the error:
This runs OK
gs = GridSearchCV(linear_model.Lasso(), {'alpha': [0.95]}, n_jobs= 1, cv=5, verbose=1)
print gs.fit(X_train, y_train)
This does not
gs = GridSearchCV(linear_model.Lasso(), {'alpha': [0.95]}, n_jobs= 1, cv=5, verbose=1, scoring='recall')
gs.fit(X_train, y_train)
and indeed the error is exactly as you have above; 'continuous multi-output is not supported'.
If you think about the recall measure, it is to do with binary or categorical data - about which we can define things like false positives and so on. At least in my replication of your data, I used continuous data and recall simply is not defined. If you use the default score it works, as you can see above.
So you probably need to look at your predictions and understand why they are continuous (i.e. use a classifier instead of regression). Or use a different score.
As an aside, if you run the regression with only one set of (column of) y values, you still get an error. This time it says more simply 'continuous output is not supported', i.e. the issue is using recall (or precision) on continuous data (whether or not it is multi-output).

The end goal is to evaluate the performance of the model, you can use the model.evaluate method:
_,accuracy = model.evaluate(our_data_feat, new_label2,verbose=0.0)
print('Accuracy:%.2f'%(accuracy*100))
This will give you the accuracy value.

Make sure you have single series for the dependent variable. Properly split your data in train_test_split.

Related

Repeated holdout method

How can I make "Repeated" holdout method, I made holdout method and get accuracy but need to repeat holdout method for 30 times
There is my code for holdout method
[IN]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y.values.ravel(), random_state=100)
model = LogisticRegression()
model.fit(X_train, Y_train)
result = model.score(X_test, Y_test)
print("Accuracy: %.2f%%" % (result*100.0))
[OUT]
Accuracy: 49.62%
I see many codes for repeated method but only for K fold cross, nothing for holdout method
So to use a repeated holdout you could use the ShuffleSplit method from sklearn. A minimum working example (following the name conventions that you used) might be as follows:
from sklearn.modelselection import ShuffleSplit
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
# Create some artificial data to train on, can be replace by your own data
X, Y = make_classification()
rs = ShuffleSplit(n_splits=30, test_size=0.25, random_state=100)
model = LogisticRegression()
for train_index, test_index in rs.split(X):
X_train, Y_train = X[train_index], Y[train_index]
X_test, Y_test = X[test_index], Y[test_index]
model.fit(X_train,Y_train)
result = model.score(X_test, Y_test)
print("Accuracy: %.2f%%" % (result*100.0))
n_splits determines how many time you would like to repeat the holdout. test_size deterimines the fraction of samples that is sampled as a test set. In this case 75% is sampled as train set, whereas 25% is sampled to your test set. For reproducible results you can set the random_state (any number suffices, as long as you use the same number consistently).

Python: I want to perform 5 fold cross validation for logistic regression and report scores. Do I use LogisticRegressionCV() or cross_val_score()?

cross_val_scores gives different results than LogisticRegressionCV, and I can't figure out why.
Here is my code:
seed = 42
test_size = .33
X_train, X_test, Y_train, Y_test = train_test_split(scale(X),Y, test_size=test_size, random_state=seed)
#Below is my model that I use throughout the program.
model = LogisticRegressionCV(random_state=42)
print('Logistic Regression results:')
#For cross_val_score below, I just call LogisticRegression (and not LogRegCV) with the same parameters.
scores = cross_val_score(LogisticRegression(random_state=42), X_train, Y_train, scoring='accuracy', cv=5)
print(np.amax(scores)*100)
print("%.2f%% average accuracy with a standard deviation of %0.2f" % (scores.mean() * 100, scores.std() * 100))
model.fit(X_train, Y_train)
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(Y_test, predictions)
coef=np.round(model.coef_,2)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
The output is this.
Logistic Regression results:
79.90483019359885
79.69% average accuracy with a standard deviation of 0.14
Accuracy: 79.81%
Why is the maximum accuracy from cross_val_score higher than the accuracy used by LogisticRegressionCV?
And, I recognize that cross_val_scores does not return a model, which is why I want to use LogisticRegressionCV, but I am struggling to understand why it is not performing as well. Likewise, I am not sure how to get the standard deviations of the predictors from LogisticRegressionCV.
For me, there might be some points to take into consideration:
Cross validation is generally used whenever you should simulate a validation set (for instance when the training set is not that big to be divided into training, validation and test sets) and only uses training data. In your case you're computing accuracy of model on test data, making it impossible to exactly compare results.
According to the docs:
Cross-validation estimators are named EstimatorCV and tend to be roughly equivalent to GridSearchCV(Estimator(), ...). The advantage of using a cross-validation estimator over the canonical estimator class along with grid search is that they can take advantage of warm-starting by reusing precomputed results in the previous steps of the cross-validation process. This generally leads to speed improvements.
If you look at this snippet, you'll see that's what happens indeed:
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split
data = load_breast_cancer()
X, y = data['data'], data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
estimator = LogisticRegression(random_state=42, solver='liblinear')
grid = {
'C': np.power(10.0, np.arange(-10, 10)),
}
gs = GridSearchCV(estimator, param_grid=grid, scoring='accuracy', cv=5)
gs.fit(X_train, y_train)
print(gs.best_score_) # 0.953846153846154
lrcv = LogisticRegressionCV(Cs=list(np.power(10.0, np.arange(-10, 10))),
cv=5, scoring='accuracy', solver='liblinear', random_state=42)
lrcv.fit(X_train, y_train)
print(lrcv.scores_[1].mean(axis=0).max()) # 0.953846153846154
I would suggest to have a look here, too, so as to get the details of lrcv.scores_[1].mean(axis=0).max().
Eventually, to get the same results with cross_val_score you should better write:
score = cross_val_score(gs.best_estimator_, X_train, y_train, cv=5, scoring='accuracy')
score.mean() # 0.953846153846154

Catboost: Cannot calc metric which requires logits for absolute values

I'm using catboost for my model and I'm getting the error as below to the following code:
from catboost import Pool, CatBoostClassifier, cv
#Split data
train = data[:split]
test = data[split:]
# Get variables for a model
x = train.drop(["Survived"], axis=1)
y = train["Survived"]
#Do train data splitting
X_train, X_test, y_train, y_test = train_test_split(x,y, test_size=0.2, random_state=42)
cat_features = np.where(x.dtypes != float)[0]
cat = CatBoostClassifier(one_hot_max_size=7, iterations=21, random_seed=42, use_best_model=True, eval_metric='Accuracy')
cat.fit(X_train, y_train, cat_features = cat_features, eval_set=(X_test, y_test))
pred = cat.predict(X_test)
pool = Pool(X_train, y_train, cat_features=cat_features)
cv_scores = cv(pool, cat.get_params(), fold_count=10, plot=True)
print('CV score: {:.5f}'.format(cv_scores['test-Accuracy-mean'].values[-1]))
print('The test accuracy is :{:.6f}'.format(accuracy_score(y_test, cat.predict(X_test))))
...which raises:
---> 23 cv_scores = cv(pool, cat.get_params(), fold_count=10)
CatBoostError: catboost/libs/metrics/metric.cpp:5069: Cannot calc
metric which requires logits for absolute values.
Help would be appreciated. Thanks.
You need to add the parameter loss_function='Logloss' inside the CatBoostClassifier(). This issue is described here and was supposedly fixed but now it appeared again. I will reopen this issue because this is clearly a bug.

ValueError: continuous-multioutput is not supported

I want to run several regression types (Lasso, Ridge, ElasticNet and SVR) on a dataset with around 5,000 rows and 6 features. Linear regression. Use GridSearchCV for cross validation. The code is extensive but here are some critical parts:
def splitTrainTestAdv(df):
y = df.iloc[:,-5:] # last 5 columns
X = df.iloc[:,:-5] # Except for last 5 columns
#Scaling and Sampling
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=0)
return X_train, X_test, y_train, y_test
def performSVR(x_train, y_train, X_test, parameter):
C = parameter[0]
epsilon = parameter[1]
kernel = parameter[2]
model = svm.SVR(C = C, epsilon = epsilon, kernel = kernel)
model.fit(x_train, y_train)
return model.predict(X_test) #prediction for the test
def performRidge(X_train, y_train, X_test, parameter):
alpha = parameter[0]
model = linear_model.Ridge(alpha=alpha, normalize=True)
model.fit(X_train, y_train)
return model.predict(X_test) #prediction for the test
MODELS = {
'lasso': (
linear_model.Lasso(),
{'alpha': [0.95]}
),
'ridge': (
linear_model.Ridge(),
{'alpha': [0.01]}
),
)
}
def performParameterSelection(model_name, feature, X_test, y_test, X_train, y_train):
print("# Tuning hyper-parameters for %s" % feature)
print()
model, param_grid = MODELS[model_name]
gs = GridSearchCV(model, param_grid, n_jobs= 1, cv=5, verbose=1, scoring='%s_weighted' % feature)
gs.fit(X_train, y_train)
print("Best parameters set found on development set:")
print(gs.best_params_)
print()
print("Grid scores on development set:")
print()
for params, mean_score, scores in gs.grid_scores_:
print("%0.3f (+/-%0.03f) for %r"
% (mean_score, scores.std() * 2, params))
print("Detailed classification report:")
print()
print("The model is trained on the full development set.")
print("The scores are computed on the full evaluation set.")
y_true, y_pred = y_test, gs.predict(X_test)
print(classification_report(y_true, y_pred))
soil = pd.read_csv('C:/training.csv', index_col=0)
soil = getDummiedSoilDepth(soil)
np.random.seed(2015)
soil = shuffleData(soil)
soil = soil.drop('Depth', 1)
X_train, X_test, y_train, y_test = splitTrainTestAdv(soil)
scores = ['precision', 'recall']
for score in scores:
for model in MODELS.keys():
print '####################'
print model, score
print '####################'
performParameterSelection(model, score, X_test, y_test, X_train, y_train)
You can assume that all required imports are done
I am getting this error and do not know why:
ValueError Traceback (most recent call last)
in ()
18 print model, score
19 print '####################'
---> 20 performParameterSelection(model, score, X_test, y_test, X_train, y_train)
21
<ipython-input-27-304555776e21> in performParameterSelection(model_name, feature, X_test, y_test, X_train, y_train)
12 # cv=5 - constant; verbose - keep writing
13
---> 14 gs.fit(X_train, y_train) # Will get grid scores with outputs from ALL models described above
15
16 #pprint(sorted(gs.grid_scores_, key=lambda x: -x.mean_validation_score))
C:\Users\Tony\Anaconda\lib\site-packages\sklearn\grid_search.pyc in fit(self, X, y)
C:\Users\Tony\Anaconda\lib\site-packages\sklearn\metrics\classification.pyc in _check_targets(y_true, y_pred)
90 if (y_type not in ["binary", "multiclass", "multilabel-indicator",
91 "multilabel-sequences"]):
---> 92 raise ValueError("{0} is not supported".format(y_type))
93
94 if y_type in ["binary", "multiclass"]:
ValueError: continuous-multioutput is not supported
I am still very new to Python and this error puzzles me. This should not because I have 6 features, of course. I tried to follow standard buil-in functions.
Please, help
First let's replicate the problem.
First import the libraries needed:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
from sklearn import linear_model
from sklearn.grid_search import GridSearchCV
Then create some data:
df = pd.DataFrame(np.random.rand(5000,11))
y = df.iloc[:,-5:] # last 5 columns
X = df.iloc[:,:-5] # Except for last 5 columns
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=0)
Now we can replicate the error and also see options which do not replicate the error:
This runs OK
gs = GridSearchCV(linear_model.Lasso(), {'alpha': [0.95]}, n_jobs= 1, cv=5, verbose=1)
print gs.fit(X_train, y_train)
This does not
gs = GridSearchCV(linear_model.Lasso(), {'alpha': [0.95]}, n_jobs= 1, cv=5, verbose=1, scoring='recall')
gs.fit(X_train, y_train)
and indeed the error is exactly as you have above; 'continuous multi-output is not supported'.
If you think about the recall measure, it is to do with binary or categorical data - about which we can define things like false positives and so on. At least in my replication of your data, I used continuous data and recall simply is not defined. If you use the default score it works, as you can see above.
So you probably need to look at your predictions and understand why they are continuous (i.e. use a classifier instead of regression). Or use a different score.
As an aside, if you run the regression with only one set of (column of) y values, you still get an error. This time it says more simply 'continuous output is not supported', i.e. the issue is using recall (or precision) on continuous data (whether or not it is multi-output).
The end goal is to evaluate the performance of the model, you can use the model.evaluate method:
_,accuracy = model.evaluate(our_data_feat, new_label2,verbose=0.0)
print('Accuracy:%.2f'%(accuracy*100))
This will give you the accuracy value.
Make sure you have single series for the dependent variable. Properly split your data in train_test_split.

Scikit: calculate precision and recall using cross_val_score function

I'm using scikit to perform a logistic regression on spam/ham data.
X_train is my training data and y_train the labels('spam' or 'ham') and I trained my LogisticRegression this way:
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
If I want to get the accuracies for a 10 fold cross validation, I just write:
accuracy = cross_val_score(classifier, X_train, y_train, cv=10)
I thought it was possible to calculate also the precisions and recalls by simply adding one parameter this way:
precision = cross_val_score(classifier, X_train, y_train, cv=10, scoring='precision')
recall = cross_val_score(classifier, X_train, y_train, cv=10, scoring='recall')
But it results in a ValueError:
ValueError: pos_label=1 is not a valid label: array(['ham', 'spam'], dtype='|S4')
Is it related to the data (should I binarize the labels ?) or do they change the cross_val_score function ?
Thank you in advance !
To compute the recall and precision, the data has to be indeed binarized, this way:
from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
lb.fit(y_train)
To go further, i was surprised that I didn't have to binarize the data when I wanted to calculate the accuracy:
accuracy = cross_val_score(classifier, X_train, y_train, cv=10)
It's just because the accuracy formula doesn't really need information about which class is considered as positive or negative: (TP + TN) / (TP + TN + FN + FP). We can indeed see that TP and TN are exchangeable, it's not the case for recall, precision and f1.
I encountered the same problem here, and I solved it with
# precision, recall and F1
from sklearn.preprocessing import LabelBinarizer
lb = LabelBinarizer()
y_train = np.array([number[0] for number in lb.fit_transform(y_train)])
recall = cross_val_score(classifier, X_train, y_train, cv=5, scoring='recall')
print('Recall', np.mean(recall), recall)
precision = cross_val_score(classifier, X_train, y_train, cv=5, scoring='precision')
print('Precision', np.mean(precision), precision)
f1 = cross_val_score(classifier, X_train, y_train, cv=5, scoring='f1')
print('F1', np.mean(f1), f1)
The syntax you showed above is correct. Looks like a problem with the data you're using. The labels don't need to be binarized, as long as they're not continuous numbers.
You can prove out the same syntax with a different dataset:
iris = sklearn.dataset.load_iris()
X_train = iris['data']
y_train = iris['target']
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
print cross_val_score(classifier, X_train, y_train, cv=10, scoring='precision')
print cross_val_score(classifier, X_train, y_train, cv=10, scoring='recall')
You could use cross-validation like this to get the f1-score and recall :
print('10-fold cross validation:\n')
start_time = time()
scores = cross_validation.cross_val_score(clf, X,y, cv=10, scoring ='f1')
recall_score=cross_validation.cross_val_score(clf, X,y, cv=10, scoring ='recall')
print(label+" f1: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), 'DecisionTreeClassifier'))
print("---Classifier %s use %s seconds ---" %('DecisionTreeClassifier', (time() - start_time)))
for more scoring-parameter just see the page
you should specify which of the two labels is positive (it could be ham) :
from sklearn.metrics import make_scorer, precision_score
precision = make_scorer(precision_score, pos_label="ham")
accuracy = cross_val_score(classifier, X_train, y_train, cv=10, scoring = precision)

Categories

Resources