I am trying to calculate the ROC score for my data but it is resulting in nan.
The code:
scoring = 'roc_auc'
kfold= KFold(n_splits=10, random_state=42, shuffle=True)
model = LinearDiscriminantAnalysis()
results = cross_val_score(model, df_n, y, cv=kfold, scoring=scoring)
print("AUC: %.3f (%.3f)" % (results.mean(), results.std()))
df_n is an array from the normalised values, I also tried it just with the X data value from the dataset.
y is an array of binary values.
df_n shape: (150, 4)
y shape: (150,)
I am stumped, it should work!
The problem is that roc_auc_score expects the probabilities and not the predictions in the case of multi-class classification. However, with that code the score is getting the output of predict instead.
Use a new scorer:
from sklearn.metrics import roc_auc_score, make_scorer
multi_roc_scorer = make_scorer(lambda y_in, y_p_in: roc_auc_score(y_in, y_p_in, multi_class='ovr'), needs_proba=True)
scores = cross_validate(model, X_s, y_s, scoring=multi_roc_scorer, cv=cv, error_score="raise")
Related
so I was able to output my predicted numerical values into an excel file but I was wondering if it is possible to instead of the numerical value, it actual exports the string instead.
Currently it looks like this,
Column 1
Answer Key
Predicted
Something Something
Cars
3
Instead of it returning 3, I would like for 3 be replaced as the actual string its associated with (for example, idk Truck).
here is my code so far, I know I have to mess with the exporting part of the code but I cannot seem to figure this out.
texts = df['without_Tags'].astype('str')
vector = TfidfVectorizer(ngram_range=(1, 2), min_df = 2, max_df = .95)
X = vector.fit_transform(texts) #features
LE = LabelEncoder()
df['tower_values'] = LE.fit_transform(df['Tower'])
y = df['tower_values'].values
print(X.shape)
print(y.shape)
lsa = TruncatedSVD (n_components=100, n_iter=10, random_state=3)
X = lsa.fit_transform(X)
print(X.shape)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3, shuffle = True, stratify = y, random_state = 3)
SG = SGDClassifier(random_state=3, loss='log')
SG.fit(X_train, y_train)
y_pred = SG.predict(X_test)
print("SG model accuracy:", accuracy_score(y_test, y_pred))
print("SG model Recall:", recall_score(y_test, y_pred, average="macro"))
print("SG model Precision:", precision_score(y_test, y_pred, average="macro"))
print("SG model F1 Score:", f1_score(y_test, y_pred, average="macro"))
y_pred = pd.DataFrame(y_pred, columns=['predictions']).to_csv('prediction.csv')
final = pd.read_csv('prediction.csv')
final['pre'] = y_pred
df.to_csv('prediction.csv')
Try using inverse transform method of LabelEncoder()
y_pred = LE.inverse_transform(y_pred)
cross_val_scores gives different results than LogisticRegressionCV, and I can't figure out why.
Here is my code:
seed = 42
test_size = .33
X_train, X_test, Y_train, Y_test = train_test_split(scale(X),Y, test_size=test_size, random_state=seed)
#Below is my model that I use throughout the program.
model = LogisticRegressionCV(random_state=42)
print('Logistic Regression results:')
#For cross_val_score below, I just call LogisticRegression (and not LogRegCV) with the same parameters.
scores = cross_val_score(LogisticRegression(random_state=42), X_train, Y_train, scoring='accuracy', cv=5)
print(np.amax(scores)*100)
print("%.2f%% average accuracy with a standard deviation of %0.2f" % (scores.mean() * 100, scores.std() * 100))
model.fit(X_train, Y_train)
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(Y_test, predictions)
coef=np.round(model.coef_,2)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
The output is this.
Logistic Regression results:
79.90483019359885
79.69% average accuracy with a standard deviation of 0.14
Accuracy: 79.81%
Why is the maximum accuracy from cross_val_score higher than the accuracy used by LogisticRegressionCV?
And, I recognize that cross_val_scores does not return a model, which is why I want to use LogisticRegressionCV, but I am struggling to understand why it is not performing as well. Likewise, I am not sure how to get the standard deviations of the predictors from LogisticRegressionCV.
For me, there might be some points to take into consideration:
Cross validation is generally used whenever you should simulate a validation set (for instance when the training set is not that big to be divided into training, validation and test sets) and only uses training data. In your case you're computing accuracy of model on test data, making it impossible to exactly compare results.
According to the docs:
Cross-validation estimators are named EstimatorCV and tend to be roughly equivalent to GridSearchCV(Estimator(), ...). The advantage of using a cross-validation estimator over the canonical estimator class along with grid search is that they can take advantage of warm-starting by reusing precomputed results in the previous steps of the cross-validation process. This generally leads to speed improvements.
If you look at this snippet, you'll see that's what happens indeed:
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split
data = load_breast_cancer()
X, y = data['data'], data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
estimator = LogisticRegression(random_state=42, solver='liblinear')
grid = {
'C': np.power(10.0, np.arange(-10, 10)),
}
gs = GridSearchCV(estimator, param_grid=grid, scoring='accuracy', cv=5)
gs.fit(X_train, y_train)
print(gs.best_score_) # 0.953846153846154
lrcv = LogisticRegressionCV(Cs=list(np.power(10.0, np.arange(-10, 10))),
cv=5, scoring='accuracy', solver='liblinear', random_state=42)
lrcv.fit(X_train, y_train)
print(lrcv.scores_[1].mean(axis=0).max()) # 0.953846153846154
I would suggest to have a look here, too, so as to get the details of lrcv.scores_[1].mean(axis=0).max().
Eventually, to get the same results with cross_val_score you should better write:
score = cross_val_score(gs.best_estimator_, X_train, y_train, cv=5, scoring='accuracy')
score.mean() # 0.953846153846154
In the code below (last line) it is used the X_test and y_test which according to the docs:
Returns the mean accuracy on the given test data and label
The question is what exactly is calculated since X_test has the data from the test data and the y_test has the labels for these data.
It would make sense to check the predicted labels vs the actual labels.
Can you tell me how the first scenario works in the last line?
X_train, X_test, y_train, y_test = train_test_split(iris_dataset['data'], iris_dataset['target'],
random_state=0)
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)
print("Test set score: {:.2f}".format(knn.score(X_test, y_test)))
If you check the docs:
Parameters:
X : array-like, shape = (n_samples, n_features)
Test samples.
y : array-like, shape = (n_samples) or (n_samples, n_outputs)
True labels for X.
sample_weight : array-like, shape = [n_samples], optional
Sample weights.
Returns:
score : float
Mean accuracy of self.predict(X) wrt. y.
What happens in the back?
It is predicting: knn.predict(X_test) and then using these values to calculate the mean accuracy between knn.predict(X_test) and y_test.
You should get the same output with sklearn.metrics.accuracy_score:
from sklearn.metrics import accuracy_score
y_predict = knn.predict(X_test)
accuracy_score(y_test, y_predict)
I'm trying to calculate the score of a DecisionTreeRegressor with the following code:
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.metrics import accuracy_score
from sklearn import tree
# some features are better using LabelEncoder like HouseStyle but the chance that they will affect
# the target LotFrontage are small so we just use HotEncoder and drop unwanted columns later
encoded_df = pd.get_dummies(train_df, prefix_sep="_", columns=['MSZoning', 'Street', 'Alley',
'LotShape', 'LandContour', 'Utilities',
'LotConfig', 'LandSlope', 'Neighborhood',
'Condition1', 'Condition2', 'BldgType', 'HouseStyle'])
encoded_df = encoded_df[['LotFrontage', 'LotArea', 'LotShape_IR1', 'LotShape_IR2', 'LotShape_IR3',
'LotConfig_Corner', 'LotConfig_CulDSac', 'LotConfig_FR2', 'LotConfig_FR3', 'LotConfig_Inside']]
# imputate LotFrontage with the mean value (we saw low outliers ratio so we gonna use the mean value)
encoded_df['LotFrontage'].fillna(encoded_df['LotFrontage'].mean(), inplace=True)
X = encoded_df.drop('LotFrontage', axis=1)
y = encoded_df['LotFrontage'].astype('int32')
X_train, X_test, y_train, y_test = train_test_split(X, y)
classifier = DecisionTreeRegressor()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
y_test = y_test.values.reshape(-1, 1)
classifier.score(y_test, y_pred)
print("Accuracy is: ", accuracy_score(y_test, y_pred) * 100)
when it's gets to calculating the score of the model I get the following error:
ValueError: Number of features of the model must match the input. Model n_features is 9 and input n_features is 1
Not sure as to why it happens because according sklearn docs the Test Samples are to be in the shape of (n_samples, n_features)
and y_test is indeed in this shape:
y_test.shape # (365, 1)
and the True labels should be in the shape of (n_samples) or (n_samples, n_outputs) and y_pred is indeed in this shape:
y_pred.shape # (365,)
The dataset: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data
The first argument of the score function shouldn't be the target value of the test set, it should be the input value of the test set, so you should do
classifier.score(X_test, y_test)
For the code below, my r-squared score is coming out to be negative but my accuracies score using k-fold cross validation is coming out to be 92%. How's this possible? Im using random forest regression algorithm to predict some data. The link to the dataset is given in the link below:
https://www.kaggle.com/ludobenistant/hr-analytics
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
dataset = pd.read_csv("HR_comma_sep.csv")
x = dataset.iloc[:,:-1].values ##Independent variable
y = dataset.iloc[:,9].values ##Dependent variable
##Encoding the categorical variables
le_x1 = LabelEncoder()
x[:,7] = le_x1.fit_transform(x[:,7])
le_x2 = LabelEncoder()
x[:,8] = le_x1.fit_transform(x[:,8])
ohe = OneHotEncoder(categorical_features = [7,8])
x = ohe.fit_transform(x).toarray()
##splitting the dataset in training and testing data
from sklearn.cross_validation import train_test_split
y = pd.factorize(dataset['left'].values)[0].reshape(-1, 1)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train)
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 10, random_state = 0)
regressor.fit(x_train, y_train)
y_pred = regressor.predict(x_test)
print(y_pred)
from sklearn.metrics import r2_score
r2_score(y_test , y_pred)
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = regressor, X = x_train, y = y_train, cv = 10)
accuracies.mean()
accuracies.std()
There are several issues with your question...
For starters, you are doing a very basic mistake: you think you are using accuracy as a metric, while you are in a regression setting and the actual metric used underneath is the mean squared error (MSE).
Accuracy is a metric used in classification, and it has to do with the percentage of the correctly classified examples - check the Wikipedia entry for more details.
The metric used internally in your chosen regressor (Random Forest) is included in the verbose output of your regressor.fit(x_train, y_train) command - notice the criterion='mse' argument:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=10, n_jobs=1, oob_score=False, random_state=0,
verbose=0, warm_start=False)
MSE is a positive continuous quantity, and it is not upper-bounded by 1, i.e. if you got a value of 0.92, this means... well, 0.92, and not 92%.
Knowing that, it is good practice to include explicitly the MSE as the scoring function of your cross-validation:
cv_mse = cross_val_score(estimator = regressor, X = x_train, y = y_train, cv = 10, scoring='neg_mean_squared_error')
cv_mse.mean()
# -2.433430574463703e-28
For all practical purposes, this is zero - you fit the training set almost perfectly; for confirmation, here is the (perfect again) R-squared score on your training set:
train_pred = regressor.predict(x_train)
r2_score(y_train , train_pred)
# 1.0
But, as always, the moment of truth comes when you apply your model on the test set; your second mistake here is that, since you train your regressor with scaled y_train, you should also scale y_test before evaluating:
y_test = sc_y.fit_transform(y_test)
r2_score(y_test , y_pred)
# 0.9998476914664215
and you get a very nice R-squared in the test set (close to 1).
What about the MSE?
from sklearn.metrics import mean_squared_error
mse_test = mean_squared_error(y_test, y_pred)
mse_test
# 0.00015230853357849051