I want to calculate and print roc_auc_score to evaluate the performance of my model (random forest), I am doing NLP, so my data in y_test and y_pred are basically list of words, I vectorise them with the function pipe_vect.transform but when I launch the training I have the following error:
ValueError: Only one class present in y_true. ROC AUC score is not defined in that case."
I think the error appear because y_test and y_pred havn't the same dimension.
My code :
x_test_vect = pipe_vect.transform(x_test)
y_pred = model.predict_proba(x_test_vect)
auc_score = roc_auc_score(y_test, y_pred)
print('Performance model :', auc_score)
Related
I have a DataFrame X in which there is a column called target with 10 different labels: [0,1,2,3,4,5,6,7,8,9]. I have a Machine Learning model, let's say: model=AdaBoostClassifier() I would like to use to fit the data and predict again the labels by doing a cross-validation process to train the model. I use two metrics for the cross-validation, the accuracy and the neg_mean_squared_error to evaluate the performance and compute the ratio: neg_mean_squared_error/accuracy. The lines are like:
model.seed = 42
outer_cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
scoring=('accuracy', 'neg_mean_squared_error')
scores = cross_validate(model, X.drop(target,axis=1), X[target], cv=outer_cv, n_jobs=-1, scoring=scoring)
scores = abs(np.sqrt(np.mean(scores['test_neg_mean_squared_error'])*-1))/np.mean(scores['test_accuracy'])
score_description = [model,'{model}'.format(model=model.__class__.__name__),"%0.5f" % scores]
However, whenever I start to run, I get the following error message:
ValueError: Samplewise metrics are not available outside of multilabel classification.
How could I do to solve this issue with the metrics and perform the corresponding classification? Which metrics could I use to evaluate the performance of the model in the multi-label case?
I am trying to train a logistic regression model with data as follows:
Categorical Variable: either 0 or 1
Numerical Variables: Continuous number between 8 and 20
I have 20 numerical variables and I want to only use one at a time for the predicting model, and see which is the best feature to use.
The code I'm using is:
for variable in numerical_variable:
X = data[[variable ]]
y = data[categorical_variable]
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.20,random_state=0)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred=logreg.predict(X_test)
print(y_pred)
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
print("Precision:", metrics.precision_score(y_test, y_pred))
print("Recall:", metrics.recall_score(y_test, y_pred))
The categorical variable is biased towards 1, there are about 800 1s to 200 0s. So I think this is why it always predicts one, regardless of the test samples (if I don't set random_state=0) and regardless of the numerical variable.
(using python 3)
Any thoughts on how to fix this?
Thanks
use joblib library to save your model,
import joblib
your_model = LogisticRegression()
your_model.fit(X_train, y_train)
filename = 'finalized_model.sav'
joblib.dump(your_model, filename)
this code will save your model as 'finalized_model.sav'. the extension doesn't matter,even if you don't write.
then you can call your exact and fixed model by this code to take the same predictions all the time.
your_loaded_model = joblib.load('finalized_model.sav')
as a prediction example;
your_loaded_model.predict(X_test)
I'm trying to do a linear model in sklearn, and therefore i want to test the model, that i have implemented using some error functions.
First i chose the features for my X and y axis.
#Predict the average parking rates per month
X = df[['Number of weekly riders', 'Price per week',
'Population of city', 'Monthly income of riders']]
y = df['Average parking rates per month']
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
#only 20% test size because we are working with a small dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)
lm = LinearRegression()
lm.fit(X_train, y_train)
after i fitted the model i try to use some of the error functions from the metrics package from sklearn
but apparently i can't use any of the functions, because there is not an equal amount of test and train data
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_train))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_train))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_train)))
ValueError: Found input variables with inconsistent numbers of samples: [6, 21]
is it really true, that you need the same size of train and test data, in order to run the error functions?
When you use train/test-split you want to devide the training and test data:
The idea is that you train your algorithm with your training data and then test it with unseen data. So all the metrics do not make any sense with y_train and y_test. What you try to compare is then the prediction and the y_test this works then like:
y_pred_test = lm.predict(X_test)
metrics.mean_absolute_error(y_test, y_pred_test)
It is also possible to get an idea on the training scores; you can do that by predicting on the training data:
y_pred_train = lm.predict(X_train)
metrics.mean_absolute_error(y_train, y_pred_train)
You want to compare y_test and y_predict which is the output of x_test through your regressor.
The following will raise an inconsistent numbers of samples error.
metrics.mean_absolute_error(y_test, y_train)
The reason is because the training set and the testing set has different number of rows.
In the rare case of them having the same number of rows, the above statement still doesn't make sense: there's no use of comparing the test set labels to training set labels.
Instead, you should obtain the predictions to your testing features(X_test) by inputting X_test to lm:
y_hat = lm.predict(X_test) # y_hat: predictions
Then, these metrics would make sense:
metrics.mean_absolute_error(y_test, y_hat)
I am doing isolation forest clustering on the the mulcross database with 2 classes. I divide my data into training and test set and try to calculate the accuracy score, the roc_auc_score and the confusion_matrix on my test set. But there are two problems: The first one is that in a clustering method i should not use the labels in the training phase, it means that "y_train" should not be mentioned, but i did not find another solution to evaluate my model. More over the results i found are wrong.
My problem is how to evaluate a clustering model like isolation forest.
Here is my code:
df = pd.read_csv('db.csv')
y_true=df['Target']
df_data=df.drop('Target',1)
X_train, X_test, y_train, y_test = train_test_split(df_data, y_true, test_size=0.3, random_state=42)
alg=IsolationForest(n_estimators=100, max_samples= 256 , contamination=0.1, max_features=1.0, bootstrap=False, n_jobs=-1, random_state=42, verbose=0, behaviour="new")
model = alg.fit(X_train, y_train)
preds = alg.predict(X_test)
print("#############################\n#############################")
print(accuracy_score(y_test, preds))
print(roc_auc_score(y_test, preds))
cm = confusion_matrix(y_test, preds)
print(cm)
print("#############################\n#############################")
I do not understand why are you clustering and dividing it into training/testing sets. It seems to me like you are mixing classification/clustering or something like that. If you have labels, try a supervised method. Easy winnings are xgboost, random forest, GLM, logistic, etc...
If you want to evaluate clustering methods, you can investigate the inter- and intra-cluster distances. At the end of the day, you want to have small and well-separated clusters. You can look at a metric called silhouette too.
You can also try
print("Accuracy:", list(y_pred_test).count(1)/y_pred_test.shape[0])
also, look here for some more details.
While working with a linear regression model I split the data into a training set and test set. I then calculated R^2, RMSE, and MAE using the following:
lm.fit(X_train, y_train)
R2 = lm.score(X,y)
y_pred = lm.predict(X_test)
RMSE = np.sqrt(metrics.mean_squared_error(y_test, y_pred))
MAE = metrics.mean_absolute_error(y_test, y_pred)
I thought that I was calculating R^2 for the entire data set (instead of comparing the training and original data). However, I learned that you must fit the model before you score it, therefore I'm not sure if I'm scoring the original data (as inputted in R2) or the data that I used to fit the model (X_train, and y_train). When I run:
lm.fit(X_train, y_train)
lm.score(X_train, y_train)
I get a different result than what I got when I was scoring X and y. So my question is are the inputs to the .score parameter compared to the model that was fitted (thereby making lm.fit(X,y); lm.score(X,y) the R^2 value for the original data and lm.fit(X_train, y_train); lm.score(X,y) the R^2 value for the original data based off the model created in .fit.) or is something else entirely happening?
fit() that only fit the data which is synonymous to train, that is fit the data means train the data.
score is something like testing or predict.
So one should use different dataset for training the classifier and testing the acuracy
One can do like this.
X_train,X_test,y_train,y_test=cross_validation.train_test_split(X,y,test_size=0.2)
clf=neighbors.KNeighborsClassifier()
clf.fit(X_train,y_train)
accuracy=clf.score(X_test,y_test)