Calculating RMSE of test set using SVD recommendation - python

I'm trying to calculate the RMSE using truncated SVD on a train-test split MovieLens dataset. After model_selection.train_test_split on the dataset of size 610x9724 (610 users x 9724 movie) ratings from 1-5, the dimensions of the train_data becomes 610x9379 and the dimension of the validation data becomes 603x3653. After doing SVD on the training data, I can calculate the RMSE between the training matrix and the predicted SVD matrix A = Usv, but how would I calculate the RMSE of the predicted ratings and the test data which is a different dimensions and a different set of movies than the training data?
Code is below. I can do RMSE between the predicted ratings and the training set but I want to do RMSE between the predicted ratings and the test set.
df = pd.read_csv('data/ml-latest-small/ratings.csv')
df_train, df_valid = model_selection.train_test_split(
df, test_size=0.1, random_state=42
)
from sklearn.utils.extmath import randomized_svd
train_data = df_train.pivot_table(index='userId', columns='movieId', values='rating').fillna(0)
test_data = df_valid.pivot_table(index='userId', columns='movieId', values='rating')
U, Sigma, VT = randomized_svd(np.array(train_data),
n_components=15,
n_iter=5,
random_state=None)
predicted = U # np.diag(Sigma) # VT
print("full matrix", num_users, num_movies) #full matrix 610 9724
print("train data", train_data.shape) #train data (610, 9379)
print("predicted",predicted.shape) #predicted (610, 9379)
print("validation",test_data.shape) #validation (603, 3653)
rmse = np.sqrt(mean_squared_error(test_data, predicted))

Related

How to convert XGBoost model SHAP values from log odds to probabilities?

I trained an XGBoost Classifier and am trying to generate SHAP contributions in probabilities. I understand the output of shap.TreeExplainer for XGBoost models is in log odds ratios. I expect the expected_value of explainer to be equal or close to the average predicted value in the dataset. However, I get exptected_value of -2.7776 (explainer.expected_value), which corresponds to probability of 0.0585 (expit(-2.7776)). This is significantlty lower compared to average predicted score of 0.21. Is there any step I am missing in converting expected value to probability?
# Import libraries
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
import shap
from scipy.special import expit
# Generate data
X, Y = make_classification(n_samples=10000,
n_features=20,
n_redundant=0,
n_classes=2,
random_state=17,
weights = [0.8, 0.2])
# Split into train and test
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.1, random_state=7)
# Data check
print('Target rate: {:.0%}'.format(sum(Y)/len(Y)))
print('Target rate in train dataset: {:.0%}'.format(sum(y_train)/len(y_train)))
print('Target rate in test dataset: {:.0%}'.format(sum(y_test)/len(y_test)))
print('Total observations: {:.0f}'.format(len(X)))
print('Train observations: {:.0f}'.format(len(x_train)))
print('Test observations: {:.0f}'.format(len(x_test)))
# Train XGBoost model
model = GradientBoostingClassifier(
n_estimators = 50,
max_depth = 3,
random_state = 17
)
model.fit(x_train, y_train)
# Get accuracy score and confusion matrix for train and test datasets
# There doesn't seem to be issues with model performance, it is pretty close for train and test datasets
acc_train = model.score(x_train, y_train)
acc_test = model.score(x_test, y_test)
cm_train = confusion_matrix(y_train, y_pred_class_train, normalize = 'true')
cl_report_train = classification_report(y_train, y_pred_class_train)
cm_test = confusion_matrix(y_test, y_pred_class_test, normalize = 'true')
cl_report_test = classification_report(y_test, y_pred_class_test)
# Print results
print('MODEL ACCURACY:\n \
training data: {:.2%}\n \
test data: {:.2%}'.format(acc_train, acc_test))
print('\nCONFUSION MATRIX (train data):\n {}'.format(cm_train.round(3)))
print('\nCLASSIFICATION REPORT (train data):\n {}'.format(cl_report_train))
print('\nCONFUSION MATRIX (test data):\n {}'.format(cm_test.round(3)))
print('\nCLASSIFICATION REPORT (test data):\n {}'.format(cl_report_test))
# Check average predicted score
# Train
y_pred_prob_train = model.predict_proba(x_train)
y_pred_class_train = model.predict(x_train)
print('Train: Average predicted score: {:.2%}'.format(np.mean(y_pred_prob_train[:,1])))
# Test
y_pred_prob_test = model.predict_proba(x_test)
y_pred_class_test = model.predict(x_test)
print('Test: Average predicted score: {:.2%}'.format(np.mean(y_pred_prob_test[:,1])))
# Get SHAP values in log odds for test dataset
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(x_test)
# Check SHAP expected value
print('SHAP expected value: {:.4f}'.format(explainer.expected_value[0]))
print('SHAP expected value transformed: {:.4f}'.format(expit(explainer.expected_value[0])))
print('Average predicted value: {:.4f}'.format(np.mean(y_pred_prob_test[:,1])))
# Average predicted value is ~ 0.21 while shap expected value only ~ 0.06.

Evaluate classification model's ability to dicriminate between different ranges of the outcome label

I would like to evaluate my model's ability to discriminate between people with prediabetes (hba1c 5.7-6.4%) and diabetes type 2 (hba1c > 6.4%)
My outcome label (y_test) is hba1c>5.7%, defining unhealthy people with undiagnosed diabetes or prediabetic conditions.
How do I separate the two ranges, compare the predicted values with actual values and calculate the sensitivity?
The present example is according to the logistic regression model.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)
X_train = X_train[feat_cols]
X_test = X_test[feat_cols]
# Building LGR and evaluating on training data
LGR = LogisticRegression(max_iter=100, random_state=1)
def evaluate_model(LGR, X_test, y_test):
# Predict Test Data
y_pred = LGR.predict(X_test)
# Calculate accuracy, precision, sensitivity and specificity
acc = metrics.accuracy_score(y_test, y_pred)
prec = metrics.precision_score(y_test, y_pred)
sen = metrics.recall_score(y_test, y_pred, pos_label=1)
spe = metrics.recall_score(y_test, y_pred, pos_label=0)
# Calculate area under curve (AUC)
y_pred_proba = LGR.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
# Display confussion matrix
cm = metrics.confusion_matrix(y_test, y_pred)
return {'acc': acc, 'prec': prec, 'sen': sen, 'spe': spe,
'fpr': fpr, 'tpr': tpr, 'auc': auc, 'cm': cm}
LGR_eval = evaluate_model(LGR, X_test, y_test)
# Print result
print('Accuracy:', LGR_eval['acc'])
print('Precision:', LGR_eval['prec'])
print('Sensitivity:', LGR_eval['sen'])
print('Specificity:', LGR_eval['spe'])
print('Area Under Curve:', LGR_eval['auc'])
print('Confusion Matrix:\n', LGR_eval['cm'])```
Accuracy: 0.7315175097276264
Precision: 0.711340206185567
Sensitivity: 0.7439353099730458
Specificity: 0.72
Area Under Curve: 0.8036994609164421
Confusion Matrix:
[[288 112]
[ 95 276]]
(Answering comment above)
As you do not use a linear output but the logistical regression it will be difficult to achieve what you ask for without making quite some changes.
Either you change your model to a linear regression model and predict the hbac-value and after that just classify the prediction in the < 5.7, 5.7 - 6.4 range or > 6.4 range. This way you can use your metrics which you've used above.
The other way around is dependant on the Y dataset, does it contain any labels regarding the different conditions or is it just labeled healthy / unhealthy? If you were to add another label which corresponds to the values (alot like above) then you can turn your model to an multi-output prediction model and still use the logistical regression, and then investigate your metrics for the requested classes.
Edit in response to comment:
Below is the function from comments.
def IsAtRisk(x):
if x < 5.7:
return 0
return 1
df['IsAtRisk'] = df['LBXGH'].map(IsAtRisk)
print(df)
print(f"{len(df[df['IsAtRisk'] == True])} of {len(df)} people are at risk")
If you instead include the range that you ask for and apply another class you'll have the labels for the diffrent classes and can measure with the metrics regarding how the model performs.
def IsAtRisk(x):
if x < 5.7:
return 0
elif 5.7 < x < 6.4:
return 1
return 2
But for this to work you should most likely format the samples in a different way depending on your model structure. If you would share your model structure and the output-layer it would help.
Most likely you will want to restructure your Y labels to
y_sample = [1, 0, 0] # This is the probability 1 of class 0 which may be the health individuals in your dataset
# With this in mind you can correct the return values from the above function to return arrays of the labels instead of ints.

How to Compare Random Forest (without scaling) and LSTM (with scaling) using RMSE and MAE Performance metrices

I am new to Machine Learning and trying my hands on Bitcoin Price Prediction using multiple Models like Random Forest, Simple Linear Regression and NN(LSTM).
As far as I have read, Random Forest and Linear regression don't require the input feature scaling, whereas LSTM does need the input features to be scaled.
If we compare the MAE and RMSE for both algorithms (with scaling and without scaling), the result would definitely be different and I can't compare which model performs better.
How should I compare the performance of these models now?
Update - Adding my code
Data
bitcoinData = pd.DataFrame([[('2013-04-01 00:07:00'),93.25,93.30,93.30,93.25,93.300000], [('2013-04-01 00:08:00'),100.00,100.00,100.00,100.00,93.300000], [('2013-04-01 00:09:00'),93.30,93.30,93.30,93.30,33.676862]], columns=['time','open', 'close', 'high','low','volume'])
bitcoinData.time = pd.to_datetime(bitcoinData.time)
bitcoinData = bitcoinData.set_index(['time'])
x_train = train_data[['high','low','open','volume']]
y_train = train_data[['close']]
x_test = test_data[['high','low','open','volume']]
y_test = test_data[['close']]
Min-Max Scaler
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
scaler1 = MinMaxScaler(feature_range=(0, 1))
x_train = scaler.fit_transform(x_train)
y_train = scaler1.fit_transform(y_train)
x_test = scaler.transform(x_test)
y_test = scaler1.transform(y_test)
MSE Calculation
from math import sqrt
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
print("Root Mean Squared Error(RMSE) : ", sqrt(mean_squared_error(y_test,preds)))
print("Mean Absolute Error(MAE) : ", mean_absolute_error(y_test,preds))
r2 = r2_score(y_test, preds)
print("R Squared (R2) : ",r2)
You scale your input data, not the output.
The input data is irrelevant for your error calculation.
If you really want to scale your lstm output data, just scale it the same way for the other classifiers.
EDIT:
From yor comment:
I only scaled my input data in LSTM
No you don't. You do transform your output data. And from what i read i assume you only transform it for the neural network.
So your y data for the lstm is around 100 times smaller --> squared_error so you get 100*100 = 10.000 which roughly is the factor your neural net performs "better" than the random forest.
Option 1:
Remove those tree lines:
scaler1 = MinMaxScaler(feature_range=(0, 1))
y_train = scaler1.fit_transform(y_train)
y_test = scaler1.transform(y_test)
Don't forget to use a last layer that can output values to + infinity
Option 2:
Scale the data for your other classifiers as well and compare the scaled values.
Option 3:
Use inverse_transform(pred) method of your MinMaxScaler on your precictions and calculate your errors with the inverse_transformed predictions and the untransformed y_test data.

Oversampling with Leave One Out Cross Validation

I am working with an extremely unbalanced dataset with a total of 44 samples for my research project. It is a binary classification problem with 3/44 samples of the minority class for which I am using Leave One Out Cross Validation. If I perform SMOTE oversampling of the entire dataset prior to LOOCV loop, both prediction accuracy and AUC for ROC curves are close to 90% and 0.9 respectively. However, if I oversample only the training set inside the LOOCV loop, which happens to be a more logical approach, AUC for ROC curves falls as low as 0.3
I also tried precision-recall curves and stratified k-fold cross validation but faced a similar distinction in results from oversampling outside and inside the loop.
Please suggest me what is the right place to oversample and also explain the distinction if possible.
Oversampling inside the loop:-
i=0
acc_dec = 0
y_test_dec=[] #Store y_test for every split
y_pred_dec=[] #Store probablity for positive label for every split
for train, test in loo.split(X): #Leave One Out Cross Validation
#Create training and test sets for split indices
X_train = X.loc[train]
y_train = Y.loc[train]
X_test = X.loc[test]
y_test = Y.loc[test]
#oversampling minority class using SMOTE technique
sm = SMOTE(sampling_strategy='minority',k_neighbors=1)
X_res, y_res = sm.fit_resample(X_train, y_train)
#KNN
clf = KNeighborsClassifier(n_neighbors=5)
clf = clf.fit(X_res,y_res)
y_pred = clf.predict(X_test)
acc_dec = acc_dec + metrics.accuracy_score(y_test, y_pred)
y_test_dec.append(y_test.to_numpy()[0])
y_pred_dec.append(clf.predict_proba(X_test)[:,1][0])
i+=1
# Compute ROC curve and ROC area for each class
fpr,tpr,threshold=metrics.roc_curve(y_test_dec,y_pred_dec,pos_label=1)
roc_auc = metrics.auc(fpr, tpr)
print(str(acc_dec/i*100)+"%")
AUC: 0.25
Accuracy: 68.1%
Oversampling Outside the loop:
acc_dec=0 #accuracy for decision tree classifier
y_test_dec=[] #Store y_test for every split
y_pred_dec=[] #Store probablity for positive label for every split
i=0
#Oversampling before the loop
sm = SMOTE(k_neighbors=1)
X, Y = sm.fit_resample(X, Y)
X=pd.DataFrame(X)
Y=pd.DataFrame(Y)
for train, test in loo.split(X): #Leave One Out Cross Validation
#Create training and test sets for split indices
X_train = X.loc[train]
y_train = Y.loc[train]
X_test = X.loc[test]
y_test = Y.loc[test]
#KNN
clf = KNeighborsClassifier(n_neighbors=5)
clf = clf.fit(X_res,y_res)
y_pred = clf.predict(X_test)
acc_dec = acc_dec + metrics.accuracy_score(y_test, y_pred)
y_test_dec.append(y_test.to_numpy()[0])
y_pred_dec.append(clf.predict_proba(X_test)[:,1][0])
i+=1
# Compute ROC curve and ROC area for each class
fpr,tpr,threshold=metrics.roc_curve(y_test_dec,y_pred_dec,pos_label=1)
roc_auc = metrics.auc(fpr, tpr)
print(str(acc_dec/i*100)+"%")
AUC: 0.99
Accuracy: 90.24%
How can these two approaches lead to so different results? What shall I follow?
Doing upsampling (like SMOTE) before you split your data means information about the training set is present in the test set. This is sometimes called "leakage". Your first setup is, unfortunately, correct.
Here's a post walking through this problem.

find out data, whose predictions are way off

I am using pandas sklearn to do some price prediction model. I divide the dataset into train and test sets. Then I fit the model and predict.
X and y are pandas dataframe
X_train,X_test, y_train, y_test = train_test_split(X, y)
y_pred = model.predict(X_test)
difference = np.abs(np.subtract(y_pred,y_test))
define own way of calculating accuracy in a percentage way other than mae
accuracy=np.divide(np.abs(np.subtract(y_pred,y_test)),y_test)
But how can I filter data with lowest accuracy in pandas to explore the data with bad prediction in pandas?

Categories

Resources