Residual plot for residual vs predicted value in Python - python

I have run a KNN model. Now i want to plot the residual vs predicted value plot. Every example from different websites shows that i have to first run a linear regression model. But i couldn't understand how to do this. Can anyone help? Thanks in advance.
Here is my model-
train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])
x_train = train.iloc[:,[2,5]].values
y_train = train.iloc[:,4].values
x_validate = validate.iloc[:,[2,5]].values
y_validate = validate.iloc[:,4].values
x_test = test.iloc[:,[2,5]].values
y_test = test.iloc[:,4].values
clf=neighbors.KNeighborsRegressor(n_neighbors = 6)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_validate)

Residuals are nothing but how much your predicted values differ from actual values. So, it's calculated as actual values-predicted values. In your case, it's residuals = y_test-y_pred. Now for the plot, just use this;
import matplotlib.pyplot as plt
plt.scatter(residuals,y_pred)
plt.show()

What is the question? The residuals are simply y_test-y_pred. Now use seaborn's regplot.

Related

How can I explain a part of my predictions as a whole with SHAP values? (and not every single prediction)

Background information
I fit a classifier on my training data. When testing my fitted best estimator, I predict the probabilities for one of the classes. I order both my X_test and my y_test by the probabilites in a descending order.
Question
I want to understand which features were important (and to what extend) for the classifier to predict only the 500 predictions with the highest probability as a whole, not for each prediction. Is the following code correct for this purpose?
y_test_probas = clf.predict_proba(X_test)[:, 1]
explainer = shap.Explainer(clf, X_train) # <-- here I put the X which the classifier was trained on?
top_n_indices = np.argsort(y_test_probas)[-500:]
shap_values = explainer(X_test.iloc[top_n_indices]) # <-- here I put the X I want the SHAP values for?
shap.plots.bar(shap_values)
Unfortunately, the shap documentation (bar plot) does not cover this case. Two things are different there:
They use the data the classifier was trained on (I want to use the data the classifier is tested on)
They use the whole X and not part of it (I want to use only part of the data)
Minimal reproducible example
import numpy as np
import pandas as pd
import shap
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
# Load the Titanic Survival dataset
data = pd.read_csv("https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv")
# Preprocess the data
data = data.drop(["Name"], axis=1)
data = data.dropna()
data["Sex"] = (data["Sex"] == "male").astype(int)
# Split the data into predictors (X) and response variable (y)
X = data.drop("Survived", axis=1)
y = data["Survived"]
# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Fit a logistic regression classifier
clf = LogisticRegression().fit(X_train, y_train)
# Get the predicted class probabilities for the positive class
y_test_probas = clf.predict_proba(X_test)[:, 1]
# Select the indices of the top 500 test samples with the highest predicted probability of the positive class
top_n_indices = np.argsort(y_test_probas)[-500:]
# Initialize the Explainer object with the classifier and the training set
explainer = shap.Explainer(clf, X_train)
# Compute the SHAP values for the top 500 test samples
shap_values = explainer(X_test.iloc[top_n_indices, :])
# Plot the bar plot of the computed SHAP values
shap.plots.bar(shap_values)
I don't want to know how the classifier decides all the predictions, but on the predictions with the highest probability. Is that code suitable to answer this question? If not, how would a suitable code look like?

Feature importance using gridsearchcv for logistic regression

I've trained a logistic regression model like this:
reg = LogisticRegression(random_state = 40)
cvreg = GridSearchCV(reg, param_grid={'C':[0.05,0.1,0.5],
'penalty':['none','l1','l2'],
'solver':['saga']},
cv = 5)
cvreg.fit(X_train, y_train)
Now to show the feature's importance I've tried this code, but I don't get the names of the coefficients in the plot:
from matplotlib import pyplot
importance = cvreg.best_estimator_.coef_[0]
pyplot.bar([x for x in range(len(importance))], importance)
pyplot.show()
Obviously, the plot isn't very informative. How do I add the names of the coefficients to the x-axis?
The importance of the coeff is:
cvreg.best_estimator_.coef_
array([[1.10303023e+00, 7.48816905e-01, 4.27705027e-04, 6.01404570e-01]])
The coefficients correspond to the columns of X_train, so pass in the X_train names instead of range(len(importance)).
Assuming X_train is a pandas dataframe:
import matplotlib.pyplot as plt
features = X_train.columns
importance = cvreg.best_estimator_.coef_[0]
plt.bar(features, importance)
plt.show()
Note that if X_train is just a numpy array without column names, you will have to define the features list based on your own data dictionary.

The intercept are half of the real values in logistic regression

For a scientific study, I need to analyze the traditional logistic regression using python and sci-kit learn. After fitting my regression model with "penalty='none'", I can get the correct coefficients but the intercept is the half of the real value. My code is mostly as follows:
df = pd.read_excel("data.xlsx")
train, test = train_test_split(df, train_size = 0.8, random_state = 42)
train = train.drop(["Unnamed: 0"], axis = 1)
test = test.drop(["Unnamed: 0"], axis = 1)
x_train = train.drop(["GRUP"], axis = 1)
x_train = sm.add_constant(x_train)
y_train = train["GRUP"]
x_test = test.drop(["GRUP"], axis = 1)
x_test = sm.add_constant(x_test)
y_test = test["GRUP"]
model = sm.Logit(y_train, x_train).fit()
model.summary()
log = LogisticRegression(penalty = "none")
log.fit(x_train, y_train)
log.intercept_
With statsmodels I get the intercept (constant) "28.7140" but with the sci-kit learn "14.35698738". Other coefficients are same. I verified it on SPSS and the first one is the correct value. I don't want to use statsmodels only for logistic regression. Could you please help?
PS: Without intercept model works fine.
The issue here is that in the code you posted you add a constant term (a column of 1's) to x_train with x_train = sm.add_constant(x_train). Then, you pass that same x_train object to sklearn's LogisticRegression() method where the default value of fit_intercept= is True. So, at that stage, you end up creating another constant term, causing the discrepancy in your estimated coefficients.
So, you should either turn off fit_intercept= in the sklearn code, or leave fit_intercept=True but use the x_train array without the added constant term.

Why "prediction space" is needed?

It's an old problem about prediction using regression exploring Gapminder data. They used "prediction space" to compute prediction.
Q1. Why should I be creating "prediction space"? What is the use of it?
Q2. The relation of computing predictions over the "prediction space"?
import numpy as np
import pandas as pd
# Read the CSV file into a DataFrame: df
df = pd.read_csv('gapminder.csv')
The data seems like this;
Country,Year,life,population,income,region
Afghanistan,1800,28.211,3280000,603.0,South Asia
Slovak Republic,1960,70.47800000000001,4137224,8693.0,Europe & Central Asia
# Create arrays for features and target variable
y = df.life.values
X = df.fertility.values
# Reshape X and y
y = y.reshape(-1,1)
X = X.reshape(-1,1)
# Create the regressor: reg
reg = LinearRegression()
# Create the prediction space
prediction_space = np.linspace(min(X_fertility), max(X_fertility)).reshape(-1,1)
# Fit the model to the data
reg.fit(X_fertility, y)
# Compute predictions over the prediction space: y_pred
y_pred = reg.predict(prediction_space)
I believe that you are taking a course from DataCamp
I stumbled upon this too, and the answer is prediction_space and y_pred are used to construct the straight line in the graph
NOTE: for those who are reading this and don't understand what I'm talking about, the code snippet is actually missing the graph plotting code
# Plot regression line
plt.plot(prediction_space, y_pred, color='black', linewidth=3)
plt.show()
It comes with the y_pred to make a baseline for you to calculate the residuals and further get the R^2 value.

Predictions using Support Vector Regression

In my problem there are four features(X); a,b,c,d and two dependents(Y); e,f. I have with me a data set containing a set of values for all these variables. How can I predict the values through Support Vector Regression using scikit learn in python, for the e,f variables when new a,b,c,d values are given?
I'm very new to ML and I would really appreciate some guidance since I found it very difficult to follow the scikit learn documentation on SVR.
This is what I have done so far with the help of an example in the sklearn documentation.
train = pd.read_csv('/Desktop/test.csv')
X = train.iloc[:, 4]
y = train.iloc[:, 4:5]
svr_rbf = SVR(kernel='rbf', C=1e3, gamma=0.1)
y_rbf = svr_rbf.fit(X, y).predict(X)
lw = 2
plt.scatter(X, y, color='darkorange', label='data')
plt.plot(X, y_rbf, color='navy', lw=lw, label='RBF model')
plt.xlabel('data')
plt.ylabel('target')
plt.title('Support Vector Regression')
plt.legend()
plt.show()
This gives the error,
ValueError: Expected 2D array, got 1D array instead:
:
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
I'm assuming that your target variables need to be independently predicted here, so correct me if I'm wrong. I've slightly modified the sklearn doc example to illustrate what you need to do. Please do consider scaling your data before performing the regression.
import numpy as np
from sklearn import svm
import matplotlib.pyplot as plt
n_samples, n_features = 10, 4 # your four features a,b,c,d are the n_features
np.random.seed(0)
y_e = np.random.randn(n_samples)
y_f = np.random.randn(n_samples)
# your input array should be formatted like this.
X = np.random.randn(n_samples, n_features)
#dummy parameters - use grid search etc to find best params
svr_rbf = svm.SVR(kernel='rbf', C=1e3, gamma=0.1)
# Fit and predict for one target, do the same for the other
y_pred_e = svr_rbf.fit(X, y_e).predict(X)
Assuming your data file has 6 columns, and the feature values are in the the first 4 columns, and the targets (which you call 'dependents') are in the final 2 columns, then I think you need to do this instead:
train = pd.read_csv('/Desktop/test.csv')
X = train.iloc[:, 0:3]
y = train.iloc[:, 4:5]

Categories

Resources