I'm trying to plot a scatter plot of the values of actual sales (y) and predicted sales (ŷ).
I have imported the csv file and currently the codes I have for the linear regression model is:
result = smf.ols('sales ~ discount + holiday + product', data=data).fit()
print(result.summary())
Since, I only have the actual sales values, how do I find the predicted sales (ŷ) values to plot the scatter plot? I have tried researching and found lm.predict() and result.predict(). Is there a difference? lm = LinearRegression()
Thank you in advance!
Without data it is hard to help, but I guess you have X and y from dataset because you want to perform linear regression. You can split data into training and test set using scikit-learn:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3)
Then you need to fit linear regression to the training set:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
and afterwards predict test set results:
y_pred = regressor.predict(X_test)
Finally, you can plot your test or training results:
# Visualising the Training set results
plt.scatter(X_train, y_train, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('Discount vs Sales (Training set)')
plt.xlabel('Discount percentage')
plt.ylabel('Sales')
plt.show()
# Visualising the Test set results
plt.scatter(X_test, y_test, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('Discount vs Sales (Test set)')
plt.xlabel('Discount percentage')
plt.ylabel('Sales')
plt.show()
(In this scenario we want to predict how many Sales will be if we set specific value of e.g. Discount percentage). If you have more than one X parameter, things are more complicated and you will need to use dummy variables, perform statistical analysis etc..
Related
Background information
I fit a classifier on my training data. When testing my fitted best estimator, I predict the probabilities for one of the classes. I order both my X_test and my y_test by the probabilites in a descending order.
Question
I want to understand which features were important (and to what extend) for the classifier to predict only the 500 predictions with the highest probability as a whole, not for each prediction. Is the following code correct for this purpose?
y_test_probas = clf.predict_proba(X_test)[:, 1]
explainer = shap.Explainer(clf, X_train) # <-- here I put the X which the classifier was trained on?
top_n_indices = np.argsort(y_test_probas)[-500:]
shap_values = explainer(X_test.iloc[top_n_indices]) # <-- here I put the X I want the SHAP values for?
shap.plots.bar(shap_values)
Unfortunately, the shap documentation (bar plot) does not cover this case. Two things are different there:
They use the data the classifier was trained on (I want to use the data the classifier is tested on)
They use the whole X and not part of it (I want to use only part of the data)
Minimal reproducible example
import numpy as np
import pandas as pd
import shap
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
# Load the Titanic Survival dataset
data = pd.read_csv("https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv")
# Preprocess the data
data = data.drop(["Name"], axis=1)
data = data.dropna()
data["Sex"] = (data["Sex"] == "male").astype(int)
# Split the data into predictors (X) and response variable (y)
X = data.drop("Survived", axis=1)
y = data["Survived"]
# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Fit a logistic regression classifier
clf = LogisticRegression().fit(X_train, y_train)
# Get the predicted class probabilities for the positive class
y_test_probas = clf.predict_proba(X_test)[:, 1]
# Select the indices of the top 500 test samples with the highest predicted probability of the positive class
top_n_indices = np.argsort(y_test_probas)[-500:]
# Initialize the Explainer object with the classifier and the training set
explainer = shap.Explainer(clf, X_train)
# Compute the SHAP values for the top 500 test samples
shap_values = explainer(X_test.iloc[top_n_indices, :])
# Plot the bar plot of the computed SHAP values
shap.plots.bar(shap_values)
I don't want to know how the classifier decides all the predictions, but on the predictions with the highest probability. Is that code suitable to answer this question? If not, how would a suitable code look like?
I'am trying to train a linear regression model from Sklearn. However, it seems that the resulting regression does not fit the data very well. I would've expected the regression to (approximately) have a slope of 30°, whereas this regression does not show a correlation at all (horizontal slope). Does anyone of you guys have an idea on how I can modify my model to have a more appropriate prediction?
Data Plot
This is the corresponding code:
# Define x & y:
x = pd.DataFrame({'patientweight': df['patientweight']})
y = pd.DataFrame({'rate': df['rate']})
# Split Data into Train-/Test-Set:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3,random_state=0)
# Train the Linear Regression Model:
lin_reg = LinearRegression()
lin_reg.fit(x_train, y_train)
# Predict the Test Set Results:
y_pred = lin_reg.predict(x_train)
# Plot the data:
plt.figure(figsize=(15,10))
plt.xlabel('Patientweight')
plt.ylabel('Rate')
plt.title('Cohort 1: Relation between Patientweight and Rate')
plt.xlim(0,200)
plt.ylim(0,4000)
plt.scatter(df['patientweight'], df['rate'], alpha=0.25)
plt.scatter(x_train, y_pred)`
I'm working through a machine learning intro. Currently, I followed the class to build a regression model for salaries based on years of experience. By just looking at the regression model, I can eyeball how much someone with, say, 4 years of experience makes, but I'm wondering if there is a function or piece of code that can return the actual predicted salary.
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
#Training Set
plt.scatter(X_train, y_train, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('Salary vs Experience (Training set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()
#Test Results
plt.scatter(X_test, y_test, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('Salary vs Experience (Training set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()
The code above is the model. Below is my code for making a prediction, but it doesn't work. Any idea how I can adapt the function to return a function based on the model?
def PredictSalary(experience):
experience = np.arange(6).reshape((3,2))
regressor.predict(experience)
PredictSalary(6)
My code above is trying to turn a 1d array into a 2d array, and then use the model to make a prediction on the salary for 6 years.
TL;DR: I'm trying to understand the meaning of the train_score_ attribute of a GradientBoostingClassifier, and specifically why it doesn't match my following attempt to calculate it directly:
my_train_scores = [clf.loss_(y_train, y_pred) for y_pred in clf.staged_predict(X_train)]
More details: I'm interested in the loss scores for both the test and the train data during the different fit stages of the classifier. I can use staged_predict and loss_ to calculate the loss scores for the test data:
test_scores = [clf.loss_(y_test, y_pred) for y_pred in clf.staged_predict(X_test)]
I'm okay with that. My problem is with the train loss scores. The documentation suggests to use clf.train_score_:
The i-th score train_score_[i] is the deviance (= loss) of the model
at iteration i on the in-bag sample. If subsample == 1 this is the
deviance on the training data.
yet these clf.train_score_ values do not match my attempt to calculate them directly in my_train_scores above. What am I missing here?
The code I used:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_hastie_10_2
from sklearn.ensemble import GradientBoostingClassifier
X, y = make_hastie_10_2()
X_train, X_test, y_train, y_test = train_test_split(X, y)
clf = GradientBoostingClassifier(n_estimators=5, loss='deviance')
clf.fit(X_train, y_train)
test_scores = [clf.loss_(y_test, y_pred) for y_pred in clf.staged_predict(X_test)]
print test_scores
print clf.train_score_
my_train_scores = [clf.loss_(y_train, y_pred) for y_pred in clf.staged_predict(X_train)]
print my_train_scores, '<= NOT the same values as in the previous line. Why?'
Producing e.g. this output...
[0.71319004170311229, 0.74985670836977902, 0.79319004170311214, 0.55385670836977885, 0.32652337503644546]
[ 1.369166 1.35366377 1.33780865 1.32352935 1.30866325]
[0.65541226392533436, 0.67430115281422309, 0.70807893059200089, 0.51096781948088987, 0.3078567083697788] <= NOT the same values as in the previous line. Why?
...where the last two rows do not match.
The attribute self.train_score_ is recreated in the following way:
test_dev = []
for i, pred in enumerate(clf.staged_decision_function(X_test)):
test_dev.append(clf.loss_(y_test, pred))
ax = plt.gca()
ax.plot(np.arange(clf.n_estimators) + 1, test_dev, color='#d7191c', label='Test', linewidth=2, alpha=0.7)
ax.plot(np.arange(clf.n_estimators) + 1, clf.train_score_, color='#2c7bb6', label='Train', linewidth=2, alpha=0.7, linestyle='--')
ax.set_xlabel('n_estimators')
plt.legend()
plt.show()
See the result below. Note that the curves are on top of each other as the training and test data are the same data.
I am trying to plot the train_test _split while maintaining the indices, here is my code.
#df.insert(0, 'x', range(0, 0 + len(df)))
X_train, X_test, y_train, y_test = train_test_split(x, y,
test_size = .1)
regressor = RandomForestClassifier()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
plt.plot(X_train,y_pred_train,'bo')
plt.show()
It seems like the y_pred are plotting against the incorrect x_axis value as there is a huge gap in the middle of the data and some overlapping
How can I make the corresponding x_value of the y_pred and y_pred_train be in their original position from the data frame.
you will need to include the index in the plot. usually y will be represented as the colour of each point. here is how to do so
plt.scatter(X_test.index,X_test.values,c=y_predict_test)
plt.show()
here is an a random example
the pointes coloured as yellow belong to class0 and points coloured in purple belong to class 1