How to make prediction from regression model? - python

I'm working through a machine learning intro. Currently, I followed the class to build a regression model for salaries based on years of experience. By just looking at the regression model, I can eyeball how much someone with, say, 4 years of experience makes, but I'm wondering if there is a function or piece of code that can return the actual predicted salary.
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
#Training Set
plt.scatter(X_train, y_train, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('Salary vs Experience (Training set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()
#Test Results
plt.scatter(X_test, y_test, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('Salary vs Experience (Training set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()
The code above is the model. Below is my code for making a prediction, but it doesn't work. Any idea how I can adapt the function to return a function based on the model?
def PredictSalary(experience):
experience = np.arange(6).reshape((3,2))
regressor.predict(experience)
PredictSalary(6)
My code above is trying to turn a 1d array into a 2d array, and then use the model to make a prediction on the salary for 6 years.

Related

My training and testing graph remains constant, can anyone help me interpret it or explain where have I gone wrong?

I'm doing a simple machine learning project. At initial model, my model was over fitting, as I understood by googling and learning about what over fitting is and how to detect it. Then I used SMOTE to reduce over fitting and tried to find if it still over fits. I'm getting a graph that I'm unable to interpret and tried several links to understand what is happening but failed.
Can anyone please tell me if this graph is okay or there is something wrong in it? (The picture and code is given below)
def EF_final(x_train, y_train, x_test, y_test):
train_scores, test_scores = [], []
values = [i for i in range(1, 21)]
# evaluate a decision tree for each depth
for i in values:
# configure the model
model_ef = ExtraTreesClassifier(n_estimators = 80, random_state=42, min_samples_split = 2, min_samples_leaf= 1, max_features = 'sqrt', max_depth= 24, bootstrap=False)
# fit model on the training dataset
model_ef.fit(x_train, y_train)
# evaluate on the train dataset
train_yhat = model_ef.predict(x_train)
train_acc = accuracy_score(y_train, train_yhat)
train_scores.append(train_acc)
# evaluate on the test dataset
test_yhat = model_ef.predict(x_test)
test_acc = accuracy_score(y_test, test_yhat)
test_scores.append(test_acc)
# summarize progress
print('>%d, train: %.3f, test: %.3f' % (i, train_acc, test_acc))
# plot of train and test scores vs tree depth
plt.plot(values, train_scores, '-o', label='Train')
plt.plot(values, test_scores, '-o', label='Test')
plt.legend()
plt.show()
Cant comment on results of your model prediction without viewing the data, but to answer your title question.
You seem to configure and create the same model in each loop without using the variable i to change model depth . Even the random_state of the model is constant hence you can expect same result .
Consider switching the model configuration line to
model_ef = ExtraTreesClassifier(n_estimators = 80,min_samples_split = 2, min_samples_leaf= 1, max_features = 'sqrt', max_depth = i, bootstrap=False)
This will change the graph result to help u choose a better model,
Accuracy can not be commented on however without knowing what kind of data is being passed.

Sklearn Linear Regression seems to not fit the Data

I'am trying to train a linear regression model from Sklearn. However, it seems that the resulting regression does not fit the data very well. I would've expected the regression to (approximately) have a slope of 30°, whereas this regression does not show a correlation at all (horizontal slope). Does anyone of you guys have an idea on how I can modify my model to have a more appropriate prediction?
Data Plot
This is the corresponding code:
# Define x & y:
x = pd.DataFrame({'patientweight': df['patientweight']})
y = pd.DataFrame({'rate': df['rate']})
# Split Data into Train-/Test-Set:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3,random_state=0)
# Train the Linear Regression Model:
lin_reg = LinearRegression()
lin_reg.fit(x_train, y_train)
# Predict the Test Set Results:
y_pred = lin_reg.predict(x_train)
# Plot the data:
plt.figure(figsize=(15,10))
plt.xlabel('Patientweight')
plt.ylabel('Rate')
plt.title('Cohort 1: Relation between Patientweight and Rate')
plt.xlim(0,200)
plt.ylim(0,4000)
plt.scatter(df['patientweight'], df['rate'], alpha=0.25)
plt.scatter(x_train, y_pred)`

Residual plot for residual vs predicted value in Python

I have run a KNN model. Now i want to plot the residual vs predicted value plot. Every example from different websites shows that i have to first run a linear regression model. But i couldn't understand how to do this. Can anyone help? Thanks in advance.
Here is my model-
train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])
x_train = train.iloc[:,[2,5]].values
y_train = train.iloc[:,4].values
x_validate = validate.iloc[:,[2,5]].values
y_validate = validate.iloc[:,4].values
x_test = test.iloc[:,[2,5]].values
y_test = test.iloc[:,4].values
clf=neighbors.KNeighborsRegressor(n_neighbors = 6)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_validate)
Residuals are nothing but how much your predicted values differ from actual values. So, it's calculated as actual values-predicted values. In your case, it's residuals = y_test-y_pred. Now for the plot, just use this;
import matplotlib.pyplot as plt
plt.scatter(residuals,y_pred)
plt.show()
What is the question? The residuals are simply y_test-y_pred. Now use seaborn's regplot.

Linear Regression - Predict ŷ

I'm trying to plot a scatter plot of the values of actual sales (y) and predicted sales (ŷ).
I have imported the csv file and currently the codes I have for the linear regression model is:
result = smf.ols('sales ~ discount + holiday + product', data=data).fit()
print(result.summary())
Since, I only have the actual sales values, how do I find the predicted sales (ŷ) values to plot the scatter plot? I have tried researching and found lm.predict() and result.predict(). Is there a difference? lm = LinearRegression()
Thank you in advance!
Without data it is hard to help, but I guess you have X and y from dataset because you want to perform linear regression. You can split data into training and test set using scikit-learn:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3)
Then you need to fit linear regression to the training set:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
and afterwards predict test set results:
y_pred = regressor.predict(X_test)
Finally, you can plot your test or training results:
# Visualising the Training set results
plt.scatter(X_train, y_train, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('Discount vs Sales (Training set)')
plt.xlabel('Discount percentage')
plt.ylabel('Sales')
plt.show()
# Visualising the Test set results
plt.scatter(X_test, y_test, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('Discount vs Sales (Test set)')
plt.xlabel('Discount percentage')
plt.ylabel('Sales')
plt.show()
(In this scenario we want to predict how many Sales will be if we set specific value of e.g. Discount percentage). If you have more than one X parameter, things are more complicated and you will need to use dummy variables, perform statistical analysis etc..

Learning curve (high bias / high variance) why the testing learning curve gets flat

I implemented a model using gradient boosting decision tree as classifier and I plotted learning curves for both training and test sets to decide what to do next in order to improve my model.
The result is as the image:
(Y axis is accuracy (percentage of correct prediction) while x axis is the number of samples i use to train the model.)
I understand that the gap between training and testing score is probably due to high variance(overfitting). But the image also shows that the test score (the green line) increases very little while the number of samples grows from 2000 to 3000. The curve of testing score is getting flat. The model is not getting better even with more samples.
My understand is that a flat learning curve usually indicates high bias (underfitting). Is that possible that both underfitting and overfitting are happening in this model? Or is there another explanation for the flat curve?
Any help would be appreciated. Thanks in advance.
=====================================
the code i use is as follows. Basic i use the same code as the example in sklearn document
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.legend(loc="best")
return plt
title = "Learning Curves (GBDT)"
# Cross validation with 100 iterations to get smoother mean test and train
# score curves, each time with 20% data randomly selected as a validation set.
cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=0)
estimator = GradientBoostingClassifier(n_estimators=450)
X,y= features, target #features and target are already loaded
plot_learning_curve(estimator, title, X, y, ylim=(0.6, 1.01), cv=cv, n_jobs=4)
plt.show()
I would say you are overfitting. Considering you are using cross validation, the gap between the training and the cross-validation score is probably too big. Without cross validation or random splitting, it could be that your train and test data differ in some way.
There are a couple of ways you could try to mitigate this:
Add more data (the training score will probably still go down a little bit more)
Reduce the number of estimators, or even better, use early stopping
Increase gamma for prunning
Use subsampling (by tree, by column...)
There are lots of parameters that you can play with, so have some fun! :-D
First of all, your training accuracy goes down quite a bit when you add more examples. So this could still be high variance. However, I doubt that this is the only explanation as the gap seems to be too big.
A reason for a gap between the training accuracy and the test accuracy could be a different distribution of the training samples and the test samples. However, with cross-validation this should not happen (do you make a k-fold cross validation where you re-train for each of the k folds?)
You should pay more attention to your training accuracy. If it goes down during the training, you did something terribly wrong. Check again the correctness of your data (are your labels correct?) and your model.
Normally, both train and test accuracies should increase, but test accuracy is behind.

Categories

Resources