Why "prediction space" is needed? - python

It's an old problem about prediction using regression exploring Gapminder data. They used "prediction space" to compute prediction.
Q1. Why should I be creating "prediction space"? What is the use of it?
Q2. The relation of computing predictions over the "prediction space"?
import numpy as np
import pandas as pd
# Read the CSV file into a DataFrame: df
df = pd.read_csv('gapminder.csv')
The data seems like this;
Country,Year,life,population,income,region
Afghanistan,1800,28.211,3280000,603.0,South Asia
Slovak Republic,1960,70.47800000000001,4137224,8693.0,Europe & Central Asia
# Create arrays for features and target variable
y = df.life.values
X = df.fertility.values
# Reshape X and y
y = y.reshape(-1,1)
X = X.reshape(-1,1)
# Create the regressor: reg
reg = LinearRegression()
# Create the prediction space
prediction_space = np.linspace(min(X_fertility), max(X_fertility)).reshape(-1,1)
# Fit the model to the data
reg.fit(X_fertility, y)
# Compute predictions over the prediction space: y_pred
y_pred = reg.predict(prediction_space)

I believe that you are taking a course from DataCamp
I stumbled upon this too, and the answer is prediction_space and y_pred are used to construct the straight line in the graph
NOTE: for those who are reading this and don't understand what I'm talking about, the code snippet is actually missing the graph plotting code
# Plot regression line
plt.plot(prediction_space, y_pred, color='black', linewidth=3)
plt.show()

It comes with the y_pred to make a baseline for you to calculate the residuals and further get the R^2 value.

Related

showing the predicted data with scikit-learn in python

I am doing a project and trying to show some BASIC elements of scikit in python. My goal is to create a 3ish simple examples and show how it learns and predicts. I am applying a simple sine wave type pattern and have been playing with a good example online from
https://mclguide.readthedocs.io/en/latest/sklearn/regression.html
My problem is that since I am new to this library and ML in general, I don't understand what I have in front of me and how to transform it into the output I am going for. The two problems I am struggling with is a linear regression on a sine wave and a guassian regression on a more complicated wave. The output I am getting per the article is the accuracy and that works like intended but what I am trying to get to is how to plot the predicted output on top of (or as an extension) of the training data to visually show how it did. I think the data is in here, I am either just using the wrong methods to return the appropriate information or I am not understanding how to extract the information from what is already being returned.
Here are some additional questions
I do not completely understand the "features = x[:, np.newaxis]" line
When plotting, what does '-*' and '-o'do? I looked through the documentation and it appears to be formatting but I couldn't find these two examples exactly.
What do I need to do to get access to the 20% predicted values so that I can plot it against the original?
Is there a simple way to apply the most amount of this code to apply to simple and gaussian examples?
Here is the skeletal code. Most of the scikit from the article is unchanged.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import random
from operator import add
N = 200 # 10 samples
randomlist = []
x = np.linspace(0, 12, N)
sine_wave = np.sin(1*x)
#plot the source data
plt.figure(figsize=(20,5))
plt.plot(x, sum_vector, 'o');
plt.show()
# convert features in 2D format i.e. list of list
# print('Before: ', x.shape)
features = x[:, np.newaxis]
# print('After: ', features.shape)
# save sine wave in variable 'targets'
targets = sine_wave
# split the training and test data
train_features, test_features, train_targets, test_targets = train_test_split(
features, targets,
train_size=0.8,
test_size=0.2,
# random but same for all run, also accuracy depends on the
# selection of data e.g. if we put 10 then accuracy will be 1.0
# in this example
random_state=23,
# keep same proportion of 'target' in test and target data
# stratify=targets # can not used for single feature
)
# training using 'training data'
regressor = LinearRegression()
regressor.fit(train_features, train_targets) # fit the model for training data
# predict the 'target' for 'training data'
prediction_training_targets = regressor.predict(train_features)
# note that 'score' uses 'feature and target (not predict_target)'
# for scoring in Regression
# whereas 'accuracy_score' uses 'features and predict_targets'
# for scoring in Classification
self_accuracy = regressor.score(train_features, train_targets)
print("Accuracy for training data (self accuracy):", self_accuracy)
# predict the 'target' for 'test data'
prediction_test_targets = regressor.predict(test_features)
test_accuracy = regressor.score(test_features, test_targets)
print("Accuracy for test data:", test_accuracy)
# plot the predicted and actual target for test data
plt.figure(figsize=(20,5))
plt.plot(test_targets, color = "red")
plt.show()
plt.plot(prediction_test_targets, '-*', color = "red")
plt.plot(test_targets, '-o' )
plt.show()

how to correctly plot regression output with right datetime index on x-axis in matplotlib?

I have air pollution time series data that I need to make a forward period estimation. To do so, I used randomforest regressor from scikit-learn to make prediction, and I want to visualize the prediction output but I have trouble visualizing the regression output where x-axis must show the right time index. Can suggest me how should I get better visualization for my below regression approach? Is there any better way to make this happen? Any idea?
my attempt
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
url = "https://gist.githubusercontent.com/jerry-shad/36912907ba8660e11cd27be0d3e30639/raw/424f0891dc46d96cd5f867f3d2697777ac984f68/pollution.csv"
df = pd.read_csv(url, parse_dates=['date'])
df.date = pd.DatetimeIndex(df.date)
# df.sort_values(by='date').reset_index(drop=True)
df.drop(columns=['Unnamed: 0'],axis=1,inplace=True)
resultsDict={}
predictionsDict={}
split_date ='2017-12-01'
df_training = df.loc[df.date <= split_date]
df_test = df.loc[df.date > split_date]
## exclude pollution_index columns from training and testing data
df_tr = df_training.drop(['pollution_index'],axis=1)
df_te = df_test.drop(['pollution_index'],axis=1)
## scaling features
scaler = StandardScaler()
scaler.fit(df_tr)
X_train = scaler.transform(df_tr)
y_train = df_training['pollution_index']
X_test = scaler.transform(df_te)
y_test = df_test['pollution_index']
X_train_df = pd.DataFrame(X_train,columns=df_tr.columns)
X_test_df = pd.DataFrame(X_test,columns=df_te.columns)
reg = RandomForestRegressor(max_depth=2, random_state=0)
reg.fit(X_train, y_train)
yhat = reg.predict(X_test)
resultsDict['Randomforest'] = evaluate(df_test['eyci'], yhat)
predictionsDict['Randomforest'] = yhat
## print out prediction from RandomForest
print(predictionsDict['Randomforest'])
plt.plot(df_test['pollution_index'].values , label='Original')
plt.plot(yhat,color='red',label='predicted')
plt.legend()
output of current attempt
here is the output of the above attempt:
In this attempt, I tried to make regression using randomforest regressor and intend to make simple plot but plot didn't show time on x-axis? Why? Does anyone know how to make this right? Any thoughts? Thanks
desired plot
Ideally, after trained the model, I want to make a forward period estimation, and this is the possible plot that I want to make from my above attempt:
Can anyone suggest to me the possible way of making the right visualization on regression output? Any thoughts?
You will need to provide the dates explicitly to matplotlib.pyplot.plot().
plt.plot(df_test['date'],df_test['pollution_index'].values , label='Original')
plt.plot(df_test['date'],yhat,color='red',label='predicted')
You can also use the matplotlib-based plotting function from pandas:
df_test['yhat'] = yhat
df_test.plot(x='date',y=['pollution_index','yhat'])
It automatically plots title, x/y labels and a legend.

After training the Linear Regression model using scikit-learn , How to do predictions for new data points which are not there in original data set?

I am learning Linear regression, I wrote this Linear Regression code using scikit-learn , after making the prediction, how to do prediction for new data points which are not there in my original data set.
In this data set you are given the salaries of people according to their work experience.
For example , The predicted salary for a person with work experience of 15 years should be [167005.32889087]
Here is image of data set
Here is my code ,
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
data = pd.read_csv('project_1_dataset.csv')
X = data.iloc[:,0].values.reshape(-1,1)
Y = data.iloc[:,1].values.reshape(-1,1)
linear_regressor = LinearRegression()
linear_regressor.fit(X,Y)
Y_pred = linear_regressor.predict(X)
plt.scatter(X,Y)
plt.plot(X, Y_pred, color = 'red')
plt.show()
After fitting and training your model with your existed dataset (i.e. after linear_regressor.fit(X,Y)), you could make predictions in new instances in the same way:
new_prediction = linear_regressor.predict(new_data)
print(new_prediction)
where new_data is your new data point.
If you want to make predictions on particular random new data points, the above way should be enough. If your new data points belong to another dataframe, then you could replace new_data with the respective dataframe containing the new instances to be predicted.

Residual plot for residual vs predicted value in Python

I have run a KNN model. Now i want to plot the residual vs predicted value plot. Every example from different websites shows that i have to first run a linear regression model. But i couldn't understand how to do this. Can anyone help? Thanks in advance.
Here is my model-
train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])
x_train = train.iloc[:,[2,5]].values
y_train = train.iloc[:,4].values
x_validate = validate.iloc[:,[2,5]].values
y_validate = validate.iloc[:,4].values
x_test = test.iloc[:,[2,5]].values
y_test = test.iloc[:,4].values
clf=neighbors.KNeighborsRegressor(n_neighbors = 6)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_validate)
Residuals are nothing but how much your predicted values differ from actual values. So, it's calculated as actual values-predicted values. In your case, it's residuals = y_test-y_pred. Now for the plot, just use this;
import matplotlib.pyplot as plt
plt.scatter(residuals,y_pred)
plt.show()
What is the question? The residuals are simply y_test-y_pred. Now use seaborn's regplot.

How we can compute intercept and slope in statsmodels OLS?

Here I asked how to compute AIC in a linear model. If I replace LinearRegression() method with linear_model.OLS method to have AIC, then how can I compute slope and intercept for the OLS linear model?
import statsmodels.formula.api as smf
regr = smf.OLS(y, X, hasconst=True).fit()
In your example, you can use the params attribute of regr, which will display the coefficients and intercept. They key is that you first need to add a column vector of 1.0s to your X data. Why? The intercept term is technically just the coefficient to a column vector of 1s. That is, the intercept is just a coefficient which, when multiplied by an X "term" of 1.0, produces itself. When you add this to the summed product of the other coefficients and features, to get your nx1 array of predicted values.
Below is an example.
# Pull some data to use in the regression
from pandas_datareader.data import DataReader
import statsmodels.api as sm
syms = {'TWEXBMTH' : 'usd',
'T10Y2YM' : 'term_spread',
'PCOPPUSDM' : 'copper'
}
data = (DataReader(syms.keys(), 'fred', start='2000-01-01')
.pct_change()
.dropna())
data = data.rename(columns = syms)
# Here's where we assign a column of 1.0s to the X data
# This is required by statsmodels
# You can check that the resulting coefficients are correct by exporting
# to Excel with data.to_clipboard() and running Data Analysis > Regression there
data = data.assign(intercept = 1.)
Now actually running the regression and getting coefficients takes just 1 line in addition to what you have now.
y = data.usd
X = data.loc[:, 'term_spread':]
regr = sm.OLS(y, X, hasconst=True).fit()
print(regr.params)
term_spread -0.00065
copper -0.09483
intercept 0.00105
dtype: float64
So regarding your question on AIC, you'll want to make sure the X data has a constant there as well, before you call .fit.
Note: when you call .fit, you create a regression results wrapper and can access any of the attributes lists here.
For anyone searching on how to get the slope and intercept of a LinearRegression in scikit-learn: it has coef_ and intercept_ properties which show this.
(x, y) = np.random.randn(10,2).T
from sklearn import linear_model
lr = linear_model.LinearRegression()
lr.fit(x.reshape(len(x), 1), y)
lr.coef_ # array([ 0.29387004])
lr.intercept_ # -0.17378418547919167

Categories

Resources