python - using linear regression to predict missing values - python

I have data from 2012-2014 with some missing months in 2014. I would like to predict those months using a linear regression model trained on the 2012/2013 data.
2014 is missing June-August and has '' as its value so i clean it up using the following code, I also change 2012,2013 to have the same shape by cutting 20 data:
data2014NaN=data2014['mob'].replace(' ', np.nan)
data2014CleanNaN = data2014NaN[data2014NaN.notnull()]
data2012[0:300]
data2013[0:300]
Then I train a linear regression model using both years as a training set.
X = pd.concat([data2012[0:300], data2013[0:300]], axis=1, join='inner')
y = data2014CleanNaN .values
y = y.reshape(-1,1)
from sklearn.model_selection import train_test_split
# Split into 75% train and 25% test
X_train, X_test, y_train, y_test = train_test_split(X, y,
train_size=0.75,
random_state=4)
lm = LinearRegression()
lm.fit(X_train,y_train)
score = lm.score(X_test,y_test)
print("The prediction score on the test data is {:.2f}%".format(score*100))
However the result I got is an abysmal 4.65% and I'm not too sure on how to approach this problem, I assume I did something wrong when I cut down the data for 2012 and 2013
Here I attached the data (this is just dummy data):
2014:
date value
29/01/2014 10
30/01/2014 20
31/01/2014 15
1/02/2014 ' '
2012:
date value
29/01/2014 15
30/01/2014 18
31/01/2014 19
1/02/2014 50
I'm only using the value data, not sure if I'm in the right direction
Best Regards

It seems that your R^2 is not so good.
Cubic Spline Interpolation might perform better than linear regression in this case.
in python this api can be called:
import scipy.interpolate as st
source
also, if x is timestamp and y is a value, you can try time series analysis like AR or ARMA and Neural Network methods like RNN and LSTM.
LSTM samples built by keras:
model = Sequential()
model.add(LSTM(activation='tanh',input_shape = dataX[0].shape, output_dim=5, return_sequences = False))
model.add(Dense(output_dim = 1))
model.compile(optimizer='adam', loss='mae',metrics=['mse'])
model.fit(dataX , dataY, epochs = times , batch_size=1, verbose = 2,shuffle=False)
y_pred = model.predict(dataX)

Related

How do I calculate the MSE during walk-forward optimization?

I am trying to predict the variable "ec" using features with a time lag of 1 period with different models. In order to see which model (I am comparing OLS, Ridge, Lasso and ARIMAX) fits the data best, I use a Walk-Forward approach (expanding window) and want to calculate the Mean Squared Error for each of the models. (I am providing the code for my OLS model as an example) Although my code seems to be working, I am not sure whether the calculation of my MSE is correct: As can be seen in the code below, I am saving the MSE of each loop (Each combination of Training and Test set) in a list (ols_mse_list) and then I calculate the "overall" MSE taking the average of the list. Is that the correct way?? I am slightly confused, as I couldn't find a proper instruction on how to calculate the MSE during the optimization process...
# Separate the predictors and label
X_bss = data_bss[data_bss.columns[~data_bss.columns.isin(
["gdp_lag1", "pop_lag1", "exp_lag1", "sp_500_lag1", "temp_lag1"])]]
y = data_bss["ec"]
tscv = expanding_window(initial=350, horizon = 12, period = 1)
for train_index, test_index in tscv.split(X_bss):
print("Train:",train_index)
print("Test :",test_index)
from sklearn.linear_model import LinearRegression
ols = LinearRegression()
ols_mse_list = []
ols_mean_mse = []
# Loop through the splits. Run a Linear Regression for each split.
for i, (train_index, test_index) in enumerate(tscv.split(data_bss)):
X_train = data_bss[["gdp_lag1", "pop_lag1", "exp_lag1", "sp_500_lag1", "temp_lag1"]].iloc[train_index]
y_train = data_bss[["ec"]].iloc[train_index]
X_test = data_bss[["gdp_lag1", "pop_lag1", "exp_lag1", "sp_500_lag1", "temp_lag1"]].iloc[test_index]
y_test = data_bss[["ec"]].iloc[test_index]
ols.fit(X_train,y_train)
ols_mse = mean_squared_error(y_test,ols.predict(X_test))
ols_mse_list.append(ols_mse)
ols_mean_mse.append(np.mean(ols_mse_list))
print("OLS MSE:",ols_mean_mse)

X has 8 features, but RandomForestRegressor is expecting 67 features as input

I want to build a House Price Prediction app. The content has features where user can enter their inputs, then a predictive model will predict the price and display it to the user. I am using a dataset from Kaggle to do the prediction. When I run the code, it shows an error message that says
X has 8 features, but RandomForestRegressor is expecting 67 features as input.
Below is the code. Xy contains the data from Kaggle and df is the user input. Xy is the train set and df is the test. Xy has 8 variables including the target. df will only retrieve 7 inputs (so it will have 7 variables because there's no target variables received from user).
# Assign to X for input features and Y for target
X = Xy.drop('Price', axis=1)
Y = Xy['Price'].values
# Build Regression Model
model = RandomForestRegressor()
model.fit(X, Y)
df = pd.get_dummies(df, columns=['Location', 'Furnishing', 'Property_Type_Supergroup', 'Size_Type'])
# Apply Model to Make Prediction
prediction = model.predict(df)
I tried to search the solutions online but nothing works for my code. Hope someone can help.
It's a little difficult to tell without seeing the data that you're fitting the model on. Between the error and your code though, it seems like possibly you're fitting the model on a data frame of 67 features. The data frame that you call fit on needs to be the same as the data frame you call predict on (at least in terms of features).
Sorry if this answer is redundant, it is difficult to tell without seeing the data and the exact error.
"X has 8 features, but RandomForestRegressor is expecting 67 features as input."
I assumed that this is the standard dataset you used, and after unzipping and loading it has the following files:
sample_submission.csv
test.csv
data_description.txt
train.csv
if you check the shape of train.csv and test.csv:
train = pd.read_csv('./house_prices/train.csv')
test = pd.read_csv('./house_prices/test.csv')
print(f'Train shape : {train.shape}')
print(f'Test shape : {test.shape}')
#Train shape : (1460, 81)
#Test shape : (1459, 80)
That shows you deleted or dropped some column/features/attributes and reduced them from 81 to 67, so no problem till now. The problem is once you converted the categorical variables into numeric variables using pd.get_dummies() in the data pre-processing stage then split data into x_train & y_train using same df to fit() your model. Finally, you predict on x_test via y_pred = model.predict(x_test). Otherwise, the shape of df does not match X (one has 8 columns, the other has 67 columns in your case)!!
So I suggest first the df should be splitted:
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
# Chossing features for predicting the target variable
x = df
# Data split on df
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2 , random_state=42)
# Apply RandomForestRegressor
model = RandomForestRegressor(n_estimators=300, max_depth=13, random_state=0)
model.fit(x_train,y_train)
# Predicting the data using the model
y_pred = model.predict(x_test)
# Evaluating the model
print(metrics.r2_score(y_test,y_pred))
I included following posts for your reference:
post1
post2
post3

Why the output of my MLP neural network are between 0 and 1?

I made a neural network to predict future prices of a stock using historical data for training. The data set is arranged in hourly intervals so the goal is to forecast the price at the end of the next hour of trading.
I used 2 hidden layers of (20,16) format, 26 inputs and one output which should be the price. The activation function is 'relu' and solver is 'adam.
When the training is done and i try to test it all the outputs are values between 0 and 1 which i dont understand.
Here's my code:
# Defining the target column and predictors
target_column = ['close']
predictors = list(set(list(df.columns))-set(target_column)
# Standardizing the DataFrame
df[predictors] = df[predictors]/df[predictors].max()
df.describe().transpose()
# Creating the sets and splitting data
X = df[predictors].values
y = df[target_column].values
y = y.astype('int')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)
df.reset_index()
#Fitting the training data
mlp = MLPClassifier(hidden_layer_sizes=(20,16), activation='relu', solver='adam', max_iter=1000)
mlp.fit(X_train, y_train.ravel())
#Predicting training and test data
predict_train = mlp.predict(X_train)
predict_test = mlp.predict(X_test)
After i execute the code i also get this warning:
UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. `Use `zero_division` parameter to control this behavior.
Shouldn't the prediction at least be a continuous numerical variable not limited in between 1 and 0?
Price is a continuous variable, so this is a regression problem and not a classification one; here you are trying to apply a classifier to a regression problem, which is wrong.
Such classifiers produce a probabilistic output, which, unsurprisingly, lies in [0, 1].
You should use an MLPRegressor model instead of MLPClassifier.

What is the difference between these two ways of specifying training/testing data for sklearn GPR

This is somewhat of a follow up to my previous question about evaluating my scikit Gaussian process regressor. I am very new to GPRs and I think that I may be making a methodological mistake in how I am using training vs testing data.
Essentially I'm wondering what the difference is between specifying training data by splitting the input between test and training data like this:
X = np.atleast_2d(some_data).T
Y = np.atleast_2d(other_data).T
X_train, X_test, y_train, y_test = train_test_split(X, Y,
test_size = 0.33,
random_state = 0)
kernel = ConstantKernel() + Matern() + WhiteKernel(noise_level=1)
gp = gaussian_process.GaussianProcessRegressor(
alpha=1e-10,
copy_X_train=True,
kernel = kernel,
n_restarts_optimizer=10,
normalize_y=False,
random_state=None)
gp.fit(X_train, y_train)
score = gp.score(X_test, y_test)
print(score)
x_pred = np.atleast_2d(np.linspace(0,10,1000)).T
y_pred, sigma = gp.predict(x_pred, return_std=True)
vs using the full data set to train like this.
X = np.atleast_2d(some_data).T
Y = np.atleast_2d(other_data).T
kernel = ConstantKernel() + Matern() + WhiteKernel(noise_level=1)
gp = gaussian_process.GaussianProcessRegressor(
alpha=1e-10,
copy_X_train=True,
kernel = kernel,
n_restarts_optimizer=10,
normalize_y=False,
random_state=None)
gp.fit(X, Y)
score = gp.score(X, Y)
print(score)
x_pred = np.atleast_2d(np.linspace(0,10,1000)).T
y_pred, sigma = gp.predict(x_pred, return_std=True)
Is one of these options going to result in incorrect predictions?
You split off the training data from the test data to evaluate your model, because otherwise you have no idea if you are over fitting the data. For example, just place data in excel and plot it with a smooth line. Technically, that spline function from excel is a perfect model but useless for predicting new values.
In your example, your predictions are over a uniform space to allow you to visualize what your model thinks is the underlying function. But it would be useless for understanding how general the model is. Sometimes you can get very high accuracy (> 95%) on training data and less than chance for testing data, which means the model is over fitting.
In addition to plotting a uniform prediction space to visualize the model, you should also predict values from the test set, then see accuracy metrics for both testing and training data.

How to train a classification algorithm with normalized data set using scikit-learn python

I have a dataset in CSV format, 6 columns and 1877 rows. The full dataset can be viewed ShareCSV.
The first five columns are characteristics and the final column is a binary result, I want to create a classification network to predict result using the five inputs as seen in the CSV above.
I use the following code to normalize the data with pandas.
from sklearn import preprocessing
import pandas as pd
df = pd.read_csv(r"D:\path\data.csv", sep=",")
df=(df-df.min())/(df.max()-df.min())
I now need to pass this data to scikit-learn and select a classification algorithm, however this is where I am unsure what would be optimal, if anyone could recommend the best algorithm for my data and a rough implementation that would be great.
Ι would say that you can separate the values by train_test_split and then train those on a classification algorithm by an appropriate metric.
Below is something that I used (for a regression problem though), that you may change to your own needs:
X = TRAIN_DS[["season", "holiday", "workingday", "weather", "weekday",
"month", "year", "hour", 'humidity', 'temperature']]
Y = TRAIN_DS['count']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
estimators = [('randf', RandomForestRegressor(max_depth= 50, n_estimators= 1500)), ('gradb', GradientBoostingRegressor(max_depth= 5, n_estimators= 400)), ('gradb2',GradientBoostingRegressor(n_estimators= 4000)), ('svr', SVR('rbf',gamma='auto')), ('ext', ExtraTreesRegressor(n_estimators=4000))]
voting = StackingRegressor(estimators)
voting.fit(X = X_train, y = np.log1p(y_train))
For the best model I would suggest that you use an appropriate metric. Here is an RMSLE and R2 function that you may find useful:
'''Calculating RMSLE score, r2 score as well as plotting'''
def calc_plot(y_test, y_pred, name):
# Removing negative values for i, y in enumerate
(y_pred): if y_pred[i] < 0: y_pred[i] = 0
# Printing scoring
print('RMSLE for ' + name + ':', np.sqrt(mean_squared_log_error(y_test, y_pred)))
print('R2 for ' + name + ':', r2_score(y_test, y_pred))
Also you may use Voting Classifier or Stacking Classifier to use multiple models for your predictions.
Finally you can use GridSearchCV to check different values for the parameters of the classification algorithm that you use. An example to a regression problem would be the below:
gr = SGDRegressor()
parameters = {'loss':['squared_loss','huber','epsilon_insensitive','squared_epsilon_insensitive'], 'penalty':['l2','l1','elasticnet'],
'fit_intercept':[True,False], 'learning_rate':['constant','optimal','invscaling','adaptive'], 'alpha':[0.0001,0.005,0.001],
'l1_ratio':[0.15,0.5,0.25], 'max_iter':[500,1000,2000], 'epsilon':[0.1,0.4], 'eta0':[0.01,0.05,0.1], 'power_t':[0.25,0.1,0.5],
'early_stopping':[True,False], 'warm_start':[True,False],'average':[True,False], 'n_iter_no_change':[3,5,10,15]}
lModel = GridSearchCV(gr,parameters, cv=LeaveOneOut(), scoring = 'neg_mean_absolute_error')
Hope it helps!

Categories

Resources