I am trying to use the latest (2nd) 0.3 version of the Prophet package for Python.
My model should include an exogenous regressor, but I receive a ValueError stating that the indeed existing regressor is missing from dataframe. Is this a bug or what am I doing wrong?
#Random Dataset Preparation
import random
random.seed(a=1)
df = pandas.DataFrame(data = None, columns = ['ds', 'y', 'ex'], index = range(50))
datelist = pandas.date_range(pandas.datetime.today(), periods = 50).tolist()
y = numpy.random.normal(0, 1, 50)
ex = numpy.random.normal(0, 2, 50)
df['ds'] = datelist
df['y'] = y
df['ex'] = ex
#Model
prophet_model = Prophet(seasonality_prior_scale = 0.1)
Prophet.add_regressor(prophet_model, 'ex')
prophet_model.fit(df)
prophet_forecast_step = prophet_model.make_future_dataframe(periods=1)
#Result-df
prophet_x_df = pandas.DataFrame(data=None, columns=['Date_x', 'Res'], index = range(int(len(y))))
#Error
prophet_x_df.iloc[0,1] = prophet_model.predict(prophet_forecast_step).iloc[0,0]
You need to first create a column with the regressor value which need to be present in both the fitting and prediction dataframes.
Refer prophet docs
make_future_dataframe generates a dataframe with ds column only.
You need to add 'ex' column to prophet_forecast_step dataframe in order to use it as a regressor
Related
from sklearn.linear_model import LinearRegression
df = average_sales.to_frame()
time = np.arange(len(df.index)) # time dummy
df['time'] = time
X = df.loc[:, ['time']] # features
y = df.loc[:, 'sales'] # target
model = LinearRegression()
model.fit(X, y)
y_pred = pd.Series(model.predict(X), index=X.index)
Could anyone tell me why in X, 'time' has a bracket, but in y 'sales' does not need it? They are just the name for different columns. Why different?
I have this data:
and I am trying to do a simple linear regression model on it.
Here is my code:
from sklearn.linear_model import LinearRegression
X = df[['Date']]
y = df['ACP Cleaning']
model = LinearRegression()
model.fit(X, y)
X_predict = [['2021-1-1']]
y_predict = model.predict(X_predict)
and this is my error:
ValueError: Unable to convert array of bytes/strings into decimal
numbers with dtype='numeric'
Linear Regression works with numbers, not strings.
You must pre-process your data in order to match the input of the model.
One way to do it is to parse the string and convert it to timestamp:
import datetime
def process_date(date_str):
d = datetime.datetime.strptime(date_str, '%Y-%m-%d')
return d.timestamp()
X = df[['Date']].apply(process_date)
The same must be done to the data you want to predict.
Update: If your dataset's datatype is correct, then the problem is with the data you are trying to use for prediction (you cannot predict a string).
The following is a complete working example. Pay close attention to the processing done to the X_predict variable.
import datetime
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
rng = pd.date_range('2015-02-24', periods=5, freq='3A')
df = pd.DataFrame({ 'Date': rng, 'Val' : np.random.randn(len(rng))})
print(df.head())
X = np.array(df['Date']).reshape(-1,1)
y = df['Val']
model = LinearRegression()
model.fit(X, y)
def process_date(date_str):
d = datetime.datetime.strptime(date_str, '%Y-%m-%d')
# return array
return [d.timestamp()]
X_predict = ['2021-1-1']
X_predict = list(map(process_date, X_predict))
y_predict = model.predict(X_predict)
y_predict
Returns:
Date Val
0 2015-12-31 -0.110503
1 2018-12-31 -0.621394
2 2021-12-31 -1.030068
3 2024-12-31 1.221146
4 2027-12-31 -0.327685
array([-2.6149628])
Update: I used your data to create a csv file:
Date,Val
1-1-2020, 90404.71
2-1-2020, 69904.71
...
And then I loaded with pandas. Everything looks good to me:
def process_date(date_str):
# the date format is month-day-year
d = datetime.datetime.strptime(date_str, '%m-%d-%Y')
return d.timestamp()
df = pd.read_csv("test.csv")
df['Date'] = df['Date'].apply(process_date)
df.head()
Output:
Date Val
0 1.577848e+09 90404.710
1 1.580526e+09 69904.710
2 1.583032e+09 98934.112
3 1.585710e+09 77084.430
4 1.588302e+09 35877.420
Extracting features:
# must reshape 'cause we have only one feature
X = df['Date'].to_numpy().reshape(-1,1)
y = df['Val'].to_numpy()
model = LinearRegression()
model.fit(X, y)
Predicting:
X_predict = ['1-1-2021', '2-1-2021']
X_predict = np.array(list(map(process_date, X_predict)))
X_predict = X_predict.reshape(-1, 1)
y_predict = model.predict(X_predict)
y_predict
Output:
array([55492.2660361 , 53516.12292932])
This is a good prediction. You can use matplotlib to plot your data and convince yourself:
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(df['Date'], df['Val'])
plt.show()
Linear Regression needs your arrays to be of numeric type, since you have dates that are stored as strings in your X array, Linear Regression won't work as you expect.
You can convert the X array to numeric type by counting the number of days since the beginning date. You can try something like this in your DataFrame:
df.Date = (df.Date - df.Date[0]).days
And then you can continue as you were doing.
I have assumed that the dates in your Date column are in the datetime format, else you would need to convert it first.
I have been trying to use RF regression from scikit-learn, but I’m getting an error with my standard (from docs and tutorials) model. Here is the code:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
db = pd.read_excel('/home/artyom/myprojects//valuevo/field2019/report/segs_inventar_dataframe/excel_var/invcents.xlsx')
age = df[['AGE_1', 'AGE_2', 'AGE_3', 'AGE_4', 'AGE_5']]
hight = df [['HIGHT_','HIGHT_1', 'HIGHT_2', 'HIGHT_3', 'HIGHT_4', 'HIGHT_5']]
diam = df[['DIAM_', 'DIAM_1', 'DIAM_2', 'DIAM_3', 'DIAM_4', 'DIAM_5']]
za = df[['ZAPSYR_', 'ZAPSYR_1', 'ZAPSYR_2', 'ZAPSYR_3', 'ZAPSYR_4', 'ZAPSYR_5']]
tova = df[['TOVARN_', 'TOVARN_1', 'TOVARN_2', 'TOVARN_3', 'TOVARN_4', 'TOVARN_5']]
#df['average'] = df.mean(numeric_only=True, axis=1)
df['meanage'] = age.mean(numeric_only=True, axis=1)
df['meanhight'] = hight.mean(numeric_only=True, axis=1)
df['mediandiam'] = diam.mean(numeric_only=True, axis=1)
df['medianza'] = za.mean(numeric_only=True, axis=1)
df['mediantova'] = tova.mean(numeric_only=True, axis=1)
unite = df[['gapA_segA','gapP_segP', 'A_median', 'p_median', 'circ_media','fdi_median', 'pfd_median', 'p_a_median', 'gsci_media','meanhight']].dropna()
from sklearn.model_selection import train_test_split as ttsplit
df_copy = unite.copy()
trainXset = df_copy[['gapA_segA','gapP_segP', 'A_median', 'p_median', 'circ_media','fdi_median', 'pfd_median', 'p_a_median', 'gsci_media']]
trainYset = df_copy [['meanhight']]
trainXset_train, trainXset_test, trainYset_train, trainYset_test = ttsplit(trainXset, trainYset, test_size=0.3) # 70% training and 30% test
rf = RandomForestRegressor(n_estimators = 100, random_state = 40)
rf.fit(trainXset_train, trainYset_train)
predictions = rf.predict(trainXset_test)
errors = abs(predictions - trainYset_test)
mape = 100 * (errors / trainYset_test)
accuracy = 100 - np.mean(mape)
print('Accuracy:', round(accuracy, 2), '%.')
But output doesn’t look ok:
---> 24 errors = abs(predictions - trainYset_test)
25 # Calculate mean absolute percentage error (MAPE)
26 mape = 100 * (errors / trainYset_test)
..... somemore track
ValueError: Unable to coerce to Series, length must be 1: given 780
How can I fix it? 780 is the shape of trainYset_test. I’m not asking for a solution (i.e. write code for me), but for advice on why this error happened. I followed everything as in tutorials.
by seeing in error it is cleared that, the array has to have the shape of one ,
so use reshape to make it in correct shape,
predictions=predictions.reshape(780,1)
I solved this by making sure the predictions were the same data type as the actual data. In my case, it was:
MSE = (sum((y_test-predictions)**2))/(len(newX)-len(newX.columns))
I resolved this by casting y_test to be a numpy array:
MSE = (sum((np.array(y_test)-predictions)**2))/(len(newX)-len(newX.columns))
I would like to do a regression with a rolling window, but I got only one parameter back after the regression:
rolling_beta = sm.OLS(X2, X1, window_type='rolling', window=30).fit()
rolling_beta.params
The result:
X1 5.715089
dtype: float64
What could be the problem?
Thanks in advance, Roland
I think the problem is that the parameters window_type='rolling' and window=30 simply do not do anything. First I'll show you why, and at the end I'll provide a setup I've got lying around for linear regressions on rolling windows.
1. The problem with your function:
Since you haven't provided some sample data, here's a function that returns a dataframe of a desired size with some random numbers:
# Function to build synthetic data
import numpy as np
import pandas as pd
import statsmodels.api as sm
from collections import OrderedDict
def sample(rSeed, periodLength, colNames):
np.random.seed(rSeed)
date = pd.to_datetime("1st of Dec, 1999")
cols = OrderedDict()
for col in colNames:
cols[col] = np.random.normal(loc=0.0, scale=1.0, size=periodLength)
dates = date+pd.to_timedelta(np.arange(periodLength), 'D')
df = pd.DataFrame(cols, index = dates)
return(df)
Output:
X1 X2
2018-12-01 -1.085631 -1.294085
2018-12-02 0.997345 -1.038788
2018-12-03 0.282978 1.743712
2018-12-04 -1.506295 -0.798063
2018-12-05 -0.578600 0.029683
.
.
.
2019-01-17 0.412912 -1.363472
2019-01-18 0.978736 0.379401
2019-01-19 2.238143 -0.379176
Now, try:
rolling_beta = sm.OLS(df['X2'], df['X1'], window_type='rolling', window=30).fit()
rolling_beta.params
Output:
X1 -0.075784
dtype: float64
And this at least represents the structure of your output too, meaning that you're expecting an estimate for each of your sample windows, but instead you get a single estimate. So I looked around for some other examples using the same function online and in the statsmodels docs, but I was unable to find specific examples that actually worked. What I did find were a few discussions talking about how this functionality was deprecated a while ago. So then I tested the same function with some bogus input for the parameters:
rolling_beta = sm.OLS(df['X2'], df['X1'], window_type='amazing', window=3000000).fit()
rolling_beta.params
Output:
X1 -0.075784
dtype: float64
And as you can see, the estimates are the same, and no error messages are returned for the bogus input. So I suggest that you take a look at the function below. This is something I've put together to perform rolling regression estimates.
2. A function for regressions on rolling windows of a pandas dataframe
df = sample(rSeed = 123, colNames = ['X1', 'X2', 'X3'], periodLength = 50)
def RegressionRoll(df, subset, dependent, independent, const, win, parameters):
"""
RegressionRoll takes a dataframe, makes a subset of the data if you like,
and runs a series of regressions with a specified window length, and
returns a dataframe with BETA or R^2 for each window split of the data.
Parameters:
===========
df: pandas dataframe
subset: integer - has to be smaller than the size of the df
dependent: string that specifies name of denpendent variable
inependent: LIST of strings that specifies name of indenpendent variables
const: boolean - whether or not to include a constant term
win: integer - window length of each model
parameters: string that specifies which model parameters to return:
BETA or R^2
Example:
========
RegressionRoll(df=df, subset = 50, dependent = 'X1', independent = ['X2'],
const = True, parameters = 'beta', win = 30)
"""
# Data subset
if subset != 0:
df = df.tail(subset)
else:
df = df
# Loopinfo
end = df.shape[0]
win = win
rng = np.arange(start = win, stop = end, step = 1)
# Subset and store dataframes
frames = {}
n = 1
for i in rng:
df_temp = df.iloc[:i].tail(win)
newname = 'df' + str(n)
frames.update({newname: df_temp})
n += 1
# Analysis on subsets
df_results = pd.DataFrame()
for frame in frames:
#print(frames[frame])
# Rolling data frames
dfr = frames[frame]
y = dependent
x = independent
if const == True:
x = sm.add_constant(dfr[x])
model = sm.OLS(dfr[y], x).fit()
else:
model = sm.OLS(dfr[y], dfr[x]).fit()
if parameters == 'beta':
theParams = model.params[0:]
coefs = theParams.to_frame()
df_temp = pd.DataFrame(coefs.T)
indx = dfr.tail(1).index[-1]
df_temp['Date'] = indx
df_temp = df_temp.set_index(['Date'])
if parameters == 'R2':
theParams = model.rsquared
df_temp = pd.DataFrame([theParams])
indx = dfr.tail(1).index[-1]
df_temp['Date'] = indx
df_temp = df_temp.set_index(['Date'])
df_temp.columns = [', '.join(independent)]
df_results = pd.concat([df_results, df_temp], axis = 0)
return(df_results)
df_rolling = RegressionRoll(df=df, subset = 50, dependent = 'X1', independent = ['X2'], const = True, parameters = 'beta',
win = 30)
Output: A dataframe with beta estimates for OLS of X2 on X1 for each 30 period window of the data.
const X2
Date
2018-12-30 0.044042 0.032680
2018-12-31 0.074839 -0.023294
2019-01-01 -0.063200 0.077215
.
.
.
2019-01-16 -0.075938 -0.215108
2019-01-17 -0.143226 -0.215524
2019-01-18 -0.129202 -0.170304
I'd like to find the standard deviation and confidence intervals for an out-of-sample prediction from an OLS model.
This question is similar to Confidence intervals for model prediction, but with an explicit focus on using out-of-sample data.
The idea would be for a function along the lines of wls_prediction_std(lm, data_to_use_for_prediction=out_of_sample_df), that returns the prstd, iv_l, iv_u for that out of sample dataframe.
For instance:
import pandas as pd
import random
import statsmodels.formula.api as smf
from statsmodels.sandbox.regression.predstd import wls_prediction_std
df = pd.DataFrame({"y":[x for x in range(10)],
"x1":[(x*5 + random.random() * 2) for x in range(10)],
"x2":[(x*2.1 + random.random()) for x in range(10)]})
out_of_sample_df = pd.DataFrame({"x1":[(x*3 + random.random() * 2) for x in range(10)],
"x2":[(x + random.random()) for x in range(10)]})
formula_string = "y ~ x1 + x2"
lm = smf.ols(formula=formula_string, data=df).fit()
# Prediction works fine:
print(lm.predict(out_of_sample_df))
# I can also get std and CI for in-sample data:
prstd, iv_l, iv_u = wls_prediction_std(lm)
print(prstd)
# I cannot figure out how to get std and CI for out-of-sample data:
try:
print(wls_prediction_std(lm, exog= out_of_sample_df))
except ValueError as e:
print(str(e))
#returns "ValueError: wrong shape of exog"
# trying to concatenate the DFs:
df_both = pd.concat([df, out_of_sample_df],
ignore_index = True)
# Only returns results for the data from df, not from out_of_sample_df
lm2 = smf.ols(formula=formula_string, data=df_both).fit()
prstd2, iv_l2, iv_u2 = wls_prediction_std(lm2)
print(prstd2)
It looks like the problem is in the format of the exog parameter. This method is 100% stolen from this workaround by github user thatneat. It is necessary because of this bug.
def transform_exog_to_model(fit, exog):
transform=True
self=fit
# The following is lifted straight from statsmodels.base.model.Results.predict()
if transform and hasattr(self.model, 'formula') and exog is not None:
from patsy import dmatrix
exog = dmatrix(self.model.data.orig_exog.design_info.builder,
exog)
if exog is not None:
exog = np.asarray(exog)
if exog.ndim == 1 and (self.model.exog.ndim == 1 or
self.model.exog.shape[1] == 1):
exog = exog[:, None]
exog = np.atleast_2d(exog) # needed in count model shape[1]
# end lifted code
return exog
transformed_exog = transform_exog_to_model(lm, out_of_sample_df)
print(transformed_exog)
prstd2, iv_l2, iv_u2 = wls_prediction_std(lm, transformed_exog, weights=[1])
print(prstd2)
Additionally you can try to use the get_prediction method.
predictions = result.get_prediction(out_of_sample_df)
predictions.summary_frame(alpha=0.05)
This returns the confidence and prediction interval. I found the summary_frame() method buried here and you can find the get_prediction() method here. You can change the significance level of the confidence interval and prediction interval by modifying the "alpha" parameter.