How do you convert day numbers (1,2,3...728,729,730) to dates in python? I can assign an arbitrary year to start the date count as the year doesn't matter to me.
I am working on learning time series analysis, ARIMA, SARIMA, etc using python. I have a CSV dataset with two columns, 'Day' and 'Revenue'. The Day column contains numbers 1-731, Revenue contains numbers 0-18.154... I have had success building the model, running statistical tests, building visualizations, etc. But when it comes to forecasting using prophet I am hitting a wall.
Here are what I feel are the relevant parts of the code related to the question:
# Loading the CSV with pandas. This code converts the "Day" column into the index.
df = read_csv("telco_time_series.csv", index_col=0, parse_dates=True)
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 731 entries, 1 to 731
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Revenue 731 non-null float64
dtypes: float64(1)
memory usage: 11.4 KB
df.head()
Revenue
Day
1 0.000000
2 0.000793
3 0.825542
4 0.320332
5 1.082554
# Instantiate the model
model = ARIMA(df, order=(4,1,0))
# Fit the model
results = model.fit()
# Print summary
print(results.summary())
# line plot of residuals
residuals = (results.resid)
residuals.plot()
plt.show()
# density plot of residuals
residuals.plot(kind='kde')
plt.show()
# summary stats of residuals
print(residuals.describe())
SARIMAX Results
==============================================================================
Dep. Variable: Revenue No. Observations: 731
Model: ARIMA(4, 1, 0) Log Likelihood -489.105
Date: Tue, 03 Aug 2021 AIC 988.210
Time: 07:29:55 BIC 1011.175
Sample: 0 HQIC 997.070
- 731
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 -0.4642 0.037 -12.460 0.000 -0.537 -0.391
ar.L2 0.0295 0.040 0.746 0.456 -0.048 0.107
ar.L3 0.0618 0.041 1.509 0.131 -0.018 0.142
ar.L4 0.0366 0.039 0.946 0.344 -0.039 0.112
sigma2 0.2235 0.013 17.629 0.000 0.199 0.248
===================================================================================
Ljung-Box (L1) (Q): 0.01 Jarque-Bera (JB): 2.52
Prob(Q): 0.90 Prob(JB): 0.28
Heteroskedasticity (H): 1.01 Skew: -0.05
Prob(H) (two-sided): 0.91 Kurtosis: 2.73
===================================================================================
df.columns=['ds','y']
ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 elements
m = Prophet()
m.fit(df)
ValueError: Dataframe must have columns "ds" and "y" with the dates and values
respectively.
I've had success with the forecast using prophet if I fill the values in the CSV with dates, but I would like to convert the Day numbers within the code using pandas.
Any ideas?
I can assign an arbitrary year to start the date count as the year doesn't matter to me(...)Any ideas?
You might harness datetime.timedelta for this task. Select any date you wish as day 0 and then add datetime.timedelta(days=x) where x is your day number, for example:
import datetime
day0 = datetime.date(2000,1,1)
day120 = day0 + datetime.timedelta(days=120)
print(day120)
output
2000-04-30
encase in function and .apply if you have pandas.DataFrame like so
import datetime
import pandas as pd
def convert_to_date(x):
return datetime.date(2000,1,1)+datetime.timedelta(days=x)
df = pd.DataFrame({'day_n':[1,2,3,4,5]})
df['day_date'] = df['day_n'].apply(convert_to_date)
print(df)
output
day_n day_date
0 1 2000-01-02
1 2 2000-01-03
2 3 2000-01-04
3 4 2000-01-05
4 5 2000-01-06
Related
I have the following code. I am running a linear model on the dataframe 'x', with gender and highest education level achieved as categorical variables.
The aim is to assess how well age, gender and highest level of education achieved can predict 'weighteddistance'.
resultmodeldistancevariation2sleep = smf.ols(formula='weighteddistance ~ age + C(gender) + C(highest_education_level_acheived)',data=x).fit()
summarymodel = resultmodeldistancevariation2sleep.summary()
print(summarymodel)
This gives me the output:
0 1 2 3 4 5 6
0 coef std err t P>|t| [0.025 0.975]
1 Intercept 6.3693 1.391 4.580 0.000 3.638 9.100
2 C(gender)[T.2.0] 0.2301 0.155 1.489 0.137 -0.073 0.534
3 C(gender)[T.3.0] 0.0302 0.429 0.070 0.944 -0.812 0.872
4 C(highest_education_level_acheived)[T.3] 1.1292 0.501 2.252 0.025 0.145 2.114
5 C(highest_education_level_acheived)[T.4] 1.0876 0.513 2.118 0.035 0.079 2.096
6 C(highest_education_level_acheived)[T.5] 1.0692 0.498 2.145 0.032 0.090 2.048
7 C(highest_education_level_acheived)[T.6] 1.2995 0.525 2.476 0.014 0.269 2.330
8 C(highest_education_level_acheived)[T.7] 1.7391 0.605 2.873 0.004 0.550 2.928
However, I want to calculate the main effect of each categorical variable on distance, which are not shown in the model above, and so I entered the model fit into an anova using 'anova_lm'.
anovaoutput = sm.stats.anova_lm(resultmodeldistancevariation2sleep)
anovaoutput['PR(>F)'] = anovaoutput['PR(>F)'].round(4)
This gives me following output below, and as I wanted, does show me the main effect of each categorical variable - gender and highest education level achieved - rather than the different groups within that variable (meaning that there is no gender[2.0] and gender[3.0] in the output below).
df sum_sq mean_sq F PR(>F)
C(gender) 2.0 4.227966 2.113983 5.681874 0.0036
C(highest_education_level_acheived) 5.0 11.425706 2.285141 6.141906 0.0000
age 1.0 8.274317 8.274317 22.239357 0.0000
Residual 647.0 240.721120 0.372057 NaN NaN
However, this output no longer shows me the confidence intervals or the coefficients for each variable.
So in other words, I would like the bottom anova table should have a column with 'coef' and '[0.025 0.975]' like in the first table.
How can I achieve this?
I would be so grateful for a helping hand!
Given a multi-index multi-column dataframe below, I want to apply LinearRegression to each block of this dataframe, for example, for each Station_Number, I would like to run a regression between LST and Value. The df look like this:
Latitude Longitude LST Elevation Value
Station_Number Date
RSM00025356 2019-01-01 66.3797 173.33 -31.655008 78.0 -28.733333
2019-02-02 66.3797 173.33 -17.215009 78.0 -17.900000
2019-02-10 66.3797 173.33 -31.180006 78.0 -19.500000
2019-02-26 66.3797 173.33 -19.275007 78.0 -6.266667
2019-04-23 66.3797 173.33 -12.905004 78.0 -4.916667
There are plenty more stations to come after. Ideally the output would just be the regression results per station number
You can use groupby to split the DataFrame then run each regression within the group. You can store the results in a dictionary where the keys are the 'Station_Number'. I'll use statsmodels for the regression, but there are many possible libraries, depending how much you care about the standard errors and inference.
import statsmodels.formula.api as smf
d = {}
for station,gp in df.groupby('Station_Number'):
mod = smf.ols(formula='LST ~ Value', data=gp)
d[station] = mod.fit()
Regression Results:
d['RSM00025356'].params
#Intercept -11.676331
#Value 0.696465
#dtype: float64
d['RSM00025356'].summary()
OLS Regression Results
==============================================================================
Dep. Variable: LST R-squared: 0.660
Model: OLS Adj. R-squared: 0.547
Method: Least Squares F-statistic: 5.831
Date: Fri, 28 May 2021 Prob (F-statistic): 0.0946
Time: 11:17:51 Log-Likelihood: -14.543
No. Observations: 5 AIC: 33.09
Df Residuals: 3 BIC: 32.30
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -11.6763 5.143 -2.270 0.108 -28.043 4.690
Value 0.6965 0.288 2.415 0.095 -0.221 1.614
==============================================================================
Omnibus: nan Durbin-Watson: 2.536
Prob(Omnibus): nan Jarque-Bera (JB): 0.299
Skew: 0.233 Prob(JB): 0.861
Kurtosis: 1.895 Cond. No. 35.9
==============================================================================
My first stack overflow post, I am studying part time for a data science qualification and Im stuck with Statsmodels SARIMAX predicting
my time series data looks as follows
ts_log.head()
Calendar Week
2016-02-22 8.168486
2016-02-29 8.252707
2016-03-07 8.324821
2016-03-14 8.371474
2016-03-21 8.766238
Name: Sales Quantity, dtype: float64
ts_log.tail()
Calendar Week
2020-07-20 8.326759
2020-07-27 8.273847
2020-08-03 8.286521
2020-08-10 8.222822
2020-08-17 8.011687
Name: Sales Quantity, dtype: float64
I run the following
train = ts_log[:'2019-07-01'].dropna()
test = ts_log['2020-08-24':].dropna()
model = SARIMAX(train, order=(2,1,2), seasonal_order=(0,1,0,52)
,enforce_stationarity=False, enforce_invertibility=False)
results = model.fit()
summary shows
results.summary()
Dep. Variable: Sales Quantity No. Observations: 175
Model: SARIMAX(2, 1, 2)x(0, 1, 0, 52) Log Likelihood 16.441
Date: Mon, 21 Sep 2020 AIC -22.883
Time: 22:32:28 BIC -8.987
Sample: 0 HQIC -17.240
- 175
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
ar.L1 1.3171 0.288 4.578 0.000 0.753 1.881
ar.L2 -0.5158 0.252 -2.045 0.041 -1.010 -0.022
ma.L1 -1.5829 0.519 -3.048 0.002 -2.601 -0.565
ma.L2 0.5093 0.502 1.016 0.310 -0.474 1.492
sigma2 0.0345 0.011 3.195 0.001 0.013 0.056
Ljung-Box (Q): 30.08 Jarque-Bera (JB): 2.55
Prob(Q): 0.87 Prob(JB): 0.28
Heteroskedasticity (H): 0.54 Skew: -0.02
Prob(H) (two-sided): 0.05 Kurtosis: 3.72
However, when I try to predict I get a key error suggesting my start date is incorrect but I cant see what is wrong with it
pred = results.predict(start='2019-06-10',end='2020-08-17')[1:]
KeyError: 'The `start` argument could not be matched to a location related to the index of the data.'
I can see both of these dates are valid:
ts_log['2019-06-10']
8.95686647085414
ts_log['2020-08-17']
8.011686729127847
If, instead I run with numbers, it works fine
pred = results.predict(start=175,end=200)[1:]
Id like to use date so I can use it in my time series graph with other dates
EmmaT,
you seem to have same date for start and end.
start='2019-06-10',end='2019-06-10'
Please, double-check if this is what you want. Also check that '2019-06-10' is present in the dataset.
I'm trying to figure out how to incorporate lagged dependent variables into statsmodel or scikitlearn to forecast time series with AR terms but cannot seem to find a solution.
The general linear equation looks something like this:
y = B1*y(t-1) + B2*x1(t) + B3*x2(t-3) + e
I know I can use pd.Series.shift(t) to create lagged variables and then add it to be included in the model and generate parameters, but how can I get a prediction when the code does not know which variable is a lagged dependent variable?
In SAS's Proc Autoreg, you can designate which variable is a lagged dependent variable and will forecast accordingly, but it seems like there are no options like that in Python.
Any help would be greatly appreciated and thank you in advance.
Since you're already mentioned statsmodels in your tags you may want to take a look at statsmodels - ARIMA, i.e.:
from statsmodels.tsa.arima_model import ARIMA
model = ARIMA(endog=t, order=(2, 0, 0)) # p=2, d=0, q=0 for AR(2)
fit = model.fit()
fit.summary()
But like you mentioned, you could create new variables manually the way you described (I used some random data):
import numpy as np
import pandas as pd
import statsmodels.api as sm
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv', parse_dates=['date'])
df['random_variable'] = np.random.randint(0, 10, len(df))
df['y'] = np.random.rand(len(df))
df.index = df['date']
df = df[['y', 'value', 'random_variable']]
df.columns = ['y', 'x1', 'x2']
shifts = 3
for variable in df.columns.values:
for t in range(1, shifts + 1):
df[f'{variable} AR({t})'] = df.shift(t)[variable]
df = df.dropna()
>>> df.head()
y x1 x2 ... x2 AR(1) x2 AR(2) x2 AR(3)
date ...
1991-10-01 0.715115 3.611003 7 ... 5.0 7.0 7.0
1991-11-01 0.202662 3.565869 3 ... 7.0 5.0 7.0
1991-12-01 0.121624 4.306371 7 ... 3.0 7.0 5.0
1992-01-01 0.043412 5.088335 6 ... 7.0 3.0 7.0
1992-02-01 0.853334 2.814520 2 ... 6.0 7.0 3.0
[5 rows x 12 columns]
I'm using the model you describe in your post:
model = sm.OLS(df['y'], df[['y AR(1)', 'x1', 'x2 AR(3)']])
fit = model.fit()
>>> fit.summary()
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.696
Model: OLS Adj. R-squared: 0.691
Method: Least Squares F-statistic: 150.8
Date: Tue, 08 Oct 2019 Prob (F-statistic): 6.93e-51
Time: 17:51:20 Log-Likelihood: -53.357
No. Observations: 201 AIC: 112.7
Df Residuals: 198 BIC: 122.6
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
y AR(1) 0.2972 0.072 4.142 0.000 0.156 0.439
x1 0.0211 0.003 6.261 0.000 0.014 0.028
x2 AR(3) 0.0161 0.007 2.264 0.025 0.002 0.030
==============================================================================
Omnibus: 2.115 Durbin-Watson: 2.277
Prob(Omnibus): 0.347 Jarque-Bera (JB): 1.712
Skew: 0.064 Prob(JB): 0.425
Kurtosis: 2.567 Cond. No. 41.5
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
"""
Hope this helps you getting started.
I have set of 5000 data points of like_so_ (x,y,z) for eg (0,1,50)
where x=1,y=2,z=120.with help of these 5000 enteries,i have to get an equation in
which given x and y ,equation should be able to get value of z
You can use statsmodels.ols. Some sample data - assuming you can create a pd.DataFrame from your (x, y, z) data:
import pandas as pd
df = pd.DataFrame(np.random.randint(100, size=(150, 3)), columns=list('XYZ'))
df.info()
RangeIndex: 150 entries, 0 to 149
Data columns (total 3 columns):
X 150 non-null int64
Y 150 non-null int64
Z 150 non-null int64
Now estimate linear regression parameters:
import numpy as np
import statsmodels.api as sm
model = sm.OLS(df['Z'], df[['X', 'Y']])
results = model.fit()
to get:
results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: Z R-squared: 0.652
Model: OLS Adj. R-squared: 0.647
Method: Least Squares F-statistic: 138.6
Date: Fri, 17 Jun 2016 Prob (F-statistic): 1.21e-34
Time: 13:48:38 Log-Likelihood: -741.94
No. Observations: 150 AIC: 1488.
Df Residuals: 148 BIC: 1494.
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
X 0.5224 0.076 6.874 0.000 0.372 0.673
Y 0.3531 0.076 4.667 0.000 0.204 0.503
==============================================================================
Omnibus: 5.869 Durbin-Watson: 1.921
Prob(Omnibus): 0.053 Jarque-Bera (JB): 2.990
Skew: -0.000 Prob(JB): 0.224
Kurtosis: 2.308 Cond. No. 2.70
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
to predict, use:
params = results.params
params = results.params
df['predictions'] = model.predict(params)
which yields:
X Y Z predictions
0 31 85 75 54.701830
1 36 46 43 34.828605
2 77 42 8 43.795386
3 78 84 65 66.932761
4 27 54 50 36.737606