Trying to Understand FB Prophet Cross Validation

Trying to Understand FB Prophet Cross Validation - python

I have a dataset with 84 Monthly Sales (from 01/2013 to 12/2019) - just months, not days.
Month 01 | Sale 1
Month 02 | Sale 2
Month 03 | Sale 3
.... | ...
Month 84 | Sale 84
By visualization it looks like that the model fits very well... but I need to check it....
So what I understood is that cross val does not support Months, and so what I did was convert to use it w/ days(although there is no day info into my original df)...
I wanted to try my model w/ the first five years(60 months) and leave the 2 remaining years(24 months) to see how well the model is predicting....
So i did something like:
cv_results = cross_validation( model = prophet, initial='1825 days', period='30 days', horizon = '60 days')
Does this make sense?
I did not get the concept of cut off dates and forecast periods

I struggled with this for a while as well. But here is how it works. The initial model will be trained on the first 1,825 days of data. It will forecast the next 60 days of data (because horizon is set to 60). The model will then train on the initial period + the period (1,825 + 30 days in this case) and forecast the next 60 days. It will continued like this, adding another 30 days to the training data and then forecasting for the next 60 until there is no longer enough data to do this.
In summary, period is how much data to add to the training data set in every iteration of cross-validation, and horizon is how far out it will forecast.

Here is the setup:
The total number of data points is 700 days
Initial is 365 days
The period is 10 days
The horizon is 20 days
On the 1st iteration, it will train on days 1-365 and will forecast on days 366 to 385.
On the 2nd iteration, it will train on days 11-375 and will forecast on days 376 to 395, etc.

Related

Calculating rolling on a mixed date using groupby

I have to calculate number of days when temperature was more than 32 degree C, in last 30 days.
I am try use rolling average. The issue is that number of days in a month varies.
weather_2['highTemp_days'] = weather_2.groupby(['date','station'])['over32'].apply(lambda x: x.rolling(len('month')).sum())
weather_2 has 66 stations
date varies from 1950 to 2020
over32 is boolean data. If temp on that date is > 32 then 1 otherwise zero.
month is taken from the date data which is weather_2['month'] = weather_2['date'].dt.month

I used this
weather_2['highTemp_days'] = weather_2.groupby(['year','station'])['over32'].apply(lambda x: x.rolling(30).sum())
This issue was I was grouping by date. That is why the answer was wrong.

How to get week number from given date in range of 1 to 5 only, not in 1 to 52

I'm working in machine learning regression problem where i predict sales value based on input features. In which date is one of the feature and i want to fetch month and week number from the date. Month gives in 1 to 12 that is okay. but for weeks i get between 1 to 52, which is also correct but i'm trying to get week number in range of 1 to 5, some months have 4 weeks and some have 5.
I have tried available methods for getting week number but it gives in range of 1 to 52 only. I can not just simply divide this by 4, otherwise no month will have 5 weeks.
this code gives output in range of 1 to 52 and i have also tried several other methods.
df['Week'] = df['Date'].dt.week
it should return like if a particular date is belong to fifth week of month than it should give week number 5.

Week number refers to week of year in most contexts. Week of month is not a standard notion and is thus not implemented in pandas. You can implement it yourself. See e.g. this question on Stackoverflow.

Linear Regression in Pandas Groupby with freq='W-MON'

I have data over the timespan of over a year. I am interested in grouping the data by week, and getting the slope of two variables by week. Here is what the data looks like:
Date | Total_Sales| Products
2015-12-30 07:42:50| 2900 | 24
2015-12-30 09:10:10| 3400 | 20
2016-02-07 07:07:07| 5400 | 25
2016-02-07 07:08:08| 1000 | 64
So ideally I would like to perform a linear regression on total_sales and products on each week of this data and record the slope. This works when each week is represented in the data, but I have problems when there are some weeks skipped in the data. I know I could do this with turning the date into the week number but I feel like the result will be skewed because there is over a year's worth of data.
Here is the code I have so far:
df['Date']=pd.to_datetime(vals['EventDate']) - pd.to_timedelta(7,unit='d')
df.groupby(pd.Grouper(key='Week', freq='W-MON')).apply(lambda v: linregress(v.Total_Sales, v.Products)[0]).reset_index()
However, I get the following error:
ValueError: Inputs must not be empty.
I expect the output to look like this:
Date | Slope
2015-12-28 | -0.008
2016-02-01 | -0.008

I assume this is happening because python is unable to groupby properly and also it is unable to recognise datetime as key ,as Date column has varying timestamp too.
Try the following code.It worked for me:
df['Date']=pd.to_datetime(df['Date']) #### Converts Date column to Python Datetime
df['daysoffset'] = df['Date'].apply(lambda x: x.weekday())
#### Return the day of the week as an integer, where Monday is 0 and Sunday is 6.
df['week_start'] = df.apply(lambda x: x['Date'].date()-timedelta(days=x['daysoffset']), axis=1)
#### x.['Date'].date() removes timestamp and considers only Date
#### the line assigns date corresponding to last Monday to column 'week_start'.
df.groupby('week_start').apply(lambda v: stats.linregress(v.Total_Sales,v.Products)
[0]).reset_index()

Python Scikit - Learn: Cross Validation with multi-index

Hi I want to use one of the scikit learn's functions for cross validation. What I want is that the splitting of the folds is determined by one of the indexes. For example lets say I have this data with "month" and "day" being the indexes:
Month Day Feature_1
January 1 10
2 20
February 1 30
2 40
March 1 50
2 60
3 70
April 1 80
2 90
Lets say I want to have a 1/4 of the data as test set for each validation. I want this fold seperation to be done by the first index which is the month. In this case the test set will be one of the months and the remaining 3 months will be the training set. As an example one of the train and test split will look like this:
TEST SET:
Month Day Feature_1
January 1 10
2 20
TRAINING SET:
Month Day Feature_1
February 1 30
2 40
March 1 50
2 60
3 70
April 1 80
2 90
How can I do this. Thank you.

This is called splitting by a group. Check out the user-guide in scikit-learn here to understand more about it:
...
To measure this, we need to ensure that all the samples in the
validation fold come from groups that are not represented at all in
the paired training fold.
...
You can use the GroupKFold or other strategies that have Group in the name. A sample can be
# I am not sure about this exact command,
# but after this, you should have individual columns for each index
df = df.reset_index()
print(df)
Month Day Feature_1
January 1 10
January 2 20
February 1 30
February 2 40
March 1 50
March 2 60
March 3 70
groups = df['Month']
from sklearn.model_selection import GroupKFold
gkf = GroupKFold(n_splits=3)
for train, test in gkf.split(X, y, groups=groups):
# Here "train", "test" are indices of location,
# you need to use "iloc" to get actual values
print("%s %s" % (train, test))
print(df.iloc[train, :])
print(df.iloc[test, :])
Update: For passing this into cross-validation methods, just pass the months data to groups param in those. Like below:
gkf = GroupKFold(n_splits=3)
y_pred = cross_val_predict(estimator, X_train, y_train, cv=gkf, groups=df['Month'])

Use -
indices = df.index.levels[0]
train_indices = np.random.choice(indices,size=int(len(indices)*0.75), replace=False)
test_indices = np.setdiff1d(indices, train_indices)
train = df[np.in1d(df.index.get_level_values(0), train_indices)]
test = df[np.in1d(df.index.get_level_values(0), test_indices)]
Output
Train
Feature_1
Month Day
January 1 10
2 20
February 1 30
2 40
March 1 50
2 60
3 70
Test
Feature_1
Month Day
April 1 80
2 90
Explanation
indices = df.index.levels[0] takes all the unique from level=0 index - Index(['April', 'February', 'January', 'March'], dtype='object', name='Month')
train_indices = np.random.choice(indices,size=int(len(indices)*0.75), replace=False) samples 75% of the indices chosen in previous step
Next we obtain the remaining indices to be test_indices
Finally we split train and test accordingly

Tricky groupby/moving average by date calculation

I am having trouble illustrating my problem with the form the data is in without complicating things. So bear with me as I would like to start with the following screen shot is for explaining the problem only (aka the data is not in this form) :
I would like to identify the past 14 days with a number > 0 across all bins (aka the total row has a value greater than 0). This would include all days except for days 5 and 12 (highlighted in red). I would then like to sum across bins horizontally for those 14 days (aka sum all days expect for 5 and 12, by bin), with the goal of ultimately calculating a 14 day average by Bin number.
Note the example above would be for one “Lane”, where my data has > 10,000. The example also only illustrates today being day 16. But I would like to apply this logic to every day in the data set. I.e. on day 20 (along with any other date), it would look at the last 14 days with a value across all bins, then use that data range to aggregate across Bin. This is a screenshot sample of how the data looks:
A simple example using the data as it is structured, with only 3 Bins, 1 Lane, and a 3 data point/date look back:
Lane Date Bin KG
AMS-ORD 2018-08-26 3 10
AMS-ORD 2018-08-29 1 25
AMS-ORD 2018-08-30 2 30
AMS-ORD 2018-09-03 2 20
AMS-ORD 2018-09-04 1 40
Note KG here is a sum. Again this is for one day (aka today), but I would like every date in my data set to follow the same logic. The output would look like the following:
Lane Date Bin KG Average
AMS-ORD 2018-09-04 1 40 13.33
AMS-ORD 2018-09-04 2 50 16.67
AMS-ORD 2018-09-04 3 0 -
I have messed around with .rolling(14).mean(), .tail(), and some others. The problem I have is specifying the correct date range for the correct Bin aggregation.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.