Typically when we have a data frame we split it into train and test. For example, imagine my data frame is something like this:
> df.head()
Date y wind temperature
1 2019-10-03 00:00:00 33 12 15
2 2019-10-03 01:00:00 10 5 6
3 2019-10-03 02:00:00 39 6 5
4 2019-10-03 03:00:00 60 13 4
5 2019-10-03 04:00:00 21 3 7
I want to predict y based on the wind and temperature. We then do a split something like this:
df_train = df.loc[df.index <= split_date].copy()
df_test = df.loc[df.index > split_date].copy()
X1=df_train[['wind','temperature']]
y1=df_train['y']
X2=df_test[['wind','temperature']]
y2=df_test['y']
from sklearn.model_selection import train_test_split
X_train, y_train =X1, y1
X_test, y_test = X2,y2
model.fit(X_train,y_train)
And we then predict our test data. However, this uses the features of wind and temperature in the test data frame. If I want to predict (unknown) tomorrow y without knowing tomorrow's hourly temperature and wind, does the method no longer work? (For LSTM or XGBoost for example)
The way you train your model, each row is considered an independent sample, regardless of the order, i.e. what values are observed earlier or later. If you have reason to believe that the chronological order is relevant to predicting y from wind speed and temperature you will need to change your model.
You could try, e.g. to add another column with the values for wind speed and temperature one hour before (shift it by one row), or, if you believe that y might be depend on the weekday, compute the weekday from the date and add that as input feature.
Related
I'm pretty new to time series.
This is the dataset I'm working on:
Date Price Location
0 2012-01-01 1771.0 Marche
1 2012-01-01 1039.0 Calabria
2 2012-01-01 2193.0 Campania
3 2012-01-01 2015.0 Emilia-Romagna
4 2012-01-01 1483.0 Friuli-Venezia Giulia
... ... ... ...
2475 2022-04-01 1963.0 Lazio
2476 2022-04-01 1362.0 Friuli-Venezia Giulia
2477 2022-04-01 1674.0 Emilia-Romagna
2478 2022-04-01 1388.0 Marche
2479 2022-04-01 1103.0 Abruzzo
I'm trying to build an LSTM for price prediction, but I don't know how to manage the Location categorical feature: do I have to use one-hot encoding or a groupby?
What I want to predict is the price based on the location.
How can I achieve that? A Python solution is particularly appreciated.
Thanks in advance.
Suppose my dataset (df) is analogous to yours:
Date Price Location
0 2021-01-01 791.076890 Campania
1 2021-01-01 705.702464 Lombardia
2 2021-01-01 719.991382 Sicilia
3 2021-02-01 825.760917 Lombardia
4 2021-02-01 747.734309 Sicilia
... ... ... ...
31 2021-11-01 886.874348 Lombardia
32 2021-11-01 935.040583 Campania
33 2021-12-01 771.165378 Sicilia
34 2021-12-01 952.255227 Campania
35 2021-12-01 939.754515 Lombardia
In my case I have a Price record for 3 regions (Campania, Lombardia, Sicilia) every month. My Idea is to treat the different region as different features, so I would transform df as:
df = df.set_index(["Date", "Location"]).Price.unstack()
Now my dataset is like:
Location Campania Lombardia Sicilia
Date
2021-01-01 791.076890 705.702464 719.991382
2021-02-01 758.872755 825.760917 747.734309
2021-03-01 880.038005 803.165998 837.738419
... ... ... ...
2021-10-01 908.402345 805.081193 792.369610
2021-11-01 935.040583 886.874348 736.862025
2021-12-01 952.255227 939.754515 771.165378
Please, after this, make sure there are no NaN values (df.isna().sum()).
Now you can pass this data to a multi feature RNN (or LSTM), as made in this example, or to a multi-channel 1D-CNN (choosing an appropriate kernel size). The only problem in both cases could be the small size of the dataset, so try to not to over-parameterize the model (for example reducing the number of neurons and layers), otherwise the over-fitting will be unavoidable. About this you can test the model on the last 20% of your time-series:
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, shuffle=False, test_size=.2)
The last part is to build a matching (X, Y) for the supervised learning, but this depends on what model are you using and what is your prediction task. Another example here.
I am training a LSTM to predict requests. The input data is split by day, like so:
import datetime as dt
import numpy as np
import pandas as pd
start = dt.date(day=1, month=1, year=2022)
end = dt.date(day=8, month=1, year=2022)
training_data = pd.DataFrame([0, 0, 1, 0, 0, 2, 3], index=np.arange(start, end))
training_data
2022-01-01 0
2022-01-02 0
2022-01-03 1
2022-01-04 0
2022-01-05 0
2022-01-06 2
2022-01-07 3
However, for my use case, I only need to predict weekly number of requests, not daily. So in this case, I would only need to predict 6 requests for the week starting on 2022-01-01.
I understand that I could resample the training data to also be on a weekly basis. But would it be possible to train on data split up by day and predict on weeks? My hypothesis is that the daily timesteps might help the model training, but my worry is that the alternate timescale here might mess up the training of the LSTM. I also am not sure exactly how normalization would work.
I have a dataset with 84 Monthly Sales (from 01/2013 to 12/2019) - just months, not days.
Month 01 | Sale 1
Month 02 | Sale 2
Month 03 | Sale 3
.... | ...
Month 84 | Sale 84
By visualization it looks like that the model fits very well... but I need to check it....
So what I understood is that cross val does not support Months, and so what I did was convert to use it w/ days(although there is no day info into my original df)...
I wanted to try my model w/ the first five years(60 months) and leave the 2 remaining years(24 months) to see how well the model is predicting....
So i did something like:
cv_results = cross_validation( model = prophet, initial='1825 days', period='30 days', horizon = '60 days')
Does this make sense?
I did not get the concept of cut off dates and forecast periods
I struggled with this for a while as well. But here is how it works. The initial model will be trained on the first 1,825 days of data. It will forecast the next 60 days of data (because horizon is set to 60). The model will then train on the initial period + the period (1,825 + 30 days in this case) and forecast the next 60 days. It will continued like this, adding another 30 days to the training data and then forecasting for the next 60 until there is no longer enough data to do this.
In summary, period is how much data to add to the training data set in every iteration of cross-validation, and horizon is how far out it will forecast.
Here is the setup:
The total number of data points is 700 days
Initial is 365 days
The period is 10 days
The horizon is 20 days
On the 1st iteration, it will train on days 1-365 and will forecast on days 366 to 385.
On the 2nd iteration, it will train on days 11-375 and will forecast on days 376 to 395, etc.
I am currently using the Statsmodels library for forecasting. I am trying to run triple exponential smoothing with seasonality, trend, and a smoothing factor, but keep getting errors. Here is a sample pandas series with the frequency already set:
2018-01-01 25
2018-02-01 30
2018-03-01 40
2018-04-01 38
2018-05-01 33
2018-06-01 36
2018-07-01 34
2018-08-01 35
2018-09-01 37
2018-10-01 41
2018-11-01 36
2018-12-01 32
2019-01-01 31
2019-02-01 29
2019-03-01 28
2019-04-01 29
2019-05-01 30
Freq: MS, dtype:float64
Here is my code:
import pandas as pd
from statsmodels.tsa.holtwinters import ExponentialSmoothing, SimpleExpSmoothing
def triple_expo(input_data_set, output_list1):
data = input_data_set
model1 = ExponentialSmoothing(data, trend='add', seasonal='mul', damped=False, seasonal_periods=12, freq='M').fit(smoothing_level=0.1, smoothing_slope=0.1, smoothing_seasonal=0.1, optimized=False)
fcast1 = model1.forecast(12)
fcast_list1 = list(fcast1)
output_list1.append(fcast_list1)
for product in unique_products:
product_slice = sorted_product_df["Product"] == product
unique_slice = sorted_product_df[product_slice]
amount = unique_slice["Amount"].tolist()
index = unique_slice["Date"].tolist()
unique_series = pd.Series(amount, index)
unique_series.index = pd.DatetimeIndex(unique_series.index, freq=pd.infer_freq(unique_series.index))
triple_expo(unique_series, triple_out_one)
I originally did not have a frequency argument at all because I was following the example on the statsmodels website which is here: http://www.statsmodels.org/stable/examples/notebooks/generated/exponential_smoothing.html
They did not pass a frequency argument at all as they inferred it using the DateTimeIndex in pandas. When I have no frequency argument, my error is "operands could not be broadcast together with shapes <5,><12,>". I have 17 months of data and pandas recognizes it as 'Ms'. The 5 and the 12 is referring to the 17. I then pass freq='M' like I have in the code sample, and I get "The given frequency argument is incompatible with the given index". I then try setting it to everything from len(data) to len(data)-1 and always get errors. I tried the len(data)-1 because I was originally referencing this stack post: seasonal_decompose: operands could not be broadcast together with shapes on a series
In that post, he said it would work if you set the frequency to one less than the length of the data set. It does not work for me though. Any help would be appreciated. Thanks!
EDIT: The comment below suggested I include more than one year's worth of data and it worked when I did that.
Hi I want to use one of the scikit learn's functions for cross validation. What I want is that the splitting of the folds is determined by one of the indexes. For example lets say I have this data with "month" and "day" being the indexes:
Month Day Feature_1
January 1 10
2 20
February 1 30
2 40
March 1 50
2 60
3 70
April 1 80
2 90
Lets say I want to have a 1/4 of the data as test set for each validation. I want this fold seperation to be done by the first index which is the month. In this case the test set will be one of the months and the remaining 3 months will be the training set. As an example one of the train and test split will look like this:
TEST SET:
Month Day Feature_1
January 1 10
2 20
TRAINING SET:
Month Day Feature_1
February 1 30
2 40
March 1 50
2 60
3 70
April 1 80
2 90
How can I do this. Thank you.
This is called splitting by a group. Check out the user-guide in scikit-learn here to understand more about it:
...
To measure this, we need to ensure that all the samples in the
validation fold come from groups that are not represented at all in
the paired training fold.
...
You can use the GroupKFold or other strategies that have Group in the name. A sample can be
# I am not sure about this exact command,
# but after this, you should have individual columns for each index
df = df.reset_index()
print(df)
Month Day Feature_1
January 1 10
January 2 20
February 1 30
February 2 40
March 1 50
March 2 60
March 3 70
groups = df['Month']
from sklearn.model_selection import GroupKFold
gkf = GroupKFold(n_splits=3)
for train, test in gkf.split(X, y, groups=groups):
# Here "train", "test" are indices of location,
# you need to use "iloc" to get actual values
print("%s %s" % (train, test))
print(df.iloc[train, :])
print(df.iloc[test, :])
Update: For passing this into cross-validation methods, just pass the months data to groups param in those. Like below:
gkf = GroupKFold(n_splits=3)
y_pred = cross_val_predict(estimator, X_train, y_train, cv=gkf, groups=df['Month'])
Use -
indices = df.index.levels[0]
train_indices = np.random.choice(indices,size=int(len(indices)*0.75), replace=False)
test_indices = np.setdiff1d(indices, train_indices)
train = df[np.in1d(df.index.get_level_values(0), train_indices)]
test = df[np.in1d(df.index.get_level_values(0), test_indices)]
Output
Train
Feature_1
Month Day
January 1 10
2 20
February 1 30
2 40
March 1 50
2 60
3 70
Test
Feature_1
Month Day
April 1 80
2 90
Explanation
indices = df.index.levels[0] takes all the unique from level=0 index - Index(['April', 'February', 'January', 'March'], dtype='object', name='Month')
train_indices = np.random.choice(indices,size=int(len(indices)*0.75), replace=False) samples 75% of the indices chosen in previous step
Next we obtain the remaining indices to be test_indices
Finally we split train and test accordingly