Understanding the TimeSeriesDataSet in pytorch forecasting - python

Here is a code sample taken from one of pytorch forecasting tutorila:
# create dataset and dataloaders
max_encoder_length = 60
max_prediction_length = 20
training_cutoff = data["time_idx"].max() - max_prediction_length
context_length = max_encoder_length
prediction_length = max_prediction_length
training = TimeSeriesDataSet(
data[lambda x: x.time_idx <= training_cutoff],
time_idx="time_idx",
target="value",
categorical_encoders={"series": NaNLabelEncoder().fit(data.series)},
group_ids=["series"],
# only unknown variable is "value" - and N-Beats can also not take any additional variables
time_varying_unknown_reals=["value"],
max_encoder_length=context_length,
max_prediction_length=prediction_length,
)
validation = TimeSeriesDataSet.from_dataset(training, data, min_prediction_idx=training_cutoff + 1)
batch_size = 128
train_dataloader = training.to_dataloader(train=True, batch_size=batch_size, num_workers=0)
val_dataloader = validation.to_dataloader(train=False, batch_size=batch_size, num_workers=0)
I don't really understand how the validation dataset is done with respect to the time index. I also don't understand why there is no test dataset in the tutorial. is it for a specific reason?

Concerning validation dataset:
training dataset is all data except the last max_prediction_length data points of each Time series (each time series correspond to datapoints with same group_ids).
Those last datapoints are filtered by the training cutoff (cutoff is the same for each time series because they are of same size)
validation data is the last max_prediction_length data points use as targets for each time series (which mean validation data are the last encoder_length + max_prediction_length of each time series).
This is done by using parameter min_prediction_idx=training_cutoff + 1 which make the dataset taking only data with time_index with value superior to training_cutoff + 1 (minimal decoder index is always >= min_prediction_idx)

Related

How to add in a Keras time series prediction model the current value of an exogenous variable (besides the past values of other variables)?

I'm following a Keras example which tries to predict a multivariate time series with an LSTM model: the Temperature given past values of temperature and other physical variables.
https://keras.io/examples/timeseries/timeseries_weather_forecasting/
https://github.com/keras-team/keras-io/blob/master/examples/timeseries/timeseries_weather_forecasting.py
The model is fed with past values (rolling window) and predicts future values.
I have tried to play with it using additional variables generated by me: introducing the hour and month on the model.
df['HourX']=np.sin(df.DateTime.dt.hour*2*np.pi/24)
df['HourY']=np.cos(df.DateTime.dt.hour*2*np.pi/24)
df['MonthX']=np.sin(df.DateTime.dt.month*2*np.pi/12)
df['MonthY']=np.cos(df.DateTime.dt.month*2*np.pi/12)
Now I have this columns on my dataset:
Temperature, Pressure, Humidty, HourX, HourY, MonthX, MonthY
This is the original code used to create the model and fit it:
start = past + future
end = start + train_split
x_train = train_data[[i for i in range(7)]].values
y_train = features.iloc[start:end][[1]]
sequence_length = int(past / step)
dataset_train = keras.preprocessing.timeseries_dataset_from_array(
x_train, y_train, sequence_length=sequence_length,
batch_size=batch_size,
)
x_end = len(val_data) - past - future
label_start = train_split + past + future
x_val = val_data.iloc[:x_end][[i for i in range(7)]].values
y_val = features.iloc[label_start:][[1]]
dataset_val = keras.preprocessing.timeseries_dataset_from_array(
x_val, y_val, sequence_length=sequence_length, batch_size=batch_size,
)
inputs = keras.layers.Input(shape=(inputs.shape[1], inputs.shape[2]))
lstm_out = keras.layers.LSTM(32)(inputs)
outputs = keras.layers.Dense(1)(lstm_out)
model = keras.Model(inputs=inputs, outputs=outputs)
model.compile(optimizer=keras.optimizers.Adam(learning_rate=learning_rate),
loss="mse")
model.fit(dataset_train, epochs=epochs, validation_data=dataset_val)
My question is...
This example only uses past values of all the variables.
But I would also like to use the current value of the newly created columns HourX, HourY, MonthX, MonthY. (Because we would not be cheating, we always know them).
How can I tell Keras to keep using the old variables as we did (past values until a given time), but also use all the values of the new columns up to the current prediction time?
This image shows how it's done when you feed the original model:
Only use values until a certain time.
An this is what I want to achieve:
In the second set of variables (the ones I've created) I want to use all values until the predicted value.

Trouble setting up LSTM model for multivariate time series forecasting

I having issues with how many number of LSTMs to have for a multi-variate forecasting.
The dataset I'm working on:
Weather Dataset (kaggle)
df_final = df.loc['2011-01-01':'2014-10-31']
df_final['Temperature (C)'].plot(figsize=(28,6))
In the end I want to forecast temperature.(other paremeters too, but mainly temperature)
The data has hourly readings.
# How many rows per month?
rows_per_months=24*30
test_months = 12 #number of months we want to predict in the future.
test_indices = test_months*720
test_indices
# train and test split:
train = df_final.iloc[:-test_indices]
# Choose the variable/parameter you want to predict
test = df_final.iloc[-test_indices:]
len(train)
#[op]: 24960
scaler = MinMaxScaler()
scaled_train = scaler.fit_transform(train)
scaled_test = scaler.transform(test)
#define generator:
length = rows_per_months#Length of output sequences (in number of timesteps)
batch_size = 30 #Number of timeseries sample in batch
generator = tf.keras.preprocessing.sequence.TimeseriesGenerator(scaled_train,scaled_train,length=length,batch_size=batch_size)
Defining Model
# define model
model = Sequential()
model.add(tf.keras.layers.LSTM(50, input_shape=(length,scaled_train.shape[1])))
#NOTE: Do not specify the activation function for LSTM layers, this is because it will not run on GPU.
model.add(Dense(scaled_train.shape[1]))
model.compile(optimizer='adam', loss='mse')
After training,
Evaluating the model on Test Data:
first_eval_batch = scaled_train[-length:]
first_eval_batch.shape
first_eval_batch = first_eval_batch.reshape((1,length,scaled_train.shape[1]))
n_features = scaled_test.shape[1] #n_features = scaled_train.shape[1] =250 (for predicting all parameters in the next time stamp)
# print(n_features) = 1 #Since we are only predicting temperature.
test_predictions = []
first_eval_batch = scaled_train[-length:]
current_batch = first_eval_batch.reshape((1, length, n_features))
print(current_batch.shape)
#output:(1, 720, 4)
for i in range(len(test)):
#Get prediction 1 time stamp ahead
current_pred = model.predict(current_batch)[0]
#store prediction
test_predictions.append(current_pred)
#update the current batch to now include the prediction and drop the first value.
current_batch = np.append(current_batch[:,1:,:],[[current_pred]],axis=1)
true_predictions = pd.DataFrame(data=true_predictions,columns=test.columns,index=test.index)
true_predictions = scaler.inverse_transform(test_predictions)
The resultant Dataframe:
result_df = pd.concat([test['Temperature (C)'], true_predictions['Temperature (C)']],axis=1)
result_df.plot(figsize=(28,8))
Since the model does not predict the correct value of test data, I cannot proceed further with the forecasting.
How do I fix this?
PS:
I have tried playing with the number of layers and number of lstm units in layers but nothing seems to work. I have also tried to use different activation functions such as relu but it will take more than 20 hrs to finish 1 epoch, since the model wouldn't train on GPU unless the LSTMs have default functions (i.e tanh)

How can slice the time-series data with multi-features to get continuous plot contains [train +test+prediction]?

I have the formatted dataset which looks like a matrix[NxM] where N = 40 total number of cycles(time-stamps) and M = 1440 pixels. For every cycle, I have 1440 pixel values corresponding to 1440 pixels. I've used different models to predict pixel values of future cycles based on past 10 cycles.
The problem is I couldn't achieve proper continuous plot after training NN most probably due to bad data split technique I've used via train_test_split but never tried by TimeSeriesSplit as following :
trainX, testX, trainY, testY = train_test_split(trainX,trainY, test_size=0.2 , shuffle=False)
1st problem is considering I've used shuffle=False and expect that 0.2 of from end of data will be considered as test data and I can get plot them correctly but I couldn't because:
1) unfortunately I miss 10 cycles because of history function def create_dataset() which check the the past 10 cycles to forecast the future one. as you can see here:
def create_dataset(dataset,data_train,look_back=1):
dataX,dataY = [],[]
print("Len:",len(dataset)-look_back-1)
for i in range(len(dataset)-look_back-1):
a = dataset[i:(i+look_back), :]
dataX.append(a)
dataY.append(data_train[i + look_back, :])
return np.array(dataX), np.array(dataY)
look_back = 10
trainX,trainY = create_dataset(data_train,Y_train, look_back=look_back)
2) I also can't access the number of cycles which are considered as test data , therefore when I plot it starts from 0 ! instead of continue from end cycle of train data!
my expectation result after right data slice is something cen led me to catch following continuous plot which I made manually by paint in Windows 7:
Any advice would be greatly appreciated.

Tensorflow DNNclassifier getting bad results

I am trying to make a classifier to learn if a movie review was positive or negative from its contents. I have using a couple of files that are relevant, a file of the total vocabulary(one word per line) across every document, two CSVs(one for the training set, one for the testing) containing the score each document got in a specific order, and two CSVs(same as above) where on one line it is the the index of each word that appears in that review looking at the vocab as a list. So for every a review like "I liked this movie" have something like a score line of 1(0: dislike, 1 like) and a word line of [2,13,64,33]. I use the DNNClassifier and currently am using 1 feature which is an embedding column wrapped around a categorical_column_with_identity. My code runs but it takes absolutely terrible results and I'm not sure why. Perhaps someone with more knowledge about tensor flow could help me out. Also I don't go on here much but I honestly tried and couldn't find a post that directly helps me.
import tensorflow as tf
import pandas as pd
import numpy as np
import os
embedding_d = 18
label_name = ['Label']
col_name = ["Words"]
hidden_unit = [10]*5
BATCH = 50
STEP = 5000
#Ignore some warning messages but an optional compiler
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
##Function to feed into training
def train_input_fn(features, labels, batch_size):
# Convert the inputs to a Dataset.
dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))
# Shuffle, repeat, and batch the examples.
dataset = dataset.shuffle(1000).repeat().batch(batch_size)
# Return the dataset.
return dataset
##Orignal Eval. Untouched so far. Mostly likely will need to be changed.
def eval_input_fn(features, labels, batch_size):
"""An input function for evaluation or prediction"""
features=dict(features)
if labels is None:
# No labels, use only features.
inputs = features
else:
inputs = (features, labels)
# Convert the inputs to a Dataset.
dataset = tf.data.Dataset.from_tensor_slices(inputs)
# Batch the examples
assert batch_size is not None, "batch_size must not be None"
dataset = dataset.batch(batch_size)
# Return the dataset.
return dataset
## Produces dataframe for labels and features(words) using pandas
def loadData():
train_label =pd.read_csv("aclImdb/train/yaynay.csv",names=label_name)
test_label =pd.read_csv("aclImdb/test/yaynay.csv",names=label_name)
train_feat = pd.read_csv("aclImdb/train/set.csv", names = col_name)
test_feat = pd.read_csv("aclImdb/test/set.csv", names = col_name)
train_feat[col_name] =train_feat[col_name].astype(np.int64)
test_feat[col_name] =test_feat[col_name].astype(np.int64)
return (train_feat,train_label),(test_feat,test_label)
## Stuff that I believe is somewhat working
# Get labels for test and training data
(train_x,train_y), (test_x,test_y) = loadData()
## Get the features for each document
train_feature = []
#Currently only one key but this could change in the future
for key in train_x.keys():
#Create a categorical_column column
idCol = tf.feature_column.categorical_column_with_identity(
key= key,
num_buckets=89528)
embedding_column = tf.feature_column.embedding_column(
categorical_column= idCol,
dimension=embedding_d)
train_feature.append(embedding_column)
##Create the neural network
classifier = tf.estimator.DNNClassifier(
feature_columns=train_feature,
# Species no. of layers and no. of neurons in each layer
hidden_units=hidden_unit,
# Number of output options(here there are 11 for scores 0-10 inclusive)
n_classes= 2)
# Train the Model
#First numerical value is batch size, second is total steps to take.
classifier.train(input_fn= lambda: train_input_fn(train_x, train_y, BATCH),steps=STEP)
#Evaluate the model
eval_result = classifier.evaluate(
input_fn=lambda:eval_input_fn(test_x, test_y,
BATCH), steps = STEP)
print('\nTest set accuracy: {accuracy:0.3f}\n'.format(**eval_result))

How to forecast unseen future Time-Series data using an ARIMA model on Python?

I have built an ARIMA model using Time Series data. Data has been split to (Train and Test). The code below builds the ARIMA model, scores the Test dataset (of 15 observations) and calculates the MSE error for each of the 15 observations.
My question is, how can I use this ARIMA model in order to predict any given time in the future ? For example, I would like to predict the time series value for Date "2018-01-01" using this ARIMA model that has been built and tested.
At the moment my dataset is in this form (with 10794 instances ranging from years 1975-2017):
Date Values
2017-06-04 -0.116141
2017-06-11 -0.116669
2017-06-18 -0.114020
2017-06-25 -0.109614
My code for building ARIMA and calculating the MSE error for Test set validation
size = int(len(ts_week_log) - 15)
train, test = ts_week_log[0:size], ts_week_log[size:len(ts_week_log)]
history = [x for x in train]
predictions = list()
print('Printing Predicted vs Expected Values...')
print('\n')
for t in range(len(test)):
model = ARIMA(history, order=(2,1,1))
model_fit = model.fit(disp=0)
output = model_fit.forecast()
yhat = output[0]
predictions.append(float(yhat))
obs = test[t]
history.append(obs)
print('predicted=%f, expected=%f' % (np.exp(yhat), np.exp(obs)))
error = mean_squared_error(test, predictions)
print('\n')
print('Printing Mean Squared Error of Predictions...')
print('Test MSE: %.6f' % error)
predictions_series = pd.Series(predictions, index = test.index)
I am suspecting that I need to use model_fit.forecast function, but I would like to know if this is indeed the correct one to use ?
Further, I would like to know if there exist any functionality in Python that allows for new data field generations to be predicted of, such as the (model.make_future_dataframe(periods=24, freq = 'm'))
on Prophet Forecasting Library.

Categories

Resources