I have a very long time series I want to feed into an LSTM for classification per-frame.
My data is labeled per frame, and I know some rare events happen that influence the classification heavily ever since they occur.
Thus, I have to feed the entire sequence to get meaningful predictions.
It is known that just feeding very long sequences into LSTM is sub-optimal, since the gradients vanish or explode just like normal RNNs.
I wanted to use a simple technique of cutting the sequence to shorter (say, 100-long) sequences, and run the LSTM on each, then pass the final LSTM hidden and cell states as the start hidden and cell state of the next forward pass.
Here is an example I found of someone who did just that. There it is called "Truncated Back propagation through time". I was not able to make the same work for me.
My attempt in Pytorch lightning (stripped of irrelevant parts):
def __init__(self, config, n_classes, datamodule):
...
self._criterion = nn.CrossEntropyLoss(
reduction='mean',
)
num_layers = 1
hidden_size = 50
batch_size=1
self._lstm1 = nn.LSTM(input_size=len(self._in_features), hidden_size=hidden_size, num_layers=num_layers, batch_first=True)
self._log_probs = nn.Linear(hidden_size, self._n_predicted_classes)
self._last_h_n = torch.zeros((num_layers, batch_size, hidden_size), device='cuda', dtype=torch.double, requires_grad=False)
self._last_c_n = torch.zeros((num_layers, batch_size, hidden_size), device='cuda', dtype=torch.double, requires_grad=False)
def training_step(self, batch, batch_index):
orig_batch, label_batch = batch
n_labels_in_batch = np.prod(label_batch.shape)
lstm_out, (self._last_h_n, self._last_c_n) = self._lstm1(orig_batch, (self._last_h_n, self._last_c_n))
log_probs = self._log_probs(lstm_out)
loss = self._criterion(log_probs.view(n_labels_in_batch, -1), label_batch.view(n_labels_in_batch))
return loss
Running this code gives the following error:
RuntimeError: Trying to backward through the graph a second time, but the saved intermediate results have already been freed. Specify retain_graph=True when calling backward the first time.
The same happens if I add
def on_after_backward(self) -> None:
self._last_h_n.detach()
self._last_c_n.detach()
The error does not happen if I use
lstm_out, (self._last_h_n, self._last_c_n) = self._lstm1(orig_batch,)
But obviously this is useless, as the output from the current frame-batch is not forwarded to the next one.
What is causing this error? I thought detaching the output h_n and c_n should be enough.
How do I pass the output of a previous frame-batch to the next one and have torch back propagate each frame batch separately?
Apparently, I missed the trailing _ for detach():
Using
def on_after_backward(self) -> None:
self._last_h_n.detach_()
self._last_c_n.detach_()
works.
The problem was self._last_h_n.detach() does not update the reference to the new memory allocated by detach(), thus the graph is still de-referencing the old variable which backprop went through.
The reference answer solved that by H = H.detach().
Cleaner (and probably faster) is self._last_h_n.detach_() which does the operation in place.
TLDR:
Autoencoder underfits timeseries reconstruction and just predicts average value.
Question Set-up:
Here is a summary of my attempt at a sequence-to-sequence autoencoder. This image was taken from this paper: https://arxiv.org/pdf/1607.00148.pdf
Encoder: Standard LSTM layer. Input sequence is encoded in the final hidden state.
Decoder: LSTM Cell (I think!). Reconstruct the sequence one element at a time, starting with the last element x[N].
Decoder algorithm is as follows for a sequence of length N:
Get Decoder initial hidden state hs[N]: Just use encoder final hidden state.
Reconstruct last element in the sequence: x[N]= w.dot(hs[N]) + b.
Same pattern for other elements: x[i]= w.dot(hs[i]) + b
use x[i] and hs[i] as inputs to LSTMCell to get x[i-1] and hs[i-1]
Minimum Working Example:
Here is my implementation, starting with the encoder:
class SeqEncoderLSTM(nn.Module):
def __init__(self, n_features, latent_size):
super(SeqEncoderLSTM, self).__init__()
self.lstm = nn.LSTM(
n_features,
latent_size,
batch_first=True)
def forward(self, x):
_, hs = self.lstm(x)
return hs
Decoder class:
class SeqDecoderLSTM(nn.Module):
def __init__(self, emb_size, n_features):
super(SeqDecoderLSTM, self).__init__()
self.cell = nn.LSTMCell(n_features, emb_size)
self.dense = nn.Linear(emb_size, n_features)
def forward(self, hs_0, seq_len):
x = torch.tensor([])
# Final hidden and cell state from encoder
hs_i, cs_i = hs_0
# reconstruct first element with encoder output
x_i = self.dense(hs_i)
x = torch.cat([x, x_i])
# reconstruct remaining elements
for i in range(1, seq_len):
hs_i, cs_i = self.cell(x_i, (hs_i, cs_i))
x_i = self.dense(hs_i)
x = torch.cat([x, x_i])
return x
Bringing the two together:
class LSTMEncoderDecoder(nn.Module):
def __init__(self, n_features, emb_size):
super(LSTMEncoderDecoder, self).__init__()
self.n_features = n_features
self.hidden_size = emb_size
self.encoder = SeqEncoderLSTM(n_features, emb_size)
self.decoder = SeqDecoderLSTM(emb_size, n_features)
def forward(self, x):
seq_len = x.shape[1]
hs = self.encoder(x)
hs = tuple([h.squeeze(0) for h in hs])
out = self.decoder(hs, seq_len)
return out.unsqueeze(0)
And here's my training function:
def train_encoder(model, epochs, trainload, testload=None, criterion=nn.MSELoss(), optimizer=optim.Adam, lr=1e-6, reverse=False):
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Training model on {device}')
model = model.to(device)
opt = optimizer(model.parameters(), lr)
train_loss = []
valid_loss = []
for e in tqdm(range(epochs)):
running_tl = 0
running_vl = 0
for x in trainload:
x = x.to(device).float()
opt.zero_grad()
x_hat = model(x)
if reverse:
x = torch.flip(x, [1])
loss = criterion(x_hat, x)
loss.backward()
opt.step()
running_tl += loss.item()
if testload is not None:
model.eval()
with torch.no_grad():
for x in testload:
x = x.to(device).float()
loss = criterion(model(x), x)
running_vl += loss.item()
valid_loss.append(running_vl / len(testload))
model.train()
train_loss.append(running_tl / len(trainload))
return train_loss, valid_loss
Data:
Large dataset of events scraped from the news (ICEWS). Various categories exist that describe each event. I initially one-hot encoded these variables, expanding the data to 274 dimensions. However, in order to debug the model, I've cut it down to a single sequence that is 14 timesteps long and only contains 5 variables. Here is the sequence I'm trying to overfit:
tensor([[0.5122, 0.0360, 0.7027, 0.0721, 0.1892],
[0.5177, 0.0833, 0.6574, 0.1204, 0.1389],
[0.4643, 0.0364, 0.6242, 0.1576, 0.1818],
[0.4375, 0.0133, 0.5733, 0.1867, 0.2267],
[0.4838, 0.0625, 0.6042, 0.1771, 0.1562],
[0.4804, 0.0175, 0.6798, 0.1053, 0.1974],
[0.5030, 0.0445, 0.6712, 0.1438, 0.1404],
[0.4987, 0.0490, 0.6699, 0.1536, 0.1275],
[0.4898, 0.0388, 0.6704, 0.1330, 0.1579],
[0.4711, 0.0390, 0.5877, 0.1532, 0.2201],
[0.4627, 0.0484, 0.5269, 0.1882, 0.2366],
[0.5043, 0.0807, 0.6646, 0.1429, 0.1118],
[0.4852, 0.0606, 0.6364, 0.1515, 0.1515],
[0.5279, 0.0629, 0.6886, 0.1514, 0.0971]], dtype=torch.float64)
And here is the custom Dataset class:
class TimeseriesDataSet(Dataset):
def __init__(self, data, window, n_features, overlap=0):
super().__init__()
if isinstance(data, (np.ndarray)):
data = torch.tensor(data)
elif isinstance(data, (pd.Series, pd.DataFrame)):
data = torch.tensor(data.copy().to_numpy())
else:
raise TypeError(f"Data should be ndarray, series or dataframe. Found {type(data)}.")
self.n_features = n_features
self.seqs = torch.split(data, window)
def __len__(self):
return len(self.seqs)
def __getitem__(self, idx):
try:
return self.seqs[idx].view(-1, self.n_features)
except TypeError:
raise TypeError("Dataset only accepts integer index/slices, not lists/arrays.")
Problem:
The model only learns the average, no matter how complex I make the model or now long I train it.
Predicted/Reconstruction:
Actual:
My research:
This problem is identical to the one discussed in this question: LSTM autoencoder always returns the average of the input sequence
The problem in that case ended up being that the objective function was averaging the target timeseries before calculating loss. This was due to some broadcasting errors because the author didn't have the right sized inputs to the objective function.
In my case, I do not see this being the issue. I have checked and double checked that all of my dimensions/sizes line up. I am at a loss.
Other Things I've Tried
I've tried this with varied sequence lengths from 7 timesteps to 100 time steps.
I've tried with varied number of variables in the time series. I've tried with univariate all the way to all 274 variables that the data contains.
I've tried with various reduction parameters on the nn.MSELoss module. The paper calls for sum, but I've tried both sum and mean. No difference.
The paper calls for reconstructing the sequence in reverse order (see graphic above). I have tried this method using the flipud on the original input (after training but before calculating loss). This makes no difference.
I tried making the model more complex by adding an extra LSTM layer in the encoder.
I've tried playing with the latent space. I've tried from 50% of the input number of features to 150%.
I've tried overfitting a single sequence (provided in the Data section above).
Question:
What is causing my model to predict the average and how do I fix it?
Okay, after some debugging I think I know the reasons.
TLDR
You try to predict next timestep value instead of difference between current timestep and the previous one
Your hidden_features number is too small making the model unable to fit even a single sample
Analysis
Code used
Let's start with the code (model is the same):
import seaborn as sns
import matplotlib.pyplot as plt
def get_data(subtract: bool = False):
# (1, 14, 5)
input_tensor = torch.tensor(
[
[0.5122, 0.0360, 0.7027, 0.0721, 0.1892],
[0.5177, 0.0833, 0.6574, 0.1204, 0.1389],
[0.4643, 0.0364, 0.6242, 0.1576, 0.1818],
[0.4375, 0.0133, 0.5733, 0.1867, 0.2267],
[0.4838, 0.0625, 0.6042, 0.1771, 0.1562],
[0.4804, 0.0175, 0.6798, 0.1053, 0.1974],
[0.5030, 0.0445, 0.6712, 0.1438, 0.1404],
[0.4987, 0.0490, 0.6699, 0.1536, 0.1275],
[0.4898, 0.0388, 0.6704, 0.1330, 0.1579],
[0.4711, 0.0390, 0.5877, 0.1532, 0.2201],
[0.4627, 0.0484, 0.5269, 0.1882, 0.2366],
[0.5043, 0.0807, 0.6646, 0.1429, 0.1118],
[0.4852, 0.0606, 0.6364, 0.1515, 0.1515],
[0.5279, 0.0629, 0.6886, 0.1514, 0.0971],
]
).unsqueeze(0)
if subtract:
initial_values = input_tensor[:, 0, :]
input_tensor -= torch.roll(input_tensor, 1, 1)
input_tensor[:, 0, :] = initial_values
return input_tensor
if __name__ == "__main__":
torch.manual_seed(0)
HIDDEN_SIZE = 10
SUBTRACT = False
input_tensor = get_data(SUBTRACT)
model = LSTMEncoderDecoder(input_tensor.shape[-1], HIDDEN_SIZE)
optimizer = torch.optim.Adam(model.parameters())
criterion = torch.nn.MSELoss()
for i in range(1000):
outputs = model(input_tensor)
loss = criterion(outputs, input_tensor)
loss.backward()
optimizer.step()
optimizer.zero_grad()
print(f"{i}: {loss}")
if loss < 1e-4:
break
# Plotting
sns.lineplot(data=outputs.detach().numpy().squeeze())
sns.lineplot(data=input_tensor.detach().numpy().squeeze())
plt.show()
What it does:
get_data either works on the data your provided if subtract=False or (if subtract=True) it subtracts value of the previous timestep from the current timestep
Rest of the code optimizes the model until 1e-4 loss reached (so we can compare how model's capacity and it's increase helps and what happens when we use the difference of timesteps instead of timesteps)
We will only vary HIDDEN_SIZE and SUBTRACT parameters!
NO SUBTRACT, SMALL MODEL
HIDDEN_SIZE=5
SUBTRACT=False
In this case we get a straight line. Model is unable to fit and grasp the phenomena presented in the data (hence flat lines you mentioned).
1000 iterations limit reached
SUBTRACT, SMALL MODEL
HIDDEN_SIZE=5
SUBTRACT=True
Targets are now far from flat lines, but model is unable to fit due to too small capacity.
1000 iterations limit reached
NO SUBTRACT, LARGER MODEL
HIDDEN_SIZE=100
SUBTRACT=False
It got a lot better and our target was hit after 942 steps. No more flat lines, model capacity seems quite fine (for this single example!)
SUBTRACT, LARGER MODEL
HIDDEN_SIZE=100
SUBTRACT=True
Although the graph does not look that pretty, we got to desired loss after only 215 iterations.
Finally
Usually use difference of timesteps instead of timesteps (or some other transformation, see here for more info about that). In other cases, neural network will try to simply... copy output from the previous step (as that's the easiest thing to do). Some minima will be found this way and going out of it will require more capacity.
When you use the difference between timesteps there is no way to "extrapolate" the trend from previous timestep; neural network has to learn how the function actually varies
Use larger model (for the whole dataset you should try something like 300 I think), but you can simply tune that one.
Don't use flipud. Use bidirectional LSTMs, in this way you can get info from forward and backward pass of LSTM (not to confuse with backprop!). This also should boost your score
Questions
Okay, question 1: You are saying that for variable x in the time
series, I should train the model to learn x[i] - x[i-1] rather than
the value of x[i]? Am I correctly interpreting?
Yes, exactly. Difference removes the urge of the neural network to base it's predictions on the past timestep too much (by simply getting last value and maybe changing it a little)
Question 2: You said my calculations for zero bottleneck were
incorrect. But, for example, let's say I'm using a simple dense
network as an auto encoder. Getting the right bottleneck indeed
depends on the data. But if you make the bottleneck the same size as
the input, you get the identity function.
Yes, assuming that there is no non-linearity involved which makes the thing harder (see here for similar case). In case of LSTMs there are non-linearites, that's one point.
Another one is that we are accumulating timesteps into single encoder state. So essentially we would have to accumulate timesteps identities into a single hidden and cell states which is highly unlikely.
One last point, depending on the length of sequence, LSTMs are prone to forgetting some of the least relevant information (that's what they were designed to do, not only to remember everything), hence even more unlikely.
Is num_features * num_timesteps not a bottle neck of the same size as
the input, and therefore shouldn't it facilitate the model learning
the identity?
It is, but it assumes you have num_timesteps for each data point, which is rarely the case, might be here. About the identity and why it is hard to do with non-linearities for the network it was answered above.
One last point, about identity functions; if they were actually easy to learn, ResNets architectures would be unlikely to succeed. Network could converge to identity and make "small fixes" to the output without it, which is not the case.
I'm curious about the statement : "always use difference of timesteps
instead of timesteps" It seem to have some normalizing effect by
bringing all the features closer together but I don't understand why
this is key ? Having a larger model seemed to be the solution and the
substract is just helping.
Key here was, indeed, increasing model capacity. Subtraction trick depends on the data really. Let's imagine an extreme situation:
We have 100 timesteps, single feature
Initial timestep value is 10000
Other timestep values vary by 1 at most
What the neural network would do (what is the easiest here)? It would, probably, discard this 1 or smaller change as noise and just predict 1000 for all of them (especially if some regularization is in place), as being off by 1/1000 is not much.
What if we subtract? Whole neural network loss is in the [0, 1] margin for each timestep instead of [0, 1001], hence it is more severe to be wrong.
And yes, it is connected to normalization in some sense come to think about it.
The canned estimators in 1.0 (LinearClassifier, DNNClassifier, etc..) use the Trainable interface which defines:
fit(
x=None,
y=None,
input_fn=None,
steps=None,
batch_size=None,
monitors=None,
max_steps=None
)
and describes steps as
Number of steps for which to train model. If None, train forever. 'steps' works incrementally. If you call two times fit(steps=10) then training occurs in total 20 steps. If you don't want to have incremental behaviour please set max_steps instead. If set, max_steps must be None.
I'm at a loss for what this means exactly.
m = LinearClassifier(
feature_columns=[sparse_column_a, sparse_feature_a_x_sparse_feature_b],
optimizer=tf.train.FtrlOptimizer(
learning_rate=0.1,
l1_regularization_strength=0.001
))
m.fit(input_fn=train_input_fn, steps=???)
Using a LinearClassifier, how do we train on a single pass of train_input_fn? Should steps be the number of samples in train_input_fn?
What if we want to train on each sample in train_input_fn 3 times?
Answer 1 (was removed?): "Steps is number of times input_fn is called for data generation"
I think a lot of the logic for this question is in Estimator's _train_model function
It executes the following:
all_hooks = []
self._graph = ops.Graph()
with self._graph.as_default() as g, g.device(self._device_fn):
random_seed.set_random_seed(self._config.tf_random_seed)
global_step = contrib_framework.create_global_step(g)
features, labels = input_fn()
.......
.......
with monitored_session.MonitoredTrainingSession(
...
hooks=all_hooks + model_fn_ops.training_hooks,
chief_only_hooks=chief_hooks + model_fn_ops.training_chief_hooks,
...
) as mon_sess:
loss = None
while not mon_sess.should_stop():
_, loss = mon_sess.run([model_fn_ops.train_op, model_fn_ops.loss])
input_fn is called only once, and then for each step,
mon_sess.run([model_fn_ops.train_op, model_fn_ops.loss]) is run
This suggests input_fn is not called for each step. Also, empirically, I tried an input function like
def train_input_fn():
current_log = TRAIN_FILES.pop()
with open('./logs/' + str(random.random()) + "__" + str(uuid.uuid4()) + "__" + str(time.time()) + ".run", "wb") as fh:
fh.write(("Ran log file %s" % (current_log)).encode('utf-8'))
and for steps > 1 there is still only one log file written.
By looking at the documentation for Trainable (which you also linked), and specifically comparing steps to max_steps, it looks like the difference is that:
Two calls to fit(steps=100) means 200 training iterations. On the
other hand, two calls to fit(max_steps=100) means that the second call
will not do any iteration since first call did all 100 steps.
So I think you would use steps if you are running fit inside of a loop, every iteration of which you are doing something that changes the input and target data going to your fit function, e.g. loading new data from files on disk. In pseudo-code:
def input_fn():
"""Get features and labels in batches."""
new_features = get_new_features()
new_labels = get_new_labels()
return new_features, new_labels
for _ in range(num_iterations):
m.fit(input_fn=input_fn, steps=num_steps) # runs num_steps times on current data set
This would generate new training data with every outer loop iteration, and run on that training dataset for num_steps. Ultimately you would probably set this up to work its way iteratively through the whole dataset once and then maybe wrap the loop here in another loop to run multiple epochs. If you are giving it all the training data at once, however, then you would use max_steps instead. Again, in pseudocode.
def input_fn():
"""Get ALL features and labels."""
all_features = get_all_features()
all_labels = get_all_labels()
return all_features, all_labels
for _ in range(num_epochs):
m.fit(input_fn=input_fn, max_steps=num_max_steps) # runs num_max_steps times on full dataset
Here in a single iteration of the loop you would work your way through the whole dataset (probably using the batch_size argument of fit to control batch sizes), and then each iteration of your loop is another epoch through the whole dataset.
Either way, it makes sense for input_fn to only be called once for a single call to fit, as input_fn is just providing the data on which fit will operate for this iteration of whatever loop it is called from.
Using a LinearClassifier, how do we train on a single pass of
train_input_fn? Should steps be the number of samples in
train_input_fn?
input_fn should return a tuple of: features (tensor or dictionary of tensors) and labels (tensor or dictionary of tensors). These features and labels will construct a mini-batch to run computations on.
Simplest input_fn could look like:
def input_fn():
value_a = tf.constant([0.0])
features = {'a': value_a}
label = tf.constant([1.0])
return features, labels
When m.fit() is called, it calls _input_fn_ to construct feeding part of the graph. Then m.fit() creates a session, runs initialisers, starts queue_runners, etc. and calls session.run(...) steps times or until certain exception is thrown.
So if you want to train on a single pass of input_fn output you can just do:
m.fit(input_fn=input_fn, steps=1)
If you want to change number of samples in the mini-batch, you need to change input_fn to return more samples and labels for them.
What if we want to train on each sample in train_input_fn 3 times?
There are couple of options:
If you use input_fn with tf.constants, you can do m.fit() multiple times
If you use readers and queues (https://www.tensorflow.org/programmers_guide/reading_data), you can specify _num_epochs_, which is number of times the data will be fed.
In tensorflow 1.1, there are a helper function (https://www.tensorflow.org/api_docs/python/tf/estimator/inputs) to create input_fn, for feeding numpy arrays and pandas data frames to the graph. For both of them you can specify _num_epochs_.