On training LSTMs efficiently but well, parallelism vs training regime

On training LSTMs efficiently but well, parallelism vs training regime - python

For a model that I intend to spontaneously generate sequences I find that training it sample by sample and keeping state in between feels most natural. I've managed to construct this in Keras after reading many helpful resources. (SO: Q and two fantastic answers, Macine Learning Mastery 1, 2, 3)
First a sequence is constructed (in my case one-hot encoded too). X and Y are procuded from this sequence by shifting Y forward one time step. Training is done in batches of one sample and one time step.
For Keras this looks something like this:
data = get_some_data() # Shape (samples, features)
Y = data[1:, :] # Shape (samples-1, features)
X = data[:-1, :].reshape((-1, 1, data.shape[-1])) # Shape (samples-1, 1, features)
model = Sequential()
model.add(LSTM(256, batch_input_shape=(1, 1, X.shape[-1]), stateful=True))
model.add(Dense(Y.shape[-1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])
for epoch in range(10):
model.fit(X, Y, batch_size=1, shuffle=False)
model.reset_states()
It does work. However after consulting my Task Manager, it seems it's only using ~10 % of my GPU resources, which are already quite limited. I'd like to improve this to speed up training. Increasing batch size would allow for parallel computations.
The network in its current state presumably "remembers" things even from the start of the training sequence. For training in batches one would need to first set up the sequence and then predict one value - and do this for multiple values. To train on the full sequence one would need to generate data in the shape (samples-steps, steps, features). I imagine it wouldn't be uncommon to have a sequence spanning at least a couple of hundred time steps. So that would mean a huge increase in the data amount.
Between framing the problem a bit differently and requiring more data to be stored in memory, and utilising only a small amount of processing resources, I must ask:
Is my intuition of the natural way of training and statefulness correct?
Are there other downsides to this training with one sample per batch?
Could the utilisation issues be resolved any other way?
Finally, is there an accepted way of performing this kind of training to generate long sequences?
Any help is greatly appreciated, I'm fairly new with LSTMs.

I do not know your specific application, however sending only one timestep of data in is surely not a good idea. You should instead, give the LSTM the entire sequence of previously given one-hot vectors (presumably words), and pre-pad (with zeros) if necessary as it appears you are working on sequences of varying length. Consider also using an embedding layer before your LSTM if these are indeed words. Read the documentation carefully.
The utilization of your GPU being low is not a problem. You simply do not have enough data to fully utilize all resources in each batch. Training with batches is a sequential process, there is no real way to parallelize this process, at least a way that is introductory and beneficial to what your goals appear to be. If you do give the LSTM more data per timestep, however, this surely will increase your utilization.
statefull in an LSTM does not do what you think it does. An LSTM always remembers the sequence it is iterating over as it updates it's internal hidden states, h and c. Furthermore, the weight transformations that "build" those internal states are learned during training. What stateful does is preserve the previous hidden state from the last batch index. Meaning, the final hidden state at the third element in the batch is sent as the initial hidden state in the third element of the next batch and so on. I do not believe this is useful for your applications.
There are downsides to training the LSTM with one sample per batch. In general, training with min-batches increases stability. However, you appear to not be training with one sample per batch but instead one timestep per sample.
Edit (from comments)
If you use stateful and send the next 'character' of your sequence in the same index of the previous batch this would be analogous to sending the full sequence timesteps per sample. I would still recommend the initial approach described above in order to improve the speed of the application and to be in more line with other LSTM applications. I see no disadvantages to the approach of sending the full sequence per sample instead of doing it along every batch. However, the advantage of speed, being able to shuffle your input data per batch, and being more readable/consistent would be worth the change IMO.

Related

How to arrange multiple multivariate time series of different length before passing it to Keras LSTM layer

I have a number of multivariate time series that are produced by the same kind of process but:
are of significantly different lengths;
each time series is an independent instance, and the measurements are taken at different, quite random timestamps;
each time series is related at every timestamp to two targets.
In other words:
each time series has a shape of (n_timestamps, n_features)
each target series has a shape of (n_timestamps, 2).
To give an example, this could be treated as stocks of different companies, that are described by few various features and the target at a given timestamp are probabilities that the final price at the end of the year will be higher than x, except we learn them directly from magically given ground-truth probabilities (instead of observed 0/1 responses).
I want to be able to predict the target at each time point and I wanted to give RNNs a try. However, I'm having issues with figuring out how I should arrange the data before passing it to Keras LSTM layers. The main things I'm wondering about are:
I want my RNN to use data starting from the beginning of the series to make prediction at time t, not only last k timestamps. I can't really use the whole history directly without exploding the gradient (it's too long), therefore I need a way to "remember" previously learned weights even though in reality my RNN will loop over last k timestamps.
Each time series has different length, so I'm unsure how to make things compatible with each other. I'm aware of padding as an option, but since the difference in length of examples can be as significant as 1000 vs 3000 this will results in many training examples that constitutes only of padding value.
Since measurements are taken at different timestamps, I believe it may affect my network in a sense that it can't really learn that e.g. last 10 timestamps are the most important. Or even if it can, these last 10 timestamps will have different lengths in reality for each input time-series... How big problem is this? Should I start with resampling all examples to the same time points (e.g. by interpolating)?
My current thinking is that:
I can pad each of my example sequences to the same length (max(n_timestamps))
Create batches of short sequences of length k, where k represents the length of the loop of RNN layer. In consequence, assuming I have 200 example sequences with the longest one has 3000 timestamps and my selected k is 50, it would result in 3000/50=60 batches of (200, 50) shape. Or should I make 3000-1 batches where one batch differs from the next one only by one timestamp (i.e. while the fist batch has timestamps from 1 to 50, the next batch has timestamps from 2 to 51 etc.)?
Since padding was used, I would need to use Masking layer. Some (quite many) of the rows in prepared batches would constitute of inputs that should be ignored completely (as they would only have padding value for all 50 elements).
Is this the correct way to prepare the data for my problem? Can it be done better to not introduce bottlenecks such as learning using examples of only padding value (that should be ignored with masking layer). Or how can I prepare that data to address points 1., 2. and 3. described above?

each time series has a shape of (n_timestamps, n_features)
each target series has a shape of (n_timestamps, 2).
Okay, this is pretty standard so far.
I want my RNN to use data starting from the beginning of the series to make prediction at time t, not only last k timestamps. I can't really use the whole history directly without exploding the gradient (it's too long), therefore I need a way to "remember" previously learned weights even though in reality my RNN will loop over last k timestamps.
Check and make sure you actually need this. An RNN (or a Transformer) could use any of/all of the history that you give it. But that's assuming that the history is useful for the predictions you're making.
I'd try training on standard-sized random-clips of the data (like in this tutorial). I'd retrain it a few times with longer and longer clips and see if the model performance plateaus before I run out of memory.
But in Keras it is relatively simple to do exactly the thing you're asking.
Keras RNNs (LSTM, GRU) have these this argument return_states. It allows you to allows you to run the model over part of a sequence, pause, execute a training step, and then continue running exactly where you left off.
(and stateful argument is another mechanism to provide that effect)
The code ends up looking something like this:
class MyModel(keras.Model):
...
def train_step(self, args):
inputs, labels = args
state = self.get_initial_state()
while tf.shape(inputs)[1] != 0:
in_slice, inputs = inputs[:,:100], inputs[:,100:]
label_slice, labels = labels[:, :100], labels[:,100:]
with tf.GradientTape() as tape:
result, state = self(in_slice, state)
loss = self.loss(label_slice, result)
vars = self.trainable_variables
grads = tape.gradient(loss, vars)
self.optimizer.apply_gradients(zip(grads, vars))
It may also be possible to use ForwardAccumulator to collect the gradients. In that case you don't need to cut the sequences into chunks because the memory used by forward accumulator doesn't grow with sequence length. I've never tried before so I don't have example code.
Each time series has different length, so I'm unsure how to make things compatible with each other. I'm aware of padding as an option, but since the difference in length of examples can be as significant as 1000 vs 3000 this will results in many training examples that constitutes only of padding value.
That might be okay, just inefficient. You can make batches of similar sequence lengths using: Dataset.bucket_by_sequence_length
Since measurements are taken at different timestamps, I believe it may affect my network in a sense that it can't really learn that e.g. last 10 timestamps are the most important. Or even if it can, these last 10 timestamps will have different lengths in reality for each input time-series... How big problem is this? Should I start with resampling all examples to the same time points (e.g. by interpolating)?
Interpolating to a fixed rate might be a resonable thing to try if it doesn't make your data too much longer. Just think carefully about making predictions on interpolated values: There's some data leaking back in time from a future measurement.
Another approach would be to make the size of the time-step a feature. If each input is tagged with how long it's been since the last input the model can learn how to handle small or large steps.
I can pad each of my example sequences to the same length (max(n_timestamps))
Yes. Pad, or make clips of a fixed size.
Create batches of short sequences of length k, where k represents the length of the loop of RNN layer. In consequence, assuming I have 200 example sequences with the longest one has 3000 timestamps and my selected k is 50, it would result in 3000/50=60 batches of (200, 50) shape.
That would line up with the code example I gave.
Or should I make 3000-1 batches where one batch differs from the next one only by one timestamp
Either way is fine. But if you want to carry the state over from batch to batch (I'm skeptical that you actually need the carry over) then you need to do them chunk by chunk, not by single-stepping your window.
Since padding was used, I would need to use Masking layer. Some (quite many) of the rows in prepared batches would constitute of inputs that should be ignored completely (as they would only have padding value for all 50 elements).
Yeah, that'll be wasted computation, but it won't hurt anything.

Can LSTMs handle incredibly dense time-series data?

I have 50 time series, each having at least 500 data points (some series have as much as 2000+ data points). All the time series go from a value of 1.089 to 0.886, so you can see that the resolution for each dataset comes close to 10e-4, i.e. the data is something like:
1.079299, 1.078809, 1.078479, 1.078389, 1.078362,... and so on in a decreasing fashion from 1.089 to 0.886 for all 50 time-series.
My questions hence, are:
Can LSTMs handle such dense data?
In order to avoid overfitting, what can be the suggested number of epochs, timesteps per batch, batches, hidden layers and neurons per layer?
I have been struggling with this for more than a week, and no other source that I could find talks about this specific case, so it could also help others.

A good question and I can understand why you did not find a lot of explanations because there are many tutorials which cover some basic concepts and aspects, not necessarily custom problems.
You have 50 time series. However, the frequency of your data is not the same for each time series. You have to interpolate in order to reach the same number of samples for each time series if you want to properly construct the dataset.
LSTMs can handle such dense data. It can be both a classification and a regression problem, neural networks can adapt to such situations.
In order to avoid overfitting(LSTMs are very prone to it), the first major aspect to take into consideration is the hidden layers and the number of units per layer. Normally people tend to use 256-512 by default since in Natural Language Processing where you process huge datasets they are suitable. In my experience, for simpler regression/classification problems you do not need such a big number, it will only lead to overfitting in smaller problems.
Therefore, taking into consideration (1) and (2), start with an LSTM/GRU with 32 units and then the output layer. If you see that you do not have good results, add another layer (64 first 32 second) and then the output layer.
Admittedly, timesteps per batch is of crucial importance. This cannot be determined here, you have to manually iterate through values of it and see what yields you the best results. I assume you create your dataset via sliding window manner; consider this(window size) also a hyperparameter to alter before arriving at the batch and epochs ones.

How to structure and size Y-labels for multivariate sequence prediction using Keras LSTMs

I am working on a sequence prediction problem where my inputs are of size (numOfSamples, numOfTimeSteps, features) where each sample is independent, number of time steps is uniform for each sample (after pre-padding the length with 0's using keras.pad_sequences), and my number of features is 2. To summarize my question(s), I am wondering how to structure my Y-label dataset to feed the model and want to gain some insight on how to properly structure my model to output what I want.
My first feature is a categorical variable encoded to a unique int and my second is numerical. I want to be able to predict the next categorical variable as well as an associated feature2 value, and then use this to feed back into the network to predict a sequence until the EOS category is output.
This is a main source I've been referencing to try and understand how to create a generator for use with keras.fit_generator.
[1]
There is no confusion with how the mini-batch for "X" data is grabbed, but for the "Y" data, I am not sure about the proper format for what I am trying to do. Since I am trying to predict a category, I figured a one-hot vector representation of the t+1 timestep would be the proper way to encode the first feature, I guess resulting in a 4? Dimensional numpy matrix?? But I'm kinda lost with how to deal with the second numerical feature.
Now, this leads me to questions concerning architecture and how to structure a model to do what I am wanting. Does the following architecture make sense? I believe there is something missing that I am not understanding.
Proposed architecture (parameters loosely filled in, nothing set yet):
model = Sequential()
model.add(Masking(mask_value=0., input_shape=(timesteps, features)))
model.add(LSTM(hidden_size, return_sequences=True))
model.add(TimeDistributed(Dense(vocab_size)))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])
model.fit_generator(...) #ill figure this out
So, at the end, a softmax activation can predict the next categorical value for feature1. How do I also output a value for feature2 so that I can feed the new prediction for both features back as the next time-step? Do I need some sort of parallel architecture with two LSTMs that are combined somehow?
This is my first attempt at doing anything with neural networks or Keras, and I would not say I'm "great" at python, I can get by though. However, I feel I have a decent grasp at the fundamental theoretical concepts, but lack the practice.
This question is sorta open ended, with encouragement to pick apart my current strategy.
Once again, the overall goal is to predict both features (categorical, numeric) in order to predict "full sequences" from intermediate length sequences.
Ex. I train on these padded max-len sequences, but in production I want to use this to predict the remaining part of the currently unseen time-steps, which would be variable length.

Okay, so If I understand you properly (correct me if I'm wrong) you would like to predict next features based on the current ones.
When it comes to categorical variables, you are on point, your Dense layer should output N-1 vector containing probability of each class (while we are at it, if you, by any chance, use pandas.get_dummies remember to specify argument drop_first=True, similiar approach should be employed whatever you are using for one-hot encoding).
Except those N-1 output vector for each sample, it should output one more number for numerical value.
Remember to output logits (no activation, don't use softmax at the end like you currently do). Afterwards network output should be separated into N-1 part (your categorical feature) and passed to loss function able to handle logits (e.g. in Tensorflow it is tf.nn.softmax_cross_entropy_with_logits_v2 which applies numerically stable softmax for you).
Now, your N-th element of network output should be passed to different loss, probably Mean Squared Error.
Based on loss value of those two losses (you could take a mean of both to obtain one loss value), you backpropagate through the network and it might do just fine.
Unfortunately I'm not skilled enough in Keras in order to help you with the code, but I think you will figure it out yourself. While we're at it, I would like to suggest PyTorch for more custom neural networks (I think yours fits this description), though it's definitely doable in Keras as well, your choice.
Additional 'maybe helpful' thought: you may check Teacher Forcing for your kind of task. More on the topic and theory behind it can be found in the outstanding Deep Learning Book and code example (though in PyTorch once again), can be found in their docs here.
BTW interesting idea, mind if I use it in connection with my current research trajectory (with kudos going to you of course)? Comment on this answer if so we can talk it out in chat.

Basically every answer I was looking for was exampled and explained in this tutorial. Absolutely great resource for trying to understand how to model multi-output networks. This one goes through a lengthy walkthrough of a multi-output CNN architecture. It only took me about three weeks to stumble upon, however.
https://www.pyimagesearch.com/2018/06/04/keras-multiple-outputs-and-multiple-losses/

Multilabel classification using LSTM on variable length signal using Keras

I have recently started working on ECG signal classification in to various classes. It is basically multi label classification task (Total 4 classes). I am new to Deep Learning, LSTM and Keras that why i am confused in few things.
I am thinking about giving normalized original signal as input to the network, is this a good approach?
I also need to understand training input shape for LSTM as ECG signals are of variable length (9000 to 18000 samples) and usually classifier need fixed variable input. How can i handle such type of input in case of LSTM.
Finally what should be structure of deep LSTM network for such lengthy input and how many layers should i use.
Thanks for your time.
Regards

I am thinking about giving normalized original signal as input to the network, is this a good approach?
Yes this is a good approach. It is actually quite standard for Deep Learning algorithms to give them your input normalized or rescaled.
This usually helps your model converge faster, as now you are inside smaller range (i.e.: [-1, 1]) instead of greater un-normalized ranges from your original input (say [0, 1000]). It also helps you get better, more precise results, as it helps solve problems like the vanishing gradient as well as adapting better to modern activation and optimizer functions.
I also need to understand training input shape for LSTM as ECG signals are of variable length (9000 to 18000 samples) and usually classifier need fixed variable input. How can i handle such type of input in case of LSTM.
This part is really important. You are correct, LSTM expects to receive inputs with a fixed shape, one that you know beforehand (in fact, any Deep Learning layer expects fixed shape inputs). This is also explained in the keras docs on Recurrent Layers where they say:
Input shape
3D tensor with shape (batch_size, timesteps, input_dim).
As we can see, it expects your data to have a number of timesteps as well as a dimension on each one of those timesteps (batch size is usually 1). To exemplify, suppose your input data consists of elements like: [[1,4],[2,3],[3,2],[4,1]]. Then, using a batch_size of 1, the shape of your data would be (1,4,2). As you have 4 timesteps, each with 2 features.
So bottom line, you have to make sure that you pre-process you data so it has a fixed shape you can then pass to your LSTM layers. This one you will have to find out by yourself, as you know your data and problem better than we do.
Maybe you can fix the samples you obtain from your signal, discarding some and keeping others so every signal is of the same length (if you say your signals are between 9k and 18k choosing 9000 could be the logical choice, discarding samples from the others you get). You could even do some other conversion to your data in a way that you can map from inputs of 9000-18000 to a fixed size.
Finally what should be structure of deep LSTM network for such lengthy input and how many layers should i use.
This one is really quite broad and doesn't have a unique answer. It would depend on the nature of your problem, and determining those parameters a priori is not so straightforward.
What I recommend you do is to start with a simple model first, and then add layers and blocks (neurons) incrementally until you are satisfied with the results.
Try just one hidden layer first, train and test your model and check your performance. You can then add more blocks and see if your performance improved. You can also add more layers and check for the same until you are satisfied.
This is a good way to create Deep Learning models, as you will arrive to the results you want while keeping your Network as lean as possible, which in turn helps your execution time and complexity. Good luck with your coding, hope you find this useful.

Does size of training data for an epoch matter in tensorflow?

Assuming we have 500k items worth of training data, does it matter if we train the model one item at a time or 'n' items at a time or all at once?
Considering inputTrainingData and outputTrainingData to be [[]] and train_step to be any generic tensorflow training step.
Option 1 Train one item at a time -
for i in range(len(inputTrainingData)):
train_step.run(feed_dict={x: [inputTrainingData[i]], y: [outputTrainingData[i]], keep_prob: .60}, session= sess)
Option 2 Train on all at once -
train_step.run(feed_dict={x: inputTrainingData, y: outputTrainingData, keep_prob: .60}, session= sess)
Is there any difference between options 1 and 2 above as far as the quality of training is concerned?

Yes, there is a difference. Option 1 is much less memory consuming but is also much less accurate. Option 2 could eat up all of your RAM but should prove more accurate. However, if you use all your training set at once, be sure to limit the number of steps to avoid over-fitting.
Ideally, use data in batches (typically between 16 and 256).
Most optimization techniques are 'stochastic', i.e. they rely on a statistical sample of examples to estimate a model update.
To sum up:
- More data => more accuracy (but more memory) => higher risk of over-fitting (so limit the amount of training steps)

There is a different between this options. Normally you have to use a batchsize to train for example 128 iterations of data.
You also could use a batchsize of one, like the first of you examples.
The advantage of this method is you can output the training efficient of the neural network.
If you are learning all data at ones, you will bi a little bit faster, but you will know only at the end if you efficient is good.
Best way is to make a batchsize and learn by stack. So you can output you efficient after every stack and control your efficient.

Mathematically these two methods are different. One is called stochastic gradient descent and the other is called batch gradient descent. You are missing the most commonly used one - mini batch gradient descent. There has been a lot of research on this topic but basically different batch sizes have different convergence properties. Generally people use batch sizes that are greater than one but not the full dataset. This is usually necessary since most datasets cannot fit into memory all at once. Also if your model uses batch normalization then a batch size of one won't converge. This paper discusses the effects of batch size (among other things) on performance. The takeaway is that larger batch sizes do not generalize as well. (They actually argue it isn't the batch size itself the but the fact that you have fewer updates when the batch is larger. I would recommend batch sizes of 32 to start and experiment to see how batch size effects performance.
Here is a graph of the effects of batch size on training and validation performance from the paper I linked.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.