LSTM Data preparation - python

I'm preparing a time series for an LSTM network (using Python and Keras) and it goes something like this:
Samples=[]
for i in range(0,len(TrainingData)-Time_Step,1):
Samples.append(TrainingData[i:i+Time_Step])]
As it's a for loop its really slow, is there a faster way of doing this?

What you're trying to do should be a simple reshape:
#if training data is not numpy, make it numpy
TrainingData = np.array(TrainingData)
originalShape = TrainindData.shape
if len(originalShape == 1):
Samples = TrainingData.reshape((Time_Step,)) #or (Time_Step,1)
else:
Samples = TrainingData.reshape((Time_Step,) + originalShape[1:])
Warning: Data for LSTM should have shapes like (batch_size_or_samples, time_steps_or_length, features). The sequences should not be divided in windows (unless you strictly need this for very special reasons - and then you have to take some special actions too) and the time dimension should be the second, not the first.

Related

Creating a TimeseriesGenerator with multiple inputs

I'm trying to train an LSTM model on daily fundamental and price data from ~4000 stocks, due to memory limits I cannot hold everything in memory after converting to sequences for the model.
This leads me to using a generator instead like the TimeseriesGenerator from Keras / Tensorflow. Problem is that if I try using the generator on all of my data stacked it would create sequences of mixed stocks, see the example below with a sequence of 5, here Sequence 3 would include the last 4 observations of "stock 1" and the first observation of "stock 2"
Instead what I would want is similar to this:
Slightly similar question: Merge or append multiple Keras TimeseriesGenerator objects into one
I explored the option of combining the generators like this SO suggests: How do I combine two keras generator functions, however this is not idea in the case of ~4000 generators.
I hope my question makes sense.
So what I've ended up doing is to do all the preprocessing manually and save an .npy file for each stock containing the preprocessed sequences, then using a manually created generator I make batches like this:
class seq_generator():
def __init__(self, list_of_filepaths):
self.usedDict = dict()
for path in list_of_filepaths:
self.usedDict[path] = []
def generate(self):
while True:
path = np.random.choice(list(self.usedDict.keys()))
stock_array = np.load(path)
random_sequence = np.random.randint(stock_array.shape[0])
if random_sequence not in self.usedDict[path]:
self.usedDict[path].append(random_sequence)
yield stock_array[random_sequence, :, :]
train_generator = seq_generator(list_of_filepaths)
train_dataset = tf.data.Dataset.from_generator(seq_generator.generate(),
output_types=(tf.float32, tf.float32),
output_shapes=(n_timesteps, n_features))
train_dataset = train_dataset.batch(batch_size)
Where list_of_filepaths is simply a list of paths to preprocessed .npy data.
This will:
Load a random stock's preprocessed .npy data
Pick a sequence at random
Check if the index of the sequence has already been used in usedDict
If not:
Append the index of that sequence to usedDict to keep track as to not feed the same data twice to the model
Yield the sequence
This means that the generator will feed a single unique sequence from a random stock at each "call", enabling me to use the .from_generator() and .batch() methods from Tensorflows Dataset type.

How to use Multivariate time-series prediction with Keras, when multiple samples are used

As the title states, I am doing multivariate time-series prediction. I have some experience with this situation and was able to successfully setup and train a working model in TF Keras.
However, I did not know the 'proper' way to handle having multiple unrelated time-series samples. I have about 8000 unique sample 'blocks' with anywhere from 800 time steps to 30,000 time steps per sample. Of course I couldn't concatenate them all into one single time series because the first points of sample 2 are not related in time with the last points of sample 1.
Thus my solution was to fit each sample individually in a loop (at great inefficiency).
My new idea is can/should I pad the start of each sample with empty time-steps = to the amount of look back for the RNN and then concatenate the padded samples into one time-series? This will mean that the first time-step will have a look-back data of mostly 0's which sounds like another 'hack' for my problem and not the right way to do it.
The main challenge is in 800 vs. 30,000 timesteps, but nothing you can't do.
Model design: group sequences into chunks - for example, 30 sequences of 800-to-900 timesteps, padded, then 60 sequences of 900-to-1000, etc. - don't have to be contiguous (i.e. next can be 1200-to-1500)
Input shape: (samples, timesteps, channels) - or equivalently, (sequences, timesteps, features)
Layers: Conv1D and/or RNNs - e.g. GRU, LSTM. Each can handle variable timesteps
Concatenation: don't do it. If each of your sequences is independent, then each must be fed along dimension 0 in Keras - the batch or samples dimension. If they are dependent, e.g. multivariate timeseries, like many channels in a signal - then feed them along the channels dimension (dim 2). But never concatenate along timeseries dimension, as it implies causal continuity whrere none exists.
Stateful RNNs: can help in processing long sequences - info on how they work here
RNN capability: is limited w.r.t. long sequences, and 800 is already in danger zone even for LSTMs; I'd suggest dimensionality reduction via either autoencoders or CNNs w/ strides > 1 at input, then feeding their outputs to RNNs.
RNN training: is difficult. Long train times, hyperparameter sensitivity, vanishing gradients - but, with proper regularization, they can be powerful. More info here
Zero-padding: before/after/both - debatable, can read about it, but probably stay clear from "both" as learning to ignore paddings is easier with one locality; I personally use "before"
RNN variant: use CuDNNLSTM or CuDNNGRU whenever possible, as they are 10x faster
Note: "samples" above, and in machine learning, refers to independent examples / observations, rather than measured signal datapoints (which would be referred to as timesteps).
Below is a minimal code for what a timeseries-suited model would look like:
from tensorflow.keras.layers import Input, Conv1D, LSTM, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
import numpy as np
def make_data(batch_shape): # dummy data
return (np.random.randn(*batch_shape),
np.random.randint(0, 2, (batch_shape[0], 1)))
def make_model(batch_shape): # example model
ipt = Input(batch_shape=batch_shape)
x = Conv1D(filters=16, kernel_size=10, strides=2, padding='valid')(ipt)
x = LSTM(units=16)(x)
out = Dense(1, activation='sigmoid')(x) # assuming binary classification
model = Model(ipt, out)
model.compile(Adam(lr=1e-3), 'binary_crossentropy')
return model
batch_shape = (32, 100, 16) # 32 samples, 100 timesteps, 16 channels
x, y = make_data(batch_shape)
model = make_model(batch_shape)
model.train_on_batch(x, y)

Reason for huge slowdown usinf TF Dataset API

I'm trying to generate batches for triplet loss where there are always pairs in the batch. The code below achieves this but it's very, very slow. In particular the choose_from_datasets method seems to be the source of the slowness.
Is there something wrong with my code that's creating the slowdown? Or is there a smarter way to do this?
I tried switching to sample_from_datasets instead, but this didn't help.
def batch_pairs3(dataset, num_classes, shuffle=True, num_classes_per_batch=10, num_images_per_class=2):
# Isolate each class into its own dataset
datasets = []
for cl in range(num_classes):
this_dataset = dataset.filter(lambda xx, yy: tf.equal(tf.reshape(yy, []), cl))
if shuffle:
this_dataset = this_dataset.shuffle(100)
datasets += [this_dataset]
# if shuffle:
# random.shuffle(datasets)
selector = tf.contrib.data.Counter().map(
lambda x: generator3(x, num_classes, num_classes_per_batch, num_images_per_class))
selector = selector.apply(tf.contrib.data.unbatch())
dataset = tf.contrib.data.choose_from_datasets(datasets, selector)
# Batch
batch_size = num_classes_per_batch * num_images_per_class
return dataset.batch(batch_size)
tf data pipeline does not handle these kind of applications where you are processing your data on the fly by iterating through it very well, unless you can independently map every data point to do such processing. For what you are doing, you may be better off pre-processing and storing your data, in something like tfrecord format and then using the data pipeline to read it in an optimized way.
Refer this official example, which kind of works on a similar problem involving triplet loss: Time Contrastive Networks, the data provider

LSTM - time series predictions

I am following this tutorial LSTM and I wonder how to map this to a multi-time series input. I have a dataset of several time-series and I want to predict for each time series the future. I don't know how to scale LSTM to several time-series.
The aim is to avoid to make a model for each time series as I have 40k of time series.
Thank you
Process one by one
Just do exactly the same in a loop like this:
for epoch in range(numberOfEpochs):
for sequence in yourSequences:
model.reset_states()
#1 - do the entire training for this sequence (1 epoch only)
#you may use "model.train_on_batch" to avoid some overhead in "fit"
#or 2 - do the entire predictoins for this sequence
Process all together
Just pack the series in the first dimension of the input. No change is necessary in the model
When defining the input shape, use batch_input=(number_of_time_series,length,features) or batch_input_shape=(number_of_time_series,length,features). (You may need a smaller batch size, because 40K is too much)
Make sure to use shuffle=False in every training command.
If your batch is not 40k, make sure to process the entire length (the entire training or prediction) of each batch, then you use model.reset_states() and start a new group of sequences.
.
batch_size = ....
for epoch in range(numberOfEpochs):
firstSeq = 0
lastSeq = firstSeq + batch_size
while lastSeq <= len(sequences):
model.reset_states()
batch = sequences[firstSeq:lastSeq]
#train the entire batch (one epoch only)
#or predict for the entire batch
firstSeq += batch_size
lastSeq += batch_size
Since you are using separate Time series, I don't think keeping stateful = True is a good Idea.
Actually, your problem is closer to the 'generic' use of LSTMs.
Try to concatenate your series in a 2d array where each line is corresponding to a serie. Then reshape your data like that : (number_of_series , timesteps (length of a single serie) , 1), then feed it to your network.
Depending of the length of your series, you may need to read this : https://machinelearningmastery.com/handle-long-sequences-long-short-term-memory-recurrent-neural-networks/
The real potential of LSTM models for time series forecasting can be exploiting by building a global model using all the time series, instead as a univariate model, which actually ignores any cross series information available across your time series.
We implement the use case you are referring to by introducing a 'Moving Window Approach' strategy that involves modeling a multiple input and output mapping, where you can pool time series that have different lengths. More detailed discussion of this strategy is described in section 3.4 on our paper[1]. Here, you basically produce multiple input and output tuples for the given set of time series you have and then pool them to together for the LSTM training purposes. This accommodates even if you have time series with different lengths.
[1] https://arxiv.org/pdf/1710.03222.pdf

understanding Keras LSTM ( lstm_text_generation.py ) - RAM memory issues

I'm diving into LSTM RNN with Keras and Theano backend. While trying to use lstm examples from keras' repo whole code of lstm_text_generation.py on github, I've got one thing that isn't pretty clear to me: the way it's vectorizing the input data (text characters):
# cut the text in semi-redundant sequences of maxlen characters
maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
sentences.append(text[i: i + maxlen])
next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))
#np - means numpy
print('Vectorization...')
X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
for t, char in enumerate(sentence):
X[i, t, char_indices[char]] = 1
y[i, char_indices[next_chars[i]]] = 1
Here, as you can see, they generate lists of zeros with Numpy and then put '1' to particular position of each list defined by input characters encoding sequences in such way.
The question is: why did they use that algorithm? is it possible to optimize it somehow? maybe it's possible to encode input data in some other way, not using huge lists of lists? The problem is that it has severe limits of input data: generating such vectors for >10 Mb text causes MemoryError of Python (dozens of Gbs RAM needed to process it!).
Thanks in advance, guys.
There are at least two optimizations in Keras which you could use in order to decrease amount of memory which is need in this case:
An Embedding layer which makes it possible to accept only a single integer intead of full one hot vector. Moreover - this layer could be pretrained before the final stage of network training - so you could inject some prior knowledge into your model (and even finetune it during the network fitting).
A fit_generator method makes it possible to train a network using a predefinied generator which would produce pairs (x, y) need in network fitting. You could e.g. save the whole dataset to disk and read it part by part using a generator interface.
Of course - both of this methods could be mixed. I think that simplicity was the reason behind this kind of implementation in the example you provided.

Categories

Resources