understanding Keras LSTM ( lstm_text_generation.py ) - RAM memory issues

understanding Keras LSTM ( lstm_text_generation.py ) - RAM memory issues - python

I'm diving into LSTM RNN with Keras and Theano backend. While trying to use lstm examples from keras' repo whole code of lstm_text_generation.py on github, I've got one thing that isn't pretty clear to me: the way it's vectorizing the input data (text characters):
# cut the text in semi-redundant sequences of maxlen characters
maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
sentences.append(text[i: i + maxlen])
next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))
#np - means numpy
print('Vectorization...')
X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
for t, char in enumerate(sentence):
X[i, t, char_indices[char]] = 1
y[i, char_indices[next_chars[i]]] = 1
Here, as you can see, they generate lists of zeros with Numpy and then put '1' to particular position of each list defined by input characters encoding sequences in such way.
The question is: why did they use that algorithm? is it possible to optimize it somehow? maybe it's possible to encode input data in some other way, not using huge lists of lists? The problem is that it has severe limits of input data: generating such vectors for >10 Mb text causes MemoryError of Python (dozens of Gbs RAM needed to process it!).
Thanks in advance, guys.

There are at least two optimizations in Keras which you could use in order to decrease amount of memory which is need in this case:
An Embedding layer which makes it possible to accept only a single integer intead of full one hot vector. Moreover - this layer could be pretrained before the final stage of network training - so you could inject some prior knowledge into your model (and even finetune it during the network fitting).
A fit_generator method makes it possible to train a network using a predefinied generator which would produce pairs (x, y) need in network fitting. You could e.g. save the whole dataset to disk and read it part by part using a generator interface.
Of course - both of this methods could be mixed. I think that simplicity was the reason behind this kind of implementation in the example you provided.

Related

Finding patterns in time series with PyTorch

I started PyTorch with image recognition. Now I want to test (very basically) with pure NumPy arrays. I struggle with getting the setup to work, so basically I have vectors with values between 0 and 1 (normalized curves). Those vectors are always of length 1500 and I want to find e.g. "high values at the beginning" or "sine wave-like function", "convex", "concave" etc. stuff like that, so just shapes of those curves.
My training set consists of many vectors with their classes; I have chosen 7 classes. The net should be trained to classify a vector into one or more of those 7 classes (not one hot).
I'm struggling with multiple issues, but first my very basic Net
class Net(nn.Module):
def __init__(self, input_dim, hidden_dim, layer_dim, output_dim):
super(Net, self).__init__()
self.hidden_dim = hidden_dim
self.layer_dim = layer_dim
self.rnn = nn.RNN(input_dim, hidden_dim, layer_dim)
self.fc = nn.Linear(self.hidden_dim, output_dim)
def forward(self, x):
h0 = torch.zeros(self.layer_dim, x.size(1), self.hidden_dim).requires_grad_()
out, h0 = self.rnn(x, h0.detach())
out = out[:, -1, :]
out = self.fc(out)
return out
network = Net(1500, 70, 20, 7)
optimizer = optim.SGD(network.parameters(), lr=learning_rate, momentum=momentum)
This is just a copy-paste from an RNN demo. Here is my first issue. Is an RNN the right choice? It is a time series, but then again it is an image recognition problem when plotting the curve.
Now, this here is an attempt to batch the data. The data object contains all training curves together with the correct classifiers.
def train(epoch):
network.train()
network.float()
batching = True
index = 0
# monitor the cummulative loss for an epoch
cummloss = []
# start batching some curves
while batching:
optimizer.zero_grad()
# here I start clustering come curves to a batch and normalize the curves
_input = []
batch_size = min(len(data)-1, index+batch_size_train) - index
for d in data[index:min(len(data)-1, index+batch_size_train)]:
y = np.array(d['data']['y'], dtype='d')
y = np.multiply(y, y.max())
y = y[0:1500]
y = np.pad(y, (0, max(1500-len(y), 0)), 'edge')
if len(_input) == 0:
_input = y
else:
_input = np.vstack((_input, y))
input = torch.from_numpy(_input).float()
input = torch.reshape(input, (1, batch_size, len(y)))
target = np.zeros((1,7))
# the correct classes have indizes, to I create a vector with 1 at the correct locations
for _index in np.array(d['classifier']):
target[0,_index-1] = 1
target = torch.from_numpy(target)
# get the result form the network
output = network(input)
# is this a good loss function?
loss = F.l1_loss(output, target)
loss.backward()
cummloss.append(loss.item())
optimizer.step()
index = index + batch_size_train
if index > len(data):
print(np.mean(cummloss))
batching = False
for e in range(1, n_epochs):
print('Epoch: ' + str(e))
train(0)
The problem I'm facing right now is, the loss doesn't change very little, even with hundreds of epochs.
Are there existing examples of this kind of problem? I didn't find any, just pure png/jpg image recognition. When I convert the curves to png then I have a little issue to train a net, I took densenet and it worked just fine but it seems to be super overkill for this simple task.

This is just a copy-paste from an RNN demo. Here is my first issue. Is an RNN the right choice?
In theory what model you choose does not matter as much as "How" you formulate your problem.
But in your case the most obvious limitation you're going to face is your sequence length: 1500. RNN store information across steps and typically runs into trouble over long sequence with vanishing or exploding gradient.
LSTM net have been developed to circumvent this limitations with memory cell, but even then in the case of long sequence it will still be limited by the amount of information stored in the cell.
You could try using a CNN network as well and think of it as an image.
Are there existing examples of this kind of problem?
I don't know but I might have some suggestions : If I understood your problem correctly, you're going from a (1500, 1) input to a (7,1) output, where 6 of the 7 positions are 0 except for the corresponding class where it's 1.
I don't see any activation function, usually when dealing with multi class you don't use the output of the dense layer to compute the loss you apply a normalizing function like softmax and then you can compute the loss.

From your description of features you have in the form of sin like structures, the closes thing that comes to mind is frequency domain. As such, if you have and input image, just transform it to the frequency domain by a Fourier transform and use that as your feature input.
Might be best to look for such projects on the internet, one such project that you might want to read the research paper or video from this group (they have some jupyter notebooks for you to try) or any similar works. They use the furrier features, that go though a multi layer perceptron (MLP).
I am not sure what exactly you want to do, but seems like a classification task, you would use RNN if you want your neural network to work with a sequence. To me it seems like the 1500 dimensions are independent, and as such can be just treated as input.
Regarding the last layer, for a classification problem it usually is a probability distribution obtained by applying softmax (if only the classification is distinct - i.e. probability sums up to 1), in which, given an input, the net gives a probability of it being from each class. If we are predicting multiple classes we are going to use sigmoid as the last layer of the neural network.
Regarding your loss, there are many losses you can try and see if they are better. Once again, for different features you have to know what exactly is the measurement of distance (a.k.a. how different 2 things are). Check out this website, or just any loss function explanations on the net.
So you should try a simple MLP on top of fourier features as a starting point, assuming that is your feature vector.

Image Recognition is different from Time-Series data. In the imaging domain your data-set might have more similarity with problems like Activity-Recognition, Video-Recognition which have temporal component. So, I'd recommend looking into some models for those.
As for the current model, I'd recommend using LSTM instead of RNN. And also for classification you need to use an activation function in your final layer. This should softmax with cross entropy based loss or sigmoid with MSE loss.
Keras has a Timedistributed model which makes it easy to handle time components. You can use a similar approach with Pytorch by applying linear layers followed by LSTM.
Look into these for better undertsanding ::
Activity Recognition : https://www.narayanacharya.com/vision/2019-12-30-Action-Recognition-Using-LSTM
https://discuss.pytorch.org/t/any-pytorch-function-can-work-as-keras-timedistributed/1346
How to implement time-distributed dense (TDD) layer in PyTorch
Activation Function ::
https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html

A potentially dangerous place in pytorch code, which leads to producing NaNs

I am trying to implement Bidirectional LSTM in PyTorch, and the implementation of forward method looks as follows:
def forward(self, inputs):
word_embeddings = self.embedding.forward(inputs)
output = torch.zeros((inputs.shape[0], inputs.shape[1], self.tagset_size))
#init hidden states for LSTM
hidden_state_L = []
hidden_state_R = []
hidden_state_L.append(torch.zeros(inputs.shape[1], self.lstm_hidden_dim))
hidden_state_R.append(torch.zeros(inputs.shape[1], self.lstm_hidden_dim))
if next(self.parameters()).is_cuda:
output = output.cuda()
hidden_state_L[0] = hidden_state_L[0].cuda()
hidden_state_R[0] = hidden_state_R[0].cuda()
for i, word in enumerate(word_embeddings):
hidden_state_L.append(self.tanh.forward(self.fc_xh_L.forward(word) + self.fc_hh_L.forward(hidden_state_L[-1])))
for i, word in enumerate(reversed(word_embeddings)):
hidden_state_R.append(self.tanh.forward(self.fc_xh_R.forward(word) + self.fc_hh_R.forward(hidden_state_R[-1])))
hidden_state_L.pop()
hidden_state_R = hidden_state_R[1:][::-1]
for idx, (h_L, h_R) in enumerate(zip(hidden_state_L, hidden_state_R)):
output[idx - 1] = self.fc_hy.forward(torch.cat([h_L, h_R], dim = 1))
return output
In my implementation the hidden states are stored in lists, and then from the list I get activations. The training process makes an epoch and a half, reaching an accuracy of approximately 80%, and then the neural network loss becomes NaN and the training fails. Where is a potentially risky place, or maybe one can suggest more sensible way, than these lists? Or there may be some caveats with the computations on the device?
I would be grateful for possible suggestions

tf.data.Iterator with a multi GPU setup

I have looked at the cifar10 multi-GPU implementation to draw inspiration for parallelizing my own GPU trained model.
My model consumes data from TFRecords, which are iterated through the tf.data.Iterator class. So given 2 GPUs what I am trying to do is call iterator.get_next() on the CPU one time for each GPU (twice for example) do some preprocessing ,embedding lookup and other CPU related stuff and then feed the two batches into the GPUs.
Pseudo code:
with tf.device('/cpu:0'):
batches = []
for gpu in multiple_gpus:
single_gpu_batch = cpu_function(iterator.get_next())
batches.append(single_gpu_batch)
....................
for gpu, batch in zip(multiple_gpus, batches):
with tf.device('/device:GPU:{}'.format(gpu.id):
single_gpu_loss = inference_and_loss(batch)
tower_losses.append(single_gpu_loss)
...........
...........
total_loss = average_loss(tower_losses)
The problem is, that if there is only 1 or less examples to be drawn from the data and I call iterator.get_next() twice a tf.errors.OutOfRange exception will be raised and the data of the first call of iterator.get_next() (which actually didn't fail, only the second one) will never be passed through the GPU.
I thought about drawing the data in one iterator.get_next() call and splitting it later, but tf.split fails of the batch size is not dividable by the number of GPUs.
What is the right way to implement consuming from iterator in a multi-GPU setup?

I think the second suggestion is the easiest way to go. In order to avoid the splitting problem on the last batch, you can use the drop_remainder option in dataset.batch; Or if you need to see all data, then one possible solution is to explicitly set the dimensions based on the size of the drawn batch, so that the splitting operation never fails:
dataset = dataset.batch(batch_size * multiple_gpus)
iterator = dataset.make_one_shot_iterator()
batches = iterator.get_next()
split_dims = [0] * multiple_gpus
drawn_batch_size = tf.shape(batches)[0]
Either in a greedy manner, i.e., fits batch_size tensors on each device until runs out
#### Solution 1 [Greedy]:
for i in range(multiple_gpus):
split_dims[i] = tf.maximum(0, tf.minimum(batch_size, drawn_batch_size))
drawn_batch_size -= batch_size
or in a more spread manner to ensure that each device get at least one sample (assuming multiple_gpus < drawn_batch_size)
### Solution 2 [Spread]
drawn_batch_size -= - multiple_gpus
for i in range(multiple_gpus):
split_dims[i] = tf.maximum(0, tf.minimum(batch_size - 1, drawn_batch_size)) + 1
drawn_batch_size -= batch_size
## Split batches
batches = tf.split(batches, split_dims)

Keras model.predict function giving input shape error

I have implemented universal sentence encoder in Tensorflow and now I am trying to predict the class probabilities on a sentence. I am converting the string to an array as well.
Code:
if model.model_type == "universal_classifier_basic":
class_probs = model.predict(np.array(['this is a random sentence'], dtype=object)
Error Message:
InvalidArgumentError (see above for traceback): input must be a vector, got shape: []
[[Node: lambda_1/module_apply_default/tokenize/StringSplit = StringSplit[skip_empty=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](lambda_1/module_apply_default/RegexReplace_1, lambda_1/module_apply_default/tokenize/Const)]]
Any leads, suggestions or explanations are welcomed and highly appreciated.
Thank You :)

it is not that easy as you would like. Usually a model expects a vector of integer as input. Each integer represent the index of the correspondent word in a vocabulary. For example
vocab = {"hello":0, "world":1}
and you want to give as input the sentence "hello world" to the network then you should build the vector as follow:
net_input = [vocab.get(word) for word in "hello world".split(" ")]
Note also that, if you trained the network with mini batch then you will also need to add an extra first dimension to the vector you want to feed to the network. You can easily do this with numpy:
import numpy as np
net_input = np.expand_dims(net_input, 0)
In this way your net_input have the shape [1, 2] and you can feed it into the network.
There is still a problem that could stop you to feed the network with such a vector. At training time you have probably defined a placeholder for the input that has a precise len (30, 40 tokens). At test time you need to match that size at cost of padding your sentence if it doesn't feel the whole length or to cut it if it is longer.
You can truncate or add padding as follow:
net_input = [old_in[:max_len] + [vocab.get("PAD")] * (max_len - len(old_in[:max_len])] for old_in in net_input]
This line of code truncate the input if necessary old_in[:max_len] to the maximum possible len (note that python won't do anything if the len was less than max_len) and fill the difference between max len and the real len ((max_len - len(old_in[:max_len])) slots with padding tokens (+ [vocab.get("PAD")] )
Hope this helps.
If this is not the case you are in, just write down a comment to the answer and I'll try to figure out other solutions.

How to use LSTM with sequence data of varying length with keras without embedding?

I have an input data where each example is some varying number of vectors of length k. In total I have n examples. So the dimensions of the input is n * ? * k. The question mark symbolizes varying length.
I want to input it to an LSTM layer in Keras, if possible, without using embedding (it isn't your ordinary words dataset).
Could someone write a short example of how to do this?
The data is currently a double nested python array, e.g.
example1 = [[1,0,1], [1,1,1]]
example2 = [[1,1,1]]
my_data = []
my_data.append(example1)
my_data.append(example2)

I think you could use pad_sequences. This should get all of your inputs to the same length.

You can use padding (pad_sequences) and a Masking layer.
You can also train batches of different lenghts in a manual training loop:
for e in range(epochs):
for batch_x, batch_y in list_of_batches: #providade you separated the batches by length
model.train_on_batch(batch_x, batch_y)
The keypoint in all of this is that your input_shape=(None, k)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.