Training tensorflow RNN with large datasets

Training tensorflow RNN with large datasets - python

I'm training an RNN in tensorflow. The function used is "rnn" from https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/rnn.py.
outputs, states = rnn.rnn(cell, inputs, initial_state=initial_state, sequence_length=seq_length)
The reason I use this function is because my data sequences are of variable lengths. This function expects all data to be loaded at once. Since my data doesn't fit in to memory all at once, I need to load data piece by piece. Any pointers on how it can be done would be highly appreciated.
Thanks

The standard practice here is to break your data up into chunks and work on it a chunk at a time. For example, if you are working with text, you might break your data up into sentences, and pass mini-batches of 10s-100s of sentences to the training process one at a time.
For an example of how to do this, take a look at this RNN tutorial.
https://www.tensorflow.org/versions/r0.9/tutorials/recurrent/index.html
The tutorial text itself doesn't describe chunking in detail, but take a look at the associated code in github and see how it loads its input data and batches it for training.
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/models/rnn/ptb
Hope that helps!

Related

How to normalize data when using Keras fit_generator

I have a very large data set and am using Keras' fit_generator to train a Keras model (tensorflow backend). My data needs to be normalized across the entire data set however when using fit_generator, I have access to relatively small batches of data and normalization of the data in this small batch is not representative of normalizing the data across the entire data set. The impact is quite large (I tested it and the model accuracy is significantly degraded).
My question is this: What is the correct practice of normalizing data across entire data set when using Keras' fit_generator? One last point: my data is a mix of text and numeric data and not images, and hence I am not able to use some of the capabilities in Keras' provided image generator which may address some of the issues for image data.
I have looked at normalizing the full data set prior to training ("brute-force" approach, I suppose) but I am wondering if there is a more elegant way of doing this.

The generator does allow you to do on-the-fly processing of data but pre-processing the data prior to training is the preferred approach:
Pre-process and saving avoids processing the data for every epoch, you should really just do small operations that can be applied to batches. One-hot encoding for example is a common one while tokenising sentences etc can be done offline.
You probably will tweak, fine-tune your model. You don't want to have the overhead of normalising the data and ensure every model trains on the same normalised data.
So, pre-process once offline prior to training and save it as your training data. When predicting you can process on-the-fly.

You would do this via pre-processing your data to a matrix. One hot encode your text data:
from keras.preprocessing.text import Tokenizer
# X is a list of text elements
t = Tokenizer()
t.fit_on_texts(X)
X_one_hot = t.texts_to_matrix(X)
and normalize your numeric data via:
for i in range(len(matrix)):
refactored_array = (matrix[i]- np.min(matrix[i], 0)) / (np.max(matrix[i], 0) + 0.0001)
If you concatenate your two matrices you should have properly preprocessed your data. I just could imagine that the text will always be influencing the outcome of your model too much. So it would make sence to train seperate models for text and numeric data.

TensorFlow train batches for multiple epochs?

I don't understand how to run the result of tf.train.batch for multiple epochs. It runs out once of course and I don't know how to restart it.
Maybe I can repeat it using tile, which is complicated but described in full here.
If I can redraw a batch each time that would be fine -- I would need batch_size random integers between 0 and num_examples. (My examples all sit in local RAM). I haven't found an easy way to get these random draws at once.
Ideally there is a reshuffle too when the batch is repeated, but it makes more sense to me to run an epoch then reshuffle, etc., instead of join the training space to itself num_epochs size, then shuffle.
I think this is confusing because I'm not really building an input pipeline since my input fits in memory, but yet I still need to be building out batching, shuffling and multiple epochs which possibly requires more knowledge of input pipeline.

tf.train.batch simply groups upstream samples into batches, and nothing more. It is meant to be used at the end of an input pipeline. Data and epochs are dealt with upstream.
For example, if your training data fits into a tensor, you could use tf.train.slice_input_producer to produce samples. This function has arguments for shuffling and epochs.

How to use model.fit_generator in keras

When and how should I use fit_generator?
What is the difference between fit and fit_generator?

If you have prepared your data and labels in all necessary aspects and simply can assign these to an array x and y, than use model.fit(x, y).
If you need to preprocess and/or augment your data while training, than you can take advantage of the generators that Keras provides.
You could for example augment images by applying random transforms (very helpful if you only have little data to train with), pad sequences, tokenize text, let Keras automagically read your data from a folder and assign appropiate classes (flow_from_directory) and much more.
See here for examples and boilerplate code for image preprocessing: https://keras.io/preprocessing/image/
or here for text preprocessing:
https://keras.io/preprocessing/text/
fit_generator also will help you to train in a more memory efficient way since you load data only when needed. The generator function yields (aka "delivers") data to your model batch by batch on demand, so to say.

They are useful for on-the-fly augmentations, which the previous poster mentioned. This however is not neccessarily restricted to generators, because you can fit for one epoch and then augment your data and fit again.
What does not work with fit is using too much data per epoch though. This means that if you have a dataset of 1 TB and only 8 GB of RAM you can use the generator to load the data on the fly and only hold a couple of batches in memory. This helps tremendously on scaling to huge datasets.

Keras: Tokenizer with fit_generator() on text data

I am creating a neural net on a very large text dataset using keras. To build the model and make sure everything was working, I read a fraction of the data into memory, and use the built in keras 'Tokenizer' to do the necessary preprocessing stuff, including mapping each word to a token. Then, I use model.fit().
Now, I want to extend to the full dataset, and don't have the space to read all the data into memory. So, I'd like to make a generator function to sequentially read data from disk, and use model.fit_generator(). However, if I do this, then I separately fit a Tokenizer object on each batch of data, providing a different word-to-token mapping for each batch. Is there anyway around this? Is there any way I can continuously build a token dictionary using keras?

So basically you could define a text generator and feed it to fit_on_text method in a following manner:
Assuming that you have texts_generator which is reading partially your data from disk and returning an iterable collection of text you may define:
def text_generator(texts_generator):
for texts in texts_generator:
for text in texts:
yield text
Please take care that you should make this generator stop after reading a whole of data from disk - what could possible make you to change the original generator you want to use in model.fit_generator
Once you have the generator from 1. you may simply apply a tokenizer.fit_on_text method by:
tokenizer.fit_on_text(text_generator)

Tensorflow: run time test metrics and data queues

I want to compute and display accuracy on the test set while the network is training.
In the MNIST tutorial that uses feeds, one can see that it can be done easily by feeding test data rather than train data. Simple solution to a simple problem.
However I am not able to find such an easy example when using queues for batching. AFAICS, the documentation proposes two solutions:
Offline testing with saved states. I don't want offline.
Making a second 'test' network that share weights with the network being trained. That doesn't sound simple and I have not seen an example of that.
Is there a third, easy way to compute test metrics at run time? Or is there an example somewhere of the second, test network with shared weights that proves me wrong by being super simple to implement?

If I understand your question correctly, you want to validate your model while training with queue inputs not feed_dict?
see my program that does this.
Here is a short explanation:
First you need to convert you data into train and validation files like 'train.tfreords' and 'valid.tfreocrds'
Second in your training program start two queues that parse this two files,
and use sharing variables to get the two logits for train and valid
In my program this is done by
with tf.variable_scope("inference") as scope:
logits = mnist.inference(images)
scope.reuse_variables()
validation_logits = mnist.inference(validation_images)
then use logits to do get train loss and minimize it and use validation_logits to get valid accuracy

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Training tensorflow RNN with large datasets - python

Related

How to normalize data when using Keras fit_generator

TensorFlow train batches for multiple epochs?

How to use model.fit_generator in keras

Keras: Tokenizer with fit_generator() on text data

Tensorflow: run time test metrics and data queues

Categories

Resources