When and how should I use fit_generator?
What is the difference between fit and fit_generator?
If you have prepared your data and labels in all necessary aspects and simply can assign these to an array x and y, than use model.fit(x, y).
If you need to preprocess and/or augment your data while training, than you can take advantage of the generators that Keras provides.
You could for example augment images by applying random transforms (very helpful if you only have little data to train with), pad sequences, tokenize text, let Keras automagically read your data from a folder and assign appropiate classes (flow_from_directory) and much more.
See here for examples and boilerplate code for image preprocessing: https://keras.io/preprocessing/image/
or here for text preprocessing:
https://keras.io/preprocessing/text/
fit_generator also will help you to train in a more memory efficient way since you load data only when needed. The generator function yields (aka "delivers") data to your model batch by batch on demand, so to say.
They are useful for on-the-fly augmentations, which the previous poster mentioned. This however is not neccessarily restricted to generators, because you can fit for one epoch and then augment your data and fit again.
What does not work with fit is using too much data per epoch though. This means that if you have a dataset of 1 TB and only 8 GB of RAM you can use the generator to load the data on the fly and only hold a couple of batches in memory. This helps tremendously on scaling to huge datasets.
Related
I am looking at the recurrent neural net walkthrough here. In the tutorial they have a line item that does:
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
However, if you're doing a sequential build, is that still necessary? Looking at the sequential documentation, a shuffle is performed automatically? If not, why is it done here? Is there a simple numerical example of the effect?
The tf.keras.models.Sequential can also batch and shuffle the data, similar to what tf.data.Dataset does. These preprocessing features are provided in Sequential because it can take up data in several types like NumPy arrays, tf.data.Dataset, dict object as well as a tf.keras.utils.Sequence.
The tf.data.Dataset API provides these features because the API is consistent with other TensorFlow APIs ( in which Keras is not involved ).
I don't think the shuffling and batching need to be done twice. You may remove the if you wish, it will not affect the model's training. I think the author wanted to use tf.data.Dataset to get the data into a Keras model. dataset.shuffle( ... ).batch( ... ) have been colloquial with the Dataset.
TensorFlow has a long-standing limitation of 2 Gb on a single tensor. It means that you can't train your model on more than 2 Gb of data at one time without jumping through hoops. See Initializing tensorflow Variable with an array larger than 2GB ; Use large dataset in Tensorflow
The standard solution referenced in those posts is to use a placeholder and to pass it to the "session" through feed_dict:
my_graph = tf.Graph()
sess = tf.Session(graph=my_graph)
X_init = tf.placeholder(tf.float32, shape=(m_input, n_input))
X = tf.Variable(X_init)
sess.run(tf.global_variables_initializer(), feed_dict={X_init: data_for_X})
However, this only works when I use the "old" API (tf.Session(), etc.) The recommended approach nowadays is to use Keras (all the tutorials on tensorflow.org use it). And, with Keras, there's no tf.Graph(), no tf.Session(), and no run() (at least none that are readily visible to the user.)
How do I adapt the above code to work with Keras?
In Keras, you'd not load your entire dataset in a tensor. You load it in numpy arrays.
If the entire data can be in a single numpy array:
Thanks to #sebrockm's comment.
The most trivial usage of Keras is simply loading your dataset in a numpy array (not a tf tensor) and call model.fit(arrayWithInputs, arrayWithoutputs, ...)
If the entire data doesn't fit a numpy array:
You'd create a generator or a keras.utils.Sequence to load batches one by one and then train the model with model.fit_generator(generatorOrSequence, ...)
The limitation becomes the batch size, but you'd hardly ever hit 2GB in a single batch.
So, go for it:
keras.utils.Sequence
model.fit_generator
Keras doesn't have a 2GB limitation for datasets, I've trained much larger datasets with Keras with no issues.
The limitation could come from TensorFlow constants, which do have a 2GB limit, but in any case you should NOT store datasets as constants, as these are saved as part of the graph, and that is not the idea of storing a model.
Keras has the model.fit_generator function that you can use to pass a generator function which loads data on the fly, and makes batches. This allows you to load a large dataset on the fly, and you usually adjust the batch size so you maximize performance with acceptable RAM usage. TensorFlow doesn't have a similar API, you have to implement it manually as you say with feed_dict.
I have an image classification task where I've created multiple crops of each image as well as flipped/flopped versions to extend my limited dataset. I have written the dataset to a tfrecords file where each record consists of (simplified here to two crops and only a flipped version):
{
lbl: int,
crop_0: np.ndarray,
crop_1: np.ndarray,
crop_0_flipped: np.ndarray,
crop_1_flipped: np.ndarray
}
Basically 4 images / entry. During training, I'd like to treat each image as separate, i.e. feed each record as 4 images with the same label, shuffled with the rest of the images in the dataset, so that N images becomes 4N images. During testing (using a separate but similarly structured dataset), I'd like to take each image, only use the crop_0 and crop_1 images and average the softmax outputs for classification.
My question is - what is the best and most efficient way of training such a dataset? I'm willing to change my approach if this will make training more inefficient, and it seems that the simplest thing to do would have been to have separate tfrecords files for each version (crop & flip/flop images) and interleave the files into one dataset, but I do not want to have a whole bunch of files to deal with if I can help it.
Writing the dataset to disk with 4N images is an approach that you'll come to loath later (I did it this way originally and loath that code now). The better way is to keep your original dataset on disk as-is, don't write your preprocessing steps to disk. Do that kind of preprocessing in the CPU while you train. The tensorflow Dataset preprocessing pipeline makes this easy, modular, and provides the parallelization you need to take advantage of multiple cores at not extra coding expense.
This is the main guide:
https://www.tensorflow.org/programmers_guide/datasets
Your approach should be to create 2 Dataset objects, one for train and one for test. The train Dataset pipeline will perform all the data augmentation you mentioned. The test Dataset pipeline will not, naturally.
One key to understanding this approach is that you will not feed the data to tensorflow using feed_dict, instead, tensorflow will just invoke the Dataset pipeline to pull the data it needs for each batch.
To get parallelization you'll use the Dataset.map function to apply some set of transformations and use the property num_parallel_calls to distribute the operations across multiple cores. If your preprocessing can be done in tensorflow code, great, if not you'll need to use tf.py_func to use python preprocessing code.
The guide I linked to above describes all of this very well. You will want to us a feedable iterator described in the section called "Creating an iterator". This will allow you to get a string ID from each of the 2 datasets (train and test) and pass that string to tensorflow via feed_dict indicating which of the two datasets tensorflow should pull samples from.
I have a very large data set and am using Keras' fit_generator to train a Keras model (tensorflow backend). My data needs to be normalized across the entire data set however when using fit_generator, I have access to relatively small batches of data and normalization of the data in this small batch is not representative of normalizing the data across the entire data set. The impact is quite large (I tested it and the model accuracy is significantly degraded).
My question is this: What is the correct practice of normalizing data across entire data set when using Keras' fit_generator? One last point: my data is a mix of text and numeric data and not images, and hence I am not able to use some of the capabilities in Keras' provided image generator which may address some of the issues for image data.
I have looked at normalizing the full data set prior to training ("brute-force" approach, I suppose) but I am wondering if there is a more elegant way of doing this.
The generator does allow you to do on-the-fly processing of data but pre-processing the data prior to training is the preferred approach:
Pre-process and saving avoids processing the data for every epoch, you should really just do small operations that can be applied to batches. One-hot encoding for example is a common one while tokenising sentences etc can be done offline.
You probably will tweak, fine-tune your model. You don't want to have the overhead of normalising the data and ensure every model trains on the same normalised data.
So, pre-process once offline prior to training and save it as your training data. When predicting you can process on-the-fly.
You would do this via pre-processing your data to a matrix. One hot encode your text data:
from keras.preprocessing.text import Tokenizer
# X is a list of text elements
t = Tokenizer()
t.fit_on_texts(X)
X_one_hot = t.texts_to_matrix(X)
and normalize your numeric data via:
for i in range(len(matrix)):
refactored_array = (matrix[i]- np.min(matrix[i], 0)) / (np.max(matrix[i], 0) + 0.0001)
If you concatenate your two matrices you should have properly preprocessed your data. I just could imagine that the text will always be influencing the outcome of your model too much. So it would make sence to train seperate models for text and numeric data.
I'm training an RNN in tensorflow. The function used is "rnn" from https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/rnn.py.
outputs, states = rnn.rnn(cell, inputs, initial_state=initial_state, sequence_length=seq_length)
The reason I use this function is because my data sequences are of variable lengths. This function expects all data to be loaded at once. Since my data doesn't fit in to memory all at once, I need to load data piece by piece. Any pointers on how it can be done would be highly appreciated.
Thanks
The standard practice here is to break your data up into chunks and work on it a chunk at a time. For example, if you are working with text, you might break your data up into sentences, and pass mini-batches of 10s-100s of sentences to the training process one at a time.
For an example of how to do this, take a look at this RNN tutorial.
https://www.tensorflow.org/versions/r0.9/tutorials/recurrent/index.html
The tutorial text itself doesn't describe chunking in detail, but take a look at the associated code in github and see how it loads its input data and batches it for training.
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/models/rnn/ptb
Hope that helps!