I am looking at the recurrent neural net walkthrough here. In the tutorial they have a line item that does:
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
However, if you're doing a sequential build, is that still necessary? Looking at the sequential documentation, a shuffle is performed automatically? If not, why is it done here? Is there a simple numerical example of the effect?
The tf.keras.models.Sequential can also batch and shuffle the data, similar to what tf.data.Dataset does. These preprocessing features are provided in Sequential because it can take up data in several types like NumPy arrays, tf.data.Dataset, dict object as well as a tf.keras.utils.Sequence.
The tf.data.Dataset API provides these features because the API is consistent with other TensorFlow APIs ( in which Keras is not involved ).
I don't think the shuffling and batching need to be done twice. You may remove the if you wish, it will not affect the model's training. I think the author wanted to use tf.data.Dataset to get the data into a Keras model. dataset.shuffle( ... ).batch( ... ) have been colloquial with the Dataset.
Related
So this may be a silly question but how exactly do the preprocessing layers in keras work, especially in the context of as a part of the model itself. This being compared to preprocessing being applied outside the model then inputting the results for training.
I'm trying to understand running data augmentation in keras models. Lets say I have 1000 images for training. Out of model I can apply augmentation 10x and get 10000 resultant images for training.
But I don't understand what's happening when you use a preprocess layer for augmentation. Does this (or these if you use many) layers take each image and apply the transformations before training? Does this mean the total number of images used for training (and validation I assume) to be the number of epochs*the original number of images?
Is one option better than the other? Does that depend on the number of images one originally has before augmentation?
The benefit of preprocessing layers is that the model is truly end-to-end, i.e. raw data comes in and a prediction comes out. It makes your model portable since the preprocessing procedure is included in the SavedModel.
However, it will run everything on the GPU. Usually it makes sense to load the data using CPU worker(s) in the background while the GPU optimizes the model.
Alternatively, you could use a preprocessing layer outside of the model and inside a Dataset. The benefit of that is that you can easily create an inference-only model including the layers, which then gives you the portability at inference time but still the speedup during training.
For more information, see the Keras guide.
I created two instances of a data generator class, extended from keras Sequence class, one for training and one for validation data. However, at the level of my source code I can only see the validation generator re-iterating between each epoch. I can't see the training generator. As a result, I can't verify that the augmentation of the training data is what I intended. In these snippets of the code, aug is a dict of parameters that go to a keras ImageDataGenerator instance within my myDataGen extension of Sequence. I wouldn't normally augment the validation data, but it's how I stumbled upon this conundrum:
aug = dict(fill_mode='nearest',
rotation_range=10,
zoom_range=0.3,
width_shift_range=0.1,
height_shift_range=0.1
)
training_datagen = myDataGen(Xdata_train,ydata_train,**aug)
validation_datagen = myDataGen(Xdata_test,ydata_test,**aug)
history = model.fit(training_datagen,
validation_data=validation_datagen,
validation_batch_size=16,
epochs=50,
shuffle=False,
)
Everything works and I get great results, but I just wanted to be sure about the augmentation. So I can gather from skimming through various functions in keras that the data generator I wrote populates a lower-level tensorflow Dataset, which then iterates per epoch. I just can't see how the tensorflow Dataset is getting augmented per epoch.
Now, I've also accidentally discovered that although the fit method doesn't support generators on the validation data, it does work, and with the interesting feature which I would like to have for the training generator, that of re-reading the data from disk so that it re-augments at the level of my own source code.
Bottom line, I can see hints that the tensorflow Dataset.cache() method is presumably storing my training dataset in memory after the first epoch. Can I somehow uncache() it to force a re-read and re-augmentation, or can someone point me to how a tensorflow Dataset calls augmentation methods when it iterates?
Hmm. This thread TF Dataset API for Image augmentation makes clear that writing augmentation methods directly in the tensorflow Dataset API is easy, but a contributor writes in the comments that you can't use keras.ImageDataGenerator on a tf.data.Dataset. But I can clearly see in the keras modules that my keras dataset is being 'adapted' into an underlying tf.data.Dataset. If this comment is true, it would explain why I don't seem to be able to break on ImageDataGenerator augmentation of my training data. But how could this possibly be true?
My confusion came the beginner mistake of overlooking the fact that of course one cannot break at the level of the keras source code after it's been compiled onto the gpu. But what's interesting that came out of this confusion, is that you CAN write a keras generator for the validation data, AND break on it for each epoch, since it is apparently NOT compiled on to the gpu... because keras doesn't support generators for the validation data! It's just that the generator is nevertheless handled nicely with no runtime error. An obscure find, but hope it may help somebody.
I wonder if it is possible to include standardization of input features directly into a keras model, in a way that it would be automatically included when loading the model with models.load_model? This would avoid the need for carrying around the normalization transformation from the training set when applying the model elsewhere.
I understand a possible solution is to include the Keras model into a scikit-learn pipeline ( see for instance How to insert Keras model into scikit-learn pipeline? ). However I would prefer not to set up a pipeline and ideally just use models.load_model. Are there possible solutions that do not involve using anything other than Keras or Tensorflow?
Another possible solution is simply using a BatchNormalization layer as the first layer in the network. This builds normalization into the network, but during training the initial normalization then depends on the statistics of the (small) batches rather than on the entire training set, and would thus vary in an unnecessary way between training batches.
( A similar question to this was asked earlier ( How to include normalization of features in Keras regression model? ), but neither the question nor answer seemed very clear. )
TensorFlow has a long-standing limitation of 2 Gb on a single tensor. It means that you can't train your model on more than 2 Gb of data at one time without jumping through hoops. See Initializing tensorflow Variable with an array larger than 2GB ; Use large dataset in Tensorflow
The standard solution referenced in those posts is to use a placeholder and to pass it to the "session" through feed_dict:
my_graph = tf.Graph()
sess = tf.Session(graph=my_graph)
X_init = tf.placeholder(tf.float32, shape=(m_input, n_input))
X = tf.Variable(X_init)
sess.run(tf.global_variables_initializer(), feed_dict={X_init: data_for_X})
However, this only works when I use the "old" API (tf.Session(), etc.) The recommended approach nowadays is to use Keras (all the tutorials on tensorflow.org use it). And, with Keras, there's no tf.Graph(), no tf.Session(), and no run() (at least none that are readily visible to the user.)
How do I adapt the above code to work with Keras?
In Keras, you'd not load your entire dataset in a tensor. You load it in numpy arrays.
If the entire data can be in a single numpy array:
Thanks to #sebrockm's comment.
The most trivial usage of Keras is simply loading your dataset in a numpy array (not a tf tensor) and call model.fit(arrayWithInputs, arrayWithoutputs, ...)
If the entire data doesn't fit a numpy array:
You'd create a generator or a keras.utils.Sequence to load batches one by one and then train the model with model.fit_generator(generatorOrSequence, ...)
The limitation becomes the batch size, but you'd hardly ever hit 2GB in a single batch.
So, go for it:
keras.utils.Sequence
model.fit_generator
Keras doesn't have a 2GB limitation for datasets, I've trained much larger datasets with Keras with no issues.
The limitation could come from TensorFlow constants, which do have a 2GB limit, but in any case you should NOT store datasets as constants, as these are saved as part of the graph, and that is not the idea of storing a model.
Keras has the model.fit_generator function that you can use to pass a generator function which loads data on the fly, and makes batches. This allows you to load a large dataset on the fly, and you usually adjust the batch size so you maximize performance with acceptable RAM usage. TensorFlow doesn't have a similar API, you have to implement it manually as you say with feed_dict.
I am using Keras functionality ImageDataGenerator() to generate training and validation dataset. I an trying to understand what this function does internally. What are the preprocessing steps does this function perform?
Where can I find the source code of this function?
You can find any source code at:
https://github.com/keras-team/keras
Here is the ImageDataGenerator:
https://github.com/keras-team/keras/blob/master/keras/preprocessing/image.py/#L374
The keras documentation page also has links that lead you there:
https://keras.io/preprocessing/image/
Internally, the ImageDataGenerator will make a series of different data augmentation procedures on images that you provide, and also prepare a python generator for you to use when fitting your models.
There are several data augmentation methods to use, you can have an idea of what they are in the help page above.
Generators are used to create batches in a loop. In this case, one batch of images at a time.
Instead of using model.fit(), you will use model.fit_generator() with either ImageDataGenerator.flow() or ImageDataGenerator.flow_from_directory().