I am using Keras functionality ImageDataGenerator() to generate training and validation dataset. I an trying to understand what this function does internally. What are the preprocessing steps does this function perform?
Where can I find the source code of this function?
You can find any source code at:
https://github.com/keras-team/keras
Here is the ImageDataGenerator:
https://github.com/keras-team/keras/blob/master/keras/preprocessing/image.py/#L374
The keras documentation page also has links that lead you there:
https://keras.io/preprocessing/image/
Internally, the ImageDataGenerator will make a series of different data augmentation procedures on images that you provide, and also prepare a python generator for you to use when fitting your models.
There are several data augmentation methods to use, you can have an idea of what they are in the help page above.
Generators are used to create batches in a loop. In this case, one batch of images at a time.
Instead of using model.fit(), you will use model.fit_generator() with either ImageDataGenerator.flow() or ImageDataGenerator.flow_from_directory().
Related
So this may be a silly question but how exactly do the preprocessing layers in keras work, especially in the context of as a part of the model itself. This being compared to preprocessing being applied outside the model then inputting the results for training.
I'm trying to understand running data augmentation in keras models. Lets say I have 1000 images for training. Out of model I can apply augmentation 10x and get 10000 resultant images for training.
But I don't understand what's happening when you use a preprocess layer for augmentation. Does this (or these if you use many) layers take each image and apply the transformations before training? Does this mean the total number of images used for training (and validation I assume) to be the number of epochs*the original number of images?
Is one option better than the other? Does that depend on the number of images one originally has before augmentation?
The benefit of preprocessing layers is that the model is truly end-to-end, i.e. raw data comes in and a prediction comes out. It makes your model portable since the preprocessing procedure is included in the SavedModel.
However, it will run everything on the GPU. Usually it makes sense to load the data using CPU worker(s) in the background while the GPU optimizes the model.
Alternatively, you could use a preprocessing layer outside of the model and inside a Dataset. The benefit of that is that you can easily create an inference-only model including the layers, which then gives you the portability at inference time but still the speedup during training.
For more information, see the Keras guide.
I created two instances of a data generator class, extended from keras Sequence class, one for training and one for validation data. However, at the level of my source code I can only see the validation generator re-iterating between each epoch. I can't see the training generator. As a result, I can't verify that the augmentation of the training data is what I intended. In these snippets of the code, aug is a dict of parameters that go to a keras ImageDataGenerator instance within my myDataGen extension of Sequence. I wouldn't normally augment the validation data, but it's how I stumbled upon this conundrum:
aug = dict(fill_mode='nearest',
rotation_range=10,
zoom_range=0.3,
width_shift_range=0.1,
height_shift_range=0.1
)
training_datagen = myDataGen(Xdata_train,ydata_train,**aug)
validation_datagen = myDataGen(Xdata_test,ydata_test,**aug)
history = model.fit(training_datagen,
validation_data=validation_datagen,
validation_batch_size=16,
epochs=50,
shuffle=False,
)
Everything works and I get great results, but I just wanted to be sure about the augmentation. So I can gather from skimming through various functions in keras that the data generator I wrote populates a lower-level tensorflow Dataset, which then iterates per epoch. I just can't see how the tensorflow Dataset is getting augmented per epoch.
Now, I've also accidentally discovered that although the fit method doesn't support generators on the validation data, it does work, and with the interesting feature which I would like to have for the training generator, that of re-reading the data from disk so that it re-augments at the level of my own source code.
Bottom line, I can see hints that the tensorflow Dataset.cache() method is presumably storing my training dataset in memory after the first epoch. Can I somehow uncache() it to force a re-read and re-augmentation, or can someone point me to how a tensorflow Dataset calls augmentation methods when it iterates?
Hmm. This thread TF Dataset API for Image augmentation makes clear that writing augmentation methods directly in the tensorflow Dataset API is easy, but a contributor writes in the comments that you can't use keras.ImageDataGenerator on a tf.data.Dataset. But I can clearly see in the keras modules that my keras dataset is being 'adapted' into an underlying tf.data.Dataset. If this comment is true, it would explain why I don't seem to be able to break on ImageDataGenerator augmentation of my training data. But how could this possibly be true?
My confusion came the beginner mistake of overlooking the fact that of course one cannot break at the level of the keras source code after it's been compiled onto the gpu. But what's interesting that came out of this confusion, is that you CAN write a keras generator for the validation data, AND break on it for each epoch, since it is apparently NOT compiled on to the gpu... because keras doesn't support generators for the validation data! It's just that the generator is nevertheless handled nicely with no runtime error. An obscure find, but hope it may help somebody.
I am looking at the recurrent neural net walkthrough here. In the tutorial they have a line item that does:
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
However, if you're doing a sequential build, is that still necessary? Looking at the sequential documentation, a shuffle is performed automatically? If not, why is it done here? Is there a simple numerical example of the effect?
The tf.keras.models.Sequential can also batch and shuffle the data, similar to what tf.data.Dataset does. These preprocessing features are provided in Sequential because it can take up data in several types like NumPy arrays, tf.data.Dataset, dict object as well as a tf.keras.utils.Sequence.
The tf.data.Dataset API provides these features because the API is consistent with other TensorFlow APIs ( in which Keras is not involved ).
I don't think the shuffling and batching need to be done twice. You may remove the if you wish, it will not affect the model's training. I think the author wanted to use tf.data.Dataset to get the data into a Keras model. dataset.shuffle( ... ).batch( ... ) have been colloquial with the Dataset.
I am training a CNN using Keras with TensorFlow backend, using imgaug for image augmentation.
I am also using Tensorboard to visualize training progress and results.
Since imgaug is applying (random) transformations to the input images, I would like to send (some of) the augmented images over to Tensorboard, so that I can visualize them and verify that everything is correct (eg: to check if I am applying too large translations, or blurring the images too much).
For this I created a custom Keras callback and am trying to input my logic in the on_batch_end method. I can send images to tensorboard alright, but can't find where I can access the augmented input images. Any tips on how to achieve this?
Thanks in advance
Better to do that outside training by simply getting images from your generator.
If it's a regular generator:
for i in range(numberOfBatches):
x,y = next(generator)
#plot, print, etc. with the batches
If it's a keras.utils.Sequence:
for i in range(len(generator)):
x,y = generator[i]
#plot, print, etc. with the batches
Directory structure:
Data
-Cats
--<images>.jpg
-Dogs
--<images>.jpg
I'm training a (n-ary) classification model. I want to create an input_fn for serving these images for training.
image dimensions are (200, 200, 3). I have a (keras) generator for them, if they can be used somehow.
I've been looking for a while but haven't found an easy way to do this. I thought this should be a standard use-case? e.g. Keras provides flow_from_directory to serve keras models. I need to use a tf.estimator for AWS Sagemaker so I'm stuck with it.
By using the tf dataset Module you can feed your data directly into your estimator. You basically have 3 ways to integrate this into your api:
1. convert your images into tfrecords and use tfrecorddataset
2 use the tf dataset from generator function to use generators
3 try introducing these decoder functions into your inputpipeline