I'm trying to use generators in my CNN training but for some reason.
However, when I try to run model.predict_evaluator(), each time I execute it (I'm working in Jupyter Notebook), ¡it gives different results! Same data (stored in folder), same model (I just rerun the same cell)
This block works fine, every time I rerun it, it gives the same metrics
test_generator = test_datagen.flow_from_directory(
'keras_data/test',
batch_size = 1,
class_mode='categorical')
loss, acc = model.evaluate(test_generator, verbose=1)
print(loss,acc)
However, when I run this cell, it gives different results every time
ytest = test_generator.classes
yhat = np.argmax(model.predict_generator(test_generator),axis=1)
from sklearn.metrics import confusion_matrix
m = confusion_matrix(ytest,yhat)
print(m)
It doesn't make any sense! Any ideas on what's happening?
EDIT: here is how I create the generators, just in case the problem is here
train_datagen = ImageDataGenerator(
preprocessing_function=preprocess_input,
horizontal_flip=True)
test_datagen = ImageDataGenerator(preprocessing_function=preprocess_input)
train_generator = train_datagen.flow_from_directory(
'keras_data/train',
batch_size=1,
class_mode='categorical')
validation_generator = test_datagen.flow_from_directory(
'keras_data/val',
batch_size=1,
class_mode='categorical')
Shuffle - Whether to shuffle the data (default: True) If set to False, sorts the data in alphanumeric order.
From comments
Set shuffle = False for test generator has resolved the issue (paraphrased from Frightera)
Related
I'm using this code to train my CNN:
history = model.fit(
x = train_gen1,
epochs = epochs,
validation_data = valid_gen,
callbacks = noaug_callbacks,
class_weight = class_weight
).history
where train_gen1 is made using ImageDataGenerator. But what if I want to use 2 different types of generator (let's call them train_gen1 and train_gen2) both feeding my training phase? How can I change my code in order to do so?
This is the way I generate:
aug_train_data_gen = ImageDataGenerator(rotation_range=0,
height_shift_range=40,
width_shift_range=40,
zoom_range=0,
horizontal_flip=True,
vertical_flip=True,
fill_mode='reflect',
preprocessing_function=preprocess_input
)
train_gen1 = aug_train_data_gen.flow_from_directory(directory=training_dir,
target_size=(96,96),
color_mode='rgb',
classes=None, # can be set to labels
class_mode='categorical',
batch_size=512,
shuffle= False #set to false if need to compare images
)
validation_split parameter is able to allow ImageDataGenerator to split the data sets reading from the folder into 2 different disjoint sets. Is there any way to create 3 sets - of training, validation, and evaluation datasets using it?
I am thinking about splitting the dataset into 2 datasets, then splitting the 2nd dataset into another 2 datasets
datagen = ImageDataGenerator(validation_split=0.5, rescale=1./255)
train_generator = datagen.flow_from_directory(
TRAIN_DIR,
subset='training'
)
val_generator = datagen.flow_from_directory(
TRAIN_DIR,
subset='validation'
)
Here I am thinking about splitting the validation dataset into 2 sets using val_generator. One for validation and the other for evaluation? How should I do it?
I like working with the flow_from_dataframe() method of ImageDataGenerator, where I interact with a simple Pandas DataFrame (perhaps containig other features), not with the directory. But you can easily change my code if you insist on flow_from_directory().
So this is my go-to function, e.g. for a regression task, where we try to predict a continuous y:
def get_generators(train_samp, test_samp, validation_split = 0.1):
train_datagen = ImageDataGenerator(validation_split=validation_split, rescale = 1. / 255)
test_datagen = ImageDataGenerator(rescale = 1. / 255)
train_generator = train_datagen.flow_from_dataframe(
dataframe = images_df[images_df.index.isin(train_samp)],
directory = images_dir,
x_col = 'img_file',
y_col = 'y',
target_size = (IMG_HEIGHT, IMG_WIDTH),
class_mode = 'raw',
batch_size = batch_size,
shuffle = True,
subset = 'training',
validate_filenames = False
)
valid_generator = train_datagen.flow_from_dataframe(
dataframe = images_df[images_df.index.isin(train_samp)],
directory = images_dir,
x_col = 'img_file',
y_col = 'y',
target_size = (IMG_HEIGHT, IMG_WIDTH),
class_mode = 'raw',
batch_size = batch_size,
shuffle = False,
subset = 'validation',
validate_filenames = False
)
test_generator = test_datagen.flow_from_dataframe(
dataframe = images_df[images_df.index.isin(test_samp)],
directory = images_dir,
x_col = 'img_file',
y_col = 'y',
target_size = (IMG_HEIGHT, IMG_WIDTH),
class_mode = 'raw',
batch_size = batch_size,
shuffle = False,
validate_filenames = False
)
return train_generator, valid_generator, test_generator
Things to notice:
I use two generators
The input to the function are the train/test indices (such as received from Sklearn's train_test_split) which are used to filter the DataFrame index.
The function also take a validation_split parameter for the training generator
images_df is a DataFrame somewhere in global memory with proper columns like img_file and y.
No need to shuffle validation and test generators
This can be further generalized for multiple outputs, classification, what have you.
I mostly have been splitting data in 80/10/10 for training, validation and test respectivelly.
When working with keras I favor the tf.data API as it provides a good abstraction for complex input pipelines
It does not provide a simple tf.data.DataSet.split functionality though
I have this function (that I found from someone's code and my source is missing) which I consistently use
def get_dataset_partitions_tf(ds: tf.data.Dataset, ds_size, train_split=0.8, val_split=0.1, test_split=0.1, shuffle=True, shuffle_size=10000):
assert (train_split + test_split + val_split) == 1
if shuffle:
# Specify seed to always have the same split distribution between runs
ds = ds.shuffle(shuffle_size, seed=12)
train_size = int(train_split * ds_size)
val_size = int(val_split * ds_size)
train_ds = ds.take(train_size)
val_ds = ds.skip(train_size).take(val_size)
test_ds = ds.skip(train_size).skip(val_size)
return train_ds, val_ds, test_ds
Firstly read your data set, and get its size(with cardianlity method), then pass it into the function and you're good to go!
This function can be given a flag to shuffle the original dataset before creating the splits, this is useful to have more realistic validation and test metrics.
The seed for shuffling is fixed so that we can run the same function and the splits remain the same, which we want for consistent results.
I have this code:
epochs =50
batch_size = 5
validation_split = 0.2
datagen = tf.keras.preprocessing.image.ImageDataGenerator(validation_split=validation_split )
train_generator = datagen.flow(
X_train_noisy, y_train_denoisy, batch_size=batch_size,
subset='training'
)
val_generator = datagen.flow(
X_train_noisy, y_train_denoisy, batch_size=batch_size,
subset='validation'
)
history = model.fit(train_generator,
steps_per_epoch=(len(X_train_noisy)*(1-validation_split)) // batch_size, epochs=epochs,
validation_data = val_generator, validation_steps=(len(X_train_noisy)*validation_split)//batch_size)
X_train_noisy and y_train_denoisy are ndarray ([20,512,512,1]) p.e. But I get this error:
training and validation subsets have different number of classes after the split
How can I solve that?
thanks!
probably what happened is when the data is split for training and validation, the set of files selected for validation did not include any files in one or more of the classes. This can happen when your data set is small. Try increasing the validation_split to a larger value say like .5 and see if the problem goes away. It should. Then reduce the size of the validation split until the error reoccurs. That will determine the minimum split value you can use. Remember the split is randomized so set the split value at something above the minimum value.
Another (BETTER) alternative is to split the data using sklearn train_test_split. This function has a parameter stratify that splits the data but ensures that all classes are included in the two component. See code below
from sklearn.model_selection import train_test_split
X_train_noisy, X_valid_noisy, y_train_denoisy, y_valid_denoisy=train_test_split(X_train_noisy,
y_train_denoisy, test_size=validation_split,
shuffle=True, random_state=123,
stratify=y_train_denoisy)
now use these split variable in model.fit
I want to use the Keras ImageDataGenerator for data augmentation.
To do so, I have to call the .fit() function on the instantiated ImageDataGenerator object using my training data as parameter as shown below.
image_datagen = ImageDataGenerator(featurewise_center=True, rotation_range=90)
image_datagen.fit(X_train, augment=True)
train_generator = image_datagen.flow_from_directory('data/images')
model.fit_generator(train_generator, steps_per_epoch=2000, epochs=50)
However, my training data set is too large to fit into memory when loaded up at once.
Consequently, I would like to fit the generator in several steps using subsets of my training data.
Is there a way to do this?
One potential solution that came to my mind is to load up batches of my training data using a custom generator function and fitting the image generator multiple times in a loop. However, I am not sure whether the fit function of ImageDataGenerator can be used in this way as it might reset on each fitting approach.
As an example of how it might work:
def custom_train_generator():
# Code loading training data subsets X_batch
yield X_batch
image_datagen = ImageDataGenerator(featurewise_center=True, rotation_range=90)
gen = custom_train_generator()
for batch in gen:
image_datagen.fit(batch, augment=True)
train_generator = image_datagen.flow_from_directory('data/images')
model.fit_generator(train_generator, steps_per_epoch=2000, epochs=50)
NEWER TF VERSIONS (>=2.5):
ImageDataGenerator() has been deprecated in favour of :
tf.keras.utils.image_dataset_from_directory
An example usage from the documentation:
tf.keras.utils.image_dataset_from_directory(
directory,
labels='inferred',
label_mode='int',
class_names=None,
color_mode='rgb',
batch_size=32,
image_size=(256, 256),
shuffle=True,
seed=None,
validation_split=None,
subset=None,
interpolation='bilinear',
follow_links=False,
crop_to_aspect_ratio=False,
**kwargs
)
OLDER TF VERSIONS (<2.5)
ImageDataGenerator() provides you with the possibility of loading the data into batches; You can actually use in your fit_generator() method the parameter batch_size, which works with ImageDataGenerator(); there is no need (only for good practice if you want) to write a generator from scratch.
IMPORTANT NOTE:
Starting from TensorFlow 2.1, .fit_generator() has been deprecated and you should use .fit()
Example taken from Keras official documentation:
datagen = ImageDataGenerator(
featurewise_center=True,
featurewise_std_normalization=True,
rotation_range=20,
width_shift_range=0.2,
height_shift_range=0.2,
horizontal_flip=True)
# compute quantities required for featurewise normalization
# (std, mean, and principal components if ZCA whitening is applied)
datagen.fit(x_train)
# TF <= 2.0
# fits the model on batches with real-time data augmentation:
model.fit_generator(datagen.flow(x_train, y_train, batch_size=32),
steps_per_epoch=len(x_train) // 32, epochs=epochs)
#TF >= 2.1
model.fit(datagen.flow(x_train, y_train, batch_size=32),
steps_per_epoch=len(x_train) // 32, epochs=epochs)
I would suggest reading this excellent article about ImageDataGenenerator and Augmentation: https://machinelearningmastery.com/how-to-configure-image-data-augmentation-when-training-deep-learning-neural-networks/
The solution to your problem lies in this line of code(either simple flow or flow_from_directory):
# prepare iterator
it = datagen.flow(samples, batch_size=1)
For creating your own DataGenerator, one should have a look at this link(for a starting point): https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly
IMPORTANT NOTE (2):
If you use Keras from Tensorflow (Keras inside Tensorflow), then for both the code presented and the tutorials you consult, ensure that you replace the import/neural network creation snippets:
from keras.x.y.z import A
WITH
from tensorflow.keras.x.y.z import A
I have tried searching for possible similar questions, but I have not been able to find any so far. My problem is:
I want to use cross-validation using non-overlapping subsets of data using KFold. What I did was to create subsets using KFold and fix the outcome by setting randome_state to a certain integer. When I print out the subsets multiple times results look OK. However, the problem is when I use the same subsets on a model multiple times using model.predict (meaning running my code multiple times), I get different results. Naturally, I suspect there is something wrong with my implementation of training model. But I cannot figure out what it is. I would very much appreciate a hint. Here is my code:
random.seed(42)
# define K-fold cross validation test harness
kf = KFold(n_splits=3, random_state=42, shuffle=True)
for train_index, test_index in kf.split(data):l
print ('Train', train_index, '\nTest ', test_index)
# create model
testX= data[test_index]
trainX = data[train_index]
testYcheck = labels[test_index]
testP = Path[test_index]
# convert the labels from integers to vectors
trainY = to_categorical(labels[train_index], num_classes=2)
testY = to_categorical(labels[test_index], num_classes=2)
# construct the image generator for data augmentation
aug = ImageDataGenerator(rotation_range=30, width_shift_range=0.1,
height_shift_range=0.1, shear_range=0.2, zoom_range=0.2,
horizontal_flip=True, fill_mode="nearest")
# train the network
print("[INFO] training network...")
model.fit_generator(aug.flow(trainX, trainY, batch_size=BS),
validation_data=(testX, testY),
steps_per_epoch=len(trainX) // BS, epochs=EPOCHS, verbose=1)
#predict the test data
y_pred = model.predict(testX)
predYl = []
for element in range(len(y_pred)):
if y_pred[element,1] > y_pred[element,0]:
predYl.append(1)
else:
predYl.append(0)
pred_Y= np.array(predYl)
# Compute confusion matrix
cnf_matrix = confusion_matrix(testYcheck, pred_Y)
np.set_printoptions(precision=2)
print (cnf_matrix)