I'm using this code to train my CNN:
history = model.fit(
x = train_gen1,
epochs = epochs,
validation_data = valid_gen,
callbacks = noaug_callbacks,
class_weight = class_weight
).history
where train_gen1 is made using ImageDataGenerator. But what if I want to use 2 different types of generator (let's call them train_gen1 and train_gen2) both feeding my training phase? How can I change my code in order to do so?
This is the way I generate:
aug_train_data_gen = ImageDataGenerator(rotation_range=0,
height_shift_range=40,
width_shift_range=40,
zoom_range=0,
horizontal_flip=True,
vertical_flip=True,
fill_mode='reflect',
preprocessing_function=preprocess_input
)
train_gen1 = aug_train_data_gen.flow_from_directory(directory=training_dir,
target_size=(96,96),
color_mode='rgb',
classes=None, # can be set to labels
class_mode='categorical',
batch_size=512,
shuffle= False #set to false if need to compare images
)
Related
validation_split parameter is able to allow ImageDataGenerator to split the data sets reading from the folder into 2 different disjoint sets. Is there any way to create 3 sets - of training, validation, and evaluation datasets using it?
I am thinking about splitting the dataset into 2 datasets, then splitting the 2nd dataset into another 2 datasets
datagen = ImageDataGenerator(validation_split=0.5, rescale=1./255)
train_generator = datagen.flow_from_directory(
TRAIN_DIR,
subset='training'
)
val_generator = datagen.flow_from_directory(
TRAIN_DIR,
subset='validation'
)
Here I am thinking about splitting the validation dataset into 2 sets using val_generator. One for validation and the other for evaluation? How should I do it?
I like working with the flow_from_dataframe() method of ImageDataGenerator, where I interact with a simple Pandas DataFrame (perhaps containig other features), not with the directory. But you can easily change my code if you insist on flow_from_directory().
So this is my go-to function, e.g. for a regression task, where we try to predict a continuous y:
def get_generators(train_samp, test_samp, validation_split = 0.1):
train_datagen = ImageDataGenerator(validation_split=validation_split, rescale = 1. / 255)
test_datagen = ImageDataGenerator(rescale = 1. / 255)
train_generator = train_datagen.flow_from_dataframe(
dataframe = images_df[images_df.index.isin(train_samp)],
directory = images_dir,
x_col = 'img_file',
y_col = 'y',
target_size = (IMG_HEIGHT, IMG_WIDTH),
class_mode = 'raw',
batch_size = batch_size,
shuffle = True,
subset = 'training',
validate_filenames = False
)
valid_generator = train_datagen.flow_from_dataframe(
dataframe = images_df[images_df.index.isin(train_samp)],
directory = images_dir,
x_col = 'img_file',
y_col = 'y',
target_size = (IMG_HEIGHT, IMG_WIDTH),
class_mode = 'raw',
batch_size = batch_size,
shuffle = False,
subset = 'validation',
validate_filenames = False
)
test_generator = test_datagen.flow_from_dataframe(
dataframe = images_df[images_df.index.isin(test_samp)],
directory = images_dir,
x_col = 'img_file',
y_col = 'y',
target_size = (IMG_HEIGHT, IMG_WIDTH),
class_mode = 'raw',
batch_size = batch_size,
shuffle = False,
validate_filenames = False
)
return train_generator, valid_generator, test_generator
Things to notice:
I use two generators
The input to the function are the train/test indices (such as received from Sklearn's train_test_split) which are used to filter the DataFrame index.
The function also take a validation_split parameter for the training generator
images_df is a DataFrame somewhere in global memory with proper columns like img_file and y.
No need to shuffle validation and test generators
This can be further generalized for multiple outputs, classification, what have you.
I mostly have been splitting data in 80/10/10 for training, validation and test respectivelly.
When working with keras I favor the tf.data API as it provides a good abstraction for complex input pipelines
It does not provide a simple tf.data.DataSet.split functionality though
I have this function (that I found from someone's code and my source is missing) which I consistently use
def get_dataset_partitions_tf(ds: tf.data.Dataset, ds_size, train_split=0.8, val_split=0.1, test_split=0.1, shuffle=True, shuffle_size=10000):
assert (train_split + test_split + val_split) == 1
if shuffle:
# Specify seed to always have the same split distribution between runs
ds = ds.shuffle(shuffle_size, seed=12)
train_size = int(train_split * ds_size)
val_size = int(val_split * ds_size)
train_ds = ds.take(train_size)
val_ds = ds.skip(train_size).take(val_size)
test_ds = ds.skip(train_size).skip(val_size)
return train_ds, val_ds, test_ds
Firstly read your data set, and get its size(with cardianlity method), then pass it into the function and you're good to go!
This function can be given a flag to shuffle the original dataset before creating the splits, this is useful to have more realistic validation and test metrics.
The seed for shuffling is fixed so that we can run the same function and the splits remain the same, which we want for consistent results.
I'm trying to use generators in my CNN training but for some reason.
However, when I try to run model.predict_evaluator(), each time I execute it (I'm working in Jupyter Notebook), ¡it gives different results! Same data (stored in folder), same model (I just rerun the same cell)
This block works fine, every time I rerun it, it gives the same metrics
test_generator = test_datagen.flow_from_directory(
'keras_data/test',
batch_size = 1,
class_mode='categorical')
loss, acc = model.evaluate(test_generator, verbose=1)
print(loss,acc)
However, when I run this cell, it gives different results every time
ytest = test_generator.classes
yhat = np.argmax(model.predict_generator(test_generator),axis=1)
from sklearn.metrics import confusion_matrix
m = confusion_matrix(ytest,yhat)
print(m)
It doesn't make any sense! Any ideas on what's happening?
EDIT: here is how I create the generators, just in case the problem is here
train_datagen = ImageDataGenerator(
preprocessing_function=preprocess_input,
horizontal_flip=True)
test_datagen = ImageDataGenerator(preprocessing_function=preprocess_input)
train_generator = train_datagen.flow_from_directory(
'keras_data/train',
batch_size=1,
class_mode='categorical')
validation_generator = test_datagen.flow_from_directory(
'keras_data/val',
batch_size=1,
class_mode='categorical')
Shuffle - Whether to shuffle the data (default: True) If set to False, sorts the data in alphanumeric order.
From comments
Set shuffle = False for test generator has resolved the issue (paraphrased from Frightera)
I want to appended every other predicted images and real one in CNN code, but I am not sure how to implement it.
The code is as below:
test_loader = DataLoader(test_data, batch_size=batch_size, shuffle=False,
num_workers=num_workers, pin_memory=True)
test_pred=[]
test_real=[]
model.eval()
with torch.no_grad():
for data in test_loader:
x_test_batch, y_test_batch = data[0].to(device,
dtype=torch.float), data[1].to(device, dtype=torch.float)
y_test_pred = model(x_test_batch)
mse_val_loss = criterion(y_test_batch, y_test_pred, x_test_batch, mse)
mae_val_loss = criterion(y_test_batch, y_test_pred, x_test_batch, l1loss)
mse_val_losses.append(mse_val_loss.item())
mae_val_losses.append(mae_val_loss.item())
N_test.append(len(x_test_batch))
test_pred.append(y_test_pred[::2])
test_real.append(y_test_batch[::2])
I am acutally working on a mini-project based on cifar10 dataset. I have loaded the data from tfds.load(...) and practicing image augmentation techniques.
As I am using tf.data.Dataset object, which is my dataset, real-time data augmentation is quite unachievable, hence I want to pass all the features into tf.keras.preprocessing.image.ImageDataGenerator.flow(...) to gain the functionality of real-time augmentation.
But this flow(...) method accepts NumPy arrays which in no way related to tf.data.Dataset object.
Can somebody guide me in this regard (or any alternative) and how do I proceed further?
Are tf.image transformations real-time? If not, what can be the best aproach other than ImageDataGenerator.flow(...)?
My code:
import tensorflow as tf
import tensorflow_datasets as tfds
from tensorflow.keras.preprocessing.image import ImageDataGenerator
splitting = tfds.Split.ALL.subsplit(weighted=(70, 20, 10))
dataset_cifar10, dataset_info = tfds.load(name='cifar10',
split=splitting,
as_supervised=True,
with_info=True)
train_dataset, valid_dataset, test_dataset = dataset_cifar10
BATCH_SIZE = 32
train_dataset = train_dataset.batch(batch_size=BATCH_SIZE)
train_dataset = train_dataset.prefetch(buffer_size=1)
image_generator = ImageDataGenerator(rotation_range=45,
width_shift_range=0.15,
height_shift_range=0.15,
zoom_range=0.2,
horizontal_flip=True,
vertical_flip=True,
rescale=1./255)
train_dataset_generator = image_generator.flow(...)
...
Right after splitting train and test dataset you can iterate over the dataset and append in a list which you can use with ImageDataGenerator. A complete usecase bellow:
cifar10_data, cifar10_info = tfds.load("cifar10", with_info=True, as_supervised=True)
train_data, test_data = cifar10_data['train'], cifar10_data['test']
NUM_CLASSES = 10
train_x = []
train_y = []
for sample in train_data:
train_x.append(sample[0].numpy())
train_y.append(tf.keras.utils.to_categorical(sample[1].numpy(), num_classes=NUM_CLASSES))
train_x = np.asarray(train_x)
train_y = np.asarray(train_y)
# DataGenerator
datagen = tf.keras.preprocessing.image.ImageDataGenerator(
featurewise_center=True,
featurewise_std_normalization=True,
horizontal_flip=True)
# Fitting train_x data
datagen.fit(train_x)
# Testing
EPOCHS = 1
BATCH_SIZE = 16
for e in range(EPOCHS):
for batch_x, batch_y in datagen.flow(train_x, train_y, batch_size=BATCH_SIZE):
print(batch_x, batch_y)
# Manually needs to break loop
import tensorflow as tf
import tensorflow_datasets as tfds
tfds.disable_progress_bar()
from tensorflow.keras.preprocessing.image import ImageDataGenerator
splits = ['train[:70%]', 'train[70%:90%]', 'train[90%:]']
BATCH_SIZE = 64
dataset_cifar10, dataset_info = tfds.load(name='cifar10',
split=splits,
as_supervised=True,
with_info=True,
batch_size=BATCH_SIZE)
train_dataset, valid_dataset, test_dataset = dataset_cifar10
image_generator = tf.keras.preprocessing.image.ImageDataGenerator(
rotation_range=45,
width_shift_range=0.15,
height_shift_range=0.15,
zoom_range=0.2,
horizontal_flip=True,
vertical_flip=True,
rescale=1./255)
# custom function to wrap image data generator with raw dataset
def tfds_imgen(ds, imgen, batch_size, num_batches):
for images, labels in ds.batch(batch_size=batch_size).prefetch(buffer_size=1):
flow = imgen.flow(images, labels, batch_size=batch_size)
for _ in range(num_batches):
yield next(flow)
# call the custom function to get the augmented data generator
train_dataset_generator = tfds_imgen(
train_dataset.as_numpy_iterator(),
image_generator,
batch_size=32,
num_batches=BATCH_SIZE // 32
)
I have tried searching for possible similar questions, but I have not been able to find any so far. My problem is:
I want to use cross-validation using non-overlapping subsets of data using KFold. What I did was to create subsets using KFold and fix the outcome by setting randome_state to a certain integer. When I print out the subsets multiple times results look OK. However, the problem is when I use the same subsets on a model multiple times using model.predict (meaning running my code multiple times), I get different results. Naturally, I suspect there is something wrong with my implementation of training model. But I cannot figure out what it is. I would very much appreciate a hint. Here is my code:
random.seed(42)
# define K-fold cross validation test harness
kf = KFold(n_splits=3, random_state=42, shuffle=True)
for train_index, test_index in kf.split(data):l
print ('Train', train_index, '\nTest ', test_index)
# create model
testX= data[test_index]
trainX = data[train_index]
testYcheck = labels[test_index]
testP = Path[test_index]
# convert the labels from integers to vectors
trainY = to_categorical(labels[train_index], num_classes=2)
testY = to_categorical(labels[test_index], num_classes=2)
# construct the image generator for data augmentation
aug = ImageDataGenerator(rotation_range=30, width_shift_range=0.1,
height_shift_range=0.1, shear_range=0.2, zoom_range=0.2,
horizontal_flip=True, fill_mode="nearest")
# train the network
print("[INFO] training network...")
model.fit_generator(aug.flow(trainX, trainY, batch_size=BS),
validation_data=(testX, testY),
steps_per_epoch=len(trainX) // BS, epochs=EPOCHS, verbose=1)
#predict the test data
y_pred = model.predict(testX)
predYl = []
for element in range(len(y_pred)):
if y_pred[element,1] > y_pred[element,0]:
predYl.append(1)
else:
predYl.append(0)
pred_Y= np.array(predYl)
# Compute confusion matrix
cnf_matrix = confusion_matrix(testYcheck, pred_Y)
np.set_printoptions(precision=2)
print (cnf_matrix)