Tensorflow does not apply data augmentation properly

Tensorflow does not apply data augmentation properly - python

I'm trying to apply the process of data augmentation to a database. I use the following code:
train_generator = keras.utils.image_dataset_from_directory(
directory= train_dir,
subset = "training",
image_size = (50,50),
batch_size = 32,
validation_split = 0.3,
seed = 1337,
labels = "inferred",
label_mode = 'binary'
)
validation_generator = keras.utils.image_dataset_from_directory(
subset="validation",
directory=validation_dir,
image_size=(50,50),
batch_size =40,
seed=1337,
validation_split = 0.3,
labels = "inferred",
label_mode ='binary'
)
data_augmentation = keras.Sequential([
keras.layers.RandomFlip("horizontal"),
keras.layers.RandomRotation(0.1),
keras.layers.RandomZoom(0.1),
])
train_dataset = train_generator.map(lambda x, y: (data_augmentation(x, training=True), y))
But when I try to run the training processe using this method, I get a "insuficient data" warning:
6/100 [>.............................] - ETA: 21s - loss: 0.7602 - accuracy: 0.5200WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches (in this case, 2000 batches). You may need to use the repeat() function when building your dataset.
WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches (in this case, 10 batches). You may need to use the repeat() function when building your dataset.
Yes, the original dataset is insuficient, but the data augmentation should provide more than enough data for the training.
Does anyone know what's going on ?
EDIT:
fit call:
history = model.fit(
train_dataset,
epochs = 20,
steps_per_epoch = 100,
validation_data = validation_generator,
validation_steps = 10,
callbacks=callbacks_list)
This is the version I have using DataImageGenerator:
train_datagen = keras.preprocessing.image.ImageDataGenerator(rescale =1/255,rotation_range = 40,width_shift_range = 0.2,height_shift_range = 0.2,shear_range = 0.2,zoom_range = 0.2,horizontal_flip = True)
train_generator = train_datagen.flow_from_directory(directory= train_dir,target_size = (50,50),batch_size = 32,class_mode = 'binary')
val_datagen = keras.preprocessing.image.ImageDataGenerator(rescale=1/255)
validation_generator = val_datagen.flow_from_directory(directory=validation_dir,target_size=(50,50),batch_size =40,class_mode ='binary')
This specific code (with this same number of epochs, steps_per_epoch and batchsize) was taken from the book deeplearning with python, by François Chollet, it's an example on page 141 of a data augmentation system. As you may have guessed, this produces the same results as the other method displayed.

When we state that data augmentation increases the number of instances, we usually understand that an altered version of a sample would be created for the model to process. It's just image preprocessing with randomness.
If you closely inspect your training log, you will get your solution, shown below. The main issue with your approach is simply discussed in this post.
WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least steps_per_epoch * epochs batches (in this case, 2000 batches). You may need to use the repeat() function when building your dataset.
So, to solve this, we can use .repeat() function. To understand what it does, you can check this answer. Here is the sample code that should work for you.
train_ds= keras.utils.image_dataset_from_directory(
...
)
train_ds = train_ds.map(
lambda x, y: (data_augmentation(x, training=True), y)
)
val_ds = keras.utils.image_dataset_from_directory(
...
)
# using .repeat function
train_ds = train_ds.repeat().shuffle(8 * batch_size)
train_ds = train_ds.prefetch(buffer_size=tf.data.AUTOTUNE)
val_ds = val_ds.repeat()
val_ds = val_ds.prefetch(buffer_size=tf.data.AUTOTUNE)
# specify step per epoch
history = model.fit(
train_ds,
validation_data=val_ds,
epochs=..,
steps_per_epoch = train_ds.cardinality().numpy(),
validation_steps = val_ds.cardinality().numpy(),
)

Related

Getting NaN value from loss function for k-fold validation

I am trying to implement MNIST using PyTorch Lightning. Here, I wanted to use k-fold cross-validation.
The problem is I am getting the NaN value from the loss function (for at least 1 fold). From below 3rd time, I was getting NaN values from the loss function.
Epoch 19: 100%|█████████████████████████████████| 110/110 [00:03<00:00, 29.24it/s, loss=0.963, v_num=287]
Testing: 100%|███████████████████████████████████████████████████████████| 28/28 [00:00<00:00, 39.94it/s]
Epoch 19: 100%|█████████████████████████████████| 110/110 [00:04<00:00, 25.69it/s, loss=0.825, v_num=288]
Testing: 100%|███████████████████████████████████████████████████████████| 28/28 [00:00<00:00, 41.19it/s]
Epoch 19: 100%|███████████████████████████████████| 110/110 [00:03<00:00, 30.19it/s, loss=nan, v_num=289]
Testing: 100%|███████████████████████████████████████████████████████████| 28/28 [00:00<00:00, 42.15it/s
Or very big loss value (terminated before completing full epocs)
Epoch 0: 44%|█████████████▉ | 48/110 [00:02<00:02, 22.87it/s, loss=2.08e+23, v_num=295]
The code I have used for data preparation, k-fold, and trainer is given below
def prepare_data():
transform=transforms.Compose([transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))])
mnist_train = MNIST(os.getcwd(), train=True, download=True, transform=transform)
mnist_test = MNIST(os.getcwd(), train=False, download=True, transform=transform)
dataset = ConcatDataset([mnist_train, mnist_test])
return dataset
k_folds=5
epochs=20
kfold=KFold(n_splits=k_folds,shuffle=True)
dataset = prepare_data()
model = LightningMNIST(lr_rate=0.01)
for fold, (train_idx, val_idx) in enumerate(kfold.split(dataset)):
train_subsampler = torch.utils.data.SubsetRandomSampler(train_idx)
val_subsampler = torch.utils.data.SubsetRandomSampler(val_idx)
train_loader = torch.utils.data.DataLoader(dataset, num_workers=8, batch_size=512, sampler=train_subsampler)
val_loader = torch.utils.data.DataLoader(dataset, num_workers=8, batch_size=512, sampler=val_subsampler)
model.apply(reset_weights) # reset model for every fold
early_stopping = EarlyStopping('train_loss', mode='min', patience=5)
model_checkpoint = ModelCheckpoint(dirpath=model_path+'mnist_{epoch}-{train_loss:.2f}',
monitor='train_loss', mode='min', save_top_k=3)
trainer = pl.Trainer(max_epochs=epochs, profiler=False, callbacks = [model_checkpoint],default_root_dir=model_path)
trainer.fit(model, train_dataloader=train_loader)
trainer.test(test_dataloaders=val_loader, ckpt_path=None)
The training step is given below
def training_step(self, train_batch, batch_idx):
x, y = train_batch
logits = self.forward(x)
loss = self.error_loss(logits.squeeze(-1), y.float())
self.log('train_loss', loss)
return {'loss': loss}
I assume, maybe I am doing something wrong in the k-fold data preparation or in the training step. Otherwise getting NaN or very big value is not expected for this simple problem and simple model.
I have gone through several posts like this, this, and that. Some of them suggested that it could happen because the dataset might contain NaN (but I think MNIST does not contain NaN as directly downloading from the module), model's learning rate is 0.01 (not too big not too small). Moreover, I believe that this post is not duplicated (because here, trying to use k-fold thought the error seems the same).
Any suggestions?

Tensorflow Data API: repeat()

The following code is a snippet from "Hands on machine learning with scimitar-learn, Keras and tensorflow".
I understand everything in the following code, except the chaining the .repeat(repeat) function in second line.
I know that repeat is repeating the dataset elements (i.e., in this case the file paths) and if the argument is set to None or left empty the repetition will continues forever until the function which using it decides when to stop.
As you can see in the code bellow, the author is setting the repeat() argument to None;
1 - basically I want to know why the author decided to do such?
2 - or is it because the code is trying to simulate a situation which the dataset is not fitting in the memory, if this is the case then in real situation we should avoid repeat(), am I correct?
def csv_reader_dataset(filepaths, repeat=1, n_readers=5,
n_read_threads=None, shuffle_buffer_size=10000,
n_parse_threads=5, batch_size=32):
dataset = tf.data.Dataset.list_files(filepaths, seed = 42).repeat(repeat)
dataset = dataset.interleave(
lambda filepath: tf.data.TextLineDataset(filepath).skip(1),
cycle_length = n_readers, num_parallel_calls = n_read_threads)
dataset = dataset.shuffle(shuffle_buffer_size)
dataset = dataset.map(preprocess, num_parallel_calls = n_parse_threads)
dataset = dataset.batch(batch_size)
return dataset.prefetch(1)
train_set = csv_reader_dataset(train_filepaths, repeat = None)
valid_set = csv_reader_dataset(valid_filepaths)
test_set = csv_reader_dataset(test_filepaths)
keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)
model = keras.models.Sequential([
keras.layers.InputLayer(input_shape = X_train.shape[-1: ]),
keras.layers.Dense(30, activation = 'relu'),
keras.layers.Dense(1)
])
m_loss = keras.losses.mean_squared_error
m_optimizer = keras.optimizers.SGD(lr = 1e-3)
batch_size = 32
model.compile(loss = m_loss, optimizer = m_optimizer, metrics = ['accuracy'])
model.fit(train_set, steps_per_epoch = len(X_train) // batch_size, epochs = 10, validation_data = valid_set)

For your questions, I think:
tf.data API won't easily lead to out-of-memory easily as it loads data given the file paths or tfrecrods (compressed mode). Hence, repeat() does not thing with memory here; instead, it is used for data-transforming.
I have to use repeat(#) when setting steps_per_epoch to #. Say your batch_num = 32, and steps_per_epoch = 100//32 = 3 -> require 3 * 32 = 96 samples per epoch but your data has 80 samples only. Then, I have to use data.repeat(2) to have totally 160 samples that 80 samples in repeat_1 and the first 16 samples in repeat_2 would be used within 1 epoch. This is to prevent the error Input run out of data.

I had another copy of the same question on book's author git repo.
The issue is clarified; it was due to a bug in Keras 2.0.
Read more on: https://github.com/ageron/handson-ml2/issues/407

tf.estimator.Estimator gives different test accuracy when trianed epoch by epoch vs over all epochs

I have defined a straightforward CNN as my model_fn for a tf.estimator.Estimator and feed it with this input_fn:
def input_fn(features, labels, batch_size, epochs):
dataset = tf.data.Dataset.from_tensor_slices((features))
dataset = dataset.map(lambda x: tf.cond(tf.random_uniform([], 0, 1) > 0.5, lambda: dataset_augment(x), lambda: x),
num_parallel_calls=16).cache()
dataset_labels = tf.data.Dataset.from_tensor_slices((labels))
dataset = dataset.zip((dataset, dataset_labels))
dataset = dataset.shuffle(30000)
dataset = dataset.repeat(epochs)
dataset = dataset.batch(batch_size)
dataset = dataset.prefetch(-1)
return dataset
when I train the estimator this way, I get 43% test accuracy after 10 epochs:
steps_per_epoch = data_train.shape[0] // batch_size
for epoch in range(1, epochs + 1):
cifar100_classifier.train(lambda: input_fn(data_train, labels_train, batch_size, epochs=1), steps=steps_per_epoch)
But when I train it this way I get 32% test accuracy after 10 epochs:
steps_per_epoch = data_train.shape[0] // batch_size
max_steps = epochs * steps_per_epoch
cifar100_classifier.train(steps=max_steps,
input_fn=lambda: input_fn(data_train, labels_train, batch_size, epochs=epochs))
I just cannot understand why these two methods produce different results. Can anyone please explain?

Since you are calling the input_fn multiple times in the first example it seems like you would be generating more augmented data through dataset_augment(x) as you're doing an augmentation coin-toss for every x every epoch.
In the second example you only do these coin-tosses once and then train multiple epochs on that same data. So here your train set is effectively ``smaller''.
The .cache() doesn't really save you from this in the first example.

are your model's weights initialized randomly? this may be a case.

How to deal with thousands of images for CNN training Keras

I have ~10000k images that cannot fit in memory. So for now I can only read 1000 images and train on it...
My code is here :
img_dir = "TrainingSet" # Enter Directory of all images
image_path = os.path.join(img_dir+"/images",'*.bmp')
files = glob.glob(image_path)
images = []
masks = []
contours = []
indexes = []
files_names = []
for f1 in np.sort(files):
img = cv2.imread(f1)
result = re.search('original_cropped_(.*).bmp', str(f1))
idx = result.group(1)
mask_path = img_dir+"/masks/mask_cropped_"+str(idx)+".bmp"
mask = cv2.imread(mask_path,0)
contour_path = img_dir+"/contours/contour_cropped_"+str(idx)+".bmp"
contour = cv2.imread(contour_path,0)
indexes.append(idx)
images.append(img)
masks.append(mask)
contours.append(contour)
train_df = pd.DataFrame({"id":indexes,"masks": masks, "images": images,"contours": contours })
train_df.sort_values(by="id",ascending=True,inplace=True)
print(train_df.shape)
img_size_target = (256,256)
ids_train, ids_valid, x_train, x_valid, y_train, y_valid, c_train, c_valid = train_test_split(
train_df.index.values,
np.array(train_df.images.apply(lambda x: cv2.resize(x,img_size_target).reshape(img_size_target[0],img_size_target[1],3))),
np.array(train_df.masks.apply(lambda x: cv2.resize(x,img_size_target).reshape(img_size_target[0],img_size_target[1],1))),
np.array(train_df.contours.apply(lambda x: cv2.resize(x,img_size_target).reshape(img_size_target[0],img_size_target[1],1))),
test_size=0.2, random_state=1337)
#Here we define the model architecture...
#.....
#End of model definition
# Training
optimizer = Adam(lr=1e-3,decay=1e-10)
model.compile(loss="binary_crossentropy", optimizer=optimizer, metrics=["accuracy"])
early_stopping = EarlyStopping(patience=10, verbose=1)
model_checkpoint = ModelCheckpoint("./keras.model", save_best_only=True, verbose=1)
reduce_lr = ReduceLROnPlateau(factor=0.5, patience=5, min_lr=0.00001, verbose=1)
epochs = 200
batch_size = 32
history = model.fit(x_train, y_train,
validation_data=[x_valid, y_valid],
epochs=epochs,
batch_size=batch_size,
callbacks=[early_stopping, model_checkpoint, reduce_lr])
What I would like to know is how can I modify my code in order to do batches of a small set of images without loading all the other 10000 into memory ? which means that the algorithm will read X images each epoch from directory and train on it and after that goes for the next X until the last one.
X here would be a reasonable amount of images that can fit into memory.

use fit_generator instead of fit
def generate_batch_data(num):
#load X images here
return images
model.fit_generator(generate_batch_data(X),
samples_per_epoch=10000, nb_epoch=10)
Alternative you could use train_on_batch instead of fit
Discussion on GitHub about this topic: https://github.com/keras-team/keras/issues/2708

np.array(train_df.images.apply(lambda x:cv2.resize(x,img_size_target).reshape(img_size_target[0],img_size_target[1],3)))
You can first apply this filter (and the 2 others) to each individual file and save them to a special folder (images_prepoc, masks_preproc,etc.. ) in a separate script, then load them back already ready for use in the current script.
Assuming that the actual images dimensions are greater than 256x256, you will have a faster algorithm, using less memory at the cost of a single preparation phase.

predict_generator and class labels

I am using ImageDataGenerator to generate new augmented images and extract bottleneck features from pretrained model but most of the tutorial I see on keras
samples same no of training samples as number of images in directory.
train_generator = train_datagen.flow_from_directory(
train_path,
target_size=image_size,
shuffle = "false",
class_mode='categorical',
batch_size=1)
bottleneck_features_train = model.predict_generator(
train_generator, 2* nb_train_samples // batch_size)
Suppose I want 2 times more images from the above code, how I can get the desired class labels for the features extracted from bottleneck layer which are stored in tuple train_generator.
shouldnt the code in training_generator.py at line 422
x, _ = generator_output
do something like this
=> x, y = generator_output
and return tuple [np.concatenate(out) for out in all_outs],y from predict_generator
i.e return the corresponding class labels along with the predicted features all_outs since there is no way to get the corresponding labels without running generator twice.

If you're using predict, normally you simply don't want Y, because Y will be the result of the prediction. (You're not training, so you don't need the true labels)
But you can do it yourself:
bottleneck = []
labels = []
for i in range(2 * nb_train_samples // batch_size):
x, y = next(train_generator)
bottleneck.append(model.predict(x))
labels.append(y)
bottleneck = np.concatenate(bottleneck)
labels = np.concatenate(labels)
If you want it with indexing (if your generator supports that):
#...
for epoch in range(2):
for i in range(nb_train_samples // batch_size):
x,y = train_generator[i]
#...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Tensorflow does not apply data augmentation properly - python

Related

Getting NaN value from loss function for k-fold validation

Tensorflow Data API: repeat()

tf.estimator.Estimator gives different test accuracy when trianed epoch by epoch vs over all epochs

How to deal with thousands of images for CNN training Keras

predict_generator and class labels

Categories

Resources