I have written a custom keras callback to check the augmented data from a generator. (See this answer for the full code.) However, when I tried to use the same callback for a tf.data.Dataset, it gave me an error:
File "/path/to/tensorflow_image_callback.py", line 16, in on_batch_end
imgs = self.train[batch][images_or_labels]
TypeError: 'PrefetchDataset' object is not subscriptable
Do keras callbacks in general only work with generators, or is it something about the way I've written my one? Is there a way to modify either my callback or the dataset to make it work?
I think there are three pieces to this puzzle. I'm open to changes to any and all of them. Firstly, the init function in the custom callback class:
class TensorBoardImage(tf.keras.callbacks.Callback):
def __init__(self, logdir, train, validation=None):
super(TensorBoardImage, self).__init__()
self.logdir = logdir
self.file_writer = tf.summary.create_file_writer(logdir)
self.train = train
self.validation = validation
Secondly, the on_batch_end function within that same class
def on_batch_end(self, batch, logs):
images_or_labels = 0 #0=images, 1=labels
imgs = self.train[batch][images_or_labels]
Thirdly, instantiating the callback
import tensorflow_image_callback
tensorboard_image_callback = tensorflow_image_callback.TensorBoardImage(logdir=tensorboard_log_dir, train=train_dataset, validation=valid_dataset)
model.fit(train_dataset,
epochs=n_epochs,
validation_data=valid_dataset,
callbacks=[
tensorboard_callback,
tensorboard_image_callback
])
Some related threads which haven't led me to an answer yet:
Accessing validation data within a custom callback
Create keras callback to save model predictions and targets for each batch during training
What ended up working for me was the following, using tfds:
the __init__ function:
def __init__(self, logdir, train, validation=None):
super(TensorBoardImage, self).__init__()
self.logdir = logdir
self.file_writer = tf.summary.create_file_writer(logdir)
# #from keras generator
# self.train = train
# self.validation = validation
#from tf.Data
my_data = tfds.as_numpy(train)
imgs = my_data['image']
then on_batch_end:
def on_batch_end(self, batch, logs):
images_or_labels = 0 #0=images, 1=labels
imgs = self.train[batch][images_or_labels]
#calculate epoch
n_batches_per_epoch = self.train.samples / self.train.batch_size
epoch = math.floor(self.train.total_batches_seen / n_batches_per_epoch)
#since the training data is shuffled each epoch, we need to use the index_array to find something which uniquely
#identifies the image and is constant throughout training
first_index_in_batch = batch * self.train.batch_size
last_index_in_batch = first_index_in_batch + self.train.batch_size
last_index_in_batch = min(last_index_in_batch, len(self.train.index_array))
img_indices = self.train.index_array[first_index_in_batch : last_index_in_batch]
with self.file_writer.as_default():
for ix,img in enumerate(imgs):
#only post 1 out of every 1000 images to tensorboard
if (img_indices[ix] % 1000) == 0:
#instead of img_filename, I could just use str(img_indices[ix]) as a unique identifier
#but this way makes it easier to find the unaugmented image
img_filename = self.train.filenames[img_indices[ix]]
#convert float to uint8, shift range to 0-255
img -= tf.reduce_min(img)
img *= 255 / tf.reduce_max(img)
img = tf.cast(img, tf.uint8)
img_tensor = tf.expand_dims(img, 0) #tf.summary needs a 4D tensor
tf.summary.image(img_filename, img_tensor, step=epoch)
I didn't need to make any changes to the instantiation.
I recommend only using it for debugging, otherwise it saves every nth image in your dataset to tensorboard every epoch. That can end up using a lot of disk space.
Related
I am trying to implement a Custom Loss function that uses multiple predictions/forward propagations of images for an image classification model.
The general concept of this loss function is to evaluate the model's consistency with non-augmented and augmented images. That is to say, the model is given 2 images; the original image and its augmented counterpart. Then, both images are forward propagated through the model. The more different the two outputs are from each other, the higher the loss.
What this meant is a fairly low-level change, and the most apparent way of solving this, to me, was model subclassing. I created a subclass of the keras.Model class and changed the train_step() method to include a small algorithm for locating the respective augmented counterpart of each original image (not relevant to the issue at all), and more significantly, a line that gave a prediction on the augmented counterpart:
with tf.GradientTape() as tape:
y_pred = self(x, training=True)
y_aug = self(self.augmented_data[aug_index:aug_index+self.batch_size], training=True)
loss = self.comparative_loss(y, y_pred, y_aug)
The whole self.augmented_data[aug_index:aug_index+self.batch_size] isn't relevant at all, it can be thought of just as the augmented data input. The intent was for the method "comparative_loss" to take the two predictions and then perform the aforementioned loss calculations on it.
The issue came when I tried to compile the model; there was a required loss parameter, but it refused to accept my custom loss method as it required 3 parameters. I couldn't go with the standard fix of putting the functions into a structure like this:
def new_loss(extra_parameter):
def loss(y_true, y_pred):
return loss_value
return loss
since my "extra_parameter" was not just a standard output of the model; it was a completely separate forward propagation on it, that relied on my custom train_step() method.
TL;DR:
What I'm most confused about is, why does tf.compile() even require a loss function, if my "train_step" method doesn't use it? The train_step method in my custom subclass has the loss built-in, so is there a way to override the .compile()'s loss parameter and have it work without me having to give it a method? If not, what other solutions are there?
The full code is below, though I sincerely apologize to anyone that reads it, as it's not quite finished:
# -*- coding: utf-8 -*-
"""
Created on Fri Feb 18 11:37:08 2022
Custom Loss Function
Description:
For each element of y_true, compare the y_predict of
the original image and the complemented one, then return
a loss accordingly using the Euclidian distance
between the predictions for the original images and the complements.
y_predict are labels for the images, these labels can
come in any form: CIFAR labels, species labels, or labels of which
individual a given image is.
y_predict will be in the shape (batch_size, number_of_classes), using the
#author: hudso
"""
import tensorflow as tf
import keras
from keras.layers import Conv2D, MaxPooling2D, Dense, Flatten, BatchNormalization
import ssl
import numpy as np
import cv2 as cv
class CustomModel(keras.Model):
def __init__(self, classes):
super().__init__() #call parent constructor
self.conv_1 = Conv2D(32,(3,3),activation='relu',padding='same')
self.batch_1 = BatchNormalization()
self.conv_2 = Conv2D(32,(3,3),activation='relu',padding='same')
self.batch_2 = BatchNormalization()
self.pool_1 = MaxPooling2D((2,2))
self.conv_3 = Conv2D(64,(3,3),activation='relu',padding='same')
self.batch_3 = BatchNormalization()
self.conv_4 = Conv2D(64,(3,3),activation='relu',padding='same')
self.batch_4 = BatchNormalization()
self.pool_2 = MaxPooling2D((2,2))
self.conv_5 = Conv2D(128,(3,3),activation='relu',padding='same')
self.batch_5 = BatchNormalization()
self.conv_6 = Conv2D(128,(3,3),activation='relu',padding='same')
self.batch_6 = BatchNormalization()
self.flatten = Flatten()
self.layer_1 = keras.layers.Dropout(0.2)
self.layer_2 = Dense(256,activation='relu')
self.dropout = keras.layers.Dropout(0.2)
self.outputs = Dense(classes, activation='softmax') #no. of classes
self.classes = classes #Initializes the number of classes variable
#essentially the Functional API forward-pass call-structure shenanigans
#called each forward propagation (calculating loss, training, etc.)
def call(self, inputs):
#print("INPUTS: " + str(inputs))
x = self.conv_1(inputs)
x = self.batch_1(x)
x = self.conv_2(x)
x = self.batch_2(x)
x = self.pool_1(x)
x = self.conv_3(x)
x = self.batch_3(x)
x = self.conv_4(x)
x = self.batch_4(x)
x = self.pool_2(x)
x = self.conv_5(x)
x = self.batch_5(x)
x = self.conv_6(x)
x = self.batch_6(x)
x = self.flatten(x)
x = self.layer_1(x)
x = self.layer_2(x)
x = self.dropout(x)
x = self.outputs(x)
return x #returns the constructed model
#Imports necessary data (It's hard to gain access of the values handed to .fit())
def data_import(self, augmented_data, x_all, batch_size):
self.augmented_data = augmented_data
self.x_all = np.asarray(x_all, dtype=np.float32)
self.batch_size = batch_size
#Very useful advice: https://stackoverflow.com/questions/65889381/going-from-a-tensorarray-to-a-tensor
def comparative_loss(self, y_true, y_pred, y_aug):
output_loss = tf.TensorArray(tf.float32, size=self.classes)
batch_loss = tf.TensorArray(tf.float32, size=self.batch_size)
for n in range(self.batch_size):
for i in range(self.classes):
output_loss = output_loss.write(i, tf.square(tf.abs(tf.subtract(y_pred[n][i], y_aug[n][i])))) #finds Euclidean Distance for each prediction, then averages the loss across all iterations in the batch
indexes = tf.keras.backend.arange(0, self.classes, step=1, dtype='int32')
output_loss_tensor = output_loss.gather(indexes)
batch_loss = batch_loss.write(n, tf.math.reduce_sum(output_loss_tensor))
indexes = tf.keras.backend.arange(0, self.batch_size, step=1, dtype='int32')
batch_loss_tensor = batch_loss.gather(indexes)
total_loss = tf.math.reduce_sum(batch_loss_tensor)
total_loss = tf.math.divide(total_loss, self.batch_size)
print("TOTAL LOSS: " + str(total_loss))
return total_loss
def train_step(self, data):
x, y = data #Current batch
#Finds the range of indexes for the complements of the current batch of images
#A lower level implementation could make this significantly more efficient by avoiding searching each time
aug_index = 0
x_arr = x.numpy() #Turns the input data iterable Tensor into a numpy array, Eager Execution must be enabled for this to work
for i in range(np.size(self.x_all, axis = 0)):
difference = cv.subtract(self.x_all[i], x_arr[0])
if np.count_nonzero(difference) == 0: #In the .fit() line for this CustomModel, shuffle = False for this to work
aug_index = i #Lower bound of the batch of images
found = True
if found == False:
print("Yikes mate the x_arr wasn't found in x_all... probably a rounding error")
print("\nCurrent Index: " + str(aug_index))
#Forward pass/predictions + loss calculation
with tf.GradientTape() as tape:
y_pred = self(x, training=True)
y_aug = self(self.augmented_data[aug_index:aug_index+self.batch_size], training=True)
loss = self.comparative_loss(y, y_pred, y_aug) #Computes the actual loss value
#I didn't touch any of this code
trainable_vars = self.trainable_variables
gradients = tape.gradient(loss, trainable_vars)
self.optimizer.apply_gradients(zip(gradients, trainable_vars))
self.compiled_metrics.update_state(y, y_pred)
return {m.name: m.result() for m in self.metrics}
#Essentially emulates the environment that the model would normally be running in
#E.g. Creates the dataset, does Image Augmentation, etc.
#In the actual implementation, only the "CustomModel" class will be used, this is purely for testing purposes
class shrek_is_love:
def __init__(self):
self.complements = []
self.create_dataset()
#automatically runs
def create_dataset(self):
ssl._create_default_https_context = ssl._create_unverified_context
(images, labels), (_, _) = keras.datasets.cifar10.load_data() #only uses the training sets and then splits it again later since that'll be what we'll be dealing with in the happywhale dataset anyways
self.labels = labels
self.images = images
self.data_aug()
#NOT MY CODE this is liam's image data generator (thx liam ur cool)
#automatically runs
def data_aug(self):
imageGen = keras.preprocessing.image.ImageDataGenerator(width_shift_range=.3, height_shift_range=.3, horizontal_flip=True, zoom_range=.3)
imagees = np.zeros(shape=(1, 32, 32, 3))
for l in range(np.size(self.images, 0)):
# adjust the tuple inside of cv.resize to adjust resolution
temp = cv.resize(self.images[l], (32, 32))
imagees[0] = (cv.cvtColor(temp, cv.COLOR_BGR2RGB))
it = imageGen.flow(imagees)
im = it.next()
im = im[0].astype('float32')
im = im / 255.0
self.complements.append(im)
self.complements = np.asarray(self.complements, dtype=np.float)
self.images = self.images.astype(np.float)
self.images = self.images / 255.0
self.preprocessor()
def preprocessor(self):
from sklearn.preprocessing import OneHotEncoder
onehot_encoder = OneHotEncoder(sparse=False)
self.labels = onehot_encoder.fit_transform(np.reshape(self.labels, (-1, 1)))
from sklearn.model_selection import train_test_split
shared_seed = 5 #the indexes of complements_train and image_train have to line up, so that labels_train can apply to both
self.complements_train, self.complements_test = train_test_split(self.complements, test_size=0.25, random_state=shared_seed)
self.images_train, self.images_test, self.labels_train, self.labels_test = train_test_split(self.images, self.labels, test_size=0.25, random_state=shared_seed)
#The following code will be all that is necessary to run the CustomModel classs
batch_size = 32
shrek_is_life = shrek_is_love()
model = CustomModel(10) #10 classes
model.data_import(shrek_is_life.complements_train, shrek_is_life.images_train, batch_size) #the model will not be training on aug_data, essentially turning it into a secondary test set
model.compile(optimizer='adam', loss=None, metrics=['accuracy'], run_eagerly=True) #loss=None brings up an error, but I have no idea what else to put in there
model.fit(x = shrek_is_life.images_train, y = shrek_is_life.labels_train, shuffle = False, batch_size = batch_size, epochs = 1)
EDIT:
Running it without a .compile line yields this error:
Traceback (most recent call last):
File "D:\Downloads\untitled0.py", line 191, in <module>
model.fit(x = shrek_is_life.images_train, y = shrek_is_life.labels_train, shuffle = False, batch_size = batch_size, epochs = 1)
File "C:\Users\hudso\anaconda3\envs\mlTens\lib\site-packages\keras\engine\training.py", line 1150, in fit
x, y, sample_weights = self._standardize_user_data(
File "C:\Users\hudso\anaconda3\envs\mlTens\lib\site-packages\keras\engine\training.py", line 508, in _standardize_user_data
raise RuntimeError('You must compile a model before '
RuntimeError: You must compile a model before training/testing. Use `model.compile(optimizer, loss)`.
Running .compile without the loss argument or with loss=None yields:
File "C:\Users\hudso\anaconda3\envs\mlTens\lib\site-packages\keras\engine\training.py", line 706, in _prepare_total_loss
raise ValueError('The model cannot be compiled '
ValueError: The model cannot be compiled because it has no loss to optimize.
I want to use torch.save() to save a trained model for inference. However, with either torch.load_state_dict() or torch.load(), I can't get the saved model. The loss computed by the loaded model is just different from the loss computed by the saved model.
The relevant Libraries:
import numpy as np
import torch
import torch.nn as nn
from torch.autograd import Variable
from torch.nn import functional as F
The model:
class nn_block(nn.Module):
def __init__(self, feats_dim):
super(nn_block, self).__init__()
self.linear = nn.Linear(feats_dim, feats_dim)
self.bn = nn.BatchNorm1d(feats_dim)
self.softplus1 = nn.Softplus()
self.softplus2 = nn.Softplus()
def forward(self, rep_mat):
transformed_mat = self.linear(rep_mat)
transformed_mat = self.bn(transformed_mat)
transformed_mat = self.softplus1(transformed_mat)
transformed_mat = self.softplus2(transformed_mat + rep_mat)
return transformed_mat
class test_nn(nn.Module):
def __init__(self, in_feats, feats_dim, num_conv, num_classes):
super(test_nn, self).__init__()
self.linear1 = nn.Linear(in_feats, feats_dim)
self.convs = [nn_block(feats_dim) for _ in range(num_conv)]
self.linear2 = nn.Linear(feats_dim, num_classes)
self.softmax = nn.Softmax()
def forward(self, rep_mat):
h = self.linear1(rep_mat)
for conv_func in self.convs:
h = conv_func(h)
h = self.linear2(h)
h = self.softmax(h)
return h
Train, save, and reload a model:
# fake a classification task
num_classes = 2; input_dim = 8
one = np.random.multivariate_normal(np.zeros(input_dim),np.eye(input_dim),20)
two = np.random.multivariate_normal(np.ones(input_dim),np.eye(input_dim),20)
inputs = np.concatenate([one, two], axis=0)
labels = np.concatenate([np.zeros(20), np.ones(20)])
inputs = Variable(torch.Tensor(inputs))
labels = torch.LongTensor(labels)
# build a model
net = test_nn(input_dim, 5, 2, num_classes)
optimizer = torch.optim.Adam(net.parameters(), lr=0.01)
net.train()
losses = []
best_score = 1e10
for epoch in range(25):
preds = net(inputs)
loss = F.cross_entropy(preds, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
state_dict = {'state_dict': net.state_dict()}
if loss.item()-best_score<-1e-4:
# save only parameters
torch.save(state_dict, 'model_params.torch')
# save the whole model
torch.save(net, 'whole_model.torch')
best_score = np.min([best_score, loss.item()])
losses.append(loss.item())
net_params = test_nn(input_dim, 5, 2, num_classes)
net_params.load_state_dict(torch.load('model_params.torch')['state_dict'])
net_params.eval()
preds_params = net_params(inputs)
loss_params = F.cross_entropy(preds_params, labels)
print('reloaded params %.4f %.4f' % (loss_params.item(), np.min(losses)))
net_whole = torch.load('whole_model.torch')
net_whole.eval()
preds_whole = net_whole(inputs)
loss_whole = F.cross_entropy(preds_whole, labels)
print('reloaded whole %.4f %.4f' % (loss_whole.item(), np.min(losses)))
As you can see by running the code, the losses computed by the two loaded models are different, while the two loaded models are exactly the same. Not just the two losses are different, they are also different from the loss computed by the best model that was saved in the first place.
Why this can happen?
The state dict contains every parameter (nn.Parameter
) and buffer (similar to parameter, but which should not be trained/optimised) that has been registered on the module and all of its submodules. Everything else will not be included in that state dict.
Your test_nn module uses a list for convs, therefore it is not included in the state dict:
self.convs = [nn_block(feats_dim) for _ in range(num_conv)]
Not only are they not contained in the state dict, they are also not visible to net.parameters(), which means they are not trained/optimised at all.
To register the modules from the list you can wrap it in nn.ModuleList, which is a module that acts like a list, while correctly registering the modules it contains:
self.convs = nn.ModuleList([nn_block(feats_dim) for _ in range(num_conv)])
With that change both models produce the same result.
Since you are calling the convs modules sequentially in the for-loop (output of one module is the input of the next), you may consider using nn.Sequential, which you can call directly instead of having to use the for-loop. Sequencing is used a lot and it just makes it a little simpler, for example if you want to replace the sequence of modules with a single module, you don't need to change anything in the forward method.
Not just the two losses are different, they are also different from the loss computed by the best model that was saved in the first place.
When you are training, you calculate the loss for the current input (batch) and then you optimise the parameters based on that input. This means your parameters differ from the ones used to calculate the loss. Because you are saving the model after that, it will also have a different loss (the one that would occur in the next iteration).
preds = net(inputs)
# Calculating the loss of the current model
loss = F.cross_entropy(preds, labels)
optimizer.zero_grad()
loss.backward()
# Updating the model's parameters based on the loss
optimizer.step()
# State of the model after it has been updated
state_dict = {'state_dict': net.state_dict()}
# Comparing the loss from BEFORE the update
# But saving the model from AFTER the update
if loss.item()-best_score<-1e-4:
# save only parameters
torch.save(state_dict, 'model_params.torch')
# save the whole model
torch.save(net, 'whole_model.torch')
It's important to evaluate the model after the updates have been made. For this reason a validation set should be used, which is run after each epoch to assess the model's accuracy.
I have a very simple question. I have a Keras model (TF backend) defined for classification. I want to dump the training images fed into my model during training for debugging purposes. I am trying to create a custom callback that writes Tensorboard image summaries for this.
But how can I obtain the real training data inside the callback?
Currently I am trying this:
class TensorboardKeras(Callback):
def __init__(self, model, log_dir, write_graph=True):
self.model = model
self.log_dir = log_dir
self.session = K.get_session()
tf.summary.image('input_image', self.model.input)
self.merged = tf.summary.merge_all()
if write_graph:
self.writer = tf.summary.FileWriter(self.log_dir, K.get_session().graph)
else:
self.writer = tf.summary.FileWriter(self.log_dir)
def on_batch_end(self, batch, logs=None):
summary = self.session.run(self.merged, feed_dict={})
self.writer.add_summary(summary, batch)
self.writer.flush()
But I am getting the error:
InvalidArgumentError (see above for traceback): You must feed a value for placeholder tensor 'input_1' with dtype float and shape [?,224,224,3]
There must be a way to see what models, get as an input, right?
Or maybe I should try another way to debug it?
You don't need callbacks for this. All you need to do is implementing a function that yields an image and its label as a tuple. flow_from_directory function has a parameter called save_to_dir which could satisfy all of your needs, in case it doesn't, here is what you can do:
def trainGenerator(batch_size,train_path, image_size)
#preprocessing see https://keras.io/preprocessing/image/ for details
image_datagen = ImageDataGenerator(horizontal_flip=True)
#create image generator see https://keras.io/preprocessing/image/#flow_from_directory for details
train_generator = image_datagen.flow_from_directory(
train_path,
class_mode = "categorical",
target_size = image_size,
batch_size = batch_size,
save_prefix = "augmented_train",
seed = seed)
for (batch_imgs, batch_labels) in train_generator:
#do other stuff such as dumping images or further augmenting images
yield (batch_imgs,batch_labels)
t_generator = trainGenerator(32, "./train_data", (224,224,3))
model.fit_generator(t_generator,steps_per_epoch=10,epochs=1)
I am using keras to build a model that inputs 720x1280 images and outputs a value.
I am having a problem with keras.models.Sequential.predict_generator when using the keras.utils.Sequence class to obtain the values corresponding to images on the validation/training sets. The values returned are shuffled, so I don't know which output corresponds to which image.
This is how my generators are defined
from skimage.io import ImageCollection, imread
from keras.utils import Sequence
def load_images(f):
return imread(f).astype(np.float64)
class DataSetImageKeras(Sequence):
def __init__(self, image_collection, values, batch_size):
self.images = image_collection
self.hf = values
self.batch_size = batch_size
self.n = len(self.images)
self.x_scale = 250
self.y_scale = 1e4
def __len__(self):
return int(np.ceil(len(self.images) / float(self.batch_size)))
def __getitem__(self, idx):
# batch_x is a numpy.ndarray
batch_x = (
self.images[idx:min(idx + self.batch_size, self.n)]
.concatenate()
.reshape(self.batch_size, 720, 1280, 1)
)
batch_y = self.hf[idx:min(idx + self.batch_size, self.n)]
return batch_x/self.x_scale, batch_y/self.y_scale
images_train = ImageCollection(images_paths_train, load_func=load_images)
images_val = ImageCollection(images_paths_test, load_func=load_images)
data_train = DataSetImageKeras(images_train, values_train, n_batch)
data_val = DataSetImageKeras(images_val, values_val, n_batch)
from keras.models import load_model
model = load_model('model001') #this model is already trained
If I use the following code:
val_result = []
val_hf =[]
for (batch_x, batch_y) in data_val:
val_result.append(model.predict_on_batch(batch_x))
val_hf.append(batch_y)
val_result = np.concatenate(val_result)
val_hf = np.concatenate(val_hf)
plt.plot(val_hf,
val_result,
marker='.',
linestyle='')
The correct result is obtained (as seen on this image where x is the desired value and y is the predicted value)
However if I use the predict_generator function, as below:
val_result = model.predict_generator(data_val, verbose=1,
workers=1,
max_queue_size=50,
use_multiprocessing=False)
The output is shuffled as can be seen here.
My problem is similar to
#5048 and
#6745,
which should be solved by
#6891 API, but I am using keras version 2.1.6 and it is still shuffling my predictions, even when using workers=1.
It is also similar to this, but I didn't find anything that could reset the generators and this problem is still present if I define a new generator and try to run the predict_generator.
I also found something stating that it could have something to do with the number of batches not dividing exactly the number of samples, but this problem is still present if I use n_batch=1
As a side note, it might be that predict_generator is not shuffling data, but only returning it with an index offset, since the input data on values and images_paths are already shuffled.
predict_generator was not shuffling my predictions, after all. The problem was with the __getitem__ method. For instance, usingn_batch=32, the method would yield values from 1 to 32, then from 2 to 33 and so forth, instead of from 1 to 32, 33 to 64, etc.
Changing the method as follows solves the problem
def __getitem__(self, idx):
# batch_x is a numpy.ndarray
idx_min = idx*self.batch_size
idx_max = min(idx_min + self.batch_size, self.n)
batch_x = (
self.images[idx_min:idx_max]
.concatenate()
.reshape(self.batch_size, 720, 1280, 1)
)
batch_y = self.hf[idx_min:idx_max]
I am trying to feed a large dataset to a keras model.
The dataset does not fit into memory.
It is currently stored as a serie of hd5f files
I want to train my model using
model.fit_generator(my_gen, steps_per_epoch=30, epochs=10, verbose=1)
However, in all the examples I could find online, my_gen was used only to perform data augmentation on a already loaded dataset. For example
def generator(features, labels, batch_size):
# Create empty arrays to contain batch of features and labels#
batch_features = np.zeros((batch_size, 64, 64, 3))
batch_labels = np.zeros((batch_size,1))
while True:
for i in range(batch_size):
# choose random index in features
index= random.choice(len(features),1)
batch_features[i] = some_processing(features[index])
batch_labels[i] = labels[index]
yield batch_features, batch_labels
In my case, it needs to be something like
def generator(features, labels, batch_size):
while True:
for i in range(batch_size):
# choose random index in features
index= # SELECT THE NEXT FILE
batch_features[i] = some_processing(features[files[index]])
batch_labels[i] = labels[file[index]]
yield batch_features, batch_labels
How do I keep track of the files which were already read in previous batch?
From the keras doc
generator: A generator or an instance of Sequence (keras.utils.Sequence) object in order to avoid duplicate data when using multiprocessing. [...]
This means you can write a class inheriting from keras.utils.sequence
class ProductSequence(keras.utils.Sequence):
def __init__(self):
pass
def __len__(self):
pass
def __getitem__(self, idx):
pass
__init__ ist to init the class.
__len__ should return the number of batches per epoch. Keras will use thisto know which index can be passed to __getitem__. __getitem__ will then return the batch data depending on the index.
A simple example can be found here
With this approach you can simpy have an internal class object in which you save which files are already read.
Let us suppose that your data are images. If you have many images you probably won't be able to load all of them in memory and you would like to read from disk in batches.
Keras flow_from _directory is very fast in doing that as it does this in a multi threading way too but it needs all the images to be in different files, according to their class. If we have all the images in the same file and their classes in separated file we could use the generator bellow to load our x,y data.
import pandas as pd
import numpy as np
import cv2
#df_train: data frame with class of every image
#dpath: path of images
classes=list(np.unique(df_train.label))
def batch_generator(ids):
while True:
for start in range(0, len(ids), batch_size):
x_batch = []
y_batch = []
end = min(start + batch_size, len(ids))
ids_batch = ids[start:end]
for id in ids_batch:
img = cv2.imread(dpath+'train/{}.png'.format(id)) #open cv read as BGR
#img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) #BGR to RGB
#img = cv2.resize(img, (224, 224), interpolation = cv2.INTER_CUBIC)
#img = pre_process(img)
labelname=df_train.label.loc[df_train.id==id].values
labelnum=classes.index(labelname)
x_batch.append(img)
y_batch.append(labelnum)
x_batch = np.array(x_batch)
y_batch = to_categorical(y_batch,10)
yield x_batch, y_batch