How to deal with thousands of images for CNN training Keras

How to deal with thousands of images for CNN training Keras - python

I have ~10000k images that cannot fit in memory. So for now I can only read 1000 images and train on it...
My code is here :
img_dir = "TrainingSet" # Enter Directory of all images
image_path = os.path.join(img_dir+"/images",'*.bmp')
files = glob.glob(image_path)
images = []
masks = []
contours = []
indexes = []
files_names = []
for f1 in np.sort(files):
img = cv2.imread(f1)
result = re.search('original_cropped_(.*).bmp', str(f1))
idx = result.group(1)
mask_path = img_dir+"/masks/mask_cropped_"+str(idx)+".bmp"
mask = cv2.imread(mask_path,0)
contour_path = img_dir+"/contours/contour_cropped_"+str(idx)+".bmp"
contour = cv2.imread(contour_path,0)
indexes.append(idx)
images.append(img)
masks.append(mask)
contours.append(contour)
train_df = pd.DataFrame({"id":indexes,"masks": masks, "images": images,"contours": contours })
train_df.sort_values(by="id",ascending=True,inplace=True)
print(train_df.shape)
img_size_target = (256,256)
ids_train, ids_valid, x_train, x_valid, y_train, y_valid, c_train, c_valid = train_test_split(
train_df.index.values,
np.array(train_df.images.apply(lambda x: cv2.resize(x,img_size_target).reshape(img_size_target[0],img_size_target[1],3))),
np.array(train_df.masks.apply(lambda x: cv2.resize(x,img_size_target).reshape(img_size_target[0],img_size_target[1],1))),
np.array(train_df.contours.apply(lambda x: cv2.resize(x,img_size_target).reshape(img_size_target[0],img_size_target[1],1))),
test_size=0.2, random_state=1337)
#Here we define the model architecture...
#.....
#End of model definition
# Training
optimizer = Adam(lr=1e-3,decay=1e-10)
model.compile(loss="binary_crossentropy", optimizer=optimizer, metrics=["accuracy"])
early_stopping = EarlyStopping(patience=10, verbose=1)
model_checkpoint = ModelCheckpoint("./keras.model", save_best_only=True, verbose=1)
reduce_lr = ReduceLROnPlateau(factor=0.5, patience=5, min_lr=0.00001, verbose=1)
epochs = 200
batch_size = 32
history = model.fit(x_train, y_train,
validation_data=[x_valid, y_valid],
epochs=epochs,
batch_size=batch_size,
callbacks=[early_stopping, model_checkpoint, reduce_lr])
What I would like to know is how can I modify my code in order to do batches of a small set of images without loading all the other 10000 into memory ? which means that the algorithm will read X images each epoch from directory and train on it and after that goes for the next X until the last one.
X here would be a reasonable amount of images that can fit into memory.

use fit_generator instead of fit
def generate_batch_data(num):
#load X images here
return images
model.fit_generator(generate_batch_data(X),
samples_per_epoch=10000, nb_epoch=10)
Alternative you could use train_on_batch instead of fit
Discussion on GitHub about this topic: https://github.com/keras-team/keras/issues/2708

np.array(train_df.images.apply(lambda x:cv2.resize(x,img_size_target).reshape(img_size_target[0],img_size_target[1],3)))
You can first apply this filter (and the 2 others) to each individual file and save them to a special folder (images_prepoc, masks_preproc,etc.. ) in a separate script, then load them back already ready for use in the current script.
Assuming that the actual images dimensions are greater than 256x256, you will have a faster algorithm, using less memory at the cost of a single preparation phase.

Related

Train a LSTM model using multiple datasets in for loop

I am in the process of training my LSTM neural networks that shall predict quintiles of stock price distributions. As I would like to train the model on not just one stock but a sample of 500 I wrote the below training loop that shall fit the model to each stock, save the model params and the load the params again when training the next stock. My question is if I can write the code in the for loop like below or whether I can also just use a complete dataset including all 500 stocks where data is just concatenated along the 0 axis.
The idea is, that the model iterates over each stock, the best model is then saved by the checkpoint function and is reloaded again for the fitting of the next stock.
This is the training loop I would like to use:
def compile_and_fit(model_type,model,checkpoint_path,config, stock_data,macro_data, factor_data, patience, batch_size,
num_epochs,train_set_ratio, val_set_ratio, Y_name):
"""
model = NN model,
data = stock data, factor data, macro data,
batch_size = timesteps per batch
alpha adam = learning rate optimizer
data set ratios = train_set_ratio, val_set_ratio (eg. 0.5)
"""
early_stopping = tf.keras.callbacks.EarlyStopping(
monitor='loss', #'loss'
patience=patience,
mode='min')
cp_callback = tf.keras.callbacks.ModelCheckpoint(
checkpoint_path,
monitor= 'loss',
verbose=True,
save_best_only=True,
save_freq = batch_size,
mode='min')
permno_list = stock_data.permno.unique()
test_data = pd.DataFrame()
counter = 0
for p in permno_list:
#checkpoints
if counter == 0:
trained_model = model
cp_callback = cp_callback
else:
trained_model = tf.keras.models.load_model(checkpoint_path)
cp_callback = tf.keras.callbacks.ModelCheckpoint(checkpoint_path,monitor= 'loss',verbose=True, save_best_only=True,save_freq = batch_size, mode='min')
stock_data_length = len(stock_data.loc[stock_data.permno==p])
train_data_stocks = stock_data.loc[stock_data.permno==p][0:int(stock_data_length*train_set_ratio)]
val_data_stocks = stock_data.loc[stock_data.permno==p][int(stock_data_length*train_set_ratio):int(stock_data_length*(val_set_ratio+train_set_ratio))]
test_data_stocks = stock_data.loc[stock_data.permno==p][int(stock_data_length*(val_set_ratio+train_set_ratio)):]
test_data = pd.concat([test_data, test_data_stocks],axis=0)
train_date_index = train_data_stocks.index.values.tolist()
val_date_index = val_data_stocks.index.values.tolist()
train_data_factors = factor_data.loc[factor_data.index.isin(train_date_index)]
train_data_macro = macro_factors.loc[macro_factors.index.isin(train_date_index)]
train_data_macro_norm = train_data_macro.copy(deep=True)
for c in train_data_macro_norm.columns:
train_data_macro_norm[c] = MinMaxScaler([-1,1]).fit_transform(pd.DataFrame(train_data_macro_norm[c]))
train_data_merged = pd.concat([train_data_factors, train_data_macro_norm],axis=1)
val_data_factors = factor_data.loc[factor_data.index.isin(val_date_index)]
val_data_macro = macro_factors.loc[macro_factors.index.isin(val_date_index)]
val_data_macro_norm = val_data_macro.copy(deep=True)
for c in val_data_macro_norm.columns:
val_data_macro_norm[c] = MinMaxScaler([-1,1]).fit_transform(pd.DataFrame(val_data_macro_norm[c]))
val_data_merged = pd.concat([val_data_factors, val_data_macro_norm],axis=1)
if model_type=='combined':
x_train_factors = []
x_train_macro = []
y_train =[]
for i in range(batch_size, len(train_data_factors)):
x_train_factors.append(train_data_factors.values[i-batch_size:i,:])
x_train_macro.append(train_data_macro_norm.values[i-batch_size:i,:])
y_train.append(train_data_stocks[Y_name].values[i])
x_train_factors, x_train_macro, y_train= np.array(x_train_factors),np.array(x_train_macro), np.array(y_train)
x_val_factors = []
x_val_macro = []
y_val =[]
for i in range(batch_size, len(val_data_factors)):
x_val_factors.append(val_data_factors.values[i-batch_size:i,:])
x_val_macro.append(val_data_macro_norm.values[i-batch_size:i,:])
y_val.append(val_data_stocks[Y_name].values[i])
x_val_factors, x_val_macro, y_val = np.array(x_val_factors),np.array(x_val_macro), np.array(y_val)
score =trained_model.evaluate([x_train_macro,x_train_factors],y_train,batch_size=batch_size)
score = list(score)
score.sort(reverse=True)
score = score[-2]
cp_callback.best = score
trained_model.fit(x=[x_train_macro,x_train_factors],y=y_train,batch_size=batch_size, epochs=num_epochs,
validation_data=[[x_val_macro,x_val_factors], y_val], callbacks=[early_stopping,cp_callback])
if model_type=='merged':
x_train_merged = []
y_train =[]
for i in range(batch_size, len(train_data_merged)):
x_train_merged.append(train_data_merged.values[i-batch_size:i,:])
y_train.append(train_data_stocks[Y_name].values[i])
x_train_merged, y_train= np.array(x_train_merged), np.array(y_train)
x_val_merged = []
y_val =[]
for i in range(batch_size, len(val_data_merged)):
x_val_merged.append(val_data_merged.values[i-batch_size:i,:])
y_val.append(val_data_stocks[Y_name].values[i])
x_val_merged, y_val = np.array(x_val_merged), np.array(y_val)
score =trained_model.evaluate(x_train_merged,y_train,batch_size=batch_size)
score = list(score)
score.sort(reverse=True)
score = score[-2]
cp_callback.best = score
trained_model.fit(x=x_train_merged,y=y_train,batch_size=batch_size, epochs=num_epochs,
validation_data=[x_val_merged, y_val], callbacks=[early_stopping,cp_callback])
return trained_model, test_data
If someone has an idea whether this works or not, I would be incredibly grateful!
In my testing I could see the mse constantly decreasing, however if the loop continues for the next stop the mse starts with avery high value again.

According to this answer
How can I use multiple datasets with one model in Keras?
you can repeatedly fit the same model on more datasets.
If you want to save the model and load it at each iteration, that should also work with the caveat that you loose the optimizer state (see Loading a trained Keras model and continue training).

Tensorflow does not apply data augmentation properly

I'm trying to apply the process of data augmentation to a database. I use the following code:
train_generator = keras.utils.image_dataset_from_directory(
directory= train_dir,
subset = "training",
image_size = (50,50),
batch_size = 32,
validation_split = 0.3,
seed = 1337,
labels = "inferred",
label_mode = 'binary'
)
validation_generator = keras.utils.image_dataset_from_directory(
subset="validation",
directory=validation_dir,
image_size=(50,50),
batch_size =40,
seed=1337,
validation_split = 0.3,
labels = "inferred",
label_mode ='binary'
)
data_augmentation = keras.Sequential([
keras.layers.RandomFlip("horizontal"),
keras.layers.RandomRotation(0.1),
keras.layers.RandomZoom(0.1),
])
train_dataset = train_generator.map(lambda x, y: (data_augmentation(x, training=True), y))
But when I try to run the training processe using this method, I get a "insuficient data" warning:
6/100 [>.............................] - ETA: 21s - loss: 0.7602 - accuracy: 0.5200WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches (in this case, 2000 batches). You may need to use the repeat() function when building your dataset.
WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches (in this case, 10 batches). You may need to use the repeat() function when building your dataset.
Yes, the original dataset is insuficient, but the data augmentation should provide more than enough data for the training.
Does anyone know what's going on ?
EDIT:
fit call:
history = model.fit(
train_dataset,
epochs = 20,
steps_per_epoch = 100,
validation_data = validation_generator,
validation_steps = 10,
callbacks=callbacks_list)
This is the version I have using DataImageGenerator:
train_datagen = keras.preprocessing.image.ImageDataGenerator(rescale =1/255,rotation_range = 40,width_shift_range = 0.2,height_shift_range = 0.2,shear_range = 0.2,zoom_range = 0.2,horizontal_flip = True)
train_generator = train_datagen.flow_from_directory(directory= train_dir,target_size = (50,50),batch_size = 32,class_mode = 'binary')
val_datagen = keras.preprocessing.image.ImageDataGenerator(rescale=1/255)
validation_generator = val_datagen.flow_from_directory(directory=validation_dir,target_size=(50,50),batch_size =40,class_mode ='binary')
This specific code (with this same number of epochs, steps_per_epoch and batchsize) was taken from the book deeplearning with python, by François Chollet, it's an example on page 141 of a data augmentation system. As you may have guessed, this produces the same results as the other method displayed.

When we state that data augmentation increases the number of instances, we usually understand that an altered version of a sample would be created for the model to process. It's just image preprocessing with randomness.
If you closely inspect your training log, you will get your solution, shown below. The main issue with your approach is simply discussed in this post.
WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least steps_per_epoch * epochs batches (in this case, 2000 batches). You may need to use the repeat() function when building your dataset.
So, to solve this, we can use .repeat() function. To understand what it does, you can check this answer. Here is the sample code that should work for you.
train_ds= keras.utils.image_dataset_from_directory(
...
)
train_ds = train_ds.map(
lambda x, y: (data_augmentation(x, training=True), y)
)
val_ds = keras.utils.image_dataset_from_directory(
...
)
# using .repeat function
train_ds = train_ds.repeat().shuffle(8 * batch_size)
train_ds = train_ds.prefetch(buffer_size=tf.data.AUTOTUNE)
val_ds = val_ds.repeat()
val_ds = val_ds.prefetch(buffer_size=tf.data.AUTOTUNE)
# specify step per epoch
history = model.fit(
train_ds,
validation_data=val_ds,
epochs=..,
steps_per_epoch = train_ds.cardinality().numpy(),
validation_steps = val_ds.cardinality().numpy(),
)

Custom Datagenerator

I have a custom file containing the paths to all my images and their labels which I load in a dataframe using:
MyIndex=pd.read_table('./MySet.txt')
MyIndex has two columns of interest ImagePath and ClassName
Next I do some train test split and encoding the output labels as:
images=[]
for index, row in MyIndex.iterrows():
img_path=basePath+row['ImageName']
img = image.load_img(img_path, target_size=(299, 299))
img_path=None
img_data = image.img_to_array(img)
img=None
images.append(img_data)
img_data=None
images[0].shape
Classes=Sample['ClassName']
OutputClasses=Classes.unique().tolist()
labels=Sample['ClassName']
images=np.array(images, dtype="float") / 255.0
(trainX, testX, trainY, testY) = train_test_split(images,labels, test_size=0.10, random_state=42)
trainX, valX, trainY, valY = train_test_split(trainX, trainY, test_size=0.10, random_state=41)
images=None
labels=None
encoder = LabelEncoder()
encoder=encoder.fit(OutputClasses)
encoded_Y = encoder.transform(trainY)
# convert integers to dummy variables (i.e. one hot encoded)
trainY = to_categorical(encoded_Y, num_classes=len(OutputClasses))
encoded_Y = encoder.transform(valY)
# convert integers to dummy variables (i.e. one hot encoded)
valY = to_categorical(encoded_Y, num_classes=len(OutputClasses))
encoded_Y = encoder.transform(testY)
# convert integers to dummy variables (i.e. one hot encoded)
testY = to_categorical(encoded_Y, num_classes=len(OutputClasses))
datagen=ImageDataGenerator(rotation_range=90,horizontal_flip=True,vertical_flip=True,width_shift_range=0.25,height_shift_range=0.25)
datagen.fit(trainX,augment=True)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
batch_size=128
model.fit_generator(datagen.flow(trainX,trainY,batch_size=batch_size), epochs=500,
steps_per_epoch=trainX.shape[0]//batch_size,validation_data=(valX,valY))
The problem I face that the data loaded in one go is too large to fit in current machine memory and so I am unable to work with the complete dataset.
I have tried to work with the datagenerator but do not want to follow he directory conventions it follows and also cannot eradicate the augmentation part.
The question is that is there a way to load batches from the disk ensuring the two stated conditions.

I believe you should have a look at this post
What you are looking for is Keras flow_from_dataframe that let you load the batches from disk by providing the names of your files and their labels in a dataframe and also providing a top directory path that contains all your images.
Making a bit of midifications in your code and borrowing some from the link shared:
MyIndex=pd.read_table('./MySet.txt')
Classes=MyIndex['ClassName']
OutputClasses=Classes.unique().tolist()
trainDf=MyIndex[['ImageName','ClassName']]
train, test = train_test_split(trainDf, test_size=0.10, random_state=1)
#creating a data generator to load the files on runtime
traindatagen=ImageDataGenerator(rotation_range=90,horizontal_flip=True,vertical_flip=True,width_shift_range=0.25,height_shift_range=0.25,
validation_split=0.1)
train_generator=traindatagen.flow_from_dataframe(
dataframe=train,
directory=basePath,#the directory containing all your images
x_col='ImageName',
y_col='ClassName',
class_mode='categorical',
target_size=(299, 299),
batch_size=batch_size,
subset='training'
)
#Also a generator for the validation data
val_generator=traindatagen.flow_from_dataframe(
dataframe=train,
directory=basePath,#the directory containing all your images
x_col='ImageName',
y_col='ClassName',
class_mode='categorical',
target_size=(299, 299),
batch_size=batch_size,
subset='validation'
)
STEP_SIZE_TRAIN=train_generator.n//train_generator.batch_size
STEP_SIZE_VALID=val_generator.n//val_generator.batch_size
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit_generator(generator=train_generator, steps_per_epoch=STEP_SIZE_TRAIN,
validation_data=val_generator,
validation_steps=STEP_SIZE_VALID,
epochs=500)
Also note now you do not need the encoding of the labels as you had in your original code and also omit the image loading code.
I have not tried this code itself so try to fix any bugs you may encounter, as the primary focus was to deliver you the basic idea.
In response to your comment:
If you have all files in different directories then one solution would be to have your ImagesName to store the relative path including the intermediate directory in path something like './Dir/File.jpg' and then move all the directories to one folder and use the one as base path and everything else stays the same.
Also looking at your code segment that loaded the files look like you already have file paths stored in ImageName column so the suggested approach should work for you.
images=[]
for index, row in MyIndex.iterrows():
img_path=basePath+row['ImageName']
img = image.load_img(img_path, target_size=(299, 299))
img_path=None
img_data = image.img_to_array(img)
img=None
images.append(img_data)
img_data=None
In case if still some ambiguity exists feel free to ask again.

I think the simplest way to do this would be to just load part of your images per each generator and repeatedly call .fit_generator() with that smaller batch.
This example uses `random.random()` to choose which images to load – you could use something more sophisticated.
The previous version used random.random(), but we can just as well use a start index and page size like in this revised version to loop over the list of images forever.
import itertools
def load_images(start_index, page_size):
images = []
for index in range(page_size):
# Generate index using modulo to loop over the list forever
index = (start_index + index) % len(rows)
row = MyIndex[index]
img_path = basePath + row["ImageName"]
img = image.load_img(img_path, target_size=(299, 299))
img_data = image.img_to_array(img)
images.append(img_data)
return images
def generate_datagen(batch_size, start_index, page_size):
images = load_images(start_index, page_size)
# ... everything else you need to get from images to trainX and trainY, etc. here ...
datagen = ImageDataGenerator(
rotation_range=90,
horizontal_flip=True,
vertical_flip=True,
width_shift_range=0.25,
height_shift_range=0.25,
)
datagen.fit(trainX, augment=True)
return (
trainX,
trainY,
valX,
valY,
datagen.flow(trainX, trainY, batch_size=batch_size),
)
model.compile(
loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"]
)
page_size = (
500
) # load 500 images at a time; change this as suitable for your memory condition
for page in itertools.count(): # Count from zero to forever.
batch_size = 128
trainX, trainY, valX, valY, generator = generate_datagen(
128, page * page_size, page_size
)
model.fit_generator(
generator,
epochs=5,
steps_per_epoch=trainX.shape[0] // batch_size,
validation_data=(valX, valY),
)
# TODO: add a `break` clause with a suitable condition

If you want to load from the disk it is convenient to do with ImageDataGenerator that you used.
There are two ways to do it. By stating the directory of the data with flow_from_directory. Alternatively you can use flow_from_dataframe with Pandas dataframe
If you want to have a list of paths you should not use a custom generator that yields batches of images. Here is a stub:
def load_image_from_path(path):
"Loading and preprocessing"
...
def my_generator():
length = df.shape[0]
for i in range(0, length, batch_size)
batch = df.loc[i:min(i+batch_size, length-1)]
x, y = map(load_image_from_path, batch['ImageName']), batch['ClassName']
yield x, y
Note: in fit_generator there is an additional generator named validation_data for well you guessed it - validation.
One option is to pass the generators the indices to choose from in order to split train and test (assuming the data is shuffled, if not check this out).

predict_generator and class labels

I am using ImageDataGenerator to generate new augmented images and extract bottleneck features from pretrained model but most of the tutorial I see on keras
samples same no of training samples as number of images in directory.
train_generator = train_datagen.flow_from_directory(
train_path,
target_size=image_size,
shuffle = "false",
class_mode='categorical',
batch_size=1)
bottleneck_features_train = model.predict_generator(
train_generator, 2* nb_train_samples // batch_size)
Suppose I want 2 times more images from the above code, how I can get the desired class labels for the features extracted from bottleneck layer which are stored in tuple train_generator.
shouldnt the code in training_generator.py at line 422
x, _ = generator_output
do something like this
=> x, y = generator_output
and return tuple [np.concatenate(out) for out in all_outs],y from predict_generator
i.e return the corresponding class labels along with the predicted features all_outs since there is no way to get the corresponding labels without running generator twice.

If you're using predict, normally you simply don't want Y, because Y will be the result of the prediction. (You're not training, so you don't need the true labels)
But you can do it yourself:
bottleneck = []
labels = []
for i in range(2 * nb_train_samples // batch_size):
x, y = next(train_generator)
bottleneck.append(model.predict(x))
labels.append(y)
bottleneck = np.concatenate(bottleneck)
labels = np.concatenate(labels)
If you want it with indexing (if your generator supports that):
#...
for epoch in range(2):
for i in range(nb_train_samples // batch_size):
x,y = train_generator[i]
#...

Loading weights through checkpoint keras provides different results

I'm new with keras. In fact, I conceive an autoencoder and train it on a part of Diabetes Dataset. Then, I use keras checkpointer to save the weights so that I can load it later in order to acheive some operations on encoded data vector (calculating the mean of encoded data to extract a class representation)
The problem
when I load the weights and then get the encoded data, I got different results each time I run the code. I turn the compile and fit statements into commentetd state after training the autoencoder to do not repeat the training process each time I run the code.
Here is the code :
checkpointer = ModelCheckpoint(filepath="weights.best.h5",
verbose=0,
save_best_only=True,
save_weights_only=True)
tensorboard = TensorBoard(log_dir='/tmp/autoencoder',
histogram_freq=0,
write_graph=True,
write_images=True)
input_enc = Input(shape=(input_size,))
hidden_1 = Dense(hidden_size1, activation='relu')(input_enc)
hidden_11 = Dense(hidden_size2, activation='relu')(hidden_1)
code = Dense(code_size, activation='relu')(hidden_11)
hidden_22 = Dense(hidden_size2, activation='relu')(code)
hidden_2 = Dense(hidden_size1, activation='relu')(hidden_22)
output_enc = Dense(input_size, activation='tanh')(hidden_2)
autoencoder_yes = Model(input_enc, output_enc)
autoencoder_yes.compile(optimizer='adam',
loss='mean_squared_error',
metrics=['accuracy'])
history_yes = autoencoder_yes.fit(df_noyau_norm_y, df_noyau_norm_y,
epochs=200,
batch_size=batch_size,
shuffle = True,
validation_data=(df_test_norm_y, df_test_norm_y),
verbose=1,
callbacks=[checkpointer, tensorboard]).history
autoencoder_yes.save_weights("weights.best.h5")
autoencoder_yes.load_weights("weights.best.h5")
encoder_yes = Model (inputs = input_enc,outputs = code)
encoded_input = Input(shape=(code_size, ))
encoded_data_yes = encoder_yes.predict(df_noyau_norm_y)
print(encoded_data_yes.tolist())
X_YES= sum(encoded_data_yes) / 7412
print (X_YES)
Anybody can help me find out the reason and how to resolve this issue?
Thanks

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to deal with thousands of images for CNN training Keras - python

Related

Train a LSTM model using multiple datasets in for loop

Tensorflow does not apply data augmentation properly

Custom Datagenerator

predict_generator and class labels

Loading weights through checkpoint keras provides different results

Categories

Resources