Memory Issues Using Keras Convolutional Network - python

I am very new to ML using Big Data and I have played with Keras generic convolutional examples for the dog/cat classification before, however when applying a similar approach to my set of images, I run into memory issues.
My dataset consists of very long images that are 10048 x1687 pixels in size. To circumvent the memory issues, I am using a batch size of 1, feeding in one image at a time to the model.
The model has two convolutional layers, each followed by max-pooling which together make the flattened layer roughly 290,000 inputs right before the fully-connected layer.
Immediately after running however, Memory usage chokes at its limit (8Gb).
So my questions are the following:
1) What is the best approach to process computations of such size in Python locally (no Cloud utilization)? Are there additional python libraries that I need to utilize?

Check out what yield does in python and the idea of generators. You do not need to load all of your data at the beginning. You should make your batch_size just small enough that you do not get memory errors.
Your generator can look like this:
def generator(fileobj, labels, memory_one_pic=1024, batch_size):
start = 0
end = start + batch_size
while True:
X_batch = fileobj.read(memory_one_pic*batch_size)
y_batch = labels[start:end]
start += batch_size
end += batch_size
if not X_batch:
break
if start >= amount_of_datasets:
start = 0
end = batch_size
yield (X_batch, y_batch)
...later when you already have your architecture ready...
train_generator = generator(open('traindata.csv','rb'), labels, batch_size)
train_steps = amount_of_datasets//batch_size + 1
model.fit_generator(generator=train_generator,
steps_per_epoch=train_steps,
epochs=epochs)
You should also read about batch_normalization, which basically helps to learn faster and with better accuracy.

While using train_generator(), you should also set the max_q_size parameter. It's set at 10 by default, which means you're loading in 10 batches while using only 1 (since train_generator() was designed to stream data from outside sources that can be delayed like network, not to save memory). I'd recommend setting max_q_size=1for your purposes.

Related

GPU's not active during validation in Keras

During training I can see that the GPU's are both active and running the data. Then once training is complete, I see the GPU activity drop to 0 and the CPU drops a little too. Could this have something to do with the way I am generating training data?
I have custom data generators feeding the model:
train_df, val_df = np.split(dataframe, [int(.8 * len(dataframe.index))])
trainbatches = math.ceil(len(train_df.index) / batchsize)
valbatches = math.ceil(len(val_df.index) / 1024)
train_gen1 = MixedGenerator(train_df, batchsize, scaler) # tensor Sequence
enqueued_train_gen = queue_generator(train_gen1)
val_gen1 = MixedGenerator(val_df, 1024, scaler)
enqueued_
val_gen = queue_generator(val_gen1)
strategy = tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.HierarchicalCopyAllReduce())
print("Number of devices: {}".format(strategy.num_replicas_in_sync))
print('numdata shape', (numdata.shape[1],))
with strategy.scope():
model = build_mixedinput_model(ir_imgs_shape=(120, 160, 1),
num_data_shape=(numdata.shape[1],),
opt=optimizer,
lossmet=lossmet, 
dropout=dropout)
history = model.fit(enqueued_
train_gen,
steps_per_epoch=trainbatches,
validation_data=enqueued_
val_gen,
validation_steps=valbatches,
verbose=1,
epochs=epochs,
callbacks=clbks)
the training generator feeds batches of a predetermined size, but the validation generator feeds batches of size 1024. This is because I am training on small batches but I want the validation to go faster. My understanding is that this should work since batch size shouldn't matter for validation.
Is this normal behavior, and is there a better practice I could be using? The validation still takes place, though it takes some time and as I said does not make use of the GPU's.
Thanks in advance
UPDATE:
only happens in the first epoch of each run, the subsequent epochs do validation extremely quickly. I am still curious why GPU's and CPU are not engaging during the validation process though.
The first epoch takes more time because the code is being compiled into a graph. This includes the model function, the loss function, validation metrics, and others. The compilation process can take some time, but it is only done once. That is why you see faster results after the first epoch.
Here are the lines in the tensorflow.keras source that compile the training step and the validation step. Those lines essentially call tf.function on the train and validation steps.

Overfitting - huge difference between training and validation accuracy

I have a dataset of 180k images for which I try to recognize the characters on the images (License plate recognition). All of these license plates contain seven characters and 35 characters are possible, so the output vector y is of shape (7, 35). I therefore onehot-encoded every license plate label.
I applied the bottom of the EfficicentNet-B0 model (https://keras.io/api/applications/efficientnet/#efficientnetb0-function) together with a customized top, which is divided in 7 branches (because of seven characters per license plate). I used the weights of the imagenet and freezed the bottom layers of efnB0_model:
def create_model(input_shape = (224, 224, 3)):
input_img = Input(shape=input_shape)
model = efnB0_model (input_img)
model = GlobalAveragePooling2D(name='avg_pool')(model)
model = Dropout(0.2)(model)
backbone = model
branches = []
for i in range(7):
branches.append(backbone)
branches[i] = Dense(360, name="branch_"+str(i)+"_Dense_16000")(branches[i])
branches[i] = BatchNormalization()(branches[i])
branches[i] = Activation("relu") (branches[i])
branches[i] = Dropout(0.2)(branches[i])
branches[i] = Dense(35, activation = "softmax", name="branch_"+str(i)+"_output")(branches[i])
output = Concatenate(axis=1)(branches)
output = Reshape((7, 35))(output)
model = Model(input_img, output)
return model
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
For training and validating the model I only use 10.000 training images and 3.000 validation images due to the big size of my model and the huge number of data which would make my training very, very slow.
I use this DataGenerator to feed batches to my model:
class DataGenerator(Sequence):
def __init__(self, x_set, y_set, batch_size):
self.x, self.y = x_set, y_set
self.batch_size = batch_size
def __len__(self):
return math.ceil(len(self.x) / self.batch_size)
def __getitem__(self, idx):
batch_x = self.x[idx*self.batch_size : (idx + 1)*self.batch_size]
batch_x = np.array([resize(imread(file_name), (224, 224)) for file_name in batch_x])
batch_x = batch_x * 1./255
batch_y = self.y[idx*self.batch_size : (idx + 1)*self.batch_size]
batch_y = np.array(batch_y)
return batch_x, batch_y
I fit the model using this code:
model.fit_generator(generator=training_generator,
validation_data=validation_generator,
steps_per_epoch = num_train_samples // 32,
validation_steps = num_val_samples // 32,
epochs = 10, workers=6, use_multiprocessing=True)
Now, after several epochs of training, I observed big differences regarding training accuracy and validation accuracy. I think, one reason for that is the small size of data. Which other factors influence this overfitting in my model? Do you think, there is something completely wrong with my code/model? Do you think the model is to big and complex as well or is it maybe due to the preprocessing of the data?
Note: I already experimented with Data Augmentation and tried the model without Transfer Learning. That leads to poor results on training AND validation data. So, is there anything what I could do additionally?
First a disclaimer
Are you sure that this is the correct approach to follow? The EfficientNet is a model created for image recognition, and your task demands the correct localization of 7 characters in one image, recognition of each one of them, and also demand to keep the order of the characters. Maybe an approach of detection + segmentation followed by recognition like in this medium post is more efficient (pun intended). Even though I think that this is very likely to be your real problem, I will try to answer your original question.
Now some general tips regarding overfitting
There's a very good guide here in Keras documentation on how to use EfficientNet for transfer learning. I will try to summarize some tips here.
From your question seems that you are not even doing the fine-tuning step which is essential for the network to learn the task better.
Now, after several epochs of training, I observed big differences regarding training accuracy and validation accuracy.
With several, you mean how many epochs? Because from the image you put in the question I think that the second complete epoch is too soon to infer that your model is overfitting. Also, from the code (10 epochs) and for the image you posted (20 epochs) I would say to train for more epochs, like 40.
Increase the dropout. Try some configurations like 30%, 40%, 50%.
Data Augmentation in practice will increase the number of samples that you have. However, you have 180K images and are only using 10K of images, data augmentation is good but when you have more images available try using them first. From that guide I mentioned, seems that with this model and Google colab is feasible to use more images to train. So, try to increase the train size. Still in the topic of DA, some transformations may be harmful to your task, like too much rotation or reflection since you are trying to recognize numbers and letters.
Reducing the batch size to 16 may provide more regularization which helps to fight overfitting. Speaking of regularization, try to apply regularization to the dense layers that you are adding.
EDIT:
After quickly reading the paper you linked, I reaffirm my point about the epochs since in the paper the results are shown for 100 epochs. Also from the charts in the paper, we can see it's not possible to confirm that the author did not had overfitting too. Additionally, the changes in the Xception network are not clear at all. Changing the input layer has an impact on the dimensions of all other layers because of the way that the Convolution operation work and this is not discussed in the paper. The operations performed to achieve that output dimension is not clear too. Besides what you did I would suggest using a pooling layer to get the output dimensions that you want. Finally, the paper doesn't explain how the positioning of the plate is guaranteed. I would try to get more details about this paper that you are trying to reproduce to be sure that you are not missing anything in your model.
I have been working on character detection + recognition problem for industrial application. From my experience using only deep CNN and dense layer to predict character class is not the best approach to solve this problem. There are good research papers for scene text recognition problem, one common approach to design character recognition problem is to have ---
any deep CNN model like VGG, ResNet or EfficientNet to extract the image feature.
Then add some RNN layers on top of CNN backbone to get character sequence from the extracted features. This would be a great plus if you want to predict variable length of character.
After getting character sequence from RNN layers, the next step is to decode this character sequence. For this you can either use CTC based method or attention mechanism. Both of these methods have their own pros and cons. CTC based methods are fast but performance is bit poor, on the other hand, attention based models give good results but they are very slow. So the selection of the method depends on your requirement.
Below image from very famous text recognition paper CRNN gives general idea about above steps.
[
For training the model, #Hemerson has given good suggestions. Try to build and train this type of model with multiple stages and I am sure you will get better results:)
Best regards!

How to adapt the gpu batch size during training?

I found surprising that I could not find any resources online on how to dynamically adapt the GPU batch size without halting training.
The idea is the following:
1) Have a training script that is (almost) agnostic to the GPU in use. The batch size will dynamically adjust without interference of the user or need for tunning.
2) Still being able to specifying the desired training batch size, even if too big to fit in the biggest known GPU.
For instance, let's say I want to train a model using a batch size of 4096 images, each image 1024x1024. Let's also say that I have access to a server with different NVidea GPUs, but I don't know which one will be assigned to me in advance. (Or that everybody wants to use the biggest GPU and that I am left waiting a long time before it is my term).
I want my training script to find the max batch size (let's say it is 32 images per GPU batch), and only update the optimizer when all 4096 images have been processed (one training batch = 128 GPU batches).
There are different ways of solving this problem. But if specifying the GPU that can do the job, or using multiple GPUs are not an option, then it is handy to dynamically adapt the GPU batch size.
I prepared this repo with an illustrative training example in pytorch (it should work similarly in TensorFlow)
In the code below, the try/except is used to try different GPU batch sizes without halting training. When the batch becomes too large, it is downsized and the adaptation is turned off. Please check the repo for the implementation details and possible bug fixes.
It is also implemented a technique called Batch Spoofing, which performs a number of forward passes before doing the backpropagation. In PyTorch it only requires replacing the optimizer.zero_grad().
import torch
import torchvision
import torch.optim as optim
import torch.nn as nn
# Example of how to use it with Pytorch
if __name__ == "__main__":
# #############################################################
# 1) Initialize the dataset, model, optimizer and loss as usual.
# Initialize a fake dataset
trainset = torchvision.datasets.FakeData(size=1_000_000,
image_size=(3, 224, 224),
num_classes=1000)
# initialize the model, loss and SGD-based optimizer
resnet = torchvision.models.resnet152(pretrained=True,
progress=True)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(resnet.parameters(), lr=0.01)
continue_training = True # criteria to stop the training
# #############################################################
# 2) Set parameters for the adaptive batch size
adapt = True # while this is true, the algorithm will perform batch adaptation
gpu_batch_size = 2 # initial gpu batch_size, it can be super small
train_batch_size = 2048 # the train batch size of desire
# Modified training loop to allow for adaptive batch size
while continue_training:
# #############################################################
# 3) Initialize dataloader and batch spoofing parameter
# Dataloader has to be reinicialized for each new batch size.
trainloader = torch.utils.data.DataLoader(trainset,
batch_size=int(gpu_batch_size),
shuffle=True)
# Number of repetitions for batch spoofing
repeat = max(1, int(train_batch_size / gpu_batch_size))
try: # This will make sure that training is not halted when the batch size is too large
# #############################################################
# 4) Epoch loop with batch spoofing
optimizer.zero_grad() # done before training because of batch spoofing.
for i, (x, y) in enumerate(trainloader):
y_pred = resnet(x)
loss = criterion(y_pred, y)
loss.backward()
# batch spoofing
if not i % repeat:
optimizer.step()
optimizer.zero_grad()
# #############################################################
# 5) Adapt batch size while no RuntimeError is rased.
# Increase batch size and get out of the loop
if adapt:
gpu_batch_size *= 2
break
# Stopping criteria for training
if i > 100:
continue_training = False
# #############################################################
# 6) After the largest batch size is found, the training progresses with the fixed batch size.
# CUDA out of memory is a RuntimeError, the moment we will get to it when our batch size is too large.
except RuntimeError as run_error:
gpu_batch_size /= 2 # resize the batch size for the biggest that works in memory
adapt = False # turn off the batch adaptation
# Number of repetitions for batch spoofing
repeat = max(1, int(train_batch_size / gpu_batch_size))
# Manual check if the RuntimeError was caused by the CUDA or something else.
print(f"---\nRuntimeError: \n{run_error}\n---\n Is it a cuda error?")
If you have code that can do similarly in Tensorflow, Caffe or others, please share!
how to dynamically adapt the GPU batch size without halting training
There is a very similar question that uses random sampler for the job.
I will just have to add another option: DataLoader has collate_fn you could use for altering the bs.
collate_fn (callable, optional) – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.

Memory efficient way of converting raw image data into features for a neural net

I'm working on a machine learning problem and one of the first steps in my pipeline is to convert the raw data into features. Since I'm working with very large datasets I constantly run into memory issues. These are the steps I follow - I'd like to know if there are some things that are fundamentally wrong with the approach. For context, I'm working with 10,000s of images on a Google Cloud machine with 64GB ram.
1 - Create array to store features
Create numpy array to store the features. Example below is for a feature array that will hold 14,000 image features, each of which has height/width of 288/512 and 3 color channels).
x = np.zeros((14000, 288, 512, 3)) # 29316
2 - Read in raw images sequentially, process them, and put them into x
for idx, name in enumerate(raw_data_paths):
image = functions.read_png(name)
features = get_feature(image)
x[idx] = features
3 - train/test split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=test_fraction, random_state=42)
Questions
Am I approaching this completely incorrectly by using numpy arrays when there are more efficient storage mechanisms? I need to later use the data on a Keras neural net so working with numpy arrays has been convenient.
I tend to get issues with step (1) and step (3) above. For step 1, I sometimes cannot execute that line because I run out of memory. Interestingly, I have no issues on my slow local computer (which I'm guessing is using virtual memory), but I do get issues on my Linux Google Compute instance which has 64GB memory. How can I fix this issue?
For step (3) I sometimes run out of memory, and I imagine it's because when that line is executed I double memory needs (x_train, y_train, x_test, y_test together I would imagine require as much memory as x and y). Is there a way to do this step without doubling memory requirements?
1 - In keras, you can either use a python generator or a keras sequence for training. You define then the size of the batches.
You will train your model using fit_generator, passing the generator or the sequence. Adjust the parameters max_queue_size to at most 1 (the queue will be loaded in parallel while the model works on a batch)
2 - Do you really need to work with 14000 at once? Can't you make smaller batches?
You may use np.empty instead of np.zeros.
3 - Splitting train and test data is just as easy as:
trainData = originalData[:someSize]
testData = originalData[somesize:]
Using generators or sequences
These are options for you to load your data in parts, and you can define these parts any way you want.
You can indeed save your data in smaller files to load each file per step.
Or you can also do the entire image preprocessing inside the generator, in small batches.
See this answer for a simple example of a generator: Training a Keras model on multiple feature files that are read in sequentially to save memory
You can create a generator from a list of image files, divide the list in batches of files, and at each step, do the preprocessing:
def loadInBatches(batchSize,dataPaths):
while True:
for step in range(0,len(dataPaths),batchSize):
x = np.empty((batchSize, 288, 512, 3))
y = np.empty(???)
for idx,name in enumerate(dataPaths[step:step+batchSize])
image = functions.read_png(name)
features = get_feature(image)
x[idx] = features
y[idx] = ???
yield (x,y)
I think a good solution, which can solve all the 3 question (more or less), is to use Tensorflow. The latter gives the possibility to create a queue of input. You can find more information in Threading and Queues. This is easy to use way to scale your training.
Since you wanna later on use Neural Net, I suggest you to spend sometimes to learn TF, and the queues, since they are a very powerfull tool.

The time to predict a single image using cifar10 network (of tensorflow) increase over time [duplicate]

I am training a CNN with TensorFlow for medical images application.
As I don't have a lot of data, I am trying to apply random modifications to my training batch during the training loop to artificially increase my training dataset. I made the following function in a different script and call it on my training batch:
def randomly_modify_training_batch(images_train_batch, batch_size):
for i in range(batch_size):
image = images_train_batch[i]
image_tensor = tf.convert_to_tensor(image)
distorted_image = tf.image.random_flip_left_right(image_tensor)
distorted_image = tf.image.random_flip_up_down(distorted_image)
distorted_image = tf.image.random_brightness(distorted_image, max_delta=60)
distorted_image = tf.image.random_contrast(distorted_image, lower=0.2, upper=1.8)
with tf.Session():
images_train_batch[i] = distorted_image.eval() # .eval() is used to reconvert the image from Tensor type to ndarray
return images_train_batch
The code works well for applying modifications to my images.
The problem is :
After each iteration of my training loop (feedfoward + backpropagation), applying this same function to my next training batch steadily takes 5 seconds longer than the last time.
It takes around 1 second to process and reaches over a minute of processing after a bit more than 10 iterations.
What causes this slowing?
How can I prevent it?
(I suspect something with distorted_image.eval() but I'm not quite sure. Am opening a new session each time? TensorFlow isn't supposed to close automatically the session as I use in a "with tf.Session()" block?)
You call that code in each iteration, so each iteration you add these operations to the graph. You don't want to do that. You want to build the graph at the start and in the training loop only execute it. Also, why do you need to convert to ndimage again afterwards, instead of putting things into your TF graph once and just use tensors all the way through?

Categories

Resources