I am trying to train a multi-scale CNN some kind like YOLOv2 in Tensorflow:
to randomly resize the batch of inputs every several epochs.
But I am not so familiar with Tensorflow, the following is how I get batches of images and labels:
data_provider = slim.dataset_data_provider.DatasetDataProvider(dataset)
image, label = data_provider.get(['image', 'label'])
inputs, labels = tf.train.shuffle_batch([image, label], \
batch_size=128, \
num_threads=4, \
capacity= 1000, \
min_after_dequeue=616)
Then I hope I can resize the batch of inputs and feed into network
rand_size=int(np.random.uniform(0.15,1)*720)
resize_output = tf.image.resize_bilinear(preprocessed_inputs, [rand_size,rand_size],align_corners=True)
Unfortunately, it does not work, it only resize the batch at the beginning, and apply the resize operation to all the inputs
Anyone have suggestions for what I should do?
Thanks a lot
You want rand_size to be based on a tf.random_uniform rather than numpy/int, otherwise it will have the same value for each run of your session.
rand_size = tf.random_uniform(
minval=int(0.15*720), maxval=720, dtype=tf.int32, shape=())
This will still resize each element of the batch by the same amount.
I'm not familiar with how slim does preprocessing, but there'd be something in there that allows you to do the above before batching (in which case you'd get a different random value each time). Alternatively look into using the more recently released tf.data.Dataset. This post might help you there.
Related
I have a dataset of 180k images for which I try to recognize the characters on the images (License plate recognition). All of these license plates contain seven characters and 35 characters are possible, so the output vector y is of shape (7, 35). I therefore onehot-encoded every license plate label.
I applied the bottom of the EfficicentNet-B0 model (https://keras.io/api/applications/efficientnet/#efficientnetb0-function) together with a customized top, which is divided in 7 branches (because of seven characters per license plate). I used the weights of the imagenet and freezed the bottom layers of efnB0_model:
def create_model(input_shape = (224, 224, 3)):
input_img = Input(shape=input_shape)
model = efnB0_model (input_img)
model = GlobalAveragePooling2D(name='avg_pool')(model)
model = Dropout(0.2)(model)
backbone = model
branches = []
for i in range(7):
branches.append(backbone)
branches[i] = Dense(360, name="branch_"+str(i)+"_Dense_16000")(branches[i])
branches[i] = BatchNormalization()(branches[i])
branches[i] = Activation("relu") (branches[i])
branches[i] = Dropout(0.2)(branches[i])
branches[i] = Dense(35, activation = "softmax", name="branch_"+str(i)+"_output")(branches[i])
output = Concatenate(axis=1)(branches)
output = Reshape((7, 35))(output)
model = Model(input_img, output)
return model
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
For training and validating the model I only use 10.000 training images and 3.000 validation images due to the big size of my model and the huge number of data which would make my training very, very slow.
I use this DataGenerator to feed batches to my model:
class DataGenerator(Sequence):
def __init__(self, x_set, y_set, batch_size):
self.x, self.y = x_set, y_set
self.batch_size = batch_size
def __len__(self):
return math.ceil(len(self.x) / self.batch_size)
def __getitem__(self, idx):
batch_x = self.x[idx*self.batch_size : (idx + 1)*self.batch_size]
batch_x = np.array([resize(imread(file_name), (224, 224)) for file_name in batch_x])
batch_x = batch_x * 1./255
batch_y = self.y[idx*self.batch_size : (idx + 1)*self.batch_size]
batch_y = np.array(batch_y)
return batch_x, batch_y
I fit the model using this code:
model.fit_generator(generator=training_generator,
validation_data=validation_generator,
steps_per_epoch = num_train_samples // 32,
validation_steps = num_val_samples // 32,
epochs = 10, workers=6, use_multiprocessing=True)
Now, after several epochs of training, I observed big differences regarding training accuracy and validation accuracy. I think, one reason for that is the small size of data. Which other factors influence this overfitting in my model? Do you think, there is something completely wrong with my code/model? Do you think the model is to big and complex as well or is it maybe due to the preprocessing of the data?
Note: I already experimented with Data Augmentation and tried the model without Transfer Learning. That leads to poor results on training AND validation data. So, is there anything what I could do additionally?
First a disclaimer
Are you sure that this is the correct approach to follow? The EfficientNet is a model created for image recognition, and your task demands the correct localization of 7 characters in one image, recognition of each one of them, and also demand to keep the order of the characters. Maybe an approach of detection + segmentation followed by recognition like in this medium post is more efficient (pun intended). Even though I think that this is very likely to be your real problem, I will try to answer your original question.
Now some general tips regarding overfitting
There's a very good guide here in Keras documentation on how to use EfficientNet for transfer learning. I will try to summarize some tips here.
From your question seems that you are not even doing the fine-tuning step which is essential for the network to learn the task better.
Now, after several epochs of training, I observed big differences regarding training accuracy and validation accuracy.
With several, you mean how many epochs? Because from the image you put in the question I think that the second complete epoch is too soon to infer that your model is overfitting. Also, from the code (10 epochs) and for the image you posted (20 epochs) I would say to train for more epochs, like 40.
Increase the dropout. Try some configurations like 30%, 40%, 50%.
Data Augmentation in practice will increase the number of samples that you have. However, you have 180K images and are only using 10K of images, data augmentation is good but when you have more images available try using them first. From that guide I mentioned, seems that with this model and Google colab is feasible to use more images to train. So, try to increase the train size. Still in the topic of DA, some transformations may be harmful to your task, like too much rotation or reflection since you are trying to recognize numbers and letters.
Reducing the batch size to 16 may provide more regularization which helps to fight overfitting. Speaking of regularization, try to apply regularization to the dense layers that you are adding.
EDIT:
After quickly reading the paper you linked, I reaffirm my point about the epochs since in the paper the results are shown for 100 epochs. Also from the charts in the paper, we can see it's not possible to confirm that the author did not had overfitting too. Additionally, the changes in the Xception network are not clear at all. Changing the input layer has an impact on the dimensions of all other layers because of the way that the Convolution operation work and this is not discussed in the paper. The operations performed to achieve that output dimension is not clear too. Besides what you did I would suggest using a pooling layer to get the output dimensions that you want. Finally, the paper doesn't explain how the positioning of the plate is guaranteed. I would try to get more details about this paper that you are trying to reproduce to be sure that you are not missing anything in your model.
I have been working on character detection + recognition problem for industrial application. From my experience using only deep CNN and dense layer to predict character class is not the best approach to solve this problem. There are good research papers for scene text recognition problem, one common approach to design character recognition problem is to have ---
any deep CNN model like VGG, ResNet or EfficientNet to extract the image feature.
Then add some RNN layers on top of CNN backbone to get character sequence from the extracted features. This would be a great plus if you want to predict variable length of character.
After getting character sequence from RNN layers, the next step is to decode this character sequence. For this you can either use CTC based method or attention mechanism. Both of these methods have their own pros and cons. CTC based methods are fast but performance is bit poor, on the other hand, attention based models give good results but they are very slow. So the selection of the method depends on your requirement.
Below image from very famous text recognition paper CRNN gives general idea about above steps.
[
For training the model, #Hemerson has given good suggestions. Try to build and train this type of model with multiple stages and I am sure you will get better results:)
Best regards!
I'm trying to perform some digit recognition using PyTorch. I have implemented a convolutional version of the sliding window with size 32x32. Which makes me able to identify digits of this range of size in a picture.
But now let's imagine I have an image of size 300x300 with a digit that occupies the whole image. I will never be able to identify it...
I have seen people saying that the image needs to be rescaled and resized. Meaning that I need to create various scaled versions of my initial image and then to feed my network with those "new" images.
Does anyone have any idea how I can perform that?
Here is a part of my code, if it can help..
# loading dataset
size=200
height=200
width= 300
transformer_svhn_test = transforms.Compose([
transforms.Grayscale(3),
transforms.Resize((height, width)),
transforms.CenterCrop((size, size)),
transforms.ToTensor(),
transforms.Normalize([.5,.5,.5], [.5,.5,.5])
])
SVHN_test = SVHN_(train=False, transform=transformer_svhn_test)
SVHN_test_loader = DataLoader(SVHN_test, batch_size=batch_size, shuffle=False, num_workers=3)
#loading network
model = Network()
model.to(device)
model.load_state_dict(torch.load("digit_classifier_gray_scale_weighted.pth"))
# loading one image and feeding the model with it
image = next(iter(SVHN_test_loader))[0]
image_tensor = image.unsqueeze(0) # creating a single-image batch
image_tensor = image_tensor.to(device)
model.eval()
output = model(image_tensor)
Please correct me if I understand your question wrong:
Your network takes images with size of 300x300 as input, and does 32x32 sliding window operation within your model, and output the locations of any digits in input images? In this setup, you are framing this problem as an object detection task.
I am imaging the digits in your training data have sizes that are similar to 32x32, and you wanted to use multiple scale evaluation to make sure digits on your testing images will also have similar sizes as those in your training data. As for object detection network, the input size of your network is not fixed.
So the thing you need is actually called multi scale evaluation/testing, and you will find it very common in Computer Vision tasks.
A good starting point would be HERE
I have built a model in Keras, normally I trained, validated it by using ImageDataGenerator, the performance of the network is quite good. However, when I tried to load the same image into the model by this manner:
img = image.load_img(file, target_size=(img_height, img_width))
img_orig = image.img_to_array(img)
preds = model.predict(preprocess_input(np.expand_dims(img_orig.copy(), axis=0)))
the prediction is very poor. I tried this on all sets(train, test, validation), they all resulted in poor prediction.
I think the difference come from the fact that in ImageDataGenerator, I use rescale=1./255, but I don't have that in the single image load, but I don't know how to apply rescale to this.
Is my idea correct about this issue? And how can I fix it?
I have a dataset of three images. When I create an autoencoder to train on those three images, the output I get is the exact same for each image, and it looks like a blend of all three images.
My result looks like this:
Input image 1:
Output image 1:
Input image 2:
Output image 2:
Input image 3:
Output image 3:
So you can see that the output is giving the exact same thing for each of the inputs, and while it matches each relatively well, it's not perfect.
This is a three image dataset - it should be perfect (or at least different for each of the images).
I'm concerned about this three image dataset because when I do the 500 image dataset, all I get back is a white blank screen, because that's the best average of all the images.
I'm using Keras, and the code is really simple.
from keras.models import Sequential
from keras.layers import Dense, Flatten, Reshape
import numpy as np
# returns a numpy array with shape (3, 24, 32, 1)
# there are 3 images that are each 24x32 and are black and white (1 color channel)
x_train = get_data()
# this is the size of our encoded representations
# encode down to two numbers (I have tested using 3; I still have the same issue)
encoding_dim = 2
# the shape without the batch amount
input_shape = x_train.shape[1:]
# how many output neurons we need to create an image
input_dim = np.prod(input_shape)
# simple feedforward network
# I've also tried convolutional layers; same issue
autoencoder = Sequential([
Flatten(), # flatten
Dense(encoding_dim), # encode
Dense(input_dim), # decode
Reshape(input_shape) # reshape decoding
])
# adadelta optimizer works better than adam, same issue with both
autoencoder.compile(optimizer='adadelta', loss='mse')
# train it to output the same thing it gets as input
# I've tried epochs up to 30000 with no improvement;
# still predicts the same image for all three inputs
autoencoder.fit(x_train, x_train,
epochs=10,
batch_size=1,
verbose=1)
out = autoencoder.predict(x_train)
I then take the outputs (out[0], out[1], out[2]) and convert them back into images. You can see the output images above.
I'm worried because this shows that the autoencoder isn't preserving any information about the input image, which is not how an encoder should perform.
How can I get the encoder to show differences in outputs based on the input images?
EDIT:
One of my coworkers had the suggestion of not even using an autoencoder, but a 1 layer feedforward neural network. I tried this, and the same thing happened, until I set the batch size to 1 and trained for 1400 epochs, and then it worked perfectly. This leads me to think that more epochs would solve this issue, but I'm not sure yet.
EDIT:
Training for 10,000 epochs (with batch-size 3) made the second image look different than the first and third on the encoder, which is exactly what happened on the non-encoder version when running for around 400 epochs (also with batch-size 3) providing further evidence that training for more epochs may be the solution.
Going to test using batch size 1, and see if that helps even more, and then try training for very many epochs and see if that completely solves the issue.
My encoding dimension was way too small. Trying to encode 24x32 images into 2 numbers (or 3 numbers) is just too much for the autoencoder to handle.
By raising encoding_dim to 32, the issue was pretty much solved. I was able to use the default learning rate with the Adadelta optimizer. My data didn't even need to be normalized (just dividing all of the pixels by 255 worked).
The "binary_crossentropy" loss function seemed to work a bit faster/better than "mse", although "mse" (mean-squared-error) worked just fine.
In the first few hundred epochs though, it does look like it's blending the images. However, as it trains for longer, the more it starts to separate.
I also made the output activation of the encode layer be relu and the activation of the decode layer be sigmoid. I'm not sure how much of an effect that had on the output - I haven't tested it.
This page helped a ton in understanding what I did wrong. I just copy/pasted the code and found out it worked on my dataset, so the rest was figuring out what I did wrong.
Here's some images of their simple autoencoder architecture working on my dataset (which was my first sign of hope):
500 Epochs:
2000 Epochs:
I am very new to ML using Big Data and I have played with Keras generic convolutional examples for the dog/cat classification before, however when applying a similar approach to my set of images, I run into memory issues.
My dataset consists of very long images that are 10048 x1687 pixels in size. To circumvent the memory issues, I am using a batch size of 1, feeding in one image at a time to the model.
The model has two convolutional layers, each followed by max-pooling which together make the flattened layer roughly 290,000 inputs right before the fully-connected layer.
Immediately after running however, Memory usage chokes at its limit (8Gb).
So my questions are the following:
1) What is the best approach to process computations of such size in Python locally (no Cloud utilization)? Are there additional python libraries that I need to utilize?
Check out what yield does in python and the idea of generators. You do not need to load all of your data at the beginning. You should make your batch_size just small enough that you do not get memory errors.
Your generator can look like this:
def generator(fileobj, labels, memory_one_pic=1024, batch_size):
start = 0
end = start + batch_size
while True:
X_batch = fileobj.read(memory_one_pic*batch_size)
y_batch = labels[start:end]
start += batch_size
end += batch_size
if not X_batch:
break
if start >= amount_of_datasets:
start = 0
end = batch_size
yield (X_batch, y_batch)
...later when you already have your architecture ready...
train_generator = generator(open('traindata.csv','rb'), labels, batch_size)
train_steps = amount_of_datasets//batch_size + 1
model.fit_generator(generator=train_generator,
steps_per_epoch=train_steps,
epochs=epochs)
You should also read about batch_normalization, which basically helps to learn faster and with better accuracy.
While using train_generator(), you should also set the max_q_size parameter. It's set at 10 by default, which means you're loading in 10 batches while using only 1 (since train_generator() was designed to stream data from outside sources that can be delayed like network, not to save memory). I'd recommend setting max_q_size=1for your purposes.