How should image preprocessing and data augmentation be for semantic segmentation?

How should image preprocessing and data augmentation be for semantic segmentation? - python

I have an imbalanced and small dataset which contains 4116 224x224x3 (RGB) aerial images. It's very likely that I will encounter the overfitting problem since the dataset is not big enough. Image preprocessing and data augmentation help to tackle this problem as explained below.
"Overfitting is caused by having too few samples to learn from, rendering you unable to train a model that can generalize to new data. Given infinite data, your model would be exposed to every possible aspect of the data distribution at hand: you would never overfit. Data augmentation takes the approach of generating more training data from existing training samples, by augmenting the samples via a number of random transformations that yield believable-looking images."
Deep Learning with Python by François Chollet, page 138-139, 5.2.5 Using data augmentation.
I've read Medium - Image Data Preprocessing for Neural Networks and examined Stanford's CS230 - Data Preprocessing and
CS231 - Data Preprocessing courses. It is highlighted once more in SO question and I understand that there is no "one fits all" solution. Here is what forced me to ask this question:
"No translation augmentation was used since we want to achieve high spatial resolution."
Reference: Researchgate - Semantic Segmentation of Small Objects and Modeling of Uncertainty in Urban Remote Sensing Images Using Deep Convolutional Neural Networks
I know that I will use Keras - ImageDataGenerator Class, but don't know which techniques and what parameters to use for the semantic segmentation on small objects task. Could someone enlighten me? Thanks in advance. :)
from keras.preprocessing.image import ImageDataGenerator
datagen = ImageDataGenerator(
rotation_range=20, # is a value in degrees (0–180)
width_shift_range=0.2, # is a range within which to randomly translate pictures horizontally.
height_shift_range=0.2, # is a range within which to randomly translate pictures vertically.
shear_range=0.2, # is for randomly applying shearing transformations.
zoom_range=0.2, # is for randomly zooming inside pictures.
horizontal_flip=True, # is for randomly flipping half the images horizontally
fill_mode='nearest', # is the strategy used for filling in newly created pixels, which can appear after a rotation or a width/height shift
featurewise_center=True,
featurewise_std_normalization=True)
datagen.fit(X_train)

The augmentation and preprocessing phases are always depending on the problem that you have. You have to think of all the possible augmentation which can enlarge your dataset. But the most important thing is, that you should not perform extreme augmentations, which makes new training samples in the way which can not happen in real examples. If you do not expect that the real examples will be horizontally flipped do not perform horizontal flip, since this will give your model false information. Think of all the possible changes that can happen in your input images and try to artificially produce new images from your existing one. You can use a lot of built-in functions from Keras. But you should be aware of each that it will not make new examples which are not likely to be present on the input of your model.
As you said, there is no "one fits all" solution, because everything is dependent on the data. Analyse the data and build everything with respect to it.
About the small objects - one direction which you should check are the loss functions which emphasise the impact of target volumes in comparison to the background. Look at the Dice Loss or Generalised Dice Loss.

Related

How can I train Super-Resolution Generative Adversarial Network (SRGAN) with high-frequency grayscale images?

This question is almost a duplicate of the post from Cross Validated, but none has replied to that and I hope it is okay I ask almost the same question here.
I have been reading and looking at implementations of the SRGAN, from Photo-realistic Single Image Super Resolution with Generative Adversarial Networks. I implemented the PyTorch implementation of SRGAN for 3 channel images and it makes some decent super resolution images. However, when I try it out with 1 channel images it fails to generate plausible images.
The grayscale images I use is from the public release MSTAR of high resolution Synthetic Aparture Radar (SAR) data from sdms. The dataset contains of 2774 images and has 10 classes. Some samples are shown below:
Since SRGAN use VGG as one of the networks I had to convert my grayscale images to RGB. I copied the first channel to the other two channels and created a 3 channel image by the following approach:
w, h = hr_image.shape
ret = np.empty((w, h, 3), dtype=np.uint8)
ret[:, :, :] = hr_image[:, :, np.newaxis]
hr_image = ret
The following are the output after 34 epochs (this continues) and is low resolution, high resolution, super resolution.
What I notice is that the Discriminators loss quickly goes to 0, Generators loss to 0.08, generators score to 0 and discriminators score to 1. I assume this means that it is too easy for the discriminator to distinguish between the real and fake image. This presumable causes the generator to not learn anything new and just stops learning.
I tried to isolate one class in the MSTAR dataset, but that did not change anything. I noticed that others Super-resolution SAR Image Reconstruction via Generative Adversarial Network did use SRGAN for SAR images and it seems to work, but their paper does not explain how they implemented it.
I am wondering if I am using the wrong approach and need to change the loss functions. SRGAN is using MSE, TVLoss and perceptual loss. MSE is by itself not the best loss functions and is explained really well on here. But it is probably good to keep the images inside the MSE-hypersphere. But I ask myself whether it make sense to use a network that is trained on low frequency images and use it on high frequency images? As I understand it the loss functions are designed to work really well on low frequency images (something our eyes like to look at) and not that much on high frequency images. My questions are therefore:
Should I change the loss functions? Should I pre train my own network with high frequency grayscale images so the network are more suitable for these images?
How come the generator crashing after few epochs? Why is it so hard for the generator to make plausible transformations? Are the images too "noisy"?
Should I change something in the structure of the generator and discriminator? And should I try to use another pretrained network that is more suitable for grayscale images, and if so, which one?
Should I use a pretrained network that is trained on high frequency images?
UPDATED TEXT 11 JUNE 2020 BECAUSE OF NEW RESULTS
Instead of converting all my images to 3 channels I only did this before the VGG network in line 22 by the following command:
out_images = out_images.repeat(1,3, 1,1)
target_images = target_images.repeat(1, 3, 1,1)
I also made some minor changes in my code and this changed the output of the model after 250:
However as can be seen it did not improve in making the super resolution image. The discriminators quickly learns to distinguish between real and fake images as can be seen on the loss plot below:
Does anyone has any suggestions on how I can make the generator stronger? I tried to add more layers in the generator and removed some of the discriminator but with no success.

How do I have to process an image to test it in a CNN?

I have trained my CNN in Tensorflow using MNIST data set; when I tested it, it worked very well using the test data. Even, to prove my model in a better way, I made another set taking images from train and test set randomly. All the images that I took from those set, at the same time, I deleted and I didn't give them to my model. It worked very well too, but with a dowloaded image from Google, it doesn't classify well, so my question is: should I have to apply any filter to that image before I give it to the prediction part?
I resized the image and converted it to gray scale before.

MNIST is an easy dataset. Your model (CNN) structure may do quite well for MNIST, but there is no guarantee that it does well for more complex images too. You can add some more layers and check different activation functions (like Relu, Elu, etc.). Normalizing your image pixel values for small values like between -1 and 1 may help too.

Training model on custom data

I am trying to try an object detection model on a custom data set. I want it to recognize a specififc piece of metal from my garage. I took like 32 photos and labelled them. The training goes well, but up to 10% loss. After that it goes very slow, so I need to stop it. After that, I implemented the model on camera, but it has no accuracy. Could it be because of the fact that I have only 32 images of the object? I have tried with YoloV2 and Faster RCNN.

It has low probability that your model implemented to a camera has no accuracy because you have only 32 images.
Anyway before you've had about up to 10% loss (It seems to be about 90% accuracy), so it should work I think the problem is not in the amount of images.
After training your model you need to save the coefficients of model trained.
Make sure that you implemented model trained, and not you don't use model from scratch

Just labeling will not help in object detection. What you are doing is image classification but expecting results of object detection.
Object detection requires bounding box annotations and changes in the loss function which is to be fed to the model during each backpropagation step.
You need some tools to do data annotations first, then manipulate your Yolov2/Fast-RCNN codes along with the loss function. Train it well and try using Image Augmentations to generate little more images because 32 images are less. In that case, you might end up in a pitfall of getting higher training accuracy but less test accuracy. Training models in fewer images sometimes lead to unexpected overfitting.
Only then you should try to implement using the camera.

Does the accuracy of the deep learning program drop if I do not put in the default input shape into the pretrained model?

As the title says, I want to know whether input shape affects the accuracy of the deep learning model.
Also, can pre-trained models (like Xception) be used on grayscale images?
P.S. : I recently started learning deep learning so if possible please explain in simple terms.

Usually, with convolutional neural networks, differences in the image shape (the width/height of an image) will not matter. However, differences in the # of channels in the image (equivalently the depth of the image), will affect the performance. In fact, there will usually be dimension mismatch errors you get if the model was trained for greyscale/colour and you put in the other type.

Generally, input scale matters. Changing to grayscale matters for sure. Details depend on the training data. That is, if the training data contains the object with the same scale you use, it might not make a big difference, if not it makes a difference. Deep learning is mostly not invariant to any changes in the data. CNNs show some invariance to translation, but that is about it. Rotation, scaling, color distortion, brightness etc. all impact performance negatively - if these conditions have not been part of the training.
The paper https://arxiv.org/abs/2106.06057 published at IJCNN 2022 investigates a classifier on rotated and scaled images on simple datasets like MNIST (digits) and show that performance deteriorates a lot. There are also other papers that showed the same thing.

Keras image augmentation: How to choose "steps per epoch" parameter and include specific augmentations during training?

I am training an image classification CNN using Keras.
Using the ImageDataGenerator function, I apply some random transformations to the training images (e.g. rotation, shearing, zooming).
My understanding is, that these transformations are applied randomly to each image before passed to the model.
But some things are not clear to me:
1) How can I make sure that specific rotations of an image (e.g. 90°, 180°, 270°) are ALL included while training.
2) The steps_per_epoch parameter of model.fit_generator should be set to the
number of unique samples of the dataset divided by the batch size define in the flow_from_directory method. Does this still apply when using the above mentioned image augmentation methods, since they increase the number of training images?
Thanks,
Mario

Some time ago I raised myself the same questions and I think a possible explanation is here:
Consider this example:
aug = ImageDataGenerator(rotation_range=90, width_shift_range=0.1,
height_shift_range=0.1, shear_range=0.2,
zoom_range=0.2, horizontal_flip=True,
fill_mode="nearest")
For question 1): I specify a rotation_range=90, which means that while you flow (retrieve) the data, the generator will randomly rotate the image by a degree between 0 and 90 deg. You can not specify an exact angle cause that's what ImageDataGenerator does: generate randomly the rotation. It is also very important concerning your second question.
For question 2): Yes it still applies to the data augmentation method. I was also confused in the beginning. The reason is that since the image is generated randomly, for each epoch, the network sees the images all different from those in previous epoch. That's why the data is "augmented" - the augmentation is not done within an epoch, but throughout the entire training process. However, I have seen other people specifying 2x value of the original steps_per_epoch.
Hope this helps

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.