Image colorization using autoencoder - Maximum compression point - python

I am building a model for autoencoder. I have a dataset of images(256x256) in LAB color space.
But i dont know, what is the right maximum compression point. I found example, when i have 176 x 176 x 1 (~30976), then the point is 22 x 22 x 512 (~247808).
But how is that calculated?
My model:
model = Sequential()
model.add(Conv2D(32, (3, 3), activation='relu', padding='same', input_shape=(256, 256, 1)))
model.add(MaxPooling2D((2,2)))
model.add(Conv2D(64, (3, 3), activation='relu', padding='same'))
model.add(MaxPooling2D((2,2)))
model.add(Conv2D(128, (3,3), activation='relu', padding='same'))
model.add(MaxPooling2D((2,2)))
model.add(Conv2D(256, (3,3), activation='relu', padding='same'))
model.add(MaxPooling2D((2,2)))
model.add(Conv2D(512, (3,3), activation='relu', padding='same'))
#Decoder
model.add(Conv2D(256, (3,3), activation='relu', padding='same'))
model.add(UpSampling2D((2, 2)))
model.add(Conv2D(128, (3,3), activation='relu', padding='same'))
model.add(UpSampling2D((2, 2)))
model.add(Conv2D(64, (3,3), activation='relu', padding='same'))
model.add(UpSampling2D((2, 2)))
model.add(Conv2D(2, (3, 3), activation='tanh', padding='same'))
model.add(UpSampling2D((2, 2)))
model.compile(optimizer='adam', loss='mse' , metrics=['accuracy'])
model.summary()

Figuring out these aspects of a network is more art than mathematics. As such, we cannot define a constant compression point without properly analyzing the data, which is the reason why neural nets are used in the first place.
We can however intuitively consider what happens at every layer. For example, in an image colorization problem, it is better not to use too many pooling layers since that discards a huge amount of information. A max pooling layer of size 2x2 with a stride of 2 discards 75% of its input data. This is much more useful in classification to eliminate improbable classes. Similarly, ReLU discards all negative data, and may not be the best function choice for the problem at hand.
Here are a few tips that may help with your specific problem:
Reduce the number of pooling layers. Before pooling, try increasing the number of trainable layers so that the model (intuitively) learns to aggregate important information to avoid pooling it out.
Change the activation to elu, LeakyReLU or such, that do not eliminate negative values, especially since the output requires negative values as well.
Maybe try BiLinear or BiCubic upsampling to maintain structure? I'd also suggest taking a look at the so called "magic" kernel here. Personally, I've had good results with it, though it takes time to implement it efficiently.
If you have enough GPU space, increase the number of channels. This particular point does not have much to consider, except overfitting in some cases.
Preferably, use a Conv2D layer as the final layer to compensate for artifacts while upsampling.
Keep in mind that these points are for general use cases. Models in research papers are a different case, and are not as simple as your architecture. All these points may or may not apply to a specific paper.

Related

why does cnn accuracy show only 0.3~0.4

I wrote very simple cnn code with spectrogram images, but
the accuracy is only 0.3~0.4
What do i have to add the other option to improve accuracy?
model.add(Conv2D(32, (3, 3), input_shape=X_train.shape[1:], padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
# model.add(Dropout(0.25))
model.add(Conv2D(64, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(Conv2D(128, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
# model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(64))
model.add(Dense(32))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(14))
model.add(Activation('softmax'))```
With the information you provide, there is zero chance to help you with the problem. The definition of your model looks correct (but you missed an activation function after the first dense layer if this is by accident). So here are some considerations:
Do you train long enough? Your model is quite big and therefore needs a long time to converge AND a large dataset to train with.
Is your dataset large enough and contains enough variance? When your dataset doesn't represent your problem well, you can't train.
Take a look at the Loss curves of your validation AND training set. Are you overfitting/underfitting?
Do you correctly normalize and preprocess your dataset? Try to transform the values of the images to a range of -1 to 1 or 0 to 1 with a float datatype.
Is your dataset balanced? As you are softmaxing 14 classes, you need a balanced dataset in order to train every single class.
Hoped this helped a little, if you need further help please provide some detailed descriptions of your problem and what are you doing in your whole process.

Sigmoid activation output layer produce Many near-1 value

:)
I have a Datset of ~16,000 .wav recording from 70 bird species.
I'm training a model using tensorflow to classify the mel-spectrogram of these recordings using Convolution based architectures.
One of the architectures used is simple multi-layer convolutional described below.
The pre-processing phase include:
extract mel-spectrograms and convert to dB Scale
segment audio to 1-second segment (pad with zero Or gaussian noise if residual is longer than 250ms, discard otherwise)
z-score normalization of training data - reduce mean and divide result by std
pre-processing while inference:
same as described above
z-score normalization BY training data - reduce mean (of training) and divide result by std (of training data)
I understand that the output layer's probabilities with sigmoid activation is not suppose to accumulate to 1, But I get many (8-10) very high prediction (~0.999) probabilities. and some is exactly 0.5.
The current test set correct classification rate is ~84%, tested with 10-fold cross validation, So it seems that the the network mostly operates well.
notes:
1.I understand there are similar features in the vocalization of different birds species, but the recieved probabilities doesn't seem to reflect them correctly
2. probabilities for example - a recording of natural noise:
Natural noise: 0.999
Mallard - 0.981
I'm trying to understand the reason for these results, if it's related the the data etc extensive mislabeling (probably not) or from another source.
Any help will be much appreciated! :)
EDIT: I use sigmoid because the probabilities of all classes are necessary, and I don't need them to accumulate to 1.
def convnet1(input_shape, numClasses, activation='softmax'):
# Define the network
model = tf.keras.Sequential()
model.add(InputLayer(input_shape=input_shape))
# model.add(Augmentations1(p=0.5, freq_type='mel', max_aug=2))
model.add(Conv2D(64, (3, 3), activation='relu', padding='same'))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 1)))
model.add(Conv2D(128, (3, 3), activation='relu', padding='same'))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 1)))
model.add(Conv2D(128, (5, 5), activation='relu', padding='same'))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(256, (5, 5), activation='relu', padding='same'))
model.add(BatchNormalization())
model.add(Flatten())
# model.add(Dense(numClasses, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(numClasses, activation='sigmoid'))
model.compile(
loss='categorical_crossentropy',
metrics=['accuracy'],
optimizer=optimizers.Adam(learning_rate=0.001),
run_eagerly=False) # this parameter allows to debug and use regular functions inside layers: print(), save() etc..
return model
For future searches - this problem was solved, and the reason was found(!).
The initial batch size that was used was 256 or 512. reducing the batch size to 16 or 32 SOLVED THE PROBLEM, and now the difference in probabilities are as expected for training AND test set samples - very high for the correct label and very low for other classes.

What does the filter parameter mean in Conv2d layer?

I am getting confused with the filter paramater, which is the first parameter in the Conv2D() layer function in keras. As I understand the filters are supposed to do things like edge detection or sharpening the image or blurring the image, but when I am defining the model as
input_shape = (32, 32, 3)
model = Sequential()
model.add( Conv2D(64, kernel_size=(5, 5), activation='relu', input_shape=input_shape, strides=(1,1), padding='same') )
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2,2)))
model.add(Conv2D(64, kernel_size=(5, 5), activation='relu', input_shape=input_shape, strides=(1,1), padding='same'))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2,2)))
model.add(Conv2D(128, kernel_size=(5, 5), activation='relu', input_shape=input_shape, strides=(1,1), padding='same'))
model.add(Flatten())
model.add(Dense(3072, activation='relu'))
model.add(Dense(2048, activation='relu'))
model.add(Dense(num_classes, activation='softmax'))
I am not mentioning the the edge detection or blurring or sharpening anywhere in the Conv2D function. The input images are 32 by 32 RGB images.
So my question is, when I define the Convolution layer as Conv2D(64, ...), does this 64 means 64 different types of filters, such as vertical edge, horizontal edge, etc, which are chosen by keras at random? if so then is the output of the convolution layer (with 64 filters and 5x5 kernel and 1x1 stride) on a 32x32 1-channel image is 64 images of 28x28 size each. How are these 64 images combined to form a single image for further layers?
The filters argument sets the number of convolutional filters in that layer. These filters are initialized to small, random values, using the method specified by the kernel_initializer argument. During network training, the filters are updated in a way that minimizes the loss. So over the course of training, the filters will learn to detect certain features, like edges and textures, and they might become something like the image below (from here).
It is very important to realize that one does not hand-craft filters. These are learned automatically during training -- that's the beauty of deep learning.
I would highly recommend going through some deep learning resources, particularly https://cs231n.github.io/convolutional-networks/ and https://www.youtube.com/watch?v=r5nXYc2wYvI&list=PLypiXJdtIca5sxV7aE3-PS9fYX3vUdIOX&index=3&t=3122s.
Just wanted to clarify what the output shape was.
Although jakub's answer was good, I don't think it addressed the "single image for further layers" part of the question.
I did a model.summary() to find out more.
I found that the shape returned from a Conv2D is (None, img_width, img_height, num_filters)
So when you pass the output of the Conv2D to MaxPooling you are passing that shape which means it is basically passing each entire convoluted image.
The other layers handle this gracefully. MaxPooling2D(2,2) returns the same shape but half the image size (None, img_width / 2, img_height / 2, num_filters).
Side note: I wish the filters was named num_filters because filters seems to imply you're passing in a list of filters in which to convolute the image.

Accuracy in a CNN model never goes high for training and validation set

I am training a CNN model on KTH dataset to detect 6 classes of human actions.
Data Processing
Dataset consists of 599 videos, each action has 99-100 videos performed by 25 different persons. I divided the data to 300 videos for train, 98 videos for validation and 200 videos for test set.
I reduced the resolution to 50x50 pixels, so I don't run out of memory while processing.
I exracted 200 frames from the middle of each video.
it normalized the pixels from 0-255 to 0,1.
Finally I one hot encoded to class labels.
Model architecture
This is my model architecture.
And this is the code of the NN layers.
model = Sequential()
model.add(Conv3D(filters=64,
kernel_size=(3, 3, 3),
strides=(1, 1, 1),
padding='valid',
activation='relu',
input_shape=X_train.shape[1:]))
model.add(MaxPooling3D(pool_size=2,
strides=(2, 2, 2),
padding='same'))
model.add(Conv3D(filters=128,
kernel_size=(3, 3, 3),
strides=(1, 1, 1),
padding='valid',
activation='relu'))
model.add(MaxPooling3D(pool_size=2,
strides=(2, 2, 2),
padding='same'))
model.add(Conv3D(filters=256,
kernel_size=(3, 3, 3),
strides=(1, 1, 1),
padding='valid',
activation='relu'))
model.add(Conv3D(filters=256,
kernel_size=(3, 3, 3),
strides=(1, 1, 1),
padding='valid',
activation='relu'))
model.add(MaxPooling3D(pool_size=2,
strides=(2, 2, 2),
padding='same'))
model.add(Conv3D(filters=512,
kernel_size=(3, 3, 3),
strides=(1, 1, 1),
padding='valid',
activation='relu'))
model.add(Dense(4096, activation='relu'))
model.add(Dense(4096, activation='relu'))
#model.add(Dropout(0.5))
model.add(Flatten())
model.add(Dense(6, activation='softmax'))
model.summary()
Training
My problem is both training and validation accuracy do not change, and they basically froze from the first epoch. These are the training step.
These are the first 6 epochs and here the last 6 epochs.
The Loss looks like this.
Training loss is very high, and the loss for validation doesn't change.
and the training looks like this.
I am confused, is the model underfitting or overfitting?
How I am gonna fix this problem? will dropout help, since I can't do data augmentation on videos (I assumed that)?
I greatly appreciate any suggestion.
You are using 0-1 values of frames and are using relu. In dying relu problem model is frozen and doesn't learn at all because relu gets maximum values b/w 0 or the weight*input if bias is not added. You can do 2 things to ensure that model does work properly altough I am not sure whether you will get good accuracy or not but can try this to avoid this dying relu problem:-
Use leaky relu with alpha>=0.2.
Do not normalize the frames, instead just convert to grayscale to reduce extensive training.
Don't take 200 frames from middle, divide all videos in equal amount of frame chunks and take 2,3 consecutive frames from each chunk. also try adding more dense layers as they help in classification.
I worked on almost same problem and what I did was to use Conv2d after merging frames together i.e. if you have 10 frames of size 64,64,3 each instead of doing conv3d, I did conv2d on 640,64,3 dataset and resulted in 86% accuracy on 16 classes for videos.
It depends on how you use the 200frames of video as training data to classify an action. Your training data is having too much bias.
Since its a sequential data to be classified, you have to go for memory based architecture or concatenation model.

Keras CNN Coordinate Regression Only Gives Mean Value

I have been trying to implement a CNN that, given a (75x75) image of a detector, will output the (x,y) coordinate of the source for that image. Each image has features that make it possible to determine the source. The issue I am coming up with is that the regressor, which outputs the x, y coordinate, always returns the mean pixel value for all the training data. I've tried to implement the regressor outlined in this paper: https://arxiv.org/abs/1803.10698 but it still gives the mean pixel value as all the predictions. A short version of my model is here:
model = Sequential()
# Base Conv layer
model.add(Conv2D(32, kernel_size=patch_size, strides=(1, 1),
activation='relu', padding='same',
input_shape=(75, 75, 1)))
model.add(Conv2D(64, patch_size, strides=(1, 1), activation='relu', padding=optimizer))
model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))
for i in range(num_conv):
model.add(Conv2D(64, patch_size, strides=(1, 1), activation='relu', padding='same))
model.add(Conv2D(64, patch_size, strides=(1, 1), activation='relu', padding='same))
model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))
model.add(Flatten())
for i in range(num_dense):
model.add(Dense(512, activation='linear'))
model.add(Dropout(dropout_layer))
# Final Dense layer
# 2 so have one for x and one for y
model.add(Dense(2, activation='linear'))
model.compile(optimizer='adam', loss='mse', metrics=['mae'])
I don't have the rep to post images of the sources, but here is truth and outputs when it is trained on 42000 images:
As you can see, its horribly off, all the estimations are in that one dot at the mean value of all the x,y coordinates. while the actual values create a circle covering the whole detector. I am not sure how to get it to actually output correct values. I get that the mean squared error as the loss probably should make the estimation the center value, as it is the minimum squared distance from all the sources, but I am not sure how else to try it, since I want to minimize the distance between each prediction and its label individually.

Categories

Resources