Keras CNN Coordinate Regression Only Gives Mean Value - python

I have been trying to implement a CNN that, given a (75x75) image of a detector, will output the (x,y) coordinate of the source for that image. Each image has features that make it possible to determine the source. The issue I am coming up with is that the regressor, which outputs the x, y coordinate, always returns the mean pixel value for all the training data. I've tried to implement the regressor outlined in this paper: https://arxiv.org/abs/1803.10698 but it still gives the mean pixel value as all the predictions. A short version of my model is here:
model = Sequential()
# Base Conv layer
model.add(Conv2D(32, kernel_size=patch_size, strides=(1, 1),
activation='relu', padding='same',
input_shape=(75, 75, 1)))
model.add(Conv2D(64, patch_size, strides=(1, 1), activation='relu', padding=optimizer))
model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))
for i in range(num_conv):
model.add(Conv2D(64, patch_size, strides=(1, 1), activation='relu', padding='same))
model.add(Conv2D(64, patch_size, strides=(1, 1), activation='relu', padding='same))
model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))
model.add(Flatten())
for i in range(num_dense):
model.add(Dense(512, activation='linear'))
model.add(Dropout(dropout_layer))
# Final Dense layer
# 2 so have one for x and one for y
model.add(Dense(2, activation='linear'))
model.compile(optimizer='adam', loss='mse', metrics=['mae'])
I don't have the rep to post images of the sources, but here is truth and outputs when it is trained on 42000 images:
As you can see, its horribly off, all the estimations are in that one dot at the mean value of all the x,y coordinates. while the actual values create a circle covering the whole detector. I am not sure how to get it to actually output correct values. I get that the mean squared error as the loss probably should make the estimation the center value, as it is the minimum squared distance from all the sources, but I am not sure how else to try it, since I want to minimize the distance between each prediction and its label individually.

Related

why does cnn accuracy show only 0.3~0.4

I wrote very simple cnn code with spectrogram images, but
the accuracy is only 0.3~0.4
What do i have to add the other option to improve accuracy?
model.add(Conv2D(32, (3, 3), input_shape=X_train.shape[1:], padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
# model.add(Dropout(0.25))
model.add(Conv2D(64, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(Conv2D(128, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
# model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(64))
model.add(Dense(32))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(14))
model.add(Activation('softmax'))```
With the information you provide, there is zero chance to help you with the problem. The definition of your model looks correct (but you missed an activation function after the first dense layer if this is by accident). So here are some considerations:
Do you train long enough? Your model is quite big and therefore needs a long time to converge AND a large dataset to train with.
Is your dataset large enough and contains enough variance? When your dataset doesn't represent your problem well, you can't train.
Take a look at the Loss curves of your validation AND training set. Are you overfitting/underfitting?
Do you correctly normalize and preprocess your dataset? Try to transform the values of the images to a range of -1 to 1 or 0 to 1 with a float datatype.
Is your dataset balanced? As you are softmaxing 14 classes, you need a balanced dataset in order to train every single class.
Hoped this helped a little, if you need further help please provide some detailed descriptions of your problem and what are you doing in your whole process.

Sigmoid activation output layer produce Many near-1 value

:)
I have a Datset of ~16,000 .wav recording from 70 bird species.
I'm training a model using tensorflow to classify the mel-spectrogram of these recordings using Convolution based architectures.
One of the architectures used is simple multi-layer convolutional described below.
The pre-processing phase include:
extract mel-spectrograms and convert to dB Scale
segment audio to 1-second segment (pad with zero Or gaussian noise if residual is longer than 250ms, discard otherwise)
z-score normalization of training data - reduce mean and divide result by std
pre-processing while inference:
same as described above
z-score normalization BY training data - reduce mean (of training) and divide result by std (of training data)
I understand that the output layer's probabilities with sigmoid activation is not suppose to accumulate to 1, But I get many (8-10) very high prediction (~0.999) probabilities. and some is exactly 0.5.
The current test set correct classification rate is ~84%, tested with 10-fold cross validation, So it seems that the the network mostly operates well.
notes:
1.I understand there are similar features in the vocalization of different birds species, but the recieved probabilities doesn't seem to reflect them correctly
2. probabilities for example - a recording of natural noise:
Natural noise: 0.999
Mallard - 0.981
I'm trying to understand the reason for these results, if it's related the the data etc extensive mislabeling (probably not) or from another source.
Any help will be much appreciated! :)
EDIT: I use sigmoid because the probabilities of all classes are necessary, and I don't need them to accumulate to 1.
def convnet1(input_shape, numClasses, activation='softmax'):
# Define the network
model = tf.keras.Sequential()
model.add(InputLayer(input_shape=input_shape))
# model.add(Augmentations1(p=0.5, freq_type='mel', max_aug=2))
model.add(Conv2D(64, (3, 3), activation='relu', padding='same'))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 1)))
model.add(Conv2D(128, (3, 3), activation='relu', padding='same'))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 1)))
model.add(Conv2D(128, (5, 5), activation='relu', padding='same'))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(256, (5, 5), activation='relu', padding='same'))
model.add(BatchNormalization())
model.add(Flatten())
# model.add(Dense(numClasses, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(numClasses, activation='sigmoid'))
model.compile(
loss='categorical_crossentropy',
metrics=['accuracy'],
optimizer=optimizers.Adam(learning_rate=0.001),
run_eagerly=False) # this parameter allows to debug and use regular functions inside layers: print(), save() etc..
return model
For future searches - this problem was solved, and the reason was found(!).
The initial batch size that was used was 256 or 512. reducing the batch size to 16 or 32 SOLVED THE PROBLEM, and now the difference in probabilities are as expected for training AND test set samples - very high for the correct label and very low for other classes.

Incompatibility between input and final Dense Layer (Value Error)

I'm following this tutorial from Nabeel Ahmed to create your own emotion detector using Keras (I'm a noob) and I've found a strange behaviour that I'd like to understand. The input data is a bunch of 48x48 images, each one with an integer value between 0 and 6 (each number stands for an emotion label), which represents the emotion present in the image.
train_X.shape -> (28709, 2304) // training-data, 28709 images of 48x48
train_Y.shape -> (28709,) //The emotion present in each image as an integer, 1 = happiness, 2 = sadness, etc.
val_X.shape -> (3589, 2304)
val_Y.shape -> (3589, )
In order to feed the data into the model, train_X and val_X are reshaped (as the tutorial explains)
train_X.shape -> (28709, 48, 48, 1)
val_X.shape -> (3589, 48, 48, 1)
The model, as it is in the tutorial, is this one:
model = Sequential()
input_shape = (48,48,1)
#1st convolution layer
model.add(Conv2D(64, (5, 5), input_shape=input_shape,activation='relu', padding='same'))
model.add(Conv2D(64, (5, 5), activation='relu', padding='same'))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.5))
#2nd convolution layer
model.add(Conv2D(128, (5, 5),activation='relu',padding='same'))
model.add(Conv2D(128, (5, 5),activation='relu',padding='same'))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.5))
#3rd convolution layer
model.add(Conv2D(256, (3, 3),activation='relu',padding='same'))
model.add(Conv2D(256, (3, 3),activation='relu',padding='same'))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.5))
model.add(Flatten())
model.add(Dense(128))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.2))
################################################################
model.add(Dense(7)) # <- problematic line
################################################################
model.add(Activation('softmax'))
my_optimiser = tf.keras.optimizers.Adam(
learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False,
name='Adam')
model.compile(loss='categorical_crossentropy', metrics=['accuracy'],optimizer=my_optimiser)
However, when I try to use it, using the tutorial snippet, I get an error in the line of the validation_data like this
history = model.fit(train_X,
train_Y,
batch_size=64,
epochs=80,
verbose=1,
validation_data=(val_X, val_Y),
shuffle=True)
ValueError: Shapes (None, 1) and (None, 7) are incompatible
After reviewing the code and the documentation about the fit method, my only idea was to change the 7 in the last Dense layer of the model to 1, which mysteriously works. I'd like to know what is happening here if anyone could give me a hint.
You seem to be working with sparse integer labels, where each sample belongs to one of seven classes {0, 1, 2, 3, 4, 5, 6}, so I would recommend using SparseCategoricalCrossentropy instead of CategoricalCrossentropy as your loss function. Just change this parameter and your model should work fine. If you want to use CategoricalCrossentropy, you will have to one-hot encode your labels, for example with:
train_Y = tf.keras.utils.to_categorical(train_Y, num_classes=7)

Image colorization using autoencoder - Maximum compression point

I am building a model for autoencoder. I have a dataset of images(256x256) in LAB color space.
But i dont know, what is the right maximum compression point. I found example, when i have 176 x 176 x 1 (~30976), then the point is 22 x 22 x 512 (~247808).
But how is that calculated?
My model:
model = Sequential()
model.add(Conv2D(32, (3, 3), activation='relu', padding='same', input_shape=(256, 256, 1)))
model.add(MaxPooling2D((2,2)))
model.add(Conv2D(64, (3, 3), activation='relu', padding='same'))
model.add(MaxPooling2D((2,2)))
model.add(Conv2D(128, (3,3), activation='relu', padding='same'))
model.add(MaxPooling2D((2,2)))
model.add(Conv2D(256, (3,3), activation='relu', padding='same'))
model.add(MaxPooling2D((2,2)))
model.add(Conv2D(512, (3,3), activation='relu', padding='same'))
#Decoder
model.add(Conv2D(256, (3,3), activation='relu', padding='same'))
model.add(UpSampling2D((2, 2)))
model.add(Conv2D(128, (3,3), activation='relu', padding='same'))
model.add(UpSampling2D((2, 2)))
model.add(Conv2D(64, (3,3), activation='relu', padding='same'))
model.add(UpSampling2D((2, 2)))
model.add(Conv2D(2, (3, 3), activation='tanh', padding='same'))
model.add(UpSampling2D((2, 2)))
model.compile(optimizer='adam', loss='mse' , metrics=['accuracy'])
model.summary()
Figuring out these aspects of a network is more art than mathematics. As such, we cannot define a constant compression point without properly analyzing the data, which is the reason why neural nets are used in the first place.
We can however intuitively consider what happens at every layer. For example, in an image colorization problem, it is better not to use too many pooling layers since that discards a huge amount of information. A max pooling layer of size 2x2 with a stride of 2 discards 75% of its input data. This is much more useful in classification to eliminate improbable classes. Similarly, ReLU discards all negative data, and may not be the best function choice for the problem at hand.
Here are a few tips that may help with your specific problem:
Reduce the number of pooling layers. Before pooling, try increasing the number of trainable layers so that the model (intuitively) learns to aggregate important information to avoid pooling it out.
Change the activation to elu, LeakyReLU or such, that do not eliminate negative values, especially since the output requires negative values as well.
Maybe try BiLinear or BiCubic upsampling to maintain structure? I'd also suggest taking a look at the so called "magic" kernel here. Personally, I've had good results with it, though it takes time to implement it efficiently.
If you have enough GPU space, increase the number of channels. This particular point does not have much to consider, except overfitting in some cases.
Preferably, use a Conv2D layer as the final layer to compensate for artifacts while upsampling.
Keep in mind that these points are for general use cases. Models in research papers are a different case, and are not as simple as your architecture. All these points may or may not apply to a specific paper.

Does target structure affect neural network performance?

I'm building a CNN to control a vehicle within a video game. The network takes a screenshot as input and uses controller values as targets. For now I'm just using two controller values as targets: steering which is a value between -1 and 1, as well as throttle, which is between 0 and 1. I have rounded the values of steering into 7 values, and throttle into 4 values giving me 28 distinct classes which I will balance (reason for rounding was difficulty balancing un-binned classes).
My question is whether I should train the network with a single value target 1-27 (one for each case), or whether I should use the two rounded controller values as targets (an array: [steering, throttle])? I understand that both create 28 target classes, but does the structure of the target output affect the performance of the network? Is one of these options noticeably better than the other?
The model for preliminary testing:
'''
model = Sequential()
model.add(Conv2D(24, kernel_size=(5, 5), strides=(2, 2), activation='relu', input_shape= INPUT_SHAPE))
model.add(Conv2D(36, kernel_size=(5, 5), strides=(2, 2), activation='relu'))
model.add(Conv2D(48, kernel_size=(5, 5), strides=(2, 2), activation='relu'))
model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))
model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))
model.add(Flatten())
model.add(Dense(1164, activation='relu'))
drop_out = 1 - keep_prob
model.add(Dropout(drop_out))
model.add(Dense(100, activation='relu'))
model.add(Dropout(drop_out))
model.add(Dense(50, activation='relu'))
model.add(Dropout(drop_out))
model.add(Dense(10, activation='relu'))
model.add(Dropout(drop_out))
model.add(Dense(OUT_SHAPE, activation='softsign'))
'''

Categories

Resources