:)
I have a Datset of ~16,000 .wav recording from 70 bird species.
I'm training a model using tensorflow to classify the mel-spectrogram of these recordings using Convolution based architectures.
One of the architectures used is simple multi-layer convolutional described below.
The pre-processing phase include:
extract mel-spectrograms and convert to dB Scale
segment audio to 1-second segment (pad with zero Or gaussian noise if residual is longer than 250ms, discard otherwise)
z-score normalization of training data - reduce mean and divide result by std
pre-processing while inference:
same as described above
z-score normalization BY training data - reduce mean (of training) and divide result by std (of training data)
I understand that the output layer's probabilities with sigmoid activation is not suppose to accumulate to 1, But I get many (8-10) very high prediction (~0.999) probabilities. and some is exactly 0.5.
The current test set correct classification rate is ~84%, tested with 10-fold cross validation, So it seems that the the network mostly operates well.
notes:
1.I understand there are similar features in the vocalization of different birds species, but the recieved probabilities doesn't seem to reflect them correctly
2. probabilities for example - a recording of natural noise:
Natural noise: 0.999
Mallard - 0.981
I'm trying to understand the reason for these results, if it's related the the data etc extensive mislabeling (probably not) or from another source.
Any help will be much appreciated! :)
EDIT: I use sigmoid because the probabilities of all classes are necessary, and I don't need them to accumulate to 1.
def convnet1(input_shape, numClasses, activation='softmax'):
# Define the network
model = tf.keras.Sequential()
model.add(InputLayer(input_shape=input_shape))
# model.add(Augmentations1(p=0.5, freq_type='mel', max_aug=2))
model.add(Conv2D(64, (3, 3), activation='relu', padding='same'))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 1)))
model.add(Conv2D(128, (3, 3), activation='relu', padding='same'))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 1)))
model.add(Conv2D(128, (5, 5), activation='relu', padding='same'))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(256, (5, 5), activation='relu', padding='same'))
model.add(BatchNormalization())
model.add(Flatten())
# model.add(Dense(numClasses, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(numClasses, activation='sigmoid'))
model.compile(
loss='categorical_crossentropy',
metrics=['accuracy'],
optimizer=optimizers.Adam(learning_rate=0.001),
run_eagerly=False) # this parameter allows to debug and use regular functions inside layers: print(), save() etc..
return model
For future searches - this problem was solved, and the reason was found(!).
The initial batch size that was used was 256 or 512. reducing the batch size to 16 or 32 SOLVED THE PROBLEM, and now the difference in probabilities are as expected for training AND test set samples - very high for the correct label and very low for other classes.
Related
I wrote very simple cnn code with spectrogram images, but
the accuracy is only 0.3~0.4
What do i have to add the other option to improve accuracy?
model.add(Conv2D(32, (3, 3), input_shape=X_train.shape[1:], padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
# model.add(Dropout(0.25))
model.add(Conv2D(64, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(Conv2D(128, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
# model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(64))
model.add(Dense(32))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(14))
model.add(Activation('softmax'))```
With the information you provide, there is zero chance to help you with the problem. The definition of your model looks correct (but you missed an activation function after the first dense layer if this is by accident). So here are some considerations:
Do you train long enough? Your model is quite big and therefore needs a long time to converge AND a large dataset to train with.
Is your dataset large enough and contains enough variance? When your dataset doesn't represent your problem well, you can't train.
Take a look at the Loss curves of your validation AND training set. Are you overfitting/underfitting?
Do you correctly normalize and preprocess your dataset? Try to transform the values of the images to a range of -1 to 1 or 0 to 1 with a float datatype.
Is your dataset balanced? As you are softmaxing 14 classes, you need a balanced dataset in order to train every single class.
Hoped this helped a little, if you need further help please provide some detailed descriptions of your problem and what are you doing in your whole process.
I have a neural network constructed in Tensorflow, and am using sentiment dimensions to try construct a predictive model. A few dimensions are Anger, Sad, Joy, Surprise, Positive, Negative, etc.. My aim is to try to combine different dimensions to see if I can find a non-linear relationship between them (I'm using a self-organizing fuzzy neural network) with what I am trying to predict. (e.g. 'Anger Surpise', 'Anger, Sad', 'Sad, Joy, Surprise', etc.)
What I have tried:
I have got all of the different combinations using the 'itertools' library. I then created a function that takes in the columns I would like to try, and then splits my pandas dataframe into training and testing, trains the model, and returns an output.
I have tried calling this function using .map on a pandas dataframe with one column consisting of the combinations, I've also tried using a threadpool, and also just a simple loop over the list of combinations and calling my function, however the time taken gets exponentially slower, and I think this is to do with the garbage collector not doing its job (my RAM usage also becomes very high). I then attempted to del all of the training, testing dataframes and model after every function call however it did not help.
tl;dr Is there a good way to try different combinations of inputs on a tensorflow model?
Your problems seems complex - there is a lot of factors to determine.
Firstly - I suggest to construct data structure for that type of operation - and make some research in existing codebases on github.
For strictly operating with images & face recognition - cv2 and openCV.
That type of model basically work on face recognitioin & positions of points in (x,y) and euclidean distance based on this.
In the process of declaring and adjustinig layer - it's possibility to mix those layers -
(Conv64, Pooling, Conv128, Conv258, Conv512)
eg.
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=(48,48,1)))
model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Conv2D(128, kernel_size=(3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(128, kernel_size=(3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(7, activation='softmax'))
(Reference: https://github.com/atulapra/Emotion-detection/blob/master/src/emotions.py)
And to define the previius mentioned factors eg.
(0=Angry, 1=Disgust, 2=Fear, 3=Happy, 4=Sad, 5=Surprise, 6=Neutral)
So after categorizing emotions:
There is a need for reading position of points - point by point, something like:
for(x, y, w, h) in images
I am building a model for autoencoder. I have a dataset of images(256x256) in LAB color space.
But i dont know, what is the right maximum compression point. I found example, when i have 176 x 176 x 1 (~30976), then the point is 22 x 22 x 512 (~247808).
But how is that calculated?
My model:
model = Sequential()
model.add(Conv2D(32, (3, 3), activation='relu', padding='same', input_shape=(256, 256, 1)))
model.add(MaxPooling2D((2,2)))
model.add(Conv2D(64, (3, 3), activation='relu', padding='same'))
model.add(MaxPooling2D((2,2)))
model.add(Conv2D(128, (3,3), activation='relu', padding='same'))
model.add(MaxPooling2D((2,2)))
model.add(Conv2D(256, (3,3), activation='relu', padding='same'))
model.add(MaxPooling2D((2,2)))
model.add(Conv2D(512, (3,3), activation='relu', padding='same'))
#Decoder
model.add(Conv2D(256, (3,3), activation='relu', padding='same'))
model.add(UpSampling2D((2, 2)))
model.add(Conv2D(128, (3,3), activation='relu', padding='same'))
model.add(UpSampling2D((2, 2)))
model.add(Conv2D(64, (3,3), activation='relu', padding='same'))
model.add(UpSampling2D((2, 2)))
model.add(Conv2D(2, (3, 3), activation='tanh', padding='same'))
model.add(UpSampling2D((2, 2)))
model.compile(optimizer='adam', loss='mse' , metrics=['accuracy'])
model.summary()
Figuring out these aspects of a network is more art than mathematics. As such, we cannot define a constant compression point without properly analyzing the data, which is the reason why neural nets are used in the first place.
We can however intuitively consider what happens at every layer. For example, in an image colorization problem, it is better not to use too many pooling layers since that discards a huge amount of information. A max pooling layer of size 2x2 with a stride of 2 discards 75% of its input data. This is much more useful in classification to eliminate improbable classes. Similarly, ReLU discards all negative data, and may not be the best function choice for the problem at hand.
Here are a few tips that may help with your specific problem:
Reduce the number of pooling layers. Before pooling, try increasing the number of trainable layers so that the model (intuitively) learns to aggregate important information to avoid pooling it out.
Change the activation to elu, LeakyReLU or such, that do not eliminate negative values, especially since the output requires negative values as well.
Maybe try BiLinear or BiCubic upsampling to maintain structure? I'd also suggest taking a look at the so called "magic" kernel here. Personally, I've had good results with it, though it takes time to implement it efficiently.
If you have enough GPU space, increase the number of channels. This particular point does not have much to consider, except overfitting in some cases.
Preferably, use a Conv2D layer as the final layer to compensate for artifacts while upsampling.
Keep in mind that these points are for general use cases. Models in research papers are a different case, and are not as simple as your architecture. All these points may or may not apply to a specific paper.
I'm researching the possibility of implementing a CNN in order to classify images as "good" or "bad" but am having no luck with my current architecture.
Characteristics that denote a "bad" image:
Overexposure
Oversaturation
Incorrect white balance
Blurriness
Would it be feasible to implement a neural network to classify images based on these characteristics or is it best left to a traditional algorithm that simply looks at the variance in brightness/contrast throughout an image and classifies it that way?
I have attempted training a CNN using the VGGNet architecture but I always seem to get a biased and unreliable model, regardless of the number of epochs or number of steps.
Examples:
My current model's architecture is very simple (as I am new to the whole machine learning world) but seemed to work fine with other classification problems, and I have modified it slightly to work better with this binary classification problem:
# CONV => RELU => POOL layer set
# define convolutional layers, use "ReLU" activation function
# and reduce the spatial size (width and height) with pool layers
model.add(Conv2D(32, (3, 3), padding="same", input_shape=input_shape)) # 32 3x3 filters (height, width, depth)
model.add(Activation("relu"))
model.add(BatchNormalization(axis=channel_dimension))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.25)) # helps prevent overfitting (25% of neurons disconnected randomly)
# (CONV => RELU) * 2 => POOL layer set (increasing number of layers as you go deeper into CNN)
model.add(Conv2D(64, (3, 3), padding="same", input_shape=input_shape)) # 64 3x3 filters
model.add(Activation("relu"))
model.add(BatchNormalization(axis=channel_dimension))
model.add(Conv2D(64, (3, 3), padding="same", input_shape=input_shape)) # 64 3x3 filters
model.add(Activation("relu"))
model.add(BatchNormalization(axis=channel_dimension))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.25)) # helps prevent overfitting (25% of neurons disconnected randomly)
# (CONV => RELU) * 3 => POOL layer set (input volume size becoming smaller and smaller)
model.add(Conv2D(128, (3, 3), padding="same", input_shape=input_shape)) # 128 3x3 filters
model.add(Activation("relu"))
model.add(BatchNormalization(axis=channel_dimension))
model.add(Conv2D(128, (3, 3), padding="same", input_shape=input_shape)) # 128 3x3 filters
model.add(Activation("relu"))
model.add(BatchNormalization(axis=channel_dimension))
model.add(Conv2D(128, (3, 3), padding="same", input_shape=input_shape)) # 128 3x3 filters
model.add(Activation("relu"))
model.add(BatchNormalization(axis=channel_dimension))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.25)) # helps prevent overfitting (25% of neurons disconnected randomly)
# only set of FC => RELU layers
model.add(Flatten())
model.add(Dense(512))
model.add(Activation("relu"))
model.add(BatchNormalization())
model.add(Dropout(0.5))
# sigmoid classifier (output layer)
model.add(Dense(classes))
model.add(Activation("sigmoid"))
Is there any glaring omissions or mistakes with this model or can I simply not solve this problem using deep learning (with my current GPU, a GTX 970)?
Thanks for your time and experience,
Josh
EDIT:
Here is my code for compiling/training the model:
# initialise the model and optimiser
print("[INFO] Training network...")
opt = SGD(lr=initial_lr, decay=initial_lr / epochs)
model.compile(loss="sparse_categorical_crossentropy", optimizer=opt, metrics=["accuracy"])
# set up checkpoints
model_name = "output/50_epochs_{epoch:02d}_{val_acc:.2f}.model"
checkpoint = ModelCheckpoint(model_name, monitor='val_acc', verbose=1,
save_best_only=True, mode='max')
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.2,
patience=5, min_lr=0.001)
tensorboard = TensorBoard(log_dir="logs/{}".format(time()))
callbacks_list = [checkpoint, reduce_lr, tensorboard]
# train the network
H = model.fit_generator(training_set, steps_per_epoch=500, epochs=50, validation_data=test_set, validation_steps=150, callbacks=callbacks_list)
Independently of any other advice (including the answer already provided), and assuming classes=2 (which you don't clarify - there is a reason we ask for a MCVE here), you seem to perform a fundamental mistake in your final layer, i.e.:
# sigmoid classifier (output layer)
model.add(Dense(classes))
model.add(Activation("sigmoid"))
A sigmoid activation is suitable only if your final layer consists of a single node; if classes=2, as I suspect, based also on your puzzling statement in the comments that
with three different images, my results are 0.987 bad and 0.999 good
and
I was giving you the predictions from the model previously
you should use a softmax activation, i.e.
model.add(Dense(classes))
model.add(Activation("softmax"))
Alternatively, you could use sigmoid, but your final layer should consist of a single node, i.e.
model.add(Dense(1))
model.add(Activation("sigmoid"))
The latter is usually preferred in binary classification settings, but the results should be the same in principle.
UPDATE (after updating the question):
sparse_categorical_crossentropy is not the correct loss here, either.
All in all, try the following changes:
model.compile(loss="binary_crossentropy", optimizer=Adam(), metrics=["accuracy"])
# final layer:
model.add(Dense(1))
model.add(Activation("sigmoid"))
with Adam optimizer (needs import). Also, dropout should not be used by default - see this thread; start without it and only add if necessary (i.e. if you see signs of overfitting).
I suggest you go for transfer learning instead of training the whole network.
use the weights trained on a huge Dataset like ImageNet
you can easily do this using Keras you just need to import model with weights like xception and remove last layer which represents 1000 classes of imagenet dataset to 2 node dense layer cause you have only 2 classes and set trainable=False for the base layer and trainable=True for custom added layers like dense layer having node = 2.
and you can train the model as usual way.
Demo code -
from keras.applications import *
from keras.models import Model
base_model = Xception(input_shape=(img_width, img_height, 3), weights='imagenet', include_top=False
x = base_model.output
x = GlobalAveragePooling2D()(x)
predictions = Dense(2, activation='softmax')(x)
model = Model(base_model.input, predictions)
# freezing the base layer weights
for layer in base_model.layers:
layer.trainable = False
I am training a CNN model on KTH dataset to detect 6 classes of human actions.
Data Processing
Dataset consists of 599 videos, each action has 99-100 videos performed by 25 different persons. I divided the data to 300 videos for train, 98 videos for validation and 200 videos for test set.
I reduced the resolution to 50x50 pixels, so I don't run out of memory while processing.
I exracted 200 frames from the middle of each video.
it normalized the pixels from 0-255 to 0,1.
Finally I one hot encoded to class labels.
Model architecture
This is my model architecture.
And this is the code of the NN layers.
model = Sequential()
model.add(Conv3D(filters=64,
kernel_size=(3, 3, 3),
strides=(1, 1, 1),
padding='valid',
activation='relu',
input_shape=X_train.shape[1:]))
model.add(MaxPooling3D(pool_size=2,
strides=(2, 2, 2),
padding='same'))
model.add(Conv3D(filters=128,
kernel_size=(3, 3, 3),
strides=(1, 1, 1),
padding='valid',
activation='relu'))
model.add(MaxPooling3D(pool_size=2,
strides=(2, 2, 2),
padding='same'))
model.add(Conv3D(filters=256,
kernel_size=(3, 3, 3),
strides=(1, 1, 1),
padding='valid',
activation='relu'))
model.add(Conv3D(filters=256,
kernel_size=(3, 3, 3),
strides=(1, 1, 1),
padding='valid',
activation='relu'))
model.add(MaxPooling3D(pool_size=2,
strides=(2, 2, 2),
padding='same'))
model.add(Conv3D(filters=512,
kernel_size=(3, 3, 3),
strides=(1, 1, 1),
padding='valid',
activation='relu'))
model.add(Dense(4096, activation='relu'))
model.add(Dense(4096, activation='relu'))
#model.add(Dropout(0.5))
model.add(Flatten())
model.add(Dense(6, activation='softmax'))
model.summary()
Training
My problem is both training and validation accuracy do not change, and they basically froze from the first epoch. These are the training step.
These are the first 6 epochs and here the last 6 epochs.
The Loss looks like this.
Training loss is very high, and the loss for validation doesn't change.
and the training looks like this.
I am confused, is the model underfitting or overfitting?
How I am gonna fix this problem? will dropout help, since I can't do data augmentation on videos (I assumed that)?
I greatly appreciate any suggestion.
You are using 0-1 values of frames and are using relu. In dying relu problem model is frozen and doesn't learn at all because relu gets maximum values b/w 0 or the weight*input if bias is not added. You can do 2 things to ensure that model does work properly altough I am not sure whether you will get good accuracy or not but can try this to avoid this dying relu problem:-
Use leaky relu with alpha>=0.2.
Do not normalize the frames, instead just convert to grayscale to reduce extensive training.
Don't take 200 frames from middle, divide all videos in equal amount of frame chunks and take 2,3 consecutive frames from each chunk. also try adding more dense layers as they help in classification.
I worked on almost same problem and what I did was to use Conv2d after merging frames together i.e. if you have 10 frames of size 64,64,3 each instead of doing conv3d, I did conv2d on 640,64,3 dataset and resulted in 86% accuracy on 16 classes for videos.
It depends on how you use the 200frames of video as training data to classify an action. Your training data is having too much bias.
Since its a sequential data to be classified, you have to go for memory based architecture or concatenation model.