Keras/TF: Time Distributed CNN+LSTM for visual recognition

Keras/TF: Time Distributed CNN+LSTM for visual recognition - python

I am trying to implement the Model from the article (https://arxiv.org/abs/1411.4389) that basically consists of time-distributed CNNs followed by a sequence of LSTMs using Keras with TF.
However, I am having a problem trying to figure out if I should include the TimeDirstibuted function just for my Convolutional & Pooling Layers or also for the LSTMs?
Is there a way to run the CNN Layers in parallel (Based on the number of frames in the sequence that I want to process and based on the number of cores that I have)?
And Last, suppose that each entry is composed of "n" frames (in sequence) where n varies based on the current data entry, what is the best suitable input dimension? and would "n" be the batch size? Is there a way to limit the number of CNNs in // to for example 4 (so that you get an output Y after 4 frames are processed)?
P.S.: The inputs are small videos (i.e. a sequence of frames)
P.S.: The output dimension is irrelevant to my question, so it is not discussed here
Thank you

[Edited]
Sorry, only-a-link-answer was bad. So I try to answer question one by one.
if I should include the TimeDirstibuted function just for my Convolutional & Pooling Layers or also for the LSTMs?
Use TimeDistributed function only for Conv and Pooling layers, no need for LSTMs.
Is there a way to run the CNN Layers in parallel?
No, if you use CPU. It's possible if you utilize GPU.
Transparent Multi-GPU Training on TensorFlow with Keras
what is the best suitable input dimension?
Five. (batch, time, width, height, channel).
Is there a way to limit the number of CNNs in // to for example 4
You can do this in the preprocess by manually aligning frames into a specific number, not in the network. In other words, "time" dimension should be 4 if you want to have output after 4 frames are processed.
model = Sequential()
model.add(
TimeDistributed(
Conv2D(64, (3, 3), activation='relu'),
input_shape=(data.num_frames, data.width, data.height, 1)
)
)
model.add(TimeDistributed(MaxPooling2D((2, 2), strides=(1, 1))))
model.add(TimeDistributed(Conv2D(128, (4,4), activation='relu')))
model.add(TimeDistributed(MaxPooling2D((2, 2), strides=(2, 2))))
model.add(TimeDistributed(Conv2D(256, (4,4), activation='relu')))
model.add(TimeDistributed(MaxPooling2D((2, 2), strides=(2, 2))))
# extract features and dropout
model.add(TimeDistributed(Flatten()))
model.add(Dropout(0.5))
# input to LSTM
model.add(LSTM(256, return_sequences=False, dropout=0.5))
# classifier with sigmoid activation for multilabel
model.add(Dense(data.num_classes, activation='sigmoid'))
Reference:
PRI-MATRIX FACTORIZATION - BENCHMARK

Related

Sigmoid activation output layer produce Many near-1 value

:)
I have a Datset of ~16,000 .wav recording from 70 bird species.
I'm training a model using tensorflow to classify the mel-spectrogram of these recordings using Convolution based architectures.
One of the architectures used is simple multi-layer convolutional described below.
The pre-processing phase include:
extract mel-spectrograms and convert to dB Scale
segment audio to 1-second segment (pad with zero Or gaussian noise if residual is longer than 250ms, discard otherwise)
z-score normalization of training data - reduce mean and divide result by std
pre-processing while inference:
same as described above
z-score normalization BY training data - reduce mean (of training) and divide result by std (of training data)
I understand that the output layer's probabilities with sigmoid activation is not suppose to accumulate to 1, But I get many (8-10) very high prediction (~0.999) probabilities. and some is exactly 0.5.
The current test set correct classification rate is ~84%, tested with 10-fold cross validation, So it seems that the the network mostly operates well.
notes:
1.I understand there are similar features in the vocalization of different birds species, but the recieved probabilities doesn't seem to reflect them correctly
2. probabilities for example - a recording of natural noise:
Natural noise: 0.999
Mallard - 0.981
I'm trying to understand the reason for these results, if it's related the the data etc extensive mislabeling (probably not) or from another source.
Any help will be much appreciated! :)
EDIT: I use sigmoid because the probabilities of all classes are necessary, and I don't need them to accumulate to 1.
def convnet1(input_shape, numClasses, activation='softmax'):
# Define the network
model = tf.keras.Sequential()
model.add(InputLayer(input_shape=input_shape))
# model.add(Augmentations1(p=0.5, freq_type='mel', max_aug=2))
model.add(Conv2D(64, (3, 3), activation='relu', padding='same'))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 1)))
model.add(Conv2D(128, (3, 3), activation='relu', padding='same'))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 1)))
model.add(Conv2D(128, (5, 5), activation='relu', padding='same'))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(256, (5, 5), activation='relu', padding='same'))
model.add(BatchNormalization())
model.add(Flatten())
# model.add(Dense(numClasses, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(numClasses, activation='sigmoid'))
model.compile(
loss='categorical_crossentropy',
metrics=['accuracy'],
optimizer=optimizers.Adam(learning_rate=0.001),
run_eagerly=False) # this parameter allows to debug and use regular functions inside layers: print(), save() etc..
return model

For future searches - this problem was solved, and the reason was found(!).
The initial batch size that was used was 256 or 512. reducing the batch size to 16 or 32 SOLVED THE PROBLEM, and now the difference in probabilities are as expected for training AND test set samples - very high for the correct label and very low for other classes.

Training a neural network without having training samples for all possible target values

Let's say I have a dataset containing many time series of a leg-worn accelerometer sensor. Each time series has peaks and the peaks correspond to jumps made by the person that is wearing the sensor. I now want to use the data for training a convolutional neural network in order to make it predict, how many jumps were performed in a certain time series.
The problem I am having is that the CNN should work for any number of peaks/jumps. Obviously, it is impossible to generate a training dataset that provides training samples for any possible number of jumps/peaks, since the dataset would have to be infinitely large then. However, as far as I know, in mutliclass classification, the final layer of a CNN must contain as many nodes as there are possible outcomes.
How should the final layer be designed in order to predict any possible number of peaks between 0 and infinity? Is this even possible?
As an example, find my very basic CNN setup here:
model = keras.Sequential()
model.add(Conv1D(filters=32, kernel_size=2, activation = 'relu', strides = 1, padding = 'same', input_shape=(3200, 1)))
model.add(Dropout(0.3))
model.add(Conv1D(filters=64, kernel_size=2, activation = 'relu', strides = 1, padding = 'same'))
model.add(Flatten())
model.add(Dense(32, activation = 'relu'))
model.add(Dense(?, activation='softmax')) # ? represents the infinite number of output units I am asking about in the question

What does the filter parameter mean in Conv2d layer?

I am getting confused with the filter paramater, which is the first parameter in the Conv2D() layer function in keras. As I understand the filters are supposed to do things like edge detection or sharpening the image or blurring the image, but when I am defining the model as
input_shape = (32, 32, 3)
model = Sequential()
model.add( Conv2D(64, kernel_size=(5, 5), activation='relu', input_shape=input_shape, strides=(1,1), padding='same') )
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2,2)))
model.add(Conv2D(64, kernel_size=(5, 5), activation='relu', input_shape=input_shape, strides=(1,1), padding='same'))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2,2)))
model.add(Conv2D(128, kernel_size=(5, 5), activation='relu', input_shape=input_shape, strides=(1,1), padding='same'))
model.add(Flatten())
model.add(Dense(3072, activation='relu'))
model.add(Dense(2048, activation='relu'))
model.add(Dense(num_classes, activation='softmax'))
I am not mentioning the the edge detection or blurring or sharpening anywhere in the Conv2D function. The input images are 32 by 32 RGB images.
So my question is, when I define the Convolution layer as Conv2D(64, ...), does this 64 means 64 different types of filters, such as vertical edge, horizontal edge, etc, which are chosen by keras at random? if so then is the output of the convolution layer (with 64 filters and 5x5 kernel and 1x1 stride) on a 32x32 1-channel image is 64 images of 28x28 size each. How are these 64 images combined to form a single image for further layers?

The filters argument sets the number of convolutional filters in that layer. These filters are initialized to small, random values, using the method specified by the kernel_initializer argument. During network training, the filters are updated in a way that minimizes the loss. So over the course of training, the filters will learn to detect certain features, like edges and textures, and they might become something like the image below (from here).
It is very important to realize that one does not hand-craft filters. These are learned automatically during training -- that's the beauty of deep learning.
I would highly recommend going through some deep learning resources, particularly https://cs231n.github.io/convolutional-networks/ and https://www.youtube.com/watch?v=r5nXYc2wYvI&list=PLypiXJdtIca5sxV7aE3-PS9fYX3vUdIOX&index=3&t=3122s.

Just wanted to clarify what the output shape was.
Although jakub's answer was good, I don't think it addressed the "single image for further layers" part of the question.
I did a model.summary() to find out more.
I found that the shape returned from a Conv2D is (None, img_width, img_height, num_filters)
So when you pass the output of the Conv2D to MaxPooling you are passing that shape which means it is basically passing each entire convoluted image.
The other layers handle this gracefully. MaxPooling2D(2,2) returns the same shape but half the image size (None, img_width / 2, img_height / 2, num_filters).
Side note: I wish the filters was named num_filters because filters seems to imply you're passing in a list of filters in which to convolute the image.

Validation Accuracy of CNN not increasing

I have around 8200 images for face detection task. 4800 of them contain human faces. The other 3400 images contain images of 3D human face masks(which are made of rubber/latex), human cartoon faces, faces of monkeys. I want to detect whether the given image contains a real human face or not.
I have trained numerous networks, changing hyper parameters but every time my training accuracy shots up to over 98% and validation accuracy stays at around 60-70%. I have tried out networks containing 3-5 Conv layers and one FC layers. I used L2 regularization, batch norm, data augmentation and dropout to remove overfitting. I then tried out reducing the learning rate of Adam optimizer as the training progressed. I trained the network for more than 100 epochs and sometimes upto 200 epochs. However, the best validation accuracy(20% of dataset) I could achieve was 71%. Is there anyway out to improve the validation accuracy above 85%?
I used the following architecture with input image size of 256*256*3 and trained them with a batch size of 16.
regularizer = tf.keras.regularizers.l2(l=0.005)
model = tf.keras.models.Sequential([
tf.keras.layers.Conv2D(64, (5, 5),strides=(2, 2), activation='relu', input_shape=(256, 256, 3), kernel_regularizer=regularizer),
tf.keras.layers.MaxPooling2D(2, 2),
tf.keras.layers.Conv2D(96, (5, 5), padding='same', activation='relu', kernel_regularizer=None),
tf.keras.layers.MaxPooling2D(2, 2),
tf.keras.layers.Conv2D(128, (3, 3), padding='same', activation='relu', kernel_regularizer=None),
tf.keras.layers.MaxPooling2D(2, 2),
tf.keras.layers.Conv2D(256, (3, 3), padding='same', activation='relu', kernel_regularizer=None),
tf.keras.layers.MaxPooling2D(2, 2),
tf.keras.layers.Flatten(),
#tf.keras.layers.Dense(2048, activation='relu', kernel_regularizer=regularizer),
tf.keras.layers.Dense(4096, activation='relu', kernel_regularizer=None),
tf.keras.layers.Dropout(0.4),
tf.keras.layers.Dense(1, activation='sigmoid', kernel_regularizer=regularizer)
])

Make sure you are using all available forms of data augmentation (scaling, rotation, translation, flips, etc.).
Use kernel regularizers on all layers.
Add SpatialDropout2D after all Conv layers.
Add BatchNormalization after all Conv and Dense layers (except for the last Dense/sigmoid one, obviously).
Reduce the size of your network (fewer layers and/or fewer filters/units per layer); you want the smallest possible network that can still learn the training data.
If all of those combined are not enough to get good validation accuracy, then you probably just don't have enough data.
A few tips that probably won't reduce overfitting, but tend to be helpful in general:
Prefer sequences of 3x3 kernel conv layers rather than single conv layers with 5x5 or larger kernels.
Replace the Flatten layer with a GlobalAveragePooling layer, and probably remove all Dense layers except the last one.
Use either stride=2 or MaxPooling, not both.

Accuracy in a CNN model never goes high for training and validation set

I am training a CNN model on KTH dataset to detect 6 classes of human actions.
Data Processing
Dataset consists of 599 videos, each action has 99-100 videos performed by 25 different persons. I divided the data to 300 videos for train, 98 videos for validation and 200 videos for test set.
I reduced the resolution to 50x50 pixels, so I don't run out of memory while processing.
I exracted 200 frames from the middle of each video.
it normalized the pixels from 0-255 to 0,1.
Finally I one hot encoded to class labels.
Model architecture
This is my model architecture.
And this is the code of the NN layers.
model = Sequential()
model.add(Conv3D(filters=64,
kernel_size=(3, 3, 3),
strides=(1, 1, 1),
padding='valid',
activation='relu',
input_shape=X_train.shape[1:]))
model.add(MaxPooling3D(pool_size=2,
strides=(2, 2, 2),
padding='same'))
model.add(Conv3D(filters=128,
kernel_size=(3, 3, 3),
strides=(1, 1, 1),
padding='valid',
activation='relu'))
model.add(MaxPooling3D(pool_size=2,
strides=(2, 2, 2),
padding='same'))
model.add(Conv3D(filters=256,
kernel_size=(3, 3, 3),
strides=(1, 1, 1),
padding='valid',
activation='relu'))
model.add(Conv3D(filters=256,
kernel_size=(3, 3, 3),
strides=(1, 1, 1),
padding='valid',
activation='relu'))
model.add(MaxPooling3D(pool_size=2,
strides=(2, 2, 2),
padding='same'))
model.add(Conv3D(filters=512,
kernel_size=(3, 3, 3),
strides=(1, 1, 1),
padding='valid',
activation='relu'))
model.add(Dense(4096, activation='relu'))
model.add(Dense(4096, activation='relu'))
#model.add(Dropout(0.5))
model.add(Flatten())
model.add(Dense(6, activation='softmax'))
model.summary()
Training
My problem is both training and validation accuracy do not change, and they basically froze from the first epoch. These are the training step.
These are the first 6 epochs and here the last 6 epochs.
The Loss looks like this.
Training loss is very high, and the loss for validation doesn't change.
and the training looks like this.
I am confused, is the model underfitting or overfitting?
How I am gonna fix this problem? will dropout help, since I can't do data augmentation on videos (I assumed that)?
I greatly appreciate any suggestion.

You are using 0-1 values of frames and are using relu. In dying relu problem model is frozen and doesn't learn at all because relu gets maximum values b/w 0 or the weight*input if bias is not added. You can do 2 things to ensure that model does work properly altough I am not sure whether you will get good accuracy or not but can try this to avoid this dying relu problem:-
Use leaky relu with alpha>=0.2.
Do not normalize the frames, instead just convert to grayscale to reduce extensive training.
Don't take 200 frames from middle, divide all videos in equal amount of frame chunks and take 2,3 consecutive frames from each chunk. also try adding more dense layers as they help in classification.
I worked on almost same problem and what I did was to use Conv2d after merging frames together i.e. if you have 10 frames of size 64,64,3 each instead of doing conv3d, I did conv2d on 640,64,3 dataset and resulted in 86% accuracy on 16 classes for videos.

It depends on how you use the 200frames of video as training data to classify an action. Your training data is having too much bias.
Since its a sequential data to be classified, you have to go for memory based architecture or concatenation model.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Keras/TF: Time Distributed CNN+LSTM for visual recognition - python

Related

Sigmoid activation output layer produce Many near-1 value

Training a neural network without having training samples for all possible target values

What does the filter parameter mean in Conv2d layer?

Validation Accuracy of CNN not increasing

Accuracy in a CNN model never goes high for training and validation set

Categories

Resources