TimeDistributed(Dense()) vs Dense() after lstm

TimeDistributed(Dense()) vs Dense() after lstm - python

input_word = Input(shape=(max_len,))
model = Embedding(input_dim=num_words, output_dim=50, input_length=max_len)(input_word)
model = SpatialDropout1D(0.1)(model)
model = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1))(model)
out = TimeDistributed(Dense(num_tags, activation="softmax"))(model)
#out = Dense(num_tags, activation="softmax")(model)
model = Model(input_word, out)
model.summary()
I get the same result when I use just Dense layer or with TimeDistributed. In which case should I use TimeDistributed?

TimeDistributed is only necessary for certain layers that cannot handle additional dimensions in their implementation. E.g. MaxPool2D only works on 2D tensors (shape batch x width x height x channels) and will crash if you, say, add a time dimension:
tfkl = tf.keras.layers
a = tf.random.normal((16, 32, 32, 3))
tfkl.MaxPool2D()(a) # this works
a = tf.random.normal((16, 5, 32, 32, 3)) # added a 5th dimension
tfkl.MaxPool2D()(a) # this will crash
Here, adding TimeDistributed will fix it:
tfkl.TimeDistributed(tfkl.MaxPool2D())(a) # works with a being 5d!
However, many layers already support arbitrary input shapes and will automatically distribute the computations over those dimensions. One of these is Dense -- it is always applied to the last axis in your input and distributed over all others, so TimeDistributed isn't necessary. In fact, as you noted, it changes nothing about the output.
Still, it may change how exactly the computation is done. I'm not sure about this, but I would wager that not using TimeDistributed and relying on the Dense implementation itself may be more efficient.

According to the book Zero to Deep Learning by Francesco Mosconi in chapter 7:
If we want the model return an output sequence to be compared with the
sequence of values in the labels, we will use the TimeDistributed
layer wrapper around our output Dense layer. This method of training
is called Teacher Forcing. If we didn’t create output sequences we
wouldn't need Teacher Forcing(i.e. wouldn't need TimeDistributed wrapper).

Related

What does applying a layer on a model do?

I'm working with the tensorflow.keras API, and I've encountered a syntax which I'm unfamiliar with, i.e., applying a layer on a sub-models' output, as shown in the following example from this tutorial:
from tensorflow.keras import Model, layers
from tensorflow.keras.applications import resnet
target_shape = (200, 200)
base_cnn = resnet.ResNet50(
weights="imagenet", input_shape=target_shape + (3,), include_top=False
)
flatten = layers.Flatten()(base_cnn.output)
dense1 = layers.Dense(512, activation="relu")(flatten)
dense1 = layers.BatchNormalization()(dense1)
dense2 = layers.Dense(256, activation="relu")(dense1)
dense2 = layers.BatchNormalization()(dense2)
output = layers.Dense(256)(dense2)
embedding = Model(base_cnn.input, output, name="Embedding")
In the official reference of layers.Flatten for example, I couldn't find the explanation of what does applying it on a layer actually do. In the keras.Layer reference I've encountered this explanation:
call(self, inputs, *args, **kwargs): Called in call after making sure build() has been called. call() performs the logic of applying the layer to the input tensors (which should be passed in as argument).
So my question is:
What does flatten = layers.Flatten()(base_cnn.output) do?

You are creating a model based on a pre-trained model. This pre-trained model will not be actively trained with the rest of your layers unless you explicitly set trainable=True. That is, you are only interested in extracting its useful features. A flattening operation is usually used to convert a multidimensional output into a one-dimensional tensor, and that is exactly what is happening in this line: flatten = layers.Flatten()(base_cnn.output). A one-dimensional tensor is often a desirable end result of a model, especially in supervised learning. The output of the pre-trained resnet model is (None, 7, 7, 2048) and you want to generate 1D feature vectors for each input and compare them, so you flatten that output, resulting in a tensor with the shape (None, 100352) or (None, 7 * 7 * 2048).
Alternatives to Flatten would be GlobalMaxPooling2D and GlobalAveragePooling2D, which downsample an input by taking the max or average value along the spatial dimensions. For more information on this topic check out this post.

Remove intermediate/middle Layer in Keras (for Models with merge-layers)

I am trying to get a fine-tuned MobileNetV3Small running in JavaScript. Unfortunately tfjs does not support the Rescaling layer yet. That shouldn't matter too much though, since I can rescale the image beforehand. Now I would like to get rid of the Rescaling layer in the model, but am failing to do so.
tf.keras' model.layers.pop seems to be not working (see e.g. here).
So I tried to disassemble the layers, like here, skip the rescaling layer and assemble them to a model again. Problem is, that MobileNetV3 has some skip-layers which are concatenated by Add layers with several Inputs, so I end up with:
ValueError: A merge layer should be called on a list of inputs.
Any ideas on how to solve it? Every help would be greatly appreciated!
Here's the code I used for (dis)assembling:
#Creating the Model with the undesired layer
base = tf.keras.applications.MobileNetV3Small(input_shape=(224,224,3), include_top=False, weights='imagenet', minimalistic=True)
model=keras.Model(inputs=base.input, outputs=predictions)
# Dissasemble
layers = [l for l in model.layers]
new_in = keras.Input((224,224,3))
x = new_in
#Assemble again, but Skip Layer-No1, the Rescaling Layer
for idx,l in enumerate(layers[2:]):
l.trainable = False
x = l(x)
results = tf.keras.Model(inputs=new_in, outputs=x)
# Results in mentioned Error

Layers to be used after using a pretrained model: When to add GlobalAveragePooling2D()

I am using pretrained models to classify image. My question is what kind of layers do I have to add after using the pretrained model structure in my model, resp. why these two implementations differ. To be specific:
Consider two examples, one using the cats and dogs dataset:
One implementation can be found here. The crucial point is that the base model:
# Create the base model from the pre-trained model MobileNet V2
base_model = tf.keras.applications.MobileNetV2(input_shape=IMG_SHAPE,
include_top=False,
weights='imagenet')
base_model.trainable = False
is frozen and a GlobalAveragePooling2D() is added, before a final tf.keras.layers.Dense(1) is added. So the model structure looks like:
model = tf.keras.Sequential([
base_model,
global_average_layer,
prediction_layer
])
which is equivalent to:
model = tf.keras.Sequential([
base_model,
tf.keras.layers.GlobalAveragePooling2D()
tf.keras.layers.Dense(1)
])
So they added not only a final dense(1) layer, but also a GlobalAveragePooling2D() layer before.
The other using the tf flowers dataset:
In this implementation it is different. A GlobalAveragePooling2D() is not added.
feature_extractor_url = "https://tfhub.dev/google/tf2-preview/mobilenet_v2/feature_vector/2"
feature_extractor_layer = hub.KerasLayer(feature_extractor_url,
input_shape=(224,224,3))
feature_extractor_layer.trainable = False
model = tf.keras.Sequential([
feature_extractor_layer,
layers.Dense(image_data.num_classes)
])
Where image_data.num_classes is 5 representing the different flower classification. So in this example a GlobalAveragePooling2D() layer is not added.
I do not understand this. Why is this different? When to add a GlobalAveragePooling2D() or not? And what is better / should I do?
I am not sure if the reason is that in one case the dataset cats and dogs is binary classification and in the other it is a multiclass classifcation problem. Or the difference is that in one case tf.keras.applications.MobileNetV2 was used to load MobileNetV2 and in the other implementation hub.KerasLayer was used to get the feature_extractor. When I check the model in the first implementation:
I can see that the last layer is a relu activation layer.
When I check the feature_extractor:
model = tf.keras.Sequential([
feature_extractor,
tf.keras.layers.Dense(1)
])
model.summary()
I get the output:
So maybe reason is also that I do not understand the difference between tf.keras.applications.MobileNetV2 vs hub.KerasLayer. The hub.KerasLayer just gives me the feature extractor. I know this, but still I think I did not get the difference between these two methods.
I cannot check the layers of the feature_extractor itself. So feature_extractor.summary() or feature_extractor.layers does not work. How can I inspect the layers here? And how can I know I should add GlobalAveragePooling2D or not?

Summary
Why is this different? When to add a GlobalAveragePooling2D() or not? And what is better / should I do?
The first case it outputs 4 dimensional tensors that are raw outputs of the last convolutional layer. So, you need to flatten them somehow, and in this example you are using GlobalAveragePooling2D (but you could use any other strategy). I can't tell which is better: it depends on your problem, and depending on how hub.KerasLayer version implemented the flatten, they could be exactly the same. That said, I'd just pickup one of them and go on: I don't see huge differences among them,
Long answer: understanding Keras implementation
The difference is in the output of both base models: in your keras examples, outputs are of shape (bz, hh, ww, nf) where bz is batch size, hh and ww are height and weight of the last convolutional layer in the model and nf is the number of filters (or convolutions) applied in this last layer.
So: this outputs the output of the last convolutions (or filters) of the base model.
Hence, you need to convert those outputs (which you can think them as images) to vectors of shape (bz, n_feats), where n_feats is the number of features the base model is computing. Once this conversion is done, you can stack your classification layer (or as many layers as you want) because at this point you have vectors.
How to compute this conversion? Some common alternatives are taking the average or maximum among the convolutional output (which reduces the size), or you could just reshape them as a single row, or add more convolutional layers until you get a vector as an output (I strongly suggest to follow usual practices like average or maximum).
In your first example, when calling tf.keras.applications.MobileNetV2, you are using the default police with respect to this last year, and hence, the last convolutional layer is let "as is": some convolutions. You can modify this behavior with the param pooling, as documented here:
pooling: Optional pooling mode for feature extraction when include_top is False.
None (default) means that the output of the model will be the 4D tensor output of the last convolutional block.
avg means that global average pooling will be applied to the output of the last convolutional block, and thus the output of the model will be a 2D tensor.
max means that global max pooling will be applied.
In summary, in your first example, you are building the base model without telling explicitly what to do with the last layer, the model keeps returning 4 dimensional tensors that you immediately convert to vectors with the usage of average pooling, so you can avoid this explicit average pooling if you tell Keras to do it:
# Create the base model from the pre-trained model MobileNet V2
base_model = tf.keras.applications.MobileNetV2(input_shape=IMG_SHAPE,
include_top=False,
pooling='avg', # Tell keras to average last layer
weights='imagenet')
base_model.trainable = False
model = tf.keras.Sequential([
base_model,
# global_average_layer, -> not needed any more
prediction_layer
])
TFHub implementation
Finally, when you use the TensorFlow Hub implementation, as you picked up the feature_vector version of the model, it already implements some kind of pooling (which I didn't found yet how) to make sure the model outputs vectors rather than 4 dimensional tensors. So, you don't need to add explicitly the layer to convert them because it is already done.
In my opinion, I prefer Keras implementation since it gives you more freedom to pick the strategy you want (in fact you could keep stacking whatever you want).

Lets say there is a model taking [1, 208, 208, 3] images and has 6 pooling layers with kernels [2, 2, 2, 2, 2, 7] which would result in a feature column for image [1, 1, 1, 2048] for 2048 filters in the last conv layer. Note, how the last pooling layer accepts [1, 7, 7, 2048] inputs
If we relax the constrains for the input image (which is typically the case for object deteciton models) than after same set of pooling layers image of size [1, 104, 208, 3] would produce pre-last-pooling output of [1, 4, 7, 2024] and [1, 256, 408, 3] would yeild [1, 8, 13, 2048]. This maps would have about the same amount information as original [1, 7, 7, 2048] but the original pooling layer would not produce a feature column wiht [1, 1, 1, N]. That is why we switch to global pooling layer.
In short, global pooling layer is important if we don't have strict restriction on the input image size (and don't resize the image as the first op in the model).

I think difference in output of models
"https://tfhub.dev/google/tf2-preview/mobilenet_v2/feature_vector/2" has output is 1d vector * batch_size, you just can't apply Conv2D to it.
Output of tf.keras.applications.MobileNetV2 probably more complex, thus you have more capability to transform one.

How to handle variable sized input in CNN with Keras?

I am trying to perform the usual classification on the MNIST database but with randomly cropped digits.
Images are cropped the following way : removed randomly first/last and/or row/column.
I would like to use a Convolutional Neural Network using Keras (and Tensorflow backend) to perform convolution and then the usual classification.
Inputs are of variable size and i can't manage to get it to work.
Here is how I cropped digits
import numpy as np
from keras.utils import to_categorical
from sklearn.datasets import load_digits
digits = load_digits()
X = digits.images
X = np.expand_dims(X, axis=3)
X_crop = list()
for index in range(len(X)):
X_crop.append(X[index, np.random.randint(0,2):np.random.randint(7,9), np.random.randint(0,2):np.random.randint(7,9), :])
X_crop = np.array(X_crop)
y = to_categorical(digits.target)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_crop, y, train_size=0.8, test_size=0.2)
And here is the architecture of the model I want to use
from keras.layers import Dense, Dropout
from keras.layers.convolutional import Conv2D
from keras.models import Sequential
model = Sequential()
model.add(Conv2D(filters=10,
kernel_size=(3,3),
input_shape=(None, None, 1),
data_format='channels_last'))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])
model.summary()
model.fit(X_train, y_train, epochs=100, batch_size=16, validation_data=(X_test, y_test))
Does someone have an idea on how to handle variable sized input in my neural network?
And how to perform classification?

TL/DR - go to point 4
So - before we get to the point - let's fix some problems with your network:
Your network will not work because of activation: with categorical_crossentropy you need to have a softmax activation:
model.add(Dense(10, activation='softmax'))
Vectorize spatial tensors: as Daniel mentioned - you need to, at some stage, switch your vectors from spatial (images) to vectorized (vectors). Currently - applying Dense to output from a Conv2D is equivalent to (1, 1) convolution. So basically - output from your network is spatial - not vectorized what causes dimensionality mismatch (you can check that by running your network or checking the model.summary(). In order to change that you need to use either GlobalMaxPooling2D or GlobalAveragePooling2D. E.g.:
model.add(Conv2D(filters=10,
kernel_size=(3, 3),
input_shape=(None, None, 1),
padding="same",
data_format='channels_last'))
model.add(GlobalMaxPooling2D())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(10, activation='softmax'))
Concatenated numpy arrays need to have the same shape: if you check the shape of X_crop you'll see that it's not a spatial matrix. It's because you concatenated matrices with different shapes. Sadly - it's impossible to overcome this issue as numpy.array need to have a fixed shape.
How to make your network train on examples of different shape: The most important thing in doing this is to understand two things. First - is that in a single batch every image should have the same size. Second - is that calling fit multiple times is a bad idea - as you reset inner model states. So here is what needs to be done:
a. Write a function which crops a single batch - e.g. a get_cropped_batches_generator which given a matrix cuts a batch out of it and crops it randomly.
b. Use train_on_batch method. Here is an example code:
from six import next
batches_generator = get_cropped_batches_generator(X, batch_size=16)
losses = list()
for epoch_nb in range(nb_of_epochs):
epoch_losses = list()
for batch_nb in range(nb_of_batches):
# cropped_x has a different shape for different batches (in general)
cropped_x, cropped_y = next(batches_generator)
current_loss = model.train_on_batch(cropped_x, cropped_y)
epoch_losses.append(current_loss)
losses.append(epoch_losses.sum() / (1.0 * len(epoch_losses))
final_loss = losses.sum() / (1.0 * len(losses))
So - a few comments to code above: First, train_on_batch doesn't use nice keras progress bar. It returns a single loss value (for a given batch) - that's why I added logic to compute loss. You could use Progbar callback for that also. Second - you need to implement get_cropped_batches_generator - I haven't written a code to keep my answer a little bit more clear. You could ask another question on how to implement it. Last thing - I use six to keep compatibility between Python 2 and Python 3.

Usually, a model containing Dense layers cannot have variable size inputs, unless the outputs are also variable. But see the workaround and also the other answer using GlobalMaxPooling2D - The workaround is equivalent to GlobalAveragePooling2D. These are layers that can eliminiate the variable size before a Dense layer and suppress the spatial dimensions.
For an image classification case, you may want to resize the images outside the model.
When my images are in numpy format, I resize them like this:
from PIL import Image
im = Image.fromarray(imgNumpy)
im = im.resize(newSize,Image.LANCZOS) #you can use options other than LANCZOS as well
imgNumpy = np.asarray(im)
Why?
A convolutional layer has its weights as filters. There is a static filter size, and the same filter is applied to the image over and over.
But a dense layer has its weights based on the input. If there is 1 input, there is a set of weights. If there are 2 inputs, you've got twice as much weights. But weights must be trained, and changing the amount of weights will definitely change the result of the model.
As #Marcin commented, what I've said is true when your input shape for Dense layers has two dimensions: (batchSize,inputFeatures).
But actually keras dense layers can accept inputs with more dimensions. These additional dimensions (which come out of the convolutional layers) can vary in size. But this would make the output of these dense layers also variable in size.
Nonetheless, at the end you will need a fixed size for classification: 10 classes and that's it. For reducing the dimensions, people often use Flatten layers, and the error will appear here.
A possible fishy workaround (not tested):
At the end of the convolutional part of the model, use a lambda layer to condense all the values in a fixed size tensor, probably taking a mean of the side dimensions and keeping the channels (channels are not variable)
Suppose the last convolutional layer is:
model.add(Conv2D(filters,kernel_size,...))
#so its output shape is (None,None,None,filters) = (batchSize,side1,side2,filters)
Let's add a lambda layer to condense the spatial dimensions and keep only the filters dimension:
import keras.backend as K
def collapseSides(x):
axis=1 #if you're using the channels_last format (default)
axis=-1 #if you're using the channels_first format
#x has shape (batchSize, side1, side2, filters)
step1 = K.mean(x,axis=axis) #mean of side1
return K.mean(step1,axis=axis) #mean of side2
#this will result in a tensor shape of (batchSize,filters)
Since the amount of filters is fixed (you have kicked out the None dimensions), the dense layers should probably work:
model.add(Lambda(collapseSides,output_shape=(filters,)))
model.add(Dense.......)
.....
In order for this to possibly work, I suggest that the number of filters in the last convolutional layer be at least 10.
With this, you can make input_shape=(None,None,1)
If you're doing this, remember that you can only pass input data with a fixed size per batch. So you have to separate your entire data in smaller batches, each batch having images all of the same size. See here: Keras misinterprets training data shape

TimeDistributed layer and return sequences etc for LSTM in Keras

Sorry I am new to RNN. I have read this post on TimeDistributed layer.
I have reshaped my data in to Keras requried [samples, time_steps, features]: [140*50*19], which means I have 140 data points, each has 50 time steps, and 19 features. My output is shaped [140*50*1]. I care more about the last data point's accuracy. This is a regression problem.
My current code is :
x = Input((None, X_train.shape[-1]) , name='input')
lstm_kwargs = { 'dropout_W': 0.25, 'return_sequences': True, 'consume_less': 'gpu'}
lstm1 = LSTM(64, name='lstm1', **lstm_kwargs)(x)
output = Dense(1, activation='relu', name='output')(lstm1)
model = Model(input=x, output=output)
sgd = SGD(lr=0.00006, momentum=0.8, decay=0, nesterov=False)
optimizer = sgd
model.compile(optimizer=optimizer, loss='mean_squared_error')
My questions are:
My case is many-to-many, so I need to use return_sequences=True? How about if I only need the last time step's prediction, it would be many-to-one. So I need to my output to be [140*1*1] and return_sequences=False?
Is there anyway to enhance my last time points accuracy if I use many-to-many? I care more about it than the other points accuracy.
I have tried to use TimeDistributed layer as
output = TimeDistributed(Dense(1, activation='relu'), name='output')(lstm1)
the performance seems to be worse than without using TimeDistributed layer. Why is this so?
I tried to use optimizer=RMSprop(lr=0.001). I thought RMSprop is supposed to stabilize the NN. But I was never able to get good result using RMSprop.
How do I choose a good lr and momentum for SGD? I have been testing on different combinations manually. Is there a cross validation method in keras?

So:
Yes - return_sequences=False makes your network to output only a last element of sequence prediction.
You could define the output slicing using the Lambda layer. Here you could find an example on how to do this. Having the output sliced you can provide the additional output where you'll feed the values of the last timestep.
From the computational point of view these two approaches are equivalent. Maybe the problem lies in randomness introduced by weight sampling.
Actually - using RMSProp as a first choice for RNNs is a rule of thumb - not a general proved law. Moreover - it is strongly adviced not to change it's parameters. So this might cause the problems. Another thing is that LSTM needs a lot of time to stabalize. Maybe you need to leave it for more epochs. Last thing - is that maybe your data could favour another activation function.
You could use a keras.sklearnWrapper.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.