What does applying a layer on a model do?

What does applying a layer on a model do? - python

I'm working with the tensorflow.keras API, and I've encountered a syntax which I'm unfamiliar with, i.e., applying a layer on a sub-models' output, as shown in the following example from this tutorial:
from tensorflow.keras import Model, layers
from tensorflow.keras.applications import resnet
target_shape = (200, 200)
base_cnn = resnet.ResNet50(
weights="imagenet", input_shape=target_shape + (3,), include_top=False
)
flatten = layers.Flatten()(base_cnn.output)
dense1 = layers.Dense(512, activation="relu")(flatten)
dense1 = layers.BatchNormalization()(dense1)
dense2 = layers.Dense(256, activation="relu")(dense1)
dense2 = layers.BatchNormalization()(dense2)
output = layers.Dense(256)(dense2)
embedding = Model(base_cnn.input, output, name="Embedding")
In the official reference of layers.Flatten for example, I couldn't find the explanation of what does applying it on a layer actually do. In the keras.Layer reference I've encountered this explanation:
call(self, inputs, *args, **kwargs): Called in call after making sure build() has been called. call() performs the logic of applying the layer to the input tensors (which should be passed in as argument).
So my question is:
What does flatten = layers.Flatten()(base_cnn.output) do?

You are creating a model based on a pre-trained model. This pre-trained model will not be actively trained with the rest of your layers unless you explicitly set trainable=True. That is, you are only interested in extracting its useful features. A flattening operation is usually used to convert a multidimensional output into a one-dimensional tensor, and that is exactly what is happening in this line: flatten = layers.Flatten()(base_cnn.output). A one-dimensional tensor is often a desirable end result of a model, especially in supervised learning. The output of the pre-trained resnet model is (None, 7, 7, 2048) and you want to generate 1D feature vectors for each input and compare them, so you flatten that output, resulting in a tensor with the shape (None, 100352) or (None, 7 * 7 * 2048).
Alternatives to Flatten would be GlobalMaxPooling2D and GlobalAveragePooling2D, which downsample an input by taking the max or average value along the spatial dimensions. For more information on this topic check out this post.

Related

TimeDistributed(Dense()) vs Dense() after lstm

input_word = Input(shape=(max_len,))
model = Embedding(input_dim=num_words, output_dim=50, input_length=max_len)(input_word)
model = SpatialDropout1D(0.1)(model)
model = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1))(model)
out = TimeDistributed(Dense(num_tags, activation="softmax"))(model)
#out = Dense(num_tags, activation="softmax")(model)
model = Model(input_word, out)
model.summary()
I get the same result when I use just Dense layer or with TimeDistributed. In which case should I use TimeDistributed?

TimeDistributed is only necessary for certain layers that cannot handle additional dimensions in their implementation. E.g. MaxPool2D only works on 2D tensors (shape batch x width x height x channels) and will crash if you, say, add a time dimension:
tfkl = tf.keras.layers
a = tf.random.normal((16, 32, 32, 3))
tfkl.MaxPool2D()(a) # this works
a = tf.random.normal((16, 5, 32, 32, 3)) # added a 5th dimension
tfkl.MaxPool2D()(a) # this will crash
Here, adding TimeDistributed will fix it:
tfkl.TimeDistributed(tfkl.MaxPool2D())(a) # works with a being 5d!
However, many layers already support arbitrary input shapes and will automatically distribute the computations over those dimensions. One of these is Dense -- it is always applied to the last axis in your input and distributed over all others, so TimeDistributed isn't necessary. In fact, as you noted, it changes nothing about the output.
Still, it may change how exactly the computation is done. I'm not sure about this, but I would wager that not using TimeDistributed and relying on the Dense implementation itself may be more efficient.

According to the book Zero to Deep Learning by Francesco Mosconi in chapter 7:
If we want the model return an output sequence to be compared with the
sequence of values in the labels, we will use the TimeDistributed
layer wrapper around our output Dense layer. This method of training
is called Teacher Forcing. If we didn’t create output sequences we
wouldn't need Teacher Forcing(i.e. wouldn't need TimeDistributed wrapper).

How to increase the rank (ndim) of input of BERT keras hub layer for learning-to-rank

I am trying to implement a learning-to-rank model using a pre-trained BERT available on tensorflow hub. I am using a variation of ListNet loss function, which requires each training instance to be a list of several ranked documents in relation to a query. I need the model to be able to accept data in a shape (batch_size, list_size, sentence_length), where the model loops over the 'list_size' axis in each training instance, returns the ranks and passes them to the loss function. In a simple model that only consists of dense layers, this is easily done by augmenting the dimensions of the input layer. For example:
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras import Model
input = Input([6,10])
x = Dense(20,activation='relu')(input)
output = Dense(1, activation='sigmoid')(x)
model = Model(inputs=input, outputs=output)
...now the model will perform 6 forward passes over vectors of length 10 before calculating the loss and updating gradients.
I am trying to do the same with the BERT model and its preprocessing layer:
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
bert_preprocess_model = hub.KerasLayer('https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1')
bert_model = hub.KerasLayer('https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3')
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
processed_input = bert_preprocess_model(text_input)
output = bert_model(processed_input)
model = tf.keras.Model(text_input, output)
But when I try to change the shape of 'text_input' to, say, (6), or meddle with it in any way really, it always results in the same type of error:
ValueError: Could not find matching function to call loaded from the SavedModel. Got:
Positional arguments (3 total):
* Tensor("inputs:0", shape=(None, 6), dtype=string)
* False
* None
Keyword arguments: {}
Expected these arguments to match one of the following 4 option(s):
Option 1:
Positional arguments (3 total):
* TensorSpec(shape=(None,), dtype=tf.string, name='sentences')
* False
* None
Keyword arguments: {}
....
As per https://www.tensorflow.org/hub/api_docs/python/hub/KerasLayer, it seems like you can configure the input shape of hub.KerasLayer via tf.keras.layers.InputSpec. In my case, I guess it would be something like this:
bert_preprocess_model.input_spec = tf.keras.layers.InputSpec(ndim=2)
bert_model.input_spec = tf.keras.layers.InputSpec(ndim=2)
When I run the above code, the attributes indeed get changed, but when trying to build the model, the same exact error appears.
Is there any way to easily resolve this without the necessity to create a custom training loop?

Suppose you have a batch of B examples, each with exactly N text strings, which makes a 2-dimensional Tensor of shape [B, N]. With tf.reshape(), you can turn that into a 1-dimensional tensor of shape [B*N], send it through BERT (which preserves the order of inputs) and then reshape it back to [B,N]. (There's also tf.keras.layers.Reshape, but that hides the batch dimension from you.)
If it's not exactly N text strings each time, you'll have to do some bookkeeping on the side (e.g., store inputs in a tf.RaggedTensor, run BERT on its .values, and construct a new RaggedTensor with the same .row_splits from the result.)

Layers to be used after using a pretrained model: When to add GlobalAveragePooling2D()

I am using pretrained models to classify image. My question is what kind of layers do I have to add after using the pretrained model structure in my model, resp. why these two implementations differ. To be specific:
Consider two examples, one using the cats and dogs dataset:
One implementation can be found here. The crucial point is that the base model:
# Create the base model from the pre-trained model MobileNet V2
base_model = tf.keras.applications.MobileNetV2(input_shape=IMG_SHAPE,
include_top=False,
weights='imagenet')
base_model.trainable = False
is frozen and a GlobalAveragePooling2D() is added, before a final tf.keras.layers.Dense(1) is added. So the model structure looks like:
model = tf.keras.Sequential([
base_model,
global_average_layer,
prediction_layer
])
which is equivalent to:
model = tf.keras.Sequential([
base_model,
tf.keras.layers.GlobalAveragePooling2D()
tf.keras.layers.Dense(1)
])
So they added not only a final dense(1) layer, but also a GlobalAveragePooling2D() layer before.
The other using the tf flowers dataset:
In this implementation it is different. A GlobalAveragePooling2D() is not added.
feature_extractor_url = "https://tfhub.dev/google/tf2-preview/mobilenet_v2/feature_vector/2"
feature_extractor_layer = hub.KerasLayer(feature_extractor_url,
input_shape=(224,224,3))
feature_extractor_layer.trainable = False
model = tf.keras.Sequential([
feature_extractor_layer,
layers.Dense(image_data.num_classes)
])
Where image_data.num_classes is 5 representing the different flower classification. So in this example a GlobalAveragePooling2D() layer is not added.
I do not understand this. Why is this different? When to add a GlobalAveragePooling2D() or not? And what is better / should I do?
I am not sure if the reason is that in one case the dataset cats and dogs is binary classification and in the other it is a multiclass classifcation problem. Or the difference is that in one case tf.keras.applications.MobileNetV2 was used to load MobileNetV2 and in the other implementation hub.KerasLayer was used to get the feature_extractor. When I check the model in the first implementation:
I can see that the last layer is a relu activation layer.
When I check the feature_extractor:
model = tf.keras.Sequential([
feature_extractor,
tf.keras.layers.Dense(1)
])
model.summary()
I get the output:
So maybe reason is also that I do not understand the difference between tf.keras.applications.MobileNetV2 vs hub.KerasLayer. The hub.KerasLayer just gives me the feature extractor. I know this, but still I think I did not get the difference between these two methods.
I cannot check the layers of the feature_extractor itself. So feature_extractor.summary() or feature_extractor.layers does not work. How can I inspect the layers here? And how can I know I should add GlobalAveragePooling2D or not?

Summary
Why is this different? When to add a GlobalAveragePooling2D() or not? And what is better / should I do?
The first case it outputs 4 dimensional tensors that are raw outputs of the last convolutional layer. So, you need to flatten them somehow, and in this example you are using GlobalAveragePooling2D (but you could use any other strategy). I can't tell which is better: it depends on your problem, and depending on how hub.KerasLayer version implemented the flatten, they could be exactly the same. That said, I'd just pickup one of them and go on: I don't see huge differences among them,
Long answer: understanding Keras implementation
The difference is in the output of both base models: in your keras examples, outputs are of shape (bz, hh, ww, nf) where bz is batch size, hh and ww are height and weight of the last convolutional layer in the model and nf is the number of filters (or convolutions) applied in this last layer.
So: this outputs the output of the last convolutions (or filters) of the base model.
Hence, you need to convert those outputs (which you can think them as images) to vectors of shape (bz, n_feats), where n_feats is the number of features the base model is computing. Once this conversion is done, you can stack your classification layer (or as many layers as you want) because at this point you have vectors.
How to compute this conversion? Some common alternatives are taking the average or maximum among the convolutional output (which reduces the size), or you could just reshape them as a single row, or add more convolutional layers until you get a vector as an output (I strongly suggest to follow usual practices like average or maximum).
In your first example, when calling tf.keras.applications.MobileNetV2, you are using the default police with respect to this last year, and hence, the last convolutional layer is let "as is": some convolutions. You can modify this behavior with the param pooling, as documented here:
pooling: Optional pooling mode for feature extraction when include_top is False.
None (default) means that the output of the model will be the 4D tensor output of the last convolutional block.
avg means that global average pooling will be applied to the output of the last convolutional block, and thus the output of the model will be a 2D tensor.
max means that global max pooling will be applied.
In summary, in your first example, you are building the base model without telling explicitly what to do with the last layer, the model keeps returning 4 dimensional tensors that you immediately convert to vectors with the usage of average pooling, so you can avoid this explicit average pooling if you tell Keras to do it:
# Create the base model from the pre-trained model MobileNet V2
base_model = tf.keras.applications.MobileNetV2(input_shape=IMG_SHAPE,
include_top=False,
pooling='avg', # Tell keras to average last layer
weights='imagenet')
base_model.trainable = False
model = tf.keras.Sequential([
base_model,
# global_average_layer, -> not needed any more
prediction_layer
])
TFHub implementation
Finally, when you use the TensorFlow Hub implementation, as you picked up the feature_vector version of the model, it already implements some kind of pooling (which I didn't found yet how) to make sure the model outputs vectors rather than 4 dimensional tensors. So, you don't need to add explicitly the layer to convert them because it is already done.
In my opinion, I prefer Keras implementation since it gives you more freedom to pick the strategy you want (in fact you could keep stacking whatever you want).

Lets say there is a model taking [1, 208, 208, 3] images and has 6 pooling layers with kernels [2, 2, 2, 2, 2, 7] which would result in a feature column for image [1, 1, 1, 2048] for 2048 filters in the last conv layer. Note, how the last pooling layer accepts [1, 7, 7, 2048] inputs
If we relax the constrains for the input image (which is typically the case for object deteciton models) than after same set of pooling layers image of size [1, 104, 208, 3] would produce pre-last-pooling output of [1, 4, 7, 2024] and [1, 256, 408, 3] would yeild [1, 8, 13, 2048]. This maps would have about the same amount information as original [1, 7, 7, 2048] but the original pooling layer would not produce a feature column wiht [1, 1, 1, N]. That is why we switch to global pooling layer.
In short, global pooling layer is important if we don't have strict restriction on the input image size (and don't resize the image as the first op in the model).

I think difference in output of models
"https://tfhub.dev/google/tf2-preview/mobilenet_v2/feature_vector/2" has output is 1d vector * batch_size, you just can't apply Conv2D to it.
Output of tf.keras.applications.MobileNetV2 probably more complex, thus you have more capability to transform one.

How to correctly create a multi input neural network

i'm building a NN that has, as input, two car images and classifies if thery are the same make and model. My problem is in the fitmethod of keras, because there is this error
ValueError: Error when checking target: expected dense_3 to have shape (1,) but got array with shape (2,)
The network architecture is the following:
input1=Input((150,200,3))
model1=InceptionV3(include_top=False, weights='imagenet', input_tensor=input1)
model1.layers.pop()
input2=Input((150,200,3))
model2=InceptionV3(include_top=False, weights='imagenet', input_tensor=input2)
model2.layers.pop()
for layer in model2.layers:
layer.name = "custom_layer_"+ layer.name
concat = concatenate([model1.layers[-1].output,model2.layers[-1].output])
flat = Flatten()(concat)
dense1=Dense(100, activation='relu')(flat)
do1=Dropout(0.25)(dense1)
dense2=Dense(50, activation='relu')(do1)
do2=Dropout(0.25)(dense2)
dense3=Dense(1, activation='softmax')(do2)
model = Model(inputs=[model1.input,model2.input],outputs=dense3)
My idea is that the error is due to the to_catogorical method that i have called on the array which stores, as 0 or 1, if the two cars have the same make and model or not. Any suggestion?

Since you are doing binary classification with one-hot encoded labels, then you should change this line:
dense3=Dense(1, activation='softmax')(do2)
To:
dense3=Dense(2, activation='softmax')(do2)
Softmax with a single neuron makes no sense, two neurons should be used for binary classification with softmax activation.

Can I use loops inside a model using functional API?

I have a trained keras model which takes inputs of size (batchSize,2). This works well and gives good results.
My main problem is to have a model which takes an input a vector of size(batchSize,2,16) and slice it inside the model to 16 vectors of size(batchSize,2) and concatenate the outputs together.
I have used this code for this
y = layers.Input(shape=(2,16,))
model_x= load_model('saved_model')
for i in range(16):
x_input = Lambda(lambda x: x[:, :, i])(y)
if i == 0:
x_output = model_x(x_input)
else:
x_output = layers.concatenate([x_output,
model_x(x_input)])
x_output = Lambda(lambda x: x[:, :tf.cast(N, tf.int32)])(x_output)
final_model = Model(y, x_output)
Although the saved model gives me good performance, this code does not trains well and doesn't give the intended performance.
What can I do to get better results?

I can't say anything about the bad performance of your final model because it might be due to various reasons and this is not readily evident from the content of your question. But to answer your original question: yes, you can use for loops that way, because you are essentially creating layers/tensors and connecting them to each other (i.e. building the graph of the model). So it's a valid thing to do. The problem might be somewhere else, e.g. a wrong indexing, a wrong loss function, etc.
Further, you can build your final model in a much simpler approach. You already have a trained model which gets inputs of shape (batch_size, 2) and gives outputs of shape (batch_size, 8). Now you want to build a model which takes inputs of shape (batch_size, 2, 16), apply the already trained model on each of the 16 (batch_size, 2) segments and then concatenate the results. You can easily do that with a TimeDistributed wrapper:
# load your already trained model
model_x = load_model('saved_model')
inp = layers.Input(shape=(2,16))
# this makes the input shape as `(16,2)`
x = layers.Permute((2,1))(inp)
# this would apply `model_x` on each of the 16 segments; the output shape would be (None, 16, 8)
x = layers.TimeDistributed(model_x)(x)
# flatten to make it have a shape of (None, 128)
out = layers.Flatten()(x)
final_model = Model(inp, out)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.