Unexpected layers were generated in the mnist example in Tensorboard

Unexpected layers were generated in the mnist example in Tensorboard - python

In order to learn tensorflow, I executed this tensorflow official mnist script (cnn_mnist.py) and displayed the graph with tensorboard.
The following is part of the code.
This network contains two conv layers and two dense layers.
conv1 = tf.layers.conv2d(inputs=input_layer,filters=32,kernel_size=[5, 5],
padding="same",activation=tf.nn.relu)
pool1 = tf.layers.max_pooling2d(inputs=conv1, pool_size=[2, 2], strides=2)
conv2 = tf.layers.conv2d(inputs=pool1,filters=64,kernel_size=[5, 5],
padding="same",activation=tf.nn.relu)
pool2 = tf.layers.max_pooling2d(inputs=conv2, pool_size=[2, 2], strides=2)
pool2_flat = tf.reshape(pool2, [-1, 7 * 7 * 64])
dense = tf.layers.dense(inputs=pool2_flat, units=1024, activation=tf.nn.relu)
dropout = tf.layers.dropout(
inputs=dense, rate=0.4, training=mode == tf.estimator.ModeKeys.TRAIN)
logits = tf.layers.dense(inputs=dropout, units=10)
However, looking at the graph generated by tensorboard, there are three conv layers and three dense layers.
I did not expect that conv2d_1 and dense_1 will be generated.
Why was conv2d_1 and dense_1 generated ?

This is a good question, because it sheds some light into inner structure of tf.layers wrappers. Let's run two experiments:
Run the model exactly as in the question.
Add explicit names to the layers via name argument and run again.
The graph without layers' names
That's the same graph as your, but I expanded and zoomed in to the logits dense layers. Note that dense_1 contains the layer variables (kernel and bias) and dense_2 contains the ops (matrix multiplication and addition).
This means that this is still one layer, but with two naming scopes - dense_1 and dense_2. This happens because this is the second dense layer, and the first one already used the naming scope dense. Variables creation is separated from the actual layer logic - there's build and call method, - and they both try to get a unique name for the scope. This leads to dense_1 and dense_2 holding variables and ops respectively.
The graph with names specified
Now let's add name='logits' to the same layer and run again:
logits = tf.layers.dense(inputs=dropout, units=10, name='logits')
You can see there're still 2 variables and 2 ops, but the layer managed to grab one unique name for the scope (logits) and put everything inside.
Conclusion
This is a good example why explicit naming in tensorflow is beneficial, no matter if it's about tensors directly or higher-level layer. There is much less confusion, when the model uses meaningful names, instead of automatically generated ones.

They are just the variables creations and other hidden operations that happen in tf.layers.conv2d other than just the convolution operation itself (tf.nn.conv2d) and the activation (same goes for the dense layer). There are only 2 convolutions happening: as you can see, if you follow your data in the graph, it never goes through conv2D_1 or dense_1, it's just that the result of these ops (basically the variables needed for the convolution) are also given as input to the convolution operation itself. I'm actually more surprised to not see the same thing appearing for conv_2d though, but I really wouldn't worry for that !

Related

What does applying a layer on a model do?

I'm working with the tensorflow.keras API, and I've encountered a syntax which I'm unfamiliar with, i.e., applying a layer on a sub-models' output, as shown in the following example from this tutorial:
from tensorflow.keras import Model, layers
from tensorflow.keras.applications import resnet
target_shape = (200, 200)
base_cnn = resnet.ResNet50(
weights="imagenet", input_shape=target_shape + (3,), include_top=False
)
flatten = layers.Flatten()(base_cnn.output)
dense1 = layers.Dense(512, activation="relu")(flatten)
dense1 = layers.BatchNormalization()(dense1)
dense2 = layers.Dense(256, activation="relu")(dense1)
dense2 = layers.BatchNormalization()(dense2)
output = layers.Dense(256)(dense2)
embedding = Model(base_cnn.input, output, name="Embedding")
In the official reference of layers.Flatten for example, I couldn't find the explanation of what does applying it on a layer actually do. In the keras.Layer reference I've encountered this explanation:
call(self, inputs, *args, **kwargs): Called in call after making sure build() has been called. call() performs the logic of applying the layer to the input tensors (which should be passed in as argument).
So my question is:
What does flatten = layers.Flatten()(base_cnn.output) do?

You are creating a model based on a pre-trained model. This pre-trained model will not be actively trained with the rest of your layers unless you explicitly set trainable=True. That is, you are only interested in extracting its useful features. A flattening operation is usually used to convert a multidimensional output into a one-dimensional tensor, and that is exactly what is happening in this line: flatten = layers.Flatten()(base_cnn.output). A one-dimensional tensor is often a desirable end result of a model, especially in supervised learning. The output of the pre-trained resnet model is (None, 7, 7, 2048) and you want to generate 1D feature vectors for each input and compare them, so you flatten that output, resulting in a tensor with the shape (None, 100352) or (None, 7 * 7 * 2048).
Alternatives to Flatten would be GlobalMaxPooling2D and GlobalAveragePooling2D, which downsample an input by taking the max or average value along the spatial dimensions. For more information on this topic check out this post.

Layers to be used after using a pretrained model: When to add GlobalAveragePooling2D()

I am using pretrained models to classify image. My question is what kind of layers do I have to add after using the pretrained model structure in my model, resp. why these two implementations differ. To be specific:
Consider two examples, one using the cats and dogs dataset:
One implementation can be found here. The crucial point is that the base model:
# Create the base model from the pre-trained model MobileNet V2
base_model = tf.keras.applications.MobileNetV2(input_shape=IMG_SHAPE,
include_top=False,
weights='imagenet')
base_model.trainable = False
is frozen and a GlobalAveragePooling2D() is added, before a final tf.keras.layers.Dense(1) is added. So the model structure looks like:
model = tf.keras.Sequential([
base_model,
global_average_layer,
prediction_layer
])
which is equivalent to:
model = tf.keras.Sequential([
base_model,
tf.keras.layers.GlobalAveragePooling2D()
tf.keras.layers.Dense(1)
])
So they added not only a final dense(1) layer, but also a GlobalAveragePooling2D() layer before.
The other using the tf flowers dataset:
In this implementation it is different. A GlobalAveragePooling2D() is not added.
feature_extractor_url = "https://tfhub.dev/google/tf2-preview/mobilenet_v2/feature_vector/2"
feature_extractor_layer = hub.KerasLayer(feature_extractor_url,
input_shape=(224,224,3))
feature_extractor_layer.trainable = False
model = tf.keras.Sequential([
feature_extractor_layer,
layers.Dense(image_data.num_classes)
])
Where image_data.num_classes is 5 representing the different flower classification. So in this example a GlobalAveragePooling2D() layer is not added.
I do not understand this. Why is this different? When to add a GlobalAveragePooling2D() or not? And what is better / should I do?
I am not sure if the reason is that in one case the dataset cats and dogs is binary classification and in the other it is a multiclass classifcation problem. Or the difference is that in one case tf.keras.applications.MobileNetV2 was used to load MobileNetV2 and in the other implementation hub.KerasLayer was used to get the feature_extractor. When I check the model in the first implementation:
I can see that the last layer is a relu activation layer.
When I check the feature_extractor:
model = tf.keras.Sequential([
feature_extractor,
tf.keras.layers.Dense(1)
])
model.summary()
I get the output:
So maybe reason is also that I do not understand the difference between tf.keras.applications.MobileNetV2 vs hub.KerasLayer. The hub.KerasLayer just gives me the feature extractor. I know this, but still I think I did not get the difference between these two methods.
I cannot check the layers of the feature_extractor itself. So feature_extractor.summary() or feature_extractor.layers does not work. How can I inspect the layers here? And how can I know I should add GlobalAveragePooling2D or not?

Summary
Why is this different? When to add a GlobalAveragePooling2D() or not? And what is better / should I do?
The first case it outputs 4 dimensional tensors that are raw outputs of the last convolutional layer. So, you need to flatten them somehow, and in this example you are using GlobalAveragePooling2D (but you could use any other strategy). I can't tell which is better: it depends on your problem, and depending on how hub.KerasLayer version implemented the flatten, they could be exactly the same. That said, I'd just pickup one of them and go on: I don't see huge differences among them,
Long answer: understanding Keras implementation
The difference is in the output of both base models: in your keras examples, outputs are of shape (bz, hh, ww, nf) where bz is batch size, hh and ww are height and weight of the last convolutional layer in the model and nf is the number of filters (or convolutions) applied in this last layer.
So: this outputs the output of the last convolutions (or filters) of the base model.
Hence, you need to convert those outputs (which you can think them as images) to vectors of shape (bz, n_feats), where n_feats is the number of features the base model is computing. Once this conversion is done, you can stack your classification layer (or as many layers as you want) because at this point you have vectors.
How to compute this conversion? Some common alternatives are taking the average or maximum among the convolutional output (which reduces the size), or you could just reshape them as a single row, or add more convolutional layers until you get a vector as an output (I strongly suggest to follow usual practices like average or maximum).
In your first example, when calling tf.keras.applications.MobileNetV2, you are using the default police with respect to this last year, and hence, the last convolutional layer is let "as is": some convolutions. You can modify this behavior with the param pooling, as documented here:
pooling: Optional pooling mode for feature extraction when include_top is False.
None (default) means that the output of the model will be the 4D tensor output of the last convolutional block.
avg means that global average pooling will be applied to the output of the last convolutional block, and thus the output of the model will be a 2D tensor.
max means that global max pooling will be applied.
In summary, in your first example, you are building the base model without telling explicitly what to do with the last layer, the model keeps returning 4 dimensional tensors that you immediately convert to vectors with the usage of average pooling, so you can avoid this explicit average pooling if you tell Keras to do it:
# Create the base model from the pre-trained model MobileNet V2
base_model = tf.keras.applications.MobileNetV2(input_shape=IMG_SHAPE,
include_top=False,
pooling='avg', # Tell keras to average last layer
weights='imagenet')
base_model.trainable = False
model = tf.keras.Sequential([
base_model,
# global_average_layer, -> not needed any more
prediction_layer
])
TFHub implementation
Finally, when you use the TensorFlow Hub implementation, as you picked up the feature_vector version of the model, it already implements some kind of pooling (which I didn't found yet how) to make sure the model outputs vectors rather than 4 dimensional tensors. So, you don't need to add explicitly the layer to convert them because it is already done.
In my opinion, I prefer Keras implementation since it gives you more freedom to pick the strategy you want (in fact you could keep stacking whatever you want).

Lets say there is a model taking [1, 208, 208, 3] images and has 6 pooling layers with kernels [2, 2, 2, 2, 2, 7] which would result in a feature column for image [1, 1, 1, 2048] for 2048 filters in the last conv layer. Note, how the last pooling layer accepts [1, 7, 7, 2048] inputs
If we relax the constrains for the input image (which is typically the case for object deteciton models) than after same set of pooling layers image of size [1, 104, 208, 3] would produce pre-last-pooling output of [1, 4, 7, 2024] and [1, 256, 408, 3] would yeild [1, 8, 13, 2048]. This maps would have about the same amount information as original [1, 7, 7, 2048] but the original pooling layer would not produce a feature column wiht [1, 1, 1, N]. That is why we switch to global pooling layer.
In short, global pooling layer is important if we don't have strict restriction on the input image size (and don't resize the image as the first op in the model).

I think difference in output of models
"https://tfhub.dev/google/tf2-preview/mobilenet_v2/feature_vector/2" has output is 1d vector * batch_size, you just can't apply Conv2D to it.
Output of tf.keras.applications.MobileNetV2 probably more complex, thus you have more capability to transform one.

BiLSTM (Bidirectional Long Short-Term Memory Networks) with MLP(Multi-layer Perceptron)

I am trying to implement the network architecture of this paper Speaker Change Detection in Broadcast TV using Bidirectional Long Short-Term Memory Networks, by Ruiqing Yin, Herve Bredin, Claude Barras, which is as, enter image description here
The model is composed of two Bi-LSTM (Bi-LSTM 1 and 2) and a multi-layer perceptron (MLP) whose weights are shared across the sequence. B. Bi-LSTM1 has 64 outputs (32 forward and 32 backward). Bi-LSTM2 has 40 (20 each). The fully connected layers are 40-, 10- and 1-dimensional respectively. The output of both forward and backward LSTMs are concatenated and fed forward to the next layer. The shared MLP is made of three fully connected feedforward layers, using tanh activation function for the first two layers, and a sigmoid activation function for the last layer, in order to output a score between 0 and 1.
I have taken reference from various sources and come up with following code,
model = Sequential()
model.add(Bidirectional(LSTM(64, return_sequences=True)))
model.add(Bidirectional(LSTM(40, return_sequences=True)))
model.add(TimeDistributed(Dense(40,activation='tanh')))
model.add(TimeDistributed(Dense(10,activation='tanh')))
model.add(TimeDistributed(Dense(1, activation='sigmoid')))
model.build(input_shape=(None, 200, 35))
model.summary()
I am confused with TimeDistributed layer and how can it simulate an MLP, also how the weights are being shared, can you at least point out that whether I am doing right or not.

As the architecture in the paper suggests, you basically want to push each of the hidden states (which are themselves time distributed) into separate dense layers (thus forming an MLP at each time state).
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
bidirectional (Bidirectional (None, 200, 128) 51200
_________________________________________________________________
bidirectional_1 (Bidirection (None, 200, 80) 54080
_________________________________________________________________
time_distributed (TimeDistri (None, 200, 40) 3240
_________________________________________________________________
time_distributed_1 (TimeDist (None, 200, 10) 410
_________________________________________________________________
time_distributed_2 (TimeDist (None, 200, 1) 11
=================================================================
Total params: 108,941
Trainable params: 108,941
Non-trainable params: 0
The Bi-LSTM here is set to return_sequence = True. Therefore it returns the hidden state sequence to the subsequent layer. If you push this sequence into a Dense layer, it wouldn't make sense since you are going to return a 3D tensor (batch, time, feature). Now, if you want to form a Dense network at each time, you will need it to be Time distributed.
As the output shape suggests, this layer creates a 40 node layer at each of the 200 time steps that are the output of the Bi-LSTM before (hidden states). Each of these is then stacked with 10 node layer as well (None, 200, 10). Similarly, the logic follows.
If your doubt is what TimeDistributed layers are - as per official documentation.
This wrapper allows applying a layer to every temporal slice of an input.
The final goal is speaker change detection. Meaning that you want to predict the speaker or probability of a speaker at each of the 200 time steps. Therefore the output layer returns 200 logits (None, 200, 1).
Hope that solves your confusion.
Another intuitive way of looking at it -
Your Bi-LSTM is set to return sequences instead of just features. Each time step in this sequence that is returned needs to have a Dense network of its own. TimeDistributed Dense is basically a layer that takes in an input sequence and inputs it to separate dense nodes at each time step. So, instead of having 40 nodes like a standard Dense layer, it has 200 X 40 nodes, where the input to say the 3rd 40 nodes, is the 3rd time step from the Bi-LSTM. This simulates a time distributed MLP over the Bi-LSTM sequences.
A good visual intuition that I prefer when working with LSTMs -
If you DONT return sequences, the output of the LSTM is just a single value of ht (LHS of the image below)
If you return sequences, the output is a sequence (h0 to ht) (RHS of the image below)
Adding a Dense layer, in the first case will only take in ht as input. In the second case, you will need a TimeDistributed Dense, which will "stack" on top of each of the h0 to ht.

Many-to-many classification with Keras LSTM

I am new to RNN's / LSTM's in Keras and need advice on whether / how to use them for my problem, which is many-to-many classification.
I have a number of time series: Approximately 1500 "runs" which each last for about 100-300 time steps and have multiple channels. I understand that I need to zero-pad my data to the maximum number of time steps, so my data looks like this:
[nb_samples, timesteps, input_dim]: [1500, 300, 10]
Since getting the label for a single time step is impossible without knowing the past even for a human, I could do feature engineering and train a classical classification algorithm, however, I think LSTMs would be a good fit here. This answer tells me that for many-to-many classification in Keras, I need to set return_sequences to True. However, I do not quite understand how to proceed from here - do I use the return sequence as input for another, normal layer? How do I connect this to my output layer?
Any help, hints or links to tutorials are greatly appreciated - I found a lot of stuff for many-to-one classification, but nothing good on many-to-many.

There can be many approaches to this, i am specifying which can be good fit to your problem.
If you want to stack two LSTM layer, then return-seq can help to learn for another LSTM layer as shown in following example.
from keras.layers import Dense, Flatten, LSTM, Activation
from keras.layers import Dropout, RepeatVector, TimeDistributed
from keras import Input, Model
seq_length = 15
input_dims = 10
output_dims = 8 # number of classes
n_hidden = 10
model1_inputs = Input(shape=(seq_length,input_dims,))
model1_outputs = Input(shape=(output_dims,))
net1 = LSTM(n_hidden, return_sequences=True)(model1_inputs)
net1 = LSTM(n_hidden, return_sequences=False)(net1)
net1 = Dense(output_dims, activation='relu')(net1)
model1_outputs = net1
model1 = Model(inputs=model1_inputs, outputs = model1_outputs, name='model1')
## Fit the model
model1.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 15, 10) 0
_________________________________________________________________
lstm_1 (LSTM) (None, 15, 10) 840
_________________________________________________________________
lstm_2 (LSTM) (None, 10) 840
_________________________________________________________________
dense_3 (Dense) (None, 8) 88
_________________________________________________________________
Another option is that you can use the complete return sequence as the features for the next layer. In that case make a simple Dense layer whose input will be [batch, seq_len*lstm_output_dims].
Note: These features can be useful for classification task, but mostly, we used stacked lstm layer and use its output with-out complete sequence as features for the classification layer.
This answer may be helpful to understand another approaches for LSTM architecture for different purpose.

Can not use both bias and batch normalization in convolution layers

I use slim framework for tensorflow, because of its simplicity.
But I want to have convolutional layer with both biases and batch normalization.
In vanilla tensorflow, I have:
def conv2d(input_, output_dim, k_h=5, k_w=5, d_h=2, d_w=2, name="conv2d"):
with tf.variable_scope(name):
w = tf.get_variable('w', [k_h, k_w, input_.get_shape()[-1], output_dim],
initializer=tf.contrib.layers.xavier_initializer(uniform=False))
conv = tf.nn.conv2d(input_, w, strides=[1, d_h, d_w, 1], padding='SAME')
biases = tf.get_variable('biases', [output_dim], initializer=tf.constant_initializer(0.0))
conv = tf.reshape(tf.nn.bias_add(conv, biases), conv.get_shape())
tf.summary.histogram("weights", w)
tf.summary.histogram("biases", biases)
return conv
d_bn1 = BatchNorm(name='d_bn1')
h1 = lrelu(d_bn1(conv2d(h0, df_dim + y_dim, name='d_h1_conv')))
and I rewrote it to slim by this:
h1 = slim.conv2d(h0,
num_outputs=self.df_dim + self.y_dim,
scope='d_h1_conv',
kernel_size=[5, 5],
stride=[2, 2],
activation_fn=lrelu,
normalizer_fn=layers.batch_norm,
normalizer_params=batch_norm_params,
weights_initializer=layers.xavier_initializer(uniform=False),
biases_initializer=tf.constant_initializer(0.0)
)
But this code does not add bias to conv layer.
That is because of https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/layers/python/layers/layers.py#L1025 where is
layer = layer_class(filters=num_outputs,
kernel_size=kernel_size,
strides=stride,
padding=padding,
data_format=df,
dilation_rate=rate,
activation=None,
use_bias=not normalizer_fn and biases_initializer,
kernel_initializer=weights_initializer,
bias_initializer=biases_initializer,
kernel_regularizer=weights_regularizer,
bias_regularizer=biases_regularizer,
activity_regularizer=None,
trainable=trainable,
name=sc.name,
dtype=inputs.dtype.base_dtype,
_scope=sc,
_reuse=reuse)
outputs = layer.apply(inputs)
in the construction of layer, which results in not having bias when using batch normalization.
Does that mean that I can not have both biases and batch normalization using slim and layers library? Or is there another way to achieve having both bias and batch normalization in layer when using slim?

Batchnormalization already includes the addition of the bias term. Recap that BatchNorm is already:
gamma * normalized(x) + bias
So there is no need (and it makes no sense) to add another bias term in the convolution layer. Simply speaking BatchNorm shifts the activation by their mean values. Hence, any constant will be canceled out.
If you still want to do this, you need to remove the normalizer_fn argument and add BatchNorm as a single layer. Like I said, this makes no sense.
But the solution would be something like
net = slim.conv2d(net, normalizer_fn=None, ...)
net = tf.nn.batch_normalization(net)
Note, the BatchNorm relies on non-gradient updates. So you either need to use an optimizer which is compatible with the UPDATE_OPS collection. Or you need to manually add tf.control_dependencies.
Long story short: Even if you implement the ConvWithBias+BatchNorm, it will behave like ConvWithoutBias+BatchNorm. It is the same as multiple fully-connected layers without activation function will behave like a single one.

The reason there is no bias for our convolutional layers is because we have batch normalization applied to their outputs. The goal of batch normalization is to get outputs with:
mean = 0
standard deviation = 1
Since we want the mean to be 0, we do not want to add an offset (bias) that will deviate from 0. We want the outputs of our convolutional layer to rely only on the coefficient weights.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.