Understanding the output shape of conv2d layer in keras

Understanding the output shape of conv2d layer in keras - python

I do not understand why the channel dimension is not included in the output dimension of a conv2D layer in Keras.
I have the following model
def create_model():
image = Input(shape=(128,128,3))
x = Conv2D(24, kernel_size=(8,8), strides=(2,2), activation='relu', name='conv_1')(image)
x = Conv2D(24, kernel_size=(8,8), strides=(2,2), activation='relu', name='conv_2')(x)
x = Conv2D(24, kernel_size=(8,8), strides=(2,2), activation='relu', name='conv_3')(x)
flatten = Flatten(name='flatten')(x)
output = Dense(1, activation='relu', name='output')(flatten)
model = Model(input=image, output=output)
return model
model = create_model()
model.summary()
The model summary is given the figure at the end of my question. The input layer takes RGB images with width = 128 and height = 128. The first conv2D layer tells me the output dimension is (None, 61, 61, 24). I have used the kernel size of (8, 8), a stride of (2, 2) no padding. The values 61 = floor( (128 - 8 + 2 * 0)/2 + 1) and 24 (number of kernels/filters) makes sense. But why isn't the dimension for the different channels included in the dimension? As far as I can see the parameters for the 24 filters on each of the channels is included in the number of the parameters. So I would expect the output dimension to be (None, 61, 61, 24, 3) or (None, 61, 61, 24 * 3). Is this just a strange notation in Keras or am I confused about something else?

This question is asked in various forms all over the internet and has a simple answer which is often missed or confused:
SIMPLE ANSWER:
The Keras Conv2D layer, given a multi-channel input (e.g. a color image), will apply the filter across ALL the color channels and sum the results, producing the equivalent of a monochrome convolved output image.
An example, from a CIFAR-10 CNN example:
(1) You're training with the CIFAR image dataset, which is made up of 32x32 color images, i.e. each image is shape (32,32,3) (RGB = 3 channels)
(2) Your first layer of your network is a Conv2D Layer with 32 filters, each specified as 3x3, so:
Conv2D(32, (3,3), padding='same', input_shape=(32,32,3))
(3) Counter-intuitively, Keras will configure each filter as (3,3,3), i.e. a 3D volume covering the 3x3 pixels PLUS all the color channels. As a minor detail each filter has an additional weight for a BIAS value, as per normal neural network layer arithmetic.
(4) Convolution proceeds absolutely as normal, except a 3x3x3 VOLUME from the input image is convolved at each step with the 3x3x3 filter, and a single (monochrome) output value (i.e. like a pixel) is produced at each step.
(5) The result is a Keras Conv2D convolution of a specified (3,3) filter on a (32,32,3) image produces a (32,32) result because the actual filter used is (3,3,3).
(6) In this example, we have also specified 32 filters in the Conv2D layer, so the actual output is (32,32,32) for each input image (i.e. you might think of this as 32 images, one for each filter, each 32x32 monochrome pixels).
As a check, you can look at the count of weights (Param #) for the layer produced by model.summary():
Layer (type) Output shape Param#
conv2d_1 (Conv2D) (None, 32, 32, 32) 896
There are 32 filters, each 3x3x3 (i.e. 27 weights) plus 1 for the bias (i.e. total 28 weights each). And 32 filters x 28 weights each = 896 Parameters.

Each of the convolutional filters (8 x 8) is connected to a (8 x 8) receptive field for all the channels of the image. That is why we have (61, 61, 24) as the output of the second layer. The different channels are encoded implicitly into the weights of the 24 filters. This means, that each filter does not have 8 x 8 = 64 weights but instead 8 x 8 x Number of channels = 8 x 8 x 3 = 192 weights.
See this quote from CS231
Left: An example input volume in red (e.g. a 32x32x3 CIFAR-10 image),
and an example volume of neurons in the first Convolutional layer.
Each neuron in the convolutional layer is connected only to a local
region in the input volume spatially, but to the full depth (i.e. all
color channels). Note, there are multiple neurons (5 in this example)
along the depth, all looking at the same region in the input - see
discussion of depth columns in the text below. Right: The neurons from the
Neural Network chapter remains unchanged: They still compute a dot
product of their weights with the input followed by a non-linearity,
but their connectivity is now restricted to be local spatially.

My guess is that you're misunderstanding how convolutional layers defined.
My notation for the shape of the convolutional layer is (out_channels, in_channels, k, k) where k is a the size of the kernel. The out_channels is the number of filters (i.e. convolutional neurons). Consider following image:
The 3d convolutional kernel weights in the picture slide across different data windows of A_{i-1}(i.e. input image). Patches of 3D data of that image of shape (in_channels, k, k) are paired with individual 3d convolutional kernels of matching dimensionality. How many such 3d kernels are there? As the number of output channels out_channels. The depth dimension that kernel adopts is the in_channels of A_{i-1}. Therefore, the dimension in_channels of A_{i-1} is contracted away by the depth-wise dot product that builds up the output tensor with out_channels channels. The precise way in which the sliding windows are constructed is defined by the sampling tuple (kernel_size, stride, padding) and results in output tensor with spatial dimensions determined by the formula that you're correctly applied.
If you want to understand more, including backpropagation and implementation take a look at this paper.

The formula you're using is correct. It may be little confusing because many popular tutorial use number of filters equal to number of channels in the image. TensorFlow/Keras implementation produces its output by computing num_input_channels * num_output_channels intermediate feature maps of size (kernel_size[0], kernel_size[1]). So for each input channel it produces num_output_channels feature maps which then get multiplied and concatenated together to create output shape of (kernel_size[0], kernel_size[1], num_output_channels) Hope this clarifies Vlad's detailed answer

Related

How do stacked convolutional layers work in a CNN?

I am having trouble understanding the way 2 or more convolutional layers (each followed by a pooling layer) work in a CNN.
Consider the input to be a 3 channel 300x300 image. If the first convolution layer has 32 convolutions and the second layers have 64 convolutional layers, then the first layer creates 32 feature maps. But how many feature maps does the second layer create? Does every convolution out of 64 act on the previously generated 32 feature maps, thus creating 32*64 = 2048 feature maps in total? Or does something else take place?
A simple code relating the question is:
model = keras.models.Sequential([
keras.layers.Conv2D(32, (3, 3), activation='relu', input_shape=(300, 300, 3)),
keras.layers.MaxPooling2D(2, 2),
keras.layers.Conv2D(64, (3, 3), activation='relu'),
keras.layers.MaxPooling2D(2, 2)])

The number of channels of the input matrix and the number of channels in each filter must match in order to be able to perform element-wise multiplication.
So the main difference between first and second convolutions is that the # of channels in the input matrix in the first convolution is 3 so we will use 32 filters where each filter has 3 channels (depth of kernel matrix).
For the second convolution, the input matrix has 32 channels (feature maps), so each filter for this convolution must have 32 channels as well. For example: each of the 64 filters will have the 32#3x3 shape.
The result of a convolution step for a single filter of 32#3x3 shape will be a single channel of WxH (Width, Height) shape. After applying all 64 filters (where each of them has shape: 32#3x3) we will get 64 channels, where each channel is a result of the convolution of a single filter.

The first convolution layer has 32 filters, but it applies to all THREE channels of the image, so the feature maps after the first Conv2D is 32x3 = 96, then the 64 filters of the second Conv2D each apply to each of that 96 feature maps, so after the 2nd Conv2D it's 64x96 = ... (fill in the blank).
But for simplicity, Keras only shows (..., 32) or (..., 64). You can use model.summary() to check.

Keras Conv2D and input channels

The Keras layer documentation specifies the input and output sizes for convolutional layers:
https://keras.io/layers/convolutional/
Input shape: (samples, channels, rows, cols)
Output shape: (samples, filters, new_rows, new_cols)
And the kernel size is a spatial parameter, i.e. detemines only width and height.
So an input with c channels will yield an output with filters channels regardless of the value of c. It must therefore apply 2D convolution with a spatial height x width filter and then aggregate the results somehow for each learned filter.
What is this aggregation operator? is it a summation across channels? can I control it? I couldn't find any information on the Keras documentation.
Note that in TensorFlow the filters are specified in the depth channel as well:
https://www.tensorflow.org/api_guides/python/nn#Convolution,
So the depth operation is clear.
Thanks.

It might be confusing that it is called Conv2D layer (it was to me, which is why I came looking for this answer), because as Nilesh Birari commented:
I guess you are missing it's 3D kernel [width, height, depth]. So the result is summation across channels.
Perhaps the 2D stems from the fact that the kernel only slides along two dimensions, the third dimension is fixed and determined by the number of input channels (the input depth).
For a more elaborate explanation, read https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/
I plucked an illustrative image from there:

I was also wondering this, and found another answer here, where it is stated (emphasis mine):
Maybe the most tangible example of a multi-channel input is when you have a color image which has 3 RGB channels. Let's get it to a convolution layer with 3 input channels and 1 output channel. (...) What it does is that it calculates the convolution of each filter with its corresponding input channel (...). The stride of all channels are the same, so they output matrices with the same size. Now, it sums up all matrices and output a single matrix which is the only channel at the output of the convolution layer.
Illustration:
Notice that the weights of the convolution kernels for each channel are different, which are then iteratively adjusted in the back-propagation steps by e.g. gradient decent based algorithms such as stochastic gradient descent (SDG).
Here is a more technical answer from TensorFlow API.

I also needed to convince myself so I ran a simple example with a 3×3 RGB image.
# red # green # blue
1 1 1 100 100 100 10000 10000 10000
1 1 1 100 100 100 10000 10000 10000
1 1 1 100 100 100 10000 10000 10000
The filter is initialised to ones:
1 1
1 1
I have also set the convolution to have these properties:
no padding
strides = 1
relu activation function
bias initialised to 0
We would expect the (aggregated) output to be:
40404 40404
40404 40404
Also, from the picture above, the no. of parameters is
3 separate filters (one for each channel) × 4 weights + 1 (bias, not shown) = 13 parameters
Here's the code.
Import modules:
import numpy as np
from keras.layers import Input, Conv2D
from keras.models import Model
Create the red, green and blue channels:
red = np.array([1]*9).reshape((3,3))
green = np.array([100]*9).reshape((3,3))
blue = np.array([10000]*9).reshape((3,3))
Stack the channels to form an RGB image:
img = np.stack([red, green, blue], axis=-1)
img = np.expand_dims(img, axis=0)
Create a model that just does a Conv2D convolution:
inputs = Input((3,3,3))
conv = Conv2D(filters=1,
strides=1,
padding='valid',
activation='relu',
kernel_size=2,
kernel_initializer='ones',
bias_initializer='zeros', )(inputs)
model = Model(inputs,conv)
Input the image in the model:
model.predict(img)
# array([[[[40404.],
# [40404.]],
# [[40404.],
# [40404.]]]], dtype=float32)
Run a summary to get the number of params:
model.summary()

Number of input channels for convolutional network

I'm following the "Deep MNIST for Experts" tutorial for TensorFlow: https://www.tensorflow.org/tutorials/mnist/pros/
The second convolutional layer has the shape [5, 5, 32, 64]; that is, it has 32 inputs whereas the first convolutional layer had 1 input (that input being I understand the grayscale values of the original image).
What does it mean that the second convolutional layer has 32 input channels? Does it mean the 64 filters that are learned in the second layer will all be applied (shifted around) to a "virtual" image having 32 points per pixel (this "virtual" image being composed of the original image to which each filter learned in the first step has been applied)? How do you apply a 2D 5x5 filter to an image having 32 points/values per pixel if what I said previously is correct?

The first convolution layer has the following weights:
W_conv1 = weight_variable([5, 5, 1, 32])
Here 5x5 is the patch size 1 is the number of input channels and 32 is the number of output channels. So after the first convolution the output has 32 channels, hence the shape of the weight matrix for second convolution layer has 32 input channels.

Keras: reshape to connect lstm and conv

This question exists as a github issue , too.
I would like to build a neural network in Keras which contains both 2D convolutions and an LSTM layer.
The network should classify MNIST.
The training data in MNIST are 60000 grey-scale images of handwritten digits from 0 to 9. Each image is 28x28 pixels.
I've splitted the images into four parts (left/right, up/down) and rearranged them in four orders to get sequences for the LSTM.
| | |1 | 2|
|image| -> ------- -> 4 sequences: |1|2|3|4|, |4|3|2|1|, |1|3|2|4|, |4|2|3|1|
| | |3 | 4|
One of the small sub-images has the dimension 14 x 14. The four sequences are stacked together along the width (shouldn't matter whether width or height).
This creates a vector with the shape [60000, 4, 1, 56, 14] where:
60000 is the number of samples
4 is the number of elements in a sequence (# of timesteps)
1 is the depth of colors (greyscale)
56 and 14 are width and height
Now this should be given to a Keras model.
The problem is to change the input dimensions between the CNN and the LSTM.
I searched online and found this question: Python keras how to change the size of input after convolution layer into lstm layer
The solution seems to be a Reshape layer which flattens the image but retains the timesteps (as opposed to a Flatten layer which would collapse everything but the batch_size).
Here's my code so far:
nb_filters=32
kernel_size=(3,3)
pool_size=(2,2)
nb_classes=10
batch_size=64
model=Sequential()
model.add(Convolution2D(nb_filters, kernel_size[0], kernel_size[1],
border_mode="valid", input_shape=[1,56,14]))
model.add(Activation("relu"))
model.add(Convolution2D(nb_filters, kernel_size[0], kernel_size[1]))
model.add(Activation("relu"))
model.add(MaxPooling2D(pool_size=pool_size))
model.add(Reshape((56*14,)))
model.add(Dropout(0.25))
model.add(LSTM(5))
model.add(Dense(50))
model.add(Dense(nb_classes))
model.add(Activation("softmax"))
This code creates an error message:
ValueError: total size of new array must be unchanged
Apparently the input to the Reshape layer is incorrect. As an alternative, I tried to pass the timesteps to the Reshape layer, too:
model.add(Reshape((4,56*14)))
This doesn't feel right and in any case, the error stays the same.
Am I doing this the right way ?
Is a Reshape layer the proper tool to connect CNN and LSTM ?
There are rather complex approaches to this problem.
Such as this:
https://github.com/fchollet/keras/pull/1456
A TimeDistributed Layer which seems to hide the timestep dimension from following layers.
Or this: https://github.com/anayebi/keras-extra
A set of special layers for combining CNNs and LSTMs.
Why are there so complicated (at least they seem complicated to me) solutions, if a simple Reshape does the trick ?
UPDATE:
Embarrassingly, I forgot that the dimensions will be changed by the pooling and (for lack of padding) the convolutions, too.
kgrm advised me to use model.summary() to check the dimensions.
The output of the layer before the Reshape layer is (None, 32, 26, 5),
I changed the reshape to: model.add(Reshape((32*26*5,))).
Now the ValueError is gone, instead the LSTM complains:
Exception: Input 0 is incompatible with layer lstm_5: expected ndim=3, found ndim=2
It seems like I need to pass the timestep dimension through the entire network. How can I do that ? If I add it to the input_shape of the Convolution, it complains, too: Convolution2D(nb_filters, kernel_size[0], kernel_size[1], border_mode="valid", input_shape=[4, 1, 56,14])
Exception: Input 0 is incompatible with layer convolution2d_44: expected ndim=4, found ndim=5

According to Convolution2D definition your input must be 4-dimensional with dimensions (samples, channels, rows, cols). This is the direct reason why are you getting an error.
To resolve that you must use TimeDistributed wrapper. This allows you to use static (not recurrent) layers across the time.

Weights in Convolutional network?

I am currently following the TensorFlow's Multilayer Convolutional Network tutorial.
In various layers weight is initialised as followed :
First Convolutional Layer:
W_conv1 = weight_variable([5, 5, 1, 32])
Second Convolutional Layer:
W_conv2 = weight_variable([5, 5, 32, 64])
Densely Connected Layer:
W_fc1 = weight_variable([7 * 7 * 64, 1024])
Readout Layer:
W_fc2 = weight_variable([1024, 10])
So I am having doubts in how is the shape of the above weight variables known to us ?
Is their any math used to find the shape for them ?

The answer is explained on the same page:
The convolutional will compute 32 features for each 5x5 patch. Its
weight tensor will have a shape of [5, 5, 1, 32]
There is no involved math par say, but these terms need explanation
The size of convolution kernel is 5X5. That means there is a 5X5 matrix that is convolved with an input image by moving it around the image. Check this link for an explanation of how a small 5X5 matrix moves over a 28X28 image and multiplies different cells of the image matrix with itself. This gives us first two dimentsions of [5, 5, 1, 32]
The size of input channels is 1. These are BW images, hence one input channel. Most colored images have 3 channels, so expect a 3 in some other convolution networks working on images. Indeed, for the second layer, W_conv2, the number of input channels is 32, same as number of output channels of layer 1.
The last dimension of the weight matrix is perhaps hardest to visualize. Imagine your 5X5 matrix, and replicate it 32 times!. Each of these 32 things are called channels. To complete the discussion, each of these 32 5X5 matrices are initialized with random weights and trained independently during forward/back propagation of the network. More channels learn different aspects of the image and hence give extra power to your network.
If you summarize these 3 points, you get desired dimensions of layer 1. Subsequent layers are an extension - First two dimensions are kernel sizes (5X5) in this case. Third dimension is equal to size of input channel, which is equal to size of output channel of previous layer. (32, since we declared 32 output channels of layer 1). Final dimension is the size of output channel of current layer (64, even lager for second layer!. Again, keeping a large number of independent 5X5 kernels helps!).
Finally, last two layers: Final dense layer is the only thing that involves some calculation:
For each convolution layer, final size = initial size
For pooling layer of size kXk, final size = initial size / k
So,
For conv1, size remains 28 X 28
pool1 reduces size to 14 X 14
For conv2, size remains 14 X 14
pool2 reduces size to 7 X 7
And ofcourse, we have 64 channels due to conv2 - pooling doesn't affect them. Hence, we get a final dense input of 7X7X64. We then create fully connected 1024 hidden layers and add 10 output classes for 10 digits.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.