Confusion regarding dimensions in CNN

Confusion regarding dimensions in CNN - python

The above picture generated using Matlab's deep learning toolbox shows the architecture of a CNN created for a toy example. The input image is of size 25*20*7, number of filters are 15 each of size 5*5 and padding is same. The output of the first convolution conv1 is 25*20*15 which goes into maxpooling 1 operation of size 2*2 with stride 1 and padding same.
Based on my understanding, the role of maxpooling is to perform dimension reduction. However, in my code since the padding is set to same I understand that the output of maxpooling will preserve the spatial dimension to its input which is 25*20*15. That is why the output of maxpooling1 and the rest of the maxpooling is of the same dimension as its input and there is no change in the dimension in the remaining layers. AS an example, So, the output of maxpooling should have been: (25 - 2 +2*1/1) + 1 = 23+2/1 + 1 = 25. Similarly, for the second dimension maxpooling would yield: (20 - 2 +2*1/1) + 1 = 18+2/1 + 1 = 20. Thus, the output of maxpooling should be 25*20*15.
This implies that maxpooling is not doing dimension reduction. Therefore, should I remove maxpooling if the padding option is set to same?
Please let me know how the dimensions are same after doing maxpooling and if same dimension then should I remove this operation? Or did I do some mistake?

The role of padding is different for convolutional and maxpooling layer. If padding=same in convolutional layer, it means that the output size (primarily height and width) remains the same as the input.
On the other hand, padding in pooling layers has a different functionality. The purpose of pooling layers is to reduce the spatial dimensions (height and width). In pooling layers, padding = same does not mean that the spatial dimensions do not change. Padding in pooling is required to make up for overlaps when the input size and kernel size do not perfectly fit.

tldr: If you want to reduce the image size in every step, use padding=valid. it's default option.
Maxpooilign is generally used for Downsampling and Reducing Overfitting.
If you use padding='same', it will stretch the image to input size, causing no drop in the size.
In the example below, input size is 4 * 4, pool is 2*2 and (default) stride is 2x2, so output is 2 * 2
Find more examples on keras' official site

Related

What is a linear pooling layer?

What is a linear pooling layer?
What can be the maximum size of a linear pooling kernel?
Do you use dense layers after linear layers?

Same as a normal pooling layer, but along one dimension. I.e., instead of selecting max response from a n x n window, select from 1 x n. Perhaps it makes sense mostly if the previous output is one-dimensional.
Size of the previous output along the desired dimension
Nothing prevents you from doing so. Just do whatever makes sense.

Understanding the output shape of conv2d layer in keras

I do not understand why the channel dimension is not included in the output dimension of a conv2D layer in Keras.
I have the following model
def create_model():
image = Input(shape=(128,128,3))
x = Conv2D(24, kernel_size=(8,8), strides=(2,2), activation='relu', name='conv_1')(image)
x = Conv2D(24, kernel_size=(8,8), strides=(2,2), activation='relu', name='conv_2')(x)
x = Conv2D(24, kernel_size=(8,8), strides=(2,2), activation='relu', name='conv_3')(x)
flatten = Flatten(name='flatten')(x)
output = Dense(1, activation='relu', name='output')(flatten)
model = Model(input=image, output=output)
return model
model = create_model()
model.summary()
The model summary is given the figure at the end of my question. The input layer takes RGB images with width = 128 and height = 128. The first conv2D layer tells me the output dimension is (None, 61, 61, 24). I have used the kernel size of (8, 8), a stride of (2, 2) no padding. The values 61 = floor( (128 - 8 + 2 * 0)/2 + 1) and 24 (number of kernels/filters) makes sense. But why isn't the dimension for the different channels included in the dimension? As far as I can see the parameters for the 24 filters on each of the channels is included in the number of the parameters. So I would expect the output dimension to be (None, 61, 61, 24, 3) or (None, 61, 61, 24 * 3). Is this just a strange notation in Keras or am I confused about something else?

This question is asked in various forms all over the internet and has a simple answer which is often missed or confused:
SIMPLE ANSWER:
The Keras Conv2D layer, given a multi-channel input (e.g. a color image), will apply the filter across ALL the color channels and sum the results, producing the equivalent of a monochrome convolved output image.
An example, from a CIFAR-10 CNN example:
(1) You're training with the CIFAR image dataset, which is made up of 32x32 color images, i.e. each image is shape (32,32,3) (RGB = 3 channels)
(2) Your first layer of your network is a Conv2D Layer with 32 filters, each specified as 3x3, so:
Conv2D(32, (3,3), padding='same', input_shape=(32,32,3))
(3) Counter-intuitively, Keras will configure each filter as (3,3,3), i.e. a 3D volume covering the 3x3 pixels PLUS all the color channels. As a minor detail each filter has an additional weight for a BIAS value, as per normal neural network layer arithmetic.
(4) Convolution proceeds absolutely as normal, except a 3x3x3 VOLUME from the input image is convolved at each step with the 3x3x3 filter, and a single (monochrome) output value (i.e. like a pixel) is produced at each step.
(5) The result is a Keras Conv2D convolution of a specified (3,3) filter on a (32,32,3) image produces a (32,32) result because the actual filter used is (3,3,3).
(6) In this example, we have also specified 32 filters in the Conv2D layer, so the actual output is (32,32,32) for each input image (i.e. you might think of this as 32 images, one for each filter, each 32x32 monochrome pixels).
As a check, you can look at the count of weights (Param #) for the layer produced by model.summary():
Layer (type) Output shape Param#
conv2d_1 (Conv2D) (None, 32, 32, 32) 896
There are 32 filters, each 3x3x3 (i.e. 27 weights) plus 1 for the bias (i.e. total 28 weights each). And 32 filters x 28 weights each = 896 Parameters.

Each of the convolutional filters (8 x 8) is connected to a (8 x 8) receptive field for all the channels of the image. That is why we have (61, 61, 24) as the output of the second layer. The different channels are encoded implicitly into the weights of the 24 filters. This means, that each filter does not have 8 x 8 = 64 weights but instead 8 x 8 x Number of channels = 8 x 8 x 3 = 192 weights.
See this quote from CS231
Left: An example input volume in red (e.g. a 32x32x3 CIFAR-10 image),
and an example volume of neurons in the first Convolutional layer.
Each neuron in the convolutional layer is connected only to a local
region in the input volume spatially, but to the full depth (i.e. all
color channels). Note, there are multiple neurons (5 in this example)
along the depth, all looking at the same region in the input - see
discussion of depth columns in the text below. Right: The neurons from the
Neural Network chapter remains unchanged: They still compute a dot
product of their weights with the input followed by a non-linearity,
but their connectivity is now restricted to be local spatially.

My guess is that you're misunderstanding how convolutional layers defined.
My notation for the shape of the convolutional layer is (out_channels, in_channels, k, k) where k is a the size of the kernel. The out_channels is the number of filters (i.e. convolutional neurons). Consider following image:
The 3d convolutional kernel weights in the picture slide across different data windows of A_{i-1}(i.e. input image). Patches of 3D data of that image of shape (in_channels, k, k) are paired with individual 3d convolutional kernels of matching dimensionality. How many such 3d kernels are there? As the number of output channels out_channels. The depth dimension that kernel adopts is the in_channels of A_{i-1}. Therefore, the dimension in_channels of A_{i-1} is contracted away by the depth-wise dot product that builds up the output tensor with out_channels channels. The precise way in which the sliding windows are constructed is defined by the sampling tuple (kernel_size, stride, padding) and results in output tensor with spatial dimensions determined by the formula that you're correctly applied.
If you want to understand more, including backpropagation and implementation take a look at this paper.

The formula you're using is correct. It may be little confusing because many popular tutorial use number of filters equal to number of channels in the image. TensorFlow/Keras implementation produces its output by computing num_input_channels * num_output_channels intermediate feature maps of size (kernel_size[0], kernel_size[1]). So for each input channel it produces num_output_channels feature maps which then get multiplied and concatenated together to create output shape of (kernel_size[0], kernel_size[1], num_output_channels) Hope this clarifies Vlad's detailed answer

Input shape and Conv1d in Keras

The first layer of my neural network is like this:
model.add(Conv1D(filters=40,
kernel_size=25,
input_shape=x_train.shape[1:],
activation='relu',
kernel_regularizer=regularizers.l2(5e-6),
strides=1))
if my input shape is (600,10)
i get (None, 576, 40) as output shape
if my input shape is (6000,1)
i get (None, 5976, 40) as output shape
so my question is what exactly is happening here? is the first example simply ignoring 90% of the input?

It is not "ignoring" a 90% of the input, the problem is simply that if you perform a 1-dimensional convolution with a kernel of size K over an input of size X the result of the convolution will have size X - K + 1. If you want the output to have the same size as the input, then you need to extend or "pad" your data. There are several strategies for that, such as add zeros, replicate the value at the ends or wrap around. Keras' Convolution1D has a padding parameter that you can set to "valid" (the default, no padding), "same" (add zeros at both sides of the input to obtain the same output size as the input) and "causal" (padding with zeros at one end only, idea taken from WaveNet).
Update
About the questions in your comments. So you say your input is (600, 10). That, I assume, is the size of one example, and you have a batch of examples with size (N, 600, 10). From the point of view of the convolution operation, this means you have N examples, each of with a length of at most 600 (this "length" may be time or whatever else, it's just the dimension across which the convolution works) and, at each of these 600 points, you have vectors of size 10. Each of these vectors is considered an atomic sample with 10 features (e.g. price, heigh, size, whatever), or, as is sometimes called in the context of convolution, "channels" (from the RGB channels used in 2D image convolution).
The point is, the convolution has a kernel size and a number of output channels, which is the filters parameter in Keras. In your example, what the convolution does is take every possible slice of 25 contiguous 10-vectors and produce a single 40-vector for each (that, for every example in the batch, of course). So you pass from having 10 features or channels in your input to having 40 after the convolution. It's not that it's using only one of the 10 elements in the last dimension, it's using all of them to produce the output.
If the meaning of the dimensions in your input is not what the convolution is interpreting, or if the operation it is performing is not what you were expecting, you may need to either reshape your input or use a different kind of layer.

How to implement 1-D deconvolutional layer with a stride larger than one by tensorflow?

As this guide said [A guide to convolution arithmetic for deep learning], a deconvolutional layer can be transformed into an equivalent convolutional layer.
However, when the original convolution has a stride larger than one, the corresponded equivalent convolution of deconvolution should take a stretched input obtained by adding s−1 zeros between each input unit, where s is the stride in the original convolution.
Here is an example:
[The transpose of convolving a 3×3 kernel over a 5×5 input padded with a 1×1 border of zeros using 2×2 strides]
Here is the problem: because tensorflow only provides a 2-D version deconvolutional layer, if I want to implement a 1-D deconvolutional layer for an original convolutional layer with a stride larger than one, how can I add zeros between each input unit?
Thanks very much

I just found that the convolutional layer in keras has a parameter called dilation_rate, and it can cover my requirement.

Weights in Convolutional network?

I am currently following the TensorFlow's Multilayer Convolutional Network tutorial.
In various layers weight is initialised as followed :
First Convolutional Layer:
W_conv1 = weight_variable([5, 5, 1, 32])
Second Convolutional Layer:
W_conv2 = weight_variable([5, 5, 32, 64])
Densely Connected Layer:
W_fc1 = weight_variable([7 * 7 * 64, 1024])
Readout Layer:
W_fc2 = weight_variable([1024, 10])
So I am having doubts in how is the shape of the above weight variables known to us ?
Is their any math used to find the shape for them ?

The answer is explained on the same page:
The convolutional will compute 32 features for each 5x5 patch. Its
weight tensor will have a shape of [5, 5, 1, 32]
There is no involved math par say, but these terms need explanation
The size of convolution kernel is 5X5. That means there is a 5X5 matrix that is convolved with an input image by moving it around the image. Check this link for an explanation of how a small 5X5 matrix moves over a 28X28 image and multiplies different cells of the image matrix with itself. This gives us first two dimentsions of [5, 5, 1, 32]
The size of input channels is 1. These are BW images, hence one input channel. Most colored images have 3 channels, so expect a 3 in some other convolution networks working on images. Indeed, for the second layer, W_conv2, the number of input channels is 32, same as number of output channels of layer 1.
The last dimension of the weight matrix is perhaps hardest to visualize. Imagine your 5X5 matrix, and replicate it 32 times!. Each of these 32 things are called channels. To complete the discussion, each of these 32 5X5 matrices are initialized with random weights and trained independently during forward/back propagation of the network. More channels learn different aspects of the image and hence give extra power to your network.
If you summarize these 3 points, you get desired dimensions of layer 1. Subsequent layers are an extension - First two dimensions are kernel sizes (5X5) in this case. Third dimension is equal to size of input channel, which is equal to size of output channel of previous layer. (32, since we declared 32 output channels of layer 1). Final dimension is the size of output channel of current layer (64, even lager for second layer!. Again, keeping a large number of independent 5X5 kernels helps!).
Finally, last two layers: Final dense layer is the only thing that involves some calculation:
For each convolution layer, final size = initial size
For pooling layer of size kXk, final size = initial size / k
So,
For conv1, size remains 28 X 28
pool1 reduces size to 14 X 14
For conv2, size remains 14 X 14
pool2 reduces size to 7 X 7
And ofcourse, we have 64 channels due to conv2 - pooling doesn't affect them. Hence, we get a final dense input of 7X7X64. We then create fully connected 1024 hidden layers and add 10 output classes for 10 digits.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.