The Keras layer documentation specifies the input and output sizes for convolutional layers:
https://keras.io/layers/convolutional/
Input shape: (samples, channels, rows, cols)
Output shape: (samples, filters, new_rows, new_cols)
And the kernel size is a spatial parameter, i.e. detemines only width and height.
So an input with c channels will yield an output with filters channels regardless of the value of c. It must therefore apply 2D convolution with a spatial height x width filter and then aggregate the results somehow for each learned filter.
What is this aggregation operator? is it a summation across channels? can I control it? I couldn't find any information on the Keras documentation.
Note that in TensorFlow the filters are specified in the depth channel as well:
https://www.tensorflow.org/api_guides/python/nn#Convolution,
So the depth operation is clear.
Thanks.
It might be confusing that it is called Conv2D layer (it was to me, which is why I came looking for this answer), because as Nilesh Birari commented:
I guess you are missing it's 3D kernel [width, height, depth]. So the result is summation across channels.
Perhaps the 2D stems from the fact that the kernel only slides along two dimensions, the third dimension is fixed and determined by the number of input channels (the input depth).
For a more elaborate explanation, read https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/
I plucked an illustrative image from there:
I was also wondering this, and found another answer here, where it is stated (emphasis mine):
Maybe the most tangible example of a multi-channel input is when you have a color image which has 3 RGB channels. Let's get it to a convolution layer with 3 input channels and 1 output channel. (...) What it does is that it calculates the convolution of each filter with its corresponding input channel (...). The stride of all channels are the same, so they output matrices with the same size. Now, it sums up all matrices and output a single matrix which is the only channel at the output of the convolution layer.
Illustration:
Notice that the weights of the convolution kernels for each channel are different, which are then iteratively adjusted in the back-propagation steps by e.g. gradient decent based algorithms such as stochastic gradient descent (SDG).
Here is a more technical answer from TensorFlow API.
I also needed to convince myself so I ran a simple example with a 3×3 RGB image.
# red # green # blue
1 1 1 100 100 100 10000 10000 10000
1 1 1 100 100 100 10000 10000 10000
1 1 1 100 100 100 10000 10000 10000
The filter is initialised to ones:
1 1
1 1
I have also set the convolution to have these properties:
no padding
strides = 1
relu activation function
bias initialised to 0
We would expect the (aggregated) output to be:
40404 40404
40404 40404
Also, from the picture above, the no. of parameters is
3 separate filters (one for each channel) × 4 weights + 1 (bias, not shown) = 13 parameters
Here's the code.
Import modules:
import numpy as np
from keras.layers import Input, Conv2D
from keras.models import Model
Create the red, green and blue channels:
red = np.array([1]*9).reshape((3,3))
green = np.array([100]*9).reshape((3,3))
blue = np.array([10000]*9).reshape((3,3))
Stack the channels to form an RGB image:
img = np.stack([red, green, blue], axis=-1)
img = np.expand_dims(img, axis=0)
Create a model that just does a Conv2D convolution:
inputs = Input((3,3,3))
conv = Conv2D(filters=1,
strides=1,
padding='valid',
activation='relu',
kernel_size=2,
kernel_initializer='ones',
bias_initializer='zeros', )(inputs)
model = Model(inputs,conv)
Input the image in the model:
model.predict(img)
# array([[[[40404.],
# [40404.]],
# [[40404.],
# [40404.]]]], dtype=float32)
Run a summary to get the number of params:
model.summary()
Related
How can I apply Group Normalization after a full-connection layer? Say the output of the full-connection layer is 1024. And the group normalization layer is using 16 groups.
self.gn1 = nn.GroupNorm(16, hidden_size)
h1 = F.relu(self.gn1(self.fc1(x))))
Am I right? How should we understand the group normalization if it is applied to the output of a full-connection layer?
Your code is correct, but let's see what happens in a small example.
The output of a fully-connected layer is usually a 2D-tensor with shape (batch_size, hidden_size) so I will focus on this kind of input, but remember that GroupNorm supports tensors with an arbitrary number of dimensions. In fact, GroupNorm works always on the last dimension of the tensor.
GroupNorm treats all the samples in the batch as independent and it creates n_groups from the last dimension of the tensor, as you can see from the image.
When the input tensor is 2D, the cube in the image becomes a square because there is no third vertical dimension, so in practice the normalization is performed on fixed-size consecutive pieces of the rows of the input matrix.
Let's see an example with some code.
import torch
import torch.nn as nn
batch_size = 2
hidden_size = 32
n_groups = 8
group_size = hidden_size // n_groups # = 4
# Input tensor that can be the result of a fully-connected layer
x = torch.rand(batch_size, hidden_size)
# GroupNorm with affine disabled to simplify the inspection of results
gn1 = nn.GroupNorm(n_groups, hidden_size, affine=False)
r = gn1(x)
# The rows are split into n_groups (8) groups of size group_size (4)
# and the normalization is applied to these pieces of rows.
# We can check it for the first group x[0, :group_size] with the following code
first_group = x[0, :group_size]
normalized_first_group = (first_group - first_group.mean())/torch.sqrt(first_group.var(unbiased=False) + gn1.eps)
print(r[0, :4])
print(normalized_first_group)
if(torch.allclose(r[0, :4], normalized_first_group)):
print('The result on the first group is the expected one')
The above picture generated using Matlab's deep learning toolbox shows the architecture of a CNN created for a toy example. The input image is of size 25*20*7, number of filters are 15 each of size 5*5 and padding is same. The output of the first convolution conv1 is 25*20*15 which goes into maxpooling 1 operation of size 2*2 with stride 1 and padding same.
Based on my understanding, the role of maxpooling is to perform dimension reduction. However, in my code since the padding is set to same I understand that the output of maxpooling will preserve the spatial dimension to its input which is 25*20*15. That is why the output of maxpooling1 and the rest of the maxpooling is of the same dimension as its input and there is no change in the dimension in the remaining layers. AS an example, So, the output of maxpooling should have been: (25 - 2 +2*1/1) + 1 = 23+2/1 + 1 = 25. Similarly, for the second dimension maxpooling would yield: (20 - 2 +2*1/1) + 1 = 18+2/1 + 1 = 20. Thus, the output of maxpooling should be 25*20*15.
This implies that maxpooling is not doing dimension reduction. Therefore, should I remove maxpooling if the padding option is set to same?
Please let me know how the dimensions are same after doing maxpooling and if same dimension then should I remove this operation? Or did I do some mistake?
The role of padding is different for convolutional and maxpooling layer. If padding=same in convolutional layer, it means that the output size (primarily height and width) remains the same as the input.
On the other hand, padding in pooling layers has a different functionality. The purpose of pooling layers is to reduce the spatial dimensions (height and width). In pooling layers, padding = same does not mean that the spatial dimensions do not change. Padding in pooling is required to make up for overlaps when the input size and kernel size do not perfectly fit.
tldr: If you want to reduce the image size in every step, use padding=valid. it's default option.
Maxpooilign is generally used for Downsampling and Reducing Overfitting.
If you use padding='same', it will stretch the image to input size, causing no drop in the size.
In the example below, input size is 4 * 4, pool is 2*2 and (default) stride is 2x2, so output is 2 * 2
Find more examples on keras' official site
In my problem, I want to convolve two tensors in my neural network model.
The shape of two tensors is [None, 2, 1], [None, 3, 1] respectively. The axis with dimension None means the batch size of the input tensor. For each sample in batch, I want to convolve the two tensors with shape [2, 1] and [3, 1].
However, the tf.nn.conv1d in TensorFlow can only convolve the input with a fixed kernel. Is there any function that can support the convolution of two tensors according to the batch size axis, similar to the tf.multiply which can multiply two tensors for each sample or just elementwise multiplication.
The code I ran can be simplified as follows:
input_signal = Input(shape=(L, M), name='input_signal')
input_h = Input(shape=(N), name='input_h')
faded= Lambda(lambda x: tf.nn.conv1d(input, x))(input_h)
What I want to do is that the sample of input_signal can be convolved by the sample of input_h with the same index. However, it just shows my pure idea which can not be able to run in the env. My question is that how I can modify the code to enable the input tensor can be convolved with another input tensor for every sample in the batch.
According to the description of the kernel size arguments for Conv1D layer or any other layer mentioned in the documentation, you cannot add multiple filters with different Kernel size or strides.
Also, Convolutions with Kernels of different sizes will produce outputs of different height and width.
The general formula for output size assuming a symmetric kernel is given by
(X−K+2P)/S+1
Where X is the input Height / Width
K is the Kernel size
P is the zero-padding
S is the stride length
So assuming you are keeping zero paddings and stride same you cannot have multiple kernels with different sizes in ConvD layer.
You can, however, use the tf.keras.Model API to create Conv1D multiple times on the same input OR multiple Conv1D Layer for different inputs and kernel size respectively in your case and then either maxpool, crop or use zero paddings to match the dimensions of the different outputs before stacking them.
Example:
inputs = tf.keras.Input(shape=(n_timesteps,n_features))
x1 = tf.keras.layers.Conv1D(filters=32, kernel_size=2)(inputs)
x2 = tf.keras.layers.Conv1D(filters=16, kernel_size=3)(inputs)
#match dimensions (height and width) of x1 or x2 here
x3 = tf.keras.layers.Concatenate(axis=-1)[x1,x2]
You can use either Zeropadding1D or Cropping2D or Maxpool1D for matching the dimensions.
I do not understand why the channel dimension is not included in the output dimension of a conv2D layer in Keras.
I have the following model
def create_model():
image = Input(shape=(128,128,3))
x = Conv2D(24, kernel_size=(8,8), strides=(2,2), activation='relu', name='conv_1')(image)
x = Conv2D(24, kernel_size=(8,8), strides=(2,2), activation='relu', name='conv_2')(x)
x = Conv2D(24, kernel_size=(8,8), strides=(2,2), activation='relu', name='conv_3')(x)
flatten = Flatten(name='flatten')(x)
output = Dense(1, activation='relu', name='output')(flatten)
model = Model(input=image, output=output)
return model
model = create_model()
model.summary()
The model summary is given the figure at the end of my question. The input layer takes RGB images with width = 128 and height = 128. The first conv2D layer tells me the output dimension is (None, 61, 61, 24). I have used the kernel size of (8, 8), a stride of (2, 2) no padding. The values 61 = floor( (128 - 8 + 2 * 0)/2 + 1) and 24 (number of kernels/filters) makes sense. But why isn't the dimension for the different channels included in the dimension? As far as I can see the parameters for the 24 filters on each of the channels is included in the number of the parameters. So I would expect the output dimension to be (None, 61, 61, 24, 3) or (None, 61, 61, 24 * 3). Is this just a strange notation in Keras or am I confused about something else?
This question is asked in various forms all over the internet and has a simple answer which is often missed or confused:
SIMPLE ANSWER:
The Keras Conv2D layer, given a multi-channel input (e.g. a color image), will apply the filter across ALL the color channels and sum the results, producing the equivalent of a monochrome convolved output image.
An example, from a CIFAR-10 CNN example:
(1) You're training with the CIFAR image dataset, which is made up of 32x32 color images, i.e. each image is shape (32,32,3) (RGB = 3 channels)
(2) Your first layer of your network is a Conv2D Layer with 32 filters, each specified as 3x3, so:
Conv2D(32, (3,3), padding='same', input_shape=(32,32,3))
(3) Counter-intuitively, Keras will configure each filter as (3,3,3), i.e. a 3D volume covering the 3x3 pixels PLUS all the color channels. As a minor detail each filter has an additional weight for a BIAS value, as per normal neural network layer arithmetic.
(4) Convolution proceeds absolutely as normal, except a 3x3x3 VOLUME from the input image is convolved at each step with the 3x3x3 filter, and a single (monochrome) output value (i.e. like a pixel) is produced at each step.
(5) The result is a Keras Conv2D convolution of a specified (3,3) filter on a (32,32,3) image produces a (32,32) result because the actual filter used is (3,3,3).
(6) In this example, we have also specified 32 filters in the Conv2D layer, so the actual output is (32,32,32) for each input image (i.e. you might think of this as 32 images, one for each filter, each 32x32 monochrome pixels).
As a check, you can look at the count of weights (Param #) for the layer produced by model.summary():
Layer (type) Output shape Param#
conv2d_1 (Conv2D) (None, 32, 32, 32) 896
There are 32 filters, each 3x3x3 (i.e. 27 weights) plus 1 for the bias (i.e. total 28 weights each). And 32 filters x 28 weights each = 896 Parameters.
Each of the convolutional filters (8 x 8) is connected to a (8 x 8) receptive field for all the channels of the image. That is why we have (61, 61, 24) as the output of the second layer. The different channels are encoded implicitly into the weights of the 24 filters. This means, that each filter does not have 8 x 8 = 64 weights but instead 8 x 8 x Number of channels = 8 x 8 x 3 = 192 weights.
See this quote from CS231
Left: An example input volume in red (e.g. a 32x32x3 CIFAR-10 image),
and an example volume of neurons in the first Convolutional layer.
Each neuron in the convolutional layer is connected only to a local
region in the input volume spatially, but to the full depth (i.e. all
color channels). Note, there are multiple neurons (5 in this example)
along the depth, all looking at the same region in the input - see
discussion of depth columns in the text below. Right: The neurons from the
Neural Network chapter remains unchanged: They still compute a dot
product of their weights with the input followed by a non-linearity,
but their connectivity is now restricted to be local spatially.
My guess is that you're misunderstanding how convolutional layers defined.
My notation for the shape of the convolutional layer is (out_channels, in_channels, k, k) where k is a the size of the kernel. The out_channels is the number of filters (i.e. convolutional neurons). Consider following image:
The 3d convolutional kernel weights in the picture slide across different data windows of A_{i-1}(i.e. input image). Patches of 3D data of that image of shape (in_channels, k, k) are paired with individual 3d convolutional kernels of matching dimensionality. How many such 3d kernels are there? As the number of output channels out_channels. The depth dimension that kernel adopts is the in_channels of A_{i-1}. Therefore, the dimension in_channels of A_{i-1} is contracted away by the depth-wise dot product that builds up the output tensor with out_channels channels. The precise way in which the sliding windows are constructed is defined by the sampling tuple (kernel_size, stride, padding) and results in output tensor with spatial dimensions determined by the formula that you're correctly applied.
If you want to understand more, including backpropagation and implementation take a look at this paper.
The formula you're using is correct. It may be little confusing because many popular tutorial use number of filters equal to number of channels in the image. TensorFlow/Keras implementation produces its output by computing num_input_channels * num_output_channels intermediate feature maps of size (kernel_size[0], kernel_size[1]). So for each input channel it produces num_output_channels feature maps which then get multiplied and concatenated together to create output shape of (kernel_size[0], kernel_size[1], num_output_channels) Hope this clarifies Vlad's detailed answer
I am currently following the TensorFlow's Multilayer Convolutional Network tutorial.
In various layers weight is initialised as followed :
First Convolutional Layer:
W_conv1 = weight_variable([5, 5, 1, 32])
Second Convolutional Layer:
W_conv2 = weight_variable([5, 5, 32, 64])
Densely Connected Layer:
W_fc1 = weight_variable([7 * 7 * 64, 1024])
Readout Layer:
W_fc2 = weight_variable([1024, 10])
So I am having doubts in how is the shape of the above weight variables known to us ?
Is their any math used to find the shape for them ?
The answer is explained on the same page:
The convolutional will compute 32 features for each 5x5 patch. Its
weight tensor will have a shape of [5, 5, 1, 32]
There is no involved math par say, but these terms need explanation
The size of convolution kernel is 5X5. That means there is a 5X5 matrix that is convolved with an input image by moving it around the image. Check this link for an explanation of how a small 5X5 matrix moves over a 28X28 image and multiplies different cells of the image matrix with itself. This gives us first two dimentsions of [5, 5, 1, 32]
The size of input channels is 1. These are BW images, hence one input channel. Most colored images have 3 channels, so expect a 3 in some other convolution networks working on images. Indeed, for the second layer, W_conv2, the number of input channels is 32, same as number of output channels of layer 1.
The last dimension of the weight matrix is perhaps hardest to visualize. Imagine your 5X5 matrix, and replicate it 32 times!. Each of these 32 things are called channels. To complete the discussion, each of these 32 5X5 matrices are initialized with random weights and trained independently during forward/back propagation of the network. More channels learn different aspects of the image and hence give extra power to your network.
If you summarize these 3 points, you get desired dimensions of layer 1. Subsequent layers are an extension - First two dimensions are kernel sizes (5X5) in this case. Third dimension is equal to size of input channel, which is equal to size of output channel of previous layer. (32, since we declared 32 output channels of layer 1). Final dimension is the size of output channel of current layer (64, even lager for second layer!. Again, keeping a large number of independent 5X5 kernels helps!).
Finally, last two layers: Final dense layer is the only thing that involves some calculation:
For each convolution layer, final size = initial size
For pooling layer of size kXk, final size = initial size / k
So,
For conv1, size remains 28 X 28
pool1 reduces size to 14 X 14
For conv2, size remains 14 X 14
pool2 reduces size to 7 X 7
And ofcourse, we have 64 channels due to conv2 - pooling doesn't affect them. Hence, we get a final dense input of 7X7X64. We then create fully connected 1024 hidden layers and add 10 output classes for 10 digits.