I want to predict a spatio-temporal system with dimension (x,y,t) using a convolutional neural network. My time-dimension is magnitudes smaller than both space dimensions, i.e. t=16, x=y=1024.
I a currently using 3D convolutional layers with channel dimension 1. Say I choose my kernel size for all dimensions 5, I end up with a kernel of size (5, 5, 5, 1). Since 3D convolutions are computationally expensive, the model training tends to crash a lot. One workaround I thought of is to use 2D convolutional layers and treat the time dimension as channels. My resulting kernel would be of size (5, 5, 16) with t=16 time steps.
In summary, by using 2D convolutions I get rid of one dimension to stride along. On the other hand, the 2D layer has more weights / trainable parameters compared to the 3D layer; 2D: 5x5x16 = 400 vs. 3D: 5x5x5x1 = 125 (+ bias, respectively). How does this translate to the computational cost between Conv3D and Conv2D layers?
The above picture generated using Matlab's deep learning toolbox shows the architecture of a CNN created for a toy example. The input image is of size 25*20*7, number of filters are 15 each of size 5*5 and padding is same. The output of the first convolution conv1 is 25*20*15 which goes into maxpooling 1 operation of size 2*2 with stride 1 and padding same.
Based on my understanding, the role of maxpooling is to perform dimension reduction. However, in my code since the padding is set to same I understand that the output of maxpooling will preserve the spatial dimension to its input which is 25*20*15. That is why the output of maxpooling1 and the rest of the maxpooling is of the same dimension as its input and there is no change in the dimension in the remaining layers. AS an example, So, the output of maxpooling should have been: (25 - 2 +2*1/1) + 1 = 23+2/1 + 1 = 25. Similarly, for the second dimension maxpooling would yield: (20 - 2 +2*1/1) + 1 = 18+2/1 + 1 = 20. Thus, the output of maxpooling should be 25*20*15.
This implies that maxpooling is not doing dimension reduction. Therefore, should I remove maxpooling if the padding option is set to same?
Please let me know how the dimensions are same after doing maxpooling and if same dimension then should I remove this operation? Or did I do some mistake?
The role of padding is different for convolutional and maxpooling layer. If padding=same in convolutional layer, it means that the output size (primarily height and width) remains the same as the input.
On the other hand, padding in pooling layers has a different functionality. The purpose of pooling layers is to reduce the spatial dimensions (height and width). In pooling layers, padding = same does not mean that the spatial dimensions do not change. Padding in pooling is required to make up for overlaps when the input size and kernel size do not perfectly fit.
tldr: If you want to reduce the image size in every step, use padding=valid. it's default option.
Maxpooilign is generally used for Downsampling and Reducing Overfitting.
If you use padding='same', it will stretch the image to input size, causing no drop in the size.
In the example below, input size is 4 * 4, pool is 2*2 and (default) stride is 2x2, so output is 2 * 2
Find more examples on keras' official site
I am having trouble understanding the way 2 or more convolutional layers (each followed by a pooling layer) work in a CNN.
Consider the input to be a 3 channel 300x300 image. If the first convolution layer has 32 convolutions and the second layers have 64 convolutional layers, then the first layer creates 32 feature maps. But how many feature maps does the second layer create? Does every convolution out of 64 act on the previously generated 32 feature maps, thus creating 32*64 = 2048 feature maps in total? Or does something else take place?
A simple code relating the question is:
model = keras.models.Sequential([
keras.layers.Conv2D(32, (3, 3), activation='relu', input_shape=(300, 300, 3)),
keras.layers.MaxPooling2D(2, 2),
keras.layers.Conv2D(64, (3, 3), activation='relu'),
keras.layers.MaxPooling2D(2, 2)])
The number of channels of the input matrix and the number of channels in each filter must match in order to be able to perform element-wise multiplication.
So the main difference between first and second convolutions is that the # of channels in the input matrix in the first convolution is 3 so we will use 32 filters where each filter has 3 channels (depth of kernel matrix).
For the second convolution, the input matrix has 32 channels (feature maps), so each filter for this convolution must have 32 channels as well. For example: each of the 64 filters will have the 32#3x3 shape.
The result of a convolution step for a single filter of 32#3x3 shape will be a single channel of WxH (Width, Height) shape. After applying all 64 filters (where each of them has shape: 32#3x3) we will get 64 channels, where each channel is a result of the convolution of a single filter.
The first convolution layer has 32 filters, but it applies to all THREE channels of the image, so the feature maps after the first Conv2D is 32x3 = 96, then the 64 filters of the second Conv2D each apply to each of that 96 feature maps, so after the 2nd Conv2D it's 64x96 = ... (fill in the blank).
But for simplicity, Keras only shows (..., 32) or (..., 64). You can use model.summary() to check.
I do not understand why the channel dimension is not included in the output dimension of a conv2D layer in Keras.
I have the following model
def create_model():
image = Input(shape=(128,128,3))
x = Conv2D(24, kernel_size=(8,8), strides=(2,2), activation='relu', name='conv_1')(image)
x = Conv2D(24, kernel_size=(8,8), strides=(2,2), activation='relu', name='conv_2')(x)
x = Conv2D(24, kernel_size=(8,8), strides=(2,2), activation='relu', name='conv_3')(x)
flatten = Flatten(name='flatten')(x)
output = Dense(1, activation='relu', name='output')(flatten)
model = Model(input=image, output=output)
return model
model = create_model()
model.summary()
The model summary is given the figure at the end of my question. The input layer takes RGB images with width = 128 and height = 128. The first conv2D layer tells me the output dimension is (None, 61, 61, 24). I have used the kernel size of (8, 8), a stride of (2, 2) no padding. The values 61 = floor( (128 - 8 + 2 * 0)/2 + 1) and 24 (number of kernels/filters) makes sense. But why isn't the dimension for the different channels included in the dimension? As far as I can see the parameters for the 24 filters on each of the channels is included in the number of the parameters. So I would expect the output dimension to be (None, 61, 61, 24, 3) or (None, 61, 61, 24 * 3). Is this just a strange notation in Keras or am I confused about something else?
This question is asked in various forms all over the internet and has a simple answer which is often missed or confused:
SIMPLE ANSWER:
The Keras Conv2D layer, given a multi-channel input (e.g. a color image), will apply the filter across ALL the color channels and sum the results, producing the equivalent of a monochrome convolved output image.
An example, from a CIFAR-10 CNN example:
(1) You're training with the CIFAR image dataset, which is made up of 32x32 color images, i.e. each image is shape (32,32,3) (RGB = 3 channels)
(2) Your first layer of your network is a Conv2D Layer with 32 filters, each specified as 3x3, so:
Conv2D(32, (3,3), padding='same', input_shape=(32,32,3))
(3) Counter-intuitively, Keras will configure each filter as (3,3,3), i.e. a 3D volume covering the 3x3 pixels PLUS all the color channels. As a minor detail each filter has an additional weight for a BIAS value, as per normal neural network layer arithmetic.
(4) Convolution proceeds absolutely as normal, except a 3x3x3 VOLUME from the input image is convolved at each step with the 3x3x3 filter, and a single (monochrome) output value (i.e. like a pixel) is produced at each step.
(5) The result is a Keras Conv2D convolution of a specified (3,3) filter on a (32,32,3) image produces a (32,32) result because the actual filter used is (3,3,3).
(6) In this example, we have also specified 32 filters in the Conv2D layer, so the actual output is (32,32,32) for each input image (i.e. you might think of this as 32 images, one for each filter, each 32x32 monochrome pixels).
As a check, you can look at the count of weights (Param #) for the layer produced by model.summary():
Layer (type) Output shape Param#
conv2d_1 (Conv2D) (None, 32, 32, 32) 896
There are 32 filters, each 3x3x3 (i.e. 27 weights) plus 1 for the bias (i.e. total 28 weights each). And 32 filters x 28 weights each = 896 Parameters.
Each of the convolutional filters (8 x 8) is connected to a (8 x 8) receptive field for all the channels of the image. That is why we have (61, 61, 24) as the output of the second layer. The different channels are encoded implicitly into the weights of the 24 filters. This means, that each filter does not have 8 x 8 = 64 weights but instead 8 x 8 x Number of channels = 8 x 8 x 3 = 192 weights.
See this quote from CS231
Left: An example input volume in red (e.g. a 32x32x3 CIFAR-10 image),
and an example volume of neurons in the first Convolutional layer.
Each neuron in the convolutional layer is connected only to a local
region in the input volume spatially, but to the full depth (i.e. all
color channels). Note, there are multiple neurons (5 in this example)
along the depth, all looking at the same region in the input - see
discussion of depth columns in the text below. Right: The neurons from the
Neural Network chapter remains unchanged: They still compute a dot
product of their weights with the input followed by a non-linearity,
but their connectivity is now restricted to be local spatially.
My guess is that you're misunderstanding how convolutional layers defined.
My notation for the shape of the convolutional layer is (out_channels, in_channels, k, k) where k is a the size of the kernel. The out_channels is the number of filters (i.e. convolutional neurons). Consider following image:
The 3d convolutional kernel weights in the picture slide across different data windows of A_{i-1}(i.e. input image). Patches of 3D data of that image of shape (in_channels, k, k) are paired with individual 3d convolutional kernels of matching dimensionality. How many such 3d kernels are there? As the number of output channels out_channels. The depth dimension that kernel adopts is the in_channels of A_{i-1}. Therefore, the dimension in_channels of A_{i-1} is contracted away by the depth-wise dot product that builds up the output tensor with out_channels channels. The precise way in which the sliding windows are constructed is defined by the sampling tuple (kernel_size, stride, padding) and results in output tensor with spatial dimensions determined by the formula that you're correctly applied.
If you want to understand more, including backpropagation and implementation take a look at this paper.
The formula you're using is correct. It may be little confusing because many popular tutorial use number of filters equal to number of channels in the image. TensorFlow/Keras implementation produces its output by computing num_input_channels * num_output_channels intermediate feature maps of size (kernel_size[0], kernel_size[1]). So for each input channel it produces num_output_channels feature maps which then get multiplied and concatenated together to create output shape of (kernel_size[0], kernel_size[1], num_output_channels) Hope this clarifies Vlad's detailed answer
I'm following the "Deep MNIST for Experts" tutorial for TensorFlow: https://www.tensorflow.org/tutorials/mnist/pros/
The second convolutional layer has the shape [5, 5, 32, 64]; that is, it has 32 inputs whereas the first convolutional layer had 1 input (that input being I understand the grayscale values of the original image).
What does it mean that the second convolutional layer has 32 input channels? Does it mean the 64 filters that are learned in the second layer will all be applied (shifted around) to a "virtual" image having 32 points per pixel (this "virtual" image being composed of the original image to which each filter learned in the first step has been applied)? How do you apply a 2D 5x5 filter to an image having 32 points/values per pixel if what I said previously is correct?
The first convolution layer has the following weights:
W_conv1 = weight_variable([5, 5, 1, 32])
Here 5x5 is the patch size 1 is the number of input channels and 32 is the number of output channels. So after the first convolution the output has 32 channels, hence the shape of the weight matrix for second convolution layer has 32 input channels.