Trying to implement a paper and running into some brick-walls due to some dimensionality problems. My input is mono audio data where 128 frames of 50ms of 16kHz sampled audio is fed into the network. So my input shape is:
[128,0.005*16000, 1]
Here's the layer details -
1.) conv-bank block : Conv1d-bank-8, LeReLU, IN (instance normalization)
I achieve this using :
bank_width = 8
conv_bank_outputs = tf.concat([ tf.layers.conv1d(input,1,k,activation=tf.nn.leaky_relu,padding="same") for k in range(1, bank_width + 1)], axes = -1)
2.) conv-block: C-512-5, LReLu --> C-512-5,stride=2, LReLu, IN, RES (Residual)
This is where I get stuck, the shapes of the output of second convolution and input to the (2) layer is mismatched. I can't get my head around it.
I achieve this using:
block_1 = tf.layers.conv1d(input,filters=512,kernel_size=5,activation=tf.nn.leaky_relu,padding="same")
block_2 = tf.layers.conv1d(block_1,filters=512,kernel_size=5,strides=2,activation=tf.nn.leaky_relu,padding="same")
IN = tf.contrib.layers.instance_norm(block_2)
RES = IN + input
Error: ValueError: Dimensions must be equal, but are 400 and 800 for 'add' (op: 'Add') with input shapes: [128,400,512], [128,800,1024].
When you run conv1d on block1 with stride = 2 , input data is halved as conv1d effectively samples only alternate numbers and also you have changed number of channels. This is usually worked around by downsampling input by 1x1 conv with stride 2 and filters 512, though I can be more specific if you can share the paper.
Related
I want to design, multi channel CNN.
I got a error message in first Conv2d step. (in figure, first layer to second layer)
My code is as bellows
_concat_embeded = keras.layers.concatenate([_embeding1, _embeding2], axis= -1)
_biCH_embeded = keras.layers.Reshape((2, self.lexicalMaxLength, charWeights.shape[1]))(_concat_embeded)
_1stConv = keras.layers.Conv2D(filters=512, kernel_size=(5, charWeights.shape[1]),
activation=tf.nn.relu)(_biCH_embeded)
Shape at _biCH_embeded is [? 2, 131 ,131] (my embeddings have 131 dimension = charWeights.shape[1])
I want to generate 512 filters, which has (5, 131) shape.
Then, I've got a message, "Negative dimension size caused by subtracting 5 from 2 for 'conv2d_1/convolution' (op: 'Conv2D') with input shapes: [?,2,33,131], [5,131,131,512]"
Where is problem?
I find the issue.
I reshaped my tensor with "channel_first" rule (2, 133, 133)
But my Keras config is set by "channel_last"
I change the reshape rule to "channel_last" (133,133,2)and training is running now.
(If you want change the Keras config, look at "~/.keras/keras.json")
I do not understand why the channel dimension is not included in the output dimension of a conv2D layer in Keras.
I have the following model
def create_model():
image = Input(shape=(128,128,3))
x = Conv2D(24, kernel_size=(8,8), strides=(2,2), activation='relu', name='conv_1')(image)
x = Conv2D(24, kernel_size=(8,8), strides=(2,2), activation='relu', name='conv_2')(x)
x = Conv2D(24, kernel_size=(8,8), strides=(2,2), activation='relu', name='conv_3')(x)
flatten = Flatten(name='flatten')(x)
output = Dense(1, activation='relu', name='output')(flatten)
model = Model(input=image, output=output)
return model
model = create_model()
model.summary()
The model summary is given the figure at the end of my question. The input layer takes RGB images with width = 128 and height = 128. The first conv2D layer tells me the output dimension is (None, 61, 61, 24). I have used the kernel size of (8, 8), a stride of (2, 2) no padding. The values 61 = floor( (128 - 8 + 2 * 0)/2 + 1) and 24 (number of kernels/filters) makes sense. But why isn't the dimension for the different channels included in the dimension? As far as I can see the parameters for the 24 filters on each of the channels is included in the number of the parameters. So I would expect the output dimension to be (None, 61, 61, 24, 3) or (None, 61, 61, 24 * 3). Is this just a strange notation in Keras or am I confused about something else?
This question is asked in various forms all over the internet and has a simple answer which is often missed or confused:
SIMPLE ANSWER:
The Keras Conv2D layer, given a multi-channel input (e.g. a color image), will apply the filter across ALL the color channels and sum the results, producing the equivalent of a monochrome convolved output image.
An example, from a CIFAR-10 CNN example:
(1) You're training with the CIFAR image dataset, which is made up of 32x32 color images, i.e. each image is shape (32,32,3) (RGB = 3 channels)
(2) Your first layer of your network is a Conv2D Layer with 32 filters, each specified as 3x3, so:
Conv2D(32, (3,3), padding='same', input_shape=(32,32,3))
(3) Counter-intuitively, Keras will configure each filter as (3,3,3), i.e. a 3D volume covering the 3x3 pixels PLUS all the color channels. As a minor detail each filter has an additional weight for a BIAS value, as per normal neural network layer arithmetic.
(4) Convolution proceeds absolutely as normal, except a 3x3x3 VOLUME from the input image is convolved at each step with the 3x3x3 filter, and a single (monochrome) output value (i.e. like a pixel) is produced at each step.
(5) The result is a Keras Conv2D convolution of a specified (3,3) filter on a (32,32,3) image produces a (32,32) result because the actual filter used is (3,3,3).
(6) In this example, we have also specified 32 filters in the Conv2D layer, so the actual output is (32,32,32) for each input image (i.e. you might think of this as 32 images, one for each filter, each 32x32 monochrome pixels).
As a check, you can look at the count of weights (Param #) for the layer produced by model.summary():
Layer (type) Output shape Param#
conv2d_1 (Conv2D) (None, 32, 32, 32) 896
There are 32 filters, each 3x3x3 (i.e. 27 weights) plus 1 for the bias (i.e. total 28 weights each). And 32 filters x 28 weights each = 896 Parameters.
Each of the convolutional filters (8 x 8) is connected to a (8 x 8) receptive field for all the channels of the image. That is why we have (61, 61, 24) as the output of the second layer. The different channels are encoded implicitly into the weights of the 24 filters. This means, that each filter does not have 8 x 8 = 64 weights but instead 8 x 8 x Number of channels = 8 x 8 x 3 = 192 weights.
See this quote from CS231
Left: An example input volume in red (e.g. a 32x32x3 CIFAR-10 image),
and an example volume of neurons in the first Convolutional layer.
Each neuron in the convolutional layer is connected only to a local
region in the input volume spatially, but to the full depth (i.e. all
color channels). Note, there are multiple neurons (5 in this example)
along the depth, all looking at the same region in the input - see
discussion of depth columns in the text below. Right: The neurons from the
Neural Network chapter remains unchanged: They still compute a dot
product of their weights with the input followed by a non-linearity,
but their connectivity is now restricted to be local spatially.
My guess is that you're misunderstanding how convolutional layers defined.
My notation for the shape of the convolutional layer is (out_channels, in_channels, k, k) where k is a the size of the kernel. The out_channels is the number of filters (i.e. convolutional neurons). Consider following image:
The 3d convolutional kernel weights in the picture slide across different data windows of A_{i-1}(i.e. input image). Patches of 3D data of that image of shape (in_channels, k, k) are paired with individual 3d convolutional kernels of matching dimensionality. How many such 3d kernels are there? As the number of output channels out_channels. The depth dimension that kernel adopts is the in_channels of A_{i-1}. Therefore, the dimension in_channels of A_{i-1} is contracted away by the depth-wise dot product that builds up the output tensor with out_channels channels. The precise way in which the sliding windows are constructed is defined by the sampling tuple (kernel_size, stride, padding) and results in output tensor with spatial dimensions determined by the formula that you're correctly applied.
If you want to understand more, including backpropagation and implementation take a look at this paper.
The formula you're using is correct. It may be little confusing because many popular tutorial use number of filters equal to number of channels in the image. TensorFlow/Keras implementation produces its output by computing num_input_channels * num_output_channels intermediate feature maps of size (kernel_size[0], kernel_size[1]). So for each input channel it produces num_output_channels feature maps which then get multiplied and concatenated together to create output shape of (kernel_size[0], kernel_size[1], num_output_channels) Hope this clarifies Vlad's detailed answer
I'm building a 1D model with TensorFlow for audio but I have a problem with the input shape during the second MaxPool1D in the model.
The problem is here, after this Pooling:
x = Convolution1D(32, 3, activation=relu, padding='valid')(x)
x = MaxPool1D(4)(x)
I get this error:
ValueError: Negative dimension size caused by subtracting 4 from 1 for 'max_pooling1d_5/MaxPool' (op: 'MaxPool') with input shapes: [?,1,1,32].
I tried to reshape x (which is a tensor) but I don't think I'm going in the right way.
In this same model, before that, I have a couple convolutional layers and a maxpooling that are working proporly.
Anyone have suggestions?
Thanks
The number of steps in the input to the MaxPool1D layer is smaller than the pool size.
In the error, it says ...input shapes: [?,1,1,32], which means the output from the Convolution1D layer has shape [1,32]. It needs to be at least 4 steps to be used as input to the MaxPool1D(4) layer, so have a minimum size of [4,32].
You can continue walking this back. For example, the Convolution1D layer will decrease the step size by kernel_size-1=2. This means the input to the Convolution1D layer needs to have at least 4+2=6 steps, meaning a shape of at least [6,?]. Continuing up to the input layer, you'll find the input size is too small.
You'll need to change the architecture to allow the input size, or, if applicable, change the input size.
I have the following idea to implement:
Input -> CNN-> LSTM -> Dense -> Output
The Input has 100 time steps, each step has a 64-dimensional feature vector
A Conv1D layer will extract features at each time step. The CNN layer contains 64 filters, each has length 16 taps. Then, a maxpooling layer will extract the single maximum value of each convolutional output, so a total of 64 features will be extracted at each time step.
Then, the output of the CNN layer will be fed into an LSTM layer with 64 neurons. Number of recurrence is the same as time step of input, which is 100 time steps. The LSTM layer should return a sequence of 64-dimensional output (the length of sequence == number of time steps == 100, so there should be 100*64=6400 numbers).
input = Input(shape=(100,64), dtype='float', name='mfcc_input')
CNN_out = TimeDistributed(Conv1D(64, 16, activation='relu'))(mfcc_input)
CNN_out = BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001, center=True, scale=True)(CNN_out)
CNN_out = TimeDistributed(MaxPooling1D(pool_size=(64-16+1), strides=None, padding='valid'))(CNN_out)
LSTM_out = LSTM(64,return_sequences=True)(CNN_out)
... (more code) ...
But this doesn't work. The second line reports "list index out of range" and I don't understand what's going on.
I'm new to Keras, so I appreciate sincerely if anyone could help me with it.
This picture explains how CNN should be applied to EACH TIME STEP
The problem is with your input. Your input is of shape (100, 64) in which the first dimension is the timesteps. So ignoring that, your input is of shape (64) to a Conv1D.
Now, refer to the Keras Conv1D documentation, which states that the input should be a 3D tensor (batch_size, steps, input_dim). Ignoring the batch_size, your input should be a 2D tensor (steps, input_dim).
So, you are providing 1D tensor input, where the expected size of the input is a 2D tensor. For example, if you are providing Natural Language input to the Conv1D in form of words, then there are 64 words in your sentence and supposing each word is encoded with a vector of length 50, your input should be (64, 50).
Also, make sure that you are feeding the right input to LSTM as given in the code below.
So, the correct code should be
embedding_size = 50 # Set this accordingingly
mfcc_input = Input(shape=(100, 64, embedding_size), dtype='float', name='mfcc_input')
CNN_out = TimeDistributed(Conv1D(64, 16, activation='relu'))(mfcc_input)
CNN_out = BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001, center=True, scale=True)(CNN_out)
CNN_out = TimeDistributed(MaxPooling1D(pool_size=(64-16+1), strides=None, padding='valid'))(CNN_out)
# Directly feeding CNN_out to LSTM will also raise Error, since the 3rd dimension is 1, you need to purge it as
CNN_out = Reshape((int(CNN_out.shape[1]), int(CNN_out.shape[3])))(CNN_out)
LSTM_out = LSTM(64,return_sequences=True)(CNN_out)
... (more code) ...
The Keras layer documentation specifies the input and output sizes for convolutional layers:
https://keras.io/layers/convolutional/
Input shape: (samples, channels, rows, cols)
Output shape: (samples, filters, new_rows, new_cols)
And the kernel size is a spatial parameter, i.e. detemines only width and height.
So an input with c channels will yield an output with filters channels regardless of the value of c. It must therefore apply 2D convolution with a spatial height x width filter and then aggregate the results somehow for each learned filter.
What is this aggregation operator? is it a summation across channels? can I control it? I couldn't find any information on the Keras documentation.
Note that in TensorFlow the filters are specified in the depth channel as well:
https://www.tensorflow.org/api_guides/python/nn#Convolution,
So the depth operation is clear.
Thanks.
It might be confusing that it is called Conv2D layer (it was to me, which is why I came looking for this answer), because as Nilesh Birari commented:
I guess you are missing it's 3D kernel [width, height, depth]. So the result is summation across channels.
Perhaps the 2D stems from the fact that the kernel only slides along two dimensions, the third dimension is fixed and determined by the number of input channels (the input depth).
For a more elaborate explanation, read https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/
I plucked an illustrative image from there:
I was also wondering this, and found another answer here, where it is stated (emphasis mine):
Maybe the most tangible example of a multi-channel input is when you have a color image which has 3 RGB channels. Let's get it to a convolution layer with 3 input channels and 1 output channel. (...) What it does is that it calculates the convolution of each filter with its corresponding input channel (...). The stride of all channels are the same, so they output matrices with the same size. Now, it sums up all matrices and output a single matrix which is the only channel at the output of the convolution layer.
Illustration:
Notice that the weights of the convolution kernels for each channel are different, which are then iteratively adjusted in the back-propagation steps by e.g. gradient decent based algorithms such as stochastic gradient descent (SDG).
Here is a more technical answer from TensorFlow API.
I also needed to convince myself so I ran a simple example with a 3×3 RGB image.
# red # green # blue
1 1 1 100 100 100 10000 10000 10000
1 1 1 100 100 100 10000 10000 10000
1 1 1 100 100 100 10000 10000 10000
The filter is initialised to ones:
1 1
1 1
I have also set the convolution to have these properties:
no padding
strides = 1
relu activation function
bias initialised to 0
We would expect the (aggregated) output to be:
40404 40404
40404 40404
Also, from the picture above, the no. of parameters is
3 separate filters (one for each channel) × 4 weights + 1 (bias, not shown) = 13 parameters
Here's the code.
Import modules:
import numpy as np
from keras.layers import Input, Conv2D
from keras.models import Model
Create the red, green and blue channels:
red = np.array([1]*9).reshape((3,3))
green = np.array([100]*9).reshape((3,3))
blue = np.array([10000]*9).reshape((3,3))
Stack the channels to form an RGB image:
img = np.stack([red, green, blue], axis=-1)
img = np.expand_dims(img, axis=0)
Create a model that just does a Conv2D convolution:
inputs = Input((3,3,3))
conv = Conv2D(filters=1,
strides=1,
padding='valid',
activation='relu',
kernel_size=2,
kernel_initializer='ones',
bias_initializer='zeros', )(inputs)
model = Model(inputs,conv)
Input the image in the model:
model.predict(img)
# array([[[[40404.],
# [40404.]],
# [[40404.],
# [40404.]]]], dtype=float32)
Run a summary to get the number of params:
model.summary()