This question exists as a github issue , too.
I would like to build a neural network in Keras which contains both 2D convolutions and an LSTM layer.
The network should classify MNIST.
The training data in MNIST are 60000 grey-scale images of handwritten digits from 0 to 9. Each image is 28x28 pixels.
I've splitted the images into four parts (left/right, up/down) and rearranged them in four orders to get sequences for the LSTM.
| | |1 | 2|
|image| -> ------- -> 4 sequences: |1|2|3|4|, |4|3|2|1|, |1|3|2|4|, |4|2|3|1|
| | |3 | 4|
One of the small sub-images has the dimension 14 x 14. The four sequences are stacked together along the width (shouldn't matter whether width or height).
This creates a vector with the shape [60000, 4, 1, 56, 14] where:
60000 is the number of samples
4 is the number of elements in a sequence (# of timesteps)
1 is the depth of colors (greyscale)
56 and 14 are width and height
Now this should be given to a Keras model.
The problem is to change the input dimensions between the CNN and the LSTM.
I searched online and found this question: Python keras how to change the size of input after convolution layer into lstm layer
The solution seems to be a Reshape layer which flattens the image but retains the timesteps (as opposed to a Flatten layer which would collapse everything but the batch_size).
Here's my code so far:
nb_filters=32
kernel_size=(3,3)
pool_size=(2,2)
nb_classes=10
batch_size=64
model=Sequential()
model.add(Convolution2D(nb_filters, kernel_size[0], kernel_size[1],
border_mode="valid", input_shape=[1,56,14]))
model.add(Activation("relu"))
model.add(Convolution2D(nb_filters, kernel_size[0], kernel_size[1]))
model.add(Activation("relu"))
model.add(MaxPooling2D(pool_size=pool_size))
model.add(Reshape((56*14,)))
model.add(Dropout(0.25))
model.add(LSTM(5))
model.add(Dense(50))
model.add(Dense(nb_classes))
model.add(Activation("softmax"))
This code creates an error message:
ValueError: total size of new array must be unchanged
Apparently the input to the Reshape layer is incorrect. As an alternative, I tried to pass the timesteps to the Reshape layer, too:
model.add(Reshape((4,56*14)))
This doesn't feel right and in any case, the error stays the same.
Am I doing this the right way ?
Is a Reshape layer the proper tool to connect CNN and LSTM ?
There are rather complex approaches to this problem.
Such as this:
https://github.com/fchollet/keras/pull/1456
A TimeDistributed Layer which seems to hide the timestep dimension from following layers.
Or this: https://github.com/anayebi/keras-extra
A set of special layers for combining CNNs and LSTMs.
Why are there so complicated (at least they seem complicated to me) solutions, if a simple Reshape does the trick ?
UPDATE:
Embarrassingly, I forgot that the dimensions will be changed by the pooling and (for lack of padding) the convolutions, too.
kgrm advised me to use model.summary() to check the dimensions.
The output of the layer before the Reshape layer is (None, 32, 26, 5),
I changed the reshape to: model.add(Reshape((32*26*5,))).
Now the ValueError is gone, instead the LSTM complains:
Exception: Input 0 is incompatible with layer lstm_5: expected ndim=3, found ndim=2
It seems like I need to pass the timestep dimension through the entire network. How can I do that ? If I add it to the input_shape of the Convolution, it complains, too: Convolution2D(nb_filters, kernel_size[0], kernel_size[1], border_mode="valid", input_shape=[4, 1, 56,14])
Exception: Input 0 is incompatible with layer convolution2d_44: expected ndim=4, found ndim=5
According to Convolution2D definition your input must be 4-dimensional with dimensions (samples, channels, rows, cols). This is the direct reason why are you getting an error.
To resolve that you must use TimeDistributed wrapper. This allows you to use static (not recurrent) layers across the time.
Related
If there are 10 features and 1 output class (sigmoid activation) with a regression objective:
If I use only 5 neurons in my first dense hidden layer: will the first error be calculated solely based on half of the training feature set? Isn't it imperative to match the # of features with the neurons in hidden layer #1 so that the model can see all the features at once? Otherwise it's not getting the whole picture? The first fwd propagation iteration would use 5 out of 10 features, and get the error value (and train during backprop, assume batch grad descent). Then the 2nd fwd propagation iteration would see the remaining 5 out of 10 features with updated weights and hopefully arrive at a smaller error. BUT its only seeing half the features at a time!
Conversely, if I have a convolutional 2D layer of 64 neurons. And my training shape is: (100, 28,28,1) (pictures of cats and dogs in greyscale), will each of the 64 neurons see a different 28x28 vector? No right, because it can only send one example through the forward propagation at a time? So then only a single picture (cat or dog) should be spanned across the 64 neurons? Why would you want that since each neuron in that layer has the same filter, stride, padding and activation function? When you define a Conv2D layer...the parameters of each neuron are the same. So is only a part of the training example going into each neuron? Why have 64 neurons, for example? Just have one neuron, use a filter on it and pass it along to a second hidden layer with another filter with different parameters!
Please explain the flaws in my logic. Thanks so much.
EDIT: I just realized for Conv2D, you flatten the training data sets so it becomes a 1D vector and so a 28x28 image would mean having an input conv2d layer of 724 neurons. But I am still confused for the dense neural network (paragraph #1 above)
What is your "first" layer?
Normally you have an input layer as first layer, which does not contain any weights.
The shape of the input layer must match the shape of your feature data.
So basically when you train a model with 10 features, but only have a input layer of shape (None,5) (where none stands for the batch_size), tensorflow will raise an exception, because it needs data for all inputs in the correct shape.
So what you said is just not going to happen. If you only have 5 features, the next 5 features wont be fit into the net in the next iteration but, the next sample will be send to the model instead. (Lets say no exception is thrown) So of the next sample also only the first 5 features would be used.
What you can do instead, use a input_layer as first layer with the correct shape of your features. Then as secodn layer, you can use any shape you like, 1,10,100 dense neurons, its up to you (and what works well of course). The shape of the output again must match (this time) the shape of your label data.
I hope this makes it more clear
I am trying to understand a concept of ANN architecture in Keras. Number of input neurons in any NN should be equal to the number of features/attributes/columns. So, in the case of having matrix of (20000,100), my input shape should have 100 neurons. In the example on the Keras page, I saw a code:
model = Sequential([Dense(32, input_shape=(784,)),
, which pretty much means that input shape has 784 columns and 32 is the dimensionality of output space, which pretty means that the second layer will have an input of 32. My understanding is that such a significant drop happens because some of the units are not activated due to an activation function. Is my understanding correct?
At the same time, another piece of code, shows that number of input neurons is higher than number of features:
model = Sequential()
model.add(Dense(64, activation='relu', input_dim=20))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))
This example is not clear to me. How can it be that the size of units is larger that number of input dimensions?
The total number of neurons in a Dense layer is a topic that is still not agreed upon within the machine learning and data science community. There are many heuristics that are used to define this and I refer you to this post on Cross Validated that provides some more details: https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw.
In summary, the number of hidden units between both methods as you have specified most likely originated from repeated experimentation and trial-and-error achieving the best accuracy.
However, for more context the answer to this as I mentioned is through experimentation. 784 for the input neurons most likely comes from the MNIST dataset, which are images that are 28 x 28 = 784. I've seen implementations of neural networks where 32 neurons for the hidden layer is good. Think of each layer as a dimensionality transformation. Even if you go down to 32 dimensions, that doesn't necessarily mean that it will lose accuracy. Also from going from a lower dimensional space to higher dimensional space, that's common if you are trying to map your points to a new space that may be easier for classification.
Finally, in Keras, that number specifies how many neurons are for the current layer. Under the hood, it figures out the weight matrix to satisfy the forward propagation going from the previous layer to the current layer. It would be 785 x 32 in that case with 1 extra neuron for the bias unit.
Neural Networks are basicly matrix multiplications, the drop you are talking about in the first part is not due to an Activation function, it's only happen because of the nature of matrix multiplication :
The calcul here is : input * weights = output
so -> [BATCHSIZE, 784] * [784, 32] = [BATCHSIZE, 32] -> output dimension
With that logic we can easily explain how we can have an input shape << size of units, it will give this calcul :
-> [BATCHSIZE, 20] * [20, 64] = [BATCHSIZE, 64] -> output dimension
Hope that helped you !
To learn more :
https://en.wikipedia.org/wiki/Matrix_multiplication
I do not understand why the channel dimension is not included in the output dimension of a conv2D layer in Keras.
I have the following model
def create_model():
image = Input(shape=(128,128,3))
x = Conv2D(24, kernel_size=(8,8), strides=(2,2), activation='relu', name='conv_1')(image)
x = Conv2D(24, kernel_size=(8,8), strides=(2,2), activation='relu', name='conv_2')(x)
x = Conv2D(24, kernel_size=(8,8), strides=(2,2), activation='relu', name='conv_3')(x)
flatten = Flatten(name='flatten')(x)
output = Dense(1, activation='relu', name='output')(flatten)
model = Model(input=image, output=output)
return model
model = create_model()
model.summary()
The model summary is given the figure at the end of my question. The input layer takes RGB images with width = 128 and height = 128. The first conv2D layer tells me the output dimension is (None, 61, 61, 24). I have used the kernel size of (8, 8), a stride of (2, 2) no padding. The values 61 = floor( (128 - 8 + 2 * 0)/2 + 1) and 24 (number of kernels/filters) makes sense. But why isn't the dimension for the different channels included in the dimension? As far as I can see the parameters for the 24 filters on each of the channels is included in the number of the parameters. So I would expect the output dimension to be (None, 61, 61, 24, 3) or (None, 61, 61, 24 * 3). Is this just a strange notation in Keras or am I confused about something else?
This question is asked in various forms all over the internet and has a simple answer which is often missed or confused:
SIMPLE ANSWER:
The Keras Conv2D layer, given a multi-channel input (e.g. a color image), will apply the filter across ALL the color channels and sum the results, producing the equivalent of a monochrome convolved output image.
An example, from a CIFAR-10 CNN example:
(1) You're training with the CIFAR image dataset, which is made up of 32x32 color images, i.e. each image is shape (32,32,3) (RGB = 3 channels)
(2) Your first layer of your network is a Conv2D Layer with 32 filters, each specified as 3x3, so:
Conv2D(32, (3,3), padding='same', input_shape=(32,32,3))
(3) Counter-intuitively, Keras will configure each filter as (3,3,3), i.e. a 3D volume covering the 3x3 pixels PLUS all the color channels. As a minor detail each filter has an additional weight for a BIAS value, as per normal neural network layer arithmetic.
(4) Convolution proceeds absolutely as normal, except a 3x3x3 VOLUME from the input image is convolved at each step with the 3x3x3 filter, and a single (monochrome) output value (i.e. like a pixel) is produced at each step.
(5) The result is a Keras Conv2D convolution of a specified (3,3) filter on a (32,32,3) image produces a (32,32) result because the actual filter used is (3,3,3).
(6) In this example, we have also specified 32 filters in the Conv2D layer, so the actual output is (32,32,32) for each input image (i.e. you might think of this as 32 images, one for each filter, each 32x32 monochrome pixels).
As a check, you can look at the count of weights (Param #) for the layer produced by model.summary():
Layer (type) Output shape Param#
conv2d_1 (Conv2D) (None, 32, 32, 32) 896
There are 32 filters, each 3x3x3 (i.e. 27 weights) plus 1 for the bias (i.e. total 28 weights each). And 32 filters x 28 weights each = 896 Parameters.
Each of the convolutional filters (8 x 8) is connected to a (8 x 8) receptive field for all the channels of the image. That is why we have (61, 61, 24) as the output of the second layer. The different channels are encoded implicitly into the weights of the 24 filters. This means, that each filter does not have 8 x 8 = 64 weights but instead 8 x 8 x Number of channels = 8 x 8 x 3 = 192 weights.
See this quote from CS231
Left: An example input volume in red (e.g. a 32x32x3 CIFAR-10 image),
and an example volume of neurons in the first Convolutional layer.
Each neuron in the convolutional layer is connected only to a local
region in the input volume spatially, but to the full depth (i.e. all
color channels). Note, there are multiple neurons (5 in this example)
along the depth, all looking at the same region in the input - see
discussion of depth columns in the text below. Right: The neurons from the
Neural Network chapter remains unchanged: They still compute a dot
product of their weights with the input followed by a non-linearity,
but their connectivity is now restricted to be local spatially.
My guess is that you're misunderstanding how convolutional layers defined.
My notation for the shape of the convolutional layer is (out_channels, in_channels, k, k) where k is a the size of the kernel. The out_channels is the number of filters (i.e. convolutional neurons). Consider following image:
The 3d convolutional kernel weights in the picture slide across different data windows of A_{i-1}(i.e. input image). Patches of 3D data of that image of shape (in_channels, k, k) are paired with individual 3d convolutional kernels of matching dimensionality. How many such 3d kernels are there? As the number of output channels out_channels. The depth dimension that kernel adopts is the in_channels of A_{i-1}. Therefore, the dimension in_channels of A_{i-1} is contracted away by the depth-wise dot product that builds up the output tensor with out_channels channels. The precise way in which the sliding windows are constructed is defined by the sampling tuple (kernel_size, stride, padding) and results in output tensor with spatial dimensions determined by the formula that you're correctly applied.
If you want to understand more, including backpropagation and implementation take a look at this paper.
The formula you're using is correct. It may be little confusing because many popular tutorial use number of filters equal to number of channels in the image. TensorFlow/Keras implementation produces its output by computing num_input_channels * num_output_channels intermediate feature maps of size (kernel_size[0], kernel_size[1]). So for each input channel it produces num_output_channels feature maps which then get multiplied and concatenated together to create output shape of (kernel_size[0], kernel_size[1], num_output_channels) Hope this clarifies Vlad's detailed answer
I'm building a 1D model with TensorFlow for audio but I have a problem with the input shape during the second MaxPool1D in the model.
The problem is here, after this Pooling:
x = Convolution1D(32, 3, activation=relu, padding='valid')(x)
x = MaxPool1D(4)(x)
I get this error:
ValueError: Negative dimension size caused by subtracting 4 from 1 for 'max_pooling1d_5/MaxPool' (op: 'MaxPool') with input shapes: [?,1,1,32].
I tried to reshape x (which is a tensor) but I don't think I'm going in the right way.
In this same model, before that, I have a couple convolutional layers and a maxpooling that are working proporly.
Anyone have suggestions?
Thanks
The number of steps in the input to the MaxPool1D layer is smaller than the pool size.
In the error, it says ...input shapes: [?,1,1,32], which means the output from the Convolution1D layer has shape [1,32]. It needs to be at least 4 steps to be used as input to the MaxPool1D(4) layer, so have a minimum size of [4,32].
You can continue walking this back. For example, the Convolution1D layer will decrease the step size by kernel_size-1=2. This means the input to the Convolution1D layer needs to have at least 4+2=6 steps, meaning a shape of at least [6,?]. Continuing up to the input layer, you'll find the input size is too small.
You'll need to change the architecture to allow the input size, or, if applicable, change the input size.
I'm following this tutorial on using Keras to train a basic conv-net. I find a couple of things confusing though, and the Keras documentation doesn't go into much detail either.
Let's look at the first few layers of the network:
model = Sequential()
model.add(Convolution2D(32, 3, 3, activation='relu', input_shape=(1,28,28)))
model.add(Convolution2D(32, 3, 3, activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.25))
My Questions:
The tutorial describes the first layer as the "input layer". However, the first line includes a Convolution2D function, with a input_shape. Am I correct in assuming that this is actually the first hidden layer (a convolution layer), rather than just the input layer? Reason being that we don't need a separate model.add() statement just for the input?
In the Convolution2D() function, we're using 32 filters, each filter being 3x3 pixels. In my understanding, a filter is a small block of pixels which "scans" across the image. So for a 28x28 image, wouldn't we need 676 filters (26*26, since each filter is 3x3)? What does the 32 here mean?
The last line is a Dropout layer. From my understanding, Dropout is a regularization technique, and it's applied to the whole network. So does the Dropout(0.25) here apply a 25% dropout only to the previous layer? Or does it apply to all layers preceding it?
Thanks.
Whenever you call model.fit(), you are passing the 28 * 28 image. So that is the input to the model.
On top of that input, we then are doing convolution to generate feature maps.
Convolution means matrix multiplication. So the output of a single filter in the first layer is 26 * 26 matrix. We have such 32 matrices. That is what 32 means. Check this for further explanation. https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/
This applies dropout with probability 0.25 to the layer preceding the sentence.