I am a little bit confused with how LSTM handle the input.
As we all know, the input of LSTM model in Keras has the form (batch_size, timesteps, input_dim).
My data is a time series data, where each sequence of n time steps are fed in to predict the value at n+1 time steps. Then, how do they access the input? They process each time steps in the sequence or have access to all of them at the same time?
As i check the number of parameters of each LSTM layer. They have 4*d*(n+d) where n is the dimension of input and d is the number of memory cell.
In my case i have d=10, and the number of parameters is 440 (without bias). So it means n=1 here, so seems like the input has dimension 1*1.
Then they have access to all of them spontaniously.
Anyone has some ideas about this?
First, think of a convolutional layer (it's easier).
It has parameters that depend only on the "filter size", "input channels" and "number of filters". But never on the "size of the image".
That happens because it's somewhat a "walking operation". The same group of filters is applied throughout the image. The total operations increase with the size of the image, but the parameters, which only define the filters, are independent from the image size. (Imagine a filter to detect a circle, this filter doesn't need to change to detect circles in different parts of the image, although it's applied for each step in the entire image).
So:
Parameters: number of filters * size of filtersยฒ * input channels
Calculation steps: size of image (considering strides, padding, etc.)
With LSTM layers, a similar thing happens. The parameters are related to what they call "gates". (Take a look here)
There is a "state", and "gates" that are applied in each time iteration to determine how the state will change.
The gates are not time dependent, though. The calculations are time iterations indeed, but every iteration uses the same group of gates.
Comparing to the convolutional layers:
Parameters: number of cells, data dimension
Calculation steps: time steps
Related
I wanted to understand architectural intuition behind the differences of:
tf.keras.layers.Conv2D(64, (3,3), activation='relu', input_shape=(28, 28, 1))
and
tf.keras.layers.Conv2D(32, (7,7), activation='relu', input_shape=(28, 28, 1))
Assuming,
As kernel size increases, more complex feature-pattern matching can be performed in the convolution step.
As feature size increases, a larger variance of smaller features can define a particular layer.
How and when (if possible kindly give scenarios) do we justify the tradeoff at an abstract level?
This can be answered from 3 different views.
Parameters:
Since you comparing 2 different convolution2D layers with different sizes, it's important to see the number of training parameters ๐โ(๐พโ๐พโ๐ท)+๐ needed for each, which in-turn makes your model more complex, and easy/difficult to train.
Here, the number of trainable parameters increases over 2.5 times when using the second configuration for conv2d
first conv2d layer: 64*(3*3*1)+64 = 640
second conv2d layer: 32*(7*7*1)+32 = 1600
Input:
Another way of asking what filter size must be used and why is by analyzing the input data in the first place. Since the goal of the first conv2d layer (over the input) is to capture the most basic of patterns in the image, ask yourself if the MOST basic of the pattern in the image really do need a larger filter to learn?
If you think that a large amount of pixels is necessary for the network to recognize the object you will use large filters (as 11x11 or 9x9). If you think what differentiates objects are some small and local features you should use small filters (3x3 or 5x5)
Usually, a better practice is to stack conv2d layers to capture bigger patterns in the image since they are made of a combination of smaller patterns that are more easily captured by smaller filters.
End goal:
Usually the goal of a conv network is to compress the image's height and width into a large number of channels which here are made of filters.
This process of down sampling image into its representative features allows us to finally add a few dense layers at the end to do our classification tasks.
The first conv2d will downsample the image only by a little, and generate a large number of channels, while the second conv2d will downsample it a lot (because larger conv filter strides over the image), and have lesser number of filters.
But the act of downsampling, to get a smaller image with a lesser number of channels (filters) immediately causes loss of information. Therefore it's recommended that it's done gradually to retain as much information as possible from the original image.
Then it can be stacked with other conv2d to get to a near vector representation of the image before classification.
Summary:
The second conv2d will be able to capture larger more complex patterns at once as compared to the first conv2d at that step.
The second conv2d will have a higher loss of information from the original image as it would skip features that are from much smaller and simpler patterns. The first conv2d will be able to capture more basic patterns in the image and use the combinations of those (in stacked Conv layers) to build a more robust set of features for your end task.
Second conv2d needs a higher number of parameters to learn the structure of the image as compared to the first conv2d.
In practice, it is recommended to have a stack of Conv layers with smaller filters to better detect larger more complex patterns in the image.
I'm trying to set up a non-conventional neural network using keras, and am having trouble efficiently setting this up.
The first few layers are standard convolutional layers, and the output of these have d channels, which each have image shapes of n x n.
What I want to do is use a single dense layer to map this d x n x n tensor onto a single image of size n x n. I want to define a single dense layer, with input size d, and output size 1, and apply this function to each "pixel" on the input (with the inputs taken depthwise across channels).
So far, I have not found a efficient solution to this. I have tried first defining a fully connected layer, then looping over each "pixel" in the input, however this takes many hours to initialize the model, and I am worried that it will slow down backprop, as the computations are likely not properly parallelized.
Is there an efficient way to do this?
What you're describing is a 1x1 convolution with output depth 1. You can implement it just as you implement the rest of the convolution layers. You might want to apply tf.squeeze afterwards to remove the depth, which should have size 1.
To implement a specific function, I need "input_channels" number of kernels in my layer, each having only a single channel depth, and not depth = "input_channels".
I need to convolve one kernel with one channel of the input, thus the output of the layer would have "input_channels" number of kernels.
Which python/numpy/tensorflow convolution function can allow such a convolution where the number of channels in kernel must not always be equal to "input_channels" and can be 1 instead?
Thanks in advance for any help.
(if anyone wishes to know what all i have tried yet,
In the conv2d function of tensorflow, if I specify number of kernels = 1 to do this, then it will sum over all input_channels and number of output_channels will be 1, since it always initialises kernel depth = "input_channels".
Another option is to specify number of number of kernels = input_channels in conv2d function but this would create "input_channels" number of kernels of depth "input_channels", thus adding lot of complexity and incorrect implementation of my layer.
Yet another thing I tried was to initialise a kernel of volume (kernel_height, kernel_width, input_channels) and loop over the third dimension to convolve only a single input channel with a single kernel. But the tensorflow conv2d function requires a rank 4 kernel to work and gives the following error -
ValueError: Shape must be rank 4 but is rank 3 for 'generic_act_func_4/Conv2D' (op: 'Conv2D') with input shapes: [?,28,28], [28,28]. )
As I see it, you're trying to learn a separate model for each dimension in the input. Thus you will need 2D convolution filters with a filter depth of 1.
I believe there should be an easier way, but most logical to me would be to create a model consisting of a number of submodels equal to the depth of your input (32). Thus 32 models containing a single convolutional filter, receiving only one dimension of your input. Stacking the output from all models would then give the results as you require.
Another solution which would be interesting (but I'm not sure whether it will work, have not tried it myself) would be to do separable convolutions on the input.
A link to an article describing these operations:
https://towardsdatascience.com/a-basic-introduction-to-separable-convolutions-b99ec3102728
You essentially want to perform only the 1st part of the separable convolution operation, which is exactly what the DepthwiseConv2D layer in keras/tensorflow does. So I would have a look at that if I where you. Would be interested to know whether this works out for you!
I am wondering if it is possible how to add a similar to flattened layer for images of variable length.
Say we have an input layer for our CNN as:
input_shape=(1, None, None)
After performing your typical series of convolution/maxpooling layers, can we create a flattened layer, such that the shape is:
output_shape=(None,...)
If not, would someone be able to explain why not?
You can add GlobalMaxPooling2D and GlobalAveragePooling2D.
These will eliminate the spatial dimensions and keep only the channels dimension. Max will take the maximum values, Average will get the mean value.
I don't really know why you can't use a Flatten layer, but in fact you can't with variable dimensions.
I understand why a Dense wouldn't work: it would have a variable number of parameters, which is totally infeasible for backpropagation, weight update and things like that. (PS: Dense layers act only on the last dimension, so that is the only that needs to be fixed).
Examples:
A Dense layer requires the last dimension fixed
A Conv layer can have variable spatial dimensions, but needs fixed channels (otherwise the number of parameters will vary)
A recurrent layer can have variable time steps, but needs fixed features and so on
Also, notice that:
For classification models, you'd need a fixed dimension output, so, how to flatten and still guarantee the correct number of elements in each dimension? It's impossible.
For models with variable output, why would you want to have a fixed dimension in the middle of the model anyway?
If you're going totally custom, you can always use K.reshape() inside a Lambda layer and work with the tensor shapes:
import keras.backend as K
def myReshape(x):
shape = K.shape(x)
batchSize = shape[:1]
newShape = K.variable([-1],dtype='int32')
newShape = K.concatenate([batchSize,newShape])
return K.reshape(x,newShape)
The layer: Lambda(myReshape)
I don't think you can because the compile step uses those dimensions to allocate fixed memory when your model is instanced for training or prediction. Some dimensions need to be known ahead of time, so the matrix dimensions can be allocated.
I understand why you want variable-sized image input, the world is not (226, 226, 3). It depends on your specific goals, but for me, scaling up or windowing to a region of interest using say Single Shot Detection as a preprocessing step may be helpful. You could just start with Keras's ImageDataGenerator to scale all images to a fixed size - then you see how much of a performance gain you get from conditional input sizing or windowing preprocessing.
#mikkola, I have found flatten to be very helpful for TimeDistributed models. You can add flatten after the convolution steps using:
your_model.add(Flatten())
The documentation for the Embedding layer is here:
https://keras.io/layers/embeddings/
and the documentation for the Masking layer is here:
https://keras.io/layers/recurrent/
I cant find a difference there. Should one of the layers be prefered in certain situations?
I feel like Masking() is more masking of time steps; while Embedding(mask_zero=True) is more of a data filter.
Masking:
If all values in the input tensor at that timestep are equal to mask_value, then the timestep will be masked (skipped) in all downstream layers
With an arbitrary mask_value. Thus, you can decide to skip time steps in which there is no input, or some other condition you can think of, based on your data.
For Embedding, you overlay a mask on your input skipping calculations for data for which the input=0. This way, you can, in a single time step, propagate full data, part of the data, of no data through the network. This is not a masking of time step #3 or something like that, it is a masking of input data #i. Also, only having no input (input=zero) can be masked.
Thus, there are certainly cases I can think of where the two are completely equal (when an input = 0, it is 0 for all inputs would be such a case), but their use is on another resolution.