Computational cost of Conv2D vs Conv3D Layer with identical data

Computational cost of Conv2D vs Conv3D Layer with identical data - python

I want to predict a spatio-temporal system with dimension (x,y,t) using a convolutional neural network. My time-dimension is magnitudes smaller than both space dimensions, i.e. t=16, x=y=1024.
I a currently using 3D convolutional layers with channel dimension 1. Say I choose my kernel size for all dimensions 5, I end up with a kernel of size (5, 5, 5, 1). Since 3D convolutions are computationally expensive, the model training tends to crash a lot. One workaround I thought of is to use 2D convolutional layers and treat the time dimension as channels. My resulting kernel would be of size (5, 5, 16) with t=16 time steps.
In summary, by using 2D convolutions I get rid of one dimension to stride along. On the other hand, the 2D layer has more weights / trainable parameters compared to the 3D layer; 2D: 5x5x16 = 400 vs. 3D: 5x5x5x1 = 125 (+ bias, respectively). How does this translate to the computational cost between Conv3D and Conv2D layers?

Related

Confusion regarding dimensions in CNN

The above picture generated using Matlab's deep learning toolbox shows the architecture of a CNN created for a toy example. The input image is of size 25*20*7, number of filters are 15 each of size 5*5 and padding is same. The output of the first convolution conv1 is 25*20*15 which goes into maxpooling 1 operation of size 2*2 with stride 1 and padding same.
Based on my understanding, the role of maxpooling is to perform dimension reduction. However, in my code since the padding is set to same I understand that the output of maxpooling will preserve the spatial dimension to its input which is 25*20*15. That is why the output of maxpooling1 and the rest of the maxpooling is of the same dimension as its input and there is no change in the dimension in the remaining layers. AS an example, So, the output of maxpooling should have been: (25 - 2 +2*1/1) + 1 = 23+2/1 + 1 = 25. Similarly, for the second dimension maxpooling would yield: (20 - 2 +2*1/1) + 1 = 18+2/1 + 1 = 20. Thus, the output of maxpooling should be 25*20*15.
This implies that maxpooling is not doing dimension reduction. Therefore, should I remove maxpooling if the padding option is set to same?
Please let me know how the dimensions are same after doing maxpooling and if same dimension then should I remove this operation? Or did I do some mistake?

The role of padding is different for convolutional and maxpooling layer. If padding=same in convolutional layer, it means that the output size (primarily height and width) remains the same as the input.
On the other hand, padding in pooling layers has a different functionality. The purpose of pooling layers is to reduce the spatial dimensions (height and width). In pooling layers, padding = same does not mean that the spatial dimensions do not change. Padding in pooling is required to make up for overlaps when the input size and kernel size do not perfectly fit.

tldr: If you want to reduce the image size in every step, use padding=valid. it's default option.
Maxpooilign is generally used for Downsampling and Reducing Overfitting.
If you use padding='same', it will stretch the image to input size, causing no drop in the size.
In the example below, input size is 4 * 4, pool is 2*2 and (default) stride is 2x2, so output is 2 * 2
Find more examples on keras' official site

Can we use 1D convolution for image classification?

I have images with shape (100, 100, 3), and I want to use keras 1D convolution to classify the images.
I want to know if this is possible, and what is the shape of the input I need to use.
PS: I use tf.data.Dataset, and my dataset is batched (20, 100, 100, 3).

I assume you mean 1x1 convolutions which convolve images across layers. In your case the layer code would be:
tf.keras.layers.Conv2D(filters=NUM_FILTERS, kernel_size=1, strides=1)
Conv1D is indeed for 1d data processing (like sound) as #MatusDubrava pointed out.

Should we use 1D convolution for image classification?
TLDR; Not by itself, but maybe if composed.
The correlation between pixels in an image (be it 2D or 3D due to multiple channels) is of spatial nature: the value of a given pixel is highly influenced by the neighboring pixels both vertically and horizontally. The advantage of 2D/3D Convolution (Conv2D or Conv3D) is that they manage to capture this influence in both spatial directions: vertical and horizontal.
In comparison, 1D convolution or Conv1D is only capturing one of the two correlations (either vertical or horizontal), thus yielding much more limited information. By itself, a singe Conv1D will be leaving out substantial information.
Nonetheless, since a Conv2D could be 'decomposed' into two Conv1D blocks (this is similar to the Pointwise & Depthwise convolutions in the MobileNet architecture), concatenating a Vertical Conv1D and a Horizontal Conv1D captures the spatial correlation in both axes. This is valid approach towards image classification as an alternative to Conv2D.
Can we use 1D convolution for image classification? How?
Yes, we can.
You should not reshape the data to reduce dimensions: if you do, you would be taping together one end of the image (say the top if the Conv1D is applied vertically) with the other end (the say the bottom), which breaks spatial coherence.
This is a possible example on how (implementing the concatenation explained above):
import tensorflow as tf
x = tf.random.normal(input_shape = (20, 100, 100, 3)) # your input batch
# Horizontal Conv1D
y_h = tf.keras.layers.Conv1D(
filters=32, kernel_size=3, activation='relu', input_shape=x.shape[2:])(x)
# Vertical Conv1D
y_v = tf.transpose(x, perm=[0, 2, 1, 3]) # Image rows to columns
y_v = tf.keras.layers.Conv1D(
filters=32, kernel_size=3, activation='relu', input_shape=x.shape[2:])(x)
# y_v = tf.transpose(y_v, perm=[0, 2, 1, 3]) # Undo transpose, optional
# Concatenate results
y = tf.keras.layers.Concatenate(axis=3)([y_h, y_v]) # Concatenate on the feature_maps
Note that you require multiple operations to obtain a result (convolution over vertical and horizontal axes) which would be easier and faster to get by applying Conv2D directly.
When should we use this?
If your image data is particularly uninformative in one axis while being particularly interesting in the other spatial axis, it might be an idea worth exploring. Otherwise it is better to resort to standard Conv2D (Most of the cases out there, including almost all public image datasets).

Can 2D convolutional neural network be converted into 1D convolutional neural network?

I have designed a neural network using 2d convolutional layers and max-pooling layers with the input shape for input, one hot encoded sequences as 2d array. then it is reshaped before inputting the model.
data = np.zeros( (100, 21 * 1000), dtype=np.float32 )
#reshape
x_data = tf.reshape( data, [-1, 1, 1000, 21] )
However, I used the same dataset using 1D convolutional layers by changing the model and input array without reshaping as it is 1D
data = np.zeros( (100, 1000,21), dtype=np.float32 )
finally, the 1D convolutional model performed well with 96% act. and 2d CNN gave 93%. Can someone explain to me what actually happens there to increase the accuracy?

Can someone explain to me what actually happens there to increase the accuracy?
That's hard to tell and depends on your specific dataset, network, hyperparameters etc.
Generally, in a conv2D-Layer the filter shifts horizontal and vertical. In a conv1D-Layer the filter shifts only vertical in the convolution process.
So which one is the best? That depends on your problem. For time series conv1D could be better and for images conv2D could be the better choice.

How to implement 1-D deconvolutional layer with a stride larger than one by tensorflow?

As this guide said [A guide to convolution arithmetic for deep learning], a deconvolutional layer can be transformed into an equivalent convolutional layer.
However, when the original convolution has a stride larger than one, the corresponded equivalent convolution of deconvolution should take a stretched input obtained by adding s−1 zeros between each input unit, where s is the stride in the original convolution.
Here is an example:
[The transpose of convolving a 3×3 kernel over a 5×5 input padded with a 1×1 border of zeros using 2×2 strides]
Here is the problem: because tensorflow only provides a 2-D version deconvolutional layer, if I want to implement a 1-D deconvolutional layer for an original convolutional layer with a stride larger than one, how can I add zeros between each input unit?
Thanks very much

I just found that the convolutional layer in keras has a parameter called dilation_rate, and it can cover my requirement.

Weights in Convolutional network?

I am currently following the TensorFlow's Multilayer Convolutional Network tutorial.
In various layers weight is initialised as followed :
First Convolutional Layer:
W_conv1 = weight_variable([5, 5, 1, 32])
Second Convolutional Layer:
W_conv2 = weight_variable([5, 5, 32, 64])
Densely Connected Layer:
W_fc1 = weight_variable([7 * 7 * 64, 1024])
Readout Layer:
W_fc2 = weight_variable([1024, 10])
So I am having doubts in how is the shape of the above weight variables known to us ?
Is their any math used to find the shape for them ?

The answer is explained on the same page:
The convolutional will compute 32 features for each 5x5 patch. Its
weight tensor will have a shape of [5, 5, 1, 32]
There is no involved math par say, but these terms need explanation
The size of convolution kernel is 5X5. That means there is a 5X5 matrix that is convolved with an input image by moving it around the image. Check this link for an explanation of how a small 5X5 matrix moves over a 28X28 image and multiplies different cells of the image matrix with itself. This gives us first two dimentsions of [5, 5, 1, 32]
The size of input channels is 1. These are BW images, hence one input channel. Most colored images have 3 channels, so expect a 3 in some other convolution networks working on images. Indeed, for the second layer, W_conv2, the number of input channels is 32, same as number of output channels of layer 1.
The last dimension of the weight matrix is perhaps hardest to visualize. Imagine your 5X5 matrix, and replicate it 32 times!. Each of these 32 things are called channels. To complete the discussion, each of these 32 5X5 matrices are initialized with random weights and trained independently during forward/back propagation of the network. More channels learn different aspects of the image and hence give extra power to your network.
If you summarize these 3 points, you get desired dimensions of layer 1. Subsequent layers are an extension - First two dimensions are kernel sizes (5X5) in this case. Third dimension is equal to size of input channel, which is equal to size of output channel of previous layer. (32, since we declared 32 output channels of layer 1). Final dimension is the size of output channel of current layer (64, even lager for second layer!. Again, keeping a large number of independent 5X5 kernels helps!).
Finally, last two layers: Final dense layer is the only thing that involves some calculation:
For each convolution layer, final size = initial size
For pooling layer of size kXk, final size = initial size / k
So,
For conv1, size remains 28 X 28
pool1 reduces size to 14 X 14
For conv2, size remains 14 X 14
pool2 reduces size to 7 X 7
And ofcourse, we have 64 channels due to conv2 - pooling doesn't affect them. Hence, we get a final dense input of 7X7X64. We then create fully connected 1024 hidden layers and add 10 output classes for 10 digits.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.