Modify UNet to take an arbitrary input dimension?

Modify UNet to take an arbitrary input dimension? - python

I have a network that's pretty much UNet. However, the model crashed when I feed in input size of 3x1x1 (channel =3, height =1, width=1) since the first max pooling (with kernel size =2 and stride =2) will reduce the dimension into 3x0x0.
How do I modify Unet model such that it can take my 3x1x1 input and handle arbitrary number of poolings? Any help is appreciated!

One must normalize sizes of images with preprocessing, see torchvision.transforms.functional.resize.

Related

Confusion regarding dimensions in CNN

The above picture generated using Matlab's deep learning toolbox shows the architecture of a CNN created for a toy example. The input image is of size 25*20*7, number of filters are 15 each of size 5*5 and padding is same. The output of the first convolution conv1 is 25*20*15 which goes into maxpooling 1 operation of size 2*2 with stride 1 and padding same.
Based on my understanding, the role of maxpooling is to perform dimension reduction. However, in my code since the padding is set to same I understand that the output of maxpooling will preserve the spatial dimension to its input which is 25*20*15. That is why the output of maxpooling1 and the rest of the maxpooling is of the same dimension as its input and there is no change in the dimension in the remaining layers. AS an example, So, the output of maxpooling should have been: (25 - 2 +2*1/1) + 1 = 23+2/1 + 1 = 25. Similarly, for the second dimension maxpooling would yield: (20 - 2 +2*1/1) + 1 = 18+2/1 + 1 = 20. Thus, the output of maxpooling should be 25*20*15.
This implies that maxpooling is not doing dimension reduction. Therefore, should I remove maxpooling if the padding option is set to same?
Please let me know how the dimensions are same after doing maxpooling and if same dimension then should I remove this operation? Or did I do some mistake?

The role of padding is different for convolutional and maxpooling layer. If padding=same in convolutional layer, it means that the output size (primarily height and width) remains the same as the input.
On the other hand, padding in pooling layers has a different functionality. The purpose of pooling layers is to reduce the spatial dimensions (height and width). In pooling layers, padding = same does not mean that the spatial dimensions do not change. Padding in pooling is required to make up for overlaps when the input size and kernel size do not perfectly fit.

tldr: If you want to reduce the image size in every step, use padding=valid. it's default option.
Maxpooilign is generally used for Downsampling and Reducing Overfitting.
If you use padding='same', it will stretch the image to input size, causing no drop in the size.
In the example below, input size is 4 * 4, pool is 2*2 and (default) stride is 2x2, so output is 2 * 2
Find more examples on keras' official site

Should I use only "exactly same" input shape for transfer learning?

I trained a CNN model with (5x128x128x3) size of input shape
and I got trained weight of (5x128x128x3)
by the way, I wanna use this weight for training (7x128x128x3) size of input data
So, this is my question
should I use only same shape of input?
I wonder if I can use another size (in this case, 7x128x128x3) of input for transfer learning
ValueError: Error when checking input: expected input_1 to have shape (5, 128, 128, 3) but got arry with shape (7, 128, 128, 3)```

Let's break down the dimensions (5x128x128x3):
The first dimension is the batch size (which was 5 when the original model was trained). This is irrelevant and you can set it to None as pointed out in the comments to feed arbitrary sized batches to the model.
The second to third dimensions (128x128) are the width and height of the image and you may be able to change these, but it's hard to say for sure without knowing the model architecture and which layer output you're using for transfer learning. The reason you can change these is that 2d convolutional filters are repeated across the 2d dimensions (width and height) of the image, so they will remain valid for different widths and heights (assuming compatible padding). But if you change the 2d dimensions too much, it is possible that the receptive fields of the layers are changed in a way that hurts transfer learning performance. Eg. if the 7th conv layer in the network for 128x128 input can see the entire input image in each activation (a receptive field of 128x128), then if you double the width and height, it won't anymore and the layer may not recognize certain global features.
The fourth dimension is the number of channels in the input images and you can't change this, as the filters in the first layer will have 3 weights across the depth dimension.

batch size for LSTM

I've been trying to set up an LSTM model but I'm a bit confused about batch_size. I'm using the Keras module in Tensorflow.
I have 50,000 samples, each has 200 time steps and each time step has three features. So I've shaped my training data as (50000, 200, 3).
I set up my model with four LSTM layers, each having 100 units. For the first layer I specified the input shape as (200, 3). The first three layers have return_sequences=True, the last one doesn't. Then I do some softmax classification.
When I call model.fit with batch_size='some_number' do Tensorflow/Keras take care of feeding the model with batches of the specified size? Do I have to reshape my data somehow in advance? What happens if the number of samples is not evenly divisible by 'some_number'?
Thanks for your help!

If you provide your data as numpy arrays to model.fit() then yes, Keras will take care of feeding the model with the batch size you specified. If your dataset size is not divisible by the batch size, Keras will have the final batch be smaller and equal to dataset_size mod batch_size.

How does TensorFlow train kernels?

TensorFlow's API describes the function tf.nn.conv2d() which takes in an argument of filter size: [filter_height, filter_width, in_channel, out_channel]. So if I used the mnist dataset and ran the network on an image displaying the number "5," would the filter be trained on the lower, circular bowl of the 5? Or would it just train on multiple parts of the image? How and what would the filters in the conv2d train on?

You should read the basic principles of convolutional layers:
Every filter is small spatially (along width and height), but extends through the full depth of the input volume. For example, a typical filter on a first layer of a ConvNet might have size 5x5x3 (i.e. 5 pixels width and height, and 3 because images have depth 3, the color channels).
During the forward pass, we slide (more precisely, convolve) each filter across the width and height of the input volume and compute dot products between the entries of the filter and the input at any position. As we slide the filter over the width and height of the input volume we will produce a 2-dimensional activation map that gives the responses of that filter at every spatial position.
Intuitively, the network will learn filters that activate when they see some type of visual feature such as an edge of some orientation or a blotch of some color on the first layer, or eventually entire honeycomb or wheel-like patterns on higher layers of the network. Now, we will have an entire set of filters in each CONV layer (e.g. 12 filters), and each of them will produce a separate 2-dimensional activation map. We will stack these activation maps along the depth dimension and produce the output volume.
So, in essence, each [filter_height, filter_width] filter is going to match all patches of the same size in the input and produce a single number for each patch. Some patches can be skipped or added, depending on the stride and padding settings. In the backward pass, the filter will be updated for all of them, i.e., it is trained on the whole input.
E.g., here's stride=1 and padding=2 convolution:

Input shape and Conv1d in Keras

The first layer of my neural network is like this:
model.add(Conv1D(filters=40,
kernel_size=25,
input_shape=x_train.shape[1:],
activation='relu',
kernel_regularizer=regularizers.l2(5e-6),
strides=1))
if my input shape is (600,10)
i get (None, 576, 40) as output shape
if my input shape is (6000,1)
i get (None, 5976, 40) as output shape
so my question is what exactly is happening here? is the first example simply ignoring 90% of the input?

It is not "ignoring" a 90% of the input, the problem is simply that if you perform a 1-dimensional convolution with a kernel of size K over an input of size X the result of the convolution will have size X - K + 1. If you want the output to have the same size as the input, then you need to extend or "pad" your data. There are several strategies for that, such as add zeros, replicate the value at the ends or wrap around. Keras' Convolution1D has a padding parameter that you can set to "valid" (the default, no padding), "same" (add zeros at both sides of the input to obtain the same output size as the input) and "causal" (padding with zeros at one end only, idea taken from WaveNet).
Update
About the questions in your comments. So you say your input is (600, 10). That, I assume, is the size of one example, and you have a batch of examples with size (N, 600, 10). From the point of view of the convolution operation, this means you have N examples, each of with a length of at most 600 (this "length" may be time or whatever else, it's just the dimension across which the convolution works) and, at each of these 600 points, you have vectors of size 10. Each of these vectors is considered an atomic sample with 10 features (e.g. price, heigh, size, whatever), or, as is sometimes called in the context of convolution, "channels" (from the RGB channels used in 2D image convolution).
The point is, the convolution has a kernel size and a number of output channels, which is the filters parameter in Keras. In your example, what the convolution does is take every possible slice of 25 contiguous 10-vectors and produce a single 40-vector for each (that, for every example in the batch, of course). So you pass from having 10 features or channels in your input to having 40 after the convolution. It's not that it's using only one of the 10 elements in the last dimension, it's using all of them to produce the output.
If the meaning of the dimensions in your input is not what the convolution is interpreting, or if the operation it is performing is not what you were expecting, you may need to either reshape your input or use a different kind of layer.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.