TensorFlow's API describes the function tf.nn.conv2d() which takes in an argument of filter size: [filter_height, filter_width, in_channel, out_channel]. So if I used the mnist dataset and ran the network on an image displaying the number "5," would the filter be trained on the lower, circular bowl of the 5? Or would it just train on multiple parts of the image? How and what would the filters in the conv2d train on?
You should read the basic principles of convolutional layers:
Every filter is small spatially (along width and height), but extends through the full depth of the input volume. For example, a typical filter on a first layer of a ConvNet might have size 5x5x3 (i.e. 5 pixels width and height, and 3 because images have depth 3, the color channels).
During the forward pass, we slide (more precisely, convolve) each filter across the width and height of the input volume and compute dot products between the entries of the filter and the input at any position. As we slide the filter over the width and height of the input volume we will produce a 2-dimensional activation map that gives the responses of that filter at every spatial position.
Intuitively, the network will learn filters that activate when they see some type of visual feature such as an edge of some orientation or a blotch of some color on the first layer, or eventually entire honeycomb or wheel-like patterns on higher layers of the network. Now, we will have an entire set of filters in each CONV layer (e.g. 12 filters), and each of them will produce a separate 2-dimensional activation map. We will stack these activation maps along the depth dimension and produce the output volume.
So, in essence, each [filter_height, filter_width] filter is going to match all patches of the same size in the input and produce a single number for each patch. Some patches can be skipped or added, depending on the stride and padding settings. In the backward pass, the filter will be updated for all of them, i.e., it is trained on the whole input.
E.g., here's stride=1 and padding=2 convolution:
Related
I have a network that's pretty much UNet. However, the model crashed when I feed in input size of 3x1x1 (channel =3, height =1, width=1) since the first max pooling (with kernel size =2 and stride =2) will reduce the dimension into 3x0x0.
How do I modify Unet model such that it can take my 3x1x1 input and handle arbitrary number of poolings? Any help is appreciated!
One must normalize sizes of images with preprocessing, see torchvision.transforms.functional.resize.
I'm following this tutorial on towardsdatascience.com because I wanted to try the MNIST dataset using Pytorch since I've already done it using keras.
So in Step 2, knowing the dataset better, they print the trainloader's shape and it returns torch.Size([64, 1, 28, 28]). I understand that 64 is the number of images in that loader and that each one is a 28x28 image but what does the 1 mean exactly?
It simply defines an image of size 28x28 has 1 channel, which means it's a grayscale image. If it was a colored image then instead of 1 there would be 3 as the colored image has 3 channels such as RGB.
It's the number of channels in the input. In the MNIST data set the images are gray scale thus the shape of the image is [28, 28, 1]. Notice that pytorch set the first dimension to the channel dimension.
Of course once loaded as batches the total input shape is the one you are getting.
refer to the MNIST dataset link, where it states:
The original black and white (bilevel) images from NIST were size
normalized to fit in a 20x20 pixel box while preserving their aspect
ratio. The resulting images contain grey levels as a result of the
anti-aliasing technique used by the normalization algorithm. the
images were centered in a 28x28 image by computing the center of mass
of the pixels, and translating the image so as to position this point
at the center of the 28x28 field.
In short ,
Its just the number of channels your 28x28 image has
This would suggest the number of batches present in the dataset. Think of it as groups, so we have 1 batch of 64 images, or you could change that, and say, have 2 batches of 32 images each. The batch size can usually influence the computational complexity for the model.
And, of course, depending on the used library (especially in the training/testing loop), the code would look slightly different if you would use just 1 batch, or X number of batches.
For example (the number of epochs/iterations = 50): imagine you are training a dataset of batch size = 1, in the training loop you would just write train the model epoch times. However, for batch size = x, you would have to loop for each epoch as well as for each batch/group.
I trained a CNN model with (5x128x128x3) size of input shape
and I got trained weight of (5x128x128x3)
by the way, I wanna use this weight for training (7x128x128x3) size of input data
So, this is my question
should I use only same shape of input?
I wonder if I can use another size (in this case, 7x128x128x3) of input for transfer learning
ValueError: Error when checking input: expected input_1 to have shape (5, 128, 128, 3) but got arry with shape (7, 128, 128, 3)```
Let's break down the dimensions (5x128x128x3):
The first dimension is the batch size (which was 5 when the original model was trained). This is irrelevant and you can set it to None as pointed out in the comments to feed arbitrary sized batches to the model.
The second to third dimensions (128x128) are the width and height of the image and you may be able to change these, but it's hard to say for sure without knowing the model architecture and which layer output you're using for transfer learning. The reason you can change these is that 2d convolutional filters are repeated across the 2d dimensions (width and height) of the image, so they will remain valid for different widths and heights (assuming compatible padding). But if you change the 2d dimensions too much, it is possible that the receptive fields of the layers are changed in a way that hurts transfer learning performance. Eg. if the 7th conv layer in the network for 128x128 input can see the entire input image in each activation (a receptive field of 128x128), then if you double the width and height, it won't anymore and the layer may not recognize certain global features.
The fourth dimension is the number of channels in the input images and you can't change this, as the filters in the first layer will have 3 weights across the depth dimension.
So far, I've been practicing neural networks on numerical datasets in pandas, but now I need to create a model that will take an image as input and output a binary mask of that image.
I have my training data as numpy arrays of shape (602, 2048, 2048, 1). 602 images of dimensions 2048x2048 with one channel. The array of output masks have the same dimensions.
What I can't figure out is how to define the first layer or how to correctly feed the data into the model. I would greatly appreciate your help on this issue
Well, this is not a "rule", but probably you will be using mostly 2D conv and related layers.
You feed everything as numpy arrays, as usual, maybe normalizing the values. Common options are:
Between 0 and 1 (just divide by 255.)
Between -1 and 1 (divide by 255., multiply by 2, subtract 1)
Caffe style: subtract from each channel a specific value to "center" the values based on their usual mean without rescaling them.
Your model should start with something like:
inputTensor = Input((2048,2048,1))
output = Conv2D(filters, kernel_size, .....)(inputTensor)
Or, in sequential models: model.add(Conv2D(...., input_shape=(2048,2048,1))
Later, it's up to you to decide which layers to use.
Conv2D
MaxPooling2D
Upsampling2D
Whether you're going to create a linear model or if you're going to divide branches, join branches, etc. is also your call.
Models in a U-Net style should be a good start for you.
What you can't do:
Don't use Flatten layers (actually you can, if you later reshape the output for having image dimensions... but why?)
Don't use Global Pooling layers (you don't want to sacrifice your spatial dimensions)
The first layer of my neural network is like this:
model.add(Conv1D(filters=40,
kernel_size=25,
input_shape=x_train.shape[1:],
activation='relu',
kernel_regularizer=regularizers.l2(5e-6),
strides=1))
if my input shape is (600,10)
i get (None, 576, 40) as output shape
if my input shape is (6000,1)
i get (None, 5976, 40) as output shape
so my question is what exactly is happening here? is the first example simply ignoring 90% of the input?
It is not "ignoring" a 90% of the input, the problem is simply that if you perform a 1-dimensional convolution with a kernel of size K over an input of size X the result of the convolution will have size X - K + 1. If you want the output to have the same size as the input, then you need to extend or "pad" your data. There are several strategies for that, such as add zeros, replicate the value at the ends or wrap around. Keras' Convolution1D has a padding parameter that you can set to "valid" (the default, no padding), "same" (add zeros at both sides of the input to obtain the same output size as the input) and "causal" (padding with zeros at one end only, idea taken from WaveNet).
Update
About the questions in your comments. So you say your input is (600, 10). That, I assume, is the size of one example, and you have a batch of examples with size (N, 600, 10). From the point of view of the convolution operation, this means you have N examples, each of with a length of at most 600 (this "length" may be time or whatever else, it's just the dimension across which the convolution works) and, at each of these 600 points, you have vectors of size 10. Each of these vectors is considered an atomic sample with 10 features (e.g. price, heigh, size, whatever), or, as is sometimes called in the context of convolution, "channels" (from the RGB channels used in 2D image convolution).
The point is, the convolution has a kernel size and a number of output channels, which is the filters parameter in Keras. In your example, what the convolution does is take every possible slice of 25 contiguous 10-vectors and produce a single 40-vector for each (that, for every example in the batch, of course). So you pass from having 10 features or channels in your input to having 40 after the convolution. It's not that it's using only one of the 10 elements in the last dimension, it's using all of them to produce the output.
If the meaning of the dimensions in your input is not what the convolution is interpreting, or if the operation it is performing is not what you were expecting, you may need to either reshape your input or use a different kind of layer.