I know that in Convolution layers the kernel size needs to be a multiplication of stride or else it will produce artefacts in gradient calculations like the checkerboard problem.
Now does it also work like that in Pooling layers? I read somewhere that max pooling can also cause problems like that. Take this line in the discriminator for example:
self.downsample = nn.AvgPool2d(3, stride=2, padding=1, count_include_pad=False)
I have a model (MUNIT) with it, and this is the image it produced:
It looks like the checkerboard problem, or at least a gradient problem but I checked my Convolution layers and didn't found the error described above. They all are of size 4 with stride 2 or an uneven size with stride of 1.
This doesn't look like a checkerboard artifact honestly. Also I don't think discriminator would be the problem, it's usually about image restoration (generator or decoder).
Took a quick look at the MUNIT and what they use in Decoder is torch.nn.Upsample with nearest neighbor upsampling (exact code line here).
You may try to use torch.nn.Conv2d followed by torch.nn.PixelShuffle, something like this:
import torch
in_channels = 32
upscale_factor = 2
out_channels = 16
upsampling = torch.nn.Sequential(
torch.nn.Conv2d(
in_channels,
out_channels * upscale_factor * upscale_factor,
kernel_size=3,
padding=1,
),
torch.nn.PixelShuffle(upscale_factor),
)
image = torch.randn(1, 32, 16, 16)
upsampling(image).shape # [1, 16, 32, 32]
This allows neural network to learn how to upsample the image instead of merely using torch.nn.Upsample which the network has no control over (and using below trick it should also be free of checkerboard artifacts).
Additionally, ICNR initialization for Conv2d should also help (possible implementation here or here). This init scheme initializes weights to act similar to nearest neighbor upsampling at the beginning (research paper here).
Related
I'm trying to write a net using pytorch and i'm facing some problems, i tried to debug some of the errors and still i get one.
File "/FCRN_B.py", line 39, in forward
x=torch.nn.functional.max_pool2d(F.relu(self.conv4(x)),(5,5))
the network is the following : Click here to see the image
My code is the following one :
class Net(nn.Module):
def __init__(self):
super(Net,self).__init__()
#1 input image channel,6 output channels,2x2 square convolution
#kernel
self.conv1=nn.Conv2d(1,32,3)
self.conv2=nn.Conv2d(32,64,3)
self.conv3=nn.Conv2d(64,128,3)
self.conv4=nn.Conv2d(128,256,5)
self.conv4=nn.Conv2d(256,256,5)
self.conv4=nn.Conv2d(256,256,5)
self.upsample1=nn.Upsample(scale_factor=1, mode='nearest')
self.upsample2=nn.Upsample(scale_factor=1, mode='nearest')
self.upsample3=nn.Upsample(scale_factor=8, mode='nearest')
def forward(self,x):
#Max pooling over a (2,2) window
x = torch.squeeze(x,1)
x=F.relu(self.conv1(x))
x=torch.nn.functional.max_pool2d(F.relu(self.conv2(x)),(3,3))
x=F.relu(self.conv3(x))
x=torch.nn.functional.max_pool2d(F.relu(self.conv4(x)),(5,5))
x=F.relu(F.conv2d(self.upsample1(x)))
x=F.relu(F.conv2d(self.upsample2(x)))
x=F.relu(F.conv2d(self.upsample1(x)))
return x
net = Net()
I think you just have four networks that are named the same. Change the name and it should work.
self.conv4=nn.Conv2d(128,256,5)
self.conv5=nn.Conv2d(256,256,5)
self.conv6=nn.Conv2d(256,256,5)
You have used conv4 4 times to define different layers. In the end that made input mismatch issues. I would easily fix it like this
self.conv4=nn.Sequential(
nn.Conv2d(128,256,5),
nn.Conv2d(256,256,5),
nn.Conv2d(256,256,5)
)
Assuming you were using those conv4 to denote a logical group of convolutions which achieve an unit of operation. For the sake of readability, this is a good choice.
I am using pretrained models to classify image. My question is what kind of layers do I have to add after using the pretrained model structure in my model, resp. why these two implementations differ. To be specific:
Consider two examples, one using the cats and dogs dataset:
One implementation can be found here. The crucial point is that the base model:
# Create the base model from the pre-trained model MobileNet V2
base_model = tf.keras.applications.MobileNetV2(input_shape=IMG_SHAPE,
include_top=False,
weights='imagenet')
base_model.trainable = False
is frozen and a GlobalAveragePooling2D() is added, before a final tf.keras.layers.Dense(1) is added. So the model structure looks like:
model = tf.keras.Sequential([
base_model,
global_average_layer,
prediction_layer
])
which is equivalent to:
model = tf.keras.Sequential([
base_model,
tf.keras.layers.GlobalAveragePooling2D()
tf.keras.layers.Dense(1)
])
So they added not only a final dense(1) layer, but also a GlobalAveragePooling2D() layer before.
The other using the tf flowers dataset:
In this implementation it is different. A GlobalAveragePooling2D() is not added.
feature_extractor_url = "https://tfhub.dev/google/tf2-preview/mobilenet_v2/feature_vector/2"
feature_extractor_layer = hub.KerasLayer(feature_extractor_url,
input_shape=(224,224,3))
feature_extractor_layer.trainable = False
model = tf.keras.Sequential([
feature_extractor_layer,
layers.Dense(image_data.num_classes)
])
Where image_data.num_classes is 5 representing the different flower classification. So in this example a GlobalAveragePooling2D() layer is not added.
I do not understand this. Why is this different? When to add a GlobalAveragePooling2D() or not? And what is better / should I do?
I am not sure if the reason is that in one case the dataset cats and dogs is binary classification and in the other it is a multiclass classifcation problem. Or the difference is that in one case tf.keras.applications.MobileNetV2 was used to load MobileNetV2 and in the other implementation hub.KerasLayer was used to get the feature_extractor. When I check the model in the first implementation:
I can see that the last layer is a relu activation layer.
When I check the feature_extractor:
model = tf.keras.Sequential([
feature_extractor,
tf.keras.layers.Dense(1)
])
model.summary()
I get the output:
So maybe reason is also that I do not understand the difference between tf.keras.applications.MobileNetV2 vs hub.KerasLayer. The hub.KerasLayer just gives me the feature extractor. I know this, but still I think I did not get the difference between these two methods.
I cannot check the layers of the feature_extractor itself. So feature_extractor.summary() or feature_extractor.layers does not work. How can I inspect the layers here? And how can I know I should add GlobalAveragePooling2D or not?
Summary
Why is this different? When to add a GlobalAveragePooling2D() or not? And what is better / should I do?
The first case it outputs 4 dimensional tensors that are raw outputs of the last convolutional layer. So, you need to flatten them somehow, and in this example you are using GlobalAveragePooling2D (but you could use any other strategy). I can't tell which is better: it depends on your problem, and depending on how hub.KerasLayer version implemented the flatten, they could be exactly the same. That said, I'd just pickup one of them and go on: I don't see huge differences among them,
Long answer: understanding Keras implementation
The difference is in the output of both base models: in your keras examples, outputs are of shape (bz, hh, ww, nf) where bz is batch size, hh and ww are height and weight of the last convolutional layer in the model and nf is the number of filters (or convolutions) applied in this last layer.
So: this outputs the output of the last convolutions (or filters) of the base model.
Hence, you need to convert those outputs (which you can think them as images) to vectors of shape (bz, n_feats), where n_feats is the number of features the base model is computing. Once this conversion is done, you can stack your classification layer (or as many layers as you want) because at this point you have vectors.
How to compute this conversion? Some common alternatives are taking the average or maximum among the convolutional output (which reduces the size), or you could just reshape them as a single row, or add more convolutional layers until you get a vector as an output (I strongly suggest to follow usual practices like average or maximum).
In your first example, when calling tf.keras.applications.MobileNetV2, you are using the default police with respect to this last year, and hence, the last convolutional layer is let "as is": some convolutions. You can modify this behavior with the param pooling, as documented here:
pooling: Optional pooling mode for feature extraction when include_top is False.
None (default) means that the output of the model will be the 4D tensor output of the last convolutional block.
avg means that global average pooling will be applied to the output of the last convolutional block, and thus the output of the model will be a 2D tensor.
max means that global max pooling will be applied.
In summary, in your first example, you are building the base model without telling explicitly what to do with the last layer, the model keeps returning 4 dimensional tensors that you immediately convert to vectors with the usage of average pooling, so you can avoid this explicit average pooling if you tell Keras to do it:
# Create the base model from the pre-trained model MobileNet V2
base_model = tf.keras.applications.MobileNetV2(input_shape=IMG_SHAPE,
include_top=False,
pooling='avg', # Tell keras to average last layer
weights='imagenet')
base_model.trainable = False
model = tf.keras.Sequential([
base_model,
# global_average_layer, -> not needed any more
prediction_layer
])
TFHub implementation
Finally, when you use the TensorFlow Hub implementation, as you picked up the feature_vector version of the model, it already implements some kind of pooling (which I didn't found yet how) to make sure the model outputs vectors rather than 4 dimensional tensors. So, you don't need to add explicitly the layer to convert them because it is already done.
In my opinion, I prefer Keras implementation since it gives you more freedom to pick the strategy you want (in fact you could keep stacking whatever you want).
Lets say there is a model taking [1, 208, 208, 3] images and has 6 pooling layers with kernels [2, 2, 2, 2, 2, 7] which would result in a feature column for image [1, 1, 1, 2048] for 2048 filters in the last conv layer. Note, how the last pooling layer accepts [1, 7, 7, 2048] inputs
If we relax the constrains for the input image (which is typically the case for object deteciton models) than after same set of pooling layers image of size [1, 104, 208, 3] would produce pre-last-pooling output of [1, 4, 7, 2024] and [1, 256, 408, 3] would yeild [1, 8, 13, 2048]. This maps would have about the same amount information as original [1, 7, 7, 2048] but the original pooling layer would not produce a feature column wiht [1, 1, 1, N]. That is why we switch to global pooling layer.
In short, global pooling layer is important if we don't have strict restriction on the input image size (and don't resize the image as the first op in the model).
I think difference in output of models
"https://tfhub.dev/google/tf2-preview/mobilenet_v2/feature_vector/2" has output is 1d vector * batch_size, you just can't apply Conv2D to it.
Output of tf.keras.applications.MobileNetV2 probably more complex, thus you have more capability to transform one.
I've been trying to convert a tensorflow model to caffe which contains conv2d_transpose layers to upscale an image. For the TF layer, I constructed the kernel shape as (3, 3, X, X) where X is some number of channels from the previous layer (didn't change channels with deconv) and specified parameters pad='same', stride=[1,2,2,1], output_shape=(N, 2 * input_shape[1], 2 * input_shape[2], X) where input_shape was the NHWC format output a previous conv layer.
The conversion I attempted followed from the patttern I've seen/used successfully for converting a caffe Convolution layer before:
layer = caffe.layers.Deconvolution(prev_layer, name=node.name,
convolution_param=dict(num_output=X, kernel_size=var.shape[0],
stride=2, pad=0))
... construct net ...
net.params[layer][0].data[:] = tf_weights.transpose((3,2,0,1))
net.params[layer][1].data[:] = tf_biases
The problem I'm seeing is that the output is not the correct size. As is, the code and network produce an output that is too large by 3 pixels in each dimension (I have two conv2d_transpose/Deconvolution layers). By changing pad=0 to 1 the output becomes similarly too small by 3. Otherwise the output looks more or less like it does in tensorflow, but the boundaries appear messed up which I assume results from this padding issue.
I'm not sure if this conversion is even possible as I've read that deconvolution does not necessarily describe the same operation as a transposed convolution. Let a dude know if it's possible/where this goes wrong. Thanks.
P.S. TF 1.5, and freshly installed caffe (as of commit 87e151281d)
I have some background in machine learning and python, but I am just learning TensorFlow. I am going through the tutorial on deep convolutional neural nets to teach myself how to use it for image classification. Along the way there is an exercise, which I am having trouble completing.
EXERCISE: The model architecture in inference() differs slightly from the CIFAR-10 model specified in cuda-convnet. In particular, the top layers of Alex's original model are locally connected and not fully connected. Try editing the architecture to exactly reproduce the locally connected architecture in the top layer.
The exercise refers to the inference() function in the cifar10.py model. The 2nd to last layer (called local4) has a shape=[384, 192], and the top layer has a shape=[192, NUM_CLASSES], where NUM_CLASSES=10 of course. I think the code that we are asked to edit is somewhere in the code defining the top layer:
with tf.variable_scope('softmax_linear') as scope:
weights = _variable_with_weight_decay('weights', [192, NUM_CLASSES],
stddev=1/192.0, wd=0.0)
biases = _variable_on_cpu('biases', [NUM_CLASSES],
tf.constant_initializer(0.0))
softmax_linear = tf.add(tf.matmul(local4, weights), biases,name=scope.name
_activation_summary(softmax_linear)
But I don't see any code that determines the probability of connecting between layers, so I don't know how we can change the model from fully connected to locally connected. Does somebody know how to do this?
I'm also working on this exercise. I'll try and explain my approach properly, rather than just give the solution. It's worth looking back at the mathematics of a fully connected layer (https://www.tensorflow.org/get_started/mnist/beginners).
So the linear algebra for a fully connected layer is:
y = W * x + b
where x is the n dimensional input vector, b is an n dimensional vector of biases, and W is an n-by-n matrix of weights. The i th element of y is the sum of the i th row of W multiplied element-wise with x.
So....if you only want y[i] connected to x[i-1], x[i], and x[i+1], you simply set all values in the i th row of W to zero, apart from the (i-1) th, i th and (i+1) th column of that row. Therefore to create a locally connected layer, you simply enforce W to be a banded matrix (https://en.wikipedia.org/wiki/Band_matrix), where the size of the band is equal to the size of the locally connected neighbourhoods you want. Tensorflow has a function for setting a matrix to be banded (tf.batch_matrix_band_part(input, num_lower, num_upper, name=None)).
This seems to me to be the simplest mathematical solution to the exercise.
I'll try to answer your question although I'm not 100% I got it right as well.
Looking at the cuda-convnet architecture we can see that the TensorFlow and cuda-convnet implementations start to differ after the second pooling layer.
TensorFlow implementation implements two fully connected layers and softmax classifier.
cuda-convnet implements two locally connected layers, one fully connected layer and softmax classifier.
The code snippet you included refers only to the softmax classifier and is in fact shared between the two implementations. To reproduce the cuda-convnet implementation using TensorFlow we have to replace the existing fully connected layers with two locally connected layers and a fully connected one.
Since Tensor doesn't have locally connected layers as part of the SDK we have to figure out a way to implement it using the existing tools. Here is my attempt to implement the first locally connected layers:
with tf.variable_scope('local3') as scope:
shape = pool2.get_shape()
h = shape[1].value
w = shape[2].value
sz_local = 3 # kernel size
sz_patch = (sz_local**2)*shape[3].value
n_channels = 64
# Extract 3x3 tensor patches
patches = tf.extract_image_patches(pool2, [1,sz_local,sz_local,1], [1,1,1,1], [1,1,1,1], 'SAME')
weights = _variable_with_weight_decay('weights', shape=[1,h,w,sz_patch, n_channels], stddev=5e-2, wd=0.0)
biases = _variable_on_cpu('biases', [h,w,n_channels], tf.constant_initializer(0.1))
# "Filter" each patch with its own kernel
mul = tf.multiply(tf.expand_dims(patches, axis=-1), weights)
ssum = tf.reduce_sum(mul, axis=3)
pre_activation = tf.add(ssum, biases)
local3 = tf.nn.relu(pre_activation, name=scope.name)
I'm following this tutorial on using Keras to train a basic conv-net. I find a couple of things confusing though, and the Keras documentation doesn't go into much detail either.
Let's look at the first few layers of the network:
model = Sequential()
model.add(Convolution2D(32, 3, 3, activation='relu', input_shape=(1,28,28)))
model.add(Convolution2D(32, 3, 3, activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.25))
My Questions:
The tutorial describes the first layer as the "input layer". However, the first line includes a Convolution2D function, with a input_shape. Am I correct in assuming that this is actually the first hidden layer (a convolution layer), rather than just the input layer? Reason being that we don't need a separate model.add() statement just for the input?
In the Convolution2D() function, we're using 32 filters, each filter being 3x3 pixels. In my understanding, a filter is a small block of pixels which "scans" across the image. So for a 28x28 image, wouldn't we need 676 filters (26*26, since each filter is 3x3)? What does the 32 here mean?
The last line is a Dropout layer. From my understanding, Dropout is a regularization technique, and it's applied to the whole network. So does the Dropout(0.25) here apply a 25% dropout only to the previous layer? Or does it apply to all layers preceding it?
Thanks.
Whenever you call model.fit(), you are passing the 28 * 28 image. So that is the input to the model.
On top of that input, we then are doing convolution to generate feature maps.
Convolution means matrix multiplication. So the output of a single filter in the first layer is 26 * 26 matrix. We have such 32 matrices. That is what 32 means. Check this for further explanation. https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/
This applies dropout with probability 0.25 to the layer preceding the sentence.