Convert Tensorflow conv2d_transpose to Caffe Deconvolution - python

I've been trying to convert a tensorflow model to caffe which contains conv2d_transpose layers to upscale an image. For the TF layer, I constructed the kernel shape as (3, 3, X, X) where X is some number of channels from the previous layer (didn't change channels with deconv) and specified parameters pad='same', stride=[1,2,2,1], output_shape=(N, 2 * input_shape[1], 2 * input_shape[2], X) where input_shape was the NHWC format output a previous conv layer.
The conversion I attempted followed from the patttern I've seen/used successfully for converting a caffe Convolution layer before:
layer = caffe.layers.Deconvolution(prev_layer, name=node.name,
convolution_param=dict(num_output=X, kernel_size=var.shape[0],
stride=2, pad=0))
... construct net ...
net.params[layer][0].data[:] = tf_weights.transpose((3,2,0,1))
net.params[layer][1].data[:] = tf_biases
The problem I'm seeing is that the output is not the correct size. As is, the code and network produce an output that is too large by 3 pixels in each dimension (I have two conv2d_transpose/Deconvolution layers). By changing pad=0 to 1 the output becomes similarly too small by 3. Otherwise the output looks more or less like it does in tensorflow, but the boundaries appear messed up which I assume results from this padding issue.
I'm not sure if this conversion is even possible as I've read that deconvolution does not necessarily describe the same operation as a transposed convolution. Let a dude know if it's possible/where this goes wrong. Thanks.
P.S. TF 1.5, and freshly installed caffe (as of commit 87e151281d)

Related

Softmax Output Layer. Which dimension?

I am having a question regarding Neuronal Nets used for image segmentation. I am using a 3D Implementation of Deeplab that can be found here
I am using softmax, so the output layer is the following:
elif self.last_activation.lower() == 'softmax':
output = nn.Softmax()(output)
No dimension is defined, so I want to define it manually. But I am not sure which dimension I need tó set. The dimension of the output tensor is the following:
[batch_size, num_classes, width, height, depth]
So I would think that dim=1 would be correct. Is that correct?
Thanks!
Indeed it should be 1 as you want this axis to be summed to 1.
Be careful if you need to train your network with a crossentropyloss as this latter already include a softmax.

Average Pooling layer in Deep Learning and gradient artifacts

I know that in Convolution layers the kernel size needs to be a multiplication of stride or else it will produce artefacts in gradient calculations like the checkerboard problem.
Now does it also work like that in Pooling layers? I read somewhere that max pooling can also cause problems like that. Take this line in the discriminator for example:
self.downsample = nn.AvgPool2d(3, stride=2, padding=1, count_include_pad=False)
I have a model (MUNIT) with it, and this is the image it produced:
It looks like the checkerboard problem, or at least a gradient problem but I checked my Convolution layers and didn't found the error described above. They all are of size 4 with stride 2 or an uneven size with stride of 1.
This doesn't look like a checkerboard artifact honestly. Also I don't think discriminator would be the problem, it's usually about image restoration (generator or decoder).
Took a quick look at the MUNIT and what they use in Decoder is torch.nn.Upsample with nearest neighbor upsampling (exact code line here).
You may try to use torch.nn.Conv2d followed by torch.nn.PixelShuffle, something like this:
import torch
in_channels = 32
upscale_factor = 2
out_channels = 16
upsampling = torch.nn.Sequential(
torch.nn.Conv2d(
in_channels,
out_channels * upscale_factor * upscale_factor,
kernel_size=3,
padding=1,
),
torch.nn.PixelShuffle(upscale_factor),
)
image = torch.randn(1, 32, 16, 16)
upsampling(image).shape # [1, 16, 32, 32]
This allows neural network to learn how to upsample the image instead of merely using torch.nn.Upsample which the network has no control over (and using below trick it should also be free of checkerboard artifacts).
Additionally, ICNR initialization for Conv2d should also help (possible implementation here or here). This init scheme initializes weights to act similar to nearest neighbor upsampling at the beginning (research paper here).

2D convolution with reparameterization using just tf.nn.convolution

I want to do something like what tfp.layers.Conv2DReparameterization does but simpler - no priors etc.
Given an augmented input x of shape [num_particles, batch, in_height, in_width, in_channels] and a filter of mean f_mean and standard deviation f_std shape [filter_height, filter_width, in_channels, out_channels] which are trainable variables, I use the reparameterization trick to get filter samples:
filter_samples = f_mean + f_std * tf.random_normal(([num_particles] + f_mean.shape))
Thus, filter_samples is of shape [num_particles, filter_height, filter_width, in_channels, out_channels].
Then, I want to do:
output = tf.nn.conv2d(x, filter_samples, padding='SAME') # or VALID
where output should be of shape [num_particles] + standard convolution output shape.
For dense layers, it works to just do a tf.matmul(x, filter_samples), but for conv2d I'm not sure about the results and I can't find the implementation code to check it. Implementing it myself would end up slower than TF code, so I want to avoid it.
For SAME padding, the resulting shape seems okay, for VALID the batch dim is changed making me believe it doesn't work as I expect.
Just to make it clear, I need the output to have the num_particles dim. Code is TF1.x
Any ideas on how to get that?
I think there is some code to do similar in tfp.experimental.nn. We can follow up in the github issues you filed/responded to.

Layers to be used after using a pretrained model: When to add GlobalAveragePooling2D()

I am using pretrained models to classify image. My question is what kind of layers do I have to add after using the pretrained model structure in my model, resp. why these two implementations differ. To be specific:
Consider two examples, one using the cats and dogs dataset:
One implementation can be found here. The crucial point is that the base model:
# Create the base model from the pre-trained model MobileNet V2
base_model = tf.keras.applications.MobileNetV2(input_shape=IMG_SHAPE,
include_top=False,
weights='imagenet')
base_model.trainable = False
is frozen and a GlobalAveragePooling2D() is added, before a final tf.keras.layers.Dense(1) is added. So the model structure looks like:
model = tf.keras.Sequential([
base_model,
global_average_layer,
prediction_layer
])
which is equivalent to:
model = tf.keras.Sequential([
base_model,
tf.keras.layers.GlobalAveragePooling2D()
tf.keras.layers.Dense(1)
])
So they added not only a final dense(1) layer, but also a GlobalAveragePooling2D() layer before.
The other using the tf flowers dataset:
In this implementation it is different. A GlobalAveragePooling2D() is not added.
feature_extractor_url = "https://tfhub.dev/google/tf2-preview/mobilenet_v2/feature_vector/2"
feature_extractor_layer = hub.KerasLayer(feature_extractor_url,
input_shape=(224,224,3))
feature_extractor_layer.trainable = False
model = tf.keras.Sequential([
feature_extractor_layer,
layers.Dense(image_data.num_classes)
])
Where image_data.num_classes is 5 representing the different flower classification. So in this example a GlobalAveragePooling2D() layer is not added.
I do not understand this. Why is this different? When to add a GlobalAveragePooling2D() or not? And what is better / should I do?
I am not sure if the reason is that in one case the dataset cats and dogs is binary classification and in the other it is a multiclass classifcation problem. Or the difference is that in one case tf.keras.applications.MobileNetV2 was used to load MobileNetV2 and in the other implementation hub.KerasLayer was used to get the feature_extractor. When I check the model in the first implementation:
I can see that the last layer is a relu activation layer.
When I check the feature_extractor:
model = tf.keras.Sequential([
feature_extractor,
tf.keras.layers.Dense(1)
])
model.summary()
I get the output:
So maybe reason is also that I do not understand the difference between tf.keras.applications.MobileNetV2 vs hub.KerasLayer. The hub.KerasLayer just gives me the feature extractor. I know this, but still I think I did not get the difference between these two methods.
I cannot check the layers of the feature_extractor itself. So feature_extractor.summary() or feature_extractor.layers does not work. How can I inspect the layers here? And how can I know I should add GlobalAveragePooling2D or not?
Summary
Why is this different? When to add a GlobalAveragePooling2D() or not? And what is better / should I do?
The first case it outputs 4 dimensional tensors that are raw outputs of the last convolutional layer. So, you need to flatten them somehow, and in this example you are using GlobalAveragePooling2D (but you could use any other strategy). I can't tell which is better: it depends on your problem, and depending on how hub.KerasLayer version implemented the flatten, they could be exactly the same. That said, I'd just pickup one of them and go on: I don't see huge differences among them,
Long answer: understanding Keras implementation
The difference is in the output of both base models: in your keras examples, outputs are of shape (bz, hh, ww, nf) where bz is batch size, hh and ww are height and weight of the last convolutional layer in the model and nf is the number of filters (or convolutions) applied in this last layer.
So: this outputs the output of the last convolutions (or filters) of the base model.
Hence, you need to convert those outputs (which you can think them as images) to vectors of shape (bz, n_feats), where n_feats is the number of features the base model is computing. Once this conversion is done, you can stack your classification layer (or as many layers as you want) because at this point you have vectors.
How to compute this conversion? Some common alternatives are taking the average or maximum among the convolutional output (which reduces the size), or you could just reshape them as a single row, or add more convolutional layers until you get a vector as an output (I strongly suggest to follow usual practices like average or maximum).
In your first example, when calling tf.keras.applications.MobileNetV2, you are using the default police with respect to this last year, and hence, the last convolutional layer is let "as is": some convolutions. You can modify this behavior with the param pooling, as documented here:
pooling: Optional pooling mode for feature extraction when include_top is False.
None (default) means that the output of the model will be the 4D tensor output of the last convolutional block.
avg means that global average pooling will be applied to the output of the last convolutional block, and thus the output of the model will be a 2D tensor.
max means that global max pooling will be applied.
In summary, in your first example, you are building the base model without telling explicitly what to do with the last layer, the model keeps returning 4 dimensional tensors that you immediately convert to vectors with the usage of average pooling, so you can avoid this explicit average pooling if you tell Keras to do it:
# Create the base model from the pre-trained model MobileNet V2
base_model = tf.keras.applications.MobileNetV2(input_shape=IMG_SHAPE,
include_top=False,
pooling='avg', # Tell keras to average last layer
weights='imagenet')
base_model.trainable = False
model = tf.keras.Sequential([
base_model,
# global_average_layer, -> not needed any more
prediction_layer
])
TFHub implementation
Finally, when you use the TensorFlow Hub implementation, as you picked up the feature_vector version of the model, it already implements some kind of pooling (which I didn't found yet how) to make sure the model outputs vectors rather than 4 dimensional tensors. So, you don't need to add explicitly the layer to convert them because it is already done.
In my opinion, I prefer Keras implementation since it gives you more freedom to pick the strategy you want (in fact you could keep stacking whatever you want).
Lets say there is a model taking [1, 208, 208, 3] images and has 6 pooling layers with kernels [2, 2, 2, 2, 2, 7] which would result in a feature column for image [1, 1, 1, 2048] for 2048 filters in the last conv layer. Note, how the last pooling layer accepts [1, 7, 7, 2048] inputs
If we relax the constrains for the input image (which is typically the case for object deteciton models) than after same set of pooling layers image of size [1, 104, 208, 3] would produce pre-last-pooling output of [1, 4, 7, 2024] and [1, 256, 408, 3] would yeild [1, 8, 13, 2048]. This maps would have about the same amount information as original [1, 7, 7, 2048] but the original pooling layer would not produce a feature column wiht [1, 1, 1, N]. That is why we switch to global pooling layer.
In short, global pooling layer is important if we don't have strict restriction on the input image size (and don't resize the image as the first op in the model).
I think difference in output of models
"https://tfhub.dev/google/tf2-preview/mobilenet_v2/feature_vector/2" has output is 1d vector * batch_size, you just can't apply Conv2D to it.
Output of tf.keras.applications.MobileNetV2 probably more complex, thus you have more capability to transform one.

Keras: feed images into CNN and get image output

So far, I've been practicing neural networks on numerical datasets in pandas, but now I need to create a model that will take an image as input and output a binary mask of that image.
I have my training data as numpy arrays of shape (602, 2048, 2048, 1). 602 images of dimensions 2048x2048 with one channel. The array of output masks have the same dimensions.
What I can't figure out is how to define the first layer or how to correctly feed the data into the model. I would greatly appreciate your help on this issue
Well, this is not a "rule", but probably you will be using mostly 2D conv and related layers.
You feed everything as numpy arrays, as usual, maybe normalizing the values. Common options are:
Between 0 and 1 (just divide by 255.)
Between -1 and 1 (divide by 255., multiply by 2, subtract 1)
Caffe style: subtract from each channel a specific value to "center" the values based on their usual mean without rescaling them.
Your model should start with something like:
inputTensor = Input((2048,2048,1))
output = Conv2D(filters, kernel_size, .....)(inputTensor)
Or, in sequential models: model.add(Conv2D(...., input_shape=(2048,2048,1))
Later, it's up to you to decide which layers to use.
Conv2D
MaxPooling2D
Upsampling2D
Whether you're going to create a linear model or if you're going to divide branches, join branches, etc. is also your call.
Models in a U-Net style should be a good start for you.
What you can't do:
Don't use Flatten layers (actually you can, if you later reshape the output for having image dimensions... but why?)
Don't use Global Pooling layers (you don't want to sacrifice your spatial dimensions)

Categories

Resources