Zero-padding for ResNet shortcut connections when channel number increase

Zero-padding for ResNet shortcut connections when channel number increase - python

I would like to implement the ResNet network in Keras with the shortcut connections that add zero entries when features/channels dimensions mismatch according to the original paper:
When the dimensions increase (dotted line shortcuts in Fig. 3), we
consider two options: (A) The shortcut still performs identity
mapping, with extra zero entries padded for increasing dimensions ...
http://arxiv.org/pdf/1512.03385v1.pdf
However wasn't able to implement it and I can't seem to find an answer on the web or on the source code. All the implementations that I found use the 1x1 convolution trick for shortcut connections when dimensions mismatch.
The layer I would like to implement would basically concatenate the input tensor with a tensor with an all zeros tensor to compensate for the dimension mismatch.
The idea would be something like this, but I could not get it working:
def zero_pad(x, shape):
return K.concatenate([x, K.zeros(shape)], axis=1)
Does anyone has an idea on how to implement such a layer ?
Thanks a lot

The question was answered on github:
https://github.com/fchollet/keras/issues/2608
It would be something like this:
from keras.layers.convolutional import MaxPooling2D
from keras.layers.core import Lambda
from keras import backend as K
def zeropad(x):
y = K.zeros_like(x)
return K.concatenate([x, y], axis=1)
def zeropad_output_shape(input_shape):
shape = list(input_shape)
assert len(shape) == 4
shape[1] *= 2
return tuple(shape)
def shortcut(input_layer, nb_filters, output_shape, zeros_upsample=True):
# TODO: Figure out why zeros_upsample doesn't work in Theano
if zeros_upsample:
x = MaxPooling2D(pool_size=(1,1),
strides=(2,2),
border_mode='same')(input_layer)
x = Lambda(zeropad, output_shape=zeropad_output_shape)(x)
else:
# Options B, C in ResNet paper...

This works for me, even in lazy (non-eager) evaluation mode, and does not require access to another tensor with correct padding (such as with zeros_like). D is the desired number of channels, nn is the tensor we're trying to pad.
def pad_depth(nn,D):
import tensorflow as tf
deltaD= D- nn.shape[-1]
paddings= [[0,0]]* len(nn.shape.as_list())
paddings[-1]= [0,deltaD]
nn= tf.pad(nn,paddings)
return nn

If you are still looking for it in my GitHub repository I implemented it. Please take a look to https://github.com/nellopai/deepLearningModels.
All the solutions that I found online were not really working and not coherent with the ResNet paper. In the repo you find more details in code/networks/resNet50 . The correct method to implement is:
def pad_depth(x, desired_channels):
new_channels = desired_channels - x.shape.as_list()[-1]
output = tf.identity(x)
repetitions = new_channels/x.shape.as_list()[-1]
for _ in range(int(repetitions)):
zeroTensors = tf.zeros_like(x, name='pad_depth1')
output = tf.keras.backend.concatenate([output, zeroTensors])
return output

Related

Pytorch submodules output shape

How does the output shape of submodules in pytorch is determined? why is the output shape of a certain sub-module is modified in the code below?
When I separate the head of a classical classifier from its backbone in the following way:
import torch, torchvision
from torchsummary import summary
effnet = torchvision.models.efficientnet_b0(num_classes = 2)
backbone = torch.nn.Sequential(*(list(effnet.children())[0]))
adaptive_pool = list(effnet.children())[1]
head = list(effnet.children())[2]
model = torch.nn.Sequential(*[backbone, adaptive_pool, head])
summary(model, (3,256,256), device = 'cpu') # <== Error
I get the following error:
RuntimeError: mat1 and mat2 shapes cannot be multiplied (2560x1 and 1280x2)
This error is due to modified output shape of the sub-module adaptive_pool. To workaround this problem, flatten can be used as follows:
class flatten(torch.nn.Module):
def forward(self, input):
return input.view(input.size(0), -1)
model = torch.nn.Sequential(*[backbone, adaptive_pool, flatten(), head])
summary(model, (3,256,256), device = 'cpu')
Why is the output shape of the sub-module adaptive_pool is modified?

The output of an nn.AdaptiveAvgPool2d is 4D even if the average is computed globally i.e output_size=1. In other words, the output shape of your global pooling layer is (N, C, 1, 1). This means you indeed need to flatten it for the layer which is fully connected.
In the referenced original efficient net classification network, the implementation of the flattening operation is done directly in the forward logic without the use of a dedicated layer. See this line.
Instead of implementing your own flattening layer, you can use the built-in nn.Flatten. More details about this module can be found here.
>>> model = nn.Sequential(backbone, adaptive_pool, nn.Flatten(1), head)

where does the additional dimension of the Input in a keras.Model come from?

When I define a model like:
import tensorflow as tf
from tensorflow.keras import layers
import numpy as np
input_shape = (20,20)
input = tf.keras.Input(shape=input_shape)
nn = layers.Flatten()(input)
nn = layers.Dense(10)(nn)
output = layers.Activation('sigmoid')(nn)
model = tf.keras.Model(inputs=input, outputs=output)
Why do I need to add another dimension to my actual input:
actual_input = np.ones((1,20,20))
prediction = model.predict(actual_input)
why can't I just do actual_input = np.ones((20,20))?
Edit:
in the docs it says something about batchsize.. Is this batchsize somehow related to my question? If so, why would I need it, when I want to predict with my model?
Thanks for any help.

In Keras (TensorFlow), one cannot predict on a single input. Therefore, even if you have a single example, you need to add the batch_axis to it.
Practically, in this situation, you have a batch size of 1, hence the batch axis.
This is how TensorFlow and Keras are built, and even for a single prediction you need to add the batch axis (batch size of 1 == 1 single example).
You can use np.expand_dims(input,axis=0) or tf.expand_dims(input,axis=0) to transform your input into a suitable format for prediction.

Jacobian of decoder in VAE

I am new to both ML and this forum, so please be kind.
What I would like to compute is the jacobian of the decoder part of the VAE w.r.t. the latent vectors. I have found the jacobian_batch function in tensorflow.python.ops.parallel_for.gradients that in principle could do the job for me. However, I could not get this function to work.
More specifically, I tried:
xin = tf.constant(np.array(latent_vector[2]),dtype=tf.float32)
f_of_xin = tf.constant(decoder.predict(tf.Session().run(xin), batch_size = 1000000, verbose=1),dtype=tf.float32)
jac = batch_jacobian(f_of_xin,xin)
Which does not work (returns None shape), while:
f_of_xin = tf.sin(tf.sin(xin))
jac = batch_jacobian(f_of_xin,xin)
works (returns a 3x3 matrix (# of input & output dim = 3) with sensible numbers).
I tried:
f_of_xin = tf.sin(tf.sin(tf.constant(tf.Session().run(xin))))
as well, in which case the jacobian_batch function does not work anymore (which means that it returns a None shape). I guess it has to do with the data conversion, but I don't know how to feed a tf structure to the model.predict function. By the way, my model here (decoder) is just a simple neural network with 1 hidden layer. Could you help me?
Kind regards,
Melissa

Computing cosine similarity between two tensors in Keras

I have been following a tutorial that shows how to make a word2vec model.
This tutorial uses this piece of code:
similarity = merge([target, context], mode='cos', dot_axes=0) (no other info was given, but I suppose this comes from keras.layers)
Now, I've researched a bit on the merge method but I couldn't find much about it.
From what I understand, it has been replaced by a lot of functions like layers.Add(), layers.Concat()....
What should I use? There's .Dot(), which has an axis parameter (which seems to be correct) but no mode parameter.
What can I use in this case?

The Dot layer in Keras now supports built-in Cosine similarity using the normalize = True parameter.
From the Keras Docs:
keras.layers.Dot(axes, normalize=True)
normalize: Whether to L2-normalize samples along the dot product axis before taking the dot product. If set to True, then the output of the dot product is the cosine proximity between the two samples.
Source

There are a few things that are unclear from the Keras documentation that I think are crucial to understanding:
For each function in the keras documentation for Merge, there is a lower case and upper case one defined i.e. add() and Add().
On Github, farizrahman4u outlines the differences:
Merge is a layer.
Merge takes layers as input
Merge is usually used with Sequential models
merge is a function.
merge takes tensors as input.
merge is a wrapper around Merge.
merge is used in Functional API
Using Merge:
left = Sequential()
left.add(...)
left.add(...)
right = Sequential()
right.add(...)
right.add(...)
model = Sequential()
model.add(Merge([left, right]))
model.add(...)
using merge:
a = Input((10,))
b = Dense(10)(a)
c = Dense(10)(a)
d = merge([b, c])
model = Model(a, d)
To answer your question, since Merge has been deprecated, we have to define and build a layer ourselves for the cosine similarity. In general this will involve using those lowercase functions, which we wrap within a Lambda to create a layer that we can use within a model.
I found a solution here:
from keras import backend as K
def cosine_distance(vests):
x, y = vests
x = K.l2_normalize(x, axis=-1)
y = K.l2_normalize(y, axis=-1)
return -K.mean(x * y, axis=-1, keepdims=True)
def cos_dist_output_shape(shapes):
shape1, shape2 = shapes
return (shape1[0],1)
distance = Lambda(cosine_distance, output_shape=cos_dist_output_shape)([processed_a, processed_b])
Depending on your data, you may want to remove the L2 normalization. What is important to note about the solution is that it is built using the Keras function api e.g. K.mean() - I think this is necessary when defining custom layer or even loss functions.
Hope I was clear, this was my first SO answer!

Maybe this will help you
(I spent a lot of time to make sure that these are the same things)
import tensorflow as tf
with tf.device('/CPU:' + str(0)):
print(tf.losses.CosineSimilarity()([1.0,1.0,1.0,-1.0],[4.0,4.0,4.0,5.0]))
print(tf.keras.layers.dot([tf.Variable([[1.0,1.0,1.0,-1.0]]),tf.Variable([[4.0,4.0,4.0,5.0]])], axes=1, normalize=True))
Output (Pay attention to the sign):
tf.Tensor(-0.40964404, shape=(), dtype=float32)
tf.Tensor([[0.40964404]], shape=(1, 1), dtype=float32)

If you alter the last code block of the tutorial as follows, you can see that the (average) loss is decreasing nicely with the Dot solution suggested by SantoshGuptaz7 (comment in the question above):
display_after_epoch = 10000
display_after_epoch_2 = 10 * display_after_epoch
loss_sum = 0
for cnt in range(epochs):
idx = np.random.randint(0, len(labels)-1)
arr_1[0,] = word_target[idx]
arr_2[0,] = word_context[idx]
arr_3[0,] = labels[idx]
loss = model.train_on_batch([arr_1, arr_2], arr_3)
loss_sum += loss
if cnt % display_after_epoch == 0 and cnt != 0:
print("\nIteration {}, loss={}".format(cnt, loss_sum / cnt))
loss_sum = 0
if cnt % display_after_epoch_2 == 0:
sim_cb.run_sim()

Machine learning size of input and output

At the moment I'm playing around with machine learning in python based on this website (part two is about image recognition) . I would like to train a network to recognize 4 specific points in am image but My problem is:
The neural network is created by simply multiplying matrices together, calculate the delta between the given output and the recognized output and recalculate the weights in the matrix. Now let' say I have a 600x800 pixel image as input. If I multiply this with my layer matrices I can't get a 4x2 matrix as output (x,y for each point).
My second problem is how much hidden layers should I have for this problem? Are more layers always better but need longer to calculate? Can we guess how much hidden layers we need or should we test some values and use the best of it?
My current neural network code:
from os.path import isfile
import numpy as np
class NeuralNetwork:
def __init__(self):
np.random.seed(1)
self.syn0 = 2 * np.random.random((480000,8)) - 1
#staticmethod
def relu(x, deriv=False):
if(deriv):
res = np.maximum(x, 0)
return np.minimum(res, 1)
return np.maximum(x, 0)
def train(self, imgIn, out):
l1 = NeuralNetwork.relu(np.dot(imgIn, self.syn0))
l1_error = out - l1
exp = NeuralNetwork.relu(l1,True)
l1_delta = l1_error * exp
self.syn0 += np.dot(imgIn.T,l1_delta)
return l1 #np.abs(out - l1)
def identify(self, img):
return NeuralNetwork.relu(np.dot(imgIn, self.syn0))

Problem 1. Input data.
You must serialize the input. For example, if you have one 600*800 pixel image, input must be 1*480000(rows, cols).
Row means the number of data and column means the dimension of data.
Problem 2. Classification.
If you want to classify 4 different type of classes, you should use (1,4) vector for output. For example, there are 4 classes ('Fish', 'Cat', 'Tiger', 'Car'). Then vector (1,0,0,0) means Fish.
Problem 3. Fully connected network.
I think the example in this homepage uses fully connected network. It uses whole image for classifying once. If you want to classify with subset of image. You should use convolution neural network or other approach. I don't know well about this.
Problem 4. Hyperparameter
It depends on data. you must test with various hyper parameter. then choose best hyper parameter.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Zero-padding for ResNet shortcut connections when channel number increase - python

Related

Pytorch submodules output shape

where does the additional dimension of the Input in a keras.Model come from?

Jacobian of decoder in VAE

Computing cosine similarity between two tensors in Keras

Machine learning size of input and output

Categories

Resources