I'm trying to write a net using pytorch and i'm facing some problems, i tried to debug some of the errors and still i get one.
File "/FCRN_B.py", line 39, in forward
x=torch.nn.functional.max_pool2d(F.relu(self.conv4(x)),(5,5))
the network is the following : Click here to see the image
My code is the following one :
class Net(nn.Module):
def __init__(self):
super(Net,self).__init__()
#1 input image channel,6 output channels,2x2 square convolution
#kernel
self.conv1=nn.Conv2d(1,32,3)
self.conv2=nn.Conv2d(32,64,3)
self.conv3=nn.Conv2d(64,128,3)
self.conv4=nn.Conv2d(128,256,5)
self.conv4=nn.Conv2d(256,256,5)
self.conv4=nn.Conv2d(256,256,5)
self.upsample1=nn.Upsample(scale_factor=1, mode='nearest')
self.upsample2=nn.Upsample(scale_factor=1, mode='nearest')
self.upsample3=nn.Upsample(scale_factor=8, mode='nearest')
def forward(self,x):
#Max pooling over a (2,2) window
x = torch.squeeze(x,1)
x=F.relu(self.conv1(x))
x=torch.nn.functional.max_pool2d(F.relu(self.conv2(x)),(3,3))
x=F.relu(self.conv3(x))
x=torch.nn.functional.max_pool2d(F.relu(self.conv4(x)),(5,5))
x=F.relu(F.conv2d(self.upsample1(x)))
x=F.relu(F.conv2d(self.upsample2(x)))
x=F.relu(F.conv2d(self.upsample1(x)))
return x
net = Net()
I think you just have four networks that are named the same. Change the name and it should work.
self.conv4=nn.Conv2d(128,256,5)
self.conv5=nn.Conv2d(256,256,5)
self.conv6=nn.Conv2d(256,256,5)
You have used conv4 4 times to define different layers. In the end that made input mismatch issues. I would easily fix it like this
self.conv4=nn.Sequential(
nn.Conv2d(128,256,5),
nn.Conv2d(256,256,5),
nn.Conv2d(256,256,5)
)
Assuming you were using those conv4 to denote a logical group of convolutions which achieve an unit of operation. For the sake of readability, this is a good choice.
Related
I know that in Convolution layers the kernel size needs to be a multiplication of stride or else it will produce artefacts in gradient calculations like the checkerboard problem.
Now does it also work like that in Pooling layers? I read somewhere that max pooling can also cause problems like that. Take this line in the discriminator for example:
self.downsample = nn.AvgPool2d(3, stride=2, padding=1, count_include_pad=False)
I have a model (MUNIT) with it, and this is the image it produced:
It looks like the checkerboard problem, or at least a gradient problem but I checked my Convolution layers and didn't found the error described above. They all are of size 4 with stride 2 or an uneven size with stride of 1.
This doesn't look like a checkerboard artifact honestly. Also I don't think discriminator would be the problem, it's usually about image restoration (generator or decoder).
Took a quick look at the MUNIT and what they use in Decoder is torch.nn.Upsample with nearest neighbor upsampling (exact code line here).
You may try to use torch.nn.Conv2d followed by torch.nn.PixelShuffle, something like this:
import torch
in_channels = 32
upscale_factor = 2
out_channels = 16
upsampling = torch.nn.Sequential(
torch.nn.Conv2d(
in_channels,
out_channels * upscale_factor * upscale_factor,
kernel_size=3,
padding=1,
),
torch.nn.PixelShuffle(upscale_factor),
)
image = torch.randn(1, 32, 16, 16)
upsampling(image).shape # [1, 16, 32, 32]
This allows neural network to learn how to upsample the image instead of merely using torch.nn.Upsample which the network has no control over (and using below trick it should also be free of checkerboard artifacts).
Additionally, ICNR initialization for Conv2d should also help (possible implementation here or here). This init scheme initializes weights to act similar to nearest neighbor upsampling at the beginning (research paper here).
I've been trying to teach myself the basics of RNN's with a personnal project on PyTorch. I want to produce a simple network that is able to predict the next character in a sequence (idea mainly from this article http://karpathy.github.io/2015/05/21/rnn-effectiveness/ but I wanted to do most of the stuff myself).
My idea is this : I take a batch of B input sequences of size n (np array of n integers), one hot encode them and pass them through my network composed of several LSTM layers, one fully connected layers and one softmax unit.
I then compare the output to the target sequences which are the input sequences shifted one step ahead.
My issue is that when I include the softmax layer, the output is the same every single epoch for every single batch. When I don't include it, the network seems to learn appropriately. I can't figure out what's wrong.
My implementation is the following :
class Model(nn.Module):
def __init__(self, one_hot_length, dropout_prob, num_units, num_layers):
super().__init__()
self.LSTM = nn.LSTM(one_hot_length, num_units, num_layers, batch_first = True, dropout = dropout_prob)
self.dropout = nn.Dropout(dropout_prob)
self.fully_connected = nn.Linear(num_units, one_hot_length)
self.softmax = nn.Softmax(dim = 1)
# dim = 1 as the tensor is of shape (batch_size*seq_length, one_hot_length) when entering the softmax unit
def forward_pass(self, input_seq, hc_states):
output, hc_states = self.LSTM (input_seq, hc_states)
output = output.view(-1, self.num_units)
output = self.fully_connected(output)
# I simply comment out the next line when I run the network without the softmax layer
output = self.softmax(output)
return output, hc_states
one_hot_length is the size of my character dictionnary (~200, also the size of a one hot encoded vector)
num_units is the number of hidden units in a LSTM cell, num_layers the number of LSTM layers in the network.
The inside of the training loop (simplified) goes as follows :
input, target = next_batches(data, batch_pointer)
input = nn.functional.one_hot(input_seq, num_classes = one_hot_length).float().
for state in hc_states:
state.detach_()
optimizer.zero_grad()
output, states = net.forward_pass(input, hc_states)
loss = nn.CrossEntropyLoss(output, target)
loss.backward()
nn.utils.clip_grad_norm_(net.parameters(), MaxGradNorm)
optimizer.step()
With hc_states a tuple with the hidden states tensor and the cell states tensor, input, is a tensor of size (B,n,one_hot_length), target is (B,n).
I'm training on a really small dataset (sentences in a .txt of ~400Ko) just to tune my code, and did 4 different runs with different parameters and each time the outcome was the same : the network doesn't learn at all when it has the softmax layer, and trains somewhat appropriately without.
I don't think it is an issue with tensors shapes as I'm almost sure I checked everything.
My understanding of my problem is that I'm trying to do classification, and that the usual is to put a softmax unit at the end to get "probabilities" of each character to appear, but clearly this isn't right.
Any ideas to help me ?
I'm also fairly new to Pytorch and RNN so I apologize in advance if my architecture/implementation is some kind of monstrosity to a knowledgeable person. Feel free to correct me and thanks in advance.
I am trying to modify a code that could find in the following link in such a way that the proposed Transformer model that is related to the paper: all you need is attention would keep only the Encoder part of the whole Transformer model. Furthermore, I would like to modify the input of the Network, instead of being a sequence of text to be a sequence of images (or better-extracted features of images) coming from a video. In a sense, I would like to figure out which frames are related to each other from my input and encode that info in an output embedding in the same way that is happening to the Transformers model.
The project as it is in the link provided is mainly performing sequence-sequence transformation. The input is text from one language and the output is text in another language. The main formation of the model is happening in the lines 386-463. Where the model is initialized and the compile of the Model is happening. For me I would like to do something like:
#414-416
self.encoder = SelfAttention(d_model, d_inner_hid, n_head, layers, dropout)
#self.decoder = Decoder(d_model, d_inner_hid, n_head, layers, dropout)
#self.target_layer = TimeDistributed(Dense(o_tokens.num(), use_bias=False))
#434-436
enc_output = self.encoder(src_emb, src_seq, active_layers=active_layers)
#dec_output = self.decoder(tgt_emb, tgt_seq, src_seq, enc_output, active_layers=active_layers)
#final_output = self.target_layer(dec_output)
Furthermore, since I would like to combine the output of the Encoder which is the output of MultiHeadAttention and PositionwiseFeedForward using an LSTM and a Dense layer which will tune the whole Encoding procedure using classification optimization. Therefore, I add when I define my model the following layers:
self.lstm = LSTM(units = 256, input_shape = (None, 256), return_sequences = False, dropout = 0.5)
self.fc1 = Dense(64, activation='relu', name = "dense_one")
self.fc2 = Dense(6, activation='sigmoid', name = "dense_two")
and then pass the output of the encoder, in line 434 using the following code:
enc_output = self.lstm(enc_output)
enc_output = self.fc1(enc_output)
enc_output = self.fc2(enc_output)
Now the video data that I would like to replace the text data provided with the Github code, have the following dimensionality: Nx10x256 where N is the number of samples, 10 is the number of frames and 256 the number of features for each frame. I have some difficulties to understand some parts of the code, in order to successfully, modified it to my needs. I guess, that now the Embedding layer is not necessary for me anymore since it is related to text classification and NLP.
Furthermore, I need to modify the input to 419-420 to be sth like:
src_seq_input = Input(shape=(None, 256,), dtype='float32') # source input related to video
tgt_seq_input = Input(shape=(6,), dtype='int32') # the target classification size (since I have 6 classes)
What other parts of the code do I need to skip or modify? What is the usefulness of the PosEncodingLayer that is used in the following line:
self.pos_emb = PosEncodingLayer(len_limit, d_emb) if self.src_loc_info else None
Is it needed in my case? Can I skip it?
After my modification in the code I noticed that when I run the code, I can check the loss function from the def get_loss(y_pred, y_true), however, in my case it is crucial to define a loss for the classification task that returns also the accuracy. How can I do so, with the provided code?
Edit:
I have to add that I treat my input as the output of the Embedding layer from the initial NLP code. Therefore, for me (in the version of code that functioned for me):
src_seq_input = Input(shape=(None, 256,), dtype='float32')
tgt_seq_input = Input(shape=(6,), dtype='int32')
src_seq = src_seq_input
#src_emb_ = self.i_word_emb(src_seq)
src_emb = src_seq
enc_output = self.encoder(src_emb, src_emb, active_layers=active_layers)
I treat src_emb as my input and completely ignore src_seq.
Edit:
The way that the loss is calculated is using the following code:
def get_loss(y_pred, y_true):
y_true = tf.cast(y_true, 'int32')
loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y_true, logits=y_pred)
mask = tf.cast(tf.not_equal(y_true, 0), 'float32')
loss = tf.reduce_sum(loss * mask, -1) / tf.reduce_sum(mask, -1)
loss = K.mean(loss)
return loss
loss = get_loss(enc_output, tgt_seq_input)
self.ppl = K.exp(loss)
Edit:
As it is the loss function (sparse_softmax_cross_entropy_with_logits) returns a loss score. Even if the whole procedure is about classification. How, can I further, tune my system to return also the accuracy?
I'm afraid this approach is not going to work.
Video data has massive dependence between adjacent frames, with each frame very similar to the last. There is also a weaker dependence on prior frames, because objects tend to continue to move relative to other objects in similar ways. Modern video formats use this redundancy to achieve high compression rates by modelling the motions.
This means that your network will have an extremely strong attention on the previous image. As you suggest, you could subsample frames several seconds apart to destroy much of the dependence on the previous frame, but if you did so I really wonder whether you would find structure at all in the result? Even if you feed it hand-coded features optimised for the purpose, there are are few general rules about which features will be in motion and which will not, so what structure can your attention network learn?
The problem of handling video is just radically different from handling sentences. Video has very complex elements (pictures) that are largely static over time and have locally predictable motions over a few frames in very simple ways. Text has simple elements (words) in a complex sentence structure with complex dependence extending over many words. These differences mean they require fundamentally different approaches.
I have some background in machine learning and python, but I am just learning TensorFlow. I am going through the tutorial on deep convolutional neural nets to teach myself how to use it for image classification. Along the way there is an exercise, which I am having trouble completing.
EXERCISE: The model architecture in inference() differs slightly from the CIFAR-10 model specified in cuda-convnet. In particular, the top layers of Alex's original model are locally connected and not fully connected. Try editing the architecture to exactly reproduce the locally connected architecture in the top layer.
The exercise refers to the inference() function in the cifar10.py model. The 2nd to last layer (called local4) has a shape=[384, 192], and the top layer has a shape=[192, NUM_CLASSES], where NUM_CLASSES=10 of course. I think the code that we are asked to edit is somewhere in the code defining the top layer:
with tf.variable_scope('softmax_linear') as scope:
weights = _variable_with_weight_decay('weights', [192, NUM_CLASSES],
stddev=1/192.0, wd=0.0)
biases = _variable_on_cpu('biases', [NUM_CLASSES],
tf.constant_initializer(0.0))
softmax_linear = tf.add(tf.matmul(local4, weights), biases,name=scope.name
_activation_summary(softmax_linear)
But I don't see any code that determines the probability of connecting between layers, so I don't know how we can change the model from fully connected to locally connected. Does somebody know how to do this?
I'm also working on this exercise. I'll try and explain my approach properly, rather than just give the solution. It's worth looking back at the mathematics of a fully connected layer (https://www.tensorflow.org/get_started/mnist/beginners).
So the linear algebra for a fully connected layer is:
y = W * x + b
where x is the n dimensional input vector, b is an n dimensional vector of biases, and W is an n-by-n matrix of weights. The i th element of y is the sum of the i th row of W multiplied element-wise with x.
So....if you only want y[i] connected to x[i-1], x[i], and x[i+1], you simply set all values in the i th row of W to zero, apart from the (i-1) th, i th and (i+1) th column of that row. Therefore to create a locally connected layer, you simply enforce W to be a banded matrix (https://en.wikipedia.org/wiki/Band_matrix), where the size of the band is equal to the size of the locally connected neighbourhoods you want. Tensorflow has a function for setting a matrix to be banded (tf.batch_matrix_band_part(input, num_lower, num_upper, name=None)).
This seems to me to be the simplest mathematical solution to the exercise.
I'll try to answer your question although I'm not 100% I got it right as well.
Looking at the cuda-convnet architecture we can see that the TensorFlow and cuda-convnet implementations start to differ after the second pooling layer.
TensorFlow implementation implements two fully connected layers and softmax classifier.
cuda-convnet implements two locally connected layers, one fully connected layer and softmax classifier.
The code snippet you included refers only to the softmax classifier and is in fact shared between the two implementations. To reproduce the cuda-convnet implementation using TensorFlow we have to replace the existing fully connected layers with two locally connected layers and a fully connected one.
Since Tensor doesn't have locally connected layers as part of the SDK we have to figure out a way to implement it using the existing tools. Here is my attempt to implement the first locally connected layers:
with tf.variable_scope('local3') as scope:
shape = pool2.get_shape()
h = shape[1].value
w = shape[2].value
sz_local = 3 # kernel size
sz_patch = (sz_local**2)*shape[3].value
n_channels = 64
# Extract 3x3 tensor patches
patches = tf.extract_image_patches(pool2, [1,sz_local,sz_local,1], [1,1,1,1], [1,1,1,1], 'SAME')
weights = _variable_with_weight_decay('weights', shape=[1,h,w,sz_patch, n_channels], stddev=5e-2, wd=0.0)
biases = _variable_on_cpu('biases', [h,w,n_channels], tf.constant_initializer(0.1))
# "Filter" each patch with its own kernel
mul = tf.multiply(tf.expand_dims(patches, axis=-1), weights)
ssum = tf.reduce_sum(mul, axis=3)
pre_activation = tf.add(ssum, biases)
local3 = tf.nn.relu(pre_activation, name=scope.name)
I would like to implement the ResNet network in Keras with the shortcut connections that add zero entries when features/channels dimensions mismatch according to the original paper:
When the dimensions increase (dotted line shortcuts in Fig. 3), we
consider two options: (A) The shortcut still performs identity
mapping, with extra zero entries padded for increasing dimensions ...
http://arxiv.org/pdf/1512.03385v1.pdf
However wasn't able to implement it and I can't seem to find an answer on the web or on the source code. All the implementations that I found use the 1x1 convolution trick for shortcut connections when dimensions mismatch.
The layer I would like to implement would basically concatenate the input tensor with a tensor with an all zeros tensor to compensate for the dimension mismatch.
The idea would be something like this, but I could not get it working:
def zero_pad(x, shape):
return K.concatenate([x, K.zeros(shape)], axis=1)
Does anyone has an idea on how to implement such a layer ?
Thanks a lot
The question was answered on github:
https://github.com/fchollet/keras/issues/2608
It would be something like this:
from keras.layers.convolutional import MaxPooling2D
from keras.layers.core import Lambda
from keras import backend as K
def zeropad(x):
y = K.zeros_like(x)
return K.concatenate([x, y], axis=1)
def zeropad_output_shape(input_shape):
shape = list(input_shape)
assert len(shape) == 4
shape[1] *= 2
return tuple(shape)
def shortcut(input_layer, nb_filters, output_shape, zeros_upsample=True):
# TODO: Figure out why zeros_upsample doesn't work in Theano
if zeros_upsample:
x = MaxPooling2D(pool_size=(1,1),
strides=(2,2),
border_mode='same')(input_layer)
x = Lambda(zeropad, output_shape=zeropad_output_shape)(x)
else:
# Options B, C in ResNet paper...
This works for me, even in lazy (non-eager) evaluation mode, and does not require access to another tensor with correct padding (such as with zeros_like). D is the desired number of channels, nn is the tensor we're trying to pad.
def pad_depth(nn,D):
import tensorflow as tf
deltaD= D- nn.shape[-1]
paddings= [[0,0]]* len(nn.shape.as_list())
paddings[-1]= [0,deltaD]
nn= tf.pad(nn,paddings)
return nn
If you are still looking for it in my GitHub repository I implemented it. Please take a look to https://github.com/nellopai/deepLearningModels.
All the solutions that I found online were not really working and not coherent with the ResNet paper. In the repo you find more details in code/networks/resNet50 . The correct method to implement is:
def pad_depth(x, desired_channels):
new_channels = desired_channels - x.shape.as_list()[-1]
output = tf.identity(x)
repetitions = new_channels/x.shape.as_list()[-1]
for _ in range(int(repetitions)):
zeroTensors = tf.zeros_like(x, name='pad_depth1')
output = tf.keras.backend.concatenate([output, zeroTensors])
return output