In the pytorch tutorial step of " Deep Learning with PyTorch: A 60 Minute Blitz > Neural Networks "
I have a question that what dose mean params[1] in the networks?
The reason why i have this think is because of max polling dose not have any weight values.
for example.
If you write some codes like that
'
def init(self) :
self.conv1 = nn.Conv2d(1 , 6 , 5)
'
this means input has 1 channel, 6 output channel, conv(5,5)
So i understood that params[0] has 6 channel, 5 by 5 matrix random mapping values when init.
for the same reason
params[2] has like same form, but 16 channel. i understood this too.
but params[1], what dose mean?
Maybe it is just presentation method of existence for max polling.
but at the end of this tutorial, in step of the " update the weights "
It's likely updated by this code below.
learning_rate = 0.01
for f in net.parameters():
f.data.sub_(f.grad.data * learning_rate)
this is code for construct a network
import torch
import torch.nn as nn
import torch.nn.functional as F
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
# 1 input image channel, 6 output channels, 5x5 square convolution
# kernel
self.conv1 = nn.Conv2d(1, 6, 5)
self.conv2 = nn.Conv2d(6, 16, 5)
# an affine operation: y = Wx + b
self.fc1 = nn.Linear(16 * 5 * 5, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
def forward(self, x):
# Max pooling over a (2, 2) window
x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
# If the size is a square you can only specify a single number
x = F.max_pool2d(F.relu(self.conv2(x)), 2)
x = x.view(-1, self.num_flat_features(x))
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
def num_flat_features(self, x):
size = x.size()[1:] # all dimensions except the batch dimension
num_features = 1
for s in size:
num_features *= s
return num_features
params = list(net.parameters())
print(params[1])
Parameter containing:
tensor([-0.0614, -0.0778, 0.0968, -0.0420, 0.1779, -0.0843],
requires_grad=True)
please visit this pytorch tutorial site.
https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html#sphx-glr-beginner-blitz-neural-networks-tutorial-py
Summary
I have a one question.
that is why max pooling layers has four weights which can be updated ?
I think they shouldn't have any weights right?
Am I wrong?
Please help me. I'm a korean.
You are wrong about that. It has nothing to do with max_pooling.
As you can read in your "linked" Tutorial, is the "nn.paramter"-Tensor automatically registered as parameter when it gets assigned to a Module.
Which, in your case basically, means that everything listed within the __init__ is a module and parameter can be assigned to.
What the values mean inside parameter, well its the parameter your model needs to calculate its steps. to picture it
params[0] -> self.conf1 -> Layer-Input
params[1] -> self.conf1 -> Layer-Output
params[2] -> self.conf2 -> Layer-Input
params[3] -> self.conf2 -> Layer-Output
params[4] -> self.fc1 -> Layer-Input
params[5] -> self.fc1 -> Layer-Output
and so on until you reach params[9], which is the end of your whole parameter list.
EDIT: forgot about the weights
These values are indicator of what your Net has learned.
Therefore you have the ability to alter these values in order to fine-tune your Net to fit your needs.
And if you ask why then 2 rows for each layer?
Well, when you do backpropagation you need these values to locate issues within your Layers.
That's why it stored before passed into a Layer and then after returning from that layer.
hope things are a little clearer now.
Related
I'm trying to understand exactly how the calculation are performed in the GRU pytorch class. I'm having some troubles while reading the GRU pytorch documetation and the LSTM TorchScript documentation with its code implementation.
In the GRU documentation is stated:
In a multilayer GRU, the input xt(l) of the l -th layer (l>=2) is the hidden state ht(l−1) of the previous layer multiplied by dropout δt(l−1)where each δt(l−1) is a Bernoulli random variable which is 0 with probability dropout.
So essentially given a sequence, each time point should be passed through all the layers for each loop, like this implementation
Meanwhile the LSTM code implementation is:
def script_lstm(input_size, hidden_size, num_layers, bias=True,
batch_first=False, dropout=False, bidirectional=False):
'''Returns a ScriptModule that mimics a PyTorch native LSTM.'''
# The following are not implemented.
assert bias
assert not batch_first
if bidirectional:
stack_type = StackedLSTM2
layer_type = BidirLSTMLayer
dirs = 2
elif dropout:
stack_type = StackedLSTMWithDropout
layer_type = LSTMLayer
dirs = 1
else:
stack_type = StackedLSTM
layer_type = LSTMLayer
dirs = 1
return stack_type(num_layers, layer_type,
first_layer_args=[LSTMCell, input_size, hidden_size],
other_layer_args=[LSTMCell, hidden_size * dirs,
hidden_size])
class LSTMCell(jit.ScriptModule):
def __init__(self, input_size, hidden_size):
super(LSTMCell, self).__init__()
self.input_size = input_size
self.hidden_size = hidden_size
self.weight_ih = Parameter(torch.randn(4 * hidden_size, input_size))
self.weight_hh = Parameter(torch.randn(4 * hidden_size, hidden_size))
self.bias_ih = Parameter(torch.randn(4 * hidden_size))
self.bias_hh = Parameter(torch.randn(4 * hidden_size))
#jit.script_method
def forward(self, input: Tensor, state: Tuple[Tensor, Tensor]) -> Tuple[Tensor, Tuple[Tensor, Tensor]]:
hx, cx = state
gates = (torch.mm(input, self.weight_ih.t()) + self.bias_ih +
torch.mm(hx, self.weight_hh.t()) + self.bias_hh)
ingate, forgetgate, cellgate, outgate = gates.chunk(4, 1)
ingate = torch.sigmoid(ingate)
forgetgate = torch.sigmoid(forgetgate)
cellgate = torch.tanh(cellgate)
outgate = torch.sigmoid(outgate)
cy = (forgetgate * cx) + (ingate * cellgate)
hy = outgate * torch.tanh(cy)
return hy, (hy, cy)
class LSTMLayer(jit.ScriptModule):
def __init__(self, cell, *cell_args):
super(LSTMLayer, self).__init__()
self.cell = cell(*cell_args)
#jit.script_method
def forward(self, input: Tensor, state: Tuple[Tensor, Tensor]) -> Tuple[Tensor, Tuple[Tensor, Tensor]]:
inputs = input.unbind(0)
outputs = torch.jit.annotate(List[Tensor], [])
for i in range(len(inputs)):
out, state = self.cell(inputs[i], state)
outputs += [out]
return torch.stack(outputs), state
def init_stacked_lstm(num_layers, layer, first_layer_args, other_layer_args):
layers = [layer(*first_layer_args)] + [layer(*other_layer_args)
for _ in range(num_layers - 1)]
return nn.ModuleList(layers)
class StackedLSTM(jit.ScriptModule):
__constants__ = ['layers'] # Necessary for iterating through self.layers
def __init__(self, num_layers, layer, first_layer_args, other_layer_args):
super(StackedLSTM, self).__init__()
self.layers = init_stacked_lstm(num_layers, layer, first_layer_args,
other_layer_args)
#jit.script_method
def forward(self, input: Tensor, states: List[Tuple[Tensor, Tensor]]) -> Tuple[Tensor, List[Tuple[Tensor, Tensor]]]:
# List[LSTMState]: One state per layer
output_states = jit.annotate(List[Tuple[Tensor, Tensor]], [])
output = input
# XXX: enumerate https://github.com/pytorch/pytorch/issues/14471
i = 0
for rnn_layer in self.layers:
state = states[i]
output, out_state = rnn_layer(output, state)
output_states += [out_state]
i += 1
return output, output_states
So in this case each layer does its own sequence for loop and passes another sequence tensor to the next layer.
So my question is: Which is the correct way to implement a multi-layer GRU?
I think you are misunderstanding the definition. The approach that you see in the lstm code, where each layer passes an entire sequence on to the next, is the standard approach for stacked RNN's - at least for sequence to sequence models. It's equivalent to RNN(RNN(input)).
It's also what the PyTorch GRU definition is saying, albeit, in a somewhat round-about-way. The definition is saying that for the N-th layer GRU, the input i, is the hidden state h (read: output) of the (N-1)-th layer GRU. Now, in theory, we could run all the inputs one at a time through all the layers and collect the outputs. Or we can do the entire sequence for each layer and only keep the last output sequence. This second approach should be faster, because it allows for vectorizing the calculations more efficiently.
Further, if you look at the link you sent with the two different GRU models. You'll see that the results are equivalent, whether you run the inputs through each layer one at a time using GRUCell's, or use full GRU layers.
In the Pytorch GRU Document, you would find that it contains an attribute named num_layers which allows you to specify the number of GRU layers.
If this answers your question as to how we apply the GRU layers practically?
>>> rnn = nn.GRU(input_size = 10, hidden_size = 20, num_layers = 2)
>>> input = torch.randn(5, 3, 10)
>>> h0 = torch.randn(2, 3, 20)
>>> output, hn = rnn(input, h0)
I have a nn.Module whose forward function takes in two inputs. Inside the function, I multiply one of the inputs x1 by a set of trainable parameters, and then concatenate them with the other input x2.
class ConcatMe(nn.Module):
def __init__(self, pad_len, emb_size):
super(ConcatMe, self).__init__()
self.W = nn.Parameter(torch.randn(pad_len, emb_size).to(DEVICE), requires_grad=True)
self.emb_size = emb_size
def forward(self, x1: Tensor, x2: Tensor):
cat = self.W * torch.reshape(x2, (1, -1, 1))
return torch.cat((x1, cat), dim=-1)
From my understanding, one is supposed to be able to write operations in PyTorch's nn.Modules like we would for inputs with a batch size of 1. For some reason, this is not the case. I'm getting an error that indicates that PyTorch is still accounting for batch_size.
x1 = torch.randn(100,2,512)
x2 = torch.randint(10, (2,1))
concat = ConcatMe(100, 512)
concat(x1, x2)
-----------------------------------------------------------------------------------
File "/home/my/file/path.py, line 0, in forward
cat = self.W * torch.reshape(x2, (1, -1, 1))
RuntimeError: The size of tensor a (100) must match the size of tensor b (2) at non-singleton dimension 1
I made a for loop to patch the issue as shown below:
class ConcatMe(nn.Module):
def __init__(self, pad_len, emb_size):
super(ConcatMe, self).__init__()
self.W = nn.Parameter(torch.randn(pad_len, emb_size).to(DEVICE), requires_grad=True)
self.emb_size = emb_size
def forward(self, x1: Tensor, x2: Tensor):
batch_size = x2.shape[0]
cat = torch.ones(x1.shape).to(DEVICE)
for i in range(batch_size):
cat[:, i, :] = self.W * x2[i]
return torch.cat((x1, cat), dim=-1)
but I feel like there's a more elegant solution. Does it have something to do with the fact that I'm creating parameters inside nn.Module? If so, what solution can I implement that doesn't require a for loop.
From my understanding, one is supposed to be able to write operations in PyTorch's nn.Modules like we would for inputs with a batch size of 1.
I'm not sure where you got this assumption, it is definitely not true - on the contrary: you always need to write them in a way that they can handle the general case of an arbitrary batch dimension.
Judging from your second implementation it seems like you're trying to multiply two tensors with incompatible dimensions. So in order to fix that you'd have to define
self.W = torch.nn.Parameter(torch.randn(pad_len, 1, emb_size), requires_grad=True)
To understand things like that better it would help to learn about broadcasting.
I have implemented the following UNET Paper Code and here is the architecture:
The problem with this is that that at level 4, Encoder has features of shape 512 x 64 x 64 and the Decoder part will be having a different features shape as 512 x 56 x 56. So Then I looked closely to find these gray arrows for copy and crop. In the paper, there is no mentioning of how it is done but just 2 reference of crops as:
Every step in the expansive path consists of an upsampling of the
feature map followed by a 2x2 convolution (“up-convolution”) that halves the
number of feature channels, a concatenation with the correspondingly cropped
feature map from the contracting path, and two 3x3 convolutions, each followed by a ReLU. The cropping is necessary due to the loss of border pixels in
every convolution.
Could Someone please explain how could I make them compatible? The code Till the BottleNeck working perfectly and as I am stuck at Cropping part so could not test the logic and code flow for Decoder.
'''
Whole UNet is divided in 3 parts: Endoder -> BottleNeck -> Decoder. There are skip connections between 'Nth' level of Encoder with Nth level of Decoder.
There is 1 basic entity called "Convolution" Block which has 3*3 Convolution (or Transposed Convolution during Upsampling) -> ReLu -> BatchNorm
Then there is Maxpooling
'''
import torch
import torch.nn as nn
# ------------------- TESTING CODE -------------------
image = torch.randn(1,1, 572,572) # Batch of 1 Gray scale image as described in paper to test
enc = Encoder(1)
feat, skip = enc(image)
bot = BottleNeck()
feat = bot(feat)
# dec = Decoder()
# feats = dec(feat, skip) # Error Starts here in Decoder Block
# -----------------------------
class ConvolutionBlock(nn.Module):
'''
The basic Convolution Block Which Will have Convolution -> RelU -> Convolution -> RelU
'''
def __init__(self, input_features, out_features):
'''
args:
batch_norm was introduced after UNET so they did not know if it existed. Might be useful
'''
super().__init__()
self.network = nn.Sequential(
nn.Conv2d(input_features, out_features, kernel_size = 3, padding= 0), # padding is 0 by default, 1 means the input width, height == out width, height
nn.ReLU(),
nn.Conv2d(out_features, out_features, kernel_size = 3, padding = 0),
nn.ReLU(),
)
def forward(self, feature_map_x):
'''
feature_map_x could be the image itself or the
'''
return self.network(feature_map_x)
class Encoder(nn.Module):
'''
'''
def __init__(self, image_channels:int = 3, blockwise_features = [64, 128, 256, 512]):
'''
In UNET, the features start at 64 and keeps getting twice the size of the previous one till it reached BottleNeck
args:
image_channels: Channels in the Input Image. Typically it is any of the 1 or 3 (rarely 4)
blockwise_features = Each block has it's own input and output features. it means first ConV block will output 64 features, second 128 and so on
'''
super().__init__()
repeat = len(blockwise_features) # how many layers we need to add len of blockwise_features == len of out_features
self.layers = nn.ModuleList()
for i in range(repeat):
if i == 0:
in_filters = image_channels
out_filters = blockwise_features[0]
else:
in_filters = blockwise_features[i-1]
out_filters = blockwise_features[i]
self.layers.append(ConvolutionBlock(in_filters, out_filters))
self.maxpool = nn.MaxPool2d(kernel_size = 2, stride = 2) # Since There is No gradient for Maxpooling, You can instantiate a single layer for the whole operation
# https://datascience.stackexchange.com/questions/11699/backprop-through-max-pooling-layers
def forward(self, feature_map_x):
skip_connections = [] # i_th level of features from Encoder will be conatenated with i_th level of decoder before applying CNN
for layer in self.layers:
feature_map_x = layer(feature_map_x)
skip_connections.append(feature_map_x)
feature_map_x = self.maxpool(feature_map_x) # Use Max Pooling AFTER storing the Skip connections
return feature_map_x, skip_connections
class BottleNeck(nn.Module):
'''
ConvolutionBlock without Max Pooling
'''
def __init__(self, input_features = 512, output_features = 1024):
super().__init__()
self.layer = ConvolutionBlock(input_features, output_features)
def forward(self, feature_map_x):
return self.layer(feature_map_x)
class Decoder(nn.Module):
'''
'''
def __init__(self, blockwise_features = [512, 256, 128, 64]):
'''
Do exactly opposite of Encoder
'''
super().__init__()
self.upsample_layers = nn.ModuleList()
self.conv_layers = nn.ModuleList()
for i, feature in enumerate(blockwise_features):
self.upsample_layers.append(nn.ConvTranspose2d(in_channels = feature*2, out_channels = feature, kernel_size = 2, stride = 2)) # Takes in 1024-> 512, takes 512->254 ......
self.conv_layers.append(nn.ConvTranspose2d(in_channels = feature*2, out_channels = feature, kernel_size = 2, stride = 2)) # After Concatinating (512 + 512-> 1024), Use double Conv block
def forward(self, feature_map_x, skip_connections):
'''
Steps go as:
1. Upsample
2. Concat Skip Connection
3. Apply ConvolutionBlock
'''
for i, layer in enumerate(self.conv_layers): # 4 levels, 4 skip connections, 4 upsampling, 4 Double Conv Block
feature_map_x = self.upsample_layers[i](feature_map_x) # step 1
feature_map_x = torch.cat((skip_connections[-i-1], feature_map_x), dim = 1) # step 2
feature_map_x = self.conv_layers[i](feature_map_x)
return feature_map_x
I think the meaning is for center crop.
the code should be something like this:
def crop_image(image, new_size):
'''image: tensor of shape(batch_size,num_channels,height_width)
new_size: torch.size object with the new size of the image
for example original image with size(30,3,106,106) and new size is
(30,20,100,100)
'''
h_crop_size,w_crop_size= (image.shape[2]-new_shape[2])//2,
(image.shape[3]-new_shape[3])//2
h_start=h_crop_size
h_end= h_start+ new_shape[2]
w_start=w_crop_size
w_end=w_start+new_shape[3]
return image[:,:,h_start:h_end,w_start:w_end]
Say I have the following network
import torch
import torch.nn as nn
class Model(nn.Module):
def __init__(self):
super(Model, self).__init__()
self.fc1 = nn.Linear(1, 2)
self.fc2 = nn.Linear(2, 3)
self.fc3 = nn.Linear(3, 1)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = torch.relu(self.fc2(x))
x = self.fc3(x)
return x
net = Model()
I know that I can access the weights (i.e edges) at each layer:
net.fc1.weight
However, I'm trying to create a function that randomly selects a neuron from the entire network and outputs it's in-connections (i.e the edges/weights that are attached to it from the previous layer) and its out-connections (i.e the edges/weights that are going out of it to the next layer).
pseudocode:
def get_neuron_in_out_edges(list_of_neurons):
shuffled_list_of_neurons = shuffle(list_of_neurons)
in_connections_list = []
out_connections_list = []
for neuron in shuffled_list_of_neurons:
in_connections = get_in_connections(neuron) # a list of connections
out_connections = get_out_connections(neuron) # a list of connections
in_connections_list.append([neuron,in_connections])
out_connections_list.append([neuron,out_connections])
return in_connections_list, out_connections_list
The idea is that I can then access these values and say if they're smaller than 10, change them to 10 in the network. This is for a networks class where we're working on plotting different networks so this doesn't have to make much sense from a machine learning perspective
Let's ignore biases for this discussion.
A linear layer computes the output y given weights w and inputs x as:
y_i = sum_j w_ij x_j
So, for neuron i all the incoming edges are the weights w_ij - that is the i-th row of the weight matrix W.
Similarly, for input neuron j it affects all y_i according to the j-th column of the weight matrix W.
I was looking for an implementation of an LSTM cell in Pytorch that I could extend, and I found an implementation of it in the accepted answer here. I will post it here because I'd like to refer to it. There are quite a few implementation details that I do not understand, and I was wondering if someone could clarify.
import math
import torch as th
import torch.nn as nn
class LSTM(nn.Module):
def __init__(self, input_size, hidden_size, bias=True):
super(LSTM, self).__init__()
self.input_size = input_size
self.hidden_size = hidden_size
self.bias = bias
self.i2h = nn.Linear(input_size, 4 * hidden_size, bias=bias)
self.h2h = nn.Linear(hidden_size, 4 * hidden_size, bias=bias)
self.reset_parameters()
def reset_parameters(self):
std = 1.0 / math.sqrt(self.hidden_size)
for w in self.parameters():
w.data.uniform_(-std, std)
def forward(self, x, hidden):
h, c = hidden
h = h.view(h.size(1), -1)
c = c.view(c.size(1), -1)
x = x.view(x.size(1), -1)
# Linear mappings
preact = self.i2h(x) + self.h2h(h)
# activations
gates = preact[:, :3 * self.hidden_size].sigmoid()
g_t = preact[:, 3 * self.hidden_size:].tanh()
i_t = gates[:, :self.hidden_size]
f_t = gates[:, self.hidden_size:2 * self.hidden_size]
o_t = gates[:, -self.hidden_size:]
c_t = th.mul(c, f_t) + th.mul(i_t, g_t)
h_t = th.mul(o_t, c_t.tanh())
h_t = h_t.view(1, h_t.size(0), -1)
c_t = c_t.view(1, c_t.size(0), -1)
return h_t, (h_t, c_t)
1- Why multiply the hidden size by 4 for both self.i2h and self.h2h (in the init method)
2- I don't understand the reset method for the parameters. In particular, why do we reset parameters in this way?
3- Why do we use view for h, c, and x in the forward method?
4- I'm also confused about the column bounds in the activations part of the forward method. As an example, why do we upper bound with 3 * self.hidden_size for gates?
5- Where are all the parameters of the LSTM? I'm talking about the Us and Ws here:
1- Why multiply the hidden size by 4 for both self.i2h and self.h2h (in the init method)
In the equations you have included, the input x and the hidden state h are used for four calculations, where each of them is a matrix multiplication with a weight. Whether you do four matrix multiplications or concatenate the weights and do one bigger matrix multiplication and separate the results afterwards, has the same result.
input_size = 5
hidden_size = 10
input = torch.randn((2, input_size))
# Two different weights
w_c = torch.randn((hidden_size, input_size))
w_i = torch.randn((hidden_size, input_size))
# Concatenated weights into one tensor
# with size:[2 * hidden_size, input_size]
w_combined = torch.cat((w_c, w_i), dim=0)
# Output calculated by using separate matrix multiplications
out_c = torch.matmul(w_c, input.transpose(0, 1))
out_i = torch.matmul(w_i, input.transpose(0, 1))
# One bigger matrix multiplication with the combined weights
out_combined = torch.matmul(w_combined, input.transpose(0, 1))
# The first hidden_size number of rows belong to w_c
out_combined_c = out_combined[:hidden_size]
# The second hidden_size number of rows belong to w_i
out_combined_i = out_combined[hidden_size:]
# Using torch.allclose because they are equal besides floating point errors.
torch.allclose(out_c, out_combined_c) # => True
torch.allclose(out_i, out_combined_i) # => True
By setting the output size of the linear layer to 4 * hidden_size there are four weights with size hidden_size, so only one layer is needed instead of four. There is not really an advantage of doing this, except maybe a minor performance improvement, mostly for smaller inputs that don't fully exhaust the parallelisations capabilities if done individually.
4- I'm also confused about the column bounds in the activations part of the forward method. As an example, why do we upper bound with 3 * self.hidden_size for gates?
That's where the outputs are separated to correspond to the output of the four individual calculations. The output is the concatenation of [i_t; f_t; o_t; g_t] (not including tanh and sigmoid respectively).
You can get the same separation by splitting the output into four chunks with torch.chunk:
i_t, f_t, o_t, g_t = torch.chunk(preact, 4, dim=1)
But after the separation you would have to apply torch.sigmoid to i_t, f_t and o_t, and torch.tanh to g_t.
5- Where are all the parameters of the LSTM? I'm talking about the Us and Ws here:
The parameters W are the weights in the linear layer self.i2h and U in the linear layer self.h2h, but concatenated.
W_i, W_f, W_o, W_c = torch.chunk(self.i2h.weight, 4, dim=0)
U_i, U_f, U_o, U_c = torch.chunk(self.h2h.weight, 4, dim=0)
3- Why do we use view for h, c, and x in the forward method?
Based on h_t = h_t.view(1, h_t.size(0), -1) towards the end, the hidden states have the size [1, batch_size, hidden_size]. With h = h.view(h.size(1), -1) that gets rid of the first singular dimension to get size [batch_size, hidden_size]. The same could be achieved with h.squeeze(0).
2- I don't understand the reset method for the parameters. In particular, why do we reset parameters in this way?
Parameter initialisation can have a big impact on the model's learning capability. The general rule for the initialisation is to have values close to zero without being too small. A common initialisation is to draw from a normal distribution with mean 0 and variance of 1 / n, where n is the number of neurons, which in turn means a standard deviation of 1 / sqrt(n).
In this case it uses a uniform distribution instead of a normal distribution, but the general idea is similar. Determining the minimum/maximum value based on the number of neurons but avoiding to make them too small. If the minimum/maximum value would be 1 / n the values would get very small, so using 1 / sqrt(n) is more appropriate, e.g. 256 neurons: 1 / 256 = 0.0039 whereas 1 / sqrt(256) = 0.0625.
Initializing neural networks provides some explanations of different initialisations with interactive visualisations.