PyTorch nn.module won't unbatch operations - python

I have a nn.Module whose forward function takes in two inputs. Inside the function, I multiply one of the inputs x1 by a set of trainable parameters, and then concatenate them with the other input x2.
class ConcatMe(nn.Module):
def __init__(self, pad_len, emb_size):
super(ConcatMe, self).__init__()
self.W = nn.Parameter(torch.randn(pad_len, emb_size).to(DEVICE), requires_grad=True)
self.emb_size = emb_size
def forward(self, x1: Tensor, x2: Tensor):
cat = self.W * torch.reshape(x2, (1, -1, 1))
return torch.cat((x1, cat), dim=-1)
From my understanding, one is supposed to be able to write operations in PyTorch's nn.Modules like we would for inputs with a batch size of 1. For some reason, this is not the case. I'm getting an error that indicates that PyTorch is still accounting for batch_size.
x1 = torch.randn(100,2,512)
x2 = torch.randint(10, (2,1))
concat = ConcatMe(100, 512)
concat(x1, x2)
-----------------------------------------------------------------------------------
File "/home/my/file/path.py, line 0, in forward
cat = self.W * torch.reshape(x2, (1, -1, 1))
RuntimeError: The size of tensor a (100) must match the size of tensor b (2) at non-singleton dimension 1
I made a for loop to patch the issue as shown below:
class ConcatMe(nn.Module):
def __init__(self, pad_len, emb_size):
super(ConcatMe, self).__init__()
self.W = nn.Parameter(torch.randn(pad_len, emb_size).to(DEVICE), requires_grad=True)
self.emb_size = emb_size
def forward(self, x1: Tensor, x2: Tensor):
batch_size = x2.shape[0]
cat = torch.ones(x1.shape).to(DEVICE)
for i in range(batch_size):
cat[:, i, :] = self.W * x2[i]
return torch.cat((x1, cat), dim=-1)
but I feel like there's a more elegant solution. Does it have something to do with the fact that I'm creating parameters inside nn.Module? If so, what solution can I implement that doesn't require a for loop.

From my understanding, one is supposed to be able to write operations in PyTorch's nn.Modules like we would for inputs with a batch size of 1.
I'm not sure where you got this assumption, it is definitely not true - on the contrary: you always need to write them in a way that they can handle the general case of an arbitrary batch dimension.
Judging from your second implementation it seems like you're trying to multiply two tensors with incompatible dimensions. So in order to fix that you'd have to define
self.W = torch.nn.Parameter(torch.randn(pad_len, 1, emb_size), requires_grad=True)
To understand things like that better it would help to learn about broadcasting.

Related

Looking for a more efficient way to implement scalar multiplication for word2vec

I would like to implement myself word2vec in pytorch and the forward method has the following signature:
def forward(self, input_index_batch, output_indices_batch):
#input_index_batch - Tensor of ints, shape: (batch_size, )
#output_indices_batch - Tensor if ints, shape: (batch_size, num_negative_samples+1)
In the word2vec approach one calculates the dot product between the input in the U - embedding matrix, and each of the num_negative_samples+1 outputs, which correspond to the context words, and the first one is the positive context word, and all remaining - the negative. I have tackled the problem with the following way:
U_batch = self.input.weight.T[input_index_batch]
for i, output_index in enumerate(output_indices_batch.T):
V_batch = self.output.weight[output_index]
dot_prod = torch.einsum('bi,bj->b', U_batch, V_batch)
if i == 0:
predictions[:, i] = torch.sigmoid(dot_prod)
else:
predictions[:, i] = 1 - torch.sigmoid(dot_prod)
return predictions
This works correctly, however, rather inefficiently and slow. What would be more robust way to implement this?
In the init method self.input and self.output are defined as (10 - word2vec dim):
self.input = nn.Linear(num_tokens, 10, bias=False)
self.output = nn.Linear(10, num_tokens, bias=False)

LSTM cell implementation in Pytorch design choices

I was looking for an implementation of an LSTM cell in Pytorch that I could extend, and I found an implementation of it in the accepted answer here. I will post it here because I'd like to refer to it. There are quite a few implementation details that I do not understand, and I was wondering if someone could clarify.
import math
import torch as th
import torch.nn as nn
class LSTM(nn.Module):
def __init__(self, input_size, hidden_size, bias=True):
super(LSTM, self).__init__()
self.input_size = input_size
self.hidden_size = hidden_size
self.bias = bias
self.i2h = nn.Linear(input_size, 4 * hidden_size, bias=bias)
self.h2h = nn.Linear(hidden_size, 4 * hidden_size, bias=bias)
self.reset_parameters()
def reset_parameters(self):
std = 1.0 / math.sqrt(self.hidden_size)
for w in self.parameters():
w.data.uniform_(-std, std)
def forward(self, x, hidden):
h, c = hidden
h = h.view(h.size(1), -1)
c = c.view(c.size(1), -1)
x = x.view(x.size(1), -1)
# Linear mappings
preact = self.i2h(x) + self.h2h(h)
# activations
gates = preact[:, :3 * self.hidden_size].sigmoid()
g_t = preact[:, 3 * self.hidden_size:].tanh()
i_t = gates[:, :self.hidden_size]
f_t = gates[:, self.hidden_size:2 * self.hidden_size]
o_t = gates[:, -self.hidden_size:]
c_t = th.mul(c, f_t) + th.mul(i_t, g_t)
h_t = th.mul(o_t, c_t.tanh())
h_t = h_t.view(1, h_t.size(0), -1)
c_t = c_t.view(1, c_t.size(0), -1)
return h_t, (h_t, c_t)
1- Why multiply the hidden size by 4 for both self.i2h and self.h2h (in the init method)
2- I don't understand the reset method for the parameters. In particular, why do we reset parameters in this way?
3- Why do we use view for h, c, and x in the forward method?
4- I'm also confused about the column bounds in the activations part of the forward method. As an example, why do we upper bound with 3 * self.hidden_size for gates?
5- Where are all the parameters of the LSTM? I'm talking about the Us and Ws here:
1- Why multiply the hidden size by 4 for both self.i2h and self.h2h (in the init method)
In the equations you have included, the input x and the hidden state h are used for four calculations, where each of them is a matrix multiplication with a weight. Whether you do four matrix multiplications or concatenate the weights and do one bigger matrix multiplication and separate the results afterwards, has the same result.
input_size = 5
hidden_size = 10
input = torch.randn((2, input_size))
# Two different weights
w_c = torch.randn((hidden_size, input_size))
w_i = torch.randn((hidden_size, input_size))
# Concatenated weights into one tensor
# with size:[2 * hidden_size, input_size]
w_combined = torch.cat((w_c, w_i), dim=0)
# Output calculated by using separate matrix multiplications
out_c = torch.matmul(w_c, input.transpose(0, 1))
out_i = torch.matmul(w_i, input.transpose(0, 1))
# One bigger matrix multiplication with the combined weights
out_combined = torch.matmul(w_combined, input.transpose(0, 1))
# The first hidden_size number of rows belong to w_c
out_combined_c = out_combined[:hidden_size]
# The second hidden_size number of rows belong to w_i
out_combined_i = out_combined[hidden_size:]
# Using torch.allclose because they are equal besides floating point errors.
torch.allclose(out_c, out_combined_c) # => True
torch.allclose(out_i, out_combined_i) # => True
By setting the output size of the linear layer to 4 * hidden_size there are four weights with size hidden_size, so only one layer is needed instead of four. There is not really an advantage of doing this, except maybe a minor performance improvement, mostly for smaller inputs that don't fully exhaust the parallelisations capabilities if done individually.
4- I'm also confused about the column bounds in the activations part of the forward method. As an example, why do we upper bound with 3 * self.hidden_size for gates?
That's where the outputs are separated to correspond to the output of the four individual calculations. The output is the concatenation of [i_t; f_t; o_t; g_t] (not including tanh and sigmoid respectively).
You can get the same separation by splitting the output into four chunks with torch.chunk:
i_t, f_t, o_t, g_t = torch.chunk(preact, 4, dim=1)
But after the separation you would have to apply torch.sigmoid to i_t, f_t and o_t, and torch.tanh to g_t.
5- Where are all the parameters of the LSTM? I'm talking about the Us and Ws here:
The parameters W are the weights in the linear layer self.i2h and U in the linear layer self.h2h, but concatenated.
W_i, W_f, W_o, W_c = torch.chunk(self.i2h.weight, 4, dim=0)
U_i, U_f, U_o, U_c = torch.chunk(self.h2h.weight, 4, dim=0)
3- Why do we use view for h, c, and x in the forward method?
Based on h_t = h_t.view(1, h_t.size(0), -1) towards the end, the hidden states have the size [1, batch_size, hidden_size]. With h = h.view(h.size(1), -1) that gets rid of the first singular dimension to get size [batch_size, hidden_size]. The same could be achieved with h.squeeze(0).
2- I don't understand the reset method for the parameters. In particular, why do we reset parameters in this way?
Parameter initialisation can have a big impact on the model's learning capability. The general rule for the initialisation is to have values close to zero without being too small. A common initialisation is to draw from a normal distribution with mean 0 and variance of 1 / n, where n is the number of neurons, which in turn means a standard deviation of 1 / sqrt(n).
In this case it uses a uniform distribution instead of a normal distribution, but the general idea is similar. Determining the minimum/maximum value based on the number of neurons but avoiding to make them too small. If the minimum/maximum value would be 1 / n the values would get very small, so using 1 / sqrt(n) is more appropriate, e.g. 256 neurons: 1 / 256 = 0.0039 whereas 1 / sqrt(256) = 0.0625.
Initializing neural networks provides some explanations of different initialisations with interactive visualisations.

Variable batch_size in call function

I am trying to implement an attention network with TensorFlow 2. Thus, for every image, I want to take only some glimpses, i.e. a small part from the image. For this I have implemented a subclass from tensorflow.keras.models.Model, here is a snippet out of it.
class RecurrentAttentionModel(models.Model):
# ...
def call(self, inputs):
l = tf.random.uniform((40,2,), minval=0, maxval=1)
for _ in range(0, self.glimpses):
glimpse = tf.image.extract_glimpse(inputs, size=(self.retina_size, self.retina_size), offsets=l, centered=False, normalized=True)
# some other code...
# update l to take a glimpse somewhere else
return result
Now, the code above works and trains perfectly, but my issue is, that I have the hardcoded 40 in it, the batch_size which I have defined in my dataset. I am not able to read/get the batch_size in the call method since the variable "inputs" is of the form Tensor("input_1_77:0", shape=(None, 250, 500, 1), dtype=float32) where the None for the batch_size seems to be expected behavior.
When I just initialize l with the following code (without the batch_size)
l = tf.random.uniform((2,), minval=0, maxval=1)
it throws this error
ValueError: Shape must be rank 2 but is rank 1 for 'recurrent_attention_model_86/ExtractGlimpse' (op: 'ExtractGlimpse') with input shapes: [?,250,500,1], [2], [2]
what I totally understand but I have no idea how I could implement the initial values according to the batch_size.
You can extract the batch size dimension dynamically by using tf.shape.
l = tf.random.normal(tf.stack([tf.shape(inputs)[0], 2]), minval=0, maxval=1))

How can I properly initialize the needed vectors in a Neural Network Model?

I'm trying to implement a Neural Network Model from scratch in Python (using Numpy). For reference, I'm using the Chapter e-7 of this book (Learning from data, by Professor Abu-Mostafa) as a theoretical support.
One of the first problems that I'm facing is how to correctly initialize the matrix of weights and the vectors of inputs and outputs (W, x and s, respectively).
Here is the my approach:
Let L be the number of layers (you do not count the 'first' layer; i.e., the layer of the vector x plus 'bias').
Let d be the dimension of the hidden layers (I'm assuming that all hidden layers have the same number of nodes).
Let out be the number of nodes at the last layer (it is typically 1).
Now, here is how I defined the matrix and vectors of interest:
Let w_ be the vector of weights. Actually, it is a vector in which each component is a matrix the of the form W_{L}. Here, the (i, j)-th value is the w_{i, j}^{(L)} term.
Let x_ be the vector of inputs.
Let s_ be the vector of outputs; you may see s_ as numpy.dot(W^{L}.T, x^{L-1}).
The following image summarizes what I've just described:
The problem arises from the fact that the dimensions of each layer (input, hidden layers and output) are NOT the same. What I was trying to do is to split each vector into different variables; however, work with it in the following steps of the algorithm is extremely difficult (because of how the indexes become a mess). Here is the piece of code that replicates my attempt:
class NeuralNetwork:
"""
Neural Network Model
"""
def __init__(self, L, d, out):
self.L = L # number of layers
self.d = d # dimension of hidden layers
self.out = out # dimension of the output layer
def initialize_(self, X):
# Initialize the vector of inputs
self.x_ = np.zeros((self.L - 1) * (self.d + 1)).reshape(self.L - 1, self.d + 1)
self.xOUT_ = np.zeros(1 * self.out).reshape(1, self.out)
# Initialize the vector of outputs
self.s_ = np.zeros((self.L - 1) * (self.d)).reshape(self.L - 1, self.d)
self.sOUT_ = np.zeros(1 * self.out).reshape(1, self.out)
# Initialize the vector of weights
self.wIN_ = np.random.normal(0, 0.1, 1 * (X.shape[1] + 1) * self.d).reshape(1, X.shape[1] + 1, self.d)
self.w_ = np.random.normal(0, 0.1, (self.L - 2) * (self.d + 1) * self.d).reshape(self.L - 2, self.d + 1, self.d)
self.wOUT_ = np.random.normal(0, 0.1, 1 * (self.d + 1) * self.out).reshape(1, self.d + 1, self.out)
def fit(self, X, y):
self.initialize_(X)
Whenever IN or OUT appear in the code, that is my way to deal with the differences of dimension between the input and output layers, respectively.
Clearly, this is NOT a good way to do it. So my question is: How can I work with these different dimensional vectors (with respect to each layer) in a clever way?
For example, after initialize them, I want to reproduce the following algorithm (forward-propagation) - you will see that, with my way to index things, it becomes almost impossible:
Where \theta(s) = \tanh(s).
P.S.: I also tried to create an array of arrays (or an array of list), but if I do that, my indexes become useless - they do not represent anymore what I wanted them to represent.
You could encapsulate the neuron logic and let the neurons perform the calculations individually:
class Neuron:
def __init__(self, I, O, b):
self.I = I # input neurons from previous layer
self.O = O # output neurons in next layer
self.b = b # bias
def activate(self, X):
output = np.dot(self.I, X) + self.b
...
return theta(output)

Efficient way to avoid modifying parameter by inplace operation

I have a model which has noisy linear layers (for which you can sample values from a mu and sigma parameter) and need to create two decorrelated outputs of it.
This means I have something like:
model.sample_noise()
output_1 = model(input)
with torch.no_grad():
model.sample_noise()
output_2 = model(input)
sample_noise actually modifies weights attached to the model according to a normal distribution.
But in the end this leads to
RuntimeError: one of the variables needed for gradient computation has been
modified by an inplace operation
The question actually is, what's the best way to avoid modifying these parameters. I could actually deepcopy the model every iteration and then use it for the second forward pass, but this does not sound very efficient to me.
If I understand your problem correctly, you want to have a linear layer with matrix M and then create two outputs
y_1 = (M + μ_1) * x + b
y_2 = (M + μ_2) * x + b
where μ_1, μ_2 ~ P. The simplest way would be, in my opinion, to create a custom class
import torch
import torch.nn.functional as F
from torch import nn
class NoisyLinear(nn.Module):
def __init__(self, n_in, n_out):
super(NoisyLinear, self).__init__()
# or any other initialization you want
self.weight = nn.Parameter(torch.randn(n_out, n_in))
self.bias = nn.Parameter(torch.randn(n_out))
def sample_noise(self):
# implement your noise generation here
return torch.randn(*self.weight.shape) * 0.01
def forward(self, x):
noise = self.sample_noise()
return F.linear(x, self.weight + noise, self.bias)
nl = NoisyLinear(4, 3)
x = torch.randn(2, 4)
y1 = nl(x)
y2 = nl(x)
print(y1, y2)

Categories

Resources