Related
My model is used to predict values based on an minimising a loss function L. But, the loss function doesn’t have a single global minima value, but rather a large number of places where it achieves global minima.
So, the model is based like this:
Model Input is [nXn] tensor (let’s say: inp=[ [i_11, i_12, i_13, ..., i_1n],[i_21, i_22, ..., i_2n],...,[i_n1,i_n2, ..., i_nn] ]) and model output is [nX1] tensor (let’s say: out1=[o_1, o_2,..., o_n ])
Output tensor is out1 is passed in a function f to get out2 (let’s say: f(o_1, o_2, o_3,..., o_n)=[O_1, O_2, O_3, ..., O_n] )
These 2 values (i.e., out1 and out2) are minimised using MSELoss i.e., Loss = ||out1 - out2||
Now, there are a lot of values for [o_1, o_2, ..., o_n] for which the Loss goes to minimum.
But, I want the values of [o_1, o_2, ..., o_n] for which |o_1| + |o_2| + |o_3| + ... + |o_n| is maximum
Right now, the weights are initialised randomly:
self.weight = torch.nn.parameter.Parameter(torch.FloatTensor(in_features, out_features)) for some value of in_features and out_features
But by doing this, I am getting the values of [o_1, o_2, ..., o_n] for which |o_1| + |o_2| + |o_3| + ... + |o_n| is minimum.
I know this problem can be solved by without using deep-learning, but I am trying to get the results like this for some task computation.
Is there a way to change this to get the largest values predicted at the output of the neural net?
Or is there any other technique (backpropagation change) to change it to get the desired largest valued output?
Thanks in advance.
EDIT 1:
Based on the answer, out1=[o_1, o_2,..., o_n ] is tending to zero-valued tensor. In the initial epochs, out2=[O_1, O_2, O_3, ..., O_n] takes very large values, but subsequently comes down to lower values.
A snippet of code below will give the idea:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
class Model(nn.Module):
def __init__(self, inp_l, hid_l, out_l=1):
super(Model, self).__init__()
self.lay1 = nn.Linear(inp_l ,hid_l)
self.lay2 = nn.Linear(hid_l ,out_l)
self.dp = nn.Dropout(p=0.5)
def forward(self, inp):
self.out1= torch.tensor([]).float()
for row in range(x.shape[0]):
y = self.lay1(inp[row])
y = F.relu(y)
y = self.dp(y.float())
y = self.lay2(y)
y = F.relu(y)
self.out1= torch.cat((self.out1, y))
return self.out1.view(inp.shape[0],-1)
def function_f(inp, out1):
'''
Some functional computation is done to return out2.
'''
return out2
def train_model(epoch):
model.train()
t = time.time()
optimizer.zero_grad()
out1 = model(inp)
out2 = function_f(inp, out1)
loss1 = ((out1-out2)**2).mean()
loss2 = -out1.abs().mean()
loss_train = loss1 + loss2
loss_train.backward(retain_graph=True)
optimizer.step()
if epoch%40==0:
print('Epoch: {:04d}'.format(epoch+1),
'loss_train: {:.4f}'.format(loss_train.item()),
'time: {:.4f}s'.format(time.time() - t))
model= Model(inp_l=10, hid_l=5, out_l=1)
optimizer = optim.Adam(model.parameters(), lr=0.001)
inp = torch.randint(100, (10, 10))
for ep in range(100):
train_model(ep)
But, out1 value goes to trivial solution i.e., zero-valued tensor which is the minimum valued solution. As mentioned before EDIT, I want to get the max-valued solution.
Thank you.
I am not sure I understand what you want.
Your weight initialization is overly complicated as well, you may just do:
self.weight = torch.nn.Linear(in_features, out_featues)
If you want to have the largest value of a batch of inputs you may simply do:
y = self.weight(x)
return y.max(dim=0)[0]
But I am not entirely sure that is what you meant with your question.
EDIT:
It seems you have two objectives. The first thing I would try is to convert both of them in losses to be minimized by the optimizer.
loss1 = MSE(out1, out2)
loss2 = - out1.abs().mean()
loss = loss1 + loss2
minimizing loss will simutaneously minimize the MSE between out1 and out2 and maximize the absolute values of out1. (minimizing - out1.abs().mean() is the same as maximizing out1.abs().mean()).
Notice that it is possible your neural net will just create large biases and zero the weights as a lazy solution for the objective. You may turn of biases to avoid the problem, but I would still expect some other training problems.
I was looking for an implementation of an LSTM cell in Pytorch that I could extend, and I found an implementation of it in the accepted answer here. I will post it here because I'd like to refer to it. There are quite a few implementation details that I do not understand, and I was wondering if someone could clarify.
import math
import torch as th
import torch.nn as nn
class LSTM(nn.Module):
def __init__(self, input_size, hidden_size, bias=True):
super(LSTM, self).__init__()
self.input_size = input_size
self.hidden_size = hidden_size
self.bias = bias
self.i2h = nn.Linear(input_size, 4 * hidden_size, bias=bias)
self.h2h = nn.Linear(hidden_size, 4 * hidden_size, bias=bias)
self.reset_parameters()
def reset_parameters(self):
std = 1.0 / math.sqrt(self.hidden_size)
for w in self.parameters():
w.data.uniform_(-std, std)
def forward(self, x, hidden):
h, c = hidden
h = h.view(h.size(1), -1)
c = c.view(c.size(1), -1)
x = x.view(x.size(1), -1)
# Linear mappings
preact = self.i2h(x) + self.h2h(h)
# activations
gates = preact[:, :3 * self.hidden_size].sigmoid()
g_t = preact[:, 3 * self.hidden_size:].tanh()
i_t = gates[:, :self.hidden_size]
f_t = gates[:, self.hidden_size:2 * self.hidden_size]
o_t = gates[:, -self.hidden_size:]
c_t = th.mul(c, f_t) + th.mul(i_t, g_t)
h_t = th.mul(o_t, c_t.tanh())
h_t = h_t.view(1, h_t.size(0), -1)
c_t = c_t.view(1, c_t.size(0), -1)
return h_t, (h_t, c_t)
1- Why multiply the hidden size by 4 for both self.i2h and self.h2h (in the init method)
2- I don't understand the reset method for the parameters. In particular, why do we reset parameters in this way?
3- Why do we use view for h, c, and x in the forward method?
4- I'm also confused about the column bounds in the activations part of the forward method. As an example, why do we upper bound with 3 * self.hidden_size for gates?
5- Where are all the parameters of the LSTM? I'm talking about the Us and Ws here:
1- Why multiply the hidden size by 4 for both self.i2h and self.h2h (in the init method)
In the equations you have included, the input x and the hidden state h are used for four calculations, where each of them is a matrix multiplication with a weight. Whether you do four matrix multiplications or concatenate the weights and do one bigger matrix multiplication and separate the results afterwards, has the same result.
input_size = 5
hidden_size = 10
input = torch.randn((2, input_size))
# Two different weights
w_c = torch.randn((hidden_size, input_size))
w_i = torch.randn((hidden_size, input_size))
# Concatenated weights into one tensor
# with size:[2 * hidden_size, input_size]
w_combined = torch.cat((w_c, w_i), dim=0)
# Output calculated by using separate matrix multiplications
out_c = torch.matmul(w_c, input.transpose(0, 1))
out_i = torch.matmul(w_i, input.transpose(0, 1))
# One bigger matrix multiplication with the combined weights
out_combined = torch.matmul(w_combined, input.transpose(0, 1))
# The first hidden_size number of rows belong to w_c
out_combined_c = out_combined[:hidden_size]
# The second hidden_size number of rows belong to w_i
out_combined_i = out_combined[hidden_size:]
# Using torch.allclose because they are equal besides floating point errors.
torch.allclose(out_c, out_combined_c) # => True
torch.allclose(out_i, out_combined_i) # => True
By setting the output size of the linear layer to 4 * hidden_size there are four weights with size hidden_size, so only one layer is needed instead of four. There is not really an advantage of doing this, except maybe a minor performance improvement, mostly for smaller inputs that don't fully exhaust the parallelisations capabilities if done individually.
4- I'm also confused about the column bounds in the activations part of the forward method. As an example, why do we upper bound with 3 * self.hidden_size for gates?
That's where the outputs are separated to correspond to the output of the four individual calculations. The output is the concatenation of [i_t; f_t; o_t; g_t] (not including tanh and sigmoid respectively).
You can get the same separation by splitting the output into four chunks with torch.chunk:
i_t, f_t, o_t, g_t = torch.chunk(preact, 4, dim=1)
But after the separation you would have to apply torch.sigmoid to i_t, f_t and o_t, and torch.tanh to g_t.
5- Where are all the parameters of the LSTM? I'm talking about the Us and Ws here:
The parameters W are the weights in the linear layer self.i2h and U in the linear layer self.h2h, but concatenated.
W_i, W_f, W_o, W_c = torch.chunk(self.i2h.weight, 4, dim=0)
U_i, U_f, U_o, U_c = torch.chunk(self.h2h.weight, 4, dim=0)
3- Why do we use view for h, c, and x in the forward method?
Based on h_t = h_t.view(1, h_t.size(0), -1) towards the end, the hidden states have the size [1, batch_size, hidden_size]. With h = h.view(h.size(1), -1) that gets rid of the first singular dimension to get size [batch_size, hidden_size]. The same could be achieved with h.squeeze(0).
2- I don't understand the reset method for the parameters. In particular, why do we reset parameters in this way?
Parameter initialisation can have a big impact on the model's learning capability. The general rule for the initialisation is to have values close to zero without being too small. A common initialisation is to draw from a normal distribution with mean 0 and variance of 1 / n, where n is the number of neurons, which in turn means a standard deviation of 1 / sqrt(n).
In this case it uses a uniform distribution instead of a normal distribution, but the general idea is similar. Determining the minimum/maximum value based on the number of neurons but avoiding to make them too small. If the minimum/maximum value would be 1 / n the values would get very small, so using 1 / sqrt(n) is more appropriate, e.g. 256 neurons: 1 / 256 = 0.0039 whereas 1 / sqrt(256) = 0.0625.
Initializing neural networks provides some explanations of different initialisations with interactive visualisations.
I'm trying to implement a Neural Network Model from scratch in Python (using Numpy). For reference, I'm using the Chapter e-7 of this book (Learning from data, by Professor Abu-Mostafa) as a theoretical support.
One of the first problems that I'm facing is how to correctly initialize the matrix of weights and the vectors of inputs and outputs (W, x and s, respectively).
Here is the my approach:
Let L be the number of layers (you do not count the 'first' layer; i.e., the layer of the vector x plus 'bias').
Let d be the dimension of the hidden layers (I'm assuming that all hidden layers have the same number of nodes).
Let out be the number of nodes at the last layer (it is typically 1).
Now, here is how I defined the matrix and vectors of interest:
Let w_ be the vector of weights. Actually, it is a vector in which each component is a matrix the of the form W_{L}. Here, the (i, j)-th value is the w_{i, j}^{(L)} term.
Let x_ be the vector of inputs.
Let s_ be the vector of outputs; you may see s_ as numpy.dot(W^{L}.T, x^{L-1}).
The following image summarizes what I've just described:
The problem arises from the fact that the dimensions of each layer (input, hidden layers and output) are NOT the same. What I was trying to do is to split each vector into different variables; however, work with it in the following steps of the algorithm is extremely difficult (because of how the indexes become a mess). Here is the piece of code that replicates my attempt:
class NeuralNetwork:
"""
Neural Network Model
"""
def __init__(self, L, d, out):
self.L = L # number of layers
self.d = d # dimension of hidden layers
self.out = out # dimension of the output layer
def initialize_(self, X):
# Initialize the vector of inputs
self.x_ = np.zeros((self.L - 1) * (self.d + 1)).reshape(self.L - 1, self.d + 1)
self.xOUT_ = np.zeros(1 * self.out).reshape(1, self.out)
# Initialize the vector of outputs
self.s_ = np.zeros((self.L - 1) * (self.d)).reshape(self.L - 1, self.d)
self.sOUT_ = np.zeros(1 * self.out).reshape(1, self.out)
# Initialize the vector of weights
self.wIN_ = np.random.normal(0, 0.1, 1 * (X.shape[1] + 1) * self.d).reshape(1, X.shape[1] + 1, self.d)
self.w_ = np.random.normal(0, 0.1, (self.L - 2) * (self.d + 1) * self.d).reshape(self.L - 2, self.d + 1, self.d)
self.wOUT_ = np.random.normal(0, 0.1, 1 * (self.d + 1) * self.out).reshape(1, self.d + 1, self.out)
def fit(self, X, y):
self.initialize_(X)
Whenever IN or OUT appear in the code, that is my way to deal with the differences of dimension between the input and output layers, respectively.
Clearly, this is NOT a good way to do it. So my question is: How can I work with these different dimensional vectors (with respect to each layer) in a clever way?
For example, after initialize them, I want to reproduce the following algorithm (forward-propagation) - you will see that, with my way to index things, it becomes almost impossible:
Where \theta(s) = \tanh(s).
P.S.: I also tried to create an array of arrays (or an array of list), but if I do that, my indexes become useless - they do not represent anymore what I wanted them to represent.
You could encapsulate the neuron logic and let the neurons perform the calculations individually:
class Neuron:
def __init__(self, I, O, b):
self.I = I # input neurons from previous layer
self.O = O # output neurons in next layer
self.b = b # bias
def activate(self, X):
output = np.dot(self.I, X) + self.b
...
return theta(output)
I'm working on a school project and am stuck on how to implement backpropagation in Numpy with the current forward prop structure I have. The aim of this script is to make a simple dynamic (meaning any number of layers and nodes) fully connected network using only numpy.
I think that I have to find the derivatives of the activation functions and multipliy it by the original error as well as the derivative of each activation function I encounter moving backward.
However, I'm having trouble figuring out how to implement this correctly in my script.
It'd be a great help if someone could explain in English what exactly I have to do given the complexities of the setup here, or even give a recommendation for a video/post that deals w dynamic size backprop.
Right now all the weights and biases are being stored in lists for future backprop, and I'm able to get the error for each output with the small amount of code currently in the backprop function.
This code block
#initialize a test model w/ 128 bacth and lr of 0.01
model = Model(128, 0.01)
#simple x data input
X = np.array([[1,1],[0,0],[12,5]])
Y = np.array([[1],[0],[-1]])
#adding 4 layers
z = model.add(X, 3, "sigmoid")
z = model.add(z, 1, "sigmoid", output=True)
#this is a full forward pass through the layers
z = model.predict(X)
print(z)
#this is the error of the predictions
print(model.backprop(z, Y))
Outputs the following vectors:
[[0.50006457]
[0.50006459]
[0.50006431]]
[[0.24993544]
[0.2500646 ]
[2.25019293]]
Like I said, not sure how to move forward ( or backward ;) ) from here.
Below is the full script needed to run the example:
import math
import numpy as np
#everything below is defining activation functions
#--------------------------------------------------------------------------------------------
def b_relu(input):
return max((0, max(input)))
def bd_relu(input):
if(input < 0 or input == 0):
return 0
else:
return 1
def b_sigmoid(x):
return 1 / (1 + math.exp(-x))
def bd_sigmoid(input):
return sigmoid(input) * (1 - sigmoid(input))
def b_tanh(input):
top = (math.exp(input) - math.exp(-input))
bottom = (math.exp(input) + math.exp(-input))
return (top/bottom)
#helper functions for tanh
def cosh(input):
return ((math.exp(input) + math.exp(-input)) / 2)
def sinh(input):
return ((math.exp(input) - math.exp(-input)) / 2)
def bd_tanh(input):
top = (math.pow(cosh(input), 2) - math.pow(sinh(input), 2))
bottom = math.pow(input, 2)
return (top / bottom)
def b_softmax(z):
# subracting the max adds numerical stability
shiftx = z - np.max(z,axis=1)[:,np.newaxis]
exps = np.exp(shiftx)
return exps / np.sum(exps,axis=1)[:,np.newaxis]
def bd_softmax(Y_hat, Y):
return Y_hat - Y
def b_linear(input):
return input
def bd_linear(input):
return 1
#vectorizing the activation and deriv. activation functions
relu = np.vectorize(b_relu)
d_relu = np.vectorize(bd_relu)
sigmoid = np.vectorize(b_sigmoid)
d_sigmoid = np.vectorize(bd_sigmoid)
tanh = np.vectorize(b_tanh)
d_tanh = np.vectorize(bd_tanh)
softmax = np.vectorize(b_softmax)
d_softmax = np.vectorize(bd_softmax)
linear = np.vectorize(b_linear)
d_linear = np.vectorize(bd_linear)
class Model:
def __init__(self, batch, lr):
#initializing self lists to keep track of stuff for bacthes, forward prop & backporp
self.batch = batch
self.lr = lr
self.W = []
self.B = []
self.A = []
self.Z = []
self.X = []
self.layers = []
self.tempW = []
self.tempB = []
#store error for backprop
self.output_error = []
#initialize the weights during 'model.add' so we can test our network shapes dynamically w/out model.compile
#added an output bool here so we can make sure the shape of the output network is (1,n)
def initial_weights(self, input_data, output_shape, output=False):
B = np.zeros((1, output_shape))
#assigning the shape
W = np.random.uniform(-1e-3, 1e-3, size = (input_data.shape[len(input_data.shape) - 1], output_shape))
self.B.append(B)
self.W.append(W)
def add(self, input_data, output_shape, activation, output=False):
#append to layers so we have a correct index value
self.layers.append(69)
#making sure our data in a numpy array
if (type(input_data) == np.ndarray):
X = input_data
else:
X = np.asarray(input_data)
#adding data and activations to self lists
self.X.append(X)
self.A.append(activation)
#keep track of our index & initializing random weights for dynamic comatibility testing
index = len(self.layers)-1
self.initial_weights(input_data, output_shape, output=False)
X2 = self.forward(input_data, index)
#printing layer info
print("Layer:", index)
print("Input Shape: ", X.shape)
print("Weight Shape: ", self.W[index].shape)
print("Output Shape: ", X2.shape)
print(" ")
return(X2)
def forward(self, input_data, index):
#pulling weights and biases from main lists for operations
B = self.B[index]
W = self.W[index]
#matmul of data # weights + bias
Z = np.matmul(input_data, W) + B
#summing each row of inputs to activation node
for x in Z:
x = sum(x)
#pulling activation from index
act = str(self.A[index])
#activating
Z = activate(Z, act)
#keeping track of Z i guess
self.Zappend = Z
return(Z)
def predict(self, input_data):
for x in range(len(self.layers)):
z = model.forward(input_data, x)
input_data = z
return z
def backprop(self, model_output, ground_truth):
#------------------------------
#now begins the backprop portion
#let's start with finding the error between predictions and actual values
#gonna do MSE to keep it simple
self.output_error = (ground_truth - model_output) ** 2
#so now we have the error of the output layer, this tells us two things, how wrong we were, and in which direction we should update
#the outputs of these nodes
'''
What to do if this was linear regression (for m & b)
1. Take the error and multiply it by the transpose of the last layer weights
(I think the error in this case is where the prime activation function should be if we had activations)
2. The last layer bias is just the error
3. The second to last layer inputs is the bias times the transpose of second layers weights
3. Then I have no idea
'''
return self.output_error
I am forwarding, and backpropping tensor data X through two simple nn.Module PyTorch models instances, model1 and model2.
I can't get this process to work without usage of the depreciated Variable API.
So this works just fine:
y1 = model1(X)
v = Variable(y1.data, requires_grad=training) # Its all about this line!
y2 = model2(v)
criterion = nn.NLLLoss()
loss = criterion(y2, y)
loss.backward()
y1.backward(v.grad)
self.step()
But this will throw an error:
y1 = model1(X)
y2 = model2(y1)
criterion = nn.NLLLoss()
loss = criterion(y2, y)
loss.backward()
y1.backward(y1.grad) # it breaks here
self.step()
>>> RuntimeError: grad can be implicitly created only for scalar outputs
I just can't seem to find a relevant difference between v in the first implementation, and y1 in the second. In both cases requires_grad is set to True. The only thing I could find was that y1.grad_fn=<ThnnConv2DBackward> and v.grad_fn=<ThnnConv2DBackward>
What am I missing here? What (tensor attributes?) do I not know about, and if Variable is depreciated, what other implementation would work?
[UPDATED]
You are not correctly passing the y1.grad into y1.backward in the second example. After the first backward all the intermediate gradient will be destroyed, you need a special hook to extract that gradients. And in your case you are passing the None value. Here is small example to reproduce your case:
Code:
import torch
import torch.nn as nn
torch.manual_seed(42)
class Model1(nn.Module):
def __init__(self):
super().__init__()
def forward(self, x):
return x.pow(3)
class Model2(nn.Module):
def __init__(self):
super().__init__()
def forward(self, x):
return x / 2
model1 = Model1()
model2 = Model2()
criterion = nn.MSELoss()
X = torch.randn(1, 5, requires_grad=True)
y = torch.randn(1, 5)
y1 = model1(X)
y2 = model2(y1)
loss = criterion(y2, y)
# We are going to backprop 2 times, so we need to
# retain_graph=True while first backward
loss.backward(retain_graph=True)
try:
y1.backward(y1.grad)
except RuntimeError as err:
print(err)
print('y1.grad: ', y1.grad)
Output:
grad can be implicitly created only for scalar outputs
y1.grad: None
So you need to extract them correctly:
Code:
def extract(V):
"""Gradient extractor.
"""
def hook(grad):
V.grad = grad
return hook
model1 = Model1()
model2 = Model2()
criterion = nn.MSELoss()
X = torch.randn(1, 5, requires_grad=True)
y = torch.randn(1, 5)
y1 = model1(X)
y2 = model2(y1)
loss = criterion(y2, y)
y1.register_hook(extract(y1))
loss.backward(retain_graph=True)
print('y1.grad', y1.grad)
y1.backward(y1.grad)
Output:
y1.grad: tensor([[-0.1763, -0.2114, -0.0266, -0.3293, 0.0534]])
After some investigation I came to the following two solutions.
The solution provided elsewhere in this thread retained the computation graph manually, without an option the free them, thus running fine initially, but causing OOM errors later on.
The first solution is to tie the models together using the built in torch.nn.Sequential as such:
model = torch.nn.Sequential(Model1(), Model2())
it's as easy as that. It looks clean and behaves exactly like an ordinary model would.
The alternative is to simply tie them together manually:
model1 = Model1()
model2 = Model2()
y1 = model1(X)
y2 = model2(y1)
loss = criterion(y2, y)
loss.backward()
My fear that this would only backpropagate model2 turned out to be unsubstantiated, since model1 is also stored in the computation graph that is back propagated over.
This implementation enabled inceased transparancy of the interface between the two models, compared to the previous implementation.