How to get bias and neuron weights in optimizer?

How to get bias and neuron weights in optimizer? - python

In a TensorFlow optimizer (python) the method apply_dense does get called for the neuron weights (layer connections) and the bias weights but I would like to use both in this method.
def _apply_dense(self, grad, weight):
...
For example: A fully connected neural network with two hidden layer with two neurons and a bias for each.
If we take a look at layer 2 we get in apply_dense a call for the neuron weights:
and a call for the bias weights:
But I would either need both matrix in one call of apply_dense or a weight matrix like this:
X_2X_4, B_1X_4, ... is just a notation for the weight of the connection between the two neurons. Therefore B_1X_4 ist only a placeholder for the weight between B_1 and X_4.
How to do this?
MWE
For an minimal working example here a stochastic gradient descent optimizer implementation with a momentum. For every layer the momentum of all incoming connections from other neurons is reduced to the mean (see ndims == 2). What i need instead is the mean of not only the momentum values from the incoming neuron connections but also from the incoming bias connections (as described above).
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import tensorflow as tf
from tensorflow.python.training import optimizer
class SGDmomentum(optimizer.Optimizer):
def __init__(self, learning_rate=0.001, mu=0.9, use_locking=False, name="SGDmomentum"):
super(SGDmomentum, self).__init__(use_locking, name)
self._lr = learning_rate
self._mu = mu
self._lr_t = None
self._mu_t = None
def _create_slots(self, var_list):
for v in var_list:
self._zeros_slot(v, "a", self._name)
def _apply_dense(self, grad, weight):
learning_rate_t = tf.cast(self._lr_t, weight.dtype.base_dtype)
mu_t = tf.cast(self._mu_t, weight.dtype.base_dtype)
momentum = self.get_slot(weight, "a")
if momentum.get_shape().ndims == 2: # neuron weights
momentum_mean = tf.reduce_mean(momentum, axis=1, keep_dims=True)
elif momentum.get_shape().ndims == 1: # bias weights
momentum_mean = momentum
else:
momentum_mean = momentum
momentum_update = grad + (mu_t * momentum_mean)
momentum_t = tf.assign(momentum, momentum_update, use_locking=self._use_locking)
weight_update = learning_rate_t * momentum_t
weight_t = tf.assign_sub(weight, weight_update, use_locking=self._use_locking)
return tf.group(*[weight_t, momentum_t])
def _prepare(self):
self._lr_t = tf.convert_to_tensor(self._lr, name="learning_rate")
self._mu_t = tf.convert_to_tensor(self._mu, name="momentum_term")
For a simple neural network: https://raw.githubusercontent.com/aymericdamien/TensorFlow-Examples/master/examples/3_NeuralNetworks/multilayer_perceptron.py (only change the optimizer to the custom SGDmomentum optimizer)

Update: I'll try to give a better answer (or at least some ideas) now that I have some understanding of your goal, but, as you suggest in the comments, there is probably not infallible way of doing this in TensorFlow.
Since TF is a general computation framework, there is no good way of determining what pairs of weights and biases are there in a model (or if it is a neural network at all). Here are some possible approaches to the problem that I can think of:
Annotating the tensors. This is probably not practical since you already said you have no control over the model, but an easy option would be to add extra attributes to the tensors to signify the weight/bias relationships. For example, you could do something like W.bias = B and B.weight = W, and then in _apply_dense check hasattr(weight, "bias") and hasattr(weight, "weight") (there may be some better designs in this sense).
You can look into some framework built on top of TensorFlow where you may have better information about the model structure. For example, Keras is a layer-based framework that implements its own optimizer classes (based on TensorFlow or Theano). I'm not too familiar with the code or its extensibility, but probably you have more tools there to use.
Detect the structure of the network yourself from the optimizer. This is quite complicated, but theoretically possible. from the loss tensor passed to the optimizer, it should be possible to "climb up" in the model graph to reach all of its nodes (taking the .op of the tensors and the .inputs of the ops). You could detect tensor multiplications and additions with variables and skip everything else (activations, loss computation, etc) to determine the structure of the network; if the model does not match your expectations (e.g. there are no multiplications or there is a multiplication without a later addition) you can raise an exception indicating that your optimizer cannot be used for that model.
Old answer, kept for the sake of keeping.
I'm not 100% clear on what you are trying to do, so I'm not sure if this really answers your question.
Let's say you have a dense layer transforming an input of size M to an output of size N. According to the convention you show, you'd have an N × M weights matrix W and a N-sized bias vector B. Then, an input vector X of size M (or a batch of inputs of size M × K) would be processed by the layer as W · X + B, and then applying the activation function (in the case of a batch, the addition would be a "broadcasted" operation). In TensorFlow:
X = ... # Input batch of size M x K
W = ... # Weights of size N x M
B = ... # Biases of size N
Y = tf.matmul(W, X) + B[:, tf.newaxis] # Output of size N x K
# Activation...
If you want, you can always put W and B together in a single extended weights matrix W*, basically adding B as a new row in W, so W* would be (N + 1) × M. Then you just need to add a new element to the input vector X containing a constant 1 (or a new row if it's a batch), so you would get X* with size N + 1 (or (N + 1) × K for a batch). The product W* · X* would then give you the same result as before. In TensorFlow:
X = ... # Input batch of size M x K
W_star = ... # Extended weights of size (N + 1) x M
# You can still have a "view" of the original W and B if you need it
W = W_star[:N]
B = W_star[-1]
X_star = tf.concat([X, tf.ones_like(X[:1])], axis=0)
Y = tf.matmul(W_star, X_star) # Output of size N x K
# Activation...
Now you can compute gradients and updates for weights and biases together. A drawback of this approach is that if you want to apply regularization then you should be careful to apply it only on the weights part of the matrix, not on the biases.

Related

How to differentiate a gradient in Pytorch

I'm trying to differentiate a gradient in PyTorch. I found this link but can't get it to work.
My code looks as follows:
import torch
from torch.autograd import grad
import torch.nn as nn
import torch.optim as optim
class net_x(nn.Module):
def __init__(self):
super(net_x, self).__init__()
self.fc1=nn.Linear(2, 20)
self.fc2=nn.Linear(20, 20)
self.out=nn.Linear(20, 4)
def forward(self, x):
x=self.fc1(x)
x=self.fc2(x)
x=self.out(x)
return x
nx = net_x()
r = torch.tensor([1.0,2.0])
nx(r)
>>>tensor([-0.2356, -0.7315, -0.2100, -0.6741], grad_fn=<AddBackward0>)
But when I try to differentiate the function with respect to the first parameter
grad(nx, r[0])
I get the error
TypeError: 'net_x' object is not iterable
Update
Trying to extend this to tensors:
For some reason the gradient is the same for all inputs.
a = torch.rand((8,2), requires_grad=True)
s = []
s_t = []
for input_tensor in a:
output_tensor = nx(input_tensor)
s.append(output_tensor[0])
s_t_value = grad(output_tensor[0], input_tensor)[0][0]
s_t.append(s_t_value)
print(s_t)
But the output is:
[tensor(-0.1326), tensor(-0.1326), tensor(-0.1326), tensor(-0.1326), tensor(-0.1326), tensor(-0.1326), tensor(-0.1326), tensor(-0.1326)]

First thing to change if you want to have the gradients with respect to r is to set the requires_grad flag to True for this tensor :
nx = net_x()
r = torch.tensor([1.0,2.0], requires_grad=True)
Then, as explained in autograd documentation, grad computes the gradients of oputputs with respect to the inputs, so you need to save the output of the model :
y = nx(r)
Now you can compute the gradients with respect to r. But there is one last issue : grad only knows how to propagate gradients from a scalar tensor, which y is not. So you need to compute the gradients with respect to each coordinate :
for x in y:
print(grad(x, r, retain_graph=True))
or equivalently:
for i in range(y.shape[0]):
# prints the vector (dy_i/dr_0, dy_i/dr_1, ... dy_i/dr_n)
print(grad(y[i], r, retain_graph=True))
You need to retain_graph because without this flag, the computational graph is cleared after the first gradient propagation. And there you have it, the derivative of each coordinate of nx(r) with respect to r !
To answer your question in the comments :
Not an error, it's normal. So you have a batched input of size (B, 2), with B = 8. You get a batched output of shape (B, 4). Now, for each vector of the batched output, for each coordinate of this vector, you can compute the derivative with respect to the batched input, which will yield a gradient of size (B,2), like that :
for b in y: # There a B vectors b of shape (4)
for x in b: # There are 4 coordinates
# This prints a tensor of shape (B, 2)
print(grad(x, r, retain_graph=True))
Now remember the way batches work : all batches are computed together to harvest the power of GPU, but they are actually completely independant. So al b vectors are actually results of the network from different inputs. Which means, the gradient of the i-th vector b with respect to the j-th vector of the input must be 0 if i!=j. Does that make sense ? It's like computing f(x,y) = (x^2, y^2). The derivative of y^2 with respect to x is obviously 0 ! Well consider x and y to be two samples from one batch, and you have you explaination for why there are a lot of 0 in your results.
A last sample of code to make it even clearer :
inputs = [torch.randn(1, 2, requires_grad=True) for i in range(8)]
r = torch.cat(inputs) # shape : (8, 2)
y = nx(r) # shape : (8, 4)
for i in range(len(y)):
print(f"Gradients of y[{i}] wrt r[{i}]")
for x in y[i]:
# prints a tensor of size (2)
print(grad(x, inputs[i], retain_graph=True))
On to why all the gradients are the same. This is because your neural network is completely linear. You have 3 nn.Linear layers, and no non-linear activation function (as a consequence, this is literally equivalent to a network with only one layer). One property of linear layers is that their gradient is constant : d(alpha*x)/dx = alpha (independant of x). Therefore the gradients will be identical along all dimensions. Just add non-linear activation layers like sigmoids and this behavior will not happen again.

How to Create Multi-layer Neural Network

Thank you for any help I'm given in advance with this! I have been given the python code for a simple single layer perceptron with the task to alter the code so it is a multi-layer perceptron. I'm still very new to all of this, but from what I understand the repeating feed-forward and back-propagation cycle is what creates the hidden layers. Given the follow code, what should be altered to help create these hidden layers?
# Creating a numerically stable logistic s-shaped definition to call
def sigmoid(x):
x = np.clip(x, -500, 500)
if x.any()>=0:
return 1/(1 + np.exp(-x))
else:
return np.exp(x)/(1 + np.exp(x))
# define the dimentions and set the weights to random numbers
def init_parameters(dim1, dim2=1,std=1e-1, random = True):
if(random):
return(np.random.random([dim1,dim2])*std)
else:
return(np.zeros([dim1,dim2]))
# Single layer network: Forward Prop
# Passed in the weight vectors, bias vector, the input vector and the Y
def fwd_prop(W1, bias, X, Y):
Z1 = np.dot(W1, X) + bias # dot product of the weights and X + bias
A1 = sigmoid(Z1) # Uses sigmoid to create a predicted vector
return(A1)
#Single layer network: Backprop
def back_prop(A1, W1, bias, X, Y):
m = np.shape(X)[1] # used the calculate the cost by the number of inputs -1/m
# Cross entropy loss function
cost = (-1/m)*np.sum(Y*np.log(A1) + (1-Y)*np.log(1-A1)) # cost of error
dZ1 = A1 - Y # subtract actual from pred weights
dW1 = (1/m) * np.dot(dZ1, X.T) # calc new weight vector
dBias = (1/m) * np.sum(dZ1, axis = 1, keepdims = True) # calc new bias vector
grads ={"dW1": dW1, "dB1":dBias} # Weight and bias vectors after backprop
return(grads, cost)
def run_grad_desc(num_epochs, learning_rate, X, Y, n_1):
n_0, m = np.shape(X)
W1 = init_parameters(n_1, n_0, True)
B1 = init_parameters(n_1,1, True)
loss_array = np.ones([num_epochs])*np.nan # resets the loss_array to NaNs
for i in np.arange(num_epochs):
A1 = fwd_prop(W1, B1, X, Y) # get predicted vector
grads, cost = back_prop(A1, W1, B1, X, Y) # get gradient and the cost from BP
W1 = W1 - learning_rate*grads["dW1"] # update weight vector LR*gradient*[BP weights]
B1 = B1 - learning_rate*grads["dB1"] # update bias LR*gradient[BP bias]
loss_array[i] = cost # loss array gets cross ent values
parameter = {"W1":W1, "B1":B1} # assign
return(parameter, loss_array)
We've also been asked to be able to adjust for the nodes in a hidden layer. Being honest, I am completely lost here and am not clear on what the nodes even represent, so help here would be appreciated as well. Thanks y'all.

It looks like this network is not even single layer. Normally "single layer" means one layer of hidden neurons. This network outputs the activations of what would normally be the hidden layer.
My advice to you would be to start studying the basics of neural networks. There are lots of great resources, including on Youtube. For backpropagation, a good place to start is here
Also note that if you are in a hurry, using autograd tools like Tensorflow or Pytorch takes care of the differentiation for you. Of course, if you are doing this to learn the details of neural networks, then building one from scratch is much better.

How do you implement neurons in ANN?

From what I understand about Neural Networks, you have a number of hidden layers which each consist of X neurons. A neuron takes in a number of inputs and prospective weights, then using an activation function (sigmoid in my case) gives an output.
My task is to implement a network from scratch (only using numpy), with 2 hidden layers, sigmoid activation function and 500 neurons in each hidden layer. What I don't understand is, how can I implement the concept of neurons? According to this article, one neuron is when all inputs are weighted and are passed into the activation function. So do I feed in the same inputs, 500 times, with different weights each time (in the first layers, then again in the second)? I've also read this topic, where the following is said:
The neuron is nothing more than a set of inputs, a set of weights, and an activation function. The neuron translates these inputs into a single output, which can then be picked up as input for another layer of neurons later on.
So according to this, I indeed should weigh the inputs differently, 500 times, and then pass these forward to the next layer which will do the same. Am I understanding this correctly?
Here is the code I have written so far (it is very elementary but I did not want to proceed further before I clear this up), but have no idea how I would be implementing this:
class NeuralNetwork:
def __init__(self, data, y, neurons, hidden):
self.input = data
self.y = y
self.output = np.zeros(y.shape)
self.layers = hidden
self.neurons = neurons
self.weights = self.generateWeightArray()
print(self.weights)
def generateWeightArray(self):
weightarr = []
#Last weight array is for inbetween hidden and output layer
for i in range(self.layers + 1):
weightarr.append(self.generateWeightMatrix())
return np.asarray(weightarr)
def generateWeightMatrix(self):
return np.random.rand(self.input.shape[0], self.input.shape[1]-1)
def sigmoid(self, x):
return 1/(1+np.exp(-x))
def dsigmoid(self, x):
return self.sigmoid(x)*(1-self.sigmoid(x))
def train(self):
pass
def run(self):
#Since between each layer we have a matrix of weights, we can just keep going for the number of hidden
#layers we have
for i in range(self.layers):
out = np.dot(self.input.transpose(), self.weights[i]).transpose() #step1
self.input = self.sigmoid(out) #step2
print(self.input)
net = NeuralNetwork(np.array([[1,2,3,4],[3,5,1,2],[5,6,7,8]]), np.array([1,0,1]), 500, 2)
net.run()
EDIT
I have changed my code as follows
class NeuralNetwork:
def __init__(self, data, y, neurons, hidden):
self.input = data
self.y = y
self.output = np.zeros(y.shape)
self.layers = hidden
self.neurons = neurons
self.weights_to_hidden = np.random.rand(self.neurons, self.input.shape[1])
self.weights = self.generateWeightArray()
self.weights_to_output = np.random.rand(self.neurons,1)
print(self.weights_to_output)
#Generate a matrix with h+1 weight matrices, where h is the number of hidden layers (+1 for output)
def generateWeightArray(self):
weightarr = []
#Last weight array is for inbetween hidden and output layer
for i in range(self.layers):
weightarr.append(self.generateWeightMatrix())
return np.asarray(weightarr)
#Generate a matrix with n columns and m rows, where n is the number of features and m is the number of neurons
#in the layer
def generateWeightMatrix(self):
return np.random.rand(self.neurons, self.neurons)
def sigmoid(self, x):
return 1/(1+np.exp(-x))
def dsigmoid(self, x):
return self.sigmoid(x)*(1-self.sigmoid(x))
def train(self):
#2 hidden layers, then hidden -> output layer
hidden_in = self.sigmoid(np.dot(self.input, self.weights_to_hidden.transpose()).transpose())
print("Going into hidden layer:")
print(hidden_in)
for i in range(self.layers):
in_hidden = self.sigmoid(np.dot(hidden_in.transpose(), self.weights[i]).transpose())
print("After ",str(i+1), " hidden layer:")
print(in_hidden)
print("Output")
out = self.sigmoid(np.dot(hidden_in.transpose(), self.weights_to_output).transpose())
print(out)
net = NeuralNetwork(np.array([[1,2,3,4],[3,5,1,2],[5,6,7,8]]), np.array([1,0,1]), 5, 2)
net.train()
And the output after running is
[[0.89405222 0.89501672 0.89717842]]
I'm not sure if self.weights_to_output has the correct shape though because its a (n,1), so all features (in each record) will have the same weight, rather than having 3 weights for each row (?)

The 'neurons' (usually called 'units' these days) are the activations in each layer. These are represented as a vector with one element for each unit. You will represent a layer's activations as a 1D array. So the short answer is that the neurons are elements in 1D arrays.
Let's look at the activation of one unit in layer 3 of a deep neural network:
a_3 = sigma(w_3 # a_2 + b_3)
So:
a_3 will be a scalar — the activation for this unit;
sigma is the activation function (e.g. logistic function, tanh, or ReLU)
w_3 is the vector of weights for this layer (one element for each unit of layer 2)
a_2 is the vector of activations for the previous layer (one element for each unit of layer 2)
b_3 is the bias for this unit.
Note that w_3 and a_2 are both 1D arrays (vectors). The # operator does matrix multiplication in Python 3.4+ and in this case it's going to perform the dot product.
Now, thanks to the magic of linear algebra, it turns out we don't need to loop over each unit to compute all the activations. If we let the weight vector w_3 be a matrix, W_3 (note the capital letter), then it can represent the weights for all units in layer 2, connecting to all units in layer 3. Then:
a_3 = sigma(W_3 # a_2 + b_3)
Now a_3 will be a vector.
The tricky part is keeping track of all the shapes. For W # a to work, the shapes must be compatible. For example, imagine we have a network with 2 units in layer 2 (so a_2 is a 1D array with 2 elements) and 3 units in layers 3 (so a_3 needs to have 3 elements). Now W_3 needs to be 3 × 2 and a_2 needs to be 2 × 1. Then the matrix multiply works. You can just use np.reshape() and np.transpose() to achieve the shapes you need.
I hope this helps... it's a lot of words.
Maybe this diagram (from this article) helps explain:
The diagram doesn't say how many records there are. There are 3 features per data instance (i.e. we have an M × 3 input matrix). The input layer is 'just' another layer, you can think of the inputs as just another set of activations. You could think of x as a_0 (the 0-th layer).
This 3 Blue 1 Brown video is well worth watching too: https://www.youtube.com/watch?v=aircAruvnKk

Where to start on creating a method that saves desired changes in a Tensor with PyTorch?

I have two tensors that I am calculating the Spearmans Rank Correlation from, and I would like to be able to have PyTorch automatically adjust the values in these Tensors in a way that increases my Spearmans Rank Correlation number as high as possible.
I have explored autograd but nothing I've found has explained it simply enough.
Initialized tensors:
a=Var(torch.randn(20,1),requires_grad=True)
psfm_s=Var(torch.randn(12,20),requires_grad=True)
How can I have a loop of constant adjustments of the values in these two tensors to get the highest spearmans rank correlation from 2 lists I make from these 2 tensors while having PyTorch do the work? I just need a guide of where to go. Thank you!

I'm not familiar with Spearman's Rank Correlation, but if I understand your question you're asking how to use PyTorch to solve problems other than deep networks?
If that's the case then I'll provide a simple least squares example which I believe should be informative to your effort.
Consider a set of 200 measurements of 10 dimensional vectors x and y. Say we want to find a linear transform from x to y.
The least squares approach dictates we can accomplish this by finding the matrix M and vector b which minimize |(y - (M x+b))²|
The following example code generates some example data and then uses pytorch to perform this minimization. I believe the comments are sufficient to help you understand what is occurring here.
import torch
from torch.nn.parameter import Parameter
from torch import optim
# define some fake data
M_true = torch.randn(10, 10)
b_true = torch.randn(10, 1)
x = torch.randn(200, 10, 1)
noise = torch.matmul(M_true, 0.05 * torch.randn(200, 10, 1))
y = torch.matmul(M_true, x) + b_true + noise
# begin optimization
# define the parameters we want to optimize (using random starting values in this case)
M = Parameter(torch.randn(10, 10))
b = Parameter(torch.randn(10, 1))
# define the optimizer and provide the parameters we want to optimize
optimizer = optim.SGD((M, b), lr=0.1)
for i in range(500):
# compute loss that we want to minimize
y_hat = torch.matmul(M, x) + b
loss = torch.mean((y - y_hat)**2)
# zero the gradients of the parameters referenced by the optimizer (M and b)
optimizer.zero_grad()
# compute new gradients
loss.backward()
# update parameters M and b
optimizer.step()
if (i + 1) % 100 == 0:
# scale learning rate by factor of 0.9 every 100 steps
optimizer.param_groups[0]['lr'] *= 0.9
print('step', i + 1, 'mse:', loss.item())
# final parameter values (data contains a torch.tensor)
print('Resulting parameters:')
print(M.data)
print(b.data)
print('Compare to the "real" values')
print(M_true)
print(b_true)
Of course this problem has a simple closed form solution, but this numerical approach is just to demonstrate how to use PyTorch's autograd to solve problems not necessarily neural network related. I also choose to explicitly define the matrix M and vector b here rather than using an equivalent nn.Linear layer since I think that would just confuse things.
In your case you want to maximize something so make sure to negate your objective function before calling backward.

Calculate values for subset of tensor values connected to the same neuron in optimizer

I'm writing an optimizer in TensorFlow with python.
How to calculate values for subsets of tensor values that are connected as incoming connections of neurons?
For example, let's take the stochastic gradient descent optimizer with a momentum term. The momentum term is calculated for each connection individually. Now i want to calculate the momentum for one connection by calculating the mean of all momentum values of connections that are connected to the same neuron.
In this picture, you can see the two connections that are both connected to neuron 3 as incoming connection. Both connections should be considered for the weight update of one connection. Normally the update for connection (1, 3) would only include the gradient(1, 3) and the momentum(1, 3). For the update of the connection (1, 3) i want to use the mean of the momentum(1, 3) and momentum(2, 3).
Let's have a look at a simple fully connected neural network with one input neuron, two hidden layers, two neurons per hidden layer and one output neuron:
If we look at the normal calculation of the momentum (called "accumulation" in the code) for the weight update for the connection between neuron 2 and neuron 5 we just consider the momentum of the last time.
We can see the normal "accumulation" update calculation from the python implementation below:
accumulation = self.get_slot(var, "a")
accumulation_update = grad + (mu_t * accumulation)
For the connection between neuron 2 and neuron 5 the accumulation looks like this:
This is the part that should change. The new momentum calculation should take the mean of all connections that are connected as incoming connections to the same neuron as the connection for which the weight update is calculated. Looking at the example neural network the "accumulation" value for the connection (2, 5) is the mean of the "accumulation" value of connection (2, 5) and (3, 5). These are all incoming connections of neuron 5.
The "accumulation" update changes in the following way:
accumulation = self.get_slot(var, "a")
accumulation_means = # Code to calculate all mean values for all neurons
accumulation_update = grad + (mu_t * accumulation_means) # Use the means for the accumulation_update
The accumulation update calculation for the connection (2, 5) is now calculated the following way:
accumulation_mean = (accumulation(2, 5) + accumulation(3, 5)) / 2
accumulation_update(2, 5) = grad(2, 5) + (mu_t * accumulation_mean)
This calculation is done the same way for every connection:
Here the python implementation of the stochastic gradient descent with momentum:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from tensorflow.python.framework import ops
from tensorflow.python.ops import control_flow_ops
from tensorflow.python.ops import math_ops
from tensorflow.python.ops import state_ops
from tensorflow.python.training import optimizer
class SGDmomentum(optimizer.Optimizer):
def __init__(self, learning_rate=0.001, momentum_term=0.9, use_locking=False, name="SGDmomentum"):
super(SGDmomentum, self).__init__(use_locking, name)
self._lr = learning_rate
self._mu = momentum_term
self._lr_t = None
self._mu_t = None
def _create_slots(self, var_list):
for v in var_list:
self._zeros_slot(v, "a", self._name)
def _apply_dense(self, grad, var):
lr_t = math_ops.cast(self._lr_t, var.dtype.base_dtype)
mu_t = math_ops.cast(self._mu_t, var.dtype.base_dtype)
accumulation = self.get_slot(var, "a")
accumulation_update = grad + (mu_t * accumulation)
accumulation_t = state_ops.assign(accumulation, accumulation_update, use_locking=self._use_locking)
var_update = lr_t * accumulation_t
var_t = state_ops.assign_sub(var, var_update, use_locking=self._use_locking)
return control_flow_ops.group(*[var_t, accumulation_t])
def _prepare(self):
self._lr_t = ops.convert_to_tensor(self._lr, name="learning_rate")
self._mu_t = ops.convert_to_tensor(self._mu, name="momentum_term")
The neural network i'm testing with (MNIST): https://github.com/tensorflow/tensorflow/blob/r1.2/tensorflow/examples/tutorials/mnist/mnist_with_summaries.py
How to implement the described mean of "accumulation" values into the existing MWE code?
Just as a side note:
The MWE is not my real-life scenario. It is just a minimal working example to explain and work on the problem I'm trying to solve.
I'm writing the optimizer in python because I couldn't build TensorFlow on Windows and therefore couldn't compile the C++ files. I did put a lot of time in trying to build on Windows and i can't afford to waste more time on it. The optimizer in python is sufficient for me since I'm just prototyping at the moment.
I'm new to tensorflow and python. I can't find anything about this topic in the documentation. Linking me to a source would be great. Also the internal structure of tensors is not digestible for me and error messages I get while trying out things are just not understandable for me. Please keep that in mind when explaining something.

We take neuron 2,3,4,5 as example to compute the new momentum. We ignore the biases and consider only weights:
We use W for the matrix of weights, G for the corresponding gradients of W, M for the matrix of the corresponding momentum, \tilde{\bm{M}} is the mean matrix.
So the update of the new momentum is
I changed some code in the SGDmomentum class you proposed and run it on the MNIST example without error, which I think you have already done.
def _apply_dense(self, grad, var):
lr_t = math_ops.cast(self._lr_t, var.dtype.base_dtype)
mu_t = math_ops.cast(self._mu_t, var.dtype.base_dtype)
accumulation = self.get_slot(var, "a")
param_dims = len(accumulation.get_shape().as_list())
if param_dims == 2: # fc layer weights
accumulation_mean = tf.reduce_mean(accumulation, axis=1, keep_dims=True)
elif param_dims == 1: # biases
accumulation_mean = accumulation
else: # cnn? or others
# TODO: improvement
accumulation_mean = accumulation
accumulation_update = grad + (mu_t * accumulation_mean) # broadcasting is supported by tf.add()
accumulation_t = state_ops.assign(accumulation, accumulation_update, use_locking=self._use_locking)
var_update = lr_t * accumulation_t
var_t = state_ops.assign_sub(var, var_update, use_locking=self._use_locking)
return control_flow_ops.group(*[var_t, accumulation_t])
For training,
with tf.name_scope('train'):
train_step = SGDmomentum(FLAGS.learning_rate, 0.9).minimize(cross_entropy)
# train_step = tf.train.AdamOptimizer(FLAGS.learning_rate).minimize(
# cross_entropy)
For the moment, this algorithm converges less fast than traditional SGD with momentum on MNIST.
As for additional reading source, I don't know if Stanford CS231n can help you Gradient Descent and SGD with momentum. Probably, you have already knew that.
If you were still confused by the usage of matrix structure for the gradient tensors, then try to accept it because it's nearly no difference between a matrix and a single scalar here.
What I've done here is just transforming the calculation of each accumulationUpdate_* in your question to matrix form.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to get bias and neuron weights in optimizer? - python

Related

How to differentiate a gradient in Pytorch

How to Create Multi-layer Neural Network

How do you implement neurons in ANN?

Where to start on creating a method that saves desired changes in a Tensor with PyTorch?

Calculate values for subset of tensor values connected to the same neuron in optimizer

Categories

Resources