I am trying to understand backpropagation in a simple 3 layered neural network with MNIST.
There is the input layer with weights and a bias. The labels are MNIST so it's a 10 class vector.
The second layer is a linear tranform. The third layer is the softmax activation to get the output as probabilities.
Backpropagation calculates the derivative at each step and call this the gradient.
Previous layers appends the global or previous gradient to the local gradient. I am having trouble calculating the local gradient of the softmax
Several resources online go through the explanation of the softmax and its derivatives and even give code samples of the softmax itself
def softmax(x):
"""Compute the softmax of vector x."""
exps = np.exp(x)
return exps / np.sum(exps)
The derivative is explained with respect to when i = j and when i != j. This is a simple code snippet I've come up with and was hoping to verify my understanding:
def softmax(self, x):
"""Compute the softmax of vector x."""
exps = np.exp(x)
return exps / np.sum(exps)
def forward(self):
# self.input is a vector of length 10
# and is the output of
# (w * x) + b
self.value = self.softmax(self.input)
def backward(self):
for i in range(len(self.value)):
for j in range(len(self.input)):
if i == j:
self.gradient[i] = self.value[i] * (1-self.input[i))
else:
self.gradient[i] = -self.value[i]*self.input[j]
Then self.gradient is the local gradient which is a vector. Is this correct? Is there a better way to write this?
I am assuming you have a 3-layer NN with W1, b1 for is associated with the linear transformation from input layer to hidden layer and W2, b2 is associated with linear transformation from hidden layer to output layer. Z1 and Z2 are the input vector to the hidden layer and output layer. a1 and a2 represents the output of the hidden layer and output layer. a2 is your predicted output. delta3 and delta2 are the errors (backpropagated) and you can see the gradients of the loss function with respect to model parameters.
This is a general scenario for a 3-layer NN (input layer, only one hidden layer and one output layer). You can follow the procedure described above to compute gradients which should be easy to compute! Since another answer to this post already pointed to the problem in your code, i am not repeating the same.
As I said, you have n^2 partial derivatives.
If you do the math, you find that dSM[i]/dx[k] is SM[i] * (dx[i]/dx[k] - SM[i]) so you should have:
if i == j:
self.gradient[i,j] = self.value[i] * (1-self.value[i])
else:
self.gradient[i,j] = -self.value[i] * self.value[j]
instead of
if i == j:
self.gradient[i] = self.value[i] * (1-self.input[i])
else:
self.gradient[i] = -self.value[i]*self.input[j]
By the way, this may be computed more concisely like so (vectorized):
SM = self.value.reshape((-1,1))
jac = np.diagflat(self.value) - np.dot(SM, SM.T)
np.exp is not stable because it has Inf.
So you should subtract maximum in x.
def softmax(x):
"""Compute the softmax of vector x."""
exps = np.exp(x - x.max())
return exps / np.sum(exps)
If x is matrix, please check the softmax function in this notebook.
Related
Thank you for any help I'm given in advance with this! I have been given the python code for a simple single layer perceptron with the task to alter the code so it is a multi-layer perceptron. I'm still very new to all of this, but from what I understand the repeating feed-forward and back-propagation cycle is what creates the hidden layers. Given the follow code, what should be altered to help create these hidden layers?
# Creating a numerically stable logistic s-shaped definition to call
def sigmoid(x):
x = np.clip(x, -500, 500)
if x.any()>=0:
return 1/(1 + np.exp(-x))
else:
return np.exp(x)/(1 + np.exp(x))
# define the dimentions and set the weights to random numbers
def init_parameters(dim1, dim2=1,std=1e-1, random = True):
if(random):
return(np.random.random([dim1,dim2])*std)
else:
return(np.zeros([dim1,dim2]))
# Single layer network: Forward Prop
# Passed in the weight vectors, bias vector, the input vector and the Y
def fwd_prop(W1, bias, X, Y):
Z1 = np.dot(W1, X) + bias # dot product of the weights and X + bias
A1 = sigmoid(Z1) # Uses sigmoid to create a predicted vector
return(A1)
#Single layer network: Backprop
def back_prop(A1, W1, bias, X, Y):
m = np.shape(X)[1] # used the calculate the cost by the number of inputs -1/m
# Cross entropy loss function
cost = (-1/m)*np.sum(Y*np.log(A1) + (1-Y)*np.log(1-A1)) # cost of error
dZ1 = A1 - Y # subtract actual from pred weights
dW1 = (1/m) * np.dot(dZ1, X.T) # calc new weight vector
dBias = (1/m) * np.sum(dZ1, axis = 1, keepdims = True) # calc new bias vector
grads ={"dW1": dW1, "dB1":dBias} # Weight and bias vectors after backprop
return(grads, cost)
def run_grad_desc(num_epochs, learning_rate, X, Y, n_1):
n_0, m = np.shape(X)
W1 = init_parameters(n_1, n_0, True)
B1 = init_parameters(n_1,1, True)
loss_array = np.ones([num_epochs])*np.nan # resets the loss_array to NaNs
for i in np.arange(num_epochs):
A1 = fwd_prop(W1, B1, X, Y) # get predicted vector
grads, cost = back_prop(A1, W1, B1, X, Y) # get gradient and the cost from BP
W1 = W1 - learning_rate*grads["dW1"] # update weight vector LR*gradient*[BP weights]
B1 = B1 - learning_rate*grads["dB1"] # update bias LR*gradient[BP bias]
loss_array[i] = cost # loss array gets cross ent values
parameter = {"W1":W1, "B1":B1} # assign
return(parameter, loss_array)
We've also been asked to be able to adjust for the nodes in a hidden layer. Being honest, I am completely lost here and am not clear on what the nodes even represent, so help here would be appreciated as well. Thanks y'all.
It looks like this network is not even single layer. Normally "single layer" means one layer of hidden neurons. This network outputs the activations of what would normally be the hidden layer.
My advice to you would be to start studying the basics of neural networks. There are lots of great resources, including on Youtube. For backpropagation, a good place to start is here
Also note that if you are in a hurry, using autograd tools like Tensorflow or Pytorch takes care of the differentiation for you. Of course, if you are doing this to learn the details of neural networks, then building one from scratch is much better.
When using the chain rule to calculate the slope of the cost function relative to the weights at the layer L , the formula becomes:
d C0 / d W(L) = ... . d a(L) / d z(L) . ...
With :
z (L) being the induced local field : z (L) = w1(L) * a1(L-1) + w2(L) * a2(L-1) * ...
a (L) beeing the ouput : a (L) = & (z (L))
& being the sigmoid function used as an activation function
Note that L is taken as a layer indicator and not as an index
Now:
d a(L) / d z(L) = &' ( z(L) )
With &' being the derivative of the sigmoid function
The problem:
But in this post which is written by James Loy on building a simple neural network from scratch with python, When doing the backpropagation, he didn't give z (L) as an input to &' to replace d a(L) / d z(L) in the chain rule function. Instead he gave it the output = last activation of the layer (L) as the input the the sigmoid derivative &'
def feedforward(self):
self.layer1 = sigmoid(np.dot(self.input, self.weights1))
self.output = sigmoid(np.dot(self.layer1, self.weights2))
def backprop(self):
# application of the chain rule to find derivative of the loss function with respect to weights2 and weights1
d_weights2 = np.dot(self.layer1.T, (2*(self.y - self.output) * sigmoid_derivative(self.output)))
Note that in code above the layer L is the layer 2 which is the last or output layer.
And sigmoid_derivative(self.output) this is where the activation of the current layer is given as input to the derivative of the sigmoid function used as an activation function.
The question:
Shouldn't we use this sigmoid_derivative(np.dot(self.layer1, self.weights2)) instead of this sigmoid_derivative(self.output)?
It turned out that &( z(L) ) or output was used, just to accommodate to the way sigmoid_derivative was implemented.
Here is the code of the sigmoid_derivative:
def sigmoid(x):
return 1.0/(1+ np.exp(-x))
def sigmoid_derivative(x):
return x * (1.0 - x)
The mathematical formula of the sigmoid_derivative can be written as: &' (x) = &(x) * (1-&(x))
So to get to the formula above, &(z) and not z was passed to sigmoid_derivative in order to return: &(z) * (1.0 - &(z))
You want to use the derivative with respect to the output. During backpropagation we use the weights only to determine how much of the error belongs to each one of the weights and by doing so we can further propagate the error back through the layers.
In the tutorial, the sigmoid is applied to the last layer:
self.output = sigmoid(np.dot(self.layer1, self.weights2))
From your question:
Shouldn't we use this sigmoid_derivative(np.dot(self.layer1, self.weights2)) instead of this sigmoid_derivative(self.output)?
You cannot do:
sigmoid_derivative(np.dot(self.layer1, self.weights2))
because here you are trying to take the derivative of the sigmoid when you have not yet applied it.
This is why you have to use:
sigmoid_derivative(self.output)
You're right- looks like the author made a mistake. I'll explain: When the network is done with a forward pass (all activations + loss), you have to use gradient descent to minimize the weights according to the loss function. To do this, you need the partial derivative of the loss function with respect to each weight matrix.
Some notation before I continue: loss is L, A is activation (aka sigmoid), Z means the net input, in other words, the result of W . X. Numbers are indices, so A1 means the activation for the first layer.
You can use the chain rule to move backwards through the network and express the weights as a function of the loss. To begin the backward pass, you start by getting the derivative of the loss with respect to the last layer's activation. This is dL/dA2, because the second layer is the final layer. To update the weights of the second layer, we need to complete dA2/dZ2 and dZ/dW2.
Before continuing, remember that the second layer's activation is A2 = sigmoid(W2 . A1) and Z2 = W2 . A1. For clarity, we'll write A2 = sigmoid(Z2). Treat Z2 as its own variable. So if you compute dA2/dZ2, you get sigmoid_derivative(Z2), which is sigmoid_derivative(W2 . A1) or sigmoid_derivative(np.dot(self.layer1, self.weights2)). So it shouldn't be sigmoid_derivative(self.output) because output was activated by sigmoid.
From what I understand about Neural Networks, you have a number of hidden layers which each consist of X neurons. A neuron takes in a number of inputs and prospective weights, then using an activation function (sigmoid in my case) gives an output.
My task is to implement a network from scratch (only using numpy), with 2 hidden layers, sigmoid activation function and 500 neurons in each hidden layer. What I don't understand is, how can I implement the concept of neurons? According to this article, one neuron is when all inputs are weighted and are passed into the activation function. So do I feed in the same inputs, 500 times, with different weights each time (in the first layers, then again in the second)? I've also read this topic, where the following is said:
The neuron is nothing more than a set of inputs, a set of weights, and an activation function. The neuron translates these inputs into a single output, which can then be picked up as input for another layer of neurons later on.
So according to this, I indeed should weigh the inputs differently, 500 times, and then pass these forward to the next layer which will do the same. Am I understanding this correctly?
Here is the code I have written so far (it is very elementary but I did not want to proceed further before I clear this up), but have no idea how I would be implementing this:
class NeuralNetwork:
def __init__(self, data, y, neurons, hidden):
self.input = data
self.y = y
self.output = np.zeros(y.shape)
self.layers = hidden
self.neurons = neurons
self.weights = self.generateWeightArray()
print(self.weights)
def generateWeightArray(self):
weightarr = []
#Last weight array is for inbetween hidden and output layer
for i in range(self.layers + 1):
weightarr.append(self.generateWeightMatrix())
return np.asarray(weightarr)
def generateWeightMatrix(self):
return np.random.rand(self.input.shape[0], self.input.shape[1]-1)
def sigmoid(self, x):
return 1/(1+np.exp(-x))
def dsigmoid(self, x):
return self.sigmoid(x)*(1-self.sigmoid(x))
def train(self):
pass
def run(self):
#Since between each layer we have a matrix of weights, we can just keep going for the number of hidden
#layers we have
for i in range(self.layers):
out = np.dot(self.input.transpose(), self.weights[i]).transpose() #step1
self.input = self.sigmoid(out) #step2
print(self.input)
net = NeuralNetwork(np.array([[1,2,3,4],[3,5,1,2],[5,6,7,8]]), np.array([1,0,1]), 500, 2)
net.run()
EDIT
I have changed my code as follows
class NeuralNetwork:
def __init__(self, data, y, neurons, hidden):
self.input = data
self.y = y
self.output = np.zeros(y.shape)
self.layers = hidden
self.neurons = neurons
self.weights_to_hidden = np.random.rand(self.neurons, self.input.shape[1])
self.weights = self.generateWeightArray()
self.weights_to_output = np.random.rand(self.neurons,1)
print(self.weights_to_output)
#Generate a matrix with h+1 weight matrices, where h is the number of hidden layers (+1 for output)
def generateWeightArray(self):
weightarr = []
#Last weight array is for inbetween hidden and output layer
for i in range(self.layers):
weightarr.append(self.generateWeightMatrix())
return np.asarray(weightarr)
#Generate a matrix with n columns and m rows, where n is the number of features and m is the number of neurons
#in the layer
def generateWeightMatrix(self):
return np.random.rand(self.neurons, self.neurons)
def sigmoid(self, x):
return 1/(1+np.exp(-x))
def dsigmoid(self, x):
return self.sigmoid(x)*(1-self.sigmoid(x))
def train(self):
#2 hidden layers, then hidden -> output layer
hidden_in = self.sigmoid(np.dot(self.input, self.weights_to_hidden.transpose()).transpose())
print("Going into hidden layer:")
print(hidden_in)
for i in range(self.layers):
in_hidden = self.sigmoid(np.dot(hidden_in.transpose(), self.weights[i]).transpose())
print("After ",str(i+1), " hidden layer:")
print(in_hidden)
print("Output")
out = self.sigmoid(np.dot(hidden_in.transpose(), self.weights_to_output).transpose())
print(out)
net = NeuralNetwork(np.array([[1,2,3,4],[3,5,1,2],[5,6,7,8]]), np.array([1,0,1]), 5, 2)
net.train()
And the output after running is
[[0.89405222 0.89501672 0.89717842]]
I'm not sure if self.weights_to_output has the correct shape though because its a (n,1), so all features (in each record) will have the same weight, rather than having 3 weights for each row (?)
The 'neurons' (usually called 'units' these days) are the activations in each layer. These are represented as a vector with one element for each unit. You will represent a layer's activations as a 1D array. So the short answer is that the neurons are elements in 1D arrays.
Let's look at the activation of one unit in layer 3 of a deep neural network:
a_3 = sigma(w_3 # a_2 + b_3)
So:
a_3 will be a scalar — the activation for this unit;
sigma is the activation function (e.g. logistic function, tanh, or ReLU)
w_3 is the vector of weights for this layer (one element for each unit of layer 2)
a_2 is the vector of activations for the previous layer (one element for each unit of layer 2)
b_3 is the bias for this unit.
Note that w_3 and a_2 are both 1D arrays (vectors). The # operator does matrix multiplication in Python 3.4+ and in this case it's going to perform the dot product.
Now, thanks to the magic of linear algebra, it turns out we don't need to loop over each unit to compute all the activations. If we let the weight vector w_3 be a matrix, W_3 (note the capital letter), then it can represent the weights for all units in layer 2, connecting to all units in layer 3. Then:
a_3 = sigma(W_3 # a_2 + b_3)
Now a_3 will be a vector.
The tricky part is keeping track of all the shapes. For W # a to work, the shapes must be compatible. For example, imagine we have a network with 2 units in layer 2 (so a_2 is a 1D array with 2 elements) and 3 units in layers 3 (so a_3 needs to have 3 elements). Now W_3 needs to be 3 × 2 and a_2 needs to be 2 × 1. Then the matrix multiply works. You can just use np.reshape() and np.transpose() to achieve the shapes you need.
I hope this helps... it's a lot of words.
Maybe this diagram (from this article) helps explain:
The diagram doesn't say how many records there are. There are 3 features per data instance (i.e. we have an M × 3 input matrix). The input layer is 'just' another layer, you can think of the inputs as just another set of activations. You could think of x as a_0 (the 0-th layer).
This 3 Blue 1 Brown video is well worth watching too: https://www.youtube.com/watch?v=aircAruvnKk
We are currently working on a project that involves Neural Networks. Our task consists of upscaling an 16x16 image all the way up to a picture with the resolution of 64x64.
The model used in the code example consists of only one Hidden-Layer with 5 Neurons in it. Furthermore the Hidden-Layer uses a ReLu function and the Output-Layer a simple linear one. The activations get saved and loaded , when they need to be accessed.
There is something wrong with the calculation of the Lost/Cost and therefore the weight's get updated the wrong way. The weight's keep getting bigger if their postive and more negative if they are negative.
We have seperated the Back-Propagation of the Hidden-Layer and the Output Layer, since they calculate the errors differently. There are also some Shape errors in the Back-Propagation for the first Hidden-Layer after the Output-Layer , which could indicate that the calculation is somewhat wrong. We used the website : https://ml-cheatsheet.readthedocs.io/en/latest/backpropagation.html to get our errors calculated.
We are by no means expert's on the topic and would be happy to accept any help that we can get, so feel free to ask us any questions you like.
{def linearfunc(x): #The Linear Output function used for the Forward-Propagation
return(x)
def linderiv(x): #The Derivative of our Linear function used for the Back-Propagation
return(1)
def backward(): #Back-Propagtion
alpha = 0.001 #Learning Rate
Weights_Output = np.random.randn(5, 12288) #Weights of the Output Layer / 5 is the number of neurons in the Hidden-Layer and 12228 the number of neurons in the Output Layer
Bias_Output = np.random.randn(12288,1) #Bias of the Output Layer / 12288 is always derived from the features of the Output Image, which is 64*64*3 (3 For the RGB Values)
Weights = np.random.randn(768,5) #Weights of the only Hidden-Layer / The number 768 is derived from the number of input features of the images being 16*16*3
Bias = np.random.randn(5,1) #Bias of the only Hidden-Layer
#Start of the Back-Propagation for the Output-Layer
DW_Output = np.zeros((12288,5)) #Preparing the shapes of DW in the Output-Layer for the Back-Propagation
DB_Output = np.zeros((12288, 1)) #Preparing the shapes of DB in the Output-Layer for the Back-Propagation
da_dz2 = linderiv(Z_Output) #Applying the derivative of the Linear Function to our Z's in the Output-Layer / 1 Step
y = np.load('DatasetY.npy') #Loading our Output Features of our Dataset with the Y values
a = np.load('Activations.npy') #Loading the Activations of the previous Layer before the Output-Layer / Our only Hidden-Layer
Error_Output = (activations_Output-y) * da_dz2 #Computing DZ(The Error) of our Output-Layer by using the formula Error = (Activations in the Output-Layer - Y-Values) multiplied with the derivative of the ReLu function applied to the Z's of the Output Layer
DW_Output = np.dot(Error_Output , activations.T) #Computing DW of our Output-Layer by using the dot product of DZ_Output and our activations and then divide by the number of images in our Trainingset
DB_Output = np.sum(Error_Output, axis = 1, keepdims = True) * (1/10000)#Computing DB of our Output-Layer by summing up all of the DZ's along axis 1 to get the right shape
Weights_Output = Weights_Output - alpha * DW_Output.T #The Weights of the Output Layer perform the Gradient Descent Step / The first time they update they all go way too high or too low
Bias_Output = Bias_Output - alpha * DB_Output #The Bias of the Output Layer perform the Gradient Descent Step / The first time they update they all go way too high or too low
#Start of the Back-Propagation for the Hidden-Layer
dw = np.zeros((5,768)) #Preparing the shapes of dw in the only Hidden-Layer for the Back-Propagation
db = np.zeros((5, 1)) #Preparing the shapes of dw in the only Hidden-Layer for the Back-Propagation
a_2 = np.load('DatasetX.npy') #Loading the activations of the Layer previous to out Hidden-Layer, which would be the Input Features
da_dz = z >= 0 #Applying the derivative of the ReLu Function to our Z's in the Hidden-Layer / 1 Step
da_dz = da_dz.astype(np.int) #Turning all the TRUE values into 1 and all the FALSE values into 0 / 2 Step
Error = Error_Output * np.dot(Weights_Output.T , da_dz) #<< Gives a shape Error when Updating the Weight's #Computing dz(Error) in our Hidden-Layer by using the Weight's and the Error from the Output Layer and multiplying that with the derivative of the ReLu function applied to the z's of the only Hidden-Layer
dw = np.dot(Error, a_2.T) * (1/10000) #Computing dw of our Hidden-Layer by using the dot product of dz and our activations and then divide by the number of images in our Trainingset
db = (np.sum(Error, axis = 1, keepdims = True)) * (1/10000) #Computing db of our Hidden-Layer by summing up all of the dz's along axis 1 to get the right shape
Weights = Weights - alpha * dw.T #The Weights of the Hidden-Layer perform the Gradient Descent Step / The first time they update they all go negative
Bias = Bias - alpha * db #The Bias of the Hidden-Layer perform the Gradient Descent Step / The first time they update they all go negative}
In a TensorFlow optimizer (python) the method apply_dense does get called for the neuron weights (layer connections) and the bias weights but I would like to use both in this method.
def _apply_dense(self, grad, weight):
...
For example: A fully connected neural network with two hidden layer with two neurons and a bias for each.
If we take a look at layer 2 we get in apply_dense a call for the neuron weights:
and a call for the bias weights:
But I would either need both matrix in one call of apply_dense or a weight matrix like this:
X_2X_4, B_1X_4, ... is just a notation for the weight of the connection between the two neurons. Therefore B_1X_4 ist only a placeholder for the weight between B_1 and X_4.
How to do this?
MWE
For an minimal working example here a stochastic gradient descent optimizer implementation with a momentum. For every layer the momentum of all incoming connections from other neurons is reduced to the mean (see ndims == 2). What i need instead is the mean of not only the momentum values from the incoming neuron connections but also from the incoming bias connections (as described above).
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import tensorflow as tf
from tensorflow.python.training import optimizer
class SGDmomentum(optimizer.Optimizer):
def __init__(self, learning_rate=0.001, mu=0.9, use_locking=False, name="SGDmomentum"):
super(SGDmomentum, self).__init__(use_locking, name)
self._lr = learning_rate
self._mu = mu
self._lr_t = None
self._mu_t = None
def _create_slots(self, var_list):
for v in var_list:
self._zeros_slot(v, "a", self._name)
def _apply_dense(self, grad, weight):
learning_rate_t = tf.cast(self._lr_t, weight.dtype.base_dtype)
mu_t = tf.cast(self._mu_t, weight.dtype.base_dtype)
momentum = self.get_slot(weight, "a")
if momentum.get_shape().ndims == 2: # neuron weights
momentum_mean = tf.reduce_mean(momentum, axis=1, keep_dims=True)
elif momentum.get_shape().ndims == 1: # bias weights
momentum_mean = momentum
else:
momentum_mean = momentum
momentum_update = grad + (mu_t * momentum_mean)
momentum_t = tf.assign(momentum, momentum_update, use_locking=self._use_locking)
weight_update = learning_rate_t * momentum_t
weight_t = tf.assign_sub(weight, weight_update, use_locking=self._use_locking)
return tf.group(*[weight_t, momentum_t])
def _prepare(self):
self._lr_t = tf.convert_to_tensor(self._lr, name="learning_rate")
self._mu_t = tf.convert_to_tensor(self._mu, name="momentum_term")
For a simple neural network: https://raw.githubusercontent.com/aymericdamien/TensorFlow-Examples/master/examples/3_NeuralNetworks/multilayer_perceptron.py (only change the optimizer to the custom SGDmomentum optimizer)
Update: I'll try to give a better answer (or at least some ideas) now that I have some understanding of your goal, but, as you suggest in the comments, there is probably not infallible way of doing this in TensorFlow.
Since TF is a general computation framework, there is no good way of determining what pairs of weights and biases are there in a model (or if it is a neural network at all). Here are some possible approaches to the problem that I can think of:
Annotating the tensors. This is probably not practical since you already said you have no control over the model, but an easy option would be to add extra attributes to the tensors to signify the weight/bias relationships. For example, you could do something like W.bias = B and B.weight = W, and then in _apply_dense check hasattr(weight, "bias") and hasattr(weight, "weight") (there may be some better designs in this sense).
You can look into some framework built on top of TensorFlow where you may have better information about the model structure. For example, Keras is a layer-based framework that implements its own optimizer classes (based on TensorFlow or Theano). I'm not too familiar with the code or its extensibility, but probably you have more tools there to use.
Detect the structure of the network yourself from the optimizer. This is quite complicated, but theoretically possible. from the loss tensor passed to the optimizer, it should be possible to "climb up" in the model graph to reach all of its nodes (taking the .op of the tensors and the .inputs of the ops). You could detect tensor multiplications and additions with variables and skip everything else (activations, loss computation, etc) to determine the structure of the network; if the model does not match your expectations (e.g. there are no multiplications or there is a multiplication without a later addition) you can raise an exception indicating that your optimizer cannot be used for that model.
Old answer, kept for the sake of keeping.
I'm not 100% clear on what you are trying to do, so I'm not sure if this really answers your question.
Let's say you have a dense layer transforming an input of size M to an output of size N. According to the convention you show, you'd have an N × M weights matrix W and a N-sized bias vector B. Then, an input vector X of size M (or a batch of inputs of size M × K) would be processed by the layer as W · X + B, and then applying the activation function (in the case of a batch, the addition would be a "broadcasted" operation). In TensorFlow:
X = ... # Input batch of size M x K
W = ... # Weights of size N x M
B = ... # Biases of size N
Y = tf.matmul(W, X) + B[:, tf.newaxis] # Output of size N x K
# Activation...
If you want, you can always put W and B together in a single extended weights matrix W*, basically adding B as a new row in W, so W* would be (N + 1) × M. Then you just need to add a new element to the input vector X containing a constant 1 (or a new row if it's a batch), so you would get X* with size N + 1 (or (N + 1) × K for a batch). The product W* · X* would then give you the same result as before. In TensorFlow:
X = ... # Input batch of size M x K
W_star = ... # Extended weights of size (N + 1) x M
# You can still have a "view" of the original W and B if you need it
W = W_star[:N]
B = W_star[-1]
X_star = tf.concat([X, tf.ones_like(X[:1])], axis=0)
Y = tf.matmul(W_star, X_star) # Output of size N x K
# Activation...
Now you can compute gradients and updates for weights and biases together. A drawback of this approach is that if you want to apply regularization then you should be careful to apply it only on the weights part of the matrix, not on the biases.