How to write dice-loss backpropogation with numpy - python

I am trying to write a dice-loss function by myself. Here is the forward pass I wrote. But I did not understand how to calculate the backprop. I tried to write some but it's not working. Or dice-loss does not needs back prop at all?
alpha = 0.5
belta = 0.5
tp = np.sum(pred * label)
fn = np.sum((1- pred ) * label)
fp = np.sum(pred * (1 - label))
dice = tp / (tp + alpha * fn + belta * fp)

I am not sure that I would call that a forward pass. How do you get the pred ?
Typically you need to write down the step leading to pred. Then you get your loss as you did. This defines a computational graph. And from there can start the backward pass (or backpropagation). You need to calculate the gradient starting from the end of the computational graph, and proceed backward to get the gradient of the loss with respect to the weights.
I wrote an introduction to backpropagation in a blog post (https://www.qwertee.io/blog/an-introduction-to-backpropagation) , and I suppose you should find more details about how to do it.

You just need to use calculus and the chain rule to solve this.
The Dice is defined as . X is your pred and Y is your label.
For a matrix X and Y of size MxN, we can write this as .
Apply the quotient rule for an arbitrary value i,j in X:
We can now divide by |X|+|Y| and that gives us a prety neat solution:
In python:
dX = (2*Y-dice)/(np.sum(X)+np.sum(Y))
And if you add a smooth term in the numerator and denominator of your dice:
dX = (2*Y-dice)/(np.sum(X)+np.sum(Y)+eps)
Also if you save the Dice, |X| and |Y| variables in your forward calculation, you don't have to calculate them again in the backwards pass.

Related

Trouble implementing "concurrent" softmax function from paper (PyTorch)

I am trying to implement the so called 'concurrent' softmax function given in the paper "Large-Scale Object Detection in the Wild from Imbalanced Multi-Labels". Below is the definition of the concurrent softmax:
NOTE: I have left the (1-rij) term out for the time being because I don't think it applies to my problem given that my training dataset has a different type of labeling compared to the paper.
To keep it simple for myself I am starting off by implementing it in a very inefficient, but easy to follow, way using for loops. However, the output I get seems wrong to me. Below is the code I am using:
# here is a one-hot encoded vector for the multi-label classification
# the image thus has 2 correct labels out of a possible 3 classes
y = [0, 1, 1]
# these are some made up logits that might come from the network.
vec = torch.tensor([0.2, 0.9, 0.7])
def concurrent_softmax(vec, y):
for i in range(len(vec)):
zi = torch.exp(vec[i])
sum_over_j = 0
for j in range(len(y)):
sum_over_j += (1-y[j])*torch.exp(vec[j])
out = zi / (sum_over_j + zi)
yield out
for result in concurrent_softmax(vec, y):
print(result)
From this implementation I have realized that, no matter what value I give to the first logit in 'vec' I will always get an output of 0.5 (because it essentially always calculates zi / (zi+zi)). This seems like a major problem, because I would expect the value of the logits to have some influence on the resulting concurrent-softmax value. Is there a problem in my implementation then, or is this behaviour of the function correct and there is something theoretically that I am not understanding?
This is the expected behaviour given y[i]=1 for all other i.
Note you can simplify the summation with a dot product:
y = torch.tensor(y)
def concurrent_softmax(z, y):
sum_over_j = torch.dot((torch.ones(len(y)) - y), torch.exp(z))
for zi in z:
numerator = torch.exp(zi)
denominator = sum_over_j + numerator
yield numerator / denominator

Checking Neural Network Gradient with Finite Difference Methods Doesn't Work

After a full week of print statements, dimensional analysis, refactoring, and talking through the code out loud, I can say I'm completely stuck.
The gradients my cost function produces are too far from those produced by finite differences.
I have confirmed my cost function produces correct costs for regularized inputs and not. Here's the cost function:
def nnCost(nn_params, X, y, lambda_, input_layer_size, hidden_layer_size, num_labels):
# reshape parameter/weight vectors to suit network size
Theta1 = np.reshape(nn_params[:hidden_layer_size * (input_layer_size + 1)], (hidden_layer_size, (input_layer_size + 1)))
Theta2 = np.reshape(nn_params[(hidden_layer_size * (input_layer_size+1)):], (num_labels, (hidden_layer_size + 1)))
if lambda_ is None:
lambda_ = 0
# grab number of observations
m = X.shape[0]
# init variables we must return
cost = 0
Theta1_grad = np.zeros(Theta1.shape)
Theta2_grad = np.zeros(Theta2.shape)
# one-hot encode the vector y
y_mtx = pd.get_dummies(y.ravel()).to_numpy()
ones = np.ones((m, 1))
X = np.hstack((ones, X))
# layer 1
a1 = X
z2 = Theta1#a1.T
# layer 2
ones_l2 = np.ones((y.shape[0], 1))
a2 = np.hstack((ones_l2, sigmoid(z2.T)))
z3 = Theta2#a2.T
# layer 3
a3 = sigmoid(z3)
reg_term = (lambda_/(2*m)) * (np.sum(np.sum(np.multiply(Theta1, Theta1))) + np.sum(np.sum(np.multiply(Theta2,Theta2))) - np.subtract((Theta1[:,0].T#Theta1[:,0]),(Theta2[:,0].T#Theta2[:,0])))
cost = (1/m) * np.sum((-np.log(a3).T * (y_mtx) - np.log(1-a3).T * (1-y_mtx))) + reg_term
# BACKPROPAGATION
# δ3 equals the difference between a3 and the y_matrix
d3 = a3 - y_mtx.T
# δ2 equals the product of δ3 and Θ2 (ignoring the Θ2 bias units) multiplied element-wise by the g′() of z2 (computed back in Step 2).
d2 = Theta2[:,1:].T#d3 * sigmoidGradient(z2)
# Δ1 equals the product of δ2 and a1.
Delta1 = d2#a1
Delta1 /= m
# Δ2 equals the product of δ3 and a2.
Delta2 = d3#a2
Delta2 /= m
reg_term1 = (lambda_/m) * np.append(np.zeros((Theta1.shape[0],1)), Theta1[:,1:], axis=1)
reg_term2 = (lambda_/m) * np.append(np.zeros((Theta2.shape[0],1)), Theta2[:,1:], axis=1)
Theta1_grad = Delta1 + reg_term1
Theta2_grad = Delta2 + reg_term2
grad = np.append(Theta1_grad.ravel(), Theta2_grad.ravel())
return cost, grad
Here's the code to check the gradients. I have been over every line and there is nothing whatsoever that I can think of to change here. It seems to be in working order.
def checkNNGradients(lambda_):
"""
Creates a small neural network to check the backpropagation gradients.
Credit: Based on the MATLAB code provided by Dr. Andrew Ng, Stanford Univ.
Input: Regularization parameter, lambda, as int or float.
Output: Analytical gradients produced by backprop code and the numerical gradients (computed
using computeNumericalGradient). These two gradient computations should result in
very similar values.
"""
input_layer_size = 3
hidden_layer_size = 5
num_labels = 3
m = 5
# generate 'random' test data
Theta1 = debugInitializeWeights(hidden_layer_size, input_layer_size)
Theta2 = debugInitializeWeights(num_labels, hidden_layer_size)
# reusing debugInitializeWeights to generate X
X = debugInitializeWeights(m, input_layer_size - 1)
y = np.ones(m) + np.remainder(np.range(m), num_labels)
# unroll parameters
nn_params = np.append(Theta1.ravel(), Theta2.ravel())
costFunc = lambda p: nnCost(p, X, y, lambda_, input_layer_size, hidden_layer_size, num_labels)
cost, grad = costFunc(nn_params)
numgrad = computeNumericalGradient(costFunc, nn_params)
# examine the two gradient computations; two columns should be very similar.
print('The columns below should be very similar.\n')
# Credit: http://stackoverflow.com/a/27663954/583834
print('{:<25}{}'.format('Numerical Gradient', 'Analytical Gradient'))
for numerical, analytical in zip(numgrad, grad):
print('{:<25}{}'.format(numerical, analytical))
# If you have a correct implementation, and assuming you used EPSILON = 0.0001
# in computeNumericalGradient.m, then diff below should be less than 1e-9
diff = np.linalg.norm(numgrad-grad)/np.linalg.norm(numgrad+grad)
print(diff)
print("\n")
print('If your backpropagation implementation is correct, then \n' \
'the relative difference will be small (less than 1e-9). \n' \
'\nRelative Difference: {:.10f}'.format(diff))
The check function generates its own data using a debugInitializeWeights function (so there's the reproducible example; just run that and it will call the other functions), and then calls the function that calculates the gradient using finite differences. Both are below.
def debugInitializeWeights(fan_out, fan_in):
"""
Initializes the weights of a layer with fan_in
incoming connections and fan_out outgoing connections using a fixed
strategy.
Input: fan_out, number of outgoing connections for a layer as int; fan_in, number
of incoming connections for the same layer as int.
Output: Weight matrix, W, of size(1 + fan_in, fan_out), as the first row of W handles the "bias" terms
"""
W = np.zeros((fan_out, 1 + fan_in))
# Initialize W using "sin", this ensures that the values in W are of similar scale;
# this will be useful for debugging
W = np.sin(range(1, np.size(W)+1)) / 10
return W.reshape(fan_out, fan_in+1)
def computeNumericalGradient(J, nn_params):
"""
Computes the gradient using "finite differences"
and provides a numerical estimate of the gradient (i.e.,
gradient of the function J around theta).
Credit: Based on the MATLAB code provided by Dr. Andrew Ng, Stanford Univ.
Inputs: Cost, J, as computed by nnCost function; Parameter vector, theta.
Output: Gradient vector using finite differences. Per Dr. Ng,
'Sets numgrad(i) to (a numerical approximation of) the partial derivative of
J with respect to the i-th input argument, evaluated at theta. (i.e., numgrad(i) should
be the (approximately) the partial derivative of J with respect
to theta(i).)'
"""
numgrad = np.zeros(nn_params.shape)
perturb = np.zeros(nn_params.shape)
e = .0001
for i in range(np.size(nn_params)):
# Set perturbation (i.e., noise) vector
perturb[i] = e
# run cost fxn w/ noise added to and subtracted from parameters theta in nn_params
cost1, grad1 = J((nn_params - perturb))
cost2, grad2 = J((nn_params + perturb))
# record the difference in cost function ouputs; this is the numerical gradient
numgrad[i] = (cost2 - cost1) / (2*e)
perturb[i] = 0
return numgrad
The code is not for class. That MOOC was in MATLAB and it's over. This is for me. Other solutions exist on the web; looking at them has proved fruitless. Everyone has a different (inscrutable) approach. So, I'm in serious need of assistance or a miracle.
Edit/Update: Fortran ordering when raveling vectors influences the outcome, but I have not been able to get the gradients to move together changing that option.
One thought: I think your perturbation is a little large, being 1e-4. For double precision floating point numbers, it should be more like 1e-8, i.e., the root of the machine precision (or are you working with single precision?!).
That being said, finite differences can be very bad approximations to true derivatives. Specifically, floating point computations in numpy are not deterministic, as you seem to have found out. The noise in evaluations can cancel out many significant digits under some circumstances. What values are you seeing and what are you expecting?
All of the following figured into the solution to my problem. For those attempting to translate MATLAB code to Python, whether from Andrew NG's Coursera Machine Learning course or not, these are things everyone should know.
MATLAB does everything in FORTRAN order; Python does everything in C order. This affects how vectors are populated and, thus, your results. You should always be in FORTRAN order, if you want your answers to match what you did in MATLAB. See docs
Getting your vectors in FORTRAN order can be as easy as passing order='F' as an argument to .reshape(), .ravel(), or .flatten(). You may, however, achieve the same thing if you are using .ravel() by transposing the vector then applying the .ravel() function like so X.T.ravel().
Speaking of .ravel(), the .ravel() and .flatten() functions do not do the same thing and may have different use cases. For example, .flatten() is preferred by SciPy optimization methods. So, if your equivalent of fminunc isn't working, it's likely because you forgot to .flatten() your response vector y. See this Q&A StackOverflow and docs on .ravel() which link to .flatten().More Docs
If you're translating your code from MATLAB live script into a Jupyter notebook or Google COLAB, you must police your name space. On one occasion, I found that the variable I thought was being passed was not actually the variable that was being passed. Why? Jupyter and Colab notebooks have a lot of global variables that one would never write ordinarily.
There is a better function to evaluate the differences between numerical and analytical gradients: Relative Error Comparison np.abs(numerical-analyitical)/(numerical+analytical). Read about it here CS231 Also, consider the accepted post above.

Can Tensorflow work out gradients for integral approximations?

I am trying to use Hamiltonian Monte Carlo (HMC, from Tensorflow Probability) but my target distribution contains an intractable 1-D integral which I approximate with the trapezoidal rule. My understanding of HMC is that it calculates gradients of the target distribution to build a more efficient transition kernel. My question is can Tensorflow work out gradients in terms of the parameters of function, and are they meaningful?
For example this is a log-probability of the target distribution where 'A' is a model parameter:
# integrate e^At * f[t] with respect to t between 0 and t, for all t
t = tf.linspace(0., 10., 100)
f = tf.ones(100)
delta = t[1]-t[0]
sum_term = tfm.multiply(tfm.exp(A*t), f)
integrals = 0.5*delta*tfm.cumsum(sum_term[:-1] + sum_term[1:], axis=0)
pred = integrals
sq_diff = tfm.square(observed_data - pred)
sq_diff = tf.reduce_sum(sq_diff, axis=0)
log_lik = -0.5*tfm.log(2*PI*variance) - 0.5*sq_diff/variance
return log_lik
Are the gradients of this function in terms of A meaningful?
Yes, you can use tensorflow GradientTape to work out the gradients. I assume you have a mathematical function outputting log_lik with many inputs, one of it is A
GradientTape to get the gradient of A
The get the gradients of log_lik with respect to A, you can use the tf.GradientTape in tensorflow
For example:
with tf.GradientTape(persistent=True) as g:
g.watch(A)
t = tf.linspace(0., 10., 100)
f = tf.ones(100)
delta = t[1]-t[0]
sum_term = tfm.multiply(tfm.exp(A*t), f)
integrals = 0.5*delta*tfm.cumsum(sum_term[:-1] + sum_term[1:], axis=0)
pred = integrals
sq_diff = tfm.square(observed_data - pred)
sq_diff = tf.reduce_sum(sq_diff, axis=0)
log_lik = -0.5*tfm.log(2*PI*variance) - 0.5*sq_diff/variance
z = log_lik
## then, you can get the gradients of log_lik with respect to A like this
dz_dA = g.gradient(z, A)
dz_dA contains all partially derivatives of variables in A
I just show you the idea by the code above. In order to make it works you need to do the calculation by Tensor operation. So change to modify your function to use tensor type for the calculation
Another example but in tensor operation
x = tf.constant(3.0)
with tf.GradientTape() as g:
g.watch(x)
with tf.GradientTape() as gg:
gg.watch(x)
y = x * x
dy_dx = gg.gradient(y, x) # Will compute to 6.0
d2y_dx2 = g.gradient(dy_dx, x) # Will compute to 2.0
Here you can see more example from the document to understand more https://www.tensorflow.org/api_docs/python/tf/GradientTape
Further discussion on "meaningfulness"
Let me translate the python code to mathematics first (I use https://www.codecogs.com/latex/eqneditor.php, hope it can display properly):
# integrate e^At * f[t] with respect to t between 0 and t, for all t
From above, it means you have a function. I call it g(t, A)
Then you are doing a definite integral. I call it G(t,A)
From your code, t is not variable any more, it is set to 10. So, we reduce to a function that has only one variable h(A)
Up to here, function h has a definite integral inside. But since you are approximating it, we should not think it as a real integral (dt -> 0), it is just another chain of simple maths. No mystery here.
Then, the last output log_lik, which is simply some simple mathematical operations with one new input variable observed_data, I call it y.
Then a function z that compute log_lik is:
z is no different than other normal chain of maths operations in tensorflow. Therefore, dz_dA is meaningful in the sense that the gradient of z w.r.t A gives you the gradient to update A that you can minimize z

Differentiable round function in Tensorflow?

So the output of my network is a list of propabilities, which I then round using tf.round() to be either 0 or 1, this is crucial for this project.
I then found out that tf.round isn't differentiable so I'm kinda lost there.. :/
Something along the lines of x - sin(2pi x)/(2pi)?
I'm sure there's a way to squish the slope to be a bit steeper.
You can use the fact that tf.maximum() and tf.minimum() are differentiable, and the inputs are probabilities from 0 to 1
# round numbers less than 0.5 to zero;
# by making them negative and taking the maximum with 0
differentiable_round = tf.maximum(x-0.499,0)
# scale the remaining numbers (0 to 0.5) to greater than 1
# the other half (zeros) is not affected by multiplication
differentiable_round = differentiable_round * 10000
# take the minimum with 1
differentiable_round = tf.minimum(differentiable_round, 1)
Example:
[0.1, 0.5, 0.7]
[-0.0989, 0.001, 0.20099] # x - 0.499
[0, 0.001, 0.20099] # max(x-0.499, 0)
[0, 10, 2009.9] # max(x-0.499, 0) * 10000
[0, 1.0, 1.0] # min(max(x-0.499, 0) * 10000, 1)
This works for me:
x_rounded_NOT_differentiable = tf.round(x)
x_rounded_differentiable = x - tf.stop_gradient(x - x_rounded_NOT_differentiable)
Rounding is a fundamentally nondifferentiable function, so you're out of luck there. The normal procedure for this kind of situation is to find a way to either use the probabilities, say by using them to calculate an expected value, or by taking the maximum probability that is output and choose that one as the network's prediction. If you aren't using the output for calculating your loss function though, you can go ahead and just apply it to the result and it doesn't matter if it's differentiable. Now, if you want an informative loss function for the purpose of training the network, maybe you should consider whether keeping the output in the format of probabilities might actually be to your advantage (it will likely make your training process smoother)- that way you can just convert the probabilities to actual estimates outside of the network, after training.
Building on a previous answer, a way to get an arbitrarily good approximation is to approximate round() using a finite Fourier approximation and use as many terms as you need. Fundamentally, you can think of round(x) as adding a reverse (i. e. descending) sawtooth wave to x. So, using the Fourier expansion of the sawtooth wave we get
With N = 5, we get a pretty nice approximation:
Kind of an old question, but I just solved this problem for TensorFlow 2.0. I am using the following round function on in my audio auto-encoder project. I basically want to create a discrete representation of sound which is compressed in time. I use the round function to clamp the output of the encoder to integer values. It has been working well for me so far.
#tf.custom_gradient
def round_with_gradients(x):
def grad(dy):
return dy
return tf.round(x), grad
In range 0 1, translating and scaling a sigmoid can be a solution:
slope = 1000
center = 0.5
e = tf.exp(slope*(x-center))
round_diff = e/(e+1)
In tensorflow 2.10, there is a function called soft_round which achieves exactly this.
Fortunately, for those who are using lower versions, the source code is really simple, so I just copy-pasted those lines, and it works like a charm:
def soft_round(x, alpha, eps=1e-3):
"""Differentiable approximation to `round`.
Larger alphas correspond to closer approximations of the round function.
If alpha is close to zero, this function reduces to the identity.
This is described in Sec. 4.1. in the paper
> "Universally Quantized Neural Compression"<br />
> Eirikur Agustsson & Lucas Theis<br />
> https://arxiv.org/abs/2006.09952
Args:
x: `tf.Tensor`. Inputs to the rounding function.
alpha: Float or `tf.Tensor`. Controls smoothness of the approximation.
eps: Float. Threshold below which `soft_round` will return identity.
Returns:
`tf.Tensor`
"""
# This guards the gradient of tf.where below against NaNs, while maintaining
# correctness, as for alpha < eps the result is ignored.
alpha_bounded = tf.maximum(alpha, eps)
m = tf.floor(x) + .5
r = x - m
z = tf.tanh(alpha_bounded / 2.) * 2.
y = m + tf.tanh(alpha_bounded * r) / z
# For very low alphas, soft_round behaves like identity
return tf.where(alpha < eps, x, y, name="soft_round")
alpha sets how soft the function is. Greater values leads to better approximations of round function, but then it becomes harder to fit since gradients vanish:
x = tf.convert_to_tensor(np.arange(-2,2,.1).astype(np.float32))
for alpha in [ 3., 7., 15.]:
y = soft_round(x, alpha)
plt.plot(x.numpy(), y.numpy(), label=f'alpha={alpha}')
plt.legend()
plt.title('Soft round function for different alphas')
plt.grid()
In my case, I tried different values for alpha, and 3. looks like a good choice.

gradient descent for linear regression in python code

def computeCost(X, y, theta):
inner = np.power(((X * theta.T) - y), 2)
return np.sum(inner) / (2 * len(X))
def gradientDescent(X, y, theta, alpha, iters):
temp = np.matrix(np.zeros(theta.shape))
params = int(theta.ravel().shape[1]) #flattens
cost = np.zeros(iters)
for i in range(iters):
err = (X * theta.T) - y
for j in range(params):
term = np.multiply(err, X[:,j])
temp[0, j] = theta[0, j] - ((alpha / len(X)) * np.sum(term))
theta = temp
cost[i] = computeCost(X, y, theta)
return theta, cost
Here is the code for linear regression cost function and gradient descent that I've found on a tutorial, but I am not quite sure how it works.
First I get how the computeCost code works since it's just (1/2M) where M is number of data.
For gradientDescent code, I just don't understand how it works in general. I know the formula for updating the theta is something like
theta = theta - (learningRate) * derivative of J(cost function). But I am not sure where alpha / len(X)) * np.sum(term) this comes from on the line updating temp[0,j].
Please help me to understand!
I'll break this down for you. So in your gradientDescent function, you are taking in the predictor variables(X), target variable(y),
the weight matrix(theta) and two more parameters(alpha, iters) which are the training parameters. The function's job is to figure out how much should
each column in the predictor variables(X) set should be multiplied by before adding to get the predicted values of target variable(y).
In the first line of the function you are initiating a weight matrix called temp to zeros. This basically is the starting point of the
final weight matrix(theta) that the function will output in the end. The params variable is basically the number of weights or number of predictor variables.
This line makes more sense in the context of neural networks where you create abstraction of features.
In linear regression, the weight matrix will mostly be a one dimensional array. In a typical feed forward neural networks we take in a lets say 5 features,
convert it to maybe 4 or 6 or n number of features by using a weight matrix and so on. In linear regression we simply distill all the input
features into one output feature. So theta will essentially be equivalent to a row vector(which is evident by the line temp[0,j]=...) and params will be the number of features. We cost array just stores
that array at each iteration. Now on to the two loops. In the line err = (X * theta.T) - y under the first for loop, we are calculating the prediction
error for each training example. The shape of the err variable be number of examples,1. In the second for loop we are training the model,
that is we are gradually updating our term matrix. We run the second for loop iters
number of times. In Neural network context, we generally call it epochs. It is basically the number of times you want to train the model.
Now the line term = np.multiply(err, X[:,j]): here we are calculating the individual adjustment that should be made to each weight in the temp matrix.
We define the cost as (y_predicted-y_actual)**2/number_of_training_points, where y_predicted = X_1*w_1 + X_2*w_2 + X_3*w_3 +... If we differentiate this cost vwith respect to
a particular weight(say W_i) we get (y_predicted-y_actual)*(X_i)/number_of_training_points where X_i is the column that W_i multiplies with. So the term = line
basically computes that differentiation part. We can multiply the term variable with the learning rate and subtract from W_i.
But as you might notice the term variable is an array. But that is taken care of in the next line.
In the next line we take the average of the term (by summing it up and then dividing it by len(X))
and subtract it from the corresponding weight in the temp matrix. Once the weights have been updated and stored and the temp matrix,
we replace the original theta by temp. We repeat this process iters number of times
If you , instead of writing it like ((alpha / len(X)) * np.sum(term)), write it like (alpha * (np.sum(term) / len(X))) ,which you're allowed to do since multiplication and division are comutative (if i remember the term correctly) then you just multiply alpha by the average error, since term is length X anyway.
This means that you substract the learning rate (alpha) times average error which would be something like X[j] * tetha (actual) - y (ideal) , which incidentally is also close enough to the derivative of (X*tetha - y)^2

Categories

Resources