Calculate Second Gradient with PyTorch

Calculate Second Gradient with PyTorch - python

In PyTorch, there are two ways of calculating second gradients. The first method is to use torch.autograd.grad function, and the other is to use backward function. I use the following examples to illustrate it:
Method 1:
x=torch.tensor([3.0], requires_grad=True)
y = torch.pow(x, 2)
grad_1 = torch.autograd.grad(y, x, create_graph=True)
print(grad_1[0].item())
grad_2 = torch.autograd.grad(grad_1[0], x)
print(grad_2)
The result makes sense for me, and the second gradient of the function is 2.
Method 2:
x=torch.tensor([3.0], requires_grad=True)
y = torch.pow(x, 2) # y=x**2
y.backward(retain_graph=True)
print(x.grad)
y.backward()
print(x.grad)
When calculating the first gradient, I use create_graph=True to make sure that we can use back prorogation method to calculate the second gradient. However, the result is is 12, which is wrong. I was wondering what's wrong with the second method?

Use the grad method from torch.autograd to differentiate your function. So the steps would be:
>>> import torch
>>> from torch.autograd import grad
>>> x = torch.tensor([3.0], requires_grad=True)
>>> y = torch.pow(x,2)
>>> z = grad(y, x, create_graph=True)
>>> print(grad(z, x, create_graph=True))
>>> (tensor([2.], grad_fn=<MulBackward0>),)
Similarly, you can loop through to make the nth derivative.

Related

Multiclass logistic regression from scratch

I’m trying to apply multiclass logistic regression from scratch. The dataset is the MNIST.
I built some functions such as hypothesis, sigmoid, cost function, cost function derivate, and gradient descendent. My code is below.
I’m struggling with:
As all images are labeled with the respective digit that they represent. There are a total of 10 classes.
Inside the function gradient descendent, I need to loop through each class, but I do not know how to apply it using the One vs All method.
In other words, what I need to do are:
How to filter each class inside the gradient descendent.
After that, how to build a function to predict the test set.
Here is my code.
import numpy as np
import pandas as pd
# Only training data set
# the test data will be load later.
url='https://drive.google.com/file/d/1-MO8oCfq4KU361QeeL4DdafVBhZePUNT/view?usp=sharing'
url='https://drive.google.com/uc?id=' + url.split('/')[-2]
df = pd.read_csv(url,header = None)
X = df.values[:, 0:-1]
y = df.values[:, -1]
m = np.size(X, 0)
y = np.array(y).reshape(m, 1)
X = np.c_[ np.ones(m), X ] # Bias
def hypothesis(X, thetas):
return sigmoid( X.dot(thetas)) #- 0.0000001
def sigmoid(z):
return 1/(1+np.exp(-z))
def losscost(X, y, m, thetas):
h = hypothesis(X, thetas)
return -(1/m) * ( y.dot(np.log(h)) + (1-y).dot(np.log(1-h)) )
def derivativelosscost(X, y, m, thetas):
h = hypothesis(X, thetas)
return (h-y).dot(X)/m
def descendinggradient(X, y, m, epoch, alpha, thetas):
n = np.size(X, 1)
J_historico = []
for i in range(epoch):
for j in range(0,10): # 10 classes
# How to filter each class inside here (inside this def descendinggradient)?
# 2 lines below are wrong.
#thetas = thetas - alpha * derivativelosscost(X, y, m, thetas)
#J_historico = J_historico + [losscost(X, y, m, thetas)]
return [thetas, J_historico]
alpha = 0.01
epoch = 50
(thetas, J_historico) = descendinggradient(X, y, m, epoch, alpha)
# After that, how to build a function to predict the test set.

Let me explain this problem step-by-step:
First since you code doesn't provides the actual data or a link to it I've created a random dataset followed by the same commands you used to create X and Y:
batch_size = 20
num_classes = 10
rng = np.random.default_rng(seed=42)
df = pd.DataFrame(
4* rng.random((batch_size, num_classes + 1)) - 2, # Create Random Array Between -2, 2
columns=['X0','X1','X2','X3','X4','X5','X6','X7','X8', 'X9','Y']
)
X = df.values[:, 0:-1]
y = df.values[:, -1]
m = np.size(X, 0)
y = np.array(y).reshape(m, 1)
X = np.c_[ np.ones(m), X ] # Bias
Next lets take a look at your hypothesis function. If we would just run hypothesis and take a look at the first sample, we will get a vector with the size (10,1). I also needed to provide the initial thetas for this case:
thetas = rng.random((X.shape[1],num_classes))
h = hypothesis(X, thetas)
print(h[0])
>>>[0.89701729 0.90050806 0.98358408 0.81786334 0.96636732 0.97819512
0.89118488 0.87238045 0.70612173 0.30256924]
Basically the function calculates a "propabilties"[1] for each class.
At this point we got to the first issue in your code. The result of the sigmoid function returns "propabilities" which are not "connected" to each other. So to set those "propabilties" in relation we need a another function: SOFTMAX. You will find plenty implementations about this functions. In short: It will calculate the "propabilites" based on the "sigmoid", so that the sum overall class-"propabilites" results to 1.
So for your second question "How to implement a predict after training", we only need to find the argmax value to determine the class:
h = hypothesis(X, thetas)
p = softmax(h) # needs to be implemented
prediction = np.argmax(p, axis=1)
print(prediction)
>>>[2 5 5 8 3 5 2 1 3 5 2 3 8 3 3 9 5 1 1 8]
Now that we know how to predict a class, we also need to know where to setup the training. We want to do this directly after the softmax function. But instead of using the argmax to determine the winning class, we use the costfunction and its derivative. Your problem in your code: You used the crossentropy loss for a binary problem. The binary problem also don't need to use the softmax function, because the sigmoid function already provides the connection of the two binary classes. So since we are not interested in the result at all of the cross-entropy-loss for multiple classes and only into its derivative, we also want to calculate this directly.
The conversion from binary crossentropy to multiclass is kind of unintuitive in the first view. I recommend to read a bit about it before implementing. After this you basicly use your line:
thetas = thetas - alpha * derivativelosscost(X, y, m, thetas)
for updating the thetas.
[1]These are not actuall propabilities, but this is a complete different topic.

Iterate over tensor in a custom loss function

I need to use this loss function for a CNN the list_distance and list_residual are output tensors from hidden layers which are important to compute the loss, but when i execute the code it gives me back this error
TypeError: Tensor objects are only iterable when eager execution is enabled. To iterate over this tensor use tf.map_fn.
Is there another way to iterate over tensors without the use of the costruct
x in X or convert it in a numpy array or using the backend function of keras?
def DBL(y_true, y_pred, list_distances, list_residual, l=0.65):
prob_dist = []
Li = []
# mean of the images power spectrum
S = np.sum([np.power(np.abs(fp.fft2(residual)), 2)
for residual in list_residual], axis=0) / K.shape(list_residual)[0]
# log-ratio between the geometric and arithmetic of S
R = np.log10((scistats.gmean(S) / np.mean(S)))
for c_i, dis_i in enumerate(list_distances):
prob_dist.append([
np.exp(-dis_i) / sum([np.exp(-dis_j) if c_j != c_i else 0 for c_j, dis_j in enumerate(list_distances)])
])
for count, _ in enumerate(prob_dist):
Li.append(
-1 * np.log10(sum([p_j for c_j, p_j in enumerate(prob_dist[count])
if y_pred[count] == 1 and count != c_j])))
L0 = np.sum(Li)
return L0 - l * R

You need to define a custom function to feed into tf.map_fn() - Tensorflow dox
Mapper functions map (funnily enough) the existing object (tensor) into a new one using a function you define.
They apply the custom function to every element in the object, without all the mucking about with for loops.
For instance (non tested code, may not run - on my phone atm):
def custom(a):
b = a + 1
return b
original = np.array([2,2,2])
mapped = tf.map_fn(custom, original)
# mapped == [3, 3, 3] ... hopefully
Tensorflow examples all use lambda functions, so you might need to define your functions like that if the above doesn’t work. Tensorflow example:
elems = np.array([1, 2, 3, 4, 5, 6])
squares = map_fn(lambda x: x * x, elems)
# squares == [1, 4, 9, 16, 25, 36]
Edit:
As an aside, map functions are much easier to parallelise than for loops - it is assumed that each element of an object is processed uniquely - so you can see a performance uplift by using them.
Edit 2:
For the "reduce sum, but not on this index" part, I would heavily recommend you start looking back at matrix operations... As mentioned, map functions work element-wise - they are not aware of other elements. A reduce function is what you want, but even they are finiky when you try and do "not this index" sums... also tensorflow is built around matrix ops... Not the MapReduce paradigm.
Something along these lines might help:
sess = tf.Session()
var = np.ones([3, 3, 3]) * 5
zero_identity = tf.linalg.set_diag(
var, tf.zeros(var.shape[0:-1], dtype=tf.float64)
)
exp_one = tf.exp(var)
exp_two = tf.exp(zero_identity)
summed = tf.reduce_sum(exp_two, axis = [0,1])
final = exp_one / summed
print("input matrix: \n", var, "\n")
print("Identities of the matrix to Zero: \n", zero_identity.eval(session=sess), "\n")
print("Exponential Values numerator: \n", exp_one.eval(session=sess), "\n")
print("Exponential Values to Sum: \n", exp_two.eval(session=sess), "\n")
print("Summed values for zero identity matrix\n ... along axis [0,1]: \n", summed.eval(session=sess), "\n")
print("Output:\n", final.eval(session=sess), "\n")

Numerical computation of softmax cross entropy gradient

I implemented the softmax() function, softmax_crossentropy() and the derivative of softmax cross entropy: grad_softmax_crossentropy(). Now I wanted to compute the derivative of the softmax cross entropy function numerically. I tried to do this by using the finite difference method but the function returns only zeros. Here is my code with some random data:
import numpy as np
batch_size = 3
classes = 10
# random preactivations
a = np.random.randint(1,100,(batch_size,classes))
# random labels
y = np.random.randint(0,np.size(a,axis=1),(batch_size,1))
def softmax(a):
epowa = np.exp(a-np.max(a,axis=1,keepdims=True))
return epowa/np.sum(epowa,axis=1,keepdims=True)
print(softmax(a))
def softmax_crossentropy(a, y):
y_one_hot = np.eye(classes)[y[:,0]]
return -np.sum(y_one_hot*np.log(softmax(a)),axis=1)
print(softmax_crossentropy(a, y))
def grad_softmax_crossentropy(a, y):
y_one_hot = np.eye(classes)[y[:,0]]
return softmax(a) - y_one_hot
print(grad_softmax_crossentropy(a, y))
# Finite difference approach to compute grad_softmax_crossentropy()
eps = 1e-5
print((softmax_crossentropy(a+eps,y)-softmax_crossentropy(a,y))/eps)
What did I wrong?

Here's how you could do it. I think you're referring to the gradient wrt the activations indicated by y's indicator matrix.
First, I instantiate a as float to change individual items.
a = np.random.randint(1,100,(batch_size,classes)).astype("float")
Then,
np.diag(grad_softmax_crossentropy(a, y)[:, y.flatten()])
array([ -1.00000000e+00, -1.00000000e+00, -4.28339542e-04])
But also
b = a.copy()
for i, o in zip(y.max(axis=1), range(y.shape[0])):
b[o, i] += eps
(softmax_crossentropy(b,y)-softmax_crossentropy(a,y))/eps
[ -1.00000000e+00 -1.00000000e+00 -4.28125536e-04]
So basically you have to change a_i in softmax, not the entirety of a.

Composing functions (i.e., replacing a placeholder with a tensor)

I need to connect two functions: first I define f(x), then I define g(t), and finally I need to set that x = g(t).
MWE
import tensorflow as tf
import numpy as np
session = tf.Session()
# f(x)
x = tf.placeholder(shape=[None,4], dtype=tf.float64)
f = tf.square(x)
# g(t)
t = tf.placeholder(shape=[None,2], dtype=tf.float64)
g = tf.matmul(t,np.ones((2,4)))
session.run(tf.global_variables_initializer())
t_eval = np.asmatrix(np.random.rand(10,2))
# f(g(t))
session.run(f, {x: session.run(g, {t: t_eval})})
# but I want to do (does not work because I need to assign a value to placeholder x)
x = g
session.run(f, {t: t_eval})
Basically I want to replace placeholder x with tensor g. How can I do that?

You don't need x in your code at all.
f = tf.square(g) # replace your definition of f with this
session.run(f, {t: t_eval})

Ridge Regression: Scikit-learn vs. direct calculation does not match for alpha > 0

In Ridge Regression, we are solving Ax=b with L2 Regularization. The direct calculation is given by:
x = (ATA + alpha * I)-1ATb
I have looked at the scikit-learn code and they do implement the same calculation. But, I can't seem to get the same results for alpha > 0
The minimal code to reproduce this.
import numpy as np
A = np.asmatrix(np.c_[np.ones((10,1)),np.random.rand(10,3)])
b = np.asmatrix(np.random.rand(10,1))
I = np.identity(A.shape[1])
alpha = 1
x = np.linalg.inv(A.T*A + alpha * I)*A.T*b
print(x.T)
>>> [[ 0.37371021 0.19558433 0.06065241 0.17030177]]
from sklearn.linear_model import Ridge
model = Ridge(alpha = alpha).fit(A[:,1:],b)
print(np.c_[model.intercept_, model.coef_])
>>> [[ 0.61241566 0.02727579 -0.06363385 0.05303027]]
Any suggestions on what I can do to resolve this discrepancy?

This modification seems to yield the same result for the direct version and the numpy version:
import numpy as np
A = np.asmatrix(np.random.rand(10,3))
b = np.asmatrix(np.random.rand(10,1))
I = np.identity(A.shape[1])
alpha = 1
x = np.linalg.inv(A.T*A + alpha * I)*A.T*b
print (x.T)
from sklearn.linear_model import Ridge
model = Ridge(alpha = alpha, tol=0.1, fit_intercept=False).fit(A ,b)
print model.coef_
print model.intercept_
It seems the main reason for the difference is the class Ridge has the parameter fit_intercept=True (by inheritance from class _BaseRidge) (source)
This is applying a data centering procedure before passing the matrices to the _solve_cholesky function.
Here's the line in ridge.py that does it
X, y, X_mean, y_mean, X_std = self._center_data(
X, y, self.fit_intercept, self.normalize, self.copy_X,
sample_weight=sample_weight)
Also, it seems you were trying to implicitly account for the intercept by adding the column of 1's. As you see, this is not necessary if you specify fit_intercept=False
Appendix: Does the Ridge class actually implement the direct formula?
It depends on the choice of the solverparameter.
Effectively, if you do not specify the solverparameter in Ridge, it takes by default solver='auto' (which internally resorts to solver='cholesky'). This should be equivalent to the direct computation.
Rigorously, _solve_cholesky uses numpy.linalg.solve instead of numpy.inv. But it can be easily checked that
np.linalg.solve(A.T*A + alpha * I, A.T*b)
yields the same as
np.linalg.inv(A.T*A + alpha * I)*A.T*b

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Calculate Second Gradient with PyTorch - python

Related

Multiclass logistic regression from scratch

Iterate over tensor in a custom loss function

Numerical computation of softmax cross entropy gradient

Composing functions (i.e., replacing a placeholder with a tensor)

Ridge Regression: Scikit-learn vs. direct calculation does not match for alpha > 0

Categories

Resources