Implementing gradient descent on with known objective function - python

I have an objective function from a paper that I would like to minimize with gradient descent. I have not yet had to do this "from scratch" and would like some advice as to how to code it up manually. The objective function is:
T(L) = tr(X.T L^s X) - beta * ||L||.
where L is an N x N matrix positive semidefinite matrix to be estimated, X is an N x M matrix, beta is a regularization constant, X.T = X transpose, and ||.|| is the frobenius norm.
Also, L^s is the matrix exponential where L^s = F Λ^s F.T, where F is a matrix of the eigenvectors of L and Λ is the diagonal matrix of eigenvalues of L.
The derivative of the objective function is:
dT/dL = sum_{from r = 0 to r = s - 1} L^r (XX.T) L^(s-r-1) - 2 * beta * L
I have done very rudimentary gradient descent problems (such as matrix factorization) where optimization is done over every element of the matrix, or using packages/libraries. This kind of problem is more complex I am used to, and I was hoping that some of you that are much more experienced with this sort of thing could help me out.
Any general advice is much appreciated as well as specific recommendations of how to code this up in python or R.
Here is the link for the paper with this function:
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0128136#sec016
Thank you very much for your help!
Paul

In general, it would probably be advisable to use a machine learning library such as tensorflow or pytorch. If you go down this route you have several advantages 1) efficient C++ implementation of the Tensor operations 2) automatic differentiation 3) easy access to more sophisticated optimizers (e.g. ADAM).
`
If you prefer to do the gradient computation yourself you could do that by setting the gradient L.grad manually before the optimization step
A simple implementation would look like this:
import torch
n=10
m=20
s = 3
b=1e-3
n_it=40
# L=torch.nn.Parameter(torch.rand(n,n))
F=torch.nn.Parameter(torch.rand(n,n))
D=torch.nn.Parameter(torch.rand(n))
X=torch.rand((n,m))
opt=torch.optim.SGD([F,D],lr=1e-4)
for i in range(n_it):
loss = (X.T.matmul(F.matmul((D**s).unsqueeze(1)*F.T)).matmul(X)).trace() - b * F.matmul((D**s).unsqueeze(1)*F.T).norm(2)
print(loss)
opt.zero_grad()
loss.backward()
opt.step()

Related

Scipy root-finding with each dimension independent

Scipy.optimize.root enables you to minimize a vector function while Scipy.optimize.root_scalar enables you to minimize a scalar function. What I need to solve is somewhat in between. I have a bunch of complex functions f_i depending on index i and x_i, I want to solve f_1(x_1)=0,f_2(x_2)=0,...,f_n(x_n)=0. But instead of solving them using for loop, I want to solve in a vectorized style. The reason for that is because using for loop to query the value of f_1,...,f_n is expensive. But querying them in batch (f_1,...f_n) is relatively cheaper.
Let f=(f_1,...,f_n) and x=(x_1,...x_n). We want to solve f(x)=(f_1(x_1),f_2(x_2),...f_n(x_n))=0. By directly calling scipy.optimize.root is not ideal since the solver has no idea that each dimension is independent.
A toy example:
from scipy import optimize
import numpy as np
coef = np.arange(10)
def f(x):
return x ** 2 + 2 * coef * x + coef ** 2
optimize.root(f, np.zeros(10))
How can we let the solver know each dimension is independent to speed it up?
The above is just a toy example to illustrate my problem. In the real case, the function f is like a black box and there is no analytical derivative for each component f_1, f_2, ...f_n. So I couldn't just input a diagonal Jacobian to the solver. I tried to look at if we can let the solver know that Jacobian matrix should be diagonal but I have no luck towards this path. Any suggestions?

Optimization Problem with fast matrix-vector multiplication in Python / cvxpy

I want to solve the following (convex) minimization problem:
min ||x||_1 under the constraints sgn(A[x,R]=y) and ||x||_2 = 1
where A is a mx(N+1) matrix, x in R^N a vector, and \[x,R\] a vector that is created by appending a given number R. The objective is to find the optimal value for x.
A is a Fourier matrix and there are fast matrix-vector, inversion, etc. algorithms available. Since this matrix is really big, I need to use an optimization algorithm that utilizes this.
Currently, I use the following implementation in cvxpy, which is way too slow:
import cvxpy as cvx
# rewrite the problem in the form x = x^- + x^+
n = A.shape[1]-1
vx = cvx.Variable(2*n)
objective = cvx.Minimize(cvx.pnorm(vx, 1)) # min ||x||_1
constraints = [vx >= 0, cvx.multiply(A[:,:n] # vx[:n] - A[:,:n] # vx[n:] + A[:,n]*R, y) >= 0,
cvx.norm(vx, 2) <= R] # sgn(A[x,1]) = y, ||x||_2 <= R
x, solve_time = solve(vx, objective, constraints)
solution = x[:n] - x[n:]
Is there a way to use fast matrix computations in cvxpy? Or is there a better library? I found a few implementations that can do this for one special algorithm but not in the general case, so I was not able to implement my problem.
No. The solver will not call your matrix multiplication code. They do their own linear algebra, which is very different in many ways. In a sense your matrix multiplication is just notation for the problem statement.
Regarding performance, it depends heavily on where the bottleneck is. Is it in generating the model (in cvxpy itself) or in the solver? What solver are you using? Consider using a different solver. Obviously, we don't have enough information (and no reproducible example) to answer this question.

Pytorch's Autograd does not support complex matrix inversion, does anyone have a workaround?

Somewhere in my loss function, I invert a complex matrix of size 64*64. Although complex matrix inversion is supported for torch.tensor, the gradient cannot be computed in the training loop as I get this error:
RuntimeError: inverse does not support automatic differentiation for outputs with complex type.
Does anyone have a workaround for this issue? a custom function instead of torch.inverse maybe?
You can do the inverse yourself using the real-valued components of your complex matrix.
Some linear algebra first:
a complex matrix C can be written as a sum of two real matrices A and B (j is the sqrt of -1):
C = A + jB
Finding the inverse of C is basically finding two real valued matrices x and y such that
(A + jB)(x + jy) = I + j0
This boils down to solving the real valued system of equations:
Now that we know how to do reduce a complex matrix inversion to real-valued matrix inversion, we can use pytorch's solve to do the inverse for us.
def complex_inverse(C):
A = torch.real(C)
B = torch.imag(C)
# construct the left hand side of the system of equations
# side note: from pytorch 1.7.1 you can use vstack and hstack instead of cat
lhs = torch.cat([torch.cat([A, -B], dim=1), torch.cat([B, A], dim=1)], dim=0)
# construct the rhs of the system of equations
rhs = torch.cat([torch.eye(A.shape[0]).to(A), torch.zeros_like(A)],dim=0)
# solve the system of equations
raw, _ = torch.solve(rhs, lhs)
# write the solution as a single complex matrix
iC = raw[:C.shape[0], :] + 1j * raw[C.shape[0]:, :]
return iC
You can verify the solution using numpy:
# C is a complex torch tensor
iC = complex_inverse(C)
with torch.no_grad():
print(np.isclose(iC.cpu().numpy() # C.cpu().numpy(), np.eye(C.shape[0])).all())
Note that by using inverse of block-matrices tricks you may reduce the computational cost of the solve operation.
As of 1.9, PyTorch now supports complex autograd.

TensorFlow: Compute Hessian matrix (and higher order derivatives)

I would like to be able to compute higher order derivatives for my loss function. At the very least I would like to be able to compute the Hessian matrix. At the moment I am computing a numerical approximation to the Hessian but this is more expensive, and more importantly, as far as I understand, inaccurate if the matrix is ill-conditioned (with very large condition number).
Theano implements this through symbolic looping, see here, but Tensorflow does not seem to support symbolic control flow yet, see here. A similar issue has been raised on TF github page, see here, but it looks like nobody has followed up on the issue for a while.
Is anyone aware of more recent developments or ways to compute higher order derivatives (symbolically) in TensorFlow?
Well, you can , with little effort, compute the hessian matrix!
Suppose you have two variables :
x = tf.Variable(np.random.random_sample(), dtype=tf.float32)
y = tf.Variable(np.random.random_sample(), dtype=tf.float32)
and a function defined using these 2 variables:
f = tf.pow(x, cons(2)) + cons(2) * x * y + cons(3) * tf.pow(y, cons(2)) + cons(4) * x + cons(5) * y + cons(6)
where:
def cons(x):
return tf.constant(x, dtype=tf.float32)
So in algebraic terms, this function is
Now we define a method that compute the hessian:
def compute_hessian(fn, vars):
mat = []
for v1 in vars:
temp = []
for v2 in vars:
# computing derivative twice, first w.r.t v2 and then w.r.t v1
temp.append(tf.gradients(tf.gradients(f, v2)[0], v1)[0])
temp = [cons(0) if t == None else t for t in temp] # tensorflow returns None when there is no gradient, so we replace None with 0
temp = tf.pack(temp)
mat.append(temp)
mat = tf.pack(mat)
return mat
and call it with:
# arg1: our defined function, arg2: list of tf variables associated with the function
hessian = compute_hessian(f, [x, y])
Now we grab a tensorflow session, initialize the variables, and run hessian :
sess = tf.Session()
sess.run(tf.initialize_all_variables())
print sess.run(hessian)
Note: Since the function we used is quadratic in nature (and we are differentiating twice), the hessian returned will have constant values irrespective of the variables.
The output is :
[[ 2. 2.]
[ 2. 6.]]
A word of caution: Hessian matrices (or more generally, tensors) are expensive to compute and store. You may actually re-think if you really need the full Hessian, or just some hessian properties. A number of them, including traces, norms, and top eigen-values can be obtained without explicit hessian matrix, just using the Hessian-vector product oracle. In turn, hessian-vector products can be implemented efficiently (also in leading autodiff frameworks such as Tensorflow and PyTorch)

Scipy - how to further optimize sparse matrix code for stochastic gradient descent

I'm working on implementing the stochastic gradient descent algorithm for recommender systems using sparse matrices with Scipy.
This is how a first basic implementation looks like:
N = self.model.shape[0] #no of users
M = self.model.shape[1] #no of items
self.p = np.random.rand(N, K)
self.q = np.random.rand(M, K)
rows,cols = self.model.nonzero()
for step in xrange(steps):
for u, i in zip(rows,cols):
e=self.model-np.dot(self.p,self.q.T) #calculate error for gradient
p_temp = learning_rate * ( e[u,i] * self.q[i,:] - regularization * self.p[u,:])
self.q[i,:]+= learning_rate * ( e[u,i] * self.p[u,:] - regularization * self.q[i,:])
self.p[u,:] += p_temp
Unfortunately, my code is still pretty slow, even for a small 4x5 ratings matrix. I was thinking that this is probably due to the sparse matrix for loop. I've tried expressing the q and p changes using fancy indexing but since I'm still pretty new at scipy and numpy, I couldn't figure a better way to do it.
Do you have any pointers on how i could avoid iterating over the rows and columns of the sparse matrix explicitly?
I almost forgot everything about recommender systems, so I may be erroneously translated your code, but you reevaluate self.model-np.dot(self.p,self.q.T) inside each loop, while I am almost convinced it should be evaluated once per step.
Then it seems that you do matrix multiplication by hand, that probably can be speeded up with direct matrix mulitplication (numpy or scipy will do it faster than you by hand), something like that:
for step in xrange(steps):
e = self.model - np.dot(self.p, self.q.T)
p_temp = learning_rate * np.dot(e, self.q)
self.q *= (1-regularization)
self.q += learning_rate*(np.dot(e.T, self.p))
self.p *= (1-regularization)
self.p += p_temp
Are you sure you are implementing SGD? because in each step, you have to calculate the error of one single user-rating, not the error of the all rating matrix or maybe I can not understand this line of your code:
e=self.model-np.dot(self.p,self.q.T) #calculate error for gradient
And for the Scipy library, I am sure you will have a slow bottleneck if you want to access the elements of the sparse matrix directly. Instead of accessing elements of rating matrix from Scipy-sparse-matrix, you can bring the specific row and column into RAM in each step and then do your calculation.

Categories

Resources