I am trying to implement a real valued cost function that evaluates a complex input in frequency space with pytorch & autograd since I am interested in the gradients of the cost function w.r.t. the input. When I compare the autograd results with the derivative that I computed by hand (with Wirtinger calculus) I get a different result. I'm not sure where I made the mistake, whether it is in my implementation or in my own derivation of the gradient.
The cost function and its derivative by hand looks like this:
Formula of the cost function
My implementation is here
def f_derivative_by_hand(f):
f = torch.tensor(f, dtype=torch.complex128)
ftilde = torch.fft.fft(f)
absf = torch.abs(ftilde)
f2 = absf**2
C = torch.trapz(f2).numpy()
grads = 2 * torch.fft.ifft((ftilde)).numpy()
return C, grads
def f_derivative_autograd(f):
f = torch.tensor(f, dtype=torch.complex128, requires_grad=True)
ftilde = torch.fft.fft(f)
f2 = torch.abs(ftilde)**2
C = torch.trapz(f2)
C.backward()
grads = f.grad
return C.detach().numpy(), grads.detach().numpy()
When I use some data and evaluate it by both functions, the gradients of the implementation with automatic differentiation is tilted in comparison (note that I normalized the plotted arrays):
Autograd and derivative by hand comparison
I suspect there could also be something wrong with the automatic differentiation of fft though since if I remove the fourier transform from the cost function and integrate the function in real space, both implementations match exactly except at the edges (again normalized):
No FFT autograd and derivative by hand
It would be fantasic if someone could help me figure out what is wrong!
After some more investigation, I found the solution to the problem of the tilted derivatives. Apparently, the trapezoidal integration rule assumes boundary conditions that will show some artifacts at the boundaries as discussed in this pytorch forum post.
In my original problem, the observed tilt results from the integration of the fourier transformed signal which is asymmetric. The boundary artifacts introduce spatial frequencies which tilt the derivative in real space.
For me, the simplest solution is just to use a sum and weight by the frequency differential. Then, everything works out.
Related
I know the library curve_fit of scipy and its power to fitting curves. I have read many examples here and in the documentation, but I cannot solve my problem.
For example, I have 10 files (chemical structers but it does not matter) and ten experimental energy values. I have a function inside a class that calculates for each structure the theoretical energy for some parameters and it returns a numpy array with the theoretical energy values.
I want to find the best parameters to have the theoretical values nearest to the experimental ones. I will furnish here the minimum exemple of my code
This is the class function that reads the experimental energy files, extracts the correct substring and returns the values as a numpy array. The self.path is just the directory and self.nPoints = 10. It is not so important, but I furnish for the sake of completeness
def experimentalValues(self):
os.chdir(self.path)
energy = np.zeros(self.nPoints)
for i in range(1, self.nPoints):
f = open("p_" + str(i + 1) + ".xyz", "r")
energy[i] = float(f.readlines()[1].split()[1])
f.close()
os.chdir('..')
return energy
I calculate the theoretical value with this class function that takes two numpy arrays as arguments, lets say
sigma = np.full(nSubstrate, 2.)
epsilon = np.full(nSubstrate, 0.15)
where nSubstrate = 9
Here there is the class function. It reads files and does two nested loops to calculate for each file the theoretical value and return it to a numpy array.
def theoreticalEnergy(self, epsilon, sigma):
os.chdir(self.path)
cE = np.zeros(self.nPoints)
for n in range(0, self.nPoints):
filenameXYZ = "p_" + str(n + 1) + "_extended.xyz"
allCoordinates = np.loadtxt(filenameXYZ, skiprows = 0, usecols = (1, 2, 3))
substrate = allCoordinates[0:self.nSubstrate]
surface = allCoordinates[self.nSubstrate:]
for i in range(0, substrate.shape[0]):
positionAtomI = np.array(substrate[i][:])
for j in range(0, surface.shape[0]):
positionAtomJ = np.array(surface[j][:])
distanceIJ = self.distance(positionAtomI, positionAtomJ)
cE[n] += self.LennardJones(distanceIJ, epsilon[i], sigma[i])
os.chdir('..')
return cE
Again, for the sake of completeness the Lennard Jones class function is defined as
def LennardJones(self, distance, epsilon, sigma):
repulsive = (sigma/distance) ** 12.
attractive = (sigma/distance) ** 6.
potential = 4. * epsilon* (repulsive - attractive)
return potential
where in this case all the arguments are scalar as the return value.
To conclude the problem presentation I have 3 ingredients:
a numpy array with the experimental data
two numpy arrays with a guess for the parameters sigma and epsilon
a function that takes the last parameters and returns a numpy vector with the values to be fitted.
How can I solve this problem like the approach described in the documentation https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html?
Curve fitting
The curve_fit fits a function f(w, x[i]) to points y[i] by finding w that minimizes sum((f(w, x[i] - y[i])**2 for i in range(n)). As you will read in the first line after the function definition
[It uses] non-linear least squares to fit a function, f, to data.
It refers to least_squares where it states
Given the residuals f(x) (an m-D real function of n real variables) and the loss function rho(s) (a scalar function), least_squares finds a local minimum of the cost function F(x):
Curve fitting is a kind of convex-cost multi-objective optimization. Since the each individual cost is convex, you can add all of them and that will still be a convex function. Notice that the decision variables (the parameters to be optimized) are the same in every point.
Your problem
In my understanding for each energy level you have a different set of parameters, if you write it as a curve fitting problem, the objective function could be expressed as sum((f(w[i], x[i]) - y[i])**2 ...), where y[i]is determined by the energy level. Since each of the terms in the sum is independent on the other terms, this is equivalent to finding each group of parametersw[i]separately minimizing(f(w[i], x[i]) - y[i])**2`.
Convexity
Convexity is a very convenient property for optimization because it ensures that you will have only one minimum in the parameter space. I am not doing a detailed analysis but have reasonable doubts about the convexity of your energy function.
The Lennard Jones function has the difference of a repulsive and an attractive force both with negative even exponent on the distance this alone is very unlikely to be convex.
The sum of multiple local functions centered at different positions has no defined convexity.
Molecular energy, or crystal energy, or protein folding are well known to be non-convex.
A few days ago (on a bike ride) I was thinking about this, how the molecules will be configured in a global minimum energy, and I was wondering if it finds that configuration so rapidly because of quantum tunneling effects.
Non-convex optimization
The non-convex (global) optimization is different from (non-linear) least-squares, in the sense that when a local minimum is found the process don't return immediately, it start making new attempts in different regions of the search spaces. If the function is smooth you can still take advantage of a gradient based local optimization method, but the complexity is still NP.
A classic global optimization method is the Simulated annenaling, if you have a chemical background I think you will have some insights reading about it. Once upon a time, simulated annealing was provided in scipy.optimize.
You will find a few global optimization methods in scipy.optimize. I would encourage you to try Basin hopping, since it was successfully applied to similar problems, as you can read in the references.
I hope this drop you on the right way to your solution. But, be aware that you will probably need to spend, learning how to use the function and will need to make some decisions. You will need to find a balance of accuracy, simplicity, efficiency.
If you want better solution take the time to derive the gradient of the cost function (you can return two values f, and df, where df is the gradient of f with respect to the decision variables).
I'm computing the first and second derivatives of a signal and then plot. I chose the Savitzky-Golay filter as implemented in SciPy (signal module). I'm wondering if the output needs to be scaled - in the Matlab implementation of the same filter, it is specified that scaling is needed on the output of the filter:
savitzkyGolayFilt(X,N,DN,F) filters the signal X using a
Savitzky-Golay (polynomial) filter. The polynomial order, N, must be
less than the frame size, F, and F must be odd. DN specifies the
differentiation order (DN=0 is smoothing). For a DN higher than zero,
you'll have to scale the output by 1/T^DN to acquire the DNth smoothed
derivative of input X, where T is the sampling interval.
However, I didn't find anything similar in SciPy's documentation. Has anybody tried and knows if the output in Python is correct and needs no further scaling? The line of code I'm running for the first derivative is this one: first_deriv = signal.savgol_filter(spectra_signal,sigma=7,2, deriv=1, delta=3.1966) The spectra_signal is my "y" variable and delta is the variation of "x" variable.
Also, I tried to compute the first derivative without using the savgol_filter, but using np.diffon the smoothed signal instead (based on the formula derivative = dy/dx).first_deriv_alternative = np.diff(signal.savgol_filter(spectra_signal, sigma=7,2))/3.1966. And the results are not the same.
Working code example:
import numpy as np
from scipy import signal
x =[405.369888, 408.561553, 411.753217, 414.944882, 418.136547, 421.328212, 424.519877, 427.711541, 430.903206]
y =[5.001440644264221191e-01,
4.990128874778747559e-01,
4.994551539421081543e-01,
5.002806782722473145e-01,
5.027571320533752441e-01,
5.053851008415222168e-01,
5.082427263259887695e-01,
5.122825503349304199e-01,
5.167465806007385254e-01]
#variation of x variable, constant step
sampling_step = x[1]-x[0]
#Method 1: using savgol_filter
deriv1_method1 = signal.savgol_filter(y,5,2,deriv=1, delta=sampling_step)
#Method 2: using np.diff to compute the derivative of the filtered original data
dy=np.diff(signal.savgol_filter(y, 5,2))
dx=np.diff(x)
deriv1_method2=dy/dx
#Method 3: filtering the first derivative of the original data
deriv1_method3=signal.savgol_filter((np.diff(y)/np.diff(x)), 5,2)
Under the hood signal.savgol_filter uses signal.savgol_coeffs if you look a the source code it says that "The coefficient assigned to y[deriv] scales the result to take into account the order of the derivative and the sample spacing". The results are hance scaled before performing the fitting and the convolve1d. So by default, it seems that the results are already scaled taking into account the order of derivatives.
I think that performing the derivative after computing Savitzky-Golay filter won't give you the same results because in this case, you are computing the derivative on the spectrum already filtered, while in the first case you are performing the derivative before performing the fitting and the scaling.
I'm currently writing my first multilayer neural net with python 3.7 and numpy, and I'm having trouble implementing softmax (I intend to use my network for classification, so having a working implementation of softmax is pretty crucial). I copied this code off of a different thread:
def softmax(x):
return exp(x) / np.sum(exp(x), axis = 0)
I think I have a basic understanding of the intended function of the softmax function; that is, to take a vector and turn its elements into probabilities so that they sum to 1. Please correct my understanding if I'm wrong. I don't quite understand how this code accomplishes that function, but I found similar code on multiple other threads, so I believe it to be correct. Please confirm.
Unfortunately, in none of these threads could I find a clear implementation of the derivative of the softmax function. I understand it to be more complicated than that of most activation functions, and to require more parameters than just x, but I have no idea how to implement it myself. I'm looking for an explanation of what those other parameters are, as well as for an implementation (or mathematical expression) of the derivative of the softmax function.
Answer for how this code accomplishes that function:
Here, we make use of a concept known as broadcasting.
When you use the function exp(x), then, assuming x is a vector, you actually perform an operation similar to what can be accomplished by the following code:
exps = []
for i in x:
exps.append(exp(i))
return exps
The above code is the longer version of what broadcasting does automatically here.
As for the implementation of the derivative, that's a bit more complicated, as you say.
An untested implementation for computing the vector of derivatives with respect to every parameter:
def softmax_derivative(X):
# input : a vector X
# output : a vector containing derivatives of softmax(X) wrt every element in X
# List of derivatives
derivs = []
# denominator after differentiation
denom = np.sum(exp(X), axis=0)
for x in X:
# Function of current element based on differentiation result
comm = -exp(x)/(denom**2)
factor = 0
# Added exp of every element except current element
for other in X:
if other==x:
continue
factor += (exp(other))
derivs.append(comm*factor)
return derivs
You can also use broadcasting in the above function, but I think its more clear in this manner.
I am asked to write an implementation of the gradient descent in python with the signature gradient(f, P0, gamma, epsilon) where f is an unknown and possibly multivariate function, P0 is the starting point for the gradient descent, gamma is the constant step and epsilon the stopping criteria.
What I find tricky is how to evaluate the gradient of f at the point P0 without knowing anything on f. I know there is numpy.gradient but I don't know how to use it in the case where I don't know the dimensions of f. Also, numpy.gradient works with samples of the function, so how to choose the right samples to compute the gradient at a point without any information on the function and the point?
I'm assuming here, So how can i choose a generic set of samples each time I need to compute the gradient at a given point? means, that the dimension of the function is fixed and can be deduced from your start point.
Consider this a demo, using scipy's approx_fprime, which is an easier to use wrapper-method for numerical-differentiation and also used in scipy's optimizers when a jacobian is needed, but not given.
Of course you can't ignore the parameter epsilon, which can make a difference depending on the data.
(This code is also ignoring optimize's args-parameter which is usually a good idea; i'm using the fact that A and b are inside the scope here; surely not best-practice)
import numpy as np
from scipy.optimize import approx_fprime, minimize
np.random.seed(1)
# Synthetic data
A = np.random.random(size=(1000, 20))
noiseless_x = np.random.random(size=20)
b = A.dot(noiseless_x) + np.random.random(size=1000) * 0.01
# Loss function
def fun(x):
return np.linalg.norm(A.dot(x) - b, 2)
# Optimize without any explicit jacobian
x0 = np.zeros(len(noiseless_x))
res = minimize(fun, x0)
print(res.message)
print(res.fun)
# Get numerical-gradient function
eps = np.sqrt(np.finfo(float).eps)
my_gradient = lambda x: approx_fprime(x, fun, eps)
# Optimize with our gradient
res = res = minimize(fun, x0, jac=my_gradient)
print(res.message)
print(res.fun)
# Eval gradient at some point
print(my_gradient(np.ones(len(noiseless_x))))
Output:
Optimization terminated successfully.
0.09272331925776327
Optimization terminated successfully.
0.09272331925776327
[15.77418041 16.43476772 15.40369129 15.79804516 15.61699104 15.52977276
15.60408688 16.29286766 16.13469887 16.29916573 15.57258797 15.75262356
16.3483305 15.40844536 16.8921814 15.18487358 15.95994091 15.45903492
16.2035532 16.68831635]
Using:
# Get numerical-gradient function with a way too big eps-value
eps = 1e-3
my_gradient = lambda x: approx_fprime(x, fun, eps)
shows that eps is a critical parameter resulting in:
Desired error not necessarily achieved due to precision loss.
0.09323354898565098
code:
a = T.vector()
b = T.vector()
loss = T.sum(a-b)
dy = T.grad(loss, a)
d2y = T.grad(loss, dy)
f = theano.function([a,b], y)
print f([.5,.5,.5], [1,0,1])
output:
theano.gradient.DisconnectedInputError: grad method was asked to compute
the gradientwith respect to a variable that is not part of the
computational graph of the cost, or is used only by a non-differentiable
operator: Elemwise{second}.0
how is a derivative of the graph not part of the graph? Is this why scan is used to compute the hessian?
Here:
d2y = T.grad(loss, dy)
you are attempting to compute the gradient of the loss with respect to dy. However the loss depends only on the values of a and b and not dy, hence the error. It only makes sense to compute partial derivatives of the loss with respect to parameters that actually affect its value.
The easiest way to compute the Hessian in Theano is to use the theano.gradient.hessian convenience function:
d2y = theano.gradient.hessian(loss, a)
See the documentation here for an alternative manual method that uses a combination of theano.grad and theano.scan.
In your example the Hessian will be a 3x3 matrix of zeros, since the partial derivative of the loss w.r.t. a is independent of a (it's just a vector of ones).