Prevent gradient descent from stopping too far from a local minimum

Prevent gradient descent from stopping too far from a local minimum - python

I'm implementing an algorithm in Python to find the nearest minimum of a 2-dimensional function using the gradient descent method. It takes a precision interval eps as input and stops when the distance between the initial and newly-found point on a given iteration is lesser than eps. The code at that stage looked like this:
while(slow_steps <= 4):
lbd_current = lbd(x_current)
x_previous = x_current.copy()
grad = normalized_gradient(f_multi, x_previous)
x_current = [x_previous[i] + lbd_current * grad[i] for i in range(len(x_previous))]
x_list.append(x_current.copy())
iteration += 1
if(distance(x_previous, x_current) <= eps):
slow_steps += 1
However, I encountered a problem with the initial version of the algorithm: it frequently got stuck in 'valleys' such as this one, depending on the function.
So far I have attempted to add a second step to traverse valleys: if the algorithm detects that the descent is slow, instead of ending immediately, it takes the points found on the latest and third-to-latest iterations and finds the nearest low point on the line between those, a line that hopefully aligns with the direction of the valley.
if(distance(x_previous, x_current) <= eps * 10 and canyon_steps >= 3):
x_canyon = x_list[len(x_list) - 2]
vector_canyon = [x_current[i] - x_canyon[i] for i in range(len(x_canyon))]
lbd_current = lbd_canyon()
x_current = [x_canyon[i] + vector_canyon[i] * lbd_current for i in range(len(x_canyon))]
if(distance(x_previous, x_current) > eps):
canyon_steps = 0
slow_steps = 0
if(distance(x_previous, x_current) <= eps * 10):
canyon_steps += 1
This algorithm works for most starting positions that I've tried, but for others, such as this one, it fails if the precision is low and seems to take a very long time to finish otherwise. How can I ensure that the algorithm arrives at a local minimum with as good chances as possible?

Related

The initial Gradient in Gradient Descent is abysmal and wrong

I am building in Python using IDLE 3.9 an input optimiser, which optimises a certain input "thetas" for a certain target output "targetRes" versus a calculated output obtained from running a model with input "thetas" in a solver. The way it works is that a model is first defined with a function called FEA(thetas, fem). After running the solver, FEA returns the output.
The chosen optimisation algorithm is Gradient Descent. FEA (output) is taken as the hypothesis function, and the target output is subtracted from it. The result is then squared to give the loss function. The gradient of the loss function is then determined using FDM. The update step then takes place. For now, I am only running the algorithm on thetas[0]. Below is the GD code:
targetRes = -0.1
thetas = [1000., 1., 1., 1., 1.] # input here initial value of unknown vector theta
def LF(thetas):
return (FEA(thetas, fem) - targetRes) ** 2 / 2
def FDM(thetas, LF):
fdm = []
for i, theta in enumerate(thetas):
h = 0.1
if i == 0:
print(h)
thetas_p_h = []
for t in thetas:
thetas_p_h.append(t)
thetas_p_h[i] += h
thetas_m_h = []
for t in thetas:
thetas_m_h.append(t)
thetas_m_h[i] -= h
grad = (LF(thetas_p_h) - LF(thetas_m_h)) / (2 * h)
fdm.append(grad)
return fdm
def GD(thetas, LF):
tol = 0.000001
alpha = 10000000
Nmax = 1000
for n in range(Nmax):
gradient = FDM(thetas, LF)
thetas_new = []
for gradient_item, theta_item in zip(gradient, thetas):
t_new = theta_item - (alpha * gradient_item)
thetas_new.append(t_new)
print(thetas, f'gradient = {gradient}', LF(thetas), thetas_new, LF(thetas_new))
if tol >= abs(LF(thetas_new) - LF(thetas)):
print(f"solution converged in {n} iterations, theta = {thetas_new}, LF(theta) = {LF(thetas_new)}, FEA(theta) = {FEA(thetas_new, fem)}")
return thetas_new
thetas = thetas_new
else:
print(f"reached max iterations, Nmax = {Nmax}")
return None
GD(thetas, LF)
As you can see the gradient descent algorithm I am using is different to the linear regression type, and that is because there are no features to evaluate, just labels (y). Unfortunately I am not allowed to provide the solver code and most likely not allowed to provide the model code as well, defined in FEA.
In the current example, the calculated initial output is -0.070309. My issues are:
The gradient is very minute at the first update iteration, and it is 2.078369999999961e-06, and the final update iteration gradient value is 1.834250000000102e-08. In fact the first update iteration gradient value is most likely wrong, as I attempted to calculate it by hand and I got a value of somewhere around 0.000005, which is still tiny.
I am using a gigantic learning rate value, because the algorithm is horribly slow without such.
The algorithm converges to target output only with certain input values. There was another case where I had other values of thetas and the algorithm at zeroth iteration produces a loss function of order of 10^7, and at the first iteration it drops right down to zero and converges, which is unrealistic. In both cases I use a large learning rate, and both cases converge to the target output. In some other cases I have other inputs and the algorithm does not converge to the target output. Other cases lead to a negative value of thetas[0] which causes the solver to fail.
Also dismiss the fact that I am not using any external libraries; it is intended. So of course without trying to run the code, any observations? Does anyone see anything obvious that I'm missing here? As I think this could be the issue. Could it be due to the orders of magnitude of the inputs? What are your thoughts? (there are no issues with the solver or model function named FEA, both have been repeatedly verified to work perfectly fine).

Im studying Gradient Descent, at this code precision meaning is what?

I'm studying Gradient Descent by myself.
Because using to resume for university admission.
Is the meaning of precision an allowable value of error?
x_old = 0
x_new = 6 # The algorithm starts at x=6
eps = 0.01 # step size
precision = 0.00001
def f_prime(x):
return 4 * x**3 - 9 * x**2
while abs(x_new - x_old) > precision:
x_old = x_new
x_new = x_old - eps * f_prime(x_old)
print("Local minimum occurs at: " + str(x_new))

It seems that precision is a means to check for convergence: if the last iteration of gradient descent caused only a small change, then stop.
This approach is not very robust. First, a small change in a single iteration is not a strong indication for convergence. It would be better to look for a small change in several consecutive iterations. Second, the process might not converge at all, so some sort of guard against an infinite loop should be used.

Estimate velocity on a spring by iterative approach

The problem:
Consider a system with a mass and a spring as shown in the picture below. The stiffness of the spring and the mass of the object are known. Therefore, if the spring is stretched the force the spring exerts can be calculated from Hooke`s law and the instantaneous acceleration can be estimated from Newton´s laws of motion. Integrating the acceleration twice yields the distance the spring would move and subtracting that from the initial length results in a new position to calculate the acceleration and start the loop again. Therefore as the acceleration decreases linearly the speed levels off at a certain value (top right) Everything after that point, spring compressing & decelerating is neglected for this case.
My question is how would to go about coding that up in python. So far I have written some pseudocode.
instantaneous_acceleration = lambda x: 5*x/10 # a = kx/m
delta_time = 0.01 #10 milliseconds
a[0] = instantaneous_acceleration(12) #initial acceleration when stretched to 12 m
v[0] = 0 #initial velocity 0 m/s
s[0] = 12 #initial length 12 m
i = 1
while a[i] > 12:
v[i] = a[i-1]*delta_time + v[i-1] #calculate the next velocity
s[i] = v[i]*delta_time + s[i-1] #calculate the next position
a[i] = instantaneous_acceleration (s[i]) #use the position to derive the new accleration
i = i + 1
Any help or tips are greatly appreciated.

If you're going to integrate up front - which is a good idea and absolutely the way to go when you can - then you can just write down the equations as functions of t for everything:
x'' = -kx/m
x'' + (k/m)x = 0
r^2 + k/m = 0
r^2 = -(k/m)
r = i*sqrt(k/m)
x(t) = A*e^(i*sqrt(k/m)t)
= A*cos(sqrt(k/m)t + B) + i*A*sin(sqrt(k/m)t + B)
= A*cos(sqrt(k/m)t + B)
From initial conditions we know that
x(0) = 12 = A*cos(B)
v(0) = 0 = -sqrt(k/m)*A*sin(B)
The second of these equation is true only if we choose A = 0 or B = 0 or B = Pi.
if A = 0, then the first equation has no solution.
if B = 0, the first equation has solution A = 12.
if B = Pi, the first equation has solution A = -12.
We probably prefer B = 0 and A = 12. This gives
x(t) = 12*cos(sqrt(k/m)t)
v(t) = -12*sqrt(k/m)*sin(sqrt(k/m)t)
a(t) = -12*(k/m)cos(sqrt(k/m)t)
Thus, at any incremental time t[n+1] = t[n] + dt, we can simply calculate the precise position, velocity and acceleration for t[n] without any drift or inaccuracy ever accumulating.
All that said, if you are interested in how to numerically find x(t) and v(t) and a(t) given an arbitrary ordinary differential equation, the answer is much harder. There are lots of good ways of doing what can be called numerical integration. Euler's method is the easiest:
// initial conditions
t[0] = 0
x[0] = …
x'[0] = …
…
x^(n-1)[0] = …
x^(n)[0] = 0
// iterative step
x^(n)[k+1] = f(x^(n-1)[k], …, x'[k], x[k], t[k])
x^(n-1)[k+1] = x^(n-1)[k] + dt * x^(n)[k]
…
x'[k+1] = x'[k] + dt * x''[k]
x[k+1] = x[k] + dt * x'[k]
t[k+1] = t[k] + dt
The smaller a value of dt you choose, the longer it takes to run for a fixed duration of time, but the more accurate the results you get. This is basically doing a Riemann sum of the function and all its derivatives up to the highest one involved in the ODE.
A more accurate version of this, Simpson's rule, does the same thing but takes the average value over the last time quantum (rather than either endpoint's value; the example above uses the beginning of the interval). The average value over the interval is guaranteed to be closer to the true value over the interval than either endpoint (unless the function was constant over that interval, in which case Simpson is at least as good).
Probably the best standard numerical integration methods for ODEs (assuming you don't need something like leapfrog methods for greater stability) are the Runge Kutta methods. An adaptive timestep Runge Kutta method of sufficient order should usually do the trick and give you accurate answers. Unfortunately, the mathematics to explain the Runge Kutta methods is probably too advanced and time consuming to cover here, but you can find information on these and other advanced techniques online or in e.g. Numerical Recipes, a series of books on numerical methods which contains lots of very useful code samples.
Even the Runge Kutta methods work basically by refining the guess at the function's value over the time quantum, though. They just do it in more sophisticated ways which provably reduce the error at each step.

You have a sign error in the force, for a spring or any other oscillation it should always be opposite to the excitation direction. Correcting this gives instantly an oscillation. However, your loop condition will now never be satisfied, so you have to also adapt that.
You can immediately increase the order of your method by elevating it from the current symplectic Euler method to Leapfrog-Verlet. You only have to change the interpretation of v[i] to be the velocity at t[i]-dt/2. Then the first update uses the acceleration in the middle at t[i-1] to compute the velocity at t[i-1]+dt/2=t[i]-dt/2 from the velocity at t[i-1]-dt/2 using a midpoint formula. Then in the next line the position update is a similar midpoint formula using the velocity at the middle time between the position times. All you have to change in the code to get this advantage is to set the initial velocity to the one at time t[0]-dt/2 using the Taylor expansion at t[0].
instantaneous_acceleration = lambda x: -5*x/10 # a = kx/m
delta_time = 0.01 #10 milliseconds
s0, v0 = 12, 0 #initial length 12 m, initial velocity 0 m/s
N=1000
s = np.zeros(N+1); v = s.copy(); a = s.copy()
a[0] = instantaneous_acceleration(s0) #initial acceleration when stretched to 12 m
v[0] = v0-a[0]*delta_time/2
s[0] = s0
for i in range(N):
v[i+1] = a[i]*delta_time + v[i] #calculate the next velocity
s[i+1] = v[i+1]*delta_time + s[i] #calculate the next position
a[i+1] = instantaneous_acceleration (s[i+1]) #use the position to derive the new acceleration
#produce plots of all these functions
t=np.arange(0,N+1)*delta_time;
fig, ax = plt.subplots(3,1,figsize=(5,3*1.5))
for g, y in zip(ax,(s,v,a)):
g.plot(t,y); g.grid();
plt.tight_layout(); plt.show();
This is obviously and correctly an oscillation. The exact solution is 12*cos(sqrt(0.5)*t), using it and its derivatives to compute the errors in the numerical solution (remember the leap-frogging of the velocities) gives via
w=0.5**0.5; dt=delta_time;
fig, ax = plt.subplots(3,1,figsize=(5,3*1.5))
for g, y in zip(ax,(s-12*np.cos(w*t),v+12*w*np.sin(w*(t-dt/2)),a+12*w**2*np.cos(w*t))):
g.plot(t,y); g.grid();
plt.tight_layout(); plt.show();
the plot below, showing errors in the expected size delta_time**2.

An analytical approach is the simplest way to obtain the velocity of a simple system that obeys Hooke's law.
However, if you desire a physically accurate numerical/iterative approach I strongly advise against methods like standard Euler or runge-kutta methods (suggested by Patrick87). [Correction: OPs method is a symplectic 1st order method, if the sign of acceleration term is corrected.]
You probably want to use a Hamiltonian approach and a suitable symplectic integrator such as the second order leapfrog (suggested also by Patrick87).
For Hookes law, you can express the Hamiltonian H = T(p) + V(q), where p is momentum (associated with velocity) and q is position (associated to how far the string located from equilibrium).
You have the kinetic energy T and potential energy V
T(p) = 0.5*p^2/m
V(q) = 0.5*k*q^2
You simply need the derivatives of these two expressions to simulate the system
dT/dp = p/m
dV/dq = k*q
I provided a detailed example (although for another 2-dimensional system), including an implementation of 1st and a 4th order method here:
https://zymplectic.com/case3.html under method 0 and method 1
These are symplectic integrators, which have an energy-preserving property that means you can perform long simulation without dissipative errors.

High frequency noise at solving differential equation

I'm trying to simulate a simple diffusion based on Fick's 2nd law.
from pylab import *
import numpy as np
gridpoints = 128
def profile(x):
range = 2.
straggle = .1576
dose = 1
return dose/(sqrt(2*pi)*straggle)*exp(-(x-range)**2/2/straggle**2)
x = linspace(0,4,gridpoints)
nx = profile(x)
dx = x[1] - x[0] # use np.diff(x) if x is not uniform
dxdx = dx**2
figure(figsize=(12,8))
plot(x,nx)
timestep = 0.5
steps = 21
diffusion_coefficient = 0.002
for i in range(steps):
coefficients = [-1.785714e-3, 2.539683e-2, -0.2e0, 1.6e0,
-2.847222e0,
1.6e0, -0.2e0, 2.539683e-2, -1.785714e-3]
ccf = (np.convolve(nx, coefficients) / dxdx)[4:-4] # second order derivative
nx = timestep*diffusion_coefficient*ccf + nx
plot(x,nx)
for the first few time steps everything looks fine, but then I start to get high frequency noise, do to build-up from numerical errors which are amplified through the second derivative. Since it seems to be hard to increase the float precision I'm hoping that there is something else that I can do to suppress this? I already increased the number of points that are being used to construct the 2nd derivative.

I don't have the time to study your solution in detail, but it seems that you are solving the partial differential equation with a forward Euler scheme. This is pretty easy to implement, as you show, but this can become numerical instable if your timestep is too small. Your only solution is to reduce the timestep or to increase the spatial resolution.
The easiest way to explain this is for the 1-D case: assume your concentration is a function of spatial coordinate x and timestep i. If you do all the math (write down your equations, substitute the partial derivatives with finite differences, should be pretty easy), you will probably get something like this:
C(x, i+1) = [1 - 2 * k] * C(x, i) + k * [C(x - 1, i) + C(x + 1, i)]
so the concentration of a point on the next step depends on its previous value and the ones of its two neighbors. It is not too hard to see that when k = 0.5, every point gets replaced by the average of its two neighbors, so a concentration profile of [...,0,1,0,1,0,...] will become [...,1,0,1,0,1,...] on the next step. If k > 0.5, such a profile will blow up exponentially. You calculate your second order derivative with a longer convolution (I effectively use [1,-2,1]), but I guess that does not change anything for the instability problem.
I don't know about normal diffusion, but based on experience with thermal diffusion, I would guess that k scales with dt * diffusion_coeff / dx^2. You thus have to chose your timestep small enough so that your simulation does not become instable. To make the simulation stable, but still as fast as possible, chose your parameters so that k is a bit smaller than 0.5. Something similar can be derived for 2-D and 3-D cases. The easiest way to achieve this is to increase dx, since your total calculation time will scale with 1/dx^3 for a linear problem, 1/dx^4 for 2-D problems, and even 1/dx^5 for 3-D problems.
There are better methods to solve diffusion equations, I believe that Crank Nicolson is at least standard for solving heat-equations (which is also a diffusion problem). The 'problem' is that this is an implicit method, which means that you have to solve a set of equations to calculate your 'concentration' at the next timestep, which is a bit of a pain to implement. But this method is guaranteed to be numerical stable, even for big timesteps.

Fitting arbitrary gaussian functions, massive memory consumption in python

I'm trying to (in python) fit a series of an arbitrary number of gaussian functions (determined by a simple algorithm still being improved) to a data set. For my current sample data set, I have 174 gaussian functions. I have a procedure for doing the fit, but it's basically complicated guess-and-check, and consumes all 4GB of memory available.
Is there any way to accomplish this using something in scipy or numpy?
Here is what I'm trying to use, where wavelength[] is the list of x-coordinates, and fluxc[] is the list of y-coordinates:
#Pick a gaussian
for repeat in range(0,2):
for f in range(0,len(centroid)):
#Iterate over every other gaussian
for i in range(0,len(centroid)):
if i!= f:
#For every wavelength,
for w in wavelength:
#Append the value of each to an list, called others
others.append(height[i]*math.exp(-(w-centroid[i])**2/(2*width[i]**2)))
#Optimize the centroid of the current gaussian
prev = centroid[f]
best = centroid[f]
#Pick an order of magnitude
for p in range (int(round(math.log10(centroid[i]))-3-repeat),int(round(math.log10(centroid[i])))-6-repeat,-1):
#Pick a value of that order of magnitude
for m in range (-5,9):
#Change the value of the current item
centroid[f] = prev + m * 10 **(p)
#Increment over all wavelengths, make a list of the new values
variancy = 0
residual = 0
test = []
#Increment across every wavelength and evaluate if this change gets R^2 any larger
for k in range(0,len(wavelength)):
test.append(height[i]*math.exp(-(wavelength[k]-centroid[f])**2/(2*width[i]**2)))
residual += (test[k]+others[k]-cflux[k])**2
variancy += (test[k]+others[k]-avgcflux)**2
rsquare = 1-(residual/variancy)
#Check the R^2 value for this new fit
if rsquare > bestr:
bestr = rsquare
best = centroid[f]
centroid[f] = best
#Optimize the height of the current gaussian
prev = height[f]
best = height[f]
#Pick an order of magnitude
for p in range (int(round(math.log10(height[i]))-repeat),int(round(math.log10(height[i])))-3-repeat,-1):
#Pick a value of that order of magnitude
for m in range (-5,9):
#Change the value of the current item
height[f] = prev + m * 10 **(p)
#Increment over all wavelengths, make a list of the new values
variancy = 0
residual = 0
test = []
#Increment across every wavelength and evaluate if this change gets R^2 any larger
for k in range(0,len(wavelength)):
test.append(height[f]*math.exp(-(wavelength[k]-centroid[i])**2/(2*width[i]**2)))
residual += (test[k]+others[k]-cflux[k])**2
variancy += (test[k]+others[k]-avgcflux)**2
rsquare = 1-(residual/variancy)
#Check the R^2 value for this new fit
if rsquare > bestr:
bestr = rsquare
best = height[f]
height[f] = best
#Optimize the width of the current gaussian
prev = width[f]
best = width[f]
#Pick an order of magnitude
for p in range (int(round(math.log10(width[i]))-repeat),int(round(math.log10(width[i])))-3-repeat,-1):
#Pick a value of that order of magnitude
for m in range (-5,9):
if prev + m * 10**(p) == 0:
m+=1
#Change the value of the current item
width[f] = prev + m * 10 **(p)
#Increment over all wavelengths, make a list of the new values
variancy = 0
residual = 0
test = []
#Increment across every wavelength and evaluate if this change gets R^2 any larger
for k in range(0,len(wavelength)):
test.append(height[i]*math.exp(-(wavelength[k]-centroid[i])**2/(2*width[f]**2)))
residual += (test[k]+others[k]-cflux[k])**2
variancy += (test[k]+others[k]-avgcflux)**2
rsquare = 1-(residual/variancy)
#Check the R^2 value for this new fit
if rsquare > bestr:
bestr = rsquare
best = width[f]
width[f] = best
count += 1
#print '{} of {} peaks optimized, iteration {} of {}'.format(f+1,len(centroid),repeat+1,2)
complete = round(100*(count/(float(len(centroid))*2)),2)
print '{}% completed'.format(complete)
print 'New R^2 = {}'.format(bestr)

Yes, it can likely be done better (easier) using scipy. But firstly, refactor your code into smaller functions; it justs makes it a lot easier to read and understand what's going on.
As for the memory consumption: you're probably overextending a list far too much somewhere (others is a candidate: I never see it cleared (or initialized!), while it gets filled in a quadruple loop). That, or your data is simply that large (in which case you really should be using numpy arrays, just to speed up things). I can't tell, because you're introducing various variables without giving any idea of the size (how big is wavelengths? How large does others get? What and where are all the initializations of your data arrays?)
Also, fitting 174 Gaussians is just a bit crazy; either look into another way of determining whatever you want to get out of your data, or split things up. From the wavelengths variable, it appears you're trying to fit lines in a high resolution spectrum; perhaps isolating most of the lines and fitting those isolated groups separately is better. If they all overlap, I doubt any normal fitting technique is going to help you.
Lastly, perhaps a package like pandas can help (e.g., the computation subpackage).
Perhaps very lastly, since I see a lot that can be improved in the code. At some point codereview may also be useful. Though for now I guess your memory usage is the most problematic part.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.