scipy.optimize.minimize chi squared python

scipy.optimize.minimize chi squared python - python

So i am doing this assignment, where i am supposed to minimize the chi squared function. I saw someone doing this on the internet so i just copied it:
Multiple variables in SciPy's optimize.minimize
I made a chi-squared function which is a function in 3 variables (x,y,sigma) where sigma is random gaussian fluctuation random.gauss(0,sigma). I did not print that code here because on first sight it might be confusing (I used a lot of recursion). But i can assure you that this function is correct.
now this code just makes a list of the calculated minimization(Which are different every time because of the random gaussian fluctuation). But here comes the main problem. If i did my calculation correctly, we should get a list with a mean of 2 (since i have 2 degrees of freedom as you can see in this link: https://en.wikipedia.org/wiki/Chi-squared_test).
def Chi2(pos):
return Chi(pos[0],pos[1],1)
x_list= []
y_list= []
chi_list = []
for i in range(1000):
result = scipy.optimize.minimize(Chi2,[5,5]).x
x_list.append(result[0])
y_list.append(result[1])
chi_list.append(Chi2(result))
But when i use this code i get a list of mean 4, however if i add the method "Powell" i get a mean of 9!!
So my main question is, how is it possible these means are so different and how do i know which method to use to get the best optimization?
Because i think the error might be in my chisquare function i will show this one as well. The story behind this assignment is that we need to find the position of a mobile device and we have routers on the positions (0,0),(20,0),(0,20) and (20,20). We used a lot of recursion, and the graph of the chi_squared looked fine(it has a minimum on (5,5)
def perfectsignal(x_m,y_m,x_r,y_r):
return 20*np.log10(c / (4 * np.pi * f)) - 10 * np.log((x_m-x_r)**2 + (y_m-y_r)**2 + 2**2)
def signal(x_m,y_m,x_r,y_r,sigma):
return perfectsignal(x_m,y_m,x_r,y_r) + random.gauss(0,sigma)
def res(x_m,y_m,x_r,y_r,sigma,sigma2):
x = (signal(x_m,y_m,x_r,y_r,sigma) - perfectsignal(x_m,y_m,x_r,y_r))/float(sigma2);
return x
def Chi(x,y,sigma):
return(res(x,y,0,0,sigma,1)**2+res(x,y,20,0,sigma,1)**2+res(x,y,0,20,sigma,1)**2+res(x,y,20,20,sigma,1)**2)
Kees

Related

Python optimizing a calculation with scipy.integrate.quad (takes very long)

I´m currently writing a program in python for calculating the total spectral emissivity (infrared waves) of any given material at different temperatures (200K - 500K), based on measurement data received by measuring the directional - hemispherical emissivity of the material at many different wavelengths using an IR - spectroscope. The calculation is done by integrating the measured intensity over all wavelenghts, using Plancks law as a weighing function (all of this doesn´t really matter for my question itself; i just want to explain the background so that the code is easier to understand). This is my code:
from scipy import integrate
from scipy.interpolate import CubicSpline
import numpy as np
import math as m
def planck_blackbody(lambda_, T): # wavelength, temperature
h = float(6.6260755e-34)
c = float(2.99792458e+8)
k = float(1.380658e-23)
try:
a = 2.0 * h * (c ** 2)
b = h * c / (lambda_ * k * T)
intensity = a / ((lambda_ ** 5) * (m.exp(b) - 1.0))
return float(intensity)
except OverflowError: # for lower temperatures
pass
def spectral_emissivity(emifilename, t, lambda_1, lambda_2):
results = []
with open(emifilename, 'r') as emifile:
emilines = emifile.readlines()
try:
w = [float(x.split('\t')[0].strip('\n')) * 1e-6 for x in emilines]
e = [float(x.split('\t')[1].strip('\n')) for x in emilines]
except ValueError:
pass
w = np.asarray(w) # wavelength
e = np.asarray(e) # measured intensity
def part_1(lambda_, T):
E = interp1d(w, e, fill_value = 'extrapolate')(lambda_)
return E * planck_blackbody(lambda_, T)
def E_complete(T):
E_complete_part_1 = integrate.quad(part_1, lambda_1, lambda_2, args=T, limit=50)
E_complete_part_2 = integrate.quad(planck_blackbody, lambda_1, lambda_2, args=T, limit=50)
return E_complete_part_1[0] / E_complete_part_2[0]
for T in t:
results.append([T, E_complete(T)])
with open("{}.plk".format(emifilename[:-4]), 'w') as resultfile:
for item in results:
resultfile.write("{}\t{}\n".format(item[0], item[1]))
t = np.arange(200, 501, 1)
spectral_emissivity('C:\test.dat', t, 1.4e-6, 35e-6)
The measured intensity is stored in a text file with two columns, the first being the wavelength of the infrared waves and the second being the directional-hemispherical emissivity of the measured material at that wavelength.
When i run this code, while it is producing the right results, i still encounter 2 problems:
I get an error message from scipy.integrate.quad:
IntegrationWarning: The maximum number of subdivisions (50) has been achieved.
If increasing the limit yields no improvement it is advised to analyze
the integrand in order to determine the difficulties. If the position of a
local difficulty can be determined (singularity, discontinuity) one will
probably gain from splitting up the interval and calling the integrator
on the subranges. Perhaps a special-purpose integrator should be used.
warnings.warn(msg, IntegrationWarning)
Can someone explain to me what exactly this means? I understand that integrate.quad is a numerical iteration method and that my functions somehow seem to require more than 50 iterations, but is there a way around this? i tried increasing the Limit, but even with 200 i still get this error message... it´s especially weird given that the integrands are pretty straightforward functions...
is closely connected to the first problem: this program takes ages (about 5 minutes!) to finish one single file, but i need to process many files every hour. cProfile reveals that 98% percent of this time is spent inside the integraion function. A MathCad program doing the exact same thing and producing the same outputs only takes some seconds to finish. Even though i spent the last week seatching for a solution, i simply don´t manage to speed this program up, and no one else on stackoverflow and elsewhere seems to have comparable time problems with integrate.quad.
So, finally, my question: is there any obvious way to optimize this code for it to run faster (except from compiling it into C+ or anything like that)? I tried reducing all floats to 6 digits (i can´t go any lower in accuracy) but that didn´t change anything.
Update: looking into it some more, i figured out that most of the time wasn´t actually consumed by the Integration itself, but by the CubicSpline operation that i used to interpolate my data. I tried out different methods and CubicSpline seemed to be the only working one for some reason (even though my data is monotonically increasing, i got errors from every other method i tried, saying that some values were either above or below the interpolation range). That is, until i found out about extrapolation with scipy.interpolate.interp1dand (fill_value = 'extrapolate'. Ths did the trick for me, enabling me to use the far less consuming interp1d method and effectively reducing the runtime of my program from 280 to 49 seconds (also added list comprehension for w and e). While this is a big improvement, i still wonder why my program takes nearly 1 Minute to calculate some integrals... and i still get the above mentioned IntegrationWarning. So any advice is highly appreciated!
(btw, since i am pretty new to python, I´m happy about any tips or critique i can get!)

Getting coefficient of term in sympy

I need to find the coefficient of a term in a rather long, nasty expansion. I have a polynomial, say f(x) = (x+x^2)/2 and then a function that is defined recursively: g_k(x,y) = y*f(g_{k-1}(x,y)) with g_0(x,y)=yx.
I want to know, say, the coefficient of x^2y^4 in g_10(x,y)
I've coded this up as
import sympy
x, y = sympy.symbols('x y')
def f(x):
return (x+x**2)/2
def g(x,y,k):
if k==0:
return y*x
else:
return y*f(g(x,y,k-1))
fxn = g(x,y,2)
fxn.expand().coeff(x**2).coeff(y**4)
> 1/4
So far so good.
But now I want to find a coefficient for k = 10. Now fxn = g(x,y,10) and then fxn.expand() is very slow. Obviously there are a lot of steps going on, so it's not a surprise. But my knowledge of sympy is rudimentary - I've only started using it specifically because I need to be able to find these coefficients. I could imagine that there may be a way to get sympy to recognize that everything is a polynomial and so it can more quickly find a particular coefficient, but I haven't been able to find examples doing that.
Is there another approach through sympy to get this coefficient, or anything I can do to speed it up?

I assume you are only interested in the coefficients given and not the whole polynomial g(x,y,10). So you can redefine your function g to get rid of higher orders in every step of the recursion. This will significantly speed up your calculation.
def g(x,y,k):
if k==0:
return y*x
else:
temp = y*f(g(x,y,k-1)) + sympy.O(y**5) + sympy.O(x**3)
return temp.expand().removeO()
Works as follows: First everything of the order O(y**5), O(x**3) (and higher) will be grouped and then discarded. Keep in mind you loose lots of information!
Also have a look here: Sympy: Drop higher order terms in polynomial

Finding a abstraction for repetitive code: Bootstrap analysis

Intro
There is a pattern that I use all the time in my Python code which analyzes
numerical data. All implementations seem overly redundant or very cumbersome or
just do not play nicely with NumPy functions. I'd like to find a better way to
abstract this pattern.
The Problem / Current State
A method of statistical error propagation is the bootstrap method. It works by
running the same analysis many times with slightly different inputs and look at
the distribution of final results.
To compute the actual value of ams_phys, I have the following equation:
ams_phys = (amk_phys**2 - 0.5 * ampi_phys**2) / aB - amcr
All the values that go into that equation have a statistical error associated
with it. These values are also computed from other equations. For instance
amk_phys is computed from this equation, where both numbers also have
uncertainties:
amk_phys_dist = mk_phys / a_inv
The value of mk_phys is given as (494.2 ± 0.3) in a paper. What I now do is
parametric bootstrap and generate R samples from a Gaussian distribution
with mean 494.2 and standard deviation 0.3. This is what I store in
mk_phys_dist:
mk_phys_dist = bootstrap.make_dist(494.2, 0.3, R)
The same is done for a_inv which is also quoted with an error in the
literature. Above equation is then converted into a list comprehension to yield
a new distribution:
amk_phys_dist = [mk_phys / a_inv
for a_inv, mk_phys in zip(a_inv_dist, mk_phys_dist)]
The first equation is then also converted into a list comprehension:
ams_phys_dist = [
(amk_phys**2 - 0.5 * ampi_phys**2) / aB - amcr
for ampi_phys, amk_phys, aB, amcr
in zip(ampi_phys_dist, amk_phys_dist, aB_dist, amcr_dist)]
To get the end result in terms of (Value ± Error), I then take the average and
standard deviation of this distribution of numbers:
ams_phys_val, ams_phys_avg, ams_phys_err \
= bootstrap.average_and_std_arrays(ams_phys_dist)
The actual value is supposed to be computed with the actual value coming in,
not the mean of this bootstrap distribution. Before I had the code replicated
for that, now I have the original value at the 0th position in the _dist
arrays. The arrays now contain 1 + R elements and the
bootstrap.average_and_std_arrays function will separate that element.
This kind of line occurs for every number that I might want to quote in my
writing. I got annoyed by the writing and created a snippet for it:
$1_val, $1_avg, $1_err = bootstrap.average_and_std_arrays($1_dist)
The need for the snippet strongly told me that I need to do some refactoring.
Also the list comprehensions are always of the following pattern:
foo_dist = [ ... bar ...
for bar in bar_dist]
It feels bad to write bar three times there.
The Class Approach
I have tried to make those _dist things a Boot class such that I would not
write ampi_dist and ampi_val but could just use ampi.val without having
to explicitly call this average_and_std_arrays functions and type a bunch of
names for it.
class Boot(object):
def __init__(self, dist):
self.dist = dist
def __str__(self):
return str(self.dist)
#property
def cen(self):
return self.dist[0]
#property
def val(self):
x = np.array(self.dist)
return np.mean(x[1:,], axis=0)
#property
def err(self):
x = np.array(self.dist)
return np.std(x[1:,], axis=0)
However, this still does not solve the problem of the list comprehensions. I
fear that I still have to repeat myself there three times. I could make the
Boot object inherit from list, such that I could at least write it like
this (without the _dist):
bar = Boot([... foo ... for foo in foo])
Magic Approach
Ideally all those list comprehensions would be gone such that I could just
write
bar = ... foo ...
where the dots mean some non-trivial operation. Those can be simple arithmetic
as above, but that could also be a function call to something that does not
support being called with multiple values (like NumPy function do support).
For instance the scipy.optimize.curve_fit function needs to be called a bunch of times:
popt_dist = [op.curve_fit(linear, mpi, diff)[0]
for mpi, diff in zip(mpi_dist, diff_dist)]
One would have to write a wrapper for that because it does not automatically loops over list of arrays.
Question
Do you see a way to abstract this process of running every transformation with
1 + R sets of data? I would like to get rid of those patterns and the huge
number of variables in each namespace (_dist, _val, _avg, ...) as this
makes passing it to function rather tedious.
Still I need to have a lot of freedom in the ... foo ... part where I need to
call arbitrary functions.

Exponentially distributed random generator (log function) in python?

I really need help as I am stuck at the begining of the code.
I am asked to create a function to investigate the exponential distribution on histogram. The function is x = −log(1−y)/λ. λ is a constant and I referred to that as lamdr in the code and simply gave it 10. I gave N (the number of random numbers) 10 and ran the code yet the results and the generated random numbers gave me totally different results; below you can find the code, I don't know what went wrong, hope you guys can help me!! (I use python 2)
import random
import math
N = raw_input('How many random numbers you request?: ')
N = int(N)
lamdr = raw_input('Enter a value:')
lamdr = int(lamdr)
def exprand(lamdr):
y = []
for i in range(N):
y.append(random.uniform(0,1))
return y
y = exprand(lamdr)
print 'Randomly generated numbers:', (y)
x = []
for w in y:
x.append((math.log((1 - w) / lamdr)) * -1)
print 'Results:', x

After viewing the code you provided, it looks like you have the pieces you need but you're not putting them together.
You were asked to write function exprand(lambdr) using the specified formula. Python already provides a function called random.expovariate(lambd) for generating exponentials, but what the heck, we can still make our own. Your formula requires a "random" value for y which has a uniform distribution between zero and one. The documentation for the random module tells us that random.random() will give us a uniform(0,1) distribution. So all we have to do is replace y in the formula with that function call, and we're in business:
def exprand(lambdr):
return -math.log(1.0 - random.random()) / lambdr
An historical note: Mathematically, if y has a uniform(0,1) distribution, then so does 1-y. Implementations of the algorithm dating back to the 1950's would often leverage this fact to simplify the calculation to -math.log(random.random()) / lambdr. Mathematically this gives distributionally correct results since P{X = c} = 0 for any continuous random variable X and constant c, but computationally it will blow up in Python for the 1 in 264 occurrence where you get a zero from random.random(). One historical basis for doing this was that when computers were many orders of magnitude slower than now, ditching the one additional arithmetic operation was considered worth the minuscule risk. Another was that Prime Modulus Multiplicative PRNGs, which were popular at the time, never yield a zero. These days it's primarily of historical interest, and an interesting example of where math and computing sometimes diverge.
Back to the problem at hand. Now you just have to call that function N times and store the results somewhere. Likely candidates to do so are loops or list comprehensions. Here's an example of the latter:
abuncha_exponentials = [exprand(0.2) for _ in range(5)]
That will create a list of 5 exponentials with λ=0.2. Replace 0.2 and 5 with suitable values provided by the user, and you're in business. Print the list, make a histogram, use it as input to something else...
Replacing exporand with expovariate in the list comprehension should produce equivalent results using Python's built-in exponential generator. That's the beauty of functions as an abstraction, once somebody writes them you can just use them to your heart's content.
Note that because of the use of randomness, this will give different results every time you run it unless you "seed" the random generator to the same value each time.

WHat #pjs wrote is true to a point. While statement mathematically, if y has a uniform(0,1) distribution, so does 1-y appears to be correct, proposal to replace code with -math.log(random.random()) / lambdr is just wrong. Why? Because Python random module provide U(0,1) in the range [0,1) (as mentioned here), thus making such replacement non-equivalent.
In more layman term, if your U(0,1) is actually generating numbers in the [0,1) range, then code
import random
def exprand(lambda):
return -math.log(1.0 - random.random()) / lambda
is correct, but code
import random
def exprand(lambda):
return -math.log(random.random()) / lambda
is wrong, it will sometimes generate NaN/exception, as log(0) will be called

Variable number of parameters for scipy optimize with L-BFGS-B algorithm

I am looking for the correct approach to use a variable number of parameters as input for the optimizer in scipy.
I have a set of input parameters p1,...,pn and I calculate a quality criteria with a function func(p1,...,pn). I want to minimize this value.
The input parameters are either 0 or 1 indicating they should be used or not. I cannot simply delete all unused ones from the parameter list, since my function for the quality criteria requires them to be "0" to remove unused terms from equations.
def func(parameters):
...calculate one scalar as quality criteria...
solution = optimize.fmin_l_bfgs_b(func,parameters,approx_grad=1,bounds=((0.0, 5.0),...,(0.0,5.0)) # This will vary all parameters
Within my code the optimizer runs without errors, but of course all given parameters are changed to achieve the best solution.
Is there a way to have e.g. 10 input parameters for func, but only 5 of them are used in the optimizer?
So far I can only think of changing my func definition in a way that I will not need the "0" input from unused parameters. I would appreciate any ideas how to avoid that.
Thanks a lot for the help!

If I understand correctly, you are asking for a constrained best fit, such that rather than finding the best [p0,p1,p2...p10] for function func(), you want to find the best best [p0, p1, ...p5] for function func() under a condition that p6=fixed6, p7=fixed7, p8=fixed8... and so on.
Translate it into python code is straight forward if you use args=(somthing) in scipy.optimize.fmin_l_bfgs_b. Firstly, write a partially fixed function func_fixed()
def func_fixed(p_var, p_fixed):
return func(p_var+p_fixed)
# this will only work if both of them are lists. If they are numpy arrays, use hstack, append or similar
solution = optimize.fmin_l_bfgs_b(func_fixed,x0=guess_parameters,\
approx_grad=your_grad,\
bounds=your_bounds,\
args=(your_fixed_parameters), \ #this is the deal
other_things)
It is not necessary to have func_fixed(), you can use lambda. But it reads much easier this way.

I recently solved a similar problem where I want to optimise a different subset of parameters at each run but need all parameters to calculate the objective function. I added two arguments to my objective function:
an index array x_idx which indicates which parameters to optimise, i.e. 0 don't optimise and 1 optimise
an array x0 with the initial values of all parameters
In the objective function I set the list of the parameters according to the index array either to the parameters which are to be optimised or the initial values.
import numpy
import scipy.optimize
def objective_function(x_optimised, x_idx, x0):
x = []
j = 0
for i, idx in enumerate(x_idx):
if idx is 1:
x.append(x_optimised[j])
j = j + 1
else:
x.append(x0[i])
x = numpy.array(x)
return sum(x**2)
if __name__ == '__main__':
x_idx = [1, 1, 0]
x0 = [1.1, 1.3, 1.5]
x_initial = [x for i, x in enumerate(x0) if x_idx[i] is 1]
xopt, fopt, iter, funcalls, warnflag = scipy.optimize.fmin(objective_function, \
x_initial, args=(x_idx, x0,), \
maxfun = 200, full_output=True)
print xopt

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

scipy.optimize.minimize chi squared python - python

Related

Python optimizing a calculation with scipy.integrate.quad (takes very long)

Getting coefficient of term in sympy

Finding a abstraction for repetitive code: Bootstrap analysis

Exponentially distributed random generator (log function) in python?

Variable number of parameters for scipy optimize with L-BFGS-B algorithm

Categories

Resources