Relatively new to python, mainly using it for plotting things. I am currently attempting to determine a best fit line using the 4 parameter logistic (4PL) equation and curve fit from scipy. There are one or two sites showing how 4PL works, but could not get them to work for my data. Example, but similar 4PL data below:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import scipy.optimize as optimization
xdata = [2.3, 2.3, 2, 2, 1.7, 1.7, 1, 1, 0.000001, 0.000001, -1, -1]
ydata = [0.32, 0.3, 0.55, 0.60, 0.88, 0.92, 1.27, 1.21, 1.15, 1.12, 1.1, 1.1]
def fourPL(x, A, B, C, D):
return ((A-D)/(1.0+((x/C)**(B))) + D)
guess = [0, -0.5, 0.5, 1]
params, params_covariance = optimization.curve_fit(fourPL, xdata, ydata,
guess)
params
Gives warning (also an exponent warning in test data, but not real):
OptimizeWarning: Covariance of the parameters could not be estimated
category=OptimizeWarning)
And the params returns my initial guess. I have tried various initial guesses.
The best fit line is drawn when plotting, but is not a curve and does not go below x = 0 (I cannot find a reason negatives would mess with the 4PL model).
4PL fit plotted
I'm not sure if I am doing something incorrect with the equation, or how the curve fit function works, or both. I have a similar issue using least squares instead of curve fit. I've tried a bunch of variations based off similar equations for fit etc. but have been stuck for awhile, any help in pointing me in the right direction would be much appreciated.
I'm surprised you did not get any warnings or did not share them with us. I can't analyze this task for you by scientific means, just some remarks about technical stuff:
Observation
When running your code, you should some warnings like:
RuntimeWarning: invalid value encountered in power
return ((A-D)/(1.0+((x/C)**(B))) + D)
Don't ignore this!
Debugging
Add some prints to your function fourPL, probably all the different components of your function and look what's happening.
Example:
def fourPL(x, A, B, C, D):
print('1: ', (A-D))
print('2: ', (x/C))
print('3: ', (1.0+((x/C)**(B))))
return ((A-D)/(1.0+((x/C)**(B))) + D)
...
params, params_covariance = optimization.curve_fit(fourPL, xdata, ydata, guess, maxfev=1)
# maxfev=1 -> let's just check 1 or few it's
Output:
1: -1.0
2: [ 4.60000000e+00 4.60000000e+00 4.00000000e+00 4.00000000e+00
3.40000000e+00 3.40000000e+00 2.00000000e+00 2.00000000e+00
2.00000000e-06 2.00000000e-06 -2.00000000e+00 -2.00000000e+00]
RuntimeWarning: invalid value encountered in power
print('3: ', (1.0+((x/C)**(B))))
3: [ 1.4662524 1.4662524 1.5 1.5 1.54232614
1.54232614 1.70710678 1.70710678 708.10678119 708.10678119
nan nan]
That's enough to stop. nans and infs are bad!
Theory
Now it's time for theory and i won't do that. But usually you now should think about the underlying theory and why these problems occur.
Is there something you missed in regards to the assumptions?
Repair (without checking theory)
Without checking out the theory and just looking over some example found within 30 secs: hmm are negative x-values a problem?
Let's shift x (by the minimum; hardcoded 1 here):
xdata = np.array([2.3, 2.3, 2, 2, 1.7, 1.7, 1, 1, 0.000001, 0.000001, -1, -1]) + 1
Complete code:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import scipy.optimize as optimization
xdata = np.array([2.3, 2.3, 2, 2, 1.7, 1.7, 1, 1, 0.000001, 0.000001, -1, -1]) + 1
ydata = np.array([0.32, 0.3, 0.55, 0.60, 0.88, 0.92, 1.27, 1.21, 1.15, 1.12, 1.1, 1.1])
def fourPL(x, A, B, C, D):
return ((A-D)/(1.0+((x/C)**(B))) + D)
guess = [0, -0.5, 0.5, 1]
params, params_covariance = optimization.curve_fit(fourPL, xdata, ydata, guess)#, maxfev=1)
x_min, x_max = np.amin(xdata), np.amax(xdata)
xs = np.linspace(x_min, x_max, 1000)
plt.scatter(xdata, ydata)
plt.plot(xs, fourPL(xs, *params))
plt.show()
Output:
RuntimeWarning: divide by zero encountered in power
return ((A-D)/(1.0+((x/C)**(B))) + D)
Looks good, but it's time for another theory session: what did our linear-shift do to our results? I'm ignoring this again.
So just one warning and a nice-looking output.
If you want to remove that last warning, add some small epsilon to not have 0's in xdata:
xdata = np.array([2.3, 2.3, 2, 2, 1.7, 1.7, 1, 1, 0.000001, 0.000001, -1, -1]) + 1 + 1e-10
which will achieve the same, without any warning.
Related
Could someone help me in fitting the data collapse_fractions with a lognormal function, which has median and standard deviation derived via the maximum likelihood method?
I tried scipy.stats.lognormal.fit(data), but I did not obtain the data I retrieved with Excel. The excel file can be downloaded: https://stacks.stanford.edu/file/druid:sw589ts9300/p_collapse_from_msa.xlsx
Also, any reference is really welcomed.
import numpy as np
intensity_measure_vector = np.array([[0.2, 0.3, 0.4, 0.6, 0.7, 0.8, 0.9, 1]])
no_analyses = 40
no_collapses = np.array([[0, 0, 0, 4, 6, 13, 12, 16]])
collapse_fractions = np.array(no_collapses/no_analyses)
print(collapse_fractions)
# array([[0. , 0. , 0. , 0.1 , 0.15 , 0.325, 0.3 , 0.4 ]])
collapse_fractions.shape
# (1, 8)
import matplotlib.pyplot as plt
plt.scatter(intensity_measure_vector, collapse_fractions)
I couldn't figure out how to get the lognorm.fit to work. So I just implemented the functions from your excel-file and used scipy.optimize as the optimizer. The added benefit is, that it is easier to understand what is actually going on compared to lognorm.fit especially with the excel on the side.
Here is my implementation:
from functools import partial
import numpy as np
from scipy import optimize, stats
im = np.array([0.2, 0.3, 0.4, 0.6, 0.7, 0.8, 0.9, 1])
im_log = np.log(im)
number_of_analyses = np.array([40, 40, 40, 40, 40, 40, 40, 40])
number_of_collapses = np.array([0, 0, 0, 4, 6, 13, 12, 16])
FORMAT_STRING = "{:<20}{:<20}{:<20}"
print(FORMAT_STRING.format("sigma", "beta", "log_likelihood_sum"))
def neg_log_likelihood_sum(params, im_l, no_a, no_c):
sigma, beta = params
theoretical_fragility_function = stats.norm(np.log(sigma), beta).cdf(im_l)
likelihood = stats.binom.pmf(no_c, no_a, theoretical_fragility_function)
log_likelihood = np.log(likelihood)
log_likelihood_sum = np.sum(log_likelihood)
print(FORMAT_STRING.format(sigma, beta, log_likelihood_sum))
return -log_likelihood_sum
neg_log_likelihood_sum_partial = partial(neg_log_likelihood_sum, im_l=im_log, no_a=number_of_analyses, no_c=number_of_collapses)
res = optimize.minimize(neg_log_likelihood_sum_partial, (1, 1), method="Nelder-Mead")
print(res)
And the final result is:
final_simplex: (array([[1.07613697, 0.42927824],
[1.07621925, 0.42935678],
[1.07622438, 0.42924577]]), array([10.7977048 , 10.79770573, 10.79770723]))
fun: 10.797704803509903
message: 'Optimization terminated successfully.'
nfev: 68
nit: 36
status: 0
success: True
x: array([1.07613697, 0.42927824])
The interesting part for you is on line one, the same final result as in the excel-calculation (sigma=1.07613697 and beta=0.42927824).
If you have any questions about what I did here, don't hesitate to ask as you said you are new to python. A few things in advance:
I did minimize the negative likelihood-sum as there is no maximizer in scipy.
partial from functools returns a function that has the specified arguments already defined (in this case im_l, no_a and no_c as they don't change) the partial function can then be called with just the missing argument.
The neg_log_likelihood_sum-function is basically whats happening in the excel-file, so it should be easy to understand when viewing it side-by-side.
scipy.optimize.minimize minimizes the function given as the first argument by changing the parameters (start-value as second argument). The method is chosen because it gave good results, I didn't dive deep into the abyss of different optimization-methods, gradients etc. So it may well be, that there is a better setup, but this one works fine and seems faster than the optimization with lognorm.fit.
The plot like in the excel-file is done like this with the results res from the optimization:
import matplotlib.pyplot as plt
x = np.linspace(0, 2.5, 100)
y = stats.norm(np.log(res["x"][0]), res["x"][1]).cdf(np.log(x))
plt.plot(x, y)
plt.scatter(im, number_of_collapses/number_of_analyses)
plt.show()
Good morning, everyone. I have a set of values.
Arr = np.array([0.11, 0.14, 0.22, 0.26, 0.31, 0.36, 0.44, 0.69, 0.70, 0.70, 0.70, 0.75, 0.98, 1.40])
I have constructed the CDF function in this way:
def ecdf(a):
x, counts = np.unique(a, return_counts=True)
cusum = np.cumsum(counts)
return x, cusum / cusum[-1]
def plot_ecdf(a):
x, y = ecdf(a)
x = np.insert(x, 0, x[0])
y = np.insert(y, 0, 0.)
plt.plot(x, y, drawstyle='steps-post')
plt.grid(True)
ecdf_ = ecdf(Arr)
plot_ecdf(ecdf_)
Obtaining this figure:
Now I want to divide the space (y-axis) into 5 parts. To do this I am using the following function:
from scipy.stats.qmc import LatinHypercube
engine = LatinHypercube(d=1)
sample = engine.random(n=5) #Array of float64
For example, obtaining 5 values randomly generated:
0.0886183
0.450613
0.808077
0.753524
0.343108
At this point I would like to keep the corresponding values in the CDF as in the picture.
I also observed that in this way the constructed CDF has a discrete set of values. Which may not be optimal for my purpose.
I try to solve an overdetermined linear equation system with boundary conditions. To describe my problem, I try to give an example:
### Input values
LED1_10 = np.array([1.5, 1, 0.5, 0.5])
LED1_20 = np.array([2.5, 1.75, 1.2, 1.2])
LED1_30 = np.array([3, 2.3, 1.7, 1.7])
LED2_10 = np.array([0.2, 0.8, 0.4, 0.4])
LED2_20 = np.array([0.6, 1.6, 0.5, 0.5])
LED2_30 = np.array([1.0, 2.0, 0.55, 0.55])
LED3_10 = np.array([1, 0.1, 0.4, 0.4])
LED3_20 = np.array([2.5, 0.8, 0.9, 0.9])
LED3_30 = np.array([3.25, 1, 1.3, 1.3])
### Rearrange the values
LED1 = np.stack((LED1_10, LED1_20, LED1_30)).T
LED2 = np.stack((LED2_10, LED2_20, LED2_30)).T
LED3 = np.stack((LED3_10, LED3_20, LED3_30)).T
### Fit polynomals
LEDs = np.array([LED1, LED2, LED3])
fits = [
[np.polyfit(np.array([10, 20, 30]), LEDs[i,j], 2) for j in range(LEDs.shape[1])]
for i in range(LEDs.shape[0])
]
fits = np.array(fits)
def g(x):
X = np.array([x**2, x, np.ones_like(x)]).T
return np.sum(fits * X[:,None], axis=(0, 2))
### Solve
def system(x,b):
return (g(x)-b)
b = [5, 8, 4, 12]
x = least_squares(system, np.asarray((1,1,1)), bounds=(0, 20), args = b).x
In my first approach I solved the system without boundaries using the solver leastsq like this x = scipy.optimize.leastsq(system, np.asarray((1,1,1)), args=b)[0] This worked out fine and brought me a solution for x1, x2 and x3. But now I've realized that my real-world application requires limits.
If i run my code as presented above i get the error: "system() takes 2 positional arguments but 5 were given"
Can anyone help me solving this problem? Or maybe suggest another solver for this task if least_squares is not the right choice.
Thank you for all of your help.
You are passing a list of 4 elements as args, so least_squares thinks your function system takes 5 arguments. Instead, either pass a tuple of your optional arguments, i.e.
x = least_squares(system, np.asarray((1,1,1)), bounds=(0, 20), args = (b,)).x
or use a lambda:
x = least_squares(lambda x: g(x) - b, np.asarray((1,1,1)), bounds=(0, 20)).x
I am trying to fit a trapezoid to a set of time series using the curve_fit library from scipy.optimize. The function that I'm using to generate a trapezoid is the following:
def trapezoid(x, a, b, c, tau1, tau2):
y = np.zeros(len(x))
c = -np.abs(c)
a = np.abs(a)
y[:int(tau1)] = a*x[:int(tau1)] + b
y[int(tau1):int(tau2)] = a*tau1 + b
y[int(tau2):] = c*(x[int(tau2):]-tau2) + (a*tau1 + b)
return y
Where a and c are the slopes, and tau1 and tau2 mark the beginning and the end of the flat phase.
And in order to fit I just use:
popt, pcov = curve_fit(trapezoid, xdata, ydata, method = 'lm')
For most of the cases it works just fine, such as in the following:
However, I'm also getting some cases on which it just fails to fit the data, where it looks like it should be doing ok:
The problem with these cases is that it sets a tau2 (end of the flat phase) smaller than tau1 (beginning of it).
Could anyone suggest a way to solve this issue? Whether by imposing a constraint or in some other way?
Example array for which the fit does not work:
array([1.2 , 1.21, 1.2 , 1.19, 1.21, 1.22, 2.47, 2.53, 2.49, 2.39, 2.28,
2.16, 2.07, 1.99, 1.91, 1.83, 1.74, 1.65, 1.57, 1.5 , 1.45, 1.41,
1.38, 1.35, 1.33, 1.29, 1.24, 1.19, 1.14, 1.11, 1.07, 1.04, 1. ,
0.95, 0.91, 0.87, 0.84, 0.8 , 0.77, 0.74, 0.72, 0.7 , 0.68, 0.66,
0.63, 0.61, 0.59, 0.57, 0.55, 0.52, 0.5 , 0.48, 0.45, 0.43, 0.41,
0.39, 0.38, 0.37, 0.37, 0.36, 0.35, 0.34, 0.34, 0.33])
Which yields: tau1: 8.45, tau2:5.99
You might find lmfit (http://lmfit.github.io/lmfit-py/) useful for this problem. Lmfit provides a slightly higher level interface to curve fitting, still based on the scipyoptimizers, but with some better abstractions and features.
In particular for your question, lmfit parameters are Python objects that can have bounds, be fixed, or be written as simple algebraic constraints in terms of other variables. This can support imposing tau2 > tau1.
The idea is essentially to set tau2=tau1+taudiff and place a lower bound of 0 on taudiff. While you could rewrite your function to do that in the code, with lmfit you don't have to do that and can put that logic in the Parameters instead.
Converting your script to use lmfit would give something like this:
from lmfit import Model
# use your same model function
def trapezoid(x, a, b, c, tau1, tau2):
y = np.zeros(len(x))
c = -np.abs(c)
a = np.abs(a)
y[:int(tau1)] = a*x[:int(tau1)] + b
y[int(tau1):int(tau2)] = a*tau1 + b
y[int(tau2):] = c*(x[int(tau2):]-tau2) + (a*tau1 + b)
return y
# turn model function into lmfit Model
tmod = Model(trapezoid)
# create Parameters for this model: they will be *named* according
# to the signature of the model function, and be used as keys in
# an ordered-directory-derived object. Here you can also give
# initial values
params = tmod.make_params(a=1, b=2, c=0.5, tau1=5, tau2=-1)
# now you can set bounds or constraints.
# 1st, add a new variable "taudiff"
params.add('taudiff', value=0.1, min=0, vary=True)
# constraint tau2 to be taudiff+tau1 -- this is no longer a "free variable:
params['tau2'].expr = "taudiff + tau1"
# now do fit to data:
result = tmod.fit(ydata, params, x=xdata)
# print report of fit
print(result.fit_report())
# get best fit params:
for parname, param in result.params:
print(parname, param.value, param.stderr, param.expr)
# get best fit array for plotting
pylab.plot(xdata, ydata)
pylab.plot(xdata, result.best_fit)
Hope that helps.
Just setting t1,t2 to the minimum and maximum value does work
def trapezoid(x, a, b, c, tau1, tau2):
y = np.zeros(len(x))
c = -np.abs(c)
a = np.abs(a)
(tau1,tau2) = (min(tau1,tau2),max(tau1,tau2))
y[:int(tau1)] = a*x[:int(tau1)] + b
y[int(tau1):int(tau2)] = a*tau1 + b
y[int(tau2):] = c*(x[int(tau2):]-tau2) + (a*tau1 + b)
x_data = np.arange(len(A))
popt, pcov = curve_fit(trapezoid, x_data, A, method = 'lm')
print popt
fit = trapezoid(x_data,*popt)
leads to:
I'm trying to use SciPy's UnivariateSpline to locate a point on a curve. Unfortunately, my result is nan.
Here's a minimal example:
from scipy.interpolate import UnivariateSpline
spline = UnivariateSpline([0.6, 0.4, 0.2, 0.0], [-0.3, -0.1, 0.1, 0.3], w=None, bbox=[None, None], k=1, s=0)
POINT = spline([0.15])
print POINT
The result is [ NaN].
Which feature of UnivariateSpline did I miss?
I'm using Python 2.6.6 and scipy version 0.7.2
I cannot guarantee that I have always increasing datapoints so interp might not be an alternative.
As the docstring for UnivariateSpline states, the values in x must be increasing. You'll have to sort your data if you want to use UnivariateSpline. E.g. something like this:
In [71]: x = np.array([0.6, 0.4, 0.2, 0.0])
In [72]: y = np.array([-0.3, -0.1, 0.1, 0.3])
In [73]: order = np.argsort(x)
In [74]: spline = UnivariateSpline(x[order], y[order], w=None, bbox=[None, None], k=1, s=0)
In [75]: spline([0.15])
Out[75]: array([ 0.15])