Fit data with a lognormal function via Maximum Likelihood estimators

Fit data with a lognormal function via Maximum Likelihood estimators - python

Could someone help me in fitting the data collapse_fractions with a lognormal function, which has median and standard deviation derived via the maximum likelihood method?
I tried scipy.stats.lognormal.fit(data), but I did not obtain the data I retrieved with Excel. The excel file can be downloaded: https://stacks.stanford.edu/file/druid:sw589ts9300/p_collapse_from_msa.xlsx
Also, any reference is really welcomed.
import numpy as np
intensity_measure_vector = np.array([[0.2, 0.3, 0.4, 0.6, 0.7, 0.8, 0.9, 1]])
no_analyses = 40
no_collapses = np.array([[0, 0, 0, 4, 6, 13, 12, 16]])
collapse_fractions = np.array(no_collapses/no_analyses)
print(collapse_fractions)
# array([[0. , 0. , 0. , 0.1 , 0.15 , 0.325, 0.3 , 0.4 ]])
collapse_fractions.shape
# (1, 8)
import matplotlib.pyplot as plt
plt.scatter(intensity_measure_vector, collapse_fractions)

I couldn't figure out how to get the lognorm.fit to work. So I just implemented the functions from your excel-file and used scipy.optimize as the optimizer. The added benefit is, that it is easier to understand what is actually going on compared to lognorm.fit especially with the excel on the side.
Here is my implementation:
from functools import partial
import numpy as np
from scipy import optimize, stats
im = np.array([0.2, 0.3, 0.4, 0.6, 0.7, 0.8, 0.9, 1])
im_log = np.log(im)
number_of_analyses = np.array([40, 40, 40, 40, 40, 40, 40, 40])
number_of_collapses = np.array([0, 0, 0, 4, 6, 13, 12, 16])
FORMAT_STRING = "{:<20}{:<20}{:<20}"
print(FORMAT_STRING.format("sigma", "beta", "log_likelihood_sum"))
def neg_log_likelihood_sum(params, im_l, no_a, no_c):
sigma, beta = params
theoretical_fragility_function = stats.norm(np.log(sigma), beta).cdf(im_l)
likelihood = stats.binom.pmf(no_c, no_a, theoretical_fragility_function)
log_likelihood = np.log(likelihood)
log_likelihood_sum = np.sum(log_likelihood)
print(FORMAT_STRING.format(sigma, beta, log_likelihood_sum))
return -log_likelihood_sum
neg_log_likelihood_sum_partial = partial(neg_log_likelihood_sum, im_l=im_log, no_a=number_of_analyses, no_c=number_of_collapses)
res = optimize.minimize(neg_log_likelihood_sum_partial, (1, 1), method="Nelder-Mead")
print(res)
And the final result is:
final_simplex: (array([[1.07613697, 0.42927824],
[1.07621925, 0.42935678],
[1.07622438, 0.42924577]]), array([10.7977048 , 10.79770573, 10.79770723]))
fun: 10.797704803509903
message: 'Optimization terminated successfully.'
nfev: 68
nit: 36
status: 0
success: True
x: array([1.07613697, 0.42927824])
The interesting part for you is on line one, the same final result as in the excel-calculation (sigma=1.07613697 and beta=0.42927824).
If you have any questions about what I did here, don't hesitate to ask as you said you are new to python. A few things in advance:
I did minimize the negative likelihood-sum as there is no maximizer in scipy.
partial from functools returns a function that has the specified arguments already defined (in this case im_l, no_a and no_c as they don't change) the partial function can then be called with just the missing argument.
The neg_log_likelihood_sum-function is basically whats happening in the excel-file, so it should be easy to understand when viewing it side-by-side.
scipy.optimize.minimize minimizes the function given as the first argument by changing the parameters (start-value as second argument). The method is chosen because it gave good results, I didn't dive deep into the abyss of different optimization-methods, gradients etc. So it may well be, that there is a better setup, but this one works fine and seems faster than the optimization with lognorm.fit.
The plot like in the excel-file is done like this with the results res from the optimization:
import matplotlib.pyplot as plt
x = np.linspace(0, 2.5, 100)
y = stats.norm(np.log(res["x"][0]), res["x"][1]).cdf(np.log(x))
plt.plot(x, y)
plt.scatter(im, number_of_collapses/number_of_analyses)
plt.show()

Related

Why can't RegularGridInterpolator not return several values (for a function that outputs in $R^d$)

MRE (with working output and output that doesn't work although I would like it to work as it would be the intuitional thing to do):
import numpy as np
from scipy.interpolate import RegularGridInterpolator, griddata
def f(x1, x2, x3):
return x1 + 2*x2 + 3*x3, x1**2, x2
# Define the input points
xi = [np.linspace(0, 1, 5), np.linspace(0, 1, 5), np.linspace(0, 1, 5)]
# Mesh grid
x1, x2, x3 = np.meshgrid(*xi, indexing='ij')
# Outputs
y = f(x1, x2, x3)
assert (y[0][1][1][3] == (0.25 + 2*0.25 + 3*0.75))
assert (y[1][1][1][3] == (0.25**2))
assert (y[2][1][1][3] == 0.25)
#### THIS WORKS BUT I CAN ONLY GET THE nth (with n integer in [1, d]) VALUE RETURNED BY f
# Interpolate at point 0.3, 0.3, 0.4
interp = RegularGridInterpolator(xi, y[0])
print(interp([0.3, 0.3, 0.4])) # outputs 2.1 as expected
#### THIS DOESN'T WORK (I WOULD'VE EXPECTED A LIST OF TUPLES FOR EXAMPLE)
# Interpolate at point 0.3, 0.3, 0.4
interp = RegularGridInterpolator(xi, y)
print(interp([0.3, 0.3, 0.4])) # doesn't output array([2.1, 0.1, 0.3])
What is intriguing is that griddata does support functions that output values in R^d
# Same with griddata
grid_for_griddata = np.array([x1.flatten(), x2.flatten(), x3.flatten()]).T
assert (grid_for_griddata.shape == (125, 3))
y_for_griddata = np.array([y[0].flatten(), y[1].flatten(), y[2].flatten()]).T
assert (y_for_griddata.shape == (125, 3))
griddata(grid_for_griddata, y_for_griddata, [0.3, 0.3, 0.4], method='linear')[0] # outputs array([2.1, 0.1, 0.3]) as expected
Am I using RegularGridInterpolator the wrong way?
I know someone might say "just use griddata", but because my data is in a rectilinear grid, I should use RegularGridInterpolator so that it's faster, right?
Proof that it's faster:

If I define a y with the 3 as last dimension:
In [196]: yarr = np.stack(y,axis=3); yarr.shape
Out[196]: (5, 5, 5, 3)
Setup works (no complaints about 3 not matching 5):
In [197]: interp = RegularGridInterpolator(xi, yarr)
And the interpolation:
In [198]: interp([.3,.3,.4])
Out[198]: array([[2.1, 0.1, 0.3]])
and for multiple points:
In [202]: interp([[.3,.3,.4],[.31,.31,.41],[.5,.4,.4]])
Out[202]:
array([[2.1 , 0.1 , 0.3 ],
[2.16 , 0.1075, 0.31 ],
[2.5 , 0.25 , 0.4 ]])
While the above was just a guess that works, I see that the docs can be interpreted this way:
values: array_like, shape (m1, …, mn, …)
The ... at the end suggest that the array may have 0 or more trailing dimensions (beyond the n that match the points dimensions). But this flexibility may apply more to the linear and nearest methods. Others seem to have problems.
This is clearer on the doc page for its __call__:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.interpolate.RegularGridInterpolator.__call__.html#scipy.interpolate.RegularGridInterpolator.__call__
Returns
values_x : ndarray, shape xi.shape[:-1] + values.shape[ndim:]
interpn also documents this.

How to find Optimum value from the plot values python?

I have a python dict
{'kValues': [2, 3, 4, 5, 6, 7, 8, 9, 10],
'WSS': [21455, 5432, 4897, 4675, 4257, 3954, 3852, 3756, 3487],
'SS': [0.75, 0.85, 0.7, 0.52, 0.33, 0.38, 0.42, 0.46, 0.47]}
When I plot kValues against WSS and SS, I get following line
The optimum value of 1st plot is at k value = 3 and on 2nd plot is at k value = 3
How to extract that k value from the dictionary without visualizing the plots
Criteria - First plot always have a elbow, elbow point to be extracted, second plot always have a raise followed by a dip, that raise value to be extracted

You can use the angle between 3 points p1,p2, and p3 which will be only helpful, and a point that is near 90 degrees for the elbow and 0 for deep.
I am sharing my code the normalization is the tricky.
import math
kValues= [1 , 2 , 3, 4, 5, 6, 7, 8, 9, 10]
WSS= [81000,21455, 5432, 4897, 4675, 4257, 3954, 3852, 3756, 3487]
SS=[0.75, 0.85, 0.7, 0.52, 0.33, 0.38, 0.42, 0.46, 0.47]
#get all the angles between k values as x and WSS as y
angles=[]
#each WSS slab values
normalize_wss=2000
for i in range(1,len(WSS)-1,1):
p1=(kValues[i-1]*normalize_wss,WSS[i-1])
p2=(kValues[i]*normalize_wss,WSS[i])
p3=(kValues[i+1]*normalize_wss,WSS[i+1])
#find angle between 3 points p1,p2,p3
angle1=math.degrees(math.atan2(p3[1]-p2[1],p3[0]-p2[0]))
angle2=math.degrees(math.atan2(p1[1]-p2[1],p1[0]-p2[0]))
angles.append([angle1-angle2])
print(angles)

Found this wonderful python library for finding the optimum value
https://pypi.org/project/kneed/
from kneed import KneeLocator
kneedle = KneeLocator(kValues, WSS, S=1.0, curve="convex", direction="decreasing")
print(kneedle.knee) # 3
print(kneedle.elbow) # 3
curve and direction can be configured based on pattern

You can use the derivative to find the peak in the SS:
import numpy as np
k_ss = kValues[np.where(np.sign(np.diff(SS, append=SS[-1])) == -1)[0][0]]
diff calculates the difference between the elements (the derivative) and then we find the first place where the derivative changes sign (goes over the peak).
For WSS it's a bit more complicated because you have to define a threshold, that would define an elbow, you could take a few examples from your data and use that. Here is a code where the threshold is set to 1/10 of the max derivative:
d = np.diff(WSS, append=WSS[-1])
th = max(abs(d)) / 10
k_wss = kValues[np.where(abs(d) < th)[0][0]]
Other than that you can try to fit the data to an asymptotic curve and extract the constants

Solution of an overdetermined nonlinear system of equations with boundary conditions in python

I try to solve an overdetermined linear equation system with boundary conditions. To describe my problem, I try to give an example:
### Input values
LED1_10 = np.array([1.5, 1, 0.5, 0.5])
LED1_20 = np.array([2.5, 1.75, 1.2, 1.2])
LED1_30 = np.array([3, 2.3, 1.7, 1.7])
LED2_10 = np.array([0.2, 0.8, 0.4, 0.4])
LED2_20 = np.array([0.6, 1.6, 0.5, 0.5])
LED2_30 = np.array([1.0, 2.0, 0.55, 0.55])
LED3_10 = np.array([1, 0.1, 0.4, 0.4])
LED3_20 = np.array([2.5, 0.8, 0.9, 0.9])
LED3_30 = np.array([3.25, 1, 1.3, 1.3])
### Rearrange the values
LED1 = np.stack((LED1_10, LED1_20, LED1_30)).T
LED2 = np.stack((LED2_10, LED2_20, LED2_30)).T
LED3 = np.stack((LED3_10, LED3_20, LED3_30)).T
### Fit polynomals
LEDs = np.array([LED1, LED2, LED3])
fits = [
[np.polyfit(np.array([10, 20, 30]), LEDs[i,j], 2) for j in range(LEDs.shape[1])]
for i in range(LEDs.shape[0])
]
fits = np.array(fits)
def g(x):
X = np.array([x**2, x, np.ones_like(x)]).T
return np.sum(fits * X[:,None], axis=(0, 2))
### Solve
def system(x,b):
return (g(x)-b)
b = [5, 8, 4, 12]
x = least_squares(system, np.asarray((1,1,1)), bounds=(0, 20), args = b).x
In my first approach I solved the system without boundaries using the solver leastsq like this x = scipy.optimize.leastsq(system, np.asarray((1,1,1)), args=b)[0] This worked out fine and brought me a solution for x1, x2 and x3. But now I've realized that my real-world application requires limits.
If i run my code as presented above i get the error: "system() takes 2 positional arguments but 5 were given"
Can anyone help me solving this problem? Or maybe suggest another solver for this task if least_squares is not the right choice.
Thank you for all of your help.

You are passing a list of 4 elements as args, so least_squares thinks your function system takes 5 arguments. Instead, either pass a tuple of your optional arguments, i.e.
x = least_squares(system, np.asarray((1,1,1)), bounds=(0, 20), args = (b,)).x
or use a lambda:
x = least_squares(lambda x: g(x) - b, np.asarray((1,1,1)), bounds=(0, 20)).x

Least squares function and 4 parameter logistics function not working

Relatively new to python, mainly using it for plotting things. I am currently attempting to determine a best fit line using the 4 parameter logistic (4PL) equation and curve fit from scipy. There are one or two sites showing how 4PL works, but could not get them to work for my data. Example, but similar 4PL data below:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import scipy.optimize as optimization
xdata = [2.3, 2.3, 2, 2, 1.7, 1.7, 1, 1, 0.000001, 0.000001, -1, -1]
ydata = [0.32, 0.3, 0.55, 0.60, 0.88, 0.92, 1.27, 1.21, 1.15, 1.12, 1.1, 1.1]
def fourPL(x, A, B, C, D):
return ((A-D)/(1.0+((x/C)**(B))) + D)
guess = [0, -0.5, 0.5, 1]
params, params_covariance = optimization.curve_fit(fourPL, xdata, ydata,
guess)
params
Gives warning (also an exponent warning in test data, but not real):
OptimizeWarning: Covariance of the parameters could not be estimated
category=OptimizeWarning)
And the params returns my initial guess. I have tried various initial guesses.
The best fit line is drawn when plotting, but is not a curve and does not go below x = 0 (I cannot find a reason negatives would mess with the 4PL model).
4PL fit plotted
I'm not sure if I am doing something incorrect with the equation, or how the curve fit function works, or both. I have a similar issue using least squares instead of curve fit. I've tried a bunch of variations based off similar equations for fit etc. but have been stuck for awhile, any help in pointing me in the right direction would be much appreciated.

I'm surprised you did not get any warnings or did not share them with us. I can't analyze this task for you by scientific means, just some remarks about technical stuff:
Observation
When running your code, you should some warnings like:
RuntimeWarning: invalid value encountered in power
return ((A-D)/(1.0+((x/C)**(B))) + D)
Don't ignore this!
Debugging
Add some prints to your function fourPL, probably all the different components of your function and look what's happening.
Example:
def fourPL(x, A, B, C, D):
print('1: ', (A-D))
print('2: ', (x/C))
print('3: ', (1.0+((x/C)**(B))))
return ((A-D)/(1.0+((x/C)**(B))) + D)
...
params, params_covariance = optimization.curve_fit(fourPL, xdata, ydata, guess, maxfev=1)
# maxfev=1 -> let's just check 1 or few it's
Output:
1: -1.0
2: [ 4.60000000e+00 4.60000000e+00 4.00000000e+00 4.00000000e+00
3.40000000e+00 3.40000000e+00 2.00000000e+00 2.00000000e+00
2.00000000e-06 2.00000000e-06 -2.00000000e+00 -2.00000000e+00]
RuntimeWarning: invalid value encountered in power
print('3: ', (1.0+((x/C)**(B))))
3: [ 1.4662524 1.4662524 1.5 1.5 1.54232614
1.54232614 1.70710678 1.70710678 708.10678119 708.10678119
nan nan]
That's enough to stop. nans and infs are bad!
Theory
Now it's time for theory and i won't do that. But usually you now should think about the underlying theory and why these problems occur.
Is there something you missed in regards to the assumptions?
Repair (without checking theory)
Without checking out the theory and just looking over some example found within 30 secs: hmm are negative x-values a problem?
Let's shift x (by the minimum; hardcoded 1 here):
xdata = np.array([2.3, 2.3, 2, 2, 1.7, 1.7, 1, 1, 0.000001, 0.000001, -1, -1]) + 1
Complete code:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import scipy.optimize as optimization
xdata = np.array([2.3, 2.3, 2, 2, 1.7, 1.7, 1, 1, 0.000001, 0.000001, -1, -1]) + 1
ydata = np.array([0.32, 0.3, 0.55, 0.60, 0.88, 0.92, 1.27, 1.21, 1.15, 1.12, 1.1, 1.1])
def fourPL(x, A, B, C, D):
return ((A-D)/(1.0+((x/C)**(B))) + D)
guess = [0, -0.5, 0.5, 1]
params, params_covariance = optimization.curve_fit(fourPL, xdata, ydata, guess)#, maxfev=1)
x_min, x_max = np.amin(xdata), np.amax(xdata)
xs = np.linspace(x_min, x_max, 1000)
plt.scatter(xdata, ydata)
plt.plot(xs, fourPL(xs, *params))
plt.show()
Output:
RuntimeWarning: divide by zero encountered in power
return ((A-D)/(1.0+((x/C)**(B))) + D)
Looks good, but it's time for another theory session: what did our linear-shift do to our results? I'm ignoring this again.
So just one warning and a nice-looking output.
If you want to remove that last warning, add some small epsilon to not have 0's in xdata:
xdata = np.array([2.3, 2.3, 2, 2, 1.7, 1.7, 1, 1, 0.000001, 0.000001, -1, -1]) + 1 + 1e-10
which will achieve the same, without any warning.

Generate image data from three numpy arrays

I have three numpy arrays, X, Y, and Z.
X and Y are coordinates of a spatial grid and each grid point (X, Y) has an intensity Z. I would like to save a PNG image using this data. Interpolation is not needed, as X and Y are guaranteed to cover each grid point between min(X) and max(Y).
I'm guessing the solution lies within numpy's meshgrid() function, but I can't figure out how to reshape the Z array to NxM intensity data.
How can I do that?
To clarify the input data structure, this is what it looks like:
X | Y | Z
-----------------------------
0.1 | 0.1 | something..
0.1 | 0.2 | something..
0.1 | 0.3 | something..
...
0.2 | 0.1 | something..
0.2 | 0.2 | something..
0.2 | 0.3 | something..
...
0.2 | 0.1 | something..
0.1 | 0.2 | something..
0.3 | 0.3 | something..
...

To begin with, you should run this piece of code:
import numpy as np
X = np.asarray(<X data>)
Y = np.asarray(<Y data>)
Z = np.asarray(<Z data>)
Xu = np.unique(X)
Yu = np.unique(Y)
Then you could apply any of the following approaches. It is worth noting that all of them would work fine even if the data are NOT sorted (in contrast to the currently accepted answer):
1) A for loop and numpy.where() function
This is perhaps the simplest and most readable solution:
Zimg = np.zeros((Xu.size, Yu.size), np.uint8)
for i in range(X.size):
Zimg[np.where(Xu==X[i]), np.where(Yu==Y[i])] = Z[i]
2) A list comprehension and numpy.sort() funtion
This solution - which is a bit more involved than the previous one - relies on Numpy's structured arrays:
data_type = [('x', np.float), ('y', np.float), ('z', np.uint8)]
XYZ = [(X[i], Y[i], Z[i]) for i in range(len(X))]
table = np.array(XYZ, dtype=data_type)
Zimg = np.sort(table, order=['y', 'x'])['z'].reshape(Xu.size, Yu.size)
3) Vectorization
Using lexsort is an elegant and efficient way of performing the required task:
Zimg = Z[np.lexsort((Y, X))].reshape(Xu.size, Yu.size)
4) Pure Python, not using NumPy
You may want to check out this link for a pure Python solution without any third party dependencies.
To end up, you have different options to save Zimg as an image:
from PIL import Image
Image.fromarray(Zimg).save('z-pil.png')
import matplotlib.pyplot as plt
plt.imsave('z-matplotlib.png', Zimg)
import cv2
cv2.imwrite('z-cv2.png', Zimg)
import scipy.misc
scipy.misc.imsave('z-scipy.png', Zimg)

You said you needed no interpolation is needed since every grid point is covered. So I assume the points are equally spaced.
If your table is already sorted primary by increasing x and secondary by y you can simply take the Z array directly and save it using PIL:
import numpy as np
# Find out what shape your final array has (if you already know just hardcode these)
x_values = np.unique(X).size
y_values = np.unique(Y).size
img = np.reshape(Z, (x_values, y_values))
# Maybe you need to cast the dtype to fulfill png restrictions
#img = img.astype(np.uint) # alter it if needed
# Print image
from PIL import Image
Image.fromarray(img).save('filename.png')
If your input isn't sorted (it looks like it is but who knows) you have to sort it before you start. Depending on your input this can be easy or really hard.

np.ufunc.at is a good tool to manage duplicates in vectorized way.
Suppose these data :
In [3]: X,Y,Z=rand(3,10).round(1)
(array([ 0.4, 0.2, 0.1, 0.8, 0.4, 0.1, 0.5, 0.2, 0.6, 0.2]),
array([ 0.5, 0.3, 0.5, 0.9, 0.9, 0.5, 0.3, 0.6, 0.4, 0.4]),
array([ 0.4, 0.6, 0.6, 0.4, 0.1, 0.1, 0.2, 0.6, 0.9, 0.8]))
First scale the image (scale=3 here) :
In [4]: indices=[ (3*c).astype(int) for c in (X,Y)]
[array([1, 0, 0, 2, 1, 0, 1, 0, 1, 0]), array([1, 0, 1, 2, 2, 1, 0, 1, 1, 1])]
Make a empty image : image=zeros((3,3)), according to indices bounds.
Then build. Here we keep the maximum.
In [5]: np.maximum.at(image,indices,Z) # in place
array([[ 0.6, 0.8, 0. ],
[ 0.2, 0.9, 0.1],
[ 0. , 0. , 0.4]])
Finally save in PNG : matplotlib.pyplot.imsave('img.png',image)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fit data with a lognormal function via Maximum Likelihood estimators - python

Related

Why can't RegularGridInterpolator not return several values (for a function that outputs in $R^d$)

How to find Optimum value from the plot values python?

Solution of an overdetermined nonlinear system of equations with boundary conditions in python

Least squares function and 4 parameter logistics function not working

Generate image data from three numpy arrays

Categories

Resources