I have a python dict
{'kValues': [2, 3, 4, 5, 6, 7, 8, 9, 10],
'WSS': [21455, 5432, 4897, 4675, 4257, 3954, 3852, 3756, 3487],
'SS': [0.75, 0.85, 0.7, 0.52, 0.33, 0.38, 0.42, 0.46, 0.47]}
When I plot kValues against WSS and SS, I get following line
The optimum value of 1st plot is at k value = 3 and on 2nd plot is at k value = 3
How to extract that k value from the dictionary without visualizing the plots
Criteria - First plot always have a elbow, elbow point to be extracted, second plot always have a raise followed by a dip, that raise value to be extracted
You can use the angle between 3 points p1,p2, and p3 which will be only helpful, and a point that is near 90 degrees for the elbow and 0 for deep.
I am sharing my code the normalization is the tricky.
import math
kValues= [1 , 2 , 3, 4, 5, 6, 7, 8, 9, 10]
WSS= [81000,21455, 5432, 4897, 4675, 4257, 3954, 3852, 3756, 3487]
SS=[0.75, 0.85, 0.7, 0.52, 0.33, 0.38, 0.42, 0.46, 0.47]
#get all the angles between k values as x and WSS as y
angles=[]
#each WSS slab values
normalize_wss=2000
for i in range(1,len(WSS)-1,1):
p1=(kValues[i-1]*normalize_wss,WSS[i-1])
p2=(kValues[i]*normalize_wss,WSS[i])
p3=(kValues[i+1]*normalize_wss,WSS[i+1])
#find angle between 3 points p1,p2,p3
angle1=math.degrees(math.atan2(p3[1]-p2[1],p3[0]-p2[0]))
angle2=math.degrees(math.atan2(p1[1]-p2[1],p1[0]-p2[0]))
angles.append([angle1-angle2])
print(angles)
Found this wonderful python library for finding the optimum value
https://pypi.org/project/kneed/
from kneed import KneeLocator
kneedle = KneeLocator(kValues, WSS, S=1.0, curve="convex", direction="decreasing")
print(kneedle.knee) # 3
print(kneedle.elbow) # 3
curve and direction can be configured based on pattern
You can use the derivative to find the peak in the SS:
import numpy as np
k_ss = kValues[np.where(np.sign(np.diff(SS, append=SS[-1])) == -1)[0][0]]
diff calculates the difference between the elements (the derivative) and then we find the first place where the derivative changes sign (goes over the peak).
For WSS it's a bit more complicated because you have to define a threshold, that would define an elbow, you could take a few examples from your data and use that. Here is a code where the threshold is set to 1/10 of the max derivative:
d = np.diff(WSS, append=WSS[-1])
th = max(abs(d)) / 10
k_wss = kValues[np.where(abs(d) < th)[0][0]]
Other than that you can try to fit the data to an asymptotic curve and extract the constants
Related
Could someone help me in fitting the data collapse_fractions with a lognormal function, which has median and standard deviation derived via the maximum likelihood method?
I tried scipy.stats.lognormal.fit(data), but I did not obtain the data I retrieved with Excel. The excel file can be downloaded: https://stacks.stanford.edu/file/druid:sw589ts9300/p_collapse_from_msa.xlsx
Also, any reference is really welcomed.
import numpy as np
intensity_measure_vector = np.array([[0.2, 0.3, 0.4, 0.6, 0.7, 0.8, 0.9, 1]])
no_analyses = 40
no_collapses = np.array([[0, 0, 0, 4, 6, 13, 12, 16]])
collapse_fractions = np.array(no_collapses/no_analyses)
print(collapse_fractions)
# array([[0. , 0. , 0. , 0.1 , 0.15 , 0.325, 0.3 , 0.4 ]])
collapse_fractions.shape
# (1, 8)
import matplotlib.pyplot as plt
plt.scatter(intensity_measure_vector, collapse_fractions)
I couldn't figure out how to get the lognorm.fit to work. So I just implemented the functions from your excel-file and used scipy.optimize as the optimizer. The added benefit is, that it is easier to understand what is actually going on compared to lognorm.fit especially with the excel on the side.
Here is my implementation:
from functools import partial
import numpy as np
from scipy import optimize, stats
im = np.array([0.2, 0.3, 0.4, 0.6, 0.7, 0.8, 0.9, 1])
im_log = np.log(im)
number_of_analyses = np.array([40, 40, 40, 40, 40, 40, 40, 40])
number_of_collapses = np.array([0, 0, 0, 4, 6, 13, 12, 16])
FORMAT_STRING = "{:<20}{:<20}{:<20}"
print(FORMAT_STRING.format("sigma", "beta", "log_likelihood_sum"))
def neg_log_likelihood_sum(params, im_l, no_a, no_c):
sigma, beta = params
theoretical_fragility_function = stats.norm(np.log(sigma), beta).cdf(im_l)
likelihood = stats.binom.pmf(no_c, no_a, theoretical_fragility_function)
log_likelihood = np.log(likelihood)
log_likelihood_sum = np.sum(log_likelihood)
print(FORMAT_STRING.format(sigma, beta, log_likelihood_sum))
return -log_likelihood_sum
neg_log_likelihood_sum_partial = partial(neg_log_likelihood_sum, im_l=im_log, no_a=number_of_analyses, no_c=number_of_collapses)
res = optimize.minimize(neg_log_likelihood_sum_partial, (1, 1), method="Nelder-Mead")
print(res)
And the final result is:
final_simplex: (array([[1.07613697, 0.42927824],
[1.07621925, 0.42935678],
[1.07622438, 0.42924577]]), array([10.7977048 , 10.79770573, 10.79770723]))
fun: 10.797704803509903
message: 'Optimization terminated successfully.'
nfev: 68
nit: 36
status: 0
success: True
x: array([1.07613697, 0.42927824])
The interesting part for you is on line one, the same final result as in the excel-calculation (sigma=1.07613697 and beta=0.42927824).
If you have any questions about what I did here, don't hesitate to ask as you said you are new to python. A few things in advance:
I did minimize the negative likelihood-sum as there is no maximizer in scipy.
partial from functools returns a function that has the specified arguments already defined (in this case im_l, no_a and no_c as they don't change) the partial function can then be called with just the missing argument.
The neg_log_likelihood_sum-function is basically whats happening in the excel-file, so it should be easy to understand when viewing it side-by-side.
scipy.optimize.minimize minimizes the function given as the first argument by changing the parameters (start-value as second argument). The method is chosen because it gave good results, I didn't dive deep into the abyss of different optimization-methods, gradients etc. So it may well be, that there is a better setup, but this one works fine and seems faster than the optimization with lognorm.fit.
The plot like in the excel-file is done like this with the results res from the optimization:
import matplotlib.pyplot as plt
x = np.linspace(0, 2.5, 100)
y = stats.norm(np.log(res["x"][0]), res["x"][1]).cdf(np.log(x))
plt.plot(x, y)
plt.scatter(im, number_of_collapses/number_of_analyses)
plt.show()
Good morning, everyone. I have a set of values.
Arr = np.array([0.11, 0.14, 0.22, 0.26, 0.31, 0.36, 0.44, 0.69, 0.70, 0.70, 0.70, 0.75, 0.98, 1.40])
I have constructed the CDF function in this way:
def ecdf(a):
x, counts = np.unique(a, return_counts=True)
cusum = np.cumsum(counts)
return x, cusum / cusum[-1]
def plot_ecdf(a):
x, y = ecdf(a)
x = np.insert(x, 0, x[0])
y = np.insert(y, 0, 0.)
plt.plot(x, y, drawstyle='steps-post')
plt.grid(True)
ecdf_ = ecdf(Arr)
plot_ecdf(ecdf_)
Obtaining this figure:
Now I want to divide the space (y-axis) into 5 parts. To do this I am using the following function:
from scipy.stats.qmc import LatinHypercube
engine = LatinHypercube(d=1)
sample = engine.random(n=5) #Array of float64
For example, obtaining 5 values randomly generated:
0.0886183
0.450613
0.808077
0.753524
0.343108
At this point I would like to keep the corresponding values in the CDF as in the picture.
I also observed that in this way the constructed CDF has a discrete set of values. Which may not be optimal for my purpose.
I have the following data:
x = np.array([0, 0, 0, 0, 0, 0, 1, 3, 3, 5, 5, 5, 5, 7, 7, 14, 14, 15, 15, 15, 15, 25, 25, 25, 25, 25, 35, 35, 40, 40, 45, 45, 45, 45, 45, 45])
y = np.array([87.9, 91.3, 94.1, 173.9, 87.7, 117.8, 52.4, 46.5, 73.7, 63.3, 50.6, 56.8, 47.5, 30.3, 59.2, 38.7, 12.2, 25.7, 23.5, 37.3, 16.6, 25, 19.7, 27.2, 27.3, 11.1, 1.1, 0.1, 0.9, 0.1, 0.3, 0.5, 0.4, 1.2, 0.6, 1])
and I would like to perform weighted least square optimization for the following model (as I have different euqations for different data I cannot just simply use the log transformation to convert to linear regression):
# defining a model
def model(x, slope):
return 100 * np.exp(-slope * x)
# fit the parameters, weighting each data point by its inverse value: 1/y^K (where K = 1.2)
params, pcov = curve_fit(model, x, y,
sigma=1/(y**1.2), absolute_sigma=False)
But I have no idea how to get the 95% prediction intervals as in the figure below (i.e. 95% PI are wide at the beginning (from 41.5 to 158.6 at y = 0) and get narrower with time (e.g., from -5 to 18 at y=30):
Prediction intervals narrowing with time
I have tried by calculating standard errors, MSE and t-critical and using the relationship between condfidence intervals and prediction intervals but it probably doesn't work for weighted fit:
#find T critical value (two-tailed inverse of the Student's t-distribution)
t_crit = scipy.stats.t.ppf(q=1-.05/2,df=75)
SE_CI = np.sqrt(np.diag(pcov))
MSE = np.mean((y-model(x, *params))**2)
#for some modelled data
x_pred = np.arange(50)
y_pred = 100 * np.exp(-params[0] * x_pred)
y_upper_CI = y_pred+t_crit*SE_CI
y_lower_CI = y_pred-t_crit*SE_CI
y_upper_PI = y_pred + np.sqrt((SE_CI)**2+MSE)*t_crit
y_lower_PI = y_pred - np.sqrt((SE_CI)**2+MSE)*t_crit
I have also found out that I might try:
define G|x, which is the gradient of the parameters at a particular value of X and using all the best-fit values of the parameters. The result is a vector, with one element per parameter. For each parameter, it is defined as dY/dP, where Y is the Y value of the curve given the particular value of X and all the best-fit parameter values, and P is one of the parameters.)
Cov is the covariance matrix (inverted Hessian from last iteration). It is a square matrix with the number of rows and columns equal to the number of parameters.
Now compute c = G|x * Cov * G'|x. The result is a single number for any value of X.
The prediction bands extend a further distance above and below the curve, equal to:
sqrt(c+1)*sqrt(SS/DF)*CriticalT(Confidence%, DF)
But I do not know how to implement it in python (namely how to get G|x, how to compute: c = G|x * Cov * G'|x and from where to take sum of squares SS)...
Thank you in advance for your help!
Relatively new to python, mainly using it for plotting things. I am currently attempting to determine a best fit line using the 4 parameter logistic (4PL) equation and curve fit from scipy. There are one or two sites showing how 4PL works, but could not get them to work for my data. Example, but similar 4PL data below:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import scipy.optimize as optimization
xdata = [2.3, 2.3, 2, 2, 1.7, 1.7, 1, 1, 0.000001, 0.000001, -1, -1]
ydata = [0.32, 0.3, 0.55, 0.60, 0.88, 0.92, 1.27, 1.21, 1.15, 1.12, 1.1, 1.1]
def fourPL(x, A, B, C, D):
return ((A-D)/(1.0+((x/C)**(B))) + D)
guess = [0, -0.5, 0.5, 1]
params, params_covariance = optimization.curve_fit(fourPL, xdata, ydata,
guess)
params
Gives warning (also an exponent warning in test data, but not real):
OptimizeWarning: Covariance of the parameters could not be estimated
category=OptimizeWarning)
And the params returns my initial guess. I have tried various initial guesses.
The best fit line is drawn when plotting, but is not a curve and does not go below x = 0 (I cannot find a reason negatives would mess with the 4PL model).
4PL fit plotted
I'm not sure if I am doing something incorrect with the equation, or how the curve fit function works, or both. I have a similar issue using least squares instead of curve fit. I've tried a bunch of variations based off similar equations for fit etc. but have been stuck for awhile, any help in pointing me in the right direction would be much appreciated.
I'm surprised you did not get any warnings or did not share them with us. I can't analyze this task for you by scientific means, just some remarks about technical stuff:
Observation
When running your code, you should some warnings like:
RuntimeWarning: invalid value encountered in power
return ((A-D)/(1.0+((x/C)**(B))) + D)
Don't ignore this!
Debugging
Add some prints to your function fourPL, probably all the different components of your function and look what's happening.
Example:
def fourPL(x, A, B, C, D):
print('1: ', (A-D))
print('2: ', (x/C))
print('3: ', (1.0+((x/C)**(B))))
return ((A-D)/(1.0+((x/C)**(B))) + D)
...
params, params_covariance = optimization.curve_fit(fourPL, xdata, ydata, guess, maxfev=1)
# maxfev=1 -> let's just check 1 or few it's
Output:
1: -1.0
2: [ 4.60000000e+00 4.60000000e+00 4.00000000e+00 4.00000000e+00
3.40000000e+00 3.40000000e+00 2.00000000e+00 2.00000000e+00
2.00000000e-06 2.00000000e-06 -2.00000000e+00 -2.00000000e+00]
RuntimeWarning: invalid value encountered in power
print('3: ', (1.0+((x/C)**(B))))
3: [ 1.4662524 1.4662524 1.5 1.5 1.54232614
1.54232614 1.70710678 1.70710678 708.10678119 708.10678119
nan nan]
That's enough to stop. nans and infs are bad!
Theory
Now it's time for theory and i won't do that. But usually you now should think about the underlying theory and why these problems occur.
Is there something you missed in regards to the assumptions?
Repair (without checking theory)
Without checking out the theory and just looking over some example found within 30 secs: hmm are negative x-values a problem?
Let's shift x (by the minimum; hardcoded 1 here):
xdata = np.array([2.3, 2.3, 2, 2, 1.7, 1.7, 1, 1, 0.000001, 0.000001, -1, -1]) + 1
Complete code:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import scipy.optimize as optimization
xdata = np.array([2.3, 2.3, 2, 2, 1.7, 1.7, 1, 1, 0.000001, 0.000001, -1, -1]) + 1
ydata = np.array([0.32, 0.3, 0.55, 0.60, 0.88, 0.92, 1.27, 1.21, 1.15, 1.12, 1.1, 1.1])
def fourPL(x, A, B, C, D):
return ((A-D)/(1.0+((x/C)**(B))) + D)
guess = [0, -0.5, 0.5, 1]
params, params_covariance = optimization.curve_fit(fourPL, xdata, ydata, guess)#, maxfev=1)
x_min, x_max = np.amin(xdata), np.amax(xdata)
xs = np.linspace(x_min, x_max, 1000)
plt.scatter(xdata, ydata)
plt.plot(xs, fourPL(xs, *params))
plt.show()
Output:
RuntimeWarning: divide by zero encountered in power
return ((A-D)/(1.0+((x/C)**(B))) + D)
Looks good, but it's time for another theory session: what did our linear-shift do to our results? I'm ignoring this again.
So just one warning and a nice-looking output.
If you want to remove that last warning, add some small epsilon to not have 0's in xdata:
xdata = np.array([2.3, 2.3, 2, 2, 1.7, 1.7, 1, 1, 0.000001, 0.000001, -1, -1]) + 1 + 1e-10
which will achieve the same, without any warning.
I tried to code the formula in pattern recognition but I can not find proper function to do the work. The problem is that I have an binary adjacency matrix A (M*N) and want to assign value 1 or 0 to each cell. Every cell has fixed probability P to be 1 and zero otherwise. I search method about sampling in python and it seems that the most methods only support sample several elements in list without considering probability. I really need help about this and any idea is appreciated.
you could use
A = (P > numpy.random.rand(4, 5)).astype(int)
Where P is your matrix of probabilities.
To make sure the probabilities are right you can test it using
P = numpy.ones((4, 5)) * 0.2
S = numpy.zeros((4, 5))
for i in range(100000):
S += (P > numpy.random.rand(4, 5)).astype(int)
print S # each element should be approximately 20000
print S.mean() # the average should be approximately 20000, too
Let's say you have your matrix of probabilities of adjacency as follows :
# Create your matrix
matrix = np.random.randint(0, 10, (3, 3))/10.
# Returns :
array([[ 0. , 0.4, 0.2],
[ 0.9, 0.7, 0.4],
[ 0.1, 0. , 0.5]])
# Now you can use np.where
threshold = 0.5
np.where(matrix<threshold, 0, 1) # you can set your threshold as you like.
# Here set to 0.5
# Returns :
array([[0, 0, 0],
[1, 1, 0],
[0, 0, 1]])