Related
I am trying to fit Resident Time Distribution (RTD) Data. RTD is typically skewed distribution. I have built a simple code that takes this non equally space-time data set from the RTD.
Data Sett
timeArray = [0.0, 0.5, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 12.0, 14.0]
concArray = [0.0, 0.6, 1.4, 5.0, 8.0, 10.0, 8.0, 6.0, 4.0, 3.0, 2.2, 1.5, 0.6, 0.0]
To fit the data I have been using python curve_fit function
parameters, covariance = curve_fit(nCSTR, time, conc, p0=guess)
and different sets of models (ex. CSTR, Sine, Gauss) to fit the data. However, no success so far.
The RTD data that I have correspond to a CSTR and there is an equation that model very accurate this type of behavior.
#Generalize nCSTR model
y = (( (np.power(x/tau,n-1)) * np.power(n,n) ) / (tau * math.gamma(n)) ) * np.exp(-n*x/tau)
As a separate note: from the Generalized nCSTR model I am using gamma instead of (n-1)! factorial terms because of the complexities of the code trying to deal with decimal values in factorials terms.
This CSTR model should be the one fitting the data without problem but for some reason is not able to do so. The outcome after executing my code:
timeArray = [0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5, 7.0, 7.5, 8.0, 8.5, 9.0, 9.5, 10.0, 10.5, 11.0, 11.5, 12.0, 12.5, 13.0, 13.5, 14.0]
concArray = [0.0, 0.6, 1.4, 2.6, 5.0, 6.5, 8.0, 9.0, 10.0, 9.0, 8.0, 7.0, 6.0, 5.0, 4.0, 3.5, 3.0, 2.5, 2.2, 1.8, 1.5, 1.2, 1.0, 0.8, 0.6, 0.5, 0.3, 0.1, 0.0]
#Recast time and conc into numpy arrays
time = np.asarray(timeArray)
conc = np.asarray(concArray)
plt.plot(time, conc, 'o')
def nCSTR(x, tau, n):
y = (( (np.power(x/tau,n-1)) * np.power(n,n) ) / (tau * math.gamma(n)) ) * np.exp(-n*x/tau)
return y
guess = [1, 12]
parameters, covariance = curve_fit(nCSTR, time, conc, p0=guess)
tau = parameters[0]
n = parameters[1]
y = np.arange(0.0, len(time), 1.0)
for i in range(len(timeArray)):
y[i] = (( (np.power(time[i]/tau,n-1)) * np.power(n,n) ) / (tau * math.gamma(n)) ) * np.exp(-n*time[i]/tau)
plt.plot(time,y)
is this plot Fitting Output
I know I am missing something and any help will be well appreciated. The model has been well known for decades so it should not be related to the equation. I did some dummy data to confirm that the equation is written correctly and the output was the same type of profile that I am looking for. In that end, the equestion is fine.
import numpy as np
import math
t = np.arange(0.0, 10.5, 0.5)
tau = 2
n = 5
y = np.arange(0.0, len(t), 1.0)
for i in range(len(t)):
y[i] = (( (np.power(t[i]/tau,n-1)) * np.power(n,n) ) / (tau * math.gamma(n)) ) * np.exp(-n*t[i]/tau)
print(y)
plt.plot(t,y)
CSTR profile with Dummy Data (image)
If anyone is interested in the theory behind it I recommend any reading related to Tank In Series (specifically CSTR) Fogler has a great book about this topic.
I think that the main problem is that your model does not allow for an overall scale factor or that your data may not be normalized as you expect.
If you'll permit me to convert your curve-fitting program to use lmfit (I am a lead author), you might do:
import numpy as np
from scipy.special import gamma
import matplotlib.pyplot as plt
from lmfit import Model
timeArray = [0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5, 7.0, 7.5, 8.0, 8.5, 9.0, 9.5, 10.0, 10.5, 11.0, 11.5, 12.0, 12.5, 13.0, 13.5, 14.0]
concArray = [0.0, 0.6, 1.4, 2.6, 5.0, 6.5, 8.0, 9.0, 10.0, 9.0, 8.0, 7.0, 6.0, 5.0, 4.0, 3.5, 3.0, 2.5, 2.2, 1.8, 1.5, 1.2, 1.0, 0.8, 0.6, 0.5, 0.3, 0.1, 0.0]
#Recast time and conc into numpy arrays
time = np.asarray(timeArray)
conc = np.asarray(concArray)
plt.plot(time, conc, 'o', label='data')
def nCSTR(x, scale, tau, n):
"""scaled CSTR model"""
z = n*x/tau
return scale * np.exp(-z) * z**(n-1) * (n/(tau*gamma(n)))
# create a Model for your model function
cmodel = Model(nCSTR)
# now create a set of Parameters for your model (note that parameters
# are named using your function arguments), and give initial values
params = cmodel.make_params(tau=3, scale=10, n=10)
# since you have `xxx**(n-1)`, setting a lower bound of 1 on `n`
# is wise, otherwise you would have to handle complex values
params['n'].min = 1
# now fit the model to your `conc` data with those parameters
# (and also passing in independent variables using `x`: the argument
# name from the signature of the model function)
result = cmodel.fit(conc, params, x=time)
# print out a report of the results
print(result.fit_report())
# you do not need to construct the best fit yourself, it is in `result`:
plt.plot(time, result.best_fit, label='fit')
plt.legend()
plt.show()
This will print out a report that includes statistics and uncertainties:
[[Model]]
Model(nCSTR)
[[Fit Statistics]]
# fitting method = leastsq
# function evals = 29
# data points = 29
# variables = 3
chi-square = 2.84348862
reduced chi-square = 0.10936495
Akaike info crit = -61.3456602
Bayesian info crit = -57.2437727
R-squared = 0.98989860
[[Variables]]
scale: 49.7615649 +/- 0.81616118 (1.64%) (init = 10)
tau: 5.06327482 +/- 0.05267918 (1.04%) (init = 3)
n: 4.33771512 +/- 0.14012112 (3.23%) (init = 10)
[[Correlations]] (unreported correlations are < 0.100)
C(scale, n) = -0.521
C(scale, tau) = 0.477
C(tau, n) = -0.406
and generate a plot of
step = [0.1,0.2,0.3,0.4,0.5]
static = []
for x in step:
range = np.arrange(5,10 + x, x)
static.append(range)
# this return a list that looks something like this [[5.,5.1,5.2,...],[5.,5.2,5.4,...],[5.,5.3,5.6,...],...]
Im trying to create standard and dynamic stop/step ranges from 5.0-10. For the standard ranges I used a list with the steps and then looped it to get the different interval lists.
What I want now is to get varying step sizes within the 5.0-10.0 interval. So for example from 5.0-7.3, the step size is 0.2, from 7.3-8.3, the range is 0.5 and then from 8.3-10.0 the lets say the step is 0.8. What I don't understand how to do is to make the dynamic run through and get all the possible combinations.
Using a list of steps and a list of "milestones" that we are going to use to determine the start and end points of each np.arange, we can do this:
import numpy as np
def dynamic_range(milestones, steps) -> list:
start = milestones[0]
dynamic_range = []
for end, step in zip(milestones[1:], steps):
dynamic_range += np.arange(start, end, step).tolist()
start = end
return dynamic_range
print(dynamic_range(milestones=(5.0, 7.3, 8.3, 10.0), steps=(0.2, 0.5, 0.8)))
# [5.0, 5.2, 5.4, 5.6, 5.8, 6.0, 6.2, 6.4, 6.6, 6.8, 7.0,
# 7.2, 7.3, 7.8, 8.3, 8.3, 9.1, 9.9]
Note on performance: this answer assumes that you are going to use a few hundred points in your dynamic range. If you want millions of points, we should try another approach with pure numpy and no list concatenation.
if you want to be it within <5,10> interval then dont add x to 10:
import numpy as np
step = [0.1, 0.2, 0.3, 0.4, 0.5]
static = []
for x in step:
range = np.arange(5, 10, x)
static.append(range)
print(static)
Dinamic:
import numpy as np
step = [0.1, 0.2, 0.3, 0.4, 0.5]
breakingpoints=[6,7,8,9,10]
dinamic = []
i=0
startingPoint=5
for x in step:
#print(breakingpoints[i])
range = np.arange(startingPoint, breakingpoints[i], x)
dinamic.append(range)
i+=1
#print(range[-1])
startingPoint=range[-1]
print(dinamic)
I have created several circles with different origins using Python and I am trying to implement a function that will divide each circle into n number of equal parts along the circumference. I am trying to populate an array that contains the starting [x,y] coordinate for each part on the circumference.
My code is as follows:
def fnCalculateArcCoordinates(self,intButtonCount,radius,center):
lstButtonCoord = []
#for degrees in range(0,360,intAngle):
for arc in range(1,intButtonCount + 1):
degrees = arc * 360 / intButtonCount
xDegreesCoord = int(center[0] + radius * math.cos(math.radians(degrees)))
yDegreesCoord = int(center[1] + radius * math.sin(math.radians(degrees)))
lstButtonCoord.append([xDegreesCoord,yDegreesCoord])
return lstButtonCoord
When I run the code for 3 parts, an example of the set of coordinates that are returned are:
[[157, 214], [157, 85], [270, 149]]
This means the segments are of different sizes. Could someone please help me identify where my error is?
The exact results of such trigonometric calculations are rarely exact integers. By flooring them to int, you lose some precision, of course. The approximate (Pythagorean) distance checks suggest that your math is correct:
(270-157)**2 + (149-85)**2
# 16865
(270-157)**2 + (214-149)**2
# 16994
(157-157)**2 + (214-85)**2
# 16641
Furthermore, you can use the built-in complex number type and the cmath module. In particular cmath.rect converts polar coordinates (a radius and an angle) into rectangular coordinates:
import cmath
def calc(count, radius, center):
x, y = center
for i in range(count):
r = cmath.rect(radius, (2*cmath.pi)*(i/count))
yield [round(x+r.real, 2), round(y+r.imag, 2)]
list(calc(4, 2, [0, 0]))
# [[2.0, 0.0], [0.0, 2.0], [-2.0, 0.0], [-0.0, -2.0]]
list(calc(6, 1, [0, 0]))
# [[1.0, 0.0], [0.5, 0.87], [-0.5, 0.87], [-1.0, 0.0], [-0.5, -0.87], [0.5, -0.87]]
You want to change rounding as you see fit.
I am currently running tests between XGBoost/lightGBM for their ability to rank items. I am reproducing the benchmarks presented here: https://github.com/guolinke/boosting_tree_benchmarks.
I have been able to successfully reproduce the benchmarks mentioned in their work. I want to make sure that I am correctly implementing my own version of the ndcg metric and also understanding the ranking problem correctly.
My questions are:
When creating the validation for the test set using ndcg - there is a test.group file that says the first X rows are group 0, etc. To get the recommendations for the group, I get the predicted values and known relevance scores and sort that list by descending predicted values for each group?
In order to get the final ndcg scores from the lists created above - do I get the ndcg scores and take the mean over all the scores? Is this the same evaluation methodology that XGBoost/lightGBM in the evaluation phase?
Here is my methodology for evaluating the test set after the model has finished training.
For the final tree when I run lightGBM I obtain these values on the validation set:
[500] valid_0's ndcg#1: 0.513221 valid_0's ndcg#3: 0.499337 valid_0's ndcg#5: 0.505188 valid_0's ndcg#10: 0.523407
My final step is to take the predicted output for the test set and calculate the ndcg values for the predictions.
Here is my python code for calculating ndcg:
import numpy as np
def dcg_at_k(r, k):
r = np.asfarray(r)[:k]
if r.size:
return np.sum(np.subtract(np.power(2, r), 1) / np.log2(np.arange(2, r.size + 2)))
return 0.
def ndcg_at_k(r, k):
idcg = dcg_at_k(sorted(r, reverse=True), k)
if not idcg:
return 0.
return dcg_at_k(r, k) / idcg
After I get the predictions for the test set for a particular group (GROUP-0) I have these predictions:
query_id predict
0 0 (2.0, -0.221681199441)
1 0 (1.0, 0.109895548348)
2 0 (1.0, 0.0262799346312)
3 0 (0.0, -0.595343431322)
4 0 (0.0, -0.52689043426)
5 0 (0.0, -0.542221350664)
6 0 (1.0, -0.448015576024)
7 0 (1.0, -0.357090949646)
8 0 (0.0, -0.279677741045)
9 0 (0.0, 0.2182200869)
NOTE
Group-0 actually has about 112 rows.
I then sort the list of tuples in descending order which provides a list of relevance scores:
def get_recommendations(x):
sorted_list = sorted(list(x), key=lambda i: i[1], reverse=True)
return [k for k, _ in sorted_list]
relavance = evaluation.groupby('query_id').predict.apply(get_recommendations)
query_id
0 [4.0, 2.0, 2.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, ...
1 [4.0, 2.0, 2.0, 2.0, 1.0, 1.0, 3.0, 2.0, 1.0, ...
2 [2.0, 3.0, 2.0, 2.0, 1.0, 0.0, 2.0, 2.0, 1.0, ...
3 [2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, ...
4 [1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, ...
Finally, for each query id I calculated the ndcg scores on the relevance list and then take the mean of all the ndcg scores calculated for each query id:
relavance.apply(lambda x: ndcg_at_k(x, 10)).mean()
The value I obtain is ~0.497193.
Cross-posting my Cross Validated answer to this cross-posted question:
https://stats.stackexchange.com/questions/303385/how-does-xgboost-lightgbm-evaluate-ndcg-metric-for-ranking/487487#487487
I happened across this myself, and finally dug into the code to figure it out.
The difference is the handling of a missing IDCG. Your code returns 0, while LightGBM is treating that case as a 1.
The following code produced matching results for me:
import numpy as np
def dcg_at_k(r, k):
r = np.asfarray(r)[:k]
if r.size:
return np.sum(np.subtract(np.power(2, r), 1) / np.log2(np.arange(2, r.size + 2)))
return 0.
def ndcg_at_k(r, k):
idcg = dcg_at_k(sorted(r, reverse=True), k)
if not idcg:
return 1. # CHANGE THIS
return dcg_at_k(r, k) / idcg
I think the problem is caused by data in the same query that have same labels.
In that case, Both XGBoost and LightGBM will produce ndcg 1 for that query.
I have a hypothetical y function of x and trying to find/fit a lognormal distribution curve that would shape over the data best. I am using curve_fit function and was able to fit normal distribution, but the curve does not look optimized.
Below are the give y and x data points where y = f(x).
y_axis = [0.00032425299473065838, 0.00063714106162861229, 0.00027009331177605913, 0.00096672396877715144, 0.002388766809835889, 0.0042233337680543182, 0.0053072824980722137, 0.0061291327849408699, 0.0064555344006149871, 0.0065601228278316746, 0.0052574034010282218, 0.0057924488798939255, 0.0048154093097913355, 0.0048619350036057446, 0.0048154093097913355, 0.0045114840997070331, 0.0034906838696562147, 0.0040069911024866456, 0.0027766995669134334, 0.0016595801819374015, 0.0012182145074882836, 0.00098231827111984341, 0.00098231827111984363, 0.0012863691645616997, 0.0012395921040321833, 0.00093554121059032721, 0.0012629806342969417, 0.0010057068013846018, 0.0006081017868837127, 0.00032743942370661445, 4.6777060529516312e-05, 7.0165590794274467e-05, 7.0165590794274467e-05, 4.6777060529516745e-05]
y-axis are probabilities of an event occurring in x-axis time bins:
x_axis = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0]
I was able to get a better fit on my data using excel and lognormal approach. When I attempt to use lognormal in python, the fit does not work and I am doing something wrong.
Below is the code I have for fitting a normal distribution, which seems to be the only one that I can fit in python (hard to believe):
#fitting distributino on top of savitzky-golay
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import scipy
import scipy.stats
import numpy as np
from scipy.stats import gamma, lognorm, halflogistic, foldcauchy
from scipy.optimize import curve_fit
matplotlib.rcParams['figure.figsize'] = (16.0, 12.0)
matplotlib.style.use('ggplot')
# results from savgol
x_axis = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0]
y_axis = [0.00032425299473065838, 0.00063714106162861229, 0.00027009331177605913, 0.00096672396877715144, 0.002388766809835889, 0.0042233337680543182, 0.0053072824980722137, 0.0061291327849408699, 0.0064555344006149871, 0.0065601228278316746, 0.0052574034010282218, 0.0057924488798939255, 0.0048154093097913355, 0.0048619350036057446, 0.0048154093097913355, 0.0045114840997070331, 0.0034906838696562147, 0.0040069911024866456, 0.0027766995669134334, 0.0016595801819374015, 0.0012182145074882836, 0.00098231827111984341, 0.00098231827111984363, 0.0012863691645616997, 0.0012395921040321833, 0.00093554121059032721, 0.0012629806342969417, 0.0010057068013846018, 0.0006081017868837127, 0.00032743942370661445, 4.6777060529516312e-05, 7.0165590794274467e-05, 7.0165590794274467e-05, 4.6777060529516745e-05]
## y_axis values must be normalised
sum_ys = sum(y_axis)
# normalize to 1
y_axis = [_/sum_ys for _ in y_axis]
# def gamma_f(x, a, loc, scale):
# return gamma.pdf(x, a, loc, scale)
def norm_f(x, loc, scale):
# print 'loc: ', loc, 'scale: ', scale, "\n"
return norm.pdf(x, loc, scale)
fitting = norm_f
# param_bounds = ([-np.inf,0,-np.inf],[np.inf,2,np.inf])
result = curve_fit(fitting, x_axis, y_axis)
result_mod = result
# mod scale
# results_adj = [result_mod[0][0]*.75, result_mod[0][1]*.85]
plt.plot(x_axis, y_axis, 'ro')
plt.bar(x_axis, y_axis, 1, alpha=0.75)
plt.plot(x_axis, [fitting(_, *result[0]) for _ in x_axis], 'b-')
plt.axis([0,35,0,.1])
# convert back into probability
y_norm_fit = [fitting(_, *result[0]) for _ in x_axis]
y_fit = [_*sum_ys for _ in y_norm_fit]
print list(y_fit)
plt.show()
I am trying to get answers two questions:
Is this the best fit I will get from normal distribution curve? How can I imporve my the fit?
Normal distribution result:
How can I fit a lognormal distribution to this data or is there a better distribution that I can use?
I was playing around with lognormal distribution curve adjust mu and sigma, it looks like that there is possible a better fit. I don't understand what I am doing wrong to get similar results in python.
Actually, Gamma distribution might be good fit as #Glen_b proposed. I'm using second definition with \alpha and \beta.
NB: trick I use for a quick fit is to compute mean and variance and for typical two-parametric distribution it is enough to recover parameters and get quick idea if it is good fit or not.
Code
import math
from scipy.misc import comb
import matplotlib.pyplot as plt
y_axis = [0.00032425299473065838, 0.00063714106162861229, 0.00027009331177605913, 0.00096672396877715144, 0.002388766809835889, 0.0042233337680543182, 0.0053072824980722137, 0.0061291327849408699, 0.0064555344006149871, 0.0065601228278316746, 0.0052574034010282218, 0.0057924488798939255, 0.0048154093097913355, 0.0048619350036057446, 0.0048154093097913355, 0.0045114840997070331, 0.0034906838696562147, 0.0040069911024866456, 0.0027766995669134334, 0.0016595801819374015, 0.0012182145074882836, 0.00098231827111984341, 0.00098231827111984363, 0.0012863691645616997, 0.0012395921040321833, 0.00093554121059032721, 0.0012629806342969417, 0.0010057068013846018, 0.0006081017868837127, 0.00032743942370661445, 4.6777060529516312e-05, 7.0165590794274467e-05, 7.0165590794274467e-05, 4.6777060529516745e-05]
x_axis = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0]
## y_axis values must be normalised
sum_ys = sum(y_axis)
# normalize to 1
y_axis = [_/sum_ys for _ in y_axis]
m = 0.0
for k in range(0, len(x_axis)):
m += y_axis[k] * x_axis[k]
v = 0.0
for k in range(0, len(x_axis)):
t = (x_axis[k] - m)
v += y_axis[k] * t * t
print(m, v)
b = m/v
a = m * b
print(a, b)
z = []
for k in range(0, len(x_axis)):
q = b**a * x_axis[k]**(a-1.0) * math.exp( - b*x_axis[k] ) / math.gamma(a)
z.append(q)
plt.plot(x_axis, y_axis, 'ro')
plt.plot(x_axis, z, 'b*')
plt.axis([0, 35, 0, .1])
plt.show()
Discrete distribution might look better - your x are all integers after all. You have distribution with variance about 3 times higher than mean, asymmetric - so most likely something like Negative Binomial might work quite well. Here is quick fit
r is a bit above 6, so you might want to move to distribution with real r - Polya distribution.
Code
from scipy.misc import comb
import matplotlib.pyplot as plt
y_axis = [0.00032425299473065838, 0.00063714106162861229, 0.00027009331177605913, 0.00096672396877715144, 0.002388766809835889, 0.0042233337680543182, 0.0053072824980722137, 0.0061291327849408699, 0.0064555344006149871, 0.0065601228278316746, 0.0052574034010282218, 0.0057924488798939255, 0.0048154093097913355, 0.0048619350036057446, 0.0048154093097913355, 0.0045114840997070331, 0.0034906838696562147, 0.0040069911024866456, 0.0027766995669134334, 0.0016595801819374015, 0.0012182145074882836, 0.00098231827111984341, 0.00098231827111984363, 0.0012863691645616997, 0.0012395921040321833, 0.00093554121059032721, 0.0012629806342969417, 0.0010057068013846018, 0.0006081017868837127, 0.00032743942370661445, 4.6777060529516312e-05, 7.0165590794274467e-05, 7.0165590794274467e-05, 4.6777060529516745e-05]
x_axis = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0]
## y_axis values must be normalised
sum_ys = sum(y_axis)
# normalize to 1
y_axis = [_/sum_ys for _ in y_axis]
s = 1.0 # shift by 1 to have them all at 0
m = 0.0
for k in range(0, len(x_axis)):
m += y_axis[k] * (x_axis[k] - s)
v = 0.0
for k in range(0, len(x_axis)):
t = (x_axis[k] - s - m)
v += y_axis[k] * t * t
print(m, v)
p = 1.0 - m/v
r = int(m*(1.0 - p) / p)
print(p, r)
z = []
for k in range(0, len(x_axis)):
q = comb(k + r - 1, k) * (1.0 - p)**r * p**k
z.append(q)
plt.plot(x_axis, y_axis, 'ro')
plt.plot(x_axis, z, 'b*')
plt.axis([0, 35, 0, .1])
plt.show()
Note that if a lognormal curve is correct and you take logs of both variables, you should have a quadratic relationship; even if that's not a suitable scale for a final model (because of variance effects -- if your variance is near constant on the original scale it will overweight the small values) it should at least give a good starting point for a nonlinear fit.
Indeed aside from the first two points this looks fairly good:
-- a quadratic fit to the solid points would describe that data quite well and should give suitable starting values if you then want to do a nonlinear fit.
(If error in x is at all possible, the lack of fit at the lowest x may be as much issues with error in x as error in y)
Incidentally, that plot seems to hint that a gamma curve may fit a little better overall than a lognormal one (in particular if you don't want to reduce the impact of those first two points relative to points 4-6). A good initial fit for that can be had by regressing log(y) on x and log(x):
The scaled gamma density is g = c.x^(a-1) exp(-bx) ... taking logs, you get log(g) = log(c) + (a-1) log(x) - b x = b0 + b1 log(x) + b2 x ... so supplying log(x) and x to a linear regression routine will fit that. The same caveats about variance effects apply (so it might be best as a starting point for a nonlinear least squares fit if your relative error in y isn't nearly constant).
In Python, I explained a trick here of how to fit a LogNormal very simply using OpenTURNS library:
import openturns as ot
n_times = [int(y_axis[i] * N) for i in range(len(y_axis))]
S = np.repeat(x_axis, n_times)
sample = ot.Sample([[p] for p in S])
fitdist = ot.LogNormalFactory().buildAsLogNormal(sample)
That's it!
print(fitdist) will show you >>> LogNormal(muLog = 2.92142, sigmaLog = 0.305, gamma = -6.24996)
and the fitting seems good:
import matplotlib.pyplot as plt
plt.hist(S, density =True, color = 'grey', bins = 34, alpha = 0.5)
plt.scatter(x_axis, y_axis, color= 'red')
plt.plot(x_axis, fitdist.computePDF(ot.Sample([[p] for p in x_axis])), color = 'black')
plt.show()