Lognorm distribution fitting

Lognorm distribution fitting - python

I am trying to do a lognorm distribution fit but the resulting paramter seem a bit odd. Could you please show me my mistake or explain to me if I am misinterpreting the parameters.
import numpy as np
import scipy.stats as st
data = np.array([1050000, 1100000, 1230000, 1300000, 1450000, 1459785, 1654000, 1888000])
s, loc, scale = st.lognorm.fit(data)
#calculating the mean
lognorm_mean = st.lognorm.mean(s = s, loc = loc, scale = scale)
The resulting mean is: 945853602904015.8.
But this doesn't make any sense.
The mean should be:
data_ln = np.log(data)
ln_mean = np.mean(data_ln)
ln_std = np.std(data_ln)
mean = np.exp(ln_mean + np.power(ln_std, 2)/2)
Here the resulting mean is 1391226.31. This should be correct.
Can you please help me with this topic?
Best regards
Norbi

I think you can tune the parameters of the minimizer to get an acceptable result:
import numpy as np
import scipy.stats as st
from scipy.optimize import minimize
data = np.array([1050000, 1100000, 1230000, 1300000,
1450000, 1459785, 1654000, 1888000])
def opti_wrap(fun, x0, args, disp=0, **kwargs):
return minimize(fun, x0, args=args, method='SLSQP',
tol=1e-12, options={'maxiter': 1000}).x
s, loc, scale = st.lognorm.fit(data, optimizer=opti_wrap)
lognorm_mean = st.lognorm.mean(s=s, loc=loc, scale=scale)
print(lognorm_mean) # should give 1392684.4350
The reason you are seeing a strange result is due to the default minimizer failing to converge on the maximum likelihood result. This could be due to a mis-behaving cost function with so few data points (you are trying to fit 3 params but only have 8 data points...). Note: I'm using scipy version 1.1.0.

Related

Python's Lmfit package not converging to a meaningful result

I'm running the code below:
import numpy as np
from lmfit import Model
def exp_model(x, ampl1=1.0, tau1=0.1):
exponential = ampl1*np.exp(-x/tau1)
return exponential
x = np.array([2.496,2.528,2.56,2.592,2.624])
y = np.array([8774.52,8361.68,7923.42,7502.43,7144.11])
dec_model = Model(exp_model, nan_policy='propagate')
results = dec_model.fit(y, x=x, ampl1=y[0])
results.plot()
The result I get is
which means that the fit is just failing for some reason. I can't figure out why. It had worked for similar data before. Any help would be greatly appreciated.

It wasn't converging because the initial value for the tau1 parameter was too far away from the real value. The code below works well.
import numpy as np
from lmfit import Model
def exp_model(x, ampl1=1.0, tau1=1.0): # The initial value of tau1 was changed from 0.1 to 1.0
exponential = ampl1*np.exp(-x/tau1)
return exponential
x = np.array([2.496,2.528,2.56,2.592,2.624])
y = np.array([8774.52,8361.68,7923.42,7502.43,7144.11])
dec_model = Model(exp_model, nan_policy='propagate')
results = dec_model.fit(y, x=x, ampl1=y[0])
results.plot()

How do you create a logit-normal distribution in Python?

Following this post, I tried to create a logit-normal distribution by creating the LogitNormal class:
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import logit
from scipy.stats import norm, rv_continuous
class LogitNormal(rv_continuous):
def _pdf(self, x, **kwargs):
return norm.pdf(logit(x), **kwargs)/(x*(1-x))
class OtherLogitNormal:
def pdf(self, x, **kwargs):
return norm.pdf(logit(x), **kwargs)/(x*(1-x))
fig, ax = plt.subplots()
values = np.linspace(10e-10, 1-10e-10, 1000)
sigma, mu = 1.78, 0
ax.plot(
values, LogitNormal().pdf(values, loc=mu, scale=sigma), label='subclassed'
)
ax.plot(
values, OtherLogitNormal().pdf(values, loc=mu, scale=sigma),
label='not subclassed'
)
ax.legend()
fig.show()
However, the LogitNormal class does not produce the desired results. When I don't subclass rv_continuous it works. Why is that? I need the subclassing to work because I also need the other methods that come with it like rvs.
Btw, the only reason I am creating my own logit-normal distribution in Python is because the only implementations of that distribution that I could find were from the PyMC3 package and from the TensorFlow package, both of which are pretty heavy / overkill if you only need them for that one function. I already tried PyMC3, but apparently it doesn't do well with scipy I think, it always crashed for me. But that's a whole different story.

Forewords
I came across this problem this week and the only relevant issue I have found about it is this post. I have almost same requirement as the OP:
Having a random variable for Logit Normal distribution.
But I also need:
To be able to perform statistical test as well;
While being compliant with the scipy random variable interface.
As #Jacques Gaudin pointed out the interface for rv_continous (see distribution architecture for details) does not ensure follow up for loc and scale parameters when inheriting from this class. And this is somehow misleading and unfortunate.
Implementing the __init__ method of course allow to create the missing binding but the trade off is: it breaks the pattern scipy is currently using to implement random variables (see an example of implementation for lognormal).
So, I took time to dig into the scipy code and I have created a MCVE for this distribution. Although it is not totally complete (it mainly misses moments overrides) it fits the bill for both OP and my purposes while having satisfying accuracy and performance.
MCVE
An interface compliant implementation of this random variable could be:
class logitnorm_gen(stats.rv_continuous):
def _argcheck(self, m, s):
return (s > 0.) & (m > -np.inf)
def _pdf(self, x, m, s):
return stats.norm(loc=m, scale=s).pdf(special.logit(x))/(x*(1-x))
def _cdf(self, x, m, s):
return stats.norm(loc=m, scale=s).cdf(special.logit(x))
def _rvs(self, m, s, size=None, random_state=None):
return special.expit(m + s*random_state.standard_normal(size))
def fit(self, data, **kwargs):
return stats.norm.fit(special.logit(data), **kwargs)
logitnorm = logitnorm_gen(a=0.0, b=1.0, name="logitnorm")
This implementation unlock most of the scipy random variables potential.
N = 1000
law = logitnorm(0.24, 1.31) # Defining a RV
sample = law.rvs(size=N) # Sampling from RV
params = logitnorm.fit(sample) # Infer parameters w/ MLE
check = stats.kstest(sample, law.cdf) # Hypothesis testing
bins = np.arange(0.0, 1.1, 0.1) # Bin boundaries
expected = np.diff(law.cdf(bins)) # Expected bin counts
As it relies on scipy normal distribution we may assume underlying functions have the same accuracy and performance than normal random variable object. But it might indeed be subject to float arithmetic inaccuracy especially when dealing with highly skewed distributions at the support boundary.
Tests
To check out how it performs we draw some distribution of interest and check them.
Let's create some fixtures:
def generate_fixtures(
locs=[-2.0, -1.0, 0.0, 0.5, 1.0, 2.0],
scales=[0.32, 0.56, 1.00, 1.78, 3.16],
sizes=[100, 1000, 10000],
seeds=[789, 123456, 999999]
):
for (loc, scale, size, seed) in itertools.product(locs, scales, sizes, seeds):
yield {"parameters": {"loc": loc, "scale": scale}, "size": size, "random_state": seed}
And perform checks on related distributions and samples:
eps = 1e-8
x = np.linspace(0. + eps, 1. - eps, 10000)
for fixture in generate_fixtures():
# Reference:
parameters = fixture.pop("parameters")
normal = stats.norm(**parameters)
sample = special.expit(normal.rvs(**fixture))
# Logit Normal Law:
law = logitnorm(m=parameters["loc"], s=parameters["scale"])
check = law.rvs(**fixture)
# Fit:
p = logitnorm.fit(sample)
trial = logitnorm(*p)
resample = trial.rvs(**fixture)
# Hypothetis Tests:
ks = stats.kstest(check, trial.cdf)
bins = np.histogram(resample)[1]
obs = np.diff(trial.cdf(bins))*fixture["size"]
ref = np.diff(law.cdf(bins))*fixture["size"]
chi2 = stats.chisquare(obs, ref, ddof=2)
Some adjustments with n=1000, seed=789 (this sample is quite normal) are shown below:

If you look at the source code of the pdf method, you will notice that _pdf is called without the scale and loc keyword arguments.
if np.any(cond):
goodargs = argsreduce(cond, *((x,)+args+(scale,)))
scale, goodargs = goodargs[-1], goodargs[:-1]
place(output, cond, self._pdf(*goodargs) / scale)
It results that the kwargs in your overriding _pdf method is always an empty dictionary.
If you look a bit closer at the code, you will also notice that the scaling and location are handled by pdf as opposed to _pdf.
In your case, the _pdf method calls norm.pdf so the loc and scale parameters must somehow be available in LogitNormal._pdf.
You could for example pass scale and loc when creating an instance of LogitNormal and store the values as class attributes:
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import logit
from scipy.stats import norm, rv_continuous
class LogitNormal(rv_continuous):
def __init__(self, scale=1, loc=0):
super().__init__(self)
self.scale = scale
self.loc = loc
def _pdf(self, x):
return norm.pdf(logit(x), loc=self.loc, scale=self.scale)/(x*(1-x))
fig, ax = plt.subplots()
values = np.linspace(10e-10, 1-10e-10, 1000)
sigma, mu = 1.78, 0
ax.plot(
values, LogitNormal(scale=sigma, loc=mu).pdf(values), label='subclassed'
)
ax.legend()
fig.show()

Are these functions equivalent?

I am building a neural network that makes use of T-distribution noise. I am using functions defined in the numpy library np.random.standard_t and the one defined in tensorflow tf.distributions.StudentT. The link to the documentation of the first function is here and that to the second function is here. I am using the said functions like below:
a = np.random.standard_t(df=3, size=10000) # numpy's function
t_dist = tf.distributions.StudentT(df=3.0, loc=0.0, scale=1.0)
sess = tf.Session()
b = sess.run(t_dist.sample(10000))
In the documentation provided for the Tensorflow implementation, there's a parameter called scale whose description reads
The scaling factor(s) for the distribution(s). Note that scale is not technically the standard deviation of this distribution but has semantics more similar to standard deviation than variance.
I have set scale to be 1.0 but I have no way of knowing for sure if these refer to the same distribution.
Can someone help me verify this? Thanks

I would say they are, as their sampling is defined in almost the exact same way in both cases. This is how the sampling of tf.distributions.StudentT is defined:
def _sample_n(self, n, seed=None):
# The sampling method comes from the fact that if:
# X ~ Normal(0, 1)
# Z ~ Chi2(df)
# Y = X / sqrt(Z / df)
# then:
# Y ~ StudentT(df).
seed = seed_stream.SeedStream(seed, "student_t")
shape = tf.concat([[n], self.batch_shape_tensor()], 0)
normal_sample = tf.random.normal(shape, dtype=self.dtype, seed=seed())
df = self.df * tf.ones(self.batch_shape_tensor(), dtype=self.dtype)
gamma_sample = tf.random.gamma([n],
0.5 * df,
beta=0.5,
dtype=self.dtype,
seed=seed())
samples = normal_sample * tf.math.rsqrt(gamma_sample / df)
return samples * self.scale + self.loc # Abs(scale) not wanted.
So it is a standard normal sample divided by the square root of a chi-square sample with parameter df divided by df. The chi-square sample is taken as a gamma sample with parameter 0.5 * df and rate 0.5, which is equivalent (chi-square is a special case of gamma). The scale value, like the loc, only comes into play in the last line, as a way to "relocate" the distribution sample at some point and scale. When scale is one and loc is zero, they do nothing.
Here is the implementation for np.random.standard_t:
double legacy_standard_t(aug_bitgen_t *aug_state, double df) {
double num, denom;
num = legacy_gauss(aug_state);
denom = legacy_standard_gamma(aug_state, df / 2);
return sqrt(df / 2) * num / sqrt(denom);
})
So essentially the same thing, slightly rephrased. Here we have also have a gamma with shape df / 2 but it is standard (rate one). However, the missing 0.5 is now by the numerator as / 2 within the sqrt. So it's just moving the numbers around. Here there is no scale or loc, though.
In truth, the difference is that in the case of TensorFlow the distribution really is a noncentral t-distribution. A simple empirical proof that they are the same for loc=0.0 and scale=1.0 is to plot histograms for both distributions and see how close they look.
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
np.random.seed(0)
t_np = np.random.standard_t(df=3, size=10000)
with tf.Graph().as_default(), tf.Session() as sess:
tf.random.set_random_seed(0)
t_dist = tf.distributions.StudentT(df=3.0, loc=0.0, scale=1.0)
t_tf = sess.run(t_dist.sample(10000))
plt.hist((t_np, t_tf), np.linspace(-10, 10, 20), label=['NumPy', 'TensorFlow'])
plt.legend()
plt.tight_layout()
plt.show()
Output:
That looks pretty close. Obviously, from the point of view of statistical samples, this is not any kind of proof. If you were not still convinced, there are some statistical tools for testing whether a sample comes from a certain distribution or two samples come from the same distribution.

How do you get scipy curve_fit to find a reasonable result if you don't have a good initial parameter guess?

I am trying to get a simple fit to my data of an exponential decay of the form a*(x-x0)**b, where I know a and b must be negative. So if you plot it on a log-log plot, I should see a linear trend for the obtained data.
As such, I'm giving scipy.optimize initial guesses where a and b are negative, but it keeps ignoring them and giving me the error,
OptimizeWarning: Covariance of the parameters could not be estimated
.. and giving me values for a and b that are positive. It then also does not give me an exponential decay, but a parabola that bottoms out and begins to increase.
I have tried many different guesses as to the initial parameters over a large range of values (one such is in the code below), but none worked without giving me the nonsensical return and the error. This has made me start to wonder if my code is wrong, or if there's just some obvious way to get good initial guesses into the code that won't be rejected.
import math
import numpy as np
import sys
import matplotlib.pyplot as plt
import scipy as sp
import scipy.optimize
from scipy.optimize import curve_fit
import numpy.polynomial.polynomial as poly
x= [1987, 1993.85, 2003, 2010.45, 2009.3, 2019.4]
t= [31, 8.6, 4.84, 1.96, 3.9, 1.875]
def model_func(x, a, b, x0):
return (a*(x-x0)**b)
# curve fit
p0 = (-.0005,-.0005,100)
opt, pcov = curve_fit(model_func, x, t,p0)
a, b, x0 = opt
# test result
x2 = np.linspace(1980, 2020, 100)
y2 = model_func(x2, a, b,x0)
coefs, cov = poly.polyfit(x, t, 2,full=True)
ffit = poly.polyval(x2, coefs)
plt.loglog(x,t,'.')
plt.loglog(x2, ffit,'--', color="#1f77b4")
print('S = (',coefs[0],'*(t-',coefs[2],')^',coefs[1])

Numdifftools to calculate lmfit fit uncertainties

I'm using lmfit to estimate the parameters of a coupled ODE system, based on thus example: https://people.duke.edu/~ccc14/sta-663/CalibratingODEs.html.
In order to obtain a global minimal of the residual I switched to use either the "basinhopping" or "ampgo" methods, but I get such warning when displaying the results:
Warning: uncertainties could not be estimated:
this fitting method does not natively calculate uncertainties
and numdifftools is not installed for lmfit to do this. Use
`pip install numdifftools` for lmfit to estimate uncertainties
with this fitting method.
I have installed "numdifftools" via conda, but the warning (and the lack of uncertainties) persists.
How can I solve this?
Here is the code with minimal data:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from lmfit import minimize, Parameters, Parameter, report_fit
from scipy.integrate import odeint
def f(xs, t, ps):
Ksor = ps['Ksor'].value
Kdes = ps['Kdes'].value
Cw, Cs, SS = xs
return [-Ksor*SS*Cw+Kdes*Cs, Ksor*SS*Cw-Kdes*Cs,0]
def g(t, x0, ps):
"""
Solution to the ODE x'(t) = f(t,x,k) with initial condition x(0) = x0
"""
x = odeint(f, x0, t, args=(ps,))
return x
def residual(ps, ts, data):
x0 = ps['Cw0'].value, ps['Cs0'].value, ps['SS0'].value
model = g(ts, x0, ps)
return (model - data).ravel()
data1=np.array([[100. , 0. , 1. ],
[ 66.5507, 33.4493, 1. ],
[ 44.4018, 55.5982, 1. ],
[ 29.7357, 70.2643, 1. ]])
t = pd.Series([0.408,0.816,1.224,1.632])
x0 = np.array([100,0,1])
# set parameters incluing bounds
params = Parameters()
params.add('Cw0', value=100, vary=False)
params.add('Cs0', value=0, vary=False)
params.add('SS0', value=1, vary=False)
params.add('Ksor', value=2.0, min=0, max=100)
params.add('Kdes', value=1.0, min=0, max=100)
# fit model and find predicted values
result = minimize(residual, params, args=(t, data1), method='basinhopping')
final = data1 + result.residual.reshape(data1.shape)
# plot data and fitted curves
plt.plot(t, data1, 'o')
plt.plot(t, final, '-', linewidth=2);
# display fitted statistics
report_fit(result)
EDIT: the code works. I think that the installation of numdifftools wasn't detected and a restart of the PC solved the issue.

It is always helpful to provide a minimal but complete example that shows the problem and the result (including fit report and/or any exceptions) that you get. Please modify the question to include these and give the version of lmfit and numdifftools you are using.
Also: The use of numdifftools to calculate the uncertainties requires the length of the residual array to be larger than the number of variables.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Lognorm distribution fitting - python

Related

Python's Lmfit package not converging to a meaningful result

How do you create a logit-normal distribution in Python?

Are these functions equivalent?

How do you get scipy curve_fit to find a reasonable result if you don't have a good initial parameter guess?

Numdifftools to calculate lmfit fit uncertainties

Categories

Resources