Related
The issue
Tl;dr: I would like a function that randomly returns a float (or optionally an ndarray of floats) in an interval following a probability distribution that resembles the sum of a "Gaussian" and a uniform distributions.
The function (or class) - let's say custom_distr() - should have as inputs (with default values already given):
the lower and upper bounds of the interval: low=0.0, high=1.0
the mean and standard deviation parameters of the "Gaussian": loc=0.5, scale=0.02
the size of the output: size=None
size can be an integer or a tuple of integers. If so, then loc and scale can either both simultaneously be scalars, or ndarrays whose shape corresponds to size.
The output is a scalar or an ndarray, depending on size.
The output has to be scaled to certify that the cumulative distribution is equal to 1 (I'm uncertain how to do this).
Note that I'm following numpy.random.Generator's naming convention from uniform and normal distributions as reference, but the nomenclature and the utilized packages does not really matter to me.
What I've tried
Since I couldn't find a way to "add" numpy.random.Generator's uniform and Gaussian distributions directly, I've tried using scipy.stats.rv_continuous subclassing, but I'm stuck at how to define the _rvs method, or the _ppf method to make it fast.
From what I've understood of rv_continuous class definition in Github, _rvs uses numpy's random.RandomState (which is out of date in comparison to random.Generator) to make the distributions. This seems to defeat the purpose of using scipy.stats.rv_continuous subclassing.
Another option would be to define _ppf, the percent-point function of my custom distribution, since according to rv_generic class definition in Github, the default function _rvs uses _ppf. But I'm having trouble defining this function by hand.
Following, there is a MWE, tested using low=0.0, high=1.0, loc=0.3 and scale=0.02. The names are different than the "The issue" section, because terminologies of terms are different between numpy and scipy.
import numpy as np
from scipy.stats import rv_continuous
import scipy.special as sc
import matplotlib.pyplot as plt
import time
# The class definition
class custom_distr(rv_continuous):
def __init__(self, my_loc=0.5, my_scale=0.5, a=0.0, b=1.0, *args, **kwargs):
super(custom_distr, self).__init__(a, b, *args, **kwargs)
self.a = a
self.b = b
self.my_loc = my_loc
self.my_scale = my_scale
def _pdf(self, x):
# uniform distribution
aux = 1/(self.b-self.a)
# gaussian distribution
aux += 1/np.sqrt(2*np.pi*self.my_scale**2) * \
np.exp(-(x-self.my_loc)**2/2/self.my_scale**2)
return aux/2 # divide by 2?
def _cdf(self, x):
# uniform distribution
aux = (x-self.a)/(self.b-self.a)
# gaussian distribution
aux += 0.5*(1+sc.erf((x-self.my_loc)/(self.my_scale*np.sqrt(2))))
return aux/2 # divide by 2?
# Testing the class
if __name__ == "__main__":
my_cust_distr = custom_distr(name="my_dist", my_loc=0.3, my_scale=0.02)
x = np.linspace(0.0, 1.0, 10000)
start_t = time.time()
the_pdf = my_cust_distr.pdf(x)
print("PDF calc time: {:4.4f}".format(time.time()-start_t))
plt.plot(x, the_pdf, label='pdf')
start_t = time.time()
the_cdf = my_cust_distr.cdf(x)
print("CDF calc time: {:4.4f}".format(time.time()-start_t))
plt.plot(x, the_cdf, 'r', alpha=0.8, label='cdf')
# Get 10000 random values according to the custom distribution
start_t = time.time()
r = my_cust_distr.rvs(size=10000)
print("RVS calc time: {:4.4f}".format(time.time()-start_t))
plt.hist(r, density=True, histtype='stepfilled', alpha=0.3, bins=40)
plt.ylim([0.0, the_pdf.max()])
plt.grid(which='both')
plt.legend()
print("Maximum of CDF is: {:2.1f}".format(the_cdf[-1]))
plt.show()
The generated image is:
The output is:
PDF calc time: 0.0010
CDF calc time: 0.0010
RVS calc time: 11.1120
Maximum of CDF is: 1.0
The time computing the RVS method is too slow in my approach.
According to Wikipedia, the ppf, or percent-point function (also called the Quantile function), can be written as the inverse function of the cumulative distribution function (cdf), when the cdf increases monotonically.
From the figure shown in the question, the cdf of my custom distribution function does, indeed, increase monotonically - as is expected, since the cdf's of Gaussian and uniform distributions do so too.
The ppf of the general normal distribution can be found in this Wikipedia page under "Quartile function". And the ppf of a uniform function defined between a and b can be calculated simply as p*(b-a)+a, where p is the desired probability.
But the inverse function of the sum of two functions, cannot (in general) be trivially written as a function of the inverses! See this Mathematics Exchange post for more information.
Therefore, the partial "solution" I have found thus far is to save an array containing the cdf of my custom distribution when instantiating an object and then finding the ppf (i.e. the inverse function of the cdf) via 1D interpolation, which only works as long as the cdf is indeed a monotonically increasing function.
NOTE 1: I still haven't fixed the bound's check issue mentioned by Peter O.
NOTE 2: The proposed solution is unviable if an ndarray of loc's were given, because of the lack of a closed-form expression for the Quartile function. Therefore, the original question is still open.
The working code is now:
import numpy as np
from scipy.stats import rv_continuous
import scipy.special as sc
import matplotlib.pyplot as plt
import time
# The class definition
class custom_distr(rv_continuous):
def __init__(self, my_loc=0.5, my_scale=0.5, a=0.0, b=1.0,
init_ppf=1000, *args, **kwargs):
super(custom_distr, self).__init__(a, b, *args, **kwargs)
self.a = a
self.b = b
self.my_loc = my_loc
self.my_scale = my_scale
self.x = np.linspace(a, b, init_ppf)
self.cdf_arr = self._cdf(self.x)
def _pdf(self, x):
# uniform distribution
aux = 1/(self.b-self.a)
# gaussian distribution
aux += 1/np.sqrt(2*np.pi)/self.my_scale * \
np.exp(-0.5*((x-self.my_loc)/self.my_scale)**2)
return aux/2 # divide by 2?
def _cdf(self, x):
# uniform distribution
aux = (x-self.a)/(self.b-self.a)
# gaussian distribution
aux += 0.5*(1+sc.erf((x-self.my_loc)/(self.my_scale*np.sqrt(2))))
return aux/2 # divide by 2?
def _ppf(self, p):
if np.any((p<0.0) | (p>1.0)):
raise RuntimeError("Quantile function accepts only values between 0 and 1")
return np.interp(p, self.cdf_arr, self.x)
# Testing the class
if __name__ == "__main__":
a = 1.0
b = 3.0
my_loc = 1.5
my_scale = 0.02
my_cust_distr = custom_distr(name="my_dist", a=a, b=b,
my_loc=my_loc, my_scale=my_scale)
x = np.linspace(a, b, 10000)
start_t = time.time()
the_pdf = my_cust_distr.pdf(x)
print("PDF calc time: {:4.4f}".format(time.time()-start_t))
plt.plot(x, the_pdf, label='pdf')
start_t = time.time()
the_cdf = my_cust_distr.cdf(x)
print("CDF calc time: {:4.4f}".format(time.time()-start_t))
plt.plot(x, the_cdf, 'r', alpha=0.8, label='cdf')
start_t = time.time()
r = my_cust_distr.rvs(size=10000)
print("RVS calc time: {:4.4f}".format(time.time()-start_t))
plt.hist(r, density=True, histtype='stepfilled', alpha=0.3, bins=100)
plt.ylim([0.0, the_pdf.max()])
# plt.xlim([a, b])
plt.grid(which='both')
plt.legend()
print("Maximum of CDF is: {:2.1f}".format(the_cdf[-1]))
plt.show()
The resulting image is:
And the output is:
PDF calc time: 0.0010
CDF calc time: 0.0010
RVS calc time: 0.0010
Maximum of CDF is: 1.0
The code is faster than before, at the cost of using a bit more memory.
I am building a neural network that makes use of T-distribution noise. I am using functions defined in the numpy library np.random.standard_t and the one defined in tensorflow tf.distributions.StudentT. The link to the documentation of the first function is here and that to the second function is here. I am using the said functions like below:
a = np.random.standard_t(df=3, size=10000) # numpy's function
t_dist = tf.distributions.StudentT(df=3.0, loc=0.0, scale=1.0)
sess = tf.Session()
b = sess.run(t_dist.sample(10000))
In the documentation provided for the Tensorflow implementation, there's a parameter called scale whose description reads
The scaling factor(s) for the distribution(s). Note that scale is not technically the standard deviation of this distribution but has semantics more similar to standard deviation than variance.
I have set scale to be 1.0 but I have no way of knowing for sure if these refer to the same distribution.
Can someone help me verify this? Thanks
I would say they are, as their sampling is defined in almost the exact same way in both cases. This is how the sampling of tf.distributions.StudentT is defined:
def _sample_n(self, n, seed=None):
# The sampling method comes from the fact that if:
# X ~ Normal(0, 1)
# Z ~ Chi2(df)
# Y = X / sqrt(Z / df)
# then:
# Y ~ StudentT(df).
seed = seed_stream.SeedStream(seed, "student_t")
shape = tf.concat([[n], self.batch_shape_tensor()], 0)
normal_sample = tf.random.normal(shape, dtype=self.dtype, seed=seed())
df = self.df * tf.ones(self.batch_shape_tensor(), dtype=self.dtype)
gamma_sample = tf.random.gamma([n],
0.5 * df,
beta=0.5,
dtype=self.dtype,
seed=seed())
samples = normal_sample * tf.math.rsqrt(gamma_sample / df)
return samples * self.scale + self.loc # Abs(scale) not wanted.
So it is a standard normal sample divided by the square root of a chi-square sample with parameter df divided by df. The chi-square sample is taken as a gamma sample with parameter 0.5 * df and rate 0.5, which is equivalent (chi-square is a special case of gamma). The scale value, like the loc, only comes into play in the last line, as a way to "relocate" the distribution sample at some point and scale. When scale is one and loc is zero, they do nothing.
Here is the implementation for np.random.standard_t:
double legacy_standard_t(aug_bitgen_t *aug_state, double df) {
double num, denom;
num = legacy_gauss(aug_state);
denom = legacy_standard_gamma(aug_state, df / 2);
return sqrt(df / 2) * num / sqrt(denom);
})
So essentially the same thing, slightly rephrased. Here we have also have a gamma with shape df / 2 but it is standard (rate one). However, the missing 0.5 is now by the numerator as / 2 within the sqrt. So it's just moving the numbers around. Here there is no scale or loc, though.
In truth, the difference is that in the case of TensorFlow the distribution really is a noncentral t-distribution. A simple empirical proof that they are the same for loc=0.0 and scale=1.0 is to plot histograms for both distributions and see how close they look.
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
np.random.seed(0)
t_np = np.random.standard_t(df=3, size=10000)
with tf.Graph().as_default(), tf.Session() as sess:
tf.random.set_random_seed(0)
t_dist = tf.distributions.StudentT(df=3.0, loc=0.0, scale=1.0)
t_tf = sess.run(t_dist.sample(10000))
plt.hist((t_np, t_tf), np.linspace(-10, 10, 20), label=['NumPy', 'TensorFlow'])
plt.legend()
plt.tight_layout()
plt.show()
Output:
That looks pretty close. Obviously, from the point of view of statistical samples, this is not any kind of proof. If you were not still convinced, there are some statistical tools for testing whether a sample comes from a certain distribution or two samples come from the same distribution.
I am trying to fit some experimental data to a nonlinear function with one parameter that includes an arcus cosine function which therefore is limited in its area of definition from -1 to 1. I use scipy's curve_fit to find the parameter of the function, but it returns the following error:
RuntimeError: Optimal parameters not found: Number of calls to function has reached maxfev = 400.
The function I want to fit is this one:
def fitfunc(x, a):
y = np.rad2deg(np.arccos(x*np.cos(np.deg2rad(a))))
return y
For the fitting, I provid a numpy array for x and y respectively which contain values in degree (which is why the function contains conversion to and from radians).
param, param_cov = curve_fit(fitfunc, xs, ys)
When I use other fit functions like for example a polynomial, the curve_fit returns some values, the error mentioned above only occurs when I use this function which includes an arcus cosine.
I suspect that it cannot fit the data points because depending on the parameter of the arcus cosine function, some data points do not lie inside the area of definition of the arcus cosine. I have tried raising the number iterations (maxfev) but without success.
Sample data:
ys = np.array([113.46125, 129.4225, 140.88125, 145.80375, 145.4425,
146.97125, 97.8025, 112.91125, 114.4325, 119.16125,
130.13875, 134.63125, 129.4375, 141.99, 139.86,
138.77875, 137.91875, 140.71375])
xs = np.array([2.786427013, 3.325624466, 3.473013087, 3.598247534, 4.304280248,
4.958273121, 2.679526725, 2.409388637, 2.606306639, 3.661558062,
4.569923009, 4.836843789, 3.377013596, 3.664550526, 4.335401233,
3.064199519, 3.97155254, 4.100567011])
As HS-nebula mentioned in his comments, you need to define an initial value a0 of a as a start guess for the curve-fitting. Moreover, you need to be careful when choosing a0 as your np.arcos() is only defined in [-1,1] and choosing the wrong a0 results in error.
import numpy as np
from scipy.optimize import curve_fit
ys = np.array([113.46125, 129.4225, 140.88125, 145.80375, 145.4425, 146.97125,
97.8025, 112.91125, 114.4325, 119.16125, 130.13875, 134.63125,
129.4375, 141.99, 139.86, 138.77875, 137.91875, 140.71375])
xs = np.array([2.786427013, 3.325624466, 3.473013087, 3.598247534, 4.304280248, 4.958273121,
2.679526725, 2.409388637, 2.606306639, 3.661558062, 4.569923009, 4.836843789,
3.377013596, 3.664550526, 4.335401233, 3.064199519, 3.97155254, 4.100567011])
def fit_func(x, a):
a_in_rad = np.deg2rad(a)
cos_a_in_rad = np.cos(a_in_rad)
arcos_xa_product = np.arccos( x * cos_a_in_rad )
return np.rad2deg(arcos_xa_product)
a0 = 80
param, param_cov = curve_fit(fit_func, xs, ys, a0, bounds = (0, 360))
print('Using curve we retrieve a value of a = ', param[0])
Output:
Using curve we retrieve a value of a = 100.05275506147824
However if you choose a0=60, you get the following error:
ValueError: Residuals are not finite in the initial point.
To be able to use the data with all possible values of a, a normalization as HS-nebula suggested is good idea.
I have a function Imaginary which describes a physics process and I want to fit this to a dataset x_interpolate, y_interpolate. The function is a form of a Lorentzian peak function and I have some initial values that are user given, except for f_peak (the peak location) which I find using a peak finding algorithm. All of the fit parameters, except for the offset, are expected to be positive and thus I have set bounds_I accordingly.
def Imaginary(freq, alpha, res, Ms, off):
numerator = (2*alpha*freq*res**2)
denominator = (4*(alpha*res*freq)**2) + (res**2 - freq**2)**2
Im = Ms*(numerator/denominator) + off
return Im
pI = np.array([alpha_init, f_peak, Ms_init, 0])
bounds_I = ([0,0,0,0, -np.inf], [np.inf,np.inf,np.inf, np.inf])
poptI, pcovI = curve_fit(Imaginary, x_interpolate, y_interpolate, pI, bounds=bounds_I)
In some situations I want to keep the parameter f_peak fixed during the fitting process. I tried an easy solution by changing bounds_I to:
bounds_I = ([0,f_peak+0.001,0,0, -np.inf], [np.inf,f_peak-0.001,np.inf, np.inf])
This is for many reasons not an optimal way of doing this so I was wondering if there is a more Pythonic way of doing this? Thank you for your help
If a parameter is fixed, it is not really a parameter, so it should be removed from the list of parameters. Define a model that has that parameter replaced by a fixed value, and fit that. Example below, simplified for brevity and to be self-contained:
x = np.arange(10)
y = np.sqrt(x)
def parabola(x, a, b, c):
return a*x**2 + b*x + c
fit1 = curve_fit(parabola, x, y) # [-0.02989396, 0.56204598, 0.25337086]
b_fixed = 0.5
fit2 = curve_fit(lambda x, a, c: parabola(x, a, b_fixed, c), x, y)
The second call to fit returns [-0.02350478, 0.35048631], which are the optimal values of a and c. The value of b was fixed at 0.5.
Of course, the parameter should be removed from the initial vector pI and the bounds as well.
You might find lmfit (https://lmfit.github.io/lmfit-py/) helpful. This library adds a higher-level interface to the scipy optimization routines, aiming for a more Pythonic approach to optimization and curve fitting. For example, it uses Parameter objects to allow setting bounds and fixing parameters without having to modify the objective or model function. For curve-fitting, it defines high level Model functions that can be used.
For you example, you could use your Imaginary function as you've written it with
from lmfit import Model
lmodel = Model(Imaginary)
and then create Parameters (lmfit will name the Parameter objects according to your function signature), providing initial values:
params = lmodel.make_params(alpha=alpha_init, res=f_peak, Ms=Ms_init, off=0)
By default all Parameters are unbound and will vary in the fit, but you can modify these attributes (without rewriting the model function):
params['alpha'].min = 0
params['res'].min = 0
params['Ms'].min = 0
You can set one (or more) of the parameters to not vary in the fit as with:
params['res'].vary = False
To be clear: this does not require altering the model function, making it much easier to change with is fixed, what bounds might be imposed, and so forth.
You would then perform the fit with the model and these parameters:
result = lmodel.fit(y_interpolate, params, freq=x_interpolate)
you can get a report of fit statistics, best-fit values and uncertainties for parameters with
print(result.fit_report())
The best fit Parameters will be held in result.params.
FWIW, lmfit also has builtin Models for many common forms, including Lorentzian and a Constant offset. So, you could construct this model as
from lmfit.models import LorentzianModel, ConstantModel
mymodel = LorentzianModel(prefix='l_') + ConstantModel()
params = mymodel.make_params()
which will have Parameters named l_amplitude, l_center, l_sigma, and c (where c is the constant) and the model will use the name x for the independent variable (your freq). This approach can become very convenient when you may want to change the functional form of the peaks or background, or when fitting multiple peaks to a spectrum.
I was able to solve this issue regarding arbitrary number of parameters and arbitrary positioning of the fixed parameters:
def d_fit(x, y, param, boundMi, boundMx, listparam):
Sparam, SboundMi, SboundMx = asarray([]), asarray([]), asarray([])
Nparam, NboundMi, NboundMx = asarray([]), asarray([]), asarray([])
for i in range(len(param)):
if(listparam[i] == 1):
Sparam = append(Sparam,asarray(param[i]))
SboundMi = append(SboundMi,asarray(boundMi[i]))
SboundMx = append(SboundMx,asarray(boundMx[i]))
else:
Nparam = append(Nparam,asarray(param[i]))
def funF(x, Sparam):
j = 0
for i in range(len(param)):
if(listparam[i] == 1):
param[i] = Sparam[i-j]
else:
param[i] = Nparam[j]
j = j + 1
return fun(x, param)
return curve_fit(lambda x, *Sparam: funF(x, Sparam), x, y, p0 = Sparam, bounds = (SboundMi,SboundMx))
In this case:
param = [a,b,c,...] # parameters array (any size)
boundMi = [min_a, min_b, min_c,...] # minimum allowable value of each parameter
boundMx = [max_a, max_b, max_c,...] # maximum allowable value of each parameter
listparam = [0,1,1,0,...] # 1 = fit and 0 = fix the corresponding parameter in the fit routine
and the root function is define as
def fun(x, param):
a,b,c,d.... = param
return a*b/c... # any function of the params a,b,c,d...
This way, you can change the root function and the number of parameters without changing the fit routine.
And, at any time, you can fix or let fit any parameter by changing "listparam".
Use like this:
popt, pcov = d_fit(x, y, param, boundMi, boundMx, listparam)
"popt" and "pcov" are 1D arrays of the size of the number of "1" in "listparam" bringing the results of the fitted parameters (best value and err matrix)
"param" will ramain an 1D array of the same size of the original (input) "param", HOWEVER IT WILL BE UPDATED AUTOMATICALLY TO THE FITTED VALUES (same as "popt") for the fitted values, keeping the fixed values according to "listparam"
Hope can be usefull!
Obs1: x = 1D-array independent values and y = 1D-array dependent values
Obs2: This is my first post. Please let me know if I can improove it!
I want to fit a curve to my data:
x=[24,25,28,37,58,104,200,235,235]
y=[340,350,370,400,430,460,490,520,550]
xerr=[1.1,1,0.8,1.4,1.4,2.6,3.8,2,2]
def fit_fc(x, a, b, c):
return a*x**b+c
popt, pcov=curve_fit(fit_fc,x,y,maxfev=5000)
plt.plot(x,fit_fc(x,popt[0],popt[1],popt[2]))
plt.errorbar(x,y,xerr=xerr,fmt='-o')
but i want to put some constraints on the a,b and c. For example I want them to be in some range, lets say between 0 and 20. How can i achieve that? I'm new in Python, so any help would be appreciated.
You could use lmfit to constrain you parameters. For the following plot, I constrained your parameters a and b to the range [0,20] (which you mentioned in your post) and c to the range [0, 400]. The parameters you get are:
a: 19.9999991
b: 0.46769173
c: 274.074071
and the corresponding plot looks as follows:
As you can see, the model reproduces the data reasonable well and the parameters are in the given ranges.
Here is the code that reproduces the results with additional comments:
from lmfit import minimize, Parameters, Parameter, report_fit
import numpy as np
x=[24,25,28,37,58,104,200,235,235]
y=[340,350,370,400,430,460,490,520,550]
def fit_fc(params, x, data):
a = params['a'].value
b = params['b'].value
c = params['c'].value
model = np.power(x,b)*a + c
return model - data #that's what you want to minimize
# create a set of Parameters
#'value' is the initial condition
#'min' and 'max' define your boundaries
params = Parameters()
params.add('a', value= 2, min=0, max=20)
params.add('b', value= 0.5, min=0, max=20)
params.add('c', value= 300.0, min=0, max=400)
# do fit, here with leastsq model
result = minimize(fit_fc, params, args=(x, y))
# calculate final result
final = y + result.residual
# write error report
report_fit(params)
#plot results
try:
import pylab
pylab.plot(x, y, 'k+')
pylab.plot(x, final, 'r')
pylab.show()
except:
pass
If you constrain all of your parameters to the range [0,20], the plot looks rather bad:
It depends on what you want to have happen if the variables are out of range. You can use a simple if statement (in this case the program exit()s):
x = 21
if (x not in range(0, 20)):
print("var x is out of range")
exit()
Another way is to assert that the variable must be in the range. In this case, it's wrapped in a try/except block that handles the problem gracefully, and also exit()s like above:
try:
assert(x in range(0, 20))
except AssertionError:
print("variable x is out of range")
exit()
Scipy uses unconstrained least squares in order to fit curve parameters, so it won't be that straightforward: https://github.com/scipy/scipy/blob/v0.16.0/scipy/optimize/minpack.py#L454
What you'd probably like to do is called constrained (non-linear?, giving what you're trying to fit) least squares problem. For instance, take a look at those discussions:
Constrained least-squares estimation in Python ( leastsq_bounds: https://gist.github.com/denis-bz/65da931bdbf92c49e4d0 )
scipy.optimize.leastsq with bound constraints