How to obtain perfect fit from np.random.power function - python

I have generated random data using:
bkg= 240-140*np.random.power(3.5,50000)
I plotted the points into a histogram by using
h_all = plt.hist(all,bins=binedges,histtype='step')
My question is, provided that I know the pdf (in this case called "bkg") can I generate a curve using scipy.optimize that fits the points generated perfectly, and what equation it is for the curve ?

First of all, remark that your bkg is NOT a probability density function (pdf). Rather, it is a list of observations from a pdf. By calling matplotlib.pyplot.hist on this list of observations, you get to see a curve that approximates the (offset and scaled version of the) probability density function. If you are given this curve, it is possible to get a good estimation of the parameters needed to model this, provided you've been given the parameterized model a priori.
For example:
import matplotlib.pyplot as plt
import numpy as np
from scipy.optimize import curve_fit
offset, scale, a, nsamples = 240, -140, 3.5, 500000
bkg = offset + scale*np.random.power(a, nsamples) # values range between (offset, offset+scale), which map to 0 and 1
nbins = 100
count, bins, ignored = plt.hist(bkg, bins=nbins, histtype='stepfilled', edgecolor='none')
If now you are given the centers of these bins and the counts,
xdata = .5*(bins[1:]+bins[:-1])
ydata = count
and you are asked to find the parameters of the power distribution function that fits to this data (-> someone told you this, you trust that source), then you could go about in the following manner.
First, observe that the power distribution function P(x,a) is a monotonously increasing function (i.e. P(x1, a ) < P(x2, a) when 0 <= x1 < x2 <= 1). That means that the dataset given above has been flipped left-to-right, or that it represents factor*P(x, a ) with factor < 0.
Next, notice that the given data is not given over the interval [0,1], typical for a probability density function. That means that you should rescale the given xdata to the [0,1] interval prior to attempting to fit the power function distribution to it. Just by observing the graph, you figure out that the values that 0 and 1 map to are 100 and 240. However, this is just luck here, because matplotlib chose a sensible range for plotting. When you are confronted with not actually knowing the limits to which 0 and 1 have mapped to, you could choose the less optimal (but still very good) choice of xdata[0] - binwidth/2 and xdata[-1] + binwidth/2 or (a slightly worse choice) xdata[0] and xdata[-1]. From the previous paragraph, you know that 1 maps to xdata[0] - binwidth/2 :=: a and 0 maps to xdata[-1] + binwidth/2 :=: b. The linear map that does this is lambda x: (a - b)*x + b (simple algebra).
If you pass this to [0,1]-mapped version of the xdata to curve_fit, it'll give you a good guess for the exponent.
def get_model(nobservations, binwidth, scale, offset):
def model(bin_centers, exponent):
x = (bin_centers - offset)/scale
y = exponent*x**(exponent - 1)
normed_y = nobservations * binwidth * y / np.abs(scale)
return normed_y
return model
binwidth = np.diff(xdata)[0]
p0, _ = curve_fit(get_model(nsamples, binwidth, scale=-xdata.ptp() - binwidth, offset=xdata[-1] + binwidth/2), xdata, ydata)
print(p0) # prints e.g.: 3.37117679
plt.plot(xdata, get_model(nsamples, binwidth, scale=-xdata.ptp() - binwidth, offset=xdata[-1] + binwidth/2)(xdata, *p0))
At this moment, you have found a rather accurate description of the distribution
that was used to generate the observations of bkg:
f(x) = offset + scale*(exponent * x**(exponent - 1))
= (xdata[-1] + binwidth/2) + (-xdata.ptp() - binwidth)*(p0[0] * x**(p0[0] - 1))
~ 234.85 - 1.34.85*(3.37 * x**(3.37 - 1))
By the way, I'd like to point out that replicating bkg (the observations from the distribution)
as a perfect copy is something you can only do if you know the exact parameters of the distribution (240, -140 and 3.5) AND set the seed for the random number generation equal to the seed that was in effect prior to the initial call to np.random.power.
If you'd like to fit a curve to the histogram using splines, you should retrieve the knots and coefficients from the generated spline and pass those into the function of bspleval, as shown here. The topic of writing out those equations is a long one however, and there are numerous resources on the internet that you can check to understand how it's done. Needless to say, that function bspleval is what you'll need in case you want to go that route. If it were me, I'd go the route of curve fitting shown above.

Related

Applying a half-gaussian filter to binned time series data in python

I am binning some time series data, I need to apply a half-normal filter to the binned data. How can I do this in python? I've provided a toy example bellow. I need Xbinned to be smoothed with a half-gaussian filter with std of 0.25 (or what ever). I'm pretty sure the half gaussian should be facing the forward time direction.
import numpy as np
X = np.random.randint(2, size=100) #example random process
bin_size = 5
Xbinned = []
for i in range(0, len(X)+1, bin_size):
Xbinned.append(sum(X[i:i+(bin_size-1)])/bin_size)
How to implement half-gaussian filtering
Scipy has a function called scipy.ndimage.gaussian_filter(). It nearly implements what we want here. Unfortunately, there's no option to use a half-gaussian instead of a gaussian. However, scipy is open-source, so we can just take the source code and modify it to be a half-gaussian.
I used this source code, and removed all of the parts that are not needed for this particular case. At the end, I had this:
import scipy.ndimage
def halfgaussian_kernel1d(sigma, radius):
"""
Computes a 1-D Half-Gaussian convolution kernel.
"""
sigma2 = sigma * sigma
x = np.arange(0, radius+1)
phi_x = np.exp(-0.5 / sigma2 * x ** 2)
phi_x = phi_x / phi_x.sum()
return phi_x
def halfgaussian_filter1d(input, sigma, axis=-1, output=None,
mode="constant", cval=0.0, truncate=4.0):
"""
Convolves a 1-D Half-Gaussian convolution kernel.
"""
sd = float(sigma)
# make the radius of the filter equal to truncate standard deviations
lw = int(truncate * sd + 0.5)
weights = halfgaussian_kernel1d(sigma, lw)
origin = -lw // 2
return scipy.ndimage.convolve1d(input, weights, axis, output, mode, cval, origin)
A short summary of how this works:
First, it generates a convolution kernel. It uses the formula e^(-1/2 * (x/sigma)^2) to generate the gaussian distribution. It keeps going until you're 4 standard deviations away from the center.
Next, it convolves that kernel against your signal. It adjusts the kernel to start at the current timestep instead of being centered on the current timestep.
Trying this on your signal, I get a result like this:
array([0.59979879, 0.6 , 0.40006707, 0.59993293, 0.79993293,
0.40013414, 0.20006707, 0.59986586, 0.40006707, 0.4 ,
0.99979879, 0.00033535, 0.59979879, 0.40006707, 0.00013414,
0.59979879, 0.20013414, 0.00006707, 0.19993293, 0.59986586])
Choice of standard deviation
If you pick a standard deviation of 0.25, that is going to have almost no effect on your signal. Here are the convolution weights it uses: [0.99966465 0.00033535]. In other words, this has less than a 0.1% effect on the signal.
I'd recommend using a larger sigma value.
Off by one error
Also, I want to point out the off-by-one error here:
for i in range(0, len(X)+1, bin_size):
Xbinned.append(sum(X[i:i+(bin_size-1)])/bin_size)
Numpy ranges are not inclusive, so a range of i to i+(bin_size-1) actually captures 4 elements, not 5.
To fix this, you can change it to this:
for i in range(0, len(X), bin_size):
Xbinned.append(X[i:i+bin_size].mean())
(Also, I fixed an off-by-one error in the loop specification and used a numpy shortcut for finding the mean.)

Is there a term for finding a minimal set of N points that approximate a curve?

I have spent some time answering How do I discretize a continuous function avoiding noise generation (see picture), and throughout, I felt like I was reinventing a bike.
Essentially, the problem is:
You are given a curve function - for any x, you can obtain y.
You want to approximate the curve using a piecewise-linear function with exactly N points, based on some error metric, e.g. distance to the curve, or minimize the absolute difference of the area under the curves (thanks to #QuangHoang for pointing out these are different).
Here's an example of a curve I approximated using 20 points:
Question: I've coded this up using repeated bisections. Is there a library I could have used? Is there a nice term of this problem type that I failed to google out? Does this generalize to a broader problem set?
Edit: upon request, here's how I've done it:
Google Colab
Data:
import numpy as np
from scipy.signal import gaussian
N_MOCK = 2000
# A nice-ish mock distribution
xs = np.linspace(-10.0, 10.0, num=N_MOCK)
sigmoid = 1 / (1 + np.exp(-xs))
gauss = gaussian(N_MOCK, std=N_MOCK / 10)
ys = gauss - sigmoid + 1
xs += 10
xs /= 20
Plotting:
import matplotlib.pyplot as plt
def plot_graph(cont_time, cont_array, disc_time, disc_array, plot_name):
"""A simplified version of the provided plotting function"""
# Setting Axis properties and titles
fig, ax = plt.subplots(figsize=(20, 4))
ax.set_title(plot_name)
# Plotting stuff
ax.plot(cont_time, cont_array, label="Continuous", color='#0000ff')
ax.plot(disc_time, disc_array, label="Discrete", color='#00ff00')
fig.legend(loc="upper left", bbox_to_anchor=(0,1), bbox_transform=ax.transAxes)
Here's how I solved it, but I hope there's a more standard way:
import warnings
warnings.simplefilter('ignore', np.RankWarning)
def line_error(x0, y0, x1, y1, ideal_line, integral_points=100):
"""Assume a straight line between (x0,y0)->(x1,p1). Then sample the perfect line multiple times and compute the distance."""
straight_line = np.poly1d(np.polyfit([x0, x1], [y0, y1], 1))
xs = np.linspace(x0, x1, num=integral_points)
ys = straight_line(xs)
perfect_ys = ideal_line(xs)
err = np.abs(ys - perfect_ys).sum() / integral_points * (x1 - x0) # Remove (x1 - x0) to only look at avg errors
return err
def discretize_bisect(xs, ys, bin_count):
"""Returns xs and ys of discrete points"""
# For a large number of datapoints, without loss of generality you can treat xs and ys as bin edges
# If it gives bad results, you can edges in many ways, e.g. with np.polyline or np.histogram_bin_edges
ideal_line = np.poly1d(np.polyfit(xs, ys, 50))
new_xs = [xs[0], xs[-1]]
new_ys = [ys[0], ys[-1]]
while len(new_xs) < bin_count:
errors = []
for i in range(len(new_xs)-1):
err = line_error(new_xs[i], new_ys[i], new_xs[i+1], new_ys[i+1], ideal_line)
errors.append(err)
max_segment_id = np.argmax(errors)
new_x = (new_xs[max_segment_id] + new_xs[max_segment_id+1]) / 2
new_y = ideal_line(new_x)
new_xs.insert(max_segment_id+1, new_x)
new_ys.insert(max_segment_id+1, new_y)
return new_xs, new_ys
Run:
BIN_COUNT = 25
new_xs, new_ys = discretize_bisect(xs, ys, BIN_COUNT)
plot_graph(xs, ys, new_xs, new_ys, f"Discretized and Continuous comparison, N(cont) = {N_MOCK}, N(disc) = {BIN_COUNT}")
print("Bin count:", len(new_xs))
Note: while I prefer numpy, the answer can be a library in any language, or the name of the mathematical term. Please do not write lots of code, as I have done that myself already :)
Is there a nice term of this problem type that I failed to google out? Does this generalize to a broader problem set?
I know this problem as Expected Improvement (EI), or Bayesian optimization (permalink on archive.org). Given an expensive black box function for which you would like to find the global maximum, this algorithm yields the next position where to check for that maximum.
At first glance, this is different from your problem. You are looking for a way to approximate a curve with a small number of samples, while EI provides the places where the function has its most likely maximum. But both problems are equivalent insofar that you minimize an error function (which will change when you add another sample to your approximation) with the fewest possible points.
I believe this is the original research paper.
Jones, Donald & Schonlau, Matthias & Welch, William. (1998). Efficient Global Optimization of Expensive Black-Box Functions. Journal of Global Optimization. 13. 455-492. 10.1023/A:1008306431147.
From section 1:
[...] the technique often requires the fewest function evaluations of all
competing methods. This is possible because, with typical engineering functions,
one can often interpolate and extrapolate quite accurately over large distances in
the design space. Intuitively, the method is able to ‘see’ obvious trends or patterns
in the data and ‘jump to conclusions’ instead of having to move step-by-step along
some trajectory.
As to why it is efficient:
[...] the response surface approach provides a credible stopping rule based
on the expected improvement from further searching. Such a stopping rule is possible because the statistical model provides confidence intervals on the function’s
value at unsampled points – and the ‘reasonableness’ of these confidence intervals
can be checked by model validation techniques.

Gaussian fit returning negative sigma

One of my algorithms performs automatic peak detection based on a Gaussian function, and then later determines the the edges based either on a multiplier (user setting) of the sigma or the 'full width at half maximum'. In the scenario where a user specified that he/she wants the peak limited at 2 Sigma, the algorithm takes -/+ 2*sigma from the peak center (mu). However, I noticed that the sigma returned by curve_fit can be negative, which is something that has been noticed before as can be seen here. However, as I determine the border by doing -/+ this can lead to the algorithm 'failing' (due to a - - scenario) as can be seen in the following code.
MVCE
#! /usr/bin/env python
from scipy.optimize import curve_fit
import bisect
import numpy as np
X = [16.4697402328,16.4701402404,16.4705402481,16.4709402557,16.4713402633,16.4717402709,16.4721402785,16.4725402862,16.4729402938,16.4733403014,16.473740309,16.4741403166,16.4745403243,16.4749403319,16.4753403395,16.4757403471,16.4761403547,16.4765403623,16.47694037,16.4773403776,16.4777403852,16.4781403928,16.4785404004,16.4789404081,16.4793404157,16.4797404233,16.4801404309,16.4805404385,16.4809404462,16.4813404538,16.4817404614,16.482140469,16.4825404766,16.4829404843,16.4833404919,16.4837404995,16.4841405071,16.4845405147,16.4849405224,16.48534053,16.4857405376,16.4861405452,16.4865405528,16.4869405604,16.4873405681,16.4877405757,16.4881405833,16.4885405909,16.4889405985,16.4893406062,16.4897406138,16.4901406214,16.490540629,16.4909406366,16.4913406443,16.4917406519,16.4921406595,16.4925406671,16.4929406747,16.4933406824,16.49374069,16.4941406976,16.4945407052,16.4949407128,16.4953407205,16.4957407281,16.4961407357,16.4965407433,16.4969407509,16.4973407585,16.4977407662,16.4981407738,16.4985407814,16.498940789,16.4993407966,16.4997408043,16.5001408119,16.5005408195,16.5009408271,16.5013408347,16.5017408424,16.50214085,16.5025408576,16.5029408652,16.5033408728,16.5037408805,16.5041408881,16.5045408957,16.5049409033,16.5053409109,16.5057409186,16.5061409262,16.5065409338,16.5069409414,16.507340949,16.5077409566,16.5081409643,16.5085409719,16.5089409795,16.5093409871,16.5097409947,16.5101410024,16.51054101,16.5109410176,16.5113410252,16.5117410328,16.5121410405,16.5125410481,16.5129410557,16.5133410633,16.5137410709,16.5141410786,16.5145410862,16.5149410938,16.5153411014,16.515741109,16.5161411166,16.5165411243,16.5169411319,16.5173411395,16.5177411471,16.5181411547,16.5185411624,16.51894117,16.5193411776,16.5197411852,16.5201411928,16.5205412005,16.5209412081,16.5213412157,16.5217412233,16.5221412309,16.5225412386,16.5229412462,16.5233412538,16.5237412614,16.524141269,16.5245412767,16.5249412843,16.5253412919,16.5257412995,16.5261413071,16.5265413147,16.5269413224,16.52734133,16.5277413376,16.5281413452,16.5285413528,16.5289413605,16.5293413681,16.5297413757,16.5301413833,16.5305413909,16.5309413986,16.5313414062,16.5317414138,16.5321414214,16.532541429,16.5329414367,16.5333414443,16.5337414519,16.5341414595,16.5345414671,16.5349414748,16.5353414824,16.53574149,16.5361414976,16.5365415052,16.5369415128,16.5373415205,16.5377415281,16.5381415357,16.5385415433,16.5389415509,16.5393415586,16.5397415662,16.5401415738,16.5405415814,16.540941589,16.5413415967,16.5417416043,16.5421416119,16.5425416195,16.5429416271,16.5433416348,16.5437416424,16.54414165,16.5445416576,16.5449416652,16.5453416729,16.5457416805,16.5461416881,16.5465416957,16.5469417033,16.5473417109,16.5477417186,16.5481417262,16.5485417338,16.5489417414,16.549341749,16.5497417567,16.5501417643,16.5505417719,16.5509417795,16.5513417871,16.5517417948,16.5521418024,16.55254181,16.5529418176,16.5533418252,16.5537418329,16.5541418405,16.5545418481,16.5549418557,16.5553418633,16.5557418709,16.5561418786,16.5565418862,16.5569418938,16.5573419014,16.557741909,16.5581419167,16.5585419243,16.5589419319,16.5593419395,16.5597419471,16.5601419548,16.5605419624,16.56094197,16.5613419776,16.5617419852,16.5621419929,16.5625420005,16.5629420081,16.5633420157,16.5637420233,16.564142031]
Y = [11579127.8554,11671781.7263,11764419.0191,11857026.0444,11949589.1124,12042094.5338,12134528.6188,12226877.6781,12319128.0219,12411265.9609,12503277.8053,12595149.8657,12686868.4525,12778419.8762,12869790.334,12960965.209,13051929.5278,13142668.3154,13233166.5969,13323409.3973,13413381.7417,13503068.6552,13592455.1627,13681526.2894,13770267.0602,13858662.5004,13946697.6348,14034357.4886,14121627.0868,14208491.4544,14294935.6166,14380944.5984,14466503.4248,14551597.1208,14636210.7116,14720329.3102,14803938.4081,14887023.5981,14969570.4732,15051564.6263,15132991.6503,15213837.1383,15294086.683,15373725.8775,15452740.3147,15531115.5875,15608837.2888,15685891.0116,15762262.3488,15837936.8934,15912900.2382,15987137.9762,16060635.7004,16133379.0036,16205353.4789,16276544.72,16346938.7731,16416522.8674,16485284.4226,16553210.8587,16620289.5956,16686508.0531,16751853.6511,16816313.8096,16879875.9485,16942527.4876,17004255.8468,17065048.446,17124892.7052,17183776.0442,17241685.8829,17298609.6412,17354534.739,17409448.5962,17463338.6327,17516192.2683,17567996.9463,17618741.7702,17668418.588,17717019.5043,17764536.6238,17810962.0514,17856287.8916,17900506.2493,17943609.2292,17985588.936,18026437.4744,18066146.9493,18104709.4653,18142117.1271,18178362.0396,18213436.3074,18247332.0352,18280041.3279,18311556.2901,18341869.0265,18370971.642,18398856.332,18425517.6188,18450952.493,18475158.064,18498131.4412,18519869.7341,18540370.0523,18559629.505,18577645.202,18594414.2525,18609933.7661,18624200.8523,18637212.6205,18648966.1802,18659458.6408,18668687.1119,18676648.7029,18683340.5233,18688759.6825,18692903.29,18695768.4553,18697352.5327,18697655.9558,18696681.2608,18694431.0245,18690907.8241,18686114.2363,18680052.838,18672726.2063,18664136.918,18654287.5501,18643180.6795,18630818.883,18617204.7377,18602340.8204,18586229.7081,18568873.9777,18550276.2061,18530438.9703,18509364.8471,18487056.4135,18463516.2464,18438747.4526,18412756.9228,18385553.1936,18357144.808,18327540.3094,18296748.2409,18264777.1456,18231635.5669,18197332.0479,18161875.1318,18125273.3619,18087535.2812,18048669.4331,18008684.3606,17967588.6071,17925390.7158,17882099.2297,17837722.6922,17792269.6464,17745748.6355,17698168.2027,17649537.512,17599868.3744,17549173.3069,17497464.8262,17444755.4492,17391057.6927,17336384.0736,17280747.1087,17224159.3148,17166633.2088,17108181.3075,17048816.1277,16988550.1864,16927396.0002,16865366.0862,16802472.961,16738729.1416,16674147.1447,16608739.4873,16542518.6861,16475497.2591,16407688.2541,16339106.0951,16269765.4262,16199680.8916,16128867.1358,16057338.8029,15985110.5372,15912196.9829,15838612.7844,15764372.5859,15689491.0316,15613982.7659,15537862.4329,15461144.6771,15383844.1425,15305975.4735,15227553.3143,15148592.3093,15069107.1026,14989112.3386,14908622.6595,14827652.5673,14746216.3337,14664328.209,14582002.4435,14499253.2874,14416094.9911,14332541.8049,14248607.9791,14164307.764,14079655.4098,13994665.1668,13909351.2855,13823728.016,13737809.6086,13651610.3137,13565144.3816,13478426.0625,13391469.6068,13304289.2646,13216899.2865,13129313.8865,13041546.3657,12953609.0623,12865514.2686,12777274.277,12688901.3798,12600407.8693,12511806.0378,12423108.1777,12334326.5812,12245473.5407,12156561.3486,12067602.297,11978608.6785,11889592.7852]
def gaussFunction(x, *p):
"""Define and return a Gaussian function.
This function returns the value of a Gaussian function, using the
A, mu and sigma value that is provided as *p.
Keyword arguments:
x -- number
p -- A, mu and sigma numbers
"""
A, mu, sigma = p
return A*np.exp(-(x-mu)**2/(2.*sigma**2))
newGaussX = np.linspace(10, 25, 2500*(X[-1]-X[0]))
p0 = [np.max(Y), X[np.argmax(Y)],0.1]
coeff, var_matrix = curve_fit(gaussFunction, X, Y, p0)
newGaussY = gaussFunction(newGaussX, *coeff)
print "Sigma is "+str(coeff[2])
# Original
low = bisect.bisect_left(newGaussX,coeff[1]-2*coeff[2])
high = bisect.bisect_right(newGaussX,coeff[1]+2*coeff[2])
print newGaussX[low], newGaussX[high]
# Absolute
low = bisect.bisect_left(newGaussX,coeff[1]-2*abs(coeff[2]))
high = bisect.bisect_right(newGaussX,coeff[1]+2*abs(coeff[2]))
print newGaussX[low], newGaussX[high]
Bottom-line, is taking the abs() of the sigma 'correct' or should this problem be solved in a different way?
You are fitting a function gaussFunction that does not care whether sigma is positive or negative. So whether you get a positive or negative result is mostly a matter of luck, and taking the absolute value of the returned sigma is fine. Also consider other possibilities:
(Suggested by Thomas Kühn): modify the model function so that it cares about the sign of sigma. Bringing it closer to the normalized Gaussian form would be enough: the formula A/np.sqrt(sigma)*np.exp(-(x-mu)**2/(2.*sigma**2)) would ensure that you get positive sigma only. A possible, mild downside is that the function takes a bit longer to compute.
Use the variance, sigma_squared, as a parameter:
A, mu, sigma_squared = p
return A*np.exp(-(x-mu)**2/(2.*sigma_squared))
This is probably easiest in terms of keeping the model equation simple. You will need to square your initial guess for that parameter, and take square root when you need sigma itself.
Aside: you hardcoded 0.1 as a guess for standard deviation. This probably should be based on data, like this:
peak = X[Y > np.exp(-0.5)*Y.max()]
guess_sigma = 0.5*(peak.max() - peak.min())
The idea is that within one standard deviation of the mean, the values of the Gaussian are greater than np.exp(-0.5) times the maximum value. So the first line locates this "peak" and the second takes half of its width as the guess for sigma.
For the above to work, X and Y should be already converted to NumPy arrays, e.g., X = np.array([16.4697402328,16.4701402404,..... This is a good idea in general: otherwise, you are making each NumPy method that receives X or Y make this conversion again.
You might find lmfit (http://lmfit.github.io/lmfit-py/) useful for this. It includes a Gaussian Model for curve-fitting that does normalize the Gaussian and also restricts sigma to be positive using a parameter transformation that is more gentle than abs(sigma). Your example would look like this
from lmfit.models import GaussianModel
xdat = np.array(X)
ydat = np.array(Y)
model = GaussianModel()
params = model.guess(ydat, x=xdat)
result = model.fit(ydat, params, x=xdat)
print(result.fit_report())
which will print a report with best-fit values and estimated uncertainties for all the parameters, and include FWHM.
[[Model]]
Model(gaussian)
[[Fit Statistics]]
# function evals = 31
# data points = 237
# variables = 3
chi-square = 95927408861.607
reduced chi-square = 409946191.716
Akaike info crit = 4703.055
Bayesian info crit = 4713.459
[[Variables]]
sigma: 0.04880178 +/- 1.57e-05 (0.03%) (init= 0.0314006)
center: 16.5174203 +/- 8.01e-06 (0.00%) (init= 16.51754)
amplitude: 2.2859e+06 +/- 586.4103 (0.03%) (init= 670578.1)
fwhm: 0.11491942 +/- 3.51e-05 (0.03%) == '2.3548200*sigma'
height: 1.8687e+07 +/- 910.0152 (0.00%) == '0.3989423*amplitude/max(1.e-15, sigma)'
[[Correlations]] (unreported correlations are < 0.100)
C(sigma, amplitude) = 0.949
The values for center +/- 2*sigma would be found with
xlo = result.params['center'].value - 2 * result.params['sigma'].value
xhi = result.params['center'].value + 2 * result.params['sigma'].value
You can use the result to evaluate the model with fitted parameters and different X values:
newGaussX = np.linspace(10, 25, 2500*(X[-1]-X[0]))
newGaussY = result.eval(x=newGaussX)
I would also recommend using numpy.where to find the location of center+/-2*sigma instead of bisect:
low = np.where(newGaussX > xlo)[0][0] # replace bisect_left
high = np.where(newGaussX <= xhi)[0][-1] + 1 # replace bisect_right
I got the same problem and I came up with a trivial but effective solution, which is basically to use the variance in the gaussian function definition instead of the standard deviation, since the variance is always positive. Then, you get the std_dev by square rooting the variance, obtaining a positive value i.e., the std_dev will always be positive. So, problem solved easily ;)
I mean, create the function this way:
def gaussian(x, Heigh, Mean, Variance):
return Heigh * np.exp(- (x-Mean)**2 / (2 * Variance))
Instead of:
def gaussian(x, Heigh, Mean, Std_dev):
return Heigh * np.exp(- (x-Mean)**2 / (2 * Std_dev**2))
And then do the fit as usual.

Producing an MLE for a pair of distributions in python

Ok, so my current curve fitting code has a step that uses scipy.stats to determine the right distribution based on the data,
distributions = [st.laplace, st.norm, st.expon, st.dweibull, st.invweibull, st.lognorm, st.uniform]
mles = []
for distribution in distributions:
pars = distribution.fit(data)
mle = distribution.nnlf(pars, data)
mles.append(mle)
results = [(distribution.name, mle) for distribution, mle in zip(distributions, mles)]
for dist in sorted(zip(distributions, mles), key=lambda d: d[1]):
print dist
best_fit = sorted(zip(distributions, mles), key=lambda d: d[1])[0]
print 'Best fit reached using {}, MLE value: {}'.format(best_fit[0].name, best_fit[1])
print [mod[0].name for mod in sorted(zip(distributions, mles), key=lambda d: d[1])]
Where data is a list of numeric values. This is working great so far for fitting unimodal distributions, confirmed in a script that randomly generates values from random distributions and uses curve_fit to redetermine the parameters.
Now I would like to make the code able to handle bimodal distributions, like the example below:
Is it possible to get a MLE for a pair of models from scipy.stats in order to determine if a particular pair of distributions are a good fit for the data?, something like
distributions = [st.laplace, st.norm, st.expon, st.dweibull, st.invweibull, st.lognorm, st.uniform]
distributionPairs = [[modelA.name, modelB.name] for modelA in distributions for modelB in distributions]
and use those pairs to get an MLE value of that pair of distributions fitting the data?
It's not a complete answer but it may help you to solve your problem. Let say you know your problem is generated by two densities.
A solution would be to use k-mean or EM algorithm.
Initalization.
You initialize your algorithm by affecting every observation to one or the other density. And you initialize the two densities (you initialize the parameters of the density, and one of the parameter in your case is "gaussian", "laplace", and so on...
Iteration.
Then, iterately, you run the two following steps :
Step 1.
Optimize the parameters assuming that the affectation of every point is right. You can now use any optimization solver. This step provide you with an estimation of the best two densities (with given parameter) that fit your data.
Step 2.
You classify every observation to one density or the other according to the greatest likelihood.
You repeat until convergence.
This is very well explained in this web-page
https://people.duke.edu/~ccc14/sta-663/EMAlgorithm.html
If you do not know how many densities have generated your data, the problem is more difficult. You have to work with penalized classification problem, which is a bit harder.
Here is a coding example in an easy case : you know that your data comes from 2 different Gaussians (you don't know how many variables are generated from each density). In your case, you can adjust this code to loop on every possible pair of density (computationally longer, but would empirically work I presume)
import scipy.stats as st
import numpy as np
#hard coded data generation
data = np.random.normal(-3, 1, size = 1000)
data[600:] = np.random.normal(loc = 3, scale = 2, size=400)
#initialization
mu1 = -1
sigma1 = 1
mu2 = 1
sigma2 = 1
#criterion to stop iteration
epsilon = 0.1
stop = False
while not stop :
#step1
classification = np.zeros(len(data))
classification[st.norm.pdf(data, mu1, sigma1) > st.norm.pdf(data, mu2, sigma2)] = 1
mu1_old, mu2_old, sigma1_old, sigma2_old = mu1, mu2, sigma1, sigma2
#step2
pars1 = st.norm.fit(data[classification == 1])
mu1, sigma1 = pars1
pars2 = st.norm.fit(data[classification == 0])
mu2, sigma2 = pars2
#stopping criterion
stop = ((mu1_old - mu1)**2 + (mu2_old - mu2)**2 +(sigma1_old - sigma1)**2 +(sigma2_old - sigma2)**2) < epsilon
#result
print("The first density is gaussian :", mu1, sigma1)
print("The first density is gaussian :", mu2, sigma2)
print("A rate of ", np.mean(classification), "is classified in the first density")
Hope it helps.

gaussian sum filter for irregular spaced points

I have a set of points (x,y) as two vectors
x,y for example:
from pylab import *
x = sorted(random(30))
y = random(30)
plot(x,y, 'o-')
Now I would like to smooth this data with a Gaussian and evaluate it only at certain (regularly spaced) points on the x-axis. lets say for:
x_eval = linspace(0,1,11)
I got the tip that this method is called a "Gaussian sum filter", but so far I have not found any implementation in numpy/scipy for that, although it seems like a standard problem at first glance.
As the x values are not equally spaced I can't use the scipy.ndimage.gaussian_filter1d.
Usually this kind of smoothing is done going through furrier space and multiplying with the kernel, but I don't really know if this will be possible with irregular spaced data.
Thanks for any ideas
This will blow up for very large datasets, but the proper calculaiton you are asking for would be done as follows:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0) # for repeatability
x = np.random.rand(30)
x.sort()
y = np.random.rand(30)
x_eval = np.linspace(0, 1, 11)
sigma = 0.1
delta_x = x_eval[:, None] - x
weights = np.exp(-delta_x*delta_x / (2*sigma*sigma)) / (np.sqrt(2*np.pi) * sigma)
weights /= np.sum(weights, axis=1, keepdims=True)
y_eval = np.dot(weights, y)
plt.plot(x, y, 'bo-')
plt.plot(x_eval, y_eval, 'ro-')
plt.show()
I'll preface this answer by saying that this is more of a DSP question than a programming question...
...that being said there, there is a simple two step solution to your problem.
Step 1: Resample the data
So to illustrate this we can create a random data set with unequal sampling:
import numpy as np
x = np.cumsum(np.random.randint(0,100,100))
y = np.random.normal(0,1,size=100)
This gives something like:
We can resample this data using simple linear interpolation:
nx = np.arange(x.max()) # choose new x axis sampling
ny = np.interp(nx,x,y) # generate y values for each x
This converts our data to:
Step 2: Apply filter
At this stage you can use some of the tools available through scipy to apply a Gaussian filter to the data with a given sigma value:
import scipy.ndimage.filters as filters
fx = filters.gaussian_filter1d(ny,sigma=100)
Plotting this up against the original data we get:
The choice of the sigma value determines the width of the filter.
Based on #Jaime's answer I wrote a function that implements this with some additional documentation and the ability to discard estimates far from the datapoints.
I think confidence intervals could be obtained on this estimate by bootstrapping, but I haven't done this yet.
def gaussian_sum_smooth(xdata, ydata, xeval, sigma, null_thresh=0.6):
"""Apply gaussian sum filter to data.
xdata, ydata : array
Arrays of x- and y-coordinates of data.
Must be 1d and have the same length.
xeval : array
Array of x-coordinates at which to evaluate the smoothed result
sigma : float
Standard deviation of the Gaussian to apply to each data point
Larger values yield a smoother curve.
null_thresh : float
For evaluation points far from data points, the estimate will be
based on very little data. If the total weight is below this threshold,
return np.nan at this location. Zero means always return an estimate.
The default of 0.6 corresponds to approximately one sigma away
from the nearest datapoint.
"""
# Distance between every combination of xdata and xeval
# each row corresponds to a value in xeval
# each col corresponds to a value in xdata
delta_x = xeval[:, None] - xdata
# Calculate weight of every value in delta_x using Gaussian
# Maximum weight is 1.0 where delta_x is 0
weights = np.exp(-0.5 * ((delta_x / sigma) ** 2))
# Multiply each weight by every data point, and sum over data points
smoothed = np.dot(weights, ydata)
# Nullify the result when the total weight is below threshold
# This happens at evaluation points far from any data
# 1-sigma away from a data point has a weight of ~0.6
nan_mask = weights.sum(1) < null_thresh
smoothed[nan_mask] = np.nan
# Normalize by dividing by the total weight at each evaluation point
# Nullification above avoids divide by zero warning shere
smoothed = smoothed / weights.sum(1)
return smoothed

Categories

Resources