Why is there noise for higher bandwidth intensity distributions? - python

I have been trying to computationally evaluate the following E-field computationally:
It is basically the sum of waves of a given wavelength, weighted by the square root of a gaussian distribution for the wavelengths.
I compute it via Python by performing gauss quadrature integrals for each value of $x$, through the package scipy.integrate.quad function. The code is stated below:
# Imports
import numpy as np
import scipy as sp
from scipy import integrate
# Parameters
mu = 0.635 # mean wavelength
sigma = 0.01 # std dev of wavelength distribution
# wl is wavelength
x_ara = np.arange(0, 1.4, 0.01)
# Limits of Integration
lower, upper = mu - 4*sigma, mu+4*sigma
if lower < 0 :
print('lower limit met')
lower = 1e-15 # cannot evaluate sigma = 0 due to singularity for gaussian function
# Functions
def Iprofile_func(wl, mu, sigma):
profile = np.exp(-( ((wl-mu) / (np.sqrt(2)*sigma))**2))
return profile
def E_func(x_ara, wl, mu, sigma):
return np.sqrt(Iprofile_func(wl, mu, sigma)) * np.cos(2*np.pi/wl * (x_ara))
# Computation
field_ara = np.array([])
for x in x_ara:
def E(wl):
return E_func(x, wl, mu, sigma)
field = sp.integrate.quad(E, lower, upper)[0]
field_ara = np.append(field_ara, field)
I fixed the value of $\mu$ = 0.635, and performed the same computation for two values of $\sigma$, $\sigma$ = 0.01 and $\sigma$ = 0.2. The arrays that I get, I have plotted below, the plot above is the wavelength distribution, while the plot below is the computed field array:
Why does noise appear in the computed field when the value of sigma increases?

For large x even small changes in lambda do already lead to quickly fluctuating integrands. At some point the numeric integration routine will either take very, very long to converge or will not consider sufficiently many integration points so that the contributions from each integration point will not cancel out completely and will show exactly the noise that you see. When I run the code I actually get a warning from scipy about reaching a limit ("IntegrationWarning: The maximum number of subdivisions (50) has been achieved.").
The good thing: you know that for sufficiently large x the integration must go to zero. There is no need to compute it outside a reasonable range.
Example:
x = 10, mu = 0.635, sigma = 0.01
The integration bounds are mu+/-4sigma = [0.595, 0.675]
2Pi/0.595*10=105.6, 2Pi/0.675*10=93.08
That means roughly two oscillations of the integrand over the wavelength range at x=10.
x = 100, everything else the same
That means 20 oscillations of the integrand over the wavelength range.
x = 10, mu = 0.635, sigma = 0.1
The integration bounds are mu+/-4sigma=[0.235, 1.035]
2Pi/0.235*10=267.37, 2Pi/1.035*10=60.71
That means 33 oscillations of the the integrand over the wavelength range already at x=10.
x = 100, everything else the same
That means 329 oscillations of the the integrand over the wavelength range.
More and more integration points wold be needed if x or sigma get large. Therefore there is no alternative to increase the limits in scipy.integrate for larger x.

Related

Simulating expectation of continuous random variable

Currently I want to generate some samples to get expectation & variance of it.
Given the probability density function: f(x) = {2x, 0 <= x <= 1; 0 otherwise}
I already found that E(X) = 2/3, Var(X) = 1/18, my detail solution is from here https://math.stackexchange.com/questions/4430163/simulating-expectation-of-continuous-random-variable
But here is what I have when simulating using python:
import numpy as np
N = 100_000
X = np.random.uniform(size=N, low=0, high=1)
Y = [2*x for x in X]
np.mean(Y) # 1.00221 <- not equal to 2/3
np.var(Y) # 0.3323 <- not equal to 1/18
What am I doing wrong here? Thank you in advanced.
You are generating the mean and variance of Y = 2X, when you want the mean and variance of the X's themselves. You know the density, but the CDF is more useful for random variate generation than the PDF. For your problem, the density is:
so the CDF is:
Given that the CDF is an easily invertible function for the range [0,1], you can use inverse transform sampling to generate X values by setting F(X) = U, where U is a Uniform(0,1) random variable, and inverting the relationship to solve for X. For your problem, this yields X = U1/2.
In other words, you can generate X values with
import numpy as np
N = 100_000
X = np.sqrt(np.random.uniform(size = N))
and then do anything you want with the data, such as calculate mean and variance, plot histograms, use in simulation models, or whatever.
A histogram will confirm that the generated data have the desired density:
import matplotlib.pyplot as plt
plt.hist(X, bins = 100, density = True)
plt.show()
produces
The mean and variance estimates can then be calculated directly from the data:
print(np.mean(X), np.var(X)) # => 0.6661509538922444 0.05556962913014367
But wait! There’s more...
Margin of error
Simulation generates random data, so estimates of mean and variance will be variable across repeated runs. Statisticians use confidence intervals to quantify the magnitude of the uncertainty in statistical estimates. When the sample size is sufficiently large to invoke the central limit theorem, an interval estimate of the mean is calculated as (x-bar ± half-width), where x-bar is the estimate of the mean. For a so-called 95% confidence interval, the half-width is 1.96 * s / sqrt(n) where:
s is the estimated standard deviation;
n is the number of samples used in the estimates of mean and standard deviation; and
1.96 is a scaling constant derived from the normal distribution and the desired level of confidence.
The half-width is a quantitative measure of the margin of error, a.k.a. precision, of the estimate. Note that as n gets larger, the estimate has a smaller margin of error and becomes more precise, but there are diminishing returns to increasing the sample size due to the square root. Increasing the precision by a factor of 2 would require 4 times the sample size if independent sampling is used.
In Python:
var = np.var(X)
print(np.mean(X), var, 1.96 * np.sqrt(var / N))
produces results such as
0.6666763186360812 0.05511848269208021 0.0014551397290634852
where the third column is the confidence interval half-width.
Improving precision
Inverse transform sampling can yield greater precision for a given sample size if we use a clever trick based on fundamental properties of expectation and variance. In intro prob/stats courses you probably were told that Var(X + Y) = Var(X) + Var(Y). The true relationship is actually Var(X + Y) = Var(X) + Var(Y) + 2Cov(X,Y), where Cov(X,Y) is the covariance between X and Y. If they are independent, the covariance is 0 and the general relationship becomes the one we learn/teach in intro courses, but if they are not independent the more general equation must be used. Variance is always a positive quantity, but covariance can be either positive or negative. Consequently, it’s easy to see that if X and Y have negative covariance the variance of their sum will be less than when they are independent. Negative covariance means that when X is above its mean Y tends to be below its mean, and vice-versa.
So how does that help? It helps because we can use the inverse transform, along with a technique known as antithetic variates, to create pairs of random variables which are identically distributed but have negative covariance. If U is a random variable with a Uniform(0,1) distribution, U’ = 1 - U also has a Uniform(0,1) distribution. (In fact, flipping any symmetric distribution will produce the same distribution.) As a result, X = F-1(U) and X’ = F-1(U’) are identically distributed since they’re defined by the same CDF, but will have negative covariance because they fall on opposite sides of their shared median and thus strongly tend to fall on opposite sides of their mean. If we average each pair to get A = (F-1(ui) + F-1(1-ui)) / 2) the expected value E[A] = E[(X + X’)/2] = 2E[X]/2 = E[X] while the variance Var(A) = [(Var(X) + Var(X’) + 2Cov(X,X’)]/4 = 2[Var(X) + Cov(X,X’)]/4 = [Var(X) + Cov(X,X’)]/2. In other words, we get a random variable A whose average is an unbiased estimate of the mean of X but which has less variance.
To fairly compare antithetic results head-to-head with independent sampling, we take the original sample size and allocate it with half the data being generated by the inverse transform of the U’s, and the other half generated by antithetic pairing using 1-U’s. We then average the paired values and generate statistics as before. In Python:
U = np.random.uniform(size = N // 2)
antithetic_avg = (np.sqrt(U) + np.sqrt(1.0 - U)) / 2
anti_var = np.var(antithetic_avg)
print(np.mean(antithetic_avg), anti_var, 1.96*np.sqrt(anti_var / (N / 2)))
which produces results such as
0.6667222935263972 0.0018911848781598295 0.0003811869837216061
Note that the half-width produced with independent sampling is nearly 4 times as large as the half-width produced using antithetic variates. To put it another way, we would need more than an order of magnitude more data for independent sampling to achieve the same precision.
To approximate the integral of some function of x, say, g(x), over S = [0, 1], using Monte Carlo simulation, you
generate N random numbers in [0, 1] (i.e. draw from the uniform distribution U[0, 1])
calculate the arithmetic mean of g(x_i) over i = 1 to i = N where x_i is the ith random number: i.e. (1 / N) times the sum from i = 1 to i = N of g(x_i).
The result of step 2 is the approximation of the integral.
The expected value of continuous random variable X with pdf f(x) and set of possible values S is the integral of x * f(x) over S. The variance of X is the expected value of X-squared minus the square of the expected value of X.
Expected value: to approximate the integral of x * f(x) over S = [0, 1] (i.e. the expected value of X), set g(x) = x * f(x) and apply the method outlined above.
Variance: to approximate the integral of (x * x) * f(x) over S = [0, 1] (i.e. the expected value of X-squared), set g(x) = (x * x) * f(x) and apply the method outlined above. Subtract the result of this by the square of the estimate of the expected value of X to obtain an estimate of the variance of X.
Adapting your method:
import numpy as np
N = 100_000
X = np.random.uniform(size = N, low = 0, high = 1)
Y = [x * (2 * x) for x in X]
E = [(x * x) * (2 * x) for x in X]
# mean
print((a := np.mean(Y)))
# variance
print(np.mean(E) - a * a)
Output
0.6662016482614397
0.05554821798023696
Instead of making Y and E lists, a much better approach is
Y = X * (2 * X)
E = (X * X) * (2 * X)
Y, E in this case are numpy arrays. This approach is much more efficient. Try making N = 100_000_000 and compare the execution times of both methods. The second should be much faster.

Applying a half-gaussian filter to binned time series data in python

I am binning some time series data, I need to apply a half-normal filter to the binned data. How can I do this in python? I've provided a toy example bellow. I need Xbinned to be smoothed with a half-gaussian filter with std of 0.25 (or what ever). I'm pretty sure the half gaussian should be facing the forward time direction.
import numpy as np
X = np.random.randint(2, size=100) #example random process
bin_size = 5
Xbinned = []
for i in range(0, len(X)+1, bin_size):
Xbinned.append(sum(X[i:i+(bin_size-1)])/bin_size)
How to implement half-gaussian filtering
Scipy has a function called scipy.ndimage.gaussian_filter(). It nearly implements what we want here. Unfortunately, there's no option to use a half-gaussian instead of a gaussian. However, scipy is open-source, so we can just take the source code and modify it to be a half-gaussian.
I used this source code, and removed all of the parts that are not needed for this particular case. At the end, I had this:
import scipy.ndimage
def halfgaussian_kernel1d(sigma, radius):
"""
Computes a 1-D Half-Gaussian convolution kernel.
"""
sigma2 = sigma * sigma
x = np.arange(0, radius+1)
phi_x = np.exp(-0.5 / sigma2 * x ** 2)
phi_x = phi_x / phi_x.sum()
return phi_x
def halfgaussian_filter1d(input, sigma, axis=-1, output=None,
mode="constant", cval=0.0, truncate=4.0):
"""
Convolves a 1-D Half-Gaussian convolution kernel.
"""
sd = float(sigma)
# make the radius of the filter equal to truncate standard deviations
lw = int(truncate * sd + 0.5)
weights = halfgaussian_kernel1d(sigma, lw)
origin = -lw // 2
return scipy.ndimage.convolve1d(input, weights, axis, output, mode, cval, origin)
A short summary of how this works:
First, it generates a convolution kernel. It uses the formula e^(-1/2 * (x/sigma)^2) to generate the gaussian distribution. It keeps going until you're 4 standard deviations away from the center.
Next, it convolves that kernel against your signal. It adjusts the kernel to start at the current timestep instead of being centered on the current timestep.
Trying this on your signal, I get a result like this:
array([0.59979879, 0.6 , 0.40006707, 0.59993293, 0.79993293,
0.40013414, 0.20006707, 0.59986586, 0.40006707, 0.4 ,
0.99979879, 0.00033535, 0.59979879, 0.40006707, 0.00013414,
0.59979879, 0.20013414, 0.00006707, 0.19993293, 0.59986586])
Choice of standard deviation
If you pick a standard deviation of 0.25, that is going to have almost no effect on your signal. Here are the convolution weights it uses: [0.99966465 0.00033535]. In other words, this has less than a 0.1% effect on the signal.
I'd recommend using a larger sigma value.
Off by one error
Also, I want to point out the off-by-one error here:
for i in range(0, len(X)+1, bin_size):
Xbinned.append(sum(X[i:i+(bin_size-1)])/bin_size)
Numpy ranges are not inclusive, so a range of i to i+(bin_size-1) actually captures 4 elements, not 5.
To fix this, you can change it to this:
for i in range(0, len(X), bin_size):
Xbinned.append(X[i:i+bin_size].mean())
(Also, I fixed an off-by-one error in the loop specification and used a numpy shortcut for finding the mean.)

How do I use the Monte Carlo method to find the uncertainties of a value?

I am trying to solve a Physics equation using a Monte Carlo simulation which I know is very long (I just need to use it to learn about it).
I have around 5 values, one is time and I have the random uncertainties (errors) for each of these values. So like mass is (10 +- 0.1)kg, where the error is 0.1 kg
How do I actually find the distribution of measurements if I performed this experiment 5,000 times for example?
I know I could make 2 arrays of errors, and maybe put them in a function. But what am I supposed to do to then? Do I put the errors in the equation and then add the answer to the arrays, and then put the changed array values in the equation and repeat this a thousand times. Or do I actually calculate the real value and add it to the array.
Please can you help me understand this.
Edit:
The problem I have is basically of a sphere of density ds that is falling by a distance l in time t through a liquid of density dl, this fits in an equation for viscosity and I need to find the distribution of viscosity measurements.
The equation shouldn't matter at, whatever equation I have I should be able to use a method like this to find the distribution of measurements. Weather I'm dropping a ball out a window or whatever.
Basic Monte Carlo is very straightforward. The following might get you started:
import random,statistics,math
#The following function generates a
#random observation of f(x) where
#x is a vector of independent normal variables
#whose means are given by the vector mus
#and whose standard deviations are given by sigmas
def sample(f,mus,sigmas):
x = (random.gauss(m,s) for m,s in zip(mus,sigmas))
return f(*x)
#do n times, returning the sample mean and standard deviation:
def monte_carlo(f,mus,sigmas,n):
samples = [sample(f,mus,sigmas) for _ in range(n)]
return (statistics.mean(samples), statistics.stdev(samples))
#for testing purposes:
def V(r,h):
return math.pi*r**2*h
print(monte_carlo(V,(2,4),(0.02, 0.01),1000))
With output:
(50.2497301631037, 1.0215188736786902)
Ok, lets try with simple example - you have air gun which shoots balls with mass m and velocity v. You have to measure kinetic energy
E = m*v2 / 2
There is distribution of velocity - gaussian with mean value of 10 and std deviation 1.
There is distribution of masses - but we cannot do gaussian, lets assume it is truncated normal, with low limit of 1, so that there is no negative values, with loc equal to 5 and scale equal to 3.
So what we will do - sample velocity, sample mass, use them to find kinetic energy, do it multiple times, build energy distribution, get mean value, get std deviation, draw graphs etc
Some simple Python code
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import truncnorm
def sampleMass(n, low, high, mu, sigma):
"""
Sample n mass values from truncated normal
"""
tn = truncnorm(low, high, loc=mu, scale=sigma)
return tn.rvs(n)
def sampleVelocity(n, mu, sigma):
return np.random.normal(loc = mu, scale = sigma, size=n)
mass_low = 1.
mass_high = 1000.
mass_mu = 5.
mass_sigma = 3.0
vel_mu = 10.0
vel_sigma = 1.0
nof_trials = 100000
mass = sampleMass(nof_trials, mass_low, mass_high, mass_mu, mass_sigma) # get samples of mass
vel = sampleVelocity(nof_trials, vel_mu, vel_sigma) # get samples of velocity
kinenergy = 0.5 * mass * vel*vel # distribution of kinetic energy
print("Mean value and stddev of the final distribution")
print(np.mean(kinenergy))
print(np.std(kinenergy))
print("Min/max values of the final distribution")
print(np.min(kinenergy))
print(np.max(kinenergy))
# print histogram of the distribution
n, bins, patches = plt.hist(kinenergy, 100, density=True, facecolor='green', alpha=0.75)
plt.xlabel('Energy')
plt.ylabel('Probability')
plt.title('Kinetic energy distribution')
plt.grid(True)
plt.show()
with output like
Mean value and stddev of the final distribution
483.8162951263243
118.34049421853899
Min/max values of the final distribution
128.86671038372
1391.400187563612

Gaussian fit returning negative sigma

One of my algorithms performs automatic peak detection based on a Gaussian function, and then later determines the the edges based either on a multiplier (user setting) of the sigma or the 'full width at half maximum'. In the scenario where a user specified that he/she wants the peak limited at 2 Sigma, the algorithm takes -/+ 2*sigma from the peak center (mu). However, I noticed that the sigma returned by curve_fit can be negative, which is something that has been noticed before as can be seen here. However, as I determine the border by doing -/+ this can lead to the algorithm 'failing' (due to a - - scenario) as can be seen in the following code.
MVCE
#! /usr/bin/env python
from scipy.optimize import curve_fit
import bisect
import numpy as np
X = [16.4697402328,16.4701402404,16.4705402481,16.4709402557,16.4713402633,16.4717402709,16.4721402785,16.4725402862,16.4729402938,16.4733403014,16.473740309,16.4741403166,16.4745403243,16.4749403319,16.4753403395,16.4757403471,16.4761403547,16.4765403623,16.47694037,16.4773403776,16.4777403852,16.4781403928,16.4785404004,16.4789404081,16.4793404157,16.4797404233,16.4801404309,16.4805404385,16.4809404462,16.4813404538,16.4817404614,16.482140469,16.4825404766,16.4829404843,16.4833404919,16.4837404995,16.4841405071,16.4845405147,16.4849405224,16.48534053,16.4857405376,16.4861405452,16.4865405528,16.4869405604,16.4873405681,16.4877405757,16.4881405833,16.4885405909,16.4889405985,16.4893406062,16.4897406138,16.4901406214,16.490540629,16.4909406366,16.4913406443,16.4917406519,16.4921406595,16.4925406671,16.4929406747,16.4933406824,16.49374069,16.4941406976,16.4945407052,16.4949407128,16.4953407205,16.4957407281,16.4961407357,16.4965407433,16.4969407509,16.4973407585,16.4977407662,16.4981407738,16.4985407814,16.498940789,16.4993407966,16.4997408043,16.5001408119,16.5005408195,16.5009408271,16.5013408347,16.5017408424,16.50214085,16.5025408576,16.5029408652,16.5033408728,16.5037408805,16.5041408881,16.5045408957,16.5049409033,16.5053409109,16.5057409186,16.5061409262,16.5065409338,16.5069409414,16.507340949,16.5077409566,16.5081409643,16.5085409719,16.5089409795,16.5093409871,16.5097409947,16.5101410024,16.51054101,16.5109410176,16.5113410252,16.5117410328,16.5121410405,16.5125410481,16.5129410557,16.5133410633,16.5137410709,16.5141410786,16.5145410862,16.5149410938,16.5153411014,16.515741109,16.5161411166,16.5165411243,16.5169411319,16.5173411395,16.5177411471,16.5181411547,16.5185411624,16.51894117,16.5193411776,16.5197411852,16.5201411928,16.5205412005,16.5209412081,16.5213412157,16.5217412233,16.5221412309,16.5225412386,16.5229412462,16.5233412538,16.5237412614,16.524141269,16.5245412767,16.5249412843,16.5253412919,16.5257412995,16.5261413071,16.5265413147,16.5269413224,16.52734133,16.5277413376,16.5281413452,16.5285413528,16.5289413605,16.5293413681,16.5297413757,16.5301413833,16.5305413909,16.5309413986,16.5313414062,16.5317414138,16.5321414214,16.532541429,16.5329414367,16.5333414443,16.5337414519,16.5341414595,16.5345414671,16.5349414748,16.5353414824,16.53574149,16.5361414976,16.5365415052,16.5369415128,16.5373415205,16.5377415281,16.5381415357,16.5385415433,16.5389415509,16.5393415586,16.5397415662,16.5401415738,16.5405415814,16.540941589,16.5413415967,16.5417416043,16.5421416119,16.5425416195,16.5429416271,16.5433416348,16.5437416424,16.54414165,16.5445416576,16.5449416652,16.5453416729,16.5457416805,16.5461416881,16.5465416957,16.5469417033,16.5473417109,16.5477417186,16.5481417262,16.5485417338,16.5489417414,16.549341749,16.5497417567,16.5501417643,16.5505417719,16.5509417795,16.5513417871,16.5517417948,16.5521418024,16.55254181,16.5529418176,16.5533418252,16.5537418329,16.5541418405,16.5545418481,16.5549418557,16.5553418633,16.5557418709,16.5561418786,16.5565418862,16.5569418938,16.5573419014,16.557741909,16.5581419167,16.5585419243,16.5589419319,16.5593419395,16.5597419471,16.5601419548,16.5605419624,16.56094197,16.5613419776,16.5617419852,16.5621419929,16.5625420005,16.5629420081,16.5633420157,16.5637420233,16.564142031]
Y = [11579127.8554,11671781.7263,11764419.0191,11857026.0444,11949589.1124,12042094.5338,12134528.6188,12226877.6781,12319128.0219,12411265.9609,12503277.8053,12595149.8657,12686868.4525,12778419.8762,12869790.334,12960965.209,13051929.5278,13142668.3154,13233166.5969,13323409.3973,13413381.7417,13503068.6552,13592455.1627,13681526.2894,13770267.0602,13858662.5004,13946697.6348,14034357.4886,14121627.0868,14208491.4544,14294935.6166,14380944.5984,14466503.4248,14551597.1208,14636210.7116,14720329.3102,14803938.4081,14887023.5981,14969570.4732,15051564.6263,15132991.6503,15213837.1383,15294086.683,15373725.8775,15452740.3147,15531115.5875,15608837.2888,15685891.0116,15762262.3488,15837936.8934,15912900.2382,15987137.9762,16060635.7004,16133379.0036,16205353.4789,16276544.72,16346938.7731,16416522.8674,16485284.4226,16553210.8587,16620289.5956,16686508.0531,16751853.6511,16816313.8096,16879875.9485,16942527.4876,17004255.8468,17065048.446,17124892.7052,17183776.0442,17241685.8829,17298609.6412,17354534.739,17409448.5962,17463338.6327,17516192.2683,17567996.9463,17618741.7702,17668418.588,17717019.5043,17764536.6238,17810962.0514,17856287.8916,17900506.2493,17943609.2292,17985588.936,18026437.4744,18066146.9493,18104709.4653,18142117.1271,18178362.0396,18213436.3074,18247332.0352,18280041.3279,18311556.2901,18341869.0265,18370971.642,18398856.332,18425517.6188,18450952.493,18475158.064,18498131.4412,18519869.7341,18540370.0523,18559629.505,18577645.202,18594414.2525,18609933.7661,18624200.8523,18637212.6205,18648966.1802,18659458.6408,18668687.1119,18676648.7029,18683340.5233,18688759.6825,18692903.29,18695768.4553,18697352.5327,18697655.9558,18696681.2608,18694431.0245,18690907.8241,18686114.2363,18680052.838,18672726.2063,18664136.918,18654287.5501,18643180.6795,18630818.883,18617204.7377,18602340.8204,18586229.7081,18568873.9777,18550276.2061,18530438.9703,18509364.8471,18487056.4135,18463516.2464,18438747.4526,18412756.9228,18385553.1936,18357144.808,18327540.3094,18296748.2409,18264777.1456,18231635.5669,18197332.0479,18161875.1318,18125273.3619,18087535.2812,18048669.4331,18008684.3606,17967588.6071,17925390.7158,17882099.2297,17837722.6922,17792269.6464,17745748.6355,17698168.2027,17649537.512,17599868.3744,17549173.3069,17497464.8262,17444755.4492,17391057.6927,17336384.0736,17280747.1087,17224159.3148,17166633.2088,17108181.3075,17048816.1277,16988550.1864,16927396.0002,16865366.0862,16802472.961,16738729.1416,16674147.1447,16608739.4873,16542518.6861,16475497.2591,16407688.2541,16339106.0951,16269765.4262,16199680.8916,16128867.1358,16057338.8029,15985110.5372,15912196.9829,15838612.7844,15764372.5859,15689491.0316,15613982.7659,15537862.4329,15461144.6771,15383844.1425,15305975.4735,15227553.3143,15148592.3093,15069107.1026,14989112.3386,14908622.6595,14827652.5673,14746216.3337,14664328.209,14582002.4435,14499253.2874,14416094.9911,14332541.8049,14248607.9791,14164307.764,14079655.4098,13994665.1668,13909351.2855,13823728.016,13737809.6086,13651610.3137,13565144.3816,13478426.0625,13391469.6068,13304289.2646,13216899.2865,13129313.8865,13041546.3657,12953609.0623,12865514.2686,12777274.277,12688901.3798,12600407.8693,12511806.0378,12423108.1777,12334326.5812,12245473.5407,12156561.3486,12067602.297,11978608.6785,11889592.7852]
def gaussFunction(x, *p):
"""Define and return a Gaussian function.
This function returns the value of a Gaussian function, using the
A, mu and sigma value that is provided as *p.
Keyword arguments:
x -- number
p -- A, mu and sigma numbers
"""
A, mu, sigma = p
return A*np.exp(-(x-mu)**2/(2.*sigma**2))
newGaussX = np.linspace(10, 25, 2500*(X[-1]-X[0]))
p0 = [np.max(Y), X[np.argmax(Y)],0.1]
coeff, var_matrix = curve_fit(gaussFunction, X, Y, p0)
newGaussY = gaussFunction(newGaussX, *coeff)
print "Sigma is "+str(coeff[2])
# Original
low = bisect.bisect_left(newGaussX,coeff[1]-2*coeff[2])
high = bisect.bisect_right(newGaussX,coeff[1]+2*coeff[2])
print newGaussX[low], newGaussX[high]
# Absolute
low = bisect.bisect_left(newGaussX,coeff[1]-2*abs(coeff[2]))
high = bisect.bisect_right(newGaussX,coeff[1]+2*abs(coeff[2]))
print newGaussX[low], newGaussX[high]
Bottom-line, is taking the abs() of the sigma 'correct' or should this problem be solved in a different way?
You are fitting a function gaussFunction that does not care whether sigma is positive or negative. So whether you get a positive or negative result is mostly a matter of luck, and taking the absolute value of the returned sigma is fine. Also consider other possibilities:
(Suggested by Thomas Kühn): modify the model function so that it cares about the sign of sigma. Bringing it closer to the normalized Gaussian form would be enough: the formula A/np.sqrt(sigma)*np.exp(-(x-mu)**2/(2.*sigma**2)) would ensure that you get positive sigma only. A possible, mild downside is that the function takes a bit longer to compute.
Use the variance, sigma_squared, as a parameter:
A, mu, sigma_squared = p
return A*np.exp(-(x-mu)**2/(2.*sigma_squared))
This is probably easiest in terms of keeping the model equation simple. You will need to square your initial guess for that parameter, and take square root when you need sigma itself.
Aside: you hardcoded 0.1 as a guess for standard deviation. This probably should be based on data, like this:
peak = X[Y > np.exp(-0.5)*Y.max()]
guess_sigma = 0.5*(peak.max() - peak.min())
The idea is that within one standard deviation of the mean, the values of the Gaussian are greater than np.exp(-0.5) times the maximum value. So the first line locates this "peak" and the second takes half of its width as the guess for sigma.
For the above to work, X and Y should be already converted to NumPy arrays, e.g., X = np.array([16.4697402328,16.4701402404,..... This is a good idea in general: otherwise, you are making each NumPy method that receives X or Y make this conversion again.
You might find lmfit (http://lmfit.github.io/lmfit-py/) useful for this. It includes a Gaussian Model for curve-fitting that does normalize the Gaussian and also restricts sigma to be positive using a parameter transformation that is more gentle than abs(sigma). Your example would look like this
from lmfit.models import GaussianModel
xdat = np.array(X)
ydat = np.array(Y)
model = GaussianModel()
params = model.guess(ydat, x=xdat)
result = model.fit(ydat, params, x=xdat)
print(result.fit_report())
which will print a report with best-fit values and estimated uncertainties for all the parameters, and include FWHM.
[[Model]]
Model(gaussian)
[[Fit Statistics]]
# function evals = 31
# data points = 237
# variables = 3
chi-square = 95927408861.607
reduced chi-square = 409946191.716
Akaike info crit = 4703.055
Bayesian info crit = 4713.459
[[Variables]]
sigma: 0.04880178 +/- 1.57e-05 (0.03%) (init= 0.0314006)
center: 16.5174203 +/- 8.01e-06 (0.00%) (init= 16.51754)
amplitude: 2.2859e+06 +/- 586.4103 (0.03%) (init= 670578.1)
fwhm: 0.11491942 +/- 3.51e-05 (0.03%) == '2.3548200*sigma'
height: 1.8687e+07 +/- 910.0152 (0.00%) == '0.3989423*amplitude/max(1.e-15, sigma)'
[[Correlations]] (unreported correlations are < 0.100)
C(sigma, amplitude) = 0.949
The values for center +/- 2*sigma would be found with
xlo = result.params['center'].value - 2 * result.params['sigma'].value
xhi = result.params['center'].value + 2 * result.params['sigma'].value
You can use the result to evaluate the model with fitted parameters and different X values:
newGaussX = np.linspace(10, 25, 2500*(X[-1]-X[0]))
newGaussY = result.eval(x=newGaussX)
I would also recommend using numpy.where to find the location of center+/-2*sigma instead of bisect:
low = np.where(newGaussX > xlo)[0][0] # replace bisect_left
high = np.where(newGaussX <= xhi)[0][-1] + 1 # replace bisect_right
I got the same problem and I came up with a trivial but effective solution, which is basically to use the variance in the gaussian function definition instead of the standard deviation, since the variance is always positive. Then, you get the std_dev by square rooting the variance, obtaining a positive value i.e., the std_dev will always be positive. So, problem solved easily ;)
I mean, create the function this way:
def gaussian(x, Heigh, Mean, Variance):
return Heigh * np.exp(- (x-Mean)**2 / (2 * Variance))
Instead of:
def gaussian(x, Heigh, Mean, Std_dev):
return Heigh * np.exp(- (x-Mean)**2 / (2 * Std_dev**2))
And then do the fit as usual.

gaussian sum filter for irregular spaced points

I have a set of points (x,y) as two vectors
x,y for example:
from pylab import *
x = sorted(random(30))
y = random(30)
plot(x,y, 'o-')
Now I would like to smooth this data with a Gaussian and evaluate it only at certain (regularly spaced) points on the x-axis. lets say for:
x_eval = linspace(0,1,11)
I got the tip that this method is called a "Gaussian sum filter", but so far I have not found any implementation in numpy/scipy for that, although it seems like a standard problem at first glance.
As the x values are not equally spaced I can't use the scipy.ndimage.gaussian_filter1d.
Usually this kind of smoothing is done going through furrier space and multiplying with the kernel, but I don't really know if this will be possible with irregular spaced data.
Thanks for any ideas
This will blow up for very large datasets, but the proper calculaiton you are asking for would be done as follows:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0) # for repeatability
x = np.random.rand(30)
x.sort()
y = np.random.rand(30)
x_eval = np.linspace(0, 1, 11)
sigma = 0.1
delta_x = x_eval[:, None] - x
weights = np.exp(-delta_x*delta_x / (2*sigma*sigma)) / (np.sqrt(2*np.pi) * sigma)
weights /= np.sum(weights, axis=1, keepdims=True)
y_eval = np.dot(weights, y)
plt.plot(x, y, 'bo-')
plt.plot(x_eval, y_eval, 'ro-')
plt.show()
I'll preface this answer by saying that this is more of a DSP question than a programming question...
...that being said there, there is a simple two step solution to your problem.
Step 1: Resample the data
So to illustrate this we can create a random data set with unequal sampling:
import numpy as np
x = np.cumsum(np.random.randint(0,100,100))
y = np.random.normal(0,1,size=100)
This gives something like:
We can resample this data using simple linear interpolation:
nx = np.arange(x.max()) # choose new x axis sampling
ny = np.interp(nx,x,y) # generate y values for each x
This converts our data to:
Step 2: Apply filter
At this stage you can use some of the tools available through scipy to apply a Gaussian filter to the data with a given sigma value:
import scipy.ndimage.filters as filters
fx = filters.gaussian_filter1d(ny,sigma=100)
Plotting this up against the original data we get:
The choice of the sigma value determines the width of the filter.
Based on #Jaime's answer I wrote a function that implements this with some additional documentation and the ability to discard estimates far from the datapoints.
I think confidence intervals could be obtained on this estimate by bootstrapping, but I haven't done this yet.
def gaussian_sum_smooth(xdata, ydata, xeval, sigma, null_thresh=0.6):
"""Apply gaussian sum filter to data.
xdata, ydata : array
Arrays of x- and y-coordinates of data.
Must be 1d and have the same length.
xeval : array
Array of x-coordinates at which to evaluate the smoothed result
sigma : float
Standard deviation of the Gaussian to apply to each data point
Larger values yield a smoother curve.
null_thresh : float
For evaluation points far from data points, the estimate will be
based on very little data. If the total weight is below this threshold,
return np.nan at this location. Zero means always return an estimate.
The default of 0.6 corresponds to approximately one sigma away
from the nearest datapoint.
"""
# Distance between every combination of xdata and xeval
# each row corresponds to a value in xeval
# each col corresponds to a value in xdata
delta_x = xeval[:, None] - xdata
# Calculate weight of every value in delta_x using Gaussian
# Maximum weight is 1.0 where delta_x is 0
weights = np.exp(-0.5 * ((delta_x / sigma) ** 2))
# Multiply each weight by every data point, and sum over data points
smoothed = np.dot(weights, ydata)
# Nullify the result when the total weight is below threshold
# This happens at evaluation points far from any data
# 1-sigma away from a data point has a weight of ~0.6
nan_mask = weights.sum(1) < null_thresh
smoothed[nan_mask] = np.nan
# Normalize by dividing by the total weight at each evaluation point
# Nullification above avoids divide by zero warning shere
smoothed = smoothed / weights.sum(1)
return smoothed

Categories

Resources