Simulating expectation of continuous random variable - python
Currently I want to generate some samples to get expectation & variance of it.
Given the probability density function: f(x) = {2x, 0 <= x <= 1; 0 otherwise}
I already found that E(X) = 2/3, Var(X) = 1/18, my detail solution is from here https://math.stackexchange.com/questions/4430163/simulating-expectation-of-continuous-random-variable
But here is what I have when simulating using python:
import numpy as np
N = 100_000
X = np.random.uniform(size=N, low=0, high=1)
Y = [2*x for x in X]
np.mean(Y) # 1.00221 <- not equal to 2/3
np.var(Y) # 0.3323 <- not equal to 1/18
What am I doing wrong here? Thank you in advanced.
You are generating the mean and variance of Y = 2X, when you want the mean and variance of the X's themselves. You know the density, but the CDF is more useful for random variate generation than the PDF. For your problem, the density is:
so the CDF is:
Given that the CDF is an easily invertible function for the range [0,1], you can use inverse transform sampling to generate X values by setting F(X) = U, where U is a Uniform(0,1) random variable, and inverting the relationship to solve for X. For your problem, this yields X = U1/2.
In other words, you can generate X values with
import numpy as np
N = 100_000
X = np.sqrt(np.random.uniform(size = N))
and then do anything you want with the data, such as calculate mean and variance, plot histograms, use in simulation models, or whatever.
A histogram will confirm that the generated data have the desired density:
import matplotlib.pyplot as plt
plt.hist(X, bins = 100, density = True)
plt.show()
produces
The mean and variance estimates can then be calculated directly from the data:
print(np.mean(X), np.var(X)) # => 0.6661509538922444 0.05556962913014367
But wait! There’s more...
Margin of error
Simulation generates random data, so estimates of mean and variance will be variable across repeated runs. Statisticians use confidence intervals to quantify the magnitude of the uncertainty in statistical estimates. When the sample size is sufficiently large to invoke the central limit theorem, an interval estimate of the mean is calculated as (x-bar ± half-width), where x-bar is the estimate of the mean. For a so-called 95% confidence interval, the half-width is 1.96 * s / sqrt(n) where:
s is the estimated standard deviation;
n is the number of samples used in the estimates of mean and standard deviation; and
1.96 is a scaling constant derived from the normal distribution and the desired level of confidence.
The half-width is a quantitative measure of the margin of error, a.k.a. precision, of the estimate. Note that as n gets larger, the estimate has a smaller margin of error and becomes more precise, but there are diminishing returns to increasing the sample size due to the square root. Increasing the precision by a factor of 2 would require 4 times the sample size if independent sampling is used.
In Python:
var = np.var(X)
print(np.mean(X), var, 1.96 * np.sqrt(var / N))
produces results such as
0.6666763186360812 0.05511848269208021 0.0014551397290634852
where the third column is the confidence interval half-width.
Improving precision
Inverse transform sampling can yield greater precision for a given sample size if we use a clever trick based on fundamental properties of expectation and variance. In intro prob/stats courses you probably were told that Var(X + Y) = Var(X) + Var(Y). The true relationship is actually Var(X + Y) = Var(X) + Var(Y) + 2Cov(X,Y), where Cov(X,Y) is the covariance between X and Y. If they are independent, the covariance is 0 and the general relationship becomes the one we learn/teach in intro courses, but if they are not independent the more general equation must be used. Variance is always a positive quantity, but covariance can be either positive or negative. Consequently, it’s easy to see that if X and Y have negative covariance the variance of their sum will be less than when they are independent. Negative covariance means that when X is above its mean Y tends to be below its mean, and vice-versa.
So how does that help? It helps because we can use the inverse transform, along with a technique known as antithetic variates, to create pairs of random variables which are identically distributed but have negative covariance. If U is a random variable with a Uniform(0,1) distribution, U’ = 1 - U also has a Uniform(0,1) distribution. (In fact, flipping any symmetric distribution will produce the same distribution.) As a result, X = F-1(U) and X’ = F-1(U’) are identically distributed since they’re defined by the same CDF, but will have negative covariance because they fall on opposite sides of their shared median and thus strongly tend to fall on opposite sides of their mean. If we average each pair to get A = (F-1(ui) + F-1(1-ui)) / 2) the expected value E[A] = E[(X + X’)/2] = 2E[X]/2 = E[X] while the variance Var(A) = [(Var(X) + Var(X’) + 2Cov(X,X’)]/4 = 2[Var(X) + Cov(X,X’)]/4 = [Var(X) + Cov(X,X’)]/2. In other words, we get a random variable A whose average is an unbiased estimate of the mean of X but which has less variance.
To fairly compare antithetic results head-to-head with independent sampling, we take the original sample size and allocate it with half the data being generated by the inverse transform of the U’s, and the other half generated by antithetic pairing using 1-U’s. We then average the paired values and generate statistics as before. In Python:
U = np.random.uniform(size = N // 2)
antithetic_avg = (np.sqrt(U) + np.sqrt(1.0 - U)) / 2
anti_var = np.var(antithetic_avg)
print(np.mean(antithetic_avg), anti_var, 1.96*np.sqrt(anti_var / (N / 2)))
which produces results such as
0.6667222935263972 0.0018911848781598295 0.0003811869837216061
Note that the half-width produced with independent sampling is nearly 4 times as large as the half-width produced using antithetic variates. To put it another way, we would need more than an order of magnitude more data for independent sampling to achieve the same precision.
To approximate the integral of some function of x, say, g(x), over S = [0, 1], using Monte Carlo simulation, you
generate N random numbers in [0, 1] (i.e. draw from the uniform distribution U[0, 1])
calculate the arithmetic mean of g(x_i) over i = 1 to i = N where x_i is the ith random number: i.e. (1 / N) times the sum from i = 1 to i = N of g(x_i).
The result of step 2 is the approximation of the integral.
The expected value of continuous random variable X with pdf f(x) and set of possible values S is the integral of x * f(x) over S. The variance of X is the expected value of X-squared minus the square of the expected value of X.
Expected value: to approximate the integral of x * f(x) over S = [0, 1] (i.e. the expected value of X), set g(x) = x * f(x) and apply the method outlined above.
Variance: to approximate the integral of (x * x) * f(x) over S = [0, 1] (i.e. the expected value of X-squared), set g(x) = (x * x) * f(x) and apply the method outlined above. Subtract the result of this by the square of the estimate of the expected value of X to obtain an estimate of the variance of X.
Adapting your method:
import numpy as np
N = 100_000
X = np.random.uniform(size = N, low = 0, high = 1)
Y = [x * (2 * x) for x in X]
E = [(x * x) * (2 * x) for x in X]
# mean
print((a := np.mean(Y)))
# variance
print(np.mean(E) - a * a)
Output
0.6662016482614397
0.05554821798023696
Instead of making Y and E lists, a much better approach is
Y = X * (2 * X)
E = (X * X) * (2 * X)
Y, E in this case are numpy arrays. This approach is much more efficient. Try making N = 100_000_000 and compare the execution times of both methods. The second should be much faster.
Related
Python: Fastest way to perform millions of simple linear regression with 1 exogenous variable only
I am performing component wise regression on a time series data. This is basically where instead of regressing y against x1, x2, ..., xN, we would regress y against x1 only, y against x2 only, ..., and take the regression that reduces the sum of square residues the most and add it as a base learner. This is repeated M times such that the final model is the sum of many many simple linear regression of the form y against xi (1 exogenous variable only), basically gradient boosting using linear regression as the base learners. The problem is that since I am performing a rolling window regression on the time series data, I have to do N × M × T regressions which is more than a million OLS. Though each OLS is very fast, it takes a few hours to run on my weak laptop. Currently, I am using statsmodels.OLS.fit() as the way to get my parameters for each y against xi linear regression as such. The z_matrix is the data matrix and the i represents the ith column to slice for the regression. The number of rows is about 100 and z_matrix is about size 100 × 500. ols_model = sm.OLS(endog=endog, exog=self.z_matrix[:, i][..., None]).fit() return ols_model.params, ols_model.ssr, ols_model.fittedvalues[..., None] I have read from a previous post in 2016 Fastest way to calculate many regressions in python? that using repeated calls to statsmodels is not efficient and I tried one of the answers which suggested numpy's pinv which is unfortunately slower: # slower: 40sec vs 30sec for statsmodel for 100 repeated runs of 150 linear regressions params = np.linalg.pinv(self.z_matrix[:, [i]]).dot(endog) y_hat = self.z_matrix[:, [i]]#params ssr = sum((y_hat-endog)**2) return params, ssr, y_hat Does anyone have any better suggestions to speed up the computation of the linear regression? I just need the estimated parameters, sum of square residues, and predicted ŷ value. Thank you!
Here is one way since you are always running regressions without a constant. This code runs around 900K models in about 0.5s. It retains the sse, the predicted values for each of the 900K regressions, and the estimated parameters. The big idea is to exploit the math behind regressions of one variable on another, which is the ratio of a cross-product to an inner product (which the model does not contain a constant). This could be modified to also include a constant by using a moving window demean to estimate the intercept. import numpy as np from statsmodels.regression.linear_model import OLS import datetime gen = np.random.default_rng(20210514) # Number of observations n = 1000 # Number of predictors m = 1000 # Window size w = 100 # Simulate data y = gen.standard_normal((n, 1)) x = gen.standard_normal((n, m)) now = datetime.datetime.now() # Compute rolling covariance and variance-like terms # These assume the model is y = x*b + e w/o a constant c = np.r_[np.zeros((1, m)), np.cumsum(x * y, axis=0)] v = np.r_[np.zeros((1, m)), np.cumsum(x * x, axis=0)] c_trimmed = c[w:] - c[:-w] v_trimmed = v[w:] - v[:-w] # Parameters are just the ratio params = c_trimmed / v_trimmed # Build a selector array to quickly reshape y and the columns of x step = np.arange(m - w + 1) sel = np.arange(w) locs = step[:, None] + sel # Get the blocked reshape of y. It has n - w + 1 rows with window observations # and looks like # [[y[0],y[1],...,y[99]], # [y[1],y[2],...,y[100]], # ..., # [y[900],y[901],...,y[999]], y_block = y[locs, 0] # Storage for the predicted values and the sse y_pred = np.empty((x.shape[1],) + y_block.shape) sse = np.empty((m - w + 1, n)) # Easiest to loop over columns. # Could do broadcasting tricks, but noth worth the trouble since number of columns is modest for i in range(x.shape[0]): # Reshape a columns of x like y x_block = x[locs, i] # Get the parameters and make sure it is 2d with shape (m-w+1, 1) # so the broadcasting works p = params[:, i][:, None] # Get the predicted value y_pred[i] = x_block * p # And the sse sse[:, i] = ((y_block - y_pred[i]) ** 2).sum(1) print(f"Time: {(datetime.datetime.now() - now).total_seconds()}s") # Some test code # Test any single observation start = 124 assert start <= m - w column = 342 assert column < x.shape[1] res = OLS(y[start : start + 100], x[start : start + 100, [column]]).fit() np.testing.assert_allclose(res.params[0], params[start, column]) np.testing.assert_allclose(res.fittedvalues, y_pred[column, start]) np.testing.assert_allclose(res.ssr, sse[start, column])
Correct normalization of discrete power spectral density in python for a real problem
I am struggling with the correct normalization of the power spectral density (and its inverse). I am given a real problem, let's say the readings of an accelerometer in the form of the power spectral density (psd) in units of Amplitude^2/Hz. I would like to translate this back into a randomized time series. However, first I want to understand the "forward" direction, time series to PSD. According to [1], the PSD of a time series x(t) can be calculated by: PSD(w) = 1/T * abs(F(w))^2 = df * abs(F(w))^2 in which T is the sampling time of x(t) and F(w) is the Fourier transform of x(t) and df=1/T is the frequency resolution in the Fourier space. However, the results I am getting are not equal to what I am getting using the scipy Welch method, see code below. This first block of code is taken from the scipy.welch documentary: from scipy import signal import matplotlib.pyplot as plt fs = 10e3 N = 1e5 amp = 2*np.sqrt(2) freq = 1234.0 noise_power = 0.001 * fs / 2 time = np.arange(N) / fs x = amp*np.sin(2*np.pi*freq*time) x += np.random.normal(scale=np.sqrt(noise_power), size=time.shape) f, Pxx_den = signal.welch(x, fs, nperseg=1024) plt.semilogy(f, Pxx_den) plt.ylim(\[0.5e-3, 1\]) plt.xlabel('frequency \[Hz\]') plt.ylabel('PSD \[V**2/Hz\]') plt.show() First thing I noticed is that the plotted psd changes with the variable fs which seems strange to me. (Maybe I need to adjust the nperseg argument then accordingly? Why is nperseg not set to fs automatically then?) My code would be the following: (Note that I defined my own fft_full function which already takes care of the correct fourier transform normalization, which I verified by checking Parsevals theorem). import scipy.fftpack as fftpack def fft_full(xt,yt): dt = xt[1] - xt[0] x_fft=fftpack.fftfreq(xt.size,dt) y_fft=fftpack.fft(yt)*dt return (x_fft,y_fft) xf,yf=fft_full(time,x) df=xf[1] - xf[0] psd=np.abs(yf)**2 *df plt.figure() plt.semilogy(xf, psd) #plt.ylim([0.5e-3, 1]) plt.xlim(0,) plt.xlabel('frequency [Hz]') plt.ylabel('PSD [V**2/Hz]') plt.show() Unfortunately, I am not yet allowed to post images but the two plots do not look the same! I would greatly appreciate if someone could explain to me where I went wrong and settle this once and for all :) [1]: Eq. 2.82. Random Vibrations in Spacecraft Structures Design Theory and Applications, Authors: Wijker, J. Jaap, 2009
The scipy library uses the Welch's method to estimate a PSD. This method is more complex than just taking the squared modulus of the discrete Fourier transform. In short terms, it proceeds as follows: Let x be the input discrete signal that contains N samples. Split x into M overlapping segments, such that each segment sm contains nperseg samples and that each two consecutive segments overlap in noverlap samples, so that nperseg = K * (nperseg - noverlap), where K is an integer (usually K = 2). Note also that: N = nperseg + (M - 1) * (nperseg - noverlap) = (M + K - 1) * nperseg / K From each segment sm, subtract its mean (this removes the DC component): tm = sm - sum(sm) / nperseg Multiply the elements of the obtained zero-mean segments tm by the elements of a suitable (nonsymmetric) window function, h (such as the Hann window): um = tm * h Calculate the Fast Fourier Transform of all vectors um. Before performing these transformations, we usually first append so many zeros to each vector um that its new dimension becomes a power of 2 (the nfft argument of the function welch is used for this purpose). Let us suppose that len(um) = 2p. In most cases, our input vectors are real-valued, so it is best to apply FFT for real data. Its results are then complex-valued vectors vm = rfft(um), such that len(vm) = 2p - 1 + 1. Calculate the squared modulus of all transformed vectors: am = abs(vm) ** 2, or more efficiently: am = vm.real ** 2 + vm.imag ** 2 Normalize the vectors am as follows: bm = am / sum(h * h) bm[1:-1] *= 2 (this takes into account the negative frequencies), where h is a real vector of the dimension nperseg that contains the window coefficients. In case of the Hann window, we can prove that sum(h * h) = 3 / 8 * len(h) = 3 / 8 * nperseg Estimate the PSD as the mean of all vectors bm: psd = sum(bm) / M The result is a vector of the dimension len(psd) = 2p - 1 + 1. If we wish that the sum of all psd coefficients matches the mean squared amplitude of the windowed input data (rather than the sum of squared amplitudes), then the vector psd must also be divided by nperseg. However, the scipy routine omits this step. In any case, we usually present psd on the decibel scale, so that the final result is: psd_dB = 10 * log10(psd). For a more detailed description, please read the original Welch's paper. See also Wikipedia's page and chapter 13.4 of Numerical Recipes in C
Generating a vector with with a random uniform direction and a Gaussian-distributed magnitude
I want to add 2D Gaussian noise to each (x,y) point of a list that I have. That is why I want to create a noise vector with a random uniform direction over [0, 2pi) and a Gaussian-distributed magnitude with N(0, \sigma^2). How can I generate a vector in Python only specifying the direction and its magnitude?
Well, this is not hard to do n = 100 sigma = 1.0 phi = 2.0 * np.pi * np.random.random(n) r = np.random.normal(loc=0.0, scale=sigma, size=n) x = r*np.cos(phi) y = r*np.sin(phi)
You can generate two vectors, one for the magnitude and another for the phase. Then you use both to get what you want. import numpy as np import math sigma_squred = 0.01 # Change to whatever value you want num_elements = 10 # Size of the vector you want magnitude = math.sqrt(sigma_squred) * np.random.randn(num_elements) phase = 2 * np.pi * np.random.random_sample(num_elements) # This will give you a vector with a Gaussian magnitude and a random phase between 0 and 2PI noise = magnitude * np.exp(1j*phase) I find it easier to work with a single vector of complex numbers, but since you have individual x and y values, you can get a noise_x and a noise_y vector with noise_x = noise.real noise_y = noise.imag Note: I'm assuming you can use the numpy library, which make things much easier. If that is not the case you will need a loop to generate each element. To generate a single sample for magnitude you can use random.gauss(0, sigma), while 2*math.pi*random.random() can be used to generate a sample for the phase. Then you do the same as before to get a complex number from where you can get the real and the imaginary parts.
Gaussian fit returning negative sigma
One of my algorithms performs automatic peak detection based on a Gaussian function, and then later determines the the edges based either on a multiplier (user setting) of the sigma or the 'full width at half maximum'. In the scenario where a user specified that he/she wants the peak limited at 2 Sigma, the algorithm takes -/+ 2*sigma from the peak center (mu). However, I noticed that the sigma returned by curve_fit can be negative, which is something that has been noticed before as can be seen here. However, as I determine the border by doing -/+ this can lead to the algorithm 'failing' (due to a - - scenario) as can be seen in the following code. MVCE #! /usr/bin/env python from scipy.optimize import curve_fit import bisect import numpy as np X = [16.4697402328,16.4701402404,16.4705402481,16.4709402557,16.4713402633,16.4717402709,16.4721402785,16.4725402862,16.4729402938,16.4733403014,16.473740309,16.4741403166,16.4745403243,16.4749403319,16.4753403395,16.4757403471,16.4761403547,16.4765403623,16.47694037,16.4773403776,16.4777403852,16.4781403928,16.4785404004,16.4789404081,16.4793404157,16.4797404233,16.4801404309,16.4805404385,16.4809404462,16.4813404538,16.4817404614,16.482140469,16.4825404766,16.4829404843,16.4833404919,16.4837404995,16.4841405071,16.4845405147,16.4849405224,16.48534053,16.4857405376,16.4861405452,16.4865405528,16.4869405604,16.4873405681,16.4877405757,16.4881405833,16.4885405909,16.4889405985,16.4893406062,16.4897406138,16.4901406214,16.490540629,16.4909406366,16.4913406443,16.4917406519,16.4921406595,16.4925406671,16.4929406747,16.4933406824,16.49374069,16.4941406976,16.4945407052,16.4949407128,16.4953407205,16.4957407281,16.4961407357,16.4965407433,16.4969407509,16.4973407585,16.4977407662,16.4981407738,16.4985407814,16.498940789,16.4993407966,16.4997408043,16.5001408119,16.5005408195,16.5009408271,16.5013408347,16.5017408424,16.50214085,16.5025408576,16.5029408652,16.5033408728,16.5037408805,16.5041408881,16.5045408957,16.5049409033,16.5053409109,16.5057409186,16.5061409262,16.5065409338,16.5069409414,16.507340949,16.5077409566,16.5081409643,16.5085409719,16.5089409795,16.5093409871,16.5097409947,16.5101410024,16.51054101,16.5109410176,16.5113410252,16.5117410328,16.5121410405,16.5125410481,16.5129410557,16.5133410633,16.5137410709,16.5141410786,16.5145410862,16.5149410938,16.5153411014,16.515741109,16.5161411166,16.5165411243,16.5169411319,16.5173411395,16.5177411471,16.5181411547,16.5185411624,16.51894117,16.5193411776,16.5197411852,16.5201411928,16.5205412005,16.5209412081,16.5213412157,16.5217412233,16.5221412309,16.5225412386,16.5229412462,16.5233412538,16.5237412614,16.524141269,16.5245412767,16.5249412843,16.5253412919,16.5257412995,16.5261413071,16.5265413147,16.5269413224,16.52734133,16.5277413376,16.5281413452,16.5285413528,16.5289413605,16.5293413681,16.5297413757,16.5301413833,16.5305413909,16.5309413986,16.5313414062,16.5317414138,16.5321414214,16.532541429,16.5329414367,16.5333414443,16.5337414519,16.5341414595,16.5345414671,16.5349414748,16.5353414824,16.53574149,16.5361414976,16.5365415052,16.5369415128,16.5373415205,16.5377415281,16.5381415357,16.5385415433,16.5389415509,16.5393415586,16.5397415662,16.5401415738,16.5405415814,16.540941589,16.5413415967,16.5417416043,16.5421416119,16.5425416195,16.5429416271,16.5433416348,16.5437416424,16.54414165,16.5445416576,16.5449416652,16.5453416729,16.5457416805,16.5461416881,16.5465416957,16.5469417033,16.5473417109,16.5477417186,16.5481417262,16.5485417338,16.5489417414,16.549341749,16.5497417567,16.5501417643,16.5505417719,16.5509417795,16.5513417871,16.5517417948,16.5521418024,16.55254181,16.5529418176,16.5533418252,16.5537418329,16.5541418405,16.5545418481,16.5549418557,16.5553418633,16.5557418709,16.5561418786,16.5565418862,16.5569418938,16.5573419014,16.557741909,16.5581419167,16.5585419243,16.5589419319,16.5593419395,16.5597419471,16.5601419548,16.5605419624,16.56094197,16.5613419776,16.5617419852,16.5621419929,16.5625420005,16.5629420081,16.5633420157,16.5637420233,16.564142031] Y = [11579127.8554,11671781.7263,11764419.0191,11857026.0444,11949589.1124,12042094.5338,12134528.6188,12226877.6781,12319128.0219,12411265.9609,12503277.8053,12595149.8657,12686868.4525,12778419.8762,12869790.334,12960965.209,13051929.5278,13142668.3154,13233166.5969,13323409.3973,13413381.7417,13503068.6552,13592455.1627,13681526.2894,13770267.0602,13858662.5004,13946697.6348,14034357.4886,14121627.0868,14208491.4544,14294935.6166,14380944.5984,14466503.4248,14551597.1208,14636210.7116,14720329.3102,14803938.4081,14887023.5981,14969570.4732,15051564.6263,15132991.6503,15213837.1383,15294086.683,15373725.8775,15452740.3147,15531115.5875,15608837.2888,15685891.0116,15762262.3488,15837936.8934,15912900.2382,15987137.9762,16060635.7004,16133379.0036,16205353.4789,16276544.72,16346938.7731,16416522.8674,16485284.4226,16553210.8587,16620289.5956,16686508.0531,16751853.6511,16816313.8096,16879875.9485,16942527.4876,17004255.8468,17065048.446,17124892.7052,17183776.0442,17241685.8829,17298609.6412,17354534.739,17409448.5962,17463338.6327,17516192.2683,17567996.9463,17618741.7702,17668418.588,17717019.5043,17764536.6238,17810962.0514,17856287.8916,17900506.2493,17943609.2292,17985588.936,18026437.4744,18066146.9493,18104709.4653,18142117.1271,18178362.0396,18213436.3074,18247332.0352,18280041.3279,18311556.2901,18341869.0265,18370971.642,18398856.332,18425517.6188,18450952.493,18475158.064,18498131.4412,18519869.7341,18540370.0523,18559629.505,18577645.202,18594414.2525,18609933.7661,18624200.8523,18637212.6205,18648966.1802,18659458.6408,18668687.1119,18676648.7029,18683340.5233,18688759.6825,18692903.29,18695768.4553,18697352.5327,18697655.9558,18696681.2608,18694431.0245,18690907.8241,18686114.2363,18680052.838,18672726.2063,18664136.918,18654287.5501,18643180.6795,18630818.883,18617204.7377,18602340.8204,18586229.7081,18568873.9777,18550276.2061,18530438.9703,18509364.8471,18487056.4135,18463516.2464,18438747.4526,18412756.9228,18385553.1936,18357144.808,18327540.3094,18296748.2409,18264777.1456,18231635.5669,18197332.0479,18161875.1318,18125273.3619,18087535.2812,18048669.4331,18008684.3606,17967588.6071,17925390.7158,17882099.2297,17837722.6922,17792269.6464,17745748.6355,17698168.2027,17649537.512,17599868.3744,17549173.3069,17497464.8262,17444755.4492,17391057.6927,17336384.0736,17280747.1087,17224159.3148,17166633.2088,17108181.3075,17048816.1277,16988550.1864,16927396.0002,16865366.0862,16802472.961,16738729.1416,16674147.1447,16608739.4873,16542518.6861,16475497.2591,16407688.2541,16339106.0951,16269765.4262,16199680.8916,16128867.1358,16057338.8029,15985110.5372,15912196.9829,15838612.7844,15764372.5859,15689491.0316,15613982.7659,15537862.4329,15461144.6771,15383844.1425,15305975.4735,15227553.3143,15148592.3093,15069107.1026,14989112.3386,14908622.6595,14827652.5673,14746216.3337,14664328.209,14582002.4435,14499253.2874,14416094.9911,14332541.8049,14248607.9791,14164307.764,14079655.4098,13994665.1668,13909351.2855,13823728.016,13737809.6086,13651610.3137,13565144.3816,13478426.0625,13391469.6068,13304289.2646,13216899.2865,13129313.8865,13041546.3657,12953609.0623,12865514.2686,12777274.277,12688901.3798,12600407.8693,12511806.0378,12423108.1777,12334326.5812,12245473.5407,12156561.3486,12067602.297,11978608.6785,11889592.7852] def gaussFunction(x, *p): """Define and return a Gaussian function. This function returns the value of a Gaussian function, using the A, mu and sigma value that is provided as *p. Keyword arguments: x -- number p -- A, mu and sigma numbers """ A, mu, sigma = p return A*np.exp(-(x-mu)**2/(2.*sigma**2)) newGaussX = np.linspace(10, 25, 2500*(X[-1]-X[0])) p0 = [np.max(Y), X[np.argmax(Y)],0.1] coeff, var_matrix = curve_fit(gaussFunction, X, Y, p0) newGaussY = gaussFunction(newGaussX, *coeff) print "Sigma is "+str(coeff[2]) # Original low = bisect.bisect_left(newGaussX,coeff[1]-2*coeff[2]) high = bisect.bisect_right(newGaussX,coeff[1]+2*coeff[2]) print newGaussX[low], newGaussX[high] # Absolute low = bisect.bisect_left(newGaussX,coeff[1]-2*abs(coeff[2])) high = bisect.bisect_right(newGaussX,coeff[1]+2*abs(coeff[2])) print newGaussX[low], newGaussX[high] Bottom-line, is taking the abs() of the sigma 'correct' or should this problem be solved in a different way?
You are fitting a function gaussFunction that does not care whether sigma is positive or negative. So whether you get a positive or negative result is mostly a matter of luck, and taking the absolute value of the returned sigma is fine. Also consider other possibilities: (Suggested by Thomas Kühn): modify the model function so that it cares about the sign of sigma. Bringing it closer to the normalized Gaussian form would be enough: the formula A/np.sqrt(sigma)*np.exp(-(x-mu)**2/(2.*sigma**2)) would ensure that you get positive sigma only. A possible, mild downside is that the function takes a bit longer to compute. Use the variance, sigma_squared, as a parameter: A, mu, sigma_squared = p return A*np.exp(-(x-mu)**2/(2.*sigma_squared)) This is probably easiest in terms of keeping the model equation simple. You will need to square your initial guess for that parameter, and take square root when you need sigma itself. Aside: you hardcoded 0.1 as a guess for standard deviation. This probably should be based on data, like this: peak = X[Y > np.exp(-0.5)*Y.max()] guess_sigma = 0.5*(peak.max() - peak.min()) The idea is that within one standard deviation of the mean, the values of the Gaussian are greater than np.exp(-0.5) times the maximum value. So the first line locates this "peak" and the second takes half of its width as the guess for sigma. For the above to work, X and Y should be already converted to NumPy arrays, e.g., X = np.array([16.4697402328,16.4701402404,..... This is a good idea in general: otherwise, you are making each NumPy method that receives X or Y make this conversion again.
You might find lmfit (http://lmfit.github.io/lmfit-py/) useful for this. It includes a Gaussian Model for curve-fitting that does normalize the Gaussian and also restricts sigma to be positive using a parameter transformation that is more gentle than abs(sigma). Your example would look like this from lmfit.models import GaussianModel xdat = np.array(X) ydat = np.array(Y) model = GaussianModel() params = model.guess(ydat, x=xdat) result = model.fit(ydat, params, x=xdat) print(result.fit_report()) which will print a report with best-fit values and estimated uncertainties for all the parameters, and include FWHM. [[Model]] Model(gaussian) [[Fit Statistics]] # function evals = 31 # data points = 237 # variables = 3 chi-square = 95927408861.607 reduced chi-square = 409946191.716 Akaike info crit = 4703.055 Bayesian info crit = 4713.459 [[Variables]] sigma: 0.04880178 +/- 1.57e-05 (0.03%) (init= 0.0314006) center: 16.5174203 +/- 8.01e-06 (0.00%) (init= 16.51754) amplitude: 2.2859e+06 +/- 586.4103 (0.03%) (init= 670578.1) fwhm: 0.11491942 +/- 3.51e-05 (0.03%) == '2.3548200*sigma' height: 1.8687e+07 +/- 910.0152 (0.00%) == '0.3989423*amplitude/max(1.e-15, sigma)' [[Correlations]] (unreported correlations are < 0.100) C(sigma, amplitude) = 0.949 The values for center +/- 2*sigma would be found with xlo = result.params['center'].value - 2 * result.params['sigma'].value xhi = result.params['center'].value + 2 * result.params['sigma'].value You can use the result to evaluate the model with fitted parameters and different X values: newGaussX = np.linspace(10, 25, 2500*(X[-1]-X[0])) newGaussY = result.eval(x=newGaussX) I would also recommend using numpy.where to find the location of center+/-2*sigma instead of bisect: low = np.where(newGaussX > xlo)[0][0] # replace bisect_left high = np.where(newGaussX <= xhi)[0][-1] + 1 # replace bisect_right
I got the same problem and I came up with a trivial but effective solution, which is basically to use the variance in the gaussian function definition instead of the standard deviation, since the variance is always positive. Then, you get the std_dev by square rooting the variance, obtaining a positive value i.e., the std_dev will always be positive. So, problem solved easily ;) I mean, create the function this way: def gaussian(x, Heigh, Mean, Variance): return Heigh * np.exp(- (x-Mean)**2 / (2 * Variance)) Instead of: def gaussian(x, Heigh, Mean, Std_dev): return Heigh * np.exp(- (x-Mean)**2 / (2 * Std_dev**2)) And then do the fit as usual.
gaussian sum filter for irregular spaced points
I have a set of points (x,y) as two vectors x,y for example: from pylab import * x = sorted(random(30)) y = random(30) plot(x,y, 'o-') Now I would like to smooth this data with a Gaussian and evaluate it only at certain (regularly spaced) points on the x-axis. lets say for: x_eval = linspace(0,1,11) I got the tip that this method is called a "Gaussian sum filter", but so far I have not found any implementation in numpy/scipy for that, although it seems like a standard problem at first glance. As the x values are not equally spaced I can't use the scipy.ndimage.gaussian_filter1d. Usually this kind of smoothing is done going through furrier space and multiplying with the kernel, but I don't really know if this will be possible with irregular spaced data. Thanks for any ideas
This will blow up for very large datasets, but the proper calculaiton you are asking for would be done as follows: import numpy as np import matplotlib.pyplot as plt np.random.seed(0) # for repeatability x = np.random.rand(30) x.sort() y = np.random.rand(30) x_eval = np.linspace(0, 1, 11) sigma = 0.1 delta_x = x_eval[:, None] - x weights = np.exp(-delta_x*delta_x / (2*sigma*sigma)) / (np.sqrt(2*np.pi) * sigma) weights /= np.sum(weights, axis=1, keepdims=True) y_eval = np.dot(weights, y) plt.plot(x, y, 'bo-') plt.plot(x_eval, y_eval, 'ro-') plt.show()
I'll preface this answer by saying that this is more of a DSP question than a programming question... ...that being said there, there is a simple two step solution to your problem. Step 1: Resample the data So to illustrate this we can create a random data set with unequal sampling: import numpy as np x = np.cumsum(np.random.randint(0,100,100)) y = np.random.normal(0,1,size=100) This gives something like: We can resample this data using simple linear interpolation: nx = np.arange(x.max()) # choose new x axis sampling ny = np.interp(nx,x,y) # generate y values for each x This converts our data to: Step 2: Apply filter At this stage you can use some of the tools available through scipy to apply a Gaussian filter to the data with a given sigma value: import scipy.ndimage.filters as filters fx = filters.gaussian_filter1d(ny,sigma=100) Plotting this up against the original data we get: The choice of the sigma value determines the width of the filter.
Based on #Jaime's answer I wrote a function that implements this with some additional documentation and the ability to discard estimates far from the datapoints. I think confidence intervals could be obtained on this estimate by bootstrapping, but I haven't done this yet. def gaussian_sum_smooth(xdata, ydata, xeval, sigma, null_thresh=0.6): """Apply gaussian sum filter to data. xdata, ydata : array Arrays of x- and y-coordinates of data. Must be 1d and have the same length. xeval : array Array of x-coordinates at which to evaluate the smoothed result sigma : float Standard deviation of the Gaussian to apply to each data point Larger values yield a smoother curve. null_thresh : float For evaluation points far from data points, the estimate will be based on very little data. If the total weight is below this threshold, return np.nan at this location. Zero means always return an estimate. The default of 0.6 corresponds to approximately one sigma away from the nearest datapoint. """ # Distance between every combination of xdata and xeval # each row corresponds to a value in xeval # each col corresponds to a value in xdata delta_x = xeval[:, None] - xdata # Calculate weight of every value in delta_x using Gaussian # Maximum weight is 1.0 where delta_x is 0 weights = np.exp(-0.5 * ((delta_x / sigma) ** 2)) # Multiply each weight by every data point, and sum over data points smoothed = np.dot(weights, ydata) # Nullify the result when the total weight is below threshold # This happens at evaluation points far from any data # 1-sigma away from a data point has a weight of ~0.6 nan_mask = weights.sum(1) < null_thresh smoothed[nan_mask] = np.nan # Normalize by dividing by the total weight at each evaluation point # Nullification above avoids divide by zero warning shere smoothed = smoothed / weights.sum(1) return smoothed