My Python programming problem is the following:
I want to create an array of measurement results. Each result can be described as a normal distribution for which the mean value is the measurement result itself and the standard deviation is its uncertainty.
Pseudo code could be:
x1 = N(result1, unc1)
x2 = N(result2, unc2)
...
x = array(x1, x2, ..., xN)
Than I would like to calculate the FFT of x:
f = numpy.fft.fft(x)
What I want is that the uncertainty of the measurements contained in x is propagated through the FFT calculation so that f is an array of amplitudes along with their uncertainty like this:
f = (a +/- unc(a), b +/- unc(b), ...)
Can you suggest me a way to do this?
Each Fourier coefficient computed by the discrete Fourier transform
of the array x is a linear combination of the elements of x; see
the formula for X_k on the wikipedia page on the discrete Fourier transform,
which I'll write as
X_k = sum_(n=0)^(n=N-1) [ x_n * exp(-i*2*pi*k*n/N) ]
(That is, X is the discrete Fourier transform of x.)
If x_n is normally distributed with mean mu_n and variance sigma_n**2,
then a little bit of algebra shows that the variance of X_k is the sum
of the variances of x_n
Var(X_k) = sum_(n=0)^(n=N-1) sigma_n**2
In other words, the variance is the same for each Fourier coefficent;
it is the sum of the variances of the measurements in x.
Using your notation, where unc(z) is the standard deviation of z,
unc(X_0) = unc(X_1) = ... = unc(X_(N-1)) = sqrt(unc(x1)**2 + unc(x2)**2 + ...)
(Note that the distribution of the magnitude of X_k is the Rice distribution.)
Here's a script that demonstrates this result. In this example, the standard
deviation of the x values increase linearly from 0.01 to 0.5.
import numpy as np
from numpy.fft import fft
import matplotlib.pyplot as plt
np.random.seed(12345)
n = 16
# Create 'x', the vector of measured values.
t = np.linspace(0, 1, n)
x = 0.25*t - 0.2*t**2 + 1.25*np.cos(3*np.pi*t) + 0.8*np.cos(7*np.pi*t)
x[:n//3] += 3.0
x[::4] -= 0.25
x[::3] += 0.2
# Compute the Fourier transform of x.
f = fft(x)
num_samples = 5000000
# Suppose the std. dev. of the 'x' measurements increases linearly
# from 0.01 to 0.5:
sigma = np.linspace(0.01, 0.5, n)
# Generate 'num_samples' arrays of the form 'x + noise', where the standard
# deviation of the noise for each coefficient in 'x' is given by 'sigma'.
xn = x + sigma*np.random.randn(num_samples, n)
fn = fft(xn, axis=-1)
print("Sum of input variances: %8.5f" % (sigma**2).sum())
print()
print("Variances of Fourier coefficients:")
np.set_printoptions(precision=5)
print(fn.var(axis=0))
# Plot the Fourier coefficient of the first 800 arrays.
num_plot = min(num_samples, 800)
fnf = fn[:num_plot].ravel()
clr = "#4080FF"
plt.plot(fnf.real, fnf.imag, 'o', color=clr, mec=clr, ms=1, alpha=0.3)
plt.plot(f.real, f.imag, 'kD', ms=4)
plt.grid(True)
plt.axis('equal')
plt.title("Fourier Coefficients")
plt.xlabel("$\Re(X_k)$")
plt.ylabel("$\Im(X_k)$")
plt.show()
The printed output is
Sum of input variances: 1.40322
Variances of Fourier coefficients:
[ 1.40357 1.40288 1.40331 1.40206 1.40231 1.40302 1.40282 1.40358
1.40376 1.40358 1.40282 1.40302 1.40231 1.40206 1.40331 1.40288]
As expected, the sample variances of the Fourier coefficients are
all (approximately) the same as the sum of the measurement variances.
Here's the plot generated by the script. The black diamonds are the
Fourier coefficients of a single x vector. The blue dots are the
Fourier coefficients of 800 realizations of x + noise. You can see that
the point clouds around each Fourier coefficent are roughly symmetric
and all the same "size" (except, of course, for the real coeffcients,
which show up in this plot as horizontal lines on the real axis).
Related
I'm trying to interpolate a function at arbitrary points and I have the function values at Chebyshev extreme points. I use the real values from Fast Fourier Transform to compute the Chebyshev coefficients. Then I scale them with 2/N and then I use the polynomial library to evaluate the series of chebyshev polynomials at a set of points. This produces the wrong function approximation. Where am I going wrong?
import numpy as np
import matplotlib.pyplot as plt
# Define the number of
# Chebyshev extreme points
N = 10
# Define the function to be
# approximated
def f(x):
return x**2
# Evaluate the function at the
# Chebyshev extreme points
x = np.cos(np.arange(N) * np.pi / N)
y = f(x)
# Compute the discrete Fourier
# transform (DFT) of the function
# values using the FFT algorithm
DFT = np.fft.fft(y).real
# Compute the correct scaling
# factor
scaling_factor = 2/N
# Scale the DFT coefficients by
# the correct scaling factor
chebyshev_coefficients = scaling_factor * DFT
# Use Chebval to
# evaluate the approximated
# polynomial at a set of points
x_eval = np.linspace(-1, 1, 100)
y_approx = np.polynomial.chebyshev.chebval(x_eval, chebyshev_coefficients[::-1])
# Plot the original function
# and the approximated function
plt.plot(x, y, 'o',
label='Original function')
plt.plot(x_eval, y_approx, '-',
label='Approximated function')
plt.legend()
plt.show()
Currently I want to generate some samples to get expectation & variance of it.
Given the probability density function: f(x) = {2x, 0 <= x <= 1; 0 otherwise}
I already found that E(X) = 2/3, Var(X) = 1/18, my detail solution is from here https://math.stackexchange.com/questions/4430163/simulating-expectation-of-continuous-random-variable
But here is what I have when simulating using python:
import numpy as np
N = 100_000
X = np.random.uniform(size=N, low=0, high=1)
Y = [2*x for x in X]
np.mean(Y) # 1.00221 <- not equal to 2/3
np.var(Y) # 0.3323 <- not equal to 1/18
What am I doing wrong here? Thank you in advanced.
You are generating the mean and variance of Y = 2X, when you want the mean and variance of the X's themselves. You know the density, but the CDF is more useful for random variate generation than the PDF. For your problem, the density is:
so the CDF is:
Given that the CDF is an easily invertible function for the range [0,1], you can use inverse transform sampling to generate X values by setting F(X) = U, where U is a Uniform(0,1) random variable, and inverting the relationship to solve for X. For your problem, this yields X = U1/2.
In other words, you can generate X values with
import numpy as np
N = 100_000
X = np.sqrt(np.random.uniform(size = N))
and then do anything you want with the data, such as calculate mean and variance, plot histograms, use in simulation models, or whatever.
A histogram will confirm that the generated data have the desired density:
import matplotlib.pyplot as plt
plt.hist(X, bins = 100, density = True)
plt.show()
produces
The mean and variance estimates can then be calculated directly from the data:
print(np.mean(X), np.var(X)) # => 0.6661509538922444 0.05556962913014367
But wait! There’s more...
Margin of error
Simulation generates random data, so estimates of mean and variance will be variable across repeated runs. Statisticians use confidence intervals to quantify the magnitude of the uncertainty in statistical estimates. When the sample size is sufficiently large to invoke the central limit theorem, an interval estimate of the mean is calculated as (x-bar ± half-width), where x-bar is the estimate of the mean. For a so-called 95% confidence interval, the half-width is 1.96 * s / sqrt(n) where:
s is the estimated standard deviation;
n is the number of samples used in the estimates of mean and standard deviation; and
1.96 is a scaling constant derived from the normal distribution and the desired level of confidence.
The half-width is a quantitative measure of the margin of error, a.k.a. precision, of the estimate. Note that as n gets larger, the estimate has a smaller margin of error and becomes more precise, but there are diminishing returns to increasing the sample size due to the square root. Increasing the precision by a factor of 2 would require 4 times the sample size if independent sampling is used.
In Python:
var = np.var(X)
print(np.mean(X), var, 1.96 * np.sqrt(var / N))
produces results such as
0.6666763186360812 0.05511848269208021 0.0014551397290634852
where the third column is the confidence interval half-width.
Improving precision
Inverse transform sampling can yield greater precision for a given sample size if we use a clever trick based on fundamental properties of expectation and variance. In intro prob/stats courses you probably were told that Var(X + Y) = Var(X) + Var(Y). The true relationship is actually Var(X + Y) = Var(X) + Var(Y) + 2Cov(X,Y), where Cov(X,Y) is the covariance between X and Y. If they are independent, the covariance is 0 and the general relationship becomes the one we learn/teach in intro courses, but if they are not independent the more general equation must be used. Variance is always a positive quantity, but covariance can be either positive or negative. Consequently, it’s easy to see that if X and Y have negative covariance the variance of their sum will be less than when they are independent. Negative covariance means that when X is above its mean Y tends to be below its mean, and vice-versa.
So how does that help? It helps because we can use the inverse transform, along with a technique known as antithetic variates, to create pairs of random variables which are identically distributed but have negative covariance. If U is a random variable with a Uniform(0,1) distribution, U’ = 1 - U also has a Uniform(0,1) distribution. (In fact, flipping any symmetric distribution will produce the same distribution.) As a result, X = F-1(U) and X’ = F-1(U’) are identically distributed since they’re defined by the same CDF, but will have negative covariance because they fall on opposite sides of their shared median and thus strongly tend to fall on opposite sides of their mean. If we average each pair to get A = (F-1(ui) + F-1(1-ui)) / 2) the expected value E[A] = E[(X + X’)/2] = 2E[X]/2 = E[X] while the variance Var(A) = [(Var(X) + Var(X’) + 2Cov(X,X’)]/4 = 2[Var(X) + Cov(X,X’)]/4 = [Var(X) + Cov(X,X’)]/2. In other words, we get a random variable A whose average is an unbiased estimate of the mean of X but which has less variance.
To fairly compare antithetic results head-to-head with independent sampling, we take the original sample size and allocate it with half the data being generated by the inverse transform of the U’s, and the other half generated by antithetic pairing using 1-U’s. We then average the paired values and generate statistics as before. In Python:
U = np.random.uniform(size = N // 2)
antithetic_avg = (np.sqrt(U) + np.sqrt(1.0 - U)) / 2
anti_var = np.var(antithetic_avg)
print(np.mean(antithetic_avg), anti_var, 1.96*np.sqrt(anti_var / (N / 2)))
which produces results such as
0.6667222935263972 0.0018911848781598295 0.0003811869837216061
Note that the half-width produced with independent sampling is nearly 4 times as large as the half-width produced using antithetic variates. To put it another way, we would need more than an order of magnitude more data for independent sampling to achieve the same precision.
To approximate the integral of some function of x, say, g(x), over S = [0, 1], using Monte Carlo simulation, you
generate N random numbers in [0, 1] (i.e. draw from the uniform distribution U[0, 1])
calculate the arithmetic mean of g(x_i) over i = 1 to i = N where x_i is the ith random number: i.e. (1 / N) times the sum from i = 1 to i = N of g(x_i).
The result of step 2 is the approximation of the integral.
The expected value of continuous random variable X with pdf f(x) and set of possible values S is the integral of x * f(x) over S. The variance of X is the expected value of X-squared minus the square of the expected value of X.
Expected value: to approximate the integral of x * f(x) over S = [0, 1] (i.e. the expected value of X), set g(x) = x * f(x) and apply the method outlined above.
Variance: to approximate the integral of (x * x) * f(x) over S = [0, 1] (i.e. the expected value of X-squared), set g(x) = (x * x) * f(x) and apply the method outlined above. Subtract the result of this by the square of the estimate of the expected value of X to obtain an estimate of the variance of X.
Adapting your method:
import numpy as np
N = 100_000
X = np.random.uniform(size = N, low = 0, high = 1)
Y = [x * (2 * x) for x in X]
E = [(x * x) * (2 * x) for x in X]
# mean
print((a := np.mean(Y)))
# variance
print(np.mean(E) - a * a)
Output
0.6662016482614397
0.05554821798023696
Instead of making Y and E lists, a much better approach is
Y = X * (2 * X)
E = (X * X) * (2 * X)
Y, E in this case are numpy arrays. This approach is much more efficient. Try making N = 100_000_000 and compare the execution times of both methods. The second should be much faster.
There are many questions on this topic, and I have cycled through a lot of them getting conceptual pointers on handling frequencies (here and here), documentation on numpy functions (here), how-to information on extracting magnitude and phase (here), and stepping outside the site, for example this or this.
However, only the painful "proving it" to myself with simple examples and checking the output of different functions contrasted to their manual implementation has given me a bit of an idea.
The answer attempts to document and share details related to the DFT in Python that may constitute barriers of entry if not explained in simple terms.
The DFT (FFT being its algorithmic computation) is a dot product between a finite discrete number of samples N of an analogue signal s(t) (a function of time or space) and a set of basis vectors of complex exponentials (sin and cos functions). Although the sample is naturally finite and may show no periodicity, it is implicitly thought of as a periodically repeating discrete function. Even when dealing with real-valued signals (the usual situation) it is convenient to work with complex numbers (Euler's equation). It may be intimidating to implement the function on a signal with np.fft.fft(s) only to get the output coefficients in complex numbers and get stuck in their interpretation. Some steps are essential:
What are the frequencies in the complex exponentials?
The DFT does not necessarily preserve the sampling frequency in Hertz. The frequencies are indices (k).
The indices k range from 0 to N - 1 and can be thought of as having units of cycles / set (the set being the N samples of the signal s). I will omit discussing the Nyquist limit, but for real signals the frequencies form a mirror image after N / 2, and given as negative decreasing values after that point (not a problem within the framework of implicit periodicity). The frequencies used in the FFT are not simply k, but k / N, thought of as having units of cycles / sample. See this reference. Example (reference): If a signal is sampled N = 5 times the frequencies are: np.fft.fftfreq(5), yielding [ 0 , 0.2, 0.4, -0.4, -0.2], i.e. [0/5, 1/5, 2/5, -2/5, -1/5].
To convert these frequencies to meaningful units (e.g. Hetz or mm) the values in cycles/sample above will need to be divided by sampling interval T (e.g. distance in seconds between samples). Continuing with the example above, there is a built-in call: np.fft.fftfreq(5, d=T): If the analogue signal s is sampled 5 times at equidistant intervals T = 1/2 sec for a total sample of NT = 5 x 1/2 sec, the normalized frequencies will be np.fft.fftfreq(5, d = 1/2), yielding [0 0.4 0.8 -0.8 -0.4] or [0/NT, 1/NT, 2/NT, -2/NT, -1/NT].
Either normalized or un-normalized frequencies are used to control angular frequencies (ω_m), expressed as ω_m = 2π k/NT. Note that NT is the total duration for
which the signal was sampled. The index k does result in multiples of a fundamental frequency (ω-naught) corresponding to k = 1 - the frequency of (co-)sine wave that completes
exactly one oscillation over NT (here).
Magnitude, frequency and phase of the coefficients in the FFT
Given the output of the FFT S = fft.fft(s), the magnitude of the output coefficients (here) is just the Euclidean norm of the complex numbers in the output coefficients adjusted for the symmetry in real signals (x 2) and for the number of samples 1/N: magnitudes = 1/N * np.abs(S)
The frequencies are matched to the call explained above np.fft.fftfreq(N), or more expediently to incorporate the actual analogue frequency units, frequencies = np.fft.fftfreq(N, d=T).
The phase of each coefficients is the angle of the complex number in polar form phase = np.arctan(np.imag(S)/np.real(S))
How to find the dominant frequencies in the signal s in the FFT and their coefficients?
Plotting aside, finding the index k corresponding the frequency with the highest magnitude can be accomplished as index = np.argmax(np.abs(S)). To find the 4 indices with the highest magnitude, for example, the call is indices = np.argpartition(S,-4)[-4:].
And finding the actual corresponding coefficient: S[index] with frequency freq_max = np.fft.fftfreq(N, d=T)[index].
Reproducing the original signal after obtaining the coefficients:
Reproducing s through sines and cosines (p.150 in here):
Re = np.real(S[index])
Im = np.imag(S[index])
s_recon = Re * 2/N * np.cos(-2 * np.pi * freq_max * t) + abs(Im) * 2/N * np.sin(-2 * np.pi * freq_max * t)
Here is a complete example:
import numpy as np
import matplotlib.pyplot as plt
N = 10000 # Sample points
T = 1/5000 # Spacing
# Total duration N * T= 2
t = np.linspace(0.0, N*T, N, endpoint=False) # Time: Vector of 10,000 elements from 0 to N*T=2.
frequency = np.fft.fftfreq(t.size, d=T) # Normalized Fourier frequencies in spectrum.
f0 = 25 # Frequency of the sampled wave
phi = np.pi/8 # Phase
A = 50 # Amplitude
s = A * np.cos(2 * np.pi * f0 * t + phi) # Signal
S = np.fft.fft(s) # Unnormalized FFT
index = np.argmax(np.abs(S))
print(S[index])
magnitude = np.abs(S[index]) * 2/N
freq_max = frequency[index]
phase = np.arctan(np.imag(S[index])/np.real(S[index]))
print(f"magnitude: {magnitude}, freq_max: {freq_max}, phase: {phase}")
print(phi)
fig, [ax1,ax2] = plt.subplots(nrows=2, ncols=1, figsize=(10, 5))
ax1.plot(t,s, linewidth=0.5, linestyle='-', color='r', marker='o', markersize=1,markerfacecolor=(1, 0, 0, 0.1))
ax1.set_xlim([0, .31])
ax1.set_ylim([-51,51])
ax2.plot(frequency[0:N//2], 2/N * np.abs(S[0:N//2]), '.', color='xkcd:lightish blue', label='amplitude spectrum')
plt.xlim([0, 100])
plt.show()
Re = np.real(S[index])
Im = np.imag(S[index])
s_recon = Re*2/N * np.cos(-2 * np.pi * freq_max * t) + abs(Im)*2/N * np.sin(-2 * np.pi * freq_max * t)
fig = plt.figure(figsize=(10, 2.5))
plt.xlim(0,0.3)
plt.ylim(-51,51)
plt.plot(t,s_recon, linewidth=0.5, linestyle='-', color='r', marker='o', markersize=1,markerfacecolor=(1, 0, 0, 0.1))
plt.show()
s.all() == s_recon.all()
I am struggling with the correct normalization of the power spectral density (and its inverse).
I am given a real problem, let's say the readings of an accelerometer in the form of the power spectral density (psd) in units of Amplitude^2/Hz. I would like to translate this back into a randomized time series. However, first I want to understand the "forward" direction, time series to PSD.
According to [1], the PSD of a time series x(t) can be calculated by:
PSD(w) = 1/T * abs(F(w))^2 = df * abs(F(w))^2
in which T is the sampling time of x(t) and F(w) is the Fourier transform of x(t) and df=1/T is the frequency resolution in the Fourier space. However, the results I am getting are not equal to what I am getting using the scipy Welch method, see code below.
This first block of code is taken from the scipy.welch documentary:
from scipy import signal
import matplotlib.pyplot as plt
fs = 10e3
N = 1e5
amp = 2*np.sqrt(2)
freq = 1234.0
noise_power = 0.001 * fs / 2
time = np.arange(N) / fs
x = amp*np.sin(2*np.pi*freq*time)
x += np.random.normal(scale=np.sqrt(noise_power), size=time.shape)
f, Pxx_den = signal.welch(x, fs, nperseg=1024)
plt.semilogy(f, Pxx_den)
plt.ylim(\[0.5e-3, 1\])
plt.xlabel('frequency \[Hz\]')
plt.ylabel('PSD \[V**2/Hz\]')
plt.show()
First thing I noticed is that the plotted psd changes with the variable fs which seems strange to me. (Maybe I need to adjust the nperseg argument then accordingly? Why is nperseg not set to fs automatically then?)
My code would be the following: (Note that I defined my own fft_full function which already takes care of the correct fourier transform normalization, which I verified by checking Parsevals theorem).
import scipy.fftpack as fftpack
def fft_full(xt,yt):
dt = xt[1] - xt[0]
x_fft=fftpack.fftfreq(xt.size,dt)
y_fft=fftpack.fft(yt)*dt
return (x_fft,y_fft)
xf,yf=fft_full(time,x)
df=xf[1] - xf[0]
psd=np.abs(yf)**2 *df
plt.figure()
plt.semilogy(xf, psd)
#plt.ylim([0.5e-3, 1])
plt.xlim(0,)
plt.xlabel('frequency [Hz]')
plt.ylabel('PSD [V**2/Hz]')
plt.show()
Unfortunately, I am not yet allowed to post images but the two plots do not look the same!
I would greatly appreciate if someone could explain to me where I went wrong and settle this once and for all :)
[1]: Eq. 2.82. Random Vibrations in Spacecraft Structures Design
Theory and Applications, Authors: Wijker, J. Jaap, 2009
The scipy library uses the Welch's method to estimate a PSD. This method is more complex than just taking the squared modulus of the discrete Fourier transform. In short terms, it proceeds as follows:
Let x be the input discrete signal that contains N samples.
Split x into M overlapping segments, such that each segment sm contains nperseg samples and that each two consecutive segments overlap in noverlap samples, so that nperseg = K * (nperseg - noverlap), where K is an integer (usually K = 2). Note also that:
N = nperseg + (M - 1) * (nperseg - noverlap) = (M + K - 1) * nperseg / K
From each segment sm, subtract its mean (this removes the DC component):
tm = sm - sum(sm) / nperseg
Multiply the elements of the obtained zero-mean segments tm by the elements of a suitable (nonsymmetric) window function, h (such as the Hann window):
um = tm * h
Calculate the Fast Fourier Transform of all vectors um. Before performing these transformations, we usually first append so many zeros to each vector um that its new dimension becomes a power of 2 (the nfft argument of the function welch is used for this purpose). Let us suppose that len(um) = 2p. In most cases, our input vectors are real-valued, so it is best to apply FFT for real data. Its results are then complex-valued vectors vm = rfft(um), such that len(vm) = 2p - 1 + 1.
Calculate the squared modulus of all transformed vectors:
am = abs(vm) ** 2,
or more efficiently:
am = vm.real ** 2 + vm.imag ** 2
Normalize the vectors am as follows:
bm = am / sum(h * h)
bm[1:-1] *= 2 (this takes into account the negative frequencies),
where h is a real vector of the dimension nperseg that contains the window coefficients. In case of the Hann window, we can prove that
sum(h * h) = 3 / 8 * len(h) = 3 / 8 * nperseg
Estimate the PSD as the mean of all vectors bm:
psd = sum(bm) / M
The result is a vector of the dimension len(psd) = 2p - 1 + 1. If we wish that the sum of all psd coefficients matches the mean squared amplitude of the windowed input data (rather than the sum of squared amplitudes), then the vector psd must also be divided by nperseg. However, the scipy routine omits this step. In any case, we usually present psd on the decibel scale, so that the final result is:
psd_dB = 10 * log10(psd).
For a more detailed description, please read the original Welch's paper. See also Wikipedia's page and chapter 13.4 of Numerical Recipes in C
I have a set of points (x,y) as two vectors
x,y for example:
from pylab import *
x = sorted(random(30))
y = random(30)
plot(x,y, 'o-')
Now I would like to smooth this data with a Gaussian and evaluate it only at certain (regularly spaced) points on the x-axis. lets say for:
x_eval = linspace(0,1,11)
I got the tip that this method is called a "Gaussian sum filter", but so far I have not found any implementation in numpy/scipy for that, although it seems like a standard problem at first glance.
As the x values are not equally spaced I can't use the scipy.ndimage.gaussian_filter1d.
Usually this kind of smoothing is done going through furrier space and multiplying with the kernel, but I don't really know if this will be possible with irregular spaced data.
Thanks for any ideas
This will blow up for very large datasets, but the proper calculaiton you are asking for would be done as follows:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0) # for repeatability
x = np.random.rand(30)
x.sort()
y = np.random.rand(30)
x_eval = np.linspace(0, 1, 11)
sigma = 0.1
delta_x = x_eval[:, None] - x
weights = np.exp(-delta_x*delta_x / (2*sigma*sigma)) / (np.sqrt(2*np.pi) * sigma)
weights /= np.sum(weights, axis=1, keepdims=True)
y_eval = np.dot(weights, y)
plt.plot(x, y, 'bo-')
plt.plot(x_eval, y_eval, 'ro-')
plt.show()
I'll preface this answer by saying that this is more of a DSP question than a programming question...
...that being said there, there is a simple two step solution to your problem.
Step 1: Resample the data
So to illustrate this we can create a random data set with unequal sampling:
import numpy as np
x = np.cumsum(np.random.randint(0,100,100))
y = np.random.normal(0,1,size=100)
This gives something like:
We can resample this data using simple linear interpolation:
nx = np.arange(x.max()) # choose new x axis sampling
ny = np.interp(nx,x,y) # generate y values for each x
This converts our data to:
Step 2: Apply filter
At this stage you can use some of the tools available through scipy to apply a Gaussian filter to the data with a given sigma value:
import scipy.ndimage.filters as filters
fx = filters.gaussian_filter1d(ny,sigma=100)
Plotting this up against the original data we get:
The choice of the sigma value determines the width of the filter.
Based on #Jaime's answer I wrote a function that implements this with some additional documentation and the ability to discard estimates far from the datapoints.
I think confidence intervals could be obtained on this estimate by bootstrapping, but I haven't done this yet.
def gaussian_sum_smooth(xdata, ydata, xeval, sigma, null_thresh=0.6):
"""Apply gaussian sum filter to data.
xdata, ydata : array
Arrays of x- and y-coordinates of data.
Must be 1d and have the same length.
xeval : array
Array of x-coordinates at which to evaluate the smoothed result
sigma : float
Standard deviation of the Gaussian to apply to each data point
Larger values yield a smoother curve.
null_thresh : float
For evaluation points far from data points, the estimate will be
based on very little data. If the total weight is below this threshold,
return np.nan at this location. Zero means always return an estimate.
The default of 0.6 corresponds to approximately one sigma away
from the nearest datapoint.
"""
# Distance between every combination of xdata and xeval
# each row corresponds to a value in xeval
# each col corresponds to a value in xdata
delta_x = xeval[:, None] - xdata
# Calculate weight of every value in delta_x using Gaussian
# Maximum weight is 1.0 where delta_x is 0
weights = np.exp(-0.5 * ((delta_x / sigma) ** 2))
# Multiply each weight by every data point, and sum over data points
smoothed = np.dot(weights, ydata)
# Nullify the result when the total weight is below threshold
# This happens at evaluation points far from any data
# 1-sigma away from a data point has a weight of ~0.6
nan_mask = weights.sum(1) < null_thresh
smoothed[nan_mask] = np.nan
# Normalize by dividing by the total weight at each evaluation point
# Nullification above avoids divide by zero warning shere
smoothed = smoothed / weights.sum(1)
return smoothed