Can I get data spread (noise) from singular value decomposition? - python

I'm was hoping to use singular value decomposition to estimate the standard deviation of eliptoid data. I'm not sure if this is the best approach and I may be overthinking the entire process so I need some help.
I simulated some data using the following script...
from matplotlib import pyplot as plt
import numpy
def svd_example():
# simulate some data...
# x values have standard deviation 3000
xdata = numpy.random.normal(0, 3000, 5000).reshape(-1, 1)
# y values standard deviation 300
ydata = numpy.random.normal(0, 300, 5000).reshape(-1, 1)
# apply some rotation
ydata_rotated = ydata + (xdata * 0.5)
data = numpy.hstack((xdata, ydata_rotated))
# get singular values
left_singular_matrix, singular_values, right_singular_matrix = numpy.linalg.svd(data)
print 'singular values', singular_values
# plot data....
plt.scatter(data[:, 0], data[:, 1], s=5)
plt.ylim(-15000, 15000)
plt.show()
svd_example()
I get singular values of...
>>> singular values [ 234001.71228678 18850.45155942]
My data looks like this...
I was under the assumption that the singular values would give me some indication of the spread of data regardless of it's rotation, right? But these values, [234001.71228678 18850.45155942], make no sense to me. My standard deviations were 3000 and 300. Do these singular values represent variance? How do I convert them?

The singular values indeed give some indication of the spread. In fact, they are related to the standard deviation in these directions. However, they are not normalized. If you divide by the square-root of the number samples, you will get values that closely resemble the standard deviations used for creating the data:
singular_values / np.sqrt(5000)
# array([ 3398.61320614, 264.00975837])
Why do you get 3400 and 264 instead of 3000 and 300? That is because ydata + (xdata * 0.5) is not a rotation but a shearing operation. A real rotation would preserve the original standard deviations.
For example, the following code would rotate the data by 40 degrees:
# apply some rotation
s = numpy.sin(40 * numpy.pi / 180)
c = numpy.cos(40 * numpy.pi / 180)
data = numpy.hstack((xdata, ydata)).dot([[c, s], [-s, c]])
With such a rotation you will get normalized singular values that are pretty close to the original standard deviations.
Edit:
On Normalization
I have to admit, normalization is probably not the correct term to apply here. It does not necessarily mean to scale values to a certain range. Normalization, as I meant it, was to bring values into a defined range, independent of the number of samples.
To understand where the division by sqrt(5000) comes from, let's talk about the standard deviation. Let x, be a data vector of n samples with zero mean. Then the standard deviation is computed as sqrt(sum(x**2)/n) or sqrt(sum(x**2)) / sqrt(n). Now, you can think of the singular value decomposition to compute only the sqrt(sum(x**2)) part, so we have to divide by sqrt(n) ourselves.
I'm afraid, this is not a very mathematical explanation, but hopefully it conveys the idea.

Related

Applying a half-gaussian filter to binned time series data in python

I am binning some time series data, I need to apply a half-normal filter to the binned data. How can I do this in python? I've provided a toy example bellow. I need Xbinned to be smoothed with a half-gaussian filter with std of 0.25 (or what ever). I'm pretty sure the half gaussian should be facing the forward time direction.
import numpy as np
X = np.random.randint(2, size=100) #example random process
bin_size = 5
Xbinned = []
for i in range(0, len(X)+1, bin_size):
Xbinned.append(sum(X[i:i+(bin_size-1)])/bin_size)
How to implement half-gaussian filtering
Scipy has a function called scipy.ndimage.gaussian_filter(). It nearly implements what we want here. Unfortunately, there's no option to use a half-gaussian instead of a gaussian. However, scipy is open-source, so we can just take the source code and modify it to be a half-gaussian.
I used this source code, and removed all of the parts that are not needed for this particular case. At the end, I had this:
import scipy.ndimage
def halfgaussian_kernel1d(sigma, radius):
"""
Computes a 1-D Half-Gaussian convolution kernel.
"""
sigma2 = sigma * sigma
x = np.arange(0, radius+1)
phi_x = np.exp(-0.5 / sigma2 * x ** 2)
phi_x = phi_x / phi_x.sum()
return phi_x
def halfgaussian_filter1d(input, sigma, axis=-1, output=None,
mode="constant", cval=0.0, truncate=4.0):
"""
Convolves a 1-D Half-Gaussian convolution kernel.
"""
sd = float(sigma)
# make the radius of the filter equal to truncate standard deviations
lw = int(truncate * sd + 0.5)
weights = halfgaussian_kernel1d(sigma, lw)
origin = -lw // 2
return scipy.ndimage.convolve1d(input, weights, axis, output, mode, cval, origin)
A short summary of how this works:
First, it generates a convolution kernel. It uses the formula e^(-1/2 * (x/sigma)^2) to generate the gaussian distribution. It keeps going until you're 4 standard deviations away from the center.
Next, it convolves that kernel against your signal. It adjusts the kernel to start at the current timestep instead of being centered on the current timestep.
Trying this on your signal, I get a result like this:
array([0.59979879, 0.6 , 0.40006707, 0.59993293, 0.79993293,
0.40013414, 0.20006707, 0.59986586, 0.40006707, 0.4 ,
0.99979879, 0.00033535, 0.59979879, 0.40006707, 0.00013414,
0.59979879, 0.20013414, 0.00006707, 0.19993293, 0.59986586])
Choice of standard deviation
If you pick a standard deviation of 0.25, that is going to have almost no effect on your signal. Here are the convolution weights it uses: [0.99966465 0.00033535]. In other words, this has less than a 0.1% effect on the signal.
I'd recommend using a larger sigma value.
Off by one error
Also, I want to point out the off-by-one error here:
for i in range(0, len(X)+1, bin_size):
Xbinned.append(sum(X[i:i+(bin_size-1)])/bin_size)
Numpy ranges are not inclusive, so a range of i to i+(bin_size-1) actually captures 4 elements, not 5.
To fix this, you can change it to this:
for i in range(0, len(X), bin_size):
Xbinned.append(X[i:i+bin_size].mean())
(Also, I fixed an off-by-one error in the loop specification and used a numpy shortcut for finding the mean.)

Mean and standard deviation of lognormal distribution do not match analytic values

As part of my research, I measure the mean and standard deviation of draws from a lognormal distribution. Given a value of the underlying normal distribution, it should be possible to analytically predict these quantities (as given at https://en.wikipedia.org/wiki/Log-normal_distribution).
However, as can be seen in the plots below, this does not seem to be the case. The first plot shows the mean of the lognormal data against the sigma of the gaussian, while the second plot shows the sigma of the lognormal data against that of the gaussian. Clearly, the "calculated" lines deviate from the "analytic" ones very significantly.
I take the mean of the gaussian distribution to be related to the sigma by mu = -0.5*sigma**2 as this ensures that the lognormal field ought to have mean of 1. Note, this is motivated by the area of physics that I work in: the deviation from analytic values still occurs if you set mu=0.0 for example.
By copying and pasting the code at the bottom of the question, it should be possible to reproduce the plots below. Any advice as to what might be causing this would be much appreciated!
Mean of lognormal vs sigma of gaussian:
Sigma of lognormal vs sigma of gaussian:
Note, to produce the plots above, I used N=10000, but have put N=1000 in the code below for speed.
import numpy as np
import matplotlib.pyplot as plt
mean_calc = []
sigma_calc = []
mean_analytic = []
sigma_analytic = []
ss = np.linspace(1.0,10.0,46)
N = 1000
for s in ss:
mu = -0.5*s*s
ln = np.random.lognormal(mean=mu, sigma=s, size=(N,N))
mean_calc += [np.average(ln)]
sigma_calc += [np.std(ln)]
mean_analytic += [np.exp(mu+0.5*s*s)]
sigma_analytic += [np.sqrt((np.exp(s**2)-1)*(np.exp(2*mu + s*s)))]
plt.loglog(ss,mean_calc,label='calculated')
plt.loglog(ss,mean_analytic,label='analytic')
plt.legend();plt.grid()
plt.xlabel(r'$\sigma_G$')
plt.ylabel(r'$\mu_{LN}$')
plt.show()
plt.loglog(ss,sigma_calc,label='calculated')
plt.loglog(ss,sigma_analytic,label='analytic')
plt.legend();plt.grid()
plt.xlabel(r'$\sigma_G$')
plt.ylabel(r'$\sigma_{LN}$')
plt.show()
TL;DR
Lognormal are positively skewed and heavy tailed distribution. When performing float arithmetic operations (such as sum, mean or std) on sample drawn from a highly skewed distribution, the sampling vector contains values with discrepancy over several order of magnitude (many decades). This makes the computation inaccurate.
The problem comes from those two lines:
mean_calc += [np.average(ln)]
sigma_calc += [np.std(ln)]
Because ln contains both very low and very high values with order of magnitude much higher than float precision.
The problem can be easily detected to warn user that its computation are wrong using the following predicate:
(max(ln) + min(ln)) <= max(ln)
Which is obviously false in Strictly Positive Real but must be considered when using Finite Precision Arithmetic.
Modifying your MCVE
If we slightly modify your MCVE to:
from scipy import stats
for s in ss:
mu = -0.5*s*s
ln = stats.lognorm(s, scale=np.exp(mu)).rvs(N*N)
f = stats.lognorm.fit(ln, floc=0)
mean_calc += [f[2]*np.exp(0.5*s*s)]
sigma_calc += [np.sqrt((np.exp(f[0]**2)-1)*(np.exp(2*mu + s*s)))]
mean_analytic += [np.exp(mu+0.5*s*s)]
sigma_analytic += [np.sqrt((np.exp(s**2)-1)*(np.exp(2*mu + s*s)))]
It gives the reasonably correct mean and standard deviation estimation even for high value of sigma.
The key is that fit uses MLE algorithm to estimates parameters. This totally differs from your original approach which directly performs the mean of the sample.
The fit method returns a tuple with (sigma, loc=0, scale=exp(mu)) which are parameters of the scipy.stats.lognorm object as specified in documentation.
I think you should investigate how you are estimating mean and standard deviation. The divergence probably comes from this part of your algorithm.
There might be several reasons why it diverges, at least consider:
Biased estimator: Are your estimator correct and unbiased? Mean is unbiased estimator (see next section) but maybe not efficient;
Sampled outliers from Pseudo Random Generator may not be as intense as they should be compared to the theoretical distribution: maybe MLE is less sensitive than your estimator New MCVE bellow does not support this hypothesis, but Float Arithmetic Error can explain why your estimators are underestimated;
Float Arithmetic Error New MCVE bellow highlights that it is part of your problem.
A scientific quote
It seems naive mean estimator (simply taking mean), even if unbiased, is inefficient to properly estimate mean for large sigma (see Qi Tang, Comparison of Different Methods for Estimating Log-normal Means, p. 11):
The naive estimator is easy to calculate and it is unbiased. However,
this estimator can be inefficient when variance is large and sample
size is small.
The thesis reviews several methods to estimate mean of a lognormal distribution and uses MLE as reference for comparison. This explains why your method has a drift as sigma increases and MLE stick better alas it is not time efficient for large N. Very interesting paper.
Statistical considerations
Recalling than:
Lognormal is a heavy and long tailed distribution positively skewed. One consequence is: as the shape parameter sigma grows, the asymmetry and skweness grows, so does the strength of outliers.
Effect of Sample Size: as the number of samples drawn from a distribution grows, the expectation of having an outlier increases (so does the extent).
Building a new MCVE
Lets build a new MCVE to make it clearer. The code bellow draws samples of different sizes (N ranges between 100 and 10000) from lognormal distribution where shape parameter varies (sigma ranges between 0.1 and 10) and scale parameter is set to be unitary.
import warnings
import numpy as np
from scipy import stats
# Make computation reproducible among batches:
np.random.seed(123456789)
# Parameters ranges:
sigmas = np.arange(0.1, 10.1, 0.1)
sizes = np.logspace(2, 5, 21, base=10).astype(int)
# Placeholders:
rv = np.empty((sigmas.size,), dtype=object)
xmean = np.full((3, sigmas.size, sizes.size), np.nan)
xstd = np.full((3, sigmas.size, sizes.size), np.nan)
xextent = np.full((2, sigmas.size, sizes.size), np.nan)
eps = np.finfo(np.float64).eps
# Iterate Shape Parameter:
for (i, s) in enumerate(sigmas):
# Create Random Variable:
rv[i] = stats.lognorm(s, loc=0, scale=1)
# Iterate Sample Size:
for (j, N) in enumerate(sizes):
# Draw Samples:
xs = rv[i].rvs(N)
# Sample Extent:
xextent[:,i,j] = [np.min(xs), np.max(xs)]
# Check (max(x) + min(x)) <= max(x)
if (xextent[0,i,j] + xextent[1,i,j]) - xextent[1,i,j] < eps:
warnings.warn("Potential Float Arithmetic Errors: logN(mu=%.2f, sigma=%2f).sample(%d)" % (0, s, N))
# Generate different Estimators:
# Fit Parameters using MLE:
fit = stats.lognorm.fit(xs, floc=0)
xmean[0,i,j] = fit[2]
xstd[0,i,j] = fit[0]
# Naive (Bad Estimators because of Float Arithmetic Error):
xmean[1,i,j] = np.mean(xs)*np.exp(-0.5*s**2)
xstd[1,i,j] = np.sqrt(np.log(np.std(xs)**2*np.exp(-s**2)+1))
# Log-transform:
xmean[2,i,j] = np.exp(np.mean(np.log(xs)))
xstd[2,i,j] = np.std(np.log(xs))
Observation: The new MCVE starts to raise warnings when sigma > 4.
MLE as Reference
Estimating shape and scale parameters using MLE performs well:
The two figures above show than:
Error on estimation grows alongside with shape parameter;
Error on estimation reduces as sample size increases (CTL);
Note than MLE also fits well the shape parameter:
Float Arithmetic
It is worthy to plot the extent of drawn samples versus shape parameter and sample size:
Or the decimal magnitude between smallest and largest number form the sample:
On my setup:
np.finfo(np.float64).precision # 15
np.finfo(np.float64).eps # 2.220446049250313e-16
It means we have at maximum 15 significant figures to work with, if the magnitude between two numbers exceed then the largest number absorb the smaller ones.
A basic example: What is the result of 1 + 1e6 if we can only keep four significant figures?
The exact result is 1,000,001.0 but it must be rounded off to 1.000e6. This implies: the result of the operation equals to the highest number because of the rounding precision. It is inherent of Finite Precision Arithmetic.
The two previous figures above in conjunction with statistical consideration supports your observation that increasing N does not improve estimation for large value of sigma in your MCVE.
Figures above and below show than when sigma > 3 we haven't enough significant figures (less than 5) to performs valid computations.
Further more we can say that estimator will be underestimated because largest numbers will absorb smallest and the underestimated sum will then be divided by N making the estimator biased by default.
When shape parameter becomes sufficiently large, computations are strongly biased because of Arithmetic Float Errors.
It means using quantities such as:
np.mean(xs)
np.std(xs)
When computing estimate will have huge Arithmetic Float Error because of the important discrepancy among values stored in xs. Figures below reproduce your issue:
As stated, estimations are in default (not in excess) because high values (few outliers) in sampled vector absorb small values (most of the sampled values).
Logarithmic Transformation
If we apply a logarithmic transformation, we can drastically reduces this phenomenon:
xmean[2,i,j] = np.exp(np.mean(np.log(xs)))
xstd[2,i,j] = np.std(np.log(xs))
And then the naive estimation of the mean is correct and will be far less affected by Arithmetic Float Error because all sample values will lie within few decades instead of having relative magnitude higher than the float arithmetic precision.
Actually, taking the log-transform returns the same result for mean and std estimation than MLE for each N and sigma:
np.allclose(xmean[0,:,:], xmean[2,:,:]) # True
np.allclose(xstd[0,:,:], xstd[2,:,:]) # True
Reference
If you are looking for complete and detailed explanations of this kind of issues when performing scientific computing, I recommend you to read the excellent book: N. J. Higham, Accuracy and Stability of Numerical Algorithms, Siam, Second Edition, 2002.
Bonus
Here an example of figure generation code:
import matplotlib.pyplot as plt
fig, axe = plt.subplots()
idx = slice(None, None, 5)
axe.loglog(sigmas, xmean[0,:,idx])
axe.axhline(1, linestyle=':', color='k')
axe.set_title(r"MLE: $x \sim \log\mathcal{N}(\mu=0,\sigma)$")
axe.set_xlabel(r"Standard Deviation, $\sigma$")
axe.set_ylabel(r"Mean Estimation, $\hat{\mu}$")
axe.set_ylim([0.1,10])
lgd = axe.legend([r"$N = %d$" % s for s in sizes[idx]] + ['Exact'], bbox_to_anchor=(1,1), loc='upper left')
axe.grid(which='both')
fig.savefig('Lognorm_MLE_Emean_Sigma.png', dpi=120, bbox_extra_artists=(lgd,), bbox_inches='tight')

How do I use the Monte Carlo method to find the uncertainties of a value?

I am trying to solve a Physics equation using a Monte Carlo simulation which I know is very long (I just need to use it to learn about it).
I have around 5 values, one is time and I have the random uncertainties (errors) for each of these values. So like mass is (10 +- 0.1)kg, where the error is 0.1 kg
How do I actually find the distribution of measurements if I performed this experiment 5,000 times for example?
I know I could make 2 arrays of errors, and maybe put them in a function. But what am I supposed to do to then? Do I put the errors in the equation and then add the answer to the arrays, and then put the changed array values in the equation and repeat this a thousand times. Or do I actually calculate the real value and add it to the array.
Please can you help me understand this.
Edit:
The problem I have is basically of a sphere of density ds that is falling by a distance l in time t through a liquid of density dl, this fits in an equation for viscosity and I need to find the distribution of viscosity measurements.
The equation shouldn't matter at, whatever equation I have I should be able to use a method like this to find the distribution of measurements. Weather I'm dropping a ball out a window or whatever.
Basic Monte Carlo is very straightforward. The following might get you started:
import random,statistics,math
#The following function generates a
#random observation of f(x) where
#x is a vector of independent normal variables
#whose means are given by the vector mus
#and whose standard deviations are given by sigmas
def sample(f,mus,sigmas):
x = (random.gauss(m,s) for m,s in zip(mus,sigmas))
return f(*x)
#do n times, returning the sample mean and standard deviation:
def monte_carlo(f,mus,sigmas,n):
samples = [sample(f,mus,sigmas) for _ in range(n)]
return (statistics.mean(samples), statistics.stdev(samples))
#for testing purposes:
def V(r,h):
return math.pi*r**2*h
print(monte_carlo(V,(2,4),(0.02, 0.01),1000))
With output:
(50.2497301631037, 1.0215188736786902)
Ok, lets try with simple example - you have air gun which shoots balls with mass m and velocity v. You have to measure kinetic energy
E = m*v2 / 2
There is distribution of velocity - gaussian with mean value of 10 and std deviation 1.
There is distribution of masses - but we cannot do gaussian, lets assume it is truncated normal, with low limit of 1, so that there is no negative values, with loc equal to 5 and scale equal to 3.
So what we will do - sample velocity, sample mass, use them to find kinetic energy, do it multiple times, build energy distribution, get mean value, get std deviation, draw graphs etc
Some simple Python code
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import truncnorm
def sampleMass(n, low, high, mu, sigma):
"""
Sample n mass values from truncated normal
"""
tn = truncnorm(low, high, loc=mu, scale=sigma)
return tn.rvs(n)
def sampleVelocity(n, mu, sigma):
return np.random.normal(loc = mu, scale = sigma, size=n)
mass_low = 1.
mass_high = 1000.
mass_mu = 5.
mass_sigma = 3.0
vel_mu = 10.0
vel_sigma = 1.0
nof_trials = 100000
mass = sampleMass(nof_trials, mass_low, mass_high, mass_mu, mass_sigma) # get samples of mass
vel = sampleVelocity(nof_trials, vel_mu, vel_sigma) # get samples of velocity
kinenergy = 0.5 * mass * vel*vel # distribution of kinetic energy
print("Mean value and stddev of the final distribution")
print(np.mean(kinenergy))
print(np.std(kinenergy))
print("Min/max values of the final distribution")
print(np.min(kinenergy))
print(np.max(kinenergy))
# print histogram of the distribution
n, bins, patches = plt.hist(kinenergy, 100, density=True, facecolor='green', alpha=0.75)
plt.xlabel('Energy')
plt.ylabel('Probability')
plt.title('Kinetic energy distribution')
plt.grid(True)
plt.show()
with output like
Mean value and stddev of the final distribution
483.8162951263243
118.34049421853899
Min/max values of the final distribution
128.86671038372
1391.400187563612

How to obtain perfect fit from np.random.power function

I have generated random data using:
bkg= 240-140*np.random.power(3.5,50000)
I plotted the points into a histogram by using
h_all = plt.hist(all,bins=binedges,histtype='step')
My question is, provided that I know the pdf (in this case called "bkg") can I generate a curve using scipy.optimize that fits the points generated perfectly, and what equation it is for the curve ?
First of all, remark that your bkg is NOT a probability density function (pdf). Rather, it is a list of observations from a pdf. By calling matplotlib.pyplot.hist on this list of observations, you get to see a curve that approximates the (offset and scaled version of the) probability density function. If you are given this curve, it is possible to get a good estimation of the parameters needed to model this, provided you've been given the parameterized model a priori.
For example:
import matplotlib.pyplot as plt
import numpy as np
from scipy.optimize import curve_fit
offset, scale, a, nsamples = 240, -140, 3.5, 500000
bkg = offset + scale*np.random.power(a, nsamples) # values range between (offset, offset+scale), which map to 0 and 1
nbins = 100
count, bins, ignored = plt.hist(bkg, bins=nbins, histtype='stepfilled', edgecolor='none')
If now you are given the centers of these bins and the counts,
xdata = .5*(bins[1:]+bins[:-1])
ydata = count
and you are asked to find the parameters of the power distribution function that fits to this data (-> someone told you this, you trust that source), then you could go about in the following manner.
First, observe that the power distribution function P(x,a) is a monotonously increasing function (i.e. P(x1, a ) < P(x2, a) when 0 <= x1 < x2 <= 1). That means that the dataset given above has been flipped left-to-right, or that it represents factor*P(x, a ) with factor < 0.
Next, notice that the given data is not given over the interval [0,1], typical for a probability density function. That means that you should rescale the given xdata to the [0,1] interval prior to attempting to fit the power function distribution to it. Just by observing the graph, you figure out that the values that 0 and 1 map to are 100 and 240. However, this is just luck here, because matplotlib chose a sensible range for plotting. When you are confronted with not actually knowing the limits to which 0 and 1 have mapped to, you could choose the less optimal (but still very good) choice of xdata[0] - binwidth/2 and xdata[-1] + binwidth/2 or (a slightly worse choice) xdata[0] and xdata[-1]. From the previous paragraph, you know that 1 maps to xdata[0] - binwidth/2 :=: a and 0 maps to xdata[-1] + binwidth/2 :=: b. The linear map that does this is lambda x: (a - b)*x + b (simple algebra).
If you pass this to [0,1]-mapped version of the xdata to curve_fit, it'll give you a good guess for the exponent.
def get_model(nobservations, binwidth, scale, offset):
def model(bin_centers, exponent):
x = (bin_centers - offset)/scale
y = exponent*x**(exponent - 1)
normed_y = nobservations * binwidth * y / np.abs(scale)
return normed_y
return model
binwidth = np.diff(xdata)[0]
p0, _ = curve_fit(get_model(nsamples, binwidth, scale=-xdata.ptp() - binwidth, offset=xdata[-1] + binwidth/2), xdata, ydata)
print(p0) # prints e.g.: 3.37117679
plt.plot(xdata, get_model(nsamples, binwidth, scale=-xdata.ptp() - binwidth, offset=xdata[-1] + binwidth/2)(xdata, *p0))
At this moment, you have found a rather accurate description of the distribution
that was used to generate the observations of bkg:
f(x) = offset + scale*(exponent * x**(exponent - 1))
= (xdata[-1] + binwidth/2) + (-xdata.ptp() - binwidth)*(p0[0] * x**(p0[0] - 1))
~ 234.85 - 1.34.85*(3.37 * x**(3.37 - 1))
By the way, I'd like to point out that replicating bkg (the observations from the distribution)
as a perfect copy is something you can only do if you know the exact parameters of the distribution (240, -140 and 3.5) AND set the seed for the random number generation equal to the seed that was in effect prior to the initial call to np.random.power.
If you'd like to fit a curve to the histogram using splines, you should retrieve the knots and coefficients from the generated spline and pass those into the function of bspleval, as shown here. The topic of writing out those equations is a long one however, and there are numerous resources on the internet that you can check to understand how it's done. Needless to say, that function bspleval is what you'll need in case you want to go that route. If it were me, I'd go the route of curve fitting shown above.

gaussian sum filter for irregular spaced points

I have a set of points (x,y) as two vectors
x,y for example:
from pylab import *
x = sorted(random(30))
y = random(30)
plot(x,y, 'o-')
Now I would like to smooth this data with a Gaussian and evaluate it only at certain (regularly spaced) points on the x-axis. lets say for:
x_eval = linspace(0,1,11)
I got the tip that this method is called a "Gaussian sum filter", but so far I have not found any implementation in numpy/scipy for that, although it seems like a standard problem at first glance.
As the x values are not equally spaced I can't use the scipy.ndimage.gaussian_filter1d.
Usually this kind of smoothing is done going through furrier space and multiplying with the kernel, but I don't really know if this will be possible with irregular spaced data.
Thanks for any ideas
This will blow up for very large datasets, but the proper calculaiton you are asking for would be done as follows:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0) # for repeatability
x = np.random.rand(30)
x.sort()
y = np.random.rand(30)
x_eval = np.linspace(0, 1, 11)
sigma = 0.1
delta_x = x_eval[:, None] - x
weights = np.exp(-delta_x*delta_x / (2*sigma*sigma)) / (np.sqrt(2*np.pi) * sigma)
weights /= np.sum(weights, axis=1, keepdims=True)
y_eval = np.dot(weights, y)
plt.plot(x, y, 'bo-')
plt.plot(x_eval, y_eval, 'ro-')
plt.show()
I'll preface this answer by saying that this is more of a DSP question than a programming question...
...that being said there, there is a simple two step solution to your problem.
Step 1: Resample the data
So to illustrate this we can create a random data set with unequal sampling:
import numpy as np
x = np.cumsum(np.random.randint(0,100,100))
y = np.random.normal(0,1,size=100)
This gives something like:
We can resample this data using simple linear interpolation:
nx = np.arange(x.max()) # choose new x axis sampling
ny = np.interp(nx,x,y) # generate y values for each x
This converts our data to:
Step 2: Apply filter
At this stage you can use some of the tools available through scipy to apply a Gaussian filter to the data with a given sigma value:
import scipy.ndimage.filters as filters
fx = filters.gaussian_filter1d(ny,sigma=100)
Plotting this up against the original data we get:
The choice of the sigma value determines the width of the filter.
Based on #Jaime's answer I wrote a function that implements this with some additional documentation and the ability to discard estimates far from the datapoints.
I think confidence intervals could be obtained on this estimate by bootstrapping, but I haven't done this yet.
def gaussian_sum_smooth(xdata, ydata, xeval, sigma, null_thresh=0.6):
"""Apply gaussian sum filter to data.
xdata, ydata : array
Arrays of x- and y-coordinates of data.
Must be 1d and have the same length.
xeval : array
Array of x-coordinates at which to evaluate the smoothed result
sigma : float
Standard deviation of the Gaussian to apply to each data point
Larger values yield a smoother curve.
null_thresh : float
For evaluation points far from data points, the estimate will be
based on very little data. If the total weight is below this threshold,
return np.nan at this location. Zero means always return an estimate.
The default of 0.6 corresponds to approximately one sigma away
from the nearest datapoint.
"""
# Distance between every combination of xdata and xeval
# each row corresponds to a value in xeval
# each col corresponds to a value in xdata
delta_x = xeval[:, None] - xdata
# Calculate weight of every value in delta_x using Gaussian
# Maximum weight is 1.0 where delta_x is 0
weights = np.exp(-0.5 * ((delta_x / sigma) ** 2))
# Multiply each weight by every data point, and sum over data points
smoothed = np.dot(weights, ydata)
# Nullify the result when the total weight is below threshold
# This happens at evaluation points far from any data
# 1-sigma away from a data point has a weight of ~0.6
nan_mask = weights.sum(1) < null_thresh
smoothed[nan_mask] = np.nan
# Normalize by dividing by the total weight at each evaluation point
# Nullification above avoids divide by zero warning shere
smoothed = smoothed / weights.sum(1)
return smoothed

Categories

Resources