I've a huge dataset with 271116 rows of data. I normalized the data using the z-score normalization method. I've no idea of knowing if the data actually follows a normal distribution. So I plotted a simple density graph using matplotlib:
hdf = df['Height'].plot(kind = 'kde', stacked = False)
plt.show()
I got this for a result:
Though, the data seems somewhat normal, can I apply the Central Limit Theorem where I take the means of different random samples (say, 10000 times) to get a smooth bell-curve?
Any help in python is appreciated, thanks.
Something like:
import numpy as np
sampleMeans = []
for _ in range(100000):
samples = df['Height'].sample(n=100)
sampleMean = np.mean(samples)
sampleMeans.append(sampleMean)
#Now you have a list of sample means to plot - should be normally distributed
The mean of the distribution should equal the mean of the original data, and the standard deviation should be a factor of ten less than the original data. If the result isn't smooth enough, then increase .sample(n=100) to a higher figure. This will also decrease the standard deviation of the resulting bell curve. The general rule is that the CLT standard deviation is the data standard deviation divided by sqrt(n).
It's important to note that the resulting distribution is different from the original. It is not merely smoothed out using the CLT.
Related
I have a dataset with a signal and a 1/distance (Angstrom^-1) column.
This is the dataset (fourier.csv): https://pastebin.com/ucFekzc6
After applying these steps:
import pandas as pd
import numpy as np
from numpy.fft import fft
df = pd.read_csv (r'fourier.csv')
df.plot(x ='1/distance', y ='signal', kind = 'line')
I generated this plot:
To generate the Fast Fourier Transformation data, I used the numpy library for its fft function and I applied it like this:
df['signal_fft'] = fft(df['signal'])
df.plot(x ='1/distance', y ='signal_fft', kind = 'line')
Now the plot looks like this, with the FFT data plotted instead of the initial "signal" data:
What I hoped to generate is something like this (This signal is extremely similar to mine, yet yields a vastly different FFT picture):
Theory Signal before windowing:
Theory FFT:
As you can see, my initial plot looks somewhat similar to graphic (a), but my FFT plot doesn't look anywhere near as clear as graphic (b). I'm still using the 1/distance data for both horizontal axes, but I don't see anything wrong with it, only that it should be interpreted as distance (Angstrom) instead of 1/distance (1/Angstrom) in the FFT plot.
How should I apply FFT in order to get a result that resembles the theoretical FFT curve?
Here's another slide that shows a similar initial signal to mine and a yet again vastly different FFT:
Addendum: I have been asked to provide some additional information on this problem, so I hope this helps.
The origin of the dataset that I have linked is an XAS (X-Ray Absorption Spectroscopy) spectrum of iron oxide. Such an experimentally obtained spectrum looks similar to the one shown in the "Schematic of XAFS data processing" on the top left, i.e. absorbance [a.u.] plotted against the photon energy [eV]. Firstly I processed the spectrum (pre-edge baseline correction + post-edge normalization). Then I converted the data on the x-axis from energy E to wavenumber k (thus dimension 1/Angstrom) and cut off the signal at the L-edge jump, leaving only the signal in the post-edge EXAFS region, referred to as fine structure function χ(k). The mentioned dataset includes k^2 weighted χ(k) (to emphasize oscillations at large k). All of this is not entirely relevant as the only thing I want to do now is a Fourier transformation on this signal ( k^2 χ(k) vs. k). In theory, as we are dealing with photoelectrons and (back)scattering phenomena, the EXAFS region of the XAS spectrum can be approximated using a superposition of many sinusoidal waves such as described in this equation with f(k) being the amplitude and δ(k) the phase shift of the scattered wave.
The aim is to gain an understanding of the chemical environment and the coordination spheres around the absorbing atom. The goal of the Fourier transform is to obtain some sort of signal in dependence of the "radius" R [Angstrom], which could later on be correlated to e.g. an oxygen being in ~2 Angstrom distance to the Mn atom (see "Schematic of XAFS data processing" on the right).
I only want to be able to reproduce the theoretically expected output after the FFT. My main concern is to get rid of the weird output signal and produce something that in some way resembles a curve with somewhat distinct local maxima (as shown in the 4th picture).
I don't have a 100% solution for you, but here's part of the problem.
The fft function you're using assumes that your X values are equally spaced. I checked this assumption by taking the difference between each 1/distance value, and graphing it:
df['1/distance'].diff().plot()
(Y is the difference, X is the index in the dataframe.)
This is supposed to be a constant line.
In order to fix this, one solution is to resample the signal through linear interpolation so that the timestep is constant.
from scipy import interpolate
rs_df = df.drop_duplicates().copy() # Needed because 0 is present twice in dataset
x = rs_df['1/distance']
y = rs_df['signal']
flinear = interpolate.interp1d(x, y, kind='linear')
xnew = np.linspace(np.min(x), np.max(x), rs_df.index.size)
ylinear = flinear(xnew)
rs_df['signal'] = ylinear
rs_df['1/distance'] = xnew
df.plot(x ='1/distance', y ='signal', kind = 'line')
rs_df.plot(x ='1/distance', y ='signal', kind = 'line')
The new line looks visually identical, but has a constant timestep.
I still don't get your intended result from the FFT, so this is only a partial solution.
MCVE
We import required dependencies:
import numpy as np
import pandas as pd
from scipy import signal
import matplotlib.pyplot as plt
And we load your dataset:
raw = pd.read_csv("https://pastebin.com/raw/ucFekzc6", sep="\t",
names=["k", "wchi"], header=0)
We clean the dataset a bit as it contains duplicates and a problematic point with null wave number (or infinite distance) and ensure a zero mean signal:
raw = raw.drop_duplicates()
raw = raw.iloc[1:, :]
raw["wchi"] = raw["wchi"] - raw["wchi"].mean()
The signal is about:
As noticed by #NickODell, signal is not equally sampled which is a problem if you aim to perform FFT signal processing.
We can resample your signal to have equally spaced sampling:
N = 65536
k = np.linspace(raw["k"].min(), raw["k"].max(), N)
interpolant = interpolate.interp1d(raw["k"], raw["wchi"], kind="linear")
g = interpolant(k)
Notice for performance concerns FFT does split the signal with the null frequency component at the borders (that's why your FFT signal does not look as it is usually presented in books). This indeed can be corrected by using classic fftshift method or performing ad hoc indexing.
R = 2*np.pi*fft.fftfreq(N, np.diff(k)[0])[:N//2]
G = (1/N)*fft.fft(g)[0:N//2]
Mind the 2π factor which is involved in the units scaling of your transformation.
You also have mentioned a windowing (at least in a picture) that is not referenced anywhere. This kind of filtering may help a lot when performing signal processing as it filter out artifacts and unwanted noise. I leave it up to you.
Least Square Spectral Analysis
An alternative to process your signal is available since the advent of modern Linear Algebra. There is a way to estimate the periodogram of an irregular sampled signal by a method called Least Square Spectral Analysis.
You are looking for the square root of the periodogram of your signal and scipy does implement an easy way to compute it by the Lomb-Scargle method.
To do so, we simply create a frequency vector (in this case they are desired output distances) and perform the regression for each of those distances w.r.t. your signal:
Rhat = np.linspace(raw["R"].min(), raw["R"].max()*2, 5000)
Ghat = signal.lombscargle(raw["k"], raw["wchi"], freqs=Rhat, normalize=True)
Graphically it leads to:
Comparison
If we compare both methodology we can confirm that the major peaks definitely match.
LSSA gives a smoother curve but do not assume it to be more accurate as this is statistical smooth of an interpolated curve. Anyway it fit the bill for your requirement:
I only want to be able to reproduce the theoretically expected output
after the FFT. My main concern is to get rid of the weird output
signal and produce something that in some way resembles a curve with
somewhat distinct local maxima (as shown in the 4th picture).
Conclusions
I think you have enough information to process your signal either by resampling and using FFT or by using LSSA. Both method has advantages and drawbacks.
Of course this needs to be validated with well know cases. Why not to reproduce with the data of the experience of the paper you are working on to check out you can reconstruct figures you posted.
You also need to dig in the signal conditioning before performing post processing (resampling, windowing, filtering).
I'm looking for some advice on how to implement some statistical models in Python. I'm interested in constructing a sequence of z values (z_1,z_2,z_3,...,z_n) where the number of jumps in an interval (z_1,z_2] is distributed according to the Poisson distribution with parameter lambda(z_2-z_1)
and the numbers of random jumps over disjoint intervals are independent random variables. I want my piecewise constant plot to look something like the two images below, where the y axis is Y(z), where Y(z) consists of N(0,1) random variables in each interval say.
To construct the z data, what would be the best way to tackle this? I have tried sampling values via np.random.poisson and then taking a cumulative sum, but the values drawn are repeated for small intensity values. Please any help or thoughts would be really helpful. Thanks.
np.random.poisson is used to sample the count of events that occured in [z_i, z_j). if you want to sample the events as they occur, then you just want the exponential distribution. for example:
import numpy as np
n = 50
z = np.cumsum(np.random.exponential(1/n, size=n))
y = np.random.normal(size=n)
plotting these (using step in matplotlib) gives something similar to your plots:
note the 1/n sets a "lambda" so on average we expect n points within [0,1]. in this case we got slightly less so it overshoot. feel free to rescale if that's important to you
I have this task:
Choose your favorite continuous distribution (the less it looks normal, the more interesting; try to choose one of the distributions we have not discussed in the course).
Generate a sample of 1000 from it, build a histogram of the sample,
and draw a theoretical density of distribution of your random value
on top of it (so that the values are on the same scale, don't forget
to set the histogram to normed=True). Your task is to estimate the
distribution of your random sample average for different sample
sizes.
3
To do this, generate 1000 samples of n volume and build
histograms of their sample averages for three or more n values (for
example, 5, 10, 50).
Using the information on the mean and
variance of the original distribution (easily found on wikipedia),
calculate the values of the normal distribution parameters, which,
according to the central limit theorem, approximate the distribution
of the sample averages. Note: to calculate the values of these
parameters, it is the theoretical mean and the variance of your
random value that should be used, and not their sample estimates.
5.
On top of each histogram, draw the density of the corresponding
normal distribution (be careful with the parameters of the function,
it takes the input not the dispersion, but the standard deviation).
Describe the difference between the distributions obtained at different n values. How
does the accuracy of the approximation of
the distribution of sample averages change with the growth of n?
So if I want to pick exponential distribution in Python do I need to go like that?
from scipy.stats import expon
import matplotlib.pyplot as plt
exdist=sc.expon(loc=2,scale=3) # loc and scale - shift and scale parameters, default values 0 and 1.
mean, var, skew, kurt = exdist.stats(moments='mvsk') # Let's see the moments of our distribution.
x = np.linspace((0,2,100))
ax.plot(x, exdist.pdf(x)) # Let's draw it
arr=exdist.rvc(size=1000) # generation of thousand of random numbers. (Is it for task 3?)
And I constantly getting this error: here is a screenshot from Jupyter:
https://i.stack.imgur.com/zwUtu.png
Could you please explain to me how to write the right code? I can't figure out where to start or where to make a mistake. Do I have to use arr.mean() to search for a sample mean and plt.hist(arr,bins=) to build a histogram? I would be very grateful for the explanation.
As part of my research, I measure the mean and standard deviation of draws from a lognormal distribution. Given a value of the underlying normal distribution, it should be possible to analytically predict these quantities (as given at https://en.wikipedia.org/wiki/Log-normal_distribution).
However, as can be seen in the plots below, this does not seem to be the case. The first plot shows the mean of the lognormal data against the sigma of the gaussian, while the second plot shows the sigma of the lognormal data against that of the gaussian. Clearly, the "calculated" lines deviate from the "analytic" ones very significantly.
I take the mean of the gaussian distribution to be related to the sigma by mu = -0.5*sigma**2 as this ensures that the lognormal field ought to have mean of 1. Note, this is motivated by the area of physics that I work in: the deviation from analytic values still occurs if you set mu=0.0 for example.
By copying and pasting the code at the bottom of the question, it should be possible to reproduce the plots below. Any advice as to what might be causing this would be much appreciated!
Mean of lognormal vs sigma of gaussian:
Sigma of lognormal vs sigma of gaussian:
Note, to produce the plots above, I used N=10000, but have put N=1000 in the code below for speed.
import numpy as np
import matplotlib.pyplot as plt
mean_calc = []
sigma_calc = []
mean_analytic = []
sigma_analytic = []
ss = np.linspace(1.0,10.0,46)
N = 1000
for s in ss:
mu = -0.5*s*s
ln = np.random.lognormal(mean=mu, sigma=s, size=(N,N))
mean_calc += [np.average(ln)]
sigma_calc += [np.std(ln)]
mean_analytic += [np.exp(mu+0.5*s*s)]
sigma_analytic += [np.sqrt((np.exp(s**2)-1)*(np.exp(2*mu + s*s)))]
plt.loglog(ss,mean_calc,label='calculated')
plt.loglog(ss,mean_analytic,label='analytic')
plt.legend();plt.grid()
plt.xlabel(r'$\sigma_G$')
plt.ylabel(r'$\mu_{LN}$')
plt.show()
plt.loglog(ss,sigma_calc,label='calculated')
plt.loglog(ss,sigma_analytic,label='analytic')
plt.legend();plt.grid()
plt.xlabel(r'$\sigma_G$')
plt.ylabel(r'$\sigma_{LN}$')
plt.show()
TL;DR
Lognormal are positively skewed and heavy tailed distribution. When performing float arithmetic operations (such as sum, mean or std) on sample drawn from a highly skewed distribution, the sampling vector contains values with discrepancy over several order of magnitude (many decades). This makes the computation inaccurate.
The problem comes from those two lines:
mean_calc += [np.average(ln)]
sigma_calc += [np.std(ln)]
Because ln contains both very low and very high values with order of magnitude much higher than float precision.
The problem can be easily detected to warn user that its computation are wrong using the following predicate:
(max(ln) + min(ln)) <= max(ln)
Which is obviously false in Strictly Positive Real but must be considered when using Finite Precision Arithmetic.
Modifying your MCVE
If we slightly modify your MCVE to:
from scipy import stats
for s in ss:
mu = -0.5*s*s
ln = stats.lognorm(s, scale=np.exp(mu)).rvs(N*N)
f = stats.lognorm.fit(ln, floc=0)
mean_calc += [f[2]*np.exp(0.5*s*s)]
sigma_calc += [np.sqrt((np.exp(f[0]**2)-1)*(np.exp(2*mu + s*s)))]
mean_analytic += [np.exp(mu+0.5*s*s)]
sigma_analytic += [np.sqrt((np.exp(s**2)-1)*(np.exp(2*mu + s*s)))]
It gives the reasonably correct mean and standard deviation estimation even for high value of sigma.
The key is that fit uses MLE algorithm to estimates parameters. This totally differs from your original approach which directly performs the mean of the sample.
The fit method returns a tuple with (sigma, loc=0, scale=exp(mu)) which are parameters of the scipy.stats.lognorm object as specified in documentation.
I think you should investigate how you are estimating mean and standard deviation. The divergence probably comes from this part of your algorithm.
There might be several reasons why it diverges, at least consider:
Biased estimator: Are your estimator correct and unbiased? Mean is unbiased estimator (see next section) but maybe not efficient;
Sampled outliers from Pseudo Random Generator may not be as intense as they should be compared to the theoretical distribution: maybe MLE is less sensitive than your estimator New MCVE bellow does not support this hypothesis, but Float Arithmetic Error can explain why your estimators are underestimated;
Float Arithmetic Error New MCVE bellow highlights that it is part of your problem.
A scientific quote
It seems naive mean estimator (simply taking mean), even if unbiased, is inefficient to properly estimate mean for large sigma (see Qi Tang, Comparison of Different Methods for Estimating Log-normal Means, p. 11):
The naive estimator is easy to calculate and it is unbiased. However,
this estimator can be inefficient when variance is large and sample
size is small.
The thesis reviews several methods to estimate mean of a lognormal distribution and uses MLE as reference for comparison. This explains why your method has a drift as sigma increases and MLE stick better alas it is not time efficient for large N. Very interesting paper.
Statistical considerations
Recalling than:
Lognormal is a heavy and long tailed distribution positively skewed. One consequence is: as the shape parameter sigma grows, the asymmetry and skweness grows, so does the strength of outliers.
Effect of Sample Size: as the number of samples drawn from a distribution grows, the expectation of having an outlier increases (so does the extent).
Building a new MCVE
Lets build a new MCVE to make it clearer. The code bellow draws samples of different sizes (N ranges between 100 and 10000) from lognormal distribution where shape parameter varies (sigma ranges between 0.1 and 10) and scale parameter is set to be unitary.
import warnings
import numpy as np
from scipy import stats
# Make computation reproducible among batches:
np.random.seed(123456789)
# Parameters ranges:
sigmas = np.arange(0.1, 10.1, 0.1)
sizes = np.logspace(2, 5, 21, base=10).astype(int)
# Placeholders:
rv = np.empty((sigmas.size,), dtype=object)
xmean = np.full((3, sigmas.size, sizes.size), np.nan)
xstd = np.full((3, sigmas.size, sizes.size), np.nan)
xextent = np.full((2, sigmas.size, sizes.size), np.nan)
eps = np.finfo(np.float64).eps
# Iterate Shape Parameter:
for (i, s) in enumerate(sigmas):
# Create Random Variable:
rv[i] = stats.lognorm(s, loc=0, scale=1)
# Iterate Sample Size:
for (j, N) in enumerate(sizes):
# Draw Samples:
xs = rv[i].rvs(N)
# Sample Extent:
xextent[:,i,j] = [np.min(xs), np.max(xs)]
# Check (max(x) + min(x)) <= max(x)
if (xextent[0,i,j] + xextent[1,i,j]) - xextent[1,i,j] < eps:
warnings.warn("Potential Float Arithmetic Errors: logN(mu=%.2f, sigma=%2f).sample(%d)" % (0, s, N))
# Generate different Estimators:
# Fit Parameters using MLE:
fit = stats.lognorm.fit(xs, floc=0)
xmean[0,i,j] = fit[2]
xstd[0,i,j] = fit[0]
# Naive (Bad Estimators because of Float Arithmetic Error):
xmean[1,i,j] = np.mean(xs)*np.exp(-0.5*s**2)
xstd[1,i,j] = np.sqrt(np.log(np.std(xs)**2*np.exp(-s**2)+1))
# Log-transform:
xmean[2,i,j] = np.exp(np.mean(np.log(xs)))
xstd[2,i,j] = np.std(np.log(xs))
Observation: The new MCVE starts to raise warnings when sigma > 4.
MLE as Reference
Estimating shape and scale parameters using MLE performs well:
The two figures above show than:
Error on estimation grows alongside with shape parameter;
Error on estimation reduces as sample size increases (CTL);
Note than MLE also fits well the shape parameter:
Float Arithmetic
It is worthy to plot the extent of drawn samples versus shape parameter and sample size:
Or the decimal magnitude between smallest and largest number form the sample:
On my setup:
np.finfo(np.float64).precision # 15
np.finfo(np.float64).eps # 2.220446049250313e-16
It means we have at maximum 15 significant figures to work with, if the magnitude between two numbers exceed then the largest number absorb the smaller ones.
A basic example: What is the result of 1 + 1e6 if we can only keep four significant figures?
The exact result is 1,000,001.0 but it must be rounded off to 1.000e6. This implies: the result of the operation equals to the highest number because of the rounding precision. It is inherent of Finite Precision Arithmetic.
The two previous figures above in conjunction with statistical consideration supports your observation that increasing N does not improve estimation for large value of sigma in your MCVE.
Figures above and below show than when sigma > 3 we haven't enough significant figures (less than 5) to performs valid computations.
Further more we can say that estimator will be underestimated because largest numbers will absorb smallest and the underestimated sum will then be divided by N making the estimator biased by default.
When shape parameter becomes sufficiently large, computations are strongly biased because of Arithmetic Float Errors.
It means using quantities such as:
np.mean(xs)
np.std(xs)
When computing estimate will have huge Arithmetic Float Error because of the important discrepancy among values stored in xs. Figures below reproduce your issue:
As stated, estimations are in default (not in excess) because high values (few outliers) in sampled vector absorb small values (most of the sampled values).
Logarithmic Transformation
If we apply a logarithmic transformation, we can drastically reduces this phenomenon:
xmean[2,i,j] = np.exp(np.mean(np.log(xs)))
xstd[2,i,j] = np.std(np.log(xs))
And then the naive estimation of the mean is correct and will be far less affected by Arithmetic Float Error because all sample values will lie within few decades instead of having relative magnitude higher than the float arithmetic precision.
Actually, taking the log-transform returns the same result for mean and std estimation than MLE for each N and sigma:
np.allclose(xmean[0,:,:], xmean[2,:,:]) # True
np.allclose(xstd[0,:,:], xstd[2,:,:]) # True
Reference
If you are looking for complete and detailed explanations of this kind of issues when performing scientific computing, I recommend you to read the excellent book: N. J. Higham, Accuracy and Stability of Numerical Algorithms, Siam, Second Edition, 2002.
Bonus
Here an example of figure generation code:
import matplotlib.pyplot as plt
fig, axe = plt.subplots()
idx = slice(None, None, 5)
axe.loglog(sigmas, xmean[0,:,idx])
axe.axhline(1, linestyle=':', color='k')
axe.set_title(r"MLE: $x \sim \log\mathcal{N}(\mu=0,\sigma)$")
axe.set_xlabel(r"Standard Deviation, $\sigma$")
axe.set_ylabel(r"Mean Estimation, $\hat{\mu}$")
axe.set_ylim([0.1,10])
lgd = axe.legend([r"$N = %d$" % s for s in sizes[idx]] + ['Exact'], bbox_to_anchor=(1,1), loc='upper left')
axe.grid(which='both')
fig.savefig('Lognorm_MLE_Emean_Sigma.png', dpi=120, bbox_extra_artists=(lgd,), bbox_inches='tight')
I have some data I have sampled from a radar satellite image and wanted to perform some statistical tests on. Before this I wanted to conduct a normality test so I could be sure my data was normally distributed. My data appears to be normally distributed but when I perform the test Im getting a Pvalue of 0, suggesting my data is not normally distributed.
I have attached my code along with the output and a histogram of the distribution (Im relatively new to python so apologies if my code is clunky in any way). Can anyone tell me if Im doing something wrong - I find it hard to believe from my histogram that my data is not normally distributed?
values = 'inputfile.h5'
f = h5py.File(values,'r')
dset = f['/DATA/DATA']
array = dset[...,0]
print('normality =', scipy.stats.normaltest(array))
max = np.amax(array)
min = np.amin(array)
histo = np.histogram(array, bins=100, range=(min, max))
freqs = histo[0]
rangebins = (max - min)
numberbins = (len(histo[1])-1)
interval = (rangebins/numberbins)
newbins = np.arange((min), (max), interval)
histogram = bar(newbins, freqs, width=0.2, color='gray')
plt.show()
This prints this: (41099.095955202931, 0.0). the first element is a chi-square value and the second is a pvalue.
I have made a graph of the data which I have attached. I thought that maybe as Im dealing with negative values it was causing a problem so I normalised the values but the problem persists.
This question explains why you're getting such a small p-value. Essentially, normality tests almost always reject the null on very large sample sizes (in yours, for example, you can see just some skew in the left side, which at your enormous sample size is way more than enough).
What would be much more practically useful in your case is to plot a normal curve fit to your data. Then you can see how the normal curve actually differs (for example, you can see whether the tail on the left side does indeed go too long). For example:
from matplotlib import pyplot as plt
import matplotlib.mlab as mlab
n, bins, patches = plt.hist(array, 50, normed=1)
mu = np.mean(array)
sigma = np.std(array)
plt.plot(bins, mlab.normpdf(bins, mu, sigma))
(Note the normed=1 argument: this ensures that the histogram is normalized to have a total area of 1, which makes it comparable to a density like the normal distribution).
In general when the number of samples is less than 50, you should be careful about using tests of normality. Since these tests need enough evidences to reject the null hypothesis, which is "the distribution of the data is normal", and when the number of samples is small they are not able to find those evidences.
Keep in mind that when you fail to reject the null hypothesis it does not mean that the alternative hypothesis is correct.
There is another possibility that:
Some implementations of the statistical tests for normality compare the distribution of your data to standard normal distribution. In order to avoid this, I suggest you to standardize the data and then apply the test of normality.