I have some data I have sampled from a radar satellite image and wanted to perform some statistical tests on. Before this I wanted to conduct a normality test so I could be sure my data was normally distributed. My data appears to be normally distributed but when I perform the test Im getting a Pvalue of 0, suggesting my data is not normally distributed.
I have attached my code along with the output and a histogram of the distribution (Im relatively new to python so apologies if my code is clunky in any way). Can anyone tell me if Im doing something wrong - I find it hard to believe from my histogram that my data is not normally distributed?
values = 'inputfile.h5'
f = h5py.File(values,'r')
dset = f['/DATA/DATA']
array = dset[...,0]
print('normality =', scipy.stats.normaltest(array))
max = np.amax(array)
min = np.amin(array)
histo = np.histogram(array, bins=100, range=(min, max))
freqs = histo[0]
rangebins = (max - min)
numberbins = (len(histo[1])-1)
interval = (rangebins/numberbins)
newbins = np.arange((min), (max), interval)
histogram = bar(newbins, freqs, width=0.2, color='gray')
plt.show()
This prints this: (41099.095955202931, 0.0). the first element is a chi-square value and the second is a pvalue.
I have made a graph of the data which I have attached. I thought that maybe as Im dealing with negative values it was causing a problem so I normalised the values but the problem persists.
This question explains why you're getting such a small p-value. Essentially, normality tests almost always reject the null on very large sample sizes (in yours, for example, you can see just some skew in the left side, which at your enormous sample size is way more than enough).
What would be much more practically useful in your case is to plot a normal curve fit to your data. Then you can see how the normal curve actually differs (for example, you can see whether the tail on the left side does indeed go too long). For example:
from matplotlib import pyplot as plt
import matplotlib.mlab as mlab
n, bins, patches = plt.hist(array, 50, normed=1)
mu = np.mean(array)
sigma = np.std(array)
plt.plot(bins, mlab.normpdf(bins, mu, sigma))
(Note the normed=1 argument: this ensures that the histogram is normalized to have a total area of 1, which makes it comparable to a density like the normal distribution).
In general when the number of samples is less than 50, you should be careful about using tests of normality. Since these tests need enough evidences to reject the null hypothesis, which is "the distribution of the data is normal", and when the number of samples is small they are not able to find those evidences.
Keep in mind that when you fail to reject the null hypothesis it does not mean that the alternative hypothesis is correct.
There is another possibility that:
Some implementations of the statistical tests for normality compare the distribution of your data to standard normal distribution. In order to avoid this, I suggest you to standardize the data and then apply the test of normality.
Related
I have a dataset with a signal and a 1/distance (Angstrom^-1) column.
This is the dataset (fourier.csv): https://pastebin.com/ucFekzc6
After applying these steps:
import pandas as pd
import numpy as np
from numpy.fft import fft
df = pd.read_csv (r'fourier.csv')
df.plot(x ='1/distance', y ='signal', kind = 'line')
I generated this plot:
To generate the Fast Fourier Transformation data, I used the numpy library for its fft function and I applied it like this:
df['signal_fft'] = fft(df['signal'])
df.plot(x ='1/distance', y ='signal_fft', kind = 'line')
Now the plot looks like this, with the FFT data plotted instead of the initial "signal" data:
What I hoped to generate is something like this (This signal is extremely similar to mine, yet yields a vastly different FFT picture):
Theory Signal before windowing:
Theory FFT:
As you can see, my initial plot looks somewhat similar to graphic (a), but my FFT plot doesn't look anywhere near as clear as graphic (b). I'm still using the 1/distance data for both horizontal axes, but I don't see anything wrong with it, only that it should be interpreted as distance (Angstrom) instead of 1/distance (1/Angstrom) in the FFT plot.
How should I apply FFT in order to get a result that resembles the theoretical FFT curve?
Here's another slide that shows a similar initial signal to mine and a yet again vastly different FFT:
Addendum: I have been asked to provide some additional information on this problem, so I hope this helps.
The origin of the dataset that I have linked is an XAS (X-Ray Absorption Spectroscopy) spectrum of iron oxide. Such an experimentally obtained spectrum looks similar to the one shown in the "Schematic of XAFS data processing" on the top left, i.e. absorbance [a.u.] plotted against the photon energy [eV]. Firstly I processed the spectrum (pre-edge baseline correction + post-edge normalization). Then I converted the data on the x-axis from energy E to wavenumber k (thus dimension 1/Angstrom) and cut off the signal at the L-edge jump, leaving only the signal in the post-edge EXAFS region, referred to as fine structure function χ(k). The mentioned dataset includes k^2 weighted χ(k) (to emphasize oscillations at large k). All of this is not entirely relevant as the only thing I want to do now is a Fourier transformation on this signal ( k^2 χ(k) vs. k). In theory, as we are dealing with photoelectrons and (back)scattering phenomena, the EXAFS region of the XAS spectrum can be approximated using a superposition of many sinusoidal waves such as described in this equation with f(k) being the amplitude and δ(k) the phase shift of the scattered wave.
The aim is to gain an understanding of the chemical environment and the coordination spheres around the absorbing atom. The goal of the Fourier transform is to obtain some sort of signal in dependence of the "radius" R [Angstrom], which could later on be correlated to e.g. an oxygen being in ~2 Angstrom distance to the Mn atom (see "Schematic of XAFS data processing" on the right).
I only want to be able to reproduce the theoretically expected output after the FFT. My main concern is to get rid of the weird output signal and produce something that in some way resembles a curve with somewhat distinct local maxima (as shown in the 4th picture).
I don't have a 100% solution for you, but here's part of the problem.
The fft function you're using assumes that your X values are equally spaced. I checked this assumption by taking the difference between each 1/distance value, and graphing it:
df['1/distance'].diff().plot()
(Y is the difference, X is the index in the dataframe.)
This is supposed to be a constant line.
In order to fix this, one solution is to resample the signal through linear interpolation so that the timestep is constant.
from scipy import interpolate
rs_df = df.drop_duplicates().copy() # Needed because 0 is present twice in dataset
x = rs_df['1/distance']
y = rs_df['signal']
flinear = interpolate.interp1d(x, y, kind='linear')
xnew = np.linspace(np.min(x), np.max(x), rs_df.index.size)
ylinear = flinear(xnew)
rs_df['signal'] = ylinear
rs_df['1/distance'] = xnew
df.plot(x ='1/distance', y ='signal', kind = 'line')
rs_df.plot(x ='1/distance', y ='signal', kind = 'line')
The new line looks visually identical, but has a constant timestep.
I still don't get your intended result from the FFT, so this is only a partial solution.
MCVE
We import required dependencies:
import numpy as np
import pandas as pd
from scipy import signal
import matplotlib.pyplot as plt
And we load your dataset:
raw = pd.read_csv("https://pastebin.com/raw/ucFekzc6", sep="\t",
names=["k", "wchi"], header=0)
We clean the dataset a bit as it contains duplicates and a problematic point with null wave number (or infinite distance) and ensure a zero mean signal:
raw = raw.drop_duplicates()
raw = raw.iloc[1:, :]
raw["wchi"] = raw["wchi"] - raw["wchi"].mean()
The signal is about:
As noticed by #NickODell, signal is not equally sampled which is a problem if you aim to perform FFT signal processing.
We can resample your signal to have equally spaced sampling:
N = 65536
k = np.linspace(raw["k"].min(), raw["k"].max(), N)
interpolant = interpolate.interp1d(raw["k"], raw["wchi"], kind="linear")
g = interpolant(k)
Notice for performance concerns FFT does split the signal with the null frequency component at the borders (that's why your FFT signal does not look as it is usually presented in books). This indeed can be corrected by using classic fftshift method or performing ad hoc indexing.
R = 2*np.pi*fft.fftfreq(N, np.diff(k)[0])[:N//2]
G = (1/N)*fft.fft(g)[0:N//2]
Mind the 2π factor which is involved in the units scaling of your transformation.
You also have mentioned a windowing (at least in a picture) that is not referenced anywhere. This kind of filtering may help a lot when performing signal processing as it filter out artifacts and unwanted noise. I leave it up to you.
Least Square Spectral Analysis
An alternative to process your signal is available since the advent of modern Linear Algebra. There is a way to estimate the periodogram of an irregular sampled signal by a method called Least Square Spectral Analysis.
You are looking for the square root of the periodogram of your signal and scipy does implement an easy way to compute it by the Lomb-Scargle method.
To do so, we simply create a frequency vector (in this case they are desired output distances) and perform the regression for each of those distances w.r.t. your signal:
Rhat = np.linspace(raw["R"].min(), raw["R"].max()*2, 5000)
Ghat = signal.lombscargle(raw["k"], raw["wchi"], freqs=Rhat, normalize=True)
Graphically it leads to:
Comparison
If we compare both methodology we can confirm that the major peaks definitely match.
LSSA gives a smoother curve but do not assume it to be more accurate as this is statistical smooth of an interpolated curve. Anyway it fit the bill for your requirement:
I only want to be able to reproduce the theoretically expected output
after the FFT. My main concern is to get rid of the weird output
signal and produce something that in some way resembles a curve with
somewhat distinct local maxima (as shown in the 4th picture).
Conclusions
I think you have enough information to process your signal either by resampling and using FFT or by using LSSA. Both method has advantages and drawbacks.
Of course this needs to be validated with well know cases. Why not to reproduce with the data of the experience of the paper you are working on to check out you can reconstruct figures you posted.
You also need to dig in the signal conditioning before performing post processing (resampling, windowing, filtering).
I have built a XGBoostRegressor model using around 200 categorical features predicting a countinous time variable.
But I would want to get both the actual prediction and the probability of that prediction as output. Is there any way to get this from the XGBoostRegressor model?
So I both want and P(Y|X) as output. Any idea how to do this?
There is no probability in regression, In regression the only output you will get is a predicted value thats why it is called regression, so for any regressor probability of a prediction is not possible. Its only there in classification.
As mentioned before, there is no probability associated with regression.
However, you could probably add a confidence interval on that regression, to see whether or not your regression can be trusted.
One thing to note though, is that the variance might not be the same along the data.
Let's assume that you study a time based phenomenon. Specifically, you have the temperature (y) after (x) time (in sec for instance) inside an oven. At x = 0s it is at 20°C, and you start heating it, and want to know the evolution in order to predict the temperature after x seconds. The variance could be the same after 20 seconds and after 5 minutes, or be completely different. This is called heteroscedasticity.
If you want to use a confidence interval, you probably want to make sure that you took care of heteroscedasticity, so your interval is the same for all the data.
You can probably try to get the distribution of your known outputs and compare the prediction on that curve, and check the pvalue. But that would only give you a measure of how realistic it is to get that output, without taking the input into consideration. If you know your inputs/outputs are in a specific interval, this could work.
EDIT
This is how I would do it. Obviously the outputs are your real outputs.
import numpy as np
import matplotlib.pyplot as plt
from scipy import integrate
from scipy.interpolate import interp1d
N = 1000 # The number of sample
mean = 0
std = 1
outputs = np.random.normal(loc=mean, scale=std, size=N)
# We want to get a normed histogram (since this is PDF, if we integrate
# it must be equal to 1)
nbins = N / 10
n = int(N / nbins)
p, x = np.histogram(outputs, bins=n, normed=True)
plt.hist(outputs, bins=n, normed=True)
x = x[:-1] + (x[ 1] - x[0])/2 # converting bin edges to centers
# Now we want to interpolate :
# f = CubicSpline(x=x, y=p, bc_type='not-a-knot')
f = interp1d(x=x, y=p, kind='quadratic', fill_value='extrapolate')
x = np.linspace(-2.9*std, 2.9*std, 10000)
plt.plot(x, f(x))
plt.show()
# To check :
area = integrate.quad(f, x[0], x[-1])
print(area) # (should be close to 1)
Now, the interpolate method is not great for outliers. if a predicted data is extremely far (more than 3 times the std) from your distribution, it wont work. Other than that, you can now use the PDF to get meaningful results.
It is not perfect, but it is the best I came up with in that time. I'm sure there are some better ways to do it. If your data follow a normal law, it becomes trivial.
I suggest you to look into Ngboost (essentially a wrapper of Xgboost which provides eventually a probabilistic model.
Here you can find slides on the Ngboost functioning and the seminal Ngboost paper.
The basic idea is to assume a specific distribution for $P(Y|X=x)$ (by default is the Gaussian distribution) and fit an Xgboost model to estimate the best parameters of the distribution (for the Gaussian $\mu$ and $\sigma$. The model will split the variables' space into different regions with different distributions, i.e. same family (eg. Gaussian) but different parameters.
After training the model, you're provided with the method '''pred_dist''' which returns the estimated distribution $P(Y|X=x)$ for a given set of values $x$
I've a huge dataset with 271116 rows of data. I normalized the data using the z-score normalization method. I've no idea of knowing if the data actually follows a normal distribution. So I plotted a simple density graph using matplotlib:
hdf = df['Height'].plot(kind = 'kde', stacked = False)
plt.show()
I got this for a result:
Though, the data seems somewhat normal, can I apply the Central Limit Theorem where I take the means of different random samples (say, 10000 times) to get a smooth bell-curve?
Any help in python is appreciated, thanks.
Something like:
import numpy as np
sampleMeans = []
for _ in range(100000):
samples = df['Height'].sample(n=100)
sampleMean = np.mean(samples)
sampleMeans.append(sampleMean)
#Now you have a list of sample means to plot - should be normally distributed
The mean of the distribution should equal the mean of the original data, and the standard deviation should be a factor of ten less than the original data. If the result isn't smooth enough, then increase .sample(n=100) to a higher figure. This will also decrease the standard deviation of the resulting bell curve. The general rule is that the CLT standard deviation is the data standard deviation divided by sqrt(n).
It's important to note that the resulting distribution is different from the original. It is not merely smoothed out using the CLT.
I generated two distributions using the following code:
rand_num1 = 2*np.random.randn(10000) + 1
rand_num2 = 2*np.random.randn(10000) + 1
stats.ks_2samp(rand_num1, rand_num2)
My question is why do both these distributions do not test to be the same based on kstest and chisquare test.
When I run a kstest on the 2 distributions I get:
Ks_2sampResult(statistic=0.019899999999999973, pvalue=0.037606196570126725)
which implies that the two distributions are statistically different. I use the following code to plot the CDF of the two distributions:
count1, bins = np.histogram(rand_num1, bins = 100)
count2, _ = np.histogram(rand_num2, bins = bins)
plt.plot(np.cumsum(count1), 'g-')
plt.plot(np.cumsum(count2), 'b.')
This is how the CDF of two distributions looks.
When I run a chisquare test I get the following:
stats.chisquare(count1, count2) # Gives an nan output
stats.chisquare(count1+1, count2+1) # Outputs "Power_divergenceResult(statistic=180.59294741316694, pvalue=1.0484033143507713e-06)"
I have 3 questions below:
Even though the CDF looks the same and the data comes from same distribution why do kstest and chisquare test both reject the same distribution hypothesis? Is there an underlying assumption that I am missing here?
Some counts are 0 and hence the first chisquare() gives an error. Is it an accepted practice to just add a non-0 number to all counts to get a correct estimate?
Is there a kstest to test against non standard distributions, say a normal with a non 0 mean and std != 1?
CDF, in my humble opinion, is not a good curve to look at. It will hide a lot of details due to the fact that it is an integral. Basically, some outlier in distribution which is way below will be compensated by another outlier which is way above.
Ok, lets take a look at distribution of K-S results. I've run the test 100 times and plotted statistics vs p-value, and, as expected, for some cases there would be (small p, large stat) points.
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
np.random.seed(12345)
x = []
y = []
for k in range(0, 100):
rand_num1 = 2.0*np.random.randn(10000) + 1.0
rand_num2 = 2.0*np.random.randn(10000) + 1.0
q = stats.ks_2samp(rand_num1, rand_num2)
x.append(q.statistic)
y.append(q.pvalue)
plt.scatter(x, y, alpha=0.1)
plt.show()
Graph
UPDATE
In reality if I run a test and see the test vs control distribution of my metric as shown in my plot then I would want to be able to say that they are they same - are there any statistics or parameters around these tests that can tell me how close these distributions are?
Of course, they are - and you're using one of such tests! K-S is most general but weakest test. And as with any test you would use there are ALWAYS cases where test will say those samples come from different distributions even you deliberately sample them from the same routine. It is just NATURE of the things,
you'll get yes or no with some confidence, but not much more. Look
at the graph again for illustrations.
Concerning your exercises with chi2 I'm very skeptical from the beginning to use chi2 for such task. For me, given the problem of making decision about two samples, test to be used should be explicitly symmetric. K-S is ok, but looking at the definition of chi2, it is NOT symmetric. Simple modification of
your code
count1, bins = np.histogram(rand_num1, bins = 40, range=(-2.,2.))
count2, _ = np.histogram(rand_num2, bins = bins, range=(-2.,2.))
q = stats.chisquare(count2, count1)
print(q)
q = stats.chisquare(count1, count2)
print(q)
produces something like
Power_divergenceResult(statistic=87.645335824746468, pvalue=1.3298580128472864e-05)
Power_divergenceResult(statistic=77.582358201839526, pvalue=0.00023275129585256563)
Basically, it means that test may pass if you run (1,2) but fail if you run (2,1), which is not good, IMHO. Chi2 is ok with me as soon as you test against expected values from known distribution curve - here test asymmetry makes sense
I would advice to try Anderson-Darling test along the lines
q = stats.anderson_ksamp([np.sort(rand_num1), np.sort(rand_num2)])
print(q)
But remember, it is the same as with K-S, some samples may fail to pass the test even if they are drawn from the same underlying distribution - this is just the nature of the beast.
UPDATE: Some reading material
https://stats.stackexchange.com/questions/187016/scipy-chisquare-applied-on-continuous-data
I've closely followed this book (http://nbviewer.ipython.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Chapter2_MorePyMC/MorePyMC.ipynb) but have found myself running into problems when trying to use Pymc for my own problem.
I've got a bunch of order values from customers who have placed an order and they look reasonably like a Gamma distribution. I'm running an AB test and want to see how the distribution of order values changes - enter Pymc. I was following the example in the book but found it didn't really work for me - first attempt was this:
import pymc as pm
import numpy as np
from matplotlib import pyplot as plt
from pylab import savefig
## Replace these with the actual order values in the test set
## Have made slightly different to be able to see differing distributions
observations_A = pm.rgamma(3.5, 0.013, size=1000)
observations_B = pm.rgamma(3.45, 0.016, size=2000)
## Identical prior assumptions
prior_a = pm.Gamma('prior_a', 3.5, 0.015)
prior_b = pm.Gamma('prior_b', 3.5, 0.015)
## The difference in the test groups is the most important bit
#pm.deterministic
def delta(p_A = prior_a, p_B = prior_b):
return p_A - p_B
## Add observations
observation_a = pm.Gamma('observation_a', prior_a, value=observations_A, observed=True)
observation_b = pm.Gamma('observation_b', prior_b, value=observations_A, observed=True)
mcmc = pm.MCMC([prior_a, prior_b, delta, observation_a, observation_b])
mcmc.sample(20000,1000)
Looking at the mean of the trace for prior_a and prior_b I see values of around 3.97/3.98 and when I look at the stats of these priors I see a similar story. However, upon defining the priors, calling the rand() method on the prior gives me the kind of values I would expect (between 100 and 400). Basically, one of the updating stages (I'm least certain about the observation stages) is doing something I don't expect.
Having struggled with this for a bit I found this page (http://matpalm.com/blog/2012/12/27/dead_simple_pymc/) and decided a different approach may be in order:
import pymc as pm
import numpy as np
from matplotlib import pyplot as plt
from pylab import savefig
## Replace these with the actual order values in the test set
observations_A = pm.rgamma(3.5, 0.013, size=1000)
observations_B = pm.rgamma(3.45, 0.016, size=2000)
## Initial assumptions
A_Rate = pm.Uniform('A_Rate', 2, 4)
B_Rate = pm.Uniform('B_Rate', 2, 4)
A_Shape = pm.Uniform('A_Shape', 0.005, 0.05)
B_Shape = pm.Uniform('B_Shape', 0.005, 0.05)
p_A = pm.Gamma('p_A', A_Rate, A_Shape, value=observations_A, observed=True)
p_B = pm.Gamma('p_B', A_Rate, B_Shape, value=observations_B, observed=True)
## Sample
mcmc = pm.MCMC([p_A, p_B, A_Rate, B_Rate, A_Shape, B_Shape])
mcmc.sample(20000, 1000)
## Plot the A_Rate, B_Rate, A_Shape, B_Shape
## Using those, determine the Gamma distribution
## Plot both - and draw 1000000... samples from each.
## Perform statistical tests on these.
So instead of going straight for the Gamma distribution, we're looking to find the parameters (I think). This seems to work a treat in that it gives me values in the traces of the right order of magnitude. However, now I can plot a histogram of samples for alpha for both test groups and for beta but that's not really what I'm after. I want to be able to plot each of the test group's 'gamma-like' distributions, calculated from a prior and the values I supply. I also want to be able to plot a 'delta' as the AB testing example shows. I feel a deterministic variable on the second example is going to be my best bet but I don't really know the best way to go about constructing this.
Long story short - I've got data drawn from a Gamma distribution that I'd like to AB test. I've got a gamma prior view of the data, though could be persuaded that I've got a normal prior view if that's easier. I'd like to update identical priors with the data I've collected, in a sensible way, and plot the distributions and the difference between them.
Cheers,
Matt