Using ttest while increasing samplesize - python

I have a df with different features. I will focous on one feature here, called 'x':
count 2152.000000
mean 95.162587
std 0.758480
min 92.882304
25% 94.648659
50% 95.172078
75% 95.648485
max 97.407068
I want to perfom a ttest on my df while i sample data out of the df. I want to see the effect of the sampleSize. As i expect it to saturate after a number of samples. Therefore i loop the sampleSize for a specific random_state:
for N in np.arange(1,2153,1):
pull = helioPosition.sample(N,random_state= 140)
ttest_pull.append(stats.ttest_ind(df['x'],pull['x'])[1])
the distribution of 'x' is a normal distribution:
When i plot the p of the ttest over my sampleSize I get the following plot:
Is there a mistake in my code or method. I would expect to get a better p value with a higher sampleSize, but this is not true for every sampleSize. How can a sampleSize of ~1500 be worse than a sample size of ~450?

pull is from the sampled from the same data, i.e. the second sample is a random sample from the same population and the two samples have the same mean (expected value).
p-values are uniformly distributed on interval [0, 1] when the null hypothesis is true, which is here the case. This is independent of the sample size, so we expect to see fluctuations or randomness in the p-value of the tests.
However, in this case you do not have two independent samples which is the underlying assumption of the t-test. As far as I understand your code, in the limit as N becomes large the second sample will include the entire "population" and be identical to the first sample. In that case the p-value will go to one because you are comparing two essentially identical samples.
If sample samples with replacement, then you are essentially comparing a bootstrap sample with the "population", which would be two samples with the same expected value and very high correlation. So, p-value for standard t-test should be high but still a random number.

Just to add to the answer above, what you referring to is power. Basically how many false negative do you have given a certain effect and sample. In your case, the effect is zero since they come from the same distribution, and note you did only one test, which means all your pvalues are basically sampling from a uniform distribution.
What you need is first, a difference between the two distributions, and secondly to perform this test repeatedly to see the number of rejections. See example below:
import numpy as np
import pandas as pd
from scipy.stats import ttest_ind
import seaborn as sns
df = pd.DataFrame({'x':np.random.normal(0,2,150),
'y':np.random.normal(1,2,150)})
Now we have two columns that have different means. We go through the sampling with different sizes
def subsampletest(da,N):
pull = da.sample(N)
return(ttest_ind(pull['x'],pull['y'])[1])
sampleSize = np.arange(5,150,step=5)
results = np.array([[subsampletest(df,x) for x in sampleSize] for B in range(100)])
The number of rejections at alpha of 0.05 (out of 100) per sample size, is simply:
rejections = np.mean(results<0.05,axis=0)
sns.lineplot(x=sampleSize,y=rejections)

Related

Chi Squared Analysis on Data sets that don't have matching frequencies

I have 15 data sets each of which I have fitted with a curve. Now I am trying to determine the quality of fit by doing a chi-squared test; however, when I run my code:
chi, p_value = stats.chisquare(n, y)
where n is the actual data and y is the predicted data, I get the error
For each axis slice, the sum of the observed frequencies must agree with the sum of the expected frequencies to a relative tolerance of 1e-08, but the percent differences are:
0.1350785306607008
I can't seem to understand why they have to add up to the same total - are there any ways I can run a chi-squared test without muddling my data?
This chi-squared test for goodness of fit indeed requires the sums of both inputs to be (almost) the same. So, if you want to check whether your model fits the observations n well, you have to adjust the counts y of your model as described e.g. here. This could be done with a small wrapper:
from scipy.stats import chisquare
import numpy as np
def cs(n, y):
return chisquare(n, np.sum(n)/np.sum(y) * y)
Another possibility would be to go for R and use chisq.test.

Creating vector with intervals drawn from Poisson process

I'm looking for some advice on how to implement some statistical models in Python. I'm interested in constructing a sequence of z values (z_1,z_2,z_3,...,z_n) where the number of jumps in an interval (z_1,z_2] is distributed according to the Poisson distribution with parameter lambda(z_2-z_1)
and the numbers of random jumps over disjoint intervals are independent random variables. I want my piecewise constant plot to look something like the two images below, where the y axis is Y(z), where Y(z) consists of N(0,1) random variables in each interval say.
To construct the z data, what would be the best way to tackle this? I have tried sampling values via np.random.poisson and then taking a cumulative sum, but the values drawn are repeated for small intensity values. Please any help or thoughts would be really helpful. Thanks.
np.random.poisson is used to sample the count of events that occured in [z_i, z_j). if you want to sample the events as they occur, then you just want the exponential distribution. for example:
import numpy as np
n = 50
z = np.cumsum(np.random.exponential(1/n, size=n))
y = np.random.normal(size=n)
plotting these (using step in matplotlib) gives something similar to your plots:
note the 1/n sets a "lambda" so on average we expect n points within [0,1]. in this case we got slightly less so it overshoot. feel free to rescale if that's important to you

How do we apply the Central Limit Theorem using python?

I've a huge dataset with 271116 rows of data. I normalized the data using the z-score normalization method. I've no idea of knowing if the data actually follows a normal distribution. So I plotted a simple density graph using matplotlib:
hdf = df['Height'].plot(kind = 'kde', stacked = False)
plt.show()
I got this for a result:
Though, the data seems somewhat normal, can I apply the Central Limit Theorem where I take the means of different random samples (say, 10000 times) to get a smooth bell-curve?
Any help in python is appreciated, thanks.
Something like:
import numpy as np
sampleMeans = []
for _ in range(100000):
samples = df['Height'].sample(n=100)
sampleMean = np.mean(samples)
sampleMeans.append(sampleMean)
#Now you have a list of sample means to plot - should be normally distributed
The mean of the distribution should equal the mean of the original data, and the standard deviation should be a factor of ten less than the original data. If the result isn't smooth enough, then increase .sample(n=100) to a higher figure. This will also decrease the standard deviation of the resulting bell curve. The general rule is that the CLT standard deviation is the data standard deviation divided by sqrt(n).
It's important to note that the resulting distribution is different from the original. It is not merely smoothed out using the CLT.

Plotting confidence intervals for Maximum Likelihood Estimate

I am trying to write code to produce confidence intervals for the number of different books in a library (as well as produce an informative plot).
My cousin is at elementary school and every week is given a book by his teacher. He then reads it and returns it in time to get another one the next week. After a while we started noticing that he was getting books he had read before and this became gradually more common over time.
Say the true number of books in the library is N and the teacher picks one uniformly at random (with replacement) to give to you each week. If at week t the number of occasions on which you have received a book you have read is x, then I can produce a maximum likelihood estimate for the number of books in the library following https://math.stackexchange.com/questions/615464/how-many-books-are-in-a-library .
Example: Consider a library with five books A, B, C, D, and E. If you receive books [A, B, A, C, B, B, D] in seven successive weeks, then the value for x (the number of duplicates) will be [0, 0, 1, 1, 2, 3, 3] after each of those weeks, meaning after seven weeks, you have received a book you have already read on three occasions.
To visualise the likelihood function (assuming I have understood what one is correctly) I have written the following code which I believe plots the likelihood function. The maximum is around 135 which is indeed the maximum likelihood estimate according to the MSE link above.
from __future__ import division
import random
import matplotlib.pyplot as plt
import numpy as np
#N is the true number of books. t is the number of weeks.unk is the true number of repeats found
t = 30
unk = 3
def numberrepeats(N, t):
return t - len(set([random.randint(0,N) for i in xrange(t)]))
iters = 1000
ydata = []
for N in xrange(10,500):
sampledunk = [numberrepeats(N,t) for i in xrange(iters)].count(unk)
ydata.append(sampledunk/iters)
print "MLE is", np.argmax(ydata)
xdata = range(10, 500)
print len(xdata), len(ydata)
plt.plot(xdata,ydata)
plt.show()
The output looks like
My questions are these:
Is there an easy way to get a 95% confidence interval and plot it on the diagram?
How can you superimpose a smoothed curve over the plot?
Is there a better way my code should have been written? It isn't very elegant and is also quite slow.
Finding the 95% confidence interval means finding the range of the x axis so that 95% of the time the empirical maximum likelihood estimate we get by sampling (which should theoretically be 135 in this example) will fall within it. The answer #mbatchkarov has given does not currently do this correctly.
There is now a mathematical answer at https://math.stackexchange.com/questions/656101/how-to-find-a-confidence-interval-for-a-maximum-likelihood-estimate .
Looks like you're ok on the first part, so I'll tackle your second and third points.
There are plenty of ways to fit smooth curves, with scipy.interpolate and splines, or with scipy.optimize.curve_fit. Personally, I prefer curve_fit, because you can supply your own function and let it fit the parameters for you.
Alternatively, if you don't want to learn a parametric function, you could do simple rolling-window smoothing with numpy.convolve.
As for code quality: you're not taking advantage of numpy's speed, because you're doing things in pure python. I would write your (existing) code like this:
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
# N is the true number of books.
# t is the number of weeks.
# unk is the true number of repeats found
t = 30
unk = 3
def numberrepeats(N, t, iters):
rand = np.random.randint(0, N, size=(t, iters))
return t - np.array([len(set(r)) for r in rand])
iters = 1000
ydata = np.empty(500-10)
for N in xrange(10,500):
sampledunk = np.count_nonzero(numberrepeats(N,t,iters) == unk)
ydata[N-10] = sampledunk/iters
print "MLE is", np.argmax(ydata)
xdata = range(10, 500)
print len(xdata), len(ydata)
plt.plot(xdata,ydata)
plt.show()
It's probably possible to optimize this even more, but this change brings your code's runtime from ~30 seconds to ~2 seconds on my machine.
The a simple (numerical) way to get a confidence interval is simply to run your script many times, and see how much your estimate varies. You can use that standard deviation to calculate the confidence interval.
In the interest of time, another option is to run a bunch of trials at each value of N (I used 2000), and then use random subsampling of those trials to get an estimate of the estimator standard deviation. Basically, this involves selecting a subset of the trials, generating your likelihood curve using that subset, then finding the maximum of that curve to get your estimator. You do this over many subsets and this gives you a bunch of estimators, which you can use to find a confidence interval on your estimator. My full script is as follows:
import numpy as np
t = 30
k = 3
def trial(N):
return t - len(np.unique(np.random.randint(0, N, size=t)))
def trials(N, n_trials):
return np.asarray([trial(N) for i in xrange(n_trials)])
n_trials = 2000
Ns = np.arange(1, 501)
results = np.asarray([trials(N, n_trials=n_trials) for N in Ns])
def likelihood(results):
L = (results == 3).mean(-1)
# boxcar filtering
n = 10
L = np.convolve(L, np.ones(n) / float(n), mode='same')
return L
def max_likelihood_estimate(Ns, results):
i = np.argmax(likelihood(results))
return Ns[i]
def max_likelihood(Ns, results):
# calculate mean from all trials
mean = max_likelihood_estimate(Ns, results)
# randomly subsample results to estimate std
n_samples = 100
sample_frac = 0.25
estimates = np.zeros(n_samples)
for i in xrange(n_samples):
mask = np.random.uniform(size=results.shape[1]) < sample_frac
estimates[i] = max_likelihood_estimate(Ns, results[:,mask])
std = estimates.std()
sterr = std * np.sqrt(sample_frac) # is this mathematically sound?
ci = (mean - 1.96*sterr, mean + 1.96*sterr)
return mean, std, sterr, ci
mean, std, sterr, ci = max_likelihood(Ns, results)
print "Max likelihood estimate: ", mean
print "Max likelihood 95% ci: ", ci
There are two drawbacks to this method. One is that, since you're taking many subsamples from the same set of trials, your estimates are not independent. To limit the effect of this, I only used 25% of the results for each subset. Another drawback is that each subsample is only a fraction of your data, so estimates derived from these subsets will have more variance than estimates derived from running the full script many times. To account for this, I computed the standard error as the standard deviation divided by the square root of 4, since I had four times as much data in my full data set than in one of the subsamples. However, I'm not familiar enough with Monte Carlo theory to know if this is mathematically sound. Running my script a number of times did seem to indicate that my results were reasonable.
Lastly, I did use a boxcar filter on the likelihood curves to smooth them out a bit. Ideally, this should improve results, but even with the filtering there was still a considerable amount of variability in the results. When calculating the value for the overall estimator, I wasn't sure if it would be better compute one likelihood curve from all the results and use the max of that (this is what I ended up doing), or to use the mean of all the subset estimators. Using the mean of the subset estimators might be able to help cancel out some of the roughness in the curves that remains after filtering, but I'm not sure on this.
Here is an answer to your first question and a pointer to a solution for the second:
plot(xdata,ydata)
# calculate the cumulative distribution function
cdf = np.cumsum(ydata)/sum(ydata)
# get the left and right boundary of the interval that contains 95% of the probability mass
right=argmax(cdf>0.975)
left=argmax(cdf>0.025)
# indicate confidence interval with vertical lines
vlines(xdata[left], 0, ydata[left])
vlines(xdata[right], 0, ydata[right])
# hatch confidence interval
fill_between(xdata[left:right], ydata[left:right], facecolor='blue', alpha=0.5)
This produces the following figure:
I'll try to answer question 3 when I have more time :)

Pseudoexperiments in PyMC

Is it possible to perform "pseudoexperiments" using PyMC?
By pseudoexperiments, I mean generating random "observations" by sampling from the prior, and then, given each pseudoexperiment, drawing samples from the posterior. Afterwards, one would compare the trace for each parameter to the sample (obtained from the prior) used in sampling from the posterior.
A more concrete example: Suppose that I want to know the rate of process X. I count how many occurrences there are in a certain period of time. However, I know that process Y also sometimes occurs and will contaminate my count. The rate of process Y is known with some uncertainty. So, I build a model, include my observations, and sample from the posterior:
import pymc
class mymodel:
rate_x = pymc.Uniform('rate_x', lower=0, upper=100)
rate_y = pymc.Normal('rate_y', mu=150, tau=1./(15**2))
total_rate = pymc.LinearCombination('total_rate', [1,1], [rate_x, rate_y])
data = pymc.Poisson('data', mu=total_rate, value=193, observed=True)
Mod = pymc.Model(mymodel)
MCMC = pymc.MCMC(Mod)
MCMC.sample(100000, burn=5000, thin=5)
print MCMC.stats()['rate_x']['quantiles']
However, before I do my experiment (or before I "unblind" my analysis and look at my data), I would like to know how sensitive I expect to be -- what will be the uncertainty on my measurement of rate_x?
To answer this, I could sample from the prior
Mod.draw_from_prior()
but this only samples rate_x, rate_y, and calculates total_rate. But once the values of those are set by draw_from_prior(), I can draw a pseudoexperiment:
Mod.data.random()
This just returns a number, so I have to set the value of Mod.data to a random sample. Because Mod.data has the observed flag set, I have to also "force" it:
Mod.data.set_value(Mod.data.random(), force=True)
Now I can sample from the posterior again
MCMC.sample(100000, burn=500, thin=5)
print MCMC.stats()['rate_x']['quantiles']
All this works, so I suppose the simple answer to my question is "yes". But it feels very hacky. Is there a better or more natural way to accomplish this?

Categories

Resources