Related
I am re-learning introductory statistics and wanted to try implementing my own versions of the general and unpooled formulas that find the T Value. I implemented it in 2 ways, one by just replicating the formulas as is as Python Functions. The other was to use Python's ability to generate a normal distribution and use that to find the difference in means. But I noticed my values were pretty different in both versions. So my question is why is there a difference? Is it with how the function works itself?
Here's the "generate a distribution itself" method:
from numpy.random import seed
from numpy.random import normal
from scipy import stats
from datetime import datetime
import math
#Plan: Generate 2 random normal distributions of the desired critiera. And T Test them
data1 = normal(loc=65.2, scale=7.8, size=30)
data2 = normal(loc=70.3, scale=8.4, size=30)
stats.ttest_ind(a=data1, b=data2)
Ttest_indResult(statistic=-2.029830829733737, pvalue=0.04696953433513939)
As you can see, it gives a T statistic of ~-2.0298 and a p value of ~ 0.0470.
Here's my "manual version":
def pop_2_mean_pooled_t(mean1, mean2, s1, s2, n1, n2):
dof = (n1+n2)-2
mean_diff = mean1 - mean2
#The N part on the right
right_n = math.sqrt((1/n1) + (1/n2))
#The Sp part
sp_numereator_left = ((n1-1)*(s1**2))
sp_numberator_right = ((n2-1)*(s2**2))
sp = math.sqrt((sp_numereator_left + sp_numberator_right)/(dof))
pooled_sp = sp*right_n
t = mean_diff/pooled_sp
p = stats.t.cdf(t, dof)
print("T is " +str(t))
print("p is " +str(p))
return t, p
pop_2_mean_pooled_t(65.2, 70.3, 7.8, 8.4, 30, 30)
T is -2.4368742610942298
p is 0.00895208222413155
(-2.4368742610942298, 0.00895208222413155)
As you can see, it gives a T statistic of ~-2.439 and a p value of ~ 0.009.
My question is why is there a discrepancy here? My "manual version" is closer to the example I was referencing. But surely the generator one should also be?
My understanding is that if a sample is significantly large enough, it would resemble a normal distribution. Therefore, one could generate a normal distribution using code and use that to approximate the corresponding T Values. For some reason, that differed quite a bit from my "manual" version
Your thinking is basically correct (I did not check your formulae though). What your encountering is in the nature of the problem: the two random samples you're drawing are, well, random and they differ in subsequent runs, so you will always get a different p-value ant the t-statistics.
Two suggestions from me:
increase the sample size in the first snippet to hundreds (not 30): you should get much closer to the stats from the second snippet.
keep 30 samples in the first snippet but run the simulation several times; you will learn the distributions of p-values and t-statistics and, again, you can check the values from your second snippet against the simulated distributions.
(Some conceptual flaws occur in this approach, e.g. repeated testing affects the p-value, but let us put them aside for now; the goal is to see your two sets of values converge.)
I have runned two different meta-heuristic algorithm for 25 times and I want to check which algorithm's results are better than the other algorithm. I decided to use Wilcoxon Ranksum test but I could not understood well the output of the given function:
from scipy.stats import ranksums
rng = np.random.default_rng()
sample1 = rng.uniform(-1, 1, 200)
sample2 = rng.uniform(-0.5, 1.5, 200)
ranksums(sample1, sample2)
Output: RanksumsResult(statistic=-7.887059, pvalue=3.09390448e-15)
ranksums(sample1, sample2, alternative='less')
Output: RanksumsResult(statistic=-7.750585297581713, pvalue=4.573497606342543e-15)
ranksums(sample1, sample2, alternative='greater')
Output: RanksumsResult(statistic=-7.750585297581713, pvalue=0.9999999999999954)
How can I understand which sample is better than the other? I think sample2 is better, becuase in the second output pvalue is lower than 0.05 and the alternative parameter is "less". Can anyone explain this code and output?
The test does not tell you which sample is better; that's up to you to decide.
From the documentation:
The Wilcoxon rank-sum test tests the null hypothesis that two sets of measurements are drawn from the same distribution.
The test only answers the question: are the samples "statistically different"?
If you have no prior expectation which sample should be better just use the default two-sided alternative hypothesis.
Once you are confident they are different (p-value below a pre-determined threshold), you can proceed to determine which sample is better. How to define "better" depends on the metric you are comparing (e.g. errors should be smaller, scores should be larger, ...).
How can I optimize a function with fixed steps? I have developed a function with five thresholds as entry that I want to optimize. I actually tried to optimize them with different solvers, but the steps that the solver takes are so tiny that the function never converge in a good solution.
Defined thresholds vary from 0 to 1, and I want them to take steps of 0.01. For example, in case of threshold_0, I want It to vary from initial guess 0.6 to 0.61 or 0.59, etc. depending on error result.
from scipy import optimize
initial_guess = [0.6,0.3,0.6,0.5,0.5]
def get_sobel3d_accuracy_from_thresholds(thresholds,array_dicts,ponderation_dict):
...
return error
result = optimize.minimize(
get_sobel3d_accuracy_from_thresholds, # function to optimize
initial_guess,
args=(array_dicts,ponderation_dict), # extra fixed args
method='nelder-mead',
options={'xatol': 1e-8, 'disp': True})
What I want to get is a solution that minimizes de error returned from the function get_sobel3d_accuracy_from_thresholds as follows:
optimized_thresholds = [0.61, 0.3, 0.81, 0.52, 0.44]
I would also like to fix boundaries for thresholds from 0 to 1, but I think that It can be done only with some solvers, right?
bounds = [(0, 1) for n in range(0,5)]
thank you all.
Is there a function in numpy/scipy that lets you sample multinomial from a vector of small log probabilities, without losing precision? example:
# sample element randomly from these log probabilities
l = [-900, -1680]
the naive method fails because of underflow:
import scipy
import numpy as np
# this makes a all zeroes
a = np.exp(l) / scipy.misc.logsumexp(l)
r = np.random.multinomial(1, a)
this is one attempt:
def s(l):
m = np.max(l)
norm = m + np.log(np.sum(np.exp(l - m)))
p = np.exp(l - norm)
return np.where(np.random.multinomial(1, p) == 1)[0][0]
is this the best/fastest method and can np.exp() in the last step be avoided?
First of all, I believe the problem you're encountering is because you're normalizing your probabilities incorrectly. This line is incorrect:
a = np.exp(l) / scipy.misc.logsumexp(l)
You're dividing a probability by a log probability, which makes no sense. Instead you probably want
a = np.exp(l - scipy.misc.logsumexp(l))
If you do that, you find a = [1, 0] and your multinomial sampler works as expected up to floating point precision in the second probability.
A Solution for Small N: Histograms
That said, if you still need more precision and performance is not as much of a concern, one way you could make progress is by implementing a multinomial sampler from scratch, and then modifying this to work at higher precision.
NumPy's multinomial function is implemented in Cython, and essentially performs a loop over a number of binomial samples and combines them into a multinomial sample.
and you can call it like this:
np.random.multinomial(10, [0.1, 0.2, 0.7])
# [0, 1, 9]
(Note that the precise output values here & below are random, and will change from call to call).
Another way you might implement a multinomial sampler is to generate N uniform random values, then compute the histogram with bins defined by the cumulative probabilities:
def multinomial(N, p):
rand = np.random.uniform(size=N)
p_cuml = np.cumsum(np.hstack([[0], p]))
p_cuml /= p_cuml[-1]
return np.histogram(rand, bins=p_cuml)[0]
multinomial(10, [0.1, 0.2, 0.7])
# [1, 1, 8]
With this method in mind, we can think about doing things to higher precision by keeping everything in log-space. The main trick is to realize that the log of uniform random deviates is equivalent to the negative of exponential random deviates, and so you can do everything above without ever leaving log space:
def multinomial_log(N, logp):
log_rand = -np.random.exponential(size=N)
logp_cuml = np.logaddexp.accumulate(np.hstack([[-np.inf], logp]))
logp_cuml -= logp_cuml[-1]
return np.histogram(log_rand, bins=logp_cuml)[0]
multinomial_log(10, np.log([0.1, 0.2, 0.7]))
# [1, 2, 7]
The resulting multinomial draws will maintain precision even for very small values in the p array.
Unfortunately, these histogram-based solutions will be much slower than the native numpy.multinomial function, so if performance is an issue you may need another approach. One option would be to adapt the Cython code linked above to work in log-space, using similar mathematical tricks as I used here.
A Solution for Large N: Poisson Approximation
The problem with the above solution is that as N grows large, it becomes very slow.
I was thinking about this and realized there's a more efficient way forward, despite np.random.multinomial failing for probabilities smaller than 1E-16 or so.
Here's an example of that failure: on a 64-bit machine, this will always give zero for the first entry because of the way the code is implemented, when in reality it should give something near 10:
np.random.multinomial(1E18, [1E-17, 1])
# array([ 0, 1000000000000000000])
If you dig into the source, you can trace this issue to the binomial function upon which the multinomial function is built. The cython code internally does something like this:
def multinomial_basic(N, p, size=None):
results = np.array([np.random.binomial(N, pi, size) for pi in p])
results[-1] = int(N) - results[:-1].sum(0)
return np.rollaxis(results, 0, results.ndim)
multinomial_basic(1E18, [1E-17, 1])
# array([ 0, 1000000000000000000])
The problem is that the binomial function chokes on very small values of p – this is because the algorithm computes the value (1 - p), so the value of p is limited by floating-point precision.
So what can we do? Well, it turns out that for small values of p, the Poisson distribution is an extremely good approximation of the binomial distribution, and the implementation doesn't have these issues. So we can build a robust multinomial function based on a robust binomial sampler that switches to a Poisson sampler at small p:
def binomial_robust(N, p, size=None):
if p < 1E-7:
return np.random.poisson(N * p, size)
else:
return np.random.binomial(N, p, size)
def multinomial_robust(N, p, size=None):
results = np.array([binomial_robust(N, pi, size) for pi in p])
results[-1] = int(N) - results[:-1].sum(0)
return np.rollaxis(results, 0, results.ndim)
multinomial_robust(1E18, [1E-17, 1])
array([ 12, 999999999999999988])
The first entry is nonzero and near 10 as expected! Note that we can't use N larger than 1E18, because it will overflow the long integer.
But we can confirm that our approach works for smaller probabilities using the size parameter, and averaging over results:
p = [1E-23, 1E-22, 1E-21, 1E-20, 1]
size = int(1E6)
multinomial_robust(1E18, p, size).mean(0)
# array([ 1.70000000e-05, 9.00000000e-05, 9.76000000e-04,
# 1.00620000e-02, 1.00000000e+18])
We see that even for these very small probabilities, the multinomial values are turning up in the right proportion. The result is a very robust and very fast approximation to the multinomial distribution for small p.
I am trying to write code to produce confidence intervals for the number of different books in a library (as well as produce an informative plot).
My cousin is at elementary school and every week is given a book by his teacher. He then reads it and returns it in time to get another one the next week. After a while we started noticing that he was getting books he had read before and this became gradually more common over time.
Say the true number of books in the library is N and the teacher picks one uniformly at random (with replacement) to give to you each week. If at week t the number of occasions on which you have received a book you have read is x, then I can produce a maximum likelihood estimate for the number of books in the library following https://math.stackexchange.com/questions/615464/how-many-books-are-in-a-library .
Example: Consider a library with five books A, B, C, D, and E. If you receive books [A, B, A, C, B, B, D] in seven successive weeks, then the value for x (the number of duplicates) will be [0, 0, 1, 1, 2, 3, 3] after each of those weeks, meaning after seven weeks, you have received a book you have already read on three occasions.
To visualise the likelihood function (assuming I have understood what one is correctly) I have written the following code which I believe plots the likelihood function. The maximum is around 135 which is indeed the maximum likelihood estimate according to the MSE link above.
from __future__ import division
import random
import matplotlib.pyplot as plt
import numpy as np
#N is the true number of books. t is the number of weeks.unk is the true number of repeats found
t = 30
unk = 3
def numberrepeats(N, t):
return t - len(set([random.randint(0,N) for i in xrange(t)]))
iters = 1000
ydata = []
for N in xrange(10,500):
sampledunk = [numberrepeats(N,t) for i in xrange(iters)].count(unk)
ydata.append(sampledunk/iters)
print "MLE is", np.argmax(ydata)
xdata = range(10, 500)
print len(xdata), len(ydata)
plt.plot(xdata,ydata)
plt.show()
The output looks like
My questions are these:
Is there an easy way to get a 95% confidence interval and plot it on the diagram?
How can you superimpose a smoothed curve over the plot?
Is there a better way my code should have been written? It isn't very elegant and is also quite slow.
Finding the 95% confidence interval means finding the range of the x axis so that 95% of the time the empirical maximum likelihood estimate we get by sampling (which should theoretically be 135 in this example) will fall within it. The answer #mbatchkarov has given does not currently do this correctly.
There is now a mathematical answer at https://math.stackexchange.com/questions/656101/how-to-find-a-confidence-interval-for-a-maximum-likelihood-estimate .
Looks like you're ok on the first part, so I'll tackle your second and third points.
There are plenty of ways to fit smooth curves, with scipy.interpolate and splines, or with scipy.optimize.curve_fit. Personally, I prefer curve_fit, because you can supply your own function and let it fit the parameters for you.
Alternatively, if you don't want to learn a parametric function, you could do simple rolling-window smoothing with numpy.convolve.
As for code quality: you're not taking advantage of numpy's speed, because you're doing things in pure python. I would write your (existing) code like this:
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
# N is the true number of books.
# t is the number of weeks.
# unk is the true number of repeats found
t = 30
unk = 3
def numberrepeats(N, t, iters):
rand = np.random.randint(0, N, size=(t, iters))
return t - np.array([len(set(r)) for r in rand])
iters = 1000
ydata = np.empty(500-10)
for N in xrange(10,500):
sampledunk = np.count_nonzero(numberrepeats(N,t,iters) == unk)
ydata[N-10] = sampledunk/iters
print "MLE is", np.argmax(ydata)
xdata = range(10, 500)
print len(xdata), len(ydata)
plt.plot(xdata,ydata)
plt.show()
It's probably possible to optimize this even more, but this change brings your code's runtime from ~30 seconds to ~2 seconds on my machine.
The a simple (numerical) way to get a confidence interval is simply to run your script many times, and see how much your estimate varies. You can use that standard deviation to calculate the confidence interval.
In the interest of time, another option is to run a bunch of trials at each value of N (I used 2000), and then use random subsampling of those trials to get an estimate of the estimator standard deviation. Basically, this involves selecting a subset of the trials, generating your likelihood curve using that subset, then finding the maximum of that curve to get your estimator. You do this over many subsets and this gives you a bunch of estimators, which you can use to find a confidence interval on your estimator. My full script is as follows:
import numpy as np
t = 30
k = 3
def trial(N):
return t - len(np.unique(np.random.randint(0, N, size=t)))
def trials(N, n_trials):
return np.asarray([trial(N) for i in xrange(n_trials)])
n_trials = 2000
Ns = np.arange(1, 501)
results = np.asarray([trials(N, n_trials=n_trials) for N in Ns])
def likelihood(results):
L = (results == 3).mean(-1)
# boxcar filtering
n = 10
L = np.convolve(L, np.ones(n) / float(n), mode='same')
return L
def max_likelihood_estimate(Ns, results):
i = np.argmax(likelihood(results))
return Ns[i]
def max_likelihood(Ns, results):
# calculate mean from all trials
mean = max_likelihood_estimate(Ns, results)
# randomly subsample results to estimate std
n_samples = 100
sample_frac = 0.25
estimates = np.zeros(n_samples)
for i in xrange(n_samples):
mask = np.random.uniform(size=results.shape[1]) < sample_frac
estimates[i] = max_likelihood_estimate(Ns, results[:,mask])
std = estimates.std()
sterr = std * np.sqrt(sample_frac) # is this mathematically sound?
ci = (mean - 1.96*sterr, mean + 1.96*sterr)
return mean, std, sterr, ci
mean, std, sterr, ci = max_likelihood(Ns, results)
print "Max likelihood estimate: ", mean
print "Max likelihood 95% ci: ", ci
There are two drawbacks to this method. One is that, since you're taking many subsamples from the same set of trials, your estimates are not independent. To limit the effect of this, I only used 25% of the results for each subset. Another drawback is that each subsample is only a fraction of your data, so estimates derived from these subsets will have more variance than estimates derived from running the full script many times. To account for this, I computed the standard error as the standard deviation divided by the square root of 4, since I had four times as much data in my full data set than in one of the subsamples. However, I'm not familiar enough with Monte Carlo theory to know if this is mathematically sound. Running my script a number of times did seem to indicate that my results were reasonable.
Lastly, I did use a boxcar filter on the likelihood curves to smooth them out a bit. Ideally, this should improve results, but even with the filtering there was still a considerable amount of variability in the results. When calculating the value for the overall estimator, I wasn't sure if it would be better compute one likelihood curve from all the results and use the max of that (this is what I ended up doing), or to use the mean of all the subset estimators. Using the mean of the subset estimators might be able to help cancel out some of the roughness in the curves that remains after filtering, but I'm not sure on this.
Here is an answer to your first question and a pointer to a solution for the second:
plot(xdata,ydata)
# calculate the cumulative distribution function
cdf = np.cumsum(ydata)/sum(ydata)
# get the left and right boundary of the interval that contains 95% of the probability mass
right=argmax(cdf>0.975)
left=argmax(cdf>0.025)
# indicate confidence interval with vertical lines
vlines(xdata[left], 0, ydata[left])
vlines(xdata[right], 0, ydata[right])
# hatch confidence interval
fill_between(xdata[left:right], ydata[left:right], facecolor='blue', alpha=0.5)
This produces the following figure:
I'll try to answer question 3 when I have more time :)