Binomial test in Python vs R - python

I am trying to re-implement a binomial test initialy developed in R with Python. However, I am not sure if I am using the right functionality.
In R, I get:
> binom.test (2, 8, 11/2364, alternative = "greater")
0.25
With Python & SciPy, I use
from scipy.stats import binom
binom.sf(2, 8, float(11)/float(2364))
5.5441613055814931e-06
In fact I have to do binom.sf(2, 8, float(11)/float(2364)) to make sure the third parameter is not 0 because of int division.
Why do the values differ? Do I have to specify the moments for Scipy / binom.sf?
Should I use some other library?

Here's what I get in R:
> binom.test(2, 8, 11/2364, alternative = "greater")
Exact binomial test
data: 2 and 8
number of successes = 2, number of trials = 8, p-value = 0.0005951
alternative hypothesis: true probability of success is greater than 0.00465313
95 percent confidence interval:
0.04638926 1.00000000
sample estimates:
probability of success
0.25
>
Note that the p-value is 0.0005951.
Compare that to the result of scipy.stats.binom_test (which returns just the p-value):
In [25]: from scipy.stats import binom_test
In [26]: binom_test(2, 8, 11/2364, alternative='greater')
Out[26]: 0.00059505960517880572
So that agrees with R.
To use the survival function of scipy.stats.binom, you have to adjust the first argument (as noted in a comment by Marius):
In [27]: from scipy.stats import binom
In [28]: binom.sf(1, 8, 11/2364)
Out[28]: 0.00059505960517880572
(I am using Python 3, so 11/2364 equals 0.004653130287648054. If you are using Python 2, be sure to write that fraction as 11.0/2364 or float(11)/2364.)

Related

Random Samples from Gamma distribution with two parameters / Python

If I would like to generate 10 random samples from a gamma distribution with (with the following form):
with alpha = 2 and beta = 3, how would I do it?
The documentation: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gamma.html
is a bit unclear to me.
My guess is that it would be like:
a = 2
b = 3
scipy.stats.gamma.rvs(a, loc = 0, scale = 1/b, size = 10)
Can anyone verify wether this is correct or provide the correct solution?
Yes, that is correct. In the formula that you show, β is often called the rate parameter. The gamma distribution in SciPy uses a scale parameter, which corresponds to 1/β. You can see the formulas for these two common parameterizations side-by-side in the wikipedia article on the gamma distribution.
If all you need is the generation of random samples (and not all the other methods provided by scipy.stats.gamma), you can use the gamma method of the NumPy class numpy.random.Generator. It uses the same parameter conventions as the SciPy gamma distribution, except that it does not have the loc parameter:
In [26]: import numpy as np
In [27]: rng = np.random.default_rng()
In [28]: a = 2
In [29]: b = 3
In [30]: rng.gamma(a, scale=1/b, size=10)
Out[30]:
array([0.637065 , 0.18436688, 1.36876183, 0.74692619, 0.12608862,
0.38395668, 0.81947237, 0.63437319, 0.47902819, 0.39094079])

Trying to run an implementation of lsqcurvefit from the Optimization Toolbox from Matlab in python using curvefit

I am trying to implement lsqcurvefit from matlab in Python using curve_fit with no success. Below is the matlab code I am trying to port to Python:
myfun = #(x,xdata)(exp(x(1))./ xdata.^exp(x(2))) - x(3);
xstart = [4, -2, 54];
pX = [2, 3, 13, 12, 38, 39];
pY = [12.7595, 8.7857, -11.8802, -10.9528, -15.4390, -15.3083];
try
fittedmodel = lsqcurvefit(myfun,xstart,double(pX),double(pY), [], [], optimset('Display', 'off'));
disp("fitted model:");
disp(fittedmodel);
catch
end
Below is my matlab output:
fitted model:
4.8389 3.3577 -2.0000
Below is my Python code:
from scipy.optimize import curve_fit
import numpy as np
pX = [2, 3, 13, 12, 38, 39];
pY = [12.7595, 8.7857, -11.8802, -10.9528, -15.4390, -15.3083];
def myfun(x, xdata):
temp_val_1 = np.exp(x[0])
temp_val_2 = np.exp(x[1])
temp_val_3 = x[2]
temp_val_4 = np.power(xdata, temp_val_2)
temp_val_5 = np.divide(temp_val_1, temp_val_4)
temp_val_6 = temp_val_5 - temp_val_3
return temp_val_6
popt, pcov = curve_fit(myfun, pX, pY, p0=([4, -2, 54]))
print(popt, "\n", pcov)
and below is my Python output:
myfun() takes 2 positional arguments but 4 were given
I understand that there is something wrong with the inputs, but I don't understand what to change to solve this and receive the same results as I do with matlab.
Here are a few hints to get you started:
Note that curve_fit expects a function with signature f(xdata, *x), where x is your optimization variable, i.e. the searched coefficients. It's just the other way around compared to Matlab's lsqcurvefit. The notation *x is python specific and denotes a variable number of arguments.
Additionally, you don't need to use the np.power and np.divide functions. The usual mathematical operators are overloaded for np.arrays and are applied elementwise. For example, this means that for two np.arrays a / b is equivalent to Matlab's a ./ b. Consequently, it's more convenient to write (and to read):
def myfun(xdata, *x):
return np.exp(x[0]) / xdata**np.exp(x[1]) - x[2]
I obtain the following coefficients:
[ 4.01234549 -0.47409326 21.70045585]
However, there seems to be an overflow for the term np.exp(x[1]), so it might be worth to reformulate the objective function or increase the floating point precision. i.e. use long doubles dtype=np.float128.

How to properly sample truncated distributions?

I am trying to learn how to sample truncated distributions. To begin with I decided to try a simple example I found here example
I didn't really understand the division by the CDF, therefore I decided to tweak the algorithm a bit. Being sampled is an exponential distribution for values x>0 Here is an example python code:
# Sample exponential distribution for the case x>0
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
def pdf(x):
return x*np.exp(-x)
xvec=np.zeros(1000000)
x=1.
for i in range(1000000):
a=x+np.random.normal()
xs=x
if a > 0. :
xs=a
A=pdf(xs)/pdf(x)
if np.random.uniform()<A :
x=xs
xvec[i]=x
x=np.linspace(0,15,1000)
plt.plot(x,pdf(x))
plt.hist([x for x in xvec if x != 0],bins=150,normed=True)
plt.show()
Ant the output is:
The code above seems to work fine only for when using the condition if a > 0. :, i.e. positive x, choosing another condition (e.g. if a > 0.5 :) produces wrong results.
Since my final goal was to sample a 2D-Gaussian - pdf on a truncated interval I tried extending the simple example using the exponential distribution (see the code below). Unfortunately, since the simple case didn't work, I assume that the code given below would yield wrong results.
I assume that all this can be done using the advanced tools of python. However, since my primary idea was to understand the principle behind, I would greatly appreciate your help to understand my mistake.
Thank you for your help.
EDIT:
# code updated according to the answer of CrazyIvan
from scipy.stats import multivariate_normal
RANGE=100000
a=2.06072E-02
b=1.10011E+00
a_range=[0.001,0.5]
b_range=[0.01, 2.5]
cov=[[3.1313994E-05, 1.8013737E-03],[ 1.8013737E-03, 1.0421529E-01]]
x=a
y=b
j=0
for i in range(RANGE):
a_t,b_t=np.random.multivariate_normal([a,b],cov)
# accept if within bounds - all that is neded to truncate
if a_range[0]<a_t and a_t<a_range[1] and b_range[0]<b_t and b_t<b_range[1]:
print(dx,dy)
EDIT:
I changed the code by norming the analytic pdf according to this scheme, and according to the answers given by, #Crazy Ivan and #Leandro Caniglia , for the case where the bottom of the pdf is removed. That is dividing by (1-CDF(0.5)) since my accept condition is x>0.5. This seems again to show some discrepancies. Again the mystery prevails ..
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
def pdf(x):
return x*np.exp(-x)
# included the corresponding cdf
def cdf(x):
return 1. -np.exp(-x)-x*np.exp(-x)
xvec=np.zeros(1000000)
x=1.
for i in range(1000000):
a=x+np.random.normal()
xs=x
if a > 0.5 :
xs=a
A=pdf(xs)/pdf(x)
if np.random.uniform()<A :
x=xs
xvec[i]=x
x=np.linspace(0,15,1000)
# new part norm the analytic pdf to fix the area
plt.plot(x,pdf(x)/(1.-cdf(0.5)))
plt.hist([x for x in xvec if x != 0],bins=200,normed=True)
plt.savefig("test_exp.png")
plt.show()
It seems that this can be cured by choosing larger shift size
shift=15.
a=x+np.random.normal()*shift.
which is in general an issue of the Metropolis - Hastings. See the graph below:
I also checked shift=150
Bottom line is that changing the shift size definitely improves the convergence. The misery is why, since the Gaussian is unbounded.
You say you want to learn the basic idea of sampling a truncated distribution, but your source is a blog post about
Metropolis–Hastings algorithm? Do you actually need this "method for obtaining a sequence of random samples from a probability distribution for which direct sampling is difficult"? Taking this as your starting point is like learning English by reading Shakespeare.
Truncated normal
For truncated normal, basic rejection sampling is all you need: generate samples for original distribution, reject those outside of bounds. As Leandro Caniglia noted, you should not expect truncated distribution to have the same PDF except on a shorter interval — this is plain impossible because the area under the graph of a PDF is always 1. If you cut off stuff from sides, there has to be more in the middle; the PDF gets rescaled.
It's quite inefficient to gather samples one by one, when you need 100000. I would grab 100000 normal samples at once, accept only those that fit; then repeat until I have enough. Example of sampling truncated normal between amin and amax:
import numpy as np
n_samples = 100000
amin, amax = -1, 2
samples = np.zeros((0,)) # empty for now
while samples.shape[0] < n_samples:
s = np.random.normal(0, 1, size=(n_samples,))
accepted = s[(s >= amin) & (s <= amax)]
samples = np.concatenate((samples, accepted), axis=0)
samples = samples[:n_samples] # we probably got more than needed, so discard extra ones
And here is the comparison with the PDF curve, rescaled by division by cdf(amax) - cdf(amin) as explained above.
from scipy.stats import norm
_ = plt.hist(samples, bins=50, density=True)
t = np.linspace(-2, 3, 500)
plt.plot(t, norm.pdf(t)/(norm.cdf(amax) - norm.cdf(amin)), 'r')
plt.show()
Truncated multivariate normal
Now we want to keep the first coordinate between amin and amax, and the second between bmin and bmax. Same story, except there will be a 2-column array and the comparison with bounds is done in a relatively sneaky way:
(np.min(s - [amin, bmin], axis=1) >= 0) & (np.max(s - [amax, bmax], axis=1) <= 0)
This means: subtract amin, bmin from each row and keep only the rows where both results are nonnegative (meaning we had a >= amin and b >= bmin). Also do a similar thing with amax, bmax. Accept only the rows that meet both criteria.
n_samples = 10
amin, amax = -1, 2
bmin, bmax = 0.2, 2.4
mean = [0.3, 0.5]
cov = [[2, 1.1], [1.1, 2]]
samples = np.zeros((0, 2)) # 2 columns now
while samples.shape[0] < n_samples:
s = np.random.multivariate_normal(mean, cov, size=(n_samples,))
accepted = s[(np.min(s - [amin, bmin], axis=1) >= 0) & (np.max(s - [amax, bmax], axis=1) <= 0)]
samples = np.concatenate((samples, accepted), axis=0)
samples = samples[:n_samples, :]
Not going to plot, but here are some values: naturally, within bounds.
array([[ 0.43150033, 1.55775629],
[ 0.62339265, 1.63506963],
[-0.6723598 , 1.58053835],
[-0.53347361, 0.53513105],
[ 1.70524439, 2.08226558],
[ 0.37474842, 0.2512812 ],
[-0.40986396, 0.58783193],
[ 0.65967087, 0.59755193],
[ 0.33383214, 2.37651975],
[ 1.7513789 , 1.24469918]])
To compute the truncated density function pdf_t from the entire density function pdf, do the following:
Let [a, b] be the truncation interval; (x axis)
Let A := cdf(a) and B := cdf(b); (cdf = non-truncated cumulative distribution function)
Then pdf_t(x) := pdf(x) / (B - A) if x in [a, b] and 0 elsewhere.
In cases where a = -infinity (resp. b = +infinity), take A := 0 (resp. B := 1).
As for the "mystery" you see
please note that your blue curve is wrong. It is not the pdf of your truncated distribution, it is just the pdf of the non-truncated one, scaled by the correct amount (division by 1-cdf(0.5)). The actual truncated pdf curve starts with a vertical line on x = 0.5 which goes up until it reaches your current blue curve. In other words, you only scaled the curve but forgot to truncate it, in this case to the left. Such a truncation corresponds to the "0 elsewhere" part of step 3 in the algorithm above.

What is the pandas equivalent of R's qnorm()

I am moving some code from R to Anaconda Python. The R code uses qnorm, documented as "quantile function for the normal distribution with mean equal to mean and standard deviation equal to sd."
The call and parameters are:
qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
p vector of probabilities.
mean vector of means.
sd vector of standard deviations.
log.p logical; if TRUE, probabilities p are given as log(p).
lower.tail logical; if TRUE (default), probabilities are
P[X≤x] otherwise, P[X].
I don't see any equivalent in pandas.Series. Have I missed it, is it on another object, or is there some equivalent in another library?
A lot of this equivalent functionality is found in scipy.stats. In this case, you're looking for scipy.stats.norm.ppf.
qnorm(p, mean = 0, sd = 1) is equivalent to scipy.stats.norm.ppf(q, loc=0, scale=1).
import scipy.stats as st
>>> st.norm.ppf([0.01, 0.99])
array([-2.32634787, 2.32634787])
>>> st.norm.ppf([0.01, 0.99], loc=10, scale=0.1)
array([ 9.76736521, 10.23263479])
Just to expand #miradulo answer. If you want to get also qnorm(lower.tail=FALSE) you can just multiply by -1:
In R:
qnorm(0.8, lower.tail = F)
-0.8416212
In python
from scipy.stats import norm
norm.ppf(0.8) * -1
-0.8416212

SciPy step response plot seems to break for some values

I'm using SciPy instead of MATLAB in a control systems class to plot the step responses of LTI systems. It's worked great so far, but I've run into an issue with a very specific system. With this code:
from numpy import min
from scipy import linspace
from scipy.signal import lti, step
from matplotlib import pyplot as p
# Create an LTI transfer function from coefficients
tf = lti([64], [1, 16, 64])
# Step response (redo it to get better resolution)
t, s = step(tf)
t, s = step(tf, T = linspace(min(t), t[-1], 200))
# Plotting stuff
p.plot(t, s)
p.xlabel('Time / s')
p.ylabel('Displacement / m')
p.show()
The code as-is displays a flat line. If I modify the final coefficient of the denominator to 64.00000001 (i.e., tf = lti([64], [1, 16, 64.0000001])) then it works as it should, showing an underdamped step response. Setting the coefficient to 63.9999999 also works. Changing all the coefficients to have explicit decimal places (i.e., tf = lti([64.0], [1.0, 16.0, 64.0])) doesn't affect anything, so I guess it's not a case of integer division messing things up.
Is this a bug in SciPy, or am I doing something wrong?
This is a limitation of the implementation of the step function. It uses a matrix exponential to find the step response, and it doesn't handle repeated poles well. (Your system has a repeated pole at -8.)
Instead of using step, you can use the function scipy.signal.step2
In [253]: from scipy.signal import lti, step2
In [254]: sys = lti([64], [1, 16, 64])
In [255]: t, y = step2(sys)
In [256]: plot(t, y)
Out[256]: [<matplotlib.lines.Line2D at 0x5ec6b90>]

Categories

Resources