Binomial test in Python for very large numbers - python

I need to do a binomial test in Python that allows calculation for 'n' numbers of the order of 10000.
I have implemented a quick binomial_test function using scipy.misc.comb, however, it is pretty much limited around n = 1000, I guess because it reaches the biggest representable number while computing factorials or the combinatorial itself. Here is my function:
from scipy.misc import comb
def binomial_test(n, k):
"""Calculate binomial probability
"""
p = comb(n, k) * 0.5**k * 0.5**(n-k)
return p
How could I use a native python (or numpy, scipy...) function in order to calculate that binomial probability? If possible, I need scipy 0.7.2 compatible code.
Many thanks!

Edited to add this comment: please note that, as Daniel Stutzbach mentions, the "binomial test" is probably not what the original poster was asking for (though he did use this expression). He seems to be asking for the probability density function of a binomial distribution, which is not what I'm suggesting below.
Have you tried scipy.stats.binom_test?
rbp#apfelstrudel ~$ python
Python 2.6.2 (r262:71600, Apr 16 2009, 09:17:39)
[GCC 4.0.1 (Apple Computer, Inc. build 5250)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from scipy import stats
>>> print stats.binom_test.__doc__
Perform a test that the probability of success is p.
This is an exact, two-sided test of the null hypothesis
that the probability of success in a Bernoulli experiment
is `p`.
Parameters
----------
x : integer or array_like
the number of successes, or if x has length 2, it is the
number of successes and the number of failures.
n : integer
the number of trials. This is ignored if x gives both the
number of successes and failures
p : float, optional
The hypothesized probability of success. 0 <= p <= 1. The
default value is p = 0.5
Returns
-------
p-value : float
The p-value of the hypothesis test
References
----------
.. [1] http://en.wikipedia.org/wiki/Binomial_test
>>> stats.binom_test(500, 10000)
4.9406564584124654e-324
Small edit to add documentation link: http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binom_test.html#scipy.stats.binom_test
BTW: works on scipy 0.7.2, as well as on current 0.8 dev.

Any solution that looks like comb(n, k) * 0.5**k * 0.5**(n-k) isn't going to work for large n. On most (all?) platforms, the smallest value a Python float can store is around 2**-1022. For large n-k or large k, the right-hand side will get rounded to 0. Likewise, comb(n, k) can grow so large that it will not fit in a float.
A more robust approach is to compute the probability density function as the difference between two consecutive points in the cumulative distribution function, which can be computed using the regularized incomplete beta function (look in SciPy's "special functions" package). Mathematically:
pdf(p, n, k) = cdf(p, n, k) - cdf(p, n, k-1)
Another option is to use the Normal approximation, which is quite accurate for large n. If speed is a concern, this is probably the way to go:
from math import *
def normal_pdf(x, m, v):
return 1.0/sqrt(2*pi*v) * exp(-(x-m)**2/(2*v))
def binomial_pdf(p, n, k):
if n < 100:
return comb(n, k) * p**k * p**(n-k) # Fall back to your current method
return normal_pdf(k, n*p, n*p*(1.0-p))
I haven't tested the code, but that should give you the general idea.

GMPY also supports extended precision floating point calculations. For example:
>>> from gmpy import *
>>>
>>> def f(n,k,p,prec=256):
... return mpf(comb(n,k),prec) * mpf(p,prec)**k * mpf(1-p,prec)**(n-k)
...
>>> print(f(1000,500,0.5))
0.0252250181783608019068416887621024545529410193921696384762532089115753731615931
>>>
I specified a floating point precision of 256 bits. By the way, source forge version is way out of date. The current version is maintained at code.google.com and supports Python 3.x. (Disclaimer: I'm the current maintainer of gmpy.)
casevh

I would look into the GNU Multi-Precision package (gmpy), which allows you to perform arbitrary precision calculations: you could probably do:
comb(n, k, exact=1)/2**k/2**(n-k)
but with the long integers of gmpy.
In fact, if you use exact integer computations, you can easily reach n=10000 for the combinations part; for this, you must use:
comb(n, k, exact=1)
instead of the floating point approximation comb(n, k), which overflows.
However, as the Original Poster noted, the returned (long) integer may be too long to be multiplied by a float!
Furthermore, one quickly runs into another problem: 0.5**1000=9.3…e-302 is already very close to the float underflow…
In summary: if you really need precise results for all k for n~10,000, you need to use a different approach than the formula from the original post, which suffers from the limitations of double precision floating point arithmetics. Using gmpy as indicated above could be a solution (not tested!).

Not specifically a Python solution, but if you can deal with small fractional errors, you might try using Stirling's approximation for n!:
comb(n, k) = n!/(k! * (n-k)!), where n! is approximately sqrt(2*Pin)(n/e)^n for large n.
For n>1000 the fractional errors should be very small.
For the probability calculation with large n, use logarithms for intermediate results:
log p = log(comb(n, k)) - n * log(2)
p = exp(log(p))

# This imports the array function form numpy
from numpy import array
# the following defines the factorial function to be used in the binomial commands/
# n+1 is used in the range to include the nth term
def factorial (n):
f=1
for x in range(1,n+1):
f=f*(x)
return f
# The follwong calculates the binomial coefficients for given values of n & k
def binomial (n,k):
b=1
b=(factorial(n)/(factorial(k)*factorial(n-k)))
return int(b)
# the following lines define the pascal triangle , and print it out for 20 rows./
# in order to include nth term, the n +1 term needs to be in the range. The commands/
# append the next binomial coeficiant to a raw first and then append rows to the triangle/
# and prints a 20 row size pascal triangle
def pascal(T):
triangle=[]
for n in range(T):
r=[]
for k in range(n+1):
r.append(binomial(n,k))
triangle.append(r)
return triangle
for r in pascal(20):
print((r))

Related

Reversing pow function - finding the power [duplicate]

Given positive integers b, c, m where (b < m) is True it is to find a positive integer e such that
(b**e % m == c) is True
where ** is exponentiation (e.g. in Ruby, Python or ^ in some other languages) and % is modulo operation. What is the most effective algorithm (with the lowest big-O complexity) to solve it?
Example:
Given b=5; c=8; m=13 this algorithm must find e=7 because 5**7%13 = 8
From the % operator I'm assuming that you are working with integers.
You are trying to solve the Discrete Logarithm problem. A reasonable algorithm is Baby step, giant step, although there are many others, none of which are particularly fast.
The difficulty of finding a fast solution to the discrete logarithm problem is a fundamental part of some popular cryptographic algorithms, so if you find a better solution than any of those on Wikipedia please let me know!
This isn't a simple problem at all. It is called calculating the discrete logarithm and it is the inverse operation to a modular exponentation.
There is no efficient algorithm known. That is, if N denotes the number of bits in m, all known algorithms run in O(2^(N^C)) where C>0.
Python 3 Solution:
Thankfully, SymPy has implemented this for you!
SymPy is a Python library for symbolic mathematics. It aims to become a full-featured computer algebra system (CAS) while keeping the code as simple as possible in order to be comprehensible and easily extensible. SymPy is written entirely in Python.
This is the documentation on the discrete_log function. Use this to import it:
from sympy.ntheory import discrete_log
Their example computes \log_7(15) (mod 41):
>>> discrete_log(41, 15, 7)
3
Because of the (state-of-the-art, mind you) algorithms it employs to solve it, you'll get O(\sqrt{n}) on most inputs you try. It's considerably faster when your prime modulus has the property where p - 1 factors into a lot of small primes.
Consider a prime on the order of 100 bits: (~ 2^{100}). With \sqrt{n} complexity, that's still 2^{50} iterations. That being said, don't reinvent the wheel. This does a pretty good job. I might also add that it was almost 4x times more memory efficient than Mathematica's MultiplicativeOrder function when I ran with large-ish inputs (44 MiB vs. 173 MiB).
Since a duplicate of this question was asked under the Python tag, here is a Python implementation of baby step, giant step, which, as #MarkBeyers points out, is a reasonable approach (as long as the modulus isn't too large):
def baby_steps_giant_steps(a,b,p,N = None):
if not N: N = 1 + int(math.sqrt(p))
#initialize baby_steps table
baby_steps = {}
baby_step = 1
for r in range(N+1):
baby_steps[baby_step] = r
baby_step = baby_step * a % p
#now take the giant steps
giant_stride = pow(a,(p-2)*N,p)
giant_step = b
for q in range(N+1):
if giant_step in baby_steps:
return q*N + baby_steps[giant_step]
else:
giant_step = giant_step * giant_stride % p
return "No Match"
In the above implementation, an explicit N can be passed to fish for a small exponent even if p is cryptographically large. It will find the exponent as long as the exponent is smaller than N**2. When N is omitted, the exponent will always be found, but not necessarily in your lifetime or with your machine's memory if p is too large.
For example, if
p = 70606432933607
a = 100001
b = 54696545758787
then 'pow(a,b,p)' evaluates to 67385023448517
and
>>> baby_steps_giant_steps(a,67385023448517,p)
54696545758787
This took about 5 seconds on my machine. For the exponent and the modulus of those sizes, I estimate (based on timing experiments) that brute force would have taken several months.
Discrete logarithm is a hard problem
Computing discrete logarithms is believed to be difficult. No
efficient general method for computing discrete logarithms on
conventional computers is known.
I will add here a simple bruteforce algorithm which tries every possible value from 1 to m and outputs a solution if it was found. Note that there may be more than one solution to the problem or zero solutions at all. This algorithm will return you the smallest possible value or -1 if it does not exist.
def bruteLog(b, c, m):
s = 1
for i in xrange(m):
s = (s * b) % m
if s == c:
return i + 1
return -1
print bruteLog(5, 8, 13)
and here you can see that 3 is in fact the solution:
print 5**3 % 13
There is a better algorithm, but because it is often asked to be implemented in programming competitions, I will just give you a link to explanation.
as said the general problem is hard. however a prcatical way to find e if and only if you know e is going to be small (like in your example) would be just to try each e from 1.
btw e==3 is the first solution to your example, and you can obviously find that in 3 steps, compare to solving the non discrete version, and naively looking for integer solutions i.e.
e = log(c + n*m)/log(b) where n is a non-negative integer
which finds e==3 in 9 steps

Why does this Python code give me the wrong answer?

I wrote a simple Python code to solve a certain Hydraulic formula (The Manning's equation):
import math
def mannings(units,A,P,S,n):
if units=='SI':
k=1.0
elif units=='US':
k=1.49
R=A/P
V=(k/n)*(math.pow(R,(2/3)))*(math.sqrt(S))
Q=A*V
return R,V,Q
In the code above, the velocity V is calculated from the k, n, R and S. The velocity is then used to calculate the discharge Q by multiplying with Area A. The user inputs the unit convention, the A, P, S and n. k is decided on the basis of unit convention.
When I run the function using mannings('US',1.0618,2.7916,0.02,0.015), I get (0.38035535176959456, 14.047854719572745, 14.916012141242343). The R value matches the R calculated in a spreadsheet, but the V and Q are way off. The actual V should be 7.374638178
and the Q should be 7.830634155.
It'd be great if someone can tell me what's going wrong here. This is a pretty straightforward formula and I was guessing it should work easily.
Your problem is that 2/3 is an integer division and therefore evaluates to 0. You want 2.0/3 to force a floating-point division. Or else include from __future__ import division at the top of your file to use the Python 3-style division in Python 2.x.
Assuming you don't use the __future__ solution, you will also want to write your R = A / P as e.g. R = float(A) / P because otherwise, if A and P are both integers, R will also be an integer.

Numerical precision in python

I've been doing simple numerical experiments with python, like computing
factorials. For instance, compute the factorial of 32:
My routine:
2.6313083693369503e+35
From scipy.misc:
2.6313083693369355e+35
I want to point out that my routine calculates the logarithm of the factorial,
it calculates the sumation of logarithms starting from 1 to 32 (in this case)
and then I just take the exp function (I do it this way because of stuff learned from
Fortran 90).
It is a surprise that the correct answer is
263130836933693530167218012160000000
according to pari/gp.
I would be very happy if someone can point me out to references where I can look for
correct numerical answers in Python. The documentation it's ok but only if one want
"short" numbers.
log and exp functions operate on floating points, which have limited precision. Python's integers, on the other hand, can have arbitrary precision. So, you can compute the factorial of 32 in linear space just fine using integers.
f = 1
for i in xrange(32):
f *= i + 1
print f # prints '263130836933693530167218012160000000'
You can do it this way:
import operator
n=32
print reduce(operator.__mul__,range(1,n+1))
# 263130836933693530167218012160000000

Python - how to compute all nth roots of a number?

Is it possible to calculate n complex roots of a given number using Python? I've shortly checked it, and it looks like Python gives me wrong/incomplete answers:
(-27.0j)**(1.0/3.0) produces (2.598076211353316-1.4999999999999998j)
but proper roots should be 3 complex numbers, because every non-zero number has n different complex number nth roots. Is it possible in Python?
I don't think standard Python will do this unless you write a function for it, but you can do it with Numpy:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.roots.html
There are many multi-valued complex functions - functions that can have more than one value corresponding to any point in their domain. For example: roots, logarithms, inverse trigonometric functions...
The reason these functions can have multiple values is usually because they are the inverse of a function that has multiple values in the domain map to the same value.
When doing calculations with such functions, it would be impractical to always return all possible values. For the inverse trigonometric functions, there are infinitely many possible values.
Usually the different function values can be expressed as a function of an integer parameter k. For example, the values of log z with z = r*(cos t + i*sin t is log r + i*(t + k*2*pi) with k any integer. For the nth root, it is r**(1/n)*exp(i*(t+k*2*pi)/n with k=0..n-1 inclusive.
Because returning all possible values is impractical, mathematical functions in Python and almost all other common programming languages return what's called the 'principal value' of the function. (reference) The principal value is usually the function value with k=0. Whatever choice is made, it should be stated clearly in the documentation.
So to get all the complex roots of a complex number, you just evaluate the function for all relevant values of k:
def roots(z, n):
nthRootOfr = abs(z)**(1.0/n)
t = phase(z)
return map(lambda k: nthRootOfr*exp((t+2*k*pi)*1j/n), range(n))
(You'll need to import the cmath module to make this work.) This gives:
>>> roots(-27j,3)
[(2.59808-1.5j), (1.83691e-16+3j), (-2.59808-1.5j)]
If you want to get all roots on clean python you can create simple function to do this:
import math
def root(num, r):
base = num ** (1.0/r)
roots = [base]
for i in range(1, r):
roots.append(complex(base * math.cos(2*math.pi * i / r), base * math.sin(2*math.pi * i / r)))
return roots

Operations for Long and Float in Python

I'm trying to compute this:
from scipy import *
3600**3400 * (exp(-3600)) / factorial(3400)
the error: unsupported long and float
Try using logarithms instead of working with the numbers directly. Since none of your operations are addition or subtraction, you could do the whole thing in logarithm form and convert back at the end.
Computing with numbers of such magnitude, you just can't use ordinary 64-bit-or-so floats, which is what Python's core runtime supports. Consider gmpy (do not get the sourceforge version, it's aeons out of date) -- with that, math, and some care...:
>>> e = gmpy.mpf(math.exp(1))
>>> gmpy.mpz(3600)**3400 * (e**(-3600)) / gmpy.fac(3400)
mpf('2.37929475533825366213e-5')
(I'm biased about gmpy, of course, since I originated and still participate in that project, but I'd never make strong claims about its floating point abilities... I've been using it mostly for integer stuff... still, it does make this computation possible!-).
You could try using the Decimal object. Calculations will be slower but you won't have trouble with really small numbers.
from decimal import Decimal
I don't know how Decimal interacts with the scipy module, however.
This numpy discussion might be relevant.
Well the error is coming about because you are trying to multiply
3600**3400
which is a long with
exp(-3600)
which is a float.
But regardless, the error you are receiving is disguising the true problem. It seems exp(-3600) is too big a number to fit in a float anyway. The python math library is fickle with large numbers, at best.
exp(-3600) is too smale, factorial(3400) is too large:
In [1]: from scipy import exp
In [2]: exp(-3600)
Out[2]: 0.0
In [3]: from scipy import factorial
In [4]: factorial(3400)
Out[4]: array(1.#INF)
What about calculate it step by step as a workaround(and it makes sense
to check the smallest and biggest intermediate result):
from math import exp
output = 1
smallest = 1e100
biggest = 0
for i,j in izip(xrange(1, 1701), xrange(3400, 1699, -1)):
output = output * 3600 * exp(-3600/3400) / i
output = output * 3600 * exp(-3600/3400) / j
smallest = min(smallest, output)
biggest = max(biggest, output)
print "output: ", output
print "smallest: ", smallest
print "biggest: ", biggest
output is:
output: 2.37929475534e-005
smallest: 2.37929475534e-005
biggest: 1.28724174494e+214

Categories

Resources