How can this Mathematica code be ported to Python? I do not know the Mathematica syntax and am having a hard time understanding how this is described in a more traditional language.
Source (pg 5): http://subjoin.net/misc/m496pres1.nb.pdf
This cannot be ported to Python directly as the definition a[j] uses the Symbolic Arithmetic feature of Mathematica.
a[j] is basically the coefficient of xj in the series expansion of that rational function inside Apart.
Assume you have a[j], then f[n] is easy. A Block in Mathematica basically introduces a scope for variables. The first list initializes the variable, and the rest is the execution of the code. So
from __future__ import division
def f(n):
v = n // 5
q = v // 20
r = v % 20
return sum(binomial(q+5-j, 5) * a[r+20*j] for j in range(5))
(binomial is the Binomial coefficient.)
Using the proposed solutions from the previous answers I found that sympy sadly doesn't compute the apart() of the rational immediatly. It somehow gets confused. Moreover, the python list of coefficients returned by *Poly.all_coeffs()* has a different semantics than a Mathmatica list. Hence the try-except-clause in the definition of a().
The following code does work and the output, for some tested values, concurs with the answers given by the Mathematica formula in Mathematica 7:
from __future__ import division
from sympy import expand, Poly, binomial, apart
from sympy.abc import x
A = Poly(apart(expand(((1-x**20)**5)) / expand((((1-x)**2)*(1-x**2)*(1-x**5)*(1-x**10))))).all_coeffs()
def a(n):
try:
return A[n]
except IndexError:
return 0
def f(n):
v = n // 5
q = v // 20
r = v % 20
return sum(a[r+20*j]* binomial(q+5-j, 5) for j in range(5))
print map(f, [100, 50, 1000, 150])
The symbolics can be done with sympy. Combined with KennyTM's answer, something like this might be what you want:
from __future__ import division
from sympy import Symbol, apart, binomial
x = Symbol('x')
poly = (1-x**20)**5 / ((1-x)**2 * (1-x**2) * (1-x**5) * (1-x**10))
poly2 = apart(poly,x)
def a(j):
return poly2.coeff(x**j)
def f(n):
v = n // 5
q = v // 20
r = v % 20
return sum(binomial(q+5-j, 5)*a(r+20*j) for j in range(5))
Although I have to admit that f(n) does not work (I'm not very good at Python).
Related
Out of curiosity, I was wondering if there's a way to solve a binominal coefficient by simulation in python. I tried a little bit, but the numbers are getting so big so quickly that I wasn't able to solve it for anything but really small numbers.
I'm aware of this question but wasn't able to identify one solution that uses only brute force to solve the coefficient. But I have to admit that I don't understand all the implementations listed there.
Here's my naive approach:
import random
import numpy as np
from math import factorial as fac
# Calculating the reference with help of factorials
def comb(n,k):
return fac(n) // fac(k) // fac(n-k)
# trying a simple simulation with help of random.sample
random.seed(42)
n,k = 30,3
n_sim = 100000
samples = np.empty([n_sim,k], dtype=int)
for i in range(n_sim):
x = random.sample(range(n),k)
samples[i] = sorted(x)
u = np.unique(samples, axis=0)
print(len(u))
print(comb(n,k))
Would it be possible to do this efficiently and fast for big numbers?
I use this, its pretty efficient for large numbers:
def nck(n, k):
if k < 0 or k > n:
return 0
if k == 0 or k == n:
return 1
k = min(k, n - k) # take advantage of symmetry
c = 1
for i in range(k):
c = c * (n - i) // (i + 1)
return c
I'm trying to continue on my previous question in which I'm trying to calculate Fibonacci numbers using Benet's algorithm. To work with arbitrary precision I found mpmath. However the implementation seems to fail above certain value. For instance the 99th value gives:
218922995834555891712
This should be (ref):
218922995834555169026
Here is my code:
from mpmath import *
def Phi():
return (1 + sqrt(5)) / 2
def phi():
return (1 - sqrt(5)) / 2
def F(n):
return (power(Phi(), n) - power(phi(), n)) / sqrt(5)
start = 99
end = 100
for x in range(start, end):
print(x, int(F(x)))
mpmath does do arbitrary precision math, and it does do it accurately to any precision (as described above) if you are using the arbitrary precision math module and not the default behavior.
mpmath has more than one module which determines the accuracy and speed of the results (to be chosen depending on what you need), and by default it uses Python floats, which is what I believe you saw above.
If you call mpmath's fib( ) having set mp.dps high enough, you will get the correct answer as stated above.
>>> from mpmath import mp
>>> mp.dps = 25
>>> mp.nprint( mp.fib( 99 ), 25 )
218922995834555169026.0
>>> mp.nprint( mpmath.fib( 99 ), 25 )
218922995834555169026.0
Whereas, if you don't use the mp module, you will only get results as accurate as a Python double.
>>> import mpmath
>>> mpmath.dps = 25
>>> mpmath.nprint( mpmath.fib( 99 ), 25
218922995834555170816.0
mpmath provides arbitrary precision (as set in mpmath.mp.dps), but still inaccuate calculation. For example, mpmath.sqrt(5) is not accurate, so any calculation based on that will also be inaccurate.
To get an accurate result for sqrt(5), you have to use a library which supports abstract calculation, e.g. http://sympy.org/ .
To get an accurate result for Fibonacci numbers, probably the simplest way is using an algorithm which does only integer arithmetics. For example:
def fib(n):
if n < 0:
raise ValueError
def fib_rec(n):
if n == 0:
return 0, 1
else:
a, b = fib_rec(n >> 1)
c = a * ((b << 1) - a)
d = b * b + a * a
if n & 1:
return d, c + d
else:
return c, d
return fib_rec(n)[0]
Actually mpmath's default precision is 15 which I think is not enough if you want to get the result of up to 21-digit precision.
One thing you can do is set the precision to be a higher value and use mpmath's defined arithmetic functions for addition, subtraction, etc.
from mpmath import mp
mp.dps = 50
sqrt5 = mp.sqrt(5)
def Phi():
return 0.5*mp.fadd(1, sqrt5)
def phi():
return 0.5*mp.fsub(1, sqrt5)
def F(n):
return mp.fdiv(mp.power(Phi(), n) - mp.power(phi(), n), sqrt5)
print int(F(99))
This will give you
218922995834555169026L
My problem is very simple. I would like to compute the following sum.
from __future__ import division
from scipy.misc import comb
import math
for n in xrange(2,1000,10):
m = 2.2*n/math.log(n)
print sum(sum(comb(n,a) * comb(n-a,b) * (comb(a+b,a)*2**(-a-b))**m
for b in xrange(n+1))
for a in xrange(1,n+1))
However python gives RuntimeWarning: overflow encountered in multiply and nan as the output and it is also very very slow.
Is there a clever way to do this?
The reason why you get NaNs is you end up evaluating numbers like
comb(600 + 600, 600) == 3.96509646226102e+359
This is too large to fit into a floating point number:
>>> numpy.finfo(float).max
1.7976931348623157e+308
Take logarithms to avoid it:
from __future__ import division, absolute_import, print_function
from scipy.special import betaln
from scipy.misc import logsumexp
import numpy as np
def binomln(n, k):
# Assumes binom(n, k) >= 0
return -betaln(1 + n - k, 1 + k) - np.log(n + 1)
for n in range(2, 1000, 10):
m = 2.2*n/np.log(n)
a = np.arange(1, n + 1)[np.newaxis,:]
b = np.arange(n + 1)[:,np.newaxis]
v = (binomln(n, a)
+ binomln(n - a, b)
+ m*binomln(a + b, a)
- m*(a+b) * np.log(2))
term = np.exp(logsumexp(v))
print(term)
Use the Memoize pattern. With that, redefine comb:
#memoized
def newcomb(a, b):
return comb(a, b)
And replace all calls to comb with newcomb. Also, for a minor improvement, remove the brackets. If you make explicit lists, you waste time constructing them. If you remove them, you're effectively using generator expressions.
Update:
This won't solve the nan issue, but does make it a lot faster.
For everyone who does not see this as being faster, are you applying the memoize decorator? On my machine, the original function takes 29.7s to go up to 200, but only 3.8s with the memoized version.
What memoize does is simply store all your invocations of comb in a lookup table. So if in a later iteration you're invoking comb with the same arguments as you had at some point in the past, it doesn't recalculate it - it simply looks it up in the lookup table.
I'd like to take the modular inverse of a matrix like [[1,2],[3,4]] mod 7 in Python. I've looked at numpy (which does matrix inversion but not modular matrix inversion) and I saw a few number theory packages online, but nothing that seems to do this relatively common procedure (at least, it seems relatively common to me).
By the way, the inverse of the above matrix is [[5,1],[5,3]] (mod 7). I'd like Python to do it for me though.
Okay...for those who care, I solved my own problem. It took me a while, but I think this works. It's probably not the most elegant, and should include some more error handling, but it works:
import numpy
import math
from numpy import matrix
from numpy import linalg
def modMatInv(A,p): # Finds the inverse of matrix A mod p
n=len(A)
A=matrix(A)
adj=numpy.zeros(shape=(n,n))
for i in range(0,n):
for j in range(0,n):
adj[i][j]=((-1)**(i+j)*int(round(linalg.det(minor(A,j,i)))))%p
return (modInv(int(round(linalg.det(A))),p)*adj)%p
def modInv(a,p): # Finds the inverse of a mod p, if it exists
for i in range(1,p):
if (i*a)%p==1:
return i
raise ValueError(str(a)+" has no inverse mod "+str(p))
def minor(A,i,j): # Return matrix A with the ith row and jth column deleted
A=numpy.array(A)
minor=numpy.zeros(shape=(len(A)-1,len(A)-1))
p=0
for s in range(0,len(minor)):
if p==i:
p=p+1
q=0
for t in range(0,len(minor)):
if q==j:
q=q+1
minor[s][t]=A[p][q]
q=q+1
p=p+1
return minor
A hackish trick which works when rounding errors aren't an issue:
find the regular inverse (may have non-integer entries), and the determinant (an integer), both implemented in numpy
multiply the inverse by the determinant, and round to integers (hacky)
now multiply everything by the determinant's multiplicative inverse (modulo your modulus, code below)
do entrywise mod by your modulus
A less hackish way is to actually implement gaussian elimination. Here's my code using Gaussian elimination, which I wrote for my own purposes (rounding errors were an issue for me). q is the modulus, which is not necessarily prime.
def generalizedEuclidianAlgorithm(a, b):
if b > a:
return generalizedEuclidianAlgorithm(b,a);
elif b == 0:
return (1, 0);
else:
(x, y) = generalizedEuclidianAlgorithm(b, a % b);
return (y, x - (a / b) * y)
def inversemodp(a, p):
a = a % p
if (a == 0):
print "a is 0 mod p"
return None
if a > 1 and p % a == 0:
return None
(x,y) = generalizedEuclidianAlgorithm(p, a % p);
inv = y % p
assert (inv * a) % p == 1
return inv
def identitymatrix(n):
return [[long(x == y) for x in range(0, n)] for y in range(0, n)]
def inversematrix(matrix, q):
n = len(matrix)
A = np.matrix([[ matrix[j, i] for i in range(0,n)] for j in range(0, n)], dtype = long)
Ainv = np.matrix(identitymatrix(n), dtype = long)
for i in range(0, n):
factor = inversemodp(A[i,i], q)
if factor is None:
raise ValueError("TODO: deal with this case")
A[i] = A[i] * factor % q
Ainv[i] = Ainv[i] * factor % q
for j in range(0, n):
if (i != j):
factor = A[j, i]
A[j] = (A[j] - factor * A[i]) % q
Ainv[j] = (Ainv[j] - factor * Ainv[i]) % q
return Ainv
EDIT: as commenters point out, there are some cases this algorithm fails. It's slightly nontrivial to fix, and I don't have time nowadays. Back then it worked for random matrices in my case (the moduli were products of large primes). Basically, the first non-zero entry might not be relatively prime to the modulus. The prime case is easy since you can search for a different row and swap. In the non-prime case, I think it could be that all leading entries aren't relatively prime so you have to combine them
It can be calculated using Sage (www.sagemath.org) as
Matrix(IntegerModRing(7), [[1, 2], [3,4]]).inverse()
Although Sage is huge to install and you have to use the version of python that comes with it which is a pain.
'sympy' package Matrix class function 'sqMatrix.inv_mod(mod)' computes modulo matrix inverse for small and arbitrarily large modulus. By combining sympy with numpy, it becomes easy to compute modulo inverse of 2-D numpy arrays (see the code snippet below):
enter code here
import numpy
from sympy import Matrix
def matInvMod (vmnp, mod):
nr = vmnp.shape[0]
nc = vmnp.shape[1]
if (nr!= nc):
print "Error: Non square matrix! exiting"
exit()
vmsym = Matrix(vmnp)
vmsymInv = vmsym.inv_mod(mod)
vmnpInv = numpy.array(vmsymInv)
print "vmnpInv: ", vmnpInv, "\n"
k = nr
vmtest = [[1 for i in range(k)] for j in range(k)] # just a 2-d list
vmtestInv = vmsym*vmsymInv
for i in range(k):
for j in range(k):
#print i, j, vmtrx2[i,j] % mod
vmtest[i][j] = vmtestInv[i,j] % mod
print "test vmk*vkinv % mod \n:", vmtest
return vmnpInv
if __name__ == '__main__':
#p = 271
p =
115792089210356248762697446949407573530086143415290314195533631308867097853951
vm = numpy.array([[1,1,1,1], [1, 2, 4, 8], [1, 4, 16, 64], [1, 5, 25, 125]])
#vminv = modMatInv(vm, p)
vminv = matInvMod(vm, p)
print vminv
vmtestnp = vm.dot(vminv)%p # test mtrx inversion
print vmtestnp
Unfortunately numpy does not have modular arithmetic implementations. You can always code up the proposed algorithm using row reduction or determinants as demonstrated here. A modular inverse seems to be quite useful for cryptography.
I wrote a method to calculate the cosine distance between two arrays:
def cosine_distance(a, b):
if len(a) != len(b):
return False
numerator = 0
denoma = 0
denomb = 0
for i in range(len(a)):
numerator += a[i]*b[i]
denoma += abs(a[i])**2
denomb += abs(b[i])**2
result = 1 - numerator / (sqrt(denoma)*sqrt(denomb))
return result
Running it can be very slow on a large array. Is there an optimized version of this method that would run faster?
Update: I've tried all the suggestions to date, including scipy. Here's the version to beat, incorporating suggestions from Mike and Steve:
def cosine_distance(a, b):
if len(a) != len(b):
raise ValueError, "a and b must be same length" #Steve
numerator = 0
denoma = 0
denomb = 0
for i in range(len(a)): #Mike's optimizations:
ai = a[i] #only calculate once
bi = b[i]
numerator += ai*bi #faster than exponent (barely)
denoma += ai*ai #strip abs() since it's squaring
denomb += bi*bi
result = 1 - numerator / (sqrt(denoma)*sqrt(denomb))
return result
If you can use SciPy, you can use cosine from spatial.distance:
http://docs.scipy.org/doc/scipy/reference/spatial.distance.html
If you can't use SciPy, you could try to obtain a small speedup by rewriting your Python (EDIT: but it didn't work out like I thought it would, see below).
from itertools import izip
from math import sqrt
def cosine_distance(a, b):
if len(a) != len(b):
raise ValueError, "a and b must be same length"
numerator = sum(tup[0] * tup[1] for tup in izip(a,b))
denoma = sum(avalue ** 2 for avalue in a)
denomb = sum(bvalue ** 2 for bvalue in b)
result = 1 - numerator / (sqrt(denoma)*sqrt(denomb))
return result
It is better to raise an exception when the lengths of a and b are mismatched.
By using generator expressions inside of calls to sum() you can calculate your values with most of the work being done by the C code inside of Python. This should be faster than using a for loop.
I haven't timed this so I can't guess how much faster it might be. But the SciPy code is almost certainly written in C or C++ and it should be about as fast as you can get.
If you are doing bioinformatics in Python, you really should be using SciPy anyway.
EDIT: Darius Bacon timed my code and found it slower. So I timed my code and... yes, it is slower. The lesson for all: when you are trying to speed things up, don't guess, measure.
I am baffled as to why my attempt to put more work on the C internals of Python is slower. I tried it for lists of length 1000 and it was still slower.
I can't spend any more time on trying to hack the Python cleverly. If you need more speed, I suggest you try SciPy.
EDIT: I just tested by hand, without timeit. I find that for short a and b, the old code is faster; for long a and b, the new code is faster; in both cases the difference is not large. (I'm now wondering if I can trust timeit on my Windows computer; I want to try this test again on Linux.) I wouldn't change working code to try to get it faster. And one more time I urge you to try SciPy. :-)
(I originally thought) you're not going to speed it up a lot without breaking out to C (like numpy or scipy) or changing what you compute. But here's how I'd try that, anyway:
from itertools import imap
from math import sqrt
from operator import mul
def cosine_distance(a, b):
assert len(a) == len(b)
return 1 - (sum(imap(mul, a, b))
/ sqrt(sum(imap(mul, a, a))
* sum(imap(mul, b, b))))
It's roughly twice as fast in Python 2.6 with 500k-element arrays. (After changing map to imap, following Jarret Hardie.)
Here's a tweaked version of the original poster's revised code:
from itertools import izip
def cosine_distance(a, b):
assert len(a) == len(b)
ab_sum, a_sum, b_sum = 0, 0, 0
for ai, bi in izip(a, b):
ab_sum += ai * bi
a_sum += ai * ai
b_sum += bi * bi
return 1 - ab_sum / sqrt(a_sum * b_sum)
It's ugly, but it does come out faster. . .
Edit: And try Psyco! It speeds up the final version by another factor of 4. How could I forget?
No need to take abs() of a[i] and b[i] if you're squaring it.
Store a[i] and b[i] in temporary variables, to avoid doing the indexing more than once.
Maybe the compiler can optimize this, but maybe not.
Check into the **2 operator. Is it simplifying it into a multiply, or is it using a general power function (log - multiply by 2 - antilog).
Don't do sqrt twice (though the cost of that is small). Do sqrt(denoma * denomb).
Similar to Darius Bacon's answer, I've been toying with operator and itertools to produce a faster answer. The following seems to be 1/3 faster on a 500-item array according to timeit:
from math import sqrt
from itertools import imap
from operator import mul
def op_cosine(a, b):
dot_prod = sum(imap(mul, a, b))
a_veclen = sqrt(sum(i ** 2 for i in a))
b_veclen = sqrt(sum(i ** 2 for i in b))
return 1 - dot_prod / (a_veclen * b_veclen)
This is faster for arrays of around 1000+ elements.
from numpy import array
def cosine_distance(a, b):
a=array(a)
b=array(b)
numerator=(a*b).sum()
denoma=(a*a).sum()
denomb=(b*b).sum()
result = 1 - numerator / sqrt(denoma*denomb)
return result
Using the C code inside of SciPy wins big for long input arrays. Using simple and direct Python wins for short input arrays; Darius Bacon's izip()-based code benchmarked out best. Thus, the ultimate solution is to decide which one to use at runtime, based on the length of the input arrays:
from scipy.spatial.distance import cosine as scipy_cos_dist
from itertools import izip
from math import sqrt
def cosine_distance(a, b):
len_a = len(a)
assert len_a == len(b)
if len_a > 200: # 200 is a magic value found by benchmark
return scipy_cos_dist(a, b)
# function below is basically just Darius Bacon's code
ab_sum = a_sum = b_sum = 0
for ai, bi in izip(a, b):
ab_sum += ai * bi
a_sum += ai * ai
b_sum += bi * bi
return 1 - ab_sum / sqrt(a_sum * b_sum)
I made a test harness that tested the functions with different length inputs, and found that around length 200 the SciPy function started to win. The bigger the input arrays, the bigger it wins. For very short length arrays, say length 3, the simpler code wins. This function adds a tiny amount of overhead to decide which way to do it, then does it the best way.
In case you are interested, here is the test harness:
from darius2 import cosine_distance as fn_darius2
fn_darius2.__name__ = "fn_darius2"
from ult import cosine_distance as fn_ult
fn_ult.__name__ = "fn_ult"
from scipy.spatial.distance import cosine as fn_scipy
fn_scipy.__name__ = "fn_scipy"
import random
import time
lst_fn = [fn_darius2, fn_scipy, fn_ult]
def run_test(fn, lst0, lst1, test_len):
start = time.time()
for _ in xrange(test_len):
fn(lst0, lst1)
end = time.time()
return end - start
for data_len in range(50, 500, 10):
a = [random.random() for _ in xrange(data_len)]
b = [random.random() for _ in xrange(data_len)]
print "len(a) ==", len(a)
test_len = 10**3
for fn in lst_fn:
n = fn.__name__
r = fn(a, b)
t = run_test(fn, a, b, test_len)
print "%s:\t%f seconds, result %f" % (n, t, r)
def cd(a,b):
if(len(a)!=len(b)):
raise ValueError, "a and b must be the same length"
rn = range(len(a))
adb = sum([a[k]*b[k] for k in rn])
nma = sqrt(sum([a[k]*a[k] for k in rn]))
nmb = sqrt(sum([b[k]*b[k] for k in rn]))
result = 1 - adb / (nma*nmb)
return result
Your updated solution still has two square roots. You can reduce this to one by replacing the sqrt line with:
result = 1 - numerator /
(sqrt(denoma*denomb))
A multiply is typically quite a bit quicker than a sqrt. It might not seem much as it is only called once in the function, but it sounds like you are calculating a lot of cosine distances, so the improvement will add up.
Your code looks like it should be ripe for vector optimizations. So if cross-platofrm support is not an issue and you want to speed it even further, you could code the cosine distance code in C and make sure your compiler is aggressively vectorizing the resulting code (even Pentium II is capable of some floating point vectorisation)