Python - calculate multinomial probability density functions on large dataset?

Python - calculate multinomial probability density functions on large dataset? - python

I originally intended to use MATLAB to tackle this problem but the in-built function has limitations that do not suit my goal. The same limitation occurs in NumPy.
I have two tab-delimited files. The first is a file showing amino acid residue, frequency and count for an in-house database of protein structures, i.e.
A 0.25 1
S 0.25 1
T 0.25 1
P 0.25 1
The second file consists of quadruplets of amino acids and the number of times they occur, i.e.
ASTP 1
Note, there are >8,000 such quadruplets.
Based on the background frequency of occurence of each amino acid and the count of quadruplets, I aim to calculate the multinomial probability density function for each quadruplet and subsequently use it as the expected value in a maximum likelihood calculation.
The multinomial distribution is as follows:
f(x|n, p) = n!/(x1!*x2!*...*xk!)*((p1^x1)*(p2^x2)*...*(pk^xk))
where x is the number of each of k outcomes in n trials with fixed probabilities p. n is 4 four in all cases in my calculation.
I have created four functions to calculate this distribution.
# functions for multinomial distribution
def expected_quadruplets(x, y):
expected = x*y
return expected
# calculates the probabilities of occurence raised to the number of occurrences
def prod_prob(p1, a, p2, b, p3, c, p4, d):
prob_prod = (pow(p1, a))*(pow(p2, b))*(pow(p3, c))*(pow(p4, d))
return prob_prod
# factorial() and multinomial_coefficient() work in tandem to calculate C, the multinomial coefficient
def factorial(n):
if n <= 1:
return 1
return n*factorial(n-1)
def multinomial_coefficient(a, b, c, d):
n = 24.0
multi_coeff = (n/(factorial(a) * factorial(b) * factorial(c) * factorial(d)))
return multi_coeff
The problem is how best to structure the data in order to tackle the calculation most efficiently, in a manner that I can read (you guys write some cryptic code :-)) and that will not create an overflow or runtime error.
To date my data is represented as nested lists.
amino_acids = [['A', '0.25', '1'], ['S', '0.25', '1'], ['T', '0.25', '1'], ['P', '0.25', '1']]
quadruplets = [['ASTP', '1']]
I initially intended calling these functions within a nested for loop but this resulted in runtime errors or overflow errors. I know that I can reset the recursion limit but I would rather do this more elegantly.
I had the following:
for i in quadruplets:
quad = i[0].split(' ')
for j in amino_acids:
for k in quadruplets:
for v in k:
if j[0] == v:
multinomial_coefficient(int(j[2]), int(j[2]), int(j[2]), int(j[2]))
I haven'te really gotten to how to incorporate the other functions yet. I think that my current nested list arrangement is sub optimal.
I wish to compare each letter within the string 'ASTP' with the first component of each sub list in amino_acids. Where a match exists, I wish to pass the appropriate numeric values to the functions using indices.
Is their a better way? Can I append the appropriate numbers for each amino acid and quadruplet to a temporary data structure within a loop, pass this to the functions and clear it for the next iteration?
Thanks, S :-)

This might be tangential to your original question, but I strongly advise against calculating factorials explicitly due to overflows. Instead, make use of the fact that factorial(n) = gamma(n+1), use the logarithm of the gamma function and use additions instead of multiplications, subtractions instead of divisions. scipy.special contains a function named gammaln, which gives you the logarithm of the gamma function.
from itertools import izip
from numpy import array, log, exp
from scipy.special import gammaln
def log_factorial(x):
"""Returns the logarithm of x!
Also accepts lists and NumPy arrays in place of x."""
return gammaln(array(x)+1)
def multinomial(xs, ps):
n = sum(xs)
xs, ps = array(xs), array(ps)
result = log_factorial(n) - sum(log_factorial(xs)) + sum(xs * log(ps))
return exp(result)
If you don't want to install SciPy just for the sake of gammaln, here is a replacement in pure Python (of course it is slower and it is not vectorized like the one in SciPy):
def gammaln(n):
"""Logarithm of Euler's gamma function for discrete values."""
if n < 1:
return float('inf')
if n < 3:
return 0.0
c = [76.18009172947146, -86.50532032941677, \
24.01409824083091, -1.231739572450155, \
0.001208650973866179, -0.5395239384953 * 0.00001]
x, y = float(n), float(n)
tm = x + 5.5
tm -= (x + 0.5) * log(tm)
se = 1.0000000000000190015
for j in range(6):
y += 1.0
se += c[j] / y
return -tm + log(2.5066282746310005 * se / x)
Another easy trick is to use a dict for amino_acids, indexed by the residue itself. Given your original amino_acids structure, you can do this:
amino_acid_dict = dict((amino_acid[0], amino_acid) for amino_acid in amino_acids)
print amino_acid_dict
{"A": ["A", 0.25, 1], "S": ["S", 0.25, 1], "T": ["T", 0.25, 1], "P": ["P", 0.25, 1]}
You can then look up the frequencies or counts by residue easier:
freq_A = amino_acid_dict["A"][1]
count_A = amino_acid_dict["A"][2]
This saves you some time in the main loop:
for quadruplet in quadruplets:
probs = [amino_acid_dict[aa][1] for aa in quadruplet]
counts = [amino_acid_dict[aa][2] for aa in quadruplet]
print quadruplet, multinomial(counts, probs)

Related

how to iterate through an array of complex numbers

I'm trying to calculate the energy of a complex valued signal. Passing an array of complex numbers into the energy function, it separates the real and imaginary parts of the number and converts them into their polar equivalents. It then returns the sum of the squares of the real parts of each complex number. Everytime I try to call the energy function it says that the arctan2 ufunc is not supported for the input types.
def toExponential(a, b):
c = np.sqrt(a**2 + b**2)
d = np.arctan2(b,a)
return (c,d)
def energy(x):
sum = 0
for i in x:
e = ((i + np.conj(i))/2)
f = ((i - np.conj(i)/(1j * 2)))
r,i = toExponential(e,f)
sum = r**2 + sum
return sum

I think you are passing e and f to the to np.arctan2(b,a) ,instead of the real and imaginary parts of the complex number magnitude, phase = np.abs(i), np.angle(i)
Try this out
def energy(x):
sum = 0
for i in x:
magnitude, phase = np.abs(i), np.angle(i)
sum += magnitude**2
return sum
The magnitude and phase of each complex number are extracted in this example using the numpy.abs() and numpy.angle() methods, respectively. The energy of a complex signal is then determined by adding the squares of the complex numbers' magnitudes, which is the appropriate procedure.

If speed is important, here is a vectorized version of #MohamedFathallah's answer:
def energy(x):
return np.sum(np.abs(x)**2)
or
def energy(x):
return np.sum(np.real(x)**2 + np.imag(x)**2)

Recreating R Quantile Type 2 in Numpy

I'm migrating some legacy code from R to Python and I'm having trouble matching the quantile results with numpy percentile.
Given the following list of numbers:
a1 = [
5.75,6.13333333333333,7.13636363636364,9,10.1,4.80952380952381,8.82926829268293,4.7906976744186,3.83333333333333,6,6.1,
8.88235294117647,30,5.7,3.98507462686567,6.83333333333333,8.39805825242718,4.78260869565217,7.26356589147287,5.67857142857143,
3.58333333333333,6.69230769230769,14.3333333333333,14.3333333333333,5.125,5.16216216216216,5.36363636363636,10.7142857142857,
4.90909090909091,7.5,8,6,6.93939393939394,10.4,6,6.8,5.33333333333333,10.3076923076923,4.5625,5.4,6.44,3.36363636363636,
11.1666666666667,4.5,7.35714285714286,10.6363636363636,9.26746031746032,3.83333333333333,5.75,9.14285714285714,8.27272727272727,
5,5.92307692307692,5.23076923076923,4.09375,6.25,4.63888888888889,6.07142857142857,5,5.42222222222222,3.93892045454545,4.8,
8.71428571428571,6.25925925925926,4.12,5.30769230769231,4.26086956521739,5.22222222222222,4.64285714285714,5,3.64705882352941,
5.33333333333333,3.65217391304348,3.54166666666667,10.0952380952381,3.38235294117647,8.67123287671233,2.66666666666667,3.5,4.875,
4.5,6.2,5.45454545454545,4.89189189189189,4.71428571428571,1,5.33333333333333,6.09090909090909,4.36756756756757,6,5.17197452229299,
4.48717948717949,5.01219512195122,4.83098591549296,5.25,8.52,5.47692307692308,5.45454545454545,8.6578947368421,8.35714285714286,3.25,
8.5,4,5.95652173913043,7.05882352941176,7.5,8.6,8.49122807017544,5.14285714285714,4,13.3294117647059,9.55172413793103,5.57446808510638,
4.5,8,4.11764705882353,3.9,5.14285714285714,6,4.66666666666667,6,3.75,4.93333333333333,4.5,5.21666666666667,6.53125,6,7,7.28333333333333,
7.34615384615385,7.15277777777778,8.07936507936508,11.609756097561
]
Using quantile in R such that
quantile(a1, probs=.05, type=2)
Gives a results of 3.541667
Trying all of the interpolation methods in numpy to find the same result:
{x:np.percentile(a1,q=5, interpolation=x) for x in ['linear','lower','higher','nearest','midpoint']}
Yields
{'linear': 3.566666666666666,
'lower': 3.54166666666667,
'higher': 3.58333333333333,
'nearest': 3.58333333333333,
'midpoint': 3.5625}
As we can see the lower interpolation method returns the same result as R quantile type 2
However again with a different quantile in R we get different results:
quantile(a1, probs=.95, type=2)
Gives a result of 10.71429
And with numpy:
{x:np.percentile(a1,q=95, interpolation=x) for x in ['linear','lower','higher','nearest','midpoint']}
Yields
{'linear': 10.667532467532439,
'lower': 10.6363636363636,
'higher': 10.7142857142857,
'nearest': 10.6363636363636,
'midpoint': 10.67532467532465}
In this case the higher interpolation method returns the same result
I'm hoping that someone familiar enough w/the R quantile types can help me reproduce the same quantile logic in numpy.

You can implement this yourself. With type=2 it's a rather simple calculation. You either take the next highest order statistic or at a discontinuity (i.e. 100 values and you want the p=0.06, which falls exactly on the 6th value) you take the average of that order statistic and the next greatest order statistic.
import numpy as np
def R_type2(arr, p):
"""
arr : array-like
p : float between [0, 1]
"""
#m=0 for Q_2(p) in R
x = np.sort(arr)
n = len(x)
aleph = n*p
k = np.floor(np.array(aleph).clip(1, n-1)).astype(int)
gamma = {False: 1, True: 0.5}.get(aleph==k) # Discontinuity or not
# Deal with case where it should be smallest value
if aleph < 1:
return x[k-1] # x[0]
else:
return (1.-gamma)*x[k-1] + gamma*x[k]
R_type2(a1, 0.05)
#3.54166666666667
R_type2(a1, 0.95)
#10.7142857142857
A word of caution. k will be an integer while n*p is a float. In general it's a very bad idea to do aleph==k because this leads to problems with floating point inaccuracies. For instance with 100 numbers p=0.07 is NOT considered a discontinuity because 0.07 cannot be represented precisely. However, because R seems to implement a pure equality check I left it like the above for consistency.
I personally would favor changing from the equaltiy: {False: 1, True: 0.5}.get(aleph==k)
to {False: 1, True: 0.5}.get(np.isclose(aleph,k)) that way floating point issues don't become a problem.

Does Python have a function which computes multinomial coefficients?

I was looking for a Python library function which computes multinomial coefficients.
I could not find any such function in any of the standard libraries.
For binomial coefficients (of which multinomial coefficients are a generalization) there is scipy.special.binom and also scipy.misc.comb. Also, numpy.random.multinomial draws samples from a multinomial distribution, and sympy.ntheory.multinomial.multinomial_coefficients returns a dictionary related to multinomial coefficients.
However, I could not find a multinomial coefficients function proper, which given a,b,...,z returns (a+b+...+z)!/(a! b! ... z!). Did I miss it? Is there a good reason there is none available?
I would be happy to contribute an efficient implementation to SciPy say. (I would have to figure out how to contribute, as I have never done this).
For background, they do come up when expanding (a+b+...+z)^n. Also, they count the ways of depositing a+b+...+z distinct objects into distinct bins such that the first bin contains a objects, etc. I need them occasionally for a Project Euler problem.
BTW, other languages do offer this function: Mathematica, MATLAB, Maple.

To partially answer my own question, here is my simple and fairly efficient implementation of the multinomial function:
def multinomial(lst):
res, i = 1, 1
for a in lst:
for j in range(1,a+1):
res *= i
res //= j
i += 1
return res
It seems from the comments so far that no efficient implementation of the function exists in any of the standard libraries.
Update (January 2020). As Don Hatch has pointed out in the comments, this can be further improved by looking for the largest argument (especially for the case that it dominates all others):
def multinomial(lst):
res, i = 1, sum(lst)
i0 = lst.index(max(lst))
for a in lst[:i0] + lst[i0+1:]:
for j in range(1,a+1):
res *= i
res //= j
i -= 1
return res

No, there is not a built-in multinomial library or function in Python.
Anyway this time math could help you. In fact a simple method for calculating the multinomial
keeping an eye on the performance is to rewrite it by using the characterization of the multinomial coefficient as a product of binomial coefficients:
where of course
Thanks to scipy.special.binom and the magic of recursion you can solve the problem like this:
from scipy.special import binom
def multinomial(params):
if len(params) == 1:
return 1
return binom(sum(params), params[-1]) * multinomial(params[:-1])
where params = [n1, n2, ..., nk].
Note: Splitting the multinomial as a product of binomial is also good to prevent overflow in general.

You wrote "sympy.ntheory.multinomial.multinomial_coefficients returns a dictionary related to multinomial coefficients", but it is not clear from that comment if you know how to extract the specific coefficients from that dictionary. Using the notation from the wikipedia link, the SymPy function gives you all the multinomial coefficients for the given m and n. If you only want a specific coefficient, just pull it out of the dictionary:
In [39]: from sympy import ntheory
In [40]: def sympy_multinomial(params):
...: m = len(params)
...: n = sum(params)
...: return ntheory.multinomial_coefficients(m, n)[tuple(params)]
...:
In [41]: sympy_multinomial([1, 2, 3])
Out[41]: 60
In [42]: sympy_multinomial([10, 20, 30])
Out[42]: 3553261127084984957001360
Busy Beaver gave an answer written in terms of scipy.special.binom. A potential problem with that implementation is that binom(n, k) returns a floating point value. If the coefficient is large enough, it will not be exact, so it would probably not help you with a Project Euler problem. Instead of binom, you can use scipy.special.comb, with the argument exact=True. This is Busy Beaver's function, modified to use comb:
In [46]: from scipy.special import comb
In [47]: def scipy_multinomial(params):
...: if len(params) == 1:
...: return 1
...: coeff = (comb(sum(params), params[-1], exact=True) *
...: scipy_multinomial(params[:-1]))
...: return coeff
...:
In [48]: scipy_multinomial([1, 2, 3])
Out[48]: 60
In [49]: scipy_multinomial([10, 20, 30])
Out[49]: 3553261127084984957001360

Here are two approaches, one using factorials, one using Stirling's approximation.
Using factorials
You can define a function to return multinomial coefficients in a single line using vectorised code (instead of for-loops) as follows:
from scipy.special import factorial
def multinomial_coeff(c):
return factorial(c.sum()) / factorial(c).prod()
(Where c is an np.ndarray containing the number of counts for each different object). Usage example:
>>> import numpy as np
>>> coeffs = np.array([2, 3, 4])
>>> multinomial_coeff(coeffs)
1260.0
In some cases this might be slower because you will be computing certain factorial expressions multiple times, in other cases this might be faster because I believe that numpy naturally parallelises vectorised code. Also this reduces the required number of lines in your program and is arguably more readable. If someone has the time to run speed tests on these different options then I'd be interested to see the results.
Using Stirling's approximation
In fact the logarithm of the multinomial coefficient is much faster to compute (based on Stirling's approximation) and allows computation of much larger coefficients:
from scipy.special import gammaln
def log_multinomial_coeff(c):
return gammaln(c.sum()+1) - gammaln(c+1).sum()
Usage example:
>>> import numpy as np
>>> coeffs = np.array([2, 3, 4])
>>> np.exp(log_multinomial_coeff(coeffs))
1259.999999999999

Your own answer (the accepted one) is quite good, and is especially simple. However, it does have one significant inefficiency: your outer loop for a in lst is executed one more time than is necessary. In the first pass through that loop, the values of i and j are always identical, so the multiplications and divisions do nothing. In your example multinomial([123, 134, 145]), there are 123 unneeded multiplications and divisions, adding time to the code.
I suggest finding the maximum value in the parameters and removing it, so those unneeded operations are not done. That adds complexity to the code but reduces the execution time, especially for short lists of large numbers. My code below executes multcoeff(123, 134, 145) in 111 microseconds, while your code takes 141 microseconds. That is not a large increase, but that could matter. So here is my code. This also takes individual values as parameters rather than a list, so that is another difference from your code.
def multcoeff(*args):
"""Return the multinomial coefficient
(n1 + n2 + ...)! / n1! / n2! / ..."""
if not args: # no parameters
return 1
# Find and store the index of the largest parameter so we can skip
# it (for efficiency)
skipndx = args.index(max(args))
newargs = args[:skipndx] + args[skipndx + 1:]
result = 1
num = args[skipndx] + 1 # a factor in the numerator
for n in newargs:
for den in range(1, n + 1): # a factor in the denominator
result = result * num // den
num += 1
return result

Starting Python 3.8,
since the standard library now includes the math.comb function (binomial coefficient)
and since the multinomial coefficient can be computed as a product of binomial coefficients
we can implement it without external libraries:
import math
def multinomial(*params):
return math.prod(math.comb(sum(params[:i]), x) for i, x in enumerate(params, 1))
multinomial(10, 20, 30) # 3553261127084984957001360

How to do a Sigma in python 3

I'm trying to make a calculator for something, but the formulas use a sigma, I have no idea how to do a sigma in python, is there an operator for it?
Ill put a link here with a page that has the formulas on it for illustration:http://fromthedepths.gamepedia.com/User:Evil4Zerggin/Advanced_cannon

A sigma (∑) is a Summation operator. It evaluates a certain expression many times, with slightly different variables, and returns the sum of all those expressions.
For example, in the Ballistic coefficient formula
The Python implementation would look something like this:
# Just guessing some values. You have to search the actual values in the wiki.
ballistic_coefficients = [0.3, 0.5, 0.1, 0.9, 0.1]
total_numerator = 0
total_denominator = 0
for i, coefficient in enumerate(ballistic_coefficients):
total_numerator += 2**(-i) * coefficient
total_denominator += 2**(-i)
print('Total:', total_numerator / total_denominator)
You may want to look at the enumerate function, and beware precision problems.

The easiest way to do this is to create a sigma function the returns the summation, you can barely understand this, you don't need to use a library. you just need to understand the logic .
def sigma(first, last, const):
sum = 0
for i in range(first, last + 1):
sum += const * i
return sum
# first : is the first value of (n) (the index of summation)
# last : is the last value of (n)
# const : is the number that you want to sum its multiplication each (n) times with (n)

An efficient way to do this in Python is to use reduce().
To solve
3
Σ i
i=1
You can use the following:
from functools import reduce
result = reduce(lambda a, x: a + x, [0]+list(range(1,3+1)))
print(result)
reduce() will take arguments of a callable and an iterable, and return one value as specified by the callable. The accumulator is a and is set to the first value (0), and then the current sum following that. The current value in the iterable is set to x and added to the accumulator. The final accumulator is returned.
The formula to the right of the sigma is represented by the lambda. The sequence we are summing is represented by the iterable. You can change these however you need.
For example, if I wanted to solve:
Σ π*i^2
i
For a sequence I [2, 3, 5], I could do the following:
reduce(lambda a, x: a + 3.14*x*x, [0]+[2,3,5])
You can see the following two code lines produce the same result:
>>> reduce(lambda a, x: a + 3.14*x*x, [0]+[2,3,5])
119.32
>>> (3.14*2*2) + (3.14*3*3) + (3.14*5*5)
119.32

I've looked all the answers that different programmers and coders have tried to give to your query but i was unable to understand any of them maybe because i am a high school student anyways according to me using LIST will definately reduce some pain of coding so here it is what i think simplest way to form a sigma function .
#creating a sigma function
a=int(input("enter a number for sigma "))
mylst=[]
for i in range(1,a+1):
mylst.append(i)
b=sum(mylst)
print(mylst)
print(b)

Captial sigma (Σ) applies the expression after it to all members of a range and then sums the results.
In Python, sum will take the sum of a range, and you can write the expression as a comprehension:
For example
Speed Coefficient
A factor in muzzle velocity is the speed coefficient, which is a
weighted average of the speed modifiers si of the (non-
casing) parts, where each component i starting at the head has half the
weight of the previous:
The head will thus always determine at least 25% of the speed
coefficient.
For example, suppose the shell has a Composite Head (speed modifier
1.6), a Solid Warhead Body (speed modifier 1.3), and a Supercavitation
Base (speed modifier 0.9). Then we have
s0=1.6
s1=1.3
s2=0.9
From the example we can see that i starts from 0 not the usual 1 and so we can do
def speed_coefficient(parts):
return (
sum(0.75 ** i * si for i, si in enumerate(parts))
/
sum(0.75 ** i for i, si in enumerate(parts))
)
>>> speed_coefficient([1.6, 1.3, 0.9])
1.3324324324324326

import numpy as np
def sigma(s,e):
x = np.arange(s,e)
return np.sum([x+1])

Manual fft not giving me same results as fft

import numpy as np
import matplotlib.pyplot as pp
curve = np.genfromtxt('C:\Users\latel\Desktop\kool\Neuro\prax2\data\curve.csv',dtype = 'float', delimiter = ',')
curve_abs2 = np.empty_like(curve)
z = 1j
N = len(curve)
for i in range(0,N-1):
curve_abs2[i] =0
for k in range(0,N-1):
curve_abs2[i] += (curve[i]*np.exp((-1)*z*(np.pi)*i*((k-1)/N)))
for i in range(0,N):
curve_abs2[i] = abs(curve_abs2[i])/(2*len(curve_abs2))
#curve_abs = (np.abs(np.fft.fft(curve)))
#pp.plot(curve_abs)
pp.plot(curve_abs2)
pp.show()
The code behind # gives me 3 values. But this is just ... different
Wrong ^^ this code: http://www.upload.ee/image/3922681/Ex5problem.png
Correct using numpy.fft.fft(): http://www.upload.ee/image/3922682/Ex5numpyformulas.png

There are several problems:
You are assigning complex values to the elements of curve_abs2, so it should be declared to be complex, e.g. curve_abs2 = np.empty_like(curve, dtype=np.complex128). (And I would recommend using the name, say, curve_fft instead of curve_abs2.)
In python, range(low, high) gives the sequence [low, low + 1, ..., high - 2, high - 1], so instead of range(0, N - 1), you must use range(0, N) (which can be simplified to range(N), if you want).
You are missing a factor of 2 in your formula. You could fix this by using z = 2j.
In the expression that is being summed in the inner loop, you are indexing curve as curve[i], but this should be curve[k].
Also in that expression, you don't need to subtract 1 from k, because the k loop ranges from 0 to N - 1.
Because k and N are integers and you are using Python 2.7, the division in the expression (k-1)/N will be integer division, and you'll get 0 for all k. To fix this and the previous problem, you can change that term to k / float(N).
If you fix those issues, when the first double loop finishes, the array curve_abs2 (now a complex array) should match the result of np.fft.fft(curve). It won't be exactly the same, but the differences should be very small.
You could eliminate that double loop altogether using numpy vectorized calculations, but that is a topic for another question.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - calculate multinomial probability density functions on large dataset? - python

Related

how to iterate through an array of complex numbers

Recreating R Quantile Type 2 in Numpy

Does Python have a function which computes multinomial coefficients?

How to do a Sigma in python 3

Manual fft not giving me same results as fft

Categories

Resources