What are equivalents to R's "phyper" function in Python?

What are equivalents to R's "phyper" function in Python? - python

In R, I use the phyper function to do a hypergeometric test for bioinformatics analysis. However I use a lot of Python code and using rpy2 here is quite slow. So, I started looking for alternatives. It seemed that scipy.stats.hypergeom had something similar.
Currently, I call phyper like this:
pvalue <- 1-phyper(45, 92, 7518, 1329)
where 45 is the number of selected items having the property of interest, 92 the number of total items having the property, 7518 the number of non selected items not having the property, and 1329 the total number of selected items.
In R, this yields 6.92113e-13.
Attempting to do the same with scipy.stats.hypergeom however yields a completely different result (notice, the numbers are swapped because the function accepts numbers in a different way):
import scipy.stats as stats
pvalue = 1-stats.hypergeom.cdf(45, 7518, 92. 1329)
print pvalue
However this returns -7.3450134863151106e-12, which makes little sense. Notice that I've tested this on other data and I had little issues (same precision up to the 4th decimal, which is enough for me).
So it boils down to these possibilities:
I'm using the wrong function for the job (or wrong parameters)
There's a bug in scipy
In case of "1", are there other alternatives to phyper that can be used in Python?
EDIT: As the comments have noted, this is a bug in scipy, fixed in git master.

From the docs, you could try:
hypergeom.sf(x,M,n,N,loc=0) :
survival function (1-cdf — sometimes
more accurate)
Also, I think you might have the values mixed up.
Models drawing objects from a bin. M
is total number of objects, n is total
number of Type I objects. RV counts
number of Type I objects in N drawn
without replacement from population.
Therefore, by my reading: x=q, M=n+m, n=m, N=k.
So I would try:
stats.hypergeom.sf(45,(92+7518),92,1329)

Related

Python: Computation in for loop doesn't match result of manual computation

I'm currently working on a project researching properties of some gas mixtures. Testing my code with different inputs, I came upon a bug(?) which I fail to be able to explain. Basically, it's concerning a computation on a numpy array in a for loop. When it computed the for-loop, it yields a different (and wrong) result as opposed to the manual construction of the result, using the same exact code snippets as in the for-loop, but indexing manually. I have no clue, why it is happening and whether it is my own mistake, or a bug within numpy.
It's super weird, that certain instances of the desired input objects run through the whole for loop without any problem, while others run perfectly up to a certain index and others fail to even compute the very first loop.
For instance, one input always stopped at index 16, throwing a:
ValueError: could not broadcast input array from shape (25,) into shape (32,)
Upon further investigation I could confirm, that the previous 15 loops threw the correct results, the results in loop of index 16 were wrong and not even of the correct size. When running loop 16 manually through the console, no errors occured...
The lower array shows the results for index 16, when it's running in the loop.
These are the results for index 16, when running the code in the for loop manually in the console. These are, what one would expect to get.
The important part of the code is really only the np.multiply() in the for loop - I left the rest of it for context but am pretty sure it shouldn't interfere with my intentions.
def thermic_dissociation(input_gas, pressure):
# Copy of the input_gas object, which may not be altered out of scope
gas = copy.copy(input_gas)
# Temperature range
T = np.logspace(2.473, 4.4, 1000)
# Matrix containing the data over the whole range of interest
moles = np.zeros((gas.gas_cantera.n_species, len(T)))
# Array containing other property of interest
sum_particles = np.zeros(len(T))
# The troublesome for-loop:
for index in range(len(T)):
print(str(index) + ' start')
# Set temperature and pressure of the gas
gas.gas_cantera.TP = T[index], pressure
# Set gas mixture to a state of chemical equilibrium
gas.gas_cantera.equilibrate('TP')
# Sum of particles = Molar Density * Avogadro constant for every temperature
sum_particles[index] = gas.gas_cantera.density_mole * ct.avogadro
#This multiplication is doing the weird stuff, printed it to see what's computed before it puts it into the result matrix and throwing the error
print(np.multiply(list(gas.gas_cantera.mole_fraction_dict().values()), sum_particles[index]))
# This is where the error is thrown, as the resulting array is of smaller size, than it should be and thus resulting in the error
moles[:, index] = np.multiply(list(gas.gas_cantera.mole_fraction_dict().values()), sum_particles[index])
print(str(index) + ' end')
# An array helping to handle the results
molecule_order = list(gas.gas_cantera.mole_fraction_dict().keys())
return [moles, sum_particles, T, molecule_order]
Help will be very appreciated!

If you want the array of all species mole fractions, you should use the X property of the cantera.Solution object, which always returns that full array directly. You can see the documentation for that method: cantera.Solution.X`.
The mole_fraction_dict method is specifically meant for cases where you want to refer to the species by name, rather than their order in the Solution object, such as when relating two different Solution objects that define different sets of species.

This particular issue is not related to numpy. The call to mole_fraction_dict returns a standard python dictionary. The number of elements in the dictionary depends on the optional threshold argument, which has a default value of 0.0.
The source code of Cantera can be inspected to see what happens exactly.
mole_fraction_dict
getMoleFractionsByName
In other words, a value ends up in the dictionary if x > threshold. Maybe it would make more sense if >= was used here instead of >. And maybe this would have prevented the unexpected outcome in your case.
As confirmed in the comments, you can use mole_fraction_dict(threshold=-np.inf) to get all of the desired values in the dictionary. Or -float('inf') can also be used.
In your code you proceed to call .values() on the dictionary but this would be problematic if the order of the values is not guaranteed. I'm not sure if this is the case. It might be better to make the order explicit by retrieving values out of the dict using their key.

(scipy.stats.qmc) How to do multiple randomized Quasi Monte Carlo

I want to generate many randomized realizations of a low discrepancy sequence thanks to scipy.stat.qmc. I only know this way, which directly provide a randomized sequence:
from scipy.stats import qmc
ld = qmc.Sobol(d=2, scramble=True)
r = ld.random_base2(m=10)
But if I run
r = ld_deterministic.random_base2(m=10)
twice I get
The balance properties of Sobol' points require n to be a power of 2. 2048 points have been previously generated, then: n=2048+2**10=3072. If you still want to do this, the function 'Sobol.random()' can be used.
It seems like using Sobol.random() is discouraged from the doc.
What I would like (and it should be faster) is to first get
ld = qmc.Sobol(d=2, scramble=False)
then to generate like a 1000 scrambling (or other randomization method) from this initial series.
It avoids having to regenerate the Sobol sequence for each sample and just do scrambling.
How to that?
It seems to me like it is the proper way to do many Randomized QMC, but I might be wrong and there might be other ways.

As the warning suggests, Sobol' is a sequence meaning that there is a link between with the previous samples. You have to respect the properties of 2^m. It's perfectly fine to use Sobol.random() if you understand how to use it, this is why we created Sobol.random_base2() which prints a warning if you try to do something that would break the properties of the sequence. Remember that with Sobol' you cannot skip 10 points and then sample 5 or do arbitrary things like that. If you do that, you will not get the convergence rate guaranteed by Sobol'.
In your case, what you want to do is to reset the sequence between the draws (Sobol.reset). A new draw will be different from the previous one if scramble=True. Another way (using a non scrambled sequence for instance) is to sample 2^k and skip the first 2^(k-1) points then you can sample 2^n with n<k-1.

Finding combination of columns which provides best combination based on function return

I have a dataframe with daily returns 6 portfolios (PORT1, PORT2, PORT3, ... PORT6).
I have defined functions for compound annual returns and risk-adjusted returns. I can run this function for any one PORT.
I want to find a combination of portfolios (assume equal weighting) to obtain the highest returns. For example, a combination of PORT1, PORT3, PORT4, and PORT6 may provide the highest risk adjusted return. Is there a method to automatically run the defined function on all combinations and obtain the desired combination?
No code is included as I do not think it is necessary to show the computation used to determine risk adj return.
def returns(PORT):
val = ... [computation of return here for PORT]
return val

Finding the optimal location within a multidimensional space is possible, but people have made fortunes figuring out better ways of achieving exactly this.
The problem at the outset is setting out your possibility space. You've six dimensions, and presumably you want to allocate 1 unit of "stuff" across all those six, such that a vector of the allocations {a,b,c,d,e,f} sums to 1. That's still an infinity of numbers, so maybe we only start off with increments of size 0.10. So 10 increments possible, across 6 dimensions, gives you 10^6 possibilities.
So the simple brute-force method would be to "simply" run your function across the entire parameter space, store the values and pick the best one.
That may not be the answer you want, other methods exist, including randomising your guesses and limiting your results to a more manageable number. But the performance gain is offset with some uncertainty - and some potentially difficult conversations with your clients "What do you mean you did it randomly?!".
To make any guesses at what might be optimal, it would be helpful to have an understanding of the response curves each portfolio has under different circumstances and the sorts of risk/reward profiles you might expect them to operate under. Are they linear, quadratic, or are they more complex? If you can model them mathematically, you might be able to use an algorithm to reduce your search space.
Short (but fundamental) answer is "it depends".

You can do
import itertools
best_return = 0
for r in range(len(PORTS)):
for PORT in itertools.combinations(PORTS,r):
cur_return = returns(PORT)
if cur_return > best_return :
best_return = cur_return
best_PORT = PORT
You can also do
max([max([PORT for PORT in itertools.combinations(PORTS,r)], key = returns)
for r in range(len(PORTS))], key = returns)
However, this is more of an economics question than a CS one. Given a set of positions and their returns and risk, there are explicit formulae to find the optimal portfolio without having to brute force it.

Possible error in very large numbers in the Fibonacci sequence (Python 2.7.6 integer math)

Edited to put questions in bold.
I wrote the following Python code (using Python 2.7.6) to calculate the Fibonacci sequence. It doesn't use any extra libraries, just the core python modules.
I was wondering if there was a limit to how may terms of the sequence I could calculate, perhaps due to the absurd length of the resulting integers, or if there would be a point where Python no longer performed the calculations accurately.
Also, for the fibopt(n) function, it seems to sometimes return the term under the one requested (e. g. 99th instead of 100th) but always works at lower terms (1st, 2nd, 10th, 15th). Why is that?
def fibopt(n): # Returns term "n" of the Fibonacci sequence.
f = [0,1] # List containing the first two numbers in the Fibonacci sequence.
x = 0 # Empty integer to store the next value in the sequence. Not really necessary.
optnum = 2 # Number of calculated entries in the sequence. Starts at 2 (0, 1).
while optnum < n: # Until the "n"th value in the sequence has been calculated.
if optnum % 10000 == 0:
print "Calculating index number %s." % optnum # Notify the user for every 10000th value calculated. This is useful because the program can take a very long time to calculate higher values (e. g. about 15 minutes on an i7-4790 for the 10000000th value).
x = [f[-1] + f[-2]] # Calculate the next value in the sequence based of the previous two. This could be built into the next line.
f.extend(x) # Append that value to the sequence. This could be f.extend([f[-1] + f[-2]]) instead.
optnum +=1 # Increment the counter for number of values calculated by 1.
del f[:-2] # Remove all values from the table except for the last two. Without this, the integers become so long that they fill 16 GB of RAM in seconds.
return f[:n] # Returns the requested term of the sequence.
def fib(n): # Similar to fibopt(n), but returns all of the terms in the sequence up to and including term "n". Can use a lot of memory very quickly.
f = [0,1]
x = 0
while len(f) < n:
x = [f[-1] + f[-2]]
f.extend(x)
return f[:n]

The good news is: integer math in Python is easy -- there are no overflows.
As long as your integers can fit within a C long, Python will use that. Once you go past that, it will auto-promote to arbitrary-precision integers (which means it'll be slower and use more memory, but the calculations will remain correct).
The only limits are:
The amount of memory addressable by the Python process. If you're using 32-bit Python, you need to be able to fit all of your data within 2 gigabytes or RAM (get past that and your program will fail with MemoryError). If you're using 64-bit Python, your physical RAM + swapfile is the theoretical limit.
The time you're willing to wait while calculations are being performed. The larger your ints, the slower the calculations are. If you ever hit your swap space, your program will reach continental drift levels of slow.

If you go to Python 2.7 documentation, there is a section on Fibonacci numbers. In this section on Fibonacci numbers, the arbitrary end is not the elongated answer we would all want to view. It shortens it.
If this does not answer your question, please see: 4.6 Defining Functions.
If you have downloaded the interpreter, the manuals come preinstalled. You can go online if necessary to www.python.org or you can view your manual to see the Fibonacci numbers that end in an "arbitrary" short, i.e. not the entire numerical value.
Seth
P.S. If you have any questions on where to find this section in your manual, please see The Python Tutorial/4. More Control Flow Tools/4.6 Defining Functions. I hope this helps a bit.

Python integers can express arbitrary-length values and will not be automatically converted to float. You can check by simply creating a very large number and checking its type:
>>> type(2**(2**25))
<class 'int'> # long in Python 2.x
fibopt returns f[:n], and that is a list. You seem to expect it to return a term, so either the expectation (the first comment) or the implementation must change.

Efficient way of computing statistics for large/imprecise amount of data

I have over 65 million numeric values stored in a text file. I need to compute the maximum, minimum, average, standard deviation, as well as the 25, 50, and 75 percentiles.
Normally I would use the attached code, but I need a more efficient way to compute these metrics because i cannot store all value p in a list. How can I more effectively calculate these values in Python?
import numpy as np
np.average(obj)
np.min(mylist)
np.max(mylist)
np.std(mylist)
np.percentile(obj, 25)
np.percentile(obj, 50)
np.percentile(obj, 75)
maxx = float('-inf')
minx = float('+inf')
sumz = 0
for index, p in enumerate(open("foo.txt", "r")):
maxx = max(maxx, float(p))
minx = min(minx, float(p))
sumz += float(p)
index += 1
my_max = maxx
my_min = minx
my_avg = sumz/index

Use binary file. Then you can use numpy.memmap to map it to memory and can perform all sorts of algorithms, even if the dataset was larger than RAM.
You can even use the numpy.memmap to create a memory mapped array, and read your data in from the text file... you can work on it and when you are done, you also have the data in binary format.

I think you are on the right track, by iterating over the file and keeping track of max and min values. To calculate the std, you should keep a sum of squares inside the loop: sum_of_squares += z**2. You then can calculate std = sqrt(sum_of_squares / n - (sumz / n)**2) after the loop, see formula here (but this formula might suffer from numerical problems). For performance, you might want to iterate over the file in some decent size chunks of data.
To calculate the median and percentiles in a 'continuous' way, you could build up a histogram inside your loop. After the loop, you can get approximate percentiles and median by converting the histogram to the CDF, the error will depend on the number of bins.

As Antti Haapala says, the easiest and most efficient way to do this will be to stick with numpy, and just use a memmapped binary file instead of a text file. Yes, converting from one format to the other will take a bit of time—but it'll almost certainly save more time than it costs (because you can use numpy vectorized operations instead of loops), and it will also make your code a lot simpler.
If you can't do that, Python 3.4 will come with a statistics module. A backport to 2.6+ will hopefully be available at some point after the PEP is finalized; at present I believe you can only get stats, the earlier module it's based on, which requires 3.1+. Unfortunately, while stats does do single-pass algorithms on iterators, it doesn't have any convenient way to run multiple algorithms in parallel on the same iterator, so you have be clever with itertools.tee and zip to force it to interleave the work instead of pulling the whole thing into memory.
And of course there are plenty of other modules out there if you search PyPI for "stats" and/or "statistics" and/or "statistical".
Either way, using a pre-built module will mean someone's already debugged all the problems you're going to run into, and they may have also optimized the code (maybe even ported it to C) to boot.

To get the percentiles, sort the text file using a command line program. Use the line count (index in your program) to find the line numbers of the percentiles (index // 4, etc.) Then retrieve those lines from the file.

Most of these operations can be expressed easily in terms of simple arithmetic. In that case, it can actually (surprisingly) be quite efficient to process simple statistics directly from the Linux command line using awk and sed, e.g. as in this post: < http://www.unixcl.com/2008/09/sum-of-and-group-by-using-awk.html >.
If you need to generalize to more advanced operations, like weighted percentiles, then I'd recommend using Python Pandas (notably the HDFStore capabilities for later retrieval). I've used Pandas with a DataFrame of over 25 million records before (10 columns by 25 million distinct rows). If you're more memory constrained, you could read the data in in chunks, calculate partial contributions from each chunk, and store out intermediate results, then finish off the calculation by just loading the intermediate results, in a serialized sort of map-reduce kind of framework.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

What are equivalents to R's "phyper" function in Python? - python

Related

Python: Computation in for loop doesn't match result of manual computation

(scipy.stats.qmc) How to do multiple randomized Quasi Monte Carlo

Finding combination of columns which provides best combination based on function return

Possible error in very large numbers in the Fibonacci sequence (Python 2.7.6 integer math)

Efficient way of computing statistics for large/imprecise amount of data

Categories

Resources