(scipy.stats.qmc) How to do multiple randomized Quasi Monte Carlo - python

I want to generate many randomized realizations of a low discrepancy sequence thanks to scipy.stat.qmc. I only know this way, which directly provide a randomized sequence:
from scipy.stats import qmc
ld = qmc.Sobol(d=2, scramble=True)
r = ld.random_base2(m=10)
But if I run
r = ld_deterministic.random_base2(m=10)
twice I get
The balance properties of Sobol' points require n to be a power of 2. 2048 points have been previously generated, then: n=2048+2**10=3072. If you still want to do this, the function 'Sobol.random()' can be used.
It seems like using Sobol.random() is discouraged from the doc.
What I would like (and it should be faster) is to first get
ld = qmc.Sobol(d=2, scramble=False)
then to generate like a 1000 scrambling (or other randomization method) from this initial series.
It avoids having to regenerate the Sobol sequence for each sample and just do scrambling.
How to that?
It seems to me like it is the proper way to do many Randomized QMC, but I might be wrong and there might be other ways.

As the warning suggests, Sobol' is a sequence meaning that there is a link between with the previous samples. You have to respect the properties of 2^m. It's perfectly fine to use Sobol.random() if you understand how to use it, this is why we created Sobol.random_base2() which prints a warning if you try to do something that would break the properties of the sequence. Remember that with Sobol' you cannot skip 10 points and then sample 5 or do arbitrary things like that. If you do that, you will not get the convergence rate guaranteed by Sobol'.
In your case, what you want to do is to reset the sequence between the draws (Sobol.reset). A new draw will be different from the previous one if scramble=True. Another way (using a non scrambled sequence for instance) is to sample 2^k and skip the first 2^(k-1) points then you can sample 2^n with n<k-1.

Related

Combining p values using scipy

I have to combine p values and get one p value.
I'm using scipy.stats.combine_pvalues function, but it is giving very small combined p value, is it normal?
e.g.:
>>> import scipy
>>> p_values_list=[8.017444955844044e-06, 0.1067379119652372, 5.306374345615846e-05, 0.7234201655194492, 0.13050605094545614, 0.0066989543716175, 0.9541246420333787]
>>> test_statistic, combined_p_value = scipy.stats.combine_pvalues(p_values_list, method='fisher',weights=None)
>>> combined_p_value
4.331727536209026e-08
As you see, combined_p_value is smaller than any given p value in the p_values_list?
How can it be?
Thanks in advance,
Burcak
It is correct, because you are testing all of your p-values come from a random uniform distribution. The alternate hypothesis is that at least one of them is true. Which in your case is very possible.
We can simulate this, by drawing from a random uniform distribution 1000 times, the length of your p-values:
import numpy as np
from scipy.stats import combine_pvalues
from matplotlib import pyplot as plt
random_p = np.random.uniform(0,1,(1000,len(p_values_list)))
res = np.array([combine_pvalues(i,method='fisher',weights=None) for i in random_p])
plt.hist(fisher_p)
From your results, the chi-square is 62.456 which is really huge and no where near the simulated chi-square above.
One thing to note is that the combining you did here does not take into account directionality, if that is possible in your test, you might want to consider using stouffer's Z along with weights. Also another sane way to check is to run simulation like the above, to generate list of p-values under the null hypothesis and see how they differ from what you observed.
Interesting paper but maybe a bit on the statistics side
I am by no means an expert in this field, but am interested in your question. Following some reading of wiki it seems to me that the combined_p_value tells you the likelihood of all p-values in the list been obtained under the same null-hypothesis. Which is very unlikely considering two extremely small values.
Your set has two extremely small values: 1st and 3rd. If the thought process I described is correct, removing any of them should yield a much higher p-value, which is indeed the case:
remove 1st: p-value of 0.00010569305282803985
remove 3rd: p-value of 2.4713196031837724e-05
In conclusion, I think that this is a correct way of interpreting the meta-analysis that combine_pvalues actually describes.

Finding combination of columns which provides best combination based on function return

I have a dataframe with daily returns 6 portfolios (PORT1, PORT2, PORT3, ... PORT6).
I have defined functions for compound annual returns and risk-adjusted returns. I can run this function for any one PORT.
I want to find a combination of portfolios (assume equal weighting) to obtain the highest returns. For example, a combination of PORT1, PORT3, PORT4, and PORT6 may provide the highest risk adjusted return. Is there a method to automatically run the defined function on all combinations and obtain the desired combination?
No code is included as I do not think it is necessary to show the computation used to determine risk adj return.
def returns(PORT):
val = ... [computation of return here for PORT]
return val
Finding the optimal location within a multidimensional space is possible, but people have made fortunes figuring out better ways of achieving exactly this.
The problem at the outset is setting out your possibility space. You've six dimensions, and presumably you want to allocate 1 unit of "stuff" across all those six, such that a vector of the allocations {a,b,c,d,e,f} sums to 1. That's still an infinity of numbers, so maybe we only start off with increments of size 0.10. So 10 increments possible, across 6 dimensions, gives you 10^6 possibilities.
So the simple brute-force method would be to "simply" run your function across the entire parameter space, store the values and pick the best one.
That may not be the answer you want, other methods exist, including randomising your guesses and limiting your results to a more manageable number. But the performance gain is offset with some uncertainty - and some potentially difficult conversations with your clients "What do you mean you did it randomly?!".
To make any guesses at what might be optimal, it would be helpful to have an understanding of the response curves each portfolio has under different circumstances and the sorts of risk/reward profiles you might expect them to operate under. Are they linear, quadratic, or are they more complex? If you can model them mathematically, you might be able to use an algorithm to reduce your search space.
Short (but fundamental) answer is "it depends".
You can do
import itertools
best_return = 0
for r in range(len(PORTS)):
for PORT in itertools.combinations(PORTS,r):
cur_return = returns(PORT)
if cur_return > best_return :
best_return = cur_return
best_PORT = PORT
You can also do
max([max([PORT for PORT in itertools.combinations(PORTS,r)], key = returns)
for r in range(len(PORTS))], key = returns)
However, this is more of an economics question than a CS one. Given a set of positions and their returns and risk, there are explicit formulae to find the optimal portfolio without having to brute force it.

Why is Python randint() generating bizarre grid effect when used in script to transform image?

I am playing around with images and the random number generator in Python, and I would like to understand the strange output picture my script is generating.
The larger script iterates through each pixel (left to right, then next row down) and changes the color. I used the following function to offset the given input red, green, and blue values by a randomly determined integer between 0 and 150 (so this formula is invoked 3 times for R, G, and B in each iteration):
def colCh(cVal):
random.seed()
rnd = random.randint(0,150)
newVal = max(min(cVal - 75 + rnd,255),0)
return newVal
My understanding is that random.seed() without arguments uses the system clock as the seed value. Given that it is invoked prior to the calculation of each offset value, I would have expected a fairly random output.
When reviewing the numerical output, it does appear to be quite random:
Scatter plot of every 100th R value as x and R' as y:
However, the picture this script generates has a very peculiar grid effect:
Output picture with grid effect hopefully noticeable:
Furthermore, fwiw, this grid effect seems to appear or disappear at different zoom levels.
I will be experimenting with new methods of creating seed values, but I can't just let this go without trying to get an answer.
Can anybody explain why this is happening? THANKS!!!
Update: Per Dan's comment about possible issues from JPEG compression, the input file format is .jpg and the output file format is .png. I would assume only the output file format would potentially create the issue he describes, but I admittedly do not understand how JPEG compression works at all. In order to try and isolate JPEG compression as the culprit, I changed the script so that the colCh function that creates the randomness is excluded. Instead, it merely reads the original R,G,B values and writes those exact values as the new R,G,B values. As one would expect, this outputs the exact same picture as the input picture. Even when, say, multiplying each R,G,B value by 1.2, I am not seeing the grid effect. Therefore, I am fairly confident this is being caused by the colCh function (i.e. the randomness).
Update 2: I have updated my code to incorporate Sascha's suggestions. First, I moved random.seed() outside of the function so that it is not reseeding based on the system time in every iteration. Second, while I am not quite sure I understand how there is bias in my original function, I am now sampling from a positive/negative distribution. Unfortunately, I am still seeing the grid effect. Here is my new code:
random.seed()
def colCh(cVal):
rnd = random.uniform(-75,75)
newVal = int(max(min(cVal + rnd,255),0))
return newVal
Any more ideas?
As imgur is down for me right now, some guessing:
Your usage of PRNGs is a bit scary. Don't use time-based seeds in very frequently called loops. It's very much possible, that the same seeds are generated and of course this will generate patterns. (granularity of time + number of random-bits used matter here)
So: seed your PRNG once! Don't do this every time, don't do this for every channel. Seed one global PRNG and use it for all operations.
There should be no pattern then.
(If there is: also check the effect of interpolation = image-size change)
Edit: As imgur is on now, i recognized the macro-block like patterns, like Dan mentioned in the comments. Please change your PRNG-usage first before further analysis. And maybe show more complete code.
It may be possible, that you recompressed the output and JPEG-compression emphasized the effects observed before.
Another thing is:
newVal = max(min(cVal - 75 + rnd,255),0)
There is a bit of a bias here (better approach: sample from symmetric negative/positive distribution and clip between 0,255), which can also emphasize some effect (what looked those macroblocks before?).

How to use distancematrix function from Biopython?

I would like to calculate the distance matrix (using genetic distance function) on a data set using http://biopython.org/DIST/docs/api/Bio.Cluster.Record-class.html#distancematrix, but I seem to keep getting errors, typically telling me the rank is not of 2. I'm not actually sure what it wants as an input since the documentation never says and there are no examples online.
Say I read in some aligned gene sequences:
SingleLetterAlphabet() alignment with 7 rows and 52 columns
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRL...SKA COATB_BPIKE/30-81
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKL...SRA Q9T0Q8_BPIKE/1-52
DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRL...SKA COATB_BPI22/32-83
AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKL...SKA COATB_BPM13/24-72
AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKL...SKA COATB_BPZJ2/1-49
AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKL...SKA Q9T0Q9_BPFD/1-49
FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKL...SRA COATB_BPIF1/22-73
which would be done by
data = Align.read("dataset.fasta","fasta")
But the distance matrix in Cluster.Record class does not accept this. How can I get it to work! ie
dist_mtx = distancematrix(data)
The short answer: You don't.
From the documentation:
A Record stores the gene expression data and related information
The Cluster object is used for gene expression data and not for MSA.
I would recommend using an external tool like MSARC which runs in Python as well.

Adding +1 to specific matrix elements

I'm currently coding an algorithm for 4-parametric RAINFLOW method. The idea of this method is to eliminate load cycles from a cycle history, which is normally given in a load (for example force) - time diagram. This is a very frequently used method in mechanical engineering to determine the life span of a product/element, that is exposed to a certain number of the load cycles.
However the result of this method is a so called FROM-TO table or FROM-TO matrix where the rows present the FROM and the columns present the TO number like shown in the picture below:
example of from-to table/matrix
This example is non-realistic as you normally get a file with million points of measurements which means, that some cycles won't occur only once(1) or twice (2) like its shown in the table, but they may occur thousands of times.
Now to the problem:
I coded the algorithm of the method and as a result formed a vector with FROM values and a vector with TO values, like this:
vek_from=[]
vek_to=[]
d=len(a)/2
for i in range(int(d)):
vek_from.append(a[2*i]) # FROM
vek_to.append(a[2*i+1]) # TO
a is the vector with all values, like a=[from, to, from, to,...]
Now I'm trying to form a matrix out of this, like this:
mat_from_to = np.zeros(shape=(int(d),int(d)))
MAT = np.zeros(shape=(int(d),int(d)))
s=int(d-1)
for i in range(s):
mat_from_to[vek_from[i]-2, vek_to[i]-2] += 1
So the problem is that I don't know how to code that when a load cycles occurs several times (it has the same from-to values), how to add +1 to the FROM-TO combination every time that happens, because with what I've coded, it only replaces the previous value with 1, so I can never exceed 1...
So to make explanation shorter, how to code that whenever a combination of FROM-TO values is made that determined the position of an element in the matrix, to add a +1 there...
Hopefully I didn't make it too complicated and someone will be happy to help me with this.
Regards,
Luka

Categories

Resources