Sampling from a huge uniform distribution in Python - python

I need to select 3.7*10^8 unique values from the range [0, 3*10^9] and either obtain them in order or keep them in memory.
To do this, I started working on a simple algorithm where I sample smaller uniform distributions (that fit in memory) in order to indirectly sample the large distribution that really interests me.
The code is available at the following gist https://gist.github.com/legaultmarc/7290ac4bef4edb591d1e
Since I'm having trouble implementing something more robust, I was wondering if you had other ideas to sample unique values from a large discrete uniform. I'm looking for either an algorithm, a module or an idea on how to manage very large lists directly (perhaps using the hard drive instead of memory).

There is an interesting post, Generating sorted random ints without the sort? O(n) which suggests that instead of generating uniform random ints, you can do a running-sum on exponential random deltas, which gives you a uniform random result generated in sorted order.
It's not guaranteed to give exactly the number of samples you want, but should be pretty close, and much faster / lower memory requirements.
Edit: I found a second post, generating sorted random numbers without exponentiation involved? which suggests tweaking the distribution density as you go to generate an exact number of samples, but I am leery of just exactly what this would do to your "uniform" distribution.
Edit2: Another possibility that occurs to me would be to use an inverse cumulative binomial distribution to iteratively split your sample range (predict how many uniformly generated random samples would fall in the lower half of the range, then the remainder must be in the upper half) until the block-size reaches something you can easily hold in memory.

This is a standard sample with out replacement. You can't divide the range [0, 3*10^9] into equally binned ranges and sample same amount in each bin.
Also, 3 billion is relative large, many "ready to use" codes only handle 32 bit integers, roughly 2 billion(+-). Please take a close look at their implementations.

Related

Total variation implementation in numpy for a piecewise linear function

I would like to use total variation in Python, but I wasn't able to find an existing implementation.
Assuming that I have an array with a finite number of elements, is the implementation with NumPy simply as:
import numpy as np
a = np.array([...], dtype=float)
tv = np.sum(np.abs(np.diff(a)))
My main doubt is how to compute the supremum of tv across all partitions, and if just the sum of the absolute difference might suffice for a finite array of floats.
Edit: My input array represents a piecewise linear function, therefore the supremum over the full set of partitions is indeed the sum of absolute differences between contiguous points.
Yes, that is correct.
I imagine you're confused by the mathy definition on the Wikipedia page for total variation. Have a look at the more practical definition on the Wikipedia page for total variation denoising instead.
For an actual code (even Python) implementation, see e.g. Tensorflow's total_variation(), though this is for one or more (2D, color) images, so the TV is computed for both rows and columns, and then added together.

clustering in python without number of clusters or threshold

Is it possible to do clustering without providing any input apart from the data? The clustering method/algorithm should decide from the data on how many logical groups the data can be divided, even it doesn't require me to input the threshold eucledian distance on which the clusters are built, this also needs to be learned from the data.
Could you please suggest me what is closest solution for my problem?
Why not code your algorithm to create a list of clusters ranging from size 1 to n (which could be defined in a config file so that you can avoid hard coding and just fix it once).
Once that is done, compute the clusters of size 1 to n. Choose the value which gives you the smallest Mean Square Error.
This would require some additional work by your machine to determine the optimal number of logical groups the data can be divided (bounded between 1 and n).
Clustering is an explorative technique.
This means it must always be able to produce different results, as desired by the user. Having many parameters is a feature. It means the method can be adapted easily to very different data, and to user preferences.
There will never be a generally useful parameter-free technique. At best, some parameters will have default values or heuristics (such as Euclidean distance, such as standardizing the input prior to clusterings such as the gap statistic for choosing k) that may give a reasonable first try in 80% of cases. But after that first try, you'll need to understand the data, and try other parameters to learn more about your data.
Methods that claim to be "parameter free" usually just have some hidden parameters set so it works on the few toy example it was demonstrated on.

How to randomly shuffle a list that has more permutations than the PRNG's period?

I have a list with around 3900 elements that I need to randomly permute to produce a statistical distribution. I looked around and found this Maximal Length of List to Shuffle with Python random.shuffle that explains that the period of the PRNG in Python is 2**19937-1, which leads to a list with a maximum length of 2080 before it becomes impossible to generate all possible permutations. I am only producing 300-1000 permutations of the list so it unlikely that I will be producing duplicate permutations, however, since this is producing a statistical distribution I would like to have all possible permutations as potential samples.
There are longer-period PRNGs than MT, but they are hard to find.
To get all 3090! combinations, you need 40,905 bits of entropy. That's about 5kb. You should be able to grab a chunk of bytes that size from someplace like random.org many times with no problem. To get precisely balanced, you'll have to add some and do rejection sampling. I.e., grab 12 bits at a time (0..4095), and reject numbers higher than your current loop index. That might inflate the number of bits needed, but probably not beyond 8kb.
I agree with #user2357112 that it is unlikely to be a genuine issue -- but it seems like you should be able to use the standard random module in such a way that all permutations are at least possible.
You could do a divide-and-conquer approach. Use the initial seed to partition the list into 2 lists of around 2000 each. The number of such partitions is roughly C(4000,2000) which is approximately 1.66 x 10^1202. This is less that the period, which suggests that it is at least possible for all such partitions to be generated with random.sample(). Then - reseed the random number generator and permute the first half. Then -- reseed a second time and permute the second half. Perhaps throw in little time delays before the reseedings so you don't run into issues involving the resolution of your system clock. You could also experiment in randomly partitioning the initial list into larger numbers of smaller lists.
Mathematically, it is easy to see that if you randomly partition a list into sublists so that each partition is equally likely and then you permute each sublist in such a way that all sublist permutations are equally likely, and glue together these sublist permutations to get a whole-list permutation, then all whole-list permutations are equally likely.
Here is an implementation:
import random, time
def permuted(items, pieces = 2):
sublists = [[] for i in range(pieces)]
for x in items:
sublists[random.randint(0,pieces-1)].append(x)
permutedList = []
for i in range(pieces):
time.sleep(0.01)
random.seed()
random.shuffle(sublists[i])
permutedList.extend(sublists[i])
return permutedList
I'm not sure that time.sleep(0.01) is really needed. My concern was that if the reseeds happened within a millisecond then on some systems the same seed might be used.
As a final remark, just because the above function (with suitable choice of pieces) can't be shown to miss certain permutations by a simple counting argument (comparing the number of permutations with the number of initial states) this doesn't in and of itself constitute proof that all permutations are in fact possible. That would require a more detailed analysis of the random number generator, the hash function that seeds it, and the shuffle algorithm.

Randomizing values accounting for floating point resolution

I have an array of values that I'm clipping to be within a certain range. I don't want large numbers of values to be identical though, so I'm adding a small amount of random noise after the operation. I think that I need to be accounting for the floating point resolution for this to work.
Right now I've got code something like this:
import numpy as np
np.minimum(x[:,0:3],topRtBk,x[:,0:3])
np.maximum(x[:,0:3],botLftFrnt,x[:,0:3])
np.add(x[:,0:3],np.random.randn(x.shape[0],3).astype(real_t)*5e-5,x[:,0:3])
where topRtBk and botLftFrnt are the 3D bounding limits (there's another version of this for spheres).
real_t is configurable to np.float32 or np.float64 (other parts of the code are GPU accelerated, and this may be eventually as well).
The 5e-5 is a magic number which is twice np.finfo(np.float32).resolution, and the crux of my question: what's the right value to use here?
I'd like to dither the values by the smallest possible amount while retaining sufficient variation-- and I admit that sufficient is rather ill defined. I'm trying to minimize duplicate values, but having some won't kill me.
I guess my question is two fold: is this the right approach to use, and what's a reasonable scale factor for the random numbers?

Test for statistically significant difference between two arrays

I have two 2-D arrays with the same shape (105,234) named A & B essentially comprised of mean values from other arrays. I am familiar with Python's scipy package, but I can't seem to find a way to test whether or not the two arrays are statistically significantly different at each individual array index. I'm thinking this is just a large 2D paired T-test, but am having difficulty. Any ideas or other packages to use?
If we assume that the underlying variance for each mean at the gridpoints is the same, and the number of observations is the same or is known, then we can use the arrays of means to estimate the standard deviation of the means directly.
Dividing the difference between gridpoints by the standard deviation, then gives t distributed random variables, that can be directly tested, i.e. the p-value can be calculated.
As tests for many points, we will run into a multiple testing problem http://en.wikipedia.org/wiki/Multiple_comparisons#Large-scale_multiple_testing and the p-values should be corrected.
If your question is "Do two-dimensional distributions differ ?", see
Numerical Recipes p. 763
(and ask further on how to do that in numpy / scipy).
You might also ask on stats.stackexchange.
I assume that x,y coordinates do not matter and we just have the two huge sets of independent measurements.
One of the possible approaches could be just to compute standard deviation of mean for each array, multiply this value to the Student coefficient (probably somewhat 1.645 for your astronomic number of samples and 95 % confidence level) and obtain the confidence ranges around the mean this way. If the confidence ranges of the two different arrays overlap, the difference between them is not significant. Formulas can be found here.
Go to MS Excel. If you don't have it your work does, there are alternatives
Enter the array of numbers in Excel worksheet. Run the formula in the entry field, =TTEST (array1,array2,tail). One tail is one, Two tail is two...easy peasy. It's a simple Student's T and I believe you may still need a t-table to interpret the statistic (internet). Yet it's quick for on the fly comparison of samples.

Categories

Resources