Numpy random integer generator not covering full interval?

Numpy random integer generator not covering full interval? - python

When generating random integers over (almost) the full interval allowed by int64, the generated integers seem to be generated on a smaller range. I'm using the following code:
import numpy
def randGenerationTest(n_gens=100000):
min_int = 2**63
max_int = 0
for _ in range(n_gens) :
randMatrix = numpy.random.randint(low=1, high = 2**63, size=(1000,1000))
a = randMatrix.min()
b = randMatrix.max()
if a < min_int:
min_int = a
if b > max_int :
max_int = b
return min_int, max_int
Which is returning the following:
randomGenerationTest()
>>> (146746577, 9223372036832037133)
I agree that [1, 146746577] represents just a tiny fraction of the full range I'm trying to get, but in 1e11 random integers generated in the range of [1,2^63), I should have come just once near to my boundaries?
Is this expected behavior when using too large intervals?
Or is it cause as a human I can not grasp how enormous these intervals are and that I am already "near enough"?
By the way, this was just to know if the Seed can be randomly set from 1 to 1e63, as it is possible to set it manually to any of those values.

You're generating 10^3 * 10^3 * 10^5 = 10^11 values. 2^63 / 10^11 ~= 10e+08. You're not even close to filling out the space of values. As a rough back of the hand calculation, if you're sampling 1/10^n elements of a uniform space, the min and max of the sample being ~n order of magnitude from the maximal and minimal element seems pretty reasonable.

The difference of your max. number 9223372036832037133 to the upper boundary of the interval 2**63 - 1 is 22738674. That's only about 2.46e-12 of the full range. The same holds for the min. value 146746577 which has a distance to the lower boundary of about 1.59e-11 relative to the full range of the interval. That means you covered more than 99.999999999% of the interval's range, i.e. pretty much everything.

Related

Been using rand.int for a while and seeing unexpected results

I've been running some code for an hour or so using a rand.int function, where the code models a dice's roll, where the dice has ten faces, and you have to roll it six times in a row, and each time it has to roll the same number, and it is tracking how many tries it takes for this to happen.
success = 0
times = 0
count = 0
total = 0
for h in range(0,100):
for i in range(0,10):
times = 0
while success == 0:
numbers = [0,0,0,0,0,0,0,0,0,0]
for j in range(0,6):
x = int(random.randint(0,9))
numbers[x] = 1
count = numbers.count(1)
if count == 1:
success = 1
else:
times += 1
print(i)
total += times
success = 0
randtst = open("RandomTesting.txt", "a" )
randtst.write(str(total / 10)+"\n")
randtst.close()
And running this code, this has been going into a file, the contents of which is below
https://pastebin.com/7kRK1Z5f
And taking the average of these numbers using
newtotal = 0
totalamounts = 0
with open ('RandomTesting.txt', 'rt') as rndtxt:
for myline in rndtxt: ,
newtotal += float(myline)
totalamounts += 1
print(newtotal / totalamounts)
Which returns 742073.7449342106. This number is incorrect, (I think) as this is not near to 10^6. I tried getting rid of the contents and doing it again, but to no avail, the number is nowhere near 10^6. Can anyone see a problem with this?
Note: I am not asking for fixes to the code or anything, I am asking whether something has gone wrong to get the above number rather that 100,000

There are several issues working against you here. Bottom line up front:
your code doesn't do what you described as your intent;
you currently have no yardstick for measuring whether your results agree with the theoretical answer; and
your expectations regarding the correct answer are incorrect.
I felt that your code was overly complex for the task you were describing, so I wrote my own version from scratch. I factored out the basic experiment of rolling six 10-sided dice and checking to see if the outcomes were all equal by creating a list of length 6 comprised of 10-sided die rolls. Borrowing shamelessly from BoarGules' comment, I threw the results into a set—which only stores unique elements—and counted the size of the set. The dice are all the same value if and only if the size of the set is 1. I kept repeating this while the number of distinct elements was greater than 1, maintaining a tally of how many trials that required, and returned the number of trials once identical die rolls were obtained.
That basic experiment is then run for any desired number of replications, with the results placed in a numpy array. The resulting data was processed by numpy and scipy to yield the average number of trials and a 95% confidence interval for the mean. The confidence interval uses the estimated variability of the results to construct a lower and an upper bound for the mean. The bounds produced this way should contain the true mean for 95% of estimates generated in this way if the underlying assumptions are met, and address the second point in my BLUF.
Here's the code:
import random
import scipy.stats as st
import numpy as np
NUM_DIGITS = 6
SAMPLE_SIZE = 1000
def expt():
num_trials = 1
while(len(set([random.randrange(10) for _ in range(NUM_DIGITS)])) > 1):
num_trials += 1
return num_trials
data = np.array([expt() for _ in range(SAMPLE_SIZE)])
mu_hat = np.mean(data)
ci = st.t.interval(alpha=0.95, df=SAMPLE_SIZE-1, loc=mu_hat, scale=st.sem(data))
print(mu_hat, ci)
The probability of producing 6 identical results of a particular value from a 10-sided die is 10-6, but there are 10 possible particular values so the overall probability of producing all duplicates is 10*10-6, or 10-5. Consequently, the expected number of trials until you obtain a set of duplicates is 105. The code above took a little over 5 minutes to run on my computer, and produced 102493.559 (96461.16185897154, 108525.95614102845) as the output. Rounding to integers, this means that the average number of trials was 102493 and we're 95% confident that the true mean lies somewhere between 96461 and 108526. This particular range contains 105, i.e., it is consistent with the expected value. Rerunning the program will yield different numbers, but 95% of such runs should also contain the expected value, and the handful that don't should still be close.

Might I suggest if you're working with whole integers that you should be receiving a whole number back instead of a floating point(if I'm understanding what you're trying to do.).
##randtst.write(str(total / 10)+"\n") Original
##randtst.write(str(total // 10)+"\n")
Using a floor division instead of a division sign will round down the number to a whole number which is more idea for what you're trying to do.
If you ARE using floating point numbers, perhaps using the % instead. This will not only divide your number, but also ONLY returns the remainder.
% is Modulo in python
// is floor division in python
Those signs will keep your numbers stable and easier to work if your total returns a floating point integer.
If this isn't the case, you will have to account for every number behind the decimal to the right of it.
And if this IS the case, your result will never reach 10x^6 because the line for totalling your value is stuck in a loop.
I hope this helps you in anyway and if not, please let me know as I'm also learning python.

Compute a list of rounded proportions

Required:
[10,20,-30] -> [1,2,-3]
[19,-14,15] -> [2,-1,2]
[-1.09,-0.92,0.02] -> [-109,-92,2]
[501.6545,-1857.1,897.543] -> [5,-19,9]
The number closest to zero in each input set should be a single digit number in the output. The proportions must be kept approximately constant, rounding errors accepted.
Context: Converting the number of shares of securities to buy from a model to round lots of 100 using the smallest orders possible.
I can brute force this in a non-pythonic way but I'm looking for pointers on Python functions to use. My background is Java.

In Python you would use numpy for such calculations. I would suggest an algorithm like this:
def process(array):
order_of_magnitude = np.floor(np.log10(np.min(np.abs(array))))
return np.round(array*10**(-order_of_magnitude))
Explanation:
Find the order of magnitude of the smallest element in the array (regardless of sign).
Scale every element (or up) according to this.
Round the result
You will need to install numpy for this. For example with pip or via your linux distribution.
Turn your lists into numpy arrays like this:
array = np.array(your_list)

Ignoring your examples, I implemented the requirement
The number closest to zero in each input set should be a single digit number in the output. The proportions must be kept approximately constant, rounding errors accepted.
This algorithm normalizes the data by the absolute value of the value closest to zero, and multiplies that result by 9 to keep the smallest number one-digit, thus minimizing the subsequent rounding error.
def normalize(l):
import numpy as np
m = np.min(np.abs(l))
return np.round(l / m * 9).astype(int)

Here is the correct answer based on #user8408080 answer.
import numpy as np
def process(array):
order_of_magnitude = np.floor(np.log10(np.min(np.abs(array)))).astype(int).item()
return np.round(np.asarray(array)*10**(-order_of_magnitude)).astype(int)

How to calculate the minrun length for timsort in Python

I am fairly new to Python and am attempting to write a timsort re-implementation. While writing the program I have had trouble working out how to get the minrun length. The sources I have consulted have described identifying minrun as:
minrun = N/minrun<=2^N
where n is the size of the array.
I understand what I am trying to do, I just dont know how i could do it in Python?
Any ideas or sample code would be very useful, thanks!

In the wikipedia trimsort-article the built in implementation of the python timsort is described:
Minrun is chosen from the range 32 to 64 inclusive, such that the size of the data, divided by minrun, is equal to, or slightly less than, a power of two. The final algorithm takes the six most significant bits of the size of the array, adds one if any of the remaining bits are set, and uses that result as the minrun. This algorithm works for all arrays, including those smaller than 64; for arrays of size 63 or less, this sets minrun equal to the array size and Timsort reduces to an insertion sort.
You could do it like this:
minrun = length
remaining_bits = length.bit_length() - 6
if remaining_bits > 0:
minrun = length >> remaining_bits
mask = (1 << remaining_bits) - 1
if (length & mask) > 0: minrun += 1
That should be it, for any other timsort questions make sure to take a look at the the python timsort.

How to randomly generate really small numbers?

Suppose I want to draw a random number in the range [10^-20, 0.1], how do I do that?
If I use numpy.random.uniform, I don't seem to go lower than 10^-2:
In [2]: np.random.uniform(0.1, 10**(-20))
Out[2]: 0.02506361878539856
In [3]: np.random.uniform(0.1, 10**(-20))
Out[3]: 0.04035553250149768
In [4]: np.random.uniform(0.1, 10**(-20))
Out[4]: 0.09801074888377342
In [5]: np.random.uniform(0.1, 10**(-20))
Out[5]: 0.09778150831277296
In [6]: np.random.uniform(0.1, 10**(-20))
Out[6]: 0.08486347093110456
In [7]: np.random.uniform(0.1, 10**(-20))
Out[7]: 0.04206753781952958
Alternatively I could generate an array instead like:
In [44]: fac = np.linspace(10**(-20),10**(-1),100)
In [45]: fac
Out[45]:
array([ 1.00000000e-20, 1.01010101e-03, 2.02020202e-03,
3.03030303e-03, 4.04040404e-03, 5.05050505e-03,
6.06060606e-03, 7.07070707e-03, 8.08080808e-03,
9.09090909e-03, 1.01010101e-02, 1.11111111e-02,
1.21212121e-02, 1.31313131e-02, 1.41414141e-02,
1.51515152e-02, 1.61616162e-02, 1.71717172e-02,
1.81818182e-02, 1.91919192e-02, 2.02020202e-02,
2.12121212e-02, 2.22222222e-02, 2.32323232e-02,
2.42424242e-02, 2.52525253e-02, 2.62626263e-02,
2.72727273e-02, 2.82828283e-02, 2.92929293e-02,
3.03030303e-02, 3.13131313e-02, 3.23232323e-02,
3.33333333e-02, 3.43434343e-02, 3.53535354e-02,
3.63636364e-02, 3.73737374e-02, 3.83838384e-02,
3.93939394e-02, 4.04040404e-02, 4.14141414e-02,
4.24242424e-02, 4.34343434e-02, 4.44444444e-02,
4.54545455e-02, 4.64646465e-02, 4.74747475e-02,
4.84848485e-02, 4.94949495e-02, 5.05050505e-02,
5.15151515e-02, 5.25252525e-02, 5.35353535e-02,
5.45454545e-02, 5.55555556e-02, 5.65656566e-02,
5.75757576e-02, 5.85858586e-02, 5.95959596e-02,
6.06060606e-02, 6.16161616e-02, 6.26262626e-02,
6.36363636e-02, 6.46464646e-02, 6.56565657e-02,
6.66666667e-02, 6.76767677e-02, 6.86868687e-02,
6.96969697e-02, 7.07070707e-02, 7.17171717e-02,
7.27272727e-02, 7.37373737e-02, 7.47474747e-02,
7.57575758e-02, 7.67676768e-02, 7.77777778e-02,
7.87878788e-02, 7.97979798e-02, 8.08080808e-02,
8.18181818e-02, 8.28282828e-02, 8.38383838e-02,
8.48484848e-02, 8.58585859e-02, 8.68686869e-02,
8.78787879e-02, 8.88888889e-02, 8.98989899e-02,
9.09090909e-02, 9.19191919e-02, 9.29292929e-02,
9.39393939e-02, 9.49494949e-02, 9.59595960e-02,
9.69696970e-02, 9.79797980e-02, 9.89898990e-02,
1.00000000e-01])
and pick a random element from that array, but wanted to clarify anyway if the first option is possible since I could be probably missing something obvious.

You need to think closely about what you're doing. You're asking for a uniform distribution between almost 0.0 and 0.1. The average result would be 0.05. Which is exactly what you're getting. It seems you want a random distribution of the exponents.
The following might do what you want:
import random
def rnd():
exp = random.randint(-19, -1)
significand = 0.9 * random.random() + 0.1
return significand * 10**exp
[rnd() for _ in range(20)]
The lowest possible value is when exp=-19 and significand=0.1 giving 0.1*10**-19 = 1**-20. And the highest possible value is when exp=-1 and significand=1.0 giving 1.0*10**-1 = 0.1.
Note: Technically, the significand can only aprach 1.0 as random() is bounded to [0.0, 1.0), i.e., including 0.0, but excluding 1.0.
Output:
[2.3038280595190108e-11,
0.02658855644891981,
4.104572641101877e-11,
3.638231824527544e-19,
6.220040206106022e-17,
7.207472203268789e-06,
6.244626749598619e-17,
2.299282102612733e-18,
0.0013251357609258432,
3.118805901868378e-06,
6.585606992344938e-05,
0.005955900790586139,
1.72779538837876e-08,
7.556972406280229e-13,
3.887023124444594e-15,
0.0019965330694999488,
1.7732147730252207e-08,
8.920398286274208e-17,
4.4422869312622194e-08,
2.4815949527034027e-18]
See "scientific notation" on wikipedia for definition of significand and exponent.

As per the numpy documentation:
low : float or array_like of floats, optional
Lower boundary of the output interval. All values generated will be greater than or equal to low. The default value is 0.
With that in mind, decreasing the value of low will produce lower numbers
>>> np.random.uniform(0.00001, 10**(-20))
6.390804027773046e-06

How about generating a random number between 1 and 10 000,
then divide that number by 100 000.

Since you want to keep a uniform distribution and avoid problems related to float representation, just draw 20 integers uniformly between 0 and 9 and "build" your result with base 10 representation (you'll still have a uniform distribution):
result = 0
digits = np.random.randint(0,10,20)
for idx,digit in enumerate(digits):
result += digit*(10**idx)
This will give you a number between 0 and 10**19 -1. You can just interpret the result differently to get what you want.

The likelyhood of a random number less than 10^-20 arising if you generate uniform random numbers in the range [0,0.1] is one in 10^-19. It will probably never happen. However, if you have to make sure that it cannot happen (maybe because a smaller number will crash your code), then simply generate your uniform random numbers in the range [0,0.1], test them, and reject any that are too small by replacing them with another uniform random number out of the same generator and re-testing. This replaces "very unlikely" by "certain never to happen".
This technique is more commonly encountered in Monte-Carlo simulations where you wish to randomly sample f(x,y) or f(x,y,z) where the coordinates (x,y[,z]) must be within some area or volume with a complicated definition, for example, the inside of a complex mechanical component. The technique is the same. Establish bounding ranges [xlow, xhigh], [ylow, yhigh] ... and generate a uniformly distributed random coordinate within this bounding box. Then check whether this random location is within the area / volume to be sampled. If not, generate another random tuple and re-check.

Use of Inf on Matlab

I am currently translating a MATLAB program into Python. I successfully ported all the previous vector operations using numpy. However I am stuck in the following bit of code which is a cosine similarity measure.
% W and ind are different sized matrices
dist = full(W * (W(ind2(range),:)' - W(ind1(range),:)' + W(ind3(range),:)'));
for i=1:length(range)
dist(ind1(range(i)),i) = -Inf;
dist(ind2(range(i)),i) = -Inf;
dist(ind3(range(i)),i) = -Inf;
end
disp(dist)
[~, mx(range)] = max(dist);
I did not understand the following part.
dist(indx(range(i)),i) = -Inf;
What actuality is happening when you use
= -Inf;
on the right side?

In Matlab (see: Inf):
Inf returns the IEEE® arithmetic representation for positive infinity.
So Inf produces a value that is greater than all other numeric values. -Inf produces a value that is guaranteed to be less than any other numeric value. It's generally used when you want to iteratively find a maximum and need a first value to compare to that's always going to be less than your first comparison.
According to Wikipedia (see: IEEE 754 Inf):
Positive and negative infinity are represented thus:
sign = 0 for positive infinity, 1 for negative infinity.
biased exponent = all 1 bits.
fraction = all 0 bits.
Python has the same concept using '-inf' (see Note 6 here):
float also accepts the strings “nan” and “inf” with an optional prefix “+” or “-” for Not a Number (NaN) and positive or negative infinity.
>>> a=float('-inf')
>>> a
-inf
>>> b=-27983.444
>>> min(a,b)
-inf

It just assigns a minus infinity value to the left-hand side.
It may appear weird to assign that value, particularly because a distance cannot be negative. But it looks like it's used for effectively removing those entries from the max computation in the last line.
If Python doesn't have "infinity" (I don't know Python) and if dist is really a distance (hence nonnegative) , you could use any negative value instead of -inf to achieve the same effect, namely remove those entries from the max computation.

The -Inf is typically used to initialize a variable such that you later can use it to in a comparison in a loop.
For instance if I want to find the the maximum value in a function (and have forgotten the command max). Then I would have made something like:
function maxF = findMax(f,a,b)
maxF = -Inf;
x = a:0.001:b;
for i = 1:length(x)
if f(x) > maxF
maxF = f(x);
end
end
It is a method in matlab to make sure that any other value is larger than the current. So the comparison in Python would be -sys.maxint +1.
See for instance:
Maximum and Minimum values for ints

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Numpy random integer generator not covering full interval? - python

Related

Been using rand.int for a while and seeing unexpected results

Compute a list of rounded proportions

How to calculate the minrun length for timsort in Python

How to randomly generate really small numbers?

Use of Inf on Matlab

Categories

Resources