Suppose I want to draw a random number in the range [10^-20, 0.1], how do I do that?
If I use numpy.random.uniform, I don't seem to go lower than 10^-2:
In [2]: np.random.uniform(0.1, 10**(-20))
Out[2]: 0.02506361878539856
In [3]: np.random.uniform(0.1, 10**(-20))
Out[3]: 0.04035553250149768
In [4]: np.random.uniform(0.1, 10**(-20))
Out[4]: 0.09801074888377342
In [5]: np.random.uniform(0.1, 10**(-20))
Out[5]: 0.09778150831277296
In [6]: np.random.uniform(0.1, 10**(-20))
Out[6]: 0.08486347093110456
In [7]: np.random.uniform(0.1, 10**(-20))
Out[7]: 0.04206753781952958
Alternatively I could generate an array instead like:
In [44]: fac = np.linspace(10**(-20),10**(-1),100)
In [45]: fac
Out[45]:
array([ 1.00000000e-20, 1.01010101e-03, 2.02020202e-03,
3.03030303e-03, 4.04040404e-03, 5.05050505e-03,
6.06060606e-03, 7.07070707e-03, 8.08080808e-03,
9.09090909e-03, 1.01010101e-02, 1.11111111e-02,
1.21212121e-02, 1.31313131e-02, 1.41414141e-02,
1.51515152e-02, 1.61616162e-02, 1.71717172e-02,
1.81818182e-02, 1.91919192e-02, 2.02020202e-02,
2.12121212e-02, 2.22222222e-02, 2.32323232e-02,
2.42424242e-02, 2.52525253e-02, 2.62626263e-02,
2.72727273e-02, 2.82828283e-02, 2.92929293e-02,
3.03030303e-02, 3.13131313e-02, 3.23232323e-02,
3.33333333e-02, 3.43434343e-02, 3.53535354e-02,
3.63636364e-02, 3.73737374e-02, 3.83838384e-02,
3.93939394e-02, 4.04040404e-02, 4.14141414e-02,
4.24242424e-02, 4.34343434e-02, 4.44444444e-02,
4.54545455e-02, 4.64646465e-02, 4.74747475e-02,
4.84848485e-02, 4.94949495e-02, 5.05050505e-02,
5.15151515e-02, 5.25252525e-02, 5.35353535e-02,
5.45454545e-02, 5.55555556e-02, 5.65656566e-02,
5.75757576e-02, 5.85858586e-02, 5.95959596e-02,
6.06060606e-02, 6.16161616e-02, 6.26262626e-02,
6.36363636e-02, 6.46464646e-02, 6.56565657e-02,
6.66666667e-02, 6.76767677e-02, 6.86868687e-02,
6.96969697e-02, 7.07070707e-02, 7.17171717e-02,
7.27272727e-02, 7.37373737e-02, 7.47474747e-02,
7.57575758e-02, 7.67676768e-02, 7.77777778e-02,
7.87878788e-02, 7.97979798e-02, 8.08080808e-02,
8.18181818e-02, 8.28282828e-02, 8.38383838e-02,
8.48484848e-02, 8.58585859e-02, 8.68686869e-02,
8.78787879e-02, 8.88888889e-02, 8.98989899e-02,
9.09090909e-02, 9.19191919e-02, 9.29292929e-02,
9.39393939e-02, 9.49494949e-02, 9.59595960e-02,
9.69696970e-02, 9.79797980e-02, 9.89898990e-02,
1.00000000e-01])
and pick a random element from that array, but wanted to clarify anyway if the first option is possible since I could be probably missing something obvious.
You need to think closely about what you're doing. You're asking for a uniform distribution between almost 0.0 and 0.1. The average result would be 0.05. Which is exactly what you're getting. It seems you want a random distribution of the exponents.
The following might do what you want:
import random
def rnd():
exp = random.randint(-19, -1)
significand = 0.9 * random.random() + 0.1
return significand * 10**exp
[rnd() for _ in range(20)]
The lowest possible value is when exp=-19 and significand=0.1 giving 0.1*10**-19 = 1**-20. And the highest possible value is when exp=-1 and significand=1.0 giving 1.0*10**-1 = 0.1.
Note: Technically, the significand can only aprach 1.0 as random() is bounded to [0.0, 1.0), i.e., including 0.0, but excluding 1.0.
Output:
[2.3038280595190108e-11,
0.02658855644891981,
4.104572641101877e-11,
3.638231824527544e-19,
6.220040206106022e-17,
7.207472203268789e-06,
6.244626749598619e-17,
2.299282102612733e-18,
0.0013251357609258432,
3.118805901868378e-06,
6.585606992344938e-05,
0.005955900790586139,
1.72779538837876e-08,
7.556972406280229e-13,
3.887023124444594e-15,
0.0019965330694999488,
1.7732147730252207e-08,
8.920398286274208e-17,
4.4422869312622194e-08,
2.4815949527034027e-18]
See "scientific notation" on wikipedia for definition of significand and exponent.
As per the numpy documentation:
low : float or array_like of floats, optional
Lower boundary of the output interval. All values generated will be greater than or equal to low. The default value is 0.
With that in mind, decreasing the value of low will produce lower numbers
>>> np.random.uniform(0.00001, 10**(-20))
6.390804027773046e-06
How about generating a random number between 1 and 10 000,
then divide that number by 100 000.
Since you want to keep a uniform distribution and avoid problems related to float representation, just draw 20 integers uniformly between 0 and 9 and "build" your result with base 10 representation (you'll still have a uniform distribution):
result = 0
digits = np.random.randint(0,10,20)
for idx,digit in enumerate(digits):
result += digit*(10**idx)
This will give you a number between 0 and 10**19 -1. You can just interpret the result differently to get what you want.
The likelyhood of a random number less than 10^-20 arising if you generate uniform random numbers in the range [0,0.1] is one in 10^-19. It will probably never happen. However, if you have to make sure that it cannot happen (maybe because a smaller number will crash your code), then simply generate your uniform random numbers in the range [0,0.1], test them, and reject any that are too small by replacing them with another uniform random number out of the same generator and re-testing. This replaces "very unlikely" by "certain never to happen".
This technique is more commonly encountered in Monte-Carlo simulations where you wish to randomly sample f(x,y) or f(x,y,z) where the coordinates (x,y[,z]) must be within some area or volume with a complicated definition, for example, the inside of a complex mechanical component. The technique is the same. Establish bounding ranges [xlow, xhigh], [ylow, yhigh] ... and generate a uniformly distributed random coordinate within this bounding box. Then check whether this random location is within the area / volume to be sampled. If not, generate another random tuple and re-check.
Related
I happen to have a numpy array of floats:
a.dtype, a.shape
#(dtype('float64'), (32769,))
The values are:
a[0]
#3.699822718929953
all(a == a[0])
True
However:
a.mean()
3.6998227189299517
The mean is off by 15th and 16th figure.
Can anybody show how this difference is accumulated over 30K mean and if there is a way to avoid it?
In case it matters my OS is 64 bit.
Here is a rough approximation of a bound on the maximum error. This will not be representative of average error, and it could be improved with more analysis.
Consider calculating a sum using floating-point arithmetic with round-to-nearest ties-to-even:
sum = 0;
for (i = 0; i < n; ++n)
sum += a[i];
where each a[i] is in [0, m).
Let ULP(x) denote the unit of least precision in the floating-point number x. (For example, in the IEEE-754 binary64 format with 53-bit significands, if the largest power of 2 not greater than |x| is 2p, then ULP(x) = 2p−52. With round-to-nearest, the maximum error in any operation with result x is ½ULP(x).
If we neglect rounding errors, the maximum value of sum after i iterations is i•m. Therefore, a bound on the error in the addition in iteration i is ½ULP(i•m). (Actually zero for i=1, since that case adds to zero, which has no error, but we neglect that for this approximation.) Then the total of the bounds on all the additions is the sum of ½ULP(i•m) for i from 1 to n. This is approximately ½•n•(n+1)/2•ULP(m) = ¼•n•(n+1)•ULP(m). (This is an approximation because it moves i outside the ULP function, but ULP is a discontinuous function. It is “approximately linear,“ but there are jumps. Since the jumps are by factors of two, the approximation can be off by at most a factor of two.)
So, with 32,769 elements, we can say the total rounding error will be at most about ¼•32,769•32,770•ULP(m), about 2.7•108 times the ULP of the maximum element value. The ULP is 2−52 times the greatest power of two not less than m, so that is about 2.7•108•2−52 = 6•10−8 times m.
Of course, the likelihood that 32,768 sums (not 32,769 because the first necessarily has no error) all round in the same direction by chance is vanishingly small but I conjecture one might engineer a sequence of values that gets close to that.
An Experiment
Here is a chart of (in blue) the mean error over 10,000 samples of summing arrays with sizes 100 to 32,800 by 100s and elements drawn randomly from a uniform distribution over [0, 1). The error was calculated by comparing the sum calculated with float (IEEE-754 binary32) to that calculated with double (IEEE-754 binary64). (The samples were all multiples of 2−24, and double has enough precision so that the sum for up to 229 such values is exact.)
The green line is c n √n with c set to match the last point of the blue line. We see it tracks the blue line over the long term. At points where the average sum crosses a power of two, the mean error increases faster for a time. At these points, the sum has entered a new binade, and further additions have higher average errors due to the increased ULP. Over the course of the binade, this fixed ULP decreases relative to n, bringing the blue line back to the green line.
This is due to incapability of float64 type to store the sum of your float numbers with correct precision. In order to get around this problem you need to use a larger data type of course*. Numpy has a longdouble dtype that you can use in such cases:
In [23]: np.mean(a, dtype=np.longdouble)
Out[23]: 3.6998227189299530693
Also, note:
In [25]: print(np.longdouble.__doc__)
Extended-precision floating-point number type, compatible with C
``long double`` but not necessarily with IEEE 754 quadruple-precision.
Character code: ``'g'``.
Canonical name: ``np.longdouble``.
Alias: ``np.longfloat``.
Alias *on this platform*: ``np.float128``: 128-bit extended-precision floating-point number type.
* read the comments for more details.
The mean is (by definition):
a.sum()/a.size
Unfortunately, adding all those values up and dividing accumulates floating point errors. They are usually around the magnitude of:
np.finfo(np.float).eps
Out[]: 2.220446049250313e-16
Yeah, e-16, about where you get them. You can make the error smaller by using higher-accuracy floats like float128 (if your system supports it) but they'll always accumulate whenever you're summing a large number of float together. If you truly want the identity, you'll have to hardcode it:
def mean_(arr):
if np.all(arr == arr[0]):
return arr[0]
else:
return arr.mean()
In practice, you never really want to use == between floats. Generally in numpy we use np.isclose or np.allclose to compare floats for exactly this reason. There are ways around it using other packages and leveraging arcane machine-level methods of calculating numbers to get (closer to) exact equality, but it's rarely worth the performance and clarity hit.
When generating random integers over (almost) the full interval allowed by int64, the generated integers seem to be generated on a smaller range. I'm using the following code:
import numpy
def randGenerationTest(n_gens=100000):
min_int = 2**63
max_int = 0
for _ in range(n_gens) :
randMatrix = numpy.random.randint(low=1, high = 2**63, size=(1000,1000))
a = randMatrix.min()
b = randMatrix.max()
if a < min_int:
min_int = a
if b > max_int :
max_int = b
return min_int, max_int
Which is returning the following:
randomGenerationTest()
>>> (146746577, 9223372036832037133)
I agree that [1, 146746577] represents just a tiny fraction of the full range I'm trying to get, but in 1e11 random integers generated in the range of [1,2^63), I should have come just once near to my boundaries?
Is this expected behavior when using too large intervals?
Or is it cause as a human I can not grasp how enormous these intervals are and that I am already "near enough"?
By the way, this was just to know if the Seed can be randomly set from 1 to 1e63, as it is possible to set it manually to any of those values.
You're generating 10^3 * 10^3 * 10^5 = 10^11 values. 2^63 / 10^11 ~= 10e+08. You're not even close to filling out the space of values. As a rough back of the hand calculation, if you're sampling 1/10^n elements of a uniform space, the min and max of the sample being ~n order of magnitude from the maximal and minimal element seems pretty reasonable.
The difference of your max. number 9223372036832037133 to the upper boundary of the interval 2**63 - 1 is 22738674. That's only about 2.46e-12 of the full range. The same holds for the min. value 146746577 which has a distance to the lower boundary of about 1.59e-11 relative to the full range of the interval. That means you covered more than 99.999999999% of the interval's range, i.e. pretty much everything.
For an introduction to Python course, I'm looking at generating a random floating point number in Python, and I have seen a standard recommended code of
import random
lower = 5
upper = 10
range_width = upper - lower
x = random.random() * range_width + lower
for a random floating point from 5 up to but not including 10.
It seems to me that the same effect could be achieved by:
import random
x = random.randrange(5, 10) + random.random()
Since that would give an integer of 5, 6, 7, 8, or 9, and then tack on a decimal to it.
The question I have is would this second code still give a fully even probability distribution, or would it not keep the full randomness of the first version?
According to the documentation then yes random() is indeed a uniform distribution.
random(), which generates a random float uniformly in the semi-open range [0.0, 1.0). Python uses the Mersenne Twister as the core generator.
So both code examples should be fine. To shorten your code, you can equally do:
random.uniform(5, 10)
Note that uniform(a, b) is simply a + (b - a) * random() so the same as your first example.
The second example depends on the version of Python you're using.
Prior to 3.2 randrange() could produce a slightly uneven distributions.
There is a difference. Your second method is theoretically superior, although in practice it only matters for large ranges. Indeed, both methods will give you a uniform distribution. But only the second method can return all values in the range that are representable as a floating point number.
Since your range is so small, there is no appreciable difference. But still there is a difference, which you can see by considering a larger range. If you take a random real number between 0 and 1, you get a floating-point representation with a given number of bits. Now suppose your range is, say, in the order of 2**32. By multiplying the original random number by this range, you lose 32 bits of precision in the result. Put differently, there will be gaps between the values that this method can return. The gaps are still there when you multiply by 4: You have lost the two least significant bits of the original random number.
The two methods can give different results, but you'll only notice the difference in fairly extreme situations (with very wide ranges). For instance, If you generate random numbers between 0 and 2/sys.float_info.epsilon (9007199254740992.0, or a little more than 9 quintillion), you'll notice that the version using multiplication will never give you any floats with fractional values. If you increase the maximum bound to 4/sys.float_info.epsilon, you won't get any odd integers, only even ones. That's because the 64-bit floating point type Python uses doesn't have enough precision to represent all integers at the upper end of that range, and it's trying to maintain a uniform distribution (so it omits small odd integers and fractional values even though those can be represented in parts of the range).
The second version of the calculation will give extra precision to the smaller random numbers generated. For instance, if you're generating numbers between 0 and 2/sys.float_info.epsilon and the randrange call returned 0, you can use the full precision of the random call to add a fractional part to the number. On the other hand if the randrange returned the largest number in the range (2/sys.float_info.epsilon - 1), very little of the precision of the fraction would be used (the number will round to the nearest integer without any fractional part remaining).
Adding a fractional value also can't help you deal with ranges that are too large for every integer to be represented. If randrange returns only even numbers, adding a fraction usually won't make odd numbers appear (it can in some parts of the range, but not for others, and the distribution may be very uneven). Even for ranges where all integers can be represented, the odds of a specific floating point number appearing will not be entirely uniform, since the smaller numbers can be more precisely represented. Large but imprecise numbers will be more common than smaller but more precisely represented ones.
I was looking at numpy.finfo and did the following:
In [14]: np.finfo(np.float16).resolution
Out[14]: 0.0010004
In [16]: np.array([0., 0.0001], dtype=np.float16)
Out[16]: array([ 0. , 0.00010002], dtype=float16)
It seems that the vector is able to store two numbers such that their difference is 10 times smaller than the type's resolution. Am I missing something?
Floating point numbers have a fixed amount of resolution after the initial digit. What this number is telling you is, when the first digit is at the 1.0 position, what is the resolution of the number. You can see this by trying to add smaller amounts to 1.0:
In [8]: np.float16(1) + np.float16(0.001)
Out[8]: 1.001
In [9]: np.float16(1) + np.float16(0.0001)
Out[9]: 1.0
This is related to the nextafter function, which gives the next representable number after the given one. Taking that difference gives approximately this resolution:
In [10]: np.nextafter(np.float16(1), np.float16(np.inf)) - np.float16(1)
Out[10]: 0.00097656
From what I understand, the precision is the amount of decimals you can have. But since floats are stored whith exponants, you can have a number smaller than the resolution. try np.finfo(np.float16).tiny, it should give you 6.1035e-05, which is way smaller than the resolution. But the base part of that number has a resolution of ~0.001. Note that all of the limits in finfo are approximated because the binary representation is not directly correlated to an exact decimal limit.
I am currently translating a MATLAB program into Python. I successfully ported all the previous vector operations using numpy. However I am stuck in the following bit of code which is a cosine similarity measure.
% W and ind are different sized matrices
dist = full(W * (W(ind2(range),:)' - W(ind1(range),:)' + W(ind3(range),:)'));
for i=1:length(range)
dist(ind1(range(i)),i) = -Inf;
dist(ind2(range(i)),i) = -Inf;
dist(ind3(range(i)),i) = -Inf;
end
disp(dist)
[~, mx(range)] = max(dist);
I did not understand the following part.
dist(indx(range(i)),i) = -Inf;
What actuality is happening when you use
= -Inf;
on the right side?
In Matlab (see: Inf):
Inf returns the IEEE® arithmetic representation for positive infinity.
So Inf produces a value that is greater than all other numeric values. -Inf produces a value that is guaranteed to be less than any other numeric value. It's generally used when you want to iteratively find a maximum and need a first value to compare to that's always going to be less than your first comparison.
According to Wikipedia (see: IEEE 754 Inf):
Positive and negative infinity are represented thus:
sign = 0 for positive infinity, 1 for negative infinity.
biased exponent = all 1 bits.
fraction = all 0 bits.
Python has the same concept using '-inf' (see Note 6 here):
float also accepts the strings “nan” and “inf” with an optional prefix “+” or “-” for Not a Number (NaN) and positive or negative infinity.
>>> a=float('-inf')
>>> a
-inf
>>> b=-27983.444
>>> min(a,b)
-inf
It just assigns a minus infinity value to the left-hand side.
It may appear weird to assign that value, particularly because a distance cannot be negative. But it looks like it's used for effectively removing those entries from the max computation in the last line.
If Python doesn't have "infinity" (I don't know Python) and if dist is really a distance (hence nonnegative) , you could use any negative value instead of -inf to achieve the same effect, namely remove those entries from the max computation.
The -Inf is typically used to initialize a variable such that you later can use it to in a comparison in a loop.
For instance if I want to find the the maximum value in a function (and have forgotten the command max). Then I would have made something like:
function maxF = findMax(f,a,b)
maxF = -Inf;
x = a:0.001:b;
for i = 1:length(x)
if f(x) > maxF
maxF = f(x);
end
end
It is a method in matlab to make sure that any other value is larger than the current. So the comparison in Python would be -sys.maxint +1.
See for instance:
Maximum and Minimum values for ints