For an introduction to Python course, I'm looking at generating a random floating point number in Python, and I have seen a standard recommended code of
import random
lower = 5
upper = 10
range_width = upper - lower
x = random.random() * range_width + lower
for a random floating point from 5 up to but not including 10.
It seems to me that the same effect could be achieved by:
import random
x = random.randrange(5, 10) + random.random()
Since that would give an integer of 5, 6, 7, 8, or 9, and then tack on a decimal to it.
The question I have is would this second code still give a fully even probability distribution, or would it not keep the full randomness of the first version?
According to the documentation then yes random() is indeed a uniform distribution.
random(), which generates a random float uniformly in the semi-open range [0.0, 1.0). Python uses the Mersenne Twister as the core generator.
So both code examples should be fine. To shorten your code, you can equally do:
random.uniform(5, 10)
Note that uniform(a, b) is simply a + (b - a) * random() so the same as your first example.
The second example depends on the version of Python you're using.
Prior to 3.2 randrange() could produce a slightly uneven distributions.
There is a difference. Your second method is theoretically superior, although in practice it only matters for large ranges. Indeed, both methods will give you a uniform distribution. But only the second method can return all values in the range that are representable as a floating point number.
Since your range is so small, there is no appreciable difference. But still there is a difference, which you can see by considering a larger range. If you take a random real number between 0 and 1, you get a floating-point representation with a given number of bits. Now suppose your range is, say, in the order of 2**32. By multiplying the original random number by this range, you lose 32 bits of precision in the result. Put differently, there will be gaps between the values that this method can return. The gaps are still there when you multiply by 4: You have lost the two least significant bits of the original random number.
The two methods can give different results, but you'll only notice the difference in fairly extreme situations (with very wide ranges). For instance, If you generate random numbers between 0 and 2/sys.float_info.epsilon (9007199254740992.0, or a little more than 9 quintillion), you'll notice that the version using multiplication will never give you any floats with fractional values. If you increase the maximum bound to 4/sys.float_info.epsilon, you won't get any odd integers, only even ones. That's because the 64-bit floating point type Python uses doesn't have enough precision to represent all integers at the upper end of that range, and it's trying to maintain a uniform distribution (so it omits small odd integers and fractional values even though those can be represented in parts of the range).
The second version of the calculation will give extra precision to the smaller random numbers generated. For instance, if you're generating numbers between 0 and 2/sys.float_info.epsilon and the randrange call returned 0, you can use the full precision of the random call to add a fractional part to the number. On the other hand if the randrange returned the largest number in the range (2/sys.float_info.epsilon - 1), very little of the precision of the fraction would be used (the number will round to the nearest integer without any fractional part remaining).
Adding a fractional value also can't help you deal with ranges that are too large for every integer to be represented. If randrange returns only even numbers, adding a fraction usually won't make odd numbers appear (it can in some parts of the range, but not for others, and the distribution may be very uneven). Even for ranges where all integers can be represented, the odds of a specific floating point number appearing will not be entirely uniform, since the smaller numbers can be more precisely represented. Large but imprecise numbers will be more common than smaller but more precisely represented ones.
Related
I happen to have a numpy array of floats:
a.dtype, a.shape
#(dtype('float64'), (32769,))
The values are:
a[0]
#3.699822718929953
all(a == a[0])
True
However:
a.mean()
3.6998227189299517
The mean is off by 15th and 16th figure.
Can anybody show how this difference is accumulated over 30K mean and if there is a way to avoid it?
In case it matters my OS is 64 bit.
Here is a rough approximation of a bound on the maximum error. This will not be representative of average error, and it could be improved with more analysis.
Consider calculating a sum using floating-point arithmetic with round-to-nearest ties-to-even:
sum = 0;
for (i = 0; i < n; ++n)
sum += a[i];
where each a[i] is in [0, m).
Let ULP(x) denote the unit of least precision in the floating-point number x. (For example, in the IEEE-754 binary64 format with 53-bit significands, if the largest power of 2 not greater than |x| is 2p, then ULP(x) = 2p−52. With round-to-nearest, the maximum error in any operation with result x is ½ULP(x).
If we neglect rounding errors, the maximum value of sum after i iterations is i•m. Therefore, a bound on the error in the addition in iteration i is ½ULP(i•m). (Actually zero for i=1, since that case adds to zero, which has no error, but we neglect that for this approximation.) Then the total of the bounds on all the additions is the sum of ½ULP(i•m) for i from 1 to n. This is approximately ½•n•(n+1)/2•ULP(m) = ¼•n•(n+1)•ULP(m). (This is an approximation because it moves i outside the ULP function, but ULP is a discontinuous function. It is “approximately linear,“ but there are jumps. Since the jumps are by factors of two, the approximation can be off by at most a factor of two.)
So, with 32,769 elements, we can say the total rounding error will be at most about ¼•32,769•32,770•ULP(m), about 2.7•108 times the ULP of the maximum element value. The ULP is 2−52 times the greatest power of two not less than m, so that is about 2.7•108•2−52 = 6•10−8 times m.
Of course, the likelihood that 32,768 sums (not 32,769 because the first necessarily has no error) all round in the same direction by chance is vanishingly small but I conjecture one might engineer a sequence of values that gets close to that.
An Experiment
Here is a chart of (in blue) the mean error over 10,000 samples of summing arrays with sizes 100 to 32,800 by 100s and elements drawn randomly from a uniform distribution over [0, 1). The error was calculated by comparing the sum calculated with float (IEEE-754 binary32) to that calculated with double (IEEE-754 binary64). (The samples were all multiples of 2−24, and double has enough precision so that the sum for up to 229 such values is exact.)
The green line is c n √n with c set to match the last point of the blue line. We see it tracks the blue line over the long term. At points where the average sum crosses a power of two, the mean error increases faster for a time. At these points, the sum has entered a new binade, and further additions have higher average errors due to the increased ULP. Over the course of the binade, this fixed ULP decreases relative to n, bringing the blue line back to the green line.
This is due to incapability of float64 type to store the sum of your float numbers with correct precision. In order to get around this problem you need to use a larger data type of course*. Numpy has a longdouble dtype that you can use in such cases:
In [23]: np.mean(a, dtype=np.longdouble)
Out[23]: 3.6998227189299530693
Also, note:
In [25]: print(np.longdouble.__doc__)
Extended-precision floating-point number type, compatible with C
``long double`` but not necessarily with IEEE 754 quadruple-precision.
Character code: ``'g'``.
Canonical name: ``np.longdouble``.
Alias: ``np.longfloat``.
Alias *on this platform*: ``np.float128``: 128-bit extended-precision floating-point number type.
* read the comments for more details.
The mean is (by definition):
a.sum()/a.size
Unfortunately, adding all those values up and dividing accumulates floating point errors. They are usually around the magnitude of:
np.finfo(np.float).eps
Out[]: 2.220446049250313e-16
Yeah, e-16, about where you get them. You can make the error smaller by using higher-accuracy floats like float128 (if your system supports it) but they'll always accumulate whenever you're summing a large number of float together. If you truly want the identity, you'll have to hardcode it:
def mean_(arr):
if np.all(arr == arr[0]):
return arr[0]
else:
return arr.mean()
In practice, you never really want to use == between floats. Generally in numpy we use np.isclose or np.allclose to compare floats for exactly this reason. There are ways around it using other packages and leveraging arcane machine-level methods of calculating numbers to get (closer to) exact equality, but it's rarely worth the performance and clarity hit.
When I run the simple code:
a = np.float64([20269207])
b = np.float32(a)
The output turns to be
a = array([20269207.])
b = array([20269208.], dtype=float32)
What reason causes the difference before and after this conversion? And in what condition the outputs will be different?
It is impossible to store the value 20269207 in the float32 (IEEE 754) format.
You may see, why:
It is possible to store the values 20269206 and 20269208; their representations in binary form are (see IEEE-754 Floating Point Converter):
01001011100110101010010001001011 for 20269206
01001011100110101010010001001100 for 20269208
Their binary forms differ by 1, so there is no place for any number between 20269206 and 20269208.
By the rounding rules “Round to nearest, ties to even” and “Round to nearest, ties away from zero” of IEEE 754, your number is rounded to the nearest even higher number, i.e. to the number 20269208.
Outputs for integer numbers will be different:
for odd numbers with absolute value greater than 16,777,216,
for almost all numbers with absolute value greater than 33,554,432.
Notes:
The first number is 2^24, the second one is 2^25.
"allmost all" - there are "nice" numbers, such as powers of 2, which have precise representations even for very very large numbers.
For the moment, put aside any issues relating to pseudorandom number generators and assume that numpy.random.rand perfectly samples from the discrete distribution of floating point numbers over [0, 1). What are the odds getting at least two exactly identical floating point numbers in the result of:
numpy.random.rand(n)
for any given value of n?
Mathematically, I think this is equivalent to first asking how many IEEE 754 singles or doubles there are in the interval [0, 1). Then I guess the next step would be to solve the equivalent birthday problem? I'm not really sure. Anyone have some insight?
The computation performed by numpy.random.rand for each element generates a number 0.<53 random bits>, for a total of 2^53 equally likely outputs. (Of course, the memory representation isn't a fixed-point 0.stuff; it's still floating point.) This computation is incapable of producing most binary64 floating-point numbers between 0 and 1; for example, it cannot produce 1/2^60. You can see the code in numpy/random/mtrand/randomkit.c:
double
rk_double(rk_state *state)
{
/* shifts : 67108864 = 0x4000000, 9007199254740992 = 0x20000000000000 */
long a = rk_random(state) >> 5, b = rk_random(state) >> 6;
return (a * 67108864.0 + b) / 9007199254740992.0;
}
(Note that rk_random produces 32-bit outputs, regardless of the size of long.)
Assuming a perfect source of randomness, the probability of repeats in numpy.random.rand(n) is 1-(1-0/k)(1-1/k)(1-2/k)...(1-(n-1)/k), where k=2^53. It's probably best to use an approximation instead of calculating this directly for large values of n. (The approximation may even be more accurate, depending on how the approximation error compares to the rounding error accumulated in a direct computation.)
I think you are correct, this is like the birthday problem.
But you need to decide on the number of possible options. You do this by deciding the precision of your floating point numbers.
For example, if you decide to have a precision of 2 numbers after the dot, then there are 100 options(including zero and excluding 1).
And if you have n numbers then the probability of not having a collision is:
or when given R possible numbers and N data points, the probability of no collision is:
And of collision is 1 - P.
This is because the probability of getting any given number is 1/R. And at any point, the probability of a data point not colliding with prior data points is (R-i)/R for i being the index of the data point. But to get the probability of no data points colliding with each other, we need to multiply all the probabilities of data points not colliding with those prior to them. Applying some algebraic operations, we get the equation above.
Inspired by this answer, I wonder why numpy.nextafter gives different results for the smallest positive float number from numpy.finfo(float).tiny and sys.float_info.min:
import numpy, sys
nextafter = numpy.nextafter(0., 1.) # 5e-324
tiny = numpy.finfo(float).tiny # 2.2250738585072014e-308
info = sys.float_info.min # 2.2250738585072014e-308
According to the documentations:
numpy.nextafter
Return the next floating-point value after x1 towards x2, element-wise.
finfo(float).tiny
The smallest positive usable number. Type of tiny is an appropriate floating point type.
sys.float_info
A structseq holding information about the float type. It contains low level information about the precision and internal representation. Please study your system's :file:float.h for more information.
Does someone have an explanation for this?
The documentation’s wording on this is bad; “usable” is colloquial and not defined. Apparently tiny is meant to be the smallest positive normal number.
nextafter is returning the actual next representable value after zero, which is subnormal.
Python does not rigidly specify its floating-point properties. Python implementations commonly inherit them from underlying hardware or software, and use of IEEE-754 formats (but not full conformance to IEEE-754 semantics) is common. In IEEE-754, numbers are represented with an implicit leading one bit in the significand1 until the exponent reaches its minimum value for the format, after which the implicit bit is zero instead of one and smaller values are representable only by reducing the significand instead of reducing the exponent. These numbers with the implicit leading zero are the subnormal numbers. They serve to preserve some useful arithmetic properties, such as x-y == 0 if and only if x == y. (Without subnormal numbers, two very small numbers might be different, but their even smaller difference might not be representable because it was below the exponent limit, so computing x-y would round to zero, resulting in code like if (x != y) quotient = t / (x-y) getting a divide-by-zero error.)
Note
1 “Significand” is the term preferred by experts for the fraction portion of a floating-point representation. “Mantissa” is an old term for the fraction portion of a logarithm. Mantissas are logarithmic, while significands are linear.
I am trying to find the cause of this result:
import numpy
result1 = numpy.rint(1.5)
result2 = numpy.rint(6.5)
print result
The output:
result1-> 2
result2-> 6
This is odd: result1 is correct but I result2 is not (It has to be 7 because rint rounds any float to the nearest integer).
Any idea? (THANKS!)
From numpy's documentation on numpy.around, equivalent to numpy.round, which supposedly also is relevant for numpy.rint:
For values exactly halfway between rounded decimal values, Numpy
rounds to the nearest even value. Thus 1.5 and 2.5 round to 2.0, -0.5
and 0.5 round to 0.0, etc. Results may also be surprising due to the
inexact representation of decimal fractions in the IEEE floating point
standard [R9] and errors introduced when scaling by powers of ten.
Also relevant: While for large numbers there might be representation errors, for small values half integers are exactly representable in binary-base floating points, in particular 1.5 and 6.5 are exactly representable in standard single-precision floats. Without the preference for either odd, even, lower, upper integers or any other scheme one would have undefined behaviour here.
As #wim points out in the comments the behaviour of Python's build-in round is different. It rounds away from zero: It prefers upper integers for positive inputs and lower integers for negative inputs. (see http://docs.python.org/2/library/functions.html#round)
I think this is the rule of the thumb - when you have a float midway between two integers, like 1.5 lies midway between 1 and 2 and since both choices are equally good, we prefer rounding to the even number(which is 2 in this case) and for 6.5, which lies midway between 6 and 7, 6 is the closest even number.