I'm summing up the values in a series, but depending on how I do it, I get different results. The two ways I've tried are:
sum(df['series'])
df['series'].sum()
Why would they return different values?
Sample Code.
s = pd.Series([
0.428229
, -0.948957
, -0.110125
, 0.791305
, 0.113980
,-0.479462
,-0.623440
,-0.610920
,-0.135165
, 0.090192])
print(s.sum())
print(sum(s))
-1.4843630000000003
-1.4843629999999999
The difference is quite small here, but in a dataset with a few thousand values, it becomes quite large.
Floating point numbers are only accurate to a certain number of significant figures. Imagine if all of your numbers - including intermediate results - are only accurate to two significant figures, and you want the sum of the list [100, 1, 1, 1, 1, 1, 1].
The "true" sum is 106, but this cannot be represented since we're only allowed two significant figures;
The "correct" answer is 110, since that's the "true" sum rounded to 2 s.f.;
But if we naively add the numbers in sequence, we'll first do 100 + 1 = 100 (to 2 s.f.), then 100 + 1 = 100 (to 2 s.f.), and so on until the final result is 100.
The "correct" answer can be achieved by adding the numbers up from smallest to largest; 1 + 1 = 2, then 2 + 1 = 3, then 3 + 1 = 4, then 4 + 1 = 5, then 5 + 1 = 6, then 6 + 100 = 110 (to 2 s.f.). However, even this doesn't work in the general case; if there were over a hundred 1s then the intermediate sums would start being inaccurate. You can do even better by always adding the smallest two remaining numbers.
Python's built-in sum function uses the naive algorithm, while df['series'].sum() method uses a more accurate algorithm with a lower accumulated rounding error. From the numpy source code, which pandas uses:
For floating point numbers the numerical precision of sum (and
np.add.reduce) is in general limited by directly adding each number
individually to the result causing rounding errors in every step.
However, often numpy will use a numerically better approach (partial
pairwise summation) leading to improved precision in many use-cases.
This improved precision is always provided when no axis is given.
The math.fsum function uses an algorithm which is more accurate still:
In contrast to NumPy, Python's math.fsum function uses a slower but
more precise approach to summation.
For your list, the result of math.fsum is -1.484363, which is the correctly-rounded answer.
Related
I happen to have a numpy array of floats:
a.dtype, a.shape
#(dtype('float64'), (32769,))
The values are:
a[0]
#3.699822718929953
all(a == a[0])
True
However:
a.mean()
3.6998227189299517
The mean is off by 15th and 16th figure.
Can anybody show how this difference is accumulated over 30K mean and if there is a way to avoid it?
In case it matters my OS is 64 bit.
Here is a rough approximation of a bound on the maximum error. This will not be representative of average error, and it could be improved with more analysis.
Consider calculating a sum using floating-point arithmetic with round-to-nearest ties-to-even:
sum = 0;
for (i = 0; i < n; ++n)
sum += a[i];
where each a[i] is in [0, m).
Let ULP(x) denote the unit of least precision in the floating-point number x. (For example, in the IEEE-754 binary64 format with 53-bit significands, if the largest power of 2 not greater than |x| is 2p, then ULP(x) = 2p−52. With round-to-nearest, the maximum error in any operation with result x is ½ULP(x).
If we neglect rounding errors, the maximum value of sum after i iterations is i•m. Therefore, a bound on the error in the addition in iteration i is ½ULP(i•m). (Actually zero for i=1, since that case adds to zero, which has no error, but we neglect that for this approximation.) Then the total of the bounds on all the additions is the sum of ½ULP(i•m) for i from 1 to n. This is approximately ½•n•(n+1)/2•ULP(m) = ¼•n•(n+1)•ULP(m). (This is an approximation because it moves i outside the ULP function, but ULP is a discontinuous function. It is “approximately linear,“ but there are jumps. Since the jumps are by factors of two, the approximation can be off by at most a factor of two.)
So, with 32,769 elements, we can say the total rounding error will be at most about ¼•32,769•32,770•ULP(m), about 2.7•108 times the ULP of the maximum element value. The ULP is 2−52 times the greatest power of two not less than m, so that is about 2.7•108•2−52 = 6•10−8 times m.
Of course, the likelihood that 32,768 sums (not 32,769 because the first necessarily has no error) all round in the same direction by chance is vanishingly small but I conjecture one might engineer a sequence of values that gets close to that.
An Experiment
Here is a chart of (in blue) the mean error over 10,000 samples of summing arrays with sizes 100 to 32,800 by 100s and elements drawn randomly from a uniform distribution over [0, 1). The error was calculated by comparing the sum calculated with float (IEEE-754 binary32) to that calculated with double (IEEE-754 binary64). (The samples were all multiples of 2−24, and double has enough precision so that the sum for up to 229 such values is exact.)
The green line is c n √n with c set to match the last point of the blue line. We see it tracks the blue line over the long term. At points where the average sum crosses a power of two, the mean error increases faster for a time. At these points, the sum has entered a new binade, and further additions have higher average errors due to the increased ULP. Over the course of the binade, this fixed ULP decreases relative to n, bringing the blue line back to the green line.
This is due to incapability of float64 type to store the sum of your float numbers with correct precision. In order to get around this problem you need to use a larger data type of course*. Numpy has a longdouble dtype that you can use in such cases:
In [23]: np.mean(a, dtype=np.longdouble)
Out[23]: 3.6998227189299530693
Also, note:
In [25]: print(np.longdouble.__doc__)
Extended-precision floating-point number type, compatible with C
``long double`` but not necessarily with IEEE 754 quadruple-precision.
Character code: ``'g'``.
Canonical name: ``np.longdouble``.
Alias: ``np.longfloat``.
Alias *on this platform*: ``np.float128``: 128-bit extended-precision floating-point number type.
* read the comments for more details.
The mean is (by definition):
a.sum()/a.size
Unfortunately, adding all those values up and dividing accumulates floating point errors. They are usually around the magnitude of:
np.finfo(np.float).eps
Out[]: 2.220446049250313e-16
Yeah, e-16, about where you get them. You can make the error smaller by using higher-accuracy floats like float128 (if your system supports it) but they'll always accumulate whenever you're summing a large number of float together. If you truly want the identity, you'll have to hardcode it:
def mean_(arr):
if np.all(arr == arr[0]):
return arr[0]
else:
return arr.mean()
In practice, you never really want to use == between floats. Generally in numpy we use np.isclose or np.allclose to compare floats for exactly this reason. There are ways around it using other packages and leveraging arcane machine-level methods of calculating numbers to get (closer to) exact equality, but it's rarely worth the performance and clarity hit.
I'm trying to return a vector (1-d numpy array) that has a sum of 1.
The key is that it has to equal 1.0 as it represents a percentage.
However, there seems to be a lot of cases where the sum does not equal to 1 even when I divided each element by the total.
In other words, the sum of 'x' does not equal to 1.0 even when x = x'/sum(x')
One of the cases where this occurred was the vector below.
x = np.array([0.090179377557090171, 7.4787182000074775e-05, 0.52465058646452456, 1.3594135000013591e-05, 0.38508165466138505])
The summation of this vector x.sum() is 1.0000000000000002 whereas the summation of the vector that is divided by this value is 0.99999999999999978.
From that point on that reciprocates.
What I did do was round the elements in the vector by the 10th decimal place (np.round(x, decimals = 10)) then divided this by the sum which results in a sum of exactly 1.0. This works when I know the size of the numerical error.
Unfortunately, that would not be the case in usual circumstances.
I'm wondering if there is a way to correct the numerical error of only when the vector is known so that the sum will equal to 1.0.
Edit:
Is floating point math broken?
This question doesn't answer my question as it states only 'why' the difference occurs and not how to resolve the issue.
A bit of a hacky solution:
x[-1] = 0
x[-1] = 1 - x.sum()
Essentially shoves the numerical errors into the last element of the array.
(No roundings beforehand are needed.)
Note: A mathematically simpler solution:
x[-1] = 1.0 - x[:-1].sum()
does not work, due to different behavior of numpy.sum on whole array vs a slice.
For the moment, put aside any issues relating to pseudorandom number generators and assume that numpy.random.rand perfectly samples from the discrete distribution of floating point numbers over [0, 1). What are the odds getting at least two exactly identical floating point numbers in the result of:
numpy.random.rand(n)
for any given value of n?
Mathematically, I think this is equivalent to first asking how many IEEE 754 singles or doubles there are in the interval [0, 1). Then I guess the next step would be to solve the equivalent birthday problem? I'm not really sure. Anyone have some insight?
The computation performed by numpy.random.rand for each element generates a number 0.<53 random bits>, for a total of 2^53 equally likely outputs. (Of course, the memory representation isn't a fixed-point 0.stuff; it's still floating point.) This computation is incapable of producing most binary64 floating-point numbers between 0 and 1; for example, it cannot produce 1/2^60. You can see the code in numpy/random/mtrand/randomkit.c:
double
rk_double(rk_state *state)
{
/* shifts : 67108864 = 0x4000000, 9007199254740992 = 0x20000000000000 */
long a = rk_random(state) >> 5, b = rk_random(state) >> 6;
return (a * 67108864.0 + b) / 9007199254740992.0;
}
(Note that rk_random produces 32-bit outputs, regardless of the size of long.)
Assuming a perfect source of randomness, the probability of repeats in numpy.random.rand(n) is 1-(1-0/k)(1-1/k)(1-2/k)...(1-(n-1)/k), where k=2^53. It's probably best to use an approximation instead of calculating this directly for large values of n. (The approximation may even be more accurate, depending on how the approximation error compares to the rounding error accumulated in a direct computation.)
I think you are correct, this is like the birthday problem.
But you need to decide on the number of possible options. You do this by deciding the precision of your floating point numbers.
For example, if you decide to have a precision of 2 numbers after the dot, then there are 100 options(including zero and excluding 1).
And if you have n numbers then the probability of not having a collision is:
or when given R possible numbers and N data points, the probability of no collision is:
And of collision is 1 - P.
This is because the probability of getting any given number is 1/R. And at any point, the probability of a data point not colliding with prior data points is (R-i)/R for i being the index of the data point. But to get the probability of no data points colliding with each other, we need to multiply all the probabilities of data points not colliding with those prior to them. Applying some algebraic operations, we get the equation above.
For an introduction to Python course, I'm looking at generating a random floating point number in Python, and I have seen a standard recommended code of
import random
lower = 5
upper = 10
range_width = upper - lower
x = random.random() * range_width + lower
for a random floating point from 5 up to but not including 10.
It seems to me that the same effect could be achieved by:
import random
x = random.randrange(5, 10) + random.random()
Since that would give an integer of 5, 6, 7, 8, or 9, and then tack on a decimal to it.
The question I have is would this second code still give a fully even probability distribution, or would it not keep the full randomness of the first version?
According to the documentation then yes random() is indeed a uniform distribution.
random(), which generates a random float uniformly in the semi-open range [0.0, 1.0). Python uses the Mersenne Twister as the core generator.
So both code examples should be fine. To shorten your code, you can equally do:
random.uniform(5, 10)
Note that uniform(a, b) is simply a + (b - a) * random() so the same as your first example.
The second example depends on the version of Python you're using.
Prior to 3.2 randrange() could produce a slightly uneven distributions.
There is a difference. Your second method is theoretically superior, although in practice it only matters for large ranges. Indeed, both methods will give you a uniform distribution. But only the second method can return all values in the range that are representable as a floating point number.
Since your range is so small, there is no appreciable difference. But still there is a difference, which you can see by considering a larger range. If you take a random real number between 0 and 1, you get a floating-point representation with a given number of bits. Now suppose your range is, say, in the order of 2**32. By multiplying the original random number by this range, you lose 32 bits of precision in the result. Put differently, there will be gaps between the values that this method can return. The gaps are still there when you multiply by 4: You have lost the two least significant bits of the original random number.
The two methods can give different results, but you'll only notice the difference in fairly extreme situations (with very wide ranges). For instance, If you generate random numbers between 0 and 2/sys.float_info.epsilon (9007199254740992.0, or a little more than 9 quintillion), you'll notice that the version using multiplication will never give you any floats with fractional values. If you increase the maximum bound to 4/sys.float_info.epsilon, you won't get any odd integers, only even ones. That's because the 64-bit floating point type Python uses doesn't have enough precision to represent all integers at the upper end of that range, and it's trying to maintain a uniform distribution (so it omits small odd integers and fractional values even though those can be represented in parts of the range).
The second version of the calculation will give extra precision to the smaller random numbers generated. For instance, if you're generating numbers between 0 and 2/sys.float_info.epsilon and the randrange call returned 0, you can use the full precision of the random call to add a fractional part to the number. On the other hand if the randrange returned the largest number in the range (2/sys.float_info.epsilon - 1), very little of the precision of the fraction would be used (the number will round to the nearest integer without any fractional part remaining).
Adding a fractional value also can't help you deal with ranges that are too large for every integer to be represented. If randrange returns only even numbers, adding a fraction usually won't make odd numbers appear (it can in some parts of the range, but not for others, and the distribution may be very uneven). Even for ranges where all integers can be represented, the odds of a specific floating point number appearing will not be entirely uniform, since the smaller numbers can be more precisely represented. Large but imprecise numbers will be more common than smaller but more precisely represented ones.
I wrote a program that records how many times 2 fair dice need to be rolled to match the probabilities for each result that we should expect.
I think it works but I'm wondering if there's a more resource friendly way to solve this problem.
import random
expected = [0.0, 0.0, 0.028, 0.056, 0.083,
0.111, 0.139, 0.167, 0.139, 0.111,
0.083, 0.056, 0.028]
results = [0.0] * 13 # store our empirical results here
emp_percent = [0.0] * 13 # results / by count
count = 0.0 # how many times have we rolled the dice?
while True:
r = random.randrange(1,7) + random.randrange(1,7) # roll our die
count += 1
results[r] += 1
emp_percent = results[:]
for i in range(len(emp_percent)):
emp_percent[i] /= count
emp_percent[i] = round(emp_percent[i], 3)
if emp_percent == expected:
break
print(count)
print(emp_percent)
There are several problems here.
Firstly, there is no guarantee that this will ever terminate, nor is it particularly likely to terminate in a reasonable amount of time. Ignoring floating point arithmetic issues, this should only terminate when your numbers are distributed exactly right. But the law of large numbers does not guarantee this will ever happen. The law of large numbers works like this:
Your initial results are (by random chance) almost certainly biased one way or another.
Eventually, the trials not yet performed will greatly outnumber your initial trials, and the lack of bias in those later trials will outweigh your initial bias.
Notice that the initial bias is never counterbalanced. Rather, it is dwarfed by the rest of the results. This means the bias tends to zero, but it does not guarantee the bias actually vanishes in a finite number of trials. Indeed, it specifically predicts that progressively smaller amounts of bias will continue to exist indefinitely. So it would be entirely possible that this algorithm never terminates, because there's always that tiny bit of bias still hanging around, statistically insignificant, but still very much there.
That's bad enough, but you're also working with floating point, which has its own issues; in particular, floating point arithmetic violates lots of conventional rules of math because the computer keeps doing intermediate rounding to ensure the numbers continue to fit into memory, even if they are repeating (in base 2) or irrational. The fact that you are rounding the empirical percents to three decimal places doesn't actually fix this, because not all terminating decimals (base 10) are terminating binary values (base 2), so you may still find mismatches between your empirical and expected values. Instead of doing this:
if emp_percent == expected:
break
...you might try this (in Python 3.5+ only):
if all(map(math.is_close, emp_percent, expected)):
break
This solves both problems at once. By default, math.is_close() requires the values to be within (about) 9 decimal places of one another, so it inserts the necessary give for this algorithm to actually have a chance of working. Note that it does require special handling for comparisons involving zero, so you may need to tweak this code for your use case, like this:
is_close = functools.partial(math.is_close, abs_tol=1e-9)
if all(map(is_close, emp_percent, expected)):
break
math.is_close() also removes the need to round your empiricals, since it can do this approximation for you:
is_close = functools.partial(math.is_close, rel_tol=1e-3, abs_tol=1e-5)
if all(map(is_close, emp_percent, expected)):
break
If you really don't want these approximations, you will have to give up floating point and work with fractions exclusively. They produce exact results when divided by one another. However, you still have the problem that your algorithm is unlikely to terminate quickly (or perhaps at all), for the reasons discussed above.
Rather than trying to match floating point numbers -- you could try to match expected values for each possible sum. This is equivalent to what you are trying to do since (observed number)/(number of trials) == (theoretical probability) if and only if the observed number equals the expected number. These will always be an integer exactly when the number of rolls is a multiple of 36. Hence, if the number of rolls is not a multiple of 36 then it is impossible for your observations to equal expectations exactly.
To get the expected values, note that the numerators that appear in the exact probabilities of the various sums (1,2,3,4,5,6,5,4,3,2,1 for the sums 2,3,..., 12 respectively) are the expected values for the sums if the dice are rolled 36 times. If the dice are rolled 36i times then multiply these numerators by i to get the expected values of the sums. The following code simulates repeatedly rolling a pair of fair dice 36 times, accumulating the total counts and then comparing them with the expected counts. If there is a perfect match, the number of trials (where a trial is 36 rolls) needed to get the match is returned. If this doesn't happen by max_trials, a vector showing the discrepancy between the final counts and final expected value is given:
import random
def roll36(counts):
for i in range(36):
r1 = random.randint(1,6)
r2 = random.randint(1,6)
counts[r1+r2 - 2] += 1
def match_expected(max_trials):
counts = [0]*11
numerators = [1,2,3,4,5,6,5,4,3,2,1]
for i in range(1, max_trials+1):
roll36(counts)
expected = [i*j for j in numerators]
if counts == expected:
return i
#else:
return [c-e for c,e in zip(counts,expected)]
Here is some typical output:
>>> match_expected(1000000)
[-750, 84, 705, -286, 5783, -3504, -1208, 1460, 543, -1646, -1181]
Not only have the exact expected values never been observed in 36 million simulated rolls of a pair of fair dice, in the final state the discrepancies between observations and expectations have become quite large (in absolute value -- the relative discrepancies are approaching zero, as the law of large numbers predicts). This approach is unlikely to ever yield a perfect match. A variation that would work (while still focusing on expected numbers) would be to iterate until the observations pass a chi-squared goodness of fit test when compared with the theoretical distribution. In that case there would no longer be any reason to focus on multiples of 36.