what is the difference expect and mean in the scipy.stats?

what is the difference expect and mean in the scipy.stats? - python

according the definition of the expected value, it also refers to mean.
But in scipy.stats.binom, they get different values. like this,
import scipy.stats as st
st.binom.mean(10, 0.3) ----> 3.0
st.binom.expect(args=(10, 0.3)) ---->3.0000000000000013
so that makes me confusing!! why?

In the example the difference is in floating point computation as pointed out. In general there might also be a truncation in expect depending on the integration tolerance.
The mean and some other moments have for many distribution an analytical solution in which case we usually get a precise estimate.
expect is a general function that computes the expectation for arbitrary (*) functions through summation in the discrete case and numerical integration in the continuous case. This accumulates floating point noise but also depends on the convergence criteria for the numerical integration and will, in general, be less precise than an analytically computed moment.
(*) There might be numerical problems in the integration for some "not nice" functions, which can happen for example with default settings in scipy.integrate.quad

This could be simply a result of numerical imprecision when calculating the average. Mathematically, they should be identical, but there are different ways of calculating the mean which have different properties when implemented using finite-precision arithmetic. For example, adding up the numbers and dividing by the total is not particularly reliable, especially when the numbers fluctuate by a small amount around the true (theoretical) mean, or have opposite signs. Recursive estimates may have much better properties.

Related

Internals of numpy.sum

Why are the two sums returning different values? In fact, if 0.1 is summed up 10 times in IEEE arithmetic, the result should not be exactly 1. It could be that np.sum() groups the sum differently, so that by chance the result is exactly 1. But is there is doc about this (besides studying the source code)? Surely, numpy isn't rounding off the result. And AFAIK, it is also using IEEE double floating point.
import numpy as np

print(1-sum(np.ones(10)*0.1))
print(1-np.sum(np.ones(10)*0.1))
-----
1.1102230246251565e-16
0.0

Sorry, I found the solution in the docs. The algorithm of numpy.sum() is indeed more clever:
For floating point numbers the numerical precision of sum (and np.add.reduce) is in general limited by directly adding each number individually to the result causing rounding errors in every step. However, often numpy will use a numerically better approach (partial pairwise summation) leading to improved precision in many use-cases. This improved precision is always provided when no axis is given. When axis is given, it will depend on which axis is summed. Technically, to provide the best speed possible, the improved precision is only used when the summation is along the fast axis in memory. Note that the exact precision may vary depending on other parameters. In contrast to NumPy, Python’s math.fsum function uses a slower but more precise approach to summation. Especially when summing a large number of lower precision floating point numbers, such as float32, numerical errors can become significant. In such cases it can be advisable to use dtype="float64" to use a higher precision for the output.
See https://numpy.org/doc/stable/reference/generated/numpy.sum.html

Checkgradient without solving optimization problem in MATLAB

I have a relatively complicated function and I have calculated the analytical form of the Jacobian of this function. However, sometimes, I mess up this Jacobian.
MATLAB has a nice way to check for the accuracy of the Jacobian when using some optimization technique as described here.
The problem though is that it looks like MATLAB solves the optimization problem and then returns if the Jacobian was correct or not. This is extremely time consuming, especially considering that some of my optimization problems take hours or even days to compute.
Python has a somewhat similar function in scipy as described here which just compares the analytical gradient with a finite difference approximation of the gradient for some user provided input.
Is there anything I can do to check the accuracy of the Jacobian in MATLAB without having to solve the entire optimization problem?

A laborious but useful method I've used for this sort of thing is to check that the (numerical) integral of the purported derivative is the difference of the function at the end points. I have found this more convenient than comparing fractions like (f(x+h)-f(x))/h with f'(x) because of the difficulty of choosing h so that on the one hand h is not so small that the fraction is not dominated by rounding error and on the other h is small enough that the fraction should be close to f'(x)
In the case of a function F of a single variable, the assumption is that you have code f to evaluate F and fd say to evaluate F'. Then the test is, for various intervals [a,b] to look at the differences, which the fundamental theorem of calculus says should be 0,
Integral{ 0<=x<=b | fd(x)} - (f(b)-f(a))
with the integral being computed numerically. There is no need for the intervals to be small.
Part of the error will, of course, be due to the error in the numerical approximation to the integral. For this reason I tend to use, for example, and order 40 Gausss Legendre integrator.
For functions of several variables, you can test one variable at a time. For several functions, these can be tested one at a time.
I've found that these tests, which are of course by no means exhaustive, show up the kinds of mistakes that occur in computing derivatives quire readily.

Have you considered the usage of Complex step differentiation to check your gradient? See this description

What is the Precision of SymPy's 'expand' Function?

I am trying to expand a function of the form (X + Y + Z) ^ N where N is sufficiently large so that the expanded product will contain terms with coefficients much greater than 2 ^ 64; for the sake of this discussion let's just say that N is greater than 200. This is an issue because I am hoping to do an analysis of the expanded form of this function, and this analysis requires exact precision for all of the terms and their coefficients.
To expand the function I am using the Python module SymPy, which has seemed very promising thus far and been able to expand functions where N is > 150 in a relatively short amount of time. My concern though is that after looking through some of the expanded functions, I am seeing coefficients with more trailing zeroes than I might expect. I know that I can run everything through mpmath for my analysis after the function has been expanded, but as of now, I am unsure as to whether or not some of the larger coefficients are even exactly correct in the first place.
Under the documentation for SymPy's expand function, there is no clarification of how precise the coefficients of the expansion are when working with very large numbers. I know for a fact that SymPy uses the mpmath module for some of its functions, so I know that it is capable of arbitrary precision, I just don't know if arbitrary precision explicitly applies to this case.
I know that I could also confirm if the expand function is arbitrarily precise or not by summing all of the coefficients of a given function and checking whether or not that sum is equal to N, but I'd rather not spend a few hours coding all the necessary pieces to make that assessment, only to find out that expand is imprecise.
If anyone has any suggestions for easier ways to confirm the precision of expand, then I would appreciate that if direct confirmation of its precision cannot be given.

Although PR 18960 has not yet been merged, you can affirm there that the coefficients are correct:
>>> multinomial(15,16,14)
50151543548788717200
>>> ((x+y+z)**(15+16+14)).expand().coeff(x**15*y**16*z**14)
50151543548788717200
>>> _ > 2**64
True
Since Python supports unlimited integers and the coefficients are integers, I don't know any reason that they would not be accurate.

Can Python floating-point error affect statistical test over small numbers?

Can floating-point errors affect my calculations in the following scenario, where the values are small?
My purpose is to compare two sets of values and determine if their means are statistically different.
I handle very small values the usual way in performing large-sample unpaired tests with data like this:
first group (obtained from 100 samples):
first item's mean = 2.7977620220553945e-24
std dev = 3.2257148207429583e-15
second group (obtained from 100 samples):
first item's mean = 3.1086244689504383e-15
std dev = 3.92336102789548e-15
The goal is to find out whether or not the two means are statistically significantly different.
I plan to follow the usual steps of finding the standard error of the difference and the z-score and so on. I will be using Python (or Java).
My question is not about the statistical test but about the potential problem with the smallness of the numbers (floating-point errors).
Should I (must I) approximate each of the above two means to zero (and thus conclude that there is no difference)?
That is, given the smallness of the means, is it computationally meaningless to go about performing the statistical test?

64-bit floating point numbers allot 52-bits for the significand. This is approximately 15-16 decimal places (log10(2^52) ~ 15.6). In scientific notation, this is the difference between, say 1 e -9 and 1 e -24 (because 10^-9 / 10^-24 == 10^15, i.e. they differ by 15 decimal places).
What does this all mean? Well, it means that if you add 10^-24 to 10^-9, it is just on the border of being too small to show up in the larger number (10^-9).
Observe:
>>> a = 1e-9
>>> a
1e-09
>>> a + 1e-23
1.00000000000001e-09
>>> a + 1e-24
1.000000000000001e-09
>>> a + 1e-25
1e-09
Since the z-score statistics stuff involves basically adding and subtracting a few standard deviations from the mean, or so, it will definitely have problems if the difference in the exponent is 16. It's probably not a good situation if the difference is like 14 or 15. The difference in your exponents is 9, which will still allow you 1/10^6 standard deviations of accuracy in the final sum. Since we're worried about errors on the order of, maybe, a tenth of a standard deviation or so when we talk about statistical significance, you should be fine.
On 32-bit platforms, the significand gets 23 bits, which is about 6.9 places.

In principle, if you work with numbers with the same order of magnitude, the float representation of data is sufficient to retain the same precision as numbers close to 1.
However, it is much more robust to being able to perform computation with whitened data.
If whitening is not an option for your use case, you can use an arbitrary precision library for non-integer data (Python offers built-in arbitrary precision integers), like decimal, fractions and/or statistics, and do all the computations with that.
EDIT
However, just looking at your numbers the standard deviation ranges (the interval [µ-σ, µ+σ] largely overlap, therefore you have no evidence for the two means to be statistically significantly different. Of course this is meaningful only for (at least asymptotically) normally distributed populations / samples.

Why is numpy's sine function so inaccurate at some points?

I just checked numpy's sine function. Apparently, it produce highly inaccurate results around pi.
In [26]: import numpy as np
In [27]: np.sin(np.pi)
Out[27]: 1.2246467991473532e-16
The expected result is 0. Why is numpy so inaccurate there?
To some extend, I feel uncertain whether it is acceptable to regard the calculated result as inaccurate: Its absolute error comes within one machine epsilon (for binary64), whereas the relative error is +inf -- reason why I feel somewhat confused. Any idea?
[Edit] I fully understand that floating-point calculation can be inaccurate. But most of the floating-point libraries can manage to deliver results within a small range of error. Here, the relative error is +inf, which seems unacceptable. Just imagine that we want to calculate
1/(1e-16 + sin(pi))
The results would be disastrously wrong if we use numpy's implementation.

The main problem here is that np.pi is not exactly π, it's a finite binary floating point number that is close to the true irrational real number π but still off by ~1e-16. np.sin(np.pi) is actually returning a value closer to the true infinite-precision result for sin(np.pi) (i.e. the ideal mathematical sin() function being given the approximated np.pi value) than 0 would be.

The value is dependent upon the algorithm used to compute it. A typical implementation will use some quickly-converging infinite series, carried out until it converges within one machine epsilon. Many modern chips (starting with the Intel 960, I think) had such functions in the instruction set.
To get 0 returned for this, we would need either a notably more accurate algorithm, one that ran extra-precision arithmetic to guarantee the closest-match result, or something that recognizes special cases: detect a multiple of PI and return the exact value.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

what is the difference expect and mean in the scipy.stats? - python

according the definition of the expected value, it also refers to mean. But in scipy.stats.binom, they get different values. like this, import scipy.stats as st st.binom.mean(10, 0.3) ----> 3.0 st.binom.expect(args=(10, 0.3)) ---->3.0000000000000013 so that makes me confusing!! why?

Related

Internals of numpy.sum

Checkgradient without solving optimization problem in MATLAB

What is the Precision of SymPy's 'expand' Function?

Can Python floating-point error affect statistical test over small numbers?

Why is numpy's sine function so inaccurate at some points?

Categories

Resources