I just checked numpy's sine function. Apparently, it produce highly inaccurate results around pi.
In [26]: import numpy as np
In [27]: np.sin(np.pi)
Out[27]: 1.2246467991473532e-16
The expected result is 0. Why is numpy so inaccurate there?
To some extend, I feel uncertain whether it is acceptable to regard the calculated result as inaccurate: Its absolute error comes within one machine epsilon (for binary64), whereas the relative error is +inf -- reason why I feel somewhat confused. Any idea?
[Edit] I fully understand that floating-point calculation can be inaccurate. But most of the floating-point libraries can manage to deliver results within a small range of error. Here, the relative error is +inf, which seems unacceptable. Just imagine that we want to calculate
1/(1e-16 + sin(pi))
The results would be disastrously wrong if we use numpy's implementation.
The main problem here is that np.pi is not exactly π, it's a finite binary floating point number that is close to the true irrational real number π but still off by ~1e-16. np.sin(np.pi) is actually returning a value closer to the true infinite-precision result for sin(np.pi) (i.e. the ideal mathematical sin() function being given the approximated np.pi value) than 0 would be.
The value is dependent upon the algorithm used to compute it. A typical implementation will use some quickly-converging infinite series, carried out until it converges within one machine epsilon. Many modern chips (starting with the Intel 960, I think) had such functions in the instruction set.
To get 0 returned for this, we would need either a notably more accurate algorithm, one that ran extra-precision arithmetic to guarantee the closest-match result, or something that recognizes special cases: detect a multiple of PI and return the exact value.
Related
I am trying to expand a function of the form (X + Y + Z) ^ N where N is sufficiently large so that the expanded product will contain terms with coefficients much greater than 2 ^ 64; for the sake of this discussion let's just say that N is greater than 200. This is an issue because I am hoping to do an analysis of the expanded form of this function, and this analysis requires exact precision for all of the terms and their coefficients.
To expand the function I am using the Python module SymPy, which has seemed very promising thus far and been able to expand functions where N is > 150 in a relatively short amount of time. My concern though is that after looking through some of the expanded functions, I am seeing coefficients with more trailing zeroes than I might expect. I know that I can run everything through mpmath for my analysis after the function has been expanded, but as of now, I am unsure as to whether or not some of the larger coefficients are even exactly correct in the first place.
Under the documentation for SymPy's expand function, there is no clarification of how precise the coefficients of the expansion are when working with very large numbers. I know for a fact that SymPy uses the mpmath module for some of its functions, so I know that it is capable of arbitrary precision, I just don't know if arbitrary precision explicitly applies to this case.
I know that I could also confirm if the expand function is arbitrarily precise or not by summing all of the coefficients of a given function and checking whether or not that sum is equal to N, but I'd rather not spend a few hours coding all the necessary pieces to make that assessment, only to find out that expand is imprecise.
If anyone has any suggestions for easier ways to confirm the precision of expand, then I would appreciate that if direct confirmation of its precision cannot be given.
Although PR 18960 has not yet been merged, you can affirm there that the coefficients are correct:
>>> multinomial(15,16,14)
50151543548788717200
>>> ((x+y+z)**(15+16+14)).expand().coeff(x**15*y**16*z**14)
50151543548788717200
>>> _ > 2**64
True
Since Python supports unlimited integers and the coefficients are integers, I don't know any reason that they would not be accurate.
I am currently going through different Machine Learning methods on a low-level basis. Since the polynomial kernel
K(x,z)=(1+x^T*z)^d
is one of the more commonly used kernels, I was under the assumption that this kernel must yield a positive (semi-)definite matrix for a fixed set of data {x1,...,xn}.
However, in my implementation, this seems to not be the case, as the following example shows:
import numpy as np
np.random.seed(93)
x=np.random.uniform(0,1,5)
#Assuming d=1
kernel=(1+x[:,None]#x.T[None,:])
np.linalg.eigvals(kernel)
The output would be
[ 6.9463439e+00 1.6070016e-01 9.5388039e-08 -1.5821310e-08 -3.7724433e-08]
I'm also getting negative eigenvalues for d>2.
Am I totally misunderstanding something here? Or is the polynomia kernel simply not PSD?
EDIT: In a previous version, i used x=np.float32(np.random.uniform(0,1,5)) to reduce computing time, which lead to a greater amount of negative eigenvalues (I believe due to numerical instabilities, as #user2357112 mentioned). I guess this is a good example that precision does matter? Since negative eigenvalues still occur, even for float64 precision, the follow-up question would then be how to avoid such numerical instabilities?
according the definition of the expected value, it also refers to mean.
But in scipy.stats.binom, they get different values. like this,
import scipy.stats as st
st.binom.mean(10, 0.3) ----> 3.0
st.binom.expect(args=(10, 0.3)) ---->3.0000000000000013
so that makes me confusing!! why?
In the example the difference is in floating point computation as pointed out. In general there might also be a truncation in expect depending on the integration tolerance.
The mean and some other moments have for many distribution an analytical solution in which case we usually get a precise estimate.
expect is a general function that computes the expectation for arbitrary (*) functions through summation in the discrete case and numerical integration in the continuous case. This accumulates floating point noise but also depends on the convergence criteria for the numerical integration and will, in general, be less precise than an analytically computed moment.
(*) There might be numerical problems in the integration for some "not nice" functions, which can happen for example with default settings in scipy.integrate.quad
This could be simply a result of numerical imprecision when calculating the average. Mathematically, they should be identical, but there are different ways of calculating the mean which have different properties when implemented using finite-precision arithmetic. For example, adding up the numbers and dividing by the total is not particularly reliable, especially when the numbers fluctuate by a small amount around the true (theoretical) mean, or have opposite signs. Recursive estimates may have much better properties.
I am doing feature scaling on my data and R and Python are giving me different answers in the scaling. R and Python give different answers for the many statistical values:
Median:
Numpy gives 14.948499999999999 with this code:np.percentile(X[:, 0], 50, interpolation = 'midpoint').
The built in Statistics package in Python gives the same answer with the following code: statistics.median(X[:, 0]).
On the other hand, R gives this results 14.9632 with this code: median(X[, 1]). Interestingly, the summary() function in R gives 14.960 as the median.
A similar difference occurs when computing the mean of this same data. R gives 13.10936 using the built-in mean() function and both Numpy and the Python Statistics package give 13.097945407088607.
Again, the same thing happens when computing the Standard Deviation. R gives 7.390328 and Numpy (with DDOF = 1) gives 7.3927612774052083. With DDOF = 0, Numpy gives 7.3927565984408936.
The IQR also gives different results. Using the built-in IQR() function in R, the given results is 12.3468. Using Numpy with this code: np.percentile(X[:, 0], 75) - np.percentile(X[:, 0], 25) the results is 12.358700000000002.
What is going on here? Why are Python and R always giving different results? It may help to know that my data has 795066 rows and is being treated as an np.array() in Python. The same data is being treated as a matrix in R.
tl;dr there are a few potential differences in algorithms even for such simple summary statistics, but given that you're seeing differences across the board and even in relatively simple computations such as the median, I think the problem is more likely that the values are getting truncated/modified/losing precision somehow in the transfer between platforms.
(This is more of an extended comment than an answer, but it was getting awkwardly long.)
you're unlikely to get much farther without a reproducible example; there are various ways to create examples to test hypotheses for the differences, but it's better if you do so yourself rather than making answerers do it.
how are you transferring data to/from Python/R? Is there some rounding in the representation used in the transfer? (What do you get for max/min, which should be based on a single number with no floating-point computations? How about if you drop one value to get an odd-length vector and take the median?)
medians: I was originally going to say that this could be a function of different ways to define quantile interpolation for an even-length vector, but the definition of the median is somewhat simpler than general quantiles, so I'm not sure. The differences you're reporting above seem way too big to be driven by floating-point computation in this case (since the computation is just an average of two values of similar magnitude).
IQRs: similarly, there are different possible definitions of percentiles/quantiles: see ?quantile in R.
median() vs summary(): R's summary() reports values at reduced precision (often useful for a quick overview); this is a common source of confusion.
mean/sd: there are some possible subtleties in the algorithm here -- for example, R sorts the vector before summing uses extended precision internally to reduce instability, I don't know if Python does or not. However, this shouldn't make as big a difference as you're seeing unless the data are a bit weird:
x <- rnorm(1000000,mean=0,sd=1)
> mean(x)
[1] 0.001386724
> sum(x)/length(x)
[1] 0.001386724
> mean(x)-sum(x)/length(x)
[1] -1.734723e-18
Similarly, there are more- and less-stable ways to compute a variance/standard deviation.
I have a Bayesian Classifier programmed in Python, the problem is that when I multiply the features probabilities I get VERY small float values like 2.5e-320 or something like that, and suddenly it turns into 0.0. The 0.0 is obviously of no use to me since I must find the "best" class based on which class returns the MAX value (greater value).
What would be the best way to deal with this? I thought about finding the exponential portion of the number (-320) and, if it goes too low, multiplying the value by 1e20 or some value like that. But maybe there is a better way?
What you describe is a standard problem with the naive Bayes classifier. You can search for underflow with that to find the answer. or see here.
The short answer is it is standard to express all that in terms of logarithms. So rather than multiplying probabilities, you sum their logarithms.
You might want to look at other algorithms as well for classification.
Would it be possible to do your work in a logarithmic space? (For example, instead of storing 1e-320, just store -320, and use addition instead of multiplication)
Floating point numbers don't have infinite precision, which is why you saw the numbers turn to 0. Could you multiply all the probabilities by a large scalar, so that your numbers stay in a higher range? If you're only worried about max and not magnitude, you don't even need to bother dividing through at the end. Alternatively you could use an infinite precision decimal, like ikanobori suggests.
Take a look at Decimal from the stdlib.
from decimal import Decimal, getcontext
getcontext().prec = 320
Decimal(1) / Decimal(7)
I am not posting the results here as it is quite long.