I am doing data analysis in Python (Numpy) and R. My data is a vector 795067 X 3 and computing the mean, median, standard deviation, and IQR on this data yields different results depending on whether I use Numpy or R. I crosschecked the values and it looks like R gives the "correct" value.
Median:
Numpy:14.948499999999999
R: 14.9632
Mean:
Numpy: 13.097945407088607
R: 13.10936
Standard Deviation:
Numpy: 7.3927612774052083
R: 7.390328
IQR:
Numpy:12.358700000000002
R: 12.3468
Max and min of the data are the same on both platforms. I ran a quick test to better understand what is going on here.
Multiplying 1.2*1.2 in Numpy gives 1.4 (same with R).
Multiplying 1.22*1.22 gives 1.4884 in Numpy and the same with R.
However, multiplying 1.222*1.222 in Numpy gives 1.4932839999999998 which is clearly wrong! Doing the multiplication in R gives the correct answer of 1.49324.
Multiplying 1.2222*1.2222 in Numpy gives 1.4937728399999999 and 1.493773 in R. Once more, R is correct.
In Numpy, the numbers are float64 datatype and they are double in R. What is going on here? Why are Numpy and R giving different results? I know R uses IEEE754 double-precision but I don't know what precision Numpy uses. How can I change Numpy to give me the "correct" answer?
Python
The print statement/function in Python will print single-precision floats. Calculations will actually be done in the precision specified. Python/numpy uses double-precision float by default (at least on my 64-bit machine):
import numpy
single = numpy.float32(1.222) * numpy.float32(1.222)
double = numpy.float64(1.222) * numpy.float64(1.222)
pyfloat = 1.222 * 1.222
print single, double, pyfloat
# 1.49328 1.493284 1.493284
print "%.16f, %.16f, %.16f"%(single, double, pyfloat)
# 1.4932839870452881, 1.4932839999999998, 1.4932839999999998
In an interactive Python/iPython shell, the shell prints double-precision results when printing the results of statements:
>>> 1.222 * 1.222
1.4932839999999998
In [1]: 1.222 * 1.222
Out[1]: 1.4932839999999998
R
It looks like R is doing the same as Python when using print and sprintf:
print(1.222 * 1.222)
# 1.493284
sprintf("%.16f", 1.222 * 1.222)
# "1.4932839999999998"
In contrast to interactive Python shells, the interactive R shell also prints single-precision when printing the results of statements:
> 1.222 * 1.222
[1] 1.493284
Differences between Python and R
The differences in your results could result from using single-precision values in numpy. Calculations with a lot of additions/subtractions will ultimately make the problem surface:
In [1]: import numpy
In [2]: a = numpy.float32(1.222)
In [3]: a*6
Out[3]: 7.3320000171661377
In [4]: a+a+a+a+a+a
Out[4]: 7.3320003
As suggested in the comments to your actual question, make sure to use double-precision floats in your numpy calculations.
Related
I am solving cumulative probability functions (or equations in general if you want to think about it this way) with sympy solveset. So far so good. They return however "sets" as a type of result output. I am having troubles converting those to or saving those as standard python variable types: In my case I would like it to be a float.
My code is as follows:
import sympy as sp
from sympy import Symbol
from sympy import erf
from sympy import log
from sympy import sqrt
x = Symbol('x')
p = 0.1
sp.solveset((0.5 + 0.5*erf((log(x) - mu)/(sqrt(2)*sigma)))-p)
Out[91]:
FiniteSet(7335.64225447845*exp(-1.77553477605362*sqrt(2)))
Is there a possibility to convert this to float? just using float() does not work as I have tried this and I also have gotten so far to somehow store it as a list and then extracting the number again. However this way seems very cumbersome and not suited to my purpose. In the end I will let us say solve this equation above a 1000 times and I would like to have all the results as a neat array containing floating point numbers.
If you store the above result as follows:
q = sp.solveset((0.5 + 0.5*erf((log(x) - mu)/(sqrt(2)*sigma)))-p)
then Python says the type is sets.setsFiniteSet and if you try to access the variable q it gives you an error (working in Spyder btw):
"Spyder was unable to retrieve the value of this variable from the console - The error message was: 'tuple object has no attribute 'raise_error'".
I have no idea what that means. Thanks a lot.
The FiniteSet works like a Python set. You can convert it to a list and extract the element by indexing e.g.:
In [3]: S = FiniteSet(7335.64225447845*exp(-1.77553477605362*sqrt(2)))
In [4]: S
Out[4]:
⎧ -1.77553477605362⋅√2⎫
⎨7335.64225447845⋅ℯ ⎬
⎩ ⎭
In [5]: list(S)
Out[5]:
⎡ -1.77553477605362⋅√2⎤
⎣7335.64225447845⋅ℯ ⎦
In [6]: list(S)[0]
Out[6]:
-1.77553477605362⋅√2
7335.64225447845⋅ℯ
In [7]: list(S)[0].n()
Out[7]: 595.567591563886
How to correctly add or subtract using floats?
For example how to perform:
2.4e-07 - 1e-8
so that it returns 2.3e-7 instead of 2.2999999999999997e-07.
Converting to int first yields unexpected results, the below returns 2.2e-07:
int(2.4e-07 * 1e8 - 1) * 1e-8
Similarly,
(2.4e-07 * 1e8 - 1) * 1e-8
returns 2.2999999999999997e-07.
How to perform subtraction and addition of numbers with 8 decimal point precision?
2.2999999999999997e-07 is not sufficient as the number is used as a lookup in a dictionary, and the key is 2.3e-7. This means that any value other than 2.3e-7 results in an incorrect lookup.
I suggest using the decimal data type (it is present in the stardard installation of Python), because it uses fixed precision to avoid just the differences you are talking about.
>>> from decimal import Decimal
>>> x = Decimal('2.4e-7')
>>> x
Decimal('2.4E-7')
>>> y = Decimal('1e-8')
>>> y
Decimal('1E-8')
>>> x - y
Decimal('2.3E-7')
It's really just a way of skirting around the issue of floating point arithmetic, but I suggest using the decimal package from the standard library. It lets you do exact floating point math.
Using your example,
$ from decimal import Decimal
$ x = Decimal('2.4e-7')
$ y = Decimal('1e-8')
$ x-y
Decimal('2.3E-7')
It's worth noting that Decimal objects are different than the float built-in, but they are mostly interchangeable.
I do not know if it is what you are looking for but you can try that kind of thing:
a = 0.555555555
a = float("{0:.2f}".format(a))
>>> 0.56
I hope it will help you!
Adrien
I have been switching from Matlab to IPython.
In IPython, if we multiply 3.1 by 2.1, the following is the result:
In [297]:
3.1 * 2.1
Out[297]:
6.510000000000001
There is a small round-off error. It is not a big problem, but it is a little bit annoying. I assume that it appeared while converting decimal numbers into binary numbers and vice versa, is it right?
However, in Numpy array, the result is correct:
>>> np.array([3.1 * 2.1])
array([ 6.51])
In Matlab command line prompt, also the result is correct:
>> 3.1 * 2.1
ans =
6.5100
The above round-off error in Python looks annoying. Are there some ways to avoid this error in the python interactive mode or in IPython?
The numpy result is no more precise than the pure Python one - the floating point imprecision is just hidden from you because, by default, numpy prints fewer decimal places of the result:
In [1]: float(np.array([3.1 * 2.1]))
Out[1]: 6.510000000000001
You can control how numpy displays floating point numbers using np.set_printoptions. For example, to print 16 decimal places rather than the usual 8:
In [2]: np.set_printoptions(precision=16)
In [3]: np.array([3.1 * 2.1])
Out[3]: array([ 6.5100000000000007])
In IPython you can also use the %precision magic to control the number of decimal places that are displayed when pretty-printing normal Python floats:
In [4]: %precision 8
Out[4]: u'%.8f'
In [5]: 3.1 * 2.1
Out[5]: 6.51000000
Note that this is purely cosmetic - the value of 3.1 * 2.1 will still be equal to 6.5100000000000006750155990... rather than 6.51.
In Octave, a MATLAB clone, I can display those distant decimals:
octave:12> printf("%24.20f\n", 3.1*2.1)
6.51000000000000067502
They are also present your numpy.array
In [6]: np.array([3.1*2.1]).item()
Out[6]: 6.510000000000001
even the component terms involve this sort of rounding:
octave:13> printf("%24.20f\n", 3.1)
3.10000000000000008882
octave:14> printf("%24.20f\n", 2.1)
2.10000000000000008882
I'm working with mpmath python library to gain precision during some computations, but i need to cast the result in a numpy native type.
More precisely i need to cast an mpmath matrix (that contains mpf object types) in an numpy.ndarray (that contains float types).
I have solved the problem with a raw approach:
# My input Matrix:
matr = mp.matrix(
[[ '115.80200375', '22.80402473', '13.69453064', '54.28049263'],
[ '22.80402473', '86.14887381', '53.79999432', '42.78548627'],
[ '13.69453064', '53.79999432', '110.9695448' , '37.24270321'],
[ '54.28049263', '42.78548627', '37.24270321', '95.79388469']])
# multiple precision computation
D = MPDBiteration(matr)
# Create a new ndarray
Z = numpy.ndarray((matr.cols,matr.rows),dtype=numpy.float)
# I fill it pretty "manually"
for i in range(0,matr.rows):
for j in range(0,matr.cols):
Z[i,j] = D[i,j] # or float(D[i,j]) seems to work the same
My question is:
Is there a better/more elegant/easier/clever way to do it?
UPDATE:
Reading again and again the mpmath documentation I've found this very useful method: tolist() , it can be used as follows:
Z = np.array(matr.tolist(),dtype=np.float32)
It seems slightly better and elegant (no for loops needed)
Are there better ways to do it? Does my second solution round or chop extra digits?
Your second method is to be preferred, but using np.float32 means casting numbers to single precision. For your matrix, this precision is too low: 115.80200375 becomes 115.80200195 due to truncation. You can set double precition explicitly with numpy.float64, or just pass Python's float type as an argument, which means the same.
Z = numpy.array(matr.tolist(), dtype=float)
or, to keep the matrix structure,
Z = numpy.matrix(matr.tolist(), dtype=float)
You can do that when vectorizing a function (which is what we usually want to do anyway). The following example vectorizes and converts the theta function
import numpy as np
import mpmath as mpm
jtheta3_fn = lambda z,q: mpm.jtheta(n=3,z=z,q=q)
jtheta3_fn = np.vectorize(jtheta3_fn,otypes=(float,))
I'm trying to compute this:
from scipy import *
3600**3400 * (exp(-3600)) / factorial(3400)
the error: unsupported long and float
Try using logarithms instead of working with the numbers directly. Since none of your operations are addition or subtraction, you could do the whole thing in logarithm form and convert back at the end.
Computing with numbers of such magnitude, you just can't use ordinary 64-bit-or-so floats, which is what Python's core runtime supports. Consider gmpy (do not get the sourceforge version, it's aeons out of date) -- with that, math, and some care...:
>>> e = gmpy.mpf(math.exp(1))
>>> gmpy.mpz(3600)**3400 * (e**(-3600)) / gmpy.fac(3400)
mpf('2.37929475533825366213e-5')
(I'm biased about gmpy, of course, since I originated and still participate in that project, but I'd never make strong claims about its floating point abilities... I've been using it mostly for integer stuff... still, it does make this computation possible!-).
You could try using the Decimal object. Calculations will be slower but you won't have trouble with really small numbers.
from decimal import Decimal
I don't know how Decimal interacts with the scipy module, however.
This numpy discussion might be relevant.
Well the error is coming about because you are trying to multiply
3600**3400
which is a long with
exp(-3600)
which is a float.
But regardless, the error you are receiving is disguising the true problem. It seems exp(-3600) is too big a number to fit in a float anyway. The python math library is fickle with large numbers, at best.
exp(-3600) is too smale, factorial(3400) is too large:
In [1]: from scipy import exp
In [2]: exp(-3600)
Out[2]: 0.0
In [3]: from scipy import factorial
In [4]: factorial(3400)
Out[4]: array(1.#INF)
What about calculate it step by step as a workaround(and it makes sense
to check the smallest and biggest intermediate result):
from math import exp
output = 1
smallest = 1e100
biggest = 0
for i,j in izip(xrange(1, 1701), xrange(3400, 1699, -1)):
output = output * 3600 * exp(-3600/3400) / i
output = output * 3600 * exp(-3600/3400) / j
smallest = min(smallest, output)
biggest = max(biggest, output)
print "output: ", output
print "smallest: ", smallest
print "biggest: ", biggest
output is:
output: 2.37929475534e-005
smallest: 2.37929475534e-005
biggest: 1.28724174494e+214