Dealing with large numbers in R [Inf] and Python

Dealing with large numbers in R [Inf] and Python - python

I am learning Python these days, and this is probably my first post on Python. I am relatively new to R as well, and have been using R for about a year. I am comparing both the languages while learning Python. I apologize if this question is too basic.
I am unsure why R outputs Inf for something python doesn't. Let's take 2^1500 as an example.
In R:
nchar(2^1500)
[1] 3
2^1500
[1] Inf
In Python:
len(str(2**1500))
Out[7]: 452
2**1500
Out[8]: 3507466211043403874...
I have two questions:
a) Why is it that R provides Inf when Python doesn't.
b) I researched How to work with large numbers in R? thread. It seems that Brobdingnag could help us out with dealing with large numbers. However, even in such case, I am unable to compute nchar. How do I compute above expression i.e. 2^1500 in R
2^Brobdingnag::as.brob(500)
[1] +exp(346.57)
> nchar(2^Brobdingnag::as.brob(500))
Error in nchar(2^Brobdingnag::as.brob(500)) :
no method for coercing this S4 class to a vector

In answer to your questions:
a) They use different representations for numbers. Most numbers in R are represented as double precision floating point values. These are all 64 bits long, and give about 15 digit precision throughout the range, which goes from -double.xmax to double.xmax, then switches to signed infinite values. R also uses 32 bit integer values sometimes. These cover the range of roughly +/- 2 billion. R chooses these types because it is geared towards statistical and numerical methods, and those rarely need more precision than double precision gives. (They often need a bigger range, but usually taking logs solves that problem.)
Python is more of a general purpose platform, and it has types discussed in MichaelChirico's comment.
b) Besides Brobdingnag, the gmp package can handle arbitrarily large integers. For example,
> as.bigz(2)^1500
Big Integer ('bigz') :
[1] 35074662110434038747627587960280857993524015880330828824075798024790963850563322203657080886584969261653150406795437517399294548941469959754171038918004700847889956485329097264486802711583462946536682184340138629451355458264946342525383619389314960644665052551751442335509249173361130355796109709885580674313954210217657847432626760733004753275317192133674703563372783297041993227052663333668509952000175053355529058880434182538386715523683713208549376
> nchar(as.character(as.bigz(2)^1500))
[1] 452
I imagine the as.character() call would also be needed with Brobdingnag.

Apparently python uses arbitrary precision integers by default when needed. R does not. However, there are many useful R packages to perform arbitrary precision arithmetic. Which package to pick depends on the use case.
To bring up a package that hasn't been discussed yet, consider the Rmpfr package:
> library(Rmpfr)
> a <- 2^mpfr(1500, 10000)
> a
1 'mpfr' number of precision 10000 bits
[1] 35074662110434038747627587960280857993524015880330828824075798024790963850563322203657080886584969261653150406795437517399294548941469959754171038918004700847889956485329097264486802711583462946536682184340138629451355458264946342525383619389314960644665052551751442335509249173361130355796109709885580674313954210217657847432626760733004753275317192133674703563372783297041993227052663333668509952000175053355529058880434182538386715523683713208549376
It requires you to set a precision, but if you make it large enough it can hold 2^1500 as integer.
However, it also doesn't seem to define an as.character() function:
> as.character(a)
[1] "<S4 object of class \"mpfr1\">"
So if your problem is specifically to count digits, then the gmp package as discussed in this answer is probably the way to go. On the other hand, if you're interested in arbitrary precision floating point arithmetic, Rmpfr might be a better choice.

Related

Does Python document its behavior for rounding to a specified number of fractional digits?

Is the algorithm used for rounding a float in Python to a specified number of digits specified in any Python documentation? The semantics of round with zero fractional digits (i.e. rounding to an integer) are simple to understand, but it's not clear to me how the case where the number of digits is nonzero is implemented.
The most straightforward implementation of the function that I can think of (given the existence of round to zero fractional digits) would be:
def round_impl(x, ndigits):
return (10 ** -ndigits) * round(x * (10 ** ndigits))
I'm trying to write some C++ code that mimics the behavior of Python's round() function for all values of ndigits, and the above agrees with Python for the most part, when translated to equivalent C++ calls. However, there are some cases where it differs, e.g.:
>>> round(0.493125, 5)
0.49312
>>> round_impl(0.493125, 5)
0.49313
There is clearly a difference that occurs when the value to be rounded is at or very near the exact midpoint between two potential output values. Therefore, it seems important that I try to use the same technique if I want similar results.
Is the specific means for performing the rounding specified by Python? I'm using CPython 2.7.15 in my tests, but I'm specifically targeting v2.7+.

Also refer to What Every Programmer Should Know About Floating-Point Arithmetic, which has more detailed explanations for why this is happening as it is.
This is a mess. First of all, as far as float is concerned, there is no such number as 0.493125, when you write 0.493125 what you actually get is:
0.493124999999999980015985556747182272374629974365234375
So this number is not exactly between two decimals, it's actually closer to 0.49312 than it is to 0.49313, so it should definitely round to 0.49312, that much is clear.
The problem is that when you multiply by 105, you get the exact number 49312.5. So what happened here is the multiplication gave you an inexact result which by coincidence canceled out the rounding error in the original number. Two rounding errors canceled each other out, yay! But the problem is that when you do this, the rounding is actually incorrect... at least if you want to round up at midpoints, but Python 3 and Python 2 behave differently. Python 2 rounds away from 0, and Python 3 rounds towards even least-significant digits.
Python 2
if two multiples are equally close, rounding is done away from 0
Python 3
...if two multiples are equally close, rounding is done toward the even choice...
Summary
In Python 2,
>>> round(49312.5)
49313.0
>>> round(0.493125, 5)
0.49312
In Python 3,
>>> round(49312.5)
49312
>>> round(0.493125, 5)
0.49312
And in both cases, 0.493125 is really just a short way of writing 0.493124999999999980015985556747182272374629974365234375.
So, how does it work?
I see two plausible ways for round() to actually behave.
Choose the closest decimal number with the specified number of digits, and then round that decimal number to float precision. This is hard to implement, because it requires doing calculations with more precision than you can get from a float.
Take the two closest decimal numbers with the specified number of digits, round them both to float precision, and return whichever is closer. This will give incorrect results, because it rounds numbers twice.
And Python chooses... option #1! The exactly correct, but much harder to implement version. Refer to Objects/floatobject.c:927 double_round(). It uses the following process:
Write the floating-point number to a string in decimal format, using the requested precision.
Parse the string back in as a float.
This uses code based on David Gay's dtoa library. If you want C++ code that gets the actual correct result like Python does, this is a good start. Fortunately you can just include dtoa.c in your program and call it, since its licensing is very permissive.

The Python documentation for and 2.7 specifies the behaviour:
Values are rounded to the closest multiple of 10 to the power minus
ndigits; if two multiples are equally close, rounding is done away
from 0.
For 3.7:
For the built-in types supporting round(), values are rounded to the
closest multiple of 10 to the power minus ndigits; if two multiples
are equally close, rounding is done toward the even choice
Update:
The (cpython) implementation can be found floatobjcet.c in the function float___round___impl, which calls round if ndigits is not given, but double_round if it is.
double_round has two implementations.
One converts the double to a string (aka decimal) and back to a double.
The other one does some floating point calculations, calls to pow and at its core calls round. It seems to have more potential problems with overflows, since it actually multiplies the input by 10**-ndigits.
For the precise algorithm, look at the linked source file.

number format is different in linux and windows version of pycharm

I have used one python code in PyCharm in Linux and the format of number was
-91.35357. When I used the same code in PyCharm in Windows format was
-91.35356999999999. The problem is that value is consisted in the file name which I need to open (and the list of files to open is long).
Anyone knows possible explanation and hot to fix it?

Floats
Always remember that float numbers have a limited precision. If you think about it, there must be a limit to how exactly you represent a number if you limit storage to 32 or 64 bits (or any other number).
in Python
Python provides just one float type. Float numbers are usually implemented using 64 bits, but yet they might be 64 bit in one Python binary, 32 bit on another, so you can't really rely on that (however, see #Mark Dickinson comment below).
Let's test this. But note that, because Python does not provide float32 and float64 alternatives, we will use a different library, numpy, to provide us with those types and operations:
>>> n = 1.23456789012345678901234567890
>>> n
1.2345678901234567
>>> numpy.float64(n)
1.2345678901234567
>>> numpy.float32(n)
1.2345679
Here we can see that Python, in my computer, handles the variable as a float64. This already truncates the number we introduced (because a float64 can only handle so much precision).
When we use a float32, precision is further reduced and, because of truncation, the closest number we can represent is slightly different.
Conclusion
Float resolution is limited. Furthermore, some operations behave differently across different architectures.
Even if you are using a consistent float size, not all numbers can be represented, and operations will accumulate truncation errors.
Comparing a float to another float shall be done considering a possible error margin. Do not use float_a == float_b, instead use abs(float_a - float_b) < error_margin.
Relying on float representations is always a bad idea. Python sometimes uses scientific notation:
>>> a = 0.0000000001
>>> str(a)
'1e-10'
You can get consistent rounding approximation (ie, to use in file names), but remember that storage and representation are different things. This other thread may assist you: Limiting floats to two decimal points
In general, I'd advise against using float numbers in file names or as any other kind of identifier.
Latitude / Longitude
float32 numbers have not enough precision to represent the 5th and 6th decimal numbers in latitude/longitude pairs (depending on whether the integer part has one, two or three digits).
If you want to learn what's really happening, check this page and test some of your numbers: https://www.h-schmidt.net/FloatConverter/IEEE754.html
Representing
Note that Python rounds float values when representing them:
>>> lat = 123.456789
>>> "{0:.6f}".format(lat)
'123.456789'
>>> "{0:.5f}".format(lat)
'123.45679'
And as stated above, latitude/longitude cannot be correctly represented by a float32 down to the 6th decimal, and furthermore, the truncated float values are rounded when presented by Python:
>>> lat = 123.456789
>>> lat
123.456789
>>> "{0:.5f}".format(numpy.float64(lat))
'123.45679'
>>> "{0:.5f}".format(numpy.float32(lat))
'123.45679'
>>> "{0:.6f}".format(numpy.float32(lat))
'123.456787'
As you can see, the rounded version of that float32 number fails to match the original number from the 5th decimal. But also does the rounded version to the 5th decimal of the float64 number.

Your PyCharm on Linux is simply rounding of your large floating point number. Rounding it off to the nearest 6 or 7 can resolve your issue but DONT USE THESE AS FILE NAMES.
Keeping your code constant in both cases then, their can be many explanations:
1) 32-bit Processors handles float differently than 64-Bit Processors.
2) PyCharm for both Linux and Windows behaves differently for floating points which we cannot determine exactly, may be PyCharm for Windows is better optimised.
edit 1
Explanation for Point 1
on 32-Bit processors everything is really done in 80-bit precision internally. The precision really just determines how many of those bits are stored in memory. This is part of the reason why different optimisation settings can change results slightly: They change the amount of rounding from 80-bit to 32- or 64-bit.
edit 2
You can use hashmapping for saving your data in files and then mapping them onto the co-ordinates.
Example:
# variable = {(long,lat):"<random_file_name>"}
cordinates_and_file ={(-92.45453534,-87.2123123):"AxdwaWAsdAwdz"}

How to avoid floating point arithmetics issues?

Python (and almost anything else) has known limitations while working with floating point numbers (nice overview provided here).
While problem is described well in the documentation it avoids providing any approach to fixing it. And with this question I am seeking to find a more or less robust way to avoid situations like the following:
print(math.floor(0.09/0.015)) # >> 6
print(math.floor(0.009/0.0015)) # >> 5
print(99.99-99.973) # >> 0.016999999999825377
print(.99-.973) # >> 0.017000000000000015
var = 0.009
step = 0.0015
print(var < math.floor(var/step)*step+step) # False
print(var < (math.floor(var/step)+1)*step) # True
And unlike suggested in this question, their solution does not help to fix a problem like next peace of code failing randomly:
total_bins = math.ceil((data_max - data_min) / width) # round to upper
new_max = data_min + total_bins * width
assert new_max >= data_max
# fails. because for example 1.9459999999999997 < 1.946

If you deal in discrete quantities, use int.
Sometimes people use float in places where they definitely shouldn't. If you're counting something (like number of cars in the world) as opposed to measuring something (like how much gasoline is used per day), floating-point is probably the wrong choice. Currency is another example where floating point numbers are often abused: if you're storing your bank account balance in a database, it's really not 123.45 dollars, it's 12345 cents. (But also see below about Decimal.)
Most of the rest of the time, use float.
Floating-point numbers are general-purpose. They're extremely accurate; they just can't represent certain fractions, like finite decimal numbers can't represent the number 1/3. Floats are generally suited for any kind of analog quantity where the measurement has error bars: length, mass, frequency, energy -- if there's uncertainty on the order of 2^(-52) or greater, there's probably no good reason not to use float.
If you need human-readable numbers, use float but format it.
"This number looks weird" is a bad reason not to use float. But that doesn't mean you have to display the number to arbitrary precision. If a number with only three significant figures comes out to 19.99909997918947, format it to one decimal place and be done with it.
>>> print('{:0.1f}'.format(e**pi - pi))
20.0
If you need precise decimal representation, use Decimal.
Sraw's answer refers to the decimal module, which is part of the standard library. I already mentioned currency as a discrete quantity, but you may need to do calculations on amounts of currency in which not all numbers are discrete, for example calculating interest. If you're writing code for an accounting system, there will be rules that say when rounding is applied and to what accuracy various calculations are done, and those specifications will be written in terms of decimal places. In this situation and others where the decimal representation is inherent to the problem specification, you'll want to use a decimal type.
>>> from decimal import Decimal
>>> rate = Decimal('0.0345')
>>> principal = Decimal('3412.65')
>>> interest = rate*principal
>>> interest
Decimal('117.736425')
>>> interest.quantize(Decimal('0.01'))
Decimal('117.74')
But most importantly, use data types and operations that make sense in context.
Several of your examples use math.floor, which takes a float and chops off the fractional part. In any situation where you should use math.floor, floating-point error doesn't matter. (If you want to round to the nearest integer, use round instead.) Yes, there are ways to use floating-point operations that have wrong results from a mathematical standpoint. But real-world quantities usually fall into one of these categories:
Exact, and therefore should not be put in a float;
Imprecise to a degree far exceeding the likely accumulation of floating-point error.
As a programmer, it's part of your job to know the quantities you're dealing with and choose appropriate data types. So there's no "fix" for floating point numbers, because there's no "problem" really -- just people using the wrong type for the wrong thing.

Let's talk about decimal. Actually, this library converts number into a string-like object, and then do any arithmetical operation based on chars.
So in this case, it can handle significantly huge number with almost perfect precision.
But, as it calculate number based on chars, it cost much more.
Further, if you want to use decimal, to ensure precision, you need consistently use it. If you mix decimal with normal types such as float, it may cause unexpected problems.
Finally, when you construct a Decimal object, it is better to pass a string but not a number.
>>> print(Decimal(99.99) - Decimal(99.973))
0.01699999999999590727384202182
>>> print(Decimal("99.99") - Decimal("99.973"))
0.017

It depends what your end goal is - there is no way to "perfectly" store floating point numbers. Only "good enough".
If you are working with money for example (dollars and cents) it is common practice to not store dollars - and only cents. (dollar = 100 cents) - this is how paypal stores your account balance on their servers.
There is also the python Decimal class for fixed point arithmetic.

Python: Decimals with trigonometric functions

I'm having a little problem, take a look:
>>> import math
>>> math.sin(math.pi)
1.2246467991473532e-16
This is not what I learnt in my Calculus class (It was 0, actually)
So, now, my question:
I need to perform some heavy trigonometric calculus with Python. What library can I use to get correct values?
Can I use Decimal?
EDIT:
Sorry, What I mean is other thing.
What I want is some way to do:
>>> awesome_lib.sin(180)
0
or this:
>>> awesome_lib.sin(Decimal("180"))
0
I need a libraray that performs good trigonometric calculus. Everybody knows that sin 180° is 0, I need a library that can do that too.

1.2246467991473532e-16 is close to 0 -- there are 16 zeroes between the decimal point and the first significant digit -- much as 3.1415926535897931 (the value of math.pi) is close to pi. The answer is correct to sixteen decimal places!
So if you want sin(pi) to equal 0, simply round it to a reasonable number of decimal places. 15 looks good to me and should be plenty for any application:
print round(math.sin(math.pi), 15)

Pi is an irrational number so it can't be represented exactly using a finite number of bits. However, you can use some library for symbolic computation such as sympy.
>>> sympy.sin(sympy.pi)
0
Regarding the second part of you question, if you want to use degrees instead of radians you can define a simple conversion function
def radians(x):
return x * sympy.pi / 180
and use it as follows:
>>> sympy.sin(radians(180))
0

If you find the result unexpected, I dare suggesting that you have a look at this text:
What Every Computer Scientist Should Know About Floating-Point Arithmetic
It's really worth it.

you can also try gmpy or real
in gmpy you can specify the precision explicitly:
gmpy.pi(256)
in real.py you could use the pa() function:
from real import pa,pi
pa(pi)

Short Answer -
Decimal.cos() and Decimal.sin() can both be implemented from Decimal.exp() implementation by splitting all even terms into the cos() function and all the odd terms into the sin() function and alternating the sign of each term between positive and negative in both of those series. No change needed in the loop which only computes N terms based on configured precision (Decimal.getcontext().prec).
Long Answer -
Python decimal.Decimal supports exp() function that takes only a real number argument (unlike exp() in R language) and computes the infinite series only up to the number of terms based on configured precision (decimal.Decimal.getcontext().prec).
Currently the even terms compute cosh() and the odd terms compute sinh(). Their sum is returned as the result of exp(). If the sign of each term was modified to alternate between positive and negative within each series, the even-terms-series will compute cos() and the odd-terms-series would compute sin().
Additionally, like R language, this change could enable Decimal.exp() to support complex arguments, so that exp(1j*x) could return Decimal.cos(x) + 1j * Decimal.sin(x).

Unable to see Python's approximations in mathematical calculations

Problem: to see when computer makes approximation in mathematical calculations when I use Python
Example of the problem:
My old teacher once said the following statement
You cannot never calculate 200! with your computer.
I am not completely sure whether it is true or not nowadays.
It seems that it is, since I get a lot zeros for it from a Python script.
How can you see when your Python code makes approximations?

Python use arbitrary-precision arithmetic to calculate with integers, so it can exactly calculate 200!. For real numbers (so-called floating-point), Python does not use an exact representation. It uses a binary representation called IEEE 754, which is essentially scientific notation, except in base 2 instead of base 10.
Thus, any real number that cannot be exactly represented in base 2 with 53 bits of precision, Python cannot produce an exact result. For example, 0.1 (in base 10) is an infinite decimal in base 2, 0.0001100110011..., so it cannot be exactly represented. Hence, if you enter on a Python prompt:
>>> 0.1
0.10000000000000001
The result you get back is different, since has been converted from decimal to binary (with 53 bits of precision), back to decimal. As a consequence, you get things like this:
>>> 0.1 + 0.2 == 0.3
False
For a good (but long) read, see What Every Programmer Should Know About Floating-Point Arithmetic.

Python has unbounded integer sizes in the form of a long type. That is to say, if it is a whole number, the limit on the size of the number is restricted by the memory available to Python.
When you compute a large number such as 200! and you see an L on the end of it, that means Python has automatically cast the int to a long, because an int was not large enough to hold that number.
See section 6.4 of this page for more information.

200! is a very large number indeed.
If the range of an IEEE 64-bit double is 1.7E +/- 308 (15 digits), you can see that the largest factorial you can get is around 170!.
Python can handle arbitrary sized numbers, as can Java with its BigInteger.

Without some sort of clarification to that statement, it's obviously false. Just from personal experience, early lessons in programming (in the late 1980s) included solving very similar, if not exactly the same, problems. In general, to know some device which does calculations isn't making approximations, you have to prove (in the math sense of a proof) that it isn't.
Python's integer types (named int and long in 2.x, both folded into just the int type in 3.x) are very good, and do not overflow like, for example, the int type in C. If you do the obvious of print 200 * 199 * 198 * ... it may be slow, but it will be exact. Similiarly, addition, subtraction, and modulus are exact. Division is a mixed bag, as there's two operators, / and //, and they underwent a change in 2.x—in general you can only treat it as inexact.
If you want more control yet don't want to limit yourself to integers, look at the decimal module.

Python handles large numbers automatically (unlike a language like C where you can overflow its datatypes and the values reset to zero, for example) - over a certain point (sys.maxint or 2147483647) it converts the integer to a "long" (denoted by the L after the number), which can be any length:
>>> def fact(x):
... return reduce(lambda x, y: x * y, range(1, x+1))
...
>>> fact(10)
3628800
>>> fact(200)
788657867364790503552363213932185062295135977687173263294742533244359449963403342920304284011984623904177212138919638830257642790242637105061926624952829931113462857270763317237396988943922445621451664240254033291864131227428294853277524242407573903240321257405579568660226031904170324062351700858796178922222789623703897374720000000000000000000000000000000000000000000000000L
Long numbers are "easy", floating point is more complicated, and almost any computer representation of a floating point number is an approximation, for example:
>>> float(1)/3
0.33333333333333331
Obviously you can't store an infinite number of 3's in memory, so it cheats and rounds it a bit..
You may want to look at the decimal module:
Decimal numbers can be represented exactly. In contrast, numbers like 1.1 do not have an exact representation in binary floating point. End users typically would not expect 1.1 to display as 1.1000000000000001 as it does with binary floating point.
Unlike hardware based binary floating point, the decimal module has a user alterable precision (defaulting to 28 places) which can be as large as needed for a given problem

See Handling very large numbers in Python.
Python has a BigNum class for holding 200! and will use it automatically.
Your teacher's statement, though not exactly true here is true in general. Computers have limitations, and it is good to know what they are. Remember that every time you add another integer of data storage, you can store a number that is 2^32 (4 billion +) times larger. It is hard to comprehend how many more numbers that is - but maths gets slower as you add more integers to store the exact value of a very large number.
As an example (what you can store with 1000 bits)
>>> 2 << 1000
2143017214372534641896850098120003621122809623411067214887500776740702102249872244986396
7576313917162551893458351062936503742905713846280871969155149397149607869135549648461970
8421492101247422837559083643060929499671638825347975351183310878921541258291423929553730
84335320859663305248773674411336138752L
I tried to illustrate how big a number you can store with 10000 bits, or even 8,000,000 bits (a megabyte) but that number is many pages long.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dealing with large numbers in R [Inf] and Python - python

Related

Does Python document its behavior for rounding to a specified number of fractional digits?

number format is different in linux and windows version of pycharm

How to avoid floating point arithmetics issues?

Python: Decimals with trigonometric functions

Unable to see Python's approximations in mathematical calculations

Categories

Resources