Why does pd.to_numeric not work with large numbers?

Why does pd.to_numeric not work with large numbers? - python

Let's say I have a large number in a string, like '555555555555555555555'. One could choose to convert it to an int, float or even a numpy float:
int('555555555555555555555')
float('555555555555555555555')
np.float('555555555555555555555')
However, when I use the pandas function pd.to_numeric, things go wrong:
pd.to_numeric('555555555555555555555')
With error:
Traceback (most recent call last):
File "pandas/_libs/src/inference.pyx", line 1173, in pandas._libs.lib.maybe_convert_numeric
ValueError: Integer out of range.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\path_to_conda\lib\site-packages\IPython\core\interactiveshell.py", line 3267, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-34-6a735441ab7b>", line 1, in <module>
pd.to_numeric('555555555555555555555')
File "C:\path_to_conda\lib\site-packages\pandas\core\tools\numeric.py", line 133, in to_numeric
coerce_numeric=coerce_numeric)
File "pandas/_libs/src/inference.pyx", line 1185, in pandas._libs.lib.maybe_convert_numeric
ValueError: Integer out of range. at position 0
What's going wrong? Why can't pandas to_numeric handle larger values? Are there any use cases why you would use pd.to_numeric instead of functions like np.float?

Because your number is larger that the maximum size of an integer that your system is capable of saving:
In [4]: import sys
In [5]: sys.maxsize
Out[5]: 9223372036854775807
In [6]: 555555555555555555555 > sys.maxsize
Out[6]: True
Here is part of the source code that raises the ValueError:
if not (seen.float_ or as_int in na_values):
if as_int < oINT64_MIN or as_int > oUINT64_MAX:
raise ValueError('Integer out of range.')
As you can see, because your number is not a float it treats it as an integer and checks if the number is in the proper range oINT64_MIN, oUINT64_MAX. If you've passed a float number instead it'd gave you the proper result:
In [9]: pd.to_numeric('555555555555555555555.0')
Out[9]: 5.5555555555555554e+20

Related

How can I access the raw data from individual data frame cells using pandas?

I'm working on a custom elo/team rating calculator using a CSV file as input. I was able to get similar logic for this working in Excel with openpyxl but I am now trying to implement it in pandas for better integration with jupyter and matplotlib. I'm having issues running calculations on individual cells in the data frames, however.
def find_team_row(team_name):
switcher = {
'100T': 0,
'C9': 1,
'CG': 2,
'CLG': 3,
'FOX': 4,
'FLY': 5,
'GGS': 6,
'OPT': 7,
'TL': 8,
'TSM': 9,
}
return switcher.get(team_name, None)
def update_df():
for column in range(1, df.columns.get_loc(df.columns[-1]), 3):
for row in range(0,9):
init_rating = df.iloc[row,column]
opponent_name = df.iloc[row,column+1]
match_result = df.iloc[row,column+2]
oppo_rating = df.iloc[find_team_row(opponent_name),column]
These exceptions are thrown with respect to this code block:
ajisaksonmac:elo_calc ajisakson$ /Library/Frameworks/Python.framework/Versions/3.7/bin/python3 "/Users/ajisakson/Google Drive/swe_projects/elo_calc/test2.py"
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexing.py", line 235, in _has_valid_tuple
self._validate_key(k, i)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexing.py", line 2035, in _validate_key
"a [{types}]".format(types=self._valid_types)
ValueError: Can only index by location with a [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/ajisakson/Google Drive/swe_projects/elo_calc/test2.py", line 74, in <module>
update_df()
File "/Users/ajisakson/Google Drive/swe_projects/elo_calc/test2.py", line 27, in update_df
oppo_rating = df.iloc[find_team_row(opponent_name),column]
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexing.py", line 1418, in __getitem__
return self._getitem_tuple(key)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexing.py", line 2092, in _getitem_tuple
self._has_valid_tuple(tup)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexing.py", line 239, in _has_valid_tuple
"[{types}] types".format(types=self._valid_types)
ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types
So I'm trying to access individual cells in the data frame using iloc but I receive this ValueError where oppo_rating is assigned. I tried a number of different things to convert both of the iloc parameters to integers including int(), .iat(), .at(), .loc(), etc. and I continue to receive errors suggesting that one of my parameters is not an integer.
Here is the first part of the data frame I'm trying to manipulate/make calculations on:
example of the pandas data frame

Python Compiler - Using Decimal for Division of Zero?

I am currently creating a compiler and would also like to implement the division of zero. I noticed the decimal module in python and thought it could be useful. The code below shows what I am trying to get at. Is there anyway to split up the expression and check for the division of 0 for both negative and positive numbers? thanks in advance.
if input negative int/0 = -infin
if pos int/0 = infin
if 0/0 = null
ect.

The python documentation says the default behaviour is to raise an exception.
>>> import decimal
>>> D = decimal.Decimal
>>> a = D("12")
>>> b = D("0")
>>> a/b
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/decimal.py", line 1350, in __truediv__
return context._raise_error(DivisionByZero, 'x / 0', sign)
File "/usr/lib/python3.4/decimal.py", line 4050, in _raise_error
raise error(explanation)
decimal.DivisionByZero: x / 0
>>>
But also, instead of raising an exception:
If this signal is not trapped, returns Infinity or -Infinity with the
sign determined by the inputs to the calculation.
As of the zero by zero division, it is not mathematically defined, so I don't know what you may want to do with it. Python would return another exception:
>>> b/b
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/decimal.py", line 1349, in __truediv__
return context._raise_error(DivisionUndefined, '0 / 0')
File "/usr/lib/python3.4/decimal.py", line 4050, in _raise_error
raise error(explanation)
decimal.InvalidOperation: 0 / 0
>>>
and if the signal is not trapped, it returns NaN. (Might not be useful to know whether the operation was a zero divided by zero or not).

numpy not returning the correct median value

Alright I am a bit confused here, I have a list that looks like:
>>> _list
['-1.24235365387e-07', '-2.31373100323e-07', '-3.4561064219e-07', '-4.5226775879e-08', '-4.8495857305e-06', '-6.05262333229e-07', '-6.87756245459e-07', '1.01130316722e-06', '1.12310282664e-07', '1.49359255132e-06', '1.56048010364e-06', '2.43283432336e-07', '3.04787966681e-07', '3.44224562526e-06', '3.89199793328e-07', '4.61725496189e-07', '4.91574219806e-07', '6.42046115267e-07', '6.52594949337e-07', '7.29511505567e-07', '8.38829381985e-07', '8.59463647511e-07', '8.89956059753e-07']
>>> len(_list)
23
With a median value of:
>>> _list[int(len(_list)/2)]
'2.43283432336e-07'
but when I do:
>>> median(array(_list,dtype=float))
4.6172549618900001e-07
I get that as a median value, I am doing something wrong here. When I don't use floats:
>>> median([-1,-2,-3,-4,-5,-6,-7,-8,-9,0,1,2,3,4,5,6,7,8,9])
0.0
>>> [-1,-2,-3,-4,-5,-6,-7,-8,-9,0,1,2,3,4,5,6,7,8,9][int(len([-1,-2,-3,-4,-5,-6,-7,-8,-9,0,1,2,3,4,5,6,7,8,9])/2)]
0
Dropping the dtype gives:
>>> median(array(_list))
Traceback (most recent call last):
File "<pyshell#42>", line 1, in <module>
median(array(_list))
File "C:\Python27\lib\site-packages\numpy\lib\function_base.py", line 2718, in median
return mean(part[indexer], axis=axis, out=out)
File "C:\Python27\lib\site-packages\numpy\core\fromnumeric.py", line 2716, in mean
out=out, keepdims=keepdims)
File "C:\Python27\lib\site-packages\numpy\core\_methods.py", line 62, in _mean
ret = um.add.reduce(arr, axis=axis, dtype=dtype, out=out, keepdims=keepdims)
TypeError: cannot perform reduce with flexible type
If someone could steer me in the right direction I would appreciate it, thanks.

I'm guessing it's because _list contains strings - your values are in lexicographic sort order, but not numerical. Try resorting the data after the conversion to float.

Sorry, complete user error, these values are being read in from a txt file I made earlier, so they have the type str instead of float, apparently that effects numpy, floating them makes everything work, total user error, my fault.

pagerank python implementation

I want to convert PageRank MATLAB/Octave implementation to python, but when it comes to:
a=array([[inf]])
last_v = dot(ones(N,1),a)
there is a TypeError.
Traceback (most recent call last):
File "/home/googcheng/page_rank.py", line 18, in <module>
pagerank(0,0)
File "/home/googcheng/page_rank.py", line 14, in pagerank
last_v = dot(ones(N,1),a)
File "/usr/lib/python2.7/dist-packages/numpy/core/numeric.py", line 1819, in ones
a = empty(shape, dtype, order)
TypeError: data type not understood
some code https://gist.github.com/3722398

The first argument to ones, the shape, should be a tuple. Change ones(N,1) to ones((N,1)).

Python cdecimal InvalidOperation

I am trying to read financial data and store it. The place I get the financial data from stores the data with incredible precision, however I am only interested in 5 figures after the decimal point. Therefore, I have decided to use t = .quantize(cdecimal.Decimal('.00001'), rounding=cdecimal.ROUND_UP) on the Decimal I create, but I keep getting an InvalidOperation exception. Why is this?
>>> import cdecimal
>>> c = cdecimal.getcontext()
>>> c.prec = 5
>>> s = '45.2091000080109'
>>> # s = '0.257585003972054' works!
>>> t = cdecimal.Decimal(s).quantize(cdecimal.Decimal('.00001'), rounding=cdecimal.ROUND_UP)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
cdecimal.InvalidOperation: [<class 'cdecimal.InvalidOperation'>]
Why is there an invalid operation here? If I change the precision to 7 (or greater), it works. If I set s to be '0.257585003972054' instead of the original value, that also works! What is going on?
Thanks!

decimal version gives a better description of the error:
Python 2.7.2+ (default, Feb 16 2012, 18:47:58)
>>> import decimal
>>> s = '45.2091000080109'
>>> decimal.getcontext().prec = 5
>>> decimal.Decimal(s).quantize(decimal.Decimal('.00001'), rounding=decimal.ROUND_UP)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/decimal.py", line 2464, in quantize
'quantize result has too many digits for current context')
File "/usr/lib/python2.7/decimal.py", line 3866, in _raise_error
raise error(explanation)
decimal.InvalidOperation: quantize result has too many digits for current context
>>>
Docs:
Unlike other operations, if the length of the coefficient after the
quantize operation would be greater than precision, then an
InvalidOperation is signaled. This guarantees that, unless there is an
error condition, the quantized exponent is always equal to that of the
right-hand operand.
But i must confess i don't know what this means.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why does pd.to_numeric not work with large numbers? - python

Related

How can I access the raw data from individual data frame cells using pandas?

Python Compiler - Using Decimal for Division of Zero?

numpy not returning the correct median value

pagerank python implementation

Python cdecimal InvalidOperation

Categories

Resources