Converting exponential values to negative intege [duplicate] - python

My task is to read data from excel to dataframe. The data is a bit messy and to clean that up I've done:
df_1 = pd.read_excel(offers[0])
df_1 = df_1.rename(columns={'Наименование [Дата Файла: 29.05.2019 время: 10:29:42 ]':'good_name',
'Штрихкод':'barcode',
'Цена шт. руб.':'price',
'Остаток': 'balance'
})
df_1 = df_1[new_columns]
# I don't know why but without replacing NaN with another char code doesn't work
df_1.barcode = df_1.barcode.fillna('_')
# remove all non-numeric characters
df_1.barcode = df_1.barcode.apply(lambda row: re.sub('[^0-9]', '', row))
# convert str to numeric
df_1.barcode = pd.to_numeric(df_1.barcode, downcast='integer').fillna(0)
df_1.head()
It returns column barcode with type float64 (why so?)
0 0.000000e+00
1 7.613037e+12
2 7.613037e+12
3 7.613034e+12
4 7.613035e+12
Name: barcode, dtype: float64
Then I try to convert that column to integer.
df_1.barcode = df_1.barcode.astype(int)
But I keep getting silly negative numbers.
df_1.barcode[0:5]
0 0
1 -2147483648
2 -2147483648
3 -2147483648
4 -2147483648
Name: barcode, dtype: int32
Thanks to #Will and #micric eventually I've got a solution.
df_1 = pd.read_excel(offers[0])
df_1 = df_1[new_columns]
# replacing NaN with 0, it'll help to convert the column explicitly to dtype integer
df_1.barcode = df_1.barcode.fillna('0')
# remove all non-numeric characters
df_1.barcode = df_1.barcode.apply(lambda row: re.sub('[^0-9]', '', row))
# convert str to integer
df_1.barcode = pd.to_numeric(df_1.barcode, downcast='integer')
Resume:
pd.to_numeric converts NaN to float64. As a result from column with
both NaN and not-Nan values we should expect column dtype float64.
Check size of number you're dealing with. int32 has its limit, which
is 2**32 = 4294967296.
Thanks a lot for your help, guys!

That number is a 32 bit lower limit. Your number is out of the int32 range you are trying to use, so it returns you the limit (notice that 2**32 = 4294967296, divided by 2 2147483648 that is your number).
You should use astype(int64) instead.

I ran into the same problem as OP, using
astype(np.int64)
solved mine, see the link here.
I like this solution because it's consistent with my habit of changing the column type of pandas column, maybe someone could check the performance of these solutions.

Many questions in one.
So your expected dtype...
pd.to_numeric(df_1.barcode, downcast='integer').fillna(0)
pd.to_numeric downcast to integer would give you an integer, however, you have NaNs in your data and pandas needs to use a float64 type to represent NaNs

Related

python float64 type conversion issue with pandas

I have a need to convert an 18 digit float64 pandas column to an integer or string to be readable avoiding the exponential notation.
But I am not successful so far.
df=pd.DataFrame(data={'col1':[915235514180670190,915235514180670208]},dtype='float64')
print(df)
col1
0 9.152355e+17
1 9.152355e+17
Then I tried converting it to int64. But last 3 digits going wrong.
df.col1.astype('int64')
0 915235514180670208
1 915235514180670208
Name: col1, dtype: int64
But you see .. the value is goin wrong. Not sure why.
I read from documentation as int64 should be able to hold an 18 digit number.
int64 Integer (-9223372036854775808 to 9223372036854775807)
Any idea what I am doing wrong ?
How can I achieve my requirement ?
Giving further info based on Eric Postpischil comment.
If float64 can't hold 18 digits, I might be in trouble.
Thing is that I get this data through a pandas read_sql function call from DB. And it automatically type casted to float64.
I don't see an option to mention datatype in pandas read_sql()
Any thoughts from any one on what I can do to overcome this problem ?
The problem is that a float64 a mantisse of 53 bits which can represent 15 or 16 decimal digits (ref).
That means that a 18 digit float64 pandas column is an illusion. No need to go into Pandas not even into numpy types:
>>> n = 915235514180670190
>>> d = float(n)
>>> print(n, d, int(d))
915235514180670190 9.152355141806702e+17 915235514180670208
read_sql in Pandas has a coerce_float parameter that might help. It's on by default, and is documented as:
Attempts to convert values of non-string, non-numeric objects (like decimal.Decimal) to floating point, useful for SQL result sets.
Setting this to False helps, e.g. with the following schema/data:
import psycopg2
con = psycopg2.connect()
with con, con.cursor() as cur:
cur.execute("CREATE TABLE foo ( id SERIAL PRIMARY KEY, num DECIMAL(30,0) )")
cur.execute("INSERT INTO foo (num) VALUES (123456789012345678901234567890)")
I can run:
print(pd.read_sql("SELECT * FROM foo", con))
print(pd.read_sql("SELECT * FROM foo", con, coerce_float=False))
which gives me the following output:
id num
0 1 1.234568e+29
id num
0 1 123456789012345678901234567890
preserving the precision of the value I inserted.
You've not given many details of the database you're using, but hopefully the above is helpful to somebody!
I did a work around to deal that problem.. Thought of sharing it as it may help some one else.
#Preapring SQL to extract all rows.
sql='SELECT * , CAST(col1 AS CHAR(18)) as DUMMY_COL FROM table1;'
#Get data from postgres
df=pd.read_sql(sql, self.conn)
# converting dummy col to integer
df['DUMMY_COL']=df['DUMMY_COL'].astype('int64')
# removing the original col1 column with replacing the int64 converted one.
df['col1'] = df['DUMMY_COL']
df.drop('DUMMY_COL', axis=1, inplace=True)

Change dtype without NA values, or while reading DF in pandas?

I have a csv with that df.head():
marker_name ars120_pos snp_bs ars120_chrn
0 ARS-BFGL-BAC-10172 5342658.0 [A/G] 2.0
1 ARS-BFGL-BAC-1020 6889656.0 [T/C] 14.0
2 ARS-BFGL-BAC-10245 NA [T/C] 14.0
3 ARS-BFGL-BAC-10345 5105727.0 [A/C] 14.0
4 ARS-BFGL-BAC-10365 25323952.0 [A/C] NA
That DF has few millions rows. I want to change datatype of that floats to int32.
I tried :
ARS1_2 = ARS1_2.astype({'marker_name':'str','ars120_pos':'int32','snp_bs':'str','ars120_chrn':'int32'})
But I got
ValueError: Cannot convert non-finite values (NA or inf) to integer
If I think property it's mean than I cant change NA to integer. And ok. I can drop NA, but in cols I can have a symbol of X Y chromosome "X", "Y" - as string. I know I can change it to int for example 99 and 98 but I want to avoid it.
So my question is:
What is the simplest method to change all float in column to integer?
I tried somethin like
if type(value) in col == float:
value.as_int
(it's pseudocode of course, I didn't remember exacly code) but it's didn't work too... And it's just a play with regular if. Maybe I can do it better and simpler in pandas?
I lf similar posts on so but I didin't find nothing for me. Expect that line above.
To change float column to integer column use this:
df[col] = df[col].astype(pd.Int32Dtype()) # For single column - instead col put column name
If you want to go through all columns at once:
for col in df.columns:
if df[col].dtype == np.float:
df[col] = df[col].astype(pd.Int32Dtype())
To check types of columns:
df.dtypes
Output:
marker_name object
ars120_pos Int32
snp_bs object
ars120_chrn Int32
dtype: object

python dataframe.at assignment with datatype change

Please look the code and output.
May I know why the data type in *_state column are float instead of int and how to cast those data type to int?
Thanks,
Code
print(df_test)
for idx, row in df_test.iterrows():
print(type(row['value']))
df_test.at[idx, row['name'] + '_state'] = row['value']
print(df_test)
Output
Message name value
0 Door_Started Door 1
1 Light_open Light 1
type 'int'
type 'int'
Message name value Door_state Light_state
0 Door_Started Door 1 1.0 NaN
1 Light_open Light 1 NaN 1.0
You are only assigning an integer to a single column row['name'] + '_state'. This causes, for any given index, NaN values to appear in other column(s).
NaN is considered float (see here why), so a mixture of int and NaN values will always be upcasted to float1, for any given series. You can check this for yourself:
type(np.nan) # float
This usually does not break subsequent manipulations / calculations, and it is efficient to keep your series float. Converting such a series to int is not possible and workarounds are inefficient. Therefore, I advise you do nothing.
1 This accommodative behaviour is described in the docs:
Note: When working with heterogeneous data, the dtype of the resulting ndarray will be chosen to accommodate all of the data
involved. For example, if strings are involved, the result will be of
object dtype. If there are only floats and integers, the resulting
array will be of float dtype.
use this after the code:
pd.options.display.float_format = '{:,.0f}'.format
print(df)
#Jpp is correct there. This will just change your visual so you can print 1 instead of 1.0
Also if using this solution make sure you read about pd.reset_option too https://pandas.pydata.org/pandas-docs/stable/options.html

pandas to_numeric errors='coerce' doesn't coerce when number outside int64

I am designing a workflow which cleans up very messy data submitted by third parties,
I'm running into an issue with numeric conversion. Specifically, I'm using
the pandas.to_numeric to take data which is received and stored as text, and
test whether or not it contains valid numbers. (Yes, I know it would be easier
to sanitize user input earlier, but that is unfortunately not a possibility
in this situation.)
The issue I'm running into is that pandas.to_numeric seems to fail silently
when it encounters an integer outside of the +/- 2^64 size range. Is this
expected behavior?
If it is expected behavior, is there a way to work around it programmatically?
I've found that it correctly coerces numeric values outside of the 2^64
restriction, but not integers.
Here's a minimal example of just the problematic component:
import pandas
# when converting text representations of numbers to
# numbers, the conversion fails if the number is very
# large (~2^64)
integer_version_success = pandas.to_numeric(
pandas.Series(
name = 'values',
index = range(2),
# works even when adding a bit beyond 2^64
data = ['9223372036854775807','.50001']),
errors='coerce')
print(integer_version_success)
# 0 9.223372e+18
# 1 5.000100e-01
# Name: values, dtype: float64
integer_version_failure = pandas.to_numeric(
pandas.Series(
name = 'values',
index = range(2),
# one digit longer, outside range
data = ['92233720368547758071','.50001']),
errors='coerce')
print(integer_version_failure)
# 0 92233720368547758071
# 1 .50001
# Name: values, dtype: object
# | bad, leads to unexpected results
# when converting text representations of numbers to numbers,
# the conversion succeeds if the number is already represented
# as a number (non-int), regardless of if it's larger than 2^64
numeric_version_success = pandas.to_numeric(
pandas.Series(
name = 'values',
index = range(2),
# one digit longer, outside range
data = ['92233720368547758071.0','.50001']),
errors='coerce')
print(numeric_version_success)
# 0 9.223372e+19
# 1 5.000100e-01
# Name: values, dtype: float64
# | putting a decimal in the string maxes coercion succeed
This is due to an issue in 0.20.3, which is resolved when running on 0.22.0.
The documentation does not say coerce will raise an exception (it will set those to NaN that you can later remove if wanted), you could change it to raise or don't set it as raise is the default.
With pandas 0.22.0:
0 9.223372e+18
1 5.000100e-01
Name: values, dtype: float64
0 NaN
1 0.50001
Name: values, dtype: float64
0 9.223372e+19
1 5.000100e-01
Name: values, dtype: float64

Pandas Shift Converts Ints to Float AND Rounds

When shifting column of integers, I know how to fix my column when Pandas automatically converts the integers to floats because of the presence of a NaN.
I basically use the method described here.
However, if the shift introduces a NaN thereby converting all integers to floats, there's some rounding that happens (e.g. on epoch timestamps) so even recasting it back to integer doesn't replicate what it was originally.
Any way to fix this?
Example Data:
pd.DataFrame({'epochee':[1495571400259317500,1495571400260585120,1495571400260757200, 1495571400260866800]})
Out[19]:
epoch
0 1495571790919317503
1 1495999999999999999
2 1495571400265555555
3 1495571400267777777
Example Code:
df['prior_epochee'] = df['epochee'].shift(1)
df.dropna(axis=0, how='any', inplace=True)
df['prior_epochee'] = df['prior_epochee'].astype(int)
Resulting output:
Out[22]:
epoch prior_epoch
1 1444444444444444444 1400000000000000000
2 1433333333333333333 1490000000000000000
3 1777777777777777777 1499999999999999948
Because you know what happens when int is casted as float due to np.nan and you know that you don't want the np.nan rows anyway, you can shift yourself with numpy
df[1:].assign(prior_epoch=df.epoch.values[:-1])
epoch prior_epoch
1 1495571400260585120 1495571400259317500
2 1495571400260757200 1495571400260585120
3 1495571400260866800 1495571400260757200

Categories

Resources