When shifting column of integers, I know how to fix my column when Pandas automatically converts the integers to floats because of the presence of a NaN.
I basically use the method described here.
However, if the shift introduces a NaN thereby converting all integers to floats, there's some rounding that happens (e.g. on epoch timestamps) so even recasting it back to integer doesn't replicate what it was originally.
Any way to fix this?
Example Data:
pd.DataFrame({'epochee':[1495571400259317500,1495571400260585120,1495571400260757200, 1495571400260866800]})
Out[19]:
epoch
0 1495571790919317503
1 1495999999999999999
2 1495571400265555555
3 1495571400267777777
Example Code:
df['prior_epochee'] = df['epochee'].shift(1)
df.dropna(axis=0, how='any', inplace=True)
df['prior_epochee'] = df['prior_epochee'].astype(int)
Resulting output:
Out[22]:
epoch prior_epoch
1 1444444444444444444 1400000000000000000
2 1433333333333333333 1490000000000000000
3 1777777777777777777 1499999999999999948
Because you know what happens when int is casted as float due to np.nan and you know that you don't want the np.nan rows anyway, you can shift yourself with numpy
df[1:].assign(prior_epoch=df.epoch.values[:-1])
epoch prior_epoch
1 1495571400260585120 1495571400259317500
2 1495571400260757200 1495571400260585120
3 1495571400260866800 1495571400260757200
Related
My task is to read data from excel to dataframe. The data is a bit messy and to clean that up I've done:
df_1 = pd.read_excel(offers[0])
df_1 = df_1.rename(columns={'Наименование [Дата Файла: 29.05.2019 время: 10:29:42 ]':'good_name',
'Штрихкод':'barcode',
'Цена шт. руб.':'price',
'Остаток': 'balance'
})
df_1 = df_1[new_columns]
# I don't know why but without replacing NaN with another char code doesn't work
df_1.barcode = df_1.barcode.fillna('_')
# remove all non-numeric characters
df_1.barcode = df_1.barcode.apply(lambda row: re.sub('[^0-9]', '', row))
# convert str to numeric
df_1.barcode = pd.to_numeric(df_1.barcode, downcast='integer').fillna(0)
df_1.head()
It returns column barcode with type float64 (why so?)
0 0.000000e+00
1 7.613037e+12
2 7.613037e+12
3 7.613034e+12
4 7.613035e+12
Name: barcode, dtype: float64
Then I try to convert that column to integer.
df_1.barcode = df_1.barcode.astype(int)
But I keep getting silly negative numbers.
df_1.barcode[0:5]
0 0
1 -2147483648
2 -2147483648
3 -2147483648
4 -2147483648
Name: barcode, dtype: int32
Thanks to #Will and #micric eventually I've got a solution.
df_1 = pd.read_excel(offers[0])
df_1 = df_1[new_columns]
# replacing NaN with 0, it'll help to convert the column explicitly to dtype integer
df_1.barcode = df_1.barcode.fillna('0')
# remove all non-numeric characters
df_1.barcode = df_1.barcode.apply(lambda row: re.sub('[^0-9]', '', row))
# convert str to integer
df_1.barcode = pd.to_numeric(df_1.barcode, downcast='integer')
Resume:
pd.to_numeric converts NaN to float64. As a result from column with
both NaN and not-Nan values we should expect column dtype float64.
Check size of number you're dealing with. int32 has its limit, which
is 2**32 = 4294967296.
Thanks a lot for your help, guys!
That number is a 32 bit lower limit. Your number is out of the int32 range you are trying to use, so it returns you the limit (notice that 2**32 = 4294967296, divided by 2 2147483648 that is your number).
You should use astype(int64) instead.
I ran into the same problem as OP, using
astype(np.int64)
solved mine, see the link here.
I like this solution because it's consistent with my habit of changing the column type of pandas column, maybe someone could check the performance of these solutions.
Many questions in one.
So your expected dtype...
pd.to_numeric(df_1.barcode, downcast='integer').fillna(0)
pd.to_numeric downcast to integer would give you an integer, however, you have NaNs in your data and pandas needs to use a float64 type to represent NaNs
I have a Pandas dataframe, with 4 rows, and one of the columns (named limit) contains floating point values, where any zeros must be replaced with 9999999999 (9.999999999 billion). The column is set to the float32 data type, and I use the pandas.DataFrame.where method to do the replacement. But it's not working as expected because Numpy is rounding up 9999999999 to 10000000000 (10 billion).
I've tried this in iPython 3 (Python 3.6.8), Pandas version 0.24.2, Numpy version 1.14.0.
This is the replacement statement
df['limit'] = df['limit'].where(df['limit'] != 0, 9999999999)
I'm seeing the following column values for limit:
0 1.000000e+10
1 1.000000e+10
2 1.000000e+10
3 1.000000e+10
but I'm expecting
0 9999999999.0
1 9999999999.0
2 9999999999.0
3 9999999999.0
Why does the rounding up happen? This doesn't happen with plain Python
In [1]: (9.999999999) * 10**9
Out[1]: 9999999999.0
This is simply because int32 is not capable of preserving that number. You can check this by calculating the number of bits needed for demonstrating that number:
In [24]: np.floor(np.log2(9999999999)) + 1
Out[24]: 34.0
As you can see you need at least 34 bits for demonstrating that number. Therefore you should use int64 as a larger data type for representing it.
Even if you test this by putting the number in a series with same data type you'll see the unexpected result (overflow) again:
In [25]: s = pd.Series([9999999999], dtype=pd.np.int32)
In [26]: s
Out[26]:
0 1410065407
dtype: int32
Please look the code and output.
May I know why the data type in *_state column are float instead of int and how to cast those data type to int?
Thanks,
Code
print(df_test)
for idx, row in df_test.iterrows():
print(type(row['value']))
df_test.at[idx, row['name'] + '_state'] = row['value']
print(df_test)
Output
Message name value
0 Door_Started Door 1
1 Light_open Light 1
type 'int'
type 'int'
Message name value Door_state Light_state
0 Door_Started Door 1 1.0 NaN
1 Light_open Light 1 NaN 1.0
You are only assigning an integer to a single column row['name'] + '_state'. This causes, for any given index, NaN values to appear in other column(s).
NaN is considered float (see here why), so a mixture of int and NaN values will always be upcasted to float1, for any given series. You can check this for yourself:
type(np.nan) # float
This usually does not break subsequent manipulations / calculations, and it is efficient to keep your series float. Converting such a series to int is not possible and workarounds are inefficient. Therefore, I advise you do nothing.
1 This accommodative behaviour is described in the docs:
Note: When working with heterogeneous data, the dtype of the resulting ndarray will be chosen to accommodate all of the data
involved. For example, if strings are involved, the result will be of
object dtype. If there are only floats and integers, the resulting
array will be of float dtype.
use this after the code:
pd.options.display.float_format = '{:,.0f}'.format
print(df)
#Jpp is correct there. This will just change your visual so you can print 1 instead of 1.0
Also if using this solution make sure you read about pd.reset_option too https://pandas.pydata.org/pandas-docs/stable/options.html
I've got a pandas DataFrame with a float (on decimal) index which I use to look up values (similar to a dictionary). As floats are not exactly the value they are supposed to be multiplied everything by 10 and converted it to integers .astype(int) before setting it as index. However this seems to do a floor instead of rounding. Thus 1.999999999999999992 is converted to 1 instead of 2. Rounding with the pandas.DataFrame.round() method before does not avoid this problem as the values are still stored as floats.
The original idea (which obviously rises a key error) was this:
idx = np.arange(1,3,0.001)
s = pd.Series(range(2000))
s.index=idx
print(s[2.022])
trying with converting to integers:
idx_int = idx*1000
idx_int = idx_int.astype(int)
s.index = idx_int
for i in range(1000,3000):
print(s[i])
the output is always a bit random as the 'real' value of an integer can be slightly above or below the wanted value. In this case the index contains two times the value 1000 and does not contain the value 2999.
You are right, astype(int) does a conversion toward zero:
‘integer’ or ‘signed’: smallest signed int dtype
from pandas.to_numeric documentation (which is linked from astype() for numeric conversions).
If you want to round, you need to do a float round, and then convert to int:
df.round(0).astype(int)
Use other rounding functions, according your needs.
the output is always a bit random as the 'real' value of an integer can be slightly above or below the wanted value
Floats are able to represent whole numbers, making a conversion after round(0) lossless and non-risky, check here for details.
If I understand right you could just perform the rounding operation followed by converting it to an integer?
s1 = pd.Series([1.2,2.9])
s1 = s1.round().astype(int)
Which gives the output:
0 1
1 3
dtype: int32
In case the data frame contains both, numeric and non-numeric values and you only want to touch numeric fields:
df = df.applymap(lambda x: int(round(x, 0)) if isinstance(x, (int, float)) else x)
There is a potential that NA as a float type exists in the dataframe. so an alternative solution is: df.fillna(0).astype('int')
How does Pandas store floats for comparison sake? I ran as simple check for a value and it returned what I expected but the result is not the same as my query / comparison:
Why aren't the values of each time epoch the same?
I tried rerunning this by first casting the column as int but then the comparison brought up nothing.
Your floats are in nano-seconds since epoch, so to convert try this:
Code;
df.time = df.time.astype('datetime64[ns]')
Test Code:
df = pd.DataFrame([1484314274417920512., 1484314274417620224.],
columns=['time'])
print(df)
df.time = df.time.astype('datetime64[ns]')
print(df)
Results:
time
0 1.484314e+18
1 1.484314e+18
time
0 2017-01-13 13:31:14.417920512
1 2017-01-13 13:31:14.417620224
But:
The problem likely came about when you converted from the original data source. Converting the int64 to float64, has already lost some precision, so just converting it to nano-seconds, could very well still not do what you need. Somethings that could be done:
Perform the original conversion directly to int64 so as not to lose precision.
If nano-seconds are not needed then round the timestamps to micro-seconds or milli-seconds.