How does Pandas store floats for comparison sake? I ran as simple check for a value and it returned what I expected but the result is not the same as my query / comparison:
Why aren't the values of each time epoch the same?
I tried rerunning this by first casting the column as int but then the comparison brought up nothing.
Your floats are in nano-seconds since epoch, so to convert try this:
Code;
df.time = df.time.astype('datetime64[ns]')
Test Code:
df = pd.DataFrame([1484314274417920512., 1484314274417620224.],
columns=['time'])
print(df)
df.time = df.time.astype('datetime64[ns]')
print(df)
Results:
time
0 1.484314e+18
1 1.484314e+18
time
0 2017-01-13 13:31:14.417920512
1 2017-01-13 13:31:14.417620224
But:
The problem likely came about when you converted from the original data source. Converting the int64 to float64, has already lost some precision, so just converting it to nano-seconds, could very well still not do what you need. Somethings that could be done:
Perform the original conversion directly to int64 so as not to lose precision.
If nano-seconds are not needed then round the timestamps to micro-seconds or milli-seconds.
Related
My task is to read data from excel to dataframe. The data is a bit messy and to clean that up I've done:
df_1 = pd.read_excel(offers[0])
df_1 = df_1.rename(columns={'Наименование [Дата Файла: 29.05.2019 время: 10:29:42 ]':'good_name',
'Штрихкод':'barcode',
'Цена шт. руб.':'price',
'Остаток': 'balance'
})
df_1 = df_1[new_columns]
# I don't know why but without replacing NaN with another char code doesn't work
df_1.barcode = df_1.barcode.fillna('_')
# remove all non-numeric characters
df_1.barcode = df_1.barcode.apply(lambda row: re.sub('[^0-9]', '', row))
# convert str to numeric
df_1.barcode = pd.to_numeric(df_1.barcode, downcast='integer').fillna(0)
df_1.head()
It returns column barcode with type float64 (why so?)
0 0.000000e+00
1 7.613037e+12
2 7.613037e+12
3 7.613034e+12
4 7.613035e+12
Name: barcode, dtype: float64
Then I try to convert that column to integer.
df_1.barcode = df_1.barcode.astype(int)
But I keep getting silly negative numbers.
df_1.barcode[0:5]
0 0
1 -2147483648
2 -2147483648
3 -2147483648
4 -2147483648
Name: barcode, dtype: int32
Thanks to #Will and #micric eventually I've got a solution.
df_1 = pd.read_excel(offers[0])
df_1 = df_1[new_columns]
# replacing NaN with 0, it'll help to convert the column explicitly to dtype integer
df_1.barcode = df_1.barcode.fillna('0')
# remove all non-numeric characters
df_1.barcode = df_1.barcode.apply(lambda row: re.sub('[^0-9]', '', row))
# convert str to integer
df_1.barcode = pd.to_numeric(df_1.barcode, downcast='integer')
Resume:
pd.to_numeric converts NaN to float64. As a result from column with
both NaN and not-Nan values we should expect column dtype float64.
Check size of number you're dealing with. int32 has its limit, which
is 2**32 = 4294967296.
Thanks a lot for your help, guys!
That number is a 32 bit lower limit. Your number is out of the int32 range you are trying to use, so it returns you the limit (notice that 2**32 = 4294967296, divided by 2 2147483648 that is your number).
You should use astype(int64) instead.
I ran into the same problem as OP, using
astype(np.int64)
solved mine, see the link here.
I like this solution because it's consistent with my habit of changing the column type of pandas column, maybe someone could check the performance of these solutions.
Many questions in one.
So your expected dtype...
pd.to_numeric(df_1.barcode, downcast='integer').fillna(0)
pd.to_numeric downcast to integer would give you an integer, however, you have NaNs in your data and pandas needs to use a float64 type to represent NaNs
From the official documentation of pandas.to_datetime we can say,
unit : string, default ‘ns’
unit of the arg (D,s,ms,us,ns) denote the unit, which is an integer or
float number. This will be based off the origin. Example, with
unit=’ms’ and origin=’unix’ (the default), this would calculate the
number of milliseconds to the unix epoch start.
So when I try like this way,
import pandas as pd
df = pd.DataFrame({'time': [pd.to_datetime('2019-01-15 13:25:43')]})
df_unix_sec = pd.to_datetime(df['time'], unit='ms', origin='unix')
print(df)
print(df_unix_sec)
time
0 2019-01-15 13:25:43
0 2019-01-15 13:25:43
Name: time, dtype: datetime64[ns]
Output is not changing for the latter one. Every time it is showing the datetime value not number of milliseconds to the unix epoch start for the 2nd one. Why is that? Am I missing something?
I think you misunderstood what the argument is for. The purpose of origin='unix' is to convert an integer timestamp to datetime, not the other way.
pd.to_datetime(1.547559e+09, unit='s', origin='unix')
# Timestamp('2019-01-15 13:30:00')
Here are some options:
Option 1: integer division
Conversely, you can get the timestamp by converting to integer (to get nanoseconds) and divide by 109.
pd.to_datetime(['2019-01-15 13:30:00']).astype(int) / 10**9
# Float64Index([1547559000.0], dtype='float64')
Pros:
super fast
Cons:
makes assumptions about how pandas internally stores dates
Option 2: recommended by pandas
Pandas docs recommend using the following method:
# create test data
dates = pd.to_datetime(['2019-01-15 13:30:00'])
# calculate unix datetime
(dates - pd.Timestamp("1970-01-01")) // pd.Timedelta('1s')
[out]:
Int64Index([1547559000], dtype='int64')
Pros:
"idiomatic", recommended by the library
Cons:
unweildy
not as performant as integer division
Option 3: pd.Timestamp
If you have a single date string, you can use pd.Timestamp as shown in the other answer:
pd.Timestamp('2019-01-15 13:30:00').timestamp()
# 1547559000.0
If you have to cooerce multiple datetimes (where pd.to_datetime is your only option), you can initialize and map:
pd.to_datetime(['2019-01-15 13:30:00']).map(pd.Timestamp.timestamp)
# Float64Index([1547559000.0], dtype='float64')
Pros:
best method for a single datetime string
easy to remember
Cons:
not as performant as integer division
You can use timestamp() method which returns POSIX timestamp as float:
pd.Timestamp('2021-04-01').timestamp()
[Out]:
1617235200.0
pd.Timestamp('2021-04-01 00:02:35.234').timestamp()
[Out]:
1617235355.234
value attribute of the pandas Timestamp holds the unix epoch. This value is in nanoseconds. So you can convert to ms or us by diving by 1e3 or 1e6. Check the code below.
import pandas as pd
date_1 = pd.to_datetime('2020-07-18 18:50:00')
print(date_1.value)
When you calculate the difference between two datetimes, the dtype of the difference is timedelta64[ns] by default (ns in brackets). By changing [ns] into [ms], [s], [m] etc as you cast the output to a new timedelta64 object, you can convert the difference into milliseconds, seconds, minutes etc.
For example, to find the number of seconds passed since Unix epoch, subtract datetimes and change dtype.
df_unix_sec = (df['time'] - pd.Timestamp('1970-01-01')).astype('timedelta64[s]')
N.B. Oftentimes, the differences are very large numbers, so if you want them as integers, use astype('int64') (NOT astype(int)).
df_unix_sec = (df['time'] - pd.Timestamp('1970-01-01')).astype('timedelta64[s]').astype('int64')
For OP's example, this would yield,
0 1547472343
Name: time, dtype: int64
In case you are accessing a particular datetime64 object from the dataframe, chances are that pandas will return a Timestamp object which is essentially how pandas stores datetime64 objects.
You can use pd.Timestamp.to_datetime64() method of the pd.Timestamp object to convert it to numpy.datetime64 object with ns precision.
HI need help with datetime,
I have extracted minutes second value in the form of mm:ss eg(23:50)
but after that now I need to convert the same in '%H:%M:%S' format but it is giving error as it is in type dtype('o'), used below code but it is giving error, what to do
df_raw['Time-only'] = pd.to_datetime(df_raw['time2'], format='%H:%M:%S').dt.time
First, to_datetime is used to convert a given string format into DateTime. For example, if I had a string '2018-04-10 12:37:51.252'
df_raw['time2'] = pd.to_datetime(df_raw['time2'], format='%Y-%m-%d %H:%M:%S.%f')
doing the above will simply change the data types. You can verify them by doing
df_raw.dtypes
Now, to answer your question I believe you are trying to convert timestamp mm:ss to hh:mm:ss? If not, can you be more specific?
df_raw['Time-only'] = list(df_raw['time2'].map(lambda f: datetime.strftime(f, "%H:%M:%S")))
This piece of code will change the format of your timestamp to 12:37:51.
I believe you need convert values to timedeltas for next vectorized processing like subtract or sum:
df_raw = pd.DataFrame({'time2':['23:50','15:23']})
df_raw['Time-only'] = pd.to_timedelta('00:' + df_raw['time2'])
print (df_raw)
time2 Time-only
0 23:50 0 days 00:23:50
1 15:23 0 days 00:15:23
If convert to times then is not possible use vectorized operations, because get object python time.
When shifting column of integers, I know how to fix my column when Pandas automatically converts the integers to floats because of the presence of a NaN.
I basically use the method described here.
However, if the shift introduces a NaN thereby converting all integers to floats, there's some rounding that happens (e.g. on epoch timestamps) so even recasting it back to integer doesn't replicate what it was originally.
Any way to fix this?
Example Data:
pd.DataFrame({'epochee':[1495571400259317500,1495571400260585120,1495571400260757200, 1495571400260866800]})
Out[19]:
epoch
0 1495571790919317503
1 1495999999999999999
2 1495571400265555555
3 1495571400267777777
Example Code:
df['prior_epochee'] = df['epochee'].shift(1)
df.dropna(axis=0, how='any', inplace=True)
df['prior_epochee'] = df['prior_epochee'].astype(int)
Resulting output:
Out[22]:
epoch prior_epoch
1 1444444444444444444 1400000000000000000
2 1433333333333333333 1490000000000000000
3 1777777777777777777 1499999999999999948
Because you know what happens when int is casted as float due to np.nan and you know that you don't want the np.nan rows anyway, you can shift yourself with numpy
df[1:].assign(prior_epoch=df.epoch.values[:-1])
epoch prior_epoch
1 1495571400260585120 1495571400259317500
2 1495571400260757200 1495571400260585120
3 1495571400260866800 1495571400260757200
I am confused how pandas blew out of bounds for datetime objects with these lines:
import pandas as pd
BOMoffset = pd.tseries.offsets.MonthBegin()
# here some code sets the all_treatments dataframe and the newrowix, micolix, mocolix counters
all_treatments.iloc[newrowix,micolix] = BOMoffset.rollforward(all_treatments.iloc[i,micolix] + pd.tseries.offsets.DateOffset(months = x))
all_treatments.iloc[newrowix,mocolix] = BOMoffset.rollforward(all_treatments.iloc[newrowix,micolix]+ pd.tseries.offsets.DateOffset(months = 1))
Here all_treatments.iloc[i,micolix] is a datetime set by pd.to_datetime(all_treatments['INDATUMA'], errors='coerce',format='%Y%m%d'), and INDATUMA is date information in the format 20070125.
This logic seems to work on mock data (no errors, dates make sense), so at the moment I cannot reproduce while it fails in my entire data with the following error:
pandas.tslib.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 2262-05-01 00:00:00
Since pandas represents timestamps in nanosecond resolution, the timespan that can be represented using a 64-bit integer is limited to approximately 584 years
In [54]: pd.Timestamp.min
Out[54]: Timestamp('1677-09-22 00:12:43.145225')
In [55]: pd.Timestamp.max
Out[55]: Timestamp('2262-04-11 23:47:16.854775807')
And your value is out of this range 2262-05-01 00:00:00 and hence the outofbounds error
Straight out of: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timestamp-limitations
Workaround:
This will force the dates which are outside the bounds to NaT
pd.to_datetime(date_col_to_force, errors = 'coerce')
Setting the errors parameter in pd.to_datetime to 'coerce' causes replacement of out of bounds values with NaT. Quoting the docs:
If ‘coerce’, then invalid parsing will be set as NaT
E.g.:
datetime_variable = pd.to_datetime(datetime_variable, errors = 'coerce')
This does not fix the data (obviously), but still allows processing the non-NaT data points.
The reason you are seeing this error message
"OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 3000-12-23 00:00:00" is because pandas timestamp data type stores date in nanosecond resolution(from the docs).
Which means the date values have to be in the range
pd.Timestamp.min(1677-09-21 00:12:43.145225) and
pd.Timestamp.max(2262-04-11 23:47:16.854775807)
Even if you only want the date with resolution of seconds or microseconds, pandas will still store it internally in nanoseconds. There is no option in pandas to store a timestamp outside of the above mentioned range.
This is surprising because databases like sql server and libraries like numpy allows to store date beyond this range. Also maximum of 64 bits are used in most of the cases to store the date.
But here is the difference.
SQL server stores date in nanosecond resolution but only up to a accuracy of 100 ns(as opposed to 1 ns in pandas). Since the space is limited(64 bits), its a matter of range vs accuracy. With pandas timestamp we have higher accuracy but lower date range.
In case of numpy (pandas is built on top of numpy) datetime64 data type,
if the date falls in the above mentioned range you can store
it in nanoseconds which is similar to pandas.
OR you can give up the nanosecond resolution and go with
microseconds which will give you a much larger range. This is something that is missing in pandas timestamp type.
However if you choose to store in nanoseconds and the date is outside the range then numpy will automatically wrap around this date and you might get unexpected results (referenced below in the 4th solution).
np.datetime64("3000-06-19T08:17:14.073456178", dtype="datetime64[ns]")
> numpy.datetime64('1831-05-11T09:08:06.654352946')
Now with pandas we have below options,
import pandas as pd
data = {'Name': ['John', 'Sam'], 'dob': ['3000-06-19T08:17:14', '2000-06-19T21:17:14']}
my_df = pd.DataFrame(data)
1)If you are ok with losing the data which is out of range then simply use below param to convert out of range date to NaT(not a time).
my_df['dob'] = pd.to_datetime(my_df['dob'], errors = 'coerce')
2)If you dont want to lose the data then you can convert the values into a python datetime type. Here the column "dob" is of type pandas object but the individual value will be of type python datetime. However doing this we will lose the benefit of vectorized functions.
import datetime as dt
my_df['dob'] = my_df['dob'].apply(lambda x: dt.datetime.strptime(x,'%Y-%m-%dT%H:%M:%S') if type(x)==str else pd.NaT)
print(type(my_df.iloc[0][1]))
> <class 'datetime.datetime'>
3)Another option is to use numpy instead of pandas series if possible. In case of pandas dataframe, you can convert a series(or column in a df) to numpy array. Process the data separately and then join it back to the dataframe.
4)we can also use pandas timespans as suggested in the docs. Do checkout the difference b/w timestamp and period before using this data type. Date range and frequency here works similar to numpy(mentioned above in the numpy section).
my_df['dob'] = my_df['dob'].apply(lambda x: pd.Period(x, freq='ms'))
You can try with strptime() in datetime library along with lambda expression to convert text to date values in a series object:
Example:
df['F'].apply(lambda x: datetime.datetime.strptime(x, '%m/%d/%Y %I:%M:%S') if type(x)==str else np.NaN)
None of above are so good, because it will delete your data. But, you can only mantain and edit your conversion:
# convertin from epoch to datatime mantainig the nanoseconds timestamp
xbarout= pd.to_datetime(xbarout.iloc[:,0],unit='ns')