Pandas to_datetime () function performance issues

Pandas to_datetime () function performance issues - python

Have a df like that:
Dat
10/01/2016
11/01/2014
12/02/2013
The column 'Dat' has object type so I trying to switch it to datetime using to_datetime () pandas function that way:
to_datetime_rand = partial(pd.to_datetime, format='%m/%d/%Y')
df['DAT'] = df['DAT'].apply(to_datetime_rand)
Everything works well but I have performance issues when my df is higher than 2 billion rows. So in that case this method stucks and does not work well.
Does pandas to_datetime () function has an ability to do the convertation by chuncks or maybe iterationally by looping.
Thanks.

If performance is a concern I would advise to use the following function to convert those columns to date_time:
def lookup(s):
"""
This is an extremely fast approach to datetime parsing.
For large data, the same dates are often repeated. Rather than
re-parse these, we store all unique dates, parse them, and
use a lookup to convert all dates.
"""
dates = {date:pd.to_datetime(date) for date in s.unique()}
return s.apply(lambda v: dates[v])
to_datetime: 5799 ms
dateutil: 5162 ms
strptime: 1651 ms
manual: 242 ms
lookup: 32 ms

UPDATE: This enhancement has been incorporated into pandas 0.23.0
cache : boolean, default False
If True, use a cache of unique, converted dates to apply the datetime
conversion. May produce significant speed-up when parsing duplicate
date strings, especially ones with timezone offsets.

You could split into chunks your huge dataframe into smaller ones, for example this method can do it where you can decide what is the chunk size:
def splitDataFrameIntoSmaller(df, chunkSize = 10000):
listOfDf = list()
numberChunks = len(df) // chunkSize + 1
for i in range(numberChunks):
listOfDf.append(df[i*chunkSize:(i+1)*chunkSize])
return listOfDf
After you have chunks, you can apply the datetime function on each chunk separately.

I just came across this same issue myself. Thanks to SerialDev for the excellent answer. To build on that, I tried using datetime.strptime instead of pd.to_datetime:
from datetime import datetime as dt
dates = {date : dt.strptime(date, '%m/%d/%Y') for date in df['DAT'].unique()}
df['DAT'] = df['DAT'].apply(lambda v: dates[v])
The strptime method was 6.5x faster than the to_datetime method for me.

Inspired from the previous answers, in the case of having both performance problems and multiple date formats I suggest the following solution.
for date in df['DAT'].unique():
for ft in ['%Y/%m/%d', '%Y']:
try:
dates[date] = datetime.strptime(date, ft) if date else None
except ValueError:
continue
df['DAT'] = df['DAT'].apply(lambda v: dates[v])

Related

How to convert the timestamps column into Unix epoch format [duplicate]

From the official documentation of pandas.to_datetime we can say,
unit : string, default ‘ns’
unit of the arg (D,s,ms,us,ns) denote the unit, which is an integer or
float number. This will be based off the origin. Example, with
unit=’ms’ and origin=’unix’ (the default), this would calculate the
number of milliseconds to the unix epoch start.
So when I try like this way,
import pandas as pd
df = pd.DataFrame({'time': [pd.to_datetime('2019-01-15 13:25:43')]})
df_unix_sec = pd.to_datetime(df['time'], unit='ms', origin='unix')
print(df)
print(df_unix_sec)
time
0 2019-01-15 13:25:43
0 2019-01-15 13:25:43
Name: time, dtype: datetime64[ns]
Output is not changing for the latter one. Every time it is showing the datetime value not number of milliseconds to the unix epoch start for the 2nd one. Why is that? Am I missing something?

I think you misunderstood what the argument is for. The purpose of origin='unix' is to convert an integer timestamp to datetime, not the other way.
pd.to_datetime(1.547559e+09, unit='s', origin='unix')
# Timestamp('2019-01-15 13:30:00')
Here are some options:
Option 1: integer division
Conversely, you can get the timestamp by converting to integer (to get nanoseconds) and divide by 109.
pd.to_datetime(['2019-01-15 13:30:00']).astype(int) / 10**9
# Float64Index([1547559000.0], dtype='float64')
Pros:
super fast
Cons:
makes assumptions about how pandas internally stores dates
Option 2: recommended by pandas
Pandas docs recommend using the following method:
# create test data
dates = pd.to_datetime(['2019-01-15 13:30:00'])
# calculate unix datetime
(dates - pd.Timestamp("1970-01-01")) // pd.Timedelta('1s')
[out]:
Int64Index([1547559000], dtype='int64')
Pros:
"idiomatic", recommended by the library
Cons:
unweildy
not as performant as integer division
Option 3: pd.Timestamp
If you have a single date string, you can use pd.Timestamp as shown in the other answer:
pd.Timestamp('2019-01-15 13:30:00').timestamp()
# 1547559000.0
If you have to cooerce multiple datetimes (where pd.to_datetime is your only option), you can initialize and map:
pd.to_datetime(['2019-01-15 13:30:00']).map(pd.Timestamp.timestamp)
# Float64Index([1547559000.0], dtype='float64')
Pros:
best method for a single datetime string
easy to remember
Cons:
not as performant as integer division

You can use timestamp() method which returns POSIX timestamp as float:
pd.Timestamp('2021-04-01').timestamp()
[Out]:
1617235200.0
pd.Timestamp('2021-04-01 00:02:35.234').timestamp()
[Out]:
1617235355.234

value attribute of the pandas Timestamp holds the unix epoch. This value is in nanoseconds. So you can convert to ms or us by diving by 1e3 or 1e6. Check the code below.
import pandas as pd
date_1 = pd.to_datetime('2020-07-18 18:50:00')
print(date_1.value)

When you calculate the difference between two datetimes, the dtype of the difference is timedelta64[ns] by default (ns in brackets). By changing [ns] into [ms], [s], [m] etc as you cast the output to a new timedelta64 object, you can convert the difference into milliseconds, seconds, minutes etc.
For example, to find the number of seconds passed since Unix epoch, subtract datetimes and change dtype.
df_unix_sec = (df['time'] - pd.Timestamp('1970-01-01')).astype('timedelta64[s]')
N.B. Oftentimes, the differences are very large numbers, so if you want them as integers, use astype('int64') (NOT astype(int)).
df_unix_sec = (df['time'] - pd.Timestamp('1970-01-01')).astype('timedelta64[s]').astype('int64')
For OP's example, this would yield,
0 1547472343
Name: time, dtype: int64

In case you are accessing a particular datetime64 object from the dataframe, chances are that pandas will return a Timestamp object which is essentially how pandas stores datetime64 objects.
You can use pd.Timestamp.to_datetime64() method of the pd.Timestamp object to convert it to numpy.datetime64 object with ns precision.

pandas out of bounds nanosecond timestamp after offset rollforward plus adding a month offset

I am confused how pandas blew out of bounds for datetime objects with these lines:
import pandas as pd
BOMoffset = pd.tseries.offsets.MonthBegin()
# here some code sets the all_treatments dataframe and the newrowix, micolix, mocolix counters
all_treatments.iloc[newrowix,micolix] = BOMoffset.rollforward(all_treatments.iloc[i,micolix] + pd.tseries.offsets.DateOffset(months = x))
all_treatments.iloc[newrowix,mocolix] = BOMoffset.rollforward(all_treatments.iloc[newrowix,micolix]+ pd.tseries.offsets.DateOffset(months = 1))
Here all_treatments.iloc[i,micolix] is a datetime set by pd.to_datetime(all_treatments['INDATUMA'], errors='coerce',format='%Y%m%d'), and INDATUMA is date information in the format 20070125.
This logic seems to work on mock data (no errors, dates make sense), so at the moment I cannot reproduce while it fails in my entire data with the following error:
pandas.tslib.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 2262-05-01 00:00:00

Since pandas represents timestamps in nanosecond resolution, the timespan that can be represented using a 64-bit integer is limited to approximately 584 years
In [54]: pd.Timestamp.min
Out[54]: Timestamp('1677-09-22 00:12:43.145225')
In [55]: pd.Timestamp.max
Out[55]: Timestamp('2262-04-11 23:47:16.854775807')
And your value is out of this range 2262-05-01 00:00:00 and hence the outofbounds error
Straight out of: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timestamp-limitations
Workaround:
This will force the dates which are outside the bounds to NaT
pd.to_datetime(date_col_to_force, errors = 'coerce')

Setting the errors parameter in pd.to_datetime to 'coerce' causes replacement of out of bounds values with NaT. Quoting the docs:
If ‘coerce’, then invalid parsing will be set as NaT
E.g.:
datetime_variable = pd.to_datetime(datetime_variable, errors = 'coerce')
This does not fix the data (obviously), but still allows processing the non-NaT data points.

The reason you are seeing this error message
"OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 3000-12-23 00:00:00" is because pandas timestamp data type stores date in nanosecond resolution(from the docs).
Which means the date values have to be in the range
pd.Timestamp.min(1677-09-21 00:12:43.145225) and
pd.Timestamp.max(2262-04-11 23:47:16.854775807)
Even if you only want the date with resolution of seconds or microseconds, pandas will still store it internally in nanoseconds. There is no option in pandas to store a timestamp outside of the above mentioned range.
This is surprising because databases like sql server and libraries like numpy allows to store date beyond this range. Also maximum of 64 bits are used in most of the cases to store the date.
But here is the difference.
SQL server stores date in nanosecond resolution but only up to a accuracy of 100 ns(as opposed to 1 ns in pandas). Since the space is limited(64 bits), its a matter of range vs accuracy. With pandas timestamp we have higher accuracy but lower date range.
In case of numpy (pandas is built on top of numpy) datetime64 data type,
if the date falls in the above mentioned range you can store
it in nanoseconds which is similar to pandas.
OR you can give up the nanosecond resolution and go with
microseconds which will give you a much larger range. This is something that is missing in pandas timestamp type.
However if you choose to store in nanoseconds and the date is outside the range then numpy will automatically wrap around this date and you might get unexpected results (referenced below in the 4th solution).
np.datetime64("3000-06-19T08:17:14.073456178", dtype="datetime64[ns]")
> numpy.datetime64('1831-05-11T09:08:06.654352946')
Now with pandas we have below options,
import pandas as pd
data = {'Name': ['John', 'Sam'], 'dob': ['3000-06-19T08:17:14', '2000-06-19T21:17:14']}
my_df = pd.DataFrame(data)
1)If you are ok with losing the data which is out of range then simply use below param to convert out of range date to NaT(not a time).
my_df['dob'] = pd.to_datetime(my_df['dob'], errors = 'coerce')
2)If you dont want to lose the data then you can convert the values into a python datetime type. Here the column "dob" is of type pandas object but the individual value will be of type python datetime. However doing this we will lose the benefit of vectorized functions.
import datetime as dt
my_df['dob'] = my_df['dob'].apply(lambda x: dt.datetime.strptime(x,'%Y-%m-%dT%H:%M:%S') if type(x)==str else pd.NaT)
print(type(my_df.iloc[0][1]))
> <class 'datetime.datetime'>
3)Another option is to use numpy instead of pandas series if possible. In case of pandas dataframe, you can convert a series(or column in a df) to numpy array. Process the data separately and then join it back to the dataframe.
4)we can also use pandas timespans as suggested in the docs. Do checkout the difference b/w timestamp and period before using this data type. Date range and frequency here works similar to numpy(mentioned above in the numpy section).
my_df['dob'] = my_df['dob'].apply(lambda x: pd.Period(x, freq='ms'))

You can try with strptime() in datetime library along with lambda expression to convert text to date values in a series object:
Example:
df['F'].apply(lambda x: datetime.datetime.strptime(x, '%m/%d/%Y %I:%M:%S') if type(x)==str else np.NaN)

None of above are so good, because it will delete your data. But, you can only mantain and edit your conversion:
# convertin from epoch to datatime mantainig the nanoseconds timestamp
xbarout= pd.to_datetime(xbarout.iloc[:,0],unit='ns')

Python: Convert timedelta to int in a dataframe

I would like to create a column in a pandas data frame that is an integer representation of the number of days in a timedelta column. Is it possible to use 'datetime.days' or do I need to do something more manual?
timedelta column
7 days, 23:29:00
day integer column
7

The Series class has a pandas.Series.dt accessor object with several
useful datetime attributes, including dt.days. Access this attribute via:
timedelta_series.dt.days
You can also get the seconds and microseconds attributes in the same way.

You could do this, where td is your series of timedeltas. The division converts the nanosecond deltas into day deltas, and the conversion to int drops to whole days.
import numpy as np
(td / np.timedelta64(1, 'D')).astype(int)

Timedelta objects have read-only instance attributes .days, .seconds, and .microseconds.

If the question isn't just "how to access an integer form of the timedelta?" but "how to convert the timedelta column in the dataframe to an int?" the answer might be a little different. In addition to the .dt.days accessor you need either df.astype or pd.to_numeric
Either of these options should help:
df['tdColumn'] = pd.to_numeric(df['tdColumn'].dt.days, downcast='integer')
or
df['tdColumn'] = df['tdColumn'].dt.days.astype('int16')

The simplest way to do this is by
df["DateColumn"] = (df["DateColumn"]).dt.days

A great way to do this is
dif_in_days = dif.days
(where dif is the difference between dates)

Pandas dataframe groupby datetime month

Consider a CSV file:
string,date,number
a string,2/5/11 9:16am,1.0
a string,3/5/11 10:44pm,2.0
a string,4/22/11 12:07pm,3.0
a string,4/22/11 12:10pm,4.0
a string,4/29/11 11:59am,1.0
a string,5/2/11 1:41pm,2.0
a string,5/2/11 2:02pm,3.0
a string,5/2/11 2:56pm,4.0
a string,5/2/11 3:00pm,5.0
a string,5/2/14 3:02pm,6.0
a string,5/2/14 3:18pm,7.0
I can read this in, and reformat the date column into datetime format:
b = pd.read_csv('b.dat')
b['date'] = pd.to_datetime(b['date'],format='%m/%d/%y %I:%M%p')
I have been trying to group the data by month. It seems like there should be an obvious way of accessing the month and grouping by that. But I can't seem to do it. Does anyone know how?
What I am currently trying is re-indexing by the date:
b.index = b['date']
I can access the month like so:
b.index.month
However I can't seem to find a function to lump together by month.

Managed to do it:
b = pd.read_csv('b.dat')
b.index = pd.to_datetime(b['date'],format='%m/%d/%y %I:%M%p')
b.groupby(by=[b.index.month, b.index.year])
Or
b.groupby(pd.Grouper(freq='M')) # update for v0.21+

(update: 2018)
Note that pd.Timegrouper is depreciated and will be removed. Use instead:
df.groupby(pd.Grouper(freq='M'))

To groupby time-series data you can use the method resample. For example, to groupby by month:
df.resample(rule='M', on='date')['Values'].sum()
The list with offset aliases you can find here.

One solution which avoids MultiIndex is to create a new datetime column setting day = 1. Then group by this column.
Normalise day of month
df = pd.DataFrame({'Date': pd.to_datetime(['2017-10-05', '2017-10-20', '2017-10-01', '2017-09-01']),
'Values': [5, 10, 15, 20]})
# normalize day to beginning of month, 4 alternative methods below
df['YearMonth'] = df['Date'] + pd.offsets.MonthEnd(-1) + pd.offsets.Day(1)
df['YearMonth'] = df['Date'] - pd.to_timedelta(df['Date'].dt.day-1, unit='D')
df['YearMonth'] = df['Date'].map(lambda dt: dt.replace(day=1))
df['YearMonth'] = df['Date'].dt.normalize().map(pd.tseries.offsets.MonthBegin().rollback)
Then use groupby as normal:
g = df.groupby('YearMonth')
res = g['Values'].sum()
# YearMonth
# 2017-09-01 20
# 2017-10-01 30
# Name: Values, dtype: int64
Comparison with pd.Grouper
The subtle benefit of this solution is, unlike pd.Grouper, the grouper index is normalized to the beginning of each month rather than the end, and therefore you can easily extract groups via get_group:
some_group = g.get_group('2017-10-01')
Calculating the last day of October is slightly more cumbersome. pd.Grouper, as of v0.23, does support a convention parameter, but this is only applicable for a PeriodIndex grouper.
Comparison with string conversion
An alternative to the above idea is to convert to a string, e.g. convert datetime 2017-10-XX to string '2017-10'. However, this is not recommended since you lose all the efficiency benefits of a datetime series (stored internally as numerical data in a contiguous memory block) versus an object series of strings (stored as an array of pointers).

Slightly alternative solution to #jpp's but outputting a YearMonth string:
df['YearMonth'] = pd.to_datetime(df['Date']).apply(lambda x: '{year}-{month}'.format(year=x.year, month=x.month))
res = df.groupby('YearMonth')['Values'].sum()

Convert numpy.datetime64 to string object in python

I am having trouble converting a python datetime64 object into a string. For example:
t = numpy.datetime64('2012-06-30T20:00:00.000000000-0400')
Into:
'2012.07.01' as a string. (note time difference)
I have already tried to convert the datetime64 object to a datetime long then to a string, but I seem to get this error:
dt = t.astype(datetime.datetime) #1341100800000000000L
time.ctime(dt)
ValueError: unconvertible time

Solution was:
import pandas as pd
ts = pd.to_datetime(str(date))
d = ts.strftime('%Y.%m.%d')

If you don't want to do that conversion gobbledygook and are ok with just one date format, this was the best solution for me
str(t)[:10]
Out[11]: '2012-07-01'
As noted this works for pandas too
df['d'].astype(str).str[:10]
df['d'].dt.strftime('%Y-%m-%d') # equivalent

You can use Numpy's datetime_as_string function. The unit='D' argument specifies the precision, in this case days.
>>> t = numpy.datetime64('2012-06-30T20:00:00.000000000-0400')
>>> numpy.datetime_as_string(t, unit='D')
'2012-07-01'

t.item().strftime('%Y.%m.%d')
.item() will cast numpy.datetime64 to datetime.datetime, no need to import anything.

There is a route without using pandas; but see caveat below.
Well, the t variable has a resolution of nanoseconds, which can be shown by inspection in python:
>>> numpy.dtype(t)
dtype('<M8[ns]')
This means that the integral value of this value is 10^9 times the UNIX timestamp. The value printed in your question gives that hint. Your best bet is to divide the integral value of t by 1 billion then you can use time.strftime:
>>> import time
>>> time.strftime("%Y.%m.%d", time.gmtime(t.astype(int)/1000000000))
2012.07.01
In using this, be conscious of two assumptions:
1) the datetime64 resolution is nanosecond
2) the time stored in datetime64 is in UTC
Side note 1: Interestingly, the numpy developers decided [1] that datetime64 object that has a resolution greater than microsecond will be cast to a long type, which explains why t.astype(datetime.datetime) yields 1341100800000000000L. The reason is that datetime.datetime object can't accurately represent a nanosecond or finer timescale, because the resolution supported by datetime.datetime is only microsecond.
Side note 2: Beware the different conventions between numpy 1.10 and earlier vs 1.11 and later:
in numpy <= 1.10, datetime64 is stored internally as UTC, and printed as local time. Parsing is assuming local time if no TZ is specified, otherwise the timezone offset is accounted for.
in numpy >= 1.11, datetime64 is stored internally as timezone-agnostic value (seconds since 1970-01-01 00:00 in unspecified timezone), and printed as such. Time parsing does not assume the timezone, although +NNNN style timezone shift is still permitted and that the value is converted to UTC.
[1]: https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/datetime.c see routine convert_datetime_to_pyobject.

I wanted an ISO 8601 formatted string without needing any extra dependencies. My numpy_array has a single element as a datetime64. With help from #Wirawan-Purwanto, I added just a bit:
from datetime import datetime
ts = numpy_array.values.astype(datetime)/1000000000
return datetime.utcfromtimestamp(ts).isoformat() # "2018-05-24T19:54:48"

Building on this answer I would do the following:
import numpy
import datetime
t = numpy.datetime64('2012-06-30T20:00:00.000000000')
datetime.datetime.fromtimestamp(t.item() / 10**9).strftime('%Y.%m.%d')
The division by a billion is to convert from nanoseconds to seconds.

Here is a one liner (note the padding with extra zero's):
datetime.strptime(str(t),'%Y-%m-%dT%H:%M:%S.%f000').strftime("%Y-%m-%d")
code sample
import numpy
from datetime import datetime
t = numpy.datetime64('2012-06-30T20:00:00.000000000-0400')
method 1:
datetime.strptime(str(t),'%Y-%m-%dT%H:%M:%S.%f000').strftime("%Y-%m-%d")
method 2:
datetime.strptime(str(t)[:10], "%Y-%m-%d").strftime("%Y-%m-%d")
output
'2012-07-01'

Also, if someone want to apply same formula for any series of datetime dataframe then you can follow below steps
import pandas as pd
temp = []
for i in range(len(t["myDate"])):
ts = pd.to_datetime(str(t["myDate"].iloc[i]))
temp.append(ts.strftime('%Y-%m-%d'))
t["myDate"] = temp

datetime objects can be converted to strings using the str() method
t.__str__()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas to_datetime () function performance issues - python

UPDATE: This enhancement has been incorporated into pandas 0.23.0 cache : boolean, default False If True, use a cache of unique, converted dates to apply the datetime conversion. May produce significant speed-up when parsing duplicate date strings, especially ones with timezone offsets.

Related

How to convert the timestamps column into Unix epoch format [duplicate]

pandas out of bounds nanosecond timestamp after offset rollforward plus adding a month offset

Python: Convert timedelta to int in a dataframe

Pandas dataframe groupby datetime month

Convert numpy.datetime64 to string object in python

Categories

Resources