Vectorizing a Pandas apply function for tz_convert - python

I'm have a dataframe where the hour column contains datetime data in UTC. I have a time_zone column with time zones for each observation, and I'm using it to convert hour to the local time and save it in a new column named local_hour. To do this, I'm using the following code:
import pandas as pd
# Sample dataframe
import pandas as pd
df = pd.DataFrame({
'hour': ['2019-01-01 05:00:00', '2019-01-01 07:00:00', '2019-01-01 08:00:00'],
'time_zone': ['US/Eastern', 'US/Central', 'US/Mountain']
})
# Ensure hour is in datetime format and localized to UTC
df['hour'] = pd.to_datetime(df['hour']).dt.tz_localize('UTC')
# Add local_hour column with hour in local time
df['local_hour'] = df.apply(lambda row: row['hour'].tz_convert(row['time_zone']), axis=1)
df
hour time_zone local_hour
0 2019-01-01 05:00:00+00:00 US/Eastern 2019-01-01 00:00:00-05:00
1 2019-01-01 07:00:00+00:00 US/Central 2019-01-01 01:00:00-06:00
2 2019-01-01 08:00:00+00:00 US/Mountain 2019-01-01 01:00:00-07:00
The code works. However using apply runs quite slowly since in reality I have a large dataframe. Is there a way to vectorize this or otherwise speed it up?
Note: I have tried using the swifter package, but in my case it doesn't speed things up.

From the assumption there is not an infinite number of time_zone, maybe you could perform a tz_convert per group, like:
df['local_hour'] = df.groupby('time_zone')['hour'].apply(lambda x: x.dt.tz_convert(x.name))
print (df)
hour time_zone local_hour
0 2019-01-01 05:00:00+00:00 US/Eastern 2019-01-01 00:00:00-05:00
1 2019-01-01 07:00:00+00:00 US/Central 2019-01-01 01:00:00-06:00
2 2019-01-01 08:00:00+00:00 US/Mountain 2019-01-01 01:00:00-07:00
On the sample it will be probably slower than what you did, but on bigger data and groups, should be faster
For speed comparison, with the df of 3 rows you provided, it gives:
%timeit df.apply(lambda row: row['hour'].tz_convert(row['time_zone']), axis=1)
# 1.6 ms ± 102 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df.groupby('time_zone')['hour'].apply(lambda x: x.dt.tz_convert(x.name))
# 2.58 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
so apply is faster, but if you create a dataframe 1000 times bigger but with only 3 time_zones, then you get groupby about 20 times faster:
df = pd.concat([df]*1000, ignore_index=True)
%timeit df.apply(lambda row: row['hour'].tz_convert(row['time_zone']), axis=1)
# 585 ms ± 42.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.groupby('time_zone')['hour'].apply(lambda x: x.dt.tz_convert(x.name))
# 27.5 ms ± 2.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Related

How can I replace the 'year' value in a datetime column for each row?

Within my dataframe I have two columns: 'release_date' and 'release_year'
I am trying to replace the year value in each 'release_date' instance with the corresponding value in 'release_year'
I have tried the following
df.loc[:, 'release_date'] = df['release_date'].apply(lambda x: x.replace(x.year == df['release_year']))
however I am getting the error: 'value must be an integer, received <class 'pandas.core.series.Series'> for year'
Having checked the dtype, the release_date column is stored as datetime64[ns]
Excerpt from dataframe
You need to use pandas.DataFrame.apply here rather than pandas.Series.apply as you need data from other column, consider following simple example
import datetime
import pandas as pd
df = pd.DataFrame({'release_date':[datetime.date(1901,1,1),datetime.date(1902,1,1),datetime.date(1903,1,1)],'release_year':[2001,2002,2003]})
df['changed_date'] = df.apply(lambda x:x.release_date.replace(year=x.release_year),axis=1)
print(df)
output
release_date release_year changed_date
0 1901-01-01 2001 2001-01-01
1 1902-01-01 2002 2002-01-01
2 1903-01-01 2003 2003-01-01
Note axis=1 which mean function is applied to each row and you got row (pandas.Series) as argument for that function
casting to string then parsing to datetime is more efficient here; and also more readable if you ask me. Ex:
import datetime
import pandas as pd
N = 100000
df = pd.DataFrame({'release_date':[datetime.date(1901,1,1),datetime.date(1902,1,1),datetime.date(1903,1,1)]*N,
'release_year':[2001,2002,2003]*N})
df['changed_date'] = pd.to_datetime(
df['release_year'].astype(str) + df['release_date'].astype(str).str[5:],
format="%Y%m-%d"
)
df['changed_date']
Out[176]:
0 2001-01-01
1 2002-01-01
2 2003-01-01
3 2001-01-01
4 2002-01-01
299995 2002-01-01
299996 2003-01-01
299997 2001-01-01
299998 2002-01-01
299999 2003-01-01
Name: changed_date, Length: 300000, dtype: datetime64[ns]
>>> %timeit df['changed_date'] = df.apply(lambda x:x.release_date.replace(year=x.release_year),axis=1)
6.73 s ± 542 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit df['changed_date'] = pd.to_datetime(df['release_year'].astype(str)+df['release_date'].astype(str).str[5:], format="%Y%m-%d")
651 ms ± 78.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

How to optimize the conversion of DateTimeIndex to a certain column of a DataFrame in a specific format?

I have a DateTimeIndex, I need to convert to a certain column of the Dataframe, and use a specific format, my code is as follows, how to optimize?
import numpy as np
import pandas as pd
original = pd.date_range(start='20210520 09:00:00', end='20210520 12:00:00', freq='30min')
time = np.vectorize(lambda s: s.strftime('%H:%M:%S'))(original.to_pydatetime())
result = pd.DataFrame(time, columns=['time'])
print('original:')
print(original)
print('result:')
print(result)
original:
DatetimeIndex(['2021-05-20 09:00:00', '2021-05-20 09:30:00',
'2021-05-20 10:00:00', '2021-05-20 10:30:00',
'2021-05-20 11:00:00', '2021-05-20 11:30:00',
'2021-05-20 12:00:00'],
dtype='datetime64[ns]', freq='30T')
result:
time
0 09:00:00
1 09:30:00
2 10:00:00
3 10:30:00
4 11:00:00
5 11:30:00
6 12:00:00
Instead of this:
time = np.vectorize(lambda s: s.strftime('%H:%M:%S'))(original.to_pydatetime())
Use:
time=original.time.astype(str)
Performance:
​%%timeit
original = pd.date_range(start='20210520 09:00:00', end='20210520 12:00:00', freq='30min')
time = np.vectorize(lambda s: s.strftime('%H:%M:%S'))(original.to_pydatetime())
result = pd.DataFrame(time, columns=['time'])
>>>925 µs ± 53.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
original = pd.date_range(start='20210520 09:00:00', end='20210520 12:00:00', freq='30min')
time=original.time.astype(str)
result = pd.DataFrame(time, columns=['time'])
>>>724 µs ± 12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Integer to date in Python: unsupported type for timedelta days component: Series

I want to create a new column DATE in the dataframe transaction from the column DAY (lists days from beginning of the study). Like, if DAY is 1, I want DATE to show 2014-1-1, or if DAY is 12 it should be 2014-1-12. This is what I am doing to convert days into date and it works:
import datetime
def serial_date_to_string(srl_no):
new_date = datetime.datetime(2014,1,1,0,0) + datetime.timedelta(srl_no - 1)
return new_date.strftime("%Y-%m-%d")
However, when I try to use this formula to add a new column, it doesn't work:
transaction['DATE'] = serial_date_to_string(transaction['DAY'])
TypeError: unsupported type for timedelta days component: Series
But the DAY column type is int64. I tried to search on forums and found that the formula could be adjusted, if I try to use this:
def serial_date_to_string(srl_no):
new_date = datetime.datetime(2014,1,1,0,0) + (srl_no - 1).map(datetime.timedelta)
return new_date.strftime("%Y-%m-%d")
It still gives AttributeError: 'Series' object has no attribute 'strftime'.
Thank you for any help!
Use map
transaction['DATE'] = transaction['DAY'].map(serial_date_to_string)
Use to_datetime with parameters origin for starting day and unit=d for days:
df = pd.DataFrame({'DAY':[1,12,20]})
df['date'] = pd.to_datetime(df['DAY'] - 1, origin='2014-01-01', unit='d')
print (df)
DAY date
0 1 2014-01-01
1 12 2014-01-12
2 20 2014-01-20
For same ouput add Series.dt.strftime:
df['date'] = pd.to_datetime(df['DAY'] - 1, origin='2014-01-01', unit='d').dt.strftime("%Y-%m-%d")
print (df)
DAY date
0 1 2014-01-01
1 12 2014-01-12
2 20 2014-01-20
EDIT:
For your function is possible use Series.apply:
transaction['DATE'] = transaction['DAY'].apply(serial_date_to_string)
Performance is similar, apply here is fastest for 10k rows:
import datetime
def serial_date_to_string(srl_no):
new_date = datetime.datetime(2014,1,1,0,0) + datetime.timedelta(srl_no - 1)
return new_date.strftime("%Y-%m-%d")
np.random.seed(2021)
df = pd.DataFrame({'DAY': np.random.randint(10, 1000, size=10000)})
In [17]: %timeit pd.to_datetime(df['DAY'] - 1, origin='2014-01-01', unit='d').dt.strftime("%Y-%m-%d")
79.4 ms ± 1.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [18]: %timeit df['DAY'].apply(serial_date_to_string)
57.1 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [19]: %timeit df['DAY'].map(serial_date_to_string)
64.7 ms ± 5.83 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

How to find out a customer's billing date before their contract termination using python?

I have several customers who get billed every 25th of each month. I want to find out their last billing date before their contract was terminated. Below is a sample from the dataframe:
> data = [['Arthur','2019-03-01'],['Bart','2019-02-26'],['Cindy','2019-02-18'],['Douglas','2019-03-31']]
> df = pd.DataFrame(data, columns = ['Name','Termination Date'])
> df
Furthermore, below is the expected output:
> df['Last Billing Date'] =['2019-02-25','2019-02-25','2019-01-25','2019-03-25']
> df
Here is one way
s=df['Termination Date'].apply(lambda x : x.replace(day=25))
df['New']=np.where(df['Termination Date']>=s,s,s-pd.DateOffset(months=1))
df
Name Termination Date New
0 Arthur 2019-03-01 2019-02-25
1 Bart 2019-02-26 2019-02-25
2 Cindy 2019-02-18 2019-01-25
3 Douglas 2019-03-31 2019-03-25
If you want to do this in a vectorized way:
df['Termination Date'] = pd.to_datetime(df['Termination Date'])
before_25 = df['Termination Date'].dt.day < 25
df.loc[before_25, 'Termination Date'] = df.loc[before_25, 'Termination Date'] + pd.DateOffset(months=-1)
df['Termination Date'].apply(lambda dt: dt.replace(day=25)).values
A simple solution is to subtract a month if the day is before 25:
import datetime
def last_billing(termination_dt):
if isinstance(termination_dt, str): # check if not in datetime format
termination_dt = datetime.datetime.strptime(termination_dt, '%Y-%m-%d')
if termination_dt.day < 25:
return termination_dt.replace(day=25, month=termination_dt.month-1)
return termination_dt.replace(day=25)
df['Last Billing Date'] = df['Termination Date'].apply(last_billing)
Name Termination Date Last Billing Date
0 Arthur 2019-03-01 2019-02-25
1 Bart 2019-02-26 2019-02-25
2 Cindy 2019-02-18 2019-01-25
3 Douglas 2019-03-31 2019-03-25
If performance is an issue, you vectorize the function
import numpy as np
#np.vectorize
def last_billing(termination_dt):
if isinstance(termination_dt, str):
termination_dt = datetime.datetime.strptime(termination_dt, '%Y-%m-%d')
if termination_dt.day < 25:
return termination_dt.replace(day=25, month=termination_dt.month-1)
return termination_dt.replace(day=25)
df['Last Billing Date'] = last_billing(df['Termination Date'])
Time comparisons:
%timeit df['Last Billing Date'] = df['Termination Date'].apply(last_billing)
## 113 ms ± 365 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df['Last Billing Date'] = last_billing(df['Termination Date'])
## 108 ms ± 397 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

pd to_datetime not converting DDMMMYYYY dates to datetime in python

Here's what my data looks like. As you can see, there are some columns with DDMMMYYYY format, some are NaN and some are standard DD/MM/YYYY format.
completion_date_latest 15/03/2001
completion_date_original 15/03/2001
customer_birth_date_1 30/11/1970
customer_birth_date_2 20/11/1971
d_start 01Feb2018
latest_maturity_date 28/02/2021
latest_valuation_date 15/03/2001
sdate NaN
startdt_def NaN
obs_date 01Feb2018
I want to convert them to datetime fields. I have a list of columns in a list called varlist2, and I'm looping through them to a) remove the NA's and b) convert to datetime using the to_datetime function:
for m in range (0,len(varlist2)):
date_var = varlist2[m]
print('MM_Dates transform variable: ' + date_var)
mm_dates_base[date_var] = pd.to_datetime(mm_dates_base[date_var], errors='ignore', dayfirst=True)
mm_dates_base[date_var] = mm_dates_base[date_var].fillna('')
However, when I check my output, I get this, where d_start and obs_date haven't been converted. Any idea why this might be the case and what I can do to fix it?
In [111]: print(mm_dates_base.iloc[0])
completion_date_latest 2001-03-15 00:00:00
completion_date_original 2001-03-15 00:00:00
customer_birth_date_1 1970-11-30 00:00:00
customer_birth_date_2 1971-11-20 00:00:00
d_start 01Feb2018
latest_maturity_date 2021-02-28 00:00:00
latest_valuation_date 2001-03-15 00:00:00
sdate
startdt_def
obs_date 01Feb2018
Any ideas how I can treat the DDMMMYYYY dates at the same time?
You can select all columns define by column varlist2 to DataFrame, then use apply + to_datetime with errors='coerce' for convert problematic formats to NaTs if not possible converting. Last replace NaTs by combine_first and assign back:
df1 = mm_dates_base[varlist2].apply(pd.to_datetime, errors='coerce', dayfirst=True)
df2 = mm_dates_base[varlist2].apply(pd.to_datetime, errors='coerce', format='%d%b%Y')
mm_dates_base[varlist2] = df1.combine_first(df2)
print (mm_dates_base)
completion_date_latest completion_date_original customer_birth_date_1 \
0 2001-03-15 2001-03-15 1970-11-30
customer_birth_date_2 d_start latest_maturity_date latest_valuation_date \
0 1971-11-20 2018-02-01 2021-02-28 2001-03-15
sdate startdt_def obs_date
0 NaT NaT 2018-02-01
Another faster solution is loop each column:
for col in varlist2:
a = pd.to_datetime(mm_dates_base[col], errors='coerce', dayfirst=True)
b = pd.to_datetime(mm_dates_base[col], errors='coerce', format='%d%b%Y')
mm_dates_base[col] = a.combine_first(b)
Fast compare:
#[100 rows x 10 columns]
mm_dates_base = pd.concat([df] * 100, ignore_index=True)
In [41]: %%timeit
...:
...: for col in varlist2:
...: a = pd.to_datetime(mm_dates_base[col], errors='coerce', dayfirst=True)
...: b = pd.to_datetime(mm_dates_base[col], errors='coerce', format='%d%b%Y')
...: mm_dates_base[col] = a.combine_first(b)
...:
5.13 ms ± 46.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [43]: %%timeit
...: df1 = mm_dates_base[varlist2].apply(pd.to_datetime, errors='coerce', dayfirst=True)
...: df2 = mm_dates_base[varlist2].apply(pd.to_datetime, errors='coerce', format='%d%b%Y')
...:
...: mm_dates_base[varlist2] = df1.combine_first(df2)
...:
14.1 ms ± 92.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The to_datetime function will usually detect what format the date is when converting, but the lack of spaces in your d_start and obs_date are probably what are causing the error. You might have to run a .strptime() on those specific values/columns. You'll have to look into this, but from the looks of it it'll follow something like %d%b%Y.

Categories

Resources