I'm have a dataframe where the hour column contains datetime data in UTC. I have a time_zone column with time zones for each observation, and I'm using it to convert hour to the local time and save it in a new column named local_hour. To do this, I'm using the following code:
import pandas as pd
# Sample dataframe
import pandas as pd
df = pd.DataFrame({
'hour': ['2019-01-01 05:00:00', '2019-01-01 07:00:00', '2019-01-01 08:00:00'],
'time_zone': ['US/Eastern', 'US/Central', 'US/Mountain']
})
# Ensure hour is in datetime format and localized to UTC
df['hour'] = pd.to_datetime(df['hour']).dt.tz_localize('UTC')
# Add local_hour column with hour in local time
df['local_hour'] = df.apply(lambda row: row['hour'].tz_convert(row['time_zone']), axis=1)
df
hour time_zone local_hour
0 2019-01-01 05:00:00+00:00 US/Eastern 2019-01-01 00:00:00-05:00
1 2019-01-01 07:00:00+00:00 US/Central 2019-01-01 01:00:00-06:00
2 2019-01-01 08:00:00+00:00 US/Mountain 2019-01-01 01:00:00-07:00
The code works. However using apply runs quite slowly since in reality I have a large dataframe. Is there a way to vectorize this or otherwise speed it up?
Note: I have tried using the swifter package, but in my case it doesn't speed things up.
From the assumption there is not an infinite number of time_zone, maybe you could perform a tz_convert per group, like:
df['local_hour'] = df.groupby('time_zone')['hour'].apply(lambda x: x.dt.tz_convert(x.name))
print (df)
hour time_zone local_hour
0 2019-01-01 05:00:00+00:00 US/Eastern 2019-01-01 00:00:00-05:00
1 2019-01-01 07:00:00+00:00 US/Central 2019-01-01 01:00:00-06:00
2 2019-01-01 08:00:00+00:00 US/Mountain 2019-01-01 01:00:00-07:00
On the sample it will be probably slower than what you did, but on bigger data and groups, should be faster
For speed comparison, with the df of 3 rows you provided, it gives:
%timeit df.apply(lambda row: row['hour'].tz_convert(row['time_zone']), axis=1)
# 1.6 ms ± 102 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df.groupby('time_zone')['hour'].apply(lambda x: x.dt.tz_convert(x.name))
# 2.58 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
so apply is faster, but if you create a dataframe 1000 times bigger but with only 3 time_zones, then you get groupby about 20 times faster:
df = pd.concat([df]*1000, ignore_index=True)
%timeit df.apply(lambda row: row['hour'].tz_convert(row['time_zone']), axis=1)
# 585 ms ± 42.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.groupby('time_zone')['hour'].apply(lambda x: x.dt.tz_convert(x.name))
# 27.5 ms ± 2.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Related
Within my dataframe I have two columns: 'release_date' and 'release_year'
I am trying to replace the year value in each 'release_date' instance with the corresponding value in 'release_year'
I have tried the following
df.loc[:, 'release_date'] = df['release_date'].apply(lambda x: x.replace(x.year == df['release_year']))
however I am getting the error: 'value must be an integer, received <class 'pandas.core.series.Series'> for year'
Having checked the dtype, the release_date column is stored as datetime64[ns]
Excerpt from dataframe
You need to use pandas.DataFrame.apply here rather than pandas.Series.apply as you need data from other column, consider following simple example
import datetime
import pandas as pd
df = pd.DataFrame({'release_date':[datetime.date(1901,1,1),datetime.date(1902,1,1),datetime.date(1903,1,1)],'release_year':[2001,2002,2003]})
df['changed_date'] = df.apply(lambda x:x.release_date.replace(year=x.release_year),axis=1)
print(df)
output
release_date release_year changed_date
0 1901-01-01 2001 2001-01-01
1 1902-01-01 2002 2002-01-01
2 1903-01-01 2003 2003-01-01
Note axis=1 which mean function is applied to each row and you got row (pandas.Series) as argument for that function
casting to string then parsing to datetime is more efficient here; and also more readable if you ask me. Ex:
import datetime
import pandas as pd
N = 100000
df = pd.DataFrame({'release_date':[datetime.date(1901,1,1),datetime.date(1902,1,1),datetime.date(1903,1,1)]*N,
'release_year':[2001,2002,2003]*N})
df['changed_date'] = pd.to_datetime(
df['release_year'].astype(str) + df['release_date'].astype(str).str[5:],
format="%Y%m-%d"
)
df['changed_date']
Out[176]:
0 2001-01-01
1 2002-01-01
2 2003-01-01
3 2001-01-01
4 2002-01-01
299995 2002-01-01
299996 2003-01-01
299997 2001-01-01
299998 2002-01-01
299999 2003-01-01
Name: changed_date, Length: 300000, dtype: datetime64[ns]
>>> %timeit df['changed_date'] = df.apply(lambda x:x.release_date.replace(year=x.release_year),axis=1)
6.73 s ± 542 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit df['changed_date'] = pd.to_datetime(df['release_year'].astype(str)+df['release_date'].astype(str).str[5:], format="%Y%m-%d")
651 ms ± 78.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I have a DateTimeIndex, I need to convert to a certain column of the Dataframe, and use a specific format, my code is as follows, how to optimize?
import numpy as np
import pandas as pd
original = pd.date_range(start='20210520 09:00:00', end='20210520 12:00:00', freq='30min')
time = np.vectorize(lambda s: s.strftime('%H:%M:%S'))(original.to_pydatetime())
result = pd.DataFrame(time, columns=['time'])
print('original:')
print(original)
print('result:')
print(result)
original:
DatetimeIndex(['2021-05-20 09:00:00', '2021-05-20 09:30:00',
'2021-05-20 10:00:00', '2021-05-20 10:30:00',
'2021-05-20 11:00:00', '2021-05-20 11:30:00',
'2021-05-20 12:00:00'],
dtype='datetime64[ns]', freq='30T')
result:
time
0 09:00:00
1 09:30:00
2 10:00:00
3 10:30:00
4 11:00:00
5 11:30:00
6 12:00:00
Instead of this:
time = np.vectorize(lambda s: s.strftime('%H:%M:%S'))(original.to_pydatetime())
Use:
time=original.time.astype(str)
Performance:
%%timeit
original = pd.date_range(start='20210520 09:00:00', end='20210520 12:00:00', freq='30min')
time = np.vectorize(lambda s: s.strftime('%H:%M:%S'))(original.to_pydatetime())
result = pd.DataFrame(time, columns=['time'])
>>>925 µs ± 53.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
original = pd.date_range(start='20210520 09:00:00', end='20210520 12:00:00', freq='30min')
time=original.time.astype(str)
result = pd.DataFrame(time, columns=['time'])
>>>724 µs ± 12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I want to create a new column DATE in the dataframe transaction from the column DAY (lists days from beginning of the study). Like, if DAY is 1, I want DATE to show 2014-1-1, or if DAY is 12 it should be 2014-1-12. This is what I am doing to convert days into date and it works:
import datetime
def serial_date_to_string(srl_no):
new_date = datetime.datetime(2014,1,1,0,0) + datetime.timedelta(srl_no - 1)
return new_date.strftime("%Y-%m-%d")
However, when I try to use this formula to add a new column, it doesn't work:
transaction['DATE'] = serial_date_to_string(transaction['DAY'])
TypeError: unsupported type for timedelta days component: Series
But the DAY column type is int64. I tried to search on forums and found that the formula could be adjusted, if I try to use this:
def serial_date_to_string(srl_no):
new_date = datetime.datetime(2014,1,1,0,0) + (srl_no - 1).map(datetime.timedelta)
return new_date.strftime("%Y-%m-%d")
It still gives AttributeError: 'Series' object has no attribute 'strftime'.
Thank you for any help!
Use map
transaction['DATE'] = transaction['DAY'].map(serial_date_to_string)
Use to_datetime with parameters origin for starting day and unit=d for days:
df = pd.DataFrame({'DAY':[1,12,20]})
df['date'] = pd.to_datetime(df['DAY'] - 1, origin='2014-01-01', unit='d')
print (df)
DAY date
0 1 2014-01-01
1 12 2014-01-12
2 20 2014-01-20
For same ouput add Series.dt.strftime:
df['date'] = pd.to_datetime(df['DAY'] - 1, origin='2014-01-01', unit='d').dt.strftime("%Y-%m-%d")
print (df)
DAY date
0 1 2014-01-01
1 12 2014-01-12
2 20 2014-01-20
EDIT:
For your function is possible use Series.apply:
transaction['DATE'] = transaction['DAY'].apply(serial_date_to_string)
Performance is similar, apply here is fastest for 10k rows:
import datetime
def serial_date_to_string(srl_no):
new_date = datetime.datetime(2014,1,1,0,0) + datetime.timedelta(srl_no - 1)
return new_date.strftime("%Y-%m-%d")
np.random.seed(2021)
df = pd.DataFrame({'DAY': np.random.randint(10, 1000, size=10000)})
In [17]: %timeit pd.to_datetime(df['DAY'] - 1, origin='2014-01-01', unit='d').dt.strftime("%Y-%m-%d")
79.4 ms ± 1.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [18]: %timeit df['DAY'].apply(serial_date_to_string)
57.1 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [19]: %timeit df['DAY'].map(serial_date_to_string)
64.7 ms ± 5.83 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I have several customers who get billed every 25th of each month. I want to find out their last billing date before their contract was terminated. Below is a sample from the dataframe:
> data = [['Arthur','2019-03-01'],['Bart','2019-02-26'],['Cindy','2019-02-18'],['Douglas','2019-03-31']]
> df = pd.DataFrame(data, columns = ['Name','Termination Date'])
> df
Furthermore, below is the expected output:
> df['Last Billing Date'] =['2019-02-25','2019-02-25','2019-01-25','2019-03-25']
> df
Here is one way
s=df['Termination Date'].apply(lambda x : x.replace(day=25))
df['New']=np.where(df['Termination Date']>=s,s,s-pd.DateOffset(months=1))
df
Name Termination Date New
0 Arthur 2019-03-01 2019-02-25
1 Bart 2019-02-26 2019-02-25
2 Cindy 2019-02-18 2019-01-25
3 Douglas 2019-03-31 2019-03-25
If you want to do this in a vectorized way:
df['Termination Date'] = pd.to_datetime(df['Termination Date'])
before_25 = df['Termination Date'].dt.day < 25
df.loc[before_25, 'Termination Date'] = df.loc[before_25, 'Termination Date'] + pd.DateOffset(months=-1)
df['Termination Date'].apply(lambda dt: dt.replace(day=25)).values
A simple solution is to subtract a month if the day is before 25:
import datetime
def last_billing(termination_dt):
if isinstance(termination_dt, str): # check if not in datetime format
termination_dt = datetime.datetime.strptime(termination_dt, '%Y-%m-%d')
if termination_dt.day < 25:
return termination_dt.replace(day=25, month=termination_dt.month-1)
return termination_dt.replace(day=25)
df['Last Billing Date'] = df['Termination Date'].apply(last_billing)
Name Termination Date Last Billing Date
0 Arthur 2019-03-01 2019-02-25
1 Bart 2019-02-26 2019-02-25
2 Cindy 2019-02-18 2019-01-25
3 Douglas 2019-03-31 2019-03-25
If performance is an issue, you vectorize the function
import numpy as np
#np.vectorize
def last_billing(termination_dt):
if isinstance(termination_dt, str):
termination_dt = datetime.datetime.strptime(termination_dt, '%Y-%m-%d')
if termination_dt.day < 25:
return termination_dt.replace(day=25, month=termination_dt.month-1)
return termination_dt.replace(day=25)
df['Last Billing Date'] = last_billing(df['Termination Date'])
Time comparisons:
%timeit df['Last Billing Date'] = df['Termination Date'].apply(last_billing)
## 113 ms ± 365 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df['Last Billing Date'] = last_billing(df['Termination Date'])
## 108 ms ± 397 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Here's what my data looks like. As you can see, there are some columns with DDMMMYYYY format, some are NaN and some are standard DD/MM/YYYY format.
completion_date_latest 15/03/2001
completion_date_original 15/03/2001
customer_birth_date_1 30/11/1970
customer_birth_date_2 20/11/1971
d_start 01Feb2018
latest_maturity_date 28/02/2021
latest_valuation_date 15/03/2001
sdate NaN
startdt_def NaN
obs_date 01Feb2018
I want to convert them to datetime fields. I have a list of columns in a list called varlist2, and I'm looping through them to a) remove the NA's and b) convert to datetime using the to_datetime function:
for m in range (0,len(varlist2)):
date_var = varlist2[m]
print('MM_Dates transform variable: ' + date_var)
mm_dates_base[date_var] = pd.to_datetime(mm_dates_base[date_var], errors='ignore', dayfirst=True)
mm_dates_base[date_var] = mm_dates_base[date_var].fillna('')
However, when I check my output, I get this, where d_start and obs_date haven't been converted. Any idea why this might be the case and what I can do to fix it?
In [111]: print(mm_dates_base.iloc[0])
completion_date_latest 2001-03-15 00:00:00
completion_date_original 2001-03-15 00:00:00
customer_birth_date_1 1970-11-30 00:00:00
customer_birth_date_2 1971-11-20 00:00:00
d_start 01Feb2018
latest_maturity_date 2021-02-28 00:00:00
latest_valuation_date 2001-03-15 00:00:00
sdate
startdt_def
obs_date 01Feb2018
Any ideas how I can treat the DDMMMYYYY dates at the same time?
You can select all columns define by column varlist2 to DataFrame, then use apply + to_datetime with errors='coerce' for convert problematic formats to NaTs if not possible converting. Last replace NaTs by combine_first and assign back:
df1 = mm_dates_base[varlist2].apply(pd.to_datetime, errors='coerce', dayfirst=True)
df2 = mm_dates_base[varlist2].apply(pd.to_datetime, errors='coerce', format='%d%b%Y')
mm_dates_base[varlist2] = df1.combine_first(df2)
print (mm_dates_base)
completion_date_latest completion_date_original customer_birth_date_1 \
0 2001-03-15 2001-03-15 1970-11-30
customer_birth_date_2 d_start latest_maturity_date latest_valuation_date \
0 1971-11-20 2018-02-01 2021-02-28 2001-03-15
sdate startdt_def obs_date
0 NaT NaT 2018-02-01
Another faster solution is loop each column:
for col in varlist2:
a = pd.to_datetime(mm_dates_base[col], errors='coerce', dayfirst=True)
b = pd.to_datetime(mm_dates_base[col], errors='coerce', format='%d%b%Y')
mm_dates_base[col] = a.combine_first(b)
Fast compare:
#[100 rows x 10 columns]
mm_dates_base = pd.concat([df] * 100, ignore_index=True)
In [41]: %%timeit
...:
...: for col in varlist2:
...: a = pd.to_datetime(mm_dates_base[col], errors='coerce', dayfirst=True)
...: b = pd.to_datetime(mm_dates_base[col], errors='coerce', format='%d%b%Y')
...: mm_dates_base[col] = a.combine_first(b)
...:
5.13 ms ± 46.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [43]: %%timeit
...: df1 = mm_dates_base[varlist2].apply(pd.to_datetime, errors='coerce', dayfirst=True)
...: df2 = mm_dates_base[varlist2].apply(pd.to_datetime, errors='coerce', format='%d%b%Y')
...:
...: mm_dates_base[varlist2] = df1.combine_first(df2)
...:
14.1 ms ± 92.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The to_datetime function will usually detect what format the date is when converting, but the lack of spaces in your d_start and obs_date are probably what are causing the error. You might have to run a .strptime() on those specific values/columns. You'll have to look into this, but from the looks of it it'll follow something like %d%b%Y.