Here's what my data looks like. As you can see, there are some columns with DDMMMYYYY format, some are NaN and some are standard DD/MM/YYYY format.
completion_date_latest 15/03/2001
completion_date_original 15/03/2001
customer_birth_date_1 30/11/1970
customer_birth_date_2 20/11/1971
d_start 01Feb2018
latest_maturity_date 28/02/2021
latest_valuation_date 15/03/2001
sdate NaN
startdt_def NaN
obs_date 01Feb2018
I want to convert them to datetime fields. I have a list of columns in a list called varlist2, and I'm looping through them to a) remove the NA's and b) convert to datetime using the to_datetime function:
for m in range (0,len(varlist2)):
date_var = varlist2[m]
print('MM_Dates transform variable: ' + date_var)
mm_dates_base[date_var] = pd.to_datetime(mm_dates_base[date_var], errors='ignore', dayfirst=True)
mm_dates_base[date_var] = mm_dates_base[date_var].fillna('')
However, when I check my output, I get this, where d_start and obs_date haven't been converted. Any idea why this might be the case and what I can do to fix it?
In [111]: print(mm_dates_base.iloc[0])
completion_date_latest 2001-03-15 00:00:00
completion_date_original 2001-03-15 00:00:00
customer_birth_date_1 1970-11-30 00:00:00
customer_birth_date_2 1971-11-20 00:00:00
d_start 01Feb2018
latest_maturity_date 2021-02-28 00:00:00
latest_valuation_date 2001-03-15 00:00:00
sdate
startdt_def
obs_date 01Feb2018
Any ideas how I can treat the DDMMMYYYY dates at the same time?
You can select all columns define by column varlist2 to DataFrame, then use apply + to_datetime with errors='coerce' for convert problematic formats to NaTs if not possible converting. Last replace NaTs by combine_first and assign back:
df1 = mm_dates_base[varlist2].apply(pd.to_datetime, errors='coerce', dayfirst=True)
df2 = mm_dates_base[varlist2].apply(pd.to_datetime, errors='coerce', format='%d%b%Y')
mm_dates_base[varlist2] = df1.combine_first(df2)
print (mm_dates_base)
completion_date_latest completion_date_original customer_birth_date_1 \
0 2001-03-15 2001-03-15 1970-11-30
customer_birth_date_2 d_start latest_maturity_date latest_valuation_date \
0 1971-11-20 2018-02-01 2021-02-28 2001-03-15
sdate startdt_def obs_date
0 NaT NaT 2018-02-01
Another faster solution is loop each column:
for col in varlist2:
a = pd.to_datetime(mm_dates_base[col], errors='coerce', dayfirst=True)
b = pd.to_datetime(mm_dates_base[col], errors='coerce', format='%d%b%Y')
mm_dates_base[col] = a.combine_first(b)
Fast compare:
#[100 rows x 10 columns]
mm_dates_base = pd.concat([df] * 100, ignore_index=True)
In [41]: %%timeit
...:
...: for col in varlist2:
...: a = pd.to_datetime(mm_dates_base[col], errors='coerce', dayfirst=True)
...: b = pd.to_datetime(mm_dates_base[col], errors='coerce', format='%d%b%Y')
...: mm_dates_base[col] = a.combine_first(b)
...:
5.13 ms ± 46.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [43]: %%timeit
...: df1 = mm_dates_base[varlist2].apply(pd.to_datetime, errors='coerce', dayfirst=True)
...: df2 = mm_dates_base[varlist2].apply(pd.to_datetime, errors='coerce', format='%d%b%Y')
...:
...: mm_dates_base[varlist2] = df1.combine_first(df2)
...:
14.1 ms ± 92.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The to_datetime function will usually detect what format the date is when converting, but the lack of spaces in your d_start and obs_date are probably what are causing the error. You might have to run a .strptime() on those specific values/columns. You'll have to look into this, but from the looks of it it'll follow something like %d%b%Y.
Related
Within my dataframe I have two columns: 'release_date' and 'release_year'
I am trying to replace the year value in each 'release_date' instance with the corresponding value in 'release_year'
I have tried the following
df.loc[:, 'release_date'] = df['release_date'].apply(lambda x: x.replace(x.year == df['release_year']))
however I am getting the error: 'value must be an integer, received <class 'pandas.core.series.Series'> for year'
Having checked the dtype, the release_date column is stored as datetime64[ns]
Excerpt from dataframe
You need to use pandas.DataFrame.apply here rather than pandas.Series.apply as you need data from other column, consider following simple example
import datetime
import pandas as pd
df = pd.DataFrame({'release_date':[datetime.date(1901,1,1),datetime.date(1902,1,1),datetime.date(1903,1,1)],'release_year':[2001,2002,2003]})
df['changed_date'] = df.apply(lambda x:x.release_date.replace(year=x.release_year),axis=1)
print(df)
output
release_date release_year changed_date
0 1901-01-01 2001 2001-01-01
1 1902-01-01 2002 2002-01-01
2 1903-01-01 2003 2003-01-01
Note axis=1 which mean function is applied to each row and you got row (pandas.Series) as argument for that function
casting to string then parsing to datetime is more efficient here; and also more readable if you ask me. Ex:
import datetime
import pandas as pd
N = 100000
df = pd.DataFrame({'release_date':[datetime.date(1901,1,1),datetime.date(1902,1,1),datetime.date(1903,1,1)]*N,
'release_year':[2001,2002,2003]*N})
df['changed_date'] = pd.to_datetime(
df['release_year'].astype(str) + df['release_date'].astype(str).str[5:],
format="%Y%m-%d"
)
df['changed_date']
Out[176]:
0 2001-01-01
1 2002-01-01
2 2003-01-01
3 2001-01-01
4 2002-01-01
299995 2002-01-01
299996 2003-01-01
299997 2001-01-01
299998 2002-01-01
299999 2003-01-01
Name: changed_date, Length: 300000, dtype: datetime64[ns]
>>> %timeit df['changed_date'] = df.apply(lambda x:x.release_date.replace(year=x.release_year),axis=1)
6.73 s ± 542 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit df['changed_date'] = pd.to_datetime(df['release_year'].astype(str)+df['release_date'].astype(str).str[5:], format="%Y%m-%d")
651 ms ± 78.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I want to create a new column DATE in the dataframe transaction from the column DAY (lists days from beginning of the study). Like, if DAY is 1, I want DATE to show 2014-1-1, or if DAY is 12 it should be 2014-1-12. This is what I am doing to convert days into date and it works:
import datetime
def serial_date_to_string(srl_no):
new_date = datetime.datetime(2014,1,1,0,0) + datetime.timedelta(srl_no - 1)
return new_date.strftime("%Y-%m-%d")
However, when I try to use this formula to add a new column, it doesn't work:
transaction['DATE'] = serial_date_to_string(transaction['DAY'])
TypeError: unsupported type for timedelta days component: Series
But the DAY column type is int64. I tried to search on forums and found that the formula could be adjusted, if I try to use this:
def serial_date_to_string(srl_no):
new_date = datetime.datetime(2014,1,1,0,0) + (srl_no - 1).map(datetime.timedelta)
return new_date.strftime("%Y-%m-%d")
It still gives AttributeError: 'Series' object has no attribute 'strftime'.
Thank you for any help!
Use map
transaction['DATE'] = transaction['DAY'].map(serial_date_to_string)
Use to_datetime with parameters origin for starting day and unit=d for days:
df = pd.DataFrame({'DAY':[1,12,20]})
df['date'] = pd.to_datetime(df['DAY'] - 1, origin='2014-01-01', unit='d')
print (df)
DAY date
0 1 2014-01-01
1 12 2014-01-12
2 20 2014-01-20
For same ouput add Series.dt.strftime:
df['date'] = pd.to_datetime(df['DAY'] - 1, origin='2014-01-01', unit='d').dt.strftime("%Y-%m-%d")
print (df)
DAY date
0 1 2014-01-01
1 12 2014-01-12
2 20 2014-01-20
EDIT:
For your function is possible use Series.apply:
transaction['DATE'] = transaction['DAY'].apply(serial_date_to_string)
Performance is similar, apply here is fastest for 10k rows:
import datetime
def serial_date_to_string(srl_no):
new_date = datetime.datetime(2014,1,1,0,0) + datetime.timedelta(srl_no - 1)
return new_date.strftime("%Y-%m-%d")
np.random.seed(2021)
df = pd.DataFrame({'DAY': np.random.randint(10, 1000, size=10000)})
In [17]: %timeit pd.to_datetime(df['DAY'] - 1, origin='2014-01-01', unit='d').dt.strftime("%Y-%m-%d")
79.4 ms ± 1.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [18]: %timeit df['DAY'].apply(serial_date_to_string)
57.1 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [19]: %timeit df['DAY'].map(serial_date_to_string)
64.7 ms ± 5.83 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I'm have a dataframe where the hour column contains datetime data in UTC. I have a time_zone column with time zones for each observation, and I'm using it to convert hour to the local time and save it in a new column named local_hour. To do this, I'm using the following code:
import pandas as pd
# Sample dataframe
import pandas as pd
df = pd.DataFrame({
'hour': ['2019-01-01 05:00:00', '2019-01-01 07:00:00', '2019-01-01 08:00:00'],
'time_zone': ['US/Eastern', 'US/Central', 'US/Mountain']
})
# Ensure hour is in datetime format and localized to UTC
df['hour'] = pd.to_datetime(df['hour']).dt.tz_localize('UTC')
# Add local_hour column with hour in local time
df['local_hour'] = df.apply(lambda row: row['hour'].tz_convert(row['time_zone']), axis=1)
df
hour time_zone local_hour
0 2019-01-01 05:00:00+00:00 US/Eastern 2019-01-01 00:00:00-05:00
1 2019-01-01 07:00:00+00:00 US/Central 2019-01-01 01:00:00-06:00
2 2019-01-01 08:00:00+00:00 US/Mountain 2019-01-01 01:00:00-07:00
The code works. However using apply runs quite slowly since in reality I have a large dataframe. Is there a way to vectorize this or otherwise speed it up?
Note: I have tried using the swifter package, but in my case it doesn't speed things up.
From the assumption there is not an infinite number of time_zone, maybe you could perform a tz_convert per group, like:
df['local_hour'] = df.groupby('time_zone')['hour'].apply(lambda x: x.dt.tz_convert(x.name))
print (df)
hour time_zone local_hour
0 2019-01-01 05:00:00+00:00 US/Eastern 2019-01-01 00:00:00-05:00
1 2019-01-01 07:00:00+00:00 US/Central 2019-01-01 01:00:00-06:00
2 2019-01-01 08:00:00+00:00 US/Mountain 2019-01-01 01:00:00-07:00
On the sample it will be probably slower than what you did, but on bigger data and groups, should be faster
For speed comparison, with the df of 3 rows you provided, it gives:
%timeit df.apply(lambda row: row['hour'].tz_convert(row['time_zone']), axis=1)
# 1.6 ms ± 102 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df.groupby('time_zone')['hour'].apply(lambda x: x.dt.tz_convert(x.name))
# 2.58 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
so apply is faster, but if you create a dataframe 1000 times bigger but with only 3 time_zones, then you get groupby about 20 times faster:
df = pd.concat([df]*1000, ignore_index=True)
%timeit df.apply(lambda row: row['hour'].tz_convert(row['time_zone']), axis=1)
# 585 ms ± 42.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.groupby('time_zone')['hour'].apply(lambda x: x.dt.tz_convert(x.name))
# 27.5 ms ± 2.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I have a data frame like this:
Date Quote-Spread
0 2013-11-17 0.010000
1 2013-12-10 0.020000
2 2013-12-11 0.013333
3 2014-06-01 0.050000
4 2014-06-23 0.050000
When i use this code i raise an error:
import pandas as pd
pd.to_datetime(df1['Date'] ,format ="%Y%m%d")
ValueError: time data '2013-11-17' does not match format '%Y%m%d' (match)
how can i correct this error?
Use to_datetime only:
df1['Date'] = pd.to_datetime(df1['Date'])
print (df1['Date'])
0 2013-11-17
1 2013-12-10
2 2013-12-11
3 2014-06-01
4 2014-06-23
Name: Date, dtype: datetime64[ns]
Or if want specify format add -, because %Y%m%d match YYMMDD and your format is YY-MM-DD:
pd.to_datetime(df1['Date'], format ="%Y-%m-%d")
to_datetime is the way to go. It is the fastest too if compared to the alternative of using list comprehension or apply.
import pandas as pd
import datetime
# Create dataset
df1 = pd.DataFrame(dict(Date=['2013-11-17','2013-12-10']*10000))
Alt1, list comprehension:
df1.Date = [datetime.datetime.strptime(i,"%Y-%m-%d") for i in df1.Date.values]
Alt2, apply:
df1.Date = df1.Date.apply(lambda x: datetime.datetime.strptime(x,"%Y-%m-%d"))
Alt3, to_datetime:
df1.Date = pd.to_datetime(df1.Date)
Timings
1 loop, best of 3: 744 ms per loop #1
1 loop, best of 3: 793 ms per loop #2
100 loops, best of 3: 18.5 ms per loop #3
I have an irregularly indexed time series of data with seconds resolution like:
import pandas as pd
idx = ['2012-01-01 12:43:35', '2012-03-12 15:46:43',
'2012-09-26 18:35:11', '2012-11-11 2:34:59']
status = [1, 0, 1, 0]
df = pd.DataFrame(status, index=idx, columns = ['status'])
df = df.reindex(pd.to_datetime(df.index))
In [62]: df
Out[62]:
status
2012-01-01 12:43:35 1
2012-03-12 15:46:43 0
2012-09-26 18:35:11 1
2012-11-11 02:34:59 0
and I am interested in the fraction of the year when the status is 1. The way I currently do it is that I reindex df with every second in the year and use forward filling like:
full_idx = pd.date_range(start = '1/1/2012', end = '12/31/2012', freq='s')
df1 = df.reindex(full_idx, method='ffill')
which returns a DataFrame that contains every second for the year which I can then calculate the mean for, to see the percentage of time in the 1 status like:
In [66]: df1
Out[66]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 31536001 entries, 2012-01-01 00:00:00 to 2012-12-31 00:00:00
Freq: S
Data columns:
status 31490186 non-null values
dtypes: float64(1)
In [67]: df1.status.mean()
Out[67]: 0.31953371123308066
The problem is that I have to do this for a lot of data, and reindexing it for every second in the year is most expensive operation by far.
What are better ways to do this?
There doesn't seem to be a pandas method to compute time differences between entries of an irregular time series, though there is a convenience method to convert a time series index to an array of datetime.datetime objects, which can be converted to datetime.timedelta objects through subtraction.
In [6]: start_end = pd.DataFrame({'status': [0, 0]},
index=[pd.datetools.parse('1/1/2012'),
pd.datetools.parse('12/31/2012')])
In [7]: df = df.append(start_end).sort()
In [8]: df
Out[8]:
status
2012-01-01 00:00:00 0
2012-01-01 12:43:35 1
2012-03-12 15:46:43 0
2012-09-26 18:35:11 1
2012-11-11 02:34:59 0
2012-12-31 00:00:00 0
In [9]: pydatetime = pd.Series(df.index.to_pydatetime(), index=df.index)
In [11]: df['duration'] = pydatetime.diff().shift(-1).\
map(datetime.timedelta.total_seconds, na_action='ignore')
In [16]: df
Out[16]:
status duration
2012-01-01 00:00:00 0 45815
2012-01-01 12:43:35 1 6145388
2012-03-12 15:46:43 0 17117308
2012-09-26 18:35:11 1 3916788
2012-11-11 02:34:59 0 4310701
2012-12-31 00:00:00 0 NaN
In [17]: (df.status * df.duration).sum() / df.duration.sum()
Out[17]: 0.31906950786402843
Note:
Our answers seem to differ because I set status before the first timestamp to zero, while those entries are NA in your df1 as there's no start value to forward fill and NA values are excluded by pandas mean().
timedelta.total_seconds() is new in Python 2.7.
Timing comparison of this method versus reindexing:
In [8]: timeit delta_method(df)
1000 loops, best of 3: 1.3 ms per loop
In [9]: timeit redindexing(df)
1 loops, best of 3: 2.78 s per loop
Another potential approach is to use traces.
import traces
from dateutil.parser import parse as date_parse
idx = ['2012-01-01 12:43:35', '2012-03-12 15:46:43',
'2012-09-26 18:35:11', '2012-11-11 2:34:59']
status = [1, 0, 1, 0]
# create a TimeSeries from date strings and status
ts = traces.TimeSeries(default=0)
for date_string, status_value in zip(idx, status):
ts[date_parse(date_string)] = status_value
# compute distribution
ts.distribution(
start=date_parse('2012-01-01'),
end=date_parse('2013-01-01'),
)
# {0: 0.6818022667476219, 1: 0.31819773325237805}
The value is calculated between the start of January 1, 2012 and end of December 31, 2012 (equivalently the start of January 1, 2013) without resampling, and assuming the status is 0 at the start of the year (the default=0 parameter)
Timing results:
In [2]: timeit ts.distribution(
start=date_parse('2012-01-01'),
end=date_parse('2013-01-01')
)
619 µs ± 7.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)