I need to resample some data in Pandas and I am using the code below:
On my data it takes, 5 hours.
df['date'] = pd.to_datetime(df['date'], format='%y-%m-%d')
df = df.set_index('date')
df.groupby('id').resample('D')['value'].agg('sum').loc[lambda x: x>0]
This is prohibitively slow.
How can I speed up the above code, on data like:
id date value
1 16-12-1 9
1 16-12-1 8
1 17-1-1 18
2 17-3-4 19
2 17-3-4 20
1 17-4-3 21
2 17-7-13 12
3 17-8-9 12
2 17-9-12 11
1 17-11-12 19
3 17-11-12 21
giving output:
id date
1 2016-12-04 17
2017-01-01 18
2017-04-09 21
2017-11-12 19
2 2017-03-05 39
2017-07-16 12
2017-09-17 11
3 2017-08-13 12
2017-11-12 21
Name: value, dtype: int64
I set up date as an index but the code is so slow. Any help would be great.
Give this a try.
I am going to use pd.Grouper() and specify the frequency to daily, hoping that it is faster. Also, i am getting rid of the agg and using .sum() straight away.
df['date'] = pd.to_datetime(df['date'], format='%y-%m-%d')
df = df.set_index('date')
df2 = df.groupby(['id',pd.Grouper(freq='D')])['value'].sum()
Results:
id date
1 2016-12-01 17
2017-01-01 18
2017-04-03 21
2017-11-12 19
2 2017-03-04 39
2017-07-13 12
2017-09-12 11
3 2017-08-09 12
2017-11-12 21
Hope this works.
[EDIT]
So I just did a small test between both methods over a randomly generated df with 100000 rows
df = pd.DataFrame(np.random.randint(0, 30,size=100000),
columns=["id"],
index=pd.date_range("19300101", periods=100000))
df['value'] = np.random.randint(0, 10,size=100000)
and tried it on both codes and the results are:
for using resmple:
startTime = time.time()
df2 = df.groupby('id').resample('D')['value'].agg('sum').loc[lambda x: x>0]
print(time.time()-startTime)
1.0451831817626953 seconds
for using pd.Grouper():
startTime = time.time()
df3 = df.groupby(['id',pd.Grouper(freq='D')])['value'].sum()
print(time.time()-startTime)
0.08430838584899902 seconds
so approximately 12 times faster! (if my math is correct)
Related
I have a dataframe like as shown below
df = pd.DataFrame({'person_id': [11,11,11,21,21],
'offset' :['-131 days','29 days','142 days','20 days','-200 days'],
'date_1': ['05/29/2017', '01/21/1997', '7/27/1989','01/01/2013','12/31/2016'],
'dis_date': ['05/29/2017', '01/24/1999', '7/22/1999','01/01/2015','12/31/1991'],
'vis_date':['05/29/2018', '01/27/1994', '7/29/2011','01/01/2018','12/31/2014']})
df['date_1'] = pd.to_datetime(df['date_1'])
df['dis_date'] = pd.to_datetime(df['dis_date'])
df['vis_date'] = pd.to_datetime(df['vis_date'])
I would like to shift all the dates of each subject based on his offset
Though my code works (credit - SO), I am looking for an elegant approach. You can see am kind of repeating almost the same line thrice.
df['offset_to_shift'] = pd.to_timedelta(df['offset'],unit='d')
#am trying to make the below lines elegant/efficient
df['shifted_date_1'] = df['date_1'] + df['offset_to_shift']
df['shifted_dis_date'] = df['dis_date'] + df['offset_to_shift']
df['shifted_vis_date'] = df['vis_date'] + df['offset_to_shift']
I expect my output to be like as shown below
Use, DataFrame.add along with DataFrame.add_prefix and DataFrame.join:
cols = ['date_1', 'dis_date', 'vis_date']
df = df.join(df[cols].add(df['offset_to_shift'], 0).add_prefix('shifted_'))
OR, it is also possible to use pd.concat:
df = pd.concat([df, df[cols].add(df['offset_to_shift'], 0).add_prefix('shifted_')], axis=1)
OR, we can also directly assign the new shifted columns to the dataframe:
df[['shifted_' + col for col in cols]] = df[cols].add(df['offset_to_shift'], 0)
Result:
# print(df)
person_id offset date_1 dis_date vis_date offset_to_shift shifted_date_1 shifted_dis_date shifted_vis_date
0 11 -131 days 2017-05-29 2017-05-29 2018-05-29 -131 days 2017-01-18 2017-01-18 2018-01-18
1 11 29 days 1997-01-21 1999-01-24 1994-01-27 29 days 1997-02-19 1999-02-22 1994-02-25
2 11 142 days 1989-07-27 1999-07-22 2011-07-29 142 days 1989-12-16 1999-12-11 2011-12-18
3 21 20 days 2013-01-01 2015-01-01 2018-01-01 20 days 2013-01-21 2015-01-21 2018-01-21
4 21 -200 days 2016-12-31 1991-12-31 2014-12-31 -200 days 2016-06-14 1991-06-14 2014-06-14
I have two time-series below. Datetime indices are TZ-aware.
df1: Five minutes interval
value_1
Timestamp
2009-04-01 10:50:00+09:30 50
2009-04-05 11:55:00+09:30 55
2009-04-23 16:00:00+09:30 0
2009-05-03 10:50:00+09:30 50
2009-05-07 11:55:00+09:30 55
2009-05-11 16:00:00+09:30 0
2009-07-04 02:05:00+09:30 5
2009-07-21 09:10:00+09:30 10
2009-07-30 12:15:00+09:30 15
2010-09-02 11:25:00+09:30 25
2010-09-22 15:30:00+09:30 30
2010-09-30 06:15:00+09:30 15
2010-12-06 11:25:00+09:30 25
2010-12-22 15:30:00+09:30 30
2010-12-28 06:15:00+09:30 15
df2: Monthly interval obtained by groupby('Month') from a different dataset.
value_2
Timestamp
2009-04-30 00:00:00+09:30 23
2009-07-31 00:00:00+09:30 28
2010-12-31 00:00:00+09:30 23
I want to combine the two datasets by index. Any record in df1 should be included in the final results if it has the same month as df2. The expected result is below.
value_1 value_2
Timestamp
2009-04-01 10:50:00+09:30 50 23
2009-04-05 11:55:00+09:30 55 23
2009-04-23 16:00:00+09:30 0 23
2009-07-04 02:05:00+09:30 5 28
2009-07-21 09:10:00+09:30 10 28
2009-07-30 12:15:00+09:30 15 28
2010-12-06 11:25:00+09:30 25 23
2010-12-22 15:30:00+09:30 30 23
2010-12-28 06:15:00+09:30 15 23
This is my attempt.
result = pd.concat([df1, df2], axis=1)
# this combines the datasets, but not like expected, also by including join="outer". With join="inner", no data shown.
result = pd.merge(df1, df2, left_on='value_1', right_index=True)
# this return ValueError: You are trying to merge on Int64 and datetime64[ns, Australia/North] columns. If you wish to proceed you should use pd.concat
# Using #Ben.T
mt_hMF = df1.merge( df2.reset_index().set_index(df2.index.floor('M')),
how='left', left_index=True, right_index=True).set_index('Timestamp')
# This gives ValueError: <MonthEnd> is a non-fixed frequency
Try this, using strftime to create a temporary merge key for both dataframes:
df1.reset_index()\
.assign(yearmonth=df1.index.strftime('%Y%m'))\
.merge(df2.assign(yearmonth=df2.index.strftime('%Y%m')))\
.set_index('Timestamp')\
.drop('yearmonth', axis=1)
Output:
value_1 value_2
Timestamp
2009-04-01 10:50:00+09:30 50 23
2009-04-05 11:55:00+09:30 55 23
2009-04-23 16:00:00+09:30 0 23
2009-07-04 02:05:00+09:30 5 28
2009-07-21 09:10:00+09:30 10 28
2009-07-30 12:15:00+09:30 15 28
2010-12-06 11:25:00+09:30 25 23
2010-12-22 15:30:00+09:30 30 23
2010-12-28 06:15:00+09:30 15 23
I am trying to merge the following files
df1
unix_time,hk1,hk2,val2,hint
1560752700,10,15,3,6:25am
1560753900,20,25,5,6:45am
1560756600,10,10,-1,7:30am
df2
unix_time,hk1,hk2,val,hint
1560751200,10,15,1,6am
1560754800,20,25,2,7am
1560758400,10,10,3,8am
on unix_time
I am trying to do this as follows
merged = pd.merge_asof(df2.sort_values('unix_time'),
df1.sort_values('unix_time'),
by=['hk1', 'hk2'],
on='unix_time',
tolerance=pd.Timedelta(seconds=1800),
direction='nearest')
From docs merge_asof tolerance can be specified as pd.Timedelta.
But when I am running the above piece of code I get
pandas.errors.MergeError: incompatible tolerance <class 'pandas._libs.tslibs.timedeltas.Timedelta'>, must be compat with type int64
How do I fix it?
Thank you
the expected join vals output for the above example:
val | val2
1 | 3
2 | 5
3 | -1
Use tolerance=1800:
merged = pd.merge_asof(df2.sort_values('unix_time'),
df1.sort_values('unix_time'),
by=['hk1', 'hk2'],
on='unix_time',
tolerance=1800,
direction='nearest')
print (merged)
unix_time hk1 hk2 val hint_x val2 hint_y
0 1560751200 10 15 1 6am 3 6:25am
1 1560754800 20 25 2 7am 5 6:45am
2 1560758400 10 10 3 8am -1 7:30am
Or convert both columns to datetimes before merge_asof if want use your solution:
df1['unix_time'] = pd.to_datetime(df1['unix_time'], unit='s')
df2['unix_time'] = pd.to_datetime(df2['unix_time'], unit='s')
merged = pd.merge_asof(df2.sort_values('unix_time'),
df1.sort_values('unix_time'),
by=['hk1', 'hk2'],
on='unix_time',
tolerance=pd.Timedelta(seconds=1800),
direction='nearest')
print (merged)
unix_time hk1 hk2 val hint_x val2 hint_y
0 2019-06-17 06:00:00 10 15 1 6am 3 6:25am
1 2019-06-17 07:00:00 20 25 2 7am 5 6:45am
2 2019-06-17 08:00:00 10 10 3 8am -1 7:30am
Is there a (more) convenient/efficient method to calculate the number of business days between to dates using pandas?
I could do
len(pd.bdate_range(start='2018-12-03',end='2018-12-14'))-1 # minus one only if end date is a business day
but for longer distances between the start and end day this seems rather inefficient.
There are a couple of suggestion how to use the BDay offset object, but they all seem to refer to the creation of dateranges or something similar.
I am thinking more in terms of a Timedelta object that is represented in business-days.
Say I have two series,s1 and s2, containing datetimes. If pandas had something along the lines of
s1.dt.subtract(s2,freq='B')
# giving a new series containing timedeltas where the number of days calculated
# use business days only
would be nice.
(numpy has a busday_count() method. But I would not want to convert my pandas Timestamps to numpy, as this can get messy.)
I think np.busday_count here is good idea, also convert to numpy arrays is not necessary:
s1 = pd.Series(pd.date_range(start='05/01/2019',end='05/10/2019'))
s2 = pd.Series(pd.date_range(start='05/04/2019',periods=10, freq='5d'))
s = pd.Series([np.busday_count(a, b) for a, b in zip(s1, s2)])
print (s)
0 3
1 5
2 7
3 10
4 14
5 17
6 19
7 23
8 25
9 27
dtype: int64
from xone import calendar
def business_dates(start, end):
us_cal = calendar.USTradingCalendar()
kw = dict(start=start, end=end)
return pd.bdate_range(**kw).drop(us_cal.holidays(**kw))
In [1]: business_dates(start='2018-12-20', end='2018-12-31')
Out[1]: DatetimeIndex(['2018-12-20', '2018-12-21', '2018-12-24', '2018-12-26',
'2018-12-27', '2018-12-28', '2018-12-31'],
dtype='datetime64[ns]', freq=None)
source Get business days between start and end date using pandas
#create dataframes with the dates
df=pd.DataFrame({'dates':pd.date_range(start='05/01/2019',end='05/31/2019')})
#check if the dates are in business days
df[df['dates'].isin(pd.bdate_range(df['dates'].get(0), df['dates'].get(len(df)-1)))]
out[]:
0 2019-05-01
1 2019-05-02
2 2019-05-03
5 2019-05-06
6 2019-05-07
7 2019-05-08
8 2019-05-09
9 2019-05-10
12 2019-05-13
13 2019-05-14
14 2019-05-15
15 2019-05-16
16 2019-05-17
19 2019-05-20
20 2019-05-21
21 2019-05-22
22 2019-05-23
23 2019-05-24
26 2019-05-27
27 2019-05-28
28 2019-05-29
29 2019-05-30
30 2019-05-31
My data looks like below:
id, date, target
1,2016-10-24,22
1,2016-10-25,31
1,2016-10-27,44
1,2016-10-28,12
2,2016-10-21,22
2,2016-10-22,31
2,2016-10-25,44
2,2016-10-27,12
I want to fill in missing dates among id.
For example, the date range of id=1 is 2016-10-24 ~ 2016-10-28, and 2016-10-26 is missing. Moreover, the date range of id=2 is 2016-10-21 ~ 2016-10-27, and 2016-10-23, 2016-10-24 and 2016-10-26 are missing.
I want to fill in the missing dates and fill in the target value as 0.
Therefore, I want my data to be as below:
id, date, target
1,2016-10-24,22
1,2016-10-25,31
1,2016-10-26,0
1,2016-10-27,44
1,2016-10-28,12
2,2016-10-21,22
2,2016-10-22,31
2,2016-10-23,0
2,2016-10-24,0
2,2016-10-25,44
2,2016-10-26,0
2,2016-10-27,12
Can somebody help me?
Thanks in advance.
You can use groupby with resample - then is problem fillna - so need asfreq first:
#if necessary convert to datetime
df.date = pd.to_datetime(df.date)
df = df.set_index('date')
df = df.groupby('id').resample('d')['target'].asfreq().fillna(0).astype(int).reset_index()
print (df)
id date target
0 1 2016-10-24 22
1 1 2016-10-25 31
2 1 2016-10-26 0
3 1 2016-10-27 44
4 1 2016-10-28 12
5 2 2016-10-21 22
6 2 2016-10-22 31
7 2 2016-10-23 0
8 2 2016-10-24 0
9 2 2016-10-25 44
10 2 2016-10-26 0
11 2 2016-10-27 12