This question is about efficiently applying a custom function on logical groups of rows in a Pandas dataframe, which share a value in some column.
Consider the following example of a dataframe:
sID = [1,1,1,2,4,4,5,5,5]
data = np.random.randn(len(sID))
dates = pd.date_range(start='1/1/2018', periods=len(sID))
mydf = pd.DataFrame({"subject_id":sID, "data":data, "date":dates})
mydf['date'][5] += pd.Timedelta('2 days')
which looks like:
data date subject_id
0 0.168150 2018-01-01 1
1 -0.484301 2018-01-02 1
2 -0.522980 2018-01-03 1
3 -0.724524 2018-01-04 2
4 0.563453 2018-01-05 4
5 0.439059 2018-01-08 4
6 -1.902182 2018-01-07 5
7 -1.433561 2018-01-08 5
8 0.586191 2018-01-09 5
Imagine that for each subject_id, we want to subtract from each date the first date encountered for this subject_id. Storing the result in a new column "days_elapsed", the result will look like this:
data date subject_id days_elapsed
0 0.168150 2018-01-01 1 0
1 -0.484301 2018-01-02 1 1
2 -0.522980 2018-01-03 1 2
3 -0.724524 2018-01-04 2 0
4 0.563453 2018-01-05 4 0
5 0.439059 2018-01-08 4 3
6 -1.902182 2018-01-07 5 0
7 -1.433561 2018-01-08 5 1
8 0.586191 2018-01-09 5 2
One natural way of doing this is by using groupby and apply:
g_df = mydf.groupby('subject_id')
mydf.loc[:, "days_elapsed"] = g_df["date"].apply(lambda x: x - x.iloc[0]).astype('timedelta64[D]').astype(int)
However, if the number of groups (subject IDs) is big (e.g. 10^4), and let's say only 10 times smaller than the length of the dataframe, this very simple operation is really slow.
Is there any faster method?
PS: I have also tried setting the index to subject_id and then using the following list comprehension:
def get_first(series, ind):
"Return the first row in a group within a series which (group) potentially can span multiple rows and corresponds to a given index"
group = series.loc[ind]
if hasattr(group, 'iloc'):
return group.iloc[0]
else: # this is for indices with a single element
return group
hind_df = mydf.set_index('subject_id')
A = pd.concat([hind_df["date"].loc[ind] - get_first(hind_df["date"], ind) for ind in np.unique(hind_df.index)])
However, it's even slower.
You can use GroupBy + transform with first. This should be more efficient as it avoids expensive lambda function calls.
You may also see a performance improvement by working with the NumPy array via pd.Series.values:
first = df.groupby('subject_id')['date'].transform('first').values
df['days_elapsed'] = (df['date'].values - first).astype('timedelta64[D]').astype(int)
print(df)
subject_id data date days_elapsed
0 1 1.079472 2018-01-01 0
1 1 -0.197255 2018-01-02 1
2 1 -0.687764 2018-01-03 2
3 2 0.023771 2018-01-04 0
4 4 -0.538191 2018-01-05 0
5 4 1.479294 2018-01-08 3
6 5 -1.993196 2018-01-07 0
7 5 -2.111831 2018-01-08 1
8 5 -0.934775 2018-01-09 2
mydf['days_elapsed'] = (mydf['date'] - mydf.groupby(['subject_id'])['date'].transform('min')).dt.days
Related
I have a pandas data frame that looks like this:
Count Status
Date
2021-01-01 11 1
2021-01-02 13 1
2021-01-03 14 1
2021-01-04 8 0
2021-01-05 8 0
2021-01-06 5 0
2021-01-07 2 0
2021-01-08 6 1
2021-01-09 8 1
2021-01-10 10 0
I want to calculate the difference between the initial and final value of the "Count" column before the "Status" column changes from 0 to 1 or vice-versa (for every cycle) and make a new dataframe out of these values.
The output for this example would be:
Cycle Difference
1 3
2 -6
3 2
Use GroupBy.agg by consecutive groups created by comapre shifted values with cumulative sum, last subtract last and first value:
df = (df.groupby(df['Status'].ne(df['Status'].shift()).cumsum().rename('Cycle'))['Count']
.agg(['first','last'])
.eval('last - first')
.reset_index(name='Difference'))
print (df)
Cycle Difference
0 1 3
1 2 -6
2 3 2
3 4 0
If need filter out groups rows with 1 row is possible add aggregation GroupBy.size and then filter oupt rows by DataFrame.loc:
df = (df.groupby(df['Status'].ne(df['Status'].shift()).cumsum().rename('Cycle'))['Count']
.agg(['first','last', 'size'])
.loc[lambda x: x['size'] > 1]
.eval('last - first')
.reset_index(name='Difference'))
print (df)
Cycle Difference
0 1 3
1 2 -6
2 3 2
You can use a GroupBy.agg on the groups formed of the consecutive values, then get the first minus last value (see below for variants):
out = (df.groupby(df['Status'].ne(df['Status'].shift()).cumsum())
['Count'].agg(lambda x: x.iloc[-1]-x.iloc[0])
)
output:
Status
1 3
2 -6
3 2
4 0
Name: Count, dtype: int64
If you only want to do this for groups of more than one element:
out = (df.groupby(df['Status'].ne(df['Status'].shift()).cumsum())
['Count'].agg(lambda x: x.iloc[-1]-x.iloc[0] if len(x)>1 else pd.NA)
.dropna()
)
output:
Status
1 3
2 -6
3 2
Name: Count, dtype: object
output as DataFrame:
add .rename_axis('Cycle').reset_index(name='Difference'):
out = (df.groupby(df['Status'].ne(df['Status'].shift()).cumsum())
['Count'].agg(lambda x: x.iloc[-1]-x.iloc[0] if len(x)>1 else pd.NA)
.dropna()
.rename_axis('Cycle').reset_index(name='Difference')
)
output:
Cycle Difference
0 1 3
1 2 -6
2 3 2
I have a dataframe:
df1 = pd.DataFrame(
[['2011-01-01','2011-01-03','A'], ['2011-04-01','2011-04-01','A'], ['2012-08-28','2012-08-30','B'], ['2015-04-03','2015-04-05','A'], ['2015-08-21','2015-08-21','B']],
columns=['d0', 'd1', 'event'])
d0 d1 event
0 2011-01-01 2011-01-03 A
1 2011-04-01 2011-04-01 A
2 2012-08-28 2012-08-30 B
3 2015-04-03 2015-04-05 A
4 2015-08-21 2015-08-21 B
It contains some events A and B that occurred in the specified interval from d0 to d1. (There are actually more events, they are mixed, but they have no intersection by dates.) Moreover, this interval can be 1 day (d0 = d1). I need to go from df1 to df2 in which these time intervals are "unrolled" for each event, i.e.:
df2 = pd.DataFrame(
[['2011-01-01','A'], ['2011-01-02','A'], ['2011-01-03','A'], ['2011-04-01','A'], ['2012-08-28','B'], ['2012-08-29','B'], ['2012-08-30','B'], ['2015-04-03','A'], ['2015-04-04','A'], ['2015-04-05','A'], ['2015-08-21','B']],
columns=['Date', 'event'])
Date event
0 2011-01-01 A
1 2011-01-02 A
2 2011-01-03 A
3 2011-04-01 A
4 2012-08-28 B
5 2012-08-29 B
6 2012-08-30 B
7 2015-04-03 A
8 2015-04-04 A
9 2015-04-05 A
10 2015-08-21 B
I tried doing this based on resample and comparing areas where ffill = bfill but couldn't come up with anything. How can this be done in the most simple way?
We can set_index to event then create date_range per row, then explode to unwind the ranges and reset_index to create the DataFrame:
df2 = (
df1.set_index('event')
.apply(lambda r: pd.date_range(r['d0'], r['d1']), axis=1)
.explode()
.reset_index(name='Date')[['Date', 'event']]
)
df2:
Date event
0 2011-01-01 A
1 2011-01-02 A
2 2011-01-03 A
3 2011-04-01 A
4 2012-08-28 B
5 2012-08-29 B
6 2012-08-30 B
7 2015-04-03 A
8 2015-04-04 A
9 2015-04-05 A
10 2015-08-21 B
Let us try comprehension to create the pairs of date and event
pd.DataFrame(((d, c) for (*v, c) in df1.to_numpy()
for d in pd.date_range(*v)), columns=['Date', 'Event'])
Date Event
0 2011-01-01 A
1 2011-01-02 A
2 2011-01-03 A
3 2011-04-01 A
4 2012-08-28 B
5 2012-08-29 B
6 2012-08-30 B
7 2015-04-03 A
8 2015-04-04 A
9 2015-04-05 A
10 2015-08-21 B
I don't know if this is the "most simple," but it's the most intuitive way I can think to do it. I iterate over the rows and unroll it manually into a new dataframe. This means that I look at each row, iterate over the dates between d0 and d1, and construct a row for each of them and compile them into a dataframe:
from datetime import timedelta
def unroll_events(df):
rows = []
for _, row in df.iterrows():
event = row['event']
start = row['d0']
end = row['d1']
current = start
while current != end:
rows.append(dict(Date=current, event=event))
current += timedelta(days=1)
rows.append(dict(Date=current, event=event)) # make sure last one is included
return pd.DataFrame(rows)
I am looking for a way to identify the row that is the 'master' row. The way I am defining the master row is for each group id the row that has the minimum in cust_hierarchy then if it is a tie use the row with the most recent date.
I have supplied some sample tables below:
row_id
group_id
cust_hierarchy
most_recent_date
master(I am looking for)
1
0
2
2020-01-03
1
2
0
7
2019-01-01
0
3
1
7
2019-05-01
0
4
1
6
2019-04-01
0
5
1
6
2019-04-03
1
I was thinking of possibly ordering by the two columns (cust_hierarchy (ascending), most_recent_date (descending), and then a new column that places a 1 on the first row for each group id?
Does anyone have any helpful code for this?
You basically can to an groupby with an idxmin(), but with a little bit of sorting to ensure the most recent use date is selected by the min operation:
import pandas as pd
import numpy as np
# example data
dates = ['2020-01-03','2019-01-01','2019-05-01',
'2019-04-01','2019-04-03']
dates = pd.to_datetime(dates)
df = pd.DataFrame({'group_id':[0,0,1,1,1],
'cust_hierarchy':[2,7,7,6,6,],
'most_recent_date':dates})
# solution
df = df.sort_values('most_recent_date', ascending=False)
idxs = df.groupby('group_id')['cust_hierarchy'].idxmin()
df['master'] = np.where(df.index.isin(idxs), True, False)
df = df.sort_index()
df before:
group_id cust_hierarchy most_recent_date
0 0 2 2020-01-03
1 0 7 2019-01-01
2 1 7 2019-05-01
3 1 6 2019-04-01
4 1 6 2019-04-03
df after:
group_id cust_hierarchy most_recent_date master
0 0 2 2020-01-03 True
1 0 7 2019-01-01 False
2 1 7 2019-05-01 False
3 1 6 2019-04-01 False
4 1 6 2019-04-03 True
Use duplicated on sort_values:
df['master'] = 1- (df.sort_values(['cust_hierarchy', 'most_recent_date'],
ascending=[False, True])
.duplicated('group_id', keep='last')
.astype(int)
)
I have a pandas data frame mydf that has two columns,and both columns are datetime datatypes: mydate and mytime. I want to add three more columns: hour, weekday, and weeknum.
def getH(t): #gives the hour
return t.hour
def getW(d): #gives the week number
return d.isocalendar()[1]
def getD(d): #gives the weekday
return d.weekday() # 0 for Monday, 6 for Sunday
mydf["hour"] = mydf.apply(lambda row:getH(row["mytime"]), axis=1)
mydf["weekday"] = mydf.apply(lambda row:getD(row["mydate"]), axis=1)
mydf["weeknum"] = mydf.apply(lambda row:getW(row["mydate"]), axis=1)
The snippet works, but it's not computationally efficient as it loops through the data frame at least three times. I would just like to know if there's a faster and/or more optimal way to do this. For example, using zip or merge? If, for example, I just create one function that returns three elements, how should I implement this? To illustrate, the function would be:
def getHWd(d,t):
return t.hour, d.isocalendar()[1], d.weekday()
Here's on approach to do it using one apply
Say, df is like
In [64]: df
Out[64]:
mydate mytime
0 2011-01-01 2011-11-14
1 2011-01-02 2011-11-15
2 2011-01-03 2011-11-16
3 2011-01-04 2011-11-17
4 2011-01-05 2011-11-18
5 2011-01-06 2011-11-19
6 2011-01-07 2011-11-20
7 2011-01-08 2011-11-21
8 2011-01-09 2011-11-22
9 2011-01-10 2011-11-23
10 2011-01-11 2011-11-24
11 2011-01-12 2011-11-25
We'll take the lambda function out to separate line for readability and define it like
In [65]: lambdafunc = lambda x: pd.Series([x['mytime'].hour,
x['mydate'].isocalendar()[1],
x['mydate'].weekday()])
And, apply and store the result to df[['hour', 'weekday', 'weeknum']]
In [66]: df[['hour', 'weekday', 'weeknum']] = df.apply(lambdafunc, axis=1)
And, the output is like
In [67]: df
Out[67]:
mydate mytime hour weekday weeknum
0 2011-01-01 2011-11-14 0 52 5
1 2011-01-02 2011-11-15 0 52 6
2 2011-01-03 2011-11-16 0 1 0
3 2011-01-04 2011-11-17 0 1 1
4 2011-01-05 2011-11-18 0 1 2
5 2011-01-06 2011-11-19 0 1 3
6 2011-01-07 2011-11-20 0 1 4
7 2011-01-08 2011-11-21 0 1 5
8 2011-01-09 2011-11-22 0 1 6
9 2011-01-10 2011-11-23 0 2 0
10 2011-01-11 2011-11-24 0 2 1
11 2011-01-12 2011-11-25 0 2 2
To complement John Galt's answer:
Depending on the task that is performed by lambdafunc, you may experience some speedup by storing the result of apply in a new DataFrame and then joining with the original:
lambdafunc = lambda x: pd.Series([x['mytime'].hour,
x['mydate'].isocalendar()[1],
x['mydate'].weekday()])
newcols = df.apply(lambdafunc, axis=1)
newcols.columns = ['hour', 'weekday', 'weeknum']
newdf = df.join(newcols)
Even if you do not see a speed improvement, I would recommend using the join. You will be able to avoid the (always annoying) SettingWithCopyWarning that may pop up when assigning directly on the columns:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
You can do this in a somewhat cleaner method by having the function you apply return a pd.Series with named elements:
def process(row):
return pd.Series(dict(b=row["a"] * 2, c=row["a"] + 2))
my_df = pd.DataFrame(dict(a=range(10)))
new_df = my_df.join(my_df.apply(process, axis="columns"))
The result is:
a b c
0 0 0 2
1 1 2 3
2 2 4 4
3 3 6 5
4 4 8 6
5 5 10 7
6 6 12 8
7 7 14 9
8 8 16 10
9 9 18 11
def getWd(d):
d.isocalendar()[1], d.weekday()
def getH(t):
return t.hour
mydf["hour"] = zip(*df["mytime"].map(getH))
mydf["weekday"], mydf["weeknum"] = zip(*df["mydate"].map(getWd))
Update: starting with version 0.20.0, pandas cut/qcut DOES handle date fields. See What's New for more.
pd.cut and pd.qcut now support datetime64 and timedelta64 dtypes (GH14714, GH14798)
Original question: Pandas cut and qcut functions are great for 'bucketing' continuous data for use in pivot tables and so forth, but I can't see an easy way to get datetime axes in the mix. Frustrating since pandas is so great at all the time-related stuff!
Here's a simple example:
def randomDates(size, start=134e7, end=137e7):
return np.array(np.random.randint(start, end, size), dtype='datetime64[s]')
df = pd.DataFrame({'ship' : randomDates(10), 'recd' : randomDates(10),
'qty' : np.random.randint(0,10,10), 'price' : 100*np.random.random(10)})
df
price qty recd ship
0 14.723510 3 2012-11-30 19:32:27 2013-03-08 23:10:12
1 53.535143 2 2012-07-25 14:26:45 2012-10-01 11:06:39
2 85.278743 7 2012-12-07 22:24:20 2013-02-26 10:23:20
3 35.940935 8 2013-04-18 13:49:43 2013-03-29 21:19:26
4 54.218896 8 2013-01-03 09:00:15 2012-08-08 12:50:41
5 61.404931 9 2013-02-10 19:36:54 2013-02-23 13:14:42
6 28.917693 1 2012-12-13 02:56:40 2012-09-08 21:14:45
7 88.440408 8 2013-04-04 22:54:55 2012-07-31 18:11:35
8 77.329931 7 2012-11-23 00:49:26 2012-12-09 19:27:40
9 46.540859 5 2013-03-13 11:37:59 2013-03-17 20:09:09
To bin by groups of price or quantity, I can use cut/qcut to bucket them:
df.groupby([pd.cut(df['qty'], bins=[0,1,5,10]), pd.qcut(df['price'],q=3)]).count()
price qty recd ship
qty price
(0, 1] [14.724, 46.541] 1 1 1 1
(1, 5] [14.724, 46.541] 2 2 2 2
(46.541, 61.405] 1 1 1 1
(5, 10] [14.724, 46.541] 1 1 1 1
(46.541, 61.405] 2 2 2 2
(61.405, 88.44] 3 3 3 3
But I can't see any easy way of doing the same thing with my 'recd' or 'ship' date fields. For example, generate a similar table of counts broken down by (say) monthly buckets of recd and ship. It seems like resample() has all of the machinery to bucket into periods, but I can't figure out how to apply it here. The buckets (or levels) in the 'date cut' would be equivalent to a pandas.PeriodIndex, and then I want to label each value of df['recd'] with the period it falls into?
So the kind of output I'm looking for would be something like:
ship recv count
2011-01 2011-01 1
2011-02 3
... ...
2011-02 2011-01 2
2011-02 6
... ... ...
More generally, I'd like to be able to mix and match continuous or categorical variables in the output. Imagine df also contains a 'status' column with red/yellow/green values, then maybe I want to summarize counts by status, price bucket, ship and recd buckets, so:
ship recv price status count
2011-01 2011-01 [0-10) green 1
red 4
[10-20) yellow 2
... ... ...
2011-02 [0-10) yellow 3
... ... ... ...
As a bonus question, what's the simplest way to modify the groupby() result above to just contain a single output column called 'count'?
Here's a solution using pandas.PeriodIndex (caveat: PeriodIndex doesn't
seem to support time rules with a multiple > 1, such as '4M'). I think
the answer to your bonus question is .size().
In [49]: df.groupby([pd.PeriodIndex(df.recd, freq='Q'),
....: pd.PeriodIndex(df.ship, freq='Q'),
....: pd.cut(df['qty'], bins=[0,5,10]),
....: pd.qcut(df['price'],q=2),
....: ]).size()
Out[49]:
qty price
2012Q2 2013Q1 (0, 5] [2, 5] 1
2012Q3 2013Q1 (5, 10] [2, 5] 1
2012Q4 2012Q3 (5, 10] [2, 5] 1
2013Q1 (0, 5] [2, 5] 1
(5, 10] [2, 5] 1
2013Q1 2012Q3 (0, 5] (5, 8] 1
2013Q1 (5, 10] (5, 8] 2
2013Q2 2012Q4 (0, 5] (5, 8] 1
2013Q2 (0, 5] [2, 5] 1
Just need to set the index of the field you'd like to resample by, here's some examples
In [36]: df.set_index('recd').resample('1M',how='sum')
Out[36]:
price qty
recd
2012-07-31 64.151194 9
2012-08-31 93.476665 7
2012-09-30 94.193027 7
2012-10-31 NaN NaN
2012-11-30 NaN NaN
2012-12-31 12.353405 6
2013-01-31 NaN NaN
2013-02-28 129.586697 7
2013-03-31 NaN NaN
2013-04-30 NaN NaN
2013-05-31 211.979583 13
In [37]: df.set_index('recd').resample('1M',how='count')
Out[37]:
2012-07-31 price 1
qty 1
ship 1
2012-08-31 price 1
qty 1
ship 1
2012-09-30 price 2
qty 2
ship 2
2012-10-31 price 0
qty 0
ship 0
2012-11-30 price 0
qty 0
ship 0
2012-12-31 price 1
qty 1
ship 1
2013-01-31 price 0
qty 0
ship 0
2013-02-28 price 2
qty 2
ship 2
2013-03-31 price 0
qty 0
ship 0
2013-04-30 price 0
qty 0
ship 0
2013-05-31 price 3
qty 3
ship 3
dtype: int64
I came up with an idea that relies on the underlying storage format of datetime64[ns]. If you define dcut() like this
def dcut(dts, freq='d', right=True):
hi = pd.Period(dts.max(), freq=freq) + 1 # get first period past end of data
periods = pd.PeriodIndex(start=dts.min(), end=hi, freq=freq)
# get a list of integer bin boundaries representing ns-since-epoch
# note the extra period gives us the extra right-hand bin boundary we need
bounds = np.array(periods.to_timestamp(how='start'), dtype='int')
# bin our time field as integers
cut = pd.cut(np.array(dts, dtype='int'), bins=bounds, right=right)
# relabel the bins using the periods, omitting the extra one at the end
cut.levels = periods[:-1].format()
return cut
Then we can do what I wanted:
df.groupby([dcut(df.recd, freq='m', right=False),dcut(df.ship, freq='m', right=False)]).count()
To get:
price qty recd ship
2012-07 2012-10 1 1 1 1
2012-11 2012-12 1 1 1 1
2013-03 1 1 1 1
2012-12 2012-09 1 1 1 1
2013-02 1 1 1 1
2013-01 2012-08 1 1 1 1
2013-02 2013-02 1 1 1 1
2013-03 2013-03 1 1 1 1
2013-04 2012-07 1 1 1 1
2013-03 1 1 1 1
I guess you could similarly define dqcut() which first "rounds" each datetime value to the integer representing the start of its containing period (at your specified frequency), and then uses qcut() to choose amongst those boundaries. Or do qcut() first on the raw integer values and round the resulting bins based on your chosen frequency?
No joy on the bonus question yet? :)
How about using Series and putting the parts of the DataFrame that you're interested into that, then calling cut on the series object?
price_series = pd.Series(df.price.tolist(), index=df.recd)
and then
pd.qcut(price_series, q=3)
and so on. (Though I think #Jeff's answer is best)