one year rolling count of unique values by group in pandas

one year rolling count of unique values by group in pandas - python

So I have the following dataframe:
Period group ID
20130101 A 10
20130101 A 20
20130301 A 20
20140101 A 20
20140301 A 30
20140401 A 40
20130101 B 11
20130201 B 21
20130401 B 31
20140401 B 41
20140501 B 51
I need to count how many different ID there are by group in the last year. So my desired output would look like this:
Period group num_ids_last_year
20130101 A 2 # ID 10 and 20 in the last year
20130301 A 2
20140101 A 2
20140301 A 2 # ID 30 enters, ID 10 leaves
20140401 A 3 # ID 40 enters
20130101 B 1
20130201 B 2
20130401 B 3
20140401 B 2 # ID 11 and 21 leave
20140501 B 2 # ID 31 leaves, ID 51 enters
Period is in datetime format. I tried many things along the lines of:
df.groupby(['group','Period'])['ID'].nunique() # Get number of IDs by group in a given period.
df.groupby(['group'])['ID'].nunique() # Get total number of IDs by group.
df.set_index('Period').groupby('group')['ID'].rolling(window=1, freq='Y').nunique()
But the last one isn't even possible. Is there any straightforward way to do this? I'm thinking maybe some kind of combination of cumcount() and pd.DateOffset or maybe ge(df.Period - dt.timedelta(365), but I can't find the answer.
Thanks.
Edit: added the fact that I can find more than one ID in a given Period

looking at your data structure, I am guessing you have MANY duplicates, so start with dropping them. drop_duplicates tend to be fast
I am assuming that df['Period'] columns is of dtype datetime64[ns]
df = df.drop_duplicates()
results = dict()
for start in df['Period'].drop_duplicates():
end = start.date() - relativedelta(years=1)
screen = (df.Period <= start) & (df.Period >= end) # screen for 1 year of data
singles = df.loc[screen, ['group', 'ID']].drop_duplicates() # screen for same year ID by groups
x = singles.groupby('group').count()
results[start] = x
results = pd.concat(results, 0)
results
ID
group
2013-01-01 A 2
B 1
2013-02-01 A 2
B 2
2013-03-01 A 2
B 2
2013-04-01 A 2
B 3
2014-01-01 A 2
B 3
2014-03-01 A 2
B 1
2014-04-01 A 3
B 2
2014-05-01 A 3
B 2
is that any faster?
p.s. if df['Period'] is not a datetime:
df['Period'] = pd.to_datetime(df['Period'],format='%Y%m%d', errors='ignore')

Here the solution using groupby and rolling. Note: your desired ouput counts a year from YYYY0101 to next year YYYY0101, so you need rolling 366D instead of 365D
df['Period'] = pd.to_datetime(df.Period, format='%Y%m%d')
df = df.set_index('Period')
df_final = (df.groupby('group')['ID'].rolling(window='366D')
.apply(lambda x: np.unique(x).size, raw=True)
.reset_index(name='ID_count')
.drop_duplicates(['group','Period'], keep='last'))
Out[218]:
group Period ID_count
1 A 2013-01-01 2.0
2 A 2013-03-01 2.0
3 A 2014-01-01 2.0
4 A 2014-03-01 2.0
5 A 2014-04-01 3.0
6 B 2013-01-01 1.0
7 B 2013-02-01 2.0
8 B 2013-04-01 3.0
9 B 2014-04-01 2.0
10 B 2014-05-01 2.0
Note: On 18M+ rows, I don't think this solution will make it at 10 mins. I hope it would take about 30 mins.

from dateutil.relativedelta import relativedelta
df.sort_values(by=['Period'], inplace=True) # if not already sorted
# create new output df
df1 = (df.groupby(['Period','group'])['ID']
.apply(lambda x: list(x))
.reset_index())
df1['num_ids_last_year'] = df1.apply(lambda x: len(set(df1.loc[(df1['Period'] >= x['Period']-relativedelta(years=1)) & (df1['Period'] <= x['Period']) & (df1['group'] == x['group'])].ID.apply(pd.Series).stack())), axis=1)
df1.sort_values(by=['group'], inplace=True)
df1.drop('ID', axis=1, inplace=True)
df1 = df1.reset_index(drop=True)

Related

Python Pandas - Selecting specific rows based on the max and min of two columns with the same group id

I am looking for a way to identify the row that is the 'master' row. The way I am defining the master row is for each group id the row that has the minimum in cust_hierarchy then if it is a tie use the row with the most recent date.
I have supplied some sample tables below:
row_id
group_id
cust_hierarchy
most_recent_date
master(I am looking for)
1
0
2
2020-01-03
1
2
0
7
2019-01-01
0
3
1
7
2019-05-01
0
4
1
6
2019-04-01
0
5
1
6
2019-04-03
1
I was thinking of possibly ordering by the two columns (cust_hierarchy (ascending), most_recent_date (descending), and then a new column that places a 1 on the first row for each group id?
Does anyone have any helpful code for this?

You basically can to an groupby with an idxmin(), but with a little bit of sorting to ensure the most recent use date is selected by the min operation:
import pandas as pd
import numpy as np
# example data
dates = ['2020-01-03','2019-01-01','2019-05-01',
'2019-04-01','2019-04-03']
dates = pd.to_datetime(dates)
df = pd.DataFrame({'group_id':[0,0,1,1,1],
'cust_hierarchy':[2,7,7,6,6,],
'most_recent_date':dates})
# solution
df = df.sort_values('most_recent_date', ascending=False)
idxs = df.groupby('group_id')['cust_hierarchy'].idxmin()
df['master'] = np.where(df.index.isin(idxs), True, False)
df = df.sort_index()
df before:
group_id cust_hierarchy most_recent_date
0 0 2 2020-01-03
1 0 7 2019-01-01
2 1 7 2019-05-01
3 1 6 2019-04-01
4 1 6 2019-04-03
df after:
group_id cust_hierarchy most_recent_date master
0 0 2 2020-01-03 True
1 0 7 2019-01-01 False
2 1 7 2019-05-01 False
3 1 6 2019-04-01 False
4 1 6 2019-04-03 True

Use duplicated on sort_values:
df['master'] = 1- (df.sort_values(['cust_hierarchy', 'most_recent_date'],
ascending=[False, True])
.duplicated('group_id', keep='last')
.astype(int)
)

Filter out values within 30 period of previous entry

I need to pick one value per 30 day period for the entire dataframe. For instance if I have the following dataframe:
df:
Date Value
0 2015-09-25 e
1 2015-11-11 b
2 2015-11-24 c
3 2015-12-02 d
4 2015-12-14 a
5 2016-02-01 b
6 2016-03-23 c
7 2016-05-02 d
8 2016-05-25 a
9 2016-06-15 a
10 2016-06-28 a
I need to pick the first entry and then filter out any entry within the next 30 days of that entry and then proceed along the dataframe. For instance, indexes, 0 and 1 should stay since they are at least 30 days apart, but 2 and 3 are within 30 days of 1 so they should be removed. This should continue chronologically until we have 1 entry per 30 day period:
Date Value
0 2015-09-25 e
1 2015-11-11 b
4 2015-12-14 a
5 2016-02-01 b
6 2016-03-23 c
7 2016-05-02 d
9 2016-06-15 a
The end result should have only 1 entry per 30 day period. Any advice or assistance would be greatly appreciated!
I have tried df.groupby(pd.Grouper(freq='M')).first() but that picks the first entry in each month rather than each entry that is at least 30 days from the previous entry.
I came up with a simple iterative solution which uses the fact that the DF is sorted, but its fairly slow:
index = df.index.values
dates = df['Date'].tolist()
index_to_keep = []
curr_date = None
for i in range(len(dates)):
if not curr_date or (dates[i] - curr_date).days > 30:
index_to_keep.append(index[i])
curr_date = dates[i]
df_out = df.loc[index_to_keep, :]
return df_out
Any ideas on how to speed it up?

I think this should be what you are looking for.
You need to transform your date column into a datetime datastructure to not be interpreted as a string.
here is what it looks like:
df = pd.DataFrame({'Date': ['2015-09-25', '2015-11-11','2015-11-24', '2015-12-02','2015-12-14'],
'Value' : ['e', 'b', 'c','d','a']})
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date')
df = df.groupby(pd.Grouper(freq='30D')).nth(0)
and here is the result
Value
Date
2015-09-25 e
2015-10-25 b
2015-11-24 c

Expanding multi-indexed dataframe with new dates as forecast

Note: I have followed Stackoverflow's instruction of how to create MRE and paste the MRE into 'code block' as instructed (i.e. paste it in the Body and then press Ctrl+K when highlighting it). If I am still not doing it correctly, let me know.
Back to question: Suppose I now have a df multi-indexed in both the date (df['DT']) and ID (df['ID'])
DT,ID,value1,value2
2020-10-01,a,1,1
2020-10-01,b,2,1
2020-10-01,c,3,1
2020-10-01,d,4,1
2020-10-02,a,10,1
2020-10-02,b,11,1
2020-10-02,c,12,1
2020-10-02,d,13,1
df = df.set_index(['DT','ID'])
And now, I want to expand the df to have '2020-10-03' and '2020-10-04' with the same set of ID {a,b,c,d} as my forecast period. To forecast value 1, I assume they will take the average of the existing values, e.g. for a's value1 in both 2020-10-03' and '2020-10-04', I assume it will take (1+10)/2 = 5.5. For value 2, I assume it will stay constant as 1.
The expected df will look like this:
DT,ID,value1,value2
2020-10-01,a,1.0,1
2020-10-01,b,2.0,1
2020-10-01,c,3.0,1
2020-10-01,d,4.0,1
2020-10-02,a,10.0,1
2020-10-02,b,11.0,1
2020-10-02,c,12.0,1
2020-10-02,d,13.0,1
2020-10-03,a,5.5,1
2020-10-03,b,6.5,1
2020-10-03,c,7.5,1
2020-10-03,d,8.5,1
2020-10-04,a,5.5,1
2020-10-04,b,6.5,1
2020-10-04,c,7.5,1
2020-10-04,d,8.5,1
Appreciate your help and time.

For easy forecast with mean use DataFrame.unstack for DatetimeIndex, add next datetimes by DataFrame.reindex with date_range and then replace missing values in value1 level by DataFrame.fillna and for value2 is set 1, last reshape back by DataFrame.stack:
print (df)
value1 value2
DT ID
2020-10-01 a 1 1
b 2 1
c 3 1
d 4 1
2020-10-02 a 10 1
b 11 1
c 12 1
d 13 1
rng = pd.date_range('2020-10-01','2020-10-04', name='DT')
df1 = df.unstack().reindex(rng)
df1['value1'] = df1['value1'].fillna(df1['value1'].mean())
df1['value2'] = 1
df2 = df1.stack()
print (df2)
value1 value2
DT ID
2020-10-01 a 1.0 1
b 2.0 1
c 3.0 1
d 4.0 1
2020-10-02 a 10.0 1
b 11.0 1
c 12.0 1
d 13.0 1
2020-10-03 a 5.5 1
b 6.5 1
c 7.5 1
d 8.5 1
2020-10-04 a 5.5 1
b 6.5 1
c 7.5 1
d 8.5 1
But forecasting is more complex, you can check this

How to drop records based on number of unique days using pandas?

I have a dataframe like as shown below
df = pd.DataFrame({
'subject_id':[1,1,1,1,1,1,1,2,2,2,2,2],
'time_1' :['2173-04-03 12:35:00','2173-04-03 12:50:00','2173-04-05 12:59:00','2173-05-04 13:14:00','2173-05-05 13:37:00','2173-07-03 13:39:00','2173-07-04 11:30:00','2173-04-04 16:00:00','2173-04-09 22:00:00','2173-04-11 04:00:00','2173- 04-13 04:30:00','2173-04-14 08:00:00'],
'val' :[5,5,5,5,1,6,5,5,8,3,4,6]})
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
df['month'] = df['time_1'].dt.month
What I would like to do is drop records/subjects who doesn't have more than 4 or more unique days
If you see my sample dataframe, you can see that subject_id = 1 has only 3 unique dates which is 3,4 and 5 so I would like to drop subject_id = 1 completely. But if you see subject_id = 2 he has more than 4 unique dates like 4,9,11,13,14. Please note that date values has timestamp, hence I extract the day from each datetime field and check for unique records.
This is what I tried
df.groupby(['subject_id','day']).transform('size')>4 # doesn't work
df[df.groupby(['subject_id','day'])['subject_id'].transform('size')>=4] # doesn't produce expected output
I expect my output to be like this

Change your function from size to DataFrameGroupBy.nunique, grouping only by the subject_id column:
df = df[df.groupby('subject_id')['day'].transform('nunique')>=4]
Or alternatively you can use filtration, but this should be slower if you're using a larger dataframe or many unique groups:
df = df.groupby('subject_id').filter(lambda x: x['day'].nunique()>=4)
print (df)
subject_id time_1 val day month
7 2 2173-04-04 16:00:00 5 4 4
8 2 2173-04-09 22:00:00 8 9 4
9 2 2173-04-11 04:00:00 3 11 4
10 2 2173-04-13 04:30:00 4 13 4
11 2 2173-04-14 08:00:00 6 14 4

How to calculate weeks difference and add missing Weeks with count in python pandas

I am having a data frame like this I have to get missing Weeks value and count between them
year Data Id
20180406 57170 A
20180413 55150 A
20180420 51109 A
20180427 57170 A
20180504 55150 A
20180525 51109 A
The output should be like this.
Id Start year end-year count
A 20180420 20180420 1
A 20180518 20180525 2

Use:
#converting to week period starts in Thursday
df['year'] = pd.to_datetime(df['year'], format='%Y%m%d').dt.to_period('W-Thu')
#resample by start of months with asfreq
df1 = (df.set_index('year')
.groupby('Id')['Id']
.resample('W-Thu')
.asfreq()
.rename('val')
.reset_index())
print (df1)
Id year val
0 A 2018-04-06/2018-04-12 A
1 A 2018-04-13/2018-04-19 A
2 A 2018-04-20/2018-04-26 A
3 A 2018-04-27/2018-05-03 A
4 A 2018-05-04/2018-05-10 A
5 A 2018-05-11/2018-05-17 NaN
6 A 2018-05-18/2018-05-24 NaN
7 A 2018-05-25/2018-05-31 A
#onverting to datetimes with starts dates
#http://pandas.pydata.org/pandas-docs/stable/timeseries.html#converting-between-representations
df1['year'] = df1['year'].dt.to_timestamp('D', how='s')
print (df1)
Id year val
0 A 2018-04-06 A
1 A 2018-04-13 A
2 A 2018-04-20 A
3 A 2018-04-27 A
4 A 2018-05-04 A
5 A 2018-05-11 NaN
6 A 2018-05-18 NaN
7 A 2018-05-25 A
m = df1['val'].notnull().rename('g')
#create index by cumulative sum for unique groups for consecutive NaNs
df1.index = m.cumsum()
#filter only NaNs row and aggregate first, last and count.
df2 = (df1[~m.values].groupby(['Id', 'g'])['year']
.agg(['first','last','size'])
.reset_index(level=1, drop=True)
.reset_index())
print (df2)
Id first last size
0 A 2018-05-11 2018-05-18 2

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

one year rolling count of unique values by group in pandas - python

Related

Python Pandas - Selecting specific rows based on the max and min of two columns with the same group id

Filter out values within 30 period of previous entry

Expanding multi-indexed dataframe with new dates as forecast

How to drop records based on number of unique days using pandas?

How to calculate weeks difference and add missing Weeks with count in python pandas

Categories

Resources