Ignoring Duplicates on Max in GroupBy - Pandas - python
I've read this thread about grouping and getting max: Apply vs transform on a group object.
It works perfectly and is helpful if your max is unique to a group but I'm running into an issue of ignoring duplicates from a group, getting the max of unique items then putting it back into the DataSeries.
Input (named df1):
date val
2004-01-01 0
2004-02-01 0
2004-03-01 0
2004-04-01 0
2004-05-01 0
2004-06-01 0
2004-07-01 0
2004-08-01 0
2004-09-01 0
2004-10-01 0
2004-11-01 0
2004-12-01 0
2005-01-01 11
2005-02-01 11
2005-03-01 8
2005-04-01 5
2005-05-01 0
2005-06-01 0
2005-07-01 2
2005-08-01 1
2005-09-01 0
2005-10-01 0
2005-11-01 3
2005-12-01 3
My code:
df1['peak_month'] = df1.groupby(df1.date.dt.year)['val'].transform(max) == df1['val']
My Output:
date val max
2004-01-01 0 true #notice how all duplicates are true in 2004
2004-02-01 0 true
2004-03-01 0 true
2004-04-01 0 true
2004-05-01 0 true
2004-06-01 0 true
2004-07-01 0 true
2004-08-01 0 true
2004-09-01 0 true
2004-10-01 0 true
2004-11-01 0 true
2004-12-01 0 true
2005-01-01 11 true #notice how these two values
2005-02-01 11 true #are the max values for 2005 and are true
2005-03-01 8 false
2005-04-01 5 false
2005-05-01 0 false
2005-06-01 0 false
2005-07-01 2 false
2005-08-01 1 false
2005-09-01 0 false
2005-10-01 0 false
2005-11-01 3 false
2005-12-01 3 false
Expected Output:
date val max
2004-01-01 0 false #notice how all duplicates are false in 2004
2004-02-01 0 false #because they are the same and all vals are max
2004-03-01 0 false
2004-04-01 0 false
2004-05-01 0 false
2004-06-01 0 false
2004-07-01 0 false
2004-08-01 0 false
2004-09-01 0 false
2004-10-01 0 false
2004-11-01 0 false
2004-12-01 0 false
2005-01-01 11 false #notice how these two values
2005-02-01 11 false #are the max values for 2005 but are false
2005-03-01 8 true #this is the second max val and is true
2005-04-01 5 false
2005-05-01 0 false
2005-06-01 0 false
2005-07-01 2 false
2005-08-01 1 false
2005-09-01 0 false
2005-10-01 0 false
2005-11-01 3 false
2005-12-01 3 false
For reference:
df1 = pd.DataFrame({'val':[0, 0, 0, 0, 0 , 0, 0, 0, 0, 0, 0, 0, 11, 11, 8, 5, 0 , 0, 2, 1, 0, 0, 3, 3],
'date':['2004-01-01','2004-02-01','2004-03-01','2004-04-01','2004-05-01','2004-06-01','2004-07-01','2004-08-01','2004-09-01','2004-10-01','2004-11-01','2004-12-01','2005-01-01','2005-02-01','2005-03-01','2005-04-01','2005-05-01','2005-06-01','2005-07-01','2005-08-01','2005-09-01','2005-10-01','2005-11-01','2005-12-01',]})
Not the slickest solution, but it works. The idea is to first determine the unique values appearing in each year, and then do your transform just on those unique values.
# Determine the unique values appearing in each year.
df1['year'] = df1.date.dt.year
unique_vals = df1.drop_duplicates(subset=['year', 'val'], keep=False)
# Max transform on the unique values.
df1['peak_month'] = unique_vals.groupby('year')['val'].transform(max) == unique_vals['val']
# Fill NaN's as False, drop extra column.
df1['peak_month'].fillna(False, inplace=True)
df1.drop('year', axis=1, inplace=True)
Related
How do I create a Boolean column that places a flag on two days before and two days after a holiday?
I have a data frame with Boolean columns denoting holidays. I woudl like to add another Boolean column that flags the two days before any column and two days after, for any holiday column. For example, take the data below: import pandas as pd from pandas.tseries.offsets import DateOffset date_range = pd.date_range(start = pd.to_datetime("2020-01-10") + DateOffset(days=1), periods = 45, freq = 'D').to_list() peanutbutterday = [0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] jellyday = [0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] crackerday = [0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0] holiday_dict = {'date':date_range, 'peanutbutterday':peanutbutterday, 'jellyday':jellyday, 'crackerday':crackerday} df = pd.DataFrame.from_dict(holiday_dict) What I would expect is an additional column titled below as holiday_bookend that looks like the following: import pandas as pd from pandas.tseries.offsets import DateOffset date_range = pd.date_range(start = pd.to_datetime("2020-01-10") + DateOffset(days=1), periods = 45, freq = 'D').to_list() peanutbutterday = [0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] jellyday = [0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] crackerday = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0] holiday_bookend = [0,0,0,0,0,1,1,0,0,1,1,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,1,0,0,0,0,0,0] holiday_dict = {'date':date_range, 'peanutbutterday':peanutbutterday, 'jellyday':jellyday, 'crackerday':crackerday, 'holiday_bookend':holiday_bookend} df = pd.DataFrame.from_dict(holiday_dict) I'm not sure if I should try with a loop. I haven't conceptually worked that out so I'm kind of stuck. I tried to incorporate the suggestion from here: How To Identify days before and after a holiday within pandas? but it seemed I needed to put a column for each holiday. I need one column that takes into account all holiday columns.
basically add two extra columns: detect when a holiday has occurred (use any method). two days before and two days after (use shift method). The columns work like this: The any method contains all the holiday days. The shift method has -2 and +2 for 2 day shifting. side note: avoid using for loops inside a pandas dataframe. the vectorised methods will always be faster and preferable. So you can do this: import pandas as pd from pandas.tseries.offsets import DateOffset date_range = pd.date_range(start = pd.to_datetime("2020-01-10") + DateOffset(days=1), periods = 45, freq = 'D').to_list() peanutbutterday = [0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] jellyday = [0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] crackerday = [0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0] holiday_dict = {'date':date_range, 'peanutbutterday':peanutbutterday, 'jellyday':jellyday, 'crackerday':crackerday} df = pd.DataFrame.from_dict(holiday_dict) # add extra colums df["holiday"] = df[["peanutbutterday", "jellyday", "crackerday"]].any(axis=1).astype(bool) # column with 2 days before and 2 days after df["holiday_extended"] = df["holiday"] | df["holiday"].shift(-2) | df["holiday"].shift(2) which returns this: date peanutbutterday jellyday crackerday holiday holiday_extended 0 2020-01-11 0 0 0 False False 1 2020-01-12 0 0 0 False False 2 2020-01-13 0 0 0 False False 3 2020-01-14 0 0 0 False False 4 2020-01-15 0 0 0 False False 5 2020-01-16 0 0 0 False True 6 2020-01-17 0 0 0 False True 7 2020-01-18 1 0 0 True True 8 2020-01-19 1 0 0 True True 9 2020-01-20 0 0 0 False True 10 2020-01-21 0 0 0 False True 11 2020-01-22 0 0 0 False True 12 2020-01-23 0 0 0 False True 13 2020-01-24 0 1 1 True True 14 2020-01-25 0 1 1 True True 15 2020-01-26 0 0 0 False True 16 2020-01-27 0 0 0 False True 17 2020-01-28 0 0 0 False False 18 2020-01-29 0 0 0 False False 19 2020-01-30 0 0 0 False False 20 2020-01-31 0 0 0 False False 21 2020-02-01 0 0 0 False False 22 2020-02-02 0 0 0 False False 23 2020-02-03 0 0 0 False False 24 2020-02-04 0 0 0 False False 25 2020-02-05 0 0 0 False False 26 2020-02-06 0 0 0 False False 27 2020-02-07 0 0 0 False False 28 2020-02-08 0 0 0 False False 29 2020-02-09 0 0 0 False False 30 2020-02-10 0 0 0 False False 31 2020-02-11 0 0 0 False True 32 2020-02-12 0 0 0 False True 33 2020-02-13 0 0 1 True True 34 2020-02-14 0 0 1 True True 35 2020-02-15 0 0 1 True True 36 2020-02-16 0 0 1 True True 37 2020-02-17 0 0 0 False True 38 2020-02-18 0 0 0 False True 39 2020-02-19 0 0 0 False False 40 2020-02-20 0 0 0 False False 41 2020-02-21 0 0 0 False False 42 2020-02-22 0 0 0 False False 43 2020-02-23 0 0 0 False False 44 2020-02-24 0 0 0 False False
import numpy as np import pandas as pd from pandas.tseries.offsets import DateOffset date_range = pd.date_range(start = pd.to_datetime("2020-01-10") + DateOffset(days=1), periods = 45, freq = 'D').to_list() peanutbutterday = [0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] jellyday = [0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] crackerday = [0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0] holiday_dict = {'date':date_range, 'peanutbutterday':peanutbutterday, 'jellyday':jellyday, 'crackerday':crackerday} df = pd.DataFrame.from_dict(holiday_dict) # Grab all the holidays holidays = df.loc[df[df.columns[1:]].sum(axis = 1) > 0, 'date'].values # Subtract every day by every holiday and get the absolute time difference in days days_from_holiday = np.subtract.outer(df.date.values, holidays) days_from_holiday = np.min(np.abs(days_from_holiday), axis = 1) days_from_holiday = np.array(days_from_holiday, dtype = 'timedelta64[D]') # Make comparison df['near_holiday'] = days_from_holiday <= np.timedelta64(2, 'D') # If you want it to read 0 or 1 df['near_holiday'] = df['near_holiday'].astype('int') print(df) First, we need to grab all the holidays. If we sum across all the holiday columns, then any rows with a sum > 0 is a holiday and we pull that date. Then, we subtract every day by every holiday, which is quickly done using np.subtract.outer. Then we find the minimum of the absolute value to see the closest time to a holiday a date has. Then we just convert it to days because the default unit is nanoseconds. After that, it's just a matter of making the comparison and assigning it to the column.
Count of consecutive nulls grouped by key column in Pandas dataframe
My dataset(yearly data) looks like this CODE Date PRCP TAVG TMAX TMIN AE000041196 01-01-2020 0 21.1 AE000041196 02-01-2020 0 21.4 AE000041196 03-01-2020 0 21.2 15.4 AE000041196 04-01-2020 0 21.9 14.9 AE000041196 05-01-2020 0 23.7 16.5 AE000041196 06-01-2020 0.5 20.7 AE000041196 07-01-2020 0 18.1 11.5 AE000041196 08-01-2020 0 19.6 10.3 AE000041196 09-01-2020 0.3 20.6 13.8 I am trying to find out the longest run of consecutive missing values[Max count of consecutive NaN for each 'CODE'] for columns TMAX and TMIN for each value in CODE. eg. From the limited dataset above: Max consecutive missing value for TMAX would be 9, and for TMIN would be 2 The code I am using df['TMAX_nullccount'] = df.TMAX.isnull().astype(int).groupby(df['TMAX'].notnull().astype(int).cumsum()).cumsum() This leads to errors in dataset when CODE Date PRCP TAVG TMAX TMIN TMAX_nullccount CA1AB000014 10-03-2021 2.3 297 CA1AB000014 11-03-2021 0 298 CA1AB000014 12-03-2021 0 299 CA1AB000014 13-03-2021 0 300 CA1AB000014 14-03-2021 0 301 CA1AB000015 01-01-2021 0 302 CA1AB000015 02-01-2021 0 303 CA1AB000015 03-01-2021 0 304 CA1AB000015 04-01-2021 0 305 In theory the count(TMAX_nullcount) should have started from 0 again code changed from CA1AB000014 to CA1AB000015. Also value in column TMAX_nullcount cannot exceed 365(yearly dataset) but my code give values way more than that. Expected Output file(values are made up) CODE TMAX_maxcnullcount TMIN_maxcnullcount TAVG_maxcnullcount AE000041196 2 2 0 AEM00041194 1 1 0 AEM00041217 3 1 0 AEM00041218 1 2 45 AFM00040938 65 65 0 AFM00040948 132 132 0 AG000060390 155 141 0 How can I fix this? Thanks in advance
You can use: First test if match missing values: print (df.isna()) CODE Date PRCP TAVG TMAX TMIN 0 False False False False True True 1 False False False False True True 2 False False False False True False 3 False False False False True False 4 False False False False True False 5 False False False False True True 6 False False False False True False 7 False False False False True False 8 False False False False True False #columsn for test missing values cols = ['TMAX','TMIN','TAVG'] #CODe to index, filter columns and create one Series m = df.set_index('CODE')[cols].isna().unstack() #create consecutive groups and count them with maximal count per column and group df = (m.ne(m.shift()).cumsum() .where(m) .groupby(level=[0,1]).value_counts() .max(level=[0,1]) .unstack(0) .add_suffix('_maxcnullcount')) print (df) TMAX_maxcnullcount TMIN_maxcnullcount CODE AE000041196 9 2
You can try something like this: df.groupby(['CODE', df['PRCP'].ne(df['PRCP'].shift()).cumsum()]).size().max() groupby by CODE and the consecutive zeros then compute size. Your groupby result (aggr->size) will be: CODE PRCP AE000041196 1 5 2 1 3 2 4 1 Now you can find max and min. So your final solution will look like this: df1 = df.fillna(0) df1.groupby(['CODE', df1['TMAX'].ne(df1['TMAX'].shift()).cumsum()]).size().max() 9
Days between this and next time a column value is True?
I am trying to do a date calculation counting days passing between events in a non-date column in pandas. I have a pandas dataframe that looks something like this: df = pd.DataFrame({'date':[ '01.01.2020','02.01.2020','03.01.2020','10.01.2020', '01.01.2020','04.02.2020','20.02.2020','21.02.2020', '01.02.2020','10.02.2020','20.02.2020','20.03.2020'], 'user_id':[1,1,1,1,2,2,2,2,3,3,3,3], 'other_val':[0,0,0,100,0,100,0,10,10,0,0,10], 'booly':[True, False, False, True, True, False, False, True, True, True, True, True]}) Now, I've been unable to figure out how to create a new column stating the number of days that passed between each True value in the 'booly' column, for each user. So for each row with a True in the 'booly' column, how many days is it until the next row with a True in the 'booly' column occurs, like so: date user_id booly days_until_next_booly 01.01.2020 1 True 9 02.01.2020 1 False None 03.01.2020 1 False None 10.01.2020 1 True None 01.01.2020 2 True 51 04.02.2020 2 False None 20.02.2020 2 False None 21.01.2020 2 True None 01.02.2020 3 True 9 10.02.2020 3 True 10 20.02.2020 3 True 29 20.03.2020 3 True None
# sample data df = pd.DataFrame({'date':[ '01.01.2020','02.01.2020','03.01.2020','10.01.2020', '01.01.2020','04.02.2020','20.02.2020','21.02.2020', '01.02.2020','10.02.2020','20.02.2020','20.03.2020'], 'user_id':[1,1,1,1,2,2,2,2,3,3,3,3], 'other_val':[0,0,0,100,0,100,0,10,10,0,0,10], 'booly':[True, False, False, True, True, False, False, True, True, True, True, True]}) # convert data to date time format df['date'] = pd.to_datetime(df['date'], dayfirst=True) # use loc with groupby to calculate the difference between True values df.loc[df['booly'] == True, 'days_until_next_booly'] = df.loc[df['booly'] == True].groupby('user_id')['date'].diff().shift(-1) date user_id other_val booly days_until_next_booly 0 2020-01-01 1 0 True 9 days 1 2020-01-02 1 0 False NaT 2 2020-01-03 1 0 False NaT 3 2020-01-10 1 100 True NaT 4 2020-01-01 2 0 True 51 days 5 2020-02-04 2 100 False NaT 6 2020-02-20 2 0 False NaT 7 2020-02-21 2 10 True NaT 8 2020-02-01 3 10 True 9 days 9 2020-02-10 3 0 True 10 days 10 2020-02-20 3 0 True 29 days 11 2020-03-20 3 10 True NaT
( df # fist convert the date column to datetime format .assign(date=lambda x: pd.to_datetime(x['date'], dayfirst=True)) # sort your dates .sort_values('date') # calculate the difference between subsequent dates .assign(date_diff=lambda x: x['date'].diff(1).shift(-1)) # Groupby your booly column to calculate the cumulative days between True values .assign(date_diff_cum=lambda x: x.groupby(x['booly'].cumsum())['date_diff'].transform('sum').where(x['booly'] == True)) ) Output: date user_id other_val booly date_diff date_diff_cum 2020-01-01 2 0 True 1 days 9 days 2020-01-02 1 0 False 1 days NaT 2020-01-03 1 0 False 7 days NaT 2020-01-10 1 100 True 22 days 22 days 2020-02-01 1 0 True 0 days 0 days 2020-02-01 3 10 True 3 days 9 days 2020-02-04 2 10 False 6 days NaT 2020-02-10 3 0 True 10 days 10 days 2020-02-20 2 100 False 0 days NaT 2020-02-20 3 0 True 1 days 1 days 2020-02-21 2 0 True 28 days 28 days 2020-03-20 3 10 True NaT 0 days
How to conditionally drop rows in pandas
I have the following dataframe: True_False cum_val Date 2018-01-02 False NaN 2018-01-03 False 0.006399 2018-01-04 False 0.010427 2018-01-05 False 0.017461 2018-01-08 False 0.019124 2018-01-09 False 0.020426 2018-01-10 False 0.019314 2018-01-11 False 0.026348 2018-01-12 False 0.033098 2018-01-16 False 0.029573 2018-01-17 False 0.038988 2018-01-18 False 0.037372 2018-01-19 False 0.041757 2018-01-22 False 0.049824 2018-01-23 False 0.051998 2018-01-24 False 0.051438 2018-01-25 False 0.052041 2018-01-26 False 0.063882 2018-01-29 False 0.057150 2018-01-30 True -0.010899 2018-01-31 True -0.010410 2018-02-01 True -0.011058 2018-02-02 True -0.032266 2018-02-05 True -0.073246 2018-02-06 True -0.055805 2018-02-07 True -0.060806 2018-02-08 True -0.098343 2018-02-09 True -0.083407 2018-02-12 False 0.013915 2018-02-13 False 0.016528 2018-02-14 False 0.029930 2018-02-15 False 0.041999 2018-02-16 False 0.042373 2018-02-20 False 0.036531 2018-02-21 False 0.031035 2018-03-06 False 0.013671 How can I drop the row second value after False all the the True values till the second True Value till the second False? Such as for example: True_False cum_val Date 2020-01-21 False 0.022808 2020-01-22 False 0.023097 2020-01-23 True 0.001141 2020-01-24 True -0.007901 # <- Start drop here since this is the second True 2020-01-27 True -0.023632 2020-01-28 False -0.013578 2020-01-29 False -0.000867 #< - End Drop Here Since this is the second False 2020-01-30 False 0.003134 Edit 1: I would like to add 1 more condition on the new df: 2020-01-22 0.000289 False 2020-01-23 0.001141 True 2020-01-27 -0.015731 True # <- Start Drop Here 2020-01-28 0.010054 True 2020-01-29 -0.000867 False 2020-01-30 0.003134 True #<-End drop here 2020-02-03 0.007255 True As you have mentioned in the comment: [True, True, True, False, True] In this case it would still start the drop at the second True value but would stop the drop right after the first False even though the second value has toggled to True. If the next value is still True drop it till the value after False
Let's try using where with ffill and parameter limit=2 then boolean filtering: df[~(df['True_False'].where(df['True_False']).ffill(limit=2).cumsum() > 1)] Output: | | Date | True_False | cum_val | |----|------------|--------------|-----------| | 0 | 2020-01-21 | False | 1 | | 1 | 2020-01-22 | False | 2 | | 2 | 2020-01-23 | True | 3 | | 7 | 2020-01-28 | False | 8 | Details: First let's convert the False to np.nan using where Next, fill first two np.nan after the last True using ffill(limit=2) Now, let's use cumsum so we can add consecutive True and select those greater than 2 And negate, to keep false records above the first True record and third False record and on.
Here's what I tried. The data I created is: Date True_False cum_val 0 2020-01-21 False 1 1 2020-01-22 False 2 2 2020-01-23 True 3 3 2020-01-24 True 4 4 2020-01-25 True 5 5 2020-01-26 False 6 6 2020-01-27 False 7 7 2020-01-28 False 8 true_count = 0 false_count = 0 drop_continue = False for index, row in df.iterrows(): if row['True_False'] is True and drop_continue is False: true_count +=1 if true_count == 2: drop_continue = True df.drop(index, inplace=True) true_count = 0 continue if drop_continue is True: if row['True_False'] is True: df.drop(index, inplace=True) if row['True_False'] is False: false_count += 1 if false_count <2: df.drop(index, inplace=True) else: drop_continue = False false_count = 0 Output Date True_False cum_val 0 2020-01-21 False 1 1 2020-01-22 False 2 2 2020-01-23 True 3 6 2020-01-27 False 7 7 2020-01-28 False 8
You could use Series.Shift and Series.bfill: df = df[~df['True_False'].shift().bfill()] print(df) Date True_False cum_val 0 2020-01-21 False 0.022808 1 2020-01-22 False 0.023097 2 2020-01-23 True 0.001141 6 2020-01-29 False -0.000867 7 2020-01-30 False 0.003134
You can do: #mark start of the area you want to drop df["dropit"]=np.where(df["True_False"] & df["True_False"].shift(1) & np.logical_not(df["True_False"].shift(2)), "start", None) #mark the end of the drop area df["dropit"]=np.where(np.logical_not(df["True_False"].shift(1)) & df["True_False"].shift(2), "end", df["dropit"]) #indicate gaps between the different drop areas: df.loc[df["dropit"].shift().eq("end")&df["dropit"].ne("start"), "dropit"]="keep" #forward fill df["dropit"]=df["dropit"].ffill() #drop marked drop areas and drop "dropit" column df=df.drop(df.loc[df["dropit"].isin(["start", "end"])].index, axis=0).drop("dropit", axis=1) Outputs: True_False cum_val Date 2018-01-02 False NaN 2018-01-03 False 0.006399 2018-01-04 False 0.010427 2018-01-05 False 0.017461 2018-01-08 False 0.019124 2018-01-09 False 0.020426 2018-01-10 False 0.019314 2018-01-11 False 0.026348 2018-01-12 False 0.033098 2018-01-16 False 0.029573 2018-01-17 False 0.038988 2018-01-18 False 0.037372 2018-01-19 False 0.041757 2018-01-22 False 0.049824 2018-01-23 False 0.051998 2018-01-24 False 0.051438 2018-01-25 False 0.052041 2018-01-26 False 0.063882 2018-01-29 False 0.057150 2018-01-30 True -0.010899 2018-02-14 False 0.029930 2018-02-15 False 0.041999 2018-02-16 False 0.042373 2018-02-20 False 0.036531 2018-02-21 False 0.031035 2018-03-06 False 0.013671
pandas how to check differences between column values are within a range or not in each group
I have the following df, cluster_id date 1 2018-01-02 1 2018-02-01 1 2018-03-30 2 2018-04-01 2 2018-04-23 2 2018-05-18 3 2018-06-01 3 2018-07-30 3 2018-09-30 I like to create a boolean column recur_pmt, which is set to True if all differences between consecutive values of date in each cluster (df.groupby('cluster_id')) are 30 < x < 40; and False otherwise. So the result is like, cluster_id date recur_pmt 1 2018-01-02 False 1 2018-02-01 False 1 2018-03-30 False 2 2018-04-01 True 2 2018-04-23 True 2 2018-05-18 True 3 2018-06-01 False 3 2018-07-30 False 3 2018-09-30 False I tried df['recur_pmt'] = df.groupby('cluster_id')['date'].apply( lambda x: (20 < x.diff().dropna().dt.days < 40).all()) but it did not work. I am also wondering can it use transform as well in this case.
Use transform with Series.between and parameter inclusive=False: df['recur_pmt'] = df.groupby('cluster_id')['date'].transform( lambda x: (x.diff().dropna().dt.days.between(20, 40, inclusive=False)).all()) print (df) cluster_id date recur_pmt 0 1 2018-01-02 False 1 1 2018-02-01 False 2 1 2018-03-30 False 3 2 2018-04-01 True 4 2 2018-04-23 True 5 2 2018-05-18 True 6 3 2018-06-01 False 7 3 2018-07-30 False 8 3 2018-09-30 False