Pandas group by recurring state - python
I have a dataset with hosts sorted by time and a state if isCorrect or not. I would like to get only the hosts that have been rated "False" for at least 3 consecutive times. That is if there is a True in between the counter should reset.
data = {'time': ['10:01', '10:02', '10:03', '10:15', '10:16', '10:18','10:20','10:21','10:22', '10:23','10:24','10:25','10:26','10:27'],
'host': ['A','B','A','A','A','B','A','A','B','B','B','B','B','B'],
'isCorrect': [True, True, False, True, False, False, True, True, False, False, True, False, False, False]}
time host isCorrect
0 10:01 A True
1 10:02 B True
2 10:03 A False
3 10:15 A True
4 10:16 A False
5 10:18 B False
6 10:20 A True
7 10:21 A True
8 10:22 B False
9 10:23 B False
10 10:24 B True
11 10:25 B False
12 10:26 B False
13 10:27 B False
With this example dataset there should be 2 clusters:
Host B due to row 5,8,9 since they were False for 3 times in a row.
Host B due to row 11,12,13
Note that it should be 2 clusters rather than 1 made of 6 items. Unfortunately my implementation does exactly that.
df = pd.DataFrame(data)
df = df[~df['isCorrect']].sort_values(['host','time'])
mask = df['host'].map(df['host'].value_counts()) >= 3
df = df[mask].copy()
df['Group'] = pd.factorize(df['host'])[0]
Which returns
time host isCorrect Group
5 10:18 B False 0
8 10:22 B False 0
9 10:23 B False 0
11 10:25 B False 0
12 10:26 B False 0
13 10:27 B False 0
Expected is an output like so:
time host isCorrect Group
5 10:18 B False 0
8 10:22 B False 0
9 10:23 B False 0
11 10:25 B False 1
12 10:26 B False 1
13 10:27 B False 1
Solution was changed for generate new column Group after sorting with cumulative sum of Trues (because tested Falses), so is generated unique groups which are factorized in last step:
df = df.sort_values(['host','time'])
df['Group'] = df['isCorrect'].cumsum()
df = df[~df['isCorrect']]
mask = df['Group'].map(df['Group'].value_counts()) >= 3
df = df[mask].copy()
df['Group'] = pd.factorize(df['Group'])[0]
print (df)
time host isCorrect Group
5 10:18 B False 0
8 10:22 B False 0
9 10:23 B False 0
11 10:25 B False 1
12 10:26 B False 1
13 10:27 B False 1
df = df.sort_values(['host','time'])
df['Group'] = df['isCorrect'].cumsum()
df = df[~df['isCorrect']]
mask = df['Group'].map(df['Group'].value_counts()) >= 5
df = df[mask].copy()
df['Group'] = pd.factorize(df['Group'])[0]
print (df)
Empty DataFrame
Columns: [time, host, isCorrect, Group]
Index: []
Related
How do I create a Boolean column that places a flag on two days before and two days after a holiday?
I have a data frame with Boolean columns denoting holidays. I woudl like to add another Boolean column that flags the two days before any column and two days after, for any holiday column. For example, take the data below: import pandas as pd from pandas.tseries.offsets import DateOffset date_range = pd.date_range(start = pd.to_datetime("2020-01-10") + DateOffset(days=1), periods = 45, freq = 'D').to_list() peanutbutterday = [0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] jellyday = [0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] crackerday = [0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0] holiday_dict = {'date':date_range, 'peanutbutterday':peanutbutterday, 'jellyday':jellyday, 'crackerday':crackerday} df = pd.DataFrame.from_dict(holiday_dict) What I would expect is an additional column titled below as holiday_bookend that looks like the following: import pandas as pd from pandas.tseries.offsets import DateOffset date_range = pd.date_range(start = pd.to_datetime("2020-01-10") + DateOffset(days=1), periods = 45, freq = 'D').to_list() peanutbutterday = [0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] jellyday = [0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] crackerday = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0] holiday_bookend = [0,0,0,0,0,1,1,0,0,1,1,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,1,0,0,0,0,0,0] holiday_dict = {'date':date_range, 'peanutbutterday':peanutbutterday, 'jellyday':jellyday, 'crackerday':crackerday, 'holiday_bookend':holiday_bookend} df = pd.DataFrame.from_dict(holiday_dict) I'm not sure if I should try with a loop. I haven't conceptually worked that out so I'm kind of stuck. I tried to incorporate the suggestion from here: How To Identify days before and after a holiday within pandas? but it seemed I needed to put a column for each holiday. I need one column that takes into account all holiday columns.
basically add two extra columns: detect when a holiday has occurred (use any method). two days before and two days after (use shift method). The columns work like this: The any method contains all the holiday days. The shift method has -2 and +2 for 2 day shifting. side note: avoid using for loops inside a pandas dataframe. the vectorised methods will always be faster and preferable. So you can do this: import pandas as pd from pandas.tseries.offsets import DateOffset date_range = pd.date_range(start = pd.to_datetime("2020-01-10") + DateOffset(days=1), periods = 45, freq = 'D').to_list() peanutbutterday = [0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] jellyday = [0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] crackerday = [0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0] holiday_dict = {'date':date_range, 'peanutbutterday':peanutbutterday, 'jellyday':jellyday, 'crackerday':crackerday} df = pd.DataFrame.from_dict(holiday_dict) # add extra colums df["holiday"] = df[["peanutbutterday", "jellyday", "crackerday"]].any(axis=1).astype(bool) # column with 2 days before and 2 days after df["holiday_extended"] = df["holiday"] | df["holiday"].shift(-2) | df["holiday"].shift(2) which returns this: date peanutbutterday jellyday crackerday holiday holiday_extended 0 2020-01-11 0 0 0 False False 1 2020-01-12 0 0 0 False False 2 2020-01-13 0 0 0 False False 3 2020-01-14 0 0 0 False False 4 2020-01-15 0 0 0 False False 5 2020-01-16 0 0 0 False True 6 2020-01-17 0 0 0 False True 7 2020-01-18 1 0 0 True True 8 2020-01-19 1 0 0 True True 9 2020-01-20 0 0 0 False True 10 2020-01-21 0 0 0 False True 11 2020-01-22 0 0 0 False True 12 2020-01-23 0 0 0 False True 13 2020-01-24 0 1 1 True True 14 2020-01-25 0 1 1 True True 15 2020-01-26 0 0 0 False True 16 2020-01-27 0 0 0 False True 17 2020-01-28 0 0 0 False False 18 2020-01-29 0 0 0 False False 19 2020-01-30 0 0 0 False False 20 2020-01-31 0 0 0 False False 21 2020-02-01 0 0 0 False False 22 2020-02-02 0 0 0 False False 23 2020-02-03 0 0 0 False False 24 2020-02-04 0 0 0 False False 25 2020-02-05 0 0 0 False False 26 2020-02-06 0 0 0 False False 27 2020-02-07 0 0 0 False False 28 2020-02-08 0 0 0 False False 29 2020-02-09 0 0 0 False False 30 2020-02-10 0 0 0 False False 31 2020-02-11 0 0 0 False True 32 2020-02-12 0 0 0 False True 33 2020-02-13 0 0 1 True True 34 2020-02-14 0 0 1 True True 35 2020-02-15 0 0 1 True True 36 2020-02-16 0 0 1 True True 37 2020-02-17 0 0 0 False True 38 2020-02-18 0 0 0 False True 39 2020-02-19 0 0 0 False False 40 2020-02-20 0 0 0 False False 41 2020-02-21 0 0 0 False False 42 2020-02-22 0 0 0 False False 43 2020-02-23 0 0 0 False False 44 2020-02-24 0 0 0 False False
import numpy as np import pandas as pd from pandas.tseries.offsets import DateOffset date_range = pd.date_range(start = pd.to_datetime("2020-01-10") + DateOffset(days=1), periods = 45, freq = 'D').to_list() peanutbutterday = [0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] jellyday = [0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] crackerday = [0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0] holiday_dict = {'date':date_range, 'peanutbutterday':peanutbutterday, 'jellyday':jellyday, 'crackerday':crackerday} df = pd.DataFrame.from_dict(holiday_dict) # Grab all the holidays holidays = df.loc[df[df.columns[1:]].sum(axis = 1) > 0, 'date'].values # Subtract every day by every holiday and get the absolute time difference in days days_from_holiday = np.subtract.outer(df.date.values, holidays) days_from_holiday = np.min(np.abs(days_from_holiday), axis = 1) days_from_holiday = np.array(days_from_holiday, dtype = 'timedelta64[D]') # Make comparison df['near_holiday'] = days_from_holiday <= np.timedelta64(2, 'D') # If you want it to read 0 or 1 df['near_holiday'] = df['near_holiday'].astype('int') print(df) First, we need to grab all the holidays. If we sum across all the holiday columns, then any rows with a sum > 0 is a holiday and we pull that date. Then, we subtract every day by every holiday, which is quickly done using np.subtract.outer. Then we find the minimum of the absolute value to see the closest time to a holiday a date has. Then we just convert it to days because the default unit is nanoseconds. After that, it's just a matter of making the comparison and assigning it to the column.
How to retrieve pandas dataframe rows surrounding rows with a True boolean?
Suppose I have a df of the following format: Assumed line 198 would be True for rot_mismatch, what would be the best way to retrieve the True line (easy) and the line above and below (unsolved)? I have multiple lines with a True boolean and would like to automatically create a dataframe for closer investigation, always including the True line and its surrounding lines. Thanks! Edit for clarification: exemplary input: id name Bool 1 Sta False 2 Danny True 3 Elle False 4 Rob False 5 Dan False 6 Holger True 7 Mat True 8 Derrick False 9 Lisa False desired output: id name Bool 1 Sta False 2 Danny True 3 Elle False 5 Dan False 6 Holger True 7 Mat True 8 Derrick False
Assuming this input: col1 rot_mismatch 0 A False 1 B True 2 C False 3 D False 4 E False 5 F False 6 G True 7 H True to get the N rows before/after any True, you can use a rolling operation to compute a mask for boolean indexing: N = 1 mask = (df['rot_mismatch'] .rolling(2*N+1, center=True, min_periods=1) .max().astype(bool) ) df2 = df.loc[mask] output: # N = 1 col1 rot_mismatch 0 A False 1 B True 2 C False 5 F False 6 G True 7 H True # N = 0 col1 rot_mismatch 1 B True 6 G True 7 H True # N = 2 col1 rot_mismatch 0 A False 1 B True 2 C False 3 D False 4 E False 5 F False 6 G True 7 H True
Try with shift: >>> df[df["rot_mismatch"]|df["rot_mismatch"].shift()|df["rot_mismatch"].shift(-1)] dep_ap_sched arr_ap_sched rot_mismatch 120 East Carmen South Nathaniel False 198 South Nathaniel East Carmen True 289 East Carmen Joneshaven False Output for amended example: >>> df[df["Bool"]|df["Bool"].shift()|df["Bool"].shift(-1)] id name Bool 0 1 Sta False 1 2 Danny True 2 3 Elle False 4 5 Dan False 5 6 Holger True 6 7 Mat True 7 8 Derrick False
Is it what you want ? df_true=df.loc[df['rot_mismatch']=='True',:] df_false=df.loc[df['rot_mismatch']=='False',:]
Expanding mean based on boolean column with False as most recent value in Pandas
If I have the following data frame: b = {'user': [1, 1, 1, 1, 2, 2, 2], 'value': [10, 20, 30, 40, 1, 2, 3], 'loan': [True, True, True, False, True, False, True]} temp_df: pd.DataFrame = pd.DataFrame(b) temp_df['date'] = np.array([23, 24, 25, 26, 27, 28, 29]) user value loan date 0 1 10 True 23 1 1 20 True 24 2 1 30 True 25 3 1 40 False 26 4 2 1 True 27 5 2 2 False 28 6 2 3 True 29 I want to calculate in a new column, for each user, the "rolling" mean value with values taken into account only when loan == True, and it should be the mean up to the current row, not including the current row. So, the desired output should be something like this: user value loan date cummean_value 0 1 10 True 23 0 1 1 20 True 24 10 2 1 30 True 25 15 3 1 40 False 26 20 4 2 1 True 27 0 5 2 2 False 28 1 6 2 3 True 29 1 When loan == False I want the value to be the last most recent mean calculated so far (for True values of loan). The first value for each user will be basically NaN which should be replaced with 0 (as it is in the desired output).
Let us try with groupby + cumsum temp_df['new'] = temp_df['value'].where(temp_df['loan']).groupby(temp_df['user'])\ .apply(lambda x : (x.shift().cumsum()/x.shift().notna().cumsum()).ffill().fillna(0)) Out[54]: 0 0.0 1 10.0 2 15.0 3 20.0 4 0.0 5 1.0 6 1.0 Name: value, dtype: float64
Try: # supplementary columns: temp_df['value2'] = np.where(temp_df['loan'], temp_df['value'], 0) temp_df['x'] = np.where(temp_df['loan'], 1, 0) # the whole calculation assuming cummean until given row temp_df['cummean_value'] = temp_df.groupby('user')['value2'].cumsum() \ .div(temp_df.groupby('user')['x'].cumsum()) # assuming - until previous row (shift backward, keeping grouping temp_df['cummean_value'] = temp_df.groupby('user')['cummean_value'].shift().fillna(0) # clean-up temp_df.drop(['x', 'value2'], axis=1, inplace=True) Outputs: user value loan date cummean_value 0 1 10 True 23 0.0 1 1 20 True 24 10.0 2 1 30 True 25 15.0 3 1 40 False 26 20.0 4 2 1 True 27 0.0 5 2 2 False 28 1.0 6 2 3 True 29 1.0
Days between this and next time a column value is True?
I am trying to do a date calculation counting days passing between events in a non-date column in pandas. I have a pandas dataframe that looks something like this: df = pd.DataFrame({'date':[ '01.01.2020','02.01.2020','03.01.2020','10.01.2020', '01.01.2020','04.02.2020','20.02.2020','21.02.2020', '01.02.2020','10.02.2020','20.02.2020','20.03.2020'], 'user_id':[1,1,1,1,2,2,2,2,3,3,3,3], 'other_val':[0,0,0,100,0,100,0,10,10,0,0,10], 'booly':[True, False, False, True, True, False, False, True, True, True, True, True]}) Now, I've been unable to figure out how to create a new column stating the number of days that passed between each True value in the 'booly' column, for each user. So for each row with a True in the 'booly' column, how many days is it until the next row with a True in the 'booly' column occurs, like so: date user_id booly days_until_next_booly 01.01.2020 1 True 9 02.01.2020 1 False None 03.01.2020 1 False None 10.01.2020 1 True None 01.01.2020 2 True 51 04.02.2020 2 False None 20.02.2020 2 False None 21.01.2020 2 True None 01.02.2020 3 True 9 10.02.2020 3 True 10 20.02.2020 3 True 29 20.03.2020 3 True None
# sample data df = pd.DataFrame({'date':[ '01.01.2020','02.01.2020','03.01.2020','10.01.2020', '01.01.2020','04.02.2020','20.02.2020','21.02.2020', '01.02.2020','10.02.2020','20.02.2020','20.03.2020'], 'user_id':[1,1,1,1,2,2,2,2,3,3,3,3], 'other_val':[0,0,0,100,0,100,0,10,10,0,0,10], 'booly':[True, False, False, True, True, False, False, True, True, True, True, True]}) # convert data to date time format df['date'] = pd.to_datetime(df['date'], dayfirst=True) # use loc with groupby to calculate the difference between True values df.loc[df['booly'] == True, 'days_until_next_booly'] = df.loc[df['booly'] == True].groupby('user_id')['date'].diff().shift(-1) date user_id other_val booly days_until_next_booly 0 2020-01-01 1 0 True 9 days 1 2020-01-02 1 0 False NaT 2 2020-01-03 1 0 False NaT 3 2020-01-10 1 100 True NaT 4 2020-01-01 2 0 True 51 days 5 2020-02-04 2 100 False NaT 6 2020-02-20 2 0 False NaT 7 2020-02-21 2 10 True NaT 8 2020-02-01 3 10 True 9 days 9 2020-02-10 3 0 True 10 days 10 2020-02-20 3 0 True 29 days 11 2020-03-20 3 10 True NaT
( df # fist convert the date column to datetime format .assign(date=lambda x: pd.to_datetime(x['date'], dayfirst=True)) # sort your dates .sort_values('date') # calculate the difference between subsequent dates .assign(date_diff=lambda x: x['date'].diff(1).shift(-1)) # Groupby your booly column to calculate the cumulative days between True values .assign(date_diff_cum=lambda x: x.groupby(x['booly'].cumsum())['date_diff'].transform('sum').where(x['booly'] == True)) ) Output: date user_id other_val booly date_diff date_diff_cum 2020-01-01 2 0 True 1 days 9 days 2020-01-02 1 0 False 1 days NaT 2020-01-03 1 0 False 7 days NaT 2020-01-10 1 100 True 22 days 22 days 2020-02-01 1 0 True 0 days 0 days 2020-02-01 3 10 True 3 days 9 days 2020-02-04 2 10 False 6 days NaT 2020-02-10 3 0 True 10 days 10 days 2020-02-20 2 100 False 0 days NaT 2020-02-20 3 0 True 1 days 1 days 2020-02-21 2 0 True 28 days 28 days 2020-03-20 3 10 True NaT 0 days
How can I perform a value dependent pivot table/Groupby in Pandas?
I have the following dataframe: Tran ID Category Quantity 0 001 A 5 1 001 B 2 2 001 C 3 3 002 A 4 4 002 C 2 5 003 D 6 I want to transform it into: Tran ID A B C D Quantity 0 001 True True True False 10 1 002 True False True False 6 2 003 False False False True 6 I know I can use groupby to get the sum of quantity, but I can't figure out how to perform the pivot that I described.
Use get_dummies for indicators with max and add new column with aggregating sum: #pandas 0.23+ df1 = pd.get_dummies(df.set_index('Tran ID')['Category'], dtype=bool).max(level=0) #oldier pandas versions #df1 = pd.get_dummies(df.set_index('Tran ID')['Category']).astype(bool).max(level=0) s = df.groupby('Tran ID')['Quantity'].sum() df2 = df1.assign(Quantity = s).reset_index() print (df2) Tran ID A B C D Quantity 0 001 True True True False 10 1 002 True False True False 6 2 003 False False False True 6
Or you can use: print(df.drop('Category',1).join(df['Category'].str.get_dummies().astype(bool)).groupby('Tran ID',as_index=False).sum()) Or little easier to read: df1 = df.drop('Category',1).join(df['Category'].str.get_dummies().astype(bool)) print(df1.groupby('Tran ID',as_index=False).sum()) Both output: Tran ID Quantity A B C D 0 1 10 True True True False 1 2 6 True False True False 2 3 6 False False False True pandas.DataFrame.groupby with pandas.Series.str.get_dummies is the way to do it.