Days between this and next time a column value is True? - python
I am trying to do a date calculation counting days passing between events in a non-date column in pandas.
I have a pandas dataframe that looks something like this:
df = pd.DataFrame({'date':[
'01.01.2020','02.01.2020','03.01.2020','10.01.2020',
'01.01.2020','04.02.2020','20.02.2020','21.02.2020',
'01.02.2020','10.02.2020','20.02.2020','20.03.2020'],
'user_id':[1,1,1,1,2,2,2,2,3,3,3,3],
'other_val':[0,0,0,100,0,100,0,10,10,0,0,10],
'booly':[True, False, False, True,
True, False, False, True,
True, True, True, True]})
Now, I've been unable to figure out how to create a new column stating the number of days that passed between each True value in the 'booly' column, for each user. So for each row with a True in the 'booly' column, how many days is it until the next row with a True in the 'booly' column occurs, like so:
date user_id booly days_until_next_booly
01.01.2020 1 True 9
02.01.2020 1 False None
03.01.2020 1 False None
10.01.2020 1 True None
01.01.2020 2 True 51
04.02.2020 2 False None
20.02.2020 2 False None
21.01.2020 2 True None
01.02.2020 3 True 9
10.02.2020 3 True 10
20.02.2020 3 True 29
20.03.2020 3 True None
# sample data
df = pd.DataFrame({'date':[
'01.01.2020','02.01.2020','03.01.2020','10.01.2020',
'01.01.2020','04.02.2020','20.02.2020','21.02.2020',
'01.02.2020','10.02.2020','20.02.2020','20.03.2020'],
'user_id':[1,1,1,1,2,2,2,2,3,3,3,3],
'other_val':[0,0,0,100,0,100,0,10,10,0,0,10],
'booly':[True, False, False, True,
True, False, False, True,
True, True, True, True]})
# convert data to date time format
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
# use loc with groupby to calculate the difference between True values
df.loc[df['booly'] == True, 'days_until_next_booly'] = df.loc[df['booly'] == True].groupby('user_id')['date'].diff().shift(-1)
date user_id other_val booly days_until_next_booly
0 2020-01-01 1 0 True 9 days
1 2020-01-02 1 0 False NaT
2 2020-01-03 1 0 False NaT
3 2020-01-10 1 100 True NaT
4 2020-01-01 2 0 True 51 days
5 2020-02-04 2 100 False NaT
6 2020-02-20 2 0 False NaT
7 2020-02-21 2 10 True NaT
8 2020-02-01 3 10 True 9 days
9 2020-02-10 3 0 True 10 days
10 2020-02-20 3 0 True 29 days
11 2020-03-20 3 10 True NaT
(
df
# fist convert the date column to datetime format
.assign(date=lambda x: pd.to_datetime(x['date'], dayfirst=True))
# sort your dates
.sort_values('date')
# calculate the difference between subsequent dates
.assign(date_diff=lambda x: x['date'].diff(1).shift(-1))
# Groupby your booly column to calculate the cumulative days between True values
.assign(date_diff_cum=lambda x: x.groupby(x['booly'].cumsum())['date_diff'].transform('sum').where(x['booly'] == True))
)
Output:
date user_id other_val booly date_diff date_diff_cum
2020-01-01 2 0 True 1 days 9 days
2020-01-02 1 0 False 1 days NaT
2020-01-03 1 0 False 7 days NaT
2020-01-10 1 100 True 22 days 22 days
2020-02-01 1 0 True 0 days 0 days
2020-02-01 3 10 True 3 days 9 days
2020-02-04 2 10 False 6 days NaT
2020-02-10 3 0 True 10 days 10 days
2020-02-20 2 100 False 0 days NaT
2020-02-20 3 0 True 1 days 1 days
2020-02-21 2 0 True 28 days 28 days
2020-03-20 3 10 True NaT 0 days
Related
How do I create a Boolean column that places a flag on two days before and two days after a holiday?
I have a data frame with Boolean columns denoting holidays. I woudl like to add another Boolean column that flags the two days before any column and two days after, for any holiday column. For example, take the data below: import pandas as pd from pandas.tseries.offsets import DateOffset date_range = pd.date_range(start = pd.to_datetime("2020-01-10") + DateOffset(days=1), periods = 45, freq = 'D').to_list() peanutbutterday = [0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] jellyday = [0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] crackerday = [0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0] holiday_dict = {'date':date_range, 'peanutbutterday':peanutbutterday, 'jellyday':jellyday, 'crackerday':crackerday} df = pd.DataFrame.from_dict(holiday_dict) What I would expect is an additional column titled below as holiday_bookend that looks like the following: import pandas as pd from pandas.tseries.offsets import DateOffset date_range = pd.date_range(start = pd.to_datetime("2020-01-10") + DateOffset(days=1), periods = 45, freq = 'D').to_list() peanutbutterday = [0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] jellyday = [0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] crackerday = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0] holiday_bookend = [0,0,0,0,0,1,1,0,0,1,1,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,1,0,0,0,0,0,0] holiday_dict = {'date':date_range, 'peanutbutterday':peanutbutterday, 'jellyday':jellyday, 'crackerday':crackerday, 'holiday_bookend':holiday_bookend} df = pd.DataFrame.from_dict(holiday_dict) I'm not sure if I should try with a loop. I haven't conceptually worked that out so I'm kind of stuck. I tried to incorporate the suggestion from here: How To Identify days before and after a holiday within pandas? but it seemed I needed to put a column for each holiday. I need one column that takes into account all holiday columns.
basically add two extra columns: detect when a holiday has occurred (use any method). two days before and two days after (use shift method). The columns work like this: The any method contains all the holiday days. The shift method has -2 and +2 for 2 day shifting. side note: avoid using for loops inside a pandas dataframe. the vectorised methods will always be faster and preferable. So you can do this: import pandas as pd from pandas.tseries.offsets import DateOffset date_range = pd.date_range(start = pd.to_datetime("2020-01-10") + DateOffset(days=1), periods = 45, freq = 'D').to_list() peanutbutterday = [0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] jellyday = [0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] crackerday = [0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0] holiday_dict = {'date':date_range, 'peanutbutterday':peanutbutterday, 'jellyday':jellyday, 'crackerday':crackerday} df = pd.DataFrame.from_dict(holiday_dict) # add extra colums df["holiday"] = df[["peanutbutterday", "jellyday", "crackerday"]].any(axis=1).astype(bool) # column with 2 days before and 2 days after df["holiday_extended"] = df["holiday"] | df["holiday"].shift(-2) | df["holiday"].shift(2) which returns this: date peanutbutterday jellyday crackerday holiday holiday_extended 0 2020-01-11 0 0 0 False False 1 2020-01-12 0 0 0 False False 2 2020-01-13 0 0 0 False False 3 2020-01-14 0 0 0 False False 4 2020-01-15 0 0 0 False False 5 2020-01-16 0 0 0 False True 6 2020-01-17 0 0 0 False True 7 2020-01-18 1 0 0 True True 8 2020-01-19 1 0 0 True True 9 2020-01-20 0 0 0 False True 10 2020-01-21 0 0 0 False True 11 2020-01-22 0 0 0 False True 12 2020-01-23 0 0 0 False True 13 2020-01-24 0 1 1 True True 14 2020-01-25 0 1 1 True True 15 2020-01-26 0 0 0 False True 16 2020-01-27 0 0 0 False True 17 2020-01-28 0 0 0 False False 18 2020-01-29 0 0 0 False False 19 2020-01-30 0 0 0 False False 20 2020-01-31 0 0 0 False False 21 2020-02-01 0 0 0 False False 22 2020-02-02 0 0 0 False False 23 2020-02-03 0 0 0 False False 24 2020-02-04 0 0 0 False False 25 2020-02-05 0 0 0 False False 26 2020-02-06 0 0 0 False False 27 2020-02-07 0 0 0 False False 28 2020-02-08 0 0 0 False False 29 2020-02-09 0 0 0 False False 30 2020-02-10 0 0 0 False False 31 2020-02-11 0 0 0 False True 32 2020-02-12 0 0 0 False True 33 2020-02-13 0 0 1 True True 34 2020-02-14 0 0 1 True True 35 2020-02-15 0 0 1 True True 36 2020-02-16 0 0 1 True True 37 2020-02-17 0 0 0 False True 38 2020-02-18 0 0 0 False True 39 2020-02-19 0 0 0 False False 40 2020-02-20 0 0 0 False False 41 2020-02-21 0 0 0 False False 42 2020-02-22 0 0 0 False False 43 2020-02-23 0 0 0 False False 44 2020-02-24 0 0 0 False False
import numpy as np import pandas as pd from pandas.tseries.offsets import DateOffset date_range = pd.date_range(start = pd.to_datetime("2020-01-10") + DateOffset(days=1), periods = 45, freq = 'D').to_list() peanutbutterday = [0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] jellyday = [0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] crackerday = [0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0] holiday_dict = {'date':date_range, 'peanutbutterday':peanutbutterday, 'jellyday':jellyday, 'crackerday':crackerday} df = pd.DataFrame.from_dict(holiday_dict) # Grab all the holidays holidays = df.loc[df[df.columns[1:]].sum(axis = 1) > 0, 'date'].values # Subtract every day by every holiday and get the absolute time difference in days days_from_holiday = np.subtract.outer(df.date.values, holidays) days_from_holiday = np.min(np.abs(days_from_holiday), axis = 1) days_from_holiday = np.array(days_from_holiday, dtype = 'timedelta64[D]') # Make comparison df['near_holiday'] = days_from_holiday <= np.timedelta64(2, 'D') # If you want it to read 0 or 1 df['near_holiday'] = df['near_holiday'].astype('int') print(df) First, we need to grab all the holidays. If we sum across all the holiday columns, then any rows with a sum > 0 is a holiday and we pull that date. Then, we subtract every day by every holiday, which is quickly done using np.subtract.outer. Then we find the minimum of the absolute value to see the closest time to a holiday a date has. Then we just convert it to days because the default unit is nanoseconds. After that, it's just a matter of making the comparison and assigning it to the column.
pandas generate a sequence of dates according to a pattern
I have this sequence of dates and I want to create a column with a flag according to a 3-2 pattern: 3 days in a row flagged, then 2 days not flagged etc. import pandas as pd date_pattern = pd.date_range(start='2020-01-01', end='2020-06-30') date_pattern = pd.DataFrame({"my_date": date_pattern}) date_pattern wanted a 'flag' column, having for instance 1 for the range 01 to 03 jan, then from 06 to 08 jan, etc.
You can use modulo 5 with index values and then compare for less like 3 for each fourth and fifth row: date_pattern['flag'] = date_pattern.index % 5 < 3 #alternative for not default index #date_pattern['flag'] = np.arange(len(date_pattern)) % 5 < 3 print(date_pattern.head(15)) my_date flag 0 2020-01-01 True 1 2020-01-02 True 2 2020-01-03 True 3 2020-01-04 False 4 2020-01-05 False 5 2020-01-06 True 6 2020-01-07 True 7 2020-01-08 True 8 2020-01-09 False 9 2020-01-10 False 10 2020-01-11 True 11 2020-01-12 True 12 2020-01-13 True 13 2020-01-14 False 14 2020-01-15 False
dataframe set true after the time that meets specific condition daily
I need to set condition column to True after the time at which the price was 20 or higher daily like below. I want to avoid using apply function because I got several millions data. I think apply requires too much time.
Use GroupBy.cummax or GroupBy.cumsum per days and compare for greater or equal by Series.ge: df['datetime'] = pd.to_datetime(df['datetime']) df['condition'] = df.groupby([df['datetime'].dt.date])['price'].cummax().ge(20) If need test also per compid: df['condition'] = df.groupby(['compid', df['datetime'].dt.date])['price'].cummax().ge(20) print (df) compid datetime price condition 0 1 2020-11-06 00:00:00 10 False 1 1 2020-11-06 00:00:10 20 True 2 1 2020-11-06 00:00:20 5 True 3 1 2020-11-07 00:00:00 20 True 4 1 2020-11-07 00:00:10 5 True 5 1 2020-11-07 00:00:20 25 True
You can use np.where with df.cumsum: In [1306]: import numpy as np In [1307]: df['condition'] = np.where(df.groupby(df.datetime.dt.date).price.cumsum().ge(20), 'TRUE', 'FALSE') In [1308]: df Out[1308]: compid datetime price condition 0 1 2020-11-06 00:00:00 10 FALSE 1 1 2020-11-06 00:00:10 20 TRUE 2 1 2020-11-06 00:00:20 5 TRUE 3 1 2020-11-07 00:00:00 20 TRUE 4 1 2020-11-07 00:00:10 5 TRUE 5 1 2020-11-07 00:00:20 25 TRUE OR, if you need bool values in condition column, do this: In [1309]: df['condition'] = np.where(df.groupby(df.datetime.dt.date).price.cumsum().ge(20), True, False) In [1310]: df Out[1310]: compid datetime price condition 0 1 2020-11-06 00:00:00 10 False 1 1 2020-11-06 00:00:10 20 True 2 1 2020-11-06 00:00:20 5 True 3 1 2020-11-07 00:00:00 20 True 4 1 2020-11-07 00:00:10 5 True 5 1 2020-11-07 00:00:20 25 True
pandas how to check differences between column values are within a range or not in each group
I have the following df, cluster_id date 1 2018-01-02 1 2018-02-01 1 2018-03-30 2 2018-04-01 2 2018-04-23 2 2018-05-18 3 2018-06-01 3 2018-07-30 3 2018-09-30 I like to create a boolean column recur_pmt, which is set to True if all differences between consecutive values of date in each cluster (df.groupby('cluster_id')) are 30 < x < 40; and False otherwise. So the result is like, cluster_id date recur_pmt 1 2018-01-02 False 1 2018-02-01 False 1 2018-03-30 False 2 2018-04-01 True 2 2018-04-23 True 2 2018-05-18 True 3 2018-06-01 False 3 2018-07-30 False 3 2018-09-30 False I tried df['recur_pmt'] = df.groupby('cluster_id')['date'].apply( lambda x: (20 < x.diff().dropna().dt.days < 40).all()) but it did not work. I am also wondering can it use transform as well in this case.
Use transform with Series.between and parameter inclusive=False: df['recur_pmt'] = df.groupby('cluster_id')['date'].transform( lambda x: (x.diff().dropna().dt.days.between(20, 40, inclusive=False)).all()) print (df) cluster_id date recur_pmt 0 1 2018-01-02 False 1 1 2018-02-01 False 2 1 2018-03-30 False 3 2 2018-04-01 True 4 2 2018-04-23 True 5 2 2018-05-18 True 6 3 2018-06-01 False 7 3 2018-07-30 False 8 3 2018-09-30 False
Ignoring Duplicates on Max in GroupBy - Pandas
I've read this thread about grouping and getting max: Apply vs transform on a group object. It works perfectly and is helpful if your max is unique to a group but I'm running into an issue of ignoring duplicates from a group, getting the max of unique items then putting it back into the DataSeries. Input (named df1): date val 2004-01-01 0 2004-02-01 0 2004-03-01 0 2004-04-01 0 2004-05-01 0 2004-06-01 0 2004-07-01 0 2004-08-01 0 2004-09-01 0 2004-10-01 0 2004-11-01 0 2004-12-01 0 2005-01-01 11 2005-02-01 11 2005-03-01 8 2005-04-01 5 2005-05-01 0 2005-06-01 0 2005-07-01 2 2005-08-01 1 2005-09-01 0 2005-10-01 0 2005-11-01 3 2005-12-01 3 My code: df1['peak_month'] = df1.groupby(df1.date.dt.year)['val'].transform(max) == df1['val'] My Output: date val max 2004-01-01 0 true #notice how all duplicates are true in 2004 2004-02-01 0 true 2004-03-01 0 true 2004-04-01 0 true 2004-05-01 0 true 2004-06-01 0 true 2004-07-01 0 true 2004-08-01 0 true 2004-09-01 0 true 2004-10-01 0 true 2004-11-01 0 true 2004-12-01 0 true 2005-01-01 11 true #notice how these two values 2005-02-01 11 true #are the max values for 2005 and are true 2005-03-01 8 false 2005-04-01 5 false 2005-05-01 0 false 2005-06-01 0 false 2005-07-01 2 false 2005-08-01 1 false 2005-09-01 0 false 2005-10-01 0 false 2005-11-01 3 false 2005-12-01 3 false Expected Output: date val max 2004-01-01 0 false #notice how all duplicates are false in 2004 2004-02-01 0 false #because they are the same and all vals are max 2004-03-01 0 false 2004-04-01 0 false 2004-05-01 0 false 2004-06-01 0 false 2004-07-01 0 false 2004-08-01 0 false 2004-09-01 0 false 2004-10-01 0 false 2004-11-01 0 false 2004-12-01 0 false 2005-01-01 11 false #notice how these two values 2005-02-01 11 false #are the max values for 2005 but are false 2005-03-01 8 true #this is the second max val and is true 2005-04-01 5 false 2005-05-01 0 false 2005-06-01 0 false 2005-07-01 2 false 2005-08-01 1 false 2005-09-01 0 false 2005-10-01 0 false 2005-11-01 3 false 2005-12-01 3 false For reference: df1 = pd.DataFrame({'val':[0, 0, 0, 0, 0 , 0, 0, 0, 0, 0, 0, 0, 11, 11, 8, 5, 0 , 0, 2, 1, 0, 0, 3, 3], 'date':['2004-01-01','2004-02-01','2004-03-01','2004-04-01','2004-05-01','2004-06-01','2004-07-01','2004-08-01','2004-09-01','2004-10-01','2004-11-01','2004-12-01','2005-01-01','2005-02-01','2005-03-01','2005-04-01','2005-05-01','2005-06-01','2005-07-01','2005-08-01','2005-09-01','2005-10-01','2005-11-01','2005-12-01',]})
Not the slickest solution, but it works. The idea is to first determine the unique values appearing in each year, and then do your transform just on those unique values. # Determine the unique values appearing in each year. df1['year'] = df1.date.dt.year unique_vals = df1.drop_duplicates(subset=['year', 'val'], keep=False) # Max transform on the unique values. df1['peak_month'] = unique_vals.groupby('year')['val'].transform(max) == unique_vals['val'] # Fill NaN's as False, drop extra column. df1['peak_month'].fillna(False, inplace=True) df1.drop('year', axis=1, inplace=True)