How to conditionally drop rows in pandas - python

I have the following dataframe:
True_False cum_val
Date
2018-01-02 False NaN
2018-01-03 False 0.006399
2018-01-04 False 0.010427
2018-01-05 False 0.017461
2018-01-08 False 0.019124
2018-01-09 False 0.020426
2018-01-10 False 0.019314
2018-01-11 False 0.026348
2018-01-12 False 0.033098
2018-01-16 False 0.029573
2018-01-17 False 0.038988
2018-01-18 False 0.037372
2018-01-19 False 0.041757
2018-01-22 False 0.049824
2018-01-23 False 0.051998
2018-01-24 False 0.051438
2018-01-25 False 0.052041
2018-01-26 False 0.063882
2018-01-29 False 0.057150
2018-01-30 True -0.010899
2018-01-31 True -0.010410
2018-02-01 True -0.011058
2018-02-02 True -0.032266
2018-02-05 True -0.073246
2018-02-06 True -0.055805
2018-02-07 True -0.060806
2018-02-08 True -0.098343
2018-02-09 True -0.083407
2018-02-12 False 0.013915
2018-02-13 False 0.016528
2018-02-14 False 0.029930
2018-02-15 False 0.041999
2018-02-16 False 0.042373
2018-02-20 False 0.036531
2018-02-21 False 0.031035
2018-03-06 False 0.013671
How can I drop the row second value after False all the the True values till the second True Value till the second False?
Such as for example:
True_False cum_val
Date
2020-01-21 False 0.022808
2020-01-22 False 0.023097
2020-01-23 True 0.001141
2020-01-24 True -0.007901 # <- Start drop here since this is the second True
2020-01-27 True -0.023632
2020-01-28 False -0.013578
2020-01-29 False -0.000867 #< - End Drop Here Since this is the second False
2020-01-30 False 0.003134
Edit 1:
I would like to add 1 more condition on the new df:
2020-01-22 0.000289 False
2020-01-23 0.001141 True
2020-01-27 -0.015731 True # <- Start Drop Here
2020-01-28 0.010054 True
2020-01-29 -0.000867 False
2020-01-30 0.003134 True #<-End drop here
2020-02-03 0.007255 True
As you have mentioned in the comment: [True, True, True, False, True]
In this case it would still start the drop at the second True value but would stop the drop right after the first False even though the second value has toggled to True. If the next value is still True drop it till the value after False

Let's try using where with ffill and parameter limit=2 then boolean filtering:
df[~(df['True_False'].where(df['True_False']).ffill(limit=2).cumsum() > 1)]
Output:
| | Date | True_False | cum_val |
|----|------------|--------------|-----------|
| 0 | 2020-01-21 | False | 1 |
| 1 | 2020-01-22 | False | 2 |
| 2 | 2020-01-23 | True | 3 |
| 7 | 2020-01-28 | False | 8 |
Details:
First let's convert the False to np.nan using where
Next, fill first two np.nan after the last True using
ffill(limit=2)
Now, let's use cumsum so we can add consecutive True and select
those greater than 2
And negate, to keep false records above the first True record and
third False record and on.

Here's what I tried.
The data I created is:
Date True_False cum_val
0 2020-01-21 False 1
1 2020-01-22 False 2
2 2020-01-23 True 3
3 2020-01-24 True 4
4 2020-01-25 True 5
5 2020-01-26 False 6
6 2020-01-27 False 7
7 2020-01-28 False 8
true_count = 0
false_count = 0
drop_continue = False
for index, row in df.iterrows():
if row['True_False'] is True and drop_continue is False:
true_count +=1
if true_count == 2:
drop_continue = True
df.drop(index, inplace=True)
true_count = 0
continue
if drop_continue is True:
if row['True_False'] is True:
df.drop(index, inplace=True)
if row['True_False'] is False:
false_count += 1
if false_count <2:
df.drop(index, inplace=True)
else:
drop_continue = False
false_count = 0
Output
Date True_False cum_val
0 2020-01-21 False 1
1 2020-01-22 False 2
2 2020-01-23 True 3
6 2020-01-27 False 7
7 2020-01-28 False 8

You could use Series.Shift and Series.bfill:
df = df[~df['True_False'].shift().bfill()]
print(df)
Date True_False cum_val
0 2020-01-21 False 0.022808
1 2020-01-22 False 0.023097
2 2020-01-23 True 0.001141
6 2020-01-29 False -0.000867
7 2020-01-30 False 0.003134

You can do:
#mark start of the area you want to drop
df["dropit"]=np.where(df["True_False"] & df["True_False"].shift(1) & np.logical_not(df["True_False"].shift(2)), "start", None)
#mark the end of the drop area
df["dropit"]=np.where(np.logical_not(df["True_False"].shift(1)) & df["True_False"].shift(2), "end", df["dropit"])
#indicate gaps between the different drop areas:
df.loc[df["dropit"].shift().eq("end")&df["dropit"].ne("start"), "dropit"]="keep"
#forward fill
df["dropit"]=df["dropit"].ffill()
#drop marked drop areas and drop "dropit" column
df=df.drop(df.loc[df["dropit"].isin(["start", "end"])].index, axis=0).drop("dropit", axis=1)
Outputs:
True_False cum_val
Date
2018-01-02 False NaN
2018-01-03 False 0.006399
2018-01-04 False 0.010427
2018-01-05 False 0.017461
2018-01-08 False 0.019124
2018-01-09 False 0.020426
2018-01-10 False 0.019314
2018-01-11 False 0.026348
2018-01-12 False 0.033098
2018-01-16 False 0.029573
2018-01-17 False 0.038988
2018-01-18 False 0.037372
2018-01-19 False 0.041757
2018-01-22 False 0.049824
2018-01-23 False 0.051998
2018-01-24 False 0.051438
2018-01-25 False 0.052041
2018-01-26 False 0.063882
2018-01-29 False 0.057150
2018-01-30 True -0.010899
2018-02-14 False 0.029930
2018-02-15 False 0.041999
2018-02-16 False 0.042373
2018-02-20 False 0.036531
2018-02-21 False 0.031035
2018-03-06 False 0.013671

Related

Python pandas: How to match data between two dataframes

The first dataframe(df1) is similar to this:
Result
A
B
C
2021-12-31
False
True
True
2022-01-01
False
False
True
2022-01-02
False
True
False
2022-01-03
True
False
True
df2 is an updated version of df1, the date data are new and the column names may be increased, which is similar to this:
Result
A
B
C
D
2022-01-04
False
False
True
True
2022-01-05
True
False
True
True
2022-01-06
False
True
False
True
2022-01-07
False
False
True
True
I want to integrate two databases, but I don't know how to do it。
I want to get a result similar to the following:
Result
A
B
C
D
2021-12-31
False
True
True
NaN
2022-01-01
False
False
True
NaN
2022-01-02
False
True
False
NaN
2022-01-03
True
False
True
NaN
2022-01-04
False
False
True
True
2022-01-05
True
False
True
True
2022-01-06
False
True
False
True
2022-01-07
False
False
True
True
Thank you very much!
Use the concatenate function while ignoring indexes
df_new = pd.concat([df1, df2], ignore_index=True)
Any missing values will be 'NaN'.

pandas generate a sequence of dates according to a pattern

I have this sequence of dates and I want to create a column with a flag according to a 3-2 pattern: 3 days in a row flagged, then 2 days not flagged etc.
import pandas as pd
date_pattern = pd.date_range(start='2020-01-01', end='2020-06-30')
date_pattern = pd.DataFrame({"my_date": date_pattern})
date_pattern
wanted a 'flag' column, having for instance 1 for the range 01 to 03 jan, then from 06 to 08 jan, etc.
You can use modulo 5 with index values and then compare for less like 3 for each fourth and fifth row:
date_pattern['flag'] = date_pattern.index % 5 < 3
#alternative for not default index
#date_pattern['flag'] = np.arange(len(date_pattern)) % 5 < 3
print(date_pattern.head(15))
my_date flag
0 2020-01-01 True
1 2020-01-02 True
2 2020-01-03 True
3 2020-01-04 False
4 2020-01-05 False
5 2020-01-06 True
6 2020-01-07 True
7 2020-01-08 True
8 2020-01-09 False
9 2020-01-10 False
10 2020-01-11 True
11 2020-01-12 True
12 2020-01-13 True
13 2020-01-14 False
14 2020-01-15 False

Days between this and next time a column value is True?

I am trying to do a date calculation counting days passing between events in a non-date column in pandas.
I have a pandas dataframe that looks something like this:
df = pd.DataFrame({'date':[
'01.01.2020','02.01.2020','03.01.2020','10.01.2020',
'01.01.2020','04.02.2020','20.02.2020','21.02.2020',
'01.02.2020','10.02.2020','20.02.2020','20.03.2020'],
'user_id':[1,1,1,1,2,2,2,2,3,3,3,3],
'other_val':[0,0,0,100,0,100,0,10,10,0,0,10],
'booly':[True, False, False, True,
True, False, False, True,
True, True, True, True]})
Now, I've been unable to figure out how to create a new column stating the number of days that passed between each True value in the 'booly' column, for each user. So for each row with a True in the 'booly' column, how many days is it until the next row with a True in the 'booly' column occurs, like so:
date user_id booly days_until_next_booly
01.01.2020 1 True 9
02.01.2020 1 False None
03.01.2020 1 False None
10.01.2020 1 True None
01.01.2020 2 True 51
04.02.2020 2 False None
20.02.2020 2 False None
21.01.2020 2 True None
01.02.2020 3 True 9
10.02.2020 3 True 10
20.02.2020 3 True 29
20.03.2020 3 True None
# sample data
df = pd.DataFrame({'date':[
'01.01.2020','02.01.2020','03.01.2020','10.01.2020',
'01.01.2020','04.02.2020','20.02.2020','21.02.2020',
'01.02.2020','10.02.2020','20.02.2020','20.03.2020'],
'user_id':[1,1,1,1,2,2,2,2,3,3,3,3],
'other_val':[0,0,0,100,0,100,0,10,10,0,0,10],
'booly':[True, False, False, True,
True, False, False, True,
True, True, True, True]})
# convert data to date time format
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
# use loc with groupby to calculate the difference between True values
df.loc[df['booly'] == True, 'days_until_next_booly'] = df.loc[df['booly'] == True].groupby('user_id')['date'].diff().shift(-1)
date user_id other_val booly days_until_next_booly
0 2020-01-01 1 0 True 9 days
1 2020-01-02 1 0 False NaT
2 2020-01-03 1 0 False NaT
3 2020-01-10 1 100 True NaT
4 2020-01-01 2 0 True 51 days
5 2020-02-04 2 100 False NaT
6 2020-02-20 2 0 False NaT
7 2020-02-21 2 10 True NaT
8 2020-02-01 3 10 True 9 days
9 2020-02-10 3 0 True 10 days
10 2020-02-20 3 0 True 29 days
11 2020-03-20 3 10 True NaT
(
df
# fist convert the date column to datetime format
.assign(date=lambda x: pd.to_datetime(x['date'], dayfirst=True))
# sort your dates
.sort_values('date')
# calculate the difference between subsequent dates
.assign(date_diff=lambda x: x['date'].diff(1).shift(-1))
# Groupby your booly column to calculate the cumulative days between True values
.assign(date_diff_cum=lambda x: x.groupby(x['booly'].cumsum())['date_diff'].transform('sum').where(x['booly'] == True))
)
Output:
date user_id other_val booly date_diff date_diff_cum
2020-01-01 2 0 True 1 days 9 days
2020-01-02 1 0 False 1 days NaT
2020-01-03 1 0 False 7 days NaT
2020-01-10 1 100 True 22 days 22 days
2020-02-01 1 0 True 0 days 0 days
2020-02-01 3 10 True 3 days 9 days
2020-02-04 2 10 False 6 days NaT
2020-02-10 3 0 True 10 days 10 days
2020-02-20 2 100 False 0 days NaT
2020-02-20 3 0 True 1 days 1 days
2020-02-21 2 0 True 28 days 28 days
2020-03-20 3 10 True NaT 0 days

pandas how to check differences between column values are within a range or not in each group

I have the following df,
cluster_id date
1 2018-01-02
1 2018-02-01
1 2018-03-30
2 2018-04-01
2 2018-04-23
2 2018-05-18
3 2018-06-01
3 2018-07-30
3 2018-09-30
I like to create a boolean column recur_pmt, which is set to True if all differences between consecutive values of date in each cluster (df.groupby('cluster_id')) are 30 < x < 40; and False otherwise. So the result is like,
cluster_id date recur_pmt
1 2018-01-02 False
1 2018-02-01 False
1 2018-03-30 False
2 2018-04-01 True
2 2018-04-23 True
2 2018-05-18 True
3 2018-06-01 False
3 2018-07-30 False
3 2018-09-30 False
I tried
df['recur_pmt'] = df.groupby('cluster_id')['date'].apply(
lambda x: (20 < x.diff().dropna().dt.days < 40).all())
but it did not work. I am also wondering can it use transform as well in this case.
Use transform with Series.between and parameter inclusive=False:
df['recur_pmt'] = df.groupby('cluster_id')['date'].transform(
lambda x: (x.diff().dropna().dt.days.between(20, 40, inclusive=False)).all())
print (df)
cluster_id date recur_pmt
0 1 2018-01-02 False
1 1 2018-02-01 False
2 1 2018-03-30 False
3 2 2018-04-01 True
4 2 2018-04-23 True
5 2 2018-05-18 True
6 3 2018-06-01 False
7 3 2018-07-30 False
8 3 2018-09-30 False

Ignoring Duplicates on Max in GroupBy - Pandas

I've read this thread about grouping and getting max: Apply vs transform on a group object.
It works perfectly and is helpful if your max is unique to a group but I'm running into an issue of ignoring duplicates from a group, getting the max of unique items then putting it back into the DataSeries.
Input (named df1):
date val
2004-01-01 0
2004-02-01 0
2004-03-01 0
2004-04-01 0
2004-05-01 0
2004-06-01 0
2004-07-01 0
2004-08-01 0
2004-09-01 0
2004-10-01 0
2004-11-01 0
2004-12-01 0
2005-01-01 11
2005-02-01 11
2005-03-01 8
2005-04-01 5
2005-05-01 0
2005-06-01 0
2005-07-01 2
2005-08-01 1
2005-09-01 0
2005-10-01 0
2005-11-01 3
2005-12-01 3
My code:
df1['peak_month'] = df1.groupby(df1.date.dt.year)['val'].transform(max) == df1['val']
My Output:
date val max
2004-01-01 0 true #notice how all duplicates are true in 2004
2004-02-01 0 true
2004-03-01 0 true
2004-04-01 0 true
2004-05-01 0 true
2004-06-01 0 true
2004-07-01 0 true
2004-08-01 0 true
2004-09-01 0 true
2004-10-01 0 true
2004-11-01 0 true
2004-12-01 0 true
2005-01-01 11 true #notice how these two values
2005-02-01 11 true #are the max values for 2005 and are true
2005-03-01 8 false
2005-04-01 5 false
2005-05-01 0 false
2005-06-01 0 false
2005-07-01 2 false
2005-08-01 1 false
2005-09-01 0 false
2005-10-01 0 false
2005-11-01 3 false
2005-12-01 3 false
Expected Output:
date val max
2004-01-01 0 false #notice how all duplicates are false in 2004
2004-02-01 0 false #because they are the same and all vals are max
2004-03-01 0 false
2004-04-01 0 false
2004-05-01 0 false
2004-06-01 0 false
2004-07-01 0 false
2004-08-01 0 false
2004-09-01 0 false
2004-10-01 0 false
2004-11-01 0 false
2004-12-01 0 false
2005-01-01 11 false #notice how these two values
2005-02-01 11 false #are the max values for 2005 but are false
2005-03-01 8 true #this is the second max val and is true
2005-04-01 5 false
2005-05-01 0 false
2005-06-01 0 false
2005-07-01 2 false
2005-08-01 1 false
2005-09-01 0 false
2005-10-01 0 false
2005-11-01 3 false
2005-12-01 3 false
For reference:
df1 = pd.DataFrame({'val':[0, 0, 0, 0, 0 , 0, 0, 0, 0, 0, 0, 0, 11, 11, 8, 5, 0 , 0, 2, 1, 0, 0, 3, 3],
'date':['2004-01-01','2004-02-01','2004-03-01','2004-04-01','2004-05-01','2004-06-01','2004-07-01','2004-08-01','2004-09-01','2004-10-01','2004-11-01','2004-12-01','2005-01-01','2005-02-01','2005-03-01','2005-04-01','2005-05-01','2005-06-01','2005-07-01','2005-08-01','2005-09-01','2005-10-01','2005-11-01','2005-12-01',]})
Not the slickest solution, but it works. The idea is to first determine the unique values appearing in each year, and then do your transform just on those unique values.
# Determine the unique values appearing in each year.
df1['year'] = df1.date.dt.year
unique_vals = df1.drop_duplicates(subset=['year', 'val'], keep=False)
# Max transform on the unique values.
df1['peak_month'] = unique_vals.groupby('year')['val'].transform(max) == unique_vals['val']
# Fill NaN's as False, drop extra column.
df1['peak_month'].fillna(False, inplace=True)
df1.drop('year', axis=1, inplace=True)

Categories

Resources