How can I combine the dateframe if the date is consecutive? - python

I am now to python and pandas.
I have the following dateframe. I would like to combine the start and end date if they are in consecutive day.
data = {"Project":["A","A","A",'A',"B","B"], "Start":[dt.datetime(2020,1,1),dt.datetime(2020,1,16),dt.datetime(2020,1,31),dt.datetime(2020,7,1),dt.datetime(2020,1,31),dt.datetime(2020,2,16)],"End":[dt.datetime(2020,1,15),dt.datetime(2020,1,30),dt.datetime(2020,2,15),dt.datetime(2020,7,15),dt.datetime(2020,2,15),dt.datetime(2020,2,20)]}
df = pd.DataFrame(data)
Project Start End
0 A 2020-01-01 2020-01-15
1 A 2020-01-16 2020-01-30
2 A 2020-01-31 2020-02-15
3 A 2020-07-01 2020-07-15
4 B 2020-01-31 2020-02-15
5 B 2020-02-16 2020-02-20
And my expected result:
Project Start End
0 A 2020-01-01 2020-02-15
1 A 2020-07-01 2020-07-15
2 B 2020-01-31 2020-02-20
If the next day of end is another start, I would like to combine the two rows.
Is there any pandas function can do this?
Thank a lot!

Create a mask with groupby and shift, then assign the values directly and drop_duplicates:
mask = df.groupby("Project").apply(lambda d: (d["Start"].shift(-1)-d["End"]).dt.days<=1).reset_index(drop=True)
df.loc[mask, "End"]= df["End"].shift(-1)
print (df.drop_duplicates(subset=["Project","End"],keep="first"))
Project Start End
0 A 2020-01-01 2020-01-30
2 A 2020-05-01 2020-05-15
3 A 2020-07-01 2020-07-15
4 B 2020-02-01 2020-02-20
For multiple rows instead, one way is to create an array of dates in long form by list comprehension & pd.date_range, and then get a mask grouped by cumsum, and finally get the min/max date of each group:
s = [(i[0],x) for i in df.to_numpy() for x in pd.date_range(*i[1:])]
new = pd.DataFrame(index=pd.MultiIndex.from_tuples(s,names=["Project","Date"])).reset_index()
mask = new.groupby("Project")["Date"].diff().dt.days.gt(1).cumsum()
print (new.groupby(["Project", mask]).agg({"min","max"}))
Date
min max
Project Date
A 0 2020-01-01 2020-02-15
1 2020-07-01 2020-07-15
B 1 2020-01-31 2020-02-20

Related

Pandas: Combine rows with consecutive dates (with NaT) within groups of same id

I would like to combine rows of same id with consecutive dates and same features values.
I have the following dataframe:
Id Start End Feature1 Feature2
0 A 2020-01-01 2020-01-15 1 1
1 A 2020-01-16 2020-01-30 1 1
2 A 2020-01-31 2020-02-15 0 1
3 A 2020-07-01 2020-07-15 0 1
4 B 2020-01-31 2020-02-15 0 0
5 B 2020-02-16 NaT 0 0
An the expected result is:
Id Start End Feature1 Feature2
0 A 2020-01-01 2020-01-30 1 1
1 A 2020-01-31 2020-02-15 0 1
2 A 2020-07-01 2020-07-15 0 1
3 B 2020-01-31 NaT 0 0
I have been trying other posts answers but they don't really match with my use case.
Thanks in advance!
You can approach by:
Get the day diff of each consecutive entries within same group by substracting current Start with last End with the group using GroupBy.shift().
Set group number group_no such that new group number is issued when day diff with previous entry within the group is greater than 1.
Then, group by Id and group_no and aggregate for each group the Start and End dates using .gropuby() and .agg()
As there is NaT data within the grouping, we need to specify dropna=False during grouping. Furthermore, to get the last entry of End within the group, we use x.iloc[-1] instead of last.
# convert to datetime format if not already in datetime
df['Start'] = pd.to_datetime(df['Start'])
df['End'] = pd.to_datetime(df['End'])
# sort by columns `Id` and `Start` if not already in this sequence
df = df.sort_values(['Id', 'Start'])
day_diff = (df['Start'] - df['End'].groupby([df['Id'], df['Feature1'], df['Feature2']]).shift()).dt.days
group_no = (day_diff.isna() | day_diff.gt(1)).cumsum()
df_out = (df.groupby(['Id', group_no], dropna=False, as_index=False)
.agg({'Id': 'first',
'Start': 'first',
'End': lambda x: x.iloc[-1],
'Feature1': 'first',
'Feature2': 'first',
}))
Result:
print(df_out)
Id Start End Feature1 Feature2
0 A 2020-01-01 2020-01-30 1 1
1 A 2020-01-31 2020-02-15 0 1
2 A 2020-07-01 2020-07-15 0 1
3 B 2020-01-31 NaT 0 0
Extract months from both date column
df['sMonth'] = df['Start'].apply(pd.to_datetime).dt.month
df['eMonth'] = df['End'].apply(pd.to_datetime).dt.month
Now groupby data frame with ['Id','Feature1','Feature2','sMonth','eMonth'] and we get result
df.groupby(['Id','Feature1','Feature2','sMonth','eMonth']).agg({'Start':'min','End':'max'}).reset_index().drop(['sMonth','eMonth'],axis=1)
Result
Id Feature1 Feature2 Start End
0 A 0 1 2020-01-31 2020-02-15
1 A 0 1 2020-07-01 2020-07-15
2 A 1 1 2020-01-01 2020-01-30
3 B 0 0 2020-01-31 2020-02-15

How to check for interval overlap for grouped item on two dataframes?

I have two dataframes, df1 and df2, df1 has three columns - group, startdate1 and enddate1, and df2 also has three columns, group, startdate2 and enddate2. I'd like to compare for each group in df1, if the interval (startdate1,enddate1) overlaps with any interval of (startdate2,enddate2) for the same group.
I found this post(Is it possible to use Pandas Overlap in a Dataframe?) which used pandas.IntervalIndex.overlaps to check interval overlap. It's very similar to my question, but I'm struggling on how to use groupby for pandas.IntervalIndex.overlaps (or should I use other methods)? Below are some sample data:
df1:
group
startdate1
enddate1
A
2017-07-01
2018-06-30
B
2017-07-01
2018-06-30
A
2018-07-01
2019-06-30
B
2019-07-01
2020-06-30
df2:
group
startdate2
enddate2
A
2017-05-01
2018-04-30
A
2019-10-01
2020-01-31
B
2017-07-02
2018-06-29
B
2018-07-01
2019-06-30
The expected output is to add a column of 1 or 0 in df1 if there's any interval overlap with df2 for the same group.
df_output:
group
startdate1
enddate1
flag
A
2017-07-01
2018-06-30
1
B
2017-07-01
2018-06-30
1
A
2018-07-01
2019-06-30
0
B
2019-07-01
2020-06-30
0
Thank you!
You can make cartesian join within groups, find indexes of records in df1 that overlap by date range with df2, and then add flag by checking if the index of a record is in that list:
ixs = (df1.reset_index().merge(df2, on=['group'])
.query('(startdate1 < enddate2) & (enddate1 > startdate2)'))['index']
df1.assign(flag=df1.index.isin(ixs).astype(int))
Output:
group startdate1 enddate1 flag
0 A 2017-07-01 2018-06-30 1
1 B 2017-07-01 2018-06-30 1
2 A 2018-07-01 2019-06-30 0
3 B 2019-07-01 2020-06-30 0
P.S. I'm assuming all dates are in datetime format already, otherwise we need to pd.to_datetime(...) those columns first

Create New DataFrame, assigning a count for each instance in a time frame

Below is script for a simplified version of the df in question:
plan_dates=pd.DataFrame({'start_date':['2021-01-01','2021-01-02','2021-01-03','2021-01-04','2021-01-05'],
'end_date': ['2021-01-03','2021-01-04','2021-02-03','2021-03-04','2021-03-05']})
plan_dates
start_date end_date
0 2021-01-01 2021-01-03
1 2021-01-02 2021-01-04
2 2021-01-03 2021-02-03
3 2021-01-04 2021-03-04
4 2021-01-05 2021-03-05
I would like to create a new DataFrame which has 2 columns:
date
count of active plans (the count of cases where the date is within the start_date & end_date in each row of the plan_dates df)
INTENDED DF:
date count_active_plans
0 2021-01-01 1
1 2021-01-02 2
2 2021-01-03 3
3 2021-01-04 3
4 2021-01-05 3
Any help would be greatly appreciated.
First convert both columns to datetimes and add one day to end_date, then repeat index by Index.repeat with subtraction of days and add counter values by GroupBy.cumcount with to_timedelta, last count by Series.value_counts with come data cleaning and converting to DataFrame:
plan_dates['start_date'] = pd.to_datetime(plan_dates['start_date'])
plan_dates['end_date'] = pd.to_datetime(plan_dates['end_date']) + pd.Timedelta(1, unit='d')
s = plan_dates['end_date'].sub(plan_dates['start_date']).dt.days
df = plan_dates.loc[plan_dates.index.repeat(s)].copy()
counter = df.groupby(level=0).cumcount()
df1 = (df['start_date'].add(pd.to_timedelta(counter, unit='d'))
.value_counts()
.sort_index()
.rename_axis('date')
.reset_index(name='count_active_plans'))
print (df1)
date count_active_plans
0 2021-01-01 1
1 2021-01-02 2
2 2021-01-03 3
3 2021-01-04 3
4 2021-01-05 3
.. ... ...
59 2021-03-01 2
60 2021-03-02 2
61 2021-03-03 2
62 2021-03-04 2
63 2021-03-05 1
[64 rows x 2 columns]

How to calculate date difference between rows in pandas

I have a data frame that looks like this.
ID
Start
End
1
2020-12-13
2020-12-20
1
2020-12-26
2021-01-20
1
2020-02-20
2020-02-21
2
2020-12-13
2020-12-20
2
2021-01-11
2021-01-20
2
2021-02-15
2021-02-26
Using pandas, I am trying to group by ID and then subtract the start date from a current row from the end date of the previous row.
If the difference is greater than 5 then it should return True
I'm new to pandas, and I've been trying to figure this out all day.
Two assumptions:
By difference greater than 5, you mean 5 days
You mean the absolute difference
So I am starting with this dataframe to which I added the column 'above_5_days'.
df
ID start end above_5_days
0 1 2020-12-13 2020-12-20 None
1 1 2020-12-26 2021-01-20 None
2 1 2020-02-20 2020-02-21 None
3 2 2020-12-13 2020-12-20 None
4 2 2021-01-11 2021-01-20 None
5 2 2021-02-15 2021-02-26 None
this will be the groupby object that will be used to apply the operation on each ID-group
id_grp = df.groupby("ID")
the following is the operation that will be applied on each subset
def calc_diff(x):
# this shifts the end times down by one row to align the current start with the previous end
to_subtract_from = x["end"].shift(periods=1)
diff = to_subtract_from - x["start"] # subtract the start date from the previous end
# sets the new column to True/False depending on condition
# if you don't want the absolute difference, remove .abs()
x["above_5_days"] = diff.abs() > to_timedelta(5, unit="D")
return x
Now apply this to the whole group and store it in a newdf
newdf = id_grp.apply(calc_diff)
newdf
ID start end above_5_days
0 1 2020-12-13 2020-12-20 False
1 1 2020-12-26 2021-01-20 True
2 1 2020-02-20 2020-02-21 True
3 2 2020-12-13 2020-12-20 False
4 2 2021-01-11 2021-01-20 True
5 2 2021-02-15 2021-02-26 True
>>>>>>> I should point out that:
in this case, there are only False values because shifting down the end column for each group will make a NaN value in the first row of the column, which returns a NaN value when subtracted from. So the False values are just the boolean versions of None.
That is why, I would personally change the function to:
def calc_diff(x):
# this shifts the end times down by one row to align the current start with the previous end
to_subtract_from = x["end"].shift(periods=1)
diff = to_subtract_from - x["start"] # subtract the start date from the previous end
# sets the new column to True/False depending on condition
x["above_5_days"] = diff.abs() > to_timedelta(5, unit="D")
x.loc[to_subtract_from.isna(), "above_5_days"] = None
return x
When rerunning this, you can see that the extra line right before the return statement will set the value in the new column to NaN if the shifted end times are NaN.
newdf = id_grp.apply(calc_diff)
newdf
ID start end above_5_days
0 1 2020-12-13 2020-12-20 NaN
1 1 2020-12-26 2021-01-20 1.0
2 1 2020-02-20 2020-02-21 1.0
3 2 2020-12-13 2020-12-20 NaN
4 2 2021-01-11 2021-01-20 1.0
5 2 2021-02-15 2021-02-26 1.0

Counting number of entries per month pandas

I have a df in format:
start end
0 2020-01-01 2020-01-01
1 2020-01-01 2020-01-01
2 2020-01-02 2020-01-02
...
57 2020-04-01 2020-04-01
58 2020-04-02 2020-04-02
And I want to count the number of entries in each month and place it in a new df i.e. the number of 'start' entries for Jan, Feb, etc, to give me:
Month Entries
2020-01 3
...
2020-04 2
I am currently trying something like this, but its not what I'm needing:
df.index = pd.to_datetime(df['start'],format='%Y-%m-%d')
df.groupby(pd.Grouper(freq='M'))
df['start'].value_counts()
Use Groupby.count with Series.dt:
In [1282]: df
Out[1282]:
start end
0 2020-01-01 2020-01-01
1 2020-01-01 2020-01-01
2 2020-01-02 2020-01-02
57 2020-04-01 2020-04-01
58 2020-04-02 2020-04-02
# Do this only when your `start` and `end` columns are object. If already datetime, you can ignore below 2 statements
In [1284]: df.start = pd.to_datetime(df.start)
In [1285]: df.end = pd.to_datetime(df.end)
In [1296]: df1 = df.groupby([df.start.dt.year, df.start.dt.month]).count().rename_axis(['year', 'month'])['start'].reset_index(name='Entries')
In [1297]: df1
Out[1297]:
year month Entries
0 2020 1 3
1 2020 4 2

Categories

Resources