ID START DATE END DATE
5194 2019-05-15 2019-05-31
5193 2017-02-08 2017-04-02
5193 2017-02-15 2017-04-10
5193 2021-04-01 2021-05-15
5191 2020-10-01 2020-11-20
5191 2019-02-28 2019-04-20
5188 2018-10-01 2018-11-30
i have a dataframe(this is just a part of it) , When the id value of the previous row equals the id value of the next row, i want to check if the dates of the 2 rows overlap, and if so i want to create a new row that keeps the longest date and drops the old ones, ie when the ID is 5193 i want my new row to be ID: 5193, START DATE: 2017-02-08 , END DATE: 2017-04-10 !!
Is that even doable? , tried to approach it with midle point of a date but didnt get any results! Any suggestion would be highly appreciated
Try with groupby and agg
import pandas as pd
a = """5194 2019-05-15 2019-05-31
5193 2017-02-08 2017-04-02
5193 2017-02-15 2017-04-10
5193 2021-04-01 2021-05-15
5191 2020-10-01 2020-11-20
5191 2019-02-28 2019-04-20
5188 2018-10-01 2018-11-30
"""
df = pd.DataFrame([i.split() for i in a.splitlines()], columns=["ID", "START DATE", "END DATE"])
df = df.assign(part_start_date=lambda x: x["START DATE"].astype(str).str[:7]).groupby(["ID", "part_start_date"]).agg({"START DATE": "min", "END DATE": "max"}).reset_index().drop("part_start_date", axis=1)
# output. Longest Date will be where start date is min and end_date is max
ID START DATE END DATE
0 5188 2018-10-01 2018-11-30
1 5191 2019-02-28 2019-04-20
2 5191 2020-10-01 2020-11-20
3 5193 2017-02-08 2017-04-10
4 5193 2021-04-01 2021-05-15
5 5194 2019-05-15 2019-05-31
Related
Data:
df:
ts_code
2018-01-01 A
2018-02-07 A
2018-03-11 A
2022-07-08 A
df_cal:
start_date end_date
2018-02-07 2018-03-12
2018-10-22 2018-11-16
2019-01-07 2019-03-08
2019-03-11 2019-04-22
2019-05-24 2019-07-02
2019-08-06 2019-09-09
2019-10-09 2019-11-05
2019-11-29 2020-01-14
2020-02-03 2020-02-21
2020-02-28 2020-03-05
2020-03-19 2020-04-28
2020-05-06 2020-07-13
2020-07-24 2020-08-31
2020-11-02 2021-01-13
2020-09-11 2020-10-13
2021-01-29 2021-02-18
2021-03-09 2021-04-30
2021-05-06 2021-07-22
2021-07-28 2021-09-14
2021-10-12 2021-12-13
2022-04-27 2022-06-30
Expected result:
ts_code col
2018-01-01 A 0
2018-02-07 A 1
2018-03-11 A 1
2022-07-08 A 0
Goal:
I want to assign values to a new column col: to 1 if df.index is between any of df_cal date ranges, and to 0 otherwise.
Reference:
I refer this post. But it just works for one condition and mine is lots of date ranges. And I don't want to use dataframe join method to achieve it because it will break index order.
You check with numpy broadcasting
df2['new'] = np.any((df1.end_date.values >=df2.index.values[:,None])&
(df1.start_date.values <= df2.index.values[:,None]),1).astype(int)
df2
Out[55]:
ts_code col new
2018-01-01 A 0 0
2018-02-07 A 1 1
2018-03-11 A 1 1
2022-07-08 A 0 0
New to Python, so please excuse poor articulation. I have some data in a dataframe that I've applied drop_duplicates to in order to identify state change in an item. The data is shown below. My goal is to establish some aging on the Item Id's. (note Created Date is the same on all records for a specific Item Id).
I've edited this to show what I've tried and the result I'm getting.
Item Id State Created Date Date Severity
0 327863 New 2019-02-11 2019-10-03 1
9 327863 Approved 2019-02-11 2019-12-05 1
12 327863 Committed 2019-02-11 2019-12-26 1
16 327863 Done 2019-02-11 2020-01-23 1
27 327864 New 2019-02-11 2019-10-03 1
33 327864 Committed 2019-02-11 2019-11-14 1
42 327864 Done 2019-02-11 2020-01-16 1
53 341283 Approved 2019-03-11 2019-10-03 1
57 341283 Done 2019-03-11 2019-10-31 1
I'm doing the following to merge the rows.
s = dfdr.groupby(['Item Id','Created Date', 'Severity']).cumcount()
df1 = dfdr.set_index(['Item Id','Created Date', 'Severity', s]).unstack().sort_index(level=1, axis=1)
df1=df1.reset_index()
print(df1[['Item Id', 'Created Date', 'Severity', 'State','Date']])
The output looks to me to show what I'm told to avoid, chained indexing.
Item Id Created Date Severity State Date
0 1 2 3 0 1 2 3
0 194795 2018-09-18 16:11:25.330 3.0 New Approved Committed Done 2019-10-03 2019-10-10 2019-10-17 2019-10-24
1 194808 2018-09-18 16:11:25.330 3.0 Duplicate NaN NaN NaN 2019-10-03 NaT NaT NaT
2 270787 2018-11-27 15:55:02.207 1.0 New Duplicate NaN NaN 2019-10-03 2019-10-10 NaT NaT
To use the data in graphing I believe what I want is not the nested data, but rather something like the following, but not sure how to get there.
Item Id Created Date Severity New NewDate Approved AppDate Committed CommDate Done Done Date
123456 3/25/2020 3 New 2019-10-03 Approved 2019-11-05 NaN NaT Done 2020-02-17
After adding pivot_table and reset_index per Sikan Answer, I'm closer, but I don't get the same output. This is the output I'm getting.
State Approved Committed Done Duplicate New
Item Id Created Date Severity
194795 2018-09-18 3.0 2019-10-10 2019-10-17 2019-10-24 NaT 2019-10-03
194808 2018-09-18 3.0 NaT NaT NaT 2019-10-03 NaT
This is specifically my code now
df = pd.read_excel(r'C:\Users\xxx\Documents\Excel\DataSample.xlsx')
df = df.drop_duplicates(subset=['Item Id', 'State','Created Date'], keep='first')
df['Severity'] = df['Severity'].replace(np.nan,3)
df = pd.pivot_table(df, index=['Item Id', 'Created Date', 'Severity'], columns=['State'], values='Date', aggfunc=lambda x: x)
df.reset_index()
print(df)
This is the output
State Approved Committed Done Duplicate New
Item Id Created Date Severity
194795 2018-09-18 3.0 2019-10-10 2019-10-17 2019-10-24 NaT 2019-10-03
194808 2018-09-18 3.0 NaT NaT NaT 2019-10-03 NaT
270787 2018-11-27 1.0 NaT NaT NaT 2019-10-10 2019-10-03
Thanks
You can use pd.pivot_table for this:
df = pd.pivot_table(dfdr, index=['Item Id', 'Created Date', 'Severity'], columns=['State'], values='Date', aggfunc=lambda x: x)
df = df.reset_index()
Output:
ItemId CreatedDate Severity Approved Committed Done New
0 327863 2019-02-11 1 2019-12-05 2019-12-26 2020-01-23 2019-10-03
1 327864 2019-02-11 1 NaN 2019-11-14 2020-01-16 2019-10-03
2 341283 2019-03-11 1 2019-10-03 NaN 2019-10-31 NaN
I'm having difficulty figuring out a way to count the occurrences of holidays between datetime ranges in a dataframe. The holidays are in a list while the datetime ranges are in the dataframe as shown below: (note that this is a subset of a very large data set)
df = pd.DataFrame({'Date': ['2018-12-19 18:47','2019-01-01 06:11','2019-01-12 10:05','2019-02-17 14:22','2019-03-08 16:17','2019-03-25 17:35','2019-02-14 17:35'],
'End Date': ['2018-12-28 18:47','2019-01-05 06:11','2019-01-16 10:05','2019-02-19 14:22','2019-03-12 16:17','2019-03-26 17:35','2019-05-27 17:35']})
df['Date'] = pd.to_datetime(df['Date'])
df['End Date'] = pd.to_datetime(df['End Date'])
Holidays = [date(2018,12,24),date(2018,12,25),date(2019,1,1),date(2019,1,21),date(2019,2,18),date(2019,3,8),date(2019,5,27)]
I've been able to find a way that determine whether or not a Holiday is within the datetime ranges, but not get an actual count.
Is there a way to alter the code below to gather the count rather than boolean values?
This is what I've tried so far:
df['Holidays'] = [any([(z>=x)&(z<=y) for z in Holidays]) for x , y in zip(df['Date'].dt.date,df['End Date'].dt.date)]
The result I'm looking for is as follows:
result = pd.DataFrame({'Date': ['2018-12-19 18:47','2019-01-01 06:11','2019-01-12 10:05','2019-02-17 14:22','2019-03-08 16:17','2019-03-25 17:35','2019-02-14 17:35'],
'End Date': ['2018-12-28 18:47','2019-01-05 06:11','2019-01-16 10:05','2019-02-19 14:22','2019-03-12 16:17','2019-03-26 17:35','2019-05-27 17:35'],
'Holidays': [2,1,0,1,1,0,3]})
We can make a function that checks this condition and then apply it row-wise.
def fn(series):
return sum([series.iloc[0] <= h <= series.iloc[1] for h in Holidays])
df.assign(Holidays=df.apply(fn, axis=1))
Date End Date Holidays
0 2018-12-19 18:47:00 2018-12-28 18:47:00 2
1 2019-01-01 06:11:00 2019-01-05 06:11:00 0
2 2019-01-12 10:05:00 2019-01-16 10:05:00 0
3 2019-02-17 14:22:00 2019-02-19 14:22:00 1
4 2019-03-08 16:17:00 2019-03-12 16:17:00 0
5 2019-03-25 17:35:00 2019-03-26 17:35:00 0
6 2019-02-14 17:35:00 2019-05-27 17:35:00 3
Your desired output is incorrect because the Holidays list has no hours for any of the date timestamps. To get the output that you posted we will have to round down to the day.
def fn(series):
return sum([series.iloc[0].floor('d') <= h <= series.iloc[1].floor('d') for h in Holidays])
df.assign(Holidays=df.apply(fn, axis=1))
Date End Date Holidays
0 2018-12-19 18:47 2018-12-28 18:47 2
1 2019-01-01 06:11 2019-01-05 06:11 1
2 2019-01-12 10:05 2019-01-16 10:05 0
3 2019-02-17 14:22 2019-02-19 14:22 1
4 2019-03-08 16:17 2019-03-12 16:17 1
5 2019-03-25 17:35 2019-03-26 17:35 0
6 2019-02-14 17:35 2019-05-27 17:35 3
I have this dataframe object:
Date
2018-12-14
2019-01-11
2019-01-25
2019-02-08
2019-02-22
2019-07-26
What I want, if it's possible, is to add for example: 3 months to the dates, and then 3 months to the new date (original date + 3 months) and repeat this x times. I am using pd.offsets.MonthOffset but this just adds the months one time and I need to do it more times.
I don't know if it is possible but any help would be perfect.
Thank you so much for taking your time.
The expected output is (for 1 month adding 2 times):
[[2019-01-14, 2019-02-11, 2019-02-25, 2019-03-08, 2019-03-22, 2019-08-26],[2019-02-14, 2019-03-11, 2019-03-25, 2019-04-08, 2019-04-22, 2019-09-26]]
I believe you need loop with f-strings for new columns names:
for i in range(1,4):
df[f'Date_added_{i}_months'] = df['Date'] + pd.offsets.MonthBegin(i)
print (df)
Date Date_added_1_months Date_added_2_months Date_added_3_months
0 2018-12-14 2019-01-01 2019-02-01 2019-03-01
1 2019-01-11 2019-02-01 2019-03-01 2019-04-01
2 2019-01-25 2019-02-01 2019-03-01 2019-04-01
3 2019-02-08 2019-03-01 2019-04-01 2019-05-01
4 2019-02-22 2019-03-01 2019-04-01 2019-05-01
5 2019-07-26 2019-08-01 2019-09-01 2019-10-01
Or:
for i in range(1,4):
df[f'Date_added_{i}_months'] = df['Date'] + pd.offsets.MonthOffset(i)
print (df)
Date Date_added_1_months Date_added_2_months Date_added_3_months
0 2018-12-14 2019-01-14 2019-02-14 2019-03-14
1 2019-01-11 2019-02-11 2019-03-11 2019-04-11
2 2019-01-25 2019-02-25 2019-03-25 2019-04-25
3 2019-02-08 2019-03-08 2019-04-08 2019-05-08
4 2019-02-22 2019-03-22 2019-04-22 2019-05-22
5 2019-07-26 2019-08-26 2019-09-26 2019-10-26
I hope this helps
from dateutil.relativedelta import relativedelta
month_offset = [3,6,9]
for i in month_offset:
df['Date_plus_'+i+'_months'] = df['Date'].map(lambda x: x+relativedelta(months=i))
If your dates are date objects, it should be pretty easy. You can just create a timedelta of 3 months and add it to each date.
Alternatively, you can convert them to date objects with .strptime() and then do what you are suggesting. You can convert them back to a string with .strftime().
I have the following dataframe;
Group Deadline Time Deadline Date Task Completed Date Task Completed Time
Group 1 20:00:00 17-07-2012 17-07-2012 20:34:00
Group 2 20:15:00 17-07-2012 17-07-2012 20:39:00
Group 3 22:00:00 17-07-2012 17-07-2012 22:21:00
Group 4 23:50:00 17-07-2012 18-07-2012 00:09:00
Group 5 20:00:00 18-07-2012 18-07-2012 20:37:00
Group 6 20:15:00 18-07-2012 18-07-2012 21:13:00
Group 7 22:00:00 18-07-2012 18-07-2012 22:56:00
Group 8 23:50:00 18-07-2012 19-07-2012 00:01:00
Group 9 20:15:00 19-07-2012 19-07-2012 20:34:00
Group 10 20:00:00 19-07-2012 19-07-2012 20:24:00
How do I calculate the time delay as;
Time Delay (mins)
00:34:00
00:24:00
00:21:00
00:19:00
00:37:00
00:58:00
00:56:00
00:11:00
00:19:00
00:24:00
I have tried without success;
Combining the 'Deadline' 'date' & 'time' columns and 'Task Completed' 'date' & 'time' columns and
Finding the difference as 'Task Completed' - 'Deadline' time.
Combine them as strings ("addition" works), convert them to datetime type, and then subtract, which gives a Series of timedelta type.
In [14]: deadline = pd.to_datetime(df['Deadline Date'] + ' ' + df['Deadline Time'])
In [15]: completed = pd.to_datetime(df['Task Completed Date'] + ' ' + df['Task Completed Time'])
In [16]: completed - deadline
Out[16]:
0 00:34:00
1 00:24:00
2 00:21:00
3 00:19:00
4 00:37:00
5 00:58:00
6 00:56:00
7 00:11:00
8 00:19:00
9 00:24:00
dtype: timedelta64[ns]