Python merging rows with conditional logic - python

New to Python, so please excuse poor articulation. I have some data in a dataframe that I've applied drop_duplicates to in order to identify state change in an item. The data is shown below. My goal is to establish some aging on the Item Id's. (note Created Date is the same on all records for a specific Item Id).
I've edited this to show what I've tried and the result I'm getting.
Item Id State Created Date Date Severity
0 327863 New 2019-02-11 2019-10-03 1
9 327863 Approved 2019-02-11 2019-12-05 1
12 327863 Committed 2019-02-11 2019-12-26 1
16 327863 Done 2019-02-11 2020-01-23 1
27 327864 New 2019-02-11 2019-10-03 1
33 327864 Committed 2019-02-11 2019-11-14 1
42 327864 Done 2019-02-11 2020-01-16 1
53 341283 Approved 2019-03-11 2019-10-03 1
57 341283 Done 2019-03-11 2019-10-31 1
I'm doing the following to merge the rows.
s = dfdr.groupby(['Item Id','Created Date', 'Severity']).cumcount()
df1 = dfdr.set_index(['Item Id','Created Date', 'Severity', s]).unstack().sort_index(level=1, axis=1)
df1=df1.reset_index()
print(df1[['Item Id', 'Created Date', 'Severity', 'State','Date']])
The output looks to me to show what I'm told to avoid, chained indexing.
Item Id Created Date Severity State Date
0 1 2 3 0 1 2 3
0 194795 2018-09-18 16:11:25.330 3.0 New Approved Committed Done 2019-10-03 2019-10-10 2019-10-17 2019-10-24
1 194808 2018-09-18 16:11:25.330 3.0 Duplicate NaN NaN NaN 2019-10-03 NaT NaT NaT
2 270787 2018-11-27 15:55:02.207 1.0 New Duplicate NaN NaN 2019-10-03 2019-10-10 NaT NaT
To use the data in graphing I believe what I want is not the nested data, but rather something like the following, but not sure how to get there.
Item Id Created Date Severity New NewDate Approved AppDate Committed CommDate Done Done Date
123456 3/25/2020 3 New 2019-10-03 Approved 2019-11-05 NaN NaT Done 2020-02-17
After adding pivot_table and reset_index per Sikan Answer, I'm closer, but I don't get the same output. This is the output I'm getting.
State Approved Committed Done Duplicate New
Item Id Created Date Severity
194795 2018-09-18 3.0 2019-10-10 2019-10-17 2019-10-24 NaT 2019-10-03
194808 2018-09-18 3.0 NaT NaT NaT 2019-10-03 NaT
This is specifically my code now
df = pd.read_excel(r'C:\Users\xxx\Documents\Excel\DataSample.xlsx')
df = df.drop_duplicates(subset=['Item Id', 'State','Created Date'], keep='first')
df['Severity'] = df['Severity'].replace(np.nan,3)
df = pd.pivot_table(df, index=['Item Id', 'Created Date', 'Severity'], columns=['State'], values='Date', aggfunc=lambda x: x)
df.reset_index()
print(df)
This is the output
State Approved Committed Done Duplicate New
Item Id Created Date Severity
194795 2018-09-18 3.0 2019-10-10 2019-10-17 2019-10-24 NaT 2019-10-03
194808 2018-09-18 3.0 NaT NaT NaT 2019-10-03 NaT
270787 2018-11-27 1.0 NaT NaT NaT 2019-10-10 2019-10-03
Thanks

You can use pd.pivot_table for this:
df = pd.pivot_table(dfdr, index=['Item Id', 'Created Date', 'Severity'], columns=['State'], values='Date', aggfunc=lambda x: x)
df = df.reset_index()
Output:
ItemId CreatedDate Severity Approved Committed Done New
0 327863 2019-02-11 1 2019-12-05 2019-12-26 2020-01-23 2019-10-03
1 327864 2019-02-11 1 NaN 2019-11-14 2020-01-16 2019-10-03
2 341283 2019-03-11 1 2019-10-03 NaN 2019-10-31 NaN

Related

Overlaping dates in pandas dataframe

ID START DATE END DATE
5194 2019-05-15 2019-05-31
5193 2017-02-08 2017-04-02
5193 2017-02-15 2017-04-10
5193 2021-04-01 2021-05-15
5191 2020-10-01 2020-11-20
5191 2019-02-28 2019-04-20
5188 2018-10-01 2018-11-30
i have a dataframe(this is just a part of it) , When the id value of the previous row equals the id value of the next row, i want to check if the dates of the 2 rows overlap, and if so i want to create a new row that keeps the longest date and drops the old ones, ie when the ID is 5193 i want my new row to be ID: 5193, START DATE: 2017-02-08 , END DATE: 2017-04-10 !!
Is that even doable? , tried to approach it with midle point of a date but didnt get any results! Any suggestion would be highly appreciated
Try with groupby and agg
import pandas as pd
a = """5194 2019-05-15 2019-05-31
5193 2017-02-08 2017-04-02
5193 2017-02-15 2017-04-10
5193 2021-04-01 2021-05-15
5191 2020-10-01 2020-11-20
5191 2019-02-28 2019-04-20
5188 2018-10-01 2018-11-30
"""
df = pd.DataFrame([i.split() for i in a.splitlines()], columns=["ID", "START DATE", "END DATE"])
df = df.assign(part_start_date=lambda x: x["START DATE"].astype(str).str[:7]).groupby(["ID", "part_start_date"]).agg({"START DATE": "min", "END DATE": "max"}).reset_index().drop("part_start_date", axis=1)
# output. Longest Date will be where start date is min and end_date is max
ID START DATE END DATE
0 5188 2018-10-01 2018-11-30
1 5191 2019-02-28 2019-04-20
2 5191 2020-10-01 2020-11-20
3 5193 2017-02-08 2017-04-10
4 5193 2021-04-01 2021-05-15
5 5194 2019-05-15 2019-05-31

For each date - is it between any of the provided date bounds?

Data:
df:
ts_code
2018-01-01 A
2018-02-07 A
2018-03-11 A
2022-07-08 A
df_cal:
start_date end_date
2018-02-07 2018-03-12
2018-10-22 2018-11-16
2019-01-07 2019-03-08
2019-03-11 2019-04-22
2019-05-24 2019-07-02
2019-08-06 2019-09-09
2019-10-09 2019-11-05
2019-11-29 2020-01-14
2020-02-03 2020-02-21
2020-02-28 2020-03-05
2020-03-19 2020-04-28
2020-05-06 2020-07-13
2020-07-24 2020-08-31
2020-11-02 2021-01-13
2020-09-11 2020-10-13
2021-01-29 2021-02-18
2021-03-09 2021-04-30
2021-05-06 2021-07-22
2021-07-28 2021-09-14
2021-10-12 2021-12-13
2022-04-27 2022-06-30
Expected result:
ts_code col
2018-01-01 A 0
2018-02-07 A 1
2018-03-11 A 1
2022-07-08 A 0
Goal:
I want to assign values to a new column col: to 1 if df.index is between any of df_cal date ranges, and to 0 otherwise.
Reference:
I refer this post. But it just works for one condition and mine is lots of date ranges. And I don't want to use dataframe join method to achieve it because it will break index order.
You check with numpy broadcasting
df2['new'] = np.any((df1.end_date.values >=df2.index.values[:,None])&
(df1.start_date.values <= df2.index.values[:,None]),1).astype(int)
df2
Out[55]:
ts_code col new
2018-01-01 A 0 0
2018-02-07 A 1 1
2018-03-11 A 1 1
2022-07-08 A 0 0

Why is pandas str.replace returning NaN?

I am trying to remove the comma separator from values in a dataframe in Pandas to enable me to convert the to Integers. I have been using the following method:
df_orders['qty'] = df_orders['qty'].str.replace(',','')
However this seems to be returning NaN values for some numbers which did not originally contain ',' in their values. I have included a sample of my Input data and current output below:
Input:
date sku qty
556603 2020-10-25 A 6
590904 2020-10-21 A 5
595307 2020-10-20 A 31
602678 2020-10-19 A 11
615022 2020-10-18 A 2
641077 2020-10-16 A 1
650203 2020-10-15 A 3
655363 2020-10-14 A 18
667919 2020-10-13 A 5
674990 2020-10-12 A 2
703901 2020-10-09 A 1
715411 2020-10-08 A 1
721557 2020-10-07 A 31
740515 2020-10-06 A 49
752670 2020-10-05 A 4
808426 2020-09-28 A 2
848057 2020-09-23 A 1
865751 2020-09-21 A 2
886630 2020-09-18 A 3
901095 2020-09-16 A 47
938648 2020-09-10 A 2
969909 2020-09-07 A 3
1021548 2020-08-31 A 2
1032254 2020-08-30 A 8
1077443 2020-08-25 A 5
1089670 2020-08-24 A 24
1098843 2020-08-23 A 16
1102025 2020-08-22 A 23
1179347 2020-08-12 A 1
1305700 2020-07-29 A 1
1316343 2020-07-28 A 1
1399930 2020-07-19 A 1
1451864 2020-07-15 A 1
1463195 2020-07-14 A 15
2129080 2020-05-19 A 1
2143468 2020-05-18 A 1
Current Output:
date sku qty
556603 2020-10-25 A 6
590904 2020-10-21 A 5
595307 2020-10-20 A 31
602678 2020-10-19 A 11
615022 2020-10-18 A 2
641077 2020-10-16 A 1
650203 2020-10-15 A 3
655363 2020-10-14 A NaN
667919 2020-10-13 A NaN
674990 2020-10-12 A NaN
703901 2020-10-09 A NaN
715411 2020-10-08 A NaN
721557 2020-10-07 A NaN
740515 2020-10-06 A NaN
752670 2020-10-05 A NaN
808426 2020-09-28 A 2
848057 2020-09-23 A 1
865751 2020-09-21 A 2
886630 2020-09-18 A 3
901095 2020-09-16 A 47
938648 2020-09-10 A NaN
969909 2020-09-07 A NaN
1021548 2020-08-31 A NaN
1032254 2020-08-30 A NaN
1077443 2020-08-25 A NaN
1089670 2020-08-24 A NaN
1098843 2020-08-23 A NaN
1102025 2020-08-22 A NaN
1179347 2020-08-12 A NaN
1305700 2020-07-29 A NaN
1316343 2020-07-28 A 1
1399930 2020-07-19 A 1
1451864 2020-07-15 A 1
1463195 2020-07-14 A 15
2129080 2020-05-19 A 1
2143468 2020-05-18 A 1
I have had a look around but can't seem to find what is causing this error.
I was able to reproduce your issue:
# toy df
df
qty
0 1
1 2,
2 3
df['qty'].str.replace(',', '')
0 NaN
1 2
2 NaN
Name: qty, dtype: object
I created df by doing this:
df = pd.DataFrame({'qty': [1, '2,', 3]})
In other words, your column has mixed data types - some values are integers while others are strings. So when you apply .str methods on mixed types, non str types are converted to NaN to indicate "hey it doesn't make sense to run a str method on an int".
You may fix this by converting the entire column to string, then back to int:
df['qty'].astype(str).str.replace(',', '').astype(int)
Or if you want something a litte more robust, try
df['qty'] = pd.to_numeric(
df['qty'].astype(str).str.extract('(\d+)', expand=False), errors='coerce')

Python: How to chronologically sort by date and find any gaps

I searched for an answer, but couldn't find!
I have a dataframe that looks like:
import pandas as pd
df = pd.DataFrame({'Cust_Name' : ['APPT1', 'APPT1','APPT2','APPT2'],
'Move_In':['2013-02-01','2019-02-01','2019-02-04','2019-02-19'],
'Move_Out':['2019-01-31','','2019-02-15','']})
I am looking to find a way to calculate the vacancy.
APPT1 was occupied from 2013-02-01 to 2019-01-31 and, again from the next day 2019-02-01. So the vacancy for APPT1 is 0 and is currently occupied.
APPT2 was occupied from 2019-02-04 to 2019-02-15 and, again from 2019-02-19. So the vacancy for APPT2 is 2 business days and is currently occupied.
NaT: means currently occupied or currently occupied.
TIA
df = pd.DataFrame({
'Cust_Name': ['APPT1', 'APPT1','APPT2','APPT2'],
'Move_In': ['2013-02-01','2019-02-01','2019-02-04','2019-02-19'],
'Move_Out': ['2019-01-31','','2019-02-15','']
})
df['Move_In'] = df['Move_In'].astype('datetime64')
df['Move_Out'] = df['Move_Out'].astype('datetime64')
df['Prev_Move_Out'] = df['Move_Out'].shift()
Cust_Name Move_In Move_Out Prev_Move_Out
0 APPT1 2013-02-01 2019-01-31 NaT
1 APPT1 2019-02-01 NaT 2019-01-31
2 APPT2 2019-02-04 2019-02-15 NaT
3 APPT2 2019-02-19 NaT 2019-02-15
def calculate_business_day_vacancy(df):
try:
return len(pd.date_range(start=df['Prev_Move_Out'], end=df['Move_In'], freq='B')) - 2
except ValueError:
# Consider instead running the function only on rows that do not contain NaT.
return 0
df['Vacancy_BDays'] = df.apply(calculate_business_day_vacancy, axis=1)
Output
Cust_Name Move_In Move_Out Prev_Move_Out Vacancy_BDays
0 APPT1 2013-02-01 2019-01-31 NaT 0
1 APPT1 2019-02-01 NaT 2019-01-31 0
2 APPT2 2019-02-04 2019-02-15 NaT 0
3 APPT2 2019-02-19 NaT 2019-02-15 1
Note that there is only one Business Day vacancy between 15 Feb 2019 and 19 Feb 2019.

Pandas Dataframe Create Column of Next Future Date from Unique values of two other columns, with Groupby

I have a dataframe which can be created with this:
import pandas as pd
import datetime
#create df
data={'id':[1,1,1,1,2,2,2,2],
'date1':[datetime.date(2016,1,1),datetime.date(2016,7,23),datetime.date(2017,2,26),datetime.date(2017,5,28),
datetime.date(2015,11,1),datetime.date(2016,7,23),datetime.date(2017,6,28),datetime.date(2017,5,23)],
'date2':[datetime.date(2017,5,12),datetime.date(2016,8,10),datetime.date(2017,10,26),datetime.date(2017,9,22),
datetime.date(2015,11,9),datetime.date(2016,9,23),datetime.date(2017,8,3),datetime.date(2017,9,22)]}
df=pd.DataFrame.from_dict(data)
df=df[['id','date1','date2']]
And looks like this:
df
Out[83]:
id date1 date2
0 1 2016-01-01 2017-05-12
1 1 2016-07-23 2016-08-10
2 1 2017-02-26 2017-10-26
3 1 2017-05-28 2017-09-22
4 2 2015-11-01 2015-11-09
5 2 2016-07-23 2016-09-23
6 2 2017-06-28 2017-08-03
7 2 2017-05-23 2017-09-22
What I need to do is create a new column called 'newdate' which at the groupby['id'] level will take all the unique grouped by date values from columns date1 and date2, and give me the NEXT FUTURE date from those unique values after the date in date2.
So the new dataframe would look like:
df
Out[87]:
id date1 date2 newdate
0 1 2016-01-01 2017-05-12 2017-05-28
1 1 2016-07-23 2016-08-10 2017-02-26
2 1 2017-02-26 2017-10-26 None
3 1 2017-05-28 2017-09-22 2017-10-26
4 2 2015-11-01 2015-11-09 2016-07-23
5 2 2016-07-23 2016-09-23 2017-05-23
6 2 2017-06-28 2017-08-03 2017-09-22
7 2 2017-05-23 2017-09-22 None
For clarification, take a look at the id=2 records. note in row 4, the newdate is 2016-07-23. This is because it is the FIRST date from all of the dates represented for id=2 in columns date1 & date2, that FOLLOWS the row 4 date2.
We definitely need to use groupby. I think we could use some form(s) of unique(), np.unique, pd.unique to get the dates? But then how do you select the 'NEXT' one and assign? Just stumped...
Few other points. Don't assume the dataframe is sorted in any way, and efficiency is important here because the actual dataframe is very large. Note also that the 'None' values in newdate are there because we have no 'NEXT' future date represented, as the maximum date in the subset is the same as date2. We can use None, nan, whatever to represent these...
EDIT:
Based on Wen's answer, his answer fails if like dates. If you use this dataset:
data={'id':[1,1,1,1,2,2,2,2],
'date1':[datetime.date(2016,1,1),datetime.date(2016,7,23),datetime.date(2017,2,26),datetime.date(2017,5,28),
datetime.date(2015,11,1),datetime.date(2016,7,23),datetime.date(2017,6,28),datetime.date(2017,5,23)],
'date2':[datetime.date(2017,5,12),datetime.date(2017,5,12),datetime.date(2017,2,26),datetime.date(2017,9,22),
datetime.date(2015,11,9),datetime.date(2016,9,23),datetime.date(2017,8,3),datetime.date(2017,9,22)]}
df=pd.DataFrame.from_dict(data)
df=df[['id','date1','date2']]
Then the result is:
df
Out[104]:
id date1 date2 newdate
0 1 2016-01-01 2017-05-12 2017-05-12
1 1 2016-07-23 2017-05-12 2017-05-28
2 1 2017-02-26 2017-02-26 2017-05-12
3 1 2017-05-28 2017-09-22 NaN
4 2 2015-11-01 2015-11-09 2016-07-23
5 2 2016-07-23 2016-09-23 2017-05-23
6 2 2017-06-28 2017-08-03 2017-09-22
7 2 2017-05-23 2017-09-22 NaN
Note that row 0 'newdate' should be 2017-05-28, the 'next' available date from the superset of date1&date2 for id==1.
I believe melt gets us closer though...
Perhaps not the quickest, depending on your actual dataframe ("very large" could mean anything). Basically two steps - first create a lookup table for every date to the next date. Then merge that lookup with the original table.
#get the latest date for each row - just the max of date1 and date2
df['latest_date'] = df.loc[:, ['date1','date2']].max(axis=1)
#for each date, find the next date - basically create a lookup table
new_date_lookup = (df
.melt(id_vars=['id'], value_vars=['date1', 'date2'])
.loc[:, ['id','value']]
)
new_date_lookup = (new_date_lookup
.merge(new_date_lookup, on="id")
.query("value_y > value_x")
.groupby(["id", "value_x"])
.min()
.reset_index()
.rename(columns={'value_x': 'value', 'value_y':'new_date'})
)
#merge the original and lookup table together to get the new_date for each row
new_df = (pd
.merge(df, new_date_lookup, how='left', left_on=['id', 'latest_date'], right_on=['id','value'])
.drop(['latest_date', 'value'], axis=1)
)
print(new_df)
Which gives the output:
id date1 date2 new_date
0 1 2016-01-01 2017-05-12 2017-05-28
1 1 2016-07-23 2016-08-10 2017-02-26
2 1 2017-02-26 2017-10-26 NaN
3 1 2017-05-28 2017-09-22 2017-10-26
4 2 2015-11-01 2015-11-09 2016-07-23
5 2 2016-07-23 2016-09-23 2017-05-23
6 2 2017-06-28 2017-08-03 2017-09-22
7 2 2017-05-23 2017-09-22 NaN
And for the second example, added in the edit, gives the output:
id date1 date2 new_date
0 1 2016-01-01 2017-05-12 2017-05-28
1 1 2016-07-23 2017-05-12 2017-05-28
2 1 2017-02-26 2017-02-26 2017-05-12
3 1 2017-05-28 2017-09-22 NaN
4 2 2015-11-01 2015-11-09 2016-07-23
5 2 2016-07-23 2016-09-23 2017-05-23
6 2 2017-06-28 2017-08-03 2017-09-22
7 2 2017-05-23 2017-09-22 NaN

Categories

Resources