ID START DATE END DATE
5194 2019-05-15 2019-05-31
5193 2017-02-08 2017-04-02
5193 2017-02-15 2017-04-10
5193 2021-04-01 2021-05-15
5191 2020-10-01 2020-11-20
5191 2019-02-28 2019-04-20
5188 2018-10-01 2018-11-30
i have a dataframe(this is just a part of it) , When the id value of the previous row equals the id value of the next row, i want to check if the dates of the 2 rows overlap, and if so i want to create a new row that keeps the longest date and drops the old ones, ie when the ID is 5193 i want my new row to be ID: 5193, START DATE: 2017-02-08 , END DATE: 2017-04-10 !!
Is that even doable? , tried to approach it with midle point of a date but didnt get any results! Any suggestion would be highly appreciated
Try with groupby and agg
import pandas as pd
a = """5194 2019-05-15 2019-05-31
5193 2017-02-08 2017-04-02
5193 2017-02-15 2017-04-10
5193 2021-04-01 2021-05-15
5191 2020-10-01 2020-11-20
5191 2019-02-28 2019-04-20
5188 2018-10-01 2018-11-30
"""
df = pd.DataFrame([i.split() for i in a.splitlines()], columns=["ID", "START DATE", "END DATE"])
df = df.assign(part_start_date=lambda x: x["START DATE"].astype(str).str[:7]).groupby(["ID", "part_start_date"]).agg({"START DATE": "min", "END DATE": "max"}).reset_index().drop("part_start_date", axis=1)
# output. Longest Date will be where start date is min and end_date is max
ID START DATE END DATE
0 5188 2018-10-01 2018-11-30
1 5191 2019-02-28 2019-04-20
2 5191 2020-10-01 2020-11-20
3 5193 2017-02-08 2017-04-10
4 5193 2021-04-01 2021-05-15
5 5194 2019-05-15 2019-05-31
Data:
df:
ts_code
2018-01-01 A
2018-02-07 A
2018-03-11 A
2022-07-08 A
df_cal:
start_date end_date
2018-02-07 2018-03-12
2018-10-22 2018-11-16
2019-01-07 2019-03-08
2019-03-11 2019-04-22
2019-05-24 2019-07-02
2019-08-06 2019-09-09
2019-10-09 2019-11-05
2019-11-29 2020-01-14
2020-02-03 2020-02-21
2020-02-28 2020-03-05
2020-03-19 2020-04-28
2020-05-06 2020-07-13
2020-07-24 2020-08-31
2020-11-02 2021-01-13
2020-09-11 2020-10-13
2021-01-29 2021-02-18
2021-03-09 2021-04-30
2021-05-06 2021-07-22
2021-07-28 2021-09-14
2021-10-12 2021-12-13
2022-04-27 2022-06-30
Expected result:
ts_code col
2018-01-01 A 0
2018-02-07 A 1
2018-03-11 A 1
2022-07-08 A 0
Goal:
I want to assign values to a new column col: to 1 if df.index is between any of df_cal date ranges, and to 0 otherwise.
Reference:
I refer this post. But it just works for one condition and mine is lots of date ranges. And I don't want to use dataframe join method to achieve it because it will break index order.
You check with numpy broadcasting
df2['new'] = np.any((df1.end_date.values >=df2.index.values[:,None])&
(df1.start_date.values <= df2.index.values[:,None]),1).astype(int)
df2
Out[55]:
ts_code col new
2018-01-01 A 0 0
2018-02-07 A 1 1
2018-03-11 A 1 1
2022-07-08 A 0 0
I am trying to remove the comma separator from values in a dataframe in Pandas to enable me to convert the to Integers. I have been using the following method:
df_orders['qty'] = df_orders['qty'].str.replace(',','')
However this seems to be returning NaN values for some numbers which did not originally contain ',' in their values. I have included a sample of my Input data and current output below:
Input:
date sku qty
556603 2020-10-25 A 6
590904 2020-10-21 A 5
595307 2020-10-20 A 31
602678 2020-10-19 A 11
615022 2020-10-18 A 2
641077 2020-10-16 A 1
650203 2020-10-15 A 3
655363 2020-10-14 A 18
667919 2020-10-13 A 5
674990 2020-10-12 A 2
703901 2020-10-09 A 1
715411 2020-10-08 A 1
721557 2020-10-07 A 31
740515 2020-10-06 A 49
752670 2020-10-05 A 4
808426 2020-09-28 A 2
848057 2020-09-23 A 1
865751 2020-09-21 A 2
886630 2020-09-18 A 3
901095 2020-09-16 A 47
938648 2020-09-10 A 2
969909 2020-09-07 A 3
1021548 2020-08-31 A 2
1032254 2020-08-30 A 8
1077443 2020-08-25 A 5
1089670 2020-08-24 A 24
1098843 2020-08-23 A 16
1102025 2020-08-22 A 23
1179347 2020-08-12 A 1
1305700 2020-07-29 A 1
1316343 2020-07-28 A 1
1399930 2020-07-19 A 1
1451864 2020-07-15 A 1
1463195 2020-07-14 A 15
2129080 2020-05-19 A 1
2143468 2020-05-18 A 1
Current Output:
date sku qty
556603 2020-10-25 A 6
590904 2020-10-21 A 5
595307 2020-10-20 A 31
602678 2020-10-19 A 11
615022 2020-10-18 A 2
641077 2020-10-16 A 1
650203 2020-10-15 A 3
655363 2020-10-14 A NaN
667919 2020-10-13 A NaN
674990 2020-10-12 A NaN
703901 2020-10-09 A NaN
715411 2020-10-08 A NaN
721557 2020-10-07 A NaN
740515 2020-10-06 A NaN
752670 2020-10-05 A NaN
808426 2020-09-28 A 2
848057 2020-09-23 A 1
865751 2020-09-21 A 2
886630 2020-09-18 A 3
901095 2020-09-16 A 47
938648 2020-09-10 A NaN
969909 2020-09-07 A NaN
1021548 2020-08-31 A NaN
1032254 2020-08-30 A NaN
1077443 2020-08-25 A NaN
1089670 2020-08-24 A NaN
1098843 2020-08-23 A NaN
1102025 2020-08-22 A NaN
1179347 2020-08-12 A NaN
1305700 2020-07-29 A NaN
1316343 2020-07-28 A 1
1399930 2020-07-19 A 1
1451864 2020-07-15 A 1
1463195 2020-07-14 A 15
2129080 2020-05-19 A 1
2143468 2020-05-18 A 1
I have had a look around but can't seem to find what is causing this error.
I was able to reproduce your issue:
# toy df
df
qty
0 1
1 2,
2 3
df['qty'].str.replace(',', '')
0 NaN
1 2
2 NaN
Name: qty, dtype: object
I created df by doing this:
df = pd.DataFrame({'qty': [1, '2,', 3]})
In other words, your column has mixed data types - some values are integers while others are strings. So when you apply .str methods on mixed types, non str types are converted to NaN to indicate "hey it doesn't make sense to run a str method on an int".
You may fix this by converting the entire column to string, then back to int:
df['qty'].astype(str).str.replace(',', '').astype(int)
Or if you want something a litte more robust, try
df['qty'] = pd.to_numeric(
df['qty'].astype(str).str.extract('(\d+)', expand=False), errors='coerce')
I searched for an answer, but couldn't find!
I have a dataframe that looks like:
import pandas as pd
df = pd.DataFrame({'Cust_Name' : ['APPT1', 'APPT1','APPT2','APPT2'],
'Move_In':['2013-02-01','2019-02-01','2019-02-04','2019-02-19'],
'Move_Out':['2019-01-31','','2019-02-15','']})
I am looking to find a way to calculate the vacancy.
APPT1 was occupied from 2013-02-01 to 2019-01-31 and, again from the next day 2019-02-01. So the vacancy for APPT1 is 0 and is currently occupied.
APPT2 was occupied from 2019-02-04 to 2019-02-15 and, again from 2019-02-19. So the vacancy for APPT2 is 2 business days and is currently occupied.
NaT: means currently occupied or currently occupied.
TIA
df = pd.DataFrame({
'Cust_Name': ['APPT1', 'APPT1','APPT2','APPT2'],
'Move_In': ['2013-02-01','2019-02-01','2019-02-04','2019-02-19'],
'Move_Out': ['2019-01-31','','2019-02-15','']
})
df['Move_In'] = df['Move_In'].astype('datetime64')
df['Move_Out'] = df['Move_Out'].astype('datetime64')
df['Prev_Move_Out'] = df['Move_Out'].shift()
Cust_Name Move_In Move_Out Prev_Move_Out
0 APPT1 2013-02-01 2019-01-31 NaT
1 APPT1 2019-02-01 NaT 2019-01-31
2 APPT2 2019-02-04 2019-02-15 NaT
3 APPT2 2019-02-19 NaT 2019-02-15
def calculate_business_day_vacancy(df):
try:
return len(pd.date_range(start=df['Prev_Move_Out'], end=df['Move_In'], freq='B')) - 2
except ValueError:
# Consider instead running the function only on rows that do not contain NaT.
return 0
df['Vacancy_BDays'] = df.apply(calculate_business_day_vacancy, axis=1)
Output
Cust_Name Move_In Move_Out Prev_Move_Out Vacancy_BDays
0 APPT1 2013-02-01 2019-01-31 NaT 0
1 APPT1 2019-02-01 NaT 2019-01-31 0
2 APPT2 2019-02-04 2019-02-15 NaT 0
3 APPT2 2019-02-19 NaT 2019-02-15 1
Note that there is only one Business Day vacancy between 15 Feb 2019 and 19 Feb 2019.
I have a dataframe which can be created with this:
import pandas as pd
import datetime
#create df
data={'id':[1,1,1,1,2,2,2,2],
'date1':[datetime.date(2016,1,1),datetime.date(2016,7,23),datetime.date(2017,2,26),datetime.date(2017,5,28),
datetime.date(2015,11,1),datetime.date(2016,7,23),datetime.date(2017,6,28),datetime.date(2017,5,23)],
'date2':[datetime.date(2017,5,12),datetime.date(2016,8,10),datetime.date(2017,10,26),datetime.date(2017,9,22),
datetime.date(2015,11,9),datetime.date(2016,9,23),datetime.date(2017,8,3),datetime.date(2017,9,22)]}
df=pd.DataFrame.from_dict(data)
df=df[['id','date1','date2']]
And looks like this:
df
Out[83]:
id date1 date2
0 1 2016-01-01 2017-05-12
1 1 2016-07-23 2016-08-10
2 1 2017-02-26 2017-10-26
3 1 2017-05-28 2017-09-22
4 2 2015-11-01 2015-11-09
5 2 2016-07-23 2016-09-23
6 2 2017-06-28 2017-08-03
7 2 2017-05-23 2017-09-22
What I need to do is create a new column called 'newdate' which at the groupby['id'] level will take all the unique grouped by date values from columns date1 and date2, and give me the NEXT FUTURE date from those unique values after the date in date2.
So the new dataframe would look like:
df
Out[87]:
id date1 date2 newdate
0 1 2016-01-01 2017-05-12 2017-05-28
1 1 2016-07-23 2016-08-10 2017-02-26
2 1 2017-02-26 2017-10-26 None
3 1 2017-05-28 2017-09-22 2017-10-26
4 2 2015-11-01 2015-11-09 2016-07-23
5 2 2016-07-23 2016-09-23 2017-05-23
6 2 2017-06-28 2017-08-03 2017-09-22
7 2 2017-05-23 2017-09-22 None
For clarification, take a look at the id=2 records. note in row 4, the newdate is 2016-07-23. This is because it is the FIRST date from all of the dates represented for id=2 in columns date1 & date2, that FOLLOWS the row 4 date2.
We definitely need to use groupby. I think we could use some form(s) of unique(), np.unique, pd.unique to get the dates? But then how do you select the 'NEXT' one and assign? Just stumped...
Few other points. Don't assume the dataframe is sorted in any way, and efficiency is important here because the actual dataframe is very large. Note also that the 'None' values in newdate are there because we have no 'NEXT' future date represented, as the maximum date in the subset is the same as date2. We can use None, nan, whatever to represent these...
EDIT:
Based on Wen's answer, his answer fails if like dates. If you use this dataset:
data={'id':[1,1,1,1,2,2,2,2],
'date1':[datetime.date(2016,1,1),datetime.date(2016,7,23),datetime.date(2017,2,26),datetime.date(2017,5,28),
datetime.date(2015,11,1),datetime.date(2016,7,23),datetime.date(2017,6,28),datetime.date(2017,5,23)],
'date2':[datetime.date(2017,5,12),datetime.date(2017,5,12),datetime.date(2017,2,26),datetime.date(2017,9,22),
datetime.date(2015,11,9),datetime.date(2016,9,23),datetime.date(2017,8,3),datetime.date(2017,9,22)]}
df=pd.DataFrame.from_dict(data)
df=df[['id','date1','date2']]
Then the result is:
df
Out[104]:
id date1 date2 newdate
0 1 2016-01-01 2017-05-12 2017-05-12
1 1 2016-07-23 2017-05-12 2017-05-28
2 1 2017-02-26 2017-02-26 2017-05-12
3 1 2017-05-28 2017-09-22 NaN
4 2 2015-11-01 2015-11-09 2016-07-23
5 2 2016-07-23 2016-09-23 2017-05-23
6 2 2017-06-28 2017-08-03 2017-09-22
7 2 2017-05-23 2017-09-22 NaN
Note that row 0 'newdate' should be 2017-05-28, the 'next' available date from the superset of date1&date2 for id==1.
I believe melt gets us closer though...
Perhaps not the quickest, depending on your actual dataframe ("very large" could mean anything). Basically two steps - first create a lookup table for every date to the next date. Then merge that lookup with the original table.
#get the latest date for each row - just the max of date1 and date2
df['latest_date'] = df.loc[:, ['date1','date2']].max(axis=1)
#for each date, find the next date - basically create a lookup table
new_date_lookup = (df
.melt(id_vars=['id'], value_vars=['date1', 'date2'])
.loc[:, ['id','value']]
)
new_date_lookup = (new_date_lookup
.merge(new_date_lookup, on="id")
.query("value_y > value_x")
.groupby(["id", "value_x"])
.min()
.reset_index()
.rename(columns={'value_x': 'value', 'value_y':'new_date'})
)
#merge the original and lookup table together to get the new_date for each row
new_df = (pd
.merge(df, new_date_lookup, how='left', left_on=['id', 'latest_date'], right_on=['id','value'])
.drop(['latest_date', 'value'], axis=1)
)
print(new_df)
Which gives the output:
id date1 date2 new_date
0 1 2016-01-01 2017-05-12 2017-05-28
1 1 2016-07-23 2016-08-10 2017-02-26
2 1 2017-02-26 2017-10-26 NaN
3 1 2017-05-28 2017-09-22 2017-10-26
4 2 2015-11-01 2015-11-09 2016-07-23
5 2 2016-07-23 2016-09-23 2017-05-23
6 2 2017-06-28 2017-08-03 2017-09-22
7 2 2017-05-23 2017-09-22 NaN
And for the second example, added in the edit, gives the output:
id date1 date2 new_date
0 1 2016-01-01 2017-05-12 2017-05-28
1 1 2016-07-23 2017-05-12 2017-05-28
2 1 2017-02-26 2017-02-26 2017-05-12
3 1 2017-05-28 2017-09-22 NaN
4 2 2015-11-01 2015-11-09 2016-07-23
5 2 2016-07-23 2016-09-23 2017-05-23
6 2 2017-06-28 2017-08-03 2017-09-22
7 2 2017-05-23 2017-09-22 NaN