How create field weeks in between two dates - python

How to calculate using pandas weeks between two dates such 2019-12-15 and 2019-12-28
Data:
cw = pd.DataFrame({ "lead_date" : ["2019-12-28" , "2019-12-23"] ,
"Received_date" : ["2019-12-15" , "2019-12-21" ] })
So I could do something like
cw["weeks_between"]= ( cw["lead_date"] - cw["Received_date"]) / 7
The problem is..
For row 1:
it will return 1.85, but is wrong value because one day starts in on beginning of week Vs End of week
For row 2:
It will return 0.28, but also wrong because one day starts end of week Vs beginning of week.
-
So how can I get the number of weeks in between this two dates?

Method 1: Using list comprehension, dt.period & getattr
provided by Jon Clements in comments
This method will work when years change between the compared dates:
cw['weeks_diff'] = (
[getattr(el, 'n', 0)
for el in cw['lead_date'].dt.to_period('W') - cw['Received_date'].dt.to_period('W')]
)
Method 2: using weeknumbers with dt.strftime('%W')
We can use pd.to_datetime to convert your dates to datetime. Then we use the dt.strftime accessor to get the weeknumbers with %W.
Finally we substract both weeknumbers:
weeks = (cw[['lead_date', 'Received_date']]
.apply(lambda x: pd.to_datetime(x).dt.strftime('%W'))
.replace('NaT', 0)
.astype(int)
)
cw['weeks_diff'] = weeks['lead_date'] - weeks['Received_date']
lead_date Received_date weeks_diff
0 2019-12-28 2019-12-15 2
1 2019-12-23 2019-12-21 1

You need to use convert to datetime using pandas
import pandas as pd
import numpy as np
df = pd.DataFrame({ "lead_date" : ["2019-12-28" , "2019-12-23"] ,
"Received_date" : ["2019-12-15" , "2019-12-21" ] })
df['lead_date']=pd.to_datetime(df['lead_date'])
df['Received_date']=pd.to_datetime(df['Received_date'])
Here is the difference in days between "lead_date" and "Received_date"
df['time_between'] =df['lead_date']-df['Received_date']
print(df.head())
lead_date Received_date time_between
0 2019-12-28 2019-12-15 13 days
1 2019-12-23 2019-12-21 2 days
Update: edits below to get number of weeks. Also added import pandas and numpy.
To get 'time_between' column in weeks:
df['time_between']= df['time_between']/np.timedelta64(1,'W')
will yield
lead_date Received_date time_between
0 2019-12-28 2019-12-15 1.857143
1 2019-12-23 2019-12-21 0.285714
Update 2: If you want week number subtractions and not days between then use:
df['lead_date']=pd.to_datetime(df['lead_date']).dt.week
df['Received_date']=pd.to_datetime(df['Received_date']).dt.week
df['time_between'] =df['lead_date']-df['Received_date']
yields,
lead_date Received_date time_between
0 52 50 2
1 52 51 1
.dt.week returns week number in the year.

Related

Calculate avg time between multiple dates

I have a dataframe that looks like this:
Part
Date
1
9/1/2021
1
9/8/2021
1
9/15/2021
2
9/1/2020
2
9/1/2021
2
9/1/2022
The dataframe is already sorted by part, then by date.
I am trying to find the average days between each date grouped by the Part column.
So the desired output would be:
Part
Avg Days
1
7
2
365
How would you go about processing this data to achieve the desired output?
You can groupby "Date", use apply+ diff to get the time delta between consecutive rows, and get the mean:
(df.groupby('Part')['Date']
.apply(lambda s: s.diff().mean())
.to_frame()
.reset_index()
)
output:
Part Date
1 7 days
2 365 days

How to count business days per month for the whole year with different weekmask every week?

I'm trying to make a program that will equally distribute employees' day off. There are 4 groups and each group has it's own weekmask for each week of the month. So far I've made a code that will change weekmask when it locates 0 in Dataframe(Sunday). I'm stuck on structuring this command np.busday_count(start, end, weekmask=) to automatically change the start and the end date.
My Dataframe looks like this:
And here's my code:
a: int = 0
week_mask: str = '1100111'
def _change_week_mask():
global a, week_mask
a += 1
if a == 1:
week_mask = '1111000'
elif a == 2:
week_mask = '1111111'
elif a == 3:
week_mask = '0011111'
else:
a = 0
for line in rows['Workday']:
if line is '0':
_change_week_mask()
Edit: changed the value of start week from 6 to 0.
Ok, so to answer your problem I have created the sample data frame with below code.
Then I have added below columns to the data frame.
dayofweek - to reach to similar data which you created by setting every Sunday as zero. In this case Monday is set as zero and Sunday is six.
weeknum - week of year
week - instead of counting and than changing the week mask, I have assigned the value to week from 0 to 3 and based on it, we can calculate the mask.
weekmask - using value of the week, I have calculate the mask, you might need to align this as per your logic.
weekenddate- end date I have calculate by adding 7 to start date, if month is changing mid week then this will have month end date.
b
after this we can create a new data frame to have only end of week entry, in this case Monday is 0 so I have taken 0.
then you can apply function and store the result to data frame.
import datetime
import pandas as pd
import numpy as np
df_ = pd.DataFrame({'startdate':pd.date_range(pd.to_datetime('2018-10-01'), pd.to_datetime('2018-11-30'))})
df_['dayofweek'] = df_.startdate.dt.dayofweek
df_['remaining_days_in_month'] = df_.startdate.dt.days_in_month - df_.startdate.dt.day
df_['week'] = df_.startdate.dt.week%4
df_['day'] = df_.startdate.dt.day
df_['weekmask'] = df_.week.map({0 : '1100111', 1 : '1111000' , 2 : '1111111', 3: '0011111'})
df_['weekenddate'] = [x[0] + datetime.timedelta(days=(7-x[1])) if x[2] > 7-x[1] else x[0] + datetime.timedelta(days=(x[2])) for x in df_[['startdate','dayofweek','remaining_days_in_month']].values]
final_df = df_[(df_['dayofweek']==0) | ( df_['day']==1)][['startdate','weekenddate','weekmask']]
final_df['numberofdays'] = [ np.busday_count((x[0]).astype('<M8[D]'), x[1].astype('<M8[D]'), weekmask=x[2]) for x in final_df.values.astype(str)]
Output:
startdate weekenddate weekmask numberofdays
0 2018-10-01 2018-10-08 1100111 5
7 2018-10-08 2018-10-15 1111000 4
14 2018-10-15 2018-10-22 1111111 7
21 2018-10-22 2018-10-29 0011111 5
28 2018-10-29 2018-10-31 1100111 2
31 2018-11-01 2018-11-05 1100111 3
35 2018-11-05 2018-11-12 1111000 4
42 2018-11-12 2018-11-19 1111111 7
49 2018-11-19 2018-11-26 0011111 5
56 2018-11-26 2018-11-30 1100111 2
let me know if this needs some changes as per your requirement.

Pandas: Group by bi-monthly date field

I am trying to group by hospital staff working hours bi monthly. I have raw data on daily basis which look like below.
date hourse_spent emp_id
9/11/2016 8 1
15/11/2016 8 1
22/11/2016 8 2
23/11/2016 8 1
How I want to group by is.
cycle hourse_spent emp_id
1/11/2016-15/11/2016 16 1
16/11/2016-31/11/2016 8 2
16/11/2016-31/11/2016 8 1
I am trying to do the same with grouper and frequency in pandas something as below.
data.set_index('date',inplace=True)
print data.head()
dt = data.groupby(['emp_id', pd.Grouper(key='date', freq='MS')])['hours_spent'].sum().reset_index().sort_values('date')
#df.resample('10d').mean().interpolate(method='linear',axis=0)
print dt.resample('SMS').sum()
I also tried resampling
df1 = dt.resample('MS', loffset=pd.Timedelta(15, 'd')).sum()
data.set_index('date',inplace=True)
df1 = data.resample('MS', loffset=pd.Timedelta(15, 'd')).sum()
But this is giving data of 15 days interval not like 1 to 15 and 15 to 31.
Please let me know what I am doing wrong here.
You were almost there. This will do it -
dt = df.groupby(['emp_id', pd.Grouper(key='date', freq='SM')])['hours_spent'].sum().reset_index().sort_values('date')
emp_id date hours_spent
1 2016-10-31 8
1 2016-11-15 16
2 2016-11-15 8
The freq='SM' is the concept of semi-months which will use the 15th and the last day of every month
Put DateTime-Values into Bins
If I got you right, you basically want to put your values in the date column into bins. For this, pandas has the pd.cut() function included, which does exactly what you want.
Here's an approach which might help you:
import pandas as pd
df = pd.DataFrame({
'hours' : 8,
'emp_id' : [1,1,2,1],
'date' : [pd.datetime(2016,11,9),
pd.datetime(2016,11,15),
pd.datetime(2016,11,22),
pd.datetime(2016,11,23)]
})
bins_dt = pd.date_range('2016-10-16', freq='SM', periods=3)
cycle = pd.cut(df.date, bins_dt)
df.groupby([cycle, 'emp_id']).sum()
Which gets you:
cycle emp_id hours
------------------------ ------ ------
(2016-10-31, 2016-11-15] 1 16
2 NaN
(2016-11-15, 2016-11-30] 1 8
2 8
Had a similar question, here was my solution:
df1['BiMonth'] = df1['Date'] + pd.DateOffset(days=-1) + pd.offsets.SemiMonthEnd()
df1['BiMonth'] = df1['BiMonth'].dt.to_period('D')
The construction "df1['Date'] + pd.DateOffset(days=-1)" will take whatever is in the date column and -1 day.
The construction "+ pd.offsets.SemiMonthEnd()" converts it to a bimonthly basket, but its off by a day unless you reduce the reference date by 1.
The construction "df1['BiMonth'] = df1['BiMonth'].dt.to_period('D')" cleans out the time so you just have days.

Finding number of months between overlapping periods - pandas

I have the data set of customers with their policies, I am trying to find the number of months the customer is with us. (tenure)
df
cust_no poly_no start_date end_date
1 1 2016-06-01 2016-08-31
1 2 2017-05-01 2018-05-31
1 3 2016-11-01 2018-05-31
output should look like,
cust_no no_of_months
1 22
So basically, it should get rid of the months where there is no policy and count the overlapping period once not twice. I have to do this for every customers, so group by cust_no, how can i do this?
Thanks.
One way to do this is to create date ranges for each records, then use stack to get all the months. Next, take the unique values only to count a month only once:
s = df.apply(lambda x: pd.Series(pd.date_range(x.start_date, x.end_date, freq='M').values), axis=1)
ss = s.stack().unique()
ss.shape[0]
Output:
22
For multiple customers you can use groupby. Continuing with #ScottBoston's answer:
df_range = df.apply(lambda r: pd.Series(
pd.date_range(start=r.start_date, end=r.end_date, freq='M')
.values), axis=1)
df_range.groupby('cust_no').apply(lambda x: x.stack().unique().shape[0])

Propagate dates pandas and interpolate

We have some ready available sales data for certain periods, like 1week, 1month...1year:
time_pillars = pd.Series(['1W', '1M', '3M', '1Y'])
sales = pd.Series([4.75, 5.00, 5.10, 5.75])
data = {'time_pillar': time_pillars, 'sales': sales}
df = pd.DataFrame(data)
I would like to do two operations.
Firstly, create a new column of date type, df['date'], that corresponds to the actual date of 1week, 1month..1year from now.
Then, I'd like to create another column df['days_from_now'], taking how many days are on these pillars (1week would be 7days, 1month would be around 30days..1year around 365days).
The goal of this is then to use any day as input for a a simple linear_interpolation_method() to obtain sales data for any given day (eg, what are sales for 4Octobober2018? ---> We would interpolate between 3months and 1year).
Many thanks.
I'm not exactly sure what you mean regarding your interpolation, but here is a way to make your dataframe in pandas (starting from your original df you provided in your post):
from datetime import datetime
from dateutil.relativedelta import relativedelta
def create_dates(df):
df['date'] = [i.date() for i in
[d+delt for d,delt in zip([datetime.now()] * 4 ,
[relativedelta(weeks=1), relativedelta(months=1),
relativedelta(months=3), relativedelta(years=1)])]]
df['days_from_now'] = df['date'] - datetime.now().date()
return df
create_dates(df)
sales time_pillar date days_from_now
0 4.75 1W 2018-04-11 7 days
1 5.00 1M 2018-05-04 30 days
2 5.10 3M 2018-07-04 91 days
3 5.75 1Y 2019-04-04 365 days
I wrapped it in a function, so that you can call it on any given day and get your results for 1 week, 3 weeks, etc. from that exact day.
Note: if you want your days_from_now to simply be an integer of the number of days, use df['days_from_now'] = [i.days for i in df['date'] - datetime.now().date()] in the function, instead of df['days_from_now'] = df['date'] - datetime.now().date()
Explanation:
df['date'] = [i.date() for i in
[d+delt for d,delt in zip([datetime.now()] * 4 ,
[relativedelta(weeks=1), relativedelta(months=1),
relativedelta(months=3), relativedelta(years=1)])]]
Takes a list of the date today (datetime.now()) repeated 4 times, and adds a relativedelta (a time difference) of 1 week, 1 month, 3 months, and 1 year, respectively, extracts the date (i.date() for ...), finally creating a new column using the resulting list.
df['days_from_now'] = df['date'] - datetime.now().date()
is much more straightforward, it simply subtracts those new dates that you got above from the date today. The result is a timedelta object, which pandas conveniently formats as "n days".

Categories

Resources