Get average intervals in a list of dates in dataframe - python

I have a dataframe like this
Event dates Duration
Event1 [1796-12-02, 1796-12-10, 1796-12-11] 9 days
Event2 [1848-03-31, 1848-02-26] 34 days
Event3 [1826-05-20] 0 days
And I would like to add an "Average time inbetween dates" column which would compute the difference between subsequent couple of days and look like:
"Average"
4.5
17
0
In the first one, 4.5 comes from pair of dates 1: 8 and pair of dates 2: 1, 8+1=9 /2 = 4.5

You can use this line:
df["Average"] = df.apply(lambda x: float(x["Duration"].replace(" days", ""))/(len(x["dates"])-1), axis=1)

Related

Calculate avg time between multiple dates

I have a dataframe that looks like this:
Part
Date
1
9/1/2021
1
9/8/2021
1
9/15/2021
2
9/1/2020
2
9/1/2021
2
9/1/2022
The dataframe is already sorted by part, then by date.
I am trying to find the average days between each date grouped by the Part column.
So the desired output would be:
Part
Avg Days
1
7
2
365
How would you go about processing this data to achieve the desired output?
You can groupby "Date", use apply+ diff to get the time delta between consecutive rows, and get the mean:
(df.groupby('Part')['Date']
.apply(lambda s: s.diff().mean())
.to_frame()
.reset_index()
)
output:
Part Date
1 7 days
2 365 days

How create field weeks in between two dates

How to calculate using pandas weeks between two dates such 2019-12-15 and 2019-12-28
Data:
cw = pd.DataFrame({ "lead_date" : ["2019-12-28" , "2019-12-23"] ,
"Received_date" : ["2019-12-15" , "2019-12-21" ] })
So I could do something like
cw["weeks_between"]= ( cw["lead_date"] - cw["Received_date"]) / 7
The problem is..
For row 1:
it will return 1.85, but is wrong value because one day starts in on beginning of week Vs End of week
For row 2:
It will return 0.28, but also wrong because one day starts end of week Vs beginning of week.
-
So how can I get the number of weeks in between this two dates?
Method 1: Using list comprehension, dt.period & getattr
provided by Jon Clements in comments
This method will work when years change between the compared dates:
cw['weeks_diff'] = (
[getattr(el, 'n', 0)
for el in cw['lead_date'].dt.to_period('W') - cw['Received_date'].dt.to_period('W')]
)
Method 2: using weeknumbers with dt.strftime('%W')
We can use pd.to_datetime to convert your dates to datetime. Then we use the dt.strftime accessor to get the weeknumbers with %W.
Finally we substract both weeknumbers:
weeks = (cw[['lead_date', 'Received_date']]
.apply(lambda x: pd.to_datetime(x).dt.strftime('%W'))
.replace('NaT', 0)
.astype(int)
)
cw['weeks_diff'] = weeks['lead_date'] - weeks['Received_date']
lead_date Received_date weeks_diff
0 2019-12-28 2019-12-15 2
1 2019-12-23 2019-12-21 1
You need to use convert to datetime using pandas
import pandas as pd
import numpy as np
df = pd.DataFrame({ "lead_date" : ["2019-12-28" , "2019-12-23"] ,
"Received_date" : ["2019-12-15" , "2019-12-21" ] })
df['lead_date']=pd.to_datetime(df['lead_date'])
df['Received_date']=pd.to_datetime(df['Received_date'])
Here is the difference in days between "lead_date" and "Received_date"
df['time_between'] =df['lead_date']-df['Received_date']
print(df.head())
lead_date Received_date time_between
0 2019-12-28 2019-12-15 13 days
1 2019-12-23 2019-12-21 2 days
Update: edits below to get number of weeks. Also added import pandas and numpy.
To get 'time_between' column in weeks:
df['time_between']= df['time_between']/np.timedelta64(1,'W')
will yield
lead_date Received_date time_between
0 2019-12-28 2019-12-15 1.857143
1 2019-12-23 2019-12-21 0.285714
Update 2: If you want week number subtractions and not days between then use:
df['lead_date']=pd.to_datetime(df['lead_date']).dt.week
df['Received_date']=pd.to_datetime(df['Received_date']).dt.week
df['time_between'] =df['lead_date']-df['Received_date']
yields,
lead_date Received_date time_between
0 52 50 2
1 52 51 1
.dt.week returns week number in the year.

Finding number of months between overlapping periods - pandas

I have the data set of customers with their policies, I am trying to find the number of months the customer is with us. (tenure)
df
cust_no poly_no start_date end_date
1 1 2016-06-01 2016-08-31
1 2 2017-05-01 2018-05-31
1 3 2016-11-01 2018-05-31
output should look like,
cust_no no_of_months
1 22
So basically, it should get rid of the months where there is no policy and count the overlapping period once not twice. I have to do this for every customers, so group by cust_no, how can i do this?
Thanks.
One way to do this is to create date ranges for each records, then use stack to get all the months. Next, take the unique values only to count a month only once:
s = df.apply(lambda x: pd.Series(pd.date_range(x.start_date, x.end_date, freq='M').values), axis=1)
ss = s.stack().unique()
ss.shape[0]
Output:
22
For multiple customers you can use groupby. Continuing with #ScottBoston's answer:
df_range = df.apply(lambda r: pd.Series(
pd.date_range(start=r.start_date, end=r.end_date, freq='M')
.values), axis=1)
df_range.groupby('cust_no').apply(lambda x: x.stack().unique().shape[0])

How to convert this to a for-loop with an output to CSV

I'm trying to put together a generic piece of code that would:
Take a time series for some price data and divide it into deciles, e.g. take the past 18m of gold prices and divide it into deciles [DONE, see below]
date 4. close decile
2017-01-03 1158.2 0
2017-01-04 1166.5 1
2017-01-05 1181.4 2
2017-01-06 1175.7 1
... ...
2018-04-23 1326.0 7
2018-04-24 1333.2 8
2018-04-25 1327.2 7
[374 rows x 2 columns]
Pull out the dates for a particular decile, then create a secondary datelist with an added 30 days
#So far only for a single decile at a time
firstdecile = gold.loc[gold['decile'] == 1]
datelist = list(pd.to_datetime(firstdecile.index))
datelist2 = list(pd.to_datetime(firstdecile.index) + pd.DateOffset(months=1))
Take an average of those 30-day price returns for each decile
level1 = gold.ix[datelist]
level2 = gold.ix[datelist2]
level2.index = level2.index - pd.DateOffset(months=1)
result = pd.merge(level1,level2, how='inner', left_index=True, right_index=True)
def ret(one, two):
return (two - one)/one
pricereturns = result.apply(lambda x :ret(x['4. close_x'], x['4. close_y']), axis=1)
mean = pricereturns.mean()
Return the list of all 10 averages in a single CSV file
So far I've been able to put together something functional that does steps 1-3 but only for a single decile, but I'm struggling to expand this to a looped-code for all 10 deciles at once with a clean CSV output
First append the close price at t + 1 month as a new column on the whole dataframe.
gold2_close = gold.loc[gold.index + pd.DateOffset(months=1), 'close']
gold2_close.index = gold.index
gold['close+1m'] = gold2_close
However practically relevant should be the number of trading days, i.e. you won't have prices for the weekend or holidays. So I'd suggest you shift by number of rows, not by daterange, i.e. the next 20 trading days
gold['close+20'] = gold['close'].shift(periods=-20)
Now calculate the expected return for each row
gold['ret'] = (gold['close+20'] - gold['close']) / gold['close']
You can also combine steps 1. and 2. directly so you don't need the additional column (only if you shift by number of rows, not by fixed daterange due to reindexing)
gold['ret'] = (gold['close'].shift(periods=-20) - gold['close']) / gold['close']
Since you already have your deciles, you just need to groupby the deciles and aggregate the returns with mean()
gold_grouped = gold.groupby(by="decile").mean()
Putting in some random data you get something like the dataframe below. close and ret are the averages for each decile. You can create a csv from a dataframe via pandas.DataFrame.to_csv
close ret
decile
0 1238.343597 -0.018290
1 1245.663315 0.023657
2 1254.073343 -0.025934
3 1195.941312 0.009938
4 1212.394511 0.002616
5 1245.961831 -0.047414
6 1200.676333 0.049512
7 1181.179956 0.059099
8 1214.438133 0.039242
9 1203.060985 0.029938

Propagate dates pandas and interpolate

We have some ready available sales data for certain periods, like 1week, 1month...1year:
time_pillars = pd.Series(['1W', '1M', '3M', '1Y'])
sales = pd.Series([4.75, 5.00, 5.10, 5.75])
data = {'time_pillar': time_pillars, 'sales': sales}
df = pd.DataFrame(data)
I would like to do two operations.
Firstly, create a new column of date type, df['date'], that corresponds to the actual date of 1week, 1month..1year from now.
Then, I'd like to create another column df['days_from_now'], taking how many days are on these pillars (1week would be 7days, 1month would be around 30days..1year around 365days).
The goal of this is then to use any day as input for a a simple linear_interpolation_method() to obtain sales data for any given day (eg, what are sales for 4Octobober2018? ---> We would interpolate between 3months and 1year).
Many thanks.
I'm not exactly sure what you mean regarding your interpolation, but here is a way to make your dataframe in pandas (starting from your original df you provided in your post):
from datetime import datetime
from dateutil.relativedelta import relativedelta
def create_dates(df):
df['date'] = [i.date() for i in
[d+delt for d,delt in zip([datetime.now()] * 4 ,
[relativedelta(weeks=1), relativedelta(months=1),
relativedelta(months=3), relativedelta(years=1)])]]
df['days_from_now'] = df['date'] - datetime.now().date()
return df
create_dates(df)
sales time_pillar date days_from_now
0 4.75 1W 2018-04-11 7 days
1 5.00 1M 2018-05-04 30 days
2 5.10 3M 2018-07-04 91 days
3 5.75 1Y 2019-04-04 365 days
I wrapped it in a function, so that you can call it on any given day and get your results for 1 week, 3 weeks, etc. from that exact day.
Note: if you want your days_from_now to simply be an integer of the number of days, use df['days_from_now'] = [i.days for i in df['date'] - datetime.now().date()] in the function, instead of df['days_from_now'] = df['date'] - datetime.now().date()
Explanation:
df['date'] = [i.date() for i in
[d+delt for d,delt in zip([datetime.now()] * 4 ,
[relativedelta(weeks=1), relativedelta(months=1),
relativedelta(months=3), relativedelta(years=1)])]]
Takes a list of the date today (datetime.now()) repeated 4 times, and adds a relativedelta (a time difference) of 1 week, 1 month, 3 months, and 1 year, respectively, extracts the date (i.date() for ...), finally creating a new column using the resulting list.
df['days_from_now'] = df['date'] - datetime.now().date()
is much more straightforward, it simply subtracts those new dates that you got above from the date today. The result is a timedelta object, which pandas conveniently formats as "n days".

Categories

Resources