Python Date mid-way between two dates - python

I have a DataFrame that looks like this:
raw_data = {'Series_Date':['2017-03-10','2017-03-13','2017-03-14','2017-03-15'],'SeriesDate':['2017-03-10','2017-03-13','2017-03-14','2017-03-15']}
import pandas as pd
df = pd.DataFrame(raw_data,columns=['Series_Date','SeriesDate'])
print df
To this DF, I would like to append four columns at the end:
1) Start_Date = SeriesDate - 10 Business Days
2) End_Date = SeriesDate - 3 Business Days
3) Date_Difference = (End_Date - Start_Date)/2. However, if the date difference is 4.5 days the value should be 5 and not 4 i.e. it should round up.
4) Roll_Date = End_Date - 'Date_Difference' Business Days. i.e. if Date_Difference is 5 then the Roll_Date = End_Date - 5 Business Days
I am able to append the first two columns as follows:
from pandas.tseries.offsets import BDay
df['Start_Date'] = df['SeriesDate'] - BDay(10)
df['End_Date'] = df['SeriesDate'] - BDay(3)
However, I am struggling with the last 2 columns. Could anyone provide some help?

Once you have this df:
Series_Date Start_Date End_Date
0 2017-03-10 2017-02-24 2017-03-07
1 2017-03-13 2017-02-27 2017-03-08
2 2017-03-14 2017-02-28 2017-03-09
3 2017-03-15 2017-03-01 2017-03-10
You can complete the 2 columns:
df['Date_Difference'] = ((df.End_Date - df.Start_Date) / 2).dt.ceil('D')
df['Roll_Date'] = df.End_Date - pd.Series(BDay(dd.days) for dd in df.Date_Difference)
Explanation:
(df.End_Date - df.Start_Date) / 2) gives a Series of timedeltas. .dt.ceil('D') rounds this Series up to the day.
pd.Series(BDay(dd.days) for dd in df.Date_Difference) creates a Series of BusinessDays based on the number of days in Date_Difference. (There is very likely a better way to do it, but I'm a newbie with pandas).
Side question: why do you have 2 columns Series_Date and SeriesDate with the same content ?

Related

Time elapsed since first log for each user

I'm trying to calculate the time difference between all the logs of a user and the first log of that same user. There are users with several logs.
The dataframe looks like this:
16 00000021601 2022-08-23 17:12:04
20 00000021601 2022-08-23 17:12:04
21 00000031313 2022-10-22 11:16:57
22 00000031313 2022-10-22 12:16:44
23 00000031313 2022-10-22 14:39:07
24 00000065137 2022-05-06 11:51:33
25 00000065137 2022-05-06 11:51:33
I know that I could do df['DELTA'] = df.groupby('ID')['DATE'].shift(-1) - df['DATE'] to get the difference between consecutive dates for each user, but since something like iat[0] doesn't work in this case I don't know how to get the difference in relation to the first date.
You can try this code
import pandas as pd
dates = ['2022-08-23 17:12:04',
'2022-08-23 17:12:04',
'2022-10-22 11:16:57',
'2022-10-22 12:16:44',
'2022-10-22 14:39:07',
'2022-05-06 11:51:33',
'2022-05-06 11:51:33',]
ids = [1,1,1,2,2,2,2]
df = pd.DataFrame({'id':ids, 'dates':dates})
df['dates'] = pd.to_datetime(df['dates'])
df.groupby('id').apply(lambda x: x['dates'] - x.iloc[0, 0])
Out:
id
1 0 0 days 00:00:00
1 0 days 00:00:00
2 59 days 18:04:53
2 3 0 days 00:00:00
4 0 days 02:22:23
5 -170 days +23:34:49
6 -170 days +23:34:49
Name: dates, dtype: timedelta64[ns]
If you dataframe is large and apply took a long time you can try use parallel-pandas. It's very simple
import pandas as pd
from parallel_pandas import ParallelPandas
ParallelPandas.initialize(n_cpu=8)
dates = ['2022-08-23 17:12:04',
'2022-08-23 17:12:04',
'2022-10-22 11:16:57',
'2022-10-22 12:16:44',
'2022-10-22 14:39:07',
'2022-05-06 11:51:33',
'2022-05-06 11:51:33',]
ids = [1,1,1,2,2,2,2]
df = pd.DataFrame({'id':ids, 'dates':dates})
df['dates'] = pd.to_datetime(df['dates'])
#p_apply is parallel analogue of apply method
df.groupby('id').p_apply(lambda x: x['dates'] - x.iloc[0, 0])
It will be 5-10 time faster

Python: Add Weeks to Date from df

How would I add two df columns together (date + weeks):
This works for me:
df['Date'] = pd.to_datetime(startDate, format='%Y-%m-%d') + datetime.timedelta(weeks = 3)
But when I try to add weeks from a column, I get a type error: unsupported type for timedelta weeks component: Series
df['Date'] = pd.to_datetime(startDate, format='%Y-%m-%d') + datetime.timedelta(weeks = df['Duration (weeks)'])
Would appreciate any help thank you!
You can use the pandas to_timelta function to transform the number of weeks column to a timedelta, like this:
import pandas as pd
import numpy as np
# create a DataFrame with a `date` column
df = pd.DataFrame(
pd.date_range(start='1/1/2018', end='1/08/2018'),
columns=["date"]
)
# add a column `weeks` with a random number of weeks
df['weeks'] = np.random.randint(1, 6, df.shape[0])
# use `pd.to_timedelta` to transform the number of weeks column to a timedelta
# and add it to the `date` column
df["new_date"] = df["date"] + pd.to_timedelta(df["weeks"], unit="W")
df.head()
date weeks new_date
0 2018-01-01 5 2018-02-05
1 2018-01-02 2 2018-01-16
2 2018-01-03 2 2018-01-17
3 2018-01-04 4 2018-02-01
4 2018-01-05 3 2018-01-26

Add months to a date in Pandas

I'm trying to figure out how to add 3 months to a date in a Pandas dataframe, while keeping it in the date format, so I can use it to lookup a range.
This is what I've tried:
#create dataframe
df = pd.DataFrame([pd.Timestamp('20161011'),
pd.Timestamp('20161101') ], columns=['date'])
#create a future month period
plus_month_period = 3
#calculate date + future period
df['future_date'] = plus_month_period.astype("timedelta64[M]")
However, I get the following error:
AttributeError: 'int' object has no attribute 'astype'
You could use pd.DateOffset
In [1756]: df.date + pd.DateOffset(months=plus_month_period)
Out[1756]:
0 2017-01-11
1 2017-02-01
Name: date, dtype: datetime64[ns]
Details
In [1757]: df
Out[1757]:
date
0 2016-10-11
1 2016-11-01
In [1758]: plus_month_period
Out[1758]: 3
Suppose you have a dataframe of the following format, where you have to add integer months to a date column.
Start_Date
Months_to_add
2014-06-01
23
2014-06-01
4
2000-10-01
10
2016-07-01
3
2017-12-01
90
2019-01-01
2
In such a scenario, using Zero's code or mattblack's code won't be useful. You have to use lambda function over the rows where the function takes 2 arguments -
A date to which months need to be added to
A month value in integer format
You can use the following function:
# Importing required modules
from dateutil.relativedelta import relativedelta
# Defining the function
def add_months(start_date, delta_period):
end_date = start_date + relativedelta(months=delta_period)
return end_date
After this you can use the following code snippet to add months to the Start_Date column. Use progress_apply functionality of Pandas. Refer to this Stackoverflow answer on progress_apply : Progress indicator during pandas operations.
from tqdm import tqdm
tqdm.pandas()
df["End_Date"] = df.progress_apply(lambda row: add_months(row["Start_Date"], row["Months_to_add"]), axis = 1)
Here's the full code form dataset creation, for your reference:
import pandas as pd
from dateutil.relativedelta import relativedelta
from tqdm import tqdm
tqdm.pandas()
# Initilize a new dataframe
df = pd.DataFrame()
# Add Start Date column
df["Start_Date"] = ['2014-06-01T00:00:00.000000000',
'2014-06-01T00:00:00.000000000',
'2000-10-01T00:00:00.000000000',
'2016-07-01T00:00:00.000000000',
'2017-12-01T00:00:00.000000000',
'2019-01-01T00:00:00.000000000']
# To convert the date column to a datetime format
df["Start_Date"] = pd.to_datetime(df["Start_Date"])
# Add months column
df["Months_to_add"] = [23, 4, 10, 3, 90, 2]
# Defining the Add Months function
def add_months(start_date, delta_period):
end_date = start_date + relativedelta(months=delta_period)
return end_date
# Apply function on the dataframe using lambda operation.
df["End_Date"] = df.progress_apply(lambda row: add_months(row["Start_Date"], row["Months_to_add"]), axis = 1)
You will have the final output dataframe as follows.
Start_Date
Months_to_add
End_Date
2014-06-01
23
2016-05-01
2014-06-01
4
2014-10-01
2000-10-01
10
2001-08-01
2016-07-01
3
2016-10-01
2017-12-01
90
2025-06-01
2019-01-01
2
2019-03-01
Please add to comments if there are any issues with the above code.
All the best!
I believe that the simplest and most efficient (faster) way to solve this is to transform the date to monthly periods with to_period(M), add the result with the values of the Months_to_add column and then retrieve the data as datetime with the .dt.to_timestamp() command.
Using the sample data created by #Aruparna Maity
Start_Date
Months_to_add
2014-06-01
23
2014-06-20
4
2000-10-01
10
2016-07-05
3
2017-12-15
90
2019-01-01
2
df['End_Date'] = ((df['Start_Date'].dt.to_period('M')) + df['Months_to_add']).dt.to_timestamp()
df.head(6)
#output
Start_Date Months_to_add End_Date
0 2014-06-01 23 2016-05-01
1 2014-06-20 4 2014-10-01
2 2000-10-01 10 2001-08-01
3 2016-07-05 3 2016-10-01
4 2017-12-15 90 2025-06-01
5 2019-01-01 2 2019-03-01
If the exact day is needed, just repeat the process, but changing the periods to days
df['End_Date'] = ((df['End_Date'].dt.to_period('D')) + df['Start_Date'].dt.day -1).dt.to_timestamp()
#output:
Start_Date Months_to_add End_Date
0 2014-06-01 23 2016-05-01
1 2014-06-20 4 2014-10-20
2 2000-10-01 10 2001-08-01
3 2016-07-05 3 2016-10-05
4 2017-12-15 90 2025-06-15
5 2019-01-01 2 2019-03-01
Another way using numpy timedelta64
df['date'] + np.timedelta64(plus_month_period, 'M')
0 2017-01-10 07:27:18
1 2017-01-31 07:27:18
Name: date, dtype: datetime64[ns]

Substract days in a filtered timeseries column

I have the following dataframe:
date number
2016-01-20 1
2016-06-21 1
2012-05-6 1
I know want to substract 10 days of each date, but only to those dates that are smaller than march 2014. The result should look like this:
date number
2016-01-10 1
2016-06-11 1
2012-05-6
I tried the following command, but it simply does not change the column. Does anybody know what I am doing wrong here?
df[df["date"].isin(pd.date_range("2014-02-01", "2018-01-01"))]["date"] = df[df["date"].isin(pd.date_range("2014-02-01", "2018-01-01"))]["date"] - pd.Timedelta(10, "D")
If i just run this command:
df[df["date"].isin(pd.date_range("2014-02-01", "2018-01-01"))]["date"] - pd.Timedelta(10, "D")
It correctly gives me the substracted dates of the filtered dataframe. However, I do not know how to map these then back to the filtered original date column to replace the not-substracted dates.
You can use Series.where:
df.date = df.date.where(df.date < '2014-03-01', df.date - pd.Timedelta(10, 'D'))
df
# date number
#0 2016-01-10 1
#1 2016-06-11 1
#2 2012-05-06 1
Or use loc with boolean indexing and assignment:
df.loc[df.date > '2014-03-01', 'date'] -= pd.Timedelta(10, 'D')
df
# date number
#0 2016-01-10 1
#1 2016-06-11 1
#2 2012-05-06 1

Pandas get days in a between two two dates from a particular month

I have a pandas dataframe with three columns. A start and end date and a month.
I would like to add a column for how many days within the month are between the two dates. I started doing something with apply, the calendar library and some math, but it started to get really complex. I bet pandas has a simple solution, but am struggling to find it.
Input:
import pandas as pd
df1 = pd.DataFrame(data=[['2017-01-01', '2017-06-01', '2016-01-01'],
['2015-03-02', '2016-02-10', '2016-02-01'],
['2011-01-02', '2018-02-10', '2016-03-01']],
columns=['start date', 'end date date', 'Month'])
Desired Output:
start date end date date Month Days in Month
0 2017-01-01 2017-06-01 2016-01-01 0
1 2015-03-02 2016-02-10 2016-02-01 10
2 2011-01-02 2018-02-10 2016-03-01 31
There is a solution:
get a date list by pd.date_range between start and end dates, and then check how many date has the same year and month with the target month.
def overlap(x):
md = pd.to_datetime(x[2])
cand = [(ad.year, ad.month) for ad in pd.date_range(x[0], x[1])]
return len([x for x in cand if x ==(md.year, md.month)])
df1["Days in Month"]= df1.apply(overlap, axis=1)
You'll get:
start date end date date Month Days in Month
0 2017-01-01 2017-06-01 2016-01-01 0
1 2015-03-02 2016-02-10 2016-02-01 10
2 2011-01-02 2018-02-10 2016-03-01 31
You can convert your cell to datetime by
df = df.applymap(lambda x: pd.to_datetime(x))
Then find intersection days with function
def intersectionDaysInMonth(start, end, month):
end_month = month.replace(month=month.month + 1)
if month <= start <= end_month:
return end_month - start
if month <= end <= end_month:
return end - month
if start <= month < end_month <= end:
return end_month - month
return pd.to_timedelta(0)
Then apply
df['Days in Month'] = df.apply(lambda row: intersectionDaysInMonth(*row).days, axis=1)

Categories

Resources