How to create a binary variable based on date ranges

How to create a binary variable based on date ranges - python

I would like to flag all rows with dates 1 week before and 1 week after a specific holiday to be = 1; = 0 otherwise.
What's the best way to do so? Below are my codes, which only flag New Year's Day to be new_year = 1. What I want is all 3 rows to have new_year = 1 (since they fall within 1 week before and after New Year's Day).
Note: I would like the code to work for any holidays (e.g. Thanksgiving, Easter, etc.).
Thank you!
# importing pandas as pd
import pandas as pd
import holidays
# Creating the dataframe
df = pd.DataFrame({'Date': ['1/1/2019', '1/5/2019', '12/28/2018'],
'Event': ['Music', 'Poetry', 'Theatre'],
'Cost': [10000, 5000, 15000]})
df['newDate'] = pd.to_datetime(df['Date'], format='%m/%d/%Y')
new_year = holidays.HolidayBase()
new_year.append({"2018-01-01": "New Year's Day",
"2019-01-01": "New Year's Day"})
df['hol_new_year'] = np.where(df['newDate'] in new_year, 1, 0)

You can use pandas' time series offsets:
ye = pd.tseries.offsets.YearEnd()
yb = pd.tseries.offsets.YearBegin()
d = pd.to_timedelta('1w')
s = df['newDate']
df['hol_new_year'] = (s.between(s-ye-d, s-ye+d)
|s.between(s+yb-d, s+yb+d)
).astype(int)
Output:
Date Event Cost newDate hol_new_year
0 1/1/2019 Music 10000 2019-01-01 1
1 1/5/2019 Poetry 5000 2019-01-05 1
2 12/28/2018 Theatre 15000 2018-12-28 1
3 1/15/2021 SO 0 2021-01-15 0

Related

How to tackle a dataset that has multiple same date values

I have a large data set that I'm trying to produce a time series using ARIMA. However
some of the data in the date column has multiple rows with the same date.
The data for the dates was entered this way in the data set as it was not known the exact date of the event, hence unknown dates where entered for the first of that month(biased). Known dates have been entered correctly in the data set.
2016-01-01 10035
2015-01-01 5397
2013-01-01 4567
2014-01-01 4343
2017-01-01 3981
2011-01-01 2049
Ideally I want to randomise the dates within the month so they are not the same. I have the code to randomise the date but I cannot find a way to replace the data with the date ranges.
import random
import time
def str_time_prop(start, end, time_format, prop):
stime = time.mktime(time.strptime(start, time_format))
etime = time.mktime(time.strptime(end, time_format))
ptime = stime + prop * (etime - stime)
return time.strftime(time_format, time.localtime(ptime))
def random_date(start, end, prop):
return str_time_prop(start, end, '%Y-%m-%d', prop)
# check if the random function works
print(random_date("2021-01-02", "2021-01-11", random.random()))
The code above I use to generate a random date within a date range but I'm stuggling to find a way to replace the dates.
Any help/guidance would be great.
Thanks

With the following toy dataframe:
import random
import time
import pandas as pd
df = pd.DataFrame(
{
"date": [
"2016-01-01",
"2015-01-01",
"2013-01-01",
"2014-01-01",
"2017-01-01",
"2011-01-01",
],
"value": [10035, 5397, 4567, 4343, 3981, 2049],
}
)
print(df)
# Output
date value
0 2016-01-01 10035
1 2015-01-01 5397
2 2013-01-01 4567
3 2014-01-01 4343
4 2017-01-01 3981
5 2011-01-01 2049
Here is one way to do it:
df["date"] = [
random_date("2011-01-01", "2022-04-17", random.random()) for _ in range(df.shape[0])
]
print(df)
# Ouput
date value
0 2013-12-30 10035
1 2016-06-17 5397
2 2018-01-26 4567
3 2012-02-14 4343
4 2014-06-26 3981
5 2019-07-03 2049

Since the data in the date column has multiple rows with the same date, and you want to randomize the dates within the month, you could group by the year and month and select only those who have the day equal 1. Then, use calendar.monthrange to find the last day of the month for that particular year, and use that information when replacing the timestamp's day. Change the FIRST_DAY and last_day values to match your desired range.
import pandas as pd
import calendar
import numpy as np
np.random.seed(42)
df = pd.read_csv('sample.csv')
df['date'] = pd.to_datetime(df['date'])
# group multiple rows with the same year, month and day equal 1
grouped = df.groupby([df['date'].dt.year, df['date'].dt.month, df['date'].dt.day==1])
FIRST_DAY = 2 # set for the desired range
df_list = []
for n,g in grouped:
last_day = calendar.monthrange(n[0], n[1])[1] # get last day for this month and year
g['New_Date'] = g['date'].apply(lambda d:
d.replace(day=np.random.randint(FIRST_DAY,last_day+1))
)
df_list.append(g)
new_df = pd.concat(df_list)
print(new_df)
Output from new_df
date num New_Date
2 2013-01-01 4567 2013-01-08
3 2014-01-01 4343 2014-01-21
1 2015-01-01 5397 2015-01-30
0 2016-01-01 10035 2016-01-16
4 2017-01-01 3981 2017-01-12

Is there a quick way for checking whether a date lies within n days(say 7) from a list of dates

I'm working with the following dataset:
Date
2016-01-04
2016-01-05
2016-01-06
2016-01-07
2016-01-08
and a list holidays = ['2016-01-01','2016-01-18'....'2017-11-23','2017-12-25']
Objective: Create a column indicating whether a particular date is within +- 7 days of any holiday present in the list.
Mock output:
Date
Within a week of Holiday
2016-01-04
1
2016-01-05
1
2016-01-06
1
2016-01-07
1
2016-01-08
0
I'm working with a lot of date records and thus trying to find a quick(most optimized) way to do this.
My Current Solution:
One way I figured to do this quickly would be to create another list with only the unique dates for my desired duration(say 2 years). This way, I can implement a simple solution with 2 for loops to check if a date is within +-7days of a holiday, and it wouldn't be computationally heavy as both lists would be relatively small(730 unique dates and ~20 dates in the holiday list).
Once I have my desired list of dates, all I have to do is run a single check on my 'Date' column to see if that date is a part of this new list I created. However, any suggestions to do this even quicker?

Turn holidays into a DataFrame and then merge_asof with a tolerance of 6 days:
new_df = pd.merge_asof(df, holidays, left_on='Date', right_on='Holiday',
tolerance=pd.Timedelta(days=6))
new_df['Holiday'] = np.where(new_df['Holiday'].notnull(), 1, 0)
new_df = new_df.rename(columns={'Holiday': 'Within a week of Holiday'})
Complete Working Example:
import numpy as np
import pandas as pd
holidays = pd.DataFrame(pd.to_datetime(['2016-01-01', '2016-01-18']),
columns=['Holiday'])
df = pd.DataFrame({
'Date': ['2016-01-04', '2016-01-05', '2016-01-06', '2016-01-07',
'2016-01-08']
})
df['Date'] = pd.to_datetime(df['Date'])
new_df = pd.merge_asof(df, holidays, left_on='Date', right_on='Holiday',
tolerance=pd.Timedelta(days=6))
new_df['Holiday'] = np.where(new_df['Holiday'].notnull(), 1, 0)
new_df = new_df.rename(columns={'Holiday': 'Within a week of Holiday'})
print(new_df)
new_df:
Date Within a week of Holiday
0 2016-01-04 1
1 2016-01-05 1
2 2016-01-06 1
3 2016-01-07 1
4 2016-01-08 0
Or turn Holdiays into a np datetime array then broadcast subtraction across the 'Date' Column, compare the abs to 7 days, and see if there are any matches:
holidays = np.array(['2016-01-01', '2016-01-18']).astype('datetime64')
df['Within a week of Holiday'] = (
abs(df['Date'].values - holidays[:, None]) < pd.Timedelta(days=7)
).any(axis=0).astype(int)
Complete Working Example:
import numpy as np
import pandas as pd
holidays = np.array(['2016-01-01', '2016-01-18']).astype('datetime64')
df = pd.DataFrame({
'Date': ['2016-01-04', '2016-01-05', '2016-01-06', '2016-01-07',
'2016-01-08']
})
df['Date'] = pd.to_datetime(df['Date'])
df['Within a week of Holiday'] = (
abs(df['Date'].values - holidays[:, None]) < pd.Timedelta(days=7)
).any(axis=0).astype(int)
print(df)
df:
Date Within a week of Holiday
0 2016-01-04 1
1 2016-01-05 1
2 2016-01-06 1
3 2016-01-07 1
4 2016-01-08 0

make a function that calculate date with +- 7 days and check if calculated date is in holidays so return True else False and apply that function to Data frame
import datetime
import pandas as pd
holidays = ['2016-01-01','2016-01-18','2017-11-23','2017-12-25']
def holiday_present(date):
date = datetime.datetime.strptime(date, '%Y-%m-%d')
for i in range(-7,7):
datte = (date - datetime.timedelta(days=i)).strftime('%Y-%m-%d')
if datte in holidays:
return True
return False
data = {
"Date":[
"2016-01-04",
"2016-01-05",
"2016-01-06",
"2016-01-07",
"2016-01-08"]
}
df= pd.DataFrame(data)
df["Within a week of Holiday"] = df["Date"].apply(holiday_present).astype(int)
Output:
Date Within a week of Holiday
0 2016-01-04 1
1 2016-01-05 1
2 2016-01-06 1
3 2016-01-07 1
4 2016-01-08 0

Try this:
Sample:
import pandas as pd
df = pd.DataFrame({'Date': {0: '2016-01-04',
1: '2016-01-05',
2: '2016-01-06',
3: '2016-01-07',
4: '2016-01-08'}})
Code:
def get_date_range(holidays):
h = [pd.to_datetime(x) for x in holidays]
h = [pd.date_range(x - pd.DateOffset(6), x + pd.DateOffset(6)) for x in h]
h = [x.strftime('%Y-%m-%d') for y in h for x in y]
return h
df['Within a week of Holiday'] = df['Date'].isin(get_date_range(holidays))*1
Result:
Out[141]:
0 1
1 1
2 1
3 1
4 0
Name: Within a week of Holiday, dtype: int32

Python Optimization - Can I avoid the double for loop?

I have a df like the following:
import datetime as dt
import pandas as pd
import pytz
cols = ['utc_datetimes', 'zone_name']
data = [
['2019-11-13 14:41:26,2019-12-18 23:04:12', 'Europe/Stockholm'],
['2019-12-06 21:49:04,2019-12-11 22:52:57,2019-12-18 20:30:58,2019-12-23 18:49:53,2019-12-27 18:34:23,2020-01-07 21:20:51,2020-01-11 17:36:56,2020-01-20 21:45:47,2020-01-30 20:48:49,2020-02-03 21:04:52,2020-02-07 20:05:02,2020-02-10 21:07:21', 'Europe/London']
]
df = pd.DataFrame(data, columns=cols)
print(df)
# utc_datetimes zone_name
# 0 2019-11-13 14:41:26,2019-12-18 23:04:12 Europe/Stockholm
# 1 2019-12-06 21:49:04,2019-12-11 22:52:57,2019-1... Europe/London
And I would like to count the number of nights and Wednesdays, of the row's local time, the dates in the df represent. This is the desired output:
utc_datetimes zone_name nights wednesdays
0 2019-11-13 14:41:26,2019-12-18 23:04:12 Europe/Stockholm 0 1
1 2019-12-06 21:49:04,2019-12-11 22:52:57,2019-1... Europe/London 11 2
I've come up with the following double for loop, but it is not as efficient as I'd like it for the sizable df:
# New columns.
df['nights'] = 0
df['wednesdays'] = 0
for row in range(df.shape[0]):
date_list = df['utc_datetimes'].iloc[row].split(',')
user_time_zone = df['zone_name'].iloc[row]
for date in date_list:
datetime_obj = dt.datetime.strptime(
date, '%Y-%m-%d %H:%M:%S'
).replace(tzinfo=pytz.utc)
local_datetime = datetime_obj.astimezone(pytz.timezone(user_time_zone))
# Get day of the week count:
if local_datetime.weekday() == 2:
df['wednesdays'].iloc[row] += 1
# Get time of the day count:
if (local_datetime.hour >17) & (local_datetime.hour <= 23):
df['nights'].iloc[row] += 1
Any suggestions will be appreciated :)
PD. disregard the definition of 'night', just an example.

One way is to first create a reference df by exploding your utc_datetimes column and then get the TimeDelta for each zone:
df = pd.DataFrame(data, columns=cols)
s = (df.assign(utc_datetimes=df["utc_datetimes"].str.split(","))
.explode("utc_datetimes"))
s["diff"] = [pd.Timestamp(a, tz=b).utcoffset() for a,b in zip(s["utc_datetimes"],s["zone_name"])]
With this helper df you can calculate the number of wednesdays and nights:
df["wednesdays"] = (pd.to_datetime(s["utc_datetimes"])+s["diff"]).dt.day_name().eq("Wednesday").groupby(level=0).sum()
df["nights"] = ((pd.to_datetime(s["utc_datetimes"])+s["diff"]).dt.hour>17).groupby(level=0).sum()
print (df)
#
utc_datetimes zone_name wednesdays nights
0 2019-11-13 14:41:26,2019-12-18 23:04:12 Europe/Stockholm 1.0 0.0
1 2019-12-06 21:49:04,2019-12-11 22:52:57,2019-1... Europe/London 2.0 11.0

Pandas get days in a between two two dates from a particular month

I have a pandas dataframe with three columns. A start and end date and a month.
I would like to add a column for how many days within the month are between the two dates. I started doing something with apply, the calendar library and some math, but it started to get really complex. I bet pandas has a simple solution, but am struggling to find it.
Input:
import pandas as pd
df1 = pd.DataFrame(data=[['2017-01-01', '2017-06-01', '2016-01-01'],
['2015-03-02', '2016-02-10', '2016-02-01'],
['2011-01-02', '2018-02-10', '2016-03-01']],
columns=['start date', 'end date date', 'Month'])
Desired Output:
start date end date date Month Days in Month
0 2017-01-01 2017-06-01 2016-01-01 0
1 2015-03-02 2016-02-10 2016-02-01 10
2 2011-01-02 2018-02-10 2016-03-01 31

There is a solution:
get a date list by pd.date_range between start and end dates, and then check how many date has the same year and month with the target month.
def overlap(x):
md = pd.to_datetime(x[2])
cand = [(ad.year, ad.month) for ad in pd.date_range(x[0], x[1])]
return len([x for x in cand if x ==(md.year, md.month)])
df1["Days in Month"]= df1.apply(overlap, axis=1)
You'll get:
start date end date date Month Days in Month
0 2017-01-01 2017-06-01 2016-01-01 0
1 2015-03-02 2016-02-10 2016-02-01 10
2 2011-01-02 2018-02-10 2016-03-01 31

You can convert your cell to datetime by
df = df.applymap(lambda x: pd.to_datetime(x))
Then find intersection days with function
def intersectionDaysInMonth(start, end, month):
end_month = month.replace(month=month.month + 1)
if month <= start <= end_month:
return end_month - start
if month <= end <= end_month:
return end - month
if start <= month < end_month <= end:
return end_month - month
return pd.to_timedelta(0)
Then apply
df['Days in Month'] = df.apply(lambda row: intersectionDaysInMonth(*row).days, axis=1)

Pandas offset DatetimeIndex to next business if date is not a business day

I have a DataFrame which is indexed with the last day of the month. Sometimes this date is a weekday and sometimes it is a weekend. Ignoring holidays, I'm looking to offset the date to the next business date if the date is on a weekend and leave the result unchanged if it is already on a weekday.
Some example data would be
import pandas as pd
idx = [pd.to_datetime('20150430'), pd.to_datetime('20150531'),
pd.to_datetime('20150630')]
df = pd.DataFrame(0, index=idx, columns=['A'])
df
A
2015-04-30 0
2015-05-31 0
2015-06-30 0
df.index.weekday
array([3, 6, 1], dtype=int32)
Something like the following works, however I would appreciate if someone has a solution that is a little more straightforward.
idx = df.index.copy()
wknds = (idx.weekday == 5) | (idx.weekday == 6)
idx2 = idx[~wknds]
idx2 = idx2.append(idx[wknds] + pd.datetools.BDay(1))
idx2 = idx2.order()
df.index = idx2
df
A
2015-04-30 0
2015-06-01 0
2015-06-30 0

You can add 0*BDay()
from pandas.tseries.offsets import BDay
df.index = df.index.map(lambda x : x + 0*BDay())
You can also use this with a Holiday calendar with CDay(calendar) in case there are holidays.

You can map the index with a lambda function, and set the result back to the index.
df.index = df.index.map(lambda x: x if x.dayofweek < 5 else x + pd.DateOffset(7-x.dayofweek))
df
A
2015-04-30 0
2015-06-01 0
2015-06-30 0

Using DataFrame.resample
A more idiomatic method would be to resample to business days:
df.resample('B', label='right', closed='right').first().dropna()
A
2015-04-30 0.0
2015-06-01 0.0
2015-06-30 0.0

Can also use a variation of the logic: a)given input date = 'inputdate', go back one business day using pandas date_range which has business days input; then b) go forward one business day using the same. To do this, you generate a vector with 2 inputs using data_range and select the min or max value to return the appropriate single value. So this could look as follows:
a) get business day before:
date_1b_bef = min(pd.date_range(start=inputdate, periods = 2, freq='-1B'))
b) get business day after the 'business day before':
date_1b_aft = max(pd.date_range(start=date_1b_bef, periods = 2, freq='1B'))
or substituting a) into b) to get one line:
date_1b_aft = max(pd.date_range(start=min(pd.date_range(start=inputdate, periods = 2, freq='-1B')), periods = 2, freq='1B'))
This can also be used with relativedelta to get the business day after some calendar period offset from inputdate. For example:
a) get the business day (using 'following' convention if offset day is not a business day) for 1 calendar month prior to 'input date':
date_1mbef_fol = max(pd.date_range(min(pd.date_range(start=inputdate + relativedelta(months=-1), periods = 2, freq='-1B')), periods = 2, freq = '1B'))
b) get the business day (using 'preceding' convention if offset day is not a business day) for 1 year prior to 'input date':
date_1ybef_pre = min(pd.date_range(max(pd.date_range(start=inputdate + relativedelta(years=-1), periods = 2, freq='1B')), periods = 2, freq = '-1B'))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to create a binary variable based on date ranges - python

Related

How to tackle a dataset that has multiple same date values

Is there a quick way for checking whether a date lies within n days(say 7) from a list of dates

Python Optimization - Can I avoid the double for loop?

Pandas get days in a between two two dates from a particular month

Pandas offset DatetimeIndex to next business if date is not a business day

Categories

Resources