I want to calculate the number of business days between two dates and create a new pandas dataframe column with those days. I also have a holiday calendar and I want to exclude dates in the holiday calendar while making my calculation.
I looked around and I saw the numpy busday_count function as a useful tool for it. The function counts the number of business days between two dates and also allows you to include a holiday calendar.
I also looked around and I saw the holidays package which gives me the holiday dates for different countries. I thought it will be great to add this holiday calendar into the numpy function.
Then I proceeded as follows;
import pandas as pd
import numpy as np
import holidays
from datetime import datetime, timedelta, date
df = {'start' : ['2019-01-02', '2019-02-01'],
'end' : ['2020-01-04', '2020-03-05']
}
df = pd.DataFrame(df)
holidays_country = holidays.CountryHoliday('UnitedKingdom')
start_date = [d.date for d in df['start']]
end_date = [d.date for d in df['end']]
holidays_numpy = holidays_country[start_date:end_date]
df['business_days'] = np.busday_count(begindates = start_date,
enddates = end_date,
holidays=holidays_numpy)
When I run this code, it throws this error TypeError: Cannot convert type '<class 'list'>' to date
When I looked further, I noticed that the start_date and end_date are lists and that might be whey the error was occuring.
I then changed the holidays_numpy variable to holidays_numpy = holidays_country['2019-01-01':'2019-12-31'] and it worked.
However, since my dates are different for each row in my dataframe, is there a way to set the two arguments in my holiday_numpy variable to select corresponding values (just like the zip function) each from start_date and end_date?
I'm also open to alternative ways of solving this problem.
This should work:
import pandas as pd
import numpy as np
import holidays
df = {'start' : ['2019-01-02', '2019-02-01'],
'end' : ['2020-01-04', '2020-03-05']}
df = pd.DataFrame(df)
holidays_country = holidays.CountryHoliday('UK')
def f(x):
return np.busday_count(x[0],x[1],holidays=holidays_country[x[0]:x[1]])
df['business_days'] = df[['start','end']].apply(f,axis=1)
df.head()
Related
I'm trying to group my data by a week that I predefined using to_datetime and timedelta. However, after copying my script a few times, I was hoping there was a way to iterate this process over multiple weeks. Is this something that can be done?
The data set that I'm working with lists sales out sales revenue and spending out by the day for each data source and its corresponding id.
Below is what I have so far but my knowledge of loops is pretty limited due to being self-taught.
Let me know if what I'm asking is feasible or if I have to continue to copy my code every week.
Code
import pandas as pd
from datetime import datetime, timedelta,date
startdate = '2021-09-26'
enddate = pd.to_datetime(startdate) + timedelta(days=6)
last7 = (df.date >= startdate) & (df.date <= enddate)
df = df.loc[last7,['datasource','id','revenue','spend']]
df.groupby(by=['datasource_name','id'],as_index=False).sum()
df['start_date'] = startdate
df['end_date'] = enddate
df
If I have understood your issue correctly, you are basically trying to aggregate daily data into weekly. You can try following code
import datetime as dt
import pandas as pd
#Get weekend date for each date
df['week_end_date']=df['date'].apply(lambda x: pd.Period(x,freq='W').end_time.date().strftime('%Y-%m-%d'))
#Aggregate sales and revenue at weekly level
df_agg = df.groupby(['datasource_name','id','week_end_date']).agg({'revenue':'sum','spend':'sum'}).reset_index()
df_agg will have all your sales and revenue numbers aggregated by the weekend date for corresponding date.
I am using the yfinance library to import data for a given stock. See code below:
import yfinance as yf
from datetime import datetime as dt
import pandas as pd
# Naming Constants
stock = "AAPL"
start_date = "2014-01-01"
end_date = "2018-01-01"
# Importing all the data into a dataFrame
stock_data = yf.download(stock, start=start_date, end=end_date)
When I call print(stock_data.index) I have the following:
DatetimeIndex(['2014-01-02', '2014-01-03', '2014-01-06', '2014-01-07', '2014-01-08', '2014-01-09', '2014-01-10', '2014-01-13', '2014-01-14', '2014-01-15',
...
'2017-12-15', '2017-12-18', '2017-12-19', '2017-12-20', '2017-12-21', '2017-12-22', '2017-12-26', '2017-12-27', '2017-12-28', '2017-12-29'],
dtype='datetime64[ns]', name='Date', length=1007, freq=None)
I wish to switch the frequency argument from None to daily since every Date refers to a trading day.
When I say stock_data.index.freq = 'B' I get the following error:
ValueError: Inferred frequency None from passed values does not conform to passed frequency B
And if I put stock_data = stock_data.asfreq('B'), it will change the frequency but it will add certain lines that were not there originally and fills them with NA values.
In other words, what is the offset ALIAS used for trading days?
You can find the list of alias from the Pandas documentation here: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases
The error with stock_data.index.freq = 'B' indicates that your timeseries frequency is not 'business-day', but undefined or 'None'.
With
stock_data = stock_data.asfreq('B')
your are re-indexing your timeseries to business-daily frequency: The missing timestamps will be added, and the missing stock data values are set to NaN. Now you need to decide how replace them, so have a look here: pandas.DataFrame.asfreq. So you could replace all NaN's with a fixed value like -999, but in general what you want to do with stock data is take the last valid value at a given point in time, which is forward filling the gaps:
stock_data = stock_data.asfreq('B', method='ffill')
It's always worth reading the docs.
date['Maturity_date'] = data.apply(lambda data: relativedelta(months=int(data['TRM_LNTH_MO'])) + data['POL_EFF_DT'], axis=1)
Tried this also:
date['Maturity_date'] = date['POL_EFF_DT'] + date['TRM_LNTH_MO'].values.astype("timedelta64[M]")
TypeError: 'type' object does not support item assignment
import pandas as pd
import datetime
#Convert the date column to date format
date['date_format'] = pd.to_datetime(date['Maturity_date'])
#Add a month column
date['Month'] = date['date_format'].apply(lambda x: x.strftime('%b'))
If you are using Pandas, you may use a resource called: "Frequency Aliases". Something very out of the box:
# For "periods": 1 (is the current date you have) and 2 the result, plus 1, by the frequency of 'M' (month).
import pandas as pd
_new_period = pd.date_range(_existing_date, periods=2, freq='M')
Now you can get exactly the period you want as the second element returned:
# The index for your information is 1. Index 0 is the existing date.
_new_period.strftime('%Y-%m-%d')[1]
# You can format in different ways. Only Year, Month or Day. Whatever.
Consult this link for further information
I have a dataframe with the date and month_diff variable. I would like to get a new date (name it as Target_Date) based on the following logic:
For example, the date is 2/13/2019, month_diff is 3, then the target date should be the month-end of the original date plus 3 months, which is 5/31/2019
I tried the following method to get the traget date first:
df["Target_Date"] = df["Date"] + pd.DateOffset(months = df["month_diff"])
But it failed, as I know, the parameter in the dateoffset should be a varaible or a fixed number.
I also tried:
df["Target_Date"] = df["Date"] + relativedelta(months = df["month_diff"])
It failes too.
Anyone can help? thank you.
edit:
this is a large dataset with millions rows.
You could try this
import pandas as pd
from dateutil.relativedelta import relativedelta
df = pd.DataFrame({'Date': [pd.datetime(2019,1,1), pd.datetime(2019,2,1)], 'month_diff': [1,2]})
df.apply(lambda row: row.Date + relativedelta(months=row.month_diff), axis=1)
Or list comprehension
[date + relativedelta(months=month_diff) for date, month_diff in df[['Date', 'month_diff']].values]
I would approach in the following method to compute your "target_date".
Apply the target month offset (in your case +3months), using your pd.DateOffset.
Get the last day of that target month (using for example calendar.monthrange, see also "Get last day of the month"). This will provide you with the "flexible" part of that date" offset.
Apply the flexible day offset, when comparing the result of step 1. and step 2. This could be a new pd.DateOffset.
A solution could look something like this:
import calendar
from dateutil.relativedelta import relativedelta
for ii in df.index:
new_ = df.at[ii, 'start_date'] + relativedelta(months=df.at[ii, 'month_diff'])
max_date = calendar.monthrange(new_.year, new_.month)[1]
end_ = new_ + relativedelta(days=max_date - new_.day)
print(end_)
Further "cleaning" into a function and / or list comprehension will probably make it much faster
import pandas as pd
from datetime import datetime
from datetime import timedelta
This is my approach in solving your issue.
However for some reason I am getting a semantic error in my output even though I am sure it is the correct way. Please everyone correct me if you notice something wrong.
today = datetime.now()
today = today.strftime("%d/%m/%Y")
month_diff =[30,5,7]
n = 30
for i in month_diff:
b = {'Date': today, 'month_diff':month_diff,"Target_Date": datetime.now()+timedelta(days=i*n)}
df = pd.DataFrame(data=b)
Output:
For some reason the i is not getting updated.
I was looking for a solution I can write in one line only and apply does the job. However, by default apply function performs action on each column, so you have to remember to specify correct axis: axis=1.
from datetime import datetime
from dateutil.relativedelta import relativedelta
# Create a new column with date adjusted by number of months from 'month_diff' column and later adjust to the last day of month
df['Target_Date'] = df.apply(lambda row: row.Date # to current date
+ relativedelta(months=row.month_diff) # add month_diff
+ relativedelta(day=+31) # and adjust to the last day of month
, axis=1) # 1 or ‘columns’: apply function to each row.
Revised question with appropriate MCVE:
As part of a script I'm writing I need to have a loop that contains a different pair of dates during each iteration, these dates are the first and last available stock trading dates of each month. I have managed to find a calendar with the available dates in an index however despite my research I am not sure how to select the correct dates from this index so that they can be used in the DateTime variables start and end.
Here is as far as my research has got me and I will continue to search for and build my own solution which I will post if I manage to find one:
from __future__ import division
import numpy as np
import pandas as pd
import datetime
import pandas_market_calendars as mcal
from pandas_datareader import data as web
from datetime import date
'''
Full date range:
'''
startrange = datetime.date(2016, 1, 1)
endrange = datetime.date(2016, 12, 31)
'''
Tradable dates in the year:
'''
nyse = mcal.get_calendar('NYSE')
available = nyse.valid_days(start_date='2016-01-01', end_date='2016-12-31')
'''
The loop that needs to take first and last trading date of each month:
'''
dict1 = {}
for i in available:
start = datetime.date('''first available trade day of the month''')
end = datetime.date('''last available trade day of the month''')
diffdays = ((end - start).days)/365
dict1 [i] = diffdays
print (dict1)
That is probably because 1 January 2016 was not a trading day. To check if I am right, try giving it the date 4 January 2016, which was the following Monday. If that works, then you will have to be more sophisticated about the dates you ask for.
Look in the documentaion for dm.BbgDataManager(). It is possible that you can ask it what dates are available.