Pandas fill a DataFrame from another by DatetimeIndex - python

I have a DataFrame of sales numbers with a DatetimeIndex, for data that extends over a couple of years at the minute level, and I want to first calculate totals (of sales) per year, month, day, hour and location, then average over years and month.
Then with that date, I want to extrapolate to a new month, per day, hour and location. So to do that, I calculate the sales numbers per hour for each day of the week (expecting that weekend days will behave differently from work week days), then I create a new DataFrame for the month I want to extrapolate to, then for each day in that month, I calculate (day of week, hour, POS) and use the past data for the corresponding (day of week, hour, POS) as my "prediction" for what will be sold at POS at the given hour and day in the given month.
The reason I'm doing it this way is that once I calculate a mean per day of the week in the past, when I populate the DataFrame for the month of June, the 1st of June could be any day of the week, and that is important as weekdays/weekend days behave differently. I want the past sales number for a Friday, if the 1st is a Friday.
I have the following, that is unfortunately too slow - or maybe wrong, in any case, there is no error message but it doesn't complete (on the real data):
import numpy as np
import pandas as pd
# Setup some sales data for the past 2 years for some stores
hours = pd.date_range('2018-01-01', '2019-12-31', freq='h')
sales = pd.DataFrame(index = hours, columns=['Store', 'Count'])
sales['Store'] = np.random.randint(0,10, sales.shape[0])
sales['Count'] = np.random.randint(0,100, sales.shape[0])
# Calculate the average of sales over these 2 years for each hour in
# each day of the week and each store
sales.groupby([sales.index.year, sales.index.month, sales.index.dayofweek, sales.index.hour, 'Store'])['Count'] \
.sum() \
.rename_axis(index=['Year', 'Month', 'DayOfWeek', 'Hour', 'Store']) \
.reset_index() \
.groupby(['DayOfWeek', 'Hour', 'Store'])['Count'] \
.mean() \
.rename_axis(index=['DayOfWeek', 'Hour', 'Store'])
# Setup a DataFrame to predict May sales per store/day/hour
may_hours = pd.date_range('2020-05-01', '2020-05-31', freq='h')
predicted = pd.DataFrame(index = pd.MultiIndex.from_product([may_hours, range(0,11)]), columns = ['Count']) \
.rename_axis(index=['Datetime', 'Store'])
# "Predict" sales for each (day, hour, store) in May 2020
# by retrieving the average sales for the corresponding
# (day of week, hour store)
for idx in predicted.index:
qidx = (idx[0].dayofweek, idx[0].hour, idx[1])
predicted.loc[idx] = sales[qidx] if qidx in sales.index else 0

Related

How to get date range for a specific month starting with a previous month using Pandas

I'm trying to build a list of "pay days" for a given month in the future knowing only when the pay days started months ago. For example:
Starting date - When the paychecks started: 1/6/2023
Frequency is every two weeks
So if I want to know which dates are pay days in March, I have to start at the 1/6/2023 and add two weeks until I get to March to know that the first pay day in March is 3/3/2/2023.
Then I want my final list of dates to be only those March dates of:
(3/3/2023, 3/17/2023, 3/31/2023)
I know I can use pandas to do something like:
pd.date_range(starting_date, starting_date+relativedelta(months=1), freq='14d')
but it would include every date back to 1/6/2023.
The easiest thing to do here would be to just update the starting_date parameter to be the first pay day in the month you're interested in.
To do this, you can use this function that finds the first pay day in a given month by first finding the difference between your start date and the desired month.
# month is the number of the month (1-12)
def get_first_pay_day_in_month(month=datetime.datetime.now().month,
year=datetime.datetime.now().year,
start_date=datetime.datetime(2023, 1, 6),
):
diff = datetime.datetime(year, month, 1) - start_date
freq = 14
if diff.days % freq == 0:
print(f'Difference: {diff.days/freq} weeks')
return datetime.datetime(year,month,1)
else:
print(f'Difference: {diff.days} days')
print(f'Days: {diff.days % freq} extra')
return datetime.datetime(year,month,1 + 14 - (diff.days % freq))
Then you can use this function to get the first pay day of a specific month and plug it into the date_range method.
from dateutil import relativedelta
starting_date = get_first_pay_day_in_month(month=3)
pay_days = pd.date_range(starting_date, starting_date+relativedelta.relativedelta(months=1), freq='14d')
print(pay_days)

Changing timeseries column into a date

I have a timeseries with 2 columns, the first being hours after 1 Jan 1970. In this column, a year is only 360 days, with 12 months of 30 days. I need to convert this column into a usable date so that I can analyse the other column based on month, year etc (e.g 1997-Jan-1-1 being year-month-day-hour).
I need to make an array with modulo, to convert the each row of the hours column into hour_of_day, day_of_month, year etc, so that the column is instead a year, month, day and hour. But I don't know how to do this. Appreciate it might be confusing. Any help on doing this would be very helpful.
Input: 233280.5 (in hours)
Output: 1997-01-01-01 (year-day-month-hour)
you can calculate the number of years and add it to the reference date like e.g.
import pandas as pd
import numpy as np
from pandas.tseries.offsets import DateOffset
refdate = pd.Timestamp('1970-01-01')
df = pd.DataFrame({'360d_year_hours': [233280.5]})
# we calculate the number of years and fractional years as helper Series
y_frac, y = np.modf(df['360d_year_hours'] / (24*360))
# now we can calculate the new date's year:
df['datetime'] = pd.Series(refdate + DateOffset(years=i) for i in y)
# we need the days in the given year to be able to use y_frac
daysinyear = np.where(df['datetime'].dt.is_leap_year, 366, 365)
# ...so we can update the datetime and round to the hour:
df['datetime'] = (df['datetime'] + pd.to_timedelta(y_frac*daysinyear, unit='d')).dt.round('h')
# df['datetime']
# 0 1997-01-01 01:00:00
# Name: datetime, dtype: datetime64[ns]

Pandas date_range - subtracting numpy timedelta gives odd result, time becomes not 0:00:00

I am trying to generate a set of dates with pandas date_range functionality. Then I want to iterate over this range and subtract several months from each of the dates (exact number of month is determined in loop) to get a new date.
I get some very odd results when I do this.
MVP:
#get date range
dates = pd.date_range(start = '1/1/2013', end='1/1/2018', freq=str(test_size)+'MS', closed='left', normalize=True)
#take first date as example
date = dates[0]
date
Timestamp('2013-01-01 00:00:00', freq='3MS')
So far so good.
Now let's say I want to go just one month back from this date. I define numpy timedelta (it supports months for definition, while pandas' timedelta doesn't):
#get timedelta of 1 month
deltaGap = np.timedelta64(1,'M')
#subtract one month from date
date - deltaGap
Timestamp('2012-12-01 13:30:54', freq='3MS')
Why so? Why I get 13:30:54 in time component instead of midnight.
Moreover, if I subtract more than 1 month it the shift becomes so large that I lose a whole day:
#let's say I want to subtract both 2 years and then 1 month
deltaTrain = np.timedelta64(2,'Y')
#subtract 2 years and then subtract 1 month
date - deltaTrain - deltaGap
Timestamp('2010-12-02 01:52:30', freq='3MS')
I've had similar issues with timedelta, and the solution I've ended up using was using relativedelta from dateutil, which is specifically built for this kind of application (taking into account all the calendar weirdness like leap years, weekdays, etc...). For example given:
from dateutil.relativedelta import relativedelta
date = dates[0]
>>> date
Timestamp('2013-01-01 00:00:00', freq='10MS')
deltaGap = relativedelta(months=1)
>>> date-deltaGap
Timestamp('2012-12-01 00:00:00', freq='10MS')
deltaGap = relativedelta(years=2, months=1)
>>> date-deltaGap
Timestamp('2010-12-01 00:00:00', freq='10MS')
Check out the documentation for more info on relativedelta
The issues with numpy.timedelta64
I think that the problem with np.timedelta is revealed in these 2 parts of the docs:
There are two Timedelta units (‘Y’, years and ‘M’, months) which are treated specially, because how much time they represent changes depending on when they are used. While a timedelta day unit is equivalent to 24 hours, there is no way to convert a month unit into days, because different months have different numbers of days.
and
The length of the span is the range of a 64-bit integer times the length of the date or unit. For example, the time span for ‘W’ (week) is exactly 7 times longer than the time span for ‘D’ (day), and the time span for ‘D’ (day) is exactly 24 times longer than the time span for ‘h’ (hour).
So the timedeltas are fine for hours, weeks, months, days, because these are non-variable timespans. However, months and years are variable in length (think leap years), and so to take this into account, numpy takes some sort of "average" (I guess). One numpy "year" seems to be one year, 5 hours, 49 minutes and 12 seconds, while one numpy "month" seems to be 30 days, 10 hours, 29 minutes and 6 seconds.
# Adding one numpy month adds 30 days + 10:29:06:
deltaGap = np.timedelta64(1,'M')
date+deltaGap
# Timestamp('2013-01-31 10:29:06', freq='10MS')
# Adding one numpy year adds 1 year + 05:49:12:
deltaGap = np.timedelta64(1,'Y')
date+deltaGap
# Timestamp('2014-01-01 05:49:12', freq='10MS')
This is not so easy to work with, which is why I would just go to relativedelta, which is much more intuitive (to me).
You can try using pd.DateOffset which is mainly used for applying offset logic (month, year, hour) on dates format.
# get random dates
dates = pd.date_range(start = '1/1/2013', freq='H',periods=100,closed='left', normalize=True)
#take first date as example
date = dates[0]
# subtract a month
dates[0] - pd.DateOffset(months=1)
Timestamp('2012-12-01 00:00:00')
# to apply this on all dates
new_dates = list(map(lambda x: x - pd.DateOffset(months=1), dates))

Compute salary days in Pandas

I've created a custom Calendar:
holidays_list = [...] # list of all weekends and holidays for needed time period
class MyBusinessCalendar(AbstractHolidayCalendar):
start_date = datetime(2011, 1, 1)
end_date = datetime(2017, 12, 31)
rules = [
Holiday(name='Day Off', year=d.year, month=d.month, day=d.day) for d in holidays_list
]
cal = MyBusinessCalendar()
I know that salary days are the 5th and the 20th days of each month or the previous business days if these ones are days off.
Therefore I take
bus_day = CustomBusinessDay(calendar=cal)
r = pd.date_range('2011-01-01', '2017-12-31', freq=bus_day)
and I'd like to compute for each day from r if it's a salary day. How can I get this?
The list of salary days (paydays in American English) is defined by you as:
the 5th and the 20th days of each month or the previous business days if these ones are days off
To generate the list of paydays programmatically using a holiday calendar, you can generate the list of every 6th of the month and every 21st of the month:
dates = [date(year, month, 6) for month in range(1, 13)] +
[date(year, month, 21) for month in range(1, 13)]
Then get the previous working day, i.e. offset=-1. I'd use this:
np.busday_offset(dates, -1, roll='forward', holidays=my_holidays)
The reason I use numpy.busday_offset instead of the Pandas stuff for doing the offsets is that it is vectorized and runs very fast, whereas the Pandas busday offset logic is very slow. If the number of dates is small, it won't matter. You can still use Pandas to generate the list of holidays if you want.
Note that roll='forward' is because you want the logic to be that if the 6th is on a weekend or holiday, you roll forward to the 7th or 8th, then from there you offset -1 working day to get the payday.

Difference between multi year timeseries and it's 'standard year'

Assume I've a timeseries of a certain number of years as in:
rng = pd.date_range(start = '2001-01-01',periods = 5113)
ts = pd.TimeSeries(np.random.randn(len(rng)), rng)
Than I can calculate it's standard year (the average value of each day over all years) by doing:
std = ts.groupby([ts.index.month, ts.index.day]).mean()
Now I was wondering how I could subtract my multi-year timeseries from this standard year, in order to get a timeseries that show which days were below or above it's standard.
You can do this using the groupby, just subtract each group's mean from the values for that group:
average_diff = ts.groupby([ts.index.month, ts.index.day]).apply(
lambda g: g - g.mean()
)

Categories

Resources