Defining time as categorical variables in decision tree algorithms

Defining time as categorical variables in decision tree algorithms - python

I'm using LightGBM to solve a time-series regression problem using decision tree methods (determining the price of strawberries over several years). The function lightgbm.Dataset accepts a list of categorical features, and I'm not sure if time features should be included in the list.
I've separated my time index data into year, month, season etc:
df['year'] = pd.DatetimeIndex(df.index).year
df['month'] = pd.DatetimeIndex(df.index).month
df['week'] = np.int64(pd.DatetimeIndex(df.index).isocalendar().week)
df['day'] = pd.DatetimeIndex(df.index).dayofweek # Mon=0, ..., Sun=6
df['weekend'] = np.int64(pd.DatetimeIndex(df.index).dayofweek >= 5) # weekday=0, weekend=1
# 1=winter, 2=spring, 3=summer, 4=autumn
df['season'] = pd.DatetimeIndex(df.index).month%12 // 3 + 1
# national public holidays
df['hols'] = pd.Series(pd.DatetimeIndex(df.index)).apply(lambda x: holidays.CountryHoliday('BEL').get(x)).values.astype('bool').astype('int')
Now I'm trying to determine which of these should be classified as categorical variables. I've had a look at this Data Science post, but it still seems inconclusive.

I already used this kind of temporal variables in the past but I think there is a drawback to use it like that. For instance, take a look to season and this 3 dates:
January 10 -> winter
March 19 -> winter
March 20 -> spring
Which days are the nearest in your case? (10-01, 19-03) or (19-03, 20-03). IMHO, the price of strawberries should probably be closer between (19-03, 20-03) rather than between (10-01, 19-03) even if it's not the same season.
I had a similar problem with DayOfYear (1->365):
2019-12-31: 365
2020-01-01: 1
2020-07-01: 183
2020-12-31: 365
The distance between days was not representative. I solved to use this formula:
Value of day = min((365 - DoY(Day), (DoY(Day) - 1))
2019-12-31: 0
2020-01-01: 0
2020-07-01: 182
2020-12-31: 0
It was the right choice for European electricity consumption because the load is really seasonal.

Related

Cumulative Sales of a Last Period

I have the following code that starts like this:
# Import Libraies
import numpy as np
import pandas as pd
import datetime as dt
#Conexion to Drive
from google.colab import drive
drive.mount('/content/drive')
ruta = '/content/drive/MyDrive/example.csv'
df = pd.read_csv(ruta)
df.head(10)
The file that I import, you can download it from here: Data
And it looks like this:
Then what I do is group the values and then create a metric called Rolling Year (RY_ACTUAL) and (RY_LAST), these help me to know the sales of each category, for example the Blue category, twelve months ago. This metric works fine:
# ROLLING YEAR
# I want to make a Roling Year for each category. Thats mean how much sell each category since 12 moths ago TO current month
# RY_ACTUAL One year have 12 months so I pass as parameter in the rolling 12
f = lambda x:x.rolling(12).sum()
df_group["RY_ACTUAL"] = df_group.groupby(["CATEGORY"])['Sales'].apply(f)
# RY_24 I create a rolling with 24 as parameter to compare actual RY vs last RY
f_1 = lambda x:x.rolling(24).sum()
df_group["RY_24"] = df_group.groupby(["CATEGORY"])['Sales'].apply(f_1)
#RY_LAST Substract RY_24 - RY_Actual to get the correct amount. Thats mean amount of RY vs the amount of RY-1
df_group["RY_LAST"] = df_group["RY_24"] - df_group["RY_ACTUAL"]
My problem is in the metric called Year To Date, which is nothing more than the accumulated sales of each category from JANUARY to the month where you read the table, for example if I stop in March 2015, know how much each category sold in January to March. The column I created called YTD_ACTUAL does just that for me and I achieve it like this:
# YTD_ACTUAL
df_group['YTD_ACTUAL'] = df_group.groupby(["CATEGORY","DATE"]).Sales.cumsum()
However, what I have not been able to do is the YTD_LAST column, that is, from the past period, which reminding the previous example where it was stopped in March 2015, suppose in the blue category, it should return to me how much was the accumulated sales for the blue category from JANUARY to MARCH but 2014 year.
My try >.<
#YTD_LAST
df_group['YTD_LAST'] = df_group.groupby(["CATEGORY", "DATE"]).Sales.apply(f)
Could someone help me to make this column correctly?
Thank you in advance, community!

Return Earliest Date based on value within dataset

I am working with REIGN data that documents elections and leaders in countries around the world (https://www.oneearthfuture.org/datasets/reign)
In the dataset there is an boolean election anticipation variable that turns from 0 to 1 to denote that an election is anticipated in at least the next 6 months, possibly sooner.
Excel sheet of data in question
I want to create a new column that returns the earliest date of when anticipation (column N) turns 1 (i.e. when was the election first anticipated).
So for example, with Afghanistan in column we have an election in 2014 and in 2017.
In column N we see it turn from 0 to 1 on Oct, 2014 (election anticipated) and then we see it go back to 0 on July, 2014 (election concluded) until it goes back to 1 on Jan, 2019 (election anticipated) and then turns back to 0 on Oct, 2019.
So if successful, I would capture Oct, 2014 (election anticipated) and Jan, 2019 (election anticipated) as election announcement dates in a new column along with any other dates an election was anticipated.
Currently I have the following:
#bringing in Reign CSV
regin = pd.read_csv('REIGN_2021_7(1).csv')
#shows us the first 5 rows to make sure they look good
print(regin.head())
#show us the rows and columns in the file
regin.shape
#Getting our index
print(regin.columns)
#adding in a date column that concatenates year and month
regin['date'] = pd.to_datetime(regin[['year', 'month']].assign(DAY=1))
regin.head
#def conditions(s):
if (s['anticipation'] == 1):
return (s['date'])
else:
return 0
regin['announced_date'] = regin.apply(conditions, axis=1)
print(regin.head)
Biggest issue for me is that while this returns the date of when a 1 appears, it does not display the earliest date. How I can loop through the anticipation column and return the minimum date, but do so multiple times as a country will have many elections over the years and there are therefore multiple instances in column N for one country of the anticipation turning on(1) and off(0).
Thanks in advance for any assistance! Let me know if anything is unclear.

If you can loop over your dates, you will probably want to use the datetime module (assuming all dates have the same format):
from datetime import datetime
[...]
earliest_date = datetime.today()
[... loop over data, by country ...]
date1 = datetime.strptime(input_date_string1, date_string_format)
if date1 < earliest_date:
earliest_date = date1
[...]
This module supports (among other things):
parsing date objects from a string (.strptime(in_str, format))
comparison of date objects (date1 > date2)
datetime object from current date + time (.today())
datetime object from arbitrary date (.date(year, month, day))
docs: https://docs.python.org/3/library/datetime.html

Modelling Loan Payments - Calculate IRR

Working with loan data.
I have a dataframe with the columns:
df_irr = df1[['id', 'funded_amnt_t', 'Expect_NoPayments','installment']]
ID of the Loan | Funded Amount | Expected Number of Payments | fixed instalment of the annuity.
I have estimated the number of payments with regression analysis.
the loans have 36 or 60 months maturity.
Now I am trying to calculate the expected irr (internal rate of return).
But I am stuck
I was planning to use numpy.irr
However, I never had the chance to use it - as my date is not in the right format?
I have tried pandas pivot and reshape functions. No Luck.
Time series of cash flows:
- Columns: Months 0 , ...., 60
- Rows: ID for each loan
- Values in Month 0 = - funded_amount
- Values in Month 0-60: installment if expected_number_of_payments > months
My old Stata code was:
keep id installment funded_amnt expectednumberofpayments
sort id
expand 61, generate(expand)
bysort id : gen month = _n
gen cf = 0
replace cf = installment if (expectednumberofpayments+1)>=month
replace cf = funded_amnt*-1 if month==1
enter image description here

numpy.irr is the wrong formula to use. That formula is for irregular payments (e.g. $100 in month 1, $0 in month 2, and $400 in month 3). Instead, you want to use numpy.rate. I'm making some assumptions about your data for this solution:
import numpy as np
df_irr['rate'] = np.rate(nper=df_irr['Expect_NoPayments'],
pmt=df_irr['installment'],
pv=df_irr['funded_amnt_t'])
More information can be found here numpy documentation.

Adjust datetime in Pandas to get CustomBusinessWeek

I have a long series of stock daily prices and I am trying to get week prices to do some calculations. I have been reading the documentation and I see you can set offsets get a specific date of the week which is what I want. This is the code assume stock is part of a loop I am runing.
df_clean_BW[WEEKLY_PricesFriday'] = stock.resample('W-FRI').last()
But for US stock market there are many days where it is a holiday on Friday so then I saw you can adjust this for USCalendar Holidays. This is the code I was using
from pandas.tseries.offsets import CustomBusinessDay
from pandas.tseries.holiday import USFederalHolidayCalendar
bday_us = CustomBusinessDay(calendar=USFederalHolidayCalendar())
But I dont know how to combine the two so that if there is a holiday on Friday to take the day prior (the Thursday instead). So something like this but this throws an error
df_clean_BW[WEEKLY_PricesFriday'] = stock.resample('W-FRI' & bday_us).last()
I have a long list of dates so I don't want to create a list of exception days because that would be too long. Here is an example of the output I would want. In this case Jan 1, 2016 was a Friday so I just want to take December 31, 2015 instead. This must be a common request for anyone who looks at stock data but I cant figure out a way to do it.
Date Price Week Price
12/30/2015 103.3227
12/31/2015 101.3394
1/4/2016 101.426 101.3394 << Take 12/31 as 1.1 is holiday
1/5/2016 98.8844
1/6/2016 96.9492
1/7/2016 92.8575
1/8/2016 93.3485 93.3485

First generate your array of Fridays including holidays. Then use np.busday_offset() to offset them like this:
np.busday_offset(fridays, 0, roll='backward', busdaycal=bday_us.calendar)

Linking Simpy simulation time to Python Calendar for week day specific actions

I want to build a simulation model of a production network with SimPy comprising the following features with regard to time:
Plants work from Monday to Friday (with two shifts of 8 hours)
Heavy trucks drive on all days of the week except Sunday
Light trucks drive on all days of the week, including Sunday
To this purpose, I want to construct a BroadcastPipe as given in the docs combined with timeouts to make the objects wait during days they are not working (for the plants additional logic is required to model shifts). This BroadcastPipe would just count the days (assuming 24*60 minutes for each day) and then say "It's Monday, everybody". The objects (plant, light and heavy trucks) would then process this information individually and act accordingly.
Now, I wonder whether there is an elegant method to link simulation time to regular Python Calender objects in order to easily access days of the week. This would be useful for clarity and enhancements like bank holidays and varying starting days. Do you have any advise how to do this? (or general advice on how to model better?). Thanks in advance!

I usually set a start date and define it to be equal with the simulation time (Environment.now) 0. Since SimPy’s simulation time has no inherent unit, I also define that it is in seconds. Using arrow, I can than easily calculate an actual date and time from the current simulation time:
import arrow
import simpy
start = arrow.get('2015-01-01T00:00:00')
env = simpy.Environment()
# do some simulation ...
current_date = start.replace(seconds=env.now)
print('Curret weekday:', current_date.weekday())

You might use the datetime module and create a day_of_week object, though you would still need to calculate the elapsed time:
import datetime
# yyyy = four digit year integer
# mm = 1- or 2-digit month integer
# dd = 1- or 2-digit day integer
day_of_week = datetime.datetime(yyyy, mm, dd).strftime('%a')
if day_of_week == 'Mon':
# Do Monday tasks...
elif day_of_week == 'Tue':
# Tuesday...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Defining time as categorical variables in decision tree algorithms - python

Related

Cumulative Sales of a Last Period

Return Earliest Date based on value within dataset

Modelling Loan Payments - Calculate IRR

Adjust datetime in Pandas to get CustomBusinessWeek

Linking Simpy simulation time to Python Calendar for week day specific actions

Categories

Resources