HTS Prophet Holidays Issue - python

I am attempting to use the htsprophet package in Python. I am using the following example code below. This example is pulled from https://github.com/CollinRooney12/htsprophet/blob/master/htsprophet/runHTS.py . The issue I am getting is ValueError "holidays must be a DataFrame with 'ds' and 'holiday' column. I am wondering if there is a work around this because I clearly have a data frame holidays with the two columns ds and holidays. I believe that the error comes from one of the dependency packages from fbprophet from the forecaster file. I am wondering if there is anything that I need to add or if anyone has added something to fix this.
import pandas as pd
from htsprophet.hts import hts, orderHier, makeWeekly
from htsprophet.htsPlot import plotNode, plotChild, plotNodeComponents
import numpy as np
#%% Random data (Change this to whatever data you want)
date = pd.date_range("2015-04-02", "2017-07-17")
date = np.repeat(date, 10)
medium = ["Air", "Land", "Sea"]
businessMarket = ["Birmingham","Auburn","Evanston"]
platform = ["Stone Tablet","Car Phone"]
mediumDat = np.random.choice(medium, len(date))
busDat = np.random.choice(businessMarket, len(date))
platDat = np.random.choice(platform, len(date))
sessions = np.random.randint(1000,10000,size=(len(date),1))
data = pd.DataFrame(date, columns = ["day"])
data["medium"] = mediumDat
data["platform"] = platDat
data["businessMarket"] = busDat
data["sessions"] = sessions
#%% Run HTS
##
# Make the daily data weekly (optional)
##
data1 = makeWeekly(data)
##
# Put the data in the format to run HTS, and get the nodes input (a list of list that describes the hierarchical structure)
##
data2, nodes = orderHier(data, 1, 2, 3)
##
# load in prophet inputs (Running HTS runs prophet, so all inputs should be gathered beforehand)
# Made up holiday data
##
holidates = pd.date_range("12/25/2013","12/31/2017", freq = 'A')
holidays = pd.DataFrame(["Christmas"]*5, columns = ["holiday"])
holidays["ds"] = holidates
holidays["lower_window"] = [-4]*5
holidays["upper_window"] = [0]*5
##
# Run hts with the CVselect function (this decides which hierarchical aggregation method to use based on minimum mean Mean Absolute Scaled Error)
# h (which is 12 here) - how many steps ahead you would like to forecast. If youre using daily data you don't have to specify freq.
#
# NOTE: CVselect takes a while, so if you want results in minutes instead of half-hours pick a different method
##
myDict = hts(data2, 52, nodes, holidays = holidays, method = "FP", transform = "BoxCox")
##

The problem lies in the htsProphet package, with the 'fitForecast.py' file. The instantiation of the fbProphet object relies on just positional arguments, however a new argument as been added to the fbProphet class. This means the arguments don't correspond anymore.
You can solve this by hacking the fbProphet module and changing the positional arguments to keyword arguments, just fixing lines '73-74' should be sufficient to get it running:
Prophet(growth=growth, changepoints=changepoints1, n_changepoints=n_changepoints1, yearly_seasonality=yearly_seasonality, weekly_seasonality=weekly_seasonality, holidays=holidays, seasonality_prior_scale=seasonality_prior_scale, \
holidays_prior_scale=holidays_prior_scale, changepoint_prior_scale=changepoint_prior_scale, mcmc_samples=mcmc_samples, interval_width=interval_width, uncertainty_samples=uncertainty_samples)
Ill submit a bug for this to the creators.

Related

Big O notation : limited input

As an exercise, I am trying to set Monte Carlo Simulation on a chosen ticker symbol.
from numpy.random import randint
from datetime import date
from datetime import timedelta
import pandas as pd
import yfinance as yf
from math import log
# ticker symbol
ticker_input = "AAPL" # change
# start day + endday for Yahoo Finance API, 5 years of data
start_date = date.today()
end_date = start_date - timedelta(days=1826)
# retrieve data from Yahoo Finance
data = yf.download(ticker_input, end_date,start_date)
yf_data = data.reset_index()
# dataframe : define columns
df = pd.DataFrame(columns=['date', "ln_change", 'open_price', 'random_num'])
open_price = []
date_historical = []
for column in yf_data:
open_price = yf_data["Open"].values
date_historical = yf_data["Date"].values
# list order: descending
open_price[:] = open_price[::-1]
date_historical[:] = date_historical[::-1]
# Populate data into dataframe
for i in range(0, len(open_price)-1):
# date
day = date_historical[i]
# ln_change
lnc = log(open_price[i]/open_price[i+1], 2)
# random number
rnd = randint(1, 1258)
# op = (open_price[i]) open price
df.loc[i] = [day, open_price[i], lnc, rnd]
I was wondering how to calculate Big O if you have e.g. nested loops or exponential complexity but have a limited input like one in my example, maximum input size is 1259 instances of float number. Input size is not going to change.
How do you calculate code complexity in that scenario?
It is a matter of points of view. Both ways of seeing it are technically correct. The question is: What information do you wish to convey to the reader?
Consider the following code:
quadraticAlgorithm(n) {
for (i <- 1...n)
for (j <- 1...n)
doSomethingConstant();
}
quadraticAlgorithm(1000);
The function is clearly O(n2). And yet the program will always run in the same, constant time, because it just contains one function call with n=1000. It is still perfectly valid to refer to the function as O(n2). And we can refer to the program as O(1).
But sometimes the boundaries are not that clear. Then it is up to you to choose if you wish to see it as an algorithm with a time complexity as some function of n, or as a piece of constant code that runs in O(1). The importance is to make it clear to the reader how you define things.

using PM4PY, is ignoring outliers to calulate the average

I have a csv file which has the following attributes - (id, date, status)
First I store the values in a dataframe and process it
# get the data in a data frame
log_csv = pd.read_csv('HILTGLOBAL.csv', sep=',')
# processing the data
log_csv = dataframe_utils.convert_timestamp_columns_in_df(log_csv)
log_csv['createdDate'] = pd.to_datetime(log_csv.createdDate)
log_csv['createdDate'] = log_csv['createdDate'].values.astype('datetime64[D]')
log_csv = log_csv.sort_values('createdDate')
After that I rename some columns as required by PM4PY, and get the event log
# renaming
log_csv.rename(columns = {'currentStatus':'concept:name','createdDate':'time:timestamp','candidateId':'case:concept:name'},inplace = True)
# getting the event logs
log = log_converter.apply(log_csv)
Then I try to get the directly follows graph of the above dataframe.
I want the edges to represent the average time between each stage.
from pm4py.algo.discovery.dfg import algorithm as dfg_discovery
from pm4py.objects.dfg.retrieval.log import Parameters
dfg = dfg_discovery.apply(log, variant=dfg_discovery.Variants.PERFORMANCE,parameters = {Parameters.AGGREGATION_MEASURE:'mean'})
gviz = dfg_visualization.apply(dfg, log=log,
variant=dfg_visualization.Variants.PERFORMANCE,parameters = {Parameters.AGGREGATION_MEASURE:'mean'})
dfg_visualization.view(gviz)
However, the outliers are being ignored in calculating the average time.
I do not know how to fix it such that all points are considered.

TypeError: _append_dispatcher()

I am trying to predict the stock price of Facebook on the 1664th row of the .csv file. I am encountering an error when it comes to appending a np.array. Here's my code:
##predicts price of facebook stock for one day
from sklearn.svm import SVR
import numpy as np
import pandas as pd
##store and show data
df = pd.read_csv (r'fb.csv')
##get and print last row of data
actual_price = df.tail(1)
#print(actual_price)
##prepare and print svr models
##get all of the data except for last row
df = df.head(len(df)-1)
ind = (np.arange((len(df.index))))
df["index"] = ind
##create empty list to store dependent and independent data
# days1 =
days = np.array([])
adj_close_prices = np.array([])
##get the date and adjusted close prices
df_days = df.loc[:, 'index']
df_adj_close = df.loc[:, 'Adj Close']
##create the independent dataset ### this part to specify
for day in df_days:
days = np.append(float(day))
And the error which keeps occurring is the following:
days = np.append(float(day))
File "<__array_function__ internals>", line 4, in append
TypeError: _append_dispatcher() missing 1 required positional argument: 'values'
I have very basic level of Python knowledge and have been using YouTube and online resources to come up with what I have already.
I haven't checked your whole code, but the append problem isn't too hard to fix. Check out numpy.append() documentation and you will notice that it takes 2 parameters that are required (and a third that is optional), in a form of numpy.append(array_to_apend_to, value_to_apend) so, in your code it should look like days = np.append(days, float(day))

Trying to translate my Matlab routine for loading and preparing data to Python. Stuck in pandas

this is my first post and I have been struggling with this problem for a few days now. The following code is a Matlab code that I usually use to load my data (.csv files) and prepare them for further calculations.
%I use this later on to predefine my array because they need to have to same length for calculations
maxind = 400;
% i is given a vector (like test_numbers = [1 2 3]) I get from the user so I can iterate over the numbers of test specimen
for i = test_numbers
% Setup the Import Options and import the data
opts = delimitedTextImportOptions("NumVariables", 4);
% Specify range and delimiter
opts.DataLines = [2, Inf];
opts.Delimiter = ";";
% Specify column names and types
opts.VariableNames = ["time", "force", "displ_1", "displ_2"];
opts.VariableTypes = ["double", "double", "double", "double"];
% Specify file level properties
opts.ExtraColumnsRule = "ignore";
opts.EmptyLineRule = "read";
% Import the data
%Here I build the name and read the files/ the csv files are in the same folder as the main program.
data_col = readtable(['specimen_name',num2str(i),'.csv'], opts);
Data.force(:,i)=nan(maxind,1);
Data.force(1:length(data_col.time),i)=data_col.force;
Data.displ(:,i)=nan(maxind,1);
Data.displ(1:length(data_col.time),i)=nanmean([data_col.displ_1,data_col.displ_2]')';
Data.time(:,i)=nan(maxind,1);
Data.time(1:length(data_col.time),i)=data_col.time;
Data.name(i)={['specimen_name',num2str(i)]};
% Clear temporary variables
clear opts
end
Now I have to use Python instead of Matlab and I started with pandas to read my csv as DataFrame.
Now my question. Is there a way to access my data like in this part of my Matlab code or should I not use dataframes in the first place to do something like that? (I know I can access my data with the name of the column, but I got stuck by trying to refer my data like Data.force(1:length(data_col.time),i) in a new dataframe)
Data.force(:,i)=nan(maxind,1);
Data.force(1:length(data_col.time),i)=data_col.force;
Data.displ(:,i)=nan(maxind,1);
Data.displ(1:length(data_col.time),i)=nanmean([data_col.displ_1,data_col.displ_2]')';
Data.time(:,i)=nan(maxind,1);
Data.time(1:length(data_col.time),i)=data_col.time;
Data.name(i)={['specimen_name',num2str(i)]};
Many thanks in advance for your help.

Get results in an Earth Engine python script

I'm trying to get NDVI mean in every polygon in a feature collection with earth engine python API.
I think that I succeeded getting the result (a feature collection in a feature collection), but then I don't know how to get data from it.
The data I want is IDs from features and ndvi mean in each feature.
import datetime
import ee
ee.Initialize()
#Feature collection
fc = ee.FeatureCollection("ft:1s57dkY_Sg_E_COTe3sy1tIR_U-5Gw-BQNwHh4Xel");
fc_filtered = fc.filter(ee.Filter.equals('NUM_DECS', 1))
#Image collection
Sentinel_collection1 = (ee.ImageCollection('COPERNICUS/S2')).filterBounds(fc_filtered)
Sentinel_collection2 = Sentinel_collection1.filterDate(datetime.datetime(2017, 1, 1),datetime.datetime(2017, 8, 1))
# NDVI function to use with ee map
def NDVIcalc (image):
red = image.select('B4')
nir = image.select('B8')
ndvi = nir.subtract(red).divide(nir.add(red)).rename('NDVI')
#NDVI mean calculation with reduceRegions
MeansFeatures = ndvi.reduceRegions(reducer= ee.Reducer.mean(),collection= fc_filtered,scale= 10)
return (MeansFeatures)
#Result that I don't know to get the information: Features ID and NDVI mean
result = Sentinel_collection2.map(NDVIcalc)
If the result is small, you pull them into python using result.getInfo(). That will give you a python dictionary containing a list of FeatureCollection (which are more dictionaries). However, if the results are large or the polygons cover large regions, you'll have to Export the collection instead.
That said, there are probably some other things you'll want to do first:
1) You might want to flatten() the collection, so it's not nested collections. It'll be easier to handle that way.
2) You might want to add a date to each result so you know what time the result came from. You can do that with a map on the result, inside your NDVIcalc function
return MeansFeatures.map(lambda f : f.set('date', image.date().format())
3) If what you really want is a time-series of NDVI over time for each polygon (most common), then restructuring your code to map over polygons first will be easier:
Sentinel_collection = (ee.ImageCollection('COPERNICUS/S2')
.filterBounds(fc_filtered)
.filterDate(ee.Date('2017-01-01'),ee.Date('2017-08-01')))
def GetSeries(feature):
def NDVIcalc(img):
red = img.select('B4')
nir = img.select('B8')
ndvi = nir.subtract(red).divide(nir.add(red)).rename(['NDVI'])
return (feature
.set(ndvi.reduceRegion(ee.Reducer.mean(), feature.geometry(), 10))
.set('date', img.date().format("YYYYMMdd")))
series = Sentinel_collection.map(NDVIcalc)
// Get the time-series of values as two lists.
list = series.reduceColumns(ee.Reducer.toList(2), ['date', 'NDVI']).get('list')
return feature.set(ee.Dictionary(ee.List(list).flatten()))
result = fc_filtered.map(GetSeries)
print(result.getInfo())
4) And finally, if you're going to try to Export the result, you're likely to run into an issue where the columns of the exported table are selected from whatever columns the first feature has, so it's good to provide a "header" feature that has all columns (times), that you can merge() with the result as the first feature:
# Get all possible dates.
dates = ee.List(Sentinel_collection.map(function(img) {
return ee.Feature(null, {'date': img.date().format("YYYYMMdd") })
}).aggregate_array('date'))
# Make a default value for every date.
header = ee.Feature(null, ee.Dictionary(dates, ee.List.repeat(-1, dates.size())))
output = header.merge(result)
ee.batch.Export.table.toDrive(...)

Categories

Resources