Can time series analysis forecast the past? - python

I'm trying to guess the past data by using time series analysis.
Usually, time series analysis forecast the future, but in the opposite direction, can time series forecast(?) the past?
The reason why I do this is, there is missing part in the past data.
I'm trying to write down the code in R or Python.
I tried forecast(arima, h=-92) in R. This didn't work.
This is the code I tried in R.
library('ggfortify')
library('data.table')
library('ggplot2')
library('forecast')
library('tseries')
library('urca')
library('dplyr')
library('TSstudio')
library("xts")
df<- read.csv('https://drive.google.com/file/d/1Dt2ZLOCASYIbvviWQkwwgdo2BdmKfl9H/view?usp=sharing')
colnames(df)<-c("date", "production")
df$date<-as.Date(df$date, format="%Y-%m-%d")
CandyXTS<- xts(df[-1], df[[1]])
CandyTS<- ts(df$production, start=c(1972,1),end=c(2017,8), frequency=12 )
ggAcf(CandyTS)
forecast(CandyTS, h=-92)

It is possible. It is called backcasting. You can find some information in this chapter of Forecasting: Principles and Practice.
Basicly you need to forecast in reverse. I have added an example based on the code in the chapter and your data. Adjust as needed. You create a reverse index and use that to be able to backcast in time. You can use different models than ETS. Same principle
# I downloaded data.
df1 <- readr::read_csv("datasets/candy_production.csv")
colnames(df1) <- c("date", "production")
library(fpp3)
back_cast <- df1 %>%
as_tsibble() %>%
mutate(reverse_time = rev(row_number())) %>%
update_tsibble(index = reverse_time) %>%
model(ets = ETS(production ~ season(period = 12))) %>%
# backcast
forecast(h = 12) %>%
# add dates in reverse order to the forecast with the same name as in original dataset.
mutate(date = df1$date[1] %m-% months(1:12)) %>%
as_fable(index = date, response = "production",
distribution = "production")
back_cast %>%
autoplot(df1) +
labs(title = "Backcast of candy production",
y = "production")

Related

Appending values corresponding to matching date in different dataframes (in R or Python)

I have following data:
#1. dates of 15 day frequency:
dates = seq(as.Date("2017-01-01"), as.Date("2020-12-30"), by=15)
#2. I have a dataframe containing dates where certain observation is recoded per variable as:
#3. Values corresponding to dates in #2 as:
What I am trying to do is assign values to respective dates, and keep other as NaN for the dates which has no observation, and save as a text file. The output look something like below. Appreciate your help. Can be in R or in python.
This code work on the example data you provided.
Due to loop, it will not be the fastest way out there, but it does the job.
The date DataFrames is containing the dates, your data shown in #2. And data is the DataFrames containing the data shown in #3.
# IMPORT PACKAGES AND LOAD DATA
import pandas as pd
import numpy as np
data = pd.read_csv("./data.csv")
date = pd.read_csv("./date.csv")
# GET THE UNIQUE DATES
date_unique = pd.Series(np.concatenate([date[col_name].dropna().unique() for col_name in date.columns]).flat).unique()
# GET THE END DATA FRAME READY
data_reform = pd.DataFrame(data=[], columns=["date"])
for col in data.columns:
data_reform.insert(len(data_reform.columns),col,[])
data_reform["date"]=date_unique
data_reform.sort_values(by=["date"],inplace=True)
# ITER THROUGH THE DATA AND ALLOCATED THEM TO THE FINAL DATA FRAME
for row_id, row_val in data.iterrows():
for col_id, col_val in row_val.dropna().items():
data_reform[col_id][data_reform["date"]==date[col_id].iloc[row_id]]=col_val
You can use stack, then merge and finally spread for going back to your column-style matrix. However, I think this can create a sparse matrix (not too optimal in memory for large datasets).
library(pivottabler)
library(dplyr)
library(tidyr)
dates_df <- data.frame(read.csv("date.txt"))
dates_df$dates_formatted <- as.Date(dates_df$dates, format = "%d/%m/%Y")
dates_df <- data.frame(dates_df[,2])
names(dates_df) <- c("dates")
valid_observations <- read.csv("cp.txt", sep = " " ,na.strings=c('NaN'))
observations <- read.csv("cpAC.txt", sep = " ")
# Idea is to produce an EAV table
# date var_name value
EAV_Table <- cbind(stack(valid_observations), stack(observations))
complete_EAV <- EAV_Table[complete.cases(EAV_Table), 1:3]
complete_EAV$date <- as.Date(complete_EAV$values, format = "%Y-%m-%d")
complete_EAV <- complete_EAV[, 2:4]
names(complete_EAV) <- c("Variable", "Value", "dates")
complete_EAV$Variable <- as.factor(complete_EAV$Variable)
dates_measures <- merge(dates_df, complete_EAV, all.x = TRUE)
result <- spread(dates_measures, Variable, Value)
write.csv(result, "data_measurements.csv", row.names = FALSE)

How to replicate python code to R to find duplicates?

I'm trying to reproduce this code from python to R:
# Sort by user overall rating first
reviews = reviews.sort_values('review_overall', ascending=False)
# Keep the highest rating from each user and drop the rest
reviews = reviews.drop_duplicates(subset= ['review_profilename','beer_name'], keep='first')
and I've done this piece of code in R:
reviews_df <-df[order(-df$review_overall), ]
library(dplyr)
df_clean <- distinct(reviews_df, review_profilename, beer_name, .keep_all= TRUE)
The problem is that I'm getting with python 1496263 records and with R 1496596 records.
link to dataset: dataset
Can someone help me to see my mistakes?
Without having some data, it's difficult to help, but you might be looking for:
library(tidyverse)
df_clean <- reviews_df %>%
arrange(desc(review_overall)) %>%
distinct(across(c(review_profilename, beer_name)), .keep_all = TRUE)
This code will sort descending by review_overall and look for every profilename + beer name combination and keep the first row (i.e. with highest review overall) for each of these combinations.

HTS Prophet Holidays Issue

I am attempting to use the htsprophet package in Python. I am using the following example code below. This example is pulled from https://github.com/CollinRooney12/htsprophet/blob/master/htsprophet/runHTS.py . The issue I am getting is ValueError "holidays must be a DataFrame with 'ds' and 'holiday' column. I am wondering if there is a work around this because I clearly have a data frame holidays with the two columns ds and holidays. I believe that the error comes from one of the dependency packages from fbprophet from the forecaster file. I am wondering if there is anything that I need to add or if anyone has added something to fix this.
import pandas as pd
from htsprophet.hts import hts, orderHier, makeWeekly
from htsprophet.htsPlot import plotNode, plotChild, plotNodeComponents
import numpy as np
#%% Random data (Change this to whatever data you want)
date = pd.date_range("2015-04-02", "2017-07-17")
date = np.repeat(date, 10)
medium = ["Air", "Land", "Sea"]
businessMarket = ["Birmingham","Auburn","Evanston"]
platform = ["Stone Tablet","Car Phone"]
mediumDat = np.random.choice(medium, len(date))
busDat = np.random.choice(businessMarket, len(date))
platDat = np.random.choice(platform, len(date))
sessions = np.random.randint(1000,10000,size=(len(date),1))
data = pd.DataFrame(date, columns = ["day"])
data["medium"] = mediumDat
data["platform"] = platDat
data["businessMarket"] = busDat
data["sessions"] = sessions
#%% Run HTS
##
# Make the daily data weekly (optional)
##
data1 = makeWeekly(data)
##
# Put the data in the format to run HTS, and get the nodes input (a list of list that describes the hierarchical structure)
##
data2, nodes = orderHier(data, 1, 2, 3)
##
# load in prophet inputs (Running HTS runs prophet, so all inputs should be gathered beforehand)
# Made up holiday data
##
holidates = pd.date_range("12/25/2013","12/31/2017", freq = 'A')
holidays = pd.DataFrame(["Christmas"]*5, columns = ["holiday"])
holidays["ds"] = holidates
holidays["lower_window"] = [-4]*5
holidays["upper_window"] = [0]*5
##
# Run hts with the CVselect function (this decides which hierarchical aggregation method to use based on minimum mean Mean Absolute Scaled Error)
# h (which is 12 here) - how many steps ahead you would like to forecast. If youre using daily data you don't have to specify freq.
#
# NOTE: CVselect takes a while, so if you want results in minutes instead of half-hours pick a different method
##
myDict = hts(data2, 52, nodes, holidays = holidays, method = "FP", transform = "BoxCox")
##
The problem lies in the htsProphet package, with the 'fitForecast.py' file. The instantiation of the fbProphet object relies on just positional arguments, however a new argument as been added to the fbProphet class. This means the arguments don't correspond anymore.
You can solve this by hacking the fbProphet module and changing the positional arguments to keyword arguments, just fixing lines '73-74' should be sufficient to get it running:
Prophet(growth=growth, changepoints=changepoints1, n_changepoints=n_changepoints1, yearly_seasonality=yearly_seasonality, weekly_seasonality=weekly_seasonality, holidays=holidays, seasonality_prior_scale=seasonality_prior_scale, \
holidays_prior_scale=holidays_prior_scale, changepoint_prior_scale=changepoint_prior_scale, mcmc_samples=mcmc_samples, interval_width=interval_width, uncertainty_samples=uncertainty_samples)
Ill submit a bug for this to the creators.

Convert JSON to R dataframe

I have a very messy JSON file (lists inside of lists) that I'm trying to convert to an R dataframe (part of the reason to convert the file is that I need to export it into a .csv file). Here is a sample of the data (https://www.dropbox.com/s/ikb4znhpaavyc9z/20140909-20141010_10zdfxhqf0_2014_10_09_23_50_activities.json?dl=0). I tried this solution (Parse nested JSON to Data Frame in R), but that got rid of many of my columns. Below is the code I have so far:
library("twitteR")
library ("streamR")
library("rjson")
json_file <- "20140909-20141010_10zdfxhqf0_2014_09_09_01_00_activities.json"
json_data <- fromJSON(file=json_file) #convert to r list
str (json_data) #list of 16 objects
#unlist elements
tweets.i <- lapply(json_data, function(x){ unlist(x)})
tweets <- do.call("rbind", tweets.i)
tweets <- as.data.frame(tweets)
library(plyr)
tweets <- rbind.fill(lapply(tweets.i,
function(x) do.call("data.frame", as.list(x))
))
Anyone have a way to convert the file to an R dataframe without losing all the info? I'm open to using Python to do this work to, I just don't have the expertise to figure out how to code it.
This is not very efficient, but it may work for you:
download.file("https://www.dropbox.com/s/ikb4znhpaavyc9z/20140909-20141010_10zdfxhqf0_2014_10_09_23_50_activities.json?dl=1", destfile = tf <- tempfile(fileext = ".json"))
txt <- readLines(tf)
library(jsonlite)
library(plyr)
df <- do.call(plyr::rbind.fill, lapply(txt[txt != ""], function(x) as.data.frame(t(unlist(fromJSON(x))))))
I like the answer provided above as a really quick way to get everything. You could try tidyjson, but it also will not be efficient since it requires pre-knowledge of the structure. listviewer::jsonedit might help visualize what you are working with.
#devtools::install_github("timelyportfolio/listviewer")
library(listviewer)
jsonedit(readLines(
"https://www.dropbox.com/s/ikb4znhpaavyc9z/20140909-20141010_10zdfxhqf0_2014_10_09_23_50_activities.json?dl=1"
)[2])
Perhaps a data.frame really isn't the best structure, but it really depends on what you are trying to accomplish.
This is just a sample to hopefully show you how it might look.
library(tidyjson)
library(dplyr)
json <- readLines(
"https://www.dropbox.com/s/ikb4znhpaavyc9z/20140909-20141010_10zdfxhqf0_2014_10_09_23_50_activities.json?dl=1"
)
json %>%
{
Filter(
function(x){return (nchar(x) != 0)}
,.
)
} %>%
as.tbl_json() %>%
spread_values(
id = jstring("id")
,objectType = jstring("objectType")
,link = jstring("link")
,body = jstring("body")
,favoritesCount = jstring("favoritesCount")
,twitter_filter_level = jstring("twitter_filter_level")
,twitter_lang = jstring("twitter_lang")
,retweetCount = jnumber("retweetCount")
,verb = jstring("verb")
,postedTime = jstring("postedTime")
# from actor object in the JSON
,actor_objectType = jstring("actor","objectType")
,actor_id = jstring("actor","id")
,actor_link = jstring("actor","link")
,actor_displayName = jstring("actor","displayName")
,actor_image = jstring("actor","image")
,actor_summary = jstring("actor","summary")
,actor_friendsCount = jnumber("actor","friendsCount")
,actor_followersCount = jnumber("actor","followersCount")
) %>%
# careful once you enter you can't go back up
enter_object("actor","links") %>%
gather_array( ) %>%
spread_values(
actor_href = jstring("href")
)

Data frames in R

Pandas has proven very successful as a tool for working with time series data. For example to perform a 5 minutes mean you can use the resample function like this :
import pandas as pd
dframe = pd.read_table("test.csv",
delimiter=",", index_col=0, parse_dates=True, date_parser=parse)
## 5 minutes mean
dframe.resample('t', how = 'mean')
## daily mean
ts.resample('D', how='mean')
How can I perform this in R ?
In R you can use xts package specialised in time series manipulations. For example, you can use the period.apply function like this :
library(xts)
zoo.data <- zoo(rnorm(31)+10,as.Date(13514:13744,origin="1970-01-01"))
ep <- endpoints(zoo.data,'days')
## daily mean
period.apply(zoo.data, INDEX=ep, FUN=function(x) mean(x))
There some handy wrappers of this function ,
apply.daily(x, FUN, ...)
apply.weekly(x, FUN, ...)
apply.monthly(x, FUN, ...)
apply.quarterly(x, FUN, ...)
apply.yearly(x, FUN, ...)
R has data frames (data.frame) and it can also read csv files. Eg.
dframe <- read.csv2("test.csv")
For dates, you may need to specify the columns using the colClasses parameter. See ?read.csv2. For example:
dframe <- read.csv2("test.csv", colClasses=c("POSIXct",NA,NA))
You should then be able to round the date field using round or trunc, which will allow you to break up the data into the desired frequencies.
For example,
dframe$trunc.times <- trunc(dframe$date.field,1,units='mins');
means <- daply(dframe, 'trunc.times', function(df) { return( mean(df$value) ) });
Where value is the name of the field that you want to average.
Personally, I really like a combination of lubridate and zoo aggregate() for these operations:
ts.month.sum <- aggregate(ts.data, month, sum)
ts.daily.mean <- aggregate(ts.data, day, mean)
ts.mins.mean <- aggregate(ts.data, minutes, mean)
You can also use the standard time functions yearmon() or yearqtr(), or custom functions for both split and apply. This method is as syntactically sweet as that of pandas.

Categories

Resources