Data frames in R - python

Pandas has proven very successful as a tool for working with time series data. For example to perform a 5 minutes mean you can use the resample function like this :
import pandas as pd
dframe = pd.read_table("test.csv",
delimiter=",", index_col=0, parse_dates=True, date_parser=parse)
## 5 minutes mean
dframe.resample('t', how = 'mean')
## daily mean
ts.resample('D', how='mean')
How can I perform this in R ?

In R you can use xts package specialised in time series manipulations. For example, you can use the period.apply function like this :
library(xts)
zoo.data <- zoo(rnorm(31)+10,as.Date(13514:13744,origin="1970-01-01"))
ep <- endpoints(zoo.data,'days')
## daily mean
period.apply(zoo.data, INDEX=ep, FUN=function(x) mean(x))
There some handy wrappers of this function ,
apply.daily(x, FUN, ...)
apply.weekly(x, FUN, ...)
apply.monthly(x, FUN, ...)
apply.quarterly(x, FUN, ...)
apply.yearly(x, FUN, ...)

R has data frames (data.frame) and it can also read csv files. Eg.
dframe <- read.csv2("test.csv")
For dates, you may need to specify the columns using the colClasses parameter. See ?read.csv2. For example:
dframe <- read.csv2("test.csv", colClasses=c("POSIXct",NA,NA))
You should then be able to round the date field using round or trunc, which will allow you to break up the data into the desired frequencies.
For example,
dframe$trunc.times <- trunc(dframe$date.field,1,units='mins');
means <- daply(dframe, 'trunc.times', function(df) { return( mean(df$value) ) });
Where value is the name of the field that you want to average.

Personally, I really like a combination of lubridate and zoo aggregate() for these operations:
ts.month.sum <- aggregate(ts.data, month, sum)
ts.daily.mean <- aggregate(ts.data, day, mean)
ts.mins.mean <- aggregate(ts.data, minutes, mean)
You can also use the standard time functions yearmon() or yearqtr(), or custom functions for both split and apply. This method is as syntactically sweet as that of pandas.

Related

Can time series analysis forecast the past?

I'm trying to guess the past data by using time series analysis.
Usually, time series analysis forecast the future, but in the opposite direction, can time series forecast(?) the past?
The reason why I do this is, there is missing part in the past data.
I'm trying to write down the code in R or Python.
I tried forecast(arima, h=-92) in R. This didn't work.
This is the code I tried in R.
library('ggfortify')
library('data.table')
library('ggplot2')
library('forecast')
library('tseries')
library('urca')
library('dplyr')
library('TSstudio')
library("xts")
df<- read.csv('https://drive.google.com/file/d/1Dt2ZLOCASYIbvviWQkwwgdo2BdmKfl9H/view?usp=sharing')
colnames(df)<-c("date", "production")
df$date<-as.Date(df$date, format="%Y-%m-%d")
CandyXTS<- xts(df[-1], df[[1]])
CandyTS<- ts(df$production, start=c(1972,1),end=c(2017,8), frequency=12 )
ggAcf(CandyTS)
forecast(CandyTS, h=-92)
It is possible. It is called backcasting. You can find some information in this chapter of Forecasting: Principles and Practice.
Basicly you need to forecast in reverse. I have added an example based on the code in the chapter and your data. Adjust as needed. You create a reverse index and use that to be able to backcast in time. You can use different models than ETS. Same principle
# I downloaded data.
df1 <- readr::read_csv("datasets/candy_production.csv")
colnames(df1) <- c("date", "production")
library(fpp3)
back_cast <- df1 %>%
as_tsibble() %>%
mutate(reverse_time = rev(row_number())) %>%
update_tsibble(index = reverse_time) %>%
model(ets = ETS(production ~ season(period = 12))) %>%
# backcast
forecast(h = 12) %>%
# add dates in reverse order to the forecast with the same name as in original dataset.
mutate(date = df1$date[1] %m-% months(1:12)) %>%
as_fable(index = date, response = "production",
distribution = "production")
back_cast %>%
autoplot(df1) +
labs(title = "Backcast of candy production",
y = "production")

Python: How to use date as an independent variable when running a regression? [duplicate]

It seems that for OLS linear regression to work well in Pandas, the arguments must be floats. I'm starting with a csv (called "gameAct.csv") of the form:
date, city, players, sales
2014-04-28,London,111,1091.28
2014-04-29,London,100,1100.44
2014-04-28,Paris,87,1001.33
...
I want to perform linear regression of how sales depend on date (as time moves forward, how do sales move?). The problem with my code below seems to be with dates not being float values. I would appreciate help on how to resolve this indexing problem in Pandas.
My current (non-working, but compiling code):
import pandas as pd
from pandas import DataFrame, Series
import statsmodels.formula.api as sm
df = pd.read_csv('gameAct.csv')
df.columns = ['date', 'city', 'players', 'sales']
city_data = df[df['city'] == 'London']
result = sm.ols(formula = 'sales ~ date', data = city_data).fit()
As I vary the city value, I get R^2 = 1 results, which is wrong. I have also attempted index_col = 0, parse_dates == True' in defining the dataframe df, but without success.
I suspect there is a better way to read in such csv files to perform basic regression over dates, and also for more general time series analysis. Help, examples, and resources are appreciated!
Note, with the above code, if I convert the dates index (for a given city) to an array, the values in this array are of the form:
'\xef\xbb\xbf2014-04-28'
How does one produce an AIC analysis over all of the non-sales parameters? (e.g. the result might be that sales depend most linearly on date and city).
For this kind of regression, I usually convert the dates or timestamps to an integer number of days since the start of the data.
This does the trick nicely:
df = pd.read_csv('test.csv')
df['date'] = pd.to_datetime(df['date'])
df['date_delta'] = (df['date'] - df['date'].min()) / np.timedelta64(1,'D')
city_data = df[df['city'] == 'London']
result = sm.ols(formula = 'sales ~ date_delta', data = city_data).fit()
The advantage of this method is that you're sure of the units involved in the regression (days), whereas an automatic conversion may implicitly use other units, creating confusing coefficients in your linear model. It also allows you to combine data from multiple sales campaigns that started at different times into your regression (say you're interested in effectiveness of a campaign as a function of days into the campaign). You could also pick Jan 1st as your 0 if you're interested in measuring the day of year trend. Picking your own 0 date puts you in control of all that.
There's also evidence that statsmodels supports timeseries from pandas. You may be able to apply this to linear models as well:
http://statsmodels.sourceforge.net/stable/examples/generated/ex_dates.html
Also, a quick note:
You should be able to read column names directly out of the csv automatically as in the sample code I posted. In your example I see there are spaces between the commas in the first line of the csv file, resulting in column names like ' date'. Remove the spaces and automatic csv header reading should just work.
get date as floating point year
I prefer a date-format, which can be understood without context. Hence, the floating point year representation.
The nice thing here is, that the solution works on a numpy level - hence should be fast.
import numpy as np
import pandas as pd
def dt64_to_float(dt64):
"""Converts numpy.datetime64 to year as float.
Rounded to days
Parameters
----------
dt64 : np.datetime64 or np.ndarray(dtype='datetime64[X]')
date data
Returns
-------
float or np.ndarray(dtype=float)
Year in floating point representation
"""
year = dt64.astype('M8[Y]')
# print('year:', year)
days = (dt64 - year).astype('timedelta64[D]')
# print('days:', days)
year_next = year + np.timedelta64(1, 'Y')
# print('year_next:', year_next)
days_of_year = (year_next.astype('M8[D]') - year.astype('M8[D]')
).astype('timedelta64[D]')
# print('days_of_year:', days_of_year)
dt_float = 1970 + year.astype(float) + days / (days_of_year)
# print('dt_float:', dt_float)
return dt_float
if __name__ == "__main__":
dates = np.array([
'1970-01-01', '2014-01-01', '2020-12-31', '2019-12-31', '2010-04-28'],
dtype='datetime64[D]')
df = pd.DataFrame({
'date': dates,
'number': np.arange(5)
})
df['date_float'] = dt64_to_float(df['date'].to_numpy())
print('df:', df, sep='\n')
print()
dt64 = np.datetime64( "2011-11-11" )
print('dt64:', dt64_to_float(dt64))
output
df:
date number date_float
0 1970-01-01 0 1970.000000
1 2014-01-01 1 2014.000000
2 2020-12-31 2 2020.997268
3 2019-12-31 3 2019.997260
4 2010-04-28 4 2010.320548
dt64: 2011.8602739726027
I'm not sure about the specifics of the statsmodels, but this post lists all the date/time conversions for python. They aren't always one-to-one, so it's a reference I used often ;-)
df.date.dt.total_seconds()
If the data type of your date is datetime64[ns] than dt.total_seconds() should work; this will return a number of seconds (float).

Formating time and plot it

I have the following excel file and the time stamp in the format
20180821_2330
1) for a lot of days. How would I format it as standard time so that I can plot it versus the other sensor values ?
2) I would like to have a big plot with for example sensor 1 reading against all the days, is that possible ?
https://www.mediafire.com/file/m36ha4777d6epvd/median_data.xlsx/file
is this something you are looking for? I improvised and created 'n' column which could represent your 'timestamp' as the data frame. Basically, what I think you should do, is to apply another function - let's call it 'apply_fun' on your column which stores 'timestamps' a function which takes each element and transforms it into strptime() format.
import datetime
import pandas as pd
n = {'timestamp':['20180822_2330', '20180821_2334', '20180821_2334', '20180821_2330']}
data_series = pd.DataFrame(n)
def format_dates(n):
x = n.find('_')
y = datetime.datetime.strptime(n[:x]+n[x+1:], '%Y%m%d%H%M')
return y
def apply_fun(dataset):
dataset['timestamp2'] = dataset['timestamp'].apply(format_dates)
return dataset
print(apply_fun(data_series))
When it comes to 2nd point, I am not able to reach the site due to McAffe agent at work, which does not allow to open it. Once you have 1st, you can ask for 2nd separately.

Tracking Error on a number of benchmarks

I'm trying to calculate tracking error for a number of different benchmarks versus a fund that I'm looking at (tracking error is defined as the standard deviation of the percent difference between the fund and benchmark). The time series for the fund and all the benchmarks are all in a data frame that I'm reading from an excel on file and what I have so far is this (with the idea that arg1 represents all the benchmarks and is then applied using applymap), but it's returning a KeyError, any suggestions?
import pandas as pd
import numpy as np
data = pd.read_excel('File_Path.xlsx')
def index_analytics(arg1):
tracking_err = np.std((data['Fund'] - data[arg1]) / data[arg1])
return tracking_err
data.applymap(index_analytics)
There are a few things that need fixed. First,applymap passes each individual value for all the columns to your calling function (index_analytics). So arg1 is the individual scalar value for all the values in your dataframe. data[arg1] is always going to return a key error unless all your values are also column names.
You also shouldn't need to use apply to do this. Assuming your benchmarks are in the same dataframe then you should be able to do something like this for each benchmark. Next time include a sample of your dataframe.
df['Benchmark1_result'] = (df['Fund'] - data['Benchmark1']) / data['Benchmark1']
And if you want to calculate all the standard deviations for all the benchmarks you can do this
# assume you have a dataframe with a list of all the benchmark columns
benchmark_columns = [list, of, benchmark, columns]
np.std((df['Fund'].values - df[benchmark_columns].values) / df['Fund'].values, axis=1)
Assuming you're following the definition of Tracking Error below:
import pandas as pd
import numpy as np
# Example DataFrame
df = pd.DataFrame({'Portfolio_Returns': [5.00, 1.67], 'Bench_Returns': [2.89, .759]})
df['Active_Return'] = df['Portfolio_Returns'] - df['Bench_Returns']
print(df.head())
list_ = df['Active_Return']
temp_ = []
for val in list_:
x = val**2
temp_.append(x)
tracking_error = np.sqrt(sum(temp_))
print(f"Tracking Error is: {tracking_error}")
Or if you want it more compact (because apparently the cool kids do it):
df = pd.DataFrame({'Portfolio_Returns': [5.00, 1.67], 'Bench_Returns': [2.89, .759]})
tracking_error = np.sqrt(sum([val**2 for val in df['Portfolio_Returns'] - df['Bench_Returns']]))
print(f"Tracking Error is: {tracking_error}")

Python convert a type 'datetime.timedelta' to 'pandas.tseries.offsets'

How can I convert a 'datetime.timedelta' object to a 'pandas.tseries.offsets' in Python?
For instance: datetime.timedelta(1) to to_offset('1D') ?
The long story: I would like to plot a Pandas DataFrame with OHLC bars, but when there are too many data points, the bars are too thin and the charts becomes un readable. In that case, I would like to compute bars with a longer period so as to get less than 100 bars on my chart.
Given a DataFrame df (data can be anything between second to monthly periodicity)
step 1: get dataframe period:
source_period = min([t2-t1 for t2, t1 in zip(df.index[1:], df.index[:-1])])
(I cannot use df.index.freq here)
step 2: estimate target period:
target_period = source_period * float(len(df.index))/100
step 3: group-combine data
grouped_df = df.groupby(pd.TimeGrouper(period_convert(target_period ))).agg(ohlc_combine)
I miss the period_convert function where I do not really know where to start.
For the sake of having a starting point, my ridiculously ugly hack to do so:
target_datapoints = 100
source_period = min([t2-t1 for t2, t1 in zip(df.index[1:], df.index[:-1])])
target_period = source_period.total_seconds() * float(len(df.index)) / target_datapoints
target_offset = str(int(target_period)) + 'S'
target_df = df.groupby(pd.TimeGrouper(target_offset)).agg(ohlc_combine)
Issues: ugly, slow, edgy cases not handled...
you can just construct it directly
e.g. say source period was 60S and now target_period is 301S, just use that.
If you want to change these to 'bigger' time periods, you could do this (requires numpy 1.7), (which returns a string):
In [18]: pd.tslib.repr_timedelta64(np.timedelta64(timedelta(seconds=301)).astype('m8[ns]'))
Out[18]: '00:05:01'
But that's almost as hard to deal with.
It might be nice to have a function to do this actually.

Categories

Resources