Ensure consistent time indexing in statsmodels predict - python

I am trying to fit an AR(1) model to a Pandas time series and project forward. The data is annual with each year starting at 1 April. When I use statsmodels.tsa.ar_model.AR.predict to forecast from the estimated model the output is a Pandas time series with the annual forecasts centred on 31 December.
Code:
mod1 = sm.tsa.AR(ser['1972-01-04':'2007-01-04'], freq='A')
res1 = mod1.fit(order=1)
fcast1 = res1.predict('2007-01-04', '2018-01-04')
print fcast1
Output:
2007-12-31 988.121031
2008-12-31 1035.640294
2009-12-31 1081.584720
...
Can I get the predict method to create a time series indexed on 1 April, or do I have to re-index the forecast series after creating it? I'd like to be able to compare it to other series in the dataframe so the indexing is quite important.
Thanks for your help!

No, not at the moment, but you should be able to in the next release. The fix should be fairly trivial. The pandas time-series stuff is relatively new compared to when I wrote the TSA infrastructure, and I just haven't had a chance to catch up. Too much to do.
https://github.com/statsmodels/statsmodels/issues/319

Related

Calculating and plotting a 20 year Climatology

I am working on plotting a 20 year climatology and have had issues with averaging.
My data is hourly data since December 1999 in CSV format. I used an API to get the data and currently have it in a pandas data frame. I was able to split up hours, days, etc like this:
dfROVC1['Month'] = dfROVC1['time'].apply(lambda cell: int(cell[5:7]))
dfROVC1['Day'] = dfROVC1['time'].apply(lambda cell: int(cell[8:9]))
dfROVC1['Year'] = dfROVC1['time'].apply(lambda cell: int(cell[0:4]))
dfROVC1['Hour'] = dfROVC1['time'].apply(lambda cell: int(cell[11:13]))
So I averaged all the days using:
z=dfROVC1.groupby([dfROVC1.index.day,dfROVC1.index.month]).mean()
That worked, but I realized I should take the average of the mins and average of the maxes of all my data. I have been having a hard time figuring all of this out.
I want my plot to look like this:
Monthly Average Section
but I can't figure out how to make it work.
I am currently using Jupyter Notebook with Python 3.
Any help would be appreciated.
Is there a reason you didn't just use datetime to convert your time column?
The minimums by month would be:
z=dfROVC1.groupby(['Year','Month']).min()

Time-series analysis with Python

So I have sensor-based time series data for a subject measured in second intervals, with the corresponding heart rate at each time point in an Excel format. My goal is to analyze whether there are any trends over time. When I import it into Python, I can see a certain number, but not the time. However, when imported in Excel, I can convert it into time format easily.
This is what it looks like in Python.. (column 1 = timestamp, column 2 = heart rate in bpm)
This is what it should look like though:
This is what I tried to convert it into datetime format in Python:
import datetime
Time = datetime.datetime.now()
"%s:%s.%s" % (Time.minute, Time.second, str(Time.microsecond)[:2])
if isinstance(Time,datetime.datetime):
print ("Yay!")
df3.set_index('Time', inplace=True)
Time gets recognized as a float64 if I do this, not datetime64 [ns].
Consequently, when I try to plot this timeseries, I get the following:
I even did the Dickey-fuller Test to analyze trends in Python with this dataset. Does my misconfiguration of the time column in Python actually affect my ADF-test? I'm assuming since only trends in the 'heartrate' column are analyzed with this code, it shouldn't matter, right?
Here's the code I used:
#Perform Dickey-Fuller test:
print("Results of Dickey-Fuller Test:")
dftest=adfuller(df3 ['HeartRate'], autolag='AIC')
dfoutput=pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key,value in dftest[4].items():
dfoutput['Critical Value (%s)'%key] = value
print(dfoutput)
test_stationarity(df3)
Did I do this correctly? I don't have experience in the engineering field and I'm doing this to improve healthcare for older people, so any help will be very much appreciated!
Thanks in advance! :)
It seems that the dateformat in excel is expresed as the number of days that have passed since 12/30/1899. In order to transform the number on the timestamp column to seconds you only need to multiply it by 24*60*60 = 86400 (the number of seconds in one day).

how to vectorise Pandas calculation that is based on last x rows of data

I have a fairly sophisticate prediction code with over 20 columns and millions of data per column using wls. Now i use iterrow to loop through dates, then based on those dates and values in those dates, extract different sizes of data for calculation. it takes hours to run in my production, I simplify the code into the following:
import pandas as pd
import numpy as np
from datetime import timedelta
df=pd.DataFrame(np.random.randn(1000,2), columns=list('AB'))
df['dte'] = pd.date_range('9/1/2014', periods=1000, freq='D')
def calculateC(A, dte):
if A>0: #based on values has different cutoff length for trend prediction
depth=10
else:
depth=20
lastyear=(dte-timedelta(days=365))
df2=df[df.dte<lastyear].head(depth) #use last year same date data for basis of prediction
return df2.B.mean() #uses WLS in my model but for simplification replace with mean
for index, row in df.iterrows():
if index>365:
df.loc[index,'C']=calculateC(row.A, row.dte)
I read that iterrow is the main cause because it is not an effective way to use Pandas, and I should use vector methods. However, I can't seem to be able to find a way to vector based on conditions (dates, different length, and range of values). Is there a way?
I have good news and bad news. The good news is I have something vectorized that is about 300x faster but the bad news is that I can't quite replicate your results. But I think that you ought to be able to use the principles here to greatly speed up your code, even if this code does not actually replicate your results at the moment.
df['result'] = np.where( df['A'] > 0,
df.shift(365).rolling(10).B.mean(),
df.shift(365).rolling(20).B.mean() )
The tough (slow) part of your code is this:
df2=df[df.dte<lastyear].head(depth)
However, as long as your dates are all 365 days apart, you can use code like this, which is vectorized and much faster:
df.shift(365).rolling(10).B.mean()
shift(365) replaces df.dte < lastyear and the rolling().mean() replaces head().mean(). It will be much faster and use less memory.
And actually, even if your dates aren't completely regular, you can probably resample and get this way to work. Or, somewhat equivalently, if you make the date your index, the shift can be made to work based on a frequency rather than rows (e.g. shift 365 days, even if that is not 365 rows). It would probably be a good idea to make 'dte' your index here regardless.
I would try pandas.DataFrame.apply(func, axis=1)
def calculateC2(row):
if row.name >365: # row.name is the index of the row
if row.A >0: #based on values has different cutoff length for trend prediction
depth=10
else:
depth=20
lastyear=(row.dte-timedelta(days=365))
df2=df[df.dte<lastyear].B.head(depth) #use last year same date data for basis of prediction
print row.name,np.mean(df2) #uses WLS in my model but for simplification replace with mean
df.apply(calculateC2,axis=1)

Date ranges in Pandas

After fighting with NumPy and dateutil for days, I recently discovered the amazing Pandas library. I've been poring through the documentation and source code, but I can't figure out how to get date_range() to generate indices at the right breakpoints.
from datetime import date
import pandas as pd
start = date('2012-01-15')
end = date('2012-09-20')
# 'M' is month-end, instead I need same-day-of-month
date_range(start, end, freq='M')
What I want:
2012-01-15
2012-02-15
2012-03-15
...
2012-09-15
What I get:
2012-01-31
2012-02-29
2012-03-31
...
2012-08-31
I need month-sized chunks that account for the variable number of days in a month. This is possible with dateutil.rrule:
rrule(freq=MONTHLY, dtstart=start, bymonthday=(start.day, -1), bysetpos=1)
Ugly and illegible, but it works. How can do I this with pandas? I've played with both date_range() and period_range(), so far with no luck.
My actual goal is to use groupby, crosstab and/or resample to calculate values for each period based on sums/means/etc of individual entries within the period. In other words, I want to transform data from:
total
2012-01-10 00:01 50
2012-01-15 01:01 55
2012-03-11 00:01 60
2012-04-28 00:01 80
#Hypothetical usage
dataframe.resample('total', how='sum', freq='M', start='2012-01-09', end='2012-04-15')
to
total
2012-01-09 105 # Values summed
2012-02-09 0 # Missing from dataframe
2012-03-09 60
2012-04-09 0 # Data past end date, not counted
Given that Pandas originated as a financial analysis tool, I'm virtually certain that there's a simple and fast way to do this. Help appreciated!
freq='M' is for month-end frequencies (see here). But you can use .shift to shift it by any number of days (or any frequency for that matter):
pd.date_range(start, end, freq='M').shift(15, freq=pd.datetools.day)
There actually is no "day of month" frequency (e.g. "DOMXX" like "DOM09"), but I don't see any reason not to add one.
http://github.com/pydata/pandas/issues/2289
I don't have a simple workaround for you at the moment because resample requires passing a known frequency rule. I think it should be augmented to be able to take any date range to be used as arbitrary bin edges, also. Just a matter of time and hacking...
try
date_range(start, end, freq=pd.DateOffset(months=1))

Remove Holidays and Weekends in a very long time-serie, how to model time-series in Python?

Is there some function in Python to handle this. GoogleDocs has a Weekday -operation so perhaps there is something like that in Python. I am pretty sure someone must have solved this, similar problems occur in sparse data such as in finance and research. I am basically just trying to organize a huge amount of different sized vectors indexed by days, time-series, I am not sure how I should hadle the days -- mark the first day with 1 and the last day with N or with unix -time or how should that be done? I am not sure whether the time-series should be saved into matrix so I could model them more easily to calculate correlation matrices and such things, any ready thing to do such things?
Let's try to solve this problem without the "practical" extra clutter:
import itertools
seq = range(100000)
criteria = cycle([True]*10 + [False]*801)
list(compress(seq, criteria))
now have to change them into days and then change the $\mathbb R$ into $( \mathbb R, \mathbb R)$, tuple. So $V : \mathbb R \mapsto \mathbb R^{2}$ missing, investigating.
[Update]
Let's play! Below code solves the subproblem -- creates some test data to test things -- now we need to create arbitrary days and valuations there to try to test it on arbitrary timeseries. If we can create some function $V$, we are very close to solve this problem...it must consider though the holidays and weekends so maybe not easy (not sure).
import itertools as i
import time
import math
import numpy
def createRandomData():
samples=[]
for x in range(5):
seq = range(5)
criteria = i.cycle([True]*x+ [False]*3)
samples += [list(i.compress( seq, criteria ))]
return samples
def createNNtriangularMatrix(data):
N = len(data)
return [aa+[0]*(N-len(aa)) for aa in data]
A= createNNtriangularMatrix(createRandomData())
print numpy.array(A)
print numpy.corrcoef(A)
I think you should figure out someway the days you want to INCLUDE, and create a (probably looping) subroutine use slicing operations on your big list.
For discontinuous slices, you can take a look at this question:
Discontinuous slice in python list
Or perhaps you could make the days you do not want receive a null value (zero or None).
Try using pandas. You can create a DateOffset for business days and include your data in a DataFrame (see: http://pandas.pydata.org/pandas-docs/stable/timeseries.html) to analyze it.
I think it depends on the scope of your problem, for a personal calendar, 'day' is good enough for indexing.
One's life is as long as 200 years, about 73000 days, simply calculate and record them all, maybe use a dict, e.g.
day = {}
# day[0] = [event_a, event_b, ...]
# or you may want to rewrite the __getitem__ method like this: day['09-05-2012']
Why would you want to remove the holidays and weekends? Is it because they are outliers or zeroes? If they are zeroes they will be handled by the model. You would want to leave the data in the time series and use dummy variables to model the seasonal effects (ie monthly dummies), day of the week dummies and holiday dummies. Clearly, I am dummfounded. I have season people who are unable to deal with time series analysis even break the weekdays into one time series and the weekends into another which completely ignores the lead and lag impacts around holidays.
If it is trading days you want then you can use the pandas datareader package to download the s&p 500 historical prices for U.S. and use the index of dates as a mask to your data.
Answered on mobile, I'll add links and code later.

Categories

Resources