Time-series analysis with Python - python

So I have sensor-based time series data for a subject measured in second intervals, with the corresponding heart rate at each time point in an Excel format. My goal is to analyze whether there are any trends over time. When I import it into Python, I can see a certain number, but not the time. However, when imported in Excel, I can convert it into time format easily.
This is what it looks like in Python.. (column 1 = timestamp, column 2 = heart rate in bpm)
This is what it should look like though:
This is what I tried to convert it into datetime format in Python:
import datetime
Time = datetime.datetime.now()
"%s:%s.%s" % (Time.minute, Time.second, str(Time.microsecond)[:2])
if isinstance(Time,datetime.datetime):
print ("Yay!")
df3.set_index('Time', inplace=True)
Time gets recognized as a float64 if I do this, not datetime64 [ns].
Consequently, when I try to plot this timeseries, I get the following:
I even did the Dickey-fuller Test to analyze trends in Python with this dataset. Does my misconfiguration of the time column in Python actually affect my ADF-test? I'm assuming since only trends in the 'heartrate' column are analyzed with this code, it shouldn't matter, right?
Here's the code I used:
#Perform Dickey-Fuller test:
print("Results of Dickey-Fuller Test:")
dftest=adfuller(df3 ['HeartRate'], autolag='AIC')
dfoutput=pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key,value in dftest[4].items():
dfoutput['Critical Value (%s)'%key] = value
print(dfoutput)
test_stationarity(df3)
Did I do this correctly? I don't have experience in the engineering field and I'm doing this to improve healthcare for older people, so any help will be very much appreciated!
Thanks in advance! :)

It seems that the dateformat in excel is expresed as the number of days that have passed since 12/30/1899. In order to transform the number on the timestamp column to seconds you only need to multiply it by 24*60*60 = 86400 (the number of seconds in one day).

Related

How to generate one value each minute out of irregular data?

I have values that are mesured event-related. So there are not the same amount of data every Minute. To be able to better handle this data I aim to only take the first row of values every Minute.
The time of the data I import from a csv looks like this:
time
11.11.2011 11:11
11.11.2011 11:11
11.11.2011 11:11
11.11.2011 11:12
11.11.2011 11:12
11.11.2011 11:13
The other values are Temperatures.
One main problem ist to import the time in the right format.
I tried to solve this with the help of this comunity like this:
with open('my_file.csv','r') as file:
for line in file:
try:
time = line.split(';')[0] #splits the line at the comma and takes the first bit
time = dt.datetime.strptime(time, '%d.%m.%Y %H:%M')
print(time)
except:
pass
then I importet the columns of the temperatures and joind them like this:
df = pd.read_csv("my_file.csv", sep=';', encoding='latin-1')
df=df[["time", "T1", "T2", "DT1", "DT2"]]
when I printed the dtypes of my data the time was datetime64[ns] and the others where objects.
I tried different options of groupby and resample. Like the following:
df=df.groupby([pd.Grouper(key = 'time', freq='1min')])
df.resample('M')
One main problem that was stated in the error messages was that the datatype of the time was not appropriate for grouping,... because it is not an DatetimeIndex.
So I tried to convert the dates to a DatetimeIndex like this:
df.index = pd.to_datetime(daten["time"].index, format='%Y-%m-%d %H:%M:%S')
but then I reseaved a Nummeration of the Index starting with 1970-01-01 so I am not quite shure if this conversion is possible with irregular data.
Without this conversion I also get the message <pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000026938A74850>
When I then try to call my dataframe the message shows and when saving it to csv like this:
df.to_csv('04_01_DTempminuten.csv', index=False, encoding='utf-8', sep =';', date_format = '%Y-%m-%d %H:%M:%S')
I receive either the same message or only one line with a Dezimalnumber instead of the time.
Does anyone have an idear how to deal with this irregular data to get one line of values each minute?
Thank you for reading my question. I am really thankful for any Idears.
Without sample data I can only show how I do it with irregular time series, which I think is your case. I work with price data which comes at irregular time intervals. So if you need to sample taking the first minute value you can use resample with for a specific interval using ohlc aggregation function, that will give you four columns for each sample interval.
open: first value in the interval
high: highest
low: lowest value
close: last value
In your case the sampling interval would 1 minute ('T')
In the following example I'm using one second ('S') as resampling frequency, to resample ask column (your temperature column):
import pandas as pd
df = pd.read_csv('my_tick_data.csv')
df['date_time'] = pd.to_datetime(df['date_time'])
df.set_index('date_time', inplace=True)
df.head(6)
df['ask'].resample('S').ohlc()
This is not solving your date issue, which is a prerequisite for this part because the data set needs to be indexed by date. If you can provide sample data maybe I can help you with that part either.

Take maximum rainfall value for each season over a time period (xarray)

I'm trying to find the maximum rainfall value for each season (DJF, MAM, JJA, SON) over a 10 year period. I am using netcdf data and xarray to try and do this. The data consists of rainfall (recorded every 3 hours), lat, and lon data. Right now I have the following code:
ds.groupby('time.season).max('time')
However, when I do it this way the output has a shape of (4,145,192) indicating that it's taking the maximum value for each season over the entire period. I would like the maximum for each individual season every year. In other words, output should have something with a shape like (40,145,192) (4 values for each year x 10 years)
I've looked into trying to do this with DataSet.resample as well using time=3M as the frequency, but then it doesn't split the months up correctly. If I have to I can alter the dataset, so it starts in the correct place, but I was hoping there would be an easier way considering there's already a function to group it correctly.
Thanks and let me know if you need anymore details!
Resample is going to be the easiest tool for this job. You are close with the time frequency but you probably want to use the quarterly frequency with an offset:
ds.resample(time='QS-Mar').max('time')
These offsets can be further configured as described in the Pandas documentation: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases

Python - select certain time range pandas

Python newbie here but I have some data that is intra-day financial data, going back to 2012, so it's got the same hours each day(same trading session each day) but just different dates. I want to be able to select certain times out of the data and check the corresponding OHLC data for that period and then do some analysis on it.
So at the moment it's a CSV file, and I'm doing:
import pandas as pd
data = pd.DataFrame.read_csv('data.csv')
date = data['date']
op = data['open']
high = data['high']
low = data['low']
close = data['close']
volume = data['volume']
The thing is that the date column is in the format of "dd/mm/yyyy 00:00:00 "as one string or whatever, so is it possible to still select between a certain time, like between "09:00:00" and "10:00:00"? or do I have to separate that time bit from the date and make it it's own column? If so, how?
So I believe pandas has a between_time() function, but that seems to need a DataFrame, so how can I convert it to a DataFrame, then I should be able to use the between_time function to select between the times I want. Also because there's obviously thousands of days, all with their own "xx:xx:xx" to "xx:xx:xx" I want to pull that same time period I want to look at from each day, not just the first lot of "xx:xx:xx" to "xx:xx:xx" as it makes its way down the data, if that makes sense. Thanks!!
Consider the dataframe df
from pandas_datareader import data
df = data.get_data_yahoo('AAPL', start='2016-08-01', end='2016-08-03')
df = df.asfreq('H').ffill()
option 1
convert index to series then dt.hour.isin
slc = df.index.to_series().dt.hour.isin([9, 10])
df.loc[slc]
option 2
numpy broadcasting
slc = (df.index.hour[:, None] == [9, 10]).any(1)
df.loc[slc]
response to comment
To then get a range within that time slot per day, use resample + agg + np.ptp (peak to peak)
df.loc[slc].resample('D').agg(np.ptp)

how to vectorise Pandas calculation that is based on last x rows of data

I have a fairly sophisticate prediction code with over 20 columns and millions of data per column using wls. Now i use iterrow to loop through dates, then based on those dates and values in those dates, extract different sizes of data for calculation. it takes hours to run in my production, I simplify the code into the following:
import pandas as pd
import numpy as np
from datetime import timedelta
df=pd.DataFrame(np.random.randn(1000,2), columns=list('AB'))
df['dte'] = pd.date_range('9/1/2014', periods=1000, freq='D')
def calculateC(A, dte):
if A>0: #based on values has different cutoff length for trend prediction
depth=10
else:
depth=20
lastyear=(dte-timedelta(days=365))
df2=df[df.dte<lastyear].head(depth) #use last year same date data for basis of prediction
return df2.B.mean() #uses WLS in my model but for simplification replace with mean
for index, row in df.iterrows():
if index>365:
df.loc[index,'C']=calculateC(row.A, row.dte)
I read that iterrow is the main cause because it is not an effective way to use Pandas, and I should use vector methods. However, I can't seem to be able to find a way to vector based on conditions (dates, different length, and range of values). Is there a way?
I have good news and bad news. The good news is I have something vectorized that is about 300x faster but the bad news is that I can't quite replicate your results. But I think that you ought to be able to use the principles here to greatly speed up your code, even if this code does not actually replicate your results at the moment.
df['result'] = np.where( df['A'] > 0,
df.shift(365).rolling(10).B.mean(),
df.shift(365).rolling(20).B.mean() )
The tough (slow) part of your code is this:
df2=df[df.dte<lastyear].head(depth)
However, as long as your dates are all 365 days apart, you can use code like this, which is vectorized and much faster:
df.shift(365).rolling(10).B.mean()
shift(365) replaces df.dte < lastyear and the rolling().mean() replaces head().mean(). It will be much faster and use less memory.
And actually, even if your dates aren't completely regular, you can probably resample and get this way to work. Or, somewhat equivalently, if you make the date your index, the shift can be made to work based on a frequency rather than rows (e.g. shift 365 days, even if that is not 365 rows). It would probably be a good idea to make 'dte' your index here regardless.
I would try pandas.DataFrame.apply(func, axis=1)
def calculateC2(row):
if row.name >365: # row.name is the index of the row
if row.A >0: #based on values has different cutoff length for trend prediction
depth=10
else:
depth=20
lastyear=(row.dte-timedelta(days=365))
df2=df[df.dte<lastyear].B.head(depth) #use last year same date data for basis of prediction
print row.name,np.mean(df2) #uses WLS in my model but for simplification replace with mean
df.apply(calculateC2,axis=1)

Remove Holidays and Weekends in a very long time-serie, how to model time-series in Python?

Is there some function in Python to handle this. GoogleDocs has a Weekday -operation so perhaps there is something like that in Python. I am pretty sure someone must have solved this, similar problems occur in sparse data such as in finance and research. I am basically just trying to organize a huge amount of different sized vectors indexed by days, time-series, I am not sure how I should hadle the days -- mark the first day with 1 and the last day with N or with unix -time or how should that be done? I am not sure whether the time-series should be saved into matrix so I could model them more easily to calculate correlation matrices and such things, any ready thing to do such things?
Let's try to solve this problem without the "practical" extra clutter:
import itertools
seq = range(100000)
criteria = cycle([True]*10 + [False]*801)
list(compress(seq, criteria))
now have to change them into days and then change the $\mathbb R$ into $( \mathbb R, \mathbb R)$, tuple. So $V : \mathbb R \mapsto \mathbb R^{2}$ missing, investigating.
[Update]
Let's play! Below code solves the subproblem -- creates some test data to test things -- now we need to create arbitrary days and valuations there to try to test it on arbitrary timeseries. If we can create some function $V$, we are very close to solve this problem...it must consider though the holidays and weekends so maybe not easy (not sure).
import itertools as i
import time
import math
import numpy
def createRandomData():
samples=[]
for x in range(5):
seq = range(5)
criteria = i.cycle([True]*x+ [False]*3)
samples += [list(i.compress( seq, criteria ))]
return samples
def createNNtriangularMatrix(data):
N = len(data)
return [aa+[0]*(N-len(aa)) for aa in data]
A= createNNtriangularMatrix(createRandomData())
print numpy.array(A)
print numpy.corrcoef(A)
I think you should figure out someway the days you want to INCLUDE, and create a (probably looping) subroutine use slicing operations on your big list.
For discontinuous slices, you can take a look at this question:
Discontinuous slice in python list
Or perhaps you could make the days you do not want receive a null value (zero or None).
Try using pandas. You can create a DateOffset for business days and include your data in a DataFrame (see: http://pandas.pydata.org/pandas-docs/stable/timeseries.html) to analyze it.
I think it depends on the scope of your problem, for a personal calendar, 'day' is good enough for indexing.
One's life is as long as 200 years, about 73000 days, simply calculate and record them all, maybe use a dict, e.g.
day = {}
# day[0] = [event_a, event_b, ...]
# or you may want to rewrite the __getitem__ method like this: day['09-05-2012']
Why would you want to remove the holidays and weekends? Is it because they are outliers or zeroes? If they are zeroes they will be handled by the model. You would want to leave the data in the time series and use dummy variables to model the seasonal effects (ie monthly dummies), day of the week dummies and holiday dummies. Clearly, I am dummfounded. I have season people who are unable to deal with time series analysis even break the weekdays into one time series and the weekends into another which completely ignores the lead and lag impacts around holidays.
If it is trading days you want then you can use the pandas datareader package to download the s&p 500 historical prices for U.S. and use the index of dates as a mask to your data.
Answered on mobile, I'll add links and code later.

Categories

Resources