TSFRESH - features extracted by a symmetric sliding window - python

As raw data we have measurements m_{i,j}, measured every 30 seconds (i=0, 30, 60, 90,...720,..) for every subject j in the dataset.
I wish use TSFRESH (package) to extract time-series features, such that for a point of interest at time i, features are calculated based on symmetric rolling window.
We wish to calculate the feature vector of time point i,j based on measurements of 3 hours of context before i and 3 hours after i.
Thus, the 721-dim feature vector represents a point of interest surrounded by 6 hours “context”, i.e. 360 measurements before and 360 measurements after the point of interest.
For every point of interest, features should be extracted based on 721 measurements of m_{i,j}.
I've tried using rolling_direction param in roll_time_series(), but the only options are either roll backwards or forwards in “time” - I'm looking for a way to include both "past" and "future" data in features calculation.

If I understand your idea correctly, it is even possible to do this with only one-sided rolling. Let's try with one example:
You want to predict for the time 8:00 - and you need for this the data from 5:00 until 11:00.
If you roll through the data with a size of 6h and positive rolling direction, you will end up with a dataset, which also includes exactly this part of the data (5:00 to 11:00). Normally, it would be used to train for the value at 11:00 (or 12:00) - but nothing prevents you to use it for predicting the value at 8:00.
Basically, it is just a matter of re-indexing.
(Same is actually true for negative rolling direction)

A "workaround" solution:
Use the "roll_time_series" function twice; one for "backward" rolling (setting rolling_direction=1) and the second for "forward" (rolling_direction=-1), and then combine them into one.
This will provide, for each time point in the original dataset m_{i,j}$, a time series rolling object with 360 values "from the past" and 360 values "from the future" (i.e., the time point is at the center of the window and max_timeshift=360)
Note to the use of pandas functions below: concat(), sort_values(), drop_duplicates() - which are mandatory for this solution to work.
import numpy as np
import pandas as pd
from tsfresh.utilities.dataframe_functions import roll_time_series
from tsfresh.feature_extraction import EfficientFCParameters, MinimalFCParameters
rolled_backward = roll_time_series(activity_data,
column_id=id_column,
column_sort=sort_column,
column_kind=None,
rolling_direction=1,
max_timeshift=360)
rolled_farward = roll_time_series(activity_data,
column_id=id_column,
column_sort=sort_column,
column_kind=None,
rolling_direction=-1,
max_timeshift=360)
# merge into one dataframe, with rolled_farward and rolled_backward window for every time point (sample)
df = pd.concat([rolled_backward, rolled_farward])
# important! - sort and drop duplicates
df.sort_values(by=[id_column, sort_column], inplace=True)
df.drop_duplicates(subset=[id_column, sort_column, 'activity'], inplace=True, keep='first')

Related

How can I accurately compute the phase lag variation over time for two time series?

My data set consists of two time series with 1 sample per minute's sampling frequency. When, I plot and see the data, it is very obvious that two time series have a phase lag. However, I am not sure how I can correctly measure the phase lag on temporal basis. Because, I am interested to measure the phase lag change with time.
Here is my script with steps, I used so far for my computation:
How data looks like
Cross-correlation to measure the phase lag
First Attempt
I directly cross correlate the two time series
d['h_ven']=series1
d['h_tid']=series2
corr = signal.correlate(d['h_ven'], d['h_tid'])
lags = signal.correlation_lags(len(d['h_tid']), len(d['h_ven']))
corr /= np.max(corr)
ax1.plot(lags, corr)
Second approach
I attempted to interpolate first by polynomial fitting:
#inte = d['h_ven'].interpolate(method='spline', order=5)
#df = pd.DataFrame(dict(x=d['h_ven']))
#x_f = df[["x"]].apply(savgol_filter, window_length=31, polyorder=2)
# other approach
res1 = savgol_filter(d['h_ven'], 31, 3)
res2 = savgol_filter(d['h_tid'], 31, 3)
Afterward, I'm sampling again cross-correlate using the same script.
However, I am not sure about the result of cross-correlation.
May someone suggest me how can, I improve/correct it further.
Expected Result:
I expect to the phase delay (in degree/radian within -360 to 360 or -2pi to +2pi) as a function of time. Expected output is attached here:

Rolling weighted mean in pandas using date range

I want to calculate the rolling weighted mean of a time series and the average to be calculated over a specific time interval. For example, this calculated the rolling mean with a 90-day window (not weighted):
import numpy as np
import pandas as pd
data = np.random.randint(0, 1000, (1000, 10))
index = pd.date_range("20190101", periods=1000, freq="18H")
df = pd.DataFrame(index=index, data=data)
df = df.rolling("90D").mean()
However, when I apply a weighting function (line below) I get an error: "ValueError: Invalid window 90D"
df = df.rolling("90D", win_type="gaussian").mean(std=60)
On the other hand, the weighted average works if I make the window an integer instead of an offset:
df = df.rolling(90, win_type="gaussian").mean(std=60)
Using an integer does not work for my application since the observations are not evenly spaced in time.
Two questions:
can I do a weighted rolling mean with an offset (e.g. "90D" or "3M"?
If I can do a weighted rolling mean with an offset, then what does std
refer to when I specify window="90D" and win_type="gaussian"; does it mean the std is 60D?
Okey, I discoveret that its not implemented yet in pandas.
Look here:
https://github.com/pandas-dev/pandas/blob/v0.25.0/pandas/core/window.py
If you follow line 2844 you see that when win_type is not None a Window object is returned:
if win_type is not None:
return Window(obj, win_type=win_type, **kwds)
Then check the validate method of the window object at line 630, it only allows integer or list-like windows
I think this is because pandas uses scipy.signal library which receives an array, so it cannot take into account the distribution of your data over time.
You could implement your own weighting function and use apply but its performance won't be too good.
It is not clear to me what you wants the weights in your weighted average to be but is the weight a measure of the time for which an observation is 'in effect'?
If so, I believe you can re-index the dataframe so it has regularly-spaced observations. Then fill NAs appropriately - see method in https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html
That will allow rolling to work and also help you think explicitly about how missing observations are treated, for instance should a missing sample take its value from the last valid sample or the nearest sample.

Automatic Trend Detection for Time Series / Signal Processing

What are the good algorithms to automatically detect trend or draw trend line (up trend, down trend, no trend) for time series data? Appreciate if you can point me to any good research paper or good library in python, R or Matlab.
Ideally, the output from this algorithm will have 4 columns:
from_time
to_time
trend (up/down/no trend/unknown)
probability_of_trend or degree_of_trend
Thank you so much for your time.
I had a similar problem - wanted to do segmentation of the time series on segments with a similar trends. For that task, you can use trend-classifier Python library. It is pip installable (pip3 install trend-classifier).
Here is an example that gets the time series data from YahooFinance and performs analysis.
import yfinance as yf
from trend_classifier import Segmenter
# download data from yahoo finance
df = yf.download("AAPL", start="2018-09-15", end="2022-09-05", interval="1d", progress=False)
x_in = list(range(0, len(df.index.tolist()), 1))
y_in = df["Adj Close"].tolist()
seg = Segmenter(x_in, y_in, n=20)
seg.calculate_segments()
Now, you can plot the time series with trend lines and segment boundaries with:
seg.plot_segments()
You can inspect details about each segment (e.g. positive value for slope indicates up-trend and a negative down-trend). To see info about the segment with index 3:
from devtools import debug
debug(seg.segments[3])
You can have information about all segments in tabular form using Segmenter.segments.to_dataframe() method which produces Pandas DataFrame.
seg.segments.to_dataframe()
There is a parameter that controls the "generalization" factor, i.e. you can try to fit a trend line to a smaller range of time series - you will end up with a large number of segments, or you can go for the segments spanning a bigger part of the time series (more general trend line) and end up with a time series divided into fewer segments. To control that behavior, when initializing Segmenter() (e.g. Segmenter(x_in, y_in, n=20) use various values for n parameter. The larger n the generalization is stronger (fewer segments).
Disclaimer: I'm the author of the trend-classifier package.

pandas rolling_std only perform every Nth calculation

I am working on some code optimization. Currently I use the pandas rolling_mean and rolling_std to compute normalized cross correlations of time series data from seismic instruments. For non-pertinent technical reasons I am only interested in every Nth value of the output of these pandas rolling mean and rolling std calls, so I am looking for away to only compute every Nth value. I may have to write a cython code to do this but I would prefer not to. Here is an example:
import pandas as pd
import numpy as np
As=5000 #Array size
as=150 #Moving window size
N=3 # only interested in every N values of output array
ar=np.random.rand(As) # generate generic random array
RSTD=pd.rolling_std(ar,as)[as-1:] # dont return the nans before widows overlap
foo=RSTD[::N] # use array indexing to decimate RSTD to only return every Nth value
Is there a good pandas way to only calculate every Nth value of RSTD rather than calculate all the values and decimate?
Thanks

Converting excel files to python to frequency

Essentially I've got an excel files with voltage in the first column, and time in the second. I want to find the period of the voltages, as it returns a graph of voltage in y axis and time in x axis with a periodicity, looking similar to a sine function.
To find the frequency I have uploaded my excel file to python as I think this will make it easier- there may be something I've missed that will simplify this.
So far in python I have:
import xlrd
import numpy as N
import numpy.fft as F
import matplotlib.pyplot as P
wb = xlrd.open_workbook('temp7.xls') #LOADING EXCEL FILE
wb.sheet_names()
sh = wb.sheet_by_index(0)
first_column = sh.col_values(1) #VALUES FROM EXCEL
second_column = sh.col_values(2) #VALUES FROM EXCEL
Now how do I find the frequency from this?
I'm not sure how much you know about the Fourier transform, so forgive me if this is too much background.
Your signal does not have "a frequency", it is but it can be thought of as the sum of many frequencies. The Fourier transform will tell you the weights of all the frequencies that make up your signal. Unfortunately information may be lost when sampling from the analog (continuous time) to digital (discrete time) domain. This puts a constraint on the information we can get about frequency - namely that the maximum frequency component we can determine is related to the digital sampling rate (Nyquist-Shannon criterion):
fs > 2B
Where fs is your sampling rate (samples/unit time, typically in Hz or something like it), and B is the maximum frequency of your signal. If your signal actually has frequencies higher than B they will be "aliased" to some value lower than B.
For your problem, all you have to do is this:
x = N.array(first_column)
X = F.fft(x)
Now X is the frequency-domain representation of your voltage signal. The corresponding frequency axis covers [0, fs), based on the sampling theorem. So, what is fs? You need to calculate that by looking at the number of samples you have divided by the total duration of your sampled signal (note your units here):
fs = len(second_column) / second_column[-1]
Note that this representation of your signal will also (probably) be complex, i.e. each frequency will have an associated amplitude and phase.
Hopefully this helps, and hopefully I didn't cover a bunch of stuff you already knew.

Categories

Resources