What are the good algorithms to automatically detect trend or draw trend line (up trend, down trend, no trend) for time series data? Appreciate if you can point me to any good research paper or good library in python, R or Matlab.
Ideally, the output from this algorithm will have 4 columns:
from_time
to_time
trend (up/down/no trend/unknown)
probability_of_trend or degree_of_trend
Thank you so much for your time.
I had a similar problem - wanted to do segmentation of the time series on segments with a similar trends. For that task, you can use trend-classifier Python library. It is pip installable (pip3 install trend-classifier).
Here is an example that gets the time series data from YahooFinance and performs analysis.
import yfinance as yf
from trend_classifier import Segmenter
# download data from yahoo finance
df = yf.download("AAPL", start="2018-09-15", end="2022-09-05", interval="1d", progress=False)
x_in = list(range(0, len(df.index.tolist()), 1))
y_in = df["Adj Close"].tolist()
seg = Segmenter(x_in, y_in, n=20)
seg.calculate_segments()
Now, you can plot the time series with trend lines and segment boundaries with:
seg.plot_segments()
You can inspect details about each segment (e.g. positive value for slope indicates up-trend and a negative down-trend). To see info about the segment with index 3:
from devtools import debug
debug(seg.segments[3])
You can have information about all segments in tabular form using Segmenter.segments.to_dataframe() method which produces Pandas DataFrame.
seg.segments.to_dataframe()
There is a parameter that controls the "generalization" factor, i.e. you can try to fit a trend line to a smaller range of time series - you will end up with a large number of segments, or you can go for the segments spanning a bigger part of the time series (more general trend line) and end up with a time series divided into fewer segments. To control that behavior, when initializing Segmenter() (e.g. Segmenter(x_in, y_in, n=20) use various values for n parameter. The larger n the generalization is stronger (fewer segments).
Disclaimer: I'm the author of the trend-classifier package.
Related
My data set consists of two time series with 1 sample per minute's sampling frequency. When, I plot and see the data, it is very obvious that two time series have a phase lag. However, I am not sure how I can correctly measure the phase lag on temporal basis. Because, I am interested to measure the phase lag change with time.
Here is my script with steps, I used so far for my computation:
How data looks like
Cross-correlation to measure the phase lag
First Attempt
I directly cross correlate the two time series
d['h_ven']=series1
d['h_tid']=series2
corr = signal.correlate(d['h_ven'], d['h_tid'])
lags = signal.correlation_lags(len(d['h_tid']), len(d['h_ven']))
corr /= np.max(corr)
ax1.plot(lags, corr)
Second approach
I attempted to interpolate first by polynomial fitting:
#inte = d['h_ven'].interpolate(method='spline', order=5)
#df = pd.DataFrame(dict(x=d['h_ven']))
#x_f = df[["x"]].apply(savgol_filter, window_length=31, polyorder=2)
# other approach
res1 = savgol_filter(d['h_ven'], 31, 3)
res2 = savgol_filter(d['h_tid'], 31, 3)
Afterward, I'm sampling again cross-correlate using the same script.
However, I am not sure about the result of cross-correlation.
May someone suggest me how can, I improve/correct it further.
Expected Result:
I expect to the phase delay (in degree/radian within -360 to 360 or -2pi to +2pi) as a function of time. Expected output is attached here:
I have a dataset with a signal and a 1/distance (Angstrom^-1) column.
This is the dataset (fourier.csv): https://pastebin.com/ucFekzc6
After applying these steps:
import pandas as pd
import numpy as np
from numpy.fft import fft
df = pd.read_csv (r'fourier.csv')
df.plot(x ='1/distance', y ='signal', kind = 'line')
I generated this plot:
To generate the Fast Fourier Transformation data, I used the numpy library for its fft function and I applied it like this:
df['signal_fft'] = fft(df['signal'])
df.plot(x ='1/distance', y ='signal_fft', kind = 'line')
Now the plot looks like this, with the FFT data plotted instead of the initial "signal" data:
What I hoped to generate is something like this (This signal is extremely similar to mine, yet yields a vastly different FFT picture):
Theory Signal before windowing:
Theory FFT:
As you can see, my initial plot looks somewhat similar to graphic (a), but my FFT plot doesn't look anywhere near as clear as graphic (b). I'm still using the 1/distance data for both horizontal axes, but I don't see anything wrong with it, only that it should be interpreted as distance (Angstrom) instead of 1/distance (1/Angstrom) in the FFT plot.
How should I apply FFT in order to get a result that resembles the theoretical FFT curve?
Here's another slide that shows a similar initial signal to mine and a yet again vastly different FFT:
Addendum: I have been asked to provide some additional information on this problem, so I hope this helps.
The origin of the dataset that I have linked is an XAS (X-Ray Absorption Spectroscopy) spectrum of iron oxide. Such an experimentally obtained spectrum looks similar to the one shown in the "Schematic of XAFS data processing" on the top left, i.e. absorbance [a.u.] plotted against the photon energy [eV]. Firstly I processed the spectrum (pre-edge baseline correction + post-edge normalization). Then I converted the data on the x-axis from energy E to wavenumber k (thus dimension 1/Angstrom) and cut off the signal at the L-edge jump, leaving only the signal in the post-edge EXAFS region, referred to as fine structure function χ(k). The mentioned dataset includes k^2 weighted χ(k) (to emphasize oscillations at large k). All of this is not entirely relevant as the only thing I want to do now is a Fourier transformation on this signal ( k^2 χ(k) vs. k). In theory, as we are dealing with photoelectrons and (back)scattering phenomena, the EXAFS region of the XAS spectrum can be approximated using a superposition of many sinusoidal waves such as described in this equation with f(k) being the amplitude and δ(k) the phase shift of the scattered wave.
The aim is to gain an understanding of the chemical environment and the coordination spheres around the absorbing atom. The goal of the Fourier transform is to obtain some sort of signal in dependence of the "radius" R [Angstrom], which could later on be correlated to e.g. an oxygen being in ~2 Angstrom distance to the Mn atom (see "Schematic of XAFS data processing" on the right).
I only want to be able to reproduce the theoretically expected output after the FFT. My main concern is to get rid of the weird output signal and produce something that in some way resembles a curve with somewhat distinct local maxima (as shown in the 4th picture).
I don't have a 100% solution for you, but here's part of the problem.
The fft function you're using assumes that your X values are equally spaced. I checked this assumption by taking the difference between each 1/distance value, and graphing it:
df['1/distance'].diff().plot()
(Y is the difference, X is the index in the dataframe.)
This is supposed to be a constant line.
In order to fix this, one solution is to resample the signal through linear interpolation so that the timestep is constant.
from scipy import interpolate
rs_df = df.drop_duplicates().copy() # Needed because 0 is present twice in dataset
x = rs_df['1/distance']
y = rs_df['signal']
flinear = interpolate.interp1d(x, y, kind='linear')
xnew = np.linspace(np.min(x), np.max(x), rs_df.index.size)
ylinear = flinear(xnew)
rs_df['signal'] = ylinear
rs_df['1/distance'] = xnew
df.plot(x ='1/distance', y ='signal', kind = 'line')
rs_df.plot(x ='1/distance', y ='signal', kind = 'line')
The new line looks visually identical, but has a constant timestep.
I still don't get your intended result from the FFT, so this is only a partial solution.
MCVE
We import required dependencies:
import numpy as np
import pandas as pd
from scipy import signal
import matplotlib.pyplot as plt
And we load your dataset:
raw = pd.read_csv("https://pastebin.com/raw/ucFekzc6", sep="\t",
names=["k", "wchi"], header=0)
We clean the dataset a bit as it contains duplicates and a problematic point with null wave number (or infinite distance) and ensure a zero mean signal:
raw = raw.drop_duplicates()
raw = raw.iloc[1:, :]
raw["wchi"] = raw["wchi"] - raw["wchi"].mean()
The signal is about:
As noticed by #NickODell, signal is not equally sampled which is a problem if you aim to perform FFT signal processing.
We can resample your signal to have equally spaced sampling:
N = 65536
k = np.linspace(raw["k"].min(), raw["k"].max(), N)
interpolant = interpolate.interp1d(raw["k"], raw["wchi"], kind="linear")
g = interpolant(k)
Notice for performance concerns FFT does split the signal with the null frequency component at the borders (that's why your FFT signal does not look as it is usually presented in books). This indeed can be corrected by using classic fftshift method or performing ad hoc indexing.
R = 2*np.pi*fft.fftfreq(N, np.diff(k)[0])[:N//2]
G = (1/N)*fft.fft(g)[0:N//2]
Mind the 2π factor which is involved in the units scaling of your transformation.
You also have mentioned a windowing (at least in a picture) that is not referenced anywhere. This kind of filtering may help a lot when performing signal processing as it filter out artifacts and unwanted noise. I leave it up to you.
Least Square Spectral Analysis
An alternative to process your signal is available since the advent of modern Linear Algebra. There is a way to estimate the periodogram of an irregular sampled signal by a method called Least Square Spectral Analysis.
You are looking for the square root of the periodogram of your signal and scipy does implement an easy way to compute it by the Lomb-Scargle method.
To do so, we simply create a frequency vector (in this case they are desired output distances) and perform the regression for each of those distances w.r.t. your signal:
Rhat = np.linspace(raw["R"].min(), raw["R"].max()*2, 5000)
Ghat = signal.lombscargle(raw["k"], raw["wchi"], freqs=Rhat, normalize=True)
Graphically it leads to:
Comparison
If we compare both methodology we can confirm that the major peaks definitely match.
LSSA gives a smoother curve but do not assume it to be more accurate as this is statistical smooth of an interpolated curve. Anyway it fit the bill for your requirement:
I only want to be able to reproduce the theoretically expected output
after the FFT. My main concern is to get rid of the weird output
signal and produce something that in some way resembles a curve with
somewhat distinct local maxima (as shown in the 4th picture).
Conclusions
I think you have enough information to process your signal either by resampling and using FFT or by using LSSA. Both method has advantages and drawbacks.
Of course this needs to be validated with well know cases. Why not to reproduce with the data of the experience of the paper you are working on to check out you can reconstruct figures you posted.
You also need to dig in the signal conditioning before performing post processing (resampling, windowing, filtering).
Hi can someone give me a dummy guide for setting the order params and seasonal order params in the SARIMA model from statsmodels?
Are those numbers obtained from the ACF and PACF plot? If yes, how do you get the numbers for AR and MA from those plots? I know the differencing(I) cannot be derived from those two plots, how should I decide whether I should set it to 0, 1, or 2?
Thank you
You can use:
pip install pmdarima
pmdarima
pip install pyramid-arima
Please look at:
pyramid-auto-arima
from pyramid.arima.stationarity import ADFTest
adf_test = ADFTest(alpha=0.05)
adf_test.is_stationary(series)
train, test = series[1:500], series[501:910]
train.shape
test.shape
plt.plot(train)
plt.plot(test)
plt.title("Pyramid")
plt.show()
It does:
arima_fit = statsmodels.tsa.SARIMAX(data_set, order = (1,0,1), seasonal_order = (0,1,0,50), trend = 'c').fit()
prediction = arima_fit.predict('start', 'end', dynamic = True)
About your question regarding to ACF and PACF
The ACF stands for Autocorrelation function, and the PACF for Partial
Autocorrelation function. Looking at these two plots together can help
us form an idea of what models to fit. Autocorrelation computes and
plots the autocorrelations of a time series. Autocorrelation is the
correlation between observations of a time series separated by k time
units.
Similarly, partial autocorrelations measure the strength of
relationship with other terms being accounted for, in this case other
terms being the intervening lags present in the model. For example,
the partial autocorrelaton at lag 4 is the correlation at lag 4,
accounting for the correlations at lags 1, 2, and 3. To generate these
plots in Minitab, we go to Stat > Time Series > Autocorrelation or
Stat > Time Series > Partial Autocorrelation. I've generated these
plots for our simulated data below:
Fitting-an-arima-model
As raw data we have measurements m_{i,j}, measured every 30 seconds (i=0, 30, 60, 90,...720,..) for every subject j in the dataset.
I wish use TSFRESH (package) to extract time-series features, such that for a point of interest at time i, features are calculated based on symmetric rolling window.
We wish to calculate the feature vector of time point i,j based on measurements of 3 hours of context before i and 3 hours after i.
Thus, the 721-dim feature vector represents a point of interest surrounded by 6 hours “context”, i.e. 360 measurements before and 360 measurements after the point of interest.
For every point of interest, features should be extracted based on 721 measurements of m_{i,j}.
I've tried using rolling_direction param in roll_time_series(), but the only options are either roll backwards or forwards in “time” - I'm looking for a way to include both "past" and "future" data in features calculation.
If I understand your idea correctly, it is even possible to do this with only one-sided rolling. Let's try with one example:
You want to predict for the time 8:00 - and you need for this the data from 5:00 until 11:00.
If you roll through the data with a size of 6h and positive rolling direction, you will end up with a dataset, which also includes exactly this part of the data (5:00 to 11:00). Normally, it would be used to train for the value at 11:00 (or 12:00) - but nothing prevents you to use it for predicting the value at 8:00.
Basically, it is just a matter of re-indexing.
(Same is actually true for negative rolling direction)
A "workaround" solution:
Use the "roll_time_series" function twice; one for "backward" rolling (setting rolling_direction=1) and the second for "forward" (rolling_direction=-1), and then combine them into one.
This will provide, for each time point in the original dataset m_{i,j}$, a time series rolling object with 360 values "from the past" and 360 values "from the future" (i.e., the time point is at the center of the window and max_timeshift=360)
Note to the use of pandas functions below: concat(), sort_values(), drop_duplicates() - which are mandatory for this solution to work.
import numpy as np
import pandas as pd
from tsfresh.utilities.dataframe_functions import roll_time_series
from tsfresh.feature_extraction import EfficientFCParameters, MinimalFCParameters
rolled_backward = roll_time_series(activity_data,
column_id=id_column,
column_sort=sort_column,
column_kind=None,
rolling_direction=1,
max_timeshift=360)
rolled_farward = roll_time_series(activity_data,
column_id=id_column,
column_sort=sort_column,
column_kind=None,
rolling_direction=-1,
max_timeshift=360)
# merge into one dataframe, with rolled_farward and rolled_backward window for every time point (sample)
df = pd.concat([rolled_backward, rolled_farward])
# important! - sort and drop duplicates
df.sort_values(by=[id_column, sort_column], inplace=True)
df.drop_duplicates(subset=[id_column, sort_column, 'activity'], inplace=True, keep='first')
Essentially I've got an excel files with voltage in the first column, and time in the second. I want to find the period of the voltages, as it returns a graph of voltage in y axis and time in x axis with a periodicity, looking similar to a sine function.
To find the frequency I have uploaded my excel file to python as I think this will make it easier- there may be something I've missed that will simplify this.
So far in python I have:
import xlrd
import numpy as N
import numpy.fft as F
import matplotlib.pyplot as P
wb = xlrd.open_workbook('temp7.xls') #LOADING EXCEL FILE
wb.sheet_names()
sh = wb.sheet_by_index(0)
first_column = sh.col_values(1) #VALUES FROM EXCEL
second_column = sh.col_values(2) #VALUES FROM EXCEL
Now how do I find the frequency from this?
I'm not sure how much you know about the Fourier transform, so forgive me if this is too much background.
Your signal does not have "a frequency", it is but it can be thought of as the sum of many frequencies. The Fourier transform will tell you the weights of all the frequencies that make up your signal. Unfortunately information may be lost when sampling from the analog (continuous time) to digital (discrete time) domain. This puts a constraint on the information we can get about frequency - namely that the maximum frequency component we can determine is related to the digital sampling rate (Nyquist-Shannon criterion):
fs > 2B
Where fs is your sampling rate (samples/unit time, typically in Hz or something like it), and B is the maximum frequency of your signal. If your signal actually has frequencies higher than B they will be "aliased" to some value lower than B.
For your problem, all you have to do is this:
x = N.array(first_column)
X = F.fft(x)
Now X is the frequency-domain representation of your voltage signal. The corresponding frequency axis covers [0, fs), based on the sampling theorem. So, what is fs? You need to calculate that by looking at the number of samples you have divided by the total duration of your sampled signal (note your units here):
fs = len(second_column) / second_column[-1]
Note that this representation of your signal will also (probably) be complex, i.e. each frequency will have an associated amplitude and phase.
Hopefully this helps, and hopefully I didn't cover a bunch of stuff you already knew.