I am trying to use wavelets coefficients as feature for neural networks on a time series data and I am bit confused on usage of the same. Do I need to find the coefficients on entire time series at once, or use a sliding window for finding the same. I mean, will finding coefficients on entire time series for once, include the future data points while determining those coefficients? What should be the approach to go about using Wavelets on a time series data without look ahead bias if any?
It is hard to provide you with a detailed answer without knowing what you are trying to achieve.
In a nutshell, you first need to decide whether you want to apply a discrete (DWT) or a continous (CWT) wavelet transform to your time series.
A DWT will allow you to decompose your input data into a set of discrete levels, providing you with information about the frequency content of the signal i.e. determining whether the signal contains high frequency variations or low frequency trends. Think of it as applying several band-pass filters to your input data.
I do not think that you should apply a DWT to your entire time series at once. Since you are working with financial data, maybe decomposing your input signal into 1-day windows and applying a DWT on these subsets would do the trick for you.
In any case, I would suggest:
Installing the pywt toolbox and playing with a dummy time series to understand how wavelet decomposition works.
Checking out the abundant literature available about wavelet analysis of financial data. For instance, if you are interested into financial time series forecasting, you might want to read this paper.
Posting your future questions on the DSP stack exchange, unless you have a specific coding-related answer.
Related
My understanding of downsampling is that it is an operation to decrease the sample rate of x by keeping the first sample and then every nth sample after the first.
The example provided from the resample method of scipy package clearly illustrated about this operation as depicted the picture which is accessible from the link (https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.resample.html) or as extracted below
In an enlarged view, it is evident that the original data points were resampled point by point.
However, using the mne example of downsampling which accessible via the link
: https://mne.tools/dev/auto_examples/preprocessing/plot_resample.html
, I notice that the data points were not resampled point by point as illustrated visually below
This given that, mne resample is based on the resample method of scipy package as indicated from mne resample function as shown:
https://github.com/mne-tools/mne-python/blob/607fb4613fb5a80dd225132a4a53fe43b8fde0fb/mne/filter.py#L1342
May I know whether this issue is due to the ringing artifacts or due to other problems?
Also, are there remedies to mitigate this problem.
Thanks for any insight. Appreciate it
The same question has been asked in mne discussion repo, but un-answered as of the time of writing
My understanding of downsampling is that it is an operation to
decrease the sample rate of x by keeping the first sample and then
every nth sample after the first.
Resampling typically consists of two steps: low-pass filtering to avoid aliasing, then sample rate reduction (subselecting samples from the resulting signal). The low-passing actually changes the values, so the subselection-of-filtered-data step will not necessarily yield points that were "on" the original signal.
May I know whether this issue is due to the ringing artifacts or due
to other problems?
In this case it's likely due to the (implicit) low-pass filtering in the frequency-domain resampling of the signal. It looks pretty reasonable to me. If you want to play around with it a bit, you can
Call scipy.signal.resample directly on your data and see how closely it matches.
Pad your signal, call scipy.signal.resample, and remove the (now
reduced-length) padding -- this is what MNE does internally.
Use scipy.signal.resample_poly directly on your data.
Manually low-pass filter and then directly subselect samples from the low-passed signal, which is what resample_poly does internally.
Also, scipy.signal.resample does frequency-domain resampling, so implicitly uses a brick-wall filter at Nyquist when downsampling (unless you specify something for the window argument, which gets applied in the frequency domain in addition to the effective brick-wall filter).
p.s. The answer provided is an extract from the discussion with the folk at mne, namely Eric Larson,Brunner Clemens,Phillip Alday. Credit should be given to them
With python I want to compare a simulated light curve with the real light curve. It should be mentioned that the measured data contain gaps and outliers and the time steps are not constant. The model, however, contains constant time steps.
In a first step I would like to compare with a statistical method how similar the two light curves are. Which method is best suited for this?
In a second step I would like to fit the model to my measurement data. However, the model data is not calculated in Python but in an independent software. Basically, the model data depends on four parameters, all of which are limited to a certain range, which I am currently feeding mannualy to the software (planned is automatic).
What is the best method to create a suitable fit?
A "Brute-Force-Fit" is currently an option that comes to my mind.
This link "https://imgur.com/a/zZ5xoqB" provides three different plots. The simulated lightcurve, the actual measurement and lastly both together. The simulation is not good, but by playing with the parameters one can get an acceptable result. Which means the phase and period are the same, magnitude is in the same order and even the specular flashes should occur at the same period.
If I understand this correctly, you're asking a more foundational question that could be better answered in https://datascience.stackexchange.com/, rather than something specific to Python.
That said, as a data science layperson, this may be a problem suited for gradient descent with a mean-square-error cost function. You initialize the parameters of the curve (possibly randomly), then calculate the square error at your known points.
Then you make tiny changes to each parameter in turn, and calculate how the cost function is affected. Then you change all the parameters (by a tiny amount) in the direction that decreases the cost function. Repeat this until the parameters stop changing.
(Note that this might trap you in a local minimum and not work.)
More information: https://towardsdatascience.com/implement-gradient-descent-in-python-9b93ed7108d1
Edit: I overlooked this part
The simulation is not good, but by playing with the parameters one can get an acceptable result. Which means the phase and period are the same, magnitude is in the same order and even the specular flashes should occur at the same period.
Is the simulated curve just a sum of sine waves, and are the parameters just phase/period/amplitude of each? In this case what you're looking for is the Fourier transform of your signal, which is very easy to calculate with numpy: https://docs.scipy.org/doc/scipy/reference/tutorial/fftpack.html
I am trying to remove muscle artifacts from an EEG signal corresponding to an epileptic patient. For that, I used the fastICA method with python. The figure below represents the independent components:
enter image description here
Unfortunately, I could not distinguish the components corresponding to the artifacts. Is there a way to help me know which components to remove?
first of thing you should know how an eeg signal look like. I think in the picture attached ICA21, ICA7 and ICA2 is completely noisy data
Not sure if this is possible given the data that you have, but one possibility is to frame it as a supervised problem. Say you have a few epileptic patients' EEGs and a few from non-epileptic patients. You can apply an ICA decomposition to the whole dataset, and then use each component by itself as a feature vector (maybe discretizing it) to predict the class (i.e., epileptic vs. non-epileptic).
The noise components should have no predictive value, so you might be able to find that a cluster of components has a (statistically) significantly higher predictive value than another. This will require manually looking at the accuracy value of each component and making a subjective decision, but maybe it can help as an exploratory analysis.
Of course, this only works if you have data from multiple patients.
I am studying the correlation between a set of input variables and a response variable, price. These are all in time series.
1) Is it necessary that I smooth out the curve where the input variable is cyclical (autoregressive)? If so, how?
2) Once a correlation is established, I would like to quantify exactly how the input variable affects the response variable.
Eg: "Once X increases >10% then there is an 2% increase in y 6 months later."
Which python libraries should I be looking at to implement this - in particular to figure out the lag time between two correlated occurrences?
Example:
I already looked at: statsmodels.tsa.ARMA but it seems to deal with predicting only one variable over time. In scipy the covariance matrix can tell me about the correlation, but does not help with figuring out the lag time.
While part of the question is more statistics based, the bit about how to do it in Python seems at home here. I see that you've since decided to do this in R from looking at your question on Cross Validated, but in case you decide to move back to Python, or for the benefit of anyone else finding this question:
I think you were in the right area looking at statsmodels.tsa, but there's a lot more to it than just the ARMA package:
http://statsmodels.sourceforge.net/devel/tsa.html
In particular, have a look at statsmodels.tsa.vector_ar for modelling multivariate time series. The documentation for it is available here:
http://statsmodels.sourceforge.net/devel/vector_ar.html
The page above specifies that it's for working with stationary time series - I presume this means removing both trend and any seasonality or periodicity. The following link is ultimately readying a model for forecasting, but it discusses the Box-Jenkins approach for building a model, including making it stationary:
http://www.colorado.edu/geography/class_homepages/geog_4023_s11/Lecture16_TS3.pdf
You'll notice that link discusses looking for autocorrelations (ACF) and partial autocorrelations (PACF), and then using the Augmented Dickey-Fuller test to test whether the series is now stationary. Tools for all three can be found in statsmodels.tsa.stattools. Likewise, statsmodels.tsa.arma_process has ACF and PACF.
The above link also discusses using metrics like AIC to determine the best model; both statsmodels.tsa.var_model and statsmodels.tsa.ar_model include AIC (amongst other measures). The same measures seem to be used for calculating lag order in var_model, using select_order.
In addition, the pandas library is at least partially integrated into statsmodels and has a lot of time series and data analysis functionality itself, so will probably be of interest. The time series documentation is located here:
http://pandas.pydata.org/pandas-docs/stable/timeseries.html
I'm working on a project to find the instantaneous frequency of a multicomponent audio signal in Python. I am currently using a Butterworth bandpass filter combined with scipy.signal.lfilter to extract around my desired frequency region. I then use the analytic signal (from scipy.signal.hilbert) to get the instantaneous phase, which can be unwrapped to give frequency.
As a relative novice to signal processing, I have two main questions:
I have read that in many applications, it is preferable to use scipy.signal.filtfilt over scipy.signal.lfilter. Certainly when I apply filtfilt to my data, I get a significantly smoother looking instantaneous frequency signal. I would like to know the major differences between the two, bearing in mind that I would like to get an output which is as close to the "true" instantaneous frequency as possible.
The instantaneous frequency data is nonstationary, which means that in some instances I must use a wider bandpass filter to capture all my desired data. This appears to introduce additional noise, and occassional instabilities, into my signal. Are there ways to deal with these kinds of problems, for example with a better designed filter?
EDIT
In response to flebool, below are some images of the data I am looking at. First, a comparison of filt and filtfilt:
Both the above signals have had the same Butterworth filter applied (although the filter function is different), followed by extraction of instantaneous frequency (which is what is plotted, as a function of time). filtfilt seems to shift and smooth the data. Is one of these signals a better approximation of the "true" signal?
Note that this graph shows just a subset of a particular signal.
Second, the effect of increasing the Butterworth filter size:
This is for the same subset of data as figure 1. The legend shows the lower and upper bound for the filter, respectively (the red trace is the filt version of the data in figure 1).
Although it may not be clear here why I would use a larger pass band, in some instances the data may be located at various points between, say, 600 and 800Hz. It is here that I would require a broader filter design. You can see that additional noise enters the trace as the filter gets wider; I'd like to know if there's a way to optimise/improve my filter design.
Some sparse comments:
1) On the top picture: I can't comment on what is best between filt and filtfilt, though the shift in frequency of filtfilt is worrying. You can obtain a similar result by applying a low-pass filter to the filt signal.
2) There isn't a "true" instantaneous frequency, unless the signal was specifically generated with a certain tone. In my experience unwrapping the phase of the Hilbert transform does a good job in many cases, though. It becomes less and less reliable as the ratio of noise to signal intensity grows.
3) Regarding the bottom picture, you say that sometimes you need a large bandpass filter. Is this because the signal is very long, and the instantaneous frequency moves around between 500 and 800 Hz? If so, you may want to proceed windowing the signal to a length at which the filtered signal has a distinct peak in the Fourier spectrum, extract that peak, tailor your bandbass filter to that peak, apply Hilbert to the windowed signal, extract the phase, filter the phase.
This is worth doing if you are sure the signal has other harmonics except noise and the one you are interested in, and it takes a while. Before doing so I would want to be sure the data I obtain is wrong.
If it is simply 1 harmonic + noise, I would lowpass+hilbert+extract instantaneous phase + lowpass again on the instantaneous phase
To your first problem I can't speak intelligently on but scipy is generally well documented so I'd start reading through some of their stuff.
To your second problem a better designed filter would certainly help. You say the data is "non-stationary," do you know where it will be? Or what kind frequencies it might occupy? For example if the signal is centered around 1 of 3 frequencies that you know a-priori you could have three different filters and run the signal through all 3 (only one giving you the output you want of course).
If you don't know have that kind of knowledge about the signal I would first do a wider BPF, then do some peak detection, and apply a more stringent BPF when you know where the data you would like is located