Data analysis to extract period

Data analysis to extract period - python

I'm only asking this question because I've recently stumbled upon some clever code that I would have never thought of on my own. The code I refer to uses the numpy python library. It coverts the signal to true/false array based on if the signal is above a threshold. Then it generates an array that aligns with the middle of each bit. Then it reshapes the data into groups of 8. It takes a half dozen lines of code to analyze thousands of points of data. I've written code that does similar things but it walks through the entire dataset using for loops looking for edges and then converts those edges to bits. It takes literally hundreds of lines of code to do.
Pictured is an example of a dataset I'm trying to analyze. The beginning always has a preamble of 8 bits that are the same. I want to extract what the period of the signal is using the preamble.
Are there any methods for doing so in python without painstakingly looking for edges?

# Find the transitions
edges = np.abs(x[:-1] - x[1:]) > limit
# Find the indices where the transitions occur
indices = np.where(edges)[0]
# Count the elements as the difference between the indices
counts = indices[1:] - indices[:-1]

Related

Looking for repeated patterns in time series data

I have spent the best part of the last few days searching forums and reading papers trying to solve the following question. I have thousands of time series arrays each of varying lengths containing a single column vector. this column vector contains the time between clicks for dolphins using echolocation.
I have managed to cluster these into similar groups using DTW and want to check which trains have a high degree of similarity i.e repeated patterns. I only want to know the similarity with themselves and don't care to compare them with other trains as I have already applied DTW for that. I'm hoping some of these clusters will contain trains with a high proportion of repeated patterns.
I have already applied the Ljung–Box test to each series to check for autocorrelation but think i should maybe be using something with FFT and the power spectrum. I don't have much experience in this but have tried to do so using a Python package waipy. Ultimately, I just want to know if there is some kind of repeated pattern in the data ideally tested with a p-value. The image I have attached shows an example train across the top. the maximum length of my trains is 550.
example output from Waipy
I know this is quite a complex question but any help would be greatly appreciated even if it is a link to a helpful Python library.
Thanks,
Dex

For anyone in a similar position I decided to go with Motifs as they are able to find a repeated pattern in a time series using euclidian distance. There is a really good package in Python called Stumpy which makes this very easy!
Thanks,
Dex

How to correlate partial signal with dataset

I have a some large datasets of sensor values consisting of a single sensor value sampled at a one-minute interval, like a waveform. The total dataset spans a few years.
I wish to (using python) enter/select a arbitrary set of sensor data (for instance consisting of 600 values, so for 10hrs worth of data) and find all similar time stamps where roughly the same shape occurred in these datasets.
The matches should be made by shape (relative differences), not by actual values, as there are different sensors used with different biases and environments. Also, I wish to retrieve multiple matches within a single dataset, to further analyse.
I’ve been looking into pandas, but I’m stuck at the moment... any guru here?

I don't know much about the functionalities available in Pandas.
I think you need to first decide the typical time span T over which
the correlation is supposed to occurred. What I would do is to
split all your times series into (possibly overlapping) segments
of duration T using Numpy (see here for instance).
This will lead to a long list of segments. I would then compute
the correlation between all pairs of segments using e.g. corrcoef.
You get a large correlation matrix where you can spot the
pairs of similar segments by applying a threshold on the absolute
value of the correlation. You can estimate the correct threshold
by applying this algorithm to a data set where you don't expect
any correlation, or by randomizing your data.

Understanding the output of a DCT

I have some trouble understanding the output of the Discrete Cosine Transform.
Background:
I want to achive a simple audio compression by saving only the most relevant frequencies of a DCT. In order to be somewhat general, I would cut several audio tracks into pieces of a fixed size, say 5 seconds.
Then I would do a DCT on each sample and find out which are the most important frequencies among all short snippets.
This however does not work, which might be due to my missunderstanding of the DCT. See for example the images below:
The first image shows the DCT of the first 40 seconds of an audio track (wanted to make it long enough so that I get a good mix of frequencies).
The second image shows the DCT of the first ten seconds.
The thrird image shows the DCT of a reverse concatination (like abc->abccba) of the first 40 seconds
I added a vertical mark at 2e5 for comparison. Samplerate of the music is the usual 44.1 khz
So here are my questions:
What is the frequency that corresponds to an individual value of the DCT-output-vector? Is it bin/2? Like if I have a spike at bin=10000, which frequency in the real world does this correspond to?
Why does the first plot show strong amplitudes for so many more frquencies than the seond? My intuition was that the DCT would yield values for all frequencies up to 44.l khz (so bin number 88.2k if my assumption in #1 is correct), only that the scale of the spikes would be different, which would then make up the difference in the music.
Why does the third plot show strong amplitudes for more frequencies than the first does? I thought that by concatenating the data, I would not get any new frequencies.
As DCTand FFT/DFT are very similar, I tried to learn more about ft (this and this helped), but apparently it didn't suffice.

Figured it out myself. And it was indeed written in the link I posted in the question. The frequency that corresponds to a certain bin_id is given by (bin_id * freq/2) / (N/2). Which essentially boils down to bin_id*1/t with N=freq*t. This means that the plots just have different granularities. So if plot#1 has a high point at position x, plot#2 will likely show a high point at x/4 and plot#3 at x*2
The image blow shows the data of plot#1 stretched to twice its size (in blue) and the data of plot#3 in yellow

hdf5 Matrix Reading with python

I have a huge sequence (1000000) of small matrices (32x32) stored in a hdf5 file, each one with a label.
Each of this matrices represent a sensor data for a specific time.
I want to obtain the evolution for each pixel in for a small time slice, different for each x,y position in the matrix.
This is taking more time than I expect.
def getPixelSlice (self,xpixel,ypixel,initphoto,endphoto):
#obtain h5 keys inside time range between initphoto and endphoto
valid=np.where(np.logical_and(self.photoList>=initphoto,self.photoList<endphoto))
#look at pixel data in valid frames
evolution = []
#for each valid frame, obtain the data, and append the target pixel to the list.
for frame in valid[0]:
data = self.h5f[str(self.photoList[frame])]
evolution.append(data[ypixel][xpixel])
return evolution,valid

So, there is a problem here that took me a while to sort out for a similar application. Due to the physical limitations of hard drives, the data are stored in such a way that with a three dimensional array it will always be easier to read in one orientation than another. It all depends on what order you stored the data in.
How you handle this problem depends on your application. My specific application can be characterized as "write few, read many". In this case, it makes the most sense to store the data in the order that I expect to read it. To do this, I use PyTables and specify a "chunkshape" that is the same as one of my timeseries. So, in your case it would be (1,1,1000000). I'm not sure if that size is too large or not, though, so you may need to break it down a bit farther, say (1,1,10000) or something like that.
For more info see PyTables Optimization Tips.
For applications where you intend to read in a specific orientation many times, it is crucial that you choose an appropriate chuck shape for your HDF5 arrays.

Pandas and the best method for representing variable-length time-series

Here's the scenario. Let's say I have data from a visual psychophysics experiment, in which a subject indicates whether the net direction of motion in a noisy visual stimulus is to the left or to the right. The atomic unit here is a single trial and a typical daily session might have between 1000 and 2000 trials. With each trial are associated various parameters: the difficulty of that trial, where stimuli were positioned on the computer monitor, the speed of motion, the distance of the subject from the display, whether the subject answered correctly, etc. For now, let's assume that each trial has only one value for each parameter (e.g., each trial has only one speed of motion, etc.). So far, so easy: trial ids are the Index and the different parameters correspond to columns.
Here's the wrinkle. With each trial are also associated variable length time series. For instance, each trial will have eye movement data that's sampled at 1 kHz (so we get time of acquisition, the x data at that time point, and y data at that time point). Because each trial has a different total duration, the length of these time series will differ across trials.
So... what's the best means for representing this type of data in a pandas DataFrame? Is this something that pandas can even be expected to deal with? Should I go to multiple DataFrames, one for the single valued parameters and one for the time series like parameters?
I've considered adopting a MultiIndex approach where level 0 corresponds to trial number and level 1 corresponds to time of continuous data acquisition. Then all I'd need to do is repeat the single valued columns to match the length of the time series on that trial. But I immediately foresee 2 problems. First, the number of single valued columns is large enough that extending each one of them to match the length of the time series seems very wasteful if not impractical. Second, and more importantly, if I wanna do basic groupby type of analyses (e.g. getting the proportion of correct responses at a given difficulty level), this will give biased (incorrect) results because whether each trial was correct or wrong will be repeated as many times as necessary for its length to match the length of time series on that trial (which is irrelevant to the computation of the mean across trials).
I hope my question makes sense and thanks for suggestions.

I've also just been dealing with this type of issue. I have a bunch of motion-capture data that I've recorded, containing x- y- and z-locations of several motion-capture markers at time intervals of 10ms, but there are also a couple of single-valued fields per trial (e.g., which task the subject is doing).
I've been using this project as a motivation for learning about pandas so I'm certainly not "fluent" yet with it. But I have found it incredibly convenient to be able to concatenate data frames for each trial into a single larger frame for, e.g., one subject:
subject_df = pd.concat(
[pd.read_csv(t) for t in subject_trials],
keys=[i for i, _ in enumerate(subject_trials)])
Anyway, my suggestion for how to combine single-valued trial data with continuous time recordings is to duplicate the single-valued columns down the entire index of your time recordings, like you mention toward the end of your question.
The only thing you lose by denormalizing your data in this way is that your data will consume more memory; however, provided you have sufficient memory, I think the benefits are worth it, because then you can do things like group individual time frames of data by the per-trial values. This can be especially useful with a stacked data frame!
As for removing the duplicates for doing, e.g., trial outcome analysis, it's really straightforward to do this:
df.outcome.unique()
assuming your data frame has an "outcome" column.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.