Generate captions for Time Series Data

Generate captions for Time Series Data - python

I am trying to generate captions for time series data based on increasing/decreasing values.
I have a column with values which change gradually over time (time horizon is irrelevant for now). When we visualize a graph, we make comments like increasing/decreasing, steep curve etc.
I am looking at what libraries are available for the same. Has any research been done in graph captioning/ ts captioning?
Currently juggling with pyts and exploring the quantization options by converting values to small clusters and analysing the so-created bag of words.
As of now, its simply looking at when a value changes direction from inc to dec, halt and generate "increasing" for previous segment and move on. This is an inefficient approach in terms of scaling.
Looking for guidance, suggestions, resources.

Related

Method of averaging periodic timeseries data

I'm in the process of collecting O2 data for work. This data shows periodic behavior. I would like to parse out each repetition to thereby get statistical information like average and theoretical error. Data Figure
Is there a convenient way programmatically:
Identify cyclical data?
Pick out starting & ending indices such that repeating cycle can be concatenated, post-processed, etc.
I had a few ideas, but am more lacking the Python programing experience.
Brute force, condition data in Excel prior. (Will likely collect similar data in future, would like more robust method).
Train NN to identify cycle then output indices. (Limited training set, would have to label).
Decompose to trend/seasonal data apply Fourier series on seasonal data. Pick out N cycles.
Heuristically, i.e. identify thresholds of rate of change & event detection (difficult due to secondary hump, please see data).
Is there a Python program that systematically does this for me? Any help would be greatly appreciated.
Sample Data

How to correlate partial signal with dataset

I have a some large datasets of sensor values consisting of a single sensor value sampled at a one-minute interval, like a waveform. The total dataset spans a few years.
I wish to (using python) enter/select a arbitrary set of sensor data (for instance consisting of 600 values, so for 10hrs worth of data) and find all similar time stamps where roughly the same shape occurred in these datasets.
The matches should be made by shape (relative differences), not by actual values, as there are different sensors used with different biases and environments. Also, I wish to retrieve multiple matches within a single dataset, to further analyse.
I’ve been looking into pandas, but I’m stuck at the moment... any guru here?

I don't know much about the functionalities available in Pandas.
I think you need to first decide the typical time span T over which
the correlation is supposed to occurred. What I would do is to
split all your times series into (possibly overlapping) segments
of duration T using Numpy (see here for instance).
This will lead to a long list of segments. I would then compute
the correlation between all pairs of segments using e.g. corrcoef.
You get a large correlation matrix where you can spot the
pairs of similar segments by applying a threshold on the absolute
value of the correlation. You can estimate the correct threshold
by applying this algorithm to a data set where you don't expect
any correlation, or by randomizing your data.

Filter noise in a dataset containing walking trajectory

I've a dataset containing the position of a person walking in a indoor environment at a given time.
I don't have any information about the environment, just the dataset.
The table is structured like this:
(ID, X, Y, time)
where ID is the primary key containing the id, X and Y the coordinates and time is the timestamp.
The frequency for the data gathering is of 1 element every 0.2 seconds.
Before I start any analysis on the path, speeds etc I'd like to remove the noise from the dataset but I'm not sure what approach I should use.
I've read about using clustering functions like DBSCAN and for given parameters it seems to do something but since it's a clustering based on density I don't feel like it's the best solution. On the other hand ST-DBSCAN takes into account the time so it seems more appropriate but it's still based on density.
Is there a better way to filter noise in a context like this or is DBSCAN the right approach?

If you think of your data as 2-dimensional time-series, then it makes sense to apply one of the algorithms listed here: https://github.com/rob-med/awesome-TS-anomaly-detection

Extracting the threshold value of some given raw numbers

I am trying to determine the conditions of a wireless channel by analysis of captured I/Q samples. Indeed, I have a 50000 data samples and as it is shown in the attached figure, there are some sparks in the graphs when there is an activity (e.g. data transmission) over the channel. I am trying to count the number of sparks which are data values higher than a threshold.
I need to have an accurate estimation of the threshold and then I can find the channel load. the threshold value in the attached figure is around 0.0025 and it should be noted that it varies over time. So, each time that I took 50000 samples, I have to find the threshold value first using some sort of unsupervised learning.
I tried k-means (in python scikit-learn) to cluster the data and find the centroids of the estimated clusters, but it can't give me good estimation on the threshold value (especially when there is no activity over the channel and the channel is idle).
I would like to know is there anyone who has prior experience on similar topics?
Captured data

Since the idle noise seems relatively consistent and very different from when data is transmitted, I can think of several simple algorithms which could give you a reasonable threshold in an unsupervised manner.
The most direct method would be to sort the values (perhaps first group into buckets), then find the lowest-valued region where a large enough proportion (at least ~5%) of values fall. Take a reasonable margin above the highest values (50%?) and you should be good to go.
You'll need to fiddle with the thresholds a bit. I'd collect sample data and tweak the values until I get it working 100% of the time and the values used make sense.

Pandas and the best method for representing variable-length time-series

Here's the scenario. Let's say I have data from a visual psychophysics experiment, in which a subject indicates whether the net direction of motion in a noisy visual stimulus is to the left or to the right. The atomic unit here is a single trial and a typical daily session might have between 1000 and 2000 trials. With each trial are associated various parameters: the difficulty of that trial, where stimuli were positioned on the computer monitor, the speed of motion, the distance of the subject from the display, whether the subject answered correctly, etc. For now, let's assume that each trial has only one value for each parameter (e.g., each trial has only one speed of motion, etc.). So far, so easy: trial ids are the Index and the different parameters correspond to columns.
Here's the wrinkle. With each trial are also associated variable length time series. For instance, each trial will have eye movement data that's sampled at 1 kHz (so we get time of acquisition, the x data at that time point, and y data at that time point). Because each trial has a different total duration, the length of these time series will differ across trials.
So... what's the best means for representing this type of data in a pandas DataFrame? Is this something that pandas can even be expected to deal with? Should I go to multiple DataFrames, one for the single valued parameters and one for the time series like parameters?
I've considered adopting a MultiIndex approach where level 0 corresponds to trial number and level 1 corresponds to time of continuous data acquisition. Then all I'd need to do is repeat the single valued columns to match the length of the time series on that trial. But I immediately foresee 2 problems. First, the number of single valued columns is large enough that extending each one of them to match the length of the time series seems very wasteful if not impractical. Second, and more importantly, if I wanna do basic groupby type of analyses (e.g. getting the proportion of correct responses at a given difficulty level), this will give biased (incorrect) results because whether each trial was correct or wrong will be repeated as many times as necessary for its length to match the length of time series on that trial (which is irrelevant to the computation of the mean across trials).
I hope my question makes sense and thanks for suggestions.

I've also just been dealing with this type of issue. I have a bunch of motion-capture data that I've recorded, containing x- y- and z-locations of several motion-capture markers at time intervals of 10ms, but there are also a couple of single-valued fields per trial (e.g., which task the subject is doing).
I've been using this project as a motivation for learning about pandas so I'm certainly not "fluent" yet with it. But I have found it incredibly convenient to be able to concatenate data frames for each trial into a single larger frame for, e.g., one subject:
subject_df = pd.concat(
[pd.read_csv(t) for t in subject_trials],
keys=[i for i, _ in enumerate(subject_trials)])
Anyway, my suggestion for how to combine single-valued trial data with continuous time recordings is to duplicate the single-valued columns down the entire index of your time recordings, like you mention toward the end of your question.
The only thing you lose by denormalizing your data in this way is that your data will consume more memory; however, provided you have sufficient memory, I think the benefits are worth it, because then you can do things like group individual time frames of data by the per-trial values. This can be especially useful with a stacked data frame!
As for removing the duplicates for doing, e.g., trial outcome analysis, it's really straightforward to do this:
df.outcome.unique()
assuming your data frame has an "outcome" column.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.