I'm looking for a neat way to detect particular events in time series data.
In my case, an event might consist of a value changing by more than a certain amount from one sample to the next, or it might consist of a sample being (for example) greater than a threshold while another parameter is less than another threshold.
e.g. imagine a time series list in which I've got three parameters; a timestamp, some temperature data and some humidity data:
time_series = []
# time, temp, humidity
time_series.append([0.0, 12.5, 87.5])
time_series.append([0.1, 12.8, 92.5])
time_series.append([0.2, 12.9, 95.5])
Obviously a useful time series would be much longer than this.
I can obviously loop through this data checking each row (and potentially the previous row) to see if it meets my criteria, but I'm wondering if there's a neat library or technique that I can use to search time series data for particular events - especially where an event might be defined as a function of a number of contiguous samples, or a function of samples in more than one column.
Does anyone know of such a library or technique?
You might like to investigate pandas, which includes time series tools see this pandas doc.
I think that what you are trying to do is take "slices" through the data. [This link on earthpy.org] (http://earthpy.org/pandas-basics.html) has a nice introduction to using time series data with pandas, and if you follow down through the examples it shows how to take out slices, which I think would correspond to pulling out parameters that exceed thresholds, etc. in your data.
Related
I have two lists, one of them is the simulated data and the other is the observation result data. I want to check if the my simulated data is really similar to the real data by checking the distance between them using AIC. So how can I do it?
For example:
obs=[51,12,13,47,45]
smlt=[21,34,14,45,47]
And these two data sets are based on time, it means obs[0] and smlt[0] should be both at the time of 1, and so on....Thanks for the help!
I tried to use SSD and absolute difference but I don't they can help in this case as they don't really care about the vary of time.
I have a multithreaded simulation code doing some calculation in discrete time steps. So for each time stamp I have a set of variables which I want to store and access later on.
Now my question is:
What is a good/the best data structure given following conditions?:
has to be thread-safe (so I guess a ordered dictionary might be best here). I have one thread writing data, other threads only read it.
I want to find data later given a time interval which is not necessarily a multiple of the time-step size.
E.g.: I simulate values from t=0 to t=10 in steps of 1. If I get a request for all data in the range of t=5.6 to t=8.1, I want to get the simulated values such that the requested times are within the returned time range. In this case all data from t=5 to t=9.
the time-step size can vary from run to run. It is constant within a run, so the created data set has always a consistent time-step size. But I might want to restart simulation with a better time resolution.
the amount of time stamps which are calculated might be rather large (up to a million may be)
From searching through the net I get the impression some tree-like structure implemented as a dictionary might be a good idea, but I would also need some kind of iterator/index to go through the data, since I want to fetch always data from time intervals. I got no real idea how something like that could look like ...
There are posts for finding a key in a dictionary close to a given value. But these always include some look up of all the keys in the dictionary, which might not be so cool for a million of keys (that is how I feel at least).
I am trying to decompose a Time Series, however my data does not have Dates, it is composed of entries taken at regular (and unknown) time intervals.
This solution is great and exactly what I want, however it assumed that my series has a datetime index, which it does not.
I can estimate the frequency parameter in this specific case, however this will need to be automated for different data, and as such I can not use the freq parameter of the seasonal_decompose function (unless there is some way to automatically calculate this) to make do for the fact that my series lacks a datetime index.
I have managed to estimate season lenght by utilizing the seasonal python package.
Using fit_seasons function and then seeing the lenght of the returned seasons.
Here's the scenario. Let's say I have data from a visual psychophysics experiment, in which a subject indicates whether the net direction of motion in a noisy visual stimulus is to the left or to the right. The atomic unit here is a single trial and a typical daily session might have between 1000 and 2000 trials. With each trial are associated various parameters: the difficulty of that trial, where stimuli were positioned on the computer monitor, the speed of motion, the distance of the subject from the display, whether the subject answered correctly, etc. For now, let's assume that each trial has only one value for each parameter (e.g., each trial has only one speed of motion, etc.). So far, so easy: trial ids are the Index and the different parameters correspond to columns.
Here's the wrinkle. With each trial are also associated variable length time series. For instance, each trial will have eye movement data that's sampled at 1 kHz (so we get time of acquisition, the x data at that time point, and y data at that time point). Because each trial has a different total duration, the length of these time series will differ across trials.
So... what's the best means for representing this type of data in a pandas DataFrame? Is this something that pandas can even be expected to deal with? Should I go to multiple DataFrames, one for the single valued parameters and one for the time series like parameters?
I've considered adopting a MultiIndex approach where level 0 corresponds to trial number and level 1 corresponds to time of continuous data acquisition. Then all I'd need to do is repeat the single valued columns to match the length of the time series on that trial. But I immediately foresee 2 problems. First, the number of single valued columns is large enough that extending each one of them to match the length of the time series seems very wasteful if not impractical. Second, and more importantly, if I wanna do basic groupby type of analyses (e.g. getting the proportion of correct responses at a given difficulty level), this will give biased (incorrect) results because whether each trial was correct or wrong will be repeated as many times as necessary for its length to match the length of time series on that trial (which is irrelevant to the computation of the mean across trials).
I hope my question makes sense and thanks for suggestions.
I've also just been dealing with this type of issue. I have a bunch of motion-capture data that I've recorded, containing x- y- and z-locations of several motion-capture markers at time intervals of 10ms, but there are also a couple of single-valued fields per trial (e.g., which task the subject is doing).
I've been using this project as a motivation for learning about pandas so I'm certainly not "fluent" yet with it. But I have found it incredibly convenient to be able to concatenate data frames for each trial into a single larger frame for, e.g., one subject:
subject_df = pd.concat(
[pd.read_csv(t) for t in subject_trials],
keys=[i for i, _ in enumerate(subject_trials)])
Anyway, my suggestion for how to combine single-valued trial data with continuous time recordings is to duplicate the single-valued columns down the entire index of your time recordings, like you mention toward the end of your question.
The only thing you lose by denormalizing your data in this way is that your data will consume more memory; however, provided you have sufficient memory, I think the benefits are worth it, because then you can do things like group individual time frames of data by the per-trial values. This can be especially useful with a stacked data frame!
As for removing the duplicates for doing, e.g., trial outcome analysis, it's really straightforward to do this:
df.outcome.unique()
assuming your data frame has an "outcome" column.
I am considering the use of Cassandra as a time-series store. I have millions of series and each series have around 10K of sequential points with uniform intervals. Some series though have a few thousands points or less. They may start and end at different points but all share the same times. I access the data series
Vertically: predefined partitions (e.g. all days in a year) and I need all the rows.
Horizontally: All values of a specific series (random)
I am considering two options. First I could just have a column per time as it is recommended for monitoring systems for example (I have a different access pattern though). Second, using list columns one per partition.
I am worried about read performance (second use case is more critical) and storage overhead. I did finf the following formula:
total_column_size = column_name_size + column_value_size + 15 here
I think that would make the first option quite expensive in terms of storage. I could not find any documentation for a list storage layout. Do you know of any? Have other recommendations?
BTW, I am using python as a client for cassandra if that makes any difference.
"Storage is cheap" is generally the philosophy here. If you have 2 query patterns, which you seem to, then store everything twice: once partitioned by your desired verticals (days by the looks), and once again by your chosen series. If you don't know how to partition your series in advance (it wasn't clear from the question) then it becomes more complicated. Cassandra reads are sequential when reading in order - and this is the only way you should be using it anyway.
You have in the region of X0bn points which is larger than your average DB but is not bordering on ridiculous, particularly when distributed over a cluster. It's hard to put an exact figure given that I don't know the width of your data points, but if these are just scalar values then this is only going to be 2TB or so of data.