Visualizing timestamps and cummulative time usage python - python

I have some sensor data, which essentially is a time stamp, and a status number. A new line, with timestamp and status is recorded semi-periodically (everything from every second to 1 hour). The same status can be repeated over hundreds of lines. I want to represent how long time the component has been in each of the states it can be in.
To get the accumulative time in each sequence I just need to loop over all the records, and sum up all the time spent in each state, right? But how do I visualize this? The idea is to both show how much time is used in each state, acumulated, but also visualize the status and status changes along a date axis.
Suggestions for plot types is welcome.
Example data (unixtime | state): http://pastebin.com/6TmXFZQd

This design is what I would do to visualize the data described in the question:

Related

how to find periods from time series data in python

I have time series data
How do I get period??
The first picture graph and first graph of second picture is same graph.
and ignore the second graph of second picture.
We wanted to get micro current when a person workout.
this chart is that a man push the sensor with his finger.
Because we didn't make a good sensor yet.
and My team tried to find not noisy data so that we made the data to 0 below 400. But we can return it to normal data.
It seems 7 similiar periods.
I have used
https://github.com/gsubramani/SignalRecognition
but this have an error. code did not work well
https://github.com/guillaume-chevalier/seq2seq-signal-prediction
My computer have no gpu.. so I couldn't test it. It had errors
https://github.com/tbnsilveira/STFT_analysis/blob/master/STFT_sinusoidal_signal.ipynb
This don't seem to have how to get periods.
I use python.
Any helps would be helpful!! Thank you in advance.

How to build an prediction model which takes input in date_time format?

Dataset Image
I have been working on predicting water usage on a weekly basis. I have starting day of every week in one column and water consumed in another column, I want my model prediction in such a way that I give the input in date time format like 21-01-2021 (say)in the predict() function. Which model and how can I achieve this?
I've previously tried with ARIMA model in time series analysis.
Most of the ML/DL algorithms use floating point input values, that said and based on most of the datasets that I've seen, you should do some data transformation and compute a time delta (you'd see something like TimeDT). That's done setting a setting a base date (the first date that appears in you train data) compute next row delta based on your criteria (seconds, hours or days elapsed, etc).
TL;DR
As I understood you're computing based on the day of the week (correct me please if I'm wrong), so your time delta would be daily, restarting each week? the most appropriated in that case is based on the calendar, decompose the date and add two new features: week_of_year and day_of_week.
Is week_of_year important? well, in summer might be a tendency on consume more water, that's something your dataset can tell you.

Extract data faster from Redis and store in Panda Dataframe by avoiding key generation

I am using Redis with Python to store my per second ticker data (price and volume of an instrument). I am performing r.hget(instrument,key) and facing the following issue.
My key (string) looks like 01/01/2020-09:32:01 and goes on incrementing per second till the user specified interval.
For example 01/01/2020-09:32:01
01/01/2020-09:32:02 01/01/2020-09:32:03 ....
My r.hget(instrument,key) result looks likeb'672.2,432'(price and volume separated by a comma).
The issue am facing is that a user can specify a long time interval, like 2 years, that is, he/she wants the data from 01/01/2020 to 31/12/2020 (d/m/y format).So to perform the get operation I have to first generate timestamps for that period and then perform the get operation to form a panda dataframe. The generation of this datastamp to use as key for get operation is slowing down my process terribly (but it also ensures that the data is in strict ordering. For example 01/01/2020-09:32:01 will definitely be before 01/01/2020-09:32:02). Is there another way to achieve the same?
If I simply do r.hgetall(...) I wont be able to satisfy the time interval condition of user.
redis sorted set's are good fit for such range queries, sorted sets are made up of unique member's with a score, in your case timestamp can be score in epoch seconds and price and volume can be member, however member in sorted set is unique you may consider adding timestamp to make it unique.
zadd instrument 1577883600 672.2,432,1577883600
zadd instrument 1577883610 672.2,412,1577883610
After adding members to the set you can do range queries using zrangebyscore as below
zrangebyscore instrument 1577883600 1577883610
If your instrument contains many values then consider sharding it into multiple for example per month each set like instrument:202001, instrument:202002 and so on.
following are good read on this topic
Sorted Set Time Series
Sharding Structure
So to perform the get operation I have to first generate timestamps for that period and then perform the get operation...
No. This is the problem.
Make a function that calculates the timestamps and yield a smaller set of values, for a smaller time span (one week or one month).
So the new workflow will be in batches, see this loop:
generate a small set of timestamps
fetch items from redis
Pros:
minimize the memory usage
easy to change your current code to this new algo.
I don't know about redis specific functions, so other specific solutions can be better. My idea is a general approach, I used it with success for other problems.
Have you considered using RedisTimeSeries for this task? It is a redis module that is tailored exactly for the sort of task you are describing.
You can keep two timeseries per instrument that will hold price and value.
With RedisTimeSeries is it easy the query over different ranges and you can use the filtering mechanism to group different series, instrument families for example, and query all of them at once.
// create your timeseries
TS.CREATE instrument:price LABELS instrument violin type string
TS.CREATE instrument:volume LABELS instrument violin type string
// add values
TS.ADD instrument:price 123456 9.99
TS.ADD instrument:volume 123456 42
// query timeseries
TS.RANGE instrument:price - +
TS.RANGE instrument:volume - +
// query multiple timeseries by filtering according to labels
TS.MRANGE - + FILTER instrument=violin
TS.MRANGE - + FILTER type=string
RedisTimeSeries allows running queries with aggregations such as average standard-deviation, and uses double-delta compression which can reduce your memory usage by over 90%.
You can checkout a benchmark here.

Array resampling with Pandas

I have a 2000 x 200 hdf array from a 2000 run Monte Carlo. The data are the times in minutes before and after a certain event. For the first 50 or so rows, before the event is triggered, the time is negative, then when the event occurs, the time crosses zero and thereafter increases until the end of the simulation. The data has no header, nor date column. This was how the data was given to me. I have loaded the data and extracted the time rows centered on a few minutes before and after the trigger event. I am now trying to use the Pandas resample function on this data subset to obtain a 1 second frequency data set for statistical analysis. I cant find any examples to help me up-sample a data array in the format in which this data was received. Perhaps I could add a date column to the array, or resample each column separately, but I wonder if anyone knows of a simpler faster way? Thanks and best regards!

Pandas and the best method for representing variable-length time-series

Here's the scenario. Let's say I have data from a visual psychophysics experiment, in which a subject indicates whether the net direction of motion in a noisy visual stimulus is to the left or to the right. The atomic unit here is a single trial and a typical daily session might have between 1000 and 2000 trials. With each trial are associated various parameters: the difficulty of that trial, where stimuli were positioned on the computer monitor, the speed of motion, the distance of the subject from the display, whether the subject answered correctly, etc. For now, let's assume that each trial has only one value for each parameter (e.g., each trial has only one speed of motion, etc.). So far, so easy: trial ids are the Index and the different parameters correspond to columns.
Here's the wrinkle. With each trial are also associated variable length time series. For instance, each trial will have eye movement data that's sampled at 1 kHz (so we get time of acquisition, the x data at that time point, and y data at that time point). Because each trial has a different total duration, the length of these time series will differ across trials.
So... what's the best means for representing this type of data in a pandas DataFrame? Is this something that pandas can even be expected to deal with? Should I go to multiple DataFrames, one for the single valued parameters and one for the time series like parameters?
I've considered adopting a MultiIndex approach where level 0 corresponds to trial number and level 1 corresponds to time of continuous data acquisition. Then all I'd need to do is repeat the single valued columns to match the length of the time series on that trial. But I immediately foresee 2 problems. First, the number of single valued columns is large enough that extending each one of them to match the length of the time series seems very wasteful if not impractical. Second, and more importantly, if I wanna do basic groupby type of analyses (e.g. getting the proportion of correct responses at a given difficulty level), this will give biased (incorrect) results because whether each trial was correct or wrong will be repeated as many times as necessary for its length to match the length of time series on that trial (which is irrelevant to the computation of the mean across trials).
I hope my question makes sense and thanks for suggestions.
I've also just been dealing with this type of issue. I have a bunch of motion-capture data that I've recorded, containing x- y- and z-locations of several motion-capture markers at time intervals of 10ms, but there are also a couple of single-valued fields per trial (e.g., which task the subject is doing).
I've been using this project as a motivation for learning about pandas so I'm certainly not "fluent" yet with it. But I have found it incredibly convenient to be able to concatenate data frames for each trial into a single larger frame for, e.g., one subject:
subject_df = pd.concat(
[pd.read_csv(t) for t in subject_trials],
keys=[i for i, _ in enumerate(subject_trials)])
Anyway, my suggestion for how to combine single-valued trial data with continuous time recordings is to duplicate the single-valued columns down the entire index of your time recordings, like you mention toward the end of your question.
The only thing you lose by denormalizing your data in this way is that your data will consume more memory; however, provided you have sufficient memory, I think the benefits are worth it, because then you can do things like group individual time frames of data by the per-trial values. This can be especially useful with a stacked data frame!
As for removing the duplicates for doing, e.g., trial outcome analysis, it's really straightforward to do this:
df.outcome.unique()
assuming your data frame has an "outcome" column.

Categories

Resources