Selecting a 10 minute time chunk in Pandas - python

I'm new to dealing with time-series data/pandas in python. The data in question relates to some physiological data sampled every second.
I have a pandas dataframe (called df_baseline) indexed using datetime which has a frequency of one row every second for about 25 minutes.
e.g.
2016-01-22 14:39:05
2016-01-22 14:35:06
2016-01-22 14:35:07
type(df_baseline.index)
Out[21]: pandas.tseries.index.DatetimeIndex
Data for this project was collected at varying times and dates with start and stop points varying and I want to iterate through several of such files picking up the first 10 minutes of this data and creating averages and other statistics.
I have searched and am getting confused with datetimes, timeseries, periods and timedeltas but the solutions I've found relate to selecting between very specific time ranges using things like df.between_time and such like which I Don't have as each datafile .
Am I approaching this the wrong way? I feel like the solution is staring me in the face. Please get me on the right track!

Related

Creating multiple subsets of a timeseries pandas dataframe by weekly intervals

New to python. I have a dataframe with a date time column (essentially a huge time series data). I basically want to divide this into multiple subsets where each subset data frame contains one week worth of data (starting from the first timestamp). I have been trying this with groupBy and Grouper but it returns tuples which themselves don't contain a week's worth of data. In addition, the Grouper (Erstwhile TimeGrouper) documentation isn't very clear on this.
This is what I tried. Any better ideas or approaches?
grouped = uema_label_format.groupby(pd.Grouper(key='HEADER_START_TIME', freq='W'))
If your dataset is really big, it could be worth externalising this work to a time-series database and then query it to get each week you are interested in. These results can then be loaded into pandas, but the database handles the heavy lifting. For example in QuestDB you could get the current week as follows
select * from yourTable where timestamp = '2020-06-22;7d'
Although this would return the data for a single week, you could iterate on this to get the individual objects quickly since the results are instantaneous. Also, you can easily change the sample interval after the fact, for example to monthly using 1M. This would still be an instant response.
You can try this here using this query as an example to get one week worth of data (roughly 3M rows) out of a 1.6 billion rows NYC taxi dataset.
select * from trips where pickup_datetime = '2015-08-01;7d';
If this would solve your use case, there is a tutorial on how to get query results from QuestDB to pandas here.

Array resampling with Pandas

I have a 2000 x 200 hdf array from a 2000 run Monte Carlo. The data are the times in minutes before and after a certain event. For the first 50 or so rows, before the event is triggered, the time is negative, then when the event occurs, the time crosses zero and thereafter increases until the end of the simulation. The data has no header, nor date column. This was how the data was given to me. I have loaded the data and extracted the time rows centered on a few minutes before and after the trigger event. I am now trying to use the Pandas resample function on this data subset to obtain a 1 second frequency data set for statistical analysis. I cant find any examples to help me up-sample a data array in the format in which this data was received. Perhaps I could add a date column to the array, or resample each column separately, but I wonder if anyone knows of a simpler faster way? Thanks and best regards!

Python: aggregating data by row count

I'm trying to aggregate this call center data in various different ways in Python, for example mean q_time by type and priority. This is fairly straightforward using df.groupby.
However, I would also like to be able to aggregate by call volume. The problem is that each line of the data represents a call, so I'm not sure how to do it. If I'm just grouping by date then I can just use 'count' as the aggregate function, but how would I aggregate by e.g. weekday, i.e. create a data frame like:
weekday mean_row_count
1 100
2 150
3 120
4 220
5 200
6 30
7 35
Is there a good way to do this? All I can think of is looping through each weekday and counting the number of unique dates, then dividing the counts per weekday by the number of unique dates, but I think this could get messy and maybe really slow it down if I need to also group by other variables, or do it by date and hour of the day.
Since the date of each call is given, one idea is to implement a function to determine the day of the week from a given date. There are many ways to do this such as Conway's Doomsday algorithm.
https://en.wikipedia.org/wiki/Doomsday_rule
One can then go through each line, determine the week day, and add to the count for each weekday.
When I find myself thinking how to aggregate and query data in a versatile way, it think that the solution is probably a database. SQLite is a lightweight embedded database with high performances for simple use cases, and Python and a native support for it.
My advice here is : create a database and a table for your data, eventually add ancillary tables depending on your needs, load data into it, and use interative sqlite or Python scripts for your queries.

Pandas and the best method for representing variable-length time-series

Here's the scenario. Let's say I have data from a visual psychophysics experiment, in which a subject indicates whether the net direction of motion in a noisy visual stimulus is to the left or to the right. The atomic unit here is a single trial and a typical daily session might have between 1000 and 2000 trials. With each trial are associated various parameters: the difficulty of that trial, where stimuli were positioned on the computer monitor, the speed of motion, the distance of the subject from the display, whether the subject answered correctly, etc. For now, let's assume that each trial has only one value for each parameter (e.g., each trial has only one speed of motion, etc.). So far, so easy: trial ids are the Index and the different parameters correspond to columns.
Here's the wrinkle. With each trial are also associated variable length time series. For instance, each trial will have eye movement data that's sampled at 1 kHz (so we get time of acquisition, the x data at that time point, and y data at that time point). Because each trial has a different total duration, the length of these time series will differ across trials.
So... what's the best means for representing this type of data in a pandas DataFrame? Is this something that pandas can even be expected to deal with? Should I go to multiple DataFrames, one for the single valued parameters and one for the time series like parameters?
I've considered adopting a MultiIndex approach where level 0 corresponds to trial number and level 1 corresponds to time of continuous data acquisition. Then all I'd need to do is repeat the single valued columns to match the length of the time series on that trial. But I immediately foresee 2 problems. First, the number of single valued columns is large enough that extending each one of them to match the length of the time series seems very wasteful if not impractical. Second, and more importantly, if I wanna do basic groupby type of analyses (e.g. getting the proportion of correct responses at a given difficulty level), this will give biased (incorrect) results because whether each trial was correct or wrong will be repeated as many times as necessary for its length to match the length of time series on that trial (which is irrelevant to the computation of the mean across trials).
I hope my question makes sense and thanks for suggestions.
I've also just been dealing with this type of issue. I have a bunch of motion-capture data that I've recorded, containing x- y- and z-locations of several motion-capture markers at time intervals of 10ms, but there are also a couple of single-valued fields per trial (e.g., which task the subject is doing).
I've been using this project as a motivation for learning about pandas so I'm certainly not "fluent" yet with it. But I have found it incredibly convenient to be able to concatenate data frames for each trial into a single larger frame for, e.g., one subject:
subject_df = pd.concat(
[pd.read_csv(t) for t in subject_trials],
keys=[i for i, _ in enumerate(subject_trials)])
Anyway, my suggestion for how to combine single-valued trial data with continuous time recordings is to duplicate the single-valued columns down the entire index of your time recordings, like you mention toward the end of your question.
The only thing you lose by denormalizing your data in this way is that your data will consume more memory; however, provided you have sufficient memory, I think the benefits are worth it, because then you can do things like group individual time frames of data by the per-trial values. This can be especially useful with a stacked data frame!
As for removing the duplicates for doing, e.g., trial outcome analysis, it's really straightforward to do this:
df.outcome.unique()
assuming your data frame has an "outcome" column.

Best way to construct a pandas.DataFrame composed of different chunks

I'm running daily simulations in a batch: I do 365 simluations to get results for a full year. After every run, I want to extract some arrays from the results and add them to a pandas.DataFrame for analysis later.
I have a rough model (doing an optimisation) and a more precise model for a post-simulation, so I can get the same variable from two sources. In case the post-simulation is done, the results may overwrite the optimization results.
To make it more complicated, the optimization model has a smaller output interval, depending on the discretisation settings, but the final analysis will happen on the larger interval of the post-simulation).
What is the best way to construct this DataFrame?
This was my first appraoch:
creation of an empty DataFrame df for the full year, with DateRange index with the larger post- simulation interval (=15 minutes)
do optimization for 1 day ==> create temporary df_temp with DateRange as index with smaller interval
downsample this DataFrame to 15 minutes as described here:
update df with df_temp (rows in df are still empty, except for the last row of the previous run, so I have to take df_temp[1:])
do simulation for same day ==> create temporary df_temp2 with interval = 15min
overwrite the corresponding rows in df with df_temp2
Which methods should I use in step 4) and 6)? Or is there a better way from the start?
Thanks,
Roel
I think that using DataFrame.combine_first could be the way to go, but depending on the scale of the data, it might be more useful to have a method like "update" that just modified particular rows in an existing DataFrame. combine_first is more general and can cause the result to be of a different size than either of the inputs (because the indexes will get unioned together).
https://github.com/pydata/pandas/issues/961

Categories

Resources