Best way to construct a pandas.DataFrame composed of different chunks

Best way to construct a pandas.DataFrame composed of different chunks - python

I'm running daily simulations in a batch: I do 365 simluations to get results for a full year. After every run, I want to extract some arrays from the results and add them to a pandas.DataFrame for analysis later.
I have a rough model (doing an optimisation) and a more precise model for a post-simulation, so I can get the same variable from two sources. In case the post-simulation is done, the results may overwrite the optimization results.
To make it more complicated, the optimization model has a smaller output interval, depending on the discretisation settings, but the final analysis will happen on the larger interval of the post-simulation).
What is the best way to construct this DataFrame?
This was my first appraoch:
creation of an empty DataFrame df for the full year, with DateRange index with the larger post- simulation interval (=15 minutes)
do optimization for 1 day ==> create temporary df_temp with DateRange as index with smaller interval
downsample this DataFrame to 15 minutes as described here:
update df with df_temp (rows in df are still empty, except for the last row of the previous run, so I have to take df_temp[1:])
do simulation for same day ==> create temporary df_temp2 with interval = 15min
overwrite the corresponding rows in df with df_temp2
Which methods should I use in step 4) and 6)? Or is there a better way from the start?
Thanks,
Roel

I think that using DataFrame.combine_first could be the way to go, but depending on the scale of the data, it might be more useful to have a method like "update" that just modified particular rows in an existing DataFrame. combine_first is more general and can cause the result to be of a different size than either of the inputs (because the indexes will get unioned together).
https://github.com/pydata/pandas/issues/961

Related

Creating multiple subsets of a timeseries pandas dataframe by weekly intervals

New to python. I have a dataframe with a date time column (essentially a huge time series data). I basically want to divide this into multiple subsets where each subset data frame contains one week worth of data (starting from the first timestamp). I have been trying this with groupBy and Grouper but it returns tuples which themselves don't contain a week's worth of data. In addition, the Grouper (Erstwhile TimeGrouper) documentation isn't very clear on this.
This is what I tried. Any better ideas or approaches?
grouped = uema_label_format.groupby(pd.Grouper(key='HEADER_START_TIME', freq='W'))

If your dataset is really big, it could be worth externalising this work to a time-series database and then query it to get each week you are interested in. These results can then be loaded into pandas, but the database handles the heavy lifting. For example in QuestDB you could get the current week as follows
select * from yourTable where timestamp = '2020-06-22;7d'
Although this would return the data for a single week, you could iterate on this to get the individual objects quickly since the results are instantaneous. Also, you can easily change the sample interval after the fact, for example to monthly using 1M. This would still be an instant response.
You can try this here using this query as an example to get one week worth of data (roughly 3M rows) out of a 1.6 billion rows NYC taxi dataset.
select * from trips where pickup_datetime = '2015-08-01;7d';
If this would solve your use case, there is a tutorial on how to get query results from QuestDB to pandas here.

Extract data faster from Redis and store in Panda Dataframe by avoiding key generation

I am using Redis with Python to store my per second ticker data (price and volume of an instrument). I am performing r.hget(instrument,key) and facing the following issue.
My key (string) looks like 01/01/2020-09:32:01 and goes on incrementing per second till the user specified interval.
For example 01/01/2020-09:32:01
01/01/2020-09:32:02 01/01/2020-09:32:03 ....
My r.hget(instrument,key) result looks likeb'672.2,432'(price and volume separated by a comma).
The issue am facing is that a user can specify a long time interval, like 2 years, that is, he/she wants the data from 01/01/2020 to 31/12/2020 (d/m/y format).So to perform the get operation I have to first generate timestamps for that period and then perform the get operation to form a panda dataframe. The generation of this datastamp to use as key for get operation is slowing down my process terribly (but it also ensures that the data is in strict ordering. For example 01/01/2020-09:32:01 will definitely be before 01/01/2020-09:32:02). Is there another way to achieve the same?
If I simply do r.hgetall(...) I wont be able to satisfy the time interval condition of user.

redis sorted set's are good fit for such range queries, sorted sets are made up of unique member's with a score, in your case timestamp can be score in epoch seconds and price and volume can be member, however member in sorted set is unique you may consider adding timestamp to make it unique.
zadd instrument 1577883600 672.2,432,1577883600
zadd instrument 1577883610 672.2,412,1577883610
After adding members to the set you can do range queries using zrangebyscore as below
zrangebyscore instrument 1577883600 1577883610
If your instrument contains many values then consider sharding it into multiple for example per month each set like instrument:202001, instrument:202002 and so on.
following are good read on this topic
Sorted Set Time Series
Sharding Structure

So to perform the get operation I have to first generate timestamps for that period and then perform the get operation...
No. This is the problem.
Make a function that calculates the timestamps and yield a smaller set of values, for a smaller time span (one week or one month).
So the new workflow will be in batches, see this loop:
generate a small set of timestamps
fetch items from redis
Pros:
minimize the memory usage
easy to change your current code to this new algo.
I don't know about redis specific functions, so other specific solutions can be better. My idea is a general approach, I used it with success for other problems.

Have you considered using RedisTimeSeries for this task? It is a redis module that is tailored exactly for the sort of task you are describing.
You can keep two timeseries per instrument that will hold price and value.
With RedisTimeSeries is it easy the query over different ranges and you can use the filtering mechanism to group different series, instrument families for example, and query all of them at once.
// create your timeseries
TS.CREATE instrument:price LABELS instrument violin type string
TS.CREATE instrument:volume LABELS instrument violin type string
// add values
TS.ADD instrument:price 123456 9.99
TS.ADD instrument:volume 123456 42
// query timeseries
TS.RANGE instrument:price - +
TS.RANGE instrument:volume - +
// query multiple timeseries by filtering according to labels
TS.MRANGE - + FILTER instrument=violin
TS.MRANGE - + FILTER type=string
RedisTimeSeries allows running queries with aggregations such as average standard-deviation, and uses double-delta compression which can reduce your memory usage by over 90%.
You can checkout a benchmark here.

Can I .set_index() lazily ( or to be executed concurrently ), on Dask Dataframes?

tl;dr: Is it possible to .set_index() method on several Dask Dataframes in parallel concurrently? Alternatively, is it possible to .set_index() lazily on several Dask Dataframes which, consequently, would lead to the indexes being set in parallel concurrently?
Here is the scenario:
I have several time series
Each time series is stored is several .csv files. Each file contains data related to a specific day. Also, files are scattered amongst different folders (each folder is contains data for one month)
Each time series has different sampling rates
All time series have the same columns. All have a column which contains DateTime, amongst others.
Data is too large to be processed in memory. That's why I am using Dask.
I want to merge all the time series into a single DataFrame, aligned by DateTime. For this, I need to first resample() each and all time series to a common sampling rate. And then .join() all time series.
.resample() can only be applied on index. Hence, before resampling I need to .set_index() on the DateTime column on each time series.
When I ask .set_index() method on one time series, computation starts immediately. Which leads to my code being blocked and waiting. At this moment, if I check my machine resources usage, I can see that many cores are being used but the usage does not go above ~15%. Which makes me think that, ideally, I could have the .set_index() method being applied to more than one time series at the same time.
After reaching the above situation, I've tried some not elegant solutions to parallelize application of .set_index() method on several time series (e.g. create a multiprocessing.Pool ), which were not successful. Before giving more details on those, is there a clean way on how to solve the situation above? Was the above scenario thought at some point when implementing Dask?
Alternatively, is it possible to .set_index() lazily? If .set_index() method could be applied lazily, I would create a full computation graph with the steps described above and in the end ,everything would be computed in parallel concurrently (I think).

Dask.dataframe needs to know the min and max values of all of the partitions of the dataframe in order to sensibly do datetime operations in parallel. By default it will read the data once in order to find good partitions. If the data is not sorted it will then do a shuffle (perhaps very expensive) to sort
In your case it sounds like your data is already sorted and that you might be able to provide these explicitly. You should look at the last example of the dd.DataFrame.set_index docstring
A common case is when we have a datetime column that we know to be
sorted and is cleanly divided by day. We can set this index for free
by specifying both that the column is pre-sorted and the particular
divisions along which is is separated
>>> import pandas as pd
>>> divisions = pd.date_range('2000', '2010', freq='1D')
>>> df2 = df.set_index('timestamp', sorted=True, divisions=divisions) # doctest: +SKIP

Pandas and the best method for representing variable-length time-series

Here's the scenario. Let's say I have data from a visual psychophysics experiment, in which a subject indicates whether the net direction of motion in a noisy visual stimulus is to the left or to the right. The atomic unit here is a single trial and a typical daily session might have between 1000 and 2000 trials. With each trial are associated various parameters: the difficulty of that trial, where stimuli were positioned on the computer monitor, the speed of motion, the distance of the subject from the display, whether the subject answered correctly, etc. For now, let's assume that each trial has only one value for each parameter (e.g., each trial has only one speed of motion, etc.). So far, so easy: trial ids are the Index and the different parameters correspond to columns.
Here's the wrinkle. With each trial are also associated variable length time series. For instance, each trial will have eye movement data that's sampled at 1 kHz (so we get time of acquisition, the x data at that time point, and y data at that time point). Because each trial has a different total duration, the length of these time series will differ across trials.
So... what's the best means for representing this type of data in a pandas DataFrame? Is this something that pandas can even be expected to deal with? Should I go to multiple DataFrames, one for the single valued parameters and one for the time series like parameters?
I've considered adopting a MultiIndex approach where level 0 corresponds to trial number and level 1 corresponds to time of continuous data acquisition. Then all I'd need to do is repeat the single valued columns to match the length of the time series on that trial. But I immediately foresee 2 problems. First, the number of single valued columns is large enough that extending each one of them to match the length of the time series seems very wasteful if not impractical. Second, and more importantly, if I wanna do basic groupby type of analyses (e.g. getting the proportion of correct responses at a given difficulty level), this will give biased (incorrect) results because whether each trial was correct or wrong will be repeated as many times as necessary for its length to match the length of time series on that trial (which is irrelevant to the computation of the mean across trials).
I hope my question makes sense and thanks for suggestions.

I've also just been dealing with this type of issue. I have a bunch of motion-capture data that I've recorded, containing x- y- and z-locations of several motion-capture markers at time intervals of 10ms, but there are also a couple of single-valued fields per trial (e.g., which task the subject is doing).
I've been using this project as a motivation for learning about pandas so I'm certainly not "fluent" yet with it. But I have found it incredibly convenient to be able to concatenate data frames for each trial into a single larger frame for, e.g., one subject:
subject_df = pd.concat(
[pd.read_csv(t) for t in subject_trials],
keys=[i for i, _ in enumerate(subject_trials)])
Anyway, my suggestion for how to combine single-valued trial data with continuous time recordings is to duplicate the single-valued columns down the entire index of your time recordings, like you mention toward the end of your question.
The only thing you lose by denormalizing your data in this way is that your data will consume more memory; however, provided you have sufficient memory, I think the benefits are worth it, because then you can do things like group individual time frames of data by the per-trial values. This can be especially useful with a stacked data frame!
As for removing the duplicates for doing, e.g., trial outcome analysis, it's really straightforward to do this:
df.outcome.unique()
assuming your data frame has an "outcome" column.

Splitting time data into "runs" in order to plot and examine differences

I am trying to investigate differences between runs/experiments in a continuously logged data set. I am taking a fixed subset of a few months for this data set and then analysising it to come up with an estimate on when a run was started. I have this sorted in a series of times.
With this I then chop the data up into 30 hour chunks (approximate time between runs) and then put it into a dictionary:
data = {}
for time in times:
timeNow = np.datetime64(time.to_datetime())
time30hr = np.datetime64(time.to_datetime())+np.timedelta64(30*60*60,'s')
data[time] = df[timeNow:time30hr]
So now I have a dictionary of dataframes, indexed by by StartTime and each one contains all of my data for a run, plus some extra to ensure I have it all for every run. But to compare two runs together I need to have a common X value to stack them on top of each other. Now every run is different and the point I want to consider "the same" varies depending on what i'm looking at. For the example below I have used the largest value in that dataset to "pivot" on.
for time in data:
A = data[time]
#Find max point for value. And take the first if there is more than 1
maxTtime = A[A['Value'] == A['Value'].max()]['DateTime'][0]
# Now we can say we want 12 hours before and end 12 after.
new = A[maxTtime-datetime.timedelta(0.5):maxTtime+datetime.timedelta(0.5)]
#Stick on a new column with time from 0 point:
new['RTime'] = new['DateTime'] - maxTtime
#Plot values against this new time
plot(new['RTime'],new['Value'])
This yields a graph like:
Which is great except I can't get a decent legend in order to tell what run was what and work out how much variation there is. I believe half my problem is because Im iterating over a dictionary of dataframes which is causing issues.
Could someone recommend how to better organise this (a dictionary of dataframes is all I could do to get it to work). I've thought of doing a hierarchical dataframe and instead of indexing it by run time, assigning a set of identifiers to the runs (The actual time is contained within the dataframes themself so I have no problem loosing the assumed starttime) and plotting it then with a legend.
My final aim is to have a dataset and methodology that means I can investigate the similarity and differences between different runs using different "pivot points" amd produce a graph of each one which I can then interrogate (or at least tell which data set is which to interrogate the data directly) but couldn;t get past various errors with creating it.
I can upload a set of the data to a csv if required but am not sure on the best place to upload it to. Thanks

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.