Pandas and HDF5 aggregate performance

Pandas and HDF5 aggregate performance - python

I'm trying to understand the ideal way to organise data within Pandas to achieve the best aggregating performance. The data I am dealing with is of the form yyyy-mm.csv which I just read_csv in and then to_hdf out. It generally looks something a bit like this:
ObjectID Timestamp ParamA ParamB --> ParamZ
1 2013-01-01 00:00:00 1 9
2 2013-01-01 00:00:00 3 2
1 2013-01-01 00:10:00 8 11
2 2013-01-01 00:10:00 6 14
There are about 50 object ids and readings for each batch of 10 minutes for the whole month. The end result I want to achieve is aggregated data (e.g. the mean) for a single parameter grouped by month (or potentially finer resolution eventually) over say 5 years.
What I've discovered so far is that a HDFStore.select of a single column isn't really a great deal quicker than bringing in all of those params into a single data frame at once. Therefore it feels very wasteful and the performance is not great. Without knowing exactly why this is, I can't really decide the best way to move forward. It seems that if the data were transposed such that the yyyy-mm was along the x axis with the dd hh:mm:ss down the y axis, and there were one of these data frames per parameter that the performance would massively improve as it could bring in more data in one hit. The groupby's are really quick once things have been read in from disk. However I'm not at all convinced that this is how it is supposed to be used. Can anyone advise the best way to organise and store the data?
Thanks

Pls review the HDFStore docs here, and the cookboo recipies here
PyTables stores data in a row-oriented format, so it behooves you to generally have long and not so wide tables. However, if you tend to query and need/want the entire row then the width does not present a problem.
On the other hand, if you are generally after a small subset of columns, you will want to shard the table into multiples (possibly with the same indexing scheme), so you can use a 'master' table to run the query, then select 'columns' (other tables) as needed. You can accomplish this via the append_to_multiple/select_from_multiple methods for example. Taken to the extreme, this you could store a single column in a separate group and make yourself a column-oriented table. However this will substantially slow down if say you tend to select a lot of columns.
Furthermore you always want to have the queryable columns as indexes or data_columns, as these allow queries in the first place and are indexed.
So it comes down to the ratio of queries that select lots of columns vs single-column selections.
For example
In [5]: df = DataFrame(np.random.randn(16,2),
columns=['A','B'],
index=MultiIndex.from_tuples(
[ (i,j) for i in range(4) for j in date_range(
'20130101 00:00:00',periods=4,freq='10T') ],
names=['id','date']))
In [6]: df
Out[6]:
A B
id date
0 2013-01-01 00:00:00 -0.247945 0.954260
2013-01-01 00:10:00 1.035678 -0.657710
2013-01-01 00:20:00 -2.399376 -0.188057
2013-01-01 00:30:00 -1.043764 0.510098
1 2013-01-01 00:00:00 -0.009998 0.239947
2013-01-01 00:10:00 2.038563 0.640080
2013-01-01 00:20:00 1.123922 -0.944170
2013-01-01 00:30:00 -1.757766 -1.398392
2 2013-01-01 00:00:00 -1.053324 -1.015211
2013-01-01 00:10:00 0.062408 -1.476484
2013-01-01 00:20:00 -1.202875 -0.747429
2013-01-01 00:30:00 -0.798126 -0.485392
3 2013-01-01 00:00:00 0.496098 0.700073
2013-01-01 00:10:00 -0.042914 1.099115
2013-01-01 00:20:00 -1.762597 -0.239100
2013-01-01 00:30:00 -0.344125 -1.607524
[16 rows x 2 columns]
In 0.12, use table=True rather than format
In [7]: df.to_hdf('test.h5','df',mode='w',format='table')
In [8]: store = pd.HDFStore('test.h5')
In [9]: store
Out[9]:
<class 'pandas.io.pytables.HDFStore'>
File path: test.h5
/df frame_table (typ->appendable_multi,nrows->16,ncols->4,indexers->[index],dc->[date,id])
In [10]: store.select('df',where='id=0')
Out[10]:
A B
id date
0 2013-01-01 00:00:00 -0.247945 0.954260
2013-01-01 00:10:00 1.035678 -0.657710
2013-01-01 00:20:00 -2.399376 -0.188057
2013-01-01 00:30:00 -1.043764 0.510098
[4 rows x 2 columns]
This is 0.13 syntax, this is a bit more tricky in 0.12
In [18]: store.select('df',where='date>"20130101 00:10:00" & date<"20130101 00:30:00"')
Out[18]:
A B
id date
0 2013-01-01 00:20:00 -2.399376 -0.188057
1 2013-01-01 00:20:00 1.123922 -0.944170
2 2013-01-01 00:20:00 -1.202875 -0.747429
3 2013-01-01 00:20:00 -1.762597 -0.239100
[4 rows x 2 columns]
In [19]: store.close()
So for example to do a groupby on the id, you can select all of the unique ids (use the select_column method. Then iterate over these, doing a query and performing your function on the results. This will be quite fast and these are indexed columns. Something like this:
In [24]: ids = store.select_column('df','id').unique()
In [25]: ids
Out[25]: array([0, 1, 2, 3])
In [27]: pd.concat([ store.select('df',where='id={0}'.format(i)).sum() for i in ids ],axis=1)
Out[27]:
0 1 2 3
A -2.655407 1.394721 -2.991917 -1.653539
B 0.618590 -1.462535 -3.724516 -0.047436
[2 rows x 4 columns]
A multi-groupby is just a combination query, e.g. id=1 & date>="20130101 00:10:00' & date<='20130101 00:30:00'
You might find this example instructive as well here

Related

Doing calculations on higher frequency data in lower frequency bins in Pandas

I have some data in a pandas dataframe that has entries at the per-second level over the course of a few hours. Entries are indexed by datetime format as TIMESTAMP. I would like to group all data within each minute and do some calculations and manipulations. That is, I would like to take all data within 09:00:00 to 09:00:59 and report some things about what happened in this minute. I would then like to do the same calculations and manipulations from 09:01:00 to 09:01:59 and so on through to the end of my dataset.
I've been fiddling around with groupby() and .resample() but I have had no success so far. I can think of a very inelegant way to do it with a series of for loops and if statements but I was wondering if there was an easier way here.

You didn't provide any data or code, so I'll just make some up. You also don't specify what calculations you want to do, so I'm just taking the mean:
>>> import numpy as np
>>> import pandas as pd
>>> dates = pd.date_range("1/1/2020 00:00:00", "1/1/2020 03:00:00", freq="S")
>>> values = np.random.random(len(dates))
>>> df = pd.DataFrame({"dates": dates, "values": values})
>>> df.resample("1Min", on="dates").mean().reset_index()
dates values
0 2020-01-01 00:00:00 0.486985
1 2020-01-01 00:01:00 0.454880
2 2020-01-01 00:02:00 0.467397
3 2020-01-01 00:03:00 0.543838
4 2020-01-01 00:04:00 0.502764
.. ... ...
236 2020-01-01 03:56:00 0.478224
237 2020-01-01 03:57:00 0.460435
238 2020-01-01 03:58:00 0.508211
239 2020-01-01 03:59:00 0.415030
240 2020-01-01 04:00:00 0.050993
[241 rows x 2 columns]

How can I do a calculation depending on a condition in a column?

I would like to make a calculation when there is a group of ones that follow continuously.
I have a database on how a compressor works. Every 5 minutes I get the compressor status if it is ON/OFF and the electricity consumed at this moment. The column On_Off there are a 1 when the compressor works (ON) and 0 when it is OFF.
Compresor = pd.Series([0,0,1,1,1,0,0,1,1,1,0,0,0,0,1,1,1,0], index = pd.date_range('1/1/2012', periods=18, freq='5 min'))
df = pd.DataFrame(Compresor)
df.index.rename("Date", inplace=True)
df.set_axis(["ON_OFF"], axis=1, inplace=True)
df.loc[(df.ON_OFF == 1), 'Electricity'] = np.random.randint(4, 20, df.sum())
df.loc[(df.ON_OFF < 1), 'Electricity'] = 0
df
ON_OFF Electricity
Date
2012-01-01 00:00:00 0 0.0
2012-01-01 00:05:00 0 0.0
2012-01-01 00:10:00 1 4.0
2012-01-01 00:15:00 1 10.0
2012-01-01 00:20:00 1 9.0
2012-01-01 00:25:00 0 0.0
2012-01-01 00:30:00 0 0.0
2012-01-01 00:35:00 1 17.0
2012-01-01 00:40:00 1 10.0
2012-01-01 00:45:00 1 5.0
2012-01-01 00:50:00 0 0.0
2012-01-01 00:55:00 0 0.0
2012-01-01 01:00:00 0 0.0
2012-01-01 01:05:00 0 0.0
2012-01-01 01:10:00 1 14.0
2012-01-01 01:15:00 1 5.0
2012-01-01 01:20:00 1 19.0
2012-01-01 01:25:00 0 0.0
What I would like to do is to add the electrical consumption only when there is a set of ones and make another Data.Frame. For example:
In this example, the first time that the compressor was turned on was between 00:20 -00:30. During this period it consumed 25 (10+10+5). The second time it lasted longer on (00:50-01:15) and consumed in this interval 50 (10+10+10+10+10+5+5). The third time it consume 20 (10 + 10).
I would like to do this automatically I'm new to pandas and I can't think of a way to do it.

Lets say you have the following data:
from operator import itemgetter
import numpy as np
import numpy.random as rnd
import pandas as pd
from funcy import concat, repeat
from toolz import partitionby
base_data = {
'time': list(range(20)),
'state': list(concat(repeat(0, 3), repeat(1, 4), repeat(0, 5), repeat(1, 6), repeat(0, 2))),
'value': list(concat(repeat(0, 3), rnd.randint(5, 20, 4), repeat(0, 5), rnd.randint(5, 20, 6), repeat(0, 2)))
}
Well, there are two ways:
The first one is functional and independent of pandas: you simply partition your data by a field, i.e. the method is processes the data sequentially and generates a new partition every time the value of the field changes. You can then simply summarize each partition as desired.
# transform into sample data
sample_data = [dict(zip(base_data.keys(), x)) for x in zip(*base_data.values())]
# and compute statistics the functional way
[sum(x['value'] for x in part if x['state'] == 1)
for part in partitionby(itemgetter('state'), sample_data)
if part[0]['state'] == 1]
There is also the pandas way, similarly to what #ivallesp mentioned:
You compute the change of state by shifting the state column. Then you summarize your data frame by the group
pd_data = pd.DataFrame(base_data)
pd_data['shifted_state'] = pd_data['state'].shift(fill_value = pd_data['state'][0])
pd_data['cum_state'] = np.cumsum(pd_data['state'] != pd_data['shifted_state'])
pd_data[pd_data['state'] == 1].groupby('cum_state').sum()
Depending on what you and your peers can read best you can choose your way. Also, the functional way may not be easily readable, and can also rewritten with readable loop statments.

What I would do is creating a variable representing each period of activity with an integer as an ID, then group by it and sum the Electricity column. An easy way of creating it would be by cumulative summing On_Off (the data has to be sorted by increasing date) and multiplying the resulting value by the On_Off column. If you provide a reproducible example of your table in Pandas I can quickly write you the solution.
Hope it helps

How to update some of the rows from another series in pandas using df.update

I have a df like,
stamp value
0 00:00:00 2
1 00:00:00 3
2 01:00:00 5
converting to time delta
df['stamp']=pd.to_timedelta(df['stamp'])
slicing only odd index and adding 30 mins,
odd_df=pd.to_timedelta(df[1::2]['stamp'])+pd.to_timedelta('30 min')
#print(odd_df)
1 00:30:00
Name: stamp, dtype: timedelta64[ns]
now, updating df with odd_df,
as per the documentation it should give my expected output.
expected output:
df.update(odd_df)
#print(df)
stamp value
0 00:00:00 2
1 00:30:00 3
2 01:00:00 5
What I am getting,
df.update(odd_df)
#print(df)
stamp value
0 00:30:00 00:30:00
1 00:30:00 00:30:00
2 00:30:00 00:30:00
please help, what is wrong in this.

Try this instead:
df.loc[1::2, 'stamp'] += pd.to_timedelta('30 min')
This ensures you update just the values in DataFrame specified by the .loc() function while keeping the rest of your original DataFrame. To test, run df.shape. You will get (3,2) with the method above.
In your code here:
odd_df=pd.to_timedelta(df[1::2]['stamp'])+pd.to_timedelta('30 min')
The odd_df DataFrame only has parts of your original DataFrame. The parts you sliced. The shape of odd_df is (1,).

Pandas Pivot_table with different dtype values

I have a particularly strange behavior from pivot_table (or at least I think it is...)
I have got a dataframe extracted from database with dates that I use with pivot table to do some basics stats:
pd.pivot_table(df,values["Diff_DGach_Dcent","Diff_DepCh_BoxPlt","Attente_totale"],
index=['id_chantier',"date_bl"], aggfunc=np.sum,fill_value=0)
I can pivot table on these fields but if I also add the fied "Cpt" to the values (it's a simple int field with 1 in it) to count how many lines are grouped by the pivot table , it'll only display the Cpt field but no more the timedelta ones...
Is it impossible to do the pivot table on different dtype ?
EDIT :
Sample of Data to be processed
Diff_DGach_Dcent Diff_DepCh_BoxPlt Attente_totale Cpt
00:21:00 00:45:00 01:23:00 1
00:26:00 00:18:00 02:16:00 1
00:15:00 00:18:00 01:25:00 1
00:25:00 00:18:00 01:25:00 1
00:26:00 00:10:00 01:20:00 1
00:20:00 00:14:00 01:38:00 1

You should set the columns parameter to specify the group that you want the values to be grouped on.
Do you have sample data?

How to Convert (timestamp, value) array to timeseries [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I have a rather straightforward problem I'd like to solve with more efficiency than I'm currently getting.
I have a bunch of data coming in as a set of monitoring metrics. Input data is structured as an array of tuples. Each tuple is (timestamp, value). Timestamps are integer epoch seconds, and values are normal floating point numbers. Example:
inArr = [ (1388435242, 12.3), (1388435262, 11.1), (1388435281, 12.8), ... ]
The timestamps are not always the same number of seconds apart, but it's usually close. Sometimes we get duplicate numbers submitted, sometimes we miss datapoints, etc.
My current solution takes the timestamps and:
finds the num seconds between each successive pair of timestamps;
finds the median of these delays;
creates an array of the correct size;
presumes the first time period starts at half the median value before the first timestamp (putting the measurement in the middle of the time period);
averages values that happen to go into the same time bucket;
adds data to this array according to the correct (timestamp - starttime)/median element.
if there's no value for a time range, I obviously output a None value.
Output data has to be in the format:
outArr = [ (startTime, timeStep, numVals), [ val1, val2, val3, val4, ... ] ]
I suspect this is a solved problem with Python Pandas http://pandas.pydata.org/ (or Numpy / SciPy).
Yes, my solution works, but when I'm operating on 60K datapoints it can take a tenth of a second (or more) to run. This is troublesome when I'm trying to work on large numbers of sets of data.
So, I'm looking for a solution that might run faster than my pure-Python version. I guess I'm presuming (based on a couple of previous conversations with an Argonne National Labs guy) that SciPy and Numpy are (clearing-throat) "somewhat faster" at array operations. I've looked briefly (an hour or so) at the Pandas code but it looks cumbersome to do this set of operations. Am I incorrect?
-- Edit to show expected output --
The median time between datapoints is 20 seconds, half that is 10 seconds. To make sure we put the lines well between the timestamps, we make the start time 10 seconds before the first datapoint. If we just make the start time the first timestamp, it's a lot more likely that we'll get 2 timestamps in one interval.
So, 1388435242 - 10 = 1388435232. The timestep is the median, 20 seconds. The numvals here is 3.
outArr = [ (1388435232, 20, 3), [ 12.3, 11.1, 12.8 ] )
This is the format that Graphite expects when we're graphing the output; it's not my invention. It seems common, though, to have timeseries data be in this format - a starttime, interval, and then an array of values.

Here's a sketch
Create your input series
In [24]: x = zip(pd.date_range('20130101',periods=1000000,freq='s').asi8/1000000000,np.random.randn(1000000))
In [49]: x[0]
Out[49]: (1356998400, 1.2809949462375376)
Create the frame
In [25]: df = DataFrame(x,columns=['time','value'])
Make the dates a bit random (to simulate some data)
In [26]: df['time1'] = df['time'] + np.random.randint(0,10,size=1000000)
Convert the epoch seconds to datetime64[ns] dtype
In [29]: df['time2'] = pd.to_datetime(df['time1'],unit='s')
Difference the series (to create timedeltas)
In [32]: df['diff'] = df['time2'].diff()
Looks like this
In [50]: df
Out[50]:
time value time1 time2 diff
0 1356998400 -0.269644 1356998405 2013-01-01 00:00:05 NaT
1 1356998401 -0.924337 1356998401 2013-01-01 00:00:01 -00:00:04
2 1356998402 0.952466 1356998410 2013-01-01 00:00:10 00:00:09
3 1356998403 0.604783 1356998411 2013-01-01 00:00:11 00:00:01
4 1356998404 0.140927 1356998407 2013-01-01 00:00:07 -00:00:04
5 1356998405 -0.083861 1356998414 2013-01-01 00:00:14 00:00:07
6 1356998406 1.287110 1356998412 2013-01-01 00:00:12 -00:00:02
7 1356998407 0.539957 1356998414 2013-01-01 00:00:14 00:00:02
8 1356998408 0.337780 1356998412 2013-01-01 00:00:12 -00:00:02
9 1356998409 -0.368456 1356998410 2013-01-01 00:00:10 -00:00:02
10 1356998410 -0.355176 1356998414 2013-01-01 00:00:14 00:00:04
11 1356998411 -2.912447 1356998417 2013-01-01 00:00:17 00:00:03
12 1356998412 -0.003209 1356998418 2013-01-01 00:00:18 00:00:01
13 1356998413 0.122424 1356998414 2013-01-01 00:00:14 -00:00:04
14 1356998414 0.121545 1356998421 2013-01-01 00:00:21 00:00:07
15 1356998415 -0.838947 1356998417 2013-01-01 00:00:17 -00:00:04
16 1356998416 0.329681 1356998419 2013-01-01 00:00:19 00:00:02
17 1356998417 -1.071963 1356998418 2013-01-01 00:00:18 -00:00:01
18 1356998418 1.090762 1356998424 2013-01-01 00:00:24 00:00:06
19 1356998419 1.740093 1356998428 2013-01-01 00:00:28 00:00:04
20 1356998420 1.480837 1356998428 2013-01-01 00:00:28 00:00:00
21 1356998421 0.118806 1356998427 2013-01-01 00:00:27 -00:00:01
22 1356998422 -0.935749 1356998427 2013-01-01 00:00:27 00:00:00
Calc median
In [34]: df['diff'].median()
Out[34]:
0 00:00:01
dtype: timedelta64[ns]
Calc mean
In [35]: df['diff'].mean()
Out[35]:
0 00:00:00.999996
dtype: timedelta64[ns]
Should get you started

You can pass your inArr to a pandas Dataframe:
df = pd.DataFrame(inArr, columns=['time', 'value'])
num seconds between each successive pair of timestamps: df['time'].diff()
median delay: df['time'].diff().median()
creates an array of the correct size (I think that's taken care of?)
presumes the first time period starts at half the median value before the first timestamp (putting the measurement in the middle of the time period); I don't know what you mean here
averages values that happen to go into the same time bucket
For several of these problems it may make since to convert your seconds to datetime and set it as the index:
In [39]: df['time'] = pd.to_datetime(df['time'], unit='s')
In [41]: df = df.set_index('time')
In [42]: df
Out[42]:
value
time
2013-12-30 20:27:22 12.3
2013-12-30 20:27:42 11.1
2013-12-30 20:28:01 12.8
Then to handle multiple values in the same time, use groupby.
In [49]: df.groupby(level='time').mean()
Out[49]:
value
time
2013-12-30 20:27:22 12.3
2013-12-30 20:27:42 11.1
2013-12-30 20:28:01 12.8
It's the same since there aren't any dupes.
Not sure what you mean about the last two.
And your desired output seems to contradict what you wanted earlier. You values with the same timestamp should be averaged, and now you want them all? Maybe clear that up a bit.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas and HDF5 aggregate performance - python

Related

Doing calculations on higher frequency data in lower frequency bins in Pandas

How can I do a calculation depending on a condition in a column?

How to update some of the rows from another series in pandas using df.update

Pandas Pivot_table with different dtype values

How to Convert (timestamp, value) array to timeseries [closed]

Categories

Resources