Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I have a rather straightforward problem I'd like to solve with more efficiency than I'm currently getting.
I have a bunch of data coming in as a set of monitoring metrics. Input data is structured as an array of tuples. Each tuple is (timestamp, value). Timestamps are integer epoch seconds, and values are normal floating point numbers. Example:
inArr = [ (1388435242, 12.3), (1388435262, 11.1), (1388435281, 12.8), ... ]
The timestamps are not always the same number of seconds apart, but it's usually close. Sometimes we get duplicate numbers submitted, sometimes we miss datapoints, etc.
My current solution takes the timestamps and:
finds the num seconds between each successive pair of timestamps;
finds the median of these delays;
creates an array of the correct size;
presumes the first time period starts at half the median value before the first timestamp (putting the measurement in the middle of the time period);
averages values that happen to go into the same time bucket;
adds data to this array according to the correct (timestamp - starttime)/median element.
if there's no value for a time range, I obviously output a None value.
Output data has to be in the format:
outArr = [ (startTime, timeStep, numVals), [ val1, val2, val3, val4, ... ] ]
I suspect this is a solved problem with Python Pandas http://pandas.pydata.org/ (or Numpy / SciPy).
Yes, my solution works, but when I'm operating on 60K datapoints it can take a tenth of a second (or more) to run. This is troublesome when I'm trying to work on large numbers of sets of data.
So, I'm looking for a solution that might run faster than my pure-Python version. I guess I'm presuming (based on a couple of previous conversations with an Argonne National Labs guy) that SciPy and Numpy are (clearing-throat) "somewhat faster" at array operations. I've looked briefly (an hour or so) at the Pandas code but it looks cumbersome to do this set of operations. Am I incorrect?
-- Edit to show expected output --
The median time between datapoints is 20 seconds, half that is 10 seconds. To make sure we put the lines well between the timestamps, we make the start time 10 seconds before the first datapoint. If we just make the start time the first timestamp, it's a lot more likely that we'll get 2 timestamps in one interval.
So, 1388435242 - 10 = 1388435232. The timestep is the median, 20 seconds. The numvals here is 3.
outArr = [ (1388435232, 20, 3), [ 12.3, 11.1, 12.8 ] )
This is the format that Graphite expects when we're graphing the output; it's not my invention. It seems common, though, to have timeseries data be in this format - a starttime, interval, and then an array of values.
Here's a sketch
Create your input series
In [24]: x = zip(pd.date_range('20130101',periods=1000000,freq='s').asi8/1000000000,np.random.randn(1000000))
In [49]: x[0]
Out[49]: (1356998400, 1.2809949462375376)
Create the frame
In [25]: df = DataFrame(x,columns=['time','value'])
Make the dates a bit random (to simulate some data)
In [26]: df['time1'] = df['time'] + np.random.randint(0,10,size=1000000)
Convert the epoch seconds to datetime64[ns] dtype
In [29]: df['time2'] = pd.to_datetime(df['time1'],unit='s')
Difference the series (to create timedeltas)
In [32]: df['diff'] = df['time2'].diff()
Looks like this
In [50]: df
Out[50]:
time value time1 time2 diff
0 1356998400 -0.269644 1356998405 2013-01-01 00:00:05 NaT
1 1356998401 -0.924337 1356998401 2013-01-01 00:00:01 -00:00:04
2 1356998402 0.952466 1356998410 2013-01-01 00:00:10 00:00:09
3 1356998403 0.604783 1356998411 2013-01-01 00:00:11 00:00:01
4 1356998404 0.140927 1356998407 2013-01-01 00:00:07 -00:00:04
5 1356998405 -0.083861 1356998414 2013-01-01 00:00:14 00:00:07
6 1356998406 1.287110 1356998412 2013-01-01 00:00:12 -00:00:02
7 1356998407 0.539957 1356998414 2013-01-01 00:00:14 00:00:02
8 1356998408 0.337780 1356998412 2013-01-01 00:00:12 -00:00:02
9 1356998409 -0.368456 1356998410 2013-01-01 00:00:10 -00:00:02
10 1356998410 -0.355176 1356998414 2013-01-01 00:00:14 00:00:04
11 1356998411 -2.912447 1356998417 2013-01-01 00:00:17 00:00:03
12 1356998412 -0.003209 1356998418 2013-01-01 00:00:18 00:00:01
13 1356998413 0.122424 1356998414 2013-01-01 00:00:14 -00:00:04
14 1356998414 0.121545 1356998421 2013-01-01 00:00:21 00:00:07
15 1356998415 -0.838947 1356998417 2013-01-01 00:00:17 -00:00:04
16 1356998416 0.329681 1356998419 2013-01-01 00:00:19 00:00:02
17 1356998417 -1.071963 1356998418 2013-01-01 00:00:18 -00:00:01
18 1356998418 1.090762 1356998424 2013-01-01 00:00:24 00:00:06
19 1356998419 1.740093 1356998428 2013-01-01 00:00:28 00:00:04
20 1356998420 1.480837 1356998428 2013-01-01 00:00:28 00:00:00
21 1356998421 0.118806 1356998427 2013-01-01 00:00:27 -00:00:01
22 1356998422 -0.935749 1356998427 2013-01-01 00:00:27 00:00:00
Calc median
In [34]: df['diff'].median()
Out[34]:
0 00:00:01
dtype: timedelta64[ns]
Calc mean
In [35]: df['diff'].mean()
Out[35]:
0 00:00:00.999996
dtype: timedelta64[ns]
Should get you started
You can pass your inArr to a pandas Dataframe:
df = pd.DataFrame(inArr, columns=['time', 'value'])
num seconds between each successive pair of timestamps: df['time'].diff()
median delay: df['time'].diff().median()
creates an array of the correct size (I think that's taken care of?)
presumes the first time period starts at half the median value before the first timestamp (putting the measurement in the middle of the time period); I don't know what you mean here
averages values that happen to go into the same time bucket
For several of these problems it may make since to convert your seconds to datetime and set it as the index:
In [39]: df['time'] = pd.to_datetime(df['time'], unit='s')
In [41]: df = df.set_index('time')
In [42]: df
Out[42]:
value
time
2013-12-30 20:27:22 12.3
2013-12-30 20:27:42 11.1
2013-12-30 20:28:01 12.8
Then to handle multiple values in the same time, use groupby.
In [49]: df.groupby(level='time').mean()
Out[49]:
value
time
2013-12-30 20:27:22 12.3
2013-12-30 20:27:42 11.1
2013-12-30 20:28:01 12.8
It's the same since there aren't any dupes.
Not sure what you mean about the last two.
And your desired output seems to contradict what you wanted earlier. You values with the same timestamp should be averaged, and now you want them all? Maybe clear that up a bit.
Related
I am using pandas in a python notebook to make some data analysis. I am trying to make a simple nested loop, but this is very bad performing.
The problem is that I have two tables made of two columns each, the first containing time stamps (hh:mm:ss) and the second containing some integer values.
The first table (big_table) contains 86400 rows, one for each possible timestamp in a day, and each integer value is initially set to 0.
The second table (small_table) contains less rows, one for each timestamp in which an actual integer value is registered.
The goal is to map the small_table integers to the big_table integers, in the rows where the timestamp is the same. I also want to write the last written integer when the small_table timestamp is not found in the big_table timestamps.
I am doing this trying to "force" a Java/C way of doing it, which iterates over each element accessing them as the [i][j] elements of a matrix.
Is there any better way of doing this using pandas/numpy?
Code:
rel_time_pointer = small_table.INTEGER.iloc[0]
for i in range(small_table.shape[0]):
for j in range(big_table.shape[0]):
if (small_table.time.iloc[i] == big_table.time.iloc[j]):
rel_time_pointer = small_table.INTEGER.iloc[i]
big_table.INTEGER.iloc[j] = rel_time_pointer
break
else:
big_table.INTEGER.iloc[j] = rel_time_pointer
example:
big_table:
time INTEGER
00:00:00 0
00:00:01 0
00:00:02 0
00:00:03 0
00:00:04 0
00:00:05 0
00:00:06 0
.
.
.
23:59:59 0
small_table:
time INTEGER
00:00:03 100
00:00:05 100
big_table_after_execution:
time INTEGER
00:00:00 0
00:00:01 0
00:00:02 0
00:00:03 100
00:00:04 100
00:00:05 200
00:00:06 200
Using the #gtomer merge command:
big_table = big_table.merge(small_table, on='time', how='left')
and adding .fillna(0) at the end of the command I get:
time INTEGER__x INTEGER__y
00:00:00 0 0.0
00:00:01 0 0.0
... ... ...
with the INTEGER values of small_table in the right places of big_table_after_execution. Now I'm trying to set the 0 values to the not-0 top element:
time INTEGER__x INTEGER__y
00:00:00 0 0.0
00:00:01 0 0.0
00:00:02 0 0.0
00:00:03 0 1.0
00:00:04 0 1.0
00:00:05 0 2.0
00:00:06 0 2.0
instead of:
00:00:00 0 0.0
00:00:01 0 0.0
00:00:02 0 0.0
00:00:03 0 1.0
00:00:04 0 0.0
00:00:05 0 2.0
00:00:06 0 0.0
Please try the following:
big_table_after_execution = big_table.merge(small_table, on='time', how='left')
Please post the output you get and we'll continue from there
Numpy iteration and enumeration options:
if you have a 2d np.ndarray type object, then iteration can be achieved in one line as follows:
for (i,j), value in np.ndenumerate(ndarray_object):...
This works like regular enumerate, but allows you to deconstruct the higher dimensional index into a tuple of appropriate dimensions.
You could maybe place your values into a 2d array structure from numpy and iterate through them like that?
The easiest way to modify what you already have so that it looks less 'c-like' is probably to just use regular enumerate:
for small_index, small_value in enumerate(small_table):
for big_index, big_value in enumerate(big_table):...
zip
Another option for grouping your iteration together is the zip() function, which will combine iterable 1 and 2, but it will only produce a resultant iterable as with a length equal to the minimum iterable length.
I have a Pandas DataFrame where the index is datetimes for every 12 minutes in a day (120 rows total). I went ahead and resampled the data to every 30 minutes.
Time Rain_Rate
1 2014-04-02 00:00:00 0.50
2 2014-04-02 00:30:00 1.10
3 2014-04-02 01:00:00 0.48
4 2014-04-02 01:30:00 2.30
5 2014-04-02 02:00:00 4.10
6 2014-04-02 02:30:00 5.00
7 2014-04-02 03:00:00 3.20
I want to take 3 hour means centered on hours 00, 03, 06, 09, 12, 15 ,18, and 21. I want the mean to consist of 1.5 hours before 03:00:00 (so 01:30:00) and 1.5 hours after 03:00:00 (04:30:00). The 06:00:00 time would overlap with the 03:00:00 average (they would both use 04:30:00).
Is there a way to do this using pandas? I've tried a few things but they haven't worked.
Method 1
I'm going to suggest just change your resample from the get-go to get the chunks you want. Here's some fake data resembling yours, before resampling at all:
dr = pd.date_range('04-02-2014 00:00:00', '04-03-2014 00:00:00', freq='12T', closed='left')
data = np.random.rand(120)
df = pd.DataFrame(data, index=dr, columns=['Rain_Rate'])
df.index.name = 'Time'
#df.head()
Rain_Rate
Time
2014-04-02 00:00:00 0.616588
2014-04-02 00:12:00 0.201390
2014-04-02 00:24:00 0.802754
2014-04-02 00:36:00 0.712743
2014-04-02 00:48:00 0.711766
Averaging by 3 hour chunks initially will be the same as doing 30 minute chunks then doing 3 hour chunks. You just have to tweak a couple things to get the right bins you want. First you can add the bin you will start from (i.e. 10:30 pm on the previous day, even if there's no data there; the first bin is from 10:30pm - 1:30am), then resample starting from this point
before = df.index[0] - pd.Timedelta(minutes=90) #only if the first index is at midnight!!!
df.loc[before] = np.nan
df = df.sort_index()
output = df.resample('3H', base=22.5, loffset='90min').mean()
The base parameter here means start at the 22.5th hour (10:30), and loffset means push the bin names back by 90 minutes. You get the following output:
Rain_Rate
Time
2014-04-02 00:00:00 0.555515
2014-04-02 03:00:00 0.546571
2014-04-02 06:00:00 0.439953
2014-04-02 09:00:00 0.460898
2014-04-02 12:00:00 0.506690
2014-04-02 15:00:00 0.605775
2014-04-02 18:00:00 0.448838
2014-04-02 21:00:00 0.387380
2014-04-03 00:00:00 0.604204 #this is the bin at midnight on the following day
You could also start with the data binned at 30 minutes and use this method, and should get the same answer.*
Method 2
Another approach would be to find the locations of the indexes you want to create averages for, and then calculate the averages for entries in the 3 hours surrounding:
resampled = df.resample('30T',).mean() #like your data in the post
centers = [0,3,6,9,12,15,18,21]
mask = np.where(df.index.hour.isin(centers) & (df.index.minute==0), True, False)
df_centers = df.index[mask]
output = []
for center in df_centers:
cond1 = (df.index >= (center - pd.Timedelta(hours=1.5)))
cond2 = (df.index <= (center + pd.Timedelta(hours=1.5)))
output.append(df[cond1 & cond2].values.mean())
Output here is the same, but the answers are in a list (and the last point of "24 hours" is not included):
[0.5555146139562004,
0.5465709237162698,
0.43995277270996735,
0.46089800625663596,
0.5066902552121085,
0.6057747262752732,
0.44883794039466535,
0.3873795731806939]
*You mentioned you wanted some points on the edge of bins to be included in both bins. resample doesn't do this (and generally I don't think most people want to do so), but the second method I used is explicit about doing so (by using >= and <= in cond1 and cond2). However, these two methods achieve the same result here, presumably b/c of the use of resample at different stages causing data points to be included in different bins. It's hard for me to wrap my around that, but one could do a little manual binning to verify what is going on. The point is, I would recommend spot-checking the output of these methods (or any resample-based method) against your raw data to make sure things look correct. For these examples, I did so using Excel.
My dataset looks like this:
time Open
2017-01-01 00:00:00 1.219690
2017-01-01 01:00:00 1.688490
2017-01-01 02:00:00 1.015285
2017-01-01 03:00:00 1.357672
2017-01-01 04:00:00 1.293786
2017-01-01 05:00:00 1.040048
2017-01-01 06:00:00 1.225080
2017-01-01 07:00:00 1.145402
...., ....
2017-12-31 23:00:00 1.145402
I want to find the sum between the time-range specified and save it to new dataframe.
let's say,
I want to find the sum between 2017-01-01 22:00:00 and 2017-01-02 04:00:00. This is the sum of 6 hours between 2 days. I want to find the sum of the data in the time-range such as 10 PM to next day 4 AM and put it in a different data frame for example df_timerange_sum. Please note that we are doing sum of time in 2 different date?
What did I do?
I used the sum() to calculate time-range like this: df[~df['time'].dt.hour.between(10, 4)].sum()but it gives me sum as a whole of the df but not on the between time-range I have specified.
I also tried the resample but I cannot find a way to do it for time-specific
df['time'].dt.hour.between(10, 4) is always False because no number is larger than 10 and smaller than 4 at the same time. What you want is to mark between(4,21) and then negate that to get the other hours.
Here's what I would do:
# mark those between 4AM and 10PM
# data we want is where s==False, i.e. ~s
s = df['time'].dt.hour.between(4, 21)
# use s.cumsum() marks the consecutive False block
# on which we will take sum
blocks = s.cumsum()
# again we only care for ~s
(df[~s].groupby(blocks[~s], as_index=False) # we don't need the blocks as index
.agg({'time':'min', 'Open':'sum'}) # time : min -- select the beginning of blocks
) # Open : sum -- compute sum of Open
Output for random data:
time Open
0 2017-01-01 00:00:00 1.282701
1 2017-01-01 22:00:00 2.766324
2 2017-01-02 22:00:00 2.838216
3 2017-01-03 22:00:00 4.151461
4 2017-01-04 22:00:00 2.151626
5 2017-01-05 22:00:00 2.525190
6 2017-01-06 22:00:00 0.798234
an alternative (in my opinion more straightforward) approach that accomplishes the same thing..there's definitely ways to reduce the code but I am also relatively new to pandas
df.set_index(['time'],inplace=True) #make time the index col (not 100% necessary)
df2=pd.DataFrame(columns=['start_time','end_time','sum_Open']) #new df that stores your desired output + start and end times if you need them
df2['start_time']=df[df.index.hour == 22].index #gets/stores all start datetimes
df2['end_time']=df[df.index.hour == 4].index #gets/stores all end datetimes
for i,row in df2.iterrows():
df2.set_value(i,'sum_Open',df[(df.index >= row['start_time']) & (df.index <= row['end_time'])]['Open'].sum())
you'd have to add an if statement or something to handle the last day which ends at 11pm.
Say I have a dataframe with several timestamps and values. I would like to measure Δ values / Δt every 2.5 seconds. Does Pandas provide any utilities for time differentiation?
time_stamp values
19492 2014-10-06 17:59:40.016000-04:00 1832128
167106 2014-10-06 17:59:41.771000-04:00 2671048
202511 2014-10-06 17:59:43.001000-04:00 2019434
161457 2014-10-06 17:59:44.792000-04:00 1294051
203944 2014-10-06 17:59:48.741000-04:00 867856
It most certainly does. First, you'll need to convert your indices into pandas date_rangeformat and then use the custom offset functions available to series/dataframes indexed with that class. Helpful documentation here. Read more here about offset aliases.
This code should resample your data to 2.5s intervals
#df is your dataframe
index = pd.date_range(df['time_stamp'])
values = pd.Series(df.values, index=index)
#Read above link about the different Offset Aliases, S=Seconds
resampled_values = values.resample('2.5S')
resampled_values.diff() #compute the difference between each point!
That should do it.
If you really want the time derivative, then you also need to divide by the time difference (delta time, dt) since last sample
An example:
dti = pd.DatetimeIndex([
'2018-01-01 00:00:00',
'2018-01-01 00:00:02',
'2018-01-01 00:00:03'])
X = pd.DataFrame({'data': [1,3,4]}, index=dti)
X.head()
data
2018-01-01 00:00:00 1
2018-01-01 00:00:02 3
2018-01-01 00:00:03 4
You can find the time delta by using the diff() on the DatetimeIndex. This gives you a series of type Time Deltas. You only need the values in seconds, though
dt = pd.Series(df.index).diff().dt.seconds.values
dXdt = df.diff().div(dt, axis=0, )
dXdt.head()
data
2018-01-01 00:00:00 NaN
2018-01-01 00:00:02 1.0
2018-01-01 00:00:03 1.0
As you can see, this approach takes into account that there are two seconds between the first two values, and only one between the two last values. :)
I'm trying to understand the ideal way to organise data within Pandas to achieve the best aggregating performance. The data I am dealing with is of the form yyyy-mm.csv which I just read_csv in and then to_hdf out. It generally looks something a bit like this:
ObjectID Timestamp ParamA ParamB --> ParamZ
1 2013-01-01 00:00:00 1 9
2 2013-01-01 00:00:00 3 2
1 2013-01-01 00:10:00 8 11
2 2013-01-01 00:10:00 6 14
There are about 50 object ids and readings for each batch of 10 minutes for the whole month. The end result I want to achieve is aggregated data (e.g. the mean) for a single parameter grouped by month (or potentially finer resolution eventually) over say 5 years.
What I've discovered so far is that a HDFStore.select of a single column isn't really a great deal quicker than bringing in all of those params into a single data frame at once. Therefore it feels very wasteful and the performance is not great. Without knowing exactly why this is, I can't really decide the best way to move forward. It seems that if the data were transposed such that the yyyy-mm was along the x axis with the dd hh:mm:ss down the y axis, and there were one of these data frames per parameter that the performance would massively improve as it could bring in more data in one hit. The groupby's are really quick once things have been read in from disk. However I'm not at all convinced that this is how it is supposed to be used. Can anyone advise the best way to organise and store the data?
Thanks
Pls review the HDFStore docs here, and the cookboo recipies here
PyTables stores data in a row-oriented format, so it behooves you to generally have long and not so wide tables. However, if you tend to query and need/want the entire row then the width does not present a problem.
On the other hand, if you are generally after a small subset of columns, you will want to shard the table into multiples (possibly with the same indexing scheme), so you can use a 'master' table to run the query, then select 'columns' (other tables) as needed. You can accomplish this via the append_to_multiple/select_from_multiple methods for example. Taken to the extreme, this you could store a single column in a separate group and make yourself a column-oriented table. However this will substantially slow down if say you tend to select a lot of columns.
Furthermore you always want to have the queryable columns as indexes or data_columns, as these allow queries in the first place and are indexed.
So it comes down to the ratio of queries that select lots of columns vs single-column selections.
For example
In [5]: df = DataFrame(np.random.randn(16,2),
columns=['A','B'],
index=MultiIndex.from_tuples(
[ (i,j) for i in range(4) for j in date_range(
'20130101 00:00:00',periods=4,freq='10T') ],
names=['id','date']))
In [6]: df
Out[6]:
A B
id date
0 2013-01-01 00:00:00 -0.247945 0.954260
2013-01-01 00:10:00 1.035678 -0.657710
2013-01-01 00:20:00 -2.399376 -0.188057
2013-01-01 00:30:00 -1.043764 0.510098
1 2013-01-01 00:00:00 -0.009998 0.239947
2013-01-01 00:10:00 2.038563 0.640080
2013-01-01 00:20:00 1.123922 -0.944170
2013-01-01 00:30:00 -1.757766 -1.398392
2 2013-01-01 00:00:00 -1.053324 -1.015211
2013-01-01 00:10:00 0.062408 -1.476484
2013-01-01 00:20:00 -1.202875 -0.747429
2013-01-01 00:30:00 -0.798126 -0.485392
3 2013-01-01 00:00:00 0.496098 0.700073
2013-01-01 00:10:00 -0.042914 1.099115
2013-01-01 00:20:00 -1.762597 -0.239100
2013-01-01 00:30:00 -0.344125 -1.607524
[16 rows x 2 columns]
In 0.12, use table=True rather than format
In [7]: df.to_hdf('test.h5','df',mode='w',format='table')
In [8]: store = pd.HDFStore('test.h5')
In [9]: store
Out[9]:
<class 'pandas.io.pytables.HDFStore'>
File path: test.h5
/df frame_table (typ->appendable_multi,nrows->16,ncols->4,indexers->[index],dc->[date,id])
In [10]: store.select('df',where='id=0')
Out[10]:
A B
id date
0 2013-01-01 00:00:00 -0.247945 0.954260
2013-01-01 00:10:00 1.035678 -0.657710
2013-01-01 00:20:00 -2.399376 -0.188057
2013-01-01 00:30:00 -1.043764 0.510098
[4 rows x 2 columns]
This is 0.13 syntax, this is a bit more tricky in 0.12
In [18]: store.select('df',where='date>"20130101 00:10:00" & date<"20130101 00:30:00"')
Out[18]:
A B
id date
0 2013-01-01 00:20:00 -2.399376 -0.188057
1 2013-01-01 00:20:00 1.123922 -0.944170
2 2013-01-01 00:20:00 -1.202875 -0.747429
3 2013-01-01 00:20:00 -1.762597 -0.239100
[4 rows x 2 columns]
In [19]: store.close()
So for example to do a groupby on the id, you can select all of the unique ids (use the select_column method. Then iterate over these, doing a query and performing your function on the results. This will be quite fast and these are indexed columns. Something like this:
In [24]: ids = store.select_column('df','id').unique()
In [25]: ids
Out[25]: array([0, 1, 2, 3])
In [27]: pd.concat([ store.select('df',where='id={0}'.format(i)).sum() for i in ids ],axis=1)
Out[27]:
0 1 2 3
A -2.655407 1.394721 -2.991917 -1.653539
B 0.618590 -1.462535 -3.724516 -0.047436
[2 rows x 4 columns]
A multi-groupby is just a combination query, e.g. id=1 & date>="20130101 00:10:00' & date<='20130101 00:30:00'
You might find this example instructive as well here