Python Pandas MA for irregular dataframe - python

I'd like to calculate a rolling moving average for a data set that is time stamped in ms, but is irregular. For a 2 day dataframe, the irregular data set has ~36K records. If I resample into ms bars, I melt the computer and there become 32M bars.
To be clear, consider the following data set taken from the Pandas docs:
(I've changed the NaN to 0)
df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]},index =
[pd.Timestamp('20130101 09:00:00'),
pd.Timestamp('20130101 09:00:02'),
pd.Timestamp('20130101 09:00:03'),
pd.Timestamp('20130101 09:00:05'),
pd.Timestamp('20130101 09:00:06')])
df.rolling('2s').mean()
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:02 1.0
2013-01-01 09:00:03 1.5
2013-01-01 09:00:05 0.0
2013-01-01 09:00:06 2.0
But the answer I'd like is:
df.rolling('2s').mean()
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:02 0.5
2013-01-01 09:00:03 1.5
2013-01-01 09:00:05 1.0
2013-01-01 09:00:06 2.0
This has the entries rolled forward (ffill style) in order to calc the mean. I'd like to solve this problem without exploding the memory usage and without just going through it sequentially (which I know I can do).
I had thought that something like:
df.rolling('2s', freq='1s').mean()
would work but it throws off an error expecting 7 rows but having only 5 (ValueError: Shape of passed values is (1,5), indices imply (1,7)).
If I resample into another dataframe using pad and then do a rolling mean, it works:
df2 = df.resample('1s').pad()
df2.rolling('2s').mean()
Is there a built in for this? Or do I just iterate through?

Related

How do I plot a scatter graph comparing two dataframes?

I have two separate DataFrames, which both contain rainfall amounts and dates corresponding to them.
df1:
time tp
0 2013-01-01 00:00:00 0.0
1 2013-01-01 01:00:00 0.0
2 2013-01-01 02:00:00 0.0
3 2013-01-01 03:00:00 0.0
4 2013-01-01 04:00:00 0.0
... ...
8755 2013-12-31 19:00:00 0.0
8756 2013-12-31 20:00:00 0.0
8757 2013-12-31 21:00:00 0.0
8758 2013-12-31 22:00:00 0.0
8759 2013-12-31 23:00:00 0.0
[8760 rows x 2 columns]
df2:
time tp
0 2013-07-18T18:00:01 0.002794
1 2013-07-18T20:00:00 0.002794
2 2013-07-18T21:00:00 0.002794
3 2013-07-18T22:00:00 0.002794
4 2013-07-19T00:00:00 0.000000
... ...
9656 2013-12-30T13:30:00 0.000000
9657 2013-12-30T23:30:00 0.000000
9658 2013-12-31T00:00:00 0.000000
9659 2013-12-31T00:00:00 0.000000
9660 2014-01-01T00:00:00 0.000000
[9661 rows x 2 columns]
I'm trying to plot a scatter graph comparing the two data frames. The way I'm doing it is by choosing a specific date and time and plotting the df1 tp on one axis and df2 tp on the other axis.
For example,
If the date/time on both dataframes = 2013-12-31 19:00:00, then plot tp for df1 onto x-axis, and tp for df2 on the y-axis.
To solve this, I tried using the following:
df1['dates_match'] = np.where(df1['time'] == df2['time'], 'True', 'False')
which will tell me if the dates match, and if they do I can plot. The problem arises as I have a different number of rows on each dataframe, and most methods only allow comparison of dataframes with exactly the same amount of rows.
Does anyone know of an alternative method I could use to plot the graph?
Thanks in advance!
The main goal is to plot two time series with that apparently don't have the same frequency to be able to compare them.
Since the main issue here is the different timestamps let's tackle that with pandas resample so we have a more uniform timestamps for each observation. To take the sum of 30 minutes intervals you can do (feel free to change the time interval and the agg function if you want to)
df1.set_index("time", inplace=True)
df2.set_index("time", inplace=True)
df1_resampled = df1.resample("30T").sum() # taking the sum of 30 minutes intervals
df2_resampled = df2.resample("30T").sum() # taking the sum of 30 minutes intervals
Now that the timestamps are more organized you can either merge the newer resampled dataframes if you want to and then plot i
df_joined = df1_resampled.join(df2_resampled, lsuffix="_1", rsuffix="_2")
df_joined.plot(marker="o", figsize=(12,6))
# df_joined.plot(subplots=True) if you want to plot them separately
Since df1 starts on 2013-01-01 and df2 on 2013-07-18 you'll have a first period where only df1 will exist if you want to plot only the overlapped period you can pass how="outer" to when joining both dataframes.

Autofill datetime in Pandas by previous increment

Given previous datetime values in a Pandas DataFrame--either as an index or as values in a column--is there a way to "autofill" remaining time increments based on the previous fixed increments?
For example, given:
import pandas as pd
import numpy as np
df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]},
index = [pd.Timestamp('20130101 09:00:00'),
pd.Timestamp('20130101 09:00:05'),
pd.Timestamp('20130101 09:00:10'),
np.nan,
np.nan])
I would like to apply a function to yield:
B
2013-01-01 09:00:00
0.0
2013-01-01 09:00:05
1.0
2013-01-01 09:00:10
2.0
2013-01-01 09:00:15
NaN
2013-01-01 09:00:20
4.0
Where I have missing timesteps for my last two data points. Here, timesteps are fixed in 5 second increments.
This will be for thousands of rows. While I might reset_index and then create a function to apply to each row, this seems cumbersome. Is there a slick or built-in way to do this that I'm not finding?
Assuming the first index value is a valid datetime and all the values are spaced 5s apart, you could do the following:
df.index = pd.date_range(df.index[0], periods=len(df), freq='5s')
>>> df
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:05 1.0
2013-01-01 09:00:10 2.0
2013-01-01 09:00:15 NaN
2013-01-01 09:00:20 4.0
This solution might work for you,but also use reset_index() fuction.
new_dateindex=pd.Series(pd.date_range(start=pd.Timestamp('20130101 09:00:00'),periods=1000,freq='5S'),name='Date')
#'periods=1000' can change to 'periods=len(df.index)'
df.reset_index().join(new_dateindex,how='right')

How to match time series in python?

I have two high frequency time series of 3 months worth of data.
The problem is that one goes from 15:30 to 23:00, the other from 01:00 to 00:00.
IS there any way to match the two time series, by discarding the extra data, in order to run some regression analysis?
use can use the function combine_first of pandas Series. This function selects the element of the calling object, if both series contain the same index.
Following code shows a minimum example:
idx1 = pd.date_range('2018-01-01', periods=5, freq='H')
idx2 = pd.date_range('2018-01-01 01:00', periods=5, freq='H')
ts1 = pd.Series(range(len(ts1)), index=idx1)
ts2 = pd.Series(range(len(ts2)), index=idx2)
idx1.combine_first(idx2)
This gives a dataframe with the content:
2018-01-01 00:00:00 0.0
2018-01-01 01:00:00 1.0
2018-01-01 02:00:00 2.0
2018-01-01 03:00:00 3.0
2018-01-01 04:00:00 4.0
2018-01-01 05:00:00 4.0
For more complex combinations you can use combine.

How to resample data in a single dataframe within 3 distinct groups

I've got a dataframe and want to resample certain columns (as hourly sums and means from 10-minutely data) WITHIN the 3 different 'users' that exist in the dataset.
A normal resample would use code like:
import pandas as pd
import numpy as np
df = pd.read_csv('example.csv')
df['Datetime'] = pd.to_datetime(df['date_datetime/_source'] + ' ' + df['time']) #create datetime stamp
df.set_index(df['Datetime'], inplace = True)
df = df.resample('1H', how={'energy_kwh': np.sum, 'average_w': np.mean, 'norm_average_kw/kw': np.mean, 'temperature_degc': np.mean, 'voltage_v': np.mean})
df
To geta a result like (please forgive the column formatting, I have no idea how to paste this properly to make it look nice):
energy_kwh norm_average_kw/kw voltage_v temperature_degc average_w
Datetime
2013-04-30 06:00:00 0.027 0.007333 266.333333 4.366667 30.000000
2013-04-30 07:00:00 1.250 0.052333 298.666667 5.300000 192.500000
2013-04-30 08:00:00 5.287 0.121417 302.333333 7.516667 444.000000
2013-04-30 09:00:00 12.449 0.201000 297.500000 9.683333 726.000000
2013-04-30 10:00:00 26.101 0.396417 288.166667 11.150000 1450.000000
2013-04-30 11:00:00 45.396 0.460250 282.333333 12.183333 1672.500000
2013-04-30 12:00:00 64.731 0.440833 276.166667 13.550000 1541.000000
2013-04-30 13:00:00 87.095 0.562750 284.833333 13.733333 2084.500000
However, in the original CSV, there is a column containing URLs - in the dataset of 100,000 rows, there are 3 different URLs (effectively IDs). I want to have each resampled individually rather than having a 'lump' resample from all (e.g. 9.00 AM on 2014-01-01 would have data for all 3 users, but each should have it's own hourly sums and means).
I hope this makes sense - please let me know if I need to clarify anything.
FYI, I tried using the advice in the following 2 posts but to no avail:
Resampling a multi-index DataFrame
Resampling Within a Pandas MultiIndex
Thanks in advance
You can resample a groupby object, groupby-ed by URLs, in this minimal example:
In [157]:
df=pd.DataFrame({'Val': np.random.random(100)})
df['Datetime'] = pd.date_range('2001-01-01', periods=100, freq='5H') #create random dataset
df.set_index(df['Datetime'], inplace = True)
df.__delitem__('Datetime')
df['Location']=np.tile(['l0', 'l1', 'l2', 'l3', 'l4'], 20)
In [158]:
print df.groupby('Location').resample('10D', how={'Val':np.mean})
Val
Location Datetime
l0 2001-01-01 00:00:00 0.334183
2001-01-11 00:00:00 0.584260
l1 2001-01-01 05:00:00 0.288290
2001-01-11 05:00:00 0.470140
l2 2001-01-01 10:00:00 0.381273
2001-01-11 10:00:00 0.461684
l3 2001-01-01 15:00:00 0.703523
2001-01-11 15:00:00 0.386858
l4 2001-01-01 20:00:00 0.448857
2001-01-11 20:00:00 0.310914

Pandas and HDF5 aggregate performance

I'm trying to understand the ideal way to organise data within Pandas to achieve the best aggregating performance. The data I am dealing with is of the form yyyy-mm.csv which I just read_csv in and then to_hdf out. It generally looks something a bit like this:
ObjectID Timestamp ParamA ParamB --> ParamZ
1 2013-01-01 00:00:00 1 9
2 2013-01-01 00:00:00 3 2
1 2013-01-01 00:10:00 8 11
2 2013-01-01 00:10:00 6 14
There are about 50 object ids and readings for each batch of 10 minutes for the whole month. The end result I want to achieve is aggregated data (e.g. the mean) for a single parameter grouped by month (or potentially finer resolution eventually) over say 5 years.
What I've discovered so far is that a HDFStore.select of a single column isn't really a great deal quicker than bringing in all of those params into a single data frame at once. Therefore it feels very wasteful and the performance is not great. Without knowing exactly why this is, I can't really decide the best way to move forward. It seems that if the data were transposed such that the yyyy-mm was along the x axis with the dd hh:mm:ss down the y axis, and there were one of these data frames per parameter that the performance would massively improve as it could bring in more data in one hit. The groupby's are really quick once things have been read in from disk. However I'm not at all convinced that this is how it is supposed to be used. Can anyone advise the best way to organise and store the data?
Thanks
Pls review the HDFStore docs here, and the cookboo recipies here
PyTables stores data in a row-oriented format, so it behooves you to generally have long and not so wide tables. However, if you tend to query and need/want the entire row then the width does not present a problem.
On the other hand, if you are generally after a small subset of columns, you will want to shard the table into multiples (possibly with the same indexing scheme), so you can use a 'master' table to run the query, then select 'columns' (other tables) as needed. You can accomplish this via the append_to_multiple/select_from_multiple methods for example. Taken to the extreme, this you could store a single column in a separate group and make yourself a column-oriented table. However this will substantially slow down if say you tend to select a lot of columns.
Furthermore you always want to have the queryable columns as indexes or data_columns, as these allow queries in the first place and are indexed.
So it comes down to the ratio of queries that select lots of columns vs single-column selections.
For example
In [5]: df = DataFrame(np.random.randn(16,2),
columns=['A','B'],
index=MultiIndex.from_tuples(
[ (i,j) for i in range(4) for j in date_range(
'20130101 00:00:00',periods=4,freq='10T') ],
names=['id','date']))
In [6]: df
Out[6]:
A B
id date
0 2013-01-01 00:00:00 -0.247945 0.954260
2013-01-01 00:10:00 1.035678 -0.657710
2013-01-01 00:20:00 -2.399376 -0.188057
2013-01-01 00:30:00 -1.043764 0.510098
1 2013-01-01 00:00:00 -0.009998 0.239947
2013-01-01 00:10:00 2.038563 0.640080
2013-01-01 00:20:00 1.123922 -0.944170
2013-01-01 00:30:00 -1.757766 -1.398392
2 2013-01-01 00:00:00 -1.053324 -1.015211
2013-01-01 00:10:00 0.062408 -1.476484
2013-01-01 00:20:00 -1.202875 -0.747429
2013-01-01 00:30:00 -0.798126 -0.485392
3 2013-01-01 00:00:00 0.496098 0.700073
2013-01-01 00:10:00 -0.042914 1.099115
2013-01-01 00:20:00 -1.762597 -0.239100
2013-01-01 00:30:00 -0.344125 -1.607524
[16 rows x 2 columns]
In 0.12, use table=True rather than format
In [7]: df.to_hdf('test.h5','df',mode='w',format='table')
In [8]: store = pd.HDFStore('test.h5')
In [9]: store
Out[9]:
<class 'pandas.io.pytables.HDFStore'>
File path: test.h5
/df frame_table (typ->appendable_multi,nrows->16,ncols->4,indexers->[index],dc->[date,id])
In [10]: store.select('df',where='id=0')
Out[10]:
A B
id date
0 2013-01-01 00:00:00 -0.247945 0.954260
2013-01-01 00:10:00 1.035678 -0.657710
2013-01-01 00:20:00 -2.399376 -0.188057
2013-01-01 00:30:00 -1.043764 0.510098
[4 rows x 2 columns]
This is 0.13 syntax, this is a bit more tricky in 0.12
In [18]: store.select('df',where='date>"20130101 00:10:00" & date<"20130101 00:30:00"')
Out[18]:
A B
id date
0 2013-01-01 00:20:00 -2.399376 -0.188057
1 2013-01-01 00:20:00 1.123922 -0.944170
2 2013-01-01 00:20:00 -1.202875 -0.747429
3 2013-01-01 00:20:00 -1.762597 -0.239100
[4 rows x 2 columns]
In [19]: store.close()
So for example to do a groupby on the id, you can select all of the unique ids (use the select_column method. Then iterate over these, doing a query and performing your function on the results. This will be quite fast and these are indexed columns. Something like this:
In [24]: ids = store.select_column('df','id').unique()
In [25]: ids
Out[25]: array([0, 1, 2, 3])
In [27]: pd.concat([ store.select('df',where='id={0}'.format(i)).sum() for i in ids ],axis=1)
Out[27]:
0 1 2 3
A -2.655407 1.394721 -2.991917 -1.653539
B 0.618590 -1.462535 -3.724516 -0.047436
[2 rows x 4 columns]
A multi-groupby is just a combination query, e.g. id=1 & date>="20130101 00:10:00' & date<='20130101 00:30:00'
You might find this example instructive as well here

Categories

Resources