Resample intraday pandas DataFrame without add new days - python

I want to downsample some intraday data without adding in new days
df.resample('30Min')
Will add weekends etc which is undesirable. Is there anyway around this?

A combined groupby/resample might work:
In [22]: dates = pd.date_range('01-Jan-2014','11-Jan-2014', freq='T')[0:-1]
...: dates = dates[dates.dayofweek < 5]
...: s = pd.TimeSeries(np.random.randn(dates.size), dates)
...:
In [23]: s.size
Out[23]: 11520
In [24]: s.groupby(lambda d: d.date()).resample('30min').size
Out[24]: 384
In [25]: s.groupby(lambda d: d.date()).resample('30min')
Out[25]:
2014-01-01 2014-01-01 00:00:00 0.202943
2014-01-01 00:30:00 -0.466010
2014-01-01 01:00:00 0.029175
2014-01-01 01:30:00 -0.064492
2014-01-01 02:00:00 -0.113348
2014-01-01 02:30:00 0.100408
2014-01-01 03:00:00 -0.036561
2014-01-01 03:30:00 -0.029578
2014-01-01 04:00:00 -0.047602
2014-01-01 04:30:00 -0.073846
2014-01-01 05:00:00 -0.410143
2014-01-01 05:30:00 0.143853
2014-01-01 06:00:00 -0.077783
2014-01-01 06:30:00 -0.122345
2014-01-01 07:00:00 0.153003
...
2014-01-10 2014-01-10 16:30:00 -0.107377
2014-01-10 17:00:00 -0.157420
2014-01-10 17:30:00 0.201802
2014-01-10 18:00:00 -0.189018
2014-01-10 18:30:00 -0.310503
2014-01-10 19:00:00 -0.086091
2014-01-10 19:30:00 -0.090800
2014-01-10 20:00:00 -0.263758
2014-01-10 20:30:00 -0.036789
2014-01-10 21:00:00 0.041957
2014-01-10 21:30:00 -0.192332
2014-01-10 22:00:00 -0.263690
2014-01-10 22:30:00 -0.395939
2014-01-10 23:00:00 -0.171149
2014-01-10 23:30:00 0.263057
Length: 384
In [26]: np.unique(_25.index.get_level_values(1).minute)
Out[26]: array([ 0, 30])
In [27]: np.unique(_25.index.get_level_values(1).dayofweek)
Out[27]: array([0, 1, 2, 3, 4])

The easiest workaround right now is probably something like:
rs = df.resample('30min')
rs[rs.index.dayofweek < 5]

Probably the simplest way is to just do a dropna afterwards to get rid of the empty rows, e.g.
df.resample('30Min').dropna()

Related

Pandas - Resample on MultiIndex based DataFrame and use of offset

I have a df which has a MultiIndex [(latitude, longitude, time)] with the number of rows being 148 x 244 x 90 x 24. For each latitude and longitude, the time is hourly from 2014-01-01 00:00:00 to 2014:03:31 23:00:00 in the UTC format.
FFDI
latitude longitude time
-39.20000 140.80000 2014-01-01 00:00:00 6.20000
2014-01-01 01:00:00 4.10000
2014-01-01 02:00:00 2.40000
2014-01-01 03:00:00 1.90000
2014-01-01 04:00:00 1.70000
2014-01-01 05:00:00 1.50000
2014-01-01 06:00:00 1.40000
2014-01-01 07:00:00 1.30000
2014-01-01 08:00:00 1.20000
2014-01-01 09:00:00 1.00000
2014-01-01 10:00:00 1.00000
2014-01-01 11:00:00 0.90000
2014-01-01 12:00:00 0.90000
... ... ... ...
2014-03-31 21:00:00 0.30000
2014-03-31 22:00:00 0.30000
2014-03-31 23:00:00 0.50000
140.83786 2014-01-01 00:00:00 3.20000
2014-01-01 01:00:00 2.90000
2014-01-01 02:00:00 2.10000
2014-01-01 03:00:00 2.90000
2014-01-01 04:00:00 1.20000
2014-01-01 05:00:00 0.90000
2014-01-01 06:00:00 1.10000
2014-01-01 07:00:00 1.60000
2014-01-01 08:00:00 1.40000
2014-01-01 09:00:00 1.50000
2014-01-01 10:00:00 1.20000
2014-01-01 11:00:00 0.80000
2014-01-01 12:00:00 0.40000
... ... ... ...
2014-03-31 21:00:00 0.30000
2014-03-31 22:00:00 0.30000
2014-03-31 23:00:00 0.50000
... ... ... ...
... ... ...
-33.90000 140.80000 2014-01-01 00:00:00 6.20000
2014-01-01 01:00:00 4.10000
2014-01-01 02:00:00 2.40000
2014-01-01 03:00:00 1.90000
2014-01-01 04:00:00 1.70000
2014-01-01 05:00:00 1.50000
2014-01-01 06:00:00 1.40000
2014-01-01 07:00:00 1.30000
2014-01-01 08:00:00 1.20000
2014-01-01 09:00:00 1.00000
2014-01-01 10:00:00 1.00000
2014-01-01 11:00:00 0.90000
2014-01-01 12:00:00 0.90000
... ... ... ...
2014-03-31 21:00:00 0.30000
2014-03-31 22:00:00 0.30000
2014-03-31 23:00:00 0.50000
140.83786 2014-01-01 00:00:00 3.20000
2014-01-01 01:00:00 2.90000
2014-01-01 02:00:00 2.10000
2014-01-01 03:00:00 2.90000
2014-01-01 04:00:00 1.20000
2014-01-01 05:00:00 0.90000
2014-01-01 06:00:00 1.10000
2014-01-01 07:00:00 1.60000
2014-01-01 08:00:00 1.40000
2014-01-01 09:00:00 1.50000
2014-01-01 10:00:00 1.20000
2014-01-01 11:00:00 0.80000
2014-01-01 12:00:00 0.40000
... ... ... ...
2014-03-31 21:00:00 0.30000
2014-03-31 22:00:00 0.30000
2014-03-31 23:00:00 0.50000
78001920 rows × 1 columns
I need to calculate a daily maximum FFDI value for a date using hourly values from 13:00:00 of the previous day to 12:00:00 of the current day to suit my time zone (+11). For example, if calculating daily max FFDI for 2014-01-10 in the +11 time zone, I can use hourly FFDI from 2014-01-09 13:00:00 to 2014-01-10 12:00:00.
df_daily_max = df .groupby(['latitude', 'longitude', pd.Grouper(freq='24H',base=13,loffset='11H',level='time')])['FFDI'].max().reset_index(name='Max FFDI')
The calculation starts with 13:00:00 and with a frequency of 24 hours.
The output is:
latitude longitude time Max FFDI
0 -39.20000076293945312500 140.80000305175781250000 2013-12-31 13:00:00 6.19999980926513671875
1 -39.20000076293945312500 140.80000305175781250000 2014-01-01 13:00:00 1.50000000000000000000
2 -39.20000076293945312500 140.80000305175781250000 2014-01-02 13:00:00 1.60000002384185791016
... ... ... ...
I would like the output to be:
latitude longitude time Max FFDI
0 -39.20000076293945312500 140.80000305175781250000 2014-01-01 6.19999980926513671875
1 -39.20000076293945312500 140.80000305175781250000 2014-01-02 1.50000000000000000000
2 -39.20000076293945312500 140.80000305175781250000 2014-01-03 1.60000002384185791016
... ... ... ...

How to fill the first date in the column?

I have a df:
dates values
2020-01-01 00:15:00 38.61487
2020-01-01 00:30:00 36.905204
2020-01-01 00:45:00 35.136584
2020-01-01 01:00:00 33.60378
2020-01-01 01:15:00 32.306791999999994
2020-01-01 01:30:00 31.304574
I am creating a new column named start as follows:
df = df.rename(columns={'dates': 'end'})
df['start']= df['end'].shift(1)
When I do this, I get the following:
end values start
2020-01-01 00:15:00 38.61487 NaT
2020-01-01 00:30:00 36.905204 2020-01-01 00:15:00
2020-01-01 00:45:00 35.136584 2020-01-01 00:30:00
2020-01-01 01:00:00 33.60378 2020-01-01 00:45:00
2020-01-01 01:15:00 32.306791999999994 2020-01-01 01:00:00
2020-01-01 01:30:00 31.304574 2020-01-01 01:15:00
I want to fill that NaT value with
2020-01-01 00:00:00
How can this be done?
Use Series.fillna with datetimes, e.g. by Timestamp:
df['start']= df['end'].shift().fillna(pd.Timestamp('2020-01-01'))
Or if pandas 0.24+ with fill_value parameter:
df['start']= df['end'].shift(fill_value=pd.Timestamp('2020-01-01'))
If all datetimes are regular, always difference 15 minutes is possible subtracting by offsets.DateOffset:
df['start']= df['end'] - pd.offsets.DateOffset(minutes=15)
print (df)
end values start
0 2020-01-01 00:15:00 38.614870 2020-01-01 00:00:00
1 2020-01-01 00:30:00 36.905204 2020-01-01 00:15:00
2 2020-01-01 00:45:00 35.136584 2020-01-01 00:30:00
3 2020-01-01 01:00:00 33.603780 2020-01-01 00:45:00
4 2020-01-01 01:15:00 32.306792 2020-01-01 01:00:00
5 2020-01-01 01:30:00 31.304574 2020-01-01 01:15:00
How about that?
df = pd.DataFrame(columns = ['end'])
df.loc[:, 'end'] = pd.date_range(start=pd.Timestamp(2019,1,1,0,15), end=pd.Timestamp(2019,1,2), freq='15min')
df.loc[:, 'start'] = df.loc[:, 'end'].shift(1)
delta = df.loc[df.index[3], 'end'] - df.loc[df.index[2], 'end']
df.loc[df.index[0], 'start'] = df.loc[df.index[1], 'start'] - delta
df
end start
0 2019-01-01 00:15:00 2019-01-01 00:00:00
1 2019-01-01 00:30:00 2019-01-01 00:15:00
2 2019-01-01 00:45:00 2019-01-01 00:30:00
3 2019-01-01 01:00:00 2019-01-01 00:45:00
4 2019-01-01 01:15:00 2019-01-01 01:00:00
... ... ...
91 2019-01-01 23:00:00 2019-01-01 22:45:00
92 2019-01-01 23:15:00 2019-01-01 23:00:00
93 2019-01-01 23:30:00 2019-01-01 23:15:00
94 2019-01-01 23:45:00 2019-01-01 23:30:00
95 2019-01-02 00:00:00 2019-01-01 23:45:00

Pandas groupby then fill missing rows

I have a dataframe structured like this:
df_all:
day_time LCLid energy(kWh/hh)
2014-02-08 23:00:00 MAC000006 0.077
2014-02-08 23:30:00 MAC000006 0.079
...
2014-02-08 23:00:00 MAC000007 0.045
...
There are four sequential datetimes (accross all LCLid's) missing from the data that I want to fill with previous and trailing values.
If the dataframe was split into sub-dataframes (df), one per LCLid eg as per:
gb = df.groupby('LCLid')
df_list = [gb.get_group(x) for x in gb.groups]
Then I could do this for each df in df_list:
#valid data before gap
prev_row = df.loc['2013-09-09 22:30:00'].copy()
#valid data after gap
post_row = df.loc['2013-09-10 01:00:00'].copy()
df.loc[pd.to_datetime('2013-09-09 23:00:00')] = prev_row
df.loc[pd.to_datetime('2013-09-09 23:30:00')] = prev_row
df.loc[pd.to_datetime('2013-09-10 00:00:00')] = post_row
df.loc[pd.to_datetime('2013-09-10 00:30:00')] = post_row
df = df.sort_index()
How can I do this on the df_all one one go to fill the missing data with 'valid' data just from each LCLid?
The solution
The input DataFrame:
LCLid energy(kWh/hh)
day_time
2014-01-01 00:00:00 MAC000006 0.270453
2014-01-01 00:00:00 MAC000007 0.170603
2014-01-01 00:30:00 MAC000006 0.716418
2014-01-01 00:30:00 MAC000007 0.276678
2014-01-01 03:00:00 MAC000006 0.819146
2014-01-01 03:00:00 MAC000007 0.027490
2014-01-01 03:30:00 MAC000006 0.688879
2014-01-01 03:30:00 MAC000007 0.868017
What you need to do:
full_idx = pd.date_range(start=df.index.min(), end=df.index.max(), freq='30T')
df = (
df
.groupby('LCLid', as_index=False)
.apply(lambda group: group.reindex(full_idx, method='nearest'))
.reset_index(level=0, drop=True)
.sort_index()
)
Result:
LCLid energy(kWh/hh)
2014-01-01 00:00:00 MAC000006 0.270453
2014-01-01 00:00:00 MAC000007 0.170603
2014-01-01 00:30:00 MAC000006 0.716418
2014-01-01 00:30:00 MAC000007 0.276678
2014-01-01 01:00:00 MAC000006 0.716418
2014-01-01 01:00:00 MAC000007 0.276678
2014-01-01 01:30:00 MAC000006 0.716418
2014-01-01 01:30:00 MAC000007 0.276678
2014-01-01 02:00:00 MAC000006 0.819146
2014-01-01 02:00:00 MAC000007 0.027490
2014-01-01 02:30:00 MAC000006 0.819146
2014-01-01 02:30:00 MAC000007 0.027490
2014-01-01 03:00:00 MAC000006 0.819146
2014-01-01 03:00:00 MAC000007 0.027490
2014-01-01 03:30:00 MAC000006 0.688879
2014-01-01 03:30:00 MAC000007 0.868017
The explanation
First I'll build an example DataFrame that looks like yours
import numpy as np
import pandas as pd
# Building an example DataFrame that looks like yours
df = pd.DataFrame({
'day_time': [
pd.Timestamp(2014, 1, 1, 0, 0),
pd.Timestamp(2014, 1, 1, 0, 0),
pd.Timestamp(2014, 1, 1, 0, 30),
pd.Timestamp(2014, 1, 1, 0, 30),
pd.Timestamp(2014, 1, 1, 3, 0),
pd.Timestamp(2014, 1, 1, 3, 0),
pd.Timestamp(2014, 1, 1, 3, 30),
pd.Timestamp(2014, 1, 1, 3, 30),
],
'LCLid': [
'MAC000006',
'MAC000007',
'MAC000006',
'MAC000007',
'MAC000006',
'MAC000007',
'MAC000006',
'MAC000007',
],
'energy(kWh/hh)': np.random.rand(8)
},
).set_index('day_time')
Result:
LCLid energy(kWh/hh)
day_time
2014-01-01 00:00:00 MAC000006 0.270453
2014-01-01 00:00:00 MAC000007 0.170603
2014-01-01 00:30:00 MAC000006 0.716418
2014-01-01 00:30:00 MAC000007 0.276678
2014-01-01 03:00:00 MAC000006 0.819146
2014-01-01 03:00:00 MAC000007 0.027490
2014-01-01 03:30:00 MAC000006 0.688879
2014-01-01 03:30:00 MAC000007 0.868017
Notice how we're missing the following timestamps:
2014-01-01 01:00:00
2014-01-01 01:30:00
2014-01-02 02:00:00
2014-01-02 02:30:00
df.reindex()
First thing to know is that df.reindex() allows you to fill in missing index values, and will default to NaN for missing values. In your case, you would want to supply the full timestamp range index, including the values that don't show up in your starting DataFrame.
Here I used pd.date_range() to list all timestamps between your min and max starting index values, taking strides of 30 minutes. WARNING: this way of doing it means that if your missing timestamp values are at the beginning or the end, you're not adding them back! So maybe you want to specify start and end explicitly.
full_idx = pd.date_range(start=df.index.min(), end=df.index.max(), freq='30T')
Result:
DatetimeIndex(['2014-01-01 00:00:00', '2014-01-01 00:30:00',
'2014-01-01 01:00:00', '2014-01-01 01:30:00',
'2014-01-01 02:00:00', '2014-01-01 02:30:00',
'2014-01-01 03:00:00', '2014-01-01 03:30:00'],
dtype='datetime64[ns]', freq='30T')
Now if we use that to reindex one of your grouped sub-DataFrames, we would get this:
grouped_df = df[df.LCLid == 'MAC000006']
grouped_df.reindex(full_idx)
Result:
LCLid energy(kWh/hh)
2014-01-01 00:00:00 MAC000006 0.270453
2014-01-01 00:30:00 MAC000006 0.716418
2014-01-01 01:00:00 NaN NaN
2014-01-01 01:30:00 NaN NaN
2014-01-01 02:00:00 NaN NaN
2014-01-01 02:30:00 NaN NaN
2014-01-01 03:00:00 MAC000006 0.819146
2014-01-01 03:30:00 MAC000006 0.688879
You said you want to fill missing values using the closest available surrounding value. This can be done during reindexing, as follows:
grouped_df.reindex(full_idx, method='nearest')
Result:
LCLid energy(kWh/hh)
2014-01-01 00:00:00 MAC000006 0.270453
2014-01-01 00:30:00 MAC000006 0.716418
2014-01-01 01:00:00 MAC000006 0.716418
2014-01-01 01:30:00 MAC000006 0.716418
2014-01-01 02:00:00 MAC000006 0.819146
2014-01-01 02:30:00 MAC000006 0.819146
2014-01-01 03:00:00 MAC000006 0.819146
2014-01-01 03:30:00 MAC000006 0.688879
Doing all the groups at once using df.groupby()
Now we'd like to apply this transformation to every group in your DataFrame, where
a group is defined by its LCLid.
(
df
.groupby('LCLid', as_index=False) # use LCLid as groupby key, but don't add it as a group index
.apply(lambda group: group.reindex(full_idx, method='nearest')) # do this for each group
.reset_index(level=0, drop=True) # get rid of the automatic index generated during groupby
.sort_index() # This is optional, just in case you want timestamps in chronological order
)
Result:
LCLid energy(kWh/hh)
2014-01-01 00:00:00 MAC000006 0.270453
2014-01-01 00:00:00 MAC000007 0.170603
2014-01-01 00:30:00 MAC000006 0.716418
2014-01-01 00:30:00 MAC000007 0.276678
2014-01-01 01:00:00 MAC000006 0.716418
2014-01-01 01:00:00 MAC000007 0.276678
2014-01-01 01:30:00 MAC000006 0.716418
2014-01-01 01:30:00 MAC000007 0.276678
2014-01-01 02:00:00 MAC000006 0.819146
2014-01-01 02:00:00 MAC000007 0.027490
2014-01-01 02:30:00 MAC000006 0.819146
2014-01-01 02:30:00 MAC000007 0.027490
2014-01-01 03:00:00 MAC000006 0.819146
2014-01-01 03:00:00 MAC000007 0.027490
2014-01-01 03:30:00 MAC000006 0.688879
2014-01-01 03:30:00 MAC000007 0.868017
Relevant doc:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.date_range.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.apply.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_index.html

Grouping dates by 5 minute periods irrespective of day

I have a DataFrame with data similar to the following
import pandas as pd; import numpy as np; import datetime; from datetime import timedelta;
df = pd.DataFrame(index=pd.date_range(start='20160102', end='20170301', freq='5min'))
df['value'] = np.random.randn(df.index.size)
df.index += pd.Series([timedelta(seconds=np.random.randint(-60, 60))
for _ in range(df.index.size)])
which looks like this
In[37]: df
Out[37]:
value
2016-01-02 00:00:33 0.546675
2016-01-02 00:04:52 1.080558
2016-01-02 00:10:46 -1.551206
2016-01-02 00:15:52 -1.278845
2016-01-02 00:19:04 -1.672387
2016-01-02 00:25:36 -0.786985
2016-01-02 00:29:35 1.067132
2016-01-02 00:34:36 -0.575365
2016-01-02 00:39:33 0.570341
2016-01-02 00:44:56 -0.636312
...
2017-02-28 23:14:57 -0.027981
2017-02-28 23:19:51 0.883150
2017-02-28 23:24:15 -0.706997
2017-02-28 23:30:09 -0.954630
2017-02-28 23:35:08 -1.184881
2017-02-28 23:40:20 0.104017
2017-02-28 23:44:10 -0.678742
2017-02-28 23:49:15 -0.959857
2017-02-28 23:54:36 -1.157165
2017-02-28 23:59:10 0.527642
Now, I'm aiming to get the mean per 5 minute period over the course of a 24 hour day - without considering what day those values actually come from.
How can I do this effectively? I would like to think I could somehow remove the actual dates from my index and then use something like pd.TimeGrouper, but I haven't figured out how to do so.
My not-so-great solution
My solution so far has been to use between_time in a loop like this, just using an arbitrary day.
aggregates = []
start_time = datetime.datetime(1990, 1, 1, 0, 0, 0)
while start_time < datetime.datetime(1990, 1, 1, 23, 59, 0):
aggregates.append(
(
start_time,
df.between_time(start_time.time(),
(start_time + timedelta(minutes=5)).time(),
include_end=False).value.mean()
)
)
start_time += timedelta(minutes=5)
result = pd.DataFrame(aggregates, columns=['time', 'value'])
which works as expected
In[68]: result
Out[68]:
time value
0 1990-01-01 00:00:00 0.032667
1 1990-01-01 00:05:00 0.117288
2 1990-01-01 00:10:00 -0.052447
3 1990-01-01 00:15:00 -0.070428
4 1990-01-01 00:20:00 0.034584
5 1990-01-01 00:25:00 0.042414
6 1990-01-01 00:30:00 0.043388
7 1990-01-01 00:35:00 0.050371
8 1990-01-01 00:40:00 0.022209
9 1990-01-01 00:45:00 -0.035161
.. ... ...
278 1990-01-01 23:10:00 0.073753
279 1990-01-01 23:15:00 -0.005661
280 1990-01-01 23:20:00 -0.074529
281 1990-01-01 23:25:00 -0.083190
282 1990-01-01 23:30:00 -0.036636
283 1990-01-01 23:35:00 0.006767
284 1990-01-01 23:40:00 0.043436
285 1990-01-01 23:45:00 0.011117
286 1990-01-01 23:50:00 0.020737
287 1990-01-01 23:55:00 0.021030
[288 rows x 2 columns]
But this doesn't feel like a very Pandas-friendly solution.
IIUC then the following should work:
In [62]:
df.groupby(df.index.floor('5min').time).mean()
Out[62]:
value
00:00:00 -0.038002
00:05:00 -0.011646
00:10:00 0.010701
00:15:00 0.034699
00:20:00 0.041164
00:25:00 0.151187
00:30:00 -0.006149
00:35:00 -0.008256
00:40:00 0.021389
00:45:00 0.016851
00:50:00 -0.074825
00:55:00 0.012861
01:00:00 0.054048
01:05:00 0.041907
01:10:00 -0.004457
01:15:00 0.052428
01:20:00 -0.021518
01:25:00 -0.019010
01:30:00 0.030887
01:35:00 -0.085415
01:40:00 0.002386
01:45:00 -0.002189
01:50:00 0.049720
01:55:00 0.032292
02:00:00 -0.043642
02:05:00 0.067132
02:10:00 -0.029628
02:15:00 0.064098
02:20:00 0.042731
02:25:00 -0.031113
... ...
21:30:00 -0.018391
21:35:00 0.032155
21:40:00 0.035014
21:45:00 -0.016979
21:50:00 -0.025248
21:55:00 0.027896
22:00:00 -0.117036
22:05:00 -0.017970
22:10:00 -0.008494
22:15:00 -0.065303
22:20:00 -0.014623
22:25:00 0.076994
22:30:00 -0.030935
22:35:00 0.030308
22:40:00 -0.124668
22:45:00 0.064853
22:50:00 0.057913
22:55:00 0.002309
23:00:00 0.083586
23:05:00 -0.031043
23:10:00 -0.049510
23:15:00 0.003520
23:20:00 0.037135
23:25:00 -0.002231
23:30:00 -0.029592
23:35:00 0.040335
23:40:00 -0.021513
23:45:00 0.104421
23:50:00 -0.022280
23:55:00 -0.021283
[288 rows x 1 columns]
Here I floor the index to '5 min' intervals and then group on the time attribute and aggregate the mean

Pandas filtering values in dataframe

I have this dataframe. The columns represent the highs and the lows in daily EURUSD price:
df.low df.high
2013-01-17 16:00:00 1.33394 2013-01-17 20:00:00 1.33874
2013-01-18 18:00:00 1.32805 2013-01-18 09:00:00 1.33983
2013-01-21 00:00:00 1.32962 2013-01-21 09:00:00 1.33321
2013-01-22 11:00:00 1.32667 2013-01-22 09:00:00 1.33715
2013-01-23 17:00:00 1.32645 2013-01-23 14:00:00 1.33545
2013-01-24 10:00:00 1.32860 2013-01-24 18:00:00 1.33926
2013-01-25 04:00:00 1.33497 2013-01-25 17:00:00 1.34783
2013-01-28 10:00:00 1.34246 2013-01-28 16:00:00 1.34771
2013-01-29 13:00:00 1.34143 2013-01-29 21:00:00 1.34972
2013-01-30 08:00:00 1.34820 2013-01-30 21:00:00 1.35873
2013-01-31 13:00:00 1.35411 2013-01-31 17:00:00 1.35944
I summed them up into a third column (df.extremes).
df.extremes
2013-01-17 16:00:00 1.33394
2013-01-17 20:00:00 1.33874
2013-01-18 18:00:00 1.32805
2013-01-18 09:00:00 1.33983
2013-01-21 00:00:00 1.32962
2013-01-21 09:00:00 1.33321
2013-01-22 09:00:00 1.33715
2013-01-22 11:00:00 1.32667
2013-01-23 14:00:00 1.33545
2013-01-23 17:00:00 1.32645
2013-01-24 10:00:00 1.32860
2013-01-24 18:00:00 1.33926
2013-01-25 04:00:00 1.33497
2013-01-25 17:00:00 1.34783
2013-01-28 10:00:00 1.34246
2013-01-28 16:00:00 1.34771
2013-01-29 13:00:00 1.34143
2013-01-29 21:00:00 1.34972
2013-01-30 08:00:00 1.34820
2013-01-30 21:00:00 1.35873
2013-01-31 13:00:00 1.35411
2013-01-31 17:00:00 1.35944
But now i want to filter some values from df.extremes.
To explain what to filter i try with this "pseudocode":
IF following the index we move from: previous df.low --> df.low --> df.high:
IF df.low > previous df.low: delete df.low
IF df.low < previous df.low: delete previous df.low
If i try to work this out with a for loop, it gives me a KeyError: 1.3339399999999999.
day = df.groupby(pd.TimeGrouper('D'))
is_day_min = day.extremes.apply(lambda x: x == x.min())
for i in df.extremes:
if is_day_min[i] == True and is_day_min[i+1] == True:
if df.extremes[i] > df.extremes[i+1]:
del df.extremes[i]
for i in df.extremes:
if is_day_min[i] == True and is_day_min[i+1] == True:
if df.extremes[i] < df.extremes[i+1]:
del df.extremes[i+1]
How to filter/delete the values as i explained in pseudocode?
I am struggling with indexing and bools but i can't solve this. I strongly suspect that i need to use a lambda function, but i don't know how to apply it. So please have mercy it's too long that i'm trying on this. Hope i've been clear enough.
All you're really missing is a way of saying "previous low" in a vectorized fashion. That's spelled df['low'].shift(-1). Once you have that it's just:
prev = df.low.shift(-1)
filtered_df = df[~((df.low > prev) | (df.low < prev))]

Categories

Resources