How to conditionally aggregate a Pandas dataframe - python

I have a dataframe with some data that I'm going to run simulations on. Each row is a datetime and a value. Because of the nature of the problem, I need to keep the original frequency of 1 hour when the value is above a certain threshold. When it's not, I could resample the data and run that part of the simulation on lower frequency data, in order to speed up the simulation.
My idea is to somehow group the dataframe by day (since I've noticed there are many whole days where the value stays below the threshold), check the max value over each group, and if the max is below the threshold then aggregate the data in that group into a single mean value.
Here's a minimal working example:
import pandas as pd
import numpy as np
threshold = 3
idx = pd.date_range("2018-01-01", periods=27, freq="H")
df = pd.Series(np.append(np.ones(26), 5), index=idx).to_frame("v")
print(df)
Output:
v
2018-01-01 00:00:00 1.0
2018-01-01 01:00:00 1.0
2018-01-01 02:00:00 1.0
2018-01-01 03:00:00 1.0
2018-01-01 04:00:00 1.0
2018-01-01 05:00:00 1.0
2018-01-01 06:00:00 1.0
2018-01-01 07:00:00 1.0
2018-01-01 08:00:00 1.0
2018-01-01 09:00:00 1.0
2018-01-01 10:00:00 1.0
2018-01-01 11:00:00 1.0
2018-01-01 12:00:00 1.0
2018-01-01 13:00:00 1.0
2018-01-01 14:00:00 1.0
2018-01-01 15:00:00 1.0
2018-01-01 16:00:00 1.0
2018-01-01 17:00:00 1.0
2018-01-01 18:00:00 1.0
2018-01-01 19:00:00 1.0
2018-01-01 20:00:00 1.0
2018-01-01 21:00:00 1.0
2018-01-01 22:00:00 1.0
2018-01-01 23:00:00 1.0
2018-01-02 00:00:00 1.0
2018-01-02 01:00:00 1.0
2018-01-02 02:00:00 5.0
The desired output of the operation would be this dataframe:
v
2018-01-01 00:00:00 1.0
2018-01-02 00:00:00 1.0
2018-01-02 01:00:00 1.0
2018-01-02 02:00:00 5.0
where the first value is the mean of the first day.
I think I'm getting close
grouped = df.resample("1D")
for name, group in grouped:
if group["v"].max() <= 3:
group['v'].agg("mean")
but I'm unsure how to actually apply the aggregation to the desired groups, and get a dataframe back.
Any help is greatly appreciated.

So I found a solution.
grouped = df.resample("1D")
def conditionalAggregation(x):
if x['v'].max() <= 3:
idx = [x.index[0].replace(hour=0, minute=0, second=0, microsecond=0)]
return pd.DataFrame(x['v'].max(), index=idx, columns=['v'])
else:
return x
conditionallyAggregated = grouped.apply(conditionalAggregation)
conditionallyAggregated = conditionallyAggregated.droplevel(level=0)
conditionallyAggregated
This gives the following df:
v
2018-01-01 00:00:00 1.0
2018-01-02 00:00:00 1.0
2018-01-02 01:00:00 1.0
2018-01-02 02:00:00 5.0

Related

How to extract the first and last value from a data sequence based on a column value?

I have a time series dataset that can be created with the following code.
idx = pd.date_range("2018-01-01", periods=100, freq="H")
ts = pd.Series(idx)
dft = pd.DataFrame(ts,columns=["date"])
dft["data"] = ""
dft["data"][0:5]= "a"
dft["data"][5:15]= "b"
dft["data"][15:20]= "c"
dft["data"][20:30]= "d"
dft["data"][30:40]= "a"
dft["data"][40:70]= "c"
dft["data"][70:85]= "b"
dft["data"][85:len(dft)]= "c"
In the data column, the unique values are a,b,c,d. These values are repeating in a sequence in different time windows. I want to capture the first and last value of that time window. How can I do that?
Compute a grouper for your changing values using shift to compare consecutive rows, then use groupby+agg to get the min/max per group:
group = dft.data.ne(dft.data.shift()).cumsum()
dft.groupby(group)['date'].agg(['min', 'max'])
output:
min max
data
1 2018-01-01 00:00:00 2018-01-01 04:00:00
2 2018-01-01 05:00:00 2018-01-01 14:00:00
3 2018-01-01 15:00:00 2018-01-01 19:00:00
4 2018-01-01 20:00:00 2018-01-02 05:00:00
5 2018-01-02 06:00:00 2018-01-02 15:00:00
6 2018-01-02 16:00:00 2018-01-03 21:00:00
7 2018-01-03 22:00:00 2018-01-04 12:00:00
8 2018-01-04 13:00:00 2018-01-05 03:00:00
edit. combining with original data:
dft.groupby(group).agg({'data': 'first', 'date': ['min', 'max']})
output:
data date
first min max
data
1 a 2018-01-01 00:00:00 2018-01-01 04:00:00
2 b 2018-01-01 05:00:00 2018-01-01 14:00:00
3 c 2018-01-01 15:00:00 2018-01-01 19:00:00
4 d 2018-01-01 20:00:00 2018-01-02 05:00:00
5 a 2018-01-02 06:00:00 2018-01-02 15:00:00
6 c 2018-01-02 16:00:00 2018-01-03 21:00:00
7 b 2018-01-03 22:00:00 2018-01-04 12:00:00
8 c 2018-01-04 13:00:00 2018-01-05 03:00:00

How do I group a time series by hour of day?

I have a time series and I want to group the rows by hour of day (regardless of date) and visualize these as boxplots. So I'd want 24 boxplots starting from hour 1, then hour 2, then hour 3 and so on.
The way I see this working is splitting the dataset up into 24 series (1 for each hour of the day), creating a boxplot for each series and then plotting this on the same axes.
The only way I can think of to do this is to manually select all the values between each hour, is there a faster way?
some sample data:
Date Actual Consumption
2018-01-01 00:00:00 47.05
2018-01-01 00:15:00 46
2018-01-01 00:30:00 44
2018-01-01 00:45:00 45
2018-01-01 01:00:00 43.5
2018-01-01 01:15:00 43.5
2018-01-01 01:30:00 43
2018-01-01 01:45:00 42.5
2018-01-01 02:00:00 43
2018-01-01 02:15:00 42.5
2018-01-01 02:30:00 41
2018-01-01 02:45:00 42.5
2018-01-01 03:00:00 42.04
2018-01-01 03:15:00 41.96
2018-01-01 03:30:00 44
2018-01-01 03:45:00 44
2018-01-01 04:00:00 43.54
2018-01-01 04:15:00 43.46
2018-01-01 04:30:00 43.5
2018-01-01 04:45:00 43
2018-01-01 05:00:00 42.04
This is what i've tried so far:
zero = df.between_time('00:00', '00:59')
one = df.between_time('01:00', '01:59')
two = df.between_time('02:00', '02:59')
and then I would plot a boxplot for each of these on the same axes. However it's very tedious to do this for all 24 hours in a day.
This is the kind of output I want:
https://www.researchgate.net/figure/Boxplot-of-the-NOx-data-by-hour-of-the-day_fig1_24054015
there are 2 steps to achieve this:
convert Actual to date time:
df.Actual = pd.to_datetime(df.Actual)
Group by the hour:
df.groupby([df.Date, df.Actual.dt.hour+1]).Consumption.sum().reset_index()
I assumed you wanted to sum the Consumption (unless you wish to have mean or whatever just change it). One note: hour+1 so it will start from 1 and not 0 (remove it if you wish 0 to be midnight).
desired result:
Date Actual Consumption
0 2018-01-01 1 182.05
1 2018-01-01 2 172.50
2 2018-01-01 3 169.00
3 2018-01-01 4 172.00
4 2018-01-01 5 173.50
5 2018-01-01 6 42.04

pad a data frame according to a frequency for each group

I have a pandas.DataFrame df with a pandas.DatetimeIndex and a column named group_column.
I need the df to have a minutely frequency (meaning there is a row for every minute).
however this needs to be case for every value in the group_column, so every minute can potentially have several values.
NOTE:
the group_column can have hundreds of unique values.
some groups can "last" several minutes and others can last for days, the edges are determined by the first and last appearances of the values in group_column.
example
Input:
dates = [pd.Timestamp('2018-01-01 12:00'), pd.Timestamp('2018-01-01 12:01'), pd.Timestamp('2018-01-01 12:01'), pd.Timestamp('2018-01-01 12:03'), pd.Timestamp('2018-01-01 12:04')]
df = pd.DataFrame({'group_column': ['a', 'a','b','a','b'], 'data_column': [1.2, 2.2, 4, 1, 2]}, index=dates)
group_column data_column
2018-01-01 12:00:00 a 1.2
2018-01-01 12:01:00 a 2.2
2018-01-01 12:01:00 b 4.0
2018-01-01 12:03:00 a 1.0
2018-01-01 12:04:00 b 2.0
desired output:
group_column data_column
2018-01-01 12:00:00 a 1.2
2018-01-01 12:01:00 a 2.2
2018-01-01 12:02:00 a 2.2
2018-01-01 12:03:00 a 1.0
2018-01-01 12:01:00 b 4.0
2018-01-01 12:02:00 b 4.0
2018-01-01 12:03:00 b 4.0
2018-01-01 12:04:00 b 2.0
my attempt
I have done this, however it seems highly inefficient:
def group_resmaple(df, group_column_name):
values = df[group_column_name].unique()
for value in values:
df_g = df.loc[df[group_column]==value]
df_g = df_g.asfreq('min', 'pad')
yield df_g
df_paded = pd.concat(group_resmaple(df, 'group_column'))
Use GroupBy.apply with asfreq:
df1 = (df.groupby('group_column')
.apply(lambda x: x.asfreq('min', 'pad'))
.reset_index(level=0, drop=True))
print (df1)
group_column data_column
2018-01-01 12:00:00 a 1.2
2018-01-01 12:01:00 a 2.2
2018-01-01 12:02:00 a 2.2
2018-01-01 12:03:00 a 1.0
2018-01-01 12:01:00 b 4.0
2018-01-01 12:02:00 b 4.0
2018-01-01 12:03:00 b 4.0
2018-01-01 12:04:00 b 2.0
My approach would be
df2 = df.groupby('group_column').resample('min').ffill().reset_index(level=0, drop=True)
print(df2)
data_column group_column
2018-01-01 12:00:00 1.2 a
2018-01-01 12:01:00 2.2 a
2018-01-01 12:02:00 2.2 a
2018-01-01 12:03:00 1.0 a
2018-01-01 12:01:00 4.0 b
2018-01-01 12:02:00 4.0 b
2018-01-01 12:03:00 4.0 b
2018-01-01 12:04:00 2.0 b

Pandas drop before first valid index and after last valid index for each column of a dataframe

I have a dataframe like this:
df = pd.DataFrame({'timestamp':pd.date_range('2018-01-01', '2018-01-02', freq='2h', closed='right'),'col1':[np.nan, np.nan, np.nan, 1,2,3,4,5,6,7,8,np.nan], 'col2':[np.nan, np.nan, 0, 1,2,3,4,5,np.nan,np.nan,np.nan,np.nan], 'col3':[np.nan, -1, 0, 1,2,3,4,5,6,7,8,9], 'col4':[-2, -1, 0, 1,2,3,4,np.nan,np.nan,np.nan,np.nan,np.nan]
})[['timestamp', 'col1', 'col2', 'col3', 'col4']]
which looks like this:
timestamp col1 col2 col3 col4
0 2018-01-01 02:00:00 NaN NaN NaN -2.0
1 2018-01-01 04:00:00 NaN NaN -1.0 -1.0
2 2018-01-01 06:00:00 NaN 0.0 NaN 0.0
3 2018-01-01 08:00:00 1.0 1.0 1.0 1.0
4 2018-01-01 10:00:00 2.0 NaN 2.0 2.0
5 2018-01-01 12:00:00 3.0 3.0 NaN 3.0
6 2018-01-01 14:00:00 NaN 4.0 4.0 4.0
7 2018-01-01 16:00:00 5.0 NaN 5.0 NaN
8 2018-01-01 18:00:00 6.0 NaN 6.0 NaN
9 2018-01-01 20:00:00 7.0 NaN 7.0 NaN
10 2018-01-01 22:00:00 8.0 NaN 8.0 NaN
11 2018-01-02 00:00:00 NaN NaN 9.0 NaN
Now, I want to find an efficient and pythonic way of chopping off (for each column! Not counting timestamp) before the first valid index and after the last valid index. In this example I have 4 columns, but in reality I have a lot more, 600 or so. I am looking for a way of chop of all the NaN values before the first valid index and all the NaN values after the last valid index.
One way would be to loop through I guess.. But is there a better way? This way has to be efficient. I tried to "unpivot" the dataframe using melt, but then this didn't help.
An obvious point is that each column would have a different number of rows after the chopping. So I would like the result to be a list of data frames (one for each column) having timestamp and the column in question. For instance:
timestamp col1
3 2018-01-01 08:00:00 1.0
4 2018-01-01 10:00:00 2.0
5 2018-01-01 12:00:00 3.0
6 2018-01-01 14:00:00 NaN
7 2018-01-01 16:00:00 5.0
8 2018-01-01 18:00:00 6.0
9 2018-01-01 20:00:00 7.0
10 2018-01-01 22:00:00 8.0
My try
I tried like this:
final = []
columns = [c for c in df if c !='timestamp']
for col in columns:
first = df.loc[:, col].first_valid_index()
last = df.loc[:, col].last_valid_index()
final.append(df.loc[:, ['timestamp', col]].iloc[first:last+1, :])
One idea is to use a list or dictionary comprehension after setting your index as timestamp. You should test with your data to see if this resolves your issue with performance. It is unlikely to help if your limitation is memory.
df = df.set_index('timestamp')
final = {col: df[col].loc[df[col].first_valid_index(): df[col].last_valid_index()] \
for col in df}
print(final)
{'col1': timestamp
2018-01-01 08:00:00 1.0
2018-01-01 10:00:00 2.0
2018-01-01 12:00:00 3.0
2018-01-01 14:00:00 4.0
2018-01-01 16:00:00 5.0
2018-01-01 18:00:00 6.0
2018-01-01 20:00:00 7.0
2018-01-01 22:00:00 8.0
Name: col1, dtype: float64,
...
'col4': timestamp
2018-01-01 02:00:00 -2.0
2018-01-01 04:00:00 -1.0
2018-01-01 06:00:00 0.0
2018-01-01 08:00:00 1.0
2018-01-01 10:00:00 2.0
2018-01-01 12:00:00 3.0
2018-01-01 14:00:00 4.0
Name: col4, dtype: float64}
You can use the power of functional programming and apply a function to each column. This may speed things up. Also, as you timestamps looks sorted, you can use them as index of your Datarame.
df.set_index('timestamp', inplace=True)
final = []
def func(col):
first = col.first_valid_index()
last = col.last_valid_index()
final.append(col.loc[first:last])
return
df.apply(func)
Also, you can compact everything in a one liner:
final = []
df.apply(lambda col: final.append(col.loc[col.first_valid_index() : col.last_valid_index()]))
My approach is to find the cumulative sum of NaN for each column and its inverse and filter those entries that are greater than 0. Then I do a dict comprehension to return a dataframe for each column (you can change that to a list if that's what you prefer).
For your example we have
cols = [c for c in df.columns if c!='timestamp']
result_dict = {c: df[(df[c].notnull().cumsum() > 0) &
(df.ix[::-1,c].notnull().cumsum()[::-1] > 0)][['timestamp', c]]
for c in cols}

Change the frequency of a Pandas datetimeindex from daily to hourly, to select hourly data based on a condition on daily resampled data

I am working on hourly and sub-hourly time series. However, one of the conditions I need to test is on daily averages. I need to find the days that meet the condition, and then select all hours (or other time steps) from those days to change their values. But right now, the only value that is actually changed is the first hour on the selected day. How can I select and modify every hour?
This is an example of my dataset:
In[]: print(hourly_dataset.head())
Out[]:
GHI DNI DHI
2016-01-01 00:00:00 0.0 0.0 0.0
2016-01-01 01:00:00 0.0 0.0 0.0
2016-01-01 02:00:00 0.0 0.0 0.0
2016-01-01 03:00:00 0.0 0.0 0.0
2016-01-01 04:00:00 0.0 0.0 0.0
And this is the condition I need to check. I saved the indexes that satisfy the condition on the daily standard deviation as ix.
ix = hourly_dataset['GHI'].resample('D').std()[hourly_dataset['GHI'].resample('D').std() > 300].index
In[]: print(ix)
Out[]: DatetimeIndex(['2016-05-31', '2016-07-17', '2016-07-18'], dtype='datetime64[ns]', freq=None)
But then I assign a nan value to those days and only the first hour is actually modified to nan.
hourly_dataset.loc[ix,'GHI'] = np.nan
In[]: print(hourly_dataset.loc['2016-05-31','GHI'].head())
Out[]:
2016-05-31 00:00:00 NaN
2016-05-31 01:00:00 0.0
2016-05-31 02:00:00 0.0
2016-05-31 03:00:00 0.0
2016-05-31 04:00:00 7.4
Freq: H, Name: GHI, dtype: float64
I would like all values in that day to be assigned nan.
Thanks for the help!
Possible workaround:
for i in ix:
hourly_dataset.loc[i.strftime('%Y-%m-%d'),'GHI'] = np.nan
Explanation
I had a small look and the issue is when we try to select the index by a Timestamp. I was able to reproduce your error.
Consider this example:
import pandas as pd
df = pd.DataFrame({
'date': pd.date_range(start='2018-01-01', freq='2H', periods=24),
'GHI': 0
}).set_index('date')
ix = pd.date_range(start='2018-01-01', end='2018-01-02')
df.loc[ix, 'GHI'] = np.nan
print(df.head())
Returns:
GHI
date
2018-01-01 00:00:00 NaN
2018-01-01 02:00:00 0.0
2018-01-01 04:00:00 0.0
2018-01-01 06:00:00 0.0
2018-01-01 08:00:00 0.0
Maybe not the best, but one work-around would be to loop through the ix and use loc on ix as a datetime string with format YYYY-mm-dd.
# df.loc[ix.strftime('%Y-%m-%d'), 'GHI'] = np.nan --> does not work
for i in ix:
df.loc[i.strftime('%Y-%m-%d'), 'GHI'] = np.nan
print(df.head())
date
2018-01-01 00:00:00 NaN
2018-01-01 02:00:00 NaN
2018-01-01 04:00:00 NaN
2018-01-01 06:00:00 NaN
2018-01-01 08:00:00 NaN

Categories

Resources