From hourly data, get daily nsmallest values for each column - python

I have a dataframe df with n columns, with hourly data (date_i X1_i X2_i ... Xn_i).
For each day, I want to get the nsmallest values for each column. But i cannot find a way without looping over the columns.
It is easy with the smallest value as df.groupby(pd.Grouper(freq='D')).min() seems to do the trick, but when I try the nsmallest method, i get the following error message : "Cannot access callable attribute 'nsmallest' of 'DataFrameGroupBy' objects, try using the 'apply' method".
I tried to use nsmallest with the 'apply' method but was asked to specify columns...
If someone has an idea, it would be very helpful
Thanks
PS : sorry for the formatting, this is my first post ever
Edit : some illustrations
what my data looks like :
0 1 ... 9678 9679
2022-01-08 00:00:00 18472.232746 28934.878033 ... 20668.503228 22079.457224
2022-01-08 01:00:00 19546.101746 30239.880033 ... 21789.779228 23330.190224
2022-01-08 02:00:00 22031.448746 33016.048033 ... 24278.199228 25990.503224
2022-01-08 03:00:00 24089.368644 36134.608919 ... 26327.332591 28089.134306
2022-01-08 04:00:00 24640.942644 36818.412919 ... 26894.204591 28736.705306
2022-01-08 05:00:00 23329.700644 35639.693919 ... 25555.199591 27379.323306
2022-01-08 06:00:00 20990.043644 33329.805919 ... 23137.500591 24917.126306
2022-01-08 07:00:00 18314.599644 30347.799919 ... 20167.500591 22022.524306
2022-01-08 08:00:00 17628.482226 31301.113041 ... 21665.296600 24202.625832
2022-01-08 09:00:00 15743.339226 29588.354041 ... 19912.297600 22341.947832
2022-01-08 10:00:00 15498.405226 29453.561041 ... 19799.009600 22131.170832
2022-01-08 11:00:00 14950.121226 28767.791041 ... 19328.678600 21507.167832
2022-01-08 12:00:00 13925.869226 27530.472041 ... 18404.139600 20460.316832
2022-01-08 13:00:00 17502.122226 30922.783041 ... 21990.380600 24008.382832
2022-01-08 14:00:00 19159.511385 34275.005187 ... 23961.590286 26460.214883
2022-01-08 15:00:00 20583.356385 35751.662187 ... 25315.380286 27793.800883
2022-01-08 16:00:00 20443.423385 35925.362187 ... 25184.576286 27672.536883
2022-01-08 17:00:00 15825.211385 31604.614187 ... 20646.669286 23145.311883
2022-01-08 18:00:00 11902.354052 28786.559805 ... 16028.363856 19313.677750
2022-01-08 19:00:00 13483.710052 30631.806805 ... 17635.338856 20948.556750
2022-01-08 20:00:00 16084.773323 33944.862396 ... 20627.810852 22763.962851
2022-01-08 21:00:00 18340.833323 36435.799396 ... 22920.037852 25240.320851
2022-01-08 22:00:00 15110.698323 33159.222396 ... 19794.355852 22102.416851
2022-01-08 23:00:00 15663.400323 33741.501396 ... 20180.693852 22605.909851
2022-01-09 00:00:00 19500.930751 39058.431760 ... 24127.257756 26919.289816
2022-01-09 01:00:00 20562.985751 40330.807760 ... 25123.488756 28051.573816
2022-01-09 02:00:00 23408.547751 43253.635760 ... 27840.447756 30960.372816
2022-01-09 03:00:00 25975.071191 45523.722743 ... 30274.316013 32276.174330
2022-01-09 04:00:00 27180.858191 46586.959743 ... 31348.131013 33414.631330
2022-01-09 05:00:00 26383.511191 45793.920743 ... 30598.931013 32605.280330
... ... ... ... ...
What i get with the min function :
2022-01-08 11902.354052 27530.472041 ... 16028.363856 19313.677750
2022-01-09 14491.281907 30293.870235 ... 16766.428013 21386.135041
...
What i would like to have, for example with nsmallest(2)
2022-01-08 11902.354052 27530.472041 ... 16028.363856 19313.677750
13483.710052 28767.791041 ... 17635.338856 20460.316832
2022-01-09 14491.281907 30293.870235 ... 16766.428013 21386.135041
14721.392907 30722.928235 ... 17130.594013 21732.426041
...

Group by days and get the 2 smallest values as a list and explode all columns (pandas>=1.3.0)
get_2smallest = lambda x: x.nsmallest(2).tolist()
out = df.resample('D').apply(get_2smallest).explode(df.columns.tolist())
print(out)
# Output
0 1 9678 9679
2022-01-08 11902.354052 27530.472041 16028.363856 19313.67775
2022-01-08 13483.710052 28767.791041 17635.338856 20460.316832
2022-01-09 19500.930751 39058.43176 24127.257756 26919.289816
2022-01-09 20562.985751 40330.80776 25123.488756 28051.573816
Update
Another version, maybe faster:
out = df.set_index(df.index.date).stack().rename_axis(['Date', 'Col']) \
.rename('Val').sort_values().groupby(level=[0, 1]).head(2) \
.sort_index().reset_index().assign(Idx=lambda x: x.index % 2) \
.pivot(index=['Date', 'Idx'], columns='Col', values='Val') \
.droplevel('Idx').rename_axis(index=None, columns=None)

Related

Resampling Hourly Data into Half Hourly in Pandas

I have the following DataFrame called prices:
DateTime PriceAmountGBP
0 2022-03-27 23:00:00 202.807890
1 2022-03-28 00:00:00 197.724150
2 2022-03-28 01:00:00 191.615328
3 2022-03-28 02:00:00 188.798436
4 2022-03-28 03:00:00 187.706682
... ... ...
19 2023-01-24 18:00:00 216.915400
20 2023-01-24 19:00:00 197.050516
21 2023-01-24 20:00:00 168.227992
22 2023-01-24 21:00:00 158.954200
23 2023-01-24 22:00:00 149.039322
I'm trying to resample prices to show Half Hourly data instead of Hourly, with PriceAmountGBP repeating on the half hour, desired output below:
DateTime PriceAmountGBP
0 2022-03-27 23:00:00 202.807890
1 2022-03-28 23:30:00 202.807890
2 2022-03-28 00:00:00 197.724150
3 2022-03-28 00:30:00 197.724150
4 2022-03-28 01:00:00 191.615328
... ... ...
19 2023-01-24 18:00:00 216.915400
20 2023-01-24 18:30:00 216.915400
21 2023-01-24 19:00:00 197.050516
22 2023-01-24 19:30:00 197.050516
23 2023-01-24 20:00:00 168.227992
I've attempted the below which is incorrect:
prices.set_index('DateTime').resample('30T').interpolate()
Output:
PriceAmountGBP
DateTime
2022-03-27 23:00:00 202.807890
2022-03-27 23:30:00 200.266020
2022-03-28 00:00:00 197.724150
2022-03-28 00:30:00 194.669739
2022-03-28 01:00:00 191.615328
... ...
2023-01-24 20:00:00 168.227992
2023-01-24 20:30:00 163.591096
2023-01-24 21:00:00 158.954200
2023-01-24 21:30:00 153.996761
2023-01-24 22:00:00 149.039322
Any help appreciated!
You want to resample without any transformation, and then do a so-called "forward fill" of the resulting null values.
That's:
result = (
prices.set_index('DateTime')
.resample('30T')
.asfreq() # no transformation
.ffill() # drag previous values down
)

Replacing NaNs with Mean Value using Pandas

Say I have a Dataframe called Data with shape (71067, 4):
StartTime EndDateTime TradeDate Values
0 2018-12-31 23:00:00 2018-12-31 23:30:00 2019-01-01 -44.676
1 2018-12-31 23:30:00 2019-01-01 00:00:00 2019-01-01 -36.113
2 2019-01-01 00:00:00 2019-01-01 00:30:00 2019-01-01 -19.229
3 2019-01-01 00:30:00 2019-01-01 01:00:00 2019-01-01 -23.606
4 2019-01-01 01:00:00 2019-01-01 01:30:00 2019-01-01 -25.899
... ... ... ... ...
2023-01-30 20:30:00 2023-01-30 21:00:00 2023-01-30 -27.198
2023-01-30 21:00:00 2023-01-30 21:30:00 2023-01-30 -13.221
2023-01-30 21:30:00 2023-01-30 22:00:00 2023-01-30 -12.034
2023-01-30 22:00:00 2023-01-30 22:30:00 2023-01-30 -16.464
2023-01-30 22:30:00 2023-01-30 23:00:00 2023-01-30 -25.441
71067 rows × 4 columns
When running Data.isna().sum().sum() I realise I have some NaN values in the dataset:
Data.isna().sum().sum()
> 1391
Shown here:
Data[Data['Values'].isna()].reset_index(drop = True).sort_values(by = 'StartTime')
0 2019-01-01 03:30:00 2019-01-01 04:00:00 2019-01-01 NaN
1 2019-01-04 02:30:00 2019-01-04 03:00:00 2019-01-04 NaN
2 2019-01-04 03:00:00 2019-01-04 03:30:00 2019-01-04 NaN
3 2019-01-04 03:30:00 2019-01-04 04:00:00 2019-01-04 NaN
4 2019-01-04 04:00:00 2019-01-04 04:30:00 2019-01-04 NaN
... ... ... ... ...
1386 2022-12-06 13:00:00 2022-12-06 13:30:00 2022-12-06 NaN
1387 2022-12-06 13:30:00 2022-12-06 14:00:00 2022-12-06 NaN
1388 2022-12-22 11:00:00 2022-12-22 11:30:00 2022-12-22 NaN
1389 2023-01-25 11:00:00 2023-01-25 11:30:00 2023-01-25 NaN
1390 2023-01-25 11:30:00 2023-01-25 12:00:00 2023-01-25 NaN
Is there anyway of replacing each of the NaN values in the dataset with the mean value of the corresponding half hour across the 70,000 plus rows, see below:
Data['HH'] = pd.to_datetime(Data['StartTime']).dt.time
Data.groupby(['HH'], as_index=False)[['Data']].mean().head(10)
# Only showing first 10 means
HH Values
0 00:00:00 5.236811
1 00:30:00 2.056571
2 01:00:00 4.157455
3 01:30:00 2.339253
4 02:00:00 2.658238
5 02:30:00 0.230557
6 03:00:00 0.217599
7 03:30:00 -0.630243
8 04:00:00 -0.989919
9 04:30:00 -0.494372
For example, if a value is missing against 04:00, can it be replaced with the 04:00 mean value (0.989919) as per the above table of means?
Any help greatly appreciated.
Let's group the dataframe by HH then transform the Values with mean to broadcast the mean values back to the original column shape then use fillna to fill the null values
avg = Data.groupby('HH')['Values'].transform('mean')
Data['Values'] = Data['Values'].fillna(avg)

PANDAS - Resample monthly time series to hourly

Suppose I have a multi-index Pandas data frame with two index levels: month_begin and month_end
import pandas as pd
multi_index = pd.MultiIndex.from_tuples([("2022-03-01", "2022-03-31"),
("2022-04-01", "2022-04-30"),
("2022-05-01", "2022-05-31"),
("2022-06-01", "2022-06-30")])
multi_index.names = ['month_begin', 'month_end']
df = pd.DataFrame(np.random.rand(4,100), index=multi_index)
df
0 1 ... 98 99
month_begin month_end ...
2022-03-01 2022-03-31 0.322032 0.205307 ... 0.975128 0.673460
2022-04-01 2022-04-30 0.113813 0.278981 ... 0.951049 0.090765
2022-05-01 2022-05-31 0.777918 0.842734 ... 0.667831 0.274189
2022-06-01 2022-06-30 0.221407 0.555711 ... 0.745158 0.648246
I would like to resample the data to have the value in a month at every hour in the respective month:
0 1 ... 98 99
...
2022-03-01 00:00 0.322032 0.205307 ... 0.975128 0.673460
2022-03-01 01:00 0.322032 0.205307 ... 0.975128 0.673460
2022-03-01 02:00 0.322032 0.205307 ... 0.975128 0.673460
...
2022-06-30 22:00 0.221407 0.555711 ... 0.745158 0.648246
2022-06-30 23:00 0.221407 0.555711 ... 0.745158 0.648246
I know I can use resample(), but I am struggeling with how to do this. Does anybody have a clue?
IIUC, try this using list_comprehension and explode with pd.date_range:
df['Date'] = [pd.date_range(s, e, freq='H') for s, e in df.index]
df_out = df.explode('Date').set_index('Date')
Output:
0 1 ... 98 99
Date ...
2022-03-01 00:00:00 0.396311 0.138263 ... 0.637640 0.106366
2022-03-01 01:00:00 0.396311 0.138263 ... 0.637640 0.106366
2022-03-01 02:00:00 0.396311 0.138263 ... 0.637640 0.106366
2022-03-01 03:00:00 0.396311 0.138263 ... 0.637640 0.106366
2022-03-01 04:00:00 0.396311 0.138263 ... 0.637640 0.106366
... ... ... ... ... ...
2022-06-29 20:00:00 0.129921 0.654878 ... 0.619212 0.142297
2022-06-29 21:00:00 0.129921 0.654878 ... 0.619212 0.142297
2022-06-29 22:00:00 0.129921 0.654878 ... 0.619212 0.142297
2022-06-29 23:00:00 0.129921 0.654878 ... 0.619212 0.142297
2022-06-30 00:00:00 0.129921 0.654878 ... 0.619212 0.142297
[2836 rows x 100 columns]

Pandas to_datetime adds year 1900 while I specified format to month day and time

This is basically my code :
for index in range(scen_1['DateTime'].size):
DateandTime = scen_1['DateTime'][index][1:3] + '-' + scen_1['DateTime'][index][4:6] + ' ' + scen_1['DateTime'][index][8:]
if '24:00:00' in DateandTime:
DateandTime = DateandTime.replace('24:00:00','00:00:00')
scen_1['DateTime'][index] = DateandTime
scen_1['Date'] = pd.to_datetime(scen_1['DateTime'], format='%m-%d %H:%M:%S')
So I get this result:
DateTime OutdoorTemp ... HeadOfficeOcc Date
0 01-01 00:15:00 NaN ... 0 1900-01-01 00:15:00
1 01-01 00:30:00 NaN ... 0 1900-01-01 00:30:00
2 01-01 00:45:00 NaN ... 0 1900-01-01 00:45:00
3 01-01 01:00:00 5.2875 ... 0 1900-01-01 01:00:00
4 01-01 01:15:00 NaN ... 0 1900-01-01 01:15:00
... ... ... ... ... ...
17371 06-30 23:00:00 19.9875 ... 0 1900-06-30 23:00:00
17372 06-30 23:15:00 NaN ... 0 1900-06-30 23:15:00
17373 06-30 23:30:00 NaN ... 0 1900-06-30 23:30:00
17374 06-30 23:45:00 NaN ... 0 1900-06-30 23:45:00
17375 06-30 00:00:00 17.8250 ... 0 1900-06-30 00:00:00
any help is very much appreciated, I tried dt.date dt.time and I don't what else to try thanks !

Imputation using pandas

I have a multi-year timeseries with half-hourly resolution with some gaps and would like to impute them based on averages of the values of other years, but at the same time. E.g. if a value is missing at 2005-1-1 12:00, I'd like to take all the values at the same time, but from all other years and average them, then impute the missing value by the average. Here's what I got:
import pandas as pd
import numpy as np
idx = pd.date_range('2000-1-1', '2010-1-1', freq='30T')
df = pd.DataFrame({'somedata': np.random.rand(175345)}, index=idx)
df.loc[df['somedata'] > 0.7, 'somedata'] = None
grouped = df.groupby([df.index.month, df.index.day, df.index.hour, df.index.minute]).mean()
Which gives me the averages I need, but I don't know how to plug them back into the original timeseries.
You are almost there. Just use .tranform to fill NaNs.
import pandas as pd
import numpy as np
# your data
# ==================================================
np.random.seed(0)
idx = pd.date_range('2000-1-1', '2010-1-1', freq='30T')
df = pd.DataFrame({'somedata': np.random.rand(175345)}, index=idx)
df.loc[df['somedata'] > 0.7, 'somedata'] = np.nan
somedata
2000-01-01 00:00:00 0.5488
2000-01-01 00:30:00 NaN
2000-01-01 01:00:00 0.6028
2000-01-01 01:30:00 0.5449
2000-01-01 02:00:00 0.4237
2000-01-01 02:30:00 0.6459
2000-01-01 03:00:00 0.4376
2000-01-01 03:30:00 NaN
... ...
2009-12-31 20:30:00 0.4983
2009-12-31 21:00:00 0.4282
2009-12-31 21:30:00 NaN
2009-12-31 22:00:00 0.3306
2009-12-31 22:30:00 0.3021
2009-12-31 23:00:00 0.2077
2009-12-31 23:30:00 0.2965
2010-01-01 00:00:00 0.5183
[175345 rows x 1 columns]
# processing
# ==================================================
result = df.groupby([df.index.month, df.index.day, df.index.hour, df.index.minute], as_index=False).transform(lambda g: g.fillna(g.mean()))
somedata
2000-01-01 00:00:00 0.5488
2000-01-01 00:30:00 0.2671
2000-01-01 01:00:00 0.6028
2000-01-01 01:30:00 0.5449
2000-01-01 02:00:00 0.4237
2000-01-01 02:30:00 0.6459
2000-01-01 03:00:00 0.4376
2000-01-01 03:30:00 0.3957
... ...
2009-12-31 20:30:00 0.4983
2009-12-31 21:00:00 0.4282
2009-12-31 21:30:00 0.4784
2009-12-31 22:00:00 0.3306
2009-12-31 22:30:00 0.3021
2009-12-31 23:00:00 0.2077
2009-12-31 23:30:00 0.2965
2010-01-01 00:00:00 0.5183
[175345 rows x 1 columns]
# take a look at a particular sample
# ======================================
x = list(df.groupby([df.index.month, df.index.day, df.index.hour, df.index.minute]))[0][1]
somedata
2000-01-01 0.5488
2001-01-01 0.1637
2002-01-01 0.3245
2003-01-01 NaN
2004-01-01 0.5654
2005-01-01 0.5729
2006-01-01 0.4740
2007-01-01 0.1728
2008-01-01 0.2577
2009-01-01 NaN
2010-01-01 0.5183
x.mean() # output: 0.3998
list(result.groupby([df.index.month, df.index.day, df.index.hour, df.index.minute]))[0][1]
somedata
2000-01-01 0.5488
2001-01-01 0.1637
2002-01-01 0.3245
2003-01-01 0.3998
2004-01-01 0.5654
2005-01-01 0.5729
2006-01-01 0.4740
2007-01-01 0.1728
2008-01-01 0.2577
2009-01-01 0.3998
2010-01-01 0.5183

Categories

Resources