Imputation using pandas - python

I have a multi-year timeseries with half-hourly resolution with some gaps and would like to impute them based on averages of the values of other years, but at the same time. E.g. if a value is missing at 2005-1-1 12:00, I'd like to take all the values at the same time, but from all other years and average them, then impute the missing value by the average. Here's what I got:
import pandas as pd
import numpy as np
idx = pd.date_range('2000-1-1', '2010-1-1', freq='30T')
df = pd.DataFrame({'somedata': np.random.rand(175345)}, index=idx)
df.loc[df['somedata'] > 0.7, 'somedata'] = None
grouped = df.groupby([df.index.month, df.index.day, df.index.hour, df.index.minute]).mean()
Which gives me the averages I need, but I don't know how to plug them back into the original timeseries.

You are almost there. Just use .tranform to fill NaNs.
import pandas as pd
import numpy as np
# your data
# ==================================================
np.random.seed(0)
idx = pd.date_range('2000-1-1', '2010-1-1', freq='30T')
df = pd.DataFrame({'somedata': np.random.rand(175345)}, index=idx)
df.loc[df['somedata'] > 0.7, 'somedata'] = np.nan
somedata
2000-01-01 00:00:00 0.5488
2000-01-01 00:30:00 NaN
2000-01-01 01:00:00 0.6028
2000-01-01 01:30:00 0.5449
2000-01-01 02:00:00 0.4237
2000-01-01 02:30:00 0.6459
2000-01-01 03:00:00 0.4376
2000-01-01 03:30:00 NaN
... ...
2009-12-31 20:30:00 0.4983
2009-12-31 21:00:00 0.4282
2009-12-31 21:30:00 NaN
2009-12-31 22:00:00 0.3306
2009-12-31 22:30:00 0.3021
2009-12-31 23:00:00 0.2077
2009-12-31 23:30:00 0.2965
2010-01-01 00:00:00 0.5183
[175345 rows x 1 columns]
# processing
# ==================================================
result = df.groupby([df.index.month, df.index.day, df.index.hour, df.index.minute], as_index=False).transform(lambda g: g.fillna(g.mean()))
somedata
2000-01-01 00:00:00 0.5488
2000-01-01 00:30:00 0.2671
2000-01-01 01:00:00 0.6028
2000-01-01 01:30:00 0.5449
2000-01-01 02:00:00 0.4237
2000-01-01 02:30:00 0.6459
2000-01-01 03:00:00 0.4376
2000-01-01 03:30:00 0.3957
... ...
2009-12-31 20:30:00 0.4983
2009-12-31 21:00:00 0.4282
2009-12-31 21:30:00 0.4784
2009-12-31 22:00:00 0.3306
2009-12-31 22:30:00 0.3021
2009-12-31 23:00:00 0.2077
2009-12-31 23:30:00 0.2965
2010-01-01 00:00:00 0.5183
[175345 rows x 1 columns]
# take a look at a particular sample
# ======================================
x = list(df.groupby([df.index.month, df.index.day, df.index.hour, df.index.minute]))[0][1]
somedata
2000-01-01 0.5488
2001-01-01 0.1637
2002-01-01 0.3245
2003-01-01 NaN
2004-01-01 0.5654
2005-01-01 0.5729
2006-01-01 0.4740
2007-01-01 0.1728
2008-01-01 0.2577
2009-01-01 NaN
2010-01-01 0.5183
x.mean() # output: 0.3998
list(result.groupby([df.index.month, df.index.day, df.index.hour, df.index.minute]))[0][1]
somedata
2000-01-01 0.5488
2001-01-01 0.1637
2002-01-01 0.3245
2003-01-01 0.3998
2004-01-01 0.5654
2005-01-01 0.5729
2006-01-01 0.4740
2007-01-01 0.1728
2008-01-01 0.2577
2009-01-01 0.3998
2010-01-01 0.5183

Related

Replacing NaNs with Mean Value using Pandas

Say I have a Dataframe called Data with shape (71067, 4):
StartTime EndDateTime TradeDate Values
0 2018-12-31 23:00:00 2018-12-31 23:30:00 2019-01-01 -44.676
1 2018-12-31 23:30:00 2019-01-01 00:00:00 2019-01-01 -36.113
2 2019-01-01 00:00:00 2019-01-01 00:30:00 2019-01-01 -19.229
3 2019-01-01 00:30:00 2019-01-01 01:00:00 2019-01-01 -23.606
4 2019-01-01 01:00:00 2019-01-01 01:30:00 2019-01-01 -25.899
... ... ... ... ...
2023-01-30 20:30:00 2023-01-30 21:00:00 2023-01-30 -27.198
2023-01-30 21:00:00 2023-01-30 21:30:00 2023-01-30 -13.221
2023-01-30 21:30:00 2023-01-30 22:00:00 2023-01-30 -12.034
2023-01-30 22:00:00 2023-01-30 22:30:00 2023-01-30 -16.464
2023-01-30 22:30:00 2023-01-30 23:00:00 2023-01-30 -25.441
71067 rows × 4 columns
When running Data.isna().sum().sum() I realise I have some NaN values in the dataset:
Data.isna().sum().sum()
> 1391
Shown here:
Data[Data['Values'].isna()].reset_index(drop = True).sort_values(by = 'StartTime')
0 2019-01-01 03:30:00 2019-01-01 04:00:00 2019-01-01 NaN
1 2019-01-04 02:30:00 2019-01-04 03:00:00 2019-01-04 NaN
2 2019-01-04 03:00:00 2019-01-04 03:30:00 2019-01-04 NaN
3 2019-01-04 03:30:00 2019-01-04 04:00:00 2019-01-04 NaN
4 2019-01-04 04:00:00 2019-01-04 04:30:00 2019-01-04 NaN
... ... ... ... ...
1386 2022-12-06 13:00:00 2022-12-06 13:30:00 2022-12-06 NaN
1387 2022-12-06 13:30:00 2022-12-06 14:00:00 2022-12-06 NaN
1388 2022-12-22 11:00:00 2022-12-22 11:30:00 2022-12-22 NaN
1389 2023-01-25 11:00:00 2023-01-25 11:30:00 2023-01-25 NaN
1390 2023-01-25 11:30:00 2023-01-25 12:00:00 2023-01-25 NaN
Is there anyway of replacing each of the NaN values in the dataset with the mean value of the corresponding half hour across the 70,000 plus rows, see below:
Data['HH'] = pd.to_datetime(Data['StartTime']).dt.time
Data.groupby(['HH'], as_index=False)[['Data']].mean().head(10)
# Only showing first 10 means
HH Values
0 00:00:00 5.236811
1 00:30:00 2.056571
2 01:00:00 4.157455
3 01:30:00 2.339253
4 02:00:00 2.658238
5 02:30:00 0.230557
6 03:00:00 0.217599
7 03:30:00 -0.630243
8 04:00:00 -0.989919
9 04:30:00 -0.494372
For example, if a value is missing against 04:00, can it be replaced with the 04:00 mean value (0.989919) as per the above table of means?
Any help greatly appreciated.
Let's group the dataframe by HH then transform the Values with mean to broadcast the mean values back to the original column shape then use fillna to fill the null values
avg = Data.groupby('HH')['Values'].transform('mean')
Data['Values'] = Data['Values'].fillna(avg)

Replace nan with zero or linear interpolation

I have a dataset with a lot of NaNs and numeric values with the following form:
PV_Power
2017-01-01 00:00:00 NaN
2017-01-01 01:00:00 NaN
2017-01-01 02:00:00 NaN
2017-01-01 03:00:00 NaN
2017-01-01 04:00:00 NaN
... ...
2017-12-31 20:00:00 NaN
2017-12-31 21:00:00 NaN
2017-12-31 22:00:00 NaN
2017-12-31 23:00:00 NaN
2018-01-01 00:00:00 NaN
What I need to do is to replace a NaN value with either 0 if it is between other NaN values or with the result of interpolation if it is between numeric values. Any idea of how can I achieve that?
Use DataFrame.interpolate with limit_area='inside' if need interpolate between numeric values and then replace missing values:
print (df)
PV_Power
date
2017-01-01 00:00:00 NaN
2017-01-01 01:00:00 4.0
2017-01-01 02:00:00 NaN
2017-01-01 03:00:00 NaN
2017-01-01 04:00:00 5.0
2017-01-01 05:00:00 NaN
2017-01-01 06:00:00 NaN
df = df.interpolate(limit_area='inside').fillna(0)
print (df)
PV_Power
date
2017-01-01 00:00:00 0.000000
2017-01-01 01:00:00 4.000000
2017-01-01 02:00:00 4.333333
2017-01-01 03:00:00 4.666667
2017-01-01 04:00:00 5.000000
2017-01-01 05:00:00 0.000000
2017-01-01 06:00:00 0.000000
You could reindex your dataframe
idx = df.index
df = df.dropna().reindex(idx, fill_value=0)
or just set values where PV_Power is NaN:
df.loc[pd.isna(df.PV_Power), ["PV_Power"]] = 0
You Can use fillna(0) :-
df['PV_Power'].fillna(0, inplace=True)
or You Can Replace it:-
df['PV_Power'] = df['PV_Power'].replace(np.nan, 0)

How do I resample a pandas Series using values around the hour

I have timeseries data recorded at 10min frequency. I want to average the values at one hour interval. But for that I want to take 3 values before the hour and 2 values after the hour, take the average and assign that value to the exact hour timestamp.
for example, I have the series
index = pd.date_range('2000-01-01T00:30:00', periods=63, freq='10min')
series = pd.Series(range(63), index=index)
series
2000-01-01 00:30:00 0
2000-01-01 00:40:00 1
2000-01-01 00:50:00 2
2000-01-01 01:00:00 3
2000-01-01 01:10:00 4
2000-01-01 01:20:00 5
2000-01-01 01:30:00 6
2000-01-01 01:40:00 7
2000-01-01 01:50:00 8
2000-01-01 02:00:00 9
2000-01-01 02:10:00 10
..
2000-01-01 08:50:00 50
2000-01-01 09:00:00 51
2000-01-01 09:10:00 52
2000-01-01 09:20:00 53
2000-01-01 09:30:00 54
2000-01-01 09:40:00 55
2000-01-01 09:50:00 56
2000-01-01 10:00:00 57
2000-01-01 10:10:00 58
2000-01-01 10:20:00 59
2000-01-01 10:30:00 60
2000-01-01 10:40:00 61
2000-01-01 10:50:00 62
Freq: 10T, Length: 63, dtype: int64
So, if I do
series.resample('1H').mean()
2000-01-01 00:00:00 1.0
2000-01-01 01:00:00 5.5
2000-01-01 02:00:00 11.5
2000-01-01 03:00:00 17.5
2000-01-01 04:00:00 23.5
2000-01-01 05:00:00 29.5
2000-01-01 06:00:00 35.5
2000-01-01 07:00:00 41.5
2000-01-01 08:00:00 47.5
2000-01-01 09:00:00 53.5
2000-01-01 10:00:00 59.5
Freq: H, dtype: float64
the first value is the average of 0, 1, 2, and assigned to hour 0, the second the average of the values for 1:00:00 to 1:50:00 assigned to 1:00:00 and so on.
What I would like to have is the first average centered at 1:00:00 calculated using values from 00:30:00 through 01:20:00, the second centered at 02:00:00 calculated from 01:30:00 to 02:20:00 and so on...
What will be the best way to do that?
Thanks!
You should be able to do that with:
series.index = series.index - pd.Timedelta(30, unit='m')
series_grouped_mean = series.groupby(pd.Grouper(freq='60min')).mean()
series_grouped_mean.index = series_grouped_mean.index + pd.Timedelta(60, unit='m')
series_grouped_mean
I got:
2000-01-01 01:00:00 2.5
2000-01-01 02:00:00 8.5
2000-01-01 03:00:00 14.5
2000-01-01 04:00:00 20.5
2000-01-01 05:00:00 26.5
2000-01-01 06:00:00 32.5
2000-01-01 07:00:00 38.5
2000-01-01 08:00:00 44.5
2000-01-01 09:00:00 50.5
2000-01-01 10:00:00 56.5
2000-01-01 11:00:00 61.0
Freq: H, dtype: float64

how can i get conditonal hourly mean in pandas?

i have below dataframe. and i wanna make a hourly mean dataframe
condition that every hour just calculate mean value 00:15:00~00:45:00.
date/time are multi index.
aaa
date time
2017-01-01 00:00:00 146.88
00:15:00 143.28
00:30:00 143.28
00:45:00 141.12
01:00:00 134.64
01:15:00 132.48
01:30:00 136.80
01:45:00 138.24
02:00:00 131.76
02:15:00 131.04
02:30:00 134.64
02:45:00 139.68
03:00:00 136.08
03:15:00 132.48
03:30:00 132.48
03:45:00 139.68
04:00:00 134.64
04:15:00 131.04
04:30:00 160.56
04:45:00 177.12
...
results should be belows.. how can i do it?
aaa
date time
2017-01-01 00:00:00 146.88
01:00:00 134.64
02:00:00 131.76
03:00:00 136.08
04:00:00 134.64
...
It seems need only select rows with 00:00 in the end of times:
df2 = df1[df1.index.get_level_values(1).astype(str).str.endswith('00:00')]
print (df2)
aaa
date time
2017-01-01 00:00:00 146.88
01:00:00 134.64
02:00:00 131.76
03:00:00 136.08
04:00:00 134.64
But if need mean only values 00:15-00:45 it is more complicated:
lvl1 = pd.Series(df1.index.get_level_values(1))
m = ~lvl1.astype(str).str.endswith('00:00')
lvl1new = lvl1.mask(m).ffill()
df1.index = pd.MultiIndex.from_arrays([df1.index.get_level_values(0),
lvl1new.where(m)], names=df1.index.names)
print (df1)
aaa
date time
2017-01-01 NaN 146.88
00:00:00 143.28
00:00:00 143.28
00:00:00 141.12
NaN 134.64
01:00:00 132.48
01:00:00 136.80
01:00:00 138.24
NaN 131.76
02:00:00 131.04
02:00:00 134.64
02:00:00 139.68
NaN 136.08
03:00:00 132.48
03:00:00 132.48
03:00:00 139.68
NaN 134.64
04:00:00 131.04
04:00:00 160.56
04:00:00 177.12
df = df1['aaa'].groupby(level=[0,1]).mean()
print (df)
date time
2017-01-01 00:00:00 142.56
01:00:00 135.84
02:00:00 135.12
03:00:00 134.88
04:00:00 156.24
Name: aaa, dtype: float64

Grouping dates by 5 minute periods irrespective of day

I have a DataFrame with data similar to the following
import pandas as pd; import numpy as np; import datetime; from datetime import timedelta;
df = pd.DataFrame(index=pd.date_range(start='20160102', end='20170301', freq='5min'))
df['value'] = np.random.randn(df.index.size)
df.index += pd.Series([timedelta(seconds=np.random.randint(-60, 60))
for _ in range(df.index.size)])
which looks like this
In[37]: df
Out[37]:
value
2016-01-02 00:00:33 0.546675
2016-01-02 00:04:52 1.080558
2016-01-02 00:10:46 -1.551206
2016-01-02 00:15:52 -1.278845
2016-01-02 00:19:04 -1.672387
2016-01-02 00:25:36 -0.786985
2016-01-02 00:29:35 1.067132
2016-01-02 00:34:36 -0.575365
2016-01-02 00:39:33 0.570341
2016-01-02 00:44:56 -0.636312
...
2017-02-28 23:14:57 -0.027981
2017-02-28 23:19:51 0.883150
2017-02-28 23:24:15 -0.706997
2017-02-28 23:30:09 -0.954630
2017-02-28 23:35:08 -1.184881
2017-02-28 23:40:20 0.104017
2017-02-28 23:44:10 -0.678742
2017-02-28 23:49:15 -0.959857
2017-02-28 23:54:36 -1.157165
2017-02-28 23:59:10 0.527642
Now, I'm aiming to get the mean per 5 minute period over the course of a 24 hour day - without considering what day those values actually come from.
How can I do this effectively? I would like to think I could somehow remove the actual dates from my index and then use something like pd.TimeGrouper, but I haven't figured out how to do so.
My not-so-great solution
My solution so far has been to use between_time in a loop like this, just using an arbitrary day.
aggregates = []
start_time = datetime.datetime(1990, 1, 1, 0, 0, 0)
while start_time < datetime.datetime(1990, 1, 1, 23, 59, 0):
aggregates.append(
(
start_time,
df.between_time(start_time.time(),
(start_time + timedelta(minutes=5)).time(),
include_end=False).value.mean()
)
)
start_time += timedelta(minutes=5)
result = pd.DataFrame(aggregates, columns=['time', 'value'])
which works as expected
In[68]: result
Out[68]:
time value
0 1990-01-01 00:00:00 0.032667
1 1990-01-01 00:05:00 0.117288
2 1990-01-01 00:10:00 -0.052447
3 1990-01-01 00:15:00 -0.070428
4 1990-01-01 00:20:00 0.034584
5 1990-01-01 00:25:00 0.042414
6 1990-01-01 00:30:00 0.043388
7 1990-01-01 00:35:00 0.050371
8 1990-01-01 00:40:00 0.022209
9 1990-01-01 00:45:00 -0.035161
.. ... ...
278 1990-01-01 23:10:00 0.073753
279 1990-01-01 23:15:00 -0.005661
280 1990-01-01 23:20:00 -0.074529
281 1990-01-01 23:25:00 -0.083190
282 1990-01-01 23:30:00 -0.036636
283 1990-01-01 23:35:00 0.006767
284 1990-01-01 23:40:00 0.043436
285 1990-01-01 23:45:00 0.011117
286 1990-01-01 23:50:00 0.020737
287 1990-01-01 23:55:00 0.021030
[288 rows x 2 columns]
But this doesn't feel like a very Pandas-friendly solution.
IIUC then the following should work:
In [62]:
df.groupby(df.index.floor('5min').time).mean()
Out[62]:
value
00:00:00 -0.038002
00:05:00 -0.011646
00:10:00 0.010701
00:15:00 0.034699
00:20:00 0.041164
00:25:00 0.151187
00:30:00 -0.006149
00:35:00 -0.008256
00:40:00 0.021389
00:45:00 0.016851
00:50:00 -0.074825
00:55:00 0.012861
01:00:00 0.054048
01:05:00 0.041907
01:10:00 -0.004457
01:15:00 0.052428
01:20:00 -0.021518
01:25:00 -0.019010
01:30:00 0.030887
01:35:00 -0.085415
01:40:00 0.002386
01:45:00 -0.002189
01:50:00 0.049720
01:55:00 0.032292
02:00:00 -0.043642
02:05:00 0.067132
02:10:00 -0.029628
02:15:00 0.064098
02:20:00 0.042731
02:25:00 -0.031113
... ...
21:30:00 -0.018391
21:35:00 0.032155
21:40:00 0.035014
21:45:00 -0.016979
21:50:00 -0.025248
21:55:00 0.027896
22:00:00 -0.117036
22:05:00 -0.017970
22:10:00 -0.008494
22:15:00 -0.065303
22:20:00 -0.014623
22:25:00 0.076994
22:30:00 -0.030935
22:35:00 0.030308
22:40:00 -0.124668
22:45:00 0.064853
22:50:00 0.057913
22:55:00 0.002309
23:00:00 0.083586
23:05:00 -0.031043
23:10:00 -0.049510
23:15:00 0.003520
23:20:00 0.037135
23:25:00 -0.002231
23:30:00 -0.029592
23:35:00 0.040335
23:40:00 -0.021513
23:45:00 0.104421
23:50:00 -0.022280
23:55:00 -0.021283
[288 rows x 1 columns]
Here I floor the index to '5 min' intervals and then group on the time attribute and aggregate the mean

Categories

Resources