I have a dataset with a lot of NaNs and numeric values with the following form:
PV_Power
2017-01-01 00:00:00 NaN
2017-01-01 01:00:00 NaN
2017-01-01 02:00:00 NaN
2017-01-01 03:00:00 NaN
2017-01-01 04:00:00 NaN
... ...
2017-12-31 20:00:00 NaN
2017-12-31 21:00:00 NaN
2017-12-31 22:00:00 NaN
2017-12-31 23:00:00 NaN
2018-01-01 00:00:00 NaN
What I need to do is to replace a NaN value with either 0 if it is between other NaN values or with the result of interpolation if it is between numeric values. Any idea of how can I achieve that?
Use DataFrame.interpolate with limit_area='inside' if need interpolate between numeric values and then replace missing values:
print (df)
PV_Power
date
2017-01-01 00:00:00 NaN
2017-01-01 01:00:00 4.0
2017-01-01 02:00:00 NaN
2017-01-01 03:00:00 NaN
2017-01-01 04:00:00 5.0
2017-01-01 05:00:00 NaN
2017-01-01 06:00:00 NaN
df = df.interpolate(limit_area='inside').fillna(0)
print (df)
PV_Power
date
2017-01-01 00:00:00 0.000000
2017-01-01 01:00:00 4.000000
2017-01-01 02:00:00 4.333333
2017-01-01 03:00:00 4.666667
2017-01-01 04:00:00 5.000000
2017-01-01 05:00:00 0.000000
2017-01-01 06:00:00 0.000000
You could reindex your dataframe
idx = df.index
df = df.dropna().reindex(idx, fill_value=0)
or just set values where PV_Power is NaN:
df.loc[pd.isna(df.PV_Power), ["PV_Power"]] = 0
You Can use fillna(0) :-
df['PV_Power'].fillna(0, inplace=True)
or You Can Replace it:-
df['PV_Power'] = df['PV_Power'].replace(np.nan, 0)
Related
Say I have a Dataframe called Data with shape (71067, 4):
StartTime EndDateTime TradeDate Values
0 2018-12-31 23:00:00 2018-12-31 23:30:00 2019-01-01 -44.676
1 2018-12-31 23:30:00 2019-01-01 00:00:00 2019-01-01 -36.113
2 2019-01-01 00:00:00 2019-01-01 00:30:00 2019-01-01 -19.229
3 2019-01-01 00:30:00 2019-01-01 01:00:00 2019-01-01 -23.606
4 2019-01-01 01:00:00 2019-01-01 01:30:00 2019-01-01 -25.899
... ... ... ... ...
2023-01-30 20:30:00 2023-01-30 21:00:00 2023-01-30 -27.198
2023-01-30 21:00:00 2023-01-30 21:30:00 2023-01-30 -13.221
2023-01-30 21:30:00 2023-01-30 22:00:00 2023-01-30 -12.034
2023-01-30 22:00:00 2023-01-30 22:30:00 2023-01-30 -16.464
2023-01-30 22:30:00 2023-01-30 23:00:00 2023-01-30 -25.441
71067 rows × 4 columns
When running Data.isna().sum().sum() I realise I have some NaN values in the dataset:
Data.isna().sum().sum()
> 1391
Shown here:
Data[Data['Values'].isna()].reset_index(drop = True).sort_values(by = 'StartTime')
0 2019-01-01 03:30:00 2019-01-01 04:00:00 2019-01-01 NaN
1 2019-01-04 02:30:00 2019-01-04 03:00:00 2019-01-04 NaN
2 2019-01-04 03:00:00 2019-01-04 03:30:00 2019-01-04 NaN
3 2019-01-04 03:30:00 2019-01-04 04:00:00 2019-01-04 NaN
4 2019-01-04 04:00:00 2019-01-04 04:30:00 2019-01-04 NaN
... ... ... ... ...
1386 2022-12-06 13:00:00 2022-12-06 13:30:00 2022-12-06 NaN
1387 2022-12-06 13:30:00 2022-12-06 14:00:00 2022-12-06 NaN
1388 2022-12-22 11:00:00 2022-12-22 11:30:00 2022-12-22 NaN
1389 2023-01-25 11:00:00 2023-01-25 11:30:00 2023-01-25 NaN
1390 2023-01-25 11:30:00 2023-01-25 12:00:00 2023-01-25 NaN
Is there anyway of replacing each of the NaN values in the dataset with the mean value of the corresponding half hour across the 70,000 plus rows, see below:
Data['HH'] = pd.to_datetime(Data['StartTime']).dt.time
Data.groupby(['HH'], as_index=False)[['Data']].mean().head(10)
# Only showing first 10 means
HH Values
0 00:00:00 5.236811
1 00:30:00 2.056571
2 01:00:00 4.157455
3 01:30:00 2.339253
4 02:00:00 2.658238
5 02:30:00 0.230557
6 03:00:00 0.217599
7 03:30:00 -0.630243
8 04:00:00 -0.989919
9 04:30:00 -0.494372
For example, if a value is missing against 04:00, can it be replaced with the 04:00 mean value (0.989919) as per the above table of means?
Any help greatly appreciated.
Let's group the dataframe by HH then transform the Values with mean to broadcast the mean values back to the original column shape then use fillna to fill the null values
avg = Data.groupby('HH')['Values'].transform('mean')
Data['Values'] = Data['Values'].fillna(avg)
I am looking fopr a way to interpolate only over short gaps in a Pandas DataFrame that has a DateTimeIndex. Long gaps should be kept as they are.
df = pd.DataFrame(
{ "value": [ 1, np.nan, 3, np.nan, np.nan, 5, np.nan, 11, np.nan, 21, np.nan, 41 ] },
index=pd.to_datetime( [
"2021-01-01 00:00", "2021-01-01 00:05", "2021-01-01 00:10",
"2021-01-01 00:11", "2021-01-01 00:13", "2021-01-01 00:14",
"2021-01-01 00:15", "2021-01-01 01:30", "2021-01-01 03:00",
"2021-01-01 04:00", "2021-01-01 05:45", "2021-01-01 06:45",
] )
)
value
2021-01-01 00:00:00 1.0
2021-01-01 00:05:00 NaN
2021-01-01 00:10:00 3.0
2021-01-01 00:11:00 NaN
2021-01-01 00:13:00 NaN
2021-01-01 00:14:00 5.0
2021-01-01 00:15:00 NaN
2021-01-01 01:30:00 11.0
2021-01-01 03:00:00 NaN
2021-01-01 04:00:00 21.0
2021-01-01 05:45:00 NaN
2021-01-01 06:45:00 41.0
The idea is to keep gaps that are longer than a certain time (>5 minutes in this case), but interpolate all missing values within shorter gaps.
interpolate() has a limit argument that limits the number of missing values to be interpolated, but this does not respect the time delta between the rows, only the number of rows.
I would like the result to be like this:
value
2021-01-01 00:00:00 1.000000
2021-01-01 00:05:00 2.000000
2021-01-01 00:10:00 3.000000
2021-01-01 00:11:00 3.500000
2021-01-01 00:13:00 4.500000
2021-01-01 00:14:00 5.000000
2021-01-01 00:15:00 NaN
2021-01-01 01:30:00 11.000000
2021-01-01 03:00:00 NaN
2021-01-01 04:00:00 21.000000
2021-01-01 05:45:00 NaN
2021-01-01 06:45:00 41.000000
This solution fills value gaps that are in time spans that are less than a specified value. The filled values are set proportionally to the entry's position within the value gap's time span (time-interpolated values). Julian dates are used for easier computation.
Set max time span gap to fill with time-interpolated values. 5 minutes.
jd_max_gap_fill = 5/(60*24)
Calculate the value gap:
df['ffill'] = df['value'].ffill()
df['value_gap'] = df['value'].bfill() - df['value'].ffill()
Get the Julian date for the entry:
df['jd'] = df.index.to_julian_date()
Calculate the time gap:
df['jd_nan'] = np.where(~df['value'].isna(), df['jd'], np.nan)
df['jd_gap'] = df['jd_nan'].bfill() - df['jd_nan'].ffill()
Time-wise, calculate how far into the value gap we are:
df['jd_start'] = df['jd_nan'].ffill()
df['jd_prp'] = np.where(df['jd_gap'] != 0, (df['jd'] - df['jd_start'])/df['jd_gap'], 0)
Calculate time-interpolated values:
df['filled_value'] = np.where(df['jd_gap'] <= jd_max_gap_fill, df['ffill'] + df['value_gap'] * df['jd_prp'], np.nan)
df['filled_value']
2021-01-01 00:00:00 1.0
2021-01-01 00:05:00 NaN
2021-01-01 00:10:00 3.0
2021-01-01 00:11:00 3.5
2021-01-01 00:13:00 4.5
2021-01-01 00:14:00 5.0
2021-01-01 00:15:00 NaN
2021-01-01 01:30:00 11.0
2021-01-01 03:00:00 NaN
2021-01-01 04:00:00 21.0
2021-01-01 05:45:00 NaN
2021-01-01 06:45:00 41.0
Note that my output is different than your expected output because the first NaN is in a 10 minute gap.
I have the dataframes below (date/time is multi index) and I want to replace column values in (00:00:00~07:00:00) as a numpy array:
[[ 21.63920663 21.62012822 20.9900515 21.23217008 21.19482458
21.10839656 20.89631935 20.79977166 20.99176729 20.91567565
20.87258765 20.76210464 20.50357827 20.55897631 20.38005033
20.38227309 20.54460993 20.37707293 20.08279925 20.09955877
20.02559575 20.12390737 20.2917257 20.20056711 20.1589065
20.41302289 20.48000767 20.55604102 20.70255192]]
date time
2018-01-26 00:00:00 21.65
00:15:00 NaN
00:30:00 NaN
00:45:00 NaN
01:00:00 NaN
01:15:00 NaN
01:30:00 NaN
01:45:00 NaN
02:00:00 NaN
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
03:15:00 NaN
03:30:00 NaN
03:45:00 NaN
04:00:00 NaN
04:15:00 NaN
04:30:00 NaN
04:45:00 NaN
05:00:00 NaN
05:15:00 NaN
05:30:00 NaN
05:45:00 NaN
06:00:00 NaN
06:15:00 NaN
06:30:00 NaN
06:45:00 NaN
07:00:00 NaN
07:15:00 NaN
07:30:00 NaN
07:45:00 NaN
08:00:00 NaN
08:15:00 NaN
08:30:00 NaN
08:45:00 NaN
09:00:00 NaN
09:15:00 NaN
09:30:00 NaN
09:45:00 NaN
10:00:00 NaN
10:15:00 NaN
10:30:00 NaN
10:45:00 NaN
11:00:00 NaN
Name: temp, dtype: float64
<class 'datetime.time'>
How can I do this?
You can use slicers:
idx = pd.IndexSlice
df1.loc[idx[:, '00:00:00':'02:00:00'],:] = 1
Or if second levels are times:
import datetime
idx = pd.IndexSlice
df1.loc[idx[:, datetime.time(0, 0, 0):datetime.time(2, 0, 0)],:] = 1
Sample:
print (df1)
aaa
date time
2018-01-26 00:00:00 21.65
00:15:00 NaN
00:30:00 NaN
00:45:00 NaN
01:00:00 NaN
01:15:00 NaN
01:30:00 NaN
01:45:00 NaN
02:00:00 NaN
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
2018-01-27 00:00:00 2.00
00:15:00 NaN
00:30:00 NaN
00:45:00 NaN
01:00:00 NaN
01:15:00 NaN
01:30:00 NaN
01:45:00 NaN
02:00:00 NaN
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
idx = pd.IndexSlice
df1.loc[idx[:, '00:00:00':'02:00:00'],:] = 1
print (df1)
aaa
date time
2018-01-26 00:00:00 1.0
00:15:00 1.0
00:30:00 1.0
00:45:00 1.0
01:00:00 1.0
01:15:00 1.0
01:30:00 1.0
01:45:00 1.0
02:00:00 1.0
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
2018-01-27 00:00:00 1.0
00:15:00 1.0
00:30:00 1.0
00:45:00 1.0
01:00:00 1.0
01:15:00 1.0
01:30:00 1.0
01:45:00 1.0
02:00:00 1.0
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
EDIT:
For assign array is necessary use numpy.tile for repeat by length of first level unique values:
df1.loc[idx[:, '00:00:00':'02:00:00'],:] = np.tile(np.arange(1, 10),len(df1.index.levels[0]))
print (df1)
aaa
date time
2018-01-26 00:00:00 1.0
00:15:00 2.0
00:30:00 3.0
00:45:00 4.0
01:00:00 5.0
01:15:00 6.0
01:30:00 7.0
01:45:00 8.0
02:00:00 9.0
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
2018-01-27 00:00:00 1.0
00:15:00 2.0
00:30:00 3.0
00:45:00 4.0
01:00:00 5.0
01:15:00 6.0
01:30:00 7.0
01:45:00 8.0
02:00:00 9.0
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
More general solution with generated array by length of slice:
idx = pd.IndexSlice
len0 = df1.loc[idx[df1.index.levels[0][0], '00:00:00':'02:00:00'],:].shape[0]
len1 = len(df1.index.levels[0])
df1.loc[idx[:, '00:00:00':'02:00:00'],:] = np.tile(np.arange(1, len0 + 1), len1)
Tested with times:
import datetime
idx = pd.IndexSlice
arr =np.tile(np.arange(1, 10),len(df1.index.levels[0]))
df1.loc[idx[:, datetime.time(0, 0, 0):datetime.time(2, 0, 0)],:] = arr
print (df1)
aaa
date time
2018-01-26 00:00:00 1.0
00:15:00 2.0
00:30:00 3.0
00:45:00 4.0
01:00:00 5.0
01:15:00 6.0
01:30:00 7.0
01:45:00 8.0
02:00:00 9.0
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
2018-01-27 00:00:00 1.0
00:15:00 2.0
00:30:00 3.0
00:45:00 4.0
01:00:00 5.0
01:15:00 6.0
01:30:00 7.0
01:45:00 8.0
02:00:00 9.0
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
EDIT:
Last was problem found - my solution wokrs with one column DataFrame, but if working with Series need remove one ::
arr = np.array([[ 21.63920663, 21.62012822, 20.9900515, 21.23217008, 21.19482458, 21.10839656,
20.89631935, 20.79977166, 20.99176729, 20.91567565, 20.87258765, 20.76210464,
20.50357827, 20.55897631, 20.38005033, 20.38227309, 20.54460993, 20.37707293,
20.08279925, 20.09955877, 20.02559575, 20.12390737, 20.2917257, 20.20056711,
20.1589065, 20.41302289, 20.48000767, 20.55604102, 20.70255192]])
import datetime
idx = pd.IndexSlice
df1.loc[idx[:, datetime.time(0, 0, 0): datetime.time(7, 0, 0)]] = arr[0]
---^^^
i have below dataframe. and i wanna make a hourly mean dataframe
condition that every hour just calculate mean value 00:15:00~00:45:00.
date/time are multi index.
aaa
date time
2017-01-01 00:00:00 146.88
00:15:00 143.28
00:30:00 143.28
00:45:00 141.12
01:00:00 134.64
01:15:00 132.48
01:30:00 136.80
01:45:00 138.24
02:00:00 131.76
02:15:00 131.04
02:30:00 134.64
02:45:00 139.68
03:00:00 136.08
03:15:00 132.48
03:30:00 132.48
03:45:00 139.68
04:00:00 134.64
04:15:00 131.04
04:30:00 160.56
04:45:00 177.12
...
results should be belows.. how can i do it?
aaa
date time
2017-01-01 00:00:00 146.88
01:00:00 134.64
02:00:00 131.76
03:00:00 136.08
04:00:00 134.64
...
It seems need only select rows with 00:00 in the end of times:
df2 = df1[df1.index.get_level_values(1).astype(str).str.endswith('00:00')]
print (df2)
aaa
date time
2017-01-01 00:00:00 146.88
01:00:00 134.64
02:00:00 131.76
03:00:00 136.08
04:00:00 134.64
But if need mean only values 00:15-00:45 it is more complicated:
lvl1 = pd.Series(df1.index.get_level_values(1))
m = ~lvl1.astype(str).str.endswith('00:00')
lvl1new = lvl1.mask(m).ffill()
df1.index = pd.MultiIndex.from_arrays([df1.index.get_level_values(0),
lvl1new.where(m)], names=df1.index.names)
print (df1)
aaa
date time
2017-01-01 NaN 146.88
00:00:00 143.28
00:00:00 143.28
00:00:00 141.12
NaN 134.64
01:00:00 132.48
01:00:00 136.80
01:00:00 138.24
NaN 131.76
02:00:00 131.04
02:00:00 134.64
02:00:00 139.68
NaN 136.08
03:00:00 132.48
03:00:00 132.48
03:00:00 139.68
NaN 134.64
04:00:00 131.04
04:00:00 160.56
04:00:00 177.12
df = df1['aaa'].groupby(level=[0,1]).mean()
print (df)
date time
2017-01-01 00:00:00 142.56
01:00:00 135.84
02:00:00 135.12
03:00:00 134.88
04:00:00 156.24
Name: aaa, dtype: float64
I have got the following data:
data
timestamp
2012-06-01 17:00:00 9
2012-06-01 20:00:00 8
2012-06-01 13:00:00 9
2012-06-01 10:00:00 9
and would like to sort it descending by time, add a start and end date on top and bottom of the data, so that it looks like this:
data
timestamp
2012-06-01 00:00:00 NaN
2012-06-01 10:00:00 9
2012-06-01 13:00:00 9
2012-06-01 17:00:00 9
2012-06-01 20:00:00 8
2012-06-02 00:00:00 NaN
and finally I would like to extend the dataset to cover all hours from start to end in one hour steps, filling the dataframe with missing timestamps containing 'None'/'NaN' as data.
So far I have the following code:
df2 = pd.DataFrame({'data':temperature, 'timestamp': pd.DatetimeIndex(timestamp)}, dtype=float)
df2.set_index('timestamp',inplace=True)
df3 = pd.DataFrame({ 'timestamp': pd.Series([ts1, ts2]), 'data': [None, None]})
df3.set_index('timestamp',inplace=True)
print(df3)
merged = df3.append(df2)
print(merged)
with the following print outs:
df3:
data
timestamp
2012-06-01 00:00:00 None
2012-06-02 00:00:00 None
merged:
data
timestamp
2012-06-01 00:00:00 NaN
2012-06-02 00:00:00 NaN
2012-06-01 17:00:00 9
2012-06-01 20:00:00 8
2012-06-01 13:00:00 9
2012-06-01 10:00:00 9
I have tried:
merged = merged.asfreq('H')
but this returned an unsatisfying result:
data
2012-06-01 00:00:00 NaN
2012-06-01 01:00:00 NaN
2012-06-01 02:00:00 NaN
2012-06-01 03:00:00 NaN
2012-06-01 04:00:00 NaN
2012-06-01 05:00:00 NaN
2012-06-01 06:00:00 NaN
2012-06-01 07:00:00 NaN
2012-06-01 08:00:00 NaN
2012-06-01 09:00:00 NaN
2012-06-01 10:00:00 9
Where is the rest of the dataframe? Why does it only contain data till the first valid value?
Help is much appreciated. Thanks a lot in advance
First create an empty dataframe with the timestamp index that you want and then do a left merge with your original dataset:
df2 = pd.DataFrame(index = pd.date_range('2012-06-01','2012-06-02', freq='H'))
df3 = pd.merge(df2, df, left_index = True, right_index = True, how = 'left')
df3
Out[103]:
timestamp value
2012-06-01 00:00:00 NaN NaN
2012-06-01 01:00:00 NaN NaN
2012-06-01 02:00:00 NaN NaN
2012-06-01 03:00:00 NaN NaN
2012-06-01 04:00:00 NaN NaN
2012-06-01 05:00:00 NaN NaN
2012-06-01 06:00:00 NaN NaN
2012-06-01 07:00:00 NaN NaN
2012-06-01 08:00:00 NaN NaN
2012-06-01 09:00:00 NaN NaN
2012-06-01 10:00:00 2012-06-01 10:00:00 9
2012-06-01 11:00:00 NaN NaN
2012-06-01 12:00:00 NaN NaN
2012-06-01 13:00:00 2012-06-01 13:00:00 9
2012-06-01 14:00:00 NaN NaN
2012-06-01 15:00:00 NaN NaN
2012-06-01 16:00:00 NaN NaN
2012-06-01 17:00:00 2012-06-01 17:00:00 9
2012-06-01 18:00:00 NaN NaN
2012-06-01 19:00:00 NaN NaN
2012-06-01 20:00:00 2012-06-01 20:00:00 8
2012-06-01 21:00:00 NaN NaN
2012-06-01 22:00:00 NaN NaN
2012-06-01 23:00:00 NaN NaN
2012-06-02 00:00:00 NaN NaN