Replacing NaNs with Mean Value using Pandas - python

Say I have a Dataframe called Data with shape (71067, 4):
StartTime EndDateTime TradeDate Values
0 2018-12-31 23:00:00 2018-12-31 23:30:00 2019-01-01 -44.676
1 2018-12-31 23:30:00 2019-01-01 00:00:00 2019-01-01 -36.113
2 2019-01-01 00:00:00 2019-01-01 00:30:00 2019-01-01 -19.229
3 2019-01-01 00:30:00 2019-01-01 01:00:00 2019-01-01 -23.606
4 2019-01-01 01:00:00 2019-01-01 01:30:00 2019-01-01 -25.899
... ... ... ... ...
2023-01-30 20:30:00 2023-01-30 21:00:00 2023-01-30 -27.198
2023-01-30 21:00:00 2023-01-30 21:30:00 2023-01-30 -13.221
2023-01-30 21:30:00 2023-01-30 22:00:00 2023-01-30 -12.034
2023-01-30 22:00:00 2023-01-30 22:30:00 2023-01-30 -16.464
2023-01-30 22:30:00 2023-01-30 23:00:00 2023-01-30 -25.441
71067 rows × 4 columns
When running Data.isna().sum().sum() I realise I have some NaN values in the dataset:
Data.isna().sum().sum()
> 1391
Shown here:
Data[Data['Values'].isna()].reset_index(drop = True).sort_values(by = 'StartTime')
0 2019-01-01 03:30:00 2019-01-01 04:00:00 2019-01-01 NaN
1 2019-01-04 02:30:00 2019-01-04 03:00:00 2019-01-04 NaN
2 2019-01-04 03:00:00 2019-01-04 03:30:00 2019-01-04 NaN
3 2019-01-04 03:30:00 2019-01-04 04:00:00 2019-01-04 NaN
4 2019-01-04 04:00:00 2019-01-04 04:30:00 2019-01-04 NaN
... ... ... ... ...
1386 2022-12-06 13:00:00 2022-12-06 13:30:00 2022-12-06 NaN
1387 2022-12-06 13:30:00 2022-12-06 14:00:00 2022-12-06 NaN
1388 2022-12-22 11:00:00 2022-12-22 11:30:00 2022-12-22 NaN
1389 2023-01-25 11:00:00 2023-01-25 11:30:00 2023-01-25 NaN
1390 2023-01-25 11:30:00 2023-01-25 12:00:00 2023-01-25 NaN
Is there anyway of replacing each of the NaN values in the dataset with the mean value of the corresponding half hour across the 70,000 plus rows, see below:
Data['HH'] = pd.to_datetime(Data['StartTime']).dt.time
Data.groupby(['HH'], as_index=False)[['Data']].mean().head(10)
# Only showing first 10 means
HH Values
0 00:00:00 5.236811
1 00:30:00 2.056571
2 01:00:00 4.157455
3 01:30:00 2.339253
4 02:00:00 2.658238
5 02:30:00 0.230557
6 03:00:00 0.217599
7 03:30:00 -0.630243
8 04:00:00 -0.989919
9 04:30:00 -0.494372
For example, if a value is missing against 04:00, can it be replaced with the 04:00 mean value (0.989919) as per the above table of means?
Any help greatly appreciated.

Let's group the dataframe by HH then transform the Values with mean to broadcast the mean values back to the original column shape then use fillna to fill the null values
avg = Data.groupby('HH')['Values'].transform('mean')
Data['Values'] = Data['Values'].fillna(avg)

Related

How to use datetime index from a Series to modify another series?

I have 2 Series with datetime index. I want to edit the values (float) of the first one at each corresponding daily minimum of the second one.
I tried
ser_1.loc[ser_2.groupby(ser_2.index.day_of_year).idxmin()] += 1
But I get this error :
raise KeyError(f"None of [{key}] are in the [{axis_name}]")
series 2 and 1,respectively, are shaped
2019-01-01 00:00:00 0.04980
2019-01-01 01:00:00 0.04426
2019-01-01 02:00:00 0.05100
2019-01-01 03:00:00 0.04627
2019-01-01 04:00:00 0.03978
...
2019-12-31 19:00:00 0.04773
2019-12-31 20:00:00 0.04600
2019-12-31 21:00:00 0.04220
2019-12-31 22:00:00 0.03974
2019-12-31 23:00:00 0.03888
Name: 0, Length: 8760, dtype: float64
2019-01-01 23:00:00 0.000
2019-01-02 00:00:00 0.000
2019-01-02 01:00:00 0.000
2019-01-02 02:00:00 0.000
2019-01-02 03:00:00 0.000
...
2019-12-13 06:00:00 1.534
2019-12-13 07:00:00 2.425
2019-12-13 08:00:00 1.622
2019-12-13 09:00:00 1.974
2019-12-13 10:00:00 1.729
Freq: H, Name: 1, Length: 8292, dtype: float64
Could it be a non correspondig index format or just bad use of a function ?
Found my problem
The function I used is correct
But ser_1 is incomplete and some values can't correspond

Replace nan with zero or linear interpolation

I have a dataset with a lot of NaNs and numeric values with the following form:
PV_Power
2017-01-01 00:00:00 NaN
2017-01-01 01:00:00 NaN
2017-01-01 02:00:00 NaN
2017-01-01 03:00:00 NaN
2017-01-01 04:00:00 NaN
... ...
2017-12-31 20:00:00 NaN
2017-12-31 21:00:00 NaN
2017-12-31 22:00:00 NaN
2017-12-31 23:00:00 NaN
2018-01-01 00:00:00 NaN
What I need to do is to replace a NaN value with either 0 if it is between other NaN values or with the result of interpolation if it is between numeric values. Any idea of how can I achieve that?
Use DataFrame.interpolate with limit_area='inside' if need interpolate between numeric values and then replace missing values:
print (df)
PV_Power
date
2017-01-01 00:00:00 NaN
2017-01-01 01:00:00 4.0
2017-01-01 02:00:00 NaN
2017-01-01 03:00:00 NaN
2017-01-01 04:00:00 5.0
2017-01-01 05:00:00 NaN
2017-01-01 06:00:00 NaN
df = df.interpolate(limit_area='inside').fillna(0)
print (df)
PV_Power
date
2017-01-01 00:00:00 0.000000
2017-01-01 01:00:00 4.000000
2017-01-01 02:00:00 4.333333
2017-01-01 03:00:00 4.666667
2017-01-01 04:00:00 5.000000
2017-01-01 05:00:00 0.000000
2017-01-01 06:00:00 0.000000
You could reindex your dataframe
idx = df.index
df = df.dropna().reindex(idx, fill_value=0)
or just set values where PV_Power is NaN:
df.loc[pd.isna(df.PV_Power), ["PV_Power"]] = 0
You Can use fillna(0) :-
df['PV_Power'].fillna(0, inplace=True)
or You Can Replace it:-
df['PV_Power'] = df['PV_Power'].replace(np.nan, 0)

Drop all rows for the month if a column has more than one value that crossed the threshold

I have a dataframe with time data in the format:
date values
0 2013-01-01 00:00:00 0.0
1 2013-01-01 01:00:00 0.0
2 2013-01-01 02:00:00 -9999
3 2013-01-01 03:00:00 -9999
4 2013-01-01 04:00:00 0.0
.. ... ...
8754 2016-12-31 18:00:00 427.5
8755 2016-12-31 19:00:00 194.9
8756 2016-12-31 20:00:00 -9999
8757 2016-12-31 21:00:00 237.6
8758 2016-12-31 22:00:00 -9999
8759 2016-12-31 23:00:00 0.0
Suppose the value -9999 was repeated 200 times in the month of January and the threshold is 150. Practically the entire month of January must be deleted or all its rows must be deleted.
date values repeated
1 2013-02 0
2 2013-03 2
4 2013-05 0
5 2013-06 0
6 2013-07 66
7 2013-08 0
8 2013-09 7
With this I think I can drop the rows that repeat but I want drop the whole month.
import numpy as np
df['month'] = df['date'].dt.to_period('M')
df['new_value'] = np.where((df['values'] == -9999) & (df['n_missing'] > 150),np.nan,df['values'])
df.dropna()
How can I do that ?
One way using pandas.to_datetime with pandas.DataFrame.groupby.filter.
Here's a sample with months that have -9999 repeated 2, 1, 0, 2 times each:
date values
0 2013-01-01 00:00:00 0.0
1 2013-01-01 01:00:00 0.0
2 2013-01-01 02:00:00 -9999.0
3 2013-01-01 03:00:00 -9999.0
4 2013-01-01 04:00:00 0.0
5 2013-02-01 12:00:00 -9999.0
6 2013-03-01 12:00:00 0.0
8754 2016-12-31 18:00:00 427.5
8755 2016-12-31 19:00:00 194.9
8756 2016-12-31 20:00:00 -9999.0
8757 2016-12-31 21:00:00 237.6
8758 2016-12-31 22:00:00 -9999.0
8759 2016-12-31 23:00:00 0.0
Then we do filtering:
date = pd.to_datetime(df["date"]).dt.strftime("%Y-%m")
new_df = df.groupby(date).filter(lambda x: x["values"].eq(-9999).sum() < 2)
print(new_df)
Output:
date values
5 2013-02-01 12:00:00 -9999.0
6 2013-03-01 12:00:00 0.0
You can see the months with 2 or more repeats are deleted.

Add hours to year-month-day data in pandas data frame

I have the following data frame with hourly resolution
day_ahead_DK1
Out[27]:
DateStamp DK1
0 2017-01-01 20.96
1 2017-01-01 20.90
2 2017-01-01 18.13
3 2017-01-01 16.03
4 2017-01-01 16.43
... ...
8756 2017-12-31 25.56
8757 2017-12-31 11.02
8758 2017-12-31 7.32
8759 2017-12-31 1.86
type(day_ahead_DK1)
Out[28]: pandas.core.frame.DataFrame
But the current column DateStamp is missing hours. How can I add hours 00:00:00, to 2017-01-01 for Index 0 so it will be 2017-01-01 00:00:00, and then 01:00:00, to 2017-01-01 for Index 1 so it will be 2017-01-01 01:00:00, and so on, so that all my days will have hours from 0 to 23. Thank you!
The expected output:
day_ahead_DK1
Out[27]:
DateStamp DK1
0 2017-01-01 00:00:00 20.96
1 2017-01-01 01:00:00 20.90
2 2017-01-01 02:00:00 18.13
3 2017-01-01 03:00:00 16.03
4 2017-01-01 04:00:00 16.43
... ...
8756 2017-12-31 20:00:00 25.56
8757 2017-12-31 21:00:00 11.02
8758 2017-12-31 22:00:00 7.32
8759 2017-12-31 23:00:00 1.86
Use GroupBy.cumcount for counter with to_timedelta for hours and add to DateStamp column:
df['DateStamp'] = pd.to_datetime(df['DateStamp'])
df['DateStamp'] += pd.to_timedelta(df.groupby('DateStamp').cumcount(), unit='H')
print (df)
DateStamp DK1
0 2017-01-01 00:00:00 20.96
1 2017-01-01 01:00:00 20.90
2 2017-01-01 02:00:00 18.13
3 2017-01-01 03:00:00 16.03
4 2017-01-01 04:00:00 16.43
8756 2017-12-31 00:00:00 25.56
8757 2017-12-31 01:00:00 11.02
8758 2017-12-31 02:00:00 7.32
8759 2017-12-31 03:00:00 1.86

How to fill the first date in the column?

I have a df:
dates values
2020-01-01 00:15:00 38.61487
2020-01-01 00:30:00 36.905204
2020-01-01 00:45:00 35.136584
2020-01-01 01:00:00 33.60378
2020-01-01 01:15:00 32.306791999999994
2020-01-01 01:30:00 31.304574
I am creating a new column named start as follows:
df = df.rename(columns={'dates': 'end'})
df['start']= df['end'].shift(1)
When I do this, I get the following:
end values start
2020-01-01 00:15:00 38.61487 NaT
2020-01-01 00:30:00 36.905204 2020-01-01 00:15:00
2020-01-01 00:45:00 35.136584 2020-01-01 00:30:00
2020-01-01 01:00:00 33.60378 2020-01-01 00:45:00
2020-01-01 01:15:00 32.306791999999994 2020-01-01 01:00:00
2020-01-01 01:30:00 31.304574 2020-01-01 01:15:00
I want to fill that NaT value with
2020-01-01 00:00:00
How can this be done?
Use Series.fillna with datetimes, e.g. by Timestamp:
df['start']= df['end'].shift().fillna(pd.Timestamp('2020-01-01'))
Or if pandas 0.24+ with fill_value parameter:
df['start']= df['end'].shift(fill_value=pd.Timestamp('2020-01-01'))
If all datetimes are regular, always difference 15 minutes is possible subtracting by offsets.DateOffset:
df['start']= df['end'] - pd.offsets.DateOffset(minutes=15)
print (df)
end values start
0 2020-01-01 00:15:00 38.614870 2020-01-01 00:00:00
1 2020-01-01 00:30:00 36.905204 2020-01-01 00:15:00
2 2020-01-01 00:45:00 35.136584 2020-01-01 00:30:00
3 2020-01-01 01:00:00 33.603780 2020-01-01 00:45:00
4 2020-01-01 01:15:00 32.306792 2020-01-01 01:00:00
5 2020-01-01 01:30:00 31.304574 2020-01-01 01:15:00
How about that?
df = pd.DataFrame(columns = ['end'])
df.loc[:, 'end'] = pd.date_range(start=pd.Timestamp(2019,1,1,0,15), end=pd.Timestamp(2019,1,2), freq='15min')
df.loc[:, 'start'] = df.loc[:, 'end'].shift(1)
delta = df.loc[df.index[3], 'end'] - df.loc[df.index[2], 'end']
df.loc[df.index[0], 'start'] = df.loc[df.index[1], 'start'] - delta
df
end start
0 2019-01-01 00:15:00 2019-01-01 00:00:00
1 2019-01-01 00:30:00 2019-01-01 00:15:00
2 2019-01-01 00:45:00 2019-01-01 00:30:00
3 2019-01-01 01:00:00 2019-01-01 00:45:00
4 2019-01-01 01:15:00 2019-01-01 01:00:00
... ... ...
91 2019-01-01 23:00:00 2019-01-01 22:45:00
92 2019-01-01 23:15:00 2019-01-01 23:00:00
93 2019-01-01 23:30:00 2019-01-01 23:15:00
94 2019-01-01 23:45:00 2019-01-01 23:30:00
95 2019-01-02 00:00:00 2019-01-01 23:45:00

Categories

Resources