I am looking fopr a way to interpolate only over short gaps in a Pandas DataFrame that has a DateTimeIndex. Long gaps should be kept as they are.
df = pd.DataFrame(
{ "value": [ 1, np.nan, 3, np.nan, np.nan, 5, np.nan, 11, np.nan, 21, np.nan, 41 ] },
index=pd.to_datetime( [
"2021-01-01 00:00", "2021-01-01 00:05", "2021-01-01 00:10",
"2021-01-01 00:11", "2021-01-01 00:13", "2021-01-01 00:14",
"2021-01-01 00:15", "2021-01-01 01:30", "2021-01-01 03:00",
"2021-01-01 04:00", "2021-01-01 05:45", "2021-01-01 06:45",
] )
)
value
2021-01-01 00:00:00 1.0
2021-01-01 00:05:00 NaN
2021-01-01 00:10:00 3.0
2021-01-01 00:11:00 NaN
2021-01-01 00:13:00 NaN
2021-01-01 00:14:00 5.0
2021-01-01 00:15:00 NaN
2021-01-01 01:30:00 11.0
2021-01-01 03:00:00 NaN
2021-01-01 04:00:00 21.0
2021-01-01 05:45:00 NaN
2021-01-01 06:45:00 41.0
The idea is to keep gaps that are longer than a certain time (>5 minutes in this case), but interpolate all missing values within shorter gaps.
interpolate() has a limit argument that limits the number of missing values to be interpolated, but this does not respect the time delta between the rows, only the number of rows.
I would like the result to be like this:
value
2021-01-01 00:00:00 1.000000
2021-01-01 00:05:00 2.000000
2021-01-01 00:10:00 3.000000
2021-01-01 00:11:00 3.500000
2021-01-01 00:13:00 4.500000
2021-01-01 00:14:00 5.000000
2021-01-01 00:15:00 NaN
2021-01-01 01:30:00 11.000000
2021-01-01 03:00:00 NaN
2021-01-01 04:00:00 21.000000
2021-01-01 05:45:00 NaN
2021-01-01 06:45:00 41.000000
This solution fills value gaps that are in time spans that are less than a specified value. The filled values are set proportionally to the entry's position within the value gap's time span (time-interpolated values). Julian dates are used for easier computation.
Set max time span gap to fill with time-interpolated values. 5 minutes.
jd_max_gap_fill = 5/(60*24)
Calculate the value gap:
df['ffill'] = df['value'].ffill()
df['value_gap'] = df['value'].bfill() - df['value'].ffill()
Get the Julian date for the entry:
df['jd'] = df.index.to_julian_date()
Calculate the time gap:
df['jd_nan'] = np.where(~df['value'].isna(), df['jd'], np.nan)
df['jd_gap'] = df['jd_nan'].bfill() - df['jd_nan'].ffill()
Time-wise, calculate how far into the value gap we are:
df['jd_start'] = df['jd_nan'].ffill()
df['jd_prp'] = np.where(df['jd_gap'] != 0, (df['jd'] - df['jd_start'])/df['jd_gap'], 0)
Calculate time-interpolated values:
df['filled_value'] = np.where(df['jd_gap'] <= jd_max_gap_fill, df['ffill'] + df['value_gap'] * df['jd_prp'], np.nan)
df['filled_value']
2021-01-01 00:00:00 1.0
2021-01-01 00:05:00 NaN
2021-01-01 00:10:00 3.0
2021-01-01 00:11:00 3.5
2021-01-01 00:13:00 4.5
2021-01-01 00:14:00 5.0
2021-01-01 00:15:00 NaN
2021-01-01 01:30:00 11.0
2021-01-01 03:00:00 NaN
2021-01-01 04:00:00 21.0
2021-01-01 05:45:00 NaN
2021-01-01 06:45:00 41.0
Note that my output is different than your expected output because the first NaN is in a 10 minute gap.
Related
I am working with a dataset where I have dates in datetime format in the first column and hours as float as separate columns like this:
date 1.0 2.0 3.0 ... 21.0 22.0 23.0 24.0
0 2021-01-01 24.95 24.35 23.98 ... 27.32 26.98 26.44 25.64
1 2021-01-02 25.59 24.91 24.74 ... 27.38 26.96 26.85 25.94
and what I want to achieve is this:
Date Price
0 2021-01-01 01:00:00 24.95
1 2021-01-01 02:00:00 24.35
2 2021-01-01 03:00:00 23.98
3 2013-01-01 04:00:00 ...
So I have been figuring that the first step should be to change the hours into datetime format,
been trying this code for example: df[1.0] = pd.to_datetime(df[1.0], format='%h')
Where I get this: "ValueError: 'h' is a bad directive in format '%h'"
And then rearrange the columns and rows. Been thinking about doing this with pandas pivot_table and transform. Any help would be appreciated. Thank you.
Use DataFrame.set_index first, convert all columns to timedeltas, reshape by DataFrame.unstack and last join dates and timedeltas:
df['date'] = pd.to_datetime(df['date'])
f = lambda x: pd.to_timedelta(float(x), unit='h')
df1 = (df.set_index('date')
.rename(columns=f)
.unstack()
.reset_index(name='Price')
.assign(date=lambda x: x['date'] + x.pop('level_0')))
print (df1)
date Price
0 2021-01-01 01:00:00 24.95
1 2021-01-02 01:00:00 25.59
2 2021-01-01 02:00:00 24.35
3 2021-01-02 02:00:00 24.91
4 2021-01-01 03:00:00 23.98
5 2021-01-02 03:00:00 24.74
6 2021-01-01 21:00:00 27.32
7 2021-01-02 21:00:00 27.38
8 2021-01-01 22:00:00 26.98
9 2021-01-02 22:00:00 26.96
10 2021-01-01 23:00:00 26.44
11 2021-01-02 23:00:00 26.85
12 2021-01-02 00:00:00 25.64
13 2021-01-03 00:00:00 25.94
Or use DataFrame.melt and then join column converted to timedeltas:
df['date'] = pd.to_datetime(df['date'])
df1 = (df.melt('date', value_name='Price')
.assign(date = lambda x: x['date'] +
pd.to_timedelta(x.pop('variable').astype(float), unit='h'))
.sort_values('date', ignore_index=True))
print (df1)
date Price
0 2021-01-01 01:00:00 24.95
1 2021-01-01 02:00:00 24.35
2 2021-01-01 03:00:00 23.98
3 2021-01-01 21:00:00 27.32
4 2021-01-01 22:00:00 26.98
5 2021-01-01 23:00:00 26.44
6 2021-01-02 00:00:00 25.64
7 2021-01-02 01:00:00 25.59
8 2021-01-02 02:00:00 24.91
9 2021-01-02 03:00:00 24.74
10 2021-01-02 21:00:00 27.38
11 2021-01-02 22:00:00 26.96
12 2021-01-02 23:00:00 26.85
13 2021-01-03 00:00:00 25.94
I have a dataset with a lot of NaNs and numeric values with the following form:
PV_Power
2017-01-01 00:00:00 NaN
2017-01-01 01:00:00 NaN
2017-01-01 02:00:00 NaN
2017-01-01 03:00:00 NaN
2017-01-01 04:00:00 NaN
... ...
2017-12-31 20:00:00 NaN
2017-12-31 21:00:00 NaN
2017-12-31 22:00:00 NaN
2017-12-31 23:00:00 NaN
2018-01-01 00:00:00 NaN
What I need to do is to replace a NaN value with either 0 if it is between other NaN values or with the result of interpolation if it is between numeric values. Any idea of how can I achieve that?
Use DataFrame.interpolate with limit_area='inside' if need interpolate between numeric values and then replace missing values:
print (df)
PV_Power
date
2017-01-01 00:00:00 NaN
2017-01-01 01:00:00 4.0
2017-01-01 02:00:00 NaN
2017-01-01 03:00:00 NaN
2017-01-01 04:00:00 5.0
2017-01-01 05:00:00 NaN
2017-01-01 06:00:00 NaN
df = df.interpolate(limit_area='inside').fillna(0)
print (df)
PV_Power
date
2017-01-01 00:00:00 0.000000
2017-01-01 01:00:00 4.000000
2017-01-01 02:00:00 4.333333
2017-01-01 03:00:00 4.666667
2017-01-01 04:00:00 5.000000
2017-01-01 05:00:00 0.000000
2017-01-01 06:00:00 0.000000
You could reindex your dataframe
idx = df.index
df = df.dropna().reindex(idx, fill_value=0)
or just set values where PV_Power is NaN:
df.loc[pd.isna(df.PV_Power), ["PV_Power"]] = 0
You Can use fillna(0) :-
df['PV_Power'].fillna(0, inplace=True)
or You Can Replace it:-
df['PV_Power'] = df['PV_Power'].replace(np.nan, 0)
I have a dataframe that's indexed by datetime and has one column of integers and another column that I want to put in a string if a condition of the integers is met. I need the condition to assess the integer in row X against the integer in row X-1, but only if both rows are on the same day.
I am currently using the condition:
df.loc[(df['IntCol'] > df['IntCol'].shift(periods=1)), 'StringCol'] = 'Success'
This successfully applies my condition, however if the shifted row is on a different day then the condition will still use it and I want it to ignore any rows that are on a different day. I've tried various iterations of groupby(df.index.date) but can't seem to figure out if that will work or not.
Not sure if this is the best way to do it but gets you the answer:
df['out'] = np.where(df['int_col'] > df.groupby(df.index)['int_col'].shift(1), 'Success', 'Failure')
I think this is what you want. You were probably closer to the answer than you thought...
There is two dataframes use to show that the logic you have works whether or not data is random or integers are sorted range.
You will need to import random to see the data
dates = list(pd.date_range(start='2021/1/1', periods=16, freq='4H'))
def compare(x):
x.loc[(x['IntCol'] > x['IntCol'].shift(periods=1)), 'StringCol'] = 'Success'
return x
#### Will show success in all rows except where dates change because it's a range in numerical order
df = pd.DataFrame({'IntCol': range(10,26)}, index=dates)
df.groupby(df.index.date).apply(compare)
2021-01-01 00:00:00 10 NaN
2021-01-01 04:00:00 11 Success
2021-01-01 08:00:00 12 Success
2021-01-01 12:00:00 13 Success
2021-01-01 16:00:00 14 Success
2021-01-01 20:00:00 15 Success
2021-01-02 00:00:00 16 NaN
2021-01-02 04:00:00 17 Success
2021-01-02 08:00:00 18 Success
2021-01-02 12:00:00 19 Success
2021-01-02 16:00:00 20 Success
2021-01-02 20:00:00 21 Success
2021-01-03 00:00:00 22 NaN
2021-01-03 04:00:00 23 Success
2021-01-03 08:00:00 24 Success
2021-01-03 12:00:00 25 Success
### random numbers to show that it works here too
df = pd.DataFrame({'IntCol': [random.randint(3, 500) for x in range(0,16)]}, index=dates)
df.groupby(df.index.date).apply(compare)
IntCol StringCol
2021-01-01 00:00:00 386 NaN
2021-01-01 04:00:00 276 NaN
2021-01-01 08:00:00 143 NaN
2021-01-01 12:00:00 144 Success
2021-01-01 16:00:00 10 NaN
2021-01-01 20:00:00 343 Success
2021-01-02 00:00:00 424 NaN
2021-01-02 04:00:00 362 NaN
2021-01-02 08:00:00 269 NaN
2021-01-02 12:00:00 35 NaN
2021-01-02 16:00:00 278 Success
2021-01-02 20:00:00 268 NaN
2021-01-03 00:00:00 58 NaN
2021-01-03 04:00:00 169 Success
2021-01-03 08:00:00 85 NaN
2021-01-03 12:00:00 491 Success
I have the following df
dates Final
2020-01-01 00:15:00 94.7
2020-01-01 00:30:00 94.1
2020-01-01 00:45:00 94.1
2020-01-01 01:00:00 95.0
2020-01-01 01:15:00 96.6
2020-01-01 01:30:00 98.4
2020-01-01 01:45:00 99.8
2020-01-01 02:00:00 99.8
2020-01-01 02:15:00 98.0
2020-01-01 02:30:00 95.1
2020-01-01 02:45:00 91.9
2020-01-01 03:00:00 89.5
The entire dataset is till 2021-01-01 00:00:00 95.6 with a gap of 15mins.
Since the freq is 15mins, I would like to change it to 1 hour and maybe drop the middle values
Expected output
dates Final
2020-01-01 01:00:00 95.0
2020-01-01 02:00:00 99.8
2020-01-01 03:00:00 89.5
With the last row being 2021-01-01 00:00:00 95.6
How can this be done?
Thanks
Use Series.dt.minute to performance a boolean indexing:
df_filtered = df.loc[df['dates'].dt.minute.eq(0)]
#if necessary
#df_filtered = df.loc[pd.to_datetime(df['dates']).dt.minute.eq(0)]
print(df_filtered)
dates Final
3 2020-01-01 01:00:00 95.0
7 2020-01-01 02:00:00 99.8
11 2020-01-01 03:00:00 89.5
If you're doing data analysis or data science I don't think dropping the middle values is a good approach at all! You should sum them I guess (I don't know about your use case but I know some stuff about Time Series data).
For a dataframe with no missing values, this would be as easy as df.diff(periods=24, axis=0). But how is it possible to connect the calculations to the index values?
Reproducible dataframe - Code:
# Imports
import pandas as pd
import numpy as np
# A dataframe with two variables, random numbers and hourly time series
np.random.seed(123)
rows = 36
rng = pd.date_range('1/1/2017', periods=rows, freq='H')
df = pd.DataFrame(np.random.randint(100,150,size=(rows, 2)), columns=['A', 'B'])
df = df.set_index(rng)
Reproducible dataframe - Screenshot:
Desired output - Code:
# Running difference step = 24
df = df.diff(periods=24, axis=0)
df = df.dropna(axis=0, how='all')
Desired output - Screenshot
The real challenge
The problem is that my real-world examples are full of missing values.
So I'll have to connect the difference intervals with the index values, and I have no Idea how. I've tried a few solutions with filling in the missing hours in the index first, and then running the differences like before, but it's not very elegant.
Thank you for any suggestions!
Edit - As requested in the comments, here's my best attempt for a bit longer time period:
df_missing = df.drop(df.index[[2,3]])
newIndex = pd.date_range(start = '1/1/2017', end = '1/3/2017', freq='H')
df_missing = df_missing.reindex(newIndex, fill_value = np.nan)
df_refilled = df_missing.diff(periods=24, axis=0)
Compared to the other suggestions, I would say that this is not very elegant =)
I think maybe you can use groupby
df.groupby(df.index.hour).diff().dropna()
Out[784]:
A B
2017-01-02 00:00:00 -3.0 3.0
2017-01-02 01:00:00 -28.0 -23.0
2017-01-02 02:00:00 -4.0 -7.0
2017-01-02 03:00:00 3.0 -29.0
2017-01-02 04:00:00 -4.0 3.0
2017-01-02 05:00:00 -17.0 -6.0
2017-01-02 06:00:00 -20.0 35.0
2017-01-02 07:00:00 -2.0 -40.0
2017-01-02 08:00:00 13.0 -21.0
2017-01-02 09:00:00 -9.0 -13.0
2017-01-02 10:00:00 0.0 3.0
2017-01-02 11:00:00 -21.0 -9.0
You can snap your dataframe to hourly recordings using asfreq, and then use diff?
df.asfreq('1H').diff(periods=24, axis=0).dropna()
Or, use shift and then subtract (instead of diff),
v = df.asfreq('1h')
(v - v.shift(periods=24)).dropna()
A B
2017-01-02 00:00:00 -3.0 3.0
2017-01-02 01:00:00 -28.0 -23.0
2017-01-02 02:00:00 -4.0 -7.0
2017-01-02 03:00:00 3.0 -29.0
2017-01-02 04:00:00 -4.0 3.0
2017-01-02 05:00:00 -17.0 -6.0
2017-01-02 06:00:00 -20.0 35.0
2017-01-02 07:00:00 -2.0 -40.0
2017-01-02 08:00:00 13.0 -21.0
2017-01-02 09:00:00 -9.0 -13.0
2017-01-02 10:00:00 0.0 3.0
2017-01-02 11:00:00 -21.0 -9.0