Fill NaN in candlestick OHLCV data - python

I have a DataFrame like this
OPEN HIGH LOW CLOSE VOL
2012-01-01 19:00:00 449000 449000 449000 449000 1336303000
2012-01-01 20:00:00 NaN NaN NaN NaN NaN
2012-01-01 21:00:00 NaN NaN NaN NaN NaN
2012-01-01 22:00:00 NaN NaN NaN NaN NaN
2012-01-01 23:00:00 NaN NaN NaN NaN NaN
...
OPEN HIGH LOW CLOSE VOL
2013-04-24 14:00:00 11700000 12000000 11600000 12000000 20647095439
2013-04-24 15:00:00 12000000 12399000 11979000 12399000 23997107870
2013-04-24 16:00:00 12399000 12400000 11865000 12100000 9379191474
2013-04-24 17:00:00 12300000 12397995 11850000 11850000 4281521826
2013-04-24 18:00:00 11850000 11850000 10903000 11800000 15546034128
I need to fill NaN according this rule
When OPEN, HIGH, LOW, CLOSE are NaN,
set VOL to 0
set OPEN, HIGH, LOW, CLOSE to previous CLOSE candle value
else keep NaN

Since neither of the other two answers work, here's a complete answer.
I'm testing two methods here. The first is based on working4coin's comment on hd1's answer and the second being a slower, pure python implementation. It seems obvious that the python implementation should be slower but I decided to time the two methods to make sure and to quantify the results.
def nans_to_prev_close_method1(data_frame):
data_frame['volume'] = data_frame['volume'].fillna(0.0) # volume should always be 0 (if there were no trades in this interval)
data_frame['close'] = data_frame.fillna(method='pad') # ie pull the last close into this close
# now copy the close that was pulled down from the last timestep into this row, across into o/h/l
data_frame['open'] = data_frame['open'].fillna(data_frame['close'])
data_frame['low'] = data_frame['low'].fillna(data_frame['close'])
data_frame['high'] = data_frame['high'].fillna(data_frame['close'])
Method 1 does most of the heavy lifting in c (in the pandas code), and so should be quite fast.
The slow, python approach (method 2) is shown below
def nans_to_prev_close_method2(data_frame):
prev_row = None
for index, row in data_frame.iterrows():
if np.isnan(row['open']): # row.isnull().any():
pclose = prev_row['close']
# assumes first row has no nulls!!
row['open'] = pclose
row['high'] = pclose
row['low'] = pclose
row['close'] = pclose
row['volume'] = 0.0
prev_row = row
Testing the timing on both of them:
df = trades_to_ohlcv(PATH_TO_RAW_TRADES_CSV, '1s') # splits raw trades into secondly candles
df2 = df.copy()
wrapped1 = wrapper(nans_to_prev_close_method1, df)
wrapped2 = wrapper(nans_to_prev_close_method2, df2)
print("method 1: %.2f sec" % timeit.timeit(wrapped1, number=1))
print("method 2: %.2f sec" % timeit.timeit(wrapped2, number=1))
The results were:
method 1: 0.46 sec
method 2: 151.82 sec
Clearly method 1 is far faster (approx 330 times faster).

Here's how to do it via masking
Simulate a frame with some holes (A is your 'close' field)
In [20]: df = DataFrame(randn(10,3),index=date_range('20130101',periods=10,freq='min'),
columns=list('ABC'))
In [21]: df.iloc[1:3,:] = np.nan
In [22]: df.iloc[5:8,1:3] = np.nan
In [23]: df
Out[23]:
A B C
2013-01-01 00:00:00 -0.486149 0.156894 -0.272362
2013-01-01 00:01:00 NaN NaN NaN
2013-01-01 00:02:00 NaN NaN NaN
2013-01-01 00:03:00 1.788240 -0.593195 0.059606
2013-01-01 00:04:00 1.097781 0.835491 -0.855468
2013-01-01 00:05:00 0.753991 NaN NaN
2013-01-01 00:06:00 -0.456790 NaN NaN
2013-01-01 00:07:00 -0.479704 NaN NaN
2013-01-01 00:08:00 1.332830 1.276571 -0.480007
2013-01-01 00:09:00 -0.759806 -0.815984 2.699401
The ones we that are all Nan
In [24]: mask_0 = pd.isnull(df).all(axis=1)
In [25]: mask_0
Out[25]:
2013-01-01 00:00:00 False
2013-01-01 00:01:00 True
2013-01-01 00:02:00 True
2013-01-01 00:03:00 False
2013-01-01 00:04:00 False
2013-01-01 00:05:00 False
2013-01-01 00:06:00 False
2013-01-01 00:07:00 False
2013-01-01 00:08:00 False
2013-01-01 00:09:00 False
Freq: T, dtype: bool
Ones we want to propogate A
In [26]: mask_fill = pd.isnull(df['B']) & pd.isnull(df['C'])
In [27]: mask_fill
Out[27]:
2013-01-01 00:00:00 False
2013-01-01 00:01:00 True
2013-01-01 00:02:00 True
2013-01-01 00:03:00 False
2013-01-01 00:04:00 False
2013-01-01 00:05:00 True
2013-01-01 00:06:00 True
2013-01-01 00:07:00 True
2013-01-01 00:08:00 False
2013-01-01 00:09:00 False
Freq: T, dtype: bool
propogate first
In [28]: df.loc[mask_fill,'C'] = df['A']
In [29]: df.loc[mask_fill,'B'] = df['A']
fill the 0's
In [30]: df.loc[mask_0] = 0
Done
In [31]: df
Out[31]:
A B C
2013-01-01 00:00:00 -0.486149 0.156894 -0.272362
2013-01-01 00:01:00 0.000000 0.000000 0.000000
2013-01-01 00:02:00 0.000000 0.000000 0.000000
2013-01-01 00:03:00 1.788240 -0.593195 0.059606
2013-01-01 00:04:00 1.097781 0.835491 -0.855468
2013-01-01 00:05:00 0.753991 0.753991 0.753991
2013-01-01 00:06:00 -0.456790 -0.456790 -0.456790
2013-01-01 00:07:00 -0.479704 -0.479704 -0.479704
2013-01-01 00:08:00 1.332830 1.276571 -0.480007
2013-01-01 00:09:00 -0.759806 -0.815984 2.699401

This illustrates pandas' missing data behaviour. The incantation you're looking for is the fillna method, which takes a value:
In [1381]: df2
Out[1381]:
one two three four five timestamp
a NaN 1.138469 -2.400634 bar True NaT
c NaN 0.025653 -1.386071 bar False NaT
e 0.863937 0.252462 1.500571 bar True 2012-01-01 00:00:00
f 1.053202 -2.338595 -0.374279 bar True 2012-01-01 00:00:00
h NaN -1.157886 -0.551865 bar False NaT
In [1382]: df2.fillna(0)
Out[1382]:
one two three four five timestamp
a 0.000000 1.138469 -2.400634 bar True 1970-01-01 00:00:00
c 0.000000 0.025653 -1.386071 bar False 1970-01-01 00:00:00
e 0.863937 0.252462 1.500571 bar True 2012-01-01 00:00:00
f 1.053202 -2.338595 -0.374279 bar True 2012-01-01 00:00:00
h 0.000000 -1.157886 -0.551865 bar False 1970-01-01 00:00:00
You can even propogate them forward and backward:
In [1384]: df
Out[1384]:
one two three
a NaN 1.138469 -2.400634
c NaN 0.025653 -1.386071
e 0.863937 0.252462 1.500571
f 1.053202 -2.338595 -0.374279
h NaN -1.157886 -0.551865
In [1385]: df.fillna(method='pad')
Out[1385]:
one two three
a NaN 1.138469 -2.400634
c NaN 0.025653 -1.386071
e 0.863937 0.252462 1.500571
f 1.053202 -2.338595 -0.374279
h 1.053202 -1.157886 -0.551865
For your specific case, I think you'll need to do:
df['VOL'].fillna(0)
df.fillna(df['CLOSE'])

Related

Replace nan with zero or linear interpolation

I have a dataset with a lot of NaNs and numeric values with the following form:
PV_Power
2017-01-01 00:00:00 NaN
2017-01-01 01:00:00 NaN
2017-01-01 02:00:00 NaN
2017-01-01 03:00:00 NaN
2017-01-01 04:00:00 NaN
... ...
2017-12-31 20:00:00 NaN
2017-12-31 21:00:00 NaN
2017-12-31 22:00:00 NaN
2017-12-31 23:00:00 NaN
2018-01-01 00:00:00 NaN
What I need to do is to replace a NaN value with either 0 if it is between other NaN values or with the result of interpolation if it is between numeric values. Any idea of how can I achieve that?
Use DataFrame.interpolate with limit_area='inside' if need interpolate between numeric values and then replace missing values:
print (df)
PV_Power
date
2017-01-01 00:00:00 NaN
2017-01-01 01:00:00 4.0
2017-01-01 02:00:00 NaN
2017-01-01 03:00:00 NaN
2017-01-01 04:00:00 5.0
2017-01-01 05:00:00 NaN
2017-01-01 06:00:00 NaN
df = df.interpolate(limit_area='inside').fillna(0)
print (df)
PV_Power
date
2017-01-01 00:00:00 0.000000
2017-01-01 01:00:00 4.000000
2017-01-01 02:00:00 4.333333
2017-01-01 03:00:00 4.666667
2017-01-01 04:00:00 5.000000
2017-01-01 05:00:00 0.000000
2017-01-01 06:00:00 0.000000
You could reindex your dataframe
idx = df.index
df = df.dropna().reindex(idx, fill_value=0)
or just set values where PV_Power is NaN:
df.loc[pd.isna(df.PV_Power), ["PV_Power"]] = 0
You Can use fillna(0) :-
df['PV_Power'].fillna(0, inplace=True)
or You Can Replace it:-
df['PV_Power'] = df['PV_Power'].replace(np.nan, 0)

resampling and appending to same dataframe

I have a dataframe which I want to resample and append the results to original dataframe as new column,
What I have:
index = pd.date_range('1/1/2000', periods=9, freq='T')
series = pd.Series(range(9), index=index)
series
time value
2000-01-01 00:00:00 0
2000-01-01 00:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
2000-01-01 00:06:00 6
2000-01-01 00:07:00 7
2000-01-01 00:08:00 8
What I want:
time value mean_resampled
2000-01-01 00:00:00 0. 2
2000-01-01 00:01:00 1. NaN
2000-01-01 00:02:00 2. NaN
2000-01-01 00:03:00 3. NaN
2000-01-01 00:04:00 4. NaN
2000-01-01 00:05:00 5. 6.5
2000-01-01 00:06:00 6. NaN
2000-01-01 00:07:00 7. NaN
2000-01-01 00:08:00 8. NaN
Note: resampling frequency is '5T'
index = pd.date_range('1/1/2000', periods=9, freq='T')
series = pd.Series(range(9), index=index, name='values')
sample = series.resample('5T').mean() # create a sample at some frequency
df = series.to_frame() # convert series to frame
df.loc[sample.index.values, 'mean_resampled'] = sample # use loc to assign new values
values mean_resampled
2000-01-01 00:00:00 0 2.0
2000-01-01 00:01:00 1 NaN
2000-01-01 00:02:00 2 NaN
2000-01-01 00:03:00 3 NaN
2000-01-01 00:04:00 4 NaN
2000-01-01 00:05:00 5 6.5
2000-01-01 00:06:00 6 NaN
2000-01-01 00:07:00 7 NaN
2000-01-01 00:08:00 8 NaN
Use resample to compute the mean and concat to merge your Series with new values.
>>> pd.concat([series, series.resample('5T').mean()], axis=1) \
.rename(columns={0: 'value', 1: 'mean_resampled'})
value mean_resampled
2000-01-01 00:00:00 0 2.0
2000-01-01 00:01:00 1 NaN
2000-01-01 00:02:00 2 NaN
2000-01-01 00:03:00 3 NaN
2000-01-01 00:04:00 4 NaN
2000-01-01 00:05:00 5 6.5
2000-01-01 00:06:00 6 NaN
2000-01-01 00:07:00 7 NaN
2000-01-01 00:08:00 8 NaN
If you have a DataFrame instead of Series in your real case, you have just to add a new column:
>>> df['mean_resampled'] = df.resample('5T').mean()

Pandas fillna() method not filling all missing values

I have rain and temp data sourced from Environment Canada but it contains some NaN values.
start_date = '2015-12-31'
end_date = '2021-05-26'
mask = (data['date'] > start_date) & (data['date'] <= end_date)
df = data.loc[mask]
print(df)
date time rain_gauge_value temperature
8760 2016-01-01 00:00:00 0.0 -2.9
8761 2016-01-01 01:00:00 0.0 -3.4
8762 2016-01-01 02:00:00 0.0 -3.6
8763 2016-01-01 03:00:00 0.0 -3.6
8764 2016-01-01 04:00:00 0.0 -4.0
... ... ... ... ...
56107 2021-05-26 19:00:00 0.0 22.0
56108 2021-05-26 20:00:00 0.0 21.5
56109 2021-05-26 21:00:00 0.0 21.1
56110 2021-05-26 22:00:00 0.0 19.5
56111 2021-05-26 23:00:00 0.0 18.5
[47352 rows x 4 columns]
Find the rows with a NaN value
null = df[df['rain_gauge_value'].isnull()]
print(null)
date time rain_gauge_value temperature
11028 2016-04-04 12:00:00 NaN -6.9
11986 2016-05-14 10:00:00 NaN NaN
11987 2016-05-14 11:00:00 NaN NaN
11988 2016-05-14 12:00:00 NaN NaN
11989 2016-05-14 13:00:00 NaN NaN
... ... ... ... ...
49024 2020-08-04 16:00:00 NaN NaN
49025 2020-08-04 17:00:00 NaN NaN
50505 2020-10-05 09:00:00 NaN 11.3
54083 2021-03-03 11:00:00 NaN -5.1
54084 2021-03-03 12:00:00 NaN -4.5
[6346 rows x 4 columns]
This is my dataframe I want to use to fill the NaN values
print(rain_df)
date time rain_gauge_value temperature
0 2015-12-28 00:00:00 0.1 -6.0
1 2015-12-28 01:00:00 0.0 -7.0
2 2015-12-28 02:00:00 0.0 -8.0
3 2015-12-28 03:00:00 0.0 -8.0
4 2015-12-28 04:00:00 0.0 -7.0
... ... ... ... ...
48043 2021-06-19 19:00:00 0.6 20.0
48044 2021-06-19 20:00:00 0.6 19.0
48045 2021-06-19 21:00:00 0.8 18.0
48046 2021-06-19 22:00:00 0.4 17.0
48047 2021-06-19 23:00:00 0.0 16.0
[48048 rows x 4 columns]
But when I use the fillna() method, some of the values don't get substitued.
null = null.fillna(rain_df)
null = null[null['rain_gauge_value'].isnull()]
print(null)
date time rain_gauge_value temperature
48057 2020-06-25 09:00:00 NaN NaN
48058 2020-06-25 10:00:00 NaN NaN
48059 2020-06-25 11:00:00 NaN NaN
48060 2020-06-25 12:00:00 NaN NaN
48586 2020-07-17 10:00:00 NaN NaN
48587 2020-07-17 11:00:00 NaN NaN
48588 2020-07-17 12:00:00 NaN NaN
49022 2020-08-04 14:00:00 NaN NaN
49023 2020-08-04 15:00:00 NaN NaN
49024 2020-08-04 16:00:00 NaN NaN
49025 2020-08-04 17:00:00 NaN NaN
50505 2020-10-05 09:00:00 NaN 11.3
54083 2021-03-03 11:00:00 NaN -5.1
54084 2021-03-03 12:00:00 NaN -4.5
How can I resolve this issue?
when fillna, you probably want a method, like fill using previous/next value, mean of column etc, what we can do is like this
nulls_index = df['rain_gauge_value'].isnull()
df = df.fillna(method='ffill') # use ffill as example
nulls_after_fill = df[nulls_index]
take a look at:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
You need to inform pandas how you want to patch. It may be obvious to you want to use the "patch" dataframe's values when the date and times line up, but it won't be obvious to pandas. see my dummy example:
raw = pd.DataFrame(dict(date=[date(2015,12,28), date(2015,12,28)], time= [time(0,0,0),time(0,0,1)],temp=[1.,np.nan],rain=[4.,np.nan]))
raw
date time temp rain
0 2015-12-28 00:00:00 1.0 4.0
1 2015-12-28 00:00:01 NaN NaN
patch = pd.DataFrame(dict(date=[date(2015,12,28), date(2015,12,28)], time=[time(0,0,0),time(0,0,1)],temp=[5.,5.],rain=[10.,10.]))
patch
date time temp rain
0 2015-12-28 00:00:00 5.0 10.0
1 2015-12-28 00:00:01 5.0 10.0
you need the indexes of raw and patch to correspond to how you want to patch the raw data (in this case, you want to patch based on date and time)
raw.set_index(['date','time']).fillna(patch.set_index(['date','time']))
returns
temp rain
date time
2015-12-28 00:00:00 1.0 4.0
00:00:01 5.0 10.0

While resampling, put NaN in the resulting value if there are some NaN values in the source interval

Example:
import pandas as pd
import numpy as np
rng = pd.date_range("2000-01-01", periods=12, freq="T")
ts = pd.Series(np.arange(12), index=rng)
ts["2000-01-01 00:02"] = np.nan
ts
2000-01-01 00:00:00 0.0
2000-01-01 00:01:00 1.0
2000-01-01 00:02:00 NaN
2000-01-01 00:03:00 3.0
2000-01-01 00:04:00 4.0
2000-01-01 00:05:00 5.0
2000-01-01 00:06:00 6.0
2000-01-01 00:07:00 7.0
2000-01-01 00:08:00 8.0
2000-01-01 00:09:00 9.0
2000-01-01 00:10:00 10.0
2000-01-01 00:11:00 11.0
Freq: T, dtype: float64
ts.resample("5min").sum()
2000-01-01 00:00:00 5.0
2000-01-01 00:05:00 30.0
2000-01-01 00:10:00 30.0
Freq: 5T, dtype: float64
In the above example, it extracts the sum of the interval 00:00-00:05 as if the missing value was zero. What I want is for it to produce result NaN in 00:00.
Or, maybe I'd like for it to be OK if there's one missing value in the interval, but NaN if there are two missing values in the interval.
How can I do these?
For one or more NaN values:
ts.resample('5min').agg(pd.Series.sum, skipna=False)
For a minimum of 2 non-NaN values:
ts.resample('5min').agg(pd.Series.sum, min_count=2)
For a maximum of 2 NaN values seems tricker:
ts.resample('5min').apply(lambda x: x.sum() if x.isnull().sum() <= 2 else np.nan)
You might expect ts.resample('5min').sum(skipna=False) to work in the same way as ts.sum(skipna=False), but the implementations are not consistent.

Extend dataframe by adding start and end date and fill it with timestamps and NaN

I have got the following data:
data
timestamp
2012-06-01 17:00:00 9
2012-06-01 20:00:00 8
2012-06-01 13:00:00 9
2012-06-01 10:00:00 9
and would like to sort it descending by time, add a start and end date on top and bottom of the data, so that it looks like this:
data
timestamp
2012-06-01 00:00:00 NaN
2012-06-01 10:00:00 9
2012-06-01 13:00:00 9
2012-06-01 17:00:00 9
2012-06-01 20:00:00 8
2012-06-02 00:00:00 NaN
and finally I would like to extend the dataset to cover all hours from start to end in one hour steps, filling the dataframe with missing timestamps containing 'None'/'NaN' as data.
So far I have the following code:
df2 = pd.DataFrame({'data':temperature, 'timestamp': pd.DatetimeIndex(timestamp)}, dtype=float)
df2.set_index('timestamp',inplace=True)
df3 = pd.DataFrame({ 'timestamp': pd.Series([ts1, ts2]), 'data': [None, None]})
df3.set_index('timestamp',inplace=True)
print(df3)
merged = df3.append(df2)
print(merged)
with the following print outs:
df3:
data
timestamp
2012-06-01 00:00:00 None
2012-06-02 00:00:00 None
merged:
data
timestamp
2012-06-01 00:00:00 NaN
2012-06-02 00:00:00 NaN
2012-06-01 17:00:00 9
2012-06-01 20:00:00 8
2012-06-01 13:00:00 9
2012-06-01 10:00:00 9
I have tried:
merged = merged.asfreq('H')
but this returned an unsatisfying result:
data
2012-06-01 00:00:00 NaN
2012-06-01 01:00:00 NaN
2012-06-01 02:00:00 NaN
2012-06-01 03:00:00 NaN
2012-06-01 04:00:00 NaN
2012-06-01 05:00:00 NaN
2012-06-01 06:00:00 NaN
2012-06-01 07:00:00 NaN
2012-06-01 08:00:00 NaN
2012-06-01 09:00:00 NaN
2012-06-01 10:00:00 9
Where is the rest of the dataframe? Why does it only contain data till the first valid value?
Help is much appreciated. Thanks a lot in advance
First create an empty dataframe with the timestamp index that you want and then do a left merge with your original dataset:
df2 = pd.DataFrame(index = pd.date_range('2012-06-01','2012-06-02', freq='H'))
df3 = pd.merge(df2, df, left_index = True, right_index = True, how = 'left')
df3
Out[103]:
timestamp value
2012-06-01 00:00:00 NaN NaN
2012-06-01 01:00:00 NaN NaN
2012-06-01 02:00:00 NaN NaN
2012-06-01 03:00:00 NaN NaN
2012-06-01 04:00:00 NaN NaN
2012-06-01 05:00:00 NaN NaN
2012-06-01 06:00:00 NaN NaN
2012-06-01 07:00:00 NaN NaN
2012-06-01 08:00:00 NaN NaN
2012-06-01 09:00:00 NaN NaN
2012-06-01 10:00:00 2012-06-01 10:00:00 9
2012-06-01 11:00:00 NaN NaN
2012-06-01 12:00:00 NaN NaN
2012-06-01 13:00:00 2012-06-01 13:00:00 9
2012-06-01 14:00:00 NaN NaN
2012-06-01 15:00:00 NaN NaN
2012-06-01 16:00:00 NaN NaN
2012-06-01 17:00:00 2012-06-01 17:00:00 9
2012-06-01 18:00:00 NaN NaN
2012-06-01 19:00:00 NaN NaN
2012-06-01 20:00:00 2012-06-01 20:00:00 8
2012-06-01 21:00:00 NaN NaN
2012-06-01 22:00:00 NaN NaN
2012-06-01 23:00:00 NaN NaN
2012-06-02 00:00:00 NaN NaN

Categories

Resources