Pandas fillna() method not filling all missing values - python

I have rain and temp data sourced from Environment Canada but it contains some NaN values.
start_date = '2015-12-31'
end_date = '2021-05-26'
mask = (data['date'] > start_date) & (data['date'] <= end_date)
df = data.loc[mask]
print(df)
date time rain_gauge_value temperature
8760 2016-01-01 00:00:00 0.0 -2.9
8761 2016-01-01 01:00:00 0.0 -3.4
8762 2016-01-01 02:00:00 0.0 -3.6
8763 2016-01-01 03:00:00 0.0 -3.6
8764 2016-01-01 04:00:00 0.0 -4.0
... ... ... ... ...
56107 2021-05-26 19:00:00 0.0 22.0
56108 2021-05-26 20:00:00 0.0 21.5
56109 2021-05-26 21:00:00 0.0 21.1
56110 2021-05-26 22:00:00 0.0 19.5
56111 2021-05-26 23:00:00 0.0 18.5
[47352 rows x 4 columns]
Find the rows with a NaN value
null = df[df['rain_gauge_value'].isnull()]
print(null)
date time rain_gauge_value temperature
11028 2016-04-04 12:00:00 NaN -6.9
11986 2016-05-14 10:00:00 NaN NaN
11987 2016-05-14 11:00:00 NaN NaN
11988 2016-05-14 12:00:00 NaN NaN
11989 2016-05-14 13:00:00 NaN NaN
... ... ... ... ...
49024 2020-08-04 16:00:00 NaN NaN
49025 2020-08-04 17:00:00 NaN NaN
50505 2020-10-05 09:00:00 NaN 11.3
54083 2021-03-03 11:00:00 NaN -5.1
54084 2021-03-03 12:00:00 NaN -4.5
[6346 rows x 4 columns]
This is my dataframe I want to use to fill the NaN values
print(rain_df)
date time rain_gauge_value temperature
0 2015-12-28 00:00:00 0.1 -6.0
1 2015-12-28 01:00:00 0.0 -7.0
2 2015-12-28 02:00:00 0.0 -8.0
3 2015-12-28 03:00:00 0.0 -8.0
4 2015-12-28 04:00:00 0.0 -7.0
... ... ... ... ...
48043 2021-06-19 19:00:00 0.6 20.0
48044 2021-06-19 20:00:00 0.6 19.0
48045 2021-06-19 21:00:00 0.8 18.0
48046 2021-06-19 22:00:00 0.4 17.0
48047 2021-06-19 23:00:00 0.0 16.0
[48048 rows x 4 columns]
But when I use the fillna() method, some of the values don't get substitued.
null = null.fillna(rain_df)
null = null[null['rain_gauge_value'].isnull()]
print(null)
date time rain_gauge_value temperature
48057 2020-06-25 09:00:00 NaN NaN
48058 2020-06-25 10:00:00 NaN NaN
48059 2020-06-25 11:00:00 NaN NaN
48060 2020-06-25 12:00:00 NaN NaN
48586 2020-07-17 10:00:00 NaN NaN
48587 2020-07-17 11:00:00 NaN NaN
48588 2020-07-17 12:00:00 NaN NaN
49022 2020-08-04 14:00:00 NaN NaN
49023 2020-08-04 15:00:00 NaN NaN
49024 2020-08-04 16:00:00 NaN NaN
49025 2020-08-04 17:00:00 NaN NaN
50505 2020-10-05 09:00:00 NaN 11.3
54083 2021-03-03 11:00:00 NaN -5.1
54084 2021-03-03 12:00:00 NaN -4.5
How can I resolve this issue?

when fillna, you probably want a method, like fill using previous/next value, mean of column etc, what we can do is like this
nulls_index = df['rain_gauge_value'].isnull()
df = df.fillna(method='ffill') # use ffill as example
nulls_after_fill = df[nulls_index]
take a look at:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html

You need to inform pandas how you want to patch. It may be obvious to you want to use the "patch" dataframe's values when the date and times line up, but it won't be obvious to pandas. see my dummy example:
raw = pd.DataFrame(dict(date=[date(2015,12,28), date(2015,12,28)], time= [time(0,0,0),time(0,0,1)],temp=[1.,np.nan],rain=[4.,np.nan]))
raw
date time temp rain
0 2015-12-28 00:00:00 1.0 4.0
1 2015-12-28 00:00:01 NaN NaN
patch = pd.DataFrame(dict(date=[date(2015,12,28), date(2015,12,28)], time=[time(0,0,0),time(0,0,1)],temp=[5.,5.],rain=[10.,10.]))
patch
date time temp rain
0 2015-12-28 00:00:00 5.0 10.0
1 2015-12-28 00:00:01 5.0 10.0
you need the indexes of raw and patch to correspond to how you want to patch the raw data (in this case, you want to patch based on date and time)
raw.set_index(['date','time']).fillna(patch.set_index(['date','time']))
returns
temp rain
date time
2015-12-28 00:00:00 1.0 4.0
00:00:01 5.0 10.0

Related

How can I split my dataframe by year or month

I have a dataframe that contains a time series with hourly data form 2015 to 2020. I want to create a new dataframe that has a column with the values of the time series for each year or for each month of each year to perform a separate analysis. As I have 1 leap year, I want them to share index but have a NaN value at that position (29 Feb) on the years that are not leap. I tried using merge creating two new columns called month and day_of_month but index gets crazy and ends up having millions of entries instead of the ~40.000 it should have, and in the end it ends up with a more than 20GB file on RAM and breaks:
years = pd.DataFrame(index=pd.date_range('2016-01-01', '2017-01-01', freq='1H'))
years['month'] = years.index.month
years['day_of_month'] = years.index.day
gp = data_md[['value', 'month', 'day_of_month']].groupby(pd.Grouper(freq='1Y'))
for name, group in gp:
years = years.merge(group, right_on=['month', 'day_of_month'], left_on=['month', 'day_of_month'])
RESULT:
month day_of_month value
0 1 1 0
1 1 1 6
2 1 1 2
3 1 1 0
4 1 1 1
... ... ... ...
210259 12 31 6
210260 12 31 2
210261 12 31 4
210262 12 31 5
210263 12 31 1
How can I get the frame constructed having one value column for each single year or month?
Here I leave the original frame from which I want to create the new one, only needed column by now is value
value month day_of_month week day_name year hour season dailyp day_of_week ... hourly_no_noise daily_trend daily_seasonal daily_residuals daily_no_noise daily_trend_h daily_seasonal_h daily_residuals_h daily_no_noise_h Total
date
2015-01-01 00:00:00 0 1 1 1 Thursday 2015 0 Invierno 165.0 3 ... NaN NaN -9.053524 NaN NaN NaN -3.456929 NaN NaN 6436996.0
2015-01-01 01:00:00 6 1 1 1 Thursday 2015 1 Invierno NaN 3 ... NaN NaN -9.053524 NaN NaN NaN -4.879983 NaN NaN NaN
2015-01-01 02:00:00 2 1 1 1 Thursday 2015 2 Invierno NaN 3 ... NaN NaN -9.053524 NaN NaN NaN -5.895367 NaN NaN NaN
2015-01-01 03:00:00 0 1 1 1 Thursday 2015 3 Invierno NaN 3 ... NaN NaN -9.053524 NaN NaN NaN -6.468616 NaN NaN NaN
2015-01-01 04:00:00 1 1 1 1 Thursday 2015 4 Invierno NaN 3 ... NaN NaN -9.053524 NaN NaN NaN -6.441830 NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2019-12-31 19:00:00 6 12 31 1 Tuesday 2019 19 Invierno NaN 1 ... 11.529465 230.571429 -4.997480 -11.299166 237.299166 9.613095 2.805720 1.176491 17.823509 NaN
2019-12-31 20:00:00 3 12 31 1 Tuesday 2019 20 Invierno NaN 1 ... 11.314857 230.571429 -4.997480 -11.299166 237.299166 9.613095 2.928751 1.176491 17.823509 NaN
2019-12-31 21:00:00 3 12 31 1 Tuesday 2019 21 Invierno NaN 1 ... 10.141139 230.571429 -4.997480 -11.299166 237.299166 9.613095 1.774848 1.176491 17.823509 NaN
2019-12-31 22:00:00 3 12 31 1 Tuesday 2019 22 Invierno NaN 1 ... 8.823152 230.571429 -4.997480 -11.299166 237.299166 9.613095 0.663344 1.176491 17.823509 NaN
2019-12-31 23:00:00 6 12 31 1 Tuesday 2019 23 Invierno NaN 1 ... 6.884636 230.571429 -4.997480 -11.299166 237.299166 9.613095 -1.624980 1.176491 17.823509 NaN
I would like to end up with a dataframe like this:
2015 2016 2017 2018 2019
2016-01-01 00:00:00 0.074053 0.218161 0.606810 0.687365 0.352672
2016-01-01 01:00:00 0.465167 0.210297 0.722825 0.683341 0.885175
2016-01-01 02:00:00 0.175964 0.610560 0.722479 0.016842 0.205916
2016-01-01 03:00:00 0.945955 0.807490 0.627525 0.187677 0.535116
2016-01-01 04:00:00 0.757608 0.797835 0.639215 0.455989 0.042285
... ... ... ... ... ...
2016-12-30 20:00:00 0.046138 0.139100 0.397547 0.738687 0.335306
2016-12-30 21:00:00 0.672800 0.802090 0.617625 0.787601 0.007535
2016-12-30 22:00:00 0.698141 0.776686 0.423712 0.667808 0.298338
2016-12-30 23:00:00 0.198089 0.642073 0.586527 0.106567 0.514569
2016-12-31 00:00:00 0.367572 0.390791 0.105193 0.592167 0.007365
where 29 Feb is NaN on non-leap years:
df['2016-02']
2015 2016 2017 2018 2019
2016-02-01 00:00:00 0.656703 0.348784 0.383639 0.208786 0.183642
2016-02-01 01:00:00 0.488729 0.909498 0.873642 0.122028 0.547563
2016-02-01 02:00:00 0.210427 0.912393 0.505873 0.085149 0.358841
2016-02-01 03:00:00 0.281107 0.534750 0.622473 0.643611 0.258437
2016-02-01 04:00:00 0.187434 0.327459 0.701008 0.887041 0.385816
... ... ... ... ... ...
2016-02-29 19:00:00 NaN 0.742402 NaN NaN NaN
2016-02-29 20:00:00 NaN 0.013419 NaN NaN NaN
2016-02-29 21:00:00 NaN 0.517194 NaN NaN NaN
2016-02-29 22:00:00 NaN 0.003136 NaN NaN NaN
2016-02-29 23:00:00 NaN 0.128406 NaN NaN NaN
IIUC, you just need the original DataFrame:
origin = 2016 # or whatever year of your chosing
newidx = pd.to_datetime(df.index.strftime(f'{origin}-%m-%d %H:%M:%S'))
newdf = (
df[['value']]
.assign(year=df.index.year)
.set_axis(newidx, axis=0)
.pivot(columns='year', values='value')
)
Using the small sample data you provided for that "original frame" df, we get:
>>> newdf
year 2015 2019
date
2016-01-01 00:00:00 0.0 NaN
2016-01-01 01:00:00 6.0 NaN
2016-01-01 02:00:00 2.0 NaN
... ... ...
2016-12-31 21:00:00 NaN 3.0
2016-12-31 22:00:00 NaN 3.0
2016-12-31 23:00:00 NaN 6.0
On a larger (made-up) DataFrame:
np.random.seed(0)
ix = pd.date_range('2015', '2020', freq='H', inclusive='left')
df = pd.DataFrame({'value': np.random.randint(0, 100, len(ix))}, index=ix)
# (code above)
>>> newdf
year 2015 2016 2017 2018 2019
2016-01-01 00:00:00 44.0 82.0 96.0 68.0 71.0
2016-01-01 01:00:00 47.0 99.0 54.0 44.0 71.0
2016-01-01 02:00:00 64.0 28.0 11.0 10.0 55.0
... ... ... ... ... ...
2016-12-31 21:00:00 0.0 30.0 28.0 53.0 14.0
2016-12-31 22:00:00 47.0 82.0 19.0 6.0 64.0
2016-12-31 23:00:00 22.0 75.0 13.0 37.0 35.0
and, as expected, only 2016 has values for 02/29:
>>> newdf[:'2016-02-29 02:00:00'].tail()
year 2015 2016 2017 2018 2019
2016-02-28 22:00:00 74.0 54.0 22.0 17.0 39.0
2016-02-28 23:00:00 37.0 61.0 31.0 8.0 62.0
2016-02-29 00:00:00 NaN 34.0 NaN NaN NaN
2016-02-29 01:00:00 NaN 82.0 NaN NaN NaN
2016-02-29 02:00:00 NaN 67.0 NaN NaN NaN
Addendum: by months
The code above can easily be adapted for month columns:
Either using MultiIndex columns:
origin = 2016
newidx = pd.to_datetime(df.index.strftime(f'{origin}-01-%d %H:%M:%S'))
newdf = (
df[['value']]
.assign(year=df.index.year, month=df.index.month)
.set_axis(newidx, axis=0)
.pivot(columns=['year', 'month'], values='value')
)
>>> newdf
year 2015 ... 2019
month 1 2 3 4 5 6 7 8 9 10 ... 3 4 5 6 7 8 9 10 11 12
2016-01-01 00:00:00 44.0 49.0 40.0 60.0 71.0 67.0 63.0 16.0 71.0 78.0 ... 32.0 35.0 51.0 35.0 68.0 43.0 4.0 23.0 65.0 19.0
2016-01-01 01:00:00 47.0 71.0 27.0 88.0 68.0 58.0 74.0 67.0 98.0 49.0 ... 85.0 27.0 70.0 8.0 9.0 29.0 78.0 29.0 21.0 68.0
2016-01-01 02:00:00 64.0 90.0 4.0 61.0 95.0 3.0 57.0 41.0 28.0 24.0 ... 7.0 93.0 21.0 10.0 72.0 79.0 46.0 45.0 25.0 99.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2016-01-31 21:00:00 48.0 NaN 24.0 NaN 79.0 NaN 55.0 47.0 NaN 20.0 ... 87.0 NaN 19.0 NaN 56.0 76.0 NaN 91.0 NaN 14.0
2016-01-31 22:00:00 82.0 NaN 6.0 NaN 46.0 NaN 9.0 57.0 NaN 21.0 ... 69.0 NaN 67.0 NaN 85.0 38.0 NaN 34.0 NaN 64.0
2016-01-31 23:00:00 51.0 NaN 97.0 NaN 45.0 NaN 55.0 41.0 NaN 87.0 ... 94.0 NaN 80.0 NaN 37.0 81.0 NaN 98.0 NaN 35.0
or a simple string column made of %Y-%m to indicate year/month:
origin = 2016
newidx = pd.to_datetime(df.index.strftime(f'{origin}-01-%d %H:%M:%S'))
newdf = (
df[['value']]
.assign(ym=df.index.strftime(f'%Y-%m'))
.set_axis(newidx, axis=0)
.pivot(columns='ym', values='value')
)
>>> newdf
ym 2015-01 2015-02 2015-03 2015-04 2015-05 2015-06 2015-07 2015-08 2015-09 2015-10 ... 2019-03 2019-04 2019-05 2019-06 2019-07 2019-08 2019-09 \
2016-01-01 00:00:00 44.0 49.0 40.0 60.0 71.0 67.0 63.0 16.0 71.0 78.0 ... 32.0 35.0 51.0 35.0 68.0 43.0 4.0
2016-01-01 01:00:00 47.0 71.0 27.0 88.0 68.0 58.0 74.0 67.0 98.0 49.0 ... 85.0 27.0 70.0 8.0 9.0 29.0 78.0
2016-01-01 02:00:00 64.0 90.0 4.0 61.0 95.0 3.0 57.0 41.0 28.0 24.0 ... 7.0 93.0 21.0 10.0 72.0 79.0 46.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2016-01-31 21:00:00 48.0 NaN 24.0 NaN 79.0 NaN 55.0 47.0 NaN 20.0 ... 87.0 NaN 19.0 NaN 56.0 76.0 NaN
2016-01-31 22:00:00 82.0 NaN 6.0 NaN 46.0 NaN 9.0 57.0 NaN 21.0 ... 69.0 NaN 67.0 NaN 85.0 38.0 NaN
2016-01-31 23:00:00 51.0 NaN 97.0 NaN 45.0 NaN 55.0 41.0 NaN 87.0 ... 94.0 NaN 80.0 NaN 37.0 81.0 NaN
ym 2019-10 2019-11 2019-12
2016-01-01 00:00:00 23.0 65.0 19.0
2016-01-01 01:00:00 29.0 21.0 68.0
2016-01-01 02:00:00 45.0 25.0 99.0
... ... ... ...
2016-01-31 21:00:00 91.0 NaN 14.0
2016-01-31 22:00:00 34.0 NaN 64.0
2016-01-31 23:00:00 98.0 NaN 35.0
The former gives you more flexibility to index sub-parts. For example, here is a selection of rows for "all February months":
>>> newdf.loc[:'2016-01-29 02:00:00', (slice(None), 2)].tail()
year 2015 2016 2017 2018 2019
month 2 2 2 2 2
2016-01-28 22:00:00 74.0 54.0 22.0 17.0 39.0
2016-01-28 23:00:00 37.0 61.0 31.0 8.0 62.0
2016-01-29 00:00:00 NaN 34.0 NaN NaN NaN
2016-01-29 01:00:00 NaN 82.0 NaN NaN NaN
2016-01-29 02:00:00 NaN 67.0 NaN NaN NaN
So let's assume we have the following dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame(pd.date_range('2015-01-01', '2020-01-01', freq='1H'),
columns = ['Date and Time'])
df['str'] = df['Date and Time'].dt.strftime('%Y-%m-%d')
df[['Year', 'Month','Day']] = df['str'].apply(lambda x: pd.Series(str(x).split("-")))
df['Values'] = np.random.rand(len(df))
print(df)
Output:
Date and Time str Year Month Day Values
0 2015-01-01 00:00:00 2015-01-01 2015 01 01 0.153948
1 2015-01-01 01:00:00 2015-01-01 2015 01 01 0.663132
2 2015-01-01 02:00:00 2015-01-01 2015 01 01 0.141534
3 2015-01-01 03:00:00 2015-01-01 2015 01 01 0.263551
4 2015-01-01 04:00:00 2015-01-01 2015 01 01 0.094391
... ... ... ... ... .. ...
43820 2019-12-31 20:00:00 2019-12-31 2019 12 31 0.055802
43821 2019-12-31 21:00:00 2019-12-31 2019 12 31 0.952963
43822 2019-12-31 22:00:00 2019-12-31 2019 12 31 0.106768
43823 2019-12-31 23:00:00 2019-12-31 2019 12 31 0.834583
43824 2020-01-01 00:00:00 2020-01-01 2020 01 01 0.325849
[43825 rows x 6 columns]
Now we separate the dataframe by year and save it in a disk:
d = {}
for i in range(2015,2020):
d[i] = pd.DataFrame(df[df['Year'] == str(i)])
d[i].sort_values(by = 'Date and Time',inplace=True,ignore_index=True)
for i in range(2015,2020):
print('Feb', i,':',(d[i][d[i]['Month'] == '02']).shape)
print((d[i][d[i]['Month'] == '02']).tail(3))
print('-----------------------------------------------------------------')
Output:
Feb 2015 : (672, 6)
Date and Time str Year Month Day Values
1413 2015-02-28 21:00:00 2015-02-28 2015 02 28 0.517525
1414 2015-02-28 22:00:00 2015-02-28 2015 02 28 0.404741
1415 2015-02-28 23:00:00 2015-02-28 2015 02 28 0.299090
-----------------------------------------------------------------
Feb 2016 : (696, 6)
Date and Time str Year Month Day Values
1437 2016-02-29 21:00:00 2016-02-29 2016 02 29 0.854047
1438 2016-02-29 22:00:00 2016-02-29 2016 02 29 0.035787
1439 2016-02-29 23:00:00 2016-02-29 2016 02 29 0.955364
-----------------------------------------------------------------
Feb 2017 : (672, 6)
Date and Time str Year Month Day Values
1413 2017-02-28 21:00:00 2017-02-28 2017 02 28 0.936354
1414 2017-02-28 22:00:00 2017-02-28 2017 02 28 0.954680
1415 2017-02-28 23:00:00 2017-02-28 2017 02 28 0.625131
-----------------------------------------------------------------
Feb 2018 : (672, 6)
Date and Time str Year Month Day Values
1413 2018-02-28 21:00:00 2018-02-28 2018 02 28 0.965274
1414 2018-02-28 22:00:00 2018-02-28 2018 02 28 0.848050
1415 2018-02-28 23:00:00 2018-02-28 2018 02 28 0.238984
-----------------------------------------------------------------
Feb 2019 : (672, 6)
Date and Time str Year Month Day Values
1413 2019-02-28 21:00:00 2019-02-28 2019 02 28 0.476142
1414 2019-02-28 22:00:00 2019-02-28 2019 02 28 0.498278
1415 2019-02-28 23:00:00 2019-02-28 2019 02 28 0.127525
-----------------------------------------------------------------
To fix the leap year problem:
There is definitely a better way, but the only thing I can think of is to create the value rows, add them, and then join the dataframes.
indexs = list(range(1416,1440))
lines = pd.DataFrame(np.nan ,columns = df.columns.values , index = indexs)
print(lines.head())
Output:
Date and Time str Year Month Day Values
1416 NaN NaN NaN NaN NaN NaN
1417 NaN NaN NaN NaN NaN NaN
1418 NaN NaN NaN NaN NaN NaN
1419 NaN NaN NaN NaN NaN NaN
1420 NaN NaN NaN NaN NaN NaN
Then I add the NaN rows to the data frame with the following code:
b = {}
for i in range(2015,2020):
if list(d[i][d[i]['Month'] == '02'].tail(1)['Day'])[0] == '28':
bi = pd.concat([d[i].iloc[0:1416], lines]).reset_index(drop=True)
b[i] = pd.concat([bi,d[i].iloc[1416:8783]]).reset_index(drop=True)
else:
b[i] = d[i].copy()
for i in range(2015,2020):
print(i,':',b[i].shape)
print(b[i].iloc[1438:1441])
print('-----------------------------------------------------------------')
Output:
2015 : (8784, 6)
Date and Time str Year Month Day Values
1438 NaT NaN NaN NaN NaN NaN
1439 NaT NaN NaN NaN NaN NaN
1440 2015-03-01 2015-03-01 2015 03 01 0.676486
-----------------------------------------------------------------
2016 : (8784, 6)
Date and Time str Year Month Day Values
1438 2016-02-29 22:00:00 2016-02-29 2016 02 29 0.035787
1439 2016-02-29 23:00:00 2016-02-29 2016 02 29 0.955364
1440 2016-03-01 00:00:00 2016-03-01 2016 03 01 0.014158
-----------------------------------------------------------------
2017 : (8784, 6)
Date and Time str Year Month Day Values
1438 NaT NaN NaN NaN NaN NaN
1439 NaT NaN NaN NaN NaN NaN
1440 2017-03-01 2017-03-01 2017 03 01 0.035952
-----------------------------------------------------------------
2018 : (8784, 6)
Date and Time str Year Month Day Values
1438 NaT NaN NaN NaN NaN NaN
1439 NaT NaN NaN NaN NaN NaN
1440 2018-03-01 2018-03-01 2018 03 01 0.44876
-----------------------------------------------------------------
2019 : (8784, 6)
Date and Time str Year Month Day Values
1438 NaT NaN NaN NaN NaN NaN
1439 NaT NaN NaN NaN NaN NaN
1440 2019-03-01 2019-03-01 2019 03 01 0.096433
-----------------------------------------------------------------
And finally, if we want to create the dataframe you want:
final_df = pd.DataFrame(index = b[2016]['Date and Time'])
for i in range(2015,2020):
final_df[i] = np.array(b[i]['Values'])
Output:
2015 2016 2017 2018 2019
Date and Time
2016-01-01 00:00:00 0.153948 0.145602 0.957265 0.427620 0.868948
2016-01-01 01:00:00 0.663132 0.318746 0.013658 0.380105 0.442332
2016-01-01 02:00:00 0.141534 0.483471 0.048050 0.139065 0.702211
2016-01-01 03:00:00 0.263551 0.737948 0.528827 0.472889 0.165095
2016-01-01 04:00:00 0.094391 0.939737 0.120343 0.134011 0.297611
... ... ... ... ... ...
2016-02-28 22:00:00 0.404741 0.864423 0.954680 0.848050 0.498278
2016-02-28 23:00:00 0.299090 0.348466 0.625131 0.238984 0.127525
2016-02-29 00:00:00 NaN 0.375469 NaN NaN NaN
2016-02-29 01:00:00 NaN 0.186092 NaN NaN NaN
... ... ... ... ... ...
2016-02-29 22:00:00 NaN 0.035787 NaN NaN NaN
2016-02-29 23:00:00 NaN 0.955364 NaN NaN NaN
2016-03-01 00:00:00 0.676486 0.014158 0.035952 0.448760 0.096433
2016-03-01 01:00:00 0.792168 0.520436 0.138874 0.229396 0.913848
... ... ... ... ... ...
2016-12-31 19:00:00 0.517459 0.956219 0.116335 0.736170 0.739740
2016-12-31 20:00:00 0.814362 0.324332 0.324911 0.485508 0.055802
2016-12-31 21:00:00 0.870459 0.809150 0.335461 0.124459 0.952963
2016-12-31 22:00:00 0.549891 0.043623 0.997053 0.144286 0.106768
2016-12-31 23:00:00 0.047090 0.730074 0.698159 0.235253 0.834583
[8784 rows x 5 columns]

Finding maximum null values in stretch and generating flag

I have dataframe with datetime and two columns.I have to find maximum stretch of null values in a 'particular date' for column 'X' and replace it with zero in both column for that particular date. In addition to that I have to create third column with name 'flag' which will carry value of 1 for every zero imputation in other two column or else value of 0. In example below, January 1st the maximum stretch null value is 3 times, so I have to replace this with zero. Similarly, I have to replicate the process for 2nd January.
Below is my sample data:
Datetime X Y
01-01-2018 00:00 1 1
01-01-2018 00:05 nan 2
01-01-2018 00:10 2 nan
01-01-2018 00:15 3 4
01-01-2018 00:20 2 2
01-01-2018 00:25 nan 1
01-01-2018 00:30 nan nan
01-01-2018 00:35 nan nan
01-01-2018 00:40 4 4
02-01-2018 00:00 nan nan
02-01-2018 00:05 2 3
02-01-2018 00:10 2 2
02-01-2018 00:15 2 5
02-01-2018 00:20 2 2
02-01-2018 00:25 nan nan
02-01-2018 00:30 nan 1
02-01-2018 00:35 3 nan
02-01-2018 00:40 nan nan
"Below is the result that I am expecting"
Datetime X Y Flag
01-01-2018 00:00 1 1 0
01-01-2018 00:05 nan 2 0
01-01-2018 00:10 2 nan 0
01-01-2018 00:15 3 4 0
01-01-2018 00:20 2 2 0
01-01-2018 00:25 0 0 1
01-01-2018 00:30 0 0 1
01-01-2018 00:35 0 0 1
01-01-2018 00:40 4 4 0
02-01-2018 00:00 nan nan 0
02-01-2018 00:05 2 3 0
02-01-2018 00:10 2 2 0
02-01-2018 00:15 2 5 0
02-01-2018 00:20 2 2 0
02-01-2018 00:25 nan nan 0
02-01-2018 00:30 nan 1 0
02-01-2018 00:35 3 nan 0
02-01-2018 00:40 nan nan 0
This question is the extension of previous question. Here is the link Python - Find maximum null values in stretch and replacing with 0
First create consecutive groups for each column filled by unique values:
df1 = df.isna()
df2 = df1.ne(df1.groupby(df1.index.date).shift()).cumsum().where(df1)
df2['Y'] *= len(df2)
print (df2)
X Y
Datetime
2018-01-01 00:00:00 NaN NaN
2018-01-01 00:05:00 2.0 NaN
2018-01-01 00:10:00 NaN 36.0
2018-01-01 00:15:00 NaN NaN
2018-01-01 00:20:00 NaN NaN
2018-01-01 00:25:00 4.0 NaN
2018-01-01 00:30:00 4.0 72.0
2018-01-01 00:35:00 4.0 72.0
2018-01-01 00:40:00 NaN NaN
2018-02-01 00:00:00 6.0 108.0
2018-02-01 00:05:00 NaN NaN
2018-02-01 00:10:00 NaN NaN
2018-02-01 00:15:00 NaN NaN
2018-02-01 00:20:00 NaN NaN
2018-02-01 00:25:00 8.0 144.0
2018-02-01 00:30:00 8.0 NaN
2018-02-01 00:35:00 NaN 180.0
2018-02-01 00:40:00 10.0 180.0
Then get groups with maximum count - here group 4:
a = df2.stack().value_counts().index[0]
print (a)
4.0
Get mask for match rows for set 0 and for Flag column cast mask to integer to Tru/False to 1/0 mapping:
mask = df2.eq(a).any(axis=1)
df.loc[mask,:] = 0
df['Flag'] = mask.astype(int)
print (df)
X Y Flag
Datetime
2018-01-01 00:00:00 1.0 1.0 0
2018-01-01 00:05:00 NaN 2.0 0
2018-01-01 00:10:00 2.0 NaN 0
2018-01-01 00:15:00 3.0 4.0 0
2018-01-01 00:20:00 2.0 2.0 0
2018-01-01 00:25:00 0.0 0.0 1
2018-01-01 00:30:00 0.0 0.0 1
2018-01-01 00:35:00 0.0 0.0 1
2018-01-01 00:40:00 4.0 4.0 0
2018-02-01 00:00:00 NaN NaN 0
2018-02-01 00:05:00 2.0 3.0 0
2018-02-01 00:10:00 2.0 2.0 0
2018-02-01 00:15:00 2.0 5.0 0
2018-02-01 00:20:00 2.0 2.0 0
2018-02-01 00:25:00 NaN NaN 0
2018-02-01 00:30:00 NaN 1.0 0
2018-02-01 00:35:00 3.0 NaN 0
2018-02-01 00:40:00 NaN NaN 0
EDIT:
Added new condition for match dates from list:
dates = df.index.floor('d')
filtered = ['2018-01-01','2019-01-01']
m = dates.isin(filtered)
df1 = df.isna() & m[:, None]
df2 = df1.ne(df1.groupby(dates).shift()).cumsum().where(df1)
df2['Y'] *= len(df2)
print (df2)
X Y
Datetime
2018-01-01 00:00:00 NaN NaN
2018-01-01 00:05:00 2.0 NaN
2018-01-01 00:10:00 NaN 36.0
2018-01-01 00:15:00 NaN NaN
2018-01-01 00:20:00 NaN NaN
2018-01-01 00:25:00 4.0 NaN
2018-01-01 00:30:00 4.0 72.0
2018-01-01 00:35:00 4.0 72.0
2018-01-01 00:40:00 NaN NaN
2018-02-01 00:00:00 NaN NaN
2018-02-01 00:05:00 NaN NaN
2018-02-01 00:10:00 NaN NaN
2018-02-01 00:15:00 NaN NaN
2018-02-01 00:20:00 NaN NaN
2018-02-01 00:25:00 NaN NaN
2018-02-01 00:30:00 NaN NaN
2018-02-01 00:35:00 NaN NaN
2018-02-01 00:40:00 NaN NaN
a = df2.stack().value_counts().index[0]
#solution working also if no NaNs per filtered rows (prevent IndexError: index 0 is out of bounds)
#a = next(iter(df2.stack().value_counts().index), -1)
mask = df2.eq(a).any(axis=1)
df.loc[mask,:] = 0
df['Flag'] = mask.astype(int)
print (df)
X Y Flag
Datetime
2018-01-01 00:00:00 1.0 1.0 0
2018-01-01 00:05:00 NaN 2.0 0
2018-01-01 00:10:00 2.0 NaN 0
2018-01-01 00:15:00 3.0 4.0 0
2018-01-01 00:20:00 2.0 2.0 0
2018-01-01 00:25:00 0.0 0.0 1
2018-01-01 00:30:00 0.0 0.0 1
2018-01-01 00:35:00 0.0 0.0 1
2018-01-01 00:40:00 4.0 4.0 0
2018-02-01 00:00:00 NaN NaN 0
2018-02-01 00:05:00 2.0 3.0 0
2018-02-01 00:10:00 2.0 2.0 0
2018-02-01 00:15:00 2.0 5.0 0
2018-02-01 00:20:00 2.0 2.0 0
2018-02-01 00:25:00 NaN NaN 0
2018-02-01 00:30:00 NaN 1.0 0
2018-02-01 00:35:00 3.0 NaN 0

how can i replace time-series dataframe specific values in pandas?

I have the dataframes below (date/time is multi index) and I want to replace column values in (00:00:00~07:00:00) as a numpy array:
[[ 21.63920663 21.62012822 20.9900515 21.23217008 21.19482458
21.10839656 20.89631935 20.79977166 20.99176729 20.91567565
20.87258765 20.76210464 20.50357827 20.55897631 20.38005033
20.38227309 20.54460993 20.37707293 20.08279925 20.09955877
20.02559575 20.12390737 20.2917257 20.20056711 20.1589065
20.41302289 20.48000767 20.55604102 20.70255192]]
date time
2018-01-26 00:00:00 21.65
00:15:00 NaN
00:30:00 NaN
00:45:00 NaN
01:00:00 NaN
01:15:00 NaN
01:30:00 NaN
01:45:00 NaN
02:00:00 NaN
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
03:15:00 NaN
03:30:00 NaN
03:45:00 NaN
04:00:00 NaN
04:15:00 NaN
04:30:00 NaN
04:45:00 NaN
05:00:00 NaN
05:15:00 NaN
05:30:00 NaN
05:45:00 NaN
06:00:00 NaN
06:15:00 NaN
06:30:00 NaN
06:45:00 NaN
07:00:00 NaN
07:15:00 NaN
07:30:00 NaN
07:45:00 NaN
08:00:00 NaN
08:15:00 NaN
08:30:00 NaN
08:45:00 NaN
09:00:00 NaN
09:15:00 NaN
09:30:00 NaN
09:45:00 NaN
10:00:00 NaN
10:15:00 NaN
10:30:00 NaN
10:45:00 NaN
11:00:00 NaN
Name: temp, dtype: float64
<class 'datetime.time'>
How can I do this?
You can use slicers:
idx = pd.IndexSlice
df1.loc[idx[:, '00:00:00':'02:00:00'],:] = 1
Or if second levels are times:
import datetime
idx = pd.IndexSlice
df1.loc[idx[:, datetime.time(0, 0, 0):datetime.time(2, 0, 0)],:] = 1
Sample:
print (df1)
aaa
date time
2018-01-26 00:00:00 21.65
00:15:00 NaN
00:30:00 NaN
00:45:00 NaN
01:00:00 NaN
01:15:00 NaN
01:30:00 NaN
01:45:00 NaN
02:00:00 NaN
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
2018-01-27 00:00:00 2.00
00:15:00 NaN
00:30:00 NaN
00:45:00 NaN
01:00:00 NaN
01:15:00 NaN
01:30:00 NaN
01:45:00 NaN
02:00:00 NaN
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
idx = pd.IndexSlice
df1.loc[idx[:, '00:00:00':'02:00:00'],:] = 1
print (df1)
aaa
date time
2018-01-26 00:00:00 1.0
00:15:00 1.0
00:30:00 1.0
00:45:00 1.0
01:00:00 1.0
01:15:00 1.0
01:30:00 1.0
01:45:00 1.0
02:00:00 1.0
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
2018-01-27 00:00:00 1.0
00:15:00 1.0
00:30:00 1.0
00:45:00 1.0
01:00:00 1.0
01:15:00 1.0
01:30:00 1.0
01:45:00 1.0
02:00:00 1.0
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
EDIT:
For assign array is necessary use numpy.tile for repeat by length of first level unique values:
df1.loc[idx[:, '00:00:00':'02:00:00'],:] = np.tile(np.arange(1, 10),len(df1.index.levels[0]))
print (df1)
aaa
date time
2018-01-26 00:00:00 1.0
00:15:00 2.0
00:30:00 3.0
00:45:00 4.0
01:00:00 5.0
01:15:00 6.0
01:30:00 7.0
01:45:00 8.0
02:00:00 9.0
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
2018-01-27 00:00:00 1.0
00:15:00 2.0
00:30:00 3.0
00:45:00 4.0
01:00:00 5.0
01:15:00 6.0
01:30:00 7.0
01:45:00 8.0
02:00:00 9.0
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
More general solution with generated array by length of slice:
idx = pd.IndexSlice
len0 = df1.loc[idx[df1.index.levels[0][0], '00:00:00':'02:00:00'],:].shape[0]
len1 = len(df1.index.levels[0])
df1.loc[idx[:, '00:00:00':'02:00:00'],:] = np.tile(np.arange(1, len0 + 1), len1)
Tested with times:
import datetime
idx = pd.IndexSlice
arr =np.tile(np.arange(1, 10),len(df1.index.levels[0]))
df1.loc[idx[:, datetime.time(0, 0, 0):datetime.time(2, 0, 0)],:] = arr
print (df1)
aaa
date time
2018-01-26 00:00:00 1.0
00:15:00 2.0
00:30:00 3.0
00:45:00 4.0
01:00:00 5.0
01:15:00 6.0
01:30:00 7.0
01:45:00 8.0
02:00:00 9.0
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
2018-01-27 00:00:00 1.0
00:15:00 2.0
00:30:00 3.0
00:45:00 4.0
01:00:00 5.0
01:15:00 6.0
01:30:00 7.0
01:45:00 8.0
02:00:00 9.0
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
EDIT:
Last was problem found - my solution wokrs with one column DataFrame, but if working with Series need remove one ::
arr = np.array([[ 21.63920663, 21.62012822, 20.9900515, 21.23217008, 21.19482458, 21.10839656,
20.89631935, 20.79977166, 20.99176729, 20.91567565, 20.87258765, 20.76210464,
20.50357827, 20.55897631, 20.38005033, 20.38227309, 20.54460993, 20.37707293,
20.08279925, 20.09955877, 20.02559575, 20.12390737, 20.2917257, 20.20056711,
20.1589065, 20.41302289, 20.48000767, 20.55604102, 20.70255192]])
import datetime
idx = pd.IndexSlice
df1.loc[idx[:, datetime.time(0, 0, 0): datetime.time(7, 0, 0)]] = arr[0]
---^^^

Pandas: Find first occurrence - on daily basis in a timeseries

I'm struggling with this so any input appreciated. I want to iterate over the values in a dataframe column and return the first instance when a value is seen every day. Groupby looked to be a good option for this but when using df.groupby(grouper).first() with grouper set at daily the following output is seen.
In [95]:
df.groupby(grouper).first()
Out[95]:
test_1
2014-03-04 1.0
2014-03-05 1.0
This is only giving the day the value was seen in test _1 and not reseting the first() on a daily basis which is what I need (see desired output below).
I want to preserve the time this value was seen in the following format:
This is the input dataframe:
test_1
2014-03-04 09:00:00 NaN
2014-03-04 10:00:00 NaN
2014-03-04 11:00:00 NaN
2014-03-04 12:00:00 NaN
2014-03-04 13:00:00 NaN
2014-03-04 14:00:00 1.0
2014-03-04 15:00:00 NaN
2014-03-04 16:00:00 1.0
2014-03-05 09:00:00 1.0
This is the desired output:
test_1 test_output
2014-03-04 09:00:00 NaN NaN
2014-03-04 10:00:00 NaN NaN
2014-03-04 11:00:00 NaN NaN
2014-03-04 12:00:00 NaN NaN
2014-03-04 13:00:00 NaN NaN
2014-03-04 14:00:00 1.0 1.0
2014-03-04 15:00:00 NaN NaN
2014-03-04 16:00:00 1.0 NaN
2014-03-05 09:00:00 1.0 NaN
I just want to mark the time when an event first occurs in a new column named test_output.
Admins. Please note this question is different from the other marked as a duplicate as this requires a rolling one day first occurrence.
Try this, using this data:
rng = pd.DataFrame( {'test_1': [None, None,None, None, 1,1, 1 , None, None, None,1 , None, None, None,]}, index = pd.date_range('4/2/2014', periods=14, freq='BH'))
rng
test_1
2014-04-02 09:00:00 NaN
2014-04-02 10:00:00 NaN
2014-04-02 11:00:00 NaN
2014-04-02 12:00:00 NaN
2014-04-02 13:00:00 1.0
2014-04-02 14:00:00 1.0
2014-04-02 15:00:00 1.0
2014-04-02 16:00:00 NaN
2014-04-03 09:00:00 NaN
2014-04-03 10:00:00 NaN
2014-04-03 11:00:00 1.0
2014-04-03 12:00:00 NaN
2014-04-03 13:00:00 NaN
2014-04-03 14:00:00 NaN
The output is this:
rng['test_output'] = rng['test_1'].loc[rng.groupby(pd.TimeGrouper(freq='D'))['test_1'].idxmin()]
test_1 test_output
2014-04-02 09:00:00 NaN NaN
2014-04-02 10:00:00 NaN NaN
2014-04-02 11:00:00 NaN NaN
2014-04-02 12:00:00 NaN NaN
2014-04-02 13:00:00 1.0 1.0
2014-04-02 14:00:00 1.0 NaN
2014-04-02 15:00:00 1.0 NaN
2014-04-02 16:00:00 NaN NaN
2014-04-03 09:00:00 NaN NaN
2014-04-03 10:00:00 NaN NaN
2014-04-03 11:00:00 1.0 1.0
2014-04-03 12:00:00 NaN NaN
2014-04-03 13:00:00 NaN NaN
2014-04-03 14:00:00 NaN NaN

How do I merge two columns in a dataframe based on a datetime index?

I got a big dataset in pandas with a datetime index.
The dataframe start at 2010-04-09 and ends present time. When I created this dataset I had for a few columns which had only data starting from 2011-06-01. The values above that where NaN values. Now I managed to get the data for a few of those columns between 2010-04-09 and 2011-06-01. Those data are in a different dataframe with the same datetime index.
Now I want to fill the old columns in the original dataset with the values of the new one but I seem not to be able to do it.
My original dataframe looks like this:
>>> data.head()
bc_conc stability wind_speed Qnet Visibility \
2010-04-09 10:00:00 609.542000 NaN NaN NaN NaN
2010-04-09 11:00:00 663.500000 NaN NaN NaN NaN
2010-04-09 12:00:00 524.661667 NaN NaN NaN NaN
2010-04-09 13:00:00 228.706667 NaN NaN NaN NaN
2010-04-09 14:00:00 279.721667 NaN NaN NaN NaN
wind_direction Rain seizoen clouds
2010-04-09 10:00:00 NaN NaN lente 1
2010-04-09 11:00:00 NaN NaN lente 6
2010-04-09 12:00:00 NaN NaN lente 8
2010-04-09 13:00:00 NaN NaN lente 4
2010-04-09 14:00:00 NaN NaN lente 7
The dataframe I want to add looks like this:
>>> df.loc['20100409']
Qnet Rain Windspeed Winddirection
2010-04-09 10:00:00 326.3 0.0 2.4 288
2010-04-09 11:00:00 331.8 0.0 3.6 308
2010-04-09 12:00:00 212.7 0.0 3.8 349
2010-04-09 13:00:00 246.6 0.0 4.1 354
2010-04-09 14:00:00 422.7 0.0 4.5 343
2010-04-09 15:00:00 210.9 0.0 4.6 356
2010-04-09 16:00:00 120.6 0.0 4.5 3
2010-04-09 17:00:00 83.3 0.0 4.5 4
2010-04-09 18:00:00 -23.8 0.0 3.3 7
2010-04-09 19:00:00 -54.0 0.0 3.0 15
2010-04-09 20:00:00 -44.3 0.0 2.7 3
2010-04-09 21:00:00 -41.9 0.0 2.6 3
2010-04-09 22:00:00 -42.1 0.0 2.2 1
2010-04-09 23:00:00 -47.4 0.0 2.2 2
So I want to add the values of df['Qnet'] to data['Qnet'], etc
I tried a lot of things with merge and join but nothing seems to really work. There are no overlapping data in the frames. The 'df' dataframe stops at 2011-05-31 and the 'data' dataframe has NaN values until that date in the columns I want to change. The original columns in data do have values from 2011-06-01 and I want to keep those!
I do know how to merge the two dataset but then I get a Qnet_x and a Qnet_y column.
So the questions is how do I combine/merge two columns in 2 or the same dataset.
I hope the question is clear
Thanks in advance for the help
UPDATE2:
this version should also work with duplicates in the index:
data = data.join(df['Qnet'], rsuffix='_new')
data['Qnet'] = data['Qnet'].combine_first(data['Qnet_new'])
data.drop(['Qnet_new'], axis=1, inplace=True)
UPDATE:
data.ix[pd.isnull(data.Qnet), 'Qnet'] = df['Qnet']
In [114]: data.loc[data.index[-1], 'Qnet'] = 9999
In [115]: data
Out[115]:
bc_conc stability wind_speed Qnet Visibility \
2010-04-09 10:00:00 609.542000 NaN NaN NaN NaN
2010-04-09 11:00:00 663.500000 NaN NaN NaN NaN
2010-04-09 12:00:00 524.661667 NaN NaN NaN NaN
2010-04-09 13:00:00 228.706667 NaN NaN NaN NaN
2010-04-09 14:00:00 279.721667 NaN NaN 9999.0 NaN
wind_direction Rain seizoen clouds
2010-04-09 10:00:00 NaN NaN lente 1
2010-04-09 11:00:00 NaN NaN lente 6
2010-04-09 12:00:00 NaN NaN lente 8
2010-04-09 13:00:00 NaN NaN lente 4
2010-04-09 14:00:00 NaN NaN lente 7
In [116]: data.ix[pd.isnull(data.Qnet), 'Qnet'] = df['Qnet']
In [117]: data
Out[117]:
bc_conc stability wind_speed Qnet Visibility \
2010-04-09 10:00:00 609.542000 NaN NaN 326.3 NaN
2010-04-09 11:00:00 663.500000 NaN NaN 331.8 NaN
2010-04-09 12:00:00 524.661667 NaN NaN 212.7 NaN
2010-04-09 13:00:00 228.706667 NaN NaN 246.6 NaN
2010-04-09 14:00:00 279.721667 NaN NaN 9999.0 NaN
wind_direction Rain seizoen clouds
2010-04-09 10:00:00 NaN NaN lente 1
2010-04-09 11:00:00 NaN NaN lente 6
2010-04-09 12:00:00 NaN NaN lente 8
2010-04-09 13:00:00 NaN NaN lente 4
2010-04-09 14:00:00 NaN NaN lente 7
OLD answer:
you can do it this way:
In [97]: data.drop(['Qnet'], axis=1).join(df['Qnet'])
Out[97]:
bc_conc stability wind_speed Visibility \
2010-04-09 10:00:00 609.542000 NaN NaN NaN
2010-04-09 11:00:00 663.500000 NaN NaN NaN
2010-04-09 12:00:00 524.661667 NaN NaN NaN
2010-04-09 13:00:00 228.706667 NaN NaN NaN
2010-04-09 14:00:00 279.721667 NaN NaN NaN
wind_direction Rain seizoen clouds Qnet
2010-04-09 10:00:00 NaN NaN lente 1 326.3
2010-04-09 11:00:00 NaN NaN lente 6 331.8
2010-04-09 12:00:00 NaN NaN lente 8 212.7
2010-04-09 13:00:00 NaN NaN lente 4 246.6
2010-04-09 14:00:00 NaN NaN lente 7 422.7

Categories

Resources