This is part of my data:
Day_Data Hour_Data WIN_D WIN_S TEM RHU PRE_1h
1 0 58 1 22 78 0
1 3 32 1.9 24.6 65 0
1 6 41 3.2 25.6 59 0
1 9 20 0.8 24.8 64 0
1 12 44 1.7 22.7 76 0
1 15 118 0.7 20.2 92 0
1 18 70 2.6 20.2 94 0
1 21 76 3.4 19.9 66 0
2 0 76 3.8 19.4 58 0
2 3 75 5.8 19.4 47 0
2 6 81 5.1 19.5 42 0
2 9 61 3.6 17.4 48 0
2 12 50 0.9 15.8 46 0
2 15 348 1.1 14.5 52 0
2 18 357 1.9 13.5 60 0
2 21 333 1.2 12.4 74 0
and, I want to generate extra data like this:
the fill values are the mean of the last value and the next value.
How can I do that?
Thank you!
And, #jdy thanks for reminder, this is what I have done:
data['time']='2017'+'-'+'10'+'-'+data['Day_Data'].map(int).map(str)+'
'+data['Hour_Data'].map(int).map(str)+':'+'00'+':'+'00'
from datetime import datetime
data.loc[:,'Date']=pd.to_datetime(data['time'])
data=data.drop(['Day_Data','Hour_Data','time'],axis=1)
index = data.set_index(data['Date'])
data=index.resample('1h').mean()
Output:
2017-10-01 00:00:00 58.0 1.0 22.0 78.0 0.0
2017-10-01 01:00:00 NaN NaN NaN NaN NaN
2017-10-01 02:00:00 NaN NaN NaN NaN NaN
2017-10-01 03:00:00 32.0 1.9 24.6 65.0 0.0
2017-10-01 04:00:00 NaN NaN NaN NaN NaN
2017-10-01 05:00:00 NaN NaN NaN NaN NaN
2017-10-01 06:00:00 41.0 3.2 25.6 59.0 0.0
2017-10-01 07:00:00 NaN NaN NaN NaN NaN
2017-10-01 08:00:00 NaN NaN NaN NaN NaN
2017-10-01 09:00:00 20.0 0.8 24.8 64.0 0.0
2017-10-01 10:00:00 NaN NaN NaN NaN NaN
2017-10-01 11:00:00 NaN NaN NaN NaN NaN
2017-10-01 12:00:00 44.0 1.7 22.7 76.0 0.0
2017-10-01 13:00:00 NaN NaN NaN NaN NaN
2017-10-01 14:00:00 NaN NaN NaN NaN NaN
2017-10-01 15:00:00 118.0 0.7 20.2 92.0 0.0
2017-10-01 16:00:00 NaN NaN NaN NaN NaN
2017-10-01 17:00:00 NaN NaN NaN NaN NaN
2017-10-01 18:00:00 70.0 2.6 20.2 94.0 0.0
2017-10-01 19:00:00 NaN NaN NaN NaN NaN
2017-10-01 20:00:00 NaN NaN NaN NaN NaN
2017-10-01 21:00:00 76.0 3.4 19.9 66.0 0.0
2017-10-01 22:00:00 NaN NaN NaN NaN NaN
2017-10-01 23:00:00 NaN NaN NaN NaN NaN
2017-10-02 00:00:00 76.0 3.8 19.4 58.0 0.0
2017-10-02 01:00:00 NaN NaN NaN NaN NaN
2017-10-02 02:00:00 NaN NaN NaN NaN NaN
2017-10-02 03:00:00 75.0 5.8 19.4 47.0 0.0
2017-10-02 04:00:00 NaN NaN NaN NaN NaN
2017-10-02 05:00:00 NaN NaN NaN NaN NaN
2017-10-02 06:00:00 81.0 5.1 19.5 42.0 0.0
but, I have no idea about how to fill the NaN by the mean of the last value and the next value.
Related
I have a dataframe that contains a time series with hourly data form 2015 to 2020. I want to create a new dataframe that has a column with the values of the time series for each year or for each month of each year to perform a separate analysis. As I have 1 leap year, I want them to share index but have a NaN value at that position (29 Feb) on the years that are not leap. I tried using merge creating two new columns called month and day_of_month but index gets crazy and ends up having millions of entries instead of the ~40.000 it should have, and in the end it ends up with a more than 20GB file on RAM and breaks:
years = pd.DataFrame(index=pd.date_range('2016-01-01', '2017-01-01', freq='1H'))
years['month'] = years.index.month
years['day_of_month'] = years.index.day
gp = data_md[['value', 'month', 'day_of_month']].groupby(pd.Grouper(freq='1Y'))
for name, group in gp:
years = years.merge(group, right_on=['month', 'day_of_month'], left_on=['month', 'day_of_month'])
RESULT:
month day_of_month value
0 1 1 0
1 1 1 6
2 1 1 2
3 1 1 0
4 1 1 1
... ... ... ...
210259 12 31 6
210260 12 31 2
210261 12 31 4
210262 12 31 5
210263 12 31 1
How can I get the frame constructed having one value column for each single year or month?
Here I leave the original frame from which I want to create the new one, only needed column by now is value
value month day_of_month week day_name year hour season dailyp day_of_week ... hourly_no_noise daily_trend daily_seasonal daily_residuals daily_no_noise daily_trend_h daily_seasonal_h daily_residuals_h daily_no_noise_h Total
date
2015-01-01 00:00:00 0 1 1 1 Thursday 2015 0 Invierno 165.0 3 ... NaN NaN -9.053524 NaN NaN NaN -3.456929 NaN NaN 6436996.0
2015-01-01 01:00:00 6 1 1 1 Thursday 2015 1 Invierno NaN 3 ... NaN NaN -9.053524 NaN NaN NaN -4.879983 NaN NaN NaN
2015-01-01 02:00:00 2 1 1 1 Thursday 2015 2 Invierno NaN 3 ... NaN NaN -9.053524 NaN NaN NaN -5.895367 NaN NaN NaN
2015-01-01 03:00:00 0 1 1 1 Thursday 2015 3 Invierno NaN 3 ... NaN NaN -9.053524 NaN NaN NaN -6.468616 NaN NaN NaN
2015-01-01 04:00:00 1 1 1 1 Thursday 2015 4 Invierno NaN 3 ... NaN NaN -9.053524 NaN NaN NaN -6.441830 NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2019-12-31 19:00:00 6 12 31 1 Tuesday 2019 19 Invierno NaN 1 ... 11.529465 230.571429 -4.997480 -11.299166 237.299166 9.613095 2.805720 1.176491 17.823509 NaN
2019-12-31 20:00:00 3 12 31 1 Tuesday 2019 20 Invierno NaN 1 ... 11.314857 230.571429 -4.997480 -11.299166 237.299166 9.613095 2.928751 1.176491 17.823509 NaN
2019-12-31 21:00:00 3 12 31 1 Tuesday 2019 21 Invierno NaN 1 ... 10.141139 230.571429 -4.997480 -11.299166 237.299166 9.613095 1.774848 1.176491 17.823509 NaN
2019-12-31 22:00:00 3 12 31 1 Tuesday 2019 22 Invierno NaN 1 ... 8.823152 230.571429 -4.997480 -11.299166 237.299166 9.613095 0.663344 1.176491 17.823509 NaN
2019-12-31 23:00:00 6 12 31 1 Tuesday 2019 23 Invierno NaN 1 ... 6.884636 230.571429 -4.997480 -11.299166 237.299166 9.613095 -1.624980 1.176491 17.823509 NaN
I would like to end up with a dataframe like this:
2015 2016 2017 2018 2019
2016-01-01 00:00:00 0.074053 0.218161 0.606810 0.687365 0.352672
2016-01-01 01:00:00 0.465167 0.210297 0.722825 0.683341 0.885175
2016-01-01 02:00:00 0.175964 0.610560 0.722479 0.016842 0.205916
2016-01-01 03:00:00 0.945955 0.807490 0.627525 0.187677 0.535116
2016-01-01 04:00:00 0.757608 0.797835 0.639215 0.455989 0.042285
... ... ... ... ... ...
2016-12-30 20:00:00 0.046138 0.139100 0.397547 0.738687 0.335306
2016-12-30 21:00:00 0.672800 0.802090 0.617625 0.787601 0.007535
2016-12-30 22:00:00 0.698141 0.776686 0.423712 0.667808 0.298338
2016-12-30 23:00:00 0.198089 0.642073 0.586527 0.106567 0.514569
2016-12-31 00:00:00 0.367572 0.390791 0.105193 0.592167 0.007365
where 29 Feb is NaN on non-leap years:
df['2016-02']
2015 2016 2017 2018 2019
2016-02-01 00:00:00 0.656703 0.348784 0.383639 0.208786 0.183642
2016-02-01 01:00:00 0.488729 0.909498 0.873642 0.122028 0.547563
2016-02-01 02:00:00 0.210427 0.912393 0.505873 0.085149 0.358841
2016-02-01 03:00:00 0.281107 0.534750 0.622473 0.643611 0.258437
2016-02-01 04:00:00 0.187434 0.327459 0.701008 0.887041 0.385816
... ... ... ... ... ...
2016-02-29 19:00:00 NaN 0.742402 NaN NaN NaN
2016-02-29 20:00:00 NaN 0.013419 NaN NaN NaN
2016-02-29 21:00:00 NaN 0.517194 NaN NaN NaN
2016-02-29 22:00:00 NaN 0.003136 NaN NaN NaN
2016-02-29 23:00:00 NaN 0.128406 NaN NaN NaN
IIUC, you just need the original DataFrame:
origin = 2016 # or whatever year of your chosing
newidx = pd.to_datetime(df.index.strftime(f'{origin}-%m-%d %H:%M:%S'))
newdf = (
df[['value']]
.assign(year=df.index.year)
.set_axis(newidx, axis=0)
.pivot(columns='year', values='value')
)
Using the small sample data you provided for that "original frame" df, we get:
>>> newdf
year 2015 2019
date
2016-01-01 00:00:00 0.0 NaN
2016-01-01 01:00:00 6.0 NaN
2016-01-01 02:00:00 2.0 NaN
... ... ...
2016-12-31 21:00:00 NaN 3.0
2016-12-31 22:00:00 NaN 3.0
2016-12-31 23:00:00 NaN 6.0
On a larger (made-up) DataFrame:
np.random.seed(0)
ix = pd.date_range('2015', '2020', freq='H', inclusive='left')
df = pd.DataFrame({'value': np.random.randint(0, 100, len(ix))}, index=ix)
# (code above)
>>> newdf
year 2015 2016 2017 2018 2019
2016-01-01 00:00:00 44.0 82.0 96.0 68.0 71.0
2016-01-01 01:00:00 47.0 99.0 54.0 44.0 71.0
2016-01-01 02:00:00 64.0 28.0 11.0 10.0 55.0
... ... ... ... ... ...
2016-12-31 21:00:00 0.0 30.0 28.0 53.0 14.0
2016-12-31 22:00:00 47.0 82.0 19.0 6.0 64.0
2016-12-31 23:00:00 22.0 75.0 13.0 37.0 35.0
and, as expected, only 2016 has values for 02/29:
>>> newdf[:'2016-02-29 02:00:00'].tail()
year 2015 2016 2017 2018 2019
2016-02-28 22:00:00 74.0 54.0 22.0 17.0 39.0
2016-02-28 23:00:00 37.0 61.0 31.0 8.0 62.0
2016-02-29 00:00:00 NaN 34.0 NaN NaN NaN
2016-02-29 01:00:00 NaN 82.0 NaN NaN NaN
2016-02-29 02:00:00 NaN 67.0 NaN NaN NaN
Addendum: by months
The code above can easily be adapted for month columns:
Either using MultiIndex columns:
origin = 2016
newidx = pd.to_datetime(df.index.strftime(f'{origin}-01-%d %H:%M:%S'))
newdf = (
df[['value']]
.assign(year=df.index.year, month=df.index.month)
.set_axis(newidx, axis=0)
.pivot(columns=['year', 'month'], values='value')
)
>>> newdf
year 2015 ... 2019
month 1 2 3 4 5 6 7 8 9 10 ... 3 4 5 6 7 8 9 10 11 12
2016-01-01 00:00:00 44.0 49.0 40.0 60.0 71.0 67.0 63.0 16.0 71.0 78.0 ... 32.0 35.0 51.0 35.0 68.0 43.0 4.0 23.0 65.0 19.0
2016-01-01 01:00:00 47.0 71.0 27.0 88.0 68.0 58.0 74.0 67.0 98.0 49.0 ... 85.0 27.0 70.0 8.0 9.0 29.0 78.0 29.0 21.0 68.0
2016-01-01 02:00:00 64.0 90.0 4.0 61.0 95.0 3.0 57.0 41.0 28.0 24.0 ... 7.0 93.0 21.0 10.0 72.0 79.0 46.0 45.0 25.0 99.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2016-01-31 21:00:00 48.0 NaN 24.0 NaN 79.0 NaN 55.0 47.0 NaN 20.0 ... 87.0 NaN 19.0 NaN 56.0 76.0 NaN 91.0 NaN 14.0
2016-01-31 22:00:00 82.0 NaN 6.0 NaN 46.0 NaN 9.0 57.0 NaN 21.0 ... 69.0 NaN 67.0 NaN 85.0 38.0 NaN 34.0 NaN 64.0
2016-01-31 23:00:00 51.0 NaN 97.0 NaN 45.0 NaN 55.0 41.0 NaN 87.0 ... 94.0 NaN 80.0 NaN 37.0 81.0 NaN 98.0 NaN 35.0
or a simple string column made of %Y-%m to indicate year/month:
origin = 2016
newidx = pd.to_datetime(df.index.strftime(f'{origin}-01-%d %H:%M:%S'))
newdf = (
df[['value']]
.assign(ym=df.index.strftime(f'%Y-%m'))
.set_axis(newidx, axis=0)
.pivot(columns='ym', values='value')
)
>>> newdf
ym 2015-01 2015-02 2015-03 2015-04 2015-05 2015-06 2015-07 2015-08 2015-09 2015-10 ... 2019-03 2019-04 2019-05 2019-06 2019-07 2019-08 2019-09 \
2016-01-01 00:00:00 44.0 49.0 40.0 60.0 71.0 67.0 63.0 16.0 71.0 78.0 ... 32.0 35.0 51.0 35.0 68.0 43.0 4.0
2016-01-01 01:00:00 47.0 71.0 27.0 88.0 68.0 58.0 74.0 67.0 98.0 49.0 ... 85.0 27.0 70.0 8.0 9.0 29.0 78.0
2016-01-01 02:00:00 64.0 90.0 4.0 61.0 95.0 3.0 57.0 41.0 28.0 24.0 ... 7.0 93.0 21.0 10.0 72.0 79.0 46.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2016-01-31 21:00:00 48.0 NaN 24.0 NaN 79.0 NaN 55.0 47.0 NaN 20.0 ... 87.0 NaN 19.0 NaN 56.0 76.0 NaN
2016-01-31 22:00:00 82.0 NaN 6.0 NaN 46.0 NaN 9.0 57.0 NaN 21.0 ... 69.0 NaN 67.0 NaN 85.0 38.0 NaN
2016-01-31 23:00:00 51.0 NaN 97.0 NaN 45.0 NaN 55.0 41.0 NaN 87.0 ... 94.0 NaN 80.0 NaN 37.0 81.0 NaN
ym 2019-10 2019-11 2019-12
2016-01-01 00:00:00 23.0 65.0 19.0
2016-01-01 01:00:00 29.0 21.0 68.0
2016-01-01 02:00:00 45.0 25.0 99.0
... ... ... ...
2016-01-31 21:00:00 91.0 NaN 14.0
2016-01-31 22:00:00 34.0 NaN 64.0
2016-01-31 23:00:00 98.0 NaN 35.0
The former gives you more flexibility to index sub-parts. For example, here is a selection of rows for "all February months":
>>> newdf.loc[:'2016-01-29 02:00:00', (slice(None), 2)].tail()
year 2015 2016 2017 2018 2019
month 2 2 2 2 2
2016-01-28 22:00:00 74.0 54.0 22.0 17.0 39.0
2016-01-28 23:00:00 37.0 61.0 31.0 8.0 62.0
2016-01-29 00:00:00 NaN 34.0 NaN NaN NaN
2016-01-29 01:00:00 NaN 82.0 NaN NaN NaN
2016-01-29 02:00:00 NaN 67.0 NaN NaN NaN
So let's assume we have the following dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame(pd.date_range('2015-01-01', '2020-01-01', freq='1H'),
columns = ['Date and Time'])
df['str'] = df['Date and Time'].dt.strftime('%Y-%m-%d')
df[['Year', 'Month','Day']] = df['str'].apply(lambda x: pd.Series(str(x).split("-")))
df['Values'] = np.random.rand(len(df))
print(df)
Output:
Date and Time str Year Month Day Values
0 2015-01-01 00:00:00 2015-01-01 2015 01 01 0.153948
1 2015-01-01 01:00:00 2015-01-01 2015 01 01 0.663132
2 2015-01-01 02:00:00 2015-01-01 2015 01 01 0.141534
3 2015-01-01 03:00:00 2015-01-01 2015 01 01 0.263551
4 2015-01-01 04:00:00 2015-01-01 2015 01 01 0.094391
... ... ... ... ... .. ...
43820 2019-12-31 20:00:00 2019-12-31 2019 12 31 0.055802
43821 2019-12-31 21:00:00 2019-12-31 2019 12 31 0.952963
43822 2019-12-31 22:00:00 2019-12-31 2019 12 31 0.106768
43823 2019-12-31 23:00:00 2019-12-31 2019 12 31 0.834583
43824 2020-01-01 00:00:00 2020-01-01 2020 01 01 0.325849
[43825 rows x 6 columns]
Now we separate the dataframe by year and save it in a disk:
d = {}
for i in range(2015,2020):
d[i] = pd.DataFrame(df[df['Year'] == str(i)])
d[i].sort_values(by = 'Date and Time',inplace=True,ignore_index=True)
for i in range(2015,2020):
print('Feb', i,':',(d[i][d[i]['Month'] == '02']).shape)
print((d[i][d[i]['Month'] == '02']).tail(3))
print('-----------------------------------------------------------------')
Output:
Feb 2015 : (672, 6)
Date and Time str Year Month Day Values
1413 2015-02-28 21:00:00 2015-02-28 2015 02 28 0.517525
1414 2015-02-28 22:00:00 2015-02-28 2015 02 28 0.404741
1415 2015-02-28 23:00:00 2015-02-28 2015 02 28 0.299090
-----------------------------------------------------------------
Feb 2016 : (696, 6)
Date and Time str Year Month Day Values
1437 2016-02-29 21:00:00 2016-02-29 2016 02 29 0.854047
1438 2016-02-29 22:00:00 2016-02-29 2016 02 29 0.035787
1439 2016-02-29 23:00:00 2016-02-29 2016 02 29 0.955364
-----------------------------------------------------------------
Feb 2017 : (672, 6)
Date and Time str Year Month Day Values
1413 2017-02-28 21:00:00 2017-02-28 2017 02 28 0.936354
1414 2017-02-28 22:00:00 2017-02-28 2017 02 28 0.954680
1415 2017-02-28 23:00:00 2017-02-28 2017 02 28 0.625131
-----------------------------------------------------------------
Feb 2018 : (672, 6)
Date and Time str Year Month Day Values
1413 2018-02-28 21:00:00 2018-02-28 2018 02 28 0.965274
1414 2018-02-28 22:00:00 2018-02-28 2018 02 28 0.848050
1415 2018-02-28 23:00:00 2018-02-28 2018 02 28 0.238984
-----------------------------------------------------------------
Feb 2019 : (672, 6)
Date and Time str Year Month Day Values
1413 2019-02-28 21:00:00 2019-02-28 2019 02 28 0.476142
1414 2019-02-28 22:00:00 2019-02-28 2019 02 28 0.498278
1415 2019-02-28 23:00:00 2019-02-28 2019 02 28 0.127525
-----------------------------------------------------------------
To fix the leap year problem:
There is definitely a better way, but the only thing I can think of is to create the value rows, add them, and then join the dataframes.
indexs = list(range(1416,1440))
lines = pd.DataFrame(np.nan ,columns = df.columns.values , index = indexs)
print(lines.head())
Output:
Date and Time str Year Month Day Values
1416 NaN NaN NaN NaN NaN NaN
1417 NaN NaN NaN NaN NaN NaN
1418 NaN NaN NaN NaN NaN NaN
1419 NaN NaN NaN NaN NaN NaN
1420 NaN NaN NaN NaN NaN NaN
Then I add the NaN rows to the data frame with the following code:
b = {}
for i in range(2015,2020):
if list(d[i][d[i]['Month'] == '02'].tail(1)['Day'])[0] == '28':
bi = pd.concat([d[i].iloc[0:1416], lines]).reset_index(drop=True)
b[i] = pd.concat([bi,d[i].iloc[1416:8783]]).reset_index(drop=True)
else:
b[i] = d[i].copy()
for i in range(2015,2020):
print(i,':',b[i].shape)
print(b[i].iloc[1438:1441])
print('-----------------------------------------------------------------')
Output:
2015 : (8784, 6)
Date and Time str Year Month Day Values
1438 NaT NaN NaN NaN NaN NaN
1439 NaT NaN NaN NaN NaN NaN
1440 2015-03-01 2015-03-01 2015 03 01 0.676486
-----------------------------------------------------------------
2016 : (8784, 6)
Date and Time str Year Month Day Values
1438 2016-02-29 22:00:00 2016-02-29 2016 02 29 0.035787
1439 2016-02-29 23:00:00 2016-02-29 2016 02 29 0.955364
1440 2016-03-01 00:00:00 2016-03-01 2016 03 01 0.014158
-----------------------------------------------------------------
2017 : (8784, 6)
Date and Time str Year Month Day Values
1438 NaT NaN NaN NaN NaN NaN
1439 NaT NaN NaN NaN NaN NaN
1440 2017-03-01 2017-03-01 2017 03 01 0.035952
-----------------------------------------------------------------
2018 : (8784, 6)
Date and Time str Year Month Day Values
1438 NaT NaN NaN NaN NaN NaN
1439 NaT NaN NaN NaN NaN NaN
1440 2018-03-01 2018-03-01 2018 03 01 0.44876
-----------------------------------------------------------------
2019 : (8784, 6)
Date and Time str Year Month Day Values
1438 NaT NaN NaN NaN NaN NaN
1439 NaT NaN NaN NaN NaN NaN
1440 2019-03-01 2019-03-01 2019 03 01 0.096433
-----------------------------------------------------------------
And finally, if we want to create the dataframe you want:
final_df = pd.DataFrame(index = b[2016]['Date and Time'])
for i in range(2015,2020):
final_df[i] = np.array(b[i]['Values'])
Output:
2015 2016 2017 2018 2019
Date and Time
2016-01-01 00:00:00 0.153948 0.145602 0.957265 0.427620 0.868948
2016-01-01 01:00:00 0.663132 0.318746 0.013658 0.380105 0.442332
2016-01-01 02:00:00 0.141534 0.483471 0.048050 0.139065 0.702211
2016-01-01 03:00:00 0.263551 0.737948 0.528827 0.472889 0.165095
2016-01-01 04:00:00 0.094391 0.939737 0.120343 0.134011 0.297611
... ... ... ... ... ...
2016-02-28 22:00:00 0.404741 0.864423 0.954680 0.848050 0.498278
2016-02-28 23:00:00 0.299090 0.348466 0.625131 0.238984 0.127525
2016-02-29 00:00:00 NaN 0.375469 NaN NaN NaN
2016-02-29 01:00:00 NaN 0.186092 NaN NaN NaN
... ... ... ... ... ...
2016-02-29 22:00:00 NaN 0.035787 NaN NaN NaN
2016-02-29 23:00:00 NaN 0.955364 NaN NaN NaN
2016-03-01 00:00:00 0.676486 0.014158 0.035952 0.448760 0.096433
2016-03-01 01:00:00 0.792168 0.520436 0.138874 0.229396 0.913848
... ... ... ... ... ...
2016-12-31 19:00:00 0.517459 0.956219 0.116335 0.736170 0.739740
2016-12-31 20:00:00 0.814362 0.324332 0.324911 0.485508 0.055802
2016-12-31 21:00:00 0.870459 0.809150 0.335461 0.124459 0.952963
2016-12-31 22:00:00 0.549891 0.043623 0.997053 0.144286 0.106768
2016-12-31 23:00:00 0.047090 0.730074 0.698159 0.235253 0.834583
[8784 rows x 5 columns]
I'm processing a MIMIC dataset. Now I want to combine the data in the rows whose time difference (delta time) is below 10min. How can I do that?
The original data:
charttime hadm_id age is_male HR RR SPO2 Systolic_BP Diastolic_BP MAP PEEP PO2
0 2119-07-20 17:54:00 26270240 NaN NaN NaN NaN NaN 103.0 66.0 81.0 NaN NaN
1 2119-07-20 17:55:00 26270240 68.0 1.0 113.0 26.0 NaN NaN NaN NaN NaN NaN
2 2119-07-20 17:57:00 26270240 NaN NaN NaN NaN 92.0 NaN NaN NaN NaN NaN
3 2119-07-20 18:00:00 26270240 68.0 1.0 114.0 28.0 NaN 85.0 45.0 62.0 16.0 NaN
4 2119-07-20 18:01:00 26270240 NaN NaN NaN NaN 91.0 NaN NaN NaN NaN NaN
5 2119-07-30 21:00:00 26270240 68.0 1.0 90.0 16.0 93.0 NaN NaN NaN NaN NaN
6 2119-07-30 21:00:00 26270240 68.0 1.0 89.0 9.0 94.0 NaN NaN NaN NaN NaN
7 2119-07-30 21:01:00 26270240 68.0 1.0 89.0 10.0 93.0 NaN NaN NaN NaN NaN
8 2119-07-30 21:05:00 26270240 NaN NaN NaN NaN NaN 109.0 42.0 56.0 NaN NaN
9 2119-07-30 21:10:00 26270240 68.0 1.0 90.0 10.0 93.0 NaN NaN NaN NaN NaN
After combining the rows whose delta time is less than 10 min, the output I want:
(when there is duplicate data in same column in some rows to group, just take the first one)
charttime hadm_id age is_male HR RR SPO2 Systolic_BP Diastolic_BP MAP PEEP PO2
0 2119-07-20 17:55:00 26270240 68.0 1.0 113.0 26.0 92.0 103.0 66.0 81.0 16.0 NaN2119-07-30 20:00:00 26270240 68.0 1.0 90.0 16.0 93.0 NaN NaN NaN NaN NaN
1 2119-07-30 21:00:00 26270240 68.0 1.0 89.0 9.0 94.0 109.0 42.0 56.0 NaN NaN
How can I do this?
First, I would round the timestamp column to 10 minutes:
df['charttime'] = pd.to_datetime(df['charttime']).dt.floor('10T').dt.time
Then, I would drop the duplicates, based on the columns you want to compare (for example, hadm_id and charttime:
df.drop_duplicates(subset=['charttime', 'hadm_id'], keep='first', inplace=True)
I want to find the last valid index of the first Dataframe, and use it to index the second Dataframe.
So, suppose I have the following Dataframe (df1):
Site 1 Site 2 Site 3 Site 4 Site 5 Site 6
Date
2000-01-01 13.0 28.0 76.0 45 90.0 58.0
2001-01-01 77.0 75.0 57.0 3 41.0 24.0
2002-01-01 50.0 29.0 2.0 65 48.0 21.0
2003-01-01 7.0 48.0 14.0 63 12.0 66.0
2004-01-01 11.0 90.0 11.0 5 47.0 6.0
2005-01-01 50.0 4.0 31.0 1 40.0 79.0
2006-01-01 30.0 98.0 91.0 96 43.0 39.0
2007-01-01 50.0 20.0 54.0 65 NaN 47.0
2008-01-01 24.0 84.0 52.0 84 NaN 81.0
2009-01-01 56.0 61.0 57.0 25 NaN 36.0
2010-01-01 87.0 45.0 68.0 65 NaN 71.0
2011-01-01 22.0 50.0 92.0 91 NaN 48.0
2012-01-01 12.0 44.0 79.0 77 NaN 25.0
2013-01-01 1.0 22.0 34.0 57 NaN 25.0
2014-01-01 94.0 NaN 86.0 97 NaN 91.0
2015-01-01 2.0 NaN 98.0 44 NaN 79.0
2016-01-01 81.0 NaN 35.0 87 NaN 32.0
2017-01-01 59.0 NaN 95.0 32 NaN 58.0
2018-01-01 NaN NaN 3.0 14 NaN NaN
2019-01-01 NaN NaN 48.0 9 NaN NaN
2020-01-01 NaN NaN NaN 49 NaN NaN
Now I can use "first_valid_index()" to find the last valid index of each column:
lvi = df.apply(lambda series: series.last_valid_index())
Which yields:
Site 1 2017-01-01
Site 2 2013-01-01
Site 3 2019-01-01
Site 4 2020-01-01
Site 5 2006-01-01
Site 6 2017-01-01
How do I apply this to another Dataframe where I use this index to slice the timeseries of another Dataframe. Another example of a Dataframe could be created with:
import pandas as pd
import numpy as np
from numpy import random
random.seed(30)
df2 = pd.DataFrame({
"Site 1": np.random.rand(21),
"Site 2": np.random.rand(21),
"Site 3": np.random.rand(21),
"Site 4": np.random.rand(21),
"Site 5": np.random.rand(21),
"Site 6": np.random.rand(21)})
idx = pd.date_range(start='2000-01-01', end='2020-01-01',freq ='AS')
df2 = df2.set_index(idx)
How do I use that "lvi" variable to index into df2?
To do this manually I could just use:
df_s1 = df['Site 1'].loc['2000-01-01':'2017-01-01']
To get something like:
2000-01-01 13.0
2001-01-01 77.0
2002-01-01 50.0
2003-01-01 7.0
2004-01-01 11.0
2005-01-01 50.0
2006-01-01 30.0
2007-01-01 50.0
2008-01-01 24.0
2009-01-01 56.0
2010-01-01 87.0
2011-01-01 22.0
2012-01-01 12.0
2013-01-01 1.0
2014-01-01 94.0
2015-01-01 2.0
2016-01-01 81.0
2017-01-01 59.0
Is there a better way to approach this? Also, will each column have to essentially be its own dataframe to work? Any help is greatly appreciated!
This might be a bit more idiomatic:
df2[df.notna()]
or even
df2.where(df.notna())
Note that in these cases (and df1*0 + df2), the operations are done for matching index values of df and df2. For example, df2[df.reset_index(drop=True).notna()] will return all nan because there are no common index values.
This seems to work just fine:
In [34]: d
Out[34]:
x y
Date
2020-01-01 1.0 2.0
2020-01-02 1.0 2.0
2020-01-03 1.0 2.0
2020-01-04 1.0 2.0
2020-01-05 1.0 2.0
2020-01-06 1.0 NaN
2020-01-07 1.0 NaN
2020-01-08 1.0 NaN
2020-01-09 1.0 NaN
2020-01-10 1.0 NaN
2020-01-11 NaN NaN
2020-01-12 NaN NaN
2020-01-13 NaN NaN
2020-01-14 NaN NaN
2020-01-15 NaN NaN
2020-01-16 NaN NaN
2020-01-17 NaN NaN
2020-01-18 NaN NaN
2020-01-19 NaN NaN
2020-01-20 NaN NaN
In [35]: d.apply(lambda col: col.last_valid_index())
Out[35]:
x 2020-01-10
y 2020-01-05
dtype: datetime64[ns]
And then:
In [15]: d.apply(lambda col: col.last_valid_index()).apply(lambda date: df2.loc[date]) Out[15]: z x 0.940396 y 0.564007
Alright, so after thinking about this for a while and trying to come up with a detailed procedure that involved a for loop etc., I came to the conclusions that this simple math operation will do the trick. Basically I am taking advantage of how math is done between Dataframes in pandas.
output = df1*0 + df2
This gives the output on df2 that will take on the NaN values from df1 and look like this:
Site 1 Site 2 Site 3 Site 4 Site 5 Site 6
Date
2000-01-01 0.690597 0.443933 0.787931 0.659639 0.363606 0.922373
2001-01-01 0.388669 0.577734 0.450225 0.021592 0.554249 0.305546
2002-01-01 0.578212 0.927848 0.361426 0.840541 0.626881 0.545491
2003-01-01 0.431668 0.128282 0.893351 0.783488 0.122182 0.666194
2004-01-01 0.151491 0.928584 0.834474 0.945401 0.590830 0.802648
2005-01-01 0.113477 0.398326 0.649955 0.202538 0.485927 0.127925
2006-01-01 0.521906 0.458672 0.923632 0.948696 0.638754 0.552753
2007-01-01 0.266599 0.839047 0.099069 0.000928 NaN 0.018146
2008-01-01 0.819810 0.809779 0.706223 0.247780 NaN 0.759691
2009-01-01 0.441574 0.020291 0.702551 0.468862 NaN 0.341191
2010-01-01 0.277030 0.130573 0.906697 0.589474 NaN 0.819986
2011-01-01 0.795344 0.103121 0.846405 0.589916 NaN 0.564411
2012-01-01 0.697255 0.599767 0.206482 0.718980 NaN 0.731366
2013-01-01 0.891771 0.001944 0.703132 0.751986 NaN 0.845933
2014-01-01 0.672579 NaN 0.466981 0.466770 NaN 0.618069
2015-01-01 0.767219 NaN 0.702156 0.370905 NaN 0.481971
2016-01-01 0.315264 NaN 0.793531 0.754920 NaN 0.091432
2017-01-01 0.431651 NaN 0.974520 0.708074 NaN 0.870077
2018-01-01 NaN NaN 0.408743 0.430576 NaN NaN
2019-01-01 NaN NaN 0.751509 0.755521 NaN NaN
2020-01-01 NaN NaN NaN 0.518533 NaN NaN
I was basically wanting to imprint the NaN values from one Dataframe onto another. I cannot believe how difficult I was making this. As long as my Dataframes are the same size this should work fine for my needs.
Now I should be able to take it from here to calculate the percent change from each last valid datapoint. Thank you everyone for the input!
EDIT:
Just to show everyone what I was ultimately trying to accomplish, here is the final code I produced with everyone's help and suggestions!
The original df originally looked like:
Site 1 Site 2 Site 3 Site 4 Site 5 Site 6
Date
2000-01-01 13.0 28.0 76.0 45 90.0 58.0
2001-01-01 77.0 75.0 57.0 3 41.0 24.0
2002-01-01 50.0 29.0 2.0 65 48.0 21.0
2003-01-01 7.0 48.0 14.0 63 12.0 66.0
2004-01-01 11.0 90.0 11.0 5 47.0 6.0
2005-01-01 50.0 4.0 31.0 1 40.0 79.0
2006-01-01 30.0 98.0 91.0 96 43.0 39.0
2007-01-01 50.0 20.0 54.0 65 NaN 47.0
2008-01-01 24.0 84.0 52.0 84 NaN 81.0
2009-01-01 56.0 61.0 57.0 25 NaN 36.0
2010-01-01 87.0 45.0 68.0 65 NaN 71.0
2011-01-01 22.0 50.0 92.0 91 NaN 48.0
2012-01-01 12.0 44.0 79.0 77 NaN 25.0
2013-01-01 1.0 22.0 34.0 57 NaN 25.0
2014-01-01 94.0 NaN 86.0 97 NaN 91.0
2015-01-01 2.0 NaN 98.0 44 NaN 79.0
2016-01-01 81.0 NaN 35.0 87 NaN 32.0
2017-01-01 59.0 NaN 95.0 32 NaN 58.0
2018-01-01 NaN NaN 3.0 14 NaN NaN
2019-01-01 NaN NaN 48.0 9 NaN NaN
2020-01-01 NaN NaN NaN 49 NaN NaN
Then I came up with a second full dataframe (df2) with:
df2 = pd.DataFrame({
"Site 1": np.random.rand(21),
"Site 2": np.random.rand(21),
"Site 3": np.random.rand(21),
"Site 4": np.random.rand(21),
"Site 5": np.random.rand(21),
"Site 6": np.random.rand(21)})
idx = pd.date_range(start='2000-01-01', end='2020-01-01',freq ='AS')
df2 = df2.set_index(idx)
Now I replace the nan values in df2 with the nan values from df:
dfr = df2[df.notna()]
Then I invert the dataframe:
dfr = dfr[::-1]
valid_first = dfr.apply(lambda col: col.first_valid_index())
valid_last = dfr.apply(lambda col: col.last_valid_index())
Now I want the to calculate the percent change from my last valid data point, which is fixed for each column. This gives me the % change from the present to the past, with respect to the most recent (or last valid) data point.
new = []
for j in dfr:
m = dfr[j].loc[valid_first[j]:valid_last[j]]
pc = m / m.iloc[0]-1
new.append(pc)
final = pd.concat(new,axis=1)
print(final)
Which gave me:
Site 1 Site 2 Site 3 Site 4 Site 5 Site 6
2000-01-01 0.270209 -0.728445 -0.636105 0.380330 41.339081 -0.462147
2001-01-01 0.854952 -0.827804 -0.703568 -0.787391 40.588791 -0.884806
2002-01-01 -0.677757 -0.120482 -0.208255 -0.982097 54.348094 -0.483415
2003-01-01 -0.322010 -0.061277 -0.382602 1.025088 5.440808 -0.602661
2004-01-01 1.574451 -0.768251 -0.543260 1.210434 50.494788 -0.859331
2005-01-01 -0.412226 -0.866441 -0.055027 -0.168267 1.346869 -0.385080
2006-01-01 1.280867 -0.640899 0.354513 1.086703 0.000000 0.108504
2007-01-01 1.121585 -0.741675 -0.735990 -0.768578 NaN -0.119436
2008-01-01 -0.210467 -0.376884 -0.575106 -0.779147 NaN 0.055949
2009-01-01 1.864107 -0.966827 0.566590 1.003121 NaN -0.214482
2010-01-01 0.571762 -0.311459 -0.518113 1.036950 NaN -0.513911
2011-01-01 -0.122525 -0.178137 -0.641642 0.197481 NaN 0.033141
2012-01-01 0.403578 -0.829402 0.161753 -0.438578 NaN -0.996595
2013-01-01 0.383481 0.000000 -0.305824 0.602079 NaN -0.057711
2014-01-01 -0.699708 NaN -0.515074 -0.277157 NaN -0.840873
2015-01-01 0.422364 NaN -0.759708 1.230037 NaN -0.663253
2016-01-01 -0.418945 NaN 0.197396 -0.445260 NaN -0.299741
2017-01-01 0.000000 NaN -0.897428 0.669791 NaN 0.000000
2018-01-01 NaN NaN 0.138997 0.486961 NaN NaN
2019-01-01 NaN NaN 0.000000 0.200771 NaN NaN
2020-01-01 NaN NaN NaN 0.000000 NaN NaN
I know often times these questions don't have context, so here is the final output achieved thanks to your input. Again, thank you to everyone for the help!
Here is (file) a multi index and level dataframe. Loading the dataframe from a csv:
import pandas as pd
df = pd.read_csv('./enviar/only-bh-extreme-events-satellite.csv'
,index_col=[0,1,2,3,4]
,header=[0,1,2,3]
,skipinitialspace=True
,tupleize_cols=True
)
df.columns = pd.MultiIndex.from_tuples(df.columns)
print(df)
ci \
1
1
00h 06h 12h 18h
wsid lat lon start prcp_24
329 -43.969397 -19.883945 2007-03-18 10:00:00 72.0 NaN NaN NaN NaN
2007-03-20 10:00:00 104.4 NaN NaN NaN NaN
2007-10-18 23:00:00 92.8 NaN NaN NaN NaN
2007-12-21 00:00:00 60.4 NaN NaN NaN NaN
2008-01-19 18:00:00 53.0 NaN NaN NaN NaN
2008-04-05 01:00:00 80.8 0.0 0.0 0.0 0.0
2008-10-31 17:00:00 101.8 NaN NaN NaN NaN
2008-11-01 04:00:00 82.0 NaN NaN NaN NaN
2008-12-29 00:00:00 57.8 NaN NaN NaN NaN
2009-03-28 10:00:00 72.4 NaN NaN NaN NaN
2009-10-07 02:00:00 57.8 NaN NaN NaN NaN
2009-10-08 00:00:00 83.8 NaN NaN NaN NaN
2009-11-28 16:00:00 84.4 NaN NaN NaN NaN
2009-12-18 04:00:00 51.8 NaN NaN NaN NaN
2009-12-28 00:00:00 96.4 NaN NaN NaN NaN
2010-01-06 05:00:00 74.2 NaN NaN NaN NaN
2011-12-18 00:00:00 113.6 NaN NaN NaN NaN
2011-12-19 00:00:00 90.6 NaN NaN NaN NaN
2012-11-15 07:00:00 85.8 NaN NaN NaN NaN
2013-10-17 00:00:00 52.4 NaN NaN NaN NaN
2014-04-01 22:00:00 72.0 0.0 0.0 0.0 0.0
2014-10-20 06:00:00 56.6 NaN NaN NaN NaN
2014-12-13 09:00:00 104.4 NaN NaN NaN NaN
2015-02-09 00:00:00 62.0 NaN NaN NaN NaN
2015-02-16 19:00:00 56.8 NaN NaN NaN NaN
2015-05-06 17:00:00 50.8 0.0 0.0 0.0 0.0
2016-02-26 00:00:00 52.2 NaN NaN NaN NaN
343 -44.416883 -19.885398 2008-08-30 21:00:00 50.4 0.0 0.0 0.0 0.0
2009-02-01 01:00:00 53.8 NaN NaN NaN NaN
2010-03-22 00:00:00 51.4 NaN NaN NaN NaN
2011-11-12 21:00:00 57.8 NaN NaN NaN NaN
2011-11-25 22:00:00 107.6 NaN NaN NaN NaN
2012-12-28 20:00:00 94.0 NaN NaN NaN NaN
2013-10-16 22:00:00 50.8 NaN NaN NaN NaN
2014-11-06 21:00:00 55.2 NaN NaN NaN NaN
2015-01-24 00:00:00 80.0 NaN NaN NaN NaN
2015-01-27 00:00:00 52.8 NaN NaN NaN NaN
370 -43.958651 -19.980034 2015-01-28 23:00:00 50.4 NaN NaN NaN NaN
2015-01-29 00:00:00 50.6 NaN NaN NaN NaN
I'm trying to describe grouping by level (0), variables ci, d, r, z... I like to get the count, max, min, std, etc...
When I tried df.describe() I got not grouping by level 0. So I expected:
ci cc z r -> Level 0
count 39.000000 39.000000 39.000000 39.000000
mean 422577.032051 422025.595353 421672.402244 422449.004808
std 144740.869473 144550.040108 144425.167173 144692.422425
min 0.000000 0.000000 0.000000 0.000000
25% 467962.437500 467512.156250 467915.437500 468552.750000
50% 470644.687500 469924.468750 469772.312500 470947.468750
75% 472557.875000 471953.828125 471156.250000 472279.937500
max 473988.062500 473269.187500 472358.125000 473675.812500
I had created this helper function:
def format_percentiles(percentiles):
percentiles = np.asarray(percentiles)
percentiles = 100 * percentiles
int_idx = (percentiles.astype(int) == percentiles)
if np.all(int_idx):
out = percentiles.astype(int).astype(str)
return [i + '%' for i in out]
And this my own describe function:
import numpy as np
from functools import reduce
def describe_customized(df):
_df = pd.DataFrame()
data = []
variables = list(set(df.columns.get_level_values(0)))
variables.sort()
for var in variables:
idx = pd.IndexSlice
values = df.loc[:, idx[[var]]].values.tolist() #get all values from a specif variable
z = reduce(lambda x,y: x+y,values) #flat a list of list
data.append(pd.Series(z,name=var))
#return data
for series in data:
percentiles = np.array([0.25, 0.5, 0.75])
formatted_percentiles = format_percentiles(percentiles)
stat_index = (['count', 'mean', 'std', 'min'] + formatted_percentiles + ['max'])
d = ([series.count(), series.mean(), series.std(), series.min()] +
[series.quantile(x) for x in percentiles] + [series.max()])
s = pd.Series(d, index=stat_index, name=series.name)
_df = pd.concat([_df,s], axis=1)
return _df
dd = describe_customized(df)
Result:
al asn cc chnk ci ciwc \
25% 0.130846 0.849998 0.000000 0.018000 0.0 0.000000e+00
50% 0.131369 0.849999 0.000000 0.018000 0.0 0.000000e+00
75% 0.134000 0.849999 0.000000 0.018000 0.0 0.000000e+00
count 624.000000 624.000000 23088.000000 624.000000 64.0 2.308800e+04
max 0.137495 0.849999 1.000000 0.018006 0.0 5.576574e-04
mean 0.119082 0.762819 0.022013 0.016154 0.0 8.247306e-07
min 0.000000 0.000000 0.000000 0.000000 0.0 0.000000e+00
std 0.040338 0.258087 0.098553 0.005465 0.0 8.969210e-06
I created a function that returns a new dataframe with the statistics of the variables for a level of your choice:
def describe_levels(df,level):
df_des = pd.DataFrame(
index=df.columns.levels[0],
columns=['count','mean','std','min','25','50','75','max']
)
for index in df_des.index:
df_des.loc[index,'count'] = len(df[index]['1'][level])
df_des.loc[index,'mean'] = df[index]['1'][level].mean().mean()
df_des.loc[index,'std'] = df[index]['1'][level].std().mean()
df_des.loc[index,'min'] = df[index]['1'][level].min().mean()
df_des.loc[index,'max'] = df[index]['1'][level].max().mean()
df_des.loc[index,'25'] = df[index]['1'][level].quantile(q=0.25).mean()
df_des.loc[index,'50'] = df[index]['1'][level].quantile(q=0.5).mean()
df_des.loc[index,'75'] = df[index]['1'][level].quantile(q=0.75).mean()
return df_des
For example, I called:
describe_levels(df,'1').T
See here the result for pressure level 1:
i have this code which I load a my data to a dataframe and i try to fill up the naN values using .interpolate instead of replacing it with 0
my dataframe looks like this:
weight height wc hc FBS HBA1C
0 NaN NaN NaN NaN NaN NaN
1 55.6 151.0 NaN NaN 126.0 NaN
2 42.8 151.0 73.0 79.0 NaN NaN
3 60.8 155.0 NaN NaN 201.0 NaN
4 NaN NaN NaN NaN NaN NaN
5 60.0 NaN 87.0 92.0 NaN NaN
6 NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN 194.0 NaN
9 57.0 158.0 95.0 90.0 NaN NaN
10 46.0 NaN 83.0 91.0 223.0 NaN
11 NaN NaN NaN NaN NaN NaN
12 NaN NaN NaN NaN NaN NaN
13 58.5 164.0 NaN NaN NaN NaN
14 62.0 154.0 80.5 100.0 NaN NaN
15 NaN NaN NaN NaN NaN NaN
16 57.0 152.0 NaN NaN NaN NaN
17 62.4 153.0 88.0 99.0 NaN NaN
18 NaN NaN NaN NaN NaN NaN
19 48.0 146.0 NaN NaN NaN NaN
20 68.7 NaN NaN NaN NaN NaN
21 49.0 146.0 NaN NaN NaN NaN
22 NaN NaN NaN NaN NaN NaN
23 NaN NaN NaN NaN NaN NaN
24 70.2 161.0 NaN NaN NaN NaN
25 70.4 161.0 93.0 68.0 NaN NaN
26 61.8 143.0 91.0 98.0 NaN NaN
27 70.4 NaN NaN NaN NaN NaN
28 70.1 144.0 100.0 103.0 NaN NaN
29 NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ...
318 49.0 146.0 92.0 89.0 NaN NaN
319 64.7 145.0 87.0 107.0 NaN NaN
320 55.5 149.0 81.0 101.0 NaN NaN
321 55.4 145.0 87.0 96.0 NaN NaN
322 53.1 153.0 83.0 96.0 NaN NaN
323 52.1 147.0 89.0 92.0 NaN NaN
324 68.9 167.0 96.0 100.0 NaN NaN
325 NaN NaN NaN NaN NaN NaN
326 57.0 142.0 100.0 101.0 NaN NaN
327 72.5 163.0 98.0 95.0 NaN NaN
328 73.5 157.0 94.0 114.0 NaN NaN
329 61.0 160.0 90.0 89.5 NaN NaN
330 49.0 150.0 80.0 90.0 NaN NaN
331 50.0 150.0 83.0 90.0 NaN NaN
332 67.6 155.0 92.0 103.0 NaN NaN
333 NaN NaN NaN NaN NaN NaN
334 78.7 162.0 99.0 101.0 NaN NaN
335 74.5 155.0 98.0 110.0 NaN NaN
336 68.0 152.0 85.0 93.0 NaN NaN
337 67.0 152.0 NaN NaN 179.1 NaN
338 NaN NaN NaN NaN 315.0 NaN
339 38.0 145.0 66.0 NaN 196.0 NaN
340 50.0 148.0 NaN NaN 133.0 NaN
341 73.5 NaN NaN NaN NaN NaN
342 74.5 NaN NaN NaN NaN NaN
343 NaN NaN NaN NaN NaN NaN
344 67.0 152.0 106.0 NaN NaN NaN
345 52.0 145.0 94.0 NaN NaN NaN
346 52.0 159.0 89.0 NaN NaN NaN
347 67.0 153.0 92.0 91.0 NaN NaN
my code:
import pandas as pd
df = pd.read_csv('final_dataset_3.csv')
import numpy as np
df['weight'].replace(0,np.nan, inplace=True)
df['height'].replace(0,np.nan, inplace=True)
df['wc'].replace(0,np.nan, inplace=True)
df['hc'].replace(0,np.nan, inplace=True)
df['FBS'].replace(0,np.nan, inplace=True)
df['HBA1C'].replace(0,np.nan, inplace=True)
df1 = df.interpolate()
df1
df1 looks like this
weight height wc hc FBS HBA1C
0 NaN NaN NaN NaN NaN NaN
1 55.600000 151.0 NaN NaN 126.000000 NaN
2 42.800000 151.0 73.000000 79.000000 163.500000 NaN
3 60.800000 155.0 77.666667 83.333333 201.000000 NaN
4 60.400000 155.5 82.333333 87.666667 199.600000 NaN
5 60.000000 156.0 87.000000 92.000000 198.200000 NaN
6 59.250000 156.5 89.000000 91.500000 196.800000 NaN
after running the code, it didnt replace the naN values with a value instead replaces the values with more decimal points.
Looking at this data leads me to believe that interpolating the values would be improper. Each row represents some attributes for different people. You cannot base a missing value of, say, weight on adjacent rows. I understand that you need to deal with the NaN's because much of the data will be useless when building many types of models.
Instead maybe you should fill with the mean() or median(). Here's a simple dataframe with some missing values.
df
Out[58]:
height weight
0 54.0 113.0
1 61.0 133.0
2 NaN 129.0
3 48.0 NaN
4 60.0 107.0
5 51.0 114.0
6 NaN 165.0
7 51.0 NaN
8 53.0 147.0
9 NaN 124.0
To replace missing values with the mean() of the column:
df.fillna(df.mean())
Out[59]:
height weight
0 54.0 113.0
1 61.0 133.0
2 54.0 129.0
3 48.0 129.0
4 60.0 107.0
5 51.0 114.0
6 54.0 165.0
7 51.0 129.0
8 53.0 147.0
9 54.0 124.0
Of course, you could easily use median() or some other method that makes sense for your data.