I have a pandas dataframe as below:
>>> df.head()
timestamp count_200 count_201 count_503 count_504 mean_200 mean_201 mean_503 mean_504 count_500
0 2020-09-18 09:00:00 4932.0 51.0 NaN NaN 59.501014 73.941176 0.0 0.0 0
1 2020-09-18 10:00:00 1697.0 9.0 NaN NaN 57.807896 69.111111 0.0 0.0 0
2 2020-09-18 11:00:00 6895.0 6.0 2.0 1.0 54.037273 98.333333 33.0 1511.0 0
3 2020-09-18 12:00:00 2943.0 97.0 NaN NaN 74.334353 74.268041 0.0 0.0 0
4 2020-09-18 13:00:00 2299.0 43.0 NaN NaN 70.539800 102.302326 0.0 0.0 0
fillna does no replace the NaN
>>> df.fillna(0)
timestamp count_200 count_201 count_503 count_504 mean_200 mean_201 mean_503 mean_504 count_500
0 2020-09-18 09:00:00 4932.0 51.0 NaN NaN 59.501014 73.941176 0.000000 0.000 0
1 2020-09-18 10:00:00 1697.0 9.0 NaN NaN 57.807896 69.111111 0.000000 0.000 0
2 2020-09-18 11:00:00 6895.0 6.0 2.0 1.0 54.037273 98.333333 33.000000 1511.000 0
3 2020-09-18 12:00:00 2943.0 97.0 NaN NaN 74.334353 74.268041 0.000000 0.000 0
4 2020-09-18 13:00:00 2299.0 43.0 NaN NaN 70.539800 102.302326 0.000000 0.000 0
But if we access just one row, fillna on the resulting series works as expected:
>>> df.iloc[0]
timestamp 2020-09-18 09:00:00
count_200 4932
count_201 51
count_503 NaN
count_504 NaN
mean_200 59.501
mean_201 73.9412
mean_503 0
mean_504 0
count_500 0
Name: 0, dtype: object
>>> df.iloc[0].fillna(0)
timestamp 2020-09-18 09:00:00
count_200 4932
count_201 51
count_503 0
count_504 0
mean_200 59.501
mean_201 73.9412
mean_503 0
mean_504 0
count_500 0
Name: 0, dtype: object
What is going on here?
>>> df.iloc[0,3]
nan
>>> type(df.iloc[0,3])
<class 'numpy.float64'>
Pandas recognises as a na:
>>> df.isna()
timestamp count_200 count_201 count_503 count_504 mean_200 mean_201 mean_503 mean_504 count_500
0 False False False True True False False False False False
1 False False False True True False False False False False
2 False False False False False False False False False False
3 False False False True True False False False False False
4 False False False True True False False False False False
But with numpys inbuild function this can be fixed in pandas:
>>> df.head().apply(np.nan_to_num)
timestamp count_200 count_201 count_503 count_504 mean_200 mean_201 mean_503 mean_504 count_500
0 2020-09-18 09:00:00 4932.0 51.0 0.0 0.0 59.501014 73.941176 0.0 0.0 0
1 2020-09-18 10:00:00 1697.0 9.0 0.0 0.0 57.807896 69.111111 0.0 0.0 0
2 2020-09-18 11:00:00 6895.0 6.0 2.0 1.0 54.037273 98.333333 33.0 1511.0 0
3 2020-09-18 12:00:00 2943.0 97.0 0.0 0.0 74.334353 74.268041 0.0 0.0 0
4 2020-09-18 13:00:00 2299.0 43.0 0.0 0.0 70.539800 102.302326 0.0 0.0 0
Is this expected, I can't find this documented. What am I missing? Is this a bug?
df.head()
timestamp count_200 count_201 count_503 count_504 mean_200 mean_201 mean_503 mean_504 count_500
0 2020-09-18 09:00:00 4932.0 51.0 NaN NaN 59.501014 73.941176 0.0 0.0 0
1 2020-09-18 10:00:00 1697.0 9.0 NaN NaN 57.807896 69.111111 0.0 0.0 0
2 2020-09-18 11:00:00 6895.0 6.0 2.0 1.0 54.037273 98.333333 33.0 1511.0 0
3 2020-09-18 12:00:00 2943.0 97.0 NaN NaN 74.334353 74.268041 0.0 0.0 0
4 2020-09-18 13:00:00 2299.0 43.0 NaN NaN 70.539800 102.302326 0.0 0.0 0
Replacing NaN with 0
df.fillna(0)
timestamp count_200 count_201 count_503 count_504 mean_200 mean_201 mean_503 mean_504 count_500
0 2020-09-18 09:00:00 4932.0 51.0 0.0 0.0 59.501014 73.941176 0.0 0.0 0
1 2020-09-18 10:00:00 1697.0 9.0 0.0 0.0 57.807896 69.111111 0.0 0.0 0
2 2020-09-18 11:00:00 6895.0 6.0 2.0 1.0 54.037273 98.333333 33.0 1511.0 0
3 2020-09-18 12:00:00 2943.0 97.0 0.0 0.0 74.334353 74.268041 0.0 0.0 0
4 2020-09-18 13:00:00 2299.0 43.0 0.0 0.0 70.539800 102.302326 0.0 0.0 0
It is working fine for me.
Use inplace=True to apply the changes to dataframe
df.fillna(0, inplace=True)
Pandas version I'm using is
print(pd.__version__)
0.23.0
Please restart IDE/python kernal
Check and update pandas version (if required)
df[df.isna().any()] = 0
you can use this,pandas lib can be confusing,as for one functionality you have many type of things you can do,i generally try all and dont stuck in one,tell me if this is working or atleast what is it doing.
I can't seem to recreate the error, if I copy your provided df and use pd.read_clipboard() to turn it into a df, then df.fillna(0) gives the expected results for me.
When you provide the return from df.fillna(0), is that the actual return? Or are you printing the df. If so, remember to use the inplace=True parameter.
Related
I have rain and temp data sourced from Environment Canada but it contains some NaN values.
start_date = '2015-12-31'
end_date = '2021-05-26'
mask = (data['date'] > start_date) & (data['date'] <= end_date)
df = data.loc[mask]
print(df)
date time rain_gauge_value temperature
8760 2016-01-01 00:00:00 0.0 -2.9
8761 2016-01-01 01:00:00 0.0 -3.4
8762 2016-01-01 02:00:00 0.0 -3.6
8763 2016-01-01 03:00:00 0.0 -3.6
8764 2016-01-01 04:00:00 0.0 -4.0
... ... ... ... ...
56107 2021-05-26 19:00:00 0.0 22.0
56108 2021-05-26 20:00:00 0.0 21.5
56109 2021-05-26 21:00:00 0.0 21.1
56110 2021-05-26 22:00:00 0.0 19.5
56111 2021-05-26 23:00:00 0.0 18.5
[47352 rows x 4 columns]
Find the rows with a NaN value
null = df[df['rain_gauge_value'].isnull()]
print(null)
date time rain_gauge_value temperature
11028 2016-04-04 12:00:00 NaN -6.9
11986 2016-05-14 10:00:00 NaN NaN
11987 2016-05-14 11:00:00 NaN NaN
11988 2016-05-14 12:00:00 NaN NaN
11989 2016-05-14 13:00:00 NaN NaN
... ... ... ... ...
49024 2020-08-04 16:00:00 NaN NaN
49025 2020-08-04 17:00:00 NaN NaN
50505 2020-10-05 09:00:00 NaN 11.3
54083 2021-03-03 11:00:00 NaN -5.1
54084 2021-03-03 12:00:00 NaN -4.5
[6346 rows x 4 columns]
This is my dataframe I want to use to fill the NaN values
print(rain_df)
date time rain_gauge_value temperature
0 2015-12-28 00:00:00 0.1 -6.0
1 2015-12-28 01:00:00 0.0 -7.0
2 2015-12-28 02:00:00 0.0 -8.0
3 2015-12-28 03:00:00 0.0 -8.0
4 2015-12-28 04:00:00 0.0 -7.0
... ... ... ... ...
48043 2021-06-19 19:00:00 0.6 20.0
48044 2021-06-19 20:00:00 0.6 19.0
48045 2021-06-19 21:00:00 0.8 18.0
48046 2021-06-19 22:00:00 0.4 17.0
48047 2021-06-19 23:00:00 0.0 16.0
[48048 rows x 4 columns]
But when I use the fillna() method, some of the values don't get substitued.
null = null.fillna(rain_df)
null = null[null['rain_gauge_value'].isnull()]
print(null)
date time rain_gauge_value temperature
48057 2020-06-25 09:00:00 NaN NaN
48058 2020-06-25 10:00:00 NaN NaN
48059 2020-06-25 11:00:00 NaN NaN
48060 2020-06-25 12:00:00 NaN NaN
48586 2020-07-17 10:00:00 NaN NaN
48587 2020-07-17 11:00:00 NaN NaN
48588 2020-07-17 12:00:00 NaN NaN
49022 2020-08-04 14:00:00 NaN NaN
49023 2020-08-04 15:00:00 NaN NaN
49024 2020-08-04 16:00:00 NaN NaN
49025 2020-08-04 17:00:00 NaN NaN
50505 2020-10-05 09:00:00 NaN 11.3
54083 2021-03-03 11:00:00 NaN -5.1
54084 2021-03-03 12:00:00 NaN -4.5
How can I resolve this issue?
when fillna, you probably want a method, like fill using previous/next value, mean of column etc, what we can do is like this
nulls_index = df['rain_gauge_value'].isnull()
df = df.fillna(method='ffill') # use ffill as example
nulls_after_fill = df[nulls_index]
take a look at:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
You need to inform pandas how you want to patch. It may be obvious to you want to use the "patch" dataframe's values when the date and times line up, but it won't be obvious to pandas. see my dummy example:
raw = pd.DataFrame(dict(date=[date(2015,12,28), date(2015,12,28)], time= [time(0,0,0),time(0,0,1)],temp=[1.,np.nan],rain=[4.,np.nan]))
raw
date time temp rain
0 2015-12-28 00:00:00 1.0 4.0
1 2015-12-28 00:00:01 NaN NaN
patch = pd.DataFrame(dict(date=[date(2015,12,28), date(2015,12,28)], time=[time(0,0,0),time(0,0,1)],temp=[5.,5.],rain=[10.,10.]))
patch
date time temp rain
0 2015-12-28 00:00:00 5.0 10.0
1 2015-12-28 00:00:01 5.0 10.0
you need the indexes of raw and patch to correspond to how you want to patch the raw data (in this case, you want to patch based on date and time)
raw.set_index(['date','time']).fillna(patch.set_index(['date','time']))
returns
temp rain
date time
2015-12-28 00:00:00 1.0 4.0
00:00:01 5.0 10.0
What's the most succinct way to resample this dataframe:
>>> uneven = pd.DataFrame({'a': [0, 12, 19]}, index=pd.DatetimeIndex(['2020-12-08', '2020-12-20', '2020-12-27']))
>>> print(uneven)
a
2020-12-08 0
2020-12-20 12
2020-12-27 19
...into this dataframe:
>>> daily = pd.DataFrame({'a': range(20)}, index=pd.date_range('2020-12-08', periods=3*7-1, freq='D'))
>>> print(daily)
a
2020-12-08 0
2020-12-09 1
...
2020-12-19 11
2020-12-20 12
2020-12-21 13
...
2020-12-27 19
NB: 12 days between the 8th and 20th Dec, 7 days between the 20th and 27th.
Also, to give clarity of the kind of interpolation/resampling I want to do:
>>> print(daily.diff())
a
2020-12-08 NaN
2020-12-09 1.0
2020-12-10 1.0
...
2020-12-19 1.0
2020-12-20 1.0
2020-12-21 1.0
...
2020-12-27 1.0
The actual data is hierarchical and has multiple columns, but I wanted to start with something I could get my head around:
first_dose second_dose
date areaCode
2020-12-08 E92000001 0.0 0.0
N92000002 0.0 0.0
S92000003 0.0 0.0
W92000004 0.0 0.0
2020-12-20 E92000001 574829.0 0.0
N92000002 16068.0 0.0
S92000003 60333.0 0.0
W92000004 24056.0 0.0
2020-12-27 E92000001 267809.0 0.0
N92000002 14948.0 0.0
S92000003 34535.0 0.0
W92000004 12495.0 0.0
2021-01-03 E92000001 330037.0 20660.0
N92000002 9669.0 1271.0
S92000003 21446.0 44.0
W92000004 14205.0 27.0
I think you need:
df = df.reset_index('areaCode').groupby('areaCode')[['first_dose','second_dose']].resample('D').interpolate()
print (df)
first_dose second_dose
areaCode date
E92000001 2020-12-08 0.000000 0.000000
2020-12-09 47902.416667 0.000000
2020-12-10 95804.833333 0.000000
2020-12-11 143707.250000 0.000000
2020-12-12 191609.666667 0.000000
... ...
W92000004 2020-12-30 13227.857143 11.571429
2020-12-31 13472.142857 15.428571
2021-01-01 13716.428571 19.285714
2021-01-02 13960.714286 23.142857
2021-01-03 14205.000000 27.000000
[108 rows x 2 columns]
This is part of my data:
Day_Data Hour_Data WIN_D WIN_S TEM RHU PRE_1h
1 0 58 1 22 78 0
1 3 32 1.9 24.6 65 0
1 6 41 3.2 25.6 59 0
1 9 20 0.8 24.8 64 0
1 12 44 1.7 22.7 76 0
1 15 118 0.7 20.2 92 0
1 18 70 2.6 20.2 94 0
1 21 76 3.4 19.9 66 0
2 0 76 3.8 19.4 58 0
2 3 75 5.8 19.4 47 0
2 6 81 5.1 19.5 42 0
2 9 61 3.6 17.4 48 0
2 12 50 0.9 15.8 46 0
2 15 348 1.1 14.5 52 0
2 18 357 1.9 13.5 60 0
2 21 333 1.2 12.4 74 0
and, I want to generate extra data like this:
the fill values are the mean of the last value and the next value.
How can I do that?
Thank you!
And, #jdy thanks for reminder, this is what I have done:
data['time']='2017'+'-'+'10'+'-'+data['Day_Data'].map(int).map(str)+'
'+data['Hour_Data'].map(int).map(str)+':'+'00'+':'+'00'
from datetime import datetime
data.loc[:,'Date']=pd.to_datetime(data['time'])
data=data.drop(['Day_Data','Hour_Data','time'],axis=1)
index = data.set_index(data['Date'])
data=index.resample('1h').mean()
Output:
2017-10-01 00:00:00 58.0 1.0 22.0 78.0 0.0
2017-10-01 01:00:00 NaN NaN NaN NaN NaN
2017-10-01 02:00:00 NaN NaN NaN NaN NaN
2017-10-01 03:00:00 32.0 1.9 24.6 65.0 0.0
2017-10-01 04:00:00 NaN NaN NaN NaN NaN
2017-10-01 05:00:00 NaN NaN NaN NaN NaN
2017-10-01 06:00:00 41.0 3.2 25.6 59.0 0.0
2017-10-01 07:00:00 NaN NaN NaN NaN NaN
2017-10-01 08:00:00 NaN NaN NaN NaN NaN
2017-10-01 09:00:00 20.0 0.8 24.8 64.0 0.0
2017-10-01 10:00:00 NaN NaN NaN NaN NaN
2017-10-01 11:00:00 NaN NaN NaN NaN NaN
2017-10-01 12:00:00 44.0 1.7 22.7 76.0 0.0
2017-10-01 13:00:00 NaN NaN NaN NaN NaN
2017-10-01 14:00:00 NaN NaN NaN NaN NaN
2017-10-01 15:00:00 118.0 0.7 20.2 92.0 0.0
2017-10-01 16:00:00 NaN NaN NaN NaN NaN
2017-10-01 17:00:00 NaN NaN NaN NaN NaN
2017-10-01 18:00:00 70.0 2.6 20.2 94.0 0.0
2017-10-01 19:00:00 NaN NaN NaN NaN NaN
2017-10-01 20:00:00 NaN NaN NaN NaN NaN
2017-10-01 21:00:00 76.0 3.4 19.9 66.0 0.0
2017-10-01 22:00:00 NaN NaN NaN NaN NaN
2017-10-01 23:00:00 NaN NaN NaN NaN NaN
2017-10-02 00:00:00 76.0 3.8 19.4 58.0 0.0
2017-10-02 01:00:00 NaN NaN NaN NaN NaN
2017-10-02 02:00:00 NaN NaN NaN NaN NaN
2017-10-02 03:00:00 75.0 5.8 19.4 47.0 0.0
2017-10-02 04:00:00 NaN NaN NaN NaN NaN
2017-10-02 05:00:00 NaN NaN NaN NaN NaN
2017-10-02 06:00:00 81.0 5.1 19.5 42.0 0.0
but, I have no idea about how to fill the NaN by the mean of the last value and the next value.
As here I need to calculate the mean of the colums duration and km for the
rows with value ==1 and values = 0.
This time I would like that the aggregation period is flexible.
df
Out[20]:
Date duration km value
0 2015-03-28 09:07:00.800001 0 0 0
1 2015-03-28 09:36:01.819998 1 2 1
2 2015-03-30 09:36:06.839997 1 3 1
3 2015-03-30 09:37:27.659997 nan 5 0
4 2015-04-22 09:51:40.440003 3 7 0
5 2015-04-23 10:15:25.080002 0 nan 1
For the aggregation period of 1 day I can use the solution suggested before:
df.pivot_table(values=['duration','km'],columns=['value'],index=df['Date'].dt.date,aggfunc='mean'
ndf.columns = [i[0]+str(i[1]) for i in ndf.columns]
duration0 duration1 km0 km1
Date
2015-03-28 0.0 1.0 0.0 2.0
2015-03-30 NaN 1.0 5.0 3.0
2015-04-22 3.0 NaN 7.0 NaN
2015-04-23 NaN 0.0 NaN NaN
However, I do not know how to change the aggregation period in case, for example, I want to pass it as an argument of a function...
For this reason an approach with pd.Grouper(freq=freq_aggregation), being freq_aggregation = 'd' or '60s' would be preferred...
You can pass grouper to the index of pivot table. Hope this is what you are looking for i.e
ndf = df.pivot_table(values=['duration','km'],columns=['value'],index=pd.Grouper(key='Date', freq='60s'),aggfunc='mean')
ndf.columns = [i[0]+str(i[1]) for i in ndf.columns]
Output:
duration0 duration1 km0 km1
Date
2015-03-28 09:07:00 0.0 NaN 0.0 NaN
2015-03-28 09:36:00 NaN 1.0 NaN 2.0
2015-03-30 09:36:00 NaN 1.0 NaN 3.0
2015-03-30 09:37:00 NaN NaN 5.0 NaN
2015-04-22 09:51:00 3.0 NaN 7.0 NaN
2015-04-23 10:15:00 NaN 0.0 NaN NaN
If frequency is D then
duration0 duration1 km0 km1
Date
2015-03-28 0.0 1.0 0.0 2.0
2015-03-30 NaN 1.0 5.0 3.0
2015-04-22 3.0 NaN 7.0 NaN
2015-04-23 NaN 0.0 NaN NaN
Let's use pd.Grouper, unstack, and columns map:
freq_str = '60s'
df_out = df.groupby([pd.Grouper(freq=freq_str, key='Date'),'value'])['duration','km'].agg('mean').unstack()
df_out.columns = df_out.columns.map('{0[0]}{0[1]}'.format)
df_out
Output:
duration0 duration1 km0 km1
Date
2015-03-28 09:07:00 0.0 NaN 0.0 NaN
2015-03-28 09:36:00 NaN 1.0 NaN 2.0
2015-03-30 09:36:00 NaN 1.0 NaN 3.0
2015-03-30 09:37:00 NaN NaN 5.0 NaN
2015-04-22 09:51:00 3.0 NaN 7.0 NaN
2015-04-23 10:15:00 NaN 0.0 NaN NaN
Now, let's change freq_str to 'D':
freq_str = 'D'
df_out = df.groupby([pd.Grouper(freq=freq_str, key='Date'),'value'])['duration','km'].agg('mean').unstack()
df_out.columns = df_out.columns.map('{0[0]}{0[1]}'.format)
print(df_out)
Output:
duration0 duration1 km0 km1
Date
2015-03-28 0.0 1.0 0.0 2.0
2015-03-30 NaN 1.0 5.0 3.0
2015-04-22 3.0 NaN 7.0 NaN
2015-04-23 NaN 0.0 NaN NaN
use groupby
df = df.set_index('Date')
df.groupby([pd.TimeGrouper('D'), 'value']).mean()
duration km
Date value
2017-10-11 0 1.500000 4.0
1 0.666667 2.5
df.groupby([pd.TimeGrouper('60s'), 'value']).mean()
duration km
Date value
2017-10-11 09:07:00 0 0.0 0.0
2017-10-11 09:36:00 1 1.0 2.5
2017-10-11 09:37:00 0 NaN 5.0
2017-10-11 09:51:00 0 3.0 7.0
2017-10-11 10:15:00 1 0.0 NaN
if you want it unstacked, then unstack it.
df.groupby([pd.TimeGrouper('D'), 'value']).mean().unstack()
duration km
value 0 1 0 1
Date
2017-10-11 1.50 0.67 4.00 2.50
I have the data like this:
OwnerUserId Score
CreationDate
2015-01-01 00:16:46.963 1491895.0 0.0
2015-01-01 00:23:35.983 1491895.0 1.0
2015-01-01 00:30:55.683 1491895.0 1.0
2015-01-01 01:10:43.830 2141635.0 0.0
2015-01-01 01:11:08.927 1491895.0 1.0
2015-01-01 01:12:34.273 3297613.0 1.0
..........
This is a whole year data with different user's score ,I hope to get the data like:
OwnerUserId 1491895.0 1491895.0 1491895.0 2141635.0 1491895.0
00:00 0.0 3.0 0.0 3.0 5.8
00:01 5.0 3.0 0.0 3.0 5.8
00:02 3.0 33.0 20.0 3.0 5.8
......
23:40 12.0 33.0 10.0 3.0 5.8
23:41 32.0 33.0 20.0 3.0 5.8
23:42 12.0 13.0 10.0 3.0 5.8
The element of dataframe is the score(mean or sum).
I have been try like follow:
pd.pivot_table(data_series.reset_index(),index=['CreationDate'],columns=['OwnerUserId'],
fill_value=0).resample('W').sum()['Score']
Get the result like the image.
I think you need:
#remove `[]` and add parameter values for remove MultiIndex in columns
df = pd.pivot_table(data_series.reset_index(),
index='CreationDate',
columns='OwnerUserId',
values='Score',
fill_value=0)
#truncate seconds and convert to timedeltaindex
df.index = pd.to_timedelta(df.index.floor('T').strftime('%H:%M:%S'))
#or round to minutes
#df.index = pd.to_timedelta(df.index.round('T').strftime('%H:%M:%S'))
print (df)
OwnerUserId 1491895.0 2141635.0 3297613.0
00:16:00 0 0 0
00:23:00 1 0 0
00:30:00 1 0 0
01:10:00 0 0 0
01:11:00 1 0 0
01:12:00 0 0 1
idx = pd.timedelta_range('00:00:00', '23:59:00', freq='T')
#resample by minutes, aggregate sum, for add missing rows use reindex
df = df.resample('T').sum().fillna(0).reindex(idx, fill_value=0)
print (df)
OwnerUserId 1491895.0 2141635.0 3297613.0
00:00:00 0.0 0.0 0.0
00:01:00 0.0 0.0 0.0
00:02:00 0.0 0.0 0.0
00:03:00 0.0 0.0 0.0
00:04:00 0.0 0.0 0.0
00:05:00 0.0 0.0 0.0
00:06:00 0.0 0.0 0.0
...
...