I have the below dataframe:
df = pd.DataFrame({'a': [2.85,3.11,3.3,3.275,np.NaN,4.21], 'b': [3.65,3.825,3.475,np.NaN,4.10,2.73],
'c': [4.3,3.08,np.NaN,2.40, 3.33, 2.48]}, index=pd.date_range('2019-01-01', periods=6,
freq='M'))
This gives the dataframe as below:
a b c
2019-01-31 2.850 3.650 4.30
2019-02-28 3.110 3.825 3.08
2019-03-31 3.300 3.475 NaN
2019-04-30 3.275 NaN 2.40
2019-05-31 NaN 4.100 3.33
2019-06-30 4.210 2.730 2.48
Expected:
a b c
2019-01-31 2.850 3.650 4.30
2019-02-28 3.110 3.825 3.08
2019-03-31 3.300 3.475 **3.69**
2019-04-30 3.275 **3.650** 2.40
2019-05-31 **3.220** 4.100 3.33
2019-06-30 4.210 2.730 2.48
I want to replace the NaN values with the 3 month rolling average. How should I got about this?
If you take NaNs as 0 into your means, can do:
df.fillna(0,inplace=True)
df.rolling(3).mean()
This will give you:
a b c
2019-01-31 NaN NaN NaN
2019-02-28 NaN NaN NaN
2019-03-31 3.086667 3.650000 2.460000
2019-04-30 3.228333 2.433333 1.826667
2019-05-31 2.191667 2.525000 1.910000
2019-06-30 2.495000 2.276667 2.736667
Related
i have a dataframe named zz
zz columns name ['Ancolmekar','Cidurian','Dayeuhkolot','Hantap','Kertasari','Meteolembang','Sapan']
for col in zz.columns:
df = pd.DataFrame(zz[col],index=pd.date_range('2017-01-01 00:00:00', '2021-12-31 23:50:00', freq='10T'))
df.resample('1M').mean()
error : invalid syntax
i want to know the mean value by month in 10 minutes data interval. when i run this just sapan values appear with NaN. before, i have replace the NaN data 1 else 0.
Sapan
2017-01-31 NaN
2017-02-28 NaN
2017-03-31 NaN
2017-04-30 NaN
2017-05-31 NaN
2017-06-30 NaN
2017-07-31 NaN
2017-08-31 NaN
2017-09-30 NaN
2017-10-31 NaN
2017-11-30 NaN
2017-12-31 NaN
2018-01-31 NaN
2018-02-28 NaN
2018-03-31 NaN
2018-04-30 NaN
2018-05-31 NaN
2018-06-30 NaN
2018-07-31 NaN
2018-08-31 NaN
2018-09-30 NaN
2018-10-31 NaN
2018-11-30 NaN
2018-12-31 NaN
2019-01-31 NaN
2019-02-28 NaN
2019-03-31 NaN
2019-04-30 NaN
2019-05-31 NaN
2019-06-30 NaN
2019-07-31 NaN
2019-08-31 NaN
2019-09-30 NaN
2019-10-31 NaN
2019-11-30 NaN
2019-12-31 NaN
2020-01-31 NaN
2020-02-29 NaN
2020-03-31 NaN
2020-04-30 NaN
2020-05-31 NaN
2020-06-30 NaN
2020-07-31 NaN
2020-08-31 NaN
2020-09-30 NaN
2020-10-31 NaN
2020-11-30 NaN
2020-12-31 NaN
2021-01-31 NaN
2021-02-28 NaN
2021-03-31 NaN
2021-04-30 NaN
2021-05-31 NaN
2021-06-30 NaN
2021-07-31 NaN
2021-08-31 NaN
2021-09-30 NaN
2021-10-31 NaN
2021-11-30 NaN
2021-12-31 NaN
what should i do? thanks before
You are re-assigninig variable df to a dataframe with a single column during each pass through the for loop. The last column is sapan. Hence, only this column is shown.
Additionally, you are setting the index on df that probably isn't the index in zz, therefore you get Not A Number NaN for non-existing values.
If the index in zz is corresponding to the one you are setting, this should work:
df = zz.copy()
df['new_column'] = pd.Series(pd.date_range('2017-01-01 00:00:00', '2021-12-31 23:50:00', freq='10T'))
df = df.set_index('new_column')
df.resample('1M').mean()
I have this df:
Week U.S. 30 yr FRM U.S. 15 yr FRM
0 2014-12-31 3.87 3.15
1 2015-01-01 NaN NaN
2 2015-01-02 NaN NaN
3 2015-01-03 NaN NaN
4 2015-01-04 NaN NaN
... ... ... ...
2769 2022-07-31 NaN NaN
2770 2022-08-01 NaN NaN
2771 2022-08-02 NaN NaN
2772 2022-08-03 NaN NaN
2773 2022-08-04 4.99 4.26
And when I try to run this interpolation:
pmms_df.interpolate(method = 'nearest', inplace = True)
I get ValueError: Invalid fill method. Expecting pad (ffill) or backfill (bfill). Got nearest
I read in this post that pandas interpolate doesn't do well with the time columns, so I tried this:
pmms_df[['U.S. 30 yr FRM', 'U.S. 15 yr FRM']].interpolate(method = 'nearest', inplace = True)
but the output is exactly the same as before the interpolation.
It may not work great with date columns, but it works well with a datetime index, which is probably what you should be using here:
df = df.set_index('Week')
df = df.interpolate(method='nearest')
print(df)
# Output:
U.S. 30 yr FRM U.S. 15 yr FRM
Week
2014-12-31 3.87 3.15
2015-01-01 3.87 3.15
2015-01-02 3.87 3.15
2015-01-03 3.87 3.15
2015-01-04 3.87 3.15
2022-07-31 4.99 4.26
2022-08-01 4.99 4.26
2022-08-02 4.99 4.26
2022-08-03 4.99 4.26
2022-08-04 4.99 4.26
How can I create empty rows from 7 days before 2016-01-01 going to January 2015? I tried reindexing
df
date value
0 2016-01-01 4.0
1 2016-01-08 5.0
2 2016-01-15 1.0
Expected Output
date value
2015-01-02 NaN
....
2015-12-25 NaN
2016-01-01 4.0
2016-01-08 5.0
2016-01-15 1.0
First create DatetimeIndex:
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
And then use DataFrame.reindex with date_range by your minimal value and minimal index value with Index.union for avoid lost original index values:
rng = pd.date_range('2015-01-02', df.index.min(), freq='7d').union(df.index)
df = df.reindex(rng)
print (df)
value
2015-01-02 NaN
2015-01-09 NaN
2015-01-16 NaN
2015-01-23 NaN
2015-01-30 NaN
2015-02-06 NaN
2015-02-13 NaN
2015-02-20 NaN
2015-02-27 NaN
2015-03-06 NaN
2015-03-13 NaN
2015-03-20 NaN
2015-03-27 NaN
2015-04-03 NaN
2015-04-10 NaN
2015-04-17 NaN
2015-04-24 NaN
2015-05-01 NaN
2015-05-08 NaN
2015-05-15 NaN
2015-05-22 NaN
2015-05-29 NaN
2015-06-05 NaN
2015-06-12 NaN
2015-06-19 NaN
2015-06-26 NaN
2015-07-03 NaN
2015-07-10 NaN
2015-07-17 NaN
2015-07-24 NaN
2015-07-31 NaN
2015-08-07 NaN
2015-08-14 NaN
2015-08-21 NaN
2015-08-28 NaN
2015-09-04 NaN
2015-09-11 NaN
2015-09-18 NaN
2015-09-25 NaN
2015-10-02 NaN
2015-10-09 NaN
2015-10-16 NaN
2015-10-23 NaN
2015-10-30 NaN
2015-11-06 NaN
2015-11-13 NaN
2015-11-20 NaN
2015-11-27 NaN
2015-12-04 NaN
2015-12-11 NaN
2015-12-18 NaN
2015-12-25 NaN
2016-01-01 4.0
2016-01-08 5.0
2016-01-15 1.0
I would like to resample df by creating monthly data for all columns and filling in missing values with 0, within the time frame of say 2019-01-01 to 2019-12-31.
df:
ITEM_ID Date Value YearMonth
0 101002 2019-03-31 1.0 2019-03
1 101002 2019-04-30 1.0 2019-04
2 101002 2019-10-31 0.0 2019-10
3 101002 2019-11-30 8.0 2019-11
4 101002 2019-12-31 5.0 2019-12
Expected output:
ITEM_ID Date Value YearMonth
... 0 2019-01 (added)
... 0 2019-02 (added)
0 101002 2019-03-31 1.0 2019-03
1 101002 2019-04-30 1.0 2019-04
... 0 2019-05 (added)
... 0 2019-06 (added)
... 0 2019-07 (added)
... 0 2019-08 (added)
... 0 2019-09 (added)
2 101002 2019-10-31 0.0 2019-10
3 101002 2019-11-30 8.0 2019-11
4 101002 2019-12-31 5.0 2019-12
I came across a few methods like multiindex and resample. multiindex seems to be versatile but gets a bit complicated when it involves different levels of indexes; I am not sure if resample allows me to extend the effect to specified time frame. What is the best way to do it?
I think you need DataFrame.reindex:
df['YearMonth'] = pd.to_datetime(df['YearMonth'])
r = pd.to_datetime(pd.date_range('2019-01-01', '2020-01-01', freq='1MS'))
mux = pd.MultiIndex.from_product([df['ITEM_ID'].unique(), r], names=['ITEM_ID','YearMonth'])
df = df.set_index(['ITEM_ID','YearMonth']).reindex(mux).fillna({'Value':0}).reset_index().reindex(df.columns, axis=1)
print (df)
ITEM_ID Date Value YearMonth
0 101002 NaN 0.0 2019-01-01
1 101002 NaN 0.0 2019-02-01
2 101002 2019-03-31 1.0 2019-03-01
3 101002 2019-04-30 1.0 2019-04-01
4 101002 NaN 0.0 2019-05-01
5 101002 NaN 0.0 2019-06-01
6 101002 NaN 0.0 2019-07-01
7 101002 NaN 0.0 2019-08-01
8 101002 NaN 0.0 2019-09-01
9 101002 2019-10-31 0.0 2019-10-01
10 101002 2019-11-30 8.0 2019-11-01
11 101002 2019-12-31 5.0 2019-12-01
12 101002 NaN 0.0 2020-01-01
Here is the solution
import pandas as pd
df1= # this is the dataframe which you have given example. please change accordingly.
print(df1)
data=[['2019-01'],['2019-02'],['2019-03'],['2019-04'],['2019-05'],['2019-06'],['2019-07'],['2019-08'],
['2019-09'],['2019-10'],['2019-11'],['2019-12']]
df2=pd.DataFrame(data=data,columns=['YearMonth'])
print(df2)
final_DF = pd.merge(df1,df2,on ='YearMonth',how ='outer').sort_values('YearMonth')
final_DF = final_DF.fillna(0)
print(final_DF)
Instead of thinking in terms of year and month columns, we created an empty data frame with a start and end date and time and combined it with the original data frame.
df['Date'] = pd.to_datetime(df['Date'])
df1 = pd.DataFrame(index=pd.to_datetime(pd.date_range('2019-01-01', '2020-01-01', freq='1M'))).reset_index()
df1 = df1.merge(df, left_on='index', right_on='Date', how='outer')
df1['yearmonth'] = df1['index'].apply(lambda x: str(x.year) + '-' + '{:02}'.format(x.month))
df1
index ITEM_ID Date Value YearMonth yearmonth
0 2019-01-31 NaN NaT NaN NaN 2019-01
1 2019-02-28 NaN NaT NaN NaN 2019-02
2 2019-03-31 101002.0 2019-03-31 1.0 2019-03 2019-03
3 2019-04-30 101002.0 2019-04-30 1.0 2019-04 2019-04
4 2019-05-31 NaN NaT NaN NaN 2019-05
5 2019-06-30 NaN NaT NaN NaN 2019-06
6 2019-07-31 NaN NaT NaN NaN 2019-07
7 2019-08-31 NaN NaT NaN NaN 2019-08
8 2019-09-30 NaN NaT NaN NaN 2019-09
9 2019-10-31 101002.0 2019-10-31 0.0 2019-10 2019-10
10 2019-11-30 101002.0 2019-11-30 8.0 2019-11 2019-11
11 2019-12-31 101002.0 2019-12-31 5.0 2019-12 2019-12
I have several .csv files which I am importing via Pandas and then work out a summary of the data (min, max, mean), ideally weekly and monthly reports. I have the following code, but just do not seem to get the month summary to work, I am sure the problem is with the timestamp conversion.
What am I doing wrong?
import pandas as pd
import numpy as np
#Format of the data that is been imported
#2017-05-11 18:29:14+00:00,264.0,987.99,26.5,23.70,512.0,11.763,52.31
df = pd.read_csv('data.csv')
df['timestamp'] = pd.to_datetime(df['time'], format='%Y-%m-%d %H:%M:%S')
print 'month info'
print [g for n, g in df.groupby(pd.Grouper(key='timestamp',freq='M'))]
print(data.groupby('timestamp')['light'].mean())
IIUC, you almost have it, and your datetime conversion is fine. Here is an example:
Starting from a dataframe like this (which is your example row, duplicated with slight modifications):
>>> df
time x y z a b c d
0 2017-05-11 18:29:14+00:00 264.0 947.99 24.5 53.7 511.0 11.463 12.31
1 2017-05-15 18:29:14+00:00 265.0 957.99 25.5 43.7 512.0 11.563 22.31
2 2017-05-21 18:29:14+00:00 266.0 967.99 26.5 33.7 513.0 11.663 32.31
3 2017-06-11 18:29:14+00:00 267.0 977.99 26.5 23.7 514.0 11.763 42.31
4 2017-06-22 18:29:14+00:00 268.0 997.99 27.5 13.7 515.0 11.800 52.31
You can do what you did before with your datetime:
df['timestamp'] = pd.to_datetime(df['time'], format='%Y-%m-%d %H:%M:%S')
And then get your summaries either separately:
monthly_mean = df.groupby(pd.Grouper(key='timestamp',freq='M')).mean()
monthly_max = df.groupby(pd.Grouper(key='timestamp',freq='M')).max()
monthly_min = df.groupby(pd.Grouper(key='timestamp',freq='M')).min()
weekly_mean = df.groupby(pd.Grouper(key='timestamp',freq='W')).mean()
weekly_min = df.groupby(pd.Grouper(key='timestamp',freq='W')).min()
weekly_max = df.groupby(pd.Grouper(key='timestamp',freq='W')).max()
# Examples:
>>> monthly_mean
x y z a b c d
timestamp
2017-05-31 265.0 957.99 25.5 43.7 512.0 11.5630 22.31
2017-06-30 267.5 987.99 27.0 18.7 514.5 11.7815 47.31
>>> weekly_mean
x y z a b c d
timestamp
2017-05-14 264.0 947.99 24.5 53.7 511.0 11.463 12.31
2017-05-21 265.5 962.99 26.0 38.7 512.5 11.613 27.31
2017-05-28 NaN NaN NaN NaN NaN NaN NaN
2017-06-04 NaN NaN NaN NaN NaN NaN NaN
2017-06-11 267.0 977.99 26.5 23.7 514.0 11.763 42.31
2017-06-18 NaN NaN NaN NaN NaN NaN NaN
2017-06-25 268.0 997.99 27.5 13.7 515.0 11.800 52.31
Or aggregate them all together to get a multi-indexed dataframe with your summaries:
monthly_summary = df.groupby(pd.Grouper(key='timestamp',freq='M')).agg(['mean', 'min', 'max'])
weekly_summary = df.groupby(pd.Grouper(key='timestamp',freq='W')).agg(['mean', 'min', 'max'])
# Example of summary of row 'x':
>>> monthly_summary['x']
mean min max
timestamp
2017-05-31 265.0 264.0 266.0
2017-06-30 267.5 267.0 268.0
>>> weekly_summary['x']
mean min max
timestamp
2017-05-14 264.0 264.0 264.0
2017-05-21 265.5 265.0 266.0
2017-05-28 NaN NaN NaN
2017-06-04 NaN NaN NaN
2017-06-11 267.0 267.0 267.0
2017-06-18 NaN NaN NaN
2017-06-25 268.0 268.0 268.0