I have this DataFrame.
timestamp Val1
2020-04-02 06:44:00 NaN
2020-04-03 16:52:00 NaN
2020-04-03 16:53:00 NaN
2020-04-03 16:54:00 NaN
2020-04-03 16:55:00 NaN
2020-04-17 02:03:00 NaN
2020-04-17 02:04:00 NaN
2020-04-17 02:05:00 NaN
2020-04-17 02:06:00 NaN
And I trying to separate in groups using the sequence of minutes. For example, I can't group rows with more then 1 min with difference.
So the output will be like this:
#Group 1
timestamp Val1
2020-04-02 06:44:00 NaN
#Group 2
timestamp Val1
2020-04-03 16:52:00 NaN
2020-04-03 16:53:00 NaN
2020-04-03 16:54:00 NaN
2020-04-03 16:55:00 NaN
#Group 3
timestamp Val1
2020-04-17 02:03:00 NaN
2020-04-17 02:04:00 NaN
2020-04-17 02:05:00 NaN
2020-04-17 02:06:00 NaN
Now, I just can get the min and max data with all the data. But no like what I want to try.
Take the difference between consecutive rows and check whether it is above your desired difference ('1min'). Taking the cumsum of this Boolean Series creates the grouping label. I've assigned it to a column here for illustration.
#df['timestamp'] = pd.to_datetime(df['timestamp'])
df['group'] = df['timestamp'].diff().gt('1min').cumsum()
timestamp Val1 group
0 2020-04-02 06:44:00 NaN 0
1 2020-04-03 16:52:00 NaN 1
2 2020-04-03 16:53:00 NaN 1
3 2020-04-03 16:54:00 NaN 1
4 2020-04-03 16:55:00 NaN 1
5 2020-04-17 02:03:00 NaN 2
6 2020-04-17 02:04:00 NaN 2
7 2020-04-17 02:05:00 NaN 2
8 2020-04-17 02:06:00 NaN 2
Related
i have a dataframe named zz
zz columns name ['Ancolmekar','Cidurian','Dayeuhkolot','Hantap','Kertasari','Meteolembang','Sapan']
for col in zz.columns:
df = pd.DataFrame(zz[col],index=pd.date_range('2017-01-01 00:00:00', '2021-12-31 23:50:00', freq='10T'))
df.resample('1M').mean()
error : invalid syntax
i want to know the mean value by month in 10 minutes data interval. when i run this just sapan values appear with NaN. before, i have replace the NaN data 1 else 0.
Sapan
2017-01-31 NaN
2017-02-28 NaN
2017-03-31 NaN
2017-04-30 NaN
2017-05-31 NaN
2017-06-30 NaN
2017-07-31 NaN
2017-08-31 NaN
2017-09-30 NaN
2017-10-31 NaN
2017-11-30 NaN
2017-12-31 NaN
2018-01-31 NaN
2018-02-28 NaN
2018-03-31 NaN
2018-04-30 NaN
2018-05-31 NaN
2018-06-30 NaN
2018-07-31 NaN
2018-08-31 NaN
2018-09-30 NaN
2018-10-31 NaN
2018-11-30 NaN
2018-12-31 NaN
2019-01-31 NaN
2019-02-28 NaN
2019-03-31 NaN
2019-04-30 NaN
2019-05-31 NaN
2019-06-30 NaN
2019-07-31 NaN
2019-08-31 NaN
2019-09-30 NaN
2019-10-31 NaN
2019-11-30 NaN
2019-12-31 NaN
2020-01-31 NaN
2020-02-29 NaN
2020-03-31 NaN
2020-04-30 NaN
2020-05-31 NaN
2020-06-30 NaN
2020-07-31 NaN
2020-08-31 NaN
2020-09-30 NaN
2020-10-31 NaN
2020-11-30 NaN
2020-12-31 NaN
2021-01-31 NaN
2021-02-28 NaN
2021-03-31 NaN
2021-04-30 NaN
2021-05-31 NaN
2021-06-30 NaN
2021-07-31 NaN
2021-08-31 NaN
2021-09-30 NaN
2021-10-31 NaN
2021-11-30 NaN
2021-12-31 NaN
what should i do? thanks before
You are re-assigninig variable df to a dataframe with a single column during each pass through the for loop. The last column is sapan. Hence, only this column is shown.
Additionally, you are setting the index on df that probably isn't the index in zz, therefore you get Not A Number NaN for non-existing values.
If the index in zz is corresponding to the one you are setting, this should work:
df = zz.copy()
df['new_column'] = pd.Series(pd.date_range('2017-01-01 00:00:00', '2021-12-31 23:50:00', freq='10T'))
df = df.set_index('new_column')
df.resample('1M').mean()
Is there a way using pandas to compute rolling sum over 12 months as opposed to 365 days? (which is suggested here)
For the sake of simplicity I do it here just for one group, but the code should work with multiple groups. The true data spans about a century, so the one day errors add up.
Example
Note the switch from the end of month to beginning of month around the end of 2020.
data = {'date': ['2020-01-31', '2020-02-29', '2020-03-31',
'2020-04-30', '2020-05-31', '2020-06-30',
'2020-07-31', '2020-08-31', '2020-09-30',
'2020-10-31', '2020-11-30', '2020-12-31',
'2021-01-01'],
'values': [1, 1,1,1,1,1,1,1,1,1,1,1,1],
'group': [1, 1,1,1,1,1,1,1,1,1,1,1,1]}
df = pd.DataFrame(data, columns=['date', 'values', 'group' ''])
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values('date').set_index('date')
df.groupby('group').rolling("365d", min_periods=12).sum()[['values']]
The input
date values group
0 2020-01-31 1 1
1 2020-02-29 1 1
2 2020-03-31 1 1
3 2020-04-30 1 1
4 2020-05-31 1 1
5 2020-06-30 1 1
6 2020-07-31 1 1
7 2020-08-31 1 1
8 2020-09-30 1 1
9 2020-10-31 1 1
10 2020-11-30 1 1
11 2020-12-31 1 1
12 2021-01-01 1 1
is transformed to
values
group date
1 2020-01-31 NaN
2020-02-29 NaN
2020-03-31 NaN
2020-04-30 NaN
2020-05-31 NaN
2020-06-30 NaN
2020-07-31 NaN
2020-08-31 NaN
2020-09-30 NaN
2020-10-31 NaN
2020-11-30 NaN
2020-12-31 12.0
2021-01-01 13.0
The desired output is
values
group date
1 2020-01-31 NaN
2020-02-29 NaN
2020-03-31 NaN
2020-04-30 NaN
2020-05-31 NaN
2020-06-30 NaN
2020-07-31 NaN
2020-08-31 NaN
2020-09-30 NaN
2020-10-31 NaN
2020-11-30 NaN
2020-12-31 12.0
2021-01-01 12.0
Edit
I also need to account for the case that certain months are missing so that the sum starts from scratch. Note that February 2020 is missing in the following example:
data = {'date': ['2020-01-31', '2020-03-31',
'2020-04-30', '2020-05-31', '2020-06-30',
'2020-07-31', '2020-08-31', '2020-09-30',
'2020-10-31', '2020-11-30', '2020-12-31',
'2021-01-31', '2021-02-28', '2021-03-31' ],
'values': [1, 1,1,1,1,1,1,1,1,1,1,1,1, 1],
'group': [1, 1,1,1,1,1,1,1,1,1,1,1,1, 1]}
df = pd.DataFrame(data, columns=['date', 'values', 'group' ''])
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values('date').set_index('date')
df.groupby('group').rolling(12, min_periods=12).sum()[['values']]
Output:
values
group date
1 2020-01-31 NaN
2020-03-31 NaN
2020-04-30 NaN
2020-05-31 NaN
2020-06-30 NaN
2020-07-31 NaN
2020-08-31 NaN
2020-09-30 NaN
2020-10-31 NaN
2020-11-30 NaN
2020-12-31 NaN
2021-01-31 12.0
2021-02-28 12.0
2021-03-31 12.0
Desired output:
values
group date
1 2020-01-31 NaN
2020-03-31 NaN
2020-04-30 NaN
2020-05-31 NaN
2020-06-30 NaN
2020-07-31 NaN
2020-08-31 NaN
2020-09-30 NaN
2020-10-31 NaN
2020-11-30 NaN
2020-12-31 NaN
2021-01-31 NaN
2021-02-28 NaN
2021-03-31 12.0
How can I create empty rows from 7 days before 2016-01-01 going to January 2015? I tried reindexing
df
date value
0 2016-01-01 4.0
1 2016-01-08 5.0
2 2016-01-15 1.0
Expected Output
date value
2015-01-02 NaN
....
2015-12-25 NaN
2016-01-01 4.0
2016-01-08 5.0
2016-01-15 1.0
First create DatetimeIndex:
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
And then use DataFrame.reindex with date_range by your minimal value and minimal index value with Index.union for avoid lost original index values:
rng = pd.date_range('2015-01-02', df.index.min(), freq='7d').union(df.index)
df = df.reindex(rng)
print (df)
value
2015-01-02 NaN
2015-01-09 NaN
2015-01-16 NaN
2015-01-23 NaN
2015-01-30 NaN
2015-02-06 NaN
2015-02-13 NaN
2015-02-20 NaN
2015-02-27 NaN
2015-03-06 NaN
2015-03-13 NaN
2015-03-20 NaN
2015-03-27 NaN
2015-04-03 NaN
2015-04-10 NaN
2015-04-17 NaN
2015-04-24 NaN
2015-05-01 NaN
2015-05-08 NaN
2015-05-15 NaN
2015-05-22 NaN
2015-05-29 NaN
2015-06-05 NaN
2015-06-12 NaN
2015-06-19 NaN
2015-06-26 NaN
2015-07-03 NaN
2015-07-10 NaN
2015-07-17 NaN
2015-07-24 NaN
2015-07-31 NaN
2015-08-07 NaN
2015-08-14 NaN
2015-08-21 NaN
2015-08-28 NaN
2015-09-04 NaN
2015-09-11 NaN
2015-09-18 NaN
2015-09-25 NaN
2015-10-02 NaN
2015-10-09 NaN
2015-10-16 NaN
2015-10-23 NaN
2015-10-30 NaN
2015-11-06 NaN
2015-11-13 NaN
2015-11-20 NaN
2015-11-27 NaN
2015-12-04 NaN
2015-12-11 NaN
2015-12-18 NaN
2015-12-25 NaN
2016-01-01 4.0
2016-01-08 5.0
2016-01-15 1.0
I would like to resample df by creating monthly data for all columns and filling in missing values with 0, within the time frame of say 2019-01-01 to 2019-12-31.
df:
ITEM_ID Date Value YearMonth
0 101002 2019-03-31 1.0 2019-03
1 101002 2019-04-30 1.0 2019-04
2 101002 2019-10-31 0.0 2019-10
3 101002 2019-11-30 8.0 2019-11
4 101002 2019-12-31 5.0 2019-12
Expected output:
ITEM_ID Date Value YearMonth
... 0 2019-01 (added)
... 0 2019-02 (added)
0 101002 2019-03-31 1.0 2019-03
1 101002 2019-04-30 1.0 2019-04
... 0 2019-05 (added)
... 0 2019-06 (added)
... 0 2019-07 (added)
... 0 2019-08 (added)
... 0 2019-09 (added)
2 101002 2019-10-31 0.0 2019-10
3 101002 2019-11-30 8.0 2019-11
4 101002 2019-12-31 5.0 2019-12
I came across a few methods like multiindex and resample. multiindex seems to be versatile but gets a bit complicated when it involves different levels of indexes; I am not sure if resample allows me to extend the effect to specified time frame. What is the best way to do it?
I think you need DataFrame.reindex:
df['YearMonth'] = pd.to_datetime(df['YearMonth'])
r = pd.to_datetime(pd.date_range('2019-01-01', '2020-01-01', freq='1MS'))
mux = pd.MultiIndex.from_product([df['ITEM_ID'].unique(), r], names=['ITEM_ID','YearMonth'])
df = df.set_index(['ITEM_ID','YearMonth']).reindex(mux).fillna({'Value':0}).reset_index().reindex(df.columns, axis=1)
print (df)
ITEM_ID Date Value YearMonth
0 101002 NaN 0.0 2019-01-01
1 101002 NaN 0.0 2019-02-01
2 101002 2019-03-31 1.0 2019-03-01
3 101002 2019-04-30 1.0 2019-04-01
4 101002 NaN 0.0 2019-05-01
5 101002 NaN 0.0 2019-06-01
6 101002 NaN 0.0 2019-07-01
7 101002 NaN 0.0 2019-08-01
8 101002 NaN 0.0 2019-09-01
9 101002 2019-10-31 0.0 2019-10-01
10 101002 2019-11-30 8.0 2019-11-01
11 101002 2019-12-31 5.0 2019-12-01
12 101002 NaN 0.0 2020-01-01
Here is the solution
import pandas as pd
df1= # this is the dataframe which you have given example. please change accordingly.
print(df1)
data=[['2019-01'],['2019-02'],['2019-03'],['2019-04'],['2019-05'],['2019-06'],['2019-07'],['2019-08'],
['2019-09'],['2019-10'],['2019-11'],['2019-12']]
df2=pd.DataFrame(data=data,columns=['YearMonth'])
print(df2)
final_DF = pd.merge(df1,df2,on ='YearMonth',how ='outer').sort_values('YearMonth')
final_DF = final_DF.fillna(0)
print(final_DF)
Instead of thinking in terms of year and month columns, we created an empty data frame with a start and end date and time and combined it with the original data frame.
df['Date'] = pd.to_datetime(df['Date'])
df1 = pd.DataFrame(index=pd.to_datetime(pd.date_range('2019-01-01', '2020-01-01', freq='1M'))).reset_index()
df1 = df1.merge(df, left_on='index', right_on='Date', how='outer')
df1['yearmonth'] = df1['index'].apply(lambda x: str(x.year) + '-' + '{:02}'.format(x.month))
df1
index ITEM_ID Date Value YearMonth yearmonth
0 2019-01-31 NaN NaT NaN NaN 2019-01
1 2019-02-28 NaN NaT NaN NaN 2019-02
2 2019-03-31 101002.0 2019-03-31 1.0 2019-03 2019-03
3 2019-04-30 101002.0 2019-04-30 1.0 2019-04 2019-04
4 2019-05-31 NaN NaT NaN NaN 2019-05
5 2019-06-30 NaN NaT NaN NaN 2019-06
6 2019-07-31 NaN NaT NaN NaN 2019-07
7 2019-08-31 NaN NaT NaN NaN 2019-08
8 2019-09-30 NaN NaT NaN NaN 2019-09
9 2019-10-31 101002.0 2019-10-31 0.0 2019-10 2019-10
10 2019-11-30 101002.0 2019-11-30 8.0 2019-11 2019-11
11 2019-12-31 101002.0 2019-12-31 5.0 2019-12 2019-12
I set up a new data frame SimMean:
columns = ['Tenor','5x16', '7x8', '2x16H']
index = range(0,12)
SimMean = pd.DataFrame(index=index, columns=columns)
SimMean
Tenor 5x16 7x8 2x16H
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 NaN NaN NaN NaN
10 NaN NaN NaN NaN
11 NaN NaN NaN NaN
I have another data frame FwdDf:
FwdDf
Tenor 5x16 7x8 2x16H
0 2017-01-01 50.94 34.36 43.64
1 2017-02-01 50.90 32.60 42.68
2 2017-03-01 42.66 26.26 37.26
3 2017-04-01 37.08 22.65 32.46
4 2017-05-01 42.21 20.94 33.28
5 2017-06-01 39.30 22.05 32.29
6 2017-07-01 50.90 21.80 38.51
7 2017-08-01 42.77 23.64 35.07
8 2017-09-01 37.45 19.61 32.68
9 2017-10-01 37.55 21.75 32.10
10 2017-11-01 35.61 22.73 32.90
11 2017-12-01 40.16 29.79 37.49
12 2018-01-01 53.45 36.09 47.61
13 2018-02-01 52.89 35.74 45.00
14 2018-03-01 44.67 27.79 38.62
15 2018-04-01 38.48 24.21 34.43
16 2018-05-01 43.87 22.17 34.69
17 2018-06-01 40.24 22.85 34.31
18 2018-07-01 49.98 23.58 39.96
19 2018-08-01 45.57 24.76 37.23
20 2018-09-01 38.90 21.74 34.22
21 2018-10-01 39.75 23.36 35.20
22 2018-11-01 38.04 24.20 34.62
23 2018-12-01 42.68 31.03 40.00
now I need to assign the 'Tenor' data from row 12 to row 23 in FwdDf to the new data frame SimMean.
I used
SimMean.loc[0:11,'Tenor'] = FwdDf.loc [12:23,'Tenor']
but it didn't work:
SimMean
Tenor 5x16 7x8 2x16H
0 None NaN NaN NaN
1 None NaN NaN NaN
2 None NaN NaN NaN
3 None NaN NaN NaN
4 None NaN NaN NaN
5 None NaN NaN NaN
6 None NaN NaN NaN
7 None NaN NaN NaN
8 None NaN NaN NaN
9 None NaN NaN NaN
10 None NaN NaN NaN
11 None NaN NaN NaN
I'm new to python. I would appreciate your help. Thanks
call .values so there are no index alignment issues:
In [35]:
SimMean.loc[0:11,'Tenor'] = FwdDf.loc[12:23,'Tenor'].values
SimMean
Out[35]:
Tenor 5x16 7x8 2x16H
0 2018-01-01 NaN NaN NaN
1 2018-02-01 NaN NaN NaN
2 2018-03-01 NaN NaN NaN
3 2018-04-01 NaN NaN NaN
4 2018-05-01 NaN NaN NaN
5 2018-06-01 NaN NaN NaN
6 2018-07-01 NaN NaN NaN
7 2018-08-01 NaN NaN NaN
8 2018-09-01 NaN NaN NaN
9 2018-10-01 NaN NaN NaN
10 2018-11-01 NaN NaN NaN
11 2018-12-01 NaN NaN NaN
EDIT
As your column is actually datetime then you need to convert the type again:
In [46]:
SimMean['Tenor'] = pd.to_datetime(SimMean['Tenor'])
SimMean
Out[46]:
Tenor 5x16 7x8 2x16H
0 2018-01-01 NaN NaN NaN
1 2018-02-01 NaN NaN NaN
2 2018-03-01 NaN NaN NaN
3 2018-04-01 NaN NaN NaN
4 2018-05-01 NaN NaN NaN
5 2018-06-01 NaN NaN NaN
6 2018-07-01 NaN NaN NaN
7 2018-08-01 NaN NaN NaN
8 2018-09-01 NaN NaN NaN
9 2018-10-01 NaN NaN NaN
10 2018-11-01 NaN NaN NaN
11 2018-12-01 NaN NaN NaN