Pandas timeseries groupby using TimeGrouper - python

I have a time series which is like this
Time Demand
Date
2014-01-01 0:00 2899.0
2014-01-01 0:15 2869.0
2014-01-01 0:30 2827.0
2014-01-01 0:45 2787.0
2014-01-01 1:00 2724.0
2014-01-01 1:15 2687.0
2014-01-01 1:30 2596.0
2014-01-01 1:45 2543.0
2014-01-01 2:00 2483.0
Its is in 15 minute increments. I want the average for every hour of everyday.So i tried something like this df.groupby(pd.TimeGrouper(freq='H')).mean(). It didn't work out quite right because it returned mostlyNaNs.
Now my dataset has data like this for the whole year and I would like to calculate the mean for all the hours of all the months such that I have 24 points but the mean is for all hours of the year e.g. the first hour get the mean of the first hour for all the months. The expected output would be
2014 00:00:00 2884.0
2014 01:00:00 2807.0
2014 02:00:00 2705.5
2014 03:00:00 2569.5
..........
2014 23:00:00 2557.5
How can I achieve this?

I think you need first add Time column to index:
df.index = df.index + pd.to_timedelta(df.Time + ':00')
print (df)
Time Demand
2014-01-01 00:00:00 0:00 2899.0
2014-01-01 00:15:00 0:15 2869.0
2014-01-01 00:30:00 0:30 2827.0
2014-01-01 00:45:00 0:45 2787.0
2014-01-01 01:00:00 1:00 2724.0
2014-01-01 01:15:00 1:15 2687.0
2014-01-01 01:30:00 1:30 2596.0
2014-01-01 01:45:00 1:45 2543.0
2014-01-01 02:00:00 2:00 2483.0
print (df.groupby(pd.Grouper(freq='H')).mean())
#same as
#print (df.groupby(pd.TimeGrouper(freq='H')).mean())
Demand
2014-01-01 00:00:00 2845.5
2014-01-01 01:00:00 2637.5
2014-01-01 02:00:00 2483.0
Thanks pansen for another idea resample:
print (df.resample("H").mean())
Demand
2014-01-01 00:00:00 2845.5
2014-01-01 01:00:00 2637.5
2014-01-01 02:00:00 2483.0
EDIT:
print (df)
Time Demand
Date
2014-01-01 0:00 1.0
2014-01-01 0:15 2.0
2014-01-01 0:30 4.0
2014-01-01 0:45 5.0
2014-01-01 1:00 1.0
2014-01-01 1:15 0.0
2015-01-01 1:30 1.0
2015-01-01 1:45 2.0
2015-01-01 2:00 3.0
df.index = df.index + pd.to_timedelta(df.Time + ':00')
print (df)
Time Demand
2014-01-01 00:00:00 0:00 1.0
2014-01-01 00:15:00 0:15 2.0
2014-01-01 00:30:00 0:30 4.0
2014-01-01 00:45:00 0:45 5.0
2014-01-01 01:00:00 1:00 1.0
2014-01-01 01:15:00 1:15 0.0
2015-01-01 01:30:00 1:30 1.0
2015-01-01 01:45:00 1:45 2.0
2015-01-01 02:00:00 2:00 3.0
df1 = df.groupby([df.index.year, df.index.hour]).mean().reset_index()
df1.columns = ['year','hour','Demand']
print (df1)
year hour Demand
0 2014 0 3.0
1 2014 1 0.5
2 2015 1 1.5
3 2015 2 3.0
For DatetimeIndex use:
df1 = df.groupby([df.index.year, df.index.hour]).mean()
df1.index = pd.to_datetime(df1.index.get_level_values(0).astype(str) +
df1.index.get_level_values(1).astype(str), format='%Y%H')
print (df1)
Demand
2014-01-01 00:00:00 3.0
2014-01-01 01:00:00 0.5
2015-01-01 01:00:00 1.5
2015-01-01 02:00:00 3.0

Related

Pandas - Resample on MultiIndex based DataFrame and use of offset

I have a df which has a MultiIndex [(latitude, longitude, time)] with the number of rows being 148 x 244 x 90 x 24. For each latitude and longitude, the time is hourly from 2014-01-01 00:00:00 to 2014:03:31 23:00:00 in the UTC format.
FFDI
latitude longitude time
-39.20000 140.80000 2014-01-01 00:00:00 6.20000
2014-01-01 01:00:00 4.10000
2014-01-01 02:00:00 2.40000
2014-01-01 03:00:00 1.90000
2014-01-01 04:00:00 1.70000
2014-01-01 05:00:00 1.50000
2014-01-01 06:00:00 1.40000
2014-01-01 07:00:00 1.30000
2014-01-01 08:00:00 1.20000
2014-01-01 09:00:00 1.00000
2014-01-01 10:00:00 1.00000
2014-01-01 11:00:00 0.90000
2014-01-01 12:00:00 0.90000
... ... ... ...
2014-03-31 21:00:00 0.30000
2014-03-31 22:00:00 0.30000
2014-03-31 23:00:00 0.50000
140.83786 2014-01-01 00:00:00 3.20000
2014-01-01 01:00:00 2.90000
2014-01-01 02:00:00 2.10000
2014-01-01 03:00:00 2.90000
2014-01-01 04:00:00 1.20000
2014-01-01 05:00:00 0.90000
2014-01-01 06:00:00 1.10000
2014-01-01 07:00:00 1.60000
2014-01-01 08:00:00 1.40000
2014-01-01 09:00:00 1.50000
2014-01-01 10:00:00 1.20000
2014-01-01 11:00:00 0.80000
2014-01-01 12:00:00 0.40000
... ... ... ...
2014-03-31 21:00:00 0.30000
2014-03-31 22:00:00 0.30000
2014-03-31 23:00:00 0.50000
... ... ... ...
... ... ...
-33.90000 140.80000 2014-01-01 00:00:00 6.20000
2014-01-01 01:00:00 4.10000
2014-01-01 02:00:00 2.40000
2014-01-01 03:00:00 1.90000
2014-01-01 04:00:00 1.70000
2014-01-01 05:00:00 1.50000
2014-01-01 06:00:00 1.40000
2014-01-01 07:00:00 1.30000
2014-01-01 08:00:00 1.20000
2014-01-01 09:00:00 1.00000
2014-01-01 10:00:00 1.00000
2014-01-01 11:00:00 0.90000
2014-01-01 12:00:00 0.90000
... ... ... ...
2014-03-31 21:00:00 0.30000
2014-03-31 22:00:00 0.30000
2014-03-31 23:00:00 0.50000
140.83786 2014-01-01 00:00:00 3.20000
2014-01-01 01:00:00 2.90000
2014-01-01 02:00:00 2.10000
2014-01-01 03:00:00 2.90000
2014-01-01 04:00:00 1.20000
2014-01-01 05:00:00 0.90000
2014-01-01 06:00:00 1.10000
2014-01-01 07:00:00 1.60000
2014-01-01 08:00:00 1.40000
2014-01-01 09:00:00 1.50000
2014-01-01 10:00:00 1.20000
2014-01-01 11:00:00 0.80000
2014-01-01 12:00:00 0.40000
... ... ... ...
2014-03-31 21:00:00 0.30000
2014-03-31 22:00:00 0.30000
2014-03-31 23:00:00 0.50000
78001920 rows × 1 columns
I need to calculate a daily maximum FFDI value for a date using hourly values from 13:00:00 of the previous day to 12:00:00 of the current day to suit my time zone (+11). For example, if calculating daily max FFDI for 2014-01-10 in the +11 time zone, I can use hourly FFDI from 2014-01-09 13:00:00 to 2014-01-10 12:00:00.
df_daily_max = df .groupby(['latitude', 'longitude', pd.Grouper(freq='24H',base=13,loffset='11H',level='time')])['FFDI'].max().reset_index(name='Max FFDI')
The calculation starts with 13:00:00 and with a frequency of 24 hours.
The output is:
latitude longitude time Max FFDI
0 -39.20000076293945312500 140.80000305175781250000 2013-12-31 13:00:00 6.19999980926513671875
1 -39.20000076293945312500 140.80000305175781250000 2014-01-01 13:00:00 1.50000000000000000000
2 -39.20000076293945312500 140.80000305175781250000 2014-01-02 13:00:00 1.60000002384185791016
... ... ... ...
I would like the output to be:
latitude longitude time Max FFDI
0 -39.20000076293945312500 140.80000305175781250000 2014-01-01 6.19999980926513671875
1 -39.20000076293945312500 140.80000305175781250000 2014-01-02 1.50000000000000000000
2 -39.20000076293945312500 140.80000305175781250000 2014-01-03 1.60000002384185791016
... ... ... ...

How to either change the date or get rid off it after using pd.to_datetime()?

I have a df that looks as follows:
Datum Dates Time Menge day month
1/1/2018 0:00 2018-01-01 00:00:00 19.5 1 1
1/1/2018 0:15 2018-01-01 00:15:00 19.0 1 1
1/1/2018 0:30 2018-01-01 00:30:00 19.5 1 1
1/1/2018 0:45 2018-01-01 00:45:00 19.5 1 1
1/1/2018 1:00 2018-01-01 01:00:00 21.0 1 1
1/1/2018 1:15 2018-01-01 01:15:00 19.5 1 1
1/1/2018 1:30 2018-01-01 01:30:00 20.0 1 1
1/1/2018 1:45 2018-01-01 01:45:00 23.0 1 1
1/1/2018 2:00 2018-01-01 02:00:00 20.5 1 1
1/1/2018 2:15 2018-01-01 02:15:00 20.5 1 1
and their data types are:
Datum object
Dates object
Time object
Menge float64
day int64
month int64
dtype: object
I wanted to calculate a few things like the hourly average, daily average, monthly average and for that, I had to convert the types of the Dates and Time column. For that, I did:
data_nan_dropped['Dates'] = pd.to_datetime(data_nan_dropped.Dates)
data_nan_dropped.Time = pd.to_datetime(data_nan_dropped.Time, format='%H:%M:%S')
which converted my df to:
Datum Dates Time Menge day month
1/1/2018 0:00 2018-01-01 00:00:00 1900-01-01 00:00:00 19.5 1 1
1/1/2018 0:15 2018-01-01 00:00:00 1900-01-01 00:15:00 19.0 1 1
1/1/2018 0:30 2018-01-01 00:00:00 1900-01-01 00:30:00 19.5 1 1
1/1/2018 0:45 2018-01-01 00:00:00 1900-01-01 00:45:00 19.5 1 1
1/1/2018 1:00 2018-01-01 00:00:00 1900-01-01 01:00:00 21.0 1 1
1/1/2018 1:15 2018-01-01 00:00:00 1900-01-01 01:15:00 19.5 1 1
1/1/2018 1:30 2018-01-01 00:00:00 1900-01-01 01:30:00 20.0 1 1
1/1/2018 1:45 2018-01-01 00:00:00 1900-01-01 01:45:00 23.0 1 1
1/1/2018 2:00 2018-01-01 00:00:00 1900-01-01 02:00:00 20.5 1 1
1/1/2018 2:15 2018-01-01 00:00:00 1900-01-01 02:15:00 20.5 1 1
Now, in the Time column, the time is converted and has the form of 1900-01-01. I don't want that.
If possible, I would like one of the following:
The Time column be converted to datetime64[ns] without the date being displayed
or
The date that is in the Datum column be displyed there instead of
1900-01-01.
How can I achieve this?
Expected output:
Datum Dates Time Menge day month
1/1/2018 0:00 2018-01-01 00:00:00 2018-01-01 00:00:00 19.5 1 1
1/1/2018 0:15 2018-01-01 00:00:00 2018-01-01 00:15:00 19.0 1 1
1/1/2018 0:30 2018-01-01 00:00:00 2018-01-01 00:30:00 19.5 1 1
1/1/2018 0:45 2018-01-01 00:00:00 2018-01-01 00:45:00 19.5 1 1
1/1/2018 1:00 2018-01-01 00:00:00 2018-01-01 01:00:00 21.0 1 1
1/1/2018 1:15 2018-01-01 00:00:00 2018-01-01 01:15:00 19.5 1 1
1/1/2018 1:30 2018-01-01 00:00:00 2018-01-01 01:30:00 20.0 1 1
1/1/2018 1:45 2018-01-01 00:00:00 2018-01-01 01:45:00 23.0 1 1
1/1/2018 2:00 2018-01-01 00:00:00 2018-01-01 02:00:00 20.5 1 1
1/1/2018 2:15 2018-01-01 00:00:00 2018-01-01 02:15:00 20.5 1 1
If I understand you correctly by looking at your expected output, we can use the Datum column to create the right Time column:
df['Dates'] = pd.to_datetime(df['Dates'])
df['Time'] = pd.to_datetime(df['Datum'], format='%d/%m/%Y %H:%M')
Datum Dates Time Menge day month
0 1/1/2018 0:00 2018-01-01 2018-01-01 00:00:00 19.5 1 1
1 1/1/2018 0:15 2018-01-01 2018-01-01 00:15:00 19.0 1 1
2 1/1/2018 0:30 2018-01-01 2018-01-01 00:30:00 19.5 1 1
3 1/1/2018 0:45 2018-01-01 2018-01-01 00:45:00 19.5 1 1
4 1/1/2018 1:00 2018-01-01 2018-01-01 01:00:00 21.0 1 1
5 1/1/2018 1:15 2018-01-01 2018-01-01 01:15:00 19.5 1 1
6 1/1/2018 1:30 2018-01-01 2018-01-01 01:30:00 20.0 1 1
7 1/1/2018 1:45 2018-01-01 2018-01-01 01:45:00 23.0 1 1
8 1/1/2018 2:00 2018-01-01 2018-01-01 02:00:00 20.5 1 1
9 1/1/2018 2:15 2018-01-01 2018-01-01 02:15:00 20.5 1 1

Unable to combine date and time in pandas

I would like to combine the following date and time columns to 1 date_time column:
casinghourly[['Date','Time']].head()
Out[275]:
Date Time
0 2014-01-01 00:00:00
1 2014-01-01 01:00:00
2 2014-01-01 02:00:00
3 2014-01-01 03:00:00
4 2014-01-01 04:00:00
I've used the following code:
casinghourly.loc[:,'Date_Time'] = pd.to_datetime(casinghourly.Date.astype(str)+' '+casinghourly.Time.astype(str))
But I get the following error:
ValueError: Unknown string format
Fyi:
casinghourly[['Date','Time']].dtypes
Out[276]:
Date datetime64[ns]
Time timedelta64[ns]
dtype: object
Can somebody help me here please?
You can directly concat datetime64[ns] with timedelta64[ns]:
df['Date'] = df['Date']+df['Time']
print(df['Date'])
0 2014-01-01 00:00:00
1 2014-01-01 01:00:00
2 2014-01-01 02:00:00
3 2014-01-01 03:00:00
4 2014-01-01 04:00:00
Name: Date, dtype: datetime64[ns]
print(df)
Date Time
0 2014-01-01 00:00:00 00:00:00
1 2014-01-01 01:00:00 01:00:00
2 2014-01-01 02:00:00 02:00:00
3 2014-01-01 03:00:00 03:00:00
4 2014-01-01 04:00:00 04:00:00
print(df.dtypes)
Date datetime64[ns]
Time timedelta64[ns]
dtype: object

how can i get conditonal hourly mean in pandas?

i have below dataframe. and i wanna make a hourly mean dataframe
condition that every hour just calculate mean value 00:15:00~00:45:00.
date/time are multi index.
aaa
date time
2017-01-01 00:00:00 146.88
00:15:00 143.28
00:30:00 143.28
00:45:00 141.12
01:00:00 134.64
01:15:00 132.48
01:30:00 136.80
01:45:00 138.24
02:00:00 131.76
02:15:00 131.04
02:30:00 134.64
02:45:00 139.68
03:00:00 136.08
03:15:00 132.48
03:30:00 132.48
03:45:00 139.68
04:00:00 134.64
04:15:00 131.04
04:30:00 160.56
04:45:00 177.12
...
results should be belows.. how can i do it?
aaa
date time
2017-01-01 00:00:00 146.88
01:00:00 134.64
02:00:00 131.76
03:00:00 136.08
04:00:00 134.64
...
It seems need only select rows with 00:00 in the end of times:
df2 = df1[df1.index.get_level_values(1).astype(str).str.endswith('00:00')]
print (df2)
aaa
date time
2017-01-01 00:00:00 146.88
01:00:00 134.64
02:00:00 131.76
03:00:00 136.08
04:00:00 134.64
But if need mean only values 00:15-00:45 it is more complicated:
lvl1 = pd.Series(df1.index.get_level_values(1))
m = ~lvl1.astype(str).str.endswith('00:00')
lvl1new = lvl1.mask(m).ffill()
df1.index = pd.MultiIndex.from_arrays([df1.index.get_level_values(0),
lvl1new.where(m)], names=df1.index.names)
print (df1)
aaa
date time
2017-01-01 NaN 146.88
00:00:00 143.28
00:00:00 143.28
00:00:00 141.12
NaN 134.64
01:00:00 132.48
01:00:00 136.80
01:00:00 138.24
NaN 131.76
02:00:00 131.04
02:00:00 134.64
02:00:00 139.68
NaN 136.08
03:00:00 132.48
03:00:00 132.48
03:00:00 139.68
NaN 134.64
04:00:00 131.04
04:00:00 160.56
04:00:00 177.12
df = df1['aaa'].groupby(level=[0,1]).mean()
print (df)
date time
2017-01-01 00:00:00 142.56
01:00:00 135.84
02:00:00 135.12
03:00:00 134.88
04:00:00 156.24
Name: aaa, dtype: float64

Create multiple columns in pandas aggregation function

I'd like to create multiple columns while resampling a pandas DataFrame like the built-in ohlc method.
def mhl(data):
return pandas.Series([np.mean(data),np.max(data),np.min(data)],index = ['mean','high','low'])
ts.resample('30Min',how=mhl)
Dies with
Exception: Must produce aggregated value
Any suggestions? Thanks!
You can pass a dictionary of functions to the resample method:
In [35]: ts
Out[35]:
2013-01-01 00:00:00 0
2013-01-01 00:15:00 1
2013-01-01 00:30:00 2
2013-01-01 00:45:00 3
2013-01-01 01:00:00 4
2013-01-01 01:15:00 5
...
2013-01-01 23:00:00 92
2013-01-01 23:15:00 93
2013-01-01 23:30:00 94
2013-01-01 23:45:00 95
2013-01-02 00:00:00 96
Freq: 15T, Length: 97
Create a dictionary of functions:
mhl = {'m':np.mean, 'h':np.max, 'l':np.min}
Pass the dictionary to the how parameter of resample:
In [36]: ts.resample("30Min", how=mhl)
Out[36]:
h m l
2013-01-01 00:00:00 1 0.5 0
2013-01-01 00:30:00 3 2.5 2
2013-01-01 01:00:00 5 4.5 4
2013-01-01 01:30:00 7 6.5 6
2013-01-01 02:00:00 9 8.5 8
2013-01-01 02:30:00 11 10.5 10
2013-01-01 03:00:00 13 12.5 12
2013-01-01 03:30:00 15 14.5 14

Categories

Resources