Pandas - Resample on MultiIndex based DataFrame and use of offset - python

I have a df which has a MultiIndex [(latitude, longitude, time)] with the number of rows being 148 x 244 x 90 x 24. For each latitude and longitude, the time is hourly from 2014-01-01 00:00:00 to 2014:03:31 23:00:00 in the UTC format.
FFDI
latitude longitude time
-39.20000 140.80000 2014-01-01 00:00:00 6.20000
2014-01-01 01:00:00 4.10000
2014-01-01 02:00:00 2.40000
2014-01-01 03:00:00 1.90000
2014-01-01 04:00:00 1.70000
2014-01-01 05:00:00 1.50000
2014-01-01 06:00:00 1.40000
2014-01-01 07:00:00 1.30000
2014-01-01 08:00:00 1.20000
2014-01-01 09:00:00 1.00000
2014-01-01 10:00:00 1.00000
2014-01-01 11:00:00 0.90000
2014-01-01 12:00:00 0.90000
... ... ... ...
2014-03-31 21:00:00 0.30000
2014-03-31 22:00:00 0.30000
2014-03-31 23:00:00 0.50000
140.83786 2014-01-01 00:00:00 3.20000
2014-01-01 01:00:00 2.90000
2014-01-01 02:00:00 2.10000
2014-01-01 03:00:00 2.90000
2014-01-01 04:00:00 1.20000
2014-01-01 05:00:00 0.90000
2014-01-01 06:00:00 1.10000
2014-01-01 07:00:00 1.60000
2014-01-01 08:00:00 1.40000
2014-01-01 09:00:00 1.50000
2014-01-01 10:00:00 1.20000
2014-01-01 11:00:00 0.80000
2014-01-01 12:00:00 0.40000
... ... ... ...
2014-03-31 21:00:00 0.30000
2014-03-31 22:00:00 0.30000
2014-03-31 23:00:00 0.50000
... ... ... ...
... ... ...
-33.90000 140.80000 2014-01-01 00:00:00 6.20000
2014-01-01 01:00:00 4.10000
2014-01-01 02:00:00 2.40000
2014-01-01 03:00:00 1.90000
2014-01-01 04:00:00 1.70000
2014-01-01 05:00:00 1.50000
2014-01-01 06:00:00 1.40000
2014-01-01 07:00:00 1.30000
2014-01-01 08:00:00 1.20000
2014-01-01 09:00:00 1.00000
2014-01-01 10:00:00 1.00000
2014-01-01 11:00:00 0.90000
2014-01-01 12:00:00 0.90000
... ... ... ...
2014-03-31 21:00:00 0.30000
2014-03-31 22:00:00 0.30000
2014-03-31 23:00:00 0.50000
140.83786 2014-01-01 00:00:00 3.20000
2014-01-01 01:00:00 2.90000
2014-01-01 02:00:00 2.10000
2014-01-01 03:00:00 2.90000
2014-01-01 04:00:00 1.20000
2014-01-01 05:00:00 0.90000
2014-01-01 06:00:00 1.10000
2014-01-01 07:00:00 1.60000
2014-01-01 08:00:00 1.40000
2014-01-01 09:00:00 1.50000
2014-01-01 10:00:00 1.20000
2014-01-01 11:00:00 0.80000
2014-01-01 12:00:00 0.40000
... ... ... ...
2014-03-31 21:00:00 0.30000
2014-03-31 22:00:00 0.30000
2014-03-31 23:00:00 0.50000
78001920 rows × 1 columns
I need to calculate a daily maximum FFDI value for a date using hourly values from 13:00:00 of the previous day to 12:00:00 of the current day to suit my time zone (+11). For example, if calculating daily max FFDI for 2014-01-10 in the +11 time zone, I can use hourly FFDI from 2014-01-09 13:00:00 to 2014-01-10 12:00:00.
df_daily_max = df .groupby(['latitude', 'longitude', pd.Grouper(freq='24H',base=13,loffset='11H',level='time')])['FFDI'].max().reset_index(name='Max FFDI')
The calculation starts with 13:00:00 and with a frequency of 24 hours.
The output is:
latitude longitude time Max FFDI
0 -39.20000076293945312500 140.80000305175781250000 2013-12-31 13:00:00 6.19999980926513671875
1 -39.20000076293945312500 140.80000305175781250000 2014-01-01 13:00:00 1.50000000000000000000
2 -39.20000076293945312500 140.80000305175781250000 2014-01-02 13:00:00 1.60000002384185791016
... ... ... ...
I would like the output to be:
latitude longitude time Max FFDI
0 -39.20000076293945312500 140.80000305175781250000 2014-01-01 6.19999980926513671875
1 -39.20000076293945312500 140.80000305175781250000 2014-01-02 1.50000000000000000000
2 -39.20000076293945312500 140.80000305175781250000 2014-01-03 1.60000002384185791016
... ... ... ...

Related

Create Unlimited DataFrames from other DataFrame Column Category with Time-Series Data

I have three columns in a Time series.
The Time series is hourly and the index value.
I have multiple categories that are being measured hourly.
I have arbitrary lists of levels: these are usually odd names and I may pull anywhere between 40 to 40000 at a time.
I also have their varying values: for score 0 - 100.
So:
I want to make each Level have its own DataFrame:
(FULL DataFrame):
df =
date levels score
2019-01-01 00:00:00 1005 99.438851
2019-01-01 01:00:00 1005 92.081975
2019-01-01 02:00:00 1005 93.032991
2019-01-01 03:00:00 1005 1.991615
2019-01-01 04:00:00 1005 12.723531
2019-01-01 05:00:00 1005 74.443313
(One of hundreds of individual DataFrames I want generated, but NOT in a DICT)
df_is_1005 =
date score
2019-01-01 00:00:00 99.438851
2019-01-01 01:00:00 92.081975
2019-01-01 02:00:00 93.032991
2019-01-01 03:00:00 1.991615
2019-01-01 04:00:00 12.723531
2019-01-01 05:00:00 74.443313
.... but for ALL THE LEVELS
.
And
I have a bit of a problem!
I've done quite a lot of digging and have tried making a dict of the dataframes. How do I extract each of these?
Also, how do I name them individually as: df_of_{levels}?
This is the Time Series data I'll create for a toy model. (BUT there should be multiple datetime for each and every level, unlike here)
import pandas as pd
from datetime import datetime
import numpy as np
import pandas as pd
date_rng = pd.date_range(start='1/1/2019', end='3/30/2019', freq='H')
df = pd.DataFrame(date_rng, columns=['date'])
df['level'] = np.random.randint(1000,1033,size=(len(date_rng)))
df['score'] = np.random.uniform(0,100,size=(len(date_rng)))
Keep in mind, the levels I may deal with could be hundreds and they are named bizarre things.
I'll have the time stamps for each of these as separate rows.
My desired goal is the have each of the possible levels, which there may well be more than just the small number here, to dynamically create dataframes.
NOW: I know I can create a Dictionary of dataframes.
BUT How do I extract each of dataframes with INDIVIDUAL Numbers?
I want, for example
df =
date levels score
2019-01-01 00:00:00 1005 99.438851
2019-01-01 01:00:00 1005 92.081975
2019-01-01 02:00:00 1005 93.032991
2019-01-01 03:00:00 1005 1.991615
2019-01-01 04:00:00 1005 12.723531
2019-01-01 05:00:00 1005 74.443313
2019-01-01 06:00:00 1005 12.154499
2019-01-01 07:00:00 1005 96.439228
2019-01-01 08:00:00 1005 64.283731
2019-01-01 09:00:00 1005 83.165093
2019-01-01 10:00:00 1005 75.740610
2019-01-01 11:00:00 1005 25.721404
2019-01-01 12:00:00 1005 37.493829
2019-01-01 13:00:00 1005 51.783549
2019-01-01 14:00:00 1005 7.223582
2019-01-01 15:00:00 1005 0.932651
2019-01-01 16:00:00 1005 95.916686
2019-01-01 17:00:00 1005 11.579450
and same df, far later... :
date levels score
2019-01-01 00:00:00 1027 99.438851
2019-01-01 01:00:00 1027 92.081975
2019-01-01 02:00:00 1027 93.032991
2019-01-01 03:00:00 1027 1.991615
2019-01-01 04:00:00 1027 12.723531
2019-01-01 05:00:00 1027 74.443313
2019-01-01 06:00:00 1027 12.154499
2019-01-01 07:00:00 1027 96.439228
2019-01-01 08:00:00 1027 64.283731
2019-01-01 09:00:00 1027 83.165093
2019-01-01 10:00:00 1027 75.740610
2019-01-01 11:00:00 1027 25.721404
2019-01-01 12:00:00 1027 37.493829
2019-01-01 13:00:00 1027 51.783549
2019-01-01 14:00:00 1027 7.223582
2019-01-01 15:00:00 1027 0.932651
2019-01-01 16:00:00 1027 95.916686
2019-01-01 17:00:00 1027 11.579450
2019-01-01 18:00:00 1027 91.226938
2019-01-01 19:00:00 1027 31.564530
2019-01-01 20:00:00 1027 39.511358
2019-01-01 21:00:00 1027 59.787468
2019-01-01 22:00:00 1027 4.666549
2019-01-01 23:00:00 1027 92.197337
...etcetera...
EACH level individually, whatever it may be called (and there may be hundreds of them with random values):
To be converted to
df_{level_value_generated} =
date score
2019-01-01 00:00:00 8.040233
2019-01-01 01:00:00 55.736688
2019-01-01 02:00:00 37.910143
2019-01-01 03:00:00 22.907763
2019-01-01 04:00:00 4.586205
2019-01-01 05:00:00 88.090652
2019-01-01 06:00:00 50.474533
2019-01-01 07:00:00 92.890208
2019-01-01 08:00:00 70.949978
2019-01-01 09:00:00 23.191488
2019-01-01 10:00:00 60.506870
2019-01-01 11:00:00 25.689149
2019-01-01 12:00:00 49.234296
2019-01-01 13:00:00 65.369771
2019-01-01 14:00:00 55.550065
2019-01-01 15:00:00 35.112297
2019-01-01 16:00:00 45.989587
2019-01-01 17:00:00 76.829787
2019-01-01 18:00:00 5.982378
2019-01-01 19:00:00 83.603115
2019-01-01 20:00:00 5.995648
2019-01-01 21:00:00 95.658097
2019-01-01 22:00:00 21.877945
2019-01-01 23:00:00 30.428798
2019-01-02 00:00:00 72.450284
2019-01-02 01:00:00 91.947018
2019-01-02 02:00:00 66.741502
2019-01-02 03:00:00 77.535416
2019-01-02 04:00:00 29.624868
2019-01-02 05:00:00 89.652003
So I can then list the these DataFrames that are created DYNAMICALLY.
From here, I'd like to add them to a dictionary, the reason being, is that I want to train a Time-Series model to on each and every one of the individual DataFrames so I can have a different model for each of them, each with their own training and outputs.
If possibly, can I train multiple DataFrames from inside a dictionary individually?
If I just do a pivot table or groupby, I will have a large Dataframe that I'll have to individually call out columns of to train on the time series. So I'd rather not do that.
So, how do I dynamically create:
Newly named DataFrames from levels that are not all known in value,
each named:
df_{level_name}:
DateTime Column: Score_Column:
some dates... scores 0-100
that will then drop the 'level_name' column in their own DataFrame, so that I can have as many dataframes as necessary, each named uniquely, programmatically, so I can then take each of these and then plug them into a new model or whatever?
If I've understood your problem correctly, a MultiIndex should do exactly what you want.
To do this on your dataframe:
df.reset_index(inplace=True)
df.set_index(['levels', 'date'], inplace=True)
# in the case of your example above, this will produce:
df =
levels date score
1005 2019-01-01 00:00:00 99.438851
2019-01-01 01:00:00 92.081975
2019-01-01 02:00:00 93.032991
2019-01-01 03:00:00 1.991615
2019-01-01 04:00:00 12.723531
2019-01-01 05:00:00 74.443313
2019-01-01 06:00:00 12.154499
2019-01-01 07:00:00 96.439228
2019-01-01 08:00:00 64.283731
2019-01-01 09:00:00 83.165093
2019-01-01 10:00:00 75.740610
2019-01-01 11:00:00 25.721404
2019-01-01 12:00:00 37.493829
2019-01-01 13:00:00 51.783549
2019-01-01 14:00:00 7.223582
2019-01-01 15:00:00 0.932651
2019-01-01 16:00:00 95.916686
2019-01-01 17:00:00 11.579450
1027 2019-01-01 00:00:00 99.438851
2019-01-01 01:00:00 92.081975
2019-01-01 02:00:00 93.032991
2019-01-01 03:00:00 1.991615
2019-01-01 04:00:00 12.723531
2019-01-01 05:00:00 74.443313
2019-01-01 06:00:00 12.154499
2019-01-01 07:00:00 96.439228
2019-01-01 08:00:00 64.283731
2019-01-01 09:00:00 83.165093
2019-01-01 10:00:00 75.740610
2019-01-01 11:00:00 25.721404
2019-01-01 12:00:00 37.493829
2019-01-01 13:00:00 51.783549
2019-01-01 14:00:00 7.223582
2019-01-01 15:00:00 0.932651
2019-01-01 16:00:00 95.916686
2019-01-01 17:00:00 11.579450
2019-01-01 18:00:00 91.226938
2019-01-01 19:00:00 31.564530
2019-01-01 20:00:00 39.511358
2019-01-01 21:00:00 59.787468
2019-01-01 22:00:00 4.666549
2019-01-01 23:00:00 92.197337
#... etc
You can then access each level of the data using these indices:
df.loc[1005, :]
>
date score
2019-01-01 00:00:00 99.438851
2019-01-01 01:00:00 92.081975
2019-01-01 02:00:00 93.032991
2019-01-01 03:00:00 1.991615
2019-01-01 04:00:00 12.723531
2019-01-01 05:00:00 74.443313
2019-01-01 06:00:00 12.154499
2019-01-01 07:00:00 96.439228
2019-01-01 08:00:00 64.283731
2019-01-01 09:00:00 83.165093
2019-01-01 10:00:00 75.740610
2019-01-01 11:00:00 25.721404
2019-01-01 12:00:00 37.493829
2019-01-01 13:00:00 51.783549
2019-01-01 14:00:00 7.223582
2019-01-01 15:00:00 0.932651
2019-01-01 16:00:00 95.916686
2019-01-01 17:00:00 11.579450
You can also loop through all the 'levels' of the data using:
for level, data in df.groupby(level=0):
# do something to 'level'
And, if needed, get a list of all 'levels' contained in the data:
df.index.levels[0]
> [1005, 1027, ...]
This might prove to be more flexible than creating numerous individually named dataframes, and is closer to the indended use of pandas.

Pandas groupby then fill missing rows

I have a dataframe structured like this:
df_all:
day_time LCLid energy(kWh/hh)
2014-02-08 23:00:00 MAC000006 0.077
2014-02-08 23:30:00 MAC000006 0.079
...
2014-02-08 23:00:00 MAC000007 0.045
...
There are four sequential datetimes (accross all LCLid's) missing from the data that I want to fill with previous and trailing values.
If the dataframe was split into sub-dataframes (df), one per LCLid eg as per:
gb = df.groupby('LCLid')
df_list = [gb.get_group(x) for x in gb.groups]
Then I could do this for each df in df_list:
#valid data before gap
prev_row = df.loc['2013-09-09 22:30:00'].copy()
#valid data after gap
post_row = df.loc['2013-09-10 01:00:00'].copy()
df.loc[pd.to_datetime('2013-09-09 23:00:00')] = prev_row
df.loc[pd.to_datetime('2013-09-09 23:30:00')] = prev_row
df.loc[pd.to_datetime('2013-09-10 00:00:00')] = post_row
df.loc[pd.to_datetime('2013-09-10 00:30:00')] = post_row
df = df.sort_index()
How can I do this on the df_all one one go to fill the missing data with 'valid' data just from each LCLid?
The solution
The input DataFrame:
LCLid energy(kWh/hh)
day_time
2014-01-01 00:00:00 MAC000006 0.270453
2014-01-01 00:00:00 MAC000007 0.170603
2014-01-01 00:30:00 MAC000006 0.716418
2014-01-01 00:30:00 MAC000007 0.276678
2014-01-01 03:00:00 MAC000006 0.819146
2014-01-01 03:00:00 MAC000007 0.027490
2014-01-01 03:30:00 MAC000006 0.688879
2014-01-01 03:30:00 MAC000007 0.868017
What you need to do:
full_idx = pd.date_range(start=df.index.min(), end=df.index.max(), freq='30T')
df = (
df
.groupby('LCLid', as_index=False)
.apply(lambda group: group.reindex(full_idx, method='nearest'))
.reset_index(level=0, drop=True)
.sort_index()
)
Result:
LCLid energy(kWh/hh)
2014-01-01 00:00:00 MAC000006 0.270453
2014-01-01 00:00:00 MAC000007 0.170603
2014-01-01 00:30:00 MAC000006 0.716418
2014-01-01 00:30:00 MAC000007 0.276678
2014-01-01 01:00:00 MAC000006 0.716418
2014-01-01 01:00:00 MAC000007 0.276678
2014-01-01 01:30:00 MAC000006 0.716418
2014-01-01 01:30:00 MAC000007 0.276678
2014-01-01 02:00:00 MAC000006 0.819146
2014-01-01 02:00:00 MAC000007 0.027490
2014-01-01 02:30:00 MAC000006 0.819146
2014-01-01 02:30:00 MAC000007 0.027490
2014-01-01 03:00:00 MAC000006 0.819146
2014-01-01 03:00:00 MAC000007 0.027490
2014-01-01 03:30:00 MAC000006 0.688879
2014-01-01 03:30:00 MAC000007 0.868017
The explanation
First I'll build an example DataFrame that looks like yours
import numpy as np
import pandas as pd
# Building an example DataFrame that looks like yours
df = pd.DataFrame({
'day_time': [
pd.Timestamp(2014, 1, 1, 0, 0),
pd.Timestamp(2014, 1, 1, 0, 0),
pd.Timestamp(2014, 1, 1, 0, 30),
pd.Timestamp(2014, 1, 1, 0, 30),
pd.Timestamp(2014, 1, 1, 3, 0),
pd.Timestamp(2014, 1, 1, 3, 0),
pd.Timestamp(2014, 1, 1, 3, 30),
pd.Timestamp(2014, 1, 1, 3, 30),
],
'LCLid': [
'MAC000006',
'MAC000007',
'MAC000006',
'MAC000007',
'MAC000006',
'MAC000007',
'MAC000006',
'MAC000007',
],
'energy(kWh/hh)': np.random.rand(8)
},
).set_index('day_time')
Result:
LCLid energy(kWh/hh)
day_time
2014-01-01 00:00:00 MAC000006 0.270453
2014-01-01 00:00:00 MAC000007 0.170603
2014-01-01 00:30:00 MAC000006 0.716418
2014-01-01 00:30:00 MAC000007 0.276678
2014-01-01 03:00:00 MAC000006 0.819146
2014-01-01 03:00:00 MAC000007 0.027490
2014-01-01 03:30:00 MAC000006 0.688879
2014-01-01 03:30:00 MAC000007 0.868017
Notice how we're missing the following timestamps:
2014-01-01 01:00:00
2014-01-01 01:30:00
2014-01-02 02:00:00
2014-01-02 02:30:00
df.reindex()
First thing to know is that df.reindex() allows you to fill in missing index values, and will default to NaN for missing values. In your case, you would want to supply the full timestamp range index, including the values that don't show up in your starting DataFrame.
Here I used pd.date_range() to list all timestamps between your min and max starting index values, taking strides of 30 minutes. WARNING: this way of doing it means that if your missing timestamp values are at the beginning or the end, you're not adding them back! So maybe you want to specify start and end explicitly.
full_idx = pd.date_range(start=df.index.min(), end=df.index.max(), freq='30T')
Result:
DatetimeIndex(['2014-01-01 00:00:00', '2014-01-01 00:30:00',
'2014-01-01 01:00:00', '2014-01-01 01:30:00',
'2014-01-01 02:00:00', '2014-01-01 02:30:00',
'2014-01-01 03:00:00', '2014-01-01 03:30:00'],
dtype='datetime64[ns]', freq='30T')
Now if we use that to reindex one of your grouped sub-DataFrames, we would get this:
grouped_df = df[df.LCLid == 'MAC000006']
grouped_df.reindex(full_idx)
Result:
LCLid energy(kWh/hh)
2014-01-01 00:00:00 MAC000006 0.270453
2014-01-01 00:30:00 MAC000006 0.716418
2014-01-01 01:00:00 NaN NaN
2014-01-01 01:30:00 NaN NaN
2014-01-01 02:00:00 NaN NaN
2014-01-01 02:30:00 NaN NaN
2014-01-01 03:00:00 MAC000006 0.819146
2014-01-01 03:30:00 MAC000006 0.688879
You said you want to fill missing values using the closest available surrounding value. This can be done during reindexing, as follows:
grouped_df.reindex(full_idx, method='nearest')
Result:
LCLid energy(kWh/hh)
2014-01-01 00:00:00 MAC000006 0.270453
2014-01-01 00:30:00 MAC000006 0.716418
2014-01-01 01:00:00 MAC000006 0.716418
2014-01-01 01:30:00 MAC000006 0.716418
2014-01-01 02:00:00 MAC000006 0.819146
2014-01-01 02:30:00 MAC000006 0.819146
2014-01-01 03:00:00 MAC000006 0.819146
2014-01-01 03:30:00 MAC000006 0.688879
Doing all the groups at once using df.groupby()
Now we'd like to apply this transformation to every group in your DataFrame, where
a group is defined by its LCLid.
(
df
.groupby('LCLid', as_index=False) # use LCLid as groupby key, but don't add it as a group index
.apply(lambda group: group.reindex(full_idx, method='nearest')) # do this for each group
.reset_index(level=0, drop=True) # get rid of the automatic index generated during groupby
.sort_index() # This is optional, just in case you want timestamps in chronological order
)
Result:
LCLid energy(kWh/hh)
2014-01-01 00:00:00 MAC000006 0.270453
2014-01-01 00:00:00 MAC000007 0.170603
2014-01-01 00:30:00 MAC000006 0.716418
2014-01-01 00:30:00 MAC000007 0.276678
2014-01-01 01:00:00 MAC000006 0.716418
2014-01-01 01:00:00 MAC000007 0.276678
2014-01-01 01:30:00 MAC000006 0.716418
2014-01-01 01:30:00 MAC000007 0.276678
2014-01-01 02:00:00 MAC000006 0.819146
2014-01-01 02:00:00 MAC000007 0.027490
2014-01-01 02:30:00 MAC000006 0.819146
2014-01-01 02:30:00 MAC000007 0.027490
2014-01-01 03:00:00 MAC000006 0.819146
2014-01-01 03:00:00 MAC000007 0.027490
2014-01-01 03:30:00 MAC000006 0.688879
2014-01-01 03:30:00 MAC000007 0.868017
Relevant doc:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.date_range.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.apply.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_index.html

how can i get conditonal hourly mean in pandas?

i have below dataframe. and i wanna make a hourly mean dataframe
condition that every hour just calculate mean value 00:15:00~00:45:00.
date/time are multi index.
aaa
date time
2017-01-01 00:00:00 146.88
00:15:00 143.28
00:30:00 143.28
00:45:00 141.12
01:00:00 134.64
01:15:00 132.48
01:30:00 136.80
01:45:00 138.24
02:00:00 131.76
02:15:00 131.04
02:30:00 134.64
02:45:00 139.68
03:00:00 136.08
03:15:00 132.48
03:30:00 132.48
03:45:00 139.68
04:00:00 134.64
04:15:00 131.04
04:30:00 160.56
04:45:00 177.12
...
results should be belows.. how can i do it?
aaa
date time
2017-01-01 00:00:00 146.88
01:00:00 134.64
02:00:00 131.76
03:00:00 136.08
04:00:00 134.64
...
It seems need only select rows with 00:00 in the end of times:
df2 = df1[df1.index.get_level_values(1).astype(str).str.endswith('00:00')]
print (df2)
aaa
date time
2017-01-01 00:00:00 146.88
01:00:00 134.64
02:00:00 131.76
03:00:00 136.08
04:00:00 134.64
But if need mean only values 00:15-00:45 it is more complicated:
lvl1 = pd.Series(df1.index.get_level_values(1))
m = ~lvl1.astype(str).str.endswith('00:00')
lvl1new = lvl1.mask(m).ffill()
df1.index = pd.MultiIndex.from_arrays([df1.index.get_level_values(0),
lvl1new.where(m)], names=df1.index.names)
print (df1)
aaa
date time
2017-01-01 NaN 146.88
00:00:00 143.28
00:00:00 143.28
00:00:00 141.12
NaN 134.64
01:00:00 132.48
01:00:00 136.80
01:00:00 138.24
NaN 131.76
02:00:00 131.04
02:00:00 134.64
02:00:00 139.68
NaN 136.08
03:00:00 132.48
03:00:00 132.48
03:00:00 139.68
NaN 134.64
04:00:00 131.04
04:00:00 160.56
04:00:00 177.12
df = df1['aaa'].groupby(level=[0,1]).mean()
print (df)
date time
2017-01-01 00:00:00 142.56
01:00:00 135.84
02:00:00 135.12
03:00:00 134.88
04:00:00 156.24
Name: aaa, dtype: float64

Pandas timeseries groupby using TimeGrouper

I have a time series which is like this
Time Demand
Date
2014-01-01 0:00 2899.0
2014-01-01 0:15 2869.0
2014-01-01 0:30 2827.0
2014-01-01 0:45 2787.0
2014-01-01 1:00 2724.0
2014-01-01 1:15 2687.0
2014-01-01 1:30 2596.0
2014-01-01 1:45 2543.0
2014-01-01 2:00 2483.0
Its is in 15 minute increments. I want the average for every hour of everyday.So i tried something like this df.groupby(pd.TimeGrouper(freq='H')).mean(). It didn't work out quite right because it returned mostlyNaNs.
Now my dataset has data like this for the whole year and I would like to calculate the mean for all the hours of all the months such that I have 24 points but the mean is for all hours of the year e.g. the first hour get the mean of the first hour for all the months. The expected output would be
2014 00:00:00 2884.0
2014 01:00:00 2807.0
2014 02:00:00 2705.5
2014 03:00:00 2569.5
..........
2014 23:00:00 2557.5
How can I achieve this?
I think you need first add Time column to index:
df.index = df.index + pd.to_timedelta(df.Time + ':00')
print (df)
Time Demand
2014-01-01 00:00:00 0:00 2899.0
2014-01-01 00:15:00 0:15 2869.0
2014-01-01 00:30:00 0:30 2827.0
2014-01-01 00:45:00 0:45 2787.0
2014-01-01 01:00:00 1:00 2724.0
2014-01-01 01:15:00 1:15 2687.0
2014-01-01 01:30:00 1:30 2596.0
2014-01-01 01:45:00 1:45 2543.0
2014-01-01 02:00:00 2:00 2483.0
print (df.groupby(pd.Grouper(freq='H')).mean())
#same as
#print (df.groupby(pd.TimeGrouper(freq='H')).mean())
Demand
2014-01-01 00:00:00 2845.5
2014-01-01 01:00:00 2637.5
2014-01-01 02:00:00 2483.0
Thanks pansen for another idea resample:
print (df.resample("H").mean())
Demand
2014-01-01 00:00:00 2845.5
2014-01-01 01:00:00 2637.5
2014-01-01 02:00:00 2483.0
EDIT:
print (df)
Time Demand
Date
2014-01-01 0:00 1.0
2014-01-01 0:15 2.0
2014-01-01 0:30 4.0
2014-01-01 0:45 5.0
2014-01-01 1:00 1.0
2014-01-01 1:15 0.0
2015-01-01 1:30 1.0
2015-01-01 1:45 2.0
2015-01-01 2:00 3.0
df.index = df.index + pd.to_timedelta(df.Time + ':00')
print (df)
Time Demand
2014-01-01 00:00:00 0:00 1.0
2014-01-01 00:15:00 0:15 2.0
2014-01-01 00:30:00 0:30 4.0
2014-01-01 00:45:00 0:45 5.0
2014-01-01 01:00:00 1:00 1.0
2014-01-01 01:15:00 1:15 0.0
2015-01-01 01:30:00 1:30 1.0
2015-01-01 01:45:00 1:45 2.0
2015-01-01 02:00:00 2:00 3.0
df1 = df.groupby([df.index.year, df.index.hour]).mean().reset_index()
df1.columns = ['year','hour','Demand']
print (df1)
year hour Demand
0 2014 0 3.0
1 2014 1 0.5
2 2015 1 1.5
3 2015 2 3.0
For DatetimeIndex use:
df1 = df.groupby([df.index.year, df.index.hour]).mean()
df1.index = pd.to_datetime(df1.index.get_level_values(0).astype(str) +
df1.index.get_level_values(1).astype(str), format='%Y%H')
print (df1)
Demand
2014-01-01 00:00:00 3.0
2014-01-01 01:00:00 0.5
2015-01-01 01:00:00 1.5
2015-01-01 02:00:00 3.0

How I can find out max and min value of each day from hourly data sets

This is hourly data of 2 years and I required max and min of daily data, how can I find this?
value
record ts
2014-01-01 00:00:00 5.83
2014-01-01 01:00:00 5.38
2014-01-01 02:00:00 4.80
2014-01-01 03:00:00 3.81
2014-01-01 04:00:00 4.46
2014-01-01 05:00:00 5.04
2014-01-01 06:00:00 5.76
2014-01-01 07:00:00 6.15
2014-01-01 08:00:00 6.66
2014-01-01 09:00:00 7.02
2014-01-01 10:00:00 7.43
2014-01-01 11:00:00 7.34
2014-01-01 12:00:00 7.24
2014-01-01 13:00:00 7.71
2014-01-01 14:00:00 8.89
2014-01-01 15:00:00 10.31
You can use resample with Resampler.aggregate min and max:
print (df)
value
record ts
2014-01-01 00:00:00 5.83
2014-01-01 01:00:00 5.38
2014-01-01 02:00:00 4.80
2014-01-01 03:00:00 3.81
2014-01-02 04:00:00 4.46
2014-01-02 05:00:00 5.04
2014-01-02 06:00:00 5.76
2014-01-03 07:00:00 6.15
2014-01-03 08:00:00 6.66
2014-01-03 09:00:00 7.02
2014-01-03 10:00:00 7.43
2014-01-04 11:00:00 7.34
2014-01-04 12:00:00 7.24
2014-01-04 13:00:00 7.71
2014-01-05 14:00:00 8.89
2014-01-05 15:00:00 10.31
#if not DatetimeIndex
df.index = pd.to_datetime(df.index)
print (df.resample('D')['value'].agg(['min', 'max']))
min max
record ts
2014-01-01 3.81 5.83
2014-01-02 4.46 5.76
2014-01-03 6.15 7.43
2014-01-04 7.24 7.71
2014-01-05 8.89 10.31
Another solution:
print (df.groupby(pd.TimeGrouper('D'))['value'].agg(['min', 'max']))
min max
record ts
2014-01-01 3.81 5.83
2014-01-02 4.46 5.76
2014-01-03 6.15 7.43
2014-01-04 7.24 7.71
2014-01-05 8.89 10.31
* piRSquared edit, thank you*
timing

Categories

Resources