Pandas groupby then fill missing rows - python
I have a dataframe structured like this:
df_all:
day_time LCLid energy(kWh/hh)
2014-02-08 23:00:00 MAC000006 0.077
2014-02-08 23:30:00 MAC000006 0.079
...
2014-02-08 23:00:00 MAC000007 0.045
...
There are four sequential datetimes (accross all LCLid's) missing from the data that I want to fill with previous and trailing values.
If the dataframe was split into sub-dataframes (df), one per LCLid eg as per:
gb = df.groupby('LCLid')
df_list = [gb.get_group(x) for x in gb.groups]
Then I could do this for each df in df_list:
#valid data before gap
prev_row = df.loc['2013-09-09 22:30:00'].copy()
#valid data after gap
post_row = df.loc['2013-09-10 01:00:00'].copy()
df.loc[pd.to_datetime('2013-09-09 23:00:00')] = prev_row
df.loc[pd.to_datetime('2013-09-09 23:30:00')] = prev_row
df.loc[pd.to_datetime('2013-09-10 00:00:00')] = post_row
df.loc[pd.to_datetime('2013-09-10 00:30:00')] = post_row
df = df.sort_index()
How can I do this on the df_all one one go to fill the missing data with 'valid' data just from each LCLid?
The solution
The input DataFrame:
LCLid energy(kWh/hh)
day_time
2014-01-01 00:00:00 MAC000006 0.270453
2014-01-01 00:00:00 MAC000007 0.170603
2014-01-01 00:30:00 MAC000006 0.716418
2014-01-01 00:30:00 MAC000007 0.276678
2014-01-01 03:00:00 MAC000006 0.819146
2014-01-01 03:00:00 MAC000007 0.027490
2014-01-01 03:30:00 MAC000006 0.688879
2014-01-01 03:30:00 MAC000007 0.868017
What you need to do:
full_idx = pd.date_range(start=df.index.min(), end=df.index.max(), freq='30T')
df = (
df
.groupby('LCLid', as_index=False)
.apply(lambda group: group.reindex(full_idx, method='nearest'))
.reset_index(level=0, drop=True)
.sort_index()
)
Result:
LCLid energy(kWh/hh)
2014-01-01 00:00:00 MAC000006 0.270453
2014-01-01 00:00:00 MAC000007 0.170603
2014-01-01 00:30:00 MAC000006 0.716418
2014-01-01 00:30:00 MAC000007 0.276678
2014-01-01 01:00:00 MAC000006 0.716418
2014-01-01 01:00:00 MAC000007 0.276678
2014-01-01 01:30:00 MAC000006 0.716418
2014-01-01 01:30:00 MAC000007 0.276678
2014-01-01 02:00:00 MAC000006 0.819146
2014-01-01 02:00:00 MAC000007 0.027490
2014-01-01 02:30:00 MAC000006 0.819146
2014-01-01 02:30:00 MAC000007 0.027490
2014-01-01 03:00:00 MAC000006 0.819146
2014-01-01 03:00:00 MAC000007 0.027490
2014-01-01 03:30:00 MAC000006 0.688879
2014-01-01 03:30:00 MAC000007 0.868017
The explanation
First I'll build an example DataFrame that looks like yours
import numpy as np
import pandas as pd
# Building an example DataFrame that looks like yours
df = pd.DataFrame({
'day_time': [
pd.Timestamp(2014, 1, 1, 0, 0),
pd.Timestamp(2014, 1, 1, 0, 0),
pd.Timestamp(2014, 1, 1, 0, 30),
pd.Timestamp(2014, 1, 1, 0, 30),
pd.Timestamp(2014, 1, 1, 3, 0),
pd.Timestamp(2014, 1, 1, 3, 0),
pd.Timestamp(2014, 1, 1, 3, 30),
pd.Timestamp(2014, 1, 1, 3, 30),
],
'LCLid': [
'MAC000006',
'MAC000007',
'MAC000006',
'MAC000007',
'MAC000006',
'MAC000007',
'MAC000006',
'MAC000007',
],
'energy(kWh/hh)': np.random.rand(8)
},
).set_index('day_time')
Result:
LCLid energy(kWh/hh)
day_time
2014-01-01 00:00:00 MAC000006 0.270453
2014-01-01 00:00:00 MAC000007 0.170603
2014-01-01 00:30:00 MAC000006 0.716418
2014-01-01 00:30:00 MAC000007 0.276678
2014-01-01 03:00:00 MAC000006 0.819146
2014-01-01 03:00:00 MAC000007 0.027490
2014-01-01 03:30:00 MAC000006 0.688879
2014-01-01 03:30:00 MAC000007 0.868017
Notice how we're missing the following timestamps:
2014-01-01 01:00:00
2014-01-01 01:30:00
2014-01-02 02:00:00
2014-01-02 02:30:00
df.reindex()
First thing to know is that df.reindex() allows you to fill in missing index values, and will default to NaN for missing values. In your case, you would want to supply the full timestamp range index, including the values that don't show up in your starting DataFrame.
Here I used pd.date_range() to list all timestamps between your min and max starting index values, taking strides of 30 minutes. WARNING: this way of doing it means that if your missing timestamp values are at the beginning or the end, you're not adding them back! So maybe you want to specify start and end explicitly.
full_idx = pd.date_range(start=df.index.min(), end=df.index.max(), freq='30T')
Result:
DatetimeIndex(['2014-01-01 00:00:00', '2014-01-01 00:30:00',
'2014-01-01 01:00:00', '2014-01-01 01:30:00',
'2014-01-01 02:00:00', '2014-01-01 02:30:00',
'2014-01-01 03:00:00', '2014-01-01 03:30:00'],
dtype='datetime64[ns]', freq='30T')
Now if we use that to reindex one of your grouped sub-DataFrames, we would get this:
grouped_df = df[df.LCLid == 'MAC000006']
grouped_df.reindex(full_idx)
Result:
LCLid energy(kWh/hh)
2014-01-01 00:00:00 MAC000006 0.270453
2014-01-01 00:30:00 MAC000006 0.716418
2014-01-01 01:00:00 NaN NaN
2014-01-01 01:30:00 NaN NaN
2014-01-01 02:00:00 NaN NaN
2014-01-01 02:30:00 NaN NaN
2014-01-01 03:00:00 MAC000006 0.819146
2014-01-01 03:30:00 MAC000006 0.688879
You said you want to fill missing values using the closest available surrounding value. This can be done during reindexing, as follows:
grouped_df.reindex(full_idx, method='nearest')
Result:
LCLid energy(kWh/hh)
2014-01-01 00:00:00 MAC000006 0.270453
2014-01-01 00:30:00 MAC000006 0.716418
2014-01-01 01:00:00 MAC000006 0.716418
2014-01-01 01:30:00 MAC000006 0.716418
2014-01-01 02:00:00 MAC000006 0.819146
2014-01-01 02:30:00 MAC000006 0.819146
2014-01-01 03:00:00 MAC000006 0.819146
2014-01-01 03:30:00 MAC000006 0.688879
Doing all the groups at once using df.groupby()
Now we'd like to apply this transformation to every group in your DataFrame, where
a group is defined by its LCLid.
(
df
.groupby('LCLid', as_index=False) # use LCLid as groupby key, but don't add it as a group index
.apply(lambda group: group.reindex(full_idx, method='nearest')) # do this for each group
.reset_index(level=0, drop=True) # get rid of the automatic index generated during groupby
.sort_index() # This is optional, just in case you want timestamps in chronological order
)
Result:
LCLid energy(kWh/hh)
2014-01-01 00:00:00 MAC000006 0.270453
2014-01-01 00:00:00 MAC000007 0.170603
2014-01-01 00:30:00 MAC000006 0.716418
2014-01-01 00:30:00 MAC000007 0.276678
2014-01-01 01:00:00 MAC000006 0.716418
2014-01-01 01:00:00 MAC000007 0.276678
2014-01-01 01:30:00 MAC000006 0.716418
2014-01-01 01:30:00 MAC000007 0.276678
2014-01-01 02:00:00 MAC000006 0.819146
2014-01-01 02:00:00 MAC000007 0.027490
2014-01-01 02:30:00 MAC000006 0.819146
2014-01-01 02:30:00 MAC000007 0.027490
2014-01-01 03:00:00 MAC000006 0.819146
2014-01-01 03:00:00 MAC000007 0.027490
2014-01-01 03:30:00 MAC000006 0.688879
2014-01-01 03:30:00 MAC000007 0.868017
Relevant doc:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.date_range.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.apply.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_index.html
Related
Pandas - Resample on MultiIndex based DataFrame and use of offset
I have a df which has a MultiIndex [(latitude, longitude, time)] with the number of rows being 148 x 244 x 90 x 24. For each latitude and longitude, the time is hourly from 2014-01-01 00:00:00 to 2014:03:31 23:00:00 in the UTC format. FFDI latitude longitude time -39.20000 140.80000 2014-01-01 00:00:00 6.20000 2014-01-01 01:00:00 4.10000 2014-01-01 02:00:00 2.40000 2014-01-01 03:00:00 1.90000 2014-01-01 04:00:00 1.70000 2014-01-01 05:00:00 1.50000 2014-01-01 06:00:00 1.40000 2014-01-01 07:00:00 1.30000 2014-01-01 08:00:00 1.20000 2014-01-01 09:00:00 1.00000 2014-01-01 10:00:00 1.00000 2014-01-01 11:00:00 0.90000 2014-01-01 12:00:00 0.90000 ... ... ... ... 2014-03-31 21:00:00 0.30000 2014-03-31 22:00:00 0.30000 2014-03-31 23:00:00 0.50000 140.83786 2014-01-01 00:00:00 3.20000 2014-01-01 01:00:00 2.90000 2014-01-01 02:00:00 2.10000 2014-01-01 03:00:00 2.90000 2014-01-01 04:00:00 1.20000 2014-01-01 05:00:00 0.90000 2014-01-01 06:00:00 1.10000 2014-01-01 07:00:00 1.60000 2014-01-01 08:00:00 1.40000 2014-01-01 09:00:00 1.50000 2014-01-01 10:00:00 1.20000 2014-01-01 11:00:00 0.80000 2014-01-01 12:00:00 0.40000 ... ... ... ... 2014-03-31 21:00:00 0.30000 2014-03-31 22:00:00 0.30000 2014-03-31 23:00:00 0.50000 ... ... ... ... ... ... ... -33.90000 140.80000 2014-01-01 00:00:00 6.20000 2014-01-01 01:00:00 4.10000 2014-01-01 02:00:00 2.40000 2014-01-01 03:00:00 1.90000 2014-01-01 04:00:00 1.70000 2014-01-01 05:00:00 1.50000 2014-01-01 06:00:00 1.40000 2014-01-01 07:00:00 1.30000 2014-01-01 08:00:00 1.20000 2014-01-01 09:00:00 1.00000 2014-01-01 10:00:00 1.00000 2014-01-01 11:00:00 0.90000 2014-01-01 12:00:00 0.90000 ... ... ... ... 2014-03-31 21:00:00 0.30000 2014-03-31 22:00:00 0.30000 2014-03-31 23:00:00 0.50000 140.83786 2014-01-01 00:00:00 3.20000 2014-01-01 01:00:00 2.90000 2014-01-01 02:00:00 2.10000 2014-01-01 03:00:00 2.90000 2014-01-01 04:00:00 1.20000 2014-01-01 05:00:00 0.90000 2014-01-01 06:00:00 1.10000 2014-01-01 07:00:00 1.60000 2014-01-01 08:00:00 1.40000 2014-01-01 09:00:00 1.50000 2014-01-01 10:00:00 1.20000 2014-01-01 11:00:00 0.80000 2014-01-01 12:00:00 0.40000 ... ... ... ... 2014-03-31 21:00:00 0.30000 2014-03-31 22:00:00 0.30000 2014-03-31 23:00:00 0.50000 78001920 rows × 1 columns I need to calculate a daily maximum FFDI value for a date using hourly values from 13:00:00 of the previous day to 12:00:00 of the current day to suit my time zone (+11). For example, if calculating daily max FFDI for 2014-01-10 in the +11 time zone, I can use hourly FFDI from 2014-01-09 13:00:00 to 2014-01-10 12:00:00. df_daily_max = df .groupby(['latitude', 'longitude', pd.Grouper(freq='24H',base=13,loffset='11H',level='time')])['FFDI'].max().reset_index(name='Max FFDI') The calculation starts with 13:00:00 and with a frequency of 24 hours. The output is: latitude longitude time Max FFDI 0 -39.20000076293945312500 140.80000305175781250000 2013-12-31 13:00:00 6.19999980926513671875 1 -39.20000076293945312500 140.80000305175781250000 2014-01-01 13:00:00 1.50000000000000000000 2 -39.20000076293945312500 140.80000305175781250000 2014-01-02 13:00:00 1.60000002384185791016 ... ... ... ... I would like the output to be: latitude longitude time Max FFDI 0 -39.20000076293945312500 140.80000305175781250000 2014-01-01 6.19999980926513671875 1 -39.20000076293945312500 140.80000305175781250000 2014-01-02 1.50000000000000000000 2 -39.20000076293945312500 140.80000305175781250000 2014-01-03 1.60000002384185791016 ... ... ... ...
Unable to combine date and time in pandas
I would like to combine the following date and time columns to 1 date_time column: casinghourly[['Date','Time']].head() Out[275]: Date Time 0 2014-01-01 00:00:00 1 2014-01-01 01:00:00 2 2014-01-01 02:00:00 3 2014-01-01 03:00:00 4 2014-01-01 04:00:00 I've used the following code: casinghourly.loc[:,'Date_Time'] = pd.to_datetime(casinghourly.Date.astype(str)+' '+casinghourly.Time.astype(str)) But I get the following error: ValueError: Unknown string format Fyi: casinghourly[['Date','Time']].dtypes Out[276]: Date datetime64[ns] Time timedelta64[ns] dtype: object Can somebody help me here please?
You can directly concat datetime64[ns] with timedelta64[ns]: df['Date'] = df['Date']+df['Time'] print(df['Date']) 0 2014-01-01 00:00:00 1 2014-01-01 01:00:00 2 2014-01-01 02:00:00 3 2014-01-01 03:00:00 4 2014-01-01 04:00:00 Name: Date, dtype: datetime64[ns] print(df) Date Time 0 2014-01-01 00:00:00 00:00:00 1 2014-01-01 01:00:00 01:00:00 2 2014-01-01 02:00:00 02:00:00 3 2014-01-01 03:00:00 03:00:00 4 2014-01-01 04:00:00 04:00:00 print(df.dtypes) Date datetime64[ns] Time timedelta64[ns] dtype: object
how can i get conditonal hourly mean in pandas?
i have below dataframe. and i wanna make a hourly mean dataframe condition that every hour just calculate mean value 00:15:00~00:45:00. date/time are multi index. aaa date time 2017-01-01 00:00:00 146.88 00:15:00 143.28 00:30:00 143.28 00:45:00 141.12 01:00:00 134.64 01:15:00 132.48 01:30:00 136.80 01:45:00 138.24 02:00:00 131.76 02:15:00 131.04 02:30:00 134.64 02:45:00 139.68 03:00:00 136.08 03:15:00 132.48 03:30:00 132.48 03:45:00 139.68 04:00:00 134.64 04:15:00 131.04 04:30:00 160.56 04:45:00 177.12 ... results should be belows.. how can i do it? aaa date time 2017-01-01 00:00:00 146.88 01:00:00 134.64 02:00:00 131.76 03:00:00 136.08 04:00:00 134.64 ...
It seems need only select rows with 00:00 in the end of times: df2 = df1[df1.index.get_level_values(1).astype(str).str.endswith('00:00')] print (df2) aaa date time 2017-01-01 00:00:00 146.88 01:00:00 134.64 02:00:00 131.76 03:00:00 136.08 04:00:00 134.64 But if need mean only values 00:15-00:45 it is more complicated: lvl1 = pd.Series(df1.index.get_level_values(1)) m = ~lvl1.astype(str).str.endswith('00:00') lvl1new = lvl1.mask(m).ffill() df1.index = pd.MultiIndex.from_arrays([df1.index.get_level_values(0), lvl1new.where(m)], names=df1.index.names) print (df1) aaa date time 2017-01-01 NaN 146.88 00:00:00 143.28 00:00:00 143.28 00:00:00 141.12 NaN 134.64 01:00:00 132.48 01:00:00 136.80 01:00:00 138.24 NaN 131.76 02:00:00 131.04 02:00:00 134.64 02:00:00 139.68 NaN 136.08 03:00:00 132.48 03:00:00 132.48 03:00:00 139.68 NaN 134.64 04:00:00 131.04 04:00:00 160.56 04:00:00 177.12 df = df1['aaa'].groupby(level=[0,1]).mean() print (df) date time 2017-01-01 00:00:00 142.56 01:00:00 135.84 02:00:00 135.12 03:00:00 134.88 04:00:00 156.24 Name: aaa, dtype: float64
Pandas timeseries groupby using TimeGrouper
I have a time series which is like this Time Demand Date 2014-01-01 0:00 2899.0 2014-01-01 0:15 2869.0 2014-01-01 0:30 2827.0 2014-01-01 0:45 2787.0 2014-01-01 1:00 2724.0 2014-01-01 1:15 2687.0 2014-01-01 1:30 2596.0 2014-01-01 1:45 2543.0 2014-01-01 2:00 2483.0 Its is in 15 minute increments. I want the average for every hour of everyday.So i tried something like this df.groupby(pd.TimeGrouper(freq='H')).mean(). It didn't work out quite right because it returned mostlyNaNs. Now my dataset has data like this for the whole year and I would like to calculate the mean for all the hours of all the months such that I have 24 points but the mean is for all hours of the year e.g. the first hour get the mean of the first hour for all the months. The expected output would be 2014 00:00:00 2884.0 2014 01:00:00 2807.0 2014 02:00:00 2705.5 2014 03:00:00 2569.5 .......... 2014 23:00:00 2557.5 How can I achieve this?
I think you need first add Time column to index: df.index = df.index + pd.to_timedelta(df.Time + ':00') print (df) Time Demand 2014-01-01 00:00:00 0:00 2899.0 2014-01-01 00:15:00 0:15 2869.0 2014-01-01 00:30:00 0:30 2827.0 2014-01-01 00:45:00 0:45 2787.0 2014-01-01 01:00:00 1:00 2724.0 2014-01-01 01:15:00 1:15 2687.0 2014-01-01 01:30:00 1:30 2596.0 2014-01-01 01:45:00 1:45 2543.0 2014-01-01 02:00:00 2:00 2483.0 print (df.groupby(pd.Grouper(freq='H')).mean()) #same as #print (df.groupby(pd.TimeGrouper(freq='H')).mean()) Demand 2014-01-01 00:00:00 2845.5 2014-01-01 01:00:00 2637.5 2014-01-01 02:00:00 2483.0 Thanks pansen for another idea resample: print (df.resample("H").mean()) Demand 2014-01-01 00:00:00 2845.5 2014-01-01 01:00:00 2637.5 2014-01-01 02:00:00 2483.0 EDIT: print (df) Time Demand Date 2014-01-01 0:00 1.0 2014-01-01 0:15 2.0 2014-01-01 0:30 4.0 2014-01-01 0:45 5.0 2014-01-01 1:00 1.0 2014-01-01 1:15 0.0 2015-01-01 1:30 1.0 2015-01-01 1:45 2.0 2015-01-01 2:00 3.0 df.index = df.index + pd.to_timedelta(df.Time + ':00') print (df) Time Demand 2014-01-01 00:00:00 0:00 1.0 2014-01-01 00:15:00 0:15 2.0 2014-01-01 00:30:00 0:30 4.0 2014-01-01 00:45:00 0:45 5.0 2014-01-01 01:00:00 1:00 1.0 2014-01-01 01:15:00 1:15 0.0 2015-01-01 01:30:00 1:30 1.0 2015-01-01 01:45:00 1:45 2.0 2015-01-01 02:00:00 2:00 3.0 df1 = df.groupby([df.index.year, df.index.hour]).mean().reset_index() df1.columns = ['year','hour','Demand'] print (df1) year hour Demand 0 2014 0 3.0 1 2014 1 0.5 2 2015 1 1.5 3 2015 2 3.0 For DatetimeIndex use: df1 = df.groupby([df.index.year, df.index.hour]).mean() df1.index = pd.to_datetime(df1.index.get_level_values(0).astype(str) + df1.index.get_level_values(1).astype(str), format='%Y%H') print (df1) Demand 2014-01-01 00:00:00 3.0 2014-01-01 01:00:00 0.5 2015-01-01 01:00:00 1.5 2015-01-01 02:00:00 3.0
Resample intraday pandas DataFrame without add new days
I want to downsample some intraday data without adding in new days df.resample('30Min') Will add weekends etc which is undesirable. Is there anyway around this?
A combined groupby/resample might work: In [22]: dates = pd.date_range('01-Jan-2014','11-Jan-2014', freq='T')[0:-1] ...: dates = dates[dates.dayofweek < 5] ...: s = pd.TimeSeries(np.random.randn(dates.size), dates) ...: In [23]: s.size Out[23]: 11520 In [24]: s.groupby(lambda d: d.date()).resample('30min').size Out[24]: 384 In [25]: s.groupby(lambda d: d.date()).resample('30min') Out[25]: 2014-01-01 2014-01-01 00:00:00 0.202943 2014-01-01 00:30:00 -0.466010 2014-01-01 01:00:00 0.029175 2014-01-01 01:30:00 -0.064492 2014-01-01 02:00:00 -0.113348 2014-01-01 02:30:00 0.100408 2014-01-01 03:00:00 -0.036561 2014-01-01 03:30:00 -0.029578 2014-01-01 04:00:00 -0.047602 2014-01-01 04:30:00 -0.073846 2014-01-01 05:00:00 -0.410143 2014-01-01 05:30:00 0.143853 2014-01-01 06:00:00 -0.077783 2014-01-01 06:30:00 -0.122345 2014-01-01 07:00:00 0.153003 ... 2014-01-10 2014-01-10 16:30:00 -0.107377 2014-01-10 17:00:00 -0.157420 2014-01-10 17:30:00 0.201802 2014-01-10 18:00:00 -0.189018 2014-01-10 18:30:00 -0.310503 2014-01-10 19:00:00 -0.086091 2014-01-10 19:30:00 -0.090800 2014-01-10 20:00:00 -0.263758 2014-01-10 20:30:00 -0.036789 2014-01-10 21:00:00 0.041957 2014-01-10 21:30:00 -0.192332 2014-01-10 22:00:00 -0.263690 2014-01-10 22:30:00 -0.395939 2014-01-10 23:00:00 -0.171149 2014-01-10 23:30:00 0.263057 Length: 384 In [26]: np.unique(_25.index.get_level_values(1).minute) Out[26]: array([ 0, 30]) In [27]: np.unique(_25.index.get_level_values(1).dayofweek) Out[27]: array([0, 1, 2, 3, 4])
The easiest workaround right now is probably something like: rs = df.resample('30min') rs[rs.index.dayofweek < 5]
Probably the simplest way is to just do a dropna afterwards to get rid of the empty rows, e.g. df.resample('30Min').dropna()