Create new columns by grouping and aggregating multicolumns in pandas

Create new columns by grouping and aggregating multicolumns in pandas - python

I have a dataframe with about 50 columns, some of them are period_start_time, id, speed_throughput, etc.
dataframe sample:
id period_start_time speed_througput ...
0 1 2017-06-14 20:00:00 6
1 1 2017-06-14 20:00:00 10
2 1 2017-06-14 21:00:00 2
3 1 2017-06-14 21:00:00 5
4 2 2017-06-14 20:00:00 8
5 2 2017-06-14 20:00:00 12
...
I have tried to go create two new columns by grouping two columns(id and period_start_time) and find avg and min of speed_trhoughput.
The code I've tried:
df['Throughput_avg']=df.sort_values(['period_start_time'],ascending=False).groupby(['period_start_time','id'])[['speed_trhoughput']].max()
df['Throughput_min'] = df.groupby(['period_start_time', 'id'])[['speed_trhoughput']].min()
As you can see, there are two ways I've tried, but nothing works.
The error message I received for both attempts:
TypeError:incompatible index of inserted column with frame index
I suppose you know what my output needs to be, so there is no need to post it.

Option 1
Use agg in a groupby and join to attach to main dataframe
df.join(
df.groupby(['id', 'period_start_time']).speed_througput.agg(
['mean', 'min']
).rename(columns={'mean': 'avg'}).add_prefix('Throughput_'),
on=['id', 'period_start_time']
)
id period_start_time speed_througput Throughput_avg Throughput_min
0 1 2017-06-14 20:00:00 6 8.0 6
1 1 2017-06-14 20:00:00 10 8.0 6
2 1 2017-06-14 21:00:00 2 3.5 2
3 1 2017-06-14 21:00:00 5 3.5 2
4 2 2017-06-14 20:00:00 8 10.0 8
5 2 2017-06-14 20:00:00 12 10.0 8
Option 2
Use transform in a groupby context and use assign to add the new columns
g = df.groupby(['id', 'period_start_time']).speed_througput.transform
df.assign(Throughput_avg=g('mean'), Throughput_min=g('min'))
id period_start_time speed_througput Throughput_avg Throughput_min
0 1 2017-06-14 20:00:00 6 8.0 6
1 1 2017-06-14 20:00:00 10 8.0 6
2 1 2017-06-14 21:00:00 2 3.5 2
3 1 2017-06-14 21:00:00 5 3.5 2
4 2 2017-06-14 20:00:00 8 10.0 8
5 2 2017-06-14 20:00:00 12 10.0 8

Related

How to reframe a times series data-frame by adding 3 columns that will contain the information of row-level partition of the data-frame [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 10 months ago.
Improve this question
How to reframe a times series data-frame by adding 3 columns that will contain the information of row-level partition of the data-frame?

You can get each of your columns with groupby and transform:
df["group"] = df.groupby(df["fault_code"].ne(df["fault_code"].shift()).cumsum()).ngroup().add(1)
df["count"] = df.groupby("group")["timestamp"].transform("count")
df["duration"] = df.groupby("group")["timestamp"].transform(lambda x: (x.max()-x.min()+pd.Timedelta(minutes=1)).total_seconds()/60)
>>> df
timestamp fault_code pulse group count duration
0 2022-05-01 00:01:00 A 1 1 4 4.0
1 2022-05-01 00:02:00 A 1 1 4 4.0
2 2022-05-01 00:03:00 A 1 1 4 4.0
3 2022-05-01 00:04:00 A 1 1 4 4.0
4 2022-05-01 00:05:00 B 1 2 3 3.0
5 2022-05-01 00:06:00 B 1 2 3 3.0
6 2022-05-01 00:07:00 B 1 2 3 3.0
7 2022-05-01 00:08:00 A 1 3 2 2.0
8 2022-05-01 00:09:00 A 1 3 2 2.0
9 2022-05-01 00:10:00 C 1 4 5 5.0
10 2022-05-01 00:11:00 C 1 4 5 5.0
11 2022-05-01 00:12:00 C 1 4 5 5.0
12 2022-05-01 00:13:00 C 1 4 5 5.0
13 2022-05-01 00:14:00 C 1 4 5 5.0
14 2022-05-01 00:15:00 B 1 5 2 2.0
15 2022-05-01 00:16:00 B 1 5 2 2.0
16 2022-05-01 00:17:00 A 1 6 1 1.0
Edit:
If you only want to fill the count and duration for the first row of each group, you can further do:
df[["count","duration"]] = df[["count","duration"]].where(~df["group"].duplicated()).fillna('')
>>> df
timestamp fault_code pulse group count duration
0 2022-05-01 00:01:00 A 1 1 4.0 4.0
1 2022-05-01 00:02:00 A 1 1
2 2022-05-01 00:03:00 A 1 1
3 2022-05-01 00:04:00 A 1 1
4 2022-05-01 00:05:00 B 1 2 3.0 3.0
5 2022-05-01 00:06:00 B 1 2
6 2022-05-01 00:07:00 B 1 2
7 2022-05-01 00:08:00 A 1 3 2.0 2.0
8 2022-05-01 00:09:00 A 1 3
9 2022-05-01 00:10:00 C 1 4 5.0 5.0
10 2022-05-01 00:11:00 C 1 4
11 2022-05-01 00:12:00 C 1 4
12 2022-05-01 00:13:00 C 1 4
13 2022-05-01 00:14:00 C 1 4
14 2022-05-01 00:15:00 B 1 5 2.0 2.0
15 2022-05-01 00:16:00 B 1 5
16 2022-05-01 00:17:00 A 1 6 1.0 1.0

How to find occurrence of consecutive events in python timeseries data frame?

I have got a time series of meteorological observations with date and value columns:
df = pd.DataFrame({'date':['11/10/2017 0:00','11/10/2017 03:00','11/10/2017 06:00','11/10/2017 09:00','11/10/2017 12:00',
'11/11/2017 0:00','11/11/2017 03:00','11/11/2017 06:00','11/11/2017 09:00','11/11/2017 12:00',
'11/12/2017 00:00','11/12/2017 03:00','11/12/2017 06:00','11/12/2017 09:00','11/12/2017 12:00'],
'value':[850,np.nan,np.nan,np.nan,np.nan,500,650,780,np.nan,800,350,690,780,np.nan,np.nan],
'consecutive_hour': [ 3,0,0,0,0,3,6,9,0,3,3,6,9,0,0]})
With this DataFrame, I want a third column of consecutive_hours such that if the value in a particular timestamp is less than 1000, we give corresponding value in "consecutive-hours" of "3:00" hours and find consecutive such occurrence like 6:00 9:00 as above.
Lastly, I want to summarize the table counting consecutive hours occurrence and number of days such that the summary table looks like:
df_summary = pd.DataFrame({'consecutive_hours':[3,6,9,12],
'number_of_day':[2,0,2,0]})
I tried several online solutions and methods like shift(), diff() etc. as mentioned in:How to groupby consecutive values in pandas DataFrame
and more, spent several days but no luck yet.
I would highly appreciate help on this issue.
Thanks!

Input data:
>>> df
date value
0 2017-11-10 00:00:00 850.0
1 2017-11-10 03:00:00 NaN
2 2017-11-10 06:00:00 NaN
3 2017-11-10 09:00:00 NaN
4 2017-11-10 12:00:00 NaN
5 2017-11-11 00:00:00 500.0
6 2017-11-11 03:00:00 650.0
7 2017-11-11 06:00:00 780.0
8 2017-11-11 09:00:00 NaN
9 2017-11-11 12:00:00 800.0
10 2017-11-12 00:00:00 350.0
11 2017-11-12 03:00:00 690.0
12 2017-11-12 06:00:00 780.0
13 2017-11-12 09:00:00 NaN
14 2017-11-12 12:00:00 NaN
The cumcount_reset function is adapted from this answer of #jezrael:
Python pandas cumsum with reset everytime there is a 0
cumcount_reset = \
lambda b: b.cumsum().sub(b.cumsum().where(~b).ffill().fillna(0)).astype(int)
df["consecutive_hour"] = (df.set_index("date")["value"] < 1000) \
.groupby(pd.Grouper(freq="D")) \
.apply(lambda b: cumcount_reset(b)).mul(3) \
.reset_index(drop=True)
Output result:
>>> df
date value consecutive_hour
0 2017-11-10 00:00:00 850.0 3
1 2017-11-10 03:00:00 NaN 0
2 2017-11-10 06:00:00 NaN 0
3 2017-11-10 09:00:00 NaN 0
4 2017-11-10 12:00:00 NaN 0
5 2017-11-11 00:00:00 500.0 3
6 2017-11-11 03:00:00 650.0 6
7 2017-11-11 06:00:00 780.0 9
8 2017-11-11 09:00:00 NaN 0
9 2017-11-11 12:00:00 800.0 3
10 2017-11-12 00:00:00 350.0 3
11 2017-11-12 03:00:00 690.0 6
12 2017-11-12 06:00:00 780.0 9
13 2017-11-12 09:00:00 NaN 0
14 2017-11-12 12:00:00 NaN 0
Summary table
df_summary = df.loc[df.groupby(pd.Grouper(key="date", freq="D"))["consecutive_hour"] \
.apply(lambda h: (h - h.shift(-1).fillna(0)) > 0),
"consecutive_hour"] \
.value_counts().reindex([3, 6, 9, 12], fill_value=0) \
.rename("number_of_day") \
.rename_axis("consecutive_hour") \
.reset_index()
>>> df_summary
consecutive_hour number_of_day
0 3 2
1 6 0
2 9 2
3 12 0

Pandas DateTime only partially showing in Matplotlib

Problem: I am trying to make a bar chart based on a very simple Pandas DataFrame that has a DateTime index and integers in one column. Data below. Only some of the data is showing up in Matplotlib, however.
Code:
fig, ax = plt.subplots(figsize=(16,6))
ax.bar(unique_opps_by_month.index, unique_opps_by_month['Opportunity Name'])
ax.xaxis.set_major_locator(mdates.MonthLocator((1,4,7,10)))
ax.xaxis.set_minor_locator(mdates.MonthLocator())
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%b'))
ax.set_xlim('2016-01', '2021-04')
fig.autofmt_xdate()
The output, however, looks like:
This doesn't match the data though! There should be eight bars in 2017, and not all the same height. There are similar problems in the other years as well. Why are only some of the bars showing? How can I make Matplotlib show all the data?
Data
2016-02-01 1
2016-05-01 1
2016-08-01 1
2016-09-01 1
2017-01-01 1
2017-02-01 1
2017-03-01 1
2017-04-01 1
2017-07-01 3
2017-10-01 2
2017-11-01 3
2017-12-01 1
2018-02-01 2
2018-03-01 2
2018-04-01 2
2018-06-01 1
2018-07-01 1
2018-08-01 1
2018-11-01 1
2018-12-01 2
2019-03-01 5
2019-04-01 2
2019-05-01 1
2019-06-01 2
2019-07-01 1
2019-08-01 2
2019-09-01 4
2019-11-01 5
2020-01-01 4
2020-02-01 6
2020-03-01 1
2020-06-01 1
2020-07-01 2
2020-09-01 3
2020-10-01 5
2020-11-01 4
2020-12-01 6
2021-01-01 3
2021-02-01 6
2021-03-01 3

Is there a way to apply a function to a MultiIndex dataframe slice with the same outer index without iterating each slice?

Basically, what I'm trying to accomplish is to fill the missing dates (creating new DataFrame rows) with respect to each product, then create a new column based on a cumulative sum of column 'A' (example shown below)
The data is a MultiIndex with (product, date) as indexes.
Basically I would like to apply this answer to a MultiIndex DataFrame using only the rightmost index and calculating a subsequent np.cumsum for each product (and all dates).
A
product date
0 2017-01-02 1
2017-01-03 2
2017-01-04 2
2017-01-05 1
2017-01-06 4
2017-01-07 1
2017-01-10 7
1 2018-06-29 1
2018-06-30 4
2018-07-01 1
2018-07-02 1
2018-07-04 2
What I want to accomplish (efficiently) is:
A CumSum
product date
0 2017-01-02 1 1
2017-01-03 2 3
2017-01-04 2 5
2017-01-05 1 6
2017-01-06 4 10
2017-01-07 1 11
2017-01-08 0 11
2017-01-09 0 11
2017-01-10 7 18
1 2018-06-29 1 1
2018-06-30 4 5
2018-07-01 1 6
2018-07-02 1 7
2018-07-03 0 7
2018-07-04 2 9

You have 2 ways:
One way:
Using groupby with apply and with resample and cumsum. Finally, pd.concat result with df.A and fillna with 0
s = (df.reset_index(0).groupby('product').apply(lambda x: x.resample(rule='D')
.asfreq(0).A.cumsum()))
pd.concat([df.A, s.rename('cumsum')], axis=1).fillna(0)
Out[337]:
A cumsum
product date
0 2017-01-02 1.0 1
2017-01-03 2.0 3
2017-01-04 2.0 5
2017-01-05 1.0 6
2017-01-06 4.0 10
2017-01-07 1.0 11
2017-01-08 0.0 11
2017-01-09 0.0 11
2017-01-10 7.0 18
1 2018-06-29 1.0 1
2018-06-30 4.0 5
2018-07-01 1.0 6
2018-07-02 1.0 7
2018-07-03 0.0 7
2018-07-04 2.0 9
Another way:
you need 2 groupbys. First one for resample, 2nd one for cumsum. Finally, use pd.concat and fillna with 0
s1 = df.reset_index(0).groupby('product').resample(rule='D').asfreq(0).A
pd.concat([df.A, s1.groupby(level=0).cumsum().rename('cumsum')], axis=1).fillna(0)
Out[351]:
A cumsum
product date
0 2017-01-02 1.0 1
2017-01-03 2.0 3
2017-01-04 2.0 5
2017-01-05 1.0 6
2017-01-06 4.0 10
2017-01-07 1.0 11
2017-01-08 0.0 11
2017-01-09 0.0 11
2017-01-10 7.0 18
1 2018-06-29 1.0 1
2018-06-30 4.0 5
2018-07-01 1.0 6
2018-07-02 1.0 7
2018-07-03 0.0 7
2018-07-04 2.0 9

Aggregate to 15min based timestamp to hour and find sum, avg and max for multiple columns in pandas

I have a dataframe with period_start_time by every 15 minutes and now I need to aggregate to 1 hour and calculate sum and avg for almost every column in dataframe (it has about 20 columns) and
PERIOD_START_TIME ID val1 val2
06.21.2017 22:15:00 12 3 0
06.21.2017 22:30:00 12 5 6
06.21.2017 22:45:00 12 0 3
06.21.2017 23:00:00 12 5 2
...
06.21.2017 22:15:00 15 9 2
06.21.2017 22:30:00 15 0 2
06.21.2017 22:45:00 15 1 5
06.21.2017 23:00:00 15 0 1
...
Desired output:
PERIOD_START_TIME ID val1(avg) val1(sum) val1(max) ...
06.21.2017 22:00:00 12 3.25 13 5
...
06.21.2017 23:00:00 15 2.25 10 9 ...
And for columns val2 too, and for every other column in dataframe.
I have no idea how to group by period start time for every hour, not for the whole day, no idea how to start.

I believe you need Series.dt.floor for Hours and then aggregate by agg:
df = df.groupby([df['PERIOD_START_TIME'].dt.floor('H'),'ID']).agg(['mean','sum', 'max'])
#for columns from MultiIndex
df.columns = df.columns.map('_'.join)
print (df)
val1_mean val1_sum val1_max val2_mean val2_sum \
PERIOD_START_TIME ID
2017-06-21 22:00:00 12 2.666667 8 5 3 9
15 3.333333 10 9 3 9
2017-06-21 23:00:00 12 5.000000 5 5 2 2
15 0.000000 0 0 1 1
val2_max
PERIOD_START_TIME ID
2017-06-21 22:00:00 12 6
15 5
2017-06-21 23:00:00 12 2
15 1
df = df.reset_index()
print (df)
PERIOD_START_TIME ID val1_mean val1_sum val1_max val2_mean val2_sum \
0 2017-06-21 22:00 12 2.666667 8 5 3 9
1 2017-06-21 22:00 15 3.333333 10 9 3 9
2 2017-06-21 23:00 12 5.000000 5 5 2 2
3 2017-06-21 23:00 15 0.000000 0 0 1 1
val2_max
0 6
1 5
2 2
3 1

Very similarly you can convert PERIOD_START_TIME to a pandas Period.
df['PERIOD_START_TIME'] = df['PERIOD_START_TIME'].dt.to_period('H')
df.groupby(['PERIOD_START_TIME', 'ID']).agg(['max', 'min', 'mean']).reset_index()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create new columns by grouping and aggregating multicolumns in pandas - python

Related

How to reframe a times series data-frame by adding 3 columns that will contain the information of row-level partition of the data-frame [closed]

How to find occurrence of consecutive events in python timeseries data frame?

Pandas DateTime only partially showing in Matplotlib

Is there a way to apply a function to a MultiIndex dataframe slice with the same outer index without iterating each slice?

Aggregate to 15min based timestamp to hour and find sum, avg and max for multiple columns in pandas

Categories

Resources