Pandas groupby apply is very slow - python

grouped = data_v1.sort_values(by = "Strike_Price").groupby(['dateTime','close','Index','Expiry','group'])
def calc_summary(group):
name = group.name
if name[3] == "above":
call_oi = group['Call_OI'].sum()
call_vol = group['Call_Volume'].sum()
put_oi = group['Put_OI'].sum()
put_vol = group['Put_Volume'].sum()
call_oi_1 = group.head(1)['Call_OI'].sum()
call_vol_1 = group.head(1)['Call_Volume'].sum()
put_oi_1 = group.head(1)['Put_OI'].sum()
put_vol_1 = group.head(1)['Put_Volume'].sum()
else:
call_oi = group['Call_OI'].sum()
call_vol = group['Call_Volume'].sum()
put_oi = group['Put_OI'].sum()
put_vol = group['Put_Volume'].sum()
call_oi_1 = group.tail(1)['Call_OI'].sum()
call_vol_1 = group.tail(1)['Call_Volume'].sum()
put_oi_1 = group.tail(1)['Put_OI'].sum()
put_vol_1 = group.tail(1)['Put_Volume'].sum()
summary = pd.DataFrame([{'call_oi':call_oi,
'call_vol':call_vol,
'put_oi':put_oi,
'put_vol':put_vol,
'call_oi_1':call_oi_1,
'call_vol_1':call_vol_1,
'put_oi_1':put_oi_1,
'put_vol_1':put_vol_1,
return summary
result = grouped.apply(calc_summary)
This above code takes too much time to run given the dataset is not even that big. Currently, it takes about 23 seconds in my system.
I tried swifter but that doesn't work with groupby objects.
What should I do to make my code faster?
Edit:
The data looks like this
{'dateTime': {0: Timestamp('2023-02-06 09:21:00'),
1: Timestamp('2023-02-06 09:21:00'),
2: Timestamp('2023-02-06 09:21:00'),
3: Timestamp('2023-02-06 09:21:00'),
4: Timestamp('2023-02-06 09:21:00')},
'close': {0: 17780.55, 1: 17780.55, 2: 17780.55, 3: 17780.55, 4: 17780.55},
'Index': {0: 'NIFTY', 1: 'NIFTY', 2: 'NIFTY', 3: 'NIFTY', 4: 'NIFTY'},
'Expiry': {0: '16FEB2023',
1: '23FEB2023',
2: '9FEB2023',
3: '16FEB2023',
4: '23FEB2023'},
'Expiry_order': {0: 'week_2',
1: 'week_3',
2: 'week_1',
3: 'week_2',
4: 'week_3'},
'group': {0: 'below', 1: 'below', 2: 'below', 3: 'below', 4: 'below'},
'Call_OI': {0: nan, 1: 60.0, 2: 4.0, 3: nan, 4: nan},
'Put_OI': {0: 1364.0, 1: 11255.0, 2: 91059.0, 3: 343.0, 4: 153.0},
'Call_Volume': {0: nan, 1: 3.0, 2: 2.0, 3: nan, 4: nan},
'Put_Volume': {0: 84.0, 1: 1246.0, 2: 5197.0, 3: 24.0, 4: 1.0},
'Strike_Price': {0: 16100.0, 1: 16100.0, 2: 16100.0, 3: 16150.0, 4: 16150.0}}

Using your sample data:
import io
import pandas as pd
csv = """
dateTime,close,Index,Expiry,Expiry_order,group,Call_OI,Put_OI,Call_Volume,Put_Volume,Strike_Price
2023-02-06 09:21:00,17780.55,NIFTY,16FEB2023,week_2,below,,1364.0,,84.0,16100.0
2023-02-06 09:21:00,17780.55,NIFTY,23FEB2023,week_3,below,60.0,11255.0,3.0,1246.0,16100.0
2023-02-06 09:21:00,17780.55,NIFTY,9FEB2023,week_1,below,4.0,91059.0,2.0,5197.0,16100.0
2023-02-06 09:21:00,17780.55,NIFTY,16FEB2023,week_2,below,,343.0,,24.0,16150.0
2023-02-06 09:21:00,17780.55,NIFTY,23FEB2023,week_3,below,,153.0,,1.0,16150.0
"""
df = pd.read_csv(io.StringIO(csv))
The output of your calc_summary function:
>>> df.sort_values(by='Strike_Price').groupby(['dateTime', 'close', 'Index', 'Expiry', 'group']).apply(calc_summary)
call_oi call_vol put_oi put_vol call_oi_1 call_vol_1 put_oi_1 put_vol_1
dateTime close Index Expiry group
2023-02-06 09:21:00 17780.55 NIFTY 16FEB2023 below 0 0.0 0.0 1707.0 108.0 0.0 0.0 343.0 24.0
23FEB2023 below 0 60.0 3.0 11408.0 1247.0 0.0 0.0 153.0 1.0
9FEB2023 below 0 4.0 2.0 91059.0 5197.0 4.0 2.0 91059.0 5197.0
.agg()
You're performing an aggregation where you conditionally want the head/tail depending on the value of the group column.
You could aggregate both values instead and then do the filtering afterwards.
This allows you to use .agg() directly.
We can use first and last aggregations for head/tail but must first fillna(0) as they handle NaN values differently.
summary = (
df.fillna(0) # needed for first/last as they ignore NaN
.sort_values(by='Strike_Price')
.groupby(['dateTime', 'close', 'Index', 'Expiry', 'group'])
[['Call_OI', 'Call_Volume', 'Put_OI', 'Put_Volume']]
.agg(['first', 'last', 'sum'])
.reset_index()
)
Which produces a multi-indexed column structure like:
dateTime close Index Expiry group Call_OI Call_Volume Put_OI Put_Volume
first last sum first last sum first last sum first last sum
0 2023-02-06 09:21:00 17780.55 NIFTY 16FEB2023 below 0.0 0.0 0.0 0.0 0.0 0.0 1364.0 343.0 1707.0 84.0 24.0 108.0
1 2023-02-06 09:21:00 17780.55 NIFTY 23FEB2023 below 60.0 0.0 60.0 3.0 0.0 3.0 11255.0 153.0 11408.0 1246.0 1.0 1247.0
2 2023-02-06 09:21:00 17780.55 NIFTY 9FEB2023 below 4.0 4.0 4.0 2.0 2.0 2.0 91059.0 91059.0 91059.0 5197.0 5197.0 5197.0
To say you want the last values when group != "above" you can:
>>> below = summary.loc[summary['group'] != 'above', summary.columns.get_level_values(1) != 'first']
>>> below
dateTime close Index Expiry group Call_OI Call_Volume Put_OI Put_Volume
last sum last sum last sum last sum
0 2023-02-06 09:21:00 17780.55 NIFTY 16FEB2023 below 0.0 0.0 0.0 0.0 343.0 1707.0 24.0 108.0
1 2023-02-06 09:21:00 17780.55 NIFTY 23FEB2023 below 0.0 60.0 0.0 3.0 153.0 11408.0 1.0 1247.0
2 2023-02-06 09:21:00 17780.55 NIFTY 9FEB2023 below 4.0 4.0 2.0 2.0 91059.0 91059.0 5197.0 5197.0
To flatten the column structure similar to your functions output you can:
>>> below.columns = [left.lower() + ('' if right in {'', 'sum'} else '_1') for left, right in below.columns]
>>> below
datetime close index expiry group call_oi_1 call_oi call_volume_1 call_volume put_oi_1 put_oi put_volume_1 put_volume
0 2023-02-06 09:21:00 17780.55 NIFTY 16FEB2023 below 0.0 0.0 0.0 0.0 343.0 1707.0 24.0 108.0
1 2023-02-06 09:21:00 17780.55 NIFTY 23FEB2023 below 0.0 60.0 0.0 3.0 153.0 11408.0 1.0 1247.0
2 2023-02-06 09:21:00 17780.55 NIFTY 9FEB2023 below 4.0 4.0 2.0 2.0 91059.0 91059.0 5197.0 5197.0
There are no examples of above in your data - but you could do the same for those rows using == 'above' and != 'last' and concat both sets of rows into a single dataframe.
Polars
You may also wish to compare how the dataset performs with polars.
One possible approach which generates the same output:
import io
import polars as pl
df = pl.read_csv(io.StringIO(csv))
columns = ["Call_OI", "Call_Volume", "Put_OI", "Put_Volume"]
(
df
.sort("Strike_Price")
.groupby(["dateTime", "close", "Index", "Expiry", "group"], maintain_order=True)
.agg([
pl.col(columns).sum(),
pl.when(pl.col("group").first() == "above")
.then(pl.col(columns).first())
.otherwise(pl.col(columns).last())
.suffix("_1")
])
.fill_null(0)
)
shape: (3, 13)
┌─────────────────────┬──────────┬───────┬───────────┬───────┬─────────┬─────────────┬─────────┬────────────┬───────────┬───────────────┬──────────┬──────────────┐
│ dateTime | close | Index | Expiry | group | Call_OI | Call_Volume | Put_OI | Put_Volume | Call_OI_1 | Call_Volume_1 | Put_OI_1 | Put_Volume_1 │
│ --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- │
│ str | f64 | str | str | str | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 │
╞═════════════════════╪══════════╪═══════╪═══════════╪═══════╪═════════╪═════════════╪═════════╪════════════╪═══════════╪═══════════════╪══════════╪══════════════╡
│ 2023-02-06 09:21:00 | 17780.55 | NIFTY | 16FEB2023 | below | 0.0 | 0.0 | 1707.0 | 108.0 | 0.0 | 0.0 | 343.0 | 24.0 │
│ 2023-02-06 09:21:00 | 17780.55 | NIFTY | 23FEB2023 | below | 60.0 | 3.0 | 11408.0 | 1247.0 | 0.0 | 0.0 | 153.0 | 1.0 │
│ 2023-02-06 09:21:00 | 17780.55 | NIFTY | 9FEB2023 | below | 4.0 | 2.0 | 91059.0 | 5197.0 | 4.0 | 2.0 | 91059.0 | 5197.0 │
└─────────────────────┴──────────┴───────┴───────────┴───────┴─────────┴─────────────┴─────────┴────────────┴───────────┴───────────────┴──────────┴──────────────┘

Related

Replace nan values with data from previous months

I have a DataFrame as follows. This DataFrame contains NAN values. I want to replace nan values with the earlier non nan value in my DataFrame from previous month(s):
date (y-d-m) | value
2022-01-01 | 1
2022-02-01 | 2
2022-03-01 | 3
2022-04-01 | 4
...
2022-01-02 | nan
2022-02-02 | nan
2022-03-02 | nan
2022-04-02 | nan
...
2022-01-03 | nan
2022-02-03 | nan
2022-03-03 | nan
2022-04-03 | nan
Desired outcome
date (y-d-m) | value
2022-01-01 | 1
2022-02-01 | 2
2022-03-01 | 3
2022-04-01 | 4
...
2022-01-02 | 1
2022-02-02 | 2
2022-03-02 | 3
2022-04-02 | 4
...
2022-01-03 | 1
2022-02-03 | 2
2022-03-03 | 3
2022-04-03 | 4
Data:
{'date (y-d-m)': ['2022-01-01', '2022-02-01', '2022-03-01', '2022-04-01',
'2022-01-02', '2022-02-02', '2022-03-02', '2022-04-02',
'2022-01-03', '2022-02-03', '2022-03-03', '2022-04-03'],
'value': [1.0, 2.0, 3.0, 4.0, nan, nan, nan, nan, nan, nan, nan, nan]}
You could convert "date (y-d-m)" column to datetime; then groupby "day" and forward fill with ffill (values from previous months' same day):
df['date (y-d-m)'] = pd.to_datetime(df['date (y-d-m)'], format='%Y-%d-%m')
df['value'] = df.groupby(df['date (y-d-m)'].dt.day)['value'].ffill()
Output:
date (y-d-m) value
0 2022-01-01 1.0
1 2022-01-02 2.0
2 2022-01-03 3.0
3 2022-01-04 4.0
4 2022-02-01 1.0
5 2022-02-02 2.0
6 2022-02-03 3.0
7 2022-02-04 4.0
8 2022-03-01 1.0
9 2022-03-02 2.0
10 2022-03-03 3.0
11 2022-03-04 4.0

Convert list with multiple entries per day to standard daytime index and give each entry it's own column

I have a file that looks like this:
Date | col1 | col2 | col3
2010-01-01 | -1.4 | 0.0 | 0.0
2010-01-01 | -1.4 | 0.0 | 0.0
2010-01-01 | -2.4 | 0.0 | 0.66
2010-01-02 | -2.4 | 0.0 | 0.08
2010-01-02 | -4.3 | 0.0 | 0.1
2010-01-02 | -4.3 | 0.0 | 1.04
Same days refer to a specific city, so for 2010-01-01 there is data for 3 cities, same for 2010-01-02 and all other days (it's always the same amount, at the moment 13 cities = 13 rows per day).
The city names are in a list where the order of the cities is the same as the order of the days:
["city1", "city2", "city3"]
So "city1" is the first row for each day, then "city2", then "city3" and so on.
I need to get this format into a standard format where I can set the Date as the index, so need a format like this:
Date | city1_col1 | city1_col2 | city1_col3 | city2_col1| city2_col2 | city2_col3 | city3_col1| city3_col2 | city3_col3
2010-01-01 | -1.4 | 0.0 | 0.0 | -1.4 | 0.0 | 0.0 | -2.4 | 0.0 | 0.66
2010-01-02 | -2.4 | 0.0 | 0.08 | -4.3 | 0.0 | 0.1 | -4.3 | 0.0 | 1.04
The data is later merged with other dataframes where the indexes are also the days of the year so a multiindex won't work.
How can I achieve this with pandas?
Here's a way to do that:
df["city"] = cities * (len(df) // len(cities))
df = pd.pivot_table(df, index="Date", columns="city")
df.columns = [c[1] + "_" + c[0] for c in df.columns]
df=df.sort_index(axis=1)
The output is:
city1_col1 city1_col2 city1_col3 city2_col1 city2_col2 city2_col3 city3_col1 city3_col2 city3_col3
Date
2010-01-01 -1.4 0.0 0.00 -1.4 0.0 0.0 -2.4 0.0 0.66
2010-01-02 -2.4 0.0 0.08 -4.3 0.0 0.1 -4.3 0.0 1.04

maximum variation within one second for each row of a DataFrame

I'm having a calculation problem with pandas and I'd like to know if anyone could help me.
Having this df created using this code:
df = pd.DataFrame({'B': [0, 2, 1, np.nan, 4, 1, 3, 10, np.nan, 3, 6]},
index = [pd.Timestamp('20130101 09:31:23.999'),
pd.Timestamp('20130101 09:31:24.200'),
pd.Timestamp('20130101 09:31:24.250'),
pd.Timestamp('20130101 09:31:25.000'),
pd.Timestamp('20130101 09:31:25.375'),
pd.Timestamp('20130101 09:31:25.850'),
pd.Timestamp('20130101 09:31:26.100'),
pd.Timestamp('20130101 09:31:27.150'),
pd.Timestamp('20130101 09:31:28.050'),
pd.Timestamp('20130101 09:31:28.850'),
pd.Timestamp('20130101 09:31:29.200')])
df
| | B |
|-------------------------|------|
| 2013-01-01 09:31:23.999 | 0.0 |
| 2013-01-01 09:31:24.200 | 2.0 |
| 2013-01-01 09:31:24.250 | 1.0 |
| 2013-01-01 09:31:25.000 | NaN |
| 2013-01-01 09:31:25.375 | 4.0 |
| 2013-01-01 09:31:25.850 | 1.0 |
| 2013-01-01 09:31:26.100 | 3.0 |
| 2013-01-01 09:31:27.150 | 10.0 |
| 2013-01-01 09:31:28.050 | NaN |
| 2013-01-01 09:31:28.850 | 3.0 |
| 2013-01-01 09:31:29.200 | 6.0 |
I would like to be able to calculate for each row what the maximum variation of B has been during one second.
For example, in the first row you would have to look at how much it has changed with respect to the second row and the third row which are those within the interval of a second and calculate the difference with the maximum value.
In this case, the maximum value is in the second row "09:31:24.200", the maximum variation will be 2 - 0.
Then, we will create a new column with all these maximum variations for each of the rows.
df
| | B | Maximum Variation |
|-------------------------|------|--------------------|
| 2013-01-01 09:31:23.999 | 0.0 | 2.0 |
| 2013-01-01 09:31:24.200 | 2.0 | 1.0 |
| 2013-01-01 09:31:24.250 | 1.0 | 0.0 |
| 2013-01-01 09:31:25.000 | NaN | 4.0 |
| 2013-01-01 09:31:25.375 | 4.0 |-3.0 |
| 2013-01-01 09:31:25.850 | 1.0 | 2.0 |
| 2013-01-01 09:31:26.100 | 3.0 | 0.0 |
| 2013-01-01 09:31:27.150 | 10.0 | 0.0 |
| 2013-01-01 09:31:28.050 | NaN | 3.0 |
| 2013-01-01 09:31:28.850 | 3.0 | 3.0 |
| 2013-01-01 09:31:29.200 | 6.0 | 0.0 |
I hope it's clear enough
Solution has been found and shared in the answers, but still an efficiency improvement in this solution that doesn't involve having to make a loop for each row of the df, will be more than welcome
I've finally found the solution:
df = pd.DataFrame({'B': [0, 1, 2, 8, 6, 1, 3, 10, np.nan, 3, 6]},
index = [pd.Timestamp('20130101 09:31:23.999'),
pd.Timestamp('20130101 09:31:24.200'),
pd.Timestamp('20130101 09:31:24.250'),
pd.Timestamp('20130101 09:31:25.000'),
pd.Timestamp('20130101 09:31:25.375'),
pd.Timestamp('20130101 09:31:25.850'),
pd.Timestamp('20130101 09:31:26.100'),
pd.Timestamp('20130101 09:31:27.150'),
pd.Timestamp('20130101 09:31:28.050'),
pd.Timestamp('20130101 09:31:28.850'),
pd.Timestamp('20130101 09:31:29.200')])
df = df.reset_index()
df = df.rename(columns={"index": "start_date"})
df['duration_in_seconds'] = 1
df['end_date'] = df['start_date'] + pd.to_timedelta(df['duration_in_seconds'], unit='s')
df['max'] = np.nan
for index, row in df.iterrows():
start = row['start_date']
end = row['end_date']
maxi = df[(df['start_date'] >= start ) & (df['start_date'] <= end)]['B'].max()
df.iloc[index, df.columns.get_loc('max')] = maxi
df['Maximum Variation'] = df['max'] - df['B']
df
| | start_date | B | duration_in_seconds | end_date | max | Maximum Variation |
|----|-------------------------|------|---------------------|-------------------------|------|-------------------|
| 0 | 2013-01-01 09:31:23.999 | 0.0 | 1 | 2013-01-01 09:31:24.999 | 2.0 | 2.0 |
| 1 | 2013-01-01 09:31:24.200 | 1.0 | 1 | 2013-01-01 09:31:25.200 | 8.0 | 7.0 |
| 2 | 2013-01-01 09:31:24.250 | 2.0 | 1 | 2013-01-01 09:31:25.250 | 8.0 | 6.0 |
| 3 | 2013-01-01 09:31:25.000 | 8.0 | 1 | 2013-01-01 09:31:26.000 | 8.0 | 0.0 |
| 4 | 2013-01-01 09:31:25.375 | 6.0 | 1 | 2013-01-01 09:31:26.375 | 6.0 | 0.0 |
| 5 | 2013-01-01 09:31:25.850 | 1.0 | 1 | 2013-01-01 09:31:26.850 | 3.0 | 2.0 |
| 6 | 2013-01-01 09:31:26.100 | 3.0 | 1 | 2013-01-01 09:31:27.100 | 3.0 | 0.0 |
| 7 | 2013-01-01 09:31:27.150 | 10.0 | 1 | 2013-01-01 09:31:28.150 | 10.0 | 0.0 |
| 8 | 2013-01-01 09:31:28.050 | NaN | 1 | 2013-01-01 09:31:29.050 | 3.0 | NaN |
| 9 | 2013-01-01 09:31:28.850 | 3.0 | 1 | 2013-01-01 09:31:29.850 | 6.0 | 3.0 |
| 10 | 2013-01-01 09:31:29.200 | 6.0 | 1 | 2013-01-01 09:31:30.200 | 6.0 | 0.0 |
More time efficient solutions are still welcome
More efficient solution
df = df.reset_index()
df = df.rename(columns={"index": "start_date"})
df['duration_in_seconds'] = 1
df['end_date'] = df['start_date'] + pd.to_timedelta(df['duration_in_seconds'], unit='s')
df['max'] = np.nan
df["max"] = df.apply(lambda row : df.loc[(df["start_date"] >= row['start_date']) & (df["start_date"] <=row['end_date'])]["B"].max(), axis = 1)
df['Maximum Variation'] = df['max'] - df['B']
import numpy as np
import pandas as pd
df = pd.DataFrame({'B': [0, 2, 1, np.nan, 4, 1, 3, 10, np.nan, 3, 6]},
index = [pd.Timestamp('20130101 09:31:23.999'),
pd.Timestamp('20130101 09:31:24.200'),
pd.Timestamp('20130101 09:31:24.250'),
pd.Timestamp('20130101 09:31:25.000'),
pd.Timestamp('20130101 09:31:25.375'),
pd.Timestamp('20130101 09:31:25.850'),
pd.Timestamp('20130101 09:31:26.100'),
pd.Timestamp('20130101 09:31:27.150'),
pd.Timestamp('20130101 09:31:28.050'),
pd.Timestamp('20130101 09:31:28.850'),
pd.Timestamp('20130101 09:31:29.200')])
print(df)
B
2013-01-01 09:31:23.999 0.0
2013-01-01 09:31:24.200 2.0
2013-01-01 09:31:24.250 1.0
2013-01-01 09:31:25.000 NaN
2013-01-01 09:31:25.375 4.0
2013-01-01 09:31:25.850 1.0
2013-01-01 09:31:26.100 3.0
2013-01-01 09:31:27.150 10.0
2013-01-01 09:31:28.050 NaN
2013-01-01 09:31:28.850 3.0
2013-01-01 09:31:29.200 6.0
df_min = df.resample('1S').min()
print(df_min)
B
2013-01-01 09:31:23 0.0
2013-01-01 09:31:24 1.0
2013-01-01 09:31:25 1.0
2013-01-01 09:31:26 3.0
2013-01-01 09:31:27 10.0
2013-01-01 09:31:28 3.0
2013-01-01 09:31:29 6.0
df_max = df.resample('1S').max()
print(df_max)
B
2013-01-01 09:31:23 0.0
2013-01-01 09:31:24 2.0
2013-01-01 09:31:25 4.0
2013-01-01 09:31:26 3.0
2013-01-01 09:31:27 10.0
2013-01-01 09:31:28 3.0
2013-01-01 09:31:29 6.0
df_diff = df_max - df_min
print(df_diff)
B
2013-01-01 09:31:23 0.0
2013-01-01 09:31:24 1.0
2013-01-01 09:31:25 3.0
2013-01-01 09:31:26 0.0
2013-01-01 09:31:27 0.0
2013-01-01 09:31:28 0.0
2013-01-01 09:31:29 0.0

Efficiently apply several different operations to a dataset

I have to do several different operations to many columns of a DataSet, I did it but not in a very efficient way...
As an example, I have this table:
| A | B | C | D | E |
|------|------|------|------|------|
| 1.0 | 1.0 | 1.0 | 2.0 | a |
| 2.0 | 1.0 | 1.5 | 5.0 | a |
| 3.0 | 1.0 | 2.0 | 3.0 | b |
| 1.0 | 2.0 | 2.0 | 6.0 | a |
| 2.0 | 2.0 | 3.0 | 4.0 | b |
| 3.0 | 2.0 | 4.0 | 2.0 | b |
| 1.0 | 3.0 | 5.0 | 5.0 | b |
| 2.0 | 3.0 | 6.0 | 1.0 | a |
| 3.0 | 3.0 | 10.0 | 2.0 | a |
And I need to get the following result:
# I dont need the A column, the criteria is the B column, apply the mean
# to the C, the sum to the D and the most frequent on E
| B | C | D | E |
|------|------|------|------|
| 1.0 | 1.5 | 10.0 | a |
| 2.0 | 3.0 | 12.0 | b |
| 3.0 | 7.0 | 8.0 | a |
Here is my attempt but is extremely slow. My original dataset has 2.000.000 of rows. Transforming it to 130.000 takes more than 30 minutes and I have to apply it three times... this is why I need something more efficient.
import pandas as pd
df = pd.DataFrame({"A":[1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0],
"B":[1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0],
"C":[1.0, 1.5, 2.0, 2.0, 3.0, 4.0, 5.0, 6.0, 10.0],
"D":[2.0, 5.0, 3.0, 6.0, 4.0, 2.0, 5.0, 1.0, 2.0],
"E":['a', 'a', 'b', 'a', 'b', 'b', 'b', 'a', 'a']})
print(df)
dict_ds = { 'B' : [], 'C' : [], 'D' : [], 'E' : []}
df2 = pd.DataFrame(dict_ds)
df=df.groupby('B')
for n in df.first().index:
data = df.get_group(n)
partial = data.mean()
new_C = partial['C']
partial = data.sum()
new_D = partial['D']
new_E = data['E'].mode()[0]
df2.loc[len(df2)] = (n,new_C,new_D,new_E)
print(df2)
This part is after getting the solution.
If I apply the operation unique to the agg:
df.groupby('B').agg({
'A': 'unique',
'C': 'mean',
'D': 'sum',
'E': lambda x: x.mode()
}).reset_index()
I have the next result:
B A C D E
0 1.0 [1.0, 2.0, 3.0] 1.5 10.0 a
1 2.0 [1.0, 2.0, 3.0] 3.0 12.0 b
2 3.0 [1.0, 2.0, 3.0] 7.0 8.0 a
But I need to have it in this other way:
A B C D E
0 1.0 1.0 1.5 10.0 a
1 2.0 1.0 1.5 10.0 a
2 3.0 1.0 1.5 10.0 a
3 1.0 2.0 3.0 12.0 b
4 2.0 2.0 3.0 12.0 b
5 3.0 2.0 3.0 12.0 b
6 1.0 3.0 7.0 8.0 a
7 2.0 3.0 7.0 8.0 a
8 3.0 3.0 7.0 8.0 a
Is it possible to have something similar? A very efficent way?
new_df = df.groupby('B').agg({
'C': 'mean',
'D': 'sum',
'E': lambda x: x.mode()
})
>>> new_df
B C D E
1.0 1.5 10.0 a
2.0 3.0 12.0 b
3.0 7.0 8.0 a
EDIT: For your 2nd question...
I can't guarantee that this will be efficient but it gets what you want done:
df_1 = new_df['A'].apply(pd.Series).unstack().reset_index(level = 0, drop = True)
df_1.name = 'A'
df_2 = new_df[[col for col in df.columns if col != 'A']]
df_2.name = 'others'
pd.merge(df_1, df_2, left_index = True, right_index = True).reset_index(drop = True)
>>> output
A B C D E
0 1.0 1.0 1.5 10.0 a
0 2.0 1.0 1.5 10.0 a
0 3.0 1.0 1.5 10.0 a
1 1.0 2.0 3.0 12.0 b
1 2.0 2.0 3.0 12.0 b
1 3.0 2.0 3.0 12.0 b
2 1.0 3.0 7.0 8.0 a
2 2.0 3.0 7.0 8.0 a
2 3.0 3.0 7.0 8.0 a

Construct new columns based on all previous pairs information using Pandas

I am trying to create 3 new columns in a dataframe, which are based on previous pairs information.
You can think of the dataframe as the results of comptetion ('xx' column) within diffrerent types ('type' column) at different dates ('date column).
The idea is to create the following new columns:
(i) numb_comp_past: sum of the number of times every type faced the competitors in the past.
(ii) win_comp_past: sum of the win (+1), ties (+0), and loss (-1) of the previous competitions that all the types competing with each other had in the past.
(iii) win_comp_past_difs: sum of difference of the results of the previous competitions that all the types competing with each other had in the past.
The original dataframe (df) is the following:
idx = [np.array(['Jan-18', 'Jan-18', 'Feb-18', 'Mar-18', 'Mar-18', 'Mar-18','Mar-18', 'Mar-18', 'May-18', 'Jun-18', 'Jun-18', 'Jun-18','Jul-18', 'Aug-18', 'Aug-18', 'Sep-18', 'Sep-18', 'Oct-18','Oct-18', 'Oct-18', 'Nov-18', 'Dec-18', 'Dec-18',]),np.array(['A', 'B', 'B', 'A', 'B', 'C', 'D', 'E', 'B', 'A', 'B', 'C','A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'A', 'B', 'C'])]
data = [{'xx': 1}, {'xx': 5}, {'xx': 3}, {'xx': 2}, {'xx': 7}, {'xx': 3},{'xx': 1}, {'xx': 6}, {'xx': 3}, {'xx': 5}, {'xx': 2}, {'xx': 3},{'xx': 1}, {'xx': 9}, {'xx': 3}, {'xx': 2}, {'xx': 7}, {'xx': 3}, {'xx': 6}, {'xx': 8}, {'xx': 2}, {'xx': 7}, {'xx': 9}]
df = pd.DataFrame(data, index=idx, columns=['xx'])
df.index.names=['date','type']
df=df.reset_index()
df['date'] = pd.to_datetime(df['date'],format = '%b-%y')
df=df.set_index(['date','type'])
df['xx'] = df.xx.astype('float')
And it looks like this:
xx
date type
2018-01-01 A 1.0
B 5.0
2018-02-01 B 3.0
2018-03-01 A 2.0
B 7.0
C 3.0
D 1.0
E 6.0
2018-05-01 B 3.0
2018-06-01 A 5.0
B 2.0
C 3.0
2018-07-01 A 1.0
2018-08-01 B 9.0
C 3.0
2018-09-01 A 2.0
B 7.0
2018-10-01 C 3.0
A 6.0
B 8.0
2018-11-01 A 2.0
2018-12-01 B 7.0
C 9.0
The 3 new columns I need to add to the dataframe are shown below (expected output of the Pandas code):
xx numb_comp_past win_comp_past win_comp_past_difs
date type
2018-01-01 A 1.0 0.0 0.0 0.0
B 5.0 0.0 0.0 0.0
2018-02-01 B 3.0 0.0 0.0 0.0
2018-03-01 A 2.0 1.0 -1.0 -4.0
B 7.0 1.0 1.0 4.0
C 3.0 0.0 0.0 0.0
D 1.0 0.0 0.0 0.0
E 6.0 0.0 0.0 0.0
2018-05-01 B 3.0 0.0 0.0 0.0
2018-06-01 A 5.0 3.0 -3.0 -10.0
B 2.0 3.0 3.0 13.0
C 3.0 2.0 0.0 -3.0
2018-07-01 A 1.0 0.0 0.0 0.0
2018-08-01 B 9.0 2.0 0.0 3.0
C 3.0 2.0 0.0 -3.0
2018-09-01 A 2.0 3.0 -1.0 -6.0
B 7.0 3.0 1.0 6.0
2018-10-01 C 3.0 5.0 -1.0 -10.0
A 6.0 6.0 -2.0 -10.0
B 8.0 7.0 3.0 20.0
2018-11-01 A 2.0 0.0 0.0 0.0
2018-12-01 B 7.0 4.0 2.0 14.0
C 9.0 4.0 -2.0 -14.0
Note that:
(i) for numb_comp_past if there are no previous competitions I assign a value of 0. On 2018-06-01, for example, the type A has a value of 3 given that he previously competed with type B on 2018-01-01 and 2018-03-01 and with type C on 2018-03-01.
(ii) for win_comp_past if there are no previous competitions I assign a value of 0. On 2018-06-01, for example, the type A has a value of -3 given that he previously lost with type B on 2018-01-01 (-1) and 2018-03-01 (-1) and with type C on 2018-03-01 (-1). Thus adding -1-1-1=-3.
(iii) for win_comp_past_value if there are no previous competitions I assign a value of 0. On 2018-06-01, for example, the type A has a value of -10 given that he previously lost with type B on 2018-01-01 by a difference of -4 (=1-5) and on 2018-03-01 by a diffrence of -5 (=2-7) and with type C on 2018-03-01 by -1 (=2-3). Thus adding -4-5-1=-10.
I really don't know how to start solving this problem. Any ideas of how to solve the new columns decribed in (i), (ii) and (ii) are very welcome.
Here's my take:
# get differences of pairs, useful for win counts and win_difs
def get_diff(x):
teams = x.index.get_level_values(1)
tmp = pd.DataFrame(x[:,None]-x[None,:],
columns = teams.values,
index=teams.values).stack()
return tmp[tmp.index.get_level_values(0)!=tmp.index.get_level_values(1)]
new_df = df.groupby('date').xx.apply(get_diff).to_frame()
# win matches
new_df['win'] = new_df.xx.ge(0).astype(int) - new_df.xx.le(0).astype(int)
# group by players
groups = new_df.groupby(level=[1,2])
# sum function
def cumsum_shift(x):
return x.cumsum().shift()
# assign new values
df['num_comp_past'] = groups.xx.cumcount().sum(level=[0,1])
df['win_comp_past'] = groups.win.apply(cumsum_shift).sum(level=[0,1])
df['win_comp_past_difs'] = groups.xx.apply(cumsum_shift).sum(level=[0,1])
Output:
+------------+------+-----+---------------+---------------+--------------------+
| | | xx | num_comp_past | win_comp_past | win_comp_past_difs |
+------------+------+-----+---------------+---------------+--------------------+
| date | type | | | | |
+------------+------+-----+---------------+---------------+--------------------+
| 2018-01-01 | A | 1.0 | 0.0 | 0.0 | 0.0 |
| | B | 5.0 | 0.0 | 0.0 | 0.0 |
| 2018-02-01 | B | 3.0 | NaN | NaN | NaN |
| 2018-03-01 | A | 2.0 | 1.0 | -1.0 | -4.0 |
| | B | 7.0 | 1.0 | 1.0 | 4.0 |
| | C | 3.0 | 0.0 | 0.0 | 0.0 |
| | D | 1.0 | 0.0 | 0.0 | 0.0 |
| | E | 6.0 | 0.0 | 0.0 | 0.0 |
| 2018-05-01 | B | 3.0 | NaN | NaN | NaN |
| 2018-06-01 | A | 5.0 | 3.0 | -3.0 | -10.0 |
| | B | 2.0 | 3.0 | 3.0 | 13.0 |
| | C | 3.0 | 2.0 | 0.0 | -3.0 |
| 2018-07-01 | A | 1.0 | NaN | NaN | NaN |
| 2018-08-01 | B | 9.0 | 2.0 | 0.0 | 3.0 |
| | C | 3.0 | 2.0 | 0.0 | -3.0 |
| 2018-09-01 | A | 2.0 | 3.0 | -1.0 | -6.0 |
| | B | 7.0 | 3.0 | 1.0 | 6.0 |
| 2018-10-01 | C | 3.0 | 5.0 | -1.0 | -10.0 |
| | A | 6.0 | 6.0 | -2.0 | -10.0 |
| | B | 8.0 | 7.0 | 3.0 | 20.0 |
| 2018-11-01 | A | 2.0 | NaN | NaN | NaN |
| 2018-12-01 | B | 7.0 | 4.0 | 2.0 | 14.0 |
| | C | 9.0 | 4.0 | -2.0 | -14.0 |
| 2018-01-01 | A | 1.0 | 0.0 | 0.0 | 0.0 |
| | B | 5.0 | 0.0 | 0.0 | 0.0 |
| 2018-02-01 | B | 3.0 | NaN | NaN | NaN |
| 2018-03-01 | A | 2.0 | 1.0 | -1.0 | -4.0 |
| | B | 7.0 | 1.0 | 1.0 | 4.0 |
| | C | 3.0 | 0.0 | 0.0 | 0.0 |
| | D | 1.0 | 0.0 | 0.0 | 0.0 |
| | E | 6.0 | 0.0 | 0.0 | 0.0 |
| 2018-05-01 | B | 3.0 | NaN | NaN | NaN |
| 2018-06-01 | A | 5.0 | 3.0 | -3.0 | -10.0 |
| | B | 2.0 | 3.0 | 3.0 | 13.0 |
| | C | 3.0 | 2.0 | 0.0 | -3.0 |
| 2018-07-01 | A | 1.0 | NaN | NaN | NaN |
| 2018-08-01 | B | 9.0 | 2.0 | 0.0 | 3.0 |
| | C | 3.0 | 2.0 | 0.0 | -3.0 |
| 2018-09-01 | A | 2.0 | 3.0 | -1.0 | -6.0 |
| | B | 7.0 | 3.0 | 1.0 | 6.0 |
| 2018-10-01 | C | 3.0 | 5.0 | -1.0 | -10.0 |
| | A | 6.0 | 6.0 | -2.0 | -10.0 |
| | B | 8.0 | 7.0 | 3.0 | 20.0 |
| 2018-11-01 | A | 2.0 | NaN | NaN | NaN |
| 2018-12-01 | B | 7.0 | 4.0 | 2.0 | 14.0 |
| | C | 9.0 | 4.0 | -2.0 | -14.0 |
+------------+------+-----+---------------+---------------+--------------------+

Categories

Resources