Pandas: Row reduction via Groupby

Pandas: Row reduction via Groupby - python

For an example, I have a simple DataFrame like:
index
data1
replace_me
agg_me
ID
1
100
(+)
25
1
2
200
(-)
35
2
3
200
(+)
45
2
4
300
(+)
55
3
5
400
(+)
10
4
6
400
(+)
10
4
7
400
(-)
10
4
8
400
(-)
10
4
I am trying to aggregate some rows together, whereby there exists a len(groupby of ID) > 1. In the cases where len(groupby ID) > 1, I am looking to:
Add column "agg_me" together
Replace (-) and (+) with (=)
Enter the (min(agg_me) / sum(agg_me)) into a new column called "Percent".
Do such that it only "pairs" off rows, ie, it doesnt collapse 4 rows -> 1.
So as a result:
index
data1
replace_me
agg_me
ID
Percent
1
100
(+)
25
1
0
2
200
(=)
80
2
0.4375
4
300
(+)
55
3
0
5
400
(=)
20
4
0.5
6
400
(=)
20
4
0.5
Any help is appreciated!

Try this:
vc = df['ID'].map(df['ID'].value_counts()).gt(1)
pd.concat([df.loc[~vc],
df.loc[vc]
.groupby(['ID',df.groupby('ID').cumcount().floordiv(2)]).agg(
index = ('index','first'),
data1 = ('data1','first'),
replace_me = ('replace_me',lambda x: '(=)'),
agg_me = ('agg_me','sum'),
Percent = ('agg_me',lambda x: x.min()/x.sum()))
.reset_index(level=0)]).fillna(0).sort_values('ID').reset_index(drop=True)
Output:
index data1 replace_me agg_me ID Percent
0 1 100 (+) 25 1 0.0000
1 2 200 (=) 80 2 0.4375
2 4 300 (+) 55 3 0.0000
3 5 400 (=) 20 4 0.5000
4 7 400 (=) 20 4 0.5000

Related

Python Pandas: "Series" objects are mutable, thus cannot be hashed when using .groupby

I want to take the 2nd derivative of column['Value'] and place it into another column. There is also another column called ['Cycle'] that organizes the data into various cycles. So for each cycle, I want to take the 2nd derivative of those sets of number.
I have tried using this:
Data3['Diff2'] = Data3.groupby('Cycle#').apply(Data3['Value'] - 2*Data3['Value'].shift(1) + Data3['Value'].shift(2))
Which works for giving me the 2nd derivative (before adding the groupby) but now I am getting the error:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
Anyone know why?

rng = np.random.default_rng(seed=42)
df = pd.DataFrame(
{"Cycle#": rng.integers(1,4, size=12),
"Value": rng.integers(1,11, size=12)*10
})
df
###
Cycle# Value
0 1 80
1 3 80
2 2 80
3 2 80
4 2 60
5 3 20
6 1 90
7 3 50
8 1 60
9 1 40
10 2 20
11 3 100
df['Diff2'] = df.groupby('Cycle#', as_index=False)['Value'].transform(lambda x:x - 2*x.shift(1) + x.shift(2))
df
###
Cycle# Value Diff2
0 1 80 NaN
1 3 80 NaN
2 2 80 NaN
3 2 80 NaN
4 2 60 -20.0
5 3 20 NaN
6 1 90 NaN
7 3 50 90.0
8 1 60 -40.0
9 1 40 10.0
10 2 20 -20.0
11 3 100 20.0

How to group dataframe by column and receive new column for every group

I have the following dataframe:
df = pd.DataFrame({'timestamp' : [10,10,10,20,20,20], 'idx': [1,2,3,1,2,3], 'v1' : [1,2,4,5,1,9], 'v2' : [1,2,8,5,1,2]})
timestamp idx v1 v2
0 10 1 1 1
1 10 2 2 2
2 10 3 4 8
3 20 1 5 5
4 20 2 1 1
5 20 3 9 2
I'd like to group data by timestamp and calculate the following cumulative statistic:
np.sum(v1*v2) for every timestamp. I'd like to see the following result:
timestamp idx v1 v2 stat
0 10 1 1 1 37
1 10 2 2 2 37
2 10 3 4 8 37
3 20 1 5 5 44
4 20 2 1 1 44
5 20 3 9 2 44
I'm trying to do the following:
def calc_some_stat(d):
return np.sum(d.v1 * d.v2)
df.loc[:, 'stat'] = df.groupby('timestamp').apply(calc_some_stat)
But for stat columns I receive all NaN values - what is wrong in my code?

We want groupby transform here not groupby apply:
df['stat'] = (df['v1'] * df['v2']).groupby(df['timestamp']).transform('sum')
If we really want to use the function we need to join back to scale up the aggregated DataFrame:
def calc_some_stat(d):
return np.sum(d.v1 * d.v2)
df = df.join(
df.groupby('timestamp').apply(calc_some_stat)
.rename('stat'), # Needed to use join but also sets the col name
on='timestamp'
)
df:
timestamp idx v1 v2 stat
0 10 1 1 1 37
1 10 2 2 2 37
2 10 3 4 8 37
3 20 1 5 5 44
4 20 2 1 1 44
5 20 3 9 2 44
The issue is that groupby apply is producing summary information:
timestamp
10 37
20 44
dtype: int64
This does not assign back to the DataFrame naturally as there are only 2 rows when the initial DataFrame has 6. We either need to use join to scale these 2 rows up to align with the original DataFrame, or we can avoid all of this using groupby transform which is designed to produce a:
like-indexed DataFrame on each group and return a DataFrame having the same indexes as the original object filled with the transformed values

Python loop for calculating sum of column values in pandas

I have below data frame:
a
100
200
200
b
20
30
40
c
400
50
Need help to calculate sum of values for each item and place it in 2nd column, which ideally should look like below:
a 500
100
200
200
b 90
20
30
40
c 450
400
50

If need sum by groups by column col converted to numeric use GroupBy.transform with repeated non numeric values by ffill:
s = pd.to_numeric(df['col'], errors='coerce')
mask = s.isna()
df.loc[mask, 'new'] = s.groupby(df['col'].where(mask).ffill()).transform('sum')
print (df)
col new
0 a 500.0
1 100 NaN
2 200 NaN
3 200 NaN
4 b 90.0
5 20 NaN
6 30 NaN
7 40 NaN
8 c 450.0
9 400 NaN
10 50 NaN
Or:
df['new'] = np.where(mask, new.astype(int), '')
print (df)
col new
0 a 500
1 100
2 200
3 200
4 b 90
5 20
6 30
7 40
8 c 450
9 400
10 50

pandas calulate difference based on a pattern

I have a pandas dataframe as
df
Category NET A B C_DIFF 1 2 DD_DIFF .....
0 tom CD 10 20 NaN 30 40 NaN
1 tom CD 100 200 NaN 300 400 NaN
2 tom CD 100 200 NaN 300 400 NaN
3 tom CD 100 200 NaN 300 400 NaN
4 tom CD 100 200 NaN 300 400 NaN
Now my columns name ending with _DIFF i.e,C_DIFF and DD_DIFF should get the subsequent difference. i.e, A-B values should be in C_DIFF and 1-2 difference should be populated DD_DIFF. How to get this desired output.
Edit : There are 20 columns ending with _DIFF. Need to do this programmatically and not hard code the columns

Generalizing this:
m=df.columns.get_indexer(df.filter(like='DIFF').columns)
df.iloc[:,m]=pd.concat([df.iloc[:,a]-df.iloc[:,b] for a,b in zip(m-2,m-1)],axis=1).values
print(df)
Category NET A B C_DIFF 1 2 DD_DIFF
0 tom CD 10 20 -10 30 40 -10
1 tom CD 100 200 -100 300 400 -100
2 tom CD 100 200 -100 300 400 -100
3 tom CD 100 200 -100 300 400 -100
4 tom CD 100 200 -100 300 400 -100
Explanation:
df.filter() will filter the columns with names DIFF.
df.columns.get_indexer is using pd.Index.get_indexer which gets the index of such columns.
Post this we zip them and calculate the difference, and store in a list and concat them. Finally access the values to assign.
EDIT:
To handle strings you can take help of pd.to_numeric() with errors='coerce':
m=df.columns.get_indexer(df.filter(like='DIFF').columns)
df.iloc[:,m]=pd.concat([pd.to_numeric(df.iloc[:,a],errors='coerce')-
pd.to_numeric(df.iloc[:,b],errors='coerce') for a,b in zip(m-2,m-1)],axis=1).values

Perform an operation on all pairs of rows in a column

Assume the following DataFrame:
id A
1 0
2 10
3 200
4 3000
I would like to make a calculation betweeen all rows to all other rows.
For example, if the calculation were lambda r1, r2: abs(r1-r2), then the output would be (in some order)
id col_name
1 10
2 200
3 3000
4 190
5 2990
6 2800
Questions:
How to get only the above output?
How to associate a result to its creators in the most "pandas like" way?
I would like to keep everything in a single table as much as possible, in a way that still supports reasonable lookup.
The size of my data is not large, and never will be.
EDIT1:
One way that would answer my question 2 would be
id col_name origin1 origin2
1 10 1 2
2 200 1 3
3 3000 1 4
4 190 2 3
5 2990 2 4
6 2800 3 4
And I would like to know if this is standard, and has a built in way of doing this, or if there is another/better way

IIUC itertools
import itertools
s=list(itertools.combinations(df.index, 2))
pd.Series([df.A.loc[x[1]]-df.A.loc[x[0]] for x in s ])
Out[495]:
0 10
1 200
2 3000
3 190
4 2990
5 2800
dtype: int64
Update
s=list(itertools.combinations(df.index, 2))
pd.DataFrame([x+(df.A.loc[x[1]]-df.A.loc[x[0]],) for x in s ])
Out[518]:
0 1 2
0 0 1 10
1 0 2 200
2 0 3 3000
3 1 2 190
4 1 3 2990
5 2 3 2800

Use broadcasted subtraction, then np.tril_indices to extract the lower diagonal (positive values).
# <= 0.23
# u = df['A'].values
# 0.24+
u = df['A'].to_numpy()
u2 = (u[:,None] - u)
pd.Series(u2[np.tril_indices_from(u2, k=-1)])
0 10
1 200
2 190
3 3000
4 2990
5 2800
dtype: int64
Or, use subtract.outer to avoid the conversion to array beforehand.
u2 = np.subtract.outer(*[df.A]*2)
pd.Series(u2[np.tril_indices_from(u2, k=-1)])
If you need the index as well, use
idx = np.tril_indices_from(u2, k=-1)
pd.DataFrame({
'val':u2[np.tril_indices_from(u2, k=-1)],
'row': idx[0],
'col': idx[1]
})
val row col
0 10 1 0
1 200 2 0
2 190 2 1
3 3000 3 0
4 2990 3 1
5 2800 3 2

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: Row reduction via Groupby - python

Related

Python Pandas: "Series" objects are mutable, thus cannot be hashed when using .groupby

How to group dataframe by column and receive new column for every group

Python loop for calculating sum of column values in pandas

pandas calulate difference based on a pattern

Perform an operation on all pairs of rows in a column

Categories

Resources