Pandas: Row reduction via Groupby - python

For an example, I have a simple DataFrame like:
index
data1
replace_me
agg_me
ID
1
100
(+)
25
1
2
200
(-)
35
2
3
200
(+)
45
2
4
300
(+)
55
3
5
400
(+)
10
4
6
400
(+)
10
4
7
400
(-)
10
4
8
400
(-)
10
4
I am trying to aggregate some rows together, whereby there exists a len(groupby of ID) > 1. In the cases where len(groupby ID) > 1, I am looking to:
Add column "agg_me" together
Replace (-) and (+) with (=)
Enter the (min(agg_me) / sum(agg_me)) into a new column called "Percent".
Do such that it only "pairs" off rows, ie, it doesnt collapse 4 rows -> 1.
So as a result:
index
data1
replace_me
agg_me
ID
Percent
1
100
(+)
25
1
0
2
200
(=)
80
2
0.4375
4
300
(+)
55
3
0
5
400
(=)
20
4
0.5
6
400
(=)
20
4
0.5
Any help is appreciated!

Try this:
vc = df['ID'].map(df['ID'].value_counts()).gt(1)
pd.concat([df.loc[~vc],
df.loc[vc]
.groupby(['ID',df.groupby('ID').cumcount().floordiv(2)]).agg(
index = ('index','first'),
data1 = ('data1','first'),
replace_me = ('replace_me',lambda x: '(=)'),
agg_me = ('agg_me','sum'),
Percent = ('agg_me',lambda x: x.min()/x.sum()))
.reset_index(level=0)]).fillna(0).sort_values('ID').reset_index(drop=True)
Output:
index data1 replace_me agg_me ID Percent
0 1 100 (+) 25 1 0.0000
1 2 200 (=) 80 2 0.4375
2 4 300 (+) 55 3 0.0000
3 5 400 (=) 20 4 0.5000
4 7 400 (=) 20 4 0.5000

Related

Python Pandas: "Series" objects are mutable, thus cannot be hashed when using .groupby

I want to take the 2nd derivative of column['Value'] and place it into another column. There is also another column called ['Cycle'] that organizes the data into various cycles. So for each cycle, I want to take the 2nd derivative of those sets of number.
I have tried using this:
Data3['Diff2'] = Data3.groupby('Cycle#').apply(Data3['Value'] - 2*Data3['Value'].shift(1) + Data3['Value'].shift(2))
Which works for giving me the 2nd derivative (before adding the groupby) but now I am getting the error:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
Anyone know why?
rng = np.random.default_rng(seed=42)
df = pd.DataFrame(
{"Cycle#": rng.integers(1,4, size=12),
"Value": rng.integers(1,11, size=12)*10
})
df
###
Cycle# Value
0 1 80
1 3 80
2 2 80
3 2 80
4 2 60
5 3 20
6 1 90
7 3 50
8 1 60
9 1 40
10 2 20
11 3 100
df['Diff2'] = df.groupby('Cycle#', as_index=False)['Value'].transform(lambda x:x - 2*x.shift(1) + x.shift(2))
df
###
Cycle# Value Diff2
0 1 80 NaN
1 3 80 NaN
2 2 80 NaN
3 2 80 NaN
4 2 60 -20.0
5 3 20 NaN
6 1 90 NaN
7 3 50 90.0
8 1 60 -40.0
9 1 40 10.0
10 2 20 -20.0
11 3 100 20.0

How to group dataframe by column and receive new column for every group

I have the following dataframe:
df = pd.DataFrame({'timestamp' : [10,10,10,20,20,20], 'idx': [1,2,3,1,2,3], 'v1' : [1,2,4,5,1,9], 'v2' : [1,2,8,5,1,2]})
timestamp idx v1 v2
0 10 1 1 1
1 10 2 2 2
2 10 3 4 8
3 20 1 5 5
4 20 2 1 1
5 20 3 9 2
I'd like to group data by timestamp and calculate the following cumulative statistic:
np.sum(v1*v2) for every timestamp. I'd like to see the following result:
timestamp idx v1 v2 stat
0 10 1 1 1 37
1 10 2 2 2 37
2 10 3 4 8 37
3 20 1 5 5 44
4 20 2 1 1 44
5 20 3 9 2 44
I'm trying to do the following:
def calc_some_stat(d):
return np.sum(d.v1 * d.v2)
df.loc[:, 'stat'] = df.groupby('timestamp').apply(calc_some_stat)
But for stat columns I receive all NaN values - what is wrong in my code?
We want groupby transform here not groupby apply:
df['stat'] = (df['v1'] * df['v2']).groupby(df['timestamp']).transform('sum')
If we really want to use the function we need to join back to scale up the aggregated DataFrame:
def calc_some_stat(d):
return np.sum(d.v1 * d.v2)
df = df.join(
df.groupby('timestamp').apply(calc_some_stat)
.rename('stat'), # Needed to use join but also sets the col name
on='timestamp'
)
df:
timestamp idx v1 v2 stat
0 10 1 1 1 37
1 10 2 2 2 37
2 10 3 4 8 37
3 20 1 5 5 44
4 20 2 1 1 44
5 20 3 9 2 44
The issue is that groupby apply is producing summary information:
timestamp
10 37
20 44
dtype: int64
This does not assign back to the DataFrame naturally as there are only 2 rows when the initial DataFrame has 6. We either need to use join to scale these 2 rows up to align with the original DataFrame, or we can avoid all of this using groupby transform which is designed to produce a:
like-indexed DataFrame on each group and return a DataFrame having the same indexes as the original object filled with the transformed values

Python loop for calculating sum of column values in pandas

I have below data frame:
a
100
200
200
b
20
30
40
c
400
50
Need help to calculate sum of values for each item and place it in 2nd column, which ideally should look like below:
a 500
100
200
200
b 90
20
30
40
c 450
400
50
If need sum by groups by column col converted to numeric use GroupBy.transform with repeated non numeric values by ffill:
s = pd.to_numeric(df['col'], errors='coerce')
mask = s.isna()
df.loc[mask, 'new'] = s.groupby(df['col'].where(mask).ffill()).transform('sum')
print (df)
col new
0 a 500.0
1 100 NaN
2 200 NaN
3 200 NaN
4 b 90.0
5 20 NaN
6 30 NaN
7 40 NaN
8 c 450.0
9 400 NaN
10 50 NaN
Or:
df['new'] = np.where(mask, new.astype(int), '')
print (df)
col new
0 a 500
1 100
2 200
3 200
4 b 90
5 20
6 30
7 40
8 c 450
9 400
10 50

pandas calulate difference based on a pattern

I have a pandas dataframe as
df
Category NET A B C_DIFF 1 2 DD_DIFF .....
0 tom CD 10 20 NaN 30 40 NaN
1 tom CD 100 200 NaN 300 400 NaN
2 tom CD 100 200 NaN 300 400 NaN
3 tom CD 100 200 NaN 300 400 NaN
4 tom CD 100 200 NaN 300 400 NaN
Now my columns name ending with _DIFF i.e,C_DIFF and DD_DIFF should get the subsequent difference. i.e, A-B values should be in C_DIFF and 1-2 difference should be populated DD_DIFF. How to get this desired output.
Edit : There are 20 columns ending with _DIFF. Need to do this programmatically and not hard code the columns
Generalizing this:
m=df.columns.get_indexer(df.filter(like='DIFF').columns)
df.iloc[:,m]=pd.concat([df.iloc[:,a]-df.iloc[:,b] for a,b in zip(m-2,m-1)],axis=1).values
print(df)
Category NET A B C_DIFF 1 2 DD_DIFF
0 tom CD 10 20 -10 30 40 -10
1 tom CD 100 200 -100 300 400 -100
2 tom CD 100 200 -100 300 400 -100
3 tom CD 100 200 -100 300 400 -100
4 tom CD 100 200 -100 300 400 -100
Explanation:
df.filter() will filter the columns with names DIFF.
df.columns.get_indexer is using pd.Index.get_indexer which gets the index of such columns.
Post this we zip them and calculate the difference, and store in a list and concat them. Finally access the values to assign.
EDIT:
To handle strings you can take help of pd.to_numeric() with errors='coerce':
m=df.columns.get_indexer(df.filter(like='DIFF').columns)
df.iloc[:,m]=pd.concat([pd.to_numeric(df.iloc[:,a],errors='coerce')-
pd.to_numeric(df.iloc[:,b],errors='coerce') for a,b in zip(m-2,m-1)],axis=1).values

Perform an operation on all pairs of rows in a column

Assume the following DataFrame:
id A
1 0
2 10
3 200
4 3000
I would like to make a calculation betweeen all rows to all other rows.
For example, if the calculation were lambda r1, r2: abs(r1-r2), then the output would be (in some order)
id col_name
1 10
2 200
3 3000
4 190
5 2990
6 2800
Questions:
How to get only the above output?
How to associate a result to its creators in the most "pandas like" way?
I would like to keep everything in a single table as much as possible, in a way that still supports reasonable lookup.
The size of my data is not large, and never will be.
EDIT1:
One way that would answer my question 2 would be
id col_name origin1 origin2
1 10 1 2
2 200 1 3
3 3000 1 4
4 190 2 3
5 2990 2 4
6 2800 3 4
And I would like to know if this is standard, and has a built in way of doing this, or if there is another/better way
IIUC itertools
import itertools
s=list(itertools.combinations(df.index, 2))
pd.Series([df.A.loc[x[1]]-df.A.loc[x[0]] for x in s ])
Out[495]:
0 10
1 200
2 3000
3 190
4 2990
5 2800
dtype: int64
Update
s=list(itertools.combinations(df.index, 2))
pd.DataFrame([x+(df.A.loc[x[1]]-df.A.loc[x[0]],) for x in s ])
Out[518]:
0 1 2
0 0 1 10
1 0 2 200
2 0 3 3000
3 1 2 190
4 1 3 2990
5 2 3 2800
Use broadcasted subtraction, then np.tril_indices to extract the lower diagonal (positive values).
# <= 0.23
# u = df['A'].values
# 0.24+
u = df['A'].to_numpy()
u2 = (u[:,None] - u)
pd.Series(u2[np.tril_indices_from(u2, k=-1)])
0 10
1 200
2 190
3 3000
4 2990
5 2800
dtype: int64
Or, use subtract.outer to avoid the conversion to array beforehand.
u2 = np.subtract.outer(*[df.A]*2)
pd.Series(u2[np.tril_indices_from(u2, k=-1)])
If you need the index as well, use
idx = np.tril_indices_from(u2, k=-1)
pd.DataFrame({
'val':u2[np.tril_indices_from(u2, k=-1)],
'row': idx[0],
'col': idx[1]
})
val row col
0 10 1 0
1 200 2 0
2 190 2 1
3 3000 3 0
4 2990 3 1
5 2800 3 2

Categories

Resources