Multiply all values in a pandas df by max value within group - python

I am trying to return the max value within a pandas df for each specific group. I then want to use this max value to multiply separate values and return in a separate column.
For example, using the df below, the max value for each group in Item is:
X = 5
Y = 2
I want to use these values to multiply all other values as a separate column.
import pandas as pd
d = ({
'Item' : ['X','X','X','Y','Y','Y','Y'],
'Count' : [0,2,5,3,1,2,1],
})
df = pd.DataFrame(data = d)
This is my attempt:
df['Mult_max'] = df.groupby('Item').apply(lambda x: x['Count'].max() * x['Count'])
Intended Output:
Group Value Mult_max
0 X 0 0
1 X 2 10
2 X 5 25
3 Y 3 9
4 Y 1 3
5 Y 2 6
6 Y 1 3

Use GroupBy.transform for Series with same size like original DataFrame filled by max values:
df['Mult_max'] = df.groupby('Item')['Count'].transform('max') * df['Count']
print (df)
Item Count Mult_max
0 X 0 0
1 X 2 10
2 X 5 25
3 Y 3 9
4 Y 1 3
5 Y 2 6
6 Y 1 3

Related

How to conditionally replace a value with a value from the same row in a different column using pandas?

Say I have a data frame:
ID X Y Z
1 3 5 3
2 1 4 1
3 5 3 5
4 0 7 7
I want column Z to equal column X, unless column X is equal to 0. In that case, I would want column Z to equal column Y. Is there a way to do this without a for loop using pandas?
Use a conditional with numpy.where:
df['Z'] = np.where(df['X'].eq(0), df['Y'], df['X'])
Or Series.where:
df['Z'] = df['Y'].where(df['X'].eq(0), df['X'])
Output:
ID X Y Z
1 3 5 3
2 1 4 1
3 5 3 5
4 0 7 7
You can try using np.where():
df['Z'] = np.where(df['X'] == 0,df['Y'],df['X'])
Basically this translates to "If X = 0 then use the corresponding value for column Y, else (different than 0) use column X"

Drop the rows till we reach certain string as column name and append with another dataframe after doing the same in pandas

I have 2 dataframes
df1 = pd.DataFrame([["M","N","O"],["A","B","C"],["X","Y","Z"],[2,3,4],[1,2,3]])
0 1 2
M N O
A B C
X Y Z
2 3 4
1 2 3
df2 = pd.DataFrame([["P","Q","R","S"],["X","Z","W","Y"],[4,5,6,7],[7,8,9,3]])
0 1 2 3
P Q R S
X Z W Y
4 5 6 7
7 8 9 3
I want to read the 1st dataframe drop the rows till the row starts with X and make that row as column names, then read the 2nd dataframe again drop the rows till row starts with X then append it to the 1st dataframe. Repeat the process in loop because I have multiple such dataframes.
Expected Output:
df_out = pd.DataFrame([[2,3,4,0],[1,2,3,0],[4,7,5,6],[7,3,8,9]],columns=["X","Y","Z","W"])
X Y Z W
2 3 4 0
1 2 3 0
4 7 5 6
7 3 8 9
How to do it?
First test if value X exist in any row for all columns in shifted DataFrame for get all rows after match by DataFrame.cummax with DataFrame.any and set columns names by this row in DataFrame.set_axis, same solution use for another DataFrame, join by concat, replace missing values and for expected order add DataFrame.reindex with unon both columns names:
m1 = df1.shift().eq('X').cummax().any(axis=1)
cols1 = df1[df1.eq('X').any(axis=1)].to_numpy().tolist()
df11 = df1[m1].set_axis(cols1, axis=1)
m2 = df2.shift().eq('X').cummax().any(axis=1)
cols2 = df2[df2.eq('X').any(axis=1)].to_numpy().tolist()
df22 = df2[m2].set_axis(cols2, axis=1)
df = (pd.concat([df11, df22], ignore_index=True)
.fillna(0)
.reindex(df11.columns.union(df22.columns, sort=False), axis=1))
print (df)
X Y Z W
0 2 3 4 0
1 1 2 3 0
2 4 7 5 6
3 7 3 8 9
This works,
shift = 0
for index in df.index:
if df.iloc[index - 1, 0] == "X":
X = df.iloc[index - 1, :].values
break
shift -= 1
df = df.shift(shift).dropna()
df.columns = X
df
Output -
X
Y
Z
0
2
3
4
1
1
2
3

Split a dataframe based on certain column values

Let's say I have a DF like this:
Mean 1
Mean 2
Stat 1
Stat 2
ID
5
10
15
20
Z
3
6
9
12
X
Now, I want to split the dataframe to separate the data based on whether it is a #1 or #2 for each ID.
Basically I would double the amount of rows for each ID, with each one being dedicated to either #1 or #2, and a new column will be added to specify which number we are looking at. Instead of Mean 1 and 2 being on the same row, they will be listed in two separate rows, with the # column making it clear which one we are looking at. What's the best way to do this? I was trying pd.melt(), but it seems like a slightly different use case.
Mean
Stat
ID
#
5
15
Z
1
10
20
Z
2
3
9
X
1
6
12
X
2
Use pd.wide_to_long:
new_df = pd.wide_to_long(
df, stubnames=['Mean', 'Stat'], i='ID', j='#', sep=' '
).reset_index()
new_df:
ID # Mean Stat
0 Z 1 5 15
1 X 1 3 9
2 Z 2 10 20
3 X 2 6 12
Or set_index then str.split the columns then stack if order must match the OP:
new_df = df.set_index('ID')
new_df.columns = new_df.columns.str.split(expand=True)
new_df = new_df.stack().rename_axis(['ID', '#']).reset_index()
new_df:
ID # Mean Stat
0 Z 1 5 15
1 Z 2 10 20
2 X 1 3 9
3 X 2 6 12
Here is a solution with melt and pivot:
df = df.melt(id_vars=['ID'], value_name='Mean')
df[['variable', '#']] = df['variable'].str.split(expand=True)
df = (df.assign(idx=df.groupby('variable').cumcount())
.pivot(index=['idx', 'ID', '#'], columns='variable').reset_index().drop(('idx', ''), axis=1))
df.columns = [col[0] if col[1] == '' else col[1] for col in df.columns]
df
Out[1]:
ID # Mean Stat
0 Z 1 5 15
1 X 1 3 9
2 Z 2 10 20
3 X 2 6 12

how to return value based on max columns by pandas?

Input
df
A B C
0 0 0
1 2 4
5 8 7
9 1 1
8 8 8
All values≥0
Goal
A B C D
0 0 0 unk
1 2 4 Z
5 8 7 Y
9 1 1 X
8 8 8 balance
Return D column, the detail as follow:
Return value base on the max value among A/B/C columns
If the max value in A , then D value is X, B is Y, C is Z, and all values must not be the same.
If all values among A/B/C is same, when the same value is 0, then return unk. If the same value is not 0, it returns balance.
Try
I search something like idxmax or max(axis=1) but could not get it, especially the balance and unk.
import numpy as np
key_map={0:'X',1:'Y',2:'Z'}
def mapper(row):
if row['A']==row['B']==row['C']:
if row['A']==0:
return 'unk'
else:
return 'balance'
else:
column=np.argmax([row['A'],row['B'],row['C']])
return key_map[column]
df['D']=df.apply(mapper,axis=1)
Explanation: Create a dictionary 'key_map'. Using apply(), call mapper() & return values according to the conditions mentioned
Let's try your logic:
# extract max and min
max_vals, min_vals = df.max(1), df.min(1)
# extract columns with max values
idxmax = df.idxmax(1)
# map the column names
key_map={'A':'X', 'B':'Y', 'C':'Z'}
df['D'] = np.select((max_vals==0, max_vals==min_vals),
('unk','balance'), idxmax.map(key_map)
)
Output:
A B C D
0 0 0 0 unk
1 1 2 4 Z
2 5 8 7 Y
3 9 1 1 X
4 8 8 8 balance

Remove all groups with more than N observations

If a value occurs more than two times in a column I want to drop every row that it occurs in.
The input df would look like:
Name Num
X 1
X 2
Y 3
Y 4
X 5
The output df would look like:
Name Num
Y 3
Y 4
I know it is possible to remove duplicates, but that only works if I want to remove the first or last duplicate that is found, not the nth duplicate.
df = df.drop_duplicates(subset = ['Name'], drop='third')
This code is completely wrong but it helps explain what I was trying to do.
Using head
df.groupby('Name').head(2)
Out[375]:
Name Num
0 X 1
1 X 2
2 Y 3
3 Y 4
s=df.groupby('Name').size()<=2
df.loc[df.Name.isin(s[s].index)]
Out[380]:
Name Num
2 Y 3
3 Y 4
Use GroupBy.cumcount for counter and filter all values less like 2:
df1 = df[df.groupby('Name').cumcount() < 3]
print (df1)
Name Num
0 X 1
1 X 2
2 Y 3
3 Y 4
Detail:
print (df.groupby('Name').cumcount())
0 0
1 1
2 0
3 1
4 2
dtype: int64
EDIT
Filter by GroupBy.transform and GroupBy.size:
df1 = df[df.groupby('Name')['Num'].transform('size') < 3]
print (df1)
Name Num
2 Y 3
3 Y 4

Categories

Resources