I am trying to return the max value within a pandas df for each specific group. I then want to use this max value to multiply separate values and return in a separate column.
For example, using the df below, the max value for each group in Item is:
X = 5
Y = 2
I want to use these values to multiply all other values as a separate column.
import pandas as pd
d = ({
'Item' : ['X','X','X','Y','Y','Y','Y'],
'Count' : [0,2,5,3,1,2,1],
})
df = pd.DataFrame(data = d)
This is my attempt:
df['Mult_max'] = df.groupby('Item').apply(lambda x: x['Count'].max() * x['Count'])
Intended Output:
Group Value Mult_max
0 X 0 0
1 X 2 10
2 X 5 25
3 Y 3 9
4 Y 1 3
5 Y 2 6
6 Y 1 3
Use GroupBy.transform for Series with same size like original DataFrame filled by max values:
df['Mult_max'] = df.groupby('Item')['Count'].transform('max') * df['Count']
print (df)
Item Count Mult_max
0 X 0 0
1 X 2 10
2 X 5 25
3 Y 3 9
4 Y 1 3
5 Y 2 6
6 Y 1 3
Related
Say I have a data frame:
ID X Y Z
1 3 5 3
2 1 4 1
3 5 3 5
4 0 7 7
I want column Z to equal column X, unless column X is equal to 0. In that case, I would want column Z to equal column Y. Is there a way to do this without a for loop using pandas?
Use a conditional with numpy.where:
df['Z'] = np.where(df['X'].eq(0), df['Y'], df['X'])
Or Series.where:
df['Z'] = df['Y'].where(df['X'].eq(0), df['X'])
Output:
ID X Y Z
1 3 5 3
2 1 4 1
3 5 3 5
4 0 7 7
You can try using np.where():
df['Z'] = np.where(df['X'] == 0,df['Y'],df['X'])
Basically this translates to "If X = 0 then use the corresponding value for column Y, else (different than 0) use column X"
I have 2 dataframes
df1 = pd.DataFrame([["M","N","O"],["A","B","C"],["X","Y","Z"],[2,3,4],[1,2,3]])
0 1 2
M N O
A B C
X Y Z
2 3 4
1 2 3
df2 = pd.DataFrame([["P","Q","R","S"],["X","Z","W","Y"],[4,5,6,7],[7,8,9,3]])
0 1 2 3
P Q R S
X Z W Y
4 5 6 7
7 8 9 3
I want to read the 1st dataframe drop the rows till the row starts with X and make that row as column names, then read the 2nd dataframe again drop the rows till row starts with X then append it to the 1st dataframe. Repeat the process in loop because I have multiple such dataframes.
Expected Output:
df_out = pd.DataFrame([[2,3,4,0],[1,2,3,0],[4,7,5,6],[7,3,8,9]],columns=["X","Y","Z","W"])
X Y Z W
2 3 4 0
1 2 3 0
4 7 5 6
7 3 8 9
How to do it?
First test if value X exist in any row for all columns in shifted DataFrame for get all rows after match by DataFrame.cummax with DataFrame.any and set columns names by this row in DataFrame.set_axis, same solution use for another DataFrame, join by concat, replace missing values and for expected order add DataFrame.reindex with unon both columns names:
m1 = df1.shift().eq('X').cummax().any(axis=1)
cols1 = df1[df1.eq('X').any(axis=1)].to_numpy().tolist()
df11 = df1[m1].set_axis(cols1, axis=1)
m2 = df2.shift().eq('X').cummax().any(axis=1)
cols2 = df2[df2.eq('X').any(axis=1)].to_numpy().tolist()
df22 = df2[m2].set_axis(cols2, axis=1)
df = (pd.concat([df11, df22], ignore_index=True)
.fillna(0)
.reindex(df11.columns.union(df22.columns, sort=False), axis=1))
print (df)
X Y Z W
0 2 3 4 0
1 1 2 3 0
2 4 7 5 6
3 7 3 8 9
This works,
shift = 0
for index in df.index:
if df.iloc[index - 1, 0] == "X":
X = df.iloc[index - 1, :].values
break
shift -= 1
df = df.shift(shift).dropna()
df.columns = X
df
Output -
X
Y
Z
0
2
3
4
1
1
2
3
Let's say I have a DF like this:
Mean 1
Mean 2
Stat 1
Stat 2
ID
5
10
15
20
Z
3
6
9
12
X
Now, I want to split the dataframe to separate the data based on whether it is a #1 or #2 for each ID.
Basically I would double the amount of rows for each ID, with each one being dedicated to either #1 or #2, and a new column will be added to specify which number we are looking at. Instead of Mean 1 and 2 being on the same row, they will be listed in two separate rows, with the # column making it clear which one we are looking at. What's the best way to do this? I was trying pd.melt(), but it seems like a slightly different use case.
Mean
Stat
ID
#
5
15
Z
1
10
20
Z
2
3
9
X
1
6
12
X
2
Use pd.wide_to_long:
new_df = pd.wide_to_long(
df, stubnames=['Mean', 'Stat'], i='ID', j='#', sep=' '
).reset_index()
new_df:
ID # Mean Stat
0 Z 1 5 15
1 X 1 3 9
2 Z 2 10 20
3 X 2 6 12
Or set_index then str.split the columns then stack if order must match the OP:
new_df = df.set_index('ID')
new_df.columns = new_df.columns.str.split(expand=True)
new_df = new_df.stack().rename_axis(['ID', '#']).reset_index()
new_df:
ID # Mean Stat
0 Z 1 5 15
1 Z 2 10 20
2 X 1 3 9
3 X 2 6 12
Here is a solution with melt and pivot:
df = df.melt(id_vars=['ID'], value_name='Mean')
df[['variable', '#']] = df['variable'].str.split(expand=True)
df = (df.assign(idx=df.groupby('variable').cumcount())
.pivot(index=['idx', 'ID', '#'], columns='variable').reset_index().drop(('idx', ''), axis=1))
df.columns = [col[0] if col[1] == '' else col[1] for col in df.columns]
df
Out[1]:
ID # Mean Stat
0 Z 1 5 15
1 X 1 3 9
2 Z 2 10 20
3 X 2 6 12
Input
df
A B C
0 0 0
1 2 4
5 8 7
9 1 1
8 8 8
All values≥0
Goal
A B C D
0 0 0 unk
1 2 4 Z
5 8 7 Y
9 1 1 X
8 8 8 balance
Return D column, the detail as follow:
Return value base on the max value among A/B/C columns
If the max value in A , then D value is X, B is Y, C is Z, and all values must not be the same.
If all values among A/B/C is same, when the same value is 0, then return unk. If the same value is not 0, it returns balance.
Try
I search something like idxmax or max(axis=1) but could not get it, especially the balance and unk.
import numpy as np
key_map={0:'X',1:'Y',2:'Z'}
def mapper(row):
if row['A']==row['B']==row['C']:
if row['A']==0:
return 'unk'
else:
return 'balance'
else:
column=np.argmax([row['A'],row['B'],row['C']])
return key_map[column]
df['D']=df.apply(mapper,axis=1)
Explanation: Create a dictionary 'key_map'. Using apply(), call mapper() & return values according to the conditions mentioned
Let's try your logic:
# extract max and min
max_vals, min_vals = df.max(1), df.min(1)
# extract columns with max values
idxmax = df.idxmax(1)
# map the column names
key_map={'A':'X', 'B':'Y', 'C':'Z'}
df['D'] = np.select((max_vals==0, max_vals==min_vals),
('unk','balance'), idxmax.map(key_map)
)
Output:
A B C D
0 0 0 0 unk
1 1 2 4 Z
2 5 8 7 Y
3 9 1 1 X
4 8 8 8 balance
If a value occurs more than two times in a column I want to drop every row that it occurs in.
The input df would look like:
Name Num
X 1
X 2
Y 3
Y 4
X 5
The output df would look like:
Name Num
Y 3
Y 4
I know it is possible to remove duplicates, but that only works if I want to remove the first or last duplicate that is found, not the nth duplicate.
df = df.drop_duplicates(subset = ['Name'], drop='third')
This code is completely wrong but it helps explain what I was trying to do.
Using head
df.groupby('Name').head(2)
Out[375]:
Name Num
0 X 1
1 X 2
2 Y 3
3 Y 4
s=df.groupby('Name').size()<=2
df.loc[df.Name.isin(s[s].index)]
Out[380]:
Name Num
2 Y 3
3 Y 4
Use GroupBy.cumcount for counter and filter all values less like 2:
df1 = df[df.groupby('Name').cumcount() < 3]
print (df1)
Name Num
0 X 1
1 X 2
2 Y 3
3 Y 4
Detail:
print (df.groupby('Name').cumcount())
0 0
1 1
2 0
3 1
4 2
dtype: int64
EDIT
Filter by GroupBy.transform and GroupBy.size:
df1 = df[df.groupby('Name')['Num'].transform('size') < 3]
print (df1)
Name Num
2 Y 3
3 Y 4