custom groupby function pandas python - python

I have a following dataframe:
I would like to group by id and add a flag column which contains Y if anytime Y has occurred against id, resultant DF would like following:
Here is my approach which is too time consuming and not sure of correctness:
temp=pd.DataFrame()
j='flag'
for i in df['id'].unique():
test=df[df['id']==i]
test[j]=np.where(np.any((test[j]=='Y')),'Y',test[j])
temp=temp.append(test)

You can do groupby + max since Y > N:
df.groupby('id', as_index=False)['flag'].max()
id flag
0 1 Y
1 2 Y
2 3 N
3 4 N
4 5 Y

Compare flag to Y, group by id, and use any:
new_df = (df['flag'] == 'Y').groupby(df['id']).any().map({True:'Y', False:'N'}).reset_index()
Output:
>>> new_df
id flag
0 1 Y
1 2 Y
2 3 N
3 4 N
4 5 Y

Related

How to conditionally replace a value with a value from the same row in a different column using pandas?

Say I have a data frame:
ID X Y Z
1 3 5 3
2 1 4 1
3 5 3 5
4 0 7 7
I want column Z to equal column X, unless column X is equal to 0. In that case, I would want column Z to equal column Y. Is there a way to do this without a for loop using pandas?
Use a conditional with numpy.where:
df['Z'] = np.where(df['X'].eq(0), df['Y'], df['X'])
Or Series.where:
df['Z'] = df['Y'].where(df['X'].eq(0), df['X'])
Output:
ID X Y Z
1 3 5 3
2 1 4 1
3 5 3 5
4 0 7 7
You can try using np.where():
df['Z'] = np.where(df['X'] == 0,df['Y'],df['X'])
Basically this translates to "If X = 0 then use the corresponding value for column Y, else (different than 0) use column X"

Multiply all values in a pandas df by max value within group

I am trying to return the max value within a pandas df for each specific group. I then want to use this max value to multiply separate values and return in a separate column.
For example, using the df below, the max value for each group in Item is:
X = 5
Y = 2
I want to use these values to multiply all other values as a separate column.
import pandas as pd
d = ({
'Item' : ['X','X','X','Y','Y','Y','Y'],
'Count' : [0,2,5,3,1,2,1],
})
df = pd.DataFrame(data = d)
This is my attempt:
df['Mult_max'] = df.groupby('Item').apply(lambda x: x['Count'].max() * x['Count'])
Intended Output:
Group Value Mult_max
0 X 0 0
1 X 2 10
2 X 5 25
3 Y 3 9
4 Y 1 3
5 Y 2 6
6 Y 1 3
Use GroupBy.transform for Series with same size like original DataFrame filled by max values:
df['Mult_max'] = df.groupby('Item')['Count'].transform('max') * df['Count']
print (df)
Item Count Mult_max
0 X 0 0
1 X 2 10
2 X 5 25
3 Y 3 9
4 Y 1 3
5 Y 2 6
6 Y 1 3

Remove all groups with more than N observations

If a value occurs more than two times in a column I want to drop every row that it occurs in.
The input df would look like:
Name Num
X 1
X 2
Y 3
Y 4
X 5
The output df would look like:
Name Num
Y 3
Y 4
I know it is possible to remove duplicates, but that only works if I want to remove the first or last duplicate that is found, not the nth duplicate.
df = df.drop_duplicates(subset = ['Name'], drop='third')
This code is completely wrong but it helps explain what I was trying to do.
Using head
df.groupby('Name').head(2)
Out[375]:
Name Num
0 X 1
1 X 2
2 Y 3
3 Y 4
s=df.groupby('Name').size()<=2
df.loc[df.Name.isin(s[s].index)]
Out[380]:
Name Num
2 Y 3
3 Y 4
Use GroupBy.cumcount for counter and filter all values less like 2:
df1 = df[df.groupby('Name').cumcount() < 3]
print (df1)
Name Num
0 X 1
1 X 2
2 Y 3
3 Y 4
Detail:
print (df.groupby('Name').cumcount())
0 0
1 1
2 0
3 1
4 2
dtype: int64
EDIT
Filter by GroupBy.transform and GroupBy.size:
df1 = df[df.groupby('Name')['Num'].transform('size') < 3]
print (df1)
Name Num
2 Y 3
3 Y 4

Pandas: Trying to drop rows based on for loop?

I have a dataframe consisting of multiple columns and then two of the columns, x and y, that are both filled with numbers ranging from 1 to 3.I want to drop all rows where the number in x is less than in the number in y. For example, if in one row x = 1 and y = 3 I want to drop that entire row. This is the code I've written so far:
for num1 in df.x:
for num2 in df.y:
if (num1< num2):
df.drop(df.iloc[num1], inplace = True)
but I keep getting the error:
labels ['new' 'active' 1 '1'] not contained in axis
Anyhelp is greatly appreciated. Thanks!
I would avoid loops in your scenario, and just use .drop:
df.drop(df[df['x'] < df['y']].index, inplace=True)
Example:
df = pd.DataFrame({'x':np.random.randint(0,4,5), 'y':np.random.randint(0,4,5)})
>>> df
x y
0 1 2
1 2 1
2 3 1
3 2 1
4 1 3
df.drop(df[df['x'] < df['y']].index, inplace = True)
>>> df
x y
1 2 1
2 3 1
3 2 1
[EDIT]: Or, more simply, without using drop:
df=df[~(df['x'] < df['y'])]
Writing two for loops is very ineffecient, instead you can
just compare the two columns
[df['x'] >= df['y']]
These returns a boolean array which you can use to filter the dataframe
df[df['x'] >= df['y']]
I think better is use boolean indexing or query with changing condition to >=:
df[df['x'] >= df['y']]
Or:
df = df.query('x >= y')
Sample:
df = pd.DataFrame({'x':[1,2,3,2], 'y':[0,4,5,1]})
print (df)
x y
0 1 0
1 2 4
2 3 5
3 2 1
df = df[df['x'] >= df['y']]
print (df)
x y
0 1 0
3 2 1

Pandas copy dataframe keeping only max value for rows with same index

If I have a dataframe that looks like
value otherstuff
0 4 x
0 5 x
0 2 x
1 2 x
2 3 x
2 7 x
what is a succinct way to get a new dataframe that looks like
value otherstuff
0 5 x
1 2 x
2 7 x
where rows with the same index have been dropped so only the row with the maximum 'value' remains? As far as I am aware there is no option in df.drop_duplicates to keep the max, only the first or last occurrence.
You can use max with level=0:
df.max(level=0)
Output:
value otherstuff
0 5 x
1 2 x
2 7 x
OR, to address other columns mentioned in comments:
df.groupby(level=0,group_keys=False)\
.apply(lambda x: x.loc[x['value']==x['value'].max()])
Output:
value otherstuff
0 5 x
1 2 x
2 7 x
You can use groupby.transform to calculate the maximum value per group and then compare the value column with the maximum, if true, keep the rows:
df[df.groupby(level=0).value.transform('max').eq(df.value)]
# value otherstuff
#0 5 x
#1 2 x
#2 7 x
You could sort by value to ensure you will take the maximum, then group by the index and take the first member for each group.
(df.sort_values(by='value', ascending=False)
.groupby(level=0)
.head(1)
.sort_index())
Which yields
value otherstuff
0 5 x
1 2 x
2 7 x
Without groupby you can suing sort_values and drop_duplicates
df2['INDEX'] = df2.index
df2.sort_values(['INDEX', 'value'],ascending=[True,False]).
drop_duplicates(['INDEX'],keep='first')
Out[47]:
value otherstuff INDEX
0 5 x 0
1 2 x 1
2 7 x 2

Categories

Resources