Remove all groups with more than N observations - python

If a value occurs more than two times in a column I want to drop every row that it occurs in.
The input df would look like:
Name Num
X 1
X 2
Y 3
Y 4
X 5
The output df would look like:
Name Num
Y 3
Y 4
I know it is possible to remove duplicates, but that only works if I want to remove the first or last duplicate that is found, not the nth duplicate.
df = df.drop_duplicates(subset = ['Name'], drop='third')
This code is completely wrong but it helps explain what I was trying to do.

Using head
df.groupby('Name').head(2)
Out[375]:
Name Num
0 X 1
1 X 2
2 Y 3
3 Y 4
s=df.groupby('Name').size()<=2
df.loc[df.Name.isin(s[s].index)]
Out[380]:
Name Num
2 Y 3
3 Y 4

Use GroupBy.cumcount for counter and filter all values less like 2:
df1 = df[df.groupby('Name').cumcount() < 3]
print (df1)
Name Num
0 X 1
1 X 2
2 Y 3
3 Y 4
Detail:
print (df.groupby('Name').cumcount())
0 0
1 1
2 0
3 1
4 2
dtype: int64
EDIT
Filter by GroupBy.transform and GroupBy.size:
df1 = df[df.groupby('Name')['Num'].transform('size') < 3]
print (df1)
Name Num
2 Y 3
3 Y 4

Related

How to conditionally replace a value with a value from the same row in a different column using pandas?

Say I have a data frame:
ID X Y Z
1 3 5 3
2 1 4 1
3 5 3 5
4 0 7 7
I want column Z to equal column X, unless column X is equal to 0. In that case, I would want column Z to equal column Y. Is there a way to do this without a for loop using pandas?
Use a conditional with numpy.where:
df['Z'] = np.where(df['X'].eq(0), df['Y'], df['X'])
Or Series.where:
df['Z'] = df['Y'].where(df['X'].eq(0), df['X'])
Output:
ID X Y Z
1 3 5 3
2 1 4 1
3 5 3 5
4 0 7 7
You can try using np.where():
df['Z'] = np.where(df['X'] == 0,df['Y'],df['X'])
Basically this translates to "If X = 0 then use the corresponding value for column Y, else (different than 0) use column X"

custom groupby function pandas python

I have a following dataframe:
I would like to group by id and add a flag column which contains Y if anytime Y has occurred against id, resultant DF would like following:
Here is my approach which is too time consuming and not sure of correctness:
temp=pd.DataFrame()
j='flag'
for i in df['id'].unique():
test=df[df['id']==i]
test[j]=np.where(np.any((test[j]=='Y')),'Y',test[j])
temp=temp.append(test)
You can do groupby + max since Y > N:
df.groupby('id', as_index=False)['flag'].max()
id flag
0 1 Y
1 2 Y
2 3 N
3 4 N
4 5 Y
Compare flag to Y, group by id, and use any:
new_df = (df['flag'] == 'Y').groupby(df['id']).any().map({True:'Y', False:'N'}).reset_index()
Output:
>>> new_df
id flag
0 1 Y
1 2 Y
2 3 N
3 4 N
4 5 Y

Concatenate pandas Dataframe via groupby

I have a pandas DataFrame with columns 'x', 'y', 'z'
However a lot of the x and y values are redundant. I want to take all rows that have the same x and y values and sum the third column, returning a smaller DataFrame.
So given
x y z
0 1 2 1
1 1 2 5
2 1 2 0
3 1 3 0
4 2 6 1
it would return:
x y z
0 1 2 6
1 1 3 0
2 2 6 1
I've tried
df = df.groupby(['x', 'y'])['z'].sum
but I'm not sure how to work with grouped objects.
Very close as-is; you just need to call .sum() and then reset the index:
>>> df.groupby(['x', 'y'])['z'].sum().reset_index()
x y z
0 1 2 6
1 1 3 0
2 2 6 1
There is also a parameter to groupby() that handles that:
>>> df.groupby(['x', 'y'], as_index=False)['z'].sum()
x y z
0 1 2 6
1 1 3 0
2 2 6 1
In your question, you have df.groupby(['x', 'y'])['z'].sum without parentheses. This simply references the method .sum as a Python object, without calling it.
>>> type(df.groupby(['x', 'y'])['z'].sum)
method
>>> callable(df.groupby(['x', 'y'])['z'].sum)
True
Another option without using groupby syntax is to use the indexes and summing on index levels like this:
df.set_index(['x','y']).sum(level=[0,1]).reset_index()
Output:
x y z
0 1 2 6
1 1 3 0
2 2 6 1

Pandas: Trying to drop rows based on for loop?

I have a dataframe consisting of multiple columns and then two of the columns, x and y, that are both filled with numbers ranging from 1 to 3.I want to drop all rows where the number in x is less than in the number in y. For example, if in one row x = 1 and y = 3 I want to drop that entire row. This is the code I've written so far:
for num1 in df.x:
for num2 in df.y:
if (num1< num2):
df.drop(df.iloc[num1], inplace = True)
but I keep getting the error:
labels ['new' 'active' 1 '1'] not contained in axis
Anyhelp is greatly appreciated. Thanks!
I would avoid loops in your scenario, and just use .drop:
df.drop(df[df['x'] < df['y']].index, inplace=True)
Example:
df = pd.DataFrame({'x':np.random.randint(0,4,5), 'y':np.random.randint(0,4,5)})
>>> df
x y
0 1 2
1 2 1
2 3 1
3 2 1
4 1 3
df.drop(df[df['x'] < df['y']].index, inplace = True)
>>> df
x y
1 2 1
2 3 1
3 2 1
[EDIT]: Or, more simply, without using drop:
df=df[~(df['x'] < df['y'])]
Writing two for loops is very ineffecient, instead you can
just compare the two columns
[df['x'] >= df['y']]
These returns a boolean array which you can use to filter the dataframe
df[df['x'] >= df['y']]
I think better is use boolean indexing or query with changing condition to >=:
df[df['x'] >= df['y']]
Or:
df = df.query('x >= y')
Sample:
df = pd.DataFrame({'x':[1,2,3,2], 'y':[0,4,5,1]})
print (df)
x y
0 1 0
1 2 4
2 3 5
3 2 1
df = df[df['x'] >= df['y']]
print (df)
x y
0 1 0
3 2 1

Pandas copy dataframe keeping only max value for rows with same index

If I have a dataframe that looks like
value otherstuff
0 4 x
0 5 x
0 2 x
1 2 x
2 3 x
2 7 x
what is a succinct way to get a new dataframe that looks like
value otherstuff
0 5 x
1 2 x
2 7 x
where rows with the same index have been dropped so only the row with the maximum 'value' remains? As far as I am aware there is no option in df.drop_duplicates to keep the max, only the first or last occurrence.
You can use max with level=0:
df.max(level=0)
Output:
value otherstuff
0 5 x
1 2 x
2 7 x
OR, to address other columns mentioned in comments:
df.groupby(level=0,group_keys=False)\
.apply(lambda x: x.loc[x['value']==x['value'].max()])
Output:
value otherstuff
0 5 x
1 2 x
2 7 x
You can use groupby.transform to calculate the maximum value per group and then compare the value column with the maximum, if true, keep the rows:
df[df.groupby(level=0).value.transform('max').eq(df.value)]
# value otherstuff
#0 5 x
#1 2 x
#2 7 x
You could sort by value to ensure you will take the maximum, then group by the index and take the first member for each group.
(df.sort_values(by='value', ascending=False)
.groupby(level=0)
.head(1)
.sort_index())
Which yields
value otherstuff
0 5 x
1 2 x
2 7 x
Without groupby you can suing sort_values and drop_duplicates
df2['INDEX'] = df2.index
df2.sort_values(['INDEX', 'value'],ascending=[True,False]).
drop_duplicates(['INDEX'],keep='first')
Out[47]:
value otherstuff INDEX
0 5 x 0
1 2 x 1
2 7 x 2

Categories

Resources