Pandas: select rows if a specific column satisfies a certain condition - python

Say I have this dataframe df:
A B C
0 1 1 2
1 2 2 2
2 1 3 1
3 4 5 2
Say you want to select all rows which column C is >1. If I do this:
newdf=df['C']>1
I only obtain True or False in the resulting df. Instead, in the example given I want this result:
A B C
0 1 1 2
1 2 2 2
3 4 5 2
What would you do? Do you suggest using iloc?

Use boolean indexing:
newdf=df[df['C']>1]

use query
df.query('C > 1')

Related

How to drop row with bracket in Pandas

I would like to drop the [] for a given df
df=pd.DataFrame(dict(a=[1,2,4,[],5]))
Such that the expected output will be
a
0 1
1 2
2 4
3 5
Edit:
or to make thing more interesting, what if we have two columns and some of the cell is with [] to be dropped.
df=pd.DataFrame(dict(a=[1,2,4,[],5],b=[2,[],1,[],6]))
One way is to get the string repr and filter:
df = df[df['a'].map(repr)!='[]']
Output:
a
0 1
1 2
2 4
4 5
For multiple columns, we could apply the above:
out = df[df.apply(lambda c: c.map(repr)).ne('[]').all(axis=1)]
Output:
a b
0 1 2
2 4 1
4 5 6
You can't use equality directly as pandas will try to align a Series and a list, but you can use isin:
df[~df['a'].isin([[]])]
output:
a
0 1
1 2
2 4
4 5
To act on all columns:
df[~df.isin([[]]).any(1)]
output:
a b
0 1 2
2 4 1
4 5 6

How to find group-column have duplicate values in a dataframegroup python?

first i have a df, when i groupby it with a column, will it remove duplicate values?.
Second, how to know which group have duplicate values ( i tried to find how to know which columns of a df have duplicate values but couldn't find anything, they just talk about how each element duplicated or not)
ex i have a df like this:
A B C
1 1 2 3
2 1 4 3
3 2 2 2
4 2 3 4
5 2 2 3
after groupby('A')
A B C
1 2 3
4 3
2 2 2
3 2
2 3
i want to know how many group A have B duplicated, and how many group A have C duplicated
result:
B C
1 1 2
or maybe better can caculate percent
B : 50%
C : 100%
thanks
You could use a lambda function inside GroupBy.agg to compare number of unique values that is not equal to the number of values in a group. To get the number of unique we can use Series.nunique and Series.size for the number of values in a group.
df.groupby(level=0).agg(lambda x: x.size!=x.nunique())
# B C
# 1 False True
# 2 True False
Let us try
out = df.groupby(level=0).agg(lambda x : x.duplicated().any())
B C
1 False True
2 True False

Setting Values in Pandas Dataframe Based on Condition in Another Column

I am looking to update the values in a pandas series that satisfy a certain condition and take the corresponding value from another column.
Specifically, I want to look at the subcluster column and if the value equals 1, I want the record to update to the corresponding value in the cluster column.
For example:
Cluster
Subcluster
3
1
3
2
3
1
3
4
4
1
4
2
Should result in this
Cluster
Subcluster
3
3
3
2
3
3
3
4
4
4
4
2
I've been trying to use apply and a lambda function, but can't seem to get it to work properly. Any advice would be greatly appreciated. Thanks!
You can use np.where:
import numpy as np
df['Subcluster'] = np.where(df['Subcluster'].eq(1), df['Cluster'], df['Subcluster'])
Output:
Cluster Subcluster
0 3 3
1 3 2
2 3 3
3 3 4
4 4 4
5 4 2
In your case try mask
df.Subcluster.mask(lambda x : x==1, df.Cluster,inplace=True)
df
Out[12]:
Cluster Subcluster
0 3 3
1 3 2
2 3 3
3 3 4
4 4 4
5 4 2
Or
df.loc[df.Subcluster==1,'Subcluster'] = df['Cluster']
Really all you need here is to use .loc with a mask (you don't actually need to create the mask, you could apply a mask inline)
df = pd.DataFrame({'cluster':np.random.randint(0,10,10)
,'subcluster':np.random.randint(0,3,10)}
)
df.to_clipboard(sep=',')
df at this point
,cluster,subcluster
0,8,0
1,5,2
2,6,2
3,6,1
4,8,0
5,1,1
6,0,0
7,6,0
8,1,0
9,3,1
create and apply the mask (you could do this all in one line)
mask = df.subcluster == 1
df.loc[mask,'subcluster'] = df.loc[mask,'cluster']
df.to_clipboard(sep=',')
final output:
,cluster,subcluster
0,8,0
1,5,2
2,6,2
3,6,6
4,8,0
5,1,1
6,0,0
7,6,0
8,1,0
9,3,3
Here's the lambda you couldn't write. In lamba, x corresponds to the index, so you can use that to refer a specific row in a column.
df['Subcluster'] = df.apply(lambda x: x['Cluster'] if x['Subcluster'] == 1 else x['Subcluster'], axis = 1)
And the output:
Cluster Subcluster
0 3 3
1 3 2
2 3 3
3 3 4
4 4 4
5 4 2

Groupby selected rows by a condition on a column value and then transform another column

This seems to be easy but couldn't find a working solution for it:
I have a dataframe with 3 columns:
df = pd.DataFrame({'A': [0,0,2,2,2],
'B': [1,1,2,2,3],
'C': [1,1,2,3,4]})
A B C
0 0 1 1
1 0 1 1
2 2 2 2
3 2 2 3
4 2 3 4
I want to select rows based on values of column A, then groupby based on values of column B, and finally transform values of column C into sum. something along the line of this (obviously not working) code:
df[df['A'].isin(['2']), 'C'] = df[df['A'].isin(['2']), 'C'].groupby('B').transform('sum')
desired output for above example is:
A B C
0 0 1 1
1 0 1 1
2 2 2 5
3 2 3 4
I also know how to split dataframe and do it. I am looking more for a solution that does it without the need of split+concat/merge. Thank you.
Is it just
s = df['A'].isin([2])
pd.concat((df[s].groupby(['A','B'])['C'].sum().reset_index(),
df[~s])
)
Output:
A B C
0 2 2 5
1 2 3 4
0 0 1 1
Update: Without splitting, you can assign a new column indicating special values of A:
(df.sort_values('A')
.assign(D=(~df['A'].isin([2])).cumsum())
.groupby(['D','A','B'])['C'].sum()
.reset_index('D',drop=True)
.reset_index()
)
Output:
A B C
0 0 1 1
1 0 1 1
2 2 2 5
3 2 3 4

python pandas, trying to find unique combinations of two columns and merging while summing a third column

Hi I will show what im trying to do through examples:
I start with a dataframe like this:
> pd.DataFrame({'A':['a','a','a','c'],'B':[1,1,2,3], 'count':[5,6,1,7]})
A B count
0 a 1 5
1 a 1 6
2 a 2 1
3 c 3 7
I need to find a way to get all the unique combinations between column A and B, and merge them. The count column should be added together between the merged columns, the result should be like the following:
A B count
0 a 1 11
1 a 2 1
2 c 3 7
Thans for any help.
Use groupby with aggregating sum:
print (df.groupby(['A','B'], as_index=False)['count'].sum())
A B count
0 a 1 11
1 a 2 1
2 c 3 7
print (df.groupby(['A','B'])['count'].sum().reset_index())
A B count
0 a 1 11
1 a 2 1
2 c 3 7

Categories

Resources