I have a Dataframe:
User Numbers
A 0
A 4
A 5
B 0
B 0
C 1
C 3
I want to perform an operation on each corresponding grouped data. For example, if I want to remove all Users that have the Number 0, it should look like:
User Numbers
A 0
A 4
A 5
C 1
C 3
since all Numbers of User B is 0.
Or for example, if I want to find the variance of the Numbers of all the Users, it should look like:
Users Variance
A 7
B 0
C 2
This means only the Numbers of A are calculated for finding the variance of A and so on.
Is there a general way to do all these computations for matching grouped data?
You want 2 different operations - filtration per groups and aggregation per groups.
Filtration:
For better performance is better use transform for boolean mask and filter by boolean indexing.
df1 = df[~df['Number'].eq(0).groupby(df['User']).transform('all')]
print (df1)
User Number
0 A 0
1 A 4
2 A 5
5 C 1
6 C 3
Steps:
1.First create boolean Series by comparing Number by eq:
print (df['Number'].eq(0))
0 True
1 False
2 False
3 True
4 True
5 False
6 False
Name: Number, dtype: bool
2.Then use syntactic sugar - groupby by another column and transform function all for check if all Trues per group and transform is for mask with same size as original DataFrame:
print (df['Number'].eq(0).groupby(df['User']).transform('all'))
0 False
1 False
2 False
3 True
4 True
5 False
6 False
Name: Number, dtype: bool
3.Invert boolen mask by ~:
print (~df['Number'].eq(0).groupby(df['User']).transform('all'))
0 True
1 True
2 True
3 False
4 False
5 True
6 True
Name: Number, dtype: bool
4.Filter:
print (df[~df['Number'].eq(0).groupby(df['User']).transform('all')])
User Number
0 A 0
1 A 4
2 A 5
5 C 1
6 C 3
Another slowier solution in large DataFrame with filter and same logic as first solution:
df2 = df.groupby('User').filter(lambda x: ~x['Number'].eq(0).all())
print (df2)
User Number
0 A 0
1 A 4
2 A 5
5 C 1
6 C 3
Aggregation:
For simplier aggregation by one column with one aggregate function, e.g. GroupBy.var use:
df3 = df.groupby('User', as_index=False)['Number'].var()
print (df3)
User Number
0 A 7
1 B 0
2 C 2
Related
first i have a df, when i groupby it with a column, will it remove duplicate values?.
Second, how to know which group have duplicate values ( i tried to find how to know which columns of a df have duplicate values but couldn't find anything, they just talk about how each element duplicated or not)
ex i have a df like this:
A B C
1 1 2 3
2 1 4 3
3 2 2 2
4 2 3 4
5 2 2 3
after groupby('A')
A B C
1 2 3
4 3
2 2 2
3 2
2 3
i want to know how many group A have B duplicated, and how many group A have C duplicated
result:
B C
1 1 2
or maybe better can caculate percent
B : 50%
C : 100%
thanks
You could use a lambda function inside GroupBy.agg to compare number of unique values that is not equal to the number of values in a group. To get the number of unique we can use Series.nunique and Series.size for the number of values in a group.
df.groupby(level=0).agg(lambda x: x.size!=x.nunique())
# B C
# 1 False True
# 2 True False
Let us try
out = df.groupby(level=0).agg(lambda x : x.duplicated().any())
B C
1 False True
2 True False
I have something like this:
data = {'SKU':[1,1,2,1,2,2,3],
'QTY':[5,12,2,24,1,2,12],
'TYPE': ['M','C','M','C','M','M','C']
}
df = pd.DataFrame(data)
print(df)
OUTPUT:
SKU QTY TYPE
0 1 5 M
1 1 12 C
2 2 2 M
3 1 24 C
4 2 1 M
5 2 2 M
6 3 12 C
And I want a list of unique SKUs and a true / false column indicating if Type = C for all instances of that SKU.
Something like this:
SKU Case
0 1 False
1 2 False
2 3 True
I've tried all manner of combinations of groupby, filter, agg, value_counts, etc. and just can't seem to find a reasonable way to achieve this.
Any help would be much appreciated. I'm sure the answer will be humbling.
print(df.groupby('SKU')['TYPE'].agg(lambda x: np.all(x == 'C')).reset_index())
Prints:
SKU TYPE
0 1 False
1 2 False
2 3 True
Let us do groupby + nunique
s=df.TYPE.eq('C').groupby(df['SKU']).all().reset_index()
SKU TYPE
0 1 False
1 2 False
2 3 True
I have a dataframe with two columns: "Agent" and "Client"
Each row corresponds to an interaction between an Agent and a client.
I want to keep only the rows if a client had interactions with at least 2 agents.
How can I do that?
Worth adding that now you can use df.duplicated()
df = df.loc[df.duplicated(subset='Agent', keep=False)]
Use groupby and transform by value_counts.
df[df.Agent.groupby(df.Agent).transform('value_counts') > 1]
Note, that, as mentioned here, you might have one agent interacting with the same client multiple times. This might be retained as a false positive. If you do not want this, you could add a drop_duplicates call before filtering:
df = df.drop_duplicates()
df = df[df.Agent.groupby(df.Agent).transform('value_counts') > 1]
print(df)
A B
0 1 2
1 2 5
2 3 1
3 4 1
4 5 5
5 6 1
mask = df.B.groupby(df.B).transform('value_counts') > 1
print(mask)
0 False
1 True
2 True
3 True
4 True
5 True
Name: B, dtype: bool
df = df[mask]
print(df)
A B
1 2 5
2 3 1
3 4 1
4 5 5
5 6 1
Say I have this dataframe df:
A B C
0 1 1 2
1 2 2 2
2 1 3 1
3 4 5 2
Say you want to select all rows which column C is >1. If I do this:
newdf=df['C']>1
I only obtain True or False in the resulting df. Instead, in the example given I want this result:
A B C
0 1 1 2
1 2 2 2
3 4 5 2
What would you do? Do you suggest using iloc?
Use boolean indexing:
newdf=df[df['C']>1]
use query
df.query('C > 1')
For example, if I want to consider a flower species, number of petals, germination time and user ID, the user ID is going to have a hyphen in there. So in my data analysis, I don't want to use it. I'm aware that I can hard code it in, but I want to make it so when I input any dataset, it will automatically remove columns with non-numeric inputs.
Edit: Unclear question. I'm reading in data from a csv file using pandas.
Example:
Species NPetals GermTime UserID
1 R. G 5 4 65-78
2 R. F 5 3 65-81
I want to remove the UserID and Species columns from the dataset.
From the docs you can just select the numeric data by filtering using select_dtypes:
In [5]:
df = pd.DataFrame({'a': np.random.randn(6).astype('f4'),'b': [True, False] * 3,'c': [1.0, 2.0] * 3})
df
Out[5]:
a b c
0 0.338710 True 1
1 1.530095 False 2
2 -0.048261 True 1
3 -0.505742 False 2
4 0.729667 True 1
5 -0.634482 False 2
In [15]:
df.select_dtypes(include=[np.number])
Out[15]:
a c
0 0.338710 1
1 1.530095 2
2 -0.048261 1
3 -0.505742 2
4 0.729667 1
5 -0.634482 2
You can pass any valid np dtype hierarchy