I have a dataframe:
df = [type1 , type2 , type3 , val1, val2, val3
a b q 1 2 3
a c w 3 5 2
b c t 2 9 0
a b p 4 6 7
a c m 2 1 8
a b h 8 6 3
a b e 4 2 7]
I want to apply groupby based on columns type1, type2 and delete from the dataframe the groups with more than 2 rows. So the new dataframe will be:
df = [type1 , type2 , type3 , val1, val2, val3
a c w 3 5 2
b c t 2 9 0
a c m 2 1 8
]
What is the best way to do so?
Use GroupBy.transform for get counts of groups for Series with same size like original, so possible filter by Series.le for <= in boolean indexing:
df = df[df.groupby(['type1','type2'])['type1'].transform('size').le(2)]
print (df)
type1 type2 type3 val1 val2 val3
1 a c w 3 5 2
2 b c t 2 9 0
4 a c m 2 1 8
If performace is not important or small DataFrame is possible use DataFrameGroupBy.filter:
df =df.groupby(['type1','type2']).filter(lambda x: len(x) <= 2)
Related
I have data frame where i need to convert all the column in a row with their unique values
A B C
1 2 2
1 2 3
5 2 9
Desired output
X1 V1
A 1
A 5
B 2
C 2
C 3
C 9
I can get unique values by unique() function but don't know how I get desired output in pandas
You can use melt and drop_duplicates:
df.melt(var_name='X1', value_name='V1').drop_duplicates()
Output:
X1 V1
0 A 1
2 A 5
3 B 2
6 C 2
7 C 3
8 C 9
P.S. And you can add .reset_index(drop=True) if you want to have sequential integers for index
What I'm looking to do is group my Dataframe on a Categorical column, compute quantiles using second column, and store the result in a 3rd column. For simplicity lets just do the P50. Example below:
Original DF:
Col1 Col2
A 2
B 4
C 2
A 6
B 12
C 10
Desired DF:
Col1 Col2 Col3_P50
A 2 4
B 4 8
C 2 6
A 6 4
B 12 8
C 10 6
One easy way would be to create a small dataframe of each Category (A,B,C) and compute quantile and merge back to existing DF, but my actual dataset has 100s of category so this isn't an option. Any suggestions would be much appreciated!
You can do transform with quantile
df['Col3_P50'] = df.groupby("Col1")['Col2'].transform('quantile',0.5)
print(df)
Col1 Col2 Col3_P50
0 A 2 4
1 B 4 8
2 C 2 6
3 A 6 4
4 B 12 8
5 C 10 6
If you have multiple values, one way is creating a dictionary and set the keys as column names and values inside the groupby:
d = {'P_50':0.5,'P_90':0.9}
for k,v in d.items():
df[k]=df.groupby("Col1")['Col2'].transform('quantile',v)
print(df)
Col1 Col2 P_50 P_90
0 A 2 4 5.6
1 B 4 8 11.2
2 C 2 6 9.2
3 A 6 4 5.6
4 B 12 8 11.2
5 C 10 6 9.2
df:
id1 id2 value1 value2
-----------------------------------
a b 10 5
c a 5 10
b c 0 0
c d 2 1
d a 10 20
a c 5 10
get sum of values associated with id 'a' from column ['id1','id2']:
id1 id2 a.rolling(2).sum()
-----------------------------------
a b NaN
c a 20
d a 30
a c 25
How would I get the rolling sum of values of id 'a' from two different column with a df.groupby function?
I tried this df.groupby(['id1','id2])['value1','value2'].transform(lambda x: x.rolling(2).sum()), but that did't work.
Here's one way to do it
i = df.filter(like='id')
v = df.filter(like='va')
x, y = np.where(i == 'a')
df.iloc[x].assign(A=v.values[x, y]).assign(Roll=lambda d: d.A.rolling(2).sum())
id1 id2 value1 value2 A Roll
0 a b 10 5 10 NaN
1 c a 5 10 10 20.0
4 d a 10 20 20 30.0
5 a c 5 10 5 25.0
Using concat after filter
df1=df.filter(like='1')
df2=df.filter(like='2')
df2.columns=df1.columns
s=pd.concat([df1,df2]).sort_index().groupby('id1').rolling(2).sum()
s=s.loc['a']
df.loc[s.index].assign(new=s)
Out[99]:
id1 id2 value1 value2 new
0 a b 10 5 NaN
1 c a 5 10 20.0
4 d a 10 20 30.0
5 a c 5 10 25.0
I have following dataframe:
name gender count
0 A M 3
1 A F 2
2 A Nan 3
3 B NaN 2
4 C F 4
5 D M 5
6 D Nan 5
I would like to build a resulting dataframe df1 which deletes that last row of group of name attribute if the count of that group is greater than 1. For eq- name A is present 3 times, hence the last row containing A should be removed. B and C are only present once, hence the rows containing them should be retained.
Resulting dataframe df1 should be like this:
name gender count
0 A M 3
1 A F 2
2 B NaN 2
3 C F 4
4 D M 5
Please advice.
Use
In [4598]: (df.groupby('name').apply(lambda x: x.iloc[:-1] if len(x)>1 else x)
.reset_index(drop=True))
Out[4598]:
name gender count
0 A M 3
1 A F 2
2 B NaN 2
3 C F 4
4 D M 5
Using groupby + head:
g = df.groupby('name', as_index=False, group_keys=False)\
.apply(lambda x: x.head(-1) if x.shape[0] > 1 else x)
print(g)
name gender count
0 A M 3
1 A F 2
3 B NaN 2
4 C F 4
5 D M 5
I have the following dataframe
ID ID2 SCORE X Y
0 0 a 10 1 2
1 0 b 20 2 3
2 0 b 20 3 4
3 0 b 30 4 5
4 1 c 5 5 6
5 1 d 6 6 7
What I would like to do, is to groupby ID and ID2 and to average the SCORE taking into consideration only UNIQUE scores.
Now, if I use the standard df.groupby(['ID', 'ID2'])['SCORE'].mean() I would get 23.33~, where what I am looking for is a score of 25.
I know I can filter out X and Y, drop the duplicates and do that, but I want to keep them as they are relevant.
How can I achieve that?
If i understand correctly:
In [41]: df.groupby(['ID', 'ID2'])['SCORE'].agg(lambda x: x.unique().sum()/x.nunique())
Out[41]:
ID ID2
0 a 10
b 25
1 c 5
d 6
Name: SCORE, dtype: int64
or bit easier:
In [43]: df.groupby(['ID', 'ID2'])['SCORE'].agg(lambda x: x.unique().mean())
Out[43]:
ID ID2
0 a 10
b 25
1 c 5
d 6
Name: SCORE, dtype: int64
You can get the unique scores within groups of ('ID', 'ID2') by dropping duplicates before hand.
cols = ['ID', 'ID2', 'SCORE']
d1 = df.drop_duplicates(cols)
d1.groupby(cols[:-1]).SCORE.mean()
ID ID2
0 a 10
b 25
1 c 5
d 6
Name: SCORE, dtype: int64
You could also use
In [108]: df.drop_duplicates(['ID', 'ID2', 'SCORE']).groupby(['ID', 'ID2'])['SCORE'].mean()
Out[108]:
ID ID2
0 a 10
b 25
1 c 5
d 6
Name: SCORE, dtype: int64