Count of unique values per group as new column with pandas - python

I would like to count the unique observations by a group in a pandas dataframe and create a new column that has the unique count. Importantly, I would not like to reduce the rows in the dataframe; effectively performing something similar to a window function in SQL.
df = pd.DataFrame({
'uID': ['James', 'Henry', 'Abe', 'James', 'Henry', 'Brian', 'Claude', 'James'],
'mID': ['A', 'B', 'A', 'B', 'A', 'A', 'A', 'C']
})
df.groupby('mID')['uID'].nunique()
Will get the unique count per group, but it summarises (reduces the rows), I would effectively like to do something along the lines of:
df['ncount'] = df.groupby('mID')['uID'].transform('nunique')
(this obviously does not work)
It is possible to accomplish the desired outcome by taking the unique summarised dataframe and joining it to the original dataframe but I am wondering if there is a more minimal solution.
Thanks

GroupBy.transform('nunique')
On v0.23.4, your solution works for me.
df['ncount'] = df.groupby('mID')['uID'].transform('nunique')
df
uID mID ncount
0 James A 5
1 Henry B 2
2 Abe A 5
3 James B 2
4 Henry A 5
5 Brian A 5
6 Claude A 5
7 James C 1
GroupBy.nunique + pd.Series.map
Additionally, with your existing solution, you could map the series back to mID:
df['ncount'] = df.mID.map(df.groupby('mID')['uID'].nunique())
df
uID mID ncount
0 James A 5
1 Henry B 2
2 Abe A 5
3 James B 2
4 Henry A 5
5 Brian A 5
6 Claude A 5
7 James C 1

You are very close!
df['ncount'] = df.groupby('mID')['uID'].transform(pd.Series.nunique)
uID mID ncount
0 James A 5
1 Henry B 2
2 Abe A 5
3 James B 2
4 Henry A 5
5 Brian A 5
6 Claude A 5
7 James C 1

Related

How to find common values in groupby groups?

I have a df of this format, my goal is to find users who participate in more than one tournament and ultimately set their 'val' value to the one they first appear with. Initially, I was thinking I need to groupby 'tour' but then it needs some intersection but I'm not sure how to proceed. Alternatively, I can do pd.crosstab(df.user, df.tour) but I'm not sure how to proceed either.
df = pd.DataFrame(data = [['jim','1','1', 10],['john','1','1', 12], ['jack','2', '1', 14],['jim','2','1', 10],
['mel','3','2', 20],['jim','3','2', 10],['mat','4','2', 14],['nick','4','2', 20],
['tim','5','3', 16],['john','5','3', 10],['lin','6','3', 16],['mick','6','3', 20]],
columns = ['user', 'game', 'tour', 'val'])
Since df is already sorted by tour, we could use groupby + first:
df['val'] = df.groupby('user')['val'].transform('first')
Output:
user game tour val
0 jim 1 1 10
1 john 1 1 12
2 jack 2 1 14
3 jim 2 1 10
4 mel 3 2 20
5 jim 3 2 10
6 mat 4 2 14
7 nick 4 2 20
8 tim 5 3 16
9 john 5 3 12
10 lin 6 3 16
11 mick 6 3 20
You can groupby on 'user' and filter out groups with only 1 element, and then select the first one, like so:
df.groupby(['user']).filter(lambda g:len(g)>1).groupby('user').head(1)
output
user game tour val
0 jim 1 1 10
1 john 1 1 12

Select top n items in a pandas groupby and calculate the mean

I have the following dataframe:
df = pd.DataFrame({'Value': [0, 1, 2,3, 4,5,6,7,8,9],'Name': ['John', 'Jim', 'John','Jim', 'John','Jim','Jim','John','Jim','John']})
df
Value Name
0 0 John
1 1 Jim
2 2 John
3 3 Jim
4 4 John
5 5 Jim
6 6 Jim
7 7 John
8 8 Jim
9 9 John
I would like to select the top n items by Name and find the mean from the Value column.
I have tried this:
df['Top2Mean'] = df.groupby(['Name'])['Value'].nlargest(2).transform('mean')
But the following error:
ValueError: transforms cannot produce aggregated results
My expected result is a new column called Top2Mean with a 8 next to John and 7 next to Jim.
Thanks in advance!
Let us calculate mean on level=0, then map the calculated mean value to the Name column to broadcast the aggregated results.
top2 = df.groupby('Name')['Value'].nlargest(2).mean(level=0)
df['Top2Mean'] = df['Name'].map(top2)
If we need to group on multiple columns for example Name and City then we have to take mean on level=[Name, City] and map the calculated mean values using MultiIndex.map
c = ['Name', 'City']
top2 = df.groupby(c)['Value'].nlargest(2).mean(level=c)
df['Top2Mean'] = df.set_index(c).index.map(top2)
Alternative approach with groupby and transform using a custom lambda function
df['Top2Mean'] = df.groupby('Name')['Value']\
.transform(lambda v: v.nlargest(2).mean())
Value Name Top2Mean
0 0 John 8
1 1 Jim 7
2 2 John 8
3 3 Jim 7
4 4 John 8
5 5 Jim 7
6 6 Jim 7
7 7 John 8
8 8 Jim 7
9 9 John 8

How can I group sum and count in Python creating a new dataframe?

So, I'm trying to do something similar to this:
select a, b, c, sum(d), sum(e), count(*)
from df
group by 1,2,3
In other words, I have this:
a b c d e
Billy Profesor 1 10 5
Billy Profesor 1 17 3
Andrew Student 8 2 7
And I want the output to be:
a b c d e count
Billy Profesor 1 27 8 2
Andrew Student 8 2 7 1
I tried this, and it partially worked:
df.groupby(['a','b','c']).sum().reset_index()
I still couldn't make it work for the count. I also tried the answer in the post Group dataframe and get sum AND count?, but using the agg function make things very messy and it counts every column.
UPDATE: I changed column c because I have a numeric column to group, but not sum.
You can do a join:
groups=df.groupby(['a','b','c'])
groups.sum().join(groups.size().to_frame('count')).reset_index()
Output:
a b c d e count
0 Andrew Student CA 2 7 1
1 Billy Profesor NY 27 8 2
Try NamedAgg
df_final = df.groupby(['a','b','c'], sort=False).agg(d=('d', 'sum'),
e=('e', 'sum'),
count=('e', 'count')).reset_index()
Out[12]:
a b c d e count
0 Billy Profesor NY 27 8 2
1 Andrew Student CA 2 7 1

How to know the occurrence of a text in row of data frame pandas python

C1
0 John
1 John
2 John
3 Michale
4 Michale
5 Newton
6 Newton
7 John
8 John
9 John
I want to know how many time John occurred row wise. Suppose John occurred from 0 to 2 In result i want from 0 to 2 John. from 3 to 4 Michel from 5 to 6 Newton
Result I want in this format:
Start End Name
0 2 John
3 4 Michale
5 6 newton
7 9 John
Use
In [163]: df.reset_index().groupby('C1')['index'].agg(['min', 'max']).rename(
columns={'min': 'start', 'max': 'end'})
Out[163]:
start end
C1
John 0 2
Michale 3 4
Newton 5 6
#Zero: Would adding the below to your code help ..?? :)
df_new = df.reset_index().groupby('C1')['index'].agg(['min', 'max']).rename(
columns={'min': 'start', 'max': 'end'})
df_new.reset_index().rename(columns={'C1':'Name'})
Edit: Maybe something like this..? I am still learning but there is no harm trying. :)
labels = (df.C1 != df.C1.shift()).cumsum()
df1 = pd.concat([df,labels],axis = 1,names = 'label')
df1.columns = ['C1','label']
df_new = df1.reset_index().groupby(['label','C1']).agg(['min', 'max']).rename(
columns={'min': 'start', 'max': 'end'}).reset_index().rename(columns={'C1':'Name'})
df_new

Pandas, how to make matrix

I have a question about pandas and if someone could help me, I would be grateful for that very much.
I have a dataframe
df1 = pd.DataFrame( {'Name': ['A', 'B','A','A']})
df1
I want to do groupby for this.
x=df1.groupby("Name").size()
x
I also have another dataframe
df2 = pd.DataFrame( {'Name2': ['Jon',Maria','Maria','Mike','Mike','Mike']})
df2
For this one, I do groupby as well.
y= df2.groupby("Name2").size()
And then I want to make matrix whose column is x and row is y, and want to multiply the values.
I want the matrix like this.
Jon Maria Mike
A 3 6 9
B 1 2 3
If you could tell me how to do that, I would greatly appreciate it.
You could perform a dot product:
x.to_frame().dot(y.to_frame().T)
Name2 Jon Maria Mike
Name
A 3 6 9
B 1 2 3
If you want to remove the axis labels, use rename_axis:
x.to_frame().dot(y.to_frame().T)\
.rename_axis(None).rename_axis(None, 1)
Jon Maria Mike
A 3 6 9
B 1 2 3
Alternatively, assign in-place:
v = x.to_frame().dot(y.to_frame().T)
v.index.name = v.columns.name = None
v
Jon Maria Mike
A 3 6 9
B 1 2 3
In [35]: (pd.DataFrame(y[:,None].dot(x[:,None].T).T, columns=y.index, index=x.index)
.rename_axis(None)
.rename_axis(None,1))
Out[35]:
Jon Maria Mike
A 3 6 9
B 1 2 3
Or we can using np.multiply.outer
pd.DataFrame(np.multiply.outer(x.values,y.values),columns=y.index,index=x.index)
Out[344]:
Name2 Jon Maria Mike
Name
A 3 6 9
B 1 2 3

Categories

Resources