Count visitor with same id but different name and show it - python

I have a dataframe:
df1 = pd.DataFrame({'id': ['1','2','2','3','3','4','4'],
'name': ['James','Jim','jimy','Daniel','Dane','Ash','Ash'],
'event': ['Basket','Soccer','Soccer','Basket','Soccer','Basket','Soccer']})
I want to count unique values of id but with the name, the result I except are:
id name count
1 James 1
2 Jim, jimy 2
3 Daniel, Dane 2
4 Ash 2
I try to group by id and name but it doesn't count as i expected

You could try:
df1.groupby('id').agg(
name=('name', lambda x: ', '.join(x.unique())),
count=('name', 'count')
)
We are basically grouping by id and then joining the unique names to a comma separated list!

Here is a solution:
groups = df1[["id", "name"]].groupby("id")
a = groups.agg(lambda x: ", ".join( set(x) ))
b = groups.size().rename("count")
c = pd.concat([a,b], axis=1)
I'm not an expert when it comes to pandas but I thought I might as well post my solution because I think that it's straightforward and readable.
In your example, the groupby is done on the id column and not by id and name. The name column you see in your expected DataFrame is the result of an aggregation done after a groupby.
Here, it is obvious that the groupby was done on the id column.
My solution is maybe not the most straightforward but I still find it to be more readable:
Create a groupby object groups by grouping by id
Create a DataFrame a from groups by aggregating it using commas (you also need to remove the duplicates using set(...) ): lambda x: ", ".join( set(x) )
The DataFrame a will thus have the following data:
name
id
1 James
2 Jim, jimy
3 Daniel, Dane
4 Ash
Create another DataFrame b by computing the size of each groups in groups : groups.size() (you should also rename your column)
id
1 1
2 2
3 2
4 2
Name: count, dtype: int64
Concat a and b horizontally and you get what you wanted
name count
id
1 James 1
2 Jim, jimy 2
3 Daniel, Dane 2
4 Ash 2

Related

Can't sort values after aggregation using Pandas dataframe

I have the following dataframe:
df[['ID','Team']].groupby(['Team']).agg([('total','count')]).reset_index("total").sort_values("count")
I basically, need to count the number of IDs by Team and then sort by the total number of IDs.
The aggregation part it's good and it gives me the expected result. But when I try the sort part I got this:
KeyError: 'Requested level (total) does not match index name (Team)'
What I am doing wrong?
Use names aggregation for specify new columns names in aggregate function, remove total from DataFrame.reset_index:
df = pd.DataFrame({
'ID':list('abcdef'),
'Team':list('aaabcb')
})
df = df.groupby('Team').agg(count=('ID','count')).reset_index().sort_values("count")
print (df)
Team count
2 c 1
1 b 2
0 a 3
Your solution should be changed by specify column after groupby for processing, then specify new column name with aggregate function in tuple and last also remove total from reset_index:
df = df.groupby('Team')['ID'].agg([('count','count')]).reset_index().sort_values("count")
print (df)
Team count
2 c 1
1 b 2
0 a 3

Count number of matching values from pandas groupby

I have created a pandas dataframe for a store
I have columns Transaction and Item_Type
import pandas as pd
data = {'Transaction':[1, 2, 2, 2, 3], 'Item_Type':['Food', 'Drink', 'Food', 'Drink', 'Food']}
df = pd.DataFrame(data, columns=['Transaction', 'Item_Type'])
Transaction Item_Type
1 Food
2 Drink
2 Food
2 Drink
3 Food
I am trying to group by transaction and count the number of drinks per transaction, but cannot find the right syntax to do it.
df = df.groupby(['Transaction','Item_Type']).size()
This sort of works, but gives me a multi-index Series, which I cannot yet figure out how to select drinks per transaction from it.
1/Food 1
2/Drink 2
2/Food 1
3/Food 1
This seems clunky - is there a better way?
This stackoverflow seemed most similar Adding a 'count' column to the result of a groupby in pandas?
Another way possible with pivot_table:
s = df.pivot_table(index='Transaction',
columns='Item_Type',aggfunc=len).stack().astype(int)
Or:
s = df.pivot_table(index=['Transaction','Item_Type'],aggfunc=len) ##thanks #Ch3steR
s.index = s.index.map("{0[0]}/{0[1]}".format)
print(s)
1/Food 1
2/Drink 2
2/Food 1
3/Food 1
Or if you wish to filter a particular category:
to_filter = 'Drink'
(df.pivot_table(index='Transaction',columns='Item_Type',aggfunc=len,fill_value=0)
.filter(items=[to_filter]))
Item_Type Drink
Transaction
1 0
2 2
3 0
​
Edit: replacing original xs approach with unstack after seeing anky's answer.
>>> df.groupby('Transaction')['Item_Type'].value_counts().unstack(fill_value=0)['Drink']
Transaction
1 0
2 2
3 0
Name: Drink, dtype: int64
With a particular condition, you can sum the Boolean Series, within group, after you check the condition.
df['Item_Type'].eq('Drink').groupby(df['Transaction']).sum()
#Transaction
#1 0.0
#2 2.0
#3 0.0
#Name: Item_Type, dtype: float64
I found a solution I think
Get statistics for each group (such as count, mean, etc) using pandas GroupBy?
df = df.groupby(['Transaction','Item_Type']).size().reset_index(name='counts')
Gives me the information I need
Transaction Item_Type counts
1 Food 1
2 Drink 2
2 Food 1
3 Food 1
You may use agg and value_counts
s = df.astype(str).agg('/'.join, axis=1).value_counts(sort=False)
Out[61]:
3/Food 1
2/Drink 2
1/Food 1
2/Food 1
dtype: int64
If you want to keep the original order, chain additional sort_index
s = df.astype(str).agg('/'.join, axis=1).value_counts().sort_index(kind='mergesort')
Out[62]:
1/Food 1
2/Drink 2
2/Food 1
3/Food 1
dtype: int64

counting all string values in given column of a table and grouping it based on third column

I have three columns. the table looks like this:
ID. names tag
1. john. 1
2. sam 0
3. sam,robin. 1
4. robin. 1
Id: type integer
Names: type string
Tag: type integer (just 0,1)
What I want is to find how many times each name is repeated grouped by 0 and 1. this is to be done in python.
Answer must look like
0 1
John 23 12
Robin 32 10
sam 9 30
Using extractall and crosstab:
s = df.names.str.extractall(r'(\w+)').reset_index(1, drop=True).join(df.tag)
pd.crosstab(s[0], s['tag'])
tag 0 1
0
john 0 1
robin 0 2
sam 1 1
Because of the nature of your names column, there is some re-processing that needs to be done before you can get value counts. In the case of your example dataframe, this could look something like:
my_counts = (df.set_index(['ID.', 'tag'])
# Get rid of periods and split on commas
.names.str.strip('.').str.split(',')
.apply(pd.Series)
.stack()
.reset_index([0, 1])
# rename column 0 for consistency, easier reading
.rename(columns={0: 'names'})
# Get value counts of names per tag:
.groupby('tag')['names']
.value_counts()
.unstack('tag', fill_value=0))
>>> my_counts
tag 0 1
names
john 0 1
robin 0 2
sam 1 1

Pandas value_counts using multiple columns [duplicate]

I've heard in Pandas there's often multiple ways to do the same thing, but I was wondering –
If I'm trying to group data by a value within a specific column and count the number of items with that value, when does it make sense to use df.groupby('colA').count() and when does it make sense to use df['colA'].value_counts() ?
There is difference value_counts return:
The resulting object will be in descending order so that the first element is the most frequently-occurring element.
but count not, it sort output by index (created by column in groupby('col')).
df.groupby('colA').count()
is for aggregate all columns of df by function count. So it count values excluding NaNs.
So if need count only one column need:
df.groupby('colA')['colA'].count()
Sample:
df = pd.DataFrame({'colB':list('abcdefg'),
'colC':[1,3,5,7,np.nan,np.nan,4],
'colD':[np.nan,3,6,9,2,4,np.nan],
'colA':['c','c','b','a',np.nan,'b','b']})
print (df)
colA colB colC colD
0 c a 1.0 NaN
1 c b 3.0 3.0
2 b c 5.0 6.0
3 a d 7.0 9.0
4 NaN e NaN 2.0
5 b f NaN 4.0
6 b g 4.0 NaN
print (df['colA'].value_counts())
b 3
c 2
a 1
Name: colA, dtype: int64
print (df.groupby('colA').count())
colB colC colD
colA
a 1 1 1
b 3 2 2
c 2 2 1
print (df.groupby('colA')['colA'].count())
colA
a 1
b 3
c 2
Name: colA, dtype: int64
Groupby and value_counts are totally different functions. You cannot perform value_counts on a dataframe.
Value Counts are limited only for a single column or series and it's sole purpose is to return the series of frequencies of values
Groupby returns a object so one can perform statistical computations over it. So when you do df.groupby(col).count() it will return the number of true values present in columns with respect to the specific columns in groupby.
When should be value_counts used and when should groupby.count be used :
Lets take an example
df = pd.DataFrame({'id': [1, 2, 3, 4, 2, 2, 4], 'color': ["r","r","b","b","g","g","r"], 'size': [1,2,1,2,1,3,4]})
Groupby count:
df.groupby('color').count()
id size
color
b 2 2
g 2 2
r 3 3
Groupby count is generally used for getting the valid number of values
present in all the columns with reference to or with respect to one
or more columns specified. So not a number (nan) will be excluded.
To find the frequency using groupby you need to aggregate against the specified column itself like #jez did. (maybe to avoid this and make developers life easy value_counts is implemented ).
Value Counts:
df['color'].value_counts()
r 3
g 2
b 2
Name: color, dtype: int64
Value count is generally used for finding the frequency of the values
present in one particular column.
In conclusion :
.groupby(col).count() should be used when you want to find the frequency of valid values present in columns with respect to specified col.
.value_counts() should be used to find the frequencies of a series.
in simple words: .value_counts() Return a Series containing counts of unique rows in the DataFrame which means it counts up the individual values in a specific row and reports how many of the values are in the column:
imagine we have a dataframe like:
df = pd.DataFrame({'first_name': ['John', 'Anne', 'John', 'Beth'],
'middle_name': ['Smith', pd.NA, pd.NA, 'Louise']})
first_name middle_name
0 John Smith
1 Anne <NA>
2 John <NA>
3 Beth Louise
then we apply value_counts on it:
df.value_counts()
first_name middle_name
Beth Louise 1
John Smith 1
dtype: int64
as you can see it didn't count rows with NA values.
however count() count non-NA cells for each column or row.
in our example:
df.count()
first_name 4
middle_name 2
dtype: int64

Sort a column within groups in Pandas

I am new to pandas. I'm trying to sort a column within each group. So far, I was able to group first and second column values together and calculate the mean value in third column. But I am still struggling to sort 3rd column.
This is my input dataframe
This is my dataframe after applying groupby and mean function
I used the following line of code to group input dataframe,
df_o=df.groupby(by=['Organization Group','Department']).agg({'Total Compensation':np.mean})
Please let me know how to sort the last column for each group in 1st column using pandas.
It seems you need sort_values:
#for return df add parameter as_index=False
df_o=df.groupby(['Organization Group','Department'],
as_index=False)['Total Compensation'].mean()
df_o = df_o.sort_values(['Total Compensation','Organization Group'])
Sample:
df = pd.DataFrame({'Organization Group':['a','b','a','a'],
'Department':['d','f','a','a'],
'Total Compensation':[1,8,9,1]})
print (df)
Department Organization Group Total Compensation
0 d a 1
1 f b 8
2 a a 9
3 a a 1
df_o=df.groupby(['Organization Group','Department'],
as_index=False)['Total Compensation'].mean()
print (df_o)
Organization Group Department Total Compensation
0 a a 5
1 a d 1
2 b f 8
df_o = df_o.sort_values(['Total Compensation','Organization Group'])
print (df_o)
Organization Group Department Total Compensation
1 a d 1
0 a a 5
2 b f 8

Categories

Resources