Pandas - Applying filter in groupby - python

I am trying to perform a group by function in a Dataframe. I need two aggregations done, to find total count and find the count based on filtering of one column
product, count, type
prod_a,100,1
prod_b,200,2
prod_c,23,3
prod_d,23,1
I am trying to create a pivot of columns, column 1 that has count of product sold and column 2 that has count of products by type 1
sold, type_1
prod_a,1,1
prod_b,1,0
prod_c,1,0
prod_d,1,1
I am able to get count of products sold but I am not sure how to apply filter and get the count of prod_a sold
df("product").agg({'count': [('sold', 'count')]})

If need count by only one condition like type==1 then use GroupBy.agg with named aggregations:
df2 = df.groupby("product").agg(sold = ('count','count'),
type_1= ('type', lambda x: (x == 1).sum()))
print (df2)
sold type_1
product
prod_a 1 1
prod_b 1 0
prod_c 1 0
prod_d 1 1
For improve performance first create column and then aggregate sum:
df2 = (df.assign(type_1 = df['type'].eq(1).astype(int))
.groupby("product").agg(sold = ('count','count'),
type_1 = ('type_1','sum')))
For all combinations use crosstab with DataFrame.join:
df1 = pd.crosstab(df['product'], df['type']).add_prefix('type_')
df2 = df.groupby("product").agg(sold = ('count','count')).join(df1)
print (df2)
sold type_1 type_2 type_3
product
prod_a 1 1 0 0
prod_b 1 0 1 0
prod_c 1 0 0 1
prod_d 1 1 0 0

Related

Printing count of a column based on value of another column

I have a data frame:
Dept_Name
Placed
A
1
B
0
C
1
Where 'Placed' column has a boolean value
I want to print the count of rows that have the value '1' in placed grouped by the Dept_Name
Dept_Name
Count(Placed == 1)
A
3
B
4
C
0
If values are 0,1 or True/False you can aggregate sum, last for column Count use Series.reset_index:
df1 = df.groupby('Dept_Name')['Placed'].sum().reset_index(name='Count')
If test some non boolean values - e.g. count values 100:
df2 = df['Placed'].eq(100).groupby(df['Dept_Name']).sum().reset_index(name='Count')
As you have a boolean 0/1 a simple sum will work:
out = df.groupby('Dept_Name', as_index=False).sum()
output:
Dept_Name Placed
0 A 5
1 B 0
2 C 2
For a named column:
out = df.groupby('Dept_Name', as_index=False).agg(**{'Count': ('Placed', 'sum')})
output:
Dept_Name Count
0 A 5
1 B 0
2 C 2

How to match column names with dictionary keys and add value to counter

I created a dataframe that has binary values for each cell, where each row is a user and each column is a company the user can select (or not), like this:
company1 company2 company3
1 0 0
0 0 1
0 1 1
And I created a dictionary that categorizes each company into either a high, mid, or low value company:
{'company1': 'high',
'company2': 'low',
'company3': 'low'}
Currently there are companies that are in the dataframe but not in the dictionary, but this should be fixed relatively soon. I would like to create variables for how many times each user selected a high, mid, or low value company. Ultimately should look something like this:
company1 company2 company3 total_low total_mid total_high
1 0 0 0 0 1
0 0 1 1 0 0
0 1 1 2 0 0
I started creating a loop to accomplish this, but I'm not sure how to match the column name with the dictionary key/value, or if this is even the most efficient method (there are ~18,000 rows/users and ~100 columns/companies in total):
total_high = []
total_mid = []
total_low = []
for row in range(df.shape[0]):
for col in range(df.shape[1]):
if df.iloc[row,col] == 1:
# match column name with dict key and add value to
# counter
One possible approach:
d = {'company1': 'high',
'company2': 'low',
'company3': 'low'}
df.join(df.rename(columns=d)
.groupby(level=0, axis=1).sum()
.reindex(['low','mid','high'], axis=1, fill_value=0)
.add_prefix('total_')
)
Output:
company1 company2 company3 total_low total_mid total_high
0 1 0 0 0 0 1
1 0 0 1 1 0 0
2 0 1 1 2 0 0
Not as short as #Quang Hoang 's but Another way;
Melt dataframe
df2=pd.melt(df, value_vars=['company1', 'company2', 'company3'])
Map dictionary creating another column total
df2['total']=df2.variable.map(d)
Pivot high, low and add middle and join to df
compa=['low','medium','high']
df.join(df2.groupby(['variable','total'])['value'].sum().unstack('total', fill_value=0).reindex(compa,axis=1, fill_value=0).add_prefix('total_').reset_index().drop(columns=['variable']))

How to group phone number with and without country code

I am trying to detect phone number, my country code is +62 but some phone manufacturer or operator use 0 and +62, after query and pivoting I get pivoted data. But, the pivoted data is out of context
Here's the pivoted data
Id +623684682 03684682 +623684684 03684684
1 1 0 1 1
2 1 1 2 1
Here's what I need to group, but I don't want to group manually (+623684682 and 03684682 is same, etc)
Id 03684682 03684684
1 1 2
2 2 3
I think need replace with aggregate sum:
df = df.groupby(lambda x: x.replace('+62','0'), axis=1).sum()
Or replace columns names and sum:
df.columns = df.columns.str.replace('\+62','0')
df = df.sum(level=0, axis=1)
print (df)
03684682 03684684
Id
1 1 2
2 2 3

Python Pandas Create Cooccurence Matrix from two rows

I have a Dataframe which looks like this (The columns are filled with ids for a movie and ids for an actor:
movie actor clusterid
0 0 1 2
1 0 2 2
2 1 1 2
3 1 3 2
4 2 2 1
and i want to create a binary co-occurence matrix from this dataframe which looks like this
actor1 actor2 actor3
clusterid 2 movie0 1 1 0
movie1 1 0 1
clusterid 1 movie2 0 1 0
where my dataframe has (i) a multiindex (clusterid, movieid) and a binary count for actors which acted in the movie according to my inital dataframe.
I tried:
df.groupby("movie").agg('count').unstack(fill_value=0)
but unfortunately this doesn't expand the dataframe and counts the totals. Can something like this be done using the internal pandas functions easily?
Thank you for any advice
You can create an extra auxiliary column to indicate if the value exists and then do pivot_table:
(df.assign(actor = "actor" + df.actor.astype(str), indicator = 1)
.pivot_table('indicator', ['clusterid', 'movie'], 'actor', fill_value = 0))
Or use set_index.unstack() pattern:
(df.assign(actor = "actor" + df.actor.astype(str), indicator = 1)
.set_index(['clusterid', 'movie', 'actor']).indicator.unstack('actor', fill_value=0))

Column headers like pivot table

I am trying to find out the mix of member grades that visit my stores.
import pandas as pd
df=pd.DataFrame({'MbrID':['M1','M2','M3','M4','M5','M6','M7']
,'Store':['PAR','TPM','AMK','TPM','PAR','PAR','AMK']
,'Grade':['A','A','B','A','C','A','C']})
df=df[['MbrID','Store','Grade']]
print(df)
df.groupby('Store').agg({'Grade':pd.Series.nunique})
Below is the dataframe and also the result of groupby function.
How do I produce the result like Excel Pivot table, such that the categories of Grade (A,B,C) is the column headers? This is assuming that I have quite a wide range of member grades.
I think you can use groupby with size and reshaping by unstack:
df1 = df.groupby(['Store','Grade'])['Grade'].size().unstack(fill_value=0)
print (df1)
Grade A B C
Store
AMK 0 1 1
PAR 2 0 1
TPM 2 0 0
Solution with crosstab:
df2 = pd.crosstab(df.Store, df.Grade)
print (df2)
Grade A B C
Store
AMK 0 1 1
PAR 2 0 1
TPM 2 0 0
and with pivot_table:
df3 = df.pivot_table(index='Store',
columns='Grade',
values='MbrID',
aggfunc=len,
fill_value=0)
print (df3)
Grade A B C
Store
AMK 0 1 1
PAR 2 0 1
TPM 2 0 0

Categories

Resources