Pandas: Groupby count as column value - python

I have a pandas dataframe that looks like this:
I would like to generate counts instances of 'x' (regardless of whether they're unique, or not) per 'id'. The result would be insert as a column labeled 'x_count' as shown below:
Any tips would be helpful.

Simply a groupby with transform count
df['x_count'] = df.groupby('id')['x'].transform('count')
If you also want to count the NaN, use `size'
df['x_count'] = df.groupby('id')['x'].transform('size')

Try .value_counts with .map
df['x_count'] = df['id'].map(df.value_counts('id'))

Related

GroupBy using select columns with apply(list) and retaining other columns of the dataframe

data={'order_num':[123,234,356,123,234,356],'email':['abc#gmail.com','pqr#hotmail.com','xyz#yahoo.com','abc#gmail.com','pqr#hotmail.com','xyz#gmail.com'],'product_code':['rdcf1','6fgxd','2sdfs','34fgdf','gvwt5','5ganb']}
df=pd.DataFrame(data,columns=['order_num','email','product_code'])
My data frame looks something like this:
Image of data frame
For sake of simplicity, while making the example, I omitted the other columns. What I need to do is that I need to groupby on the column called order_num, apply(list) on product_code, sort the groups based on a timestamp column and retain the columns like email as they are.
I tried doing something like:
df.groupby(['order_num', 'email', 'timestamp'])['product_code'].apply(list).sort_values(by='timestamp').reset_index()
Output: Expected output appearance
but I do not wish to groupby with other columns. Is there any other alternative to performing the list operation? I tried using transform but it threw me size mismatch error and I don't think it's the right way to go either.
If there is a lot another columns and need grouping by order_num only use Series.map for new column filled by lists and then remove duplicates by DataFrame.drop_duplicates by column order_num, last if necessary sorting:
df['product_code']=df['order_num'].map(df.groupby('order_num')['product_code'].apply(list))
df = df.drop_duplicates('order_num').sort_values(by='timestamp')

is there a way to use the index numbers to select columns when doing pandas groupby?

I am using groupby to do a sum of groups. The code I am working on looks like this:
data1=data.groupby('a')['b_1','b_2'].mean().reset_index()
However, I have more than 30 columns need to be calculated, which is from 'b_1' to 'b_30', I dont want to list all the names of columns, so I tried using index numbers of the dataset. like this:
data1=data.groupby('a')[list(range(3,33))].mean().reset_index()
But I always got: KeyError: 'Columns not found'
So I just wonder is there another way to do this?
Thanks!
Try:
data.filter(like='b_').groupby(data['a']).mean().reset_index()
Or you can manually create the list of columns:
cols = ['b_{}'.format(i) for i in range(1,33)]
data1=data.groupby('a')[cols].mean().reset_index()

How to add an integer-represented column in Pandas dataframe

I need to add an integer-represented column in a pandas dataframe. For example if a have a dataframe with names and genders as the following:
I would need to add a new column with an integer value depending of the gender. Expected out put would be as follows:
df['Gender_code']=df['Gender'].transform(lambda gender: 1 if gender=='Female' else 0)
Explanation: Using transform(), you can apply a function to all values of any column. Here, I applied the function defined using lambda to column 'Gender'
For just two gender you can do a comparison:
df['Gender_code'] = df['Gender'].eq('Female').astype(int)
In the general case, you can resolve to factorize:
df['Gender_code'] = df['Gender'].factorize()[0]

Pandas list down unique values in column and assign it to separate columns

I have following table:
I want to create new data frame or column in same data frame where unique values are listed. e.g.
I used following code:
data.groupby('EMAIL')['Classification'].transform('nunique')
But it is giving me number of unique values (for CLASSIFICATION, it is 2),
However I want to note down value in list format. So that at the end i will remove duplicate rows and put single row for each unique email id. Please advise on this.
Thanks!
For performance use set for unique values and pass to lambda function in GroupBy.agg, order should be different like original:
df = data.groupby('EMAIL').agg(lambda x: ','.join(set(x))).reset_index()
For same order like original use dictionary trick:
f = ','.join(dict.fromkeys(x).keys())
df = data.groupby('EMAIL').agg(f).reset_index()
Use df.groupby(as_index=False) with df.groupby.agg:
data.groupby('EMAIL',as_index=False).agg(lambda x: ','.join(x.unique()))

pandas pivot_table returns empty dataframe

I get an empty dataframe when I try to group values using the pivot_table. Let's first create some stupid data:
import pandas as pd
df = pd.DataFrame({"size":['large','middle','xsmall','large','middle','small'],
"color":['blue','blue','red','black','red','red']})
When I use:
df1 = df.pivot_table(index='size', aggfunc='count')
returns me what I expect. Now I would like to have a complete pivot table with the color as column:
df2 = df.pivot_table(index='size', aggfunc='count',columns='color')
But this results in an empty dataframe. Why? How can I get a simple pivot table which counts me the number of combinations?
Thank you.
You need to use len as the aggfunc, like so
df.pivot_table(index='size', aggfunc=len, columns='color')
If you want to use count, here are the steps:
First add a frequency columns, like so:
df['freq'] = df.groupby(['color', 'size'])['color'].transform('count')
Then create the pivot table using the frequency column:
df.pivot_table(values='freq', index='size', aggfunc='count', columns='color')
you need another column to be used as values for aggregation.
Add a column -
df['freq']=1
Your code will work.

Categories

Resources