groupby two columns and count unique values from a third column

groupby two columns and count unique values from a third column - python

I have the following df1:
id period color size rate
1 01 red 12 30
1 02 red 12 30
2 01 blue 12 35
3 03 blue 12 35
4 01 blue 12 35
4 02 blue 12 35
5 01 pink 10 40
6 01 pink 10 40
I need to create a new df2 with an index that is an aggregate of 3 columns color-size-rate, then groupby 'period' and get the count of unique ids.
My final df should be have the following structure:
index period count
red-12-30 01 1
red-12-30 02 1
blue-12-35 01 2
blue-12-35 03 1
blue-12-35 02 1
pink-10-40 01 2
Thank you in advance for your help.

try .agg('-'.join) and .groupby
df1 = df.groupby([df[["color", "size", "rate"]].astype(str)\
.agg("-".join, 1).rename('index'), "period"])\
.agg(count=("id", "nunique"))\
.reset_index()
print(df1)
index period count
0 blue-12-35 1 2
1 blue-12-35 2 1
2 blue-12-35 3 1
3 pink-10-40 1 2
4 red-12-30 1 1
5 red-12-30 2 1

you can achieve this with a groupby
df2 = df1.groupby(['color', 'size', 'rate', 'period']).count().reset_index();
df2['index'] = df2.apply(lambda x: '-'.join([x['color'], x['size'], x['rate']]), axis = 1)

Related

How to turn a pandas dataframe row into a comma separated value with condition

I have a pandas dataframe as such:
id =[30,30,40,40,30,40,55,30]
month =[1,3,11,4,10,2,12,12]
average=[90,80,50,92,18,15,16,55]
sec =['id1','id1','id3','id4','id2','id2','id1','id1']
df = pd.DataFrame(list(zip(id,sec,month,average)),columns =['id','sec','month','Average'])
We want to add one more column having comma separated months of below conditions
Need to exclude id2 sec
and below 90 average
Desired Output
I have tried below code but not getting desired output
final=pd.DataFrame()
for i in set(sec):
if i !='id2': #Exclude id2
d2 =df[df['sec']==i]
d2=df[df['average']<90] # apply below 90 condition
d2=d2[['id','month']].groupby(['id'], as_index=False).agg(lambda x: ', '.join(sorted(set(x.astype(str))))) #comma seperated data
d2.rename(columns={'month':'problematic_month'},inplace=True)
d2['sec']=i
tab =df.merge(d2,on =['id','sec'], how ='inner')
final =final.append(tab)
else:
d2 =df[df['sec']==i]
d2['problematic_month']=np.NaN
final =final.append(d2)
Kindly suggest any other way(without merge) to get the desired output

Another way using groupby+transform
import calendar
d = dict(enumerate(calendar.month_abbr))
s = df['month'].map(d).where(df['sec'].ne("id2")& (df['Average'].lt(90)))
col = s.groupby([df["id"],df['sec']]).transform(lambda x: ','.join(x.dropna()))
out = df.assign(problematic_column=col.replace("",np.nan)).sort_values(['id','sec'])
print(out)
id sec month Average problematic_column
0 30 id1 1 90 Mar,Dec
1 30 id1 3 80 Mar,Dec
7 30 id1 12 55 Mar,Dec
4 30 id2 10 18 NaN
5 40 id2 2 15 NaN
2 40 id3 11 50 Nov
3 40 id4 4 92 NaN
6 55 id1 12 16 Dec
Steps:
Map the month column to the calender to get month abbreviation.
Retain values only when the condition matches.
Use groupby and transform to dropna and join by comma.

You can start by first converting your int months to actual Month abbreviations using calendar.
df['month'] = df['month'].apply(lambda x: calendar.month_abbr[x])
print(df.head(3))
id sec month Average
0 30 id1 Jan 90
1 30 id1 Mar 80
2 40 id3 Nov 50
Then I would use loc to narrow your dataframe based on your conditions above and a groupby and to get your months together per sec.
Thereafter use map to attach it to your initial dataframe:
r = df.loc[(df['Average'].gt(90) |\
(df['sec'].eq('id2'))).eq(0)]\
.groupby('sec').agg({'month':lambda x: ','.join(x)})\
.reset_index()\
.rename({'month':'problematic_month'},axis=1)
print(r)
sec problematic_month
0 id1 Jan,Mar,Dec
1 id3 Nov
# Attach with map
df['problematic_month'] = df['sec'].map(dict(zip(r.sec,r.problematic_month)))
>>> print(df)
id sec month Average problematic_month
0 30 id1 Jan 90 Jan,Mar,Dec
1 30 id1 Mar 80 Jan,Mar,Dec
2 40 id3 Nov 50 Nov
3 40 id4 Apr 92 NaN
4 30 id2 Oct 18 NaN
5 40 id2 Feb 15 NaN
6 55 id1 Dec 16 Jan,Mar,Dec
Then using this problematic_month column, you can check whether it contains a , and it it does you can select the first and last column:
import numpy as np
f = df['problematic_month'].str.split(',').str[0]
l = ',' + df['problematic_month'].str.split(',').str[-1]
df['problematic_month'] = np.where(df['problematic_month'].str.contains(','),f+l, df['problematic_month'])
Answer:
>>> print(df)
id sec month Average problematic_month
0 30 id1 Jan 90 Jan,Dec
1 30 id1 Mar 80 Jan,Dec
2 40 id3 Nov 50 Nov
3 40 id4 Apr 92 NaN
4 30 id2 Oct 18 NaN
5 40 id2 Feb 15 NaN
6 55 id1 Dec 16 Jan,Dec

Multiple values to pivot in dataframe

I have a dataset where I would like to pivot the entire dataframe, using certain columns as values.
Data
id date sun moon stars total pcp base final status space galaxy
aa Q1 21 5 1 2 8 0 200 41 5 1 1
aa Q2 21 4 1 2 7 1 200 50 6 2 1
Desired
id date type pcp base final final2 status type2 final3
aa Q1 21 sun 0 200 41 5 5 space 1
aa Q1 21 moon 0 200 41 1 5 galaxy 1
aa Q1 21 stars 0 200 41 2 5 space 1
aa Q2 21 sun 1 200 50 4 6 space 2
aa Q2 21 moon 1 200 50 1 6 galaxy 1
aa Q2 21 stars 1 200 50 2 6 space 2
Doing
df.drop(columns='total').melt(['id','date','final','final2','base','ppp'],var_name='type',value_name='ppp')
This works well in pivoting the first set of values (sun, moon etc) however, not sure how to incorporate the second 'set' space and galaxy.
Any suggestion is appreciated

This is a partial answer:
cols = ['id', 'date', 'pcp', 'base', 'final', 'status']
df = df.drop(columns='total')
df1 = df.melt(id_vars=cols, value_vars=['sun', 'moon', 'stars'], var_name='type')
df2 = df.melt(id_vars=cols, value_vars=['galaxy', 'space'], var_name='type2')
out = pd.merge(df1, df2, on=cols)
At this point, your dataframe looks like:
>>> out
id date pcp base final status type value_x type2 value_y
0 aa Q1 21 0 200 41 5 sun 5 galaxy 1
1 aa Q1 21 0 200 41 5 sun 5 space 1
2 aa Q1 21 0 200 41 5 moon 1 galaxy 1
3 aa Q1 21 0 200 41 5 moon 1 space 1
4 aa Q1 21 0 200 41 5 stars 2 galaxy 1
5 aa Q1 21 0 200 41 5 stars 2 space 1
6 aa Q2 21 1 200 50 6 sun 4 galaxy 1
7 aa Q2 21 1 200 50 6 sun 4 space 2
8 aa Q2 21 1 200 50 6 moon 1 galaxy 1
9 aa Q2 21 1 200 50 6 moon 1 space 2
10 aa Q2 21 1 200 50 6 stars 2 galaxy 1
11 aa Q2 21 1 200 50 6 stars 2 space 2
Now the question is how total3 is set to reduce the dataframe?

pandas convert dataframe to pivot_table where index is the sorting values

i have the following dataframe:
site height_id height_meters
0 9 c3 24
1 9 c2 30
2 9 c1 36
3 3 c0 18
4 3 bf 24
5 3 be 30
6 4 10 18
7 4 0f 24
8 4 0e 30
i want to transform it to the following this column indexes is values of 'site' and the values is 'height_meters' and i want it to be indexed by the order of the values (i looked in the internet and didnt find somthing similar... tried to groupby and make some pivot table without success):
9 3 4
0 24 18 18
1 30 24 24
2 36 30 24
the gap between numbers isn't necessary ...
here is the df
my_df = pd.DataFrame(dict(
site=[9, 9, 9, 3, 3, 3, 4, 4, 4],
height_id='c3,c2,c1,c0,bf,be,10,0f,0e'.split(','),
height_meters=[24, 30, 36, 18, 24, 30, 18, 24, 30]
))

You can use GroupBy.cumcount for counter of column site:
print (my_df.groupby('site').cumcount())
0 0
1 1
2 2
3 0
4 1
5 2
6 0
7 1
8 2
dtype: int64
You can convert it to index with site column and reshape by Series.unstack:
df = my_df.set_index([my_df.groupby('site').cumcount(), 'site'])['height_meters'].unstack()
print (df)
site 3 4 9
0 18 18 24
1 24 24 30
2 30 30 36
Similar solution with DataFrame.pivot and column created by cumcount:
df = my_df.assign(new=my_df.groupby('site').cumcount()).pivot('new','site','height_meters')
print (df)
site 3 4 9
new
0 18 18 24
1 24 24 30
2 30 30 36
If order is important add DataFrame.reindex by unique values of column site:
df = (my_df.set_index([my_df.groupby('site').cumcount(), 'site'])['height_meters']
.unstack()
.reindex(my_df['site'].unique(), axis=1))
print (df)
site 9 3 4
0 24 18 18
1 30 24 24
2 36 30 30
Last for remove site (new) columns and index names is possible use DataFrame.rename_axis:
df = df.rename_axis(index=None, columns=None)
print (df)
3 4 9
0 18 18 24
1 24 24 30
2 30 30 36

multiple groupby and get unique count

I have the following df1:
id period color size rate
1 01 red 12 30
1 02 red 12 30
2 01 blue 12 35
3 03 blue 12 35
4 01 blue 12 35
4 02 blue 12 35
5 01 pink 10 40
6 01 pink 10 40
I need to create an index column that is basically a group of color-size-rate and count the number of unique id(s) that have this combination. My output df after transformation should look like this df1:
id period color size rate index count
1 01 red 12 30 red-12-30 1
1 02 red 12 30 red-12-30 1
2 01 blue 12 35 blue-12-35 3
2 03 blue 12 35 blue-12-35 3
4 01 blue 12 35 blue-12-35 3
4 02 blue 12 35 blue-12-35 3
5 01 pink 10 40 pink-10-40 2
6 01 pink 10 40 pink-10-40 2
I am able to get the count but it is not counting "unique" ids but instead , just the number of occurrences.
1 01 red 12 30 red-12-30 2
1 02 red 12 30 red-12-30 2
2 01 blue 12 35 blue-12-35 4
2 03 blue 12 35 blue-12-35 4
4 01 blue 12 35 blue-12-35 4
4 02 blue 12 35 blue-12-35 4
This is wrong as it is not actually grouping by id to count unique ones.
Appreciate any pointers in this direction.
Adding an edit here as my requirement changed:
The count need to also be grouped by 'period' ie my final df should be:
index period count
red-12-30 01 1
red-12-30 02 1
blue-12-35 01 2
blue-12-35 03 1
blue-12-35 02 1
pink-10-40 01 2
Solution: from #anky:
When I try adding another groupby['period'], I am getting dimension mismatch error.
Thank you in advance.

You can try aggregating a join for creating the index column, then group on the same and get nunique using groupby+transform
idx = df[['color','size','rate']].astype(str).agg('-'.join,1)
out = df.assign(index=idx,count=df.groupby(idx)['id'].transform('nunique'))
print(out)
id period color size rate index count
0 1 1 red 12 30 red-12-30 1
1 1 2 red 12 30 red-12-30 1
2 2 1 blue 12 35 blue-12-35 3
3 3 3 blue 12 35 blue-12-35 3
4 4 1 blue 12 35 blue-12-35 3
5 4 2 blue 12 35 blue-12-35 3
6 5 1 pink 10 40 pink-10-40 2
7 6 1 pink 10 40 pink-10-40 2

unstack "multi-indexed" column| in pandas

in pandas I have a dataframe as follows (first line is the column, second is just a row now)
2012 2013 2012 2013
women women men men
0 14 43 24 45
1 34 54 35 65
and would like to get it like
women men
2012 0 14 24
2012 1 34 35
2013 0 43 45
2013 1 54 65
using df.stack, df.unstack did not get anywhere?
Any elegant solution?

In [5]: df
Out[5]:
2012 2013
women men women men
0 0 1 2 3
1 4 5 6 7
the idea is to first stack the first level of the column to the first level of index, and then swap two indices (pandas.DataFrame.swaplevel)
In [6]: df.stack(level=0).swaplevel(0,1,axis=0)
Out[6]:
men women
2012 0 1 0
2013 0 3 2
2012 1 5 4
2013 1 7 6

df.stack is most likely what you want. See below, you do need to specify that you want the first level.
In [79]: df = pd.DataFrame(0., index=[0,1], columns=pd.MultiIndex.from_product([[2012,2013], ['women','men']]))
In [83]: df.stack(level=0)
Out[83]:
men women
0 2012 0 0
2013 0 0
1 2012 0 0
2013 0 0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

groupby two columns and count unique values from a third column - python

you can achieve this with a groupby df2 = df1.groupby(['color', 'size', 'rate', 'period']).count().reset_index(); df2['index'] = df2.apply(lambda x: '-'.join([x['color'], x['size'], x['rate']]), axis = 1)

Related

How to turn a pandas dataframe row into a comma separated value with condition

Multiple values to pivot in dataframe

pandas convert dataframe to pivot_table where index is the sorting values

multiple groupby and get unique count

unstack "multi-indexed" column| in pandas

Categories

Resources