I am new to pandas. I'm trying to sort a column within each group. So far, I was able to group first and second column values together and calculate the mean value in third column. But I am still struggling to sort 3rd column.
This is my input dataframe
This is my dataframe after applying groupby and mean function
I used the following line of code to group input dataframe,
df_o=df.groupby(by=['Organization Group','Department']).agg({'Total Compensation':np.mean})
Please let me know how to sort the last column for each group in 1st column using pandas.
It seems you need sort_values:
#for return df add parameter as_index=False
df_o=df.groupby(['Organization Group','Department'],
as_index=False)['Total Compensation'].mean()
df_o = df_o.sort_values(['Total Compensation','Organization Group'])
Sample:
df = pd.DataFrame({'Organization Group':['a','b','a','a'],
'Department':['d','f','a','a'],
'Total Compensation':[1,8,9,1]})
print (df)
Department Organization Group Total Compensation
0 d a 1
1 f b 8
2 a a 9
3 a a 1
df_o=df.groupby(['Organization Group','Department'],
as_index=False)['Total Compensation'].mean()
print (df_o)
Organization Group Department Total Compensation
0 a a 5
1 a d 1
2 b f 8
df_o = df_o.sort_values(['Total Compensation','Organization Group'])
print (df_o)
Organization Group Department Total Compensation
1 a d 1
0 a a 5
2 b f 8
Related
I have a dataframe:
df1 = pd.DataFrame({'id': ['1','2','2','3','3','4','4'],
'name': ['James','Jim','jimy','Daniel','Dane','Ash','Ash'],
'event': ['Basket','Soccer','Soccer','Basket','Soccer','Basket','Soccer']})
I want to count unique values of id but with the name, the result I except are:
id name count
1 James 1
2 Jim, jimy 2
3 Daniel, Dane 2
4 Ash 2
I try to group by id and name but it doesn't count as i expected
You could try:
df1.groupby('id').agg(
name=('name', lambda x: ', '.join(x.unique())),
count=('name', 'count')
)
We are basically grouping by id and then joining the unique names to a comma separated list!
Here is a solution:
groups = df1[["id", "name"]].groupby("id")
a = groups.agg(lambda x: ", ".join( set(x) ))
b = groups.size().rename("count")
c = pd.concat([a,b], axis=1)
I'm not an expert when it comes to pandas but I thought I might as well post my solution because I think that it's straightforward and readable.
In your example, the groupby is done on the id column and not by id and name. The name column you see in your expected DataFrame is the result of an aggregation done after a groupby.
Here, it is obvious that the groupby was done on the id column.
My solution is maybe not the most straightforward but I still find it to be more readable:
Create a groupby object groups by grouping by id
Create a DataFrame a from groups by aggregating it using commas (you also need to remove the duplicates using set(...) ): lambda x: ", ".join( set(x) )
The DataFrame a will thus have the following data:
name
id
1 James
2 Jim, jimy
3 Daniel, Dane
4 Ash
Create another DataFrame b by computing the size of each groups in groups : groups.size() (you should also rename your column)
id
1 1
2 2
3 2
4 2
Name: count, dtype: int64
Concat a and b horizontally and you get what you wanted
name count
id
1 James 1
2 Jim, jimy 2
3 Daniel, Dane 2
4 Ash 2
I have the following dataframe:
df[['ID','Team']].groupby(['Team']).agg([('total','count')]).reset_index("total").sort_values("count")
I basically, need to count the number of IDs by Team and then sort by the total number of IDs.
The aggregation part it's good and it gives me the expected result. But when I try the sort part I got this:
KeyError: 'Requested level (total) does not match index name (Team)'
What I am doing wrong?
Use names aggregation for specify new columns names in aggregate function, remove total from DataFrame.reset_index:
df = pd.DataFrame({
'ID':list('abcdef'),
'Team':list('aaabcb')
})
df = df.groupby('Team').agg(count=('ID','count')).reset_index().sort_values("count")
print (df)
Team count
2 c 1
1 b 2
0 a 3
Your solution should be changed by specify column after groupby for processing, then specify new column name with aggregate function in tuple and last also remove total from reset_index:
df = df.groupby('Team')['ID'].agg([('count','count')]).reset_index().sort_values("count")
print (df)
Team count
2 c 1
1 b 2
0 a 3
I did : g=df.groupby('name of the column') and i got a group. Now I want to, for every different 'name of the column', sum values that are specified in another column. So when i run the function, i'll get a series(sorted by sum of values) with each 'name of the column' and its respective sum of the values . What i did was:
for name, dfaux in g:
print(name, dfaux['name of the column where the values are specified'].sum())
I did get the series that I wanted, but I don't know how to sort it. Any help? Thanks!
Do u want the below kind of sorting, if yes u can code so.
your data-frame
0 a 1
1 b 2
2 a 3
3 c 4
4 b 5
If u expect the output to be
a 4
c 4
b 7
d = {'col1':['a','b','a','c','b'], 'col2':[1,2,3,4,5]}
df = pd.DataFrame(d)
print(df.groupby(['col1']).sum().sort_values(by=['col2']))
here groupby will return a data-frame with the column names as specified before.
so u can just sort the returned data-frame.
I am trying to add an underscore and incremental numbers to any repeating values ordered by index and within a group that is defined by another column.
For example, I would like the repeating values in the Chemistry column to have underscores and incremental numbers ordered by index and grouped by the Cycle column.
df = pd.DataFrame([[1,1,1,1,1,1,2,2,2,2,2,2], ['NaOH', 'H20', 'MWS', 'H20', 'MWS', 'NaOh', 'NaOH', 'H20', 'MWS', 'H20', 'MWS', 'NaOh']]).transpose()
df.columns = ['Cycle', 'Chemistry']
df
Original Table
So the output will look like the table in the link below:
Desired output table
IIUC:
pandas.Series.str.cat and cumcount
df['Chemistry'] = df.Chemistry.str.cat(
df.groupby(['Cycle', 'Chemistry']).cumcount().add(1).astype(str),
sep='_'
)
df
Cycle Chemistry
0 1 NaOH_1
1 1 H20_1
2 1 MWS_1
3 1 H20_2
4 1 MWS_2
5 1 NaOh_1
6 2 NaOH_1
7 2 H20_1
8 2 MWS_1
9 2 H20_2
10 2 MWS_2
11 2 NaOH_2
I have a pandas DataFrame which contains information in columns which I would like to extract into a new column.
It is best explained visually:
df = pd.DataFrame({'Number Type 1':[1,2,np.nan],
'Number Type 2':[np.nan,3,4],
'Info':list('abc')})
The Table shows the initial DataFrame with Number Type 1 and NumberType 2 columns.
I would like to extract the types and create a new Type column, refactoring the DataFrame accordingly.
basically, Numbers are collapsed into the Number columns, and the types extracted into the Type column. The information in the Info column is bound to the numbers (f.e. 2 and 3 have the same information b)
What is the best way to do this in Pandas?
Use melt with dropna:
df = df.melt('Info', value_name='Number', var_name='Type').dropna(subset=['Number'])
df['Type'] = df['Type'].str.extract('(\d+)')
df['Number'] = df['Number'].astype(int)
print (df)
Info Type Number
0 a 1 1
1 b 1 2
4 b 2 3
5 c 2 4
Another solution with set_index and stack:
df = df.set_index('Info').stack().rename_axis(('Info','Type')).reset_index(name='Number')
df['Type'] = df['Type'].str.extract('(\d+)')
df['Number'] = df['Number'].astype(int)
print (df)
Info Type Number
0 a 1 1
1 b 1 2
2 b 2 3
3 c 2 4