Can't sort values after aggregation using Pandas dataframe - python

I have the following dataframe:
df[['ID','Team']].groupby(['Team']).agg([('total','count')]).reset_index("total").sort_values("count")
I basically, need to count the number of IDs by Team and then sort by the total number of IDs.
The aggregation part it's good and it gives me the expected result. But when I try the sort part I got this:
KeyError: 'Requested level (total) does not match index name (Team)'
What I am doing wrong?

Use names aggregation for specify new columns names in aggregate function, remove total from DataFrame.reset_index:
df = pd.DataFrame({
'ID':list('abcdef'),
'Team':list('aaabcb')
})
df = df.groupby('Team').agg(count=('ID','count')).reset_index().sort_values("count")
print (df)
Team count
2 c 1
1 b 2
0 a 3
Your solution should be changed by specify column after groupby for processing, then specify new column name with aggregate function in tuple and last also remove total from reset_index:
df = df.groupby('Team')['ID'].agg([('count','count')]).reset_index().sort_values("count")
print (df)
Team count
2 c 1
1 b 2
0 a 3

Related

Assign counts from .count() to a dataframe + column names - pandas python

Hoping someone can help me here - i believe i am close to the solution.
I have a dataframe, of which i have am using .count() in order to return a series of all column names of my dataframe, and each of their respective non-NAN value counts.
Example dataframe:
feature_1
feature_2
1
1
2
NaN
3
2
4
NaN
5
3
Example result for .count() here would output a series that looks like:
feature_1 5
feature_2 3
I am now trying to get this data into a dataframe, with the column names "Feature" and "Count". To have the expected output look like this:
Feature
Count
feature_1
5
feature_2
3
I am using .to_frame() to push the series to a dataframe in order to add column names. Full code:
df = data.count()
df = df.to_frame()
df.columns = ['Feature', 'Count']
However receiving this error message - "ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 elements", as if though it is not recognising the actual column names (Feature) as a column with values.
How can i get it to recognise both Feature and Count columns to be able to add column names to them?
Add Series.reset_index instead Series.to_frame for 2 columns DataFrame - first column from index, second from values of Series:
df = data.count().reset_index()
df.columns = ['Feature', 'Count']
print (df)
Feature Count
0 feature_1 5
1 feature_2 3
Another solution with name parameter and Series.rename_axis or with DataFrame.set_axis:
df = data.count().rename_axis('Feature').reset_index(name='Count')
#alternative
df = data.count().reset_index().set_axis(['Feature', 'Count'], axis=1)
print (df)
Feature Count
0 feature_1 5
1 feature_2 3
This happens because your new dataframe has only one column (the column name is taken as series index, then translated into dataframe index with the func to_frame()). In order to assign a 2 elements list to df.columns you have to reset the index first:
df = data.count()
df = df.to_frame().reset_index()
df.columns = ['Feature', 'Count']

How to get the index value of column value compared with another column value

I want something like this.
Index Sentence
0 I
1 want
2 like
3 this
Keyword Index
want 1
this 3
I tried with df.index("Keyword") but its not giving for all the rows. It will be really helpful if someone solve this.
Use isin with boolean indexing only:
df = df[df['Sentence'].isin(['want', 'this'])]
print (df)
Index Sentence
1 1 want
3 3 this
EDIT: If need compare by another column:
df = df[df['Sentence'].isin(df['Keyword'])]
#another DataFrame df2
#df = df[df['Sentence'].isin(df2['Keyword'])]
And if need index values:
idx = df.index[df['Sentence'].isin(df['Keyword'])]
#alternative
#idx = df[df['Sentence'].isin(df['Keyword'])].index

Sort a column within groups in Pandas

I am new to pandas. I'm trying to sort a column within each group. So far, I was able to group first and second column values together and calculate the mean value in third column. But I am still struggling to sort 3rd column.
This is my input dataframe
This is my dataframe after applying groupby and mean function
I used the following line of code to group input dataframe,
df_o=df.groupby(by=['Organization Group','Department']).agg({'Total Compensation':np.mean})
Please let me know how to sort the last column for each group in 1st column using pandas.
It seems you need sort_values:
#for return df add parameter as_index=False
df_o=df.groupby(['Organization Group','Department'],
as_index=False)['Total Compensation'].mean()
df_o = df_o.sort_values(['Total Compensation','Organization Group'])
Sample:
df = pd.DataFrame({'Organization Group':['a','b','a','a'],
'Department':['d','f','a','a'],
'Total Compensation':[1,8,9,1]})
print (df)
Department Organization Group Total Compensation
0 d a 1
1 f b 8
2 a a 9
3 a a 1
df_o=df.groupby(['Organization Group','Department'],
as_index=False)['Total Compensation'].mean()
print (df_o)
Organization Group Department Total Compensation
0 a a 5
1 a d 1
2 b f 8
df_o = df_o.sort_values(['Total Compensation','Organization Group'])
print (df_o)
Organization Group Department Total Compensation
1 a d 1
0 a a 5
2 b f 8

Counting instances in a pandas DF

I have a dataframe that looks like this:
TF name
0 A
1 A
0 A
0 A
1 B
1 B
0 B
1 B
1 B
I need to produce a resulting dataframe that would count how many 0's and 1's each person in my dataframe has.
So the result for the above would be:
name True False
A 3 1
B 1 4
I don't think groupby would work in this instance. Any solution other than looping and counting?
You can perform groupby letting TF be the grouped key. Take the corresponding value_counts of the name column to get distinct counts.
Unstack level=0 of the multi-index series so that a dataframe object gets produced. Finally, rename the integer columns by type-casting them as boolean values.
df.groupby('TF')['name'].value_counts().unstack(0).rename(columns=bool)
To have the column names take on string values:
1) Use lambda function:
<...operations on df...>.rename(columns=lambda x: str(x.astype(bool)))
2) Or chain the syntaxes together:
<...operations on df...>.rename(columns=bool).rename(columns=str)
I would first convert your columns to boolean and then group by both name and TF and then unstack the boolean column TF.
df['TF']=df['TF'].astype(bool)
df.groupby(['name', 'TF']).size().unstack('TF')
TF False True
name
A 3 1
B 1 4

Pivot Pandas Dataframe with Duplicates using Masking

A non-indexed df contains rows of gene, a cell that contains a mutation in that gene, and the type of mutation in that gene:
df = pd.DataFrame({'gene': ['one','one','one','two','two','two','three'],
'cell': ['A', 'A', 'C', 'A', 'B', 'C','A'],
'mutation': ['frameshift', 'missense', 'nonsense', '3UTR', '3UTR', '3UTR', '3UTR']})
df:
cell gene mutation
0 A one frameshift
1 A one missense
2 C one nonsense
3 A two 3UTR
4 B two 3UTR
5 C two 3UTR
6 A three 3UTR
I'd like to pivot this df so I can index by gene and set columns to cells. The trouble is that there can be multiple entries per cell: there can be multiple mutations in any one gene in a given cell (cell A has two different mutations in gene One). So when I run:
df.pivot_table(index='gene', columns='cell', values='mutation')
this happens:
DataError: No numeric types to aggregate
I'd like to use masking to perform the pivot while capturing the presence of at least one mutation:
A B C
gene
one 1 1 1
two 0 1 0
three 1 1 0
Solution with drop_duplicates and pivot_table:
df = df.drop_duplicates(['cell','gene'])
.pivot_table(index='gene',
columns='cell',
values='mutation',
aggfunc=len,
fill_value=0)
print (df)
cell A B C
gene
one 1 0 1
three 1 0 0
two 1 1 1
Another solution with drop_duplicates, groupby with aggregate size and last reshape by unstack:
df = df.drop_duplicates(['cell','gene'])
.groupby(['cell', 'gene'])
.size()
.unstack(0, fill_value=0)
print (df)
cell A B C
gene
one 1 0 1
three 1 0 0
two 1 1 1
The error message is not what is produced when you run pivot_table. You can have multiple values in the index for pivot_table. I don't believe this is true for the pivot method. You can however fix your problem by changing the aggregation to something that works on strings as opposed to numerics. Most aggregation functions operate on numeric columns and the code you wrote above would produce an error relating to the data type of the column not an index error.
df.pivot_table(index='gene',
columns='cell',
values='mutation',
aggfunc='count', fill_value=0)
If you only want 1 value per cell you can do a groupby and aggregate everything to 1 and then unstack a level.
df.groupby(['cell', 'gene']).agg(lambda x: 1).unstack(fill_value=0)

Categories

Resources