Sort pandas dataframe within groups - python

I have a dataframe:
>>> df
Category Score
0 A 1
1 A 2
2 A 3
3 B 5
4 B 9
I expect the output:
Sorting Score within Category.
>>> df
Category Score
2 A 3
1 A 2
0 A 1
4 B 9
3 B 5
Any ideas?

Use sort_values by mention order.
In [17]: df.sort_values(by=['Category', 'Score'], ascending=[True, False])
Out[17]:
Category Score
2 A 3
1 A 2
0 A 1
4 B 9
3 B 5

Related

Count unique values for each group in multi column with criteria in Pandas

UPDATED THE SAMPLE DATASET
I have the following data:
location ID Value
A 1 1
A 1 1
A 1 1
A 1 1
A 1 2
A 1 2
A 1 2
A 1 2
A 1 3
A 1 4
A 2 1
A 2 2
A 3 1
A 3 2
B 4 1
B 4 2
B 5 1
B 5 1
B 5 2
B 5 2
B 6 1
B 6 1
B 6 1
B 6 1
B 6 1
B 6 2
B 6 2
B 6 2
B 7 1
I want to count unique Values (only if value is equals to 1 or 2) for each location and for each ID for the following output.
location ID_Count Value_Count
A 3 6
B 4 7
I tried using df.groupby(['location'])['ID','value'].nunique(), but I am getting only the unique count of values, like for I am getting value_count for A as 4 and for B as 2.
Try agg with slice on ID on True values.
For your updated sample, you just need to drop duplicates before processing. The rest is the same
df = df.drop_duplicates(['location', 'ID', 'Value'])
df_agg = (df.Value.isin([1,2]).groupby(df.location)
.agg(ID_count=lambda x: df.loc[x[x].index, 'ID'].nunique(),
Value_count='sum'))
Out[93]:
ID_count Value_count
location
A 3 6
B 4 7
IIUC, You can try series.isin with groupby.agg
out = (df.assign(Value_Count=df['Value'].isin([1,2])).groupby("location",as_index=False)
.agg({"ID":'nunique',"Value_Count":'sum'}))
print(out)
location ID Value_Count
0 A 3 6.0
1 B 4 7.0
Roughly same as anky, but then using Series.where and named aggregations so we can rename the columns while creating them in the groupby.
grp = df.assign(Value=df['Value'].where(df['Value'].isin([1, 2]))).groupby('location')
grp.agg(
ID_count=('ID', 'nunique'),
Value_count=('Value', 'count')
).reset_index()
location ID_count Value_count
0 A 3 6
1 B 4 7
Let's try a very similar approach to other answers. This time we filter first:
(df[df['Value'].isin([1,2])]
.groupby(['location'],as_index=False)
.agg({'ID':'nunique', 'Value':'size'})
)
Output:
location ID Value
0 A 3 6
1 B 4 7

Split rows into multiple rows based on column value

Input DF:
Index Parameters A B C
1 Apple 1 2 3
2 Banana 2 4 5
3 Potato 3 5 2
4 Tomato 1 x 4 1 x 6 2 x 12
Output DF
Index Parameters A B C
1 Apple 1 2 3
2 Banana 2 4 5
3 Potato 3 5 2
4 Tomato_P 1 1 2
5 Tomato_Q 4 6 12
Problem Statement:
I want convert a row of data into multiple rows based on particular column value (Tomato) and with split parameter as x
Code/Findings:
I have a code which works well if I transpose this data set and then apply the answer from here or here and then re-transpose the same.
Looking for a solution which can directly work on the given dataframe
Solution if always only one x values in data - first Series.str.split by columns in list, then Series.explode, added all another columns by DataFrame.join and set _P with _Q with Series.duplicated and numpy.select:
cols = ['A','B','C']
df[cols] = df[cols].apply(lambda x : x.str.split(' x '))
df1 = pd.concat([df[x].explode() for x in cols],axis=1)
#print (df1)
df = df[df.columns.difference(cols)].join(df1)
df['Parameters'] += np.select([df.index.duplicated(keep='last'),
df.index.duplicated()],
['_P','_Q'],
default='')
df = df.reset_index(drop=True)
print (df)
Parameters A B C
0 Apple 1 2 3
1 Banana 2 4 5
2 Potato 3 5 2
3 Tomato_P 1 1 2
4 Tomato_Q 4 6 12
EDIT:
Answer with no explode:
cols = df.columns[1:]
df1 = (pd.concat([df[x].str.split(' x ', expand=True).stack() for x in cols],axis=1, keys=cols)
.reset_index(level=1, drop=True))
print (df1)
A B C
Index
1 1 2 3
2 2 4 5
3 3 5 2
4 1 1 2
4 4 6 12
df = df.iloc[:, [0]].join(df1)
df['Parameters'] += np.select([df.index.duplicated(keep='last'),
df.index.duplicated()],
['_P','_Q'],
default='')
df = df.reset_index(drop=True)
print (df)
Parameters A B C
0 Apple 1 2 3
1 Banana 2 4 5
2 Potato 3 5 2
3 Tomato_P 1 1 2
4 Tomato_Q 4 6 1
This is more like a explode problem , available after pandas 0.25
df[['A','B','C']]=df[['A','B','C']].apply(lambda x : x.str.split(' x '))
df
Index Parameters A B C
0 1 Apple [1] [2] [3]
1 2 Banana [2] [4] [5]
2 3 Potato [3] [5] [2]
3 4 Tomato [1, 4] [1, 6] [2, 12]
df.set_index(['Index','Parameters'],inplace=True)
pd.concat([df[x].explode() for x in ['A','B','C']],axis=1)
A B C
Index Parameters
1 Apple 1 2 3
2 Banana 2 4 5
3 Potato 3 5 2
4 Tomato 1 1 2
Tomato 4 6 12

Aggregate data frame rows based on conditions

I have this table
A B C E
1 2 1 3
1 2 4 4
2 7 1 1
3 4 0 2
3 4 8 3
Now, I want to remove duplicates based on column A and B and at the same time sum up column C. For E, it should take the value where C shows the max value. The desirable result table should look like this:
A B C E
1 2 5 4
2 7 1 1
3 4 8 3
I tried this: df.groupby(['A', 'B']).sum()['C'] but my data frame does not change at all as I am thinking that I didn't incorporate the E column part properly...Can somebody advise?
Thanks so much!
If the first and second rows are duplicates, we can group by them.
In [20]: df
Out[20]:
A B C E
0 1 1 5 4
1 1 1 1 1
2 3 3 8 3
In [21]: df.groupby(['A', 'B'])['C'].sum()
Out[21]:
A B
1 1 6
3 3 8
Name: C, dtype: int64
I tried this: df.groupby(['A', 'B']).sum()['C'] but my data frame does not change at all
yes, it's because pandas didn't overwrite initial DataFrame
In [22]: df
Out[22]:
A B C E
0 1 1 5 4
1 1 1 1 1
2 3 3 8 3
You have to overwrite it explicitly.
In [23]: df = df.groupby(['A', 'B'])['C'].sum()
In [24]: df
Out[24]:
A B
1 1 6
3 3 8
Name: C, dtype: int64

Pandas: Sort the column on frequency by another column having same value grouped

I've dataframe which is group by y column and sorted on their count column of y column.
Code:
df['count'] = df.groupby(['y'])['y'].transform(pd.Series.value_counts)
df = df.sort('count', ascending=False)
Output:
x y count
1 a 4
3 a 4
2 a 4
1 a 4
2 c 3
1 c 3
2 c 3
2 b 2
1 b 2
Now, I want to sort x column on its frequency having same values grouped on y column like below:
Expected Output:
x y count
1 a 4
1 a 4
2 a 4
3 a 4
2 c 3
2 c 3
1 c 3
2 b 2
1 b 2
It seems you need groupby and value_counts and then numpy.repeat for expand index values by their counts to DataFrame:
s = df.groupby('y', sort=False)['x'].value_counts()
#alternative
#s = df.groupby('y', sort=False)['x'].apply(pd.Series.value_counts)
print (s)
y x
a 1 2
2 1
3 1
c 2 2
1 1
b 1 1
2 1
Name: x, dtype: int64
df1 = pd.DataFrame(np.repeat(s.index.values, s.values).tolist(), columns=['y','x'])
#change order of columns
df1 = df1.reindex_axis(['x','y'], axis=1)
print (df1)
x y
0 1 a
1 1 a
2 2 a
3 3 a
4 2 c
5 2 c
6 1 c
7 1 b
8 2 b
If you are using an older version where df.sort_values is not supported. you can use:
df.sort(columns=['count','x'], ascending=[False,True])

How to groupby based on two columns in pandas?

A similar question might have been asked before, but I couldn't find the exact one fitting to my problem.
I want to group by a dataframe based on two columns.
For exmaple to make this
id product quantity
1 A 2
1 A 3
1 B 2
2 A 1
2 B 1
3 B 2
3 B 1
Into this:
id product quantity
1 A 5
1 B 2
2 A 1
2 B 1
3 B 3
Meaning that summation on "quantity" column for same "id" and same "product".
You need groupby with parameter as_index=False for return DataFrame and aggregating mean:
df = df.groupby(['id','product'], as_index=False)['quantity'].sum()
print (df)
id product quantity
0 1 A 5
1 1 B 2
2 2 A 1
3 2 B 1
4 3 B 3
Or add reset_index:
df = df.groupby(['id','product'])['quantity'].sum().reset_index()
print (df)
id product quantity
0 1 A 5
1 1 B 2
2 2 A 1
3 2 B 1
4 3 B 3
You can use pivot_table with aggfunc='sum'
df.pivot_table('quantity', ['id', 'product'], aggfunc='sum').reset_index()
id product quantity
0 1 A 5
1 1 B 2
2 2 A 1
3 2 B 1
4 3 B 3
You can use groupby and aggregate function
import pandas as pd
df = pd.DataFrame({
'id': [1,1,1,2,2,3,3],
'product': ['A','A','B','A','B','B','B'],
'quantity': [2,3,2,1,1,2,1]
})
print df
id product quantity
0 1 A 2
1 1 A 3
2 1 B 2
3 2 A 1
4 2 B 1
5 3 B 2
6 3 B 1
df = df.groupby(['id','product']).agg({'quantity':'sum'}).reset_index()
print df
id product quantity
0 1 A 5
1 1 B 2
2 2 A 1
3 2 B 1
4 3 B 3

Categories

Resources