Sorting Aggregated DataFrame in Python [duplicate] - python

This question already has answers here:
Multi Index Sorting in Pandas
(2 answers)
Closed 2 years ago.
I have dataframe and I aggredated it as below. I want to sort (descending) it according to 'mean'. I m using below code but it gives an error.
df_agg = df.groupby('Subject Field').agg({'seniority_level':['min','mean','median','max']})
df_agg.sort_values(by='mean',ascending=False).head(10)
Error

Your aggregated dataframe has a multi level column index. So you need to address this by specifying both senority_level and mean.
df_agg.sort_values(('seniority_level', 'mean'), ascending=False)
Quick check to demonstrate:
df = pd.DataFrame({
'Accounting': [1, 2, 3],
'Acoustics': [4, 5, 6],
}).melt(var_name='Subject Field', value_name='seniority_level')
df_agg = df.groupby('Subject Field').agg(
{'seniority_level':['min', 'mean', 'median']}
)
df_agg.sort_values(('seniority_level','mean'), ascending=True)
seniority_level
min mean median
Subject Field
Accounting 1 2 2
Acoustics 4 5 5
df_agg.sort_values(('seniority_level','mean'), ascending=False)
seniority_level
min mean median
Subject Field
Acoustics 4 5 5
Accounting 1 2 2

Related

pandas create new column based on divide column by another and check that I not divide by 0 [duplicate]

This question already has answers here:
handling zeros in pandas DataFrames column divisions in Python
(4 answers)
Closed 3 months ago.
I want to create a new column based on a division of two different columns, but make sure that I do not divide by 0, if the price is 0 set it to none.
if I try to just divide I get 'inf' where the price is 0:
df['new'] = df['memory'] / df['price']
id memory price
0 0 7568 751.64
1 1 53759 885.17
2 2 41140 1067.78
3 3 10558 0
4 4 44436 1023.13
I didn't find a way to add condition
To avoid division by zero, I would avoid dividing the values by zero. Please take a look at the following example.
I hope this helps.
Best regards
import pandas as pd
data = {'id': [0, 1, 2, 3, 4], 'memory': [7568, 53759, 41140, 10558, 44436], 'price': [751.64, 885.17, 1067.78, 0, 1023.13]}
df = pd.DataFrame(data)
# adding a new column and setting the values to "none"
df['new'] = "none"
for i in range(len(df)):
if df.iat[i,2] != 0:
df.iat[i, 3] = df.iat[i, 1] / df.iat[i, 2]
print(df)

I want to return the original df indexing for values returned by pd.groupby.max() function [duplicate]

This question already has answers here:
Pandas groupby: how to select adjacent column data after selecting a row based on data in another column in pandas groupby groups?
(3 answers)
Closed 2 years ago.
The below code will create a table containing max temps for each day. What I would like to do is return the Index for all these max temp values so I can apply to the original df
df = pd.DataFrame('date':list1,'max_temp':list2)
grouped = df.groupby(by=date,as_index=False).max()
You can define another column called "index" before sorting the dataframe:
import pandas as pd
list1 = [7, 9, 3, 4]
list2 = [8, 6, 8, 9]
df = pd.DataFrame({'date': list1, 'max_temp': list2})
df['index'] = df.index
grouped = df.groupby(by="date", as_index=False).max()
print(grouped)
Output:
date max_temp index
0 3 8 2
1 4 9 3
2 7 8 0
3 9 6 1
Now, using df.query, we can get a "date" value by the "column" index:
print(grouped.query("index==0")["date"])
Output:
2 7
Name: date, dtype: int64
df.groupby('date')['max_temp'].idxmax()
It would seem i've found a great solution from the following link...
Pandas groupby: how to select adjacent column data after selecting a row based on data in another column in pandas groupby groups?
(Although this doesn't seem to be the answer they accepted for some reason). Anyway, the following worked well for me if anyone finds them selves in the same position...
idx = df.groupby('date')['max_temp'].transform(max) == df['max_temp']

How do I create one column in pandas (Python) based on indexes out of multiple columns? [duplicate]

This question already has answers here:
Pandas Melt Function
(2 answers)
Closed 2 years ago.
I have a data frame where there's multiple options for a certain index (1-M relationship) - e.g. States as Index and Counties as respective columns. I want to group it in a way that creates just one column but with all the values. This is a basic transformation but somehow I can't get it right.
Sorry I don't know how to insert code that actually is already run so here I present the code to create example DFs as to what I'd like to create.
pd.DataFrame({'INDEX': ['INDEX1','INDEX2','INDEX3'],
'col1': ['a','b','d'],
'col2': ['c','f',np.nan],
'col3': ['e',np.nan,np.nan]})
and I want it to transform it so that I end up with this data frame:
pd.DataFrame({'INDEX': ['INDEX1','INDEX1','INDEX1','INDEX2','INDEX2','INDEX3'],
'col1': ['a','c','e','b','f','d']})
You can use melt here:
df = pd.melt(df, id_vars=['INDEX']).drop(columns=['variable']).dropna()
print(df)
INDEX value
0 INDEX1 a
1 INDEX2 b
2 INDEX3 d
3 INDEX1 c
4 INDEX2 f
6 INDEX1 e

python pandas groupby unexpected empty column [duplicate]

This question already has answers here:
How to assign a name to the size() column?
(5 answers)
Closed 3 years ago.
I want to aggregate some data to append to a dataframe. The following gives me the number of wins per name
import pandas as pd
data = [[1,'tom', 10], [1,'nick', 15], [2,'juli', 14], [2,'peter', 20], [3,'juli', 3], [3,'peter', 13]]
have = pd.DataFrame(data, columns = ['Round', 'Winner', 'Score'])
WinCount= have.groupby(['Winner']).size().to_frame('WinCount')
WinCount
, but the output does not give me two columns, named Winner and WinCount. In stead, the first column has no name, and the column name then appears on the second line:
How can I get a dataframe without these two "blank" fields
Try this
WinCount=have.groupby(['Winner']).size().to_frame('WinCount').reset_index()
Output
Winner WinCount
0 juli 2
1 nick 1
2 peter 2
3 tom 1

How to calculate a difference between values in the same column with data in a "long" format in Python/Pandas [duplicate]

This question already has answers here:
Pandas groupby multiple fields then diff
(2 answers)
Closed 4 years ago.
I have a data frame sorted by IDs in long format. Most IDs have more than one row, and all rows have a date. I want to calculate the difference between dates in consecutive rows within each ID.
I've tried using a groupby object in Pandas, and pivoting the data to a wide format, but haven't had success. The set up is below. (Sorry, I couldn't figure out how to post the console output of the set up code below.)
The integers in the date columns are stand ins for the dates. I know how to work with dates, so don't need help there. The code should calculate the difference in dates between consecutive rows within an ID and put the difference in a new column called 'difference' (i.e., it should "start over" when it gets to the next ID). The first row in each ID will not have an entry for difference because there is no difference to calculate. The second should be the difference between dates in the first row and the second row within an ID, etc.
df = pd.DataFrame({'ID': [1,1,2,2,2,2,3,3,3],
'action': ['first', 'end', 'first', 'change', 'change',
'last','first','change', 'end'],
'date': [1, 2, 2, 4, 6, 8, 1, 2, 9],
'movement': [1,0,1,1,1,0,1,1,0],})
Here is an image of the dataframe from my console:
Code to generate desired output is below:
desiredOutput = pd.DataFrame({'ID': [1,1,2,2,2,2,3,3,3],
'action': ['first', 'end', 'first', 'change', 'change',
'last','first','change', 'end'],
'date': [1, 2, 2, 4, 6, 8, 1, 2, 9],
'movement': [1,0,1,1,1,0,1,1,0], 'difference':[0,1,0,2,2,2,0,1,7]})
This is a groupby problem. You can use GroupBy.diff, remembering to replace null values with 0 and convert to int:
df['difference'] = df.groupby('ID')['date'].diff().fillna(0).astype(int)
print(df)
# ID action date movement difference
# 0 1 first 1 1 0
# 1 1 end 2 0 1
# 2 2 first 2 1 0
# 3 2 change 4 1 2
# 4 2 change 6 1 2
# 5 2 last 8 0 2
# 6 3 first 1 1 0
# 7 3 change 2 1 1
# 8 3 end 9 0 7

Categories

Resources