find the maximum value in a column with respect to other column - python

i have below data frame:-
input-
first_name last_name age preTestScore postTestScore
0 Jason Miller 42 4 25
1 Molly Jacobson 52 24 94
2 Tina Ali 36 31 57
3 Jake Milner 24 2 62
4 Amy Cooze 73 3 70
i want the output as:-0
Amy 73
so basically i want to find the highest value in age column and i also want the name of person with highest age.
i tried with pandas using group by as below:-
df2=df.groupby(['first_name'])['age'].max()
But with this i am getting the below output as below :
first_name
Amy 73
Jake 24
Jason 42
Molly 52
Tina 36
Name: age, dtype: int64
where as i only want
Amy 73
How shall i go about it in pandas?

You can get your result with the code below
df.loc[df.age.idxmax(),['first_name','age']]
Here, with df.age.idxmax() we are getting the index of the row which has the maximum age value.
Then with df.loc[df.age.idxmax(),['first_name','age']] we are getting the columns 'first_name' & 'age' at that index.

This line of code should do the work
df[df['age']==df['age'].max()][['first_name','age']]
The [['first_name','age']] has the names of columns you want in the result output.
Change as you want.
As in this case the output will be
first_name Age
Amy 73

Related

How to turn column header into pandas index

So my pandas df currently looks something like this:
Detail
Person 1
Person 2
Person 3
Name
Steve
Larry
Dave
Age
45
56
67
Hobbie
Running
Skating
Painting
But I want to reshape it to this:
Person
Name
Age
Hobbie
Person 1
Steve
45
Running
Person 2
Larry
56
Skating
Person 3
Dave
67
Painting
Anyone know a way of doing this?
Use:
out = (df.set_index('Detail').T
.rename_axis('Person')
.reset_index()
.rename_axis(columns=None)
)
Output:
Person Name Age Hobbie
0 Person 1 Steve 45 Running
1 Person 2 Larry 56 Skating
2 Person 3 Dave 67 Painting
All you need to do is transpose the dataframe with df.T and rename the column name using df.rename(). But there is a catch while using df.T it also transpose the index column to row. So we need to work around it.
Here is the step by step code:
data.csv:
Detail,Person 1,Person 2,Person 3
Name,Steve,Larry,Dave
Age,45,56,67
Hobbie,Running,Skating,Painting
Reading from file:
import pandas as pd
df = pd.read_csv("data.csv")
print(df)
output:
Detail Person 1 Person 2 Person 3
0 Name Steve Larry Dave
1 Age 45 56 67
2 Hobbie Running Skating Painting
Changing the column name:
df = df.rename(columns={"Detail":"Person"})
print(df)
output:
Person Person 1 Person 2 Person 3
0 Name Steve Larry Dave
1 Age 45 56 67
2 Hobbie Running Skating Painting
Transposing with new index column:
df = df.set_index('Person').T
print(df)
output:
Person Name Age Hobbie
Person 1 Steve 45 Running
Person 2 Larry 56 Skating
Person 3 Dave 67 Painting

Choose higher value based off column value between two dataframes

question to choose value based on two df.
>>> df[['age','name']]
age name
0 44 Anna
1 22 Bob
2 33 Cindy
3 44 Danis
4 55 Cindy
5 66 Danis
6 11 Anna
7 43 Bob
8 12 Cindy
9 19 Danis
10 11 Anna
11 32 Anna
12 55 Anna
13 33 Anna
14 32 Anna
>>> df2[['age','name']]
age name
5 66 Danis
4 55 Cindy
0 44 Anna
7 43 Bob
expected result is all rows that value 'age' is higher than df['age'] based on column 'name.
expected result
age name
12 55 Anna
Per comments, use merge and filter dataframe:
df.merge(df2, on='name', suffixes={'','_y'}).query('age > age_y')[['name','age']]
Output:
name age
4 Anna 55
IIUC, you can use this to find the max age of all names:
pd.concat([df,df2]).groupby('name')['age'].max()
Output:
name
Anna 55
Bob 43
Cindy 55
Danis 66
Name: age, dtype: int64
Try this:
index = df[df['age'] > age].index
df.loc[index]
There are a few edge cases you don't mention how you would like to resolve, but generally what you want to do is iterate down the df and compare ages and use the larger. You could do so in the following manner:
df3 = pd.DataFrame(columns = ['age', 'name'])
for x in len(df):
if df['age'][x] > df2['age'][x]:
df3['age'][x] = df['age'][x]
df3['name'][x] = df['name'][x]
else:
df3['age'][x] = df2['age'][x]
df3['name'][x] = df2['name'][x]
Although you will need to modify this to reflect how you want to resolve names that are only in one list, or if the lists are of different sizes.
One solution comes to my mind is merge and drop
df.merge(df2, on='name', suffixes=('', '_y')).query('age.gt(age_y)', engine='python')[['age','name']]
Out[175]:
age name
4 55 Anna

Pivoting count of column value using python pandas

I have student data with id's and some values and I need to pivot the table for count of ID.
Here's an example of data:
id name maths science
0 B001 john 50 60
1 B021 Kenny 89 77
2 B041 Jessi 100 89
3 B121 Annie 91 73
4 B456 Mark 45 33
pivot table:
count of ID
5
Lots of different ways to approach this, I would use either shape or nunique() as Sandeep suggested.
data = {'id' : ['0','1','2','3','4'],
'name' : ['john', 'kenny', 'jessi', 'Annie', 'Mark'],
'math' : [50,89,100,91,45],
'science' : [60,77,89,73,33]}
df = pd.DataFrame(data)
print(df)
id name math science
0 0 john 50 60
1 1 kenny 89 77
2 2 jessi 100 89
3 3 Annie 91 73
4 4 Mark 45 33
then pass either of the following:
df.shape() which gives you the length of a data frame.
or
in:df['id'].nunique()
out:5

How to perform groupby and mean on categorical columns in Pandas

I'm working on a dataset called gradedata.csv in Python Pandas where I've created a new binned column called 'Status' as 'Pass' if grade > 70 and 'Fail' if grade <= 70. Here is the listing of first five rows of the dataset:
fname lname gender age exercise hours grade \
0 Marcia Pugh female 17 3 10 82.4
1 Kadeem Morrison male 18 4 4 78.2
2 Nash Powell male 18 5 9 79.3
3 Noelani Wagner female 14 2 7 83.2
4 Noelani Cherry female 18 4 15 87.4
address status
0 9253 Richardson Road, Matawan, NJ 07747 Pass
1 33 Spring Dr., Taunton, MA 02780 Pass
2 41 Hill Avenue, Mentor, OH 44060 Pass
3 8839 Marshall St., Miami, FL 33125 Pass
4 8304 Charles Rd., Lewis Center, OH 43035 Pass
Now, how do i compute the mean hours of exercise of female students with a 'status' of passing...?
I've used the below code, but it isn't working.
print(df.groupby('gender', 'status')['exercise'].mean())
I'm new to Pandas. Anyone please help me in solving this.
You are very close. Note that your groupby key must be one of mapping, function, label, or list of labels. In this case, you want a list of labels. For example:
res = df.groupby(['gender', 'status'])['exercise'].mean()
You can then extract your desired result via pd.Series.get:
query = res.get(('female', 'Pass'))

Pandas to have one row per email

Say I have the following table, Peter and Halla,
Name Age occupation BillingContactEmail
Peter 44 Salesman a#a.com
Andy 43 Manager a#a.com
Halla 33 Fisherman b#b.com
how to make pandas to contain
Name Age occupation BillingContactEmail
Peter 44 Salesman a#a.com
Halla 33 Fisherman b#b.com
where we only contain an instance for an email? (meaning we will have distinct email in the end)
use drop_duplicates
df.drop_duplicates(subset=['BillingContactEmail'])
Name Age occupation BillingContactEmail
0 Peter 44 Salesman a#a.com
2 Halla 33 Fisherman b#b.com
Addressing #DSM's comment
You should be more specific about what criterion you want to use to decide which one to keep. The first seen with that email? The oldest? Etc.
By default, drop_duplicates keeps the first instance found. This is equivalent to
df.drop_duplicates(subset=['BillingContactEmail'], keep='first')
However, you could also specify to keep the last instance via keep='last'
df.drop_duplicates(subset=['BillingContactEmail'], keep='last')
Name Age occupation BillingContactEmail
1 Andy 43 Manager a#a.com
2 Halla 33 Fisherman b#b.com
Or, drop all duplicates
df.drop_duplicates(subset=['BillingContactEmail'], keep=False)
Name Age occupation BillingContactEmail
2 Halla 33 Fisherman b#b.com

Categories

Resources