pandas dataframe filter calculation - python

I have the following dataframe
student_id gender major admitted
0 35377 female Chemistry False
1 56105 male Physics True
2 31441 female Chemistry False
3 51765 male Physics True
4 53714 female Physics True
5 50693 female Chemistry False
6 25946 male Physics True
7 27648 female Chemistry True
8 55247 male Physics False
9 35838 male Physics True
How would I calculate the admission rate for female physics majors?

I think -
df_f = df[(df['gender']=='female') & (df['major']=='Physics')]
df_f['admitted'].mean()
First part filters female and Physics. Next, we calculate mean.
The mean part sounds unintuitive and weird but mathematically it will give the percentage value. Python treats boolean values as 0 and 1 so basically if you are summing up and dividing by the count (which mean does) you are actually calculating the percentage of female students with a major in Physics who were admitted

import numpy as np
np.average(dat['admitted'][(dat['gender']=='female') & (dat['major']=='Physics')].values)
Working Principle: (dat['gender']=='female') & (dat['major']=='Physics') creates a boolean pandas Series which can be used to select the correct entries from the dat['admitted'] Series. The .values functionality extracts those entries into a numpy array. At the end we take the average of those entries giving us the admittance ratio.

import numpy as np
import pandas as pd
df = pd.DataFrame({"gender":np.random.choice(["male","female"],[20]),
"admitted":np.random.choice([True,False],[20]),
"major":np.random.choice(["Chemistry","Physics"],[20])})
phy_female_admited = df.loc[(df["major"]=="Physics") & (df["admitted"]==True) & ((df["gender"]=="female"))]
phy_female_applied = df.loc[(df["major"]=="Physics") & ((df["gender"]=="female"))]
acceptance_rate = phy_female_admited.shape[0]/phy_female_applied.shape[0]
A little more expanded answer but basically works in the same way as DZurico's
ignore the line where i am creating a dataframe and use your own data instead

Solution for all admission rates with groupby and GroupBy.size, and GroupBy.transform with sum:
a = df.groupby(['gender' ,'admitted', 'major']).size()
print (a)
gender admitted major
female False Chemistry 3
True Chemistry 1
Physics 1
male False Physics 1
True Physics 4
dtype: int64
b = a.groupby(['gender' ,'major']).transform('sum')
print (b)
gender admitted major
female False Chemistry 4
True Chemistry 4
Physics 1
male False Physics 5
True Physics 5
dtype: int64
c = a.div(b)
print (c)
gender admitted major
female False Chemistry 0.75
True Chemistry 0.25
Physics 1.00
male False Physics 0.20
True Physics 0.80
dtype: float64
Select by tuples which row of c need:
print (c.loc[('female',True,'Physics')])
1.0
If want all values in DataFrame:
d = a.div(b).reset_index(name='rates')
print (d)
gender admitted major rates
0 female False Chemistry 0.75
1 female True Chemistry 0.25
2 female True Physics 1.00
3 male False Physics 0.20
4 male True Physics 0.80

Related

How pandas calculates correlation between categorical variables and continuous variables?

Suppose I have a dataframe something like below:
age sex bmi children smoker region charges
19 female 27.900 0 yes southwest 16884.92400
18 male 33.770 1 no southeast 1725.55230
28 male 33.000 3 no southeast 4449.46200
33 male 22.705 0 no northwest 21984.47061
32 male 28.880 0 no northwest 3866.85520
I want to calculate correlation between sex and smoker, both are categorical variables. I tried calulating the correlation between sex and smoker using df.corr(), it came out 0.076185
I also tried using cramer's V rule using:
def cramers_v(x, y):
confusion_matrix = pd.crosstab(x,y)
chi2 = chi2_contingency(confusion_matrix)[0]
n = confusion_matrix.sum().sum()
phi2 = chi2/n
r,k = confusion_matrix.shape
phi2corr = max(0, phi2-((k-1)*(r-1))/(n-1))
rcorr = r-((r-1)**2)/(n-1)
kcorr = k-((k-1)**2)/(n-1)
return np.sqrt(phi2corr/min((kcorr-1),(rcorr-1)))
cramers_v(df["sex"], df["smoker"])
0.06914461040709625
It is not very clear in the source code that how it calculates the correlation between all the possible combination of categorical and continous variables.
You would need to change to strings to integers.
So for example:
male=1, female=0 and smoker=1 or smoker=0 (for yes or no).
Here is an example with just the two relevant columns:
import pandas as pd
d = {'sex':['male','female','female','female','female'], 'smoke':[1,0,0,0,0], 'hello':[1,2,3,4,5]}
df = pd.DataFrame(d)
# example of how to convert sex from string to numeric
df['sex'] = df['sex'].apply(lambda r: 1 if r=='male' else 0)
c = df[['sex','smoke']].corr()
print(c)
The output:
sex smoke
sex 1.0 1.0
smoke 1.0 1.0
In this simple example case 100% correlated (because of the data).

Sum based on grouping in pandas dataframe?

I have a pandas dataframe df which contains:
major men women rank
Art 5 4 1
Art 3 5 3
Art 2 4 2
Engineer 7 8 3
Engineer 7 4 4
Business 5 5 4
Business 3 4 2
Basically I am needing to find the total number of students including both men and women as one per major regardless of the rank column. So for Art for example, the total should be all men + women totaling 23, Engineer 26, Business 17.
I have tried
df.groupby(['major_category']).sum()
But this separately sums the men and women rather than combining their totals.
Just add both columns and then groupby:
(df.men+df.women).groupby(df.major).sum()
major
Art 23
Business 17
Engineer 26
dtype: int64
melt() then groupby():
df.drop('rank',1).melt('major').groupby('major',as_index=False).sum()
major value
0 Art 23
1 Business 17
2 Engineer 26

Plotting boolean frequency against qualitative data in pandas

I'll start off by saying that I'm not really talented in statistical analysis. I have a dataset stored in a .csv file that I'm looking to represent graphically. What I'm trying to represent is the frequency of survival (represented for each person as a 0 or 1 in the Survived column) for each unique entry in the other columns.
For example: one of the other columns, Class, holds one of three possible values (1, 2, or 3). I want to graph the probability that someone from Class 1 survives versus Class 2 versus Class 3, so that I can visually determine whether or not class is correlated to survival rate.
I've attached the snippet of code that I've developed so far, but I'd understand if everything I'm doing is wrong because I've never used pandas before.
1 import pandas as pd
2 import matplotlib.pyplot as plt
3
4 df = pd.read_csv('train.csv')
5
6 print(list(df)[2:]) # slicing first 2 values of "ID" and "Survived"
7
8 for column in list(df)[2:]:
9 try:
10 df.plot(x='Survived',y=column,kind='hist')
11 except TypeError:
12 print("Column {} not usable.".format(column))
13
14 plt.show()
EDIT: I've attached a small segment of the dataframe below
PassengerId Survived Pclass Name ... Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris ... A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... ... PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina ... STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) ... 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry ... 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James ... 330877 8.4583 NaN Q
I think you want this:
df.groupby('Pclass')['Survived'].mean()
This separates the dataframe into three groups based on the three unique values of Pclass. It then takes the mean of Survived, which is equal to the number of 1 values divided by the number of values total. This would produce a dataframe looking something like this:
Pclass
1 0.558824
2 0.636364
3 0.696970
It is then trivial from there to plot a bar graph with .plot.bar() if you wish.
Adding to the answer, here is a simple bar graph.
result = df.groupby('Pclass')['Survived'].mean()
result.plot(kind='bar', rot=1, ylim=(0, 1))

How to label the Pie chart?

My stud_alcoh data set is given below
school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason guardian legal_drinker
0 GP F 18 U GT3 A 4 4 AT_HOME TEACHER course mother True
1 GP F 17 U GT3 T 1 1 AT_HOME OTHER course father False
2 GP F 15 U LE3 T 1 1 AT_HOME OTHER other mother False
3 GP F 15 U GT3 T 4 2 HEALTH SERVICES home mother False
4 GP F 16 U GT3 T 3 3 OTHER OTHER home father False
number_of_drinkers = stud_alcoh.groupby('legal_drinker').size()
number_of_drinkers
legal_drinker
False 284
True 111
dtype: int64
I have to draw a pie chart with number_of_drinkers with True as 111 and False 284. I wrote number_of_drinkers.plot(kind='pie')
which Y label and also the number(284 and 111) is not labeling
This should work:
number_of_drinkers.plot(kind = 'pie', label = 'my label', autopct = '%.2f%%')
The autopct argument gives you a notation of percentage inside the plot, with the desired number of decimals indicated right before the letter "f". So you can change this, for example, to %.1f%% for only one decimal.
I personally don't know of a way to show the raw numbers inside but only the percentage, but to the best of my understanding this is the purpose of a pie.
Your question already has a good answer. You could also try this. I'm using the data frame you shared.
import pandas as pd
df = pd.read_clipboard()
df
school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason guardian legal_drinker
0 GP F 18 U GT3 A 4 4 AT_HOME TEACHER course mother True
1 GP F 17 U GT3 T 1 1 AT_HOME OTHER course father False
2 GP F 15 U LE3 T 1 1 AT_HOME OTHER other mother False
3 GP F 15 U GT3 T 4 2 HEALTH SERVICES home mother False
4 GP F 16 U GT3 T 3 3 OTHER OTHER home father False
number_of_drinkers = df.groupby('legal_drinker').size() # Series
number_of_drinkers
legal_drinker
False 4
True 1
dtype: int64
number_of_drinkers.plot.pie(label='counts', autopct='%1.1f%%') # Label the wedges with their numeric value

Use of iterrows() and arithmetic in Pandas

Indexing into an array in C is pretty easy and the brackets handle arithmetic nicely, thus allowing for the comparison of adjacent values. That's what I'd like to do in with iterrows() in Pandas, but I can't find a suitable example that shows how to do so. Consider the following:
Year Name Winner Count
432 1936 Alice 0.0 2
538 1937 Alice 1.0 2
6391 1985 Bob 1.0 2
6818 1989 Brad 0.0 2
Alice did not win a prize in 1936, but she did win one in 1937. I need to iterate over all of the rows, 1) check to see if the Year in row n immediately follows the Year in row n - 1, and 2) if so, did the subject win in the second year and not the first? Alice fits the bill, and I'd like to loop through the frame printing out her name and everyone else who meet the criteria.
I had started with . . .
for index, row in df.iterrows():
if df['Year'] > df[df.Year - 1]:
And got, among other things, that the data type I had explicitly cast as an int (i.e., Year), is now being returned as a string. Is there a way to do this, or should I explore a different method?
Here's some augmented sample data, to account for edge cases:
Year Name Winner Count
432 1936 Alice 0.0 2
538 1937 Alice 1.0 2
6390 1985 Bob 1.0 2
6817 1989 Brad 0.0 2
433 1997 Alice 0.0 2
539 1993 Alice 1.0 2
6391 1986 Bob 1.0 2
6818 1990 Brad 0.0 2
6819 1991 Brad 0.0 2
This approach sorts rows by Name and Year, then establishes whether a given year meets the criteria for inclusion (i.e., consecutive with the year before, and a win).
Then a simple groupby() finds the subjects who qualify.
import pandas as pd
df = pd.read_clipboard()
df.sort_values(['Name','Year'], inplace=True)
# eligible = consecutive year and won in that year
df['eligible'] = (df.Year.subtract(df.Year.shift()) == 1.) & (df.Winner)
# identify any person with at least one eligible year
df.groupby('Name').eligible.any())
Output:
Name
Alice True
Bob True
Brad False

Categories

Resources