My stud_alcoh data set is given below
school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason guardian legal_drinker
0 GP F 18 U GT3 A 4 4 AT_HOME TEACHER course mother True
1 GP F 17 U GT3 T 1 1 AT_HOME OTHER course father False
2 GP F 15 U LE3 T 1 1 AT_HOME OTHER other mother False
3 GP F 15 U GT3 T 4 2 HEALTH SERVICES home mother False
4 GP F 16 U GT3 T 3 3 OTHER OTHER home father False
number_of_drinkers = stud_alcoh.groupby('legal_drinker').size()
number_of_drinkers
legal_drinker
False 284
True 111
dtype: int64
I have to draw a pie chart with number_of_drinkers with True as 111 and False 284. I wrote number_of_drinkers.plot(kind='pie')
which Y label and also the number(284 and 111) is not labeling
This should work:
number_of_drinkers.plot(kind = 'pie', label = 'my label', autopct = '%.2f%%')
The autopct argument gives you a notation of percentage inside the plot, with the desired number of decimals indicated right before the letter "f". So you can change this, for example, to %.1f%% for only one decimal.
I personally don't know of a way to show the raw numbers inside but only the percentage, but to the best of my understanding this is the purpose of a pie.
Your question already has a good answer. You could also try this. I'm using the data frame you shared.
import pandas as pd
df = pd.read_clipboard()
df
school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason guardian legal_drinker
0 GP F 18 U GT3 A 4 4 AT_HOME TEACHER course mother True
1 GP F 17 U GT3 T 1 1 AT_HOME OTHER course father False
2 GP F 15 U LE3 T 1 1 AT_HOME OTHER other mother False
3 GP F 15 U GT3 T 4 2 HEALTH SERVICES home mother False
4 GP F 16 U GT3 T 3 3 OTHER OTHER home father False
number_of_drinkers = df.groupby('legal_drinker').size() # Series
number_of_drinkers
legal_drinker
False 4
True 1
dtype: int64
number_of_drinkers.plot.pie(label='counts', autopct='%1.1f%%') # Label the wedges with their numeric value
Related
Data Frame :
city Temperature
0 Chandigarh 15
1 Delhi 22
2 Kanpur 20
3 Chennai 26
4 Manali -2
0 Bengalaru 24
1 Coimbatore 35
2 Srirangam 36
3 Pondicherry 39
I need to create another column in data frame, which contains a boolean value for each city to indicate whether it's a union territory or not. Chandigarh, Pondicherry and Delhi are only 3 union territories here.
I have written below code
import numpy as np
conditions = [df3['city'] == 'Chandigarh',df3['city'] == 'Pondicherry',df3['city'] == 'Delhi']
values =[1,1,1]
df3['territory'] = np.select(conditions, values)
Is there any easier or efficient way that I can write?
You can use isin:
union_terrs = ["Chandigarh", "Pondicherry", "Delhi"]
df3["territory"] = df3["city"].isin(union_terrs).astype(int)
which checks each entry in city column and if it is in union_terrs, gives True and otherwise False. The astype makes True/False to 1/0 conversion,
to get
city Temperature territory
0 Chandigarh 15 1
1 Delhi 22 1
2 Kanpur 20 0
3 Chennai 26 0
4 Manali -2 0
0 Bengalaru 24 0
1 Coimbatore 35 0
2 Srirangam 36 0
3 Pondicherry 39 1
My data looks like:
Club Count
0 AC Milan 2
1 Ajax 1
2 FC Barcelona 4
3 Bayern Munich 2
4 Chelsea 1
5 Dortmund 1
6 FC Porto 1
7 Inter Milan 1
8 Juventus 1
9 Liverpool 2
10 Man U 2
11 Real Madrid 7
I'm trying to plot an Area plot using Club as the X Axis, when plotting all data, it looks correct but the X axis displayed is the index and not the Clubs.
When specifying the index as Club(index=x), it shows correct, but the scale of the y axis is set from 0 to 0.05, assuming that's why nothing is displayed since the count is from 1 to 7 any suggestions ?
Code used:
data.columns = ['Club', 'Count']
x=data.Club
y=data.Count
print(data)
ax.margins(0, 10)
data.plot.area()
df = pd.DataFrame(y,index=x)
df.plot.area()
results:
Change to
df = pd.Series(y,index=x)
df.plot.area()
I have a dataframe called passenger_details which is shown below
Passenger Age Gender Commute_to_work Commute_mode Commute_time ...
Passenger1 32 Male I drive to work car 1 hour
Passenger2 26 Female I take the metro train NaN ...
Passenger3 33 Female NaN NaN 30 mins ...
Passenger4 29 Female I take the metro train NaN ...
...
I want to apply an if function that will turn missing values(NaN values) to 0 and present values to 1, to column headings that have the string 'Commute' in them.
This is basically what I'm trying to achieve
Passenger Age Gender Commute_to_work Commute_mode Commute_time ...
Passenger1 32 Male 1 1 1
Passenger2 26 Female 1 1 0 ...
Passenger3 33 Female 0 0 1 ...
Passenger4 29 Female 1 1 0 ...
...
However, I'm struggling with how to phrase my code. This is what I have done
passenger_details = passenger_details.filter(regex = 'Location_', axis = 1).apply(lambda value: str(value).replace('value', '1', 'NaN','0'))
But I get a Type Error of
'replace() takes at most 3 arguments (4 given)'
Any help would be appreciated
Seelct columns by Index.contains and test not missing values by DataFrame.notna and last cast to integer for True/False to 1/0 map:
c = df.columns.str.contains('Commute')
df.loc[:, c] = df.loc[:, c].notna().astype(int)
print (df)
Passenger Age Gender Commute_to_work Commute_mode Commute_time
0 Passenger1 32 Male 1 1 1
1 Passenger2 26 Female 1 1 0
2 Passenger3 33 Female 0 0 1
3 Passenger4 29 Female 1 1 0
I have the following dataframe
student_id gender major admitted
0 35377 female Chemistry False
1 56105 male Physics True
2 31441 female Chemistry False
3 51765 male Physics True
4 53714 female Physics True
5 50693 female Chemistry False
6 25946 male Physics True
7 27648 female Chemistry True
8 55247 male Physics False
9 35838 male Physics True
How would I calculate the admission rate for female physics majors?
I think -
df_f = df[(df['gender']=='female') & (df['major']=='Physics')]
df_f['admitted'].mean()
First part filters female and Physics. Next, we calculate mean.
The mean part sounds unintuitive and weird but mathematically it will give the percentage value. Python treats boolean values as 0 and 1 so basically if you are summing up and dividing by the count (which mean does) you are actually calculating the percentage of female students with a major in Physics who were admitted
import numpy as np
np.average(dat['admitted'][(dat['gender']=='female') & (dat['major']=='Physics')].values)
Working Principle: (dat['gender']=='female') & (dat['major']=='Physics') creates a boolean pandas Series which can be used to select the correct entries from the dat['admitted'] Series. The .values functionality extracts those entries into a numpy array. At the end we take the average of those entries giving us the admittance ratio.
import numpy as np
import pandas as pd
df = pd.DataFrame({"gender":np.random.choice(["male","female"],[20]),
"admitted":np.random.choice([True,False],[20]),
"major":np.random.choice(["Chemistry","Physics"],[20])})
phy_female_admited = df.loc[(df["major"]=="Physics") & (df["admitted"]==True) & ((df["gender"]=="female"))]
phy_female_applied = df.loc[(df["major"]=="Physics") & ((df["gender"]=="female"))]
acceptance_rate = phy_female_admited.shape[0]/phy_female_applied.shape[0]
A little more expanded answer but basically works in the same way as DZurico's
ignore the line where i am creating a dataframe and use your own data instead
Solution for all admission rates with groupby and GroupBy.size, and GroupBy.transform with sum:
a = df.groupby(['gender' ,'admitted', 'major']).size()
print (a)
gender admitted major
female False Chemistry 3
True Chemistry 1
Physics 1
male False Physics 1
True Physics 4
dtype: int64
b = a.groupby(['gender' ,'major']).transform('sum')
print (b)
gender admitted major
female False Chemistry 4
True Chemistry 4
Physics 1
male False Physics 5
True Physics 5
dtype: int64
c = a.div(b)
print (c)
gender admitted major
female False Chemistry 0.75
True Chemistry 0.25
Physics 1.00
male False Physics 0.20
True Physics 0.80
dtype: float64
Select by tuples which row of c need:
print (c.loc[('female',True,'Physics')])
1.0
If want all values in DataFrame:
d = a.div(b).reset_index(name='rates')
print (d)
gender admitted major rates
0 female False Chemistry 0.75
1 female True Chemistry 0.25
2 female True Physics 1.00
3 male False Physics 0.20
4 male True Physics 0.80
My dataset is based on the results of Food Inspections in the City of Chicago.
import pandas as pd
df = pd.read_csv("C:/~/Food_Inspections.csv")
df.head()
Out[1]:
Inspection ID DBA Name \
0 1609238 JR'SJAMAICAN TROPICAL CAFE,INC
1 1609245 BURGER KING
2 1609237 DUNKIN DONUTS / BASKIN ROBINS
3 1609258 CHIPOTLE MEXICAN GRILL
4 1609244 ATARDECER ACAPULQUENO INC.
AKA Name License # Facility Type Risk \
0 NaN 2442496.0 Restaurant Risk 1 (High)
1 BURGER KING 2411124.0 Restaurant Risk 2 (Medium)
2 DUNKIN DONUTS / BASKIN ROBINS 1717126.0 Restaurant Risk 2 (Medium)
3 CHIPOTLE MEXICAN GRILL 1335044.0 Restaurant Risk 1 (High)
4 ATARDECER ACAPULQUENO INC. 1910118.0 Restaurant Risk 1 (High)
Here is how often each of the facilities appear in the dataset:
df['Facility Type'].value_counts()
Out[3]:
Restaurant 14304
Grocery Store 2647
School 1155
Daycare (2 - 6 Years) 367
Bakery 316
Children's Services Facility 262
Daycare Above and Under 2 Years 248
Long Term Care 169
Daycare Combo 1586 142
Catering 123
Liquor 78
Hospital 68
Mobile Food Preparer 67
Golden Diner 65
Mobile Food Dispenser 51
Special Event 25
Shared Kitchen User (Long Term) 22
Daycare (Under 2 Years) 18
I am trying to create a new set of data containing those rows where its Facility Type has over 50 occurrences in the dataset. How would I approach this?
Please note the list of facility counts is MUCH LARGER as I have cut out most of the information as it did not contribute to the question at hand (so simply removing occurrences of "Special Event", " Shared Kitchen User", and "Daycare" is not what I'm looking for).
IIUC then you want to filter:
df.groupby('Facility Type').filter(lambda x: len(x) > 50)
Example:
In [9]:
df = pd.DataFrame({'type':list('aabcddddee'), 'value':np.random.randn(10)})
df
Out[9]:
type value
0 a -0.160041
1 a -0.042310
2 b 0.530609
3 c 1.238046
4 d -0.754779
5 d -0.197309
6 d 1.704829
7 d -0.706467
8 e -1.039818
9 e 0.511638
In [10]:
df.groupby('type').filter(lambda x: len(x) > 1)
Out[10]:
type value
0 a -0.160041
1 a -0.042310
4 d -0.754779
5 d -0.197309
6 d 1.704829
7 d -0.706467
8 e -1.039818
9 e 0.511638
Not tested, but should work.
FT=df['Facility Type'].value_counts()
df[df['Facility Type'].isin(FT.index[FT>50])]