Plotting boolean frequency against qualitative data in pandas - python

I'll start off by saying that I'm not really talented in statistical analysis. I have a dataset stored in a .csv file that I'm looking to represent graphically. What I'm trying to represent is the frequency of survival (represented for each person as a 0 or 1 in the Survived column) for each unique entry in the other columns.
For example: one of the other columns, Class, holds one of three possible values (1, 2, or 3). I want to graph the probability that someone from Class 1 survives versus Class 2 versus Class 3, so that I can visually determine whether or not class is correlated to survival rate.
I've attached the snippet of code that I've developed so far, but I'd understand if everything I'm doing is wrong because I've never used pandas before.
1 import pandas as pd
2 import matplotlib.pyplot as plt
3
4 df = pd.read_csv('train.csv')
5
6 print(list(df)[2:]) # slicing first 2 values of "ID" and "Survived"
7
8 for column in list(df)[2:]:
9 try:
10 df.plot(x='Survived',y=column,kind='hist')
11 except TypeError:
12 print("Column {} not usable.".format(column))
13
14 plt.show()
EDIT: I've attached a small segment of the dataframe below
PassengerId Survived Pclass Name ... Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris ... A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... ... PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina ... STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) ... 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry ... 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James ... 330877 8.4583 NaN Q

I think you want this:
df.groupby('Pclass')['Survived'].mean()
This separates the dataframe into three groups based on the three unique values of Pclass. It then takes the mean of Survived, which is equal to the number of 1 values divided by the number of values total. This would produce a dataframe looking something like this:
Pclass
1 0.558824
2 0.636364
3 0.696970
It is then trivial from there to plot a bar graph with .plot.bar() if you wish.

Adding to the answer, here is a simple bar graph.
result = df.groupby('Pclass')['Survived'].mean()
result.plot(kind='bar', rot=1, ylim=(0, 1))

Related

Get unique values from pandas series of lists

I have a column in DataFrame containing list of categories. For example:
0 [Pizza]
1 [Mexican, Bars, Nightlife]
2 [American, New, Barbeque]
3 [Thai]
4 [Desserts, Asian, Fusion, Mexican, Hawaiian, F...
6 [Thai, Barbeque]
7 [Asian, Fusion, Korean, Mexican]
8 [Barbeque, Bars, Pubs, American, Traditional, ...
9 [Diners, Burgers, Breakfast, Brunch]
11 [Pakistani, Halal, Indian]
I am attempting to do two things:
1) Get unique categories - My approach is have a empty set, iterate through series and append each list.
my code:
unique_categories = {'Pizza'}
for lst in restaurant_review_df['categories_arr']:
unique_categories = unique_categories | set(lst)
This give me a set of unique categories contained in all the lists in the column.
2) Generate pie plot of category counts and each restaurant can belong to multiple categories. For example: restaurant 11 belongs to Pakistani, Indian and Halal categories. My approach is again iterate through categories and one more iteration through series to get counts.
Are there simpler or elegant ways of doing this?
Thanks in advance.
Update using pandas 0.25.0+ with explode
df['category'].explode().value_counts()
Output:
Barbeque 3
Mexican 3
Fusion 2
Thai 2
American 2
Bars 2
Asian 2
Hawaiian 1
New 1
Brunch 1
Pizza 1
Traditional 1
Pubs 1
Korean 1
Pakistani 1
Burgers 1
Diners 1
Indian 1
Desserts 1
Halal 1
Nightlife 1
Breakfast 1
Name: Places, dtype: int64
And with plotting:
df['category'].explode().value_counts().plot.pie(figsize=(8,8))
Output:
For older verions of pandas before 0.25.0
Try:
df['category'].apply(pd.Series).stack().value_counts()
Output:
Mexican 3
Barbeque 3
Thai 2
Fusion 2
American 2
Bars 2
Asian 2
Pubs 1
Burgers 1
Traditional 1
Brunch 1
Indian 1
Korean 1
Halal 1
Pakistani 1
Hawaiian 1
Diners 1
Pizza 1
Nightlife 1
New 1
Desserts 1
Breakfast 1
dtype: int64
With plotting:
df['category'].apply(pd.Series).stack().value_counts().plot.pie()
Output:
Per #coldspeed's comments
from itertools import chain
from collections import Counter
pd.DataFrame.from_dict(Counter(chain(*df['category'])), orient='index').sort_values(0, ascending=False)
Output:
Barbeque 3
Mexican 3
Bars 2
American 2
Thai 2
Asian 2
Fusion 2
Pizza 1
Diners 1
Halal 1
Pakistani 1
Brunch 1
Breakfast 1
Burgers 1
Hawaiian 1
Traditional 1
Pubs 1
Korean 1
Desserts 1
New 1
Nightlife 1
Indian 1

How to perform groupby and mean on categorical columns in Pandas

I'm working on a dataset called gradedata.csv in Python Pandas where I've created a new binned column called 'Status' as 'Pass' if grade > 70 and 'Fail' if grade <= 70. Here is the listing of first five rows of the dataset:
fname lname gender age exercise hours grade \
0 Marcia Pugh female 17 3 10 82.4
1 Kadeem Morrison male 18 4 4 78.2
2 Nash Powell male 18 5 9 79.3
3 Noelani Wagner female 14 2 7 83.2
4 Noelani Cherry female 18 4 15 87.4
address status
0 9253 Richardson Road, Matawan, NJ 07747 Pass
1 33 Spring Dr., Taunton, MA 02780 Pass
2 41 Hill Avenue, Mentor, OH 44060 Pass
3 8839 Marshall St., Miami, FL 33125 Pass
4 8304 Charles Rd., Lewis Center, OH 43035 Pass
Now, how do i compute the mean hours of exercise of female students with a 'status' of passing...?
I've used the below code, but it isn't working.
print(df.groupby('gender', 'status')['exercise'].mean())
I'm new to Pandas. Anyone please help me in solving this.
You are very close. Note that your groupby key must be one of mapping, function, label, or list of labels. In this case, you want a list of labels. For example:
res = df.groupby(['gender', 'status'])['exercise'].mean()
You can then extract your desired result via pd.Series.get:
query = res.get(('female', 'Pass'))

Chained conditional count in Pandas

I have a dataframe that looks at how a form has been filled out. Here's an example:
ID Name Postcode Street Employer Salary
1 John NaN Craven Road NaN NaN
2 Sue TD2 NAN NaN 15000
3 Jimmy MW6 Blake Street Bank 40000
4 Laura QE2 Mill Lane NaN 20000
5 Sam NW2 Duke Avenue Farms 35000
6 Jordan SE6 NaN NaN NaN
7 NaN CB2 NaN Startup NaN `
I want to return a count of successively filled out columns on the condition that all previous columns have been filled. The final output should look something like:
Name Postcode Street Employer salary
6 5 3 2 2
Is there a good Pandas way of doing this? I suppose there could be a way of applying a mask so that if any previous boolean is given as zero the current column is also zero and then counting that but I'm not sure if that is the best way.
Thanks!
I think you can use notnull and cummin:
In [99]: df.notnull().cummin(axis=1).sum(axis=0)
Out[99]:
Name 6
Postcode 5
Street 3
Employer 2
Salary 2
dtype: int64
Although note that I had to replace your NAN (Sue's street) with a float NaN before I did that, and I assumed that ID was your index.
The cumulative minimum is one way to implement "applying a mask so that if any previous boolean is given as zero the current column is also zero", as you predicted would work.
Maybe cumprod BTW you have 'NAN' in your df, I try then as notnull here
df.notnull().cumprod(1).sum()
Out[59]:
ID 7
Name 6
Postcode 5
Street 4
Employer 2
Salary 2
dtype: int64

Pandas fillna with DataFrame of values

Accordingly to the docs, the fillna value parameter can be one among the following:
value : scalar, dict, Series, or DataFrame
Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). (values not in the dict/Series/DataFrame will not be filled). This value cannot be a list.
I have a data frame that looks like:
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S
And that is what I want to do:
NaN Cabin will be filled according to the median value given the Pclass feature value
NaN Age will be filled according to its median value across the data set
NaN Embarked will be filled according to the median value given the Pclass feature value
So after some data manipulation, I got this data frame:
Pclass Cabin Embarked Ticket
0 1 C S 50
1 2 F S 13
2 3 G S 5
What it says is that for the Pclass == 1 the most common Cabin is C. Given that, in my original data frame df I want to fill every df['Cabin'] == null with C.
This is a small example and I could treat each possible null combination by hand with something as:
df_both[df_both['Pclass'] == 1 & df_both['Cabin'] == np.NaN] = 'C'
However, I wonder if I can use this derived data frame to do all this filling automatic.
Thank you.
If you want to fill all Nan's with something like the median or the mean of the specific column you can do the following.
for median:
df.fillna(df.median())
for mean
df.fillna(df.mean())
see https://pandas.pydata.org/pandas-docs/stable/missing_data.html#filling-with-a-pandasobject for more information.
Edit:
Alternatively you can use a dictionary with specified values. The keys need to map to column names. This way you can also impute for missing values in strings.
df.fillna({'col1':'a','col2': 1})

How to boxplot data after different column values in pandas

I have a dataframe like this:
Country Year Column1 Column2
1 Guatemala 1999 5 1
4 Mexico 2000 1 3
5 Mexico 2000 2 2
6 Mexico 2000 2 1
8 Guatemala 2000 3 2
11 Guatemala 2003 4 3
12 Guatemala 2003 6 4
13 Guatemala 2003 5 5
What I want to make is a boxplot for each group in Country, displaying a number of boxes corresponding to the number of unique values in Years. These boxes should represent the values in Column2.
I group the data and get boxplots like this:
df1=df.groupby('Origin').boxplot(column='Column2', subplots=True)
That gives me a boxplot for each Country, but with just one plot in it, representing all the values from that group, not separated by years. How can I get a box for each unique value in year, representing the values in Column2 in my code?
I would use the seaborn package, in particular combining the FacetGrid with boxplot.
For your situation, the code might look like this:
import seaborn as sns
g = sns.FacetGrid(df, col="Country", sharex=False)
g.map(sns.boxplot, 'Year', 'Column2')
Edit: this is what I get for your data above:

Categories

Resources