Better way to plot Gender count using Python - python

I am making a graph to plot Gender count for the time series data that look like following data. Each row represent hourly data of each respective patient.
HR
SBP
DBP
Sepsis
Gender
P_ID
92
120
80
0
0
0
98
115
85
0
0
0
93
125
75
1
1
1
95
130
90
1
1
1
102
120
80
0
0
2
109
115
75
0
0
2
94
135
100
0
0
2
97
100
70
1
1
3
85
120
80
1
1
3
88
115
75
1
1
3
93
125
85
1
1
3
78
130
90
1
0
4
115
140
110
1
0
4
102
120
80
0
1
5
98
140
110
0
1
5
This is my code:
gender = df_n['Gender'].value_counts()
plt.figure(figsize=(7, 6))
ax = gender.plot(kind='bar', rot=0, color="c")
ax.set_title("Bar Graph of Gender", y = 1)
ax.set_xlabel('Gender')
ax.set_ylabel('Number of People')
ax.set_xticklabels(('Male', 'Female'))
for rect in ax.patches:
y_value = rect.get_height()
x_value = rect.get_x() + rect.get_width() / 2
space = 1
label = format(y_value)
ax.annotate(label, (x_value, y_value), xytext=(0, space), textcoords="offset points", ha='center', va='bottom')
plt.show()
Now what is happening is the code is calculating total number of instances (0: Male, 1: Female) and plotting it. But I want to plot the total males and females, not the total number of 0s and 1s, as the Same patient is having multiple rows of data (as per P_ID). Like how many patients are male and how many are female?
Can someone help me out? I guess maybe sns.countplot can be used. But I don't know how.
Thanks for helping me out >.<
__________ Udpate ________________
How I can group those Genders that are sepsis (1) or no sepsis (0)?
__________ Update 2 ___________
So, I got the total actual count of Male and Female, thanks to #Shaido.
In the whole dataset, there are only 2932 septic patients. Rest are non-septic. This is what I got from #JohanC answer.
Now, the problem is that as there are only 2932 septic patients, by looking at the graph, it is assumed that only 426 (251 Male) and (175 Female) are septic patients (out of 2932), rest are non-septic. But this is not true. Please help. Thanks.

I have a working example for selecting the unique IDS, it looks ugly so there is probably a better way, but it works...
import pandas as pd
# example of data:
data = {'gender': [0, 0, 1, 1, 1, 1, 0, 0], 'id': [1, 1, 2, 2, 3, 3, 4, 4]}
df = pd.DataFrame(data)
# get all unique ids:
ids = set(df.id)
# Go over all id, get first element of gender:
g = [list(df[df['id'] == i]['gender'])[0] for i in ids]
# count genders, laze way using pandas since the rest of the code also assumes a dataframe for plotting:
gender_counts = pd.DataFrame(g).value_counts()
# from here you can use your plot function.
# Or Counter
from collections import Counter
gender_counts = Counter(g)
# You have to create another method for plotting the gender.

You can group by 'P_ID' and take the first row for each of them (supposing a 'P_ID' has only one gender and only one sepsis). Then you can call sns.countplot on that dataframe, using gender for x and sepsis for hue (or vice versa). You can rename the values in the columns to show their names in the legend and in the tick labels.
import matplotlib.pyplot as plt
import seaborn as sns
from io import StringIO
data_str = '''
HR|SBP|DBP|Sepsis|Gender|P_ID
92|120|80|0|0|0
98|115|85|0|0|0
93|125|75|1|1|1
95|130|90|1|1|1
102|120|80|0|0|2
109|115|75|0|0|2
94|135|100|0|0|2
97|100|70|1|1|3
85|120|80|1|1|3
88|115|75|1|1|3
93|125|85|1|1|3
78|130|90|1|0|4
115|140|110|1|0|4
102|120|80|0|1|5
98|140|110|0|1|5
'''
df = pd.read_csv(StringIO(data_str), delimiter='|')
# new df: take Sepsis and Gender from the first row for every P_ID
df_per_PID = df.groupby('P_ID')[['Sepsis', 'Gender']].first()
# give names to the values in the columns
df_per_PID = df_per_PID.replace({'Gender': {0: 'Male', 1: 'Female'}, 'Sepsis': {0: 'No sepsis', 1: 'Sepsis'}})
# show counts per Gender and Sepsis
ax = sns.countplot(data=df_per_PID, x='Gender', hue='Sepsis', palette='rocket')
ax.legend(title='') # remove title, as it is clear from the legend items
ax.set_xlabel('')
for bars in ax.containers:
ax.bar_label(bars)
# ax.margins(y=0.1) # make some extra space for the labels
ax.locator_params(axis='y', integer=True)
sns.despine()
plt.show()

Related

Trying to work together with groupby and Anova

I have a data set with categorical values for Family size (1 to 4) and Loan taken (0 and 1). I want to know if there is significant diff between the mean of family size and loan taken.
I did groupby to get the count of Loan by family size as :
gp = df.groupby(["Family", "Personal Loan"])["Personal Loan"].count()
with output
Family Personal Loan
1 0 1365
1 107
2 0 1190
1 106
3 0 877
1 133
4 0 1088
1 134
Now I need to apply f_two way anova to see if there is significant difference between the loan taken and family size. Need help how to go about it.
You cannot do an two way anova with count data and binary response. What you can do is a chisq test, to test that the proportions of loan==1 are not equal across all families:
import seaborn as sns
import pandas as pd
from scipy.stats import chi2_contingency
I have to get back something like your original df:
df = pd.DataFrame({
'Family':np.repeat(np.arange(1,5),[1472,1296,1010,1222]),
'Personal Loan':np.repeat([0,1,0,1,0,1,0,1],
[1365,107,1190,106,877,133,1088,134]),
})
gp = df.groupby(["Family","Personal Loan"])["Personal Loan"].count()
gp
Family Personal Loan
1 0 1365
1 107
2 0 1190
1 106
3 0 877
1 133
4 0 1088
1 134
Name: Personal Loan, dtype: int64
Now we do the chi-sq using crosstab:
contingency = pd.crosstab(df['Family'],df['Personal Loan'])
test = chi2_contingency(contingency)
test
(29.676116414854746, 1.6144121228248757e-06, 3, array([[1330.688, 141.312],
[1171.584, 124.416],
[ 913.04 , 96.96 ],
[1104.688, 117.312]]))
The 2nd value, 1.614e-06 is the p-value of your test that all the loan == 1 ratios are equal

How to groupby column, and then create a scatterplot of counts

I have a dataframe similar to the one below:
id date available
0 1944 2019-07-11 f
1 1944 2019-07-11 t
2 159454 2019-07-12 f
3 159454 2019-07-13 f
4 159454 2019-07-14 f
I would like form a scatter plot where each id has a corresponding point; the x value is the number of t occurrences, and the y value is the number of f occurrences in the available column.
I have tried:
grouped = df.groupby(['listing_id'])['available'].value_counts().to_frame()
grouped.head()
This gives me something like
available
listing_id available
1944 t 364
f 1
2015 f 184
t 181
3176 t 279
f 10
But I'm not sure how to work this anymore. How can I get my desired plot? Is there a better way to proceed?
Assuming you won't have to use the date column:
# Generate example data
N = 100
np.random.seed(1)
df = pd.DataFrame({'id': np.random.choice(list(range(1, 6)), size=N),
'available': np.random.choice(['t', 'f'], size=N)})
df = df.sort_values('id').reset_index(drop=True)
# For each id: get t and f counts, unstack into columns, ensure
# column order is ['t', 'f']
counts = df.groupby(['id', 'available']).size().unstack()[['t', 'f']]
# Plot
fig, ax = plt.subplots()
counts.plot(x='t', y='f', kind='scatter', ax=ax)
# Optional: label each data point with its id.
# This is rough and might not look good beyond a few data points
for label, (t, f) in counts.iterrows():
ax.text(t + .05, f + .05, label)
Output:
You can group by both listing_id and available, do a count and then unstack and then plot using seaborn.
Below I used some random numbers, the image is only for illustration.
import seaborn as sns
data = df.groupby(['listing_id', 'available'])['date'].count().unstack()
sns.scatterplot(x=data.t, y=data.f, hue=data.index, legend='full')
Using your data:
reset the index
df.reset_index(inplace=True)
id available count
1944 t 364
1944 f 1
2015 f 184
2015 t 181
3176 t 279
3176 f 10
create a t & f dataframe:
t = df[df.available == 't'].reset_index(drop=True)
id available count
0 1944 t 364
1 2015 t 181
2 3176 t 279
f = df[df.available == 'f'].reset_index(drop=True)
id available count
0 1944 f 1
1 2015 f 184
2 3176 f 10
Plot the data:
plt.scatter(x=t['count'], y=f['count'])
plt.xlabel('t')
plt.ylabel('f')
for i, txt in enumerate(f['id'].tolist()):
plt.annotate(txt, (t['count'].loc[i] + 3, f['count'].loc[i]))

Stacked Plot To Represent Genders For An Age Group From CSV containing Identifier , Age and Gender On Python / Pandas/ Matplotlib

I have a csv data with age, gender(Men,Women) and identifier. I grouped age and gender of individuals by count of identifier on pandas with
counts = df.groupby(['Age','Gender']).count()
print counts
and the result looked something like this :
Age Gender Id_count
15 W 1
17 M 1
19 M 2
20 M 6
W 1
21 M 3
W 1
23 M 4
W 3
24 M 8
W 3
25 M 9
26 M 6
W 1
27 M 3
W 1
28 M 9
W 2
29 M 5
W 1
30 M 3
31 M 9
W 1 ..
Unique ages on my dataset are from age 15 to 90. I now want to do an age group analysis with a stacked plot at the end.For that , i want to lets say range the ages into certain age group (10-20,21-30,31-40 and so on) and plot sum of identifier on each age group , showing sum on the top of the bar and my aim is to get two different colors for stacked bar representing men and women according to their proportion of id_count. To implement this : i created a dictionary where i gave range as shown below..
df['ids_counted']= np.round(df['Age'])
categories_dict = { 15 : 'Between 10 and 20',
16 : 'Between 10 and 20',
17 : 'Between 10 and 20',
18 : 'Between 10 and 20',
19 : 'Between 10 and 20',
20 : 'Between 10 and 20',
21 : 'Between 21 and 30',
22 : 'Between 21 and 30',..
90 : 'Between 81 and 90',}
Then I created this dataframe.
df['category'] = df['id_counted'].map(categories_dict)
count2 = df.groupby(['category','Age','Gender','Id_Count']).count()
total= count2.sum(level= 0)
print total
now i have successfully counted the total of identifier on each age group. It looked something like this :
Between 10 and 20 11
Between 21 and 30 62
Between 31 and 40 82
Between 41 and 50 120
Between 51 and 60 125
Between 61 and 70 141
Between 71 and 80 192
Between 81 and 90 38
But i lost my way here because i wanted to plot gender too. lets take age between 10 and 20 . Total 11 should have been on the top of my bar and portion 9 men and 2 women should have been plotted on a stacked bar. I thought about another approach because i think this way to approach won't get me to my result. I generated a grouped dataframe with the counts of each M and F per age, then calculated the total number of individual per age group.
totals = counts.sum(level=0)
Now to plot :
plt.bar(ages, counts['M'], bottom=None, color='blue', label='M')
plt.bar(ages, counts['W'], bottom=counts['M'], color='red', label='W')
plt.legend()
plt.xlabel('Age Group')
plt.ylabel('Occurences Of Identifiers')
plt.title('ttl',fontsize=20)
for age,tot in zip(ages,totals.values.flatten()):
plt.annotate('{:d}'.format(tot), xy=(age+0.39, tot), xytext=(0,1), textcoords='offset points', ha='center', va='bottom')
plt.show()
plt.save()
plt.close()
and got this plot which turned out to be okay but it is for individual age and my target is to generate same plot for age group on my dictionary. I would be very grateful if anyone would suggest me or give me an idea to obtain my aimed result. Thank you so much for your time.
Assigning age groups is easier using np.digitize.
n = 100
age = np.random.randint(15, 91, size=n)
gender = np.random.randint(2, size=n)
df = pd.DataFrame.from_items([('Age', age), ('Gender', gender)])
bins = np.arange(1, 10) * 10
df['category'] = np.digitize(df.Age, bins, right=True)
print(df.head())
Age Gender category
0 22 1 2
1 54 0 5
2 85 1 8
3 77 0 7
4 86 1 8
Now count grouping by category and gender, then unstack the result to have gender as columns.
counts = df.groupby(['category', 'Gender']).Age.count().unstack()
print(counts)
Gender 0 1
category
1 2 7
2 7 5
3 6 4
4 11 9
5 5 8
6 2 4
7 10 7
8 6 7
Plotting is now a breeze.
counts.plot(kind='bar', stacked=True)
This turned out to be my code at last :
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
plt.style.use('fivethirtyeight')
df = pd.read_csv('/home/Desktop/cocktail_ids_age_gender.csv')
df.values
bins = np.arange(10, 100, 10)
df['category'] = np.digitize(df.Age, bins, right=True)
counts = df.groupby(['category', 'Gender']).Age.count().unstack()
print(counts)
ax = counts.plot(kind='bar',stacked = False, colormap = 'Paired')
for p in ax.patches:
ax.annotate(np.round(p.get_height(),decimals=0).astype(np.int64), (p.get_x()+p.get_width()/2., p.get_height()), ha='center', va='center', xytext=(2, 10), textcoords='offset points')
plt.xlabel ('Age Group')
plt.ylabel ('Co-Occurences ')
plt.title('Comparison Of Occurences In An Age Group',fontsize=20)
plt.show()
And i decided to leave it stacked anyways because it made analysis easier. Everything turned out well , thanks to goyo. But the only thing that is again bothering me is my x-axis. Instead of showing 1 , 2 , 3 , 4.. i wanted to show 10-20,20-30 and so on. I am not grasping how i could do that. Can anyone help me. Thank you

Stacked Bar Plot By Group Count On Pandas Python

My csv data looks something like the one provided below. I wanted to create a stack bar plot with pandas/python where each bar represent male and female portions with two colors and on the top of the bar it shows the total count of both male and female taking the drug(in my case). For instance, for the Age of 20 fall total of 7 people and 6 of them are male and 1 is female so on the bar plot there should be 7 on the top of the bar and this 6:1 portion is shown in the bar with two colors. I managed to group the people according to their age count and plot it but I wanted to show the bar with two genders on different colors as well. Any help will be appreciated . Thank you.
Drug_ID,Age,Gender
FF90E5F780133BBCAEE9BA0E4654C5CA,15,F
2AB2529352E6A1D0BA91E167C5191231,17,M
6E4B2C80EA83C5B51675BC077A182E06,19,M
8AEB19A68DE4C301154973E0C87C774D,19,M
FA533E7DD1FCAB83822D4A2FC514AEE7,20,F
208F39485A8FBA7214CA004B53CC29B7,20,M
3172B95E8A5732D2EDB089A354787612,20,M
38FBAE52AAB04E56AB3A35C4AFBF3813,20,M
62D8A33130574C984EAF3A2E80C243A6,20,M
646AF35E192885FE1336649BA3735FC4,20,M
778D2B1354F30ED3D3BDD7B9437CF670,20,M
891312933FE5FB25703B3E958EF943E3,21,F
5CC63DFC5EF399F65CB9BC583A770DFB,21,M
97C9ED5750EC6FF618A059085F0BF4A6,21,M
C6F03C5E3CC37249E0308A09292F5A58,21,M
0713C427BDC3A90F291AF49599987F38,23,F
518EBD35FCDAB7C744334F993D8CEB28,23,F
A3FC185C4531AFF1E8B7B9F5985028CB,23,F
1837406921314CB182FB0C7BC5565204,23,M
CA0E6274BD39E8DE5C823F4E6F234252,23,M
E2F64030BB011C11541EDC192BAED09A,23,M
EF8F3D028C5759858AB7574864833015,23,M
5DCD427F26E05CC1C3F565BB05EAE10B,24,F
8646ED503722C3C6C6B44208EF1A5202,24,F
F9F45112C472282778E1F437F54B0B70,24,F
0CF4F20B89303CB4C03CF3BD3B27CAF0,24,M
63727039DFF77A46995DA8EDBC4E3497,24,M
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
data = pd.read_csv('/home/MedResearch/Desktop/drug_ids_age_gender.csv')
data.values
df = pd.DataFrame(data)
df2 = pd.merge(df1,df, left_index = True, right_index = True)
temp1 = df2.groupby('Age').Age.count()
df3 = pd.merge(df1,df, left_index = True, right_index = True)
temp2 = df3.groupby('Gender').Age.count()
ax = temp1.plot(kind='bar',stacked = False, colormap = 'Paired')
for p in ax.patches:
ax.annotate(np.round(p.get_height(),decimals=0), (p.get_x()+p.get_width()/2., p.get_height()), ha='center', va='center', xytext=(2, 10), textcoords='offset points')
plt.ylabel('Numbers Of Patients Taking the drug')
plt.title('Age Distribution Of Patients Based Upon drug Intake')
plt.show()
Got something like this as a result:
This questions come back often, so I decided to write a step by step explanation. Note that I'm not a pandas guru, so there are things that could probably be optimized.
I started by generating getting a list of ages that I will use for my x-axis:
cvsdata = '''Drug_ID,Age,Gender
FF90E5F780133BBCAEE9BA0E4654C5CA,15,F
2AB2529352E6A1D0BA91E167C5191231,17,M
6E4B2C80EA83C5B51675BC077A182E06,19,M
8AEB19A68DE4C301154973E0C87C774D,19,M
FA533E7DD1FCAB83822D4A2FC514AEE7,20,F
208F39485A8FBA7214CA004B53CC29B7,20,M
3172B95E8A5732D2EDB089A354787612,20,M
38FBAE52AAB04E56AB3A35C4AFBF3813,20,M
62D8A33130574C984EAF3A2E80C243A6,20,M
646AF35E192885FE1336649BA3735FC4,20,M
778D2B1354F30ED3D3BDD7B9437CF670,20,M
891312933FE5FB25703B3E958EF943E3,21,F
5CC63DFC5EF399F65CB9BC583A770DFB,21,M
97C9ED5750EC6FF618A059085F0BF4A6,21,M
C6F03C5E3CC37249E0308A09292F5A58,21,M
0713C427BDC3A90F291AF49599987F38,23,F
518EBD35FCDAB7C744334F993D8CEB28,23,F
A3FC185C4531AFF1E8B7B9F5985028CB,23,F
1837406921314CB182FB0C7BC5565204,23,M
CA0E6274BD39E8DE5C823F4E6F234252,23,M
E2F64030BB011C11541EDC192BAED09A,23,M
EF8F3D028C5759858AB7574864833015,23,M
5DCD427F26E05CC1C3F565BB05EAE10B,24,F
8646ED503722C3C6C6B44208EF1A5202,24,F
F9F45112C472282778E1F437F54B0B70,24,F
0CF4F20B89303CB4C03CF3BD3B27CAF0,24,M
63727039DFF77A46995DA8EDBC4E3497,24,M'''
df = pd.read_csv(StringIO.StringIO(cvsdata))
ages = df.Age.unique()
array([15, 17, 19, 20, 21, 23, 24])
Then I generated a grouped dataframe with the counts of each M and F per age:
counts = df.groupby(['Age','Gender']).count()
print counts
Drug_ID
Age Gender
15 F 1
17 M 1
19 M 2
20 F 1
M 6
21 F 1
M 3
23 F 3
M 4
24 F 3
M 2
Using that, I can easily calculate the total number of individual per age group:
totals = counts.sum(level=0)
print totals
Drug_ID
Age
15 1
17 1
19 2
20 7
21 4
23 7
24 5
To prepare for plotting, I'll transform my counts dataframe to separate each sex by columns, instead of by index. Here I also drop that 'Drug_ID' column name because the unstack() operation creates a MultiIndex and it's much easier to manipulate the dataframe without that MultiIndex.
counts = counts.unstack(level=1)
counts.columns = counts.columns.droplevel(level=0)
print counts
Gender F M
Age
15 1.0 NaN
17 NaN 1.0
19 NaN 2.0
20 1.0 6.0
21 1.0 3.0
23 3.0 4.0
24 3.0 2.0
Looks pretty good. I'll just do a final refinement and replace the NaN by 0.
counts = counts.fillna(0)
print counts
Gender F M
Age
15 1.0 0.0
17 0.0 1.0
19 0.0 2.0
20 1.0 6.0
21 1.0 3.0
23 3.0 4.0
24 3.0 2.0
With this dataframe, it is trivial to plot the stacked bars:
plt.bar(ages, counts['M'], bottom=None, color='blue', label='M')
plt.bar(ages, counts['F'], bottom=counts['M'], color='pink', label='F')
plt.legend()
plt.xlabel('Ages')
plt.ylabel('Count')
To plot the total counts on top of the bars, we'll use the annotate() function. We cannot do it in one single pass, instead we'll loop through the ages and the totals (for simplicity sake, I take the values and flatten() them because they're not quite in the right format, not exactly sure why here)
for age,tot in zip(ages,totals.values.flatten()):
plt.annotate('N={:d}'.format(tot), xy=(age+0.4, tot), xytext=(0,5), textcoords='offset points', ha='center', va='bottom')
the coordinates for the annotations are (age+0.4, tot) because the bars go from x to x+width with width=0.8by default, and therefore x+0.4 is the center of the bar, while tot is of course the full height of the bar. To offset the text a bit, I offset the text by a few (5) points in the y direction. Adjust according to your liking.
Check out the documentation for bar() to adjust the parameters of the bar plots.
Check out the documentation for annotate() to customize your annotations

Histogram of one field wrt another in pandas python

I am trying to plot a histogram distribution of one column with respect to another. For example, if the dataframe columns are ['count','age'], then I want to plot the total counts in each age group. Suppose in
age: 0-10 -> total count was 20
age: 10-20 -> total count was 10
age: 20-30 -> ... etc
I tried groupby('age') and than plotting histogram but it didn't work.
Thanks.
Update
Here is some of my data
df.head()
age count
0 65 2417.86
1 65 4173.50
2 65 3549.16
3 65 509.07
4 65 0.00
Also, df.plot( x='age', y='count', kind='hist') shows
Ok,if I understand correctly, you want a weighted histogram
import pylab as plt
import pandas as pd
np = pd.np
df = pd.DataFrame( {'age':np.random.normal( 50,10,300).astype(int),
'counts':1000*np.random.random(300)} ) # test data
#df.head()
# age counts
#0 38 797.174450
#1 36 402.171434
#2 49 894.218420
#3 66 841.786623
#4 51 597.040259
df.hist('age',weights=df['counts'] )
plt.ylabel('counts')
plt.show()
yields the figure

Categories

Resources