Histogram of one field wrt another in pandas python - python

I am trying to plot a histogram distribution of one column with respect to another. For example, if the dataframe columns are ['count','age'], then I want to plot the total counts in each age group. Suppose in
age: 0-10 -> total count was 20
age: 10-20 -> total count was 10
age: 20-30 -> ... etc
I tried groupby('age') and than plotting histogram but it didn't work.
Thanks.
Update
Here is some of my data
df.head()
age count
0 65 2417.86
1 65 4173.50
2 65 3549.16
3 65 509.07
4 65 0.00
Also, df.plot( x='age', y='count', kind='hist') shows

Ok,if I understand correctly, you want a weighted histogram
import pylab as plt
import pandas as pd
np = pd.np
df = pd.DataFrame( {'age':np.random.normal( 50,10,300).astype(int),
'counts':1000*np.random.random(300)} ) # test data
#df.head()
# age counts
#0 38 797.174450
#1 36 402.171434
#2 49 894.218420
#3 66 841.786623
#4 51 597.040259
df.hist('age',weights=df['counts'] )
plt.ylabel('counts')
plt.show()
yields the figure

Related

Better way to plot Gender count using Python

I am making a graph to plot Gender count for the time series data that look like following data. Each row represent hourly data of each respective patient.
HR
SBP
DBP
Sepsis
Gender
P_ID
92
120
80
0
0
0
98
115
85
0
0
0
93
125
75
1
1
1
95
130
90
1
1
1
102
120
80
0
0
2
109
115
75
0
0
2
94
135
100
0
0
2
97
100
70
1
1
3
85
120
80
1
1
3
88
115
75
1
1
3
93
125
85
1
1
3
78
130
90
1
0
4
115
140
110
1
0
4
102
120
80
0
1
5
98
140
110
0
1
5
This is my code:
gender = df_n['Gender'].value_counts()
plt.figure(figsize=(7, 6))
ax = gender.plot(kind='bar', rot=0, color="c")
ax.set_title("Bar Graph of Gender", y = 1)
ax.set_xlabel('Gender')
ax.set_ylabel('Number of People')
ax.set_xticklabels(('Male', 'Female'))
for rect in ax.patches:
y_value = rect.get_height()
x_value = rect.get_x() + rect.get_width() / 2
space = 1
label = format(y_value)
ax.annotate(label, (x_value, y_value), xytext=(0, space), textcoords="offset points", ha='center', va='bottom')
plt.show()
Now what is happening is the code is calculating total number of instances (0: Male, 1: Female) and plotting it. But I want to plot the total males and females, not the total number of 0s and 1s, as the Same patient is having multiple rows of data (as per P_ID). Like how many patients are male and how many are female?
Can someone help me out? I guess maybe sns.countplot can be used. But I don't know how.
Thanks for helping me out >.<
__________ Udpate ________________
How I can group those Genders that are sepsis (1) or no sepsis (0)?
__________ Update 2 ___________
So, I got the total actual count of Male and Female, thanks to #Shaido.
In the whole dataset, there are only 2932 septic patients. Rest are non-septic. This is what I got from #JohanC answer.
Now, the problem is that as there are only 2932 septic patients, by looking at the graph, it is assumed that only 426 (251 Male) and (175 Female) are septic patients (out of 2932), rest are non-septic. But this is not true. Please help. Thanks.
I have a working example for selecting the unique IDS, it looks ugly so there is probably a better way, but it works...
import pandas as pd
# example of data:
data = {'gender': [0, 0, 1, 1, 1, 1, 0, 0], 'id': [1, 1, 2, 2, 3, 3, 4, 4]}
df = pd.DataFrame(data)
# get all unique ids:
ids = set(df.id)
# Go over all id, get first element of gender:
g = [list(df[df['id'] == i]['gender'])[0] for i in ids]
# count genders, laze way using pandas since the rest of the code also assumes a dataframe for plotting:
gender_counts = pd.DataFrame(g).value_counts()
# from here you can use your plot function.
# Or Counter
from collections import Counter
gender_counts = Counter(g)
# You have to create another method for plotting the gender.
You can group by 'P_ID' and take the first row for each of them (supposing a 'P_ID' has only one gender and only one sepsis). Then you can call sns.countplot on that dataframe, using gender for x and sepsis for hue (or vice versa). You can rename the values in the columns to show their names in the legend and in the tick labels.
import matplotlib.pyplot as plt
import seaborn as sns
from io import StringIO
data_str = '''
HR|SBP|DBP|Sepsis|Gender|P_ID
92|120|80|0|0|0
98|115|85|0|0|0
93|125|75|1|1|1
95|130|90|1|1|1
102|120|80|0|0|2
109|115|75|0|0|2
94|135|100|0|0|2
97|100|70|1|1|3
85|120|80|1|1|3
88|115|75|1|1|3
93|125|85|1|1|3
78|130|90|1|0|4
115|140|110|1|0|4
102|120|80|0|1|5
98|140|110|0|1|5
'''
df = pd.read_csv(StringIO(data_str), delimiter='|')
# new df: take Sepsis and Gender from the first row for every P_ID
df_per_PID = df.groupby('P_ID')[['Sepsis', 'Gender']].first()
# give names to the values in the columns
df_per_PID = df_per_PID.replace({'Gender': {0: 'Male', 1: 'Female'}, 'Sepsis': {0: 'No sepsis', 1: 'Sepsis'}})
# show counts per Gender and Sepsis
ax = sns.countplot(data=df_per_PID, x='Gender', hue='Sepsis', palette='rocket')
ax.legend(title='') # remove title, as it is clear from the legend items
ax.set_xlabel('')
for bars in ax.containers:
ax.bar_label(bars)
# ax.margins(y=0.1) # make some extra space for the labels
ax.locator_params(axis='y', integer=True)
sns.despine()
plt.show()

Seaborn Line Plot for plotting multiple parameters

I have dataset as below,
index
10_YR_CAGR
5_YR_CAGR
1_YR_CAGR
c1_rev
20.5
21.5
31.5
c2_rev
20.5
22.5
24
c3_rev
21
24
27
c4_rev
20
26
30
c5_rev
24
19
15
c1_eps
21
22
23
c2_eps
21
24
25
This data has 5 companies and its parameters like rev, eps, profit etc. I need to plot as below:
rev:
x_axis-> index_col c1_rev, ...c5_rev
y_axis -> 10_YR_CAGR .. 1_YR_CAGR
eps:
x_axis -> index_col: c1_eps,...c5_eps
y_axis -> 10_YR_CAGR,... 1_YR_CAGR
etc...
I have tried with following code:
eps = analysis_df[analysis_df.index.str.contains('eps',regex=True)]
for i1 in eps.columns[eps.columns!='index']:
sns.lineplot(x="index",y=i1,data=eps,label=i1)
I have to make a dataframe from source and then loop it. How can I try to create a for loop which loops from the main source dataframe itself?
Instead of creating a loop for separate parameters, how can I loop from the main source dataframe to create a chart of plots with parameters like rev, eps, profit to facegrid parameters? How to apply those filter in facetgrid?
My sample output of the above code,
How to plot the same sort of plot for different parameters in a single for loop?
The way facets are typically plotted is by "melting" your analysis_df into id/variable/value columns.
split() the index column into Company and Parameter, which we'll later use as id columns when melting:
analysis_df[['Company', 'Parameter']] = analysis_df['index'].str.split('_', expand=True)
# index 10_YR_CAGR 5_YR_CAGR 1_YR_CAGR Company Parameter
# 0 c1_rev 100 21 1 c1 rev
# 1 c2_rev 1 32 24 c2 rev
# ...
melt() the CAGR columns:
melted = analysis_df.melt(
id_vars=['Company', 'Parameter'],
value_vars=['10_YR_CAGR', '5_YR_CAGR', '1_YR_CAGR'],
var_name='Period',
value_name='CAGR',
)
# Company Parameter Period CAGR
# 0 c1 rev 10_YR_CAGR 100
# 1 c2 rev 10_YR_CAGR 1
# 2 c3 rev 10_YR_CAGR 14
# 3 c1 eps 10_YR_CAGR 1
# ...
# 25 c2 pft 1_YR_CAGR 14
# 26 c3 pft 1_YR_CAGR 17
relplot() CAGR vs Company (colored by Period) for each Parameter using the melted dataframe:
sns.relplot(
data=melted,
kind='line',
col='Parameter',
x='Company',
y='CAGR',
hue='Period',
col_wrap=1,
facet_kws={'sharex': False, 'sharey': False},
)
Sample data to reproduce this plot:
import io
import pandas as pd
csv = '''
index,10_YR_CAGR,5_YR_CAGR,1_YR_CAGR
c1_rev,100,21,1
c2_rev,1,32,24
c3_rev,14,23,7
c1_eps,1,20,50
c2_eps,21,20,25
c3_eps,31,20,37
c1_pft,20,1,10
c2_pft,25,20,14
c3_pft,11,55,17
'''
analysis_df = pd.read_csv(io.StringIO(csv))

Stacked Plot To Represent Genders For An Age Group From CSV containing Identifier , Age and Gender On Python / Pandas/ Matplotlib

I have a csv data with age, gender(Men,Women) and identifier. I grouped age and gender of individuals by count of identifier on pandas with
counts = df.groupby(['Age','Gender']).count()
print counts
and the result looked something like this :
Age Gender Id_count
15 W 1
17 M 1
19 M 2
20 M 6
W 1
21 M 3
W 1
23 M 4
W 3
24 M 8
W 3
25 M 9
26 M 6
W 1
27 M 3
W 1
28 M 9
W 2
29 M 5
W 1
30 M 3
31 M 9
W 1 ..
Unique ages on my dataset are from age 15 to 90. I now want to do an age group analysis with a stacked plot at the end.For that , i want to lets say range the ages into certain age group (10-20,21-30,31-40 and so on) and plot sum of identifier on each age group , showing sum on the top of the bar and my aim is to get two different colors for stacked bar representing men and women according to their proportion of id_count. To implement this : i created a dictionary where i gave range as shown below..
df['ids_counted']= np.round(df['Age'])
categories_dict = { 15 : 'Between 10 and 20',
16 : 'Between 10 and 20',
17 : 'Between 10 and 20',
18 : 'Between 10 and 20',
19 : 'Between 10 and 20',
20 : 'Between 10 and 20',
21 : 'Between 21 and 30',
22 : 'Between 21 and 30',..
90 : 'Between 81 and 90',}
Then I created this dataframe.
df['category'] = df['id_counted'].map(categories_dict)
count2 = df.groupby(['category','Age','Gender','Id_Count']).count()
total= count2.sum(level= 0)
print total
now i have successfully counted the total of identifier on each age group. It looked something like this :
Between 10 and 20 11
Between 21 and 30 62
Between 31 and 40 82
Between 41 and 50 120
Between 51 and 60 125
Between 61 and 70 141
Between 71 and 80 192
Between 81 and 90 38
But i lost my way here because i wanted to plot gender too. lets take age between 10 and 20 . Total 11 should have been on the top of my bar and portion 9 men and 2 women should have been plotted on a stacked bar. I thought about another approach because i think this way to approach won't get me to my result. I generated a grouped dataframe with the counts of each M and F per age, then calculated the total number of individual per age group.
totals = counts.sum(level=0)
Now to plot :
plt.bar(ages, counts['M'], bottom=None, color='blue', label='M')
plt.bar(ages, counts['W'], bottom=counts['M'], color='red', label='W')
plt.legend()
plt.xlabel('Age Group')
plt.ylabel('Occurences Of Identifiers')
plt.title('ttl',fontsize=20)
for age,tot in zip(ages,totals.values.flatten()):
plt.annotate('{:d}'.format(tot), xy=(age+0.39, tot), xytext=(0,1), textcoords='offset points', ha='center', va='bottom')
plt.show()
plt.save()
plt.close()
and got this plot which turned out to be okay but it is for individual age and my target is to generate same plot for age group on my dictionary. I would be very grateful if anyone would suggest me or give me an idea to obtain my aimed result. Thank you so much for your time.
Assigning age groups is easier using np.digitize.
n = 100
age = np.random.randint(15, 91, size=n)
gender = np.random.randint(2, size=n)
df = pd.DataFrame.from_items([('Age', age), ('Gender', gender)])
bins = np.arange(1, 10) * 10
df['category'] = np.digitize(df.Age, bins, right=True)
print(df.head())
Age Gender category
0 22 1 2
1 54 0 5
2 85 1 8
3 77 0 7
4 86 1 8
Now count grouping by category and gender, then unstack the result to have gender as columns.
counts = df.groupby(['category', 'Gender']).Age.count().unstack()
print(counts)
Gender 0 1
category
1 2 7
2 7 5
3 6 4
4 11 9
5 5 8
6 2 4
7 10 7
8 6 7
Plotting is now a breeze.
counts.plot(kind='bar', stacked=True)
This turned out to be my code at last :
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
plt.style.use('fivethirtyeight')
df = pd.read_csv('/home/Desktop/cocktail_ids_age_gender.csv')
df.values
bins = np.arange(10, 100, 10)
df['category'] = np.digitize(df.Age, bins, right=True)
counts = df.groupby(['category', 'Gender']).Age.count().unstack()
print(counts)
ax = counts.plot(kind='bar',stacked = False, colormap = 'Paired')
for p in ax.patches:
ax.annotate(np.round(p.get_height(),decimals=0).astype(np.int64), (p.get_x()+p.get_width()/2., p.get_height()), ha='center', va='center', xytext=(2, 10), textcoords='offset points')
plt.xlabel ('Age Group')
plt.ylabel ('Co-Occurences ')
plt.title('Comparison Of Occurences In An Age Group',fontsize=20)
plt.show()
And i decided to leave it stacked anyways because it made analysis easier. Everything turned out well , thanks to goyo. But the only thing that is again bothering me is my x-axis. Instead of showing 1 , 2 , 3 , 4.. i wanted to show 10-20,20-30 and so on. I am not grasping how i could do that. Can anyone help me. Thank you

Stacked Bar Plot By Group Count On Pandas Python

My csv data looks something like the one provided below. I wanted to create a stack bar plot with pandas/python where each bar represent male and female portions with two colors and on the top of the bar it shows the total count of both male and female taking the drug(in my case). For instance, for the Age of 20 fall total of 7 people and 6 of them are male and 1 is female so on the bar plot there should be 7 on the top of the bar and this 6:1 portion is shown in the bar with two colors. I managed to group the people according to their age count and plot it but I wanted to show the bar with two genders on different colors as well. Any help will be appreciated . Thank you.
Drug_ID,Age,Gender
FF90E5F780133BBCAEE9BA0E4654C5CA,15,F
2AB2529352E6A1D0BA91E167C5191231,17,M
6E4B2C80EA83C5B51675BC077A182E06,19,M
8AEB19A68DE4C301154973E0C87C774D,19,M
FA533E7DD1FCAB83822D4A2FC514AEE7,20,F
208F39485A8FBA7214CA004B53CC29B7,20,M
3172B95E8A5732D2EDB089A354787612,20,M
38FBAE52AAB04E56AB3A35C4AFBF3813,20,M
62D8A33130574C984EAF3A2E80C243A6,20,M
646AF35E192885FE1336649BA3735FC4,20,M
778D2B1354F30ED3D3BDD7B9437CF670,20,M
891312933FE5FB25703B3E958EF943E3,21,F
5CC63DFC5EF399F65CB9BC583A770DFB,21,M
97C9ED5750EC6FF618A059085F0BF4A6,21,M
C6F03C5E3CC37249E0308A09292F5A58,21,M
0713C427BDC3A90F291AF49599987F38,23,F
518EBD35FCDAB7C744334F993D8CEB28,23,F
A3FC185C4531AFF1E8B7B9F5985028CB,23,F
1837406921314CB182FB0C7BC5565204,23,M
CA0E6274BD39E8DE5C823F4E6F234252,23,M
E2F64030BB011C11541EDC192BAED09A,23,M
EF8F3D028C5759858AB7574864833015,23,M
5DCD427F26E05CC1C3F565BB05EAE10B,24,F
8646ED503722C3C6C6B44208EF1A5202,24,F
F9F45112C472282778E1F437F54B0B70,24,F
0CF4F20B89303CB4C03CF3BD3B27CAF0,24,M
63727039DFF77A46995DA8EDBC4E3497,24,M
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
data = pd.read_csv('/home/MedResearch/Desktop/drug_ids_age_gender.csv')
data.values
df = pd.DataFrame(data)
df2 = pd.merge(df1,df, left_index = True, right_index = True)
temp1 = df2.groupby('Age').Age.count()
df3 = pd.merge(df1,df, left_index = True, right_index = True)
temp2 = df3.groupby('Gender').Age.count()
ax = temp1.plot(kind='bar',stacked = False, colormap = 'Paired')
for p in ax.patches:
ax.annotate(np.round(p.get_height(),decimals=0), (p.get_x()+p.get_width()/2., p.get_height()), ha='center', va='center', xytext=(2, 10), textcoords='offset points')
plt.ylabel('Numbers Of Patients Taking the drug')
plt.title('Age Distribution Of Patients Based Upon drug Intake')
plt.show()
Got something like this as a result:
This questions come back often, so I decided to write a step by step explanation. Note that I'm not a pandas guru, so there are things that could probably be optimized.
I started by generating getting a list of ages that I will use for my x-axis:
cvsdata = '''Drug_ID,Age,Gender
FF90E5F780133BBCAEE9BA0E4654C5CA,15,F
2AB2529352E6A1D0BA91E167C5191231,17,M
6E4B2C80EA83C5B51675BC077A182E06,19,M
8AEB19A68DE4C301154973E0C87C774D,19,M
FA533E7DD1FCAB83822D4A2FC514AEE7,20,F
208F39485A8FBA7214CA004B53CC29B7,20,M
3172B95E8A5732D2EDB089A354787612,20,M
38FBAE52AAB04E56AB3A35C4AFBF3813,20,M
62D8A33130574C984EAF3A2E80C243A6,20,M
646AF35E192885FE1336649BA3735FC4,20,M
778D2B1354F30ED3D3BDD7B9437CF670,20,M
891312933FE5FB25703B3E958EF943E3,21,F
5CC63DFC5EF399F65CB9BC583A770DFB,21,M
97C9ED5750EC6FF618A059085F0BF4A6,21,M
C6F03C5E3CC37249E0308A09292F5A58,21,M
0713C427BDC3A90F291AF49599987F38,23,F
518EBD35FCDAB7C744334F993D8CEB28,23,F
A3FC185C4531AFF1E8B7B9F5985028CB,23,F
1837406921314CB182FB0C7BC5565204,23,M
CA0E6274BD39E8DE5C823F4E6F234252,23,M
E2F64030BB011C11541EDC192BAED09A,23,M
EF8F3D028C5759858AB7574864833015,23,M
5DCD427F26E05CC1C3F565BB05EAE10B,24,F
8646ED503722C3C6C6B44208EF1A5202,24,F
F9F45112C472282778E1F437F54B0B70,24,F
0CF4F20B89303CB4C03CF3BD3B27CAF0,24,M
63727039DFF77A46995DA8EDBC4E3497,24,M'''
df = pd.read_csv(StringIO.StringIO(cvsdata))
ages = df.Age.unique()
array([15, 17, 19, 20, 21, 23, 24])
Then I generated a grouped dataframe with the counts of each M and F per age:
counts = df.groupby(['Age','Gender']).count()
print counts
Drug_ID
Age Gender
15 F 1
17 M 1
19 M 2
20 F 1
M 6
21 F 1
M 3
23 F 3
M 4
24 F 3
M 2
Using that, I can easily calculate the total number of individual per age group:
totals = counts.sum(level=0)
print totals
Drug_ID
Age
15 1
17 1
19 2
20 7
21 4
23 7
24 5
To prepare for plotting, I'll transform my counts dataframe to separate each sex by columns, instead of by index. Here I also drop that 'Drug_ID' column name because the unstack() operation creates a MultiIndex and it's much easier to manipulate the dataframe without that MultiIndex.
counts = counts.unstack(level=1)
counts.columns = counts.columns.droplevel(level=0)
print counts
Gender F M
Age
15 1.0 NaN
17 NaN 1.0
19 NaN 2.0
20 1.0 6.0
21 1.0 3.0
23 3.0 4.0
24 3.0 2.0
Looks pretty good. I'll just do a final refinement and replace the NaN by 0.
counts = counts.fillna(0)
print counts
Gender F M
Age
15 1.0 0.0
17 0.0 1.0
19 0.0 2.0
20 1.0 6.0
21 1.0 3.0
23 3.0 4.0
24 3.0 2.0
With this dataframe, it is trivial to plot the stacked bars:
plt.bar(ages, counts['M'], bottom=None, color='blue', label='M')
plt.bar(ages, counts['F'], bottom=counts['M'], color='pink', label='F')
plt.legend()
plt.xlabel('Ages')
plt.ylabel('Count')
To plot the total counts on top of the bars, we'll use the annotate() function. We cannot do it in one single pass, instead we'll loop through the ages and the totals (for simplicity sake, I take the values and flatten() them because they're not quite in the right format, not exactly sure why here)
for age,tot in zip(ages,totals.values.flatten()):
plt.annotate('N={:d}'.format(tot), xy=(age+0.4, tot), xytext=(0,5), textcoords='offset points', ha='center', va='bottom')
the coordinates for the annotations are (age+0.4, tot) because the bars go from x to x+width with width=0.8by default, and therefore x+0.4 is the center of the bar, while tot is of course the full height of the bar. To offset the text a bit, I offset the text by a few (5) points in the y direction. Adjust according to your liking.
Check out the documentation for bar() to adjust the parameters of the bar plots.
Check out the documentation for annotate() to customize your annotations

Weighted average using pivot tables in pandas

I have written some code to compute a weighted average using pivot tables in pandas. However, I am not sure how to add the actual column which performs the weighted averaging (Add a new column where each row contains value of 'cumulative'/'COUNT').
The data looks like so:
VALUE COUNT GRID agb
1 43 1476 1051
2 212 1476 2983
5 7 1477 890
4 1361 1477 2310
Here is my code:
# Read input data
lup_df = pandas.DataFrame.from_csv(o_dir+LUP+'.csv',index_col=False)
# Insert a new column with area * variable
lup_df['cumulative'] = lup_df['COUNT']*lup_df['agb']
# Create and output pivot table
lup_pvt = pandas.pivot_table(lup_df, 'agb', rows=['GRID'])
# TODO: Add a new column where each row contains value of 'cumulative'/'COUNT'
lup_pvt.to_csv(o_dir+PIVOT+'.csv',index=True,header=True,sep=',')
How can I do this?
So you want, for each value of grid, the weighted average of the agb column where the weights are the values in the count column. If that interpretation is correct, I think this does the trick with groupby:
import numpy as np
import pandas as pd
np.random.seed(0)
n = 50
df = pd.DataFrame({'count': np.random.choice(np.arange(10)+1, n),
'grid': np.random.choice(np.arange(10)+50, n),
'value': np.random.randn(n) + 12})
df['prod'] = df['count'] * df['value']
grouped = df.groupby('grid').sum()
grouped['wtdavg'] = grouped['prod'] / grouped['count']
print grouped
count value prod wtdavg
grid
50 22 57.177042 243.814417 11.082474
51 27 58.801386 318.644085 11.801633
52 11 34.202619 135.127942 12.284358
53 24 59.340084 272.836636 11.368193
54 39 137.268317 482.954857 12.383458
55 47 79.468986 531.122652 11.300482
56 17 38.624369 214.188938 12.599349
57 22 38.572429 279.948202 12.724918
58 27 36.492929 327.315518 12.122797
59 34 60.851671 408.306429 12.009013
Or, if you want to be a bit slick and write a weighted average function you can use over and over:
import numpy as np
import pandas as pd
np.random.seed(0)
n = 50
df = pd.DataFrame({'count': np.random.choice(np.arange(10)+1, n),
'grid': np.random.choice(np.arange(10)+50, n),
'value': np.random.randn(n) + 12})
def wavg(val_col_name, wt_col_name):
def inner(group):
return (group[val_col_name] * group[wt_col_name]).sum() / group[wt_col_name].sum()
inner.__name__ = 'wtd_avg'
return inner
slick = df.groupby('grid').apply(wavg('value', 'count'))
print slick
grid
50 11.082474
51 11.801633
52 12.284358
53 11.368193
54 12.383458
55 11.300482
56 12.599349
57 12.724918
58 12.122797
59 12.009013
dtype: float64

Categories

Resources