Confidence interval does not display on barplot - python

I am wanting to display the confidence interval for each bar in my plot, but they do not seem to show. I have two dataframes, and I am displaying the average of the NUMBER_GIRLS column in my plot from both dataframes.
For example, consider the two dataframes (shown below).
schools_north_df
ID NAME NUMBER_GIRLS
----------------------------
1 SCHOOL_1 32
2 SCHOOL_2 12
3 SCHOOL_3 26
schools_south_df
ID NAME NUMBER_GIRLS
----------------------------
1 SCHOOL_1 56
2 SCHOOL_2 33
3 SCHOOL_3 34
Therefore, I have used this code (shown below) to plot my barplot with the confidence intervals showing for each bar - but when plotting it, the confidence interval does not show up.
import matplotlib.pyplot as plt
objects = ('North', 'South')
y_pos = np.arange(len(objects))
avg_girls = [schools_north_df[NUMBER_GIRLS].mean(), schools_south_df[NUMBER_GIRLS].mean()]
sns.barplot(y_pos, avg_girls, ci=95)
plt.xticks(y_pos, objects)
plt.title('Average Number of Girls')
plt.show()
If anyone could kindly help me and indicate what is wrong with my code. I really need the confidence interval to display on my barplot.
Thank you very much!

If you want seaborn to display the confidence intervals, you need to let seaborn aggregate the data by itself (that is to say, provide the raw data instead of calculating the mean yourself).
I would create a new dataframe with an extra column (region) to indicate whether the data are from the "north" or the "south" and then request seaborn to plot NUMBER_GIRLS vs region:
df = pd.concat([schools_north_df.assign(region='North'), schools_south_df.assign(region='South')])
output:
ID NAME NUMBER_GIRLS region
0 1 SCHOOL_1 32 North
1 2 SCHOOL_2 12 North
2 3 SCHOOL_3 26 North
0 1 SCHOOL_1 56 South
1 2 SCHOOL_2 33 South
2 3 SCHOOL_3 34 South
plot:
sns.barplot(data=df, x='region', y='NUMBER_GIRLS', ci=95)

Related

Create one boxplot per cluster for each column of information for a dataframe

Let it be the following Python Panda DataFrame:
value
other_value
cluster
1382
2.1
0
10
3.9
1
104
5.9
1
82
-1.1
0
100
0.9
2
1003
0.85
2
232
4.1
0
19
0.6
3
1434
0.3
3
23
1.6
3
Using the seaborn module, I want to display a set of boxplots for each column of values, showing the comparative information per value of the cluster column.
That is, for the above DataFrame, it would show a first graph for the 'value' column with 4 boxplots, one for each cluster value. The second graph would include information for the 'other_value' column also showing 1 boxplot for each cluster.
My idea is to do the same, but instead of in R language, in python: Boxplots of different variables by cluster assigned on one graph in ggplot
My code, It only shows the 1 to 1 graphs, I would like to get a joint graph with all graphs applied, as in the link above:
sns.boxplot(y='value', x='cluster',
data=df,
palette="colorblind",
hue='cluster')
Thanks for the help offered.
Most seaborn functions work best with the data in "long form".
Here is how the code could look like:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
df = pd.read_html('https://stackoverflow.com/questions/72301993/')[0]
df_long = df.melt(id_vars='cluster', value_vars=df.columns[:-1], var_name='variable', value_name='values')
sns.catplot(kind='box', data=df_long,
col='variable', y='values', x='cluster', hue='cluster', palette="colorblind", sharey=False, colwrap=2)
plt.tight_layout()
plt.show()

Hide lines from a multiple line plot

I have a dataframe with 12 columns and 30 rows (only the first 5 rows are shown here):
0 1 2 3 4 5 6 7 8 9 10 11
0
10 0.420000 0.724000 0.552000 0.316000 0.176000 0.320000 0.228000 0.552000 0.476000 0.468000 0.560000 0.332000
20 0.387097 0.701613 0.516129 0.338710 0.177419 0.346774 0.217742 0.443548 0.483871 0.435484 0.516129 0.330645
30 0.353659 0.731707 0.365854 0.280488 0.158537 0.243902 0.231707 0.451220 0.524390 0.414634 0.451220 0.329268
40 0.377049 0.557377 0.311475 0.213115 0.213115 0.262295 0.262295 0.459016 0.540984 0.475410 0.377049 0.262295
50 0.285714 0.673469 0.183673 0.183673 0.163265 0.285714 0.204082 0.387755 0.489796 0.367347 0.306122 0.244898
I would like to plot a dot plot with rows indices as the x-axis columns values as the y-axis (ie. 12 dots on each x).
I have tried the following:
df.plot()
and I get this plot
I would like to show only the markers (dots) and not the lines
I tried df.plot(linestyle='None') but then I get an empty plot.
How can I change my code to show the dots/markers and hide the lines?
pandas.DataFrame.plot passes **kwargs to matplotlib's .plot method. Thus you can use any of the matplotlib.lines.Line2D properties:
df.plot(ls='', marker='.')

Scatter plot in python with Groups

Below are three columns VMDensity, ServerswithCorrectable errors and VMReboots.
VMDensity correctableCount avgVMReboots
LowDensity 7 5
HighDensity 1 23
LowDensity 5 11
HighDensity 1 23
LowDensity 9 5
HighDensity 1 22
HighDensity 1 22
LowDensity 9 2
LowDensity 9 6
LowDensity 5 3
I tried the following but not sure how to create it by groups with different colors.
import matplotlib.pyplot as plt
import pandas as pd
plt.scatter(df.correctableCount, df.avgVMReboots)
Now, I need generate a scatter plot with the grouping by VMDensity. The low density VM's should be in one color and the high density in another one.
If I understand you correctly you do not need to "group" the data: You want to plot all data points regardsless. You just want to color them differently. So try something like
plt.scatter(df.correctableCount, df.avgVMReboots, c=df.VMDensity)
You will need to map the df.VMDensity strings to numbers and/or play with scatter's cmap parameter.
See this example from matplotlib's gallery.

How to plot a pandas dataframe?

Hi am new to python and trying to plot a dataframe.
subject name marks
0 maths ankush 313
1 maths anvesh 474
2 maths amruth 264
3 science ankush 81
4 socail ankush 4
5 maths anirudh 16470
6 science anvesh 568
7 socail anvesh 5
8 science amruth 15
am looking to plot the bar graph something like as shown in the figure.
Thank You for your help.
The problem is two-fold.
What format does data need to be in to produce bar chart?
How to get data into that format?
For the chart you want, you need the names in the x-axis in the index of the dataframe and the subjects as columns.
This requires a pivot
df.set_index(['name', 'subject']).marks.unstack(fill_value=0)
subject maths science socail
name
amruth 264 15 0
anirudh 1647 0 0
ankush 313 81 4
anvesh 474 568 5
And the subsequent plot
df.set_index(['name', 'subject']).marks.unstack(fill_value=0).plot.bar()
The above is a very good answer. However since you are new to python, pandas, & matplotlib, I thought I would share a blog post I have found really good in showing the basics of matplotlib and how it is combined with pandas.
http://pbpython.com/effective-matplotlib.html?utm_content=buffer76b10&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
I hope you find it useful

Stacked Bar Plot By Group Count On Pandas Python

My csv data looks something like the one provided below. I wanted to create a stack bar plot with pandas/python where each bar represent male and female portions with two colors and on the top of the bar it shows the total count of both male and female taking the drug(in my case). For instance, for the Age of 20 fall total of 7 people and 6 of them are male and 1 is female so on the bar plot there should be 7 on the top of the bar and this 6:1 portion is shown in the bar with two colors. I managed to group the people according to their age count and plot it but I wanted to show the bar with two genders on different colors as well. Any help will be appreciated . Thank you.
Drug_ID,Age,Gender
FF90E5F780133BBCAEE9BA0E4654C5CA,15,F
2AB2529352E6A1D0BA91E167C5191231,17,M
6E4B2C80EA83C5B51675BC077A182E06,19,M
8AEB19A68DE4C301154973E0C87C774D,19,M
FA533E7DD1FCAB83822D4A2FC514AEE7,20,F
208F39485A8FBA7214CA004B53CC29B7,20,M
3172B95E8A5732D2EDB089A354787612,20,M
38FBAE52AAB04E56AB3A35C4AFBF3813,20,M
62D8A33130574C984EAF3A2E80C243A6,20,M
646AF35E192885FE1336649BA3735FC4,20,M
778D2B1354F30ED3D3BDD7B9437CF670,20,M
891312933FE5FB25703B3E958EF943E3,21,F
5CC63DFC5EF399F65CB9BC583A770DFB,21,M
97C9ED5750EC6FF618A059085F0BF4A6,21,M
C6F03C5E3CC37249E0308A09292F5A58,21,M
0713C427BDC3A90F291AF49599987F38,23,F
518EBD35FCDAB7C744334F993D8CEB28,23,F
A3FC185C4531AFF1E8B7B9F5985028CB,23,F
1837406921314CB182FB0C7BC5565204,23,M
CA0E6274BD39E8DE5C823F4E6F234252,23,M
E2F64030BB011C11541EDC192BAED09A,23,M
EF8F3D028C5759858AB7574864833015,23,M
5DCD427F26E05CC1C3F565BB05EAE10B,24,F
8646ED503722C3C6C6B44208EF1A5202,24,F
F9F45112C472282778E1F437F54B0B70,24,F
0CF4F20B89303CB4C03CF3BD3B27CAF0,24,M
63727039DFF77A46995DA8EDBC4E3497,24,M
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
data = pd.read_csv('/home/MedResearch/Desktop/drug_ids_age_gender.csv')
data.values
df = pd.DataFrame(data)
df2 = pd.merge(df1,df, left_index = True, right_index = True)
temp1 = df2.groupby('Age').Age.count()
df3 = pd.merge(df1,df, left_index = True, right_index = True)
temp2 = df3.groupby('Gender').Age.count()
ax = temp1.plot(kind='bar',stacked = False, colormap = 'Paired')
for p in ax.patches:
ax.annotate(np.round(p.get_height(),decimals=0), (p.get_x()+p.get_width()/2., p.get_height()), ha='center', va='center', xytext=(2, 10), textcoords='offset points')
plt.ylabel('Numbers Of Patients Taking the drug')
plt.title('Age Distribution Of Patients Based Upon drug Intake')
plt.show()
Got something like this as a result:
This questions come back often, so I decided to write a step by step explanation. Note that I'm not a pandas guru, so there are things that could probably be optimized.
I started by generating getting a list of ages that I will use for my x-axis:
cvsdata = '''Drug_ID,Age,Gender
FF90E5F780133BBCAEE9BA0E4654C5CA,15,F
2AB2529352E6A1D0BA91E167C5191231,17,M
6E4B2C80EA83C5B51675BC077A182E06,19,M
8AEB19A68DE4C301154973E0C87C774D,19,M
FA533E7DD1FCAB83822D4A2FC514AEE7,20,F
208F39485A8FBA7214CA004B53CC29B7,20,M
3172B95E8A5732D2EDB089A354787612,20,M
38FBAE52AAB04E56AB3A35C4AFBF3813,20,M
62D8A33130574C984EAF3A2E80C243A6,20,M
646AF35E192885FE1336649BA3735FC4,20,M
778D2B1354F30ED3D3BDD7B9437CF670,20,M
891312933FE5FB25703B3E958EF943E3,21,F
5CC63DFC5EF399F65CB9BC583A770DFB,21,M
97C9ED5750EC6FF618A059085F0BF4A6,21,M
C6F03C5E3CC37249E0308A09292F5A58,21,M
0713C427BDC3A90F291AF49599987F38,23,F
518EBD35FCDAB7C744334F993D8CEB28,23,F
A3FC185C4531AFF1E8B7B9F5985028CB,23,F
1837406921314CB182FB0C7BC5565204,23,M
CA0E6274BD39E8DE5C823F4E6F234252,23,M
E2F64030BB011C11541EDC192BAED09A,23,M
EF8F3D028C5759858AB7574864833015,23,M
5DCD427F26E05CC1C3F565BB05EAE10B,24,F
8646ED503722C3C6C6B44208EF1A5202,24,F
F9F45112C472282778E1F437F54B0B70,24,F
0CF4F20B89303CB4C03CF3BD3B27CAF0,24,M
63727039DFF77A46995DA8EDBC4E3497,24,M'''
df = pd.read_csv(StringIO.StringIO(cvsdata))
ages = df.Age.unique()
array([15, 17, 19, 20, 21, 23, 24])
Then I generated a grouped dataframe with the counts of each M and F per age:
counts = df.groupby(['Age','Gender']).count()
print counts
Drug_ID
Age Gender
15 F 1
17 M 1
19 M 2
20 F 1
M 6
21 F 1
M 3
23 F 3
M 4
24 F 3
M 2
Using that, I can easily calculate the total number of individual per age group:
totals = counts.sum(level=0)
print totals
Drug_ID
Age
15 1
17 1
19 2
20 7
21 4
23 7
24 5
To prepare for plotting, I'll transform my counts dataframe to separate each sex by columns, instead of by index. Here I also drop that 'Drug_ID' column name because the unstack() operation creates a MultiIndex and it's much easier to manipulate the dataframe without that MultiIndex.
counts = counts.unstack(level=1)
counts.columns = counts.columns.droplevel(level=0)
print counts
Gender F M
Age
15 1.0 NaN
17 NaN 1.0
19 NaN 2.0
20 1.0 6.0
21 1.0 3.0
23 3.0 4.0
24 3.0 2.0
Looks pretty good. I'll just do a final refinement and replace the NaN by 0.
counts = counts.fillna(0)
print counts
Gender F M
Age
15 1.0 0.0
17 0.0 1.0
19 0.0 2.0
20 1.0 6.0
21 1.0 3.0
23 3.0 4.0
24 3.0 2.0
With this dataframe, it is trivial to plot the stacked bars:
plt.bar(ages, counts['M'], bottom=None, color='blue', label='M')
plt.bar(ages, counts['F'], bottom=counts['M'], color='pink', label='F')
plt.legend()
plt.xlabel('Ages')
plt.ylabel('Count')
To plot the total counts on top of the bars, we'll use the annotate() function. We cannot do it in one single pass, instead we'll loop through the ages and the totals (for simplicity sake, I take the values and flatten() them because they're not quite in the right format, not exactly sure why here)
for age,tot in zip(ages,totals.values.flatten()):
plt.annotate('N={:d}'.format(tot), xy=(age+0.4, tot), xytext=(0,5), textcoords='offset points', ha='center', va='bottom')
the coordinates for the annotations are (age+0.4, tot) because the bars go from x to x+width with width=0.8by default, and therefore x+0.4 is the center of the bar, while tot is of course the full height of the bar. To offset the text a bit, I offset the text by a few (5) points in the y direction. Adjust according to your liking.
Check out the documentation for bar() to adjust the parameters of the bar plots.
Check out the documentation for annotate() to customize your annotations

Categories

Resources