Let it be the following Python Panda DataFrame:
value
other_value
cluster
1382
2.1
0
10
3.9
1
104
5.9
1
82
-1.1
0
100
0.9
2
1003
0.85
2
232
4.1
0
19
0.6
3
1434
0.3
3
23
1.6
3
Using the seaborn module, I want to display a set of boxplots for each column of values, showing the comparative information per value of the cluster column.
That is, for the above DataFrame, it would show a first graph for the 'value' column with 4 boxplots, one for each cluster value. The second graph would include information for the 'other_value' column also showing 1 boxplot for each cluster.
My idea is to do the same, but instead of in R language, in python: Boxplots of different variables by cluster assigned on one graph in ggplot
My code, It only shows the 1 to 1 graphs, I would like to get a joint graph with all graphs applied, as in the link above:
sns.boxplot(y='value', x='cluster',
data=df,
palette="colorblind",
hue='cluster')
Thanks for the help offered.
Most seaborn functions work best with the data in "long form".
Here is how the code could look like:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
df = pd.read_html('https://stackoverflow.com/questions/72301993/')[0]
df_long = df.melt(id_vars='cluster', value_vars=df.columns[:-1], var_name='variable', value_name='values')
sns.catplot(kind='box', data=df_long,
col='variable', y='values', x='cluster', hue='cluster', palette="colorblind", sharey=False, colwrap=2)
plt.tight_layout()
plt.show()
Below are three columns VMDensity, ServerswithCorrectable errors and VMReboots.
VMDensity correctableCount avgVMReboots
LowDensity 7 5
HighDensity 1 23
LowDensity 5 11
HighDensity 1 23
LowDensity 9 5
HighDensity 1 22
HighDensity 1 22
LowDensity 9 2
LowDensity 9 6
LowDensity 5 3
I tried the following but not sure how to create it by groups with different colors.
import matplotlib.pyplot as plt
import pandas as pd
plt.scatter(df.correctableCount, df.avgVMReboots)
Now, I need generate a scatter plot with the grouping by VMDensity. The low density VM's should be in one color and the high density in another one.
If I understand you correctly you do not need to "group" the data: You want to plot all data points regardsless. You just want to color them differently. So try something like
plt.scatter(df.correctableCount, df.avgVMReboots, c=df.VMDensity)
You will need to map the df.VMDensity strings to numbers and/or play with scatter's cmap parameter.
See this example from matplotlib's gallery.
Hi am new to python and trying to plot a dataframe.
subject name marks
0 maths ankush 313
1 maths anvesh 474
2 maths amruth 264
3 science ankush 81
4 socail ankush 4
5 maths anirudh 16470
6 science anvesh 568
7 socail anvesh 5
8 science amruth 15
am looking to plot the bar graph something like as shown in the figure.
Thank You for your help.
The problem is two-fold.
What format does data need to be in to produce bar chart?
How to get data into that format?
For the chart you want, you need the names in the x-axis in the index of the dataframe and the subjects as columns.
This requires a pivot
df.set_index(['name', 'subject']).marks.unstack(fill_value=0)
subject maths science socail
name
amruth 264 15 0
anirudh 1647 0 0
ankush 313 81 4
anvesh 474 568 5
And the subsequent plot
df.set_index(['name', 'subject']).marks.unstack(fill_value=0).plot.bar()
The above is a very good answer. However since you are new to python, pandas, & matplotlib, I thought I would share a blog post I have found really good in showing the basics of matplotlib and how it is combined with pandas.
http://pbpython.com/effective-matplotlib.html?utm_content=buffer76b10&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
I hope you find it useful
My csv data looks something like the one provided below. I wanted to create a stack bar plot with pandas/python where each bar represent male and female portions with two colors and on the top of the bar it shows the total count of both male and female taking the drug(in my case). For instance, for the Age of 20 fall total of 7 people and 6 of them are male and 1 is female so on the bar plot there should be 7 on the top of the bar and this 6:1 portion is shown in the bar with two colors. I managed to group the people according to their age count and plot it but I wanted to show the bar with two genders on different colors as well. Any help will be appreciated . Thank you.
Drug_ID,Age,Gender
FF90E5F780133BBCAEE9BA0E4654C5CA,15,F
2AB2529352E6A1D0BA91E167C5191231,17,M
6E4B2C80EA83C5B51675BC077A182E06,19,M
8AEB19A68DE4C301154973E0C87C774D,19,M
FA533E7DD1FCAB83822D4A2FC514AEE7,20,F
208F39485A8FBA7214CA004B53CC29B7,20,M
3172B95E8A5732D2EDB089A354787612,20,M
38FBAE52AAB04E56AB3A35C4AFBF3813,20,M
62D8A33130574C984EAF3A2E80C243A6,20,M
646AF35E192885FE1336649BA3735FC4,20,M
778D2B1354F30ED3D3BDD7B9437CF670,20,M
891312933FE5FB25703B3E958EF943E3,21,F
5CC63DFC5EF399F65CB9BC583A770DFB,21,M
97C9ED5750EC6FF618A059085F0BF4A6,21,M
C6F03C5E3CC37249E0308A09292F5A58,21,M
0713C427BDC3A90F291AF49599987F38,23,F
518EBD35FCDAB7C744334F993D8CEB28,23,F
A3FC185C4531AFF1E8B7B9F5985028CB,23,F
1837406921314CB182FB0C7BC5565204,23,M
CA0E6274BD39E8DE5C823F4E6F234252,23,M
E2F64030BB011C11541EDC192BAED09A,23,M
EF8F3D028C5759858AB7574864833015,23,M
5DCD427F26E05CC1C3F565BB05EAE10B,24,F
8646ED503722C3C6C6B44208EF1A5202,24,F
F9F45112C472282778E1F437F54B0B70,24,F
0CF4F20B89303CB4C03CF3BD3B27CAF0,24,M
63727039DFF77A46995DA8EDBC4E3497,24,M
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
data = pd.read_csv('/home/MedResearch/Desktop/drug_ids_age_gender.csv')
data.values
df = pd.DataFrame(data)
df2 = pd.merge(df1,df, left_index = True, right_index = True)
temp1 = df2.groupby('Age').Age.count()
df3 = pd.merge(df1,df, left_index = True, right_index = True)
temp2 = df3.groupby('Gender').Age.count()
ax = temp1.plot(kind='bar',stacked = False, colormap = 'Paired')
for p in ax.patches:
ax.annotate(np.round(p.get_height(),decimals=0), (p.get_x()+p.get_width()/2., p.get_height()), ha='center', va='center', xytext=(2, 10), textcoords='offset points')
plt.ylabel('Numbers Of Patients Taking the drug')
plt.title('Age Distribution Of Patients Based Upon drug Intake')
plt.show()
Got something like this as a result:
This questions come back often, so I decided to write a step by step explanation. Note that I'm not a pandas guru, so there are things that could probably be optimized.
I started by generating getting a list of ages that I will use for my x-axis:
cvsdata = '''Drug_ID,Age,Gender
FF90E5F780133BBCAEE9BA0E4654C5CA,15,F
2AB2529352E6A1D0BA91E167C5191231,17,M
6E4B2C80EA83C5B51675BC077A182E06,19,M
8AEB19A68DE4C301154973E0C87C774D,19,M
FA533E7DD1FCAB83822D4A2FC514AEE7,20,F
208F39485A8FBA7214CA004B53CC29B7,20,M
3172B95E8A5732D2EDB089A354787612,20,M
38FBAE52AAB04E56AB3A35C4AFBF3813,20,M
62D8A33130574C984EAF3A2E80C243A6,20,M
646AF35E192885FE1336649BA3735FC4,20,M
778D2B1354F30ED3D3BDD7B9437CF670,20,M
891312933FE5FB25703B3E958EF943E3,21,F
5CC63DFC5EF399F65CB9BC583A770DFB,21,M
97C9ED5750EC6FF618A059085F0BF4A6,21,M
C6F03C5E3CC37249E0308A09292F5A58,21,M
0713C427BDC3A90F291AF49599987F38,23,F
518EBD35FCDAB7C744334F993D8CEB28,23,F
A3FC185C4531AFF1E8B7B9F5985028CB,23,F
1837406921314CB182FB0C7BC5565204,23,M
CA0E6274BD39E8DE5C823F4E6F234252,23,M
E2F64030BB011C11541EDC192BAED09A,23,M
EF8F3D028C5759858AB7574864833015,23,M
5DCD427F26E05CC1C3F565BB05EAE10B,24,F
8646ED503722C3C6C6B44208EF1A5202,24,F
F9F45112C472282778E1F437F54B0B70,24,F
0CF4F20B89303CB4C03CF3BD3B27CAF0,24,M
63727039DFF77A46995DA8EDBC4E3497,24,M'''
df = pd.read_csv(StringIO.StringIO(cvsdata))
ages = df.Age.unique()
array([15, 17, 19, 20, 21, 23, 24])
Then I generated a grouped dataframe with the counts of each M and F per age:
counts = df.groupby(['Age','Gender']).count()
print counts
Drug_ID
Age Gender
15 F 1
17 M 1
19 M 2
20 F 1
M 6
21 F 1
M 3
23 F 3
M 4
24 F 3
M 2
Using that, I can easily calculate the total number of individual per age group:
totals = counts.sum(level=0)
print totals
Drug_ID
Age
15 1
17 1
19 2
20 7
21 4
23 7
24 5
To prepare for plotting, I'll transform my counts dataframe to separate each sex by columns, instead of by index. Here I also drop that 'Drug_ID' column name because the unstack() operation creates a MultiIndex and it's much easier to manipulate the dataframe without that MultiIndex.
counts = counts.unstack(level=1)
counts.columns = counts.columns.droplevel(level=0)
print counts
Gender F M
Age
15 1.0 NaN
17 NaN 1.0
19 NaN 2.0
20 1.0 6.0
21 1.0 3.0
23 3.0 4.0
24 3.0 2.0
Looks pretty good. I'll just do a final refinement and replace the NaN by 0.
counts = counts.fillna(0)
print counts
Gender F M
Age
15 1.0 0.0
17 0.0 1.0
19 0.0 2.0
20 1.0 6.0
21 1.0 3.0
23 3.0 4.0
24 3.0 2.0
With this dataframe, it is trivial to plot the stacked bars:
plt.bar(ages, counts['M'], bottom=None, color='blue', label='M')
plt.bar(ages, counts['F'], bottom=counts['M'], color='pink', label='F')
plt.legend()
plt.xlabel('Ages')
plt.ylabel('Count')
To plot the total counts on top of the bars, we'll use the annotate() function. We cannot do it in one single pass, instead we'll loop through the ages and the totals (for simplicity sake, I take the values and flatten() them because they're not quite in the right format, not exactly sure why here)
for age,tot in zip(ages,totals.values.flatten()):
plt.annotate('N={:d}'.format(tot), xy=(age+0.4, tot), xytext=(0,5), textcoords='offset points', ha='center', va='bottom')
the coordinates for the annotations are (age+0.4, tot) because the bars go from x to x+width with width=0.8by default, and therefore x+0.4 is the center of the bar, while tot is of course the full height of the bar. To offset the text a bit, I offset the text by a few (5) points in the y direction. Adjust according to your liking.
Check out the documentation for bar() to adjust the parameters of the bar plots.
Check out the documentation for annotate() to customize your annotations