This is my first time drawing bar charts in python.
My df op:
key descript score
0 noodles taste 5
1 noodles color -2
2 noodles health 3
3 apple color 7
4 apple hard 9
My code:
import matplotlib.pyplot as plt
op['positive'] = op['score'] > 0
op['score'].plot(kind='barh', color=op.positive.map({True: 'r', False: 'k'}), use_index=True)
plt.show()
plt.savefig('sample1.png')
Output:
But this is not what I expected. I would like to draw two charts by different keys in this case with index and maybe use different colors like below:
How can I accomplish this?
Try:
fig, ax = plt.subplots(1,op.key.nunique(), figsize=(15,5), sharex=True)
i = 0
#Fix some data issues/typos
op['key']=op.key.str.replace('noodels','noodles')
for n, g in op.assign(positive=op['score'] >= 0).groupby('key'):
g.plot.barh(y='score', x='descript', ax=ax[i], color=g['positive'].map({True:'red',False:'blue'}), legend=False)\
.set_xlabel(n)
ax[i].set_ylabel('Score')
ax[i].spines['top'].set_visible(False)
ax[i].spines['right'].set_visible(False)
ax[i].spines['top'].set_visible(False)
ax[i].spines['left'].set_position('zero')
i += 1
Output:
Update added moving of labels for yaxis - Thanks to this SO solution by # ImportanceOfBeingErnest
fig, ax = plt.subplots(1,op.key.nunique(), figsize=(15,5), sharex=True)
i = 0
#Fix some data issues/typos
op['key']=op.key.str.replace('noodels','noodles')
for n, g in op.assign(positive=op['score'] >= 0).groupby('key'):
g.plot.barh(y='score', x='descript', ax=ax[i], color=g['positive'].map({True:'red',False:'blue'}), legend=False)\
.set_xlabel(n)
ax[i].set_ylabel('Score')
ax[i].spines['top'].set_visible(False)
ax[i].spines['right'].set_visible(False)
ax[i].spines['top'].set_visible(False)
ax[i].spines['left'].set_position('zero')
plt.setp(ax[i].get_yticklabels(), transform=ax[i].get_yaxis_transform())
i += 1
Output:
Related
New here so putting hyperlinks. My dataframe looks like this.
HR ICULOS SepsisLabel PatientID
100.3 1 0 1
117.0 2 0 1
103.9 3 0 1
104.7 4 0 1
102.0 5 0 1
88.1 6 0 1
Access the whole file here. What I wanted is to add a marker on the HR graph based on SepsisLabel (See the file). E.g., at ICULOS = 249, Sepsis Label changed from 0 to 1. I wanted to show that at this point on graph, sepsis label changed. I was able to calculate the position using this code:
mark = dummy.loc[dummy['SepsisLabel'] == 1, 'ICULOS'].iloc[0]
print("The ICULOS where SepsisLabel changes from 0 to 1 is:", mark)
Output: The ICULOS where SepsisLabel changes from 0 to 1 is: 249
I Plotted the graph using the code:
plt.figure(figsize=(15,6))
ax = plt.gca()
ax.set_title("Patient ID = 1")
ax.set_xlabel('ICULOS')
ax.set_ylabel('HR Readings')
sns.lineplot(ax=ax,
x="ICULOS",
y="HR",
data=dummy,
marker = '^',
markersize=5,
markeredgewidth=1,
markeredgecolor='black',
markevery=mark)
plt.show()
This is what I got: Graph. The marker was supposed to be on position 249 only. But it is also on position 0. Why is it happening? Can someone help me out?
Thanks.
Working with markevery can be tricky in this case, as it strongly depends on there being exactly one entry for each patient and each ICULOS.
Here is an alternative approach, using an explicit scatter plot to draw the marker:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
df = pd.DataFrame({'HR': np.random.randn(200).cumsum() + 60,
'ICULOS': np.tile(np.arange(1, 101), 2),
'SepsisLabel': np.random.binomial(2, 0.05, 200),
'PatientID': np.repeat([1, 2], 100)})
for patient_id in [1, 2]:
dummy = df[df['PatientID'] == patient_id]
fig, ax = plt.subplots(figsize=(15, 6))
ax.set_title(f"Patient ID = {patient_id}")
ax.set_xlabel('ICULOS')
ax.set_ylabel('HR Readings')
sns.lineplot(ax=ax,
x="ICULOS",
y="HR",
data=dummy)
x = dummy[dummy['SepsisLabel'] == 1]["ICULOS"].values[0]
y = dummy[dummy['SepsisLabel'] == 1]["HR"].values[0]
ax.scatter(x=x,
y=y,
marker='^',
s=5,
linewidth=1,
edgecolor='black')
ax.text(x, y, str(x) + '\n', ha='center', va='center', color='red')
plt.show()
For your new question, here is an example how to convert the 'ICULOS' column to pandas dates. The example uses date 20210101 to correspond with ICULOS == 1. You probably have a different starting date for each patient.
df_fb = pd.DataFrame()
df_fb['Y'] = df['HR']
df_fb['DS'] = pd.to_datetime('20210101') + pd.to_timedelta(df['ICULOS'] - 1, unit='D')
I'm looking a way to plot side by side stacked barplots to compare host composition of positive (Condition==True) and total cases in each country from my dataframe.
Here is a sample of the DataFrame.
id Location Host genus_name #ofGenes Condition
1 Netherlands Homo sapiens Escherichia 4.0 True
2 Missing Missing Klebsiella 3.0 True
3 Missing Missing Aeromonas 2.0 True
4 Missing Missing Glaciecola 2.0 True
5 Antarctica Missing Alteromonas 2.0 True
6 Indian Ocean Missing Epibacterium 2.0 True
7 Missing Missing Klebsiella 2.0 True
8 China Homo sapiens Escherichia 0 False
9 Missing Missing Escherichia 2.0 True
10 China Plantae kingdom Pantoea 0 False
11 China Missing Escherichia 2.0 True
12 Pacific Ocean Missing Halomonas 0 False
I need something similar to the image bellow, but I want to plot in percentage.
Can anyone help me?
I guess what you want is a stacked categorical bar plot, which cannot be directly plotted using seaborn. But you can achieve it by customizing one.
Import some necessary packages.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
Read the dataset. Considering your sample data is too small, I randomly generate some to make the plot looks good.
def gen_fake_data(data, size=400):
unique_values = []
for c in data.columns:
unique_values.append(data[c].unique())
new_data = pd.DataFrame({c: np.random.choice(unique_values[i], size=size)
for i, c in enumerate(data.columns)})
new_data = pd.concat([data, new_data])
new_data['id'] = new_data.index + 1
return new_data
data = pd.read_csv('data.csv')
new_data = gen_fake_data(data)
Define the stacked categorical bar plot
def stack_catplot(x, y, cat, stack, data, palette=sns.color_palette('Reds')):
ax = plt.gca()
# pivot the data based on categories and stacks
df = data.pivot_table(values=y, index=[cat, x], columns=stack,
dropna=False, aggfunc='sum').fillna(0)
ncat = data[cat].nunique()
nx = data[x].nunique()
nstack = data[stack].nunique()
range_x = np.arange(nx)
width = 0.8 / ncat # width of each bar
for i, c in enumerate(data[cat].unique()):
# iterate over categories, i.e., Conditions
# calculate the location of each bar
loc_x = (0.5 + i - ncat / 2) * width + range_x
bottom = 0
for j, s in enumerate(data[stack].unique()):
# iterate over stacks, i.e., Hosts
# obtain the height of each stack of a bar
height = df.loc[c][s].values
# plot the bar, you can customize the color yourself
ax.bar(x=loc_x, height=height, bottom=bottom, width=width,
color=palette[j + i * nstack], zorder=10)
# change the bottom attribute to achieve a stacked barplot
bottom += height
# make xlabel
ax.set_xticks(range_x)
ax.set_xticklabels(data[x].unique(), rotation=45)
ax.set_ylabel(y)
# make legend
plt.legend([Patch(facecolor=palette[i]) for i in range(ncat * nstack)],
[f"{c}: {s}" for c in data[cat].unique() for s in data[stack].unique()],
bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0.)
plt.grid()
plt.show()
Let's plot!
plt.figure(figsize=(6, 3), dpi=300)
stack_catplot(x='Location', y='#ofGenes', cat='Condition', stack='Host', data=new_data)
If you want to plot in percentile, calculate it in the raw dataset.
total_genes = new_data.groupby(['Location', 'Condition'], as_index=False)['#ofGenes'].sum().rename(
columns={'#ofGenes': 'TotalGenes'})
new_data = new_data.merge(total_genes, how='left')
new_data['%ofGenes'] = new_data['#ofGenes'] / new_data['TotalGenes'] * 100
plt.figure(figsize=(6, 3), dpi=300)
stack_catplot(x='Location', y='%ofGenes', cat='Condition', stack='Host', data=new_data)
You didn't specify how you would like to stack the bars, but you should be able to do something like this...
df = pd.read_csv('data.csv')
agg_df = df.pivot_table(index='Location', columns='Host', values='Condition', aggfunc='count')
agg_df.plot(kind='bar', stacked=True)
I am trying to plot several different things in scatter plots by having several subplots and iterating over the remaining categories, but the plots only display the first iteration without throwing any error. To clarify, here is an example of what the data actually look like:
a kind state property T
0 0.905618 I dry prop1 10
1 0.050311 I wet prop1 20
2 0.933696 II dry prop1 30
3 0.114824 III wet prop1 40
4 0.942719 IV dry prop1 50
5 0.276627 II wet prop2 10
6 0.612303 III dry prop2 20
7 0.803451 IV wet prop2 30
8 0.257816 II dry prop2 40
9 0.122468 IV wet prop2 50
And this is how I generated the example:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import gridspec
kinds = ['I','II','III','IV']
states = ['dry','wet']
props = ['prop1','prop2']
T = [10,20,30,40,50]
a = np.random.rand(10)
k = ['I','I','II','III','IV','II','III','IV','II','IV']
s = ['dry','wet','dry','wet','dry','wet','dry','wet','dry','wet']
p = ['prop1','prop1','prop1','prop1','prop1','prop2','prop2','prop2','prop2','prop2']
t = [10,20,30,40,50,10,20,30,40,50]
df = pd.DataFrame(index=range(10),columns=['a','kind','state','property','T'])
df['a']=a
df['kind']=k
df['state']=s
df['property']=p
df['T']=t
print df
Next, I am going to generate 2 rows and 2 columns of subplots, to display variabilities in property1 and property2 in wet and dry states. So I basically slice my dataframe into several smaller ones like this:
first = df[(df['state']=='dry')&(df['property']=='prop1')]
second = df[(df['state']=='wet')&(df['property']=='prop1')]
third = df[(df['state']=='dry')&(df['property']=='prop2')]
fourth = df[(df['state']=='wet')&(df['property']=='prop2')]
dfs = [first,second,third,fourth]
in each of these subplots, which specify certain laboratory conditions, I want to plot the values of a versus T for different kinds of samples. To distinguish between the kinds of samples, I assign different colours and markers to them. So here is my plotting script:
fig = plt.figure(figsize=(8,8.5))
gs = gridspec.GridSpec(2,2, hspace=0.4, wspace=0.3)
colours = ['r','b','g','gold']
symbols = ['v','v','^','^']
titles=['dry 1','wet 1','dry 2','wet 2']
for no, df in enumerate(dfs):
ax = fig.add_subplot(gs[no])
for i, r in enumerate(kinds):
#print i, r
df = df[df['kind']==r]
c = colours[i]
m = symbols[i]
plt.scatter(df['T'],df['a'],c=c,s=50.0, marker=m, edgecolor='k')
ax = plt.xlabel('T')
ax = plt.xticks(T)
ax = plt.ylabel('A')
ax = plt.title(titles[no],fontsize=12,alpha=0.75)
plt.show()
But the result only plots the first iteration, in this case kind I in red triangles. If I remove this first item from the iterating lists, it only plots the first variable (kind II in blue triangles).
What am I doing wrong?
The figure looks like this, but I would like to have each subplot accordingly populated with red and blue and green and gold markers.
(Please note this happens with my real data as well, so the problem should not be in the way I generate the example.)
Your problem lies within the inner for loop. By writing df = df[df['kind']==r], you replace the original df with the version filtered for I. Then, in the next iteration of the loop, where you would filter for II, no further data is found. Therefore you also get no error message, as the code is otherwise 'correct'. If you rewrite the relevant piece of code like this:
for no, df in enumerate(dfs):
ax = fig.add_subplot(gs[no])
for i, r in enumerate(kinds):
#print i, r
df2 = df[df['kind']==r]
c = colours[i]
m = symbols[i]
plt.scatter(df2['T'],df2['a'],c=c,s=50.0, marker=m, edgecolor='k')
ax = plt.xlabel('T')
ax = plt.xticks(T)
ax = plt.ylabel('A')
ax = plt.title(titles[no],fontsize=12,alpha=0.75)
It should work just fine. Tested on Python 3.5.
When I increase the font size in matplotlib parts of the x-axis title are cut off. Is there a way to keep the plot/figure as it is and just tell matplotlib to increase the output page size instead of cutting off?
My code is:
import matplotlib.pyplot as plt
plt.style.use('paper')
fig = plt.figure()
ax1 = plt.subplot2grid((3,1),(0,0),rowspan=2)
ax2 = plt.subplot2grid((3,1),(2,0),sharex=ax1)
# first subplot
plt.setp(ax1.get_xticklabels(),visible=False)
ax1.set_ylabel("$\mathrm{Arbitrary\; Units}$", size=20)
ax1.yaxis.get_major_formatter().set_powerlimits((0,1))
ax1.get_yaxis().set_label_coords(-0.1,0.5)
x1,x2 = [],[]
for i in xrange(1000000):
r = rnd.random()
x1.append(r**(1/2))
x2.append(r**(1/3))
xmin = 0
xmax = 1
nbins = 50
h1,_,_ = ax1.hist(x1,bins=nbins,range=(xmin,xmax),normed=1,color="#FF0000",histtype='step')
h2,_,_ = ax1.hist(x2,bins=nbins,range=(xmin,xmax),normed=1,color="#00FF00",histtype='step')
# legend
old, = ax1.plot([0,0],color="#FF0000",label="2")
new, = ax1.plot([0,0],color="#00FF00",label="3")
ax1.legend(loc=2, ncol=1, borderaxespad=0.)
old.set_visible(False)
new.set_visible(False)
# second subplot
ax2.plot(np.linspace(xmin,xmax,nbins),h2/h1)
ax2.plot([xmin,xmax],[1,1],color="black")
ax2.set_ylim(ymin=0,ymax=1.99)
ax2.set_xlabel(r'$\phi_\mu$', size=20, weight="light")
ax2.set_ylabel("$\mathrm{Ratio}$", size=20)
ax2.get_yaxis().set_label_coords(-0.1,0.5)
fig.subplots_adjust(hspace=0.)
ax2.set_yticks(ax2.get_yticks()[1:-1])
The stylesheet that I use ('paper') is:
axes.axisbelow: True
axes.grid: False
axes.labelcolor: 333333
axes.linewidth: 0.6
backend: GTK
font.family: serif
font.size: 20
grid.alpha: 0.1
grid.color: black
grid.linestyle: -
grid.linewidth: 0.8
legend.fontsize: 20
legend.framealpha: 0
legend.numpoints: 1
lines.linestyle: -
lines.linewidth: 2
lines.markeredgewidth: 0
patch.linewidth: 2
text.color: 555555
text.usetex: True
xtick.color: 333333
ytick.color: 333333
This is what it looks like:
Since I want the subplots to share the x-axis I don't want any space between them.
Can anyone help me here?
Cheers.
In python pandas I have create a dataframe with one value for each year and two subclasses - i.e., one metric for a parameter triplet
import pandas, requests, numpy
import matplotlib.pyplot as plt
df
Metric Tag_1 Tag_2 year
0 5770832 FOOBAR1 name1 2008
1 7526436 FOOBAR1 xyz 2008
2 33972652 FOOBAR1 name1 2009
3 17491416 FOOBAR1 xyz 2009
...
16 6602920 baznar2 name1 2008
17 6608 baznar2 xyz 2008
...
30 142102944 baznar2 name1 2015
31 0 baznar2 xyz 2015
I would like to produce a bar plot with metrics as y-values over x=(year,Tag_1,Tag_2) and sorting primarily for years and secondly for tag_1 and color the bars depending on tag_1. Something like
(2008,FOOBAR,name1) --> 5770832 *RED*
(2008,baznar2,name1) --> 6602920 *BLUE*
(2008,FOOBAR,xyz) --> 7526436 *RED*
(2008,baznar2,xyz) --> ... *BLUE*
(2008,FOOBAR,name1) --> ... *RED*
I tried starting with a grouping of columns like
df.plot.bar(x=['year','tag_1','tag_2']
but have not found a way to separate selections into two bar sets next to each other.
This should get you on your way:
df = pd.read_csv('path_to_file.csv')
# Group by the desired columns
new_df = df.groupby(['year', 'Tag_1', 'Tag_2']).sum()
# Sort descending
new_df.sort('Metric', inplace=True)
# Helper function for generation sequence of 'r' 'b' colors
def get_color(i):
if i%2 == 0:
return 'r'
else:
return 'b'
colors = [get_color(j) for j in range(new_df.shape[0])]
# Make the plot
fig, ax = plt.subplots()
ind = np.arange(new_df.shape[0])
width = 0.65
a = ax.barh(ind, new_df.Metric, width, color = colors) # plot a vals
ax.set_yticks(ind + width) # position axis ticks
ax.set_yticklabels(new_df.index.values) # set them to the names
fig.tight_layout()
plt.show()
you can also do it this way:
fig, ax = plt.subplots()
df.groupby(['year', 'Tag_1', 'Tag_2']).sum().plot.barh(color=['r','b'], ax=ax)
fig.tight_layout()
plt.show()
PS if don't like scientific notation you can get rid of it:
ax.get_xaxis().get_major_formatter().set_scientific(False)