I am new to pandas and matplotlib and trying to accomplish following. I have a data frame as shown below which is actually a listing the performance of players based on match date
name runs match_date player_id
Dockrell, G H 0 2018-06-17 3752
Stirling, P R 81 2018-06-17 3586
O'Brien, K J 28 2018-06-17 3391
McCarthy, B J 0 2018-06-17 4563
Poynter, S W 0 2018-06-17 4326
Poynter, S W 2 2018-06-17 4326
McCarthy, B J 0 2018-06-17 4563
Shannon, J N K 5 2018-06-17 4219
Shannon, J N K 6 2018-06-17 4219
Stirling, P R 51 2018-06-17 3586
This is a subset of data that I have created based on following code
match_performance = dataFrame[['name','runs','match_date','player_id']].sort_values('match_date',ascending=False).groupby('player_id').head(5)
sns.set_context({"figure.figsize": (10, 4)})
ax = sns.barplot(x="name", y="runs", data=match_performance)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
I need to plot this either as stacked bar or grouped bar to display performance of players in there last 5 matches based on player id which I have in the dataframe but I am not sure how to go about plotting this data as required.
Michael Waskom, the creater of Seaborn posted this on Twitter:
#randyzwitch I don't really like stacked bar charts, I'd suggest maybe
using pointplot / factorplot with kind=point
— Michael Waskom (#michaelwaskom) September 4, 2014
Regretfully, the answer is no. There is no built-in function in Seaborn for plotting stacked bar charts.
While this is an older question I found it while looking for a solution, so I hope this may help someone. Achieving stacked bars are a bit tricky with seaborn, but this should do the trick
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
def stacked_chart_sns(df, x, y, group_by, palette):
array_of_dfs = []
w_0 = None
for u in df[group_by].unique():
w = df[df[group_by] == u].copy()
if w_0 is not None:
w = w.merge(w_0, how='outer').fillna(0)
w[group_by] = u
w[y] = w.apply(lambda x: x[y] + x['y_prev'], axis=1)
w = w.drop(columns=['y_prev'])
array_of_dfs += [w]
w_0 = w.drop(columns=[group_by]).rename(columns={y:'y_prev'}).copy()
patches = []
for i, d in enumerate(array_of_dfs[::-1]):
sns.barplot(x=x, y=y, data=d, color=palette[i])
patches += [mpatches.Patch(label=list(df[group_by].unique())[::-1][i], color=palette[i])]
plt.legend(handles=patches, loc = 'upper left', ncol=1, labelspacing=1)
plt.show()
### use it with - for the example data in the question:
stacked_chart_sns(match_performance, 'match_date', 'runs', 'player_id', sns.color_palette("Spectral"))
Related
I have the following graph 1 obtained with the following code [2]. As you can see from the first line inside for I gave the height of the rectangles based on the standard deviation value. But I can't figure out how to get the height of the corresponding rectangle. For example given the blue rectangle I would like to return the 2 intervals in which it is included which are approximately 128.8 and 130.6. How can I do this?
[2] The code I used is the following:
import pandas as pd
import matplotlib.ticker as ticker
import matplotlib.pyplot as plt
import numpy as np
dfLunedi = pd.read_csv( "0.lun.csv", encoding = "ISO-8859-1", sep = ';')
dfSlotMean = dfLunedi.groupby('slotID', as_index=False).agg( NLunUn=('date', 'nunique'),NLunTot = ('date', 'count'), MeanBPM=('tempo', 'mean'), std = ('tempo','std') )
#print(dfSlotMean)
dfSlotMean.drop(dfSlotMean[dfSlotMean.NLunUn < 3].index, inplace=True)
df = pd.DataFrame(dfSlotMean)
df.to_csv('1.silLunedi.csv', sep = ';', index=False)
print(df)
bpmMattino = df['MeanBPM']
std = df['std']
listBpm = bpmMattino.tolist()
limInf = df['MeanBPM'] - df['std']
limSup = df['MeanBPM'] + df['std']
tick_spacing = 1
fig, ax = plt.subplots(1, 1)
for _, r in df.iterrows():
#
ax.plot([r['slotID'], r['slotID']+1], [r['MeanBPM']]*2, linewidth = r['std'] )
#ax.plot([r['slotID'], r['slotID']+1], [r['MeanBPM']]*2, linewidth = r['std'])
ax.xaxis.grid(True)
ax.yaxis.grid(True)
ax.yaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))
ax.xaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))
This is the content of the csv:
slotID NMonUnique NMonTot MeanBPM std
0 7 11 78 129.700564 29.323091
2 11 6 63 123.372397 24.049397
3 12 6 33 120.625667 24.029006
4 13 5 41 124.516341 30.814985
5 14 4 43 118.904512 26.205309
6 15 3 13 116.380538 24.336491
7 16 3 42 119.670881 27.416843
8 17 5 40 125.424125 32.215865
9 18 6 45 130.540578 24.437559
10 19 9 58 128.180172 32.099529
11 20 5 44 125.596045 28.060657
I would advise against using linewidth to show anything related to your data. The reason being that linewidth is measured in "points" (see the matplotlib documentation), the size of which are not related to the xy-space that you plot your data in. To see this in action, try plotting with different linewidths and changing the size of the plotting-window. The linewidth will not change with the axes.
Instead, if you do indeed want a rectangle, I suggest using matplotlib.patches.Rectangle. There is a good example of how to do that in the documentation, and I've also added an even shorter example below.
To give the rectangles different colors, you can do as here here and simply get a random tuple with 3 elements and use that for the color. Another option is to take a list of colors, for example the TABLEAU_COLORS from matplotlib.colors and take consecutive colors from that list. The latter may be better for testing, as the rectangles will get the same color for each run, but notice that there are just 10 colors in TABLEAU_COLORS, so you will have to cycle if you have more than 10 rectangles.
import matplotlib.pyplot as plt
import matplotlib.patches as ptc
import random
x = 3
y = 4.5
y_std = 0.3
fig, ax = plt.subplots()
for i in range(10):
c = tuple(random.random() for i in range(3))
# The other option as comment here
#c = mcolors.TABLEAU_COLORS[list(mcolors.TABLEAU_COLORS.keys())[i]]
rect = ptc.Rectangle(xy=(x, y-y_std), width=1, height=2*y_std, color=c)
ax.add_patch(rect)
ax.set_xlim((0,10))
ax.set_ylim((0,5))
plt.show()
If you define the height as the standard deviation, and the center is at the mean, then the interval should be [mean-(std/2) ; mean+(std/2)] for each rectangle right? Is it intentional that the rectangles overlap? If not, I think it is your use of linewidth to size the rectangles which is at fault. If the plot is there to visualize the mean and variance of the different categories something like a boxplot or raincloud plot might be better.
I'm looking a way to plot side by side stacked barplots to compare host composition of positive (Condition==True) and total cases in each country from my dataframe.
Here is a sample of the DataFrame.
id Location Host genus_name #ofGenes Condition
1 Netherlands Homo sapiens Escherichia 4.0 True
2 Missing Missing Klebsiella 3.0 True
3 Missing Missing Aeromonas 2.0 True
4 Missing Missing Glaciecola 2.0 True
5 Antarctica Missing Alteromonas 2.0 True
6 Indian Ocean Missing Epibacterium 2.0 True
7 Missing Missing Klebsiella 2.0 True
8 China Homo sapiens Escherichia 0 False
9 Missing Missing Escherichia 2.0 True
10 China Plantae kingdom Pantoea 0 False
11 China Missing Escherichia 2.0 True
12 Pacific Ocean Missing Halomonas 0 False
I need something similar to the image bellow, but I want to plot in percentage.
Can anyone help me?
I guess what you want is a stacked categorical bar plot, which cannot be directly plotted using seaborn. But you can achieve it by customizing one.
Import some necessary packages.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
Read the dataset. Considering your sample data is too small, I randomly generate some to make the plot looks good.
def gen_fake_data(data, size=400):
unique_values = []
for c in data.columns:
unique_values.append(data[c].unique())
new_data = pd.DataFrame({c: np.random.choice(unique_values[i], size=size)
for i, c in enumerate(data.columns)})
new_data = pd.concat([data, new_data])
new_data['id'] = new_data.index + 1
return new_data
data = pd.read_csv('data.csv')
new_data = gen_fake_data(data)
Define the stacked categorical bar plot
def stack_catplot(x, y, cat, stack, data, palette=sns.color_palette('Reds')):
ax = plt.gca()
# pivot the data based on categories and stacks
df = data.pivot_table(values=y, index=[cat, x], columns=stack,
dropna=False, aggfunc='sum').fillna(0)
ncat = data[cat].nunique()
nx = data[x].nunique()
nstack = data[stack].nunique()
range_x = np.arange(nx)
width = 0.8 / ncat # width of each bar
for i, c in enumerate(data[cat].unique()):
# iterate over categories, i.e., Conditions
# calculate the location of each bar
loc_x = (0.5 + i - ncat / 2) * width + range_x
bottom = 0
for j, s in enumerate(data[stack].unique()):
# iterate over stacks, i.e., Hosts
# obtain the height of each stack of a bar
height = df.loc[c][s].values
# plot the bar, you can customize the color yourself
ax.bar(x=loc_x, height=height, bottom=bottom, width=width,
color=palette[j + i * nstack], zorder=10)
# change the bottom attribute to achieve a stacked barplot
bottom += height
# make xlabel
ax.set_xticks(range_x)
ax.set_xticklabels(data[x].unique(), rotation=45)
ax.set_ylabel(y)
# make legend
plt.legend([Patch(facecolor=palette[i]) for i in range(ncat * nstack)],
[f"{c}: {s}" for c in data[cat].unique() for s in data[stack].unique()],
bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0.)
plt.grid()
plt.show()
Let's plot!
plt.figure(figsize=(6, 3), dpi=300)
stack_catplot(x='Location', y='#ofGenes', cat='Condition', stack='Host', data=new_data)
If you want to plot in percentile, calculate it in the raw dataset.
total_genes = new_data.groupby(['Location', 'Condition'], as_index=False)['#ofGenes'].sum().rename(
columns={'#ofGenes': 'TotalGenes'})
new_data = new_data.merge(total_genes, how='left')
new_data['%ofGenes'] = new_data['#ofGenes'] / new_data['TotalGenes'] * 100
plt.figure(figsize=(6, 3), dpi=300)
stack_catplot(x='Location', y='%ofGenes', cat='Condition', stack='Host', data=new_data)
You didn't specify how you would like to stack the bars, but you should be able to do something like this...
df = pd.read_csv('data.csv')
agg_df = df.pivot_table(index='Location', columns='Host', values='Condition', aggfunc='count')
agg_df.plot(kind='bar', stacked=True)
Is there a way I can get a size frequency histogram for a population under different scenarios for specific days in python
means with error bars
My data are in a format similar to this table:
SCENARIO RUN MEAN DAY
A 1 25 10
A 1 15 30
A 2 20 10
A 2 27 30
B 1 45 10
B 1 50 30
B 2 43 10
B 2 35 30
results_data.groupby(['Scenario', 'Run']).mean() does not give me the days I want to visualize the data by
it returns the mean on the days in each run.
Use seaborn.FacetGrid
FactGrid is a Multi-plot grid for plotting conditional relationships
Map seaborn.distplot onto the FacetGrid and use hue=DAY.
Setup Data and DataFrame
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import random # just for test data
import numpy as np # just for test data
# data
random.seed(365)
np.random.seed(365)
data = {'MEAN': [np.random.randint(20, 51) for _ in range(500)],
'SCENARIO': [random.choice(['A', 'B']) for _ in range(500)],
'DAY': [random.choice([10, 30]) for _ in range(500)],
'RUN': [random.choice([1, 2]) for _ in range(500)]}
# create dataframe
df = pd.DataFrame(data)
Plot with kde=False
g = sns.FacetGrid(df, col='RUN', row='SCENARIO', hue='DAY', height=5)
g = g.map(sns.distplot, 'MEAN', bins=range(20, 51, 5), kde=False, hist_kws=dict(edgecolor="k", linewidth=1)).add_legend()
plt.show()
Plot with kde=True
g = sns.FacetGrid(df, col='RUN', row='SCENARIO', hue='DAY', height=5, palette='GnBu')
g = g.map(sns.distplot, 'MEAN', bins=range(20, 51, 5), kde=True, hist_kws=dict(edgecolor="k", linewidth=1)).add_legend()
plt.show()
Plots with error bars
Using how to add error bars to histogram diagram in python
Using df from above
Use matplotlib.pyplot.errorbar to plot the error bars on the histogram.
from itertools import product
# create unique combinations for filtering df
scenarios = df.SCENARIO.unique()
runs = df.RUN.unique()
days = df.DAY.unique()
combo_list = [scenarios, runs, days]
results = list(product(*combo_list))
# plot
for i, result in enumerate(results, 1): # iterate through each set of combinations
s, r, d = result
data = df[(df.SCENARIO == s) & (df.RUN == r) & (df.DAY == d)] # filter dataframe
# add subplot rows, columns; needs to equal the number of combinations in results
plt.subplot(2, 4, i)
# plot hist and unpack values
n, bins, _ = plt.hist(x='MEAN', bins=range(20, 51, 5), data=data, color='g')
# calculate bin centers
bin_centers = 0.5 * (bins[:-1] + bins[1:])
# draw errobars, use the sqrt error. You can use what you want there
# poissonian 1 sigma intervals would make more sense
plt.errorbar(bin_centers, n, yerr=np.sqrt(n), fmt='k.')
plt.title(f'Scenario: {s} | Run: {r} | Day: {d}')
plt.tight_layout()
plt.show()
I'm trying to add a bar-plot (stacked or otherwise) for each row in a seaborn clustermap.
Let's say that I have a dataframe like this:
import pandas as pd
import numpy as np
import random
df = pd.DataFrame(np.random.randint(0,100,size=(100, 8)), columns=["heatMap_1","heatMap_2","heatMap_3","heatMap_4","heatMap_5", "barPlot_1","barPlot_1","barPlot_1"])
df['index'] = [ random.randint(1,10000000) for k in df.index]
df.set_index('index', inplace=True)
df.head()
heatMap_1 heatMap_2 heatMap_3 heatMap_4 heatMap_5 barPlot_1 barPlot_1 barPlot_1
index
4552288 9 3 54 37 23 42 94 31
6915023 7 47 59 92 70 96 39 59
2988122 91 29 59 79 68 64 55 5
5060540 68 80 25 95 80 58 72 57
2901025 86 63 36 8 33 17 79 86
I can use the first 5 columns (in this example starting with prefix heatmap_) to create seaborn clustermap using this(or the seaborn equivalent):
sns.clustermap(df.iloc[:,0:5], )
and the stacked barplot for last four columns(in this example starting with prefix barPlot_) using this:
df.iloc[:,5:8].plot(kind='bar', stacked=True)
but I'm a bit confused on how to merge both plot types. I understand that clustermap creates it's own figures and I'm not sure if I can extract just the heatmap from clustermap and then use it with subfigures. (Discussed here: Adding seaborn clustermap to figure with other plots). This creates a weird output.
Edit:
Using this:
import pandas as pd
import numpy as np
import random
import seaborn as sns; sns.set(color_codes=True)
import matplotlib.pyplot as plt
import matplotlib.gridspec
df = pd.DataFrame(np.random.randint(0,100,size=(100, 8)), columns=["heatMap_1","heatMap_2","heatMap_3","heatMap_4","heatMap_5", "barPlot_1","barPlot_2","barPlot_3"])
df['index'] = [ random.randint(1,10000000) for k in df.index]
df.set_index('index', inplace=True)
g = sns.clustermap(df.iloc[:,0:5], )
g.gs.update(left=0.05, right=0.45)
gs2 = matplotlib.gridspec.GridSpec(1,1, left=0.6)
ax2 = g.fig.add_subplot(gs2[0])
df.iloc[:,5:8].plot(kind='barh', stacked=True, ax=ax2)
creates this:
which does not really match well (i.e. due to dendrograms there is a shift).
Another options is to manually perform clustering and create a matplotlib heatmap and then add associated subfigures like barplots(discussed here:How to get flat clustering corresponding to color clusters in the dendrogram created by scipy)
Is there a way I can use clustermap as a subplot along with other plots ?
This is the result I'm looking for[1]:
While not a proper answer, I decided to break it down and do everything manually.
Taking inspiration from answer here, I decided to cluster and reorder the heatmap separately:
def heatMapCluter(df):
row_method = "ward"
column_method = "ward"
row_metric = "euclidean"
column_metric = "euclidean"
if column_method == "ward":
d2 = dist.pdist(df.transpose())
D2 = dist.squareform(d2)
Y2 = sch.linkage(D2, method=column_method, metric=column_metric)
Z2 = sch.dendrogram(Y2, no_plot=True)
ind2 = sch.fcluster(Y2, 0.7 * max(Y2[:, 2]), "distance")
idx2 = Z2["leaves"]
df = df.iloc[:, idx2]
ind2 = ind2[idx2]
else:
idx2 = range(df.shape[1])
if row_method:
d1 = dist.pdist(df)
D1 = dist.squareform(d1)
Y1 = sch.linkage(D1, method=row_method, metric=row_metric)
Z1 = sch.dendrogram(Y1, orientation="right", no_plot=True)
ind1 = sch.fcluster(Y1, 0.7 * max(Y1[:, 2]), "distance")
idx1 = Z1["leaves"]
df = df.iloc[idx1, :]
ind1 = ind1[idx1]
else:
idx1 = range(df.shape[0])
return df
Rearranged the original dataframe:
clusteredHeatmap = heatMapCluter(df.iloc[:, 0:5].copy())
# Extract the "barplot" rows and merge them
clusteredDataframe = df.reindex(list(clusteredHeatmap.index.values))
clusteredDataframe = clusteredDataframe.reindex(
list(clusteredHeatmap.columns.values)
+ list(df.iloc[:, 5:8].columns.values),
axis=1,
)
and then used the gridspec to plot both "subfigures" (clustermap and barplot):
# Now let's plot this - first the heatmap and then the barplot.
# Since it is a "two" part plot which shares the same axis, it is
# better to use gridspec
fig = plt.figure(figsize=(12, 12))
gs = GridSpec(3, 3)
gs.update(wspace=0.015, hspace=0.05)
ax_main = plt.subplot(gs[0:3, :2])
ax_yDist = plt.subplot(gs[0:3, 2], sharey=ax_main)
im = ax_main.imshow(
clusteredDataframe.iloc[:, 0:5],
cmap="Greens",
interpolation="nearest",
aspect="auto",
)
clusteredDataframe.iloc[:, 5:8].plot(
kind="barh", stacked=True, ax=ax_yDist, sharey=True
)
ax_yDist.spines["right"].set_color("none")
ax_yDist.spines["top"].set_color("none")
ax_yDist.spines["left"].set_visible(False)
ax_yDist.xaxis.set_ticks_position("bottom")
ax_yDist.set_xlim([0, 100])
ax_yDist.set_yticks([])
ax_yDist.xaxis.grid(False)
ax_yDist.yaxis.grid(False)
Jupyter notebook: https://gist.github.com/siddharthst/2a8b7028d18935860062ac7379b9279f
Image:
1 - http://code.activestate.com/recipes/578175-hierarchical-clustering-heatmap-python/
This is my first time drawing bar charts in python.
My df op:
key descript score
0 noodles taste 5
1 noodles color -2
2 noodles health 3
3 apple color 7
4 apple hard 9
My code:
import matplotlib.pyplot as plt
op['positive'] = op['score'] > 0
op['score'].plot(kind='barh', color=op.positive.map({True: 'r', False: 'k'}), use_index=True)
plt.show()
plt.savefig('sample1.png')
Output:
But this is not what I expected. I would like to draw two charts by different keys in this case with index and maybe use different colors like below:
How can I accomplish this?
Try:
fig, ax = plt.subplots(1,op.key.nunique(), figsize=(15,5), sharex=True)
i = 0
#Fix some data issues/typos
op['key']=op.key.str.replace('noodels','noodles')
for n, g in op.assign(positive=op['score'] >= 0).groupby('key'):
g.plot.barh(y='score', x='descript', ax=ax[i], color=g['positive'].map({True:'red',False:'blue'}), legend=False)\
.set_xlabel(n)
ax[i].set_ylabel('Score')
ax[i].spines['top'].set_visible(False)
ax[i].spines['right'].set_visible(False)
ax[i].spines['top'].set_visible(False)
ax[i].spines['left'].set_position('zero')
i += 1
Output:
Update added moving of labels for yaxis - Thanks to this SO solution by # ImportanceOfBeingErnest
fig, ax = plt.subplots(1,op.key.nunique(), figsize=(15,5), sharex=True)
i = 0
#Fix some data issues/typos
op['key']=op.key.str.replace('noodels','noodles')
for n, g in op.assign(positive=op['score'] >= 0).groupby('key'):
g.plot.barh(y='score', x='descript', ax=ax[i], color=g['positive'].map({True:'red',False:'blue'}), legend=False)\
.set_xlabel(n)
ax[i].set_ylabel('Score')
ax[i].spines['top'].set_visible(False)
ax[i].spines['right'].set_visible(False)
ax[i].spines['top'].set_visible(False)
ax[i].spines['left'].set_position('zero')
plt.setp(ax[i].get_yticklabels(), transform=ax[i].get_yaxis_transform())
i += 1
Output: