How can I group a stacked bar chart? - python

I'm trying to create a grouped, stacked bar chart.
Currently I have the following DataFrame:
>>> df
Value
Rating 1 2 3
Context Parameter
Total 1 43.312347 9.507902 1.580367
2 42.862649 9.482205 1.310549
3 43.710651 9.430811 1.400488
4 43.209559 9.803418 1.349094
5 42.541436 10.008994 1.220609
6 42.978286 9.430811 1.336246
7 42.734164 10.317358 1.606064
User 1 47.652348 11.138861 2.297702
2 47.102897 10.589411 1.848152
3 46.853147 10.139860 1.848152
4 47.252747 11.138861 1.748252
5 45.954046 10.239760 1.448551
6 46.353646 10.439560 1.498501
7 47.102897 11.338661 1.998002
I'd like to have for each Parameter the bars for Total and User grouped together.
This is the resulting chart with df.plot(kind='bar', stacked=True):
The bars themselve look right, but how do I get the bars for Total and User next to each other, for each Parameter, best with some margin between the parameters?

The following approach allows grouped and stacked bars at the same time.
First the dataframe is sorted by parameter, context. Then the context is unstacked from the index, creating new columns for every context, value pair.
Finally, three bar plots are drawn over each other to visualize the stacked bars.
import pandas as pd
from matplotlib import pyplot as plt
df = pd.DataFrame(columns=['Context', 'Parameter', 'Val1', 'Val2', 'Val3'],
data=[['Total', 1, 43.312347, 9.507902, 1.580367],
['Total', 2, 42.862649, 9.482205, 1.310549],
['Total', 3, 43.710651, 9.430811, 1.400488],
['Total', 4, 43.209559, 9.803418, 1.349094],
['Total', 5, 42.541436, 10.008994, 1.220609],
['Total', 6, 42.978286, 9.430811, 1.336246],
['Total', 7, 42.734164, 10.317358, 1.606064],
['User', 1, 47.652348, 11.138861, 2.297702],
['User', 2, 47.102897, 10.589411, 1.848152],
['User', 3, 46.853147, 10.139860, 1.848152],
['User', 4, 47.252747, 11.138861, 1.748252],
['User', 5, 45.954046, 10.239760, 1.448551],
['User', 6, 46.353646, 10.439560, 1.498501],
['User', 7, 47.102897, 11.338661, 1.998002]])
df.set_index(['Context', 'Parameter'], inplace=True)
df0 = df.reorder_levels(['Parameter', 'Context']).sort_index()
colors = plt.cm.Paired.colors
df0 = df0.unstack(level=-1) # unstack the 'Context' column
fig, ax = plt.subplots()
(df0['Val1']+df0['Val2']+df0['Val3']).plot(kind='bar', color=[colors[1], colors[0]], rot=0, ax=ax)
(df0['Val2']+df0['Val3']).plot(kind='bar', color=[colors[3], colors[2]], rot=0, ax=ax)
df0['Val3'].plot(kind='bar', color=[colors[5], colors[4]], rot=0, ax=ax)
legend_labels = [f'{val} ({context})' for val, context in df0.columns]
ax.legend(legend_labels)
plt.tight_layout()
plt.show()

Here's a way to do it:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="whitegrid")
# reshape you data - ensure no index is set initially
df1 = (df
.set_index(['Parameter','Context'])
.stack()
.reset_index()
.drop('level_2', 1)
.rename(columns={0:'value'}))
print(df1.head())
Parameter Context value
0 1 Total 43.312347
1 1 Total 9.507902
2 1 Total 1.580367
3 2 Total 42.862649
4 2 Total 9.482205
sns.barplot(x = 'Parameter',
y = 'value',
hue='Context',
data=df1,
errwidth=0.1)

Related

Divide two columns in pivot table and plot grouped bar chart with pandas

I have a dataset that looks like this:
df = pd.DataFrame({
'Vintage': ['2016Q1','2016Q1', '2016Q2','2016Q3','2016Q4','2016Q1', '2016Q2','2016Q2','2016Q2','2016Q3','2016Q4'],
'Model': ['A','A','A','A','A','B','B','B','B','B','B',],
'Count': [1,1,1,1,1,1,1,1,1,1,1],
'Case':[0,1,1,0,1,1,0,0,1,1,0],
})
Vintage Model Count Case
0 2016Q1 A 1 0
1 2016Q1 A 1 1
2 2016Q2 A 1 1
3 2016Q3 A 1 0
4 2016Q4 A 1 1
5 2016Q1 B 1 1
6 2016Q2 B 1 0
7 2016Q2 B 1 0
8 2016Q2 B 1 1
9 2016Q3 B 1 1
10 2016Q4 B 1 0
What I need to do is:
Plot grouped bar chart, where vintage is the groups and model is the hue/color
Two line plots in the same chart that show the percentage of case over count, aka plot the division of case over count for each model and vintage.
I figured out how to do the first task with a pivot table but haven't been able to add the percentage from the same pivot.
This is the solution for point 1:
dfp = df.pivot_table(index='vintage', columns='model', values='count', aggfunc='sum')
dfp.plot(kind='bar', figsize=(8, 4), rot=45, ylabel='Frequency', title="Vintages")
I tried dividing between columns in the pivot table but it's not the right format to plot.
How can I do the percentage calculation and line plots so without creating a different table?
Could the whole task be done with groupby instead? (as I find it easier to use in general)
Here's a solution using the seaborn plotting library, not sure if it's ok for you to use it for your problem
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame({
'Vintage': ['2016Q1','2016Q1', '2016Q2','2016Q3','2016Q4','2016Q1', '2016Q2','2016Q2','2016Q2','2016Q3','2016Q4'],
'Model': ['A','A','A','A','A','B','B','B','B','B','B',],
'Count': [1,1,1,1,1,1,1,1,1,1,1],
'Case':[0,1,1,0,1,1,0,0,1,1,0],
})
agg_df = df.groupby(['Vintage','Model']).sum().reset_index()
agg_df['Fraction'] = agg_df['Case']/agg_df['Count']
sns.barplot(
x = 'Vintage',
y = 'Count',
hue = 'Model',
alpha = 0.5,
data = agg_df,
)
sns.lineplot(
x = 'Vintage',
y = 'Fraction',
hue = 'Model',
marker = 'o',
legend = False,
data = agg_df,
)
plt.show()
plt.close()
IIUC you want the lines to be drawn on the same plot. I'd recommend creating a new y-axis after computing the division from the original df. Then you can plot the lines with seaborn:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame({
'Vintage': ['2016Q1','2016Q1', '2016Q2','2016Q3','2016Q4','2016Q1', '2016Q2','2016Q2','2016Q2','2016Q3','2016Q4'],
'Model': ['A','A','A','A','A','B','B','B','B','B','B',],
'Count': [1,1,1,1,1,1,1,1,1,1,1],
'Case':[0,1,1,0,1,1,0,0,1,1,0],
})
dfp = df.pivot_table(index='Vintage', columns='Model', values='Count', aggfunc='sum')
ax1 = dfp.plot(kind='bar', figsize=(8, 4), rot=45, ylabel='Frequency', title="Vintages")
dfd = df.groupby(["Vintage", "Model"]).sum() \
.assign(div_pct=lambda x:100*x["Case"]/x["Count"]) \
.reset_index()
ax2 = ax1.twinx() # creating a second y axis
sns.lineplot(data=dfd, x="Vintage", y="div_pct", hue="Model", style="Model", ax=ax2, markers=True, dashes=False)
plt.show()
Output:

Create a stacked bar plot and annotate with count and percent

I have the following dataframe
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
print('matplotlib: {}'.format(matplotlib.__version__))
# 3.5.3
df=pd.DataFrame({'Type': [ 'Sentence', 'Array', 'String', '-','-', 'Sentence', 'Array', 'String', '-','-', 'Sentence'],
'Length': [42,21,11,6,6,42,21,11,6,6,42],
'label': [1,1,0,0,0,1,1,0,0,0,1],
})
print(df)
# Type Length label
#0 Sentence 42 1
#1 Array 21 1
#2 String 11 0
#3 - 6 0
#4 - 6 0
#5 Sentence 42 1
#6 Array 21 1
#7 String 11 0
#8 - 6 0
#9 - 6 0
#10 Sentence 42 1
I want to plot stacked bar chart for the arbitrary column within dataframe (either numerical e.g. Length column or categorical e.g. Type column) and stack with respect to label column using annotations of both count/percentage, where small values of rare observations are also displayed. The following script gives me the wrong results:
ax = df.plot.bar(stacked=True)
#ax = df[["Type","label"]].plot.bar(stacked=True)
#ax = df.groupby('Type').size().plot(kind='bar', stacked=True)
ax.legend(["0: normanl", "1: Anomaly"])
for p in ax.patches:
width, height = p.get_width(), p.get_height()
x, y = p.get_xy()
ax.text(x+width/2,
y+height/2,
'{:.0f} %'.format(height),
horizontalalignment='center',
verticalalignment='center')
I can imagine that somehow I need to calculate the counts of the selected column with respect to label column:
## counts will be used for the labels
counts = df.apply(lambda x: x.value_counts())
## percents will be used to determine the height of each bar
percents = counts.div(counts.sum(axis=1), axis=0)
I tried to solve the problem by using df.groupby(['selcted column', 'label'] unsuccessfully. I collected all possible solutions in this Google Colab Notebook nevertheless I couldn't find a straightforward way to adapt into dataframe.
So far I have tried following solution inspired by this post to solve the problem by using df.groupby(['selcted column', 'label'] unsuccessfully and I got TypeError: unsupported operand type(s) for +: 'int' and 'str' for total = sum(dff.sum()) can't figure out what is the problem? in indexing or df transformation.
BTW I collected all possible solutions in this Google Colab Notebook nevertheless I couldn't find a straightforward way to adapt into dataframe via Mathplotlib. So I'm looking for an elegant way of using Seaborn or plotly.
df = df.groupby(["Type","label"]).count()
#dfp_Type = df.pivot_table(index='Type', columns='label', values= 'Length', aggfunc='mean')
dfp_Type = df.pivot_table(index='Type', columns='label', values= df.Type.size(), aggfunc='mean')
#dfp_Length = df.pivot_table(index='Length', columns='label', values= df.Length.size(), aggfunc='mean')
ax = dfp_Type.plot(kind='bar', stacked=True, rot=0)
# iterate through each bar container
for c in ax.containers: labels = [v.get_height() if v.get_height() > 0 else '' for v in c]
# add the annotations
ax.bar_label(c, fmt='%0.0f%%', label_type='center')
# move the legend
ax.legend(title='Class', bbox_to_anchor=(1, 1.02), loc='upper left')
plt.show()
output:
Expected output:
The values in Expected output do not match df in the OP, so the sample DataFrame has been updated.
Plot with pandas.DataFrame.plot, using kind='bar' and stacked=True. pandas uses and imports matplotlib as the default plotting backend, so there's no need to import other plotting libraries.
Resources:
How to aggregate unique count with pandas pivot_table for details about using aggfunc=len in .pivot_table.
How to add value labels on a bar chart for details and examples about .bar_label.
How to add multiple annotations to a bar plot & How to create and annotate a stacked proportional bar chart for adding count and percent to a bar plot.
Tested in python 3.10, pandas 1.4.3, matplotlib 3.5.1
import pandas as pd
# sample dataframe
df = pd.DataFrame({'Type': [ 'Sentence', 'Array', 'String', '-','-', 'Sentence', 'Array', 'String', '-','-', 'Sentence'],
'Length': [42, 21, 11, 6, 6, 42, 21, 11, 6, 6, 42],
'label': [1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0]})
# pivot the dataframe and get len
dfp = df.pivot_table(index='Type', columns='label', values='Length', aggfunc=len)
# get the total for each row
total = dfp.sum(axis=1)
# calculate the percent for each row
per = dfp.div(total, axis=0).mul(100).round(2)
# plot the pivoted dataframe
ax = dfp.plot(kind='bar', stacked=True, figsize=(10, 8), rot=0)
# set the colors for each Class
segment_colors = {'0': 'white', '1': 'black'}
# iterate through the containers
for c in ax.containers:
# get the current segment label (a string); corresponds to column / legend
label = c.get_label()
# create custom labels with the bar height and the percent from the per column
# the column labels in per and dfp are int, so convert label to int
labels = [f'{v.get_height()}\n({row}%)' if v.get_height() > 0 else '' for v, row in zip(c, per[int(label)])]
# add the annotation
ax.bar_label(c, labels=labels, label_type='center', fontweight='bold', color=segment_colors[label])
# move the legend
_ = ax.legend(title='Class', bbox_to_anchor=(1, 1.01), loc='upper left')
Comment Updates
How to always have a spot for 'Array' if it's not in the data:
Add 'Array' to dfp if it's not in dfp.index.
df.Type = pd.Categorical(df.Type, ['-', 'Array', 'Sentence', 'String'], ordered=True) does not ensure the missing categories are plotted.
How to have all the annotations, even if they're small:
Don't stack the bars, and set logy=True.
This uses the full-data, which was provided in a link.
# pivot the dataframe and get len
dfp = df.pivot_table(index='Type', columns='label', values='Length', aggfunc=len)
# append Array if it's not included
if 'Array' not in dfp.index:
dfp = pd.concat([dfp, pd.DataFrame({0: [np.nan], 1: [np.nan]}, index=['Array'])])
# order the index
dfp = dfp.loc[['-', 'Array', 'Sentence', 'String'], :]
# calculate the percent for each row
per = dfp.div(dfp.sum(axis=1), axis=0).mul(100).round(2)
# plot the pivoted dataframe
ax = dfp.plot(kind='bar', stacked=False, figsize=(10, 8), rot=0, logy=True, width=0.75)
# iterate through the containers
for c in ax.containers:
# get the current segment label (a string); corresponds to column / legend
label = c.get_label()
# create custom labels with the bar height and the percent from the per column
# the column labels in per and dfp are int, so convert label to int
labels = [f'{v.get_height()}\n({row}%)' if v.get_height() > 0 else '' for v, row in zip(c, per[int(label)])]
# add the annotation
ax.bar_label(c, labels=labels, label_type='edge', fontsize=10, fontweight='bold')
# move the legend
ax.legend(title='Class', bbox_to_anchor=(1, 1.01), loc='upper left')
# pad the spacing between the number and the edge of the figure
_ = ax.margins(y=0.1)
DataFrame Views
Based on the sample data in the OP
df
Type Length label
0 Sentence 42 1
1 Array 21 1
2 String 11 0
3 - 6 0
4 - 6 0
5 Sentence 42 1
6 Array 21 1
7 String 11 0
8 - 6 0
9 - 6 1
10 Sentence 42 0
dfp
label 0 1
Type
- 3.0 1.0
Array NaN 2.0
Sentence 1.0 2.0
String 2.0 NaN
total
Type
- 4.0
Array 2.0
Sentence 3.0
String 2.0
dtype: float64
per
label 0 1
Type
- 75.00 25.00
Array NaN 100.00
Sentence 33.33 66.67
String 100.00 NaN
I slightly adjusted the data so the graph would look identical to yours(e.g., Type:-'s label has three 0 and one 1)
df
###
Type Length label
0 Sentence 42 1
1 Array 21 1
2 String 11 0
3 - 6 0
4 - 6 0
5 Sentence 42 1
6 Array 21 1
7 String 11 0
8 - 6 0
9 - 6 1
10 Sentence 42 0
df_plot = df.groupby(['Type','label']).size().reset_index()
df_plot.columns = ['Type', 'Class', 'count']
df_plot = df_plot.astype({'Class':'str'})
df_plot['percentage'] = df.groupby(['Type','label']).size().groupby(level=0).apply(lambda x: 100*x/float(x.sum())).values.round(2).astype(str)
df_plot['percentage'] = "(" + df_plot['percentage'] + '%)'
df_plot
###
Type Class count percentage
0 - 0 3 (75.0%)
1 - 1 1 (25.0%)
2 Array 1 2 (100.0%)
3 Sentence 0 1 (33.33%)
4 Sentence 1 2 (66.67%)
5 String 0 2 (100.0%)
fig = px.bar(df_plot,
x='Type',
y='count',
color='Class',
text=df_plot['count'].astype(str) + "<br>" + df_plot['percentage'],
width=550,
height=400,
category_orders={'Type':['-','Array','Sentence','String']},
template='plotly_white',
log_y=True
)
fig.show('browser')
with your CSV file followed the same ELT turning into df_plot2,
while Class 0 and 1 has a huge difference,
A stacked bar chart(default setting) won't give you distinguishable outcome,
we can use barmode='group' instead,
fig2 = px.bar(df_plot2,
barmode='group',
x='Type',
y='count',
color='Class',
color_discrete_map={'0':'#5DA597', '1':'#FFC851'},
text=df_plot2['count'].astype(str) + "<br>" + df_plot2['percentage'],
width=850,
height=800,
category_orders={'Type': ['-', 'Array', 'Sentence', 'String']},
template='plotly_white',
log_y=True,
)
fig2.update_yaxes(dtick=1)
fig2.show('browser')

Weird shifting of boxplot in pandas boxplot combining it with seaborn pointplot - what is going on?

Imagine I have the following dataframes
import pandas as pd
import seaborn as sns
import numpy as np
d = {'val': [1, 2,3,4], 'a': [1, 1, 2, 2]}
d2 = {'val': [1, 2], 'a': [1, 2]}
df = pd.DataFrame(data=d)
df2 = pd.DataFrame(data=d2)
This will give me two dataframes that look the following:
df =
val a
0 1 1
1 2 1
2 3 2
3 4 2
and
df2 =
val a
0 1 1
1 2 2
Now I want to create a boxplot based on val in df and the values of a, i.e. fix a value a, i.e. 1; Then I have two different values val: 1 and 2; Then create a box at x=1 based on the values {1,2}; Then move on to a=2: Based on a=2 we have two values val={3,4} so create a box at x=2 based on the values {3,4};
Then I want to simply draw a line based on df2, where a is again my x-axis and val my y-axis; The way I did that is the following
ax = df.boxplot(column=['val'], by = ['a'],meanline=True, showmeans=True, showcaps=True,showbox=True)
sns.pointplot(x='a', y='val', data=df2, ax=ax)
The problem is that the box for a=1 is shifted at a=2 and the box for a=2 disappeared; I am confused if I have an error in my code or if it is a bug;
If I just add the boxplot, everything is fine, so if I do:
ax = df.boxplot(column=['val'], by = ['a'],meanline=True, showmeans=True, showcaps=True,showbox=True)
The boxes are at the right position but as soon as I add the pointplot, things don't seem to work anymore;
Anyone an idea what to do?
The problem is that you are plotting categories on the x-axis. Pointplot plots the first item at position 0 while boxplot starts at 1, thus the shift. One possibility is to use an twinned axis:
ax = df.boxplot(column=['val'], by = ['a'])
ax2 = ax.twiny()
sns.pointplot(x='a', y='val', data=df2, ax=ax2)
ax2.xaxis.set_visible(False)

Plot groupby percentage dataframe

I didn't find a complete answer to what i want to do:
I have a dataframe. I want to group by user and their answers to a survey, sum all of their good answers/total of their answers, display it in % and plot the result.
I have an answer column which contains : 1,0 or -1. I want to filter it in order to exclude -1.
Here is what i did so far :
df_sample.groupby('user').filter(lambda x : x['answer'].mean() >-1)
or :
a = df_sample.loc[df_sample['answer']!=-1,['user','answer']]
b = a.groupby(['user','answer']).agg({'answer' : 'sum'})
See it's uncomplete. Thank you for any suggestion that you may have.
Let's try with some sample data:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
np.random.seed(5)
n = 100
df = pd.DataFrame({'user': np.random.choice(list("ABCD"), size=n),
'answer': np.random.choice([1, 0, -1], size=n)})
df.head():
user answer
0 D 1
1 C 0
2 D -1
3 B 1
4 C 1
Option 1
Filter out the -1 values and use named aggregation to get the "good answers" and "total answers":
plot_df = df[df['answer'].ne(-1)].groupby('user').aggregate(
good_answer=('answer', 'sum'),
total_answer=('answer', 'size')
)
plot_df:
good_answer total_answer
user
A 9 15
B 11 20
C 15 19
D 7 14
Use division and multiplication to get percentage:
plot_df['pct'] = (plot_df['good_answer'] / plot_df['total_answer'] * 100)
plot_df:
good_answer total_answer pct
user
A 9 15 60.000000
B 11 20 55.000000
C 15 19 78.947368
D 7 14 50.000000
Then this can be plotted with DataFrame.plot:
ax = plot_df.plot(
y='pct', kind='bar', rot=0,
title='Percentage of Good Answers',
ylim=[0, 100],
label='Percent Good'
)
# Add Labels on Top of Bars
for container in ax.containers:
ax.bar_label(container, fmt='%.2f%%')
plt.show()
Option 2
If just the percentage is needed groupby mean can be used to get to the resulting plot directly after filtering out the -1s:
plot_df = df[df['answer'].ne(-1)].groupby('user')['answer'].mean().mul(100)
ax = plot_df.plot(
kind='bar', rot=0,
title='Percentage of Good Answers',
ylim=[0, 100],
label='Percent Good'
)
# Add Labels on Top of Bars
for container in ax.containers:
ax.bar_label(container, fmt='%.2f%%')
plt.show()
plot_df:
answer
user
A 60.000000
B 55.000000
C 78.947368
D 50.000000
Both options Produce:
All together:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
np.random.seed(5)
n = 100
df = pd.DataFrame({'user': np.random.choice(list("ABCD"), size=n),
'answer': np.random.choice([1, 0, -1], size=n)})
plot_df = df[df['answer'].ne(-1)].groupby('user').aggregate(
good_answer=('answer', 'sum'),
total_answer=('answer', 'size')
)
plot_df['pct'] = (plot_df['good_answer'] / plot_df['total_answer'] * 100)
ax = plot_df.plot(
y='pct', kind='bar', rot=0,
title='Percentage of Good Answers',
ylim=[0, 100],
label='Percent Good'
)
# Add Labels on Top of Bars
for container in ax.containers:
ax.bar_label(container, fmt='%.2f%%')
plt.show()
Here is a sample solution assuming you want to calculate percentage based on filtered dataframe.
import pandas as pd
import numpy as np
df_sample = pd.DataFrame(np.random.randint(-1,2,size=(10, 1)), columns=['answer'])
df_sample['user'] = [i for i in 'a b c d e f a b c d'.split(' ')]
df_filtered = df_sample[df_sample.answer>-1]
print(df_filtered.groupby('user').agg({'answer' : lambda x: x.sum()/len(df_filtered)*100}))

Shared Categorical Y Axis on Matplotlib

I tried the below but this gives wrong results - the Y labels of subplot 1 get incorrectly overwritten by the Y labels of subplot 2.
import pandas as pd
import matplotlib.pyplot as plt
ab = {
'a': ['a','b','a','b'],
'b': [1,2,3,4]
}
ab = pd.DataFrame(ab)
cd = {
'c': ['e','e','f','d'],
'd': [1,2,3,4]
}
cd = pd.DataFrame(cd)
fig, axs = plt.subplots(
1, 2,
figsize = (15, 5),
sharey = True,
sharex = True
)
axs[0].scatter(
ab['b'],
ab['a']
)
axs[1].scatter(
cd['d'],
cd['c']
)
The correct result should have all the letters - a,b,d,e,f on the Y axis, preferably in order, and the points of the scatter plot placed correctly.
Thanks!
If values of a and c columns are unique, is possible reindex by union of both:
cats = np.union1d(ab['a'], cd['c'])
ab = ab.set_index('a').reindex(cats)
cd = cd.set_index('c').reindex(cats)
and then plot instead columns indexes:
# print(dfFormationSets4.head())
fig, axs = plt.subplots(
1, 2,
figsize = (15, 5),
sharey = True,
sharex = True
)
axs[0].scatter(
ab['b'],
ab.index
)
axs[1].scatter(
cd['d'],
cd.index
)
If not unique values is necessary use numpy.setdiff1d with append and sort_values for add missing categories:
ab = {
'a': ['a','b','a','b'],
'b': [1,2,3,4]
}
ab = pd.DataFrame(ab)
cd = {
'c': ['e','e','f','d'],
'd': [1,2,3,4]
}
cd = pd.DataFrame(cd)
cats = np.union1d(ab['a'], cd['c'])
print (cats)
['a' 'b' 'd' 'e' 'f']
ab1 = pd.DataFrame({'a': np.setdiff1d(cats, ab['a'].unique())})
ab = ab.append(ab1, ignore_index=True).sort_values('a')
print (ab)
a b
0 a 1.0
2 a 3.0
1 b 2.0
3 b 4.0
4 d NaN
5 e NaN
6 f NaN
cd1 = pd.DataFrame({'c': np.setdiff1d(cats, cd['c'].unique())})
cd = cd.append(cd1, ignore_index=True).sort_values('c')
print (cd)
c d
4 a NaN
5 b NaN
3 d 4.0
0 e 1.0
1 e 2.0
2 f 3.0
fig, axs = plt.subplots(
1, 2,
figsize = (15, 5),
sharey = True,
sharex = True
)
axs[0].scatter(
ab['b'],
ab['a']
)
axs[1].scatter(
cd['d'],
cd['c']
)
Because the y-axis categories are not the same, this is happening. I have checked they work if the values of categories ('a' etc.) are same in both dataframes. From the matplotlib subplot man page
When subplots have a shared x-axis along a column, only the x tick
labels of the bottom subplot are created. Similarly, when subplots
have a shared y-axis along a row, only the y tick labels of the first
column subplot are created.
In this case this is what happening. I am not sure if the categorical values don't match, then what matplotlib can choose as tick labels.
You could trick the axis to plot numerical values and change the labels manually:
# Imports and data
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
ab = {
'a': ['a','b','a','b'],
'b': [1,2,3,4]
}
ab = pd.DataFrame(ab)
cd = {
'c': ['e','e','f','d'],
'd': [1,2,3,4]
}
cd = pd.DataFrame(cd)
# from categorical to numerical
idx = {j:i for i,j in enumerate(np.unique(list(ab['a']) + list(cd['c'])))}
fig, axs = plt.subplots(
1, 2,
figsize = (15, 5),
sharey = True,
sharex = True
)
# correct ticks
axs[0].set_yticks(range(len(idx)))
axs[0].set_yticklabels(idx.keys())
axs[0].scatter(
ab['b'],
[idx[i] for i in ab['a']] # plot numerical
)
axs[1].scatter(
cd['d'],
[idx[i] for i in cd['c']] # plot numerical
)
plt.show()
Resulting Plot:

Categories

Resources