set scatter plot legend labels with legend_elements - python

I just upgraded matplotlib to version 3.1.1 and I am experimenting with using legend_elements.
I am making a scatterplot of the top two components from PCA on a dataset of 30,000 flattened, grayscale images. Each image is labeled as one of four master categories (Accessory, Apparel, Footwear, Personal Care). I have color coded the plot by 'master category' by creating a colors column with values from 0 to 3.
I have read the documentation for PathCollection.legend_elements, but I haven't successfully incorporated the 'func' or 'fmt' parameters.
https://matplotlib.org/3.1.1/api/collections_api.html#matplotlib.collections.PathCollection.legend_elements
Also, I have tried to follow examples provided:
https://matplotlib.org/3.1.1/gallery/lines_bars_and_markers/scatter_with_legend.html
### create column for color codes
masterCat_codes = {'Accessories':0,'Apparel':1, 'Footwear':2, 'Personal Care':3}
df['colors'] = df['masterCategory'].apply(lambda x: masterCat_codes[x])
### create scatter plot
fig, ax = plt.subplots(figsize=(8,8))
scatter = ax.scatter( *full_pca.T, s=.1 , c=df['colors'], label= df['masterCategory'], cmap='viridis')
### using legend_elements
legend1 = ax.legend(*scatter.legend_elements(num=[0,1,2,3]), loc="upper left", title="Category Codes")
ax.add_artist(legend1)
plt.show()
The resulting legend labels are 0, 1, 2, 3. (This happens whether or not I specify label = df['masterCategory'] when defining 'scatter'). I would like labels to say Accessories, Apparel, Footwear, Personal Care.
Is there a way to accomplish this with legend_elements?
Note: As the dataset is large and the preprocessing is computationally heavy, I have written an example that is simpler to reproduce:
fake_data = np.array([[1,1],[1,2],[1,3],[2,1],[2,2],[2,3],[3,1],[3,2],[3,3]])
fake_df = pd.DataFrame(fake_data, columns=['X', 'Y'])
groups = np.array(['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'])
fake_df['Group'] = groups
group_codes = {k:idx for idx, k in enumerate(fake_df.Group.unique())}
fake_df['colors'] = fake_df['Group'].apply(lambda x: group_codes[x])
fig, ax = plt.subplots()
scatter = ax.scatter(fake_data[:,0], fake_data[:,1], c=fake_df['colors'])
legend = ax.legend(*scatter.legend_elements(num=[0,1,2]), loc="upper left", title="Group \nCodes")
ax.add_artist(legend)
plt.show()

Solution
Thanks to ImportanceOfBeingErnest
.legend_elements returns legend handles and labels for a PathCollection.
handles = scatter.legend_elements(num=[0,1,2,3])[0] because the handles are the first object returned by the method.
Also see Scatter plots with a legend
group_codes = {k:idx for idx, k in enumerate(fake_df.Group.unique())}
fake_df['colors'] = fake_df['Group'].apply(lambda x: group_codes[x])
fig, ax = plt.subplots(figsize=(8,8))
scatter = ax.scatter(fake_data[:,0], fake_data[:,1], c=fake_df['colors'])
handles = scatter.legend_elements(num=[0,1,2,3])[0] # extract the handles from the existing scatter plot
ax.legend(title='Group\nCodes', handles=handles, labels=group_codes.keys())
plt.show()

Related

How to generate labelled barplots using seaborn?

I am a bit new to Python. And I am playing with a dummy dataset to get some Python data manipulation practice. Below is the code for generating the dummy data:
d = {
'SeniorCitizen': [0,1,0,0,0,0,0,1,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0] ,
'CollegeDegree': [0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,1,1,1,1] ,
'Married': [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1] ,
'FulltimeJob': [1,1,1,1,1,0,0,0,1,1,1,1,1,1,1,1,1,0,0,1,1,0,0,0,1] ,
'DistancefromBranch': [7,9,14,20,21,12,22,25,9,9,9,12,13,14,16,25,27,4,14,14,20,19,15,23,2] ,
'ReversedPayment': [0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,1,0,1,0] }
CarWash = pd.DataFrame(data = d)
categoricals = ['SeniorCitizen','CollegeDegree','Married','FulltimeJob','ReversedPayment']
numerical = ['DistancefromBranch']
CarWash[categoricals] = CarWash[categoricals].astype('category')
I am basically struggling with a couple of things:
#1. A stacked barplot with absolute values (like the excel example below)
#2. A stacked barplot with percentage values (like the excel example below)
Below are my target visualizations for # 1 and # 2 using countplot().
#1
#2
For # 1, instead of a stacked barplot, with countplot() I am able to make a clustered barplot, like below, and also the annotation snippet feels more like a workaround rather than being Python elegant.
# Looping through each categorical column and viewing target variable distribution (ReversedPayment) by value
figure, axes = plt.subplots(2,2,figsize = (10,10))
for i,ax in zip(categoricals[:-1],axes.flatten()):
sns.countplot(x= i, hue = 'ReversedPayment', data = CarWash, ax = ax)
for p in ax.patches:
height = np.nan_to_num(p.get_height()) # gets the height of each patch/bar
adjust = np.nan_to_num(p.get_width())/2 # a calculation for adusting the data label later
label_xy = (np.nan_to_num(p.get_x()) + adjust,np.nan_to_num(p.get_height()) + adjust) #x,y coordinates where we want to put our data label
ax.annotate(height,label_xy) # final annotation
For # 2, I tried creating a new data frame housing % values but that felt tedious and error-prone.
I feel an option like stacked = True, proportion = True, axis = 1, annotate = True could have been so useful for countplot() to have.
Are there any other libraries that would be straight-froward and less code-intensive? Any comments or suggestions are welcome.
In this case, I think plotly.express may be more intuitive for you.
import plotly.express as px
df_temp = CarWash.groupby(['SeniorCitizen', 'ReversedPayment'])['DistancefromBranch'].count().reset_index().rename({'DistancefromBranch':'count'}, axis=1)
fig = px.bar(df_temp, x="SeniorCitizen", y="count", color="ReversedPayment", title="SeniorCitizen", text='count')
fig.update_traces(textposition='inside')
fig.show()
Basically, if you want to have more flexibility to adjust your charts, it is hard to avoid writing lots of codes.
I also try using matplotlib and pandas to create a stacked bar chart for percentages. If you are interested in it, you can try it.
sns.set()
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=[12,8], dpi=100)
# Conver the axes matrix to a 1-d array
axes = ax.flatten()
for i, col in enumerate(['SeniorCitizen', 'CollegeDegree', 'Married', 'FulltimeJob']):
# Calculate the number of plots
df_temp = (CarWash.groupby(col)['ReversedPayment']
.value_counts()
.unstack(1).fillna(0)
.rename({0:f'No', 1:f'Yes'})
.rename({0:'No', 1:'Yes'}, axis=1))
df_temp = df_temp / df_temp.sum(axis=0)
df_temp.plot.bar(stacked=True, ax=axes[i])
axes[i].set_title(col, y=1.03, fontsize=16)
rects = axes[i].patches
labels = df_temp.values.flatten()
for rect, label in zip(rects, labels):
if label == 0: continue
axes[i].text(rect.get_x() + rect.get_width() / 2, rect.get_y() + rect.get_height() / 3, '{:.2%}'.format(label),
ha='center', va='bottom', color='white', fontsize=12)
axes[i].legend(title='Reversed\nPayment', bbox_to_anchor=(1.05, 1), loc='upper left', title_fontsize = 10, fontsize=10)
axes[i].tick_params(rotation=0)
plt.tight_layout()
plt.show()

How to plot multiple seaborn.distplot in a single figure

I want to plot multiple seaborn distplot under a same window, where each plot has the same x and y grid. My attempt is shown below, which does not work.
# function to plot the density curve of the 200 Median Stn. MC-losses
def make_density(stat_list,color, layer_num):
num_subplots = len(stat_list)
ncols = 3
nrows = (num_subplots + ncols - 1) // ncols
fig, axes = plt.subplots(ncols=ncols, nrows=nrows, figsize=(ncols * 6, nrows * 5))
for i in range(len(stat_list)):
# Plot formatting
plt.title('Layer ' + layer_num)
plt.xlabel('Median Stn. MC-Loss')
plt.ylabel('Density')
plt.xlim(-0.2,0.05)
plt.ylim(0, 85)
min_ylim, max_ylim = plt.ylim()
# Draw the density plot.
sns.distplot(stat_list, hist = True, kde = True,
kde_kws = {'linewidth': 2}, color=color)
# `stat_list` is a list of 6 lists
# I want to draw histogram and density plot of
# each of these 6 lists contained in `stat_list` in a single window,
# where each row containing the histograms and densities of the 3 plots
# so in my example, there would be 2 rows of 3 columns of plots (2 x 3 =6).
stat_list = [[0.3,0.5,0.7,0.3,0.5],[0.2,0.1,0.9,0.7,0.4],[0.9,0.8,0.7,0.6,0.5]
[0.2,0.6,0.75,0.87,0.91],[0.2,0.3,0.8,0.9,0.3],[0.2,0.3,0.8,0.87,0.92]]
How can I modify my function to draw multiple distplot under the same window, where the x and y grid for each displayed plot is identical?
Thank you,
PS: Aside, I want the 6 distplots to have identical color, preferably green for all of them.
The easiest method is to load the data into pandas and then use seaborn.displot.
.displot replaces .distplot in seaborn version 0.11.0
Technically, what you would have wanted before, is a FacetGrid mapped with distplot.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# data
stat_list = [[0.3,0.5,0.7,0.3,0.5], [0.2,0.1,0.9,0.7,0.4], [0.9,0.8,0.7,0.6,0.5], [0.2,0.6,0.75,0.87,0.91], [0.2,0.3,0.8,0.9,0.3], [0.2,0.3,0.8,0.87,0.92]]
# load the data into pandas and then transpose it for the correct column data
df = pd.DataFrame(stat_list).T
# name the columns; specify a layer number
df.columns = ['A', 'B', 'C', 'D', 'E', 'F']
# now stack the data into a long (tidy) format
dfl = df.stack().reset_index(level=1).rename(columns={'level_1': 'Layer', 0: 'Median Stn. MC-Loss'})
# plot a displot
g = sns.displot(data=dfl, x='Median Stn. MC-Loss', col='Layer', col_wrap=3, kde=True, color='green')
g.set_axis_labels(y_var='Density')
g.set(xlim=(0, 1.0), ylim=(0, 3.0))
sns.FacetGrid and sns.distplot
.distplot is deprecated
p = sns.FacetGrid(data=dfl, col='Layer', col_wrap=3, height=5)
p.map(sns.distplot, 'Median Stn. MC-Loss', bins=5, kde=True, color='green')
p.set(xlim=(0, 1.0))

Scatter plot with different text at each data point that matches the size and colour of the marker

I have this scatter plot (I know it's a mess!) and I am trying to change the colour and size of the text adjacent to the marker to match that of the marker. In this case, text that is next to a green dot would be green and text that is next to an orange dot would be orange. Ideally, I would also be able to make the text smaller.
The code I use to generate the scatter plot below is:
plot = plt.figure(figsize=(30,20))
ax = sns.scatterplot(x='Recipients', y='Donors', data=concatenated, hue = 'Cost of Transfer',
palette="Set2", s= 300)
def label_point(x, y, val, ax):
a = pd.concat({'x': x, 'y': y, 'val': val}, axis=1)
for i, point in a.iterrows():
ax.text(point['x']+.1, point['y'], str(point['val']))
label_point(concatenated.Recipients, concatenated.Donors, concatenated.Species, plt.gca())
Any help is greatly appreciated :)
It would be pretty complicated and probably error-prone to try to find the colors of the points in a sns.scatterplot(). Do you really need to use scatterplot()?
If not, I would suggest forgetting about seaborn and just creating the plot using matplotlib directly, which gives you much more control:
iris = sns.load_dataset("iris")
iris['label'] = 'label_'+iris.index.astype(str) # create a label for each point
df = iris
x_col = 'sepal_length'
y_col = 'sepal_width'
hue_col = 'species'
label_col = 'label'
palette = 'Set2'
size = 5
fig, ax = plt.subplots()
colors = matplotlib.cm.get_cmap(palette)(range(len(df[hue_col].unique())))
for (g,temp),c in zip(iris.groupby('species'),colors):
print(g,c)
ax.plot(temp[x_col], temp[y_col], 'o', color=c, ms=size, label=g)
for i,row in temp.iterrows():
ax.annotate(row[label_col], xy=(row[x_col],row[y_col]), color=c)
ax.set_xlabel(x_col)
ax.set_ylabel(y_col)
ax.legend(title=hue_col)
Text in plot is set by ax.text(), matplotlib axes.text.
# Before
ax.text(point['x']+.1, point['y'], str(point['val']))
# After
ax.text(point['x']+.1, point['y'], str(point['val']), {'color': 'g', 'fontsize': 20})
Try color and font size you like.

Add Legend to Seaborn point plot

I am plotting multiple dataframes as point plot using seaborn. Also I am plotting all the dataframes on the same axis.
How would I add legend to the plot ?
My code takes each of the dataframe and plots it one after another on the same figure.
Each dataframe has same columns
date count
2017-01-01 35
2017-01-02 43
2017-01-03 12
2017-01-04 27
My code :
f, ax = plt.subplots(1, 1, figsize=figsize)
x_col='date'
y_col = 'count'
sns.pointplot(ax=ax,x=x_col,y=y_col,data=df_1,color='blue')
sns.pointplot(ax=ax,x=x_col,y=y_col,data=df_2,color='green')
sns.pointplot(ax=ax,x=x_col,y=y_col,data=df_3,color='red')
This plots 3 lines on the same plot. However the legend is missing. The documentation does not accept label argument .
One workaround that worked was creating a new dataframe and using hue argument.
df_1['region'] = 'A'
df_2['region'] = 'B'
df_3['region'] = 'C'
df = pd.concat([df_1,df_2,df_3])
sns.pointplot(ax=ax,x=x_col,y=y_col,data=df,hue='region')
But I would like to know if there is a way to create a legend for the code that first adds sequentially point plot to the figure and then add a legend.
Sample output :
I would suggest not to use seaborn pointplot for plotting. This makes things unnecessarily complicated.
Instead use matplotlib plot_date. This allows to set labels to the plots and have them automatically put into a legend with ax.legend().
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
date = pd.date_range("2017-03", freq="M", periods=15)
count = np.random.rand(15,4)
df1 = pd.DataFrame({"date":date, "count" : count[:,0]})
df2 = pd.DataFrame({"date":date, "count" : count[:,1]+0.7})
df3 = pd.DataFrame({"date":date, "count" : count[:,2]+2})
f, ax = plt.subplots(1, 1)
x_col='date'
y_col = 'count'
ax.plot_date(df1.date, df1["count"], color="blue", label="A", linestyle="-")
ax.plot_date(df2.date, df2["count"], color="red", label="B", linestyle="-")
ax.plot_date(df3.date, df3["count"], color="green", label="C", linestyle="-")
ax.legend()
plt.gcf().autofmt_xdate()
plt.show()
In case one is still interested in obtaining the legend for pointplots, here a way to go:
sns.pointplot(ax=ax,x=x_col,y=y_col,data=df1,color='blue')
sns.pointplot(ax=ax,x=x_col,y=y_col,data=df2,color='green')
sns.pointplot(ax=ax,x=x_col,y=y_col,data=df3,color='red')
ax.legend(handles=ax.lines[::len(df1)+1], labels=["A","B","C"])
ax.set_xticklabels([t.get_text().split("T")[0] for t in ax.get_xticklabels()])
plt.gcf().autofmt_xdate()
plt.show()
Old question, but there's an easier way.
sns.pointplot(x=x_col,y=y_col,data=df_1,color='blue')
sns.pointplot(x=x_col,y=y_col,data=df_2,color='green')
sns.pointplot(x=x_col,y=y_col,data=df_3,color='red')
plt.legend(labels=['legendEntry1', 'legendEntry2', 'legendEntry3'])
This lets you add the plots sequentially, and not have to worry about any of the matplotlib crap besides defining the legend items.
I tried using Adam B's answer, however, it didn't work for me. Instead, I found the following workaround for adding legends to pointplots.
import matplotlib.patches as mpatches
red_patch = mpatches.Patch(color='#bb3f3f', label='Label1')
black_patch = mpatches.Patch(color='#000000', label='Label2')
In the pointplots, the color can be specified as mentioned in previous answers. Once these patches corresponding to the different plots are set up,
plt.legend(handles=[red_patch, black_patch])
And the legend ought to appear in the pointplot.
This goes a bit beyond the original question, but also builds on #PSub's response to something more general---I do know some of this is easier in Matplotlib directly, but many of the default styling options for Seaborn are quite nice, so I wanted to work out how you could have more than one legend for a point plot (or other Seaborn plot) without dropping into Matplotlib right at the start.
Here's one solution:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# We will need to access some of these matplotlib classes directly
from matplotlib.lines import Line2D # For points and lines
from matplotlib.patches import Patch # For KDE and other plots
from matplotlib.legend import Legend
from matplotlib import cm
# Initialise random number generator
rng = np.random.default_rng(seed=42)
# Generate sample of 25 numbers
n = 25
clusters = []
for c in range(0,3):
# Crude way to get different distributions
# for each cluster
p = rng.integers(low=1, high=6, size=4)
df = pd.DataFrame({
'x': rng.normal(p[0], p[1], n),
'y': rng.normal(p[2], p[3], n),
'name': f"Cluster {c+1}"
})
clusters.append(df)
# Flatten to a single data frame
clusters = pd.concat(clusters)
# Now do the same for data to feed into
# the second (scatter) plot...
n = 8
points = []
for c in range(0,2):
p = rng.integers(low=1, high=6, size=4)
df = pd.DataFrame({
'x': rng.normal(p[0], p[1], n),
'y': rng.normal(p[2], p[3], n),
'name': f"Group {c+1}"
})
points.append(df)
points = pd.concat(points)
# And create the figure
f, ax = plt.subplots(figsize=(8,8))
# The KDE-plot generates a Legend 'as usual'
k = sns.kdeplot(
data=clusters,
x='x', y='y',
hue='name',
shade=True,
thresh=0.05,
n_levels=2,
alpha=0.2,
ax=ax,
)
# Notice that we access this legend via the
# axis to turn off the frame, set the title,
# and adjust the patch alpha level so that
# it closely matches the alpha of the KDE-plot
ax.get_legend().set_frame_on(False)
ax.get_legend().set_title("Clusters")
for lh in ax.get_legend().get_patches():
lh.set_alpha(0.2)
# You would probably want to sort your data
# frame or set the hue and style order in order
# to ensure consistency for your own application
# but this works for demonstration purposes
groups = points.name.unique()
markers = ['o', 'v', 's', 'X', 'D', '<', '>']
colors = cm.get_cmap('Dark2').colors
# Generate the scatterplot: notice that Legend is
# off (otherwise this legend would overwrite the
# first one) and that we're setting the hue, style,
# markers, and palette using the 'name' parameter
# from the data frame and the number of groups in
# the data.
p = sns.scatterplot(
data=points,
x="x",
y="y",
hue='name',
style='name',
markers=markers[:len(groups)],
palette=colors[:len(groups)],
legend=False,
s=30,
alpha=1.0
)
# Here's the 'magic' -- we use zip to link together
# the group name, the color, and the marker style. You
# *cannot* retreive the marker style from the scatterplot
# since that information is lost when rendered as a
# PathCollection (as far as I can tell). Anyway, this allows
# us to loop over each group in the second data frame and
# generate a 'fake' Line2D plot (with zero elements and no
# line-width in our case) that we can add to the legend. If
# you were overlaying a line plot or a second plot that uses
# patches you'd have to tweak this accordingly.
patches = []
for x in zip(groups, colors[:len(groups)], markers[:len(groups)]):
patches.append(Line2D([0],[0], linewidth=0.0, linestyle='',
color=x[1], markerfacecolor=x[1],
marker=x[2], label=x[0], alpha=1.0))
# And add these patches (with their group labels) to the new
# legend item and place it on the plot.
leg = Legend(ax, patches, labels=groups,
loc='upper left', frameon=False, title='Groups')
ax.add_artist(leg);
# Done
plt.show();
Here's the output:

Python Seaborn Matplotlib setting line style as legend

I have the following plot build with seaborn using factorplot() method.
Is it possible to use the line style as a legend to replace the legend based on line color on the right?
graycolors = sns.mpl_palette('Greys_r', 4)
g = sns.factorplot(x="k", y="value", hue="class", palette=graycolors,
data=df, linestyles=["-", "--"])
Furthermore I'm trying to get both lines in black color using the color="black" parameter in my factorplot method but this results in an exception "factorplot() got an unexpected keyword argument 'color'". How can I paint both lines in the same color and separate them by the linestyle only?
I have been looking for a solution trying to put the linestyle in the legend like matplotlib, but I have not yet found how to do this in seaborn. However, to make the data clear in the legend I have used different markers:
import seaborn as sns
import numpy as np
import pandas as pd
# creating some data
n = 11
x = np.linspace(0,2, n)
y = np.sin(2*np.pi*x)
y2 = np.cos(2*np.pi*x)
data = {'x': np.append(x, x), 'y': np.append(y, y2),
'class': np.append(np.repeat('sin', n), np.repeat('cos', n))}
df = pd.DataFrame(data)
# plot the data with the markers
# note that I put the legend=False to move it up (otherwise it was blocking the graph)
g=sns.factorplot(x="x", y="y", hue="class", palette=graycolors,
data=df, linestyles=["-", "--"], markers=['o','v'], legend=False)
# placing the legend up
g.axes[0][0].legend(loc=1)
# showing graph
plt.show()
you can try the following:
h = plt.gca().get_lines()
lg = plt.legend(handles=h, labels=['YOUR Labels List'], loc='best')
It worked fine with me.

Categories

Resources