Plotting 3 variables on dataframe [duplicate] - python

data = {'name': ['A', 'B', 'C', 'D'],
'score': [-9.5, -8.3, -8.1, -7.0],
'color': [4, 3, 2, 1]}
df = pd.DataFrame(data)
I have my data in a dataframe like above and I am plotting it to a seaborn swarmplot like the one below. The points are plotted based on their score, and depending on how that falls between the 3 dotted lines, I want to color the points differently. I use the 'color' column of the df to assign a key based on where the 'score' values fall that corresponds to colors and a dictionary.
colors = {1:'pink', 2:'orange', 3:'red', 4:'green'}
I then create the swarmplot with the below code and map the color dictionary to the colors column of my df.
ax = sns.swarmplot(data=df, y='score', s=10, c=df['color'].map(colors))
When I do this I don't generate any errors, but no colors are applied, and the points remain their default blue (image below, left). So, how can I assign colors to points in a seaborn swarmplot based on my df['color'] column?
Final note: When I try to use palette=df['color'].map(colors) instead of c=df['color'].map(colors), the graph just changes everything to the last color in my colors dictionary (image below, right)
Edit: Thank you for your suggestion #Trenton McKinney. It is somewhat successful in that the colors do properly map, but when I include the 'name' column (Its actually Title) for x, as below, my points are plotted like a scatter instead of a swarm plot. But I get an error if I try to remove x='Title' from my parameters.
ax = sns.swarmplot(data=data, x='Title', y='score', s=10, hue='color', palette=colors)

As per seaborn issue 941, this is the expected behavior.
Seems like the API doesn't play well when specifying only x or only y.
The issue is resolved by passing a list of the same strings based on the length of the dataframe: ['']*len(df) or ['text']*len(df)
ax = sns.swarmplot(data=df, x=['']*len(df), y='score', hue='color', palette=colors)

Related

How to reduce the blank area in a grouped boxplot with many missing hue categories

I have an issue when plotting a categorical grouped boxplot by seaborn in Python, especially using 'hue'.
My raw data is as shown in the figure below. And I wanted to plot values in column 8 after categorized by column 1 and 4.
I used seaborn and my code is shown below:
ax = sns.boxplot(x=output[:,1], y=output[:,8], hue=output[:,4])
ax.set_xticklabel(ax.get_xticklabels(), rotation=90)
plt.legend([],[])
However, the generated plot always contains large blank area, as shown in the upper figure below. I tried to add 'dodge=False' in sns.boxplot according to a post here (https://stackoverflow.com/questions/53641287/off-center-x-axis-in-seaborn), but it gives the lower figure below.
Actually, what I want Python to plot is a boxplot like what I generated using JMP below.
It seems that if one of the 2nd categories is empty, seaborn will still leave the space on the generated figure for each 1st category, thus causes the observed off-set/blank area.
So I wonder if there is any way to solve this issue, like using other package in python?
Seaborn reserves a spot for each individual hue value, even when some of these values are missing. When many hue values are missing, this leads to annoying open spots. (When there would be only one box per x-value, dodge=False would solve the problem.)
A workaround is to generate a separate subplot for each individual x-label.
Reproducible example for default boxplot with missing hue values
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
np.random.seed(20230206)
df = pd.DataFrame({'label': np.repeat(['label1', 'label2', 'label3', 'label4'], 250),
'cat': np.repeat(np.random.choice([*'abcdefghijklmnopqrst'], 40), 25),
'value': np.random.randn(1000).cumsum()})
df['cat'] = pd.Categorical(df['cat'], [*'abcdefghijklmnopqrst'])
sns.set_style('white')
plt.figure(figsize=(15, 5))
ax = sns.boxplot(df, x='label', y='value', hue='cat', palette='turbo')
sns.move_legend(ax, loc='upper left', bbox_to_anchor=(1, 1), ncol=2)
sns.despine()
plt.tight_layout()
plt.show()
Individual subplots per x value
A FacetGrid is generated with a subplot ("facet") for each x value
The original hue will be used as x-value for each subplot. To avoid empty spots, the hue should be of string type. When the hue would be pd.Categorical, seaborn would still reserve a spot for each of the categories.
df['cat'] = df['cat'].astype(str) # the column should be of string type, not pd.Categorical
g = sns.FacetGrid(df, col='label', sharex=False)
g.map_dataframe(sns.boxplot, x='cat', y='value')
for label, ax in g.axes_dict.items():
ax.set_title('') # remove the title generated by sns.FacetGrid
ax.set_xlabel(label) # use the label from the dataframe as xlabel
plt.tight_layout()
plt.show()
Adding consistent coloring
A dictionary palette can color the boxes such that corresponding boxes in different subplots have the same color. hue= with the same column as the x= will do the coloring, and dodge=False will remove the empty spots.
df['cat'] = df['cat'].astype(str) # the column should be of string type, not pd.Categorical
cats = np.sort(df['cat'].unique())
palette_dict = {cat: color for cat, color in zip(cats, sns.color_palette('turbo', len(cats)))}
g = sns.FacetGrid(df, col='label', sharex=False)
g.map_dataframe(sns.boxplot, x='cat', y='value',
hue='cat', dodge=False, palette=palette_dict)
for label, ax in g.axes_dict.items():
ax.set_title('') # remove the title generated by sns.FacetGrid
ax.set_xlabel(label) # use the label from the dataframe as xlabel
# ax.tick_params(axis='x', labelrotation=90) # optionally rotate the tick labels
plt.tight_layout()
plt.show()

Hue and shared x-axis not working Seaborn facet grid

I'm aiming to plot a stacked chart that displays normalised values from a pandas df. Using below, each unique value in Item has it's own row. I then aim to plot a stacked chart containing normalised values from Label, with Num along the x-axis.
However, hue seems to pass a different set of colours for each individual Item. They aren't consistent, for ex, A in Up is blue, while A in Right is green.
I'm also hoping to share the x-axis for Num is consistent for each Item. The values aren't aligned with the respective x-axis.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Num' : [1,2,1,2,3,2,1,3,2,2,1,2,3,3,1,3],
'Label' : ['A','B','C','B','B','C','C','B','B','A','C','A','B','A','C','A'],
'Item' : ['Up','Left','Up','Left','Down','Right','Up','Down','Right','Down','Right','Up','Up','Right','Down','Left'],
})
g = sns.FacetGrid(df,
row = 'Item',
row_order = ['Up','Right','Down','Left'],
aspect = 2,
height = 4,
sharex = True,
legend_out = True
)
g.map_dataframe(sns.histplot, x = 'Num', hue = 'Label', multiple = 'fill', shrink = 0.8, binwidth = 1)
g.add_legend()
Using FacetGrid directly can be tricky; it is basically doing a groupb-by and for loop over the axes, and it does not track any function-specific state that would be needed to make sure that the answer to questions like "what order should be used for each hue level" is the same in each facet. So you would need to supply that information somehow (i.e. hue_order or passing a palette dictionary). In fact, there is a warning in the documentation to this effect.
But you generally don't need to use FacetGrid directly; you can use one of the figure-level functions, which do all of the bookkeeping for you to make sure that information is aligned across facets. Here you would use displot:
sns.displot(
data=df, x="Num", hue="Label",
row='Item', row_order=['Up','Right','Down','Left'],
multiple="fill", shrink=.8, discrete=True,
aspect=4, height=2,
)
Note that I've made one other change to your code here, which is to use discrete=True instead of binwidth=1, which is what I think you want.

Show first and last label in pandas plot

I have a DataFrame with 361 columns. I want to plot it but showing only the first and last columns in the legend. For instance:
d = {'col1':[1,2],'col2':[3,4],'col3':[5,6],'col4':[7,8]}
df = pd.DataFrame(data=d)
If I plot through df.plot() all the legends will be displayed, but I only want 'col1' and 'col4' in my legend with the proper color code (I am using a colormap) and legend title.
One way to do this is to plot each column separately through matplotlib without using legends and then plot two more empty plots with only the labels (example below), but I wonder if there is a direct way to do it with pandas.
for columns in df:
plt.plot(df[columns])
plt.plot([],[],label=df.columns[0])
plt.plot([],[],label=df.columns[-1])
plt.legend()
plt.show()
Let's try extracting the handlers/labels from the axis and defining new legend:
ax = df.plot()
handlers, labels = ax.get_legend_handles_labels()
new_handlers, new_labels = [], []
for h,l in zip(handlers, labels):
if l in ['col1','col4']:
new_handlers.append(h)
new_labels.append(l)
ax.legend(new_handlers, new_labels)
Output:
You can try to split your df into two dfs which the second one will contain only the columns of interest and then plot both dfs showing only the second legend.

customise correlation heatmap in seaborn

I am new to python and i am trying to make a correlation heatmap on seaborn. Could anyone tell me how to customise the default values on the right of the heatmap with my own correlation cutoffs? I get something like the one in the picture but i want to customise with my own cutoffs and three values instead of four.
import pandas as pd
import seaborn as sns
new_df = df[['A', 'B', 'C', 'D', 'G','E']]
sns.heatmap(new_df.corr(), annot = False,square=True)
If I understand your question correctly, you want to change the tick marks that display on the colorbar.
If so the heatmap function has an attribute cbar_kws which accepts a dictionary as input. A potential solution is:
sns.heatmap(
new_df.corr(),
annot = False,
square=True,
cbar_kws={'ticks': [0.25, .5, 1]}
)
Seaborn heatmap documentation
Matplotlib Colorbar documentation (for the key words)

plot different color for different categorical levels using matplotlib

I have this data frame diamonds which is composed of variables like (carat, price, color), and I want to draw a scatter plot of price to carat for each color, which means different color has different color in the plot.
This is easy in R with ggplot:
ggplot(aes(x=carat, y=price, color=color), #by setting color=color, ggplot automatically draw in different colors
data=diamonds) + geom_point(stat='summary', fun.y=median)
I wonder how could this be done in Python using matplotlib ?
PS:
I know about auxiliary plotting packages, such as seaborn and ggplot for python, and I don't prefer them, just want to find out if it is possible to do the job using matplotlib alone, ;P
Imports and Sample DataFrame
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns # for sample data
from matplotlib.lines import Line2D # for legend handle
# DataFrame used for all options
df = sns.load_dataset('diamonds')
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
With matplotlib
You can pass plt.scatter a c argument, which allows you to select the colors. The following code defines a colors dictionary to map the diamond colors to the plotting colors.
fig, ax = plt.subplots(figsize=(6, 6))
colors = {'D':'tab:blue', 'E':'tab:orange', 'F':'tab:green', 'G':'tab:red', 'H':'tab:purple', 'I':'tab:brown', 'J':'tab:pink'}
ax.scatter(df['carat'], df['price'], c=df['color'].map(colors))
# add a legend
handles = [Line2D([0], [0], marker='o', color='w', markerfacecolor=v, label=k, markersize=8) for k, v in colors.items()]
ax.legend(title='color', handles=handles, bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()
df['color'].map(colors) effectively maps the colors from "diamond" to "plotting".
(Forgive me for not putting another example image up, I think 2 is enough :P)
With seaborn
You can use seaborn which is a wrapper around matplotlib that makes it look prettier by default (rather opinion-based, I know :P) but also adds some plotting functions.
For this you could use seaborn.lmplot with fit_reg=False (which prevents it from automatically doing some regression).
sns.scatterplot(x='carat', y='price', data=df, hue='color', ec=None) also does the same thing.
Selecting hue='color' tells seaborn to split and plot the data based on the unique values in the 'color' column.
sns.lmplot(x='carat', y='price', data=df, hue='color', fit_reg=False)
With pandas.DataFrame.groupby & pandas.DataFrame.plot
If you don't want to use seaborn, use pandas.groupby to get the colors alone, and then plot them using just matplotlib, but you'll have to manually assign colors as you go, I've added an example below:
fig, ax = plt.subplots(figsize=(6, 6))
grouped = df.groupby('color')
for key, group in grouped:
group.plot(ax=ax, kind='scatter', x='carat', y='price', label=key, color=colors[key])
plt.show()
This code assumes the same DataFrame as above, and then groups it based on color. It then iterates over these groups, plotting for each one. To select a color, I've created a colors dictionary, which can map the diamond color (for instance D) to a real color (for instance tab:blue).
Here's a succinct and generic solution to use a seaborn color palette.
First find a color palette you like and optionally visualize it:
sns.palplot(sns.color_palette("Set2", 8))
Then you can use it with matplotlib doing this:
# Unique category labels: 'D', 'F', 'G', ...
color_labels = df['color'].unique()
# List of RGB triplets
rgb_values = sns.color_palette("Set2", 8)
# Map label to RGB
color_map = dict(zip(color_labels, rgb_values))
# Finally use the mapped values
plt.scatter(df['carat'], df['price'], c=df['color'].map(color_map))
I had the same question, and have spent all day trying out different packages.
I had originally used matlibplot: and was not happy with either mapping categories to predefined colors; or grouping/aggregating then iterating through the groups (and still having to map colors). I just felt it was poor package implementation.
Seaborn wouldn't work on my case, and Altair ONLY works inside of a Jupyter Notebook.
The best solution for me was PlotNine, which "is an implementation of a grammar of graphics in Python, and based on ggplot2".
Below is the plotnine code to replicate your R example in Python:
from plotnine import *
from plotnine.data import diamonds
g = ggplot(diamonds, aes(x='carat', y='price', color='color')) + geom_point(stat='summary')
print(g)
So clean and simple :)
Here a combination of markers and colors from a qualitative colormap in matplotlib:
import itertools
import numpy as np
from matplotlib import markers
import matplotlib.pyplot as plt
m_styles = markers.MarkerStyle.markers
N = 60
colormap = plt.cm.Dark2.colors # Qualitative colormap
for i, (marker, color) in zip(range(N), itertools.product(m_styles, colormap)):
plt.scatter(*np.random.random(2), color=color, marker=marker, label=i)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., ncol=4);
Using Altair.
from altair import *
import pandas as pd
df = datasets.load_dataset('iris')
Chart(df).mark_point().encode(x='petalLength',y='sepalLength', color='species')
The easiest way is to simply pass an array of integer category levels to the plt.scatter() color parameter.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/diamonds.csv')
plt.scatter(df['carat'], df['price'], c=pd.factorize(df['color'])[0],)
plt.gca().set(xlabel='Carat', ylabel='Price', title='Carat vs. Price')
This creates a plot without a legend, using the default "viridis" colormap. In this case "viridis" is not a good default choice because the colors appear to imply a sequential order rather than purely nominal categories.
To choose your own colormap and add a legend, the simplest approach is this:
import matplotlib.patches
levels, categories = pd.factorize(df['color'])
colors = [plt.cm.tab10(i) for i in levels] # using the "tab10" colormap
handles = [matplotlib.patches.Patch(color=plt.cm.tab10(i), label=c) for i, c in enumerate(categories)]
plt.scatter(df['carat'], df['price'], c=colors)
plt.gca().set(xlabel='Carat', ylabel='Price', title='Carat vs. Price')
plt.legend(handles=handles, title='Color')
I chose the "tab10" discrete (aka qualitative) colormap here, which does a better job at signaling the color factor is a nominal categorical variable.
Extra credit:
In the first plot, the default colors are chosen by passing min-max scaled values from the array of category level ints pd.factorize(iris['species'])[0] to the call method of the plt.cm.viridis colormap object.
With df.plot()
Normally when quickly plotting a DataFrame, I use pd.DataFrame.plot(). This takes the index as the x value, the value as the y value and plots each column separately with a different color.
A DataFrame in this form can be achieved by using set_index and unstack.
import matplotlib.pyplot as plt
import pandas as pd
carat = [5, 10, 20, 30, 5, 10, 20, 30, 5, 10, 20, 30]
price = [100, 100, 200, 200, 300, 300, 400, 400, 500, 500, 600, 600]
color =['D', 'D', 'D', 'E', 'E', 'E', 'F', 'F', 'F', 'G', 'G', 'G',]
df = pd.DataFrame(dict(carat=carat, price=price, color=color))
df.set_index(['color', 'carat']).unstack('color')['price'].plot(style='o')
plt.ylabel('price')
With this method you do not have to manually specify the colors.
This procedure may make more sense for other data series. In my case I have timeseries data, so the MultiIndex consists of datetime and categories. It is also possible to use this approach for more than one column to color by, but the legend is getting a mess.
You can convert the categorical column into a numerical one by using the commands:
#we converting it into categorical data
cat_col = df['column_name'].astype('category')
#we are getting codes for it
cat_col = cat_col.cat.codes
# we are using c parameter to change the color.
plt.scatter(df['column1'],df['column2'], c=cat_col)

Categories

Resources