Hexbin plot in PairGrid with Seaborn - python

I am trying to get a hexbin plot in a Seaborn Grid. I have the following code,
# Works in Jupyter with Python 2 Kernel.
%matplotlib inline
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
tips = sns.load_dataset("tips")
# Borrowed from http://stackoverflow.com/a/31385996/4099925
def hexbin(x, y, color, **kwargs):
cmap = sns.light_palette(color, as_cmap=True)
plt.hexbin(x, y, gridsize=15, cmap=cmap, extent=[min(x), max(x), min(y), max(y)], **kwargs)
g = sns.PairGrid(tips, hue='sex')
g.map_diag(plt.hist)
g.map_lower(sns.stripplot, jitter=True, alpha=0.5)
g.map_upper(hexbin)
However, that gives me the following image,
How can I fix the hexbin plots in such a way that they cover the entire surface of the graph and not just a subset of the shown plot area?

There are (at least) three problems with what you are trying to do here.
stripplot is for data where at least one axis is categorical. This is not true in this case. Seaborn guesses that the x axis is the categorical one which messes up the x axes of your subplots. From the docs for stripplot:
Draw a scatterplot where one variable is categorical.
In my suggested code below I have changed it to a simple scatter plot.
Drawing two hexbin-plots on top of eachother will only show the latter one. I added some alpha=0.5 to the hexbin arguments, but the result is far from pretty.
The extent parameter in your code adjusted the hexbin plot to x and y of each sex one at a time. But both of the hexbin plots need to be equal in size so they should use min/max of an entire series over both sexes. To achieve this I passed in the minimum and maximum values for all series to the hexbin function which can then pick and use the relevant ones.
Here is what I came up with:
# Works in Jupyter with Python 2 Kernel.
%matplotlib inline
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
tips = sns.load_dataset("tips")
# Borrowed from http://stackoverflow.com/a/31385996/4099925
def hexbin(x, y, color, max_series=None, min_series=None, **kwargs):
cmap = sns.light_palette(color, as_cmap=True)
ax = plt.gca()
xmin, xmax = min_series[x.name], max_series[x.name]
ymin, ymax = min_series[y.name], max_series[y.name]
plt.hexbin(x, y, gridsize=15, cmap=cmap, extent=[xmin, xmax, ymin, ymax], **kwargs)
g = sns.PairGrid(tips, hue='sex')
g.map_diag(plt.hist)
g.map_lower(plt.scatter, alpha=0.5)
g.map_upper(hexbin, min_series=tips.min(), max_series=tips.max(), alpha=0.5)
And here is the result:

Related

Removing legend from mpl parallel coordinates plot?

I have a parallel coordinates plot with lots of data points so I'm trying to use a continuous colour bar to represent that, which I think I have worked out. However, I haven't been able to remove the default key that is put in when creating the plot, which is very long and hinders readability. Is there a way to remove this table to make the graph much easier to read?
This is the code I'm currently using to generate the parallel coordinates plot:
parallel_coordinates(data[[' male_le','
female_le','diet','activity','obese_perc','median_income']],'median_income',colormap = 'rainbow',
alpha = 0.5)
fig, ax = plt.subplots(figsize=(6, 1))
fig.subplots_adjust(bottom=0.5)
cmap = mpl.cm.rainbow
bounds = [0.00,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]
norm = mpl.colors.BoundaryNorm(bounds, cmap.N,)
plt.colorbar(mpl.cm.ScalarMappable(norm = norm, cmap=cmap),cax = ax, orientation = 'horizontal',
label = 'normalised median income', alpha = 0.5)
plt.show()
Current Output:
I want my legend to be represented as a color bar, like this:
Any help would be greatly appreciated. Thanks.
You can use ax.legend_.remove() to remove the legend.
The cax parameter of plt.colorbar indicates the subplot where to put the colorbar. If you leave it out, matplotlib will create a new subplot, "stealing" space from the current subplot (subplots are often referenced to by ax in matplotlib). So, here leaving out cax (adding ax=ax isn't necessary, as here ax is the current subplot) will create the desired colorbar.
The code below uses seaborn's penguin dataset to create a standalone example.
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import numpy as np
from pandas.plotting import parallel_coordinates
penguins = sns.load_dataset('penguins')
fig, ax = plt.subplots(figsize=(10, 4))
cmap = plt.get_cmap('rainbow')
bounds = np.arange(penguins['body_mass_g'].min(), penguins['body_mass_g'].max() + 200, 200)
norm = mpl.colors.BoundaryNorm(bounds, 256)
penguins = penguins.dropna(subset=['body_mass_g'])
parallel_coordinates(penguins[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']],
'body_mass_g', colormap=cmap, alpha=0.5, ax=ax)
ax.legend_.remove()
plt.colorbar(mpl.cm.ScalarMappable(norm=norm, cmap=cmap),
ax=ax, orientation='horizontal', label='body mass', alpha=0.5)
plt.show()

Showing a correct legend when doing scatter plot with palette

Stupid way to plot a scatter plot
Suppose I have a data with 3 classes, the following code can give me a perfect graph with a correct legend, in which I plot out data class by class.
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_blobs
import numpy as np
X, y = make_blobs()
X0 = X[y==0]
X1 = X[y==1]
X2 = X[y==2]
ax = plt.subplot(1,1,1)
ax.scatter(X0[:,0],X0[:,1], lw=0, s=40)
ax.scatter(X1[:,0],X1[:,1], lw=0, s=40)
ax.scatter(X2[:,0],X2[:,1], lw=0, s=40)
ax.legend(['0','1','2'])
Better way to plot a scatter plot
However, if I have a dataset with 3000 classes, the above method doesn't work anymore. (You won't expect me to write 3000 line corresponding to each class, right?)
So I come up with the following plotting code.
num_classes = len(set(y))
palette = np.array(sns.color_palette("hls", num_classes))
ax = plt.subplot(1,1,1)
ax.scatter(X[:,0], X[:,1], lw=0, s=40, c=palette[y.astype(np.int)])
ax.legend(['0','1','2'])
This code is perfect, we can plot out all the classes with only 1 line. However, the legend is not showing correctly this time.
Question
How to maintain a correct legend when we plot graphs by using the following?
ax.scatter(X[:,0], X[:,1], lw=0, s=40, c=palette[y.astype(np.int)])
plt.legend() works best when you have multiple "artists" on the plot. That is the case in your first example which is why calling plt.legend(labels) works effortlessly.
If you are worried about writing lots of lines of code then you can take advantage of for loops.
As we can see with this example using 5 classes:
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
import numpy as np
X, y = make_blobs(centers=5)
ax = plt.subplot(1,1,1)
for c in np.unique(y):
ax.scatter(X[y==c,0],X[y==c,1],label=c)
ax.legend()
np.unique() returns a sorted array of the unique elements of y, by looping through these and plotting each class with its own artist plt.legend() can easily provide a legend.
Edit:
You can also assign labels to the plots as you make them which is probably safer.
plt.scatter(..., label=c) followed by plt.legend()
Why not simply do the following?
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_blobs
import numpy as np
X, y = make_blobs()
ngroups = 3
ax = plt.subplot(1, 1, 1)
for i in range(ngroups):
ax.scatter(X[y==i][:,0], X[y==i][:,1], lw=0, s=40, label=i)
ax.legend()

Create a discrete colorbar in matplotlib

I've tried the other threads, but can't work out how to solve. I'm attempting to create a discrete colorbar. Much of the code appears to be working, a discrete bar does appear, but the labels are wrong and it throws the error: "No mappable was found to use for colorbar creation. First define a mappable such as an image (with imshow) or a contour set (with contourf)."
Pretty sure the error is because I'm missing an argument in plt.colorbar, but not sure what it's asking for or how to define it.
Below is what I have. Any thoughts gratefully received:
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
norm = mpl.colors.BoundaryNorm(np.arange(-0.5,4), cmap.N)
ex2 = sample_data.plot.scatter(x='order_count', y='total_value',c='cluster', marker='+', ax=ax, cmap='plasma', norm=norm, s=100, edgecolor ='none', alpha=0.70)
plt.colorbar(ticks=np.linspace(0,3,4))
plt.show()
Indeed, the fist argument to colorbar should be a ScalarMappable, which would be the scatter plot PathCollection itself.
Setup
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame({"x" : np.linspace(0,1,20),
"y" : np.linspace(0,1,20),
"cluster" : np.tile(np.arange(4),5)})
cmap = mpl.colors.ListedColormap(["navy", "crimson", "limegreen", "gold"])
norm = mpl.colors.BoundaryNorm(np.arange(-0.5,4), cmap.N)
Pandas plotting
The problem is that pandas does not provide you access to this ScalarMappable directly. So one can catch it from the list of collections in the axes, which is easy if there is only one single collection present: ax.collections[0].
fig, ax = plt.subplots()
df.plot.scatter(x='x', y='y', c='cluster', marker='+', ax=ax,
cmap=cmap, norm=norm, s=100, edgecolor ='none', alpha=0.70, colorbar=False)
fig.colorbar(ax.collections[0], ticks=np.linspace(0,3,4))
plt.show()
Matplotlib plotting
One could consider using matplotlib directly to plot the scatter in which case you would directly use the return of the scatter function as argument to colorbar.
fig, ax = plt.subplots()
scatter = ax.scatter(x='x', y='y', c='cluster', marker='+', data=df,
cmap=cmap, norm=norm, s=100, edgecolor ='none', alpha=0.70)
fig.colorbar(scatter, ticks=np.linspace(0,3,4))
plt.show()
Output in both cases is identical.

Using seaborn and contourf, how can I plot gridlines?

Using the following code, the first contour plot has grid lines. For the second plot, I have imported seaborn, but the grid lines don't show up. What do I need to add to make the grid lines show on the second plot.
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
dx=0.05
x=np.arange(0,5+dx,dx)
y=x
X,Y = np.meshgrid(x,y)
Z = np.sin(X)**10+np.cos(10+Y*Y)*np.cos(X)
nbins=10
levels=mpl.ticker.MaxNLocator(nbins=nbins).tick_values(Z.min(),Z.max())
plt.figure()
plt.contourf(x,y,Z,levels=levels)
plt.colorbar()
plt.grid('on')
import seaborn as sns
sns.set_context("notebook")
sns.set_style("whitegrid")
plt.figure()
plt.contourf(x,y,Z,levels=levels)
plt.colorbar()
plt.grid('on')
plt.show()
You either need to change either the axes.axisbelow rc parameter or the zorder of the contourf plot. So you could do
sns.set(context="notebook", style="whitegrid",
rc={"axes.axisbelow": False})
When you set up the style or
plt.contourf(x, y, Z, levels=levels, zorder=0)
When you draw the plot.

matplotlib colorbar for scatter

I'm working with data that has the data has 3 plotting parameters: x,y,c. How do you create a custom color value for a scatter plot?
Extending this example I'm trying to do:
import matplotlib
import matplotlib.pyplot as plt
cm = matplotlib.cm.get_cmap('RdYlBu')
colors=[cm(1.*i/20) for i in range(20)]
xy = range(20)
plt.subplot(111)
colorlist=[colors[x/2] for x in xy] #actually some other non-linear relationship
plt.scatter(xy, xy, c=colorlist, s=35, vmin=0, vmax=20)
plt.colorbar()
plt.show()
but the result is TypeError: You must first set_array for mappable
From the matplotlib docs on scatter 1:
cmap is only used if c is an array of floats
So colorlist needs to be a list of floats rather than a list of tuples as you have it now.
plt.colorbar() wants a mappable object, like the CircleCollection that plt.scatter() returns.
vmin and vmax can then control the limits of your colorbar. Things outside vmin/vmax get the colors of the endpoints.
How does this work for you?
import matplotlib.pyplot as plt
cm = plt.cm.get_cmap('RdYlBu')
xy = range(20)
z = xy
sc = plt.scatter(xy, xy, c=z, vmin=0, vmax=20, s=35, cmap=cm)
plt.colorbar(sc)
plt.show()
Here is the OOP way of adding a colorbar:
fig, ax = plt.subplots()
im = ax.scatter(x, y, c=c)
fig.colorbar(im, ax=ax)
If you're looking to scatter by two variables and color by the third, Altair can be a great choice.
Creating the dataset
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df = pd.DataFrame(40*np.random.randn(10, 3), columns=['A', 'B','C'])
Altair plot
from altair import *
Chart(df).mark_circle().encode(x='A',y='B', color='C').configure_cell(width=200, height=150)
Plot

Categories

Resources