Plotting correlation heatmaps with Seaborn FacetGrid - python

I am trying to create a single image with heatmaps representing the correlation of features of data points for each label separately. With seaborn I can create a heatmap for a single class like so
grouped = df.groupby('target')
sns.heatmap(grouped.get_group('Class_1').corr())
An I get this which makes sense:
But then I try to make a list of all the labels like so:
g = sns.FacetGrid(df, col='target')
g.map(lambda grp: sns.heatmap(grp.corr()))
And sadly I get this which makes no sense to me:

Turns out you can do it pretty concisely with just seaborn if you use map_dataframe instead of map:
g = sns.FacetGrid(df, col='target')
g.map_dataframe(lambda data, color: sns.heatmap(data.corr(), linewidths=0))
#mwaskom points out in his comment that it might be a good idea to explicitly set the limits of the colormap so that the different facets can be more directly compared. The documentation describes relevant heatmap parameters:
vmin, vmax : floats, optional
Values to anchor the colormap, otherwise they are inferred from the data and other keyword arguments.

Without FacetGrid, but making a corr heatmap for each group in a column:
import pandas as pd
import seaborn as sns
from numpy.random import randint
import matplotlib.pyplot as plt
df = pd.DataFrame(randint(0,10,(200,12)),columns=list('abcdefghijkl'))
grouped = df.groupby('a')
rowlength = grouped.ngroups/2 # fix up if odd number of groups
fig, axs = plt.subplots(figsize=(9,4), nrows=2, ncols=rowlength)
targets = zip(grouped.groups.keys(), axs.flatten())
for i, (key, ax) in enumerate(targets):
sns.heatmap(grouped.get_group(key).corr(), ax=ax,
xticklabels=(i >= rowlength),
yticklabels=(i%rowlength==0),
cbar=False) # Use cbar_ax into single side axis
ax.set_title('a=%d'%key)
plt.show()
Maybe there's a way to set up a lambda to correctly pass the data from the g.facet_data() generator through corr before going to heatmap.

Related

Is there a way to adjust the axes limits of pairplot(), but not as individual plots?

Is there a way to adjust the axes limits of pairplot(), but not as individual plots? Maybe a setting to produce better axes limits?
I would like to have the plots with a bigger range for the axes. My plots axes allows all the data to be visualized, but it is too 'zoomed in'.
My code is:
import pandas as pd
mport matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
g = sns.pairplot(iris, hue = 'species', diag_kind = 'hist', palette = 'Dark2', plot_kws={"s": 20})
The link for my plot and what I would like to plot to look like is here:
pairplot
To change the subplots, g.map(func, <parameters>) can be used. A small problem is that func needs to accept color as parameter, and plt.margins() gives an error when color is used. Moreover, map uses x and y to indicate the row and column variables. You could write a dummy function that simply calls plt.margin(), for example g.map(lambda *args, **kwargs: plt.margins(x=0.2, y=0.3)).
An alternative is to loop through g.axes.flat and call ax.margins() on each of them. Note that many axes are shared in x and/or y direction. The diagonal is treated differently; for some reason ax.margins needs to be called a second time on the diagonal.
To have the histogram for the different colors stacked instead of overlapping, diag_kws={"multiple": "stack"} can be set.
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
iris = sns.load_dataset('iris')
g = sns.pairplot(iris, hue='species', diag_kind='hist', palette='Dark2',
plot_kws={"s": 20}, diag_kws={"multiple": "stack"})
# g.map(plt.margins, x=0.2, y=0.2) # gives an error
for ax in g.axes.flat:
ax.margins(x=0.2, y=0.2)
for ax in g.diag_axes:
ax.margins(y=0.2)
plt.show()
PS: still another option, is to change the rcParams which will have effect on all the plots created later in the code:
import matplotlib as mpl
mpl.rcParams['axes.xmargin'] = 0.2
mpl.rcParams['axes.ymargin'] = 0.2

How to set seaborn color palette for multiple categories?

I am passing a pandas dataframe to be plotted with pd.scatterplot and want to use the 'bright' color palette. The color is to be determined by values in an integer Series I pass as hue to the plotting function.
The problem is that this only works when the hue Series has only two distinct values. When it has only one ore more than 2 different values, the plotting defaults to a beige-to-purple color palette.
When setting the color palette using sns.set_palette('bright') everything happens as described above. But when I do palette='bright'inside the plotting function call (and n_classes is != 2) I get an explicit Value Error thrown:
ValueError: Palette {} not understood
Here is the code for reproducing:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_palette('bright') # first method
n_classes = 3
a = np.arange(10)
b = np.random.randn(10)
c = np.random.randint(n_classes, size=10)
s = pd.DataFrame({'A': a, 'B':b, 'C': c})
sns.scatterplot(data=s, x='A', y='B', hue='C')
plt.show()
For the second method simply change the scatterplot call to
sns.scatterplot(data=s, x='A', y='B', hue='C', palette='bright')
Is there a way to get multiple hue levels in the palette I want? Am I doing anything wrong or is this a bug?
You need to pass the number of colors
Something like that.
sns.scatterplot(data=s,
x='A',
y='B',
hue='C',
palette=sns.color_palette('bright', s.C.unique().shape[0])
)

How to plot annotations on every axes of lmplot?

I'm now practicing machine-learning and I would like to add annotations on clustering plots.
Here I'm using soil samples data, and trying to divide them into several groups. When I focus on a specific element I would like to see the correlations of other elements or find out the sample IDs and look them on a map. I'm now trying to put ID on the plots but I'm not sure how to do it with lmplots.
import pandas as pd
import seaborn as sns
sns.set()
data=pd.read_csv("E:\Programming\Python\Matplotlib\Geochemi_test3.csv", index_col=0) #reading my dataset
data_x = data.drop(labels=["E","N","B_ppm","Geology","Height"], axis=1)
data_y=data["Geology"]
from sklearn.decomposition import PCA
model = PCA(n_components=2)
model.fit(data_x)
X_2D = model.transform(data_x)
data['PCA1'] = X_2D[:, 0]
data['PCA2'] = X_2D[:, 1]
#sns.lmplot("PCA1", "PCA2", data=data, hue="Geology", fit_reg=False)
from sklearn.mixture import GaussianMixture as GMM
model = GMM(n_components=4,covariance_type='full')
model.fit(data_x)
y_gmm = model.predict(data_x)
data['cluster'] = y_gmm
fgrid = sns.lmplot("PCA1", "PCA2", data=data, hue="Se_ppm", col="cluster",fit_reg=False)
ax = fgrid.axes[0,0]
p1=sns.regplot(data=data, x="PCA1", y="PCA2", fit_reg=False, marker="o", scatter_kws={'s':10})
for line in range(0,data.shape[0]):
p1.text(data.PCA1[line]+0.2, data.PCA2[line], data.index[line], horizontalalignment='left', size='medium', color='black', weight='semibold')
The result of this code I get this plot.
Is it possible to add annotations on each axes? Here annotations are only shown on the right axes.
As I searched about annotations I only could find plotting on regplot. Can I annotate on lmplot as well which is divided by columns?
The return of lmplot is a FacetGrid. You need to specify each Axes object within the FacetGrid to annotate each one. Something like this:
for ax in fgrid.axes:
for line in range(0,data.shape[0]):
ax.text(...)
However, you seem to have overwritten the last Axes object with your regplot call. I'm not sure if that's intentional.

Matplotlib: Identify bars in bar plot based on criteria

The code below:
import pandas as pd
import matplotlib.pyplot as plt
data = [['Apple',10],['Banana',15],['Kiwi',11],['Orange',17]]
df = pd.DataFrame(data,columns=['Fruit','Quantity'])
df.set_index('Fruit', inplace=True)
df.plot.bar(color='gray',rot=0)
plt.show()
gives the following output:
I would like to plot bars in red color for the top two quantity fruits i.e., Orange and Banana. How can I do that? Instead of giving a fixed threshold value to change color, I would prefer if my plot is robust enough to identify top two bars.
There might be a straightforward and simpler way but I was able to come up with the following solution which would work in principle for any number of top n values. The idea is:
First get the top n elements (n=2 in the example below) from the DataFrame using nlargest
Then, loop over the x-tick labels and change the color of the patches (bars) for those values which are the largest using an if statement to get their index. Here we created an axis instance ax to be able to extract the patches for setting the colors.
import pandas as pd
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
data = [['Apple',10],['Banana',15],['Kiwi',11],['Orange',17]]
df = pd.DataFrame(data,columns=['Fruit','Quantity'])
df.set_index('Fruit', inplace=True)
df.plot.bar(color='gray',rot=0, ax=ax)
top = df['Quantity'].nlargest(2).keys() # Top 2 values here
for i, tick in enumerate(ax.get_xticklabels()):
if tick.get_text() in top:
ax.patches[i].set_color('r')
plt.show()
Plotting a colored bar plot
The problem is that pandas bar plots take the color argument to apply column-wise. Here you have a single column. Hence something like the canonical attempt to color a bar plot does not work
pd.DataFrame([12,14]).plot.bar(color=["red", "green"])
A workaround is to create a diagonal matrix instead of a single column and plot it with the stacked=True option.
df = pd.DataFrame([12,14])
df = pd.DataFrame(np.diag(df[0].values), index=df.index, columns=df.index)
df.plot.bar(color=["red", "green"], stacked=True)
Another option is to use matplotlib instead.
df = pd.DataFrame([12,14])
plt.bar(df.index, df[0].values, color=color)
Choosing the colors according to values
Now the question remains on how to create a list of the colors to use in either of the two solutions above. Given a dataframe df you can create an array of equal length to the frame and fill it with the default color, then you can set those entries of the two highest values to another color:
color = np.array(["gray"]*len(df))
color[np.argsort(df["Quantity"])[-2:]] = "red"
Solution:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = [['Apple',10],['Banana',15],['Kiwi',11],['Orange',17]]
df = pd.DataFrame(data,columns=['Fruit','Quantity'])
df.set_index('Fruit', inplace=True)
color = np.array(["gray"]*len(df))
color[np.argsort(df["Quantity"])[-2:]] = "red"
plt.bar(df.index, df.values, color=color)
plt.show()

python facetgrid with sns.barplot and map; target no overlapping group bars

I am currently implementing a code for facetgrid with subplots of barplots with two different groups ('type'), respectively. I am intending to get a plot, where the different groups are not stacked and not overlapping. I am using following code
g = sns.FacetGrid(data,
col='C',
hue = 'type',
sharex=False,
sharey=False,
size=7,
palette=sns.color_palette(['red','green']),
)
g = g.map(sns.barplot, 'A', 'B').add_legend()
The data is a pandas long format df with following example structure:
data=pd.DataFrame({'A':['X','X','Y','Y','X','X','Y','Y'],
'B':[0,1,2,3,4,5,6,7],
'C':[1,1,1,1,2,2,2,2],
'type':['ctrl','cond1','ctrl','cond1','ctrl','cond1','ctrl','cond1']}
)
In the created barplots I get now fully overlapping barplots of the two groups, thus ctrlis missing, see below. However, I am intending to get neighbouring non-overlapping bars each. How to achieve that? My real code has some more bars per plot, where you can see overlapping colors (here fully covered)
this answer shows up how to use FacetGrid directly.
But, if you have 0.9.0 installed, I would recommend you make use of the new catplot() function that will produce the right (at least I think?) plot. Note that this function returns a FacetGrid object. You can pass kwargs to the call to customize the resulting FacetGrid, or modify its properties afterwards.
g = sns.catplot(data=data, x='A', y='B', hue='type', col='C', kind='bar')
I think you want to provide the hue argument to the barplot, not the FacetGrid. Because the grouping takes place within the (single) barplot, not on the facet's level.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
data=pd.DataFrame({'A':['X','X','Y','Y','X','X','Y','Y'],
'B':[0,1,2,3,4,5,6,7],
'C':[1,1,1,1,2,2,2,2],
'type':['ctrl','cond1','ctrl','cond1','ctrl','cond1','ctrl','cond1']})
g = sns.FacetGrid(data,
col='C',
sharex=False,
sharey=False,
height=4)
g = g.map(sns.barplot, 'A', 'B', "type",
hue_order=np.unique(data["type"]),
order=["X", "Y"],
palette=sns.color_palette(['red','green']))
g.add_legend()
plt.show()

Categories

Resources