I'm now practicing machine-learning and I would like to add annotations on clustering plots.
Here I'm using soil samples data, and trying to divide them into several groups. When I focus on a specific element I would like to see the correlations of other elements or find out the sample IDs and look them on a map. I'm now trying to put ID on the plots but I'm not sure how to do it with lmplots.
import pandas as pd
import seaborn as sns
sns.set()
data=pd.read_csv("E:\Programming\Python\Matplotlib\Geochemi_test3.csv", index_col=0) #reading my dataset
data_x = data.drop(labels=["E","N","B_ppm","Geology","Height"], axis=1)
data_y=data["Geology"]
from sklearn.decomposition import PCA
model = PCA(n_components=2)
model.fit(data_x)
X_2D = model.transform(data_x)
data['PCA1'] = X_2D[:, 0]
data['PCA2'] = X_2D[:, 1]
#sns.lmplot("PCA1", "PCA2", data=data, hue="Geology", fit_reg=False)
from sklearn.mixture import GaussianMixture as GMM
model = GMM(n_components=4,covariance_type='full')
model.fit(data_x)
y_gmm = model.predict(data_x)
data['cluster'] = y_gmm
fgrid = sns.lmplot("PCA1", "PCA2", data=data, hue="Se_ppm", col="cluster",fit_reg=False)
ax = fgrid.axes[0,0]
p1=sns.regplot(data=data, x="PCA1", y="PCA2", fit_reg=False, marker="o", scatter_kws={'s':10})
for line in range(0,data.shape[0]):
p1.text(data.PCA1[line]+0.2, data.PCA2[line], data.index[line], horizontalalignment='left', size='medium', color='black', weight='semibold')
The result of this code I get this plot.
Is it possible to add annotations on each axes? Here annotations are only shown on the right axes.
As I searched about annotations I only could find plotting on regplot. Can I annotate on lmplot as well which is divided by columns?
The return of lmplot is a FacetGrid. You need to specify each Axes object within the FacetGrid to annotate each one. Something like this:
for ax in fgrid.axes:
for line in range(0,data.shape[0]):
ax.text(...)
However, you seem to have overwritten the last Axes object with your regplot call. I'm not sure if that's intentional.
Related
I have used FacetGrid() from the seaborn module to break a line graph into segments with labels for each region as the title of each subplot. I saw the option in the documentation to have the x-axes be independent. However, I could not find anything related to having the plot sizes correspond to the size of each axis.
The code I used to generate this plot, along with the plot, are found below.
import matplotlib.pyplot as plt
import seaborn as sns
# Added during Edit 1.
sns.set()
graph = sns.FacetGrid(rmsf_crys, col = "Subunit", sharex = False)
graph.map(plt.plot, "Seq", "RMSF")
graph.set_titles(col_template = '{col_name}')
plt.show()
Plot resulting from the above code
Edit 1
Updated plot code using relplot() instead of calling FacetGrid() directly. The final result is the same graph.
import matplotlib.pyplot as plt
import seaborn as sns
# Forgot to include this in the original code snippet.
sns.set()
graph = sns.relplot(data = rmsf_crys, x = "Seq", y = "RMSF",
col = "Subunit", kind = "line",
facet_kws = dict(sharex=False))
graph.set_titles(col_template = '{col_name}')
plt.show()
Full support for this would need to live at the matplotlib layer, and I don't believe it's currently possible to have independent axes but shared transforms. (Someone with deeper knowledge of the matplotlib scale internals may prove me wrong).
But you can get pretty close by calculating the x range you'll need ahead of time and using that to parameterize the gridspec for the facets:
import numpy as np, seaborn as sns
tips = sns.load_dataset("tips")
xranges = tips.groupby("size")["total_bill"].agg(np.ptp)
xranges *= 1.1 # Account for default margins
sns.relplot(
data=tips, kind="line",
x="total_bill", y="tip",
col="size", col_order=xranges.index,
height=3, aspect=.65,
facet_kws=dict(sharex=False, gridspec_kws=dict(width_ratios=xranges))
)
Is there a way to adjust the axes limits of pairplot(), but not as individual plots? Maybe a setting to produce better axes limits?
I would like to have the plots with a bigger range for the axes. My plots axes allows all the data to be visualized, but it is too 'zoomed in'.
My code is:
import pandas as pd
mport matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
g = sns.pairplot(iris, hue = 'species', diag_kind = 'hist', palette = 'Dark2', plot_kws={"s": 20})
The link for my plot and what I would like to plot to look like is here:
pairplot
To change the subplots, g.map(func, <parameters>) can be used. A small problem is that func needs to accept color as parameter, and plt.margins() gives an error when color is used. Moreover, map uses x and y to indicate the row and column variables. You could write a dummy function that simply calls plt.margin(), for example g.map(lambda *args, **kwargs: plt.margins(x=0.2, y=0.3)).
An alternative is to loop through g.axes.flat and call ax.margins() on each of them. Note that many axes are shared in x and/or y direction. The diagonal is treated differently; for some reason ax.margins needs to be called a second time on the diagonal.
To have the histogram for the different colors stacked instead of overlapping, diag_kws={"multiple": "stack"} can be set.
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
iris = sns.load_dataset('iris')
g = sns.pairplot(iris, hue='species', diag_kind='hist', palette='Dark2',
plot_kws={"s": 20}, diag_kws={"multiple": "stack"})
# g.map(plt.margins, x=0.2, y=0.2) # gives an error
for ax in g.axes.flat:
ax.margins(x=0.2, y=0.2)
for ax in g.diag_axes:
ax.margins(y=0.2)
plt.show()
PS: still another option, is to change the rcParams which will have effect on all the plots created later in the code:
import matplotlib as mpl
mpl.rcParams['axes.xmargin'] = 0.2
mpl.rcParams['axes.ymargin'] = 0.2
I am creating a violinplot using the following code:
import seaborn as sns
ax = sns.violinplot(data=df[['SoundProduction','SoundForecast','diff']])
ax.set_ylabel("Sound power level [dB(A)]")
It gives me the folowing result:
Is there any way I can plot diff on a second y-axis so that all three series become clearly visible?
Also, is there a way to plot a vertical line in between 2 series? In this case I want a vertical line between SoundForecast and diff once they are plotted on two different axes.
You can achieve this using multiple subplots, which are easily set up using the plt.subplots (see lots more subplot examples).
This allows you to display your distributions on scales that are appropriate, and don't "waste" the display space. Most(all?) of seaborn's plotting functions accept the ax= argument so you can set the axes where the plot will be rendered. The axes also have clear separations between them.
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# generate some random distribution data
n = 800 # samples
prod = 95 + 5 * np.random.beta(0.6, 0.5, size=n); # a bimodal distribution
forecast = prod + 3*np.random.randn(n) # forecast is noisy estimate around the "true" production
diff = prod-forecast # should be with mu 0 sigma 3
df = pd.DataFrame(np.array([prod, forecast, diff]).T, columns=['SoundProduction','SoundForecast','diff']);
# set up two subplots, with one wider than the other
fig, ax = plt.subplots(1,2, num=1, gridspec_kw={'width_ratios':[2,1]})
# plot violin distribution estimates separately so the y-scaling makes sense in each group
sns.violinplot(data=df[['SoundProduction','SoundForecast']], ax=ax[0])
sns.violinplot(data=df[['diff']], ax=ax[1])
I have a pandas dataframe with 3 classes and datapoints of n features.
The following code produces a scatter matrix with histograms in the diagonal, of 4 of the features in the dataframe.
colums = ['n1','n2','n3','n4']
grr = pd.scatter_matrix(
dataframe[columns], c=y_train, figsize=(15,15), label=['B','N','O'], marker='.',
hist_kwds={'bins':20}, s=10, alpha=.8, cmap='brg')
plt.legend()
plt.show()
like this:
The problem I'm having is that plt.legend() doesn't seem to work, it shown no legend at all (or it's the tiny 'le8' barely visible in the first column of the second row...)
What I'd like to have is a single legend that just shows which color is which class.
I've tried all the suggested questions but none have a solution.
I also tried to put the labels in the legend function parameters like this:
plt.legend(label=['B','N','O'], loc=1)
but to no avail..
What am I doing wrong?
The pandas scatter_matrix is a wrapper for several matplotlib scatter plots. Arguments are passed on to the scatter function. However, the scatter is usually meant to be used with a colormap and not a legend with discrete labeled points, so there is no argument available to create a legend automatically.
I'm affraid you have to manually create the legend. To this end you may create the dots from the scatter using matplotlib's plot function (with empty data) and add them as handles to the legend.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams["figure.subplot.right"] = 0.8
v= np.random.rayleigh(size=(30,5))
v[:,4] = np.random.randint(1,4,size=30)/3.
dataframe= pd.DataFrame(v, columns=['n1','n2','n3','n4',"c"])
columns = ['n1','n2','n3','n4']
grr = pd.scatter_matrix(
dataframe[columns], c=dataframe["c"], figsize=(7,5), label=['B','N','O'], marker='.',
hist_kwds={'bins':20}, s=10, alpha=.8, cmap='brg')
handles = [plt.plot([],[],color=plt.cm.brg(i/2.), ls="", marker=".", \
markersize=np.sqrt(10))[0] for i in range(3)]
labels=["Label A", "Label B", "Label C"]
plt.legend(handles, labels, loc=(1.02,0))
plt.show()
As mentionned in ImportanceOfBeingErnest's answer. Scatter plots select color from a colormap. However plt.colorbar() does not work with pd.plotting.scatter_matrix. Here's a simple workaround that consists in plotting an image of the colorbar and labeling it with your target names. Below, I use the iris dataset from sklearn as an example:
from sklearn import datasets
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
_ = pd.plotting.scatter_matrix(df, c=iris.target, figsize=[8,8], s=100, alpha=0.8)
plt.figure()
plt.imshow([np.unique(iris.target)])
_ = plt.xticks(ticks=np.unique(iris.target),labels=iris.target_names)
Which generates to following figures
I am trying to create a single image with heatmaps representing the correlation of features of data points for each label separately. With seaborn I can create a heatmap for a single class like so
grouped = df.groupby('target')
sns.heatmap(grouped.get_group('Class_1').corr())
An I get this which makes sense:
But then I try to make a list of all the labels like so:
g = sns.FacetGrid(df, col='target')
g.map(lambda grp: sns.heatmap(grp.corr()))
And sadly I get this which makes no sense to me:
Turns out you can do it pretty concisely with just seaborn if you use map_dataframe instead of map:
g = sns.FacetGrid(df, col='target')
g.map_dataframe(lambda data, color: sns.heatmap(data.corr(), linewidths=0))
#mwaskom points out in his comment that it might be a good idea to explicitly set the limits of the colormap so that the different facets can be more directly compared. The documentation describes relevant heatmap parameters:
vmin, vmax : floats, optional
Values to anchor the colormap, otherwise they are inferred from the data and other keyword arguments.
Without FacetGrid, but making a corr heatmap for each group in a column:
import pandas as pd
import seaborn as sns
from numpy.random import randint
import matplotlib.pyplot as plt
df = pd.DataFrame(randint(0,10,(200,12)),columns=list('abcdefghijkl'))
grouped = df.groupby('a')
rowlength = grouped.ngroups/2 # fix up if odd number of groups
fig, axs = plt.subplots(figsize=(9,4), nrows=2, ncols=rowlength)
targets = zip(grouped.groups.keys(), axs.flatten())
for i, (key, ax) in enumerate(targets):
sns.heatmap(grouped.get_group(key).corr(), ax=ax,
xticklabels=(i >= rowlength),
yticklabels=(i%rowlength==0),
cbar=False) # Use cbar_ax into single side axis
ax.set_title('a=%d'%key)
plt.show()
Maybe there's a way to set up a lambda to correctly pass the data from the g.facet_data() generator through corr before going to heatmap.