Pandas legend for scatter matrix - python

I have a pandas dataframe with 3 classes and datapoints of n features.
The following code produces a scatter matrix with histograms in the diagonal, of 4 of the features in the dataframe.
colums = ['n1','n2','n3','n4']
grr = pd.scatter_matrix(
dataframe[columns], c=y_train, figsize=(15,15), label=['B','N','O'], marker='.',
hist_kwds={'bins':20}, s=10, alpha=.8, cmap='brg')
plt.legend()
plt.show()
like this:
The problem I'm having is that plt.legend() doesn't seem to work, it shown no legend at all (or it's the tiny 'le8' barely visible in the first column of the second row...)
What I'd like to have is a single legend that just shows which color is which class.
I've tried all the suggested questions but none have a solution.
I also tried to put the labels in the legend function parameters like this:
plt.legend(label=['B','N','O'], loc=1)
but to no avail..
What am I doing wrong?

The pandas scatter_matrix is a wrapper for several matplotlib scatter plots. Arguments are passed on to the scatter function. However, the scatter is usually meant to be used with a colormap and not a legend with discrete labeled points, so there is no argument available to create a legend automatically.
I'm affraid you have to manually create the legend. To this end you may create the dots from the scatter using matplotlib's plot function (with empty data) and add them as handles to the legend.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams["figure.subplot.right"] = 0.8
v= np.random.rayleigh(size=(30,5))
v[:,4] = np.random.randint(1,4,size=30)/3.
dataframe= pd.DataFrame(v, columns=['n1','n2','n3','n4',"c"])
columns = ['n1','n2','n3','n4']
grr = pd.scatter_matrix(
dataframe[columns], c=dataframe["c"], figsize=(7,5), label=['B','N','O'], marker='.',
hist_kwds={'bins':20}, s=10, alpha=.8, cmap='brg')
handles = [plt.plot([],[],color=plt.cm.brg(i/2.), ls="", marker=".", \
markersize=np.sqrt(10))[0] for i in range(3)]
labels=["Label A", "Label B", "Label C"]
plt.legend(handles, labels, loc=(1.02,0))
plt.show()

As mentionned in ImportanceOfBeingErnest's answer. Scatter plots select color from a colormap. However plt.colorbar() does not work with pd.plotting.scatter_matrix. Here's a simple workaround that consists in plotting an image of the colorbar and labeling it with your target names. Below, I use the iris dataset from sklearn as an example:
from sklearn import datasets
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
_ = pd.plotting.scatter_matrix(df, c=iris.target, figsize=[8,8], s=100, alpha=0.8)
plt.figure()
plt.imshow([np.unique(iris.target)])
_ = plt.xticks(ticks=np.unique(iris.target),labels=iris.target_names)
Which generates to following figures

Related

Overlaying Pandas plot with Matplotlib is sensitive to the plotting order

I have the following problem: I'm trying to overlay two plots: One Pandas plot via plot.area() for a dataframe, and a second plot that is a standard Matplotlib plot. Depending the coder order for those two, the Matplotlib plot is displayed only if the code is before the Pandas plot.area() on the same axes.
Example: I have a Pandas dataframe called revenue that has a DateTimeIndex, and a single column with "revenue" values (float). Separately I have a dataset called projection with data along the same index (revenue.index)
If the code looks like this:
import pandas as pd
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10, 6))
# First -- Pandas area plot
revenue.plot.area(ax = ax)
# Second -- Matplotlib line plot
ax.plot(revenue.index, projection, color='black', linewidth=3)
plt.tight_layout()
plt.show()
Then the only thing displayed is the pandas plot.area() like this:
1/ Pandas plot.area() and 2/ Matplotlib line plot
However, if the order of the plotting is reversed:
fig, ax = plt.subplots(figsize=(10, 6))
# First -- Matplotlib line plot
ax.plot(revenue.index, projection, color='black', linewidth=3)
# Second -- Pandas area plot
revenue.plot.area(ax = ax)
plt.tight_layout()
plt.show()
Then the plots are overlayed properly, like this:
1/ Matplotlib line plot and 2/ Pandas plot.area()
Can someone please explain me what I'm doing wrong / what do I need to do to make the code more robust ? Kind TIA.
The values on the x-axis are different in both plots. I think DataFrame.plot.area() formats the DateTimeIndex in a pretty way, which is not compatible with pyplot.plot().
If you plot of the projection first, plot.area() can still plot the data and does not format the x-axis.
Mixing the two seems tricky to me, so I would either use pyplot or Dataframe.plot for both the area and the line:
import pandas as pd
from matplotlib import pyplot as plt
projection = [1000, 2000, 3000, 4000]
datetime_series = pd.to_datetime(["2021-12","2022-01", "2022-02", "2022-03"])
datetime_index = pd.DatetimeIndex(datetime_series.values)
revenue = pd.DataFrame({"value": [1200, 2200, 2800, 4100]})
revenue = revenue.set_index(datetime_index)
fig, ax = plt.subplots(1, 2, figsize=(10, 4))
# Option 1: only pyplot
ax[0].fill_between(revenue.index, revenue.value)
ax[0].plot(revenue.index, projection, color='black', linewidth=3)
ax[0].set_title("Pyplot")
# Option 2: only DataFrame.plot
revenue["projection"] = projection
revenue.plot.area(y='value', ax=ax[1])
revenue.plot.line(y='projection', ax=ax[1], color='black', linewidth=3)
ax[1].set_title("DataFrame.plot")
The results then look like this, where DataFrame.plot gives a much cleaner looking result:
If you do not want the projection in the revenue DataFrame, you can put it in a separate DataFrame and set the index to match revenue:
projection_df = pd.DataFrame({"projection": projection})
projection_df = projection_df.set_index(datetime_index)
projection_df.plot.line(ax=ax[1], color='black', linewidth=3)

How to change joypy joyplot y-axis labels colors

How do you change the colors of the y-axis labels in a joyplot using joypy package?
Here is a sample code where i can change the color if the x-axis labels, but not the y-axis.
import joypy
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
## DATA
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
new_names = ['SepalLength','SepalWidth','PetalLength','PetalWidth','Name']
iris = pd.read_csv(url, names=new_names, skiprows=0, delimiter=',')
## PLOT
fig, axes = joypy.joyplot(iris)
## X AXIS
plt.tick_params(axis='x', colors='red')
## Y AXIS (NOT WORKING)
plt.tick_params(axis='y', colors='red')
I'm pretty sure the issue is because there are mutliple sub-y-axis's, one for each density plot, and they are actually hidden already.
Not sure how to access the y-axis that is actually shown (I want to change the color of "SepalLength")
Joyplot is using Matplotlib
r-beginners' comment worked for me. If you want to change the colors of all the y-axis labels, you can iterate through them like this:
for ax in axes:
label = ax.get_yticklabels()
ax.set_yticklabels(label, fontdict={'color': 'r'})
This results in a warning that you're not supposed to use set_xticklabels() before fixing the tick positions using set_xticks (see documentation here) but with joypy it didn't result in any errors for me.
Here's another solution that just changes the color of the label directly:
for ax in axes:
label = ax.get_yticklabels()
label[0].set_color('red')

Is there a way to adjust the axes limits of pairplot(), but not as individual plots?

Is there a way to adjust the axes limits of pairplot(), but not as individual plots? Maybe a setting to produce better axes limits?
I would like to have the plots with a bigger range for the axes. My plots axes allows all the data to be visualized, but it is too 'zoomed in'.
My code is:
import pandas as pd
mport matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
g = sns.pairplot(iris, hue = 'species', diag_kind = 'hist', palette = 'Dark2', plot_kws={"s": 20})
The link for my plot and what I would like to plot to look like is here:
pairplot
To change the subplots, g.map(func, <parameters>) can be used. A small problem is that func needs to accept color as parameter, and plt.margins() gives an error when color is used. Moreover, map uses x and y to indicate the row and column variables. You could write a dummy function that simply calls plt.margin(), for example g.map(lambda *args, **kwargs: plt.margins(x=0.2, y=0.3)).
An alternative is to loop through g.axes.flat and call ax.margins() on each of them. Note that many axes are shared in x and/or y direction. The diagonal is treated differently; for some reason ax.margins needs to be called a second time on the diagonal.
To have the histogram for the different colors stacked instead of overlapping, diag_kws={"multiple": "stack"} can be set.
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
iris = sns.load_dataset('iris')
g = sns.pairplot(iris, hue='species', diag_kind='hist', palette='Dark2',
plot_kws={"s": 20}, diag_kws={"multiple": "stack"})
# g.map(plt.margins, x=0.2, y=0.2) # gives an error
for ax in g.axes.flat:
ax.margins(x=0.2, y=0.2)
for ax in g.diag_axes:
ax.margins(y=0.2)
plt.show()
PS: still another option, is to change the rcParams which will have effect on all the plots created later in the code:
import matplotlib as mpl
mpl.rcParams['axes.xmargin'] = 0.2
mpl.rcParams['axes.ymargin'] = 0.2

How to plot annotations on every axes of lmplot?

I'm now practicing machine-learning and I would like to add annotations on clustering plots.
Here I'm using soil samples data, and trying to divide them into several groups. When I focus on a specific element I would like to see the correlations of other elements or find out the sample IDs and look them on a map. I'm now trying to put ID on the plots but I'm not sure how to do it with lmplots.
import pandas as pd
import seaborn as sns
sns.set()
data=pd.read_csv("E:\Programming\Python\Matplotlib\Geochemi_test3.csv", index_col=0) #reading my dataset
data_x = data.drop(labels=["E","N","B_ppm","Geology","Height"], axis=1)
data_y=data["Geology"]
from sklearn.decomposition import PCA
model = PCA(n_components=2)
model.fit(data_x)
X_2D = model.transform(data_x)
data['PCA1'] = X_2D[:, 0]
data['PCA2'] = X_2D[:, 1]
#sns.lmplot("PCA1", "PCA2", data=data, hue="Geology", fit_reg=False)
from sklearn.mixture import GaussianMixture as GMM
model = GMM(n_components=4,covariance_type='full')
model.fit(data_x)
y_gmm = model.predict(data_x)
data['cluster'] = y_gmm
fgrid = sns.lmplot("PCA1", "PCA2", data=data, hue="Se_ppm", col="cluster",fit_reg=False)
ax = fgrid.axes[0,0]
p1=sns.regplot(data=data, x="PCA1", y="PCA2", fit_reg=False, marker="o", scatter_kws={'s':10})
for line in range(0,data.shape[0]):
p1.text(data.PCA1[line]+0.2, data.PCA2[line], data.index[line], horizontalalignment='left', size='medium', color='black', weight='semibold')
The result of this code I get this plot.
Is it possible to add annotations on each axes? Here annotations are only shown on the right axes.
As I searched about annotations I only could find plotting on regplot. Can I annotate on lmplot as well which is divided by columns?
The return of lmplot is a FacetGrid. You need to specify each Axes object within the FacetGrid to annotate each one. Something like this:
for ax in fgrid.axes:
for line in range(0,data.shape[0]):
ax.text(...)
However, you seem to have overwritten the last Axes object with your regplot call. I'm not sure if that's intentional.

Get actual numbers instead of normalized value in seaborn KDE plots

I have three dataframes and I plot the KDE using seaborn module in python. The issue is that these plots try to make the area under the curve 1 (which is how they are intended to perform), so the height in the plots are normalized ones. But is there any way to show the actual values instead of the normalized ones. Also is there any way I can find out the point of intersection for the curves?
Note: I do not want to use the curve_fit method of scipy as I am not sure about the distribution I will get for each dataframe, it can be multimodal also.
import seaborn as sns
plt.figure()
sns.distplot(data_1['gap'],kde=True,hist=False,label='1')
sns.distplot(data_2['gap'],kde=True,hist=False,label='2')
sns.distplot(data_3['gap'],kde=True,hist=False,label='3')
plt.legend(loc='best')
plt.show()
Output for the code is attached in the link as I can't post images.plot_link
You can just grab the line and rescale its y-values with set_data:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# create some data
n = 1000
x = np.random.rand(n)
# plot stuff
fig, ax = plt.subplots(1,1)
ax = sns.distplot(x, kde=True, hist=False, ax=ax)
# find the line and rescale y-values
children = ax.get_children()
for child in children:
if isinstance(child, matplotlib.lines.Line2D):
x, y = child.get_data()
y *= n
child.set_data(x,y)
# update y-limits (not done automatically)
ax.set_ylim(y.min(), y.max())
fig.canvas.draw()

Categories

Resources