how to plot legend for scatter points without reorganising array? - python

points with label is usually presented in X, y form
X is a multi-dimensional array, y is label/class that belongs to each point of X
what I want to do:
import matplotlib.pyplot as plt
import numpy as np
X = [[0,1],[1,2],[2,3],[3,4]]
X = np.array(X)
y = np.array([0,0,1,2])
myCmap = np.array(['r', 'g', 'b'])
myLabelMap = np.array(['car', 'bicycle', 'plane'])
plt.scatter(X[:, 0], X[:, 1], color=myCmap[y], label=myLabelMap[y])
plt.legend(loc='upper right')
plt.show()
however this will mess up the legend, as you can see in legend section it plot all labels for all points.
Is there a way to solve this without put the X into different arrays?

First you find out the unique labels, and the points they refer to. You then plot those points with the labels, and the others without labels:
import matplotlib.pyplot as plt
import numpy as np
X = [[0,1],[1,2],[2,3],[3,4]]
X = np.array(X)
y = np.array([0,0,1,2])
myCmap = np.array(['r', 'g', 'b'])
myLabelMap = np.array(['car', 'bicycle', 'plane'])
y_unique,id_unique = unique(y,return_index=True)
X_unique = X[id_unique]
X = asarray(X,dtype=float)
for j,yj in enumerate(y_unique):
plt.scatter(X_unique[j, 0], X_unique[j, 1], color=myCmap[yj], label=myLabelMap[yj])
X[id_unique] = nan
plt.scatter(X[:, 0], X[:, 1], color=myCmap[y])
plt.legend(loc='upper right')
plt.show()
See also this question.

Related

matplotlib.pyplot.scatter - define sizes of entries in legend for size of marker

How to change the marker size and the respective label in the legend to meaningful values like [20,40,60,80] ?
Do I need to derive handles and labels from an additional dummy dataset and how to plot it, so that it will not be visible (alpha=0.0 will not work?)?
import matplotlib.pyplot as plt
import numpy as np
x = [1,2,3,4,5]
y = [1,2,3,4,5]
size = np.asarray([0.84,0.53,0.24,0.47,0.18]) * 100
s1 = plt.scatter(x, y, s=size)
handles, labels = s1.legend_elements(prop="sizes")
legend2 = plt.legend(handles, labels, frameon=False, title="Sizes")
plt.show()
The function legend_elements(...) has a parameter num= which can be a Locator. So, you can try e.g. a MultipleLocator:
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator
import numpy as np
x = [1, 2, 3, 4, 5]
y = [1, 2, 3, 4, 5]
size = np.asarray([0.84, 0.53, 0.24, 0.47, 0.18]) * 100
s1 = plt.scatter(x, y, s=size)
handles, labels = s1.legend_elements(prop="sizes", num=MultipleLocator(20))
legend2 = plt.legend(handles, labels, frameon=False, title="Sizes")
plt.show()
PS: In this case, you can also just put a number for num=, e.g. s1.legend_elements(prop="sizes", num=4). That also seems to put rounded values. When only a few different size values are used in the plot, the default num='auto', uses these values instead of rounded values.

3D PCA in matplotlib: how to add legend?

I am attempting to use http://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_iris.html for my own data to construct a 3D PCA plot. The tutorial, however, did not specify how I can add a legend. Another page, https://matplotlib.org/users/legend_guide.html did, but I cannot see how I can apply the information in the second tutorial to the first.
How can I modify the code below to add a legend?
# Code source: Gae"l Varoquaux
# License: BSD 3 clause
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn import decomposition
from sklearn import datasets
np.random.seed(5)
centers = [[1, 1], [-1, -1], [1, -1]]
iris = datasets.load_iris()
X = iris.data#the floating point values
y = iris.target#unsigned integers specifying group
fig = plt.figure(1, figsize=(4, 3))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
plt.cla()
pca = decomposition.PCA(n_components=3)
pca.fit(X)
X = pca.transform(X)
for name, label in [('Setosa', 0), ('Versicolour', 1), ('Virginica', 2)]:
ax.text3D(X[y == label, 0].mean(),
X[y == label, 1].mean() + 1.5,
X[y == label, 2].mean(), name,
horizontalalignment='center',
bbox=dict(alpha=.5, edgecolor='w', facecolor='w'))
# Reorder the labels to have colors matching the cluster results
y = np.choose(y, [1, 2, 0]).astype(np.float)
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y, cmap=plt.cm.spectral,
edgecolor='k')
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
plt.show()
There are some issues with the other answer on which neither the OP, nor the answerer seem to be clear about; this is hence not a complete answer, but rather an appendix to the existing answer.
The spectral colormap has been removed from matplotlib in version 2.2,
use Spectral or nipy_spectral or any other valid colormap.
Any colormap in matplotlib ranges from 0 to 1. If you call it with any value outside that range,
it will just give your the outmost color. To get a color from a colormap you hence need to normalize the values.
This is done via a Normalize instance. In this case this is internal to scatter.
Hence use sc = ax.scatter(...) and then sc.cmap(sc.norm(value)) to get a value according to the same mapping that is used within the scatter.
Therefore the code should rather use
[sc.cmap(sc.norm(i)) for i in [1, 2, 0]]
The legend is outside the figure. The figure is 4 x 3 inches in size (figsize=(4, 3)).
The axes takes 95% of that space in width (rect=[0, 0, .95, 1]).
The call to legend places the legend's right center point at 1.7 times the axes width = 4*0.95*1.7 = 6.46 inches. (bbox_to_anchor=(1.7,0.5)).
Alternative suggestion from my side: Make the figure larger (figsize=(5.5, 3)), such that the legend will fit in, make the axes take only 70% of the figure width, such that you have 30% left for the legend. Position the legend's left side close to the axes boundary (bbox_to_anchor=(1.0, .5)).
For more on this topic see How to put the legend out of the plot.
The reason you still see the complete figure including the legend in a jupyter notebook is that jupyter will just save everything inside the canvas, even if it overlaps and thereby enlarge the figure.
In total the code may then look like
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np; np.random.seed(5)
from sklearn import decomposition, datasets
centers = [[1, 1], [-1, -1], [1, -1]]
iris = datasets.load_iris()
X = iris.data #the floating point values
y = iris.target #unsigned integers specifying group
fig = plt.figure(figsize=(5.5, 3))
ax = Axes3D(fig, rect=[0, 0, .7, 1], elev=48, azim=134)
pca = decomposition.PCA(n_components=3)
pca.fit(X)
X = pca.transform(X)
labelTups = [('Setosa', 0), ('Versicolour', 1), ('Virginica', 2)]
for name, label in labelTups:
ax.text3D(X[y == label, 0].mean(),
X[y == label, 1].mean() + 1.5,
X[y == label, 2].mean(), name,
horizontalalignment='center',
bbox=dict(alpha=.5, edgecolor='w', facecolor='w'))
# Reorder the labels to have colors matching the cluster results
y = np.choose(y, [1, 2, 0]).astype(np.float)
sc = ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y, cmap="Spectral", edgecolor='k')
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
colors = [sc.cmap(sc.norm(i)) for i in [1, 2, 0]]
custom_lines = [plt.Line2D([],[], ls="", marker='.',
mec='k', mfc=c, mew=.1, ms=20) for c in colors]
ax.legend(custom_lines, [lt[0] for lt in labelTups],
loc='center left', bbox_to_anchor=(1.0, .5))
plt.show()
and produce
Needed a few tweaks (plt.cm.spectral is the danged weirdest colormap I've ever dealt with), but it seems to be good now:
from matplotlib.lines import Line2D
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
from sklearn import decomposition
from sklearn import datasets
np.random.seed(5)
centers = [[1, 1], [-1, -1], [1, -1]]
iris = datasets.load_iris()
X = iris.data#the floating point values
y = iris.target#unsigned integers specifying group
fig = plt.figure(1, figsize=(4, 3))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
plt.cla()
pca = decomposition.PCA(n_components=3)
pca.fit(X)
X = pca.transform(X)
labelTups = [('Setosa', 0), ('Versicolour', 1), ('Virginica', 2)]
for name, label in labelTups:
ax.text3D(X[y == label, 0].mean(),
X[y == label, 1].mean() + 1.5,
X[y == label, 2].mean(), name,
horizontalalignment='center',
bbox=dict(alpha=.5, edgecolor='w', facecolor='w'))
# Reorder the labels to have colors matching the cluster results
y = np.choose(y, [1, 2, 0]).astype(np.float)
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y, cmap=plt.cm.spectral, edgecolor='k')
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
colors = [plt.cm.spectral(np.float(i/2)) for i in [1, 2, 0]]
custom_lines = [Line2D([0], [0], linestyle="none", marker='.', markeredgecolor='k', markerfacecolor=c, markeredgewidth=.1, markersize=20) for c in colors]
ax.legend(custom_lines, [lt[0] for lt in labelTups], loc='right', bbox_to_anchor=(1.7, .5))
plt.show()
Here's a link to an online Jupyter notebook with a live version of the script (requires an account for rerunning, though).
Short explanation
You're trying to add three legend markers for a single plot, which is nonstandard behavior. Thus, you need to manually create the shapes that your legend will display.
Longer explanation
This line of code recreates the colors you used in your plot:
colors = [plt.cm.spectral(np.float(i/2)) for i in [1, 2, 0]]
and then this line of code draws some appropriate-looking dots that we'll eventually display on your legend:
custom_lines = [Line2D([0], [0], linestyle="none", marker='.', markeredgecolor='k', markerfacecolor=c, markeredgewidth=.1, markersize=20) for c in colors]
The first two args are just the (internal) x and y coords of the single dot that will be drawn, linestyle="none" suppresses the line that Line2D would normally draw by default, and the rest of the args create and style the dot itself (referred to as a marker in the terminology of the matplotlib api).
Finally, this statement actually creates the legend:
ax.legend(custom_lines, [lt[0] for lt in labelTups], loc='right', bbox_to_anchor=(1.7, .5))
The first arg is of course a list of the dots we just drew, and the second arg is a list of the labels (one per dot). The remaining two args tell matplotlib where to draw the actual box containing the legend. The last arg, bbox_to_anchor, is basically a way to manually fiddle with the positioning of the legend, which I had to do since matplotlib support for 3D anything is still a little behind the curve. On 2D plots you typically don't need it, and, since matplotlib usually does a decent job of automatically positioning the legend on 2D plots in the first place, you often don't even need the loc arg either.
Some colormap weirdness
Don't quite know what was going on with plt.cm.spectral, but in order to get it to behave, for every value I fed it I had to:
a) first cast the value to float
b) then divide the value by 2
a) does occur explicitly in the OP's original code, right before they plot. The divide by 2 thing, I don't know where that comes from. Somehow the call to ax.scatter is implicitly normalizing all of the y values so that the maximum is 1? I guess?

Matplotlib - understanding color values

I found a piece of code which is passing a 1D Numpy array to MatplotLib. The values of array are either 1 or 0, but the graph plotted has colours as yellow or purple. I am unable to find any documentation around it.
Here is the code:
import numpy as np
import matplotlib.pyplot as plt
num_observations = 5000
x1 = np.random.multivariate_normal([0, 0], [[1, .85],[.85, 1]], num_observations) # mean, covariance
x2 = np.random.multivariate_normal([1, 4], [[1, .85],[.85, 1]], num_observations)
features = np.vstack((x1, x2)).astype(np.float32)
labels = np.hstack((np.zeros(num_observations),np.ones(num_observations)))
plt.figure(figsize=(12,8))
plt.scatter(features[:, 0], features[:, 1],
c = labels, alpha = .4)
plt.show()
Can anyone explain how we are getting the colors as yellow and violet? Relevant Documentation would also help.
Its using the default viridis colormap, and so purple represents 0 and yellow represents 1. See here for more about colormaps: https://matplotlib.org/examples/color/colormaps_reference.html.
Adding a colorbar helps here. Adding one to your example is easy:
import numpy as np
import matplotlib.pyplot as plt
num_observations = 5000
x1 = np.random.multivariate_normal([0, 0], [[1, .85],[.85, 1]], num_observations) # mean, covariance
x2 = np.random.multivariate_normal([1, 4], [[1, .85],[.85, 1]], num_observations)
features = np.vstack((x1, x2)).astype(np.float32)
labels = np.hstack((np.zeros(num_observations),np.ones(num_observations)))
plt.figure(figsize=(12,8))
p = plt.scatter(features[:, 0], features[:, 1],
c = labels, alpha = .4)
plt.colorbar(p)
plt.show()

Matplotlib PCA sample not working after altering dimensions

I am trying to learn how to use matplotlib.mlabPCA. Below I have the following code:
import numpy as np
from matplotlib import pyplot as plt
from matplotlib.mlab import PCA as mlabPCA
from mpl_toolkits.mplot3d import Axes3D, proj3d
np.random.seed(234234782384239784)
DIMENSIONS = 3
mu_vec1 = np.array([0 for i in xrange(DIMENSIONS)])
cov_mat1 = np.identity(DIMENSIONS)
class1_sample = np.random.multivariate_normal(mu_vec1, cov_mat1, 20).T
assert class1_sample.shape == (DIMENSIONS, 20)
mu_vec2 = np.array([3 for i in xrange(DIMENSIONS)])
cov_mat2 = np.identity(DIMENSIONS)
class2_sample = np.random.multivariate_normal(mu_vec2, cov_mat2, 20).T
assert class2_sample.shape == (DIMENSIONS, 20)
# Combine the two together
all_samples = np.vstack([class1_sample.T, class2_sample.T])
all_samples = all_samples.T
assert all_samples.shape == (DIMENSIONS, 40)
mlab_pca = mlabPCA(all_samples.T)
# 2d plotting
plt.plot(mlab_pca.Y[0:20, 0],
mlab_pca.Y[0:20, 1],
'o', markersize=7, color='blue', alpha=0.5, label='class1')
plt.plot(mlab_pca.Y[20:40, 0],
mlab_pca.Y[20:40, 1],
'^', markersize=7, color='red', alpha=0.5, label='class2')
plt.xlabel('x_values')
plt.ylabel('y_values')
plt.xlim([-4, 4])
plt.ylim([-4, 4])
plt.legend()
plt.title('Transformed samples with class labels from matplotlib.mlab.PCA()')
plt.show()
As you can see, PCA works pretty well and I get the following graph:
However, when I try to change DIMENSIONS = 100 (I am trying to simulate spectral data analysis), I am getting this error:
RuntimeError: we assume data in a is organized with numrows>numcols
"Ok sure, I can just apply PCA onto the transpose of this matrix instead." I told myself naively.
DIMENSIONS = 100
...
mlab_pca = mlabPCA(all_samples)
plt.plot(mlab_pca.Y[0, 0:20],
mlab_pca.Y[1, 0:20],
'o', markersize=7, color='blue', alpha=0.5, label='class1')
plt.plot(mlab_pca.Y[0, 20:40],
mlab_pca.Y[1, 20:40],
'^', markersize=7, color='red', alpha=0.5, label='class2')
...
My resulting plot looks completely off!
Am I doing something wrong? Or is adding that many dimension actually messing up my data?
I would not expect the points to separate. PCA(X) is not the same thing as PCA(X.T).T
It seems that requiring numrows > numcols is a limitation of matplotlib PCA.
Both R's prcomp and Python's sklearn PCA can take matrices with either numrows > numcols or numcols > numrows.

scatter plots with string arrays in matplotlib

this seems like it should be an easy one but I can't figure it out. I have a pandas data frame and would like to do a 3D scatter plot with 3 of the columns. The X and Y columns are not numeric, they are strings, but I don't see how this should be a problem.
X= myDataFrame.columnX.values #string
Y= myDataFrame.columnY.values #string
Z= myDataFrame.columnY.values #float
fig = pl.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X, Y, np.log10(Z), s=20, c='b')
pl.show()
isn't there an easy way to do this? Thanks.
You could use np.unique(..., return_inverse=True) to get representative ints for each string. For example,
In [117]: uniques, X = np.unique(['foo', 'baz', 'bar', 'foo', 'baz', 'bar'], return_inverse=True)
In [118]: X
Out[118]: array([2, 1, 0, 2, 1, 0])
Note that X has dtype int32, as np.unique can handle at most 2**31 unique strings.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import mpl_toolkits.mplot3d.axes3d as axes3d
N = 12
arr = np.arange(N*2).reshape(N,2)
words = np.array(['foo', 'bar', 'baz', 'quux', 'corge'])
df = pd.DataFrame(words[arr % 5], columns=list('XY'))
df['Z'] = np.linspace(1, 1000, N)
Z = np.log10(df['Z'])
Xuniques, X = np.unique(df['X'], return_inverse=True)
Yuniques, Y = np.unique(df['Y'], return_inverse=True)
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1, projection='3d')
ax.scatter(X, Y, Z, s=20, c='b')
ax.set(xticks=range(len(Xuniques)), xticklabels=Xuniques,
yticks=range(len(Yuniques)), yticklabels=Yuniques)
plt.show()
Scatter does this automatically now (from at least matplotlib 2.1.0):
plt.scatter(['A', 'B', 'B', 'C'], [0, 1, 2, 1])
Try converting the characters to numbers for the plotting and then use the characters again for the axis labels.
Using hash
You could use the hash function for the conversion;
from mpl_toolkits.mplot3d import Axes3D
xlab = myDataFrame.columnX.values
ylab = myDataFrame.columnY.values
X =[hash(l) for l in xlab]
Y =[hash(l) for l in xlab]
Z= myDataFrame.columnY.values #float
fig = figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X, Y, np.log10(Z), s=20, c='b')
ax.set_xticks(X)
ax.set_xticklabels(xlab)
ax.set_yticks(Y)
ax.set_yticklabels(ylab)
show()
As M4rtini has pointed out in the comments, it't not clear what the spacing/scaling of string coordinates should be; the hash function could give unexpected spacings.
Nondegenerate uniform spacing
If you wanted to have the points uniformly spaced then you would have to use a different conversion.
For example you could use
X =[i for i in range(len(xlab))]
though that would cause each point to have a unique x-position even if the label is the same, and the x and y points would be correlated if you used the same approach for Y.
Degenerate uniform spacing
A third alternative is to first get the unique members of xlab (using e.g. set) and then map each xlab to a position using the unique set for the mapping; e.g.
xmap = dict((sn, i)for i,sn in enumerate(set(xlab)))
X = [xmap[l] for l in xlab]

Categories

Resources