plotting results of hierarchical clustering on top of a matrix of data - python

How can I plot a dendrogram right on top of a matrix of values, reordered appropriately to reflect the clustering, in Python? An example is the following figure:
This is Figure 6 from: A panel of induced pluripotent stem cells from chimpanzees: a resource for comparative functional genomics
I use scipy.cluster.dendrogram to make my dendrogram and perform hierarchical clustering on a matrix of data. How can I then plot the data as a matrix where the rows have been reordered to reflect a clustering induced by the cutting the dendrogram at a particular threshold, and have the dendrogram plotted alongside the matrix? I know how to plot the dendrogram in scipy, but not how to plot the intensity matrix of data with the right scale bar next to it.

The question does not define matrix very well: "matrix of values", "matrix of data". I assume that you mean a distance matrix. In other words, element D_ij in the symmetric nonnegative N-by-N distance matrix D denotes the distance between two feature vectors, x_i and x_j. Is that correct?
If so, then try this (edited June 13, 2010, to reflect two different dendrograms).
Tested in python 3.10 and matplotlib 3.5.1
import numpy as np
import matplotlib.pyplot as plt
import scipy.cluster.hierarchy as sch
from scipy.spatial.distance import squareform
# Generate random features and distance matrix.
np.random.seed(200) # for reproducible data
x = np.random.rand(40)
D = np.zeros([40, 40])
for i in range(40):
for j in range(40):
D[i,j] = abs(x[i] - x[j])
condensedD = squareform(D)
# Compute and plot first dendrogram.
fig = plt.figure(figsize=(8, 8))
ax1 = fig.add_axes([0.09, 0.1, 0.2, 0.6])
Y = sch.linkage(condensedD, method='centroid')
Z1 = sch.dendrogram(Y, orientation='left')
ax1.set_xticks([])
ax1.set_yticks([])
# Compute and plot second dendrogram.
ax2 = fig.add_axes([0.3, 0.71, 0.6, 0.2])
Y = sch.linkage(condensedD, method='single')
Z2 = sch.dendrogram(Y)
ax2.set_xticks([])
ax2.set_yticks([])
# Plot distance matrix.
axmatrix = fig.add_axes([0.3, 0.1, 0.6, 0.6])
idx1 = Z1['leaves']
idx2 = Z2['leaves']
D = D[idx1,:]
D = D[:,idx2]
im = axmatrix.matshow(D, aspect='auto', origin='lower', cmap=plt.cm.YlGnBu)
axmatrix.set_xticks([]) # remove axis labels
axmatrix.set_yticks([]) # remove axis labels
# Plot colorbar.
axcolor = fig.add_axes([0.91, 0.1, 0.02, 0.6])
plt.colorbar(im, cax=axcolor)
plt.show()
fig.savefig('dendrogram.png')
Edit: For different colors, adjust the cmap attribute in imshow. See the scipy/matplotlib docs for examples. That page also describes how to create your own colormap. For convenience, I recommend using a preexisting colormap. In my example, I used YlGnBu.
Edit: add_axes (see documentation here) accepts a list or tuple: (left, bottom, width, height). For example, (0.5,0,0.5,1) adds an Axes on the right half of the figure. (0,0.5,1,0.5) adds an Axes on the top half of the figure.
Most people probably use add_subplot for its convenience. I like add_axes for its control.
To remove the border, use add_axes([left,bottom,width,height], frame_on=False). See example here.

If in addition to the matrix and dendrogram it is required to show the labels of the elements, the following code can be used, that shows all the labels rotating the x labels and changing the font size to avoid overlapping on the x axis. It requires moving the colorbar to have space for the y labels:
axmatrix.set_xticks(range(40))
axmatrix.set_xticklabels(idx1, minor=False)
axmatrix.xaxis.set_label_position('bottom')
axmatrix.xaxis.tick_bottom()
pylab.xticks(rotation=-90, fontsize=8)
axmatrix.set_yticks(range(40))
axmatrix.set_yticklabels(idx2, minor=False)
axmatrix.yaxis.set_label_position('right')
axmatrix.yaxis.tick_right()
axcolor = fig.add_axes([0.94,0.1,0.02,0.6])
The result obtained is this (with a different color map):

Related

Project variables in PCA plot in Python

After performing a PCA analysis in R we can do:
ggbiplot(pca, choices=1:2, groups=factor(row.names(df_t)))
That will plot the data in the 2 PC space, and the direction and weight of the variables in such space as vectors (with different length and direction).
In Python I can plot the data in the 2 PC space, and I can get the weights of the variables, but how do I know the direction.
In other words, how could I plot the variable contribution to both PC (weight and direction) in Python?
I am not aware of any pre-made implementation of this kind of plot, but it can be created using matplotlib.pyplot.quiver. Here's an example I quickly put together. You can use this as a basis to create a nice plot that works well for your data.
Example Data
This generates some example data. It is reused from this answer.
# User input
n_samples = 100
n_features = 5
# Prep
data = np.empty((n_samples,n_features))
np.random.seed(42)
# Generate
for i,mu in enumerate(np.random.choice([0,1,2,3], n_samples, replace=True)):
data[i,:] = np.random.normal(loc=mu, scale=1.5, size=n_features)
PCA
pca = PCA().fit(data)
Variables Factor Map
Here we go:
# Get the PCA components (loadings)
PCs = pca.components_
# Use quiver to generate the basic plot
fig = plt.figure(figsize=(5,5))
plt.quiver(np.zeros(PCs.shape[1]), np.zeros(PCs.shape[1]),
PCs[0,:], PCs[1,:],
angles='xy', scale_units='xy', scale=1)
# Add labels based on feature names (here just numbers)
feature_names = np.arange(PCs.shape[1])
for i,j,z in zip(PCs[1,:]+0.02, PCs[0,:]+0.02, feature_names):
plt.text(j, i, z, ha='center', va='center')
# Add unit circle
circle = plt.Circle((0,0), 1, facecolor='none', edgecolor='b')
plt.gca().add_artist(circle)
# Ensure correct aspect ratio and axis limits
plt.axis('equal')
plt.xlim([-1.0,1.0])
plt.ylim([-1.0,1.0])
# Label axes
plt.xlabel('PC 0')
plt.ylabel('PC 1')
# Done
plt.show()
Being Uncertain
I struggled a bit with the scaling of the arrows. Please make sure they correctly reflect the loadings for your data. A quick check of whether feature 4 really correlates strongly with PC 1 (as this example would suggest) looks promising:
data_pca = pca.transform(data)
plt.scatter(data_pca[:,1], data[:,4])
plt.xlabel('PC 2') and plt.ylabel('feature 4')
plt.show()
Thanks to WhoIsJack for the earlier answer.
I adapted there code to a function below that takes in a fitted PCA object and the data it was based on. It produces the figure similar to above, but I substituted out real column names for the column index, and then pruned it to only show a certain number of contributing columns.
def plot_pca_vis(pca, df: pd.DataFrame, pc_x: int = 0, pc_y: int = 1, num_dims: int = 5):
"""
https://stackoverflow.com/questions/45148539/project-variables-in-pca-plot-in-python
Adapted into function by Tim Cashion
"""
# Get the PCA components (loadings)
PCs = pca.components_
PC_x_index = PCs[pc_x, : ].argsort()[-num_dims:][::-1]
PC_y_index = PCs[pc_y, : ].argsort()[-num_dims:][::-1]
combined_index = set(list(PC_x_index) + list(PC_y_index))
PCs = PCs[:, list(combined_index)]
# Use quiver to generate the basic plot
fig = plt.figure(figsize=(5,5))
plt.quiver(np.zeros(PCs.shape[1]), np.zeros(PCs.shape[1]),
PCs[pc_x,:], PCs[pc_y,:],
angles='xy', scale_units='xy', scale=1)
# Add labels based on feature names (here just numbers)
feature_names = df.columns
for i,j,z in zip(PCs[pc_y,:]+0.02, PCs[pc_x,:]+0.02, feature_names):
plt.text(j, i, z, ha='center', va='center')
# Add unit circle
circle = plt.Circle((0,0), 1, facecolor='none', edgecolor='b')
plt.gca().add_artist(circle)
# Ensure correct aspect ratio and axis limits
plt.axis('equal')
plt.xlim([-1.0,1.0])
plt.ylim([-1.0,1.0])
# Label axes
plt.xlabel('PC ' + str(pc_x))
plt.ylabel('PC ' + str(pc_y))
# Done
plt.show()
Hope this helps someone!

Hatch area using pcolormesh in Basemap

I try to hatch only the regions where I have statistically significant results. How can I do this using Basemap and pcolormesh?
plt.figure(figsize=(12,12))
lons = iris_cube.coord('longitude').points
lats = iris_cube.coord('latitude').points
m = Basemap(llcrnrlon=lons[0], llcrnrlat=lats[0], urcrnrlon=lons[-1], urcrnrlat=lats[-1], resolution='l')
lon, lat = np.meshgrid(lons, lats)
plt.subplot(111)
cs = m.pcolormesh(lon, lat, significant_data, cmap=cmap, norm=norm, hatch='/')
It seems pcolormesh does not support hatching (see https://github.com/matplotlib/matplotlib/issues/3058). Instead, the advice is to use pcolor, which starting from this example would look like,
import matplotlib.pyplot as plt
import numpy as np
dx, dy = 0.15, 0.05
y, x = np.mgrid[slice(-3, 3 + dy, dy),
slice(-3, 3 + dx, dx)]
z = (1 - x / 2. + x ** 5 + y ** 3) * np.exp(-x ** 2 - y ** 2)
z = z[:-1, :-1]
zm = np.ma.masked_less(z, 0.3)
cm = plt.pcolormesh(x, y, z)
plt.pcolor(x, y, zm, hatch='/', alpha=0.)
plt.colorbar(cm)
plt.show()
where a mask array is used to get the values of z greater than 0.3 and these are hatched using pcolor.
To avoid plotting another colour over the top (so you get only hatching) I've set alpha to 0. in pcolor which feels a bit like a hack. The alternative is to use patch and assign to the areas you want. See this example Python: Leave Numpy NaN values from matplotlib heatmap and its legend. This may be more tricky for basemaps, etc than just choosing areas with pcolor.
I have a simple solution for this problem, using only pcolormesh and not pcolor: Plot the color mesh, then hatch the entire plot, and then plot the original mesh again, this time by masking statistically significant cells, so that the only hatching visible is those on significant cells. Alternatively, you can put a marker on every cell (looks good too), instead of hatching the entire figure.
(I use cartopy instead of basemap, but this shouldn't matter.)
Step 1: Plot your field (z) normally, using pcolormesh.
mesh = plt.pcolormesh(x,y,z)
where x/y can be lons/lats.
Step 2: Hatch the entire plot. For this, use fill_between:
hatch = plt.fill_between([xmin,xmax],y1,y2,hatch='///////',color="none",edgecolor='black')
Check details of fill_between to set xmin, xmax, y1 and y2. You simply define two horizontal lines beyond the bounds of your plot, and hatch the area in between. Use more, or less /s to set hatch density.
To adjust hatch thickness, use below lines:
import matplotlib as mpl
mpl.rcParams['hatch.linewidth'] = 0.3
As an alternative to hatching everything, you can plot all your x-y points (or, lon-lat couples) as markers. A simple solution is putting a dot (x also looks good).
hatch = plt.plot(x,y,'.',color='black',markersize=1.5)
One of the above will be the basis of your 'hatch'. This is how it should look after Step 2:
Step 3: On top of these two, plot your color mesh once again with pcolormesh, this time masking cells containing statistically significant values. This way, the markers on your 'insignificant' cells become invisible again, while significant markers stay visible.
Assuming you have an identically sized array containing the t statistic for each cell (t_z), you can mask significant values using numpy's ma module.
z_masked = numpy.ma.masked_where(t_z >= your_threshold, z)
Then, plot the color mesh, using the masked array.
mesh_masked = plt.pcolormesh(x,y,z_masked)
Use zorder to make sure the layers are in correct order. This is how it should look after Step 3:

matplotlib spectrogram, intensity scale [duplicate]

I using matplotlib to plot some data in python and the plots require a standard colour bar. The data consists of a series of NxM matrices containing frequency information so that a simple imshow() plot gives a 2D histogram with colour describing frequency. Each matrix contains data in different, but overlapping ranges. Imshow normalizes the data in each matrix to the range 0-1 which means that, for example, the plot of matrix A, will appear identical to the plot of the matrix 2*A (though the colour bar will show double the values). What I would like is for the colour red, for example, to correspond to the same frequency in all of the plots. In other words, a single colour bar would suffice for all the plots. Any suggestions would be greatly appreciated.
Not to steal #ianilis's answer, but I wanted to add an example...
There are multiple ways, but the simplest is just to specify the vmin and vmax kwargs to imshow. Alternately, you can make a matplotlib.cm.Colormap instance and specify it, but that's more complicated than necessary for simple cases.
Here's a quick example with a single colorbar for all images:
import numpy as np
import matplotlib.pyplot as plt
# Generate some data that where each slice has a different range
# (The overall range is from 0 to 2)
data = np.random.random((4,10,10))
data *= np.array([0.5, 1.0, 1.5, 2.0])[:,None,None]
# Plot each slice as an independent subplot
fig, axes = plt.subplots(nrows=2, ncols=2)
for dat, ax in zip(data, axes.flat):
# The vmin and vmax arguments specify the color limits
im = ax.imshow(dat, vmin=0, vmax=2)
# Make an axis for the colorbar on the right side
cax = fig.add_axes([0.9, 0.1, 0.03, 0.8])
fig.colorbar(im, cax=cax)
plt.show()
Easiest solution is to call clim(lower_limit, upper_limit) with the same arguments for each plot.
This only answer half of the question, or rather starts a new one.
If you change
data *= np.array([0.5, 1.0, 1.5, 2.0])[:,None,None]
to
data *= np.array([2.0, 1.0, 1.5, 0.5])[:,None,None]
your colorbar will go from 0 to 0.5 which in this case is dark blue to slightly lighter blue and will not cover the whole range (0 to 2).
The colorbar will only show the colors from the last image or contour regardless of vmin and vmax.
I wasn't happy with the solutions that suggested to manually set vmin and vmax, so I decided to read the limits of each plot and automatically set vmin and vmax.
The example below shows three plots of samples taken from normal distributions with increasing mean value.
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import ImageGrid
import numpy as np
numberOfPlots = 3
data = []
for i in range(numberOfPlots):
mean = i
data.append(np.random.normal(mean, size=(100,100)))
fig = plt.figure()
grid = ImageGrid(fig, 111, nrows_ncols=(1,numberOfPlots), cbar_mode='single')
ims = []
for i in range(numberOfPlots):
ims.append(grid[i].imshow(data[i]))
grid[i].set_title("Mean = " + str(i))
clims = [im.get_clim() for im in ims]
vmin = min([clim[0] for clim in clims])
vmax = max([clim[1] for clim in clims])
for im in ims:
im.set_clim(vmin=np.floor(vmin),vmax=np.ceil(vmax))
grid[0].cax.colorbar(ims[0]) # with cbar_mode="single", cax attribute of all axes are identical
fig.show()

Plot histogram normalized by fixed parameter

I need to plot a plot a normalized histogram (by normalized I mean divided by a fixed value) using the histtype='step' style.
The issue is that plot.bar() doesn't seem to support that style and if I use instead plot.hist() which does, I can't (or at least don't know how) plot the normalized histogram.
Here's a MWE of what I mean:
import matplotlib.pyplot as plt
import numpy as np
def rand_data():
return np.random.uniform(low=10., high=20., size=(200,))
# Generate data.
x1 = rand_data()
# Define histogram params.
binwidth = 0.25
x_min, x_max = x1.min(), x1.max()
bin_n = np.arange(int(x_min), int(x_max + binwidth), binwidth)
# Obtain histogram.
hist1, edges1 = np.histogram(x1, bins=bin_n)
# Normalization parameter.
param = 5.
# Plot histogram normalized by the parameter defined above.
plt.ylim(0, 3)
plt.bar(edges1[:-1], hist1 / param, width=binwidth, color='none', edgecolor='r')
plt.show()
(notice the normalization: hist1 / param) which produces this:
I can generate a histtype='step' histogram using:
plt.hist(x1, bins=bin_n, histtype='step', color='r')
and get:
but then it wouldn't be normalized by the param value.
The step plot will generate the appearance that you want from a set of bins and the count (or normalized count) in those bins. Here I've used plt.hist to get the counts, then plot them, with the counts normalized. It's necessary to duplicate the first entry in order to get it to actually have a line there.
(a,b,c) = plt.hist(x1, bins=bin_n, histtype='step', color='r')
a = np.append(a[0],a[:])
plt.close()
step(b,a/param,color='r')
This is not quite right, because it doesn't finish the plot correctly. the end of the line is hanging in free space rather than dropping down the x axis.
you can fix that by adding a 0 to the end of 'a' and one more bin point to b
a=np.append(a[:],0)
b=np.append(b,(2*b[-1]-b[-2]))
step(b,a/param,color='r')
lastly, the ax.step mentioned would be used if you had used
fig, ax = plt.subplots()
to give you access to the figure and axis directly. For examples, see http://matplotlib.org/examples/ticks_and_spines/spines_demo_bounds.html
Based on tcaswell's comment (use step) I've developed my own answer. Notice that I need to add elements to both the x (one zero element at the beginning of the array) and y arrays (one zero element at the beginning and another at the end of the array) so that step will plot the vertical lines at the beginning and the end of the bars.
Here's the code:
import matplotlib.pyplot as plt
import numpy as np
def rand_data():
return np.random.uniform(low=10., high=20., size=(5000,))
# Generate data.
x1 = rand_data()
# Define histogram params.
binwidth = 0.25
x_min, x_max = x1.min(), x1.max()
bin_n = np.arange(int(x_min), int(x_max + binwidth), binwidth)
# Obtain histogram.
hist1, edges1 = np.histogram(x1, bins=bin_n)
# Normalization parameter.
param = 5.
# Create arrays adding elements so plt.bar will plot the first and last
# vertical bars.
x2 = np.concatenate((np.array([0.]), edges1))
y2 = np.concatenate((np.array([0.]), (hist1 / param), np.array([0.])))
# Plot histogram normalized by the parameter defined above.
plt.xlim(min(edges1) - (min(edges1) / 10.), max(edges1) + (min(edges1) / 10.))
plt.bar(x2, y2, width=binwidth, color='none', edgecolor='b')
plt.step(x2, y2, where='post', color='r', ls='--')
plt.show()
and here's the result:
The red lines generated by step are equal to those blue lines generated by bar as can be seen.

Plotting a 2D mesh grid with matplotlib

I would like to plot a 2D discretization rectangular mesh with non-regular
x y axes values, e.g. the typical discretization meshes used in CFD.
An example of the code may be:
fig = plt.figure(1,figsize=(12,8))
axes = fig.add_subplot(111)
matplotlib.rcParams.update({'font.size':17})
axes.set_xticks(self.xPoints)
axes.set_yticks(self.yPoints)
plt.grid(color='black', linestyle='-', linewidth=1)
myName = "2D.jpg"
fig.savefig(myName)
where self.xPoints and self.yPoints are 1D non-regular vectors.
This piece of code produce a good discretization mesh, the problem are the
xtics and ytics labels because they appear for all values of xPoints and yPoints (they overlap).
How can I easily redefine the printed values in the axes?
Let's say I only want to show the minimum and maximum value for x and y and not all values from the discretization mesh.
I cann't post a example-figure because it is the first time I ask something here (I can send it by mail if requested)
the problem is that you explicitly told matplotlib to label each point when you wrote:
axes.set_xticks(self.xPoints)
axes.set_yticks(self.yPoints)
comment out those lines and see what the result looks like.
Of course, if you only want the first and last point labelled, it becomes:
axes.set_xticks([self.xPoints[0], self.xPoints[-1]])
...
If the gridline was specified by axes.set_xticks(), I don't think it would be possible to show ticks without overlap in your case.
I may have a solution for you:
...
ax = plt.gca()
#Arr_y: y-direction data, 1D numpy array or list.
for j in range(len(Arr_y)):
plt.hline(y = Arr_y[j], xmin = Arr_x.min(), xmax = Arr_x.max(), color = 'black')
#Arr_x: x-direction data, 1D numpy array or list.
for i in range(len(Arr_x)):
plt.vline(x = Arr_x[i], ymin = Arr_y.min(), ymax = Arr_y.max(), color = 'black')
#Custom your ticks here, 1D numpy array or list.
ax.set_xticks(Arr_xticks)
ax.set_yticks(Arr_yticks)
plt.xlim(Arr_x.min(), Arr_x.max())
plt.ylim(Arr_y.min(), Arr_y.max())
plt.show()
...
hlines and vlines are horizontal and vertical lines, you can specify those lines with boundary data in both x and y directions.
I tried it with 60×182 non uniform mesh grid which cost me 1.2s, hope I can post a picture here.

Categories

Resources