retrieve leave colors from scipy dendrogram - python

I can not get the color leaves from the scipy dendrogram dictionary. As stated in the documentation and in this github issue, the color_list key in the dendrogram dictionary refers to the links, not the leaves. It would be nice to have another key referring to the leaves, sometimes you need this for coloring other types of graphics, such as this scatter plot in the example below.
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage, dendrogram
# DATA EXAMPLE
x = np.array([[ 5, 3],
[10,15],
[15,12],
[24,10],
[30,30],
[85,70],
[71,80]])
# DENDROGRAM
plt.figure()
plt.subplot(121)
z = linkage(x, 'single')
d = dendrogram(z)
# COLORED PLOT
# This is what I would like to achieve. Colors are assigned manually by looking
# at the dendrogram, because I failed to get it from d['color_list'] (it refers
# to links, not observations)
plt.subplot(122)
points = d['leaves']
colors = ['r','r','g','g','g','g','g']
for point, color in zip(points, colors):
plt.plot(x[point, 0], x[point, 1], 'o', color=color)
Manual color assignment seems easy in this example, but I'm dealing with huge datasets, so until we get this new feature in the dictionary (color leaves), I'm trying to infer it somehow with the current information contained in the dictionary but I'm out of ideas so far. Can anyone help me?
Thanks.

For scipy 1.7.1 the new functionality has been implemented and the dendogram function returns in the output dictionary also an entry 'leaves_color_list' that can be used to perform easily this task.
Here is a working code of the OP (see last line "NEW CODE")
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage, dendrogram
# DATA EXAMPLE
x = np.array([[ 5, 3],
[10,15],
[15,12],
[24,10],
[30,30],
[85,70],
[71,80]])
# DENDROGRAM
plt.figure()
plt.subplot(121)
z = linkage(x, 'single')
d = dendrogram(z)
# COLORED PLOT
# This is what I would like to achieve. Colors are assigned manually by looking
# at the dendrogram, because I failed to get it from d['color_list'] (it refers
# to links, not observations)
plt.subplot(122)
#NEW CODE
plt.scatter(x[d['leaves'],0],x[d['leaves'],1], color=d['leaves_color_list'])

The following approach seems to work. The dictionary returned by the dendogram contains 'color_list' with the colors of the linkages. And 'icoord' and 'dcoord' with the x, resp. y, plot coordinates of these linkages. These x-positions are 5, 15, 25, ... when the linkage starts at a point. So, testing these x-positions can bring us back from the linkage to the corresponding point. And allows to assign the color of the linkage to the point.
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage, dendrogram
# DATA EXAMPLE
x = np.random.uniform(0, 10, (20, 2))
# DENDROGRAM
plt.figure()
plt.subplot(121)
z = linkage(x, 'single')
d = dendrogram(z)
plt.yticks([])
# COLORED PLOT
plt.subplot(122)
points = d['leaves']
colors = ['none'] * len(points)
for xs, c in zip(d['icoord'], d['color_list']):
for xi in xs:
if xi % 10 == 5:
colors[(int(xi)-5) // 10] = c
for point, color in zip(points, colors):
plt.plot(x[point, 0], x[point, 1], 'o', color=color)
plt.text(x[point, 0], x[point, 1], f' {point}')
plt.show()
PS: This post about matching points with their clusters might also be relevant.

Related

Plotting array over background map using cartopy

I am trying to plot an numpy array over a background map tile using cartopy. When including the background map, the array is not visible.
I am adding background map tiles using cimgt and geo_axes.add_image(). This method has worked for me before when plotting points with plt.scatter(). I have tried several projections (PlateCarree, Mercator, and EPSG32630) and map tiles (OSM, GoogleTiles). The array contains np.nans and floats.
Here is my code:
import numpy as np
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import cartopy.io.img_tiles as cimgt
# array creation
array = np.asarray([[1, np.nan, np.nan], [1, 1, 1], [2, 2, 1]])
x_coords = np.asarray([690000, 691000, 692000])
y_coords = np.asarray([4958000, 4959000, 496000])
# create figure
fig = plt.figure(figsize=(8, 6), dpi=100)
# create geo axes
projection = ccrs.epsg(32630)
geo_axes = plt.subplot(projection=projection)
# add open street map background
# when commenting the two following lines, the data array is plotted correctly
osm_background = cimgt.OSM()
geo_axes.add_image(osm_background, 14)
# plot dataset
plt.imshow(
array,
origin="upper",
extent=(x_coords[0], x_coords[1], y_coords[0], y_coords[1]),
transform=projection,
)
# show plot
plt.show()
I can't seem to find what is causing the issue. Has anyone encountered this before, or can anyone see what I am doing wrong?
You need some tricks to reveal all the plotted features. Here is the relevant code to update yours, and the output plot that shows both the (OSM) background and the array-image.
# plot dataset
plt.imshow(
array,
origin="upper",
extent=(x_coords[0], x_coords[1], y_coords[0], y_coords[1]),
transform=projection,
alpha=0.25, # allows the background image show-through
zorder=10 # make this layer on top
)
# draw graticule and labels
geo_axes.gridlines(color='lightgrey', linestyle='-', draw_labels=True)
The result:

how to draw an asymptote with a dashed line?

I would like the asymptote on the tg(x) function be draw with a dashed line, but I don't know how to change it in this code:
import matplotlib.ticker as tck
import matplotlib.pyplot as plt
import numpy as np
f,ax=plt.subplots(figsize=(8,5))
x=np.linspace(-np.pi, np.pi,100)
y=np.sin(x)/np.cos(x)
plt.ylim([-4, 4])
plt.title("f(x) = tg(x)")
plt.xlabel("x")
plt.ylabel("y")
ax.plot(x/np.pi,y)
ax.xaxis.set_major_formatter(tck.FormatStrFormatter('%g $\pi$'))
Interesting question. My approach is to look for the discontinuities by examining the derivative of the function, and separating the original function based on the location o these discontinuities.
So for tan(x), since the derivative is always positive (outside of the asymptotes) we look for points where np.diff(y) < 0. Based on all the locations where the previous condition is true, we split up the original function into segments and plot those individually (with the same plot properties so the lines look the same) and then plot black dashed lines separately. The following code shows this working:
import matplotlib.ticker as tck
import matplotlib.pyplot as plt
import numpy as np
f,ax=plt.subplots(figsize=(8,5))
x=np.linspace(-np.pi, np.pi,100)
y=np.sin(x)/np.cos(x)
plt.ylim([-4, 4])
plt.title("f(x) = tg(x)")
plt.xlabel("x")
plt.ylabel("y")
ax.xaxis.set_major_formatter(tck.FormatStrFormatter('%g $\pi$'))
# Search for points with negative slope
dydx = np.diff(y)
negativeSlopeIdx = np.nonzero(dydx < 0)[0]
# Take those points and parse the original function into segments to plot
yasymptote = np.array([-4, 4])
iprev = 0
for i in negativeSlopeIdx:
ax.plot(x[iprev:i-1]/np.pi, y[iprev:i-1], "b", linewidth=2)
ax.plot(np.array([x[i], x[i]])/np.pi, yasymptote, "--k")
iprev = i+1
ax.plot(x[iprev:]/np.pi, y[iprev:], "b", linewidth=2)
plt.show()
With a final plot looking like:

Python Scatterplot: Changing color based on both X and Y values

I've tried to recreate the image attached using cmaps as well as with if/else statements.
My current attempt is based upon the advice given in this thread
I tried using 1.8<=x<=2.2 but I get an error.
Here is my current code below:
import numpy as np
import matplotlib.pyplot as plt
N = 500
# center, variation, number of points
x = np.random.normal(2,0.2,N)
y = np.random.normal(2,0.2,N)
colors = np.where(x<=2.2,'r',np.where(y<=2.2,'b','b'))
plt.scatter(x , y, c=colors)
plt.colorbar()
plt.show()
To make that plot, you need to pass an array with the color of each point. In this case the color is the distance to the point (2, 2), since the distributions are centered on that point.
import numpy as np
import matplotlib.pyplot as plt
N = 500
# center, variation, number of points
x = np.random.normal(2,0.2,N)
y = np.random.normal(2,0.2,N)
# we calculate the distance to (2, 2).
# This we are going to use to give it the color.
color = np.sqrt((x-2)**2 + (y-2)**2)
plt.scatter(x , y, c=color, cmap='plasma', alpha=0.7)
# we set a alpha
# it is what gives the transparency to the points.
# if they suppose themselves, the colors are added.
plt.show()

Plot 2D histogram data with pcolormesh

I need to plot a binned statistic, as one would get from scipy.stats.binned_statistic_2d. Basically, that means I have edge values and within-bin data. This also means I cannot (to my knowledge) use plt.hist2d. Here's a code snippet to generate the sort of data I might need to plot:
import numpy as np
x_edges = np.arange(6)
y_edges = np.arange(6)
bin_values = np.random.randn(5, 5)
One would imagine that I could use pcolormesh for this, but the issue is that pcolormesh does not allow for bin edge values. The following will only plot the values in bins 1 through 4. The 5th value is excluded, since while pcolormesh "knows" that the value at 4.0 is some value, there is no later value to plot, so the width of the 5th bin is zero.
import matplotlib.pyplot as plt
X, Y = np.broadcast_arrays(x_edges[:5, None], y_edges[None, :5])
plt.figure()
plt.pcolormesh(X, Y, bin_values)
plt.show()
I can get around this with an ugly hack by adding an additional set of values equal to the last values:
import matplotlib.pyplot as plt
X, Y = np.broadcast_arrays(x_edges[:, None], y_edges[None, :])
dummy_bin_values = np.zeros([6, 6])
dummy_bin_values[:5, :5] = bin_values
dummy_bin_values[5, :] = dummy_bin_values[4, :]
dummy_bin_values[:, 5] = dummy_bin_values[:, 4]
plt.figure()
plt.pcolormesh(X, Y, dummy_bin_values)
plt.show()
However, this is an ugly hack. Is there any cleaner way to plot 2D histogram data with bin edge values? "No" is possibly the correct answer, but convince me that's the case if it is.
I do not understand the problem with any of the two options. So here is simly a code which uses both, numpy histogrammed data with pcolormesh, as well as simply plt.hist2d.
import numpy as np
import matplotlib.pyplot as plt
x_edges = np.arange(6)
y_edges = np.arange(6)
data = np.random.rand(340,2)*5
### using numpy.histogram2d
bin_values,_,__ = np.histogram2d(data[:,0],data[:,1],bins=(x_edges, y_edges) )
X, Y = np.meshgrid(x_edges,y_edges)
fig, (ax,ax2) = plt.subplots(ncols=2)
ax.set_title("numpy.histogram2d \n + plt.pcolormesh")
ax.pcolormesh(X, Y, bin_values.T)
### using plt.hist2d
ax2.set_title("plt.hist2d")
ax2.hist2d(data[:,0],data[:,1],bins=(x_edges, y_edges))
plt.show()
Of course this would equally work with scipy.stats.binned_statistic_2d.

Plotting using PolyCollection in matplotlib

I am trying to plot a 3 dimensional plot in matplotlib. I have to plot Frequency vs Amplitude Distribution for four (or multiple) Radii in a single 3D plot. I was looking at PolyCollection command available in matplotlib.collections and I also went through the example but I do not know how to use the existing data to arrive at the plot.
The dimensions of the quantities that I have are,
Frequency : 4000 x 4,
Amplitude : 4000 x 4,
Radius : 4
I would like to plot something like,
With X axis being Frequencies, Y axis being Radius, and Z axis being Amplitudes. How do I go about solving this problem?
PolyCollection expects a sequence of vertices, which matches your desired data pretty well. You don't provide any example data, so I'll make some up for illustration (my dimension of 200 would be your 4000 .... although I might consider a different plot than this if you have so many data points):
import matplotlib.pyplot as plt
from matplotlib.collections import PolyCollection
from mpl_toolkits.mplot3d import axes3d
import numpy as np
# These will be (200, 4), (200, 4), and (4)
freq_data = np.linspace(0,300,200)[:,None] * np.ones(4)[None,:]
amp_data = np.random.rand(200*4).reshape((200,4))
rad_data = np.linspace(0,2,4)
verts = []
for irad in range(len(rad_data)):
# I'm adding a zero amplitude at the beginning and the end to get a nice
# flat bottom on the polygons
xs = np.concatenate([[freq_data[0,irad]], freq_data[:,irad], [freq_data[-1,irad]]])
ys = np.concatenate([[0],amp_data[:,irad],[0]])
verts.append(list(zip(xs, ys)))
poly = PolyCollection(verts, facecolors = ['r', 'g', 'c', 'y'])
poly.set_alpha(0.7)
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
# The zdir keyword makes it plot the "z" vertex dimension (radius)
# along the y axis. The zs keyword sets each polygon at the
# correct radius value.
ax.add_collection3d(poly, zs=rad_data, zdir='y')
ax.set_xlim3d(freq_data.min(), freq_data.max())
ax.set_xlabel('Frequency')
ax.set_ylim3d(rad_data.min(), rad_data.max())
ax.set_ylabel('Radius')
ax.set_zlim3d(amp_data.min(), amp_data.max())
ax.set_zlabel('Amplitude')
plt.show()
Most of this is straight from the example you mention, I just made it clear where your particular datasets would lie. This yields this plot:

Categories

Resources