Extract dendrogram from seaborn clustermap

Extract dendrogram from seaborn clustermap - python

Given the following example which is from: https://python-graph-gallery.com/404-dendrogram-with-heat-map/
It generates a dendrogram where I assume that it is based on scipy.
# Libraries
import seaborn as sns
import pandas as pd
from matplotlib import pyplot as plt
# Data set
url = 'https://python-graph-gallery.com/wp-content/uploads/mtcars.csv'
df = pd.read_csv(url)
df = df.set_index('model')
del df.index.name
df
# Default plot
sns.clustermap(df)
Question: How can one get the dendrogram in non-graphical form?
Background information:
From the root of that dendrogram I want to cut it at the largest length. For example we have one edge from the root to a left cluster (L) and an edge to a right cluster (R) ...from those two I'd like to get their edge lengths and cut the whole dendrogram at the longest of these two edges.
Best regards

clustermap returns a handle to the ClusterGrid object, which includes child objects for each dendrogram,
h.dendrogram_col and h.dendrogram_row.
Inside these are the dendrograms themselves, which provides the dendrogram geometry
as per the scipy.hierarchical.dendrogram return data, from which you could compute
the lengths of a specific branch.
h = sns.clustermap(df)
dgram = h.dendrogram_col.dendrogram
D = np.array(dgram['dcoord'])
I = np.array(dgram['icoord'])
# then the root node will be the last entry, and the length of the L/R branches will be
yy = D[-1]
lenL = yy[1]-yy[0]
lenR = yy[2]-yy[3]
The linkage matrix, the input used to compute the dendrogram, might also help:
h.dendrogram_col.linkage
h.dendrogram_row.linkage

Related

How to cut vertices and faces connected to points lower than some value in pyvista?

So when one exports r.out.vtk from Grass GIS we get a bad surface with -99999 points instead of nulls:
I want to remove them, yet a simple clip is not enough:
pd = pv.read('./pid1.vtk')
pd = pd.clip((0,1,1), invert=False).extract_surface()
p.add_mesh(pd ) #add atoms to scene
p.show()
resulting in:
So I wonder how to keep from it only top (> -999) points and connected vertices - in order to get only the top plane (it is curved\not flat actually) using pyvista?
link to example .vtk

There is an easy way to do this and there isn't...
You could use pyvista's threshold filter with all_scalars=True as long as you have only one set of scalars:
import pyvista as pv
pd = pv.read('./pid1.vtk')
pd = pd.threshold(-999, all_scalars=True)
plotter = pv.Plotter()
plotter.add_mesh(pd) #add atoms to scene
plotter.show()
Since all_scalars starts filtering based on every scalar array, this will only do what you'd expect if there are no other scalars. Furthermore, unfortunately there seems to be a bug in pyvista (expected to be fixed in version 0.32.0) which makes the use of this keyword impossible.
What you can do in the meantime (if you don't want to use pyvista's main branch before the fix is released) is to threshold the data yourself using numpy:
import pyvista as pv
pd = pv.read('./pid1.vtk')
scalars = pd.active_scalars
keep_inds = (scalars > -999).nonzero()[0]
pd = pd.extract_points(keep_inds, adjacent_cells=False)
plotter = pv.Plotter()
plotter.add_mesh(pd) #add atoms to scene
plotter.show()
The main point of both all_scalars (in threshold) and adjacent_cells (in extract_points) is to only keep cells where every point satisfies the condition.
With both of the above I get the following figure using your data:

I can't form a graph with networkx based on three criteria

I'm new to Python. Please help me solve the problem with graph construction. I have a database with the attribute "Source", "Interlocutor" and "Frequency".
An example of three lines:
I need to build a graph based on the Source-Interlocutor, but the frequency is also taken into account.
Like this:
My code:
dic_values={Source:[24120.0,24120.0,24120.0], Interlocutor:[34,34,34],Frequency:[446625000, 442475000, 445300000]
session_graph=pd.DataFrame(dic_values)
friquency=session_graph['Frequency'].unique()
plt.figure(figsize=(10,10))
for i in range(len(friquency)):
df_friq=session_subset[session_subset['Frequency']==friquency[i]]
G_frique=nx.from_pandas_edgelist(df_friq,source='Source',target='Interlocutor')
pos = nx.spring_layout(G_frique)
nx.draw_networkx_nodes(G_frique, pos, cmap=plt.get_cmap('jet'), node_size = 20)
nx.draw_networkx_edges(G_frique, pos, arrows=True)
nx.draw_networkx_labels(G_frique, pos)
plt.show()
And I have like this:

Your problem requires a MultiGraph
import networkx as nx
import matplotlib.pyplot as plt
import pandas as pd
import pydot
from IPython.display import Image
dic_values = {"Source":[24120.0,24120.0,24120.0], "Interlocutor":[34,34,34],
"Frequency":[446625000, 442475000, 445300000]}
session_graph = pd.DataFrame(dic_values)
sources = session_graph['Source'].unique()
targets = session_graph['Interlocutor'].unique()
#create a Multigraph and add the unique nodes
G = nx.MultiDiGraph()
for n in [sources, targets]:
G.add_node(n[0])
#Add edges, multiple connections between the same set of nodes okay.
# Handled by enum in Multigraph
#Itertuples() is a faster way to iterate through a Pandas dataframe. Adding one edge per row
for row in session_graph.itertuples():
#print(row[1], row[2], row[3])
G.add_edge(row[1], row[2], label=row[3])
#Now, render it to a file...
p=nx.drawing.nx_pydot.to_pydot(G)
p.write_png('multi.png')
Image(filename='multi.png') #optional
This will produce the following:
Please note that node layouts are trickier when you use Graphviz/Pydot.
For example check this SO answer.. I hope this helps you move forward. And welcome to SO.

fuzzy-c-means - setting initial number of clusters=6, but only 4 cluster labels generated

I use the fuzzy-c-means clustering implementation and I would like the data X to form the number of clusters i define in the algorithm(I beleive that is how it works). But the behavior is confusing.
cm = FCM(n_clusters=6)
cm.fit(X)
This code generates a plot with 4 labels - [0,2,4,6]
cm = FCM(n_clusters=4)
cm.fit(X)
This code generates a plot with 4 labels - [0,1,2,3]
I expect labels [0,1,2,3,4,5] when i initialize the cluster number to be 6.
code:
from fcmeans import FCM
from matplotlib import pyplot as plt
from seaborn import scatterplot as scatter
# fit the fuzzy-c-means
fcm = FCM(n_clusters=6)
fcm.fit(X)
# outputs
fcm_centers = fcm.centers
fcm_labels = fcm.u.argmax(axis=1)
# plot result
%matplotlib inline
f, axes = plt.subplots(1, 2, figsize=(11,5))
scatter(X[:,0], X[:,1], ax=axes[0])
scatter(X[:,0], X[:,1], ax=axes[1], hue=fcm_labels)
scatter(fcm_centers[:,0], fcm_centers[:,1], ax=axes[1],marker="s",s=200)
plt.show()

Fuzzy c-means is a fuzzy clustering algorithm.
The labels are only an approximation to the fuzzy assignment.
Most likely two clusters are pretty weak, and hence never win the argmax operation used to produce the labels. That doesn't mean these clusters have not been used, you are just not using the full fuzzy result.

I'm using fuzzy-c-means version 1.7.0:
>>> import fcmeans
>>> fcmeans.__version__
'1.7.0'
Using synthetic data:
>>> from sklearn.datasets import load_iris
>>> iris = load_iris().data
>>> model = fcmeans.FCM(n_clusters = 2)
>>> model.fit(iris)
>>> pred = model.predict(iris)
>>> from collections import Counter
>>> Counter(pred)
Counter({0: 97, 1: 53})
So, the n_clusters applied correctly.

I read about it and looks like once the algorithm reaches the knee point(max number of clusters it can perform with the data), it wont create anything more than this. So in my question, 4 was the maximum number of clusters that the algo perform with the given dataset.

Querying data in pandas where points are grouped by a hexbin function

Both seaborn and pandas provide APIs in order to plot bivariate histograms as a hexbin plot (example plotted below). However, I am searching to execute a query for the points that are located in the same hexbin. Is there a function to retrieve the rows associated with the data points in the hexbin?
The give an example:
My data frame contains 3 rows: A, B and C. I use sns.jointplot(x=A,y=B) to plot the density. Now, I want to execute a query on each data point located in the same bin. For instance, for each bin compute the mean of the C value associated with each point.

Current solution -- Quick Hack
Currently, I have implemented the following function to apply a function to the data associated with a (x,y) coordinate located in the same hexbin:
def hexagonify(x, y, values, func=None):
hexagonized_list = []
fig = plt.figure()
fig.set_visible(False)
if func is not None:
image = plt.hexbin(x=x, y=y, C=values, reduce_C_function=func)
else:
image = plt.hexbin(x=x, y=y, C=values)
values = image.get_array()
verts = image.get_offsets()
for offc in range(verts.shape[0]):
binx, biny = verts[offc][0], verts[offc][1]
val = values[offc]
if val:
hexagonized_list.append((binx, biny, val))
fig.clear()
plt.close(fig)
return hexagonized_list
The values (with the same size as x or y) are passed through the values parameter. The hexbins are computed through the hexbin function of matplotlib. The values are retrieved through the get_array() function of the returned PolyCollection. By default, the np.mean function is applied to the accumalated values per bin. This functionality can be changed by providing a function to the func paramater. Subsequently, the get_offsets() method allows us to calculate the center of the bins (discussed here). In this way, we can associate (by default) mean value of the provided values per hexbin. However, this solution is a hack, so any improvements to this solution are welcome.

From matplotlib
If you have already drawn the plot, you can get Bin Counts from polycollection returned by matplotlib:
polycollection: A PolyCollection instance; use PolyCollection.get_array on this to get the counts in each hexagon.
This functionality is also available in:
matplotlib.pyplot.hist2d;
numpy.histogram2d;
Pure pandas
Here a MCVE using only pandas that can handle the C property:
import numpy as np
import pandas as pd
# Trial Dataset:
N=1000
d = np.array([np.random.randn(N), np.random.randn(N), np.random.rand(N)]).T
df = pd.DataFrame(d, columns=['x', 'y', 'c'])
# Create bins:
df['xb'] = pd.cut(df.x, 3)
df['yb'] = pd.cut(df.y, 3)
# Group by and Aggregate:
p = df.groupby(['xb', 'yb']).agg('mean')['c']
p.unstack()
First we create bins using pandas.cut. Then we group by and aggregate. You can pick the agg function you like to aggregate C (eg. max, median, etc.).
The output is about:
yb (-2.857, -0.936] (-0.936, 0.98] (0.98, 2.895]
xb
(-2.867, -0.76] 0.454424 0.519920 0.507443
(-0.76, 1.34] 0.535930 0.484818 0.513158
(1.34, 3.441] 0.441094 0.493657 0.385987

Do not plot contourlines, just get the output for lines in colorbar [duplicate]

I'm trying to find (but not draw!) contour lines for some data:
from pprint import pprint
import matplotlib.pyplot
z = [[0.350087, 0.0590954, 0.002165], [0.144522, 0.885409, 0.378515],
[0.027956, 0.777996, 0.602663], [0.138367, 0.182499, 0.460879],
[0.357434, 0.297271, 0.587715]]
cn = matplotlib.pyplot.contour(z)
I know cn contains the contour lines I want, but I can't seem to get
to them. I've tried several things:
print dir(cn)
pprint(cn.collections[0])
print dir(cn.collections[0])
pprint(cn.collections[0].figure)
print dir(cn.collections[0].figure)
to no avail. I know cn is a ContourSet, and cn.collections is an array
of LineCollections. I would think a LineCollection is an array of line segments, but I
can't figure out how to extract those segments.
My ultimate goal is to create a KML file that plots data on a world
map, and the contours for that data as well.
However, since some of my data points are close together, and others
are far away, I need the actual polygons (linestrings) that make up
the contours, not just a rasterized image of the contours.
I'm somewhat surprised qhull doesn't do something like this.
Using Mathematica's ListContourPlot and then exporting as SVG works, but I
want to use something open source.
I can't use the well-known CONREC algorithm because my data isn't on a
mesh (there aren't always multiple y values for a given x value, and
vice versa).
The solution doesn't have to python, but does have to be open source
and runnable on Linux.

You can get the vertices back by looping over collections and paths and using the iter_segments() method of matplotlib.path.Path.
Here's a function that returns the vertices as a set of nested lists of contour lines, contour sections and arrays of x,y vertices:
import numpy as np
def get_contour_verts(cn):
contours = []
# for each contour line
for cc in cn.collections:
paths = []
# for each separate section of the contour line
for pp in cc.get_paths():
xy = []
# for each segment of that section
for vv in pp.iter_segments():
xy.append(vv[0])
paths.append(np.vstack(xy))
contours.append(paths)
return contours
Edit:
It's also possible to compute the contours without plotting anything using the undocumented matplotlib._cntr C module:
from matplotlib import pyplot as plt
from matplotlib import _cntr as cntr
z = np.array([[0.350087, 0.0590954, 0.002165],
[0.144522, 0.885409, 0.378515],
[0.027956, 0.777996, 0.602663],
[0.138367, 0.182499, 0.460879],
[0.357434, 0.297271, 0.587715]])
x, y = np.mgrid[:z.shape[0], :z.shape[1]]
c = cntr.Cntr(x, y, z)
# trace a contour at z == 0.5
res = c.trace(0.5)
# result is a list of arrays of vertices and path codes
# (see docs for matplotlib.path.Path)
nseg = len(res) // 2
segments, codes = res[:nseg], res[nseg:]
fig, ax = plt.subplots(1, 1)
img = ax.imshow(z.T, origin='lower')
plt.colorbar(img)
ax.hold(True)
p = plt.Polygon(segments[0], fill=False, color='w')
ax.add_artist(p)
plt.show()

I would suggest to use scikit-image find_contours
It returns a list of contours for a given level.
matplotlib._cntr has been removed from matplotlib since v2.2 (see here).

It seems that the contour data is in the .allsegs attribute of the QuadContourSet object returned by the plt.contour() function.
The .allseg attribute is a list of all the levels (which can be specified when calling plt.contour(X,Y,Z,V). For each level you get a list of n x 2 NumPy arrays.
plt.figure()
C = plt.contour(X, Y, Z, [0], colors='r')
plt.figure()
for ii, seg in enumerate(C.allsegs[0]):
plt.plot(seg[:,0], seg[:,1], '.-', label=ii)
plt.legend(fontsize=9, loc='best')
In the above example, only one level is given, so len(C.allsegs) = 1. You get:
contour plot
the extracted curves

The vertices of an all paths can be returned as a numpy array of float64 simply via:
vertices = cn.allsegs[i][j] # for element j, in level i
with cn defines as in the original question:
import matplotlib.pyplot as plt
z = [[0.350087, 0.0590954, 0.002165], [0.144522, 0.885409, 0.378515],
[0.027956, 0.777996, 0.602663], [0.138367, 0.182499, 0.460879],
[0.357434, 0.297271, 0.587715]]
cn = plt.contour(z)
More detailed:
Going through the collections and extracting the paths and vertices is not the most straight forward or fastest thing to do. The returned Contour object actually has attributes for the segments via cs.allsegs, which returns a nested list of shape [level][element][vertex_coord]:
num_levels = len(cn.allsegs)
num_element = len(cn.allsegs[0]) # in level 0
num_vertices = len(cn.allsegs[0][0]) # of element 0, in level 0
num_coord = len(cn.allsegs[0][0][0]) # of vertex 0, in element 0, in level 0
See reference:
https://matplotlib.org/3.1.1/api/contour_api.html

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract dendrogram from seaborn clustermap - python

Related

How to cut vertices and faces connected to points lower than some value in pyvista?

I can't form a graph with networkx based on three criteria

fuzzy-c-means - setting initial number of clusters=6, but only 4 cluster labels generated

Querying data in pandas where points are grouped by a hexbin function

Do not plot contourlines, just get the output for lines in colorbar [duplicate]

Categories

Resources