Show individual cluster elements in a dendrogram graph - python

It is easy to use two images to showcase what I am working to create. I have the following Python dendrogram, created from the following code (not currently reproducible, but wanted to show the code regardless):
# Initialize Plot
plt.figure(figsize=(18,9))
hierarchy.dendrogram(
Z=Z,
p=20,
orientation="top",
truncate_mode='lastp',
leaf_rotation=45.,
leaf_font_size=15.,
)
plt.show()
It is fairly simple and straightforward to this point. However, in an effort to better visualize the clusters, I'd like to show this same dendrogram, with string values for the elements in the cluster below the cluster, as such (created demo by pasting image into Excel and typing values):
Is this possible to do in Python with the dendrogram function, or in any other way? A hacky approach that uses subplots + a 2nd "graph" that is actually a table could be a possible solution, however it would be good if a less-hacky solution existed.

Related

Creating a packed bubble / scatter plot in python (jitter based on size to avoid overlapping)

I have come across a number of plots (end of page) that are very similar to scatter / swarm plots which jitter the y-axis in order avoid overlapping dots / bubbles.
How can I get the y values (ideally in an array) based on a given set of x and z values (dot sizes)?
I found the python circlify library but it's not quite what I am looking for.
Example of what I am trying to create
EDIT: For this project I need to be able to output the x, y and z values so that they can be plotted in the user's tool of choice. Therefore I am more interested in solutions that generate the y-coords rather than the actual plot.
Answer:
What you describe in your text is known as a swarm plot (or beeswarm plot) and there are python implementations of these (esp see seaborn), but also, eg, in R. That is, these plots allow adjustment of the y-position of each data point so they don't overlap, but otherwise are closely packed.
Seaborn swarm plot:
Discussion:
But the plots that you show aren't standard swarm plots (which almost always have the weird looking "arms"), but instead seem to be driven by some type of physics engine which allows for motion along x as well as y, which produces the well packed structures you see in the plots (eg, like a water drop on a spiders web).
That is, in the plot above, by imagining moving points only along the vertical axis so that it packs better, you can see that, for the most part, you can't really do it. (Honestly, maybe the data shown could be packed a bit better, but not dramatically so -- eg, the first arm from the left couldn't be improved, and if any of them could, it's only by moving one or two points inward). Instead, to get the plot like you show, you'll need some motion in x, like would be given by some type of physics engine, which hopefully is holding x close to its original value, but also allows for some variation. But that's a trade-off that needs to be decided on a data level, not a programming level.
For example, here's a plotting library, RAWGraphs, which produces a compact beeswarm plot like the Politico graphs in the question:
But critically, they give the warning:
"It’s important to keep in mind that a Beeswarm plot uses forces to avoid collision between the single elements of the visual model. While this helps to see all the circles in the visualization, it also creates some cases where circles are not placed in the exact position they should be on the linear scale of the X Axis."
Or, similarly, in notes from this this D3 package: "Other implementations use force layout, but the force layout simulation naturally tries to reach its equilibrium by pushing data points along both axes, which can be disruptive to the ordering of the data." And here's a nice demo based on D3 force layout where sliders adjust the relative forces pulling the points to their correct values.
Therefore, this plot is a compromise between a swarm plot and a violin plot (which shows a smoothed average for the distribution envelope), but both of those plots give an honest representation of the data, and in these plots, these closely packed plots representation comes at a cost of a misrepresentation of the x-position of the individual data points. Their advantage seems to be that you can color and click on the individual points (where, if you wanted you could give the actual x-data, although that's not done in the linked plots).
Seaborn violin plot:
Personally, I'm really hesitant to misrepresent the data in some unknown way (that's the outcome of a physics engine calculation but not obvious to the reader). Maybe a better compromise would be a violin filled with non-circular patches, or something like a Raincloud plot.
I created an Observable notebook to calculate the y values of a beeswarm plot with variable-sized circles. The image below gives an example of the results.
If you need to use the JavaScript code in a script, it should be straightforward to copy and paste the code for the AccurateBeeswarm class.
The algorithm simply places the points one by one, as close as possible to the x=0 line while avoiding overlaps. There are also options to add a little randomness to improve the appearance. x values are never altered; this is the one big advantage of this approach over force-directed algorithms such as the one used by RAWGraphs.

marker style by third variable

Might seem like a repeat question, but the solution in this post doesn't seem to work for me.
I have a bunch of data I want to plot as lines/curves, and another dataset linked to the curves consisting of XYZ data, where Z represents a labeling variable for the curves.
I've got some example code here with some XY data, and labels for anyone wanting to replicate what I'm doing:
plt.plot(xdata, ydata)
plt.scatter(xlab, ylab, c=lab) # needs a marker function adding
plt.show()
Ideally I want to add some kind of unique marker based on the label values; 0.1,0.5,1,2,3,4,6,8,10,20. The labels are the same for each curve.
I have over 100 curves to plot, so something quick and effective is needed. Any help would be great!
My current solution would be to just split the data by labelling values, and then plot separately for each one (long and messy in my opinion). Figured someone might have a more elegant solution here.
I'm guessing you could do this with a dictionary... but I might need some help doing that!
Cheers, KB
Matplotlib does not accepts different markers per plot.
However, a less verbose and more robust solution for large dataset is using the pandas and seaborn library:
Additionally you can use the pandas.cut function to plot bins (Its something I regularly need to produce graphs where I can use a third continuous value as a parameter). The way to use it is :
import pandas as pd
import seaborn as sns
url = 'https://pastebin.com/raw/dwGBLqSb' # url of paste
df = pd.read_csv(url)
sns.scatterplot(data = df, x='labx', y='laby', style='lab')
and it produces the following example:
If you have something more advanced labelling you could also look at LabelEncoder of Sklearn.
Hopefully, I've edited enough this answer not to offend don't post identical answers to multiple questions. For what is worth, I am not affiliated with seaborn library in any way nor am I trying to promote anything. The only thing I am trying to do is help someone with a similar problem that I've come across and I couldn't find easily a clear answer in SE.

Plotting a large graph in igraph

I created a Graph using the igraph package (http://hal.elte.hu/~nepusz/development/igraph/tutorial/tutorial.html), and I'm trying to plot it using the Plot() function. My graph has ~20000 Vertices and ~30000 Edges, and I'm struggling to find a way to plot it such that it looks remotely presentable.
Distributive Recursive Layout gives decent results, but it doesn't make much sense since vertices are arbitrarily grouped together. Fiddling with the vertex,label,edge sizes doesn't help much either.
Could someone, who has experience with plots of this size, suggest a suitable layout to data of this size?

Embed matplotlib figure in larger figure

I am writing a bunch of scripts and functions for processing astronomical data. I have a set of galaxies, for which I want to plot some different properties in a 3-panel plot. I have an example of the layout here:
Now, this is not a problem. But sometimes, I want to create this plot just for a single galaxy. In other cases, I want to make a larger plot consisting of subplots that each are made up of the three+pane structure, like this mockup:
For the sake of modularity and reusability of my code, I would like to do something to the effect of just letting my function return a matplotlib.figure.Figure object and then let the caller - function or interactive session - decide whether to show() or savefig the object or embed it in a larger figure. But I cannot seem to find any hints of this in the documentation or elsewhere, it doesn't seem to be something people do that often.
Any suggestions as to what would be the best road to take? I have speculated whether using axes_grid would be the solution, but it doesn't seem quite clean and caller-agnostic to me. Any suggestions?
The best solution is to separate the figure layout logic from the plotting logic. Write your plotting code something like this:
def three_panel_plot(data, ploting_args, ax1, ax2, ax3):
# what you do to plot
So now the code that turns data -> images takes as arguments the data and where it should plot that data too.
If you want to do just one, it's easy, if you want to do a 3x3 grid, you just need to generate the layout and then loop over the axes sets + data.
The way you are suggesting (returning an object out of your plotting routine) would be very hard in matplotlib, the internals are too connected.

Barchart (o plot) 3D in Python

I need to plot some data in various forms. Currently I'm using Matplotlib and I'm fairly happy with the plots I'm able to produce.
This question is on how to plot the last one. The data is similar to the "distance table", like this (just bigger, my table is 128x128 and still have 3 or more number per element).
Now, my data is much better "structured" than a distance table (my data doesn't varies "randomly" like in a alphabetically sorted distance table), thus a 3D barchart, or maybe 3 of them, would be perfect. My understanding is that such a chart is missing in Matplotlib.
I could use a (colored) Countor3d like these or something in 2D like imshow, but it isn't really well representative of what the data is (the data has meaning just in my 128 points, there isn't anything between two points). And the height of bars is more readable than color, IMO.
Thus the questions:
is it possible to create 3D barchart in Matplotlib? It should be clear that I mean with a 2D domain, not just a 2D barchart with a "fake" 3D rendering for aesthetics purposes
if the answer to the previous question is no, then is there some other library able to do that? I strongly prefer something Python-based, but I'm OK with other Linux-friendly possibilities
if the answer to the previous question is no, then do you have any suggestions on how to show that data? E.g. create a table with the values, superimposed to the imshow or other colored way?
For some time now, matplotlib had no 3D support, but it has been added back recently. You will need to use the svn version, since no release has been made since, and the documentation is a little sparse (see examples/mplot3d/demo.py). I don't know if mplot3d supports real 3D bar charts, but one of the demos looks a little like it could be extended to something like that.
Edit: The source code for the demo is in the examples but for some reason the result is not. I mean the test_polys function, and here's how it looks like:
example figure http://www.iki.fi/jks/tmp/poly3d.png
The test_bar2D function would be even better, but it's commented out in the demo as it causes an error with the current svn version. Might be some trivial problem, or something that's harder to fix.
MyavaVi2 can make 3D barcharts (scroll down a bit). Once you have MayaVi/VTK/ETS/etc. installed it all works beautifully, but it can be some work getting it all installed. Ubuntu has all of it packaged, but they're the only Linux distribution I know that does.
One more possibility is Gnuplot, which can draw some kind of pseudo 3D bar charts, and gnuplot.py allows interfacing to Gnuplot from Python. I have not tried it myself, though.
This is my code for a simple Bar-3d using matplotlib.
import mpl_toolkits
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
%matplotlib inline
## The value you want to plot
zval=[0.020752244,0.078514652,0.170302899,0.29543857,0.45358061,0.021255922,0.079022499,\
0.171294169,0.29749654,0.457114286,0.020009631,0.073154019,0.158043498,0.273889264,0.419618287]
fig = plt.figure(figsize=(12,9))
ax = fig.add_subplot(111,projection='3d')
col=["#ccebc5","#b3cde3","#fbb4ae"]*5
xpos=[1,2,3]*5
ypos=range(1,6,1)*5
zpos=[0]*15
dx=[0.4]*15
dy=[0.5]*15
dz=zval
for i in range(0,15,1):
ax.bar3d(ypos[i], xpos[i], zpos[i], dx[i], dy[i], dz[i],
color=col[i],alpha=0.75)
ax.view_init(azim=120)
plt.show()
http://i8.tietuku.com/ea79b55837914ab2.png
You might check out Chart Director:
http://www.advsofteng.com
It has a pretty wide variety of charts and graphs and has a nice Python (and several other languages) API.
There are two editions: The free version puts a blurb on the generated image, and the
pay version is pretty reasonably priced.
Here's one of the more interesting looking 3d stacked bar charts:
(source: advsofteng.com)

Categories

Resources