I'm generating basic matplotlib line plots of cpu utilisation per core for multi-core servers. The line plots for each successive series of data are overlayed on top of each other, so the graph for the fist series to be plotted is often buried behind the others.
I've managed to improve the plot by progressively reducing the alpha for each series I'm plotting, but the problem is that often a later series is very 'busy'. The same line gets drawn repeatedly on the same pixels, so even with a low alpha it still obscures all the data behind it.
Ideally the colour and alpha for each line should be applied only once to each pixel, no matter how often that line actually goes through the pixel. I'd like to do is something like drawing each series on a separate 'layer', then apply the alpha to the whole layer in one go, so it doesn't mater how often a line is drawn on any given pixel. I hope that makes sense. Any ideas?
Related
I have come across a number of plots (end of page) that are very similar to scatter / swarm plots which jitter the y-axis in order avoid overlapping dots / bubbles.
How can I get the y values (ideally in an array) based on a given set of x and z values (dot sizes)?
I found the python circlify library but it's not quite what I am looking for.
Example of what I am trying to create
EDIT: For this project I need to be able to output the x, y and z values so that they can be plotted in the user's tool of choice. Therefore I am more interested in solutions that generate the y-coords rather than the actual plot.
Answer:
What you describe in your text is known as a swarm plot (or beeswarm plot) and there are python implementations of these (esp see seaborn), but also, eg, in R. That is, these plots allow adjustment of the y-position of each data point so they don't overlap, but otherwise are closely packed.
Seaborn swarm plot:
Discussion:
But the plots that you show aren't standard swarm plots (which almost always have the weird looking "arms"), but instead seem to be driven by some type of physics engine which allows for motion along x as well as y, which produces the well packed structures you see in the plots (eg, like a water drop on a spiders web).
That is, in the plot above, by imagining moving points only along the vertical axis so that it packs better, you can see that, for the most part, you can't really do it. (Honestly, maybe the data shown could be packed a bit better, but not dramatically so -- eg, the first arm from the left couldn't be improved, and if any of them could, it's only by moving one or two points inward). Instead, to get the plot like you show, you'll need some motion in x, like would be given by some type of physics engine, which hopefully is holding x close to its original value, but also allows for some variation. But that's a trade-off that needs to be decided on a data level, not a programming level.
For example, here's a plotting library, RAWGraphs, which produces a compact beeswarm plot like the Politico graphs in the question:
But critically, they give the warning:
"It’s important to keep in mind that a Beeswarm plot uses forces to avoid collision between the single elements of the visual model. While this helps to see all the circles in the visualization, it also creates some cases where circles are not placed in the exact position they should be on the linear scale of the X Axis."
Or, similarly, in notes from this this D3 package: "Other implementations use force layout, but the force layout simulation naturally tries to reach its equilibrium by pushing data points along both axes, which can be disruptive to the ordering of the data." And here's a nice demo based on D3 force layout where sliders adjust the relative forces pulling the points to their correct values.
Therefore, this plot is a compromise between a swarm plot and a violin plot (which shows a smoothed average for the distribution envelope), but both of those plots give an honest representation of the data, and in these plots, these closely packed plots representation comes at a cost of a misrepresentation of the x-position of the individual data points. Their advantage seems to be that you can color and click on the individual points (where, if you wanted you could give the actual x-data, although that's not done in the linked plots).
Seaborn violin plot:
Personally, I'm really hesitant to misrepresent the data in some unknown way (that's the outcome of a physics engine calculation but not obvious to the reader). Maybe a better compromise would be a violin filled with non-circular patches, or something like a Raincloud plot.
I created an Observable notebook to calculate the y values of a beeswarm plot with variable-sized circles. The image below gives an example of the results.
If you need to use the JavaScript code in a script, it should be straightforward to copy and paste the code for the AccurateBeeswarm class.
The algorithm simply places the points one by one, as close as possible to the x=0 line while avoiding overlaps. There are also options to add a little randomness to improve the appearance. x values are never altered; this is the one big advantage of this approach over force-directed algorithms such as the one used by RAWGraphs.
I have a scatter plot of data and would like to highlight certain ranges of the x-axis. When the number ranges to highlight are relatively small, using BoxAnnotation works well. However, I'm trying to make many adjacent highlightings (with different opacity). With many adjacent BoxAnnotations, zoomed out, the boxes slightly overlap, creating lines. Additionally, thousands of BoxAnnotations takes a long time to generate and does not run smoothly when interacting with the plot.
To be more specific about my case, I have some temporal data and a predictive model detecting the probability of some event occurring in the data. I want each segment to be highlighted with an opacity given by the probability that an event is occurring at that point in time. However, my current BoxAnnotation approach results in artificial lines from overlap of boxes when zoomed out (they disappear when zooming in on a region), and slow responsiveness of the interactive plot.
Is there a way to accomplish something similar to this without the artifacts and with a smoother experience?
Current method:
source = ColumnDataSource(data=data_frame)
figure_ = figure(x_axis_label='Time', y_axis_label='Intensity')
for index in range(data_frame.shape[0] - 1):
figure_.add_layout(
BoxAnnotation(left=data_frame['time'].values[index], right=data_frame['time'].values[index + 1],
fill_alpha=data_frame['prediction'].values[index], fill_color='red', line_alpha=0)
)
figure_.circle(x='time', y='intensity', source=source)
show(figure_)
Example of artificial lines when there are too many small adjacent BoxAnnotations:
When zooming on the x-axis, the lines disappear:
There's probably not any way to salvage this exact approach. The artifacts are due to the functioning of the underlying raster HTML canvas, and here's not anything that can be one about that. And any slowness is due to the fact that this kind of use of BoxAnnotation (with so very many individual instances) is not at all what was envisioned, and it is simply not optimized to show hundreds of instances the way e.g. scatter glyphs are. You are trying to use box annotations to construct a sort of translucent heat map, and that is not a good fit for it, for the reasons above.
You could potentially overcome slowness by using a single rect or vbar glyph that draws all the boxes at once in a vectorized way. But that won't alleviate the compositing issues.
Your best bet is to create a semi-transparent "heatmap" image overlay yourself with a tool or code that can afford better control over the details of rasterization and compositing. I can't really advise you on how to do that in any detail. The Datashader library might be useful for this.
The program pastebinned below generates a plot that looks like:
Pastebin: http://pastebin.com/wNgAG6K9
Basically, the program solves an equation for AA, and plots the values provided AA>0 and AA=/=0. The data is plotted using pcolormesh from 3 arrays called x, y and z (lines 57 - 59).
What I want to do:
I would like to plot a line around the boundary where the solutions go from zero (black) to non-zero (yellow/green), see plot below. What is the most sensible way to go about this?
I.e. lines in red (done crudely in MS paint)
Further info: I need to be able to store the red dashed boundary values so that I can plot the red dashed boundary condition to another 2d plot made from real/measured/non-theoretical data.
Feel free to ask for further information.
Without seeing your data, I would suggest first trying to work with matplotlib's internal algorithm to plot the contour line corresponding to the zero level. This is simple, but it might happen that the interpolation that is used for this doesn't look good enough (I don't know if it can find that sharp peak in the contour line). The proof of the pudding is in the eating:
plt.contour(x,y,z,[0],colors='r',linewidths=2,linestyles='dashed')
If that doesn't suffice, you might have to resort to image processing methods to find the boundaries of your data (after turning it into binary).
I want to plot a narrow peak that contains several thousand data points, additionally I want to draw a vertical line marking the peak. I'm using Python 2.7 and Matplotlib 1.3.1 on Win7 (checked on two separate PCs).
For example, here is a Gaussian centred on sqrt(2) with its peak marked in red:
plt.plot(-arange(-4,4,1e-5)+sqrt(2), exp(-arange(-4,4,1e-5)**2)*0.92,'k',
[sqrt(2),sqrt(2)],[0,1],'r')
When the plot is wide enough you won't notice that anything is wrong, but as you make the plot narrower you will start to see that the red line sits to the right of the peak. By taking a screenshot and zooming in on the pixels you can see that the line is exactly one pixel off where it should be. (The image above shows the peak enlarged with and without the red line so as to make it clear that the line has missed the peak.)
Am I being stupid or is this a bug? If it's a bug presumably only one of the two lines is wrong: black or red?
Worrying about single pixels might sound pedantic, but your eye is actually surprisingly good at doing sub-pixel interpolation and can easily spot the problem.
Which setting do I have to use to fit the ordinate axis position in the middle to the other two? The bigger y-axis scale moves it away sadly.
I am creating the graphs with:
plotting.gridplot(rows)
Where
rows.append(l)
with
l = line('x', 'y', source=datasource,
x_range=x_range[0], ...]
x_range[0] = l.x_range
for multiple 'y' in the datasource.
The graphs range is coupled via the x_range.
That's a bit hard to do at the moment, unfortunately. We are in the process of integrating cassowary.js for much better layout options of subplots, guides, annotations, etc. In the mean time, you can set the min_border (and min_border_left, min_border_top, etc) on your plots. This will make the border area a minimum size even if it could be smaller. So if you set it large enough to accommodate any labels you expect to see, then it should help make the plot sizes consistent.