How to better fit seaborn violinplots?

How to better fit seaborn violinplots? - python

The following code gives me a very nice violinplot (and boxplot within).
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
foo = np.random.rand(100)
sns.violinplot(foo)
plt.boxplot(foo)
plt.show()
So far so good. However, when I look at foo, the variable does not contain any negative values. The seaborn plot seems misleading here. The normal matplotlib boxplot gives something closer to what I would expect.
How can I make violinplots with a better fit (not showing false negative values)?

As the comments note, this is a consequence (I'm not sure I'd call it an "artifact") of the assumptions underlying gaussian KDE. As has been mentioned, this is somewhat unavoidable, and if your data don't meet those assumptions, you might be better off just using a boxplot, which shows only points that exist in the actual data.
However, in your response you ask about whether it could be fit "tighter", which could mean a few things.
One answer might be to change the bandwidth of the smoothing kernel. You do that with the bw argument, which is actually a scale factor; the bandwidth that will be used is bw * data.std():
data = np.random.rand(100)
sns.violinplot(y=data, bw=.1)
Another answer might be to truncate the violin at the extremes of the datapoints. The KDE will still be fit with densities that extend past the bounds of your data, but the tails will not be shown. You do that with the cut parameter, which specifies how many units of bandwidth past the extreme values the density should be drawn. To truncate, set it to 0:
sns.violinplot(y=data, cut=0)
By the way, the API for violinplot is going to change in 0.6, and I'm using the development version here, but both the bw and cut arguments exist in the current released version and behave more or less the same way.

Related

evaluate numerically the density that `sns.kdeplot` has put in the plot?

Out of the box seaborn does a very good job to plot a 2D KDE or jointplot. However it is not returning anything like a function that I can evaluate to numerically read the values of the estimated density.
How can I evaluate numerically the density that sns.kdeplot or jointplot has put in the plot?
Just for completeness. I see something interesting in the scipy docs, stats.gaussian_kde but I am getting very clunky density plots,
which for some reason because of missing extent are really off compared to the scatter plot. So I would like to stay away from the scipy kde, at least until I figure how to make it work why pyplot is so much more "not smart" as seaborn is.
Anyhow, the evaluate method of the scipy.stats.gaussian_kde does its job.

I also faced this issue in jointplot() method. I opened a file distribution.py on this path anaconda3/lib/python3.7/site-packages/seaborn/. Then I added these lines in _bivariate_kdeplot() function:
print("xx=",xx[50])
print("yy=",yy[:,50])
print("z=",z[50])
This prints out 100 values of x,y and z arrays of 50 index. Where "z" is the density and "xx" and "yy" are the values adjusted according to the bandwidth, cut and clip, in a meshgrid form distributed according to grid size, that were given by the user. This gave me some idea about the actual values of the 2D kde plot.
If you print out entire array of each variable then you will get 100 x 100 values of each.

seaborn tsplot with non-connected confidence intervals

I'm using seaborn's tsplot function to plot how well my model fit matches actual data in a time series, with CIs showing my predictions' standard deviations. My question is: Is there a way for tsplot not to fill in CIs between points? That is, for it to show the CIs of each point individually without connecting one CI to the next.
For the means this is accomplished by setting "interpolate" to False. I'm looking the same -- but for CIs.
To illustrate, my plots currently look like this:
I'm fine with how this looks for means (red dots) that are close together, but the CI-transition looks rather odd when one mean is close to 1 and the next is close to 0. The data just happens to be like this. I'd be happy to turn the CI "connection" off, but would also be happy for any related aesthetic suggestions. Thank you.
For completeness' sake, the relevant offending code fragment is as follows:
import seaborn as sns; sns.set(color_codes=True)
import matplotlib.pyplot as plt
model_fit = #fit data
data = #actual data
sns.tsplot(model_fit,interpolate=False,ci='sd',color='indianred',condition='predicted')
plt.plot(X,actual_data ,linestyle='None',marker='*',label='actual')

Finding only the "prominent" local maxima of a 1d array

I have a couple of data sets with clusters of peaks that look like the following:
You can see that the main features here a clusters of peaks, each cluster having three peaks. I would like to find the x values of those local peaks, but I am running into a few problems. My current code is as follows:
import numpy as np
import matplotlib.pyplot as plt
from scipy import loadtxt, optimize
from scipy.signal import argrelmax
def rounddown(x):
return int(np.floor(x / 10.0)) * 10
pixel, value = loadtxt('voltage152_4.txt', unpack=True, skiprows=0)
ax = plt.axes()
ax.plot(pixel, value, '-')
ax.axis([0, np.max(pixel), np.min(value), np.max(value) + 1])
maxTemp = argrelmax(value, order=5)
maxes = []
for maxi in maxTemp[0]:
if value[maxi] > 40:
maxes.append(maxi)
ax.plot(maxes, value[maxes], 'ro')
plt.yticks(np.arange(rounddown(value.min()), value.max(), 10))
plt.savefig("spectrum1.pdf")
plt.show()
Which works relatively well, but still isn't perfect. Some peaks labeled: The main problem here is that my signal isn't smooth, so a few things that aren't actually my relevant peaks are getting picked up. You can see this in the stray maxima about halfway down a cluster, as well as peaks that have two maxima where in reality it should be one. You can see near the center of the plot there are some high frequency maxima. I was picking those up so I added in the loop only considering values above a certain point.
I am afraid that smoothing the curve will actually make me loose some of the clustered peaks that I want, as in some of my other datasets there are even closer together. Maybe my fears are unfounded, though, and I am just misunderstanding how smoothing works. Any help would be appreciated.
Does anyone have a solution on how to pick out only "prominent" peaks? That is, only those peaks that are quick large compared to the others?

Starting with SciPy version 1.1.0 you may also use the function scipy.signal.find_peaks which allows you to select detected peaks based on their topographic prominence. This function is often easier to use than find_peaks_cwt. You'll have to play around a little bit to find the optimal lower bound to pass as a value to prominence but e.g. find_peaks(..., prominence=5) will ignore the unwanted peaks in your example. This should bring you reasonably close to your goal. If that's not enough you might do your own peak selection based upon peak properties like the left_/right_bases which are optionally returned.

I'd also recommend scipy.signal.find_peaks for what you're looking for. The other, older, scipy alternate find_peaks_cwt is quite complicated to use.
It will basically do what you're looking for in a single line. Apart from the prominence parameter that lagru mentioned, for your data either the threshold or height parameters might also do what you need.
height = 40 would filter to get all the peaks you like.
Prominence is a bit hard to wrap your head around for exactly what it does sometimes.

Confusion with bandwidth on seaborn's kdeplot

lineslist, below, represents a set of lines (for some chemical spectrum, let's say), in MHz. I know the linewidth of the laser used to probe these lines to be 5 MHz. So, naively, the kernel density estimate of these lines with a bandwidth of 5 should give me the continuous distribution that would be produced in an experiment using the aforementioned laser.
The following code:
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
lineslist=np.array([-153.3048645 , -75.71982528, -12.1897835 , -73.94903264,
-178.14293936, -123.51339541, -118.11826988, -50.19812838,
-43.69282206, -34.21268228])
sns.kdeplot(lineslist, shade=True, color="r",bw=5)
plt.show()
yields
Which looks like a Gaussian with bandwidth much larger than 5 MHz.
I'm guessing that for some reason, the bandwidth of the kdeplot has different units than the plot itself. The separation between the highest and lowest line is ~170.0 MHz. Supposing that I need to rescale the bandwidth by this factor:
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
lineslist=np.array([-153.3048645 , -75.71982528, -12.1897835 , -73.94903264,
-178.14293936, -123.51339541, -118.11826988, -50.19812838,
-43.69282206, -34.21268228])
sns.kdeplot(lineslist, shade=True, color="r",bw=5/(np.max(lineslist)-np.min(lineslist)))
plt.show()
I get:
With lines that seem to have the expected 5 MHz bandwidth.
As dandy as that solution is, I've pulled it from my arse, and I'm curious whether someone more familiar with seaborn's kdeplot internals can comment on why this is.
Thanks,
Samuel

One thing to note is that Seaborn doesn't actually handle the bandwidth itself - it passes the setting on more-or-less as-is to either SciPy or the Statsmodels packages, depending on what you have installed. (It prefers Statsmodels, but will fall back to SciPy.)
The documentation for this parameter in the various sub-packages is a little confusing, but from what I can tell, the key issue here is that the setting for SciPy is a bandwidth factor, rather than a bandwidth itself. That is, this factor is (effectively) multiplied by the standard deviation of the data you're plotting to give you the actual bandwidth used in the kernels.
So with SciPy, if you have a fixed number which you want to use as your bandwidth, you need to divide through by your data standard deviation. And if you're trying to plot multiple datasets consistently, you need to adjust for the standard deviation of each dataset. This adjustment effectively what you did by scaling by the range -- but again, it's not the range of the data that's the number used, but the standard deviation of the data.
To make things all the more confusing, Statsmodels expects the true bandwidth when given a scalar value, rather than a factor that's multiplied by the standard deviation of the sample. So depending on what backend you're using, Seaborn will behave differently. There's no direct way to tell Seaborn which backend to use - the best way to test is probably trying to import statsmodels, and seeing if that succeeds (takes bandwidth directly) or fails (takes bandwidth factor).
By the way, these results were tested against Seaborn version 0.7.0 - I expect (hope?) that versions in the future might change this behavior.

Matplotlib slow with large data sets, how to enable decimation?

I use matplotlib for a signal processing application and I noticed that it chokes on large data sets. This is something that I really need to improve to make it a usable application.
What I'm looking for is a way to let matplotlib decimate my data. Is there a setting, property or other simple way to enable that? Any suggestion of how to implement this are welcome.
Some code:
import numpy as np
import matplotlib.pyplot as plt
n=100000 # more then 100000 points makes it unusable slow
plt.plot(np.random.random_sample(n))
plt.show()
Some background information
I used to work on a large C++ application where we needed to plot large datasets and to solve this problem we used to take advantage of the structure of the data as follows:
In most cases, if we want a line plot then the data is ordered and often even equidistantial. If it is equidistantial, then you can calculate the start and end index in the data array directly from the zoom rectangle and the inverse axis transformation. If it is ordered but not equidistantial a binary search can be used.
Next the zoomed slice is decimated, and because the data is ordered we can simply iterate a block of points that fall inside one pixel. And for each block the mean, maximum and minimum is calculated. Instead of one pixel, we then draw a bar in the plot.
For example: if the x axis is ordered, a vertical line will be drawn for each block, possibly the mean with a different color.
To avoid aliasing the plot is oversampled with a factor of two.
In case it is a scatter plot, the data can be made ordered by sorting, because the sequence of plotting is not important.
The nice thing of this simple recipe is that the more you zoom in the faster it becomes. In my experience, as long as the data fits in memory the plots stays very responsive. For instance, 20 plots of timehistory data with 10 million points should be no problem.

It seems like you just need to decimate the data before you plot it
import numpy as np
import matplotlib.pyplot as plt
n=100000 # more then 100000 points makes it unusable slow
X=np.random.random_sample(n)
i=10*array(range(n/10))
plt.plot(X[i])
plt.show()

Decimation is not best for example if you decimate sparse data it might all appear as zeros.
The decimation has to be smart such that each LCD horizontal pixel is plotted with the min and the max of the data between decimation points. Then as you zoom in you see more an more detail.
With zooming this can not be done easy outside matplotlib and thus is better to handle internally.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.