Confusion with bandwidth on seaborn's kdeplot

Confusion with bandwidth on seaborn's kdeplot - python

lineslist, below, represents a set of lines (for some chemical spectrum, let's say), in MHz. I know the linewidth of the laser used to probe these lines to be 5 MHz. So, naively, the kernel density estimate of these lines with a bandwidth of 5 should give me the continuous distribution that would be produced in an experiment using the aforementioned laser.
The following code:
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
lineslist=np.array([-153.3048645 , -75.71982528, -12.1897835 , -73.94903264,
-178.14293936, -123.51339541, -118.11826988, -50.19812838,
-43.69282206, -34.21268228])
sns.kdeplot(lineslist, shade=True, color="r",bw=5)
plt.show()
yields
Which looks like a Gaussian with bandwidth much larger than 5 MHz.
I'm guessing that for some reason, the bandwidth of the kdeplot has different units than the plot itself. The separation between the highest and lowest line is ~170.0 MHz. Supposing that I need to rescale the bandwidth by this factor:
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
lineslist=np.array([-153.3048645 , -75.71982528, -12.1897835 , -73.94903264,
-178.14293936, -123.51339541, -118.11826988, -50.19812838,
-43.69282206, -34.21268228])
sns.kdeplot(lineslist, shade=True, color="r",bw=5/(np.max(lineslist)-np.min(lineslist)))
plt.show()
I get:
With lines that seem to have the expected 5 MHz bandwidth.
As dandy as that solution is, I've pulled it from my arse, and I'm curious whether someone more familiar with seaborn's kdeplot internals can comment on why this is.
Thanks,
Samuel

One thing to note is that Seaborn doesn't actually handle the bandwidth itself - it passes the setting on more-or-less as-is to either SciPy or the Statsmodels packages, depending on what you have installed. (It prefers Statsmodels, but will fall back to SciPy.)
The documentation for this parameter in the various sub-packages is a little confusing, but from what I can tell, the key issue here is that the setting for SciPy is a bandwidth factor, rather than a bandwidth itself. That is, this factor is (effectively) multiplied by the standard deviation of the data you're plotting to give you the actual bandwidth used in the kernels.
So with SciPy, if you have a fixed number which you want to use as your bandwidth, you need to divide through by your data standard deviation. And if you're trying to plot multiple datasets consistently, you need to adjust for the standard deviation of each dataset. This adjustment effectively what you did by scaling by the range -- but again, it's not the range of the data that's the number used, but the standard deviation of the data.
To make things all the more confusing, Statsmodels expects the true bandwidth when given a scalar value, rather than a factor that's multiplied by the standard deviation of the sample. So depending on what backend you're using, Seaborn will behave differently. There's no direct way to tell Seaborn which backend to use - the best way to test is probably trying to import statsmodels, and seeing if that succeeds (takes bandwidth directly) or fails (takes bandwidth factor).
By the way, these results were tested against Seaborn version 0.7.0 - I expect (hope?) that versions in the future might change this behavior.

Related

Amplitude units in FFT

I'm completely new to python, scipy, matplotlib and programming in general.
I'm using the following code, which I came across online, to apply FFT to .wav files:
import scipy.io.wavfile as wavfile
import scipy
import scipy.fftpack as fftpk
import numpy as np
from matplotlib import pyplot as plt
s_rate, signal = wavfile.read("file.wav")
FFT = abs(scipy.fft.fft(signal))
freqs = fftpk.fftfreq(len(FFT), (1.0/s_rate))
plt.plot(freqs[range(len(FFT)//2)], FFT[range(len(FFT)//2)])
plt.xlabel('Frequency (Hz)')
plt.ylabel('Amplitude')
plt.show()
The resulting graphs give amplitude values that range from 0 to a few thousands, depending on the files, and I have no idea what unit these are in. I'm guessing they might be relative amplitudes, and I was wondering if there is a way to turn that into decibels, as I need specific values.
Thank you
Tanguy

They are amplitudes relative to the quantization units used for the samples in your input signal. So, without calibrating your input signal against a known level of source input (to get Volts per 1 bit change, etc.), the actual units are unknown. If calibrated, you may still need to divide the magnitudes of the FFT output by N (the FFT length), depending on your particular FFT implementation.
To get Decibels, convert by taking 20*log10(abs(...)) of the FFT results, and offset by your 0 dB calibration level.

Prevent matplotlib Function psd() from plotting - or faster alternative

I'm using the matplotlib function psd to generate the power spectral density of a bunch of radio signals I'm receiving. All I want are the returned values but the function plots the whole spectrum not matter what. Is there a way to prevent it from plotting? Is there another function that could do this without the plot? I'm trying to run this as rapidly as possible so anything to speed it up (aka preventing the plot entirely) would be very useful.
The code is pretty straightforward but I'm not sure how to suppress this plotting and ideally prevent it from doing it entirely because I want this code to run as fast as possible:
from pylab import *
power, psd_frequencies = psd(radio_samples, NFFT=256, Fs=samples_rate, Fc=center_frequency)
Alternatives to running psd() that would be faster are very welcome too.

To reproduce exactly what matplotlib plots in the psd plot, you may use its own method:
from matplotlib.mlab import psd
power, psd_frequencies = psd(radio_samples, NFFT=256, Fs=samples_rate)
psd_frequencies += center_frequency
This gives you the data, but without the plot.

What are ways to speed up seaborns pairplot

I have a dataframe with 250.000 rows but 140 columns and I'm trying to construct a pair plot. of the variables.
I know the number of subplots is huge, as well as the time it takes to do the plots. (I'm waiting for more than an hour on an i5 with 3,4 GHZ and 32 GB RAM).
Remebering that scikit learn allows to construct random forests in parallel, I was checking if this was possible also with seaborn.
However, I didn't find anything. The source code seems to call the matplotlib plot function for every single image.
Couldn't this be parallelised? If yes, what is a good way to start from here?

Rather than parallelizing, you could downsample your DataFrame to say, 1000 rows to get a quick peek, if the speed bottleneck is indeed occurring there. 1000 points is enough to get a general idea of what's going on, usually.
i.e. sns.pairplot(df.sample(1000)).

Save your pairplot to image and then show this image instead of rendering it all in your browser.
from IPython.display import Image
import seaborn as sns
import matplotlib.pyplot as plt
sns_plot = sns.pairplot(df, size=2.0)
sns_plot.savefig("pairplot.png")
plt.clf() # Clean parirplot figure from sns
Image(filename='pairplot.png') # Show pairplot as image

For me, I had a situation where the histograms were taking a very long time due to the variance in the data. I only had 1200 rows and 4 columns, but it took half an hour before I gave up. I think it was so spread out and unordered that the histogram was constantly updating. One workaround might be to play with the bin parameter, but my solution was to use a KDE for the diagonal instead. With the KDE, it takes only a few seconds.
sns.pairplot(df, diag_kind='kde')

How to better fit seaborn violinplots?

The following code gives me a very nice violinplot (and boxplot within).
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
foo = np.random.rand(100)
sns.violinplot(foo)
plt.boxplot(foo)
plt.show()
So far so good. However, when I look at foo, the variable does not contain any negative values. The seaborn plot seems misleading here. The normal matplotlib boxplot gives something closer to what I would expect.
How can I make violinplots with a better fit (not showing false negative values)?

As the comments note, this is a consequence (I'm not sure I'd call it an "artifact") of the assumptions underlying gaussian KDE. As has been mentioned, this is somewhat unavoidable, and if your data don't meet those assumptions, you might be better off just using a boxplot, which shows only points that exist in the actual data.
However, in your response you ask about whether it could be fit "tighter", which could mean a few things.
One answer might be to change the bandwidth of the smoothing kernel. You do that with the bw argument, which is actually a scale factor; the bandwidth that will be used is bw * data.std():
data = np.random.rand(100)
sns.violinplot(y=data, bw=.1)
Another answer might be to truncate the violin at the extremes of the datapoints. The KDE will still be fit with densities that extend past the bounds of your data, but the tails will not be shown. You do that with the cut parameter, which specifies how many units of bandwidth past the extreme values the density should be drawn. To truncate, set it to 0:
sns.violinplot(y=data, cut=0)
By the way, the API for violinplot is going to change in 0.6, and I'm using the development version here, but both the bw and cut arguments exist in the current released version and behave more or less the same way.

Matplotlib slow with large data sets, how to enable decimation?

I use matplotlib for a signal processing application and I noticed that it chokes on large data sets. This is something that I really need to improve to make it a usable application.
What I'm looking for is a way to let matplotlib decimate my data. Is there a setting, property or other simple way to enable that? Any suggestion of how to implement this are welcome.
Some code:
import numpy as np
import matplotlib.pyplot as plt
n=100000 # more then 100000 points makes it unusable slow
plt.plot(np.random.random_sample(n))
plt.show()
Some background information
I used to work on a large C++ application where we needed to plot large datasets and to solve this problem we used to take advantage of the structure of the data as follows:
In most cases, if we want a line plot then the data is ordered and often even equidistantial. If it is equidistantial, then you can calculate the start and end index in the data array directly from the zoom rectangle and the inverse axis transformation. If it is ordered but not equidistantial a binary search can be used.
Next the zoomed slice is decimated, and because the data is ordered we can simply iterate a block of points that fall inside one pixel. And for each block the mean, maximum and minimum is calculated. Instead of one pixel, we then draw a bar in the plot.
For example: if the x axis is ordered, a vertical line will be drawn for each block, possibly the mean with a different color.
To avoid aliasing the plot is oversampled with a factor of two.
In case it is a scatter plot, the data can be made ordered by sorting, because the sequence of plotting is not important.
The nice thing of this simple recipe is that the more you zoom in the faster it becomes. In my experience, as long as the data fits in memory the plots stays very responsive. For instance, 20 plots of timehistory data with 10 million points should be no problem.

It seems like you just need to decimate the data before you plot it
import numpy as np
import matplotlib.pyplot as plt
n=100000 # more then 100000 points makes it unusable slow
X=np.random.random_sample(n)
i=10*array(range(n/10))
plt.plot(X[i])
plt.show()

Decimation is not best for example if you decimate sparse data it might all appear as zeros.
The decimation has to be smart such that each LCD horizontal pixel is plotted with the min and the max of the data between decimation points. Then as you zoom in you see more an more detail.
With zooming this can not be done easy outside matplotlib and thus is better to handle internally.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.