I plot a 2D KDE with seaborn with:
ax = sns.kdeplot(scatter_all["s_zscore"], scatter_all["p_zscore"])
I want my levels of the density estimation to be meaningful, ie. I wan to mark confidence intervals. Basically I would like to obtain something very close to:
this answer but the data are not normalized and it has to stay that way.
Could someone please provide me an explanation where, how and why should I change the calculations for the levels? I am looking for a clear statistical explanation as said in my comment below.
Related
I am currently learning how to use OPTICS in sklearn. I am inputting a numpy array of (205,22). I am able to get plots out of it, but I do not understand how I am getting a 2d plot out of multiple dimensions and how I am supposed to read it. I more or less understand the reachability plot, but the rest of it makes no sense to me. Can someone please explain what is happening. Is the function just simplifying the data to two dimensions somehow? Thank you
From the sklearn user guide:
The reachability distances generated by OPTICS allow for variable density extraction of clusters within a single data set. As shown in the above plot, combining reachability distances and data set ordering_ produces a reachability plot, where point density is represented on the Y-axis, and points are ordered such that nearby points are adjacent. ‘Cutting’ the reachability plot at a single value produces DBSCAN like results; all points above the ‘cut’ are classified as noise, and each time that there is a break when reading from left to right signifies a new cluster.
the other three plots are a visual representation of the actual clusters found by three different algorithms.
as you can see in the OPTICS Clustering plot there are two high density clusters (blue and cyan) the gray crosses acording to the reachability plot are classify as noise because of the low xi value
in the DBSCAN clustering with eps = 0.5 everithing is considered noise since the epsilon value is to low and the algorithm can not found any density points.
Now it is obvious that in the third plot the algorithm found just a single cluster because of the adjustment of the epsilon value and everything above the 2.0 line is considered noise.
please refer to the user guide:
Out of the box seaborn does a very good job to plot a 2D KDE or jointplot. However it is not returning anything like a function that I can evaluate to numerically read the values of the estimated density.
How can I evaluate numerically the density that sns.kdeplot or jointplot has put in the plot?
Just for completeness. I see something interesting in the scipy docs, stats.gaussian_kde but I am getting very clunky density plots,
which for some reason because of missing extent are really off compared to the scatter plot. So I would like to stay away from the scipy kde, at least until I figure how to make it work why pyplot is so much more "not smart" as seaborn is.
Anyhow, the evaluate method of the scipy.stats.gaussian_kde does its job.
I also faced this issue in jointplot() method. I opened a file distribution.py on this path anaconda3/lib/python3.7/site-packages/seaborn/. Then I added these lines in _bivariate_kdeplot() function:
print("xx=",xx[50])
print("yy=",yy[:,50])
print("z=",z[50])
This prints out 100 values of x,y and z arrays of 50 index. Where "z" is the density and "xx" and "yy" are the values adjusted according to the bandwidth, cut and clip, in a meshgrid form distributed according to grid size, that were given by the user. This gave me some idea about the actual values of the 2D kde plot.
If you print out entire array of each variable then you will get 100 x 100 values of each.
I created a graph in MATLAB (see figure below) such that around every data point there is a data distribution plotted (grey area plots). The way I did it in MATLAB was to create a set of axes for every distribution curve and then plot the curves without showing those axes at every point of the data curve. I also used a command 'linkaxes' to set figure limits for all the curves at once.
I must say that this is far from an elegant solution and I had many troubles with saving this figure in the correct aspect ratio settings. All in all I couldn't find any other useful option in MATLAB.
Is there a more elegant solution for such types of graphs in Python? I am not that much interested in how to do the areas highlighted, but how to place a set of curves(distributions) exactly at the positions of the main data curve points.
Thank you!
I'm getting into seaborn for python and I have a quick question that I was not able to find an answer to. If I add jitter to a plot, does it actually change the fit values (such as r^2, p-value, etc) or is it just cosmetic for the plot's look?
Comparing for example sns.lmplot("size", "tip", tips, x_jitter=.15) from sns.lmplot("size", "tip", tips) at https://web.stanford.edu/~mwaskom/software/seaborn/tutorial/quantitative_linear_models.html
No, the regression is estimated on the original data; the jitter is applied to a copy of the data that is used to draw the scatterplot.
The following code gives me a very nice violinplot (and boxplot within).
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
foo = np.random.rand(100)
sns.violinplot(foo)
plt.boxplot(foo)
plt.show()
So far so good. However, when I look at foo, the variable does not contain any negative values. The seaborn plot seems misleading here. The normal matplotlib boxplot gives something closer to what I would expect.
How can I make violinplots with a better fit (not showing false negative values)?
As the comments note, this is a consequence (I'm not sure I'd call it an "artifact") of the assumptions underlying gaussian KDE. As has been mentioned, this is somewhat unavoidable, and if your data don't meet those assumptions, you might be better off just using a boxplot, which shows only points that exist in the actual data.
However, in your response you ask about whether it could be fit "tighter", which could mean a few things.
One answer might be to change the bandwidth of the smoothing kernel. You do that with the bw argument, which is actually a scale factor; the bandwidth that will be used is bw * data.std():
data = np.random.rand(100)
sns.violinplot(y=data, bw=.1)
Another answer might be to truncate the violin at the extremes of the datapoints. The KDE will still be fit with densities that extend past the bounds of your data, but the tails will not be shown. You do that with the cut parameter, which specifies how many units of bandwidth past the extreme values the density should be drawn. To truncate, set it to 0:
sns.violinplot(y=data, cut=0)
By the way, the API for violinplot is going to change in 0.6, and I'm using the development version here, but both the bw and cut arguments exist in the current released version and behave more or less the same way.