evaluate numerically the density that `sns.kdeplot` has put in the plot? - python

Out of the box seaborn does a very good job to plot a 2D KDE or jointplot. However it is not returning anything like a function that I can evaluate to numerically read the values of the estimated density.
How can I evaluate numerically the density that sns.kdeplot or jointplot has put in the plot?
Just for completeness. I see something interesting in the scipy docs, stats.gaussian_kde but I am getting very clunky density plots,
which for some reason because of missing extent are really off compared to the scatter plot. So I would like to stay away from the scipy kde, at least until I figure how to make it work why pyplot is so much more "not smart" as seaborn is.
Anyhow, the evaluate method of the scipy.stats.gaussian_kde does its job.

I also faced this issue in jointplot() method. I opened a file distribution.py on this path anaconda3/lib/python3.7/site-packages/seaborn/. Then I added these lines in _bivariate_kdeplot() function:
print("xx=",xx[50])
print("yy=",yy[:,50])
print("z=",z[50])
This prints out 100 values of x,y and z arrays of 50 index. Where "z" is the density and "xx" and "yy" are the values adjusted according to the bandwidth, cut and clip, in a meshgrid form distributed according to grid size, that were given by the user. This gave me some idea about the actual values of the 2D kde plot.
If you print out entire array of each variable then you will get 100 x 100 values of each.

Related

Pick values from a CDF curve

everyone,
I have a generic values distribution. I post the graph.
Is there a way to generate a CDF from these values? Using sns I can create a graph:
My goal is to assign a value to the y-axis and take a value from the x-axis from the CDF. I'm searching online but can't find a method that doesn't require going through curve normalisation.
I'm not sure of the exact data format, but something like numpy.cumsum will take a numpy array that represents a PDF and turn it into an array that represents the CDF.
From there, with your array of p and cdf it is straightforward to find the p value that gives the cdf (which is what I understand you are looking for) with some interpolation with "nearest" as the type of interpolation (see the documentation on scipy.interpolate.interp1d for example).

Constraining RBF interpolation of 3D surface to keep curvature

I've been tasked to develop an algorithm that, given a set of sparse points representing measurements of an existing surface, would allow us to compute the z coordinate of any point on the surface. The challenge is to find a suitable interpolation method that can recreate the 3D surface given only a few points and extrapolate values also outside of the range containing the initial measurements (a notorious problem for many interpolation methods).
After trying to fit many analytic curves to the points I've decided to use RBF interpolation as I thought this will better reproduce the surface given that the points should all lie on it (I'm assuming the measurements have a negligible error).
The first results are quite impressive considering the few points that I'm using.
Interpolation results
In the picture that I'm showing the blue points are the ones used for the RBF interpolation which produces the shape represented in gray scale. The red points are instead additional measurements of the same shape that I'm trying to reproduce with my interpolation algorithm.
Unfortunately there are some outliers, especially when I'm trying to extrapolate points outside of the area where the initial measurements were taken (you can see this in the upper right and lower center insets in the picture). This is to be expected, especially in RBF methods, as I'm trying to extract information from an area that initially does not have any.
Apparently the RBF interpolation is trying to flatten out the surface while I would just need to continue with the curvature of the shape. Of course the method does not know anything about that given how it is defined. However this causes a large discrepancy from the measurements that I'm trying to fit.
That's why I'm asking if there is any way to constrain the interpolation method to keep the curvature or use a different radial basis function that doesn't smooth out so quickly only on the border of the interpolation range. I've tried different combination of the epsilon parameters and distance functions without luck. This is what I'm using right now:
from scipy import interpolate
import numpy as np
spline = interpolate.Rbf(df.X.values, df.Y.values, df.Z.values,
function='thin_plate')
X,Y = np.meshgrid(np.linspace(xmin.round(), xmax.round(), precision),
np.linspace(ymin.round(), ymax.round(), precision))
Z = spline(X, Y)
I was also thinking of creating some additional dummy points outside of the interpolation range to constrain the model even more, but that would be quite complicated.
I'm also attaching an animation to give a better idea of the surface.
Animation
Just wanted to post my solution in case someone has the same problem. The issue was indeed with scipy implementation of the RBF interpolation. I tried instead to adopt a more flexible library, https://rbf.readthedocs.io/en/latest/index.html#.
The results are pretty cool! Using the following options
from rbf.interpolate import RBFInterpolant
spline = RBFInterpolant(X_obs, U_obs, phi='phs5', order=1, sigma=0.0, eps=1.)
I was able to get the right shape even at the edge.
Surface interpolation
I've played around with the different phi functions and here is the boxplot of the spread between the interpolated surface and the points that I'm testing the interpolation against (the red points in the picture).
Boxplot
With phs5 I get the best result with an average spread of about 0.5 mm on the upper surface and 0.8 on the lower surface. Before I was getting a similar average but with many outliers > 15 mm. Definitely a success :)

Set confidence intervals in seaborn 2D kdeplot #2

I plot a 2D KDE with seaborn with:
ax = sns.kdeplot(scatter_all["s_zscore"], scatter_all["p_zscore"])
I want my levels of the density estimation to be meaningful, ie. I wan to mark confidence intervals. Basically I would like to obtain something very close to:
this answer but the data are not normalized and it has to stay that way.
Could someone please provide me an explanation where, how and why should I change the calculations for the levels? I am looking for a clear statistical explanation as said in my comment below.

Kernel Density Estimation in Python

I have a list of counts ('y' in the code below), that I am using to plot a probability distribution - so note it is not raw data but really frequencies that I have already calculated which should fall across various bins. A scatter plot and even a histogram (plotted with the bar function) revealed that it was some manner of bimodal distribution. I wanted to be able to fit a pdf to this so I first tried just a sum of gaussians but the curve fitting algorithm in SciPy was unsuccessful in fitting the curve. I then came across Kernel Density Estimation which from what I have read is the best way to achieve this but for some reason, even after putting together code from here at stack overflow from an answer to a similar question and also from a different website, both of which recommended using the gaussian_kde function from scipy.stats, I have so far been unsuccessful in being able to do so. Am I wrong in assuming that I can do this for what I have in the first place? If I am correct, what can I do to get it right?
x = np.linspace(x_min, x_max, n_bins)
y = np.array(normed_pdf)
plt.scatter(x,y,s=5, label='Sim Data')
plt.hold('on')
kde = gaussian_kde(y, bw_method=0.1 / y.std(ddof=1))
kde.covariance_factor = lambda : .25
kde._compute_covariance()
plt.plot(x, kde(x), 'r-', label='fit')
plt.hold('off')
plt.grid(True)
plt.legend(prop={'size':10})
plt.show()
I know that I might as well use R or GNUPlot or some other tool to do this but I want to be able to do it within Python. Call me a stickler self-contained, consistent code.

How to better fit seaborn violinplots?

The following code gives me a very nice violinplot (and boxplot within).
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
foo = np.random.rand(100)
sns.violinplot(foo)
plt.boxplot(foo)
plt.show()
So far so good. However, when I look at foo, the variable does not contain any negative values. The seaborn plot seems misleading here. The normal matplotlib boxplot gives something closer to what I would expect.
How can I make violinplots with a better fit (not showing false negative values)?
As the comments note, this is a consequence (I'm not sure I'd call it an "artifact") of the assumptions underlying gaussian KDE. As has been mentioned, this is somewhat unavoidable, and if your data don't meet those assumptions, you might be better off just using a boxplot, which shows only points that exist in the actual data.
However, in your response you ask about whether it could be fit "tighter", which could mean a few things.
One answer might be to change the bandwidth of the smoothing kernel. You do that with the bw argument, which is actually a scale factor; the bandwidth that will be used is bw * data.std():
data = np.random.rand(100)
sns.violinplot(y=data, bw=.1)
Another answer might be to truncate the violin at the extremes of the datapoints. The KDE will still be fit with densities that extend past the bounds of your data, but the tails will not be shown. You do that with the cut parameter, which specifies how many units of bandwidth past the extreme values the density should be drawn. To truncate, set it to 0:
sns.violinplot(y=data, cut=0)
By the way, the API for violinplot is going to change in 0.6, and I'm using the development version here, but both the bw and cut arguments exist in the current released version and behave more or less the same way.

Categories

Resources