I am using gaussian_kde from scipy.stats to fit a joint PDF from a multivariate data on, let's say, X and Y.
Now I want to resample from this PDF conditionally on a value of X. For example, once my X=x, generate Y from its conditional distribution.
Let's use the example from the documentation here. kernel.resample(1) would generate a pair of (X,Y) over all of the distribution. How could I generate Y once X is, for example, 0?
An approach could be to create a custom continuous distribution from a pdf.
The pdf can be created from the kernel function. As the pdf needs an area of 1, the kernel limited to a given x0 should be scaled by the area.
The custom distribution seems to be quite slow though. A faster solution could be to create a histogram from ys = np.linspace(-10, 10, 1000); kernel(np.vstack([np.full_like(ys, x0), ys])) and use rv_histogram. Still faster (but much less random) would be to use np.random.choice(..., p=...) with p calculated from the constrained kernel.
The following code starts from an adoption of the linked example code of a 2D kde.
import matplotlib.pyplot as plt
from scipy import stats
import numpy as np
def measure(n):
m1 = np.random.normal(size=n)
m2 = np.random.normal(scale=0.5, size=n)
return m1 + m2, m1 - m2 ** 2
m1, m2 = measure(2000)
xmin = m1.min()
xmax = m1.max()
ymin = m2.min()
ymax = m2.max()
X, Y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
positions = np.vstack([X.ravel(), Y.ravel()])
values = np.vstack([m1, m2])
kernel = stats.gaussian_kde(values)
Z = np.reshape(kernel(positions).T, X.shape)
x0 = 0.678
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(10, 4))
ax1.imshow(np.rot90(Z), cmap=plt.cm.magma_r, alpha=0.4, extent=[xmin, xmax, ymin, ymax])
ax1.plot(m1, m2, 'k.', markersize=2)
ax1.axvline(x0, color='dodgerblue', ls=':')
ax1.set_xlim([xmin, xmax])
ax1.set_ylim([ymin, ymax])
# create a distribution given the kernel function limited to x=x0
class Special_distrib(stats.rv_continuous):
def _pdf(self, y, x0, area_x0):
return kernel(np.vstack([np.full_like(y, x0), y])) / area_x0
ys = np.linspace(-10, 10, 1000)
area_x0 = np.trapz(kernel(np.vstack([np.full_like(ys, x0), ys])), ys)
special_distr = Special_distrib(name="special")
vals = special_distr.rvs(x0, area_x0, size=500)
ax2.hist(vals, bins=20, color='dodgerblue')
plt.show()
Related
I have some 2D data that I am smoothing using:
from scipy.stats import gaussian_kde
kde = gaussian_kde(data)
but what if my data isn't Gaussian/tophat/the other options? Mine looks more elliptical before smoothing, so should I really have a different bandwidth in x and then y? The variance in one direction is a lot higher, and also the values of the x axis are higher, so it feels like a simple Gaussian might miss something?
This is what I get with your defined X and Y. Seems good. Were you expecting something different?
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
def generate(n):
# generate data
np.random.seed(42)
x = np.random.normal(size=n, loc=1, scale=0.01)
np.random.seed(1)
y = np.random.normal(size=n, loc=200, scale=100)
return x, y
x, y = generate(100)
xmin = x.min()
xmax = x.max()
ymin = y.min()
ymax = y.max()
X, Y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
positions = np.vstack([X.ravel(), Y.ravel()])
values = np.vstack([x, y])
kernel = stats.gaussian_kde(values)
Z = np.reshape(kernel(positions).T, X.shape)
fig, ax = plt.subplots(figsize=(7, 7))
ax.imshow(np.rot90(Z), cmap=plt.cm.gist_earth_r,
extent=[xmin, xmax, ymin, ymax],
aspect='auto', alpha=.75
)
ax.plot(x, y, 'ko', ms=5)
ax.set_xlim([xmin, xmax])
ax.set_ylim([ymin, ymax])
plt.show()
The distributions of x and y are Gaussian.
You can verify with seaborn too
import pandas as pd
import seaborn as sns
# I pass a DataFrame because passing
# (x,y) alone will be soon deprecated
g = sns.jointplot(data=pd.DataFrame({'x':x, 'y':y}), x='x', y='y')
g.plot_joint(sns.kdeplot, color="r", zorder=0, levels=6)
update
Kernel Density Estimate of 2-dimensional data is done separately along each axis and then join together.
Let's make an example with the dataset we already used.
As we can see in the seaborn jointplot, you have not only the estimated 2d-kde but also marginal distributions of x and y (the histograms).
So, step by step, let's estimate the density of x and y and then evaluate the density over a linearspace
kde_x = sps.gaussian_kde(x)
kde_x_space = np.linspace(x.min(), x.max(), 100)
kde_x_eval = kde_x.evaluate(kde_x_space)
kde_x_eval /= kde_x_eval.sum()
kde_y = sps.gaussian_kde(y)
kde_y_space = np.linspace(y.min(), y.max(), 100)
kde_y_eval = kde_y.evaluate(kde_y_space)
kde_y_eval /= kde_y_eval.sum()
fig, ax = plt.subplots(1, 2, figsize=(12, 4))
ax[0].plot(kde_x_space, kde_x_eval, 'k.')
ax[0].set(title='KDE of x')
ax[1].plot(kde_y_space, kde_y_eval, 'k.')
ax[1].set(title='KDE of y')
plt.show()
So we now have the marginal distributions of x and y. These are probability density functions so, the joint-probability of x and y can be seen as the intersection of independent events x and y, thus we can multiply the estimated probability density of x and y in a 2d-matrix and plot on 3d projection
# Grid of x and y
X, Y = np.meshgrid(kde_x_space, kde_y_space)
# Grid of probability density
kX, kY = np.meshgrid(kde_x_eval, kde_y_eval)
# Intersection
Z = kX * kY
fig, ax = plt.subplots(
2, 2,
subplot_kw={"projection": "3d"},
figsize=(10, 10))
for i, (elev, anim, title) in enumerate(zip([10, 10, 25, 25],
[0, -90, 25, -25],
['y axis', 'x axis', 'view 1', 'view 2']
)):
# Plot the surface.
surf = ax.flat[i].plot_surface(X, Y, Z, cmap=plt.cm.gist_earth_r,
linewidth=0, antialiased=False, alpha=.75)
ax.flat[i].scatter(x, y, zs=0, zdir='z', c='k')
ax.flat[i].set(
xlabel='x', ylabel='y',
title=title
)
ax.flat[i].view_init(elev=elev, azim=anim)
plt.show()
This is a very simple and naif method but only to have an idea on how it works and why x and y scales don't matter for a 2d-KDE.
I would like to plot multiple subplots containing histograms. Additionally, I would like to plot a curve showing the normal distribution for each subplot. While I found different answers on this forum on how to plot a normal curve over a single plot (histogram), I am struggling to achieve the same with subplots. I have tried the following:
from scipy import stats
import numpy as np
import matplotlib.pylab as plt
fig, ((ax1, ax2)) = plt.subplots(1,2,figsize=(10,4))
# create some normal random noisy data
data1 = 50*np.random.rand() * np.random.normal(10, 10, 100) + 20
data2= 50*np.random.rand() * np.random.normal(10, 10, 100) + 50
# plot normed histogram
ax1.hist(data1, density=True)
# find minimum and maximum of xticks,
xt = plt.xticks()[0]
xmin, xmax = min(xt), max(xt)
lnspc = np.linspace(xmin, xmax, len(data1))
# lets try the normal distribution first
m1, s1 = stats.norm.fit(data1) # get mean and standard deviation
pdf_1 = stats.norm.pdf(lnspc, m1, s1) # now get theoretical values in our interval
ax1.plot(lnspc, pdf_1, label="Norm") # plot it
# plot second hist
ax2.hist(data2, density=True)
# find minimum and maximum of xticks
xt = plt.xticks()[0]
xmin, xmax = min(xt), max(xt)
lnspc = np.linspace(xmin, xmax, len(data2))
# lets try the normal distribution first
m2, s2 = stats.norm.fit(data2) # get mean and standard deviation
pdf_2 = stats.norm.pdf(lnspc, m2, s2) # now get theoretical values in our interval
ax2.plot(lnspc, pdf_2, label="Norm") # plot it
plt.show()
Now my problem is that the normal curve is always optimal for the second plot but not the first. This is because of xmin and xmax, I however don't know how to fit these two commands invdividually in subplots. Does anyone have any experience with this? I have been trying all afternoon
Any help is highly appreciated, thanks in advance!
You can use axes instead of a tuple. Then you can set each axis individually using sca. See below if that's what you needed.
from scipy import stats
import numpy as np
import matplotlib.pylab as plt
# fig, ((ax1, ax2)) = plt.subplots(1,2,figsize=(10,4)) << INSTEAD OF THIS DO:
fig, axes = plt.subplots(nrows = 1, ncols = 2,figsize=(10,4))
# create some normal random noisy data
data1 = 50*np.random.rand() * np.random.normal(10, 10, 100) + 20
data2= 50*np.random.rand() * np.random.normal(10, 10, 100) + 50
plt.sca(axes[0]) #Refer to the first axis
# plot normed histogram
axes[0].hist(data1, density=True)
# find minimum and maximum of xticks,
xt = plt.xticks()[0]
xmin, xmax = min(xt), max(xt)
lnspc = np.linspace(xmin, xmax, len(data1))
# lets try the normal distribution first
m1, s1 = stats.norm.fit(data1) # get mean and standard deviation
pdf_1 = stats.norm.pdf(lnspc, m1, s1) # now get theoretical values in our interval
axes[0].plot(lnspc, pdf_1, label="Norm") # plot it
plt.sca(axes[1]) #Refer to the second axis
# plot second hist
axes[1].hist(data2, density=True)
# find minimum and maximum of xticks
xt = plt.xticks()[0]
xmin, xmax = min(xt), max(xt)
lnspc = np.linspace(xmin, xmax, len(data2))
# lets try the normal distribution first
m2, s2 = stats.norm.fit(data2) # get mean and standard deviation
pdf_2 = stats.norm.pdf(lnspc, m2, s2) # now get theoretical values in our interval
axes[1].plot(lnspc, pdf_2, label="Norm") # plot it
plt.show()
Would somebody be able to explain to me how to use the location parameter with the gamma.fit function in Scipy?
It seems to me that a location parameter (μ) changes the support of the distribution from x ≥ 0 to y = ( x - μ ) ≥ 0. If μ is positive then aren't we losing all the data which doesn't satisfy x - μ ≥ 0?
Thanks!
The fit function takes all of the data into consideration when finding a fit. Adding noise to your data will alter the fit parameters and can give a distribution that does not represent the data very well. So we have to be a bit clever when we are using fit.
Below is some code that generates data, y1, with loc=2 and scale=1 using numpy. It also adds noise to the data over the range 0 to 10 to create y2. Fitting y1 yield excellent results, but attempting to fit the noisy y2 is problematic. The noise we added smears out the distribution. However, we can also hold 1 or more parameters constant when fitting the data. In this case we pass floc=2 to the fit, which forces the location to be held at 2 when performing the fit, returning much better results.
from scipy.stats import gamma
import numpy as np
import matplotlib.pyplot as plt
x = np.arange(0,10,.1)
y1 = np.random.gamma(shape=1, scale=1, size=1000) + 2 # sets loc = 2
y2 = np.hstack((y1, 10*np.random.rand(100))) # add noise from 0 to 10
# fit the distributions, get the PDF distribution using the parameters
shape1, loc1, scale1 = gamma.fit(y1)
g1 = gamma.pdf(x=x, a=shape1, loc=loc1, scale=scale1)
shape2, loc2, scale2 = gamma.fit(y2)
g2 = gamma.pdf(x=x, a=shape2, loc=loc2, scale=scale2)
# again fit the distribution, but force loc=2
shape3, loc3, scale3 = gamma.fit(y2, floc=2)
g3 = gamma.pdf(x=x, a=shape3, loc=loc3, scale=scale3)
And make some plots...
# plot the distributions and fits. to lazy to do iteration today
fig, axes = plt.subplots(1, 3, figsize=(13,4))
ax = axes[0]
ax.hist(y1, bins=40, normed=True);
ax.plot(x, g1, 'r-', linewidth=6, alpha=.6)
ax.annotate(s='shape = %.3f\nloc = %.3f\nscale = %.3f' %(shape1, loc1, scale1), xy=(6,.2))
ax.set_title('gamma fit')
ax = axes[1]
ax.hist(y2, bins=40, normed=True);
ax.plot(x, g2, 'r-', linewidth=6, alpha=.6)
ax.annotate(s='shape = %.3f\nloc = %.3f\nscale = %.3f' %(shape2, loc2, scale2), xy=(6,.2))
ax.set_title('gamma fit with noise')
ax = axes[2]
ax.hist(y2, bins=40, normed=True);
ax.plot(x, g3, 'r-', linewidth=6, alpha=.6)
ax.annotate(s='shape = %.3f\nloc = %.3f\nscale = %.3f' %(shape3, loc3, scale3), xy=(6,.2))
ax.set_title('gamma fit w/ noise, location forced')
I have a large dataset of (x,y,z) protein positions and would like to plot areas of high occupancy as a heatmap. Ideally the output should look similiar to the volumetric visualisation below, but I'm not sure how to achieve this with matplotlib.
My initial idea was to display my positions as a 3D scatter plot and color their density via a KDE. I coded this up as follows with test data:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
mu, sigma = 0, 0.1
x = np.random.normal(mu, sigma, 1000)
y = np.random.normal(mu, sigma, 1000)
z = np.random.normal(mu, sigma, 1000)
xyz = np.vstack([x,y,z])
density = stats.gaussian_kde(xyz)(xyz)
idx = density.argsort()
x, y, z, density = x[idx], y[idx], z[idx], density[idx]
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(x, y, z, c=density)
plt.show()
This works well! However, my real data contains many thousands of data points and calculating the kde and the scatter plot becomes extremely slow.
A small sample of my real data:
My research would suggest that a better option is to evaluate the gaussian kde on a grid. I’m just not sure how to this in 3D:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
mu, sigma = 0, 0.1
x = np.random.normal(mu, sigma, 1000)
y = np.random.normal(mu, sigma, 1000)
nbins = 50
xy = np.vstack([x,y])
density = stats.gaussian_kde(xy)
xi, yi = np.mgrid[x.min():x.max():nbins*1j, y.min():y.max():nbins*1j]
di = density(np.vstack([xi.flatten(), yi.flatten()]))
fig = plt.figure()
ax = fig.add_subplot(111)
ax.pcolormesh(xi, yi, di.reshape(xi.shape))
plt.show()
Thanks to mwaskon for suggesting the mayavi library.
I recreated the density scatter plot in mayavi as follows:
import numpy as np
from scipy import stats
from mayavi import mlab
mu, sigma = 0, 0.1
x = 10*np.random.normal(mu, sigma, 5000)
y = 10*np.random.normal(mu, sigma, 5000)
z = 10*np.random.normal(mu, sigma, 5000)
xyz = np.vstack([x,y,z])
kde = stats.gaussian_kde(xyz)
density = kde(xyz)
# Plot scatter with mayavi
figure = mlab.figure('DensityPlot')
pts = mlab.points3d(x, y, z, density, scale_mode='none', scale_factor=0.07)
mlab.axes()
mlab.show()
Setting the scale_mode to 'none' prevents glyphs from being scaled in proportion to the density vector. In addition for large datasets, I disabled scene rendering and used a mask to reduce the number of points.
# Plot scatter with mayavi
figure = mlab.figure('DensityPlot')
figure.scene.disable_render = True
pts = mlab.points3d(x, y, z, density, scale_mode='none', scale_factor=0.07)
mask = pts.glyph.mask_points
mask.maximum_number_of_points = x.size
mask.on_ratio = 1
pts.glyph.mask_input_points = True
figure.scene.disable_render = False
mlab.axes()
mlab.show()
Next, to evaluate the gaussian kde on a grid:
import numpy as np
from scipy import stats
from mayavi import mlab
mu, sigma = 0, 0.1
x = 10*np.random.normal(mu, sigma, 5000)
y = 10*np.random.normal(mu, sigma, 5000)
z = 10*np.random.normal(mu, sigma, 5000)
xyz = np.vstack([x,y,z])
kde = stats.gaussian_kde(xyz)
# Evaluate kde on a grid
xmin, ymin, zmin = x.min(), y.min(), z.min()
xmax, ymax, zmax = x.max(), y.max(), z.max()
xi, yi, zi = np.mgrid[xmin:xmax:30j, ymin:ymax:30j, zmin:zmax:30j]
coords = np.vstack([item.ravel() for item in [xi, yi, zi]])
density = kde(coords).reshape(xi.shape)
# Plot scatter with mayavi
figure = mlab.figure('DensityPlot')
grid = mlab.pipeline.scalar_field(xi, yi, zi, density)
min = density.min()
max=density.max()
mlab.pipeline.volume(grid, vmin=min, vmax=min + .5*(max-min))
mlab.axes()
mlab.show()
As a final improvement I sped up the evaluation of kensity density function by calling the kde function in parallel.
import numpy as np
from scipy import stats
from mayavi import mlab
import multiprocessing
def calc_kde(data):
return kde(data.T)
mu, sigma = 0, 0.1
x = 10*np.random.normal(mu, sigma, 5000)
y = 10*np.random.normal(mu, sigma, 5000)
z = 10*np.random.normal(mu, sigma, 5000)
xyz = np.vstack([x,y,z])
kde = stats.gaussian_kde(xyz)
# Evaluate kde on a grid
xmin, ymin, zmin = x.min(), y.min(), z.min()
xmax, ymax, zmax = x.max(), y.max(), z.max()
xi, yi, zi = np.mgrid[xmin:xmax:30j, ymin:ymax:30j, zmin:zmax:30j]
coords = np.vstack([item.ravel() for item in [xi, yi, zi]])
# Multiprocessing
cores = multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes=cores)
results = pool.map(calc_kde, np.array_split(coords.T, 2))
density = np.concatenate(results).reshape(xi.shape)
# Plot scatter with mayavi
figure = mlab.figure('DensityPlot')
grid = mlab.pipeline.scalar_field(xi, yi, zi, density)
min = density.min()
max=density.max()
mlab.pipeline.volume(grid, vmin=min, vmax=min + .5*(max-min))
mlab.axes()
mlab.show()
I am using scipy.stats.kde.gaussian_kde() for kde analysis, It takes time to process large number of point (for 100000 points with 250x250 grid it is taking 5 minutes).
As an faster alternative to gaussian_kde I found fast_kde function here written by Joe Kington. (weighted kde was also a factor to choose fast_kde)
Rather plotting the result, I extract it to file in format (xmin,xmax,ymin,ymax,value) for later use. I am using this technique to extract the results in raw form by using pcolormesh.
Here is the problem statement:
results produced by fast_kde function for grid (500,500) are not plot-able by pcolormesh and output in raw form is also reflecting same invalid results, however imshow method plots this result prefectly.
Generate some random two-dimensional data:
from scipy import stats
def measure(n):
"Measurement model, return two coupled measurements."
m1 = np.random.normal(size=n)
m2 = np.random.normal(scale=0.5, size=n)
return m1+m2, m1-m2
m1, m2 = measure(2000)
xmin = m1.min()
xmax = m1.max()
ymin = m2.min()
ymax = m2.max()
Perform a kernel density estimate on the data:
X, Y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
positions = np.vstack([X.ravel(), Y.ravel()])
values = np.vstack([m1, m2])
kernel = stats.gaussian_kde(values)
Z = np.reshape(kernel(positions).T, X.shape)
Save results to file: (x,y,value)
fid = open('output.csv','w')
Z1 = (kernel(positions).T, X.shape)
Z = kernel(positions).T
#for currentIndex,elem in enumerate(positions):
for currentIndex,elem in enumerate(Z):
#if Z1[currentIneex]>0:
s1 = '%f %f %f\n'%(positions[0][currentIndex], positions[1][currentIndex], Z[currentIndex] )
fid.write(s1)
fid.close()
Print results: (minx,maxx,miny,maxy,value)
mshgrd = ax.pcolormesh(X,Y,Z)
pths = mshgrd.get_paths()
arr = mshgrd.get_array()
for currentIndex,elem in enumerate(pths):
if arr[currentIndex]>0: bbox = elem.get_extents()
s2 = ",".join([str(i) for i in bbox.extents])+","+ str(arr[currentIndex]) +'\n'
print s2
Plot the results:
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
ax.imshow(np.rot90(Z), cmap=plt.cm.gist_earth_r,
extent=[xmin, xmax, ymin, ymax])
ax.plot(m1, m2, 'k.', markersize=2)
ax.set_xlim([xmin, xmax])
ax.set_ylim([ymin, ymax])
plt.show()
Code using for fast_kde (problem area)
kernel = fast_kde(m1,m2,(500,500))
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
mshgrd = ax.pcolormesh(X,Y,kernel)
plt.show()
Please suggest me how to add images here (where to upload?)