Kernel Density estimation - absolute numbers

Kernel Density estimation - absolute numbers - python

I have been using kernel density estimation for a while, but so far I always escaped the easy way by just analysing and normalised distributions where intercomparisons between different sets were not necessary. In my current project I want to compare 2D density distributions on absolute scales and it seems I have missed a critical point on how KDE works. I need to compare stellar densities on the sky from two different data sets and for this I would need either absolute numbers (in stars per some area) or I could just directly compare the two calculated density estimates. To illustrate my problem, have a look at this code:
# Import stuff
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KernelDensity
from mpl_toolkits.axes_grid1 import make_axes_locatable
from matplotlib.ticker import MultipleLocator
# Define kernel
kernel = KernelDensity(kernel="gaussian", bandwidth=1)
# Set some parameters for the synthetic data
mean = [0, 0]
cov = [[0.2, 1], [0, 1]]
# Create two data sets with different densities
x1, y1 = np.random.multivariate_normal(mean,cov,100).T
x2, y2 = np.random.multivariate_normal(mean,cov,1000).T
# Create grid
xgrid = np.arange(-5, 5, 0.1)
ygrid = np.arange(-5, 5, 0.1)
xy_coo = np.meshgrid(xgrid, ygrid)
grid = np.array([xy_coo[0].reshape(-1), xy_coo[1].reshape(-1)])
# Prepare data
data1 = np.vstack([x1, y1])
data2 = np.vstack([x2, y2])
# Evaluate density
log_dens1 = kernel.fit(data1.T).score_samples(grid.T)
dens1 = np.exp(log_dens1).reshape([len(xgrid), len(ygrid)])
log_dens2 = kernel.fit(data2.T).score_samples(grid.T)
dens2 = np.exp(log_dens2).reshape([len(xgrid), len(ygrid)])
# Plot the distributions and densities
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))
im1 = ax1.imshow(dens1, extent=[-5, 5, -5, 5], origin="lower", vmin=0, vmax=0.1)
ax1.scatter(x1, y1, s=1, marker=".")
divider1 = make_axes_locatable(ax1)
cax1 = divider1.append_axes("top", size="10%", pad=0.4)
cbar1 = plt.colorbar(im1, cax=cax1, orientation="horizontal", ticks=MultipleLocator(0.02), format="%.2f")
im2 = ax2.imshow(dens2, extent=[-5, 5, -5, 5], origin="lower", vmin=0, vmax=0.1)
ax2.scatter(x2, y2, s=1, marker=".")
divider2 = make_axes_locatable(ax2)
cax2 = divider2.append_axes("top", size="10%", pad=0.4)
cbar2 = plt.colorbar(im2, cax=cax2, orientation="horizontal", ticks=MultipleLocator(0.02), format="%.2f")
plt.show()
Now, the above image is an example of the results obtained with this code. The code just generates two datasets: One set with 100 sources, the other one with 1000 sources. Their distribution is shown in the plots as scattered points. Then the code evaluates the kernel density on a given grid. This kernel density is shown in the background of the images with colors. Now what puzzles me is that the densities I get (the values of the color in the colorbar) are almost the same for both distributions, even though I have 10 times more sources in the second set. This makes it impossible to compare the density distributions directly to each other.
My questions:
a ) How exactly are the densities normalised? By number counts?
b) Is there any way to get an absolute density estimation from the KDE? Say sources per 1x1 box in these arbitrary units?
thanks 😊

KDE is a non-parametric estimation of the probability density function, so the sum of probabilities must equal to 1. You can think of it as a smoothed histogram normalized by the number of observations.
So, to get the absolute number, you just need to multiply back the number of observations.

Related

How to put a 'grid' (for example dividing the x-y plane into bins) on an image to calculate mean of the z-values in every bin and plot as a heatmap?

My aim:
I have x, y and z values as arrays. For example:
x=np.array([10,2,-4,12,3,6,8,14])
y=np.array([5,5,-6,8,20,10,2,2])
z=np.array([4,6,10,40,22,14,20,8])
I want to plot a heatmap where the z-values will act as the intensity or 'weight' for every pair of (x,y) and the axes will be x and y values. So, my plot will be on a x-y plane. I want to lay a 'grid' on top of my plot by dividing my x-y plane into bins and then calculate the mean of the z-values within every bin and use that mean value as my color or intensity for that bin. I also want to make another plot but there I want to plot the variance of z-values as the intensity within the bins.
What I have done:
I coded it the following way but I think I am misinterpreting things..I don't think I understand bins etc well (I am new to programming).
import numpy as np
import matplotlib.pyplot as plt
x=np.array([10,2,-4,12,3,6,8,14])
y=np.array([5,5,-6,8,20,10,2,2])
z=np.array([4,6,10,40,22,14,-20,8])
# Bin the data onto a 2x2 grid
# Have to reverse x & y due to row-first indexing
zi, yi, xi = np.histogram2d(y, x, bins=(2,2), weights=z, normed=False)
counts, _, _ = np.histogram2d(y, x, bins=(2,2))
#to get mean divide by counts
zi = zi / counts
print(zi)
zi = np.ma.masked_invalid(zi)
fig, ax = plt.subplots()
sc=ax.pcolormesh(xi, yi, zi, edgecolors='black')
sct = ax.scatter(x, y, c=z, s=200) #shows the points in the bins
fig.colorbar(sc)
ax.margins(0.05)
plt.show()
Where I am stuck:
I am not even sure if the above code is doing the right thing. So, feel free to forget it and advise me on any other standard way of solving this problem.
With the above code I get a plot where the axes limits are determined by the given dataset automatically but I want to keep my axes constant at xmin=-20,xmax=20,ymin=-20,ymax=20.
Also, I am not sure how to manipulate the z-values within the bins to calculate other statistical quantities like variance or standard deviation etc.
EDIT: so, I have got some better code that gives the mean z values in bins and plot using np.histogram2d and the I can set the axes etc to my liking now but using this gives H as the sum of values in bins and I can get the mean from that but not other statistical quantities like variance. I wanted a way to code this so that I can have access to the values in the bin and I can calculate variance of those and use that result as the weight/intensity of the heatmap.
I am attaching the plot for mean z in bins.
import numpy as np
import matplotlib.pyplot as plt
x=np.array([10,2,4,12,3,6,8,14])
y=np.array([5,5,6,8,20,10,2,2])
z=np.array([4,6,10,40,22,14,20,8])
x_bins = np.linspace(0, 20, 3)
y_bins = np.linspace(0, 20, 3)
H, xedges, yedges = np.histogram2d(x, y, bins = [x_bins, y_bins], weights = z)
H_counts, xedges, yedges = np.histogram2d(x, y, bins = [x_bins, y_bins])
print(H)
H1 = H/H_counts
print(H1)
plt.xlabel("x")
plt.ylabel("y")
plt.imshow(H1.T, origin='lower', cmap='RdBu',
extent=[xedges[0], xedges[-1], yedges[0], yedges[-1]])
plt.colorbar().set_label('mean z', rotation=270)
EDIT 2: When I use stats for standard deviation I get the following plot
The deep red bin on the top right is actually empty and has no z values so I want the standard deviation to be 'Nan' instead of being assigned a value of 0. How can I do that?
My code for this plot is:
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
x=np.array([10,2,4,12,3,6,8,14])
y=np.array([5,5,6,8,20,10,2,2])
z=np.array([4,6,10,40,22,14,20,8])
x_bins = np.linspace(0, 20, 3)
y_bins = np.linspace(0, 20, 3)
H, xedges, yedges = np.histogram2d(x, y, bins = [x_bins, y_bins], weights = z)
#mean = stats.binned_statistic_2d(x,y,z,statistic='',bins=[x_bins,y_bins])
#mean.statistic
std = stats.binned_statistic_2d(x,y,z,statistic='std',bins=[x_bins,y_bins])
#std.statistic
#print(std.statistic)
plt.xlabel("x")
plt.ylabel("y")
plt.imshow(std.statistic.T, origin='lower', cmap='RdBu',
extent=[xedges[0], xedges[-1], yedges[0], yedges[-1]])
#plt.clim(0, 20)
plt.colorbar().set_label('std z', rotation=270)

You data need to be interpolated on a regular grid since your computer do not know which is the z value where there is no value. Lukily there is already a function for that: scipy.interpolate.griddata.
import numpy as np
from scipy.interpolate import griddata
import matplotlib.pyplot as plt
# Dummy data
x=np.array([10,2,-4,12,3,6,8,14])
y=np.array([5,5,-6,8,20,10,2,2])
z=np.array([4,6,10,40,22,14,-20,8])
# Create a regular grid along x and y axis
grid_x, grid_y = np.mgrid[x.min():x.max()+1, y.min():y.max()+1]
# Linear interpolation
# But you could also use a cubic interpolation or whatever you want/need
z_interpolated = griddata((x,y), z, (grid_x, grid_y), method='linear')
# Plot the result:
plt.imshow(z_interpolated, cmap='plasma')
And we obtain:
Noticed that there is no value on the image boundary because your spatial domain is not defined beyond the value contained in x and y so with a linear interpolation, your computer can not guess what should be the value beyond those points. So the heatmap is restricted to the convexhull formed by your points, anything else will be extrapolation.
Edit:
If you need to compute a bidimentionnal binned statistic you can use:
scipy.stats.binned_statistic_2d()
In your case if we want to compute the variance and the mean:
from scipy import stats
std = stats.binned_statistic_2d(x,y,z,statistic='std',bins=[x_bins,y_bins])
mean = stats.binned_statistic_2d(x,y,z,statistic='mean',bins=[x_bins,y_bins])
Where mean is totally equivalent to your H/H_counts

plot norm curve over subplotted histograms

I would like to plot multiple subplots containing histograms. Additionally, I would like to plot a curve showing the normal distribution for each subplot. While I found different answers on this forum on how to plot a normal curve over a single plot (histogram), I am struggling to achieve the same with subplots. I have tried the following:
from scipy import stats
import numpy as np
import matplotlib.pylab as plt
fig, ((ax1, ax2)) = plt.subplots(1,2,figsize=(10,4))
# create some normal random noisy data
data1 = 50*np.random.rand() * np.random.normal(10, 10, 100) + 20
data2= 50*np.random.rand() * np.random.normal(10, 10, 100) + 50
# plot normed histogram
ax1.hist(data1, density=True)
# find minimum and maximum of xticks,
xt = plt.xticks()[0]
xmin, xmax = min(xt), max(xt)
lnspc = np.linspace(xmin, xmax, len(data1))
# lets try the normal distribution first
m1, s1 = stats.norm.fit(data1) # get mean and standard deviation
pdf_1 = stats.norm.pdf(lnspc, m1, s1) # now get theoretical values in our interval
ax1.plot(lnspc, pdf_1, label="Norm") # plot it
# plot second hist
ax2.hist(data2, density=True)
# find minimum and maximum of xticks
xt = plt.xticks()[0]
xmin, xmax = min(xt), max(xt)
lnspc = np.linspace(xmin, xmax, len(data2))
# lets try the normal distribution first
m2, s2 = stats.norm.fit(data2) # get mean and standard deviation
pdf_2 = stats.norm.pdf(lnspc, m2, s2) # now get theoretical values in our interval
ax2.plot(lnspc, pdf_2, label="Norm") # plot it
plt.show()
Now my problem is that the normal curve is always optimal for the second plot but not the first. This is because of xmin and xmax, I however don't know how to fit these two commands invdividually in subplots. Does anyone have any experience with this? I have been trying all afternoon
Any help is highly appreciated, thanks in advance!

You can use axes instead of a tuple. Then you can set each axis individually using sca. See below if that's what you needed.
from scipy import stats
import numpy as np
import matplotlib.pylab as plt
# fig, ((ax1, ax2)) = plt.subplots(1,2,figsize=(10,4)) << INSTEAD OF THIS DO:
fig, axes = plt.subplots(nrows = 1, ncols = 2,figsize=(10,4))
# create some normal random noisy data
data1 = 50*np.random.rand() * np.random.normal(10, 10, 100) + 20
data2= 50*np.random.rand() * np.random.normal(10, 10, 100) + 50
plt.sca(axes[0]) #Refer to the first axis
# plot normed histogram
axes[0].hist(data1, density=True)
# find minimum and maximum of xticks,
xt = plt.xticks()[0]
xmin, xmax = min(xt), max(xt)
lnspc = np.linspace(xmin, xmax, len(data1))
# lets try the normal distribution first
m1, s1 = stats.norm.fit(data1) # get mean and standard deviation
pdf_1 = stats.norm.pdf(lnspc, m1, s1) # now get theoretical values in our interval
axes[0].plot(lnspc, pdf_1, label="Norm") # plot it
plt.sca(axes[1]) #Refer to the second axis
# plot second hist
axes[1].hist(data2, density=True)
# find minimum and maximum of xticks
xt = plt.xticks()[0]
xmin, xmax = min(xt), max(xt)
lnspc = np.linspace(xmin, xmax, len(data2))
# lets try the normal distribution first
m2, s2 = stats.norm.fit(data2) # get mean and standard deviation
pdf_2 = stats.norm.pdf(lnspc, m2, s2) # now get theoretical values in our interval
axes[1].plot(lnspc, pdf_2, label="Norm") # plot it
plt.show()

Plotting contour lines that show percentage of particles

What I am trying to produce is something similar to this plot:
Which is a contour plot representing 68%, 95%, 99.7% of the particles comprised in two data sets.
So far, I have tried to implement a gaussain KDE estimate, and plotting those particles gaussians on a contour.
Files are added here https://www.dropbox.com/sh/86r9hf61wlzitvy/AABG2mbmmeokIiqXsZ8P76Swa?dl=0
from scipy.stats import gaussian_kde
import matplotlib.pyplot as plt
import numpy as np
# My data
x = RelDist
y = RadVel
# Peform the kernel density estimate
k = gaussian_kde(np.vstack([RelDist, RadVel]))
xi, yi = np.mgrid[x.min():x.max():x.size**0.5*1j,y.min():y.max():y.size**0.5*1j]
zi = k(np.vstack([xi.flatten(), yi.flatten()]))
fig = plt.figure()
ax = fig.gca()
CS = ax.contour(xi, yi, zi.reshape(xi.shape), colors='darkslateblue')
plt.clabel(CS, inline=1, fontsize=10)
ax.set_xlim(20, 800)
ax.set_ylim(-450, 450)
ax.set_xscale('log')
plt.show()
Producing this:
]2
Where 1) I do not know how to necessarily control the bin number in gaussain kde, 2) The contour labels are all zero, 3) I have no clue on determining the percentiles.
Any help is appreciated.

taken from this example in the matplotlib documentation
you can transform your data zi to a percentage scale (0-1) and then contour plot.
You can also manually determine the levels of the countour plot when you call plt.contour().
Below is an example with 2 randomly generated normal bivariate distributions:
delta = 0.025
x = y = np.arange(-3.0, 3.01, delta)
X, Y = np.meshgrid(x, y)
Z1 = plt.mlab.bivariate_normal(X, Y, 1.0, 1.0, 0.0, 0.0)
Z2 = plt.mlab.bivariate_normal(X, Y, 1.5, 0.5, 1, 1)
Z = 10* (Z1- Z2)
#transform zi to a 0-1 range
Z = Z = (Z - Z.min())/(Z.max() - Z.min())
levels = [0.68, 0.95, 0.997]
origin = 'lower'
CS = plt.contour(X, Y, Z, levels,
colors=('k',),
linewidths=(3,),
origin=origin)
plt.clabel(CS, fmt='%2.3f', colors='b', fontsize=14)
Using the data you provided the code works just as well:
from scipy.stats import gaussian_kde
import matplotlib.pyplot as plt
import numpy as np
RadVel = np.loadtxt('RadVel.txt')
RelDist = np.loadtxt('RelDist.txt')
x = RelDist
y = RadVel
k = gaussian_kde(np.vstack([RelDist, RadVel]))
xi, yi = np.mgrid[x.min():x.max():x.size**0.5*1j,y.min():y.max():y.size**0.5*1j]
zi = k(np.vstack([xi.flatten(), yi.flatten()]))
#set zi to 0-1 scale
zi = (zi-zi.min())/(zi.max() - zi.min())
zi =zi.reshape(xi.shape)
#set up plot
origin = 'lower'
levels = [0,0.1,0.25,0.5,0.68, 0.95, 0.975,1]
CS = plt.contour(xi, yi, zi,levels = levels,
colors=('k',),
linewidths=(1,),
origin=origin)
plt.clabel(CS, fmt='%.3f', colors='b', fontsize=8)
plt.gca()
plt.xlim(10,1000)
plt.xscale('log')
plt.ylim(-200,200)

The answer from #Tkanno is programmatically correct but does not do exactly what was asked in the question.
The kde returns the likelihood of a sample according to the modeled distribution. The contour plots are therefore limits on the probability of a sample. The 0.1 contour plot would show the limit beyond which samples have less than 10% of chance to appear according to the modeled distribution. Now by normalising the z value as proposed by Tkanno, it is now relative probabilities that are plotted so in Tkanno's answer the 0.1 contour plot is the limit beyond which samples are 10 times less likely to appear than the most likely sample.
You could very similar contour plots as proposed by Tkanno (yet not smoothed) by doing a 2d-histogram, normalizing by the most frequent bin and plotting the contours with same levels.
This is not to be assimilated with a limit containing 90% of the data.
I think contour plots that encompass given fraction of the data are a bit more complicated to get (cf https://stats.stackexchange.com/questions/68105/contours-containing-a-given-fraction-of-x-y-points and the solution with bag plots).
Apparently there is an implementation of bag plots in R, maybe someone has/will make it for python.
To illustrate the difficulty of solving the question, one can think of a dataset with 100 points. Any volume containing 95 points, excluding 5 would actually answer the question. What is probably implicitly asked is the smallest volume containing 95 points (hence representing the highest likelyhood or density), and this is a combinatorial optimisation problem.

Plot a Correlation Circle in Python

I've been doing some Geometrical Data Analysis (GDA) such as Principal Component Analysis (PCA). I'm looking to plot a Correlation Circle... these look a bit like this:
Basically, it allows to measure to which extend the Eigenvalue / Eigenvector of a variable is correlated to the principal components (dimensions) of a dataset.
Anyone knows if there is a python package that plots such data visualization?

Here is a simple example using sklearn and the iris dataset. Includes both the factor map for the first two dimensions and a scree plot:
from sklearn.decomposition import PCA
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
df = sns.load_dataset('iris')
n_components = 4
# Do the PCA.
pca = PCA(n_components=n_components)
reduced = pca.fit_transform(df[['sepal_length', 'sepal_width',
'petal_length', 'petal_width']])
# Append the principle components for each entry to the dataframe
for i in range(0, n_components):
df['PC' + str(i + 1)] = reduced[:, i]
display(df.head())
# Do a scree plot
ind = np.arange(0, n_components)
(fig, ax) = plt.subplots(figsize=(8, 6))
sns.pointplot(x=ind, y=pca.explained_variance_ratio_)
ax.set_title('Scree plot')
ax.set_xticks(ind)
ax.set_xticklabels(ind)
ax.set_xlabel('Component Number')
ax.set_ylabel('Explained Variance')
plt.show()
# Show the points in terms of the first two PCs
g = sns.lmplot('PC1',
'PC2',
hue='species',data=df,
fit_reg=False,
scatter=True,
size=7)
plt.show()
# Plot a variable factor map for the first two dimensions.
(fig, ax) = plt.subplots(figsize=(8, 8))
for i in range(0, pca.components_.shape[1]):
ax.arrow(0,
0, # Start the arrow at the origin
pca.components_[0, i], #0 for PC1
pca.components_[1, i], #1 for PC2
head_width=0.1,
head_length=0.1)
plt.text(pca.components_[0, i] + 0.05,
pca.components_[1, i] + 0.05,
df.columns.values[i])
an = np.linspace(0, 2 * np.pi, 100)
plt.plot(np.cos(an), np.sin(an)) # Add a unit circle for scale
plt.axis('equal')
ax.set_title('Variable factor map')
plt.show()
It'd be a good exercise to extend this to further PCs, to deal with scaling if all components are small, and to avoid plotting factors with minimal contributions.

I agree it's a pity not to have it in some mainstream package such as sklearn.
Here is a home-made implementation:
https://github.com/mazieres/analysis/blob/master/analysis.py#L19-34

Generate a heatmap using a scatter data set

I have a set of X,Y data points (about 10k) that are easy to plot as a scatter plot but that I would like to represent as a heatmap.
I looked through the examples in Matplotlib and they all seem to already start with heatmap cell values to generate the image.
Is there a method that converts a bunch of x, y, all different, to a heatmap (where zones with higher frequency of x, y would be "warmer")?

If you don't want hexagons, you can use numpy's histogram2d function:
import numpy as np
import numpy.random
import matplotlib.pyplot as plt
# Generate some test data
x = np.random.randn(8873)
y = np.random.randn(8873)
heatmap, xedges, yedges = np.histogram2d(x, y, bins=50)
extent = [xedges[0], xedges[-1], yedges[0], yedges[-1]]
plt.clf()
plt.imshow(heatmap.T, extent=extent, origin='lower')
plt.show()
This makes a 50x50 heatmap. If you want, say, 512x384, you can put bins=(512, 384) in the call to histogram2d.
Example:

In Matplotlib lexicon, i think you want a hexbin plot.
If you're not familiar with this type of plot, it's just a bivariate histogram in which the xy-plane is tessellated by a regular grid of hexagons.
So from a histogram, you can just count the number of points falling in each hexagon, discretiize the plotting region as a set of windows, assign each point to one of these windows; finally, map the windows onto a color array, and you've got a hexbin diagram.
Though less commonly used than e.g., circles, or squares, that hexagons are a better choice for the geometry of the binning container is intuitive:
hexagons have nearest-neighbor symmetry (e.g., square bins don't,
e.g., the distance from a point on a square's border to a point
inside that square is not everywhere equal) and
hexagon is the highest n-polygon that gives regular plane
tessellation (i.e., you can safely re-model your kitchen floor with hexagonal-shaped tiles because you won't have any void space between the tiles when you are finished--not true for all other higher-n, n >= 7, polygons).
(Matplotlib uses the term hexbin plot; so do (AFAIK) all of the plotting libraries for R; still i don't know if this is the generally accepted term for plots of this type, though i suspect it's likely given that hexbin is short for hexagonal binning, which is describes the essential step in preparing the data for display.)
from matplotlib import pyplot as PLT
from matplotlib import cm as CM
from matplotlib import mlab as ML
import numpy as NP
n = 1e5
x = y = NP.linspace(-5, 5, 100)
X, Y = NP.meshgrid(x, y)
Z1 = ML.bivariate_normal(X, Y, 2, 2, 0, 0)
Z2 = ML.bivariate_normal(X, Y, 4, 1, 1, 1)
ZD = Z2 - Z1
x = X.ravel()
y = Y.ravel()
z = ZD.ravel()
gridsize=30
PLT.subplot(111)
# if 'bins=None', then color of each hexagon corresponds directly to its count
# 'C' is optional--it maps values to x-y coordinates; if 'C' is None (default) then
# the result is a pure 2D histogram
PLT.hexbin(x, y, C=z, gridsize=gridsize, cmap=CM.jet, bins=None)
PLT.axis([x.min(), x.max(), y.min(), y.max()])
cb = PLT.colorbar()
cb.set_label('mean value')
PLT.show()

Edit: For a better approximation of Alejandro's answer, see below.
I know this is an old question, but wanted to add something to Alejandro's anwser: If you want a nice smoothed image without using py-sphviewer you can instead use np.histogram2d and apply a gaussian filter (from scipy.ndimage.filters) to the heatmap:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from scipy.ndimage.filters import gaussian_filter
def myplot(x, y, s, bins=1000):
heatmap, xedges, yedges = np.histogram2d(x, y, bins=bins)
heatmap = gaussian_filter(heatmap, sigma=s)
extent = [xedges[0], xedges[-1], yedges[0], yedges[-1]]
return heatmap.T, extent
fig, axs = plt.subplots(2, 2)
# Generate some test data
x = np.random.randn(1000)
y = np.random.randn(1000)
sigmas = [0, 16, 32, 64]
for ax, s in zip(axs.flatten(), sigmas):
if s == 0:
ax.plot(x, y, 'k.', markersize=5)
ax.set_title("Scatter plot")
else:
img, extent = myplot(x, y, s)
ax.imshow(img, extent=extent, origin='lower', cmap=cm.jet)
ax.set_title("Smoothing with $\sigma$ = %d" % s)
plt.show()
Produces:
The scatter plot and s=16 plotted on top of eachother for Agape Gal'lo (click for better view):
One difference I noticed with my gaussian filter approach and Alejandro's approach was that his method shows local structures much better than mine. Therefore I implemented a simple nearest neighbour method at pixel level. This method calculates for each pixel the inverse sum of the distances of the n closest points in the data. This method is at a high resolution pretty computationally expensive and I think there's a quicker way, so let me know if you have any improvements.
Update: As I suspected, there's a much faster method using Scipy's scipy.cKDTree. See Gabriel's answer for the implementation.
Anyway, here's my code:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
def data_coord2view_coord(p, vlen, pmin, pmax):
dp = pmax - pmin
dv = (p - pmin) / dp * vlen
return dv
def nearest_neighbours(xs, ys, reso, n_neighbours):
im = np.zeros([reso, reso])
extent = [np.min(xs), np.max(xs), np.min(ys), np.max(ys)]
xv = data_coord2view_coord(xs, reso, extent[0], extent[1])
yv = data_coord2view_coord(ys, reso, extent[2], extent[3])
for x in range(reso):
for y in range(reso):
xp = (xv - x)
yp = (yv - y)
d = np.sqrt(xp**2 + yp**2)
im[y][x] = 1 / np.sum(d[np.argpartition(d.ravel(), n_neighbours)[:n_neighbours]])
return im, extent
n = 1000
xs = np.random.randn(n)
ys = np.random.randn(n)
resolution = 250
fig, axes = plt.subplots(2, 2)
for ax, neighbours in zip(axes.flatten(), [0, 16, 32, 64]):
if neighbours == 0:
ax.plot(xs, ys, 'k.', markersize=2)
ax.set_aspect('equal')
ax.set_title("Scatter Plot")
else:
im, extent = nearest_neighbours(xs, ys, resolution, neighbours)
ax.imshow(im, origin='lower', extent=extent, cmap=cm.jet)
ax.set_title("Smoothing over %d neighbours" % neighbours)
ax.set_xlim(extent[0], extent[1])
ax.set_ylim(extent[2], extent[3])
plt.show()
Result:

Instead of using np.hist2d, which in general produces quite ugly histograms, I would like to recycle py-sphviewer, a python package for rendering particle simulations using an adaptive smoothing kernel and that can be easily installed from pip (see webpage documentation). Consider the following code, which is based on the example:
import numpy as np
import numpy.random
import matplotlib.pyplot as plt
import sphviewer as sph
def myplot(x, y, nb=32, xsize=500, ysize=500):
xmin = np.min(x)
xmax = np.max(x)
ymin = np.min(y)
ymax = np.max(y)
x0 = (xmin+xmax)/2.
y0 = (ymin+ymax)/2.
pos = np.zeros([len(x),3])
pos[:,0] = x
pos[:,1] = y
w = np.ones(len(x))
P = sph.Particles(pos, w, nb=nb)
S = sph.Scene(P)
S.update_camera(r='infinity', x=x0, y=y0, z=0,
xsize=xsize, ysize=ysize)
R = sph.Render(S)
R.set_logscale()
img = R.get_image()
extent = R.get_extent()
for i, j in zip(xrange(4), [x0,x0,y0,y0]):
extent[i] += j
print extent
return img, extent
fig = plt.figure(1, figsize=(10,10))
ax1 = fig.add_subplot(221)
ax2 = fig.add_subplot(222)
ax3 = fig.add_subplot(223)
ax4 = fig.add_subplot(224)
# Generate some test data
x = np.random.randn(1000)
y = np.random.randn(1000)
#Plotting a regular scatter plot
ax1.plot(x,y,'k.', markersize=5)
ax1.set_xlim(-3,3)
ax1.set_ylim(-3,3)
heatmap_16, extent_16 = myplot(x,y, nb=16)
heatmap_32, extent_32 = myplot(x,y, nb=32)
heatmap_64, extent_64 = myplot(x,y, nb=64)
ax2.imshow(heatmap_16, extent=extent_16, origin='lower', aspect='auto')
ax2.set_title("Smoothing over 16 neighbors")
ax3.imshow(heatmap_32, extent=extent_32, origin='lower', aspect='auto')
ax3.set_title("Smoothing over 32 neighbors")
#Make the heatmap using a smoothing over 64 neighbors
ax4.imshow(heatmap_64, extent=extent_64, origin='lower', aspect='auto')
ax4.set_title("Smoothing over 64 neighbors")
plt.show()
which produces the following image:
As you see, the images look pretty nice, and we are able to identify different substructures on it. These images are constructed spreading a given weight for every point within a certain domain, defined by the smoothing length, which in turns is given by the distance to the closer nb neighbor (I've chosen 16, 32 and 64 for the examples). So, higher density regions typically are spread over smaller regions compared to lower density regions.
The function myplot is just a very simple function that I've written in order to give the x,y data to py-sphviewer to do the magic.

If you are using 1.2.x
import numpy as np
import matplotlib.pyplot as plt
x = np.random.randn(100000)
y = np.random.randn(100000)
plt.hist2d(x,y,bins=100)
plt.show()

Seaborn now has the jointplot function which should work nicely here:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Generate some test data
x = np.random.randn(8873)
y = np.random.randn(8873)
sns.jointplot(x=x, y=y, kind='hex')
plt.show()

Here's Jurgy's great nearest neighbour approach but implemented using scipy.cKDTree. In my tests it's about 100x faster.
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from scipy.spatial import cKDTree
def data_coord2view_coord(p, resolution, pmin, pmax):
dp = pmax - pmin
dv = (p - pmin) / dp * resolution
return dv
n = 1000
xs = np.random.randn(n)
ys = np.random.randn(n)
resolution = 250
extent = [np.min(xs), np.max(xs), np.min(ys), np.max(ys)]
xv = data_coord2view_coord(xs, resolution, extent[0], extent[1])
yv = data_coord2view_coord(ys, resolution, extent[2], extent[3])
def kNN2DDens(xv, yv, resolution, neighbours, dim=2):
"""
"""
# Create the tree
tree = cKDTree(np.array([xv, yv]).T)
# Find the closest nnmax-1 neighbors (first entry is the point itself)
grid = np.mgrid[0:resolution, 0:resolution].T.reshape(resolution**2, dim)
dists = tree.query(grid, neighbours)
# Inverse of the sum of distances to each grid point.
inv_sum_dists = 1. / dists[0].sum(1)
# Reshape
im = inv_sum_dists.reshape(resolution, resolution)
return im
fig, axes = plt.subplots(2, 2, figsize=(15, 15))
for ax, neighbours in zip(axes.flatten(), [0, 16, 32, 63]):
if neighbours == 0:
ax.plot(xs, ys, 'k.', markersize=5)
ax.set_aspect('equal')
ax.set_title("Scatter Plot")
else:
im = kNN2DDens(xv, yv, resolution, neighbours)
ax.imshow(im, origin='lower', extent=extent, cmap=cm.Blues)
ax.set_title("Smoothing over %d neighbours" % neighbours)
ax.set_xlim(extent[0], extent[1])
ax.set_ylim(extent[2], extent[3])
plt.savefig('new.png', dpi=150, bbox_inches='tight')

and the initial question was... how to convert scatter values to grid values, right?
histogram2d does count the frequency per cell, however, if you have other data per cell than just the frequency, you'd need some additional work to do.
x = data_x # between -10 and 4, log-gamma of an svc
y = data_y # between -4 and 11, log-C of an svc
z = data_z #between 0 and 0.78, f1-values from a difficult dataset
So, I have a dataset with Z-results for X and Y coordinates. However, I was calculating few points outside the area of interest (large gaps), and heaps of points in a small area of interest.
Yes here it becomes more difficult but also more fun. Some libraries (sorry):
from matplotlib import pyplot as plt
from matplotlib import cm
import numpy as np
from scipy.interpolate import griddata
pyplot is my graphic engine today,
cm is a range of color maps with some initeresting choice.
numpy for the calculations,
and griddata for attaching values to a fixed grid.
The last one is important especially because the frequency of xy points is not equally distributed in my data. First, let's start with some boundaries fitting to my data and an arbitrary grid size. The original data has datapoints also outside those x and y boundaries.
#determine grid boundaries
gridsize = 500
x_min = -8
x_max = 2.5
y_min = -2
y_max = 7
So we have defined a grid with 500 pixels between the min and max values of x and y.
In my data, there are lots more than the 500 values available in the area of high interest; whereas in the low-interest-area, there are not even 200 values in the total grid; between the graphic boundaries of x_min and x_max there are even less.
So for getting a nice picture, the task is to get an average for the high interest values and to fill the gaps elsewhere.
I define my grid now. For each xx-yy pair, i want to have a color.
xx = np.linspace(x_min, x_max, gridsize) # array of x values
yy = np.linspace(y_min, y_max, gridsize) # array of y values
grid = np.array(np.meshgrid(xx, yy.T))
grid = grid.reshape(2, grid.shape[1]*grid.shape[2]).T
Why the strange shape? scipy.griddata wants a shape of (n, D).
Griddata calculates one value per point in the grid, by a predefined method.
I choose "nearest" - empty grid points will be filled with values from the nearest neighbor. This looks as if the areas with less information have bigger cells (even if it is not the case). One could choose to interpolate "linear", then areas with less information look less sharp. Matter of taste, really.
points = np.array([x, y]).T # because griddata wants it that way
z_grid2 = griddata(points, z, grid, method='nearest')
# you get a 1D vector as result. Reshape to picture format!
z_grid2 = z_grid2.reshape(xx.shape[0], yy.shape[0])
And hop, we hand over to matplotlib to display the plot
fig = plt.figure(1, figsize=(10, 10))
ax1 = fig.add_subplot(111)
ax1.imshow(z_grid2, extent=[x_min, x_max,y_min, y_max, ],
origin='lower', cmap=cm.magma)
ax1.set_title("SVC: empty spots filled by nearest neighbours")
ax1.set_xlabel('log gamma')
ax1.set_ylabel('log C')
plt.show()
Around the pointy part of the V-Shape, you see I did a lot of calculations during my search for the sweet spot, whereas the less interesting parts almost everywhere else have a lower resolution.

Make a 2-dimensional array that corresponds to the cells in your final image, called say heatmap_cells and instantiate it as all zeroes.
Choose two scaling factors that define the difference between each array element in real units, for each dimension, say x_scale and y_scale. Choose these such that all your datapoints will fall within the bounds of the heatmap array.
For each raw datapoint with x_value and y_value:
heatmap_cells[floor(x_value/x_scale),floor(y_value/y_scale)]+=1

Very similar to #Piti's answer, but using 1 call instead of 2 to generate the points:
import numpy as np
import matplotlib.pyplot as plt
pts = 1000000
mean = [0.0, 0.0]
cov = [[1.0,0.0],[0.0,1.0]]
x,y = np.random.multivariate_normal(mean, cov, pts).T
plt.hist2d(x, y, bins=50, cmap=plt.cm.jet)
plt.show()
Output:

Here's one I made on a 1 Million point set with 3 categories (colored Red, Green, and Blue). Here's a link to the repository if you'd like to try the function. Github Repo
histplot(
X,
Y,
labels,
bins=2000,
range=((-3,3),(-3,3)),
normalize_each_label=True,
colors = [
[1,0,0],
[0,1,0],
[0,0,1]],
gain=50)

I'm afraid I'm a little late to the party but I had a similar question a while ago. The accepted answer (by #ptomato) helped me out but I'd also want to post this in case it's of use to someone.
''' I wanted to create a heatmap resembling a football pitch which would show the different actions performed '''
import numpy as np
import matplotlib.pyplot as plt
import random
#fixing random state for reproducibility
np.random.seed(1234324)
fig = plt.figure(12)
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)
#Ratio of the pitch with respect to UEFA standards
hmap= np.full((6, 10), 0)
#print(hmap)
xlist = np.random.uniform(low=0.0, high=100.0, size=(20))
ylist = np.random.uniform(low=0.0, high =100.0, size =(20))
#UEFA Pitch Standards are 105m x 68m
xlist = (xlist/100)*10.5
ylist = (ylist/100)*6.5
ax1.scatter(xlist,ylist)
#int of the co-ordinates to populate the array
xlist_int = xlist.astype (int)
ylist_int = ylist.astype (int)
#print(xlist_int, ylist_int)
for i, j in zip(xlist_int, ylist_int):
#this populates the array according to the x,y co-ordinate values it encounters
hmap[j][i]= hmap[j][i] + 1
#Reversing the rows is necessary
hmap = hmap[::-1]
#print(hmap)
im = ax2.imshow(hmap)
Here's the result

None of these solutions worked for my application, so this is what I came up with. Essentially I am placing a 2D Gaussian at every single point:
import cv2
import numpy as np
import matplotlib.pyplot as plt
def getGaussian2D(ksize, sigma, norm=True):
oneD = cv2.getGaussianKernel(ksize=ksize, sigma=sigma)
twoD = np.outer(oneD.T, oneD)
return twoD / np.sum(twoD) if norm else twoD
def pt2heat(pts, shape, kernel=16, sigma=5):
heat = np.zeros(shape)
k = getGaussian2D(kernel, sigma)
for y,x in pts:
x, y = int(x), int(y)
for i in range(-kernel//2, kernel//2):
for j in range(-kernel//2, kernel//2):
if 0 <= x+i < shape[0] and 0 <= y+j < shape[1]:
heat[x+i, y+j] = heat[x+i, y+j] + k[i+kernel//2, j+kernel//2]
return heat
heat = pts2heat(pts, img.shape[:2])
plt.imshow(heat, cmap='heat')
Here are the points overlayed ontop of it's associated image, along with the resulting heat map:

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.