Generating 2-dimensional sample dataset from a mixture Guassian distribution

Generating 2-dimensional sample dataset from a mixture Guassian distribution - python

I wanted to generate a 2-dimensional sample dataset. I copied the code stated in this link and doubled it to generate vectors X,Y to scatter them as 2-dimensional dataset as the following. But the result was not favourable. In fact I wanted something like the picture below.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
mu = [1,4]
sigma = [2, 1]
p_i = [0.3, 0.7]
n = 1000
x = []
y=[]
for i in range(n):
z_i = np.argmax(np.random.multinomial(1, p_i)) #np.random.multinomial(1,[0.3,0.5,0.2]) returns the result of an experiment
#of rolling a dice. the result is as this: [1,0,0]. this means that the side one occurs in the experiment and the others
#not. the goal is choosing mu[i] in a random way
x_i = np.random.normal(mu[z_i], sigma[z_i])
x.append(x_i)
mu = [3,6]
sigma = [1, 2]
p_i = [0.6, 0.4]
for i in range(n):
z_i = np.argmax(np.random.multinomial(1, p_i)) #np.random.multinomial(1,[0.3,0.5,0.2]) returns the result of an experiment
#of rolling a dice. the result is as this: [1,0,0]. this means that the side one occurs in the experiment and the others
#not. the goal is choosing mu[i] in a random way
y_i = np.random.normal(mu[z_i], sigma[z_i])
y.append(y_i)
plt.scatter(x, y)
plt.show()
`
Can any one help me?

It looks like what you're trying to plot is data sampled from 2 distinct 2D gaussians. Here is code that can plot mock data that looks like this. Feel free to adjust the mean and covariance matrix to suit your needs.
from numpy.random import multivariate_normal
# First 2D gaussian:
mu = [1, 3]
cov = [[0.07, 0],[0, 1.8]]
x, y = np.random.multivariate_normal(mu, cov, 200).T
plt.figure(figsize=(10,6))
plt.scatter(x, y, s=5, color='blue')
ax = plt.gca()
# Second 2D gaussian:
mu = [2, 1]
cov = [[0.8, -0.4],[-0.4, 0.5]]
x, y = np.random.multivariate_normal(mu, cov, 200).T
plt.scatter(x, y, s=5, color='red')
plt.xlim([-2, 8])
plt.ylim([-6, 10]);
This produces something like the plot below (different colors so you can see the pattern):

Related

Python: is there a way to define a global scale of colors and plot colored lines using this custom scale?

I've been reading about multiple lines plotting and multicolored lines, but every time I read a post about it people use continuous set of data, like some trigonometrical function:
x = np.linspace(0, 3 * np.pi, 500)
y = np.sin(x)
So, here is my problem: I'm making a plotting script for 1D finite element problems like the image attached. I'm plotting the elements as individual lines with the X and Y coordinates array, and I would like to color the lines based on a third array like the axial stress or temperature, or any other
The problem is when I try to follow the examples I've found, every line has local color distribution, instead of a global distribution. I'm thinking about defining a global color scale based on the maximum and mininum values of the third array, let's say temperature, and passing the coordinates of each element + the current average temperature may do the job, but I don't know if something alike is possible
Anyone can help?

You can map values to a particular colormap, and plot each element as a single line with a particular color, like this:
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from matplotlib.colors import Normalize
import numpy as np
# roto-translation matrix
T = lambda t, u, v: np.array([[np.cos(t), -np.sin(t), u], [np.sin(t), np.cos(t), 0], [0, 0, 1]])
# coordinates of a triangle
triangle = np.array([[-0.5, -0.5, 1], [0.5, -0.5, 1], [-0.5, 0.5, 1], [-0.5, -0.5, 1]]).T
n_elements = 10
values = np.linspace(-5, 5, n_elements)
# create a normalizer
norm = Normalize(vmin=values.min(), vmax=values.max())
# normalize values
norm_values = norm(values)
# choose a colormap
cmap = cm.magma
# create colors
colors = cmap(norm_values)
# map values to a colorbar
mappable = cm.ScalarMappable(norm=norm, cmap=cmap)
mappable.set_array(values)
f, ax = plt.subplots(1)
ax.set_aspect("equal")
for i in range(n_elements):
tr = np.matmul(T(i * np.pi, int(i / 2), 0), triangle)
ax.plot(tr[0, :], tr[1, :], color=colors[i])
cb = f.colorbar(mappable)
cb.set_label("Value")

How to draw slope triangles in matplotlib?

What I would like to have are the triangles as shown in the image:
Here is my code:
import matplotlib.pyplot as plt
data= [0.2855,0.3030,0.4995]
x = [1,2,3]
plt.plot(x, data)
plt.show
Is there a simple way of inserting these slope triangles as shown in the image in an automatic fashion? I would like to have the triangle in the middle between two points and with the slope written next to it.

Depending on your idea of "automatic fashion", this might be a suitable solution:
import matplotlib.pyplot as plt
import numpy as np
# Data
x = np.array([1, 2, 3])
y = np.array([0.2855, 0.3030, 0.4995])
# Calculate triangle coordinates values
x_mid = np.convolve(x, [0.5, 0.5], mode='valid')
x_tri = np.vstack((x_mid, x_mid + 0.3))
y_tri = np.interp(x_tri, x, y)
# Calculate slopes
slopes = np.diff(y) / np.diff(x)
# Plot
plt.plot(x, y)
plt.plot(x_tri, np.tile(y_tri[0, :], [2, 1]), 'r') # red horizontal line
plt.plot(np.tile(x_tri[1, :], [2, 1]), y_tri, 'r') # red vertical line
for i, slope in enumerate(slopes): # slope values
plt.text(x_tri[1, i] + 0.05, np.mean(y_tri[:, i]), r'{0:.3f}'.format(slope))
plt.show()
Output:
Put all the triangle stuff in a separate function, and it won't affect your main code too much.
Hope that helps!

The triangle construction can also be done using mpltools:
import matplotlib.pyplot as plt
from mpltools import annotation
import numpy as np
data = [0.2855, 0.3030, 0.4995]
x = [1, 2, 3]
# get midpoint coordinates
x_mid = np.convolve(x, [0.5, 0.5], mode='valid')
y_mid = np.interp(x_mid, x, data)
# compute the gradient of each segment
gradients = np.diff(data)/np.diff(x)
# plot
plt.plot(x, data)
axes = plt.gca()
for xm, ym, g in zip(x_mid, y_mid, gradients):
annotation.slope_marker((xm, ym), g)
plt.show()
The first argument of annotation.slope_marker is a tuple containing the coordinates for the left-hand corner of the triangle and the second argument is the gradient. So here we loop over the midpoints of the line segments and their gradients and annotate with a triangular slope marker for that gradient at those coordinates.
Expected Output:

How to do a simple Gaussian mixture sampling and PDF plotting with NumPy/SciPy?

I add three normal distributions to obtain a new distribution as shown below, how can I do sampling according to this distribution in python?
import matplotlib.pyplot as plt
import scipy.stats as ss
import numpy as np
x = np.linspace(0, 10, 1000)
y1 = [ss.norm.pdf(v, loc=5, scale=1) for v in x]
y2 = [ss.norm.pdf(v, loc=1, scale=1.3) for v in x]
y3 = [ss.norm.pdf(v, loc=9, scale=1.3) for v in x]
y = np.sum([y1, y2, y3], axis=0)/3
plt.plot(x, y, '-')
plt.xlabel('$x$')
plt.ylabel('$P(x)$')
BTW, is there a better way to plot such a probability distribution?

It seems that you're asking two questions: how do I sample from a distribution and how do I plot the PDF?
Assuming you're trying to sample from a mixture distribution of 3 normal ones shown in your code, the following code snipped performs this kind of sampling in the naïve, straightforward way as a proof-of-concept.
Basically, the idea is to
Choose an index i among the index of components, i.e. 0, 1, 2 ..., according to their probability weights.
Having chosen i, select the corresponding distribution and obtain a sample point from it.
Continue from 1 until enough sample points are collected.
However, to plot the PDF, you don't really need a sample in this case, because the theoretical solution is quite easy. In the more general case, the PDF can be approximated by a histogram from the sample.
The code below performs both sampling and PDF-plotting using the theoretical PDF.
import numpy as np
import numpy.random
import scipy.stats as ss
import matplotlib.pyplot as plt
# Set-up.
n = 10000
numpy.random.seed(0x5eed)
# Parameters of the mixture components
norm_params = np.array([[5, 1],
[1, 1.3],
[9, 1.3]])
n_components = norm_params.shape[0]
# Weight of each component, in this case all of them are 1/3
weights = np.ones(n_components, dtype=np.float64) / 3.0
# A stream of indices from which to choose the component
mixture_idx = numpy.random.choice(len(weights), size=n, replace=True, p=weights)
# y is the mixture sample
y = numpy.fromiter((ss.norm.rvs(*(norm_params[i])) for i in mixture_idx),
dtype=np.float64)
# Theoretical PDF plotting -- generate the x and y plotting positions
xs = np.linspace(y.min(), y.max(), 200)
ys = np.zeros_like(xs)
for (l, s), w in zip(norm_params, weights):
ys += ss.norm.pdf(xs, loc=l, scale=s) * w
plt.plot(xs, ys)
plt.hist(y, normed=True, bins="fd")
plt.xlabel("x")
plt.ylabel("f(x)")
plt.show()

In order to make the answer of Cong Ma work more general, I slightly modified his code. The weights work now for any number of mixture components.
import numpy as np
import numpy.random
import scipy.stats as ss
import matplotlib.pyplot as plt
# Set-up.
n = 10000
numpy.random.seed(0x5eed)
# Parameters of the mixture components
norm_params = np.array([[5, 1],
[1, 1.3],
[9, 1.3]])
n_components = norm_params.shape[0]
# Weight of each component, in this case all of them are 1/3
weights = np.ones(n_components, dtype=np.float64) / float(n_components)
# A stream of indices from which to choose the component
mixture_idx = numpy.random.choice(n_components, size=n, replace=True, p=weights)
# y is the mixture sample
y = numpy.fromiter((ss.norm.rvs(*(norm_params[i])) for i in mixture_idx),
dtype=np.float64)
# Theoretical PDF plotting -- generate the x and y plotting positions
xs = np.linspace(y.min(), y.max(), 200)
ys = np.zeros_like(xs)
for (l, s), w in zip(norm_params, weights):
ys += ss.norm.pdf(xs, loc=l, scale=s) * w
plt.plot(xs, ys)
plt.hist(y, normed=True, bins="fd")
plt.xlabel("x")
plt.ylabel("f(x)")
plt.show()

Plotting confidence and prediction intervals with repeated entries

I have a correlation plot for two variables, the predictor variable (temperature) on the x-axis, and the response variable (density) on the y-axis. My best fit least squares regression line is a 2nd order polynomial. I would like to also plot confidence and prediction intervals. The method described in this answer seems perfect. However, my dataset (n=2340) has repeated entries for many (x,y) pairs. My resulting plot looks like this:
Here is my relevant code (slightly modified from linked answer above):
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.sandbox.regression.predstd import wls_prediction_std
import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import summary_table
d = {'temp': x, 'dens': y}
df = pd.DataFrame(data=d)
x = df.temp
y = df.dens
plt.figure(figsize=(6 * 1.618, 6))
plt.scatter(x,y, s=10, alpha=0.3)
plt.xlabel('temp')
plt.ylabel('density')
# points linearly spaced for predictor variable
x1 = pd.DataFrame({'temp': np.linspace(df.temp.min(), df.temp.max(), 100)})
# 2nd order polynomial
poly_2 = smf.ols(formula='dens ~ 1 + temp + I(temp ** 2.0)', data=df).fit()
# this correctly plots my single 2nd-order poly best-fit line:
plt.plot(x1.temp, poly_2.predict(x1), 'g-', label='Poly n=2 $R^2$=%.2f' % poly_2.rsquared,
alpha=0.9)
prstd, iv_l, iv_u = wls_prediction_std(poly_2)
st, data, ss2 = summary_table(poly_2, alpha=0.05)
fittedvalues = data[:,2]
predict_mean_se = data[:,3]
predict_mean_ci_low, predict_mean_ci_upp = data[:,4:6].T
predict_ci_low, predict_ci_upp = data[:,6:8].T
# check we got the right things
print np.max(np.abs(poly_2.fittedvalues - fittedvalues))
print np.max(np.abs(iv_l - predict_ci_low))
print np.max(np.abs(iv_u - predict_ci_upp))
plt.plot(x, y, 'o')
plt.plot(x, fittedvalues, '-', lw=2)
plt.plot(x, predict_ci_low, 'r--', lw=2)
plt.plot(x, predict_ci_upp, 'r--', lw=2)
plt.plot(x, predict_mean_ci_low, 'r--', lw=2)
plt.plot(x, predict_mean_ci_upp, 'r--', lw=2)
The print statements evaluate to 0.0, as expected.
However, I need single lines for the polynomial best fit line, and the confidence and prediction intervals (rather than the multiple lines I currently have in my plot). Any ideas?
Update:
Following first answer from #kpie, I ordered my confidence and prediction interval arrays according to temperature:
data_intervals = {'temp': x, 'predict_low': predict_ci_low, 'predict_upp': predict_ci_upp, 'conf_low': predict_mean_ci_low, 'conf_high': predict_mean_ci_upp}
df_intervals = pd.DataFrame(data=data_intervals)
df_intervals_sort = df_intervals.sort(columns='temp')
This achieved desired results:

You need to order your predict values based on temperature. I think*
So to get nice curvy lines you will have to use numpy.polynomial.polynomial.polyfit This will return a list of coefficients. You will have to split the x and y data into 2 lists so it fits in the function.
You can then plot this function with:
def strPolynomialFromArray(coeffs):
return("".join([str(k)+"*x**"+str(n)+"+" for n,k in enumerate(coeffs)])[0:-1])
from numpy import *
from matplotlib.pyplot import *
x = linespace(-15,45,300) # your smooth line will be made of 300 smooth pieces
y = exec(strPolynomialFromArray(numpy.polynomial.polynomial.polyfit(xs,ys,degree)))
plt.plot(x , y)
You can look more into plotting smooth lines here just remember all lines are linear splines, becasue continuous curvature is irrational.
I believe that the polynomial fitting is done with least squares fitting (process described here)
Good Luck!

Generate a heatmap using a scatter data set

I have a set of X,Y data points (about 10k) that are easy to plot as a scatter plot but that I would like to represent as a heatmap.
I looked through the examples in Matplotlib and they all seem to already start with heatmap cell values to generate the image.
Is there a method that converts a bunch of x, y, all different, to a heatmap (where zones with higher frequency of x, y would be "warmer")?

If you don't want hexagons, you can use numpy's histogram2d function:
import numpy as np
import numpy.random
import matplotlib.pyplot as plt
# Generate some test data
x = np.random.randn(8873)
y = np.random.randn(8873)
heatmap, xedges, yedges = np.histogram2d(x, y, bins=50)
extent = [xedges[0], xedges[-1], yedges[0], yedges[-1]]
plt.clf()
plt.imshow(heatmap.T, extent=extent, origin='lower')
plt.show()
This makes a 50x50 heatmap. If you want, say, 512x384, you can put bins=(512, 384) in the call to histogram2d.
Example:

In Matplotlib lexicon, i think you want a hexbin plot.
If you're not familiar with this type of plot, it's just a bivariate histogram in which the xy-plane is tessellated by a regular grid of hexagons.
So from a histogram, you can just count the number of points falling in each hexagon, discretiize the plotting region as a set of windows, assign each point to one of these windows; finally, map the windows onto a color array, and you've got a hexbin diagram.
Though less commonly used than e.g., circles, or squares, that hexagons are a better choice for the geometry of the binning container is intuitive:
hexagons have nearest-neighbor symmetry (e.g., square bins don't,
e.g., the distance from a point on a square's border to a point
inside that square is not everywhere equal) and
hexagon is the highest n-polygon that gives regular plane
tessellation (i.e., you can safely re-model your kitchen floor with hexagonal-shaped tiles because you won't have any void space between the tiles when you are finished--not true for all other higher-n, n >= 7, polygons).
(Matplotlib uses the term hexbin plot; so do (AFAIK) all of the plotting libraries for R; still i don't know if this is the generally accepted term for plots of this type, though i suspect it's likely given that hexbin is short for hexagonal binning, which is describes the essential step in preparing the data for display.)
from matplotlib import pyplot as PLT
from matplotlib import cm as CM
from matplotlib import mlab as ML
import numpy as NP
n = 1e5
x = y = NP.linspace(-5, 5, 100)
X, Y = NP.meshgrid(x, y)
Z1 = ML.bivariate_normal(X, Y, 2, 2, 0, 0)
Z2 = ML.bivariate_normal(X, Y, 4, 1, 1, 1)
ZD = Z2 - Z1
x = X.ravel()
y = Y.ravel()
z = ZD.ravel()
gridsize=30
PLT.subplot(111)
# if 'bins=None', then color of each hexagon corresponds directly to its count
# 'C' is optional--it maps values to x-y coordinates; if 'C' is None (default) then
# the result is a pure 2D histogram
PLT.hexbin(x, y, C=z, gridsize=gridsize, cmap=CM.jet, bins=None)
PLT.axis([x.min(), x.max(), y.min(), y.max()])
cb = PLT.colorbar()
cb.set_label('mean value')
PLT.show()

Edit: For a better approximation of Alejandro's answer, see below.
I know this is an old question, but wanted to add something to Alejandro's anwser: If you want a nice smoothed image without using py-sphviewer you can instead use np.histogram2d and apply a gaussian filter (from scipy.ndimage.filters) to the heatmap:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from scipy.ndimage.filters import gaussian_filter
def myplot(x, y, s, bins=1000):
heatmap, xedges, yedges = np.histogram2d(x, y, bins=bins)
heatmap = gaussian_filter(heatmap, sigma=s)
extent = [xedges[0], xedges[-1], yedges[0], yedges[-1]]
return heatmap.T, extent
fig, axs = plt.subplots(2, 2)
# Generate some test data
x = np.random.randn(1000)
y = np.random.randn(1000)
sigmas = [0, 16, 32, 64]
for ax, s in zip(axs.flatten(), sigmas):
if s == 0:
ax.plot(x, y, 'k.', markersize=5)
ax.set_title("Scatter plot")
else:
img, extent = myplot(x, y, s)
ax.imshow(img, extent=extent, origin='lower', cmap=cm.jet)
ax.set_title("Smoothing with $\sigma$ = %d" % s)
plt.show()
Produces:
The scatter plot and s=16 plotted on top of eachother for Agape Gal'lo (click for better view):
One difference I noticed with my gaussian filter approach and Alejandro's approach was that his method shows local structures much better than mine. Therefore I implemented a simple nearest neighbour method at pixel level. This method calculates for each pixel the inverse sum of the distances of the n closest points in the data. This method is at a high resolution pretty computationally expensive and I think there's a quicker way, so let me know if you have any improvements.
Update: As I suspected, there's a much faster method using Scipy's scipy.cKDTree. See Gabriel's answer for the implementation.
Anyway, here's my code:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
def data_coord2view_coord(p, vlen, pmin, pmax):
dp = pmax - pmin
dv = (p - pmin) / dp * vlen
return dv
def nearest_neighbours(xs, ys, reso, n_neighbours):
im = np.zeros([reso, reso])
extent = [np.min(xs), np.max(xs), np.min(ys), np.max(ys)]
xv = data_coord2view_coord(xs, reso, extent[0], extent[1])
yv = data_coord2view_coord(ys, reso, extent[2], extent[3])
for x in range(reso):
for y in range(reso):
xp = (xv - x)
yp = (yv - y)
d = np.sqrt(xp**2 + yp**2)
im[y][x] = 1 / np.sum(d[np.argpartition(d.ravel(), n_neighbours)[:n_neighbours]])
return im, extent
n = 1000
xs = np.random.randn(n)
ys = np.random.randn(n)
resolution = 250
fig, axes = plt.subplots(2, 2)
for ax, neighbours in zip(axes.flatten(), [0, 16, 32, 64]):
if neighbours == 0:
ax.plot(xs, ys, 'k.', markersize=2)
ax.set_aspect('equal')
ax.set_title("Scatter Plot")
else:
im, extent = nearest_neighbours(xs, ys, resolution, neighbours)
ax.imshow(im, origin='lower', extent=extent, cmap=cm.jet)
ax.set_title("Smoothing over %d neighbours" % neighbours)
ax.set_xlim(extent[0], extent[1])
ax.set_ylim(extent[2], extent[3])
plt.show()
Result:

Instead of using np.hist2d, which in general produces quite ugly histograms, I would like to recycle py-sphviewer, a python package for rendering particle simulations using an adaptive smoothing kernel and that can be easily installed from pip (see webpage documentation). Consider the following code, which is based on the example:
import numpy as np
import numpy.random
import matplotlib.pyplot as plt
import sphviewer as sph
def myplot(x, y, nb=32, xsize=500, ysize=500):
xmin = np.min(x)
xmax = np.max(x)
ymin = np.min(y)
ymax = np.max(y)
x0 = (xmin+xmax)/2.
y0 = (ymin+ymax)/2.
pos = np.zeros([len(x),3])
pos[:,0] = x
pos[:,1] = y
w = np.ones(len(x))
P = sph.Particles(pos, w, nb=nb)
S = sph.Scene(P)
S.update_camera(r='infinity', x=x0, y=y0, z=0,
xsize=xsize, ysize=ysize)
R = sph.Render(S)
R.set_logscale()
img = R.get_image()
extent = R.get_extent()
for i, j in zip(xrange(4), [x0,x0,y0,y0]):
extent[i] += j
print extent
return img, extent
fig = plt.figure(1, figsize=(10,10))
ax1 = fig.add_subplot(221)
ax2 = fig.add_subplot(222)
ax3 = fig.add_subplot(223)
ax4 = fig.add_subplot(224)
# Generate some test data
x = np.random.randn(1000)
y = np.random.randn(1000)
#Plotting a regular scatter plot
ax1.plot(x,y,'k.', markersize=5)
ax1.set_xlim(-3,3)
ax1.set_ylim(-3,3)
heatmap_16, extent_16 = myplot(x,y, nb=16)
heatmap_32, extent_32 = myplot(x,y, nb=32)
heatmap_64, extent_64 = myplot(x,y, nb=64)
ax2.imshow(heatmap_16, extent=extent_16, origin='lower', aspect='auto')
ax2.set_title("Smoothing over 16 neighbors")
ax3.imshow(heatmap_32, extent=extent_32, origin='lower', aspect='auto')
ax3.set_title("Smoothing over 32 neighbors")
#Make the heatmap using a smoothing over 64 neighbors
ax4.imshow(heatmap_64, extent=extent_64, origin='lower', aspect='auto')
ax4.set_title("Smoothing over 64 neighbors")
plt.show()
which produces the following image:
As you see, the images look pretty nice, and we are able to identify different substructures on it. These images are constructed spreading a given weight for every point within a certain domain, defined by the smoothing length, which in turns is given by the distance to the closer nb neighbor (I've chosen 16, 32 and 64 for the examples). So, higher density regions typically are spread over smaller regions compared to lower density regions.
The function myplot is just a very simple function that I've written in order to give the x,y data to py-sphviewer to do the magic.

If you are using 1.2.x
import numpy as np
import matplotlib.pyplot as plt
x = np.random.randn(100000)
y = np.random.randn(100000)
plt.hist2d(x,y,bins=100)
plt.show()

Seaborn now has the jointplot function which should work nicely here:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Generate some test data
x = np.random.randn(8873)
y = np.random.randn(8873)
sns.jointplot(x=x, y=y, kind='hex')
plt.show()

Here's Jurgy's great nearest neighbour approach but implemented using scipy.cKDTree. In my tests it's about 100x faster.
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from scipy.spatial import cKDTree
def data_coord2view_coord(p, resolution, pmin, pmax):
dp = pmax - pmin
dv = (p - pmin) / dp * resolution
return dv
n = 1000
xs = np.random.randn(n)
ys = np.random.randn(n)
resolution = 250
extent = [np.min(xs), np.max(xs), np.min(ys), np.max(ys)]
xv = data_coord2view_coord(xs, resolution, extent[0], extent[1])
yv = data_coord2view_coord(ys, resolution, extent[2], extent[3])
def kNN2DDens(xv, yv, resolution, neighbours, dim=2):
"""
"""
# Create the tree
tree = cKDTree(np.array([xv, yv]).T)
# Find the closest nnmax-1 neighbors (first entry is the point itself)
grid = np.mgrid[0:resolution, 0:resolution].T.reshape(resolution**2, dim)
dists = tree.query(grid, neighbours)
# Inverse of the sum of distances to each grid point.
inv_sum_dists = 1. / dists[0].sum(1)
# Reshape
im = inv_sum_dists.reshape(resolution, resolution)
return im
fig, axes = plt.subplots(2, 2, figsize=(15, 15))
for ax, neighbours in zip(axes.flatten(), [0, 16, 32, 63]):
if neighbours == 0:
ax.plot(xs, ys, 'k.', markersize=5)
ax.set_aspect('equal')
ax.set_title("Scatter Plot")
else:
im = kNN2DDens(xv, yv, resolution, neighbours)
ax.imshow(im, origin='lower', extent=extent, cmap=cm.Blues)
ax.set_title("Smoothing over %d neighbours" % neighbours)
ax.set_xlim(extent[0], extent[1])
ax.set_ylim(extent[2], extent[3])
plt.savefig('new.png', dpi=150, bbox_inches='tight')

and the initial question was... how to convert scatter values to grid values, right?
histogram2d does count the frequency per cell, however, if you have other data per cell than just the frequency, you'd need some additional work to do.
x = data_x # between -10 and 4, log-gamma of an svc
y = data_y # between -4 and 11, log-C of an svc
z = data_z #between 0 and 0.78, f1-values from a difficult dataset
So, I have a dataset with Z-results for X and Y coordinates. However, I was calculating few points outside the area of interest (large gaps), and heaps of points in a small area of interest.
Yes here it becomes more difficult but also more fun. Some libraries (sorry):
from matplotlib import pyplot as plt
from matplotlib import cm
import numpy as np
from scipy.interpolate import griddata
pyplot is my graphic engine today,
cm is a range of color maps with some initeresting choice.
numpy for the calculations,
and griddata for attaching values to a fixed grid.
The last one is important especially because the frequency of xy points is not equally distributed in my data. First, let's start with some boundaries fitting to my data and an arbitrary grid size. The original data has datapoints also outside those x and y boundaries.
#determine grid boundaries
gridsize = 500
x_min = -8
x_max = 2.5
y_min = -2
y_max = 7
So we have defined a grid with 500 pixels between the min and max values of x and y.
In my data, there are lots more than the 500 values available in the area of high interest; whereas in the low-interest-area, there are not even 200 values in the total grid; between the graphic boundaries of x_min and x_max there are even less.
So for getting a nice picture, the task is to get an average for the high interest values and to fill the gaps elsewhere.
I define my grid now. For each xx-yy pair, i want to have a color.
xx = np.linspace(x_min, x_max, gridsize) # array of x values
yy = np.linspace(y_min, y_max, gridsize) # array of y values
grid = np.array(np.meshgrid(xx, yy.T))
grid = grid.reshape(2, grid.shape[1]*grid.shape[2]).T
Why the strange shape? scipy.griddata wants a shape of (n, D).
Griddata calculates one value per point in the grid, by a predefined method.
I choose "nearest" - empty grid points will be filled with values from the nearest neighbor. This looks as if the areas with less information have bigger cells (even if it is not the case). One could choose to interpolate "linear", then areas with less information look less sharp. Matter of taste, really.
points = np.array([x, y]).T # because griddata wants it that way
z_grid2 = griddata(points, z, grid, method='nearest')
# you get a 1D vector as result. Reshape to picture format!
z_grid2 = z_grid2.reshape(xx.shape[0], yy.shape[0])
And hop, we hand over to matplotlib to display the plot
fig = plt.figure(1, figsize=(10, 10))
ax1 = fig.add_subplot(111)
ax1.imshow(z_grid2, extent=[x_min, x_max,y_min, y_max, ],
origin='lower', cmap=cm.magma)
ax1.set_title("SVC: empty spots filled by nearest neighbours")
ax1.set_xlabel('log gamma')
ax1.set_ylabel('log C')
plt.show()
Around the pointy part of the V-Shape, you see I did a lot of calculations during my search for the sweet spot, whereas the less interesting parts almost everywhere else have a lower resolution.

Make a 2-dimensional array that corresponds to the cells in your final image, called say heatmap_cells and instantiate it as all zeroes.
Choose two scaling factors that define the difference between each array element in real units, for each dimension, say x_scale and y_scale. Choose these such that all your datapoints will fall within the bounds of the heatmap array.
For each raw datapoint with x_value and y_value:
heatmap_cells[floor(x_value/x_scale),floor(y_value/y_scale)]+=1

Very similar to #Piti's answer, but using 1 call instead of 2 to generate the points:
import numpy as np
import matplotlib.pyplot as plt
pts = 1000000
mean = [0.0, 0.0]
cov = [[1.0,0.0],[0.0,1.0]]
x,y = np.random.multivariate_normal(mean, cov, pts).T
plt.hist2d(x, y, bins=50, cmap=plt.cm.jet)
plt.show()
Output:

Here's one I made on a 1 Million point set with 3 categories (colored Red, Green, and Blue). Here's a link to the repository if you'd like to try the function. Github Repo
histplot(
X,
Y,
labels,
bins=2000,
range=((-3,3),(-3,3)),
normalize_each_label=True,
colors = [
[1,0,0],
[0,1,0],
[0,0,1]],
gain=50)

I'm afraid I'm a little late to the party but I had a similar question a while ago. The accepted answer (by #ptomato) helped me out but I'd also want to post this in case it's of use to someone.
''' I wanted to create a heatmap resembling a football pitch which would show the different actions performed '''
import numpy as np
import matplotlib.pyplot as plt
import random
#fixing random state for reproducibility
np.random.seed(1234324)
fig = plt.figure(12)
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)
#Ratio of the pitch with respect to UEFA standards
hmap= np.full((6, 10), 0)
#print(hmap)
xlist = np.random.uniform(low=0.0, high=100.0, size=(20))
ylist = np.random.uniform(low=0.0, high =100.0, size =(20))
#UEFA Pitch Standards are 105m x 68m
xlist = (xlist/100)*10.5
ylist = (ylist/100)*6.5
ax1.scatter(xlist,ylist)
#int of the co-ordinates to populate the array
xlist_int = xlist.astype (int)
ylist_int = ylist.astype (int)
#print(xlist_int, ylist_int)
for i, j in zip(xlist_int, ylist_int):
#this populates the array according to the x,y co-ordinate values it encounters
hmap[j][i]= hmap[j][i] + 1
#Reversing the rows is necessary
hmap = hmap[::-1]
#print(hmap)
im = ax2.imshow(hmap)
Here's the result

None of these solutions worked for my application, so this is what I came up with. Essentially I am placing a 2D Gaussian at every single point:
import cv2
import numpy as np
import matplotlib.pyplot as plt
def getGaussian2D(ksize, sigma, norm=True):
oneD = cv2.getGaussianKernel(ksize=ksize, sigma=sigma)
twoD = np.outer(oneD.T, oneD)
return twoD / np.sum(twoD) if norm else twoD
def pt2heat(pts, shape, kernel=16, sigma=5):
heat = np.zeros(shape)
k = getGaussian2D(kernel, sigma)
for y,x in pts:
x, y = int(x), int(y)
for i in range(-kernel//2, kernel//2):
for j in range(-kernel//2, kernel//2):
if 0 <= x+i < shape[0] and 0 <= y+j < shape[1]:
heat[x+i, y+j] = heat[x+i, y+j] + k[i+kernel//2, j+kernel//2]
return heat
heat = pts2heat(pts, img.shape[:2])
plt.imshow(heat, cmap='heat')
Here are the points overlayed ontop of it's associated image, along with the resulting heat map:

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Generating 2-dimensional sample dataset from a mixture Guassian distribution - python

Related

Python: is there a way to define a global scale of colors and plot colored lines using this custom scale?

How to draw slope triangles in matplotlib?

How to do a simple Gaussian mixture sampling and PDF plotting with NumPy/SciPy?

Plotting confidence and prediction intervals with repeated entries

Generate a heatmap using a scatter data set

Categories

Resources