Adding hexbin plots together - python

I have four hexbin plots which have all been normalized. How do I add them together to make one big distribution?
I have tried concatenating the input vectors and then creating the hexbin plot, but this throws off the normalization of the individual distributions:
So how do I add the individual hexbin distributions whilst still maintainging the induvidual normalization?
The relevant part of my code is:
def hex_plot(x,y,max_v):
bounds = [0,max_v*m.exp(-(3**2)/2),max_v*m.exp(-2),max_v*m.exp(-0.5),max_v] # The sigma bounds
norm = mpl.colors.BoundaryNorm(bounds, ncolors=4)
hex_ = plt.hexbin(x, y, C=None, gridsize=gridsize,reduce_C_function=np.mean,cmap=cmap,mincnt=1,norm=norm)
print "Hex plot max: ",hex_.norm.vmax
return hex_
cmap = mpl.colors.ListedColormap(['grey','#6A92D4','#1049A9','#052C6E'])
Thank you.

I've written a bit of code that does what you're after. From the snippet in your question, it looks like you already know the height (max_v) of your distribution given your binning scheme, so I worked under that assumption. Depending on the data you're applying this to, this might not actually be the case, in which case the following will fail (it's only as good as your guess/knowledge of the height of the distributions). For the purposes of my example data, I've just taken a reasonable guess (based on a quick plot) for the values of max_v1 and max_v2. Switching the c1 and c2 I've defined for the commented versions should reproduce your original problem.
import scipy
import matplotlib.pyplot as pyplot
import matplotlib.colors
import math
#need to know the height of the distributions a priori
max_v1 = 850 #approximate height of distribution 1 (defined below) with binning defined below
max_v2 = 400 #approximate height of distribution 2 (defined below) with binning defined below
max_v = max(max_v1,max_v2)
#make 2 differently sized datasets (so will require different normalizations)
#all normal distributions with assorted means/variances
x1 = scipy.randn(50000)/6.0+0.5
y1 = scipy.randn(50000)/3.0+0.5
x2 = scipy.randn(100000)/2.0-0.5
y2 = scipy.randn(100000)/2.0-0.5
#c1 = scipy.ones(len(x1)) #I don't assign meaningful weights here
#c2 = scipy.ones(len(x2)) #I don't assign meaningful weights here
c1 = scipy.ones(len(x1))*(max_v/max_v1) #highest distribution: no net change in normalization here
c2 = scipy.ones(len(x2))*(max_v/max_v2) #renormalized to same height as highest distribution
#define plot boundaries
#custom colormap
cmap = matplotlib.colors.ListedColormap(['grey','#6A92D4','#1049A9','#052C6E'])
#the bounds of 1sigma, 2sigma, etc. regions
bounds = [0,max_v*math.exp(-(3**2)/2),max_v*math.exp(-2),max_v*math.exp(-0.5),max_v]
norm = matplotlib.colors.BoundaryNorm(bounds, ncolors=4)
#make the hexbin plot
normalized = pyplot
hexplot = normalized.subplot(111)
normalized.hexbin(scipy.concatenate((x1,x2)), scipy.concatenate((y1,y2)), C=scipy.concatenate((c1,c2)), cmap=cmap, mincnt=1, extent=(xmin,xmax,ymin,ymax),gridsize=50, reduce_C_function=scipy.sum, norm=norm) #combine distributions and weights
cax = pyplot.axes([0.86, 0.1, 0.03, 0.85])
clims = cax.axis()
cb = normalized.colorbar(cax=cax)
cax.set_yticklabels([' ','3','2','1',' '])
normalized.subplots_adjust(wspace=0, hspace=0, bottom=0.1, right=0.78, top=0.95, left=0.12)
Here's the result without the fix (commented c1 and c2 used),
and the result with the fix (code as-is);
Hope that helps.


Is there a way I can find the range of local maxima of histogram?

I'm wondering if there's a way I can find the range of local maxima of a histogram. For instance, suppose I have the following histogram (just ignore the orange curve):
The histogram is actually obtained from a dictionary. I'm hoping to find the range of local maxima of this histogram (on the horizontal axis), which are, say, 1.3-1.6, and 2.1-2.4 in this case. I have no idea which tools would be helpful or which techniques I may want to use. I know there's a tool to find local maxima of a 1-D array:
from scipy.signal import argrelextrema
x = np.random.random(12)
argrelextrema(x, np.greater)
but I don't think it would work here since I'm looking for a range, and there're some 'wiggles' on the histogram. Can anyone give me some suggestions/examples about how I can obtain the range I'm looking for? Thanks a lot for the help
PS: I trying to not just search for the ranges of x whose y values are above a certain limit:)
I don't know if I correctly understand what you want to do, but you can treat the histogram as a Probability Density Function (PDF) of a bimodal distribution, then find the modes and the Highest Density Intervals (HDIs) around the two modes.
So, I create some sample data
import numpy as np
import pandas as pd
import scipy.stats as sps
from scipy.signal import find_peaks, argrelextrema
import matplotlib.pyplot as plt
d1 = sps.norm(loc=1.3, scale=.2)
d2 = sps.norm(loc=2.2, scale=.3)
r1 = d1.rvs(size=5000, random_state=1)
r2 = d2.rvs(size=5000, random_state=1)
r = np.concatenate((r1, r2))
h = plt.hist(r, bins=100, density=True);
We have only h, the result of the hist function that will contains the density (100) and the ranges of the bins (101).
So we first need to choose the mean of each bin
density = h[0]
values = h[1][:-1] + np.diff(h[1])[0] / 2
plt.hist(r, bins=100, density=True, alpha=.25)
plt.plot(values, density);
Now we can normalize the PDF (to sum to 1) and smooth the data with moving average that we'll use only to get the peaks (maxima) and minima
norm_density = density / density.sum()
norm_density_ma = pd.Series(norm_density).rolling(7, center=True).mean().values
plt.plot(values, norm_density_ma)
plt.plot(values, norm_density);
Now we can obtain indexes of maxima
peaks = find_peaks(norm_density_ma)[0]
array([24, 57])
and minima
minima = argrelextrema(norm_density_ma, np.less)[0]
and check they're correct
plt.plot(values, norm_density_ma)
plt.plot(values, norm_density)
for peak in peaks:
plt.axvline(values[peak], color='r')
plt.axvline(values[minima], color='k', ls='--');
Finally, we have to find out the HDIs around the two modes (peaks) from the normalized h histogram data. We can use a simple function to get the HDI of grid (see HDI_of_grid for details and Doing Bayesian Data Analysis by John K. Kruschke)
def HDI_of_grid(probMassVec, credMass=0.95):
sortedProbMass = np.sort(probMassVec, axis=None)[::-1]
HDIheightIdx = np.min(np.where(np.cumsum(sortedProbMass) >= credMass))
HDIheight = sortedProbMass[HDIheightIdx]
HDImass = np.sum(probMassVec[probMassVec >= HDIheight])
idx = np.where(probMassVec >= HDIheight)[0]
return {'indexes':idx, 'mass':HDImass, 'height':HDIheight}
Let's say we want the HDIs to contain a mass of 0.3
# HDI around the 1st mode
hdi1 = HDI_of_grid(norm_density, credMass=.3)
plt.plot(values, norm_density_ma)
plt.plot(values, norm_density)
0, norm_density[hdi1['indexes']],
for peak in peaks:
plt.axvline(values[peak], color='r')
for the 2nd mode, we'll get HDI from minima to avoid the 1st mode
# HDI around the 2nd mode
hdi2 = HDI_of_grid(norm_density[minima[0]:], credMass=.3)
plt.plot(values, norm_density_ma)
plt.plot(values, norm_density)
0, norm_density[hdi1['indexes']],
0, norm_density[hdi2['indexes']+minima],
for peak in peaks:
plt.axvline(values[peak], color='r')
And we have the values of the two HDIs
# 1st mode
# 0.3 HDI
values[hdi1['indexes']].take([0, -1])
array([1.12857599, 1.45715851])
# 2nd mode
# 0.3 HDI
values[hdi2['indexes']+minima].take([0, -1])
array([1.95003229, 2.47028795])

Scaling Lognormal Fit

I have two arrays with x- and y- data.
This data shows lognormal behavior. I need a graph of the fit, as well as the mu and the sigma to do some statistics.
I did a fit, in order to calculate the mu, the sigma, and further on some statistical values of it. (See code below)
I obtain the scaling factor, with which I have to multiply the distribution with an integral over the datapoints.
The code below, does work. My question now is, if (I am sure) there is a better way to do this? It feels like a workaround, that will work sometimes. I want a better way to do this, because I have to plot hundreds of these.
My code (sorry, that it is this long, wanted to include everything except import of crude data):
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
# produce plot True/False
ploton = True
x0=np.array([3.58381e+01, 3.27125e+01, 2.98680e+01, 2.72888e+01, 2.49364e+01,
2.27933e+01, 2.08366e+01, 1.90563e+01, 1.74380e+01, 1.59550e+01,
1.45904e+01, 1.33460e+01, 1.22096e+01, 1.11733e+01, 1.02262e+01,
9.35893e+00, 8.56556e+00, 7.86688e+00, 7.20265e+00, 6.59782e+00,
6.01571e+00, 5.53207e+00, 5.03979e+00, 4.64415e+00, 4.19920e+00,
3.83595e+00, 3.50393e+00, 3.28070e+00, 3.00930e+00, 2.75634e+00,
2.52050e+00, 2.31349e+00, 2.12280e+00, 1.92642e+00, 1.77820e+00,
1.61692e+00, 1.49094e+00, 1.36233e+00, 1.22935e+00, 1.14177e+00,
1.03078e+00, 9.39603e-01, 8.78425e-01, 1.01490e+00, 1.07461e-01,
4.81523e-02, 4.81523e-02, 1.00000e-02, 1.00000e-02])
y0=np.array([3.94604811e+04, 2.78223936e+04, 1.95979179e+04, 2.14447807e+04,
1.68677487e+04, 1.79429516e+04, 1.73589776e+04, 2.16101026e+04,
3.79705638e+04, 6.83622301e+04, 1.73687772e+05, 5.74854475e+05,
1.69497465e+06, 3.79135941e+06, 7.76757753e+06, 1.33429094e+07,
1.96096415e+07, 2.50403065e+07, 2.72818618e+07, 2.53120387e+07,
1.93102362e+07, 1.22219224e+07, 4.96725699e+06, 1.61174658e+06,
3.19352386e+05, 1.80305856e+05, 1.41728002e+05, 1.66191809e+05,
1.33223816e+05, 1.31384905e+05, 2.49100945e+05, 2.28300583e+05,
3.01063903e+05, 1.84271914e+05, 1.26412781e+05, 8.57488083e+04,
1.35536571e+05, 4.50076293e+04, 1.98080100e+05, 2.27630303e+05,
1.89484527e+05, 0.00000000e+00, 1.36543525e+05, 2.20677520e+05,
3.60100586e+05, 1.62676486e+05, 1.90105093e+04, 9.27461467e+05,
Dnm = x0
dndlndp = y0
#lognormal PDF:
def f(x, mu, sigma) :
return 1/(np.sqrt(2*np.pi)*sigma*x)*np.exp(-((np.log(x)-mu)**2)/(2*sigma**2))
#normalizing y-values to obtain lognormal distributed data:
y0_normalized = y0/np.trapz(x0.ravel(), y0.ravel())
#calculating mu/sigma of this distribution:
params, extras = curve_fit(f, x0.ravel(), y0_normalized.ravel())
median = np.exp(params[0])
mu = params[0]
sigma = params[1]
#output of mu / sigma / calculated median:
print "mu=%g, sigma=%g" % (params[0], params[1])
print "median=%g" % median
#new variable z for smooth fit-curve:
z = np.linspace(0.1, 100, 10000)
Dnm = np.ravel(Dnm)
dndlndp = np.ravel(dndlndp)
Dnm_rev = list(reversed(Dnm))
dndlndp_rev = list(reversed(dndlndp))
scalingfactor = np.trapz(dndlndp_rev, Dnm_rev, dx = np.log(Dnm_rev))
if ploton:
plt.plot(z, f(z, mu, sigma)*scalingfactor, label="fit", color = "red")
plt.scatter(x0, y0, label="data")
EDIT1: Maybe I should add that I have no idea, why the scaling factor calculated with
scalingfactor = np.trapz(dndlndp_rev, Dnm_rev, dx = np.log(Dnm_rev))
is right. It was simply try and error. I really want to know, why this does the trick, since the "area" of all bins combined is:
N = np.trapz(dndlndp_rev, np.log(Dnm_rev), dx = np.log(Dnm_rev))
because the width of the bins is log(Dnm).
EDIT2: Thank you for all answers. I copied the arrays into the code, which is now runable. I want to simplify the question, since i think, due to my poor english, i was not able to say what i really want:
I have lognormal set of data. The code above allows me to calculate the mu and the sigma. To do so, i need to normalize the data, and the area under the function is from now on = 1.
In order to plot a lognormal function with the calculated mu and sigma, i need to multiply the function with an (unknown) factor, because the area under the real function is something like 1e8, but sure not one. I did a workaround by calculating this "scalingfactor" via the trapz integral of the diskrete crude data.
There has to be a better way to plot the fitted function, when mu and sigma are already known.

matplotlib: plot hist2d piecewise

I would like to plot a large sample stored in the arrays a and b with matplotlib's hist2d feature. However, generating H, xedges, yedges, img does not work directly for this data, as it uses too much memory. It works for half the number of samples, though, so I would like to do something like
H_1, xedges_1, yedges_1, img_1 = plt.hist2d(a[:len(a)/2], b[:len(b)/2], bins = 10)
followed by
H_2, xedges_2, yedges_2, img_2 = plt.hist2d(a[len(a)/2:], b[len(b)/2:], bins = 10)
While perhaps deleting the first half of the arrays after calculating the first set of variables. Is there a way to merge these two sets of variables and generate a combined plot for the data?
If (and only if!) you specify the bin edges manually, then your histograms will be compatible. You can simply add the occurences of each bin for both subsets, and you'll recover the full histogram:
import numpy as np
import matplotlib.pyplot as plt
H_1, xedges_1, yedges_1, img_1 = plt.hist2d(a[:len(a)/2], b[:len(b)/2], bins = np.linspace(binmin,binmax,10+1))
H_2, xedges_2, yedges_2, img_2 = plt.hist2d(a[len(a)/2:], b[len(b)/2:], bins = np.linspace(binmin,binmax,10+1))
H_3, xedges_3, yedges_3, img_3 = plt.hist2d(a, b, bins = np.linspace(binmin,binmax,10+1))
In [150]: (H_1+H_2==H_3).all()
Out[150]: True
Which you can easily plot using plt.pcolor. That's what hist2d seems to use, albeit with an additional transpose of the data:
img_3 (left) vs (H_1+H_2).T (right):

Relation between 2D KDE bandwidth in sklearn vs bandwidth in scipy

I'm attempting to compare the performance of sklearn.neighbors.KernelDensity versus scipy.stats.gaussian_kde for a two dimensional array.
From this article I see that the bandwidths (bw) are treated differently in each function. The article gives a recipe for setting the correct bw in scipy so it will be equivalent to the one used in sklearn . Basically it divides the bw by the sample standard deviation. The result is this:
# For sklearn
bw = 0.15
# For scipy
bw = 0.15/x.std(ddof=1)
where x is the sample array I'm using to obtain the KDE. This works just fine in 1D, but I can't make it work in 2D.
Here's a MWE of what I got:
import numpy as np
from scipy import stats
from sklearn.neighbors import KernelDensity
# Generate random data.
n = 1000
m1, m2 = np.random.normal(0.2, 0.2, size=n), np.random.normal(0.2, 0.2, size=n)
# Define limits.
xmin, xmax = min(m1), max(m1)
ymin, ymax = min(m2), max(m2)
# Format data.
x, y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
positions = np.vstack([x.ravel(), y.ravel()])
values = np.vstack([m1, m2])
# Define some point to evaluate the KDEs.
x1, y1 = 0.5, 0.5
# -------------------------------------------------------
# Perform a kernel density estimate on the data using scipy.
kernel = stats.gaussian_kde(values, bw_method=0.15/np.asarray(values).std(ddof=1))
# Get KDE value for the point.
iso1 = kernel((x1,y1))
print 'iso1 = ', iso[0]
# -------------------------------------------------------
# Perform a kernel density estimate on the data using sklearn.
kernel_sk = KernelDensity(kernel='gaussian', bandwidth=0.15).fit(zip(*values))
# Get KDE value for the point.
iso2 = kernel_sk.score_samples([[x1, y1]])
print 'iso2 = ', np.exp(iso2[0])
( iso2 is presented as an exponential since sklearn returns the log values)
The results I get for iso1 and iso2 are different and I'm lost as to how should I affect the bandwidth (in either function) to make them equal (as they should).
I was advised over at sklearn chat (by ep) that I should scale the values in (x,y) before calculating the kernel with scipy in order to obtain comparable results with sklearn.
So this is what I did:
# Scale values.
x_val_sca = np.asarray(values[0])/np.asarray(values).std(axis=1)[0]
y_val_sca = np.asarray(values[1])/np.asarray(values).std(axis=1)[1]
values = [x_val_sca, y_val_sca]
kernel = stats.gaussian_kde(values, bw_method=bw_value)
ie: I scaled both dimensions before getting the kernel with scipy while leaving the line that obtains the kernel in sklearn untouched.
This gave better results but there's still differences in the kernels obtained:
where the red dot is the (x1,y1) point in the code. So as can be seen, there are still differences in the shapes of the density estimates, albeit very small ones. Perhaps this is the best that can be achieved?
A couple of years later I tried this and think I got it to work with no re-scaling needed for the data. Bandwidth values do need some scaling though:
# For sklearn
bw = 0.15
# For scipy
bw = 0.15/x.std(ddof=1)
The evaluation of both KDEs for the same point is not exactly equal. For example here's an evaluation for the (x1, y1) point:
iso1 = 0.00984751705005 # Scipy
iso2 = 0.00989788224787 # Sklearn
but I guess it's close enough.
Here's the MWE for the 2D case and the output which, as far as I can see, look almost exactly the same:
import numpy as np
from scipy import stats
from sklearn.neighbors import KernelDensity
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
# Generate random data.
n = 1000
m1, m2 = np.random.normal(-3., 3., size=n), np.random.normal(-3., 3., size=n)
# Define limits.
xmin, xmax = min(m1), max(m1)
ymin, ymax = min(m2), max(m2)
ext_range = [xmin, xmax, ymin, ymax]
# Format data.
x, y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
positions = np.vstack([x.ravel(), y.ravel()])
values = np.vstack([m1, m2])
# Define some point to evaluate the KDEs.
x1, y1 = 0.5, 0.5
# Bandwidth value.
bw = 0.15
# -------------------------------------------------------
# Perform a kernel density estimate on the data using scipy.
# **Bandwidth needs to be scaled to match Sklearn results**
kernel = stats.gaussian_kde(
values, bw_method=bw/np.asarray(values).std(ddof=1))
# Get KDE value for the point.
iso1 = kernel((x1, y1))
print 'iso1 = ', iso1[0]
# -------------------------------------------------------
# Perform a kernel density estimate on the data using sklearn.
kernel_sk = KernelDensity(kernel='gaussian', bandwidth=bw).fit(zip(*values))
# Get KDE value for the point. Use exponential since sklearn returns the
# log values
iso2 = np.exp(kernel_sk.score_samples([[x1, y1]]))
print 'iso2 = ', iso2[0]
# Plot
fig = plt.figure(figsize=(10, 10))
gs = gridspec.GridSpec(1, 2)
# Scipy
plt.title("Scipy", x=0.5, y=0.92, fontsize=10)
# Evaluate kernel in grid positions.
k_pos = kernel(positions)
kde = np.reshape(k_pos.T, x.shape)
plt.imshow(np.rot90(kde),, extent=ext_range)
plt.contour(x, y, kde, 5, colors='k', linewidths=0.6)
# Sklearn
plt.title("Sklearn", x=0.5, y=0.92, fontsize=10)
# Evaluate kernel in grid positions.
k_pos2 = np.exp(kernel_sk.score_samples(zip(*positions)))
kde2 = np.reshape(k_pos2.T, x.shape)
plt.imshow(np.rot90(kde2),, extent=ext_range)
plt.contour(x, y, kde2, 5, colors='k', linewidths=0.6)
plt.savefig('KDEs', dpi=300, bbox_inches='tight')

Python interp1d vs. UnivariateSpline

I'm trying to port some MatLab code over to Scipy, and I've tried two different functions from scipy.interpolate, interp1d and UnivariateSpline. The interp1d results match the interp1d MatLab function, but the UnivariateSpline numbers come out different - and in some cases very different.
f = interp1d(row1,row2,kind='cubic',bounds_error=False,fill_value=numpy.max(row2))
return f(interp)
f = UnivariateSpline(row1,row2,k=3,s=0)
return f(interp)
Could anyone offer any insight? My x vals aren't equally spaced, although I'm not sure why that would matter.
I just ran into the same issue.
Short answer
Use InterpolatedUnivariateSpline instead:
f = InterpolatedUnivariateSpline(row1, row2)
return f(interp)
Long answer
UnivariateSpline is a 'one-dimensional smoothing spline fit to a given set of data points' whereas InterpolatedUnivariateSpline is a 'one-dimensional interpolating spline for a given set of data points'. The former smoothes the data whereas the latter is a more conventional interpolation method and reproduces the results expected from interp1d. The figure below illustrates the difference.
The code to reproduce the figure is shown below.
import scipy.interpolate as ip
#Define independent variable
sparse = linspace(0, 2 * pi, num = 20)
dense = linspace(0, 2 * pi, num = 200)
#Define function and calculate dependent variable
f = lambda x: sin(x) + 2
fsparse = f(sparse)
fdense = f(dense)
ax = subplot(2, 1, 1)
#Plot the sparse samples and the true function
plot(sparse, fsparse, label = 'Sparse samples', linestyle = 'None', marker = 'o')
plot(dense, fdense, label = 'True function')
#Plot the different interpolation results
interpolate = ip.InterpolatedUnivariateSpline(sparse, fsparse)
plot(dense, interpolate(dense), label = 'InterpolatedUnivariateSpline', linewidth = 2)
smoothing = ip.UnivariateSpline(sparse, fsparse)
plot(dense, smoothing(dense), label = 'UnivariateSpline', color = 'k', linewidth = 2)
ip1d = ip.interp1d(sparse, fsparse, kind = 'cubic')
plot(dense, ip1d(dense), label = 'interp1d')
ylim(.9, 3.3)
legend(loc = 'upper right', frameon = False)
#Plot the fractional error
subplot(2, 1, 2, sharex = ax)
plot(dense, smoothing(dense) / fdense - 1, label = 'UnivariateSpline')
plot(dense, interpolate(dense) / fdense - 1, label = 'InterpolatedUnivariateSpline')
plot(dense, ip1d(dense) / fdense - 1, label = 'interp1d')
ylabel('Fractional error')
legend(loc = 'upper left', frameon = False)
The reason why the results are different (but both likely correct) is that the interpolation routines used by UnivariateSpline and interp1d are different.
interp1d constructs a smooth B-spline using the x-points you gave to it as knots
UnivariateSpline is based on FITPACK, which also constructs a smooth B-spline. However, FITPACK tries to choose new knots for the spline, to fit the data better (probably to minimize chi^2 plus some penalty for curvature, or something similar). You can find out what knot points it used via g.get_knots().
So the reason why you get different results is that the interpolation algorithm is different. If you want B-splines with knots at data points, use interp1d or splmake. If you want what FITPACK does, use UnivariateSpline. In the limit of dense data, both methods give same results, but when data is sparse, you may get different results.
(How do I know all this: I read the code :-)
Works for me,
from scipy import allclose, linspace
from scipy.interpolate import interp1d, UnivariateSpline
from numpy.random import normal
from pylab import plot, show
n = 2**5
x = linspace(0,3,n)
y = (2*x**2 + 3*x + 1) + normal(0.0,2.0,n)
i = interp1d(x,y,kind=3)
u = UnivariateSpline(x,y,k=3,s=0)
m = 2**4
t = linspace(1,2,m)
print allclose(i(t),u(t)) # evaluates to True
This gives me,
UnivariateSpline: A more recent
wrapper of the FITPACK routines.
this might explain the slightly different values? (I also experienced that UnivariateSpline is much faster than interp1d.)

