I have a set of points [x1,y1][x2,y2]...[xn,yn]. I need to display them using the kernel density Estimation in a 2D image. How to perform this? I was referring the following code and it's bit confusing. Looking for a simple explanation.
https://jakevdp.github.io/PythonDataScienceHandbook/05.13-kernel-density-estimation.html
img = np.zeros((height, width), np.uint8)
circles_xy =[[524,290][234,180]...[432,30]]
kde = KernelDensity(bandwidth=1.0, kernel='gaussian')
kde.fit(circles_xy)
I would continue on the same path by drawing the contours of the PDF of the kernel density estimate. However, this might not give the information you need, because the values of the PDF are not very informative. Instead, I would rather compute the minimum volume level set. From a given probability level, the minimum level set is the domain containing that fraction of the distribution. If we consider a domain defined by a given value of the PDF, this must correspond to an unknown PDF value. The problem of finding this PDF value is done by inversion.
Based on a given sample, the natural idea is to compute an approximate distribution based on kernel smoothing, just like you did. Then, for any distribution in OpenTURNS, the computeMinimumVolumeLevelSetWithThreshold method computes the required level set and the corresponding PDF value.
Let's see how it goes in practice. In order to get an interesting example, I create a 2D distribution from a mixture of two gaussian distributions.
import openturns as ot
# Create a gaussian
corr = ot.CorrelationMatrix(2)
corr[0, 1] = 0.2
copula = ot.NormalCopula(corr)
x1 = ot.Normal(-1., 1)
x2 = ot.Normal(2, 1)
x_funk = ot.ComposedDistribution([x1, x2], copula)
# Create a second gaussian
x1 = ot.Normal(1.,1)
x2 = ot.Normal(-2,1)
x_punk = ot.ComposedDistribution([x1, x2], copula)
# Mix the distributions
mixture = ot.Mixture([x_funk, x_punk], [0.5,1.])
# Generate the sample
sample = mixture.getSample(500)
This is where your problem starts. Creating the bivariate kernel smoothing from multidimensional Scott's rule only requires two lines.
factory = ot.KernelSmoothing()
distribution = factory.build(sample)
It would be straightforward just to plot the contours of this estimated distribution.
distribution.drawPDF()
produces:
This shows the shape of the distribution. However, the contours of the PDF do not convey much information on the initial sample.
The inversion to compute the minimum volume level set requires an initial sample which is generated from Monte-Carlo method when the dimension is greater than 1. The default sample size (close to 16 000) is OK, but I usually set it up by myself just to make sure that I understand what I do.
ot.ResourceMap.SetAsUnsignedInteger(
"Distribution-MinimumVolumeLevelSetSamplingSize", 1000
)
alpha = 0.9
levelSet, threshold = distribution.computeMinimumVolumeLevelSetWithThreshold(alpha)
The threshold variable contains the solution of the problem, i.e. the PDF value which corresponds to the minimum volume level set.
The final step is to plot the sample and the corresponding minimum volume level set.
def drawLevelSetContour2D(
distribution, numberOfPointsInXAxis, alpha, threshold, sample
):
"""
Compute the minimum volume LevelSet of measure equal to alpha and get the
corresponding density value (named threshold).
Draw a contour plot for the distribution, where the PDF is equal to threshold.
"""
sampleSize = sample.getSize()
X1min = sample[:, 0].getMin()[0]
X1max = sample[:, 0].getMax()[0]
X2min = sample[:, 1].getMin()[0]
X2max = sample[:, 1].getMax()[0]
xx = ot.Box([numberOfPointsInXAxis], ot.Interval([X1min], [X1max])).generate()
yy = ot.Box([numberOfPointsInXAxis], ot.Interval([X2min], [X2max])).generate()
xy = ot.Box(
[numberOfPointsInXAxis, numberOfPointsInXAxis],
ot.Interval([X1min, X2min], [X1max, X2max]),
).generate()
data = distribution.computePDF(xy)
graph = ot.Graph("", "X1", "X2", True, "topright")
labels = ["%.2f%%" % (100 * alpha)]
contour = ot.Contour(xx, yy, data, ot.Point([threshold]), ot.Description(labels))
contour.setColor("black")
graph.setTitle(
"%.2f%% of the distribution, sample size = %d" % (100 * alpha, sampleSize)
)
graph.add(contour)
cloud = ot.Cloud(sample)
graph.add(cloud)
return graph
We finally plot the contours of the level set with 50 points in each axis.
numberOfPointsInXAxis = 50
drawLevelSetContour2D(mixture, numberOfPointsInXAxis, alpha, threshold, sample)
The following figure plots the sample along with the contour of the domain which contains 90% of the population estimated from the kernel smoothing distribution. Any point outside of this region can be considered as an outlier, although we might use the higher alpha=0.95 value for this purpose.
The full example is detailed in Minimum volume level set. An application of this to stochastic processes is done in othdrplot. The ideas used here are detailed in : Rob J Hyndman and Han Lin Shang. Rainbow plots , bagplots and boxplots for functional data. Journal of Computational and Graphical Statistics, 19:29-45, 2009.
Related
The situation
I am trying to apply a high pass filter to a black&white image to enhance the texture by keeping the high frequencies. The goal is to filter from a specific frequency value obtained from the outcome of applying signal.welch() from scipy library. Up to this point, the code I have tried works good enough that I could plot the frequency-PSD graph and visually identify the frequency value of interest.
The code
The following code takes a 2D numpy array image and calculates the PSD for each horizontal line derivative, and obtains the mean to plot the periodogram.
def plot_periodogram(image):
# Calculate increment for each line (derivative)
img_incr = np.diff(image[:,:,0], axis=1)
# Calculate PSD for each increment of the line
f_tot, Pxx_tot = [], []
for i in range(image.shape[0]):
f, Pxx = signal.welch(img_incr[i,:])
f_tot.append(f)
Pxx_tot.append(Pxx)
# Calculate mean of the increments of the line
f_mean = np.mean(f_tot, axis=0)
Pxx_mean = np.mean(Pxx_tot, axis=0)
# log-log plot of frequency-PSD
plt.loglog(f_mean, Pxx_mean)
plt.xlabel('frequency [Hz]')
plt.ylabel('PSD')
plt.show()
return f_mean, Pxx_mean, f_tot, Pxx_tot
f_mean, Pxx_mean, f_tot, Pxx_tot = plot_periodogram(img)
In this sample image (ignore red lines), the peak around f=0.23 helps to identify the **cut-off frequency ** (i.e. fc=0.23) which should be used to apply the filter.
The Question
Having the cut-off frequency, how should I proceed to filter the image in frequency domain and return to spatial domain?
My best guess is that I should turn to 0 all Pxx_tot elements whose corresponding f_tot are lower than fc. If this approach is correct, I still don't know how to go back to spatial domain after filtering the image.
I have a simulated signal which is displayed as an histogram. I want to emulate the real measured signal using a convolution with a Gaussian with a specific width, since in the real experiment a detector has a certain uncertainty in the measured channels.
I have tried to do a convolution using np.convolve as well as scipy.signal.convolve but can't seem to get the filtering correctly. Not only the expected shape is off, which would be a slightly smeared version of the histogram and the x-axis e.g. energy scale is off aswell.
I tried defining my Gaussian with a width of 20 keV as:
gauss = np.random.normal(0, 20000, len(coincidence['esum']))
hist_gauss = plt.hist(gauss, bins=100)[0]
where len(coincidence['esum']) is the length of my coincidencedataframe column.This column I bin using:
counts = plt.hist(coincidence['esum'], bins=100)[0]
Besides this approach to generate a suitable Gaussian I tried scipy.signal.gaussian(50, 30000) which unfortunately generates a parabolic looking curve and does not exhibit the characteristic tails.
I tried doing the convolution using both coincidence['esum'] and counts with the both Gaussian approaches. Note that when doing a simple convolution with the standard example according to Finding the convolution of two histograms it works without problems.
Would anyone know how to do such a convolution in python? I exported the column of coincidende['esum'] that I use for my histogram to a pastebin, in case anyone is interested and wants to recreate it with the specific data https://pastebin.com/WFiSBFa6
As you may be aware, doing the convolution of the two histograms with the same bin size will give the histogram of the result of adding each element of one of the samples with each elements of the other of the samples.
I cannot see exactly what you are doing. One important thing that you seem to not be doing is to make sure that the bins of the histograms have the same width, and you have to take care of the position of the edges of the second bin.
In code we have
def hist_of_addition(A, B, bins=10, plot=False):
A_heights, A_edges = np.histogram(A, bins=bins)
# make sure the histogram is equally spaced
assert(np.allclose(np.diff(A_edges), A_edges[1] - A_edges[0]))
# make sure to use the same interval
step = A_edges[1] - A_edges[0]
# specify parameters to make sure the histogram of B will
# have the same bin size as the histogram of A
nBbin = int(np.ceil((np.max(B) - np.min(B))/step))
left = np.min(B)
B_heights, B_edges = np.histogram(B, range=(left, left + step * nBbin), bins=nBbin)
# check that the bins for the second histogram matches the first
assert(np.allclose(np.diff(B_edges), step))
C_heights = np.convolve(A_heights, B_heights)
C_edges = B_edges[0] + A_edges[0] + np.arange(0, len(C_heights) + 1) * step
if plot:
plt.figure(figsize=(12, 4))
plt.subplot(131)
plt.bar(A_edges[:-1], A_heights, step)
plt.title('A')
plt.subplot(132)
plt.bar(B_edges[:-1], B_heights, step)
plt.title('B')
plt.subplot(133)
plt.bar(C_edges[:-1], C_heights, step)
plt.title('A+B')
return C_edges, C_heights
Then
A = -np.cos(np.random.rand(10**6))
B = np.random.normal(1.5, 0.025, 10**5)
hist_of_addition(A, B, bins=100, plot=True);
Gives
I have a set of values that I'd like to plot the gaussian kernel density estimation of, however there are two problems that I'm having:
I only have the values of bars not the values themselves
I am plotting onto a categorical axis
Here's the plot I've generated so far:
The order of the y axis is actually relevant since it is representative of the phylogeny of each bacterial species.
I'd like to add a gaussian kde overlay for each color, but so far I haven't been able to leverage seaborn or scipy to do this.
Here's the code for the above grouped bar plot using python and matplotlib:
enterN = len(color1_plotting_values)
fig, ax = plt.subplots(figsize=(20,30))
ind = np.arange(N) # the x locations for the groups
width = .5 # the width of the bars
p1 = ax.barh(Species_Ordering.Species.values, color1_plotting_values, width, label='Color1', log=True)
p2 = ax.barh(Species_Ordering.Species.values, color2_plotting_values, width, label='Color2', log=True)
for b in p2:
b.xy = (b.xy[0], b.xy[1]+width)
Thanks!
How to plot a "KDE" starting from a histogram
The protocol for kernel density estimation requires the underlying data. You could come up with a new method that uses the empirical pdf (ie the histogram) instead, but then it wouldn't be a KDE distribution.
Not all hope is lost, though. You can get a good approximation of a KDE distribution by first taking samples from the histogram, and then using KDE on those samples. Here's a complete working example:
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as sts
n = 100000
# generate some random multimodal histogram data
samples = np.concatenate([np.random.normal(np.random.randint(-8, 8), size=n)*np.random.uniform(.4, 2) for i in range(4)])
h,e = np.histogram(samples, bins=100, density=True)
x = np.linspace(e.min(), e.max())
# plot the histogram
plt.figure(figsize=(8,6))
plt.bar(e[:-1], h, width=np.diff(e), ec='k', align='edge', label='histogram')
# plot the real KDE
kde = sts.gaussian_kde(samples)
plt.plot(x, kde.pdf(x), c='C1', lw=8, label='KDE')
# resample the histogram and find the KDE.
resamples = np.random.choice((e[:-1] + e[1:])/2, size=n*5, p=h/h.sum())
rkde = sts.gaussian_kde(resamples)
# plot the KDE
plt.plot(x, rkde.pdf(x), '--', c='C3', lw=4, label='resampled KDE')
plt.title('n = %d' % n)
plt.legend()
plt.show()
Output:
The red dashed line and the orange line nearly completely overlap in the plot, showing that the real KDE and the KDE calculated by resampling the histogram are in excellent agreement.
If your histograms are really noisy (like what you get if you set n = 10 in the above code), you should be a bit cautious when using the resampled KDE for anything other than plotting purposes:
Overall the agreement between the real and resampled KDEs is still good, but the deviations are noticeable.
Munge your categorial data into an appropriate form
Since you haven't posted your actual data I can't give you detailed advice. I think your best bet will be to just number your categories in order, then use that number as the "x" value of each bar in the histogram.
I have stated my reservations to applying a KDE to OP's categorical data in my comments above. Basically, as the phylogenetic distance between species does not obey the triangle inequality, there cannot be a valid kernel that could be used for kernel density estimation. However, there are other density estimation methods that do not require the construction of a kernel. One such method is k-nearest neighbour inverse distance weighting, which only requires non-negative distances which need not satisfy the triangle inequality (nor even need to be symmetric, I think). The following outlines this approach:
import numpy as np
#--------------------------------------------------------------------------------
# simulate data
total_classes = 10
sample_values = np.random.rand(total_classes)
distance_matrix = 100 * np.random.rand(total_classes, total_classes)
# Distances to the values itself are zero; hence remove diagonal.
distance_matrix -= np.diag(np.diag(distance_matrix))
# --------------------------------------------------------------------------------
# For each sample, compute an average based on the values of the k-nearest neighbors.
# Weigh each sample value by the inverse of the corresponding distance.
# Apply a regularizer to the distance matrix.
# This limits the influence of values with very small distances.
# In particular, this affects how the value of the sample itself (which has distance 0)
# is weighted w.r.t. other values.
regularizer = 1.
distance_matrix += regularizer
# Set number of neighbours to "interpolate" over.
k = 3
# Compute average based on sample value itself and k neighbouring values weighted by the inverse distance.
# The following assumes that the value of distance_matrix[ii, jj] corresponds to the distance from ii to jj.
for ii in range(total_classes):
# determine neighbours
indices = np.argsort(distance_matrix[ii, :])[:k+1] # +1 to include the value of the sample itself
# compute weights
distances = distance_matrix[ii, indices]
weights = 1. / distances
weights /= np.sum(weights) # weights need to sum to 1
# compute weighted average
values = sample_values[indices]
new_sample_values[ii] = np.sum(values * weights)
print(new_sample_values)
THE EASY WAY
For now, I am skipping any philosophical argument about the validity of using Kernel density in such settings. Will come around that later.
An easy way to do this is using scikit-learn KernelDensity:
import numpy as np
import pandas as pd
from sklearn.neighbors import KernelDensity
from sklearn import preprocessing
ds=pd.read_csv('data-by-State.csv')
Y=ds.loc[:,'State'].values # State is AL, AK, AZ, etc...
# With categorical data we need some label encoding here...
le = preprocessing.LabelEncoder()
le.fit(Y) # le.classes_ would be ['AL', 'AK', 'AZ',...
y=le.transform(Y) # y would be [0, 2, 3, ..., 6, 7, 9]
y=y[:, np.newaxis] # preparing for kde
kde = KernelDensity(kernel='gaussian', bandwidth=0.75).fit(y)
# You can control the bandwidth so the KDE function performs better
# To find the optimum bandwidth for your data you can try Crossvalidation
x=np.linspace(0,5,100)[:, np.newaxis] # let's get some x values to plot on
log_dens=kde.score_samples(x)
dens=np.exp(log_dens) # these are the density function values
array([0.06625658, 0.06661817, 0.06676005, 0.06669403, 0.06643584,
0.06600488, 0.0654239 , 0.06471854, 0.06391682, 0.06304861,
0.06214499, 0.06123764, 0.06035818, 0.05953754, 0.05880534,
0.05818931, 0.05771472, 0.05740393, 0.057276 , 0.05734634,
0.05762648, 0.05812393, 0.05884214, 0.05978051, 0.06093455,
..............
0.11885574, 0.11883695, 0.11881434, 0.11878766, 0.11875657,
0.11872066, 0.11867943, 0.11863229, 0.11857859, 0.1185176 ,
0.11844852, 0.11837051, 0.11828267, 0.11818407, 0.11807377])
And these values are all you need to plot your Kernel Density over your histogram. Capito?
Now, on the theoretical side, if X is a categorical(*), unordered variable with c possible values, then for 0 ≤ h < 1
is a valid kernel. For an ordered X,
where |x1-x2|should be understood as how many levels apart x1 and x2 are. As h tends to zero, both of these become indicators and return a relative frequency counting. h is oftentimes referred to as bandwidth.
(*) No distance needs to be defined on the variable space. Doesn't need to be a metric space.
Devroye, Luc and Gábor Lugosi (2001). Combinatorial Methods in Density Estimation. Berlin: Springer-Verlag.
I need to calculate the area where two functions overlap. I use normal distributions in this particular simplified example, but I need a more general procedure that adapts to other functions too.
See image below to get an idea of what I mean, where the red area is what I'm after:
This is the MWE I have so far:
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
# Generate random data uniformly distributed.
a = np.random.normal(1., 0.1, 1000)
b = np.random.normal(1., 0.1, 1000)
# Obtain KDE estimates foe each set of data.
xmin, xmax = -1., 2.
x_pts = np.mgrid[xmin:xmax:1000j]
# Kernels.
ker_a = stats.gaussian_kde(a)
ker_b = stats.gaussian_kde(b)
# KDEs for plotting.
kde_a = np.reshape(ker_a(x_pts).T, x_pts.shape)
kde_b = np.reshape(ker_b(x_pts).T, x_pts.shape)
# Random sample from a KDE distribution.
sample = ker_a.resample(size=1000)
# Compute the points below which to integrate.
iso = ker_b(sample)
# Filter the sample.
insample = ker_a(sample) < iso
# As per Monte Carlo, the integral is equivalent to the
# probability of drawing a point that gets through the
# filter.
integral = insample.sum() / float(insample.shape[0])
print integral
plt.xlim(0.4,1.9)
plt.plot(x_pts, kde_a)
plt.plot(x_pts, kde_b)
plt.show()
where I apply Monte Carlo to obtain the integral.
The problem with this method is that when I evaluate sampled points in either distribution with ker_b(sample) (or ker_a(sample)), I get values placed directly over the KDE line. Because of this, even clearly overlapped distributions which should return a common/overlapped area value very close to 1. return instead small values (the total area of either curve is 1. since they are probability density estimates).
How could I fix this code to give the expected results?
This is how I applied Zhenya's answer
# Calculate overlap between the two KDEs.
def y_pts(pt):
y_pt = min(ker_a(pt), ker_b(pt))
return y_pt
# Store overlap value.
overlap = quad(y_pts, -1., 2.)
The red area on the plot is the integral of min(f(x), g(x)), where f and g are your two functions, green and blue. To evaluate the integral, you can use any of the integrators from scipy.integrate (quad's the default one, I'd say) -- or an MC integrator, of course, but I don't quite see the point of that.
I think another solution would be to multiply the two curves, then take the integral. You may want to do some sort of normalization. The analogy is orbital overlap in chemistry: https://en.wikipedia.org/wiki/Orbital_overlap
I'm trying to get python to return, as close as possible, the center of the most obvious clustering in an image like the one below:
In my previous question I asked how to get the global maximum and the local maximums of a 2d array, and the answers given worked perfectly. The issue is that the center estimation I can get by averaging the global maximum obtained with different bin sizes is always slightly off than the one I would set by eye, because I'm only accounting for the biggest bin instead of a group of biggest bins (like one does by eye).
I tried adapting the answer to this question to my problem, but it turns out my image is too noisy for that algorithm to work. Here's my code implementing that answer:
import numpy as np
from scipy.ndimage.filters import maximum_filter
from scipy.ndimage.morphology import generate_binary_structure, binary_erosion
import matplotlib.pyplot as pp
from os import getcwd
from os.path import join, realpath, dirname
# Save path to dir where this code exists.
mypath = realpath(join(getcwd(), dirname(__file__)))
myfile = 'data_file.dat'
x, y = np.loadtxt(join(mypath,myfile), usecols=(1, 2), unpack=True)
xmin, xmax = min(x), max(x)
ymin, ymax = min(y), max(y)
rang = [[xmin, xmax], [ymin, ymax]]
paws = []
for d_b in range(25, 110, 25):
# Number of bins in x,y given the bin width 'd_b'
binsxy = [int((xmax - xmin) / d_b), int((ymax - ymin) / d_b)]
H, xedges, yedges = np.histogram2d(x, y, range=rang, bins=binsxy)
paws.append(H)
def detect_peaks(image):
"""
Takes an image and detect the peaks usingthe local maximum filter.
Returns a boolean mask of the peaks (i.e. 1 when
the pixel's value is the neighborhood maximum, 0 otherwise)
"""
# define an 8-connected neighborhood
neighborhood = generate_binary_structure(2,2)
#apply the local maximum filter; all pixel of maximal value
#in their neighborhood are set to 1
local_max = maximum_filter(image, footprint=neighborhood)==image
#local_max is a mask that contains the peaks we are
#looking for, but also the background.
#In order to isolate the peaks we must remove the background from the mask.
#we create the mask of the background
background = (image==0)
#a little technicality: we must erode the background in order to
#successfully subtract it form local_max, otherwise a line will
#appear along the background border (artifact of the local maximum filter)
eroded_background = binary_erosion(background, structure=neighborhood, border_value=1)
#we obtain the final mask, containing only peaks,
#by removing the background from the local_max mask
detected_peaks = local_max - eroded_background
return detected_peaks
#applying the detection and plotting results
for i, paw in enumerate(paws):
detected_peaks = detect_peaks(paw)
pp.subplot(4,2,(2*i+1))
pp.imshow(paw)
pp.subplot(4,2,(2*i+2) )
pp.imshow(detected_peaks)
pp.show()
and here's the result of that (varying the bin size):
Clearly my background is too noisy for that algorithm to work, so the question is: how can I make that algorithm less sensitive? If an alternative solution exists then please let me know.
EDIT
Following Bi Rico advise I attempted smoothing my 2d array before passing it on to the local maximum finder, like so:
H, xedges, yedges = np.histogram2d(x, y, range=rang, bins=binsxy)
H1 = gaussian_filter(H, 2, mode='nearest')
paws.append(H1)
These were the results with a sigma of 2, 4 and 8:
EDIT 2
A mode ='constant' seems to work much better than nearest. It converges to the right center with a sigma=2 for the largest bin size:
So, how do I get the coordinates of the maximum that shows in the last image?
Answering the last part of your question, always you have points in an image, you can find their coordinates by searching, in some order, the local maximums of the image. In case your data is not a point source, you can apply a mask to each peak in order to avoid the peak neighborhood from being a maximum while performing a future search. I propose the following code:
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
import numpy as np
import copy
def get_std(image):
return np.std(image)
def get_max(image,sigma,alpha=20,size=10):
i_out = []
j_out = []
image_temp = copy.deepcopy(image)
while True:
k = np.argmax(image_temp)
j,i = np.unravel_index(k, image_temp.shape)
if(image_temp[j,i] >= alpha*sigma):
i_out.append(i)
j_out.append(j)
x = np.arange(i-size, i+size)
y = np.arange(j-size, j+size)
xv,yv = np.meshgrid(x,y)
image_temp[yv.clip(0,image_temp.shape[0]-1),
xv.clip(0,image_temp.shape[1]-1) ] = 0
print xv
else:
break
return i_out,j_out
#reading the image
image = mpimg.imread('ggd4.jpg')
#computing the standard deviation of the image
sigma = get_std(image)
#getting the peaks
i,j = get_max(image[:,:,0],sigma, alpha=10, size=10)
#let's see the results
plt.imshow(image, origin='lower')
plt.plot(i,j,'ro', markersize=10, alpha=0.5)
plt.show()
The image ggd4 for the test can be downloaded from:
http://www.ipac.caltech.edu/2mass/gallery/spr99/ggd4.jpg
The first part is to get some information about the noise in the image. I did it by computing the standard deviation of the full image (actually is better to select an small rectangle without signal). This is telling us how much noise is present in the image.
The idea to get the peaks is to ask for successive maximums, which are above of certain threshold (let's say, 3, 4, 5, 10, or 20 times the noise). This is what the function get_max is actually doing. It performs the search of maximums until one of them is below the threshold imposed by the noise. In order to avoid finding the same maximum many times it is necessary to remove the peaks from the image. In the general way, the shape of the mask to do so depends strongly on the problem that one want to solve. for the case of stars, it should be good to remove the star by using a Gaussian function, or something similar. I have chosen for simplicity a square function, and the size of the function (in pixels) is the variable "size".
I think that from this example, anybody can improve the code by adding more general things.
EDIT:
The original image looks like:
While the image after identifying the luminous points looks like this:
Too much of a n00b on Stack Overflow to comment on Alejandro's answer elsewhere here. I would refine his code a bit to use a preallocated numpy array for output:
def get_max(image,sigma,alpha=3,size=10):
from copy import deepcopy
import numpy as np
# preallocate a lot of peak storage
k_arr = np.zeros((10000,2))
image_temp = deepcopy(image)
peak_ct=0
while True:
k = np.argmax(image_temp)
j,i = np.unravel_index(k, image_temp.shape)
if(image_temp[j,i] >= alpha*sigma):
k_arr[peak_ct]=[j,i]
# this is the part that masks already-found peaks.
x = np.arange(i-size, i+size)
y = np.arange(j-size, j+size)
xv,yv = np.meshgrid(x,y)
# the clip here handles edge cases where the peak is near the
# image edge
image_temp[yv.clip(0,image_temp.shape[0]-1),
xv.clip(0,image_temp.shape[1]-1) ] = 0
peak_ct+=1
else:
break
# trim the output for only what we've actually found
return k_arr[:peak_ct]
In profiling this and Alejandro's code using his example image, this code about 33% faster (0.03 sec for Alejandro's code, 0.02 sec for mine.) I expect on images with larger numbers of peaks, it would be even faster - appending the output to a list will get slower and slower for more peaks.
I think the first step needed here is to express the values in H in terms of the standard deviation of the field:
import numpy as np
H = H / np.std(H)
Now you can put a threshold on the values of this H. If the noise is assumed to be Gaussian, picking a threshold of 3 you can be quite sure (99.7%) that this pixel can be associated with a real peak and not noise. See here.
Now the further selection can start. It is not exactly clear to me what exactly you want to find. Do you want the exact location of peak values? Or do you want one location for a cluster of peaks which is in the middle of this cluster?
Anyway, starting from this point with all pixel values expressed in standard deviations of the field, you should be able to get what you want. If you want to find clusters you could perform a nearest neighbour search on the >3-sigma gridpoints and put a threshold on the distance. I.e. only connect them when they are close enough to each other. If several gridpoints are connected you can define this as a group/cluster and calculate some (sigma-weighted?) center of the cluster.
Hope my first contribution on Stackoverflow is useful for you!
The way I would do it:
1) normalize H between 0 and 1.
2) pick a threshold value, as tcaswell suggests. It could be between .9 and .99 for example
3) use masked arrays to keep only the x,y coordinates with H above threshold:
import numpy.ma as ma
x_masked=ma.masked_array(x, mask= H < thresold)
y_masked=ma.masked_array(y, mask= H < thresold)
4) now you can weight-average on the masked coordinates, with weight something like (H-threshold)^2, or any other power greater or equal to one, depending on your taste/tests.
Comment:
1) This is not robust with respect to the type of peaks you have, since you may have to adapt the thresold. This is the minor problem;
2) This DOES NOT work with two peaks as it is, and will give wrong results if the 2nd peak is above threshold.
Nonetheless, it will always give you an answer without crashing (with pros and cons of the thing..)
I'm adding this answer because it's the solution I ended up using. It's a combination of Bi Rico's comment here (May 30 at 18:54) and the answer given in this question: Find peak of 2d histogram.
As it turns out using the peak detection algorithm from this question Peak detection in a 2D array only complicates matters. After applying the Gaussian filter to the image all that needs to be done is to ask for the maximum bin (as Bi Rico pointed out) and then obtain the maximum in coordinates.
So instead of using the detect-peaks function as I did above, I simply add the following code after the Gaussian 2D histogram is obtained:
# Get 2D histogram.
H, xedges, yedges = np.histogram2d(x, y, range=rang, bins=binsxy)
# Get Gaussian filtered 2D histogram.
H1 = gaussian_filter(H, 2, mode='nearest')
# Get center of maximum in bin coordinates.
x_cent_bin, y_cent_bin = np.unravel_index(H1.argmax(), H1.shape)
# Get center in x,y coordinates.
x_cent_coor , y_cent_coord = np.average(xedges[x_cent_bin:x_cent_bin + 2]), np.average(yedges[y_cent_g:y_cent_g + 2])