Python : Generate normal distribution in the order of the bell - python

I want to generate normal distribution in the order of the bell.
I used this code to generate the numbers:
import numpy as np
mu,sigma,n = 0.,1.,1000
def normal(x,mu,sigma):
return ( 2.*np.pi*sigma**2. )**-.5 * np.exp( -.5 * (x-mu)**2. / sigma**2. )
x = np.random.normal(mu,sigma,n) #generate random list of points from normal distribution
y = normal(x,mu,sigma) #evaluate the probability density at each point
x,y = x[np.argsort(y)],np.sort(y) #sort according to the probability density
which is a code proposed in : Generating normal distribution in order python, numpy
but the numbers are not following the bell form.
Any ideas?
Thank you very much

A couple of things you are confusing.
random.normal draws n numbers randomly from a bell curve
So you have a 1000 numbers, each distinct, all drawn from the curve. To recreate the curve, you need to apply some binning. The amount of points in each bin will recreate the curve (just a single point by itself can hardly represent a probability). Using some extensive binning on your x vector of only a 1000 points:
h,hx=np.histogram(x,bins=50)
and plotting h as a function of hx (so I group your thousand numbers into 50 bins, the y axis will show the amount of points in the bins:
Now we can see x was drawn from a bell distribution - the chance to fall in the center bin is determined by the Gaussian. This is a sampling, so each point may vary a bit of course - the more points you use and the finer binning and the better it will be (smoother).
y = normal(x,mu,sigma)
This just evaluates the Gaussian at any given x, so really, supply normal with any list of numbers around your mean (mu) and it will calculate the bell curve exactly (the exact probability). Plotting your y against x (Doesn't matter that your x is Gaussian itself, but it's a 1000 points around the mean, so it can recreate the functions):
See how smooth that is? That's because it's not a sampling, it's an exact calculation of the function. You could have used just any 1000 points around 0 and it would have looked just as good.

Your code works just fine.
import numpy as np
import matplotlib.pyplot as plt
mu,sigma,n = 0.,1.,1000
def normal(x,mu,sigma):
return ( 2.*np.pi*sigma**2. )**-.5 * np.exp( -.5 * (x-mu)**2. / sigma**2. )
x = np.random.normal(mu,sigma,n)
y = normal(x,mu,sigma)
plt.plot(x,y)

Related

python signal.scipy.welch for complex input returns frequency indices which are not ordered negative to positive

My aim is to plot the PSD of a complex vector x.
I calculated the spectrum estimation using scipy.welch (version 1.4.1):
f, Px = scipy.signal.welch(**x**, return_onesided=False, detrend=False)
and then plotted:
plt.plot(f, 10*np.log10(Px),'.-')
plt.show()
The PSD was plotted fine, but I noticed an interpolation line from the last sample to the first on the plot. I then checked the frequency indices and noticed that they are ordered from DC(0) to half the sample rate(0.5 in this case) and then from -0.5 to almost zero. This is why the plot has a straight line across from Px(0.5) to Px(-0.5).
Why the returned f vector(and the appropriate Px) is not from -0.5 to 0.5 ?
Can someone suggest a straight forward method? (I'm used to MATLAB and it is much simpler to plot a PSD there...)
Thanks
Think about angles a complex value moves between two samples, it can be from 0 to 360 (this is the order that welch will return), but another way to see for say 200 degrees is that it is a 360-200=160 degrees to the other direction.
Since you are asking ti to bring the two sided spectrum
import numpy as np
import scipy.signal
import matplotlib.pyplot as plt
x = np.random.randn(1000)
x = x + np.roll(x, 1) + np.roll(x, -1) # rolling mean
f, p = scipy.signal.welch(x, detrend=False, return_onesided=False)
It will give you the positive frequency followed by the negative frequency spectrum. Numpy provides a fftshift function to rearrange a fft frequency vector. Check what the frequency (the x axis looks like)
plt.plot(f)
plt.plot(np.fft.fftshift(f))
plt.legend(['welch return', 'fftshifted'])
So if you plot directly you will see the line connecting the last point of the positive frequency spectrum to the first point of the negative frequency spectrum
plt.plot(f, np.log(p))
If you reorder both f and p you see the expected result
plt.plot(np.fft.fftshift(f), np.fft.fftshift(p))
Note that, for real data welch will return the same values for the negative part and the positive part.

How to generate a random sample of points from a 3-D ellipsoid using Python?

I am trying to sample around 1000 points from a 3-D ellipsoid, uniformly. Is there some way to code it such that we can get points starting from the equation of the ellipsoid?
I want points on the surface of the ellipsoid.
Theory
Using this excellent answer to the MSE question How to generate points uniformly distributed on the surface of an ellipsoid? we can
generate a point uniformly on the sphere, apply the mapping f :
(x,y,z) -> (x'=ax,y'=by,z'=cz) and then correct the distortion
created by the map by discarding the point randomly with some
probability p(x,y,z).
Assuming that the 3 axes of the ellipsoid are named such that
0 < a < b < c
We discard a generated point with
p(x,y,z) = 1 - mu(x,y,y)/mu_max
probability, ie we keep it with mu(x,y,y)/mu_max probability where
mu(x,y,z) = ((acy)^2 + (abz)^2 + (bcx)^2)^0.5
and
mu_max = bc
Implementation
import numpy as np
np.random.seed(42)
# Function to generate a random point on a uniform sphere
# (relying on https://stackoverflow.com/a/33977530/8565438)
def randompoint(ndim=3):
vec = np.random.randn(ndim,1)
vec /= np.linalg.norm(vec, axis=0)
return vec
# Give the length of each axis (example values):
a, b, c = 1, 2, 4
# Function to scale up generated points using the function `f` mentioned above:
f = lambda x,y,z : np.multiply(np.array([a,b,c]),np.array([x,y,z]))
# Keep the point with probability `mu(x,y,z)/mu_max`, ie
def keep(x, y, z, a=a, b=b, c=c):
mu_xyz = ((a * c * y) ** 2 + (a * b * z) ** 2 + (b * c * x) ** 2) ** 0.5
return mu_xyz / (b * c) > np.random.uniform(low=0.0, high=1.0)
# Generate points until we have, let's say, 1000 points:
n = 1000
points = []
while len(points) < n:
[x], [y], [z] = randompoint()
if keep(x, y, z):
points.append(f(x, y, z))
Checks
Check if all points generated satisfy the ellipsoid condition (ie that x^2/a^2 + y^2/b^2 + z^2/c^2 = 1):
for p in points:
pscaled = np.multiply(p,np.array([1/a,1/b,1/c]))
assert np.allclose(np.sum(np.dot(pscaled,pscaled)),1)
Runs without raising any errors. Visualize the points:
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(projection="3d")
points = np.array(points)
ax.scatter(points[:, 0], points[:, 1], points[:, 2])
# set aspect ratio for the axes using https://stackoverflow.com/a/64453375/8565438
ax.set_box_aspect((np.ptp(points[:, 0]), np.ptp(points[:, 1]), np.ptp(points[:, 2])))
plt.show()
These points seem evenly distributed.
Problem with currently accepted answer
Generating a point on a sphere and then just reprojecting it without any further corrections to an ellipse will result in a distorted distribution. This is essentially the same as setting this posts's p(x,y,z) to 0. Imagine an ellipsoid where one axis is orders of magnitude bigger than another. This way, it is easy to see, that naive reprojection is not going to work.
Consider using Monte-Carlo simulation: generate a random 3D point; check if the point is inside the ellipsoid; if it is, keep it. Repeat until you get 1,000 points.
P.S. Since the OP changed their question, this answer is no longer valid.
J.F. Williamson, "Random selection of points distributed on curved surfaces", Physics in Medicine & Biology 32(10), 1987, describes a general method of choosing a uniformly random point on a parametric surface. It is an acceptance/rejection method that accepts or rejects each candidate point depending on its stretch factor (norm-of-gradient). To use this method for a parametric surface, several things have to be known about the surface, namely—
x(u, v), y(u, v) and z(u, v), which are functions that generate 3-dimensional coordinates from two dimensional coordinates u and v,
The ranges of u and v,
g(point), the norm of the gradient ("stretch factor") at each point on the surface, and
gmax, the maximum value of g for the entire surface.
The algorithm is then:
Generate a point on the surface, xyz.
If g(xyz) >= RNDU01()*gmax, where RNDU01() is a uniform random variate in [0, 1), accept the point. Otherwise, repeat this process.
Chen and Glotzer (2007) apply the method to the surface of a prolate spheroid (one form of ellipsoid) in "Simulation studies of a phenomenological model for elongated virus capsid formation", Physical Review E 75(5), 051504 (preprint).
Here is a generic function to pick a random point on a surface of a sphere, spheroid or any triaxial ellipsoid with a, b and c parameters. Note that generating angles directly will not provide uniform distribution and will cause excessive population of points along z direction. Instead, phi is obtained as an inverse of randomly generated cos(phi).
import numpy as np
def random_point_ellipsoid(a,b,c):
u = np.random.rand()
v = np.random.rand()
theta = u * 2.0 * np.pi
phi = np.arccos(2.0 * v - 1.0)
sinTheta = np.sin(theta);
cosTheta = np.cos(theta);
sinPhi = np.sin(phi);
cosPhi = np.cos(phi);
rx = a * sinPhi * cosTheta;
ry = b * sinPhi * sinTheta;
rz = c * cosPhi;
return rx, ry, rz
This function is adopted from this post: https://karthikkaranth.me/blog/generating-random-points-in-a-sphere/
One way of doing this whch generalises for any shape or surface is to convert the surface to a voxel representation at arbitrarily high resolution (the higher the resolution the better but also the slower). Then you can easily select the voxels randomly however you want, and then you can select a point on the surface within the voxel using the parametric equation. The voxel selection should be completely unbiased, and the selection of the point within the voxel will suffer the same biases that come from using the parametric equation but if there are enough voxels then the size of these biases will be very small.
You need a high quality cube intersection code but with something like an elipsoid that can optimised quite easily. I'd suggest stepping through the bounding box subdivided into voxels. A quick distance check will eliminate most cubes and you can do a proper intersection check for the ones where an intersection is possible. For the point within the cube I'd be tempted to do something simple like a random XYZ distance from the centre and then cast a ray from the centre of the elipsoid and the selected point is where the ray intersects the surface. As I said above, it will be biased but with small voxels, the bias will probably be small enough.
There are libraries that do convex shape intersection very efficiently and cube/elipsoid will be one of the options. They will be highly optimised but I think the distance culling would probably be worth doing by hand whatever. And you will need a library that differentiates between a surface intersection and one object being totally inside the other.
And if you know your elipsoid is aligned to an axis then you can do the voxel/edge intersection very easily as a stack of 2D square intersection elipse problems with the set of squares to be tested defined as those that are adjacent to those in the layer above. That might be quicker.
One of the things that makes this approach more managable is that you do not need to write all the code for edge cases (it is a lot of work to get around issues with floating point inaccuracies that can lead to missing or doubled voxels at the intersection). That's because these will be very rare so they won't affect your sampling.
It might even be quicker to simply find all the voxels inside the elipse and then throw away all the voxels with 6 neighbours... Lots of options. It all depends how important performance is. This will be much slower than the opther suggestions but if you want ~1000 points then ~100,000 voxels feels about the minimum for the surface, so you probably need ~1,000,000 voxels in your bounding box. However even testing 1,000,000 intersections is pretty fast on modern computers.
Depending on what "uniformly" refers to, different methods are applicable. In any case, we can use the parametric equations using spherical coordinates (from Wikipedia):
where s = 1 refers to the ellipsoid given by the semi-axes a > b > c. From these equations we can derive the relevant volume/area element and generate points such that their probability of being generated is proportional to that volume/area element. This will provide constant volume/area density across the surface of the ellipsoid.
1. Constant volume density
This method generates points on the surface of an ellipsoid such that their volume density across the surface of the ellipsoid is constant. A consequence of this is that the one-dimensional projections (i.e. the x, y, z coordinates) are uniformly distributed; for details see the plot below.
The volume element for a triaxial ellipsoid is given by (see here):
and is thus proportional to sin(theta) (for 0 <= theta <= pi). We can use this as the basis for a probability distribution that indicates "how many" points should be generated for a given value of theta: where the area density is low/high, the probability for generating a corresponding value of theta should be low/high, too.
Hence, we can use the function f(theta) = sin(theta)/2 as our probability distribution on the interval [0, pi]. The corresponding cumulative distribution function is F(theta) = (1 - cos(theta))/2. Now we can use Inverse transform sampling to generate values of theta according to f(theta) from a uniform random distribution. The values of phi can be obtained directly from a uniform distribution on [0, 2*pi].
Example code:
import matplotlib.pyplot as plt
import numpy as np
from numpy import sin, cos, pi
rng = np.random.default_rng(seed=0)
a, b, c = 10, 3, 1
N = 5000
phi = rng.uniform(0, 2*pi, size=N)
theta = np.arccos(1 - 2*rng.random(size=N))
x = a*sin(theta)*cos(phi)
y = b*sin(theta)*sin(phi)
z = c*cos(theta)
fig = plt.figure()
ax = fig.add_subplot(projection='3d')
ax.scatter(x, y, z, s=2)
plt.show()
which produces the following plot:
The following plot shows the one-dimensional projections (i.e. density plots of x, y, z):
import seaborn as sns
sns.kdeplot(data=dict(x=x, y=y, z=z))
plt.show()
2. Constant area density
This method generates points on the surface of an ellipsoid such that their area density is constant across the surface of the ellipsoid.
Again, we start by calculating the corresponding area element. For simplicity we can use SymPy:
from sympy import cos, sin, symbols, Matrix
a, b, c, t, p = symbols('a b c t p')
x = a*sin(t)*cos(p)
y = b*sin(t)*sin(p)
z = c*cos(t)
J = Matrix([
[x.diff(t), x.diff(p)],
[y.diff(t), y.diff(p)],
[z.diff(t), z.diff(p)],
])
print((J.T # J).det().simplify())
This yields
-a**2*b**2*sin(t)**4 + a**2*b**2*sin(t)**2 + a**2*c**2*sin(p)**2*sin(t)**4 - b**2*c**2*sin(p)**2*sin(t)**4 + b**2*c**2*sin(t)**4
and further simplifies to (dividing by (a*b)**2 and taking the sqrt):
sin(t)*np.sqrt(1 + ((c/b)**2*sin(p)**2 + (c/a)**2*cos(p)**2 - 1)*sin(t)**2)
Since for this case the area element is more complex, we can use rejection sampling:
import matplotlib.pyplot as plt
import numpy as np
from numpy import cos, sin
def f_redo(t, p):
return (
sin(t)*np.sqrt(1 + ((c/b)**2*sin(p)**2 + (c/a)**2*cos(p)**2 - 1)*sin(t)**2)
< rng.random(size=t.size)
)
rng = np.random.default_rng(seed=0)
N = 5000
a, b, c = 10, 3, 1
t = rng.uniform(0, np.pi, size=N)
p = rng.uniform(0, 2*np.pi, size=N)
redo = f_redo(t, p)
while redo.any():
t[redo] = rng.uniform(0, np.pi, size=redo.sum())
p[redo] = rng.uniform(0, 2*np.pi, size=redo.sum())
redo[redo] = f_redo(t[redo], p[redo])
x = a*np.sin(t)*np.cos(p)
y = b*np.sin(t)*np.sin(p)
z = c*np.cos(t)
fig = plt.figure()
ax = fig.add_subplot(projection='3d')
ax.scatter(x, y, z, s=2)
plt.show()
which yields the following distribution:
The following plot shows the corresponding one-dimensional projections (x, y, z):

Inverse FFT returns negative values when it should not

I have several points (x,y,z coordinates) in a 3D box with associated masses. I want to draw an histogram of the mass-density that is found in spheres of a given radius R.
I have written a code that, providing I did not make any errors which I think I may have, works in the following way:
My "real" data is something huge thus I wrote a little code to generate non overlapping points randomly with arbitrary mass in a box.
I compute a 3D histogram (weighted by mass) with a binning about 10 times smaller than the radius of my spheres.
I take the FFT of my histogram, compute the wave-modes (kx, ky and kz) and use them to multiply my histogram in Fourier space by the analytic expression of the 3D top-hat window (sphere filtering) function in Fourier space.
I inverse FFT my newly computed grid.
Thus drawing a 1D-histogram of the values on each bin would give me what I want.
My issue is the following: given what I do there should not be any negative values in my inverted FFT grid (step 4), but I get some, and with values much higher that the numerical error.
If I run my code on a small box (300x300x300 cm3 and the points of separated by at least 1 cm) I do not get the issue. I do get it for 600x600x600 cm3 though.
If I set all the masses to 0, thus working on an empty grid, I do get back my 0 without any noted issues.
I here give my code in a full block so that it is easily copied.
import numpy as np
import matplotlib.pyplot as plt
import random
from numba import njit
# 1. Generate a bunch of points with masses from 1 to 3 separated by a radius of 1 cm
radius = 1
rangeX = (0, 100)
rangeY = (0, 100)
rangeZ = (0, 100)
rangem = (1,3)
qty = 20000 # or however many points you want
# Generate a set of all points within 1 of the origin, to be used as offsets later
deltas = set()
for x in range(-radius, radius+1):
for y in range(-radius, radius+1):
for z in range(-radius, radius+1):
if x*x + y*y + z*z<= radius*radius:
deltas.add((x,y,z))
X = []
Y = []
Z = []
M = []
excluded = set()
for i in range(qty):
x = random.randrange(*rangeX)
y = random.randrange(*rangeY)
z = random.randrange(*rangeZ)
m = random.uniform(*rangem)
if (x,y,z) in excluded: continue
X.append(x)
Y.append(y)
Z.append(z)
M.append(m)
excluded.update((x+dx, y+dy, z+dz) for (dx,dy,dz) in deltas)
print("There is ",len(X)," points in the box")
# Compute the 3D histogram
a = np.vstack((X, Y, Z)).T
b = 200
H, edges = np.histogramdd(a, weights=M, bins = b)
# Compute the FFT of the grid
Fh = np.fft.fftn(H, axes=(-3,-2, -1))
# Compute the different wave-modes
kx = 2*np.pi*np.fft.fftfreq(len(edges[0][:-1]))*len(edges[0][:-1])/(np.amax(X)-np.amin(X))
ky = 2*np.pi*np.fft.fftfreq(len(edges[1][:-1]))*len(edges[1][:-1])/(np.amax(Y)-np.amin(Y))
kz = 2*np.pi*np.fft.fftfreq(len(edges[2][:-1]))*len(edges[2][:-1])/(np.amax(Z)-np.amin(Z))
# I create a matrix containing the values of the filter in each point of the grid in Fourier space
R = 5
Kh = np.empty((len(kx),len(ky),len(kz)))
#njit(parallel=True)
def func_njit(kx, ky, kz, Kh):
for i in range(len(kx)):
for j in range(len(ky)):
for k in range(len(kz)):
if np.sqrt(kx[i]**2+ky[j]**2+kz[k]**2) != 0:
Kh[i][j][k] = (np.sin((np.sqrt(kx[i]**2+ky[j]**2+kz[k]**2))*R)-(np.sqrt(kx[i]**2+ky[j]**2+kz[k]**2))*R*np.cos((np.sqrt(kx[i]**2+ky[j]**2+kz[k]**2))*R))*3/((np.sqrt(kx[i]**2+ky[j]**2+kz[k]**2))*R)**3
else:
Kh[i][j][k] = 1
return Kh
Kh = func_njit(kx, ky, kz, Kh)
# I multiply each point of my grid by the associated value of the filter (multiplication in Fourier space = convolution in real space)
Gh = np.multiply(Fh, Kh)
# I take the inverse FFT of my filtered grid. I take the real part to get back floats but there should only be zeros for the imaginary part.
Density = np.real(np.fft.ifftn(Gh,axes=(-3,-2, -1)))
# Here it shows if there are negative values the magnitude of the error
print(np.min(Density))
D = Density.flatten()
N = np.mean(D)
# I then compute the histogram I want
hist, bins = np.histogram(D/N, bins='auto', density=True)
bin_centers = (bins[1:]+bins[:-1])*0.5
plt.plot(bin_centers, hist)
plt.xlabel('rho/rhom')
plt.ylabel('P(rho)')
plt.show()
Do you know why I'm getting these negative values? Do you think there is a simpler way to proceed?
Sorry if this is a very long post, I tried to make it very clear and will edit it with your comments, thanks a lot!
-EDIT-
A follow-up question on the issue can be found [here].1
The filter you create in the frequency domain is only an approximation to the filter you want to create. The problem is that we are dealing with the DFT here, not the continuous-domain FT (with its infinite frequencies). The Fourier transform of a ball is indeed the function you describe, however this function is infinitely large -- it is not band-limited!
By sampling this function only within a window, you are effectively multiplying it with an ideal low-pass filter (the rectangle of the domain). This low-pass filter, in the spatial domain, has negative values. Therefore, the filter you create also has negative values in the spatial domain.
This is a slice through the origin of the inverse transform of Kh (after I applied fftshift to move the origin to the middle of the image, for better display):
As you can tell here, there is some ringing that leads to negative values.
One way to overcome this ringing is to apply a windowing function in the frequency domain. Another option is to generate a ball in the spatial domain, and compute its Fourier transform. This second option would be the simplest to achieve. Do remember that the kernel in the spatial domain must also have the origin at the top-left pixel to obtain a correct FFT.
A windowing function is typically applied in the spatial domain to avoid issues with the image border when computing the FFT. Here, I propose to apply such a window in the frequency domain to avoid similar issues when computing the IFFT. Note, however, that this will always further reduce the bandwidth of the kernel (the windowing function would work as a low-pass filter after all), and therefore yield a smoother transition of foreground to background in the spatial domain (i.e. the spatial domain kernel will not have as sharp a transition as you might like). The best known windowing functions are Hamming and Hann windows, but there are many others worth trying out.
Unsolicited advice:
I simplified your code to compute Kh to the following:
kr = np.sqrt(kx[:,None,None]**2 + ky[None,:,None]**2 + kz[None,None,:]**2)
kr *= R
Kh = (np.sin(kr)-kr*np.cos(kr))*3/(kr)**3
Kh[0,0,0] = 1
I find this easier to read than the nested loops. It should also be significantly faster, and avoid the need for njit. Note that you were computing the same distance (what I call kr here) 5 times. Factoring out such computation is not only faster, but yields more readable code.
Just a guess:
Where do you get the idea that the imaginary part MUST be zero? Have you ever tried to take the absolute values (sqrt(re^2 + im^2)) and forget about the phase instead of just taking the real part? Just something that came to my mind.

Python: Choose the n points better distributed from a bunch of points

I have a numpy array of points in an XY plane like:
I want to select the n points (let's say 100) better distributed from all these points. This is, I want the density of points to be constant anywhere.
Something like this:
Is there any pythonic way or any numpy/scipy function to do this?
#EMS is very correct that you should give a lot of thought to exactly what you want.
There more sophisticated ways to do this (EMS's suggestions are very good!), but a brute-force-ish approach is to bin the points onto a regular, rectangular grid and draw a random point from each bin.
The major downside is that you won't get the number of points you ask for. Instead, you'll get some number smaller than that number.
A bit of creative indexing with pandas makes this "gridding" approach quite easy, though you can certainly do it with "pure" numpy, as well.
As an example of the simplest possible, brute force, grid approach: (There's a lot we could do better, here.)
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
total_num = 100000
x, y = np.random.normal(0, 1, (2, total_num))
# We'll always get fewer than this number for two reasons.
# 1) We're choosing a square grid, and "subset_num" may not be a perfect square
# 2) There won't be data in every cell of the grid
subset_num = 1000
# Bin points onto a rectangular grid with approximately "subset_num" cells
nbins = int(np.sqrt(subset_num))
xbins = np.linspace(x.min(), x.max(), nbins+1)
ybins = np.linspace(y.min(), y.max(), nbins+1)
# Make a dataframe indexed by the grid coordinates.
i, j = np.digitize(y, ybins), np.digitize(x, xbins)
df = pd.DataFrame(dict(x=x, y=y), index=[i, j])
# Group by which cell the points fall into and choose a random point from each
groups = df.groupby(df.index)
new = groups.agg(lambda x: np.random.permutation(x)[0])
# Plot the results
fig, axes = plt.subplots(ncols=2, sharex=True, sharey=True)
axes[0].plot(x, y, 'k.')
axes[0].set_title('Original $(n={})$'.format(total_num))
axes[1].plot(new.x, new.y, 'k.')
axes[1].set_title('Subset $(n={})$'.format(len(new)))
plt.setp(axes, aspect=1, adjustable='box-forced')
fig.tight_layout()
plt.show()
Loosely based on #EMS's suggestion in a comment, here's another approach.
We'll calculate the density of points using a kernel density estimate, and then use the inverse of that as the probability that a given point will be chosen.
scipy.stats.gaussian_kde is not optimized for this use case (or for large numbers of points in general). It's the bottleneck here. It's possible to write a more optimized version for this specific use case in several ways (approximations, special case here of pairwise distances, etc). However, that's beyond the scope of this particular question. Just be aware that for this specific example with 1e5 points, it will take a minute or two to run.
The advantage of this method is that you get the exact number of points that you asked for. The disadvantage is that you are likely to have local clusters of selected points.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
total_num = 100000
subset_num = 1000
x, y = np.random.normal(0, 1, (2, total_num))
# Let's approximate the PDF of the point distribution with a kernel density
# estimate. scipy.stats.gaussian_kde is slow for large numbers of points, so
# you might want to use another implementation in some cases.
xy = np.vstack([x, y])
dens = gaussian_kde(xy)(xy)
# Try playing around with this weight. Compare 1/dens, 1-dens, and (1-dens)**2
weight = 1 / dens
weight /= weight.sum()
# Draw a sample using np.random.choice with the specified probabilities.
# We'll need to view things as an object array because np.random.choice
# expects a 1D array.
dat = xy.T.ravel().view([('x', float), ('y', float)])
subset = np.random.choice(dat, subset_num, p=weight)
# Plot the results
fig, axes = plt.subplots(ncols=2, sharex=True, sharey=True)
axes[0].scatter(x, y, c=dens, edgecolor='')
axes[0].set_title('Original $(n={})$'.format(total_num))
axes[1].plot(subset['x'], subset['y'], 'k.')
axes[1].set_title('Subset $(n={})$'.format(len(subset)))
plt.setp(axes, aspect=1, adjustable='box-forced')
fig.tight_layout()
plt.show()
Unless you give a specific criterion for defining "better distributed" we can't give a definite answer.
The phrase "constant density of points anywhere" is also misleading, because you have to specify the empirical method for calculating density. Are you approximating it on a grid? If so, the grid size will matter, and points near the boundary won't be correctly represented.
A different approach might be as follows:
Calculate the distance matrix between all pairs of points
Treating this distance matrix as a weighted network, calculate some measure of centrality for each point in the data, such as eigenvalue centrality, Betweenness centrality or Bonacich centrality.
Order the points in descending order according to the centrality measure, and keep the first 100.
Repeat steps 1-4 possibly using a different notion of "distance" between points and with different centrality measures.
Many of these functions are provided directly by SciPy, NetworkX, and scikits.learn and will work directly on a NumPy array.
If you are definitely committed to thinking of the problem in terms of regular spacing and grid density, you might take a look at quasi-Monte Carlo methods. In particular, you could try to compute the convex hull of the set of points and then apply a QMC technique to regularly sample from anywhere within that convex hull. But again, this privileges the exterior of the region, which should be sampled far less than the interior.
Yet another interesting approach would be to simply run the K-means algorithm on the scattered data, with a fixed number of clusters K=100. After the algorithm converges, you'll have 100 points from your space (the mean of each cluster). You could repeat this several times with different random starting points for the cluster means and then sample from that larger set of possible means. Since your data do not appear to actually cluster into 100 components naturally, the convergence of this approach won't be very good and may require running the algorithm for a large number of iterations. This also has the downside that the resulting set of 100 points are not necessarily points that come form the observed data, and instead will be local averages of many points.
This method to iteratively pick the point from the remaining points which has the lowest minimum distance to the already picked points has terrible time complexity, but produces pretty uniformly distributed results:
from numpy import array, argmax, ndarray
from numpy.ma import vstack
from numpy.random import normal, randint
from scipy.spatial.distance import cdist
def well_spaced_points(points: ndarray, num_points: int):
"""
Pick `num_points` well-spaced points from `points` array.
:param points: An m x n array of m n-dimensional points.
:param num_points: The number of points to pick.
:rtype: ndarray
:return: A num_points x n array of points from the original array.
"""
# pick a random point
current_point_index = randint(0, num_points)
picked_points = array([points[current_point_index]])
remaining_points = vstack((
points[: current_point_index],
points[current_point_index + 1:]
))
# while there are more points to pick
while picked_points.shape[0] < num_points:
# find the furthest point to the current point
distance_pk_rmn = cdist(picked_points, remaining_points)
min_distance_pk = distance_pk_rmn.min(axis=0)
i_furthest = argmax(min_distance_pk)
# add it to picked points and remove it from remaining
picked_points = vstack((
picked_points,
remaining_points[i_furthest]
))
remaining_points = vstack((
remaining_points[: i_furthest],
remaining_points[i_furthest + 1:]
))
return picked_points

Calculate overlap area of two functions

I need to calculate the area where two functions overlap. I use normal distributions in this particular simplified example, but I need a more general procedure that adapts to other functions too.
See image below to get an idea of what I mean, where the red area is what I'm after:
This is the MWE I have so far:
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
# Generate random data uniformly distributed.
a = np.random.normal(1., 0.1, 1000)
b = np.random.normal(1., 0.1, 1000)
# Obtain KDE estimates foe each set of data.
xmin, xmax = -1., 2.
x_pts = np.mgrid[xmin:xmax:1000j]
# Kernels.
ker_a = stats.gaussian_kde(a)
ker_b = stats.gaussian_kde(b)
# KDEs for plotting.
kde_a = np.reshape(ker_a(x_pts).T, x_pts.shape)
kde_b = np.reshape(ker_b(x_pts).T, x_pts.shape)
# Random sample from a KDE distribution.
sample = ker_a.resample(size=1000)
# Compute the points below which to integrate.
iso = ker_b(sample)
# Filter the sample.
insample = ker_a(sample) < iso
# As per Monte Carlo, the integral is equivalent to the
# probability of drawing a point that gets through the
# filter.
integral = insample.sum() / float(insample.shape[0])
print integral
plt.xlim(0.4,1.9)
plt.plot(x_pts, kde_a)
plt.plot(x_pts, kde_b)
plt.show()
where I apply Monte Carlo to obtain the integral.
The problem with this method is that when I evaluate sampled points in either distribution with ker_b(sample) (or ker_a(sample)), I get values placed directly over the KDE line. Because of this, even clearly overlapped distributions which should return a common/overlapped area value very close to 1. return instead small values (the total area of either curve is 1. since they are probability density estimates).
How could I fix this code to give the expected results?
This is how I applied Zhenya's answer
# Calculate overlap between the two KDEs.
def y_pts(pt):
y_pt = min(ker_a(pt), ker_b(pt))
return y_pt
# Store overlap value.
overlap = quad(y_pts, -1., 2.)
The red area on the plot is the integral of min(f(x), g(x)), where f and g are your two functions, green and blue. To evaluate the integral, you can use any of the integrators from scipy.integrate (quad's the default one, I'd say) -- or an MC integrator, of course, but I don't quite see the point of that.
I think another solution would be to multiply the two curves, then take the integral. You may want to do some sort of normalization. The analogy is orbital overlap in chemistry: https://en.wikipedia.org/wiki/Orbital_overlap

Categories

Resources