I have the following problem. Imaging you have a set of coordinates that are somewhat organized in a regular pattern, such as the one shown below.
What i want to do is to automatically extract coordinates, such that they are ordered from left to right and top to bottom. In addition, the total number of coordinates should be as large as possible, but only include coordinates, such that the extracted coordinates are on a nearly rectangular grid (even if the coordinates have a different symmetry, e.g. hexagonal). I always want to extract coordinates that follow a rectangular unit cell structure.
For the example shown above, the largest number that contain such an orthorhombic set would be 8 x 8 coordinates (lets call this dimensions: m x n), as framed by the red rectangle.
The problem is that the given coordinates are noisy and distorted.
My approach was to generate an artificial lattice, and minimizing the difference to the given coordinates, taking into account some rotation, shift and simple distortion of the lattice. However, it turned out to be tricky to define a cost function that covers the complexity of the problem, i.e. minimizing the difference between the given coordinates and the fitted lattice, but also maximizing the grid components m x n.
If anyone has a smart idea how to tackle this problem, maybe also with machine learning algorithms, i would be very thankful.
Here is the code that i have used so far:
A function to generate the artificial lattice with m x n coordinates that are spaced by a and b in the "n" and "m" directions. The angle theta allows for a rotation of the lattice.
def lattice(m, n, a, b, theta):
coords = []
for j in range(m):
for i in range(n):
coords.append([np.sin(theta)*a*i + np.cos(theta)*b*j, np.cos(theta)*a*i - np.sin(theta)*b*j])
return np.array(coords)
I used the following function to measure the mean minimal distance between points, which is a good starting point for fitting:
def mean_min_distance(coords):
from scipy.spatial import distance
cd = distance.cdist(coords, coords)
cd_1 = np.where(cd == 0, np.nan, cd)
return np.mean(np.nanmin(cd_1, axis=1))
The following function provides all possible combinations of m x n that theoretically fit into the lengths of the coordinates, whose arrangement is assumed to be unknown. The ability to limit this to minimal and maximal values is included already:
def get_all_mxn(l, min_m=2, min_n=2, max_m=None, max_n=None):
poss = []
if max_m is None:
max_m = l + 1
if max_n is None:
max_n = l +1
for i in range(min_m, max_m):
for j in range(min_n, max_n):
if i * j <= l:
poss.append([i, j])
return np.array(poss)
The definition of the costfunction i used (for one particular set of m x n). So i first wanted to get a good fit for a certain m x n arrangement.
def cost(x0):
a, b, theta, shift_a, shift_b, dd1 = x0
# generate lattice
l = lattice(m, n, a, b, theta)
# distort lattice by affine transformation
distortion_matr = np.array([[1, dd1], [0, 1]])
l = np.dot(distortion_matr, l.T).T
# shift lattice
l = l + np.array((shift_b, shift_a))
# Some padding to make the lists the same length
len_diff = coords.shape[0] - l.shape[0]
l = np.append(l, (1e3, 1e3)*len_diff).reshape((l.shape[0] + len_diff, 2))
# calculate all distances between all points
cd = distance.cdist(coords, l)
minimum distance between each artificial lattice point and all coords
cd_min = np.min(cd[:, :coords.shape[0] - len_diff], axis=0)
# returns root mean square difference of all minimal distances
return np.sqrt(np.sum(np.abs(cd_min) ** 2) )
I then run the minimization:
md = mean_min_distance(coords)
# initial guess
x0 = np.array((md, md, np.deg2rad(-3.), 3, 1, 0.12))
res = minimize(cost, x0)
However, the results are extremely dependend on the initial parameter x0 and i have not even included a fitting of m and n.
Related
I am trying to sample around 1000 points from a 3-D ellipsoid, uniformly. Is there some way to code it such that we can get points starting from the equation of the ellipsoid?
I want points on the surface of the ellipsoid.
Theory
Using this excellent answer to the MSE question How to generate points uniformly distributed on the surface of an ellipsoid? we can
generate a point uniformly on the sphere, apply the mapping f :
(x,y,z) -> (x'=ax,y'=by,z'=cz) and then correct the distortion
created by the map by discarding the point randomly with some
probability p(x,y,z).
Assuming that the 3 axes of the ellipsoid are named such that
0 < a < b < c
We discard a generated point with
p(x,y,z) = 1 - mu(x,y,y)/mu_max
probability, ie we keep it with mu(x,y,y)/mu_max probability where
mu(x,y,z) = ((acy)^2 + (abz)^2 + (bcx)^2)^0.5
and
mu_max = bc
Implementation
import numpy as np
np.random.seed(42)
# Function to generate a random point on a uniform sphere
# (relying on https://stackoverflow.com/a/33977530/8565438)
def randompoint(ndim=3):
vec = np.random.randn(ndim,1)
vec /= np.linalg.norm(vec, axis=0)
return vec
# Give the length of each axis (example values):
a, b, c = 1, 2, 4
# Function to scale up generated points using the function `f` mentioned above:
f = lambda x,y,z : np.multiply(np.array([a,b,c]),np.array([x,y,z]))
# Keep the point with probability `mu(x,y,z)/mu_max`, ie
def keep(x, y, z, a=a, b=b, c=c):
mu_xyz = ((a * c * y) ** 2 + (a * b * z) ** 2 + (b * c * x) ** 2) ** 0.5
return mu_xyz / (b * c) > np.random.uniform(low=0.0, high=1.0)
# Generate points until we have, let's say, 1000 points:
n = 1000
points = []
while len(points) < n:
[x], [y], [z] = randompoint()
if keep(x, y, z):
points.append(f(x, y, z))
Checks
Check if all points generated satisfy the ellipsoid condition (ie that x^2/a^2 + y^2/b^2 + z^2/c^2 = 1):
for p in points:
pscaled = np.multiply(p,np.array([1/a,1/b,1/c]))
assert np.allclose(np.sum(np.dot(pscaled,pscaled)),1)
Runs without raising any errors. Visualize the points:
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(projection="3d")
points = np.array(points)
ax.scatter(points[:, 0], points[:, 1], points[:, 2])
# set aspect ratio for the axes using https://stackoverflow.com/a/64453375/8565438
ax.set_box_aspect((np.ptp(points[:, 0]), np.ptp(points[:, 1]), np.ptp(points[:, 2])))
plt.show()
These points seem evenly distributed.
Problem with currently accepted answer
Generating a point on a sphere and then just reprojecting it without any further corrections to an ellipse will result in a distorted distribution. This is essentially the same as setting this posts's p(x,y,z) to 0. Imagine an ellipsoid where one axis is orders of magnitude bigger than another. This way, it is easy to see, that naive reprojection is not going to work.
Consider using Monte-Carlo simulation: generate a random 3D point; check if the point is inside the ellipsoid; if it is, keep it. Repeat until you get 1,000 points.
P.S. Since the OP changed their question, this answer is no longer valid.
J.F. Williamson, "Random selection of points distributed on curved surfaces", Physics in Medicine & Biology 32(10), 1987, describes a general method of choosing a uniformly random point on a parametric surface. It is an acceptance/rejection method that accepts or rejects each candidate point depending on its stretch factor (norm-of-gradient). To use this method for a parametric surface, several things have to be known about the surface, namely—
x(u, v), y(u, v) and z(u, v), which are functions that generate 3-dimensional coordinates from two dimensional coordinates u and v,
The ranges of u and v,
g(point), the norm of the gradient ("stretch factor") at each point on the surface, and
gmax, the maximum value of g for the entire surface.
The algorithm is then:
Generate a point on the surface, xyz.
If g(xyz) >= RNDU01()*gmax, where RNDU01() is a uniform random variate in [0, 1), accept the point. Otherwise, repeat this process.
Chen and Glotzer (2007) apply the method to the surface of a prolate spheroid (one form of ellipsoid) in "Simulation studies of a phenomenological model for elongated virus capsid formation", Physical Review E 75(5), 051504 (preprint).
Here is a generic function to pick a random point on a surface of a sphere, spheroid or any triaxial ellipsoid with a, b and c parameters. Note that generating angles directly will not provide uniform distribution and will cause excessive population of points along z direction. Instead, phi is obtained as an inverse of randomly generated cos(phi).
import numpy as np
def random_point_ellipsoid(a,b,c):
u = np.random.rand()
v = np.random.rand()
theta = u * 2.0 * np.pi
phi = np.arccos(2.0 * v - 1.0)
sinTheta = np.sin(theta);
cosTheta = np.cos(theta);
sinPhi = np.sin(phi);
cosPhi = np.cos(phi);
rx = a * sinPhi * cosTheta;
ry = b * sinPhi * sinTheta;
rz = c * cosPhi;
return rx, ry, rz
This function is adopted from this post: https://karthikkaranth.me/blog/generating-random-points-in-a-sphere/
One way of doing this whch generalises for any shape or surface is to convert the surface to a voxel representation at arbitrarily high resolution (the higher the resolution the better but also the slower). Then you can easily select the voxels randomly however you want, and then you can select a point on the surface within the voxel using the parametric equation. The voxel selection should be completely unbiased, and the selection of the point within the voxel will suffer the same biases that come from using the parametric equation but if there are enough voxels then the size of these biases will be very small.
You need a high quality cube intersection code but with something like an elipsoid that can optimised quite easily. I'd suggest stepping through the bounding box subdivided into voxels. A quick distance check will eliminate most cubes and you can do a proper intersection check for the ones where an intersection is possible. For the point within the cube I'd be tempted to do something simple like a random XYZ distance from the centre and then cast a ray from the centre of the elipsoid and the selected point is where the ray intersects the surface. As I said above, it will be biased but with small voxels, the bias will probably be small enough.
There are libraries that do convex shape intersection very efficiently and cube/elipsoid will be one of the options. They will be highly optimised but I think the distance culling would probably be worth doing by hand whatever. And you will need a library that differentiates between a surface intersection and one object being totally inside the other.
And if you know your elipsoid is aligned to an axis then you can do the voxel/edge intersection very easily as a stack of 2D square intersection elipse problems with the set of squares to be tested defined as those that are adjacent to those in the layer above. That might be quicker.
One of the things that makes this approach more managable is that you do not need to write all the code for edge cases (it is a lot of work to get around issues with floating point inaccuracies that can lead to missing or doubled voxels at the intersection). That's because these will be very rare so they won't affect your sampling.
It might even be quicker to simply find all the voxels inside the elipse and then throw away all the voxels with 6 neighbours... Lots of options. It all depends how important performance is. This will be much slower than the opther suggestions but if you want ~1000 points then ~100,000 voxels feels about the minimum for the surface, so you probably need ~1,000,000 voxels in your bounding box. However even testing 1,000,000 intersections is pretty fast on modern computers.
Depending on what "uniformly" refers to, different methods are applicable. In any case, we can use the parametric equations using spherical coordinates (from Wikipedia):
where s = 1 refers to the ellipsoid given by the semi-axes a > b > c. From these equations we can derive the relevant volume/area element and generate points such that their probability of being generated is proportional to that volume/area element. This will provide constant volume/area density across the surface of the ellipsoid.
1. Constant volume density
This method generates points on the surface of an ellipsoid such that their volume density across the surface of the ellipsoid is constant. A consequence of this is that the one-dimensional projections (i.e. the x, y, z coordinates) are uniformly distributed; for details see the plot below.
The volume element for a triaxial ellipsoid is given by (see here):
and is thus proportional to sin(theta) (for 0 <= theta <= pi). We can use this as the basis for a probability distribution that indicates "how many" points should be generated for a given value of theta: where the area density is low/high, the probability for generating a corresponding value of theta should be low/high, too.
Hence, we can use the function f(theta) = sin(theta)/2 as our probability distribution on the interval [0, pi]. The corresponding cumulative distribution function is F(theta) = (1 - cos(theta))/2. Now we can use Inverse transform sampling to generate values of theta according to f(theta) from a uniform random distribution. The values of phi can be obtained directly from a uniform distribution on [0, 2*pi].
Example code:
import matplotlib.pyplot as plt
import numpy as np
from numpy import sin, cos, pi
rng = np.random.default_rng(seed=0)
a, b, c = 10, 3, 1
N = 5000
phi = rng.uniform(0, 2*pi, size=N)
theta = np.arccos(1 - 2*rng.random(size=N))
x = a*sin(theta)*cos(phi)
y = b*sin(theta)*sin(phi)
z = c*cos(theta)
fig = plt.figure()
ax = fig.add_subplot(projection='3d')
ax.scatter(x, y, z, s=2)
plt.show()
which produces the following plot:
The following plot shows the one-dimensional projections (i.e. density plots of x, y, z):
import seaborn as sns
sns.kdeplot(data=dict(x=x, y=y, z=z))
plt.show()
2. Constant area density
This method generates points on the surface of an ellipsoid such that their area density is constant across the surface of the ellipsoid.
Again, we start by calculating the corresponding area element. For simplicity we can use SymPy:
from sympy import cos, sin, symbols, Matrix
a, b, c, t, p = symbols('a b c t p')
x = a*sin(t)*cos(p)
y = b*sin(t)*sin(p)
z = c*cos(t)
J = Matrix([
[x.diff(t), x.diff(p)],
[y.diff(t), y.diff(p)],
[z.diff(t), z.diff(p)],
])
print((J.T # J).det().simplify())
This yields
-a**2*b**2*sin(t)**4 + a**2*b**2*sin(t)**2 + a**2*c**2*sin(p)**2*sin(t)**4 - b**2*c**2*sin(p)**2*sin(t)**4 + b**2*c**2*sin(t)**4
and further simplifies to (dividing by (a*b)**2 and taking the sqrt):
sin(t)*np.sqrt(1 + ((c/b)**2*sin(p)**2 + (c/a)**2*cos(p)**2 - 1)*sin(t)**2)
Since for this case the area element is more complex, we can use rejection sampling:
import matplotlib.pyplot as plt
import numpy as np
from numpy import cos, sin
def f_redo(t, p):
return (
sin(t)*np.sqrt(1 + ((c/b)**2*sin(p)**2 + (c/a)**2*cos(p)**2 - 1)*sin(t)**2)
< rng.random(size=t.size)
)
rng = np.random.default_rng(seed=0)
N = 5000
a, b, c = 10, 3, 1
t = rng.uniform(0, np.pi, size=N)
p = rng.uniform(0, 2*np.pi, size=N)
redo = f_redo(t, p)
while redo.any():
t[redo] = rng.uniform(0, np.pi, size=redo.sum())
p[redo] = rng.uniform(0, 2*np.pi, size=redo.sum())
redo[redo] = f_redo(t[redo], p[redo])
x = a*np.sin(t)*np.cos(p)
y = b*np.sin(t)*np.sin(p)
z = c*np.cos(t)
fig = plt.figure()
ax = fig.add_subplot(projection='3d')
ax.scatter(x, y, z, s=2)
plt.show()
which yields the following distribution:
The following plot shows the corresponding one-dimensional projections (x, y, z):
Consider a 3D numpy array D of dimension, say, (30 x 40 x 50). For each voxel D[x,y,z] I want to store a vector that contains neighboring voxels within a certain radius (including the D[x,y,z] itself).
(As an example here is a picture of such a sphere of radius 2: https://puu.sh/wwIYW/e3bd63ceae.png)
Is there a simple and fast way to code this?
I have written a function for it, but it is painfully slow and IDLE eventually crashes because the data structure I store the vectors in becomes too large.
Current code:
def searchlight(M_in):
radius = 4
[m,n,k] = M_in.shape
M_out = np.zeros([m,n,k],dtype=object)
count = 0
for i in range(m):
for j in range(n):
for z in range(k):
i_interval = list(range((i-4),(i+5)))
j_interval = list(range((j-4),(j+5)))
z_interval = list(range((z-4),(z+5)))
coordinates = list(itertools.product(i_interval,j_interval,z_interval))
coordinates = [pair for pair in coordinates if ((abs(pair[0]-i)+abs(pair[1]-j)+abs(pair[2]-z))<=radius)]
coordinates = [pair for pair in coordinates if ((pair[0]>=0) and (pair[1]>=0) and pair[2]>=0) and (pair[0]<m) and (pair[1]<n) and (pair[2]<k)]
out = []
for pair in coordinates:
out.append(M_in[pair[0],pair[1],pair[2]])
M_out[i,j,z] = out
count = count +1
return M_out
Here a way to do that. For efficiency, you need therefore to use ndarrays : This only take in account complete voxels. Edges must be managed "by hand".
from pylab import *
a=rand(100,100,100) # the data
r=4
ra=range(-r,r+1)
sphere=array([[x,y,z] for x in ra for y in ra for z in ra if np.abs((x,y,z)).sum()<=r])
# the unit "sphere"
indcenters=array(meshgrid(*(range(r,n-r) for n in a.shape),indexing='ij'))
# indexes of the centers of the voxels. edges are cut.
all_inds=(indcenters[newaxis].T+sphere.T).T
#all the indexes.
voxels=np.stack([a[tuple(inds)] for inds in all_inds],-1)
# the voxels.
#voxels.shape is (92, 92, 92, 129)
All the costly operations are vectorized. Comprehension lists are prefered for clarity in external loop.
You can now perform vectorized operations on voxels. for exemple the brightest voxel :
light=voxels.sum(-1)
print(np.unravel_index(light.argmax(),light.shape))
#(33,72,64)
All of this is of course extensive in memory. you must split your space for
big data or voxels.
Since you say the data structure is too large, you'll likely have to compute the vector on the fly for a given voxel. You can do this pretty quickly though:
class SearchLight(object):
def __init__(self, M_in, radius):
self.M_in = M_in
m, n, k = self.M_in.shape
# compute the sphere coordinates centered at (0,0,0)
# just like in your sample code
i_interval = list(range(-radius,radius+1))
j_interval = list(range(-radius,radius+1))
z_interval = list(range(-radius,radius+1))
coordinates = list(itertools.product(i_interval,j_interval,z_interval))
coordinates = [pair for pair in coordinates if ((abs(pair[0])+abs(pair[1])+abs(pair[2]))<=radius)]
# store those indices as a template
self.sphere_indices = np.array(coordinates)
def get_vector(self, i, j, k):
# offset sphere coordinates by the requested centre.
coordinates = self.sphere_indices + [i,j,k]
# filter out of bounds coordinates
coordinates = coordinates[(coordinates >= 0).all(1)]
coordinates = coordinates[(coordinates < self.M_in.shape).all(1)]
# use those coordinates to index the initial array.
return self.M_in[coordinates[:,0], coordinates[:,1], coordinates[:,2]]
To use the object on a given array you can simply do:
sl = SearchLight(M_in, 4)
# get vector of values for voxel i,j,k
vector = sl.get_vector(i,j,k)
This should give you the same vector you would get from
M_out[i,j,k]
in your sample code, without storing all the results at once in memory.
This can also probably be further optimized, particularly in terms of the coordinate filtering, but it may not be necessary. Hope that helps.
For example, find the image below, which explains the problem for a simple 2D case. The label (N) and coordinates (x,y) for each point is known. I need to find all the point labels that lie within the red circle
My actual problem is in 3D and the points are not uniformly distributed
Sample input file which contain coordinates of 7.25 M points is attached here point file.
I tried the following piece of code
import numpy as np
C = [50,50,50]
R = 20
centroid = np.loadtxt('centroid') #chk the file attached
def dist(x,y): return sum([(xi-yi)**2 for xi, yi in zip(x,y)])
elabels=[i+1 for i in range(len(centroid)) if dist(C,centroid[i])<=R**2]
For an single search it takes ~ 10 min. Any suggestions to make it faster ?
Thanks,
Prithivi
When using numpy, avoid using list comprehensions on arrays.
Your computation can be done using vectorized expressions like this
centre = np.array((50., 50., 50.))
points = np.loadtxt('data')
distances2= np.sum((points-centre)**2, axis=1)
points is a N x 2 array, points-centre is also a N x 2 array,
(points-centre)**2 computes the squares of each element of the difference and eventually np.sum(..., axis=1) sums the elements of the squared differences along axis no. 1, that is, across columns.
To filter the array of positions, you can use boolean indexing
close = points[distances2<max_dist**2]
You are heavily calling the dist function. You could try to low level optimize it, and control with the timeit Python module which is more efficient. On my machine, I tried this one:
def dist(x,y):
d0 = y[0] -x[0]
d1 = y[1] -x[1]
d2 = y[2] -x[2]
return d0 * d0 + d1*d1 + d2*d2
and timeit said it was more than 3 times quicker.
This one was just in the middle:
def dist(x,y):
s = 0
for i in range(len(x)):
d = y[i] - x[i]
s += d * d
return s
I am trying to generate an efficient code for generating a number of random position vectors which I then use to calculate a pair correlation function. I am wondering if there is straightforward way to set a constraint on the minimum distance allowed between any two points placed in my box.
My code currently is as follows:
def pointRun(number, dr):
"""
Compute the 3D pair correlation function
for a random distribution of 'number' particles
placed into a 1.0x1.0x1.0 box.
"""
## Create array of distances over which to calculate.
r = np.arange(0., 1.0+dr, dr)
## Generate list of arrays to define the positions of all points,
## and calculate number density.
a = np.random.rand(number, 3)
numberDensity = len(a)/1.0**3
## Find reference points within desired region to avoid edge effects.
b = [s for s in a if all(s > 0.4) and all(s < 0.6) ]
## Compute pairwise correlation for each reference particle
dist = scipy.spatial.distance.cdist(a, b, 'euclidean')
allDists = dist[(dist < np.sqrt(3))]
## Create histogram to generate radial distribution function, (RDF) or R(r)
Rr, bins = np.histogram(allDists, bins=r, density=False)
## Make empty containers to hold radii and pair density values.
radii = []
rhor = []
## Normalize RDF values by distance and shell volume to get pair density.
for i in range(len(Rr)):
y = (r[i] + r[i+1])/2.
radii.append(y)
x = np.average(Rr[i])/(4./3.*np.pi*(r[i+1]**3 - r[i]**3))
rhor.append(x)
## Generate normalized pair density function, by total number density
gr = np.divide(rhor, numberDensity)
return radii, gr
I have previously tried using a loop that calculated all distances for each point as it was made and then accepted or rejected. This method was very slow if I use a lot of points.
Here is a scalable O(n) solution using numpy. It works by initially specifying an equidistant grid of points and then perturbing the points by some amount keeping the distance between the points at most min_dist.
You'll want to tweak the number of points, box shape and perturbation sensitivity to get the min_dist you want.
Note: If you fix the size of a box and specify a minimum distance between every point, it makes sense that there will be a limit to the number of points you can draw satisfying the minimum distance.
import numpy as np
import matplotlib.pyplot as plt
# specify params
n = 500
shape = np.array([64, 64])
sensitivity = 0.8 # 0 means no movement, 1 means max distance is init_dist
# compute grid shape based on number of points
width_ratio = shape[1] / shape[0]
num_y = np.int32(np.sqrt(n / width_ratio)) + 1
num_x = np.int32(n / num_y) + 1
# create regularly spaced neurons
x = np.linspace(0., shape[1]-1, num_x, dtype=np.float32)
y = np.linspace(0., shape[0]-1, num_y, dtype=np.float32)
coords = np.stack(np.meshgrid(x, y), -1).reshape(-1,2)
# compute spacing
init_dist = np.min((x[1]-x[0], y[1]-y[0]))
min_dist = init_dist * (1 - sensitivity)
assert init_dist >= min_dist
print(min_dist)
# perturb points
max_movement = (init_dist - min_dist)/2
noise = np.random.uniform(
low=-max_movement,
high=max_movement,
size=(len(coords), 2))
coords += noise
# plot
plt.figure(figsize=(10*width_ratio,10))
plt.scatter(coords[:,0], coords[:,1], s=3)
plt.show()
Based on #Samir 's answer, and make it a callable function for your convenience :)
import numpy as np
import matplotlib.pyplot as plt
def generate_points_with_min_distance(n, shape, min_dist):
# compute grid shape based on number of points
width_ratio = shape[1] / shape[0]
num_y = np.int32(np.sqrt(n / width_ratio)) + 1
num_x = np.int32(n / num_y) + 1
# create regularly spaced neurons
x = np.linspace(0., shape[1]-1, num_x, dtype=np.float32)
y = np.linspace(0., shape[0]-1, num_y, dtype=np.float32)
coords = np.stack(np.meshgrid(x, y), -1).reshape(-1,2)
# compute spacing
init_dist = np.min((x[1]-x[0], y[1]-y[0]))
# perturb points
max_movement = (init_dist - min_dist)/2
noise = np.random.uniform(low=-max_movement,
high=max_movement,
size=(len(coords), 2))
coords += noise
return coords
coords = generate_points_with_min_distance(n=8, shape=(2448,2448), min_dist=256)
# plot
plt.figure(figsize=(10,10))
plt.scatter(coords[:,0], coords[:,1], s=3)
plt.show()
As I understood, you're looking for an algorithm to create many random points in a box such that no two points are closer than some minimum distance. If this is your problem, then you can take advantage of statistical physics, and solve it using molecular dynamics software. Moreover, you do need molecular dynamics or Monte Carlo to obtain exact solution of this problem.
You place N atoms in a rectangular box, create a repulsive interaction of a fixed radius between them (such as shifted Lennard-Jones interaction), and run simulation for some time (untill you see that the points spread out uniformly throughout the box). By laws of statistical physics you can show that positions of the points would be maximally random given the constraint that points cannot be close than some distance. This would not be true if you use iterative algorithm, such as placing points one-by-one and rejecting them if they overlap
I would estimate a runtime of several seconds for 10000 points, and several minutes for 100k. I use OpenMM for all my moelcular dynamics simulations.
#example of generating 50 points in a square of 4000x4000 and with minimum distance of 400
import numpy as np
import random as rnd
n_points=50
x,y = np.zeros(n_points),np.zeros(n_points)
x[0],y[0]=np.round(rnd.uniform(0,4000)),np.round(rnd.uniform(0,4000))
min_distances=[]
i=1
while i<n_points :
x_temp,y_temp=np.round(rnd.uniform(0,4000)),np.round(rnd.uniform(0,4000))
distances = []
for j in range(0,i):
distances.append(np.sqrt((x_temp-x[j])**2+(y_temp-y[j])**2))
min_distance = np.min(distances)
if min_distance>400 :
min_distances.append(min_distance)
x[i]=x_temp
y[i]=y_temp
i = i+1
print(x,y)
Does anyone know a good method to calculate the empirical/sample covariogram, if possible in Python?
This is a screenshot of a book which contains a good definition of covariagram:
If I understood it correctly, for a given lag/width h, I'm supposed to get all the pair of points that are separated by h (or less than h), multiply its values and for each of these points, calculate its mean, which in this case, are defined as m(x_i). However, according to the definition of m(x_{i}), if I want to compute m(x1), I need to obtain the average of the values located within distance h from x1. This looks like a very intensive computation.
First of all, am I understanding this correctly? If so, what is a good way to compute this assuming a two dimensional space? I tried to code this in Python (using numpy and pandas), but it takes a couple of seconds and I'm not even sure it is correct, that is why I will refrain from posting the code here. Here is another attempt of a very naive implementation:
from scipy.spatial.distance import pdist, squareform
distances = squareform(pdist(np.array(coordinates))) # coordinates is a nx2 array
z = np.array(z) # z are the values
cutoff = np.max(distances)/3.0 # somewhat arbitrary cutoff
width = cutoff/15.0
widths = np.arange(0, cutoff + width, width)
Z = []
Cov = []
for w in np.arange(len(widths)-1): # for each width
# for each pairwise distance
for i in np.arange(distances.shape[0]):
for j in np.arange(distances.shape[1]):
if distances[i, j] <= widths[w+1] and distances[i, j] > widths[w]:
m1 = []
m2 = []
# when a distance is within a given width, calculate the means of
# the points involved
for x in np.arange(distances.shape[1]):
if distances[i,x] <= widths[w+1] and distances[i, x] > widths[w]:
m1.append(z[x])
for y in np.arange(distances.shape[1]):
if distances[j,y] <= widths[w+1] and distances[j, y] > widths[w]:
m2.append(z[y])
mean_m1 = np.array(m1).mean()
mean_m2 = np.array(m2).mean()
Z.append(z[i]*z[j] - mean_m1*mean_m2)
Z_mean = np.array(Z).mean() # calculate covariogram for width w
Cov.append(Z_mean) # collect covariances for all widths
However, now I have confirmed that there is an error in my code. I know that because I used the variogram to calculate the covariogram (covariogram(h) = covariogram(0) - variogram(h)) and I get a different plot:
And it is supposed to look like this:
Finally, if you know a Python/R/MATLAB library to calculate empirical covariograms, let me know. At least, that way I can verify what I did.
One could use scipy.cov, but if one does the calculation directly (which is very easy), there are more ways to speed this up.
First, make some fake data that has some spacial correlations. I'll do this by first making the spatial correlations, and then using random data points that are generated using this, where the data is positioned according to the underlying map, and also takes on the values of the underlying map.
Edit 1:
I changed the data point generator so positions are purely random, but z-values are proportional to the spatial map. And, I changed the map so that left and right side were shifted relative to eachother to create negative correlation at large h.
from numpy import *
import random
import matplotlib.pyplot as plt
S = 1000
N = 900
# first, make some fake data, with correlations on two spatial scales
# density map
x = linspace(0, 2*pi, S)
sx = sin(3*x)*sin(10*x)
density = .8* abs(outer(sx, sx))
density[:,:S//2] += .2
# make a point cloud motivated by this density
random.seed(10) # so this can be repeated
points = []
while len(points)<N:
v, ix, iy = random.random(), random.randint(0,S-1), random.randint(0,S-1)
if True: #v<density[ix,iy]:
points.append([ix, iy, density[ix,iy]])
locations = array(points).transpose()
print locations.shape
plt.imshow(density, alpha=.3, origin='lower')
plt.plot(locations[1,:], locations[0,:], '.k')
plt.xlim((0,S))
plt.ylim((0,S))
plt.show()
# build these into the main data: all pairs into distances and z0 z1 values
L = locations
m = array([[math.sqrt((L[0,i]-L[0,j])**2+(L[1,i]-L[1,j])**2), L[2,i], L[2,j]]
for i in range(N) for j in range(N) if i>j])
Which gives:
The above is just the simulated data, and I made no attempt to optimize it's production, etc. I assume this is where the OP starts, with the task below, since the data already exists in a real situation.
Now calculate the "covariogram" (which is much easier than generating the fake data, btw). The idea here is to sort all the pairs and associated values by h, and then index into these using ihvals. That is, summing up to index ihval is the sum over N(h) in the equation, since this includes all pairs with hs below the desired values.
Edit 2:
As suggested in the comments below, N(h) is now only the pairs that are between h-dh and h, rather than all pairs between 0 and h (where dh is the spacing of h-values in ihvals -- ie, S/1000 was used below).
# now do the real calculations for the covariogram
# sort by h and give clear names
i = argsort(m[:,0]) # h sorting
h = m[i,0]
zh = m[i,1]
zsh = m[i,2]
zz = zh*zsh
hvals = linspace(0,S,1000) # the values of h to use (S should be in the units of distance, here I just used ints)
ihvals = searchsorted(h, hvals)
result = []
for i, ihval in enumerate(ihvals[1:]):
start, stop = ihvals[i-1], ihval
N = stop-start
if N>0:
mnh = sum(zh[start:stop])/N
mph = sum(zsh[start:stop])/N
szz = sum(zz[start:stop])/N
C = szz-mnh*mph
result.append([h[ihval], C])
result = array(result)
plt.plot(result[:,0], result[:,1])
plt.grid()
plt.show()
which looks reasonable to me as one can see bumps or troughs at the expected for the h values, but I haven't done a careful check.
The main speedup here over scipy.cov, is that one can precalculate all of the products, zz. Otherwise, one would feed zh and zsh into cov for every new h, and all the products would be recalculated. This calculate could be sped up even more by doing partial sums, ie, from ihvals[n-1] to ihvals[n] at each timestep n, but I doubt that will be necessary.