I'm currently searching for an efficient algorithm that takes in a set of points from three dimensional spaces and groups them into classes (maybe represented by a list). A point should belong to a class if it is close to one or more other points from the class. Two classes are then the same if they share any point.
Because I'm working with large data sets, I don't want to use recursive methods. Also, using something like a distance matrix with O(n^2) performance is what I try to avoid.
I tried to check for some algorithms online, but most of them don't appeal to this specific purpose (e.g. k-d tree or other cluster algorithms). I thought about parting space into smaller parts, but that (potentially) results in an inexact result.
I tried to write something myself, but it turned out to be flawed. I would sort my points after distance and append the distance as a fourth coordinate and then repeat the following the following code-segment:
def grouping_presorted(lst, distance):
positions = [0]
x = []
while positions:
curr_el = lst[ positions[-1] ]
nn_i = HasNeighbor(lst, distance, positions[-1])
if nn_i is None:
x.append(lst.pop(positions[-1]) )
return x
def HasNeighbor(lst,distance,index):
i = index+1
while lst[i][3]- lst[index][3] < distance:
dist = (lst[i][0]-lst[index][0])**2 + (lst[i][1]-lst[index][1])**2 + (lst[i][2]-lst[index][2])**2
if dist < distance:
return i
return None
Aside from an (probably easy to fix) overflow error, there's a bigger flaw in the logic of linking the points. If you think of my points describing lines in space, the algorithm only works for lines that strictly point outwards the origin, but not for circles or similar structures.
Does anybody know of a prewritten code for this or have an idea what I could try?
Thanks in advance.
Edit: It seems my spelling and maybe confusion of some terms has sparked some misunderstandings. I hope that this (badly-made) sketch helps. In this example, I marked my reference distance as d and circled the two containers I wan't to end up with in red.
You could try https://en.wikipedia.org/wiki/OPTICS_algorithm. When you index the points first (e.g, with an R-Tree) this should be possible in O(n log n).
If you already know your epsilon and how many points are minimally in a cluster (minpoints) then DBSCAN could be the better choice.
What I ended up doing
After following all the suggestions of your comments, help from cs.stackexchange and doing some research I was able to write down two different methods for solving this problem. In case someone might be interested, I decided to share them here. Again, the problem is to write a program that takes in a set of coordinate tuples and groups them into clusters. Two points x,y belong to the same cluster if there is a sequence of elements x=x_1,..,y=x_N such that d(x_i,x_i+1)
DBSCAN: By fixing euclidean metric, minPts = 2 and grouping distance epsilon = r.
scikit-learn provides a nice implementation of this algorithm. A minimal code snippet for the task would be:
from sklearn.cluster import DBSCAN
from sklearn.datasets.samples_generator import make_blobs
import networkx as nx
import scipy.spatial as sp
def cluster(data, epsilon,N): #DBSCAN, euclidean distance
db = DBSCAN(eps=epsilon, min_samples=N).fit(data)
labels = db.labels_ #labels of the found clusters
n_clusters = len(set(labels)) - (1 if -1 in labels else 0) #number of clusters
clusters = [data[labels == i] for i in range(n_clusters)] #list of clusters
return clusters, n_clusters
centers = [[1, 1,1], [-1, -1,1], [1, -1,1]]
X,_ = make_blobs(n_samples=N, centers=centers, cluster_std=0.4,
On my machine, N=20000 for this clustering variation with an epsilon of epsilon = 0.1 takes just 290ms, so this seems really quick to me.
Graph components: One can think of this problem as follows: The coordinates define nodes of a graph, and two nodes are adjacent if their distance is smaller than epsilon/r. A cluster is then given as a connected component of this graph. At first I had problems implementing this graph, but there are many ways to write a linear time algorithm to do this. The easiest and fastest way however, for me, was to use scipy.spatial's cKDTree data structure and the corresponding query_pairs() method, that returns a list of indice tuples of points that are in given distance. One could for example write it like this:
class IGraph:
def __init__(self, nodelst=[], radius = 1):
self.igraph = nx.Graph()
self.radii = radius
self.nodelst = nodelst #nodelst is array of coordinate tuples, graph contains indices as nodes
def __make_edges__(self):
self.igraph.add_edges_from( sp.cKDTree(self.nodelst).query_pairs(r=self.radii) )
def get_conn_comp(self):
ind = [list(x) for x in nx.connected_components(self.igraph) if len(x)>1]
return [self.nodelst[indlist] for indlist in ind]
def graph_cluster(data, epsilon):
graph = IGraph(nodelst = data, radius = epsilon)
clusters = graph.get_conn_comp()
return clusters, len(clusters)
For the same dataset mentioned above, this method takes 420ms to find the connected components. However, for smaller clusters, e.g. N=700, this snippet runs faster. It also seems to have an advantage for finding smaller clusters (that is being given smaller epsilon values) and a vast disadvantage in the other direction (all on this specific dataset of course). I think, depending on the given situation, both methods are worth considering.
Hope this is of use for somebody.
Edit: Theoretically, DBSCAN has computational complexity O(n log n) when properly implemented (according to wikipedia...), while constructing the graph as well as finding its connected components runs linear in time. I'm not sure how well these statements hold for the given implementations though.
Adapt a path-finding algorithm, such as Dijkstra's or A*, or alternatively adapt the breadth-first or depth-first search of a graph. Start at any point in the set of unvisited points, and proceed with whichever algorithm you've picked with the caveat that a point is considered to be connected only to all points to which its distance is less than the threshhold. When you've finished off with one class (i.e. when you can discover no more new nodes), pick any node from the set of unvisited nodes and repeat.
I have written a code to find connected spheres paths using NetworkX library in Python. For doing so, I need to find distances between the spheres before using the graph. This part of the code (calculation section (the numba function) --> finding distances and connections) led to memory leaks when using arrays in parallel scheme by numba (I had this problem when using np.linalg or scipy.spatial.distance.cdist, too). So, I wrote a non-parallel numba code using lists to do so. Now, it is memory-friendly but consumes a much time to calculate these distances (it consumes just ~10-20% of 16GB memory and ~30-40% of each CPU cores of my 4-cores CPU machine). For example, when I was testing on ~12000 data volume, it took less than one second for each of the calculation section and the NetworkX graph creation and for ~550000 data volume, it took around 25 minutes for calculation section (numba part) and 7 seconds for graph creation and getting the output list.
import numpy as np
import numba as nb
import networkx as nx
radii = np.load('rad_dist_12000.npy')
poss = np.load('pos_dist_12000.npy')
#nb.njit("(Tuple([float64[:, ::1], float64[:, ::1]]))(float64[::1], float64[:, ::1])", parallel=True)
def distances_numba_parallel(radii, poss):
radii_arr = np.zeros((radii.shape[0], radii.shape[0]), dtype=np.float64)
poss_arr = np.zeros((poss.shape[0], poss.shape[0]), dtype=np.float64)
for i in nb.prange(radii.shape[0] - 1):
for j in range(i+1, radii.shape[0]):
radii_arr[i, j] = radii[i] + radii[j]
poss_arr[i, j] = ((poss[i, 0] - poss[j, 0]) ** 2 + (poss[i, 1] - poss[j, 1]) ** 2 + (poss[i, 2] - poss[j, 2]) ** 2) ** 0.5
return radii_arr, poss_arr
#nb.njit("(List(UniTuple(int64, 2)))(float64[::1], float64[:, ::1])")
def distances_numba_non_parallel(radii, poss):
connections = []
for i in range(radii.shape[0] - 1):
connections.append((i, i))
for j in range(i+1, radii.shape[0]):
radii_arr_ij = radii[i] + radii[j]
poss_arr_ij = ((poss[i, 0] - poss[j, 0]) ** 2 + (poss[i, 1] - poss[j, 1]) ** 2 + (poss[i, 2] - poss[j, 2]) ** 2) ** 0.5
if poss_arr_ij <= radii_arr_ij:
connections.append((i, j))
return connections
def connected_spheres_path(radii, poss):
# in parallel mode
# maximum_distances, distances = distances_numba_parallel(radii, poss)
# connections = distances <= maximum_distances
# connections[np.tril_indices_from(connections, -1)] = False
# in non-parallel mode
connections = distances_numba_non_parallel(radii, poss)
G = nx.Graph(connections)
return list(nx.connected_components(G))
My datasets will contain maximum of 10 millions spheres (data are positions and radii), mostly, up to 1 millions; As it is mentioned above, the most part of the consumed time is related to the calculation section. I have little experience using graphs and don't know if (and how) it can be handled much faster using all CPU cores or RAM capacity (max 12GB) or if it can be calculated internally (I doubt that it is needed to calculate and find the connected spheres separately before using graphs) using other Python libraries such as graph-tool, igraph, and netwrokit to do all the process in C or C++ in an efficient way.
I would be grateful for any suggested answer that can make my code faster for large data volumes (performance is the first priority; if much memory capacities are needed for large data volumes, mentioning (some benchmarks) its amounts will be helpful).
Since just using trees will not be helpful enough to improve the performance, I have written an advanced optimized code to improve the calculation section speed by combining tree-based algorithms and numba jitting.
Now, I am curious if it can be calculated internally (calculation section is an integral part and basic need for such graphing) by other Python libraries such as graph-tool, igraph, and netwrokit to do all the process in C or C++ in an efficient way.
radii: 12000, 50000, 550000
poss: 12000, 50000, 550000
If you are computing the pairwise distance between all points, that's N^2 calculations, which will take a very long time for sufficiently many data points.
If you can place an upper bound on the distance you need to consider for any two points, then there are some nice data structures for finding pairs of neighbors in a set of points. If you already have scipy installed, then the most convenient structure to reach for is the KDTree (or the optimized version, cKDTree). (Read more here.)
The basic recipe is:
Load your point set into the KDTree.
Ask the KDTree for all pairs of points which are within some maximum distance from each other.
Calculate the actual distances between each of the returned points.
Compare those distances with the summed radii associated with the point pair. Drop the pairs whose distances are too large.
Finally, you need to determine the clusters of spheres. Your question mentions "paths", but in your example code you're only concerned with connected components. Of course you can use networkx or graph-tool for that, but maybe that's overkill.
If connected components are all you need, then you don't even need a proper graph data structure. You just need a way to find the groups of linked nodes, without maintaining the specific connections that linked them. Again, scipy has a nice tool: DisjointSet. (Read more here.)
Here is a complete example. The execution time depends on not only the number of points, but how "dense" they are. I tried some reasonable (I think) test data with 1M points, which took 24 seconds to process on my laptop.
Your example data (the largest of the sets provided above) takes longer: about 45 seconds. The KDTree finds 312M pairs of points to consider, of which fewer than 1M are actually valid connections.
import numpy as np
from scipy.spatial import cKDTree
from scipy.cluster.hierarchy import DisjointSet
## Example data (2D)
## N = 1000
# D = 2
# max_point = 1000
# min_radius = 10
# max_radius = 20
# points = np.random.randint(0, max_point, size=(N, D))
# radii = np.random.randint(min_radius, max_radius+1, size=N)
## Example data (3D)
# N = 1_000_000
# D = 3
# max_point = 3000
# min_radius = 10
# max_radius = 20
# points = np.random.randint(0, max_point, size=(N, D))
# radii = np.random.randint(min_radius, max_radius+1, size=N)
# Question data (3D)
points = np.load('b (556024).npy')
radii = np.load('a (556024).npy')
N = len(points)
# Load into a KD tree and extract all pairs which could possibly be linked
# (using the maximum radius as the upper bound of the search distance.)
kd = cKDTree(points)
pairs = kd.query_pairs(2 * radii.max(), output_type='ndarray')
def filter_pairs(pairs):
# Calculate the distance between each pair of points
vectors = points[pairs[:, 1]] - points[pairs[:, 0]]
distances = np.linalg.norm(vectors, axis=1)
# Drop the pairs whose summed radii aren't large enough
# to span the distance between the points.
thresholds = radii[pairs].sum(axis=1)
return pairs[distances <= thresholds]
# We could do this in one big step
# ...but that might require lots of RAM.
# It's cheaper to do it in big chunks, in a loop.
fp = []
CHUNK = 1_000_000
for i in range(0, len(pairs), CHUNK):
filtered_pairs = np.concatenate(fp)
# Load the pairs into a DisjointSet (a.k.a. UnionFind)
# data structure and extract the groups.
ds = DisjointSet(range(N))
for u, v in filtered_pairs:
ds.merge(u, v)
connected_sets = list(ds.subsets())
print(f"Found {len(connected_sets)} sets of circles/spheres")
Just for fun, here's a visualization of the 2D test data:
from bokeh.plotting import output_notebook, figure, show
p = figure()
p.circle(*points.T, radius=radii, fill_alpha=0.25)
p.segment(*points[filtered_pairs[:, 0]].T,
*points[filtered_pairs[:, 1]].T,
to find connected spheres using NetworkX library in Python. For
doing so, I need to find distances between the spheres
Are you calculating the distance between every pair of spheres?
If all you need is to know the pairs of spheres that touch, or maybe that overlap, then you do NOT need to calculate the distance between every pair of spheres, only ones that are in reasonable proximity to each other. The standard way of handling this it to use an octree https://en.wikipedia.org/wiki/Octree
This takes some time to set up, but once you have it, you can find quickly all the spheres that are close but none that are two far away. A reasonable distance would be twice the radius of the largest sphere. For large dataset the improvement in performance can be spectacular
( For more details about this test https://github.com/JamesBremner/quadtree )
So, the complete algorithm to find the paths through the connected spheres can be broken out into four conceptual steps
Find the connected spheres, using an octree to optimize finding them. Instead of searching through every pair of spheres, loop over the spheres and search through the spheres in the same octree cell. For more details on how to make this work you might want to look at the C++ code at https://github.com/JamesBremner/quadtree
Create the adjacency matrix of connected spheres. Conceptually this is a separate step, however, you will probably want to do that as you search for connected sphere in the first step. Construct an empty adjacency matrix N by N where N is the number of spheres. Each time you find a pair of connected spheres, fill in in matrix.
Load the matrix into a graph library. It may be more efficient to simply add the link between two connected spheres directly into the library and let it build the adjacency matrix.
Use the graph library methods to find the path.
I have a set of approximately 10,000 vectors max (random directions) in 3d space and I'm looking for a new direction v_dev (vector) which deviates from all other directions in the set by e.g. a minimum of 5 degrees. My naive initial try is the following, which has of course bad runtime complexity but succeeds for some cases.
#!/usr/bin/env python
import numpy as np
numVecs = 10000
vecs = np.random.rand(numVecs, 3)
randVec = np.random.rand(1, 3)
iter = 1
for vec in vecs:
angle = np.rad2deg(np.arccos(np.vdot(vec, foundVec)/(np.linalg.norm(vec) * np.linalg.norm(foundVec))))
print("angle: %f\n" % angle)
while notFound:
for vec in vecs:
angle = np.rad2deg(np.arccos(np.vdot(vec, randVec)/(np.linalg.norm(vec) * np.linalg.norm(randVec))))
if angle < 5:
if below:
randVec = np.random.rand(1, 3)
print("iteration no. %i" % iter)
iter = iter + 1
Any hints how to approach this problem (language agnostic) would be appreciate.
Consider the vectors in a spherical coordinate system (u,w,r), where r is always 1 because vector length doesn't matter here. Any vector can be expressed as (u,w) and the "deadzone" around each vector x, in which the target vector t cannot fall, can be expressed as dist((u_x, w_x, 1), (u_x-u_t, w_x-w_t, 1)) < 5°. However calculating this distance can be a bit tricky, so converting back into cartesian coordinates might be easier. These deadzones are circular on the spherical shell around the origin and you're looking for a t that doesn't hit any on them.
For any fixed u_t you can iterate over all x and using the distance function can find the start and end point of a range of w_t, that are blocked because they fall into the deadzone of the vector x. The union of all 10000 ranges build the possible values of w_t for that given u_t. The same can be done for any fixed w_t, looking for a u_t.
Now comes the part that I'm not entirely sure of: Given that you have two unknows u_t and w_t and 20000 knowns, the system is just a tad overdetermined and if there's a solution, it should be possible to find it.
My suggestion: Set u_t fixed to a random value and check which w_t are possible. If you find a non-empty range, great, you're done. If all w_t are blocked, select a different u_t and try again. Now, selecting u_t at random will work eventually, yet a smarter iteration should be possible. Maybe u_t(n) = u_t(n-1)*phi % 360°, where phi is the golden ratio. That way the u_t never repeat and will cover the whole space with finer and finer granularity instead of starting from one end and going slowly to the other.
Edit: You might also have more luck on the mathematics stackexchange since this isn't so much a code question as it is a mathematics question. For example I'm not sure what I wrote is all that rigorous, so I don't even know it works.
One way would be two build a 2d manifold (area on the sphere) of forbidden areas. You start by adding a point, then, the forbidden area is a circle on the sphere surface.
While true, pick a point on the boundary of the area. If this is not close (within 5 degrees) to any other vector, then, you're done, return it. If not, you just found a new circle of forbidden area. Add it to your manifold of forbidden area. You'll need to chop the circle in line or arc segments and build the boundary as a list.
If the set of vector has no solution, you boundary will collapse to an empty point. Then you return failure.
It's not the easiest approach, and you'll have to deal with the boundaries of a complex shape over a sphere. But it's guaranteed to work and should have reasonable complexity.
OSMnx provides solution to calculate the shortest path between two nodes, but I would like to the same with points on streets (I have GPS coordinates recorded from vehicles). I know there is also a method to get the closest node, but I have two question for this problem of mine.
i) When closest node computed is the street where the point is also taken into consideration? (I assume not)
ii) If I wanted to implement something like this, I like to know how a street (edge) is represented as a curve (Bézier curve maybe?). Is it possible to get the curve (or the equation of the curve) of an edge?
I asked this question here, because the guidelines for contributing of OSMnx asked it.
Streets and node in OSMnx are shapely.geometry.LineString, and shapely.geometry.Point objects, so there is no curve, only sequence of coordinates. The technical term for what you described is Map Matching. There are different ways of map matching, the simplest one being geometric map matching in which you find nearest geometry (node or edge) to the GPS point. point to point map matching can be easily achieved using built-in osmnx function ox.get_nearest_node(). If you have a luxury of dense GPS tracks, this approach could work reasonably good. For point to line map matching you have to use shapely functions. The problem with this approach is that it is very slow. you can speed up the algorithm using spatial index, but still, it will not be fast enough for most purposes. Note that geometric map matching are least accurate among all approaches. I wrote a function a few weeks ago that does simple point to line map matching using edge GeoDataFrame and node GeoDataFrame that you can get from OSMnx. I abandoned this idea and now I am working on a new algorithm (hopefully much faster), which I will publish on GitHub upon completion. Meanwhile, this may be of some help for you or someone else, so I post it here. This is an early version of abandoned code, not tested enough and not optimized. give it a try and let me know if it works for you.
def GeoMM(traj, gdfn, gdfe):
performs map matching on a given sequence of points
list of tuples each containing timestamp, projected point to the line, the edge to which GPS point has been projected, the geometry of the edge))
traj = pd.DataFrame(traj, columns=['timestamp', 'xy'])
traj['geom'] = traj.apply(lambda row: Point(row.xy), axis=1)
traj = gpd.GeoDataFrame(traj, geometry=traj['geom'], crs=EPSG3740)
traj.drop('geom', axis=1, inplace=True)
n_sindex = gdfn.sindex
res = []
for gps in traj.itertuples():
tm = gps[1]
p = gps[3]
circle = p.buffer(150)
possible_matches_index = list(n_sindex.intersection(circle.bounds))
possible_matches = gdfn.iloc[possible_matches_index]
precise_matches = possible_matches[possible_matches.intersects(circle)]
candidate_nodes = list(precise_matches.index)
candidate_edges = []
for nid in candidate_nodes:
candidate_edges = [item for sublist in candidate_edges for item in sublist]
dist = []
for edge in candidate_edges:
# get the geometry
ls = gdfe[(gdfe.u == edge[0]) & (gdfe.v == edge[1])].geometry
dist.append([ls.distance(p), edge, ls])
true_edge = dist[0][1]
true_edge_geom = dist[0][2].item()
pp = true_edge_geom.interpolate(true_edge_geom.project(p)) # projected point
res.append((tm, pp, true_edge, true_edge_geom))
return res
OSMnx was recently updated since there have been a couple of requests in this direction (see https://github.com/gboeing/osmnx/pull/234 and references therein). So in the last update, you'll find a function like this:
ox.get_nearest_edge(G, (lat, lon))
It will give you the ID of the nearest edge, which is much better than nearest nodes.
However, I think it is more useful to also get the actual distance of the nearest edge in order to check whether or not your data point is on the road or a few thousand meters apart...
To do this, I followed the implementation from https://github.com/gboeing/osmnx/pull/231/files
# Convert Graph to graph data frame
gdf = ox.graph_to_gdfs(G, nodes=False, fill_edge_geometry=True)
# extract roads and some properties
roads = gdf[["geometry", "u", "v","ref","name","highway","lanes"]].values.tolist()
# calculate and attach distance
roads_with_distances = [(road, ox.Point(tuple(reversed((lat,lon)))).distance(road[0])) for road in roads]
# sort by distance
roads_with_distances = sorted(roads_with_distances, key=lambda x: x[1])
# Select closest road
closest_road = roads_with_distances[0]
# Check whether you are actually "on" the road
if closest_road[1] < 0.0001: print('Hit the road, Jack!')
I have the impression that a distance on the order of $10^{-5}$ means that the coordinate is actually "on" the road.
I'd like to interpolate some 3D finite-element stress field data from a bunch of known nodes at points where nodes don't exist. I realise that node stresses are already extrapolated from gauss points, but it is the best I can do with the data I have available. The image below gives a 2D representation. The red and pink points would represent locations where I'd like to interpolate the value.
Initially I thought I could find the smallest bounding box (hull) or simplex that contained the point of interest and no other known points. Visualising this in 2D I realised that this might lead to ignoring data from a close-by value, incorrectly. I was planning on using the scipy LindearNDInterpolator but I notice there is some unexpected behaviour, and I'm worried it will exclude nearby points in the way that I just described. Notice how the pink point would not reference from the green triangle but ignore the point outside the orange triangle, although it is probably more relevant.
As far as I can tell the best way is to take the nearest surrounding nodes, and interpolating by weighted averaging on distance. I'm not sure if there is something readily available or if it needs to be written. I'd imagine this is a fairly common problem so I'd presume the wheel has already been invented...
Actually my final goal is to interpolate/regress values for a 3D line through the set of points.
You can try Inverse distance weighting. Here is an example in 1D (easily generalizable to 3D):
from pylab import *
# imaginary samples
# interpolation
x2=linspace(0,xmax,150) # new sampling
def weight(x,x0,p): # modify this function in 3D
return 1/(((x-x0)**2)**(p/2)+0.00001) # 0.00001 to avoid infinity
for p in range(1,4):
for i in range(len(y2)):
plot(x2,y2,label="Interpolation p="+str(p))
Here is the result
As you can see, it's not really fantastic. The best results are, I think, for p=2, but it will be different in 3D. I have obtained better curves with a gaussian weight, but have no theorical background for such a choice.
The first answer here was helpful but the 1-D example shows that the approach actually does some strange things with p=1 (wildy different from the data) and with p=3 we get some weird plateaux.
I took a look at Radial Basis Functions which are implemented in SciPy, and modified JPG's code as follows.
Modified Code
from pylab import *
from scipy.interpolate import Rbf, InterpolatedUnivariateSpline
# imaginary samples
Rbf requires sorted lists:
# interpolation
x2=linspace(0,xmax,150) # new sampling
def weight(x,x0,p): # modify this function in 3D
return 1/(((x-x0)**2)**(p/2)+0.00001) # 0.00001 to avoid infinity
for p in range(1,4):
for i in range(len(y2)):
plot(x2,y2,label="Interpolation p="+str(p))
yrbf = Rbf(x, y)
fi = yrbf(x2)
plot(x2, fi, label="Radial Basis Function")
ius = InterpolatedUnivariateSpline(x, y)
yius = ius(x2)
plot(x2, yius, label="Univariate Spline")
The results are interesting and probably more suitable to my intended usage. The following figure was produced.
But the RBF implementation in SciPy (google for alternatives) has a major problem when points are repeated - not likely in a real scenario - and goes completely ballistic:
When smoothed (smooth=0.1 was used) it goes normal again. This might show some programming weirdness.
I have two dimensional discrete spatial data. I would like to make an approximation of the spatial boundaries of this data so that I can produce a plot with another dataset on top of it.
Ideally, this would be an ordered set of (x,y) points that matplotlib can plot with the plt.Polygon() patch.
My initial attempt is very inelegant: I place a fine grid over the data, and where data is found in a cell, a square matplotlib patch is created of that cell. The resolution of the boundary thus depends on the sampling frequency of the grid. Here is an example, where the grey region are the cells containing data, black where no data exists.
1st attempt http://astro.dur.ac.uk/~dmurphy/data_limits.png
OK, problem solved - why am I still here? Well.... I'd like a more "elegant" solution, or at least one that is faster (ie. I don't want to get on with "real" work, I'd like to have some fun with this!). The best way I can think of is a ray-tracing approach - eg:
from xmin to xmax, at y=ymin, check if data boundary crossed in intervals dx
y=ymin+dy, do 1
do 1-2, but now sample in y
An alternative is defining a centre, and sampling in r-theta space - ie radial spokes in dtheta increments.
Both would produce a set of (x,y) points, but then how do I order/link neighbouring points them to create the boundary?
A nearest neighbour approach is not appropriate as, for example (to borrow from Geography), an isthmus (think of Panama connecting N&S America) could then close off and isolate regions. This also might not deal very well with the holes seen in the data, which I would like to represent as a different plt.Polygon.
The solution perhaps comes from solving an area maximisation problem. For a set of points defining the data limits, what is the maximum contiguous area contained within those points To form the enclosed area, what are the neighbouring points for the nth point? How will the holes be treated in this scheme - is this erring into topology now?
Apologies, much of this is me thinking out loud. I'd be grateful for some hints, suggestions or solutions. I suspect this is an oft-studied problem with many solution techniques, but I'm looking for something simple to code and quick to run... I guess everyone is, really!
OK, here's attempt #2 using Mark's idea of convex hulls:
alt text http://astro.dur.ac.uk/~dmurphy/data_limitsv2.png
For this I used qconvex from the qhull package, getting it to return the extreme vertices. For those interested:
cat [data] | qconvex Fx > out
The sampling of the perimeter seems quite low, and although I haven't played much with the settings, I'm not convinced I can improve the fidelity.
I think what you are looking for is the Convex Hull of the data That will give a set of points that if connected will mean that all your points are on or inside the connected points
I may have mixed something, but what's the motivation for simply not determining the maximum and minimum x and y level? Unless you have an enormous amount of data you could simply iterate through your points determining minimum and maximum levels fairly quickly.
This isn't the most efficient example, but if your data set is small this won't be particularly slow:
import random
data = [(random.randint(-100, 100), random.randint(-100, 100)) for i in range(1000)]
x_min = min([point[0] for point in data])
x_max = max([point[0] for point in data])
y_min = min([point[1] for point in data])
y_max = max([point[1] for point in data])