Proximity Graph in python

Proximity Graph in python - python

I'm relatively new to Python coding (I'm switching from R mostly due to running time speed) and I'm trying to figure out how to code a proximity graph.
That is suppose i have an array of evenly-spaced points in d-dimensional Euclidean space, these will be my nodes. I want to make these into an undirected graph by connecting two points if and only if they lie within e apart. How can I encode this functionally with parameters:
n: spacing between two points on the same axis
d: dimension of R^d
e: maximum distance allowed for an edge to exist.

The graph-tool library has much of the functionality you need. So you could do something like this, assuming you have numpy and graph-tool:
coords = numpy.meshgrid(*(numpy.linspace(0, (n-1)*delta, n) for i in range(d)))
# coords is a Python list of numpy arrays
coords = [c.flatten() for c in coords]
# now coords is a Python list of 1-d numpy arrays
coords = numpy.array(coords).transpose()
# now coords is a numpy array, one row per point
g = graph_tool.generation.geometric_graph(coords, e*(1+1e-9))
The silly e*(1+1e-9) thing is because your criterion is "distance <= e" and geometric_graph's criterion is "distance < e".
There's a parameter called delta that you didn't mention because I think your description of parameter n is doing duty for two params (spacing between points, and number of points).

This bit of code should work, although it certainly isn't the most efficient. It will go through each node and check its distance to all the other nodes (that haven't already compared to it). If that distance is less than your value e then the corresponding value in the connected matrix is set to one. Zero indicates two nodes are not connected.
In this code I'm assuming that your nodeList is a list of cartesian coordinates of the form nodeList = [[x1,y1,...],[x2,y2,...],...[xN,yN,...]]. I also assume you have some function called calcDistance which returns the euclidean distance between two cartesian coordinates. This is basic enough to implement that I haven't written the code for that, and in any case using a function allows for future generalizing and modability.
numNodes = len(nodeList)
connected = np.zeros([numNodes,numNodes])
for i, n1 in enumerate(nodeList):
for j, n2 in enumerate(nodeList[i:]):
dist = calcDistance(n1, n2)
if dist < e:
connected[i,j] = 1
connected[j,i] = 1

Related

Time-efficient way to find connected spheres paths in Python

I have written a code to find connected spheres paths using NetworkX library in Python. For doing so, I need to find distances between the spheres before using the graph. This part of the code (calculation section (the numba function) --> finding distances and connections) led to memory leaks when using arrays in parallel scheme by numba (I had this problem when using np.linalg or scipy.spatial.distance.cdist, too). So, I wrote a non-parallel numba code using lists to do so. Now, it is memory-friendly but consumes a much time to calculate these distances (it consumes just ~10-20% of 16GB memory and ~30-40% of each CPU cores of my 4-cores CPU machine). For example, when I was testing on ~12000 data volume, it took less than one second for each of the calculation section and the NetworkX graph creation and for ~550000 data volume, it took around 25 minutes for calculation section (numba part) and 7 seconds for graph creation and getting the output list.
import numpy as np
import numba as nb
import networkx as nx
radii = np.load('rad_dist_12000.npy')
poss = np.load('pos_dist_12000.npy')
#nb.njit("(Tuple([float64[:, ::1], float64[:, ::1]]))(float64[::1], float64[:, ::1])", parallel=True)
def distances_numba_parallel(radii, poss):
radii_arr = np.zeros((radii.shape[0], radii.shape[0]), dtype=np.float64)
poss_arr = np.zeros((poss.shape[0], poss.shape[0]), dtype=np.float64)
for i in nb.prange(radii.shape[0] - 1):
for j in range(i+1, radii.shape[0]):
radii_arr[i, j] = radii[i] + radii[j]
poss_arr[i, j] = ((poss[i, 0] - poss[j, 0]) ** 2 + (poss[i, 1] - poss[j, 1]) ** 2 + (poss[i, 2] - poss[j, 2]) ** 2) ** 0.5
return radii_arr, poss_arr
#nb.njit("(List(UniTuple(int64, 2)))(float64[::1], float64[:, ::1])")
def distances_numba_non_parallel(radii, poss):
connections = []
for i in range(radii.shape[0] - 1):
connections.append((i, i))
for j in range(i+1, radii.shape[0]):
radii_arr_ij = radii[i] + radii[j]
poss_arr_ij = ((poss[i, 0] - poss[j, 0]) ** 2 + (poss[i, 1] - poss[j, 1]) ** 2 + (poss[i, 2] - poss[j, 2]) ** 2) ** 0.5
if poss_arr_ij <= radii_arr_ij:
connections.append((i, j))
return connections
def connected_spheres_path(radii, poss):
# in parallel mode
# maximum_distances, distances = distances_numba_parallel(radii, poss)
# connections = distances <= maximum_distances
# connections[np.tril_indices_from(connections, -1)] = False
# in non-parallel mode
connections = distances_numba_non_parallel(radii, poss)
G = nx.Graph(connections)
return list(nx.connected_components(G))
My datasets will contain maximum of 10 millions spheres (data are positions and radii), mostly, up to 1 millions; As it is mentioned above, the most part of the consumed time is related to the calculation section. I have little experience using graphs and don't know if (and how) it can be handled much faster using all CPU cores or RAM capacity (max 12GB) or if it can be calculated internally (I doubt that it is needed to calculate and find the connected spheres separately before using graphs) using other Python libraries such as graph-tool, igraph, and netwrokit to do all the process in C or C++ in an efficient way.
I would be grateful for any suggested answer that can make my code faster for large data volumes (performance is the first priority; if much memory capacities are needed for large data volumes, mentioning (some benchmarks) its amounts will be helpful).
Update:
Since just using trees will not be helpful enough to improve the performance, I have written an advanced optimized code to improve the calculation section speed by combining tree-based algorithms and numba jitting.
Now, I am curious if it can be calculated internally (calculation section is an integral part and basic need for such graphing) by other Python libraries such as graph-tool, igraph, and netwrokit to do all the process in C or C++ in an efficient way.
Data
radii: 12000, 50000, 550000
poss: 12000, 50000, 550000

If you are computing the pairwise distance between all points, that's N^2 calculations, which will take a very long time for sufficiently many data points.
If you can place an upper bound on the distance you need to consider for any two points, then there are some nice data structures for finding pairs of neighbors in a set of points. If you already have scipy installed, then the most convenient structure to reach for is the KDTree (or the optimized version, cKDTree). (Read more here.)
The basic recipe is:
Load your point set into the KDTree.
Ask the KDTree for all pairs of points which are within some maximum distance from each other.
Calculate the actual distances between each of the returned points.
Compare those distances with the summed radii associated with the point pair. Drop the pairs whose distances are too large.
Finally, you need to determine the clusters of spheres. Your question mentions "paths", but in your example code you're only concerned with connected components. Of course you can use networkx or graph-tool for that, but maybe that's overkill.
If connected components are all you need, then you don't even need a proper graph data structure. You just need a way to find the groups of linked nodes, without maintaining the specific connections that linked them. Again, scipy has a nice tool: DisjointSet. (Read more here.)
Here is a complete example. The execution time depends on not only the number of points, but how "dense" they are. I tried some reasonable (I think) test data with 1M points, which took 24 seconds to process on my laptop.
Your example data (the largest of the sets provided above) takes longer: about 45 seconds. The KDTree finds 312M pairs of points to consider, of which fewer than 1M are actually valid connections.
import numpy as np
from scipy.spatial import cKDTree
from scipy.cluster.hierarchy import DisjointSet
## Example data (2D)
## N = 1000
# D = 2
# max_point = 1000
# min_radius = 10
# max_radius = 20
# points = np.random.randint(0, max_point, size=(N, D))
# radii = np.random.randint(min_radius, max_radius+1, size=N)
## Example data (3D)
# N = 1_000_000
# D = 3
# max_point = 3000
# min_radius = 10
# max_radius = 20
# points = np.random.randint(0, max_point, size=(N, D))
# radii = np.random.randint(min_radius, max_radius+1, size=N)
# Question data (3D)
points = np.load('b (556024).npy')
radii = np.load('a (556024).npy')
N = len(points)
# Load into a KD tree and extract all pairs which could possibly be linked
# (using the maximum radius as the upper bound of the search distance.)
kd = cKDTree(points)
pairs = kd.query_pairs(2 * radii.max(), output_type='ndarray')
def filter_pairs(pairs):
# Calculate the distance between each pair of points
vectors = points[pairs[:, 1]] - points[pairs[:, 0]]
distances = np.linalg.norm(vectors, axis=1)
# Drop the pairs whose summed radii aren't large enough
# to span the distance between the points.
thresholds = radii[pairs].sum(axis=1)
return pairs[distances <= thresholds]
# We could do this in one big step
# ...but that might require lots of RAM.
# It's cheaper to do it in big chunks, in a loop.
fp = []
CHUNK = 1_000_000
for i in range(0, len(pairs), CHUNK):
fp.append(filter_pairs(pairs[i:i+CHUNK]))
filtered_pairs = np.concatenate(fp)
# Load the pairs into a DisjointSet (a.k.a. UnionFind)
# data structure and extract the groups.
ds = DisjointSet(range(N))
for u, v in filtered_pairs:
ds.merge(u, v)
connected_sets = list(ds.subsets())
print(f"Found {len(connected_sets)} sets of circles/spheres")
Just for fun, here's a visualization of the 2D test data:
from bokeh.plotting import output_notebook, figure, show
output_notebook()
p = figure()
p.circle(*points.T, radius=radii, fill_alpha=0.25)
p.segment(*points[filtered_pairs[:, 0]].T,
*points[filtered_pairs[:, 1]].T,
line_color='red')
show(p)

to find connected spheres using NetworkX library in Python. For
doing so, I need to find distances between the spheres
Are you calculating the distance between every pair of spheres?
If all you need is to know the pairs of spheres that touch, or maybe that overlap, then you do NOT need to calculate the distance between every pair of spheres, only ones that are in reasonable proximity to each other. The standard way of handling this it to use an octree https://en.wikipedia.org/wiki/Octree
This takes some time to set up, but once you have it, you can find quickly all the spheres that are close but none that are two far away. A reasonable distance would be twice the radius of the largest sphere. For large dataset the improvement in performance can be spectacular
( For more details about this test https://github.com/JamesBremner/quadtree )
So, the complete algorithm to find the paths through the connected spheres can be broken out into four conceptual steps
Find the connected spheres, using an octree to optimize finding them. Instead of searching through every pair of spheres, loop over the spheres and search through the spheres in the same octree cell. For more details on how to make this work you might want to look at the C++ code at https://github.com/JamesBremner/quadtree
Create the adjacency matrix of connected spheres. Conceptually this is a separate step, however, you will probably want to do that as you search for connected sphere in the first step. Construct an empty adjacency matrix N by N where N is the number of spheres. Each time you find a pair of connected spheres, fill in in matrix.
Load the matrix into a graph library. It may be more efficient to simply add the link between two connected spheres directly into the library and let it build the adjacency matrix.
Use the graph library methods to find the path.

Efficient way for maximum medians from triangle polygons

Objective
I have a soup of triangle polygons. I want to retrieve the largest median as vector for each triangle.
State of work
Starting point:
Array of points (n,3) , e.g. [x,y,z]
Array of triangle point indices (n, 3) referencing the array of points above, e.g. [[0,1,2],[2,3,4]...]
I combine both two one single matrix containing the real 3D point coordinates. Then I calculate the median vectors and their lengths.
/Edit : I updated the code to my current version of it
def calcMedians(polygon):
# C -> AB = C-(A + 0.5(B-A))
# B -> AC = B - (A + 0.5(C-A))
# A -> BC = A - (B
dim = np.shape(polygon)
medians = np.zeros((dim[0],3,2,dim[1]))
medians[:,0,0] = polygon[:,2]
medians[:,0,1] = polygon[:,0] + 0.5*(polygon[:,1]-polygon[:,0])
medians[:,1,0] = polygon[:,1]
medians[:,1,1] = polygon[:,0] + 0.5*(polygon[:,2]-polygon[:,0])
medians[:,2,0] = polygon[:,0]
medians[:,2,1] = polygon[:,1] + 0.5*(polygon[:,2]-polygon[:,1])
m1 = np.linalg.norm(medians[:,0,0]-medians[:,0,1],axis=1)
m2 = np.linalg.norm(medians[:,1,0]-medians[:,1,1],axis=1)
m3 = np.linalg.norm(medians[:,2,0]-medians[:,2,1],axis=1)
medianlengths = np.vstack((m1,m2,m3)).T
maxlengths = np.argmax(medianlengths,axis=1)
final = np.zeros((dim[0],2,dim[1]))
dim = np.shape(medians)
for i in range(0,dim[0]):
idx = maxlengths[i]
final[i] = medians[i,idx]
return final
Now I am creating the final median vector matrix using an empty matrix first. The lengths are calculated using np.linalg.norm and collected in a matrix. For this matrix, the argmax method is used to identify to target median vector.
Problem
Old:However, I am somehow confused by the dimensionality and currently not able to get this to work or to understand if the result is correct.
Does somebody know how to do this correctly and/or if this approach is efficient?
My target would be a construct of the 3 medians in form of [n_polygons, 3( due to 3 medians), 2 (start and end point), 3 (xyz)]
Using the max lengths information, i would like to reduce it to [n_polygons, 2 (start and end point), 3 (xyz)]
Using this improvised for loop in the function, I can create the output. But there has to be a more "clean" matrix method to it. Using medians[:,maxlengths,:,:] leads to a shape of [4,n_polygons,2,3] instead of [n_polygons,2,3] and I do not understand why.
Example image for medians of two triangles:
Unfortunately, I don't have a large exemplary data set but I guess that this can be generated quite quickly. The example data set from the picture shown above is:
polygons = np.array([[0,1,2],[0,3,2]])
points = np.array([[0,0],
[1,0],
[1,1],
[0,1]])
polygons3d = points[polygons[:,:]]

The longest median is for the shortest triangle side. Look here and rewrite median length formula as
M[i] = Sqrt(2(a^2+b^2+c^2)-3*side[i]^2) / 2
So you can simplify calculations a bit using only side lengths (perhaps you already have them)
Concerning 3D coordinates - just use projection on any coordinate plane not perpendicular to your point plane - ignore one dimension (choose dimension with the lowest value range)

What is the Most Efficient Way to Compute the (euclidean) Distance of the Nearest Neighbor in a List of (x,y,z) points?

What is the most efficient way compute (euclidean) distance of the nearest neighbor for each point in an array?
I have a list of 100k (X,Y,Z) points and I would like to compute a list of nearest neighbor distances. The index of the distance would correspond to the index of the point.
I've looked into PYOD and sklearn neighbors, but those seem to require "teaching". I think my problem is simpler than that. For each point: find nearest neighbor, compute distance.
Example data:
points = [
(0 0 1322.1695
0.006711111 0 1322.1696
0.026844444 0 1322.1697
0.0604 0 1322.1649
0.107377778 0 1322.1651
0.167777778 0 1322.1634
0.2416 0 1322.1629
0.328844444 0 1322.1631
0.429511111 0 1322.1627...)]
compute k = 1 nearest neighbor distances
result format:
results = [nearest neighbor distance]
example results:
results = [
0.005939372
0.005939372
0.017815632
0.030118587
0.041569616
0.053475883
0.065324964
0.077200014
0.089077602)
]
UPDATE:
I've implemented two of the approaches suggested.
Use the scipy.spatial.cdist to compute the full distances matrices
Use a nearest X neighbors in radius R to find subset of neighbor distances for every point and return the smallest.
Results are that Method 2 is faster than Method 1 but took a lot more effort to implement (makes sense).
It seems the limiting factor for Method 1 is the memory needed to run the full computation, especially when my data set is approaching 10^5 (x, y, z) points. For my data set of 23k points, it takes ~ 100 seconds to capture the minimum distances.
For method 2, the speed scales as n_radius^2. That is, "neighbor radius squared", which really means that the algorithm scales ~ linearly with number of included neighbors. Using a Radius of ~ 5 (more than enough given application) it took 5 seconds, for the set of 23k points, to provide a list of mins in the same order as the point_list themselves. The difference matrix between the "exact solution" and Method 2 is basically zero.
Thanks for everyones' help!

Similar to Caleb's answer, but you could stop the iterative loop if you get a distance greater than some previous minimum distance (sorry - no code).
I used to program video games. It would take too much CPU to calculate the actual distance between two points. What we did was divide the "screen" into larger Cartesian squares and avoid the actual distance calculation if the Delta-X or Delta-Y was "too far away" - That's just subtraction, so maybe something like that to qualify where the actual Eucledian distance metric calculation is needed (extend to n-dimensions as needed)?
EDIT - expanding "too far away" candidate pair selection comments.
For brevity, I'll assume a 2-D landscape.
Take the point of interest (X0,Y0) and "draw" an nxn square around that point, with (X0,Y0) at the origin.
Go through the initial list of points and form a list of candidate points that are within that square. While doing that, if the DeltaX [ABS(Xi-X0)] is outside of the square, there is no need to calculate the DeltaY.
If there are no candidate points, make the square larger and iterate.
If there is exactly one candidate point and it is within the radius of the circle incribed by the square, that is your minimum.
If there are "too many" candidates, make the square smaller, but you only need to reexamine the candidate list from this iteration, not all the points.
If there are not "too many" candidates, then calculate the distance for that list. When doing so, first calculate DeltaX^2 + DeltaY^2 for the first candidate. If for subsequent candidates the DetlaX^2 is greater than the minumin so far, no need to calculate the DeltaY^2.
The minimum from that calculation is the minimum if it is within the radius of the circle inscribed by the square.
If not, you need to go back to a previous candidate list that includes points within the circle that has the radius of that minimum. For example, if you ended with one candidate in a 2x2 square that happened to be on the vertex X=1, Y=1, distance/radius would be SQRT(2). So go back to a previous candidate list that has a square greated or equal to 2xSQRT(2).
If warranted, generate a new candidate list that only includes points withing the +/- SQRT(2) square.
Calculate distance for those candidate points as described above - omitting any that exceed the minimum calcluated so far.
No need to do the square root of the sum of the Delta^2 until you have only one candidate.
How to size the initial square, or if it should be a rectangle, and how to increase or decrease the size of the square/rectangle could be influenced by application knowledge of the data distribution.
I would consider recursive algorithms for some of this if the language you are using supports that.

How about this?
from scipy.spatial import distance
A = (0.003467119 ,0.01422762 ,0.0101960126)
B = (0.007279433 ,0.01651597 ,0.0045558849)
C = (0.005392258 ,0.02149997 ,0.0177409387)
D = (0.017898802 ,0.02790659 ,0.0006487222)
E = (0.013564214 ,0.01835688 ,0.0008102952)
F = (0.013375397 ,0.02210725 ,0.0286032185)
points = [A, B, C, D, E, F]
results = []
for point in points:
distances = [{'point':point, 'neighbor':p, 'd':distance.euclidean(point, p)} for p in points if p != point]
results.append(min(distances, key=lambda k:k['d']))
results will be a list of objects, like this:
results = [
{'point':(x1, y1, z1), 'neighbor':(x2, y2, z2), 'd':"distance from point to neighbor"},
...]
Where point is the reference point and neighbor is point's closest neighbor.

The fastest option available to you may be scipy.spatial.distance.cdist, which finds the pairwise distances between all of the points in its input. While finding all of those distances may not be the fastest algorithm to find the nearest neighbors, cdist is implemented in C, so it is likely run faster than anything you try in Python.
import scipy as sp
import scipy.spatial
from scipy.spatial.distance import cdist
points = sp.array(...)
distances = sp.spatial.distance.cdist(points)
# An element is not its own nearest neighbor
sp.fill_diagonal(distances, sp.inf)
# Find the index of each element's nearest neighbor
mins = distances.argmin(0)
# Extract the nearest neighbors from the data by row indexing
nearest_neighbors = points[mins, :]
# Put the arrays in the specified shape
results = np.stack((points, nearest_neighbors), 1)
You could theoretically make this run faster (mostly by combining all of the steps into one algorithm), but unless you're writing in C, you won't be able to compete with SciPy/NumPy.
(cdist runs in Θ(n2) time (if the size of each point is fixed), and every other part of the algorithm in O(n) time, so even if you did try to optimize the code in Python, you wouldn't notice the change for small amounts of data, and the improvements would be overshadowed by cdist for more data.)

Algorithm for grouping points in given distance

I'm currently searching for an efficient algorithm that takes in a set of points from three dimensional spaces and groups them into classes (maybe represented by a list). A point should belong to a class if it is close to one or more other points from the class. Two classes are then the same if they share any point.
Because I'm working with large data sets, I don't want to use recursive methods. Also, using something like a distance matrix with O(n^2) performance is what I try to avoid.
I tried to check for some algorithms online, but most of them don't appeal to this specific purpose (e.g. k-d tree or other cluster algorithms). I thought about parting space into smaller parts, but that (potentially) results in an inexact result.
I tried to write something myself, but it turned out to be flawed. I would sort my points after distance and append the distance as a fourth coordinate and then repeat the following the following code-segment:
def grouping_presorted(lst, distance):
positions = [0]
x = []
while positions:
curr_el = lst[ positions[-1] ]
nn_i = HasNeighbor(lst, distance, positions[-1])
if nn_i is None:
x.append(lst.pop(positions[-1]) )
positions.pop(-1)
else:
positions.append(nn_i)
return x
def HasNeighbor(lst,distance,index):
i = index+1
while lst[i][3]- lst[index][3] < distance:
dist = (lst[i][0]-lst[index][0])**2 + (lst[i][1]-lst[index][1])**2 + (lst[i][2]-lst[index][2])**2
if dist < distance:
return i
i+=1
return None
Aside from an (probably easy to fix) overflow error, there's a bigger flaw in the logic of linking the points. If you think of my points describing lines in space, the algorithm only works for lines that strictly point outwards the origin, but not for circles or similar structures.
Does anybody know of a prewritten code for this or have an idea what I could try?
Thanks in advance.
Edit: It seems my spelling and maybe confusion of some terms has sparked some misunderstandings. I hope that this (badly-made) sketch helps. In this example, I marked my reference distance as d and circled the two containers I wan't to end up with in red.

You could try https://en.wikipedia.org/wiki/OPTICS_algorithm. When you index the points first (e.g, with an R-Tree) this should be possible in O(n log n).
Edit:
If you already know your epsilon and how many points are minimally in a cluster (minpoints) then DBSCAN could be the better choice.

What I ended up doing
After following all the suggestions of your comments, help from cs.stackexchange and doing some research I was able to write down two different methods for solving this problem. In case someone might be interested, I decided to share them here. Again, the problem is to write a program that takes in a set of coordinate tuples and groups them into clusters. Two points x,y belong to the same cluster if there is a sequence of elements x=x_1,..,y=x_N such that d(x_i,x_i+1)
DBSCAN: By fixing euclidean metric, minPts = 2 and grouping distance epsilon = r.
scikit-learn provides a nice implementation of this algorithm. A minimal code snippet for the task would be:
from sklearn.cluster import DBSCAN
from sklearn.datasets.samples_generator import make_blobs
import networkx as nx
import scipy.spatial as sp
def cluster(data, epsilon,N): #DBSCAN, euclidean distance
db = DBSCAN(eps=epsilon, min_samples=N).fit(data)
labels = db.labels_ #labels of the found clusters
n_clusters = len(set(labels)) - (1 if -1 in labels else 0) #number of clusters
clusters = [data[labels == i] for i in range(n_clusters)] #list of clusters
return clusters, n_clusters
centers = [[1, 1,1], [-1, -1,1], [1, -1,1]]
X,_ = make_blobs(n_samples=N, centers=centers, cluster_std=0.4,
random_state=0)
cluster(X,epsilon,N)
On my machine, N=20000 for this clustering variation with an epsilon of epsilon = 0.1 takes just 290ms, so this seems really quick to me.
Graph components: One can think of this problem as follows: The coordinates define nodes of a graph, and two nodes are adjacent if their distance is smaller than epsilon/r. A cluster is then given as a connected component of this graph. At first I had problems implementing this graph, but there are many ways to write a linear time algorithm to do this. The easiest and fastest way however, for me, was to use scipy.spatial's cKDTree data structure and the corresponding query_pairs() method, that returns a list of indice tuples of points that are in given distance. One could for example write it like this:
class IGraph:
def __init__(self, nodelst=[], radius = 1):
self.igraph = nx.Graph()
self.radii = radius
self.nodelst = nodelst #nodelst is array of coordinate tuples, graph contains indices as nodes
self.__make_edges__()
def __make_edges__(self):
self.igraph.add_edges_from( sp.cKDTree(self.nodelst).query_pairs(r=self.radii) )
def get_conn_comp(self):
ind = [list(x) for x in nx.connected_components(self.igraph) if len(x)>1]
return [self.nodelst[indlist] for indlist in ind]
def graph_cluster(data, epsilon):
graph = IGraph(nodelst = data, radius = epsilon)
clusters = graph.get_conn_comp()
return clusters, len(clusters)
For the same dataset mentioned above, this method takes 420ms to find the connected components. However, for smaller clusters, e.g. N=700, this snippet runs faster. It also seems to have an advantage for finding smaller clusters (that is being given smaller epsilon values) and a vast disadvantage in the other direction (all on this specific dataset of course). I think, depending on the given situation, both methods are worth considering.
Hope this is of use for somebody.
Edit: Theoretically, DBSCAN has computational complexity O(n log n) when properly implemented (according to wikipedia...), while constructing the graph as well as finding its connected components runs linear in time. I'm not sure how well these statements hold for the given implementations though.

Adapt a path-finding algorithm, such as Dijkstra's or A*, or alternatively adapt the breadth-first or depth-first search of a graph. Start at any point in the set of unvisited points, and proceed with whichever algorithm you've picked with the caveat that a point is considered to be connected only to all points to which its distance is less than the threshhold. When you've finished off with one class (i.e. when you can discover no more new nodes), pick any node from the set of unvisited nodes and repeat.

Generate, fill and plot a hexagonal lattice in Python

I'd like to modify a Python script of mine operating on a square lattice (it's an agent based model for biology), to work in a hexagonal universe.
This is how I create and initialize the 2D matrix in the square model: basically, N is the size of the lattice and R gives the radius of the part of the matrix where I need to change value at the beginning of the algorithm:
a = np.zeros(shape=(N,N))
center = N/2
for i in xrange(N):
for j in xrange(N):
if( ( pow((i-center),2) + pow((j-center),2) ) < pow(R,2) ):
a[i,j] = 1
I then let the matrix evolve according to certains rules and finally print via the creation of a pickle file:
name = "{0}-{1}-{2}-{3}-{4}.pickle".format(R, A1, A2, B1, B2)
pickle.dump(a, open(name,"w"))
Now, I'd like to do exactly the same but on an hexagonal lattice. I read this interesting StackOverflow question which clearified how to represent the positions on a hexagonal lattice with three coordinates, but a couple of things stay obscure to my knowledge, i.e.
(a) how should I deal with the three axes in Python, considering that what I want is not equivalent to a 3D matrix, due to the constraints on the coordinates, and
(b) how to plot it?
As for (a), this is what I was trying to do:
a = np.zeros(shape=(N,N,N))
for i in xrange(N/2-R, N/2+R+1):
for j in xrange(N/2-R, N/2+R+1):
for k in xrange(N/2-R, N/2+R+1):
if((abs(i)+abs(j)+abs(k))/2 <= 3*N/4+R/2):
a[i,j,k] = 1
It seems to me pretty convoluted to initialize a NxNxN matrix like that and then find a way to print a subset of it according to the constraints over the coordinates. I'm looking for a simpler way and, more importantly, for understanding how to plot the hexagonal lattice resulting from the algorithm (no clue on that, I haven't tried anything for the moment).

I agree that trying to shoehorn a hexagonal lattice into a cubic is problematic. My suggestion is to use a general scheme - represent the neighboring sites as a graph. This works very well with pythons dictionary object and it trivial to implement the "axial coordinate scheme" in one of the links you provided. Here is an example that creates and draws the "lattice" using networkx.
import networkx as nx
G = nx.Graph(directed=False)
G.add_node((0,0))
for n in xrange(4):
for (q,r) in G.nodes():
G.add_edge((q,r),(q,r-1))
G.add_edge((q,r),(q-1,r))
G.add_edge((q,r),(q-1,r+1))
G.add_edge((q,r),(q,r+1))
G.add_edge((q,r),(q+1,r-1))
G.add_edge((q,r),(q+1,r))
pos = nx.graphviz_layout(G,prog="neato")
nx.draw(G,pos,alpha=.75)
import pylab as plt
plt.axis('equal')
plt.show()
This is isn't the most optimal implementation but it can generate arbitrarily large lattices:

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.