Efficient isolation critereon

Efficient isolation critereon - python

The project I'm working on involves reading and analyzing a huge data set (Illustris 1 Dark, about 4,000,000 dark matter halos). In order to get better results, I want to impose an isolation criteria as follows:
Only keep those halos that are the biggest in the 2D circle of radius 300kpc, and get rid of the other halos in that circle
Now, the current implementation I have has run time in O(n^2), which means the code could take days to finish. I really want to do better but can't figure out how. This is what I have so far:
Function for returning a list of neighbors of Group1
def neighbors(Group1, Radius):
Neighbors = []
for Group2 in Groups:
if Distance_2D(Group1, Group2) < Radius:
Neighbors.append(Group2)
return Neighbors
Function for returning the biggest group given a list of neighbors
def biggest(Neighbors):
Biggest = Neighbors[0]
for N in Neighbors:
if N.mass > Biggest.mass:
Biggest = N
return Biggest
Putting it all together
for Group in Groups:
Neighbors = neighbors(Group, 300)
if not Group == Biggest(Neighbors):
Groups.remove(Group)
else:
Groups.remove(Neighbors)
Groups.append(Group)
After the for-loop, Groups should be a list of halos that are the largest within their 300kpc radius.
I also know that removing something from a list while iterating the same list is not good practice, so if your hint/answer takes care of that, that would be great!
Thank you all in advance :)

Halo with the largest mass will be in a group. Halos that are closer than radius to the largest halo will not be in a group. With that outline of algorithm is to:
result = []
for h in sorted(halos, key=lambda h: h.mass, reverse=True):
if all(distance(h.position, x.position) > 300 for x in result):
result.append(h)
Checking of distance is tricky.
If expected size of result list is small, than each halo is checked with that small list. That algorithm has complexity O(n*log(n) + n * |result set|).
If expected size of result list is large, that space partition can help.
Like:
result = []
sp = SpacePartition() # Some space partition
for h in sorted(halos, key=lambda h: h.mass, reverse=True):
if not sp.has_point_close(h.position, 300):
result.append(h)
sp.add(h.position)
That algorithm has complexity O(n*log(n)), since checking and adding to partition is of log(n).

Related

Finding minimum number of jumps increasing the value of the element

Optimizing a leetcode-style question - DP/DFS
The task is the following:
Given N heights, find the minimum number of suboptimal jumps required to go from start to end. [1-D Array]
A jump is suboptimal, if the height of the starting point i is less or equal to the height of the target point j.
A jump is possible, if j-i >= k, where k is the maximal jump distance.
For the first subtask, there is only one k value.
For the second subtask, there are two k values; output the amount of suboptimal jumps for each k value.
For the third subtask, there are 100 k values; output the amount of suboptimal jumps for each k value.
My Attempt
The following snippet is my shot at solving the problem, it gives the correct solution.
This was optimized to handle multiple k values without having to do a lot of unnecessary work.
The Problem is that even a solution with a single k value is o(n^2) in the worst case. (As k <= N)
A solution would be to eliminate the nested for loop, this is what I'm uncertain about how to approach it.
def solve(testcase):
N, Q = 10, 1
h = [1 , 2 , 4 ,2 , 8, 1, 2, 4, 8, 16] # output 3
# ^---- + ---^ 0 ^--- + --^ + ^
k = [3]
l_k = max(k)
distances = [99999999999] * N
distances[N-1] = 0
db = [ [0]*N for i in range(N)]
for i in range(N-2, -1, -1):
minLocalDistance = 99999999999
for j in range(min(i+l_k, N-1), i, -1):
minLocalDistance = min(minLocalDistance, distances[j] + (h[i] <= h[j]))
db[i][j] = distances[j] + (h[i] <= h[j])
distances[i] = minLocalDistance
print(f"Case #{testcase}: {distances[0]}")
NOTE: This is different from the classic min. jumps problem

Consider the best cost to get to a position i. It is the smaller of:
The minimum cost to get to any of the preceding k positions, plus one (a suboptimal jump); or
The minimum cost to get to any of the lower-height position in the same window (an optimal jump).
Case (1) can be handled with the sliding-window-minimum algorithm that you can find described, for example, here: Sliding window maximum in O(n) time. This takes amortized constant time per position, or O(N) all together.
Case (2) has a somewhat obvious solution with a BST: As the window moves, insert each new position into a BST sorted by height. Remove positions that are no longer in the window. Additionally, in each node, store the minimum cost within its subtree. With this structure, you can find the minimum cost for any height bound in O(log k) time.
The expense in case 2 leads to a total complexity of O(N log k) for a single k-value. That's not too bad for complexity, but such BSTs are somewhat complicated and aren't usually provided in standard libraries.
You can make this simpler and faster by recognizing that if the minimum cost in the window is C, then optimal jumps are only beneficial if they come from predecessors of cost C, because cost C+1 is attainable with a sub-optimal jump.
For each cost, then, you can use that same sliding-window-minimum algorithm to keep track of the minimum height in the window for nodes with that cost. Then for case (2), you just need to check to see if that minimum height for the minimum cost is lower than the height you want to jump to.
Maintaining these sliding windows again takes amortized constant time per operation, leading to O(N) time for the whole single-k-value algorithm.
I doubt that there would be any benefit in trying to manage multiple k-values at once.

Fastest way to sample most numbers with minimum difference larger than a value from a Python list

Given a list of 20 float numbers, I want to find a largest subset where any two of the candidates are different from each other larger than a mindiff = 1.. Right now I am using a brute-force method to search from largest to smallest subsets using itertools.combinations. As shown below, the code finds a subset after 4 s for a list of 20 numbers.
from itertools import combinations
import random
from time import time
mindiff = 1.
length = 20
random.seed(99)
lst = [random.uniform(1., 10.) for _ in range(length)]
t0 = time()
n = len(lst)
sample = []
found = False
while not found:
# get all subsets with size n
subsets = list(combinations(lst, n))
# shuffle to ensure randomness
random.shuffle(subsets)
for subset in subsets:
# sort the subset numbers
ss = sorted(subset)
# calculate the differences between every two adjacent numbers
diffs = [j-i for i, j in zip(ss[:-1], ss[1:])]
if min(diffs) > mindiff:
sample = set(subset)
found = True
break
# check subsets with size -1
n -= 1
print(sample)
print(time()-t0)
Output:
{2.3704888087015568, 4.365818049020534, 5.403474619948962, 6.518944556233767, 7.8388969285727015, 9.117993839791751}
4.182451486587524
However, in reality I have a list of 200 numbers, which is infeasible for a brute-froce enumeration. I want a fast algorithm to sample just one random largest subset with a minimum difference larger than 1. Note that I want each sample has randomness and maximum size. Any suggestions?

My previous answer assumed you simply wanted a single optimal solution, not a uniform random sample of all solutions. This answer assumes you want one that samples uniformly from all such optimal solutions.
Construct a directed acyclic graph G where there is one node for each point, and nodes a and b are connected when b - a > mindist. Also add two virtual nodes, s and t, where s -> x for all x and x -> t for all x.
Calculate for each node in G how many paths of length k exist to t. You can do this efficiently in O(n^2 k) time using dynamic programming with a table P[x][k], filling initially P[x][0] = 0 except P[t][0] = 1, and then P[x][k] = sum(P[y][k-1] for y in neighbors(x)).
Keep doing this until you reach the maximum k - you now know the size of the optimal subset.
Uniformly sample a path of length k from s to t using P to weight your choices.
This is done by starting at s. We then look at each neighbor of s and choose one randomly with a weighting dictated by P[s][k]. This gives us our first element of the optimal set.
We then repeatedly perform this step. We are at x, look at the neighbors of x and pick one randomly using weights P[x][k-i] where i is the step we're at.
Use the nodes you sampled in 3 as your random subset.
An implementation of the above in pure Python:
import random
def sample_mindist_subset(xs, mindist):
# Construct directed graph G.
n = len(xs)
s = n; t = n + 1 # Two virtual nodes, source and sink.
neighbors = {
i: [t] + [j for j in range(n) if xs[j] - xs[i] > mindist]
for i in range(n)}
neighbors[s] = [t] + list(range(n))
neighbors[t] = []
# Compute number of paths P[x][k] from x to t of length k.
P = [[0 for _ in range(n+2)] for _ in range(n+2)]
P[t][0] = 1
for k in range(1, n+2):
for x in range(n+2):
P[x][k] = sum(P[y][k-1] for y in neighbors[x])
# Sample maximum length path uniformly at random.
maxk = max(k for k in range(n+2) if P[s][k] > 0)
path = [s]
while path[-1] != t:
candidates = neighbors[path[-1]]
weights = [P[cn][maxk-len(path)] for cn in candidates]
path.append(random.choices(candidates, weights)[0])
return [xs[i] for i in path[1:-1]]
Note that if you want to sample from the same set of numbers many times, you don't have to recompute P every single time and can re-use it.

I probably don't fully understand the question, because right now the solution is quite trivial. EDIT: yes, I misunderstood after all, the OP does not just want an optimal solution, but wishes to randomly sample from the set of optimal solutions. This answer is not incorrect but it also is an answer to a different question than what OP is interested in.
Simply sort the numbers and greedily construct the subset:
def mindist_subset(xs, mindist):
result = []
for x in sorted(xs):
if not result or x - result[-1] > mindist:
result.append(x)
return result
Sketch of proof of correctness.
Suppose we have a solution S given input array A that is of optimal size. If it does not contain min(A) note that we could remove min(S) from S and add min(A) since this would only increase the distance between min(S) and the second smallest number in S. Conclusion: we can without loss of generality assume that min(A) is part of an optimal solution.
Now we can apply this argument recursively. We add min(A) to a solution and remove all elements too close to min(A), giving remaining elements A'. Then we're left with a subproblem where exactly the same argument applies, we can choose min(A') as our next element of the solution, etc.

Randomly sampling numerals from a list whose aggregate needs to be at least greater than a given benchmark

I have a list of tuples formed by 1000 object ids and their scores, i.e.:
scored_items = [('14',534.9),('4',86.0),('78',543.21),....].
Let T be the aggregated score of the top 20 highest scoring items.
That's easy. Using python:
top_20 = sorted(score_items, key=lambda k: k[1],reverse = True)[:20]
T = sum(n for _, n in top_20)
Next, let t equal a quarter of T. I.e. in python: t = math.ceil(T/4)
My question is: what's the most efficient way to randomly select 20 items (without replacement) from scored_items such that their aggregated score is equal to or greater than (but never lower than) t? They may or may not include items from top_20.
Would prefer an answer in Python, and would prefer to not rely on external libraries much
Background: This is an item-ranking algorithm that is strategy proof according to an esoteric - but useful - Game Theory theorem. Source: section 2.5 in this paper, or just read footnote 18 on page 11 of this same link. Btw strategy proof essentially means it's tough to game it.
I'm a neophyte python programmer and have been mulling how to solve this problem for a while now, but just can't seem to wrap my head around it. Would be great to know how the experts would approach and solve this.
I suppose the most simplistic (and least performant perhaps) way is to keep randomly generating sets of 20 items till their scores' sum exceeds or equals t.
But there has to be a better way to do this right?

Here is an implementation of what I mentioned in the comments.
Since we want items such that the sum of the scores is large, we can weight the choice so that we are more likely to pick samples with large scores.
import numpy as np
import math
def normalize(p):
return p/sum(p)
def get_sample(scored_items, N=20, max_iter = 1000):
topN = sorted(scored_items, key=lambda k: k[1],reverse = True)[:N]
T = sum(n for _, n in topN)
t = math.ceil(T/4)
i = 0
scores = np.array([x[1] for x in scored_items])
p=normalize(scores)
while i < max_iter:
sample_indexes = np.random.choice(a=range(len(ids)), size=N, replace=False, p=p)
sample = [scored_items[x] for x in sample_indexes]
if sum(n for _, n in sample) >= t:
print("Found a solution at iteration %d"%i)
return sample
i+=1
print("Could not find a solution after %d iterations"%max_iter)
return None
An example of how to use it:
np.random.seed(0)
ids = range(1000)
scores = 10000*np.random.random_sample(size=len(ids))
scored_items = list(zip(map(str, ids), scores))
sample = get_sample(scored_items, 20)
#Found a solution at iteration 0
print(sum(n for _, n in sample))
#139727.1229832652
Though this is not guaranteed to get a solution, I ran this in a loop 100 times and each time a distinct solution was found on the first iteration.

Though I do not know of a efficient way for huge lists something like this works even for 1000 or so items. You can do a bit better if you don't need True randomness
import random
testList = [x for x in range(1,1000)]
T = sum(range(975, 1000))/4
while True:
rs = random.sample(testList, 15)
if sum(rs) >= t: break
print rs

How to reduce/optimize memory usage when calculating area of skyline?

I'm trying to calculate the area of skyline (overlapping rectangles with same baseline)
building_count = int(input())
items = {} # dictionary, location on x axis is the key, height is the value
count = 0 # total area
for j in range(building_count):
line = input().split(' ')
H = int(line[0]) # height
L = int(line[1]) # left point (start of the building)
R = int(line[2]) # right point (end of the building)
for k in range(R - L):
if not (L+k in items): # if it's not there, add it
items[L+k] = H
elif H > items[L+k]: # if we have a higher building on that index
items[L+k] = H
for value in items.values(): # we add each column basically
count += value
print(count)
sample input would be:
5
3 -3 0
2 -1 1
4 2 4
2 3 7
3 6 8
and output is 29.
The issue is memory efficiency, when there are lots of values, the script simply throws MemoryError. Anyone have some ideas for optimizing memory usage?

You are allocating a separate key-value pair for every single integer value in your range. Imagine the case where R = 1 and L = 100000. Your items dictionary will be filled with 1000000 items. Your basic idea of processing/removing overlaps is is sound, but the way you do it is massive overkill.
Like so much else in life, this is a graph problem in disguise. Imaging the vertices being the rectangles you are trying to process and the (weighted) edges being the overlaps. The complication is that you can not just add up the areas of the vertices and subtract the areas of the overlaps, because many of the overlaps overlap each other as well. The overlap issue can be resolved by applying a transformation that converts two overlapping rectangles into non-overlapping rectangles, effectively cutting the edge that connects them. The transformation is shown in the image below. Notice that in some cases one of the vertices will be removed as well, simplifying the graph, while in another case a new vertex is added:
Green: overlap to be chopped out.
Normally, if we have m rectangles and n overlaps between them, constructing the graph would be an O(m2) operation because we would have to check all vertices for overlaps against each other. However, we can bypass a construction of the input graph entirely to get a O(m + n) traversal algorithm, which is going to be optimal since we will only analyze each rectangle once, and construct the output graph with no overlaps as efficiently as possible. O(m + n) assumes that your input rectangles are sorted according to their left edges in ascending order. If that is not the case, the algorithm will be O(mlog(m) + n) to account for the initial sorting step. Note that as the graph density increases, n will go from ~m to ~m2. This confirms the intuitive idea that the fewer overlaps there are, them more you would expect the process will run in O(m) time, while the more overlaps there are, the closer you will run to O(m2) time.
The space complexity of the proposed algorithm will be O(m): each rectangle in the input will result in at most two rectangles in the output, and 2m = O(m).
Enough about complexity analysis and on to the algorithm itself. The input will be a sequence of rectangles defined by L, R, H as you have now. I will assume that the input is sorted by the leftmost edge L. The output graph will be a linked list of rectangles defined by the same parameters, sorted in descending order by the rightmost edge. The head of the list will be the rightmost rectangle. The output will have no overlaps between any rectangles, so the total area of the skyline will just be the sum of H * (R - L) for each of the ~m output rectangles.
The reason for picking a linked list is that the only two operations we need is iteration from the head node and the cheapest insertion possible to maintain the list in sorted order. The sorting will be done as part of overlap checking, so we do not need to do any kind of binary searches through the list or anything like that.
Since the input list is ordered by increasing left edge and the output list is ordered by decreasing right edge, we can guarantee that each rectangle added will be checked only against the rectangles it actually overlaps1. We will do overlap checking and removal as shown in the diagram above until we reach a rectangle whose left edge is less than or equal to the left edge of the new rectangle. All further rectangles in the output list are guaranteed not to overlap with the new rectangle. This check-and-chop operation guarantees that each overlap is visited at most once, and that no non-overlapping rectangles are processed unnecessarily, making the algorithm optimal.
Before I show code, here is a diagram of the algorithm in action. Red rectangles are new rectangles; note that their left edges progress to the right. Blue rectangles are ones that are already added and have overlap with the new rectangle. Black rectangles are already added and have no overlap with the new one. The numbering represents the order of the output list. It is always done from the right. A linked list is a perfect structure to maintain this progression since it allows cheap insertions and replacements:
Here is an implementation of the algorithm which assumes that the input coordinates are passed in as an iterable of objects having the attributes l, r, and h. The iteration order is assumed to be sorted by the left edge. If that is not the case, apply sorted or list.sort to the input first:
from collections import namedtuple
# Defined in this order so you can sort a list by left edge without a custom key
Rect = namedtuple('Rect', ['l', 'r', 'h'])
class LinkedList:
__slots__ = ['value', 'next']
"""
Implements a singly-linked list with mutable nodes and an iterator.
"""
def __init__(self, value=None, next=None):
self.value = value
self.next = next
def __iter__(self):
"""
Iterate over the *nodes* in the list, starting with this one.
The `value` and `next` attribute of any node may be modified
during iteration.
"""
while self:
yield self
self = self.next
def __str__(self):
"""
Provided for inspection purposes.
Works well with `namedtuple` values.
"""
return ' -> '.join(repr(x.value) for x in self)
def process_skyline(skyline):
"""
Turns an iterable of rectangles sharing a common baseline into a
`LinkedList` of rectangles containing no overlaps.
The input is assumed to be sorted in ascending order by left edge.
Each element of the input must have the attributes `l`, r`, `h`.
The output will be sorted in descending order by right edge.
Return `None` if the input is empty.
"""
def intersect(r1, r2, default=None):
"""
Return (1) a flag indicating the order of `r1` and `r2`,
(2) a linked list of between one and three non-overlapping
rectangles covering the exact same area as `r1` and `r2`,
and (3) a pointer to the last nodes (4) a pointer to the
second-to-last node, or `default` if there is only one node.
The flag is set to True if the left edge of `r2` is strictly less
than the left edge of `r1`. That would indicate that the left-most
(last) chunk of the tuple came from `r2` instead of `r1`. For the
algorithm as a whole, that means that we need to keep checking for
overlaps.
The resulting list is always returned sorted descending by the
right edge. The input rectangles will not be modified. If they are
not returned as-is, a `Rect` object will be used instead.
"""
# Swap so left edge of r1 < left edge of r2
if r1.l > r2.l:
r1, r2 = r2, r1
swapped = True
else:
swapped = False
if r2.l >= r1.r:
# case 0: no overlap at all
last = LinkedList(r1)
s2l = result = LinkedList(r2, last)
elif r1.r < r2.r:
# case 1: simple overlap
if r1.h > r2.h:
# Chop r2
r2 = Rect(r1.r, r2.r, r2.h)
else:
r1 = Rect(r1.l, r2.l, r1.h)
last = LinkedList(r1)
s2l = result = LinkedList(r2, last)
elif r1.h < r2.h:
# case 2: split into 3
r1a = Rect(r1.l, r2.l, r1.h)
r1b = Rect(r2.r, r1.r, r1.h)
last = LinkedList(r1a)
s2l = LinkedList(r2, last)
result = LinkedList(r1b, s2l)
else:
# case 3: complete containment
result = LinkedList(r1)
last = result
s2l = default
return swapped, result, last, s2l
root = LinkedList()
skyline = iter(skyline)
try:
# Add the first node as-is
root.next = LinkedList(next(skyline))
except StopIteration:
# Empty input iterator
return None
for new_rect in skyline:
prev = root
for rect in root.next:
need_to_continue, replacement, last, second2last = \
intersect(rect.value, new_rect, prev)
# Replace the rectangle with the de-overlapped regions
prev.next = replacement
if not need_to_continue:
# Retain the remainder of the list
last.next = rect.next
break
# Force the iterator to move on to the last node
new_rect = last.value
prev = second2last
return root.next
Computing the total area is now trivial:
skyline = [
Rect(-3, 0, 3), Rect(-1, 1, 2), Rect(2, 4, 4),
Rect(3, 7, 2), Rect(6, 8, 3),
]
processed = process_skyline(skyline)
area = sum((x.value.r - x.value.l) * x.value.h for x in processed) if processed else None
Notice the altered order of the input parameters (h moved to the end). The resulting area is 29. This matches with what I get by doing the computation by hand. You can also do
>>> print(processed)
Rect(l=6, r=8, h=3) -> Rect(l=4, r=6, h=2) -> Rect(l=2, r=4, h=4) ->
Rect(l=0, r=1, h=2) -> Rect(l=-3, r=0, h=3)
This is to be expected from the diagram of the inputs/output shown below:
As an additional verification, I added a new building, Rect(-4, 9, 1) to the start of the list. It overlaps all the others and adds three units to area, or a final result of 32. processed comes out as:
Rect(l=8, r=9, h=1) -> Rect(l=6, r=8, h=3) -> Rect(l=4, r=6, h=2) ->
Rect(l=2, r=4, h=4) -> Rect(l=1, r=2, h=1) -> Rect(l=0, r=1, h=2) ->
Rect(l=-3, r=0, h=3) -> Rect(l=-4, r=-3, h=1)
Note:
While I am sure that this problem has been solved many times over, the solution I present here is entirely my own work, done without consulting any other references. The idea of using an implicit graph representation and the resulting analysis is inspired by a recent reading of Steven Skiena's Algorithm Design Manual, Second Edition. It is one of the best comp-sci books I have ever come across.
1 Technically, if a new rectangle does not overlap any other rectangles, it will be checked against one rectangle it does not overlap. If that extra check was always the case, the algorithm would have an additional m - 1 comparisons to do. Fortunately, m + m + n - 1 = O(m + n) even if we always had to check one extra rectangle (which we don't).

The reason for getting MemoryError is huge size of the dictionary being created. In the worst case, the dict can have 10^10 keys, which would end up taking all your memory. If there really is a need, shelve is a possible solution to make use of such large dict.
Let's say there is a building with 10 0 100 and another with 20 50 150, then that list might have info like [(-10^9, 0), (0, 10), (50, 20), (150, 0), (10^9, 0)]. As you come across more buildings, you can add more entries in this list. This will be O(n^2).
This might help you further.

Best way to remove similar points in a list of points

I have a list of points that looks like this:
points = [(54592748,54593510),(54592745,54593512), ...]
Many of these points are similar in the sense that points[n][0] is almost equal to points[m][0] AND points[n][1] is almost equal to points[m][1]. Where 'almost equal' is a whatever integer I decide.
I would like to filter out all the similar points from the list, keeping just one of it.
Here is my code.
points = [(54592748,54593510),(54592745,54593512),(117628626,117630648),(1354358,1619520),(54592746,54593509)]
md = 10 # max distance allowed between two points
to_compare = points[:] # make a list of item to compare
to_remove = set() # keep track of items to be removed
for point in points:
to_compare.remove(point) # do not compare with itself
for other_point in to_compare:
if abs(point[0]-other_point[0]) <= md and abs(point[1]-other_point[1]) <= md:
to_remove.add(other_point)
for point in to_remove:
points.remove(point)
It works...
>>>points
[(54592748, 54593510), (117628626, 117630648), (1354358, 1619520)]
but I am looking for a faster solution since my list is millions items long.
PyPy helped a lot, it speeded up 6 the whole process 6 times, but probably there is a more efficient way to do this in the first place, or not?
Any help is very welcome.
=======
UPDATE
I have tested some of the answers with the points object you can pickle.load() from here https://mega.nz/#!TVci1KDS!tE5fTnjpPwbvpFTmW1TLsVXDvYHbRF8F7g10KGdOPCs
My code takes 1104 seconds and reduces the list to 96428 points (from 99920).
David's code do the job in 14 seconds! But misses something, 96431 points left.
Martin's code takes 0.06 seconds!! But also misses something, 96462 points left.
Any clue about why the results are not the same?

Depending on how accurate you need this to be, the following approach should work well:
points = [(54592748, 54593510), (54592745, 54593512), (117628626, 117630648), (1354358, 1619520), (54592746, 54593509)]
d = 20
hpoints = {((x - (x % d)), (y - (y % d))) : (x,y) for x, y in points}
for x in hpoints.itervalues():
print x
This converts each point into a dictionary key with each x and y coordinate rounded by its modulus. The result is a dictionary holding the coordinate of the last point in a given area. For the data you have given, this would display the following:
(117628626, 117630648)
(54592746, 54593509)
(1354358, 1619520)

Sorting the list first avoids the inner for loop and thus the n^2 time. I'm not sure if it's practically any quicker though since I don't have your full data. Try this (it outputs the same as far as i can see from your example points, just ordered).
points = [(54592748,54593510),(54592745,54593512),(117628626,117630648),(1354358,1619520),(54592746,54593509)]
md = 10 # max distance allowed between two points
points.sort()
to_remove = set() # keep track of items to be removed
for i, point in enumerate(points):
if i == len(points) - 1:
break
other_point = points[i+1]
if abs(point[0]-other_point[0]) <= md and abs(point[1]-other_point[1]) <= md:
to_remove.add(point)
for point in to_remove:
points.remove(point)
print(points)

This function for getting unique items from a list (it isn't mine, I found it a while back) only loops over the list once (plus dictionary lookups).
def unique(seq, idfun=None):
# order preserving
if idfun is None:
def idfun(x): return x
seen = {}
result = []
for item in seq:
marker = idfun(item)
# in old Python versions:
# if seen.has_key(marker)
# but in new ones:
if marker in seen: continue
seen[marker] = 1
result.append(item)
return result
The id function will require some cleverness. point[0] is divided by error and floored to an integer. So all point[0]'s such that x*error <= point[0] < (x+1)*error are the same and similarly for point[1].
def id(point):
error = 4
x = point[0]//error
y = point[1]//error
idValue = str(x)+"//"+str(y)
return idValue
So these functions will reduce points between consecutive multiples of error to the same point. The good news is that it only touches the original list once plus the dictionary lookups. The bad news is that this id function won't catch for example 15 and 17 should be the same because 15 reduces to 3 and 17 reduces to 4. It is possible that will some cleverness, this issue could be resolved.
[NOTE: I originally used exponents of primes for the idValue, but the exponents would be way to large. If you could make the idValue an int, that would increase lookup speed ]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.