How to reduce/optimize memory usage when calculating area of skyline?

How to reduce/optimize memory usage when calculating area of skyline? - python

I'm trying to calculate the area of skyline (overlapping rectangles with same baseline)
building_count = int(input())
items = {} # dictionary, location on x axis is the key, height is the value
count = 0 # total area
for j in range(building_count):
line = input().split(' ')
H = int(line[0]) # height
L = int(line[1]) # left point (start of the building)
R = int(line[2]) # right point (end of the building)
for k in range(R - L):
if not (L+k in items): # if it's not there, add it
items[L+k] = H
elif H > items[L+k]: # if we have a higher building on that index
items[L+k] = H
for value in items.values(): # we add each column basically
count += value
print(count)
sample input would be:
5
3 -3 0
2 -1 1
4 2 4
2 3 7
3 6 8
and output is 29.
The issue is memory efficiency, when there are lots of values, the script simply throws MemoryError. Anyone have some ideas for optimizing memory usage?

You are allocating a separate key-value pair for every single integer value in your range. Imagine the case where R = 1 and L = 100000. Your items dictionary will be filled with 1000000 items. Your basic idea of processing/removing overlaps is is sound, but the way you do it is massive overkill.
Like so much else in life, this is a graph problem in disguise. Imaging the vertices being the rectangles you are trying to process and the (weighted) edges being the overlaps. The complication is that you can not just add up the areas of the vertices and subtract the areas of the overlaps, because many of the overlaps overlap each other as well. The overlap issue can be resolved by applying a transformation that converts two overlapping rectangles into non-overlapping rectangles, effectively cutting the edge that connects them. The transformation is shown in the image below. Notice that in some cases one of the vertices will be removed as well, simplifying the graph, while in another case a new vertex is added:
Green: overlap to be chopped out.
Normally, if we have m rectangles and n overlaps between them, constructing the graph would be an O(m2) operation because we would have to check all vertices for overlaps against each other. However, we can bypass a construction of the input graph entirely to get a O(m + n) traversal algorithm, which is going to be optimal since we will only analyze each rectangle once, and construct the output graph with no overlaps as efficiently as possible. O(m + n) assumes that your input rectangles are sorted according to their left edges in ascending order. If that is not the case, the algorithm will be O(mlog(m) + n) to account for the initial sorting step. Note that as the graph density increases, n will go from ~m to ~m2. This confirms the intuitive idea that the fewer overlaps there are, them more you would expect the process will run in O(m) time, while the more overlaps there are, the closer you will run to O(m2) time.
The space complexity of the proposed algorithm will be O(m): each rectangle in the input will result in at most two rectangles in the output, and 2m = O(m).
Enough about complexity analysis and on to the algorithm itself. The input will be a sequence of rectangles defined by L, R, H as you have now. I will assume that the input is sorted by the leftmost edge L. The output graph will be a linked list of rectangles defined by the same parameters, sorted in descending order by the rightmost edge. The head of the list will be the rightmost rectangle. The output will have no overlaps between any rectangles, so the total area of the skyline will just be the sum of H * (R - L) for each of the ~m output rectangles.
The reason for picking a linked list is that the only two operations we need is iteration from the head node and the cheapest insertion possible to maintain the list in sorted order. The sorting will be done as part of overlap checking, so we do not need to do any kind of binary searches through the list or anything like that.
Since the input list is ordered by increasing left edge and the output list is ordered by decreasing right edge, we can guarantee that each rectangle added will be checked only against the rectangles it actually overlaps1. We will do overlap checking and removal as shown in the diagram above until we reach a rectangle whose left edge is less than or equal to the left edge of the new rectangle. All further rectangles in the output list are guaranteed not to overlap with the new rectangle. This check-and-chop operation guarantees that each overlap is visited at most once, and that no non-overlapping rectangles are processed unnecessarily, making the algorithm optimal.
Before I show code, here is a diagram of the algorithm in action. Red rectangles are new rectangles; note that their left edges progress to the right. Blue rectangles are ones that are already added and have overlap with the new rectangle. Black rectangles are already added and have no overlap with the new one. The numbering represents the order of the output list. It is always done from the right. A linked list is a perfect structure to maintain this progression since it allows cheap insertions and replacements:
Here is an implementation of the algorithm which assumes that the input coordinates are passed in as an iterable of objects having the attributes l, r, and h. The iteration order is assumed to be sorted by the left edge. If that is not the case, apply sorted or list.sort to the input first:
from collections import namedtuple
# Defined in this order so you can sort a list by left edge without a custom key
Rect = namedtuple('Rect', ['l', 'r', 'h'])
class LinkedList:
__slots__ = ['value', 'next']
"""
Implements a singly-linked list with mutable nodes and an iterator.
"""
def __init__(self, value=None, next=None):
self.value = value
self.next = next
def __iter__(self):
"""
Iterate over the *nodes* in the list, starting with this one.
The `value` and `next` attribute of any node may be modified
during iteration.
"""
while self:
yield self
self = self.next
def __str__(self):
"""
Provided for inspection purposes.
Works well with `namedtuple` values.
"""
return ' -> '.join(repr(x.value) for x in self)
def process_skyline(skyline):
"""
Turns an iterable of rectangles sharing a common baseline into a
`LinkedList` of rectangles containing no overlaps.
The input is assumed to be sorted in ascending order by left edge.
Each element of the input must have the attributes `l`, r`, `h`.
The output will be sorted in descending order by right edge.
Return `None` if the input is empty.
"""
def intersect(r1, r2, default=None):
"""
Return (1) a flag indicating the order of `r1` and `r2`,
(2) a linked list of between one and three non-overlapping
rectangles covering the exact same area as `r1` and `r2`,
and (3) a pointer to the last nodes (4) a pointer to the
second-to-last node, or `default` if there is only one node.
The flag is set to True if the left edge of `r2` is strictly less
than the left edge of `r1`. That would indicate that the left-most
(last) chunk of the tuple came from `r2` instead of `r1`. For the
algorithm as a whole, that means that we need to keep checking for
overlaps.
The resulting list is always returned sorted descending by the
right edge. The input rectangles will not be modified. If they are
not returned as-is, a `Rect` object will be used instead.
"""
# Swap so left edge of r1 < left edge of r2
if r1.l > r2.l:
r1, r2 = r2, r1
swapped = True
else:
swapped = False
if r2.l >= r1.r:
# case 0: no overlap at all
last = LinkedList(r1)
s2l = result = LinkedList(r2, last)
elif r1.r < r2.r:
# case 1: simple overlap
if r1.h > r2.h:
# Chop r2
r2 = Rect(r1.r, r2.r, r2.h)
else:
r1 = Rect(r1.l, r2.l, r1.h)
last = LinkedList(r1)
s2l = result = LinkedList(r2, last)
elif r1.h < r2.h:
# case 2: split into 3
r1a = Rect(r1.l, r2.l, r1.h)
r1b = Rect(r2.r, r1.r, r1.h)
last = LinkedList(r1a)
s2l = LinkedList(r2, last)
result = LinkedList(r1b, s2l)
else:
# case 3: complete containment
result = LinkedList(r1)
last = result
s2l = default
return swapped, result, last, s2l
root = LinkedList()
skyline = iter(skyline)
try:
# Add the first node as-is
root.next = LinkedList(next(skyline))
except StopIteration:
# Empty input iterator
return None
for new_rect in skyline:
prev = root
for rect in root.next:
need_to_continue, replacement, last, second2last = \
intersect(rect.value, new_rect, prev)
# Replace the rectangle with the de-overlapped regions
prev.next = replacement
if not need_to_continue:
# Retain the remainder of the list
last.next = rect.next
break
# Force the iterator to move on to the last node
new_rect = last.value
prev = second2last
return root.next
Computing the total area is now trivial:
skyline = [
Rect(-3, 0, 3), Rect(-1, 1, 2), Rect(2, 4, 4),
Rect(3, 7, 2), Rect(6, 8, 3),
]
processed = process_skyline(skyline)
area = sum((x.value.r - x.value.l) * x.value.h for x in processed) if processed else None
Notice the altered order of the input parameters (h moved to the end). The resulting area is 29. This matches with what I get by doing the computation by hand. You can also do
>>> print(processed)
Rect(l=6, r=8, h=3) -> Rect(l=4, r=6, h=2) -> Rect(l=2, r=4, h=4) ->
Rect(l=0, r=1, h=2) -> Rect(l=-3, r=0, h=3)
This is to be expected from the diagram of the inputs/output shown below:
As an additional verification, I added a new building, Rect(-4, 9, 1) to the start of the list. It overlaps all the others and adds three units to area, or a final result of 32. processed comes out as:
Rect(l=8, r=9, h=1) -> Rect(l=6, r=8, h=3) -> Rect(l=4, r=6, h=2) ->
Rect(l=2, r=4, h=4) -> Rect(l=1, r=2, h=1) -> Rect(l=0, r=1, h=2) ->
Rect(l=-3, r=0, h=3) -> Rect(l=-4, r=-3, h=1)
Note:
While I am sure that this problem has been solved many times over, the solution I present here is entirely my own work, done without consulting any other references. The idea of using an implicit graph representation and the resulting analysis is inspired by a recent reading of Steven Skiena's Algorithm Design Manual, Second Edition. It is one of the best comp-sci books I have ever come across.
1 Technically, if a new rectangle does not overlap any other rectangles, it will be checked against one rectangle it does not overlap. If that extra check was always the case, the algorithm would have an additional m - 1 comparisons to do. Fortunately, m + m + n - 1 = O(m + n) even if we always had to check one extra rectangle (which we don't).

The reason for getting MemoryError is huge size of the dictionary being created. In the worst case, the dict can have 10^10 keys, which would end up taking all your memory. If there really is a need, shelve is a possible solution to make use of such large dict.
Let's say there is a building with 10 0 100 and another with 20 50 150, then that list might have info like [(-10^9, 0), (0, 10), (50, 20), (150, 0), (10^9, 0)]. As you come across more buildings, you can add more entries in this list. This will be O(n^2).
This might help you further.

Related

Parallel way of filtering no more than 1 point in a neighbourhood

I have a space of 23x23x30 and each cube of 1x1x1 represents a point, some of these 23x23x30 points are populated with numbers ranging from -65 to -45, I want to make sure that there should be no more than 1 number in any given region of 5x5x30 around a populated point, if there are multiple points in any region of 5x5x30 the points with the smallest number should be eliminated. I have done this in serial way using nested for loops but that's very expensive operation. I would like to parallelize this operation. I have n cores and each core has it's own sub region of the total region of 23x23x30 without any overlap. I can collect those 'sub regions' and construct the full region of 23x23x30 that was mentioned above, so that all cores can access full region of 23x23x30 at the same time they have their 'sub region' as well. I am not sure if there are any libraries available for this kind of operation in python. In my application 8 processes will fill up this 23x23x30 space with about 3500 points, rite now I'm doing this 'filtering' operation on all the 8 processes(i.e duplicating the work) this is wastage of resources, so I will have to do this 'filtering' in parallel in order to use the available resource efficiently.
Here is the serial code: self.tntatv_stdp_ids is a dictionary with keys step1, step2....upto 30 steps in dimension, z. This keys have the numbers(1 to 529) of the points in that step that are populated. Note that in serial implementation of the code, points in each step in the z dimension are from 1 to 529.
self.cell_voltages is a dictionary with keys step1, step2....upto 30 steps in dimension, z. Each key gives the numbers that present in a point.
a_keys = self.tntatv_stdp_ids.keys()
#Filter tentative neuron ids using suppression algo to come up with final stdp neuron ids.
for i in range(0,len(a_keys)):
b_keys= list(set(a_keys) - set([a_keys[i]]))
c_keys = self.tntatv_stdp_ids[a_keys[i]]
for j in range(0,len(b_keys)):
d_keys=self.tntatv_stdp_ids[b_keys[j]]
for k in c_keys[:]:
key = k
key_row= key/(image_size-kernel+1)
key_col = key%(image_size-kernel+1)
remove =0
for l in d_keys[:]:
target = l
tar_row = target/(image_size-kernel+1)
tar_col = target%(image_size-kernel+1)
if(abs(key_row-tar_row) > kernel-1 and abs(key_col-tar_col) > kernel-1):
pass
else:
if(self.cell_voltages[a_keys[i]][key]>=self.cell_voltages[b_keys[j]][target]):
d_keys.remove(target)
else:
remove+=1
if(remove):
c_keys.remove(key)
At the end of this operation , if there are multiple points left over in 30 regions of 23x23x1, one final winner point for each of those 30 23x23x1 regions can be selected by seeing which of the remaining populated points of23x23x1 points has the highest number. In this way the maximum number of winners can be 30 from all of the points in 23x23x30, 1 for each of the 23x23x1. There can be less than 30 also, depends upon how many of the 23x23x30 points were populated to start with.

This problem doesn't likely require parallelization:
# Generate a random array of appropriate size for testing
super_array = [[[None for _ in range(30)] for _ in range(529)] for _ in range(529)]
for _ in range(3500):
super_array[random.randint(0, 528)][random.randint(0, 528)][random.randint(0,29)] = random.randint(-65, -45)
First step is building a list of filled nodes:
filled = []
for x in range(len(super_array)):
for y in range(len(super_array[0])):
for z in range(len(super_array[0][0])):
if super_array[x][y][z] is not None:
filled.append((x, y, z, super_array[x][y][z]))
Then, sort list from high to low:
sfill = sorted(filled, key=lambda x: x[3], reverse=True)
Now, generate a blocking grid:
block_array = [[None for _ in range(529)] for _ in range(529)]
And traverse the list, blocking off neighborhoods as you find nodes and deleting nodes in an already occupied neighborhood:
for node in sfill:
x, y, z, _ = node
if block_array[x][y] is not None:
super_array[x][y][z] = None # kill node if it's in the neighborhood of a larger node
else: # Block their neighborhood
for dx in range(5):
for dy in range(5):
cx = x + dx - 2
cy = y + dy - 2
if 529 > cx >= 0 and 529 > cy >= 0:
block_array[cx][cy] = True
Some notes:
This uses a sliding neighborhood, so it checks a 5x5 centered on each node. Doing the check from highest to lowest is important, as that ensures a node which is removed hasn't previously forced a different node to be removed.
You could do this even more efficiently by doing ranges instead of a full 529x529 array, but the neighborhood blocking takes less than a second and the full process, from generated array to pruned final list is 1.2 seconds.
Building of a filled nodes list could be improved by only adding the highest value node within any z stack. This will reduce the size of the list which must be sorted if a significant number of nodes end up with the same x,y values.
On a 23x23x30, it takes about ~18ms, again including the time to build the 3d array:
timeit.timeit(prune_test, number=1000)
17.61786985397339

Efficient isolation critereon

The project I'm working on involves reading and analyzing a huge data set (Illustris 1 Dark, about 4,000,000 dark matter halos). In order to get better results, I want to impose an isolation criteria as follows:
Only keep those halos that are the biggest in the 2D circle of radius 300kpc, and get rid of the other halos in that circle
Now, the current implementation I have has run time in O(n^2), which means the code could take days to finish. I really want to do better but can't figure out how. This is what I have so far:
Function for returning a list of neighbors of Group1
def neighbors(Group1, Radius):
Neighbors = []
for Group2 in Groups:
if Distance_2D(Group1, Group2) < Radius:
Neighbors.append(Group2)
return Neighbors
Function for returning the biggest group given a list of neighbors
def biggest(Neighbors):
Biggest = Neighbors[0]
for N in Neighbors:
if N.mass > Biggest.mass:
Biggest = N
return Biggest
Putting it all together
for Group in Groups:
Neighbors = neighbors(Group, 300)
if not Group == Biggest(Neighbors):
Groups.remove(Group)
else:
Groups.remove(Neighbors)
Groups.append(Group)
After the for-loop, Groups should be a list of halos that are the largest within their 300kpc radius.
I also know that removing something from a list while iterating the same list is not good practice, so if your hint/answer takes care of that, that would be great!
Thank you all in advance :)

Halo with the largest mass will be in a group. Halos that are closer than radius to the largest halo will not be in a group. With that outline of algorithm is to:
result = []
for h in sorted(halos, key=lambda h: h.mass, reverse=True):
if all(distance(h.position, x.position) > 300 for x in result):
result.append(h)
Checking of distance is tricky.
If expected size of result list is small, than each halo is checked with that small list. That algorithm has complexity O(n*log(n) + n * |result set|).
If expected size of result list is large, that space partition can help.
Like:
result = []
sp = SpacePartition() # Some space partition
for h in sorted(halos, key=lambda h: h.mass, reverse=True):
if not sp.has_point_close(h.position, 300):
result.append(h)
sp.add(h.position)
That algorithm has complexity O(n*log(n)), since checking and adding to partition is of log(n).

Iterate over two lists, execute function and return values

I am trying to iterate over two lists of the same length, and for the pair of entries per index, execute a function. The function aims to cluster the entries
according to some requirement X on the value the function returns.
The lists in questions are:
e_list = [-0.619489,-0.465505, 0.124281, -0.498212, -0.51]
p_list = [-1.7836,-1.14238, 1.73884, 1.94904, 1.84]
and the function takes 4 entries, every combination of l1 and l2.
The function is defined as
def deltaR(e1, p1, e2, p2):
de = e1 - e2
dp = p1 - p2
return de*de + dp*dp
I have so far been able to loop over the lists simultaneously as:
for index, (eta, phi) in enumerate(zip(e_list, p_list)):
for index2, (eta2, phi2) in enumerate(zip(e_list, p_list)):
if index == index2: continue # to avoid same indices
if deltaR(eta, phi, eta2, phi2) < X:
print (index, index2) , deltaR(eta, phi, eta2, phi2)
This loops executes the function on every combination, except those that are same i.e. index 0,0 or 1,1 etc
The output of the code returns:
(0, 1) 0.659449892453
(1, 0) 0.659449892453
(2, 3) 0.657024790285
(2, 4) 0.642297230697
(3, 2) 0.657024790285
(3, 4) 0.109675332432
(4, 2) 0.642297230697
(4, 3) 0.109675332432
I am trying to return the number of indices that are all matched following the condition above. In other words, to rearrange the output to:
output = [No. matched entries]
i.e.
output = [2, 3]
2 coming from the fact that indices 0 and 1 are matched
3 coming from the fact that indices 2, 3, and 4 are all matched
A possible way I have thought of is to append to a list, all the indices used such that I return
output_list = [0, 1, 1, 0, 2, 3, 4, 3, 2, 4, 4, 2, 3]
Then, I use defaultdict to count the occurrances:
for index in output_list:
hits[index] += 1
From the dict I can manipulate it to return [2,3] but is there a more pythonic way of achieving this?

This is finding connected components of a graph, which is very easy and well documented, once you revisit the problem from that view.
The data being in two lists is a distraction. I am going to consider the data to be zip(e_list, p_list). Consider this as a graph, which in this case has 5 nodes (but could have many more on a different data set). Construct the graph using these nodes, and connected them with an edge if they pass your distance test.
From there, you only need to determine the connected components of an undirected graph, which is covered on many many places. Here is a basic depth first search on this site: Find connected components in a graph
You loop through the nodes once, performing a DFS to find all connected nodes. Once you look at a node, mark it visited, so it does not get counted again. To get the answer in the format you want, simply count the number of unvisited nodes found from each unvisited starting point, and append that to a list.
------------------------ graph theory ----------------------
You have data points that you want to break down into related groups. This is a topic in both mathematics and computer science known as graph theory. see: https://en.wikipedia.org/wiki/Graph_theory
You have data points. Imagine drawing them in eta phi space as rectangular coordinates, and then draw lines between the points that are close to each other. You now have a "graph" with vertices and edges.
To determine which of these dots have lines between them is finding connected components. Obviously it's easy to see, but if you have thousands of points, and you want a computer to find the connected components quickly, you use graph theory.
Suppose I make a list of all the eta phi points with zip(e_list, p_list), and each entry in the list is a vertex. If you store the graph in "adjacency list" format, then each vertex will also have a list of the outgoing edges which connect it to another vertex.
Finding a connected component is literally as easy as looking at each vertex, putting a checkmark by it, and then following every line to the next vertex and putting a checkmark there, until you can't find anything else connected. Now find the next vertex without a checkmark, and repeat for the next connected component.
As a programmer, you know that writing your own data structures for common problems is a bad idea when you can use published and reviewed code to handle the task. Google "python graph module". One example mentioned in comments is "pip install networkx". If you build the graph in networkx, you can get the connected components as a list of lists, then take the len of each to get the format you want: [len(_) for _ in nx.connected_components(G)]
---------------- code -------------------
But if you don't understand the math, then you might not understand a module for graphs, nor a base python implementation, but it's pretty easy if you just look at some of those links. Basically dots and lines, but pretty useful when you apply the concepts, as you can see with your problem being nothing but a very simple graph theory problem in disguise.
My graph is a basic list here, so the vertices don't actually have names. They are identified by their list index.
e_list = [-0.619489,-0.465505, 0.124281, -0.498212, -0.51]
p_list = [-1.7836,-1.14238, 1.73884, 1.94904, 1.84]
def deltaR(e1, p1, e2, p2):
de = e1 - e2
dp = p1 - p2
return de*de + dp*dp
X = 1 # you never actually said, but this works
def these_two_particles_are_going_the_same_direction(p1, p2):
return deltaR(p1.eta, p1.phi, p2.eta, p2.phi) < X
class Vertex(object):
def __init__(self, eta, phi):
self.eta = eta
self.phi = phi
self.connected = []
self.visited = False
class Graph(object):
def __init__(self, e_list, p_list):
self.vertices = []
for eta, phi in zip(e_list, p_list):
self.add_node(eta, phi)
def add_node(self, eta, phi):
# add this data point at the next available index
n = len(self.vertices)
a = Vertex(eta, phi)
for i, b in enumerate(self.vertices):
if these_two_particles_are_going_the_same_direction(a,b):
b.connected.append(n)
a.connected.append(i)
self.vertices.append(a)
def reset_visited(self):
for v in self.nodes:
v.visited = False
def DFS(self, n):
#perform depth first search from node n, return count of connected vertices
count = 0
v = self.vertices[n]
if not v.visited:
v.visited = True
count += 1
for i in v.connected:
count += self.DFS(i)
return count
def connected_components(self):
self.reset_visited()
components = []
for i, v in enumerate(self.vertices):
if not v.visited:
components.append(self.DFS(i))
return components
g = Graph(e_list, p_list)
print g.connected_components()

Best way to remove similar points in a list of points

I have a list of points that looks like this:
points = [(54592748,54593510),(54592745,54593512), ...]
Many of these points are similar in the sense that points[n][0] is almost equal to points[m][0] AND points[n][1] is almost equal to points[m][1]. Where 'almost equal' is a whatever integer I decide.
I would like to filter out all the similar points from the list, keeping just one of it.
Here is my code.
points = [(54592748,54593510),(54592745,54593512),(117628626,117630648),(1354358,1619520),(54592746,54593509)]
md = 10 # max distance allowed between two points
to_compare = points[:] # make a list of item to compare
to_remove = set() # keep track of items to be removed
for point in points:
to_compare.remove(point) # do not compare with itself
for other_point in to_compare:
if abs(point[0]-other_point[0]) <= md and abs(point[1]-other_point[1]) <= md:
to_remove.add(other_point)
for point in to_remove:
points.remove(point)
It works...
>>>points
[(54592748, 54593510), (117628626, 117630648), (1354358, 1619520)]
but I am looking for a faster solution since my list is millions items long.
PyPy helped a lot, it speeded up 6 the whole process 6 times, but probably there is a more efficient way to do this in the first place, or not?
Any help is very welcome.
=======
UPDATE
I have tested some of the answers with the points object you can pickle.load() from here https://mega.nz/#!TVci1KDS!tE5fTnjpPwbvpFTmW1TLsVXDvYHbRF8F7g10KGdOPCs
My code takes 1104 seconds and reduces the list to 96428 points (from 99920).
David's code do the job in 14 seconds! But misses something, 96431 points left.
Martin's code takes 0.06 seconds!! But also misses something, 96462 points left.
Any clue about why the results are not the same?

Depending on how accurate you need this to be, the following approach should work well:
points = [(54592748, 54593510), (54592745, 54593512), (117628626, 117630648), (1354358, 1619520), (54592746, 54593509)]
d = 20
hpoints = {((x - (x % d)), (y - (y % d))) : (x,y) for x, y in points}
for x in hpoints.itervalues():
print x
This converts each point into a dictionary key with each x and y coordinate rounded by its modulus. The result is a dictionary holding the coordinate of the last point in a given area. For the data you have given, this would display the following:
(117628626, 117630648)
(54592746, 54593509)
(1354358, 1619520)

Sorting the list first avoids the inner for loop and thus the n^2 time. I'm not sure if it's practically any quicker though since I don't have your full data. Try this (it outputs the same as far as i can see from your example points, just ordered).
points = [(54592748,54593510),(54592745,54593512),(117628626,117630648),(1354358,1619520),(54592746,54593509)]
md = 10 # max distance allowed between two points
points.sort()
to_remove = set() # keep track of items to be removed
for i, point in enumerate(points):
if i == len(points) - 1:
break
other_point = points[i+1]
if abs(point[0]-other_point[0]) <= md and abs(point[1]-other_point[1]) <= md:
to_remove.add(point)
for point in to_remove:
points.remove(point)
print(points)

This function for getting unique items from a list (it isn't mine, I found it a while back) only loops over the list once (plus dictionary lookups).
def unique(seq, idfun=None):
# order preserving
if idfun is None:
def idfun(x): return x
seen = {}
result = []
for item in seq:
marker = idfun(item)
# in old Python versions:
# if seen.has_key(marker)
# but in new ones:
if marker in seen: continue
seen[marker] = 1
result.append(item)
return result
The id function will require some cleverness. point[0] is divided by error and floored to an integer. So all point[0]'s such that x*error <= point[0] < (x+1)*error are the same and similarly for point[1].
def id(point):
error = 4
x = point[0]//error
y = point[1]//error
idValue = str(x)+"//"+str(y)
return idValue
So these functions will reduce points between consecutive multiples of error to the same point. The good news is that it only touches the original list once plus the dictionary lookups. The bad news is that this id function won't catch for example 15 and 17 should be the same because 15 reduces to 3 and 17 reduces to 4. It is possible that will some cleverness, this issue could be resolved.
[NOTE: I originally used exponents of primes for the idValue, but the exponents would be way to large. If you could make the idValue an int, that would increase lookup speed ]

Understanding the recursion in mergesort-like algorithms

I was wondering how the flow of this recursive algorithm works: an inversion counter based on merge-sort. When I looked at the diagrams of the merge-sort recursion tree, it seemed fairly lucid; I thought that the leaves would keep splitting until each leaf was a single unit, then merge() would start combining them; and therefore, start 'moving back up' the tree -- so to speak.
But in the code below, if we print out this function with a given array print(sortAndCount(test_case)) then we're actually getting our 'final' output from the merge() function, not the return statement in sortAndCount()? So in the code below, I thought that the sortAndCount() method would call itself over and over in (invCountA, A) = sortAndCount(anArray[:halfN]) until reaching the base case and then moving on to processing the next half of the array -- but now that seems incorrect. Can someone correct my understanding of this recursive flow? (N.b. I truncated some of the code for the merge() method since I'm only interested the recursive process.)
def sortAndCount(anArray):
N = len(anArray)
halfN = N // 2
#base case:
if N == 1: return (0, anArray)
(invCountA, A) = sortAndCount(anArray[:halfN])
(invCountB, B) = sortAndCount(anArray[halfN:])
(invCountCross, anArray) = merge(A, B)
return (invCountA + invCountB + invCountCross, anArray)
def merge(listA, listB):
counter = 0
i, j = 0, 0
#some additional code...
#...
#...
#If all items in one array have been selected,
#we just return remaining values from other array:
if (i == Asize):
return (counter, output_array + listB[j:])
else:
return (counter, output_array + listA[i:])

The following image created using rcviz shows the order of recursive call, as explained in the documentation the edges are numbered by the order in which they were traversed by the execution.The edges are colored from black to grey to indicate order of traversal : black edges first, grey edges last.:
So if we follow the steps closely we see that first we traverse the left half of the original array completely then the right.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to reduce/optimize memory usage when calculating area of skyline? - python

Related

Parallel way of filtering no more than 1 point in a neighbourhood

Efficient isolation critereon

Iterate over two lists, execute function and return values

Best way to remove similar points in a list of points

Understanding the recursion in mergesort-like algorithms

Categories

Resources