Finding range overlap between at least three ranges Python - python

I have a list of tuples, which I am using to mark the lower and upper bounds of ranges. For example:
[(3,10), (4,11), (2,6), (8,11), (9,11)] # 5 separate ranges.
I want to find the ranges where three or more of the input ranges overlap. For instance the tuples listed above would return:
[(4,6), (8,11)]
I tried the method provided by #WolframH in answer to this post
But I don't know what I can do to:
Give me more than one output range
Set a threshold of at least three range overlaps to qualify an output

You first have to find all combinations of ranges. Then you can transform them to sets and calculate the intersection:
import itertools
limits = [(3,10), (4,11), (2,6), (8,11), (9,11)]
ranges = [range(*lim) for lim in limits]
results = []
for comb in itertools.combinations(ranges,3):
intersection = set(comb[0]).intersection(comb[1])
intersection = intersection.intersection(comb[2])
if intersection and intersection not in results and\
not any(map(intersection.issubset, results)):
results = filter(lambda res: not intersection.issuperset(res),results)
results.append(intersection)
result_limits = [(res[0], res[-1]+1) for res in map(list,results)]
It should give you all 3-wise intersections

You can, of course, solve this by brute-force checking all the combinations if you want. If you need this algorithm to scale, though, you can do it in (pseudo) nlogn. You can technically come up with a degenerate worst-case that's O(n**2), but whatchagonnado.
Basically, you sort the ranges, then for a given range you look to its immediate left to see that the bounds overlap, and if so you then look right to mark overlapping intervals as results. Pseudocode (which is actually almost valid python, look at that):
ranges.sort()
for left_range, current_range, right_range in sliding_window(ranges, 3):
if left_range.right < current_range.left:
continue
while right_range.left < min(left_range.right, current_range.right):
results.append(overlap(left_range, right_range))
right_range = right_range.next
#Before moving on to the next node, extend the current_range's right bound
#to be the longer of (left_range.right, current_range.right)
#This makes sense if you think about it.
current_range.right = max(left_range.right, current_range.right)
merge_overlapping(results)
(you also need to merge some possibly-overlapping ranges at the end, this is another nlogn operation - though n will usually be much smaller there. I won't discuss the code for that, but it's similar in approach to the above, involving a sort-then-merge. See here for an example.)

Related

Is there a more efficient an robust way to create a minimum proximity algorithm for a distance matrix?

I am trying to make an algorithm that propagates from point to point in a distance matrix using the smallest distance in the proximity. The code has two conditions: the minimum distance must be no less than 0 and each point must be visited once and return to the starting position.
This is my code in its entirety:
def totalDistance(aList):
path = []
for j in range(0,len(aList)):
k=j
order = []
for l in range(0,len(aList)):
order.append(k)
initval= min(x for x in aList[k] if x > 0 )
k = aList[k].index(initval)
for s in range(0,len(aList)):
for t in range(0,len(aList[s])):
aList[s][k] = 0
path.append(order)
return path
The code is meant to return the indexes of the points in within the closes proximity of the evaluated point.
aList = [[0,3,4,6],[3,0,7,3],[4,7,0,9],[6,3,9,0]] and represents the distance matrix.
When running the code, I get the following error:
initval= min(x for x in aList[k] if x > 0 )
ValueError: min() arg is an empty sequence
I presume that when I make the columns in my distance matrix zero with the following function:
for s in range(0,len(aList)):
for t in range(0,len(aList[s])):
aList[s][k] = 0
the min() function is unable to find a value with the given conditions. Is there a better way to format my code such that this does not occur or a better approach to this problem all together?
One technique and a pointer on the rest that you say is working...
For preventing re-visiting / backtracking. One of the common design patterns for this is to keep a separate data structure to "mark" the places you've been. Because your points are numerically indexed, you could use a list of booleans, but I think it is much easier to just keep a set of the places you've been. Something like this...
visited = set() # places already seen
# If I decide to visit point/index "3"...
visited.add(3)
Not really a great practice to modify your input data as you are doing, and especially so if you are looping over it, which you are...leads to headaches.
So then... Your current error is occurring because when you screen the rows for x>0 you eventually get an empty list because you are changing values and then min() chokes. So part of above can fix that, and you don't need to zero-ize, just mark them.
Then, the obvious question...how to use the marks? You can just use it as a part of your search. And it can work well with the enumerate command which can return index values and the value by enumeration.
Try something like this, which will make a list of "eligible" tuples with the distance and index location.
pts_to_consider = [(dist, idx) for idx, dist in enumerate(aList[k])
if dist > 0
and idx not in visited]
There are other ways to do this with numpy and other things, but this is a reasonable approach and close to what you have in code now. Comment back if stuck. I don't want to give away the whole farm because this is probably H/W. Perhaps you can use some of the hints here.

Python: complexity of check for overlapping ranges

I have two ranges and want to check if they overlap in Python (v3.5). These are some solutions.
1a: use set intersection with range:
def overlap_intersection_set(range1, range2):
return bool(set(range1).intersection(range2))
1b: use set intersection with two sets:
def overlap_intersection_two_sets(range1, range2):
return bool(set(range1).intersection(set(range2)))
2: use any and range in:
def overlap_any(range1, range2):
return any([i1 in range2 for i1 in range1])
I've been trying to compute the cost for these approaches, mostly in terms of time, but space complexity might also be considerable.
The Python Wiki page "Time Complexity" lists for the set intersections (average case):
Intersection s&t (average case): O(min(len(s), len(t)) (replace "min" with "max" if t is not a set)
For solution 1b, I hence assume O(min(len(range1), len(range2)), plus two times a set creation from a range. I consider the bool function very cheap.
For solution 1a: O(max(len(range1), len(range2)), plus once a set creation from a range.
For solution 2 (any): I have not found much documentation regarding complexities, neither for any nor for range in. For the latter, I assume that a range behaves like a list, which would mean O(n) for each in call, hence resulting in O(n*m) with n=len(range1) and m=len(range2). At the same time, any should lead to a shortcut as soon as a match is found and the set creation can be spared.
My questions thus involve algorithmic complexities as well as their Python-specific implementations:
How expensive is it to convert a range to a set?
How expensive is the bool() function really?
Does in for a range really behave as in a list (O(n))?
What other implementation details are relevant apart from algorithmic complexity?
Ultimately, considering these questions: what is the most efficient way to check for an overlap between two ranges?
This is not easy to evaluate empirically as the actual computation time depends a lot on the properties of the ranges, i.e. how early an overlapping element is found, and their sizes. That is why I am looking for a more analytical explanation.
Don't do that. Instead:
Arrange that every range is arranged as lowest-to-highest.
if range1.lowest > range2.lowest then swap range1 with range2
If range1.highest > range2.lowest then ranges intersect
If range1.highest == range2.lowest then ranges touch
If range1.highest < range2.lowest then ranges are distinct.
The above algorithm is independent of the sizes of the ranges and can handle non-integer ranges too.
Something like:
def is_overlapped(r1, r2):
if r1.lowest > r2.lowest:
r1, r2 = r2, r1
return r1.highest > r2.lowest
A more full implementation:
from collections import namedtuple
class Range(namedtuple('Range', 'lowest, highest')):
__slots__ = ()
def __new__(_cls, lowest, highest):
'Enforces lowest <= highest'
if lowest > highest:
lowest, highest = highest, lowest
return super().__new__(_cls, lowest, highest)
def is_overlapped(r1, r2):
r1, r2 = sorted([r1, r2])
return r1.highest > r2.lowest
if __name__ == '__main__':
range1, range2 = Range(4, -4), Range(7, 3)
assert is_overlapped(range2, range1) == is_overlapped(range1, range2)
print(is_overlapped(range2, range1)) # True

Sorting points on multiple lines

Given that we have two lines on a graph (I just noticed that I inverted the numbers on the Y axis, this was a mistake, it should go from 11-1)
And we only care about whole number X axis intersections
We need to order these points from highest Y value to lowest Y value regardless of their position on the X axis (Note I did these pictures by hand so they may not line up perfectly).
I have a couple of questions:
1) I have to assume this is a known problem, but does it have a particular name?
2) Is there a known optimal solution when dealing with tens of billions (or hundreds of millions) of lines? Our current process of manually calculating each point and then comparing it to a giant list requires hours of processing. Even though we may have a hundred million lines we typically only want the top 100 or 50,000 results some of them are so far "below" other lines that calculating their points is unnecessary.
Your data structure is a set of tuples
lines = {(y0, Δy0), (y1, Δy1), ...}
You need only the ntop points, hence build a set containing only
the top ntop yi values, with a single pass over the data
top_points = choose(lines, ntop)
EDIT --- to choose the ntop we had to keep track of the smallest
one, and this is interesting info, so let's return also this value
from choose, also we need to initialize decremented
top_points, smallest = choose(lines, ntop)
decremented = top_points
and start a loop...
while True:
Generate a set of decremented values
decremented = {(y-Δy, Δy) for y, Δy in top_points}
decremented = {(y-Δy, Δy) for y, Δy in decremented if y>smallest}
if decremented == {}: break
Generate a set of candidates
candidates = top_lines.union(decremented)
generate a new set of top points
new_top_points, smallest = choose(candidates, ntop)
The following is no more necessary
check if new_top_points == top_points
if new_top_points == top_points: break
top_points = new_top_points</strike>
of course we are in a loop...
The difficult part is the choose function, but I think that this
answer to the question
How can I sort 1 million numbers, and only print the top 10 in Python?
could help you.
It's not a really complicated thing, just a "normal" sorting problem.
Usually sorting requires a large amount of computing time. But your case is one where you don't need to use complex sorting techniques.
You on both graphs are growing or falling constantly, there are no "jumps". You can use this to your advantage. The basic algorithm:
identify if a graph is growing or falling.
write a generator, that generates the values; from left to right if raising, form right to left if falling.
get the first value from both graphs
insert the lower on into the result list
get a new value from the graph that had the lower value
repeat the last two steps until one generator is "empty"
append the leftover items from the other generator.

Mapping a range of integers to a single integer

I have a function which receives an integer as an input and depending on what range this input lies in, assigns to it a difficulty value. I know that this can be done using if else loops. I was wondering whether there is a more efficient/cleaner way to do it.
I tried to do something like this
TIME_RATING_KEY ={
range(0,46):1,
range(46,91):2,
range(91,136):3,
range(136,201):4,
range(201,10800):5,
}
But found out that we can use range as a key in dict(right?). So is there a better way to do this?
You can implement an interval tree. This kind of data structures are able to return all the intervals that intersect a given input point.
In your case intervals don't overlap, so they would always return 1 interval.
Centered interval trees run in O(log n + m) time, where m is the number of intervals returned (1 in your case). So this would reduce the complexity from O(n) to O(log n).
The idea of these interval trees is the following:
You consider the interval that encloses all the intervals you have
Take the center of that interval and partition the given intervals into those that end before that point, those that contain that point and those that start after it.
Recursively construct the same kind of tree for the intervals ending before the center and those starting after it
Keep the intervals that contain the center point in two sorted sequences. One sorted by starting point, and the other sorted by ending point
When searching go left or right depending on the center point. When you find an overlap you use binary search on the sorted sequence you want to check (this allows for looking up not only intervals that contain a given point but intervals that intersect or contain a given interval).
It's trivial to modify the data structure to return a specific value instead of the found interval.
This said, from the context I don't think you actually need to reduce the efficiency of this lookup and you should probably use the simpler and more readable solution since it would be more maintainable and there are less chances to make mistakes.
However reading about the mroe efficient data structure can turn out useful in the future.
The simplest way is probably just to write a short function:
def convert(n, difficulties=[0, 46, 91, 136, 201]):
if n < difficulties[0]:
raise ValueError
for difficulty, end in enumerate(difficulties):
if n < end:
return difficulty
else:
return len(difficulties)
Examples:
>>> convert(32)
1
>>> convert(68)
2
>>> convert(150)
4
>>> convert(250)
5
As a side note: You can use a range as a dictionary key in Python 3.x, but not directly in 2.x (because range returns a list). You could do:
TIME_RATING_KEY = {tuple(range(0, 46)): 1, ...}
However that won't be much help!

Performance of a "fuzzy" Jaccard index implementation

I'm a trying to calculate a kind of fuzzy Jaccard index between two sets with the following rationale: as the Jaccard index, I want to calculate the ratio between the number of items that are common to both sets and the total number of different items in both sets. The problem is that I want to use a similarity function with a threshold to determine what what counts as the "same" item being in both sets, so that items that are similar:
Aren't counted twice in the union
Are counted in the intersection.
I have a working implementation here (in python):
def fuzzy_jaccard(set1, set2, similarity, threshold):
intersection_size = union_size = len(set1 & set2)
shorter_difference, longer_difference = sorted([set2 - set1, set1 - set2], key=len)
while len(shorter_difference) > 0:
item1, item2 = max(
itertools.product(longer_difference, shorter_difference),
key=lambda (a, b): similarity(a, b)
)
longer_difference.remove(item1)
shorter_difference.remove(item2)
if similarity(item1, item2) > threshold:
union_size += 1
intersection_size += 1
else:
union_size += 2
union_size = union_size + len(longer_difference)
return intersection_size / union_size
The problem here is the this is quadratic in the size of the sets, because in itertools.product I iterate in all possible pairs of items taken one from each set(*). Now, I think I must do this because I want to match each item a from set1 with the best possible candidate b from set2 that isn't more similar to another item a' from set1.
I have a feeling that there should be a O(n) way of doing that I'm not grasping. Do you have any suggestions?
There are other issues two, like recalculating the similarity for each pair once I get the best match, but I don't care to much about them.
I doubt there's any way that would be O(n) in the general case, but you can probably do a lot better than O(n^2) at least for most cases.
Is similarity transitive? By this I mean: can you assume that distance(a, c) <= distance(a, b) + distance(b, c)? If not, this answer probably won't help. I'm treating similarities like distances.
Try clumping the data:
Pick a radius r. Based on intuition, I suggest setting r to one-third of the average of the first 5 similarities you calculate, or something.
The first point you pick in set1 becomes the centre of your first clump. Classify the points in set2 as being in the clump (similarity to the centre point <= r) or outside the clump. Also keep track of points that are within 2r of the clump centre.
You can require that clump centre points be at least a distance of 2r from each other; in that case some points may not be in any clump. I suggest making them at least r from each other. (Maybe less if you're dealing with a large number of dimensions.) You could treat every point as a clump centre but then you wouldn't save any processing time.
When you pick a new point, first compare it with the clump centre points (even though they're in the same set). Either it's in an already existing clump, or it becomes a new clump centre, (or perhaps neither if it's between r and 2r of a clump centre). If it's within r of a clump centre, then compare it with all points in the other set that are within 2r of that clump centre. You may be able to ignore points further than 2r from the clump centre. If you don't find a similar point within the clump (perhaps because the clump has no points left), then you may have to scan all the rest of the points for that case. Hopefully this would mostly happen only when there aren't many points left in the set. If this works well, then in most cases you'd find the most similar point within the clump and would know that it's the most similar point.
This idea may require some tweaking.
If there are a large number of dimenstions involved, then you might find that for a given radius r, frustratingly many points are within 2r of each other while few are within r of each other.
Here's another algorithm. The more time-consuming it is to calculate your similarity function (as compared to the time it takes to maintain sorted lists of points) the more index points you might want to have. If you know the number of dimensions, it might make sense to use that number of index points. You might reject a point as a candidate index point if it's too similar to another index point.
For each of the first point you use and any others you decide to use as index points, generate a list of all the remaining points in the other set, sorted in order of distance from the index point,
When you're comparing a point P1 to points in the other set, I think you can skip over sets for two possible reasons. Consider the most similar point P2 you've found to P1. If P2 is similar to an index point then you can skip all points which are sufficiently dissimilar from that index point. If P2 is dissimilar to an index point then you can skip over all points which are sufficiently similar to that index point. I think in some cases you can skip over some of both types of point for the same index point.

Categories

Resources