Suppose there are 2 arrays. Every element in the array is short line contains start position and end position.
a1 = [[0,1],[3,6],[7,9]]
a2 = [[2,6],[0,1]]
In this example, a1[0] is same with a2[1], the overlap length is 1. a1[1] and a2[0] has overlap length of 3. The total result is 4.
Are there any way to achieve this method easily?
You can use itertools.product to generate all interval pairs and then calculate the overlap for each pair. Two intervals overlap if one starts before the second ends.
import itertools
overlap=0
for x, y in itertools.product(a1, a2):
max_start = max(x[0], y[0])
min_end = min(x[1], y[1])
overlap += max(0, min_end-max_start)
There is an ambiguity in the problem statement: Can intervals in the same set overlap each other, and if so, do we double count the overlap of those intervals with an interval in the other set or not?
Anyway, The brute-force approach will take O(N^2) time, which may be fine depending on how large the sets are. But it can be improved to O(N*logN) by sorting the two sets by the starting points. If overlapping within the same set is not allowed, you can simply go from left two right, keeping track of the last intervals in each set that overlap each other. If overlapping within the same set is allowed, you can keep a heap of intervals of the first set of which the endpoints have not been reached, and iterate over the second set
In the case of non-overlapping intervals within the same set, the code will be something like this:
a1 = [[0,1],[3,6],[7,9]]
a2 = [[2,6],[0,1]]
a1.sort(key = lambda x: x[0])
a2.sort(key = lambda x: x[0])
i1 = 0
i2 = 0
overlapping = 0
while i1 < len(a1) and i2 < len(a2):
# start and end of the overlapping
start = max(a1[i1][0], a2[i2][0])
end = min(a1[i1][1], a2[i2][1])
overlapping += max(0, end-start)
# move the interval that ends first to the next interval in the same set
if a1[i1][1] < a2[i2][1]:
i1 += 1
else:
i2 += 1
print(overlapping)
Related
Given a sequence of distinct items Sa, we wish to create a sequence Sb (composed of the same items in Sa, but in a different order) such that the sequence S = Sa + Sb (sequence Sb appended immediately after sequence Sa) satisfies the following properties:
The distance (number of positions) between the two occurrences of item I in S is at least some number T for all items I.
If items I and J are within N positions in Sa, then I and J are not within N positions in Sb.
I've been able to program the first stipulation in Python fairly simply. However, the second one is where I struggle. Essentially, I'm just wanting these two things:
I want the second sequence to have its items "far away enough" from their occurrence in the first sequence.
I don't want neighbors of the first sequence to also be neighbors in the second sequence (with N referring to the distance in which items are considered neighbors).
Here's what I have so far:
import random
clips = list(range(10)) # arbitrary items
choice_pool = clips[:]
Sa = clips[:]
random.shuffle(Sa)
Sb = []
count = len(Sa)
threshold = 0.5*len(clips) # the minimum distance the item has to be away from itself in the concatenated sequence
while len(Sb) != len(Sa):
jj = random.randint(0, len(choice_pool) - 1)
# we want clip a1 to be at least threshold away from clip b1
if count - Sa.index(choice_pool[jj]) >= threshold:
Sb.append(choice_pool[jj])
del choice_pool[jj]
count += 1
print("Sa:", Sa)
print("Sb:", Sb)
print("S :", Sa + Sb)
Do you have any advice on how to also achieve the second stipulation, while always guaranteeing such a sequence exists (not ending up in an infinite loop)? Thanks.
I would take out the chance of randomness out of the equation. With randomness you are never guaranteed that you aren't stuck in an infinite loop. There are better improvements to this algorithm but here is the base.
import itertools as it
def filter_criteria(sequence):
#put your filters here return True if you find a sequence that works
pass
for sb_try in it.permutations(sa, len(sa)):
if filter_criteria(sa+sb_try):
return sb_try
return "no permutation matches"
I have a space of 23x23x30 and each cube of 1x1x1 represents a point, some of these 23x23x30 points are populated with numbers ranging from -65 to -45, I want to make sure that there should be no more than 1 number in any given region of 5x5x30 around a populated point, if there are multiple points in any region of 5x5x30 the points with the smallest number should be eliminated. I have done this in serial way using nested for loops but that's very expensive operation. I would like to parallelize this operation. I have n cores and each core has it's own sub region of the total region of 23x23x30 without any overlap. I can collect those 'sub regions' and construct the full region of 23x23x30 that was mentioned above, so that all cores can access full region of 23x23x30 at the same time they have their 'sub region' as well. I am not sure if there are any libraries available for this kind of operation in python. In my application 8 processes will fill up this 23x23x30 space with about 3500 points, rite now I'm doing this 'filtering' operation on all the 8 processes(i.e duplicating the work) this is wastage of resources, so I will have to do this 'filtering' in parallel in order to use the available resource efficiently.
Here is the serial code: self.tntatv_stdp_ids is a dictionary with keys step1, step2....upto 30 steps in dimension, z. This keys have the numbers(1 to 529) of the points in that step that are populated. Note that in serial implementation of the code, points in each step in the z dimension are from 1 to 529.
self.cell_voltages is a dictionary with keys step1, step2....upto 30 steps in dimension, z. Each key gives the numbers that present in a point.
a_keys = self.tntatv_stdp_ids.keys()
#Filter tentative neuron ids using suppression algo to come up with final stdp neuron ids.
for i in range(0,len(a_keys)):
b_keys= list(set(a_keys) - set([a_keys[i]]))
c_keys = self.tntatv_stdp_ids[a_keys[i]]
for j in range(0,len(b_keys)):
d_keys=self.tntatv_stdp_ids[b_keys[j]]
for k in c_keys[:]:
key = k
key_row= key/(image_size-kernel+1)
key_col = key%(image_size-kernel+1)
remove =0
for l in d_keys[:]:
target = l
tar_row = target/(image_size-kernel+1)
tar_col = target%(image_size-kernel+1)
if(abs(key_row-tar_row) > kernel-1 and abs(key_col-tar_col) > kernel-1):
pass
else:
if(self.cell_voltages[a_keys[i]][key]>=self.cell_voltages[b_keys[j]][target]):
d_keys.remove(target)
else:
remove+=1
if(remove):
c_keys.remove(key)
At the end of this operation , if there are multiple points left over in 30 regions of 23x23x1, one final winner point for each of those 30 23x23x1 regions can be selected by seeing which of the remaining populated points of23x23x1 points has the highest number. In this way the maximum number of winners can be 30 from all of the points in 23x23x30, 1 for each of the 23x23x1. There can be less than 30 also, depends upon how many of the 23x23x30 points were populated to start with.
This problem doesn't likely require parallelization:
# Generate a random array of appropriate size for testing
super_array = [[[None for _ in range(30)] for _ in range(529)] for _ in range(529)]
for _ in range(3500):
super_array[random.randint(0, 528)][random.randint(0, 528)][random.randint(0,29)] = random.randint(-65, -45)
First step is building a list of filled nodes:
filled = []
for x in range(len(super_array)):
for y in range(len(super_array[0])):
for z in range(len(super_array[0][0])):
if super_array[x][y][z] is not None:
filled.append((x, y, z, super_array[x][y][z]))
Then, sort list from high to low:
sfill = sorted(filled, key=lambda x: x[3], reverse=True)
Now, generate a blocking grid:
block_array = [[None for _ in range(529)] for _ in range(529)]
And traverse the list, blocking off neighborhoods as you find nodes and deleting nodes in an already occupied neighborhood:
for node in sfill:
x, y, z, _ = node
if block_array[x][y] is not None:
super_array[x][y][z] = None # kill node if it's in the neighborhood of a larger node
else: # Block their neighborhood
for dx in range(5):
for dy in range(5):
cx = x + dx - 2
cy = y + dy - 2
if 529 > cx >= 0 and 529 > cy >= 0:
block_array[cx][cy] = True
Some notes:
This uses a sliding neighborhood, so it checks a 5x5 centered on each node. Doing the check from highest to lowest is important, as that ensures a node which is removed hasn't previously forced a different node to be removed.
You could do this even more efficiently by doing ranges instead of a full 529x529 array, but the neighborhood blocking takes less than a second and the full process, from generated array to pruned final list is 1.2 seconds.
Building of a filled nodes list could be improved by only adding the highest value node within any z stack. This will reduce the size of the list which must be sorted if a significant number of nodes end up with the same x,y values.
On a 23x23x30, it takes about ~18ms, again including the time to build the 3d array:
timeit.timeit(prune_test, number=1000)
17.61786985397339
I need to count number of unique elements in a set of given ranges. My input is the start and end coordinates for these ranges and I do the following.
>>>coordinates
[[7960383, 7961255],
[15688414, 15689284],
[19247797, 19248148],
[21786109, 21813057],
[21822367, 21840682],
[21815951, 21822369],
[21776839, 21783355],
[21779693, 21786111],
[21813097, 21815959],
[21776839, 21786111],
[21813097, 21819613],
[21813097, 21822369]]
[21813097, 21822369]]
>>>len(set(chain(*[range(i[0],i[1]+1) for i in coordinates]))) #here chain is from itertools
Problem is that it is not fast enough. This is taking 3.5ms (found using %timeit) on my machine (buying a new computer is not an option) and since I need to do this on millions of sets, it is not fast.
Any suggestions how this could be proved?
Edit: The number of rows can vary. In this case there are 12 rows. But I can't put any upper limit on it.
You could just take the difference between the coordinates, and subtract overlapping:
coordinates =[
[ 7960383, 7961255],
[15688414, 15689284],
[19247797, 19248148],
[21776839, 21786111],
[21813097, 21819613],
[21813097, 21822369]
]
# sort by increasing first coordinate, and if equal, by second:
coordinates.sort()
count = 0
prevEnd = 0
for start, end in coordinates:
if end > prevEnd: # ignore a range that is sub-range of the previous one
count += end - max(start, prevEnd)
prevEnd = end
print (count)
This is both cheap in space and time.
Inclusive end coordinates
After your edit, it became clear you wanted the second coordinate to be inclusive. In that case "correct" the calculation like this:
count = 0
prevEnd = -1
for start, end in coordinates:
if end > prevEnd: # ignore a range that is sub-range of the previous one
count += end - max(start - 1, prevEnd)
prevEnd = end
Maybe this is better?
len(reduce(lambda x, y: set(x).union(set(y)), array)
With NumPy you can do:
import numpy as np
coordinates = ...
nums = np.concatenate([np.arange(start, end) for start, end in coordinates], axis=0)
num_unique = len(np.unique(nums))
Update
If you can afford having a matrix with as many rows as the number of coordinates and as many columns as the biggest number another option would be:
import numpy as np
coordinates = np.asarray(coordinates)
nums = np.tile(np.arange(np.max(coordinates)), (len(coordinates), 1))
m = (nums >= coordinates[:, :1]) & (nums < coordinates[:, 1:])
num_unique = np.count_nonzero(np.logical_or.reduce(m, axis=0))
I work with a large amount of data and the execution time of this piece of code is very very important. The results in each iteration are interdependent, so it's hard to make it in parallel. It would be awesome if there is a faster way to implement some parts of this code, like:
finding the max element in the matrix and its indices
changing the values in a row/column with the max from another row/column
removing a specific row and column
Filling the weights matrix is pretty fast.
The code does the following:
it contains a list of lists of words word_list, with count elements in it. At the beginning each word is a separate list.
it contains a two dimensional list (count x count) of float values weights (lower triangular matrix, the values for which i>=j are zeros)
in each iteration it does the following:
it finds the two words with the most similar value (the max element in the matrix and its indices)
it merges their row and column, saving the larger value from the two in each cell
it merges the corresponding word lists in word_list. It saves both lists in the one with the smaller index (max_j) and it removes the one with the larger index (max_i).
it stops if the largest value is less then a given THRESHOLD
I might think of a different algorithm to do this task, but I have no ideas for now and it would be great if there is at least a small performance improvement.
I tried using NumPy but it performed worse.
weights = fill_matrix(count, N, word_list)
while 1:
# find the max element in the matrix and its indices
max_element = 0
for i in range(count):
max_e = max(weights[i])
if max_e > max_element:
max_element = max_e
max_i = i
max_j = weights[i].index(max_e)
if max_element < THRESHOLD:
break
# reset the value of the max element
weights[max_i][max_j] = 0
# here it is important that always max_j is less than max i (since it's a lower triangular matrix)
for j in range(count):
weights[max_j][j] = max(weights[max_i][j], weights[max_j][j])
for i in range(count):
weights[i][max_j] = max(weights[i][max_j], weights[i][max_i])
# compare the symmetrical elements, set the ones above to 0
for i in range(count):
for j in range(count):
if i <= j:
if weights[i][j] > weights[j][i]:
weights[j][i] = weights[i][j]
weights[i][j] = 0
# remove the max_i-th column
for i in range(len(weights)):
weights[i].pop(max_i)
# remove the max_j-th row
weights.pop(max_i)
new_list = word_list[max_j]
new_list += word_list[max_i]
word_list[max_j] = new_list
# remove the element that was recently merged into a cluster
word_list.pop(max_i)
count -= 1
This might help:
def max_ij(A):
t1 = [max(list(enumerate(row)), key=lambda r: r[1]) for row in A]
t2 = max(list(enumerate(t1)), key=lambda r:r[1][1])
i, (j, max_) = t2
return max_, i, j
It depends on how much work you want to put into it but if you're really concerned about speed you should look into Cython. The quick start tutorial gives a few examples ranging from a 35% speedup to an amazing 150x speedup (with some added effort on your part).
I have two lists in python and I want to know if they intersect at the same index. Is there a mathematical way of solving this?
For example if I have [9,8,7,6,5] and [3,4,5,6,7] I'd like a simple and efficient formula/algorithm that finds that at index 3 they intersect. I know I could do a search just wondering if there is a better way.
I know there is a formula to solve two lines in y = mx + b form by subtracting them from each other but my "line" isn't truly a line because its limited to the items in the list and it may have curves.
Any help is appreciated.
You could take the set-theoretic intersection of the coordinates in both lists:
intersecting_points = set(enumerate(list1)).intersection(set(enumerate(list2)))
...enumerate gives you an iterable of tuples of indexes and values - in other words, (0,9),(1,8),(2,7),etc.
http://docs.python.org/library/stdtypes.html#set-types-set-frozenset
...make sense? Of course, that won't truly give you geometric intersection - for example, [1,2] intersects with [2,1] at [x=0.5,y=1.5] - if that's what you want, then you have to solve the linear equations at each interval.
from itertools import izip
def find_intersection(lineA, lineB):
for pos, (A0, B0, A1, B1) in enumerate(izip(lineA, lineB, lineA[1:], lineB[1:])):
#check integer intersections
if A0 == B0: #check required if the intersection is at position 0
return pos
if A1 == B1: #check required if the intersection is at last position
return pos + 1
#check for intersection between points
if (A0 > B0 and A1 < B1) or
(A0 < B0 and A1 > B1):
#intersection between pos and pos+1!
return pos + solve_linear_equation(A0,A1,B0,B1)
#no intersection
return None
...where solve_linear_equation finds the intersection between segments (0,A0)→(1,A1) and (0,B0)→(1,B1).
I assume one dimension in your list is assumed e.g. [9,8,7,6,5] are heights at x1,x2,x3,x4,x5 right? in that case how your list will represent curves like y=0 ?
In any case I don't think there can be any shortcut for calculating intersection of generic or random curves, best solution is to do a efficient search.
import itertools
def intersect_at_same_index(seq1, seq2):
return (
idx
for idx, (item1, item2)
in enumerate(itertools.izip(seq1, seq2))
if item1 == item2).next()
This will return the index where the two sequences have equal items, and raise a StopIteration if all item pairs are different. If you don't like this behaviour, enclose the return statement in a try statement, and at the except StopIteration clause return your favourite failure indicator (e.g. -1, None…)