Efficient method to count number of unique elements in ranges?

Efficient method to count number of unique elements in ranges? - python

I need to count number of unique elements in a set of given ranges. My input is the start and end coordinates for these ranges and I do the following.
>>>coordinates
[[7960383, 7961255],
[15688414, 15689284],
[19247797, 19248148],
[21786109, 21813057],
[21822367, 21840682],
[21815951, 21822369],
[21776839, 21783355],
[21779693, 21786111],
[21813097, 21815959],
[21776839, 21786111],
[21813097, 21819613],
[21813097, 21822369]]
[21813097, 21822369]]
>>>len(set(chain(*[range(i[0],i[1]+1) for i in coordinates]))) #here chain is from itertools
Problem is that it is not fast enough. This is taking 3.5ms (found using %timeit) on my machine (buying a new computer is not an option) and since I need to do this on millions of sets, it is not fast.
Any suggestions how this could be proved?
Edit: The number of rows can vary. In this case there are 12 rows. But I can't put any upper limit on it.

You could just take the difference between the coordinates, and subtract overlapping:
coordinates =[
[ 7960383, 7961255],
[15688414, 15689284],
[19247797, 19248148],
[21776839, 21786111],
[21813097, 21819613],
[21813097, 21822369]
]
# sort by increasing first coordinate, and if equal, by second:
coordinates.sort()
count = 0
prevEnd = 0
for start, end in coordinates:
if end > prevEnd: # ignore a range that is sub-range of the previous one
count += end - max(start, prevEnd)
prevEnd = end
print (count)
This is both cheap in space and time.
Inclusive end coordinates
After your edit, it became clear you wanted the second coordinate to be inclusive. In that case "correct" the calculation like this:
count = 0
prevEnd = -1
for start, end in coordinates:
if end > prevEnd: # ignore a range that is sub-range of the previous one
count += end - max(start - 1, prevEnd)
prevEnd = end

Maybe this is better?
len(reduce(lambda x, y: set(x).union(set(y)), array)

With NumPy you can do:
import numpy as np
coordinates = ...
nums = np.concatenate([np.arange(start, end) for start, end in coordinates], axis=0)
num_unique = len(np.unique(nums))
Update
If you can afford having a matrix with as many rows as the number of coordinates and as many columns as the biggest number another option would be:
import numpy as np
coordinates = np.asarray(coordinates)
nums = np.tile(np.arange(np.max(coordinates)), (len(coordinates), 1))
m = (nums >= coordinates[:, :1]) & (nums < coordinates[:, 1:])
num_unique = np.count_nonzero(np.logical_or.reduce(m, axis=0))

Related

Find the maximum element by summing overlapping intervals

Say we are given the total size of the interval space. Say we are also given an array of tuples giving us the start and end indices of the interval to sum over along with a value. After completing all the sums, we would like to return the maximum element. How would I go about solving this efficiently?
Input format: n = interval space, intervals = array of tuples that contain start index, end index, and value to add to each element
Eg:
Input: n = 5, intervals = [(1,2,100),(2,5,100),(3,4,100)]
Output: 200
so array is initially [0,0,0,0,0]
At each iteration the following modifications will be made:
1) [100,100,0,0,0]
2) [100,200,100,100,100]
3) [100,200,200,200,100]
Thus the answer is 200.
All I've figured out so far is the brute force solution of splicing the array and adding a value to the spliced portion. How can I do better? Any help is appreciated!

One way is to separate your intervals into a beginning and an end, and specify how much is added or subtracted to the total based on whether you are in that interval or not. Once you sort the intervals based on their location on the number line, you traverse it, adding or subtracting the values based on whether you enter or leave the interval. Here is some code to do so:
def find_max_val(intervals):
operations = []
for i in intervals:
operations.append([i[0],i[2]])
operations.append([i[1]+1,-i[2]])
unique_ops = defaultdict(int)
for operation in operations:
unique_ops[operation[0]] += operation[1]
sorted_keys = sorted(unique_ops.keys())
print(unique_ops)
curr_val = unique_ops[sorted_keys[0]]
max_val = curr_val
for key in sorted_keys[1:]:
curr_val += unique_ops[key]
max_val = max(max_val, curr_val)
return max_val
intervals = [(1,2,100),(2,5,100),(3,4,100)]
print(find_max_val(intervals))
# Output: 200

Here is the code for 3 intervals.
n = int(input())
x = [0]*n
interval = []
for i in range(3):
s = int(input()) #start
e = int(input()) #end
v = int(input()) #value
#add value
for i in range (s-1, e):
x[i] += v
print(max(x))

You can use list comprehension to do a lot of the work.
n=5
intervals = [(1,2,100),(2,5,100),(3,4,100)]
intlst = [[r[2] if i>=r[0]-1 and i<=r[1]-1 else 0 for i in range(n)] for r in intervals]
lst = [0]*n #[0,0,0,0,0]
for ls in intlst:
lst = [lst[i]+ls[i] for i in range(n)]
print(lst)
print(max(lst))
Output
[100, 200, 200, 200, 100]
200

how to calculate overlap length between two points array in python?

Suppose there are 2 arrays. Every element in the array is short line contains start position and end position.
a1 = [[0,1],[3,6],[7,9]]
a2 = [[2,6],[0,1]]
In this example, a1[0] is same with a2[1], the overlap length is 1. a1[1] and a2[0] has overlap length of 3. The total result is 4.
Are there any way to achieve this method easily?

You can use itertools.product to generate all interval pairs and then calculate the overlap for each pair. Two intervals overlap if one starts before the second ends.
import itertools
overlap=0
for x, y in itertools.product(a1, a2):
max_start = max(x[0], y[0])
min_end = min(x[1], y[1])
overlap += max(0, min_end-max_start)

There is an ambiguity in the problem statement: Can intervals in the same set overlap each other, and if so, do we double count the overlap of those intervals with an interval in the other set or not?
Anyway, The brute-force approach will take O(N^2) time, which may be fine depending on how large the sets are. But it can be improved to O(N*logN) by sorting the two sets by the starting points. If overlapping within the same set is not allowed, you can simply go from left two right, keeping track of the last intervals in each set that overlap each other. If overlapping within the same set is allowed, you can keep a heap of intervals of the first set of which the endpoints have not been reached, and iterate over the second set
In the case of non-overlapping intervals within the same set, the code will be something like this:
a1 = [[0,1],[3,6],[7,9]]
a2 = [[2,6],[0,1]]
a1.sort(key = lambda x: x[0])
a2.sort(key = lambda x: x[0])
i1 = 0
i2 = 0
overlapping = 0
while i1 < len(a1) and i2 < len(a2):
# start and end of the overlapping
start = max(a1[i1][0], a2[i2][0])
end = min(a1[i1][1], a2[i2][1])
overlapping += max(0, end-start)
# move the interval that ends first to the next interval in the same set
if a1[i1][1] < a2[i2][1]:
i1 += 1
else:
i2 += 1
print(overlapping)

First element of series to cross threshold in numpy, with handling of series that never cross

I have a numpy array of N time series of length T. I want the index at which each first crosses some threshold, and a -1 or something similar if it never crosses. Take ts_array = np.randn(N, T)
np.argmax(ts_array > cutoff, axis=1) gets close, but it returns a 0 for both time series that cross the threshold at time 0, and time series that never cross.
np.where(...) and np.nonzero(...) are possibilities, but their return values would require rather gruesome handling to extract the vector in which I'm interested
This question is similar to Numpy first occurence of value greater than existing value but none of the answers there solve it.

One liner:
(ts > c).argmax() if (ts > c).any() else -1
assuming ts = ts_array and c = cutoff
Otherwise:
Use argmax() and any()
np.random.seed([3,1415])
def xover(ts, cut):
x = ts > cut
return x.argmax() if x.any() else -1
ts_array = np.random.random(5).round(4)
ts_array looks like:
print ts_array, '\n'
[ 0.4449 0.4076 0.4601 0.4652 0.4627]
Various checks:
print xover(ts_array, 0.400), '\n'
0
print xover(ts_array, 0.460), '\n'
2
print xover(ts_array, 0.465), '\n'
3
print xover(ts_array, 1.000), '\n'
-1

It's not too bad with np.where. I would use the following as a starting point:
ts_array = np.random.rand(10, 10)
cutoff = 0.5
# Get a list of all indices that satisfy the condition
rows, cols = np.where(ts_array > cutoff)
if len(rows) > 0:
index = (rows[0], cols[0])
else:
index = -1
Note that np.where returns two arrays, a list of row indices and a list of column indices. They are matched, so choosing the first one of each array will give us the first instance where the values are above the cutoff. I don't have a nice one-liner, but the handling code isn't too bad. It should be easily adaptable to your situation.

Improving the execution time of matrix calculations in Python

I work with a large amount of data and the execution time of this piece of code is very very important. The results in each iteration are interdependent, so it's hard to make it in parallel. It would be awesome if there is a faster way to implement some parts of this code, like:
finding the max element in the matrix and its indices
changing the values in a row/column with the max from another row/column
removing a specific row and column
Filling the weights matrix is pretty fast.
The code does the following:
it contains a list of lists of words word_list, with count elements in it. At the beginning each word is a separate list.
it contains a two dimensional list (count x count) of float values weights (lower triangular matrix, the values for which i>=j are zeros)
in each iteration it does the following:
it finds the two words with the most similar value (the max element in the matrix and its indices)
it merges their row and column, saving the larger value from the two in each cell
it merges the corresponding word lists in word_list. It saves both lists in the one with the smaller index (max_j) and it removes the one with the larger index (max_i).
it stops if the largest value is less then a given THRESHOLD
I might think of a different algorithm to do this task, but I have no ideas for now and it would be great if there is at least a small performance improvement.
I tried using NumPy but it performed worse.
weights = fill_matrix(count, N, word_list)
while 1:
# find the max element in the matrix and its indices
max_element = 0
for i in range(count):
max_e = max(weights[i])
if max_e > max_element:
max_element = max_e
max_i = i
max_j = weights[i].index(max_e)
if max_element < THRESHOLD:
break
# reset the value of the max element
weights[max_i][max_j] = 0
# here it is important that always max_j is less than max i (since it's a lower triangular matrix)
for j in range(count):
weights[max_j][j] = max(weights[max_i][j], weights[max_j][j])
for i in range(count):
weights[i][max_j] = max(weights[i][max_j], weights[i][max_i])
# compare the symmetrical elements, set the ones above to 0
for i in range(count):
for j in range(count):
if i <= j:
if weights[i][j] > weights[j][i]:
weights[j][i] = weights[i][j]
weights[i][j] = 0
# remove the max_i-th column
for i in range(len(weights)):
weights[i].pop(max_i)
# remove the max_j-th row
weights.pop(max_i)
new_list = word_list[max_j]
new_list += word_list[max_i]
word_list[max_j] = new_list
# remove the element that was recently merged into a cluster
word_list.pop(max_i)
count -= 1

This might help:
def max_ij(A):
t1 = [max(list(enumerate(row)), key=lambda r: r[1]) for row in A]
t2 = max(list(enumerate(t1)), key=lambda r:r[1][1])
i, (j, max_) = t2
return max_, i, j

It depends on how much work you want to put into it but if you're really concerned about speed you should look into Cython. The quick start tutorial gives a few examples ranging from a 35% speedup to an amazing 150x speedup (with some added effort on your part).

Find large number of consecutive values fulfilling condition in a numpy array

I have some audio data loaded in a numpy array and I wish to segment the data by finding silent parts, i.e. parts where the audio amplitude is below a certain threshold over a period in time.
An extremely simple way to do this is something like this:
values = ''.join(("1" if (abs(x) < SILENCE_THRESHOLD) else "0" for x in samples))
pattern = re.compile('1{%d,}'%int(MIN_SILENCE))
for match in pattern.finditer(values):
# code goes here
The code above finds parts where there are at least MIN_SILENCE consecutive elements smaller than SILENCE_THRESHOLD.
Now, obviously, the above code is horribly inefficient and a terrible abuse of regular expressions. Is there some other method that is more efficient, but still results in equally simple and short code?

Here's a numpy-based solution.
I think (?) it should be faster than the other options. Hopefully it's fairly clear.
However, it does require a twice as much memory as the various generator-based solutions. As long as you can hold a single temporary copy of your data in memory (for the diff), and a boolean array of the same length as your data (1-bit-per-element), it should be pretty efficient...
import numpy as np
def main():
# Generate some random data
x = np.cumsum(np.random.random(1000) - 0.5)
condition = np.abs(x) < 1
# Print the start and stop indices of each region where the absolute
# values of x are below 1, and the min and max of each of these regions
for start, stop in contiguous_regions(condition):
segment = x[start:stop]
print start, stop
print segment.min(), segment.max()
def contiguous_regions(condition):
"""Finds contiguous True regions of the boolean array "condition". Returns
a 2D array where the first column is the start index of the region and the
second column is the end index."""
# Find the indicies of changes in "condition"
d = np.diff(condition)
idx, = d.nonzero()
# We need to start things after the change in "condition". Therefore,
# we'll shift the index by 1 to the right.
idx += 1
if condition[0]:
# If the start of condition is True prepend a 0
idx = np.r_[0, idx]
if condition[-1]:
# If the end of condition is True, append the length of the array
idx = np.r_[idx, condition.size] # Edit
# Reshape the result into two columns
idx.shape = (-1,2)
return idx
main()

There is a very convenient solution to this using scipy.ndimage. For an array:
a = np.array([1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0])
which can be the result of a condition applied to another array, finding the contiguous regions is as simple as:
regions = scipy.ndimage.find_objects(scipy.ndimage.label(a)[0])
Then, applying any function to those regions can be done e.g. like:
[np.sum(a[r]) for r in regions]

Slightly sloppy, but simple and fast-ish, if you don't mind using scipy:
from scipy.ndimage import gaussian_filter
sigma = 3
threshold = 1
above_threshold = gaussian_filter(data, sigma=sigma) > threshold
The idea is that quiet portions of the data will smooth down to low amplitude, and loud regions won't. Tune 'sigma' to affect how long a 'quiet' region must be; tune 'threshold' to affect how quiet it must be. This slows down for large sigma, at which point using FFT-based smoothing might be faster.
This has the added benefit that single 'hot pixels' won't disrupt your silence-finding, so you're a little less sensitive to certain types of noise.

I haven't tested this but you it should be close to what you are looking for. Slightly more lines of code but should be more efficient, readable, and it doesn't abuse regular expressions :-)
def find_silent(samples):
num_silent = 0
start = 0
for index in range(0, len(samples)):
if abs(samples[index]) < SILENCE_THRESHOLD:
if num_silent == 0:
start = index
num_silent += 1
else:
if num_silent > MIN_SILENCE:
yield samples[start:index]
num_silent = 0
if num_silent > MIN_SILENCE:
yield samples[start:]
for match in find_silent(samples):
# code goes here

This should return a list of (start,length) pairs:
def silent_segs(samples,threshold,min_dur):
start = -1
silent_segments = []
for idx,x in enumerate(samples):
if start < 0 and abs(x) < threshold:
start = idx
elif start >= 0 and abs(x) >= threshold:
dur = idx-start
if dur >= min_dur:
silent_segments.append((start,dur))
start = -1
return silent_segments
And a simple test:
>>> s = [-1,0,0,0,-1,10,-10,1,2,1,0,0,0,-1,-10]
>>> silent_segs(s,2,2)
[(0, 5), (9, 5)]

another way to do this quickly and concisely:
import pylab as pl
v=[0,0,1,1,0,0,1,1,1,1,1,0,1,0,1,1,0,0,0,0,0,1,0,0]
vd = pl.diff(v)
#vd[i]==1 for 0->1 crossing; vd[i]==-1 for 1->0 crossing
#need to add +1 to indexes as pl.diff shifts to left by 1
i1=pl.array([i for i in xrange(len(vd)) if vd[i]==1])+1
i2=pl.array([i for i in xrange(len(vd)) if vd[i]==-1])+1
#corner cases for the first and the last element
if v[0]==1:
i1=pl.hstack((0,i1))
if v[-1]==1:
i2=pl.hstack((i2,len(v)))
now i1 contains the beginning index and i2 the end index of 1,...,1 areas

#joe-kington I've got about 20%-25% speed improvement over np.diff / np.nonzero solution by using argmax instead (see code below, condition is boolean)
def contiguous_regions(condition):
idx = []
i = 0
while i < len(condition):
x1 = i + condition[i:].argmax()
try:
x2 = x1 + condition[x1:].argmin()
except:
x2 = x1 + 1
if x1 == x2:
if condition[x1] == True:
x2 = len(condition)
else:
break
idx.append( [x1,x2] )
i = x2
return idx
Of course, your mileage may vary depending on your data.
Besides, I'm not entirely sure, but i guess numpy may optimize argmin/argmax over boolean arrays to stop searching on first True/False occurrence. That might explain it.

I know I'm late to the party, but another way to do this is with 1d convolutions:
np.convolve(sig > threshold, np.ones((cons_samples)), 'same') == cons_samples
Where cons_samples is the number of consecutive samples you require above threshold

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficient method to count number of unique elements in ranges? - python

Maybe this is better? len(reduce(lambda x, y: set(x).union(set(y)), array)

Related

Find the maximum element by summing overlapping intervals

how to calculate overlap length between two points array in python?

First element of series to cross threshold in numpy, with handling of series that never cross

Improving the execution time of matrix calculations in Python

Find large number of consecutive values fulfilling condition in a numpy array

Categories

Resources