Find large number of consecutive values fulfilling condition in a numpy array - python

I have some audio data loaded in a numpy array and I wish to segment the data by finding silent parts, i.e. parts where the audio amplitude is below a certain threshold over a period in time.
An extremely simple way to do this is something like this:
values = ''.join(("1" if (abs(x) < SILENCE_THRESHOLD) else "0" for x in samples))
pattern = re.compile('1{%d,}'%int(MIN_SILENCE))
for match in pattern.finditer(values):
# code goes here
The code above finds parts where there are at least MIN_SILENCE consecutive elements smaller than SILENCE_THRESHOLD.
Now, obviously, the above code is horribly inefficient and a terrible abuse of regular expressions. Is there some other method that is more efficient, but still results in equally simple and short code?

Here's a numpy-based solution.
I think (?) it should be faster than the other options. Hopefully it's fairly clear.
However, it does require a twice as much memory as the various generator-based solutions. As long as you can hold a single temporary copy of your data in memory (for the diff), and a boolean array of the same length as your data (1-bit-per-element), it should be pretty efficient...
import numpy as np
def main():
# Generate some random data
x = np.cumsum(np.random.random(1000) - 0.5)
condition = np.abs(x) < 1
# Print the start and stop indices of each region where the absolute
# values of x are below 1, and the min and max of each of these regions
for start, stop in contiguous_regions(condition):
segment = x[start:stop]
print start, stop
print segment.min(), segment.max()
def contiguous_regions(condition):
"""Finds contiguous True regions of the boolean array "condition". Returns
a 2D array where the first column is the start index of the region and the
second column is the end index."""
# Find the indicies of changes in "condition"
d = np.diff(condition)
idx, = d.nonzero()
# We need to start things after the change in "condition". Therefore,
# we'll shift the index by 1 to the right.
idx += 1
if condition[0]:
# If the start of condition is True prepend a 0
idx = np.r_[0, idx]
if condition[-1]:
# If the end of condition is True, append the length of the array
idx = np.r_[idx, condition.size] # Edit
# Reshape the result into two columns
idx.shape = (-1,2)
return idx
main()

There is a very convenient solution to this using scipy.ndimage. For an array:
a = np.array([1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0])
which can be the result of a condition applied to another array, finding the contiguous regions is as simple as:
regions = scipy.ndimage.find_objects(scipy.ndimage.label(a)[0])
Then, applying any function to those regions can be done e.g. like:
[np.sum(a[r]) for r in regions]

Slightly sloppy, but simple and fast-ish, if you don't mind using scipy:
from scipy.ndimage import gaussian_filter
sigma = 3
threshold = 1
above_threshold = gaussian_filter(data, sigma=sigma) > threshold
The idea is that quiet portions of the data will smooth down to low amplitude, and loud regions won't. Tune 'sigma' to affect how long a 'quiet' region must be; tune 'threshold' to affect how quiet it must be. This slows down for large sigma, at which point using FFT-based smoothing might be faster.
This has the added benefit that single 'hot pixels' won't disrupt your silence-finding, so you're a little less sensitive to certain types of noise.

I haven't tested this but you it should be close to what you are looking for. Slightly more lines of code but should be more efficient, readable, and it doesn't abuse regular expressions :-)
def find_silent(samples):
num_silent = 0
start = 0
for index in range(0, len(samples)):
if abs(samples[index]) < SILENCE_THRESHOLD:
if num_silent == 0:
start = index
num_silent += 1
else:
if num_silent > MIN_SILENCE:
yield samples[start:index]
num_silent = 0
if num_silent > MIN_SILENCE:
yield samples[start:]
for match in find_silent(samples):
# code goes here

This should return a list of (start,length) pairs:
def silent_segs(samples,threshold,min_dur):
start = -1
silent_segments = []
for idx,x in enumerate(samples):
if start < 0 and abs(x) < threshold:
start = idx
elif start >= 0 and abs(x) >= threshold:
dur = idx-start
if dur >= min_dur:
silent_segments.append((start,dur))
start = -1
return silent_segments
And a simple test:
>>> s = [-1,0,0,0,-1,10,-10,1,2,1,0,0,0,-1,-10]
>>> silent_segs(s,2,2)
[(0, 5), (9, 5)]

another way to do this quickly and concisely:
import pylab as pl
v=[0,0,1,1,0,0,1,1,1,1,1,0,1,0,1,1,0,0,0,0,0,1,0,0]
vd = pl.diff(v)
#vd[i]==1 for 0->1 crossing; vd[i]==-1 for 1->0 crossing
#need to add +1 to indexes as pl.diff shifts to left by 1
i1=pl.array([i for i in xrange(len(vd)) if vd[i]==1])+1
i2=pl.array([i for i in xrange(len(vd)) if vd[i]==-1])+1
#corner cases for the first and the last element
if v[0]==1:
i1=pl.hstack((0,i1))
if v[-1]==1:
i2=pl.hstack((i2,len(v)))
now i1 contains the beginning index and i2 the end index of 1,...,1 areas

#joe-kington I've got about 20%-25% speed improvement over np.diff / np.nonzero solution by using argmax instead (see code below, condition is boolean)
def contiguous_regions(condition):
idx = []
i = 0
while i < len(condition):
x1 = i + condition[i:].argmax()
try:
x2 = x1 + condition[x1:].argmin()
except:
x2 = x1 + 1
if x1 == x2:
if condition[x1] == True:
x2 = len(condition)
else:
break
idx.append( [x1,x2] )
i = x2
return idx
Of course, your mileage may vary depending on your data.
Besides, I'm not entirely sure, but i guess numpy may optimize argmin/argmax over boolean arrays to stop searching on first True/False occurrence. That might explain it.

I know I'm late to the party, but another way to do this is with 1d convolutions:
np.convolve(sig > threshold, np.ones((cons_samples)), 'same') == cons_samples
Where cons_samples is the number of consecutive samples you require above threshold

Related

Efficient method to count number of unique elements in ranges?

I need to count number of unique elements in a set of given ranges. My input is the start and end coordinates for these ranges and I do the following.
>>>coordinates
[[7960383, 7961255],
[15688414, 15689284],
[19247797, 19248148],
[21786109, 21813057],
[21822367, 21840682],
[21815951, 21822369],
[21776839, 21783355],
[21779693, 21786111],
[21813097, 21815959],
[21776839, 21786111],
[21813097, 21819613],
[21813097, 21822369]]
[21813097, 21822369]]
>>>len(set(chain(*[range(i[0],i[1]+1) for i in coordinates]))) #here chain is from itertools
Problem is that it is not fast enough. This is taking 3.5ms (found using %timeit) on my machine (buying a new computer is not an option) and since I need to do this on millions of sets, it is not fast.
Any suggestions how this could be proved?
Edit: The number of rows can vary. In this case there are 12 rows. But I can't put any upper limit on it.
You could just take the difference between the coordinates, and subtract overlapping:
coordinates =[
[ 7960383, 7961255],
[15688414, 15689284],
[19247797, 19248148],
[21776839, 21786111],
[21813097, 21819613],
[21813097, 21822369]
]
# sort by increasing first coordinate, and if equal, by second:
coordinates.sort()
count = 0
prevEnd = 0
for start, end in coordinates:
if end > prevEnd: # ignore a range that is sub-range of the previous one
count += end - max(start, prevEnd)
prevEnd = end
print (count)
This is both cheap in space and time.
Inclusive end coordinates
After your edit, it became clear you wanted the second coordinate to be inclusive. In that case "correct" the calculation like this:
count = 0
prevEnd = -1
for start, end in coordinates:
if end > prevEnd: # ignore a range that is sub-range of the previous one
count += end - max(start - 1, prevEnd)
prevEnd = end
Maybe this is better?
len(reduce(lambda x, y: set(x).union(set(y)), array)
With NumPy you can do:
import numpy as np
coordinates = ...
nums = np.concatenate([np.arange(start, end) for start, end in coordinates], axis=0)
num_unique = len(np.unique(nums))
Update
If you can afford having a matrix with as many rows as the number of coordinates and as many columns as the biggest number another option would be:
import numpy as np
coordinates = np.asarray(coordinates)
nums = np.tile(np.arange(np.max(coordinates)), (len(coordinates), 1))
m = (nums >= coordinates[:, :1]) & (nums < coordinates[:, 1:])
num_unique = np.count_nonzero(np.logical_or.reduce(m, axis=0))

First element of series to cross threshold in numpy, with handling of series that never cross

I have a numpy array of N time series of length T. I want the index at which each first crosses some threshold, and a -1 or something similar if it never crosses. Take ts_array = np.randn(N, T)
np.argmax(ts_array > cutoff, axis=1) gets close, but it returns a 0 for both time series that cross the threshold at time 0, and time series that never cross.
np.where(...) and np.nonzero(...) are possibilities, but their return values would require rather gruesome handling to extract the vector in which I'm interested
This question is similar to Numpy first occurence of value greater than existing value but none of the answers there solve it.
One liner:
(ts > c).argmax() if (ts > c).any() else -1
assuming ts = ts_array and c = cutoff
Otherwise:
Use argmax() and any()
np.random.seed([3,1415])
def xover(ts, cut):
x = ts > cut
return x.argmax() if x.any() else -1
ts_array = np.random.random(5).round(4)
ts_array looks like:
print ts_array, '\n'
[ 0.4449 0.4076 0.4601 0.4652 0.4627]
Various checks:
print xover(ts_array, 0.400), '\n'
0
print xover(ts_array, 0.460), '\n'
2
print xover(ts_array, 0.465), '\n'
3
print xover(ts_array, 1.000), '\n'
-1
It's not too bad with np.where. I would use the following as a starting point:
ts_array = np.random.rand(10, 10)
cutoff = 0.5
# Get a list of all indices that satisfy the condition
rows, cols = np.where(ts_array > cutoff)
if len(rows) > 0:
index = (rows[0], cols[0])
else:
index = -1
Note that np.where returns two arrays, a list of row indices and a list of column indices. They are matched, so choosing the first one of each array will give us the first instance where the values are above the cutoff. I don't have a nice one-liner, but the handling code isn't too bad. It should be easily adaptable to your situation.

Speeding up this Python code for large input

I wrote this Python code to do a particular computation in a bigger project and it works fine for smaller values of N but it doesn't scale up very well for large values and even though I ran it for a number of hours to collect the data, I was wondering if there was a way to speed this up
import numpy as np
def FillArray(arr):
while(0 in arr):
ind1 = np.random.randint(0,N)
if(arr[ind1]==0):
if(ind1==0):
arr[ind1] = 1
arr[ind1+1] = 2
elif(ind1==len(arr)-1):
arr[ind1] = 1
arr[ind1-1] = 2
else:
arr[ind1] = 1
arr[ind1+1] = 2
arr[ind1-1] = 2
else:
continue
return arr
N=50000
dist = []
for i in range(1000):
arr = [0 for x in range(N)]
dist.append(Fillarr(arr).count(2))
For N = 50,000, it currently takes slightly over a minute on my computer for one iteration to fill the array. So if I want to simulate this, lets say, a 1000 times, it takes many hours. Is there something I can do to speed this up?
Edit 1: I forgot to mention what it actually does. I have a list of length N and I initialize it by having zeros in each entry. Then I pick a random number between 0 and N and if that index of the list has a zero, I replace it by 1 and its neighboring indices by 2 to indicate they are not filled by 1 but they can't be filled again. I keep doing this till I populate the whole list by 1 and 2 and then I count how many of the entries contain 2 which is the result of this computation. Thus I want to find out if I fill an array randomly with this constraint, how many entries will not be filled.
Obviously I do not claim that this is the most efficient way find this number so I am hoping that perhaps there is a better alternative way if this code can't be speeded up.
As #SylvainLeroux noted in the comments, the approach of trying to find what zero you're going to change by drawing a random location and hoping it's zero is going to slow down when you start running out of zeros. Simply choosing from the ones you know are going to be zero will speed it up dramatically. Something like
def faster(N):
# pad on each side
arr = np.zeros(N+2)
arr[0] = arr[-1] = -1 # ignore edges
while True:
# zeros left
zero_locations = np.where(arr == 0)[0]
if not len(zero_locations):
break # we're done
np.random.shuffle(zero_locations)
for zloc in zero_locations:
if arr[zloc] == 0:
arr[zloc-1:zloc+2] = [2, 1, 2]
return arr[1:-1] # remove edges
will be much faster (times on my old notebook):
>>> %timeit faster(50000)
10 loops, best of 3: 105 ms per loop
>>> %time [(faster(50000) == 2).sum() for i in range(1000)]
CPU times: user 1min 46s, sys: 4 ms, total: 1min 46s
Wall time: 1min 46s
We could improve this by vectorizing more of the computation, but depending on your constraints this might already suffice.
First I will reformulate the problem from tri-variate to bi-variate. What you are doing is spliting the vector of length N into two smaller vectors at random point k.
Lets assume that you start with a vector of zeros, then you put '1' at randomly selected k and from there take two smaller vectors of zeros - [0..k-2] & [k+2.. N-1]. No need for 3rd state. You repeat the process until exhaustion - when you are left with vectors containing only one element.
Using recusion this is reasonably fast even on my iPad mini with Pythonista.
import numpy as np
from random import randint
def SplitArray(l, r):
while(l < r):
k = randint(l, r)
arr[k] = 1
return SplitArray(l, k-2) + [k] + SplitArray(k+2, r)
return []
N = 50000
L = 1000
dist=np.zeros(L)
for i in xrange(L):
arr = [0 for x in xrange(N)]
SplitArray(0, N-1)
dist[i] = arr.count(0)
print dist, np.mean(dist), np.std(dist)
However if you would like to make it really fast then bivariate problem could be coded very effectively and naturally as bit arrays instead of storing 1 and 0 in arrays of integers or worse floats in numpy arrays. The bit manipulation should be quick and in some you easily could get close to machine level speed.
Something along the line: (this is an idea not optimal code)
from bitarray import BitArray
from random import randint
import numpy as np
def SplitArray(l, r):
while(l < r):
k = randint(l, r)
arr.set_bit(k)
return SplitArray(l, k-2) + [k] + SplitArray(k+2, r)
return []
def count0(ba):
i = 0
for n in xrange(1, N):
if ba.get_bit(n) == 0:
i += 1
return i
N = 50000
L = 1000
dist = np.zeros(L)
for i in xrange(L):
arr = BitArray(N, initialize = 0)
SplitArray(1, N)
dist[i] = count0(arr)
print np.mean(dist), np.std(dist)
using bitarray
The solution converges very nicely so perhaps half an hour spent looking for an analytical solution would make this whole MC excercise unnecessary?

Speed up loop to fill an array with closest values from another array

I have a block of code that I need to optimize as much as possible since I have to run it several thousand times.
What it does is it finds the closest float in a sub-list of a given array for a random float and stores the corresponding float (ie: with the same index) stored in another sub-list of that array. It repeats the process until the sum of floats stored reaches a certain limit.
Here's the MWE to make it clearer:
import numpy as np
# Define array with two sub-lists.
a = [np.random.uniform(0., 100., 10000), np.random.random(10000)]
# Initialize empty final list.
b = []
# Run until the condition is met.
while (sum(b) < 10000):
# Draw random [0,1) value.
u = np.random.random()
# Find closest value in sub-list a[1].
idx = np.argmin(np.abs(u - a[1]))
# Store value located in sub-list a[0].
b.append(a[0][idx])
The code is reasonably simple but I haven't found a way to speed it up. I tried to adapt the great (and very fast) answer given in a similar question I made some time ago, to no avail.
OK, here's a slightly left-field suggestion. As I understand it, you are just trying to sample uniformally from the elements in a[0] until you have a list whose sum exceeds some limit.
Although it will be more costly memory-wise, I think you'll probably find it's much faster to generate a large random sample from a[0] first, then take the cumsum and find where it first exceeds your limit.
For example:
import numpy as np
# array of reference float values, equivalent to a[0]
refs = np.random.uniform(0, 100, 10000)
def fast_samp_1(refs, lim=10000, blocksize=10000):
# sample uniformally from refs
samp = np.random.choice(refs, size=blocksize, replace=True)
samp_sum = np.cumsum(samp)
# find where the cumsum first exceeds your limit
last = np.searchsorted(samp_sum, lim, side='right')
return samp[:last + 1]
# # if it's ok to be just under lim rather than just over then this might
# # be quicker
# return samp[samp_sum <= lim]
Of course, if the sum of the sample of blocksize elements is < lim then this will fail to give you a sample whose sum is >= lim. You could check whether this is the case, and append to your sample in a loop if necessary.
def fast_samp_2(refs, lim=10000, blocksize=10000):
samp = np.random.choice(refs, size=blocksize, replace=True)
samp_sum = np.cumsum(samp)
# is the sum of our current block of samples >= lim?
while samp_sum[-1] < lim:
# if not, we'll sample another block and try again until it is
newsamp = np.random.choice(refs, size=blocksize, replace=True)
samp = np.hstack((samp, newsamp))
samp_sum = np.hstack((samp_sum, np.cumsum(newsamp) + samp_sum[-1]))
last = np.searchsorted(samp_sum, lim, side='right')
return samp[:last + 1]
Note that concatenating arrays is pretty slow, so it would probably be better to make blocksize large enough to be reasonably sure that the sum of a single block will be >= to your limit, without being excessively large.
Update
I've adapted your original function a little bit so that its syntax more closely resembles mine.
def orig_samp(refs, lim=10000):
# Initialize empty final list.
b = []
a1 = np.random.random(10000)
# Run until the condition is met.
while (sum(b) < lim):
# Draw random [0,1) value.
u = np.random.random()
# Find closest value in sub-list a[1].
idx = np.argmin(np.abs(u - a1))
# Store value located in sub-list a[0].
b.append(refs[idx])
return b
Here's some benchmarking data.
%timeit orig_samp(refs, lim=10000)
# 100 loops, best of 3: 11 ms per loop
%timeit fast_samp_2(refs, lim=10000, blocksize=1000)
# 10000 loops, best of 3: 62.9 µs per loop
That's a good 3 orders of magnitude faster. You can do a bit better by reducing the blocksize a fraction - you basically want it to be comfortably larger than the length of the arrays you're getting out. In this case, you know that on average the output will be about 200 elements long, since the mean of all real numbers between 0 and 100 is 50, and 10000 / 50 = 200.
Update 2
It's easy to get a weighted sample rather than a uniform sample - you can just pass the p= parameter to np.random.choice:
def weighted_fast_samp(refs, weights=None, lim=10000, blocksize=10000):
samp = np.random.choice(refs, size=blocksize, replace=True, p=weights)
samp_sum = np.cumsum(samp)
# is the sum of our current block of samples >= lim?
while samp_sum[-1] < lim:
# if not, we'll sample another block and try again until it is
newsamp = np.random.choice(refs, size=blocksize, replace=True,
p=weights)
samp = np.hstack((samp, newsamp))
samp_sum = np.hstack((samp_sum, np.cumsum(newsamp) + samp_sum[-1]))
last = np.searchsorted(samp_sum, lim, side='right')
return samp[:last + 1]
Write it in cython. That's going to get you a lot more for a high iteration operation.
http://cython.org/
One obvious optimization - don't re-calculate sum on each iteration, accumulate it
b_sum = 0
while b_sum<10000:
....
idx = np.argmin(np.abs(u - a[1]))
add_val = a[0][idx]
b.append(add_val)
b_sum += add_val
EDIT:
I think some minor improvement (check it out if you feel like it) may be achieved by pre-referencing sublists before the loop
a_0 = a[0]
a_1 = a[1]
...
while ...:
....
idx = np.argmin(np.abs(u - a_1))
b.append(a_0[idx])
It may save some on run time - though I don't believe it will matter that much.
Sort your reference array.
That allows log(n) lookups instead of needing to browse the whole list. (using bisect for example to find the closest elements)
For starters, I reverse a[0] and a[1] to simplify the sort:
a = np.sort([np.random.random(10000), np.random.uniform(0., 100., 10000)])
Now, a is sorted by order of a[0], meaning if you are looking for the closest value to an arbitrary number, you can start by a bisect:
while (sum(b) < 10000):
# Draw random [0,1) value.
u = np.random.random()
# Find closest value in sub-list a[0].
idx = bisect.bisect(a[0], u)
# now, idx can either be idx or idx-1
if idx is not 0 and np.abs(a[0][idx] - u) > np.abs(a[0][idx - 1] - u):
idx = idx - 1
# Store value located in sub-list a[1].
b.append(a[1][idx])

Some python / numpy optimization possible?

I am profiling some genetic algorithm code with some nested loops and from what I see most of the time is spent in two of my functions which involve slicing and adding up numpy arrays. I tried my best to further optimize them but would like to see if others come up with ideas.
Function 1:
The first function is called 2954684 times for a total time spent inside the function of 19 seconds
We basically just create views inside numpy arrays contained in data[0], according to coordinates contained in data[1]
def get_signal(data, options):
#data[0] contains bed, data[1] contains position
#forward = 0, reverse = 1
start = data[1][0] - options.halfwinwidth
end = data[1][0] + options.halfwinwidth
if data[1][1] == 0:
normals_forward = data[0]['normals_forward'][start:end]
normals_reverse = data[0]['normals_reverse'][start:end]
else:
normals_forward = data[0]['normals_reverse'][end - 1:start - 1: -1]
normals_reverse = data[0]['normals_forward'][end - 1:start - 1: -1]
row = {'normals_forward': normals_forward,
'normals_reverse': normals_reverse,
}
return row
Function 2:
Called 857 times for a total time of 13.674 seconds spent inside the function:
signal is a list of numpy arrays of equal length with dtype float, options is just random options
The goal of the function is just to add up the lists of each numpy arrays to a single one, calculate the intersection of the two curves formed by the forward and reverse arrays and return the result
def calculate_signal(signal, options):
profile_normals_forward = np.zeros(options.halfwinwidth * 2, dtype='f')
profile_normals_reverse = np.zeros(options.halfwinwidth * 2, dtype='f')
#here i tried np.sum over axis = 0, its significantly slower than the for loop approach
for b in signal:
profile_normals_forward += b['normals_forward']
profile_normals_reverse += b['normals_reverse']
count = len(signal)
if options.normalize == 1:
#print "Normalizing to max counts"
profile_normals_forward /= max(profile_normals_forward)
profile_normals_reverse /= max(profile_normals_reverse)
elif options.normalize == 2:
#print "Normalizing to number of elements"
profile_normals_forward /= count
profile_normals_reverse /= count
intersection_signal = np.fmin(profile_normals_forward, profile_normals_reverse)
intersection = np.sum(intersection_signal)
results = {"intersection": intersection,
"profile_normals_forward": profile_normals_forward,
"profile_normals_reverse": profile_normals_reverse,
}
return results
As you can see the two are very simple, but account for > 60% of my execution time on a script that can run for hours / days (genetic algorithm optimization), so even minor improvements are welcome :)
One simple thing I would do to increase the speed of the first function is to use different notation for the accessing of the list indices as detailed here.
For example:
foo = numpyArray[1][0]
bar = numpyArray[1,0]
The second line will execute much faster because you don't have to return the entire element at numpyArray[1] and then find the first element of that. Try it out

Categories

Resources