Find span where condition is True using NumPy

Find span where condition is True using NumPy - python

Imagine I have a numpy array and I need to find the spans/ranges where that condition is True. For example, I have the following array in which I'm trying to find spans where items are greater than 1:
[0, 0, 0, 2, 2, 0, 2, 2, 2, 0]
I would need to find indices (start, stop):
(3, 5)
(6, 9)
The fastest thing I've been able to implement is making a boolean array of:
truth = data > threshold
and then looping through the array using numpy.argmin and numpy.argmax to find start and end positions.
pos = 0
truth = container[RATIO,:] > threshold
while pos < len(truth):
start = numpy.argmax(truth[pos:]) + pos + offset
end = numpy.argmin(truth[start:]) + start + offset
if not truth[start]:#nothing more
break
if start == end:#goes to the end
end = len(truth)
pos = end
But this has been too slow for the billions of positions in my arrays and the fact that the spans I'm finding are usually just a few positions in a row. Does anyone know a faster way to find these spans?

How's one way. First take the boolean array you have:
In [11]: a
Out[11]: array([0, 0, 0, 2, 2, 0, 2, 2, 2, 0])
In [12]: a1 = a > 1
Shift it one to the left (to get the next state at each index) using roll:
In [13]: a1_rshifted = np.roll(a1, 1)
In [14]: starts = a1 & ~a1_rshifted # it's True but the previous isn't
In [15]: ends = ~a1 & a1_rshifted
Where this is non-zero is the start of each True batch (or, respectively, end batch):
In [16]: np.nonzero(starts)[0], np.nonzero(ends)[0]
Out[16]: (array([3, 6]), array([5, 9]))
And zipping these together:
In [17]: zip(np.nonzero(starts)[0], np.nonzero(ends)[0])
Out[17]: [(3, 5), (6, 9)]

If you have access to the scipy library:
You can use scipy.ndimage.measurements.label to identify any regions of non zero value. it returns an array where the value of each element is the id of a span or range in the original array.
You can then use scipy.ndimage.measurements.find_objects to return the slices you would need to extract those ranges. You can access the start / end values directly from those slices.
In your example:
import numpy
from scipy.ndimage.measurements import label, find_objects
data = numpy.array([0, 0, 0, 2, 2, 0, 2, 2, 2, 0])
labels, number_of_regions = label(data)
ranges = find_objects(labels)
for identified_range in ranges:
print(identified_range[0].start, identified_range[0].stop)
You should see:
3 5
6 9
Hope this helps!

Related

Reorder numpy array to new bitlength elements without loop

If I have a numpy array with elements each representing e.g. a 9-bit integer, is there an easy way (maybe without a loop) to reorder it in a way that the resulting array elements each represent a 8-bit integer with the "lost bits" at the end of the previous element getting added at the beginning of the next element?
for example to get the following
np.array([0b100111000, 0b100101100, 0b110011100, 0b110010100]) # initial array in binarys
# convert to
np.array([0b10011100, 0b01001011, 0b00110011, 0b10011001, 0b01000000]) # resulting array
I hope it is understandable what I want to archive.
Additional info, I don't know if this makes any difference:
All of my 9-bit numbers start with the msb beeing 1 (they are bigger than 255) and the last two bits are always 0, like in the example above.
The arrays I want to process are much bigger with thousands of elements.
Thanks for your help in advance!
edit:
my current (complicated) solution is the following:
import numpy as np
def get_bits(data, offset, leng):
data = (data % (1 << (offset + leng))) >> offset
return data
data1 = np.array([0b100111000, 0b100101100, 0b110011100, 0b110010100])
i = 1
part1 = []
part2 = []
for el in data1:
if i == 1:
part2.append(0)
part1.append(get_bits(el, i, 8))
part2.append(get_bits(el, 0, i)<<(8-i))
if i == 8:
i = 1
part1.append(0)
else:
i += 1
if i != 1:
part1.append(0)
res = np.array(part1) + np.array(part2)

It's been bugging me that np.packbits and np.unpackbits are inefficient, so I came up with a bit twiddling answer.
The general idea is to work it like any resampler: you make an output array, and figure out where each piece of the output comes from in the input. You have N elements of 9 bits each, so the output is:
data = np.array([0b100111000, 0b100101100, 0b110011100, 0b110010100])
result = np.empty(np.ceil(data.size * 9 / 8).astype(int), dtype=np.uint8)
Every nine output bytes have the following pattern relative to the corresponding eight input bytes. I use {...} to indicate the (inclusive) bits in each input integer:
result[0] = data[0]{8:1}
result[1] = data[0]{0:0} data[1]{8:2}
result[2] = data[1]{1:0} data[2]{8:3}
result[3] = data[2]{2:0} data[3]{8:4}
result[4] = data[3]{3:0} data[4]{8:5}
result[5] = data[4]{4:0} data[5]{8:6}
result[6] = data[5]{5:0} data[6]{8:7}
result[7] = data[6]{6:0} data[7]{8:8}
result[8] = data[7]{7:0}
The index of result (call it i) is really given modulo 9. The index into data is therefore offset by 8 * (i // 9). The lower portion of the byte is given by data[...] >> (i + 1). The upper portion is given by data[...] & ((1 << i) - 1), shifted left by 8 - i bits.
That makes it pretty easy to come up with a vectorized solution:
i = np.arange(result.size)
j = i[:-1]
result[i] = (data[8 * (i // 9) + (i % 9) - 1] & ((1 << i % 9) - 1)) << (8 - i % 9)
result[j] |= (data[8 * (j // 9) + (j % 9)] >> (j % 9 + 1)).astype(np.uint8)
You need to clip the index of the low portion because it may go out of bounds. You don't need to clip the high portion because -1 is a perfectly valid index, and you don't care which element it accesses. And of course numpy won't let you OR or add int elements to a uint8 array, so you have to cast.
>>> [bin(x) for x in result]
['0b10011100', '0b1001011', '0b110011', '0b10011001', '0b1000000']
This solution should be scalable to arrays of any size, and I wrote it so that you can work out different combinations of shifts, not just 9-to-8.

You can do it in two steps with np.unpackbits and np.packbits. First turn your array into a little-endian column vector:
>>> z = np.array([0b100111000, 0b100101100, 0b110011100, 0b110010100], dtype='<u2').reshape(-1, 1)
>>> z.view(np.uint8)
array([[ 56, 1],
[ 44, 1],
[156, 1],
[148, 1]], dtype=uint8)
You can convert this into an array of bits directly by unpacking. In fact, at some point (PR #10855) I added the count parameter to chop of the high zeros for you:
>>> np.unpackbits(z.view(np.uint8), axis=1, bitorder='l', count=9)
array([[0, 0, 0, 1, 1, 1, 0, 0, 1],
[0, 0, 1, 1, 0, 1, 0, 0, 1],
[0, 0, 1, 1, 1, 0, 0, 1, 1],
[0, 0, 1, 0, 1, 0, 0, 1, 1]], dtype=uint8)
Now you can just repack the reversed raveled array:
>>> u = np.unpackbits(z.view(np.uint8), axis=1, bitorder='l', count=9)[:, ::-1].ravel()
>>> result = np.packbits(u)
>>> result.dtype
dtype('uint8')
>>> [bin(x) for x in result]
['0b10011100', '0b1001011', '0b110011', '0b10011001', '0b1000000']
If your machine is native little endian (e.g., most intel architectures), you can do this in a one-liner:
z = np.array([0b100111000, 0b100101100, 0b110011100, 0b110010100])
result = np.packbits(np.unpackbits(z.view(np.uint8), axis=1, bitorder='l', count=9)[:, ::-1].ravel())
Otherwise, you can do z.byteswap().view(np.uint8) to get the right starting order (still one liner though, I suppose).

I think I understood most of what you want, and given that you can do bit operation with numpy arrays in which case you get the desire bit operation element wise if do it with two array (or the same for all if it is an array vs a number), then you need to construct the appropriate arrays to do the thing, so something like this
>>> import numpy as np
>>> x = np.array([0b100111000, 0b100101100, 0b110011100, 0b110010100])
>>> goal=np.array([0b10011100, 0b01001011, 0b00110011, 0b10011001, 0b01000000])
>>> x
array([312, 300, 412, 404])
>>> goal
array([156, 75, 51, 153, 64])
>>> shift1 = np.array(range(1,1+len(x)))
>>> shift1
array([1, 2, 3, 4])
>>> mask1 = np.array([2**n -1 for n in range(1,1+len(x))])
>>> mask1
array([ 1, 3, 7, 15])
>>> res=((x>>shift1)|((x&mask1)<<(9-shift1)))&0b11111111
>>> res
array([156, 75, 51, 153], dtype=int32)
>>> goal
array([156, 75, 51, 153, 64])
>>>
I don't understand why your goal array have one extra element, but the above operation give the others numbers, and adding one extra is not complicated, so adjust as necessary.
Now for explaining the ((x>>shift1)|((x&mask1)<<(9-shift1)))&0b11111111
First I notice you do a bigger shift by element, that is simple
>>> x>>shift1
array([156, 75, 51, 25], dtype=int32)
>>>
>>> list(map(bin,x>>shift1))
['0b10011100', '0b1001011', '0b110011', '0b11001']
>>>
We also want to catch the bits that would be lose with the shift, with an and with an appropriate mask we get those
>>> x&mask1
array([0, 0, 4, 4], dtype=int32)
>>> list(map(bin,mask1))
['0b1', '0b11', '0b111', '0b1111']
>>> list(map(bin,x&mask1))
['0b0', '0b0', '0b100', '0b100']
>>>
then we right shift that result by the complementary amount
>>> 9-shift1
array([8, 7, 6, 5])
>>> ((x&mask1)<<(9-shift1))
array([ 0, 0, 256, 128], dtype=int32)
>>> list(map(bin,_))
['0b0', '0b0', '0b100000000', '0b10000000']
>>>
then we or both together
>>> (x>>shift1) | ((x&mask1)<<(9-shift1))
array([156, 75, 307, 153], dtype=int32)
>>> list(map(bin,_))
['0b10011100', '0b1001011', '0b100110011', '0b10011001']
>>>
and finally we and that with 0b11111111 to keep only the 8 bit we want
Additionally, you mention that the last 2 bit are always zero, then a more simple solution is simple shift it by 2, and to recover the original just shift in back in the other direction
>>> x
array([312, 300, 412, 404])
>>> y = x>>2
>>> y
array([ 78, 75, 103, 101], dtype=int32)
>>> y<<2
array([312, 300, 412, 404], dtype=int32)
>>>

Python: searching a sequence (1,1,2) in a numpy vector (contains random 0,1,2 numbers) with for loop:

The code I have tried
rand_array2 = np.random.randint(0,3, size=1000)
rand_array2
Searching for a way that allows me to count the (1,1,2) Sequences in rand_array2 with a for a loop.

A possible solution could look like:
import numpy as np
rand_array2 = np.random.randint(0,3, size=1000)
pattern = np.array((1,1,2))
matches=0
for i in range(len(rand_array2)-len(pattern)+1):
match = rand_array2[i:i+3]==pattern
if match.all():
matches+=1
matches

Try this approach. Here we loop through the range of the random array and match any sequences. This should work with any size of sequence as long as it's less than the length of the random array.
seq = (1, 1, 2)
arr = (1, 1, 2, 1, 3, 2, 3, 2, 1, 1, 2, 1)
count = 0
for i in range(len(arr) - len(seq) + 1):
if seq == arr[i:i+len(seq)]:
count += 1
Edit: was a bit late.

Finding indices of first non-zero items in a list

I have the following list :
list_test = [0,0,0,1,0,2,5,4,0,0,5,5,3,0,0]
I would like to find the indices of all the first numbers in the list that are not equal to zero.
In this case the output should be:
output = [3,5,10]
Is there a Pythonic way to do this?

According to the output, I think you want the first index of continuous non-zero sequences.
As for Pythonic, I understand it as list generator, while it's poorly readable.
# works with starting with non-zero element.
# list_test = [1, 0, 0, 1, 0, 2, 5, 4, 0, 0, 5, 5, 3, 0, 0]
list_test = [0, 0, 0, 1, 0, 2, 5, 4, 0, 0, 5, 5, 3, 0, 0]
output = [i for i in range(len(list_test)) if list_test[i] != 0 and (i == 0 or list_test[i - 1] == 0)]
print(output)

There is also a numpy based solution:
import numpy as np
l = np.array([0,0,0,1,0,2,5,4,0,0,5,5,3,0,0])
non_zeros = np.where(l != 0)[0]
diff = np.diff(non_zeros)
np.append(non_zeros [0], non_zeros [1 + np.where(diff>=2)[0]]) # array([ 3, 5, 10], dtype=int64)
Explanation:
First, we find the non-zero places, then we calculate the pair differences of those position (we need to add 1 because its out[i] = a[i+1] - a[i], read more about np.diff) then we need to add the first element of non-zero and also all the values where the difference was greater then 1)
Note:
It will also work for the case where the array start with non-zero element or all non-zeros.

From the Link.
l = [0,0,0,1,0,2,5,4,0,0,5,5,3,0,0]
v = {}
for i, x in enumerate(l):
if x != 0 and x not in v:
v[x] = i

list_test = [0,0,0,1,0,2,5,4,0,0,5,5,3,0,0]
res = {}
for index, item in enumerate(list_test):
if item > 0:
res.setdefault(index, None)
print(res.keys())

I don't knwo what you mean by Pythonic way, but this is an answer using a simple loop:
list_test = [0,0,0,1,0,2,5,4,0,0,5,5,3,0,0]
out = []
if list_test[0] == 0:
out.append(0)
for i in range(1, len(list_test)):
if (list_test[i-1] == 0) and (list_test[i] != 0):
out.append(i)
Don't hesitate to precise what you mean by "Pythonic" !

Algorithm to offset a list of data

Given a list of data as follows:
input = [1,1,1,1,5,5,3,3,3,3,3,3,2,2,2,5,5]
I would like to create an algorithm that is able to offset the list of certain number of steps. For example, if the offset = -1:
def offsetFunc(inputList, offsetList):
#make something
return output
where:
output = [0,0,0,0,1,1,5,5,5,5,5,5,3,3,3,2,2]
Important Note: The elements of the list are float numbers and they are not in any progression. So I actually need to shift them, I cannot use any work-around for getting the result.
So basically, the algorithm should replace the first set of values (the 4 "1", basically) with the 0 and then it should:
Detect the lenght of the next range of values
Create a parallel output vectors with the values delayed by one set
The way I have roughly described the algorithm above is how I would do it. However I'm a newbie to Python (and even beginner in general programming) and I have figured out time by time that Python has a lot of built-in functions that could make the algorithm less heavy and iterating. Does anyone have any suggestion to better develop a script to make this kind of job? This is the code I have written so far (assuming a static offset at -1):
input = [1,1,1,1,5,5,3,3,3,3,3,3,2,2,2,5,5]
output = []
PrevVal = 0
NextVal = input[0]
i = 0
while input[i] == NextVal:
output.append(PrevVal)
i += 1
while i < len(input):
PrevVal = NextVal
NextVal = input[i]
while input[i] == NextVal:
output.append(PrevVal)
i += 1
if i >= len(input):
break
print output
Thanks in advance for any help!
BETTER DESCRIPTION
My list will always be composed of "sets" of values. They are usually float numbers, and they take values such as this short example below:
Sample = [1.236,1.236,1.236,1.236,1.863,1.863,1.863,1.863,1.863,1.863]
In this example, the first set (the one with value "1.236") is long 4 while the second one is long 6. What I would like to get as an output, when the offset = -1, is:
The value "0.000" in the first 4 elements;
The value "1.236" in the second 6 elements.
So basically, this "offset" function is creating the list with the same "structure" (ranges of lengths) but with the values delayed by "offset" times.
I hope it's clear now, unfortunately the problem itself is still a bit silly to me (plus I don't even speak good English :) )
Please don't hesitate to ask any additional info to complete the question and make it clearer.

How about this:
def generateOutput(input, value=0, offset=-1):
values = []
for i in range(len(input)):
if i < 1 or input[i] == input[i-1]:
yield value
else: # value change in input detected
values.append(input[i-1])
if len(values) >= -offset:
value = values.pop(0)
yield value
input = [1,1,1,1,5,5,3,3,3,3,3,3,2,2,2,5,5]
print list(generateOutput(input))
It will print this:
[0, 0, 0, 0, 1, 1, 5, 5, 5, 5, 5, 5, 3, 3, 3, 2, 2]
And in case you just want to iterate, you do not even need to build the list. Just use for i in generateOutput(input): … then.
For other offsets, use this:
print list(generateOutput(input, 0, -2))
prints:
[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 5, 5, 5, 3, 3]

Using deque as the queue, and using maxlen to define the shift length. Only holding unique values. pushing inn new values at the end, pushes out old values at the start of the queue, when the shift length has been reached.
from collections import deque
def shift(it, shift=1):
q = deque(maxlen=shift+1)
q.append(0)
for i in it:
if q[-1] != i:
q.append(i)
yield q[0]
Sample = [1.236,1.236,1.236,1.236,1.863,1.863,1.863,1.863,1.863,1.863]
print list(shift(Sample))
#[0, 0, 0, 0, 1.236, 1.236, 1.236, 1.236, 1.236, 1.236]

My try:
#Input
input = [1,1,1,1,5,5,3,3,3,3,3,3,2,2,2,5,5]
shift = -1
#Build service structures: for each 'set of data' store its length and its value
set_lengths = []
set_values = []
prev_value = None
set_length = 0
for value in input:
if prev_value is not None and value != prev_value:
set_lengths.append(set_length)
set_values.append(prev_value)
set_length = 0
set_length += 1
prev_value = value
else:
set_lengths.append(set_length)
set_values.append(prev_value)
#Output the result, shifting the values
output = []
for i, l in enumerate(set_lengths):
j = i + shift
if j < 0:
output += [0] * l
else:
output += [set_values[j]] * l
print input
print output
gives:
[1, 1, 1, 1, 5, 5, 3, 3, 3, 3, 3, 3, 2, 2, 2, 5, 5]
[0, 0, 0, 0, 1, 1, 5, 5, 5, 5, 5, 5, 3, 3, 3, 2, 2]

def x(list, offset):
return [el + offset for el in list]

A completely different approach than my first answer is this:
import itertools
First analyze the input:
values, amounts = zip(*((n, len(list(g))) for n, g in itertools.groupby(input)))
We now have (1, 5, 3, 2, 5) and (4, 2, 6, 3, 2). Now apply the offset:
values = (0,) * (-offset) + values # nevermind that it is longer now.
And synthesize it again:
output = sum([ [v] * a for v, a in zip(values, amounts) ], [])
This is way more elegant, way less understandable and probably way more expensive than my other answer, but I didn't want to hide it from you.

Reverse indices of a sorted list

I want to return the 'reverse' indices of a sorted list. What I mean by that is: I have an unsorted list U and I sort it via S=sorted(U). Now, I can get the sort indices such that U(idx)=S - but I want S(Ridx) = U.
Here a little example:
U=[5,2,3,1,4]
S=sorted(U)
idx = [U.index(S[i]) for i in range(len(U))]
>>> idx
[3, 1, 2, 4, 0]
Ridx = [S.index(U[i]) for i in range(len(U))]
>>> Ridx
[4, 1, 2, 0, 3]
>>>[U[idx[i]] for i in range(len(U))] == S
True
>>>[S[Ridx[i]] for i in range(len(U))] == U
True
What I need is an efficient way to get Ridx.
Thanks!
Edit:
All right! I did a little speed test for both of the solutions (#Jon Clements and #Whatang) which answered the question.
The script:
import datetime as DT
import random
U=[int(1000*random.random()) for i in xrange(pow(10,8))]
S=sorted(U)
idx = sorted(xrange(len(U)), key=U.__getitem__)
T0 = DT.datetime.now()
ridx = sorted(xrange(len(U)), key=idx.__getitem__)
print [S[ridx[i]] for i in range(len(U))]==U
elapsed = DT.datetime.now()-T0
print str(elapsed)
print '==============='
T0 = DT.datetime.now()
ridx = [ y for (x,y) in sorted(zip(idx, range(len(idx)))) ]
print [S[ridx[i]] for i in range(len(U))]==U
elapsed = DT.datetime.now()-T0
print str(elapsed)
And the results:
True
0:02:45.278000
===============
True
0:06:48.889000
Thank you all for the quick and meaningful help!

The most efficient I can think of (short of possibly looking to numpy) that gets rid of the .index and can be used for both idx and ridx:
U=[5,2,3,1,4]
idx = sorted(xrange(len(U)), key=U.__getitem__)
ridx = sorted(xrange(len(U)), key=idx.__getitem__)
# [3, 1, 2, 4, 0] [4, 1, 2, 0, 3]

Not quite the data structure you asked for, but I think this gets the info you want:
>>> sorted(x[::-1] for x in enumerate(['z', 'a', 'c', 'x', 'm']))
[('a', 1), ('c', 2), ('m', 4), ('x', 3), ('z', 0)]

With numpy you can do
>>> import numpy as np
>>> U = [5, 2, 3, 1, 4]
>>> np.array(U).argsort().argsort()
array([4, 1, 2, 0, 3])

Assuming you already have the list idx, you can do
ridx = [ y for (x,y) in sorted(zip(idx, range(len(idx)))) ]
Then for all i from 0 to len(U)
S[ridx[i]] == U[i]
You can avoid the sort if you use a dictionary:
ridx_dict = dict(zip(idx, range(len(idx))))
which can then be converted to a list:
ridx = [ ridx_dict[k] for k in range(len(idx)) ]
Thinking about permutations is the key to this problem. One way of writing down a permutation is to write all the indexes in order on one line, then on the line below write the new index of the element with that index. e.g., for your example
0 1 2 3 4
3 1 2 4 0
This second line is your idx list. You read down the columns, so the element which starts at index 0 moves to index 3, the element which starts at index 1 stays at index 1, and so on.
The inverse permutation is the ridx you're looking for. To find this, sort the lower line of the your permutation keeping columns together, then write down the new top line. So the example becomes:
4 1 2 0 3
0 1 2 3 4

If I understand the question correctly (which I didn't) I think U.index(S[i]) is what you are looking for
EDIT: so I guess you could save a dictionary of the original indices and keep the retrieval syntax pretty simple
OIDX = {U[i]: i for i in range(0, len(U))}
S = sorted(U)
OIDX[S[i]]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find span where condition is True using NumPy - python

Related

Reorder numpy array to new bitlength elements without loop

Python: searching a sequence (1,1,2) in a numpy vector (contains random 0,1,2 numbers) with for loop:

Finding indices of first non-zero items in a list

Algorithm to offset a list of data

Reverse indices of a sorted list

Categories

Resources