Count of islands of negative and positive numbers in a NumPy array - python

I have an array containing chunks of negative and chunks of positive elements. A much simplified example of it would be an array a looking like: array([-3, -2, -1, 1, 2, 3, 4, 5, 6, -5, -4])
(a<0).sum() and (a>0).sum() give me the total number of negative and positive elements but how do I count these in order? By this I mean I want to know that my array contains first 3 negative elements, 6 positive and 2 negative.
This sounds like a topic that have been addressed somewhere, and there may be a duplicate out there, but I can't find one.
A method is to use numpy.roll(a,1) in a loop over the whole array and count the number of elements of a given sign appearing in e.g. the first element of the array as it rolls, but it doesn't look much numpyic (or pythonic) nor very efficient to me.

Here's one vectorized approach -
def pos_neg_counts(a):
mask = a>0
idx = np.flatnonzero(mask[1:] != mask[:-1])
count = np.concatenate(( [idx[0]+1], idx[1:] - idx[:-1], [a.size-1-idx[-1]] ))
if a[0]<0:
return count[1::2], count[::2] # pos, neg counts
else:
return count[::2], count[1::2] # pos, neg counts
Sample runs -
In [155]: a
Out[155]: array([-3, -2, -1, 1, 2, 3, 4, 5, 6, -5, -4])
In [156]: pos_neg_counts(a)
Out[156]: (array([6]), array([3, 2]))
In [157]: a[0] = 3
In [158]: a
Out[158]: array([ 3, -2, -1, 1, 2, 3, 4, 5, 6, -5, -4])
In [159]: pos_neg_counts(a)
Out[159]: (array([1, 6]), array([2, 2]))
In [160]: a[-1] = 7
In [161]: a
Out[161]: array([ 3, -2, -1, 1, 2, 3, 4, 5, 6, -5, 7])
In [162]: pos_neg_counts(a)
Out[162]: (array([1, 6, 1]), array([2, 1]))
Runtime test
Other approach(es) -
# #Franz's soln
def split_app(my_array):
negative_index = my_array<0
splits = np.split(negative_index, np.where(np.diff(negative_index))[0]+1)
len_list = [len(i) for i in splits]
return len_list
Timings on bigger dataset -
In [20]: # Setup input array
...: reps = np.random.randint(3,10,(100000))
...: signs = np.ones(len(reps),dtype=int)
...: signs[::2] = -1
...: a = np.repeat(signs, reps)*np.random.randint(1,9,reps.sum())
...:
In [21]: %timeit split_app(a)
10 loops, best of 3: 90.4 ms per loop
In [22]: %timeit pos_neg_counts(a)
100 loops, best of 3: 2.21 ms per loop

Just use
my_array = np.array([-3, -2, -1, 1, 2, 3, 4, 5, 6, -5, -4])
negative_index = my_array<0
and you'll get the indizes of the negative values. After that you can split this array:
splits = np.split(negative_index, np.where(np.diff(negative_index))[0]+1)
and moreover calc the size of the inner arrays:
len_list = [len(i) for i in splits]
print(len_list)
And you'll get what you are looking for:
Out[1]: [3, 6, 2]
You just have to mention what your first element is. Per definition in my code, a negative one.
So just execute:
my_array = np.array([-3, -2, -1, 1, 2, 3, 4, 5, 6, -5, -4])
negative_index = my_array<0
splits = np.split(negative_index, np.where(np.diff(negative_index))[0]+1)
len_list = [len(i) for i in splits]
print(len_list)

My (rather simple and probably inefficient) solution would be:
import numpy as np
arr = np.array([-3, -2, -1, 1, 2, 3, 4, 5, 6, -5, -4])
sgn = np.sign(arr[0])
res = []
cntr = 1 # counting the first one
for i in range(1, len(arr)):
if np.sign(arr[i]) != sgn:
res.append(cntr)
cntr = 0
sgn *= -1
cntr += 1
res.append(cntr)
print res

Related

How to get the indices of at least two consecutive values that are all greater than a threshold?

For example, let's consider the following numpy array:
[1, 5, 0, 5, 4, 6, 1, -1, 5, 10]
Also, let's suppose that the threshold is equal to 3.
That is to say that we are looking for sequences of at least two consecutive values that are all above the threshold.
The output would be the indices of those values, which in our case is:
[[3, 4, 5], [8, 9]]
If the output array was flattened that would work as well!
[3, 4, 5, 8, 9]
Output Explanation
In our initial array we can see that for index = 1 we have the value 5, which is greater than the threshold, but is not part of a sequence (of at least two values) where every value is greater than the threshold. That's why this index would not make it to our output.
On the other hand, for indices [3, 4, 5] we have a sequence of (at least two) neighboring values [5, 4, 6] where each and every of them are above the threshold and that's the reason that their indices are included in the final output!
My Code so far
I have approached the issue with something like this:
(arr > 3).nonzero()
The above command gathers the indices of all the items that are above the threshold. However, I cannot determine if they are consecutive or not. I have thought of trying a diff on the outcome of the above snippet and then may be locating ones (that is to say that indices are one after the other). Which would give us:
np.diff((arr > 3).nonzero())
But I'd still be missing something here.
If you convolve a boolean array with a window full of 1 of size win_size ([1] * win_size), then you will obtain an array where there is the value win_size where the condition held for win_size items:
import numpy as np
def groups(arr, *, threshold, win_size, merge_contiguous=False, flat=False):
conv = np.convolve((arr >= threshold).astype(int), [1] * win_size, mode="valid")
indexes_start = np.where(conv == win_size)[0]
indexes = [np.arange(index, index + win_size) for index in indexes_start]
if flat or merge_contiguous:
indexes = np.unique(indexes)
if merge_contiguous:
indexes = np.split(indexes, np.where(np.diff(indexes) != 1)[0] + 1)
return indexes
arr = np.array([1, 5, 0, 5, 4, 6, 1, -1, 5, 10])
threshold = 3
win_size = 2
print(groups(arr, threshold=threshold, win_size=win_size))
print(groups(arr, threshold=threshold, win_size=win_size, merge_contiguous=True))
print(groups(arr, threshold=threshold, win_size=win_size, flat=True))
[array([3, 4]), array([4, 5]), array([8, 9])]
[array([3, 4, 5]), array([8, 9])]
[3 4 5 8 9]
You can do what you want using simple numpy operations
import numpy as np
arr = np.array([1, 5, 0, 5, 4, 6, 1, -1, 5, 10])
arr_padded = np.concatenate(([0], arr, [0]))
a = np.where(arr_padded > 3, 1, 0)
da = np.diff(a)
idx_start = (da == 1).nonzero()[0]
idx_stop = (da == -1).nonzero()[0]
valid = (idx_stop - idx_start >= 2).nonzero()[0]
result = [list(range(idx_start[i], idx_stop[i])) for i in valid]
print(result)
Explanation
Array a is a padded binary version of the original array, with 1s where the original elements are greater than three. da contains 1s where "islands" of 1s begin in a, and -1 where the "islands" end in a. Due to the padding, there is guaranteed to be an equal number of 1s and -1s in da. Extracting their indices, we can calculate the length of the islands. Valid index pairs are those whose respective "islands" have length >= 2. Then, its just a matter of generating all numbers between the index bounds of the valid "islands".
I follow your original idea. You are almost done.
I use another diff2 to pick the index of the first value in a sequence. See comments in code for details.
import numpy as np
arr = np.array([ 1, 5, 0, 5, 4, 6, 1, -1, 5, 10])
threshold = 3
all_idx = (arr > threshold).nonzero()[0]
# array([1, 3, 4, 5, 8, 9])
result = np.empty(0)
if all_idx.size > 1:
diff1 = np.zeros_like(all_idx)
diff1[1:] = np.diff(all_idx)
# array([0, 2, 1, 1, 3, 1])
diff1[0] = diff1[1]
# array([2, 2, 1, 1, 3, 1])
# **Positions with a value 1 in diff1 should be reserved.**
# But we also want the position before each 1. Create another diff2
diff2 = np.zeros_like(all_idx)
diff2[:-1] = np.diff(diff1)
# array([ 2, -1, 0, 2, -2, 0])
# **Positions with a negative value in diff2 should be reserved.**
result = all_idx[(diff1==1) | (diff2<0)]
print(result)
# array([3, 4, 5, 8, 9])
I'll try something different using window views, I'm not sure this works all the time so counterexamples are welcome. It has the advantage of not requiring Python loops.
import numpy as np
from numpy.lib.stride_tricks import sliding_window_view as window
def consec_thresh(arr, thresh):
win = window(np.argwhere(arr > thresh), (2, 1))
return np.unique(win[np.diff(win, axis=2).ravel() == 1, :,:].ravel())
How does it work?
So we start with the array and gather the indices where the threshold is met:
In [180]: np.argwhere(arr > 3)
Out[180]:
array([[1],
[3],
[4],
[5],
[8],
[9]])
Then we build a sliding window that makes up pair of values along the column (which is the reason for the (2, 1) shape of the window).
In [181]: window(np.argwhere(arr > 3), (2, 1))
Out[181]:
array([[[[1],
[3]]],
[[[3],
[4]]],
[[[4],
[5]]],
[[[5],
[8]]],
[[[8],
[9]]]])
Now we want to take the difference inside each pair, if it's one then the indices are consecutive.
In [182]: np.diff(window(np.argwhere(arr > 3), (2, 1)), axis=2)
Out[182]:
array([[[[2]]],
[[[1]]],
[[[1]]],
[[[3]]],
[[[1]]]])
We can plug those values back in the windows we created above,
In [185]: window(np.argwhere(arr > 3), (2, 1))[np.diff(window(np.argwhere(arr > 3), (2, 1)), axis=2).ravel() == 1, :, :]
Out[185]:
array([[[[3],
[4]]],
[[[4],
[5]]],
[[[8],
[9]]]])
Then we can ravel (flatten without copy when possible), we have to get rid of the repeated indices created by windowing so I call np.unique. We ravel again and get:
array([3, 4, 5, 8, 9])
The below iteration code should help with O(n) complexity
arr = [1, 5, 0, 5, 4, 6, 1, -1, 5, 10]
threshold = 3
sequence = 2
output = []
temp_arr = []
for i in range(len(arr)):
if arr[i] > threshold:
temp_arr.append(i)
else:
if len(temp_arr) >= sequence:
output.append(temp_arr)
temp_arr = []
if len(temp_arr):
output.append(temp_arr)
temp_arr = []
print(output)
# Output
# [[3, 4, 5], [8, 9]]
I would suggest using a for loop with two indces. You will have one that starts at j=1 and the other at i=0, both stepping forward by 1.
You can then ask if the value at both is greater than the threshold, if so
add the indices to a list and keep moving forward with j until the threshold or .next() is not greater than threshhold.
values = [1, 5, 0, 5, 4, 6, 1, -1, 5, 10]
res=[]
threshold= 3
i=0
j=0
for _ in values:
j=i+1
lista=[]
try:
print(f"i: {i} j:{j}")
# check if condition is met
if(values[i] > threshold and values[j] > threshold):
lista.append(i)
# add sequence
while values[j] > threshold:
lista.append(j)
print(f"j while: {j}")
j+=1
if(j>=len(values)):
break
res.append(lista)
i=j
if(j>=len(values)):
break
except:
print("ex")
this works. but needs refactoring
Let's try the following code:
# Simple is better than complex
# Complex is better than complicated
arr = [1, 5, 0, 5, 4, 6, 1, -1, 5, 10]
arr_3=[i if arr[i]>3 else 'a' for i in range(len(arr))]
arr_4=''.join(str(x) for x in arr_3)
i=0
while i<len(arr_5):
if len(arr_5[i]) <=1:
del arr_5[i]
else:
i+=1
arr_6=[list(map(lambda x: int(x), list(x))) for x in arr_5]
print(arr_6)
Outputs:
[[3, 4, 5], [8, 9]]
Here is a solution that makes use of pandas Series:
thresh = 3
win_size = 2
s = pd.Series(arr)
# locating groups of values where there are at least (win_size) consecutive values above the threshold
groups = s.groupby(s.le(thresh).cumsum().loc[s.gt(thresh)]).transform('count').ge(win_size)
0 False
1 False
2 False
3 True
4 True
5 True
6 False
7 False
8 True
9 True
dtype: bool
We can now easily take their indices in a 1D array:
np.flatnonzero(groups)
# array([3, 4, 5, 8, 9], dtype=int64)
OR multiple lists:
[np.arange(index.start, index.stop) for index in np.ma.clump_unmasked(np.ma.masked_not_equal(groups.values, value=True))]
# [array([3, 4, 5], dtype=int64), array([8, 9], dtype=int64)]

How to create a numpy array from 2 lists

I have a count of integer frequencies that I am trying to get into an array. L1 are the integers from 1 to 9, but only if they occur, I want to use this as the array index. L2 is the frequency of the integer and I want that to be entered in the array.
L1 = [1,3,4,5,6,7,8,9] #no twos occurred in the data so 2 is not in L1
L2 = [6,7,1,2,8,4,2,1]
The out put I want to get is: A1 = [[6,0,7],[1,2,8],[4,2,1]]
I feel like I'm missing something but this is my last attempt:
for num in L1 and count in L2:
a1[:num] = L2[:count]
Make the list arrays for ease of use:
In [286]: L1 = np.array([1,3,4,5,6,7,8,9])
...: L2 = np.array([6,7,1,2,8,4,2,1])
Make a place to put values:
In [287]: res = np.zeros(10,int)
In [288]: res[L1]
Out[288]: array([0, 0, 0, 0, 0, 0, 0, 0])
In [289]: res[L1]=L2
In [290]: res
Out[290]: array([0, 6, 0, 7, 1, 2, 8, 4, 2, 1])
oops, offset a bit.
In [291]: res = np.zeros(10,int)
In [292]: res[L1-1]=L2
In [293]: res
Out[293]: array([6, 0, 7, 1, 2, 8, 4, 2, 1, 0])
correct the initial size, and reshape:
In [294]: res = np.zeros(9,int)
In [295]: res[L1-1]=L2
In [296]: res.reshape(3,3)
Out[296]:
array([[6, 0, 7],
[1, 2, 8],
[4, 2, 1]])

Assign same lexicographic rank to duplicate elements of 2d array

I'm trying to lexicographically rank array components. The below code works fine, but I'd like to assign equal ranks to equal elements.
import numpy as np
values = np.asarray([
[1, 2, 3],
[1, 1, 1],
[2, 2, 3],
[1, 2, 3],
[1, 1, 2]
])
# need to flip, because for `np.lexsort` last
# element has highest priority.
values_reversed = np.fliplr(values)
# this returns the order, i.e. the order in
# which the elements should be in a sorted
# array (not the rank by index).
order = np.lexsort(values_reversed.T)
# convert order to ranks.
n = values.shape[0]
ranks = np.empty(n, dtype=int)
# use order to assign ranks.
ranks[order] = np.arange(n)
The rank variable contains [2, 0, 4, 3, 1], but a rank array of [2, 0, 4, 2, 1] is required because elements [1, 2, 3] (index 0 and 3) share the same rank. Continuous rank numbers are ok, so [2, 0, 3, 2, 1] is also an acceptable rank array.
Here's one approach -
# Get lexsorted indices and hence sorted values by those indices
lexsort_idx = np.lexsort(values.T[::-1])
lexsort_vals = values[lexsort_idx]
# Mask of steps where rows shift (there are no duplicates in subsequent rows)
mask = np.r_[True,(lexsort_vals[1:] != lexsort_vals[:-1]).any(1)]
# Get the stepped indices (indices shift at non duplicate rows) and
# the index values are scaled corresponding to row numbers
stepped_idx = np.maximum.accumulate(mask*np.arange(mask.size))
# Re-arrange the stepped indices based on the original order of rows
# This is basically same as the original code does in last 4 steps,
# just in a concise manner
out_idx = stepped_idx[lexsort_idx.argsort()]
Sample step-by-step intermediate outputs -
In [55]: values
Out[55]:
array([[1, 2, 3],
[1, 1, 1],
[2, 2, 3],
[1, 2, 3],
[1, 1, 2]])
In [56]: lexsort_idx
Out[56]: array([1, 4, 0, 3, 2])
In [57]: lexsort_vals
Out[57]:
array([[1, 1, 1],
[1, 1, 2],
[1, 2, 3],
[1, 2, 3],
[2, 2, 3]])
In [58]: mask
Out[58]: array([ True, True, True, False, True], dtype=bool)
In [59]: stepped_idx
Out[59]: array([0, 1, 2, 2, 4])
In [60]: lexsort_idx.argsort()
Out[60]: array([2, 0, 4, 3, 1])
In [61]: stepped_idx[lexsort_idx.argsort()]
Out[61]: array([2, 0, 4, 2, 1])
Performance boost
For more performance efficiency to compute lexsort_idx.argsort(), we could use and this is identical to the original code in last 4 lines -
def argsort_unique(idx):
# Original idea : http://stackoverflow.com/a/41242285/3293881 by #Andras
n = idx.size
sidx = np.empty(n,dtype=int)
sidx[idx] = np.arange(n)
return sidx
Thus, lexsort_idx.argsort() could be alternatively computed with argsort_unique(lexsort_idx).
Runtime test
Applying few more optimization tricks, we would have a version like so -
def numpy_app(values):
lexsort_idx = np.lexsort(values.T[::-1])
lexsort_v = values[lexsort_idx]
mask = np.concatenate(( [False],(lexsort_v[1:] == lexsort_v[:-1]).all(1) ))
stepped_idx = np.arange(mask.size)
stepped_idx[mask] = 0
np.maximum.accumulate(stepped_idx, out=stepped_idx)
return stepped_idx[argsort_unique(lexsort_idx)]
#Warren Weckesser's rankdata based method as a func for timings -
def scipy_app(values):
v = values.view(np.dtype(','.join([values.dtype.str]*values.shape[1])))
return rankdata(v, method='min') - 1
Timings -
In [97]: a = np.random.randint(0,9,(10000,3))
In [98]: out1 = numpy_app(a)
In [99]: out2 = scipy_app(a)
In [100]: np.allclose(out1, out2)
Out[100]: True
In [101]: %timeit scipy_app(a)
100 loops, best of 3: 5.32 ms per loop
In [102]: %timeit numpy_app(a)
100 loops, best of 3: 1.96 ms per loop
Here's a way to do it using scipy.stats.rankdata (with method='min'), by viewing the 2-d array as a 1-d structured array:
In [15]: values
Out[15]:
array([[1, 2, 3],
[1, 1, 1],
[2, 2, 3],
[1, 2, 3],
[1, 1, 2]])
In [16]: v = values.view(np.dtype(','.join([values.dtype.str]*values.shape[1])))
In [17]: rankdata(v, method='min') - 1
Out[17]: array([2, 0, 4, 2, 1])

Generate 1D NumPy array of concatenated ranges

I want to generate a following array a:
nv = np.random.randint(3, 10+1, size=(1000000,))
a = np.concatenate([np.arange(1,i+1) for i in nv])
Thus, the output would be something like -
[0, 1, 2, 3, 0, 1, 2, 3, 4, 0, 1, 2, 0, 1, 2, 3, 4, 5, 0, ...]
Does there exist any better way to do it?
Here's a vectorized approach using cumulative summation -
def ranges(nv, start = 1):
shifts = nv.cumsum()
id_arr = np.ones(shifts[-1], dtype=int)
id_arr[shifts[:-1]] = -nv[:-1]+1
id_arr[0] = start # Skip if we know the start of ranges is 1 already
return id_arr.cumsum()
Sample runs -
In [23]: nv
Out[23]: array([3, 2, 5, 7])
In [24]: ranges(nv, start=0)
Out[24]: array([0, 1, 2, 0, 1, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 5, 6])
In [25]: ranges(nv, start=1)
Out[25]: array([1, 2, 3, 1, 2, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 7])
Runtime test -
In [62]: nv = np.random.randint(3, 10+1, size=(100000,))
In [63]: %timeit your_func(nv) # #MSeifert's solution
10 loops, best of 3: 129 ms per loop
In [64]: %timeit ranges(nv)
100 loops, best of 3: 5.54 ms per loop
Instead of doing this with numpy methods you could use normal python ranges and just convert the result to an array:
from itertools import chain
import numpy as np
def your_func(nv):
ranges = (range(1, i+1) for i in nv)
flattened = list(chain.from_iterable(ranges))
return np.array(flattened)
This doesn't need to utilize hard to understand numpy slicings and constructs. To show a sample case:
import random
>>> nv = [random.randint(1, 10) for _ in range(5)]
>>> print(nv)
[4, 2, 10, 5, 3]
>>> print(your_func(nv))
[ 1 2 3 4 1 2 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 1 2 3]
Why two steps?
a = np.concatenate([np.arange(0,np.random.randint(3,11)) for i in range(1000000)])

Change all values exceeding threshold to the negative of itself

I have an array with a bunch of rows and three columns. I have this code below which changes every value exceeding the threshold, to 0. Is there a trick to make the replace value to the negative of which number exceeds the threshold? Lets say i have an array np.array([[1,2,3],[4,5,6],[7,8,9]]). I choose column one and get an array with the values 1,4,7(first values of each row) If the threshold is 5, is there a way to make every value larger than 5 to the negative of it self, so that 1,4,7 changes to 1,4,-7?
import numpy as np
arr = np.ndarray(my_array)
threshold = 5
column_id = 0
replace_value = 0
arr[arr[:, column_id] > threshold, column_id] = replace_value
Try this
In [37]: arr = np.array([[1,2,3],[4,5,6],[7,8,9]])
In [38]: arr[:, column_id] *= (arr[:, column_id] > threshold) * -2 + 1
In [39]: arr
Out[39]:
array([[ 1, 2, 3],
[ 4, 5, 6],
[-7, 8, 9]])
Sorry for editing later. I recommend below, which may be faster.
In [48]: arr
Out[48]:
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
In [49]: col = arr[:, column_id]
In [50]: col[col > threshold] *= -1
In [51]: arr
Out[51]:
array([[ 1, 2, 3],
[ 4, 5, 6],
[-7, 8, 9]])
import numpy as np
x= list(np.arange(1,10))
b = []
for i in x:
if i > 4:
b.append(-i)
else:
b.append(i)
print(b)
e = np.array(b).reshape(3,3)
print('changed array')
print(e[:,0])
output :
[1, 2, 3, 4, -5, -6, -7, -8, -9]
changed array :
[ 1 4 -7]

Categories

Resources