remove items with low frequency - python

Let's consider the array of length n:
y=np.array([1,1,1,1,2,2,2,3,3,3,3,3,2,2,2,2,1,4,1,1,1])
and the matrix X of size n x m.
I want to remove items of y and rows of X, for which the corresponding value of y has low frequency.
I figured out this would give me the values of y which should be removed:
>>> items, count = np.unique(y, return_counts=True)
>>> to_remove = items[count < 3] # array([4])
and this would remove the items:
>>> X=X[y != to_remove,:]
>>> y=y[y != to_remove]
array([1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 2, 2, 2, 2, 1, 1, 1, 1])
While the code above works when there is only one label to remove, it fails when there are multiple values of y with low frequency (i.e. y=np.array([1,1,1,1,2,2,2,3,3,3,3,3,2,2,2,2,1,4,1,1,1,5,5,1,1]) would cause to_remove to be array([4, 5])):
>>> y[y != to_remove,:]
Traceback (most recent call last):
File "<input>", line 1, in <module>
IndexError: too many indices for array
How to fix this in a concise way?

You can use an additional output parameter return_inverse in np.unique like so -
def unique_where(y):
_, idx, count = np.unique(y, return_inverse=True,return_counts=True)
return y[np.in1d(idx,np.where(count>=3)[0])]
def unique_arange(y):
_, idx, count = np.unique(y, return_inverse=True,return_counts=True)
return y[np.in1d(idx,np.arange(count.size)[count>=3])]
You can use np.bincount for counting that is supposedly pretty efficient at counting and might suit it better here, assuming y contains non-negative numbers, like so -
def bincount_where(y):
counts = np.bincount(y)
return y[np.in1d(y,np.where(counts>=3)[0])]
def bincount_arange(y):
counts = np.bincount(y)
return y[np.in1d(y,np.arange(y.max())[counts>=3])]
Runtime tests -
This section times the above listed three approaches alongwith the approach listed in #Ashwini Chaudhary's solution -
In [85]: y = np.random.randint(0,100000,50000)
In [90]: def unique_items_indexed(y): # #Ashwini Chaudhary's solution
...: items, count = np.unique(y, return_counts=True)
...: return y[np.in1d(y, items[count >= 3])]
...:
In [115]: %timeit unique_items_indexed(y)
10 loops, best of 3: 19.8 ms per loop
In [116]: %timeit unique_where(y)
10 loops, best of 3: 26.9 ms per loop
In [117]: %timeit unique_arange(y)
10 loops, best of 3: 26.5 ms per loop
In [118]: %timeit bincount_where(y)
100 loops, best of 3: 16.7 ms per loop
In [119]: %timeit bincount_arange(y)
100 loops, best of 3: 16.5 ms per loop

You're looking for numpy.in1d:
>>> y = np.array([1,1,1,1,2,2,2,3,3,3,3,3,2,2,2,2,1,4,1,1,1,5,5,1,1])
>>> items, count = np.unique(y, return_counts=True)
>>> to_remove = items[count < 3]
>>> y[~np.in1d(y, to_remove)]
array([1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1])

If you have more than one value in to_remove the operation is ill defined:
>>> to_remove
array([4, 5])
>>> y != to_remove
True
Use the operator in1d:
>>> ~np.in1d(y, to_remove)
array([ True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, False,
True, True, True, False, False, True, True], dtype=bool)

Related

How to create a numpy array with a random entries that exclude one element in each index?

I have an array val of possible values (ex. val = [0, 1, 2, 3, 4, 5]) and an array A (possibly very long list) of selected values (ex. A = [2, 3, 1, 0, 2, 1, ... , 2, 3, 1, 0, 4])
Now I want to create an array B of the same length as A such that A[i] is different than B[i] for each i and entries in B are selected randomly. How to do it efficiently using numpy?
A simple method would be drawing the difference between A and B modulo n where n is the number of possible outcomes. A[i] != B[i] means that this difference is not zero, hence we draw from 1,...,n-1:
n,N = 10,100
A = np.random.randint(0,n,N)
D = np.random.randint(1,n,N)
B = (A-D)%n
Update: while arguably elegant this solution is not the fastest. We could save some time by replacing the (slow) modulo operator with just testing for negative values and adding n to them.
In this form this solution starts looking quite similar to #Divakar's: two blocks of possible values, one needs to be shifted.
But we can do better: instead of shifting on average half the values we can instead swap them out only if A[i] == B[i]. As this is expected to happen rarely unless the list of permissible values is very short, the code runs faster:
B = np.random.randint(1,n,N)
B[B==A] = 0
This is somewhat wasteful as it creates a temporary list for every item in A but otherwise fullfills your requirements:
from random import choice
val = [0, 1, 2, 3, 4, 5]
A = [2, 3, 1, 0, 2, 1, 2, 3, 1, 0, 4]
val = set(val)
B = [choice(list(val - {x})) for x in A]
print(B) # -> [4, 2, 3, 2, 5, 4, 1, 5, 5, 4, 1]
In a nutshell:
What happens is that val is converted to a set from which the current item in A gets removed. Consequently, an item is chosen at random from this resulting subset and gets added to B.
You can also test it with:
print(all(x!=y for x, y in zip(A, B)))
which of course returns True
Finally, note that the approach above only works with hashable items. So if you might have something like val = [[1, 2], [2, 3], ..] for example you will run into problems.
Here's one vectorized way -
def randnum_excludeone(A, val):
n = val[-1]
idx = np.random.randint(0,n,len(A))
idx[idx>=A] += 1
return idx
The idea is we generate random integers for each entry in A covering the entire length of val minus 1. Then, we add in 1 if the current random number generated is same or greater than current A element, otherwise we keep it. Thus, for any random number generated that's lesser than current A number, we keep it. Otherwise, with 1 addition, we will offset from the current A number. That's our final output - idx.
Let's verify the random-ness and make sure it's uniform across non-A elements -
In [42]: A
Out[42]: array([2, 3, 1, 0, 2, 1, 2, 3, 1, 0, 4])
In [43]: val
Out[43]: array([0, 1, 2, 3, 4, 5])
In [44]: c = np.array([randnum_excludeone(A, val) for _ in range(10000)])
In [45]: [np.bincount(i) for i in c.T]
Out[45]:
[array([2013, 2018, 0, 2056, 1933, 1980]),
array([2018, 1985, 2066, 0, 1922, 2009]),
array([2032, 0, 1966, 1975, 2040, 1987]),
array([ 0, 2076, 1986, 1931, 2013, 1994]),
array([2029, 1943, 0, 1960, 2100, 1968]),
array([2028, 0, 2048, 2031, 1929, 1964]),
array([2046, 2065, 0, 1990, 1940, 1959]),
array([2040, 2003, 1935, 0, 2045, 1977]),
array([2008, 0, 2011, 2030, 1937, 2014]),
array([ 0, 2000, 2015, 1983, 2023, 1979]),
array([2075, 1995, 1987, 1948, 0, 1995])]
Benchmarking on large arrays
Other vectorized approach(es) :
# #Paul Panzer's solution
def pp(A, val):
n,N = val[-1]+1,len(A)
D = np.random.randint(1,n,N)
B = (A-D)%n
return B
Timing results -
In [66]: np.random.seed(0)
...: A = np.random.randint(0,6,100000)
In [67]: %timeit pp(A,val)
100 loops, best of 3: 3.11 ms per loop
In [68]: %timeit randnum_excludeone(A, val)
100 loops, best of 3: 2.53 ms per loop
In [69]: np.random.seed(0)
...: A = np.random.randint(0,6,1000000)
In [70]: %timeit pp(A,val)
10 loops, best of 3: 39.9 ms per loop
In [71]: %timeit randnum_excludeone(A, val)
10 loops, best of 3: 25.9 ms per loop
Extending the range of val to 10 -
In [60]: np.random.seed(0)
...: A = np.random.randint(0,10,1000000)
In [61]: %timeit pp(A,val)
10 loops, best of 3: 31.2 ms per loop
In [62]: %timeit randnum_excludeone(A, val)
10 loops, best of 3: 23.6 ms per loop
Quick and dirty, and improvements could be made, but here goes.
Your requirements can be accomplished as follows:
val = [0, 1, 2, 3, 4, 5]
A = [2, 3, 1, 0, 2, 1,4,4, 2, 3, 1, 0, 4]
val_shifted = np.roll(val,1)
dic_val = {i:val_shifted[i] for i in range(len(val_shifted))}
B = [dic_val[i] for i in A]
Which Gives the result that meets your requirement
A = [2, 3, 1, 0, 2, 1, 4, 4, 2, 3, 1, 0, 4]
B = [1, 2, 0, 5, 1, 0, 3, 3, 1, 2, 0, 5, 3]
Here is another approach. B first gets a random shuffle of A. Then, all the values where A and B overlap get shuffled. In the special case where all the overlapping elements have the same value, they get swapped with random good values.
Interesting on this approach is that it also works when there A only contains a very limited set of different values. Unlike other approaches, Bis an exact shuffle of A, so it also works when A doesn't have a uniform distribution. Also, B is a completely random shuffle except for the requirement of being different at equal indices.
import random
N = 10000
A = [random.randrange(0,6) for _ in range(N)]
B = a.copy()
random.shuffle(b)
print(A)
print(B)
while True:
equal_vals = {i for i,j in zip(A, B) if i == j}
print(len(equal_vals), equal_vals)
if len(equal_vals) == 0: # finished, no equal values on same positions
break
else:
equal_ind = [k for k, (i, j) in enumerate(zip(A, B)) if i == j]
# create a list of indices where A and B are equal
random.shuffle(equal_ind) # as the list was ordened, shuffle it to get a random order
if len(equal_vals) == 1: # special case, all equal indices have the same value
special_val = equal_vals.pop()
# find all the indices where the special_val could be placed without problems
good_ind = [k for k,(i,j) in enumerate(zip(A, B)) if i != special_val and j != special_val]
if len(good_ind) < len(equal_ind):
print("problem: there are too many equal values in list A")
else:
# swap each bad index with a random good index
chosen_ind = random.sample(good_ind, len(equal_ind))
for k1, k2 in zip(equal_ind, chosen_ind):
b[k1], b[k2] = b[k2], b[k1] # swap
break
elif len(equal_vals) >= 2:
# permute B via the lis of equal indices;
# as there are at least 2 different values, at least two indices will get a desired value
prev = equal_ind[0]
old_first = B[prev]
for k in equal_ind[1:]:
B[prev] = B[k]
prev = k
B[prev] = old_first
print(A)
print(B)

Find consecutive sequences based on Boolean array

I'm trying to extract sequences from an array b for which a boolean array a is used as index (len(a) >= len(b), but (a==True).sum() == len(b), i.e. there are only as many values true in a than there are elements in b). The sequences should be represented in the result as start and end index of a where a[i] is true and for which there are consecutive values.
For instance, for the following arrays of a and b
a = np.asarray([True, True, False, False, False, True, True, True, False])
b = [1, 2, 3, 4, 5]
the result should be [((0, 1), [1, 2]), ((5, 7), [3, 4, 5])], so as many elements in the array as there are true-sequences. Each true sequence should contain the start and end index from a and the values these relate to from b).
So for the above:
[
((0, 1), [1, 2]), # first true sequence: starting at index=0 (in a), ending at index=1, mapping to the values [1, 2] in b
((5, 7), [3, 4, 5]) # second true sequence: starting at index=5, ending at index=7, with values in b=[3, 4, 5]
]
How can this be done efficiently in numpy?
Here's one NumPy based one inspired by this post -
def func1(a,b):
# "Enclose" mask with sentients to catch shifts later on
mask = np.r_[False,a,False]
# Get the shifting indices
idx = np.flatnonzero(mask[1:] != mask[:-1])
s0,s1 = idx[::2], idx[1::2]
idx_b = np.r_[0,(s1-s0).cumsum()]
out = []
for (i,j,k,l) in zip(s0,s1-1,idx_b[:-1],idx_b[1:]):
out.append(((i, j), b[k:l]))
return out
Sample run -
In [104]: a
Out[104]: array([ True, True, False, False, False, True, True, True, False])
In [105]: b
Out[105]: [1, 2, 3, 4, 5]
In [106]: func1(a,b)
Out[106]: [((0, 1), [1, 2]), ((5, 7), [3, 4, 5])]
Timings -
In [156]: # Using given sample data and tiling it 1000x
...: a = np.asarray([True, True, False, False, False, True, True, True, False])
...: b = [1, 2, 3, 4, 5]
...: a = np.tile(a,1000)
...: b = np.tile(b,1000)
# #Chris's soln
In [157]: %%timeit
...: res = []
...: gen = (i for i in b)
...: for k, g in itertools.groupby(enumerate(a), lambda x:x[1]):
...: if k:
...: ind, bools = list(zip(*g))
...: res.append((ind[0::len(ind)-1], list(itertools.islice(gen, len(bools)))))
100 loops, best of 3: 13.8 ms per loop
In [158]: %timeit func1(a,b)
1000 loops, best of 3: 1.29 ms per loop
Using itertools.groupby and itertools.islice:
import itertools
res = []
gen = (i for i in b)
for k, g in itertools.groupby(enumerate(a), lambda x:x[1]):
if k:
ind, bools = list(zip(*g))
res.append((ind[0::len(ind)-1], list(itertools.islice(gen, len(bools)))))
Output
[((0, 1), [1, 2]), ((5, 7), [3, 4, 5])]
Insights:
itertools.groupby returns grouped object of Trues and Falses.
list[0::len(list)-1] returns first and last element of list.
Since b always have a same number of Trues, make b a generator and grab as many elements as there are Trues.
Time taken:
def itertool_version():
res = []
gen = (i for i in b)
for k, g in itertools.groupby(enumerate(a), lambda x:x[1]):
if k:
ind, bools = list(zip(*g))
res.append((ind[0::len(ind)-1], list(itertools.islice(gen, len(bools)))))
return res
%timeit itertool()
7.11 µs ± 313 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
I don't know about a solution using numpy, but maybe the following for-loop solution will help you (or others) finding a different, more efficient solution:
import numpy as np
a = np.asarray([True, True, False, False, False, True, True, True, False])
b = []
temp_list = []
count = 0
for val in a:
if (val):
count += 1
temp_list.append(count) if len(temp_list) == 0 else None # Only add the first 'True' value in a sequence
# Code only reached if val is not true > append b if temp_list has more than 1 entry
elif (len(temp_list) > 0):
temp_list.append(count) # Add the last true value in a sequence
b.append(temp_list)
temp_list = []
print(b)
>>> [[1, 2], [3, 5]]
Here is my two cents. Hope it helps. [EDITED]
# Get Data
a = np.asarray([True, True, False, False, False, True, True, True, False])
b = [1, 2, 3, 4, 5]
# Assign Index names
ac = ac.astype(float)
ac[ac==1] = b
# Select edges
ac[(np.roll(ac, 1) != 0) & (np.roll(ac, -1) != 0)] = 0 # Clear out intermediates
indices = ac[ac != 0] # Select only edges
indices.reshape(2, int(indices.shape[0]/2)) # group in pairs
Output
>> [[1, 2], [3, 5]]
Solution uses a method where() from numpy:
result = []
f = np.where(a)[0]
m = 1
for j in list(create(f)):
lo = j[1]-j[0]+1
result.append((j, [*range(m, m + lo)]))
m += lo
print(result)
#OUTPUT: [((0, 1), [1, 2]), ((5, 7), [3, 4, 5])]
And there is a method to split an array [0 1 5 6 7] -- > [(0, 1), (5, 7)]:
def create(k):
le = len(k)
i = 0
while i < le:
left = k[i]
while i < le - 1 and k[i] + 1 == k[i + 1]:
i += 1
right = k[i]
if right - left >= 1:
yield (left, right)
elif right - left == 1:
yield (left, )
yield (right, )
else:
yield (left, )
i += 1

Create mask for numpy array based on values' set membership

I want to create a 'mask' index array for an array, based on whether the elements of that array are members of some set. What I want can be achieved as follows:
x = np.arange(20)
interesting_numbers = {1, 5, 7, 17, 18}
x_mask = np.array([xi in interesting_numbers for xi in x])
I'm wondering if there's a faster way to execute that last line. As it is, it builds a list in Python by repeatedly calling a __contains__ method, then converts that list to a numpy array.
I want something like x_mask = x[x in interesting_numbers] but that's not valid syntax.
You can use np.in1d:
np.in1d(x, list(interesting_numbers))
#array([False, True, False, False, False, True, False, True, False,
# False, False, False, False, False, False, False, False, True,
# True, False], dtype=bool)
Timing, it is faster if the array x is large:
x = np.arange(10000)
interesting_numbers = {1, 5, 7, 17, 18}
%timeit np.in1d(x, list(interesting_numbers))
# 10000 loops, best of 3: 41.1 µs per loop
%timeit x_mask = np.array([xi in interesting_numbers for xi in x])
# 1000 loops, best of 3: 1.44 ms per loop
Here's one approach with np.searchsorted -
def set_membership(x, interesting_numbers):
b = np.sort(list(interesting_numbers))
idx = np.searchsorted(b, x)
idx[idx==b.size] = 0
return b[idx] == x
Runtime test -
# Setup inputs with random numbers that are not necessarily sorted
In [353]: x = np.random.choice(100000, 10000, replace=0)
In [354]: interesting_numbers = set(np.random.choice(100000, 1000, replace=0))
In [355]: x_mask = np.array([xi in interesting_numbers for xi in x])
# Verify output with set_membership
In [356]: np.allclose(x_mask, set_membership(x, interesting_numbers))
Out[356]: True
# #Psidom's solution
In [357]: %timeit np.in1d(x, list(interesting_numbers))
1000 loops, best of 3: 1.04 ms per loop
In [358]: %timeit set_membership(x, interesting_numbers)
1000 loops, best of 3: 682 µs per loop

Efficient numpy subarrays extraction from a mask

I am searching a pythonic way to extract multiple subarrays from a given array using a mask as shown in the example:
a = np.array([10, 5, 3, 2, 1])
m = np.array([True, True, False, True, True])
The output will be a collection of array like the following, where only the contiguous "region" of True values (True values next to each other) of the mask m represent the indices generating a subarray.
L[0] = np.array([10, 5])
L[1] = np.array([2, 1])
Here's one approach -
def separate_regions(a, m):
m0 = np.concatenate(( [False], m, [False] ))
idx = np.flatnonzero(m0[1:] != m0[:-1])
return [a[idx[i]:idx[i+1]] for i in range(0,len(idx),2)]
Sample run -
In [41]: a = np.array([10, 5, 3, 2, 1])
...: m = np.array([True, True, False, True, True])
...:
In [42]: separate_regions(a, m)
Out[42]: [array([10, 5]), array([2, 1])]
Runtime test
Other approach(es) -
# #kazemakase's soln
def zip_split(a, m):
d = np.diff(m)
cuts = np.flatnonzero(d) + 1
asplit = np.split(a, cuts)
msplit = np.split(m, cuts)
L = [aseg for aseg, mseg in zip(asplit, msplit) if np.all(mseg)]
return L
Timings -
In [49]: a = np.random.randint(0,9,(100000))
In [50]: m = np.random.rand(100000)>0.2
# #kazemakase's's solution
In [51]: %timeit zip_split(a,m)
10 loops, best of 3: 114 ms per loop
# #Daniel Forsman's solution
In [52]: %timeit splitByBool(a,m)
10 loops, best of 3: 25.1 ms per loop
# Proposed in this post
In [53]: %timeit separate_regions(a, m)
100 loops, best of 3: 5.01 ms per loop
Increasing the average length of islands -
In [58]: a = np.random.randint(0,9,(100000))
In [59]: m = np.random.rand(100000)>0.1
In [60]: %timeit zip_split(a,m)
10 loops, best of 3: 64.3 ms per loop
In [61]: %timeit splitByBool(a,m)
100 loops, best of 3: 14 ms per loop
In [62]: %timeit separate_regions(a, m)
100 loops, best of 3: 2.85 ms per loop
def splitByBool(a, m):
if m[0]:
return np.split(a, np.nonzero(np.diff(m))[0] + 1)[::2]
else:
return np.split(a, np.nonzero(np.diff(m))[0] + 1)[1::2]
This will return a list of arrays, split into chunks of True in m
Sounds like a natural application for np.split.
You first have to figure out where to cut the array, which is where the mask changes between True and False. Next discard all elements where the mask is False.
a = np.array([10, 5, 3, 2, 1])
m = np.array([True, True, False, True, True])
d = np.diff(m)
cuts = np.flatnonzero(d) + 1
asplit = np.split(a, cuts)
msplit = np.split(m, cuts)
L = [aseg for aseg, mseg in zip(asplit, msplit) if np.all(mseg)]
print(L[0]) # [10 5]
print(L[1]) # [2 1]

Efficient way to create an array that is a sequence of variable length ranges in numpy

Suppose I have an array
import numpy as np
x=np.array([5,7,2])
I want to create an array that contains a sequence of ranges stacked together with the
length of each range given by x:
y=np.hstack([np.arange(1,n+1) for n in x])
Is there some way to do this without the speed penalty of a list comprehension or looping. (x could be a very large array)
The result should be
y == np.array([1,2,3,4,5,1,2,3,4,5,6,7,1,2])
You could use accumulation:
def my_sequences(x):
x = x[x != 0] # you can skip this if you do not have 0s in x.
# Create result array, filled with ones:
y = np.cumsum(x, dtype=np.intp)
a = np.ones(y[-1], dtype=np.intp)
# Set all beginnings to - previous length:
a[y[:-1]] -= x[:-1]
# and just add it all up (btw. np.add.accumulate is equivalent):
return np.cumsum(a, out=a) # here, in-place should be safe.
(One word of caution: If you result array would be larger then the possible size np.iinfo(np.intp).max this might with some bad luck return wrong results instead of erroring out cleanly...)
And because everyone always wants timings (compared to Ophion's) method:
In [11]: x = np.random.randint(0, 20, 1000000)
In [12]: %timeit ua,uind=np.unique(x,return_inverse=True);a=[np.arange(1,k+1) for k in ua];np.concatenate(np.take(a,uind))
1 loops, best of 3: 753 ms per loop
In [13]: %timeit my_sequences(x)
1 loops, best of 3: 191 ms per loop
of course the my_sequences function will not ill-perform when the values of x get large.
First idea; prevent multiple calls to np.arange and concatenate should be much faster then hstack:
import numpy as np
x=np.array([5,7,2])
>>>a=np.arange(1,x.max()+1)
>>> np.hstack([a[:k] for k in x])
array([1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 7, 1, 2])
>>> np.concatenate([a[:k] for k in x])
array([1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 7, 1, 2])
If there are many nonunique values this seems more efficient:
>>>ua,uind=np.unique(x,return_inverse=True)
>>>a=[np.arange(1,k+1) for k in ua]
>>>np.concatenate(np.take(a,uind))
array([1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 7, 1, 2])
Some timings for your case:
x=np.random.randint(0,20,1000000)
Original code
#Using hstack
%timeit np.hstack([np.arange(1,n+1) for n in x])
1 loops, best of 3: 7.46 s per loop
#Using concatenate
%timeit np.concatenate([np.arange(1,n+1) for n in x])
1 loops, best of 3: 5.27 s per loop
First code:
#Using hstack
%timeit a=np.arange(1,x.max()+1);np.hstack([a[:k] for k in x])
1 loops, best of 3: 3.03 s per loop
#Using concatenate
%timeit a=np.arange(1,x.max()+1);np.concatenate([a[:k] for k in x])
10 loops, best of 3: 998 ms per loop
Second code:
%timeit ua,uind=np.unique(x,return_inverse=True);a=[np.arange(1,k+1) for k in ua];np.concatenate(np.take(a,uind))
10 loops, best of 3: 522 ms per loop
Looks like we gain a 14x speedup with the final code.
Small sanity check:
ua,uind=np.unique(x,return_inverse=True)
a=[np.arange(1,k+1) for k in ua]
out=np.concatenate(np.take(a,uind))
>>>out.shape
(9498409,)
>>>np.sum(x)
9498409

Categories

Resources