numpy: How to vectorize aggregation loop [duplicate]

numpy: How to vectorize aggregation loop [duplicate] - python

I have a Numpy array and a list of indices whose values I would like to increment by one. This list may contain repeated indices, and I would like the increment to scale with the number of repeats of each index. Without repeats, the command is simple:
a=np.zeros(6).astype('int')
b=[3,2,5]
a[b]+=1
With repeats, I've come up with the following method.
b=[3,2,5,2] # indices to increment by one each replicate
bbins=np.bincount(b)
b.sort() # sort b because bincount is sorted
incr=bbins[np.nonzero(bbins)] # create increment array
bu=np.unique(b) # sorted, unique indices (len(bu)=len(incr))
a[bu]+=incr
Is this the best way? Is there are risk involved with assuming that the np.bincount and np.unique operations would result in the same sorted order? Am I missing some simple Numpy operation to solve this?

In numpy >= 1.8, you can also use the at method of the addition 'universal function' ('ufunc'). As the docs note:
For addition ufunc, this method is equivalent to a[indices] += b, except that results are accumulated for elements that are indexed more than once.
So taking your example:
a = np.zeros(6).astype('int')
b = [3, 2, 5, 2]
…to then…
np.add.at(a, b, 1)
…will leave a as…
array([0, 0, 2, 1, 0, 1])

After you do
bbins=np.bincount(b)
why not do:
a[:len(bbins)] += bbins
(Edited for further simplification.)

If b is a small subrange of a, one can refine Alok's answer like this:
import numpy as np
a = np.zeros( 100000, int )
b = np.array( [99999, 99997, 99999] )
blo, bhi = b.min(), b.max()
bbins = np.bincount( b - blo )
a[blo:bhi+1] += bbins
print a[blo:bhi+1] # 1 0 2

Related

Loop over clump_masked indices

I have an array y_filtered that contains some masked values. I want to replace these values by some value I calculate based on their neighbouring values. I can get the indices of the masked values by using masked_slices = ma.clump_masked(y_filtered). This returns a list of slices, e.g. [slice(194, 196, None)].
I can easily get the values from my masked array, by using y_filtered[masked_slices], and even loop over them. However, I need to access the index of the values as well, so i can calculate its new value based on its neighbours. Enumerate (logically) returns 0, 1, etc. instead of the indices I need.
Here's the solution I came up with.
# get indices of masked data
masked_slices = ma.clump_masked(y_filtered)
y_enum = [(i, y_i) for i, y_i in zip(range(len(y_filtered)), y_filtered)]
for sl in masked_slices:
for i, y_i in y_enum[sl]:
# simplified example calculation
y_filtered[i] = np.average(y_filtered[i-2:i+2])
It is very ugly method i.m.o. and I think there has to be a better way to do this. Any suggestions?
Thanks!

EDIT:
I figured out a better way to achieve what I think you want to do. This code picks every window of 5 elements and compute its (masked) average, then uses those values to fill the gaps in the original array. If some index does not have any unmasked value close enough it will just leave it as masked:
import numpy as np
from numpy.lib.stride_tricks import as_strided
SMOOTH_MARGIN = 2
x = np.ma.array(data=[1, 2, 3, 4, 5, 6, 8, 9, 10],
mask=[0, 1, 0, 0, 1, 1, 1, 1, 0])
print(x)
# [1 -- 3 4 -- -- -- -- 10]
pad_data = np.pad(x.data, (SMOOTH_MARGIN, SMOOTH_MARGIN), mode='constant')
pad_mask = np.pad(x.mask, (SMOOTH_MARGIN, SMOOTH_MARGIN), mode='constant',
constant_values=True)
k = 2 * SMOOTH_MARGIN + 1
isize = x.dtype.itemsize
msize = x.mask.dtype.itemsize
x_pad = np.ma.array(
data=as_strided(pad_data, (len(x), k), (isize, isize), writeable=False),
mask=as_strided(pad_mask, (len(x), k), (msize, msize), writeable=False))
x_avg = np.ma.average(x_pad, axis=1).astype(x_pad.dtype)
fill_mask = ~x_avg.mask & x.mask
result = x.copy()
result[fill_mask] = x_avg[fill_mask]
print(result)
# [1 2 3 4 3 4 10 10 10]
(note all the values are integers here because x was originally of integer type)
The original posted code has a few errors, firstly it both reads and writes values from y_filtered in the loop, so the results of later indices are affected by the previous iterations, this could be fixed with a copy of the original y_filtered. Second, [i-2:i+2] should probably be [max(i-2, 0):i+3], in order to have a symmetric window starting at zero or later always.
You could do this:
from itertools import chain
# get indices of masked data
masked_slices = ma.clump_masked(y_filtered)
for idx in chain.from_iterable(range(s.start, s.stop) for s in masked_slices):
y_filtered[idx] = np.average(y_filtered[max(idx - 2, 0):idx + 3])

Check how many numpy array within a numpy array are equal to other numpy arrays within another numpy array of different size

My problem
Suppose I have
a = np.array([ np.array([1,2]), np.array([3,4]), np.array([5,6]), np.array([7,8]), np.array([9,10])])
b = np.array([ np.array([5,6]), np.array([1,2]), np.array([3,192])])
They are two arrays, of different sizes, containing other arrays (the inner arrays have same sizes!)
I want to count how many items of b (i.e. inner arrays) are also in a. Notice that I am not considering their position!
How can I do that?
My Try
count = 0
for bitem in b:
for aitem in a:
if aitem==bitem:
count+=1
Is there a better way? Especially in one line, maybe with some comprehension..

The numpy_indexed package contains efficient (nlogn, generally) and vectorized solutions to these types of problems:
import numpy_indexed as npi
count = len(npi.intersection(a, b))
Note that this is subtly different than your double loop, discarding duplicate entries in a and b for instance. If you want to retain duplicates in b, this would work:
count = npi.in_(b, a).sum()
Duplicate entries in a could also be handled by doing npi.count(a) and factoring in the result of that; but anyway, im just rambling on for illustration purposes since I imagine the distinction probably does not matter to you.

Here is a simple way to do it:
a = np.array([ np.array([1,2]), np.array([3,4]), np.array([5,6]), np.array([7,8]), np.array([9,10])])
b = np.array([ np.array([5,6]), np.array([1,2]), np.array([3,192])])
count = np.count_nonzero(
np.any(np.all(a[:, np.newaxis, :] == b[np.newaxis, :, :], axis=-1), axis=0))
print(count)
>>> 2

You can do what you want in one liner as follows:
count = sum([np.array_equal(x,y) for x,y in product(a,b)])
Explanation
Here's an explanation of what's happening:
Iterate through the two arrays using itertools.product which will create an iterator over the cartesian product of the two arrays.
Compare each two arrays in a tuple (x,y) coming from step 1. using np.array_equal
True is equal to 1 when using sum on a list
Full example:
The final code looks like this:
import numpy as np
from itertools import product
a = np.array([ np.array([1,2]), np.array([3,4]), np.array([5,6]), np.array([7,8]), np.array([9,10])])
b = np.array([ np.array([5,6]), np.array([1,2]), np.array([3,192])])
count = sum([np.array_equal(x,y) for x,y in product(a,b)])
# output: 2

You can convert the rows to dtype = np.void and then use np.in1d as on the resulting 1d arrays
def void_arr(a):
return np.ascontiguousarray(a).view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))
b[np.in1d(void_arr(b), void_arr(a))]
array([[5, 6],
[1, 2]])
If you just want the number of intersections, it's
np.in1d(void_arr(b), void_arr(a)).sum()
2
Note: if there are repeat items in b or a, then np.in1d(void_arr(b), void_arr(a)).sum() likely won't be equal to np.in1d(void_arr(a), void_arr(b)).sum(). I've reversed the order from my original answer to match your question (i.e. how many elements of b are in a?)
For more information, see the third answer here

NumPy: Sort and sum data into target indices [duplicate]

I have a Numpy array and a list of indices whose values I would like to increment by one. This list may contain repeated indices, and I would like the increment to scale with the number of repeats of each index. Without repeats, the command is simple:
a=np.zeros(6).astype('int')
b=[3,2,5]
a[b]+=1
With repeats, I've come up with the following method.
b=[3,2,5,2] # indices to increment by one each replicate
bbins=np.bincount(b)
b.sort() # sort b because bincount is sorted
incr=bbins[np.nonzero(bbins)] # create increment array
bu=np.unique(b) # sorted, unique indices (len(bu)=len(incr))
a[bu]+=incr
Is this the best way? Is there are risk involved with assuming that the np.bincount and np.unique operations would result in the same sorted order? Am I missing some simple Numpy operation to solve this?

In numpy >= 1.8, you can also use the at method of the addition 'universal function' ('ufunc'). As the docs note:
For addition ufunc, this method is equivalent to a[indices] += b, except that results are accumulated for elements that are indexed more than once.
So taking your example:
a = np.zeros(6).astype('int')
b = [3, 2, 5, 2]
…to then…
np.add.at(a, b, 1)
…will leave a as…
array([0, 0, 2, 1, 0, 1])

After you do
bbins=np.bincount(b)
why not do:
a[:len(bbins)] += bbins
(Edited for further simplification.)

If b is a small subrange of a, one can refine Alok's answer like this:
import numpy as np
a = np.zeros( 100000, int )
b = np.array( [99999, 99997, 99999] )
blo, bhi = b.min(), b.max()
bbins = np.bincount( b - blo )
a[blo:bhi+1] += bbins
print a[blo:bhi+1] # 1 0 2

Find whether a numpy array is a subset of a larger array in Python

I have 2 arrays, for the sake of simplicity let's say the original one is a random set of numbers:
import numpy as np
a=np.random.rand(N)
Then I sample and shuffle a subset from this array:
b=np.array() <------size<N
The shuffling I do do not store the index values, so b is an unordered subset of a
Is there an easy way to get the original indexes of b, so they are in the same order as a, say, if element 2 of b has the index 4 in a, create an array of its assignation.
I could use a for cycle checking element by element, but perhaps there is a more pythonic way
Thanks

I think the most computationally efficient thing to do is to keep track of the indices that associate b with a as b is created.
For example, instead of sampling a, sample the indices of a:
indices = random.sample(range(len(a)), k) # k < N
b = a[indices]

On the off chance a happens to be sorted you could do:
>>> from numpy import array
>>> a = array([1, 3, 4, 10, 11])
>>> b = array([11, 1, 4])
>>> a.searchsorted(b)
array([4, 0, 2])
If a is not sorted you're probably best off going with something like #unutbu's answer.

Increment Numpy array with repeated indices

I have a Numpy array and a list of indices whose values I would like to increment by one. This list may contain repeated indices, and I would like the increment to scale with the number of repeats of each index. Without repeats, the command is simple:
a=np.zeros(6).astype('int')
b=[3,2,5]
a[b]+=1
With repeats, I've come up with the following method.
b=[3,2,5,2] # indices to increment by one each replicate
bbins=np.bincount(b)
b.sort() # sort b because bincount is sorted
incr=bbins[np.nonzero(bbins)] # create increment array
bu=np.unique(b) # sorted, unique indices (len(bu)=len(incr))
a[bu]+=incr
Is this the best way? Is there are risk involved with assuming that the np.bincount and np.unique operations would result in the same sorted order? Am I missing some simple Numpy operation to solve this?

In numpy >= 1.8, you can also use the at method of the addition 'universal function' ('ufunc'). As the docs note:
For addition ufunc, this method is equivalent to a[indices] += b, except that results are accumulated for elements that are indexed more than once.
So taking your example:
a = np.zeros(6).astype('int')
b = [3, 2, 5, 2]
…to then…
np.add.at(a, b, 1)
…will leave a as…
array([0, 0, 2, 1, 0, 1])

After you do
bbins=np.bincount(b)
why not do:
a[:len(bbins)] += bbins
(Edited for further simplification.)

If b is a small subrange of a, one can refine Alok's answer like this:
import numpy as np
a = np.zeros( 100000, int )
b = np.array( [99999, 99997, 99999] )
blo, bhi = b.min(), b.max()
bbins = np.bincount( b - blo )
a[blo:bhi+1] += bbins
print a[blo:bhi+1] # 1 0 2

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

numpy: How to vectorize aggregation loop [duplicate] - python

After you do bbins=np.bincount(b) why not do: a[:len(bbins)] += bbins (Edited for further simplification.)

If b is a small subrange of a, one can refine Alok's answer like this: import numpy as np a = np.zeros( 100000, int ) b = np.array( [99999, 99997, 99999] ) blo, bhi = b.min(), b.max() bbins = np.bincount( b - blo ) a[blo:bhi+1] += bbins print a[blo:bhi+1] # 1 0 2

Related

Loop over clump_masked indices

Check how many numpy array within a numpy array are equal to other numpy arrays within another numpy array of different size

NumPy: Sort and sum data into target indices [duplicate]

Find whether a numpy array is a subset of a larger array in Python

Increment Numpy array with repeated indices

Categories

Resources