Most efficient way to implement numpy.in1d for muliple arrays

Most efficient way to implement numpy.in1d for muliple arrays - python

What is the best way to implement a function which takes an arbitrary number of 1d arrays and returns a tuple containing the indices of the matching values (if any).
Here is some pseudo-code of what I want to do:
a = np.array([1, 0, 4, 3, 2])
b = np.array([1, 2, 3, 4, 5])
c = np.array([4, 2])
(ind_a, ind_b, ind_c) = return_equals(a, b, c)
# ind_a = [2, 4]
# ind_b = [1, 3]
# ind_c = [0, 1]
(ind_a, ind_b, ind_c) = return_equals(a, b, c, sorted_by=a)
# ind_a = [2, 4]
# ind_b = [3, 1]
# ind_c = [0, 1]
def return_equals(*args, sorted_by=None):
...

You can use numpy.intersect1d with reduce for this:
def return_equals(*arrays):
matched = reduce(np.intersect1d, arrays)
return np.array([np.where(np.in1d(array, matched))[0] for array in arrays])
reduce may be little slow here because we are creating intermediate NumPy arrays here(for large number of input it may be very slow), we can prevent this if we use Python's set and its .intersection() method:
matched = np.array(list(set(arrays[0]).intersection(*arrays[1:])))
Related GitHub ticket: n-array versions of set operations, especially intersect1d

This solution basically concatenates all input 1D arrays into one big 1D array with the intention of performing the required operations in a vectorized manner. The only place where it uses loop is at the start where it gets the lengths of the input arrays, which must be minimal on runtime costs.
Here's the function implementation -
import numpy as np
def return_equals(*argv):
# Concatenate input arrays into one big array for vectorized processing
A = np.concatenate((argv[:]))
# lengths of input arrays
narr = len(argv)
lens = np.zeros((1,narr),int).ravel()
for i in range(narr):
lens[i] = len(argv[i])
N = A.size
# Start indices of each group of identical elements from different input arrays
# in a sorted version of the huge concatenated input array
start_idx = np.where(np.append([True],np.diff(np.sort(A))!=0))[0]
# Runlengths of islands of identical elements
runlens = np.diff(np.append(start_idx,N))
# Starting and all indices of the positions in concatenate array that has
# islands of identical elements which are present across all input arrays
good_start_idx = start_idx[runlens==narr]
good_all_idx = good_start_idx[:,None] + np.arange(narr)
# Get offsetted indices and sort them to get the desired output
idx = np.argsort(A)[good_all_idx] - np.append([0],lens[:-1].cumsum())
return np.sort(idx.T,1)

In Python:
def return_equal(*args):
rtr=[]
for i, arr in enumerate(args):
rtr.append([j for j, e in enumerate(arr) if
all(e in a for a in args[0:i]) and
all(e in a for a in args[i+1:])])
return rtr
>>> return_equal(a,b,c)
[[2, 4], [1, 3], [0, 1]]

For start, I'd try:
def return_equals(*args):
x=[]
c=args[-1]
for a in args:
x.append(np.nonzero(np.in1d(a,c))[0])
return x
If I add a d=np.array([1,0,4,3,0]) (it has only 1 match; what if there are no matches?)
then
return_equals(a,b,d,c)
produces:
[array([2, 4], dtype=int32),
array([1, 3], dtype=int32),
array([2], dtype=int32),
array([0, 1], dtype=int32)]
Since the length of both input and returned arrays can differ, you really can't vectorize the problem. That is, it takes some special gymnastics to perform the operation across all inputs at once. And if the number of arrays is small compared to their typical length, I wouldn't worry about speed. Iterating a few times is not expensive. It's iterating over a 100 values that's expensive.
You could, of course, pass the keyword arguments on to in1d.
It's not clear what you are trying to do with the sorted_by parameter. Is that something that you could just as easily apply to the arrays before you pass them to this function?
List comprehension version of this iteration:
[np.nonzero(np.in1d(x,c))[0] for x in [a,b,d,c]]
I can imagine concatenating the arrays into one longer one, applying in1d, and then splitting it up into subarrays. There is a np.split, but it requires that you tell it how many elements to put in each sublist. That means, somehow, determining how many matches there are for each argument. Doing that without looping could be tricky.
The pieces for this (that still need to be packed as function) are:
args=[a,b,d,c]
lens=[len(x) for x in args]
abc=np.concatenate(args)
C=np.cumsum(lens)
I=np.nonzero(np.in1d(abc,c))[0]
S=np.split(I,(2,4,5))
[S[0],S[1]-C[0],S[2]-C[1],S[3]-C[2]]
I
# array([ 2, 4, 6, 8, 12, 15, 16], dtype=int32)
C
# array([ 5, 10, 15, 17], dtype=int32)
The (2,4,5) are the number of elements of I between successive values of C, i.e. the number of elements that match for each of a,b,...

Related

Stable conversion of a multi-column (2D) numpy array to an indicator vector

I often need to convert a multi-column (or 2D) numpy array into an indicator vector in a stable (i.e., order preserved) manner.
For example, I have the following numpy array:
import numpy as np
arr = np.array([
[2, 20, 1],
[1, 10, 3],
[2, 20, 2],
[2, 20, 1],
[1, 20, 3],
[2, 20, 2],
])
The output I like to have is:
indicator = [0, 1, 2, 0, 3, 2]
How can I do this (preferably using numpy only)?
Notes:
I am looking for a high performance (vectorized) approach as the arr (see the example above) has millions of rows in a real application.
I am aware of the following auxiliary solutions, but none is ideal. It would be nice to hear expert's opinion.
My thoughts so far:
1. Numpy's unique: This would not work, as it is not stable:
arr_unq, indicator = np.unique(arr, axis=0, return_inverse=True)
print(arr_unq)
# output 1:
# [[ 1 10 3]
# [ 1 20 3]
# [ 2 20 1]
# [ 2 20 2]]
print(indicator)
# output 2:
# [2 0 3 2 1 3]
Notice how the indicator starts from 2. This is because unique function returns a "sorted" array (see output 1). However, I would like it to start from 0.
Of course I can use LabelEncoder from sklearn to convert the items in a manner that they start from 0 but I feel that there is a simple numpy trick that I can use and therefore avoid adding sklearn dependency to my program.
Or I can resolve this by a dictionary mapping like below, but I can imagine that there is a better or more elegant solution:
dct = {}
for idx, item in enumerate(indicator):
if item not in dct:
dct[item] = len(dct)
indicator[idx] = dct[item]
print(indicator)
# outputs:
# [0 1 2 0 3 2]
2. Stabilizing numpy's unique output: This solution is already posted in stackoverflow and correctly returns an stable unique array. But I do not know how to convert the returned indicator vector (returned when return_inverse=True) to represent the values in an stable order starting from 0.
3. Pandas's get_dummies: function. But it returns a "hot encoding" (matrix of indicator values). In contrast, I would like to have an indicator vector. It is indeed possible to convert the "hot encoding" to the indicator vector by few lines of code and data manipulation. But again that approach is not going to be highly efficient.

In addition to return_inverse, you can add the return_index option. This will tell you the first occurrence of each sorted item:
unq, idx, inv = np.unique(arr, axis=0, return_index=True, return_inverse=True)
Now you can use the fact that np.argsort is its own inverse to fix the order. Note that idx.argsort() places unq into sorted order. The corrected result is therefore
indicator = idx.argsort().argsort()[inv]
And of course the byproduct
unq = unq[idx.argsort()]
Of course there's nothing special about these operations to 2D.
A Note on the Intuition
Let's say you have an array x:
x = np.array([7, 3, 0, 1, 4])
x.argsort() is the index that tells you what elements of x are placed at each of the locations in the sorted array. So
i = x.argsort() # 2, 3, 1, 4, 0
But how would you get from np.sort(x) back to x (which is the problem you express in #2)?
Well, it happens that i tells you the original position of each element in the sorted array: the first (smallest) element was originally at index 2, the second at 3, ..., the last (largest) element was at index 0. This means that to place np.sort(x) back into its original order, you need the index that puts i into sorted order. That means that you can write x as
np.sort(x)[i.argsort()]
Which is equivalent to
x[i][i.argsort()]
OR
x[x.argsort()][x.argsort().argsort()]
So, as you can see, np.argsort is effectively its own inverse: argsorting something twice gives you the index to put it back in the original order.

Divide numpy array into multiple arrays using indices array (Python)

I have an array:
a = [1, 3, 5, 7, 29 ... 5030, 6000]
This array gets created from a previous process, and the length of the array could be different (it is depending on user input).
I also have an array:
b = [3, 15, 67, 78, 138]
(Which could also be completely different)
I want to use the array b to slice the array a into multiple arrays.
More specifically, I want the result arrays to be:
array1 = a[:3]
array2 = a[3:15]
...
arrayn = a[138:]
Where n = len(b).
My first thought was to create a 2D array slices with dimension (len(b), something). However we don't know this something beforehand so I assigned it the value len(a) as that is the maximum amount of numbers that it could contain.
I have this code:
slices = np.zeros((len(b), len(a)))
for i in range(1, len(b)):
slices[i] = a[b[i-1]:b[i]]
But I get this error:
ValueError: could not broadcast input array from shape (518) into shape (2253412)

You can use numpy.split:
np.split(a, b)
Example:
np.split(np.arange(10), [3,5])
# [array([0, 1, 2]), array([3, 4]), array([5, 6, 7, 8, 9])]

b.insert(0,0)
result = []
for i in range(1,len(b)):
sub_list = a[b[i-1]:b[i]]
result.append(sub_list)
result.append(a[b[-1]:])

You are getting the error because you are attempting to create a ragged array. This is not allowed in numpy.
An improvement on #Bohdan's answer:
from itertools import zip_longest
result = [a[start:end] for start, end in zip_longest(np.r_[0, b], b)]
The trick here is that zip_longest makes the final slice go from b[-1] to None, which is equivalent to a[b[-1]:], removing the need for special processing of the last element.
Please do not select this. This is just a thing I added for fun. The "correct" answer is #Psidom's answer.

Selecting unique random values from the third column of a an array in python

I have a 41000x3 numpy array that I call "sortedlist" in the function below. The third column has a bunch of values, some of which are duplicates, others which are not. I'd like to take a sample of unique values (no duplicates) from the third column, which is sortedlist[:,2]. I think I can do this easily with numpy.random.sample(sortedlist[:,2], sample_size). The problem is I'd like to return, not only those values, but all three columns where, in the last column, there are the randomly chosen values that I get from numpy.random.sample.
EDIT: By unique values I mean I want to choose random values which appear only once. So If I had an array:
array = [[0, 6, 2]
[5, 3, 9]
[3, 7, 1]
[5, 3, 2]
[3, 1, 1]
[5, 2, 8]]
And I wanted to choose 4 values of the third column, I want to get something like new_array_1 out:
new_array_1 = [[5, 3, 9]
[3, 7, 1]
[5, 3, 2]
[5, 2, 8]]
But I don't want something like new_array_2, where two values in the 3rd column are the same:
new_array_2 = [[5, 3, 9]
[3, 7, 1]
[5, 3, 2]
[3, 1, 1]]
I have the code to choose random values but without the criterion that they shouldn't be duplicates in the third column.
samplesize = 100
rand_sortedlist = sortedlist[np.random.randint(len(sortedlist), size = sample_size),:]]
I'm trying to enforce this criterion by doing something like this
array_index = where( array[:,2] == sample(SelectionWeight, sample_size) )
But I'm not sure if I'm on the right track. Any help would be greatly appreciated!

I can't think of a clever numpythonic way to do this that doesn't involve multiple passes over the data. (Sometimes numpy is so much faster than pure Python that's still the fastest way to go, but it never feels right.)
In pure Python, I'd do something like
def draw_unique(vec, n):
# group indices by value
d = {}
for i, x in enumerate(vec):
d.setdefault(x, []).append(i)
drawn = [random.choice(d[k]) for k in random.sample(d, n)]
return drawn
which would give
>>> a = np.random.randint(0, 10, (41000, 3))
>>> drawn = draw_unique(a[:,2], 3)
>>> drawn
[4219, 6745, 25670]
>>> a[drawn]
array([[5, 6, 0],
[8, 8, 1],
[5, 8, 3]])
I can think of some tricks with np.bincount and scipy.stats.rankdata but they hurt my head, and there always winds up being one step at the end I can't see how to vectorize.. and if I'm not vectorizing the whole thing I might as well use the above which at least is simple.

I believe this will do what you want. Note that the running time will almost certainly be dominated by whatever method you use to generate your random numbers. (An exception is if the dataset is gigantic but you only need a small number of rows, in which case very few random numbers need to be drawn.) So I'm not sure this will run much faster than a pure python method would.
# arrayify your list of lists
# please don't use `array` as a variable name!
a = np.asarray(arry)
# sort the list ... always the first step for efficiency
a2 = a[np.argsort(a[:, 2])]
# identify rows that are duplicates (3rd column is non-increasing)
# Note this has length one less than a2
duplicate_rows = np.diff(a2[:, 2]) == 0)
# if duplicate_rows[N], then we want to remove row N and N+1
keep_mask = np.ones(length(a2), dtype=np.bool) # all True
keep_mask[duplicate_rows] = 0 # remove row N
keep_mask[1:][duplicate_rows] = 0 # remove row N + 1
# now actually slice the array
a3 = a2[keep_mask]
# select rows from a3 using your preferred random number generator
# I actually prefer `random` over numpy.random for sampling w/o replacement
import random
result = a3[random.sample(xrange(len(a3)), DESIRED_NUMBER_OF_ROWS)]

Indexing NumPy 2D array with another 2D array

I have something like
m = array([[1, 2],
[4, 5],
[7, 8],
[6, 2]])
and
select = array([0,1,0,0])
My target is
result = array([1, 5, 7, 6])
I tried _ix as I read at Simplfy row AND column extraction, numpy, but this did not result in what I wanted.
p.s. Please change the title of this question if you can think of a more precise one.

The numpy way to do this is by using np.choose or fancy indexing/take (see below):
m = array([[1, 2],
[4, 5],
[7, 8],
[6, 2]])
select = array([0,1,0,0])
result = np.choose(select, m.T)
So there is no need for python loops, or anything, with all the speed advantages numpy gives you. m.T is just needed because choose is really more a choise between the two arrays np.choose(select, (m[:,0], m[:1])), but its straight forward to use it like this.
Using fancy indexing:
result = m[np.arange(len(select)), select]
And if speed is very important np.take, which works on a 1D view (its quite a bit faster for some reason, but maybe not for these tiny arrays):
result = m.take(select+np.arange(0, len(select) * m.shape[1], m.shape[1]))

I prefer to use NP.where for indexing tasks of this sort (rather than NP.ix_)
What is not mentioned in the OP is whether the result is selected by location (row/col in the source array) or by some condition (e.g., m >= 5). In any event, the code snippet below covers both scenarios.
Three steps:
create the condition array;
generate an index array by calling NP.where, passing in this
condition array; and
apply this index array against the source array
>>> import numpy as NP
>>> cnd = (m==1) | (m==5) | (m==7) | (m==6)
>>> cnd
matrix([[ True, False],
[False, True],
[ True, False],
[ True, False]], dtype=bool)
>>> # generate the index array/matrix
>>> # by calling NP.where, passing in the condition (cnd)
>>> ndx = NP.where(cnd)
>>> ndx
(matrix([[0, 1, 2, 3]]), matrix([[0, 1, 0, 0]]))
>>> # now apply it against the source array
>>> m[ndx]
matrix([[1, 5, 7, 6]])
The argument passed to NP.where, cnd, is a boolean array, which in this case, is the result from a single expression comprised of compound conditional expressions (first line above)
If constructing such a value filter doesn't apply to your particular use case, that's fine, you just need to generate the actual boolean matrix (the value of cnd) some other way (or create it directly).

What about using python?
result = array([subarray[index] for subarray, index in zip(m, select)])

IMHO, this is simplest variant:
m[np.arange(4), select]

Since the title is referring to indexing a 2D array with another 2D array, the actual general numpy solution can be found here.
In short:
A 2D array of indices of shape (n,m) with arbitrary large dimension m, named inds, is used to access elements of another 2D array of shape (n,k), named B:
# array of index offsets to be added to each row of inds
offset = np.arange(0, inds.size, inds.shape[1])
# numpy.take(B, C) "flattens" arrays B and C and selects elements from B based on indices in C
Result = np.take(B, offset[:,np.newaxis]+inds)
Another solution, which doesn't use np.take and I find more intuitive, is the following:
B[np.expand_dims(np.arange(B.shape[0]), -1), inds]
The advantage of this syntax is that it can be used both for reading elements from B based on inds (like np.take), as well as for assignment.

result = array([m[j][0] if i==0 else m[j][1] for i,j in zip(select, range(0, len(m)))])

finding indexes of common subarrays

I have 2 large, unsorted arrays (structured set of xyz coordinates) and I'm trying to find the positions of all identical subarrays (common points consisting of 3 coordinates). Example:
a = array([[0, 1, 2], [3, 4, 5]])
b = array([[3, 4, 5], [6, 7, 8]])
Here the correct subarray would be [3, 4, 5], but more than one identical subarrays are possible. The correct indexes would be [0,1] for a and [1,0] for b.
I already implemented a pure python method by iterating over all points of one array and comparing them to every point of the other array, but this is extremely slow.
My question is, is there an efficient way to find the indexes for both arrays (preferably in numpy, because I need the arrays for further calculations)? Perhaps a rolling_window approach?

A general solution for Python iterables (not specific to numpy or arrays) that works in linear average time (O(n+m), n is the number of subarrays and m is the number of unique subarrays):
a = [[0, 1, 2], [3, 4, 5]]
b = [[3, 4, 5], [6, 7, 8]]
from collections import defaultdict
indexmap = defaultdict(list)
for row, sublist in enumerate((a, b)):
for column, item in enumerate(sublist):
indexmap[tuple(item)].append((row, column))
repeats = dict((key, value) for key, value in indexmap.iteritems() if len(value) > 1)
Gives
{(3, 4, 5): [(0, 1), (1, 0)]}
If you don't need the double-row-indexes (index in the list and in the stored index) you can simplify the loop to
for row in (a, b):
for column, item in enumerate(sublist):
indexmap[tuple(item)].append(column)
as a will be processed before b, any duplicates will get numbered by row automatically:
{(3, 4, 5): [1, 0]}
With repeats[key][rownum] returning the column index for that row.

I did a little further experimenting and found a numpy specific way to solve this:
import numpy as np
a = np.arange(24).reshape(2,4,3)
b = np.arange(24, 36).reshape(2,2,3)
Array b receives 2 entries from a:
b[1,0] = a[0,1]
b[0,1] = a[1,1]
Finding common entries:
c = np.in1d(a, b).reshape(a.shape)
d = np.in1d(b, a).reshape(b.shape)
Checking where common entries exist in all 3 coordinates:
indexesC = np.where(c[:,:,0] & c[:,:,1] & c[:,:,2])
indexesD = np.where(d[:,:,0] & d[:,:,1] & d[:,:,2])

Can you map each sub-array to its position index in a hash table? So basically, you change your data structure. After that in linear time O(n), where n is the size of the biggest hash hash table, in one loop you can O(1) query each hash table and find out if you have same sub-array present in two or more hash tables.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Most efficient way to implement numpy.in1d for muliple arrays - python

In Python: def return_equal(*args): rtr=[] for i, arr in enumerate(args): rtr.append([j for j, e in enumerate(arr) if all(e in a for a in args[0:i]) and all(e in a for a in args[i+1:])]) return rtr >>> return_equal(a,b,c) [[2, 4], [1, 3], [0, 1]]

Related

Stable conversion of a multi-column (2D) numpy array to an indicator vector

Divide numpy array into multiple arrays using indices array (Python)

Selecting unique random values from the third column of a an array in python

Indexing NumPy 2D array with another 2D array

finding indexes of common subarrays

Categories

Resources