Numpy: Finding count of distinct values from associations through binning - python

Prerequisite
This is a question is an extension of this post. So, some of the introduction of the problem will be similar to that post.
Problem
Let's say result is a 2D array and values is a 1D array. values holds some values associated with each element in result. The mapping of an element in values to result is stored in x_mapping and y_mapping. A position in result can be associated with different values. (x,y) pair from x_mapping and y_mapping is associated with results[-y,x]. I have to find the unique count of the values grouped by associations.
An example for better clarification.
result array:
[[ 0., 0.],
[ 0., 0.],
[ 0., 0.],
[ 0., 0.]]
values array:
[ 1., 2., 1., 1., 5., 6., 7., 1.]
Note: Here result arrays and values have the same number of elements. But it might not be the case. There is no relation between the sizes at all.
x_mapping and y_mapping have mappings from 1D values to 2D result. The sizes of x_mapping, y_mapping and values will be the same.
x_mapping - [0, 1, 0, 0, 0, 0, 0, 0]
y_mapping - [0, 3, 2, 2, 0, 3, 2, 0]
Here, 1st value(values[0]), 5th value(values[4]) and 8th value(values[7]) have x as 0 and y as 0 (x_mapping[0] and y_mappping[0]) and hence associated with result[0, 0]. If we compute the count of distinct values from this group- (1,5,1), we will have 2 as result.
#WarrenWeckesser
Let's see how [1, 3] (x,y) pair from x_mapping and y_mapping contribute to results. Since there is only one value, ie 2, associated with this particular group, the results[-3,1] will have one as the number of distinct values associated with that cell is one.
Another example. Let's compute the value of results[-1,1]. From mappings, since there is no value associated with the cell, the value of results[-1,1] will be zero.
Similarly, the position [-2, 0] in results will have value 2.
Note that if there is no association at all then the default value for result will be zero.
The result after computation,
[[ 2., 0.],
[ 1., 1.],
[ 2., 0.],
[ 0., 0.]]
Current working solution
Using the answer from #Divakar, I was able to find a working solution.
x_mapping = np.array([0, 1, 0, 0, 0, 0, 0, 0])
y_mapping = np.array([0, 3, 2, 2, 0, 3, 2, 0])
values = np.array([ 1., 2., 1., 1., 5., 6., 7., 1.], dtype=np.float32)
result = np.zeros([4, 2], dtype=np.float32)
m,n = result.shape
out_dtype = result.dtype
lidx = ((-y_mapping)%m)*n + x_mapping
sidx = lidx.argsort()
idx = lidx[sidx]
val = values[sidx]
m_idx = np.flatnonzero(np.r_[True,idx[:-1] != idx[1:]])
unq_ids = idx[m_idx]
r_res = np.zeros(m_idx.size, dtype=np.float32)
for i in range(0, m_idx.shape[0]):
_next = None
arr = None
if i == m_idx.shape[0]-1:
_next = val.shape[0]
else:
_next = m_idx[i+1]
_start = m_idx[i]
if _start >= _next:
arr = val[_start]
else:
arr = val[_start:_next]
r_res[i] = np.unique(arr).size
result.flat[unq_ids] = r_res
Question
Now, the above solution takes 15ms for operating on 19943 values.
I'm looking for a way to compute the result faster. Is there any more performant way to do this?
Side note
I'm using Numpy version 1.14.3 with Python 3.5.2
Edits
Thanks to #WarrenWeckesser, pointing out that I haven't explained how an element in results is associated with (x,y) from mappings. I have updated the post and added examples for clarity.

Here is one solution
import numpy as np
x_mapping = np.array([0, 1, 0, 0, 0, 0, 0, 0])
y_mapping = np.array([0, 3, 2, 2, 0, 3, 2, 0])
values = np.array([ 1., 2., 1., 1., 5., 6., 7., 1.], dtype=np.float32)
result = np.zeros([4, 2], dtype=np.float32)
# Get flat indices
idx_mapping = np.ravel_multi_index((-y_mapping, x_mapping), result.shape, mode='wrap')
# Sort flat indices and reorders values accordingly
reorder = np.argsort(idx_mapping)
idx_mapping = idx_mapping[reorder]
values = values[reorder]
# Get unique values
val_uniq = np.unique(values)
# Find where each unique value appears
val_uniq_hit = values[:, np.newaxis] == val_uniq
# Find reduction indices (slices with the same flat index)
reduce_idx = np.concatenate([[0], np.nonzero(np.diff(idx_mapping))[0] + 1])
# Reduce slices
reduced = np.logical_or.reduceat(val_uniq_hit, reduce_idx)
# Count distinct values on each slice
counts = np.count_nonzero(reduced, axis=1)
# Put counts in result
result.flat[idx_mapping[reduce_idx]] = counts
print(result)
# [[2. 0.]
# [1. 1.]
# [2. 0.]
# [0. 0.]]
This method takes more memory (O(len(values) * len(np.unique(values)))), but a small benchmark comparing with your original solution shows a significant speedup (although that depends on the actual size of the problem):
import numpy as np
np.random.seed(100)
result = np.zeros([400, 200], dtype=np.float32)
values = np.random.randint(100, size=(20000,)).astype(np.float32)
x_mapping = np.random.randint(result.shape[1], size=values.shape)
y_mapping = np.random.randint(result.shape[0], size=values.shape)
res1 = solution_orig(x_mapping, y_mapping, values, result)
res2 = solution(x_mapping, y_mapping, values, result)
print(np.allclose(res1, res2))
# True
# Original solution
%timeit solution_orig(x_mapping, y_mapping, values, result)
# 76.2 ms ± 623 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# This solution
%timeit solution(x_mapping, y_mapping, values, result)
# 13.8 ms ± 51.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Full code of benchmark functions:
import numpy as np
def solution(x_mapping, y_mapping, values, result):
result = np.array(result)
idx_mapping = np.ravel_multi_index((-y_mapping, x_mapping), result.shape, mode='wrap')
reorder = np.argsort(idx_mapping)
idx_mapping = idx_mapping[reorder]
values = values[reorder]
val_uniq = np.unique(values)
val_uniq_hit = values[:, np.newaxis] == val_uniq
reduce_idx = np.concatenate([[0], np.nonzero(np.diff(idx_mapping))[0] + 1])
reduced = np.logical_or.reduceat(val_uniq_hit, reduce_idx)
counts = np.count_nonzero(reduced, axis=1)
result.flat[idx_mapping[reduce_idx]] = counts
return result
def solution_orig(x_mapping, y_mapping, values, result):
result = np.array(result)
m,n = result.shape
out_dtype = result.dtype
lidx = ((-y_mapping)%m)*n + x_mapping
sidx = lidx.argsort()
idx = lidx[sidx]
val = values[sidx]
m_idx = np.flatnonzero(np.r_[True,idx[:-1] != idx[1:]])
unq_ids = idx[m_idx]
r_res = np.zeros(m_idx.size, dtype=np.float32)
for i in range(0, m_idx.shape[0]):
_next = None
arr = None
if i == m_idx.shape[0]-1:
_next = val.shape[0]
else:
_next = m_idx[i+1]
_start = m_idx[i]
if _start >= _next:
arr = val[_start]
else:
arr = val[_start:_next]
r_res[i] = np.unique(arr).size
result.flat[unq_ids] = r_res
return result

Related

Transfom a 2D numpy array into a same-size array containing positions at which an element is found in a 1D array

I have 2 arrays, let's call them Points and Line.
Points is a 2 dimensional array, with Points.shape = (M, 3), such as:
Points = array([p1, q1, r1],
[p2, q2, r2],
[p3, q3, r3],
...,
[pM, qM, rM])
p, q, r are integers, following no specific order, with each representing a particular point. For any specific row, all points p, q, r are different. But a particular integer can appear multiple times in Points (for example, q1 = p7 = p19 = r309 = 52106).
Line, on the other hand, is a 1D array, with Line.shape = (N, ) such that Line = array([l1, l2, l3, ..., lN]). The terms l also represent integers like Points. Usually, M is much bigger than N.
Here is the problem : all integers in Line appear at least once in Points, but most p, q, r don't appear in Line. I want to construct a new 2D array Points_index, with the same shape as Points, such as :
if the element in Points is also present in Line, return the position (index) of the element in Line (starting with 1, not 0).
if not, return 0.
To illustrate, if :
Points = array([ 50, 156, 10],
[ 5, 509, 2225],
[599, 1006, 1],
[ 1, 5, 156],
[ 50, 509, 47])
Line = array([50, 5, 156, 47])
then Points_index is :
array([1, 3, 0],
[2, 0, 0],
[0, 0, 0],
[0, 2, 3],
[1, 0, 4])
I want to this as fast as possible. I tried in1d, but it gives a True/False mask instead of the indexes. I have tried :
Points_index_123 = {}
for i in range(3):
extr_i = np.zeros_like(Points[:, i])
for k, el in enumerate(Points[:, i]):
if el in Line:
extr_i[k] = np.where(el == Line)[0]+1
else:
extr_i[k] = 0
Points_index_123[i] = extr_i
Points_index = np.stack((Points_index_123[0],
Points_index_123[1],
Points_index_123[2]), axis = 1)
But I feel that this is too slow (2 nested loops). Is there any way to do it?
Try this:
import numpy as np
idx = np.zeros(Points.shape)
i, j, k = np.nonzero(Lines[:, None, None] == Points)
idx[j, k] = i + 1
idx
# array([[1., 3., 0.],
# [2., 0., 0.],
# [0., 0., 0.],
# [0., 2., 3.],
# [1., 0., 4.]])
Uses broadcasted comparison of each element in Lines with the Points array, followed by finding out True locations by np.nonzero() and then (fancy) indexing the idx array to get their respective locations.
The +1 is added to i since you need the locations in Lines in 1-indexing.

How to efficiently filter maximum elements of a matrix per row

Given a 2D array, I'm looking for a pythonic way to get an array of same shape, with only the maximum element per each row.
See max_row_filter function below
def max_row_filter(mat2d):
m = np.zeros(mat2d.shape)
for r in range(mat2d.shape[0]):
c = np.argmax(mat2d[r])
m[r,c]=mat2d[r,c]
return m
p = np.array([[1,2,3],[5,4,3,],[9,10,3]])
max_row_filter(p)
Out: array([[ 0., 0., 3.],
[ 5., 0., 0.],
[ 0., 10., 0.]])
I'm looking for an efficient way to do this, suitable to be done on big arrays.
Alternative answer (this will keep duplicates):
p * (p==p.max(axis=1, keepdims=True))
If there are no duplicates, you could use numpy.argmax:
import numpy as np
p = np.array([[1, 2, 3],
[5, 4, 3, ],
[9, 10, 3]])
result = np.zeros_like(p)
rows, cols = zip(*enumerate(np.argmax(p, axis=1)))
result[rows, cols] = p[rows, cols]
print(result)
Output
[[ 0 0 3]
[ 5 0 0]
[ 0 10 0]]
Note that, for multiple occurrences argmax return the first occurence.

Numpy: Finding minimum and maximum values from associations through binning

Prerequisite
This is a question derived from this post. So, some of the introduction of the problem will be similar to that post.
Problem
Let's say result is a 2D array and values is a 1D array. values holds some values associated with each element in result. The mapping of an element in values to result is stored in x_mapping and y_mapping. A position in result can be associated with different values. Now, I have to find the minimum and maximum of the values grouped by associations.
An example for better clarification.
min_result array:
[[0, 0],
[0, 0],
[0, 0],
[0, 0]]
max_result array:
[[0, 0],
[0, 0],
[0, 0],
[0, 0]]
values array:
[ 1., 2., 3., 4., 5., 6., 7., 8.]
Note: Here result arrays and values have the same number of elements. But it might not be the case. There is no relation between the sizes at all.
x_mapping and y_mapping have mappings from 1D values to 2D result(both min and max). The sizes of x_mapping, y_mapping and values will be the same.
x_mapping - [0, 1, 0, 0, 0, 0, 0, 0]
y_mapping - [0, 3, 2, 2, 0, 3, 2, 1]
Here, 1st value(values[0]) and 5th value(values[4]) have x as 0 and y as 0(x_mapping[0] and y_mappping[0]) and hence associated with result[0, 0]. If we compute the minimum and maximum from this group, we will have 1 and 5 as results respectively. So, min_result[0, 0] will have 1 and max_result[0, 0] will have 5.
Note that if there is no association at all then the default value for result will be zero.
Current working solution
x_mapping = np.array([0, 1, 0, 0, 0, 0, 0, 0])
y_mapping = np.array([0, 3, 2, 2, 0, 3, 2, 1])
values = np.array([ 1., 2., 3., 4., 5., 6., 7., 8.], dtype=np.float32)
max_result = np.zeros([4, 2], dtype=np.float32)
min_result = np.zeros([4, 2], dtype=np.float32)
min_result[-y_mapping, x_mapping] = values # randomly initialising from values
for i in range(values.size):
x = x_mapping[i]
y = y_mapping[i]
# maximum
if values[i] > max_result[-y, x]:
max_result[-y, x] = values[i]
# minimum
if values[i] < min_result[-y, x]:
min_result[-y, x] = values[i]
min_result,
[[1., 0.],
[6., 2.],
[3., 0.],
[8., 0.]]
max_result,
[[5., 0.],
[6., 2.],
[7., 0.],
[8., 0.]]
Failed solutions
#1
min_result = np.zeros([4, 2], dtype=np.float32)
np.minimum.reduceat(values, [-y_mapping, x_mapping], out=min_result)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-17-126de899a90e> in <module>()
1 min_result = np.zeros([4, 2], dtype=np.float32)
----> 2 np.minimum.reduceat(values, [-y_mapping, x_mapping], out=min_result)
ValueError: object too deep for desired array
#2
min_result = np.zeros([4, 2], dtype=np.float32)
np.minimum.reduceat(values, lidx, out= min_result)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-24-07e8c75ccaa5> in <module>()
1 min_result = np.zeros([4, 2], dtype=np.float32)
----> 2 np.minimum.reduceat(values, lidx, out= min_result)
ValueError: operands could not be broadcast together with remapped shapes [original->remapped]: (4,2)->(4,) (8,)->() (8,)->(8,)
#3
lidx = ((-y_mapping) % 4) * 2 + x_mapping #from mentioned post
min_result = np.zeros([8], dtype=np.float32)
np.minimum.reduceat(values, lidx, out= min_result).reshape(4,2)
[[1., 4.],
[5., 5.],
[1., 3.],
[5., 7.]]
Question
How to use np.minimum.reduceat and np.maximum.reduceat for solving this problem? I'm looking for a solution that is optimised for runtime.
Side note
I'm using Numpy version 1.14.3 with Python 3.5.2
Approach #1
Again, the most intuitive ones would be with numpy.ufunc.at.
Now, since, these reductions would be performed against the existing values, we need to initialize the output with max values for minimum reductions and min values for maximum ones. Hence, the implementation would be -
min_result[-y_mapping, x_mapping] = values.max()
max_result[-y_mapping, x_mapping] = values.min()
np.minimum.at(min_result, [-y_mapping, x_mapping], values)
np.maximum.at(max_result, [-y_mapping, x_mapping], values)
Approach #2
To leverage np.ufunc.reduceat, we need to sort data -
m,n = max_result.shape
out_dtype = max_result.dtype
lidx = ((-y_mapping)%m)*n + x_mapping
sidx = lidx.argsort()
idx = lidx[sidx]
val = values[sidx]
m_idx = np.flatnonzero(np.r_[True,idx[:-1] != idx[1:]])
unq_ids = idx[m_idx]
max_result_out.flat[unq_ids] = np.maximum.reduceat(val, m_idx)
min_result_out.flat[unq_ids] = np.minimum.reduceat(val, m_idx)

Check common elements of two 2D numpy arrays, either row or column wise

Given two numpy arrays of nx3 and mx3, what is an efficient way to determine the row indices (counter) wherein the rows are common in the two arrays. For instance I have the following solution, which is significantly slow for not even much larger arrays
def arrangment(arr1,arr2):
hits = []
for i in range(arr2.shape[0]):
current_row = np.repeat(arr2[i,:][None,:],arr1.shape[0],axis=0)
x = current_row - arr1
for j in range(arr1.shape[0]):
if np.isclose(x[j,0],0.0) and np.isclose(x[j,1],0.0) and np.isclose(x[j,2],0.0):
hits.append(j)
return hits
It checks if rows of arr2 exist in arr1 and returns the row indices of arr1 where the rows match. I need this arrangement to be always sequentially ascending in terms of rows of arr2. For instance given
arr1 = np.array([[-1., -1., -1.],
[ 1., -1., -1.],
[ 1., 1., -1.],
[-1., 1., -1.],
[-1., -1., 1.],
[ 1., -1., 1.],
[ 1., 1., 1.],
[-1., 1., 1.]])
arr2 = np.array([[-1., 1., -1.],
[ 1., 1., -1.],
[ 1., 1., 1.],
[-1., 1., 1.]])
The function should return:
[3, 2, 6, 7]
quick and dirty answer
(arr1[:, None] == arr2).all(-1).argmax(0)
array([3, 2, 6, 7])
Better answer
Takes care of chance a row in arr2 doesn't match anything in arr1
t = (arr1[:, None] == arr2).all(-1)
np.where(t.any(0), t.argmax(0), np.nan)
array([ 3., 2., 6., 7.])
As pointed out by #Divakar np.isclose accounts for rounding error in comparing floats
t = np.isclose(arr1[:, None], arr2).all(-1)
np.where(t.any(0), t.argmax(0), np.nan)
I had a similar problem in the past and I came up with a fairly optimised solution for it.
First you need a generalisation of numpy.unique for multidimensional arrays, which for the sake of completeness I would copy it here
def unique2d(arr,consider_sort=False,return_index=False,return_inverse=False):
"""Get unique values along an axis for 2D arrays.
input:
arr:
2D array
consider_sort:
Does permutation of the values within the axis matter?
Two rows can contain the same values but with
different arrangements. If consider_sort
is True then those rows would be considered equal
return_index:
Similar to numpy unique
return_inverse:
Similar to numpy unique
returns:
2D array of unique rows
If return_index is True also returns indices
If return_inverse is True also returns the inverse array
"""
if consider_sort is True:
a = np.sort(arr,axis=1)
else:
a = arr
b = np.ascontiguousarray(a).view(np.dtype((np.void,
a.dtype.itemsize * a.shape[1])))
if return_inverse is False:
_, idx = np.unique(b, return_index=True)
else:
_, idx, inv = np.unique(b, return_index=True, return_inverse=True)
if return_index == False and return_inverse == False:
return arr[idx]
elif return_index == True and return_inverse == False:
return arr[idx], idx
elif return_index == False and return_inverse == True:
return arr[idx], inv
else:
return arr[idx], idx, inv
Now all you need is to concatenate (np.vstack) your arrays and find the unique rows. The reverse mapping together with np.searchsorted will give you the indices you need. So lets write another function similar to numpy.in2d but for multidimensional (2D) arrays
def in2d_unsorted(arr1, arr2, axis=1, consider_sort=False):
"""Find the elements in arr1 which are also in
arr2 and sort them as the appear in arr2"""
assert arr1.dtype == arr2.dtype
if axis == 0:
arr1 = np.copy(arr1.T,order='C')
arr2 = np.copy(arr2.T,order='C')
if consider_sort is True:
sorter_arr1 = np.argsort(arr1)
arr1 = arr1[np.arange(arr1.shape[0])[:,None],sorter_arr1]
sorter_arr2 = np.argsort(arr2)
arr2 = arr2[np.arange(arr2.shape[0])[:,None],sorter_arr2]
arr = np.vstack((arr1,arr2))
_, inv = unique2d(arr, return_inverse=True)
size1 = arr1.shape[0]
size2 = arr2.shape[0]
arr3 = inv[:size1]
arr4 = inv[-size2:]
# Sort the indices as they appear in arr2
sorter = np.argsort(arr3)
idx = sorter[arr3.searchsorted(arr4, sorter=sorter)]
return idx
Now all you need to do is call in2d_unsorted with your input parameters
>>> in2d_unsorted(arr1,arr2)
array([ 3, 2, 6, 7])
While may not be fully optimised this approach is much faster. Let's benchmark it against #piRSquareds solutions
def indices_piR(arr1,arr2):
t = np.isclose(arr1[:, None], arr2).all(-1)
return np.where(t.any(0), t.argmax(0), np.nan)
with the following arrays
n=150
arr1 = np.random.permutation(n).reshape(n//3, 3)
idx = np.random.permutation(n//3)
arr2 = arr1[idx]
In [13]: np.allclose(in2d_unsorted(arr1,arr2),indices_piR(arr1,arr2))
True
In [14]: %timeit indices_piR(arr1,arr2)
10000 loops, best of 3: 181 µs per loop
In [15]: %timeit in2d_unsorted(arr1,arr2)
10000 loops, best of 3: 85.7 µs per loop
Now, for n=1500
In [24]: %timeit indices_piR(arr1,arr2)
100 loops, best of 3: 10.3 ms per loop
In [25]: %timeit in2d_unsorted(arr1,arr2)
1000 loops, best of 3: 403 µs per loop
and for n=15000
In [28]: %timeit indices_piR(A,B)
1 loop, best of 3: 1.02 s per loop
In [29]: %timeit in2d_unsorted(arr1,arr2)
100 loops, best of 3: 4.65 ms per loop
So for largeish arrays this is over 200X faster compared to #piRSquared's vectorised solution.

numpy 2D array assignment with 2D value and indices arrays

My goal is to assign the values of an existing 2D array, or create a new array, using two 2D arrays of the same shape, one with values and one with indices to assign the corresponding value to.
X = np.array([range(5),range(5)])
X
array([[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4]])
Y= np.array([range(5), [2,3,4,1,0]])
Y
array([[0, 1, 2, 3, 4],
[2, 3, 4, 1, 0]])
My desired output is an array of the same shape as X and Y, with the values of X given in the index from the corresponding row in Y. This result can be achieved by looping through each row in the following way:
output = np.zeros(X.shape)
for i in range(X.shape[0]):
output[i][Y[i]] = X[i]
output
array([[ 0., 1., 2., 3., 4.],
[ 4., 3., 0., 1., 2.]])
Is there a more efficient way to apply this sort of assignment?
np.take(output, Y)
Will return the items in the output array I would like to assign to the values of X to, but I believe np.take does not produce a reference to the original array, and instead a new array.
for i in range(X.shape[0]):
output[i][Y[i]] = X[i]
is equivalent to
I = np.arange(X.shape[0])[:, np.newaxis]
output[I, Y] = X
For example,
X = np.array([range(5),range(5)])
Y = np.array([range(5), [2,3,4,1,0]])
output = np.zeros(X.shape)
I = np.arange(X.shape[0])[:, np.newaxis]
output[I, Y] = X
yields
>>> output
array([[ 0., 1., 2., 3., 4.],
[ 4., 3., 0., 1., 2.]])
There is not much difference in performance when the loop has few iterations.
But if X.shape[0] is large, then using indexing is much faster:
def using_loop(X, Y):
output = np.zeros(X.shape)
for i in range(X.shape[0]):
output[i][Y[i]] = X[i]
return output
def using_indexing(X, Y):
output = np.zeros(X.shape)
I = np.arange(X.shape[0])[:, np.newaxis]
output[I, Y] = X
return output
X2 = np.tile(X, (100,1))
Y2 = np.tile(Y, (100,1))
In [77]: %timeit using_loop(X2, Y2)
1000 loops, best of 3: 376 µs per loop
In [78]: %timeit using_indexing(X2, Y2)
100000 loops, best of 3: 15.2 µs per loop

Categories

Resources