Summation by class label without loop in numpy - python

I have a matrix which represents a distances to the k-nearest neighbour of a set of points,
and there is a matrix of class labels of the nearest neighbours. (both N-by-k matrix)
What is the best way WITHOUT explicit python loop (actually, I want to implement this in theano where those loops are not going to work) to build a (N-by-#classes) matrix whose (i,j) element will be the sum of distances from i-th point to its k-NN points with the class label 'j'?
Example:
# N = 2
# k = 5
# number of classes = 3
K_val = np.array([[1,2,3,4,6],
[2,4,5,5,7]])
l_val = np.array([[0,1,2,0,1],
[2,0,1,2,0]])
"""
result -> [[5,8,3],
[11,5,7]]
"""

You can compute this with
numpy.bincount. It
has a weights parameter which allows you to count the items in l_val
but weight the items according to K_val.
The only little snag is that each row of K_val and l_val seems to be treated independently. So add a shift to l_val so each row has values which are distinct from every other row.
import numpy as np
num_classes = 3
K_val = np.array([[1,2,3,4,6],
[2,4,5,5,7]])
l_val = np.array([[0,1,2,0,1],
[2,0,1,2,0]])
def label_distance(l_val, K_val):
nrows, ncols = l_val.shape
shift = (np.arange(nrows)*num_classes)[:, np.newaxis]
result = (np.bincount((l_val+shift).ravel(), weights=K_val.ravel(),
minlength=num_classes*nrows)
.reshape(nrows, num_classes))
return result
print(label_distance(l_val, K_val))
yields
[[ 5. 8. 3.]
[ 11. 5. 7.]]
Although senderle's method is really elegant, using bincount is faster:
def using_extradim(l_val, K_val):
return (K_val[:,:,None] * (l_val[:,:,None] == numpy.arange(3)[None,None,:])).sum(axis=1)
In [34]: K2 = np.tile(K_val, (1000,1))
In [35]: L2 = np.tile(l_val, (1000,1))
In [36]: %timeit using_extradim(L2, K2)
1000 loops, best of 3: 584 µs per loop
In [40]: %timeit label_distance(L2, K2)
10000 loops, best of 3: 67.7 µs per loop

Here's a way to calculate the values directly. As unutbu's tests show, using bincount is much faster for large datasets, but I think it's worth knowing how to do this using vanilla broadcasting as well:
>>> (K_val[:,:,None] * (l_val[:,:,None] == numpy.arange(3)[None,None,:])).sum(axis=1)
array([[ 5, 8, 3],
[11, 5, 7]])
That's a bit hairy, so I'll step through it slowly. It's probably best to do it this way in code you want to be able to read later! There are four steps:
labels = numpy.arange(3)
l_selector = l_val[:,:,None] == labels[None,None,:]
distances = (K_val[:,:,None] * l_selector)
result = distances.sum(axis=1)
First we create a list of labels (labels above). Then we create a boolean index array:
>>> l_selector = l_val[:,:,None] == labels[None,None,:]
This expands l_val and labels into arrays that can be broadcast together. The None values (equivalent to np.newaxis) add new empty dimensions:
>>> l_val[:,:,None].shape
(2, 5, 1)
>>> labels[None,None,:].shape
(1, 1, 3)
The dimensions are aligned, so both arrays can be expanded (by repeating the values) along their empty dimensions:
>>> l_selector.shape
(2, 5, 3)
Now we have a (n_points, n_neighbors, n_labels) array, where each column corresponds to a label. (See how each row has only one Truevalue?)
>>> l_selector
array([[[ True, False, False],
[False, True, False],
[False, False, True],
[ True, False, False],
[False, True, False]],
[[False, False, True],
[ True, False, False],
[False, True, False],
[False, False, True],
[ True, False, False]]], dtype=bool)
So now we can use this to separate out the distances for each of the three labels. But again, we have to make sure that our arrays are broadcastable, hence the K_val[:,:,None] here:
>>> distances = (K_val[:,:,None] * l_selector)
>>> distances
array([[[1, 0, 0],
[0, 2, 0],
[0, 0, 3],
[4, 0, 0],
[0, 6, 0]],
[[0, 0, 2],
[4, 0, 0],
[0, 5, 0],
[0, 0, 5],
[7, 0, 0]]])
Now all we have to do is sum over the columns.
>>> result = distances.sum(axis=1)
>>> result
array([[ 5, 8, 3],
[11, 5, 7]])
You might also consider the transposed approach, which requires a little bit less reshaping:
>>> labels = numpy.arange(3)
>>> l_selector = l_val[None,:,:] == labels[:,None,None]
>>> distances = K_val * l_selector
>>> distances.sum(axis=-1)
array([[ 5, 11],
[ 8, 5],
[ 3, 7]])
>>> distances.sum(axis=-1).T
array([[ 5, 8, 3],
[11, 5, 7]])

Related

Get matrix entries based on upper and lower bound vectors?

so let`s say I have a matrix mat= [[1,2,3,4,5,6],[1,2,3,4,5,6],[1,2,3,4,5,6]]
and a lower bound vector vector_low = [2.1,1.9,1.7] and upper bound vector vector_up = [3.1,3.5,4.1].
How do I get the values in the matrix in between the upper and lower bounds for every row?
Expected Output:
[[3],[2,3],[2,3,4]] (it`s a list #mozway)
alternatively a vector with all of them would also do...
(Extra question: get the values of the matrix that are between the upper and lower bound, but rounded down/up to the next value in the matrix..
Expected Output:
[[2,3,4],[1,2,3,4],[1,2,3,4,5]])
There should be a fast solution without loop, hope someone can help, thanks!
PS: In the end I just want to sum over the list entries, so the output format is not important...
I probably shouldn't indulge you since you haven't provided the code I asked for, but to satisfy my own curiosity, here my solution(s)
Your lists:
In [72]: alist = [[1, 2, 3, 4, 5, 6], [1, 2, 3, 4, 5, 6], [1, 2, 3, 4, 5, 6]]
In [73]: low = [2.1,1.9,1.7]; up = [3.1,3.5,4.1]
A utility function:
In [74]: def between(row, l, u):
...: return [i for i in row if l <= i <= u]
and the straightforward list comprehension solution - VERY PYTHONIC:
In [75]: [between(row, l, u) for row, l, u in zip(alist, low, up)]
Out[75]: [[3], [2, 3], [2, 3, 4]]
A numpy solutions requires starting with arrays:
In [76]: arr = np.array(alist)
In [77]: Low = np.array(low)
...: Up = np.array(up)
We can check the bounds with:
In [79]: Low[:, None] <= arr
Out[79]:
array([[False, False, True, True, True, True],
[False, True, True, True, True, True],
[False, True, True, True, True, True]])
In [80]: (Low[:, None] <= arr) & (Up[:,None] >= arr)
Out[80]:
array([[False, False, True, False, False, False],
[False, True, True, False, False, False],
[False, True, True, True, False, False]])
Applying the mask to index arr produces a flat array of values:
In [81]: arr[_]
Out[81]: array([3, 2, 3, 2, 3, 4])
to get values by row, we still have to iterate:
In [82]: [row[mask] for row, mask in zip(arr, Out[80])]
Out[82]: [array([3]), array([2, 3]), array([2, 3, 4])]
For the small case I expect the list approach to be faster. For larger cases [81] will do better - IF we already have arrays. Creating arrays from the lists is not a time-trivial task.

numpy create Mx2 matrix based on Mx1 matrix

I have a np.array with 800 values. Each value is either 0 or 1.
If the value is 0, I want to replace it with mu_0, which is a 1x2 array; else, I want to replace it with mu_1, which is also a 1x2 array.
I tried using np.where(y == 0, mu_0, mu_1), but python would only broadcast the value of mu to match y, not the other way around. In particular, the error I get is
ValueError: operands could not be broadcast together with shapes (800,) (2,) (2,)
I tried expanding y into (800, 2), by padding y_pad = np.c_[y, np.zeros(800)], but I am unsure how to condition on the first value of each row.
If I use np.where(y_pad[:, 0] == 0, ...), the array gets sliced back into (800,) again.
You can indeed expand y into (800, 2) with, for example, np.repeat so that 0s are 0s, 1s are 1s for each row. Then we can use np.where:
# first casting to (800, 1) with `newaxis` then repetition
y_rep = np.repeat(y[:, np.newaxis], repeats=2, axis=1)
result = np.where(y_rep == 0, mu_0, mu_1)
sample run:
mu_0 = np.array([ 9, 17])
mu_1 = np.array([-3, -5])
y = np.array([0, 0, 1, 1, 0, 1, 1, 1])
then
>>> result
array([[ 9, 17],
[ 9, 17],
[-3, -5],
[-3, -5],
[ 9, 17],
[-3, -5],
[-3, -5],
[-3, -5]])
where the condition became:
>>> y_rep == 0
array([[ True, True],
[ True, True],
[False, False],
[False, False],
[ True, True],
[False, False],
[False, False],
[False, False]])
You can use indexing:
## Dummy data
# The 1x2 array
mu_0 = np.array([[1,1]])
mu_1 = np.array([[2,2]])
# The boolean array
y = np.array([0,1,0,1,0])
## Get the result:
res = np.vstack((mu_0,mu_1))[y,:]
And we obtain the following array:
array([[1, 1],
[2, 2],
[1, 1],
[2, 2],
[1, 1]])

Numpy: When applying a boolean mask for an array of arrays, most efficient way to record which items were in the original arrays

I can perform a boolean mask on an array of arrays like this
import numpy as np
a = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
b = [[True, False, False], [True, True, False], [False, False, False]]
np.array(a)[np.array(b)]
and get array([1, 4, 5])
How would I preserve the information of which numbers belonged to the same array?
something like this would work
is_in_original(1, 4)
> False
is_in_origina(5, 4)
>True
One thing I could think of is this
def is_in_original(x, y):
for arry in np.array(a):
if x in arry and y in arry:
return True
return False
I am wondering if this is the most computationally efficient method. I will be working with very large array of arrays, and need the throughput to be as fast as possible.
You can use np.where(mask, array, 0) to preserve dimensions.
import numpy as np
a = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
b = [[True, False, False], [True, True, False], [False, False, False]]
ret = np.where(np.array(b), np.array(a), 0)
Output:
array([[1, 0, 0],
[4, 5, 0],
[0, 0, 0]])
In this case you can change third parameter of np.where is 0, you can change the value to any number or inf

Python Scipy How to traverse upper/lower trianglar portion non-zeros from csr_matrix

I have a very sparse matrix(similarity matrix) with dimensions 300k * 300k. In order to find out the relatively greater similarities between users, I only need upper/lower triangular portion of the matrix. So, how to get the coordinates of users with value larger than a threshold in an efficient way?
Thanks.
How about
sparse.triu(M)
If M is
In [819]: M.A
Out[819]:
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]], dtype=int32)
In [820]: sparse.triu(M).A
Out[820]:
array([[0, 1, 2],
[0, 4, 5],
[0, 0, 8]], dtype=int32)
You may need to construct a new sparse matrix, with just nonzeros above the threshold.
In [826]: sparse.triu(M>2).A
Out[826]:
array([[False, False, False],
[False, True, True],
[False, False, True]], dtype=bool)
In [827]: sparse.triu(M>2).nonzero()
Out[827]: (array([1, 1, 2], dtype=int32), array([1, 2, 2], dtype=int32))
Here's the code for triu:
def triu(A, k=0, format=None):
A = coo_matrix(A, copy=False)
mask = A.row + k <= A.col
row = A.row[mask]
col = A.col[mask]
data = A.data[mask]
return coo_matrix((data,(row,col)), shape=A.shape).asformat(format)

Converting Specified Elements of a NumPy Array by a New Value

I wanted to convert the specified elements of the NumPy array A: 1, 5, and 8 into 0.
So I did the following:
import numpy as np
A = np.array([[1,2,3,4,5],[6,7,8,9,10]])
bad_values = (A==1)|(A==5)|(A==8)
A[bad_values] = 0
print A
Yes, I got the expected result, i.e., new array.
However, in my real world problem, the given array (A) is very large and is also 2-dimensional, and the number of bad_values to be converted into 0 are also too many. So, I tried the following way of doing that:
bads = [1,5,8] # Suppose they are the values to be converted into 0
bad_values = A == x for x in bads # HERE is the problem I am facing
How can I do this?
Then, of course the remaining is the same as before.
A[bad_values] = 0
print A
If you want to get the index of where a bad value occurs in your array A, you could use in1d to find out which values are in bads:
>>> np.in1d(A, bads)
array([ True, False, False, False, True, False, False, True, False, False], dtype=bool)
So you can just write A[np.in1d(A, bads)] = 0 to set the bad values of A to 0.
EDIT: If your array is 2D, one way would be to use the in1d method and then reshape:
>>> B = np.arange(9).reshape(3, 3)
>>> B
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
>>> np.in1d(B, bads).reshape(3, 3)
array([[False, True, False],
[False, False, True],
[False, False, True]], dtype=bool)
So you could do the following:
>>> B[np.in1d(B, bads).reshape(3, 3)] = 0
>>> B
array([[0, 0, 2],
[3, 4, 0],
[6, 7, 0]])

Categories

Resources