I have a np.array with 800 values. Each value is either 0 or 1.
If the value is 0, I want to replace it with mu_0, which is a 1x2 array; else, I want to replace it with mu_1, which is also a 1x2 array.
I tried using np.where(y == 0, mu_0, mu_1), but python would only broadcast the value of mu to match y, not the other way around. In particular, the error I get is
ValueError: operands could not be broadcast together with shapes (800,) (2,) (2,)
I tried expanding y into (800, 2), by padding y_pad = np.c_[y, np.zeros(800)], but I am unsure how to condition on the first value of each row.
If I use np.where(y_pad[:, 0] == 0, ...), the array gets sliced back into (800,) again.
You can indeed expand y into (800, 2) with, for example, np.repeat so that 0s are 0s, 1s are 1s for each row. Then we can use np.where:
# first casting to (800, 1) with `newaxis` then repetition
y_rep = np.repeat(y[:, np.newaxis], repeats=2, axis=1)
result = np.where(y_rep == 0, mu_0, mu_1)
sample run:
mu_0 = np.array([ 9, 17])
mu_1 = np.array([-3, -5])
y = np.array([0, 0, 1, 1, 0, 1, 1, 1])
then
>>> result
array([[ 9, 17],
[ 9, 17],
[-3, -5],
[-3, -5],
[ 9, 17],
[-3, -5],
[-3, -5],
[-3, -5]])
where the condition became:
>>> y_rep == 0
array([[ True, True],
[ True, True],
[False, False],
[False, False],
[ True, True],
[False, False],
[False, False],
[False, False]])
You can use indexing:
## Dummy data
# The 1x2 array
mu_0 = np.array([[1,1]])
mu_1 = np.array([[2,2]])
# The boolean array
y = np.array([0,1,0,1,0])
## Get the result:
res = np.vstack((mu_0,mu_1))[y,:]
And we obtain the following array:
array([[1, 1],
[2, 2],
[1, 1],
[2, 2],
[1, 1]])
Related
I can perform a boolean mask on an array of arrays like this
import numpy as np
a = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
b = [[True, False, False], [True, True, False], [False, False, False]]
np.array(a)[np.array(b)]
and get array([1, 4, 5])
How would I preserve the information of which numbers belonged to the same array?
something like this would work
is_in_original(1, 4)
> False
is_in_origina(5, 4)
>True
One thing I could think of is this
def is_in_original(x, y):
for arry in np.array(a):
if x in arry and y in arry:
return True
return False
I am wondering if this is the most computationally efficient method. I will be working with very large array of arrays, and need the throughput to be as fast as possible.
You can use np.where(mask, array, 0) to preserve dimensions.
import numpy as np
a = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
b = [[True, False, False], [True, True, False], [False, False, False]]
ret = np.where(np.array(b), np.array(a), 0)
Output:
array([[1, 0, 0],
[4, 5, 0],
[0, 0, 0]])
In this case you can change third parameter of np.where is 0, you can change the value to any number or inf
I have a very sparse matrix(similarity matrix) with dimensions 300k * 300k. In order to find out the relatively greater similarities between users, I only need upper/lower triangular portion of the matrix. So, how to get the coordinates of users with value larger than a threshold in an efficient way?
Thanks.
How about
sparse.triu(M)
If M is
In [819]: M.A
Out[819]:
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]], dtype=int32)
In [820]: sparse.triu(M).A
Out[820]:
array([[0, 1, 2],
[0, 4, 5],
[0, 0, 8]], dtype=int32)
You may need to construct a new sparse matrix, with just nonzeros above the threshold.
In [826]: sparse.triu(M>2).A
Out[826]:
array([[False, False, False],
[False, True, True],
[False, False, True]], dtype=bool)
In [827]: sparse.triu(M>2).nonzero()
Out[827]: (array([1, 1, 2], dtype=int32), array([1, 2, 2], dtype=int32))
Here's the code for triu:
def triu(A, k=0, format=None):
A = coo_matrix(A, copy=False)
mask = A.row + k <= A.col
row = A.row[mask]
col = A.col[mask]
data = A.data[mask]
return coo_matrix((data,(row,col)), shape=A.shape).asformat(format)
I have a matrix which represents a distances to the k-nearest neighbour of a set of points,
and there is a matrix of class labels of the nearest neighbours. (both N-by-k matrix)
What is the best way WITHOUT explicit python loop (actually, I want to implement this in theano where those loops are not going to work) to build a (N-by-#classes) matrix whose (i,j) element will be the sum of distances from i-th point to its k-NN points with the class label 'j'?
Example:
# N = 2
# k = 5
# number of classes = 3
K_val = np.array([[1,2,3,4,6],
[2,4,5,5,7]])
l_val = np.array([[0,1,2,0,1],
[2,0,1,2,0]])
"""
result -> [[5,8,3],
[11,5,7]]
"""
You can compute this with
numpy.bincount. It
has a weights parameter which allows you to count the items in l_val
but weight the items according to K_val.
The only little snag is that each row of K_val and l_val seems to be treated independently. So add a shift to l_val so each row has values which are distinct from every other row.
import numpy as np
num_classes = 3
K_val = np.array([[1,2,3,4,6],
[2,4,5,5,7]])
l_val = np.array([[0,1,2,0,1],
[2,0,1,2,0]])
def label_distance(l_val, K_val):
nrows, ncols = l_val.shape
shift = (np.arange(nrows)*num_classes)[:, np.newaxis]
result = (np.bincount((l_val+shift).ravel(), weights=K_val.ravel(),
minlength=num_classes*nrows)
.reshape(nrows, num_classes))
return result
print(label_distance(l_val, K_val))
yields
[[ 5. 8. 3.]
[ 11. 5. 7.]]
Although senderle's method is really elegant, using bincount is faster:
def using_extradim(l_val, K_val):
return (K_val[:,:,None] * (l_val[:,:,None] == numpy.arange(3)[None,None,:])).sum(axis=1)
In [34]: K2 = np.tile(K_val, (1000,1))
In [35]: L2 = np.tile(l_val, (1000,1))
In [36]: %timeit using_extradim(L2, K2)
1000 loops, best of 3: 584 µs per loop
In [40]: %timeit label_distance(L2, K2)
10000 loops, best of 3: 67.7 µs per loop
Here's a way to calculate the values directly. As unutbu's tests show, using bincount is much faster for large datasets, but I think it's worth knowing how to do this using vanilla broadcasting as well:
>>> (K_val[:,:,None] * (l_val[:,:,None] == numpy.arange(3)[None,None,:])).sum(axis=1)
array([[ 5, 8, 3],
[11, 5, 7]])
That's a bit hairy, so I'll step through it slowly. It's probably best to do it this way in code you want to be able to read later! There are four steps:
labels = numpy.arange(3)
l_selector = l_val[:,:,None] == labels[None,None,:]
distances = (K_val[:,:,None] * l_selector)
result = distances.sum(axis=1)
First we create a list of labels (labels above). Then we create a boolean index array:
>>> l_selector = l_val[:,:,None] == labels[None,None,:]
This expands l_val and labels into arrays that can be broadcast together. The None values (equivalent to np.newaxis) add new empty dimensions:
>>> l_val[:,:,None].shape
(2, 5, 1)
>>> labels[None,None,:].shape
(1, 1, 3)
The dimensions are aligned, so both arrays can be expanded (by repeating the values) along their empty dimensions:
>>> l_selector.shape
(2, 5, 3)
Now we have a (n_points, n_neighbors, n_labels) array, where each column corresponds to a label. (See how each row has only one Truevalue?)
>>> l_selector
array([[[ True, False, False],
[False, True, False],
[False, False, True],
[ True, False, False],
[False, True, False]],
[[False, False, True],
[ True, False, False],
[False, True, False],
[False, False, True],
[ True, False, False]]], dtype=bool)
So now we can use this to separate out the distances for each of the three labels. But again, we have to make sure that our arrays are broadcastable, hence the K_val[:,:,None] here:
>>> distances = (K_val[:,:,None] * l_selector)
>>> distances
array([[[1, 0, 0],
[0, 2, 0],
[0, 0, 3],
[4, 0, 0],
[0, 6, 0]],
[[0, 0, 2],
[4, 0, 0],
[0, 5, 0],
[0, 0, 5],
[7, 0, 0]]])
Now all we have to do is sum over the columns.
>>> result = distances.sum(axis=1)
>>> result
array([[ 5, 8, 3],
[11, 5, 7]])
You might also consider the transposed approach, which requires a little bit less reshaping:
>>> labels = numpy.arange(3)
>>> l_selector = l_val[None,:,:] == labels[:,None,None]
>>> distances = K_val * l_selector
>>> distances.sum(axis=-1)
array([[ 5, 11],
[ 8, 5],
[ 3, 7]])
>>> distances.sum(axis=-1).T
array([[ 5, 8, 3],
[11, 5, 7]])
I have a two-dimensional NxM numpy array:
a = np.ndarray((N,M), dtype=np.float32)
I would like to make a sub-matrix with a selected number of columns and matrices. For each dimension I have as input either a binary vector, or a vector of indices. How can I do this most efficient?
Examples
a = array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
cols = [True, False, True]
rows = [False, False, True, True]
cols_i = [0,2]
rows_i = [2,3]
result = wanted_function(a, cols, rows) or wanted_function_i(a, cols_i, rows_i)
result = array([[2, 3],
[ 10, 11]])
There are several ways to get submatrix in numpy:
In [35]: ri = [0,2]
...: ci = [2,3]
...: a[np.reshape(ri, (-1, 1)), ci]
Out[35]:
array([[ 2, 3],
[10, 11]])
In [36]: a[np.ix_(ri, ci)]
Out[36]:
array([[ 2, 3],
[10, 11]])
In [37]: s=a[np.ix_(ri, ci)]
In [38]: np.may_share_memory(a, s)
Out[38]: False
note that the submatrix you get is a new copy, not a view of the original mat.
You only need to makes cols and rows be a numpy array, and then you can just use the [] as:
import numpy as np
a = np.array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
cols = np.array([True, False, True])
rows = np.array([False, False, True, True])
result = a[cols][:,rows]
print(result)
print(type(result))
# [[ 2 3]
# [10 11]]
# <class 'numpy.ndarray'>
For example, I would like to set to zero all elements of a matrix over its counterdiagonal(i + j < n - 1).
I thought about generating a mask, but it would lead to the same problem of accessing such elements in the mask matrix.
What's the best solution?
Since your matrix seems to be square, you can use a boolean mask and do:
n = mat.shape[0]
idx = np.arange(n)
mask = idx[:, None] + idx < n - 1
mat[mask] = 0
To understand what's going on:
>>> mat = np.arange(16).reshape(4, 4)
>>> n = 4
>>> idx = np.arange(n)
>>> idx[:, None] + idx
array([[0, 1, 2, 3],
[1, 2, 3, 4],
[2, 3, 4, 5],
[3, 4, 5, 6]])
>>> idx[:, None] + idx < n - 1
array([[ True, True, True, False],
[ True, True, False, False],
[ True, False, False, False],
[False, False, False, False]], dtype=bool)
>>> mat[idx[:, None] + idx < n -1] = 0
>>> mat
array([[ 0, 0, 0, 3],
[ 0, 0, 6, 7],
[ 0, 9, 10, 11],
[12, 13, 14, 15]])