Efficiently Create compressed banded diagonal matrix in mxnet - python

In my problem I have a vector containing n elements. Given a window size k I want to efficiently create a matrix size n x 2k+1 which contains the banded diagonal. For example:
a = [a_1, a_2, a_3, a_4]
k = 1
b = [[0, a_1, a_2],
[a_1, a_2, a_3],
[a_2, a_3, a_4],
[a_3, a_4, a_5],
[a_4, a_5, 0]]
The naive way to implement this would be using for loops
out_data = mx.ndarray.zeros((n, 2k+1))
for i in range(0, n):
for j in range(0, 2k+1):
index = i - k + j
if not (index < 0 or index >= seq_len):
out_data[i][j] = in_data[index]
This is very slow.
Creating the full matrix would be easy by just using tile and reshape, however the masking part is not clear.
I found a faster, yet still very slow, implementation:
window = 2*self.windowSize + 1
in_data_reshaped = in_data.reshape((batch_size, seq_len))
out_data = mx.ndarray.zeros((seq_len * window))
for i in range(0, seq_len):
copy_from_start = max(i - self.windowSize, 0)
copy_from_end = min(seq_len -1, i+1+self.windowSize)
copy_length = copy_from_end - copy_from_start
copy_to_start = i*window + (2*self.windowSize + 1 - copy_length)
copy_to_end = copy_to_start + copy_length
out_data[copy_to_start:copy_to_end] = in_data_reshaped[copy_from_start:copy_from_end]
out_data = out_data.reshape((seq_len, window))

If in your operation, k and n are constant and you can do what you want using a combination of mxnet.nd.gather_nd() and mx.nd.scatter_nd. Even though generating the indices tensor is just as inefficient, because you need to do it only once, that wouldn't be a problem. You would want to use gather_nd to effectively "duplicate" your data from original array and then use scatter_nd to scatter them to the final matrix shape. Alternatively, you can simply concatenate a 0 element to your input array (for example [a_1, a_2, a_3] would turn into [0, a_1, a_2, a_3]) and then use only mxnet.nd.gather_nd() to duplicate elements into your final matrix.


slicing for each column of 2D numpy array

In the code, I have a 2D array(D) and for each column, I want to extract some "k" no of neighboring cols(left and right) and sum them up. A naive approach would be to use a for loop, but to speed up this I am trying to slice the 2D matrix for each column to get a submatrix and sum it column-wise. Surprisingly, the naive approach is faster than using the slicing option for k > 6. Any suggestion on how I can make the implementation efficient?
Naive implementation:
k = 64
index = np.arange(D.shape[1])
index_kp = index + k
index_kn = index - k
# neighbors can be less than k if sufficient neighbors not available; for ex. near beginning and the end of an array
index_kn[np.where(index_kn <0)] = np.where(index_kn <0)
index_kp[np.where(index_kp > (len(index)-1))] = np.where(index_kp > (len(index)-1))
Dsmear = np.empty_like(D) #stores the summation of neighboring k columns for each col
for i in range(len(index_kp)):
Dsmear[:,i] = np.sum(D[:, index_kn[i]:index_kp[i]], axis=1)
Slicing implementation:
D1 = np.concatenate((np.repeat(D[:,0].reshape(-1,1),k,axis=1), D, np.repeat(D[:,-1].reshape(-1,1),k,axis=1)),axis=1) #padding the edges with k columns
idx = np.asarray([np.arange(i-k,i+k+1) for i in range(k, D.shape[1]+k)], dtype=np.int32)
D_broadcast = D1[:, idx] # 3D array; is a bottleneck
Dsmear = np.sum(D_broadcast, axis=2)

How to vectorize indexing and computation when indexed tensors are different dimensions?

I'm trying to vectorize the following for-loop in Pytorch. I'd be happy with just vectorizing the inner for-loop, but doing the whole batch would also be awesome.
# B: the batch size
# N: the number of training examples
# dim: the dimension of each feature vector
# K: the number of discrete labels. each vector has a single label
# delta: margin for hinge loss
batch_data = torch.tensor(...) # Tensor of shape [B x N x d]
batch_labels = torch.tensor(...) # Tensor of shape [B x N x 1], each element is one of K labels (ints)
batch_losses = [] # Ultimately should be [B x 1]
batch_centroids = [] # Ultimately should be [B x K_i x dim]
for i in range(B):
centroids = [] # Keep track of the means for each class.
classes = torch.unique(labels) # Get the unique labels for the classes.
# NOTE: The number of classes K for each item in the batch might actually
# be different. This may complicate batch-level operations.
total_loss = 0
# For each class independently. This is the part I want to vectorize.
for cl in classes:
# Take the subset of training examples with that label.
subset = data[torch.where(labels == cl)]
# Find the centroid of that subset.
centroid = subset.mean(dim=0)
# Get the distance between each point in the subset and the centroid.
dists = subset - centroid
norm = torch.linalg.norm(dists, dim=1)
# The loss is the mean of the hinge loss across the subset.
margin = norm - delta
hinge = torch.clamp(margin, min=0.0) ** 2
total_loss += hinge.mean()
# Keep track of everything. If it's too hard to keep track of centroids, that's also OK.
loss = total_loss.mean()
I've been scratching my head on how to deal with the irregularly sized tensors. The number of classes in each batch K_i is different, and the size of each subset is different.
It turns out it actually is possible to vectorize across ragged arrays. I'll use numpy, but code should be directly translatable to torch. The key technique is to:
Sort by ragged array membership
Perform an accumulation
Find boundary indices, compute adjacent differences
For a single (non-batch) input of an n x d matrix X and an n-length array label, the following returns the k x d centroids and n-length distances to respective centroids:
def vcentroids(X, label):
Vectorized version of centroids.
# order points by cluster label
ix = np.argsort(label)
label = label[ix]
Xz = X[ix]
# compute pos where pos[i]:pos[i+1] is span of cluster i
d = np.diff(label, prepend=0) # binary mask where labels change
pos = np.flatnonzero(d) # indices where labels change
pos = np.repeat(pos, d[pos]) # repeat for 0-length clusters
pos = np.append(np.insert(pos, 0, 0), len(X))
Xz = np.concatenate((np.zeros_like(Xz[0:1]), Xz), axis=0)
Xsums = np.cumsum(Xz, axis=0)
Xsums = np.diff(Xsums[pos], axis=0)
counts = np.diff(pos)
c = Xsums / np.maximum(counts, 1)[:, np.newaxis]
repeated_centroids = np.repeat(c, counts, axis=0)
aligned_centroids = repeated_centroids[inverse_permutation(ix)]
dist = np.sum((X - aligned_centroids) ** 2, axis=1)
return c, dist
Batching requires little special handling. For an input B x n x d array batch_X, with B x n batch labels batch_labels, create unique labels for each batch:
batch_k = batch_labels.max(axis=1) + 1
batch_k[1:] = batch_k[:-1]
batch_k[0] = 0
base = np.cumsum(batch_k)
batch_labels += base.expand_dims(1)
So now each batch element has a unique contiguous range of labels. I.e., the first batch element will have n labels in some range [0, k0) where k0 = batch_k[0], the second will have range [k0, k0 + k1) where k1 = batch_k[1], etc.
Then just flatten the n x B x d input to n*B x d and call the same vectorized method. Your loss function is derivable using the final distances and same position-array based reduction technique.
For a detailed explanation of how the vectorization works, see my blog post.
You can vectorize the whole thing if you use a one-hot encoding for your classes and a pairwise distance trick for your norms:
import torch
B = 32
N = 1000
dim = 50
K = 25
batch_data = torch.randn((B, N, dim))
batch_labels = torch.randint(0, K, size=(B, N))
batch_one_hot = torch.nn.functional.one_hot(batch_labels)
centroids = torch.matmul(
batch_one_hot.transpose(-1, 1).type(batch_data.dtype),
) / batch_one_hot.sum(1)[..., None]
norms = torch.linalg.norm(batch_data[:, :, None] - centroids[:, None], axis=-1)
# Compute the rest of your loss
# ...
A couple things to watch out for:
You'll get a divide by zero for any batches that have a missing class. You can handle this by first computing the class sums (with matmul) and counts (summing the one-hot tensor along axis 1) separately. Then, mask the sums with count == 0 and divide the rest of them by their class counts.
If you have a large number of classes, this will cause memory problems because the one-hot tensor will be too big. In that case, the answer from #VF1 probably makes more sense.

Using python/numpy to create a complex matrix

Using python/numpy, I would like to create a 2D matrix M whose components are:
I know I can do this with a bunch of for loops but is there a better way to do this by using numpy (not using for loops)?
This is how I tried, which end up giving me a value error.
I tried to first define a function that takes the sum over k:
define sum_function(i,j):
initial_array = np.arange(g(i,j),h(i,j)+1)
applied_array = f(i,j,initial_array)
return applied_array.sum()
then I tried to create the M matrix with np.mgrid as follows:
ii, jj = np.mgrid(start:fin, start:fin)
M_matrix = sum_function(ii,jj)
Let me write down the concrete form of a matrix as an example:
M_{i,j} = \sum_{k=min(i,j)}^{i+j}\sin{\left( (i+j)^k \right)}
if i,j = 0,1, then this matrix is 2 by 2 and it's form will be
\sin(0) & \sin(1) \
\sin(1)& \sin(2)+\sin(4)
Now if the matrix gets really big, how would I create this matrix without using for loops?
To simplify thinking, lets ravel the i,j dimensions to one, ij dimension. Can we evaluate 3 arrays:
G = g(ij) # for all ij values
H = h(ij)
F = f(ij, kk) # for all ij, and all kk
In other words, can g,h,f be evaluated at multiple values, to produce whole-arrays?
If the G and H values were the same for all ij, or subsets (preferably slices), then
F[:, G:H].sum(axis=1)
would be the value for all ij.
If the H-G difference, the size of each slice, was the same, then we can construct a 2d indexing array, GH such that
F[:, GH].sum(axis=1)
In other words we are summing constant size windows of the F rows.
But if the H-G differences vary across ij, I think we are stuck with doing the sum for each ij element separately - with Python level loops, or ones complied with numba or cython.
I think I myself found an answer to this. I first create 3D array F_{i,j,k} = f(i,j,k). And then create a mask_array whose component is Ture if g(i,j) < k < f(i,j), False otherwise. Then I compute the element-wise multiplication of these two arrays, F*mask_array, and then taking the sum over k axis.
For example, this matrix can be efficiently created by the following code.
M_{i,j} = \sum_{k=min(i,j)}^{i+j}\sin{\left( (i+j)^k \right)}
#in this example, g(i,j) = min(i,j) and h(i,j) = i+j f(i,j,k) = sin((i+j)^k)
# 0<= i, j <= 2
#kk should range from min g(i,j) to max h(i,j)
ii, jj, kk = np.mgrid[0:3,0:3,0:5]
# k > g(i,j)
frm1 = kk >= jj
frm2 = kk >= ii
frm = np.logical_or(frm1,frm2)
# k < h(i,j)
to = kk <= ii+jj
k_mask = np.logical_and(frm,to)
def f(i,j,k):
return np.sin((i+j)**k)
M_before_mask = f(ii,jj,kk)
#Matrix created
M_matrix = (M_before_mask*k_mask).sum(axis=2)

How to do vector-matrix multiplication with conditions?

I want to obtain a list (or array, doesn't matter) of A from the following formula:
A_i = X_(k!=i) * S_(k!=i) * X'_(k!=i)
X is a vector (and X' is the transpose of X), S is a matrix, and the subscript k is defined as {k=1,2,3,...n| k!=i}.
X = [x1, x2, ..., xn]
S = [[s11,s12,...,s1n],
[... ... ... ..]
I take the following as an example:
X = [0.1,0.2,0.3,0.5]
S = [[0.4,0.1,0.3,0.5],
So, eventually, I would get a list of four values for A.
I did this:
import numpy as np
x = np.array([0.1,0.2,0.3,0.5])
s = np.matrix([[0.4,0.1,0.3,0.5],[1,2,1.5,2.4,0.6],[0.4,0.1,0.3,0.5],[1,2,1.5,2.4,0.6]])
for k in range(x) if k!=i
A = (x.dot(s)).dot(np.transpose(x))
print (A)
I am confused with how to use a conditional 'for' loop. Could you please help me to solve it? Thanks.
Just to explain more. If you take i=1, then the formula will be:
A_1 = X_(k!=1) * S_(k!=1) * X'_(k!=1)
So any array (or value) associated with subscript 1 will be deleted in X and S. like:
X = [0.2,0.3,0.5]
S = [[1.5,2.4,0.6]
Step 1: correctly calculate A_i
Step 2: collect them into A
I assume what you want to calculate is
An easy way to do so is to mask away the entries using masked arrays. This way we don't need to delete or copy any matrixes.
# sample
x = np.array([1,2,3,4])
s = np.diag([4,5,6,7])
# we will use masked arrays to remove k=i
vec_mask = np.zeros_like(x)
matrix_mask = np.zeros_like(s)
i = 0 # start
# set masks
vec_mask[i] = 1
matrix_mask[i] = matrix_mask[:,i] = 1
s_mask = np.ma.array(s, mask=matrix_mask)
x_mask = np.ma.array(x, mask=vec_mask)
# reduced product, remember using np.ma.inner instead np.inner
Ai = np.ma.inner(np.ma.inner(x_mask, s_mask), x_mask.T)
vec_mask[i] = 0
matrix_mask[i] = matrix_mask[:,i] = 0
As terms of 0 don't add to the sum, we actually can ignore masking the matrix and just mask the vector:
# we will use masked arrays to remove k=i
mask = np.zeros_like(x)
i = 0 # start
# set masks
mask[i] = 1
x_mask = np.ma.array(x, mask=mask)
# reduced product
Ai = np.ma.inner(np.ma.inner(x_mask, s), x_mask.T)
# unset mask
mask[i] = 0
The final step is to assemble A out of the A_is, so in total we get
x = np.array([1,2,3,4])
s = np.diag([4,5,6,7])
mask = np.zeros_like(x)
x_mask = np.ma.array(x, mask=mask)
A = []
for i in range(len(x)):
x_mask.mask[i] = 1
Ai = np.ma.inner(np.ma.inner(x_mask, s), x_mask.T)
x_mask.mask[i] = 0
A_vec = np.array(A)
Implementing a matrix/vector product using loops will be rather slow in Python. Therefore, I suggest to actually delete the rows/columns/elements at the given index and perform the fast built-in dot product without any explicit loops:
i = 0 # don't forget Python's indices are zero-based
x_ = np.delete(X, i) # remove element
s_ = np.delete(S, i, axis=0) # remove row
s_ = np.delete(s_, i, axis=1) # remove column
result = x_.dot(s_).dot(x_) # no need to transpose a 1-D array

Efficient double iteration over array

I have the following code, where points is many lines by 3 cols list of lists, coorRadius is a radius within which I want to find the local coordinate maxima, and localCoordinateMaxima is an array where I store the i's of these maxima:
for i,x in enumerate(points):
check = 1
for j,y in enumerate(points):
if linalg.norm(x-y) <= coorRadius and x[2] < y[2]:
check = 0
if check == 1:
print localCoordinateMaxima
Unfortunately, this takes forever when I have several thousand points, I am looking for a way to speed it up. I tried to do it with if all() condition, however I didn't manage it and I am not even sure it will be more efficient. Could you guys propose a way to make it faster?
Your search for neighbors is best done using a KDTree.
from scipy.spatial import cKDTree
tree = cKDTree(points)
pairs = tree.query_pairs(coorRadius)
Now pairs is a set of two item tuples (i, j), where i < j and points[i] and points[j] are within coorRadius of each other. You can now simply iterate over these, which will likely be a much smaller set than the len(points)**2 you are currently iterating over:
is_maximum = [True] * len(points)
for i, j in pairs:
if points[i][2] < points[j][2]:
is_maximum[i] = False
elif points[j][2] < points[i][2]:
is_maximum[j] = False
localCoordinateMaxima, = np.nonzero(is_maximum)
This can be further sped up by vectorizing it:
pairs = np.array(list(pairs))
pairs = np.vstack((pairs, pairs[:, ::-1]))
pairs = pairs[np.argsort(pairs[:, 0])]
is_z_smaller = points[pairs[:, 0], 2] < points[pairs[:, 1], 2]
bins, = np.nonzero(pairs[:-1, 0] != pairs[1:, 0])
bins = np.concatenate(([0], bins+1))
is_maximum = np.logical_and.reduceat(is_z_smaller, bins)
localCoordinateMaxima, = np.nonzero(is_maximum)
The above code assumes that every point has at least one neighbor within coorRadius. If that is not the case, you need to slightly complicate things:
pairs = np.array(list(pairs))
pairs = np.vstack((pairs, pairs[:, ::-1]))
pairs = pairs[np.argsort(pairs[:, 0])]
is_z_smaller = points[pairs[:, 0], 2] < points[pairs[:, 1], 2]
bins, = np.nonzero(pairs[:-1, 0] != pairs[1:, 0])
has_neighbors = pairs[np.concatenate(([True], bins)), 0]
bins = np.concatenate(([0], bins+1))
is_maximum = np.ones((len(points),), bool)
is_maximum[has_neighbors] &= np.logical_and.reduceat(is_z_smaller, bins)
localCoordinateMaxima, = np.nonzero(is_maximum)
Here is the version of your code just tightened-up a bit:
for i, x in enumerate(points):
x2 = x[2]
for y in points:
if linalg.norm(x-y) <= coorRadius and x2 < y[2]:
print localCoordinateMaxima
Factor-out the x[2] lookup.
The j variable was unused.
Add a break for an early-out
Use a for-else construct instead of a flag variable
With numpy this is not too hard. You can do it with a single (long) expression, if you want:
import numpy as np
points = np.array(points)
localCoordinateMaxima = np.where(np.all((np.linalg.norm(points-points[None,:], axis=-1) >
coorRadius) |
(points[:,2] >= points[:,None,2]),
The algorithm your current code implements is essentially where(not(any(w <= x and y < z))). If you distribute the not through the logical operations inside of it (using Demorgan's laws), you can avoid one level of nesting by flipping the inequalities, getting where(all(w > x or y >= z))).
w is a matrix of norms applied to the differences of the points broadcast together. x is a constant. y and z are both arrays with the third coordinates of the points, shaped so that they broadcast together into the same shape as w.

