Tensor reduction based off index vector - python

As an example, I have 2 tensors: A = [1;2;3;4;5;6;7] and B = [2;3;2]. The idea is that I want to reduce A based off B - such that B's values represent how to sum A's values- such that B = [2;3;2] means the reduced A shall be the sum of the first 2 values, next 3, and last 2: A' = [(1+2);(3+4+5);(6+7)]. It is apparent that the sum of B shall always be equal to the length of A. I'm trying to do this as efficiently as possible - preferably specific functions or matrix operations contained within pytorch/python. Thanks!

Here is the solution.
First, we create an array of indices B_idx with the same size of A.
Then, accumulate (add) all elements in A based on the indices B_idx using index_add_.
A = torch.arange(1, 8)
B = torch.tensor([2, 3, 2])
B_idx = [idx.repeat(times) for idx, times in zip(torch.arange(len(B)), B)]
B_idx = torch.cat(B_idx) # tensor([0, 0, 1, 1, 1, 2, 2])
A_sum = torch.zeros_like(B)
A_sum.index_add_(dim=0, index=B_idx, source=A)
print(A_sum) # tensor([ 3, 12, 13])

Related

Monte Carlo Simulation with multiple distributions in each loop

I have an array of NaNs 10 columns wide and 5 rows long.
I have a 5x3 array of poisson random number generations. This represents 5 runs of each A, B, and C, where each column has a different lambda value for the poisson distribution.
A B C
[1, 1, 2,
1, 2, 2,
2, 1, 4,
1, 2, 3,
0, 1, 2]
Each row represents the number of events. That is, the first row would produce one event of type A, one event of type B, and two events of type C.
I would like to loop through each row and produce a set of uniform random numbers. For A, it would between 1 and 100, for B it would be between 101 and 200, and for C it would be between 201 and 300.
The output of the first row would have four numbers, one number between 1 and 100, one number between 101 and 200, and two numbers between 201 and 300. So a sample output of the first row might be:
[34, 105, 287, 221]
The second output row would have five numbers in it, the third row would have seven, etc. I would like to store it in my array of NaNs by overwriting the NaNs that get replaced in each row. Can anyone please help with this? Thanks!
I've got a rather inefficient/unvectorised method which may or may not be what you're looking for, because one part of your question is unclear to me. Do you want the final array to have rows of different sizes, or to be the same size but padded with nans?
This solution assumes padding with nans, since you talked about the nans being overwritten and didn't mention the extra/unused nans being deleted. I'm also assuming that your ABC thing is structured into a numpy array of size (5,3), and I'm calling the array of nans results_arr.
import numpy as np
from random import randint
# Initializing the arrays
results_arr = np.full((5,10), np.nan)
abc = np.array([[1, 1, 2], [1, 2, 2], [2, 1, 4], [1, 2, 3], [0, 1, 2]])
# Loops through each row in ABC
for row_idx in range(len(abc)):
a, b, c = abc[row_idx]
# Here, I'm getting a number in the specified uniform distribution as many times as is specified in the A column. The other 2 loops do the same for the B and C columns.
for i in range(0, a):
results_arr[row_idx, i] = randint(1, 100)
for j in range(a, a+b):
results_arr[row_idx, j] = randint(101, 200)
for k in range(a+b, a+b+c):
results_arr[row_idx, k] = randint(201, 300)
Hope that helps!
P.S. Here's a solution with uneven rows. The result is stored in a list of lists because numpy doesn't support ragged arrays (i.e. rows of different lengths).
import numpy as np
from random import randint
# Initializations
results_arr = []
abc = np.array([[1, 1, 2], [1, 2, 2], [2, 1, 4], [1, 2, 3], [0, 1, 2]])
# Same code logic as before, just storing the results differently
for row_idx in range(len(abc)):
a, b, c = abc[row_idx]
results_this_row = []
for i in range(0, a):
results_this_row.append(randint(1, 100))
for j in range(a, a+b):
results_this_row.append(randint(101, 200))
for k in range(a+b, a+b+c):
results_this_row.append(randint(201, 300))
results_arr.append(results_this_row)
I hope these two solutions cover what you're looking for!

Select specific indexes of 3D Pytorch Tensor using a 1D long tensor that represents indexes

So I have a tensor that is M x B x C, where M is the number of models, B is the batch and C is the classes and each cell is the probability of a class for a given model and batch. Then I have a tensor of the correct answers which is just a 1D of size B we'll call "t". How do I use the 1D of size B to just return a M x B x 1, where the returned tensor is just the value at the correct class? Say the M x B x C tensor is called "blah" I've tried
blah[:, :, C]
for i in range(M):
blah[i, :, C]
blah[:, C, :]
The top 2 just return the values of indexes t in the 3rd dimension of every slice. The last one returns the values at t indexes in the 2nd dimension. How do I do this?
We can get the desired result by combining advanced and basic indexing
import torch
# shape [2, 3, 4]
blah = torch.tensor([
[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]],
[[12, 13, 14, 15],
[16, 17, 18, 19],
[20, 21, 22, 23]]])
# shape [3]
t = torch.tensor([2, 1, 0])
b = torch.arange(blah.shape[1]).type_as(t)
# shape [2, 3, 1]
result = blah[:, b, t].unsqueeze(-1)
which results in
>>> result
tensor([[[ 2],
[ 5],
[ 8]],
[[14],
[17],
[20]]])
Here is one way to do it:
Suppose a is your M x B x C shaped tensor. I am taking some representative values below,
>>> M = 3
>>> B = 5
>>> C = 4
>>> a = torch.rand(M, B, C)
>>> a
tensor([[[0.6222, 0.6703, 0.0057, 0.3210],
[0.6251, 0.3286, 0.8451, 0.5978],
[0.0808, 0.8408, 0.3795, 0.4872],
[0.8589, 0.8891, 0.8033, 0.8906],
[0.5620, 0.5275, 0.4272, 0.2286]],
[[0.2419, 0.0179, 0.2052, 0.6859],
[0.1868, 0.7766, 0.3648, 0.9697],
[0.6750, 0.4715, 0.9377, 0.3220],
[0.0537, 0.1719, 0.0013, 0.0537],
[0.2681, 0.7514, 0.6523, 0.7703]],
[[0.5285, 0.5360, 0.7949, 0.6210],
[0.3066, 0.1138, 0.6412, 0.4724],
[0.3599, 0.9624, 0.0266, 0.1455],
[0.7474, 0.2999, 0.7476, 0.2889],
[0.1779, 0.3515, 0.8900, 0.2301]]])
Let's say the 1D class tensor is t, which gives the true class of each example in the batch. So it is a 1D tensor of shape (B, ) having class labels in the range {0, 1, 2, ..., C-1}.
>>> t = torch.randint(C, size = (B, ))
>>> t
tensor([3, 2, 1, 1, 0])
So basically you want to select the indices corresponding to t from the innermost dimension of a. This can be achieved using fancy indexing and broadcasting combined as follows:
>>> i = torch.arange(M).reshape(M, 1, 1)
>>> j = torch.arange(B).reshape(1, B, 1)
>>> k = t.reshape(1, B, 1)
Note that once you index anything by (i, j, k), they are going to expand and take the shape (M, B, 1) which is the desired output shape.
Now just indexing a by i, j and k gives:
>>> a[i, j, k]
tensor([[[0.3210],
[0.8451],
[0.8408],
[0.8891],
[0.5620]],
[[0.6859],
[0.3648],
[0.4715],
[0.1719],
[0.2681]],
[[0.6210],
[0.6412],
[0.9624],
[0.2999],
[0.1779]]])
So essentially, if you generate the index arrays conveying your access pattern beforehand, you can directly use them to extract some slice of the tensor.
You simply need to pass:
your index as the third slice
range(B) as the second slice
(i.e. which element in the 2nd dim each 3rd dim index corresponds to)
blah[:,range(B),t]

How to find the nearest neighbor in numpy?

There are two array u and v.
u.shape = (N,d)
v.shape = (q,d)
I need to find, for every q, the nearest value's index for each d in u.
For example:
u = [[5,3],
[3,4],
[3,2],
[8,7]] , shape (4,2)
v = [[1,3],
[2,4]] , shape (2,2)
and I found many people said we can do that:
v = v.expand_dims(v,axis=1) # reshape to (2,1,2) for broadcast
result = np.argmin(abs(v-u),axis=1) # (u-v).shape = (2,4,2)
Of course it found the nearest value's index. But! when there are two nearest value, I need to take the "second" one's index.
In that case:
v-u = [[[-4, 0],
[-2, -1],
[-2, 1],
[-7, -4]],
[[-3, 1],
[-1, 0],
[-1, 2],
[-6, -3]]])
along axis=1, there are two -2 in (u-v)[0,:,0] and two -1 in (u-v)[1,:,0]
If we directly use:
result = np.argmin(abs(v-u),axis=1)
result will be:
array([[1, 0],
[1, 1]], dtype=int64)
It returns the indices corresponding to the first occurrence but I need the second one, i,e
array([[2, 0],
[2, 1]], dtype=int64)
Can anyone help? Thanks!
If there can be at most 2 minimal values, you can retrieve indices of
the last minimum.
To do it:
reverse abs(v-u) along axis 1,
compute argmin, getting a "reversed_index" (actually the index in the
reversed array),
map back to "original" indices using u.shape[0] - 1 - <reversed_index>
formula (in your case of 4 rows, reversed index == 3 corresponds to
original index == 0)
The whole code is:
u.shape[0] - 1 - np.argmin(abs(v-u)[:,::-1,:],axis=1)
Other choice, when there can be more than 2 min values, is to write
a specialized version of argmin, for an 1-D input array, returning
the index of the second minimal value if there are more of them:
def argmin2(arr):
ind = arr.argpartition(1)[:2]
return ind[0] if arr[ind[0]] < arr[ind[1]] else ind[1]
and then apply it to abs(v-u) along axis 1:
np.apply_along_axis(argmin2, 1, abs(v-u))

Find the number of clusters in a list of integers

Let's consider the distance d(a, b) = number of digits which are pairwise different in a and b, e.g.:
d(1003000000, 1000090000) = 2 # the 4th and 6th digits don't match
(we only work with 10-digit numbers) and this list:
L = [2678888873,
2678878873, # distance 1 from L[0]
1000000000,
1000040000, # distance 1 from L[2]
1000300000, # distance 1 from L[2], distance 2 from L[3]
1000300009, # distance 1 from L[4], distance 2 from L[2]
]
I would like to find the minimal number of points P such that each integer in the list is at a distance <= 1 of a point in P.
Here I think this number is 3: every number in the list is at distance <= 1 of 2678888873, 1000000000, or 1000300009.
I imagine an O(n^2) algorithm is possible by first computing a distance matrix i.e. M[i, j] = d(L[i], L[j]).
Is there a better way to do this, especially using Numpy? (maybe there's a built-in algorithm in Numpy/Scipy?)
PS: If we see these 10-digit integers as strings, we're close to finding a minimal number of clusters in a list of many words with a Levenshtein distance.
PS2: I know realize this distance has a name on strings: Hamming distance.
Let's see what we know from a the distance metric. Given a number P (not necessarily in L), if two members of L are within distance 1 of P, they each share 9 digits with P, but not necessarily the same ones, so they are only guaranteed to share 8 digits with each other. So any two numbers that have distance 2 are guaranteed to two unique Ps that are distance 1 from each of them (and distance 2 from each other as well). You can use this information to reduce the amount of brute-force effort required to optimize the selection of P.
Let's say you have a distance matrix. You can immediately discard rows (or columns) that don't have entries less than 3: they are their own cluster automatically. For the remaining entries that are equal to 2, construct a list of possible P values. Find the number of elements of L that are within 1 of each element of P (another distance matrix). Sort P by the number of neighbors, and select. You will need to update the matrix at each iteration as you remove members with maximal neighbors to avoid inefficient grouping due to overlap (members of L that are near multiple members of P).
You can compute a distance matrix for L in numpy by first converting it to a 2D array of digits:
L = np.array([2678888873, 2678878873, 1000000000, 1000040000, 1000300000, 1000300009])
z = 10 # Number of digits
n = len(L) # Number of numbers
dec = 10**np.arange(z).reshape(-1, 1).astype(np.int64)
digits = (L // dec) % 10
digits is now a 10xN array:
array([[3, 3, 0, 0, 0, 9],
[7, 7, 0, 0, 0, 0],
[8, 8, 0, 0, 0, 0],
[8, 8, 0, 0, 0, 0],
[8, 7, 0, 4, 0, 0],
[8, 8, 0, 0, 3, 3],
[8, 8, 0, 0, 0, 0],
[7, 7, 0, 0, 0, 0],
[6, 6, 0, 0, 0, 0],
[2, 2, 1, 1, 1, 1]], dtype=int64)
You can compute the distance between digits and itself, or digits and any other 10xM array using != and sum along the right axis:
distance = (digits[:, None, :] != digits[..., None]).sum(axis=0)
The result:
array([[ 0, 1, 10, 10, 10, 10],
[ 1, 0, 10, 10, 10, 10],
[10, 10, 0, 1, 1, 2],
[10, 10, 1, 0, 2, 3],
[10, 10, 1, 2, 0, 1],
[10, 10, 2, 3, 1, 0]])
We are only concerned with the upper (or lower) triangle of that matrix, so we can immediately mask out the other triangle:
distance[np.tril_indices(n)] = z + 1
Find all candidate values of P: all elements of L, but also all pairs between elements that have distance 2:
# Find indices of pairs that differ by 2
indices = np.nonzero(distance == 2)
# Extract those numbers as 10xKx2 array
d = digits[:, np.stack(indices, axis=1)]
# Compute where the difference is nonzero (Kx2)
locs = np.diff(d, axis=2).astype(bool).squeeze()
# Find the index of the first digit to replace (K)
s = np.argmax(locs, axis=0)
The extra values of P are constructed from each half of d, with the digits represented by k replaced from the other half:
P0 = digits[:, indices[0]]
P1 = digits[:, indices[1]]
k = np.arange(s.size)
tmp = P0[s, k]
P0[s, k] = P1[s, k]
P1[s, k] = tmp
Pextra = np.unique(np.concatenate((P0, P1), axis=1)
So now you can compute the total set of possibilities for P:
P = np.concatenate((digits, Pextra), axis=1)
distance2 = (P[:, None, :] != digits[..., None]).sum(axis=0)
You can discard any elements of Pextra that match with elements of digits based on the distance:
mask = np.concatenate((np.ones(n, bool), distance2[:, n:].all(axis=0)))
P = P[:, mask]
distance2 = distance2[:, mask]
Now you can iteratively distance P with L, and select the best values of P, removing any values that have been selected from the distance matrix. A greedy selection from P will not necessarily be optimal, since an alternative combination may require fewer elements due to overlaps, but that is a matter for a simple (but somewhat expensive) graph traversal algorithm. The following snippet just shows a simple greedy selection, which will work fine for your toy example:
distMask = distance2 <= 1
quality = distMask.sum(axis=0)
clusters = []
accounted = 0
while accounted < n:
# Get the cluster location
best = np.argmax(quality)
# Get the cluster number
clusters.append(P[:, best].dot(dec).item())
# Remove numbers in cluser from consideration
accounted += quality[best]
quality -= distMask[distMask[:, best], :].sum(axis=0)
The last couple of steps can be optimized using sets and graphs, but this shows a starting point for a valid approach. This is going to be slow for large data, but probably not prohibitively so. Do some benchmarks to decide how much time you want to spend optimizing vs just running the algorithm.

Sorting by another matrix works in one case but fails for another

I need to sort matrices according to the descending order of the values in another matrix.
E.g. in a first step I would have the following matrix A:
1 0 1 0 1
0 1 0 1 0
0 1 0 1 1
1 0 1 0 0
Then for the procedure I am following I need to take the rows of the matrix as binary numbers and sort them in descending order of their binary value.
I am doing this the following way:
for i in range(0,num_rows):
for j in range(0,num_cols):
row_val[i] = row_val[i] + A[i][j] * (2 ** (num_cols - 1 - j))
This gets me a 4x1 vector row_val with the following values:
21
10
11
20
Now I am sorting the rows of the matrix according to row_val by
A = [x for _,x in sorted(zip(row_val,A),reverse=True)]
This works perfectly fine I get the matrix A:
1 0 1 0 1
1 0 1 0 0
0 1 0 1 1
0 1 0 1 0
However now I need to apply the same procedure to the columns. So I calculate a the col_val vector with the binary values of the columns:
12
3
12
3
3
To sort the matrix A according to the vector col_val I thought I could just transpose matrix A and then do the same as before:
At = np.transpose(A)
At = [y for _,y in sorted(zip(col_val,At),reverse=True)]
Unfortunatly this fails with the error message
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I am suspecting that this might be because there are several entries with the same value in vector col_val, however in an example shown in another question the sorting seems to work for a case with several equal entries.
Your suspicion is correct, you can't sort multidimensional numpy arrays using the Python builtin sorted because the comparison of two rows, say, will yield a row of truth values instead of a single one
A[0] < A[1]
# array([False, True, False, True, False])
so sorted can't tell which should go before the other.
In your first example this is masked by lexicographic ordering of tuples: Because tuples are compared left to right and because row_val has unique entries the comparison never looks at the second elements.
But in your second example because some col_val entries are equal, the comparison will look at At for a tie breaker which is where the exception occurs.
Here is a working method which uses numpy methods:
A[np.argsort(np.packbits(A, axis=1).ravel())[::-1]]
# array([[1, 0, 1, 0, 1],
# [1, 0, 1, 0, 0],
# [0, 1, 0, 1, 1],
# [0, 1, 0, 1, 0]])
A[:, np.argsort(np.packbits(A, axis=0).ravel())[::-1]]
# array([[1, 1, 1, 0, 0],
# [0, 0, 0, 1, 1],
# [1, 0, 0, 1, 1],
# [0, 1, 1, 0, 0]])
Explanation:
np.packbits as the name suggests packs binary vectors into bit field; it is almost equivalent to your hand-written code - there is one small difference in that packbits operates on chunks of 8 and pads with zero on the right, so for example [1, 1] will go to 192, not 3.
np.argsort does an indirect sort, so it doesn't actually move the elements of its operand A but just writes down the sequence of indices I into A which would sort it A[I] == np.sort(A). This is useful when we want to sort something based on the order of something else like in this case.

Categories

Resources