Summing over specific indices PyTorch (similar to scatter_add) - python

I have some matrix where rows belong to some label, unordered. I want to sum all rows for each label.
Here is how it can be done with a loop:
labels = torch.tensor([0, 1, 0])
x = torch.tensor([[1, 2, 3],[4, 5, 6],[7, 8, 9]])
torch.stack([torch.sum(x[labels == i], dim=0) for i in torch.unique(labels)])
desired output:
tensor([[ 8, 10, 12],
[ 4, 5, 6]])
EDIT: Just to make it clear, I have the labels tensor, I know which labels repeat, I am interested in computing the final line without the use of a loop. I was thinking scatter_add_ or gather might help.

Just use torch.index_add function.
labels = torch.tensor([0, 1, 0])
x = torch.tensor([[1, 2, 3],[4, 5, 6],[7, 8, 9]])
nrow = torch.unique(labels).size(0)
ncol = x.size(1)
out = torch.zeros((nrow, ncol), dtype=x.dtype)
out.index_add_(0, labels, x)
print(out)
# the output will be
# tensor([[ 8, 10, 12],
# [ 4, 5, 6]])

1: I tried to find the repeated labels
def get_repeated_labels(label_list):
"""
Args:
label_list (ndarray): target list
Return:
(list): Repeated labels
"""
records_array = label_list
values, inverse, count = np.unique(records_array,
return_inverse=True,
return_counts=True)
repeated = np.where(count > 1)[0]
repeated = values[repeated]
rows, cols = np.where(inverse == repeated[:, np.newaxis])
_, inverse_rows = np.unique(rows, return_index=True)
res = np.split(cols, inverse_rows[1:])
return res
if __name__ == '__main__':
labels = torch.tensor([0, 1, 0])
r = get_repeated_labels(labels.numpy())
Output:
[array([0, 2])]
This means the 0th and the 2nd indexes are repeating. We need to sum the 0th and 2nd index array.
torch.sum(x[r[i]], dim=0)
But len(r[i]) is 1-dim, and we have two labels. Therefore I used if-else condition.
Final:
print(torch.stack([torch.sum(x[r[i]], dim=0) if len(r) >= i + 1 else x[i] for i, _ in enumerate(torch.unique(labels))]))
Output:
tensor([[ 8, 10, 12],
[ 4, 5, 6]])

Related

How to get the indices of at least two consecutive values that are all greater than a threshold?

For example, let's consider the following numpy array:
[1, 5, 0, 5, 4, 6, 1, -1, 5, 10]
Also, let's suppose that the threshold is equal to 3.
That is to say that we are looking for sequences of at least two consecutive values that are all above the threshold.
The output would be the indices of those values, which in our case is:
[[3, 4, 5], [8, 9]]
If the output array was flattened that would work as well!
[3, 4, 5, 8, 9]
Output Explanation
In our initial array we can see that for index = 1 we have the value 5, which is greater than the threshold, but is not part of a sequence (of at least two values) where every value is greater than the threshold. That's why this index would not make it to our output.
On the other hand, for indices [3, 4, 5] we have a sequence of (at least two) neighboring values [5, 4, 6] where each and every of them are above the threshold and that's the reason that their indices are included in the final output!
My Code so far
I have approached the issue with something like this:
(arr > 3).nonzero()
The above command gathers the indices of all the items that are above the threshold. However, I cannot determine if they are consecutive or not. I have thought of trying a diff on the outcome of the above snippet and then may be locating ones (that is to say that indices are one after the other). Which would give us:
np.diff((arr > 3).nonzero())
But I'd still be missing something here.
If you convolve a boolean array with a window full of 1 of size win_size ([1] * win_size), then you will obtain an array where there is the value win_size where the condition held for win_size items:
import numpy as np
def groups(arr, *, threshold, win_size, merge_contiguous=False, flat=False):
conv = np.convolve((arr >= threshold).astype(int), [1] * win_size, mode="valid")
indexes_start = np.where(conv == win_size)[0]
indexes = [np.arange(index, index + win_size) for index in indexes_start]
if flat or merge_contiguous:
indexes = np.unique(indexes)
if merge_contiguous:
indexes = np.split(indexes, np.where(np.diff(indexes) != 1)[0] + 1)
return indexes
arr = np.array([1, 5, 0, 5, 4, 6, 1, -1, 5, 10])
threshold = 3
win_size = 2
print(groups(arr, threshold=threshold, win_size=win_size))
print(groups(arr, threshold=threshold, win_size=win_size, merge_contiguous=True))
print(groups(arr, threshold=threshold, win_size=win_size, flat=True))
[array([3, 4]), array([4, 5]), array([8, 9])]
[array([3, 4, 5]), array([8, 9])]
[3 4 5 8 9]
You can do what you want using simple numpy operations
import numpy as np
arr = np.array([1, 5, 0, 5, 4, 6, 1, -1, 5, 10])
arr_padded = np.concatenate(([0], arr, [0]))
a = np.where(arr_padded > 3, 1, 0)
da = np.diff(a)
idx_start = (da == 1).nonzero()[0]
idx_stop = (da == -1).nonzero()[0]
valid = (idx_stop - idx_start >= 2).nonzero()[0]
result = [list(range(idx_start[i], idx_stop[i])) for i in valid]
print(result)
Explanation
Array a is a padded binary version of the original array, with 1s where the original elements are greater than three. da contains 1s where "islands" of 1s begin in a, and -1 where the "islands" end in a. Due to the padding, there is guaranteed to be an equal number of 1s and -1s in da. Extracting their indices, we can calculate the length of the islands. Valid index pairs are those whose respective "islands" have length >= 2. Then, its just a matter of generating all numbers between the index bounds of the valid "islands".
I follow your original idea. You are almost done.
I use another diff2 to pick the index of the first value in a sequence. See comments in code for details.
import numpy as np
arr = np.array([ 1, 5, 0, 5, 4, 6, 1, -1, 5, 10])
threshold = 3
all_idx = (arr > threshold).nonzero()[0]
# array([1, 3, 4, 5, 8, 9])
result = np.empty(0)
if all_idx.size > 1:
diff1 = np.zeros_like(all_idx)
diff1[1:] = np.diff(all_idx)
# array([0, 2, 1, 1, 3, 1])
diff1[0] = diff1[1]
# array([2, 2, 1, 1, 3, 1])
# **Positions with a value 1 in diff1 should be reserved.**
# But we also want the position before each 1. Create another diff2
diff2 = np.zeros_like(all_idx)
diff2[:-1] = np.diff(diff1)
# array([ 2, -1, 0, 2, -2, 0])
# **Positions with a negative value in diff2 should be reserved.**
result = all_idx[(diff1==1) | (diff2<0)]
print(result)
# array([3, 4, 5, 8, 9])
I'll try something different using window views, I'm not sure this works all the time so counterexamples are welcome. It has the advantage of not requiring Python loops.
import numpy as np
from numpy.lib.stride_tricks import sliding_window_view as window
def consec_thresh(arr, thresh):
win = window(np.argwhere(arr > thresh), (2, 1))
return np.unique(win[np.diff(win, axis=2).ravel() == 1, :,:].ravel())
How does it work?
So we start with the array and gather the indices where the threshold is met:
In [180]: np.argwhere(arr > 3)
Out[180]:
array([[1],
[3],
[4],
[5],
[8],
[9]])
Then we build a sliding window that makes up pair of values along the column (which is the reason for the (2, 1) shape of the window).
In [181]: window(np.argwhere(arr > 3), (2, 1))
Out[181]:
array([[[[1],
[3]]],
[[[3],
[4]]],
[[[4],
[5]]],
[[[5],
[8]]],
[[[8],
[9]]]])
Now we want to take the difference inside each pair, if it's one then the indices are consecutive.
In [182]: np.diff(window(np.argwhere(arr > 3), (2, 1)), axis=2)
Out[182]:
array([[[[2]]],
[[[1]]],
[[[1]]],
[[[3]]],
[[[1]]]])
We can plug those values back in the windows we created above,
In [185]: window(np.argwhere(arr > 3), (2, 1))[np.diff(window(np.argwhere(arr > 3), (2, 1)), axis=2).ravel() == 1, :, :]
Out[185]:
array([[[[3],
[4]]],
[[[4],
[5]]],
[[[8],
[9]]]])
Then we can ravel (flatten without copy when possible), we have to get rid of the repeated indices created by windowing so I call np.unique. We ravel again and get:
array([3, 4, 5, 8, 9])
The below iteration code should help with O(n) complexity
arr = [1, 5, 0, 5, 4, 6, 1, -1, 5, 10]
threshold = 3
sequence = 2
output = []
temp_arr = []
for i in range(len(arr)):
if arr[i] > threshold:
temp_arr.append(i)
else:
if len(temp_arr) >= sequence:
output.append(temp_arr)
temp_arr = []
if len(temp_arr):
output.append(temp_arr)
temp_arr = []
print(output)
# Output
# [[3, 4, 5], [8, 9]]
I would suggest using a for loop with two indces. You will have one that starts at j=1 and the other at i=0, both stepping forward by 1.
You can then ask if the value at both is greater than the threshold, if so
add the indices to a list and keep moving forward with j until the threshold or .next() is not greater than threshhold.
values = [1, 5, 0, 5, 4, 6, 1, -1, 5, 10]
res=[]
threshold= 3
i=0
j=0
for _ in values:
j=i+1
lista=[]
try:
print(f"i: {i} j:{j}")
# check if condition is met
if(values[i] > threshold and values[j] > threshold):
lista.append(i)
# add sequence
while values[j] > threshold:
lista.append(j)
print(f"j while: {j}")
j+=1
if(j>=len(values)):
break
res.append(lista)
i=j
if(j>=len(values)):
break
except:
print("ex")
this works. but needs refactoring
Let's try the following code:
# Simple is better than complex
# Complex is better than complicated
arr = [1, 5, 0, 5, 4, 6, 1, -1, 5, 10]
arr_3=[i if arr[i]>3 else 'a' for i in range(len(arr))]
arr_4=''.join(str(x) for x in arr_3)
i=0
while i<len(arr_5):
if len(arr_5[i]) <=1:
del arr_5[i]
else:
i+=1
arr_6=[list(map(lambda x: int(x), list(x))) for x in arr_5]
print(arr_6)
Outputs:
[[3, 4, 5], [8, 9]]
Here is a solution that makes use of pandas Series:
thresh = 3
win_size = 2
s = pd.Series(arr)
# locating groups of values where there are at least (win_size) consecutive values above the threshold
groups = s.groupby(s.le(thresh).cumsum().loc[s.gt(thresh)]).transform('count').ge(win_size)
0 False
1 False
2 False
3 True
4 True
5 True
6 False
7 False
8 True
9 True
dtype: bool
We can now easily take their indices in a 1D array:
np.flatnonzero(groups)
# array([3, 4, 5, 8, 9], dtype=int64)
OR multiple lists:
[np.arange(index.start, index.stop) for index in np.ma.clump_unmasked(np.ma.masked_not_equal(groups.values, value=True))]
# [array([3, 4, 5], dtype=int64), array([8, 9], dtype=int64)]

Best way to find indexes of first occurences of integers in each row of numpy array?

If I have an array such as:
a = np.array([[1, 1, 2, 2, 1, 3, 4],
[8, 7, 7, 7, 4, 8, 8]])
what would be the best way to get as output:
array([[0, 2, 5, 6], [0, 1, 4]])
or
array([[0, 2, 5, 6], [4, 1, 0]])
These are the indices of the first occurence of each integer in each row. The order of the indices is not important.
Currently I am using:
res = []
for row in a:
unique, unique_indexes = np.unique(a, return_index=True)
res.append(unique_indexes)
But I wonder if there is a (num)pythonic way to avoid the for loop.
You can transform the array in such a way that you process the entire thing in one batch. Let's start with an example very similar to the one in your question:
a = np.array([[1, 1, 2, 2, 1, 3, 4], [8, 7, 7, 7, 5, 8, 8]])
Now get the indices:
_, ix = np.unique(a, return_index=True)
# ix = array([ 0, 2, 5, 6, 11, 8, 7])
Notice that the indices of the first elements are correct. The following elements are offset by the size of a. In general, the offset is
offset = ix // a.shape[-1]
# offset = array([0, 0, 0, 0, 1, 1, 1])
ix %= a.shape[-1]
# ix = array([0, 2, 5, 6, 4, 1, 0])
You can call np.split on the new ix at every location where offset changes value:
ix = np.split(ix, np.flatnonzero(np.diff(offset)) + 1)
So why is this example valid, but the one in the question is not? The key is that np.unique uses a sort-based approach (which makes it run in O(n log n) rather than the O(n) of collections.Counter). That means that for the order of the indices to be correct, each row must be unique from and larger than the previous row. Notice that in your example, 4 appears in both rows. You can ensure this with a simple check of the max and min values in each row:
mn = a.min(axis=1)
mx = a.max(axis=1)
diff = np.r_[0, (mx - mn + 1)[:-1].cumsum(0)] - mn
# diff = array([-4, -4])
b = a + diff[:, None]
# b = array([[0, 0, 1, 1, 0, 2, 3],
[7, 6, 6, 6, 4, 7, 7]])
Notice that you have to offset the cumulative sum by one to get the right index. If you deal with large integers and/or very large arrays, you will have to be more careful about making diff to avoid overflow.
Now you can use b in place of a in the call to np.unique.
TL;DR
Here is a general no-loop approach, applicable to any axis, not just the last:
def global_unq(a, axis=-1):
n = a.shape[axis]
a = np.moveaxis(np.asanyarray(a), axis, -1).reshape(-1, n)
mn = a.min(-1)
mx = a.max(-1)
diff = np.r_[0, (mx - mn + 1)[:-1].cumsum(0)] - mn
_, ix = np.unique(a + diff[:, None], return_index=True)
return np.split(ix % n, np.flatnonzero(np.diff(ix // n)) + 1)
You could put it into a list comprehension, but your loop is fairly clean already:
[list(np.unique(e, return_index=True)[-1]) for e in a]
# [[0, 2, 5, 6], [4, 1, 0]]

compare entire row and column in numpy array and delete selected rows and columns

I have 2 square array with shape = (25, 25) and I want to check if an entire row is filled with zeros and if the corresponding column is filled with zeros. If this is the case I want to remove those columns and rows from the array.
For example:
array = np.array([[1, 0, 1, 1],
[0, 0, 0, 0],
[1, 0, 1, 1],
[1, 0, 1, 1]])
I want it manipulated to
array=np.array([[1, 1, 1],
[1, 1, 1],
[1, 1, 1]])
I hope you can understand what I am aiming at. In this example row and column two have been removed as they are zero rows/columns.
I could do that by iterating through all of those arrays, as I have 10 million of those arrays I would like to have a pythonic/efficient way to solve this issue.
The second array is a tensorflow array manipulating that should be no problem if I know the index of the rows columns I want removed.
Edit:
I have now found following solution, but it is using for-looping:
def removepadding(y_true, y_pred):
shape = np.shape(y_true)
y_true_cleaned=[]
for i in range(shape[0]):
x = y_true[i]
for n in range(shape[1] - 1, -1, -1):
if sum(x[n, :]) == 0 and sum(x[:, n]) == 0:
x = np.delete(np.delete(x, n, 0), n, 1)
y_true_cleaned.append(x)
return y_true_cleaned
You can do it in one line:
array[array.any(axis = 1)][:, array.any(axis = 0)]
#array([[1, 1, 1],
# [1, 1, 1],
# [1, 1, 1]])
if there is negative values in the arr, np.sum may fail.
for 2d array:
import numpy as np
a = np.array([[1,0,2,3,0,4],
[0,0,0,0,0,0],
[0,0,0,0,0,0],
[2,0,3,4,0,5],
[3,0,4,5,0,6],
[4,0,5,6,0,7],
[5,0,6,7,0,8]])
row = np.all(a==0, axis=1)
col = np.all(a==0, axis=0)
a[~row][:,~col]
output
array([[1, 2, 3, 4],
[2, 3, 4, 5],
[3, 4, 5, 6],
[4, 5, 6, 7],
[5, 6, 7, 8]])
for 3d array:
a = np.ones((3,3,3))
a[1,:,1] = 0
a[1,1,:] = 0
a[:,1,1] = 0
z = np.all(a==0, axis=2)
y = np.all(a==0, axis=1)
x = np.all(a==0, axis=0)
Z = ~np.array([z]*a.shape[2])
Y = ~np.array([y]*a.shape[1])
X = ~np.array([x]*a.shape[0])
ZZ, YY, XX = (Z*Y*X).nonzero()
a[ZZ, YY, XX]
You can use np.count_nonzero to get the indices in one step per dimension:
nnz_row = np.count_nonzero(array, axis=1)
nnz_col = np.count_nonzero(array, axis=0)
Now you make a mask of where both are zero:
mask = (nnz_row == 0) & (nnz_col == 9)
You can turn the mask into indices and pass it to np.delete:
ind = np.flatnonzero(mask)
array = np.delete(np.delete(array, ind, axis=0), ind, axis=1)
Alternatively, you can compute the positive mask:
pmask = nnz_row.astype(bool) | nnz_col.astype(bool)
This mask can select directly, analogously to what delete did with the negative mask:
array = array[pmask, :][:, pmask]
Edit: Thanks to #mad physicist, we can use np.flatnonzero. Here's the 2d case:
import numpy as np
a=np.array([[1,0,2,3,0,4],
[0,0,0,0,0,0],
[0,0,0,0,0,0],
[2,0,3,4,0,5],
[3,0,4,5,0,6],
[4,0,5,6,0,7],
[5,0,6,7,0,8]])
cols_to_keep = np.flatnonzero(a.sum(axis=0))
rows_to_keep = np.flatnonzero(a.sum(axis=1))
a = a[:, cols_to_keep]
a = a[rows_to_keep, :]
a
>>>
array([[1, 2, 3, 4],
[2, 3, 4, 5],
[3, 4, 5, 6],
[4, 5, 6, 7],
[5, 6, 7, 8]])
Here's the 3d case:
import numpy as np
a=np.array([
[[1,0,2,3,0,4],
[0,0,0,0,0,0],
[0,0,0,0,0,0],
[2,0,3,4,0,5],
[3,0,4,5,0,6],
[4,0,5,6,0,7],
[5,0,6,7,0,8]],
[[0,0,0,0,0,0],
[0,0,0,0,0,0],
[0,0,0,0,0,0],
[0,0,0,0,0,0],
[0,0,0,0,0,0],
[0,0,0,0,0,0],
[0,0,0,0,0,0]],
[[5,0,5,5,0,5],
[0,0,0,0,0,0],
[0,0,0,0,0,0],
[2,0,3,4,0,5],
[3,0,4,5,0,6],
[4,0,5,6,0,7],
[5,0,6,7,0,8]],
])
ix_keep_axis_0 = np.flatnonzero(a.sum(axis=(1, 2)))
ix_keep_axis_1 = np.flatnonzero(a.sum(axis=(0, 2)))
ix_keep_axis_2 = np.flatnonzero(a.sum(axis=(0, 1)))
a = a[ix_keep_axis_0, :, :]
a = a[:, ix_keep_axis_1, :]
a = a[:, :, ix_keep_axis_2]
a
>>>
array([[[1, 2, 3, 4],
[2, 3, 4, 5],
[3, 4, 5, 6],
[4, 5, 6, 7],
[5, 6, 7, 8]],
[[5, 5, 5, 5],
[2, 3, 4, 5],
[3, 4, 5, 6],
[4, 5, 6, 7],
[5, 6, 7, 8]]])

How to use tf.unique_with_counts for each row/column of a tensor

I'm trying to solve KNN using tensorflow. After I get the K neighbours for N vectors, I have a N by K tensor. Now, for each vector in N, I need to use tf.unique_with_counts to find the majority vote. However, I cannot iterate in a tensor and I cannot run tf.unique_with_counts with a multi-dimensional tensor. It keeps giving me InvalidArgumentError (see above for traceback): unique expects a 1D vector.
Example:
def knnVote():
'''
KNN using majority vote
'''
#nearest indices
A = tf.constant([1, 1, 2, 4, 4, 4, 7, 8, 8])
print(A.shape)
nearest_k_y, idx, votes = tf.unique_with_counts(A)
print("y", nearest_k_y.eval())
print("idx", idx.eval())
print("votes", votes.eval())
majority = tf.argmax(votes)
predict_res = tf.gather(nearest_k_y, majority)
print("majority", majority.eval())
print("predict", predict_res.eval())
return predict_res
Result:
y [1 2 4 7 8]
idx [0 0 1 2 2 2 3 4 4]
votes [2 1 3 1 2]
majority 2
predict 4
But how can I extend this to N by D input A, such as the case when A = tf.constant([[1, 1, 2, 4, 4, 4, 7, 8, 8],
[2, 2, 3, 3, 3, 4, 4, 5, 6]])
You can use tf.while_loop to iterate over A rows and process each row independently. This requires a little bit of dark magic with shape_invariants (to accumulate the results) and careful processing in a loop body. But it becomes more or less clear after you stare at it for some time.
Here's a code:
def multidimensionalKnnVote():
A = tf.constant([
[1, 1, 2, 4, 4, 4, 7, 8, 8],
[2, 2, 3, 3, 3, 4, 4, 5, 6],
])
def cond(i, all_idxs, all_vals):
return i < A.shape[0]
def body(i, all_idxs, all_vals):
nearest_k_y, idx, votes = tf.unique_with_counts(A[i])
majority_idx = tf.argmax(votes)
majority_val = nearest_k_y[majority_idx]
majority_idx = tf.reshape(majority_idx, shape=(1,))
majority_val = tf.reshape(majority_val, shape=(1,))
new_idxs = tf.cond(tf.equal(i, 0),
lambda: majority_idx,
lambda: tf.concat([all_idxs, majority_idx], axis=0))
new_vals = tf.cond(tf.equal(i, 0),
lambda: majority_val,
lambda: tf.concat([all_vals, majority_val], axis=0))
return i + 1, new_idxs, new_vals
# This means: starting from 0, apply the `body`, while the `cond` is true.
# Note that `shape_invariants` allow the 2nd and 3rd tensors to grow.
i0 = tf.constant(0)
idx0 = tf.constant(0, shape=(1,), dtype=tf.int64)
val0 = tf.constant(0, shape=(1,), dtype=tf.int32)
_, idxs, vals = tf.while_loop(cond, body,
loop_vars=(i0, idx0, val0),
shape_invariants=(i0.shape, tf.TensorShape([None]), tf.TensorShape([None])))
print('majority:', idxs.eval())
print('predict:', vals.eval())
you can use tf.map_fn to apply a function to each row of a matrix variable
def knnVote(A):
nearest_k_y, idx, votes = tf.unique_with_counts(A)
majority = tf.argmax(votes)
predict_res = tf.gather(nearest_k_y, majority)
return predict_res
sess = tf.Session()
with sess.as_default():
B = tf.constant([[1, 1, 2, 4, 4, 4, 7, 8, 8],
[2, 2, 3, 3, 3, 4, 4, 5, 6]])
C = tf.map_fn(knnVote, B)
print(C.eval())

How to fetch specific rows from a tensor in Tensorflow?

I have a tensor defined as follows:
temp_var = tf.Variable(initial_value=np.asarray([[1, 2, 3],[4, 5, 6],[7, 8, 9],[10, 11, 12]]))
I also have an array of indexes of rows to be fetched from tensor:
idx = tf.constant([0, 2])
Now I want to take a subset of temp_var at those indexes i.e. idx
I know that to take a single index or a slice, we can do something like
temp_var[single_row_index, :]
or
temp_var[start:end, :]
But how to fetch rows indicated by idx array?
Something like temp_var[idx, :] ?
The tf.gather() op does exactly what you need: it selects rows from a matrix (or in general (N-1)-dimensional slices from an N-dimensional tensor). Here's how it would work in your case:
temp_var = tf.Variable([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]))
idx = tf.constant([0, 2])
rows = tf.gather(temp_var, idx)
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)
print(sess.run(rows)) # ==> [[1, 2, 3], [7, 8, 9]]

Categories

Resources