Related
Sorry for confusing title, but not sure how to make it more concise. Here's my requirements:
arr1 = np.array([3,5,9,1])
arr2 = ?(arr1)
arr2 would then be:
[
[0,1,2,0,0,0,0,0,0],
[0,1,2,3,4,0,0,0,0],
[0,1,2,3,4,5,6,7,8],
[0,0,0,0,0,0,0,0,0]
]
It doesn't need to vary based on the max, the shape is known in advance. So to start I've been able to get a shape of zeros:
arr2 = np.zeros((len(arr1),max_len))
And then of course I could do a for loop over arr1 like this:
for i, element in enumerate(arr1):
arr2[i,0:element] = np.arange(element)
but that would likely take a long time and both dimensions here are rather large (arr1 is a few million rows, max_len is around 500). Is there a clean optimized way to do this in numpy?
Building on a 'padding' idea posted by #Divakar some years ago:
In [161]: res = np.arange(9)[None,:].repeat(4,0)
In [162]: res[res>=arr1[:,None]] = 0
In [163]: res
Out[163]:
array([[0, 1, 2, 0, 0, 0, 0, 0, 0],
[0, 1, 2, 3, 4, 0, 0, 0, 0],
[0, 1, 2, 3, 4, 5, 6, 7, 8],
[0, 0, 0, 0, 0, 0, 0, 0, 0]])
Try this with itertools.zip_longest -
import numpy as np
import itertools
l = map(range, arr1)
arr2 = np.column_stack((itertools.zip_longest(*l, fillvalue=0)))
print(arr2)
array([[0, 1, 2, 0, 0, 0, 0, 0, 0],
[0, 1, 2, 3, 4, 0, 0, 0, 0],
[0, 1, 2, 3, 4, 5, 6, 7, 8],
[0, 0, 0, 0, 0, 0, 0, 0, 0]])
I am adding a slight variation on #hpaulj's answer because you mentioned that max_len is around 500 and you have millions of rows. In this case, you can precompute a 500 by 500 matrix containing all possible rows and index into it using arr1:
import numpy as np
np.random.seed(0)
max_len = 500
arr = np.random.randint(0, max_len, size=10**5)
# generate all unique rows first, then index
# can be faster if max_len << len(arr)
# 53 ms
template = np.tril(np.arange(max_len)[None,:].repeat(max_len,0), k=-1)
res = template[arr,:]
# 173 ms
res1 = np.arange(max_len)[None,:].repeat(arr.size,0)
res1[res1>=arr[:,None]] = 0
assert (res == res1).all()
This can be a very simple question as I am still exploring Python. And for this issue I use numpy.
Updated 09/30/21: adopted and modified codes shown below for any potential future reference. I also added an elif in the loop for classes that have fewer counts than the wanted size. Some codes may be unnecessary tho.
new_array = test_array.copy()
uniques, counts = np.unique(new_array, return_counts=True)
print("classes:", uniques, "counts:", counts)
for unique, count in zip(uniques, counts):
#print (unique, count)
if unique != 0 and count > 3:
ids = np.random.choice(count, count-3, replace=False)
new_array[tuple(i[ids] for i in np.where(new_array == unique))] = 0
elif unique != 0 and count <= 3:
ids = np.random.choice(count, count, replace=False)
new_array[tuple(i[ids] for i in np.where(new_array == unique))] = unique
Below is original question.
Let's say I have a 2D array like this:
test_array = np.array([[0,0,0,0,0],
[1,1,1,1,1],
[0,0,0,0,0],
[2,2,2,4,4],
[4,4,4,2,2],
[0,0,0,0,0]])
print("existing classes:", np.unique(test_array))
# "existing classes: [0 1 2 4]"
Now I want to keep a fixed size (e.g. 2 values) in each class that != 0 (in this case two 1s, two 2s, and two 4s) and replace the rest with 0. Where the value being replaced is random with each run (or from a seed).
For example, with run 1 I will have
([[0,0,0,0,0],
[1,0,0,1,0],
[0,0,0,0,0],
[2,0,0,0,4],
[4,0,0,2,0],
[0,0,0,0,0]])
with another run it might be
([[0,0,0,0,0],
[1,1,0,0,0],
[0,0,0,0,0],
[2,0,2,0,4],
[4,0,0,0,0],
[0,0,0,0,0]])
etc. Could anyone help me with this?
My strategy is
Create a new array initialized to all zeros
Find the elements in each class
For each class
Randomly sample two of elements to keep
Set those elements of the new array to the class value
The trick is keeping the shape of the indexes appropriate so you retain the shape of the original array.
import numpy as np
test_array = np.array([[0,0,0,0,0],
[1,1,1,1,1],
[0,0,0,0,0],
[2,2,2,4,4],
[4,4,4,2,2],
[0,0,0,0,0]])
def sample_classes(arr, n_keep=2, random_state=42):
classes, counts = np.unique(test_array, return_counts=True)
rng = np.random.default_rng(random_state)
out = np.zeros_like(arr)
for klass, count in zip(classes, counts):
# Find locations of the class elements
indexes = np.nonzero(arr == klass)
# Sample up to n_keep elements of the class
keep_idx = rng.choice(count, n_keep, replace=False)
# Select the kept elements and reformat for indexing the output array and retaining its shape
keep_idx_reshape = tuple(ind[keep_idx] for ind in indexes)
out[keep_idx_reshape] = klass
return out
You can use it like
In [3]: sample_classes(test_array) [3/1174]
Out[3]:
array([[0, 0, 0, 0, 0],
[0, 1, 1, 0, 0],
[0, 0, 0, 0, 0],
[2, 0, 0, 4, 0],
[4, 0, 0, 2, 0],
[0, 0, 0, 0, 0]])
In [4]: sample_classes(test_array, n_keep=3)
Out[4]:
array([[0, 0, 0, 0, 0],
[1, 0, 1, 1, 0],
[0, 0, 0, 0, 0],
[0, 2, 0, 4, 0],
[4, 4, 0, 2, 2],
[0, 0, 0, 0, 0]])
In [5]: sample_classes(test_array, random_state=88)
Out[5]:
array([[0, 0, 0, 0, 0],
[0, 0, 1, 1, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[4, 0, 4, 2, 2],
[0, 0, 0, 0, 0]])
In [6]: sample_classes(test_array, random_state=88, n_keep=4)
Out[6]:
array([[0, 0, 0, 0, 0],
[0, 1, 1, 1, 1],
[0, 0, 0, 0, 0],
[2, 2, 0, 4, 4],
[4, 4, 0, 2, 2],
[0, 0, 0, 0, 0]])
Here is my not-so-elegant solution:
def unique(arr, num=2, seed=None):
np.random.seed(seed)
vals = {}
for i, row in enumerate(arr):
for j, val in enumerate(row):
if val in vals and val != 0:
vals[val].append((i, j))
elif val != 0:
vals[val] = [(i, j)]
new = np.zeros_like(arr)
for val in vals:
np.random.shuffle(vals[val])
while len(vals[val]) > num:
vals[val].pop()
for row, col in vals[val]:
new[row,col] = val
return new
The following should be O(n log n) in array size
def keep_k_per_class(data,k,rng):
out = np.zeros_like(data)
unq,cnts = np.unique(data,return_counts=True)
assert (cnts >= k).all()
# calculate class boundaries from class sizes
CNTS = cnts.cumsum()
# indirectly group classes together by partial sorting
idx = data.ravel().argpartition(CNTS[:-1])
# the following lines implement simultaneous drawing without replacement
# from all classes
# lower boundaries of intervals to draw random numbers from
# for each class they start with the lower class boundary
# and from there grow one by one - together with the
# swapping out below this implements "without replacement"
lb = np.add.outer(np.arange(k),CNTS-cnts)
pick = rng.integers(lb,CNTS,lb.shape)
for l,p in zip(lb,pick):
# populate output array
out.ravel()[idx[p]] = unq
# swap out used indices so still available ones occupy a linear
# range (per class)
idx[p] = idx[l]
return out
Examples:
rng = np.random.default_rng()
>>>
>>> keep_k_per_class(test_array,2,rng)
array([[0, 0, 0, 0, 0],
[1, 1, 0, 0, 0],
[0, 0, 0, 0, 0],
[2, 0, 2, 0, 4],
[0, 4, 0, 0, 0],
[0, 0, 0, 0, 0]])
>>> keep_k_per_class(test_array,2,rng)
array([[0, 0, 0, 0, 0],
[1, 1, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 2, 0, 0, 0],
[4, 0, 4, 0, 2],
[0, 0, 0, 0, 0]])
and a large one
>>> BIG = np.add.outer(np.tile(test_array,(100,100)),np.arange(0,500,5))
>>> BIG.size
30000000
>>> res = keep_k_per_class(BIG,30,rng)
### takes ~4 sec
### check
>>> np.unique(np.bincount(res.ravel()),return_counts=True)
(array([ 0, 30, 29988030]), array([100, 399, 1]))
Suppose I have a 2D array with shape (3, 3), call it a, and an array of zeros with shape (7, 7, 5, 5), call it b. I want to modify b in the following way:
for p in range(5):
for q in range(5):
b[p:p + 3, q:q + 3, p, q] = a
Given:
a = np.array([[4, 2, 2],
[9, 0, 5],
[9, 9, 4]])
b = np.zeros((7, 7, 5, 5), dtype=int)
b would end up something like:
>>> b[:, :, 0, 0]
array([[4, 2, 2, 0, 0, 0, 0],
[9, 0, 5, 0, 0, 0, 0],
[9, 9, 4, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0]])
>>> b[:, :, 0, 1]
array([[0, 4, 2, 2, 0, 0, 0],
[0, 9, 0, 5, 0, 0, 0],
[0, 9, 9, 4, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0]])
One way to think about this to make a sliding window view of b (6D), slice out the parts you want (3D or 4D), and assign a to them.
However, there is a simpler way to do this altogether. The way a sliding window view works is by creating a dimension that steps along less than the full size of the dimension you are viewing. For example:
>>> x = np.array([1, 2, 3, 4])
array([1, 2, 3, 4])
>>> window = np.lib.stride_tricks.as_strided(
x, shape=(x.shape[0] - 2, 3),
strides=x.strides * 2)
[[1 2 3]
[2 3 4]]
I'm deliberately using np.lib.stride_tricks.as_strided rather than np.lib.stride_tricks.sliding_window_view here because it has a certain flexibility that you need.
You can have a stride that is larger than the axis you are viewing, as long as you are careful. Contiguous arrays are more forgiving in this case, but by no means a requirement. An example of this is np.diag. You can implement it something like this:
>>> x = np.arange(12).reshape(3, 4)
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>> diag = np.lib.stride_tricks.as_strided(
x, shape=(min(x.shape),),
strides=(sum(x.strides),))
array([ 0, 5, 10])
The trick is to make a view of only the parts of b you care about in a way that makes the assignment easy. Because of broadcasting rules, you will want the last two dimensions of the view to be a.shape, and the strides to be b.strides[:2], since that's where you want to place a.
The first two dimensions of the view will be responsible for making the copies of a. You want 25 copies, so the shape will be (5, 5). The strides are the trickier part. Let's take a look at a 2D case, just because that's easier to visualize, and then attempt to generalize:
>>> a0 = np.array([1, 2])
>>> b0 = np.zeros((4, 3), dtype=int)
>>> b0[0:2, 0] = b0[1:3, 1] = b0[2:4, 2] = a0
The goal is to make a view that strides along the diagonal of b0 in the first axis. So:
>>> np.lib.stride_tricks.as_strided(
b0, shape=(b0.shape[0] - a0.shape[0] + 1, a0.shape[0]),
strides=(sum(b0.strides), b0.strides[0]))[:] = a0
>>> b0
array([[1, 0, 0],
[2, 1, 0],
[0, 2, 1],
[0, 0, 2]])
So that's what you do for b, but adding up every second dimension:
a = np.array([[4, 2, 2],
[9, 0, 5],
[9, 9, 4]])
b = np.zeros((7, 7, 5, 5), dtype=int)
vshape = (*np.subtract(b.shape[:a.ndim], a.shape) + 1,
*a.shape)
vstrides = (*np.add(b.strides[:a.ndim], b.strides[a.ndim:]),
*b.strides[:a.ndim])
np.lib.stride_tricks.as_strided(b, shape=vshape, strides=vstrides)[:] = a
TL;DR
def emplace_window(a, b):
vshape = (*np.subtract(b.shape[:a.ndim], a.shape) + 1, *a.shape)
vstrides = (*np.add(b.strides[:a.ndim], b.strides[a.ndim:]), *b.strides[:a.ndim])
np.lib.stride_tricks.as_strided(b, shape=vshape, strides=vstrides)[:] = a
I've phrased it this way, because now you can apply it to any number of dimensions. The only expectations is that 2 * a.ndim == b.ndim and that b.shape[a.ndim:] == b.shape[:a.ndim] - a.shape + 1.
I have a 2D matrix K with shape (j, k) and I'd like to turn it into a 3D one Q by distributing the vectors according to another vector d (length i) which would match the new first dimension.
I have already done that with a loop, but the routine is called several times and the overall process takes too long:
def shift(self, arr, num, init):
result = np.zeros((arr.shape[0], arr.shape[1]), dtype=np.int32)
offset = num + init
result[offset:,:] = arr[:-offset,:]
return result
q = np.zeros((i, j, k), dtype=np.int32)
q[0,:,:] = K
for i in range(len(d)):
q[:, i, :] = shift(q[:, i, :], int(d[i]))
Here basically I copy the initial matrix K into the first element of the first dimension of the 3D matrix Q, which is created with np.zeros and the final dimensions (d.shape + K.shape) and afterwards I'm "pushing" for each element in dimension 2 the number of positions in vector d.
Any ideas to avoid that for loop?
For the case in your comment:
In [853]: v = [4, 6, 3, 8]; d = [9, 2, 3, 1]
In [865]: q = np.zeros((10,4), int)
In [866]: q[d,np.arange(4)]=v
In [867]: q
Out[867]:
array([[0, 0, 0, 0],
[0, 0, 0, 8],
[0, 6, 0, 0],
[0, 0, 3, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[4, 0, 0, 0]])
Let
a = tensor([[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0]])
b = torch.tensor([1, 2])
c = tensor([[1, 2, 0, 0],
[0, 1, 2, 0],
[0, 0, 1, 2]])
Is there a way to obtain c by assigning b to slices of a without any loops? That is, a[indices] = b for some indices or something similar?
You can use scatter method in pytorch.
a = torch.tensor([[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0]])
b = torch.tensor([1, 2])
index = torch.tensor([[0,1],[1,2],[2,3]])
a.scatter_(1, index, b.view(-1,2).repeat(3,1))
# tensor([[1, 2, 0, 0],
# [0, 1, 2, 0],
# [0, 0, 1, 2]])
The logic behind this operation is a bit iffy in the sense that it is not clear what the parameters of the operation are.
However, one way of obtaining the desired output from the input with vectorized operations only is:
determine how many rows are needed (3 for your example)
create a a with a number of columns such that b is followed by as many zeros as the number of rows (2 + 3), and the choosen number of rows (3)
assign b to the beginning of a for each
flatten the array, cut num_rows zeros, and reshape to the target shape.
In NumPy, this can be implemented as:
import numpy as np
b = np.array([1, 2])
c = np.array([[1, 2, 0, 0],
[0, 1, 2, 0],
[0, 0, 1, 2]])
num_rows = 3
a = np.zeros((num_rows, len(b) + num_rows), dtype=b.dtype)
a[:, :len(b)] = b
a = a.ravel()[:-num_rows].reshape((num_rows, len(b) + num_rows - 1))
print(a)
# [[1 2 0 0]
# [0 1 2 0]
# [0 0 1 2]]
print(np.all(a == c))
# True
EDIT
The same approach implemented in Torch:
import torch as to
b = to.tensor([1, 2])
c = to.tensor([[1, 2, 0, 0],
[0, 1, 2, 0],
[0, 0, 1, 2]])
num_rows = 3
a = to.zeros((num_rows, len(b) + num_rows), dtype=b.dtype)
a[:, :len(b)] = b
a = a.flatten()[:-num_rows].reshape((num_rows, len(b) + num_rows - 1))
print(a)
# tensor([[1, 2, 0, 0],
# [0, 1, 2, 0],
# [0, 0, 1, 2]])
print(to.all(a == c))
# tensor(1, dtype=torch.uint8)