This question already has answers here:
Increment Numpy array with repeated indices
(3 answers)
Closed 4 years ago.
I want to modify an empty bitmap by given indicators (x and y axis).
For every coordinate given by the indicators the value should be raised by one.
So far so good everything seems to work. But if I have some similar indicators in my array of indicators it will only raise the value once.
>>> img
array([[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0]])
>>> inds
array([[0, 0],
[3, 4],
[3, 4]])
Operation:
>>> img[inds[:,1], inds[:,0]] += 1
Result:
>>> img
array([[1, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 1, 0]])
Expected result:
>>> img
array([[1, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 2, 0]])
Does someone have an idea how to solve this? Preferably a fast approach without the use of loops.
This is one way. Counting algorithm courtesy of #AlexRiley.
For performance implications of relative sizes of img and inds, see #PaulPanzer's answer.
# count occurrences of each row and return array
counts = (inds[:, None] == inds).all(axis=2).sum(axis=1)
# apply indices and counts
img[inds[:,1], inds[:,0]] += counts
print(img)
array([[1, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 2, 0]])
You could use numpy.add.at with a bit of manipulation to get the indices ready.
np.add.at(img, tuple(inds[:, [1, 0]].T), 1)
If you have larger inds arrays, this approach should remain fast... (though Paul Panzer's solution is faster)
Two remarks on the other two answers:
1) #jpp's can be improved by using np.unique with the axis and return_counts keywords.
2) If we translate to flat indexing we can use np.bincount which often (but not always, see last test case in benchmarks) is faster than np.add.at.
Thanks #miradulo for initial version of benchmarks.
import numpy as np
def jpp(img, inds):
counts = (inds[:, None] == inds).all(axis=2).sum(axis=1)
img[inds[:,1], inds[:,0]] += counts
def jpp_pp(img, inds):
unq, cnts = np.unique(inds, axis=0, return_counts=True)
img[unq[:,1], unq[:,0]] += cnts
def miradulo(img, inds):
np.add.at(img, tuple(inds[:, [1, 0]].T), 1)
def pp(img, inds):
imgf = img.ravel()
indsf = np.ravel_multi_index(inds.T[::-1], img.shape[::-1])
imgf += np.bincount(indsf, None, img.size)
inds = np.random.randint(0, 5, (3, 2))
big_inds = np.random.randint(0, 5, (10000, 2))
sml_inds = np.random.randint(0, 1000, (5, 2))
from timeit import timeit
for f in jpp, jpp_pp, miradulo, pp:
print(f.__name__)
for i, n, a in [(inds, 1000, 5), (big_inds, 10, 5), (sml_inds, 10, 1000)]:
img = np.zeros((a, a), int)
print(timeit("f(img, i)", globals=dict(img=img, i=i, f=f), number=n) * 1000 / n, 'ms')
Output:
jpp
0.011815106990979984 ms
2623.5026352020213 ms
0.04642329877242446 ms
jpp_pp
0.041291153989732265 ms
5.418520100647584 ms
0.05826510023325682 ms
miradulo
0.007099648006260395 ms
0.7788308983435854 ms
0.009103797492571175 ms
pp
0.0035401539935264736 ms
0.06540440081153065 ms
3.486583800986409 ms
Related
Sorry for confusing title, but not sure how to make it more concise. Here's my requirements:
arr1 = np.array([3,5,9,1])
arr2 = ?(arr1)
arr2 would then be:
[
[0,1,2,0,0,0,0,0,0],
[0,1,2,3,4,0,0,0,0],
[0,1,2,3,4,5,6,7,8],
[0,0,0,0,0,0,0,0,0]
]
It doesn't need to vary based on the max, the shape is known in advance. So to start I've been able to get a shape of zeros:
arr2 = np.zeros((len(arr1),max_len))
And then of course I could do a for loop over arr1 like this:
for i, element in enumerate(arr1):
arr2[i,0:element] = np.arange(element)
but that would likely take a long time and both dimensions here are rather large (arr1 is a few million rows, max_len is around 500). Is there a clean optimized way to do this in numpy?
Building on a 'padding' idea posted by #Divakar some years ago:
In [161]: res = np.arange(9)[None,:].repeat(4,0)
In [162]: res[res>=arr1[:,None]] = 0
In [163]: res
Out[163]:
array([[0, 1, 2, 0, 0, 0, 0, 0, 0],
[0, 1, 2, 3, 4, 0, 0, 0, 0],
[0, 1, 2, 3, 4, 5, 6, 7, 8],
[0, 0, 0, 0, 0, 0, 0, 0, 0]])
Try this with itertools.zip_longest -
import numpy as np
import itertools
l = map(range, arr1)
arr2 = np.column_stack((itertools.zip_longest(*l, fillvalue=0)))
print(arr2)
array([[0, 1, 2, 0, 0, 0, 0, 0, 0],
[0, 1, 2, 3, 4, 0, 0, 0, 0],
[0, 1, 2, 3, 4, 5, 6, 7, 8],
[0, 0, 0, 0, 0, 0, 0, 0, 0]])
I am adding a slight variation on #hpaulj's answer because you mentioned that max_len is around 500 and you have millions of rows. In this case, you can precompute a 500 by 500 matrix containing all possible rows and index into it using arr1:
import numpy as np
np.random.seed(0)
max_len = 500
arr = np.random.randint(0, max_len, size=10**5)
# generate all unique rows first, then index
# can be faster if max_len << len(arr)
# 53 ms
template = np.tril(np.arange(max_len)[None,:].repeat(max_len,0), k=-1)
res = template[arr,:]
# 173 ms
res1 = np.arange(max_len)[None,:].repeat(arr.size,0)
res1[res1>=arr[:,None]] = 0
assert (res == res1).all()
This can be a very simple question as I am still exploring Python. And for this issue I use numpy.
Updated 09/30/21: adopted and modified codes shown below for any potential future reference. I also added an elif in the loop for classes that have fewer counts than the wanted size. Some codes may be unnecessary tho.
new_array = test_array.copy()
uniques, counts = np.unique(new_array, return_counts=True)
print("classes:", uniques, "counts:", counts)
for unique, count in zip(uniques, counts):
#print (unique, count)
if unique != 0 and count > 3:
ids = np.random.choice(count, count-3, replace=False)
new_array[tuple(i[ids] for i in np.where(new_array == unique))] = 0
elif unique != 0 and count <= 3:
ids = np.random.choice(count, count, replace=False)
new_array[tuple(i[ids] for i in np.where(new_array == unique))] = unique
Below is original question.
Let's say I have a 2D array like this:
test_array = np.array([[0,0,0,0,0],
[1,1,1,1,1],
[0,0,0,0,0],
[2,2,2,4,4],
[4,4,4,2,2],
[0,0,0,0,0]])
print("existing classes:", np.unique(test_array))
# "existing classes: [0 1 2 4]"
Now I want to keep a fixed size (e.g. 2 values) in each class that != 0 (in this case two 1s, two 2s, and two 4s) and replace the rest with 0. Where the value being replaced is random with each run (or from a seed).
For example, with run 1 I will have
([[0,0,0,0,0],
[1,0,0,1,0],
[0,0,0,0,0],
[2,0,0,0,4],
[4,0,0,2,0],
[0,0,0,0,0]])
with another run it might be
([[0,0,0,0,0],
[1,1,0,0,0],
[0,0,0,0,0],
[2,0,2,0,4],
[4,0,0,0,0],
[0,0,0,0,0]])
etc. Could anyone help me with this?
My strategy is
Create a new array initialized to all zeros
Find the elements in each class
For each class
Randomly sample two of elements to keep
Set those elements of the new array to the class value
The trick is keeping the shape of the indexes appropriate so you retain the shape of the original array.
import numpy as np
test_array = np.array([[0,0,0,0,0],
[1,1,1,1,1],
[0,0,0,0,0],
[2,2,2,4,4],
[4,4,4,2,2],
[0,0,0,0,0]])
def sample_classes(arr, n_keep=2, random_state=42):
classes, counts = np.unique(test_array, return_counts=True)
rng = np.random.default_rng(random_state)
out = np.zeros_like(arr)
for klass, count in zip(classes, counts):
# Find locations of the class elements
indexes = np.nonzero(arr == klass)
# Sample up to n_keep elements of the class
keep_idx = rng.choice(count, n_keep, replace=False)
# Select the kept elements and reformat for indexing the output array and retaining its shape
keep_idx_reshape = tuple(ind[keep_idx] for ind in indexes)
out[keep_idx_reshape] = klass
return out
You can use it like
In [3]: sample_classes(test_array) [3/1174]
Out[3]:
array([[0, 0, 0, 0, 0],
[0, 1, 1, 0, 0],
[0, 0, 0, 0, 0],
[2, 0, 0, 4, 0],
[4, 0, 0, 2, 0],
[0, 0, 0, 0, 0]])
In [4]: sample_classes(test_array, n_keep=3)
Out[4]:
array([[0, 0, 0, 0, 0],
[1, 0, 1, 1, 0],
[0, 0, 0, 0, 0],
[0, 2, 0, 4, 0],
[4, 4, 0, 2, 2],
[0, 0, 0, 0, 0]])
In [5]: sample_classes(test_array, random_state=88)
Out[5]:
array([[0, 0, 0, 0, 0],
[0, 0, 1, 1, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[4, 0, 4, 2, 2],
[0, 0, 0, 0, 0]])
In [6]: sample_classes(test_array, random_state=88, n_keep=4)
Out[6]:
array([[0, 0, 0, 0, 0],
[0, 1, 1, 1, 1],
[0, 0, 0, 0, 0],
[2, 2, 0, 4, 4],
[4, 4, 0, 2, 2],
[0, 0, 0, 0, 0]])
Here is my not-so-elegant solution:
def unique(arr, num=2, seed=None):
np.random.seed(seed)
vals = {}
for i, row in enumerate(arr):
for j, val in enumerate(row):
if val in vals and val != 0:
vals[val].append((i, j))
elif val != 0:
vals[val] = [(i, j)]
new = np.zeros_like(arr)
for val in vals:
np.random.shuffle(vals[val])
while len(vals[val]) > num:
vals[val].pop()
for row, col in vals[val]:
new[row,col] = val
return new
The following should be O(n log n) in array size
def keep_k_per_class(data,k,rng):
out = np.zeros_like(data)
unq,cnts = np.unique(data,return_counts=True)
assert (cnts >= k).all()
# calculate class boundaries from class sizes
CNTS = cnts.cumsum()
# indirectly group classes together by partial sorting
idx = data.ravel().argpartition(CNTS[:-1])
# the following lines implement simultaneous drawing without replacement
# from all classes
# lower boundaries of intervals to draw random numbers from
# for each class they start with the lower class boundary
# and from there grow one by one - together with the
# swapping out below this implements "without replacement"
lb = np.add.outer(np.arange(k),CNTS-cnts)
pick = rng.integers(lb,CNTS,lb.shape)
for l,p in zip(lb,pick):
# populate output array
out.ravel()[idx[p]] = unq
# swap out used indices so still available ones occupy a linear
# range (per class)
idx[p] = idx[l]
return out
Examples:
rng = np.random.default_rng()
>>>
>>> keep_k_per_class(test_array,2,rng)
array([[0, 0, 0, 0, 0],
[1, 1, 0, 0, 0],
[0, 0, 0, 0, 0],
[2, 0, 2, 0, 4],
[0, 4, 0, 0, 0],
[0, 0, 0, 0, 0]])
>>> keep_k_per_class(test_array,2,rng)
array([[0, 0, 0, 0, 0],
[1, 1, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 2, 0, 0, 0],
[4, 0, 4, 0, 2],
[0, 0, 0, 0, 0]])
and a large one
>>> BIG = np.add.outer(np.tile(test_array,(100,100)),np.arange(0,500,5))
>>> BIG.size
30000000
>>> res = keep_k_per_class(BIG,30,rng)
### takes ~4 sec
### check
>>> np.unique(np.bincount(res.ravel()),return_counts=True)
(array([ 0, 30, 29988030]), array([100, 399, 1]))
I have a matrix M with values 0 through N within it. I'd like to unroll this matrix to create a new matrix A where each submatrix A[i, :, :] represents whether or not M == i.
The solution below uses a loop.
# Example Setup
import numpy as np
np.random.seed(0)
N = 5
M = np.random.randint(0, N, size=(5,5))
# Solution with Loop
A = np.zeros((N, M.shape[0], M.shape[1]))
for i in range(N):
A[i, :, :] = M == i
This yields:
M
array([[4, 0, 3, 3, 3],
[1, 3, 2, 4, 0],
[0, 4, 2, 1, 0],
[1, 1, 0, 1, 4],
[3, 0, 3, 0, 2]])
M.shape
# (5, 5)
A
array([[[0, 1, 0, 0, 0],
[0, 0, 0, 0, 1],
[1, 0, 0, 0, 1],
[0, 0, 1, 0, 0],
[0, 1, 0, 1, 0]],
...
[[1, 0, 0, 0, 0],
[0, 0, 0, 1, 0],
[0, 1, 0, 0, 0],
[0, 0, 0, 0, 1],
[0, 0, 0, 0, 0]]])
A.shape
# (5, 5, 5)
Is there a faster way, or a way to do it in a single numpy operation?
Broadcasted comparison is your friend:
B = (M[None, :] == np.arange(N)[:, None, None]).view(np.int8)
np.array_equal(A, B)
# True
The idea is to expand the dimensions in such a way that the comparison can be broadcasted in the manner desired.
As pointed out by #Alex Riley in the comments, you can use np.equal.outer to avoid having to do the indexing stuff yourself,
B = np.equal.outer(np.arange(N), M).view(np.int8)
np.array_equal(A, B)
# True
You can make use of some broadcasting here:
P = np.arange(N)
Y = np.broadcast_to(P[:, None], M.shape)
T = np.equal(M, Y[:, None]).astype(int)
Alternative using indices:
X, Y = np.indices(M.shape)
Z = np.equal(M, X[:, None]).astype(int)
You can index into the identity matrix like so
A = np.identity(N, int)[:, M]
or so
A = np.identity(N, int)[M.T].T
Or use the new (v1.15.0) put_along_axis
A = np.zeros((N,5,5), int)
np.put_along_axis(A, M[None], 1, 0)
Note if N is much larger than 5 then creating an NxN identity matrix may be considered wasteful. We can mitigate this using stride tricks:
def read_only_identity(N, dtype=float):
z = np.zeros(2*N-1, dtype)
s, = z.strides
z[N-1] = 1
return np.lib.stride_tricks.as_strided(z[N-1:], (N, N), (-s, s))
I'm trying to analyse map terrain given by the StarCraft 2 bot API.
A beginner's task for this analysis was finding cliffs for reapers, which are special units in SC2 that can jump up and down cliffs.
To solve this, I analyse points where the point itself is not pathable (=cliff) and the northern and southern two points are pathable. Pathable points are marked as 1 and not pathable as 0 in the array.
The terrain map exists as a 2D numpy array. The following is a small excerpt from a larger 200x200 array:
import numpy as np
example = np.array([[0, 0, 0, 0],
[0, 1, 1, 0],
[0, 0, 0, 0],
[0, 1, 1, 0],
[0, 0, 0, 0]])
Here, the points [2, 1] and [2, 2] would match the criteria where the points themselves are not pathable (=0) and the points above and below them are pathable (=1).
This can be achieved by the following code:
above = np.roll(example, 1, axis=0) # Shift rows downwards
below = np.roll(example, -1, axis=0) # Shift rows upwards
result = np.zeros_like(example) # Create array with zeros
result[(example == 0) & (above == 1) & (below == 1)] = 1 # Set cells to 1 that match condition
print(repr(result))
# array([[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 1, 1, 0],
[0, 0, 0, 0],
[0, 0, 0, 0]])
Now my question is if the same can be achieved with less code?
The np.roll function creates a new np.array object each time, so analysing hundreds of nearby points could probably result in 100 lines of uncessary code and high memory usage.
I'm trying to find something similar to
result = np.zeros_like(example)
result[(example == 0) & (example[-1, 0] == 1) & (example[1, 0 == 1)] = 1
# or
result[(example == 0) & ((example[-1:2, 0].sum() == 2)] = 1
Here the numbers in the brackets display the relative position to the currently analysed point, but I don't know if there is a way to get this to work with numpy.
Also the result for the zeroth row wouldn't be clear when checking the point "above" it: It could access either the last row, result in an error or return a default value (0 or 1).
Edit:
I found this post and it pointed me towards the scipy convolve2d function which can be applied here, which might be what I am looking for:
import numpy as np
from scipy import signal
example = np.array([[0, 0, 0, 0],
[0, 1, 1, 0],
[0, 0, 0, 0],
[0, 1, 1, 0],
[0, 0, 0, 0]])
kernel = np.zeros((3, 3), dtype=int)
kernel[::2, 1] = 1
print(repr(kernel))
# array([[0, 1, 0],
# [0, 0, 0],
# [0, 1, 0]])
result2 = signal.convolve2d(example, kernel, mode="same")
print(repr(result2))
# array([[0, 1, 1, 0],
# [0, 0, 0, 0],
# [0, 2, 2, 0],
# [0, 0, 0, 0],
# [0, 1, 1, 0]])
result2[result2 < 2] = 0
result2[result2 == 2] = 1
print(repr(result2))
# array([[0, 0, 0, 0],
# [0, 0, 0, 0],
# [0, 1, 1, 0],
# [0, 0, 0, 0],
# [0, 0, 0, 0]])
Edit2:
Another solution may be scipy.ndimage.minimum_filter which seems to work similarly:
import numpy as np
from scipy import ndimage
example = np.array([[0, 0, 0, 0],
[0, 1, 1, 0],
[0, 0, 0, 0],
[0, 1, 1, 0],
[0, 0, 0, 0]])
kernel = np.zeros((3, 3), dtype=int)
kernel[::2, 1] = 1
print(repr(kernel))
# array([[0, 1, 0],
# [0, 0, 0],
# [0, 1, 0]])
result3 = ndimage.minimum_filter(example, footprint=kernel_vertical, mode="constant")
print(repr(result3))
# array([[0, 0, 0, 0],
# [0, 0, 0, 0],
# [0, 1, 1, 0],
# [0, 0, 0, 0],
# [0, 0, 0, 0]])
Say I've labeled an image with scipy.ndimage.measurements.label like so:
[[0, 1, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 3, 0],
[2, 2, 0, 0, 0, 0],
[2, 2, 0, 0, 0, 0]]
What's a fast way to collect the coordinates belonging to each label? I.e. something like:
{ 1: [[0, 1], [1, 1], [2, 1]],
2: [[4, 0], [4, 1], [5, 0], [5, 1]],
3: [[3, 4]] }
I'm working with images that are ~15,000 x 5000 pixels in size, and roughly half of each image's pixels are labeled (i.e. non-zero).
Rather than iterating through the entire image with nditer, would it be faster to do something like np.where(img == label) for each label?
EDIT:
Which algorithm is fastest depends on how big the labeled image is as compared to how many labels it has. Warren Weckesser and Salvador Dali / BHAT IRSHAD's methods (which are based on np.nonzero and np.where) all seem to scale linearly with the number of labels, whereas iterating through each image element with nditer obviously scales linearly with the size of labeled image.
The results of a small test:
size: 1000 x 1000, num_labels: 10
weckesser ... 0.214357852936s
dali ... 0.650229930878s
nditer ... 6.53645992279s
size: 1000 x 1000, num_labels: 100
weckesser ... 0.936990022659s
dali ... 1.33582305908s
nditer ... 6.81486487389s
size: 1000 x 1000, num_labels: 1000
weckesser ... 8.43906402588s
dali ... 9.81333303452s
nditer ... 7.47897100449s
size: 1000 x 1000, num_labels: 10000
weckesser ... 100.405524015s
dali ... 118.17239809s
nditer ... 9.14583897591s
So the question becomes more specific:
For labeled images in which the number of labels is on the order of sqrt(size(image)) is there an algorithm to gather label coordinates that is faster than iterating through every image element (i.e. with nditer)?
Here's a possibility:
import numpy as np
a = np.array([[0, 1, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 3, 0],
[2, 2, 0, 0, 0, 0],
[2, 2, 0, 0, 0, 0]])
# If the array was computed using scipy.ndimage.measurements.label, you
# already know how many labels there are.
num_labels = 3
nz = np.nonzero(a)
coords = np.column_stack(nz)
nzvals = a[nz[0], nz[1]]
res = {k:coords[nzvals == k] for k in range(1, num_labels + 1)}
I called this script get_label_indices.py. Here's a sample run:
In [97]: import pprint
In [98]: run get_label_indices.py
In [99]: pprint.pprint(res)
{1: array([[0, 1],
[1, 1],
[2, 1]]),
2: array([[4, 0],
[4, 1],
[5, 0],
[5, 1]]),
3: array([[3, 4]])}
You can do something like this (let img is your original nd.array)
res = {}
for i in np.unique(img)[1:]:
x, y = np.where(a == i)
res[i] = zip(list(x), list(y))
which will give you what you want:
{
1: [(0, 1), (1, 1), (2, 1)],
2: [(4, 0), (4, 1), (5, 0), (5, 1)],
3: [(3, 4)]
}
Whether it will be faster - is up to the benchmark to determine.
Per Warren's suggestion, I do not need to use unique and can just do
res = {}
for i in range(1, num_labels + 1)
x, y = np.where(a == i)
res[i] = zip(list(x), list(y))
Try this:
>>> z
array([[0, 1, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 3, 0],
[2, 2, 0, 0, 0, 0],
[2, 2, 0, 0, 0, 0]])
>>> {i:zip(*np.where(z==i)) for i in np.unique(z) if i}
{1: [(0, 1), (1, 1), (2, 1)], 2: [(4, 0), (4, 1), (5, 0), (5, 1)], 3: [(3, 4)]}
This is basically an argsort operation, with some additional work to get the desired format:
def sorting_based(img, nlabels):
img_flat = img.ravel()
label_counts = np.bincount(img_flat)
lin_idx = np.argsort(img_flat)[label_counts[0]:]
coor = np.column_stack(np.unravel_index(lin_idx, img.shape))
ptr = np.cumsum(label_counts[1:-1])
out = dict(enumerate(np.split(coor, ptr), start=1))
return out
As you found out, doing np.where(img == label) for each label results in quadratic runtime O(m*n), with m=n_pixels and n=n_labels. The sorting based approach reduces the complexity to O(m*log(m) + n).
It is possible to do this operation in linear time, but I don't think it's possible to vectorize with Numpy. You could abuse the scipy.sparse.csr_matrix similar to this answer, but at that point you're probably better off writing code that actually makes sense, in Numba, Cython, etc.