Numpy indirect indexing - python

I'm trying to perform aggregation functions on one or multiple arrays by using another array which contains the indices. These indices could contain duplicates which need to be dealt with depending on the aggregation function (I'm interested in a general way to do this "indirect indexing", so I hope I don't need to differentiate aggregation functions).
For instance, assume we want to obtain a sum w from the elements in v by the index in ix.
ix = [ 0, 7, 0, 1, 7, 3, 0, 2, 2, 5, 6, 4]
v = [100, 200, 300, 400, 500, 600, 700, 800, 900, 0, 0, 0]
=>
# 0 1 2 3 4 5 6 7
w = [1100 (100+300+700), 400, 1700 (800+900), 600, 0, 0, 0, 700 (200+500)]
sum might be an easy one, but for instance a weighted average would be trickier (multiplication of v1 and v2 before collapsing into w). Is there an array/numpy way for doing this?

A fast numpy method:
In [107]: ix = np.array([ 0, 7, 0, 1, 7, 3, 0, 2, 2, 5, 6, 4])
...: v = np.array([100, 200, 300, 400, 500, 600, 700, 800, 900, 0, 0, 0])
In [108]:
In [108]: np.bincount(ix,v)
Out[108]: array([1100., 400., 1700., 600., 0., 0., 0., 700.])
another, not quite as fast, but potentially more flexible (using other ufunc):
In [119]: a = np.zeros(8,int)
...: np.add.at(a, ix,v)
...: a
...:
...:
Out[119]: array([1100, 400, 1700, 600, 0, 0, 0, 700])
Timings on this small example:
In [121]: timeit [np.sum(v[ix == [x]]) for x in range(ix.max() + 1)]
159 µs ± 311 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [122]: %%timeit
...: df = pd.DataFrame(zip(ix, v), columns=["idx", "v"])
...: w = df.groupby(df.idx).v.sum().to_numpy()
1.48 ms ± 884 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [123]: timeit np.bincount(ix,v)
2.15 µs ± 6.79 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [124]: %%timeit
...: a = np.zeros(8,int)
...: np.add.at(a, ix,v)
...: a
9.4 µs ± 348 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Try this:
[np.sum(v[ix == [x]]) for x in range(ix.max() + 1)]
Result:
[1100 400 1700 600 0 0 0 700]
<script type="text/javascript" src="//cdn.datacamp.com/dcl-react.js.gz"></script>
<div data-datacamp-exercise data-lang="python">
<code data-type="sample-code">
import numpy as np
ix = np.array([0, 7, 0, 1, 7, 3, 0, 2, 2, 5, 6, 4])
v = np.array([100, 200, 300, 400, 500, 600, 700, 800, 900, 0, 0, 0])
print([np.sum(v[ix == [x]]) for x in range(ix.max() + 1)])
</code>
</div>

You're looking a groupby operation. Pandas has a pretty extensive api for this kind of thing and wraps numpy under the hood so you get vectorization (as fast as numpy with some operations). Here is an example:
import pandas as pd
ix = [ 0, 7, 0, 1, 7, 3, 0, 2, 2, 5, 6, 4]
v = [100, 200, 300, 400, 500, 600, 700, 800, 900, 0, 0, 0]
df = pd.DataFrame(zip(ix, v), columns=["idx", "v"])
# groupby the index, apply a sum function, convert type to numpy:
# array([1100, 400, 1700, 600, 0, 0, 0, 700])
w = df.groupby(df.idx).v.sum().to_numpy()
You can do more sophisticated calculations and use overloaded arithemtic operations for convenience:
df["weights"] = np.random.rand(len(df))
df["weights"].mul(df["v"]).groupby("idx").sum()
And it is generally performant:
n = 1000000
df = pd.DataFrame({"idx": np.random.choice(10, n), "v": np.random.rand(n)})
%timeit df.groupby("idx")["v"].sum()
# 11.7 ms ± 214 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
As a demonstration of versatility: you can apply more exotic functions, such as the harmonic mean to each group (apply is a little slower):
from scipy.stats.mstats import hmean
%timeit df.groupby("idx").apply(hmean)
# 51.3 ms ± 1.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
idx
0 0.083368
1 0.049457
2 0.077801
3 0.074263
4 0.065142
5 0.035001
6 0.080105
7 0.002465
8 0.076336
9 0.036461
or a custom function:
def my_func(rows):
return np.max(rows)/np.min(rows)
%timeit df.groupby("idx")["v"].apply(my_func)
# 46.6 ms ± 2.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
idx
0 8.265517e+04
1 8.900603e+05
2 1.874362e+05
3 1.419228e+05
4 4.722633e+05
5 1.382114e+06
6 1.000876e+05
7 3.939510e+07
8 7.747462e+04
9 8.919914e+05

Related

Roll first column by 1, second column by 2, etc

I have an array in numpy. I want to roll the first column by 1, second column by 2, etc.
Here is an example.
>>> x = np.reshape(np.arange(15), (5, 3))
>>> x
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11],
[12, 13, 14]])
What I want to do:
>>> y = roll(x)
>>> y
array([[12, 10, 8],
[ 0, 13, 11],
[ 3, 1, 14],
[ 6, 4, 2],
[ 9, 7, 5]])
What is the best way to do it?
The real array will be very big. I'm using cupy, the GPU version of numpy. I will prefer solution fastest on GPU, but of course, any idea is welcomed.
You could use advanced indexing:
import numpy as np
x = np.reshape(np.arange(15), (5, 3))
h, w = x.shape
rows, cols = np.arange(h), np.arange(w)
offsets = cols + 1
shifted = np.subtract.outer(rows, offsets) % h
y = x[shifted, cols]
y:
array([[12, 10, 8],
[ 0, 13, 11],
[ 3, 1, 14],
[ 6, 4, 2],
[ 9, 7, 5]])
I implemented a naive solution (roll_for) and compares it to #Chrysophylaxs 's solution (roll_indexing).
Conclusion: roll_indexing is faster for small arrays, but the difference shrinks when the array goes bigger, and is eventually slower than roll_for for very large arrays.
Implementations:
import numpy as np
def roll_for(x, shifts=None, axis=-1):
if shifts is None:
shifts = np.arange(1, x.shape[axis] + 1) # OP requirement
xt = x.swapaxes(axis, 0) # https://stackoverflow.com/a/31094758/13636407
yt = np.empty_like(xt)
for idx, shift in enumerate(shifts):
yt[idx] = np.roll(xt[idx], shift=shift)
return yt.swapaxes(0, axis)
def roll_indexing(x):
h, w = x.shape
rows, cols = np.arange(h), np.arange(w)
offsets = cols + 1
shifted = np.subtract.outer(rows, offsets) % h # fix
return x[shifted, cols]
Tests:
M, N = 5, 3
x = np.arange(M * N).reshape(M, N)
expected = np.array([[12, 10, 8], [0, 13, 11], [3, 1, 14], [6, 4, 2], [9, 7, 5]])
assert np.array_equal(expected, roll_for(x))
assert np.array_equal(expected, roll_indexing(x))
M, N = 100, 200
# roll_indexing did'nt work when M < N before fix
x = np.arange(M * N).reshape(M, N)
assert np.array_equal(roll_for(x), roll_indexing(x))
Benchmark:
M, N = 100, 100
x = np.arange(M * N).reshape(M, N)
assert np.array_equal(roll_for(x), roll_indexing(x))
%timeit roll_for(x) # 859 µs ± 2.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit roll_indexing(x) # 81 µs ± 255 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
M, N = 1_000, 1_000
x = np.arange(M * N).reshape(M, N)
assert np.array_equal(roll_for(x), roll_indexing(x))
%timeit roll_for(x) # 12.7 ms ± 56.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit roll_indexing(x) # 12.4 ms ± 13.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
M, N = 10_000, 10_000
x = np.arange(M * N).reshape(M, N)
assert np.array_equal(roll_for(x), roll_indexing(x))
%timeit roll_for(x) # 1.3 s ± 6.46 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit roll_indexing(x) # 1.61 s ± 4.96 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Return majority weighted vote from array based in columns

I have a matrix x with 3 x 3 dimensions and a vector w that is 3,:
x = np.array([[1, 2, 1],
[3, 2 ,1],
[1, 2, 2]])
w = np.array([0.3, 0.4, 0.3])
I need to generate another vector y that is a majority vote for each row of x. Each column of x is weighted by the corresponding value in w. Something like this:
for y[0], it should look for X[0] => [1, 2, 1]
columns with value 1 = first and third [0,2]
columns with value 2 = second [1]
columns with value 3 = none
Sum the weights (in w) of the columns grouped by its value in X:
sum of weights of columns with value 1: 0.3 + 0.3 = 0.6
sum of weights of columns with value 2: 0.4
sum of weights of columns with value 3: 0
Since the sum of weights of columns with value 1 is the highest, y[0] = 1. And so on.
You can do it with numpy if you understand broadcasting. The downside is that because the code is vectorized, you do more computations than you need. This would matter if the size of the w vector is very large.
Perhaps someone comes up with an easier way to write it, but this is how I would do it without thinking too much.
The answer first:
i = np.arange(3) + 1
m = (x.reshape((1,4,3)) == i.reshape((3,1,1)))
np.argmax(np.sum(m, axis=2).T*w, axis=1) + 1
Now the step-by-step explanation... Note that it is usually better to start counting from zero, but I followed your convention.
I added one row so the array is not symmetric (easier to check shapes)
In [1]: x = np.array([[1, 2, 1],
...: [3, 2 ,1],
...: [1, 2, 2],
...: [3, 1, 3]])
...:
...: w = np.array([0.3, 0.4, 0.3])
The first step is to have the array of indices i. Your convention starts at one.
In [2]: i = np.arange(3) + 1
the tricky step: create an array with shape (3,4,3), where the i-th entry of the array is a (4,3) array with all entries 0 or 1. It is 1 if and only if x == i. This is done by adding dimensions to x and i so they can be broadcasted. The operation basically compares all combinations of x and i, because all dimensions of x match size=1 dimension of i and viceversa:
In [3]: m = (x.reshape((1,4,3)) == i.reshape((3,1,1)))*1
In [4]: m
Out[4]:
array([[[1, 0, 1],
[0, 0, 1],
[1, 0, 0],
[0, 1, 0]],
[[0, 1, 0],
[0, 1, 0],
[0, 1, 1],
[0, 0, 0]],
[[0, 0, 0],
[1, 0, 0],
[0, 0, 0],
[1, 0, 1]]])
now you sum along rows (which is axis=2) to get the number of times each selection appeared in each row of x (note that the result is transposed when you compare it to x):
In [5]: np.sum(m, axis=2)
Out[5]:
array([[2, 1, 1, 1],
[1, 1, 2, 0],
[0, 1, 0, 2]])
I hope you can already see where this is going. You can read directly: In the first row of x, 1 appears twice and 2 appears once. In the second row of x, all appear once, in the third row of x, 1 appears once, 2 appears twice, etc.
multiply this by the weights:
In [7]: np.sum(m, axis=2).T*w
Out[7]:
array([[0.6, 0.4, 0. ],
[0.3, 0.4, 0.3],
[0.3, 0.8, 0. ],
[0.3, 0. , 0.6]])
get the maximum along the rows (adding one to conform to your convention):
In [8]: np.argmax(np.sum(m, axis=2).T*w, axis=1) + 1
Out[8]: array([1, 2, 2, 3])
Special Case: a Tie
The following case was brought up in the comments:
x = np.array([[2, 2, 4, 1]])
w = np.array([0.1, 0.2, 0.3, 0.4])
the sum of the weights is:
[0.1, 0.4, 0., 0.4]
so in this case there is no winner. It isn't clear from the question what one would do in this case. One could take all, take none... One can look for these cases at the end:
final_w = np.sum(m, axis=2).T*w
result = np.argmax(np.sum(m*w, axis=2), axis=0) + 1
special_cases = np.argwhere(np.sum(final_w == np.max(final_w), axis=1) > 1)
Note: I used the reshape method for readability, but I often use np.expand_dims or np.newaxis. Something like this:
i = np.arange(3) + 1
m = (x[np.newaxis] == i[:, np.newaxis, np.newaxis])
np.argmax(np.sum(m, axis=2).T*w, axis=1) + 1
an alternative: you could also use some kind of compiled code. For example, numba is pretty easy to use in this case.
Here's a really crazy way to do it, which involves sorting and indexing rather than adding a new dimension. This is sort of like the sort-based method used by np.unique.
First find the sorted indices in each row:
rows = np.repeat(np.arange(x.shape[0]), x.shape[1]) # [0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3]
cols = np.argsort(x, axis=1).ravel() # [0, 2, 1, 2, 1, 0, 0, 1, 2, 1, 0, 2]
Now you can create an array of sorted elements per-column, both unweighted and weighted. The former will be used to get the indices for summing, the latter will actually be summed.
u = x[rows, cols] # [1, 1, 2, 1, 2, 3, 1, 2, 2, 1, 3, 3]
v = np.broadcast_to(w, x.shape)[rows, cols] # [0.3, 0.3, 0.4, 0.3, 0.4, 0.3, 0.3, 0.4, 0.3, 0.4, 0.3, 0.3]
You can find the breakpoints to apply np.add.reduce at:
row_breaks = np.diff(rows).astype(bool) # [0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0]
col_breaks = np.diff(u).astype(bool) # [0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0]
break_mask = row_breaks | col_breaks # [0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0]
breaks = np.r_[0, np.flatnonzero(break_mask) + 1] # [ 0, 2, 3, 4, 5, 6, 7, 9, 10]
Now you have the sums of the weights for identical numbers in each row:
sums = np.add.reduceat(v, breaks) # [0.6, 0.4, 0.3, 0.4, 0.3, 0.3, 0.7, 0.4, 0.6]
But you need to break them up into segments corresponding to the number of unique elements per row:
unique_counts = np.add.reduceat(break_mask, np.arange(0, x.size, x.shape[1]))
unique_counts[-1] += 1 # The last segment will be missing from the mask: # [2, 3, 2, 2]
unique_rows = np.repeat(np.arange(x.shape[0]), unique_counts) # [0, 0, 1, 1, 1, 2, 2, 3, 3]
You can now sort each segment to find the maximum value:
indices = np.lexsort(np.stack((sums, unique_rows), axis=0)) # [1, 0, 2, 4, 3, 5, 6, 7, 8]
The index at the end of each run is given by:
max_inds = np.cumsum(unique_counts) - 1 # [1, 4, 6, 8]
So the maximum sums are:
sums[indices[max_inds]] # [0.6, 0.4, 0.7, 0.6]
And you can unravel the indices-within indices to get the correct element from each row. Notice that max_inds, and everything that depends on it is the same size as x.shape[1], as expected:
result = u[breaks[indices[max_ind]]]
This method does not look very pretty, but it is likely more space efficient than using an extra dimension on the array. Additionally, it works regardless of the numbers in x. Notice that I never subtracted anything or adjusted x in any way. In fact, all the rows are treated independently, and the coincidence of a maximum element being identical to the minimum of the next is broken by row_breaks when constructing breaks.
TL;DR
Enjoy:
def weighted_vote(x, w):
rows = np.repeat(np.arange(x.shape[0]), x.shape[1])
cols = np.argsort(x, axis=1).ravel()
u = x[rows, cols]
v = np.broadcast_to(w, x.shape)[rows, cols]
row_breaks = np.diff(rows).astype(bool)
col_breaks = np.diff(u).astype(bool)
break_mask = row_breaks | col_breaks
breaks = np.r_[0, np.flatnonzero(break_mask) + 1]
sums = np.add.reduceat(v, breaks)
unique_counts = np.add.reduceat(break_mask, np.arange(0, x.size, x.shape[1]))
unique_counts[-1] += 1
unique_rows = np.repeat(np.arange(x.shape[0]), unique_counts)
indices = np.lexsort(np.stack((sums, unique_rows), axis=0))
max_inds = np.cumsum(unique_counts) - 1
return u[breaks[indices[max_inds]]]
Benchmarks
Benchmarks are run in following format:
rows = ...
cols = ...
x = np.random.randint(cols, size=(rows, cols)) + 1
w = np.random.rand(cols)
%timeit weighted_vote_MP(x, w)
%timeit weighted_vote_JG(x, w)
assert (weighted_vote_MP(x, w) == weighted_vote_JG(x, w)).all()
I used the following generalization for weighted_vote_JG, with appropriate corrections:
def weighted_vote_JG(x, w):
i = np.arange(w.size) + 1
m = (x[None, ...] == i.reshape(-1, 1, 1))
return np.argmax(np.sum(m * w, axis=2), axis=0) + 1
Rows: 100, Cols: 10
MP: 440 µs ± 5.12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
* JG: 153 µs ± 796 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Rows: 1000, Cols: 10
MP: 2.53 ms ± 43.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
* JG: 1.03 ms ± 12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Rows: 10000, Cols: 10
MP: 23.5 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
* JG: 16.6 ms ± 67.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Rows: 100000, Cols: 10
MP: 322 ms ± 3.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
* JG: 188 ms ± 858 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Rows: 100, Cols: 100
* MP: 3.31 ms ± 257 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
JG: 12.6 ms ± 244 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Rows: 1000, Cols: 100
* MP: 31 ms ± 159 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
JG: 134 ms ± 581 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Rows: 10000, Cols: 100
* MP: 417 ms ± 7.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
JG: 1.42 s ± 126 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Rows: 100000, Cols: 100
* MP: 4.94 s ± 25.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
JG: MemoryError: Unable to allocate 7.45 GiB for an array with shape (100, 100000, 100) and data type float64
Moral of the story: for small number of columns and weights, the expanded solution is faster. For a larger number of columns, use my version instead.

Numpy group by multiple vectors, get group indices

I have several numpy arrays; I want to build a groupby method that would have group ids for these arrays. It will then allow me to index these arrays on the group id to perform operations on the groups.
For an example:
import numpy as np
import pandas as pd
a = np.array([1,1,1,2,2,3])
b = np.array([1,2,2,2,3,3])
def group_np(groupcols):
groupby = np.array([''.join([str(b) for b in bs]) for bs in zip(*[c for c in groupcols])])
_, groupby = np.unique(groupby, return_invesrse=True)
return groupby
def group_pd(groupcols):
df = pd.DataFrame(groupcols[0])
for i in range(1, len(groupcols)):
df[i] = groupcols[i]
for i in range(len(groupcols)):
df[i] = df[i].fillna(-1)
return df.groupby(list(range(len(groupcols)))).grouper.group_info[0]
Outputs:
group_np([a,b]) -> [0, 1, 1, 2, 3, 4]
group_pd([a,b]) -> [0, 1, 1, 2, 3, 4]
Is there a more efficient way of implementing it, ideally in pure numpy? The bottleneck currently seems to be building a vector that would have unique values for each group - at the moment I am doing that by concatenating the values for each vector as strings.
I want this to work for any number of input vectors, which can have millions of elements.
Edit: here is another testcase:
a = np.array([1,2,1,1,1,2,3,1])
b = np.array([1,2,2,2,2,3,3,2])
Here, group elements 2,3,4,7 should all be the same.
Edit2: adding some benchmarks.
a = np.random.randint(1, 1000, 30000000)
b = np.random.randint(1, 1000, 30000000)
c = np.random.randint(1, 1000, 30000000)
def group_np2(groupcols):
_, groupby = np.unique(np.stack(groupcols), return_inverse=True, axis=1)
return groupby
%timeit group_np2([a,b,c])
# 25.1 s +/- 1.06 s per loop (mean +/- std. dev. of 7 runs, 1 loop each)
%timeit group_pd([a,b,c])
# 21.7 s +/- 646 ms per loop (mean +/- std. dev. of 7 runs, 1 loop each)
After using np.stack on the arrays a and b, if you set the parameter return_inverse to True in np.unique then it is the output you are looking for:
a = np.array([1,2,1,1,1,2,3,1])
b = np.array([1,2,2,2,2,3,3,2])
_, inv = np.unique(np.stack([a,b]), axis=1, return_inverse=True)
print (inv)
array([0, 2, 1, 1, 1, 3, 4, 1], dtype=int64)
and you can replace [a,b] in np.stack by a list of all the vectors.
Edit: a faster solution is use np.unique on the sum of the arrays multiply by the cumulative product (np.cumprod) of the max plus 1 of all previous arrays in groupcols. such as:
def group_np_sum(groupcols):
groupcols_max = np.cumprod([ar.max()+1 for ar in groupcols[:-1]])
return np.unique( sum([groupcols[0]] +
[ ar*m for ar, m in zip(groupcols[1:],groupcols_max)]),
return_inverse=True)[1]
To check:
a = np.array([1,2,1,1,1,2,3,1])
b = np.array([1,2,2,2,2,3,3,2])
print (group_np_sum([a,b]))
array([0, 2, 1, 1, 1, 3, 4, 1], dtype=int64)
Note: the number associated to each group may not be the same (here I changed the first element of a by 3)
a = np.array([3,2,1,1,1,2,3,1])
b = np.array([1,2,2,2,2,3,3,2])
print(group_np2([a,b]))
print (group_np_sum([a,b]))
array([3, 1, 0, 0, 0, 2, 4, 0], dtype=int64)
array([0, 2, 1, 1, 1, 3, 4, 1], dtype=int64)
but groups themselves are the same.
Now to check for timing:
a = np.random.randint(1, 100, 30000)
b = np.random.randint(1, 100, 30000)
c = np.random.randint(1, 100, 30000)
groupcols = [a,b,c]
%timeit group_pd(groupcols)
#13.7 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit group_np2(groupcols)
#34.2 ms ± 6.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit group_np_sum(groupcols)
#3.63 ms ± 562 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The numpy_indexed package (dsiclaimer: I am its authos) covers these type of use cases:
import numpy_indexed as npi
npi.group_by((a, b))
Passing a tuple of index-arrays like this avoids creating a copy; but if you dont mind making the copy you can use stacking as well:
npi.group_by(np.stack(a, b))

Numpy: Can you use broadcasting to replace values by row?

I have a M x N matrix X and a 1 x N matrix Y. What I would like to do is replace any 0-entry in X with the appropriate value from Y based on its column.
So if
X = np.array([[0, 1, 2], [3, 0, 5]])
and
Y = np.array([10, 20, 30])
The desired end result would be [[10, 1, 2], [3, 20, 5]].
This can be done straightforwardly by generating a M x N matrix where every row is Y and then using filter arrays:
Y = np.ones((X.shape[0], 1)) * Y.reshape(1, -1)
X[X==0] = Y[X==0]
But could this be done using numpy's broadcasting functionality?
Sure. Instead of physically repeating Y, create a broadcasted view of Y with the shape of X, using numpy.broadcast_to:
expanded = numpy.broadcast_to(Y, X.shape)
mask = X==0
x[mask] = expanded[mask]
Expand X to make it a bit more general:
In [306]: X = np.array([[0, 1, 2], [3, 0, 5],[0,1,0]])
where identifies the 0s; the 2nd array identifies the columns
In [307]: idx = np.where(X==0)
In [308]: idx
Out[308]: (array([0, 1, 2, 2]), array([0, 1, 0, 2]))
In [309]: Z = X.copy()
In [310]: Z[idx]
Out[310]: array([0, 0, 0, 0]) # flat list of where to put the values
In [311]: Y[idx[1]]
Out[311]: array([10, 20, 10, 30]) # matching list of values by column
In [312]: Z[idx] = Y[idx[1]]
In [313]: Z
Out[313]:
array([[10, 1, 2],
[ 3, 20, 5],
[10, 1, 30]])
Not doing broadcasting, but reasonably clean numpy.
Times compared to broadcast_to approach
In [314]: %%timeit
...: idx = np.where(X==0)
...: Z[idx] = Y[idx[1]]
...:
9.28 µs ± 157 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [315]: %%timeit
...: exp = np.broadcast_to(Y,X.shape)
...: mask=X==0
...: Z[mask] = exp[mask]
...:
19.5 µs ± 513 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Faster, though the sample size is small.
Another way to make the expanded Y, is with repeat:
In [319]: %%timeit
...: exp = np.repeat(Y[None,:],3,0)
...: mask=X==0
...: Z[mask] = exp[mask]
...:
10.8 µs ± 55.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Whose time is close to my where. It turns out that broadcast_to is relatively slow:
In [321]: %%timeit
...: exp = np.broadcast_to(Y,X.shape)
...:
10.5 µs ± 52.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [322]: %%timeit
...: exp = np.repeat(Y[None,:],3,0)
...:
3.76 µs ± 11.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
We'd have to do more tests to see whether that is just due to a setup cost, or if the relative times still apply with much larger arrays.

assigning pixels as 0 and 1

remote sensing python:
is there a way to create a new band or array with DN values of only 0 and 1, based on conditional statements derived from DN values of two separate bands? for example, if values in band 4 => 11000 and values in band 11 <= 23000, set as 0, else set as 1.
You can use int() to convert a boolean to a 0 or 1:
>>> l = [1, 2, 3, 4, 5, 6]
>>> [int(2 < i < 5) for i in l]
[0, 0, 1, 1, 0, 0]
You could just use Python's ternary operator and a list comprehension:
>>> vals = [10000, 500, 200, 10290, 10290129, 3]
>>> vals = [1 if i > 500 else 0 for i in vals]
>>> vals
[1, 0, 0, 1, 1, 0]
Or using numpy (always a good option):
>>> import numpy as np
>>> vals = np.array([10000, 500, 200, 10290, 10290129, 3])
>>> vals = (vals > 500).astype(int)
>>> vals
array([1, 0, 0, 1, 1, 0])
Some timings:
In [4]: vals = np.random.rand(10000)
In [6]: %timeit [1 if i >= 0.5 else 0 for i in vals]
1.26 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [7]: %timeit [int(i >= 0.5) for i in vals]
5.18 ms ± 61 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [8]: %timeit (vals >= 0.5).astype(int)
12.9 µs ± 308 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
As usual, numpy wins, followed by ternary, and then int conversion.

Categories

Resources