assigning pixels as 0 and 1 - python

remote sensing python:
is there a way to create a new band or array with DN values of only 0 and 1, based on conditional statements derived from DN values of two separate bands? for example, if values in band 4 => 11000 and values in band 11 <= 23000, set as 0, else set as 1.

You can use int() to convert a boolean to a 0 or 1:
>>> l = [1, 2, 3, 4, 5, 6]
>>> [int(2 < i < 5) for i in l]
[0, 0, 1, 1, 0, 0]

You could just use Python's ternary operator and a list comprehension:
>>> vals = [10000, 500, 200, 10290, 10290129, 3]
>>> vals = [1 if i > 500 else 0 for i in vals]
>>> vals
[1, 0, 0, 1, 1, 0]
Or using numpy (always a good option):
>>> import numpy as np
>>> vals = np.array([10000, 500, 200, 10290, 10290129, 3])
>>> vals = (vals > 500).astype(int)
>>> vals
array([1, 0, 0, 1, 1, 0])
Some timings:
In [4]: vals = np.random.rand(10000)
In [6]: %timeit [1 if i >= 0.5 else 0 for i in vals]
1.26 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [7]: %timeit [int(i >= 0.5) for i in vals]
5.18 ms ± 61 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [8]: %timeit (vals >= 0.5).astype(int)
12.9 µs ± 308 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
As usual, numpy wins, followed by ternary, and then int conversion.

Related

How can I assign values from one array to another according to the index more efficiently?

I am trying to replace the values of one array with another according to how many ones are in the source array. I assign a value from a given index in the replacement array based on the sum. Thus, if there are 2 ones in a row, it assigns a value of l1[1] to the species, and if there is one unit, it assigns a value of l1[0] to the output.
It will be better seen in a specific example:
import numpy as np
l1 = np.array([4, 5])
x112 = np.array([[0, 0], [0, 1], [1, 1], [0, 0], [1, 0], [1, 1]])
array([[0, 0],
[1, 0],
[1, 1],
[0, 0],
[1, 0],
[1, 1]])
Required output:
[[0]
[4]
[5]
[0]
[4]
[5]]
I did this by counting the units in each row and assigning accordingly using np.where:
x1x2 = np.array([0, 1, 2, 0 1, 2]) #count value 1
x1x2 = np.where(x1x2 != 1, x1x2, l1[0])
x1x2 = np.where(x1x2 != 2, x1x2, l1[1])
print(x1x2)
output
[0 4 5 0 4 5]
Could this be done more effectively?
Okay I actually gave devectorizing your code a shot. First the vectorized NumPy you have:
def op(x112, l1):
# bit of cheating, adding instead of counting 1s
x1x2 = x112[:,0] + x112[:,1]
x1x2=np.where(x1x2 != 1, x1x2, l1[0])
x1x2=np.where(x1x2 != 2, x1x2, l1[1])
return x1x2
The most efficient alternative is to loop through x112 only once, so let's do a Numba loop.
import numba as nb
#nb.njit
def loop(x112, l1):
d0, d1 = x112.shape
x1x2 = np.zeros(d0, dtype = x112.dtype)
for i in range(d0):
# actually count the 1s
num1s = 0
for j in range(d1):
if x112[i,j] == 1:
num1s += 1
if num1s == 1:
x1x2[i] = l1[0]
elif num1s == 2:
x1x2[i] = l1[1]
return x1x2
Numba loop has a ~9-10x speed improvement on my laptop.
%timeit op(x112, l1)
8.05 µs ± 34.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit loop(x112, l1)
873 ns ± 5.09 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
As #Mad_Physicist requested, timings with a bigger array. I'm including his advanced-indexing method too.
x112 = np.random.randint(0, 2, size = (100000, 2))
l1_v2 = np.array([0,4,5])
%timeit op(x112, l1)
1.35 ms ± 27.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit loop(x112, l1)
956 µs ± 2.78 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit l1_v2[x112.sum(1)]
1.2 ms ± 1.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
EDIT: Okay maybe take these timings with a grain of salt because when I went to restart the IPython kernel and reran this stuff, op(x112, l1) improved to 390 µs ± 22.1 µs per loop while the other methods retained the same performance (971 µs, 1.23 ms).
You can use direct indexing:
l1 = np.array([0, 4, 5])
x112 = np.array([[0, 0], [0, 1], [1, 1], [0, 0], [1, 0], [1, 1]])
result = l1[x112.sum(1)]
This works if you're at liberty to prepend the zero to l1 at creation time. If not:
result = np.r_[0, l1][x112.sum(1)]

Numpy indirect indexing

I'm trying to perform aggregation functions on one or multiple arrays by using another array which contains the indices. These indices could contain duplicates which need to be dealt with depending on the aggregation function (I'm interested in a general way to do this "indirect indexing", so I hope I don't need to differentiate aggregation functions).
For instance, assume we want to obtain a sum w from the elements in v by the index in ix.
ix = [ 0, 7, 0, 1, 7, 3, 0, 2, 2, 5, 6, 4]
v = [100, 200, 300, 400, 500, 600, 700, 800, 900, 0, 0, 0]
=>
# 0 1 2 3 4 5 6 7
w = [1100 (100+300+700), 400, 1700 (800+900), 600, 0, 0, 0, 700 (200+500)]
sum might be an easy one, but for instance a weighted average would be trickier (multiplication of v1 and v2 before collapsing into w). Is there an array/numpy way for doing this?
A fast numpy method:
In [107]: ix = np.array([ 0, 7, 0, 1, 7, 3, 0, 2, 2, 5, 6, 4])
...: v = np.array([100, 200, 300, 400, 500, 600, 700, 800, 900, 0, 0, 0])
In [108]:
In [108]: np.bincount(ix,v)
Out[108]: array([1100., 400., 1700., 600., 0., 0., 0., 700.])
another, not quite as fast, but potentially more flexible (using other ufunc):
In [119]: a = np.zeros(8,int)
...: np.add.at(a, ix,v)
...: a
...:
...:
Out[119]: array([1100, 400, 1700, 600, 0, 0, 0, 700])
Timings on this small example:
In [121]: timeit [np.sum(v[ix == [x]]) for x in range(ix.max() + 1)]
159 µs ± 311 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [122]: %%timeit
...: df = pd.DataFrame(zip(ix, v), columns=["idx", "v"])
...: w = df.groupby(df.idx).v.sum().to_numpy()
1.48 ms ± 884 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [123]: timeit np.bincount(ix,v)
2.15 µs ± 6.79 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [124]: %%timeit
...: a = np.zeros(8,int)
...: np.add.at(a, ix,v)
...: a
9.4 µs ± 348 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Try this:
[np.sum(v[ix == [x]]) for x in range(ix.max() + 1)]
Result:
[1100 400 1700 600 0 0 0 700]
<script type="text/javascript" src="//cdn.datacamp.com/dcl-react.js.gz"></script>
<div data-datacamp-exercise data-lang="python">
<code data-type="sample-code">
import numpy as np
ix = np.array([0, 7, 0, 1, 7, 3, 0, 2, 2, 5, 6, 4])
v = np.array([100, 200, 300, 400, 500, 600, 700, 800, 900, 0, 0, 0])
print([np.sum(v[ix == [x]]) for x in range(ix.max() + 1)])
</code>
</div>
You're looking a groupby operation. Pandas has a pretty extensive api for this kind of thing and wraps numpy under the hood so you get vectorization (as fast as numpy with some operations). Here is an example:
import pandas as pd
ix = [ 0, 7, 0, 1, 7, 3, 0, 2, 2, 5, 6, 4]
v = [100, 200, 300, 400, 500, 600, 700, 800, 900, 0, 0, 0]
df = pd.DataFrame(zip(ix, v), columns=["idx", "v"])
# groupby the index, apply a sum function, convert type to numpy:
# array([1100, 400, 1700, 600, 0, 0, 0, 700])
w = df.groupby(df.idx).v.sum().to_numpy()
You can do more sophisticated calculations and use overloaded arithemtic operations for convenience:
df["weights"] = np.random.rand(len(df))
df["weights"].mul(df["v"]).groupby("idx").sum()
And it is generally performant:
n = 1000000
df = pd.DataFrame({"idx": np.random.choice(10, n), "v": np.random.rand(n)})
%timeit df.groupby("idx")["v"].sum()
# 11.7 ms ± 214 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
As a demonstration of versatility: you can apply more exotic functions, such as the harmonic mean to each group (apply is a little slower):
from scipy.stats.mstats import hmean
%timeit df.groupby("idx").apply(hmean)
# 51.3 ms ± 1.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
idx
0 0.083368
1 0.049457
2 0.077801
3 0.074263
4 0.065142
5 0.035001
6 0.080105
7 0.002465
8 0.076336
9 0.036461
or a custom function:
def my_func(rows):
return np.max(rows)/np.min(rows)
%timeit df.groupby("idx")["v"].apply(my_func)
# 46.6 ms ± 2.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
idx
0 8.265517e+04
1 8.900603e+05
2 1.874362e+05
3 1.419228e+05
4 4.722633e+05
5 1.382114e+06
6 1.000876e+05
7 3.939510e+07
8 7.747462e+04
9 8.919914e+05

Boolean indexing array through array of boolean indexes without loop

I want to index an array with a boolean mask through multiple boolean arrays without a loop.
This is what I want to achieve but without a loop and only with numpy.
import numpy as np
a = np.array([[0, 1],[2, 3]])
b = np.array([[[1, 0], [1, 0]], [[0, 0], [1, 1]]], dtype=bool)
r = []
for x in b:
print(a[x])
r.extend(a[x])
# => array([0, 2])
# => array([2, 3])
print(r)
# => [0, 2, 2, 3]
# what I would like to do is something like this
r = some_fancy_indexing_magic_with_b_and_a
print(r)
# => [0, 2, 2, 3]
Approach #1
Simply broadcast a to b's shape with np.broadcast_to and then mask it with b -
In [15]: np.broadcast_to(a,b.shape)[b]
Out[15]: array([0, 2, 2, 3])
Approach #2
Another would be getting all the indices and mod those by the size of a, which would also be the size of each 2D block in b and then indexing into flattened a -
a.ravel()[np.flatnonzero(b)%a.size]
Approach #3
On the same lines as App#2, but keeping the 2D format and using non-zero indices along the last two axes of b -
_,r,c = np.nonzero(b)
out = a[r,c]
Timings on large arrays (given sample shapes scaled up by 100x) -
In [50]: np.random.seed(0)
...: a = np.random.rand(200,200)
...: b = np.random.rand(200,200,200)>0.5
In [51]: %timeit np.broadcast_to(a,b.shape)[b]
45.5 ms ± 381 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [52]: %timeit a.ravel()[np.flatnonzero(b)%a.size]
94.6 ms ± 1.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [53]: %%timeit
...: _,r,c = np.nonzero(b)
...: out = a[r,c]
128 ms ± 1.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Numpy group by multiple vectors, get group indices

I have several numpy arrays; I want to build a groupby method that would have group ids for these arrays. It will then allow me to index these arrays on the group id to perform operations on the groups.
For an example:
import numpy as np
import pandas as pd
a = np.array([1,1,1,2,2,3])
b = np.array([1,2,2,2,3,3])
def group_np(groupcols):
groupby = np.array([''.join([str(b) for b in bs]) for bs in zip(*[c for c in groupcols])])
_, groupby = np.unique(groupby, return_invesrse=True)
return groupby
def group_pd(groupcols):
df = pd.DataFrame(groupcols[0])
for i in range(1, len(groupcols)):
df[i] = groupcols[i]
for i in range(len(groupcols)):
df[i] = df[i].fillna(-1)
return df.groupby(list(range(len(groupcols)))).grouper.group_info[0]
Outputs:
group_np([a,b]) -> [0, 1, 1, 2, 3, 4]
group_pd([a,b]) -> [0, 1, 1, 2, 3, 4]
Is there a more efficient way of implementing it, ideally in pure numpy? The bottleneck currently seems to be building a vector that would have unique values for each group - at the moment I am doing that by concatenating the values for each vector as strings.
I want this to work for any number of input vectors, which can have millions of elements.
Edit: here is another testcase:
a = np.array([1,2,1,1,1,2,3,1])
b = np.array([1,2,2,2,2,3,3,2])
Here, group elements 2,3,4,7 should all be the same.
Edit2: adding some benchmarks.
a = np.random.randint(1, 1000, 30000000)
b = np.random.randint(1, 1000, 30000000)
c = np.random.randint(1, 1000, 30000000)
def group_np2(groupcols):
_, groupby = np.unique(np.stack(groupcols), return_inverse=True, axis=1)
return groupby
%timeit group_np2([a,b,c])
# 25.1 s +/- 1.06 s per loop (mean +/- std. dev. of 7 runs, 1 loop each)
%timeit group_pd([a,b,c])
# 21.7 s +/- 646 ms per loop (mean +/- std. dev. of 7 runs, 1 loop each)
After using np.stack on the arrays a and b, if you set the parameter return_inverse to True in np.unique then it is the output you are looking for:
a = np.array([1,2,1,1,1,2,3,1])
b = np.array([1,2,2,2,2,3,3,2])
_, inv = np.unique(np.stack([a,b]), axis=1, return_inverse=True)
print (inv)
array([0, 2, 1, 1, 1, 3, 4, 1], dtype=int64)
and you can replace [a,b] in np.stack by a list of all the vectors.
Edit: a faster solution is use np.unique on the sum of the arrays multiply by the cumulative product (np.cumprod) of the max plus 1 of all previous arrays in groupcols. such as:
def group_np_sum(groupcols):
groupcols_max = np.cumprod([ar.max()+1 for ar in groupcols[:-1]])
return np.unique( sum([groupcols[0]] +
[ ar*m for ar, m in zip(groupcols[1:],groupcols_max)]),
return_inverse=True)[1]
To check:
a = np.array([1,2,1,1,1,2,3,1])
b = np.array([1,2,2,2,2,3,3,2])
print (group_np_sum([a,b]))
array([0, 2, 1, 1, 1, 3, 4, 1], dtype=int64)
Note: the number associated to each group may not be the same (here I changed the first element of a by 3)
a = np.array([3,2,1,1,1,2,3,1])
b = np.array([1,2,2,2,2,3,3,2])
print(group_np2([a,b]))
print (group_np_sum([a,b]))
array([3, 1, 0, 0, 0, 2, 4, 0], dtype=int64)
array([0, 2, 1, 1, 1, 3, 4, 1], dtype=int64)
but groups themselves are the same.
Now to check for timing:
a = np.random.randint(1, 100, 30000)
b = np.random.randint(1, 100, 30000)
c = np.random.randint(1, 100, 30000)
groupcols = [a,b,c]
%timeit group_pd(groupcols)
#13.7 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit group_np2(groupcols)
#34.2 ms ± 6.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit group_np_sum(groupcols)
#3.63 ms ± 562 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The numpy_indexed package (dsiclaimer: I am its authos) covers these type of use cases:
import numpy_indexed as npi
npi.group_by((a, b))
Passing a tuple of index-arrays like this avoids creating a copy; but if you dont mind making the copy you can use stacking as well:
npi.group_by(np.stack(a, b))

Numpy: Can you use broadcasting to replace values by row?

I have a M x N matrix X and a 1 x N matrix Y. What I would like to do is replace any 0-entry in X with the appropriate value from Y based on its column.
So if
X = np.array([[0, 1, 2], [3, 0, 5]])
and
Y = np.array([10, 20, 30])
The desired end result would be [[10, 1, 2], [3, 20, 5]].
This can be done straightforwardly by generating a M x N matrix where every row is Y and then using filter arrays:
Y = np.ones((X.shape[0], 1)) * Y.reshape(1, -1)
X[X==0] = Y[X==0]
But could this be done using numpy's broadcasting functionality?
Sure. Instead of physically repeating Y, create a broadcasted view of Y with the shape of X, using numpy.broadcast_to:
expanded = numpy.broadcast_to(Y, X.shape)
mask = X==0
x[mask] = expanded[mask]
Expand X to make it a bit more general:
In [306]: X = np.array([[0, 1, 2], [3, 0, 5],[0,1,0]])
where identifies the 0s; the 2nd array identifies the columns
In [307]: idx = np.where(X==0)
In [308]: idx
Out[308]: (array([0, 1, 2, 2]), array([0, 1, 0, 2]))
In [309]: Z = X.copy()
In [310]: Z[idx]
Out[310]: array([0, 0, 0, 0]) # flat list of where to put the values
In [311]: Y[idx[1]]
Out[311]: array([10, 20, 10, 30]) # matching list of values by column
In [312]: Z[idx] = Y[idx[1]]
In [313]: Z
Out[313]:
array([[10, 1, 2],
[ 3, 20, 5],
[10, 1, 30]])
Not doing broadcasting, but reasonably clean numpy.
Times compared to broadcast_to approach
In [314]: %%timeit
...: idx = np.where(X==0)
...: Z[idx] = Y[idx[1]]
...:
9.28 µs ± 157 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [315]: %%timeit
...: exp = np.broadcast_to(Y,X.shape)
...: mask=X==0
...: Z[mask] = exp[mask]
...:
19.5 µs ± 513 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Faster, though the sample size is small.
Another way to make the expanded Y, is with repeat:
In [319]: %%timeit
...: exp = np.repeat(Y[None,:],3,0)
...: mask=X==0
...: Z[mask] = exp[mask]
...:
10.8 µs ± 55.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Whose time is close to my where. It turns out that broadcast_to is relatively slow:
In [321]: %%timeit
...: exp = np.broadcast_to(Y,X.shape)
...:
10.5 µs ± 52.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [322]: %%timeit
...: exp = np.repeat(Y[None,:],3,0)
...:
3.76 µs ± 11.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
We'd have to do more tests to see whether that is just due to a setup cost, or if the relative times still apply with much larger arrays.

Categories

Resources