Finding X values in numpy array and substituting for random value - python

Consider an list of numpy arrays with values either -1’s or 1’s allocated in random positions.
a = np.array([1,-1,1,1,-1,1,-1,-1,1,-1])
b = np.array([-1,-1,1,-1,1,1,-1,1,-1,-1])
I need to perform operations on these arrays like sum and point wise multiplication.
For example, after summing 2 arrays i will have a new one with values -2,0 and 2.
c = a + b
c = [ 0 -2 2 0 0 2 -2 0 0 -2]
Now i would like to “normalize” it back to -1’s and 1’s.
For the 2’s and -2’s it is easy:
c[c < 0] = -1
c[c > 0] = 1
The problem is the 0. For them i would like to randomly choose either a -1 or a 1.
The desired output would be like:
c = [ 1 -1 1 -1 -1 1 -1 1 -1 -1]
In generalized terms my question is how to find all N values equal to x, in an array, then substitute each for a random number.
My question is how to do this in the most “pythonic”, and fastest, way?
Thank’s

Just Posting the final results from the answers i got so far.
If anyone in the future has a better solution please share it!
I timed the 3 solutions i found and one i did.
def Norm1(HV):
HV[HV > 0] = 1
HV[HV < 0] = -1
zind = np.where(HV == 0)[0]
HV[zind] = np.array([np.random.choice([1, -1]) for _ in zind])
return HV
def norm2(HV):
if HV == 0:
return np.random.choice(np.array([-1,1]))
else:
return HV / HV * np.sign(HV)
Norm2 = np.vectorize(norm2)
def Norm3(HV):
HV[HV > 0] = 1
HV[HV < 0] = -1
mask = HV==0;
HV[mask] = np.random.choice((-1,1),HV[mask].shape)
return HV
def generate(size):
return np.random.binomial(1, 0.5, size=size) * 2 - 1
def Norm4(arr):
np.floor_divide(arr, 2, out=arr)
positions = (arr == 0)
size = sum(positions)
np.add.at(arr, positions, generate(size)
The timings were:
%%timeit
d = Norm1(c)
203 µs ± 5.9 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
d = Norm2(c)
33.4 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
d = Norm3(c)
217 µs ± 11.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
d = Norm4(c)
21 ms ± 1.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
So as it stands it looks like answer 1 and 3 are the best ones. The difference between them looks minimal, but after trying some more runs the number 1 always come slightly on top.
Thanks for the Helps guys!
I will add some references to HD computing in the question as this is a core problem in this application so it will be easier for someone to find it if needed.

I'm not in any way claiming this is the fastest nor most efficient approach.
c = np.array([ 0, -2, 2, 0, 0, 2, -2, 0, 0, -2])
def norm(a):
if a == 0:
return np.random.choice(np.array([-1,1]))
else:
return a / a * np.sign(a)
v_norm = np.vectorize(norm)
norm_arr = v_norm(c)
Result:
In [64]: norm_arr
Out[64]: array([ 1, -1, 1, 1, -1, 1, -1, 1, -1, -1])

You might use:
>>> c = [0, -2, 2, 0, 0, 2, -2, 0, 0, -2]
>>> c = np.array([0, -2, 2, 0, 0, 2, -2, 0, 0, -2])
>>> zind = np.where(c==0)[0]
>>> c[zind] = np.array([np.random.choice([1, -1]) for _ in zind])
>>> c
array([ 1, -2, 2, -1, -1, 2, -2, -1, 1, -2])

Related

Return majority weighted vote from array based in columns

I have a matrix x with 3 x 3 dimensions and a vector w that is 3,:
x = np.array([[1, 2, 1],
[3, 2 ,1],
[1, 2, 2]])
w = np.array([0.3, 0.4, 0.3])
I need to generate another vector y that is a majority vote for each row of x. Each column of x is weighted by the corresponding value in w. Something like this:
for y[0], it should look for X[0] => [1, 2, 1]
columns with value 1 = first and third [0,2]
columns with value 2 = second [1]
columns with value 3 = none
Sum the weights (in w) of the columns grouped by its value in X:
sum of weights of columns with value 1: 0.3 + 0.3 = 0.6
sum of weights of columns with value 2: 0.4
sum of weights of columns with value 3: 0
Since the sum of weights of columns with value 1 is the highest, y[0] = 1. And so on.
You can do it with numpy if you understand broadcasting. The downside is that because the code is vectorized, you do more computations than you need. This would matter if the size of the w vector is very large.
Perhaps someone comes up with an easier way to write it, but this is how I would do it without thinking too much.
The answer first:
i = np.arange(3) + 1
m = (x.reshape((1,4,3)) == i.reshape((3,1,1)))
np.argmax(np.sum(m, axis=2).T*w, axis=1) + 1
Now the step-by-step explanation... Note that it is usually better to start counting from zero, but I followed your convention.
I added one row so the array is not symmetric (easier to check shapes)
In [1]: x = np.array([[1, 2, 1],
...: [3, 2 ,1],
...: [1, 2, 2],
...: [3, 1, 3]])
...:
...: w = np.array([0.3, 0.4, 0.3])
The first step is to have the array of indices i. Your convention starts at one.
In [2]: i = np.arange(3) + 1
the tricky step: create an array with shape (3,4,3), where the i-th entry of the array is a (4,3) array with all entries 0 or 1. It is 1 if and only if x == i. This is done by adding dimensions to x and i so they can be broadcasted. The operation basically compares all combinations of x and i, because all dimensions of x match size=1 dimension of i and viceversa:
In [3]: m = (x.reshape((1,4,3)) == i.reshape((3,1,1)))*1
In [4]: m
Out[4]:
array([[[1, 0, 1],
[0, 0, 1],
[1, 0, 0],
[0, 1, 0]],
[[0, 1, 0],
[0, 1, 0],
[0, 1, 1],
[0, 0, 0]],
[[0, 0, 0],
[1, 0, 0],
[0, 0, 0],
[1, 0, 1]]])
now you sum along rows (which is axis=2) to get the number of times each selection appeared in each row of x (note that the result is transposed when you compare it to x):
In [5]: np.sum(m, axis=2)
Out[5]:
array([[2, 1, 1, 1],
[1, 1, 2, 0],
[0, 1, 0, 2]])
I hope you can already see where this is going. You can read directly: In the first row of x, 1 appears twice and 2 appears once. In the second row of x, all appear once, in the third row of x, 1 appears once, 2 appears twice, etc.
multiply this by the weights:
In [7]: np.sum(m, axis=2).T*w
Out[7]:
array([[0.6, 0.4, 0. ],
[0.3, 0.4, 0.3],
[0.3, 0.8, 0. ],
[0.3, 0. , 0.6]])
get the maximum along the rows (adding one to conform to your convention):
In [8]: np.argmax(np.sum(m, axis=2).T*w, axis=1) + 1
Out[8]: array([1, 2, 2, 3])
Special Case: a Tie
The following case was brought up in the comments:
x = np.array([[2, 2, 4, 1]])
w = np.array([0.1, 0.2, 0.3, 0.4])
the sum of the weights is:
[0.1, 0.4, 0., 0.4]
so in this case there is no winner. It isn't clear from the question what one would do in this case. One could take all, take none... One can look for these cases at the end:
final_w = np.sum(m, axis=2).T*w
result = np.argmax(np.sum(m*w, axis=2), axis=0) + 1
special_cases = np.argwhere(np.sum(final_w == np.max(final_w), axis=1) > 1)
Note: I used the reshape method for readability, but I often use np.expand_dims or np.newaxis. Something like this:
i = np.arange(3) + 1
m = (x[np.newaxis] == i[:, np.newaxis, np.newaxis])
np.argmax(np.sum(m, axis=2).T*w, axis=1) + 1
an alternative: you could also use some kind of compiled code. For example, numba is pretty easy to use in this case.
Here's a really crazy way to do it, which involves sorting and indexing rather than adding a new dimension. This is sort of like the sort-based method used by np.unique.
First find the sorted indices in each row:
rows = np.repeat(np.arange(x.shape[0]), x.shape[1]) # [0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3]
cols = np.argsort(x, axis=1).ravel() # [0, 2, 1, 2, 1, 0, 0, 1, 2, 1, 0, 2]
Now you can create an array of sorted elements per-column, both unweighted and weighted. The former will be used to get the indices for summing, the latter will actually be summed.
u = x[rows, cols] # [1, 1, 2, 1, 2, 3, 1, 2, 2, 1, 3, 3]
v = np.broadcast_to(w, x.shape)[rows, cols] # [0.3, 0.3, 0.4, 0.3, 0.4, 0.3, 0.3, 0.4, 0.3, 0.4, 0.3, 0.3]
You can find the breakpoints to apply np.add.reduce at:
row_breaks = np.diff(rows).astype(bool) # [0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0]
col_breaks = np.diff(u).astype(bool) # [0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0]
break_mask = row_breaks | col_breaks # [0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0]
breaks = np.r_[0, np.flatnonzero(break_mask) + 1] # [ 0, 2, 3, 4, 5, 6, 7, 9, 10]
Now you have the sums of the weights for identical numbers in each row:
sums = np.add.reduceat(v, breaks) # [0.6, 0.4, 0.3, 0.4, 0.3, 0.3, 0.7, 0.4, 0.6]
But you need to break them up into segments corresponding to the number of unique elements per row:
unique_counts = np.add.reduceat(break_mask, np.arange(0, x.size, x.shape[1]))
unique_counts[-1] += 1 # The last segment will be missing from the mask: # [2, 3, 2, 2]
unique_rows = np.repeat(np.arange(x.shape[0]), unique_counts) # [0, 0, 1, 1, 1, 2, 2, 3, 3]
You can now sort each segment to find the maximum value:
indices = np.lexsort(np.stack((sums, unique_rows), axis=0)) # [1, 0, 2, 4, 3, 5, 6, 7, 8]
The index at the end of each run is given by:
max_inds = np.cumsum(unique_counts) - 1 # [1, 4, 6, 8]
So the maximum sums are:
sums[indices[max_inds]] # [0.6, 0.4, 0.7, 0.6]
And you can unravel the indices-within indices to get the correct element from each row. Notice that max_inds, and everything that depends on it is the same size as x.shape[1], as expected:
result = u[breaks[indices[max_ind]]]
This method does not look very pretty, but it is likely more space efficient than using an extra dimension on the array. Additionally, it works regardless of the numbers in x. Notice that I never subtracted anything or adjusted x in any way. In fact, all the rows are treated independently, and the coincidence of a maximum element being identical to the minimum of the next is broken by row_breaks when constructing breaks.
TL;DR
Enjoy:
def weighted_vote(x, w):
rows = np.repeat(np.arange(x.shape[0]), x.shape[1])
cols = np.argsort(x, axis=1).ravel()
u = x[rows, cols]
v = np.broadcast_to(w, x.shape)[rows, cols]
row_breaks = np.diff(rows).astype(bool)
col_breaks = np.diff(u).astype(bool)
break_mask = row_breaks | col_breaks
breaks = np.r_[0, np.flatnonzero(break_mask) + 1]
sums = np.add.reduceat(v, breaks)
unique_counts = np.add.reduceat(break_mask, np.arange(0, x.size, x.shape[1]))
unique_counts[-1] += 1
unique_rows = np.repeat(np.arange(x.shape[0]), unique_counts)
indices = np.lexsort(np.stack((sums, unique_rows), axis=0))
max_inds = np.cumsum(unique_counts) - 1
return u[breaks[indices[max_inds]]]
Benchmarks
Benchmarks are run in following format:
rows = ...
cols = ...
x = np.random.randint(cols, size=(rows, cols)) + 1
w = np.random.rand(cols)
%timeit weighted_vote_MP(x, w)
%timeit weighted_vote_JG(x, w)
assert (weighted_vote_MP(x, w) == weighted_vote_JG(x, w)).all()
I used the following generalization for weighted_vote_JG, with appropriate corrections:
def weighted_vote_JG(x, w):
i = np.arange(w.size) + 1
m = (x[None, ...] == i.reshape(-1, 1, 1))
return np.argmax(np.sum(m * w, axis=2), axis=0) + 1
Rows: 100, Cols: 10
MP: 440 µs ± 5.12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
* JG: 153 µs ± 796 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Rows: 1000, Cols: 10
MP: 2.53 ms ± 43.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
* JG: 1.03 ms ± 12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Rows: 10000, Cols: 10
MP: 23.5 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
* JG: 16.6 ms ± 67.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Rows: 100000, Cols: 10
MP: 322 ms ± 3.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
* JG: 188 ms ± 858 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Rows: 100, Cols: 100
* MP: 3.31 ms ± 257 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
JG: 12.6 ms ± 244 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Rows: 1000, Cols: 100
* MP: 31 ms ± 159 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
JG: 134 ms ± 581 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Rows: 10000, Cols: 100
* MP: 417 ms ± 7.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
JG: 1.42 s ± 126 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Rows: 100000, Cols: 100
* MP: 4.94 s ± 25.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
JG: MemoryError: Unable to allocate 7.45 GiB for an array with shape (100, 100000, 100) and data type float64
Moral of the story: for small number of columns and weights, the expanded solution is faster. For a larger number of columns, use my version instead.

Max value per diagonal in 2d array

I have array and need max of rolling difference with dynamic window.
a = np.array([8, 18, 5,15,12])
print (a)
[ 8 18 5 15 12]
So first I create difference by itself:
b = a - a[:, None]
print (b)
[[ 0 10 -3 7 4]
[-10 0 -13 -3 -6]
[ 3 13 0 10 7]
[ -7 3 -10 0 -3]
[ -4 6 -7 3 0]]
Then replace upper triangle matrix to 0:
c = np.tril(b)
print (c)
[[ 0 0 0 0 0]
[-10 0 0 0 0]
[ 3 13 0 0 0]
[ -7 3 -10 0 0]
[ -4 6 -7 3 0]]
Last need max values per diagonal, so it means:
max([0,0,0,0,0]) = 0
max([-10,13,-10,3]) = 13
max([3,3,-7]) = 3
max([-7,6]) = 6
max([-4]) = -4
So expected output is:
[0, 13, 3, 6, -4]
What is some nice vectorized solution? Or is possible some another way for expected output?
Use ndarray.diagonal
v = [max(c.diagonal(-i)) for i in range(b.shape[0])]
print(v) # [0, 13, 3, 6, -4]
Not sure exactly how efficient this is considering the advanced indexing involved, but this is one way to do that:
import numpy as np
a = np.array([8, 18, 5, 15, 12])
b = a[:, None] - a
# Fill lower triangle with largest negative
b[np.tril_indices(len(a))] = np.iinfo(b.dtype).min # np.finfo for float
# Put diagonals as rows
s = b.strides[1]
diags = np.ndarray((len(a) - 1, len(a) - 1), b.dtype, b, offset=s, strides=(s, (len(a) + 1) * s))
# Get maximum from each row and add initial zero
c = np.r_[0, diags.max(1)]
print(c)
# [ 0 13 3 6 -4]
EDIT:
Another alternative, which may not be what you were looking for though, is just using Numba, for example like this:
import numpy as np
import numba as nb
def max_window_diffs_jdehesa(a):
a = np.asarray(a)
dtinf = np.iinfo(b.dtype) if np.issubdtype(b.dtype, np.integer) else np.finfo(b.dtype)
out = np.full_like(a, dtinf.min)
_pwise_diffs(a, out)
return out
#nb.njit(parallel=True)
def _pwise_diffs(a, out):
out[0] = 0
for w in nb.prange(1, len(a)):
for i in range(len(a) - w):
out[w] = max(a[i] - a[i + w], out[w])
a = np.array([8, 18, 5, 15, 12])
print(max_window_diffs(a))
# [ 0 13 3 6 -4]
Comparing these methods to the original:
import numpy as np
import numba as nb
def max_window_diffs_orig(a):
a = np.asarray(a)
b = a - a[:, None]
out = np.zeros(len(a), b.dtype)
out[-1] = b[-1, 0]
for i in range(1, len(a) - 1):
out[i] = np.diag(b, -i).max()
return out
def max_window_diffs_jdehesa_np(a):
a = np.asarray(a)
b = a[:, None] - a
dtinf = np.iinfo(b.dtype) if np.issubdtype(b.dtype, np.integer) else np.finfo(b.dtype)
b[np.tril_indices(len(a))] = dtinf.min
s = b.strides[1]
diags = np.ndarray((len(a) - 1, len(a) - 1), b.dtype, b, offset=s, strides=(s, (len(a) + 1) * s))
return np.concatenate([[0], diags.max(1)])
def max_window_diffs_jdehesa_nb(a):
a = np.asarray(a)
dtinf = np.iinfo(b.dtype) if np.issubdtype(b.dtype, np.integer) else np.finfo(b.dtype)
out = np.full_like(a, dtinf.min)
_pwise_diffs(a, out)
return out
#nb.njit(parallel=True)
def _pwise_diffs(a, out):
out[0] = 0
for w in nb.prange(1, len(a)):
for i in range(len(a) - w):
out[w] = max(a[i] - a[i + w], out[w])
np.random.seed(0)
a = np.random.randint(0, 100, size=100)
r = max_window_diffs_orig(a)
print((max_window_diffs_jdehesa_np(a) == r).all())
# True
print((max_window_diffs_jdehesa_nb(a) == r).all())
# True
%timeit max_window_diffs_orig(a)
# 348 µs ± 986 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit max_window_diffs_jdehesa_np(a)
# 91.7 µs ± 1.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit max_window_diffs_jdehesa_nb(a)
# 19.7 µs ± 88.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
np.random.seed(0)
a = np.random.randint(0, 100, size=10000)
%timeit max_window_diffs_orig(a)
# 651 ms ± 26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit max_window_diffs_jdehesa_np(a)
# 1.61 s ± 6.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit max_window_diffs_jdehesa_nb(a)
# 22 ms ± 967 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
The first one may be a bit better for smaller arrays, but doesn't work well for bigger ones. Numba on the other hand is pretty good in all cases.
You can use numpy.diagonal:
a = np.array([8, 18, 5,15,12])
b = a - a[:, None]
c = np.tril(b)
for i in range(b.shape[0]):
print(max(c.diagonal(-i)))
Output:
0
13
3
6
-4
Here's a vectorized solution with strides -
from skimage.util import view_as_windows
n = len(a)
z = np.zeros(n-1,dtype=a.dtype)
p = np.concatenate((a,z))
s = view_as_windows(p,n)
mask = np.tri(n,k=-1,dtype=bool)[:,::-1]
v = s[0]-s
out = np.where(mask,v.min()-1,v).max(1)
With one-loop for memory-efficiency -
n = len(a)
out = [max(a[:-i+n]-a[i:]) for i in range(n)]
Use np.max in place of max for better use of array-memory.
You can abuse the fact that reshaping non-square arrays of shape (N+1, N) to (N, N+1) will make diagonals appear as columns
from scipy.linalg import toeplitz
a = toeplitz([1,2,3,4], [1,4,3])
# array([[1, 4, 3],
# [2, 1, 4],
# [3, 2, 1],
# [4, 3, 2]])
a.reshape(3, 4)
# array([[1, 4, 3, 2],
# [1, 4, 3, 2],
# [1, 4, 3, 2]])
Which you can then use like (note that I've swapped the sign and set the lower triangle to zero)
smallv = -10000 # replace this with np.nan if you have floats
a = np.array([8, 18, 5,15,12])
b = a[:, None] - a
b[np.tril_indices(len(b), -1)] = smallv
d = np.vstack((b, np.full(len(b), smallv)))
d.reshape(len(d) - 1, -1).max(0)[:-1]
# array([ 0, 13, 3, 6, -4])

Is there a fast way, to create a vector with 1 and x * 0?

is there a fast way, to create a vector with 1 and x * 0 in python?
I would like to have something like
a = [1,0,0,0,0,0,0,0,0,...,0]
b = [1,1,0,0,0,0,0,0,0,...,0]
I tried it with list but see yourself :(
lst = [1, n*[0]]
lst = np.array(lst)
print(lst)
==> [1 list([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])]
A proper NumPy solution:
import numpy as np
n = 10
arr = np.zeros(shape=n + 1, dtype=np.int64)
arr[0] = 1
Results in: [1 0 0 0 0 0 0 0 0 0 0]
Quick benchmarks
Here are the functions we're going to compare:
def func_1(n):
return np.array([1, *n*[0]])
def func_2(n):
arr = np.zeros(shape=n + 1, dtype=np.int64)
arr[0] = 1
return arr
def func_3(n):
return np.array([1] + n * [0])
def func_4(n):
return np.array([1] + [0 for _ in range(n)])
def func_5(n):
return np.array([1].extend((0 for _ in range(n))))
def func_6(n):
return np.array([1].extend([0 for _ in range(n)]))
def func_7(n):
arr = [0 for _ in range(n)]
arr[0] = 1
return np.array(arr)
Results of timeit for arr_size = 100000000:
%timeit func_1(arr_size)
1 loop, best of 3: 7.3 s per loop
%timeit func_2(arr_size)
10 loops, best of 3: 177 ms per loop
%timeit func_3(arr_size)
1 loop, best of 3: 7.26 s per loop
%timeit func_4(arr_size)
1 loop, best of 3: 11.4 s per loop
%timeit func_5(arr_size)
1 loop, best of 3: 6.3 s per loop
%timeit func_6(arr_size)
1 loop, best of 3: 4.95 s per loop
%timeit func_7(arr_size)
1 loop, best of 3: 10.6 s per loop
For optimal performance, see the AMC's numpy answer below.
Use unpacking by simply adding an asterisk to your code: [1, *n*[0]] instead of [1, n*[0]]:
>>> arr = np.array([1, *n*[0]])
array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
There is another NumPy solution with only slightly worse performance than the one posted by #AMC, but with the convenience that it is a single expression and doesn't need to be wrapped in a function to be used inline:
>>> n = 10
>>> np.eye(1, n + 1, 0, dtype=int)[0]
array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
It's also easy to create other one-hot vectors of the same length by changing the third argument:
>> np.eye(1, n + 1, 4, dtype=int)[0]
array([0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0])
Here's how the performance compares to #AMC's func_2 above (same arr_size = 100000000):
def func_8(n, k=0):
return np.eye(1, n + 1, k, dtype=int)[0]
>>> %timeit func_2(arr_size)
16.4 µs ± 111 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> %timeit func_8(arr_size)
19 µs ± 113 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

assigning pixels as 0 and 1

remote sensing python:
is there a way to create a new band or array with DN values of only 0 and 1, based on conditional statements derived from DN values of two separate bands? for example, if values in band 4 => 11000 and values in band 11 <= 23000, set as 0, else set as 1.
You can use int() to convert a boolean to a 0 or 1:
>>> l = [1, 2, 3, 4, 5, 6]
>>> [int(2 < i < 5) for i in l]
[0, 0, 1, 1, 0, 0]
You could just use Python's ternary operator and a list comprehension:
>>> vals = [10000, 500, 200, 10290, 10290129, 3]
>>> vals = [1 if i > 500 else 0 for i in vals]
>>> vals
[1, 0, 0, 1, 1, 0]
Or using numpy (always a good option):
>>> import numpy as np
>>> vals = np.array([10000, 500, 200, 10290, 10290129, 3])
>>> vals = (vals > 500).astype(int)
>>> vals
array([1, 0, 0, 1, 1, 0])
Some timings:
In [4]: vals = np.random.rand(10000)
In [6]: %timeit [1 if i >= 0.5 else 0 for i in vals]
1.26 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [7]: %timeit [int(i >= 0.5) for i in vals]
5.18 ms ± 61 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [8]: %timeit (vals >= 0.5).astype(int)
12.9 µs ± 308 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
As usual, numpy wins, followed by ternary, and then int conversion.

Can motelling be vectorized in pandas?

"Motelling" is a way to smooth response to a signal.
For example: Given a time-varying signal St that takes integer values 1-5, and a response function Ft({S0...t}) that assigns [-1, 0, +1] to each signal, a standard motelling response function would return:
-1 if St = 1, or if (St = 2) & (Ft-1 = -1)
+1 if St = 5, or if (St = 4) & (Ft-1 = +1)
0 otherwise
If I have a DataFrame by time of the signal {S}, is there a vectorized way to apply this motelling function?
E.g., if DataFrame df['S'].values = [1, 2, 2, 2, 3, 5, 3, 4, 1]
then is there a vectorized approach that would produce:
df['F'].values = [-1, -1, -1, -1, 0, 1, 0, 0, -1]
Or, absent a vectorized solution, is there something obviously faster than the following DataFrame.itertuples() approach I am using now?
df = pd.DataFrame(np.random.random_integers(1,5,100000), columns=['S'])
# First set response for time t
df['F'] = np.where(df['S'] == 5, 1, np.where(df['S'] == 1, -1, 0))
# Now loop to apply motelling
previousF = 0
for row in df.itertuples():
df.at[row.Index, 'F'] = np.where((row.S >= 4) & (previousF == 1), 1,
np.where((row.S <= 2) & (previousF == -1), -1, row.F))
previousF = row.F
With a complex DataFrame the loop portion takes O(minute per million rows)!
You can try regex.
The patterns we are looking for are
(1) 1 follows by 1 or 2. (We select this rule because any 2 comes after 1 can be considered as 1 and keep influence the next row's result)
(2) 5 follows by 4 or 5. (Similarly any 4 comes after 5 can be considered as 5)
(1) will results in consecutive -1s and (2) will results in consecutive 1s. The rest that does not match will be 0.
Using these rules, the rest of work is to do replacement. We espeically use a method lambda m: "x"*len(m.group(0)) that can turn the matched results into the length of such matches. (see reference)
import re
s = [1, 2, 2, 2, 3, 5, 3, 4, 1]
str_s = "".join(str(i) for i in s)
s1 = re.sub("5[45]*", lambda m: "x"*len(m.group(0)),str_s)
s2 = re.sub("1[12]*", lambda m: "y"*len(m.group(0)),s1)
l = list(s2)
l2 = [v if v in ["x", "y"] else 0 for v in l]
l3 = [1 if v == 'x' else v for v in l2]
l4 = [-1 if v == 'y' else v for v in l3]
[-1, -1, -1, -1, 0, 1, 0, 0, -1]
Bigger dataset
def tai(s):
str_s = "".join(str(i) for i in s)
s1 = re.sub("5[45]*", lambda m: "x"*len(m.group(0)),str_s)
s2 = re.sub("1[12]*", lambda m: "y"*len(m.group(0)),s1)
l = list(s2)
l2 = [v if v in ["x", "y"] else 0 for v in l]
l3 = [1 if v == 'x' else v for v in l2]
l4 = [-1 if v == 'y' else v for v in l3]
return l4
s = np.random.randint(1,6,100000)
%timeit tai(s)
104 ms ± 6.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each
df = pd.DataFrame(np.random.randint(1,6,100000), columns=['S'])
# First set response for time t
df['F'] = np.where(df['S'] == 5, 1, np.where(df['S'] == 1, -1, 0))
# Now loop to apply motelling
%%timeit # (OP's answer)
previousF = 0
for row in df.itertuples():
df.at[row.Index, 'F'] = np.where((row.S >= 4) & (previousF == 1), 1,
np.where((row.S <= 2) & (previousF == -1), -1, row.F))
previousF = row.F
1.11 s ± 27.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Reference
Replace substrings in python with the length of each substring
You may notice that since the consecutive elements of F[t] depend on one another this doesn't vectorize well. I'm partial to using numba in this cases. Your function is simple, it works on a numpy array (series is just array under the hood) and it's not easy to vectorize -> numba is ideal for this.
Imports and function:
import numpy as np
import pandas as pd
def motel(S):
F = np.zeros_like(S)
for t in range(S.shape[0]):
if (S[t] == 1) or (S[t] == 2 and F[t-1] == -1):
F[t] = -1
elif (S[t] == 5) or (S[t] == 4 and F[t-1] == 1):
F[t] = 1
# no else required sinze it's already set to zero
return F
Here we can just jit-compile the function
import numba
jit_motel = numba.jit(nopython=True)(motel)
And ensure that the normal and jit versions return expected values
S = pd.Series([1, 2, 2, 2, 3, 5, 3, 4, 1])
print("motel(S) = ", motel(S))
print("jit_motel(S)", jit_motel(S.values))
result:
motel(S) = [-1 -1 -1 -1 0 1 0 0 -1]
jit_motel(S) [-1 -1 -1 -1 0 1 0 0 -1]
For timing, let's scale:
N = 10**4
S = pd.Series( np.random.randint(1, 5, N) )
%timeit jit_motel(S.values)
%timeit motel(S.values)
result:
82.7 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
7.75 ms ± 77.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
For your million data points (didn't time normal function because I didn't wanna wait =) )
N = 10**6
S = pd.Series( np.random.randint(1, 5, N) )
%timeit motel(S.values)
result:
768 ms ± 7.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Boom! Less than a second for a million entries. This approach is simple, readable, and fast. Only downside is the Numba dependency, but it's included in anaconda and available in conda easily (maybe pip I'm not sure).
To aggregate the other answers, first I should note that apparently DataFrame.itertuples() does not iterate deterministically, or as expected, so the sample in the OP doesn't always produce the correct result on large samples.
Thanks to the other answers, I realized that a mechanical application of the motelling logic not only produces correct results, but does so surprisingly quickly when we use DataFrame.fill functions:
def dfmotel(df):
# We'll copy results into column F as we build them
df['F'] = np.nan
# This algo is destructive, so we operate on a copy of the signal
df['temp'] = df['S']
# Fill forward the negative signal
df.loc[df['temp'] == 2, 'temp'] = np.nan
df['temp'].ffill(inplace=True)
df.loc[df['temp'] == 1, 'F'] = -1
# Fill forward the positive signal
df.loc[df['temp'] == 4, 'temp'] = np.nan
df['temp'].ffill(inplace=True)
df.loc[df['temp'] == 5, 'F'] = 1
# All other signals are zero
df['F'].fillna(0, inplace=True)
For all timing tests we will operate on the same input:
df = pd.DataFrame(np.random.randint(1,5,1000000), columns=['S'])
For the DataFrame-based function above we get:
%timeit dfmotel(df.copy())
123 ms ± 2.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
This is quite acceptable performance.
tai was first to present this very clever solution using RegEx (which is what inspired my function above), but it can't match the speed of staying in number space:
import re
def tai(s):
str_s = "".join(str(i) for i in s)
s1 = re.sub("5[45]*", lambda m: "x"*len(m.group(0)),str_s)
s2 = re.sub("1[12]*", lambda m: "y"*len(m.group(0)),s1)
l = list(s2)
l2 = [v if v in ["x", "y"] else 0 for v in l]
l3 = [1 if v == 'x' else v for v in l2]
l4 = [-1 if v == 'y' else v for v in l3]
return l4
%timeit tai(df['S'].values)
899 ms ± 9.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
But nothing beats compiled code. Thanks to evamicur for this solution using the convenient numba in-line compiler:
import numba
def motel(S):
F = np.zeros_like(S)
for t in range(S.shape[0]):
if (S[t] == 1) or (S[t] == 2 and F[t-1] == -1):
F[t] = -1
elif (S[t] == 5) or (S[t] == 4 and F[t-1] == 1):
F[t] = 1
return F
jit_motel = numba.jit(nopython=True)(motel)
%timeit jit_motel(df['S'].values)
9.06 ms ± 502 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Categories

Resources