I have a tensor element that has shape (?, a, a, b).
I want to convert this to a tensor of shape (?, a, b) where:
output[ i , j , k ] = input[ i , j , j , k ].
This is simple to do in numpy as I can just assign elements by looping over i, j, k. However, all manipulations must stay as Tensors in Tensorflow as its needed to evaluate the cost function and train the model.
I have already looked at tf.diag_part() but from my understanding, this cannot be specified on specific axes and must be done for the entire tensor.
Since, like you say, tf.diag_part does not allow for axis, it does not seem to be useful here. This is one possible solution with tf.gather_nd:
import tensorflow as tf
import numpy as np
# Input data
inp = tf.placeholder(tf.int32, [None, None, None, None])
# Read dimensions
s = tf.shape(inp)
a, b, c = s[0], s[1], s[3]
# Make indices for gathering
ii, jj, kk = tf.meshgrid(tf.range(a), tf.range(b), tf.range(c), indexing='ij')
idx = tf.stack([ii, jj, jj, kk], axis=-1)
# Gather result
out = tf.gather_nd(inp, idx)
# Test
with tf.Session() as sess:
inp_val = np.arange(36).reshape(2, 3, 3, 2)
print(inp_val)
# [[[[ 0 1]
# [ 2 3]
# [ 4 5]]
#
# [[ 6 7]
# [ 8 9]
# [10 11]]
#
# [[12 13]
# [14 15]
# [16 17]]]
#
#
# [[[18 19]
# [20 21]
# [22 23]]
#
# [[24 25]
# [26 27]
# [28 29]]
#
# [[30 31]
# [32 33]
# [34 35]]]]
print(sess.run(out, feed_dict={inp: inp_val}))
# [[[ 0 1]
# [ 8 9]
# [16 17]]
#
# [[18 19]
# [26 27]
# [34 35]]]
Here are a couple of alternative versions. One using tensor algebra.
inp = tf.placeholder(tf.int32, [None, None, None, None])
b = tf.shape(inp)[1]
eye = tf.eye(b, dtype=inp.dtype)
inp_masked = inp * tf.expand_dims(eye, 2)
out = tf.tensordot(inp_masked, tf.ones(b, inp.dtype), [[2], [0]])
And one using boolean masking:
inp = tf.placeholder(tf.int32, [None, None, None, None])
s = tf.shape(inp)
a, b, c = s[0], s[1], s[3]
mask = tf.eye(b, dtype=tf.bool)
inp_mask = tf.boolean_mask(inp, tf.tile(tf.expand_dims(mask, 0), [a, 1, 1]))
out = tf.reshape(inp_mask, [a, b, c])
EDIT: I took some time measurements for the three methods:
import tensorflow as tf
import numpy as np
def f1(inp):
s = tf.shape(inp)
a, b, c = s[0], s[1], s[3]
ii, jj, kk = tf.meshgrid(tf.range(a), tf.range(b), tf.range(c), indexing='ij')
idx = tf.stack([ii, jj, jj, kk], axis=-1)
return tf.gather_nd(inp, idx)
def f2(inp):
b = tf.shape(inp)[1]
eye = tf.eye(b, dtype=inp.dtype)
inp_masked = inp * tf.expand_dims(eye, 2)
return tf.tensordot(inp_masked, tf.ones(b, inp.dtype), [[2], [0]])
def f3(inp):
s = tf.shape(inp)
a, b, c = s[0], s[1], s[3]
mask = tf.eye(b, dtype=tf.bool)
inp_mask = tf.boolean_mask(inp, tf.tile(tf.expand_dims(mask, 0), [a, 1, 1]))
return tf.reshape(inp_mask, [a, b, c])
with tf.Graph().as_default():
inp = tf.constant(np.arange(100 * 300 * 300 * 10).reshape(100, 300, 300, 10))
out1 = f1(inp)
out2 = f2(inp)
out3 = f3(inp)
with tf.Session() as sess:
v1, v2, v3 = sess.run((out1, out2, out3))
print(np.all(v1 == v2) and np.all(v1 == v3))
# True
%timeit sess.run(out1)
# CPU: 1 ms ± 138 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
# GPU: 1.04 ms ± 93.7 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit sess.run(out2)
# CPU: 1.17 ms ± 150 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
# GPU: 734 ms ± 17.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit sess.run(out3)
# CPU: 1.11 ms ± 172 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
# GPU: 1.41 ms ± 1.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Seems all three are similar on CPU, but the second one is for some reason way slower in my GPU. Not sure what would be the results with float values though. Also, you could try replacing tf.tensordot with tf.einsum, for example. About the first and second one, they seem both fine, although if you are backpropagating through these operations the cost of computing the gradient may vary.
Related
I have an array in numpy. I want to roll the first column by 1, second column by 2, etc.
Here is an example.
>>> x = np.reshape(np.arange(15), (5, 3))
>>> x
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11],
[12, 13, 14]])
What I want to do:
>>> y = roll(x)
>>> y
array([[12, 10, 8],
[ 0, 13, 11],
[ 3, 1, 14],
[ 6, 4, 2],
[ 9, 7, 5]])
What is the best way to do it?
The real array will be very big. I'm using cupy, the GPU version of numpy. I will prefer solution fastest on GPU, but of course, any idea is welcomed.
You could use advanced indexing:
import numpy as np
x = np.reshape(np.arange(15), (5, 3))
h, w = x.shape
rows, cols = np.arange(h), np.arange(w)
offsets = cols + 1
shifted = np.subtract.outer(rows, offsets) % h
y = x[shifted, cols]
y:
array([[12, 10, 8],
[ 0, 13, 11],
[ 3, 1, 14],
[ 6, 4, 2],
[ 9, 7, 5]])
I implemented a naive solution (roll_for) and compares it to #Chrysophylaxs 's solution (roll_indexing).
Conclusion: roll_indexing is faster for small arrays, but the difference shrinks when the array goes bigger, and is eventually slower than roll_for for very large arrays.
Implementations:
import numpy as np
def roll_for(x, shifts=None, axis=-1):
if shifts is None:
shifts = np.arange(1, x.shape[axis] + 1) # OP requirement
xt = x.swapaxes(axis, 0) # https://stackoverflow.com/a/31094758/13636407
yt = np.empty_like(xt)
for idx, shift in enumerate(shifts):
yt[idx] = np.roll(xt[idx], shift=shift)
return yt.swapaxes(0, axis)
def roll_indexing(x):
h, w = x.shape
rows, cols = np.arange(h), np.arange(w)
offsets = cols + 1
shifted = np.subtract.outer(rows, offsets) % h # fix
return x[shifted, cols]
Tests:
M, N = 5, 3
x = np.arange(M * N).reshape(M, N)
expected = np.array([[12, 10, 8], [0, 13, 11], [3, 1, 14], [6, 4, 2], [9, 7, 5]])
assert np.array_equal(expected, roll_for(x))
assert np.array_equal(expected, roll_indexing(x))
M, N = 100, 200
# roll_indexing did'nt work when M < N before fix
x = np.arange(M * N).reshape(M, N)
assert np.array_equal(roll_for(x), roll_indexing(x))
Benchmark:
M, N = 100, 100
x = np.arange(M * N).reshape(M, N)
assert np.array_equal(roll_for(x), roll_indexing(x))
%timeit roll_for(x) # 859 µs ± 2.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit roll_indexing(x) # 81 µs ± 255 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
M, N = 1_000, 1_000
x = np.arange(M * N).reshape(M, N)
assert np.array_equal(roll_for(x), roll_indexing(x))
%timeit roll_for(x) # 12.7 ms ± 56.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit roll_indexing(x) # 12.4 ms ± 13.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
M, N = 10_000, 10_000
x = np.arange(M * N).reshape(M, N)
assert np.array_equal(roll_for(x), roll_indexing(x))
%timeit roll_for(x) # 1.3 s ± 6.46 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit roll_indexing(x) # 1.61 s ± 4.96 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
How can I increase performance Python script using numpy and numba?
I’m trying to convert decimal number to 21-number system.
Input: [15, 18, 28, 11, 7, 5, 41, 139, 6, 507]
Output: [[15], [18], [1, 7], [11], [7], [5], [1, 20], [6, 13], [6], [1, 3, 3]]
My script is working well using CPU.
How can I modify my script? I want to increase performance using GPU.
import numpy as np
from timeit import default_timer as timer
from numba import vectorize
import numba as nb
elements = [
"n|0",
"n|1",
"n|2",
"n|3",
"n|4",
"n|5",
"n|6",
"n|7",
"n|8",
"n|9",
"n|10",
"o|+",
"o|*",
"o|/",
"om|-",
"bl|(",
"br|)",
"e|**2",
"e|**3",
"e|**0.5",
"e|**(1/3)",
]
elements_len = len(elements)
def decimal_to_custom(number):
x = (number % elements_len)
ch = [x]
if (number - x != 0):
return decimal_to_custom(number // elements_len) + ch
else:
return ch
decimal_numbers = np.array([15, 18, 28, 11, 7, 5, 41, 139, 6, 507]) #very big array
custom_numers = []
for decimal_number in decimal_numbers:
custom_numer = decimal_to_custom(decimal_number)
custom_numers.append(custom_numer)
print(custom_numers)
Your code can be summarized as:
import numpy as np
def decimal_to_custom(number, k):
x = (number % k)
ch = [x]
if (number - x != 0):
return decimal_to_custom(number // k, k) + ch
else:
return ch
def remainders_OP(arr, k):
result = []
for value in arr:
result.append(decimal_to_custom(value, k))
return result
decimal_numbers = np.array([15, 18, 28, 11, 7, 5, 41, 139, 6, 507]) #very big array
print(remainders_OP(decimal_numbers, elements_len))
# [[15], [18], [1, 7], [11], [7], [5], [1, 20], [6, 13], [6], [1, 3, 3]]
This code can be speed-up already by replacing the costly recursive implementation of decimal_to_custom() with an iterative and simpler version mod_list() which appends and revert rather than the very expensive head insert (equivalent to list.insert(0, x)) that is implemented in OP:
def mod_list(x, k):
result = []
while x >= k:
result.append(x % k)
x //= k
result.append(x)
return result[::-1]
def remainders(arr, k):
result = []
for x in arr:
result.append(mod_list(x, k))
return result
print(remainders(decimal_numbers, elements_len))
# [[15], [18], [1, 7], [11], [7], [5], [1, 20], [6, 13], [6], [1, 3, 3]]
Now, both can be accelerated with Numba, to obtain some speed-up:
import numba as nb
#nb.njit
def mod_list_nb(x, k):
result = []
while x >= k:
result.append(x % k)
x //= k
result.append(x)
return result[::-1]
#nb.njit
def remainders_nb(arr, k):
result = []
for x in arr:
result.append(mod_list_nb(x, k))
return result
print(remainders_nb(decimal_numbers, elements_len))
# [[15], [18], [1, 7], [11], [7], [5], [1, 20], [6, 13], [6], [1, 3, 3]]
A number of options can be passed on to the decorator, including target_backend="cuda" to have the computation to run on the GPU.
As we shall see with the benchmarks, it is not going to be beneficial.
The reason is that list.append() (as well as list.insert()) is not easy to run in parallel, and hence you cannot easily exploit the massive parallelism of GPUs!
Anyway, the above solutions are slowed down by the choice of the underlying data container.
If one uses fixed size arrays instead of dynamically growing a list at each iteration, this is going to result in a much faster execution:
def remainders_fixed_np(arr, k, m):
arr = arr.copy()
n = len(arr)
result = np.empty((n, m), dtype=np.int_)
for i in range(m - 1, -1, -1):
result[:, i] = arr[:, i + 1] % k
arr //= k
return result
print(remainders_fixed_np(decimal_numbers, elements_len, 3).T)
# [[ 0 0 0 0 0 0 0 0 0 1]
# [ 0 0 1 0 0 0 1 6 0 3]
# [15 18 7 11 7 5 20 13 6 3]]
or, with Numba acceleration (and avoiding unnecessary computation):
#nb.njit
def remainders_fixed_nb(arr, k, m):
n = len(arr)
result = np.zeros((n, m), dtype=np.int_)
for i in range(n):
j = m - 1
x = arr[i]
while x >= k:
q, r = divmod(x, k)
result[i, j] = r
x = q
j -= 1
result[i, j] = x
return result
print(remainders_fixed_nb(decimal_numbers, elements_len, 3).T)
# [[ 0 0 0 0 0 0 0 0 0 1]
# [ 0 0 1 0 0 0 1 6 0 3]
# [15 18 7 11 7 5 20 13 6 3]]
Some Benchmarks
Now some benchmarks run on Google Colab show some indicative timings, where:
the _nb ending indicates Numba acceleration
the _pnb ending indicates Numba acceleration with parallel=True and the outermost range() replaced with nb.prange()
the _cunb ending indicates Numba acceleration with target CUDA target_backend="cuda"
the _cupnb is Numba acceleration with both parallelization and target CUDA
m = 4
n = 100000
arr = np.random.randint(1, k ** m - 1, n)
funcs = remainders_OP, remainders, remainders_nb, remainders_cunb
base = funcs[0](arr, k)
for func in funcs:
res = func(arr, k)
is_good = base == res
print(f"{func.__name__:>16s} {is_good!s:>5s} ", end="")
%timeit -n 4 -r 4 func(arr, k)
# remainders_OP True 333 ms ± 4.38 ms per loop (mean ± std. dev. of 4 runs, 4 loops each)
# remainders True 268 ms ± 5.11 ms per loop (mean ± std. dev. of 4 runs, 4 loops each)
# remainders_nb True 46.9 ms ± 3.16 ms per loop (mean ± std. dev. of 4 runs, 4 loops each)
# remainders_cunb True 46.4 ms ± 1.71 ms per loop (mean ± std. dev. of 4 runs, 4 loops each)
fixed_funcs = remainders_fixed_np, remainders_fixed_nb, remainders_fixed_pnb, remainders_fixed_cunb, remainders_fixed_cupnb
base = fixed_funcs[0](arr, k, m)
for func in fixed_funcs:
res = func(arr, k, m)
is_good = np.all(base == res)
print(f"{func.__name__:>24s} {is_good!s:>5s} ", end="")
%timeit -n 8 -r 8 func(arr, k, m)
# remainders_fixed_np True 10 ms ± 2.09 ms per loop (mean ± std. dev. of 8 runs, 8 loops each)
# remainders_fixed_nb True 3.6 ms ± 315 µs per loop (mean ± std. dev. of 8 runs, 8 loops each)
# remainders_fixed_pnb True 2.68 ms ± 550 µs per loop (mean ± std. dev. of 8 runs, 8 loops each)
# remainders_fixed_cunb True 3.49 ms ± 192 µs per loop (mean ± std. dev. of 8 runs, 8 loops each)
# remainders_fixed_cupnb True 2.63 ms ± 314 µs per loop (mean ± std. dev. of 8 runs, 8 loops each)
This indicate that running on the GPU has minimal effect.
The greatest speed-up is obtained by changing the data container to a pre-allocated one.
The Numba acceleration provides some acceleration both with the dynamic allocation and with the pre-allocated versions.
I have array and need max of rolling difference with dynamic window.
a = np.array([8, 18, 5,15,12])
print (a)
[ 8 18 5 15 12]
So first I create difference by itself:
b = a - a[:, None]
print (b)
[[ 0 10 -3 7 4]
[-10 0 -13 -3 -6]
[ 3 13 0 10 7]
[ -7 3 -10 0 -3]
[ -4 6 -7 3 0]]
Then replace upper triangle matrix to 0:
c = np.tril(b)
print (c)
[[ 0 0 0 0 0]
[-10 0 0 0 0]
[ 3 13 0 0 0]
[ -7 3 -10 0 0]
[ -4 6 -7 3 0]]
Last need max values per diagonal, so it means:
max([0,0,0,0,0]) = 0
max([-10,13,-10,3]) = 13
max([3,3,-7]) = 3
max([-7,6]) = 6
max([-4]) = -4
So expected output is:
[0, 13, 3, 6, -4]
What is some nice vectorized solution? Or is possible some another way for expected output?
Use ndarray.diagonal
v = [max(c.diagonal(-i)) for i in range(b.shape[0])]
print(v) # [0, 13, 3, 6, -4]
Not sure exactly how efficient this is considering the advanced indexing involved, but this is one way to do that:
import numpy as np
a = np.array([8, 18, 5, 15, 12])
b = a[:, None] - a
# Fill lower triangle with largest negative
b[np.tril_indices(len(a))] = np.iinfo(b.dtype).min # np.finfo for float
# Put diagonals as rows
s = b.strides[1]
diags = np.ndarray((len(a) - 1, len(a) - 1), b.dtype, b, offset=s, strides=(s, (len(a) + 1) * s))
# Get maximum from each row and add initial zero
c = np.r_[0, diags.max(1)]
print(c)
# [ 0 13 3 6 -4]
EDIT:
Another alternative, which may not be what you were looking for though, is just using Numba, for example like this:
import numpy as np
import numba as nb
def max_window_diffs_jdehesa(a):
a = np.asarray(a)
dtinf = np.iinfo(b.dtype) if np.issubdtype(b.dtype, np.integer) else np.finfo(b.dtype)
out = np.full_like(a, dtinf.min)
_pwise_diffs(a, out)
return out
#nb.njit(parallel=True)
def _pwise_diffs(a, out):
out[0] = 0
for w in nb.prange(1, len(a)):
for i in range(len(a) - w):
out[w] = max(a[i] - a[i + w], out[w])
a = np.array([8, 18, 5, 15, 12])
print(max_window_diffs(a))
# [ 0 13 3 6 -4]
Comparing these methods to the original:
import numpy as np
import numba as nb
def max_window_diffs_orig(a):
a = np.asarray(a)
b = a - a[:, None]
out = np.zeros(len(a), b.dtype)
out[-1] = b[-1, 0]
for i in range(1, len(a) - 1):
out[i] = np.diag(b, -i).max()
return out
def max_window_diffs_jdehesa_np(a):
a = np.asarray(a)
b = a[:, None] - a
dtinf = np.iinfo(b.dtype) if np.issubdtype(b.dtype, np.integer) else np.finfo(b.dtype)
b[np.tril_indices(len(a))] = dtinf.min
s = b.strides[1]
diags = np.ndarray((len(a) - 1, len(a) - 1), b.dtype, b, offset=s, strides=(s, (len(a) + 1) * s))
return np.concatenate([[0], diags.max(1)])
def max_window_diffs_jdehesa_nb(a):
a = np.asarray(a)
dtinf = np.iinfo(b.dtype) if np.issubdtype(b.dtype, np.integer) else np.finfo(b.dtype)
out = np.full_like(a, dtinf.min)
_pwise_diffs(a, out)
return out
#nb.njit(parallel=True)
def _pwise_diffs(a, out):
out[0] = 0
for w in nb.prange(1, len(a)):
for i in range(len(a) - w):
out[w] = max(a[i] - a[i + w], out[w])
np.random.seed(0)
a = np.random.randint(0, 100, size=100)
r = max_window_diffs_orig(a)
print((max_window_diffs_jdehesa_np(a) == r).all())
# True
print((max_window_diffs_jdehesa_nb(a) == r).all())
# True
%timeit max_window_diffs_orig(a)
# 348 µs ± 986 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit max_window_diffs_jdehesa_np(a)
# 91.7 µs ± 1.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit max_window_diffs_jdehesa_nb(a)
# 19.7 µs ± 88.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
np.random.seed(0)
a = np.random.randint(0, 100, size=10000)
%timeit max_window_diffs_orig(a)
# 651 ms ± 26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit max_window_diffs_jdehesa_np(a)
# 1.61 s ± 6.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit max_window_diffs_jdehesa_nb(a)
# 22 ms ± 967 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
The first one may be a bit better for smaller arrays, but doesn't work well for bigger ones. Numba on the other hand is pretty good in all cases.
You can use numpy.diagonal:
a = np.array([8, 18, 5,15,12])
b = a - a[:, None]
c = np.tril(b)
for i in range(b.shape[0]):
print(max(c.diagonal(-i)))
Output:
0
13
3
6
-4
Here's a vectorized solution with strides -
from skimage.util import view_as_windows
n = len(a)
z = np.zeros(n-1,dtype=a.dtype)
p = np.concatenate((a,z))
s = view_as_windows(p,n)
mask = np.tri(n,k=-1,dtype=bool)[:,::-1]
v = s[0]-s
out = np.where(mask,v.min()-1,v).max(1)
With one-loop for memory-efficiency -
n = len(a)
out = [max(a[:-i+n]-a[i:]) for i in range(n)]
Use np.max in place of max for better use of array-memory.
You can abuse the fact that reshaping non-square arrays of shape (N+1, N) to (N, N+1) will make diagonals appear as columns
from scipy.linalg import toeplitz
a = toeplitz([1,2,3,4], [1,4,3])
# array([[1, 4, 3],
# [2, 1, 4],
# [3, 2, 1],
# [4, 3, 2]])
a.reshape(3, 4)
# array([[1, 4, 3, 2],
# [1, 4, 3, 2],
# [1, 4, 3, 2]])
Which you can then use like (note that I've swapped the sign and set the lower triangle to zero)
smallv = -10000 # replace this with np.nan if you have floats
a = np.array([8, 18, 5,15,12])
b = a[:, None] - a
b[np.tril_indices(len(b), -1)] = smallv
d = np.vstack((b, np.full(len(b), smallv)))
d.reshape(len(d) - 1, -1).max(0)[:-1]
# array([ 0, 13, 3, 6, -4])
In recent TensorFlow (1.13 or 2.0) is there a way to extract non-contiguous slices from a tensor in one pass? How to do it?
For instance with the following tensor:
1 2 3 4
5 6 7 8
I want to extract columns 1 and 3 in one op to get:
2 4
6 8
However it seems I cannot do it in a single op with slicing.
What's the correct/fastest/most elegant way to do this?
1. Using tf.gather(tensor, columns, axis=1) (TF1.x, TF2):
import tensorflow as tf
tensor = tf.constant([[1, 2, 3, 4], [5, 6, 7, 8]], dtype=tf.float32)
columns = [1, 3]
print(tf.gather(tensor, columns, axis=1).numpy())
%timeit -n 10000 tf.gather(tensor, columns, axis=1)
# [[2. 4.]
# [6. 8.]]
82.6 µs ± 5.76 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
2. With indexing (TF1.x, TF2):
import tensorflow as tf
tensor = tf.constant([[1, 2, 3, 4], [5, 6, 7, 8]], dtype=tf.float32)
columns = [1, 3] # <--columns you want to extract
transposed = tf.transpose(tensor)
sliced = [transposed[c] for c in columns]
stacked = tf.transpose(tf.stack(sliced, axis=0))
# print(stacked.numpy()) # <-- TF2, TF1.x-eager
with tf.Session() as sess: # <-- TF1.x
print(sess.run(stacked))
# [[2. 4.]
# [6. 8.]]
Wrapping it to a function and running %timeit in tf.__version__=='2.0.0-alpha0':
154 µs ± 2.61 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Decorating it with #tf.function is more than 2 times faster:
import tensorflow as tf
tensor = tf.constant([[1, 2, 3, 4], [5, 6, 7, 8]], dtype=tf.float32)
columns = [1, 3] # <--columns you want to extract
#tf.function
def extract_columns(tensor=tensor, columns=columns):
transposed = tf.transpose(tensor)
sliced = [transposed[c] for c in columns]
stacked = tf.transpose(tf.stack(sliced, axis=0))
return stacked
%timeit -n 10000 extract_columns()
66.8 µs ± 2.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
3. One-liner for eager execution (TF2, TF1.x-eager):
import tensorflow as tf
tensor = tf.constant([[1, 2, 3, 4], [5, 6, 7, 8]], dtype=tf.float32)
columns = [1, 3] # <--columns you want to extract
res = tf.transpose(tf.stack([t for i, t in enumerate(tf.transpose(tensor))
if i in columns], 0))
print(res.numpy())
# [[2. 4.]
# [6. 8.]]
%timeit in tf.__version__=='2.0.0-alpha0':
242 µs ± 2.97 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
4. Use tf.one_hot() to specify rows/columns and then tf.boolean_mask() to extract these rows/columns (TF1.x, TF2):
import tensorflow as tf
tensor = tf.constant([[1, 2, 3, 4], [5, 6, 7, 8]], dtype=tf.float32)
columns = [1, 3] # <--columns you want to extract
mask = tf.one_hot(columns, tensor.get_shape().as_list()[-1])
mask = tf.reduce_sum(mask, axis=0)
res = tf.transpose(tf.boolean_mask(tf.transpose(tensor), mask))
# print(res.numpy()) # <-- TF2, TF1.x-eager
with tf.Session() as sess: # TF1.x
print(sess.run(res))
# [[2. 4.]
# [6. 8.]]
%timeit in tf.__version__=='2.0.0-alpha0':
494 µs ± 4.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
You can get all odd numbered columns with a combination of reshapes and a slice:
N = 4
M = 10
input = tf.constant(np.random.rand(M, N))
slice_odd = tf.reshape(tf.reshape(input, (-1, 2))[:,1], (-1, int(N/2)))
I have a NumPy array:
arr = [[1, 2],
[3, 4]]
I want to create a new array that contains powers of arr up to a power order:
# arr_new = [arr^0, arr^1, arr^2, arr^3,...arr^order]
arr_new = [[1, 1, 1, 2, 1, 4, 1, 8],
[1, 1, 3, 4, 9, 16, 27, 64]]
My current approach uses for loops:
# Pre-allocate an array for powers
arr = np.array([[1, 2],[3,4]])
order = 3
rows, cols = arr.shape
arr_new = np.zeros((rows, (order+1) * cols))
# Iterate over each exponent
for i in range(order + 1):
arr_new[:, (i * cols) : (i + 1) * cols] = arr**i
print(arr_new)
Is there a faster (i.e. vectorized) approach to creating powers of an array?
Benchmarking
Thanks to #hpaulj and #Divakar and #Paul Panzer for the answers. I benchmarked the loop-based and broadcasting-based operations on the following test arrays.
arr = np.array([[1, 2],
[3,4]])
order = 3
arrLarge = np.random.randint(0, 10, (100, 100)) # 100 x 100 array
orderLarge = 10
The loop_based function is:
def loop_based(arr, order):
# pre-allocate an array for powers
rows, cols = arr.shape
arr_new = np.zeros((rows, (order+1) * cols))
# iterate over each exponent
for i in range(order + 1):
arr_new[:, (i * cols) : (i + 1) * cols] = arr**i
return arr_new
The broadcast_based function using hstack is:
def broadcast_based_hstack(arr, order):
# Create a 3D exponent array for a 2D input array to force broadcasting
powers = np.arange(order + 1)[:, None, None]
# Generate values (third axis contains array at various powers)
exponentiated = arr ** powers
# Reshape and return array
return np.hstack(exponentiated) # <== using hstack function
The broadcast_based function using reshape is:
def broadcast_based_reshape(arr, order):
# Create a 3D exponent array for a 2D input array to force broadcasting
powers = np.arange(order + 1)[:, None]
# Generate values (3-rd axis contains array at various powers)
exponentiated = arr[:, None] ** powers
# reshape and return array
return exponentiated.reshape(arr.shape[0], -1) # <== using reshape function
The broadcast_based function using cumulative product cumprod and reshape:
def broadcast_cumprod_reshape(arr, order):
rows, cols = arr.shape
# Create 3D empty array where the middle dimension is
# the array at powers 0 through order
out = np.empty((rows, order + 1, cols), dtype=arr.dtype)
out[:, 0, :] = 1 # 0th power is always 1
a = np.broadcast_to(arr[:, None], (rows, order, cols))
# Cumulatively multiply arrays so each multiplication produces the next order
np.cumprod(a, axis=1, out=out[:,1:,:])
return out.reshape(rows, -1)
On Jupyter notebook, I used the timeit command and got these results:
Small arrays (2x2):
%timeit -n 100000 loop_based(arr, order)
7.41 µs ± 174 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit -n 100000 broadcast_based_hstack(arr, order)
10.1 µs ± 137 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit -n 100000 broadcast_based_reshape(arr, order)
3.31 µs ± 61.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit -n 100000 broadcast_cumprod_reshape(arr, order)
11 µs ± 102 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Large arrays (100x100):
%timeit -n 1000 loop_based(arrLarge, orderLarge)
261 µs ± 5.82 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit -n 1000 broadcast_based_hstack(arrLarge, orderLarge)
225 µs ± 4.15 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit -n 1000 broadcast_based_reshape(arrLarge, orderLarge)
223 µs ± 2.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit -n 1000 broadcast_cumprod_reshape(arrLarge, orderLarge)
157 µs ± 1.02 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Conclusions:
It seems that the broadcast based approach using reshape is faster for smaller arrays. However, for large arrays, the cumprod approach scales better and is faster.
Extend arrays to higher dims and let broadcasting do its magic with some help from reshaping -
In [16]: arr = np.array([[1, 2],[3,4]])
In [17]: order = 3
In [18]: (arr[:,None]**np.arange(order+1)[:,None]).reshape(arr.shape[0],-1)
Out[18]:
array([[ 1, 1, 1, 2, 1, 4, 1, 8],
[ 1, 1, 3, 4, 9, 16, 27, 64]])
Note that arr[:,None] is essentially arr[:,None,:], but we can skip the trailing ellipsis for brevity.
Timings on a bigger dataset -
In [40]: np.random.seed(0)
...: arr = np.random.randint(0,9,(100,100))
...: order = 10
# #hpaulj's soln with broadcasting and stacking
In [41]: %timeit np.hstack(arr **np.arange(order+1)[:,None,None])
1000 loops, best of 3: 734 µs per loop
In [42]: %timeit (arr[:,None]**np.arange(order+1)[:,None]).reshape(arr.shape[0],-1)
1000 loops, best of 3: 401 µs per loop
That reshaping part is practically free and that's where we gain performance here alongwith the broadcasting part of course, as seen in the breakdown below -
In [52]: %timeit (arr[:,None]**np.arange(order+1)[:,None])
1000 loops, best of 3: 390 µs per loop
In [53]: %timeit (arr[:,None]**np.arange(order+1)[:,None]).reshape(arr.shape[0],-1)
1000 loops, best of 3: 401 µs per loop
Use broadcasting to generate the values, and reshape or rearrange the values as desired:
In [34]: arr **np.arange(4)[:,None,None]
Out[34]:
array([[[ 1, 1],
[ 1, 1]],
[[ 1, 2],
[ 3, 4]],
[[ 1, 4],
[ 9, 16]],
[[ 1, 8],
[27, 64]]])
In [35]: np.hstack(_)
Out[35]:
array([[ 1, 1, 1, 2, 1, 4, 1, 8],
[ 1, 1, 3, 4, 9, 16, 27, 64]])
Here is a solution using cumulative multiplication which scales better than power based approaches, especially if the input array is of float dtype:
import numpy as np
def f_mult(a, k):
m, n = a.shape
out = np.empty((m, k, n), dtype=a.dtype)
out[:, 0, :] = 1
a = np.broadcast_to(a[:, None], (m, k-1, n))
a.cumprod(axis=1, out=out[:, 1:])
return out.reshape(m, -1)
Timings:
int up to power 9
divakar: 0.4342731796205044 ms
hpaulj: 0.794165057130158 ms
pp: 0.20520629966631532 ms
float up to power 39
divakar: 29.056487752124667 ms
hpaulj: 31.773792404681444 ms
pp: 1.0329263447783887 ms
Code for timings, thks #Divakar:
def f_divakar(a, k):
return (a[:,None]**np.arange(k)[:,None]).reshape(a.shape[0],-1)
def f_hpaulj(a, k):
return np.hstack(a**np.arange(k)[:,None,None])
from timeit import timeit
np.random.seed(0)
a = np.random.randint(0,9,(100,100))
k = 10
print('int up to power 9')
print('divakar:', timeit(lambda: f_divakar(a, k), number=1000), 'ms')
print('hpaulj: ', timeit(lambda: f_hpaulj(a, k), number=1000), 'ms')
print('pp: ', timeit(lambda: f_mult(a, k), number=1000), 'ms')
a = np.random.uniform(0.5,2.0,(100,100))
k = 40
print('float up to power 39')
print('divakar:', timeit(lambda: f_divakar(a, k), number=1000), 'ms')
print('hpaulj: ', timeit(lambda: f_hpaulj(a, k), number=1000), 'ms')
print('pp: ', timeit(lambda: f_mult(a, k), number=1000), 'ms')
You are creating a Vandermonde matrix with a reshape, so it is probably best to use numpy.vander to make it, and let someone else take care of the best algorithm.
This way your code is just:
np.vander(arr.ravel(), order).reshape((arr.shape[0], -1))
That said, it seems like they use something like Paul Panzer's cumprod method under the hood so it should scale well.