I have a very simple problem: in my python toolbox, I have to compute the values of polynomials (usually degree 3 or 2, seldom others, always integer degree) from a large vector (size >> 10^6). Storing the result in a buffer is not an option because I have several of these vectors so I would quickly run out of memory, and I usually have to compute it only once in any case. The performance of numpy.polyval is actually quite good, but still this is my bottleneck. Can I somehow make the evaluation of the polynomial faster?
Addendum
I think that the pure-numpy solution of Joe Kington is good for me, in particular because it avoids potential issues at installation time of other libraries or cython. For those who asked, the numbers in the vector are large (order 10^4), so I don't think that the suggested approximations would work.
You actually can speed it up slightly by doing the operations in-place (or using numexpr or numba which will automatically do what I'm doing manually below).
numpy.polyval is a very short function. Leaving out a few type checks, etc, it amounts to:
def polyval(p, x):
y = np.zeros_like(x)
for i in range(len(p)):
y = x * y + p[i]
return y
The downside to this approach is that a temporary array will be created inside the loop as opposed to doing the operation in-place.
What I'm about to do is a micro-optimization and is only worthwhile for very large x inputs. Furthermore, we'll have to assume floating-point output instead of letting the upcasting rules determine the output's dtype. However, it will speed this up slighly and make it use less memory:
def faster_polyval(p, x):
y = np.zeros(x.shape, dtype=float)
for i, v in enumerate(p):
y *= x
y += v
return y
As an example, let's say we have the following input:
# Third order polynomial
p = [4.5, 9.8, -9.2, 1.2]
# One-million element array
x = np.linspace(-10, 10, 1e6)
The results are identical:
In [3]: np_result = np.polyval(p, x)
In [4]: new_result = faster_polyval(p, x)
In [5]: np.allclose(np_result, new_result)
Out[5]: True
And we get a modest 2-3x speedup (which is mostly independent of array size, as it relates to memory allocation, not number of operations):
In [6]: %timeit np.polyval(p, x)
10 loops, best of 3: 20.7 ms per loop
In [7]: %timeit faster_polyval(p, x)
100 loops, best of 3: 7.46 ms per loop
For really huge inputs, the memory usage difference will matter more than the speed differences. The "bare" numpy version will use ~2x more memory at peak usage than the faster_polyval version.
I ended up here, when I wanted to know whether np.polyval or np.polynomial.polynomial.polyval is faster.
And it is interesting to see that simple implementations are faster as #Joe Kington shows. (I hoped for some optimisation by numpy.)
So here is my comparison with np.polynomial.polynomial.polyval and a slightly faster version.
def fastest_polyval(x, a):
y = a[-1]
for ai in a[-2::-1]:
y *= x
y += ai
return y
It avoids the initial zero array and needs one loop less.
y_np = np.polyval(p, x)
y_faster = faster_polyval(p, x)
prev = 1 * p[::-1] # reverse coefficients
y_np2 = np.polynomial.polynomial.polyval(x, prev)
y_fastest = fastest_polyval(x, prev)
np.allclose(y_np, y_faster), np.allclose(y_np, y_np2), np.allclose(y_np, y_fastest)
# (True, True, True)
%timeit np.polyval(p, x)
%timeit faster_polyval(p, x)
%timeit np.polynomial.polynomial.polyval(x, prev)
%timeit fastest_polyval(x, prev)
# 6.51 ms ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 3.69 ms ± 27.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 6.28 ms ± 43.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 2.65 ms ± 35.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Related
Given an NxN matrix W, I'm looking to calculate an NxN matrix C given by the equation in this link: https://i.stack.imgur.com/dY7rY.png, or in LaTeX
$$C_{ij} = \max_k \bigg\{ \sum_l \bigg( W_{ik}W_{kl}W_{lj} - W_{ik}W_{kj} \bigg) \bigg\}.$$
I have tried to implement this in PyTorch but I've either encountered memory problems by constructing an intermediate NxNxN 3D matrix which, for large N, causes my GPU to run out of memory, or used a for-loop over k which is then very slow. I can't work out how I can get round these. How might I implement this calculation, or an approximation of it, without a large intermediate matrix like this?
Suggestions, pseudocode in any language or an implementation in any of Python/Numpy/PyTorch would be much appreciated.
The formula can be simplified as
C_ij = max_k ( W_ik M_kj)
where
M = W * W - N * W
with N the size of the matrix W and W * W the usual product.
Then, in the formula above, for every i, j there is an independent maximum to be computed. Without knowing further properties of W, it is in general not possible to further simplify the problem. So, after computing the matrix M, you can do a loop over i and j, and compute the maximum.
The first solution using Numba (You can do the same using Cython or plain C) would be to formulate the problem using simple loops.
import numpy as np
import numba as nb
#nb.njit(fastmath=True,parallel=True)
def calc_1(W):
C=np.empty_like(W)
N=W.shape[0]
for i in nb.prange(N):
TMP=np.empty(N,dtype=W.dtype)
for j in range(N):
for k in range(N):
acc=0
for l in range(N):
acc+=W[i,k]*W[k,l]*W[l,j]-W[i,k]*W[k,j]
TMP[k]=acc
C[i,j]=np.max(TMP)
return C
Francesco provided a simplification which scales far better for larger array sizes. This leads to the following, where I also optimized away a small temporary array.
#nb.njit(fastmath=True,parallel=True)
def calc_2(W):
C=np.empty_like(W)
N=W.shape[0]
M = np.dot(W,W) - N * W
for i in nb.prange(N):
for j in range(N):
val=W[i,0]*M[0,j]
for k in range(1,N):
TMP=W[i,k]*M[k,j]
if TMP>val:
val=TMP
C[i,j]=val
return C
This can be optimized further by partial loop unrolling and optimizing the array access. Some compilers may do this automatically.
#nb.njit(fastmath=True,parallel=True)
def calc_3(W):
C=np.empty_like(W)
N=W.shape[0]
W=np.ascontiguousarray(W)
M = np.dot(W.T,W.T) - W.shape[0] * W.T
for i in nb.prange(N//4):
for j in range(N):
val_1=W[i*4+0,0]*M[j,0]
val_2=W[i*4+1,0]*M[j,0]
val_3=W[i*4+2,0]*M[j,0]
val_4=W[i*4+3,0]*M[j,0]
for k in range(1,N):
TMP_1=W[i*4+0,k]*M[j,k]
TMP_2=W[i*4+1,k]*M[j,k]
TMP_3=W[i*4+2,k]*M[j,k]
TMP_4=W[i*4+3,k]*M[j,k]
if TMP_1>val_1:
val_1=TMP_1
if TMP_2>val_2:
val_2=TMP_2
if TMP_3>val_3:
val_3=TMP_3
if TMP_4>val_4:
val_4=TMP_4
C[i*4+0,j]=val_1
C[i*4+1,j]=val_2
C[i*4+2,j]=val_3
C[i*4+3,j]=val_4
#Remainder
for i in range(N//4*4,N):
for j in range(N):
val=W[i,0]*M[j,0]
for k in range(1,N):
TMP=W[i,k]*M[j,k]
if TMP>val:
val=TMP
C[i,j]=val
return C
Timings
W=np.random.rand(100,100)
%timeit calc_1(W)
#16.8 ms ± 131 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit calc_2(W)
#449 µs ± 25.7 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit calc_3(W)
#259 µs ± 47.4 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
W=np.random.rand(2000,2000)
#Temporary array would be 64GB in this case
%timeit calc_2(W)
#5.37 s ± 174 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit calc_3(W)
#596 ms ± 30.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I have a 2D UINT8 numpy array of size (149797, 64). Each of the elements are either 0 or 1. I want to pack these binary values in each row into a UINT64 value so that i get a UINT64 array of shape 149797 as a result. I tried the following code using numpy bitpack function.
test = np.random.randint(0, 2, (149797, 64),dtype=np.uint8)
col_pack=np.packbits(test.reshape(-1, 8, 8)[:, ::-1]).view(np.uint64)
The packbits function takes about 10 ms to execute. A simple reshaping of this array itself seems to take around 7 ms.I also tried iterating over 2d numpy array using shifting operations to achieve the same result; but there was no speed improvement.
Finally i also want to compile it using numba for CPU.
#njit
def shifting(bitlist):
x=np.zeros(149797,dtype=np.uint64) #54
rows,cols=bitlist.shape
for i in range(0,rows): #56
out=0
for bit in range(0,cols):
out = (out << 1) | bitlist[i][bit] # If i comment out bitlist, time=190 microsec
x[i]=np.uint64(out) # Reduces time to microseconds if line is commented in njit
return x
It takes about 6 ms using njit.
Here is the parallel njit version
#njit(parallel=True)
def shifting(bitlist):
rows,cols=149797,64
out=0
z=np.zeros(rows,dtype=np.uint64)
for i in prange(rows):
for bit in range(cols):
z[i] = (z[i] * 2) + bitlist[i,bit] # Time becomes 100 micro if i use 'out' instead of 'z[i] array'
return z
It's slightly better wit 3.24ms execution time(google colab dual core 2.2Ghz)
Currently, the python solution with swapbytes(Paul's) method seems to be the best one i.e 1.74 ms.
How can we further speed up this conversion? Is there scope for using any vectorization(or parallelization), bitarrays etc, for achieving speedup?
Ref: numpy packbits pack to uint16 array
On a 12 core machine(Intel(R) Xeon(R) CPU E5-1650 v2 # 3.50GHz),
Pauls method: 1595.0 microseconds (it does not use multicore, i suppose)
Numba code: 146.0 microseconds (aforementioned parallel-numba)
i.e around 10x speedup !!!
You can get a sizeable speedup by using byteswap instead of reshaping etc.:
test = np.random.randint(0, 2, (149797, 64),dtype=np.uint8)
np.packbits(test.reshape(-1, 8, 8)[:, ::-1]).view(np.uint64)
# array([ 1079982015491401631, 246233595099746297, 16216705265283876830,
# ..., 1943876987915462704, 14189483758685514703,
12753669247696755125], dtype=uint64)
np.packbits(test).view(np.uint64).byteswap()
# array([ 1079982015491401631, 246233595099746297, 16216705265283876830,
# ..., 1943876987915462704, 14189483758685514703,
12753669247696755125], dtype=uint64)
timeit(lambda:np.packbits(test.reshape(-1, 8, 8)[:, ::-1]).view(np.uint64),number=100)
# 1.1054180909413844
timeit(lambda:np.packbits(test).view(np.uint64).byteswap(),number=100)
# 0.18370431219227612
A bit Numba solution (version 0.46/Windows).
Code
import numpy as np
import numba as nb
#with memory allocation
#nb.njit(parallel=True)
def shifting(bitlist):
assert bitlist.shape[1]==64
x=np.empty(bitlist.shape[0],dtype=np.uint64)
for i in nb.prange(bitlist.shape[0]):
out=np.uint64(0)
for bit in range(bitlist.shape[1]):
out = (out << 1) | bitlist[i,bit]
x[i]=out
return x
#without memory allocation
#nb.njit(parallel=True)
def shifting_2(bitlist,x):
assert bitlist.shape[1]==64
for i in nb.prange(bitlist.shape[0]):
out=np.uint64(0)
for bit in range(bitlist.shape[1]):
out = (out << 1) | bitlist[i,bit]
x[i]=out
return x
Timings
test = np.random.randint(0, 2, (149797, 64),dtype=np.uint8)
#If you call this function multiple times, only allocating memory
#once may be enough
x=np.empty(test.shape[0],dtype=np.uint64)
#Warmup first call takes significantly longer
res=shifting(test)
res=shifting_2(test,x)
%timeit res=shifting(test)
#976 µs ± 41.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit res=shifting_2(test,x)
#764 µs ± 63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.packbits(test).view(np.uint64).byteswap()
#8.07 ms ± 52.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.packbits(test.reshape(-1, 8, 8)[:, ::-1]).view(np.uint64)
#17.9 ms ± 91 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I am trying to optimize some code that uses logs (the mathematical kind, not the timestamp record kind :)) and I found something strange that I haven't been able to find any answers for online. We have log(a/b) = log(a) - log(b), so I have written some code to compare the performance of the two methods.
import numpy as np
import numba as nb
# create some large random walk data
x = np.random.normal(0, 0.1, int(1e7))
x = abs(x.min()) + 100 + x # make all values >= 100
#nb.njit
def subtract_log(arr, tau):
"""arr is a numpy array, tau is an int"""
for t in range(tau, arr.shape[0]):
a = np.log(arr[t]) - np.log(arr[t - tau])
return None
#nb.njit
def divide_log(arr, tau):
"""arr is a numpy array, tau is an int"""
for t in range(tau, arr.shape[0]):
a = np.log(arr[t] / arr[t - tau])
return None
%timeit subtract_log(x, 100)
>>> 252 ns ± 0.319 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit divide_log(x, 100)
>>> 5.57 ms ± 48.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So we see that subtracting logs is ~20,000 times faster than dividing by logs. I find this strange because I would have thought that in subtracting logs, the log series approximation would have to be calculated twice. But perhaps it's something to do with how numpy broadcasts operations?
The above example is trivial as we don't do anything with the result of the calculation. Below is a more realistic example where we return the result of the calculation.
#nb.njit
def subtract_log(arr, tau):
"""arr is a numpy array, tau is an int"""
out = np.empty(arr.shape[0] - tau)
for t in range(tau, arr.shape[0]):
f = t - tau
out[f] = np.log(arr[t]) - np.log(arr[f])
return out
#nb.njit
def divide_log(arr, tau):
"""arr is a numpy array, tau is an int"""
out = np.empty(arr.shape[0] - tau)
for t in range(tau, arr.shape[0]):
f = t - tau
out[f] = np.log(arr[t] / arr[f])
return out
out1 = subtract_log(x, 100)
out2 = divide_log(x, 100)
np.testing.assert_allclose(out1, out2, atol=1e-8) # True
%timeit subtract_log(x, 100)
>>> 129 ms ± 783 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit divide_log(x, 100)
>>> 93.4 ms ± 257 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Now we see the times are the same order of magnitude, but subtracting logs is some 40% slower than dividing.
Can anyone explain these discrepancies?
Why is subtracting logs so much faster than dividing logs for the trivial case?
Why is subtracting logs 40% slower than dividing logs when we store the value in an array? I know there is significant setup cost in initializing an array np.empty() - initializing an array in subtract_log() in the trivial case, but without storing values in it brings the time up from 252ns to 311us.
Don't measure "useless" things, a compiler may optimize it completely away
If you turn of division by zero check (error_model="numpy"), both functions take about 280ns. Not because of fast calculation, but because they are actually doing nothing.
Optimizing away useless calculations is expected, but sometimes LLVM can't detect all of it.
In the second case you are comparing the runtime of 2 logarithms, to 1 logarithm and one division. (the substractions/additions as well as multiplications are a lot faster). There can be differences in calculation time, depending on the log implementation and the processor. But also have a look at the results, they are not exactly the same.
At least for a floa64 division (FDIV) you can have a look at the instruction tables
from Agner Fog.
Not sure I titled this well, but basically I have a reference coordinate, in the format of (x,y,z), and a large list/array of coordinates also in that format. I need to get the euclidean distance between each, so with numpy and scipy in theory I should be able to do an operation such as:
import numpy, scipy.spatial.distance
a = numpy.array([1,1,1])
b = numpy.random.rand(20,3)
distances = scipy.spatial.distance.euclidean(b, a)
But instead of getting an array back I get an error: ValueError: Input vector should be 1-D.
Not sure how to resolve this error and get what I want without having to resort to loops and such, which sort of defeats the purpose of using Numpy.
Long term I want to use these distances to calculate truth masks for counting distance values in bins.
I'm not sure if I'm just using the function wrong or using the wrong function, I haven't been able to find anything in the documentation that would work better.
The documentation of scipy.spatial.distance.euclidean states, that only 1D-vectors are allowed as inputs. Thus you must loop over your arrays like:
distances = np.empty(b.shape[0])
for i in range(b.shape[0]):
distances[i] = scipy.spatial.distance.euclidean(a, b[i])
If you want to have a vectorized implementation, you need to write your own function. Perhaps using np.vectorize with a correct signature will also work, but this is in fact also just a short-hand for a for-loop and will thus have the same performance as a simple for-loop.
As stated in my comment to hannes wittingham's solution, I'll post a one-liner which is focussing on performance:
distances = ((b - a)**2).sum(axis=1)**0.5
Writing out all the calculations reduces the number of separate functions calls and thus assignments of the intermediate results to new arrays. Thus it is about 22% faster than using the solution of hannes wittingham for an array shape of b.shape == (20, 3) and about 5% faster for an array shape of
b.shape == (20000, 3):
a = np.array([1, 1, 1,])
b = np.random.rand(20, 3)
%timeit ((b - a)**2).sum(axis=1)**0.5
# 5.37 µs ± 140 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit euclidean_distances(a, b)
# 6.89 µs ± 345 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
b = np.random.rand(20000, 3)
%timeit ((b - a)**2).sum(axis=1)**0.5
# 588 µs ± 43.2 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit euclidean_distances(a, b)
# 616 µs ± 36.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
But your are losing the flexibility of being able to easily change to distance calculation routine. When using the scipy.spatial.distance module, you can change the calculation routing by simply calling another method.
To improve the calculation performance even further, you can use a jit (just in time) compiler like numba for your functions:
import numba as nb
#nb.njit
def euc(a, b):
return ((b - a)**2).sum(axis=1)**0.5
This reduces the time needed to do the calculations by about 70% for small arrays and by about 60% for large arrays. Unluckily the axis keyword for np.linalg.norm is not yet supported by numba.
It's not actually too hard to write your own function to do this - here's mine, which you're welcome to use.
If you are carrying out this operation over a large number of points and speed matters, I would guess this function will beat a for-loop based solution for speed by a long way - numpy is designed to be efficient when carrying out operations on a whole matrix.
import numpy
a = numpy.array([1,1,1])
b = numpy.random.rand(20,3)
def euclidean_distances(ref_point, co_ords_array):
diffs = co_ords_array - ref_point
sqrd_diffs = numpy.square(diffs)
sum_sqrd_diffs = numpy.sum(sqrd_diffs, axis = 1)
euc_dists = numpy.sqrt(sum_sqrd_diffs)
return euc_dists
This code will get the euclidean norm which should work in many cases, and is fairly quick, and one line. Other methods are more efficient or flexible depending on the needs, and I would favour some of the other solutions posted depending on the work being done.
import numpy
a = numpy.array([1,1,1])
b = numpy.random.rand(20,3)
distances = numpy.linalg.norm(a - b, axis = 1)
Note the extra set of [] in the definition of a
import numpy, scipy.spatial.distance
a = numpy.array([[1,1,1]])
b = numpy.random.rand(20,3)
distances = scipy.spatial.distance.cdist(b, a, metric='euclidean')
This question is a follow up to my answer in
Efficient way to compute the Vandermonde matrix.
Here's the setup:
x = np.arange(5000) # an integer array
N = 4
Now, I'll compute the Vandermonde matrix in two different ways:
m1 = (x ** np.arange(N)[:, None]).T
And,
m2 = x[:, None] ** np.arange(N)
Sanity check:
np.array_equal(m1, m2)
True
These methods are identical, but their performance is not:
%timeit m1 = (x ** np.arange(N)[:, None]).T
42.7 µs ± 271 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit m2 = x[:, None] ** np.arange(N)
150 µs ± 995 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
So, the first method, despite requiring a transposition at the end, is still over 3X faster than the second method.
The only difference is that in the first case, the smaller array is broadcasted, whereas with the second case, it is the larger.
So, with a fairly decent understanding of how numpy works, I can guess that the answer
would involve the cache. The first method is a lot more cache friendly
than the second. However, I'd like an official word from someone with
more experience than me.
What could be the reason for this stark contrast in timings?
I too tried to look at broadcast_arrays:
In [121]: X,Y = np.broadcast_arrays(np.arange(4)[:,None], np.arange(1000))
In [122]: timeit X+Y
10.1 µs ± 31.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [123]: X,Y = np.broadcast_arrays(np.arange(1000)[:,None], np.arange(4))
In [124]: timeit X+Y
26.1 µs ± 30.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [125]: X.shape, X.strides
Out[125]: ((1000, 4), (4, 0))
In [126]: Y.shape, Y.strides
Out[126]: ((1000, 4), (0, 4))
np.ascontiguousarray converts the 0 strided dimensions into full ones
In [132]: Y1 = np.ascontiguousarray(Y)
In [134]: Y1.strides
Out[134]: (16, 4)
In [135]: X1 = np.ascontiguousarray(X)
In [136]: X1.shape
Out[136]: (1000, 4)
Operating with the full arrays is faster:
In [137]: timeit X1+Y1
4.66 µs ± 161 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
So there's some sort of time penalty to using the 0-strided arrays, even if it doesn't explicitly expand the arrays first. And cost is tied to the shapes, and possibly which dimension is expanded.
I'm not convinced caching is really the single most influential factor here.
I'm also not a trained computer scientist, so I may well be wrong, but let me walk you through a couple of obervations.
For simplicity I'm using #hpaulj's call that '+' shows essentially the same effect as '**'.
My working hypthesis is that it is the overhead of the outer loops which I believe are substantially more expensive than the contiguous vectorizable innermost loops.
So let us first minimize the amount of data that repeat, so caching is unlikely to have much impact:
>>> from timeit import repeat
>>> import numpy as np
>>>
>>> def mock_data(k, N, M):
... x = list(np.random.randint(0, 10000, (k, N, M)))
... y = list(np.random.randint(0, 10000, (k, M)))
... z = list(np.random.randint(0, 10000, (k, N, 1)))
... return x, y, z
...
>>> k, N, M = 500, 5000, 4
>>>
>>> repeat('x.pop() + y.pop()', setup='x, y, z = mock_data(k, M, N)', globals=globals(), number=k)
[0.017986663966439664, 0.018148145987652242, 0.018077059998176992]
>>> repeat('x.pop() + y.pop()', setup='x, y, z = mock_data(k, N, M)', globals=globals(), number=k)
[0.026680009090341628, 0.026304758968763053, 0.02680662798229605]
Here both scenarios have contiguous data, the same number of additions but the version with 5000 outer iterations is substantially slower. When we bring back caching albeit across trials the difference stays roughly the same but the ratio becomes even more pronounced:
>>> repeat('x[0] + y[0]', setup='x, y, z = mock_data(k, M, N)', globals=globals(), number=k)
[0.011324503924697638, 0.011121788993477821, 0.01106808998156339]
>>> repeat('x[0] + y[0]', setup='x, y, z = mock_data(k, N, M)', globals=globals(), number=k)
[0.020170683041214943, 0.0202067659702152, 0.020624138065613806]
Returning to the original "outer sum" scenario we see that in the noncontiguous long dimension case we are getting even worse. Since we have to read no more data than in the contiguous scenario this cannot be explained by data not being cached.
>>> repeat('z.pop() + y.pop()', setup='x, y, z = mock_data(k, M, N)', globals=globals(), number=k)
[0.013918839977122843, 0.01390116906259209, 0.013737019035033882]
>>> repeat('z.pop() + y.pop()', setup='x, y, z = mock_data(k, N, M)', globals=globals(), number=k)
[0.0335254140663892, 0.03351909795310348, 0.0335453050211072]
Further both profit from across trial caching:
>>> repeat('z[0] + y[0]', setup='x, y, z = mock_data(k, M, N)', globals=globals(), number=k)
[0.012061356916092336, 0.012182610924355686, 0.012071475037373602]
>>> repeat('z[0] + y[0]', setup='x, y, z = mock_data(k, N, M)', globals=globals(), number=k)
[0.03265167598146945, 0.03277428599540144, 0.03247103898320347]
From a cachist's point of view this is inconclusive at best.
So let's have a look at the source.
After building a current NumPy from the tarball you'll find somewhere in the tree almost 15000 lines worth of computer generated code in a file called 'loops.c'. These loops are the innermost loops of ufuncs, the most relevant bit to our situation appears to be
#define BINARY_LOOP\
char *ip1 = args[0], *ip2 = args[1], *op1 = args[2];\
npy_intp is1 = steps[0], is2 = steps[1], os1 = steps[2];\
npy_intp n = dimensions[0];\
npy_intp i;\
for(i = 0; i < n; i++, ip1 += is1, ip2 += is2, op1 += os1)
/*
* loop with contiguous specialization
* op should be the code working on `tin in1`, `tin in2` and
* storing the result in `tout * out`
* combine with NPY_GCC_OPT_3 to allow autovectorization
* should only be used where its worthwhile to avoid code bloat
*/
#define BASE_BINARY_LOOP(tin, tout, op) \
BINARY_LOOP { \
const tin in1 = *(tin *)ip1; \
const tin in2 = *(tin *)ip2; \
tout * out = (tout *)op1; \
op; \
}
etc.
The payload in our case seems lean enough, especially if I interpret the comment about contiguous specialization and autovectorization correctly. Now, if we do only 4 iterations the overhead to payload ratio starts to look a bit troubling and it doesn't end here.
In the file ufunc_object.c we find the following snippet
/*
* If no trivial loop matched, an iterator is required to
* resolve broadcasting, etc
*/
NPY_UF_DBG_PRINT("iterator loop\n");
if (iterator_loop(ufunc, op, dtypes, order,
buffersize, arr_prep, arr_prep_args,
innerloop, innerloopdata) < 0) {
return -1;
}
return 0;
the actual loop looks like
NPY_BEGIN_THREADS_NDITER(iter);
/* Execute the loop */
do {
NPY_UF_DBG_PRINT1("iterator loop count %d\n", (int)*count_ptr);
innerloop(dataptr, count_ptr, stride, innerloopdata);
} while (iternext(iter));
NPY_END_THREADS;
innerloop is the inner loop we looked at above. How much overhead comes with iternext?
For this we need to turn to file nditer_templ.c.src where we find
/*NUMPY_API
* Compute the specialized iteration function for an iterator
*
* If errmsg is non-NULL, it should point to a variable which will
* receive the error message, and no Python exception will be set.
* This is so that the function can be called from code not holding
* the GIL.
*/
NPY_NO_EXPORT NpyIter_IterNextFunc *
NpyIter_GetIterNext(NpyIter *iter, char **errmsg)
{
etc.
This function returns a function pointer to one of the things the preprocessing makes of
/* Specialized iternext (#const_itflags#,#tag_ndim#,#tag_nop#) */
static int
npyiter_iternext_itflags#tag_itflags#_dims#tag_ndim#_iters#tag_nop#(
NpyIter *iter)
{
etc.
Parsing this is beyond me, but in any case it is a function pointer that must be called at every iteration of the outer loop, and as far as I know function pointers cannot be inlined, so compared to 4 iterations of a trivial loop body this will be sustantial.
I should probably profile this but my skills are insufficient.
While I'm afraid my conclusion won't be more fundamental than yours ("probably caching"), I believe I can help on focusing our attention with a set of more localized tests.
Consider your example problem:
M,N = 5000,4
x1 = np.arange(M)
y1 = np.arange(N)[:,None]
x2 = np.arange(M)[:,None]
y2 = np.arange(N)
x1_bc,y1_bc = np.broadcast_arrays(x1,y1)
x2_bc,y2_bc = np.broadcast_arrays(x2,y2)
x1_cont,y1_cont,x2_cont,y2_cont = map(np.ascontiguousarray,
[x1_bc,y1_bc,x2_bc,y2_bc])
As you can see, I defined a bunch of arrays to compare. x1, y1 and x2, y2, respectively, correspond to your original test cases. ??_bc correspond to explicitly broadcast versions of these arrays. These share the data with the original ones, but they have explicit 0-strides in order to get the appropriate shape. Finally, ??_cont are contiguous versions of these broadcast arrays, as if constructed with np.tile.
So both x1_bc, y1_bc, x1_cont and y1_cont have shape (4, 5000), but while the former two have zero-strides, the latter two are contiguous arrays. For all intents and purposes taking the power of any of these corresponding pairs of arrays should give us the same contiguous result (as hpaulj noted in a comment, a transposition itself is essentially for free, so I'm going to ignore that outermost transpose in the following).
Here are the timings corresponding to your original check:
In [143]: %timeit x1 ** y1
...: %timeit x2 ** y2
...:
52.2 µs ± 707 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
96 µs ± 858 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Here are the timings for the explicitly broadcast arrays:
In [144]: %timeit x1_bc ** y1_bc
...: %timeit x2_bc ** y2_bc
...:
54.1 µs ± 906 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
99.1 µs ± 1.51 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each
Same thing. This tells me that the discrepancy isn't somehow due to the transition from the indexed expressions to the broadcast arrays. This was mostly expected, but it never hurts to check.
Finally, the contiguous arrays:
In [146]: %timeit x1_cont ** y1_cont
...: %timeit x2_cont ** y2_cont
...:
38.9 µs ± 529 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
45.6 µs ± 390 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
A huge part of the discrepancy goes away!
So why did I check this? There is a general rule of thumb that you can work with CPU caching if you use vectorized operations along large trailing dimensions in python. To be more specific, for row-major ("C-order") arrays trailing dimensions are contiguous, while for column-major ("fortran-order") arrays the leading dimensions are contiguous. For large enough dimensions arr.sum(axis=-1) should be faster than arr.sum(axis=0) for row-major numpy arrays give or take some fine print.
What happens here is that there is a huge difference between the two dimensions (size 4 and 5000, respectively), but the huge performance asymmetry between the two transposed cases only happens for the broadcasting case. My admittedly handwaving impression is that broadcasting uses 0-strides to construct views of appropriate size. These 0-strides imply that in the faster case memory access looks like this for the long x array:
[mem0,mem1,mem2,...,mem4999, mem0,mem1,mem2,...,mem4999, ...] # and so on
where mem* just denotes a float64 value of x sitting somewhere in RAM. Compare this to the slower case where we're working with shape (5000,4):
[mem0,mem0,mem0,mem0, mem1,mem1,mem1,mem1, mem2,mem2,mem2,mem2, ...]
My naive notion is that working with the former allows the CPU to cache larger chunks of the individual values of x at a time, so performance is great. In the latter case the 0-strides make the CPU hop around on the same memory address of x four times each, doing this 5000 times. I find it reasonable to believe that this setup works against caching, leading to overall bad performance. This would also be in agreement with the fact that the contiguous cases don't show this performance hit: there the CPU has to work with all 5000*4 unique float64 values, and caching might not be impeded by these weird reads.