This question already has answers here:
How to repeat elements of an array along two axes?
(5 answers)
Closed 4 years ago.
Note on duplicate message:
Similar themes, not exactly a duplicate. Esp. since the loop is still the fastest method. Thanks.
Goal:
Upscale an array from [small,small] to [big,big] by a factor quickly, don't use an image library. Very simple scaling, one small value will become several big values, after it is normalized for the several big values it becomes. In other words, this is "flux conserving" from an astronomical wording - a value of 16 from the small array spread into a big array's 4 values (factor of 2) would be 4 4's so the amount of the value has been retained.
Problem:
I've got some working codes to do the upscaling, but they don't work very fast compared to downscaling. Upscaling is actually easier than downscaling (which requires many sums, in this basic case) - upscaling just requires already-known data to be put in big chunks of a preallocated array.
For a working example, a [2,2] array of [16,24;8,16]:
16 , 24
8 , 16
Multiplied by a factor of 2 for a [4,4] array would have the values:
4 , 4 , 6 , 6
4 , 4 , 6 , 6
2 , 2 , 4 , 4
2 , 2 , 4 , 4
The fastest implementation is a for loop accelerated by numba's jit & prange. I'd like to better leverage Numpy's pre-compiled functions to get this job done. I'll also entertain Scipy stuff - but not its resizing functions.
It seems like a perfect problem for strong matrix manipulation functions, but I just haven't managed to make it happen quickly.
Additionally, the single-line numpy call is way funky, so don't be surprized. But it's what it took to get it to align correctly.
Code examples:
Check more optimized calls below Be warned, the case I have here makes a 20480x20480 float64 array that can take up a fair bit of memory - but can show off if a method is too memory intensive (as matrices can be).
Environment: Python 3, Windows, i5-4960K # 4.5 GHz. Time to run for loop code is ~18.9 sec, time to run numpy code is ~52.5 sec on the shown examples.
% MAIN: To run these
import timeit
timeitSetup = '''
from Regridder1 import Regridder1
import numpy as np
factor = 10;
inArrayX = np.float64(np.arange(0,2048,1));
inArrayY = np.float64(np.arange(0,2048,1));
[inArray, _] = np.meshgrid(inArrayX,inArrayY);
''';
print("Time to run 1: {}".format( timeit.timeit(setup=timeitSetup,stmt="Regridder1(inArray, factor,)", number = 10) ));
timeitSetup = '''
from Regridder2 import Regridder2
import numpy as np
factor = 10;
inArrayX = np.float64(np.arange(0,2048,1));
inArrayY = np.float64(np.arange(0,2048,1));
[inArray, _] = np.meshgrid(inArrayX,inArrayY);
''';
print("Time to run 2: {}".format( timeit.timeit(setup=timeitSetup,stmt="Regridder2(inArray, factor,)", number = 10) ));
% FUN: Regridder 1 - for loop
import numpy as np
from numba import prange, jit
#jit(nogil=True)
def Regridder1(inArray,factor):
inSize = np.shape(inArray);
outSize = [np.int64(np.round(inSize[0] * factor)), np.int64(np.round(inSize[1] * factor))];
outBlockSize = factor*factor; #the block size where 1 inArray pixel is spread across # outArray pixels
outArray = np.zeros(outSize); #preallcoate
outBlocks = inArray/outBlockSize; #precalc the resized blocks to go faster
for i in prange(0,inSize[0]):
for j in prange(0,inSize[1]):
outArray[i*factor:(i*factor+factor),j*factor:(j*factor+factor)] = outBlocks[i,j]; #puts normalized value in a bunch of places
return outArray;
% FUN: Regridder 2 - numpy
import numpy as np
def Regridder2(inArray,factor):
inSize = np.shape(inArray);
outSize = [np.int64(np.round(inSize[0] * factor)), np.int64(np.round(inSize[1] * factor))];
outBlockSize = factor*factor; #the block size where 1 inArray pixel is spread across # outArray pixels
outArray = inArray.repeat(factor).reshape(inSize[0],factor*inSize[1]).T.repeat(factor).reshape(inSize[0]*factor,inSize[1]*factor).T/outBlockSize;
return outArray;
Would greatly appreciate insight into speeding this up. Hopefully code is good, formulated it in the text box.
Current best solution:
On my comp, the numba's jit for loop implementation (Regridder1) with jit applied to only what needs it can run the timeit test at 18.0 sec, while the numpy only implementation (Regridder2) runs the timeit test at 18.5 sec. The bonus is that on the first call, the numpy only implementation doesn't need to wait for jit to compile the code. Jit's cache=True lets it not compile on subsequent runs. The other calls (nogil, nopython, prange) don't seem to help but also don't seem to hurt. Maybe in future numba updates they'll do better or something.
For simplicity and portability, Regridder2 is the best option. It's nearly as fast, and doesn't need numba installed (which for my Anaconda install required me to go install it) - so it'll help portability.
% FUN: Regridder 1 - for loop
import numpy as np
def Regridder1(inArray,factor):
inSize = np.shape(inArray);
outSize = [np.int64(np.round(inSize[0] * factor)), np.int64(np.round(inSize[1] * factor))];
outBlockSize = factor*factor #the block size where 1 inArray pixel is spread across # outArray pixels
outArray = np.empty(outSize) #preallcoate
outBlocks = inArray/outBlockSize #precalc the resized blocks to go faster
factor = np.int64(factor) #convert to an integer to be safe (in case it's a 1.0 float)
outArray = RegridderUpscale(inSize, factor, outArray, outBlocks) #call a function that has just the loop
return outArray;
#END def Regridder1
from numba import jit, prange
#jit(nogil=True, nopython=True, cache=True) #nopython=True, nogil=True, parallel=True, cache=True
def RegridderUpscale(inSize, factor, outArray, outBlocks ):
for i in prange(0,inSize[0]):
for j in prange(0,inSize[1]):
outArray[i*factor:(i*factor+factor),j*factor:(j*factor+factor)] = outBlocks[i,j];
#END for j
#END for i
#scales the original data up, note for other languages you need i*factor+factor-1 because slicing
return outArray; #return success
#END def RegridderUpscale
% FUN: Regridder 2 - numpy based on #ZisIsNotZis's answer
import numpy as np
def Regridder2(inArray,factor):
inSize = np.shape(inArray);
#outSize = [np.int64(np.round(inSize[0] * factor)), np.int64(np.round(inSize[1] * factor))]; #whoops
outBlockSize = factor*factor; #the block size where 1 inArray pixel is spread across # outArray pixels
outArray = np.broadcast_to( inArray[:,None,:,None]/outBlockSize, (inSize[0], factor, inSize[1], factor)).reshape(np.int64(factor*inSize[0]), np.int64(factor*inSize[1])); #single line call that gets the job done
return outArray;
#END def Regridder2
I did some benchmarks about this using a 512x512 byte image (10x upscale):
a = np.empty((512, 512), 'B')
Repeat Twice
>>> %timeit a.repeat(10, 0).repeat(10, 1)
127 ms ± 979 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Repeat Once + Reshape
>>> %timeit a.repeat(100).reshape(512, 512, 10, 10).swapaxes(1, 2).reshape(5120, 5120)
150 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The two methods above all involve copying twice, while two methods below all copies once.
Fancy Indexing
Since t can be repeatedly used (and pre-computed), it is not timed.
>>> t = np.arange(512, dtype='B').repeat(10)
>>> %timeit a[t[:,None], t]
143 ms ± 2.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Viewing + Reshape
>>> %timeit np.broadcast_to(a[:,None,:,None], (512, 10, 512, 10)).reshape(5120, 5120)
29.6 ms ± 2.82 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
It seems that viewing + reshape wins (at least on my machine). The test result on 2048x2048 byte image is the following where view + reshape still wins
2.04 s ± 31.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2.4 s ± 18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2.3 s ± 25.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
424 ms ± 14.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
while the result for 2048x2048 float64 image is
3.14 s ± 20.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
5.07 s ± 39.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.56 s ± 64.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.8 s ± 24.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
which, though the itemsize is 8 times larger, didn't take much more time
Some new functions which show that order of operations is important :
import numpy as np
from numba import jit
A=np.random.rand(2048,2048)
#jit
def reg1(A,factor):
factor2=factor**2
a,b = [factor*s for s in A.shape]
B=np.empty((a,b),A.dtype)
Bf=B.ravel()
k=0
for i in range(A.shape[0]):
Ai=A[i]
for _ in range(factor):
for j in range(A.shape[1]):
x=Ai[j]/factor2
for _ in range(factor):
Bf[k]=x
k += 1
return B
def reg2(A,factor):
return np.repeat(np.repeat(A/factor**2,factor,0),factor,1)
def reg3(A,factor):
return np.repeat(np.repeat(A/factor**2,factor,1),factor,0)
def reg4(A,factor):
shx,shy=A.shape
stx,sty=A.strides
B=np.broadcast_to((A/factor**2).reshape(shx,1,shy,1),
shape=(shx,factor,shy,factor))
return B.reshape(shx*factor,shy*factor)
And runs :
In [47]: %timeit _=Regridder1(A,5)
672 ms ± 27.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [48]: %timeit _=reg1(A,5)
522 ms ± 24.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [49]: %timeit _=reg2(A,5)
1.23 s ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [50]: %timeit _=reg3(A,5)
782 ms ± 21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [51]: %timeit _=reg4(A,5)
860 ms ± 26.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
"""
Related
I have a program whose main performance bottleneck involves multiplying matrices which have one dimension of size 1 and another large dimension, e.g. 1000:
large_dimension = 1000
a = np.random.random((1,))
b = np.random.random((1, large_dimension))
c = np.matmul(a, b)
In other words, multiplying matrix b with the scalar a[0].
I am looking for the most efficient way to compute this, since this operation is repeated millions of times.
I tested for performance of the two trivial ways to do this, and they are practically equivalent:
%timeit np.matmul(a, b)
>> 1.55 µs ± 45.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit a[0] * b
>> 1.77 µs ± 34.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Is there a more efficient way to compute this?
Note: I cannot move these computations to a GPU since the program is using multiprocessing and many such computations are done in parallel.
large_dimension = 1000
a = np.random.random((1,))
B = np.random.random((1, large_dimension))
%timeit np.matmul(a, B)
5.43 µs ± 22 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit a[0] * B
5.11 µs ± 6.92 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Use just float
%timeit float(a[0]) * B
3.48 µs ± 26.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
To avoid memory allocation use "buffer"
buffer = np.empty_like(B)
%timeit np.multiply(float(a[0]), B, buffer)
2.96 µs ± 37.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
To avoid unnecessary getting attribute use "alias"
mul = np.multiply
%timeit mul(float(a[0]), B, buffer)
2.73 µs ± 12.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
And I don't recommend using numpy scalars at all,
because if you avoid it, computation will be faster
a_float = float(a[0])
%timeit mul(a_float, B, buffer)
1.94 µs ± 5.74 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Furthermore, if it's possible then initialize buffer out of loop once (of course, if you have something like loop :)
rng = range(1000)
%%timeit
for i in rng:
pass
24.4 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
for i in rng:
mul(a_float, B, buffer)
1.91 ms ± 2.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So,
"best_iteration_time" = (1.91 - 0.02) / 1000 => 1.89 (µs)
"speedup" = 5.43 / 1.89 = 2.87
In this case, it is probably faster to work with an element-wise multiplication but the time you see is mostly the overhead of Numpy (calling C functions from the CPython interpreter, wrapping/unwraping types, making checks, doing the operation, array allocations, etc.).
since this operation is repeated millions of times
This is the problem. Indeed, the CPython interpreter is very bad at doing things with a low latency. This is especially true when you work on Numpy types as calling a C code and performing checks for trivial operation is much slower than doing it in pure Python which is also much slower than compiled native C/C++ codes. If you really need this, and you cannot vectorize your code using Numpy (because you have a loop iterating over timesteps), then you move away from using CPython, or at least not a pure Python code. Instead, you can use Numba or Cython to mitigate the impact doing C calls, wrapping types, etc. If this is not enough, then you will need to write a native C/C++ code (or any similar language) unless you find exactly a dedicated Python package doing exactly that for you. Note that Numba is fast only when it works on native types or Numpy arrays (containing native types). If you works with a lot of pure Python types and you do not want to rewrite your code, then you can try the PyPy JIT.
Here is a simple example in Numba avoiding the (costly) creation/allocation of a new array (as well as many Numpy internal checks and calls) that is specifically written to solve your specific case:
#nb.njit('void(float64[::1],float64[:,::1],float64[:,::1])')
def fastMul(a, b, out):
val = a[0]
for i in range(b.shape[1]):
out[0,i] = b[0,i] * val
res = np.empty(b.shape, dtype=b.dtype)
%timeit fastMul(a, b, res)
# 397 ns ± 0.587 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
At the time of writing, this solution is faster than all the others. As most of the time is spent in calling Numba and performing some internal checks, using Numba directly for the function containing the iteration loop should result in an even faster code.
import numpy as np
import numba
def matmult_numpy(matrix, c):
return np.matmul(c, matrix)
#numba.jit(nopython=True)
def matmult_numba(matrix, c):
return c*matrix
if __name__ == "__main__":
large_dimension = 1000
a = np.random.random((1, large_dimension))
c = np.random.random((1,))
About a factor of 3 speedup using Numba. Numba cognoscenti may be able to do better by explicitly casting the parameter "c" as a scalar
Check: The result of
%timeit matmult_numpy(a, c) 2.32 µs ± 50 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit matmult_numba(a, c)
763 ns ± 6.67 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
I'm doing some vectorized algebra using numpy and the wall-clock performance of my algorithm seems weird. The program does roughly as follows:
Create three matrices: Y (KxD), X (NxD), T (KxN)
For each row of Y:
subtract Y[i] from each row of X (by broadcasting),
square the differences along one axis, sum them, take a square root, then store in T.
However, depending on how I perform the broadcasting, computation speed is vastly different. Consider the code:
import numpy as np
from time import perf_counter
D = 128
N = 3000
K = 500
X = np.random.rand(N, D)
Y = np.random.rand(K, D)
T = np.zeros((K, N))
if True: # negate to enable the second loop
time = 0.0
for i in range(100):
start = perf_counter()
for i in range(K):
T[i] = np.sqrt(np.sum(
np.square(
X - Y[i] # this has dimensions NxD
),
axis=1
))
time += perf_counter() - start
print("Broadcast in line: {:.3f} s".format(time / 100))
exit()
if True:
time = 0.0
for i in range(100):
start = perf_counter()
for i in range(K):
diff = X - Y[i]
T[i] = np.sqrt(np.sum(
np.square(
diff
),
axis=1
))
time += perf_counter() - start
print("Broadcast out: {:.3f} s".format(time / 100))
exit()
Times for each loop are measured individually and averaged over 100 executions. The results:
Broadcast in line: 1.504 s
Broadcast out: 0.438 s
The only difference is that broadcasting and subtraction in the first loop is done in-line, while in the second approach I do it before any vectorized operations. Why is this making such a difference?
My system configuration:
Lenovo ThinkStation P920, 2x Xeon Silver 4110, 64 GB RAM
Xubuntu 18.04.2 LTS (bionic)
Python 3.7.3 (GCC 7.3.0)
Numpy 1.16.3 linked against OpenBLAS (that's as much as np.__config__.show() tells me)
PS: Yes I am aware this could be further optimized, but right now I would like to understand what happens under the hood here.
It's not a broadcasting problem
I also added a optimized solution to see how long the actual calculation takes without the large overhead of memory allocation and deallocation.
Functions
import numpy as np
import numba as nb
def func_1(X,Y,T):
for i in range(K):
T[i] = np.sqrt(np.sum(np.square(X - Y[i]),axis=1))
return T
def func_2(X,Y,T):
for i in range(K):
diff = X - Y[i]
T[i] = np.sqrt(np.sum(np.square(diff),axis=1))
return T
#nb.njit(fastmath=True,parallel=True)
def func_3(X,Y,T):
for i in nb.prange(Y.shape[0]):
for j in range(X.shape[0]):
diff_sq_sum=0.
for k in range(X.shape[1]):
diff_sq_sum+= (X[j,k] - Y[i,k])**2
T[i,j]=np.sqrt(diff_sq_sum)
return T
Timings
I did all the timings in a Jupyter Notebook and observed a really weird behavior. The following code is in one cell. I also tried calling timit multiple times, but on the first execution of the cell this doesn't change anything.
First execution of the cell
D = 128
N = 3000
K = 500
X = np.random.rand(N, D)
Y = np.random.rand(K, D)
T = np.zeros((K, N))
#You can do it more often it would not change anything
%timeit func_1(X,Y,T)
%timeit func_1(X,Y,T)
#You can do it more often it would not change anything
%timeit func_2(X,Y,T)
%timeit func_2(X,Y,T)
###Avoid measuring compilation overhead###
%timeit func_3(X,Y,T)
##########################################
%timeit func_3(X,Y,T)
774 ms ± 6.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
768 ms ± 2.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
494 ms ± 2.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
494 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
10.7 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
6.74 ms ± 39.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Second execution
345 ms ± 16.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
337 ms ± 3.72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
322 ms ± 834 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
323 ms ± 1.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
6.93 ms ± 234 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
6.9 ms ± 87.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I'm trying to use different weights for my model and I need those weights add up to 1 like this;
def func(length):
return ['a list of numbers add up to 1 with given length']
func(4) returns [0.1, 0.2, 0.3, 0.4]
The numbers should be linearly spaced and they should not start from 0. Is there any way to achieve this with numpy or scipy?
This can be done quite simply using numpy arrays:
def func(length):
linArr = np.arange(1, length+1)
return linArr/sum(x)
First we create an array of length length ranging from 1 to length. Then we normalize the sum.
Thanks to Paul Panzer for pointing out that the efficiency of this function can be improved by using Gauss's formula for the sum of the first n integers:
def func(length):
linArr = np.arange(1, length+1)
arrSum = length * (length+1) // 2
return linArr/arrSum
For large inputs, you might find that using np.linspace is faster than the accepted answer
def f1(length):
linArr = np.arange(1, length+1)
arrSum = length * (length+1) // 2
return linArr/arrSum
def f2(l):
delta = 2/(l*(l+1))
return np.linspace(delta, l*delta, l)
Ensure that the two things produce the same result:
In [39]: np.allclose(f1(1000000), f2(1000000))
Out[39]: True
Check timing of both:
In [68]: %timeit f1(10000000)
515 ms ± 28.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [69]: %timeit f2(10000000)
247 ms ± 4.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
It's tempting to just use np.arange(delta, l*delta, delta) which should be even faster, but this does present the risk of rounding errors causing the array to have lengths different from l (as will happen e.g. for l = 10000000).
If speed is more important than code style, it might also possible to squeeze out a bit more by using Numba:
from numba import jit
#jit
def f3(l):
a = np.empty(l, dtype=np.float64)
delta = 2/(l*(l+1))
for n in range(l):
a[n] = (n+1)*delta
return a
In [96]: %timeit f3(10000000)
216 ms ± 16.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
While we're at it, let's note that it's possible to parallelize this loop. Doing so naively with Numba doesn't appear to give much, but helping it out a bit and pre-splitting the array into num_parallel parts does give further improvement on a quad core system:
from numba import njit, prange
#njit(parallel=True)
def f4(l, num_parallel=4):
a = np.empty(l, dtype=np.float64)
delta = 2/(l*(l+1))
for j in prange(num_parallel):
# The last iteration gets whatever's left from rounding
offset = 0 if j != num_parallel - 1 else l % num_parallel
for n in range(l//num_parallel + offset):
i = j*(l//num_parallel) + n
a[i] = (i+1)*delta
return a
In [171]: %timeit f4(10000000, 4)
163 ms ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [172]: %timeit f4(10000000, 8)
158 ms ± 5.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [173]: %timeit f4(10000000, 12)
157 ms ± 8.77 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I am performing data analysis using a python script and learned from profiling that more than 95 % of the computation time is taken by the line which performs the following operation np.sum(C[np.isin(A, b)]), where A, C are 2D NumPy arrays of equal dimension m x n, and b is a 1D array of variable length. I am wondering if not a dedicated NumPy function, is there a way to accelerate such computation?
Typical sizes of A (int64), C (float64): 10M x 100
Typical size of b (int64): 1000
As your labels are from a small integer range you should get a sizeable speedup from using np.bincount (pp) below. Alternatively, you can speedup lookup by creating a mask (p2). This---as does your original code---allows for replacing np.sum with math.fsum which guarantees an exact within machine precision result (p3). Alternatively, we can pythranize it for another 40% speedup (p4).
On my rig the numba soln (mx) is about as fast as pp but maybe I'm not doing it right.
import numpy as np
import math
from subsum import pflat
MAXIND = 120_000
def OP():
return sum(C[np.isin(A, b)])
def pp():
return np.bincount(A.reshape(-1), C.reshape(-1), MAXIND)[np.unique(b)].sum()
def p2():
grid = np.zeros(MAXIND, bool)
grid[b] = True
return C[grid[A]].sum()
def p3():
grid = np.zeros(MAXIND, bool)
grid[b] = True
return math.fsum(C[grid[A]])
def p4():
return pflat(A.ravel(), C.ravel(), b, MAXIND)
import numba as nb
#nb.njit(parallel=True,fastmath=True)
def nb_ss(A,C,b):
s=set(b)
sum=0.
for i in nb.prange(A.shape[0]):
for j in range(A.shape[1]):
if A[i,j] in s:
sum+=C[i,j]
return sum
def mx():
return nb_ss(A,C,b)
sh = 100_000, 100
A = np.random.randint(0, MAXIND, sh)
C = np.random.random(sh)
b = np.random.randint(0, MAXIND, 1000)
print(OP(), pp(), p2(), p3(), p4(), mx())
from timeit import timeit
print("OP", timeit(OP, number=4)*250)
print("pp", timeit(pp, number=10)*100)
print("p2", timeit(p2, number=10)*100)
print("p3", timeit(p3, number=10)*100)
print("p4", timeit(p4, number=10)*100)
print("mx", timeit(mx, number=10)*100)
The code for the pythran module:
[subsum.py]
import numpy as np
#pythran export pflat(int[:], float[:], int[:], int)
def pflat(A, C, b, MAXIND):
grid = np.zeros(MAXIND, bool)
grid[b] = True
return C[grid[A]].sum()
Compilation is as simple as pythran subsum.py
Sample run:
41330.15849965791 41330.15849965748 41330.15849965747 41330.158499657475 41330.15849965791 41330.158499657446
OP 1963.3807722493657
pp 53.23419079941232
p2 21.8758742994396
p3 26.829131800332107
p4 12.988955597393215
mx 52.37018179905135
I assume you have changed int64 to int8 wherever required.
You can use Numba's parallel and It feature for faster Numpy computations and makes use of the cores.
#numba.jit(nopython=True, parallel=True)
def (A,B,c):
return np.sum(C[np.isin(A, b)])
Documentation for Numba Parallel
I don't know why np.isin is that slow, but you can implement your function quite a lot faster.
The following Numba solution uses a set for fast lookup of values and is parallelized. The memory footprint is also smaller than in the Numpy implementation.
Code
import numpy as np
import numba as nb
#nb.njit(parallel=True,fastmath=True)
def nb_pp(A,C,b):
s=set(b)
sum=0.
for i in nb.prange(A.shape[0]):
for j in range(A.shape[1]):
if A[i,j] in s:
sum+=C[i,j]
return sum
Timings
The pp implementation and the first data sample is form Paul Panzers answer above.
MAXIND = 120_000
sh = 100_000, 100
A = np.random.randint(0, MAXIND, sh)
C = np.random.random(sh)
b = np.random.randint(0, MAXIND, 1000)
MAXIND = 120_000
%timeit res_1=np.sum(C[np.isin(A, b)])
1.5 s ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_2=pp(A,C,b)
62.5 ms ± 624 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit res_3=nb_pp(A,C,b)
17.1 ms ± 141 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
MAXIND = 10_000_000
%timeit res_1=np.sum(C[np.isin(A, b)])
2.06 s ± 27.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_2=pp(A,C,b)
206 ms ± 3.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_3=nb_pp(A,C,b)
17.6 ms ± 332 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
MAXIND = 100
%timeit res_1=np.sum(C[np.isin(A, b)])
1.01 s ± 20.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_2=pp(A,C,b)
46.8 ms ± 538 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit res_3=nb_pp(A,C,b)
3.88 ms ± 84.8 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
This question already has answers here:
Why is numpy list access slower than vanilla python?
(2 answers)
Closed 4 years ago.
I'm programming in pure Python for 2 years.
Now I am learning Numpy and I am confused.
In tutorials has given examples that Numpy is way more efficient than pure python. Given examples, but when I try for example simple iteration:
import numpy as np
import time
start = time.time()
list = range(1000000)
array = np.arange(1000000)
for element in list:
pass
print('\n'+str((time.time() - start)*1000)+'\n')
start = time.time()
for element in np.nditer(array, order='F'):
pass
print('\n'+str((time.time() - start)*1000)+'\n')
I got an output:
87.67843246459961
175.25482177734375
As may be seen upper, iteration over Numpy is way less efficient than pure Python.
My question is: I do not understand and cannot myself explain why to use Numpy, and moreso: when to use it?
Numpy is much faster with vector operations.
If you change your code to:
array+=1
instead of:
for element in np.nditer(array, order='F'):
pass
you can see that numpy vastly outperforms the regular python code
The strength of numpy is that you don't need to iterate.
There's no problem with iterating, and in fact in can be useful in some cases, but the vast majority of problems can be solved using functions is numpy.
Using your examples (with the %timeit command in ipython), if you do something simple like adding a number to every element of the list numpy is clearly much faster when it is used directly without iterating.
import numpy as np
import time
start = time.time()
dlist = range(1000000)
darray = np.arange(1000000)
# Pure python
%timeit [e + 2 for e in dlist]
59.8 ms ± 140 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Iterating numpy
%timeit [e + 2 for e in darray]
193 ms ± 8.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Converting numpy to list before iterating
%timeit [e+2 for e in list(darray)]
198 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Numpy
%timeit darray + 2
847 µs ± 8.81 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Same thing with more complex operations, like finding the mean:
%timeit sum(dlist)/len(dlist)
16.5 ms ± 174 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit sum(darray)/len(darray)
66.6 ms ± 583 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Converting to list then iterating
%timeit sum(list(darray))/len(darray)
83.1 ms ± 541 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Using numpy's methods
%timeit darray.mean()
1.26 ms ± 5.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Numpy is much faster once you understand how to use it. It requires a rather different way of looking at data, and gaining a familiarity with the functions it provides, but once you do it results in simpler and faster code.