currently I am trying to improve the performance of my python code. To do so I successfully use numba. In order to improve the structure of my code I create functions. Now I have noticed to my surprise that if I split the code into different numba functions, the code is significantly slower than if I put the whole code in one function with a numba decorator.
An example would be:
#nb.njit
def fct_4(a, b):
x = a ^ b
setBits = 0
while x > 0:
setBits += x & 1
x >>= 1
return setBits
#nb.njit
def fct_3(c, set_1, set_2):
h = 2
if c not in set_1 and c not in set_2:
if fct_4(0, c) <= h:
set_1.add(c)
else:
set_2.add(c)
#nb.njit
def fct_2(c, set_1, set_2):
fct_3(c, set_1, set_2)
#nb.njit
def fct_1(set_1, set_2):
for x1 in range(1000):
c = 2
fct_2(c, set_1, set_2)
is slower than
#nb.njit
def fct_1(set_1, set_2):
for x1 in range(1000):
c = 2
h = 2
if c not in set_1 and c not in set_2:
if fct_4(0, c) <= h:
set_1.add(c)
else:
set_2.add(c)
with
#nb.njit
def main_fct(set_1, set_2):
for i in range(50):
for x in range(1000):
fct_1(set_1, set_2)
set_1 = {0}
set_2 = {47}
start = timeit.default_timer()
main_fct(set_1, set_2)
stop = timeit.default_timer()
(2.70 seconds vs 0.46 seconds). I thought this shouldn't make a difference. Could you enlighten me?
Since python is a dynamically typed language, its function call overhead is quite high.
On top of that you are looping over the function calls, so the execution time incurred in calling the function and checking the arguments is multiplied 1000 times.
Related
I wanted to share with you how I solved my parallelisation problem I had with jit. Initially I had the code bellow and got this error:
Code:
#jit(nopython=True, parallel=True)
def mandelbrot_2(cArray, iterations):
count = 0
for c in cArray:
z = 0
for n in range(iterations):
z = np.square(z) + c
if np.abs(z) > 2 :
count += 1 ;
break
return cArray.shape[0] - count
Error:
/Users/alexander/opt/anaconda3/lib/python3.8/site-packages/numba/core/typed_passes.py:326: NumbaPerformanceWarning: [1m
The keyword argument 'parallel=True' was specified but no transformation for parallel execution was possible.
To find out why, try turning on parallel diagnostics, see https://numba.pydata.org/numba-doc/latest/user/parallel.html#diagnostics for help.
[1m
File "<ipython-input-56-9fcfc0a9fe03>", line 17:[0m
[1m#jit(nopython=True, parallel=True)
[1mdef mandelbrot_2(cArray, iterations):
[0m[1m^[0m[0m
[0m
warnings.warn(errors.NumbaPerformanceWarning(msg,
Import prange:
from numba import jit, prange
New Code:
Convert the first loop to indexing an array instead of using a new array element each time.
converted the range in both loops to prange which I imported from Numba.
#jit(nopython=True, parallel=True)
def mandelbrot_2(cArray, iterations):
count = 0
for c in prange(cArray.shape[0]):
z = 0
for n in prange(iterations):
z = np.square(z) + cArray[c]
if np.abs(z) > 2 : count += 1 ; break
return cArray.shape[0] - count
I created a #jit compiled function, which should be faster than a normal loop.
howerver it isnt, since just in time compilation takes minutes. What am I doing wrong?
What is the most efficient way to specify signatures for this function?
''' reproducable sample data '''
import numpy as np, time, numba as nb
LEN, Amount_Of_Elements = 10000,4000
temp = np.random.randint(Amount_Of_Elements*0.7,high=Amount_Of_Elements, size=LEN)
RatiosUp = [np.random.uniform(size=rand) for rand in temp]
RatiosDown = [np.random.uniform(size=rand) for rand in temp]
UpPointsSlices = [np.random.uniform(size=rand) for rand in temp]
DownPointsSlices = [np.random.uniform(size=rand) for rand in temp]
''' function '''
#nb.jit
def filter(a,b):
return a > b
#nb.jit
def func(RatiosUp, RatiosDown, UpPointsSlices, DownPointsSlices, result):
for i in range(len(RatiosUp)):
for j in range(RatiosUp[i].size):
if filter3(RatiosUp[i][j],RatiosDown[i][j]):
result[i][j] = 1
elif filter3(RatiosDown[i][j],RatiosUp[i][j]):
result[i][j] = 0
elif filter3(UpPointsSlices[i][j],DownPointsSlices[i][j]):
result[i][j] = 0
else:
result[i][j] = 1
I'm using python and apparently the slowest part of my program is doing simple additions on float variables.
It takes about 35seconds to do around 400,000,000 additions/multiplications.
I'm trying to figure out what is the fastest way I can do this math.
This is how the structure of my code looks like.
Example (dummy) code:
def func(x, y, z):
loop_count = 30
a = [0,1,2,3,4,5,6,7,8,9,10,11,12,...35 elements]
b = [0,11,22,33,44,55,66,77,88,99,1010,1111,1212,...35 elements]
p = [0,0,0,0,0,0,0,0,0,0,0,0,0,...35 elements]
for i in range(loop_count - 1):
c = p[i-1]
d = a[i] + c * a[i+1]
e = min(2, a[i]) + c * b[i]
f = e * x
g = y + d * c
.... and so on
p[i] = d + e + f + s + g5 + f4 + h7 * t5 + y8
return sum(p)
func() is called about 200k times. The loop_count is about 30. And I have ~20 multiplications and ~45 additions and ~10 uses of min/max
I was wondering if there is a method for me to declare all these as ctypes.c_float and do addition in C using stdlib or something similar ?
Note that the p[i] calculated at the end of the loop is used as c in the next loop iteration. For iteration 0, it just uses p[-1] which is 0 in this case.
My constraints:
I need to use python. While I understand plain math would be faster in C/Java/etc. I cannot use it due to a bunch of other things I do in python which cannot be done in C in this same program.
I tried writing this with cython, but it caused a bunch of issues with the environment I need to run this in. So, again - not an option.
I think you should consider using numpy. You did not mention any constraint.
Example case of a simple dot operation (x.y)
import datetime
import numpy as np
x = range(0,10000000,1)
y = range(0,20000000,2)
for i in range(0, len(x)):
x[i] = x[i] * 0.00001
y[i] = y[i] * 0.00001
now = datetime.datetime.now()
z = 0
for i in range(0, len(x)):
z = z+x[i]*y[i]
print "handmade dot=", datetime.datetime.now()-now
print z
x = np.arange(0.0, 10000000.0*0.00001, 0.00001)
y = np.arange(0.0, 10000000.0*0.00002, 0.00002)
now = datetime.datetime.now()
z = np.dot(x,y)
print 'numpy dot =',datetime.datetime.now()-now
print z
outputs
handmade dot= 0:00:02.559000
66666656666.7
numpy dot = 0:00:00.019000
66666656666.7
numpy is more than 100x times faster.
The reason is that numpy encapsulates a C library that does the dot operation with compiled code. In the full python you have a list of potentially generic objects, casting, ...
Is there a way to make function f() faster. I run it hundreds of millions of times, so any speed increase would be appreciated. The IDs of dictionary a and b are same across runs, w is a constant. The keys are integers; the keys are not evenly distributed in general.
Also, the function is in a class object. So f is f(self), and the variables are self.w, self.ID, self.a, and self.b
w = 10.25
ID = range(10)
a = {}
b = {}
for i in ID:
a[i] = random.uniform(0,1)
b[i] = random.uniform(0,1)
def f():
for i in ID:
a[i] = b[i] * w
t0 = time.time()
for i in xrange(1000000):
f()
t1 = time.time()
print t1-t0
You can get a nice speed-up by localizing the variables in f():
def f(ID=ID, a=a, b=b, w=w):
for i in ID:
a[i] = b[i] * w
Some folks don't like localizing in that way, so you can also build a closure which will give you a speed-up of global variable access:
def make_f(a, b, w, ID):
def f():
for i in ID:
a[i] = b[i] * w
return f
f = make_f(a, b, w, ID)
See this analysis of which types of variable access are fastest.
Algorithmically, there is not much else that can be done. The looping over ID is fast because it just increments reference counts for existing integers. The hashing of the integers is practically instant because the hash of an int is the int itself. Dict lookups themselves are already highly optimized. Likewise, there is no short-cut for multiplying by a constant.
I am optimizing a bottleneck section of my code--iterating on a function a' = f(a), where a and a' are N by 1 vectors, until max(abs(a' - a)) is sufficiently small.
I have put a Numba wrapper on f(a), and got a nice speedup over the most optimized pure NumPy version I was able to produe (cut runtime by about 50%).
I tried writing a C-compatible version of numpy.max(numpy.abs(aprime - a)), but it turns out this is slower! I actually lose back ALL of the gains I got from Numba-fying the first portion of the iteration!
Is there likely to be a way for Numba or Cython to improve upon numpy.max(numpy.abs(aprime - a))? I reproduce my code below for reference, where a is P0 and a' is Pprime:
EDIT: For me, it seems that it is important to "flatten()" the inputs to "maxabs()". When I do this, the performance is no worse than NumPy. Then, when I do a "dry run" of the function outside the timing brackets as JoshAdel suggested, the loop with "maxabs" does slightly better than the loop with numpy.max(numpy.abs()).
from numba import jit
import numpy as np
### Preliminaries, to make the working example fully functional
n = 1200
Gammer = np.exp(-np.random.rand(n,n))
alpher = np.ones((n,1))
xxer = 10000*np.random.rand(n,1)
chii = 6.5
varkappa = 6.5
phi3 = 1.5
A = .5
sig = .2
mmer = np.dot(Gammer,xxer**phi3)
totalprod = A*alpher + (1-A)*mmer
Gammerchii = Gammer**chii
Gammerrats = Gammerchii[:,0].flatten()/Gammerchii[0,:].flatten()
Gammerrats[(Gammerchii[0,:].flatten() == 0) | (Gammerchii[:,0].flatten() == 0)] = 1.
P0 = (Gammerrats*(xxer[0]/totalprod[0])*(totalprod/xxer).flatten())**(1/(1+2*chii))
P0 *= n/np.sum(P0)
### End of preliminaries
### This is the function to produce a' = f(a)
#jit
def Piteration(P0, chii, sig, n, xxer, totalprod, Gammerrats, Gammerchii):
Mac = np.zeros((n,))
Pprime = np.zeros((n,))
themacpow = 1-(1/chii)*(sig/(1-sig))
specialchiipow = 1/(1+2*chii)
Psum = 0.
for i in range(n):
for j in range(n):
Mac[j] += ((P0[i]/P0[j])**chii)*Gammerchii[i,j]*totalprod[j]
for i in range(n):
Pprime[i] = (Gammerrats[i]*(xxer[0]/totalprod[0])*(totalprod[i]/xxer[i])*((Mac[i]/Mac[0])**themacpow))**specialchiipow
Psum += Pprime[i]
Psum = n/Psum
for i in range(n):
Pprime[i] *= Psum
return Pprime
### This is the function to find max(abs(aprime - a))
#jit
def maxabs(vec1,vec2,n):
themax = 0.
curdiff = 0.
for i in range(n):
curdiff = vec1[i] - vec2[i]
if curdiff < 0:
curdiff *= -1
if curdiff > themax:
themax = curdiff
return themax
### This is the main loop
diff = 1000.
while diff > 1e-2:
Pprime = Piteration(P0.flatten(), chii, sig, n, xxer.flatten(), totalprod.flatten(), Gammerrats.flatten(), Gammerchii)
diff = maxabs(P0.flatten(),Pprime.flatten(),n)
P0 = 1.*Pprime
When I time your maxabs function vs np.max(np.abs(vec1 - vec2)) for an array of shape (1200,), the numba version is ~2.6x faster using numba 0.32.0.
When you time the code, make sure you run your function once before you time it so that you don't include the time it takes to jit the code, which you only pay the first time. In general using timeit and running multiple times takes care of this. I'm not sure how you did the timing though since I see almost no difference in using maxabs vs the numpy call, most of the runtime seems to be in the call to Piteration.