Numba #jit fails to optimise simple function - python

I have a pretty simple function which uses Numpy arrays and for loops, but adding the Numba #jit decorator gives absolutely no speed up:
# #jit(float64[:](int32,float64,float64,float64,int32))
#jit
def Ising_model_1D(N=200,J=1,T=1e-2,H=0,n_iter=1e6):
beta = 1/T
s = randn(N,1) > 10
s[N-1] = s[0]
mag = zeros((n_iter,1))
aux_idx = randint(low=0,high=N,size=(n_iter,1))
for i1 in arange(n_iter):
rnd_idx = aux_idx[i1]
s_1 = s[rnd_idx]*2 - 1
s_2 = s[(rnd_idx+1)%(N)]*2 - 1
s_3 = s[(rnd_idx-1)%(N)]*2 - 1
delta_E = 2.0*J*(s_2+s_3)*s_1 + 2.0*H*s_1
if(delta_E < 0):
s[rnd_idx] = np.logical_not(s[rnd_idx])
elif(np.exp(-1*beta*delta_E) >= rand()):
s[rnd_idx] = np.logical_not(s[rnd_idx])
s[N-1] = s[0]
mag[i1] = (s*2-1).sum()*1.0/N
return mag
MATLAB on the other hand takes less than 0.5 seconds to run this!
Why is Numba missing something so basic?

Here is a reworking of your code that runs in about 0.4 seconds on my machine:
def ising_model_1d(N=200,J=1,T=1e-2,H=0,n_iter=1e6):
n_iter = int(n_iter)
beta = 1/T
s = randn(N) > 10
s[N-1] = s[0]
mag = zeros(n_iter)
aux_idx = randint(low=0,high=N,size=n_iter)
pre_rand = rand(n_iter)
_ising_jitted(n_iter, aux_idx, s, J, N, H, beta, pre_rand, mag)
return mag
#jit(nopython=True)
def _ising_jitted(n_iter, aux_idx, s, J, N, H, beta, pre_rand, mag):
for i1 in range(n_iter):
rnd_idx = aux_idx[i1]
s_1 = s[rnd_idx*2] - 1
s_2 = s[(rnd_idx+1)%(N)]*2 - 1
s_3 = s[(rnd_idx-1)%(N)]*2 - 1
delta_E = 2.0*J*(s_2+s_3)*s_1 + 2.0*H*s_1
t = rand()
if delta_E < 0:
s[rnd_idx] = not s[rnd_idx]
elif np.exp(-1*beta*delta_E) >= pre_rand[i1]:
s[rnd_idx] = not s[rnd_idx]
s[N-1] = s[0]
mag[i1] = (s*2-1).sum()*1.0/N
Please make sure the results are as expected! I changed much of what you had, and can't guarantee that the calculations are correct!
Working with numba requires a little care. Python functions, as well as most numpy functions, cannot be optimized by the compiler. One thing I find helpful is to use the nopython option to #jit. This means that the compiler will complain whenever you give it some code that it can't really optimize. You can then look at the error message and find the line that will likely slow down your code.
The trick, I find, is to write a "gateway" function in Python that does as much of the work as possible using numpy and its vectorized functions. It should create the empty arrays that you'll need to store the results in. It should package all of the data you'll need during the computation. Then it should pass all of these into your jitted function in one big, long argument list.
Case in point: notice how I handle random number generation in the jitted code. In your original code, you called rand():
elif(np.exp(-1*beta*delta_E) >= rand()):
But rand() can't be optimized by numba (in older versions of numba, at least. In newer versions it can, provided that rand is called without arguments). The observation is that you need a single random number for every one of the n_iter iterations. So we simply create a random array using numpy in our wrapper function, then feed this random array to the jitted function. Getting a random number is then as simple as indexing into this array.
Lastly, for a list of the numpy functions that can be optimized by the latest version of the compiler, see here. In my reworking of your code I was aggressive in removing calls to numpy functions so that the code would work over more versions of numba.

Related

Can i rewrite this code to make it work faster?

Is it actually possible to make this run faster? I need to get half of all possible grids (all elements can be either -1 or 1) of size 4*Lx (for counting energies in Ising model).
def get_grid(Lx):
a = list()
count = 0
t = list(product([1,-1], repeat=Lx))
for i in range(len(t)):
for j in range(len(t)):
for k in range(len(t)):
for l in range(len(t)):
count += 1
a.append([t[i], t[j], t[k], t[l]])
if count == 2**(Lx*4)/2:
return np.array(a)
Tried using numba, but that didn't work out.
First of all, Numba does not like lists. If you want an efficient code, then you need to operate on arrays (except when you really do not know the size at runtime and estimating it is hard/slow). Here the size of the output array is already known so it is better to preallocate it and then fill it. Numba does not like much high-level features like generators, you should prefer using basic loops which are faster (as long as they are executed in a JITed function). The Cartesian product can be replaced by the efficient computation of an array based on the bits of an increasing integer. The whole computation is mainly memory-bound so it is better to use small integer datatypes like uint8 which take 4 times less space in RAM (and thus about 4 times faster to fill). Here is the resulting code:
import numpy as np
import numba as nb
#nb.njit('int8[:,:,:](int64,)')
def get_grid_numba(Lx):
t = np.empty((2**Lx, Lx), dtype=np.int8)
for i in range(2**Lx):
for j in range(Lx):
t[i, Lx-1-j] = 1 - 2 * ((i >> j) & 1)
outSize = 2**(Lx*4 - 1)
out = np.empty((outSize, 4, Lx), dtype=np.int8)
cur = 0
for i in range(len(t)):
for j in range(len(t)):
for k in range(len(t)):
for l in range(len(t)):
out[cur, 0, :] = t[i, :]
out[cur, 1, :] = t[j, :]
out[cur, 2, :] = t[k, :]
out[cur, 3, :] = t[l, :]
cur += 1
if cur == outSize:
return out
return out
For Lx=4, the initial code takes 66.8 ms while this Numba code takes 0.36 ms on my i5-9600KF processor. It is thus 185 times faster.
Note that the size of the output array exponentially grows very quickly. For Lx=7, the output shape is (134217728, 4, 7) and it takes 3.5 GiB of RAM. The Numba code takes 2.47 s to generate it, that is 1.4 GiB/s. If this is not enough to you, then you can write specific implementation from Lx=1 to Lx=8, use loops for the out slice assignment and even use multiple threads for Lx>=5. For small arrays, you can pre-compute them once. This should be an order of magnitude faster.

Reduce python loop to array calculation

I am trying to fill an array with calculated values from functions defined earlier in my code. I started with a code that has a similar structure to the following:
from numpy import cos, sin, arange, zeros
a = arange(1000)
b = arange(1000)
def defcos(x):
return cos(x)
def defsin(x):
return sin(x)
a_len = len(a)
b_len = len(b)
result = zeros((a_len,b_len))
for i in xrange(b_len):
for j in xrange(a_len):
a_res = defcos(a[j])
b_res = defsin(b[i])
result[i,j] = a_res * b_res
I tried to use array representations of the functions, which ended up in the following change for the loop
a_res = defsin(a)
b_res = defcos(b)
for i in xrange(b_len):
for j in xrange(a_len):
result[i,j] = a_res[i] * b_res[j]
This is already significantly faster, than the first version. But is there a way to avoid the loop entirely? I have encountered those loops a couple of times in the past but never botheres as it was not critical in terms of speed. But this time it is the core component of something, which is looped through a couple of times more. :)
Any help would be appreciated, thanks in advance!
Like so:
from numpy import newaxis
a_res = sin(a)
b_res = cos(b)
result = a_res[:, newaxis] * b_res
To understand how this works, have a look at the rules for array broadcasting. And please don't define useless functions like defsin, just use sin itself! Another minor detail, you get i from range(b_len), but you use it to index a_res! This is a bug if a_len != b_len.

Speeding up python code with cython

I have a function which just basically makes lots of calls to a simple defined hash function and tests to see when it finds a duplicate. I need to do lots of simulations with it so would like it to be as fast as possible. I am attempting to use cython to do this. The cython code is currently called with a normal python list of integers with values in the range 0 to m^2.
import math, random
cdef int a,b,c,d,m,pos,value, cyclelimit, nohashcalls
def h3(int a,int b,int c,int d, int m,int x):
return (a*x**2 + b*x+c) %m
def floyd(inputx):
dupefound, nohashcalls = (0,0)
m = len(inputx)
loops = int(m*math.log(m))
for loopno in xrange(loops):
if (dupefound == 1):
break
a = random.randrange(m)
b = random.randrange(m)
c = random.randrange(m)
d = random.randrange(m)
pos = random.randrange(m)
value = inputx[pos]
listofpos = [0] * m
listofpos[pos] = 1
setofvalues = set([value])
cyclelimit = int(math.sqrt(m))
for j in xrange(cyclelimit):
pos = h3(a,b, c,d, m, inputx[pos])
nohashcalls += 1
if (inputx[pos] in setofvalues):
if (listofpos[pos]==1):
dupefound = 0
else:
dupefound = 1
print "Duplicate found at position", pos, " and value", inputx[pos]
break
listofpos[pos] = 1
setofvalues.add(inputx[pos])
return dupefound, nohashcalls
How can I convert inputx and listofpos to use C type arrays and to access the arrays at C speed? Are there any other speed ups I can use? Can setofvalues be sped up?
So that there is something to compare against, 50 calls to floyd() with m = 5000 currently takes around 30 seconds on my computer.
Update: Example code snippet to show how floyd is called.
m = 5000
inputx = random.sample(xrange(m**2), m)
(dupefound, nohashcalls) = edcython.floyd(inputx)
First of all, it seems that you must type the variables inside the function. A good example of it is here.
Second, cython -a, for "annotate", gives you a really excellent break down of the code generated by the cython compiler and a color-coded indication of how dirty (read: python api heavy) it is. This output is really essential when trying to optimize anything.
Third, the now famous page on working with Numpy explains how to get fast, C-style access to the Numpy array data. Unforunately it's verbose and annoying. We're in luck however, because more recent Cython provides Typed Memory Views, which are both easy to use and awesome. Read that entire page before you try to do anything else.
After ten minutes or so I came up with this:
# cython: infer_types=True
# Use the C math library to avoid Python overhead.
from libc cimport math
# For boundscheck below.
import cython
# We're lazy so we'll let Numpy handle our array memory management.
import numpy as np
# You would normally also import the Numpy pxd to get faster access to the Numpy
# API, but it requires some fancier compilation options so I'll leave it out for
# this demo.
# cimport numpy as np
import random
# This is a small function that doesn't need to be exposed to Python at all. Use
# `cdef` instead of `def` and inline it.
cdef inline int h3(int a,int b,int c,int d, int m,int x):
return (a*x**2 + b*x+c) % m
# If we want to live fast and dangerously, we tell cython not to check our array
# indices for IndexErrors. This means we CAN overrun our array and crash the
# program or screw up our stack. Use with caution. Profiling suggests that we
# aren't gaining anything in this case so I leave it on for safety.
# #cython.boundscheck(False)
# `cpdef` so that calling this function from another Cython (or C) function can
# skip the Python function call overhead, while still allowing us to use it from
# Python.
cpdef floyd(int[:] inputx):
# Type the variables in the scope of the function.
cdef int a,b,c,d, value, cyclelimit
cdef unsigned int dupefound = 0
cdef unsigned int nohashcalls = 0
cdef unsigned int loopno, pos, j
# `m` has type int because inputx is already a Cython memory view and
# `infer-types` is on.
m = inputx.shape[0]
cdef unsigned int loops = int(m*math.log(m))
# Again using the memory view, but letting Numpy allocate an array of zeros.
cdef int[:] listofpos = np.zeros(m, dtype=np.int32)
# Keep this random sampling out of the loop
cdef int[:, :] randoms = np.random.randint(0, m, (loops, 5)).astype(np.int32)
for loopno in range(loops):
if (dupefound == 1):
break
# From our precomputed array
a = randoms[loopno, 0]
b = randoms[loopno, 1]
c = randoms[loopno, 2]
d = randoms[loopno, 3]
pos = randoms[loopno, 4]
value = inputx[pos]
# Unforunately, Memory View does not support "vectorized" operations
# like standard Numpy arrays. Otherwise we'd use listofpos *= 0 here.
for j in range(m):
listofpos[j] = 0
listofpos[pos] = 1
setofvalues = set((value,))
cyclelimit = int(math.sqrt(m))
for j in range(cyclelimit):
pos = h3(a, b, c, d, m, inputx[pos])
nohashcalls += 1
if (inputx[pos] in setofvalues):
if (listofpos[pos]==1):
dupefound = 0
else:
dupefound = 1
print "Duplicate found at position", pos, " and value", inputx[pos]
break
listofpos[pos] = 1
setofvalues.add(inputx[pos])
return dupefound, nohashcalls
There are no tricks here that aren't explained on docs.cython.org, which is where I learned them myself, but helps to see it all come together.
The most important changes to your original code are in the comments, but they all amount to giving Cython hints about how to generate code that doesn't use the Python API.
As an aside: I really don't know why infer_types is not on by default. It lets the compiler
implicitly use C types instead of Python types where possible, meaning less work for you.
If you run cython -a on this, you'll see that the only lines that call into Python are your calls to random.sample, and building or adding to a Python set().
On my machine, your original code runs in 2.1 seconds. My version runs in 0.6 seconds.
The next step is to get random.sample out of that loop, but I'll leave that to you.
I have edited my answer to demonstrate how to precompute the rand samples. This brings the time down to 0.4 seconds.
Do you need to use this particular hashing algorithm? Why not use the built-in hashing algorithm for dicts? For example:
from collections import Counter
cnt = Counter(inputx)
dupes = [k for k, v in cnt.iteritems() if v > 1]

How to rewrite this code from python loops to numpy vectors (for perfomance)?

I have this code:
for j in xrange (j_start, self.max_j):
for i in xrange (0, self.max_i):
new_i = round (i + ((j - j_start) * discriminant))
if new_i >= self.max_i:
continue
self.grid[new_i, j] = standard[i]
and I want to speed it up by throwing away slow native python loops. There is possibility to use numpy vector operations instead, they are really fast. How to do that?
j_start, self.max_j, self.max_i, discriminant
int, int, int, float (constants).
self.grid
two-dimensional numpy array (self.max_i x self.max_j).
standard
one-dimensional numpy array (self.max_i).
Here is a complete solution, perhaps that will help.
jrange = np.arange(self.max_j - j_start)
joffset = np.round(jrange * discriminant).astype(int)
i = np.arange(self.max_i)
for j in jrange:
new_i = i + joffset[j]
in_range = new_i < self.max_i
self.grid[new_i[in_range], j+j_start] = standard[i[in_range]]
It may be possible to vectorize both loops but that will, I think, be tricky.
I haven't tested this but I believe it computes the same result as your code.

Optimized method for calculating cosine distance in Python

I wrote a method to calculate the cosine distance between two arrays:
def cosine_distance(a, b):
if len(a) != len(b):
return False
numerator = 0
denoma = 0
denomb = 0
for i in range(len(a)):
numerator += a[i]*b[i]
denoma += abs(a[i])**2
denomb += abs(b[i])**2
result = 1 - numerator / (sqrt(denoma)*sqrt(denomb))
return result
Running it can be very slow on a large array. Is there an optimized version of this method that would run faster?
Update: I've tried all the suggestions to date, including scipy. Here's the version to beat, incorporating suggestions from Mike and Steve:
def cosine_distance(a, b):
if len(a) != len(b):
raise ValueError, "a and b must be same length" #Steve
numerator = 0
denoma = 0
denomb = 0
for i in range(len(a)): #Mike's optimizations:
ai = a[i] #only calculate once
bi = b[i]
numerator += ai*bi #faster than exponent (barely)
denoma += ai*ai #strip abs() since it's squaring
denomb += bi*bi
result = 1 - numerator / (sqrt(denoma)*sqrt(denomb))
return result
If you can use SciPy, you can use cosine from spatial.distance:
http://docs.scipy.org/doc/scipy/reference/spatial.distance.html
If you can't use SciPy, you could try to obtain a small speedup by rewriting your Python (EDIT: but it didn't work out like I thought it would, see below).
from itertools import izip
from math import sqrt
def cosine_distance(a, b):
if len(a) != len(b):
raise ValueError, "a and b must be same length"
numerator = sum(tup[0] * tup[1] for tup in izip(a,b))
denoma = sum(avalue ** 2 for avalue in a)
denomb = sum(bvalue ** 2 for bvalue in b)
result = 1 - numerator / (sqrt(denoma)*sqrt(denomb))
return result
It is better to raise an exception when the lengths of a and b are mismatched.
By using generator expressions inside of calls to sum() you can calculate your values with most of the work being done by the C code inside of Python. This should be faster than using a for loop.
I haven't timed this so I can't guess how much faster it might be. But the SciPy code is almost certainly written in C or C++ and it should be about as fast as you can get.
If you are doing bioinformatics in Python, you really should be using SciPy anyway.
EDIT: Darius Bacon timed my code and found it slower. So I timed my code and... yes, it is slower. The lesson for all: when you are trying to speed things up, don't guess, measure.
I am baffled as to why my attempt to put more work on the C internals of Python is slower. I tried it for lists of length 1000 and it was still slower.
I can't spend any more time on trying to hack the Python cleverly. If you need more speed, I suggest you try SciPy.
EDIT: I just tested by hand, without timeit. I find that for short a and b, the old code is faster; for long a and b, the new code is faster; in both cases the difference is not large. (I'm now wondering if I can trust timeit on my Windows computer; I want to try this test again on Linux.) I wouldn't change working code to try to get it faster. And one more time I urge you to try SciPy. :-)
(I originally thought) you're not going to speed it up a lot without breaking out to C (like numpy or scipy) or changing what you compute. But here's how I'd try that, anyway:
from itertools import imap
from math import sqrt
from operator import mul
def cosine_distance(a, b):
assert len(a) == len(b)
return 1 - (sum(imap(mul, a, b))
/ sqrt(sum(imap(mul, a, a))
* sum(imap(mul, b, b))))
It's roughly twice as fast in Python 2.6 with 500k-element arrays. (After changing map to imap, following Jarret Hardie.)
Here's a tweaked version of the original poster's revised code:
from itertools import izip
def cosine_distance(a, b):
assert len(a) == len(b)
ab_sum, a_sum, b_sum = 0, 0, 0
for ai, bi in izip(a, b):
ab_sum += ai * bi
a_sum += ai * ai
b_sum += bi * bi
return 1 - ab_sum / sqrt(a_sum * b_sum)
It's ugly, but it does come out faster. . .
Edit: And try Psyco! It speeds up the final version by another factor of 4. How could I forget?
No need to take abs() of a[i] and b[i] if you're squaring it.
Store a[i] and b[i] in temporary variables, to avoid doing the indexing more than once.
Maybe the compiler can optimize this, but maybe not.
Check into the **2 operator. Is it simplifying it into a multiply, or is it using a general power function (log - multiply by 2 - antilog).
Don't do sqrt twice (though the cost of that is small). Do sqrt(denoma * denomb).
Similar to Darius Bacon's answer, I've been toying with operator and itertools to produce a faster answer. The following seems to be 1/3 faster on a 500-item array according to timeit:
from math import sqrt
from itertools import imap
from operator import mul
def op_cosine(a, b):
dot_prod = sum(imap(mul, a, b))
a_veclen = sqrt(sum(i ** 2 for i in a))
b_veclen = sqrt(sum(i ** 2 for i in b))
return 1 - dot_prod / (a_veclen * b_veclen)
This is faster for arrays of around 1000+ elements.
from numpy import array
def cosine_distance(a, b):
a=array(a)
b=array(b)
numerator=(a*b).sum()
denoma=(a*a).sum()
denomb=(b*b).sum()
result = 1 - numerator / sqrt(denoma*denomb)
return result
Using the C code inside of SciPy wins big for long input arrays. Using simple and direct Python wins for short input arrays; Darius Bacon's izip()-based code benchmarked out best. Thus, the ultimate solution is to decide which one to use at runtime, based on the length of the input arrays:
from scipy.spatial.distance import cosine as scipy_cos_dist
from itertools import izip
from math import sqrt
def cosine_distance(a, b):
len_a = len(a)
assert len_a == len(b)
if len_a > 200: # 200 is a magic value found by benchmark
return scipy_cos_dist(a, b)
# function below is basically just Darius Bacon's code
ab_sum = a_sum = b_sum = 0
for ai, bi in izip(a, b):
ab_sum += ai * bi
a_sum += ai * ai
b_sum += bi * bi
return 1 - ab_sum / sqrt(a_sum * b_sum)
I made a test harness that tested the functions with different length inputs, and found that around length 200 the SciPy function started to win. The bigger the input arrays, the bigger it wins. For very short length arrays, say length 3, the simpler code wins. This function adds a tiny amount of overhead to decide which way to do it, then does it the best way.
In case you are interested, here is the test harness:
from darius2 import cosine_distance as fn_darius2
fn_darius2.__name__ = "fn_darius2"
from ult import cosine_distance as fn_ult
fn_ult.__name__ = "fn_ult"
from scipy.spatial.distance import cosine as fn_scipy
fn_scipy.__name__ = "fn_scipy"
import random
import time
lst_fn = [fn_darius2, fn_scipy, fn_ult]
def run_test(fn, lst0, lst1, test_len):
start = time.time()
for _ in xrange(test_len):
fn(lst0, lst1)
end = time.time()
return end - start
for data_len in range(50, 500, 10):
a = [random.random() for _ in xrange(data_len)]
b = [random.random() for _ in xrange(data_len)]
print "len(a) ==", len(a)
test_len = 10**3
for fn in lst_fn:
n = fn.__name__
r = fn(a, b)
t = run_test(fn, a, b, test_len)
print "%s:\t%f seconds, result %f" % (n, t, r)
def cd(a,b):
if(len(a)!=len(b)):
raise ValueError, "a and b must be the same length"
rn = range(len(a))
adb = sum([a[k]*b[k] for k in rn])
nma = sqrt(sum([a[k]*a[k] for k in rn]))
nmb = sqrt(sum([b[k]*b[k] for k in rn]))
result = 1 - adb / (nma*nmb)
return result
Your updated solution still has two square roots. You can reduce this to one by replacing the sqrt line with:
result = 1 - numerator /
(sqrt(denoma*denomb))
A multiply is typically quite a bit quicker than a sqrt. It might not seem much as it is only called once in the function, but it sounds like you are calculating a lot of cosine distances, so the improvement will add up.
Your code looks like it should be ripe for vector optimizations. So if cross-platofrm support is not an issue and you want to speed it even further, you could code the cosine distance code in C and make sure your compiler is aggressively vectorizing the resulting code (even Pentium II is capable of some floating point vectorisation)

Categories

Resources