Perform the box covering on a graph using cython - python

I wrote a python script to perform the box covering on a graph but it takes more than a minute when I run it on small graphs (100 nodes).
Today someone recommend cython to improve its efficiency so I follow this guide to adadpt the code that I had.
Running the python code the results where:
In [6]: %timeit test.test()
1000 loops, best of 3: 1.88 ms per loop
After following the guide the results were:
In [7]: %timeit c_test.test()
1000 loops, best of 3: 1.05 ms per loop
The performance was better but I am sure that it is a lot that can be improved. Given that I just meet cython today, I want to ask you how can I improve this code:
import random as rnd
import numpy as np
cimport cython
cimport numpy as np
DTYPE = np.int
ctypedef np.int_t DTYPE_t
def choose_color(not_valid_colors, valid_colors):
possible_values = list(valid_colors - not_valid_colors)
if possible_values:
return rnd.choice(possible_values)
else:
return max(valid_colors.union(not_valid_colors)) + 1
#cython.boundscheck(False)
cdef np.ndarray[DTYPE_t, ndim=2] greedy_coloring(np.ndarray[DTYPE_t, ndim=2] distances, int num_nodes, int diameter):
cdef int i, lb, j
cdef np.ndarray[DTYPE_t, ndim=2] c = np.empty((num_nodes+1, diameter+2), dtype=DTYPE)
c.fill(-1)
# Matrix C will not use the 0 column and 0 row to
# let the algorithm look very similar to the paper
# pseudo-code
nodes = list(range(1, num_nodes+1))
rnd.shuffle(nodes)
c[nodes[0], :] = 0
# Algorithm
for i in nodes[1:]:
for lb in range(2, diameter+1):
not_valid_colors = set()
valid_colors = set()
for j in nodes[:i]:
if distances[i-1, j-1] >= lb:
not_valid_colors.add(c[j, lb])
else:
valid_colors.add(c[j, lb])
c[i, lb] = choose_color(not_valid_colors, valid_colors)
return c
def test():
distances = np.matrix('0 3 2 4 1 1; \
3 0 1 1 3 2; \
2 1 0 2 2 1; \
4 1 2 0 4 3; \
1 3 2 4 0 1; \
1 2 1 3 1 0')
c = greedy_coloring(distances, 6, 4)

In Cython, you will get faster speed as you remove more Python calls inside your Cython functions.
For example, skimming through your code, you are making calls to choose_color() inside the nested loop in greedy_coloring(). That should be typed as well, along with variables defined inside the function. Since it will be called repeatedly, it will bring a lot of overhead.
You can use cython with -a option (e.g., cython -a file.pyx) to generate an annotated html file which shows visually which part of your code is making Python calls (yellow lines). This will help you a lot in terms of improving your Cython code.
I'm sorry for lack of specific pointers - hope this is helpful.

Related

c array with python class inside python function

I am working in Cython. How can i declare a C array of a python class instances and then pass the array to a python function and work on it?
cdef int n=100
class particle:
def __init__(self):
self.x=uniform(1,99)
self.y=uniform(1,99)
self.pot=0
cdef particle parlist[n]
def CalPot(parlist[]):
for i in range(N-1):
pot=0
for j in range(i,N):
dx=parlist[j].x-parlist[i].x
dy=parlist[j].y-parlist[j].y
r2=dx**2+dy**2
pot=pot+4*ep*r2*((sig/r2)**12 - (sig/r2)**6)
parlist[i].pot=pot
As Ioannis and DavidW told you, you should not create a c-array of python objects and should use a python list instead.
Cytonizing the resulting pure Python would bring a speed up of about factor 2, because cython would cut out the interpreter part. However, there is much more potential if you would also get rid of reference counting and dynamic dispatch - a speed-up up to factor 100 is pretty common. Some time ago I answered a question illustrating this.
What should you do to get this speed up? You need to replace python-multiplication with "bare metal" multiplications.
First step: Don't use a (python)-class for particle, use a simple c-struct - it is just a collection of data - nothing more, nothing less:
cdef struct particle:
double x
double y
double pot
First benefit: It is possible to define a global c-array of these structs (another question is, whether it is a very smart thing to do in a bigger project):
DEF n=2000 # known at compile time
cdef particle parlist[n]
After the initialization of the array (for more details see attached listings), we can use it in our calcpot-function (I slightly changed your definition):
def calcpot():
cdef double pot,dX,dY
cdef int i,j
for i in range(n):
pot=0.0
for j in range(i+1, n):
dX=parlist[i].x-parlist[j].x
dY=parlist[i].y-parlist[j].y
pot=pot+1.0/(dX*dX+dY*dY)
parlist[i].pot=pot
The main difference to the original code: parlist[i].xand Co. are no longer slow python objects but simple and fast doubles. There are a lot of subtle things to be considered in order to be able to get the maximal speed-up - one really should read/reread the cython documentation.
Was the trouble worth it? There are the timings(via %timeit calcpot()) on my machine:
Time Speed-up
pure python + interpreter: 924 ms ± 14.1 ms x1.0
pure python + cython: 609 ms ± 6.83 ms x1.5
cython version: 4.1 ms ± 55.3 µs x231.0
A speed up of 231 through using the lowly stucts!
Listing python code:
import random
class particle:
def __init__(self):
self.x=random.uniform(1,99)
self.y=random.uniform(1,99)
self.pot=0
n=2000
parlist = [particle() for _ in range(n)]
def calcpot():
for i in range(n):
pot=0.0
for j in range(i+1, n):
dX=parlist[i].x-parlist[j].x
dY=parlist[i].y-parlist[j].y
pot=pot+1.0/(dX*dX+dY*dY)
parlist[i].pot=pot
Listing cython code:
#call init_parlist prior to calcpot!
cdef struct particle:
double x
double y
double pot
DEF n=2000 # known at compile time
cdef particle parlist[n]
import random
def init_parlist():
for i in range(n):
parlist[i].x=random.uniform(1,99)
parlist[i].y=random.uniform(1,99)
parlist[i].pot=0.0
def calcpot():
cdef double pot,dX,dY
cdef int i,j
for i in range(n):
pot=0.0
for j in range(i+1, n):
dX=parlist[i].x-parlist[j].x
dY=parlist[i].y-parlist[j].y
pot=pot+1.0/(dX*dX+dY*dY)
parlist[i].pot=pot
BUG: use spaces, not tabs, REF: use fenced code block in Markdown
Instances of Python classes are Python objects, and better handled in Python (they are not C types, and I don't see any reason for creating some form of C representation for them within the Cython source). Also, global variables like n and parlist are better avoided (in this example they aren't necessary).
class particle:
def __init__(self):
self.x = uniform(1, 99)
self.y = uniform(1, 99)
self.pot = 0
def CalPot(parlist):
N = len(parlist)
for i in range(N):
pot = 0
for j in range(i, N):
dx = parlist[j].x - parlist[i].x
dy = parlist[j].y - parlist[j].y
r2 = dx**2 + dy**2
pot = pot + 4 * ep * r2 * ((sig / r2)**12 - (sig / r2)**6)
parlist[i].pot = pot
So this Cython code happens to be pure Python.

How to write this in most efficient way

I have an array (N = 10^4) and I need to find a difference between each two of the entries( calculating a potential given the coordinates of the atoms)
Here is the code I am writing in pure python, but its really not effective, can anyone tell me how to speed it up? (using numpy or weave). Here x,y are arrays of coordinates of atoms(just simple 1D array)
def potential(r):
U = 4.*(np.power(r,-12) - np.power(r,-6))
return U
def total_energy(x):
E = 0.
#need to speed up this part
for i in range(N-1):
for j in range(i):
E += potential(np.sqrt((x[i]-x[j])**2))
return E
first you can use array arithmetics:
def potential(r):
return 4.*(r**(-12) - r**(-6))
def total_energy(x):
E = 0.
for i in range(N-1):
E += potential(np.sqrt((x[i]-x[:i])**2)).sum()
return E
or you can test the fully vectorized version:
def total_energy(x):
b=np.diag(x).cumsum(1)-x
return potential(abs(b[np.triu_indices_from(b,1)])).sum()
I would recommend looking into scipy.spatial.distance. Using pdist in particular computes all pairwise distances of an array.
I am assuming that you have an array that is of shape (Nx3), thus we need to slightly change your code:
def potential(r):
U = 4.*(np.power(r,-12) - np.power(r,-6))
return U
def total_energy(x):
E = 0.
#need to speed up this part
for i in range(N): #To N here
for j in range(i):
E += potential(np.sqrt(np.sum((x[i]-x[j])**2))) #Add sum here
return E
Now lets rewrite this using spatial:
import scipy.spatial.distance as sd
def scipy_LJ(arr, sigma=None):
"""
Computes the Lennard-Jones potential for an array (M x N) of M points
in N dimensional space. Usage of a sigma parameter is optional.
"""
if len(arr.shape)==1:
arr = arr[:,None]
r = sd.pdist(arr)
if sigma==None:
np.power(r, -6, out=r)
return np.sum(r**2 - r)*4
else:
r *= sigma
np.power(r, -6, out=r)
return np.sum(r**2 - r)*4
Lets run some tests:
N = 1000
points = np.random.rand(N,3)+0.1
np.allclose(total_energy(points), scipy_LJ(points))
Out[43]: True
%timeit total_energy(points)
1 loops, best of 3: 13.6 s per loop
%timeit scipy_LJ(points)
10 loops, best of 3: 24.3 ms per loop
Now it is ~500 times faster!
N = 10000
points = np.random.rand(N,3)+0.1
%timeit scipy_LJ(points)
1 loops, best of 3: 3.05 s per loop
This used ~2GB of ram.
Here is the final answer with some timing
0) The Plain version (Really slow)
In [16]: %timeit total_energy(points)
1 loops, best of 3: 14.9 s per loop
1) SciPy version
In [9]: %timeit scipy_LJ(points)
10 loops, best of 3: 44 ms per loop
1-2) Numpy version
%timeit sum( potential(np.sqrt((x[i]-x[:i])**2 + (y[i]-y[:i])**2 + (z[i] - z[:i])**2)).sum() for i in range(N-1))
10 loops, best of 3: 126 ms per loop
2) Insanely fast Fortran version (! - means comment)
subroutine EnergyForces(Pos, PEnergy, Dim, NAtom)
implicit none
integer, intent(in) :: Dim, NAtom
real(8), intent(in), dimension(0:NAtom-1, 0:Dim-1) :: Pos
! real(8), intent(in) :: L
real(8), intent(out) :: PEnergy
real(8), dimension(Dim) :: rij, Posi
real(8) :: d2, id2, id6, id12
real(8) :: rc2, Shift
integer :: i, j
PEnergy = 0.
do i = 0, NAtom - 1
!store Pos(i,:) in a temporary array for faster access in j loop
Posi = Pos(i,:)
do j = i + 1, NAtom - 1
rij = Pos(j,:) - Posi
! rij = rij - L * dnint(rij / L)
!compute only the squared distance and compare to squared cut
d2 = sum(rij * rij)
id2 = 1. / d2 !inverse squared distance
id6 = id2 * id2 * id2 !inverse sixth distance
id12 = id6 * id6 !inverse twelvth distance
PEnergy = PEnergy + 4. * (id12 - id6)
enddo
enddo
end subroutine
after calling it
In [14]: %timeit ljlib.energyforces(points.transpose(), 3, N)
10000 loops, best of 3: 61 us per loop
3) Conclusion Fortran is 1000 times faster than scipy and 3000 times faster than numpy, and millions times faster than the pure python. That is because the Scipy version creates a matrix of differences and then analyzes it, whereas the Fortran version does everything on the fly.
Thank you for your help. Here is what I have found.
The shortest version
return sum( potential(np.sqrt((x[i]-x[:i])**2)).sum() for i in range(N-1))
The scipy version is also good.
The fastest version one may consider is to use the f2py program, i.e write the bottleneck slow part in pure Fortran(which is insanely fast), compile it and then plug it into your python code as a library
For example I have:
program_lj.f90
$gfortran -c program_lj.f90
If one defines all the types explicitly in fortran program we are good to go.
$f2py -c -m program_lj program_lj.f90
After compilation the only thing left would be to call the program from python.
in python program:
import program_lj
result = program_lj.subroutine_in_program(parameters)
In case you need a more general reference please refer to wonderful webpage.

Find number of zeros before non-zero in a numpy array

I have a numpy array A. I would like to return the number of zeros before a non-zero in A in an efficient way as it is in a loop.
If A = np.array([0,1,2]) then np.nonzero(A)[0][0] returns 1. However if A = np.array([0,0,0]) this doesn't work (I would like the answer 3 in this case). And also if A is very big and the first non-zero is near the beginning this seems inefficient.
By adding a nonzero number at the end of the array, you can still use np.nonzero to get your desired outcome.
A = np.array([0,1,2])
B = np.array([0,0,0])
np.min(np.nonzero(np.hstack((A, 1)))) # --> 1
np.min(np.nonzero(np.hstack((B, 1)))) # --> 3
i = np.argmax(A!=0)
if i==0 and np.all(A==0): i=len(A)
This should be the most performant solution without extensions. Also easily vectorized to act along multiple axes.
Here's an iterative Cython version, which may be your best bet if this is a serious bottleneck
# saved as file count_leading_zeros.pyx
import numpy as np
cimport numpy as np
cimport cython
DTYPE = np.int
ctypedef np.int_t DTYPE_t
#cython.boundscheck(False)
def count_leading_zeros(np.ndarray[DTYPE_t, ndim=1] a):
cdef int elements = a.size
cdef int i = 0
cdef int count = 0
while i < elements:
if a[i] == 0:
count += 1
else:
return count
i += 1
return count
This is similar to #mtrw's answer but with indexing at native speeds. My Cython is a bit sketchy so there may be further improvements to be made.
A quick test of an extremely favourable case with IPython with a few different methods
In [1]: import numpy as np
In [2]: import pyximport; pyximport.install()
Out[2]: (None, <pyximport.pyximport.PyxImporter at 0x53e9250>)
In [3]: import count_leading_zeros
In [4]: %paste
def count_leading_zeros_python(x):
ctr = 0
for k in x:
if k == 0:
ctr += 1
else:
return ctr
return ctr
## -- End pasted text --
In [5]: a = np.zeros((10000000,), dtype=np.int)
In [6]: a[5] = 1
In [7]:
In [7]: %timeit np.min(np.nonzero(np.hstack((a, 1))))
10 loops, best of 3: 91.1 ms per loop
In [8]:
In [8]: %timeit np.where(a)[0][0] if np.shape(np.where(a)[0])[0] != 0 else np.shape(a)[0]
10 loops, best of 3: 107 ms per loop
In [9]:
In [9]: %timeit count_leading_zeros_python(a)
100000 loops, best of 3: 3.87 µs per loop
In [10]:
In [10]: %timeit count_leading_zeros.count_leading_zeros(a)
1000000 loops, best of 3: 489 ns per loop
However I'd only use something like this if I had evidence (with a profiler) that this was a bottleneck. Many things may seem inefficient but are never worth your time to fix.
What's wrong with the naive approach:
def countLeadingZeros(x):
""" Count number of elements up to the first non-zero element, return that count """
ctr = 0
for k in x:
if k == 0:
ctr += 1
else: #short circuit evaluation, we found a non-zero so return immediately
return ctr
return ctr #we get here in the case that x was all zeros
This returns as soon as a non-zero element is found, so it is O(n) in the worst case. You could make it faster by porting it to C, but it would be worth testing to see if that is really necessary for the arrays you're working with.
I am surprised why nobody has not used np.where yet
np.where(a)[0][0] if np.shape(np.where(a)[0])[0] != 0 else np.shape(a)[0] will do the trick
>> a = np.array([0,1,2])
>> np.where(a)[0][0] if np.shape(np.where(a)[0])[0] != 0 else np.shape(a)[0]
... 1
>> a = np.array([0,0,0))
>> np.where(a)[0][0] if np.shape(np.where(a)[0])[0] != 0 else np.shape(a)[0]
... 3
>> a = np.array([1,2,3))
>> np.where(a)[0][0] if np.shape(np.where(a)[0])[0] != 0 else np.shape(a)[0]
... 0
If you don't care about the speed, I have a small trick to do the job:
a = np.array([0,0,1,1,1])
t = np.where(a==0,1,0)+np.append(np.where(a==0,0,1),0)[1:]
print t
[1 2 1 1 0]
np.where(t==2)
(array([1]),)

Speeding up python code with cython

I have a function which just basically makes lots of calls to a simple defined hash function and tests to see when it finds a duplicate. I need to do lots of simulations with it so would like it to be as fast as possible. I am attempting to use cython to do this. The cython code is currently called with a normal python list of integers with values in the range 0 to m^2.
import math, random
cdef int a,b,c,d,m,pos,value, cyclelimit, nohashcalls
def h3(int a,int b,int c,int d, int m,int x):
return (a*x**2 + b*x+c) %m
def floyd(inputx):
dupefound, nohashcalls = (0,0)
m = len(inputx)
loops = int(m*math.log(m))
for loopno in xrange(loops):
if (dupefound == 1):
break
a = random.randrange(m)
b = random.randrange(m)
c = random.randrange(m)
d = random.randrange(m)
pos = random.randrange(m)
value = inputx[pos]
listofpos = [0] * m
listofpos[pos] = 1
setofvalues = set([value])
cyclelimit = int(math.sqrt(m))
for j in xrange(cyclelimit):
pos = h3(a,b, c,d, m, inputx[pos])
nohashcalls += 1
if (inputx[pos] in setofvalues):
if (listofpos[pos]==1):
dupefound = 0
else:
dupefound = 1
print "Duplicate found at position", pos, " and value", inputx[pos]
break
listofpos[pos] = 1
setofvalues.add(inputx[pos])
return dupefound, nohashcalls
How can I convert inputx and listofpos to use C type arrays and to access the arrays at C speed? Are there any other speed ups I can use? Can setofvalues be sped up?
So that there is something to compare against, 50 calls to floyd() with m = 5000 currently takes around 30 seconds on my computer.
Update: Example code snippet to show how floyd is called.
m = 5000
inputx = random.sample(xrange(m**2), m)
(dupefound, nohashcalls) = edcython.floyd(inputx)
First of all, it seems that you must type the variables inside the function. A good example of it is here.
Second, cython -a, for "annotate", gives you a really excellent break down of the code generated by the cython compiler and a color-coded indication of how dirty (read: python api heavy) it is. This output is really essential when trying to optimize anything.
Third, the now famous page on working with Numpy explains how to get fast, C-style access to the Numpy array data. Unforunately it's verbose and annoying. We're in luck however, because more recent Cython provides Typed Memory Views, which are both easy to use and awesome. Read that entire page before you try to do anything else.
After ten minutes or so I came up with this:
# cython: infer_types=True
# Use the C math library to avoid Python overhead.
from libc cimport math
# For boundscheck below.
import cython
# We're lazy so we'll let Numpy handle our array memory management.
import numpy as np
# You would normally also import the Numpy pxd to get faster access to the Numpy
# API, but it requires some fancier compilation options so I'll leave it out for
# this demo.
# cimport numpy as np
import random
# This is a small function that doesn't need to be exposed to Python at all. Use
# `cdef` instead of `def` and inline it.
cdef inline int h3(int a,int b,int c,int d, int m,int x):
return (a*x**2 + b*x+c) % m
# If we want to live fast and dangerously, we tell cython not to check our array
# indices for IndexErrors. This means we CAN overrun our array and crash the
# program or screw up our stack. Use with caution. Profiling suggests that we
# aren't gaining anything in this case so I leave it on for safety.
# #cython.boundscheck(False)
# `cpdef` so that calling this function from another Cython (or C) function can
# skip the Python function call overhead, while still allowing us to use it from
# Python.
cpdef floyd(int[:] inputx):
# Type the variables in the scope of the function.
cdef int a,b,c,d, value, cyclelimit
cdef unsigned int dupefound = 0
cdef unsigned int nohashcalls = 0
cdef unsigned int loopno, pos, j
# `m` has type int because inputx is already a Cython memory view and
# `infer-types` is on.
m = inputx.shape[0]
cdef unsigned int loops = int(m*math.log(m))
# Again using the memory view, but letting Numpy allocate an array of zeros.
cdef int[:] listofpos = np.zeros(m, dtype=np.int32)
# Keep this random sampling out of the loop
cdef int[:, :] randoms = np.random.randint(0, m, (loops, 5)).astype(np.int32)
for loopno in range(loops):
if (dupefound == 1):
break
# From our precomputed array
a = randoms[loopno, 0]
b = randoms[loopno, 1]
c = randoms[loopno, 2]
d = randoms[loopno, 3]
pos = randoms[loopno, 4]
value = inputx[pos]
# Unforunately, Memory View does not support "vectorized" operations
# like standard Numpy arrays. Otherwise we'd use listofpos *= 0 here.
for j in range(m):
listofpos[j] = 0
listofpos[pos] = 1
setofvalues = set((value,))
cyclelimit = int(math.sqrt(m))
for j in range(cyclelimit):
pos = h3(a, b, c, d, m, inputx[pos])
nohashcalls += 1
if (inputx[pos] in setofvalues):
if (listofpos[pos]==1):
dupefound = 0
else:
dupefound = 1
print "Duplicate found at position", pos, " and value", inputx[pos]
break
listofpos[pos] = 1
setofvalues.add(inputx[pos])
return dupefound, nohashcalls
There are no tricks here that aren't explained on docs.cython.org, which is where I learned them myself, but helps to see it all come together.
The most important changes to your original code are in the comments, but they all amount to giving Cython hints about how to generate code that doesn't use the Python API.
As an aside: I really don't know why infer_types is not on by default. It lets the compiler
implicitly use C types instead of Python types where possible, meaning less work for you.
If you run cython -a on this, you'll see that the only lines that call into Python are your calls to random.sample, and building or adding to a Python set().
On my machine, your original code runs in 2.1 seconds. My version runs in 0.6 seconds.
The next step is to get random.sample out of that loop, but I'll leave that to you.
I have edited my answer to demonstrate how to precompute the rand samples. This brings the time down to 0.4 seconds.
Do you need to use this particular hashing algorithm? Why not use the built-in hashing algorithm for dicts? For example:
from collections import Counter
cnt = Counter(inputx)
dupes = [k for k, v in cnt.iteritems() if v > 1]

cython numpy accumulate function

I need to implement a function for summing the elements of an array with a variable section length.
So,
a = np.arange(10)
section_lengths = np.array([3, 2, 4])
out = accumulate(a, section_lengths)
print out
array([ 3., 7., 35.])
I attempted an implementation in cython here:
https://gist.github.com/2784725
for performance I am comparing to the pure numpy solution for the case where the section_lengths are all the same:
LEN = 10000
b = np.ones(LEN, dtype=np.int) * 2000
a = np.arange(np.sum(b), dtype=np.double)
out = np.zeros(LEN, dtype=np.double)
%timeit np.sum(a.reshape(-1,2000), axis=1)
10 loops, best of 3: 25.1 ms per loop
%timeit accumulate.accumulate(a, b, out)
10 loops, best of 3: 64.6 ms per loop
would you have any suggestion for improving performance?
You might try some of the following:
In addition to the #cython.boundscheck(False) compiler directive, also try adding #cython.wraparound(False)
In your setup.py script, try adding in some optimization flags:
ext_modules = [Extension("accumulate", ["accumulate.pyx"], extra_compile_args=["-O3",])]
Take a look at the .html file generated by cython -a accumulate.pyx to see if there are sections that are missing static typing or relying heavily on Python C-API calls:
http://docs.cython.org/src/quickstart/cythonize.html#determining-where-to-add-types
Add a return statement at the end of the method. Currently it is doing a bunch of unnecessary error checking in your tight loop at i_el += 1.
Not sure if it will make a difference but I tend to make loop counters cdef unsigned int rather than just int
You also might compare your code to numpy when section_lengths are unequal, since it will probably require a bit more than just a simple sum.
In the nest for loop update out[i_bas] is slow, you can create a temporary variable to do the accumerate, and update out[i_bas] when nest for loop finished. The following code will be as fast as numpy version:
import numpy as np
cimport numpy as np
ctypedef np.int_t DTYPE_int_t
ctypedef np.double_t DTYPE_double_t
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
def accumulate(
np.ndarray[DTYPE_double_t, ndim=1] a not None,
np.ndarray[DTYPE_int_t, ndim=1] section_lengths not None,
np.ndarray[DTYPE_double_t, ndim=1] out not None,
):
cdef int i_el, i_bas, sec_length, lenout
cdef double tmp
lenout = out.shape[0]
i_el = 0
for i_bas in range(lenout):
tmp = 0
for sec_length in range(section_lengths[i_bas]):
tmp += a[i_el]
i_el+=1
out[i_bas] = tmp

Categories

Resources