I am working in Cython. How can i declare a C array of a python class instances and then pass the array to a python function and work on it?
cdef int n=100
class particle:
def __init__(self):
self.x=uniform(1,99)
self.y=uniform(1,99)
self.pot=0
cdef particle parlist[n]
def CalPot(parlist[]):
for i in range(N-1):
pot=0
for j in range(i,N):
dx=parlist[j].x-parlist[i].x
dy=parlist[j].y-parlist[j].y
r2=dx**2+dy**2
pot=pot+4*ep*r2*((sig/r2)**12 - (sig/r2)**6)
parlist[i].pot=pot
As Ioannis and DavidW told you, you should not create a c-array of python objects and should use a python list instead.
Cytonizing the resulting pure Python would bring a speed up of about factor 2, because cython would cut out the interpreter part. However, there is much more potential if you would also get rid of reference counting and dynamic dispatch - a speed-up up to factor 100 is pretty common. Some time ago I answered a question illustrating this.
What should you do to get this speed up? You need to replace python-multiplication with "bare metal" multiplications.
First step: Don't use a (python)-class for particle, use a simple c-struct - it is just a collection of data - nothing more, nothing less:
cdef struct particle:
double x
double y
double pot
First benefit: It is possible to define a global c-array of these structs (another question is, whether it is a very smart thing to do in a bigger project):
DEF n=2000 # known at compile time
cdef particle parlist[n]
After the initialization of the array (for more details see attached listings), we can use it in our calcpot-function (I slightly changed your definition):
def calcpot():
cdef double pot,dX,dY
cdef int i,j
for i in range(n):
pot=0.0
for j in range(i+1, n):
dX=parlist[i].x-parlist[j].x
dY=parlist[i].y-parlist[j].y
pot=pot+1.0/(dX*dX+dY*dY)
parlist[i].pot=pot
The main difference to the original code: parlist[i].xand Co. are no longer slow python objects but simple and fast doubles. There are a lot of subtle things to be considered in order to be able to get the maximal speed-up - one really should read/reread the cython documentation.
Was the trouble worth it? There are the timings(via %timeit calcpot()) on my machine:
Time Speed-up
pure python + interpreter: 924 ms ± 14.1 ms x1.0
pure python + cython: 609 ms ± 6.83 ms x1.5
cython version: 4.1 ms ± 55.3 µs x231.0
A speed up of 231 through using the lowly stucts!
Listing python code:
import random
class particle:
def __init__(self):
self.x=random.uniform(1,99)
self.y=random.uniform(1,99)
self.pot=0
n=2000
parlist = [particle() for _ in range(n)]
def calcpot():
for i in range(n):
pot=0.0
for j in range(i+1, n):
dX=parlist[i].x-parlist[j].x
dY=parlist[i].y-parlist[j].y
pot=pot+1.0/(dX*dX+dY*dY)
parlist[i].pot=pot
Listing cython code:
#call init_parlist prior to calcpot!
cdef struct particle:
double x
double y
double pot
DEF n=2000 # known at compile time
cdef particle parlist[n]
import random
def init_parlist():
for i in range(n):
parlist[i].x=random.uniform(1,99)
parlist[i].y=random.uniform(1,99)
parlist[i].pot=0.0
def calcpot():
cdef double pot,dX,dY
cdef int i,j
for i in range(n):
pot=0.0
for j in range(i+1, n):
dX=parlist[i].x-parlist[j].x
dY=parlist[i].y-parlist[j].y
pot=pot+1.0/(dX*dX+dY*dY)
parlist[i].pot=pot
BUG: use spaces, not tabs, REF: use fenced code block in Markdown
Instances of Python classes are Python objects, and better handled in Python (they are not C types, and I don't see any reason for creating some form of C representation for them within the Cython source). Also, global variables like n and parlist are better avoided (in this example they aren't necessary).
class particle:
def __init__(self):
self.x = uniform(1, 99)
self.y = uniform(1, 99)
self.pot = 0
def CalPot(parlist):
N = len(parlist)
for i in range(N):
pot = 0
for j in range(i, N):
dx = parlist[j].x - parlist[i].x
dy = parlist[j].y - parlist[j].y
r2 = dx**2 + dy**2
pot = pot + 4 * ep * r2 * ((sig / r2)**12 - (sig / r2)**6)
parlist[i].pot = pot
So this Cython code happens to be pure Python.
Related
I am still at the beginning to understand how Cython works.
This snippet shows one of the slow parts of my program and I am wondering whether this for loop can be improved.
It still pretty much looks like the original Numpy version, but I added the cdef's and int-conversion.
cdef Py_ssize_t i, j
cdef double ii, jj
for ii in np.arange(startx, endx+1, 0.1):
for jj in np.arange(starty, endy+1, 0.1):
if my_condition(ii, jj):
i = <int>ii
j = <int>jj
data[i, j] += 1
Do you have any recommendations?
Study the cython example in
https://docs.scipy.org/doc/numpy/reference/arrays.nditer.html
That uses nditer to hand out the array elements.
And
https://cython.readthedocs.io/en/stable/src/userguide/memoryviews.html
which demonstrates the use of memoryviews (and C arrays) to rapidly iterate over the values.
Either way your goal is to let cython access the databuffer directly rather than by way of the numpy functions.
I know this is a old question, but i had the same problem and found this simple solution.
https://github.com/cython/cython/issues/3310#issuecomment-707252866
for i in range(start, stop, step):
...
to this:
i = start - step
for _ in range(((stop - start) + (step - 1))//step):
i += step
...
I have a large numpy array k, of unspecified shape, and I want to construct an identically shaped array d which is 1.0 when the corresponding entry in k is between two constants lo and hi, 0.0 otherwise. (Because of what the larger code is doing, I do not want a Boolean-valued array.)
The obvious way to do this is
d = np.ones_like(k)
d[np.less(k, lo)] = 0
d[np.greater(k, hi)] = 0
However, the np.less and np.greater calls involve the creation of large scratch Boolean arrays, and I have measured this to be a significant overhead. Is there a way to perform this operation that does not involve creating any large scratch objects, while remaining fully vectorized?
As others said, numpy is heavy on temporary buffers and it does not offer much control over it. If memory footprint is really a blocker, you can drop in your own little routine. For instance,
def process(x, lo, hi):
""" lo <= x < hi ? 1.0 : 0.0."""
x_shape = x.shape
xx = np.ascontiguousarray(x).ravel()
out = np.empty_like(xx)
_process(xx, lo, hi, out)
return out.reshape(x_shape)
where _process is in cython:
%%cython --annotate
import cython
#cython.boundscheck(False)
#cython.wraparound(False)
def _process(double[::1] x, double lo, double hi, double[::1] out):
""" lo <= x < hi ? 1.0 : 0.0."""
cdef:
Py_ssize_t j
double xj
for j in range(x.shape[0]):
xj = x[j]
if lo <= xj < hi:
out[j] = 1.0
else:
out[j] = 0.0
Here I used jupyter notebook (hence the funny %%cython syntax). In real project you need to throw in a setup.py to compile the extension etc. Whether the benefit from doing this is worth the hassle is up to you.
You can create boolean array based upon the comparison and then convert to float type, all in one go, like so -
d = ((k >=lo) & (k <= hi)).astype(float)
less and greater take an out parameter:
out=np.ones_like(k)
np.less(k,80,out=out);
out &= np.greater(k,20);
# np.logical_and(np.greater(k,20),out,out=out);
That might end up saving one intermediate array. Although my impression with the ufunc out is that it still creates a temporary array, but then just copies it to the out.
On a small (10x10) array, this is faster than #zwol's method, but slower than #Divakar's. But the differences are not major.
I wrote a python script to perform the box covering on a graph but it takes more than a minute when I run it on small graphs (100 nodes).
Today someone recommend cython to improve its efficiency so I follow this guide to adadpt the code that I had.
Running the python code the results where:
In [6]: %timeit test.test()
1000 loops, best of 3: 1.88 ms per loop
After following the guide the results were:
In [7]: %timeit c_test.test()
1000 loops, best of 3: 1.05 ms per loop
The performance was better but I am sure that it is a lot that can be improved. Given that I just meet cython today, I want to ask you how can I improve this code:
import random as rnd
import numpy as np
cimport cython
cimport numpy as np
DTYPE = np.int
ctypedef np.int_t DTYPE_t
def choose_color(not_valid_colors, valid_colors):
possible_values = list(valid_colors - not_valid_colors)
if possible_values:
return rnd.choice(possible_values)
else:
return max(valid_colors.union(not_valid_colors)) + 1
#cython.boundscheck(False)
cdef np.ndarray[DTYPE_t, ndim=2] greedy_coloring(np.ndarray[DTYPE_t, ndim=2] distances, int num_nodes, int diameter):
cdef int i, lb, j
cdef np.ndarray[DTYPE_t, ndim=2] c = np.empty((num_nodes+1, diameter+2), dtype=DTYPE)
c.fill(-1)
# Matrix C will not use the 0 column and 0 row to
# let the algorithm look very similar to the paper
# pseudo-code
nodes = list(range(1, num_nodes+1))
rnd.shuffle(nodes)
c[nodes[0], :] = 0
# Algorithm
for i in nodes[1:]:
for lb in range(2, diameter+1):
not_valid_colors = set()
valid_colors = set()
for j in nodes[:i]:
if distances[i-1, j-1] >= lb:
not_valid_colors.add(c[j, lb])
else:
valid_colors.add(c[j, lb])
c[i, lb] = choose_color(not_valid_colors, valid_colors)
return c
def test():
distances = np.matrix('0 3 2 4 1 1; \
3 0 1 1 3 2; \
2 1 0 2 2 1; \
4 1 2 0 4 3; \
1 3 2 4 0 1; \
1 2 1 3 1 0')
c = greedy_coloring(distances, 6, 4)
In Cython, you will get faster speed as you remove more Python calls inside your Cython functions.
For example, skimming through your code, you are making calls to choose_color() inside the nested loop in greedy_coloring(). That should be typed as well, along with variables defined inside the function. Since it will be called repeatedly, it will bring a lot of overhead.
You can use cython with -a option (e.g., cython -a file.pyx) to generate an annotated html file which shows visually which part of your code is making Python calls (yellow lines). This will help you a lot in terms of improving your Cython code.
I'm sorry for lack of specific pointers - hope this is helpful.
I have a function which just basically makes lots of calls to a simple defined hash function and tests to see when it finds a duplicate. I need to do lots of simulations with it so would like it to be as fast as possible. I am attempting to use cython to do this. The cython code is currently called with a normal python list of integers with values in the range 0 to m^2.
import math, random
cdef int a,b,c,d,m,pos,value, cyclelimit, nohashcalls
def h3(int a,int b,int c,int d, int m,int x):
return (a*x**2 + b*x+c) %m
def floyd(inputx):
dupefound, nohashcalls = (0,0)
m = len(inputx)
loops = int(m*math.log(m))
for loopno in xrange(loops):
if (dupefound == 1):
break
a = random.randrange(m)
b = random.randrange(m)
c = random.randrange(m)
d = random.randrange(m)
pos = random.randrange(m)
value = inputx[pos]
listofpos = [0] * m
listofpos[pos] = 1
setofvalues = set([value])
cyclelimit = int(math.sqrt(m))
for j in xrange(cyclelimit):
pos = h3(a,b, c,d, m, inputx[pos])
nohashcalls += 1
if (inputx[pos] in setofvalues):
if (listofpos[pos]==1):
dupefound = 0
else:
dupefound = 1
print "Duplicate found at position", pos, " and value", inputx[pos]
break
listofpos[pos] = 1
setofvalues.add(inputx[pos])
return dupefound, nohashcalls
How can I convert inputx and listofpos to use C type arrays and to access the arrays at C speed? Are there any other speed ups I can use? Can setofvalues be sped up?
So that there is something to compare against, 50 calls to floyd() with m = 5000 currently takes around 30 seconds on my computer.
Update: Example code snippet to show how floyd is called.
m = 5000
inputx = random.sample(xrange(m**2), m)
(dupefound, nohashcalls) = edcython.floyd(inputx)
First of all, it seems that you must type the variables inside the function. A good example of it is here.
Second, cython -a, for "annotate", gives you a really excellent break down of the code generated by the cython compiler and a color-coded indication of how dirty (read: python api heavy) it is. This output is really essential when trying to optimize anything.
Third, the now famous page on working with Numpy explains how to get fast, C-style access to the Numpy array data. Unforunately it's verbose and annoying. We're in luck however, because more recent Cython provides Typed Memory Views, which are both easy to use and awesome. Read that entire page before you try to do anything else.
After ten minutes or so I came up with this:
# cython: infer_types=True
# Use the C math library to avoid Python overhead.
from libc cimport math
# For boundscheck below.
import cython
# We're lazy so we'll let Numpy handle our array memory management.
import numpy as np
# You would normally also import the Numpy pxd to get faster access to the Numpy
# API, but it requires some fancier compilation options so I'll leave it out for
# this demo.
# cimport numpy as np
import random
# This is a small function that doesn't need to be exposed to Python at all. Use
# `cdef` instead of `def` and inline it.
cdef inline int h3(int a,int b,int c,int d, int m,int x):
return (a*x**2 + b*x+c) % m
# If we want to live fast and dangerously, we tell cython not to check our array
# indices for IndexErrors. This means we CAN overrun our array and crash the
# program or screw up our stack. Use with caution. Profiling suggests that we
# aren't gaining anything in this case so I leave it on for safety.
# #cython.boundscheck(False)
# `cpdef` so that calling this function from another Cython (or C) function can
# skip the Python function call overhead, while still allowing us to use it from
# Python.
cpdef floyd(int[:] inputx):
# Type the variables in the scope of the function.
cdef int a,b,c,d, value, cyclelimit
cdef unsigned int dupefound = 0
cdef unsigned int nohashcalls = 0
cdef unsigned int loopno, pos, j
# `m` has type int because inputx is already a Cython memory view and
# `infer-types` is on.
m = inputx.shape[0]
cdef unsigned int loops = int(m*math.log(m))
# Again using the memory view, but letting Numpy allocate an array of zeros.
cdef int[:] listofpos = np.zeros(m, dtype=np.int32)
# Keep this random sampling out of the loop
cdef int[:, :] randoms = np.random.randint(0, m, (loops, 5)).astype(np.int32)
for loopno in range(loops):
if (dupefound == 1):
break
# From our precomputed array
a = randoms[loopno, 0]
b = randoms[loopno, 1]
c = randoms[loopno, 2]
d = randoms[loopno, 3]
pos = randoms[loopno, 4]
value = inputx[pos]
# Unforunately, Memory View does not support "vectorized" operations
# like standard Numpy arrays. Otherwise we'd use listofpos *= 0 here.
for j in range(m):
listofpos[j] = 0
listofpos[pos] = 1
setofvalues = set((value,))
cyclelimit = int(math.sqrt(m))
for j in range(cyclelimit):
pos = h3(a, b, c, d, m, inputx[pos])
nohashcalls += 1
if (inputx[pos] in setofvalues):
if (listofpos[pos]==1):
dupefound = 0
else:
dupefound = 1
print "Duplicate found at position", pos, " and value", inputx[pos]
break
listofpos[pos] = 1
setofvalues.add(inputx[pos])
return dupefound, nohashcalls
There are no tricks here that aren't explained on docs.cython.org, which is where I learned them myself, but helps to see it all come together.
The most important changes to your original code are in the comments, but they all amount to giving Cython hints about how to generate code that doesn't use the Python API.
As an aside: I really don't know why infer_types is not on by default. It lets the compiler
implicitly use C types instead of Python types where possible, meaning less work for you.
If you run cython -a on this, you'll see that the only lines that call into Python are your calls to random.sample, and building or adding to a Python set().
On my machine, your original code runs in 2.1 seconds. My version runs in 0.6 seconds.
The next step is to get random.sample out of that loop, but I'll leave that to you.
I have edited my answer to demonstrate how to precompute the rand samples. This brings the time down to 0.4 seconds.
Do you need to use this particular hashing algorithm? Why not use the built-in hashing algorithm for dicts? For example:
from collections import Counter
cnt = Counter(inputx)
dupes = [k for k, v in cnt.iteritems() if v > 1]
I use numpexpr for fast math on large arrays but if the size of the array is less than the CPU cache, writing my code in Cython using simple array math is way faster, especially, if the function is called multiple times.
The issue is, how do you work with arrays in Cython, or more explicitly: is there a direct interface to Python's array.array type in Cython? What I would like to do is something like this (simple example)
cpdef array[double] running_sum(array[double] arr):
cdef int i
cdef int n = len(arr)
cdef array[double] out = new_array_zeros(1.0, n)
... # some error checks
out[0] = arr[0]
for i in xrange(1,n-1):
out[i] = out[i-1] + arr[i]
return(out)
I first tried using Cython numpy wrapper and worked with the ndarrays but it seems that creating them is very costly for small 1D arrays, compared with creating a C array with malloc (but memory handling becomes a pain).
Thanks!
You can roll your simple own with basic functions and checks here is a mockup to start:
from libc.stdlib cimport malloc,free
cpdef class SimpleArray:
cdef double * handle
cdef public int length
def __init__(SimpleArray self, int n):
self.handle = <double*>malloc(n * sizeof(double))
self.length = n
def __getitem__(self, int idx):
if idx < self.length:
return self.handle[idx]
raise ValueError("Invalid Idx")
def __dealloc__(SimpleArray self):
free(self.handle)
cpdef SimpleArray running_sum(SimpleArray arr):
cdef int i
cdef SimpleArray out = SimpleArray(arr.length)
out.handle[0] = arr.handle[0]
for i from 1 < i < arr.length-1:
out.handle[i] = out.handle[i-1] + arr.handle[i]
return out
can be used as
>>> import test
>>> simple = test.SimpleArray(100)
>>> del simple
>>> test.running_sum(test.SimpleArray(100))
<test.SimpleArray object at 0x1002a90b0>