I'm seeing some bizarre behavior in cython code. I'm writing code to compute a forward Kalman filter, but I have a state transition model which has many 0s in it, so it would be nice to be able to calculate only certain elements of the covariance matrices.
So to test this out, I wanted to fill individual array elements using cython. To my surprise, I found
that writing output to specific array locations is very slow (function fill(...)), compared to just assigning it to a scalar variable every time (function nofill(...)) (essentially forgetting the results), and
setting C=0.1 or 31, while not affecting how long nofill(...) takes to run, the latter choice for C makes fill(...) run 2x as slowly. This is baffling to me. Can anyone explain why I'm seeing this?
Code:-
################# file way_too_slow.pyx
from libc.math cimport sin
# Setting C=0.1 or 31 doesn't change affect performance of calling nofill(...), but it makes the fill(...) slower. I have no clue why.
cdef double C = 0.1
# This function just throws away its output.
def nofill(double[::1] x, double[::1] y, long N):
cdef int i
cdef double *p_x = &x[0]
cdef double *p_y = &y[0]
cdef double d
with nogil:
for 0 <= i < N:
d = ((p_x[i] + p_y[i])*3 + p_x[i] - p_y[i]) + sin(p_x[i]*C) # C appears here
# Same function keeps its output.
# However: #1 - MUCH slower than
def fill(double[::1] x, double[::1] y, double[::1] out, long N):
cdef int i
cdef double *p_x = &x[0]
cdef double *p_y = &y[0]
cdef double *p_o = &out[0]
cdef double d
with nogil:
for 0 <= i < N:
p_o[i] = ((p_x[i] + p_y[i])*3 + p_x[i] - p_y[i]) + sin(p_x[i]*C) # C appears here
The above code is called by the python program
#################### run_way_too_slow.py
import way_too_slow as _wts
import time as _tm
N = 80000
x = _N.random.randn(N)
y = _N.random.randn(N)
out = _N.empty(N)
t1 = _tm.time()
_wts.nofill(x, y, N)
t2 = _tm.time()
_wts.fill(x, y, out, N)
t3 = _tm.time()
print "nofill() ET: %.3e" % (t2-t1)
print "fill() ET: %.3e" % (t3-t2)
print "fill() is slower by factor %.3f" % ((t3-t2)/(t2-t1))
The cython was compiled using the setup.py file
################# setup.py
from distutils.core import setup, Extension
from distutils.sysconfig import get_python_inc
from distutils.extension import Extension
from Cython.Distutils import build_ext
incdir=[get_python_inc(plat_specific=1)]
libdir = ['/usr/local/lib']
cmdclass = {'build_ext' : build_ext}
ext_modules = Extension("way_too_slow",
["way_too_slow.pyx"],
include_dirs=incdir, # include_dirs for Mac
library_dirs=libdir)
setup(
name="way_too_slow",
cmdclass = cmdclass,
ext_modules = [ext_modules]
)
Here is a typical output of running "run_way_too_slow.py" using C=0.1
>>> exf("run_way_too_slow.py")
nofill() ET: 6.700e-05
fill() ET: 6.409e-04
fill() is slower by factor 9.566
A typical run with C=31.
>>> exf("run_way_too_slow.py")
nofill() ET: 6.795e-05
fill() ET: 1.566e-03
fill() is slower by factor 23.046
As we can see
Assigning into specified array location is quite slow compared to assigning to a double.
For some reason, assigning speed seems to depend on what operation was done in the calculation - which makes no sense to me.
Any insight would be greatly appreciated.
Two things explain your observation:
A: in the first version nothing happens. The c compiler is clever enough to see, that the whole loop has no effect at all outside of the function and optimizes it out.
To enforce the execution you must make the result d visible outside, for example via:
cdef double d=0
....
d+=....
return d
It might be still slower than writing-to-an-array-version, because of the fewer costly memory accesses - but you will see a slowdown when changing value C.
B: sin is a complicated function and how long it takes to calculate depends on its argument. For example, for very small arguments - the argument itself can be returned, but for bigger argument much longer Taylor series must be evaluated. Here is one example for cost of tanh, depending on the value of the argument, which like sin is calculated via different approximation/ Taylor series - the most important part the time needed depends on argument.
Related
I tried to compare the performance of both Python and Ctypes version of the sum function. I found that Python is faster than ctypes version.
Sum.c file:
int our_function(int num_numbers, int *numbers) {
int i;
int sum = 0;
for (i = 0; i < num_numbers; i++) {
sum += numbers[i];
}
return sum;
}
int our_function2(int num1, int num2) {
return num1 + num2;
}
I compiled it to a shared library:
gcc -shared -o Sum.so sum.c
Then I imported both the shared library and ctypes to use the C function. Below Sum.py:
import ctypes
_sum = ctypes.CDLL('.\junk.so')
_sum.our_function.argtypes = (ctypes.c_int, ctypes.POINTER(ctypes.c_int))
def our_function_c(numbers):
global _sum
num_numbers = len(numbers)
array_type = ctypes.c_int * num_numbers
result = _sum.our_function(ctypes.c_int(num_numbers), array_type(*numbers))
return int(result)
def our_function_py(numbers):
sum = 0
for i in numbers:
sum += i
return sum
import time
start = time.time()
print(our_function_c([1, 2, 3]))
end = time.time()
print("time taken C", end-start)
start1 = time.time()
print(our_function_py([1, 2, 3]))
end1 = time.time()
print("time taken py", end1-start1)
Output:
6
time taken C 0.0010006427764892578
6
time taken py 0.0
For larger list like list(range(int(1e5))):
start = time.time()
print(our_function_c(list(range(int(1e5)))))
end = time.time()
print("time taken C", end-start)
start1 = time.time()
print(our_function_py(list(range(int(1e5)))))
end1 = time.time()
print("time taken py", end1-start1)
Output:
704982704
time taken C 0.011005163192749023
4999950000
time taken py 0.00500178337097168
Question: I tried to use more numbers, but Python still beats ctypes in terms of performance. So my question is, is there a rule of thumb when I should move to ctypes over Python (in terms of the order of magnitude of code)? Also, what is the cost to convert Python to Ctypes, please?
Why
Well, yes, in such a case, it is not really worth it.
Because before calling the c function, you spend lot of time converting numbers in to c_int.
Which is not less expansive as an addition.
Usually we use ctypes when, either the data are generated on the C-side. Or when we generate them from python, but then use them for more than 1 simple operation.
Same with pandas
This is for example what happens with numpy or pandas. Two well known example of libraries in C (or compiled anyway) that allow huge time gain (in the order of 1000×), as long as data don't go back and forth between C space and python space.
Numpy is faster than list for many operations, for example. As long as you don't count data conversion for each atomic operation.
Pandas often works with data read from CSV, by pandas. Data stays in pandas space.
import time
import pandas as pd
lst=list(range(1000000))
start1=time.time()
s1=0
for x in lst:
s1+=x
end1=time.time()
start2=time.time()
df=pd.DataFrame({'x':lst})
middle2=time.time()
s2=df.x.sum()
end2=time.time()
print("python", s1, "t=", end1-start1)
print("pandas", s2, "t=", end2-start2, end2-middle2)
python 499999500000 t= 0.13175106048583984
pandas 499999500000 t= 0.35060644149780273 0.0020313262939453125
As you see pandas also is way slower than python by this standard.
But way faster if you don't count data creation.
Faster without data conversion
Try to run your code this way
import time
lst=list(range(1000))*1000
c_lst = (ctypes.c_int * len(lst))(*lst)
c_num = ctypes.c_int(len(lst))
start = time.time()
print(int(_sum.our_function(c_num, c_lst)))
end = time.time()
print("time taken C", end-start)
start1 = time.time()
print(our_function_py(lst))
end1 = time.time()
print("time taken py", end1-start1)
And c code is way faster.
So, like with panda, it doesn't worth it, if really, all you need from it is to do one summation, and then forget the data.
No such problem with c-extension
Note that with python c-extension, that allows c functions to handle python types, you don't have this problem (yet, it is often less efficient, because, well, python arrays are not just int * like C loves. But at least, you don't need a conversion to C made from python)
That is why you may sometimes see some libraries for which, even counting conversion, using external libraries is faster.
import numpy as np
np.array(lst).sum()
for example is slightly faster. But almost not so, when we are used to have numpy 1000× faster. Because numpy.array helps itself from python list data.
But that is not just ctypes, (by ctypes, I mean "using c-functions from the c-world handling c-data, not caring about python at all."). Plus, I am not even sure that this is the only reason. Numpy might be cheating, using several threads, and vectorization, which, neither python nor your c code does.
Example that needs no big data conversion
So, let's add another example, and add this to your code
int sumFirst(int n){
int s=0;
for(int i=0; i<n; i++){
s+=i;
}
return s;
}
And try it with
import ctypes
_sum = ctypes.CDLL('./ctypeBench.so')
_sum.sumFirst.argtypes = (ctypes.c_int,)
def c_sumFirst(n):
return _sum.sumFirst(ctypes.c_int(n))
import time
lst=list(range(10000))
start1=time.time()
s1=0
for x in lst:
s1+=x
end1=time.time()
start2=time.time()
s2=c_sumFirst(10000)
end2=time.time()
print(f"python {s1=}, Δt={end1-start1}")
print(f"c {s2=}, Δt={end2-start2}")
Result is
python s1=49995000, Δt=0.0012884140014648438
c s2=49995000, Δt=4.267692565917969e-05
And note that I was fair to python: I did not count data generation in its time (I explicitly listed the range. Which doesn't change much).
So, conclusion is, you can't expect ctypes function to gain time for a single operation per data such as +, when you need 1 conversion per data to use them.
Either you need to use c-extension and write ad-hoc code than handle a python list (and even there, you won't gain much if you have just one addition to do per value).
Or you need to keep the data on the c-side, creating them from c, and letting them there (like you do with pandas or numpy: you use dataframe or ndarrays, as much as possible with pandas and numpy functions or operator, not getting all of them in python with full indexation or .iloc).
Or you need to have really more than one addition per data to do.
Addendum: c-extension
Just to add another argument in favor of "problem is conversion", but also to explicit what to do if you really need to do one simple operation on a list, and don't want to convert every elements before, you can try this
modmy.c
#define PY_SSIZE_T_CLEAN
#include <Python.h>
#define PY3K
static PyObject *mysum(PyObject *self, PyObject *args){
PyObject *list;
PyArg_ParseTuple(args, "O", &list);
PyObject *it = PyObject_GetIter(list);
long int s=0;
for(;;){
PyObject *v = PyIter_Next(it);
if(!v) break;
long int iv=PyLong_AsLong(v);
s+=iv;
}
return PyLong_FromLong(s);
}
static PyMethodDef MyMethods[] = {
{"mysum", mysum, METH_VARARGS, "Sum list"},
{NULL, NULL, 0, NULL} /* Sentinel */
};
static struct PyModuleDef modmy = {
PyModuleDef_HEAD_INIT,
"modmy",
NULL,
-1,
MyMethods
};
PyMODINIT_FUNC
PyInit_modmy()
{
return PyModule_Create(&modmy);
}
Compile with
gcc -fPIC `python3-config --cflags` -c modmy.c
gcc -shared -fPIC `python3-config --ldflags` -o modmy.so modmy.o
Then
import time
import modmy
lst=list(range(10000000))
start1=time.time()
s1=0
for x in lst:
s1+=x
end1=time.time()
start2=time.time()
s2=modmy.mysum(lst)
end2=time.time()
print("python res=%d t=%5.2f"%(s1, end1-start1))
print("c res=%d t=%5.2f"%(s2, end2-start2))
This time no need for conversion (or, to be more accurate, yes, there is still a need for conversion. But it is done by C code, since it is not any C code, but code made ad-hoc to extend python.
(And after all, python interpreter, under the hood, also need to unpack the elements)
Note that my code checks nothing. It assumes that you are really calling mysum with a single argument being a list of integers. God knows what happens if you don't. Well, not just God. just try:
>>> import modmy
>>> modmy.mysum(12)
Segmentation fault (core dumped)
$
Python crashes (not just python's code. It is not a python error. The python process dies)
But result worth it
python res=49999995000000 t= 1.22
c res=49999995000000 t= 0.11
So, you see, this times C wins. Because it is really the same rules (they are doing the same. Just C does it faster)
So, you need to know what you are doing. But well, this does what you expected: a very simple operation on a list of integers, that runs faster in C than in python.
I'm trying to make a cython-built slice-sampling library. A generic slice sampling library, where you supply a log-density, a starter value, and get a result. Working on the univariate model now. Based on the response here, I've come up with the following.
So i have a function defined in cSlice.pyx:
cdef double univariate_slice_sample(f_type_1 logd, double starter,
double increment_size = 0.5):
some stuff
return value
I have defined in cSlice.pxd:
cdef ctypedef double (*f_type_1)(double)
cdef double univariate_slice_sample(f_type_1 logd, double starter,
double increment_size = *)
where logd is a generic univariate log-density.
In my distribution file, let's say cDistribution.pyx, I have the following:
from cSlice cimport univariate_slice_sample, f_type_1
cdef double log_distribution(alpha_k, y_k, prior):
some stuff
return value
cdef double _sample_alpha_k_slice(
double starter,
double[:] y_k,
Prior prior,
double increment_size
):
cdef f_type_1 f = lambda alpha_k: log_distribution(alpha_k), y_k, prior)
return univariate_slice_sample(f, starter, increment_size)
cpdef double sample_alpha_k_slice(
double starter,
double[:] y_1,
Prior prior,
double increment_size = 0.5
):
return _sample_alpha_1_slice(starter, y_1, prior, increment_size)
the wrapper because apparently lambda's aren't allowed in cpdef's.
When I try compiling the distribution file, I get the following:
cDistribution.pyx:289:22: Cannot convert Python object to 'f_type_1'
pointing at the cdef f_type_1 f = ... line.
I'm unsure of what else to do. I want this code to maintain C speed, and importantly not hit the GIL. Any ideas?
You can jit a C-callback/wrapper for any Python function (cast to a pointer from a Python-object cannot done implicitly), how for example explained in this SO-post.
However, at its core the function will stay slow pure Python function. Numba gives you possibility to create real C-callbacks via a #cfunc. Here is a simplified example:
from numba import cfunc
#cfunc("float64(float64)")
def id_(x):
return x
and this is how it could be used:
%%cython
ctypedef double(*f_type)(double)
cdef void c_print_double(double x, f_type f):
print(2.0*f(x))
import numba
expected_signature = numba.float64(numba.float64)
def print_double(double x,f):
# check the signature of f:
if not f._sig == expected_signature:
raise TypeError("cfunc has not the right type")
# it is not possible to cast a Python object to a pointer directly,
# so we cast the address first to unsigned long long
c_print_double(x, <f_type><unsigned long long int>(f.address))
And now:
print_double(1.0, id_)
# 2.0
We need to check the signature of the cfunc-object during the run time, otherwise the casting <f_type><unsigned long long int>(f.address) would "work" also for the functions with wrong signature - only to (possible) crash during the call or giving funny hard to debug errors. I'm just not sure that my method is the best though - even if it works:
...
#cfunc("float32(float32)")
def id3_(x):
return x
print_double(1.0, id3_)
# TypeError: cfunc has not the right type
I have this piece of code showcased in the latest live stream of Siraj Raval:
%laod_ext Cython
%cython
#memory management helper for Cython
from cymem.cymem import Pool
#good old python
from random import random
#The cdef statement is used to declare C variables,types, and functions
cdef struct Rectangle:
#C variables
float w
float h
#the "*" is the pointer operator, it gives value stored at particular
address
#this saves memory and runs faster, since we don't have to duplicate the
data
cdef int check_rectangles_cy(Rectangle* rectangles, int n_rectangles, float
threshold):
cdef int n_out = 0
# C arrays contain no size information => we need to state it explicitly
for rectangle in rectangles[:n_rectangles]:
if rectangle.w * rectangle.h > threshold:
n_out += 1
return n_out
#python uses garbage collection instead of manual memory management
#which means developers can freely create objects
#and Python's memory manager will periodically look for any
# objects that are no longer referenced by their program
#this overhead makes demands on the runtime environment (slower)
# so manually memory management is better
def main_rectangles_fast():
cdef int n_rectangles = 10000000
cdef float threshold = 0.25
#The Poool Object will save memory addresses internally
#then free them when the object is garbage collected
cdef Pool mem = Pool()
cdef Rectangle* rectangles = <Rectangle*>mem.alloc(n_rectangles, sizeof(Rectangle))
for i in range(n_rectangles):
rectangles[i].w = random()
rectangles[i].h = random()
n_out = check_rectangles_cy(rectangles, n_rectangles, threshold)
print(n_out)
When run, this code outputs the following error:
File "ipython-input-18-25198e914011", 'line 9'
cdef struct Rectangle:
SyntaxError: invalid syntax
What have I written wrong?
Link to the video: https://www.youtube.com/watch?v=giF8XoPTMFg
You had a typo. Replace
%laod_ext Cython
by
%load_ext Cython
and things should work as expected.
I need to get an overview of the performance one can get from using Cython in high performance numerical code. One of the thing I am interested in is to find out if an optimizing C compiler can vectorize code generated by Cython. So I decided to write the following small example:
import numpy as np
cimport numpy as np
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
cpdef int f(np.ndarray[int, ndim = 1] f):
cdef int array_length = f.shape[0]
cdef int sum = 0
cdef int k
for k in range(array_length):
sum += f[k]
return sum
I know that there are Numpy functions that does the job, but I would like to have an easy code in order to understand what is possible with Cython. It turns out that the code generated with:
from distutils.core import setup
from Cython.Build import cythonize
setup(ext_modules = cythonize("sum.pyx"))
and called with:
python setup.py build_ext --inplace
generates a C code which look likes this for the loop:
for (__pyx_t_2 = 0; __pyx_t_2 < __pyx_t_1; __pyx_t_2 += 1) {
__pyx_v_sum = __pyx_v_sum + (*(int *)((char *)
__pyx_pybuffernd_f.rcbuffer->pybuffer.buf +
__pyx_t_2 * __pyx_pybuffernd_f.diminfo[0].strides)));
}
The main problem with this code is that the compiler does not know at compile time that __pyx_pybuffernd_f.diminfo[0].strides is such that the elements of the array are close together in memory. Without that information, the compiler cannot vectorize efficiently.
Is there a way to do such a thing from Cython?
You have two problems in your code (use option -a to make it visible):
The indexing of numpy array isn't efficient
You have forgotten int in cdef sum=0
Taking this into account we get:
cpdef int f(np.ndarray[np.int_t] f): ##HERE
assert f.dtype == np.int
cdef int array_length = f.shape[0]
cdef int sum = 0 ##HERE
cdef int k
for k in range(array_length):
sum += f[k]
return sum
For the loop the following code:
int __pyx_t_5;
int __pyx_t_6;
Py_ssize_t __pyx_t_7;
....
__pyx_t_5 = __pyx_v_array_length;
for (__pyx_t_6 = 0; __pyx_t_6 < __pyx_t_5; __pyx_t_6+=1) {
__pyx_v_k = __pyx_t_6;
__pyx_t_7 = __pyx_v_k;
__pyx_v_sum = (__pyx_v_sum + (*__Pyx_BufPtrStrided1d(__pyx_t_5numpy_int_t *, __pyx_pybuffernd_f.rcbuffer->pybuffer.buf, __pyx_t_7, __pyx_pybuffernd_f.diminfo[0].strides)));
}
Which is not that bad, but not as easy for the optimizer as the normal code written by human. As you have already pointed out, __pyx_pybuffernd_f.diminfo[0].strides isn't known at compile time and this prevents vectorization.
However, you would get better results, when using typed memory views, i.e:
cpdef int mf(int[::1] f):
cdef int array_length = len(f)
...
which leads to a less opaque C-code - the one, at least my compiler, can better optimize:
__pyx_t_2 = __pyx_v_array_length;
for (__pyx_t_3 = 0; __pyx_t_3 < __pyx_t_2; __pyx_t_3+=1) {
__pyx_v_k = __pyx_t_3;
__pyx_t_4 = __pyx_v_k;
__pyx_v_sum = (__pyx_v_sum + (*((int *) ( /* dim=0 */ ((char *) (((int *) __pyx_v_f.data) + __pyx_t_4)) ))));
}
The most crucial thing here, is that we make it clear to the cython, that the memory is continuous, i.e. int[::1] compared to int[:] as it is seen for numpy-arrays, for which a possible stride!=1 must be taken into account.
In this case, the cython-generated C-code results in the same assembler as the code I would have written. As crisb has pointed out, adding -march=native would lead to vectorization, but in this case the assembler of both functions would be slightly different again.
However, in my experience, compilers have quite often some problems to optimize loops created by cython and/or it is easier to miss a detail which prevents the generation of really good C-code. So my strategy for working-horse-loops is to write them in plain C and use cython for wrapping/accessing them - often it is somewhat faster, because one can also use dedicated compiler flags for this code-snipped without affecting the whole Cython-module.
To separate real part from complex parts in cython i have been generally using complex.real and complex.imag for the job. This does however generate code that is colored slight "python red" in the html-output, and i guess that I should be using creal(complex) and cimag(complex) instead.
Consider the example bellow:
cdef double complex myfun():
cdef double complex c1,c2,c3
c1=1.0 + 1.2j
c2=2.2 + 13.4j
c3=c2.real + c1*c2.imag
c3=creal(c2) + c1*c2.imag
c3=creal(c2) + c1*cimag(c2)
return c2
The the assigments to c3 give:
__pyx_v_c3 = __Pyx_c_sum(__pyx_t_double_complex_from_parts(__Pyx_CREAL(__pyx_v_c2), 0), __Pyx_c_prod(__pyx_v_c1, __pyx_t_double_complex_from_parts(__Pyx_CIMAG(__pyx_v_c2), 0)));
__pyx_v_c3 = __Pyx_c_sum(__pyx_t_double_complex_from_parts(creal(__pyx_v_c2), 0), __Pyx_c_prod(__pyx_v_c1, __pyx_t_double_complex_from_parts(__Pyx_CIMAG(__pyx_v_c2), 0)));
__pyx_v_c3 = __Pyx_c_sum(__pyx_t_double_complex_from_parts(creal(__pyx_v_c2), 0), __Pyx_c_prod(__pyx_v_c1, __pyx_t_double_complex_from_parts(cimag(__pyx_v_c2), 0)));
where the first line to use the (python colored) construction __Pyx_CREAL and __Pyx_CIMAG.
Whys is this, and does it effect performance "significantly"?
Surely the default C library
(complex.h) would work
for you.
However, it seems this library wouldn't give you any significant improvement
when compared to the c.real c.imag approach. By putting the code within a
with nogil: block, you can check that your code already makes no call to
Python APIs:
cdef double complex c1, c2, c3
with nogil:
c1 = 1.0 + 1.2j
c2 = 2.2 + 13.4j
c3 = c2.real + c1*c2.imag
I use Windows 7 and Python 2.7, which doesn't have complex.h available in the
builtin C library for Visual Studio Compilers 9.0 (compatible with Python 2.7).
Because of that I created an equivalet pure C function to check any possible
gains compared to c.real and c.imag:
cdef double mycreal(double complex dc):
cdef double complex* dcptr = &dc
return (<double *>dcptr)[0]
cdef double mycimag(double complex dc):
cdef double complex* dcptr = &dc
return (<double *>dcptr)[1]
After running the following two testing functions:
def myfun1(double complex c1, double complex c2):
return c2.real + c1*c2.imag
def myfun2(double complex c1, double complex c2):
return mycreal(c2) + c1*mycimag(c2)
Got the timings:
In [3]: timeit myfun1(c1, c2)
The slowest run took 17.50 times longer than the fastest. This could mean that a
n intermediate result is being cached.
10000000 loops, best of 3: 86.3 ns per loop
In [4]: timeit myfun2(c1, c2)
The slowest run took 17.24 times longer than the fastest. This could mean that a
n intermediate result is being cached.
10000000 loops, best of 3: 87.6 ns per loop
Confirming that c.real and c.imag is already fast enough.
Actually you should not expect to see any difference: __Pyx_CREAL(c) and __Pyx_CIMAG(c) can be looked up here, and there is no rocket science/black magic involved:
#if CYTHON_CCOMPLEX
#ifdef __cplusplus
#define __Pyx_CREAL(z) ((z).real())
#define __Pyx_CIMAG(z) ((z).imag())
#else
#define __Pyx_CREAL(z) (__real__(z))
#define __Pyx_CIMAG(z) (__imag__(z))
#endif
#else
#define __Pyx_CREAL(z) ((z).real)
#define __Pyx_CIMAG(z) ((z).imag)
#endif
It basically says, these are defines and lead to a call without overhead of:
std::complex<>.real() and std::complex<>.imag()if it is c++.
GNU extensions __real__ and __imag__, if it is gcc or intel compiler (intel added support some time ago).
You cannot beat it, e.g. in glibc, creal uses __real__. What is left, is Microsoft compiler (CYTHON_CCOMPLEX is not defined), for that an own/special cython-class is used:
#if CYTHON_CCOMPLEX
....
#else
static CYTHON_INLINE {{type}} {{type_name}}_from_parts({{real_type}} x, {{real_type}} y) {
{{type}} z;
z.real = x;
z.imag = y;
return z;
}
#endif
Normally, one should not write his/her own complex number implementation, but there is not so much you can do wrong, if only access to the real and imaginary parts is considered. I would not vouch for other functionality, but would not expect it to be very much slower than the usual windows implementation (but would like to see the results if you have some).
As conclusion: There is no difference for gcc/intel compiler and I would not fret too long about differences for others.