Cython: How to print without GIL

Cython: How to print without GIL - python

How should I use print in a Cython function with no gil? For example:
from libc.math cimport log, fabs
cpdef double f(double a, double b) nogil:
cdef double c = log( fabs(a - b) )
print c
return c
gives this error when compiling:
Error compiling Cython file:
...
print c
^
------------------------------------------------------------
Python print statement not allowed without gil
...
I know how to use C libraries instead of their python equivalent (math library for example here) but I couldn't find a similar way for print.

Use printf from stdio:
from libc.stdio cimport printf
...
printf("%f\n", c)

This is a follow-up to a discussion in the comments which suggested that this question was based on a slight misconception: it's always worth thinking about why you need to release the GIL and whether you actually need to do it.
Fundamentally the GIL is a flag that each thread holds to indicate whether it is allowed to call the Python API. Simply holding the flag doesn't cost you any performance. Cython is generally fastest when not using the Python API, but this is because of the sort of operations it is performing rather than because it holds the flag (i.e. printf is probably slightly faster than Python print, but printf runs the same speed with or without the GIL).
The only time you really need to worry about the GIL is when using multithreaded code, where releasing it gives other Python threads the opportunity to run. (Similarly, if you're writing a library and you don't need the Python API it's probably a good idea to release the GIL so your users can run other threads if they want).
Finally, if you are in a nogil block and you want to do a quick Python operation you can simply do:
with gil:
print c
The chances are it won't cost you much performance and it may save a lot of programming effort.

Related

Why does a large for loop with 10 billion iterations take a much longer time to run in Python than in C?

I am currently comparing two loop calculation in Python3 and C. For Python, I have:
# Python3
t1 = time.process_time()
a = 100234555
b = 22333335
c = 341500
for i in range(1, 10000000001):
a = a - (b % 2)
b = b - (c % 2)
print("Sum is", a+b)
t2 = time.process_time()
print(t2-t1, "Seconds")
Then in C, I do the same thing:
#include <stdio.h>
int main() {
long long a = 100234555;
long long b = 22333335;
long long c = 341500;
for(long long i = 1; i <= 10000000000; i++){
a = a - (b % 2);
b = b - (c % 2);
}
printf("Sum is %lld\n", a+b);
return 0;
}
I timed both the code in Python and in C. The timing for Python is around 3500 seconds while the timing in C (including compilation and execution) only takes around 0.3 seconds.
I am wondering how there is such a big difference in timing. The execution was done on a server with 100 GB Ram and enough processing power.

It's partially due to the fact that Python byte code is executed by a program instead of the CPU directly, but most of the overhead is caused by the memory allocation and deallocation caused by the immutability of integers which is due to the object model, not the interpretedness.
What's going on is that your C code can change the value of the numbers, but in Python numbers are immutable which means they do not change. This means that when you do a sum, Python has to create a new int object for each new value, and then destroy the old ints after they're no longer used. This makes it much slower than just modifying a single memory value.
There is also the possibility that your C compiler is being clever, and reasons that via a chain of optimisations it can completely remove your for loop, and the result will be identical – as if the loop had actually run. I'd expect the code to run much faster than it did if that had been the case in your example, but it could do that.
Python has no such smart compiler. It can't do something as grand and clever as that; it's just not designed to optimise the code because it's so hard to do so reliably in a dynamically-typed language (though the fact that Python is strongly-typed does make it somewhat of a possibility.

As dmuir noticed, the code can be simplified drastically if the compiler propagates some constants correctly. For example: clang -O1 compiles the C code down to this (cf https://gcc.godbolt.org/z/1ZH8Rm ):
main: # #main
push rax
movabs rsi, -9877432110
mov edi, offset .L.str
xor eax, eax
call printf
xor eax, eax
pop rcx
ret
.L.str:
.asciz "Sum is %lld\n"
gcc -O1 produces essentially similar code.
Since this boils down to a single call to printf, the explanation seems to be:
The Python compiler is not as smart as the C compiler to optimize this code.
Your C compiler takes a long time to compile this 12 line piece of code. 3 seconds is way too long given your hardware setup! It only takes 0.15 seconds on my dinky laptop to compile and run the code with all optimisations. Are you compiling as C++?
Testing the C version with optimisations disabled (-O0) produces this output:
$ time (clang -O0 -o loop10g loop10g.c && ./loop10g)
Sum is -9877432110
real 4m15.352s
user 3m47.232s
sys 0m3.252s
Still much faster with unoptimized C than Python: 255 seconds vs: >3500
The Python code is interpreted with byte-code and a stack with dynamically typed values: a factor of 10 to 20 is a typical slowdown. Furthermore the integer arithmetic automatically switches to bignum mode for large values, which may be the case here, although the penalty should be even higher.

The answer is very simple. Python is the interpreted language. All the instructions are executed by the interpreter (special program which executes the script). It is much slower than the C code which is compiled to the native machine code.

Cython reading in files in parallel and bypassing GIL

Trying to figure out how to use Cython to bypass the GIL and load files parallelly for IO bound tasks. For now I have the following Cython code trying to load files n0.npy, n1.py ... n100.npy
def foo_parallel():
cdef int i
for i in prange(100, nogil=True, num_threads=8):
with gil:
np.load('n'+str(i)+'.npy')
return []
def foo_serial():
cdef int i
for i in range(100):
np.load('n'+str(i)+'.npy')
return []
I'm not noticing a significant speedup - does anyone have any experience with this?
Edit: I'm getting around 900ms parallely vs 1.3 seconds serially. Would expect more speedup given 8 threads

As the comment states you can't use NumPy with gil and expect it to become parallel. You need C or C++ level file operations to do this. See this post here for a potential solution http://www.code-corner.de/?p=183
I.e. adapt this to your problem: file_io.pyx I'd post it here but can't figure out how on my cell. Add nogil to the end of the cdef statement there and call the function from a cpdef foo_parallel defined function within your prange loop. Use the read_file not the slow one and change it to cdef. Please post benchmarks after doing so as I'm curious and have no computer on vacation.

Calling C function with OpenMP from Python causes segmentation fault at the end

I have written a Python script that calls a C-function which is parallelized using OpenMP (variables from Python to C-function were passed using ctypes-wrapper). The C-function works correctly producing the desired output. But I get a segmentation fault at the end of the Python code. I suspect it has something to do with threads spawned by OpenMP since the seg-fault does not occur when OpenMP is disabled.
On the Python side of the code (which calls the external C-function) I have:
...
C_Func = ctypes.cdll.LoadLibrary ('./Cinterface.so')
C_Func.Receive_Parameters.argtypes = (...list of ctypes variable-type ...)
C_Func.Receive_Parameters.restype = ctypes.c_void_p
C_Func.Perform_Calculation.argtypes = ( )
C_Func.Perform_Calculation.restypes = ctypes.c_void_p
and on the C-side, generic form of the function is:
void Receive_Parameters (....list of c variable-type ...)
{
---Take all data and parameters coming from python---
return;
}
void Perform_Calculation ( )
{
#pragma omp parallel default(shared) num_threads(8) private (....)
{
#pragma omp for schedule (static, 1) reduction (+:p)
p+= core_calculation (...list of variables....)
}
return;
}
float core_calculation (...list of variables...)
{
----all calculations done here-----
}
I have following questions and associated confusion:
Does Python have any control in the operation of threads spawned by the OpenMP inside the C-function? The reason I ask this is that the C-function receives pointers to arrays allocated in the heap by Python. Can OpenMP threads perform operations on this array in parallel without bothering about where it was allocated?
Do I need to do anything in the Python code before calling the C-function, say release the GIL to allow OpenMP threads to be spawned in C-function? If yes, how does one do that?
Do I have to release the GIL in the C-function (before OpenMP parallel block)?

I have SWIG (http://swig.org), a C and C++ wrapper generator for Python and other languages organizing the GIL release for me. The generated code does not look trivial and is using the new releasing/acquiring techniques from PEP311. However, the old technique explained in the PEP might be sufficient for you. I hope some more competent person will answer later, but this answer is better than nothing, I guess. But errors in an OpenMP loop are not handled gracefully, have you checked the C function with OpenMP outside of Python?

Can Go really be that much faster than Python?

I think I may have implemented this incorrectly because the results do not make sense. I have a Go program that counts to 1000000000:
package main
import (
"fmt"
)
func main() {
for i := 0; i < 1000000000; i++ {}
fmt.Println("Done")
}
It finishes in less than a second. On the other hand I have a Python script:
x = 0
while x < 1000000000:
x+=1
print 'Done'
It finishes in a few minutes.
Why is the Go version so much faster? Are they both counting up to 1000000000 or am I missing something?

One billion is not a very big number. Any reasonably modern machine should be able to do this in a few seconds at most, if it's able to do the work with native types. I verified this by writing an equivalent C program, reading the assembly to make sure that it actually was doing addition, and timing it (it completes in about 1.8 seconds on my machine).
Python, however, doesn't have a concept of natively typed variables (or meaningful type annotations at all), so it has to do hundreds of times as much work in this case. In short, the answer to your headline question is "yes". Go really can be that much faster than Python, even without any kind of compiler trickery like optimizing away a side-effect-free loop.

pypy actually does an impressive job of speeding up this loop
def main():
x = 0
while x < 1000000000:
x+=1
if __name__ == "__main__":
s=time.time()
main()
print time.time() - s
$ python count.py
44.221405983
$ pypy count.py
1.03511095047
~97% speedup!
Clarification for 3 people who didn't "get it". The Python language itself isn't slow. The CPython implementation is a relatively straight forward way of running the code. Pypy is another implementation of the language that does many tricky (especiallt the JIT) things that can make enormous differences. Directly answering the question in the title - Go isn't "that much" faster than Python, Go is that much faster than CPython.
Having said that, the code samples aren't really doing the same thing. Python needs to instantiate 1000000000 of its int objects. Go is just incrementing one memory location.

This scenario will highly favor decent natively-compiled statically-typed languages. Natively compiled statically-typed languages are capable of emitting a very trivial loop of say, 4-6 CPU opcodes that utilizes simple check-condition for termination. This loop has effectively zero branch prediction misses and can be effectively thought of as performing an increment every CPU cycle (this isn't entirely true, but..)
Python implementations have to do significantly more work, primarily due to the dynamic typing. Python must make several different calls (internal and external) just to add two ints together. In Python it must call __add__ (it is effectively i = i.__add__(1), but this syntax will only work in Python 3.x), which in turn has to check the type of the value passed (to make sure it is an int), then it adds the integer values (extracting them from both of the objects), and then the new integer value is wrapped up again in a new object. Finally it re-assigns the new object to the local variable. That's significantly more work than a single opcode to increment, and doesn't even address the loop itself - by comparison, the Go/native version is likely only incrementing a register by side-effect.
Java will fair much better in a trivial benchmark like this and will likely be fairly close to Go; the JIT and static-typing of the counter variable can ensure this (it uses a special integer add JVM instruction). Once again, Python has no such advantage. Now, there are some implementations like PyPy/RPython, which run a static-typing phase and should fare much better than CPython here ..

You've got two things at work here. The first of which is that Go is compiled to machine code and run directly on the CPU while Python is compiled to bytecode run against a (particularly slow) VM.
The second, and more significant, thing impacting performance is that the semantics of the two programs are actually significantly different. The Go version makes a "box" called "x" that holds a number and increments that by 1 on each pass through the program. The Python version actually has to create a new "box" (int object) on each cycle (and, eventually, has to throw them away). We can demonstrate this by modifying your programs slightly:
package main
import (
"fmt"
)
func main() {
for i := 0; i < 10; i++ {
fmt.Printf("%d %p\n", i, &i)
}
}
...and:
x = 0;
while x < 10:
x += 1
print x, id(x)
This is because Go, due to it's C roots, takes a variable name to refer to a place, where Python takes variable names to refer to things. Since an integer is considered a unique, immutable entity in python, we must constantly make new ones. Python should be slower than Go but you've picked a worst-case scenario - in the Benchmarks Game, we see go being, on average, about 25x times faster (100x in the worst case).
You've probably read that, if your Python programs are too slow, you can speed them up by moving things into C. Fortunately, in this case, somebody's already done this for you. If you rewrite your empty loop to use xrange() like so:
for x in xrange(1000000000):
pass
print "Done."
...you'll see it run about twice as fast. If you find loop counters to actually be a major bottleneck in your program, it might be time to investigate a new way of solving the problem.

#troq
I'm a little late to the party but I'd say the answer is yes and no. As #gnibbler pointed out, CPython is slower in the simple implementation but pypy is jit compiled for much faster code when you need it.
If you're doing numeric processing with CPython most will do it with numpy resulting in fast operations on arrays and matrices. Recently I've been doing a lot with numba which allows you to add a simple wrapper to your code. For this one I just added #njit to a function incALot() which runs your code above.
On my machine CPython takes 61 seconds, but with the numba wrapper it takes 7.2 microseconds which will be similar to C and maybe faster than Go. Thats an 8 million times speedup.
So, in Python, if things with numbers seem a bit slow, there are tools to address it - and you still get Python's programmer productivity and the REPL.
def incALot(y):
x = 0
while x < y:
x += 1
#njit('i8(i8)')
def nbIncALot(y):
x = 0
while x < y:
x += 1
return x
size = 1000000000
start = time.time()
incALot(size)
t1 = time.time() - start
start = time.time()
x = nbIncALot(size)
t2 = time.time() - start
print('CPython3 takes %.3fs, Numba takes %.9fs' %(t1, t2))
print('Speedup is: %.1f' % (t1/t2))
print('Just Checking:', x)
CPython3 takes 58.958s, Numba takes 0.000007153s
Speedup is: 8242982.2
Just Checking: 1000000000

Problem is Python is interpreted, GO isn't so there's no real way to bench test speeds. Interpreted languages usually (not always have a vm component) that's where the problem lies, any test you run is being run in interpreted bounds not actual runtime bounds. Go is slightly slower than C in terms of speed and that is mostly due to it using garbage collection instead of manual memory management. That said GO compared to Python is fast because its a compiled language, the only thing lacking in GO is bug testing I stand corrected if I'm wrong.

It is possible that the compiler realized that you didn't use the "i" variable after the loop, so it optimized the final code by removing the loop.
Even if you used it afterwards, the compiler is probably smart enough to substitute the loop with
i = 1000000000;
Hope this helps =)

I'm not familiar with go, but I'd guess that go version ignores the loop since the body of the loop does nothing. On the other hand, in the python version, you are incrementing x in the body of the loop so it's probably actually executing the loop.

Method of evaluating shellcode in python?

Evaluating a sample piece of shellcode using a C program is not complicated. It would involve storing the shellcode in a character array, creating a function pointer, typecasting the pointer and making it point to the array and calling the function(pointer).
This is how it works, assuming you can execute the memory at nastycode[]:
/* left harmless. Insert your own working example at your peril */
char nastycode[] = "\x00\x00\x00...";
void (*execute_ptr) (void);
execute_ptr = (void *)nastycode; /* point pointer at nasty code */
execute_ptr(); /* execute it */
Is there any way I could do the same using Python code? Or does the fact that Python code translates to bytecode render such an endeavour impossible?

The only way this could be done is if you rely on a C library. Buffer overflows can be introduced into python from its library bindings. For your purposes you could write your own simple python library in c and implement something like example3.c in Aleph One's Smashing the Stack for Fun and Profit. As Avilo pointed out you will have to worry about NX zones, however any region of memory can be made executable again and this is platform specific. Also GCC uses stack canaries by default. Although this can be avoided by just overwriting the return address with an address passed to the function, which would leave the cannery intact. ASLR is a very good security system that can be difficult to bypass, but if you are passing in the known address to your shell code then ASLR shouldn't be a problem.

This is what you are looking for ;)
http://libemu.carnivore.it/
Since you where looking for python:
https://github.com/buffer/pylibemu

Its possible in python... you can do your own binding to C using ctypes or simply use something like distorm
http://code.google.com/p/distorm/wiki/Python
you also might want to check out how dionaea does it. Its a honeypot but it'll test shellcode and output the results.
http://dionaea.carnivore.it/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.