Cython reading in files in parallel and bypassing GIL

Cython reading in files in parallel and bypassing GIL - python

Trying to figure out how to use Cython to bypass the GIL and load files parallelly for IO bound tasks. For now I have the following Cython code trying to load files n0.npy, n1.py ... n100.npy
def foo_parallel():
cdef int i
for i in prange(100, nogil=True, num_threads=8):
with gil:
np.load('n'+str(i)+'.npy')
return []
def foo_serial():
cdef int i
for i in range(100):
np.load('n'+str(i)+'.npy')
return []
I'm not noticing a significant speedup - does anyone have any experience with this?
Edit: I'm getting around 900ms parallely vs 1.3 seconds serially. Would expect more speedup given 8 threads

As the comment states you can't use NumPy with gil and expect it to become parallel. You need C or C++ level file operations to do this. See this post here for a potential solution http://www.code-corner.de/?p=183
I.e. adapt this to your problem: file_io.pyx I'd post it here but can't figure out how on my cell. Add nogil to the end of the cdef statement there and call the function from a cpdef foo_parallel defined function within your prange loop. Use the read_file not the slow one and change it to cdef. Please post benchmarks after doing so as I'm curious and have no computer on vacation.

Related

Caching jit-compiled functions in numba

I want to compile a range oft functions using numba and as I only need to run them on my machine with the same signatures, I want to cache them.
But when attempting to do so, numba tells me that the function cannot be cached because it uses large global arrays. This is the specific warning it displayed.
NumbaWarning: Cannot cache compiled function "sigmoid" as it uses dynamic globals (such as ctypes pointers and large global arrays)
I am aware that global arrays are usually frozen but large ones aren't, but as my function looks like this:
#njit(parallel=True, cache=True)
def sigmoid(x):
return 1./(1. + np.exp(-x))
I cannot see any global arrays, especially large ones.
Where is the problem?

I observed this behavior too (running on: Windows 10, Dell Latitude 7480, Git for Windows), even for very simple tests. It seems parallel=True doesn't allow caching. This is independent from the actual presence of of prange calls. Below a simple example.
def where_numba(arr: np.ndarray) -> np.ndarray:
l0, l1 = np.shape(arr)[0], np.shape(arr)[1]
for i0 in prange(l0):
for i1 in prange(l1):
if arr[i0, i1] > 0.5:
arr[i0, i1] = arr[i0, i1] * 10
return(arr)
where_numba_jit = jit(signature_or_function='float64[:,:](float64[:,:])',
nopython=True, parallel=True, cache=True, fastmath=True, nogil=True)(where_numba)
arr = np.random.random((10000, 10000))
seln = where_numba_jit(arr)
I get the same warning.
I think you may consider your specific codes and see which option (cache or parallel) is better to keep, clearly cache for relative short calculation times and parallel when the compilation time may be negligible compared to the actual calculation time. Please, comment if you have updates.
There is also an open Numba issue on this:
https://github.com/numba/numba/issues/2439

Cython: How to print without GIL

How should I use print in a Cython function with no gil? For example:
from libc.math cimport log, fabs
cpdef double f(double a, double b) nogil:
cdef double c = log( fabs(a - b) )
print c
return c
gives this error when compiling:
Error compiling Cython file:
...
print c
^
------------------------------------------------------------
Python print statement not allowed without gil
...
I know how to use C libraries instead of their python equivalent (math library for example here) but I couldn't find a similar way for print.

Use printf from stdio:
from libc.stdio cimport printf
...
printf("%f\n", c)

This is a follow-up to a discussion in the comments which suggested that this question was based on a slight misconception: it's always worth thinking about why you need to release the GIL and whether you actually need to do it.
Fundamentally the GIL is a flag that each thread holds to indicate whether it is allowed to call the Python API. Simply holding the flag doesn't cost you any performance. Cython is generally fastest when not using the Python API, but this is because of the sort of operations it is performing rather than because it holds the flag (i.e. printf is probably slightly faster than Python print, but printf runs the same speed with or without the GIL).
The only time you really need to worry about the GIL is when using multithreaded code, where releasing it gives other Python threads the opportunity to run. (Similarly, if you're writing a library and you don't need the Python API it's probably a good idea to release the GIL so your users can run other threads if they want).
Finally, if you are in a nogil block and you want to do a quick Python operation you can simply do:
with gil:
print c
The chances are it won't cost you much performance and it may save a lot of programming effort.

Calling C function with OpenMP from Python causes segmentation fault at the end

I have written a Python script that calls a C-function which is parallelized using OpenMP (variables from Python to C-function were passed using ctypes-wrapper). The C-function works correctly producing the desired output. But I get a segmentation fault at the end of the Python code. I suspect it has something to do with threads spawned by OpenMP since the seg-fault does not occur when OpenMP is disabled.
On the Python side of the code (which calls the external C-function) I have:
...
C_Func = ctypes.cdll.LoadLibrary ('./Cinterface.so')
C_Func.Receive_Parameters.argtypes = (...list of ctypes variable-type ...)
C_Func.Receive_Parameters.restype = ctypes.c_void_p
C_Func.Perform_Calculation.argtypes = ( )
C_Func.Perform_Calculation.restypes = ctypes.c_void_p
and on the C-side, generic form of the function is:
void Receive_Parameters (....list of c variable-type ...)
{
---Take all data and parameters coming from python---
return;
}
void Perform_Calculation ( )
{
#pragma omp parallel default(shared) num_threads(8) private (....)
{
#pragma omp for schedule (static, 1) reduction (+:p)
p+= core_calculation (...list of variables....)
}
return;
}
float core_calculation (...list of variables...)
{
----all calculations done here-----
}
I have following questions and associated confusion:
Does Python have any control in the operation of threads spawned by the OpenMP inside the C-function? The reason I ask this is that the C-function receives pointers to arrays allocated in the heap by Python. Can OpenMP threads perform operations on this array in parallel without bothering about where it was allocated?
Do I need to do anything in the Python code before calling the C-function, say release the GIL to allow OpenMP threads to be spawned in C-function? If yes, how does one do that?
Do I have to release the GIL in the C-function (before OpenMP parallel block)?

I have SWIG (http://swig.org), a C and C++ wrapper generator for Python and other languages organizing the GIL release for me. The generated code does not look trivial and is using the new releasing/acquiring techniques from PEP311. However, the old technique explained in the PEP might be sufficient for you. I hope some more competent person will answer later, but this answer is better than nothing, I guess. But errors in an OpenMP loop are not handled gracefully, have you checked the C function with OpenMP outside of Python?

Cython code runs 125x slower when compiled against python 2 vs python 3

I have a big block of Cython code that is parsing Touchstone files that I want to work with Python 2 and Python 3. I'm using very C-style parsing techniques for what I thought would be maximum efficiency, including manually malloc-ing and free-ing char* instead of using bytes so that I can avoid the GIL. When compiled using
python 3.5.2 0 anaconda
cython 0.24.1 py35_0 anaconda
I see speeds that I'm happy with, a moderate boost on small files (~20% faster) and a huge boost on large files (~2.5x faster). When compiled against
python 2.7.12 0 anaconda
cython 0.24.1 py27_0 anaconda
It runs about 125x slower (~17ms in Python 3 vs ~2.2s in Python 2). It's the exact same code compiled in different environments using a pretty simple setuputils script. I'm not currently using NumPy from Cython for any of the parsing or data storage.
import cython
cimport cython
from cython cimport array
import array
from libc.stdlib cimport strtod, malloc, free
from libc.string cimport memcpy
ctypedef long long int64_t # Really VS2008? Couldn't include this by default?
# Bunch of definitions and utility functions omitted
#cython.boundscheck(False)
cpdef Touchstone parse_touchstone(bytes file_contents, int num_ports):
cdef:
char c
char* buffer = <char*> file_contents
int64_t length_of_buffer = len(file_contents)
int64_t i = 0
# These are some cpdef enums
FreqUnits freq_units
Domain domain
Format fmt
double z0
bint option_line_found = 0
array.array data = array.array('d')
array.array row = array.array('d', [0 for _ in range(row_size)])
while i < length_of_buffer:
c = buffer[i] # cdef char c
if is_whitespace(c):
i += 1
continue
if is_comment_char(c):
# Returns the last index of the comment
i = parse_comment(buffer, length_of_buffer)
continue
if not option_line_found and is_option_leader_char(c):
# Returns the last index of the option line
# assigns values of all references passed in
i = parse_option_line(
buffer, length_of_buffer, i,
&domain, &fmt, &z0, &freq_units)
if i < 0:
# Lots of boring code along the lines of
# if i == some_int:
# raise Exception("message")
# I did this so that only my top-level parse has to interact
# with the interpreter, all the lower level functions have nogil
option_line_found = 1
if option_line_found:
if is_digit(c):
# Parse a float
row[row_idx] = strtod(buffer + i, &end_of_value)
# Jump the cursor to the end of that float
i = end_of_value - p - 1
row_idx += 1
if row_idx == row_size:
# append this row onto the main data array
data.extend(row)
row_idx = 0
i += 1
return Touchstone(num_ports, domain, fmt, z0, freq_units, data)
I've ruled out a few things, such as type casts. I also tested where the code simply loops over the entire file doing nothing. Either Cython optimized that away or it's just really fast because it causes parse_touchstone to not even show up in a cProfile/pstats report. I determined that it's not just the comment, whitespace, and option line parsing (not shown is the significantly more complicated keyword-value parsing) after I threw in a print statement in the last if row_idx == row_size block to print out a status and discovered that it's taking about 0.5-1 second (guesstimate) to parse a row with 512 floating point numbers on it. That really should not take so long, especially when using strtod to do the parsing. I also checked parsing just 2 rows' worth of values then jumping out of the while loop and it told me that parsing the comments, whitespace, and option line took up about 800ms (1/3 of the overall time), and that was for 6 lines of text totaling less than 150 bytes.
Am I just missing something here? Is there a small trick that would cause Cython code to run 3 orders of magnitude slower in Python 2 than Python 3?
(Note: I haven't shown the full code here because I'm not sure if I'm allowed to for legal reasons and because it's about 450 lines total)

The problem is with strtod, which is not optimized in VS2008. Apparently it internally calculates the length of the input string each time its called, and if you call it with a long string this will slow down your code considerably. To circumvent this you have to write a wrapper around strtod to use only small buffers at a time (see the above link for one example of how to do this) or write your own strtod function.

Iterating over a file omitting lines based on condition efficiently

Ahoi. I was tasked to improve performance of Bit.ly's Data_Hacks' sample.py, as a practice excercise.
I have cythonized part of the code. and included a PCG random generator, which has thus far improved performance by about 20 seconds (down from 72s), as well as optimizing print output (by using a basic c function, instead of python's write()).
This has all worked well, but aside from these fix-ups, I'd like to optimized the loop itself.
The basic function, as seen in bit.ly's sample.py:
def run(sample_rate):
input_stream = sys.stdin
for line in input_stream:
if random.randint(1,100) <= sample_rate:
sys.stdout.write(line)
My implementation:
cdef int take_sample(float sample_rate):
cdef unsigned int floor = 1
cdef unsigned int top = 100
if pcg32_random() % 100 <= sample_rate:
return 1
else:
return 0
def run(float sample_rate, file):
cdef char* line
with open(file, 'rb') as f:
for line in f:
if take_sample(sample_rate):
out(line)
What I would like to improve on now, is specifically skipping the next line (and preferably do so repeatedly) if my take_sample() doesn't return True.
My current implementation is this:
def run(float sample_rate, file):
cdef char* line
with open(file, 'rb') as f:
for line in f:
out(line)
while not take_sample(sample_rate):
next(f)
Which appears to do nothing to improve performance - leading me to suspect i've merely replaced a continue call after an if condition at the top of the loop, with my next(f).
So the question is this:
Is there a more efficient way to loop over a file (in Cython)?
I'd like to omit lines entirely, meaning they should only be truly accessed if I call my out() - is this already the case in python's for loop?
Is line a pointer (or comparable to such) to the line of the file? Or does the loop actually load this?
I realize that I could improve on it by writing it in C entirely, but I'd like to know how far I can push this staying with python/cython.
Update:
I've tested a C variant of my code - using the same test case - and it clocks in at under 2s (surprising no one). So, while it is true that the random generator and file I/O are two major bottlenecks generally speaking, it should be pointed out that python's file handling is in itself already darn slow.
So, is there a way to make use of C's file reading, other than implementing the loop itself into cython? The overhead is still slowing the python code down significantly, which makes me wonder if I'm simply at the sonic wall of performance, when it comes to file handling using Cython?

If the file is small, you may read it whole with .readlines() at once (possibly reducing IO traffic) and iterate the sequence of lines.
If the sample rate is small enough, you may consider sampling from geometric distribution which may be more efficient.
I do not know cython, but I would consider also:
simplifying take_sample() by removal of unnecessary variables and returning boolean result of the test instead of integer,
change signature of take_sample() to take_sample(int) to avoid int-to-float conversion every test.
[EDIT]
According to comment of #hpaulj, it may be better if you use .read().split('\n') instead of .readlines() suggested by me.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cython reading in files in parallel and bypassing GIL - python

Related

Caching jit-compiled functions in numba

Cython: How to print without GIL

Calling C function with OpenMP from Python causes segmentation fault at the end

Cython code runs 125x slower when compiled against python 2 vs python 3

Iterating over a file omitting lines based on condition efficiently

Categories

Resources