Specifying Exact CPU Instruction Set with Cythonized Python Wheels

Specifying Exact CPU Instruction Set with Cythonized Python Wheels - python

I have a Python package with a native extension compiled by Cython. Due to some performance needs, the compilation is done with -march=native, -mtune=native flags. This basically enables the compiler to use any of the available ISA extensions.
Additionally, we keep a non-cythonized, pure-python version of this package. It should be used in environments which are less performance sensitive.
Hence, in total we have two versions published:
Cythonized wheel built for a very specific platform
Pure-python wheel.
Some other packages depend on this package, and some of the machines are a bit different than the one that the package was compiled on. Since we used -march=native, as a result we get SIGILL, since some ISA extension is missing on the server.
So, in essence, I'd like to somehow make pip disregard the native wheel if the host CPU is not compatible with the wheel.
The native wheel does have the cp37 and platform name, but I don't see a way to define a more granular ISA requirements here. I can always use --implementation flags for pip, but I wonder if there's a better way for pip to differentiate among different ISAs.
Thanks,

The pip infrastructure doesn't support such granularity.
I think a better approach would be to have two versions of the Cython-extension compiled: with -march=native and without, to install both and to decide at the run time which one should be loaded.
Here is a proof of concept.
The first hoop to jump: how to check at run time which instructions are supported by CPU/OS combination. For the simplicity we will check for AVX (this SO-post has more details) and I offer only a gcc-specific (see also this) solution - called impl_picker.pyx:
cdef extern from *:
"""
int cpu_supports_avx(void){
return __builtin_cpu_supports("avx");
}
"""
int cpu_supports_avx()
def cpu_has_avx_support():
return cpu_supports_avx() != 0
The second problem: the pyx-file and the module must have the same name. To avoid code duplication, the actual code is in a pxi-file:
# worker.pxi
cdef extern from *:
"""
int compiled_with_avx(void){
#ifdef __AVX__
return 1;
#else
return 0;
#endif
}
"""
int compiled_with_avx()
def compiled_with_avx_support():
return compiled_with_avx() != 0
As one can see, the function compiled_with_avx_support will yield different results, depending on whether it was compiled with -march=native or not.
And now we can define two versions of the module just by including the actual code from the *.pxi-file. One module called worker_native.pyx:
# distutils: extra_compile_args=["-march=native"]
include "worker.pxi"
and worker_fallback.pyx:
include "worker.pxi"
Building everything, e.g. via cythonize -i -3 *.pyx, it can be used as follows:
from impl_picker import cpu_has_avx_support
# overhead once when imported:
if cpu_has_avx_support():
import worker_native as worker
else:
print("using fallback worker")
import worker_fallback as worker
print("compiled_with_avx_support:", worker.compiled_with_avx_support())
On my machine the above would lead to compiled_with_avx_support: True, on older machines the "slower" worker_fallback will be used and the result will be compiled_with_avx_support: False.
The goal of this post is not to give a working setup.py, but just to outline the idea how one could achieve the goal of picking correct version at the run time. Obviously, the setup.py could be quite more complicated: e.g. one would need to compile multiple c-files with different compiler settings (see this SO-post, how this could be achieved).

Related

compile cuda code with relocatable device code through python distutils (for python c extension)

I have some cuda code that uses cooperative groups, and thus requires the -rdc=true flag to compile with nvcc. I would like to call the cuda code from python, so am writing a python interface with python c extensions.
Because I'm including cuda code I had to adapt my setup.py, as described in: Can python distutils compile CUDA code?
This compiles and installs, but as soon as I import my code in python, it segfaults. Removing the -rdc=true flag makes everything work, but forces me to remove any cooperative group code from the cuda kernels (or get a 'cudaCGGetIntrinsicHandle unresolved' error during compilation).
Any way I can adapt my setup.py further to get this to work? Alternatively, is there an other way to compile my c extension that allows cuda code (with the rdc flag on)?

Think I sort of figured out the answer. If you generate relocatable device code with nvcc, either nvcc needs to link the object files so device code linking gets handled correctly, or you need to generate a separate object file by running nvcc on all the object files that have relocatable device code with the '--device-link' flag. This extra object file can then be included with all the other object files for an external linker.
I adapted the setup from Can python distutils compile CUDA code? by adding a dummy 'link.cu' file to the end of the sources file list. I also add the cudadevrt library and another set of compiler options for the cuda device linking step:
ext = Extension('mypythonextension',
sources=['python_wrapper.cpp', 'file_with_cuda_code.cu', 'link.cu'],
library_dirs=[CUDA['lib64']],
libraries=['cudart', 'cudadevrt'],
runtime_library_dirs=[CUDA['lib64']],
extra_compile_args={'gcc': [],
'nvcc': ['-arch=sm_70', '-rdc=true', '--compiler-options', "'-fPIC'"],
'nvcclink': ['-arch=sm_70', '--device-link', '--compiler-options', "'-fPIC'"]
},
include_dirs = [numpy_include, CUDA['include'], 'src'])
This then gets picked up in the following way by the function that adapts the compiler calls:
def customize_compiler_for_nvcc(self):
self.src_extensions.append('.cu')
# track all the object files generated with cuda device code
self.cuda_object_files = []
super = self._compile
def _compile(obj, src, ext, cc_args, extra_postargs, pp_opts):
# generate a special object file that will contain linked in
# relocatable device code
if src == 'link.cu':
self.set_executable('compiler_so', CUDA['nvcc'])
postargs = extra_postargs['nvcclink']
cc_args = self.cuda_object_files[1:]
src = self.cuda_object_files[0]
elif os.path.splitext(src)[1] == '.cu':
self.set_executable('compiler_so', CUDA['nvcc'])
postargs = extra_postargs['nvcc']
self.cuda_object_files.append(obj)
else:
postargs = extra_postargs['gcc']
super(obj, src, ext, cc_args, postargs, pp_opts)
self.compiler_so = default_compiler_so
self._compile = _compile
The solution feels a bit hackish because of my lack of distutils knowledge, but it seems to work. :)

Generate Protobuf Python source with Meson

Just learning how to use Meson and want to generate protobuf source/headers for multiple languages - C++, Python, Java, Javascript. C++ was simple enough using the generator function in my meson.build file:
project('MesonProtobufExample', 'cpp')
protoc = find_program('protoc', required : true)
deps = dependency('protobuf', required : true)
gen = generator(protoc, \
output : ['#BASENAME#.pb.cc', '#BASENAME#.pb.h'],
arguments : ['--proto_path=#CURRENT_SOURCE_DIR#', '--cpp_out=#BUILD_DIR#', '#INPUT#'])
generated = gen.process('MyExample.proto')
ex = executable('my_example', 'my_example.cpp', generated, dependencies : deps)
Which produces the MyExample.pb.cc and MyExample.pb.h files. I figured Python would be just as easy but I'm a bit stumped since there's no executable() step for my Python script since it doesn't need to be compiled. I noticed meson (and CMake it turns out) don't actually generate the protobuf files until you call executable() so I can't just skip this step or the MyExample_pb2.py file will not be generated. I have found no example for using meson/python/GPB together after several hours of searching. Shouldn't there be a simple way to 'link' the generated sources to a python file/module like the way CMake does?
protobuf_generate_python(PROTO_PY MyExample.proto)
# This command causes the protobuf python binding to be generated
add_custom_target(my_example.py ALL DEPENDS ${PROTO_PY})

You can use trick with custom_target() and "fake compiler" in the form of cp or cat tools (in -nix environments, of course, if you want to support Windows then you can use conditional find_program()). Here is the example with cp:
py_gen = generator( ... )
py_generated = gen.process('MyExample.proto')
py_proc = custom_target('py_proto',
command: [ 'cp', '#INPUT#', '#OUTPUT#' ],
input : py_generated,
output : 'MyExample_pb2.py',
build_by_default : true)
I added buid_by_default flag assuming that you need to generate it as a part of standard build process (of course, enabling this target can be conditional too).

Enumerating all modules for a binary using Python (pefile / win32api)

I want to use PEfile or another Python library to enumerate all modules. I thought I had it, but then I go into WinDbg because some obvious ones were missing, and I saw there were a number of missing ones.
For filezilla.exe:
00400000 00fe7000 image00400000 image00400000
01c70000 01ecc000 combase C:\WINDOWS\SysWOW64\combase.dll
6f590000 6f5ac000 SRVCLI C:\WINDOWS\SysWOW64\SRVCLI.DLL
6f640000 6f844000 COMCTL32 C:\WINDOWS\WinSxS\x86_microsoft.windows.common-controls_6595b64144ccf1df_6.0.17134.472_none_42ecd1cc44e43e73\COMCTL32.DLL
70610000 7061b000 NETUTILS C:\WINDOWS\SysWOW64\NETUTILS.DLL
70720000 70733000 NETAPI32 C:\WINDOWS\SysWOW64\NETAPI32.dll
72910000 72933000 winmmbase C:\WINDOWS\SysWOW64\winmmbase.dll
729d0000 729d8000 WSOCK32 C:\WINDOWS\SysWOW64\WSOCK32.DLL
72b40000 72b64000 WINMM C:\WINDOWS\SysWOW64\WINMM.DLL
72b70000 72b88000 MPR C:\WINDOWS\SysWOW64\MPR.DLL
73c60000 73c6a000 CRYPTBASE C:\WINDOWS\SysWOW64\CRYPTBASE.dll
73c70000 73c90000 SspiCli C:\WINDOWS\SysWOW64\SspiCli.dll
74120000 741b6000 OLEAUT32 C:\WINDOWS\SysWOW64\OLEAUT32.dll
741c0000 7477a000 windows_storage C:\WINDOWS\SysWOW64\windows.storage.dll
74780000 7487c000 ole32 C:\WINDOWS\SysWOW64\ole32.dll
74880000 74908000 shcore C:\WINDOWS\SysWOW64\shcore.dll
74910000 7498d000 msvcp_win C:\WINDOWS\SysWOW64\msvcp_win.dll
74990000 74a4f000 msvcrt C:\WINDOWS\SysWOW64\msvcrt.dll
74a50000 74a72000 GDI32 C:\WINDOWS\SysWOW64\GDI32.dll
74bd0000 74bde000 MSASN1 C:\WINDOWS\SysWOW64\MSASN1.dll
74be0000 74c47000 WS2_32 C:\WINDOWS\SysWOW64\WS2_32.dll
74c70000 74d30000 RPCRT4 C:\WINDOWS\SysWOW64\RPCRT4.dll
74d30000 74d37000 Normaliz C:\WINDOWS\SysWOW64\Normaliz.dll
74d40000 74d79000 cfgmgr32 C:\WINDOWS\SysWOW64\cfgmgr32.dll
74fe0000 75025000 powrprof C:\WINDOWS\SysWOW64\powrprof.dll
75150000 7526e000 ucrtbase C:\WINDOWS\SysWOW64\ucrtbase.dll
75280000 75416000 CRYPT32 C:\WINDOWS\SysWOW64\CRYPT32.dll
75420000 75584000 gdi32full C:\WINDOWS\SysWOW64\gdi32full.dll
755c0000 755c8000 FLTLIB C:\WINDOWS\SysWOW64\FLTLIB.DLL
755d0000 755e8000 profapi C:\WINDOWS\SysWOW64\profapi.dll
755f0000 75635000 SHLWAPI C:\WINDOWS\SysWOW64\SHLWAPI.dll
75640000 7698a000 SHELL32 C:\WINDOWS\SysWOW64\SHELL32.dll
76990000 76b74000 KERNELBASE C:\WINDOWS\SysWOW64\KERNELBASE.dll
76cf0000 76cff000 kernel_appcore C:\WINDOWS\SysWOW64\kernel.appcore.dll
76d00000 76d17000 win32u C:\WINDOWS\SysWOW64\win32u.dll
76db0000 76e86000 COMDLG32 C:\WINDOWS\SysWOW64\COMDLG32.DLL
76e90000 7701d000 USER32 C:\WINDOWS\SysWOW64\USER32.dll
77020000 77064000 sechost C:\WINDOWS\SysWOW64\sechost.dll
77100000 771e0000 KERNEL32 C:\WINDOWS\SysWOW64\KERNEL32.DLL
771e0000 77238000 bcryptPrimitives C:\WINDOWS\SysWOW64\bcryptPrimitives.dll
77240000 772b8000 ADVAPI32 C:\WINDOWS\SysWOW64\ADVAPI32.dll
773b0000 77540000 ntdll ntdll.dll
This is the output that I obtained from pefile by using a simlar script:
ADVAPI32.dll
COMCTL32.DLL
COMDLG32.DLL
CRYPT32.dll
GDI32.dll
KERNEL32.dll
MPR.DLL
msvcrt.dll
NETAPI32.dll
Normaliz.dll
ole32.dll
OLEAUT32.dll
POWRPROF.dll
SHELL32.DLL
USER32.dll
WINMM.DLL
WS2_32.dll
WSOCK32.DLL
def findDLL():
pe = pefile.PE(name)
for each in pe.DIRECTORY_ENTRY_IMPORT:
print entry.dll
Is there something else in Pefile I should be looking at to obtain more complete listing of modules that will be loaded?
Is there something in win32api or win32con that could get me this information? I would prefer pefile if possible, but either works. I need to be able to output a listing of all modules that would be loaded. I am working in Python and inflexible about changing.

Modules can be loaded into a process in many ways. Imported DLLs are just one way.
Imported DLLs may import DLLs for themselves too. foobar.exe may depend on user32.dll, but user32.dll in turn also depends on kernel32.dll so that will be loaded into your process. If you want a complete list, you may want to check the imported DLLs if your executable's dependencies.
Modules can be dynamically loaded in code with LoadLibrary(). You will not see those in the import directory. You'll have to disassemble the code for that and even then the library name can be generated on the fly so it will be hard to tell.
There are some more unsupported methods of loading modules that malware can use.
As the comments mentioned, getting a list of loaded modules through debugging APIs is probably simpler. But it all depends on what you're actually trying to do with this data.

As a matter of fact, there are different types of techniques to import modules. The type is determined by the way the referenced module is bound to the executable. As far as I know, PEfile does only lists the dynamic link libraries that are statically bound to the executable through the Imports table. Other types of dynamic link libraries are: explicit (those called via LoadLibrary/GetProcAddress APIs), Forwarded (those loaded via a PE mechanism that allow forwarding API calls) and delayed (those loaded via a PE mechanism that allow delayed loading of API calls).
Please find below a schema representing these methods (which is part of my slides available https://winitor.com/pdf/DynamicLinkLibraries.pdf
I hope that helps.

Using mkl_set_num_threads with numpy

I'm trying to set the number of threads for numpy calculations with mkl_set_num_threads like this
import numpy
import ctypes
mkl_rt = ctypes.CDLL('libmkl_rt.so')
mkl_rt.mkl_set_num_threads(4)
but I keep getting an segmentation fault:
Program received signal SIGSEGV, Segmentation fault.
0x00002aaab34d7561 in mkl_set_num_threads__ () from /../libmkl_intel_lp64.so
Getting the number of threads is no problem:
print mkl_rt.mkl_get_max_threads()
How can I get my code working?
Or is there another way to set the number of threads at runtime?

Ophion led me the right way. Despite the documentation, one have to transfer the parameter of mkl_set_num_thread by reference.
Now I have defined to functions, for getting and setting the threads
import numpy
import ctypes
mkl_rt = ctypes.CDLL('libmkl_rt.so')
mkl_get_max_threads = mkl_rt.mkl_get_max_threads
def mkl_set_num_threads(cores):
mkl_rt.mkl_set_num_threads(ctypes.byref(ctypes.c_int(cores)))
mkl_set_num_threads(4)
print mkl_get_max_threads() # says 4
and they work as expected.
Edit: according to Rufflewind, the names of the C-Functions are written in capital-case, which expect parameters by value:
import ctypes
mkl_rt = ctypes.CDLL('libmkl_rt.so')
mkl_set_num_threads = mkl_rt.MKL_Set_Num_Threads
mkl_get_max_threads = mkl_rt.MKL_Get_Max_Threads

Long story short, use MKL_Set_Num_Threads and its CamelCased friends when calling MKL from Python. The same applies to C if you don't #include <mkl.h>.
The MKL documentation seems to suggest that the correct type signature in C is:
void mkl_set_num_threads(int nt);
Okay, let's try a minimal program then:
void mkl_set_num_threads(int);
int main(void) {
mkl_set_num_threads(1);
return 0;
}
Compile it with GCC and boom, Segmentation fault again. So it seems the problem isn't restricted to Python.
Running it through a debugger (GDB) reveals:
Program received signal SIGSEGV, Segmentation fault.
0x0000… in mkl_set_num_threads_ ()
from /…/mkl/lib/intel64/libmkl_intel_lp64.so
Wait a second, mkl_set_num_threads_?? That's the Fortran version of mkl_set_num_threads! How did we end up calling the Fortran version? (Keep in mind that Fortran's calling convention requires arguments to be passed as pointers rather than by value.)
It turns out the documentation was a complete façade. If you actually inspect the header files for the recent versions of MKL, you will find this cute little definition:
void MKL_Set_Num_Threads(int nth);
#define mkl_set_num_threads MKL_Set_Num_Threads
… and now everything makes sense! The correct function do call (for C code) is MKL_Set_Num_Threads, not mkl_set_num_threads. Inspecting the symbol table reveals that there are actually four different variants defined:
nm -D /…/mkl/lib/intel64/libmkl_rt.so | grep -i mkl_set_num_threads
00000000000e3060 T MKL_SET_NUM_THREADS
…
00000000000e30b0 T MKL_Set_Num_Threads
…
00000000000e3060 T mkl_set_num_threads
00000000000e3060 T mkl_set_num_threads_
…
Why did Intel put in four different variants of one function despite there being only C and Fortran variants in the documentation? I don't know for certain, but I suspect it's for compatibility with different Fortran compilers. You see, Fortran calling convention is not standardized. Different compilers will mangle the names of the functions differently:
some use upper case,
some use lower case with a trailing underscore, and
some use lower case with no decoration at all.
There may even be other ways that I'm not aware of. This trick allows the MKL library to be used with most Fortran compilers without any modification, the downside being that C functions need to be "mangled" to make room for the 3 variants of the Fortran calling convention.

For people looking for a cross platform and packaged solution, note that we have recently released threadpoolctl, a module to limit the number of threads used in C-level threadpools called by python (OpenBLAS, OpenMP and MKL). See this answer for more info.

For people looking for the complete solution, you can use a context manager:
import ctypes
class MKLThreads(object):
_mkl_rt = None
#classmethod
def _mkl(cls):
if cls._mkl_rt is None:
try:
cls._mkl_rt = ctypes.CDLL('libmkl_rt.so')
except OSError:
cls._mkl_rt = ctypes.CDLL('mkl_rt.dll')
return cls._mkl_rt
#classmethod
def get_max_threads(cls):
return cls._mkl().mkl_get_max_threads()
#classmethod
def set_num_threads(cls, n):
assert type(n) == int
cls._mkl().mkl_set_num_threads(ctypes.byref(ctypes.c_int(n)))
def __init__(self, num_threads):
self._n = num_threads
self._saved_n = self.get_max_threads()
def __enter__(self):
self.set_num_threads(self._n)
return self
def __exit__(self, type, value, traceback):
self.set_num_threads(self._saved_n)
Then use it like:
with MKLThreads(2):
# do some stuff on two cores
pass
Or just manipulating configuration by calling following functions:
# Example
MKLThreads.set_num_threads(3)
print(MKLThreads.get_max_threads())
Code is also available in this gist.

How do I specify different compiler flags for just one Python/C extension source file?

I have a Python extension which uses CPU-specific features,
if available. This is done through a run-time check. If the
hardware supports the POPCNT instruction then it selects one
implementation of my inner loop, if SSSE3 is available then
it selects another, otherwise it falls back to generic versions
of my performance critical kernel. (Some 95%+ of the time is
spent in this kernel.)
Unfortunately, there's a failure mode I didn't expect. I
use -mssse3 and -O3 to compile all of the C code, even though
only one file needs that -mssse3 option. As a result, the other files are compiled with the expectation that SSSE3 will exist. This causes a segfault for the line:
start_target_popcount = (int)(query_popcount * threshold);
because the compiler used fisttpl, which is an SSSE3 instruction.
After all, I told it to assume that SSSE3 exists.
The Debian packager for my package recently ran into this problem,
because the test machine has a GCC which understands -mssse3 and
generates code with that in mind, but the machine itself has an
older CPU without those instructions.
I want a solution where the same binary can work on older machines
and on newer ones, that the Debian maintainer can use for that distro.
Ideally, I would like to say that only one file is compiled
with the -mssse3 option. Since my CPU-specific selector code
isn't part of this file, no SSSE3 code will ever be executed
unless the CPU supports it.
However, I can't figure out any way to tell distutils that
a set of compiler options are specific to a single file.
Is that even possible?

A very ugly solution would be to create two (or more Extension) classes, one to hold the SSSE3 code and the other for everything else. You could then tidy the interface up in the python layer.
c_src = [f for f in my_files if f != 'ssse3_file.c']
c_gen = Extension('c_general', sources=c_src,
libraries=[], extra_compile_args=['-O3'])
c_ssse3 = Extension('c_ssse_three', sources=['ssse3_file.c'],
libraries=[], extra_compile_args=['-O3', '-mssse3'])
and in an __init__.py somewhere
from c_general import *
from c_ssse_three import *
Of course you don't need me to write out that code! And I know this isn't DRY, I look forward to reading a better answer!

It's been 5 years but I figured out a solution which I like better than my "CC" wrapper.
The "build_ext" command creates a self.compiler instance. The compiler.compile() method takes the list of all source files to compile. The base class does some setup, then has a compiler._compile() hook for a concrete compiler subclass to implement the actual per-file compilation step.
I felt that this was stable enough that I could intercept the code at that point.
I derived a new command from distutils.command.build_ext.build_ext which tweaks self.compiler._compile to wrap the bound class method with a one-off function attached to the instance:
class build_ext_subclass(build_ext):
def build_extensions(self):
original__compile = self.compiler._compile
def new__compile(obj, src, ext, cc_args, extra_postargs, pp_opts):
if src != "src/popcount_SSSE3.c":
extra_postargs = [s for s in extra_postargs if s != "-mssse3"]
return original__compile(obj, src, ext, cc_args, extra_postargs, pp_opts)
self.compiler._compile = new__compile
try:
build_ext.build_extensions(self)
finally:
del self.compiler._compile
I then told setup() to use this command-class:
setup(
...
cmdclass = {"build_ext": build_ext_subclass}
)

Unfortunately the OP's solution will work only for Unix compilers. Here is a cross-compiler one:
(MSVC doesn't support an automatic SSSE3 code generation, so I'll use an AVX for example)
from setuptools import setup, Extension
import distutils.ccompiler
filename = 'example_avx'
compiler_options = {
'unix': ('-mavx',),
'msvc': ('/arch:AVX',)
}
def spawn(self, cmd, **kwargs):
extra_options = compiler_options.get(self.compiler_type)
if extra_options is not None:
# filenames are closer to the end of command line
for argument in reversed(cmd):
# Check if argument contains a filename. We must check for all
# possible extensions; checking for target extension is faster.
if not argument.endswith(self.obj_extension):
continue
# check for a filename only to avoid building a new string
# with variable extension
off_end = -len(self.obj_extension)
off_start = -len(filename) + off_end
if argument.endswith(filename, off_start, off_end):
if self.compiler_type == 'bcpp':
# Borland accepts a source file name at the end,
# insert the options before it
cmd[-1:-1] = extra_options
else:
cmd += extra_options
# we're done, restore the original method
self.spawn = self.__spawn
# filename is found, no need to search any further
break
distutils.ccompiler.spawn(cmd, dry_run=self.dry_run, **kwargs)
distutils.ccompiler.CCompiler.__spawn = distutils.ccompiler.CCompiler.spawn
distutils.ccompiler.CCompiler.spawn = spawn
setup(
...
ext_modules = [
Extension('extension_name', ['example.c', 'example_avx.c'])
],
...
)
See my answer here for a cross-compiler way to specify compiler/linker options in general.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.