I have a Python package with a native extension compiled by Cython. Due to some performance needs, the compilation is done with -march=native, -mtune=native flags. This basically enables the compiler to use any of the available ISA extensions.
Additionally, we keep a non-cythonized, pure-python version of this package. It should be used in environments which are less performance sensitive.
Hence, in total we have two versions published:
Cythonized wheel built for a very specific platform
Pure-python wheel.
Some other packages depend on this package, and some of the machines are a bit different than the one that the package was compiled on. Since we used -march=native, as a result we get SIGILL, since some ISA extension is missing on the server.
So, in essence, I'd like to somehow make pip disregard the native wheel if the host CPU is not compatible with the wheel.
The native wheel does have the cp37 and platform name, but I don't see a way to define a more granular ISA requirements here. I can always use --implementation flags for pip, but I wonder if there's a better way for pip to differentiate among different ISAs.
Thanks,
The pip infrastructure doesn't support such granularity.
I think a better approach would be to have two versions of the Cython-extension compiled: with -march=native and without, to install both and to decide at the run time which one should be loaded.
Here is a proof of concept.
The first hoop to jump: how to check at run time which instructions are supported by CPU/OS combination. For the simplicity we will check for AVX (this SO-post has more details) and I offer only a gcc-specific (see also this) solution - called impl_picker.pyx:
cdef extern from *:
"""
int cpu_supports_avx(void){
return __builtin_cpu_supports("avx");
}
"""
int cpu_supports_avx()
def cpu_has_avx_support():
return cpu_supports_avx() != 0
The second problem: the pyx-file and the module must have the same name. To avoid code duplication, the actual code is in a pxi-file:
# worker.pxi
cdef extern from *:
"""
int compiled_with_avx(void){
#ifdef __AVX__
return 1;
#else
return 0;
#endif
}
"""
int compiled_with_avx()
def compiled_with_avx_support():
return compiled_with_avx() != 0
As one can see, the function compiled_with_avx_support will yield different results, depending on whether it was compiled with -march=native or not.
And now we can define two versions of the module just by including the actual code from the *.pxi-file. One module called worker_native.pyx:
# distutils: extra_compile_args=["-march=native"]
include "worker.pxi"
and worker_fallback.pyx:
include "worker.pxi"
Building everything, e.g. via cythonize -i -3 *.pyx, it can be used as follows:
from impl_picker import cpu_has_avx_support
# overhead once when imported:
if cpu_has_avx_support():
import worker_native as worker
else:
print("using fallback worker")
import worker_fallback as worker
print("compiled_with_avx_support:", worker.compiled_with_avx_support())
On my machine the above would lead to compiled_with_avx_support: True, on older machines the "slower" worker_fallback will be used and the result will be compiled_with_avx_support: False.
The goal of this post is not to give a working setup.py, but just to outline the idea how one could achieve the goal of picking correct version at the run time. Obviously, the setup.py could be quite more complicated: e.g. one would need to compile multiple c-files with different compiler settings (see this SO-post, how this could be achieved).
I run conda 4.6.3 with python 3.7.2 win32. In python, when I import numpy, i see the RAM usage increase by 80MB. Since I am using multiprocessing, I wonder if this is normal and if there is anyway to avoid this RAM overhead? Please see below all the versions from relevant packages (from conda list):
python...........3.7.2 h8c8aaf0_2
mkl_fft...........1.0.10 py37h14836fe_0
mkl_random..1.0.2 py37h343c172_0
numpy...........1.15.4 py37h19fb1c0_0
numpy-base..1.15.4 py37hc3f5095_0
thanks!
You can't avoid this cost, but it's likely not as bad as it seems. The numpy libraries (a copy of C only libopenblasp, plus all the Python numpy extension modules) occupy over 60 MB on disk, and they're all going to be memory mapped into your Python process on import; adding on all the Python modules and the dynamically allocated memory involved in loading and initializing all of them, and 80 MB of increased reported RAM usage is pretty normal.
That said:
The C libraries and Python extension modules are memory mapped in, but that doesn't actually mean they occupy "real" RAM; if the code paths in a given page aren't exercised, the page will either never be loaded, or will be dropped under memory pressure (not even written to the page file, since it can always reload it from the original DLL).
On UNIX-like systems, when you fork (multiprocessing does this by default everywhere but Windows) that memory is shared between parent and worker processes in copy-on-write mode. Since the code itself is generally not written, the only cost is the page tables themselves (a tiny fraction of the memory they reference), and both parent and child will share that RAM.
Sadly, on Windows, fork isn't an option (unless you're running Ubuntu bash on Windows, in which case it's only barely Windows, effectively Linux), so you'll likely pay more of the memory costs in each process. But even there, libopenblasp, the C library backing large parts of numpy, will be remapped per process, but the OS should properly share that read-only memory across processes (and large parts, if not all, of the Python extension modules as well).
Basically, until this actually causes a problem (and it's unlikely to do so), don't worry about it.
[NumPy]: NumPy
is the fundamental package for scientific computing with Python.
It is a big package, designed to work with large datasets and optimized (primarily) for speed. If you look in its __init__.py (which gets executed when importing it (e.g.: import numpy)), you'll notice that it imports lots of items (packages / modules):
Those items themselves, may import others
Some of them are extension modules (.pyds (.dlls) or .sos) which get loaded into the current process (their dependencies as well)
I've prepared a demo.
code.py:
#!/usr/bin/env python3
import sys
import os
import psutil
#import pprint
def main():
display_text = "This {:s} screenshot was taken. Press <Enter> to continue ... "
pid = os.getpid()
print("Pid: {:d}\n".format(pid))
p = psutil.Process(pid=pid)
mod_names0 = set(k for k in sys.modules)
mi0 = p.memory_info()
input(display_text.format("first"))
import numpy
input(display_text.format("second"))
mi1 = p.memory_info()
for idx, mi in enumerate([mi0, mi1], start=1):
print("\nMemory info ({:d}): {:}".format(idx, mi))
print("\nExtra modules imported by `{:s}` :".format(numpy.__name__))
print(sorted(set(k for k in sys.modules) - mod_names0))
#pprint.pprint({k: v for k, v in sys.modules.items() if k not in mod_names0})
print("\nDone.")
if __name__ == "__main__":
print("Python {:s} on {:s}\n".format(sys.version, sys.platform))
main()
Output:
[cfati#CFATI-5510-0:e:\Work\Dev\StackOverflow\q054675983]> "e:\Work\Dev\VEnvs\py_064_03.06.08_test0\Scripts\python.exe" code.py
Python 3.6.8 (tags/v3.6.8:3c6b436a57, Dec 24 2018, 00:16:47) [MSC v.1916 64 bit (AMD64)] on win32
Pid: 27160
This first screenshot was taken. Press <Enter> to continue ...
This second screenshot was taken. Press <Enter> to continue ...
Memory info (1): pmem(rss=15491072, vms=8458240, num_page_faults=4149, peak_wset=15495168, wset=15491072, peak_paged_pool=181160, paged_pool=180984, peak_nonpaged_pool=13720, nonpaged_pool=13576, pagefile=8458240, peak_pagefile=8458240, private=8458240)
Memory info (2): pmem(rss=27156480, vms=253882368, num_page_faults=7283, peak_wset=27205632, wset=27156480, peak_paged_pool=272160, paged_pool=272160, peak_nonpaged_pool=21640, nonpaged_pool=21056, pagefile=253882368, peak_pagefile=253972480, private=253882368)
Extra modules imported by `numpy` :
['_ast', '_bisect', '_blake2', '_compat_pickle', '_ctypes', '_decimal', '_hashlib', '_pickle', '_random', '_sha3', '_string', '_struct', 'argparse', 'ast', 'atexit', 'bisect', 'copy', 'ctypes', 'ctypes._endian', 'cython_runtime', 'decimal', 'difflib', 'gc', 'gettext', 'hashlib', 'logging', 'mtrand', 'numbers', 'numpy', 'numpy.__config__', 'numpy._distributor_init', 'numpy._globals', 'numpy._import_tools', 'numpy.add_newdocs', 'numpy.compat', 'numpy.compat._inspect', 'numpy.compat.py3k', 'numpy.core', 'numpy.core._internal', 'numpy.core._methods', 'numpy.core._multiarray_tests', 'numpy.core.arrayprint', 'numpy.core.defchararray', 'numpy.core.einsumfunc', 'numpy.core.fromnumeric', 'numpy.core.function_base', 'numpy.core.getlimits', 'numpy.core.info', 'numpy.core.machar', 'numpy.core.memmap', 'numpy.core.multiarray', 'numpy.core.numeric', 'numpy.core.numerictypes', 'numpy.core.records', 'numpy.core.shape_base', 'numpy.core.umath', 'numpy.ctypeslib', 'numpy.fft', 'numpy.fft.fftpack', 'numpy.fft.fftpack_lite', 'numpy.fft.helper', 'numpy.fft.info', 'numpy.lib', 'numpy.lib._datasource', 'numpy.lib._iotools', 'numpy.lib._version', 'numpy.lib.arraypad', 'numpy.lib.arraysetops', 'numpy.lib.arrayterator', 'numpy.lib.financial', 'numpy.lib.format', 'numpy.lib.function_base', 'numpy.lib.histograms', 'numpy.lib.index_tricks', 'numpy.lib.info', 'numpy.lib.mixins', 'numpy.lib.nanfunctions', 'numpy.lib.npyio', 'numpy.lib.polynomial', 'numpy.lib.scimath', 'numpy.lib.shape_base', 'numpy.lib.stride_tricks', 'numpy.lib.twodim_base', 'numpy.lib.type_check', 'numpy.lib.ufunclike', 'numpy.lib.utils', 'numpy.linalg', 'numpy.linalg._umath_linalg', 'numpy.linalg.info', 'numpy.linalg.lapack_lite', 'numpy.linalg.linalg', 'numpy.ma', 'numpy.ma.core', 'numpy.ma.extras', 'numpy.matrixlib', 'numpy.matrixlib.defmatrix', 'numpy.polynomial', 'numpy.polynomial._polybase', 'numpy.polynomial.chebyshev', 'numpy.polynomial.hermite', 'numpy.polynomial.hermite_e', 'numpy.polynomial.laguerre', 'numpy.polynomial.legendre', 'numpy.polynomial.polynomial', 'numpy.polynomial.polyutils', 'numpy.random', 'numpy.random.info', 'numpy.random.mtrand', 'numpy.testing', 'numpy.testing._private', 'numpy.testing._private.decorators', 'numpy.testing._private.nosetester', 'numpy.testing._private.pytesttester', 'numpy.testing._private.utils', 'numpy.version', 'pathlib', 'pickle', 'pprint', 'random', 'string', 'struct', 'tempfile', 'textwrap', 'unittest', 'unittest.case', 'unittest.loader', 'unittest.main', 'unittest.result', 'unittest.runner', 'unittest.signals', 'unittest.suite', 'unittest.util', 'urllib', 'urllib.parse']
Done.
And the (before and after import) screenshots ([MS.Docs]: Process Explorer):
As a personal remark, I think that ~80 MiB (or whatever the exact amount is), is more than decent for the current "era", which is characterized by ridiculously high amounts of hardware resources, especially in the memories area. Besides, that would probably insignificant, compared to the amount required by the arrays themselves. If it's not the case, you should probably consider moving away from numpy.
There could be a way to reduce the memory footprint, by selectively importing only the modules containing the features that you need (my personal advice is against it), and thus going around __init__.py:
You'd have to be an expert in numpy's internals
Modules must be imported "manually" (by file name), using [Python 3]: importlib - The implementation of import (or alternatives)
Their dependents will be imported / loaded as well (and because of this, I don't know how much free memory you'd gain)
import rpy2.robjects as robjects
dffunc = sc.parallelize([(0,robjects.r.rnorm),(1,robjects.r.runif)])
dffunc.collect()
Outputs
[(0, <rpy2.rinterface.SexpClosure - Python:0x7f2ecfc28618 / R:0x26abd18>), (1, <rpy2.rinterface.SexpClosure - Python:0x7f2ecfc283d8 / R:0x26aad28>)]
While the partitioned version results in an error:
dffuncpart = dffunc.partitionBy(2)
dffuncpart.collect()
RuntimeError: ('R cannot evaluate code before being initialized.', <built-in function unserialize>
It seems like this error is that R wasn't loaded on one of the partitions, which I assume implies that the first import step was not performed. Is there anyway around this?
EDIT 1 This second example causes me to think there's a bug in the timing of pyspark or rpy2.
dffunc = sc.parallelize([(0,robjects.r.rnorm), (1,robjects.r.runif)]).partitionBy(2)
def loadmodel(model):
import rpy2.robjects as robjects
return model[1](2)
dffunc.map(loadmodel).collect()
Produces the same error R cannot evaluate code before being initialized.
dffuncpickle = sc.parallelize([(0,pickle.dumps(robjects.r.rnorm)),(1,pickle.dumps(robjects.r.runif))]).partitionBy(2)
def loadmodelpickle(model):
import rpy2.robjects as robjects
import pickle
return pickle.loads(model[1])(2)
dffuncpickle.map(loadmodelpickle).collect()
Works just as expected.
I'd like to say that "this is not a bug in rpy2, this is a feature" but I'll realistically have to settle with "this is a limitation".
What is happening is that rpy2 has 2 interface levels. One is a low-level one (closer to R's C API) and available through rpy2.rinterface and the other one is a high-level interface with more bells and whistles, more "pythonic", and with classes for R objects inheriting from rinterface level-ones (that last part is important for the part about pickling below). Importing the high-level interface results in initializing (starting) the embedded R with default parameters if necessary. Importing the low-level interface rinterface does not have this side effect and the initialization of the embedded R must be performed explicitly (function initr). rpy2 was designed this way because the initialization of the embedded R can have parameters: importing first rpy2.rinterface, setting the initialization, then importing rpy2.robjects makes this possible.
In addition to that the serialization (pickling) of R objects wrapped by rpy2 is currently only defined at the rinterface level (see the documentation). Pickling robjects-level (high-level) rpy2 objects is using the rinterface-level code and when unpickling them they will remain at that lower-level (the Python pickle contains the module the class of the object is defined in and will import that module - here rinterface, which does not imply the initialization of the embedded R). The reason for things being this way are simply that it was "good enough for now": at the time this was implemented I had to simultaneously think of a good way to bridge two somewhat different languages and learn my way through Python C-API and pickling/unpickling Python objects. Given the ease with which one can write something like
import rpy2.robjects
or
import rpy2.rinterface
rpy2.rinterface.initr()
before unpickling, this was never revisited. The uses of rpy2's pickling I know about are using Python's multiprocessing (and adding something similar to the import statements in the code initializing a child process was a cheap and sufficient fix). May this is the time to look at this again. File a bug report for rpy2 if the case.
edit: this is undoubtedly an issue with rpy2. pickled robjects-level objects should unpickle back to robjects-level, not rinterface-level. I have opened an issue in the rpy2 tracker (and already pushed a rudimentary patch in the default/dev branch).
2nd edit: The patch is part of released rpy2 starting with version 2.7.7 (latest release at the time of writing is 2.7.8).
In order to reduce development time of my Python based web application, I am trying to use reload() for the modules I have recently modified. The reload() happens through a dedicated web page (part of the development version of the web app) which lists the modules which have been recently modified (and the modified time stamp of py file is later than the corresponding pyc file). The full list of modules is obtained from sys.modules (and I filter the list to focus on only those modules which are part of my package).
Reloading individual python files seems to work in some cases and not in other cases. I guess, all the modules which depend on a modified module should be reloaded and the reloading should happen in proper order.
I am looking for a way to get the list of modules imported by a specific module. Is there any way to do this kind of introspection in Python?
I understand that my approach might not be 100% guaranteed and the safest way would be to reload everything, but if a fast approach works for most cases, it would be good enough for development purposes.
Response to comments regarding DJango autoreloader
#Glenn Maynard, Thanx, I had read about DJango's autoreloader. My web app is based on Zope 3 and with the amount of packages and a lot of ZCML based initializations, the total restart takes about 10 seconds to 30 seconds or more if the database size is bigger. I am attempting to cut down on this amount of time spent during restart. When I feel I have done a lot of changes, I usually prefer to do full restart, but more often I am changing couple of lines here and there for which I do not wish to spend so much of time. The development setup is completely independent of production setup and usually if something is wrong in reload, it becomes obvious since the application pages start showing illogical information or throwing exceptions. Am very much interested in exploring whether selective reload would work or not.
So - this answers "Find a list of modules which depend on a given one" - instead of how the question was initally phrased - which I answered above.
As it turns out, this is a bit more complex: One have to find the dependency tree for all loaded modules, and invert it for each module, while preserving a loading order that would not break things.
I had also posted this to brazillian's python wiki at:
http://www.python.org.br/wiki/RecarregarModulos
#! /usr/bin/env python
# coding: utf-8
# Author: João S. O. Bueno
# Copyright (c) 2009 - Fundação CPqD
# License: LGPL V3.0
from types import ModuleType, FunctionType, ClassType
import sys
def find_dependent_modules():
"""gets a one level inversed module dependence tree"""
tree = {}
for module in sys.modules.values():
if module is None:
continue
tree[module] = set()
for attr_name in dir(module):
attr = getattr(module, attr_name)
if isinstance(attr, ModuleType):
tree[module].add(attr)
elif type(attr) in (FunctionType, ClassType):
tree[module].add(attr.__module__)
return tree
def get_reversed_first_level_tree(tree):
"""Creates a one level deep straight dependence tree"""
new_tree = {}
for module, dependencies in tree.items():
for dep_module in dependencies:
if dep_module is module:
continue
if not dep_module in new_tree:
new_tree[dep_module] = set([module])
else:
new_tree[dep_module].add(module)
return new_tree
def find_dependants_recurse(key, rev_tree, previous=None):
"""Given a one-level dependance tree dictionary,
recursively builds a non-repeating list of all dependant
modules
"""
if previous is None:
previous = set()
if not key in rev_tree:
return []
this_level_dependants = set(rev_tree[key])
next_level_dependants = set()
for dependant in this_level_dependants:
if dependant in previous:
continue
tmp_previous = previous.copy()
tmp_previous.add(dependant)
next_level_dependants.update(
find_dependants_recurse(dependant, rev_tree,
previous=tmp_previous,
))
# ensures reloading order on the final list
# by postponing the reload of modules in this level
# that also appear later on the tree
dependants = (list(this_level_dependants.difference(
next_level_dependants)) +
list(next_level_dependants))
return dependants
def get_reversed_tree():
"""
Yields a dictionary mapping all loaded modules to
lists of the tree of modules that depend on it, in an order
that can be used fore reloading
"""
tree = find_dependent_modules()
rev_tree = get_reversed_first_level_tree(tree)
compl_tree = {}
for module, dependant_modules in rev_tree.items():
compl_tree[module] = find_dependants_recurse(module, rev_tree)
return compl_tree
def reload_dependences(module):
"""
reloads given module and all modules that
depend on it, directly and otherwise.
"""
tree = get_reversed_tree()
reload(module)
for dependant in tree[module]:
reload(dependant)
This wokred nicely in all tests I made here - but I would not recoment abusing it.
But for updating a running zope2 server after editing a few lines of code, I think I would use this myself.
You might want to take a look at Ian Bicking's Paste reloader module, which does what you want already:
http://pythonpaste.org/modules/reloader?highlight=reloader
It doesn't give you specifically a list of dependent files (which is only technically possible if the packager has been diligent and properly specified dependencies), but looking at the code will give you an accurate list of modified files for restarting the process.
Some introspection to the rescue:
from types import ModuleType
def find_modules(module, all_mods = None):
if all_mods is None:
all_mods = set([module])
for item_name in dir(module):
item = getattr(module, item_name)
if isinstance(item, ModuleType) and not item in all_mods:
all_mods.add(item)
find_modules(item, all_mods)
return all_mods
This gives you a set with all loaded modules - just call the function with your first module as a sole parameter. You can then iterate over the resulting set reloading it, as simply as:
[reload (m) for m in find_modules(<module>)]