I have written a Python interface to a process-centric job distribution system we're developing/using internally at my workplace. While reasonably skilled programmers, the primary people using this interface are research scientists, not software developers, so ease-of-use and keeping the interface out of the way to the greatest degree possible is paramount.
My library unrolls a sequence of inputs into a sequence of pickle files on a shared file server, then spawns jobs that load those inputs, perform the computation, pickle the results, and exit; the client script then picks back up and produces a generator that loads and yields the results (or rethrows any exception the calculation function did.)
This is only useful since the calculation function itself is one of the serialized inputs. cPickle is quite content to pickle function references, but requires the pickled function to be reimportable in the same context. This is problematic. I've already solved the problem of finding the module to reimport it, but the vast majority of the time, it is a top-level function that is pickled and, thus, does not have a module path. The only strategy I've found to be able to unpickle such a function on the computation nodes is this nauseating little approach towards simulating the original environment in which the function was pickled before unpickling it:
...
# At this point, we've identified the source of the target function.
# A string by its name lives in "modname".
# In the real code, there is significant try/except work here.
targetModule = __import__(modname)
globalRef = globals()
for thingie in dir(targetModule):
if thingie not in globalRef:
globalRef[thingie] = targetModule.__dict__[thingie]
# sys.argv[2]: the path to the pickle file common to all jobs, which contains
# any data in common to all invocations of the target function, then the
# target function itself
commonFile = open(sys.argv[2], "rb")
commonUnpickle = cPickle.Unpickler(commonFile)
commonData = commonUnpickle.load()
# the actual function unpack I'm having trouble with:
doIt = commonUnpickle.load()
The final line is the most important one here- it's where my module is picking up the function it should actually be running. This code, as written, works as desired, but directly manipulating the symbol tables like this is unsettling.
How can I do this, or something very much like this that does not force the research scientists to separate their calculation scripts into a proper class structure (they use Python like the most excellent graphing calculator ever and I would like to continue to let them do so) the way Pickle desperately wants, without the unpleasant, unsafe, and just plain scary __dict__-and-globals() manipulation I'm using above? I fervently believe there has to be a better way, but exec "from {0} import *".format("modname") didn't do it, several attempts to inject the pickle load into the targetModule reference didn't do it, and eval("commonUnpickle.load()", targetModule.__dict__, locals()) didn't do it. All of these fail with Unpickle's AttributeError over being unable to find the function in <module>.
What is a better way?
Pickling functions can be rather annoying if trying to move them into a different context. If the function does not reference anything from the module that it is in and references (if anything) modules that are guaranteed to be imported, you might check some code from a Rudimentary Database Engine found on the Python Cookbook.
In order to support views, the academic module grabs the code from the callable when pickling the query. When it comes time to unpickle the view, a LambdaType instance is created with the code object and a reference to a namespace containing all imported modules. The solution has limitations but worked well enough for the exercise.
Example for Views
class _View:
def __init__(self, database, query, *name_changes):
"Initializes _View instance with details of saved query."
self.__database = database
self.__query = query
self.__name_changes = name_changes
def __getstate__(self):
"Returns everything needed to pickle _View instance."
return self.__database, self.__query.__code__, self.__name_changes
def __setstate__(self, state):
"Sets the state of the _View instance when unpickled."
database, query, name_changes = state
self.__database = database
self.__query = types.LambdaType(query, sys.modules)
self.__name_changes = name_changes
Sometimes is appears necessary to make modifications to the registered modules available in the system. If for example you need to make reference to the first module (__main__), you may need to create a new module with your available namespace loaded into a new module object. The same recipe used the following technique.
Example for Modules
def test_northwind():
"Loads and runs some test on the sample Northwind database."
import os, imp
# Patch the module namespace to recognize this file.
name = os.path.splitext(os.path.basename(sys.argv[0]))[0]
module = imp.new_module(name)
vars(module).update(globals())
sys.modules[name] = module
Your question was long, and I was too caffeinated to make it through your very long question… However, I think you are looking to do something that there's a pretty good existing solution for already. There's a fork of the parallel python (i.e. pp) library that takes functions and objects and serializes them, sends them to different servers, and then unpikles and executes them. The fork lives inside the pathos package, but you can download it independently here:
http://danse.cacr.caltech.edu/packages/dev_danse_us
The "other context" in that case is another server… and the objects are transported by converting the objects to source code and then back to objects.
If you are looking to use pickling, much in the way you are doing already, there's an extension to mpi4py that serializes arguments and functions, and returns pickled return values… The package is called pyina, and is commonly used to ship code and objects to cluster nodes in coordination with a cluster scheduler.
Both pathos and pyina provide map abstractions (and pipe), and try to hide all of the details of parallel computing behind the abstractions, so scientists don't need to learn anything except how to program normal serial python. They just use one of the map or pipe functions, and get parallel or distributed computing.
Oh, I almost forgot. The dill serializer includes dump_session and load_session functions that allow the user to easily serialize their entire interpreter session and send it to another computer (or just save it for later use). That's pretty handy for changing contexts, in a different sense.
Get dill, pathos, and pyina here: https://github.com/uqfoundation
For a module to be recognized as loaded I think it must by in sys.modules, not just its content imported into your global/local namespace. Try to exec everything, then get the result out of an artificial environment.
env = {"fn": sys.argv[2]}
code = """\
import %s # maybe more
import cPickle
commonFile = open(fn, "rb")
commonUnpickle = cPickle.Unpickler(commonFile)
commonData = commonUnpickle.load()
doIt = commonUnpickle.load()
"""
exec code in env
return env["doIt"]
While functions are advertised as first-class objects in Python, this is one case where it can be seen that they are really second-class objects. It is the reference to the callable, not the object itself, that is pickled. (You cannot directly pickle a lambda expression.)
There is an alternate usage of __import__ that you might prefer:
def importer(modulename, symbols=None):
u"importer('foo') returns module foo; importer('foo', ['bar']) returns {'bar': object}"
if modulename in sys.modules: module = sys.modules[modulename]
else: module = __import__(modulename, fromlist=['*'])
if symbols == None: return module
else: return dict(zip(symbols, map(partial(getattr, module), symbols)))
So these would all be basically equivalent:
from mymodule.mysubmodule import myfunction
myfunction = importer('mymodule.mysubmodule').myfunction
globals()['myfunction'] = importer('mymodule.mysubmodule', ['myfunction'])['myfunction']
Related
Is there any way to test a pickle file to see if it loads a function or class during unpickling?
This gives a good summary of how to stop loading of selected functions:
https://docs.python.org/3/library/pickle.html#restricting-globals
I assume it could be used to check if there is function loading at all, by simply blocking all function loading and getting an error message.
But is there a way to write a function that will simply say: there is only text data in this pickled object and no function loading?
I can't say I know which builtins are safe!
Basically no, there is truly no way. There is a lot written about this. You can only use pickle if you trust the source, and you get the pickle directly from the source.
Any safety measures you perform are not sufficiënt to protect against mallicious attempts whatsoever.
https://medium.com/ochrona/python-pickle-is-notoriously-insecure-d6651f1974c9
https://nedbatchelder.com/blog/202006/pickles_nine_flaws.html
etcetera.
I use it sometimes, but then most of the times after I have had a phonecall with a colleague and shared a pickled file. But more often, I use it for myself on my local environment to store data. Still, this is not the preferred way, but it's fast.
So, when in doubt. Do not use pickle.
Thanks for the answers!
Straight pickle is too prone to security issues.
Picklemagic claims to fix the security issues; it looks like it does, but I can't quite confirm that: http://github.com/CensoredUsername/picklemagic
But https://medium.com/ochrona/python-pickle-is-notoriously-insecure-d6651f1974c9 suggest that there is no safe wrapper (picklemagic has been around for 8 years, and the article dates from 2021; so picklemagic was not considered?)
The only surefire way protect against Pickle Bombs is not to use pickle directly. Unfortunately unlike other unsafe standard library packages there are no safe wrappers or drop-in alternatives available for pickle, like defusedxml for xml or tarsafe for tarfile. Further there’s not a great way to inspect a pickle prior to unpickling or to block unsafe function calls invoked by REDUCE.
The 3.10 docs does offer a wrapper for blocking unauthorized execution of a function. https://docs.python.org/3.10/tutorial/controlflow.html
It does not say which builtins are safe. If os is removed, are the others safe? Still, if it is clear what is supposed to be in the pickled object, it may be easy enough to restrict execution.
import builtins
import io
import pickle
safe_builtins = {
'range',
'complex',
'set',
'frozenset',
'slice',
}
class RestrictedUnpickler(pickle.Unpickler):
def find_class(self, module, name):
# Only allow safe classes from builtins.
if module == "builtins" and name in safe_builtins:
return getattr(builtins, name)
# Forbid everything else.
raise pickle.UnpicklingError("global '%s.%s' is forbidden" %
(module, name))
def restricted_loads(s):
"""Helper function analogous to pickle.loads()."""
return RestrictedUnpickler(io.BytesIO(s)).load()
I'm trying to put together a small build system in Python that generates Ninja files for my C++ project. Its behavior should be similar to CMake; that is, a bldfile.py script defines rules and targets and optionally recurses into one or more directories by calling bld.subdir(). Each bldfile.py script has a corresponding bld.File object. When the bldfile.py script is executing, the bld global should be predefined as that file's bld.File instance, but only in that module's scope.
Additionally, I would like to take advantage of Python's bytecode caching somehow, but the .pyc file should be stored in the build output directory instead of in a __pycache__ directory alongside the bldfile.py script.
I know I should use importlib (requiring Python 3.4+ is fine), but I'm not sure how to:
Load and execute a module file with custom globals.
Re-use the bytecode caching infrastructure.
Any help would be greatly appreciated!
Injecting globals into a module before execution is an interesting idea. However, I think it conflicts with several points of the Zen of Python. In particular, it requires writing code in the module that depends on global values which are not explicitly defined, imported, or otherwise obtained - unless you know the particular procedure required to call the module.
This may be an obvious or slick solution for the specific use case but it is not very intuitive. In general, (Python) code should be explicit. Therefore, I would go for a solution where parameters are explicitly passed to the executing code. Sounds like functions? Right:
bldfile.py
def exec(bld):
print('Working with bld:', bld)
# ...
calling the module:
# set bld
# Option 1: static import
import bldfile
bldfile.exec(bld)
# Option 2: dynamic import if bldfile.py is located dynamically
import importlib.util
spec = importlib.util.spec_from_file_location("unique_name", "subdir/subsubdir/bldfile.py")
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)
module.exec(bld)
That way no code (apart from the function definition) is executed when importing the module. The exec function needs to be called explicitly and when looking at the code inside exec it is clear where bld comes from.
I studied importlib's source code and since I don't intend to make a reusable Loader, it seems like a lot of unnecessary complexity. So I just settled on creating a module with types.ModuleType, adding bld to the module's __dict__, compiling and caching the bytecode with compile, and executing the module with exec. At a low level, that's basically all importutil does anyway.
It is possible to overcome the lack of possibility by using dummy module, which would load its globals .
#service.py
module = importlib.import_module('userset')
module.user = user
module = importlib.import_module('config')
#config.py
from userset import *
#now you can use user from service.py
A proper Python module will list all its public symbols in a list called __all__. Managing that list can be tedious, since you'll have to list each symbol twice. Surely there are better ways, probably using decorators so one would merely annotate the exported symbols as #export.
How would you write such a decorator? I'm certain there are different ways, so I'd like to see several answers with enough information that users can compare the approaches against one another.
In Is it a good practice to add names to __all__ using a decorator?, Ed L suggests the following, to be included in some utility library:
import sys
def export(fn):
"""Use a decorator to avoid retyping function/class names.
* Based on an idea by Duncan Booth:
http://groups.google.com/group/comp.lang.python/msg/11cbb03e09611b8a
* Improved via a suggestion by Dave Angel:
http://groups.google.com/group/comp.lang.python/msg/3d400fb22d8a42e1
"""
mod = sys.modules[fn.__module__]
if hasattr(mod, '__all__'):
name = fn.__name__
all_ = mod.__all__
if name not in all_:
all_.append(name)
else:
mod.__all__ = [fn.__name__]
return fn
We've adapted the name to match the other examples. With this in a local utility library, you'd simply write
from .utility import export
and then start using #export. Just one line of idiomatic Python, you can't get much simpler than this. On the downside, the module does require access to the module by using the __module__ property and the sys.modules cache, both of which may be problematic in some of the more esoteric setups (like custom import machinery, or wrapping functions from another module to create functions in this module).
The python part of the atpublic package by Barry Warsaw does something similar to this. It offers some keyword-based syntax, too, but the decorator variant relies on the same patterns used above.
This great answer by Aaron Hall suggests something very similar, with two more lines of code as it doesn't use __dict__.setdefault. It might be preferable if manipulating the module __dict__ is problematic for some reason.
You could simply declare the decorator at the module level like this:
__all__ = []
def export(obj):
__all__.append(obj.__name__)
return obj
This is perfect if you only use this in a single module. At 4 lines of code (plus probably some empty lines for typical formatting practices) it's not overly expensive to repeat this in different modules, but it does feel like code duplication in those cases.
You could define the following in some utility library:
def exporter():
all = []
def decorator(obj):
all.append(obj.__name__)
return obj
return decorator, all
export, __all__ = exporter()
export(exporter)
# possibly some other utilities, decorated with #export as well
Then inside your public library you'd do something like this:
from . import utility
export, __all__ = utility.exporter()
# start using #export
Using the library takes two lines of code here. It combines the definition of __all__ and the decorator. So people searching for one of them will find the other, thus helping readers to quickly understand your code. The above will also work in exotic environments, where the module may not be available from the sys.modules cache or where the __module__ property has been tampered with or some such.
https://github.com/russianidiot/public.py has yet another implementation of such a decorator. Its core file is currently 160 lines long! The crucial points appear to be the fact that it uses the inspect module to obtain the appropriate module based on the current call stack.
This is not a decorator approach, but provides the level of efficiency I think you're after.
https://pypi.org/project/auto-all/
You can use the two functions provided with the package to "start" and "end" capturing the module objects that you want included in the __all__ variable.
from auto_all import start_all, end_all
# Imports outside the start and end functions won't be externally availab;e.
from pathlib import Path
def a_private_function():
print("This is a private function.")
# Start defining externally accessible objects
start_all(globals())
def a_public_function():
print("This is a public function.")
# Stop defining externally accessible objects
end_all(globals())
The functions in the package are trivial (a few lines), so could be copied into your code if you want to avoid external dependencies.
While other variants are technically correct to a certain extent, one might also be sure that:
if the target module already has __all__ declared, it is handled correctly;
target appears in __all__ only once:
# utils.py
import sys
from typing import Any
def export(target: Any) -> Any:
"""
Mark a module-level object as exported.
Simplifies tracking of objects available via wildcard imports.
"""
mod = sys.modules[target.__module__]
__all__ = getattr(mod, '__all__', None)
if __all__ is None:
__all__ = []
setattr(mod, '__all__', __all__)
elif not isinstance(__all__, list):
__all__ = list(__all__)
setattr(mod, '__all__', __all__)
target_name = target.__name__
if target_name not in __all__:
__all__.append(target_name)
return target
I'd like to have several of the doctests in a file share test data and/or functions. Is there a way to do this without locating them in an external file or within the code of the file being tested?
update
"""This is the docstring for the module ``fish``.
I've discovered that I can access the module under test
from within the doctest, and make changes to it, eg
>>> import fish
>>> fish.data = {1: 'red', 2: 'blue'}
"""
def jef():
"""
Modifications made to the module will persist across subsequent tests:
>>> import fish
>>> fish.data[1]
'red'
"""
pass
def seuss():
"""
Although the doctest documentation claims that
"each time doctest finds a docstring to test,
it uses a shallow copy of M‘s globals",
modifications made to the module by a doctest
are not imported into the context of subsequent docstrings:
>>> data
Traceback (most recent call last):
...
NameError: name 'data' is not defined
"""
pass
So I guess that doctest copies the module once, and then copies the copy for each docstring?
In any case, importing the module into each docstring seems usable, if awkward.
I'd prefer to use a separate namespace for this, to avoid accidentally trampling on actual module data that will or will not be imported into subsequent tests in an possibly undocumented manner.
It's occurred to me that it's (theoretically) possible to dynamically create a module in order to contain this namespace. However as yet I've not gotten any direction on how to do that from the question I asked about that a while back. Any information is quite welcome! (as a response to the appropriate question)
In any case I'd prefer to have the changes be propagated directly into the namespace of subsequent docstrings. So my original question still stands, with that as a qualifier.
This is the sort of thing that causes people to turn away from doctests: as your tests grow in complexity, you need real programming tools to be able to engineer your tests just as you would engineer your product code.
I don't think there's a way to include shared data or functions in doctests other than defining them in your product code and then using them in the doctests.
You are going to need to use real code to define some of your test infrastructure. If you like doctests, you can use that infrastructure from your doctests.
This is possible, albeit perhaps not advertised as loudly.
To obtain literate modules with tests that all use a shared execution context (i.e. individual tests can share and re-use their results), one has to look at the relevant part of documentation which says:
... each time doctest finds a docstring to test, it uses a shallow copy of M‘s globals, so that running tests doesn’t change the module’s real globals, and so that one test in M can’t leave behind crumbs that accidentally allow another test to work.
...
You can force use of your own dict as the execution context by passing globs=your_dict to testmod() or testfile() instead.
Given this, I managed to reverse-engineer from doctest module that besides using copies (i.e. the dict's copy() method), it also clears the globals dict (using clear()) after each test.
Thus, one can patch their own globals dictionary with something like:
class Context(dict):
def clear(self):
pass
def copy(self):
return self
and then use it as:
import doctest
from importlib import import_module
module = import_module('some.module')
doctest.testmod(module,
# Make a copy of globals so tests in this
# module don't affect the tests in another
glob=Context(module.__dict__.copy()))
I'd like to dynamically create a module from a dictionary, and I'm wondering if adding an element to sys.modules is really the best way to do this. EG
context = { a: 1, b: 2 }
import types
test_context_module = types.ModuleType('TestContext', 'Module created to provide a context for tests')
test_context_module.__dict__.update(context)
import sys
sys.modules['TestContext'] = test_context_module
My immediate goal in this regard is to be able to provide a context for timing test execution:
import timeit
timeit.Timer('a + b', 'from TestContext import *')
It seems that there are other ways to do this, since the Timer constructor takes objects as well as strings. I'm still interested in learning how to do this though, since a) it has other potential applications; and b) I'm not sure exactly how to use objects with the Timer constructor; doing so may prove to be less appropriate than this approach in some circumstances.
EDITS/REVELATIONS/PHOOEYS/EUREKA:
I've realized that the example code relating to running timing tests won't actually work, because import * only works at the module level, and the context in which that statement is executed is that of a function in the testit module. In other words, the globals dictionary used when executing that code is that of __main__, since that's where I was when I wrote the code in the interactive shell. So that rationale for figuring this out is a bit botched, but it's still a valid question.
I've discovered that the code run in the first set of examples has the undesirable effect that the namespace in which the newly created module's code executes is that of the module in which it was declared, not its own module. This is like way weird, and could lead to all sorts of unexpected rattlesnakeic sketchiness. So I'm pretty sure that this is not how this sort of thing is meant to be done, if it is in fact something that the Guido doth shine upon.
The similar-but-subtly-different case of dynamically loading a module from a file that is not in python's include path is quite easily accomplished using imp.load_source('NewModuleName', 'path/to/module/module_to_load.py'). This does load the module into sys.modules. However this doesn't really answer my question, because really, what if you're running python on an embedded platform with no filesystem?
I'm battling a considerable case of information overload at the moment, so I could be mistaken, but there doesn't seem to be anything in the imp module that's capable of this.
But the question, essentially, at this point is how to set the global (ie module) context for an object. Maybe I should ask that more specifically? And at a larger scope, how to get Python to do this while shoehorning objects into a given module?
Hmm, well one thing I can tell you is that the timeit function actually executes its code using the module's global variables. So in your example, you could write
import timeit
timeit.a = 1
timeit.b = 2
timeit.Timer('a + b').timeit()
and it would work. But that doesn't address your more general problem of defining a module dynamically.
Regarding the module definition problem, it's definitely possible and I think you've stumbled on to pretty much the best way to do it. For reference, the gist of what goes on when Python imports a module is basically the following:
module = imp.new_module(name)
execfile(file, module.__dict__)
That's kind of the same thing you do, except that you load the contents of the module from an existing dictionary instead of a file. (I don't know of any difference between types.ModuleType and imp.new_module other than the docstring, so you can probably use them interchangeably) What you're doing is somewhat akin to writing your own importer, and when you do that, you can certainly expect to mess with sys.modules.
As an aside, even if your import * thing was legal within a function, you might still have problems because oddly enough, the statement you pass to the Timer doesn't seem to recognize its own local variables. I invoked a bit of Python voodoo by the name of extract_context() (it's a function I wrote) to set a and b at the local scope and ran
print timeit.Timer('print locals(); a + b', 'sys.modules["__main__"].extract_context()').timeit()
Sure enough, the printout of locals() included a and b:
{'a': 1, 'b': 2, '_timer': <built-in function time>, '_it': repeat(None, 999999), '_t0': 1277378305.3572791, '_i': None}
but it still complained NameError: global name 'a' is not defined. Weird.