I'm attempting to broadcast a module to other python processes with MPI. Of course, a module itself isn't pickleable, but the __dict__ is. Currently, I'm pickling the __dict__ and making a new module in the receiving process. This worked perfectly with some simple, custom modules. However, when I try to do this with NumPy, there's one thing that I can't pickle easily: the ufunc.
I've read this thread that suggests pickling the __name__ and __module__ of the ufunc, but it seems they rely on having numpy fully built and present before they rebuild it. I need to avoid using the import statement all-together in the receiving process, so I'm curious if the getattr(numpy,name) statement mentioned would work with a module that doesn't have ufuncs included yet.
Also, I don't see a __module__ attribute on the ufunc in the NumPy documentation:
http://docs.scipy.org/doc/numpy/reference/ufuncs.html
Any help or suggestions, please?
EDIT: Sorry, forgot to include thread mentioned above. http://mail.scipy.org/pipermail/numpy-discussion/2007-January/025778.html
Pickling a function in Python only serializes its name and the module it comes from. It does not transport code over the wire, so when unpickling you need to have the same libraries available as when pickling. On unpickling, Python simply imports the module in question, and grabs the items via getattr. (This is not limited to Numpy, but applies to pickling in general.)
Ufuncs don't pickle cleanly, which is a wart. Your options mainly are then to pickle just the __name__ (and maybe the __class__) of the ufunc, and reconstruct them later on manually. (They are not actually Python functions, and do not have a __module__ attribute.)
Related
I have an application that dynamically generates a lot of Python modules with class factories to eliminate a lot of redundant boilerplate that makes the code hard to debug across similar implementations and it works well except that the dynamic generation of the classes across the modules (hundreds of them) takes more time to load than simply importing from a file. So I would like to find a way to save the modules to a file after generation (unless reset) then load from those files to cut down on bootstrap time for the platform.
Does anyone know how I can save/export auto-generated Python modules to a file for re-import later. I already know that pickling and exporting as a JSON object won't work because they make use of thread locks and other dynamic state variables and the classes must be defined before they can be pickled. I need to save the actual class definitions, not instances. The classes are defined with the type() function.
If you have ideas of knowledge on how to do this I would really appreciate your input.
You’re basically asking how to write a compiler whose input is a module object and whose output is a .pyc file. (One plausible strategy is of course to generate a .py and then byte-compile that in the usual fashion; the following could even be adapted to do so.) It’s fairly easy to do this for simple cases: the .pyc format is very simple (but note the comments there), and the marshal module does all of the heavy lifting for it. One point of warning that might be obvious: if you’ve already evaluated, say, os.getcwd() when you generate the code, that’s not at all the same as evaluating it when loading it in a new process.
The “only” other task is constructing the code objects for the module and each class: this requires concatenating a large number of boring values from the dis module, and will fail if any object encountered is non-trivial. These might be global/static variables/constants or default argument values: if you can alter your generator to produce modules directly, you can probably wrap all of these (along with anything else you want to defer) in function calls by compiling something like
my_global=(lambda: open(os.devnull,'w'))()
so that you actually emit the function and then a call to it. If you can’t so alter it, you’ll have to have rules to recognize values that need to be constructed in this fashion so that you can replace them with such calls.
Another detail that may be important is closures: if your generator uses local functions/classes, you’ll need to create the cell objects, perhaps via “fake” closures of your own:
def cell(x): return (lambda: x).__closure__[0]
I'm trying to understand some of shared_memory's operation.
Looking at the source , it looks like the module uses shm_open() for UNIX environments, and CreateFileMapping \ OpenFileMapping on windows, combined with mmap.
I understand from here, that in order to avoid a thorough serialization / deserialization by pickle, one needs to implement __setstate__() and __getstate__() explicitly for his shared datatype.
I do not see any such implementation in shared_memory.py.
How does shared_memory circumvent the pickle treatment?
Also, on a Windows machine, this alone seems to survive accross interpreters:
from mmap import mmap
shared_size = 12
shared_label = "my_mem"
mmap(-1, shared_size , shared_label)
Why then is CreateFileMapping \ OpenFileMapping needed here?
How does shared_memory circumvent the pickle treatment?
I think you are confusing shared ctypes and shared objects between processes.
First, you don't have to use the sharing mechanisms provided by multiprocessing in order to get shared objects, you can just wrap basic primitives such as mmap / Windows-equivalent or get fancier using any API that your OS/kernel provides you.
Next, the second link you mention regarding how copy is done and how __getstate__ defines the behavior of the pickling is dependent on you — using the sharedctypes module API. You are not forced to perform pickling to share memory between two processes.
In fact, sharedctypes is backed by anonymous shared memory which uses: https://github.com/python/cpython/blob/master/Lib/multiprocessing/heap.py#L31
Both implementations relies on an mmap-like primitive.
Anyway, if you try to copy something using sharedctype, you will hit:
https://github.com/python/cpython/blob/master/Lib/multiprocessing/sharedctypes.py#L98
https://github.com/python/cpython/blob/master/Lib/multiprocessing/sharedctypes.py#L39
https://github.com/python/cpython/blob/master/Lib/multiprocessing/sharedctypes.py#L135
And this function is using ForkingPickler which will make use of pickle and then… ultimately, you'll call __getstate__ somewhere.
But it's not relevant with shared_memory, because shared_memory is not really a ctype-like object.
You have other ways to share objects between processes, using the Resource Sharer / Tracker API: https://github.com/python/cpython/blob/master/Lib/multiprocessing/resource_sharer.py which will rely on pickle serialization/deserialization.
But you don't share shared memory through shared memory, right?
When you use: https://github.com/python/cpython/blob/master/Lib/multiprocessing/shared_memory.py
You create a block of memory with a unique name, and all processes must have the unique name before sharing the memory, otherwise you will not be able to attach it.
Basically, the analogy is:
You have a group of friends and you all have a unique secret base that only you have the location, you will go on errands, be away from each other, but you can all meet at this unique location.
In order for this to work, you must all know the location before going away from each other. If you do not have it beforehand, you are not certain that you will be able to figure out the place to meet them.
That is the same with the shared_memory, you only need its name to open it. You don't share / transfer shared_memory between processes. You read into shared_memory using its unique name from multiple processes.
As a result, why would you pickle it? You can. You can absolutely pickle it. But that might not be built-in, because it's straightforward to just send the unique name to all your processes through another shared memory channel or anything like that.
There is no circumvention required here. ShareableList is just an example of application of SharedMemory class. As you can see it here: https://github.com/python/cpython/blob/master/Lib/multiprocessing/shared_memory.py#L314
It requires something akin to a unique name, you can use anonymous shared memory also and transmit its name later through another channel (write a temporary file, send it back to some API, whatever).
Why then is CreateFileMapping \ OpenFileMapping needed here?
Because it depends on your Python interpreter, here you are might be using CPython, which is doing the following:
https://github.com/python/cpython/blob/master/Modules/mmapmodule.c#L1440
It's already using CreateFileMapping indirectly so that doing CreateFileMapping then attaching it is just duplicating the already-done work in CPython.
But, what about others interpreters? Do all interpreters perform the necessary to make mmap work on non-POSIX platforms? Maybe the rationale of the developer would be this.
Anyway, it is not surprising that mmap would work out of the box.
In Python C API, I already know how to import a module via PyImport_ImportModule, as described in Python Documentation: Importing Modules. I also know that there is a lot of ways to create or allocate or initialize a module and some functions for operating a module, as described in Python Documentation: Module Objects.
But how can I get a function from a module (and call it), or, get a type/class from a module (and instantiate it), or, get an object from a module (and operate on it), or get anything from a module and do anything I want to do?
I think this can be a fool question but I really cannot find any tutorial or documentation. The only way I think that I can achieve this is use PyModule_GetDict to get the __dict__ property of the module and fetch what I want, as described in the latter documentation I mentioned. But the documentation also recommend that one should not use this function to operate the module.
So any "official way" or best practice for getting something from a module?
According to the documentation for PyModule_GetDict:
It is recommended extensions use other PyModule_*() and PyObject_*() functions rather than directly manipulate a module’s __dict__.
The functions you need are generic object functions (PyObject_*) rather than module functions (PyModule_*), and I suspect this is where you were looking in the wrong place.
You want to use PyObject_GetAttr or PyObject_GetAttrString.
I want to load an object from file using eval. That object is dumped to the file so that it is a valid python expression - all types are given with their fqdn, like this:
mod1.Class1(
attr1=mod2.Class2(a=1,b=2),
attr2=[1,2,3,4],
attr3=mod1.submod1.Class3(),
)
When I feed this into eval, not all of those modules are imported in the scope where eval is called, so I get either NameError: name 'mod1' is not defined for top-level modules, or, when those are imported, AttributeError: 'module' object has not attribute 'submod1' for sub-modules.
Is there a graceful way to handle that? I can parse NameError, run __import__ and re-try eval, but I am at loss how to get what went wrong from AttributeError.
Could I feed the expression to compile, walk the AST and import whatever is necessary? Never worked with the AST though, any example for that?
Note I am not interested about security here.
Why not use pickle for this? You can even __getstate__ and __setstate__ methods on your classes to control aspects of the serialization and instantiation. Seems seriously better than doing your own eval() thing.
Otherwise, how controlled are the values in your serialization format? I.e. maybe you can just predict what modules are going to be needed.
If you're wedded to using full Python (rather than something more easily parseable like JSON or YAML) for your data, walking the AST sounds fairly feasible. You'd want to implement an ast.NodeVisitor and keep track of the Attribute nodes visited.
I am working on my program, GarlicSim, in which a user creates a simulation, then he is able to manipulate it as he desires, and then he can save it to file.
I recently tried implementing the saving feature. The natural thing that occured to me is to pickle the Project object, which contains the entire simulation.
Problem is, the Project object also includes a module-- That is the "simulation package", which is a package/module that contains several critical objects, mostly functions, that define the simulation. I need to save them together with the simulation, but it seems that it is impossible to pickle a module, as I witnessed when I tried to pickle the Project object and an exception was raised.
What would be a good way to work around that limitation?
(I should also note that the simulation package gets imported dynamically in the program.)
If the project somehow has a reference to a module with stuff you need, it sounds like you might want to refactor the use of that module into a class within the module. This is often better anyway, because the use of a module for stuff smells of a big fat global. In my experience, such an application structure will only lead to trouble.
(Of course the quick way out is to save the module's dict instead of the module itself.)
If you have the original code for the simulation package modules, which I presume are dynamically generated, then I would suggest serializing that and reconstructing the modules when loaded. You would do this in the Project.__getstate__() and Project.__setstate__() methods.