Good ways to handle dynamic zipfile creation in Python - python

I have a reporting function on my Web Application that has been subjected to Feature Creep -- instead of providing a PDF, it must now provide a .zip file that contains a variety of documents.
Generating the documents is fine. Adding documents to the Zipfile is the issue.
Until now, the various documents to be added to the archive have existed as a mix of cStringIO, StringIO or tempfile.SpooledTemporaryFile objects.
Digging into the zipfile library docs, it appears that the module's [write][1] function will only work with Strings or paths to physical files on the machine; it does not work with file-like objects.
Just to be clear: zipfile can read/write from a file-like object as the archive itself (zipfile.ZipFile), however when adding an element to the archive, the library only supports a pathname or raw string data.
I found an online blog posting suggesting a possible workaround, but I'm not eager to use it on my production machine. http://swl10.blogspot.com/2012/12/writing-stream-to-zipfile-in-python.html
Does anyone have other strategies for handling this? It looks like I have to either save everything to disk and take a hit on I/O , or handle everything as a string and take a hit on memory. Neither are ideal.

Use the solution you are referring to (using monkeypathing).
Regarding your concerns about monkeypathing thing not sounding solid enough: Do elaborate on how can be monkeypatched method used from other places.
Hooking in Python is no special magic. It means, that someone assigns alternative value/function to something, what is already defined. This has to be done by a line of code and has very limited scope.
In the example from the blog is the scope of monkeypatched os functions just the class ZipHooks.
Do not be afraid, that it would leak somewhere else without your knowledge or break complete system. Even other packages, importing your module with ZipHooks class would not have access to pathed stat and open unless they would use ZipHooks class or explicitly call stat_hook or open_hook from your package.

Related

How can I save a dynamically generated module and reimport them from file?

I have an application that dynamically generates a lot of Python modules with class factories to eliminate a lot of redundant boilerplate that makes the code hard to debug across similar implementations and it works well except that the dynamic generation of the classes across the modules (hundreds of them) takes more time to load than simply importing from a file. So I would like to find a way to save the modules to a file after generation (unless reset) then load from those files to cut down on bootstrap time for the platform.
Does anyone know how I can save/export auto-generated Python modules to a file for re-import later. I already know that pickling and exporting as a JSON object won't work because they make use of thread locks and other dynamic state variables and the classes must be defined before they can be pickled. I need to save the actual class definitions, not instances. The classes are defined with the type() function.
If you have ideas of knowledge on how to do this I would really appreciate your input.
You’re basically asking how to write a compiler whose input is a module object and whose output is a .pyc file. (One plausible strategy is of course to generate a .py and then byte-compile that in the usual fashion; the following could even be adapted to do so.) It’s fairly easy to do this for simple cases: the .pyc format is very simple (but note the comments there), and the marshal module does all of the heavy lifting for it. One point of warning that might be obvious: if you’ve already evaluated, say, os.getcwd() when you generate the code, that’s not at all the same as evaluating it when loading it in a new process.
The “only” other task is constructing the code objects for the module and each class: this requires concatenating a large number of boring values from the dis module, and will fail if any object encountered is non-trivial. These might be global/static variables/constants or default argument values: if you can alter your generator to produce modules directly, you can probably wrap all of these (along with anything else you want to defer) in function calls by compiling something like
my_global=(lambda: open(os.devnull,'w'))()
so that you actually emit the function and then a call to it. If you can’t so alter it, you’ll have to have rules to recognize values that need to be constructed in this fashion so that you can replace them with such calls.
Another detail that may be important is closures: if your generator uses local functions/classes, you’ll need to create the cell objects, perhaps via “fake” closures of your own:
def cell(x): return (lambda: x).__closure__[0]

File extension naming: .p vs .pkl vs .pickle

When reading and writing pickle files, I've noticed that some snippets use .p others .pkl and some the full .pickle. Is there one most pythonic way of doing this?
My current view is that there is no one right answer, and that any of these will suffice. In fact, writing a filename of awesome.pkl or awesome.sauce won't make a difference when running pickle.load(open(filename, "rb")). This is to say, the file extension is just a convention which doesn't actually affect the underlying data. Is that right?
Bonus: What if I saved a PNG image as myimage.jpg? What havoc would that create?
The extension makes no difference because "The Pickle Protocol" runs every time.
That is to say whenever pickle.dumps or pickle.loads is run the objects are serialized/un-serialized according to the pickle protocol.
(The pickle protocol is a serialization format)
The pickle protocol is python specific(and there are several versions). It's only really designed for a user to re-use data themselves -> if you sent the pickled file to someone else who happened to have a different version of pickle/Python then the file might not load correctly and you probably can't do anything useful with that pickled file in another language like Java.
So, use what extensions you like because the pickler will ignore them.
JSON is another more popular way of serializing data, it can also be used by other languages unlike pickle - however it does not cater directly to python and so certain variable types are not understood by it :/
source if you want to read more
EDIT: while you could use any name, what should you use?
1 As mentioned by #Mike Williamson .pickle is used in the pickle docs
2 The python standard library json module loads files named with a .json extension. So it would follow the the pickle module would load a .pickle extension.
3 Using .pickle would also minimise any chance of accidental use by other programs.
.p extensions are used by some other programs, most notably MATLAB as the suffix for binary run-time files[sources: one, two]. Some risk of conflict
.pkl is used by some obscure windows "Migration Wizard Packing List file"[ source]. Incredibly low risk of conflict.
.pickle is only used for python pickling[source]. No risk of conflict.

How come Python does not include a function to load a pickle from a file name?

I often include this, or something close to it, in Python scripts and IPython notebooks.
import cPickle
def unpickle(filename):
with open(filename) as f:
obj = cPickle.load(f)
return obj
This seems like a common enough use case that the standard library should provide a function that does the same thing. Is there such a function? If there isn't, how come?
Most of the serialization libraries in the stdlib and on PyPI have a similar API. I'm pretty sure it was marshal that set the standard,* and pickle, json, PyYAML, etc. have just followed in its footsteps.
So, the question is, why was marshal designed that way?
Well, you obviously need loads/dumps; you couldn't build those on top of a filename-based function, and to build them on top of a file-object-based function you'd need StringIO, which didn't come until later.
You don't necessarily need load/dump, because those could be built on top of loads/dumps—but doing so could have major performance implications: you can't save anything to the file until you've built the whole thing in memory, and vice-versa, which could be a problem for huge objects.
You definitely don't need a loadf/dumpf function based on filenames, because those can be built trivially on top of load/dump, with no performance implications, and no tricky considerations that a user is likely to get wrong.
On the one hand, it would be convenient to have them anyway—and there are some libraries, like ElementTree, that do have analogous functions. It may only save a few seconds and a few lines per project, but multiply that by thousands of projects…
On the other hand, it would make Python larger. Not so much the extra 1K to download and install it if you added these two functions to every module (although that did mean a lot more back in the 1.x days than nowadays…), but more to document, more to learn, more to remember. And of course more code to maintain—every time you need to fix a bug in marshal.dumpf you have to remember to go check pickle.dumpf and json.dumpf to make sure they don't need the change, and sometimes you won't remember.
Balancing those two considerations is really a judgment call. One someone made decades ago and probably nobody has really discussed since. If you think there's a good case for changing it today, you can always post a feature request on the issue tracker or start a thread on python-ideas.
* Not in the original 1991 version of marshal.c; that just had load and dump. Guido added loads and dumps in 1993 as part of a change whose main description was "Add separate main program for the Mac: macmain.c". Presumably because something inside the Python interpreter needed to dump and load to strings.**
** marshal is used as the underpinnings for things like importing .pyc files. This also means (at least in CPython) it's not just implemented in C, but statically built into the core of the interpreter itself. While I think it actually could be turned into a regular module since the 3.4 import changes, but it definitely couldn't have back in the early days. So, that's extra motivation to keep it small and simple.

Python: Alternatives to pickling a module

I am working on my program, GarlicSim, in which a user creates a simulation, then he is able to manipulate it as he desires, and then he can save it to file.
I recently tried implementing the saving feature. The natural thing that occured to me is to pickle the Project object, which contains the entire simulation.
Problem is, the Project object also includes a module-- That is the "simulation package", which is a package/module that contains several critical objects, mostly functions, that define the simulation. I need to save them together with the simulation, but it seems that it is impossible to pickle a module, as I witnessed when I tried to pickle the Project object and an exception was raised.
What would be a good way to work around that limitation?
(I should also note that the simulation package gets imported dynamically in the program.)
If the project somehow has a reference to a module with stuff you need, it sounds like you might want to refactor the use of that module into a class within the module. This is often better anyway, because the use of a module for stuff smells of a big fat global. In my experience, such an application structure will only lead to trouble.
(Of course the quick way out is to save the module's dict instead of the module itself.)
If you have the original code for the simulation package modules, which I presume are dynamically generated, then I would suggest serializing that and reconstructing the modules when loaded. You would do this in the Project.__getstate__() and Project.__setstate__() methods.

Is there an easy way to use a python tempfile in a shelve (and make sure it cleans itself up)?

Basically, I want an infinite size (more accurately, hard-drive rather than memory bound) dict in a python program I'm writing. It seems like the tempfile and shelve modules are naturally suited for this, however, I can't see how to use them together in a safe manner. I want the tempfile to be deleted when the shelve is GCed (or at guarantee deletion after the shelve is out of use, regardless of when), but the only solution I can come up with for this involves using tempfile.TemporaryFile() to open a file handle, getting the filename from the handle, using this filename for opening a shelve, keeping the reference to the file handle to prevent it from getting GCed (and the file deleted), and then putting a wrapper on the shelve that stores this reference. Anyone have a better solution than this convoluted mess?
Restrictions: Can only use the standard python library and must be fully cross platform.
I would rather inherit from shelve.Shelf, and override the close method (*) to unlink the files. Notice that, depending on the specific dbm module being used, you may have more than one file that contains the shelf. One solution could be to create a temporary directory, rather than a temporary file, and remove anything in the directory when done. The other solution would be to bind to a specific dbm module (say, bsddb, or dumbdbm), and remove specifically those files that these libraries create.
(*) notice that the close method of a shelf is also called when the shelf is garbage collected. The only case how you could end up with garbage files is when the interpreter crashes or gets killed.

Categories

Resources