File extension naming: .p vs .pkl vs .pickle - python

When reading and writing pickle files, I've noticed that some snippets use .p others .pkl and some the full .pickle. Is there one most pythonic way of doing this?
My current view is that there is no one right answer, and that any of these will suffice. In fact, writing a filename of awesome.pkl or awesome.sauce won't make a difference when running pickle.load(open(filename, "rb")). This is to say, the file extension is just a convention which doesn't actually affect the underlying data. Is that right?
Bonus: What if I saved a PNG image as myimage.jpg? What havoc would that create?

The extension makes no difference because "The Pickle Protocol" runs every time.
That is to say whenever pickle.dumps or pickle.loads is run the objects are serialized/un-serialized according to the pickle protocol.
(The pickle protocol is a serialization format)
The pickle protocol is python specific(and there are several versions). It's only really designed for a user to re-use data themselves -> if you sent the pickled file to someone else who happened to have a different version of pickle/Python then the file might not load correctly and you probably can't do anything useful with that pickled file in another language like Java.
So, use what extensions you like because the pickler will ignore them.
JSON is another more popular way of serializing data, it can also be used by other languages unlike pickle - however it does not cater directly to python and so certain variable types are not understood by it :/
source if you want to read more
EDIT: while you could use any name, what should you use?
1 As mentioned by #Mike Williamson .pickle is used in the pickle docs
2 The python standard library json module loads files named with a .json extension. So it would follow the the pickle module would load a .pickle extension.
3 Using .pickle would also minimise any chance of accidental use by other programs.
.p extensions are used by some other programs, most notably MATLAB as the suffix for binary run-time files[sources: one, two]. Some risk of conflict
.pkl is used by some obscure windows "Migration Wizard Packing List file"[ source]. Incredibly low risk of conflict.
.pickle is only used for python pickling[source]. No risk of conflict.

Related

Parsing a .proto file without creating the descriptor

I understand that the normal way of using protobuf is to create the .proto and then compile it into the relevant class - Java, Python, etc. I have a requirement which might need to parse the .proto file in Python code. Has anyone tried creating own parser for the .proto file? Will it be recommended to always compile the class instead of directly parsing the .proto?
It probably won't help you directly, but yes, I've written my own parser (live demo, parser source). This code is C# hence why it probably won't help, but it clearly is possible. I started that branch 9 days ago, and now it is basically feature-complete including parser, generator, and an interactive web-site with syntax-error highlighting - so it isn't necessarily a huge amount of work.
However! You may find it easier just to shell execute "protoc" (available on maven). If you use the -oFILE / --descriptor_set_out=FILE switch (same thing, alternative syntax), then it parses the input .proto file and writes a file that is a serialized FileDescriptorSet from descriptor.proto. This means you can use your regular tools to generate code in your chosen language for descriptor.proto, then deserialize the file as a FileDescriptorSet instance. Once you've done that: you can just walk the object model to see the files, messages, enums, fields, etc. IIRC some protobuf implementations support working entirely from a descriptor (which is what protoc emits), without the codegen step.

Good ways to handle dynamic zipfile creation in Python

I have a reporting function on my Web Application that has been subjected to Feature Creep -- instead of providing a PDF, it must now provide a .zip file that contains a variety of documents.
Generating the documents is fine. Adding documents to the Zipfile is the issue.
Until now, the various documents to be added to the archive have existed as a mix of cStringIO, StringIO or tempfile.SpooledTemporaryFile objects.
Digging into the zipfile library docs, it appears that the module's [write][1] function will only work with Strings or paths to physical files on the machine; it does not work with file-like objects.
Just to be clear: zipfile can read/write from a file-like object as the archive itself (zipfile.ZipFile), however when adding an element to the archive, the library only supports a pathname or raw string data.
I found an online blog posting suggesting a possible workaround, but I'm not eager to use it on my production machine. http://swl10.blogspot.com/2012/12/writing-stream-to-zipfile-in-python.html
Does anyone have other strategies for handling this? It looks like I have to either save everything to disk and take a hit on I/O , or handle everything as a string and take a hit on memory. Neither are ideal.
Use the solution you are referring to (using monkeypathing).
Regarding your concerns about monkeypathing thing not sounding solid enough: Do elaborate on how can be monkeypatched method used from other places.
Hooking in Python is no special magic. It means, that someone assigns alternative value/function to something, what is already defined. This has to be done by a line of code and has very limited scope.
In the example from the blog is the scope of monkeypatched os functions just the class ZipHooks.
Do not be afraid, that it would leak somewhere else without your knowledge or break complete system. Even other packages, importing your module with ZipHooks class would not have access to pathed stat and open unless they would use ZipHooks class or explicitly call stat_hook or open_hook from your package.

What available Python modules are there to save-and-load data?

There are many scattered posts out on StackOverflow, regarding Python modules used to save and load data.
I myself am familiar with json and pickle and I have heard of pytables too. There are probably more out there. Also, each module seems to fit a certain purpose and has its own limits (e.g. loading a large list or dictionary with pickle takes ages if working at all). Hence it would be nice to have a proper overview of possibilities.
Could you then help providing a comprehensive list of modules used to save and load data, describing for each module:
what the general purpose of the module is,
its limits,
why you would choose this module over others?
marshal:
Pros:
Can read and write Python values in a binary format. Therefore it's much faster than pickle (which is character based).
Cons:
Not all Python object types are supported. Some unsupported types such as subclasses of builtins will appear to marshal and unmarshal correctly
Is not intended to be secure against erroneous or maliciously constructed data.
The Python maintainers reserve the right to modify the marshal format in backward incompatible ways should the need arise
shelve
Pros:
Values in a shelf can be essentially arbitrary Python objects
Cons:
Does not support concurrent read/write access to shelved objects
ZODB (suggested by #Duncan)
Pro:
transparent persistence
full transaction support
pluggable storage
scalable architecture
Cons
not part of standard library.
unable (easily) to reload data unless the original python object model used for persisting is available (consider version difficulties and data portability)
There is an overview of the standard lib data persistence modules.

Open Matlab file .mat with module PICKLE in Python

I want open a Matlab project with the module Pickle or cPickle in Python language.
NOT with:
from scipy.io import matlab
mat=matlab.loadmat('file.mat')
Can I use pickle.load with a .mat file?
For some years now, Matlab has used HDF5 to store data. Python has support for HDF5, via PyTables. No need to use Pickle. In fact, HDF5 may surprise you for its speed relative to Pickle. A friend reported 2-10X speedups in read/write for some very large datasets.
Update 1: A concise guide to loading the files, via HDF5, can be found at this page.
In addition, several good references and resources may be found at this page. There's also a PyMat project on Sourceforge.
You can't. Pickle loads Python objects that have been serialized to binary data. The format is nothing like the Matlab file format.
If you have read all the data you need out of the matlab file and stored it in Python objects, you can then store it for later use by Pickling it if necessary.
This is not possible. From the pickle python documentation
The pickle module implements a fundamental, but powerful algorithm for
serializing and de-serializing a Python object structure. “Pickling”
is the process whereby a Python object hierarchy is converted into a
byte stream, and “unpickling” is the inverse operation, whereby a byte
stream is converted back into an object hierarchy. Pickling (and
unpickling) is alternatively known as “serialization”, “marshalling,”
1 or “flattening”, however, to avoid confusion, the terms used here
are “pickling” and “unpickling”.
In your case you could load the *.mat object with scipy.io and then serializing it in some python structure that you may define. At that point you will be able to easily pickle and unpickle it. (but this last step depends, and in some use case it is not worth to be done).

Is there an easy way to use a python tempfile in a shelve (and make sure it cleans itself up)?

Basically, I want an infinite size (more accurately, hard-drive rather than memory bound) dict in a python program I'm writing. It seems like the tempfile and shelve modules are naturally suited for this, however, I can't see how to use them together in a safe manner. I want the tempfile to be deleted when the shelve is GCed (or at guarantee deletion after the shelve is out of use, regardless of when), but the only solution I can come up with for this involves using tempfile.TemporaryFile() to open a file handle, getting the filename from the handle, using this filename for opening a shelve, keeping the reference to the file handle to prevent it from getting GCed (and the file deleted), and then putting a wrapper on the shelve that stores this reference. Anyone have a better solution than this convoluted mess?
Restrictions: Can only use the standard python library and must be fully cross platform.
I would rather inherit from shelve.Shelf, and override the close method (*) to unlink the files. Notice that, depending on the specific dbm module being used, you may have more than one file that contains the shelf. One solution could be to create a temporary directory, rather than a temporary file, and remove anything in the directory when done. The other solution would be to bind to a specific dbm module (say, bsddb, or dumbdbm), and remove specifically those files that these libraries create.
(*) notice that the close method of a shelf is also called when the shelf is garbage collected. The only case how you could end up with garbage files is when the interpreter crashes or gets killed.

Categories

Resources