Open Matlab file .mat with module PICKLE in Python

Open Matlab file .mat with module PICKLE in Python - python

I want open a Matlab project with the module Pickle or cPickle in Python language.
NOT with:
from scipy.io import matlab
mat=matlab.loadmat('file.mat')
Can I use pickle.load with a .mat file?

For some years now, Matlab has used HDF5 to store data. Python has support for HDF5, via PyTables. No need to use Pickle. In fact, HDF5 may surprise you for its speed relative to Pickle. A friend reported 2-10X speedups in read/write for some very large datasets.
Update 1: A concise guide to loading the files, via HDF5, can be found at this page.
In addition, several good references and resources may be found at this page. There's also a PyMat project on Sourceforge.

You can't. Pickle loads Python objects that have been serialized to binary data. The format is nothing like the Matlab file format.
If you have read all the data you need out of the matlab file and stored it in Python objects, you can then store it for later use by Pickling it if necessary.

This is not possible. From the pickle python documentation
The pickle module implements a fundamental, but powerful algorithm for
serializing and de-serializing a Python object structure. “Pickling”
is the process whereby a Python object hierarchy is converted into a
byte stream, and “unpickling” is the inverse operation, whereby a byte
stream is converted back into an object hierarchy. Pickling (and
unpickling) is alternatively known as “serialization”, “marshalling,”
1 or “flattening”, however, to avoid confusion, the terms used here
are “pickling” and “unpickling”.
In your case you could load the *.mat object with scipy.io and then serializing it in some python structure that you may define. At that point you will be able to easily pickle and unpickle it. (but this last step depends, and in some use case it is not worth to be done).

Related

File extension naming: .p vs .pkl vs .pickle

When reading and writing pickle files, I've noticed that some snippets use .p others .pkl and some the full .pickle. Is there one most pythonic way of doing this?
My current view is that there is no one right answer, and that any of these will suffice. In fact, writing a filename of awesome.pkl or awesome.sauce won't make a difference when running pickle.load(open(filename, "rb")). This is to say, the file extension is just a convention which doesn't actually affect the underlying data. Is that right?
Bonus: What if I saved a PNG image as myimage.jpg? What havoc would that create?

The extension makes no difference because "The Pickle Protocol" runs every time.
That is to say whenever pickle.dumps or pickle.loads is run the objects are serialized/un-serialized according to the pickle protocol.
(The pickle protocol is a serialization format)
The pickle protocol is python specific(and there are several versions). It's only really designed for a user to re-use data themselves -> if you sent the pickled file to someone else who happened to have a different version of pickle/Python then the file might not load correctly and you probably can't do anything useful with that pickled file in another language like Java.
So, use what extensions you like because the pickler will ignore them.
JSON is another more popular way of serializing data, it can also be used by other languages unlike pickle - however it does not cater directly to python and so certain variable types are not understood by it :/
source if you want to read more
EDIT: while you could use any name, what should you use?
1 As mentioned by #Mike Williamson .pickle is used in the pickle docs
2 The python standard library json module loads files named with a .json extension. So it would follow the the pickle module would load a .pickle extension.
3 Using .pickle would also minimise any chance of accidental use by other programs.
.p extensions are used by some other programs, most notably MATLAB as the suffix for binary run-time files[sources: one, two]. Some risk of conflict
.pkl is used by some obscure windows "Migration Wizard Packing List file"[ source]. Incredibly low risk of conflict.
.pickle is only used for python pickling[source]. No risk of conflict.

Is there any difference between Pickling and Serialization?

I have came across these two terms so often while reading about python objects. However, there is a confusion between pickling and serialization since at one place I read
The pickle module implements an algorithm for turning an arbitrary
Python object into a series of bytes. This process is also called
serializing” the object.
If serializing and pickling is same process, why use different terms for them?

You are misreading the article. Pickling and serialisation are not synonymous, nor does the text claim them to be.
Paraphrasing slighly, the text says this:
This module implements an algorithm for turning an object into a series of bytes. This process is also called serializing the object.
I removed the module name, pickle, deliberately. The module implements a process, an algorithm, and that process is commonly known as serialisation.
There are other implementations of that process. You could use JSON or XML to serialise data to text. There is also the marshal module. Other languages have other serialization formats; the R language has one, so does Java. Etc.
See the WikiPedia article on the subject:
In computer science, in the context of data storage, serialization is the process of translating data structures or object state into a format that can be stored (for example, in a file or memory buffer, or transmitted across a network connection link) and reconstructed later in the same or another computer environment.
Python picked the name pickle because it modelled the process on how this was handled in Modula-3, where it was also called pickling. See Pickles: Why are they called that?

in python pickle refers to a module that provides (a specific) serialization of python objects.
serialization itself is a more general term. python objects can also be serialized into json for example.
https://en.wikipedia.org/wiki/Serialization

Why does loading a pickle object take so much longer than loading a file?

I have noticed that loading a dictionary of 5000 objects with pickle takes a long time (minutes) -- but loading a json of file of 5000 entities takes a short time (seconds). I know that in general objects come with some overhead -- and that in OOP the overhead associated with keeping track of such objects is part of the cost for the ease using them. But why does loading an pickled object take SO long. What is happening under the hood? What are the costs associated with serializing an object as opposed to merely writing its data to a file? Does pickling restore the object to the same locations in memory or something? (Maybe moving other objects out of the way). If serialization loads slower (at least pickle is) than what is the benefit?

Assuming that you are using the Python 2.7 standard pickle and json modules...
Python 2.7 uses a pure-Python implementation of the pickle module by default, although a faster C implementation is available. http://docs.python.org/2/library/pickle.html
Conversely, Python 2.7 uses an optimized C implementation of the json module by default: http://docs.python.org/dev/whatsnew/2.7.html
So you're basically comparing a pure-Python deserializer to an optimized C deserializer. Not a fair comparison, even if the serialization formats were identical.

There are speed comparisons out there for the serialization of particular objects, comparing JSON and pickle and cPickle. The speed of each object will be different in each format. JSON is usually comparably faster than pickle, and you often hear not to use pickle because it's insecure. The reason for security concerns, and some of the speed lag, is that pickle doesn't actually serialize very much data -- instead it serializes some data and a bunch of instructions, where the instructions are used to assemble the python objects. If you've ever looked at the dis module, you'll see the type of instructions that pickle uses for each object. cPickle is, like json, not pure-python, and leverages optimized C, so it's often faster.
Pickling should take up less space, in general than storing an object itself -- in general, however, some instruction sets can be quite large. JSON tends to be smaller… and is human readable… however, since json stores everything as human-readable strings… it can't serialize as many different kinds of objects as pickle and cPickle can. So the trade-off is json for "security" (or inflexibility, depending on your perspective) and hunan-readability versus pickle with a broader range of objects it can serialize.
Another good reason for choosing pickle (over json) is that you can easily extend pickle, meaning that you can register a new method to serialize an object that pickle doesn't know how to pickle. Python gives you several ways to do that… __getstate__ and __setstate__ as well as the copy_reg method. Using these methods, you'll find that people have extended pickle to serialize most of python objects, for example dill.
Pickling doesn't restore the objects to the same memory location. However, it does reconstitute the object to the same state (generally) as when it was pickled. If you want to see some reasons why people pickle, take a look here:
Python serialization - Why pickle?
http://nbviewer.ipython.org/gist/minrk/5241793
http://matthewrocklin.com/blog/work/2013/12/05/Parallelism-and-Serialization/

What available Python modules are there to save-and-load data?

There are many scattered posts out on StackOverflow, regarding Python modules used to save and load data.
I myself am familiar with json and pickle and I have heard of pytables too. There are probably more out there. Also, each module seems to fit a certain purpose and has its own limits (e.g. loading a large list or dictionary with pickle takes ages if working at all). Hence it would be nice to have a proper overview of possibilities.
Could you then help providing a comprehensive list of modules used to save and load data, describing for each module:
what the general purpose of the module is,
its limits,
why you would choose this module over others?

marshal:
Pros:
Can read and write Python values in a binary format. Therefore it's much faster than pickle (which is character based).
Cons:
Not all Python object types are supported. Some unsupported types such as subclasses of builtins will appear to marshal and unmarshal correctly
Is not intended to be secure against erroneous or maliciously constructed data.
The Python maintainers reserve the right to modify the marshal format in backward incompatible ways should the need arise
shelve
Pros:
Values in a shelf can be essentially arbitrary Python objects
Cons:
Does not support concurrent read/write access to shelved objects
ZODB (suggested by #Duncan)
Pro:
transparent persistence
full transaction support
pluggable storage
scalable architecture
Cons
not part of standard library.
unable (easily) to reload data unless the original python object model used for persisting is available (consider version difficulties and data portability)

There is an overview of the standard lib data persistence modules.

Similar .rdata functionality in Python?

I'm starting to learn about doing data analysis in Python.
In R, you can load data into memory, then save variables into a .rdata file.
I'm trying to create an analysis "project", so I can load the data, store the scripts, then save the output so I can recall it should I need to.
Is there an equivalent function in Python?
Thanks

What you're looking for is binary serialization. The most notable functionality for this in Python is pickle. If you have some standard scientific data structures, you could look at HDF5 instead. JSON works for a lot of objects as well, but it is not binary serialization - it is text-based.
If you expand your options, there are a lot of other serialization options, too. Such as Google's Protocol Buffers (the developer of Rprotobuf is the top-ranked answerer for the r tag on SO), Avro, Thrift, and more.
Although there are generic serialization options, such as pickle and .Rdat, careful consideration of your usage will be helpful in making I/O fast and appropriate to your needs, especially if you need random access, portability, parallel access, tool re-use, etc. For instance, I now tend to avoid .Rdat for large objects.

json
pickle

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.