I'm starting to learn about doing data analysis in Python.
In R, you can load data into memory, then save variables into a .rdata file.
I'm trying to create an analysis "project", so I can load the data, store the scripts, then save the output so I can recall it should I need to.
Is there an equivalent function in Python?
Thanks
What you're looking for is binary serialization. The most notable functionality for this in Python is pickle. If you have some standard scientific data structures, you could look at HDF5 instead. JSON works for a lot of objects as well, but it is not binary serialization - it is text-based.
If you expand your options, there are a lot of other serialization options, too. Such as Google's Protocol Buffers (the developer of Rprotobuf is the top-ranked answerer for the r tag on SO), Avro, Thrift, and more.
Although there are generic serialization options, such as pickle and .Rdat, careful consideration of your usage will be helpful in making I/O fast and appropriate to your needs, especially if you need random access, portability, parallel access, tool re-use, etc. For instance, I now tend to avoid .Rdat for large objects.
json
pickle
Related
Alex Gaynor explains some problems with pickle in his talk "Pickles are for delis, not software", including security, reliability, human-readableness. I am generally wary of using pickle on data in my python programs. As a general rule, I much prefer to pass my data around with json or other serialization formats, specified by myself, manually.
The situation I'm interested in is: I've gathered some data in my python program and I want to run an embarrassingly parallel task on it a bunch of times in parallel.
As far as I know, the nicest parallelization library for doing this in python right now is dask-distributed, followed by joblib-parallel, concurrent.futures, and multiprocessing.
However, all of these solutions use pickle for serialization. Given the various issues with pickle, I'm inclined to simply send a json array to a subprocess of GNU parallel. But of course, this feels like a hack, and loses all the fancy goodness of Dask.
Is it possible to specify a different default serialization format for my data, but continue to parallelize in python, preferably dask, without resorting to pickle or gnu parallel?
The page http://distributed.dask.org/en/latest/protocol.html is worth a read regarding how Dask passes information around a set of distributed workers and scheduler. As can be seen, (cloud)pickle enters the picture for things like functions, which we want to be able to pass to workers, so they can execute them, but data is generally sent via fairly efficient msgpack serialisation. There would be no way to serialise functions with JSON. In fact, there is a fairly flexible internal dispatch mechanism for deciding what gets serialised with what mechanism, but there is no need to get into that here.
I would also claim that pickle is a fine way to serialise some things when passing between processes, so long as you have gone to the trouble to ensure consistent environments between them, which is an assumption that Dask makes.
-edit-
You could of course include fuction names or escapes in JSON, but I would suggest that's just as brittle as pickle anyway.
Pickles are bad for long-term storage ("what if my class definition changes after I've persisted something to a database?") and terrible for accepting as user input:
def foo():
os.system('rm -rf /')
return {'lol': foo}
But I don't think there's any problem at all with using them in this specific case. Suppose you're passing around datetime objects. Do you really want to write your own ad-hoc JSON adapter to serialize and deserialize them? I mean, you can, but do you want to? Pickles are well specified, and the process is fast. That's kind of exactly what you want here, where you're neither persisting the intermediate serialized object nor accepting objects from third parties. You're literally passing them from yourself to yourself.
I'd highly recommend picking the library you want to use -- you like Dask? Go for it! -- and not worrying about its innards until such time as you specifically have to care. In the mean time, concentrate on the parts of your program that are unique to your problem. Odds are good that the underlying serialization format won't be one of them.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I've recently read an article about protocol buffers,
Protocol Buffers is a method of serializing structured data. It is
useful in developing programs to communicate with each other over a
wire or for storing data. The method involves an interface description
language that describes the structure of some data and a program that
generates source code from that description for generating or parsing
a stream of bytes that represents the structured data
What I want to know is, where to use them? Are there any real-life examples rather than simple address book examples? Is it for example used to pre-cache query results from databases?
Protocol buffers are a data storage and exchange format, notably used for RPC - communication between programs or computers.
Alternatives include language-specific serialization (Java serialization, Python pickles, etc.), tabular formats like CSV and TSV, structured text formats like XML and JSON, and other binary formats like Apache Thrift. Conceptually these are all just different ways of representing structured data, but in practice they have different pros and cons.
Protocol buffers are:
Space efficient, relying on a custom format to represent data compactly.
Provide strong type safety cross-language (particularly in strongly-typed languages like Java, but even in Python it's still quite useful).
Designed to be backwards and forwards-compatible. It's easy to make structural changes to protocol buffers (normally adding new fields or deprecating old ones) without needing to ensure all applications using the proto are updated simultaneously.
Somewhat tedious to work with manually. While there is a text format, it is mostly useful for manually inspecting, not storing, protos. JSON, for instance, is much easier for a human to write and edit. Therefore protos are usually written and read by programs.
Dependent on a .proto compiler. By separating the structure from the data protocol buffers can be lean and mean, but it means without an associated .proto file and a tool like protoc to generate code to parse it, arbitrary data in proto format is unusable. This makes protos a poor choice for sending data to other people who may not have the .proto file on hand.
To make some sweeping generalizations about different formats:
CSV/TSV/etc. are useful for human-constructed data that never needs to be transmitted between people or programs. It's easy to construct and easy to parse, but a nightmare to keep in sync and can't easily represent complex structures.
Language-specific serialization like pickles can be useful for short-lived serialization, but quickly runs into backwards compatibility issues and obviously limit you to one language. Except in some very specific cases protobufs accomplish all the same goals with more safety and better future-proofing.
JSON is ideal for sending data between different parties (e.g. public APIs). Because the structure and the content are transmitted together anyone can understand it, and it's easy to parse in all major languages. There's little reason nowadays to use other structured formats like XML.
Binary formats like Protocol Buffers are ideal for almost all other data serialization use cases; long and short-term storage, inter-process communication, intra-process and application-wide caching, and more.
Google famously uses protocol buffers for practically everything they do. If you can imagine a reason to need to store or transmit data, Google probably does it with protocol buffers.
I used them to create a financial trading system. Here are the reasons:
There's libraries for many languages. Some things needed to be in c++, others in c#. And it was open to extending to Python or Java, etc.
It needed to be fast to serialize/deserialize and compact. This is due to the speed requirement in the financial trading system. The messages were quite a lot shorter than comparable text type messages, which meant you never had a problem fitting them in one network packet.
It didn't need to be readable off the wire. Previously the system had XML which is nice for debugging, but you can get debugging outputs in other ways and turn them off in production.
It gives your message a natural structure, and an API for getting the parts you need. Writing something custom would have required thinking about all the helper functions to pull numbers out of the binary, with corner cases and all that.
There are many scattered posts out on StackOverflow, regarding Python modules used to save and load data.
I myself am familiar with json and pickle and I have heard of pytables too. There are probably more out there. Also, each module seems to fit a certain purpose and has its own limits (e.g. loading a large list or dictionary with pickle takes ages if working at all). Hence it would be nice to have a proper overview of possibilities.
Could you then help providing a comprehensive list of modules used to save and load data, describing for each module:
what the general purpose of the module is,
its limits,
why you would choose this module over others?
marshal:
Pros:
Can read and write Python values in a binary format. Therefore it's much faster than pickle (which is character based).
Cons:
Not all Python object types are supported. Some unsupported types such as subclasses of builtins will appear to marshal and unmarshal correctly
Is not intended to be secure against erroneous or maliciously constructed data.
The Python maintainers reserve the right to modify the marshal format in backward incompatible ways should the need arise
shelve
Pros:
Values in a shelf can be essentially arbitrary Python objects
Cons:
Does not support concurrent read/write access to shelved objects
ZODB (suggested by #Duncan)
Pro:
transparent persistence
full transaction support
pluggable storage
scalable architecture
Cons
not part of standard library.
unable (easily) to reload data unless the original python object model used for persisting is available (consider version difficulties and data portability)
There is an overview of the standard lib data persistence modules.
This question may be seen as subjective, but I'd like to ask SO users which common structured textual data format is best supported in Python.
My initial choices are:
XML
JSON
and YAML
Which of these three is easiest to work with in Python (ie. has the best library support / performance) ... or is there another format that I haven't mentioned that is better supported in Python.
I cannot use a Python only format (e.g. Pickling) since interop is quite important, but the majority of the code that handles these files will be written in Python, so I'm keen to use a format that has the strongest support in Python.
CSV or fixed column text may also be viable for most use cases, however I'd prefer the flexibility of a more scalable format.
Thank you
Note
Regarding interop I will be generating these files initially from Ruby, using Builder, however Ruby will not be consuming these files again.
I would go with JSON, I mean YAML is awesome but interop with it is not that great.
XML is just an ugly mess to look at and has too much fat.
Python has a built-in JSON module since version 2.6.
JSON has great python support and it is much more compact than XML (and the API is generally more convenient if you're just trying to dump and load objects). There's no out of the box support for YAML that I know of, although I haven't really checked. In the abstract I would suggest using JSON due to the low overhead of the format and the wide range of language support, but it does depend a bit on your application - if you're working in a space that already has established applications, the formats they use might be preferable, even if they're technically deficient.
I think it depends a lot on what you need to do with the data. If you're going to be building a complex database and doing processing and transformations on it, I suspect you'd be better off with XML. I've found the lxml module pretty useful in this regard. It has full support for standards like xpath and xslt, and this support is implemented in native code so you'll get good performance.
But if you're doing something more simple, then likely you'd be better off to use a simpler format like yaml or json. I've heard tell of "json transforms" but don't know how mature the technology is or how developed Python's access to it is.
It's pretty much all the same, out of those three. Use whichever is easier to inter-operate with.
Looking for advice on the best technique for saving complex Python data structures across program sessions.
Here's a list of techniques I've come up with so far:
pickle/cpickle
json
jsonpickle
xml
database (like SQLite)
Pickle is the easiest and fastest technique, but my understanding is that there is no guarantee that pickle output will work across various versions of Python 2.x/3.x or across 32 and 64 bit implementations of Python.
Json only works for simple data structures. Jsonpickle seems to correct this AND seems to be written to work across different versions of Python.
Serializing to XML or to a database is possible, but represents extra effort since we would have to do the serialization ourselves manually.
Thank you,
Malcolm
You have a misconception about pickles: they are guaranteed to work across Python versions. You simply have to choose a protocol version that is supported by all the Python versions you care about.
The technique you left out is marshal, which is not guaranteed to work across Python versions (and btw, is how .pyc files are written).
You left out the marshal and shelve modules.
Also this python docs page covers persistence
Have you looked at PySyck or pyYAML?
What are your criteria for "best" ?
pickle can do most Python structures, deeply nested ones too
sqlite dbs can be easily queried (if you know sql :)
speed / memory ? trust no benchmarks that you haven't faked yourself.
(Fine print:
cPickle.dump(protocol=-1) compresses, in one case 15M pickle / 60M sqlite, but can break.
Strings that occur many times, e.g. country names, may take more memory than you expect;
see the builtin intern().
)