This question may be seen as subjective, but I'd like to ask SO users which common structured textual data format is best supported in Python.
My initial choices are:
XML
JSON
and YAML
Which of these three is easiest to work with in Python (ie. has the best library support / performance) ... or is there another format that I haven't mentioned that is better supported in Python.
I cannot use a Python only format (e.g. Pickling) since interop is quite important, but the majority of the code that handles these files will be written in Python, so I'm keen to use a format that has the strongest support in Python.
CSV or fixed column text may also be viable for most use cases, however I'd prefer the flexibility of a more scalable format.
Thank you
Note
Regarding interop I will be generating these files initially from Ruby, using Builder, however Ruby will not be consuming these files again.
I would go with JSON, I mean YAML is awesome but interop with it is not that great.
XML is just an ugly mess to look at and has too much fat.
Python has a built-in JSON module since version 2.6.
JSON has great python support and it is much more compact than XML (and the API is generally more convenient if you're just trying to dump and load objects). There's no out of the box support for YAML that I know of, although I haven't really checked. In the abstract I would suggest using JSON due to the low overhead of the format and the wide range of language support, but it does depend a bit on your application - if you're working in a space that already has established applications, the formats they use might be preferable, even if they're technically deficient.
I think it depends a lot on what you need to do with the data. If you're going to be building a complex database and doing processing and transformations on it, I suspect you'd be better off with XML. I've found the lxml module pretty useful in this regard. It has full support for standards like xpath and xslt, and this support is implemented in native code so you'll get good performance.
But if you're doing something more simple, then likely you'd be better off to use a simpler format like yaml or json. I've heard tell of "json transforms" but don't know how mature the technology is or how developed Python's access to it is.
It's pretty much all the same, out of those three. Use whichever is easier to inter-operate with.
Related
Is there an easy way to pass a complete data structure from C++ to Python and vice-versa easily with multiple data types?
I have a complex class with pointer objects of floats, longs etc. I could convert this into a json string and parse it both ways, but this would be really slow.
However, if we had a special format, that has takes this data, but also stores meta data of the start/end of the json string, it would parse much faster. Is there anything like this?
I would personally recommend serializing your data into JSON in C++ using e.g. rapidjson or Qt and then to transfer the resulting string to Python using the C API bindings for Python and deserializing it into Python Dictionary there. One-way or two-way transfer should be easy enough.
Note about the C API bindings however. I have used it in the past and it was not pleasant experience in any way or form. Eventually you will make it work and do what you want but it will cost you some nerves.
Lastly do not worry about performance. Since you are using Python (an interpreted language) you are apparently not doing anything performance critical anyway so cost of the JSON (de)serialization can be ignored here.
Good luck because you are going to need it with the Python's C API bindings.
There is a Scala library I'd like to use, namely BIDMach, however I need to be able to use it from Python rather than in Scala. I've been trying to think of different ways of possible being able communicate between the library and Python code, such as creating an HTTP server in Scala and calling this from Python, using something like JPype to try and use Scala libraries in Python, and different types of interprocess communication. However, none of them seem to work very well, and would seem to require a large amount of reimplementation of what is already in the library. Does anyone know of a good way of going about this?
edit: In terms of exactly what I think I'd like to do, ideally I'd be able to get close to almost all of the libraries functionality usable in Python, however that probably isn't realistic. It would be nice if some of the Scala classes were easily usable in Python, without too much repeated implementation effort. The reason I don't think what I've looked into so far will work well is because it requires a fair bit of reimplementation of what is already in the library (i.e. representing something like a matrix in JSON, as a way of transporting data to/from Python/Scala)
I'm currently working on a project where I need to transfer objects from ruby to python and back again, obviously serialization is the way to go. I've looked at things like yaml but decided to write my own as I didn't want to deal with the dependencies of the libraries and such when it came time to distribute. I've wrote up how this serialization format works here.
my question is that as this format is intended to work cross language between ruby and python,
how should I serialize ruby's symbols? I'm not aware of a object that works the same way in python. should a dump containing a symbol fail? should I just serialize it as a string? what would be best?
Doesn't that depend on what your project needs? If symbols are important, you'll need some way to deal with them.
I'm not a Ruby programmer, but from what I've just read, I think converting them to strings is probably easiest. The standard Python interpreter will reuse memory for identical short strings, which seems to be a key reason suggested for using symbols.
EDIT: If it needs to work for other programmers, passing values back and forth shouldn't change them. So you either have to handle symbols properly, or throw an error straight away. It should be simple enough in Python:
class Symbol(str):
pass
# In serialising code:
if isinstance(x, Symbol):
serialise_as_symbol(x)
Any reason you're not using a standard data interchange format like JSON or XML? They seem to be acceptable to countless applications, services, and programmers.
If symbols are a stumbling block then you have three choices, don't allow them, convert them to strings on the fly or figure out a way to make them universal and/or innocuous in other languages.
For a while I've been using a package called "gnosis-utils" which provides an XML pickling service for Python. This class works reasonably well, however it seems to have been neglected by it's developer for the last four years.
At the time we originally selected gnosis it was the only XML serization tool for Python. The advantage of Gnosis was that it provided a set of classes whose function was very similar to the built-in Python XML pickler. It produced XML which python-developers found easy to read, but non-python developers found confusing.
Now that the proejct has grown we have a new requirement: We need to be able to exchange XML with our colleagues who prefer Java or .Net. These non-python developers will not be using Python - they intend to produce XML directly, hence we have a need to simplify the format of the XML.
So are there any alternatives to Gnosis. Our requirements:
Must work on Python 2.4 / Windows x86 32bit
Output must be XML, as simple as possible
API must resemble Pickle as closely as possible
Performance is not hugely important
Of course we could simply adapt Gnosis, however we'd prefer to simply use a component which already provides the functions we requrie (assuming that it exists).
So what you're looking for is a python library that spits out arbitrary XML for your objects? You don't need to control the format, so you can't be bothered to actually write something that iterates over the relevant properties of your data and generates the XML using one of the existing tools?
This seems like a bad idea. Arbitrary XML serialization doesn't sound like a good way to move forward. Any format that includes all of pickle's features is going to be ugly, verbose, and very nasty to use. It will not be simple. It will not translate well into Java.
What does your data look like?
If you tell us precisely what aspects of pickle you need (and why lxml.objectify doesn't fulfill those), we will be better able to help you.
Have you considered using JSON for your serialization? It's easy to parse, natively supports python-like data structures, and has wide-reaching support. As an added bonus, it doesn't open your code to all kinds of evil exploits the way the native pickle module does.
Honestly, you need to bite the bullet and define a format, and build a serializer using the standard XML tools, if you absolutely must use XML. Consider JSON.
There is xml_marshaller which provides a simple way of dumping arbitrary Python objects to XML:
>>> from xml_marshaller import xml_marshaller
>>> class Foo(object): pass
>>> foo = Foo()
>>> foo.bar = 'baz'
>>> dump_str = xml_marshaller.dumps(foo)
Pretty printing the above with lxml (which is a dependency of xml_marshaller anyway):
>>> from lxml.etree import fromstring, tostring
>>> print tostring(fromstring(dump_str), pretty_print=True)
You get output like this:
<marshal>
<object id="i2" module="__main__" class="Foo">
<tuple/>
<dictionary id="i3">
<string>bar</string>
<string>baz</string>
</dictionary>
</object>
</marshal>
I did not check for Python 2.4 compatibility since this question was asked long ago, but a solution for xml dumping arbitrary Python objects remains relevant.
Looking for advice on the best technique for saving complex Python data structures across program sessions.
Here's a list of techniques I've come up with so far:
pickle/cpickle
json
jsonpickle
xml
database (like SQLite)
Pickle is the easiest and fastest technique, but my understanding is that there is no guarantee that pickle output will work across various versions of Python 2.x/3.x or across 32 and 64 bit implementations of Python.
Json only works for simple data structures. Jsonpickle seems to correct this AND seems to be written to work across different versions of Python.
Serializing to XML or to a database is possible, but represents extra effort since we would have to do the serialization ourselves manually.
Thank you,
Malcolm
You have a misconception about pickles: they are guaranteed to work across Python versions. You simply have to choose a protocol version that is supported by all the Python versions you care about.
The technique you left out is marshal, which is not guaranteed to work across Python versions (and btw, is how .pyc files are written).
You left out the marshal and shelve modules.
Also this python docs page covers persistence
Have you looked at PySyck or pyYAML?
What are your criteria for "best" ?
pickle can do most Python structures, deeply nested ones too
sqlite dbs can be easily queried (if you know sql :)
speed / memory ? trust no benchmarks that you haven't faked yourself.
(Fine print:
cPickle.dump(protocol=-1) compresses, in one case 15M pickle / 60M sqlite, but can break.
Strings that occur many times, e.g. country names, may take more memory than you expect;
see the builtin intern().
)