I'm learning Python, my background is Java EE. I have used JAXB before, where I can basically define a regular class, throw some annotations in there and then use JAXB to marshall objects to xml. This means I am not concerned with creating root elements, nodes, etc. but merely writing the Java class and anotating it here and there. Is there anything like this for Python?
Here are a few:
lxml.objectify
gnosis.xml.objecity
pyxser seems pretty cool
Pickle to XML - uses Python's pickle and xml.dom.minidom
pyxml -from xml import marshal (might be buggy)
Amara might be worth looking into.
PyXB seems to be the closest thing to JAXB although I haven't used it yet. I use lxml at the moment and find it works well.
Amara was promising but seemed to stagnate.
Related
I have an app which constantly reloads a large amount of XML data from a file, and then performs manipulations, and then writes back to file.
The lxml library is proven much faster for parsing and un-parsing XML, but cElementTree is much faster for certain kinds of manipulation. Both have an almost identical API.
How can I parse an XML file with lxml, and then manipulate it with cElementTree?
This is what I've tried, but the object produced by lxml parse methods inherently use it's own manipulative methods.
import xml.etree.cElementTree as ET
from lxml import etree as lxmlET
This question is perhaps the Python equivalent of "My friend has a fast car and I just have a clunker. How can I make my car go as fast as hers?"
I'm not saying this couldn't be done, but I should call call such an enterprise either ambitious or foolhardy, depending on your level of programming skill. The point is that each system has, as you have discovered, its own internal representation of the parsed XML.
While it might be possible to write code to take the parsed object produced by lxml and re-create or wrap it as ElementTree elements, it's probably going to a) take as long as parsing with ElementTree in the first place, and b) be a maintenance nightmare.
So do yourself a favor and choose one technology then stick with it (at least for each individual program).
I would also point out that XML was intended primarily as a data interchange language. The fact that you seem to be using it as a structured data repository inevitably introduces large inefficiencies in the processing, particularly as data volumes go up. Might it be better to choose some more amenable representation and then only convert it to XML for output and usage by other systems?
For a while I've been using a package called "gnosis-utils" which provides an XML pickling service for Python. This class works reasonably well, however it seems to have been neglected by it's developer for the last four years.
At the time we originally selected gnosis it was the only XML serization tool for Python. The advantage of Gnosis was that it provided a set of classes whose function was very similar to the built-in Python XML pickler. It produced XML which python-developers found easy to read, but non-python developers found confusing.
Now that the proejct has grown we have a new requirement: We need to be able to exchange XML with our colleagues who prefer Java or .Net. These non-python developers will not be using Python - they intend to produce XML directly, hence we have a need to simplify the format of the XML.
So are there any alternatives to Gnosis. Our requirements:
Must work on Python 2.4 / Windows x86 32bit
Output must be XML, as simple as possible
API must resemble Pickle as closely as possible
Performance is not hugely important
Of course we could simply adapt Gnosis, however we'd prefer to simply use a component which already provides the functions we requrie (assuming that it exists).
So what you're looking for is a python library that spits out arbitrary XML for your objects? You don't need to control the format, so you can't be bothered to actually write something that iterates over the relevant properties of your data and generates the XML using one of the existing tools?
This seems like a bad idea. Arbitrary XML serialization doesn't sound like a good way to move forward. Any format that includes all of pickle's features is going to be ugly, verbose, and very nasty to use. It will not be simple. It will not translate well into Java.
What does your data look like?
If you tell us precisely what aspects of pickle you need (and why lxml.objectify doesn't fulfill those), we will be better able to help you.
Have you considered using JSON for your serialization? It's easy to parse, natively supports python-like data structures, and has wide-reaching support. As an added bonus, it doesn't open your code to all kinds of evil exploits the way the native pickle module does.
Honestly, you need to bite the bullet and define a format, and build a serializer using the standard XML tools, if you absolutely must use XML. Consider JSON.
There is xml_marshaller which provides a simple way of dumping arbitrary Python objects to XML:
>>> from xml_marshaller import xml_marshaller
>>> class Foo(object): pass
>>> foo = Foo()
>>> foo.bar = 'baz'
>>> dump_str = xml_marshaller.dumps(foo)
Pretty printing the above with lxml (which is a dependency of xml_marshaller anyway):
>>> from lxml.etree import fromstring, tostring
>>> print tostring(fromstring(dump_str), pretty_print=True)
You get output like this:
<marshal>
<object id="i2" module="__main__" class="Foo">
<tuple/>
<dictionary id="i3">
<string>bar</string>
<string>baz</string>
</dictionary>
</object>
</marshal>
I did not check for Python 2.4 compatibility since this question was asked long ago, but a solution for xml dumping arbitrary Python objects remains relevant.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 12 years ago.
EDIT
The use of the phrase "bad at XML" in this question has been a point of contention, so I'd like to start out by providing a very clear definition of what I mean by this term in this context: if support for standard XML APIs is poor, and forces one to use a language-specific API, in which namespaces seem to be an afterthought, then I would be inclined to characterize that language as being not as well suited to using XML as other mainstream languages that do not have these issues. "Bad at XML" is just a shorthand for these conditions, and I think it is a fair way to characterize it. As I will describe, my initial experience with Python has raised concerns about whether it fulfils these conditions; but, because in general my experience with Python has been quite positive, it seems likely that I'm missing something, thus motivating this question.
I'm trying to do some very simple XML processing with Python. I had initially hoped to be able to reuse my knowledge of standard W3C DOM API's, and happily found that the xml.dom and xml.dom.minidom modules did a good job of supporting these API's. Unfortunately, however, serialization proved to be problematic, for the following reasons:
xml.dom does not come with a serializer
the PyXML library, which includes a serializer for xml.dom, is no longer maintained, AND
minidom does not support serialization of namespaces, even though namespaces are supported in the API
I looked through the list of other W3C-like libraries here:
http://wiki.python.org/moin/PythonXml#W3CDOM-likelibraries
I found that many other libraries, such as 4Suite and libxml2dom, are also not maintained.
On the other hand, itools at first glance appears to be maintained, but there does not appear to be an Ubuntu/Debian package available, and so would be difficult to deploy and maintain.
At this point, it seemed like trying to use W3C DOM API's in my Python application was going to be dead-end, and I began to look at the ElementTree API. But the way the eTree API supports namespaces I think is horribly ugly, requiring one to use string concatenation every time an element in a particular namespace is created:
http://lxml.de/tutorial.html#namespaces
So, my question is, have I overlooked something, or is support for XML (in particular W3C DOM) actually quite bad in Python?
EDIT
Here follows a list of more precise questions, the answers to which would really help me:
Is there reasonable support for W3C DOM in Python?
If not xml.dom, do you use e.g. etree instead of W3C DOM?
If so, which library is best, and how do you overcome the issues regarding namespacing in the API?
If you use W3C DOM instead, are you aware of a library that implements serialization with support for namespaces?
I would say python handles XML pretty well. The number of different libraries available speaks to that - you have lots of options. And if there are features missing from libraries that you would like to use, feel free to contribute some patches!
I personally use the DOM and lxml.etree (etree is really fast). However, I feel your pain about the namespace thing. I wrote a quick helper function to deal with it:
DEFAULT_NS = "http://www.domain.org/path/to/xml"
def add_xml_namespace(path, namespace=DEFAULT_NS):
"""Adds namespaces to an XPath-ish expression path for etree
Test simple expression:
>>> add_xml_namespace('image/namingData/fileBaseName')
'{http://www.domain.org/path/to/xml}image/{http://www.domain.org/path/to/xml}namingData/{http://www.domain.org/path/to/xml}fileBaseName'
More complicated expression
>>> add_xml_namespace('.//image/*')
'.//{http://www.domain.org/path/to/xml}image/*'
>>> add_xml_namespace('.//image/text()')
'.//{http://www.domain.org/path/to/xml}image/text()'
"""
pattern = re.compile(r'^[A-Za-z0-9-]+$')
tags = path.split('/')
for i in xrange(len(tags)):
if pattern.match(tags[i]):
tags[i] = "{%s}%s" % (namespace, tags[i])
return '/'.join(tags)
I use it like so:
from lxml import etree
from utilities import add_xml_namespace as ns
tree = etree.parse('file.xml')
node = tree.get_root().find(ns('root/group/subgroup'))
# etc.
If you don't know the namespace ahead of time, you can extract it from a root node:
tree = etree.parse('file.xml')
root = tree.getroot().tag
namespace = root[1:root.index('}')]
ns = lambda path: add_xml_namespace(path, namespace)
...
Additional comment: There is a little work involved here, but work is necessary when dealing with XML. That's not a python issue, it's an XML issue.
Python is great at handling XML, I consider lxml to be the best xml library I have ever worked with, it is powerful and significantly simpler the DOM. The namespace handling took some getting used to, but I think it is another great way lxml keeps things simple.
EDIT
After re reading the question, it is unclear exactly whether the the author meant the serialization of python objects, or just the DOM tree. The portion of my answer bellow assumed the former.
XML serialization is a completely different issue. Personal I don't think it is very important. Most XML serializers produce output that its pretty specific to the language or runtime, which defeats the purpose of having such an open format. I realize that there are some generic XML serialization schema, but Python provides 2 solutions that are superior for 95% of situations, Pickling and JSON.
If your application doesn't have to share objects with non-python systems, Pickling is the fastest and most powerful serialization solution you will find. JSON is significantly faster to parse and generate, and much easier to work with than XML. JSON has plenty of limitations, but it is frequently easier to work around them, than deal with the headaches of XML.
There are plenty of other serialization formats that, depending on the application, I would recommend ahead of XML (E.G.: Google Protocol Buffers, or YAML.)
Also, don't forget about SAX. Event driven parsers are only useful for reading XML, but I have found that it is still the best solution for some problems.
But the way the eTree API supports namespaces I think is horribly ugly, requiring one to use string concatenation every time an element in a particular namespace is created
Here's how you create an element with a namespace in .NET's System.Linq.Xml DOM:
XNamespace ns = "my-namespace";
XElement elm = new XElement(ns + "foo");
Here's how you create an element in a namespace in lxml:
ns = "{my-namespace}"
elm = etree.Element(ns + "foo")
I'm not seeing horrible ugliness here. In fact, the developers of the .NET API have bent over backwards, creating base classes that support operator overloading, to make it possible for their API to handle namespaces as intuitively as lxml's does.
How this is uglier than the W3C DOM requiring you to use different methods to create elements with and without namespaces is beyond me.
Currently I have 2 varieties, LXML and libXML2 that both seem to work. I have tried benchmarking both, specifically for parsing memory string and files into XML and importing XSLT stylesheets and applying them. While pure performance based tests indicate that LXML comes on top (applying stylesheets specifically) libxml2 seems to have been used as defacto-standard for many other languages. In addition, during parsing LXML seems to have some difficulties with entity substitutions.
My question primarily is: have anyone used, successfully LXML in production, and what were your impressions?
I've used LXML and been very impressed. The flexibility offered by having both the etree-like and objectify interfaces is pretty handy. I also like the fact that I don't have to have any separate text nodes.
As far as entity substitutions, I had a few issues too, but for me it was a matter of giving the parser the right options when creating it.
For example, if you're trying to load entities from a remote DTD, you might try something like:
parser = etree.XMLParser(load_dtd=True, no_network=False)
The no_network flag defaults to True and is a bit counter-intuitive in my opinion, but that's really the only snag I've hit with it.
Considering that I want to write python code that would run on Google App Engine and also inside jython, C-extensions are not an option. Amara was a nice library, but due to its C-extensions, I can't use it for either of these platforms.
ElementTree is very nice. It's also part of 2.5.
There's also Beautiful Soup (which may be geared more toward HTML, but it also does XML).
xml.sax is a builtin SAX parser
I would normally recommend lxml, but since that uses a C-library (libxml) the alternative would have to be, as Aaron has already suggested, ElementTree (as far as I know there is both a pure python and a c implementation of it available).
Found this via google search
Good luck!