Elementtree and xsd sequences

Elementtree and xsd sequences - python

I'm currently writing a code generator that produces PLCOpen XML files. The Schema uses sequences in some places. The code generator uses ElementTree because of its simple interface. However, I can't find a way to make ElementTree respect the sequence; in fact, the children of an Element are always printed in canonical order. Is there any way around this?

The method ElementTree.append that I used does not seem to keep the elements ordered. Using ElementTree.insert, however, solved the issue.

Related

Difference between removeChild() and unlink() and why should I use them?

I am currently learning all about XML through Python. Python has an extra method in its DOM implementation (minidom) you can call on any Node: unlink().
Why is it useful and what does it actually do?
I have read the definition, but I still don´t understand it and don´t know why it is recommended to be used in combination with removeChild().
Doesn´t it have the same effect?: Removing the particular node from your DOM?

Python ElementTree: ElementTree vs root Element

I'm a bit confused by some of the design decisions in the Python ElementTree API - they seem kind of arbitrary, so I'd like some clarification to see if these decisions have some logic behind them, or if they're just more or less ad hoc.
So, generally there are two ways you might want to generate an ElementTree - one is via some kind of source stream, like a file, or other I/O stream. This is achieved via the parse() function, or the ElementTree.parse() class method.
Another way is to load the XML directly from a string object. This can be done via the fromstring() function.
Okay, great. Now, I would think these functions would basically be identical in terms of what they return - the difference between the two of them is basically the source of input (one takes a file or stream object, the other takes a plain string.) Except for some reason the parse() function returns an ElementTree object, but the fromstring() function returns an Element object. The difference is basically that the Element object is the root element of an XML tree, whereas the ElementTree object is sort of a "wrapper" around the root element, which provides some extra features. You can always get the root element from an ElementTree object by calling getroot().
Still, I'm confused why we have this distinction. Why does fromstring() return a root element directly, but parse() returns an ElementTree object? Is there some logic behind this distinction?

A beautiful answer comes from this old discussion:
Just for the record: Fredrik [the creator of ElementTree] doesn't actually consider it a design
"quirk". He argues that it's designed for different use cases. While
parse() parses a file, which normally contains a complete document
(represented in ET as an ElementTree object), fromstring() and
especially the 'literal wrapper' XML() are made for parsing strings,
which (most?) often only contain XML fragments. With a fragment, you
normally want to continue doing things like inserting it into another
tree, so you need the top-level element in almost all cases.
And:
Why isn't et.parse the only way to do this? Why have XML or fromstring
at all?
Well, use cases. XML() is an alias for fromstring(), because it's
convenient (and well readable) to write
section = XML('A to Z')
section.append(paragraphs)
for XML literals in source code. fromstring() is there because when
you want to parse a fragment from a string that you got from whatever
source, it's easy to express that with exactly that function, as in
el = fromstring(some_string)
If you want to parse a document from a file or file-like object, use
parse(). Three use cases, three functions. The fourth use case of
parsing a document from a string does not have its own function,
because it is trivial to write
tree = parse(BytesIO(some_byte_string))

I'm thinking the same as remram in the comments: parse takes a file location or a file object and preserves that information so that it can provide additional utility, which is really helpful. If parse did not return an ET object, then you would have to keep better track of the sources and whatnot in order to manually feed them back into the helper functions that ET objects have by default. In contrast to files, Strings- by definition- do not have the same kind of information attached from them, so you can't create the same utilities for them (otherwise there very well may be an ET.parsefromstring() method which would return an ET Object).
I suspect this is also the logic behind the method being named parse instead of ET.fromfile(): I would expect the same object type to be returned from fromfile and fromstring, but can't say I would expect the same from parse (it's been a long time since I started using ET, so there's no way to verify that, but that's my feeling).
On the subject Remram raised of placing utility methods on Elements, as I understand the documentation, Elements are extremely uniformed when it comes to implementation. People talk about "Root Elements," but the Element at the root of the tree is literally identical to all other Elements in terms of its class Attributes and Methods. As far as I know, Elements don't even know who their parent is, which is likely to support this uniformity. Otherwise there might be more code to implement the "root" Element (which doesn't have a parent) or to re-parent subelements. It seems to me that the simplicity of the Element class works greatly in its favor. So it seems better to me to leave Elements largely agnostic of anything above them (their parent, the file they come from) so there can't be any snags concerning 4 Elements with different output files in the same tree (or the like).
When it comes to implementing the module inside of code, it seems to me that the script would have to recognize the input as a file at some point, one way or another (otherwise it would be trying to pass the file to fromstring). So there shouldn't arise a situation in which the output of parse should be unexpected such that the ElementTree is assumed to be an Element and processed as such (unless, of course, parse was implemented without the programmer checking to see what parse did, which just seems like a poor habit to me).

Parsing a large (~40GB) XML text file in python

I've got an XML file I want to parse with python. What is best way to do this? Taking into memory the entire document would be disastrous, I need to somehow read it a single node at a time.
Existing XML solutions I know of:
element tree
minixml
but I'm afraid they aren't quite going to work because of the problem I mentioned. Also I can't open it in a text editor - any good tips in generao for working with giant text files?

First, have you tried ElementTree (either the built-in pure-Python or C versions, or, better, the lxml version)? I'm pretty sure none of them actually read the whole file into memory.
The problem, of course, is that, whether or not it reads the whole file into memory, the resulting parsed tree ends up in memory.
ElementTree has a nifty solution that's pretty simple, and often sufficient: iterparse.
for event, elem in ET.iterparse(xmlfile, events=('end')):
...
The key here is that you can modify the tree as it's built up (by replacing the contents with a summary containing only what the parent node will need). By throwing out all the stuff you don't need to keep in memory as it comes in, you can stick to parsing things in the usual order without running out of memory.
The linked page gives more details, including some examples for modifying XML-RPC and plist as they're processed. (In those cases, it's to make the resulting object simpler to use, not to save memory, but they should be enough to get the idea across.)
This only helps if you can think of a way to summarize as you go. (In the most trivial case, where the parent doesn't need any info from its children, this is just elem.clear().) Otherwise, this won't work for you.
The standard solution is SAX, which is a callback-based API that lets you operate on the tree a node at a time. You don't need to worry about truncating nodes as you do with iterparse, because the nodes don't exist after you've parsed them.
Most of the best SAX examples out there are for Java or Javascript, but they're not too hard to figure out. For example, if you look at http://cs.au.dk/~amoeller/XML/programming/saxexample.html you should be able to figure out how to write it in Python (as long as you know where to find the documentation for xml.sax).
There are also some DOM-based libraries that work without reading everything into memory, but there aren't any that I know of that I'd trust to handle a 40GB file with reasonable efficiency.

The best solution will depend in part on what you are trying to do, and how free your system resources are. Converting it to a postgresql or similar database might not be a bad first goal; on the other hand, if you just need to pull data out once, it's probably not needed. When I have to parse large XML files, especially when the goal is to process the data for graphs or the like, I usually convert the xml to S-expressions, and then use an S-expression interpreter (implemented in python) to analyse the tags in order and build the tabulated data. Since it can read the file in a line at a time, the length of the file doesn't matter, so long as the resulting tabulated data all fits in memory.

XML object serialization in python, are there any alternatives to Gnosis?

For a while I've been using a package called "gnosis-utils" which provides an XML pickling service for Python. This class works reasonably well, however it seems to have been neglected by it's developer for the last four years.
At the time we originally selected gnosis it was the only XML serization tool for Python. The advantage of Gnosis was that it provided a set of classes whose function was very similar to the built-in Python XML pickler. It produced XML which python-developers found easy to read, but non-python developers found confusing.
Now that the proejct has grown we have a new requirement: We need to be able to exchange XML with our colleagues who prefer Java or .Net. These non-python developers will not be using Python - they intend to produce XML directly, hence we have a need to simplify the format of the XML.
So are there any alternatives to Gnosis. Our requirements:
Must work on Python 2.4 / Windows x86 32bit
Output must be XML, as simple as possible
API must resemble Pickle as closely as possible
Performance is not hugely important
Of course we could simply adapt Gnosis, however we'd prefer to simply use a component which already provides the functions we requrie (assuming that it exists).

So what you're looking for is a python library that spits out arbitrary XML for your objects? You don't need to control the format, so you can't be bothered to actually write something that iterates over the relevant properties of your data and generates the XML using one of the existing tools?
This seems like a bad idea. Arbitrary XML serialization doesn't sound like a good way to move forward. Any format that includes all of pickle's features is going to be ugly, verbose, and very nasty to use. It will not be simple. It will not translate well into Java.
What does your data look like?
If you tell us precisely what aspects of pickle you need (and why lxml.objectify doesn't fulfill those), we will be better able to help you.
Have you considered using JSON for your serialization? It's easy to parse, natively supports python-like data structures, and has wide-reaching support. As an added bonus, it doesn't open your code to all kinds of evil exploits the way the native pickle module does.
Honestly, you need to bite the bullet and define a format, and build a serializer using the standard XML tools, if you absolutely must use XML. Consider JSON.

There is xml_marshaller which provides a simple way of dumping arbitrary Python objects to XML:
>>> from xml_marshaller import xml_marshaller
>>> class Foo(object): pass
>>> foo = Foo()
>>> foo.bar = 'baz'
>>> dump_str = xml_marshaller.dumps(foo)
Pretty printing the above with lxml (which is a dependency of xml_marshaller anyway):
>>> from lxml.etree import fromstring, tostring
>>> print tostring(fromstring(dump_str), pretty_print=True)
You get output like this:
<marshal>
<object id="i2" module="__main__" class="Foo">
<tuple/>
<dictionary id="i3">
<string>bar</string>
<string>baz</string>
</dictionary>
</object>
</marshal>
I did not check for Python 2.4 compatibility since this question was asked long ago, but a solution for xml dumping arbitrary Python objects remains relevant.

Which XML style is better when handling it with Python's ElementTree?

I'd like to store some relatively simple stuff in XML in a cascading manner. The idea is that a build can have a number of parameter sets and the Python scripts creates the necessary build artifacts (*.h etc) by reading these sets and if two sets have the same parameter, the latter one replaces the former.
There are (at least) two differing ways of doing the XML:
First way:
<Variants>
<Variant name="foo" Info="foobar">1</Variant
</Variants>
Second way:
<Variants>
<Variant>
<Name>Foo</Name>
<Value>1</Value>
<Info>foobar</Info>
</Variant>
</Variants>
Which one is better easier to handle in ElementTree. My limited understanding claims it would be the first one as I could search the variant with find() easily and receive the entire subtree but would it be just as easy to do it with the second style? My colleague says that the latter XML is better as it allows expanding the XML more easily (and he is obviously right) but I don't see the expandability a major factor at the moment (might very well be we will never need it).
EDIT: I could of course use lxml as well, does it matter in this case? Speed really isn't an issue, the files are relatively small.

You're both right, but I would pick #1 where possible, except for the text content:
1 is much more succinct and human-readable, thus less error-prone.
Complete extensibility: YAGNI. YAGNI is not always true but if you're confident that you won't need extensibility, don't sacrifice other benefits for the sake of extensibility.
1 is still pretty extensible. You can always add more attributes or child elements. The only way it isn't extensible is if you later discover you need multiple values for name, or info (or the text content value)... since you can't have multiple attributes with the same name on an element (nor multiple text content nodes without something in between). However you can still extend those by various techniques, e.g. space-separated values in an attribute, or adding child elements as an alternative to an attribute.
I would make the "value" into an attribute or a child element rather than using the text content. If you ever have to add a child element, and you have that text content there, you will end up with mixed content (text as a sibling of an element), which gets messy to process.
Update: further reading
A few good articles on the XML elements-vs-attributes debate, including when to use each:
Principles of XML design: When to use elements versus attributes - well-integrated article by Uche Ogbuji
SGML/XML Elements versus Attributes - with many links to other commentary on this old debate
Elements or Attributes?
See also this SO question (but I think the above give more profitable reading).

Remember the critical limitations on XML attributes:
Attribute names must be XML names.
An element can have only one attribute with a given name.
The ordering of attributes is not significant.
In other words, attributes represent key/value pairs. If you can represent it in Python as a dictionary whose keys are XML names and whose values are strings, you can represent it in XML as a set of attributes, no matter what "it" is.
If you can't - if, for instance, ordering is significant, or you need a value to include child elements - then you shouldn't use attributes.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.