Python ElementTree: ElementTree vs root Element

Python ElementTree: ElementTree vs root Element - python

I'm a bit confused by some of the design decisions in the Python ElementTree API - they seem kind of arbitrary, so I'd like some clarification to see if these decisions have some logic behind them, or if they're just more or less ad hoc.
So, generally there are two ways you might want to generate an ElementTree - one is via some kind of source stream, like a file, or other I/O stream. This is achieved via the parse() function, or the ElementTree.parse() class method.
Another way is to load the XML directly from a string object. This can be done via the fromstring() function.
Okay, great. Now, I would think these functions would basically be identical in terms of what they return - the difference between the two of them is basically the source of input (one takes a file or stream object, the other takes a plain string.) Except for some reason the parse() function returns an ElementTree object, but the fromstring() function returns an Element object. The difference is basically that the Element object is the root element of an XML tree, whereas the ElementTree object is sort of a "wrapper" around the root element, which provides some extra features. You can always get the root element from an ElementTree object by calling getroot().
Still, I'm confused why we have this distinction. Why does fromstring() return a root element directly, but parse() returns an ElementTree object? Is there some logic behind this distinction?

A beautiful answer comes from this old discussion:
Just for the record: Fredrik [the creator of ElementTree] doesn't actually consider it a design
"quirk". He argues that it's designed for different use cases. While
parse() parses a file, which normally contains a complete document
(represented in ET as an ElementTree object), fromstring() and
especially the 'literal wrapper' XML() are made for parsing strings,
which (most?) often only contain XML fragments. With a fragment, you
normally want to continue doing things like inserting it into another
tree, so you need the top-level element in almost all cases.
And:
Why isn't et.parse the only way to do this? Why have XML or fromstring
at all?
Well, use cases. XML() is an alias for fromstring(), because it's
convenient (and well readable) to write
section = XML('A to Z')
section.append(paragraphs)
for XML literals in source code. fromstring() is there because when
you want to parse a fragment from a string that you got from whatever
source, it's easy to express that with exactly that function, as in
el = fromstring(some_string)
If you want to parse a document from a file or file-like object, use
parse(). Three use cases, three functions. The fourth use case of
parsing a document from a string does not have its own function,
because it is trivial to write
tree = parse(BytesIO(some_byte_string))

I'm thinking the same as remram in the comments: parse takes a file location or a file object and preserves that information so that it can provide additional utility, which is really helpful. If parse did not return an ET object, then you would have to keep better track of the sources and whatnot in order to manually feed them back into the helper functions that ET objects have by default. In contrast to files, Strings- by definition- do not have the same kind of information attached from them, so you can't create the same utilities for them (otherwise there very well may be an ET.parsefromstring() method which would return an ET Object).
I suspect this is also the logic behind the method being named parse instead of ET.fromfile(): I would expect the same object type to be returned from fromfile and fromstring, but can't say I would expect the same from parse (it's been a long time since I started using ET, so there's no way to verify that, but that's my feeling).
On the subject Remram raised of placing utility methods on Elements, as I understand the documentation, Elements are extremely uniformed when it comes to implementation. People talk about "Root Elements," but the Element at the root of the tree is literally identical to all other Elements in terms of its class Attributes and Methods. As far as I know, Elements don't even know who their parent is, which is likely to support this uniformity. Otherwise there might be more code to implement the "root" Element (which doesn't have a parent) or to re-parent subelements. It seems to me that the simplicity of the Element class works greatly in its favor. So it seems better to me to leave Elements largely agnostic of anything above them (their parent, the file they come from) so there can't be any snags concerning 4 Elements with different output files in the same tree (or the like).
When it comes to implementing the module inside of code, it seems to me that the script would have to recognize the input as a file at some point, one way or another (otherwise it would be trying to pass the file to fromstring). So there shouldn't arise a situation in which the output of parse should be unexpected such that the ElementTree is assumed to be an Element and processed as such (unless, of course, parse was implemented without the programmer checking to see what parse did, which just seems like a poor habit to me).

Related

Elementtree and xsd sequences

I'm currently writing a code generator that produces PLCOpen XML files. The Schema uses sequences in some places. The code generator uses ElementTree because of its simple interface. However, I can't find a way to make ElementTree respect the sequence; in fact, the children of an Element are always printed in canonical order. Is there any way around this?

The method ElementTree.append that I used does not seem to keep the elements ordered. Using ElementTree.insert, however, solved the issue.

How to store a python function within an XML file

I'm developing a GUI application in Python that stores it's documents in an XML based format. The application is a mathematical model which several pre-defined components which can be drag-and-dropped. I'd also like the user to be able to create custom components by writing a python function inside an editor provided within the application. My issue is with storing these functions in the XML.
A function might look something like this:
def func(node, timestamp):
return node.weight * timestamp.day + 4
These functions are wrapped in an object which provides a standard way of calling them (compared to the pre-defined components). If I was to create one from Python directly it would look like this:
parameter = ParameterFunction(func)
The function is then called by the model like this:
parameter.value(node=node, timestamp=timestamp)
The ParameterFunction object has a to_xml and from_xml functions which need to serialise/deserialise the object to/from an XML representation.
My question is: how do I store the Python functions in an XML document?
One solution I have thought of so far is to store the function definition as a string, eval() or exec() it for use but keep the string, then store the string in a CDATA block in the XML. Are there any issues with this that I'm not seeing?
An alternative would be to store all of the Python code in a separate file, and have the XML reference just the function names. This could be nice as it could be edited easily in an external editor. In which case what is the best way to import the code? I am envisiging fighting with the python import path...
I'm aware there are will be security concerns with running untrusted code, but I'm willing to make this tradeoff for the freedom it gives users.
The specific application I'm referring to is on github. I'm happy to provide more information if it's needed, but I've tried to keep it fairly generic here. https://github.com/snorfalorpagus/pywr/blob/120928eaacb9206701ceb9bc91a5d73740db1953/pywr/core.py#L396-L402

Nope, you have the easiest and best solution that I can think of. Just keep them as strings, as long as your not worried about running the untrusted code.
The way I'd deal with external python scripts containing tiny snippets like yours would be to treat them as plain text files and read them in as strings. This avoids all the problems with importing them. Just read them in and call exec on them, then the functions will exist in scope.
EDIT: I was going to add something on sandboxing python code, but after a bit of research it seems this will not be an easy task, it would be easier to sandbox the entire program. Another longer and harder way to restrict the untrusted code would be to create your own tiny interpreter that only did safe operations (i.e mathematical operations, calling existing functions, etc..)

Parsing a large (~40GB) XML text file in python

I've got an XML file I want to parse with python. What is best way to do this? Taking into memory the entire document would be disastrous, I need to somehow read it a single node at a time.
Existing XML solutions I know of:
element tree
minixml
but I'm afraid they aren't quite going to work because of the problem I mentioned. Also I can't open it in a text editor - any good tips in generao for working with giant text files?

First, have you tried ElementTree (either the built-in pure-Python or C versions, or, better, the lxml version)? I'm pretty sure none of them actually read the whole file into memory.
The problem, of course, is that, whether or not it reads the whole file into memory, the resulting parsed tree ends up in memory.
ElementTree has a nifty solution that's pretty simple, and often sufficient: iterparse.
for event, elem in ET.iterparse(xmlfile, events=('end')):
...
The key here is that you can modify the tree as it's built up (by replacing the contents with a summary containing only what the parent node will need). By throwing out all the stuff you don't need to keep in memory as it comes in, you can stick to parsing things in the usual order without running out of memory.
The linked page gives more details, including some examples for modifying XML-RPC and plist as they're processed. (In those cases, it's to make the resulting object simpler to use, not to save memory, but they should be enough to get the idea across.)
This only helps if you can think of a way to summarize as you go. (In the most trivial case, where the parent doesn't need any info from its children, this is just elem.clear().) Otherwise, this won't work for you.
The standard solution is SAX, which is a callback-based API that lets you operate on the tree a node at a time. You don't need to worry about truncating nodes as you do with iterparse, because the nodes don't exist after you've parsed them.
Most of the best SAX examples out there are for Java or Javascript, but they're not too hard to figure out. For example, if you look at http://cs.au.dk/~amoeller/XML/programming/saxexample.html you should be able to figure out how to write it in Python (as long as you know where to find the documentation for xml.sax).
There are also some DOM-based libraries that work without reading everything into memory, but there aren't any that I know of that I'd trust to handle a 40GB file with reasonable efficiency.

The best solution will depend in part on what you are trying to do, and how free your system resources are. Converting it to a postgresql or similar database might not be a bad first goal; on the other hand, if you just need to pull data out once, it's probably not needed. When I have to parse large XML files, especially when the goal is to process the data for graphs or the like, I usually convert the xml to S-expressions, and then use an S-expression interpreter (implemented in python) to analyse the tags in order and build the tabulated data. Since it can read the file in a line at a time, the length of the file doesn't matter, so long as the resulting tabulated data all fits in memory.

Which XML style is better when handling it with Python's ElementTree?

I'd like to store some relatively simple stuff in XML in a cascading manner. The idea is that a build can have a number of parameter sets and the Python scripts creates the necessary build artifacts (*.h etc) by reading these sets and if two sets have the same parameter, the latter one replaces the former.
There are (at least) two differing ways of doing the XML:
First way:
<Variants>
<Variant name="foo" Info="foobar">1</Variant
</Variants>
Second way:
<Variants>
<Variant>
<Name>Foo</Name>
<Value>1</Value>
<Info>foobar</Info>
</Variant>
</Variants>
Which one is better easier to handle in ElementTree. My limited understanding claims it would be the first one as I could search the variant with find() easily and receive the entire subtree but would it be just as easy to do it with the second style? My colleague says that the latter XML is better as it allows expanding the XML more easily (and he is obviously right) but I don't see the expandability a major factor at the moment (might very well be we will never need it).
EDIT: I could of course use lxml as well, does it matter in this case? Speed really isn't an issue, the files are relatively small.

You're both right, but I would pick #1 where possible, except for the text content:
1 is much more succinct and human-readable, thus less error-prone.
Complete extensibility: YAGNI. YAGNI is not always true but if you're confident that you won't need extensibility, don't sacrifice other benefits for the sake of extensibility.
1 is still pretty extensible. You can always add more attributes or child elements. The only way it isn't extensible is if you later discover you need multiple values for name, or info (or the text content value)... since you can't have multiple attributes with the same name on an element (nor multiple text content nodes without something in between). However you can still extend those by various techniques, e.g. space-separated values in an attribute, or adding child elements as an alternative to an attribute.
I would make the "value" into an attribute or a child element rather than using the text content. If you ever have to add a child element, and you have that text content there, you will end up with mixed content (text as a sibling of an element), which gets messy to process.
Update: further reading
A few good articles on the XML elements-vs-attributes debate, including when to use each:
Principles of XML design: When to use elements versus attributes - well-integrated article by Uche Ogbuji
SGML/XML Elements versus Attributes - with many links to other commentary on this old debate
Elements or Attributes?
See also this SO question (but I think the above give more profitable reading).

Remember the critical limitations on XML attributes:
Attribute names must be XML names.
An element can have only one attribute with a given name.
The ordering of attributes is not significant.
In other words, attributes represent key/value pairs. If you can represent it in Python as a dictionary whose keys are XML names and whose values are strings, you can represent it in XML as a set of attributes, no matter what "it" is.
If you can't - if, for instance, ordering is significant, or you need a value to include child elements - then you shouldn't use attributes.

Python design question

I'm a C programmer and I'm getting quite good with Python. But I still have some problems getting my mind around the OO awesomeness of Python.
Here is my current design problem:
The end "product" is a JSON data structure created in Python (and passed to Javascript code) containing different types of data like:
{ type:url, {urlpayloaddict) }
{ type:text, {textpayloaddict}
...
My Javascript knows how to parse and display each type of JSON response.
I'm happy with this design. My question comes from handling this data in the Python code.
I obtain my data from a variety of sources: MySQL, a table lookup, an API call to a web service...
Basically, should I make a super class responseElement and specialise it for each type of response, then pass around a list of these objects in the Python code OR should I simply pass around a list of dictionaries that contain the response data in key value pairs. The answer seems to result in significantly different implementations.
I'm a bit unsure if I'm getting too object happy ??

In my mind, it basically goes like this: you should try to keep things the same where they are the same, and separate them where they're different.
If you're performing the exact same operations on and with the data, and it can all be represented in a common format, then there's no reason to have separate objects for it - translate it into a common format ASAP and Don't Repeat Yourself when it comes to implementing things that don't distinguish.
If each type/source of data requires specialized operations specific to it, and there isn't much in the way of overlap between such at the layer your Python code is dealing with, then keep things in separate objects so that you maintain a tight association between the specialized code and the specific data on which it is able to operate.

Do the different response sources represent fundamentally different categories or classes of objects? They don't appear to, the way you've described it.
Thus, various encode/decode functions and passing around only one type seems the best solution for you.
That type can be a dict or your own class, if you have special methods to use on the data (but those methods would then not care what input and output encodings were), or you could put the encode/decode pairs into the class. (Decode would be a classmethod, returning a new instance.)

Your receiver objects (which can perfectly well be instances of different classes, perhaps generated by a Factory pattern depending on the source of incoming data) should all have a common method that returns the appropriate dict (or other directly-JSON'able structure, such as a list that will turn into a JSON array).
Differently from what one answer states, this approach clearly doesn't require higher level code to know what exact kind of receiver it's dealing with (polymorphism will handle that for you in any OO language!) -- nor does the higher level code need to know "names of keys" (as, again, that other answer peculiarly states), as it can perfectly well treat the "JSON'able data" as a pretty opaque data token (as long as it's suitable to be the argument for a json.dumps later call!).
Building up and passing around a container of "plain old data" objects (produced and added to the container in various ways) for eventual serialization (or other such uniform treatment, but you can see JSON translation as a specific form of serialization) is a common OO pattern. No need to carry around anything richer or heavier than such POD data, after all, and in Python using dicts as the PODs is often a perfectly natural implementation choice.

I've had success with the OOP approach. Consider a base class with a "ToJson" method and have each subclass implement it appropriately. Then your higher level code doesn't need to know any detail about how the data was obtained...it just knows it has to call "ToJson" on every object in the list you mentioned.
A dictionary would work too, but it requires your calling code to know names of keys, etc and won't scale as well.
OOP I say!

Personally, I opt for the latter (passing around a list of data) wherever and whenever possible. I think OO is often misused/abused for certain things. I specifically avoid things like wrapping data in an object just for the sake of wrapping it in an object. So this, {'type':'url', 'data':{some_other_dict}} is better to me than:
class DataObject:
def __init__(self):
self.type = 'url'
self.data = {some_other_dict}
But, if you need to add specific functionality to this data, like the ability for it to sort its data.keys() and return them as a set, then creating an object makes more sense.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.