How to pretty print xml in python without generating a DOM tree? - python

Generating a DOM tree is too expensive for very large xml data. Is there a method to accomplish the printing without generating it? I am using python-2.7.

Whatever the language, the way to parse a XML document without generating a tree is to use event-oriented parsers. With these kinds of parser, you give to the parser some event handlers that the parser will call at specific points of the processing: beginning of a node, end of a node, beginning of data, etc.
So you can use that kind of parser and go to a new line each time there is a new node, and increase indentation where you are entering a node and decrease indentation when you are exiting a node.
Because of the way these parsers work, it will be tricky to look ahead to see for example if a node fit in a line, so the pretty print may not be as pretty as when working with a tree (or you can, but it would be complicated).
In python, there are 3 event-driven parsers that come with the standard library (in no particular order):
ElementTree.iterparse()
pyexpat
sax (SAX is a well-known event-driven XML parsing API)
I suggest you have a look at them and try to play with.

Related

Retrieve only a portion of an XML feed

I'm using Scrapy XMLFeedSpider to parse a big XML feed(60MB) from a website, and i was just wondering if there is a way to retrieve only a portion of it instead of all 60MB because right now the RAM consumed is pretty high, maybe something to put in the link like:
"http://site/feed.xml?limit=10", i've searched if there is something similar to this but i haven't found anything.
Another option would be limit the items parsed by scrapy, but i don't know how to do that.Right now once the XMLFeedSpider parsed the whole document the bot will analyze only the first ten items, but i supposes that the whole feed will still be in the memory.
Have you any idea on how to improve the bot's performance , diminishing the RAM and CPU consumption? Thanks
When you are processing large xml documents and you don't want to load the whole thing in memory as DOM parsers do. You need to switch to a SAX parser.
SAX parsers have some benefits over DOM-style parsers. A SAX parser
only needs to report each parsing event as it happens, and normally
discards almost all of that information once reported (it does,
however, keep some things, for example a list of all elements that
have not been closed yet, in order to catch later errors such as
end-tags in the wrong order). Thus, the minimum memory required for a
SAX parser is proportional to the maximum depth of the XML file (i.e.,
of the XML tree) and the maximum data involved in a single XML event
(such as the name and attributes of a single start-tag, or the content
of a processing instruction, etc.).
For a 60 MB XML document, this is likely to be very low compared to the requirments for creating a DOM. Most DOM based systems actually use at a much lower level to build up the tree.
In order to create make use of sax, subclass xml.sax.saxutils.XMLGenerator and overrider endElement, startElement and characters. Then call xml.sax.parse with it. I am sorry I don't have a detailed example at hand to share with you, but I am sure you will find plenty online.
You should set the iterator mode of your XMLFeedSpider to iternodes (see here):
It’s recommended to use the iternodes iterator for performance reasons
After doing so, you should be able to iterate over your feed and stop at any point.

Python ElementTree: ElementTree vs root Element

I'm a bit confused by some of the design decisions in the Python ElementTree API - they seem kind of arbitrary, so I'd like some clarification to see if these decisions have some logic behind them, or if they're just more or less ad hoc.
So, generally there are two ways you might want to generate an ElementTree - one is via some kind of source stream, like a file, or other I/O stream. This is achieved via the parse() function, or the ElementTree.parse() class method.
Another way is to load the XML directly from a string object. This can be done via the fromstring() function.
Okay, great. Now, I would think these functions would basically be identical in terms of what they return - the difference between the two of them is basically the source of input (one takes a file or stream object, the other takes a plain string.) Except for some reason the parse() function returns an ElementTree object, but the fromstring() function returns an Element object. The difference is basically that the Element object is the root element of an XML tree, whereas the ElementTree object is sort of a "wrapper" around the root element, which provides some extra features. You can always get the root element from an ElementTree object by calling getroot().
Still, I'm confused why we have this distinction. Why does fromstring() return a root element directly, but parse() returns an ElementTree object? Is there some logic behind this distinction?
A beautiful answer comes from this old discussion:
Just for the record: Fredrik [the creator of ElementTree] doesn't actually consider it a design
"quirk". He argues that it's designed for different use cases. While
parse() parses a file, which normally contains a complete document
(represented in ET as an ElementTree object), fromstring() and
especially the 'literal wrapper' XML() are made for parsing strings,
which (most?) often only contain XML fragments. With a fragment, you
normally want to continue doing things like inserting it into another
tree, so you need the top-level element in almost all cases.
And:
Why isn't et.parse the only way to do this? Why have XML or fromstring
at all?
Well, use cases. XML() is an alias for fromstring(), because it's
convenient (and well readable) to write
section = XML('A to Z')
section.append(paragraphs)
for XML literals in source code. fromstring() is there because when
you want to parse a fragment from a string that you got from whatever
source, it's easy to express that with exactly that function, as in
el = fromstring(some_string)
If you want to parse a document from a file or file-like object, use
parse(). Three use cases, three functions. The fourth use case of
parsing a document from a string does not have its own function,
because it is trivial to write
tree = parse(BytesIO(some_byte_string))
I'm thinking the same as remram in the comments: parse takes a file location or a file object and preserves that information so that it can provide additional utility, which is really helpful. If parse did not return an ET object, then you would have to keep better track of the sources and whatnot in order to manually feed them back into the helper functions that ET objects have by default. In contrast to files, Strings- by definition- do not have the same kind of information attached from them, so you can't create the same utilities for them (otherwise there very well may be an ET.parsefromstring() method which would return an ET Object).
I suspect this is also the logic behind the method being named parse instead of ET.fromfile(): I would expect the same object type to be returned from fromfile and fromstring, but can't say I would expect the same from parse (it's been a long time since I started using ET, so there's no way to verify that, but that's my feeling).
On the subject Remram raised of placing utility methods on Elements, as I understand the documentation, Elements are extremely uniformed when it comes to implementation. People talk about "Root Elements," but the Element at the root of the tree is literally identical to all other Elements in terms of its class Attributes and Methods. As far as I know, Elements don't even know who their parent is, which is likely to support this uniformity. Otherwise there might be more code to implement the "root" Element (which doesn't have a parent) or to re-parent subelements. It seems to me that the simplicity of the Element class works greatly in its favor. So it seems better to me to leave Elements largely agnostic of anything above them (their parent, the file they come from) so there can't be any snags concerning 4 Elements with different output files in the same tree (or the like).
When it comes to implementing the module inside of code, it seems to me that the script would have to recognize the input as a file at some point, one way or another (otherwise it would be trying to pass the file to fromstring). So there shouldn't arise a situation in which the output of parse should be unexpected such that the ElementTree is assumed to be an Element and processed as such (unless, of course, parse was implemented without the programmer checking to see what parse did, which just seems like a poor habit to me).

Parse XML with lxml, and then manipulate it with cElementTree

I have an app which constantly reloads a large amount of XML data from a file, and then performs manipulations, and then writes back to file.
The lxml library is proven much faster for parsing and un-parsing XML, but cElementTree is much faster for certain kinds of manipulation. Both have an almost identical API.
How can I parse an XML file with lxml, and then manipulate it with cElementTree?
This is what I've tried, but the object produced by lxml parse methods inherently use it's own manipulative methods.
import xml.etree.cElementTree as ET
from lxml import etree as lxmlET
This question is perhaps the Python equivalent of "My friend has a fast car and I just have a clunker. How can I make my car go as fast as hers?"
I'm not saying this couldn't be done, but I should call call such an enterprise either ambitious or foolhardy, depending on your level of programming skill. The point is that each system has, as you have discovered, its own internal representation of the parsed XML.
While it might be possible to write code to take the parsed object produced by lxml and re-create or wrap it as ElementTree elements, it's probably going to a) take as long as parsing with ElementTree in the first place, and b) be a maintenance nightmare.
So do yourself a favor and choose one technology then stick with it (at least for each individual program).
I would also point out that XML was intended primarily as a data interchange language. The fact that you seem to be using it as a structured data repository inevitably introduces large inefficiencies in the processing, particularly as data volumes go up. Might it be better to choose some more amenable representation and then only convert it to XML for output and usage by other systems?

python xml parser minidom memory usage

I recently use minidom to parse some xml files.
The funny thing is it takes me 8G memroy to read a 56MB file, which is relative flat, i.e., most of the nodes are in the same level.
Why is this true?
You are not the only one to face this.
From my humble opinion, the main reason minidom based program consume lot of memory is due to the fact lot of minidom functions are implemented recursively (recursion and memory usage are generally not good friends).
I advice you to opt for other Python XML parser libraries which are faster (notably lxml).

Parsing a large (~40GB) XML text file in python

I've got an XML file I want to parse with python. What is best way to do this? Taking into memory the entire document would be disastrous, I need to somehow read it a single node at a time.
Existing XML solutions I know of:
element tree
minixml
but I'm afraid they aren't quite going to work because of the problem I mentioned. Also I can't open it in a text editor - any good tips in generao for working with giant text files?
First, have you tried ElementTree (either the built-in pure-Python or C versions, or, better, the lxml version)? I'm pretty sure none of them actually read the whole file into memory.
The problem, of course, is that, whether or not it reads the whole file into memory, the resulting parsed tree ends up in memory.
ElementTree has a nifty solution that's pretty simple, and often sufficient: iterparse.
for event, elem in ET.iterparse(xmlfile, events=('end')):
...
The key here is that you can modify the tree as it's built up (by replacing the contents with a summary containing only what the parent node will need). By throwing out all the stuff you don't need to keep in memory as it comes in, you can stick to parsing things in the usual order without running out of memory.
The linked page gives more details, including some examples for modifying XML-RPC and plist as they're processed. (In those cases, it's to make the resulting object simpler to use, not to save memory, but they should be enough to get the idea across.)
This only helps if you can think of a way to summarize as you go. (In the most trivial case, where the parent doesn't need any info from its children, this is just elem.clear().) Otherwise, this won't work for you.
The standard solution is SAX, which is a callback-based API that lets you operate on the tree a node at a time. You don't need to worry about truncating nodes as you do with iterparse, because the nodes don't exist after you've parsed them.
Most of the best SAX examples out there are for Java or Javascript, but they're not too hard to figure out. For example, if you look at http://cs.au.dk/~amoeller/XML/programming/saxexample.html you should be able to figure out how to write it in Python (as long as you know where to find the documentation for xml.sax).
There are also some DOM-based libraries that work without reading everything into memory, but there aren't any that I know of that I'd trust to handle a 40GB file with reasonable efficiency.
The best solution will depend in part on what you are trying to do, and how free your system resources are. Converting it to a postgresql or similar database might not be a bad first goal; on the other hand, if you just need to pull data out once, it's probably not needed. When I have to parse large XML files, especially when the goal is to process the data for graphs or the like, I usually convert the xml to S-expressions, and then use an S-expression interpreter (implemented in python) to analyse the tags in order and build the tabulated data. Since it can read the file in a line at a time, the length of the file doesn't matter, so long as the resulting tabulated data all fits in memory.

Categories

Resources