How to pretty print an xml file in Python?

How to pretty print an xml file in Python? - python

I'd like to tidy a complicated xml file, using lxml. The problem is it has many elements which have tail. For example, there's an xml like this:
<body><part>n</part> attend </body>
I want to tidy this into this:
<body>
<part>n</part> attend
</body>
I tried to apply pretty_print with remove_blank_text parser in lxml at first. But it failed.
import lxml.etree as ET
xml_doc = '<body><part>n</part> attend </body>'
parser = ET.XMLParser(remove_blank_text=True)
root = ET.fromstring(xml_doc, parser)
print(ET.tostring(root, pretty_print=True))
>>>b'<body><part>n</part> attend </body>\n'
And then, I tried again without applying the parser to no avail.
import lxml.etree as ET
xml_doc = '<body><part>n</part> attend </body>'
root = ET.fromstring(xml_doc)
print(ET.tostring(root, pretty_print=True))
>>>b'<body><part>n</part> attend </body>\n'

If the pretty_print attribute does not help, you can probably write your own recursive method to do a pretty print. Something on the lines of
def pprint(root, indentTabs = 0):
print "<%s%s>" % (indentTabs*"\t", root.tag)
print (indentTabs+1)*"\t" + root.value
for element in root.children():
pprint (element, indentTabs+1)
print "</%s%s>" % (indentTabs*"\t", root.tag)
Though there might be some already available options. The above method would take care of just tags. You might need to add code to take care of xml attributes as well, if they are present in your xml.
EDIT: The above will print in the format
<tag>
text
</tag>
You can modify it further according to the output you need.

I ran into the same problem and using tounicode() solved it for me.
print(ET.tounicode(root, pretty_print=True))

Related

Python XML parinsg is seeing unexpected tag

I'm trying to parse an XML which is including following code example:
<?xml version="1.0" encoding="UTF-8"?>
<ops:OPS xmlns:ops="http://www.si2.org/schemas/OPS"
opsVersion="OPS_1.1">
<ops:RulesSection>
[...some stuff..]
</ops:RulesSection>
</ops:OPS>
When I'm parsing this file and checking Elements value I'm expecting to find an element called "ops:RulesSection"
from xml.etree import ElementTree as ET
from xml.etree.ElementTree import ElementTree, Elem
tree = ET.parse(OPSDRMFile)
root = tree.getroot()
for elem in root:
print("!!!",elem)
However I am getting following results
!!! <Element '{http://www.si2.org/schemas/OPS}RulesSection' at 0x2b557da50f98>
it seems that "ops:" tag has been replaced by its value from the beginning of the file... can someone explain if this is expected behavior? Is there a way to NOT expand such parameter?
(as of now if I want to perform further parsing I need to transfrom "ops:NewTagToFind" as "{http://www.si2.org/schemas/OPS}NewTagToFind" when I using "findall" command.
Thanks,
Chris

Fetching child tags of a tag with a xmlns namespace [duplicate]

I want to use the method of findall to locate some elements of the source xml file in the ElementTree module.
However, the source xml file (test.xml) has namespaces. I truncate part of xml file as sample:
<?xml version="1.0" encoding="iso-8859-1"?>
<XML_HEADER xmlns="http://www.test.com">
<TYPE>Updates</TYPE>
<DATE>9/26/2012 10:30:34 AM</DATE>
<COPYRIGHT_NOTICE>All Rights Reserved.</COPYRIGHT_NOTICE>
<LICENSE>newlicense.htm</LICENSE>
<DEAL_LEVEL>
<PAID_OFF>N</PAID_OFF>
</DEAL_LEVEL>
</XML_HEADER>
The sample python code is below:
from xml.etree import ElementTree as ET
tree = ET.parse(r"test.xml")
el1 = tree.findall("DEAL_LEVEL/PAID_OFF") # Return None
el2 = tree.findall("{http://www.test.com}DEAL_LEVEL/{http://www.test.com}PAID_OFF") # Return <Element '{http://www.test.com}DEAL_LEVEL/PAID_OFF' at 0xb78b90>
Though using "{http://www.test.com}" works, it's very inconvenient to add a namespace in front of each tag.
How can I ignore the namespace when using functions like find, findall, ...?

Instead of modifying the XML document itself, it's best to parse it and then modify the tags in the result. This way you can handle multiple namespaces and namespace aliases:
from io import StringIO # for Python 2 import from StringIO instead
import xml.etree.ElementTree as ET
# instead of ET.fromstring(xml)
it = ET.iterparse(StringIO(xml))
for _, el in it:
_, _, el.tag = el.tag.rpartition('}') # strip ns
root = it.root
This is based on the discussion here.

If you remove the xmlns attribute from the xml before parsing it then there won't be a namespace prepended to each tag in the tree.
import re
xmlstring = re.sub(' xmlns="[^"]+"', '', xmlstring, count=1)

The answers so far explicitely put the namespace value in the script. For a more generic solution, I would rather extract the namespace from the xml:
import re
def get_namespace(element):
m = re.match('\{.*\}', element.tag)
return m.group(0) if m else ''
And use it in find method:
namespace = get_namespace(tree.getroot())
print tree.find('./{0}parent/{0}version'.format(namespace)).text

Here's an extension to #nonagon answer (which removes namespace from tags) to also remove namespace from attributes:
import io
import xml.etree.ElementTree as ET
# instead of ET.fromstring(xml)
it = ET.iterparse(io.StringIO(xml))
for _, el in it:
if '}' in el.tag:
el.tag = el.tag.split('}', 1)[1] # strip all namespaces
for at in list(el.attrib.keys()): # strip namespaces of attributes too
if '}' in at:
newat = at.split('}', 1)[1]
el.attrib[newat] = el.attrib[at]
del el.attrib[at]
root = it.root
Obviously this is a permanent defacing of the XML but if that's acceptable because there are no non-unique tag names and because you won't be writing the file needing the original namespaces then this can make accessing it a lot easier

Improving on the answer by ericspod:
Instead of changing the parse mode globally we can wrap this in an object supporting the with construct.
from xml.parsers import expat
class DisableXmlNamespaces:
def __enter__(self):
self.old_parser_create = expat.ParserCreate
expat.ParserCreate = lambda encoding, sep: self.old_parser_create(encoding, None)
def __exit__(self, type, value, traceback):
expat.ParserCreate = self.oldcreate
This can then be used as follows
import xml.etree.ElementTree as ET
with DisableXmlNamespaces():
tree = ET.parse("test.xml")
The beauty of this way is that it does not change any behaviour for unrelated code outside the with block. I ended up creating this after getting errors in unrelated libraries after using the version by ericspod which also happened to use expat.

You can use the elegant string formatting construct as well:
ns='http://www.test.com'
el2 = tree.findall("{%s}DEAL_LEVEL/{%s}PAID_OFF" %(ns,ns))
or, if you're sure that PAID_OFF only appears in one level in tree:
el2 = tree.findall(".//{%s}PAID_OFF" % ns)

In python 3.5 , you can pass the namespace as an argument in find().
For example ,
ns= {'xml_test':'http://www.test.com'}
tree = ET.parse(r"test.xml")
el1 = tree.findall("xml_test:DEAL_LEVEL/xml_test:PAID_OFF",ns)
Documentation link :- https://docs.python.org/3.5/library/xml.etree.elementtree.html#parsing-xml-with-namespaces

I might be late for this but I dont think re.sub is a good solution.
However the rewrite xml.parsers.expat does not work for Python 3.x versions,
The main culprit is the xml/etree/ElementTree.py see bottom of the source code
# Import the C accelerators
try:
# Element is going to be shadowed by the C implementation. We need to keep
# the Python version of it accessible for some "creative" by external code
# (see tests)
_Element_Py = Element
# Element, SubElement, ParseError, TreeBuilder, XMLParser
from _elementtree import *
except ImportError:
pass
Which is kinda sad.
The solution is to get rid of it first.
import _elementtree
try:
del _elementtree.XMLParser
except AttributeError:
# in case deleted twice
pass
else:
from xml.parsers import expat # NOQA: F811
oldcreate = expat.ParserCreate
expat.ParserCreate = lambda encoding, sep: oldcreate(encoding, None)
Tested on Python 3.6.
Try try statement is useful in case somewhere in your code you reload or import a module twice you get some strange errors like
maximum recursion depth exceeded
AttributeError: XMLParser
btw damn the etree source code looks really messy.

If you're using ElementTree and not cElementTree you can force Expat to ignore namespace processing by replacing ParserCreate():
from xml.parsers import expat
oldcreate = expat.ParserCreate
expat.ParserCreate = lambda encoding, sep: oldcreate(encoding, None)
ElementTree tries to use Expat by calling ParserCreate() but provides no option to not provide a namespace separator string, the above code will cause it to be ignore but be warned this could break other things.

Let's combine nonagon's answer with mzjn's answer to a related question:
def parse_xml(xml_path: Path) -> Tuple[ET.Element, Dict[str, str]]:
xml_iter = ET.iterparse(xml_path, events=["start-ns"])
xml_namespaces = dict(prefix_namespace_pair for _, prefix_namespace_pair in xml_iter)
return xml_iter.root, xml_namespaces
Using this function we:
Create an iterator to get both namespaces and a parsed tree object.
Iterate over the created iterator to get the namespaces dict that we can
later pass in each find() or findall() call as sugested by
iMom0.
Return the parsed tree's root element object and namespaces.
I think this is the best approach all around as there's no manipulation either of a source XML or resulting parsed xml.etree.ElementTree output whatsoever involved.
I'd like also to credit balmy's answer with providing an essential piece of this puzzle (that you can get the parsed root from the iterator). Until that I actually traversed XML tree twice in my application (once to get namespaces, second for a root).

Just by chance dropped into the answer here: XSD conditional type assignment default type confusion?. This is not the exact answer for the topic question but may be applicable if the namespace is not critical.
<?xml version="1.0" encoding="UTF-8"?>
<persons xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="test.xsd">
<person version="1">
<firstname>toto</firstname>
<lastname>tutu</lastname>
</person>
</persons>
Also see: https://www.w3.org/TR/xmlschema-1/#xsi_schemaLocation
Works for me. I call an XML validation procedure in my application. But also I want to quickly see the validation highliting and autocompletion in PyCharm when editing the XML. This noNamespaceSchemaLocation attribute does what I need.
RECHECKED
from xml.etree import ElementTree as ET
tree = ET.parse("test.xml")
el1 = tree.findall("person/firstname")
print(el1[0].text)
el2 = tree.find("person/lastname")
print(el2.text)
Returnrs
>python test.py
toto
tutu

how to parse xml with multiple root element

I need to parse both var & group root elements.
Code
import xml.etree.ElementTree as ET
tree_ownCloud = ET.parse('0020-syslog_rules.xml')
root = tree_ownCloud.getroot()
Error
xml.etree.ElementTree.ParseError: junk after document element: line 17, column 0
Sample XML
<var name="BAD_WORDS">core_dumped|failure|error|attack| bad |illegal |denied|refused|unauthorized|fatal|failed|Segmentation Fault|Corrupted</var>
<group name="syslog,errors,">
<rule id="1001" level="2">
<match>^Couldn't open /etc/securetty</match>
<description>File missing. Root access unrestricted.</description>
<group>pci_dss_10.2.4,gpg13_4.1,</group>
</rule>
<rule id="1002" level="2">
<match>$BAD_WORDS</match>
<options>alert_by_email</options>
<description>Unknown problem somewhere in the system.</description>
<group>gpg13_4.3,</group>
</rule>
</group>
I tried following couple of other questions on stackoverflow here, but none helped.
I know the reason, due to which it is not getting parsed, people have usually tried hacks. IMO it's a very common usecase to have multiple root elements in XML, and something must be there in ET parsing library to get this done.

As mentioned in the comment, an XML file cannot have multiple roots. Simple as that.
If you do receive/store data in this format (and then it's not proper XML). You could consider a hack of surrounding what you have with a fake tag, e.g.
import xml.etree.ElementTree as ET
with open("0020-syslog_rules.xml", "r") as inputFile:
fileContent = inputFile.read()
root = ET.fromstring("<fake>" + fileContent +"</fake>")
print(root)

Actually, the example data is not a well-formed XML document, but it is a well-formed XML entity. Some XML parsers have an option to accept an entity rather than a document, and in XPath 3.1 you can parse this using the parse-xml-fragment() function.
Another way to parse a fragment like this is to create a wrapper document which references it as an external entity:
<!DOCTYPE wrapper [
<!ENTITY e SYSTEM "fragment.xml">
]>
<wrapper>&e;</wrapper>
and then supply this wrapper document as the input to your XML parser.

Get string source of Python xml ElementTree

How would you get the source of an ElementTree as a string in Python?

import xml.etree.ElementTree as ET
tree = ET.parse(source)
root = tree.getroot()
ET.tostring(root)
Note that there may be formatting differences between the content of source and ET.tostring(doc).

I know it is not, but I wonder why it wouldn't be:
tree.tostring(root)
instead of the current way:
xml.etree.ElementTree.tostring(root)

simple output of xml from python

I need to output XML from python in a very minimalist way:
I can't use any external libraries beyond what's already in Python 2.6.5
I need to output XML tags, and text contents, with no attributes
At this point I'm using print statements to explicitly print out angle-bracket tags, and the only thing really stopping me is escaping text within a tag, which I do not know how to do.
Any suggestions?
update: Is there anything like Java's StAX XMLStreamWriter for Python? I may have a large XML document to produce, and I don't need (or want) to hold the entire document in memory.
update #2: I also need to escape random unicode or non-ASCII characters in the text besides <, > and &.

Well, it looks like SAX isn't that hard to use after all. Here's an example.
xmltest.py:
import xml.sax.xmlreader
import xml.sax.saxutils
def testJunk(file, e2content):
attr0 = xml.sax.xmlreader.AttributesImpl({})
x = xml.sax.saxutils.XMLGenerator(file)
x.startDocument()
x.startElement("document", attr0)
x.startElement("element1", attr0)
x.characters("bingo")
x.endElement("element1")
x.startElement("element2", attr0)
x.characters(e2content)
x.endElement("element2")
x.endElement("document")
x.endDocument()
tested:
>>> import xmltest
>>> xmltest.testJunk(open("test.xml","w"), "wham < 3!")
produces:
<?xml version="1.0" encoding="iso-8859-1"?>
<document><element1>bingo</element1><element2>wham < 3!</element2></document>

If task is as simple, minidom may suffice. Here goes short example:
from xml.dom.minidom import Document
# create xml document
document = Document()
# create root element
root = document.createElement("root")
document.appendChild(root)
# create child element
child = document.createElement("child")
child.setAttribute("tag", "test")
root.appendChild(child)
# insert some text
atext = document.createTextNode("Foo bar")
child.appendChild(atext)
# print created xml
print(document.toprettyxml(indent=" "))

ElementTree comes with Python 2.6:
from xml.etree import ElementTree as ET
root = ET.Element('root')
sub = ET.SubElement(root,'sub')
sub.text = 'Hello & Goodbye'
tree = ET.ElementTree(root)
tree.write('out.xml')
# OR
ET.dump(root)
Output
<root><sub>Hello & Goodbye</sub></root>

xml.sax.saxutils.escape(data[, entities]).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to pretty print an xml file in Python? - python

I ran into the same problem and using tounicode() solved it for me. print(ET.tounicode(root, pretty_print=True))

Related

Python XML parinsg is seeing unexpected tag

Fetching child tags of a tag with a xmlns namespace [duplicate]

how to parse xml with multiple root element

Get string source of Python xml ElementTree

simple output of xml from python

Categories

Resources