creating xml documents with whitespace with xml.etree.cElementTree - python

I'm working on a project to store various bits of text in xml files, but because people besides me are going to look at it and use it, it has to be properly indented and such. I looked at a question on how to generate xml files using cElement Tree here, and the guy says something about putting in info about making things pretty if people ask, but there isn't anything there (I guess because no one asked). So basically, is there a way to properly indent and whitespace using cElementTree, or should i just throw up my hands and go learn how to use lxml.

You can use minidom to prettify our xml string:
from xml.etree import ElementTree as ET
from xml.dom import minidom
# Return a pretty-printed XML string for the Element.
def prettify(xmlStr):
INDENT = " "
rough_string = ET.tostring(xmlStr, 'utf-8')
reparsed = minidom.parseString(rough_string)
return reparsed.toprettyxml(indent=INDENT)
# name of root tag
root = ET.Element("root")
child = ET.SubElement(root, 'child')
child.text = 'This is text of child'
prettified_xmlStr = prettify(root)
output_file = open("Output.xml", "w")
output_file.write(prettified_xmlStr)
output_file.close()
print("Done!")

Answering myself here:
Not with ElementTree. The best option would be to download and install the module for lxml, then simply enable the option
prettyprint = True
when generating new XML files.

Related

Python XML ElementTree not reading node with &

I have an XML, with one of the nodes having '&' within a string:
<uid>JAMES&001</uid>
now, when I try to read the whole xml using the following code:
tree = et.parse(fileName)
root = tree.getroot()
ids = root.findall("uid")
I get the error on the link of the above-mentioned node:
xml.etree.ElelmentTree.ParseError: not well-formed (invalid token): line17, column 21
The code works fine on other instances where there is no '&'. I guess it's breaking the string.
Can it be fixed with encoding? How? I searched through other questions but couldn't find an answer.
TIA
You need to sanitize your xml first since it isn't well formed.
You need to replace the offending & - something like .replace("&", "&")
One way to use it:
with open(fileName, 'r+') as f:
read_data = f.read()
doc = ET.fromstring(read_data.replace("&", "&"))
print(doc.find('./uid').text)
Output, given your sample, should be
JAMES&001

How to add comment after XML declaration using python

import xml.etree.ElementTree as ET
def addCommentInXml():
fileXml ='C:\\Users\\Documents\\config.xml'
tree = ET.parse(fileXml)
root = tree.getroot()
comment = ET.Comment('TEST')
root.insert(1, comment) # 1 is the index where comment is inserted
tree.write(fileXml, encoding='UTF-8', xml_declaration=True)
print("Done")
It is updating xml as below,Please suggest how to add right after xml declaration line:
<?xml version='1.0' encoding='UTF-8'?>
<ScopeConfig Checksum="5846AFCF4E5D02786">
<ExecutableName>STU</ExecutableName>
<!--TEST--><ZoomT2Encoder>-2230</ZoomT2Encoder>
The ElementTree XML API does not allow this. The documentation for the Comment factory function explicitly states:
An ElementTree will only contain comment nodes if they have been
inserted into to the tree using one of the Element methods.
but you would like to insert a comment outside the tree. The documentation for the TreeBuilder class is even more explicit:
When insert_comments and/or insert_pis is true, comments/pis will be
inserted into the tree if they appear within the root element (but not
outside of it)
So I would suggest writing out the XML file without the comment, using this API, and then reading the file as plain text (not parsed XML) to add your comment after the first line.

Python XML parser renames namespace variables

I have been using xml.etree.ElementTree to parse a Word XML document. After making my changes I use tree.write('test.xml') to write the tree to a file. Once the XML is saved, Word was unable to read the file. Looking at the XML, it appears that the new XML has all of the namespaces renamed.
For example, w:t became ns2:t
import xml.etree.ElementTree as ET
import re
tree = ET.parse('FL0809spec2.xml')
root = tree.getroot()
l = [' ',' ']
prev = None
count = 0
for t in root.iter('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}t'):
l[0] = l[1]
l[1] = t.text
if(l[0] <> '' and l[1] <> '' and re.search(r'[a-zA-Z]', l[0][len(l[0]) - 1]) and re.search(r'[a-z]', l[1][0])):
words = re.findall(r'(\b\w+\b)(\W+)',l[1])
if(len(words) > 0):
prev.text = prev.text + words[0][0]
t.text = t.text[len(words[0][0]):]
count += 1
prev = t
tree.write('FL0809spec2Improved.xml')
It appears that:
a) Python built-in xml.etree.ElementTree is not idempotent (transparent) - if you read an XML file and then immediately write out the xml, the output is different from the input. The namespace prefixes are changed, for example. Also the initial ?xml and ?mso tags are removed. There may be other differences. The removal of the two initial tags doesn't seem to matter, so it's something about the rest of the XML that Word doesn't like.
and b) MS Word expects the namespaces to be written with exactly the same prefixes as the xml files it generates - IMO this is very poor (if not appalling) style because in pure XML terms it is the namespace URI that defines the namespace, not the prefix used to reference it, but hey ho that's the way it seems to work.
As long as you don't mind installing lxml, to solve your problem is very easy. Happily lxml.etree.ElementTree appears to be a lot more determined than xml.etree.ElementTree about not changing anything when writing what it has read, at least it maintains the prefixes that were read in, and those first two tags are written too.
So to use lxml:
Install xlmx with pip:
pip install lxml
Change the first line of your code from:
import xml.etree.ElementTree as ET
to:
from lxml import etree as ET
Then (in my testing of your code with the changey bits between reading and writing the xml removed) the output document can be opened without error in MS Word :-)

lxml parsing with python: how to with objectify

I am trying to read xml behind an spss file, I would like to move from etree to objectify.
How can I convert this function below to return an objectify object? I would like to do this because objectify xml object would be easier for me (as a newbie) to work with as it is more pythonic.
def get_etree(path_file):
from lxml import etree
with open(path_file, 'r+') as f:
xml_text = f.read()
recovering_parser = etree.XMLParser(recover=True)
xml = etree.parse(StringIO(xml_text), parser=recovering_parser)
return xml
my failed attempt:
def get_etree(path_file):
from lxml import etree, objectify
with open(path_file, 'r+') as f:
xml_text = objectify.fromstring(xml)
return xml
but I get this error:
lxml.etree.XMLSyntaxError: xmlns:mdm: 'http://www.spss.com/mr/dm/metadatamodel/Arc 3/2000-02-04' is not a valid URI
The first, biggest mistake is to read a file into a string and feed that string to an XML parser.
Python will read the file as whatever your default file encoding is (unless you specify the encoding when you call read()), and that step will very likely break anything other than plain ASCII files.
XML files come in many encodings, you cannot predict them, and you really shouldn't make assumptions about them. XML files solve that problem with the XML declaration.
<?xml version="1.0" encoding="Windows-1252"?>
An XML parser will read that bit of information and configure itself correctly before reading the rest of the file. Make use of that facility. Never use open() and read() for XML files.
Luckily lxml makes it very easy:
from lxml import etree, objectify
def get_etree(path_file):
return etree.parse(path_file, parser=etree.XMLParser(recover=True))
def get_objectify(path_file):
return objectify.parse(path_file)
and
path = r"/path/to/your.xml"
xml1 = get_etree(path)
xml2 = get_objectify(path)
print xml1 # -> <lxml.etree._ElementTree object at 0x02A7B918>
print xml2 # -> <lxml.etree._ElementTree object at 0x02A7B878>
P.S.: Think hard if you really, positively must use a recovering parser. An XML file is a data structure. If it is broken (syntactically invalid, incomplete, wrongly decoded, you name it), would you really want to trust the (by definition undefined) result of an attempt to read it anyway or would you much rather reject it and display an error message?
I would do the latter. Using a recovering parser may cause nasty run-time errors later.

Python Regex - Parsing HTML

I have this little code and it's giving me AttributeError: 'NoneType' object has no attribute 'group'.
import sys
import re
#def extract_names(filename):
f = open('name.html', 'r')
text = f.read()
match = re.search (r'<hgroup><h1>(\w+)</h1>', text)
second = re.search (r'<li class="hover">Employees: <b>(\d+,\d+)</b></li>', text)
outf = open('details.txt', 'a')
outf.write(match)
outf.close()
My intention is to read a .HTML file looking for the <h1> tag value and the number of employees and append them to a file. But for some reason I can't seem to get it right.
Your help is greatly appreciated.
You are using a regular expression, but matching XML with such expressions gets too complicated, too fast. Don't do that.
Use a HTML parser instead, Python has several to choose from:
ElementTree is part of the standard library
BeautifulSoup is a popular 3rd party library
lxml is a fast and feature-rich C-based library.
The latter two handle malformed HTML quite gracefully as well, making decent sense of many a botched website.
ElementTree example:
from xml.etree import ElementTree
tree = ElementTree.parse('filename.html')
for elem in tree.findall('h1'):
print ElementTree.tostring(elem)
Just for the sake of completion: your error message just indicate that your regular expression failed and did not return anything...

Categories

Resources