lxml parsing with python: how to with objectify - python

I am trying to read xml behind an spss file, I would like to move from etree to objectify.
How can I convert this function below to return an objectify object? I would like to do this because objectify xml object would be easier for me (as a newbie) to work with as it is more pythonic.
def get_etree(path_file):
from lxml import etree
with open(path_file, 'r+') as f:
xml_text = f.read()
recovering_parser = etree.XMLParser(recover=True)
xml = etree.parse(StringIO(xml_text), parser=recovering_parser)
return xml
my failed attempt:
def get_etree(path_file):
from lxml import etree, objectify
with open(path_file, 'r+') as f:
xml_text = objectify.fromstring(xml)
return xml
but I get this error:
lxml.etree.XMLSyntaxError: xmlns:mdm: 'http://www.spss.com/mr/dm/metadatamodel/Arc 3/2000-02-04' is not a valid URI

The first, biggest mistake is to read a file into a string and feed that string to an XML parser.
Python will read the file as whatever your default file encoding is (unless you specify the encoding when you call read()), and that step will very likely break anything other than plain ASCII files.
XML files come in many encodings, you cannot predict them, and you really shouldn't make assumptions about them. XML files solve that problem with the XML declaration.
<?xml version="1.0" encoding="Windows-1252"?>
An XML parser will read that bit of information and configure itself correctly before reading the rest of the file. Make use of that facility. Never use open() and read() for XML files.
Luckily lxml makes it very easy:
from lxml import etree, objectify
def get_etree(path_file):
return etree.parse(path_file, parser=etree.XMLParser(recover=True))
def get_objectify(path_file):
return objectify.parse(path_file)
and
path = r"/path/to/your.xml"
xml1 = get_etree(path)
xml2 = get_objectify(path)
print xml1 # -> <lxml.etree._ElementTree object at 0x02A7B918>
print xml2 # -> <lxml.etree._ElementTree object at 0x02A7B878>
P.S.: Think hard if you really, positively must use a recovering parser. An XML file is a data structure. If it is broken (syntactically invalid, incomplete, wrongly decoded, you name it), would you really want to trust the (by definition undefined) result of an attempt to read it anyway or would you much rather reject it and display an error message?
I would do the latter. Using a recovering parser may cause nasty run-time errors later.

Related

How to generate XML, UTF-8 with BOM using Python Element Tree?

For generating resource XML file for ASP.NET, the third-party tool requires BOM (when migrating to a new version of the tool). At the same time, it requires the XML prolog like <?xml version='1.0' encoding='utf-8'?>.
The problem is that when using the ElementTree command...
tree.write(lang_resx_fpath, encoding='utf-8')
the resulting file does not contain BOM. When using the command...
tree.write(lang_resx_fpath, encoding='utf-8-sig')
the result does contain BOM; however, the XML prolog contains encoding='utf-8-sig'.
How should I generate the file to contain both BOM and encoding='utf-8'?
UPDATE:
I have worked around it by reading, replacing, and writing the file again, like this...
with open(lang_resx_fpath, 'r', encoding='utf-8-sig') as f:
content = f.read()
content = content.replace("encoding='utf-8-sig'", "encoding='utf-8'")
with open(lang_resx_fpath, 'w', encoding='utf-8-sig') as f:
f.write(content)
Anyway, is there any cleaner solution?
UPDATE: I have created the https://bugs.python.org/issue46598, and I have also written the fix (https://github.com/python/cpython/pull/31043).
Peek into sources of ElementTree.write shows that prolog is hardcoded there (https://github.com/python/cpython/blob/main/Lib/xml/etree/ElementTree.py or permalink https://github.com/python/cpython/blob/ee0ac328d38a86f7907598c94cb88a97635b32f8/Lib/xml/etree/ElementTree.py). Therefore probably using internals of ET is the only option (other than monkey-pathing module), to write required preamble and keep BOM in the file:
import xml.etree.ElementTree as ET
qnames, namespaces = ET._namespaces(tree._root, None)
with open(lang_resx_fpath,'w',encoding='utf-8-sig') as f:
f.write("<?xml version='1.0' encoding='utf-8'?>\n" )
ET._serialize_xml(f.write,
tree._root, qnames, namespaces,
short_empty_elements=False)
Probably it is not more elegant than your solution (and maybe it is even less elegant). The only advantage is that it does not require writing file twice, which would be minor benefit besides some huge XML files.

xml.etree.ElementTree.Element' object has no attribute 'write'

I want to read a XML string, edit it and save it as a XML file.
However I get the mentioned error in the title when I do .write()
I found out that when you read an XML string using ElementTree.fromstring(string) it will create an ElementTree.Element and not an ElementTree itself. An Element has no write method but the ElementTree does.
How can I write an Element to a XML file? Or how can I create an ElementTree and add my Element to that and then use the .write method?
I found out that when you read a xml string using ElementTree.fromstring(string) it will actually create an ElementTree.Element and not a ElementTree itself.
Yes, you get the top-level element back (also called the "document element").
An Element has no write method but the ElementTree does.
The ElementTree constructor signature goes like this:
class xml.etree.ElementTree.ElementTree(element=None, file=None)
Therefore it's completely straightforward:
import xml.etree.ElementTree as ET
doc = ET.fromstring("<test>test öäü</test>")
tree = ET.ElementTree(doc)
tree.write("test.xml", encoding="utf-8")
You always should specify the encoding when writing an XML file. Most of the time, UTF-8 is the best choice.
In case this helps anyone who gets this unclear error message when trying to use ElementTree to write an xml file, and spends way too long on it (like I did):
File "/usr/lib/python3.5/xml/etree/ElementTree.py", line 788, in _get_writer
write = file_or_filename.write
AttributeError: 'str' object has no attribute 'write'
... in my case, it was simply because the path to the directory I was trying to write my xml file to did not exist! For example:
tree.write("/FolderDidNotExist/test.xml", encoding="utf-8")
a simple mkdir /FolderDidNotExist did the trick. No more error. (Of course, this error message could use some "love" so I'm posting this here in case I forget what it means again [which I've done] and need to google this again)

Testing for an empty xml file in Python

I have Python script to parse XML files into a more friendly format for another platform.
Every so often one of the data files contains no data - only the encoding info and no other tags, which is causing ElementTree to throw a ParseError when it finds them.
<?xml version="1.0" encoding="utf-8"?>
Is there a way of testing for the empty file before calling ElementTree?
Ta.
You should ask for forgiveness not permission here.
Handle the exception by wrapping the code in a try/except block.
import xml.etree.ElementTree as ET
...
try:
tree = ET.parse(fooxml)
except ET.ParseError:
# log error
pass
Of course have several ways, use:
try:
pass # delete this and add your parse code
except:
pass # write your exception when empty
or use if statement:
if (some code to evalue if xml is not empty):
# your code
elif (some code to check if .xml is empty):
# your code
let me know how it was!
Of course you could catch the exception that lxml throws. If you want to avoid parsing, you could check if the file contains only one < symbol:
with open("input.xml","rb") as f:
contents = f.read()
if contents.count(b"<")<=1:
# empty or only header: skip
pass
else:
x = etree.XML(contents)
of course this heuristic method doesn't protect from other parsing errors. So it's best to just protect the parsing by a try/except block.
But this method has the advantage of being extremely fast if you have lots of corrupt 1-line "header only" file.

Problems extracting XML code from Word with Python

I am attempting to extract the XML code from a Word document with Python. Here's the code I tried:
def getXml(docxFilename):
zip = zipfile.ZipFile(open(docxFilename,"rb"))
xmlString= str(zip.read("word/document.xml"))
return xmlString
I created a test document and ran the function getXML on it. Here's the result:
b'<?xml version="1.0" encoding="UTF-8" standalone="yes"?>\r\n<w:document xmlns:ve="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml"><w:body><w:p w:rsidR="00971B91" w:rsidRPr="00971B91" w:rsidRDefault="00B52719"><w:pPr><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/><w:sz w:val="24"/><w:szCs w:val="24"/></w:rPr></w:pPr><w:r><w:t>Test</w:t></w:r></w:p><w:sectPr w:rsidR="00971B91" w:rsidRPr="00971B91" w:rsidSect="009C4305"><w:pgSz w:w="12240" w:h="15840"/><w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="708" w:footer="708" w:gutter="0"/><w:cols w:space="708"/><w:docGrid w:linePitch="360"/></w:sectPr></w:body></w:document>'
There are some obvious issues. One is that the XML code begins with an "b' " and ends with an apostrophe. Second, there is a "\r\n" right after the first set of angle brackets.
My ultimate goal is to modify the XML code to create a new Word document -- see this question -- but the anomalies with the extracted XML are preventing me from doing this.
Does anyone know why the extracted XML has these strange features and how I can remove them?
EDIT: I tried using the lxml module to parse this code but I only got different errors.
I created a new function getXmlTree:
from lxml import etree
def getXmlTree(xmlString):
return etree.fromstring(xmlString)
I then ran the code etree.tostring(getXmlTree(getXml("test.docx")),pretty_print=True) and received much more sensible XML code.
The problems arise when I tried to create a new Word document. I created the following function to convert XML code into a Word document (shamelessly stolen from here):
import zipfile
from lxml import etree
import os
import tempfile
import shutil
def createNewDocx(originalDocx,xmlContent,newFilename):
tmpDir = tempfile.mkdtemp()
zip = zipfile.ZipFile(open(originalDocx,"rb"))
zip.extractall(tmpDir)
with open(os.path.join(tmpDir,"word/document.xml"),"w") as f:
xmlString = etree.tostring(xmlContent,pretty_print=True)
f.write(xmlString)
filenames = zip.namelist()
zipCopyFilename = newFilename
with zipfile.ZipFile(zipCopyFilename,"w") as docx:
for filename in filenames:
docx.write(os.path.join(tmpDir,filename),filename)
shutil.rmtree(tmpDir)
Before trying to create a new Word document, I wanted to see if I could create a copy of my original test document by substituting xmlContent = getXmlTree(getXml("test.docx")) as an argument in the above function. When I ran the code, however, I received an error message:
f.write(xmlString)
TypeError: must be str, not bytes
Using f.write(str(xmlString)) instead didn't help; it created a new word document, but Word would crash if I tried to open it.
EDIT2: tried running the above code with f.write(xmlString.decode("utf-8")) instead, but it didn't help; Word still crashed.
My guess is that the XML is not being encoded properly. First, write the document file as binary using "wb" as the mode. Second, tell etree.tostring() what the encoding is and to include the XML declaration.
with open(os.path.join(tmpDir, "word/document.xml"), "wb") as f:
xmlBytes = etree.tostring(xmlContent, encoding="UTF-8", xml_declaration=True, pretty_print=True)
f.write(xmlBytes)

creating xml documents with whitespace with xml.etree.cElementTree

I'm working on a project to store various bits of text in xml files, but because people besides me are going to look at it and use it, it has to be properly indented and such. I looked at a question on how to generate xml files using cElement Tree here, and the guy says something about putting in info about making things pretty if people ask, but there isn't anything there (I guess because no one asked). So basically, is there a way to properly indent and whitespace using cElementTree, or should i just throw up my hands and go learn how to use lxml.
You can use minidom to prettify our xml string:
from xml.etree import ElementTree as ET
from xml.dom import minidom
# Return a pretty-printed XML string for the Element.
def prettify(xmlStr):
INDENT = " "
rough_string = ET.tostring(xmlStr, 'utf-8')
reparsed = minidom.parseString(rough_string)
return reparsed.toprettyxml(indent=INDENT)
# name of root tag
root = ET.Element("root")
child = ET.SubElement(root, 'child')
child.text = 'This is text of child'
prettified_xmlStr = prettify(root)
output_file = open("Output.xml", "w")
output_file.write(prettified_xmlStr)
output_file.close()
print("Done!")
Answering myself here:
Not with ElementTree. The best option would be to download and install the module for lxml, then simply enable the option
prettyprint = True
when generating new XML files.

Categories

Resources