Remove "encoding" attribute from XML in Python - python

I am using python to do some conditional changes to an XML document. The incoming document has <?xml version="1.0" ?> at the top.
I'm using xml.etree.ElementTree.
How I'm parsing the changed XMl:
filter_update_body = ET.tostring(root, encoding="utf8", method="xml")
The output has this at the top:
<?xml version='1.0' encoding='utf8'?>
The client wants the "encoding" tag removed but if I remove it then it either doesn't include the line at all or it puts in encoding= 'us-ascii'
Can this be done so the output matches: <?xml version="1.0" ?>?
(I don't know why it matters honestly but that's what I was told needed to happen)

As pointed out in this answer there is no way to make ElementTree omit the encoding attribute. However, as #James suggested in a comment, it can be stripped from the resulting output like this:
filter_update_body = ET.tostring(root, encoding="utf8", method="xml")
filter_update_body = filter_update_body.replace(b"encoding='utf8'", b"", 1)
The b prefixes are required because ET.tostring() will return a bytes object if encoding != "unicode". In turn, we need to call bytes.replace().
With encoding = "unicode" (note that this is the literal string "unicode"), it will return a regular str. In this case, the bs can be omitted. We use good old str.replace().
It's worth noting that the choice between bytes and str also affects how the XML will eventually be written to a file. A bytes object should be written in binary mode, a str in text mode.

Related

parsing XML string with encoding block at the beginning in python with etree & LXML [duplicate]

This question already has answers here:
parsing XML file gets UnicodeEncodeError (ElementTree) / ValueError (lxml)
(3 answers)
Closed 2 years ago.
I have to parse XML files that start as such:
xml_string = '''<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<annotationStandOffs xmlns="http://www.tei-c.org/ns/1.0">
<standOff>
...
</standOff>
</annotationStandOffs>
'''
The following code will only fly if I eliminate the first line of the above shown string:
import xml.etree.ElementTree as ET
from lxml import etree
parser = etree.XMLParser(resolve_entities=False,strip_cdata=False,recover=True)
XML_tree = etree.XML(xml_string,parser=parser)
Otherwise I get the error:
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
As the error indicates, the encoding part of the XML declaration is meant to provide the necessary information about how to convert bytes (e.g. as read from a file) into string. It doesn't make sense when the XML already is a string.
Some XML parsers silently ignore this information when parsing from a string. Some throw an error.
So since you're pasting XML into a string literal in Python source code, it would only make sense to remove the declaration yourself while you're editing the Python file.
The other, not so smart option would be to use a byte string literal b'''...''', or to encode the string into a single-byte encoding at run-time '''...'''.encode('windows-1252'). But this opens another can of worms. When your Python file encoding (e.g. UTF-8) clashes the alleged XML encoding from your copypasted XML (e.g. UTF-16), you'll get more interesting errors.
Long story short, don't do that. Don't copypaste XML into Python source code without taking the XML declaration out. And don't try to "fix" it by run-time string encode() tomfoolery.
The opposite is also true. If you have bytes (e.g. read from a file in binary mode, or from a network socket) then give those bytes to the XML parser. Don't manually decode() them into string first.

Add xml metatag on python 2.54

I am writing a python 2.54 script that uses the object xml.etree.ElementTree. I'm using the function write in order to write the resulted xml into a file. Thing is, I need to have an XML meta tag: <?xml version="1.0" encoding="UTF-8"?>, and as noted in here:
https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.ElementTree.write
All that needs to be done is passing true on the second parameter of write. But that seemed to work only on python 2.6 and up.
Any ideas how can this be done on python 2.5.4 (If this is possible, that is...)
As it seems the easiest way to force writing a xml declaration is to pass a different encoding then us-ascii or utf-8. This is not really documented, but a quick look at the source for the write() method reveals this:
def write(self, file, encoding="us-ascii"):
assert self._root is not None
if not hasattr(file, "write"):
file = open(file, "wb")
if not encoding:
encoding = "us-ascii"
elif encoding != "utf-8" and encoding != "us-ascii":
file.write("<?xml version='1.0' encoding='%s'?>\n" % encoding)
self._write(file, self._root, encoding, {})
The comparison is case sensitive, so if you use encoding="UTF-8" (not encoding="utf-8"), you'll end up exactly with what you want.

XML parser, recover=True?

I'm trying to parse some XML, however I get an error message.
After looking around a little I suspect it is due to some kind of special character in the source text and a (recover=True) should be placed in my parser line.
However I do not know the exact location for this.
Could someone have a look?
for name in newlist:
tree = ET.parse(loc + name)
root = tree.getroot()
for post in root.findall('post'):
text = post.text
text = text.strip()
posts.append(text)
The error I get is:
ParseError: not well-formed (invalid token): line 103, column 225
im not familar with python but I've had issues like this using c#. It might be because the xml isnt formatted properly. Normally the first line of the xml file will contain something like
<?xml version="1.0" encoding="UTF-8" ?>
the version and encoding is important as it tells the parser what characters are allowed. UTF-8 is the default but sometimes the xml file will contain non ascii characters causing this to go crazy. Changing the encoding to UTF-16 sometimes fixes this.
Good luck

Unicode decode error using codecs.open()

I have run into a character encoding problem as follows:
rating = 'Barntillåten'
new_file = codecs.open(os.path.join(folder, "metadata.xml"), 'w', 'utf-8')
new_file.write(
"""<?xml version="1.0" encoding="UTF-8"?>
<ratings>
<rating system="%s">%s</rating>
</ratings>""" % (values['rating_system'], rating))
The error I get is:
File "./assetshare.py", line 314, in write_file
</ratings>""" % (values['rating_system'], rating))
I know that the encoding error is related to Barntillåten, because if I replace that word with test, the function works fine.
Why is this encoding error happening and what do I need to do to fix it?
rating must be a Unicode string in order to contain Unicode codepoints.
rating = u'Barntillåten'
Otherwise, in Python 2, the non-Unicode string 'Barntillåten' contains bytes (encoded with whatever your source encoding was), not codepoints.
In Python 2, codecs.open expects to read and write unicode objects. You're passing it a str.
The fix is to ensure that the data you pass it is unicode:
new_file.write((
"""<?xml version="1.0" encoding="UTF-8"?>
"""<ratings>
<rating system="%s">%s</rating>
</ratings>""" % (values['rating_system'], rating)
).decode('utf-8'))
If you use unicode literals (u"...") then Python will try to ensure that all data is unicode. Here it would be sufficient to have rating = u'Barntillåten':
rating = u'Barntillåten'
new_file = codecs.open(os.path.join(folder, "metadata.xml"), 'w', 'utf-8')
new_file.write(
"""<?xml version="1.0" encoding="UTF-8"?>
"""<ratings>
<rating system="%s">%s</rating>
</ratings>""" % (values['rating_system'], rating))
You can write into a codecs.open file a str object, but only if the str is encoded in the default encoding, which means that for safety that's only safe if the str is plain ASCII. The default encoding is and should be left as ASCII; see Changing default encoding of Python?
You need to use unicode literals.
u'...'
u"..."
u'''......'''
u"""......"""

Forcing encoding on bad XML files with ElementTree

A big set of XML files have the wrong encoding defined. It should be utf-8 but the content has latin-1 characters all over the place. What's the best way to parse this content?
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
Edit: this is happening with Adobe InDesign IDML files, it seems the "Content" text has latin-1 but the rest could be utf-8. I'm favoring normal parsing with utf-8, then reencode the Unicode text chunks in Content to utf-8 and then re-parsing with latin-1. What a mess.
ಠ_ಠ
You can override the encoding specified in the XML when you parse it:
class xml.etree.ElementTree.XMLParser(html=0, target=None, encoding=None)
Element
structure builder for XML source data,
based on the expat parser. html are
predefined HTML entities. This flag is
not supported by the current
implementation. target is the target
object. If omitted, the builder uses
an instance of the standard
TreeBuilder class. encoding 1 is
optional. If given, the value
overrides the encoding specified in
the XML file.
docs
Don't try to deal with encoding problems during parse, but pre-process the offending file(s).

Categories

Resources