A big set of XML files have the wrong encoding defined. It should be utf-8 but the content has latin-1 characters all over the place. What's the best way to parse this content?
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
Edit: this is happening with Adobe InDesign IDML files, it seems the "Content" text has latin-1 but the rest could be utf-8. I'm favoring normal parsing with utf-8, then reencode the Unicode text chunks in Content to utf-8 and then re-parsing with latin-1. What a mess.
ಠ_ಠ
You can override the encoding specified in the XML when you parse it:
class xml.etree.ElementTree.XMLParser(html=0, target=None, encoding=None)
Element
structure builder for XML source data,
based on the expat parser. html are
predefined HTML entities. This flag is
not supported by the current
implementation. target is the target
object. If omitted, the builder uses
an instance of the standard
TreeBuilder class. encoding 1 is
optional. If given, the value
overrides the encoding specified in
the XML file.
docs
Don't try to deal with encoding problems during parse, but pre-process the offending file(s).
Related
I am using python to do some conditional changes to an XML document. The incoming document has <?xml version="1.0" ?> at the top.
I'm using xml.etree.ElementTree.
How I'm parsing the changed XMl:
filter_update_body = ET.tostring(root, encoding="utf8", method="xml")
The output has this at the top:
<?xml version='1.0' encoding='utf8'?>
The client wants the "encoding" tag removed but if I remove it then it either doesn't include the line at all or it puts in encoding= 'us-ascii'
Can this be done so the output matches: <?xml version="1.0" ?>?
(I don't know why it matters honestly but that's what I was told needed to happen)
As pointed out in this answer there is no way to make ElementTree omit the encoding attribute. However, as #James suggested in a comment, it can be stripped from the resulting output like this:
filter_update_body = ET.tostring(root, encoding="utf8", method="xml")
filter_update_body = filter_update_body.replace(b"encoding='utf8'", b"", 1)
The b prefixes are required because ET.tostring() will return a bytes object if encoding != "unicode". In turn, we need to call bytes.replace().
With encoding = "unicode" (note that this is the literal string "unicode"), it will return a regular str. In this case, the bs can be omitted. We use good old str.replace().
It's worth noting that the choice between bytes and str also affects how the XML will eventually be written to a file. A bytes object should be written in binary mode, a str in text mode.
This question already has answers here:
parsing XML file gets UnicodeEncodeError (ElementTree) / ValueError (lxml)
(3 answers)
Closed 2 years ago.
I have to parse XML files that start as such:
xml_string = '''<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<annotationStandOffs xmlns="http://www.tei-c.org/ns/1.0">
<standOff>
...
</standOff>
</annotationStandOffs>
'''
The following code will only fly if I eliminate the first line of the above shown string:
import xml.etree.ElementTree as ET
from lxml import etree
parser = etree.XMLParser(resolve_entities=False,strip_cdata=False,recover=True)
XML_tree = etree.XML(xml_string,parser=parser)
Otherwise I get the error:
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
As the error indicates, the encoding part of the XML declaration is meant to provide the necessary information about how to convert bytes (e.g. as read from a file) into string. It doesn't make sense when the XML already is a string.
Some XML parsers silently ignore this information when parsing from a string. Some throw an error.
So since you're pasting XML into a string literal in Python source code, it would only make sense to remove the declaration yourself while you're editing the Python file.
The other, not so smart option would be to use a byte string literal b'''...''', or to encode the string into a single-byte encoding at run-time '''...'''.encode('windows-1252'). But this opens another can of worms. When your Python file encoding (e.g. UTF-8) clashes the alleged XML encoding from your copypasted XML (e.g. UTF-16), you'll get more interesting errors.
Long story short, don't do that. Don't copypaste XML into Python source code without taking the XML declaration out. And don't try to "fix" it by run-time string encode() tomfoolery.
The opposite is also true. If you have bytes (e.g. read from a file in binary mode, or from a network socket) then give those bytes to the XML parser. Don't manually decode() them into string first.
So I have a 9000 line xml database, saved as a txt, which I want to load in python, so I can do some formatting and remove unnecessary tags (I only need some of the tags, but there is a lot of unnecessary information) to make it readable. However, I am getting a UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 608814: character maps to <undefined>, which I assume means that the program ran into a non-Unicode character. I am quite positive that these characters are not important to the program (the data I am looking for is all plain text, with no special symbols), so how can I remove all of these from the txt file, when I can't read the file without getting the UnicodeDecodeError?
One crude workaround is to decode the bytes from the file yourself and specify the error handling. EG:
for line in somefile:
uline = line.decode('ascii', errors='ignore')
That will turn the line into a Unicode object in which any non-ascii bytes have been dropped. This is not a generally recommended approach - ideally you'd want to process XML with a proper parser, or at least know your file's encoding and open it appropriately (the exact details depend on your Python version). But if you're entirely certain you only care about ascii characters this is a simple fallback.
The error suggests that you're using open() function without specifying an explicit character encoding. locale.getpreferredencoding(False) is used in this case (e.g., cp1252). The error says that it is not an appropriate encoding for the input.
An xml document may contain a declaration at the very begining that specifies the encoding used explicitly. Otherwise the encoding is defined by BOM or it is utf-8. If your copy-pasting and saving the file hasn't messed up the encoding and you don't see a line such as <?xml version="1.0" encoding="iso-8859-1" ?> then open the file using utf-8:
with open('input-xml-like.txt', encoding='utf-8', errors='ignore') as file:
...
If the input is an actual XML then just pass it to an XML parser instead:
import xml.etree.ElementTree as etree
tree = etree.parse('input.xml')
I want to parse a file with minidom:
with codecs.open(fname, encoding="utf-8") as xml:
dom = parse(xml)
Returns a UnicodeEncodeError. The XML file is in UTF-8 without BOM format and has
<?xml version="1.0" encoding="utf-8"?>
in the first line.
If I first read the file, .encode("utf-8") it and pass it to parseString, it works. Is there a way to parse an UTF-8 XML file directly with minidom.parse?
Leave the decoding to the XML parser; it'll detect what codec to use. Open the file without converting to unicode:
with open(fname) as xml:
dom = parse(xml)
Note the use of the standard function open() instead of codecs.open().
This applies to any XML parser; it is the job of the parser to determine from the XML prologue what codec to use for parsing the document. If no prologue is present then UTF-8 is the default.
I'm writing some XML with element tree.
I'm giving the code an empty template file that starts with the XML declaration:<?xml version= "1.0"?> when ET has finished making its changes and writes the completed XML its stripping out the declarion and starting with the root tag. How can I stop this?
Write call:
ET.ElementTree(root).write(noteFile)
According to the documentation:
write(file, encoding="us-ascii", xml_declaration=None, method="xml")
Writes the element tree to a file, as XML. file is a file name, or a file object opened for writing. encoding 1 is the output encoding (default is US-ASCII). xml_declaration controls if an XML declaration should be added to the file. Use False for never, True for always, None for only if not US-ASCII or UTF-8 (default is None). method is either "xml", "html" or "text" (default is "xml"). Returns an encoded string.
So, write(noteFile) is explicitly telling it to write an XML declaration only if the encoding is not US-ASCII or UTF-8, and that the encoding is US-ASCII; therefore, you get no declaration.
I'm guessing if you didn't read this much, your next question is going to be "why is my Unicode broken", so let's fix both at once:
ET.ElementTree(root).write(noteFile, encoding="utf-8", xml_declaration=True)
There are different versions of ElementTree.
Some of them accept the xml_declaration argument, some do not.
The one I happen to have does not. It emits the declaration if and only if encoding != 'utf-8'. So, to get the declaration, I call write(filename, encoding='UTF-8').