XML parser, recover=True?

XML parser, recover=True? - python

I'm trying to parse some XML, however I get an error message.
After looking around a little I suspect it is due to some kind of special character in the source text and a (recover=True) should be placed in my parser line.
However I do not know the exact location for this.
Could someone have a look?
for name in newlist:
tree = ET.parse(loc + name)
root = tree.getroot()
for post in root.findall('post'):
text = post.text
text = text.strip()
posts.append(text)
The error I get is:
ParseError: not well-formed (invalid token): line 103, column 225

im not familar with python but I've had issues like this using c#. It might be because the xml isnt formatted properly. Normally the first line of the xml file will contain something like
<?xml version="1.0" encoding="UTF-8" ?>
the version and encoding is important as it tells the parser what characters are allowed. UTF-8 is the default but sometimes the xml file will contain non ascii characters causing this to go crazy. Changing the encoding to UTF-16 sometimes fixes this.
Good luck

Related

Remove "encoding" attribute from XML in Python

I am using python to do some conditional changes to an XML document. The incoming document has <?xml version="1.0" ?> at the top.
I'm using xml.etree.ElementTree.
How I'm parsing the changed XMl:
filter_update_body = ET.tostring(root, encoding="utf8", method="xml")
The output has this at the top:
<?xml version='1.0' encoding='utf8'?>
The client wants the "encoding" tag removed but if I remove it then it either doesn't include the line at all or it puts in encoding= 'us-ascii'
Can this be done so the output matches: <?xml version="1.0" ?>?
(I don't know why it matters honestly but that's what I was told needed to happen)

As pointed out in this answer there is no way to make ElementTree omit the encoding attribute. However, as #James suggested in a comment, it can be stripped from the resulting output like this:
filter_update_body = ET.tostring(root, encoding="utf8", method="xml")
filter_update_body = filter_update_body.replace(b"encoding='utf8'", b"", 1)
The b prefixes are required because ET.tostring() will return a bytes object if encoding != "unicode". In turn, we need to call bytes.replace().
With encoding = "unicode" (note that this is the literal string "unicode"), it will return a regular str. In this case, the bs can be omitted. We use good old str.replace().
It's worth noting that the choice between bytes and str also affects how the XML will eventually be written to a file. A bytes object should be written in binary mode, a str in text mode.

parsing XML string with encoding block at the beginning in python with etree & LXML [duplicate]

This question already has answers here:
parsing XML file gets UnicodeEncodeError (ElementTree) / ValueError (lxml)
(3 answers)
Closed 2 years ago.
I have to parse XML files that start as such:
xml_string = '''<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<annotationStandOffs xmlns="http://www.tei-c.org/ns/1.0">
<standOff>
...
</standOff>
</annotationStandOffs>
'''
The following code will only fly if I eliminate the first line of the above shown string:
import xml.etree.ElementTree as ET
from lxml import etree
parser = etree.XMLParser(resolve_entities=False,strip_cdata=False,recover=True)
XML_tree = etree.XML(xml_string,parser=parser)
Otherwise I get the error:
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

As the error indicates, the encoding part of the XML declaration is meant to provide the necessary information about how to convert bytes (e.g. as read from a file) into string. It doesn't make sense when the XML already is a string.
Some XML parsers silently ignore this information when parsing from a string. Some throw an error.
So since you're pasting XML into a string literal in Python source code, it would only make sense to remove the declaration yourself while you're editing the Python file.
The other, not so smart option would be to use a byte string literal b'''...''', or to encode the string into a single-byte encoding at run-time '''...'''.encode('windows-1252'). But this opens another can of worms. When your Python file encoding (e.g. UTF-8) clashes the alleged XML encoding from your copypasted XML (e.g. UTF-16), you'll get more interesting errors.
Long story short, don't do that. Don't copypaste XML into Python source code without taking the XML declaration out. And don't try to "fix" it by run-time string encode() tomfoolery.
The opposite is also true. If you have bytes (e.g. read from a file in binary mode, or from a network socket) then give those bytes to the XML parser. Don't manually decode() them into string first.

Python lxml: how to deal with encoding errors parsing xml strings?

I need help with parsing xml data. Here's the scenario:
I have xml files loaded as strings to a postgresql database.
I downloaded them to a text file for further analysis. Each line corresponds to an xml file.
The strings have different encodings. Some explicitly specify utf-8, other windows-1252. There might be others as well; some don't specify the encoding in the string.
I need to parse these strings for data. The best approach I've found is the following:
encoded_string = bytes(bytearray(xml_data, encoding='utf-8'))
root = etree.fromstring(encoded_string)
When it doesn't work, I get two types of error messages:
"Extra content at the end of the document, line 1, column x (<string>, line 1)"
# x varies with string; I think it corresponds to the last character in the line
Looking at the lines raising exceptions it looks like the Extra content error is raised by files with a windows-1252 encoding.
I need to be able to parse every string, ideally without having to alter them in any way after download. I've tried the following:
Apply 'windows-1252' as the encoding instead.
Reading the string as binary and then applying the encoding
Reading the string as binary and converting it directly with etree.fromstring
The last attempt produced this error: ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
What can I do? I need to be able to read these strings but can't figure out how to parse them. The xml strings with the windows encoding all start with <?xml version="1.0" encoding="windows-1252" ?>

given that the table column is text, all the XML content is being presented to python in UTF-8, as a result attempting to parse a conflicting XML encoding attribute will cause problems.
maybe try stripping that attribute from the string.

I solved the problem by removing encoding information, newline literals and carriage return literals. Every string was parsed successfully if I opened the files returning errors in vim and ran the following three commands:
:%s/\\r//g
:%s/\\n//g
:%s/<?.*?>//g
Then lxml parsed the strings without issue.
Update:
I have a better solution. The problem was \n and \r literals in UTF-8 encoded strings I was copying to text files. I just needed to remove these characters from the strings with regexp_replace like so:
select regexp_replace(xmlcolumn, '\\n|\\r', '', 'g') from table;
now I can run the following and read the data with lxml without further processing:
psql -d database -c "copy (select regexp_replace(xml_column, '\\n|\\r', '', 'g') from resource ) to stdout" > output.txt

Ordinal indicator causes problems for XML parser

I have a Python script which parse XML from a website, so that implies that I can't touch the original XML and it looks like that:
<?xml version='1.0' encoding='UTF-8'?>
<list>
<orderDate>09/06/2017</orderDate>
<orderObject>RC CAR</orderObject>
<orderName>2ª versione</orderName>
<orderShipped>true</orderShipped>
</list>
I'm facing a problem when the server answer with a XML data as above, when "orderName" contains a number with an ordinal indicator like in this case "ª" it gives me the following:
xml.parsers.expat.ExpatError: not well-formed (invalid token):
In Python side I'm using minidom as a parser with this code:
xmldoc = minidom.parse(order_data)
I want to specify that when another XML does not contain an ordinal number everything works perfectly. Thanks to whoever will help me.

Remove non Unicode characters from xml database with Python

So I have a 9000 line xml database, saved as a txt, which I want to load in python, so I can do some formatting and remove unnecessary tags (I only need some of the tags, but there is a lot of unnecessary information) to make it readable. However, I am getting a UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 608814: character maps to <undefined>, which I assume means that the program ran into a non-Unicode character. I am quite positive that these characters are not important to the program (the data I am looking for is all plain text, with no special symbols), so how can I remove all of these from the txt file, when I can't read the file without getting the UnicodeDecodeError?

One crude workaround is to decode the bytes from the file yourself and specify the error handling. EG:
for line in somefile:
uline = line.decode('ascii', errors='ignore')
That will turn the line into a Unicode object in which any non-ascii bytes have been dropped. This is not a generally recommended approach - ideally you'd want to process XML with a proper parser, or at least know your file's encoding and open it appropriately (the exact details depend on your Python version). But if you're entirely certain you only care about ascii characters this is a simple fallback.

The error suggests that you're using open() function without specifying an explicit character encoding. locale.getpreferredencoding(False) is used in this case (e.g., cp1252). The error says that it is not an appropriate encoding for the input.
An xml document may contain a declaration at the very begining that specifies the encoding used explicitly. Otherwise the encoding is defined by BOM or it is utf-8. If your copy-pasting and saving the file hasn't messed up the encoding and you don't see a line such as <?xml version="1.0" encoding="iso-8859-1" ?> then open the file using utf-8:
with open('input-xml-like.txt', encoding='utf-8', errors='ignore') as file:
...
If the input is an actual XML then just pass it to an XML parser instead:
import xml.etree.ElementTree as etree
tree = etree.parse('input.xml')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.