Python lxml: how to deal with encoding errors parsing xml strings?

Python lxml: how to deal with encoding errors parsing xml strings? - python

I need help with parsing xml data. Here's the scenario:
I have xml files loaded as strings to a postgresql database.
I downloaded them to a text file for further analysis. Each line corresponds to an xml file.
The strings have different encodings. Some explicitly specify utf-8, other windows-1252. There might be others as well; some don't specify the encoding in the string.
I need to parse these strings for data. The best approach I've found is the following:
encoded_string = bytes(bytearray(xml_data, encoding='utf-8'))
root = etree.fromstring(encoded_string)
When it doesn't work, I get two types of error messages:
"Extra content at the end of the document, line 1, column x (<string>, line 1)"
# x varies with string; I think it corresponds to the last character in the line
Looking at the lines raising exceptions it looks like the Extra content error is raised by files with a windows-1252 encoding.
I need to be able to parse every string, ideally without having to alter them in any way after download. I've tried the following:
Apply 'windows-1252' as the encoding instead.
Reading the string as binary and then applying the encoding
Reading the string as binary and converting it directly with etree.fromstring
The last attempt produced this error: ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
What can I do? I need to be able to read these strings but can't figure out how to parse them. The xml strings with the windows encoding all start with <?xml version="1.0" encoding="windows-1252" ?>

given that the table column is text, all the XML content is being presented to python in UTF-8, as a result attempting to parse a conflicting XML encoding attribute will cause problems.
maybe try stripping that attribute from the string.

I solved the problem by removing encoding information, newline literals and carriage return literals. Every string was parsed successfully if I opened the files returning errors in vim and ran the following three commands:
:%s/\\r//g
:%s/\\n//g
:%s/<?.*?>//g
Then lxml parsed the strings without issue.
Update:
I have a better solution. The problem was \n and \r literals in UTF-8 encoded strings I was copying to text files. I just needed to remove these characters from the strings with regexp_replace like so:
select regexp_replace(xmlcolumn, '\\n|\\r', '', 'g') from table;
now I can run the following and read the data with lxml without further processing:
psql -d database -c "copy (select regexp_replace(xml_column, '\\n|\\r', '', 'g') from resource ) to stdout" > output.txt

Related

parsing XML string with encoding block at the beginning in python with etree & LXML [duplicate]

This question already has answers here:
parsing XML file gets UnicodeEncodeError (ElementTree) / ValueError (lxml)
(3 answers)
Closed 2 years ago.
I have to parse XML files that start as such:
xml_string = '''<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<annotationStandOffs xmlns="http://www.tei-c.org/ns/1.0">
<standOff>
...
</standOff>
</annotationStandOffs>
'''
The following code will only fly if I eliminate the first line of the above shown string:
import xml.etree.ElementTree as ET
from lxml import etree
parser = etree.XMLParser(resolve_entities=False,strip_cdata=False,recover=True)
XML_tree = etree.XML(xml_string,parser=parser)
Otherwise I get the error:
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

As the error indicates, the encoding part of the XML declaration is meant to provide the necessary information about how to convert bytes (e.g. as read from a file) into string. It doesn't make sense when the XML already is a string.
Some XML parsers silently ignore this information when parsing from a string. Some throw an error.
So since you're pasting XML into a string literal in Python source code, it would only make sense to remove the declaration yourself while you're editing the Python file.
The other, not so smart option would be to use a byte string literal b'''...''', or to encode the string into a single-byte encoding at run-time '''...'''.encode('windows-1252'). But this opens another can of worms. When your Python file encoding (e.g. UTF-8) clashes the alleged XML encoding from your copypasted XML (e.g. UTF-16), you'll get more interesting errors.
Long story short, don't do that. Don't copypaste XML into Python source code without taking the XML declaration out. And don't try to "fix" it by run-time string encode() tomfoolery.
The opposite is also true. If you have bytes (e.g. read from a file in binary mode, or from a network socket) then give those bytes to the XML parser. Don't manually decode() them into string first.

How to convert utf-8 codes in binary file to html codes in python 3

The HTML files I am dealing with are generally utf-8 but have some broken encodings and therefore can't be transformed to Unicode. My idea is to parse them as binary and replace in a first step all the proper utf-8 encodings with html codes.
e.g. "\xc2\xa3" to £
In a second step I would replace the broken encodings with proper ones.
I got stuck at the first step. Replacing a single character works with replace:
string.replace(b'\xc3\x84', b'Ä')
Taking the code mappings from a table doesn't work. When reading the table the utf-8 codes get escaped (b'\xc3\x84' and I can't find a way to get rid of the double slashes.
I can think of some dirty ways of solving this problem but there should be a clean one, should it?

Best way is either pre filter them
iconv -t utf8 -c SRCFILE > NEWFILE
Or in python
with open("somefile_with_bad_utf8.txt","rt",encoding="utf-8",errors="ignore") as myfile:
for line in myfile:
process()
I was going to say always use python3 for utf-8 but I see you are already.
Hope that helps....

Remove all characters which cannot be decoded in Python

I try to parse a html file with a Python script using the xml.etree.ElementTree module. The charset should be UTF-8 according to the header. But there is a strange character in the file. Therefore, the parser can't parse it. I opened the file in Notepad++ to see the character . I tried to open it with several encodings but I don't find the correct one.
As I have many files to parse, I would like to know how to remove all bytes which can't be decode. Is there a solution?

I would like to know how to remove all bytes which can't be decode. Is there a solution?
This is simple:
with open('filename', 'r', encoding='utf8', errors='ignore') as f:
...
The errors='ignore' tells Python to drop unrecognized characters. It can also be passed to bytes.decode() and most other places which take an encoding argument.
Since this decodes the bytes into unicode, it may not be suitable for an XML parser that wants to consume bytes. In that case, you should write the data back to disk (e.g. using shutil.copyfileobj()) and then re-open in 'rb' mode.
In Python 2, these arguments to the built-in open() don't exist, but you can use io.open() instead. Alternatively, you can decode your 8-bit strings into unicode strings after reading them, but this is more error-prone in my opinion.
But it turns out OP doesn't have invalid UTF-8. OP has valid UTF-8 which happens to include control characters. Control characters are mildly annoying to filter out since you have to run them through a function like this, meaning you can't just use copyfileobj():
import unicodedata
def strip_control_chars(data: str) -> str:
return ''.join(c for c in data if unicodedata.category(c) != 'Cc')
Cc is the Unicode category for "Other, control character, as described on the Unicode website. To include a slightly broader array of "bad characters," we could strip the entire "other" category (which mostly contains useless stuff anyway):
def strip_control_chars(data: str) -> str:
return ''.join(c for c in data if not unicodedata.category(c).startswith('C'))
This will filter out line breaks, so it's probably a good idea to process the file a line at a time and add the line breaks back in at the end.
In principle, we could create a codec for doing this incrementally, and then we could use copyfileobj(), but that's like using a sledgehammer to swat a fly.

Remove non Unicode characters from xml database with Python

So I have a 9000 line xml database, saved as a txt, which I want to load in python, so I can do some formatting and remove unnecessary tags (I only need some of the tags, but there is a lot of unnecessary information) to make it readable. However, I am getting a UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 608814: character maps to <undefined>, which I assume means that the program ran into a non-Unicode character. I am quite positive that these characters are not important to the program (the data I am looking for is all plain text, with no special symbols), so how can I remove all of these from the txt file, when I can't read the file without getting the UnicodeDecodeError?

One crude workaround is to decode the bytes from the file yourself and specify the error handling. EG:
for line in somefile:
uline = line.decode('ascii', errors='ignore')
That will turn the line into a Unicode object in which any non-ascii bytes have been dropped. This is not a generally recommended approach - ideally you'd want to process XML with a proper parser, or at least know your file's encoding and open it appropriately (the exact details depend on your Python version). But if you're entirely certain you only care about ascii characters this is a simple fallback.

The error suggests that you're using open() function without specifying an explicit character encoding. locale.getpreferredencoding(False) is used in this case (e.g., cp1252). The error says that it is not an appropriate encoding for the input.
An xml document may contain a declaration at the very begining that specifies the encoding used explicitly. Otherwise the encoding is defined by BOM or it is utf-8. If your copy-pasting and saving the file hasn't messed up the encoding and you don't see a line such as <?xml version="1.0" encoding="iso-8859-1" ?> then open the file using utf-8:
with open('input-xml-like.txt', encoding='utf-8', errors='ignore') as file:
...
If the input is an actual XML then just pass it to an XML parser instead:
import xml.etree.ElementTree as etree
tree = etree.parse('input.xml')

searching for non english word in a file python

I am trying to solve "simple" problem in python (2.7).
suppose that i have two files:
key.txt - which have a key to search for.
content.txt - which has a web content (html file)
both files saved in utf-8.
content.txt is mixed file, which means it contains non english characters (web html file)
i am trying to check if the key in key.txt file found in the content or not.
tried comparing the files as binary (bytes) didn't work, also tried decoding didn't work.
i would also appreciate any help on how can i search for regex which is mixed (my pattern built from english and non-english characters)

You should let the python interpreter know that you are using utf-8 encoding by
adding this statement at the beginning :
# encoding: utf-8
Then you can use u'yourString' to indicate that string is a unicode string.
Sample code :
text = u'someString'
keyString = u'someKey'
f = re.findall(keyString, text)
You may need to use encode('utf-8') method on the string while performing some other operation on those strings

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.