Remove non Unicode characters from xml database with Python

Remove non Unicode characters from xml database with Python - python

So I have a 9000 line xml database, saved as a txt, which I want to load in python, so I can do some formatting and remove unnecessary tags (I only need some of the tags, but there is a lot of unnecessary information) to make it readable. However, I am getting a UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 608814: character maps to <undefined>, which I assume means that the program ran into a non-Unicode character. I am quite positive that these characters are not important to the program (the data I am looking for is all plain text, with no special symbols), so how can I remove all of these from the txt file, when I can't read the file without getting the UnicodeDecodeError?

One crude workaround is to decode the bytes from the file yourself and specify the error handling. EG:
for line in somefile:
uline = line.decode('ascii', errors='ignore')
That will turn the line into a Unicode object in which any non-ascii bytes have been dropped. This is not a generally recommended approach - ideally you'd want to process XML with a proper parser, or at least know your file's encoding and open it appropriately (the exact details depend on your Python version). But if you're entirely certain you only care about ascii characters this is a simple fallback.

The error suggests that you're using open() function without specifying an explicit character encoding. locale.getpreferredencoding(False) is used in this case (e.g., cp1252). The error says that it is not an appropriate encoding for the input.
An xml document may contain a declaration at the very begining that specifies the encoding used explicitly. Otherwise the encoding is defined by BOM or it is utf-8. If your copy-pasting and saving the file hasn't messed up the encoding and you don't see a line such as <?xml version="1.0" encoding="iso-8859-1" ?> then open the file using utf-8:
with open('input-xml-like.txt', encoding='utf-8', errors='ignore') as file:
...
If the input is an actual XML then just pass it to an XML parser instead:
import xml.etree.ElementTree as etree
tree = etree.parse('input.xml')

Related

Python, writing an XML file - 'charmap' codec can't encode character. When including encoding to fix, get must be str, not bytes

I have a program that is creating an XML file and writing it to the system.
My issues occurs when I encounter characters like '\u1d52', the system throws the error.
"charmap' codec can't encode character."
tree = ET.ElementTree(root)
tree.write("Cool_Output.xml", 'w'), encoding='unicode')
Most of the solutions online seem to suggest that simply adding some encoding will fix the issue, however when I try to add encoding I instead get the following error message:
"write() argument must be str, not bytes"
tree = ET.ElementTree(root)
tree.write("Cool_Output.xml", 'w', encoding="utf-8"))
I'm experienced in javascript and C#, but fairly new to python.
Is there something simple I am missing?

\u1d52 is a latin small letter 'o' character probably cut-paste from Windows and iso-8859-1 encoding should work.
Try:
tree = ET.ElementTree(root)
tree.write("Cool_Output.xml", encoding='iso-8859-1')
The ElementTree.write() function uses the following parameters:
*file_or_filename* -- file name or a file object opened for writing
*encoding* -- the output encoding (default: US-ASCII)
*xml_declaration* -- bool indicating if an XML declaration should be
added to the output. If None, an XML declaration
is added if encoding IS NOT either of:
US-ASCII, UTF-8, or Unicode
*default_namespace* -- sets the default XML namespace (for "xmlns")
*method* -- either "xml" (default), "html, "text", or "c14n"
*short_empty_elements* -- controls the formatting of elements
that contain no content. If True (default)
they are emitted as a single self-closed
tag, otherwise they are emitted as a pair
of start/end tags

python codecs can't encode to cp1252...but notepad++ can?

I have a very simple piece of code that's converting a csv....also do note i reference notepad++ a few times but my standard IDE is vs-code.
with codecs.open(filePath, "r", encoding = "UTF-8") as sourcefile:
lines = sourcefile.read()
with codecs.open(filePath, 'w', encoding = 'cp1252') as targetfile:
targetfile.write(lines)
Now the job I'm doing requires a specific file be encoded to windows-1252 and from what i understand cp1252=windows-1252. Now this conversion works fine when i do it using the UI features in notepad++, but when i try using python codecs to encode this file it fails;
UnicodeEncodeError: 'charmap' codec can't encode character '\ufffd' in position 561488: character maps to <undefined>
When i saw this failure i was confused, so i double checked the output from when i manually convert the file using notepad++, and the converted file is encoded in windows-1252.....so what gives? Why can a UI feature in notepad++ able to do the job when but codecs seems not not be able to? Does notepad++ just ignore errors?

Looks like your input text has the character "�" (the actual placeholder "replacement character" character, not some other undefined character), which cannot be mapped to cp1252 (because it doesn't have the concept).
Depending on what you need, you can:
Filter it out (or replace it, or otherwise handle it) in Python before writing out lines to the output file.
Pass errors=... to the second codecs.open, choosing one of the other error-handling modes; the default is 'strict', you can also use 'ignore', 'replace', 'xmlcharrefreplace', 'backslashreplace' or 'namereplace'.
Check the input file and see why it's got the "�" character; is it corrupted?

Probably Python is simply more explicit in its error handling. If Notepad++ managed to represent every character correctly in CP-1252 then there is a bug in the Python codec where it should not fail where it currently does; but I'm guessing Notepad++ is silently replacing some characters with some other characters, and falsely claiming success.
Maybe try converting the result back to UTF-8 and compare the files byte by byte if the data is not easy to inspect manually.
Uncode U+FFFD is a reserved character which serves as a placeholder for a character which cannot be represented in Unicode; often, it's an indication of a conversion problem previously, when presumably this data was imperfectly input or converted at an earlier point in time.
(And yes, Windows-1252 is another name for Windows code page 1252.)

Why notepad++ "succeeds"
Notepad++ does not offer you to convert your file to cp1252, but to reinterpret it using this encoding. What lead to your confusion is that they are actually using the wrong term for this. This is the encoding menu in the program:
When "Encode with cp1252" is selected, Notepad decodes the file using cp1252 and shows you the result. If you save the character '\ufffd' to a file using utf8:
with open('f.txt', 'w', encoding='utf8') as f:
f.write('\ufffd')`
and use "Encode with cp1252" you'd see three characters:
That means that Notepad++ does not read the character in utf8 and then writes it in cp1252, because then you'd see exactly one character. You could achieve similar results to Notepad++ by reading the file using cp1252:
with open('f.txt', 'r', encoding='cp1252') as f:
print(f.read()) # Prints ï¿½
Notepad++ lets you actually convert to only five encodings, as you can see in the screenshot above.
What should you do
This character does not exist in the cp1252 encoding, which means you can't convert this file without losing information. Common solutions are to skip such characters or replace them with other similar characters that exist in your encoding (see encoding error handlers)

You are dealing with the "utf-8-sig" encoding -- please specify this one as the encoding argument instead of "utf-8".
There is information on it in the docs (search the page for "utf-8-sig").
To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls "utf-8-sig") for its Notepad program: Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is written. [...]

Python lxml: how to deal with encoding errors parsing xml strings?

I need help with parsing xml data. Here's the scenario:
I have xml files loaded as strings to a postgresql database.
I downloaded them to a text file for further analysis. Each line corresponds to an xml file.
The strings have different encodings. Some explicitly specify utf-8, other windows-1252. There might be others as well; some don't specify the encoding in the string.
I need to parse these strings for data. The best approach I've found is the following:
encoded_string = bytes(bytearray(xml_data, encoding='utf-8'))
root = etree.fromstring(encoded_string)
When it doesn't work, I get two types of error messages:
"Extra content at the end of the document, line 1, column x (<string>, line 1)"
# x varies with string; I think it corresponds to the last character in the line
Looking at the lines raising exceptions it looks like the Extra content error is raised by files with a windows-1252 encoding.
I need to be able to parse every string, ideally without having to alter them in any way after download. I've tried the following:
Apply 'windows-1252' as the encoding instead.
Reading the string as binary and then applying the encoding
Reading the string as binary and converting it directly with etree.fromstring
The last attempt produced this error: ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
What can I do? I need to be able to read these strings but can't figure out how to parse them. The xml strings with the windows encoding all start with <?xml version="1.0" encoding="windows-1252" ?>

given that the table column is text, all the XML content is being presented to python in UTF-8, as a result attempting to parse a conflicting XML encoding attribute will cause problems.
maybe try stripping that attribute from the string.

I solved the problem by removing encoding information, newline literals and carriage return literals. Every string was parsed successfully if I opened the files returning errors in vim and ran the following three commands:
:%s/\\r//g
:%s/\\n//g
:%s/<?.*?>//g
Then lxml parsed the strings without issue.
Update:
I have a better solution. The problem was \n and \r literals in UTF-8 encoded strings I was copying to text files. I just needed to remove these characters from the strings with regexp_replace like so:
select regexp_replace(xmlcolumn, '\\n|\\r', '', 'g') from table;
now I can run the following and read the data with lxml without further processing:
psql -d database -c "copy (select regexp_replace(xml_column, '\\n|\\r', '', 'g') from resource ) to stdout" > output.txt

Remove all characters which cannot be decoded in Python

I try to parse a html file with a Python script using the xml.etree.ElementTree module. The charset should be UTF-8 according to the header. But there is a strange character in the file. Therefore, the parser can't parse it. I opened the file in Notepad++ to see the character . I tried to open it with several encodings but I don't find the correct one.
As I have many files to parse, I would like to know how to remove all bytes which can't be decode. Is there a solution?

I would like to know how to remove all bytes which can't be decode. Is there a solution?
This is simple:
with open('filename', 'r', encoding='utf8', errors='ignore') as f:
...
The errors='ignore' tells Python to drop unrecognized characters. It can also be passed to bytes.decode() and most other places which take an encoding argument.
Since this decodes the bytes into unicode, it may not be suitable for an XML parser that wants to consume bytes. In that case, you should write the data back to disk (e.g. using shutil.copyfileobj()) and then re-open in 'rb' mode.
In Python 2, these arguments to the built-in open() don't exist, but you can use io.open() instead. Alternatively, you can decode your 8-bit strings into unicode strings after reading them, but this is more error-prone in my opinion.
But it turns out OP doesn't have invalid UTF-8. OP has valid UTF-8 which happens to include control characters. Control characters are mildly annoying to filter out since you have to run them through a function like this, meaning you can't just use copyfileobj():
import unicodedata
def strip_control_chars(data: str) -> str:
return ''.join(c for c in data if unicodedata.category(c) != 'Cc')
Cc is the Unicode category for "Other, control character, as described on the Unicode website. To include a slightly broader array of "bad characters," we could strip the entire "other" category (which mostly contains useless stuff anyway):
def strip_control_chars(data: str) -> str:
return ''.join(c for c in data if not unicodedata.category(c).startswith('C'))
This will filter out line breaks, so it's probably a good idea to process the file a line at a time and add the line breaks back in at the end.
In principle, we could create a codec for doing this incrementally, and then we could use copyfileobj(), but that's like using a sledgehammer to swat a fly.

Reading UTF-8 XML and writing it to a file with Python

I'm trying to parse UTF-8 XML file and save some parts of it to another file. Problem is, that this is my first Python script ever and I'm totally confused about the character encoding problems I'm finding.
My script fails immediately when it tries to write non-ascii character to a file, but it can print it to command prompt (at least in some level)
Here's the XML (from the parts that matter at least, it's a *.resx file which contains UI strings)
<?xml version="1.0" encoding="utf-8"?>
<root>
<resheader name="foo">
<value>bar</value>
</resheader>
<data name="lorem" xml:space="preserve">
<value>ipsum öä</value>
</data>
</root>
And here's my python script
from xml.dom.minidom import parse
names = []
values = []
def getStrings(path):
dom = parse(path)
data = dom.getElementsByTagName("data")
for i in range(len(data)):
name = data[i].getAttribute("name")
names.append(name)
value = data[i].getElementsByTagName("value")
values.append(value[0].firstChild.nodeValue.encode("utf-8"))
def writeToFile():
with open("uiStrings-fi.py", "w") as f:
for i in range(len(names)):
line = names[i] + '="'+ values[i] + '"' #varName='varValue'
f.write(line)
f.write("\n")
getStrings("ResourceFile.fi-FI.resx")
writeToFile()
And here's the traceback:
Traceback (most recent call last):
File "GenerateLanguageFiles.py", line 24, in
writeToFile()
File "GenerateLanguageFiles.py", line 19, in writeToFile
line = names[i] + '="'+ values[i] + '"' #varName='varValue'
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in ran
ge(128)
How should I fix my script so it would read and write UTF-8 characters properly? The files I'm trying to generate would be used in test automation with Robots Framework.

You'll need to remove the call to encode() - that is, replace nodeValue.encode("utf-8") with nodeValue - and then change the call to open() to
with open("uiStrings-fi.py", "w", "utf-8") as f:
This uses a "Unicode-aware" version of open() which you will need to import from the codecs module, so also add
from codecs import open
to the top of the file.
The issue is that when you were calling nodeValue.encode("utf-8"), you were converting a Unicode string (Python's internal representation that can store all Unicode characters) into a regular string (which can only store single-byte characters 0-255). Later on, when you construct the line to write to the output file, names[i] is still a Unicode string but values[i] is a regular string. Python tries to convert the regular string to Unicode, which is the more general type, but because you don't specify an explicit conversion, it uses the ASCII codec, which is the default, and ASCII can't handle characters with byte values greater than 127. Unfortunately, several of those do occur in the string values[i] because the UTF-8 encoding uses those upper-range bytes frequently. So Python complains that it sees a character it can't handle. The solution, as I said above, is to defer the conversion from Unicode to bytes until the last possible moment, and you do that by using the Unicode-aware version of open (which will handle the encoding for you).
Now that I think about it, instead of what I said above, an alternate solution would be to replace names[i] with names[i].encode("utf-8"). That way, you convert names[i] into a regular string as well, and Python has no reason to try to convert values[i] back to Unicode. Although, one could make the argument that it's good practice to keep your strings as Unicode objects until you write them out to the file... if nothing else, I believe unicode becomes the default in Python 3.

The XML parser decodes the UTF-8 encoding of the input when it reads the file and all the text nodes and attributes of the resulting DOM are then unicode objects. When you select the interesting data from the DOM, you re-encode the values as UTF-8, but you don't encode the names. The resulting values array contains encoded byte strings while the names array still contains unicode objects.
In the line where the encoding error is thrown, Python tries to concatenate such a unicode name and a byte string value. To do so, both values have to be of the same type and Python tries to convert the byte string values[i] to unicode, but it doesn't know that it's UTF-8 encoded and fails when it tries to use the ASCII codec.
The easiest way to work around this would be to keep all the strings as Unicode objects and just encode them to UTF-8 when they are written to the file:
values.append(value[0].firstChild.nodeValue) # encode not yet
...
f.write(line.encode('utf-8')) # but now

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.