lxml Changing Unicode Characters

lxml Changing Unicode Characters - python

I am using lxml to read through an xml file and change a few details. However, when running it I find that even if I just use lxml to read the file and then write it out again, as below:
fil='iTunes Music Library.XML'
tre=etree.parse(fil)
tre.write('temp.xml')
I find Queensrÿche converted to Queensrÿche. Anyone know how to fix this?

Change your last line to:
tre.write('temp.xml', encoding='utf-8')
Otherwise lxml writes XML in ASCII encoding, so it have to escape all non-ASCII characters.

Related

Python lxml: how to deal with encoding errors parsing xml strings?

I need help with parsing xml data. Here's the scenario:
I have xml files loaded as strings to a postgresql database.
I downloaded them to a text file for further analysis. Each line corresponds to an xml file.
The strings have different encodings. Some explicitly specify utf-8, other windows-1252. There might be others as well; some don't specify the encoding in the string.
I need to parse these strings for data. The best approach I've found is the following:
encoded_string = bytes(bytearray(xml_data, encoding='utf-8'))
root = etree.fromstring(encoded_string)
When it doesn't work, I get two types of error messages:
"Extra content at the end of the document, line 1, column x (<string>, line 1)"
# x varies with string; I think it corresponds to the last character in the line
Looking at the lines raising exceptions it looks like the Extra content error is raised by files with a windows-1252 encoding.
I need to be able to parse every string, ideally without having to alter them in any way after download. I've tried the following:
Apply 'windows-1252' as the encoding instead.
Reading the string as binary and then applying the encoding
Reading the string as binary and converting it directly with etree.fromstring
The last attempt produced this error: ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
What can I do? I need to be able to read these strings but can't figure out how to parse them. The xml strings with the windows encoding all start with <?xml version="1.0" encoding="windows-1252" ?>

given that the table column is text, all the XML content is being presented to python in UTF-8, as a result attempting to parse a conflicting XML encoding attribute will cause problems.
maybe try stripping that attribute from the string.

I solved the problem by removing encoding information, newline literals and carriage return literals. Every string was parsed successfully if I opened the files returning errors in vim and ran the following three commands:
:%s/\\r//g
:%s/\\n//g
:%s/<?.*?>//g
Then lxml parsed the strings without issue.
Update:
I have a better solution. The problem was \n and \r literals in UTF-8 encoded strings I was copying to text files. I just needed to remove these characters from the strings with regexp_replace like so:
select regexp_replace(xmlcolumn, '\\n|\\r', '', 'g') from table;
now I can run the following and read the data with lxml without further processing:
psql -d database -c "copy (select regexp_replace(xml_column, '\\n|\\r', '', 'g') from resource ) to stdout" > output.txt

How to convert utf-8 codes in binary file to html codes in python 3

The HTML files I am dealing with are generally utf-8 but have some broken encodings and therefore can't be transformed to Unicode. My idea is to parse them as binary and replace in a first step all the proper utf-8 encodings with html codes.
e.g. "\xc2\xa3" to £
In a second step I would replace the broken encodings with proper ones.
I got stuck at the first step. Replacing a single character works with replace:
string.replace(b'\xc3\x84', b'Ä')
Taking the code mappings from a table doesn't work. When reading the table the utf-8 codes get escaped (b'\xc3\x84' and I can't find a way to get rid of the double slashes.
I can think of some dirty ways of solving this problem but there should be a clean one, should it?

Best way is either pre filter them
iconv -t utf8 -c SRCFILE > NEWFILE
Or in python
with open("somefile_with_bad_utf8.txt","rt",encoding="utf-8",errors="ignore") as myfile:
for line in myfile:
process()
I was going to say always use python3 for utf-8 but I see you are already.
Hope that helps....

BeautifulSoup Unable to Parse Unexpected Encodings

Apologies in advance for this post if it's not well written as I'm extremely new to Python. Pretty simple/stupid problem I'm having with Python3 and BeautifulSoup. I'm attempting to parse a CSV file in Python without knowing what the encoding of each line will contain as each line contains raw data from several sources. Before I can even parse the file, I'm using BeautifulSoup in an attempt to clean it up (I'm not sure if this is a good idea):
from bs4 import BeautifulSoup
def main():
try:
soup = BeautifulSoup(open('files/sdk_breakout_1027.csv'))
except Exception as e:
print(str(e))
When I run this, however, I encounter the following error:
'ascii' codec can't decode byte 0xed in position 287: ordinal not in range(128)
My traceback points to this line in the CSV as the source of the problem:
500i(í£ : Android OS : 4.0.4
What is a better way to go about this? I just want to convert all rows in this CSV to a uniform encoding so I can parse it later.
Thanks for your help.

Guessing the encoding of random data will never be perfect, but if you know something about your data source, you may be able to do that.
Alternatively, you can open as UTF-8 and either ignore or replace errors:
import csv
with open("filename", encoding="utf8", errors="replace") as f:
for row in csv.reader(f):
print(", ".join(row))

You can't parse a CSV file with BeautifulSoup, only HTML or XML.
If you want to use the charset guessing from BeautifulSoup on its own, you can. See the Unicode, Dammit section of the docs. If you have a complete list of all of the encodings that might have been used, but just don't know which one in that list was actually used, pass that list to Dammit.
There's a different charset-guessing library known as chardet that you also might want to try. (Note that Dammit will use chardet if you have it installed, so you might not need to try it separately.)
But both of these just make educated guesses; the documentation explains all the myriad ways they can fail.
Also, if each line is encoded differently (which is an even bigger mess than usual), you will have to Dammit or chardet each line as if it were a separate file. With much less text to work with, the guessing is going to be much less accurate, but there's nothing you can do about that if each line really is potentially in a different encoding.
Putting it all together, it would look something like this:
encodings = 'utf-8', 'latin-1', 'cp1252', 'shift-jis'
def dammitize(f):
for line in f:
yield UnicodeDammit(line, encodings).unicode_markup
with open('foo.csv', 'rb') as f:
for row in csv.reader(dammitize(f)):
do_something_with(row)

How to decode file in Python-3.x?

For my project I need to parse xml file. For doing this I use lxml. The file I need to parse has a cp1251 coding, but, ofcourse, for parsing it using lxml I need to decode it into utf-8, and I dont know how to do it. I tryed to serch something about this, but all solutions was for Python 2.7 or didnt work.
if try to write something like
inp = open("business.xml", "r", encoding='cp1251').decode('utf-8')
or
inp.decode('utf-8')
It gets
builtins.AttributeError: '_io.TextIOWrapper' object has no attribute 'decode'
I have Python 3.2.
Any help is well,
thanks you.

open() decodes the file for you. You are already receiving Unicode data.
For lxml you need to open the file in binary mode, and let the XML parser deal with encoding. Do not do this yourself.
with open("business.xml", "rb") as inp:
tree = etree.parse(inp)
XML files include a header to indicate what encoding they use, and the parser adjusts to that. If the header is missing, the parser can safely assume UTF-8.

from input() reading and converting

i have a dict as utf-8 file and reading from the commandline the word and search it in the dictionary keys. But my file have the characters turkish and arabic
word = 'şüyûh'
mydictionary[word]
my program give me the word 'şüyûh' as KeyError this 'ÅŸÃ¼yÃ»h' back. how can i fix it.

Handle everything as unicode.
Unicode in Python, Completely Demystified"

If you're reading from a file, you'll need to tell python how to interpret the bytes in the file (files can only contain bytes) into the characters as you understand them.
The most basic way of doing so is to open the file using codecs.open instead of the built in open function. When you pull data out of the file in this way, it will be already decoded:
import codecs
with codecs.open("something.txt", encoding="utf-8") as myfile:
# do something with the file.
Note that you must tell python what encoding the file is in.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.