How to decode file in Python-3.x? - python

For my project I need to parse xml file. For doing this I use lxml. The file I need to parse has a cp1251 coding, but, ofcourse, for parsing it using lxml I need to decode it into utf-8, and I dont know how to do it. I tryed to serch something about this, but all solutions was for Python 2.7 or didnt work.
if try to write something like
inp = open("business.xml", "r", encoding='cp1251').decode('utf-8')
or
inp.decode('utf-8')
It gets
builtins.AttributeError: '_io.TextIOWrapper' object has no attribute 'decode'
I have Python 3.2.
Any help is well,
thanks you.

open() decodes the file for you. You are already receiving Unicode data.
For lxml you need to open the file in binary mode, and let the XML parser deal with encoding. Do not do this yourself.
with open("business.xml", "rb") as inp:
tree = etree.parse(inp)
XML files include a header to indicate what encoding they use, and the parser adjusts to that. If the header is missing, the parser can safely assume UTF-8.

Related

BeautifulSoup Unable to Parse Unexpected Encodings

Apologies in advance for this post if it's not well written as I'm extremely new to Python. Pretty simple/stupid problem I'm having with Python3 and BeautifulSoup. I'm attempting to parse a CSV file in Python without knowing what the encoding of each line will contain as each line contains raw data from several sources. Before I can even parse the file, I'm using BeautifulSoup in an attempt to clean it up (I'm not sure if this is a good idea):
from bs4 import BeautifulSoup
def main():
try:
soup = BeautifulSoup(open('files/sdk_breakout_1027.csv'))
except Exception as e:
print(str(e))
When I run this, however, I encounter the following error:
'ascii' codec can't decode byte 0xed in position 287: ordinal not in range(128)
My traceback points to this line in the CSV as the source of the problem:
500i(í£ : Android OS : 4.0.4
What is a better way to go about this? I just want to convert all rows in this CSV to a uniform encoding so I can parse it later.
Thanks for your help.
Guessing the encoding of random data will never be perfect, but if you know something about your data source, you may be able to do that.
Alternatively, you can open as UTF-8 and either ignore or replace errors:
import csv
with open("filename", encoding="utf8", errors="replace") as f:
for row in csv.reader(f):
print(", ".join(row))
You can't parse a CSV file with BeautifulSoup, only HTML or XML.
If you want to use the charset guessing from BeautifulSoup on its own, you can. See the Unicode, Dammit section of the docs. If you have a complete list of all of the encodings that might have been used, but just don't know which one in that list was actually used, pass that list to Dammit.
There's a different charset-guessing library known as chardet that you also might want to try. (Note that Dammit will use chardet if you have it installed, so you might not need to try it separately.)
But both of these just make educated guesses; the documentation explains all the myriad ways they can fail.
Also, if each line is encoded differently (which is an even bigger mess than usual), you will have to Dammit or chardet each line as if it were a separate file. With much less text to work with, the guessing is going to be much less accurate, but there's nothing you can do about that if each line really is potentially in a different encoding.
Putting it all together, it would look something like this:
encodings = 'utf-8', 'latin-1', 'cp1252', 'shift-jis'
def dammitize(f):
for line in f:
yield UnicodeDammit(line, encodings).unicode_markup
with open('foo.csv', 'rb') as f:
for row in csv.reader(dammitize(f)):
do_something_with(row)

from input() reading and converting

i have a dict as utf-8 file and reading from the commandline the word and search it in the dictionary keys. But my file have the characters turkish and arabic
word = 'şüyûh'
mydictionary[word]
my program give me the word 'şüyûh' as KeyError this 'şüyûh' back. how can i fix it.
Handle everything as unicode.
Unicode in Python, Completely Demystified"
If you're reading from a file, you'll need to tell python how to interpret the bytes in the file (files can only contain bytes) into the characters as you understand them.
The most basic way of doing so is to open the file using codecs.open instead of the built in open function. When you pull data out of the file in this way, it will be already decoded:
import codecs
with codecs.open("something.txt", encoding="utf-8") as myfile:
# do something with the file.
Note that you must tell python what encoding the file is in.

How to parse unicode strings with minidom?

I'm trying to parse a bunch of xml files with the library xml.dom.minidom, to extract some data and put it in a text file. Most of the XMLs go well, but for some of them I get the following error when calling minidom.parsestring():
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 5189: ordinal not in range(128)
It happens for some other non-ascii characters too. My question is: what are my options here? Am I supposed to somehow strip/replace all those non-English characters before being able to parse the XML files?
Try to decode it:
> print u'abcdé'.encode('utf-8')
> abcdé
> print u'abcdé'.encode('utf-8').decode('utf-8')
> abcdé
In case your string is 'str':
xmldoc = minidom.parseString(u'{0}'.format(str).encode('utf-8'))
This worked for me.
Minidom doesn't directly support parsing Unicode strings; it's something that has historically had poor support and standardisation. Many XML tools recognise only byte streams as something an XML parser can consume.
If you have plain files, you should either read them in as byte strings (not Unicode!) and pass that to parseString(), or just use parse() which will read a file directly.
I know the O.P. asked about parsing strings, but I had the same exception upon writing the DOM model to a file via Document.writexml(...). In case people with that (related) problem land here, I will offer my solution.
My code which was throwing the UnicodeEncodeError looked like:
with tempfile.NamedTemporaryFile(delete=False) as fh:
dom.writexml(fh, encoding="utf-8")
Note that the "encoding" param only effects the XML header and has no effect on the treatment of the data. To fix it, I changed it to:
with tempfile.NamedTemporaryFile(delete=False) as fh:
fh = codecs.lookup("utf-8")[3](fh)
dom.writexml(fh, encoding="utf-8")
This will wrap the file handle with an instance of encodings.utf_8.StreamWriter, which handles the data as UTF-8 rather then ASCII, and the UnicodeEncodeError went away. I got the idea from reading the source of xml.dom.minidom.Node.toprettyxml(...).
I encounter this error a few times, and my hacky way of dealing with it is just to do this:
def getCleanString(word):
str = ""
for character in word:
try:
str_character = str(character)
str = str + str_character
except:
dummy = 1 # this happens if character is unicode
return str
Of course, this is probably a dumb way of doing it, but it gets the job done for me, and doesn't cost me anything in speed.

lxml Changing Unicode Characters

I am using lxml to read through an xml file and change a few details. However, when running it I find that even if I just use lxml to read the file and then write it out again, as below:
fil='iTunes Music Library.XML'
tre=etree.parse(fil)
tre.write('temp.xml')
I find Queensrÿche converted to Queensrÿche. Anyone know how to fix this?
Change your last line to:
tre.write('temp.xml', encoding='utf-8')
Otherwise lxml writes XML in ASCII encoding, so it have to escape all non-ASCII characters.

Open a file in the proper encoding automatically [duplicate]

This question already has answers here:
How to determine the encoding of text
(16 answers)
Closed 5 years ago.
I'm dealing with some problems in a few files about the encoding. We receive files from other company and have to read them (the files are in csv format)
Strangely, the files appear to be encoded in UTF-16. I am managing to do that, but I have to open them using the codecs module and specifying the encoding, this way.
ENCODING = 'utf-16'
with codecs.open(test_file, encoding=ENCODING) as csv_file:
# Autodetect dialect
dialect = csv.Sniffer().sniff(descriptor.read(1024))
descriptor.seek(0)
input_file = csv.reader(descriptor, dialect=dialect)
for line in input_file:
do_funny_things()
But, just like I am able to get the dialect in a more agnostic way, I 'm thinking it will be great to have a way of opening automatically the files with its proper encoding, at least all the text files. There are other programs, like vim that achieve that.
Anyone knows a way of doing that in python 2.6?
PD: I hope that this will be solved in Python 3, as all the strings are Unicode...
chardet can help you.
Character encoding auto-detection in
Python 2 and 3. As smart as your
browser. Open source.
It won't be "fixed" in python 3, as it's not a fixable problem. Many documents are valid in several encodings, so the only way to determine the proper encoding is to know something about the document. Fortunately, in most cases we do know something about the document, like for instance, most characters will come clustered into distinct unicode blocks. A document in english will mostly contain characters within the first 128 codepoints. A document in russian will contain mostly cyrillic codepoints. Most document will contain spaces and newlines. These clues can be used to help you make educated guesses about what encodings are being used. Better yet, use a library written by someone who's already done the work. (Like chardet, mentioned in another answer by Desintegr.
csv.reader cannot handle Unicode strings in 2.x. See the bottom of the csv documentation and this question for ways to handle it.
If it will be fixed in Python 3, it should also be fixed by using
from __future__ import unicode_literals

Categories

Resources