Which encoding should Python open function use? - python

I'm getting an exception when reading a file that contains a RIGHT DOUBLE QUOTATION MARK Unicode symbol. It is encoded in UTF-8 (0xE2 0x80 0x9D). The minimal example:
import sys
print(sys.getdefaultencoding())
f = open("input.txt", "r")
r.readline()
This script fails reading the first line even if the right quotation mark is not on the first line. The exception looks like that:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Program Files\Python36\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 102: char
acter maps to <undefined>
The input file is in utf-8 encoding, I've tried both with and without BOM. The default encoding returned by sys.getdefaultencoding() is utf-8.
This script fails on the machine with Python 3.6.5 but works well on another with Python 3.6.0. Both machines are Windows.
My questions are mostly theoretical, as this exception is thrown from external software that I cannot change, and it reads file that I don't wish to change. What should be the difference in these machines except the Python patch version? Why does vanilla open use cp1252 if the system default is utf-8?

As clearly stated in Python's open documentation:
In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.
Windows defaults to a localized encoding (cp1252 on US and Western European versions). Linux typically defaults to utf-8.
Because it is platform-dependent, use the encoding parameter and specify the encoding of the file explicitly.

Related

Why did I get UnicodeDecodeError when I read a file which contains Chinese characters?

>>> path = 'name.txt'
>>> content = None
>>> with open(path, 'r') as file:
... content = file.readlines()
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/mnt/lustre/share/miniconda3/lib/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 163: ordinal not in range(128)
When I run this code to read a file which contains Chinese characters, I got an error. The file is saved by using UTF-8. My python version is 3.6.5. But it runs ok in python2.7.
open is using the ASCII codec to try to read the file. The easiest way to fix this is to specify the encoding:
with open(path, 'r', encoding='utf-8') as file:
Your locale should probably specify the preferred encoding as UTF-8, but I think it depends on OS and language settings.
Python 2.7 reads files into byte strings by default.
Python 3.x reads files into Unicode strings by default, so the bytes in the file must be decoded.
The default encoding used varies by operating system, but can be determined by calling locale.getpreferredencoding(False). This is often utf8 on Linux systems, but Windows systems return a localized ANSI encoding, e.g. cp1252 for US/Western European Windows versions.
In Python 3, specify the encoding you expect for files so as not to rely on a locale-specific default. For example:
with open(path,'r',encoding='utf8') as f:
...
You can do this in Python 2 as well, but use io.open(), which is compatible with Python 3's open() and will read Unicode strings instead of byte strings. io.open() is available in Python 3 as well for portability.

python cleaning high or non-ascii out of a file [duplicate]

I'm working on a series of parsers where I get a bunch of tracebacks from my unit tests like:
File "c:\Python31\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 112: character maps to <undefined>
The files are opened with open() with no extra arguemnts. Can I pass extra arguments to open() or use something in the codec module to open these differently?
This came up with code that was written in Python 2 and converted to 3 with the 2to3 tool.
UPDATE: it turns out this is a result of feeding a zipfile into the parser. The unit test actually expects this to happen. The parser should recognize it as something that can't be parsed. So, I need to change my exception handling. In the process of doing that now.
Position 0x81 is unassigned in Windows-1252 (aka cp1252). It is assigned to U+0081 HIGH OCTET PRESET (HOP) control character in Latin-1 (aka ISO 8859-1). I can reproduce your error in Python 3.1 like this:
>>> b'\x81'.decode('cp1252')
Traceback (most recent call last):
...
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 0: character maps to <undefined>
or with an actual file:
>>> open('test.txt', 'wb').write(b'\x81\n')
2
>>> open('test.txt').read()
Traceback (most recent call last):
...
UnicodeDecodeError: 'utf8' codec can't decode byte 0x81 in position 0: unexpected code byte
Now to treat this file as Latin-1 you pass the encoding argument, like codeape suggested:
>>> open('test.txt', encoding='latin-1').read()
'\x81\n'
Beware that there are differences between Windows-1257 and Latin-1 encodings, e.g. Latin-1 doesn't have “smart quotes”. If the file you're processing is a text file, ask yourself what that \x81 is doing in it.
You can relax the error handling.
For instance:
f = open(filename, encoding="...", errors="replace")
Or:
f = open(filename, encoding="...", errors="ignore")
See the docs.
EDIT:
But are you certain that the problem is in reading the file? Could it be that the exception happens when something is written to the console? Check http://wiki.python.org/moin/PrintFails
All files are "not Unicode". Unicode is an internal representation which must be encoded. You need to determine for each file what encoding has been used, and specify that where necessary when the file is opened.
As the traceback and error message indicate, the file in question is NOT encoded in cp1252.
If it is encoded in latin1, the "\x81" that it is complaining about is a C1 control character that doesn't even have a name (in Unicode). Consider latin1 extremely unlikely to be valid.
You say "some of the files are parsed with xml.dom.minidom" -- parsed successfully or unsuccessfully?
A valid XML file should declare its encoding (default is UTF-8) in the first line, and you should not need to specify an encoding in your code. Show us the code that you are using to do the xml.dom.minidom parsing.
"others read directly as iterables" -- sample code please.
Suggestion: try opening some of each type of file in your browser. Then click View and click Character Encoding (Firefox) or Encoding (Internet Explorer). What encoding has the browser guessed [usually reliably]?
Other possible encoding clues: What languages are used in the text in the files? Where did you get the files from?
Note: please edit your question with clarifying information; don't answer in the comments.

Python 3 UTF-8 encoding doesnt really work

I have read a lot now on the topic of UTF-8 encoding in Python 3 but it still doesn't work, and I can't find my mistake.
My code looks like this
def main():
with open("test.txt", "rU", encoding='utf-8') as test_file:
text = test_file.read()
print(str(len(text)))
if __name__ == "__main__":
main()
My test.txt file looks like this
ö
And I get the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 0: invalid start byte
Your file is not UTF-8 encoded. I'm not sure what encoding uses F6 for ä either; that codepoint is the encoding for ö in Latin 1 and CP-1252:
>>> b'\xf6'.decode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 0: invalid start byte
>>> b'\xf6'.decode('latin1')
'ö'
You'll need to save that file as UTF-8 instead, with whatever tool you used to create that file.
If open('text').read() works, then you were able to decode the file using the default system encoding. See the open() function documentation:
encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever locale.getpreferredencoding() returns), but any encoding supported by Python can be used.
That is not to say that you were reading the file using the correct encoding; that just means that the default encoding didn't break (encountered bytes for which it doesn't have a character mapping). It could still be mapping those bytes to the wrong characters.
I urge you to read up on Unicode and Python:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder

printing hebrew in python works in eclipse but not shell

I have some code that converts a Unicode representation of hebrew text file into hebrew for display
for example:
f = open(sys.argv[1])
for line in f:
print eval('u"' + line +'"')
This works fun when I run it in PyDev (eclipse), but when I run it from the command line, I get
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 9-10: ordinal not in range(256)
An example line from the input file is:
\u05d9\u05d5\u05dd
What is the problem? How can I solve this?
Do not use eval(); instead use the unicode_escape codec to interpret that data:
for line in f:
line = line.decode('unicode_escape')
The unicode_escape encoding interprets \uabcd character sequences the same way Python would when parsing a unicode literal in the source code:
>>> '\u05d9\u05d5\u05dd'.decode('unicode_escape')
u'\u05d9\u05d5\u05dd'
The exception you see is not caused by the eval() statement though; I suspect it is being caused by an attempt to print the result instead. Python will try to encode unicode values automatically and will detect what encoding the current terminal uses.
Your Eclipse output window uses a different encoding from your terminal; if the latter is configured to support Latin-1 then you'll see that exact exception, as Python tries to encode Hebrew codepoints to an encoding that doesn't support those:
>>> u'\u05d9\u05d5\u05dd'.encode('latin1')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-2: ordinal not in range(256)
The solution is to reconfigure your terminal (UTF-8 would be a good choice), or to not print unicode values with codepoints that cannot be encoded to Latin-1.
If you are redirecting output from Python to a file, then Python cannot determine the output encoding automatically. In that case you can use the PYTHONIOENCODING environment variable to tell Python what encoding to use for standard I/O:
PYTHONIOENCODING=utf-8 python yourscript.py > outputfile.txt
Thank you, this solved my problem.
line.decode('unicode_escape')
did the trick.
Followup - This now works, but if I try to send the output to a file:
python myScript.py > textfile.txt
The file itself has the error:
'ascii' codec can't encode characters in position 42-44: ordinal not in range(128)

A UnicodeDecodeError that occurs with json in python on Windows, but not Mac

On windows, I have the following problem:
>>> string = "Don´t Forget To Breathe"
>>> import json,os,codecs
>>> f = codecs.open("C:\\temp.txt","w","UTF-8")
>>> json.dump(string,f)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python26\lib\json\__init__.py", line 180, in dump
for chunk in iterable:
File "C:\Python26\lib\json\encoder.py", line 294, in _iterencode
yield encoder(o)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 3-5: invalid data
(Notice the non-ascii apostrophe in the string.)
However, my friend, on his mac (also using python2.6), can run through this like a breeze:
> string = "Don´t Forget To Breathe"
> import json,os,codecs
> f = codecs.open("/tmp/temp.txt","w","UTF-8")
> json.dump(string,f)
> f.close(); open('/tmp/temp.txt').read()
'"Don\\u00b4t Forget To Breathe"'
Why is this? I've also tried using UTF-16 and UTF-32 with json and codecs, but to no avail.
What does repr(string) show on each machine? On my Mac the apostrophe shows as \xc2\xb4 (utf8 coding, 2 bytes) so of course the utf8 codec can deal with it; on your Windows it clearly isn't doing that since it talks about three bytes being a problem - so on Windows you must have some other, non-utf8 encoding set for your console.
Your general problem is that, in Python pre-3, you should not enter a byte string ("...." as you used, rather than u"....") with non-ascii content (unless specifically as escape strings): this may (depending on how the session is set) fail directly or produce bytes, according to some codec set as the default one, which are not the exact bytes you expect (because you're not aware of the exact default codec in use). Use explicit Unicode literals
string = u"Don´t Forget To Breathe"
and you should be OK (or if you have any problem it will emerge right at the time of this assignment, at which point we may go into the issue of "how to I set a default encoding for my interactive sessions" if that's what you require).

Categories

Resources