File open error by using codec utf-8 in python - python

I execute following code on windows xp and python 2.6.4
But it show IOError.
How to open file whose name has utf-8 codec.
>>> open( unicode('한글.txt', 'euc-kr').encode('utf-8') )
Traceback (most recent call last):
File "<pyshell#0>", line 1, in <module>
open( unicode('한글.txt', 'euc-kr').encode('utf-8') )
IOError: [Errno 22] invalid mode ('r') or filename: '\xed\x95\x9c\xea\xb8\x80.txt'
But the following code to the normal operation.
>>> open( unicode('한글.txt', 'euc-kr') )
<open file u'\ud55c\uae00.txt', mode 'r' at 0x01DD63E0>

The C runtime interface that Windows exposes to Python uses the system code page to encode filenames. Unlike on OS X and modern Linux versions, on Windows the system code page is never UTF-8. So the UTF-8 byte string won't be any good.
You could encode the filename to the current code page using .encode('mbcs'), which in your case is probably equivalent to .encode('cp949'). To make it compatible with other platforms where filenames are UTF-8, you could look up sys.getfilesystemencoding, which will give you utf-8 there or mbcs on Windows.
However whilst cp949 would work for Korean characters, it would break on anything outside the repertoire of that code page (an extended version of EUC-KR).
So another approach is to keep your filenames as Unicode. On Windows this will use the Unicode-native interfaces to pass filenames to Windows in the UTF-16LE encoding it uses internally. (See PEP277 for more on this feature.)
This does generally still work on other platforms too: Linux and OS X should silently encode the Unicode filenames to UTF-8 for you. This may fail more in older Python versions, but it's the default way to handle filenames in Python 3 (where the default string type has changed to Unicode).
The traps to watch out for with using Unicode filenames on Python 2 are:
if os.path.supports_unicode_filenames is False, as it will be outside Windows, the functions that return filenames, such as os.listdir, will always give you byte strings. You'd have to detect that and decode them using sys.getfilesystemencoding.
if you have a file on Linux/OS X with a name that's not a valid UTF-8 string, you won't be able to get a Unicode filename for it (UnicodeDecodeError if you try). Bit of a corner case, but it can lead to annoying inaccessible files.
Incidentally,
open(unicode('한글.txt', 'euc-kr'))
Probably you would want to say 'cp949' there (as the Windows Korean code page has minor differences to EUC-KR). Or, more generally, 'mbcs', which gives you the system code page which is presumably going to be the same one your console is typing. Anyway, I don't know about PyShell, but normally if the above works then you should just be able to type it directly:
open(u'한글')

Related

What is the mechanism that the builtin function `open` uses to encode and decode filenames?

I have a little confusion about open. I'm running Windows 10, when I call sys.getfilesystemencoding I get mbcs so if I pass the filename to open for example:
open('Meow!.txt')
Supposedly, the encoding for the source file is utf-8. Does open encodes the filename 'Meow!.txt' with mbcs encoding which is set to the default Windows ANSI codepage? And then passes the requests to the OS?
Generally speaking, what happens when you pass filename to open as unicode in 2.X and str in 3.X?
Is it true when the filename is passed as a bytes object in 3.X or str in 2.X, overrides the default automatic encoding of the filename?
Here's what happens internally when using the builtin open in 2.7 to be precise:
Python sets a constant that names the default encoding of filenames, this constant is called Py_FileSystemDefaultEncoding and varies per-platform. Ultimately, when its value is set to Null, Python will try to get the default encoding of the platform if there's any:
/*bltinmodule.c*/
/* The default encoding used by the platform file system APIs
Can remain NULL for all platforms that don't have such a concept
*/
#if defined(MS_WINDOWS) && defined(HAVE_USABLE_WCHAR_T)
const char *Py_FileSystemDefaultEncoding = "mbcs";
#elif defined(__APPLE__)
const char *Py_FileSystemDefaultEncoding = "utf-8";
#else
const char *Py_FileSystemDefaultEncoding = NULL; /* use default */
#endif
Py_FileSystemDefaultEncoding uses "mbcs" (Multi-byte-character-set) Windows encoding, you can check the value of Py_FileSystemDefaultEncoding using sys.getfilesystemencoding() call:
Python 2.7 Documentation: sys.getfilesystemencoding()
On Windows NT+, file names are Unicode natively, so no conversion is performed. getfilesystemencoding() still returns 'mbcs', as this is the encoding that applications should use when they explicitly want to convert Unicode strings to byte strings that are equivalent when used as file names.
So for example let's assume when a filename with Chinese characters, for simplicity I'm going to use U+5F08 Chinese chess CJK for the filename that I'm going to write:
>>> f = open(u'\u5F08.txt', 'w')
>>> f
<open file u'\u5f08', mode 'w' at 0x000000000336B1E0>
Generally speaking, what happens when you pass filename to open as unicode in 2.X and str in 3.X?
This answer is platform-dependent. For instance, in Windows, there's no need to convert Unicode strings to any encoding, not even with the default filesystem encoding "mbcs", to prove that:
>>> f = open(u'\u5F08.txt'.encode('mbcs'), 'w')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: [Errno 22] invalid mode ('w') or filename: '?.txt'
By the way, even if you use 'utf-8' encoding, you'll not get the correct filename:
>>> f = open(u'\u5F08.txt'.encode('utf8'), 'w')
This will give you 弈.txt filename if you check that on Windows instead of 弈.txt. In conclusion, there's no conversion for Unicode filenames apparently. I think this rule applies to str too. Since str in 2.X is a raw byte string, Python won't pick encoding magically **I cannot verify this however and it might be possible that Python will decode str with "mbcs" encoding. It's possible to verify that I believe by using characters outside "mbcs" code pages character set, but this is again will depend on your Windows locale settings. So much is encapsulated at the lower level in Windows implementation. If memory serves, I think "mbcs" now is considered legacy for Windows APIs. Python 3.6 uses UTF-8 instead, unless the legacy mode is enabled.
Really though, it seems the issue is deep into Windows APIs and their implementation, rather than the implementation of Python itself.

Non ascii file name issue with os.walk

I am using os.walk to traverse a folder. There are some non-ascii named files in there. For these files, os.walk gives me something like ???.txt. I cannot call open with such file names. It complains [Errno 22] invalid mode ('rb') or filename. How should I work this out?
I am using Windows 7, python 2.7.11. My system locale is en-us.
Listing directories using a bytestring path on Windows produces directory entries encoded to your system locale. This encoding (done by Windows), can fail if the system locale cannot actually represent those characters, resulting in placeholder characters instead. The underlying filesystem, however, can handle the full unicode range.
The work-around is to use a unicode path as the input; so instead of os.walk(r'C:\Foo\bar\blah') use os.walk(ur'C:\Foo\bar\blah'). You'll then get unicode values for all parts instead, and Python uses a different API to talk to the Windows filesystem, avoiding the encoding step that can break filenames.

Python UnicodeDecodeError on Mac, but not on PC?

I've got a script that basically aggregates students' code files into one file for plagiarism detection. It walks through a tree of files, copying all file contents into one file.
I've run the script on the exact same files on my Mac and my PC. On my PC, it works fine. On my Mac, it encounters 27 UnicodeDecodeErrors (probably 0.1% of all files I'm testing).
What could cause a UnicodeDecodeError on a Mac, but not on a PC?
If relevant, the code is:
originalFile = open(originalFilename, "r")
newFile = open(newFilename, "a")
newFile.write(originalFile.read())
Figure out what encoding was used when saving that file. A safe bet is loading the file as 'utf-8'. If that succeeds then it's likely to be the correct encoding.
# try utf-8. If this fails, all bets are off.
open(originalFilename, "r", encoding="utf-8")
Now, if students are sending you these files, it's likely they just use the default encoding on their system. It is not possible to reliably guess the encoding. If they were using an 8-bit codec, like one of the ISO-8859 character sets, it will be almost impossible to guess which one was used. What to do then depends on what kind of files you're processing.
It is incorrect to read Python source files using open(originalFilename, "r") on Python 3. open() uses locale.getpreferredencoding(False) by default. A Python source may use a different character encoding; in the best case, it may cause UnicodeDecodeError -- usually, you just get a mojibake silently.
To read Python source taking into account the encoding declaration (# -*- coding: ...), use tokenize.open(filename). If it fails; the input is not valid Python 3 source code.
What could cause a UnicodeDecodeError on a Mac, but not on a PC?
locale.getpreferredencoding(False) is likely to be utf-8 on Mac. utf-8 doesn't accept arbitrary sequence of bytes as utf-8 encoded text. PC is likely to use a 8-bit character encoding that corrupts the input and produces a mojibake silently instead of raising an error due to a mismatched character encoding.
To read a text file, you should know its character encoding. If you don't know the character encoding then either read the file as a sequence of bytes ('rb' mode) or you could try to guess the encoding using chardet Python module (it would be only a guess but it might be good enough depending on your task).
I got the exact same problem. There seemed to be some characters in the file that gave a UnicodeDecodeError during readlines()
This only happened on my macbook, but not on a PC.
I solve the problem by simply skipping these characters:
with open(file_to_extract, errors='ignore') as f: reader = f.readlines()

Wrong encoding in filenames created on Windows XP by Python script

My Python script creates a xml file under Windows XP but that file doesn't get the right encoding with Spanish characters such 'ñ' or some accented letters.
First of all, the filename is read from an excel shell with the following code, I use to read the Excel file xlrd libraries:
filename = excelsheet.cell_value(rowx=first_row, colx=5)
Then, I've tried some encodings without success to generate the file with the right encode:
filename = filename[:-1].encode("utf-8")
filename = filename[:-1].encode("latin1")
filename = filename[:-1].encode("windows-1252")
Using "windows-1252" I get a bad encoding with letter 'ñ', 'í' and 'é'. For example, I got BAJO ARAGÓN_Alcañiz.xml instead of BAJO ARAGÓN_Alcañiz.xml
Thanks in advance for your help
You should use unicode strings for your filenames. In general operating systems support filenames that contain arbitrary Unicode characters. So if you do:
fn = u'ma\u00d1o' # maÑo
f = open(fn, "w")
f.close()
f = open(fn, "r")
f.close()
it should work just fine. A different thing is what you see in your terminal when you list the content of the directory where that file lives. If the encoding of the terminal is UTF-8 you will see the filename maño, but if the encoding is for instance iso-8859-1 you will see maÃo. But even if you see these strange characters you should be able to open the file from python as described above.
In summary, do not encode the output of
filename = excelsheet.cell_value(rowx=first_row, colx=5)
instead make sure it is a unicode string.
Reading the Unicode filenames section of the Python Unicode HOWTO can be helpful for you.
Trying your answers I found a fast solution, port my script from Python 2.7 yo Python 3.3, the reason to port my code is Python 3 works by default in Unicode.
I had to do some little changes in my code, the import of xlrd libraries (Previously I had to install xlrd3):
import xlrd3 as xlrd
Also, I had to convert the content from 'bytes' to 'string' using str instead of encode()
filename = str(filename[:-1])
Now, my script works perfect and generate the files on Windows XP without strange characters.
First,
if you had not, please, read http://www.joelonsoftware.com/articles/Unicode.html -
Now, "latin-1" should work for Spanish encoding under Windows - there are two hypotheses tehr: the strigns you are trying to "encode" to either encoding are not Unicdoe strings, but are already in some encoding. tha, however, would more likely give you an UnicodeDecodeError than strange characters, but it might work in some corner case.
The more likely case is that you are checking your files using the windows Prompt AKA 'CMD" -
Well, for some reason, Microsoft Windows does use two different encodings for the system - one from inside "native" windows programs - which should be compatible with latin1, and another one for legacy DOS programs, in which category it puts the command prompt. For Portuguese, this second encoding is "cp852" (Looking around, cp852 does not define "ñ" - but cp850 does ).
So, this happens:
>>> print u"Aña".encode("latin1").decode("cp850")
A±a
>>>
So, if you want your filenames to appear correctly from the DOS prompt, you should encode them using "CP850" - if you want them to look right from Windows programs, do encode them using "cp1252" (or "latin1" or "iso-8859-15" - they are almost the same, give or take the "€" symbol)
Of course, instead of trying to guess and picking one that looks good, and will fail if some one runs your program in Norway, Russia, or in aa Posix system, you should just do
import sys
encoding = sys.getfilesystemencoding()
(This should return one of the above for you - again, the filename will look right if seem from a Windows program, not from a DOS shell)
In Windows, the file system uses UTF-16, so no explicit encoding is required. Just use a Unicode string for the filename, and make sure to declare the encoding of the source file.
# coding: utf8
with open(u'BAJO ARAGÓN_Alcañiz.xml','w') as f:
f.write('test')
Also, even though, for example, Ó isn't supported by the cp437 encoding of my US Windows system, my console font supports the character and it still displays correctly on my console. The console supports Unicode, but non-Unicode programs can only read/write code page characters.

Python open() unicode filename behaviour different across OSes

With a filename looking like:
filename = u"/direc/tories/español.jpg"
And using open() as:
fp = open(filename, "rb")
This will correctly open the file on OSX (10.7), but on Ubuntu 11.04 the open() function will try to open u"espa\xf1ol.jpg", and this will fail with an IOError.
Through the process of trying to fix this I've checked sys.getfilesystemencoding() on both systems, both are set to utf-8 (although Ubuntu reports uppercase, i.e. UTF-8, not sure if that is relevant). I've also set # -*- coding: utf-8 -*- in the python file, but I'm sure this only affects encoding within the file itself, not any external functions or how python deals with system resources. The file exists on both systems with the eñe correctly displayed.
The end question is: How do I open the español.jpg file on the Ubuntu system?
Edit:
The español.jpg string is actually coming out of a database via Django's ORM (ImageFileField), but by the time I'm dealing with it and seeing the difference in behaviour I have a single unicode string which is an absolute path to the file.
This one below should work in both cases:
fp = open(filename.encode(sys.getfilesystemencoding()), "rb")
It's not enough to simply set the file encoding at the top of your file. Make sure that your editor is using the same encoding, and saving the text in that encoding. If necessary, re-type any non-ascii characters to ensure that your editor is doing the right thing.
If your value is coming from e.g. a database, you will still need to ensure that nowhere along the line is being encoded as non-unicode.

Categories

Resources