Non ascii file name issue with os.walk

Non ascii file name issue with os.walk - python

I am using os.walk to traverse a folder. There are some non-ascii named files in there. For these files, os.walk gives me something like ???.txt. I cannot call open with such file names. It complains [Errno 22] invalid mode ('rb') or filename. How should I work this out?
I am using Windows 7, python 2.7.11. My system locale is en-us.

Listing directories using a bytestring path on Windows produces directory entries encoded to your system locale. This encoding (done by Windows), can fail if the system locale cannot actually represent those characters, resulting in placeholder characters instead. The underlying filesystem, however, can handle the full unicode range.
The work-around is to use a unicode path as the input; so instead of os.walk(r'C:\Foo\bar\blah') use os.walk(ur'C:\Foo\bar\blah'). You'll then get unicode values for all parts instead, and Python uses a different API to talk to the Windows filesystem, avoiding the encoding step that can break filenames.

Related

Python 2.7: Read file with Chinese characters

I am trying to analyze data within CSV files with Chinese characters in their names (E.g. "粗1 25g").
I am using Tkinter to choose the files like so:
selectedFiles = askopenfilenames(filetypes=[("xlsx","*"),("xls","*")]) # Utilize Tkinker dialog window to choose files
selectedFiles = master.tk.splitlist(selectedFiles) # Create list from files chosen
I have attempted to convert the filename to unicode in this way:
selectedFiles = [x.decode("utf-8") for x in selectedFiles]
Only to yield the error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb4 in position 0: ordinal not in range(128)
I have also tried converting the filenames as the files are created with the following:
titles = [x.encode('utf-8') for x in titles]
Only to receive the error:
IOError: [Errno 22] invalid mode ('wb') or filename: 'C:\...\\data_division_files\\\xe7\xb2\x971 25g.csv'
I have also tried combinations of the above methods to no avail.
What can I do to allow these files to be read in Python?
(This question,while related, has not been able to solve my problem: Obtain File size with os.path.getsize() in Python 2.7.5)

When you call decode on a unicode object, it first encodes it with sys.getdefaultencoding() so it can decode it for you. Which is why you get an error about ASCII even though you didn't ask for ASCII anywhere.
So, where are you getting a unicode object from? From askopenfilename. From a quick test, it looks like it always returns unicode values on Windows (presumably by getting the UTF-16 and decoding it), while on POSIX it returns some unicode and some str (I'd guess by leaving alone anything that fits into 7-bit ASCII, decoding anything else with your filesystem encoding). If you'd tried printing out the repr or type or anything of selectedFiles, the problem would have been obvious.
Meanwhile, the encode('utf-8') shouldn't cause any UnicodeErrors… but it's likely that your filesystem encoding isn't UTF-8 on Windows, so it will probably cause a lot of IOErrors with errno 2 (trying to open files that don't exist, or to create files in directories that don't exist), 21 (trying to open files with illegal file or directory names on Windows), etc. And it looks like that's exactly what you're seeing. And there's really no reason to do it; just pass the pathnames as-is to open and they'll be fine.
So, basically, if you removed all of your encode and decode calls, your code would probably just work.
However, there's an even easier solution: Just use askopenfile or asksaveasfile instead of askopenfilename or asksaveasfilename. Let Tk figure out how to use its pathnames and just hand you the file objects, instead of messing with the pathnames yourself.

Get properties of a file whose name contains special (non-ASCII) characters

I'm using python and having some trouble reading the properties of a file, when the filename includes non-ASCII characters.
One of the files for example is named:
0-Channel-https∺∯∯services.apps.microsoft.com∯browse∯6.2.9200-1∯615∯Channel.dat
When I run this:
list2 = os.listdir('C:\\Users\\James\\AppData\\Local\\Microsoft\\Windows Store\\Cache Medium IL\\0\\')
for data in list2:
print os.path.getmtime(data) + '\n'
I get the error:
WindowsError: [Error 123] The filename, directory name, or volume label syntax is incorrect: '0-Channel-https???services.apps.microsoft.com?browse?6.2.9200-1?615?Channel.dat'
I assume its caused by the special chars because the code works fine with other file names with only ASCII chars.
Does anyone know of a way to query the filesystem properties of a file named like this?

If this is python 2.x, its an encoding issue. If you pass a unicode string to os.listdir such as u'C:\\my\\pathname', it will return unicode strings and they should have the non-ascii chars encoded correctly. See Unicode Filenames in the docs.
Quoting the doc:
os.listdir(), which returns filenames, raises an issue: should it return the Unicode version of filenames, or should it return 8-bit strings containing the encoded versions? os.listdir() will do both, depending on whether you provided the directory path as an 8-bit string or a Unicode string. If you pass a Unicode string as the path, filenames will be decoded using the filesystem’s encoding and a list of Unicode strings will be returned, while passing an 8-bit path will return the 8-bit versions of the filenames. For example, assuming the default filesystem encoding is UTF-8, running the following program:
this code should work...
directory_name = u'C:\\Users\\James\\AppData\\Local\\Microsoft\\Windows Store\\Cache Medium IL\\0\\'
list2 = os.listdir(directory_name)
for data in list2:
print data, os.path.getmtime(os.path.join(directory_name, data))

As you are in windows you should try with ntpath module instead of os.path
from ntpath import getmtime
As I don't have windows I can't test it. Every os has a different path convention, so, Python provides a specific module for the most common operative systems.

Wrong encoding in filenames created on Windows XP by Python script

My Python script creates a xml file under Windows XP but that file doesn't get the right encoding with Spanish characters such 'ñ' or some accented letters.
First of all, the filename is read from an excel shell with the following code, I use to read the Excel file xlrd libraries:
filename = excelsheet.cell_value(rowx=first_row, colx=5)
Then, I've tried some encodings without success to generate the file with the right encode:
filename = filename[:-1].encode("utf-8")
filename = filename[:-1].encode("latin1")
filename = filename[:-1].encode("windows-1252")
Using "windows-1252" I get a bad encoding with letter 'ñ', 'í' and 'é'. For example, I got BAJO ARAGÓN_AlcaÃ±iz.xml instead of BAJO ARAGÓN_Alcañiz.xml
Thanks in advance for your help

You should use unicode strings for your filenames. In general operating systems support filenames that contain arbitrary Unicode characters. So if you do:
fn = u'ma\u00d1o' # maÑo
f = open(fn, "w")
f.close()
f = open(fn, "r")
f.close()
it should work just fine. A different thing is what you see in your terminal when you list the content of the directory where that file lives. If the encoding of the terminal is UTF-8 you will see the filename maño, but if the encoding is for instance iso-8859-1 you will see maÃo. But even if you see these strange characters you should be able to open the file from python as described above.
In summary, do not encode the output of
filename = excelsheet.cell_value(rowx=first_row, colx=5)
instead make sure it is a unicode string.
Reading the Unicode filenames section of the Python Unicode HOWTO can be helpful for you.

Trying your answers I found a fast solution, port my script from Python 2.7 yo Python 3.3, the reason to port my code is Python 3 works by default in Unicode.
I had to do some little changes in my code, the import of xlrd libraries (Previously I had to install xlrd3):
import xlrd3 as xlrd
Also, I had to convert the content from 'bytes' to 'string' using str instead of encode()
filename = str(filename[:-1])
Now, my script works perfect and generate the files on Windows XP without strange characters.

First,
if you had not, please, read http://www.joelonsoftware.com/articles/Unicode.html -
Now, "latin-1" should work for Spanish encoding under Windows - there are two hypotheses tehr: the strigns you are trying to "encode" to either encoding are not Unicdoe strings, but are already in some encoding. tha, however, would more likely give you an UnicodeDecodeError than strange characters, but it might work in some corner case.
The more likely case is that you are checking your files using the windows Prompt AKA 'CMD" -
Well, for some reason, Microsoft Windows does use two different encodings for the system - one from inside "native" windows programs - which should be compatible with latin1, and another one for legacy DOS programs, in which category it puts the command prompt. For Portuguese, this second encoding is "cp852" (Looking around, cp852 does not define "ñ" - but cp850 does ).
So, this happens:
>>> print u"Aña".encode("latin1").decode("cp850")
A±a
>>>
So, if you want your filenames to appear correctly from the DOS prompt, you should encode them using "CP850" - if you want them to look right from Windows programs, do encode them using "cp1252" (or "latin1" or "iso-8859-15" - they are almost the same, give or take the "€" symbol)
Of course, instead of trying to guess and picking one that looks good, and will fail if some one runs your program in Norway, Russia, or in aa Posix system, you should just do
import sys
encoding = sys.getfilesystemencoding()
(This should return one of the above for you - again, the filename will look right if seem from a Windows program, not from a DOS shell)

In Windows, the file system uses UTF-16, so no explicit encoding is required. Just use a Unicode string for the filename, and make sure to declare the encoding of the source file.
# coding: utf8
with open(u'BAJO ARAGÓN_Alcañiz.xml','w') as f:
f.write('test')
Also, even though, for example, Ó isn't supported by the cp437 encoding of my US Windows system, my console font supports the character and it still displays correctly on my console. The console supports Unicode, but non-Unicode programs can only read/write code page characters.

Python open() unicode filename behaviour different across OSes

With a filename looking like:
filename = u"/direc/tories/español.jpg"
And using open() as:
fp = open(filename, "rb")
This will correctly open the file on OSX (10.7), but on Ubuntu 11.04 the open() function will try to open u"espa\xf1ol.jpg", and this will fail with an IOError.
Through the process of trying to fix this I've checked sys.getfilesystemencoding() on both systems, both are set to utf-8 (although Ubuntu reports uppercase, i.e. UTF-8, not sure if that is relevant). I've also set # -*- coding: utf-8 -*- in the python file, but I'm sure this only affects encoding within the file itself, not any external functions or how python deals with system resources. The file exists on both systems with the eñe correctly displayed.
The end question is: How do I open the español.jpg file on the Ubuntu system?
Edit:
The español.jpg string is actually coming out of a database via Django's ORM (ImageFileField), but by the time I'm dealing with it and seeing the difference in behaviour I have a single unicode string which is an absolute path to the file.

This one below should work in both cases:
fp = open(filename.encode(sys.getfilesystemencoding()), "rb")

It's not enough to simply set the file encoding at the top of your file. Make sure that your editor is using the same encoding, and saving the text in that encoding. If necessary, re-type any non-ascii characters to ensure that your editor is doing the right thing.
If your value is coming from e.g. a database, you will still need to ensure that nowhere along the line is being encoded as non-unicode.

File open error by using codec utf-8 in python

I execute following code on windows xp and python 2.6.4
But it show IOError.
How to open file whose name has utf-8 codec.
>>> open( unicode('한글.txt', 'euc-kr').encode('utf-8') )
Traceback (most recent call last):
File "<pyshell#0>", line 1, in <module>
open( unicode('한글.txt', 'euc-kr').encode('utf-8') )
IOError: [Errno 22] invalid mode ('r') or filename: '\xed\x95\x9c\xea\xb8\x80.txt'
But the following code to the normal operation.
>>> open( unicode('한글.txt', 'euc-kr') )
<open file u'\ud55c\uae00.txt', mode 'r' at 0x01DD63E0>

The C runtime interface that Windows exposes to Python uses the system code page to encode filenames. Unlike on OS X and modern Linux versions, on Windows the system code page is never UTF-8. So the UTF-8 byte string won't be any good.
You could encode the filename to the current code page using .encode('mbcs'), which in your case is probably equivalent to .encode('cp949'). To make it compatible with other platforms where filenames are UTF-8, you could look up sys.getfilesystemencoding, which will give you utf-8 there or mbcs on Windows.
However whilst cp949 would work for Korean characters, it would break on anything outside the repertoire of that code page (an extended version of EUC-KR).
So another approach is to keep your filenames as Unicode. On Windows this will use the Unicode-native interfaces to pass filenames to Windows in the UTF-16LE encoding it uses internally. (See PEP277 for more on this feature.)
This does generally still work on other platforms too: Linux and OS X should silently encode the Unicode filenames to UTF-8 for you. This may fail more in older Python versions, but it's the default way to handle filenames in Python 3 (where the default string type has changed to Unicode).
The traps to watch out for with using Unicode filenames on Python 2 are:
if os.path.supports_unicode_filenames is False, as it will be outside Windows, the functions that return filenames, such as os.listdir, will always give you byte strings. You'd have to detect that and decode them using sys.getfilesystemencoding.
if you have a file on Linux/OS X with a name that's not a valid UTF-8 string, you won't be able to get a Unicode filename for it (UnicodeDecodeError if you try). Bit of a corner case, but it can lead to annoying inaccessible files.
Incidentally,
open(unicode('한글.txt', 'euc-kr'))
Probably you would want to say 'cp949' there (as the Windows Korean code page has minor differences to EUC-KR). Or, more generally, 'mbcs', which gives you the system code page which is presumably going to be the same one your console is typing. Anyway, I don't know about PyShell, but normally if the above works then you should just be able to type it directly:
open(u'한글')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.