I am getting the error:
'ascii' codec can't decode byte 0x8b in position 14: ordinal not in range(128)
when trying to do os.walk. The error occurs because some of the files in a directory have the 0x8b (non-utf8) character in them. The files come from a Windows system (hence the utf-16 filenames), but I have copied the files over to a Linux system and am using python 2.7 (running in Linux) to traverse the directories.
I have tried passing a unicode start path to os.walk, and all the files & dirs it generates are unicode names until it comes to a non-utf8 name, and then for some reason, it doesn't convert those names to unicode and then the code chokes on the utf-16 names. Is there anyway to solve the problem short of manually finding and changing all the offensive names?
If there is not a solution in python2.7, can a script be written in python3 to traverse the file tree and fix the bad filenames by converting them to utf-8 (by removing the non-utf8 chars)? N.B. there are many non-utf8 chars in the names besides 0x8b, so it would need to work in a general fashion.
UPDATE: The fact that 0x8b is still only a btye char (just not valid ascii) makes it even more puzzling. I have verified that there is a problem converting such a string to unicode, but that a unicode version can be created directly. To wit:
>>> test = 'a string \x8b with non-ascii'
>>> test
'a string \x8b with non-ascii'
>>> unicode(test)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x8b in position 9: ordinal not in range(128)
>>>
>>> test2 = u'a string \x8b with non-ascii'
>>> test2
u'a string \x8b with non-ascii'
Here's a traceback of the error I am getting:
80. for root, dirs, files in os.walk(unicode(startpath)):
File "/usr/lib/python2.7/os.py" in walk
294. for x in walk(new_path, topdown, onerror, followlinks):
File "/usr/lib/python2.7/os.py" in walk
294. for x in walk(new_path, topdown, onerror, followlinks):
File "/usr/lib/python2.7/os.py" in walk
284. if isdir(join(top, name)):
File "/usr/lib/python2.7/posixpath.py" in join
71. path += '/' + b
Exception Type: UnicodeDecodeError at /admin/casebuilder/company/883/
Exception Value: 'ascii' codec can't decode byte 0x8b in position 14: ordinal not in range(128)
The root of the problem occurs in the list of files returned from listdir (on line 276 of os.walk):
names = listdir(top)
The names with chars > 128 are returned as non-unicode strings.
Right I just spent some time sorting through this error, and wordier answers here aren't getting at the underlying issue:
The problem is, if you pass a unicode string into os.walk(), then os.walk starts getting unicode back from os.listdir() and tries to keep it as ASCII (hence 'ascii' decode error). When it hits a unicode only special character which str() can't translate, it throws the exception.
The solution is to force the starting path you pass to os.walk to be a regular string - i.e. os.walk(str(somepath)). This means os.listdir returns regular byte-like strings and everything works the way it should.
You can reproduce this problem (and show it's solution works) trivially like:
Go into bash in some directory and run touch $(echo -e "\x8b\x8bThis is a bad filename") which will make some test files.
Now run the following Python code (iPython Qt is handy for this) in the same directory:
l = []
for root,dir,filenames in os.walk(unicode('.')):
l.extend([ os.path.join(root, f) for f in filenames ])
print l
And you'll get a UnicodeDecodeError.
Now try running:
l = []
for root,dir,filenames in os.walk('.'):
l.extend([ os.path.join(root, f) for f in filenames ])
print l
No error and you get a print out!
Thus the safe way in Python 2.x is to make sure you only pass raw text to os.walk(). You absolutely should not pass unicode or things which might be unicode to it, because os.walk will then choke when an internal ascii conversion fails.
This problem stems from two fundamental problems. The first is fact that Python 2.x default encoding is 'ascii', while the default Linux encoding is 'utf8'. You can verify these encodings via:
sys.getdefaultencoding() #python
sys.getfilesystemencoding() #OS
When os module functions returning directory contents, namely os.walk & os.listdir return a list of files containing ascii only filenames and non-ascii filenames, the ascii-encoding filenames are converted automatically to unicode. The others are not. Therefore, the result is a list containing a mix of unicode and str objects. It is the str objects that can cause problems down the line. Since they are not ascii, python has no way of knowing what encoding to use, and therefore they can't be decoded automatically into unicode.
Therefore, when performing common operations such as os.path(dir, file), where dir is unicode and file is an encoded str, this call will fail if the file is not ascii-encoded (the default). The solution is to check each filename as soon as they are retrieved and decode the str (encoded ones) objects to unicode using the appropriate encoding.
That's the first problem and its solution. The second is a bit trickier. Since the files originally came from a Windows system, their filenames probably use an encoding called windows-1252. An easy means of checking is to call:
filename.decode('windows-1252')
If a valid unicode version results you probably have the correct encoding. You can further verify by calling print on the unicode version as well and see the correct filename rendered.
One last wrinkle. In a Linux system with files of Windows origin, it is possible or even probably to have a mix of windows-1252 and utf8 encodings. There are two means of dealing with this mixture. The first and preferable is to run:
$ convmv -f windows-1252 -t utf8 -r DIRECTORY --notest
where DIRECTORY is the one containing the files needing conversion.This command will convert any windows-1252 encoded filenames to utf8. It does a smart conversion, in that if a filename is already utf8 (or ascii), it will do nothing.
The alternative (if one cannot do this conversion for some reason) is to do something similar on the fly in python. To wit:
def decodeName(name):
if type(name) == str: # leave unicode ones alone
try:
name = name.decode('utf8')
except:
name = name.decode('windows-1252')
return name
The function tries a utf8 decoding first. If it fails, then it falls back to the windows-1252 version. Use this function after a os call returning a list of files:
root, dirs, files = os.walk(path):
files = [decodeName(f) for f in files]
# do something with the unicode filenames now
I personally found the entire subject of unicode and encoding very confusing, until I read this wonderful and simple tutorial:
http://farmdev.com/talks/unicode/
I highly recommend it for anyone struggling with unicode issues.
I can reproduce the os.listdir() behavior: os.listdir(unicode_name) returns undecodable entries as bytes on Python 2.7:
>>> import os
>>> os.listdir(u'.')
[u'abc', '<--\x8b-->']
Notice: the second name is a bytestring despite listdir()'s argument being a Unicode string.
A big question remains however - how can this be solved without resorting to this hack?
Python 3 solves undecodable bytes (using filesystem's character encoding) bytes in filenames via surrogateescape error handler (os.fsencode/os.fsdecode). See PEP-383: Non-decodable Bytes in System Character Interfaces:
>>> os.listdir(u'.')
['abc', '<--\udc8b-->']
Notice: both string are Unicode (Python 3). And surrogateescape error handler was used for the second name. To get the original bytes back:
>>> os.fsencode('<--\udc8b-->')
b'<--\x8b-->'
In Python 2, use Unicode strings for filenames on Windows (Unicode API), OS X (utf-8 is enforced) and use bytestrings on Linux and other systems.
\x8 is not a valid utf-8 encoding character. os.path expects the filenames to be in utf-8. If you want to access invalid filenames, you have to pass the os.path.walk the non-unicode startpath; this way the os module will not do the utf8 decoding. You would have to do it yourself and decide what to do with the filenames that contain incorrect characters.
I.e.:
for root, dirs, files in os.walk(startpath.encode('utf8')):
After examination of the source of the error, something happens within the C-code routine listdir which returns non-unicode filenames when they are not standard ascii. The only fix therefore is to do a forced decode of the directory list within os.walk, which requires a replacement of os.walk. This replacement function works:
def asciisafewalk(top, topdown=True, onerror=None, followlinks=False):
"""
duplicate of os.walk, except we do a forced decode after listdir
"""
islink, join, isdir = os.path.islink, os.path.join, os.path.isdir
try:
# Note that listdir and error are globals in this module due
# to earlier import-*.
names = os.listdir(top)
# force non-ascii text out
names = [name.decode('utf8','ignore') for name in names]
except os.error, err:
if onerror is not None:
onerror(err)
return
dirs, nondirs = [], []
for name in names:
if isdir(join(top, name)):
dirs.append(name)
else:
nondirs.append(name)
if topdown:
yield top, dirs, nondirs
for name in dirs:
new_path = join(top, name)
if followlinks or not islink(new_path):
for x in asciisafewalk(new_path, topdown, onerror, followlinks):
yield x
if not topdown:
yield top, dirs, nondirs
By adding the line:
names = [name.decode('utf8','ignore') for name in names]
all the names are proper ascii & unicode, and everything works correctly.
A big question remains however - how can this be solved without resorting to this hack?
I got this problem when use os.walk on some directories with Chinese (unicode) names. I implemented the walk function myself as follows, which worked fine with unicode dir/file names.
import os
ft = list(tuple())
def walk(dir, cur):
fl = os.listdir(dir)
for f in fl:
full_path = os.path.join(dir,f)
if os.path.isdir(full_path):
walk(full_path, cur)
else:
path, filename = full_path.rsplit('/',1)
ft.append((path, filename, os.path.getsize(full_path)))
I am using os.walk to traverse a folder. There are some non-ascii named files in there. For these files, os.walk gives me something like ???.txt. I cannot call open with such file names. It complains [Errno 22] invalid mode ('rb') or filename. How should I work this out?
I am using Windows 7, python 2.7.11. My system locale is en-us.
Listing directories using a bytestring path on Windows produces directory entries encoded to your system locale. This encoding (done by Windows), can fail if the system locale cannot actually represent those characters, resulting in placeholder characters instead. The underlying filesystem, however, can handle the full unicode range.
The work-around is to use a unicode path as the input; so instead of os.walk(r'C:\Foo\bar\blah') use os.walk(ur'C:\Foo\bar\blah'). You'll then get unicode values for all parts instead, and Python uses a different API to talk to the Windows filesystem, avoiding the encoding step that can break filenames.
I have this Python script that takes the info of a webpage and then saves this info to a text file. But the name of this text file changes from time to time and it can changes to Cyrillic letters sometimes, and some times Korean letters.
The problem is that say I'm trying to save the file with the name "бореиская" then the name will appear very weird when I'm viewing it in Windows.
I'm guessing I need to change some encoding at some places. But the name is being sent to the open() function:
server = "бореиская"
file = open("eu_" + server + ".lua", "w")
I am, earlier on, taking the server variable from an array that already has all the names in it.
But as previously mentioned, in Windows, the names appear with some very weird characters.
tl;dr
Always use Unicode strings for file names and paths. E.g.:
io.open(u"myfile€.txt")
os.listdir(u"mycrazydirß")
In your case:
server = u"бореиская"
file = open(u"eu_" + server + ".lua", "w")
I assume server will come from another location, so you will need to ensure that it's decoded to a Unicode string correctly. See io.open().
Explanation
Windows
Windows stores filenames using UTF-16. The Windows i/o API and Python hides this detail but requires Unicode strings, else a string will have to use the correct 8bit codepage.
Linux
Filenames can be made from any byte string, in any encoding, as long as it's not ASCII "." or "..". As each system user can have their own encoding, you really can't guarantee the encoding one user used is the same as another. The locale is used to configure each user's environment. The user's terminal encoding also needs to match the encoding for consistency.
The best that can be hoped is that a user hasn't changed their locale and all applications are using the same locale. For example, the default locale may be: en_GB.UTF-8, meaning the encoding of files and filenames should be UTF-8.
When Python encounters a Unicode filename, it will use the user's locale to decode/encode filenames. An encoded string will be passed directly to the kernel, meaning you may get lucky with using "UTF-8" filenames.
OS X
OS X's filenames are always UTF-8 encoded, regardless of the user's locale. Therefore, a filename should be a Unicode string, but may be encoded in the user's locale and will be translated. As most user's locales are *.UTF-8, this means you can actually pass a UTF-8 encoded string or a Unicode string.
Roundup
For best cross-platform compatibility, always use Unicode strings as in most cases they will be translated to the correct encoding. It's really just Linux that has the most ambiguity, as some applications may choose to ignore the default locale or a user may have changed their locale to a non-UTF-8 version.
I'm viewing it in windows. ...Using python 2.7
use Unicode filenames on Windows. Python can use Unicode API there.
Do not use non-ascii characters in bytestring literals (it is explicitly forbidden on Python 3).
use Unicode literals u'' or add from __future__ import unicode_literals at the top of the module
make sure the encoding declaration (# -*- coding: utf-8 -*-) is correct i.e., your IDE/editor uses the specified encoding to save your Python source
#!/usr/bin/env python
# -*- coding: utf-8 -*-
server = u"бореиская"
with open(u"eu_{server}.lua".format(**vars()), "w") as file:
...
In Windows you have to encode filename probably to some of cp125x encoding but I don't know which one - probably cp1251.
filename = "eu_" + server + ".lua"
filename = filename.encode('cp1251')
file = open(filename, 'w')
In Linux you should use utf-8
In my small project I had to identify the type of files in the directory. So I went with python-magic module and did the following:
from Tkinter import Tk
from tkFileDialog import askdirectory
def getDirInput():
root = Tk()
root.withdraw()
return askdirectory()
di = getDirInput()
print('Selected Directory: ' + di)
for f in os.listdir(di):
m = magic.Magic(magic_file='magic')
print 'Type of ' + f + ' --> ' + m.from_file(f)
But It seems that python-magic can't take unicode filenames as it is when I pass it to the from_file() function.Here's a sample output:
Selected Directory: C:/Users/pruthvi/Desktop/vidrec/temp
Type of log.txt --> ASCII English text, with very long lines, with CRLF, CR line terminators
Type of TAEYEON 태연_ I (feat. Verbal Jint)_Music Video.mp4 --> cannot open `TAEYEON \355\234\227_ I (feat. Verbal Jint)_Music Video.mp4' (No such file or directory)
Type of test.py --> a python script text executable
you can observe that python-magic failed to identiy the type of second file TAEYEON... as it had unicode characters in it. It shows 태연 characters as \355\234\227 instead while I passed the same in both cases. How can I overcome this problem and find the type of file with Unicode characters also ? Thank you
But It seems that python-magic can't take unicode filenames
Correct. In fact most cross-platform software on Windows can't handle non-ASCII characters in filenames.
This is because the C standard library uses byte strings for all filenames but Windows uses Unicode strings (technically, UTF-16 code unit strings, but the difference isn't important here). When software using the C standard library opens a file by byte-based string, the MS C runtime converts that to a Unicode string automatically, using an encoding (the confusingly-named ‘ANSI’ code page) that depends on the locale of the Windows installation. Your ANSI code page is probably 1252, which can't encode Korean characters, so it's impossible to use that filename. The ANSI code page is unfortunately never anything sensible like UTF-8, so it can never include all possible Unicode characters.
Python is special in that it has extra support for Windows Unicode filenames which bypasses the C standard library and calls the underlying Win32 APIs directly for Unicode filenames. So you can pass a unicode string using eg open() and it will work for all filenames.
However python-magic's from_file call doesn't open the file from Python. Instead it passes the filename to the libmagic library which is written in pure C. libmagic doesn't have the special Windows-filename code path for Unicode so this fails.
I suggest opening the file yourself from Python and using magic.from_buffer instead.
The response from the magic module seems to show that your characters were incorrectly translated somewhere - only half the string is shown and the byte order of 태 is wrong - it should be \355\227\234at least.
As this is on Windows, this raises UTF-16 byte-order alarm bells.
It might be possible to work around this by encoding to UTF-16. As suggested by other commenters, you need to prefix the directory.
input_encoding = locale.getpreferredencoding()
u_di = di.decode(input_encoding)
m = magic.Magic(magic_file='magic') # only needs to be initialised once
for f in os.listdir(u_di):
fq_f = os.path.join(u_di, f)
utf16_fq_f = fq_f.encode("UTF-16LE")
print m.from_file(utf16_fq_f)
I have a collection of files from an older MAC OS file store. I know that there are filename / path name issues with the collection. The issue stems from the inclusion of a codepoint in the path that I think was rendered as a dash in the original OS, but windows struggles with the codepoint, and either includes a diacritic on the previous character, or replaces it with a ?
I'm trying to figure out a way to establishing a "truth" of the files structure, so I can be sure I'm accounting for every file.
I have explored the files with a few tools, and nothing has matching tallies. I believe the following demonstrates the problem.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
folder = "emails"
b = os.listdir(folder)
for f in b:
print repr(f)
print os.path.isfile(os.path.join(folder, f))
(I have to redact the actual filenames a litte)
Results in:-
'file(1)'
True
'file(2)'
True
'file(3)?'
False
'file(4)'
True
The file name of interest is file(3)?, where the odd codepoint has been decoded as a ?, and which evaluates as not being a file (or even exisiting via os.path.exists).
Note that print repr(string) shows that its handling a UTF-8, properly encoded ?.
I can copy paste the filename from the folder and it appears as : file(3) note the fullstop.
I can paste the string into my editor (subl) and see that I now have an undisplayable codepoint glyph for the final codepoint
a = "file(3)"
print a
print repr(a)
Gives me:
file(3)
'file(3)\xef\x80\xa9'
From this I can see that the odd code point is \xef\x80\xa9. Elsewhere in the set I also find the codepoint \xef\x80\xa8.
I must assume that os.listdir is not returning raw codepoint values but an (UTF-8?) encoded string, with a codepoint subsitution that means when it tests for exists or isfile its testing for the existance of the wrong filename, as the the file with a subsituted ? does not exist.
How do I work with these files safely? I have around 40 in a collection of around 700 files.
Try passing a unicode to os.listdir:
folder = u"emails"
b = os.listdir(folder)
Doing so will cause os.listdir to return a list of unicodes instead of strs.
Unfortunately, the more I think about this the less I understand about why this worked. Every filesystem ultimately stores its filenames in bytes using some encoding. HDF+ for instance stores filenames in UTF-16. So it would make sense if os.listdir could return those raw bytes most easily without adulteration. But instead, in this case, it looks like os.listdir can return unadulterated unicode, but not unadulterated bytes.
If someone could explain that mystery I would be most appreciative.
Did the files come from Mac Roman encoding (presumably what MacOS used), or the NFKD normal form of UTF-8 that Mac OS X uses?
The concept of Unicode normal forms is one that every programmer ought to be familiar with.... precious few are though. I can't tell you what you need too know about this with regard to Python though.