Reading Japanese filenames in windows, using Python and glob not working

Reading Japanese filenames in windows, using Python and glob not working - python

I just setup PortablePython on my system, so I can run python scripts from PHP and I got some very basic code (Below) to list all the files in a directory, however it doesn't work with Japanese filenames. It works fine with English filenames, but it spits out errors (Below) when I put any file containing Japanese characters in the directory.
import os, glob
path = 'G:\path'
for infile in glob.glob( os.path.join(path, '*') ):
print("current file is: ", infile)
It works fine using 'PyScripter-Portable.exe', however when I try to run 'PortablePython\App\python.exe "test.py"' in the command prompt or from PHP it spits out the following errors:
current file is: Traceback (most recent call last):
File "test.py", line 5, in <module>
print("current file is: ", infile)
File "PortablePython\App\lib\io.py", line 1494, in write
b = encoder.encode(s)
File "PortablePython\App\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 37-40: character maps to <undefined>
I'm very new to Python and am just using this to get around a PHP issue with not being able to read unicode filenames in Windows... So I really need this to work - any help you can give me would be great.

The problem is probably that whatever output destination you're printing to doesn't use the same encoding as the file system. The general rule is that you should get text into Unicode as soon as possible, and then convert to whatever byte encoding you need upon output (e.g. utf-8).
Since you're dealing with filenames, they should be in the system encoding.
import sys
fse = sys.getfilesystemencoding()
filenames = [unicode(x, fse) for x in glob.glob( os.path.join(path, '*') )]
Now all your filenames are Unicode, and you need to figure out the correct encoding to output from the command prompt or whatever (you can launch a Unicode version of the command prompt with the u flag: "cmd /u")

Assuming you're using python 2.x, try changing your strings to unicode, like this:
path = u'G:\path'
for infile in glob.glob( os.path.join(path, u'*') ):
print( u"current file is: ", infile)
That should let python's filesystem-related functions know that you want to work with unicode file names.

Exmaple load files with unicode symbols in path:
from glob import glob
import librosa
#File has chanies in path
#Find all wav-s
replays_files = glob('<you-path>/**/*.wav', recursive=True)
s = replays_files[1478]
#Will be something like this:
#'<you-path>\udde6\uabae\udc9a\udce4_audio.wav'
#If you try load
librosa.core.load(s,sr=16000,mono=True)
#UnicodeEncodeError: 'ascii' codec can't encode characters in position 222-242: ordinal not in range(128)
#Replace udde6\
s = s.encode('ascii','surrogateescape').decode()
#Still doesn't working
librosa.core.load(s,sr=16000,mono=True)
#UnicodeEncodeError: 'ascii' codec can't encode characters in position 222-228: ordinal not in range(128)
s = s.encode('utf-8')
#b'<you-path>\xe6\xbe\x7a\xe4\xb8_audio.was'
#Work
librosa.core.load(s,sr=16000,mono=True)

Related

How can I convince Python 3 to treat my text file as UTF-8?

I need a search and replace for a particular character in a few .php files contained in a local directory, under Windows OS.
I tried one of the examples given as answer for Do a search-and-replace across all files in a folder through python? question, which in my case means this:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
import re
_replace_re = re.compile("ş")
for dirpath, dirnames, filenames in os.walk("./something/"):
for file in filenames:
if file.endswith(".php"):
file = os.path.join(dirpath, file)
tempfile = file + ".temp"
with open(tempfile, "w") as target:
with open(file) as source:
for line in source:
line = _replace_re.sub("ș", line)
target.write(line)
os.remove(file)
os.rename(tempfile, file)
While running it, I get this error:
Traceback (most recent call last):
File "[unimportant_path]\replace.py", line 19, in <module>
for line in source:
File "C:\python31.064\lib\encodings\cp1250.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x83 in position 1393: character maps to <undefined>
Indeed, the 8 bit MS codepage CP1250 is "Undefined" at 0x83 and the absolute position 0x0571 (i.e. 1393 decimal) of the file where this error occurs contains the byte 0x83, which in fact in this case is part of the UTF-8 encoding for character ă (for which the complete UTF-8 bytes are 0xC4 0x83).
Questions:
● why tries Python 3 to read a text file in whatever 8 bit codepage instead of reading it directly in Unicode ?
● what can I do to force reading the file in true UTF-8 ?

Add the encoding option to the open function.
with open(tempfile, "w", encoding="utf-8") as target:
with open(file, encoding="utf-8") as source:
Further Details about the open builtin https://docs.python.org/3/library/functions.html?highlight=open#open
Currently Python uses the local system encoding, unless UTF-8 mode is enabled. There is a PEP proposed (not accepted as of now) to change the default to UTF-8 but even if accepted it's a few Python versions away, so best to be explicit in the code.

open takes an optional keyword argument specifying the encoding.
open(tempfile, "w", encoding="utf-8")

Open binary file as ASCII in Python

I want to open a binary file (yes, another soft synth soundbank) in ASCII format and check if it contains a string or not. There are multiple files in the folder, but I have written the appropriate code for it, I just want it to search the file for a substring.
I've tried opening the same format using the ASCII encoding function before, but it does not display the data I want (it displays some garbled data, totally different from what it does in a hex editor, in which the file is opened in ASCII). Can someone point me in the right direction?
EDIT: As asked below, here is the new code I'm using:
# sbf_check.py (sample code I've written to test the sbf file before implementing in into the main.py file)
path = "C:\\Users\\User\\AppData\\Roaming\\RevealSound\\Banks\\Aura Qualic Trance.sbf"
file = open(path, "rb")
for x in file:
line = file.readline()
new = line.decode("ASCII")
print(new)
main.py file:
import glob, os
path = "C:\\Users\\User\\AppData\\Roaming\\RevealSound\\Banks"
for filename in glob.glob(os.path.join(path, "*.sbf")):
with open(os.path.join(os.getcwd(), filename), "r") as f:
# code to decode sbf file to ASCII, then search for the substring in the main string
Hex editor:
(Note: the data circled with red does not matter to me as it's parameter data, I just want to search for the preset name. It's not like my previous question, where I needed to skip the parameter data.)
Code output (VS Code):
Traceback (most recent call last):
File "c:\Users\User\Desktop\Programming\sbf_check.py", line 6, in <module>
new = line.decode("ASCII")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd7 in position 0: ordinal not in range(128)

Does the following do what you want? It should handle non-UTF-8 characters by displaying a tofu box instead of throwing a decoding error.
path = "C:\\Users\\User\\AppData\\Roaming\\RevealSound\\Banks\\" \
"Aura Qualic Trance.sbf"
with open(path, errors='ignore') as f:
print(f.read())

Python zipfile module can't extract filenames with Chinese characters

I'm trying to use a python script to download files from a Chinese service provider (I'm not from China myself). The provider is giving me a .zip file which contains a file which seems to have Chinese characters in its name. This seems to be causing the zipfile module to barf.
Code:
import zipfile
f = "/path/to/zip_file.zip"
if zipfile.is_zipfile(f):
fz = zipfile.ZipFile(f, 'r')
The zipfile itself doesn't contain any non-ASCII characters but the file inside it does. When I run the above script i get the following exception:
Traceback (most recent call last): File "./temp.py", line 9, in <module>
fz = zipfile.ZipFile(f, 'r') File "/usr/lib/python2.7/zipfile.py", line 770, in __init__
self._RealGetContents() File "/usr/lib/python2.7/zipfile.py", line 859, in _RealGetContents
x.filename = x._decodeFilename() File "/usr/lib/python2.7/zipfile.py", line 379, in _decodeFilename
return self.filename.decode('utf-8') File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0xbd in position 30: invalid start byte
I've tried looking through the answers to many similar questions:
Read file with Chinese Characters
Extract zip files with non-unicode filenames
Extract files with invalid characters
Please correct me if I'm wrong, but it looks like an open issue with the zipfile module.
How do I get around this? Is there any alternative module for dealing with zipfiles that I should use? Or any other solution?
TIA.
Edit:
I can access/unzip the same file perfectly with the linux command-line utility "unzip".

The way of Python 2.x(2.7) and Python 3.x dealing with non utf-8 filename in module zipfile are a bit different.
First, they both check ZipInfo.flag_bits of the file, if ZipInfo.flag_bits & 0x800, name of the file will be decode with utf-8.
If the check of above is False, in Python 2.x, the byte string of the name will be returned; in Python 3.x, the module will decode the file with encoding cp437 and return decoded result. Of course, the module will not know the true encoding of the filename in both Python versions.
So, suppose you have got a filename from a ZipInfo object or zipfile.namelist method, and you have already know the filename is encoded with XXX encoding. Those are the ways you get the correct unicode filename:
# in python 2.x
filename = filename.decode('XXX')
# in python 3.x
filename = filename.encode('cp437').decode('XXX')

Recently I met the same problem. Here is my solution. I hope it is useful for you.
import shutil
import zipfile
f = zipfile.ZipFile('/path/to/zip_file.zip', 'r')
for fileinfo in f.infolist():
filename = fileinfo.filename.encode('cp437').decode('gbk')
outputfile = open(filename, "wb")
shutil.copyfileobj(f.open(fileinfo.filename), outputfile)
outputfile.close()
f.close()
UPDATE: You can use the following simpler solution with pathlib:
from pathlib import Path
import zipfile
with zipfile.ZipFile('/path/to/zip_file.zip', 'r') as f:
for fn in f.namelist():
extracted_path = Path(f.extract(fn))
extracted_path.rename(fn.encode('cp437').decode('gbk'))

The ZIP file is invalid. It has a flag that signals that filenames inside it are encoded as UTF-8, but they're actually not; they contain byte sequences that aren't valid as UTF-8. Maybe they're GBK? Maybe something else? Maybe some unholy inconsistent mixture? ZIP tools in the wild are unfortunately very very poor at handling non-ASCII filenames consistently.
A quick workaround might be to replace the library function that decodes the filenames. This is a monkey-patch as there isn't a simple way to inject your own ZipInfo class into ZipFile, but:
zipfile.ZipInfo._decodeFilename = lambda self: self.filename
would disable the attempt to decode the filename, and always return a ZipInfo with a byte string filename property that you can proceed to decode/handle manually in whatever way is appropriate.

This is almost 6 years late, but this was finally fixed in Python 3.11 with the addition of the metadata_encoding parameter. I posted this answer here anyway to help other people with similar issues.
import zipfile
f = "your/zip/file.zip"
t = "the/dir/where/you/want/to/extract/it/all"
with zipfile.ZipFile(f, "r", metadata_encoding = "utf-8") as zf:
zf.extractall(t)

What about this code?
import zipfile
with zipfile.ZipFile('/path/to/zip_file.zip', 'r') as f:
zipInfo = f.infolist()
for member in zipInfo:
member.filename = member.filename.encode('cp437').decode('gbk')
f.extract(member)

#Mr.Ham's solution perfectly solved my problem. I'm using the Chinese version of Win10. Which the default encoding of the file system is GBK.
I think for other language users. Just change decode from GBK to their system default encoding will also work. And the default system encoding could automaticly get by Python.
So the patched code looks like this:
import zipfile
import locale
default_encoding = locale.getpreferredencoding()
with zipfile.ZipFile("/path/to/zip_file.zip") as f:
zipinfo = f.infolist()
for member in zipinfo:
member.filename = member.filename.encode('cp437').decode(default_encoding)
# The second argument could make the extracted filese to the same dir as the zip file, or leave it blank to your work dir.
f.extract(member, "/path/to/zip_file")

zipfile.write() file with turkish chars in filename

On my system there are many Word documents and I want to zip them using the Python module zipfile.
I have found this solution to my problem, but on my system there are files which contain German umlauts and Turkish characters in their filename.
I have adapted the method from the solution like this, so it can process German umlauts in the filenames:
def zipdir(path, ziph):
for root, dirs, files in os.walk(path):
for file in files:
current_file = os.path.join(root, file)
print "Adding to archive -> file: "+str(current_file)
try:
#ziph.write(current_file.decode("cp1250")) #German umlauts ok, Turkish chars not ok
ziph.write(current_file.encode("utf-8")) #both not ok
#ziph.write(current_file.decode("utf-8")) #both not ok
except Exception,ex:
print "exception ---> "+str(ex)
print repr(current_file)
raise
Unfortunately my attempts to include logic for Turkish characters remained unsuccessful, leaving the problem that every time a filename contains a Turkish character the code prints an exception, for example like this:
exception ---> [Error 123] Die Syntax f³r den Dateinamen, Verzeichnisnamen oder
die Datentrõgerbezeichnung ist falsch: u'X:\\my\\path\\SomeTurk?shChar?shere.doc'
I have tried several string encode-decode stuff, but none of it was successful.
Can someone help me out here?
I edited the above code to include the changes mentioned in the comment.
The following errors are now shown:
...
Adding to archive -> file: X:\\my\path\blabla I blabla.doc
Adding to archive -> file: X:\my\path\bla bla³bla³bla³bla.doc
exception ---> 'ascii' codec can't decode byte 0xfc in position 24: ordinal not
in range(128)
'X:\\my\\path\\bla B\xfcbla\xfcbla\xfcbla.doc'
Traceback (most recent call last):
File "Backup.py", line 48, in <module>
zipdir('X:\\my\\path', zipf)
File "Backup.py", line 12, in zipdir
ziph.write(current_file.encode("utf-8"))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 24: ordinal
not in range(128)
The ³ is actually a German ü.

If you do not need to inspect the ZIP file with any archiver later, you may always encode it to base64, and then restore them when extracting with Python.
To any archiver these filenames will look like gibberish but encoding will be preserved.
Anyway, to get the 0-128 ASCII range string (or bytes object in Py3), you have to encode(), not decode().
encode() serializes the unicode() string to ASCII range.
>>> u"\u0161blah".encode("utf-8")
'\xc5\xa1blah'
decode() returns from that to unicode():
>>> "\xc5\xa1blah".decode("utf-8")
u'\u0161blah'
Same goes for any other codepage.
Sorry for emphasizing that, but people sometimes get confused about encoding and decoding stuff.
If you need files, but you arent concerned much about preserving umlautes and other symbols, you can use:
u"üsdlakui".encode("utf-8", "replace")
or:
u"üsdlakui".encode("utf-8", "ignore")
This will replace unknown characters with possible ones or totally ignore any decoding/encoding errors.
That will fix things if the raised error is something like UnicodeDecodeError: Cannot decode character ...
But, the problem will be with filenames consisting only of non-latin characters.
Now something that might actually work:
Well,
'Sömethüng'.encode("utf-8")
is bound to raise "ASCII encode error" as there is no unicode characters defined in the string while non-latin characters that othervise should be used to describe unicode/UTF-8 character are used but defined as ASCII - file itself is not UTF-8 encoded.
while:
# -*- coding: UTF-8 -*-
u'Sömethüng'.encode("utf-8")
or
# -*- coding: UTF-8 -*-
unicode('Sömethüng').encode("utf-8")
with encoding defined on top of file and saved as UTF-8 encoded should work.
Yes, you do have strings from OS (filename), but that is a problem from beginning of the story.
Even if encoding passes right, there is the ZIP thing still to be solved.
By specification ZIP should store filenames using CP437, but this is rarely so.
Most archivers use the default OS encoding (MBCS in Python).
And most archivers doesn't support UTF-8. So, what I propose here should work, but not on all archivers.
To tell the ZIP archiver that archive is using UTF-8 filenames, the eleventh bit of flag_bits should be set to True. As I said, some of them does not check that bit. This is recent thing in ZIP spec. (Well, few years ago really)
I won't write here whole code, just the part needed to understand the thing.
# -*- coding: utf-8 -*-
# Cannot hurt to have default encoding set to UTF-8 all the time. :D
import os, time, zipfile
zip = zipfile.ZipFile(...)
# Careful here, origname is the full path to the file you will store into ZIP
# filename is the filename under which the file will be stored in the ZIP
# It'll probably be better if filename is not a full path, but relative, not to introduce problems when extracting. You decide.
filename = origname = os.path.join(root, filename)
# Filenames from OS can be already UTF-8, but they can be a local codepage.
# I will use MBCS here to decode from it, so that we can encode to UTF-8 later.
# I recommend getting codepage from OS (from kernel32.dll on Windows) manually instead of using MBCS, but for now:
if isinstance(filename, str): filename = filename.decode("mbcs")
# Else, assume it is already a decoded unicode string.
# Prepare the filename for archive:
filename = os.path.normpath(os.path.splitdrive(filename)[1])
while filename[0] in (os.sep, os.altsep):
filename = filename[1:]
filename = filename.replace(os.sep, "/")
filename = filename.encode("utf-8") # Get what we need
zinfo = zipfile.ZipInfo(filename, time.localtime(os.getmtime(origname))[0:6])
# Here you should set zinfo.external_attr to store Unix permission bits and set the zinfo.compression_type
# Both are optional and not a subject to your problem. But just as notice.
zinfo.flag_bits |= 0x800 # Set 11th bit to 1, announce the UTF-8 filenames.
f = open(origname, "rb")
zip.writestr(zinfo, f.read())
f.close()
I didn't test it, just wrote a code, but this is an idea, even if somewhere crept in some bug.
If this doesn't work, I don't know what will.

UnicodeEncodeError when using the compile function

Using python 3.2 in Windows 7 I am getting the following in IDLE:
>>compile('pass', r'c:\temp\工具\module1.py', 'exec')
UnicodeEncodeError: 'mbcs' codec can't encode characters in position 0--1: invalid character
Can anybody explain why the compile statement tries to convert the unicode filename using mbcs? I know that sys.getfilesystemencoding returns 'mbcs' in Windows, but I thought that this is not used when unicode file names are provided.
for example:
f = open(r'c:\temp\工具\module1.py')
works.
For a more complete test save the following in a utf8 encoded file and run it using the standard python.exe version 3.2
# -*- coding: utf8 -*-
fname = r'c:\temp\工具\module1.py'
# I do have the a file named fname but you can comment out the following two lines
f = open(fname)
print('ok')
cmp = compile('pass', fname, 'exec')
print(cmp)
Output:
ok
Traceback (most recent call last):
File "module8.py", line 6, in <module>
cmp = compile('pass', fname, 'exec')
UnicodeEncodeError: 'mbcs' codec can't encode characters in position 0--1: inval
id character

From Python issue 10114, it seems that the logic is that all filenames used by Python should be valid for the platform where they are used. It is encoded using the filesystem encoding to be used in the C internals of Python.
I agree that it probably shouldn't throw an error on Windows, because any Unicode filename is valid. You may wish to file a bug report with Python for this. But be aware that the necessary changes might not be trivial, because any C code using the filename has to have something to do if it can't be encoded.

Here a solution that worked for me: Issue 427: UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-6: ordinal not in range (128):
If you look the PyScripter help file in the topic "Encoded Python
Source Files" (last paragraph) it tells you how to configure Python to
support other encodings by modifying the site.py file. This file is
in the lib subdirectory of the Python installation directory. Find
the function setencoding and make sure that the support locale aware
default string encodings is on. (see below)
def setencoding():
"""Set the string encoding used by the Unicode implementation. The
default is 'ascii', but if you're willing to experiment, you can
change this."""
encoding = "ascii" # Default value set by _PyUnicode_Init()
if 0: <<<--- set this to 1 ---------------------------------
# Enable to support locale aware default string encodings.
import locale
loc = locale.getdefaultlocale ()
if loc[1]:
encoding = loc[1]
if 0:
# Enable to switch off string to Unicode coercion and implicit
# Unicode to string conversion.
encoding = "undefined"
if encoding != "ascii":
# On Non-Unicode builds this will raise an AttributeError...
sys.setdefaultencoding (encoding) # Needs Python Unicode
build !

I think you could try to change the "\" in the path of file into "/"，just like
compile('pass', r'c:\temp\工具\module1.py', 'exec')
compile('pass', r'c:/temp/工具/module1.py', 'exec')
I have met a problem just like you, I used this method to solve the problem. I hope it can work with yours.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reading Japanese filenames in windows, using Python and glob not working - python

Assuming you're using python 2.x, try changing your strings to unicode, like this: path = u'G:\path' for infile in glob.glob( os.path.join(path, u'*') ): print( u"current file is: ", infile) That should let python's filesystem-related functions know that you want to work with unicode file names.

Related

How can I convince Python 3 to treat my text file as UTF-8?

Open binary file as ASCII in Python

Python zipfile module can't extract filenames with Chinese characters

zipfile.write() file with turkish chars in filename

UnicodeEncodeError when using the compile function

Categories

Resources