Python zipfile module can't extract filenames with Chinese characters - python

I'm trying to use a python script to download files from a Chinese service provider (I'm not from China myself). The provider is giving me a .zip file which contains a file which seems to have Chinese characters in its name. This seems to be causing the zipfile module to barf.
Code:
import zipfile
f = "/path/to/zip_file.zip"
if zipfile.is_zipfile(f):
fz = zipfile.ZipFile(f, 'r')
The zipfile itself doesn't contain any non-ASCII characters but the file inside it does. When I run the above script i get the following exception:
Traceback (most recent call last): File "./temp.py", line 9, in <module>
fz = zipfile.ZipFile(f, 'r') File "/usr/lib/python2.7/zipfile.py", line 770, in __init__
self._RealGetContents() File "/usr/lib/python2.7/zipfile.py", line 859, in _RealGetContents
x.filename = x._decodeFilename() File "/usr/lib/python2.7/zipfile.py", line 379, in _decodeFilename
return self.filename.decode('utf-8') File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0xbd in position 30: invalid start byte
I've tried looking through the answers to many similar questions:
Read file with Chinese Characters
Extract zip files with non-unicode filenames
Extract files with invalid characters
Please correct me if I'm wrong, but it looks like an open issue with the zipfile module.
How do I get around this? Is there any alternative module for dealing with zipfiles that I should use? Or any other solution?
TIA.
Edit:
I can access/unzip the same file perfectly with the linux command-line utility "unzip".

The way of Python 2.x(2.7) and Python 3.x dealing with non utf-8 filename in module zipfile are a bit different.
First, they both check ZipInfo.flag_bits of the file, if ZipInfo.flag_bits & 0x800, name of the file will be decode with utf-8.
If the check of above is False, in Python 2.x, the byte string of the name will be returned; in Python 3.x, the module will decode the file with encoding cp437 and return decoded result. Of course, the module will not know the true encoding of the filename in both Python versions.
So, suppose you have got a filename from a ZipInfo object or zipfile.namelist method, and you have already know the filename is encoded with XXX encoding. Those are the ways you get the correct unicode filename:
# in python 2.x
filename = filename.decode('XXX')
# in python 3.x
filename = filename.encode('cp437').decode('XXX')

Recently I met the same problem. Here is my solution. I hope it is useful for you.
import shutil
import zipfile
f = zipfile.ZipFile('/path/to/zip_file.zip', 'r')
for fileinfo in f.infolist():
filename = fileinfo.filename.encode('cp437').decode('gbk')
outputfile = open(filename, "wb")
shutil.copyfileobj(f.open(fileinfo.filename), outputfile)
outputfile.close()
f.close()
UPDATE: You can use the following simpler solution with pathlib:
from pathlib import Path
import zipfile
with zipfile.ZipFile('/path/to/zip_file.zip', 'r') as f:
for fn in f.namelist():
extracted_path = Path(f.extract(fn))
extracted_path.rename(fn.encode('cp437').decode('gbk'))

The ZIP file is invalid. It has a flag that signals that filenames inside it are encoded as UTF-8, but they're actually not; they contain byte sequences that aren't valid as UTF-8. Maybe they're GBK? Maybe something else? Maybe some unholy inconsistent mixture? ZIP tools in the wild are unfortunately very very poor at handling non-ASCII filenames consistently.
A quick workaround might be to replace the library function that decodes the filenames. This is a monkey-patch as there isn't a simple way to inject your own ZipInfo class into ZipFile, but:
zipfile.ZipInfo._decodeFilename = lambda self: self.filename
would disable the attempt to decode the filename, and always return a ZipInfo with a byte string filename property that you can proceed to decode/handle manually in whatever way is appropriate.

This is almost 6 years late, but this was finally fixed in Python 3.11 with the addition of the metadata_encoding parameter. I posted this answer here anyway to help other people with similar issues.
import zipfile
f = "your/zip/file.zip"
t = "the/dir/where/you/want/to/extract/it/all"
with zipfile.ZipFile(f, "r", metadata_encoding = "utf-8") as zf:
zf.extractall(t)

What about this code?
import zipfile
with zipfile.ZipFile('/path/to/zip_file.zip', 'r') as f:
zipInfo = f.infolist()
for member in zipInfo:
member.filename = member.filename.encode('cp437').decode('gbk')
f.extract(member)

#Mr.Ham's solution perfectly solved my problem. I'm using the Chinese version of Win10. Which the default encoding of the file system is GBK.
I think for other language users. Just change decode from GBK to their system default encoding will also work. And the default system encoding could automaticly get by Python.
So the patched code looks like this:
import zipfile
import locale
default_encoding = locale.getpreferredencoding()
with zipfile.ZipFile("/path/to/zip_file.zip") as f:
zipinfo = f.infolist()
for member in zipinfo:
member.filename = member.filename.encode('cp437').decode(default_encoding)
# The second argument could make the extracted filese to the same dir as the zip file, or leave it blank to your work dir.
f.extract(member, "/path/to/zip_file")

Related

How can I convince Python 3 to treat my text file as UTF-8?

I need a search and replace for a particular character in a few .php files contained in a local directory, under Windows OS.
I tried one of the examples given as answer for Do a search-and-replace across all files in a folder through python? question, which in my case means this:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
import re
_replace_re = re.compile("ş")
for dirpath, dirnames, filenames in os.walk("./something/"):
for file in filenames:
if file.endswith(".php"):
file = os.path.join(dirpath, file)
tempfile = file + ".temp"
with open(tempfile, "w") as target:
with open(file) as source:
for line in source:
line = _replace_re.sub("ș", line)
target.write(line)
os.remove(file)
os.rename(tempfile, file)
While running it, I get this error:
Traceback (most recent call last):
File "[unimportant_path]\replace.py", line 19, in <module>
for line in source:
File "C:\python31.064\lib\encodings\cp1250.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x83 in position 1393: character maps to <undefined>
Indeed, the 8 bit MS codepage CP1250 is "Undefined" at 0x83 and the absolute position 0x0571 (i.e. 1393 decimal) of the file where this error occurs contains the byte 0x83, which in fact in this case is part of the UTF-8 encoding for character ă (for which the complete UTF-8 bytes are 0xC4 0x83).
Questions:
● why tries Python 3 to read a text file in whatever 8 bit codepage instead of reading it directly in Unicode ?
● what can I do to force reading the file in true UTF-8 ?
Add the encoding option to the open function.
with open(tempfile, "w", encoding="utf-8") as target:
with open(file, encoding="utf-8") as source:
Further Details about the open builtin https://docs.python.org/3/library/functions.html?highlight=open#open
Currently Python uses the local system encoding, unless UTF-8 mode is enabled. There is a PEP proposed (not accepted as of now) to change the default to UTF-8 but even if accepted it's a few Python versions away, so best to be explicit in the code.
open takes an optional keyword argument specifying the encoding.
open(tempfile, "w", encoding="utf-8")

Can't read a .csv with python: "UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position..." [duplicate]

I'm trying to get a Python 3 program to do some manipulations with a text file filled with information. However, when trying to read the file I get the following error:
Traceback (most recent call last):
File "SCRIPT LOCATION", line NUMBER, in <module>
text = file.read()
File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2907500: character maps to `<undefined>`
The file in question is not using the CP1252 encoding. It's using another encoding. Which one you have to figure out yourself. Common ones are Latin-1 and UTF-8. Since 0x90 doesn't actually mean anything in Latin-1, UTF-8 (where 0x90 is a continuation byte) is more likely.
You specify the encoding when you open the file:
file = open(filename, encoding="utf8")
If file = open(filename, encoding="utf-8") doesn't work, try
file = open(filename, errors="ignore"), if you want to remove unneeded characters. (docs)
Alternatively, if you don't need to decode the file, such as uploading the file to a website, use:
open(filename, 'rb')
where r = reading, b = binary
As an extension to #LennartRegebro's answer:
If you can't tell what encoding your file uses and the solution above does not work (it's not utf8) and you found yourself merely guessing - there are online tools that you could use to identify what encoding that is. They aren't perfect but usually work just fine. After you figure out the encoding you should be able to use solution above.
EDIT: (Copied from comment)
A quite popular text editor Sublime Text has a command to display encoding if it has been set...
Go to View -> Show Console (or Ctrl+`)
Type into field at the bottom view.encoding() and hope for the best (I was unable to get anything but Undefined but maybe you will have better luck...)
TLDR: Try: file = open(filename, encoding='cp437')
Why? When one uses:
file = open(filename)
text = file.read()
Python assumes the file uses the same codepage as current environment (cp1252 in case of the opening post) and tries to decode it to its own default UTF-8. If the file contains characters of values not defined in this codepage (like 0x90) we get UnicodeDecodeError. Sometimes we don't know the encoding of the file, sometimes the file's encoding may be unhandled by Python (like e.g. cp790), sometimes the file can contain mixed encodings.
If such characters are unneeded, one may decide to replace them by question marks, with:
file = open(filename, errors='replace')
Another workaround is to use:
file = open(filename, errors='ignore')
The characters are then left intact, but other errors will be masked too.
A very good solution is to specify the encoding, yet not any encoding (like cp1252), but the one which has ALL characters defined (like cp437):
file = open(filename, encoding='cp437')
Codepage 437 is the original DOS encoding. All codes are defined, so there are no errors while reading the file, no errors are masked out, the characters are preserved (not quite left intact but still distinguishable).
Stop wasting your time, just add the following encoding="cp437" and errors='ignore' to your code in both read and write:
open('filename.csv', encoding="cp437", errors='ignore')
open(file_name, 'w', newline='', encoding="cp437", errors='ignore')
Godspeed
for me encoding with utf16 worked
file = open('filename.csv', encoding="utf16")
For those working in Anaconda in Windows, I had the same problem. Notepad++ help me to solve it.
Open the file in Notepad++. In the bottom right it will tell you the current file encoding.
In the top menu, next to "View" locate "Encoding". In "Encoding" go to "character sets" and there with patiente look for the enconding that you need. In my case the encoding "Windows-1252" was found under "Western European"
Before you apply the suggested solution, you can check what is the Unicode character that appeared in your file (and in the error log), in this case 0x90: https://unicodelookup.com/#0x90/1 (or directly at Unicode Consortium site http://www.unicode.org/charts/ by searching 0x0090)
and then consider removing it from the file.
def read_files(file_path):
with open(file_path, encoding='utf8') as f:
text = f.read()
return text
OR (AND)
def read_files(text, file_path):
with open(file_path, 'rb') as f:
f.write(text.encode('utf8', 'ignore'))
In the newer version of Python (starting with 3.7), you can add the interpreter option -Xutf8, which should fix your problem. If you use Pycharm, just got to Run > Edit configurations (in tab Configuration change value in field Interpreter options to -Xutf8).
Or, equivalently, you can just set the environmental variable PYTHONUTF8 to 1.
for me changing the Mysql character encoding the same as my code helped to sort out the solution. photo=open('pic3.png',encoding=latin1)

How to filter Unicode characters [duplicate]

I'm trying to get a Python 3 program to do some manipulations with a text file filled with information. However, when trying to read the file I get the following error:
Traceback (most recent call last):
File "SCRIPT LOCATION", line NUMBER, in <module>
text = file.read()
File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2907500: character maps to `<undefined>`
The file in question is not using the CP1252 encoding. It's using another encoding. Which one you have to figure out yourself. Common ones are Latin-1 and UTF-8. Since 0x90 doesn't actually mean anything in Latin-1, UTF-8 (where 0x90 is a continuation byte) is more likely.
You specify the encoding when you open the file:
file = open(filename, encoding="utf8")
If file = open(filename, encoding="utf-8") doesn't work, try
file = open(filename, errors="ignore"), if you want to remove unneeded characters. (docs)
Alternatively, if you don't need to decode the file, such as uploading the file to a website, use:
open(filename, 'rb')
where r = reading, b = binary
As an extension to #LennartRegebro's answer:
If you can't tell what encoding your file uses and the solution above does not work (it's not utf8) and you found yourself merely guessing - there are online tools that you could use to identify what encoding that is. They aren't perfect but usually work just fine. After you figure out the encoding you should be able to use solution above.
EDIT: (Copied from comment)
A quite popular text editor Sublime Text has a command to display encoding if it has been set...
Go to View -> Show Console (or Ctrl+`)
Type into field at the bottom view.encoding() and hope for the best (I was unable to get anything but Undefined but maybe you will have better luck...)
TLDR: Try: file = open(filename, encoding='cp437')
Why? When one uses:
file = open(filename)
text = file.read()
Python assumes the file uses the same codepage as current environment (cp1252 in case of the opening post) and tries to decode it to its own default UTF-8. If the file contains characters of values not defined in this codepage (like 0x90) we get UnicodeDecodeError. Sometimes we don't know the encoding of the file, sometimes the file's encoding may be unhandled by Python (like e.g. cp790), sometimes the file can contain mixed encodings.
If such characters are unneeded, one may decide to replace them by question marks, with:
file = open(filename, errors='replace')
Another workaround is to use:
file = open(filename, errors='ignore')
The characters are then left intact, but other errors will be masked too.
A very good solution is to specify the encoding, yet not any encoding (like cp1252), but the one which has ALL characters defined (like cp437):
file = open(filename, encoding='cp437')
Codepage 437 is the original DOS encoding. All codes are defined, so there are no errors while reading the file, no errors are masked out, the characters are preserved (not quite left intact but still distinguishable).
Stop wasting your time, just add the following encoding="cp437" and errors='ignore' to your code in both read and write:
open('filename.csv', encoding="cp437", errors='ignore')
open(file_name, 'w', newline='', encoding="cp437", errors='ignore')
Godspeed
for me encoding with utf16 worked
file = open('filename.csv', encoding="utf16")
For those working in Anaconda in Windows, I had the same problem. Notepad++ help me to solve it.
Open the file in Notepad++. In the bottom right it will tell you the current file encoding.
In the top menu, next to "View" locate "Encoding". In "Encoding" go to "character sets" and there with patiente look for the enconding that you need. In my case the encoding "Windows-1252" was found under "Western European"
Before you apply the suggested solution, you can check what is the Unicode character that appeared in your file (and in the error log), in this case 0x90: https://unicodelookup.com/#0x90/1 (or directly at Unicode Consortium site http://www.unicode.org/charts/ by searching 0x0090)
and then consider removing it from the file.
def read_files(file_path):
with open(file_path, encoding='utf8') as f:
text = f.read()
return text
OR (AND)
def read_files(text, file_path):
with open(file_path, 'rb') as f:
f.write(text.encode('utf8', 'ignore'))
In the newer version of Python (starting with 3.7), you can add the interpreter option -Xutf8, which should fix your problem. If you use Pycharm, just got to Run > Edit configurations (in tab Configuration change value in field Interpreter options to -Xutf8).
Or, equivalently, you can just set the environmental variable PYTHONUTF8 to 1.
for me changing the Mysql character encoding the same as my code helped to sort out the solution. photo=open('pic3.png',encoding=latin1)

Convert UTF-8 with BOM to UTF-8 with no BOM in Python

Two questions here. I have a set of files which are usually UTF-8 with BOM. I'd like to convert them (ideally in place) to UTF-8 with no BOM. It seems like codecs.StreamRecoder(stream, encode, decode, Reader, Writer, errors) would handle this. But I don't really see any good examples on usage. Would this be the best way to handle this?
source files:
Tue Jan 17$ file brh-m-157.json
brh-m-157.json: UTF-8 Unicode (with BOM) text
Also, it would be ideal if we could handle different input encoding wihtout explicitly knowing (seen ASCII and UTF-16). It seems like this should all be feasible. Is there a solution that can take any known Python encoding and output as UTF-8 without BOM?
edit 1 proposed sol'n from below (thanks!)
fp = open('brh-m-157.json','rw')
s = fp.read()
u = s.decode('utf-8-sig')
s = u.encode('utf-8')
print fp.encoding
fp.write(s)
This gives me the following error:
IOError: [Errno 9] Bad file descriptor
Newsflash
I'm being told in comments that the mistake is I open the file with mode 'rw' instead of 'r+'/'r+b', so I should eventually re-edit my question and remove the solved part.
Simply use the "utf-8-sig" codec:
fp = open("file.txt")
s = fp.read()
u = s.decode("utf-8-sig")
That gives you a unicode string without the BOM. You can then use
s = u.encode("utf-8")
to get a normal UTF-8 encoded string back in s. If your files are big, then you should avoid reading them all into memory. The BOM is simply three bytes at the beginning of the file, so you can use this code to strip them out of the file:
import os, sys, codecs
BUFSIZE = 4096
BOMLEN = len(codecs.BOM_UTF8)
path = sys.argv[1]
with open(path, "r+b") as fp:
chunk = fp.read(BUFSIZE)
if chunk.startswith(codecs.BOM_UTF8):
i = 0
chunk = chunk[BOMLEN:]
while chunk:
fp.seek(i)
fp.write(chunk)
i += len(chunk)
fp.seek(BOMLEN, os.SEEK_CUR)
chunk = fp.read(BUFSIZE)
fp.seek(-BOMLEN, os.SEEK_CUR)
fp.truncate()
It opens the file, reads a chunk, and writes it out to the file 3 bytes earlier than where it read it. The file is rewritten in-place. As easier solution is to write the shorter file to a new file like newtover's answer. That would be simpler, but use twice the disk space for a short period.
As for guessing the encoding, then you can just loop through the encoding from most to least specific:
def decode(s):
for encoding in "utf-8-sig", "utf-16":
try:
return s.decode(encoding)
except UnicodeDecodeError:
continue
return s.decode("latin-1") # will always work
An UTF-16 encoded file wont decode as UTF-8, so we try with UTF-8 first. If that fails, then we try with UTF-16. Finally, we use Latin-1 — this will always work since all 256 bytes are legal values in Latin-1. You may want to return None instead in this case since it's really a fallback and your code might want to handle this more carefully (if it can).
In Python 3 it's quite easy: read the file and rewrite it with utf-8 encoding:
s = open(bom_file, mode='r', encoding='utf-8-sig').read()
open(bom_file, mode='w', encoding='utf-8').write(s)
import codecs
import shutil
import sys
s = sys.stdin.read(3)
if s != codecs.BOM_UTF8:
sys.stdout.write(s)
shutil.copyfileobj(sys.stdin, sys.stdout)
I found this question because having trouble with configparser.ConfigParser().read(fp) when opening files with UTF8 BOM header.
For those who are looking for a solution to remove the header so that ConfigPhaser could open the config file instead of reporting an error of:
File contains no section headers, please open the file like the following:
configparser.ConfigParser().read(config_file_path, encoding="utf-8-sig")
This could save you tons of effort by making the remove of the BOM header of the file unnecessary.
(I know this sounds unrelated, but hopefully this could help people struggling like me.)
This is my implementation to convert any kind of encoding to UTF-8 without BOM and replacing windows enlines by universal format:
def utf8_converter(file_path, universal_endline=True):
'''
Convert any type of file to UTF-8 without BOM
and using universal endline by default.
Parameters
----------
file_path : string, file path.
universal_endline : boolean (True),
by default convert endlines to universal format.
'''
# Fix file path
file_path = os.path.realpath(os.path.expanduser(file_path))
# Read from file
file_open = open(file_path)
raw = file_open.read()
file_open.close()
# Decode
raw = raw.decode(chardet.detect(raw)['encoding'])
# Remove windows end line
if universal_endline:
raw = raw.replace('\r\n', '\n')
# Encode to UTF-8
raw = raw.encode('utf8')
# Remove BOM
if raw.startswith(codecs.BOM_UTF8):
raw = raw.replace(codecs.BOM_UTF8, '', 1)
# Write to file
file_open = open(file_path, 'w')
file_open.write(raw)
file_open.close()
return 0
You can use codecs.
import codecs
with open("test.txt",'r') as filehandle:
content = filehandle.read()
if content[:3] == codecs.BOM_UTF8:
content = content[3:]
print content.decode("utf-8")
In python3 you should add encoding='utf-8-sig':
with open(file_name, mode='a', encoding='utf-8-sig') as csvfile:
csvfile.writelines(rows)
that's it.

Reading Japanese filenames in windows, using Python and glob not working

I just setup PortablePython on my system, so I can run python scripts from PHP and I got some very basic code (Below) to list all the files in a directory, however it doesn't work with Japanese filenames. It works fine with English filenames, but it spits out errors (Below) when I put any file containing Japanese characters in the directory.
import os, glob
path = 'G:\path'
for infile in glob.glob( os.path.join(path, '*') ):
print("current file is: ", infile)
It works fine using 'PyScripter-Portable.exe', however when I try to run 'PortablePython\App\python.exe "test.py"' in the command prompt or from PHP it spits out the following errors:
current file is: Traceback (most recent call last):
File "test.py", line 5, in <module>
print("current file is: ", infile)
File "PortablePython\App\lib\io.py", line 1494, in write
b = encoder.encode(s)
File "PortablePython\App\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 37-40: character maps to <undefined>
I'm very new to Python and am just using this to get around a PHP issue with not being able to read unicode filenames in Windows... So I really need this to work - any help you can give me would be great.
The problem is probably that whatever output destination you're printing to doesn't use the same encoding as the file system. The general rule is that you should get text into Unicode as soon as possible, and then convert to whatever byte encoding you need upon output (e.g. utf-8).
Since you're dealing with filenames, they should be in the system encoding.
import sys
fse = sys.getfilesystemencoding()
filenames = [unicode(x, fse) for x in glob.glob( os.path.join(path, '*') )]
Now all your filenames are Unicode, and you need to figure out the correct encoding to output from the command prompt or whatever (you can launch a Unicode version of the command prompt with the u flag: "cmd /u")
Assuming you're using python 2.x, try changing your strings to unicode, like this:
path = u'G:\path'
for infile in glob.glob( os.path.join(path, u'*') ):
print( u"current file is: ", infile)
That should let python's filesystem-related functions know that you want to work with unicode file names.
Exmaple load files with unicode symbols in path:
from glob import glob
import librosa
#File has chanies in path
#Find all wav-s
replays_files = glob('<you-path>/**/*.wav', recursive=True)
s = replays_files[1478]
#Will be something like this:
#'<you-path>\udde6\uabae\udc9a\udce4_audio.wav'
#If you try load
librosa.core.load(s,sr=16000,mono=True)
#UnicodeEncodeError: 'ascii' codec can't encode characters in position 222-242: ordinal not in range(128)
#Replace udde6\
s = s.encode('ascii','surrogateescape').decode()
#Still doesn't working
librosa.core.load(s,sr=16000,mono=True)
#UnicodeEncodeError: 'ascii' codec can't encode characters in position 222-228: ordinal not in range(128)
s = s.encode('utf-8')
#b'<you-path>\xe6\xbe\x7a\xe4\xb8_audio.was'
#Work
librosa.core.load(s,sr=16000,mono=True)

Categories

Resources