UnicodeDecodeError while processing filenames

UnicodeDecodeError while processing filenames - python

I'm using Python 2.7.3 on Ubuntu 12 x64.
I have about 200,000 files in a folder on my filesystem. The file names of some of the files contain html encoded and escaped characters because the files were originally downloaded from a website. Here are examples:
Jamaica%2008%20114.jpg
thai_trip_%E8%B0%83%E6%95%B4%E5%A4%A7%E5%B0%8F%20RAY_5313.jpg
I wrote a simple Python script that goes through the folder and renames all of the files with encoded characters in the filename. The new filename is achieved by simply decoding the string that makes up the filename.
The script works for most of the files, but, for some of the files Python chokes and spits out the following error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 11: ordinal not in range(128)
Traceback (most recent call last):
File "./download.py", line 53, in downloadGalleries
numDownloaded = downloadGallery(opener, galleryLink)
File "./download.py", line 75, in downloadGallery
filePathPrefix = getFilePath(content)
File "./download.py", line 90, in getFilePath
return cleanupString(match.group(1).strip()) + '/' + cleanupString(match.group(2).strip())
File "/home/abc/XYZ/common.py", line 22, in cleanupString
return HTMLParser.HTMLParser().unescape(string)
File "/usr/lib/python2.7/HTMLParser.py", line 472, in unescape
return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
File "/usr/lib/python2.7/re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
Here is the contents of my cleanupString function:
def cleanupString(string):
string = urllib2.unquote(string)
return HTMLParser.HTMLParser().unescape(string)
And here's the snippet of code that calls the cleanupString function (this code is not the same code in the traceback above but it produces the same error):
rootFolder = sys.argv[1]
pattern = r'.*\.jpg\s*$|.*\.jpeg\s*$'
reobj = re.compile(pattern, re.IGNORECASE)
imgs = []
for root, dirs, files in os.walk(rootFolder):
for filename in files:
foundFile = os.path.join(root, filename)
if reobj.match(foundFile):
imgs.append(foundFile)
for img in imgs :
print 'Checking file: ' + img
newImg = cleanupString(img) #Code blows up here for some files
Can anyone provide me with a way to get around this error? I've already tried adding
# -*- coding: utf-8 -*-
to the top of the script but that has no effect.
Thanks.

Your filenames are byte strings that contain UTF-8 bytes representing unicode characters. The HTML parser normally works with unicode data instead of byte strings, particularly when it encounters a ampersand escape, so Python is automatically trying to decode the value for you, but it by default uses ASCII for that decoding. This fails for UTF-8 data as it contains bytes that fall outside of the ASCII range.
You need to explicitly decode your string to a unicode object:
def cleanupString(string):
string = urllib2.unquote(string).decode('utf8')
return HTMLParser.HTMLParser().unescape(string)
Your next problem will be that you now have unicode filenames, but your filesystem will need some kind of encoding to work with these filenames. You can check what that encoding is with sys.getfilesystemencoding(); use this to re-encode your filenames:
def cleanupString(string):
string = urllib2.unquote(string).decode('utf8')
return HTMLParser.HTMLParser().unescape(string).encode(sys.getfilesystemencoding())
You can read up on how Python deals with Unicode in the Unicode HOWTO.

Looks like you're bumping into this issue. I would try reversing the order you call unescape and unquote, since unquote would be adding non-ASCII characters into your filenames, although that may not fix the problem.
What is the actual filename it is choking on?

Related

Continuing for loop after exception in Python

So first of all I saw similar questions, but nothing worked/wasn't applicable to my problem.
I'm writing a program that is taking in a Text file with a lot of search queries to be searched on Youtube. The program is iterating through the text file line by line. But these have special UTF-8 characters that cannot be decoded. So at a certain point the program stops with a
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1826: character maps to
As I cannot check every line of my entries, I want it to except the error, print the line it was working on and continue at that point.
As the error is not happening in my for loop, but rather the for loop itself, I don't know how to write an try...except statement.
This is the code:
import urllib.request
import re
from unidecode import unidecode
with open('out.txt', 'r') as infh,\
open("links.txt", "w") as outfh:
for line in infh:
try:
clean = unidecode(line)
search_keyword = clean
html = urllib.request.urlopen("https://www.youtube.com/results?search_query=" + search_keyword)
video_ids = re.findall(r"watch\?v=(\S{11})", html.read().decode())
outfh.write("https://www.youtube.com/watch?v=" + video_ids[0] + "\n")
#print("https://www.youtube.com/watch?v=" + video_ids[0])
except:
print("Error encounted with Line: " + line)
This is the full error message, to see that the for loop itself is causing the problem.
Traceback (most recent call last):
File "ytbysearchtolinks.py", line 6, in
for line in infh:
File "C:\Users\nfeyd\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1826: character maps to
If you need an example of input I'm working with: https://pastebin.com/LEkwdU06

The try-except-block looks correct and should allow you to catch all occurring exceptions.
The usage of unidecode probably won't help you because non-ASCII characters must be encoded in a specific way in URLs, see, e.g., here.
One solution is to use urllib's quote() function. As per documentation:
Replace special characters in string using the %xx escape.
This is what works for me with the input you've provided:
import urllib.request
from urllib.parse import quote
import re
with open('out.txt', 'r', encoding='utf-8') as infh,\
open("links.txt", "w") as outfh:
for line in infh:
search_keyword = quote(line)
html = urllib.request.urlopen("https://www.youtube.com/results?search_query=" + search_keyword)
video_ids = re.findall(r"watch\?v=(\S{11})", html.read().decode())
outfh.write("https://www.youtube.com/watch?v=" + video_ids[0] + "\n")
print("https://www.youtube.com/watch?v=" + video_ids[0])
EDIT:
After thinking about it, I believe you are running into the following problem:
You are running the code on Windows, and apparently, Python will try to open the file with cp1252 encoding when on Windows, while the file that you shared is in UTF-8 encoding:
$ file out.txt
out.txt: UTF-8 Unicode text, with CRLF line terminators
This would explain the exception you are getting and why it's not being caught by your try-except-block (it's occurring when trying to open the file).
Make sure that you are using encoding='utf-8' when opening the file.

i ran your code, but i didnt have some problems. Do you have create virtual environment with virtualenv and install all the packages you use ?

What could cause this error : UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 568: invalid start byte

I'm very new to coding and python so i'm really confuse with this Error. Here's my code from an exercise where i need to find the most used word into a directory with multiples files
import pathlib
directory = pathlib.Path('/Users/k/files/Code/exo')
stats ={}
for path in directory.iterdir():
file = open(str(path))
text = file.read().lower()
punctuation = (";", ".")
for mark in punctuation:
text = text.replace(mark, "")
for word in text.split():
if word in stats:
stats[word] = stats[word] + 1
else:
stats[word] = 1
most_used_word = None
score_max = 0
for word, score in stats.items():
if score > score_max:
score_max = score
most_used_word = word
print(word,"The most used word is : ", score_max)
here's what i get
Traceback (most recent call last):
File "test.py", line 9, in <module>
text = file.read().lower()
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 568: invalid start byte
What could cause this error ?

Probably your file contain non-ascii characters, so you have to decode them in order to make the UnicodeDecodeError to disappear. You can try with reading in 'rb' mode, like this:
file = open(str(path), 'rb')
On Windows, 'b' appended to the mode opens the file in binary mode, so there are also modes like 'rb', 'wb', and 'r+b'. Python on Windows makes a distinction between text and binary files; the end-of-line characters in text files are automatically altered slightly when data is read or written. This behind-the-scenes modification to file data is fine for ASCII text files, but it’ll corrupt binary data like that in JPEG or EXE files. Be very careful to use binary mode when reading and writing such files. On Unix, it doesn’t hurt to append a 'b' to the mode, so you can use it platform-independently for all binary files.
(From the docs)

Program (twitter bot) works on Windows machine, but not on Linux machine [duplicate]

I was trying to read a file in python2.7, and it was readen perfectly. The problem that I have is when I execute the same program in Python3.4 and then appear the error:
'utf-8' codec can't decode byte 0xf2 in position 424: invalid continuation byte'
Also, when I run the program in Windows (with python3.4), the error doesn't appear. The first line of the document is:
Codi;Codi_lloc_anonim;Nom
and the code of my program is:
def lectdict(filename,colkey,colvalue):
f = open(filename,'r')
D = dict()
for line in f:
if line == '\n': continue
D[line.split(';')[colkey]] = D.get(line.split(';')[colkey],[]) + [line.split(';')[colvalue]]
f.close
return D
Traduccio = lectdict('Noms_departaments_centres.txt',1,2)

In Python2,
f = open(filename,'r')
for line in f:
reads lines from the file as bytes.
In Python3, the same code reads lines from the file as strings. Python3
strings are what Python2 call unicode objects. These are bytes decoded
according to some encoding. The default encoding in Python3 is utf-8.
The error message
'utf-8' codec can't decode byte 0xf2 in position 424: invalid continuation byte'
shows Python3 is trying to decode the bytes as utf-8. Since there is an error, the file apparently does not contain utf-8 encoded bytes.
To fix the problem you need to specify the correct encoding of the file:
with open(filename, encoding=enc) as f:
for line in f:
If you do not know the correct encoding, you could run this program to simply
try all the encodings known to Python. If you are lucky there will be an
encoding which turns the bytes into recognizable characters. Sometimes more
than one encoding may appear to work, in which case you'll need to check and
compare the results carefully.
# Python3
import pkgutil
import os
import encodings
def all_encodings():
modnames = set(
[modname for importer, modname, ispkg in pkgutil.walk_packages(
path=[os.path.dirname(encodings.__file__)], prefix='')])
aliases = set(encodings.aliases.aliases.values())
return modnames.union(aliases)
filename = '/tmp/test'
encodings = all_encodings()
for enc in encodings:
try:
with open(filename, encoding=enc) as f:
# print the encoding and the first 500 characters
print(enc, f.read(500))
except Exception:
pass

Ok, I did the same as #unutbu tell me. The result was a lot of encodings one of these are cp1250, for that reason I change :
f = open(filename,'r')
to
f = open(filename,'r', encoding='cp1250')
like #triplee suggest me. And now I can read my files.

In my case I can't change encoding because my file is really UTF-8 encoded. But some rows are corrupted and causes the same error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 7092: invalid continuation byte
My decision is to open file in binary mode:
open(filename, 'rb')

Unicode portability with PyQT4/numpy on linux

I'm developing a multiplatform application in pyQT4 and numpy and actually it doesn't work on my linux system (Xubuntu 12.04) even so it seems work great on windows 7. So, the problem comes from my import files method (It is in a PyQT4 class) :
def import_folder(self,abs_path,folder_list):
for folder_i in folder_list:
filenames = glob.glob("%s/%s/*.txt" %( abs_path,folder_i ))
aa =list(filenames)[1]
print filenames
data_fichier = np.genfromtxt("%s" %(aa),delimiter=';',skip_header=35,usecols=[8])
data_fichier2 = np.zeros((data_fichier.shape[0],))
And this is the error I get :
data_fichier = np.genfromtxt("%s" %aa,delimiter=';',skip_header=35,usecols=[8])
File "/usr/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 1241, in genfromtxt
fhd = iter(np.lib._datasource.open(fname, 'rbU'))
File "/usr/lib/python2.7/dist-packages/numpy/lib/_datasource.py", line 145, in open
return ds.open(path, mode)
File "/usr/lib/python2.7/dist-packages/numpy/lib/_datasource.py", line 472, in open
found = self._findfile(path)
File "/usr/lib/python2.7/dist-packages/numpy/lib/_datasource.py", line 315, in _findfile
filelist += self._possible_names(self.abspath(path))
File "/usr/lib/python2.7/dist-packages/numpy/lib/_datasource.py", line 364, in abspath
splitpath = path.split(self._destpath, 2)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 14: ordinal not in range(128)
I have printed my "filenames" variables :
[u'/home/*****/Documents/Plate_test/Track-Arousal_experiment_24_06_13_track_1-Trial 2-7-Subject 1.txt', u'/home/*****/Documents/Plate_test/Track-Arousal_experiment_24_06_13_track_1-Trial 2-5-Subject 1.txt']
Therefore, the problem comes from the unicode mode (the "u" at the begining of the list elements). And I don't know at all why I get this unicode mode with my linux system. Have you any ideas how I can remove it and turn of in the "regular" mode (sorry for the terminology, I'm not an expert) ? (Or others ideas about my problem).
(Just for you know, when I have launched the method as a simple function without the PyQT
class, it works great, so I suspect it.)
Thanks,
Lem

It seems that the loadtxt method expects a str and not a unicode since it tries to decode it. glob in your case is returning unicode strings so try encoding your filenames:
filenames = map(str, filenames)
Or
filenames = map(lambda f: f.encode('utf-8'), filenames)
Or even:
def safe_str(s):
if isinstance(s, unicode):
return s.encode('utf-8')
return s
filenames = map(safe_str, filenames)

Python: Special characters encoding

This is the code i am using in order to replace special characters in text files and concatenate them to a single file.
# -*- coding: utf-8 -*-
import os
import codecs
dirpath = "C:\\Users\\user\\path\\to\\textfiles"
filenames = os.listdir(dirpath)
with codecs.open(r'C:\Users\user\path\to\output.txt', 'w', encoding='utf8') as outfile:
for fname in filenames:
currentfile = dirpath+"\\"+fname
with codecs.open(currentfile, encoding='utf8') as infile:
#print currentfile
outfile.write(fname)
outfile.write('\n')
outfile.write('\n')
for line in infile:
line = line.replace(u"´ı", "i")
line = line.replace(u"ï¬", "fi")
line = line.replace(u"ï¬‚", "fl")
outfile.write (line)
The first line.replace works fine while the others do not (which makes sense) and since no errors were generated, i though there might be a problem of "visibility" (if that's the term).And so i made this:
import codecs
currentfile = 'textfile.txt'
with codecs.open('C:\\Users\\user\\path\\to\\output2.txt', 'w', encoding='utf-8') as outfile:
with open(currentfile) as infile:
for line in infile:
if "ï¬" not in line: print "not found!"
which always returns "not found!" proving that those characters aren't read.
When changing to with codecs.open('C:\Users\user\path\to\output.txt', 'w', encoding='utf-8') as outfile: in the first script, i get this error:
Traceback (most recent call last):
File C:\\path\\to\\concat.py, line 30, in <module>
outfile.write(line)
File C:\\Python27\\codecs.py, line 691, in write
return self.writer.write(data)
File C:\\Python27\\codecs.py, line 351, in write
data, consumed = self.encode(object, self.errors)
Unicode DecodeError: 'ascii' codec can't decode byte 0xe7 in position 0: ordinal
not in range (128)
Since i am not really experienced in python i can't figure it out, by the different sources already available: python documentation (1,2) and relevant questions in StackOverflow (1,2)
I am stuck here. Any suggestions?? all answers are welcome!

There is no point in using codecs.open() if you don't use an encoding. Either use codecs.open() with an encoding specified for both reading and writing, or forgo it completely. Without an encoding, codecs.open() is an alias for just open().
Here you really do want to specify the codec of the file you are opening, to process Unicode values. You should also use unicode literal values when straying beyond ASCII characters; specify a source file encoding or use unicode escape codes for your data:
# -*- coding: utf-8 -*-
import os
import codecs
dirpath = u"C:\\Users\\user\\path\\to\\textfiles"
filenames = os.listdir(dirpath)
with codecs.open(r'C:\Users\user\path\to\output.txt', 'w', encoding='utf8') as outfile:
for fname in filenames:
currentfile = os.path.join(dirpath, fname)
with codecs.open(currentfile, encoding='utf8') as infile:
outfile.write(fname + '\n\n')
for line in infile:
line = line.replace(u"´ı", u"i")
line = line.replace(u"ï¬", u"fi")
line = line.replace(u"ï¬‚", u"fl")
outfile.write (line)
This specifies to the interpreter that you used the UTF-8 codec to save your source files, ensuring that the u"´ı" code points are correctly decoded to Unicode values, and using encoding when opening files with codec.open() makes sure that the lines you read are decoded to Unicode values and ensures that your Unicode values are written out to the output file as UTF-8.
Note that the dirpath value is a Unicode value as well. If you use a Unicode path, then os.listdir() returns Unicode filenames, which is essential if you have any non-ASCII characters in those filenames.
If you do not do all this, chances are your source code encoding does not match the data you read from the file, and you are trying to replace the wrong set of encoded bytes with a few ASCII characters.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

UnicodeDecodeError while processing filenames - python

Looks like you're bumping into this issue. I would try reversing the order you call unescape and unquote, since unquote would be adding non-ASCII characters into your filenames, although that may not fix the problem. What is the actual filename it is choking on?

Related

Continuing for loop after exception in Python

What could cause this error : UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 568: invalid start byte

Program (twitter bot) works on Windows machine, but not on Linux machine [duplicate]

Unicode portability with PyQT4/numpy on linux

Python: Special characters encoding

Categories

Resources