Simple code with PIL library in Python is not workig [duplicate] - python

This question already has answers here:
"Unicode Error "unicodeescape" codec can't decode bytes... Cannot open text files in Python 3 [duplicate]
(10 answers)
Closed 14 days ago.
Well, first time in writing in stackoverflow.
when I depure and run this code
from PIL import Image
import os
downloadsFolder = "\Users\fersa\Downloads"
picturesFolder = "\Users\fersa\OneDrive\Imágenes\Imagenes Descargadas"
musicFolder = "\Users\fersa\Music\Musica Descargada"
if __name__ == "__main__":
for filename in os.listdir(downloadsFolder):
name, extension = os.path.splitext(downloadsFolder + filename)
if extension in [".jpg", ".jpeg", ".png"]:
picture = Image.open(downloadsFolder + filename)
picture.save(picturesFolder + "compressed_"+filename, optimize=True, quality=60)
os.remove(downloadsFolder + filename)
print(name + ": " + extension)
if extension in [".mp3"]:
os.rename(downloadsFolder + filename, musicFolder + filename)
I get this message on terminal
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \UXXXXXXXX escape
PS C:\Users\fersa\OneDrive\Documentos\Automatizacion Python>
but i don't know what it means
I tried chanching the files directory many times but it doesn't work

Add an r before each string to signify that it's a raw string without escape codes. Currently, every \ character is telling python to try and interpret the next few bytes as a unicode character:
from PIL import Image
import os
downloadsFolder = r"\Users\fersa\Downloads"
picturesFolder = r"\Users\fersa\OneDrive\Imágenes\Imagenes Descargadas"
musicFolder = r"\Users\fersa\Music\Musica Descargada"
if __name__ == "__main__":
for filename in os.listdir(downloadsFolder):
name, extension = os.path.splitext(downloadsFolder + filename)
if extension in [".jpg", ".jpeg", ".png"]:
picture = Image.open(downloadsFolder + filename)
picture.save(picturesFolder + "compressed_"+filename, optimize=True, quality=60)
os.remove(downloadsFolder + filename)
print(name + ": " + extension)
if extension in [".mp3"]:
os.rename(downloadsFolder + filename, musicFolder + filename)
You can read more about string prefixes in the python docs

The cause of the SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \UXXXXXXXX escape is the code line:
downloadsFolder = "\Users\fersa\Downloads"
in which "\U" is telling python to interpret the next 8 characters as a hexadecimal value of an Unicode code point. And because in "\Users\fes" are characters not being 0-9,A-F,a-f there is an Error which won't occur if the string would start for example with "\Uaabbaacc\somefilename" making it then harder to find out why no files can be found.
The options for a work around fixing the problem are:
usage of forward slashes instead of backslashes: "/Users/fersa/Downloads".
definition of the string as a raw string: r"\Users\fersa\Downloads" in order to avoid interpretation of \U as an escape sequence
Check out in Python documentation page the section 2.4.1. String and Bytes literals for more about escape sequences in Python string literals.

Related

Get unicode error when attempting to save file

I am trying to save a python-docx document in Ubuntu, but I get this error: 'ascii' codec can't encode character '\xed' in position 65: ordinal not in range(128). I tried to apply this solution, but I get this other error: AttributeError: 'bytes' object has no attribute 'write'.
This is the code that raised the first error:
current_directory = settings.MEDIA_DIR
file_name = "Rooming {} {}-{}.docx".format(hotel, start_date, end_date)
document.save(current_directory + file_name)
This is the code that raised the latest error:
current_directory = settings.MEDIA_DIR
file_name = "Rooming {} {}-{}.docx".format(hotel, start_date, end_date)
document.save((current_directory + file_name).encode('utf-8'))
I know the file name will end having non standard ascii characters, but I would like to be able to save the files using all those characters.
The problem raised because in Spanish we use some characters modifiers that are not standard (áéíóúüñ), and I was trying to form the name of the file with some data that includes such characters. I guess there must be a way to configure the server so this wouldn't be an issue, but I took the short path and changed the special characters for their standard base character:
current_directory = settings.MEDIA_DIR
file_name = "Rooming {} {}-{}.docx".format(unicodedata.normalize('NFKD', hotel).encode('ascii', 'ignore').decode('ascii'), start_date, end_date)
document.save(current_directory + file_name)
This method replaces characters like this: áéíóúüñÁÉÍÓÚÜÑ -> aeiouunAEIOUUN.
The error desaparead.

Program (twitter bot) works on Windows machine, but not on Linux machine [duplicate]

I was trying to read a file in python2.7, and it was readen perfectly. The problem that I have is when I execute the same program in Python3.4 and then appear the error:
'utf-8' codec can't decode byte 0xf2 in position 424: invalid continuation byte'
Also, when I run the program in Windows (with python3.4), the error doesn't appear. The first line of the document is:
Codi;Codi_lloc_anonim;Nom
and the code of my program is:
def lectdict(filename,colkey,colvalue):
f = open(filename,'r')
D = dict()
for line in f:
if line == '\n': continue
D[line.split(';')[colkey]] = D.get(line.split(';')[colkey],[]) + [line.split(';')[colvalue]]
f.close
return D
Traduccio = lectdict('Noms_departaments_centres.txt',1,2)
In Python2,
f = open(filename,'r')
for line in f:
reads lines from the file as bytes.
In Python3, the same code reads lines from the file as strings. Python3
strings are what Python2 call unicode objects. These are bytes decoded
according to some encoding. The default encoding in Python3 is utf-8.
The error message
'utf-8' codec can't decode byte 0xf2 in position 424: invalid continuation byte'
shows Python3 is trying to decode the bytes as utf-8. Since there is an error, the file apparently does not contain utf-8 encoded bytes.
To fix the problem you need to specify the correct encoding of the file:
with open(filename, encoding=enc) as f:
for line in f:
If you do not know the correct encoding, you could run this program to simply
try all the encodings known to Python. If you are lucky there will be an
encoding which turns the bytes into recognizable characters. Sometimes more
than one encoding may appear to work, in which case you'll need to check and
compare the results carefully.
# Python3
import pkgutil
import os
import encodings
def all_encodings():
modnames = set(
[modname for importer, modname, ispkg in pkgutil.walk_packages(
path=[os.path.dirname(encodings.__file__)], prefix='')])
aliases = set(encodings.aliases.aliases.values())
return modnames.union(aliases)
filename = '/tmp/test'
encodings = all_encodings()
for enc in encodings:
try:
with open(filename, encoding=enc) as f:
# print the encoding and the first 500 characters
print(enc, f.read(500))
except Exception:
pass
Ok, I did the same as #unutbu tell me. The result was a lot of encodings one of these are cp1250, for that reason I change :
f = open(filename,'r')
to
f = open(filename,'r', encoding='cp1250')
like #triplee suggest me. And now I can read my files.
In my case I can't change encoding because my file is really UTF-8 encoded. But some rows are corrupted and causes the same error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 7092: invalid continuation byte
My decision is to open file in binary mode:
open(filename, 'rb')

Python how to "ignore" ascii text?

I'm trying to scrape some stuff off a page using selenium. But this some of the text has ascii text in it.. so I get this.
f.write(database_text.text)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in position 1462: ordinal not in range(128)
I was wondering, is there anyway to just simpley ascii?
Thanks!
print("â")
I'm not looking to write it in my text file, but ignore it.
note: It's not just "â" it has other chars like that also.
window_before = driver.window_handles[0]
nmber_one = 1
f = open(str(unique_filename) + ".txt", 'w')
for i in range(5, 37):
time.sleep(3)
driver.find_element_by_xpath("""/html/body/center/table[2]/tbody/tr[2]/td/table/tbody/tr""" + "[" + str(i) + "]" + """/td[2]/a""").click()
time.sleep(3)
driver.switch_to.window(driver.window_handles[nmber_one])
nmber_one = nmber_one + 1
database_text = driver.find_element_by_xpath("/html/body/pre")
f = open(str(unique_filename) + ".txt", 'w',)
f.write(database_text.text)
driver.switch_to.window(window_before)
import uuid
import io
unique_filename = uuid.uuid4()
which generates a new filename, well it should anyway, it worked before.
The problem is that some of the text is not ascii. database_text.text is likely unicode text (you can do print type(database_text.text) to verify) and contains non-english text. If you are on windows it may be "codepage" text which depends on how your user account is configured.
Often, one wants to store text like this as utf-8 so open your output file accordingly
import io
text = u"â"
with io.open('somefile.txt', 'w', encoding='utf-8') as f:
f.write(text)
If you really do want to just drop the non-ascii characters from the file completely you can setup a error policy
text = u"ignore funky â character"
with io.open('somefile.txt', 'w', encoding='ascii', errors='ignore') as f:
f.write(text)
In the end, you need to choose what representation you want to use for non-ascii (roughly speaking, non-English) text.
A Try Except block would work:
try:
f.write(database_text.text)
except UnicodeEncodeError:
pass

How can I get my Python to parse the following text?

I have a sample of the text:
"PROTECTING-ħarsien",
I'm trying to parse with the following
import csv, json
with open('./dict.txt') as maltese:
entries = maltese.readlines()
for entry in entries:
tokens = entry.replace('"', '').replace(",", "").replace("\r\n", "").split("-")
if len(tokens) == 1:
pass
else:
print tokens[0] + "," + unicode(tokens[1])
But I'm getting an error message
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in range(128)
What am I doing wrong?
It appears that dict.txt is UTF-8 encoded (ħ is 0xc4 0xa7 in UTF-8).
You should open the file as UTF-8, then:
import codecs
with codecs.open('./dict.txt', encoding="utf-8") as maltese:
# etc.
You will then have Unicode strings instead of bytestrings to work with; you therefore don't need to call unicode() on them, but you may have to re-encode them to the encoding of the terminal you're outputting to.
You have to change your last line to (this has been tested to work on your data):
print tokens[0] + "," + unicode(tokens[1], 'utf8')
If you don't have that utf8, Python assumes that the source is ascii encoding, hence the error.
See http://docs.python.org/2/howto/unicode.html#the-unicode-type

UnicodeDecodeError while processing filenames

I'm using Python 2.7.3 on Ubuntu 12 x64.
I have about 200,000 files in a folder on my filesystem. The file names of some of the files contain html encoded and escaped characters because the files were originally downloaded from a website. Here are examples:
Jamaica%2008%20114.jpg
thai_trip_%E8%B0%83%E6%95%B4%E5%A4%A7%E5%B0%8F%20RAY_5313.jpg
I wrote a simple Python script that goes through the folder and renames all of the files with encoded characters in the filename. The new filename is achieved by simply decoding the string that makes up the filename.
The script works for most of the files, but, for some of the files Python chokes and spits out the following error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 11: ordinal not in range(128)
Traceback (most recent call last):
File "./download.py", line 53, in downloadGalleries
numDownloaded = downloadGallery(opener, galleryLink)
File "./download.py", line 75, in downloadGallery
filePathPrefix = getFilePath(content)
File "./download.py", line 90, in getFilePath
return cleanupString(match.group(1).strip()) + '/' + cleanupString(match.group(2).strip())
File "/home/abc/XYZ/common.py", line 22, in cleanupString
return HTMLParser.HTMLParser().unescape(string)
File "/usr/lib/python2.7/HTMLParser.py", line 472, in unescape
return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
File "/usr/lib/python2.7/re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
Here is the contents of my cleanupString function:
def cleanupString(string):
string = urllib2.unquote(string)
return HTMLParser.HTMLParser().unescape(string)
And here's the snippet of code that calls the cleanupString function (this code is not the same code in the traceback above but it produces the same error):
rootFolder = sys.argv[1]
pattern = r'.*\.jpg\s*$|.*\.jpeg\s*$'
reobj = re.compile(pattern, re.IGNORECASE)
imgs = []
for root, dirs, files in os.walk(rootFolder):
for filename in files:
foundFile = os.path.join(root, filename)
if reobj.match(foundFile):
imgs.append(foundFile)
for img in imgs :
print 'Checking file: ' + img
newImg = cleanupString(img) #Code blows up here for some files
Can anyone provide me with a way to get around this error? I've already tried adding
# -*- coding: utf-8 -*-
to the top of the script but that has no effect.
Thanks.
Your filenames are byte strings that contain UTF-8 bytes representing unicode characters. The HTML parser normally works with unicode data instead of byte strings, particularly when it encounters a ampersand escape, so Python is automatically trying to decode the value for you, but it by default uses ASCII for that decoding. This fails for UTF-8 data as it contains bytes that fall outside of the ASCII range.
You need to explicitly decode your string to a unicode object:
def cleanupString(string):
string = urllib2.unquote(string).decode('utf8')
return HTMLParser.HTMLParser().unescape(string)
Your next problem will be that you now have unicode filenames, but your filesystem will need some kind of encoding to work with these filenames. You can check what that encoding is with sys.getfilesystemencoding(); use this to re-encode your filenames:
def cleanupString(string):
string = urllib2.unquote(string).decode('utf8')
return HTMLParser.HTMLParser().unescape(string).encode(sys.getfilesystemencoding())
You can read up on how Python deals with Unicode in the Unicode HOWTO.
Looks like you're bumping into this issue. I would try reversing the order you call unescape and unquote, since unquote would be adding non-ASCII characters into your filenames, although that may not fix the problem.
What is the actual filename it is choking on?

Categories

Resources