problems with declaring encoding in a source file - python

I'm trying learning to use encoding declarations in source files reading PEP 263 and I'm experimenting on my own but I got some troubles.
Here's my file cod.py:
# -*- coding: utf-16 -*-
print('ciao')
and I saved it using UTF-16 encoding; now:
antox#antox-pc ~/Scrivania $ python3 cod.py
File "cod.py", line 1
SyntaxError: Non-UTF-8 code starting with '\xff' in file cod.py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
So I don't understand where I'm getting wrong.
P.S. I'm using gedit 2.30.4

UTF-16 is not accepted as encoding for Python source code. From PEP 263 (section Concepts, item 1):
Any encoding which allows processing the first two lines in the
way indicated above is allowed as source code encoding, this
includes ASCII compatible encodings as well as certain
multi-byte encodings such as Shift_JIS. It does not include
encodings which use two or more bytes for all characters like
e.g. UTF-16. The reason for this is to keep the encoding
detection algorithm in the tokenizer simple.
So the error you're getting is expected: you can use a different encoding (other than the default UTF-8) as long as it can be detected by Python.

Related

Python SyntaxError: Non-UTF-8

I converted my Python script to a Mac.app (via py2app). I try to run it and get the following error:
SyntaxError: Non-UTF-8 code starting with '\xcf' in file
py2app/dist/myapp.app/Contents/MacOS/myapp on line 1, but no encoding declared; see
http://python.org/dev/peps/pep-0263/ for details
I visited the PEP website and added the following to the first two lines of my script:
#!/usr/bin/python
# -*- coding: utf-8 -*-
I have also put my code into various online tools (such as this one) to check whether there are any non-UTF-8 characters but I'm not getting any issues.
I did copy some text from an Excel file however there were no special symbols that I was aware of.
The script is approx 800 lines so is there a way of identifying the problem that doesn't involve manually scanning the script line-by-line?
EDIT
Not exactly a fix, but converting my script into an executable instead of a .app has fixed the issue and it now runs correctly.
Python 3 uses UTF-8 as default encoding. This simplify the codes you get from Internet (and other packages). \xcf in UTF-8 is valid only if the byte before has predefined values, which it is not the case: Non-UTF8 code starting mean this, it is not a valid start (first byte) of UTF8 codepoint encoding.
As you see in the comment, you may convert the file into UTF-8, many times you can ignore the initial encoding (often such errors are from comments, e.g. author name). you may convert it, e.g. on options in Saving As on your original editor.
As an alternate way, you can specify the encoding on the first few lines of your code, see PEP-263 on how to do it. Note: Python has hardcoded byte strings to check [because it has not idea of encoding], so try to copy exactly the string as in such document. I think such line # -*- coding: latin-1 -*- should be ok, but this could misinterpret some characters, so test your program. If you do no know the original encoding, the easier way it is to convert original source (because you should in any case check all strings in the source code, and check if you guessed the correct encoding).

python utf-8 encoding declaration

é character belongs to utf-8 as shown in:
https://www.utf8-chartable.de/unicode-utf8-table.pl
As official documentation (https://www.python.org/dev/peps/pep-0263/)
says:
'In Python 2.1, Unicode literals can only be written using the Latin-1 based encoding "unicode-escape"....'
I use Python 2.7.13
so in my code (as told in https://www.python.org/dev/peps/pep-0263/), I have tried successively (after #!/usr/bin/python)
# coding=utf-8
# -*- coding: utf-8 -*-
the last one also appears in the post solution Correct way to define Python source code encoding
but it still does not work:
SyntaxError: Non-ASCII character '\xc3' in file ./<file_name>.py on line 160, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
Any ideas folks ?? thanx.

SyntaxError: Non-ASCII character - Scrapy [duplicate]

Say I have a function:
def NewFunction():
return '£'
I want to print some stuff with a pound sign in front of it and it prints an error when I try to run this program, this error message is displayed:
SyntaxError: Non-ASCII character '\xa3' in file 'blah' but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details
Can anyone inform me how I can include a pound sign in my return function? I'm basically using it in a class and it's within the '__str__' part that the pound sign is included.
I'd recommend reading that PEP the error gives you. The problem is that your code is trying to use the ASCII encoding, but the pound symbol is not an ASCII character. Try using UTF-8 encoding. You can start by putting # -*- coding: utf-8 -*- at the top of your .py file. To get more advanced, you can also define encodings on a string by string basis in your code. However, if you are trying to put the pound sign literal in to your code, you'll need an encoding that supports it for the entire file.
Adding the following two lines at the top of my .py script worked for me (first line was necessary):
#!/usr/bin/env python
# -*- coding: utf-8 -*-
First add the # -*- coding: utf-8 -*- line to the beginning of the file and then use u'foo' for all your non-ASCII unicode data:
def NewFunction():
return u'£'
or use the magic available since Python 2.6 to make it automatic:
from __future__ import unicode_literals
The error message tells you exactly what's wrong. The Python interpreter needs to know the encoding of the non-ASCII character.
If you want to return U+00A3 then you can say
return u'\u00a3'
which represents this character in pure ASCII by way of a Unicode escape sequence. If you want to return a byte string containing the literal byte 0xA3, that's
return b'\xa3'
(where in Python 2 the b is implicit; but explicit is better than implicit).
The linked PEP in the error message instructs you exactly how to tell Python "this file is not pure ASCII; here's the encoding I'm using". If the encoding is UTF-8, that would be
# coding=utf-8
or the Emacs-compatible
# -*- encoding: utf-8 -*-
If you don't know which encoding your editor uses to save this file, examine it with something like a hex editor and some googling. The Stack Overflow character-encoding tag has a tag info page with more information and some troubleshooting tips.
In so many words, outside of the 7-bit ASCII range (0x00-0x7F), Python can't and mustn't guess what string a sequence of bytes represents. https://tripleee.github.io/8bit#a3 shows 21 possible interpretations for the byte 0xA3 and that's only from the legacy 8-bit encodings; but it could also very well be the first byte of a multi-byte encoding. But in fact, I would guess you are actually using Latin-1, so you should have
# coding: latin-1
as the first or second line of your source file. Anyway, without knowledge of which character the byte is supposed to represent, a human would not be able to guess this, either.
A caveat: coding: latin-1 will definitely remove the error message (because there are no byte sequences which are not technically permitted in this encoding), but might produce completely the wrong result when the code is interpreted if the actual encoding is something else. You really have to know the encoding of the file with complete certainty when you declare the encoding.
Adding the following two lines in the script solved the issue for me.
# !/usr/bin/python
# coding=utf-8
Hope it helps !
You're probably trying to run Python 3 file with Python 2 interpreter. Currently (as of 2019), python command defaults to Python 2 when both versions are installed, on Windows and most Linux distributions.
But in case you're indeed working on a Python 2 script, a not yet mentioned on this page solution is to resave the file in UTF-8+BOM encoding, that will add three special bytes to the start of the file, they will explicitly inform the Python interpreter (and your text editor) about the file encoding.

Python UnicodeDecodeError on Mac, but not on PC?

I've got a script that basically aggregates students' code files into one file for plagiarism detection. It walks through a tree of files, copying all file contents into one file.
I've run the script on the exact same files on my Mac and my PC. On my PC, it works fine. On my Mac, it encounters 27 UnicodeDecodeErrors (probably 0.1% of all files I'm testing).
What could cause a UnicodeDecodeError on a Mac, but not on a PC?
If relevant, the code is:
originalFile = open(originalFilename, "r")
newFile = open(newFilename, "a")
newFile.write(originalFile.read())
Figure out what encoding was used when saving that file. A safe bet is loading the file as 'utf-8'. If that succeeds then it's likely to be the correct encoding.
# try utf-8. If this fails, all bets are off.
open(originalFilename, "r", encoding="utf-8")
Now, if students are sending you these files, it's likely they just use the default encoding on their system. It is not possible to reliably guess the encoding. If they were using an 8-bit codec, like one of the ISO-8859 character sets, it will be almost impossible to guess which one was used. What to do then depends on what kind of files you're processing.
It is incorrect to read Python source files using open(originalFilename, "r") on Python 3. open() uses locale.getpreferredencoding(False) by default. A Python source may use a different character encoding; in the best case, it may cause UnicodeDecodeError -- usually, you just get a mojibake silently.
To read Python source taking into account the encoding declaration (# -*- coding: ...), use tokenize.open(filename). If it fails; the input is not valid Python 3 source code.
What could cause a UnicodeDecodeError on a Mac, but not on a PC?
locale.getpreferredencoding(False) is likely to be utf-8 on Mac. utf-8 doesn't accept arbitrary sequence of bytes as utf-8 encoded text. PC is likely to use a 8-bit character encoding that corrupts the input and produces a mojibake silently instead of raising an error due to a mismatched character encoding.
To read a text file, you should know its character encoding. If you don't know the character encoding then either read the file as a sequence of bytes ('rb' mode) or you could try to guess the encoding using chardet Python module (it would be only a guess but it might be good enough depending on your task).
I got the exact same problem. There seemed to be some characters in the file that gave a UnicodeDecodeError during readlines()
This only happened on my macbook, but not on a PC.
I solve the problem by simply skipping these characters:
with open(file_to_extract, errors='ignore') as f: reader = f.readlines()

Python: Ascii characters from file display wrong

Here's my code:
import sys, os
print("█████") #<-- Those are solid blocks.
f= open('file.txt')
for line in f:
print(line)
In file.txt is this:
hay hay, guys
████████████
But the output is this:
██████
hay hay, guys <----- ***Looks like it outptutted this correctly!***
Traceback (most recent call last):
File "echofile.py", line 6, in <module>
print(line)
File "C:\python33\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 1-2: cha
racter maps to <undefined> <------ ***But not from the file!***
Anybody have any suggestions as to why it is doing this? I wrote the code in IDLE, tried editing the file.txt in both Programmer's Notepad and IDLE. The file is ASCII / ANSI.
I'm using Python 3, by the way. 3.3 alpha win-64 if it matters.
This is clearly an issue with character encodings.
In Python 3.x, all strings are Unicode. But when reading or writing a file, it will be necessary to translate the Unicode to some specific encoding.
By default, a Python source file is handled as UTF-8. I don't know exactly what characters you pasted into your source file for the blocks, but whatever it is, Python reads it as UTF-8 and it seems to work. Maybe your text editor converted to valid UTF-8 when you inserted those?
The backtrace suggests that Python is treating the input file as "Code Page 437" or the original IBM PC 8-bit character set. Is that correct?
This link shows how to set a specific decoder to handle a particular file encoding on input:
http://lucumr.pocoo.org/2010/2/11/porting-to-python-3-a-guide/
EDIT: I found a better resource:
http://docs.python.org/release/3.0.1/howto/unicode.html
And based on that, here's some sample code:
with open('mytextfile.txt', encoding='utf-8') as f:
for line in f:
print(line, end='')
Originally I had the above set to "cp437" but in a comment you said "utf-8" was correct, so I made that change to this example. I'm specifying end='' here because the input lines from the file already have a newline on the end, so we don't need print() to supply another newline.
EDIT: I found a short discussion of default encodings here:
http://docs.python.org/release/3.0.1/whatsnew/3.0.html
The important bit: "There is a platform-dependent default encoding, which on Unixy platforms can be set with the LANG environment variable (and sometimes also with some other platform-specific locale-related environment variables). In many cases, but not all, the system default is UTF-8; you should never count on this default."
So, I had thought that Python defaulted to UTF-8, but not always, it seems. Actually, from your stack backtrace, I think on your system with your LANG environment setting you are getting "cp437" as your default.
So, I learned something too by answering your question!
P.S. I changed the code example above to specify utf-8 since that is what you needed.
Try making that string unicode:
print(u"█████")
^ Add this

Categories

Resources