Python SyntaxError: Non-UTF-8

Python SyntaxError: Non-UTF-8 - python

I converted my Python script to a Mac.app (via py2app). I try to run it and get the following error:
SyntaxError: Non-UTF-8 code starting with '\xcf' in file
py2app/dist/myapp.app/Contents/MacOS/myapp on line 1, but no encoding declared; see
http://python.org/dev/peps/pep-0263/ for details
I visited the PEP website and added the following to the first two lines of my script:
#!/usr/bin/python
# -*- coding: utf-8 -*-
I have also put my code into various online tools (such as this one) to check whether there are any non-UTF-8 characters but I'm not getting any issues.
I did copy some text from an Excel file however there were no special symbols that I was aware of.
The script is approx 800 lines so is there a way of identifying the problem that doesn't involve manually scanning the script line-by-line?
EDIT
Not exactly a fix, but converting my script into an executable instead of a .app has fixed the issue and it now runs correctly.

Python 3 uses UTF-8 as default encoding. This simplify the codes you get from Internet (and other packages). \xcf in UTF-8 is valid only if the byte before has predefined values, which it is not the case: Non-UTF8 code starting mean this, it is not a valid start (first byte) of UTF8 codepoint encoding.
As you see in the comment, you may convert the file into UTF-8, many times you can ignore the initial encoding (often such errors are from comments, e.g. author name). you may convert it, e.g. on options in Saving As on your original editor.
As an alternate way, you can specify the encoding on the first few lines of your code, see PEP-263 on how to do it. Note: Python has hardcoded byte strings to check [because it has not idea of encoding], so try to copy exactly the string as in such document. I think such line # -*- coding: latin-1 -*- should be ok, but this could misinterpret some characters, so test your program. If you do no know the original encoding, the easier way it is to convert original source (because you should in any case check all strings in the source code, and check if you guessed the correct encoding).

Related

SyntaxError: Non-ASCII character - Scrapy [duplicate]

Say I have a function:
def NewFunction():
return '£'
I want to print some stuff with a pound sign in front of it and it prints an error when I try to run this program, this error message is displayed:
SyntaxError: Non-ASCII character '\xa3' in file 'blah' but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details
Can anyone inform me how I can include a pound sign in my return function? I'm basically using it in a class and it's within the '__str__' part that the pound sign is included.

I'd recommend reading that PEP the error gives you. The problem is that your code is trying to use the ASCII encoding, but the pound symbol is not an ASCII character. Try using UTF-8 encoding. You can start by putting # -*- coding: utf-8 -*- at the top of your .py file. To get more advanced, you can also define encodings on a string by string basis in your code. However, if you are trying to put the pound sign literal in to your code, you'll need an encoding that supports it for the entire file.

Adding the following two lines at the top of my .py script worked for me (first line was necessary):
#!/usr/bin/env python
# -*- coding: utf-8 -*-

First add the # -*- coding: utf-8 -*- line to the beginning of the file and then use u'foo' for all your non-ASCII unicode data:
def NewFunction():
return u'£'
or use the magic available since Python 2.6 to make it automatic:
from __future__ import unicode_literals

The error message tells you exactly what's wrong. The Python interpreter needs to know the encoding of the non-ASCII character.
If you want to return U+00A3 then you can say
return u'\u00a3'
which represents this character in pure ASCII by way of a Unicode escape sequence. If you want to return a byte string containing the literal byte 0xA3, that's
return b'\xa3'
(where in Python 2 the b is implicit; but explicit is better than implicit).
The linked PEP in the error message instructs you exactly how to tell Python "this file is not pure ASCII; here's the encoding I'm using". If the encoding is UTF-8, that would be
# coding=utf-8
or the Emacs-compatible
# -*- encoding: utf-8 -*-
If you don't know which encoding your editor uses to save this file, examine it with something like a hex editor and some googling. The Stack Overflow character-encoding tag has a tag info page with more information and some troubleshooting tips.
In so many words, outside of the 7-bit ASCII range (0x00-0x7F), Python can't and mustn't guess what string a sequence of bytes represents. https://tripleee.github.io/8bit#a3 shows 21 possible interpretations for the byte 0xA3 and that's only from the legacy 8-bit encodings; but it could also very well be the first byte of a multi-byte encoding. But in fact, I would guess you are actually using Latin-1, so you should have
# coding: latin-1
as the first or second line of your source file. Anyway, without knowledge of which character the byte is supposed to represent, a human would not be able to guess this, either.
A caveat: coding: latin-1 will definitely remove the error message (because there are no byte sequences which are not technically permitted in this encoding), but might produce completely the wrong result when the code is interpreted if the actual encoding is something else. You really have to know the encoding of the file with complete certainty when you declare the encoding.

Adding the following two lines in the script solved the issue for me.
# !/usr/bin/python
# coding=utf-8
Hope it helps !

You're probably trying to run Python 3 file with Python 2 interpreter. Currently (as of 2019), python command defaults to Python 2 when both versions are installed, on Windows and most Linux distributions.
But in case you're indeed working on a Python 2 script, a not yet mentioned on this page solution is to resave the file in UTF-8+BOM encoding, that will add three special bytes to the start of the file, they will explicitly inform the Python interpreter (and your text editor) about the file encoding.

Inconsistent output of unicode box-drawing characters in python IDLE

I have the following code:
# -*- coding: utf-8 -*-
print "╔╤╤╦╤╤╦╤╤╗"
print "╠╪╪╬╪╪╬╪╪╣"
print "╟┼┼╫┼┼╫┼┼╢"
print "╚╧╧╩╧╧╩╧╧╝"
print "║"
print "│"
and for some reason, only the third line (╚╧╧╩╧╧╩╧╧╝) actually outputs properly, the rest is an odd combination of symbols. I assume this is due to some encoding issues. The full output in IDLE is as follows:
â•”â•¤â•¤â•¦â•¤â•¤â•¦â•¤â•¤â•—
â• â•ªâ•ªâ•¬â•ªâ•ªâ•¬â•ªâ•ªâ•£
â•Ÿâ”¼â”¼â•«â”¼â”¼â•«â”¼â”¼â•¢
╚╧╧╩╧╧╩╧╧╝
â•‘
â”‚
What is causing this and how can I fix this? I'm using a tablet (Surface Pro 3 with Win10) with only a touch keyboard, so any solution with the least amount of typing (especially typing out weird characters) would be ideal, but obviously all help is appreciated.

Mojibake indicates that the text encoded in one encoding is shown in another incompatible encoding:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
print(u"╔╤╤╦╤╤╦╤╤╗".encode('utf-8').decode('cp1252')) #XXX: DON'T DO IT
# -> â•”â•¤â•¤â•¦â•¤â•¤â•¦â•¤â•¤â•—
There are several places where the wrong encoding could be used.
# coding: utf-8 encoding declaration says how non-ascii characters in your source code (e.g., inside string literals) should be interpreted. If print u"╔╤╤╦╤╤╦╤╤╗" works in your case then it means that the source code itself is decoded to Unicode correctly. For debugging, you could write the string using only ascii characters: u'\u2554\u2557' == u'╔╗'.
print "╔╤╤╦╤╤╦╤╤╗" (DON'T DO IT) prints bytes (text encoded using utf-8 in this case) as is. IDLE itself works with Unicode (BMP). The bytes must be decoded into Unicode text before they can be shown in IDLE. It seems IDLE uses ANSI code page such as cp1252 (locale.getpreferredencoding(False)) to decode the output bytes on Windows. Don't print text as bytes. It will fail in any environment that uses a character encoding different from your source code e.g., you would get ΓòöΓòù... mojibake if you run the code from the question in Windows console that uses cp437 OEM code page.
You should use Unicode for all text in your program. Python 3 even forbids non-ascii characters inside a bytes literal. You would get SyntaxError there.
print(u'\u2554\u2557') might fail with UnicodeEncodeError if you would run the code in Windows console and OEM code page such as cp437 weren't be able to represent the characters. To print arbitrary Unicode characters in Windows console, use win-unicode-console package. You don't need it if you use IDLE.

Putting a u before the strings fixed the issue, as per #FredLarson's suggestion:
print u"╔╤╤╦╤╤╦╤╤╗"
print u"╠╪╪╬╪╪╬╪╪╣"
print u"╟┼┼╫┼┼╫┼┼╢"
print u"╚╧╧╩╧╧╩╧╧╝"
print u"║"
print u"│"
The exact cause still isn't known, since it seemed to work on other systems and it's odd that the third line worked fine.

problems with declaring encoding in a source file

I'm trying learning to use encoding declarations in source files reading PEP 263 and I'm experimenting on my own but I got some troubles.
Here's my file cod.py:
# -*- coding: utf-16 -*-
print('ciao')
and I saved it using UTF-16 encoding; now:
antox#antox-pc ~/Scrivania $ python3 cod.py
File "cod.py", line 1
SyntaxError: Non-UTF-8 code starting with '\xff' in file cod.py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
So I don't understand where I'm getting wrong.
P.S. I'm using gedit 2.30.4

UTF-16 is not accepted as encoding for Python source code. From PEP 263 (section Concepts, item 1):
Any encoding which allows processing the first two lines in the
way indicated above is allowed as source code encoding, this
includes ASCII compatible encodings as well as certain
multi-byte encodings such as Shift_JIS. It does not include
encodings which use two or more bytes for all characters like
e.g. UTF-16. The reason for this is to keep the encoding
detection algorithm in the tokenizer simple.
So the error you're getting is expected: you can use a different encoding (other than the default UTF-8) as long as it can be detected by Python.

Python "SyntaxError: Non-ASCII character '\xe2' in file" [duplicate]

This question already has answers here:
SyntaxError: Non-ASCII character '\xa3' in file when function returns '£'
(6 answers)
Closed 2 years ago.
I am writing some python code and I am receiving the error message as in the title, from searching this has to do with the character set.
Here is the line that causes the error
hc = HealthCheck("instance_health", interval=15, target808="HTTP:8080/index.html")
I cannot figure out what character is not in the ANSI ASCII set? Furthermore searching "\xe2" does not give anymore information as to what character that appears as. Which character in that line is causing the issue?
I have also seen a few fixes for this issue but I am not sure which to use. Could someone clarify what the issue is (python doesn't interpret unicode unless told to do so?), and how I would clear it up properly?
EDIT:
Here are all the lines near the one that errors
def createLoadBalancer():
conn = ELBConnection(creds.awsAccessKey, creds.awsSecretKey)
hc = HealthCheck("instance_health", interval=15, target808="HTTP:8080/index.html")
lb = conn.create_load_balancer('my_lb', ['us-east-1a', 'us-east-1b'],[(80, 8080, 'http'), (443, 8443, 'tcp')])
lb.configure_health_check(hc)
return lb

If you are just trying to use UTF-8 characters or don't care if they are in your code, add this line to the top of your .py file
# -*- coding: utf-8 -*-

You've got a stray byte floating around. You can find it by running
with open("x.py") as fp:
for i, line in enumerate(fp):
if "\xe2" in line:
print i, repr(line)
where you should replace "x.py" by the name of your program. You'll see the line number and the offending line(s). For example, after inserting that byte arbitrarily, I got:
4 "\xe2 lb = conn.create_load_balancer('my_lb', ['us-east-1a', 'us-east-1b'],[(80, 8080, 'http'), (443, 8443, 'tcp')])\n"

Or you could just simply use:
# coding: utf-8
at top of .py file

\xe2 is the '-' character, it appears in some copy and paste it uses a different equal looking '-' that causes encoding errors.
Replace the '-'(from copy paste) with the correct '-' (from you keyboard button).

Change the file character encoding,
put below line to top of your code always
# -*- coding: utf-8 -*-

I had the same error while copying and pasting a comment from the web
For me it was a single quote (') in the word
I just erased it and re-typed it.

Adding # coding=utf-8 line in first line of your .py file will fix the problem.
Please read more about the problem and its fix on below link, in this article problem and its solution is beautifully described : https://www.python.org/dev/peps/pep-0263/

I got this error for characters in my comments (from copying/pasting content from the web into my editor for note-taking purposes).
To resolve in Text Wrangler:
Highlight the text
Go the the Text menu
Select "Convert to ASCII"

Based on PEP 0263 -- Defining Python Source Code Encodings
Python will default to ASCII as standard encoding if no other
encoding hints are given.
To define a source code encoding, a magic comment must
be placed into the source files either as first or second
line in the file, such as:
# coding=<encoding name>
or (using formats recognized by popular editors)
#!/usr/bin/python
# -*- coding: <encoding name> -*-
or
#!/usr/bin/python
# vim: set fileencoding=<encoding name> :

I had the same issue and just added this to the top of my file (in Python 3 I didn't have the problem but do in Python 2
#!/usr/local/bin/python
# coding: latin-1

If it helps anybody, for me that happened because I was trying to run a Django implementation in python 3.4 with my python 2.7 command

I my case \xe2 was a ’ which should be replaced by '.
In general I recommend to convert UTF-8 to ASCII using e.g. https://onlineasciitools.com/convert-utf8-to-ascii
However if you want to keep UTF-8 you can use
#-*- mode: python -*-
# -*- coding: utf-8 -*-

After about a half hour of looking through stack overflow, It dawned on me that if the use of a single quote " ' " in a comment will through the error:
SyntaxError: Non-ASCII character '\xe2' in file
After looking at the traceback i was able to locate the single quote used in my comment.

I had this exact issue running the simple .py code below:
import sys
print 'version is:', sys.version
DSM's code above provided the following:
1 'print \xe2\x80\x98version is\xe2\x80\x99, sys.version'
So the issue was that my text editor used SMART QUOTES, as John Y suggested. After changing the text editor settings and re-opening/saving the file, it works just fine.

I am trying to parse that weird windows apostraphe and after trying several things here is the code snippet that works.
def convert_freaking_apostrophe(self,string):
try:
issuer_rename = string.decode('windows-1252')
except:
issuer_rename = string.decode('latin-1')
issuer_rename = issuer_rename.replace(u'’', u"'")
issuer_rename = issuer_rename.encode('ascii','ignore')
try:
os.rename(directory+"/"+issuer,directory+"/"+issuer_rename)
print "Successfully renamed "+issuer+" to "+issuer_rename
return issuer_rename
except:
pass
#HANDLING FOR FUNKY APOSTRAPHE
if re.search(r"([\x90-\xff])", issuer):
issuer = self.convert_freaking_apostrophe(issuer)

I fixed this using pycharm. At the bottom of pycharm you can see file encoding. I noticed that it is UT-8. I changed it to US-ASCII

I had the same issue but it was because I copied and pasted the string as it is.
Later when I manually typed the string as it is the error vanished.
I had the error due to the - sign. When I replaced it with manually inputting a - the error was solved.
Copied string 10 + 3 * 5/(16 − 4)
Manually typed string 10 + 3 * 5/(16 - 4)
you can clearly see there is a bit of difference between both the hyphens.
I think it's because of the different formatting used by different OS or maybe just different software.

For me the problem had caused due to "’" that symbol in the quotes. As i had copied the code from a pdf file it caused that error. I just replaced "’" by this "'".

If you want to spot what character caused this just assign the problematic variable to a string and print it in a iPython console.
In my case
In [1]: array = [[24.9, 50.5], [11.2, 51.0]] # Raises an error
In [2]: string = "[[24.9, 50.5], [11.2, 51.0]]" # Manually paste the above array here
In [3]: string
Out [3]: '[[24.9, 50.5]\xe2\x80\x8b, [11.2, 51.0]]' # Here they are!

for me, the problem was caused by typing my code into Mac Notes and then copied it from Mac Notes and pasted into my vim session to create my file. This made my single quotes the curved type. to fix it I opened my file in vim and replaced all my curved single quotes with the straight kind, just by removing and retyping the same character. It was Mac Notes that made the same key stroke produce the curved single quote.

I was unable to find what's the issue for long but later I realised that I had copied a line "UTC-12:00" from web and the hyphen/dash in this was causing the problem. I just wrote this "-" again and the problem got resolved.
So, sometimes the copy pasted lines also give errors. In such cases, just re-write the copy pasted code and it works. On re-writing, it would look like nothing got changed but the error will be gone.

Plenty of good solutions here.
One challenge not really addressed in any of them is how to visually identify certain hard-to-spot non-ASCII characters that resemble other plain ASCII ones. For example, en dashes can appear almost exactly like hyphens and curly quotes look a lot like straight quotes, depending on your text editor's font.
This one-liner, which should work on Mac or Linux, will strip characters not in the ASCII printable range and show you the differences side-by-side:
# assumes Bash shell; for Bourne shell (sh), rearrange as a pipe and
# give '-' as second argument to 'sdiff' instead
sdiff --suppress-common-lines script.py <(tr -cd '\11\12\15\40-\176' <script.py)
The characters \11, \12, and \15 are tab, newline, and carriage return, respectively, in octal; the remaining range is the visible ASCII characters. (hat tip)
Another tip gleaned from this SO thread uses an inverse character class consisting of anything not in the ASCII visible range, and highlights it:
grep --color '[^ -~]' script.py
This should also work fine with the macOS / BSD version of grep.

When I have a similar issue when reading text files i use...
f = open('file','rt', errors='ignore')

How does the "magic lines(s)" in python work, when specifying encoding in python file?

At the start of a python file (first line) sometimes I read
# -*- coding: utf-8 -*-
and sometimes I read
# encoding: utf-8
Both lines seem to do the same thing: specifying utf8 as encoding for all the text put in the file.
I have to questions:
Why does this even work? I thought the interpreter ignores everything after a # because it invokes a comment.
What is the difference between the two lines above? Does the interpreter just ignore the -*-?

The two forms are equivalent. The -*- version is a special kind of comment that Emacs understands. See PEP 263 for more information.
If a comment like in either of these forms is one of the first two lines of a file, the interpreter will use the specified encoding to read the file.

It works because the implementation looks for it, there is nothing magical about it. There is no difference, all possible variants are defined by PEP 263 (the only difference is that the first one is Emacs-compatible).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.