Correct way to define Python source code encoding - python

PEP 263 defines how to declare Python source code encoding.
Normally, the first 2 lines of a Python file should start with:
#!/usr/bin/python
# -*- coding: <encoding name> -*-
But I have seen a lot of files starting with:
#!/usr/bin/python
# -*- encoding: <encoding name> -*-
=> encoding instead of coding.
So what is the correct way of declaring the file encoding?
Is encoding permitted because the regex used is lazy? Or is it just another form of declaring the file encoding?
I'm asking this question because the PEP does not talk about encoding, it just talks about coding.

Check the docs here:
"If a comment in the first or second line of the Python script matches the regular expression coding[=:]\s*([-\w.]+), this comment is processed as an encoding declaration"
"The recommended forms of this expression are
# -*- coding: <encoding-name> -*-
which is recognized also by GNU Emacs, and
# vim:fileencoding=<encoding-name>
which is recognized by Bram Moolenaar’s VIM."
So, you can put pretty much anything before the "coding" part, but stick to "coding" (with no prefix) if you want to be 100% python-docs-recommendation-compatible.
More specifically, you need to use whatever is recognized by Python and the specific editing software you use (if it needs/accepts anything at all). E.g. the coding form is recognized (out of the box) by GNU Emacs but not Vim (yes, without a universal agreement, it's essentially a turf war).

Just copy paste below statement on the top of your program.It will solve character encoding problems
#!/usr/bin/env python
# -*- coding: utf-8 -*-

PEP 263:
the first or second line must match
the regular
expression "coding[:=]\s*([-\w.]+)"
So, "encoding: UTF-8" matches.
PEP provides some examples:
#!/usr/bin/python
# vim: set fileencoding=<encoding name> :
# This Python file uses the following encoding: utf-8
import os, sys

As of today — June 2018
PEP 263 itself mentions the regex it follows:
To define a source code encoding, a magic comment must be placed into
the source files either as first or second line in the file, such as:
# coding=<encoding name>
or (using formats recognized by popular editors):
#!/usr/bin/python
# -*- coding: <encoding name> -*-
or:
#!/usr/bin/python
# vim: set fileencoding=<encoding name> :
More precisely, the first or second line must match the following regular expression:
^[ \t\f]*#.*?coding[:=][ \t]*([-_.a-zA-Z0-9]+)
So, as already summed up by other answers, it'll match coding with any prefix, but if you'd like to be as PEP-compliant as it gets (even though, as far as I can tell, using encoding instead of coding does not violate PEP 263 in any way) — stick with 'plain' coding, with no prefixes.

If I'm not mistaken, the original proposal for source file encodings was to use a regular expression for the first couple of lines, which would allow both.
I think the regex was something along the lines of coding: followed by something.
I found this: http://www.python.org/dev/peps/pep-0263/
Which is the original proposal, but I can't seem to find the final spec stating exactly what they did.
I've certainly used encoding: to great effect, so obviously that works.
Try changing to something completely different, like duhcoding: ... to see if that works just as well.

I suspect it is similar to Ruby - either method is okay.
This is largely because different text editors use different methods (ie, these two) of marking encoding.
With Ruby, as long as the first, or second if there is a shebang line contains a string that matches:
coding: encoding-name
and ignoring any whitespace and other fluff on those lines. (It can often be a = instead of :, too).

Related

SyntaxError: Non-ASCII character - Scrapy [duplicate]

Say I have a function:
def NewFunction():
return '£'
I want to print some stuff with a pound sign in front of it and it prints an error when I try to run this program, this error message is displayed:
SyntaxError: Non-ASCII character '\xa3' in file 'blah' but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details
Can anyone inform me how I can include a pound sign in my return function? I'm basically using it in a class and it's within the '__str__' part that the pound sign is included.
I'd recommend reading that PEP the error gives you. The problem is that your code is trying to use the ASCII encoding, but the pound symbol is not an ASCII character. Try using UTF-8 encoding. You can start by putting # -*- coding: utf-8 -*- at the top of your .py file. To get more advanced, you can also define encodings on a string by string basis in your code. However, if you are trying to put the pound sign literal in to your code, you'll need an encoding that supports it for the entire file.
Adding the following two lines at the top of my .py script worked for me (first line was necessary):
#!/usr/bin/env python
# -*- coding: utf-8 -*-
First add the # -*- coding: utf-8 -*- line to the beginning of the file and then use u'foo' for all your non-ASCII unicode data:
def NewFunction():
return u'£'
or use the magic available since Python 2.6 to make it automatic:
from __future__ import unicode_literals
The error message tells you exactly what's wrong. The Python interpreter needs to know the encoding of the non-ASCII character.
If you want to return U+00A3 then you can say
return u'\u00a3'
which represents this character in pure ASCII by way of a Unicode escape sequence. If you want to return a byte string containing the literal byte 0xA3, that's
return b'\xa3'
(where in Python 2 the b is implicit; but explicit is better than implicit).
The linked PEP in the error message instructs you exactly how to tell Python "this file is not pure ASCII; here's the encoding I'm using". If the encoding is UTF-8, that would be
# coding=utf-8
or the Emacs-compatible
# -*- encoding: utf-8 -*-
If you don't know which encoding your editor uses to save this file, examine it with something like a hex editor and some googling. The Stack Overflow character-encoding tag has a tag info page with more information and some troubleshooting tips.
In so many words, outside of the 7-bit ASCII range (0x00-0x7F), Python can't and mustn't guess what string a sequence of bytes represents. https://tripleee.github.io/8bit#a3 shows 21 possible interpretations for the byte 0xA3 and that's only from the legacy 8-bit encodings; but it could also very well be the first byte of a multi-byte encoding. But in fact, I would guess you are actually using Latin-1, so you should have
# coding: latin-1
as the first or second line of your source file. Anyway, without knowledge of which character the byte is supposed to represent, a human would not be able to guess this, either.
A caveat: coding: latin-1 will definitely remove the error message (because there are no byte sequences which are not technically permitted in this encoding), but might produce completely the wrong result when the code is interpreted if the actual encoding is something else. You really have to know the encoding of the file with complete certainty when you declare the encoding.
Adding the following two lines in the script solved the issue for me.
# !/usr/bin/python
# coding=utf-8
Hope it helps !
You're probably trying to run Python 3 file with Python 2 interpreter. Currently (as of 2019), python command defaults to Python 2 when both versions are installed, on Windows and most Linux distributions.
But in case you're indeed working on a Python 2 script, a not yet mentioned on this page solution is to resave the file in UTF-8+BOM encoding, that will add three special bytes to the start of the file, they will explicitly inform the Python interpreter (and your text editor) about the file encoding.

Python "SyntaxError: Non-ASCII character '\xe2' in file" [duplicate]

This question already has answers here:
SyntaxError: Non-ASCII character '\xa3' in file when function returns '£'
(6 answers)
Closed 2 years ago.
I am writing some python code and I am receiving the error message as in the title, from searching this has to do with the character set.
Here is the line that causes the error
hc = HealthCheck("instance_health", interval=15, target808="HTTP:8080/index.html")
I cannot figure out what character is not in the ANSI ASCII set? Furthermore searching "\xe2" does not give anymore information as to what character that appears as. Which character in that line is causing the issue?
I have also seen a few fixes for this issue but I am not sure which to use. Could someone clarify what the issue is (python doesn't interpret unicode unless told to do so?), and how I would clear it up properly?
EDIT:
Here are all the lines near the one that errors
def createLoadBalancer():
conn = ELBConnection(creds.awsAccessKey, creds.awsSecretKey)
hc = HealthCheck("instance_health", interval=15, target808="HTTP:8080/index.html")
lb = conn.create_load_balancer('my_lb', ['us-east-1a', 'us-east-1b'],[(80, 8080, 'http'), (443, 8443, 'tcp')])
lb.configure_health_check(hc)
return lb
If you are just trying to use UTF-8 characters or don't care if they are in your code, add this line to the top of your .py file
# -*- coding: utf-8 -*-
You've got a stray byte floating around. You can find it by running
with open("x.py") as fp:
for i, line in enumerate(fp):
if "\xe2" in line:
print i, repr(line)
where you should replace "x.py" by the name of your program. You'll see the line number and the offending line(s). For example, after inserting that byte arbitrarily, I got:
4 "\xe2 lb = conn.create_load_balancer('my_lb', ['us-east-1a', 'us-east-1b'],[(80, 8080, 'http'), (443, 8443, 'tcp')])\n"
Or you could just simply use:
# coding: utf-8
at top of .py file
\xe2 is the '-' character, it appears in some copy and paste it uses a different equal looking '-' that causes encoding errors.
Replace the '-'(from copy paste) with the correct '-' (from you keyboard button).
Change the file character encoding,
put below line to top of your code always
# -*- coding: utf-8 -*-
I had the same error while copying and pasting a comment from the web
For me it was a single quote (') in the word
I just erased it and re-typed it.
Adding # coding=utf-8 line in first line of your .py file will fix the problem.
Please read more about the problem and its fix on below link, in this article problem and its solution is beautifully described : https://www.python.org/dev/peps/pep-0263/
I got this error for characters in my comments (from copying/pasting content from the web into my editor for note-taking purposes).
To resolve in Text Wrangler:
Highlight the text
Go the the Text menu
Select "Convert to ASCII"
Based on PEP 0263 -- Defining Python Source Code Encodings
Python will default to ASCII as standard encoding if no other
encoding hints are given.
To define a source code encoding, a magic comment must
be placed into the source files either as first or second
line in the file, such as:
# coding=<encoding name>
or (using formats recognized by popular editors)
#!/usr/bin/python
# -*- coding: <encoding name> -*-
or
#!/usr/bin/python
# vim: set fileencoding=<encoding name> :
I had the same issue and just added this to the top of my file (in Python 3 I didn't have the problem but do in Python 2
#!/usr/local/bin/python
# coding: latin-1
If it helps anybody, for me that happened because I was trying to run a Django implementation in python 3.4 with my python 2.7 command
I my case \xe2 was a ’ which should be replaced by '.
In general I recommend to convert UTF-8 to ASCII using e.g. https://onlineasciitools.com/convert-utf8-to-ascii
However if you want to keep UTF-8 you can use
#-*- mode: python -*-
# -*- coding: utf-8 -*-
After about a half hour of looking through stack overflow, It dawned on me that if the use of a single quote " ' " in a comment will through the error:
SyntaxError: Non-ASCII character '\xe2' in file
After looking at the traceback i was able to locate the single quote used in my comment.
I had this exact issue running the simple .py code below:
import sys
print 'version is:', sys.version
DSM's code above provided the following:
1 'print \xe2\x80\x98version is\xe2\x80\x99, sys.version'
So the issue was that my text editor used SMART QUOTES, as John Y suggested. After changing the text editor settings and re-opening/saving the file, it works just fine.
I am trying to parse that weird windows apostraphe and after trying several things here is the code snippet that works.
def convert_freaking_apostrophe(self,string):
try:
issuer_rename = string.decode('windows-1252')
except:
issuer_rename = string.decode('latin-1')
issuer_rename = issuer_rename.replace(u'’', u"'")
issuer_rename = issuer_rename.encode('ascii','ignore')
try:
os.rename(directory+"/"+issuer,directory+"/"+issuer_rename)
print "Successfully renamed "+issuer+" to "+issuer_rename
return issuer_rename
except:
pass
#HANDLING FOR FUNKY APOSTRAPHE
if re.search(r"([\x90-\xff])", issuer):
issuer = self.convert_freaking_apostrophe(issuer)
I fixed this using pycharm. At the bottom of pycharm you can see file encoding. I noticed that it is UT-8. I changed it to US-ASCII
I had the same issue but it was because I copied and pasted the string as it is.
Later when I manually typed the string as it is the error vanished.
I had the error due to the - sign. When I replaced it with manually inputting a - the error was solved.
Copied string 10 + 3 * 5/(16 − 4)
Manually typed string 10 + 3 * 5/(16 - 4)
you can clearly see there is a bit of difference between both the hyphens.
I think it's because of the different formatting used by different OS or maybe just different software.
For me the problem had caused due to "’" that symbol in the quotes. As i had copied the code from a pdf file it caused that error. I just replaced "’" by this "'".
If you want to spot what character caused this just assign the problematic variable to a string and print it in a iPython console.
In my case
In [1]: array = [[24.9, 50.5]​, [11.2, 51.0]] # Raises an error
In [2]: string = "[[24.9, 50.5]​, [11.2, 51.0]]" # Manually paste the above array here
In [3]: string
Out [3]: '[[24.9, 50.5]\xe2\x80\x8b, [11.2, 51.0]]' # Here they are!
for me, the problem was caused by typing my code into Mac Notes and then copied it from Mac Notes and pasted into my vim session to create my file. This made my single quotes the curved type. to fix it I opened my file in vim and replaced all my curved single quotes with the straight kind, just by removing and retyping the same character. It was Mac Notes that made the same key stroke produce the curved single quote.
I was unable to find what's the issue for long but later I realised that I had copied a line "UTC-12:00" from web and the hyphen/dash in this was causing the problem. I just wrote this "-" again and the problem got resolved.
So, sometimes the copy pasted lines also give errors. In such cases, just re-write the copy pasted code and it works. On re-writing, it would look like nothing got changed but the error will be gone.
Plenty of good solutions here.
One challenge not really addressed in any of them is how to visually identify certain hard-to-spot non-ASCII characters that resemble other plain ASCII ones. For example, en dashes can appear almost exactly like hyphens and curly quotes look a lot like straight quotes, depending on your text editor's font.
This one-liner, which should work on Mac or Linux, will strip characters not in the ASCII printable range and show you the differences side-by-side:
# assumes Bash shell; for Bourne shell (sh), rearrange as a pipe and
# give '-' as second argument to 'sdiff' instead
sdiff --suppress-common-lines script.py <(tr -cd '\11\12\15\40-\176' <script.py)
The characters \11, \12, and \15 are tab, newline, and carriage return, respectively, in octal; the remaining range is the visible ASCII characters. (hat tip)
Another tip gleaned from this SO thread uses an inverse character class consisting of anything not in the ASCII visible range, and highlights it:
grep --color '[^ -~]' script.py
This should also work fine with the macOS / BSD version of grep.
When I have a similar issue when reading text files i use...
f = open('file','rt', errors='ignore')

Documenting UTF-8 Python code with Doxygen

The following code is not parsed correctly by doxygen, the "Module Docstring" is not shown in the resulting documentation:
# -*- coding: utf-8 -*-
"""
Module Docstring
"""
If I delete the first line, it's parsed correctly. But I NEED to set up the encoding, as I use non-ASCII characters on my code. Did anyone have the same problem?
I tried using doxypy, but it fails as well. Also tried many different changes in the config file.
So far, the best shot was to use the INPUT_FILTER parameter to some sort of script that strips that first line, maybe using "tail -n +3" as a filter. The problem resides that not every file needs that "coding: utf-8", so placing it in every file will be a pain. Any better ideas? Am I overlooking something?
You can specify the input encoding configuration variable:
http://www.doxygen.nl/manual/config.html#cfg_input_encoding
The variable should be set to UTF-8 (all caps, hyphen required, no spaces) as specified at http://www.gnu.org/software/libiconv/
Hope this helps. Happy documenting :-)
Looking at this link, it seems you have to put #package <packagename> in the module docstring for doxygen to do something with it.
And below on the same page you can see that doxygen actually prefers you use comments instead of docstring, because doxygen special commands are not supported in docstrings.
Edit:
To avoid confusing doxygen, put the #package comment on a separate line from the coding comment.
To get doxygen to put a package "in its proper location", you should have a look at grouping, especially modules.
I faintly remember the solution being the swap in position of coding and docstring:
#!/usr/bin/env python3
"""
Module Docstring
"""
# -*- coding: utf-8 -*-
I cannot test this right now

How does the "magic lines(s)" in python work, when specifying encoding in python file?

At the start of a python file (first line) sometimes I read
# -*- coding: utf-8 -*-
and sometimes I read
# encoding: utf-8
Both lines seem to do the same thing: specifying utf8 as encoding for all the text put in the file.
I have to questions:
Why does this even work? I thought the interpreter ignores everything after a # because it invokes a comment.
What is the difference between the two lines above? Does the interpreter just ignore the -*-?
The two forms are equivalent. The -*- version is a special kind of comment that Emacs understands. See PEP 263 for more information.
If a comment like in either of these forms is one of the first two lines of a file, the interpreter will use the specified encoding to read the file.
It works because the implementation looks for it, there is nothing magical about it. There is no difference, all possible variants are defined by PEP 263 (the only difference is that the first one is Emacs-compatible).

Where does this come from: -*- coding: utf-8 -*-

Python recognizes the following as instruction which defines file's encoding:
# -*- coding: utf-8 -*-
I definitely saw this kind of instructions before (-*- var: value -*-). Where does it come from? What is the full specification, e.g. can the value include spaces, special symbols, newlines, even -*- itself?
My program will be writing plain text files and I'd like to include some metadata in them using this format.
This way of specifying the encoding of a Python file comes from PEP 0263 - Defining Python Source Code Encodings.
It is also recognized by GNU Emacs (see Python Language Reference, 2.1.4 Encoding declarations), though I don't know if it was the first program to use that syntax.
# -*- coding: utf-8 -*- is a Python 2 thing.
In Python 3.0+ the default encoding of source files is already UTF-8 so you can safely delete that line, because unless it says something other than some variation of "utf-8", it has no effect. See Should I use encoding declaration in Python 3?
pyupgrade is a tool you can run on your code to remove those comments and other useless leftovers from Python 2, like having all your classes inherit from object.
This is so called file local variables, that are understood by Emacs and set correspondingly. See corresponding section in Emacs manual - you can define them either in header or in footer of file
In PyCharm, I'd leave it out. It turns off the UTF-8 indicator at the bottom with a warning that the encoding is hard-coded. Don't think you need the PyCharm comment mentioned above.

Categories

Resources