Python unicode string literals in module declared as utf-8

Python unicode string literals in module declared as utf-8 - python

I have a dummie Python module with the utf-8 header that looks like this:
# -*- coding: utf-8 -*-
a = "á"
print type(a), a
Which prints:
<type 'str'> á
But I thought that all string literals inside a Python module declared as utf-8 whould automatically be of type unicode, intead of str. Am I missing something or is this the correct behaviour?
In order to get a as an unicode string I use:
a = u"á"
But this doesn't seem very "polite", nor practical. Is there a better option?

# -*- coding: utf-8 -*-
doesn't make the string literals Unicode. Take this example, I have a file with an Arabic comment and string, file is utf-8:
# هذا تعليق عربي
print type('نص عربي')
if I run it it will throw a SyntaxError exception:
SyntaxError: Non-ASCII character '\xd9' in file file.py
on line 2, but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details
so to allow this I have to add that line to tell the interpreter that the file is UTF-8 encoded:
# -*-coding: utf-8 -*-
# هذا تعليق عربي
print type('نص عربي')
now it runs fine but it still prints <type 'str'> unless I make the string Unicode:
# -*-coding: utf-8 -*-
# هذا تعليق عربي
print type(u'نص عربي')

No, the codec at the top only informs Python how to interpret the source code, and uses that codec to interpret Unicode literals. It does not turn literal bytestrings into unicode values. As PEP 263 states:
This PEP proposes to introduce a syntax to declare the encoding of
a Python source file. The encoding information is then used by the
Python parser to interpret the file using the given encoding. Most
notably this enhances the interpretation of Unicode literals in
the source code and makes it possible to write Unicode literals
using e.g. UTF-8 directly in an Unicode aware editor.
Emphasis mine.
Without the codec declaration, Python has no idea how to interpret non-ASCII characters:
$ cat /tmp/test.py
example = '☃'
$ python2.7 /tmp/test.py
File "/tmp/test.py", line 1
SyntaxError: Non-ASCII character '\xe2' in file /tmp/test.py on line 1, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
If Python behaved the way you expect it to, you would not be able to literal bytestring values that contain non-ASCII byte values either.
If your terminal is configured to display UTF-8 values, then printing a UTF-8 encoded byte string will look 'correct', but only by virtue of luck that the encodings match.
The correct way to get unicode values, is by using unicode literals or by otherwise producing unicode (decoding from byte strings, converting integer codepoints to unicode characters, etc.):
unicode_snowman = '\xe2\x98\x83'.decode('utf8')
unicode_snowman = unichr(0x2603)
In Python 3, the codec also applies to how variable names are interpreted, as you can use letters and digits outside of the ASCII range in names. The default codec in Python 3 is UTF-8, as opposed to ASCII in Python 2.

No this is just source code encoding. Please see http://www.python.org/dev/peps/pep-0263/
To define a source code encoding, a magic comment must
be placed into the source files either as first or second
line in the file, such as:
# coding=<encoding name>
or (using formats recognized by popular editors)
#!/usr/bin/python
# -*- coding: <encoding name> -*-
or
#!/usr/bin/python
# vim: set fileencoding=<encoding name> :
This doesn't make all literals unicode just point how unicode literals should be decoded.
One should use unicode function or u prefix to set literal as unicode.
N.B. in python3 all strings are unicode.

Related

Encode Strings in Python

I want to run my code on terminal but it shows me this error :
SyntaxError: Non-ASCII character '\xd8' in file streaming.py on line
72, but no encoding declared; see http://python.org/dev/peps/pep-0263/
for detail
I tried to encode the Arabic string using this :
# -*- coding: utf-8 -*-
st = 'المملكة العربية السعودية'.encode('utf-8')
It's very important for me to run it on the terminal so I can't use IDLE.

The problem is since you are directly pasting your characters in to a python file, the interpreter (Python 2) attempts to read them as ASCII (even before you encode, it needs to define the literal), which is illegal. What you want is a unicode literal if pasting non-ASCII bytes:
x=u'المملكة العربية السعودية' #Or whatever the corresponding bytes are
print x.encode('utf-8')
You can also try to set the entire source file to be read as utf-8:
#/usr/bin/python
# -*- coding: utf-8 -*-
and don't forget to make it run-able, and lastly, you can import the future from Python 3:
from __future__ import unicode_literal
at the top of the file, so string literals by default are utf-8. Note that \xd8 appears as phi in my terminal, so make sure the encoding is correct.

SyntaxError: Non-ASCII character - Scrapy [duplicate]

Say I have a function:
def NewFunction():
return '£'
I want to print some stuff with a pound sign in front of it and it prints an error when I try to run this program, this error message is displayed:
SyntaxError: Non-ASCII character '\xa3' in file 'blah' but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details
Can anyone inform me how I can include a pound sign in my return function? I'm basically using it in a class and it's within the '__str__' part that the pound sign is included.

I'd recommend reading that PEP the error gives you. The problem is that your code is trying to use the ASCII encoding, but the pound symbol is not an ASCII character. Try using UTF-8 encoding. You can start by putting # -*- coding: utf-8 -*- at the top of your .py file. To get more advanced, you can also define encodings on a string by string basis in your code. However, if you are trying to put the pound sign literal in to your code, you'll need an encoding that supports it for the entire file.

Adding the following two lines at the top of my .py script worked for me (first line was necessary):
#!/usr/bin/env python
# -*- coding: utf-8 -*-

First add the # -*- coding: utf-8 -*- line to the beginning of the file and then use u'foo' for all your non-ASCII unicode data:
def NewFunction():
return u'£'
or use the magic available since Python 2.6 to make it automatic:
from __future__ import unicode_literals

The error message tells you exactly what's wrong. The Python interpreter needs to know the encoding of the non-ASCII character.
If you want to return U+00A3 then you can say
return u'\u00a3'
which represents this character in pure ASCII by way of a Unicode escape sequence. If you want to return a byte string containing the literal byte 0xA3, that's
return b'\xa3'
(where in Python 2 the b is implicit; but explicit is better than implicit).
The linked PEP in the error message instructs you exactly how to tell Python "this file is not pure ASCII; here's the encoding I'm using". If the encoding is UTF-8, that would be
# coding=utf-8
or the Emacs-compatible
# -*- encoding: utf-8 -*-
If you don't know which encoding your editor uses to save this file, examine it with something like a hex editor and some googling. The Stack Overflow character-encoding tag has a tag info page with more information and some troubleshooting tips.
In so many words, outside of the 7-bit ASCII range (0x00-0x7F), Python can't and mustn't guess what string a sequence of bytes represents. https://tripleee.github.io/8bit#a3 shows 21 possible interpretations for the byte 0xA3 and that's only from the legacy 8-bit encodings; but it could also very well be the first byte of a multi-byte encoding. But in fact, I would guess you are actually using Latin-1, so you should have
# coding: latin-1
as the first or second line of your source file. Anyway, without knowledge of which character the byte is supposed to represent, a human would not be able to guess this, either.
A caveat: coding: latin-1 will definitely remove the error message (because there are no byte sequences which are not technically permitted in this encoding), but might produce completely the wrong result when the code is interpreted if the actual encoding is something else. You really have to know the encoding of the file with complete certainty when you declare the encoding.

Adding the following two lines in the script solved the issue for me.
# !/usr/bin/python
# coding=utf-8
Hope it helps !

You're probably trying to run Python 3 file with Python 2 interpreter. Currently (as of 2019), python command defaults to Python 2 when both versions are installed, on Windows and most Linux distributions.
But in case you're indeed working on a Python 2 script, a not yet mentioned on this page solution is to resave the file in UTF-8+BOM encoding, that will add three special bytes to the start of the file, they will explicitly inform the Python interpreter (and your text editor) about the file encoding.

python unicode woes - convert cp1252 string to unicode

I think I'm just fundamentally confused about char sets that are not ascii.
I have a python file that I have declared at the top to be # -*- coding: cp1252 -*-.
In the file I have question = "what is your borther’s name", for example.
type(question)
>> str
question
>> 'what is your borther\xe2\x80\x99s name'
And I cannot convert to unicode at this point, presumably because you can't go from ASCII to Unicode.
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 20: ordinal not in range(128)
if I declare as unicode to begin with:
question = "what is your borther’s name"
>> u'what is your borther\u2019s name'
How do I get "what is your borther’s name" back? Or is just a how python interpreter displays unicode strings and it in fact will encode correctly when I pass it to an unicode-aware application (in this case, Office)?
I need to preserve the special characters but I still need to do a string comparison using Levenshtein library (pip install python-Levenshtein).
Levenshtein.ratio takes str or unicode for both of its arguments, but not mixed.

I have a plain text file that I have declared at the top to be # -*- coding: cp1252 -*-.
That does nothing.
with codecs.open(..., encoding='cp1252') as fp:
...

Python convert hanzi character

How do I convert between a hanzi character and it's unicode value as depicted below?
与 to U+4E0E
今 to U+4ECA
令 to U+4EE4
免 to U+514D
Appears unsupported by default:
>>> a = '安'
Unsupported characters in input

The small 'u' in front of the quote indicates that a Unicode string is supposed to be created.
>>> a = u'与'
>>> a
u'\u4e0e'
See the the string documentation for more information: http://docs.python.org/tutorial/introduction.html#unicode-strings
Update:
Set the source file encoding according to the actual encoding of the file, so that the interpreter knows how to parse it.
For example, to use UTF-8 just add this string to the header of the file:
# -*- coding: utf8 -*-

How to encode 'Importação de petróleo' string in python?

I want to use "Importação de petróleo" in my program.
How can I do that because all encodings give me errors as cannot encode.

I think you're confusing the string __repr__ with its __str__:
>>> s = u"Importação de petróleo"
>>> s
u'Importa\xe7\xe3o de petr\xf3leo'
>>> print s
Importação de petróleo
There's no problem with \xe7 and friends; they are just the encoding representation for those special characters. You can't avoid them and you shouldn't need to :)
A must-to-read link about unicode: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Do this
# -*- coding: utf-8 -*-
print 'Importação de petróleo'
place
# -*- coding: utf-8 -*-
on very top of the program (first line).
Also save your code as utf-8 (default if you are using linux)

If you are using characters in a source (.py) file which are outside of the ASCII range, then you will need to specify the encoding at the top of the file, so that the Python lexer knows how to read and interpret the characters in the file.
If this is the case, then, as the very first line of your file, use this:
# coding: utf-8
(If your file is actually in a different encoding, such as ISO-8859-1, then you will need to use that instead. Python can handle several different character encodings; you just have to tell it what to expect)

Adding a 'u' in front of the string makes it unicode. The documentation here gives details regarding Unicode handling in Python 2.x:-
Python 2.x Unicode support

As specialscope mentioned, first thing, you have add this as the first line of your program:
# -*- coding: utf-8 -*-
If you don’t, you’ll get an error which looks something like this:
SyntaxError: Non-ASCII character '\xc3' in file /tmp/blah.py on line 10,
but no encoding declared; see http://www.python.org/peps/pep-0263.html
for details
So far, so good. Now, you have to make sure that every string that contains anything besides plain ASCII is prefixed with u:
print u'Importação de petróleo'
But there’s one more step. This is a separate topic, but chances are that you’re going to have to end up re-encoding that string before you send it to stdout or a file.
Here are the rules of thumb for Unicode in Python:
If at all possible make sure that any data you’re working with is in UTF-8.
When you read external UTF-8 encoded data into your program, immediately decode it into Unicode.
When you send data out of your program (to a file or stdout), make sure that you re-encode it as UTF-8.
This all changes in Python 3, by the way.

Help on class unicode in module builtin:
class unicode(basestring)
| unicode(string [, encoding[, errors]]) -> object
|
| Create a new Unicode object from the given encoded string.
| encoding defaults to the current default string encoding.
| errors can be 'strict', 'replace' or 'ignore' and defaults to 'strict'.
|
try using "utf8" as the encoding for unicode()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.