UTF-8 Character Printing Error Python 2.7

UTF-8 Character Printing Error Python 2.7 - python

I write codes that contain non-ASCII characters like this;
print "Öüç"
I know that Python's default encoding is ASCII. So I add this to my code.
#-*- coding:utf-8 -*-
When I launch my code, "Öüç" string appear like this;
├û├╝├ğ
What should I do?

That is only loosely related to Python. Even #-*- coding:utf-8 -*- is useless here: it is only meant to allow to use encoded unicode litterals in Python source.
It just allowed me to guess that your source was UTF-8 encoded, so "Öüç" is in fact the following string: '\xc3\x96\xc3\xbc\xc3\xa7'. And what you see is those characters in the code page 437.
I assume that you use a Windows system, and that the chcp command in a CMD windows would confirm you that the code page used is indeed 437.
What can be done? First you must select in the console a code page able to display the 3 characters, I would advise the 850 code page: chcp 850 before starting Python
Then in Python, you decode the UTF-8 string into unicode and encode it in cp850:
print "Öüç".decode("utf8").encode('cp850')
Alternatively, you can use the windows 1252 code page which is close to Latin1: chcp 1252 before starting Python and then:
print "Öüç".decode("utf8").encode('latin1')

Related

SyntaxError: Non-ASCII character - Scrapy [duplicate]

Say I have a function:
def NewFunction():
return '£'
I want to print some stuff with a pound sign in front of it and it prints an error when I try to run this program, this error message is displayed:
SyntaxError: Non-ASCII character '\xa3' in file 'blah' but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details
Can anyone inform me how I can include a pound sign in my return function? I'm basically using it in a class and it's within the '__str__' part that the pound sign is included.

I'd recommend reading that PEP the error gives you. The problem is that your code is trying to use the ASCII encoding, but the pound symbol is not an ASCII character. Try using UTF-8 encoding. You can start by putting # -*- coding: utf-8 -*- at the top of your .py file. To get more advanced, you can also define encodings on a string by string basis in your code. However, if you are trying to put the pound sign literal in to your code, you'll need an encoding that supports it for the entire file.

Adding the following two lines at the top of my .py script worked for me (first line was necessary):
#!/usr/bin/env python
# -*- coding: utf-8 -*-

First add the # -*- coding: utf-8 -*- line to the beginning of the file and then use u'foo' for all your non-ASCII unicode data:
def NewFunction():
return u'£'
or use the magic available since Python 2.6 to make it automatic:
from __future__ import unicode_literals

The error message tells you exactly what's wrong. The Python interpreter needs to know the encoding of the non-ASCII character.
If you want to return U+00A3 then you can say
return u'\u00a3'
which represents this character in pure ASCII by way of a Unicode escape sequence. If you want to return a byte string containing the literal byte 0xA3, that's
return b'\xa3'
(where in Python 2 the b is implicit; but explicit is better than implicit).
The linked PEP in the error message instructs you exactly how to tell Python "this file is not pure ASCII; here's the encoding I'm using". If the encoding is UTF-8, that would be
# coding=utf-8
or the Emacs-compatible
# -*- encoding: utf-8 -*-
If you don't know which encoding your editor uses to save this file, examine it with something like a hex editor and some googling. The Stack Overflow character-encoding tag has a tag info page with more information and some troubleshooting tips.
In so many words, outside of the 7-bit ASCII range (0x00-0x7F), Python can't and mustn't guess what string a sequence of bytes represents. https://tripleee.github.io/8bit#a3 shows 21 possible interpretations for the byte 0xA3 and that's only from the legacy 8-bit encodings; but it could also very well be the first byte of a multi-byte encoding. But in fact, I would guess you are actually using Latin-1, so you should have
# coding: latin-1
as the first or second line of your source file. Anyway, without knowledge of which character the byte is supposed to represent, a human would not be able to guess this, either.
A caveat: coding: latin-1 will definitely remove the error message (because there are no byte sequences which are not technically permitted in this encoding), but might produce completely the wrong result when the code is interpreted if the actual encoding is something else. You really have to know the encoding of the file with complete certainty when you declare the encoding.

Adding the following two lines in the script solved the issue for me.
# !/usr/bin/python
# coding=utf-8
Hope it helps !

You're probably trying to run Python 3 file with Python 2 interpreter. Currently (as of 2019), python command defaults to Python 2 when both versions are installed, on Windows and most Linux distributions.
But in case you're indeed working on a Python 2 script, a not yet mentioned on this page solution is to resave the file in UTF-8+BOM encoding, that will add three special bytes to the start of the file, they will explicitly inform the Python interpreter (and your text editor) about the file encoding.

Inconsistent output of unicode box-drawing characters in python IDLE

I have the following code:
# -*- coding: utf-8 -*-
print "╔╤╤╦╤╤╦╤╤╗"
print "╠╪╪╬╪╪╬╪╪╣"
print "╟┼┼╫┼┼╫┼┼╢"
print "╚╧╧╩╧╧╩╧╧╝"
print "║"
print "│"
and for some reason, only the third line (╚╧╧╩╧╧╩╧╧╝) actually outputs properly, the rest is an odd combination of symbols. I assume this is due to some encoding issues. The full output in IDLE is as follows:
â•”â•¤â•¤â•¦â•¤â•¤â•¦â•¤â•¤â•—
â• â•ªâ•ªâ•¬â•ªâ•ªâ•¬â•ªâ•ªâ•£
â•Ÿâ”¼â”¼â•«â”¼â”¼â•«â”¼â”¼â•¢
╚╧╧╩╧╧╩╧╧╝
â•‘
â”‚
What is causing this and how can I fix this? I'm using a tablet (Surface Pro 3 with Win10) with only a touch keyboard, so any solution with the least amount of typing (especially typing out weird characters) would be ideal, but obviously all help is appreciated.

Mojibake indicates that the text encoded in one encoding is shown in another incompatible encoding:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
print(u"╔╤╤╦╤╤╦╤╤╗".encode('utf-8').decode('cp1252')) #XXX: DON'T DO IT
# -> â•”â•¤â•¤â•¦â•¤â•¤â•¦â•¤â•¤â•—
There are several places where the wrong encoding could be used.
# coding: utf-8 encoding declaration says how non-ascii characters in your source code (e.g., inside string literals) should be interpreted. If print u"╔╤╤╦╤╤╦╤╤╗" works in your case then it means that the source code itself is decoded to Unicode correctly. For debugging, you could write the string using only ascii characters: u'\u2554\u2557' == u'╔╗'.
print "╔╤╤╦╤╤╦╤╤╗" (DON'T DO IT) prints bytes (text encoded using utf-8 in this case) as is. IDLE itself works with Unicode (BMP). The bytes must be decoded into Unicode text before they can be shown in IDLE. It seems IDLE uses ANSI code page such as cp1252 (locale.getpreferredencoding(False)) to decode the output bytes on Windows. Don't print text as bytes. It will fail in any environment that uses a character encoding different from your source code e.g., you would get ΓòöΓòù... mojibake if you run the code from the question in Windows console that uses cp437 OEM code page.
You should use Unicode for all text in your program. Python 3 even forbids non-ascii characters inside a bytes literal. You would get SyntaxError there.
print(u'\u2554\u2557') might fail with UnicodeEncodeError if you would run the code in Windows console and OEM code page such as cp437 weren't be able to represent the characters. To print arbitrary Unicode characters in Windows console, use win-unicode-console package. You don't need it if you use IDLE.

Putting a u before the strings fixed the issue, as per #FredLarson's suggestion:
print u"╔╤╤╦╤╤╦╤╤╗"
print u"╠╪╪╬╪╪╬╪╪╣"
print u"╟┼┼╫┼┼╫┼┼╢"
print u"╚╧╧╩╧╧╩╧╧╝"
print u"║"
print u"│"
The exact cause still isn't known, since it seemed to work on other systems and it's odd that the third line worked fine.

Why does Emacs get my literal Unicode strings wrong?

As far as I know, these should be equivalent in a system that uses UTF-8 as the default encoding:
pattern1 = 'Wörterbuch Wortformen'.decode('utf8')
pattern2 = u'Wörterbuch Wortformen'
However, when I send these lines from an Emacs buffer to the Python process (M-x python-shell-send-region) something strange happens.
>>> pattern1
u'W\xf6rterbuch Wortformen'
>>> pattern2
u'W\xc3\xb6rterbuch Wortformen'
In a Python shell run in a terminal, both lines result in u'W\xf6rterbuch Wortformen'.
What is going on here?
My locale is configured to use UTF-8.

Here's what I did (might appear helpful later):
Created a single-bit encoded file, say /tmp/test.dat Opened it in Emacs using hexl-mode.
Using hexl-insert-hex-char command inserted bytes C3 and B6.
Opened this file as text (using text-mode). Emacs recognized it as file with multibyte encoding and displayed ö in place of the previous bytes.
Conclusion: you need the encoding system in your buffer which contains the source code to be utf-8 to send two bytes for ö. However, if it is a single-byte encoding, and given that you select the locale that maps the byte F6 to ö, you will get that byte.
PS. Make sure you have -*- coding: utf-8 -*- comment.

It turns out that it was a bug in python.el.

Making Swedish characters show properly in Windows Command Prompt using Python in Notepad++

The title explains it well. I have set up Notepad++ to open the Python script in the command prompt when I press F8 but all Swedish characters looks messed up when opening in CMD but perfectly fine in e.g IDLE.
This simple example code:
#!/usr/bin/env python
#-*- coding: UTF-8 -*-
print "åäö"
Looks like this.
As you can see the output of the batch file I use to open Python in cmd below shows the characters correctly but not the Python script above it. How do I fix this? I just want to show the characthers correctly I dont necessarily have too use UTF-8.
I open the file in cmd using this method.
Update: Solved. Added a "chcp 1252" line at the top of the batch file and then a cls line below it to remove the message about what character encoding it uses. Then I used "# -- coding: cp1252 --" in the python script and changed the font in cmd to Lucida Console. This is done by clicking the cmd icon at the top right of the cmd window and go into properties.

You're printing out UTF-8 bytes, but your console is not set to UTF-8. Either write Unicode as UTF-16, or set your console codepage to UTF-8.
print u"åäö"

I had the same issue and I used cp1252
C:>chcp 1252
This made the console to use 1252 encoding and then I ran my program which displayed the swedish characters with mercy.

Python will normally convert Unicode strings to the Windows console's encoding. Note that to use Unicode properly, you need Unicode strings (e.g., u'string') and need to declare the encoding the file is saved in with a coding: line.
For example, this (saved in UTF-8 as x.py on my system):
# coding: utf8
print u"åäö"
Produces this:
C:\>chcp
Active code page: 437
C:\>x
åäö
You'll only be able to successfully print characters that are supported by the active code page.

Set coding to: # -*- coding: ISO-8859-1 -*-
This worked for me and I tried a lot of different solutions to get it to work with Visual Studio IDE for Python.
# -*- coding: ISO-8859-1 -*-
print ("åäö")

Unicode problems in PyObjC

I am trying to figure out PyObjC on Mac OS X, and I have written a simple program to print out the names in my Address Book. However, I am having some trouble with the encoding of the output.
#! /usr/bin/env python
# -*- coding: UTF-8 -*-
from AddressBook import *
ab = ABAddressBook.sharedAddressBook()
people = ab.people()
for person in people:
name = person.valueForProperty_("First") + ' ' + person.valueForProperty_("Last")
name
when I run this program, the output looks something like this:
...snip...
u'Jacob \xc5berg'
u'Fernando Gonzales'
...snip...
Could someone please explain why the strings are in unicode, but the content looks like that?
I have also noticed that when I try to print the name I get the error
UnicodeEncodeError: 'ascii' codec can't encode character u'\xc5' in position 6: ordinal not in range(128)

# -*- coding: UTF-8 -*-
only affects the way Python decodes comments and string literals in your source, not the way standard output is configured, etc, etc. If you set your Mac's Terminal to UTF-8 (Terminal, Preferences, Settings, Advanced, International dropdown) and emit Unicode text to it after encoding it in UTF-8 (print name.encode("utf-8")), you should be fine.

If you run the code in your question in the interactive console the interpreter will print the repr of "name" because of the last statement of the loop.
If you change the last line of the loop from just "name" to "print name" the output should be fine. I've tested this with Terminal.app on a 10.5.7 system.

Just writing the variable name sends repr(name) to the standard output and repr() encodes all unicode values.
print tries to convert u'Jacob \xc5berg' to ASCII, which doesn't work. Try writing it to a file.
See Print Fails on the python wiki.
That means you're using legacy,
limited or misconfigured console. If
you're just trying to play with
unicode at interactive prompt move to
a modern unicode-aware console. Most
modern Python distributions come with
IDLE where you'll be able to print all
unicode characters.

Convert it to a unicode string through:
print unicode(name)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.