Unicode problems in PyObjC

Unicode problems in PyObjC - python

I am trying to figure out PyObjC on Mac OS X, and I have written a simple program to print out the names in my Address Book. However, I am having some trouble with the encoding of the output.
#! /usr/bin/env python
# -*- coding: UTF-8 -*-
from AddressBook import *
ab = ABAddressBook.sharedAddressBook()
people = ab.people()
for person in people:
name = person.valueForProperty_("First") + ' ' + person.valueForProperty_("Last")
name
when I run this program, the output looks something like this:
...snip...
u'Jacob \xc5berg'
u'Fernando Gonzales'
...snip...
Could someone please explain why the strings are in unicode, but the content looks like that?
I have also noticed that when I try to print the name I get the error
UnicodeEncodeError: 'ascii' codec can't encode character u'\xc5' in position 6: ordinal not in range(128)

# -*- coding: UTF-8 -*-
only affects the way Python decodes comments and string literals in your source, not the way standard output is configured, etc, etc. If you set your Mac's Terminal to UTF-8 (Terminal, Preferences, Settings, Advanced, International dropdown) and emit Unicode text to it after encoding it in UTF-8 (print name.encode("utf-8")), you should be fine.

If you run the code in your question in the interactive console the interpreter will print the repr of "name" because of the last statement of the loop.
If you change the last line of the loop from just "name" to "print name" the output should be fine. I've tested this with Terminal.app on a 10.5.7 system.

Just writing the variable name sends repr(name) to the standard output and repr() encodes all unicode values.
print tries to convert u'Jacob \xc5berg' to ASCII, which doesn't work. Try writing it to a file.
See Print Fails on the python wiki.
That means you're using legacy,
limited or misconfigured console. If
you're just trying to play with
unicode at interactive prompt move to
a modern unicode-aware console. Most
modern Python distributions come with
IDLE where you'll be able to print all
unicode characters.

Convert it to a unicode string through:
print unicode(name)

Related

Proper use of unicode characters in python3 - Force utf-8 encoding

I'm going crazy here. The internet and this SO question tell me that in python 3.x, the default encoding is UTF-8. In addition to that, my system's default encoding is UTF-8. In addition to that, I have # -*- coding: utf-8 -*- at the top of my python 3.5 file.
Still, python is using ascii:
# -*- coding: utf-8 -*-
mystring = "Ⓐ"
print(mystring)
Greets me with:
SyntaxError: 'ascii' codec can't decode byte 0xe2 in position 7: ordinal not in range(128)
I've also tried this: print(mystring.encode("utf-8")) and .decode("utf-8") - Same thing.
What am I missing here? How do I force python to stop using ascii encoding?
Edit: I know that it seems weird to complain about position 7 with a one character string, but this is my actual MCVE and the exact output I'm getting. The above is using python shell, the below is in a script. Both use python 3.5.2.
Edit: Since I figured it might be relevant: The string I'm getting comes from an external application and is not hardcoded, so I need a way to get that utf-8 string and save it into a file. The above is just a minimalized and generalized example. Here is my real-life code:
# the variables being a string that might contain unicode characters
mystring = "username: " + fromuser + " | printname: " + fromname
with open("myfile.txt", "a") as myfile:
myfile.write(mystring + "\n")

In Python3 all strings are unicode, so the problem you're having is likely due to your locale settings not being correct. The Python3 interpreter looks to use the locale environment variables and if it cannot find them it emulates basic ASCII
From locale.py:
except ImportError:
# Locale emulation
CHAR_MAX = 127
LC_ALL = 6
LC_COLLATE = 3
LC_CTYPE = 0
LC_MESSAGES = 5
LC_MONETARY = 4
LC_NUMERIC = 1
LC_TIME = 2
Error = ValueError
Double check the locale on your shell from which you are executing. Here are a few work arounds you can try to see if they get you working before you go through the task of getting your env setup correctly.
1) Validate UTF-8 locale or language files are installed (see link above)
2) Try adding this to the top of your script
#!/usr/bin/env LC_ALL=en_US.UTF-8 /usr/local/bin/python3
print('カタカナ')
or
#!/usr/bin/env LANG=en_US.UTF-8 /usr/local/bin/python3
print('カタカナ')
Or export shell variables before executing the Python interpreter
export LANG=en_US.UTF-8
export LC_ALL=en_US.UTF-8
python3
>>> print('カタカナ')
Sorry I cannot be more specific, as these settings are platform and OS specific. You can forcefully attempt to set the locale in Python directly using the locale module, but I don't recommend that, and it won't help if they are not installed.
Hope that helps.

What's new in Python 3.0 says:
All text is Unicode; however encoded Unicode is represented as binary
data
If you want to try outputting utf-8, here's an example:
b'\x41'.decode("utf-8", "strict")
If you'd like to use unicode in a string literal, use the unicode escape and its coded representation. For your example:
print("\u24B6")

UTF-8 Character Printing Error Python 2.7

I write codes that contain non-ASCII characters like this;
print "Öüç"
I know that Python's default encoding is ASCII. So I add this to my code.
#-*- coding:utf-8 -*-
When I launch my code, "Öüç" string appear like this;
├û├╝├ğ
What should I do?

That is only loosely related to Python. Even #-*- coding:utf-8 -*- is useless here: it is only meant to allow to use encoded unicode litterals in Python source.
It just allowed me to guess that your source was UTF-8 encoded, so "Öüç" is in fact the following string: '\xc3\x96\xc3\xbc\xc3\xa7'. And what you see is those characters in the code page 437.
I assume that you use a Windows system, and that the chcp command in a CMD windows would confirm you that the code page used is indeed 437.
What can be done? First you must select in the console a code page able to display the 3 characters, I would advise the 850 code page: chcp 850 before starting Python
Then in Python, you decode the UTF-8 string into unicode and encode it in cp850:
print "Öüç".decode("utf8").encode('cp850')
Alternatively, you can use the windows 1252 code page which is close to Latin1: chcp 1252 before starting Python and then:
print "Öüç".decode("utf8").encode('latin1')

SyntaxError: Non-ASCII character - Scrapy [duplicate]

Say I have a function:
def NewFunction():
return '£'
I want to print some stuff with a pound sign in front of it and it prints an error when I try to run this program, this error message is displayed:
SyntaxError: Non-ASCII character '\xa3' in file 'blah' but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details
Can anyone inform me how I can include a pound sign in my return function? I'm basically using it in a class and it's within the '__str__' part that the pound sign is included.

I'd recommend reading that PEP the error gives you. The problem is that your code is trying to use the ASCII encoding, but the pound symbol is not an ASCII character. Try using UTF-8 encoding. You can start by putting # -*- coding: utf-8 -*- at the top of your .py file. To get more advanced, you can also define encodings on a string by string basis in your code. However, if you are trying to put the pound sign literal in to your code, you'll need an encoding that supports it for the entire file.

Adding the following two lines at the top of my .py script worked for me (first line was necessary):
#!/usr/bin/env python
# -*- coding: utf-8 -*-

First add the # -*- coding: utf-8 -*- line to the beginning of the file and then use u'foo' for all your non-ASCII unicode data:
def NewFunction():
return u'£'
or use the magic available since Python 2.6 to make it automatic:
from __future__ import unicode_literals

The error message tells you exactly what's wrong. The Python interpreter needs to know the encoding of the non-ASCII character.
If you want to return U+00A3 then you can say
return u'\u00a3'
which represents this character in pure ASCII by way of a Unicode escape sequence. If you want to return a byte string containing the literal byte 0xA3, that's
return b'\xa3'
(where in Python 2 the b is implicit; but explicit is better than implicit).
The linked PEP in the error message instructs you exactly how to tell Python "this file is not pure ASCII; here's the encoding I'm using". If the encoding is UTF-8, that would be
# coding=utf-8
or the Emacs-compatible
# -*- encoding: utf-8 -*-
If you don't know which encoding your editor uses to save this file, examine it with something like a hex editor and some googling. The Stack Overflow character-encoding tag has a tag info page with more information and some troubleshooting tips.
In so many words, outside of the 7-bit ASCII range (0x00-0x7F), Python can't and mustn't guess what string a sequence of bytes represents. https://tripleee.github.io/8bit#a3 shows 21 possible interpretations for the byte 0xA3 and that's only from the legacy 8-bit encodings; but it could also very well be the first byte of a multi-byte encoding. But in fact, I would guess you are actually using Latin-1, so you should have
# coding: latin-1
as the first or second line of your source file. Anyway, without knowledge of which character the byte is supposed to represent, a human would not be able to guess this, either.
A caveat: coding: latin-1 will definitely remove the error message (because there are no byte sequences which are not technically permitted in this encoding), but might produce completely the wrong result when the code is interpreted if the actual encoding is something else. You really have to know the encoding of the file with complete certainty when you declare the encoding.

Adding the following two lines in the script solved the issue for me.
# !/usr/bin/python
# coding=utf-8
Hope it helps !

You're probably trying to run Python 3 file with Python 2 interpreter. Currently (as of 2019), python command defaults to Python 2 when both versions are installed, on Windows and most Linux distributions.
But in case you're indeed working on a Python 2 script, a not yet mentioned on this page solution is to resave the file in UTF-8+BOM encoding, that will add three special bytes to the start of the file, they will explicitly inform the Python interpreter (and your text editor) about the file encoding.

Inconsistent output of unicode box-drawing characters in python IDLE

I have the following code:
# -*- coding: utf-8 -*-
print "╔╤╤╦╤╤╦╤╤╗"
print "╠╪╪╬╪╪╬╪╪╣"
print "╟┼┼╫┼┼╫┼┼╢"
print "╚╧╧╩╧╧╩╧╧╝"
print "║"
print "│"
and for some reason, only the third line (╚╧╧╩╧╧╩╧╧╝) actually outputs properly, the rest is an odd combination of symbols. I assume this is due to some encoding issues. The full output in IDLE is as follows:
â•”â•¤â•¤â•¦â•¤â•¤â•¦â•¤â•¤â•—
â• â•ªâ•ªâ•¬â•ªâ•ªâ•¬â•ªâ•ªâ•£
â•Ÿâ”¼â”¼â•«â”¼â”¼â•«â”¼â”¼â•¢
╚╧╧╩╧╧╩╧╧╝
â•‘
â”‚
What is causing this and how can I fix this? I'm using a tablet (Surface Pro 3 with Win10) with only a touch keyboard, so any solution with the least amount of typing (especially typing out weird characters) would be ideal, but obviously all help is appreciated.

Mojibake indicates that the text encoded in one encoding is shown in another incompatible encoding:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
print(u"╔╤╤╦╤╤╦╤╤╗".encode('utf-8').decode('cp1252')) #XXX: DON'T DO IT
# -> â•”â•¤â•¤â•¦â•¤â•¤â•¦â•¤â•¤â•—
There are several places where the wrong encoding could be used.
# coding: utf-8 encoding declaration says how non-ascii characters in your source code (e.g., inside string literals) should be interpreted. If print u"╔╤╤╦╤╤╦╤╤╗" works in your case then it means that the source code itself is decoded to Unicode correctly. For debugging, you could write the string using only ascii characters: u'\u2554\u2557' == u'╔╗'.
print "╔╤╤╦╤╤╦╤╤╗" (DON'T DO IT) prints bytes (text encoded using utf-8 in this case) as is. IDLE itself works with Unicode (BMP). The bytes must be decoded into Unicode text before they can be shown in IDLE. It seems IDLE uses ANSI code page such as cp1252 (locale.getpreferredencoding(False)) to decode the output bytes on Windows. Don't print text as bytes. It will fail in any environment that uses a character encoding different from your source code e.g., you would get ΓòöΓòù... mojibake if you run the code from the question in Windows console that uses cp437 OEM code page.
You should use Unicode for all text in your program. Python 3 even forbids non-ascii characters inside a bytes literal. You would get SyntaxError there.
print(u'\u2554\u2557') might fail with UnicodeEncodeError if you would run the code in Windows console and OEM code page such as cp437 weren't be able to represent the characters. To print arbitrary Unicode characters in Windows console, use win-unicode-console package. You don't need it if you use IDLE.

Putting a u before the strings fixed the issue, as per #FredLarson's suggestion:
print u"╔╤╤╦╤╤╦╤╤╗"
print u"╠╪╪╬╪╪╬╪╪╣"
print u"╟┼┼╫┼┼╫┼┼╢"
print u"╚╧╧╩╧╧╩╧╧╝"
print u"║"
print u"│"
The exact cause still isn't known, since it seemed to work on other systems and it's odd that the third line worked fine.

python 2.7 - how to print with utf-8 characters?

i´m working with python version 2.7 and I need to know how to print utf-8 characters. Can anyone help me?
->I already tried putting # coding: iso-8859-1 -*- on top,
->using encode like print "nome do seu chápa".encode('iso-8859-1') also doesn't work and even
-> using print u"Nâo" doesn't work

A more complete response.
Strings have two types in Python 2, str and unicode.
When using str, you are using bytes so you can write them directly to files like stdout.
When using unicode, it has to be serialized or encoded to bytes before writing to files.
So, what happens here? print "nome do seu chápa".encode('iso-8859-1')
You have bytes but you try to encode them so Python 2 first decodes them behind your back and then encodes using the requested standard. This may work, if lucky, or produce gibberish.
Now, when doing the following: print u"Nâo".encode('utf-8')
You are telling Python 2 that you start with Unicode so then it will encode it without the problematic decode.
Python 3 solved this nastiness.

Are you sure you put this line properly at the top of the document?
# -*- coding: utf-8 -*-
Also, I don't know is there any difference between iso- or utf in this particular problem, but I experienced some troubles with latin1 in my mother-language and some technical stuff, so I recommend the second one.

I got the answer: my console needed to be restarted. I use Spyder(from Python x,y) for development and this error occcured, so beware.
UPDATE: Spyder console seems to suck, because to get it to work, I had to use string.encode('latin1') and (now here's the catch) OPEN A NEW CONSOLE! If I try to reuse my already open console, special characters just won't work.

You have two options:
1) Put on the first line of your code the line:
# -*- coding: utf-8 -*-
2) Put u on the begining of very string with utf-8 characters:
u'não vou esqueçer mais de usar o u no começo'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.