How to pass Unicode title to matplotlib? - python

Can't get the titles right in matplotlib:
'technologieën in °C' gives: technologieÃn in ÃC
Possible solutions already tried:
u'technologieën in °C' doesn't work
neither does: # -*- coding: utf-8 -*- at the beginning of the code-file.
Any solutions?

You need to pass in unicode text:
u'technologieën in °C'
Do make sure you use the # -*- coding: utf-8 -*- comment at the top, and make sure your text editor is actually using that codec. If your editor saves the file as Latin-1 encoded text, use that codec in the header, etc. The comment communicates to Python how to interpret your source file, especially when it comes to parsing string literals.
Alternatively, use escape codes for anything non-ASCII in your Unicode literals:
u'technologie\u00ebn in \u00b0C'
and avoid the issue of what codec to use in the first place.
I urge you to read:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
before you continue.
Most fonts will support the °, but if you see a box displayed instead, then you have a font issue and need to switch to a font that supports the characters you are trying to display. For example, if Ariel supports your required characters, then use:
matplotlib.rc('font', family='Arial')
before plotting.

In Python3, there is no need to worry about all that troublesome UTF-8 problems.
One note that you will need to set a Unicode font before plotting.
matplotlib.rc('font', family='Arial')

Related

Python 2.x Strings: Unicode vs. Bytes

I deal with languages that are non-us and also sometimes still have to write in Python 2.x. Reading this article: http://www.snarky.ca/why-python-3-exists by Brett Cannon makes me wonder if that implies that if I use strings that are only characters and not bytes, should I prepend all my strings with u, to avoid a potential mix up betwen byte-strings and unicode-strings? And: Does this also apply for Jython?
And one last question: -*- coding: utf-8 -*- is completely independed of the above, providing only the encoding of the file itself - correct?
Yes, you want to keep text in unicode objects (the str type in Python 3), and maintain a Unicode sandwich (decode incoming data as soon as possible, postpone encoding until the data needs to exit your application). See Ned Batchelder's excellent Unicode presentation.
This also applies to Jython, which is just another implementation of the Python language.
The PEP 263 source code encoding declaration tells the interpreter what codec to use when decoding bytes in your source code. It helps when defining Unicode literals with non-ASCII bytes, but doesn't dictate how other data other than the source code is encoded or decoded.

Python UTF-8 Latin-1 displays wrong character

I'm writing a very small script that can convert latin-1 characters into unicode (I'm a complete beginner in Python).
I tried a method like this:
def latin1_to_unicode(character):
uni = character.decode('latin-1').encode("utf-8")
retutn uni
It works fine for characters that are not specific to the latin-1 set, but if I try the following example:
print latin1_to_Unicode('å')
It returns Ã¥ instead of å. Same goes for other letters like æ and ø.
Can anyone please explain why this is happening?
Thanks
I have the # -*- coding: utf8 -*- declaration in my script, if it matters any to the problem
Your source code is encoded to UTF-8, but you are decoding the data as Latin-1. Don't do that, you are creating a Mojibake.
Decode from UTF-8 instead, and don't encode again. print will write to sys.stdout which will have been configured with your terminal or console codec (detected when Python starts).
My terminal is configured for UTF-8, so when I enter the å character in my terminal, UTF-8 data is produced:
>>> 'å'
'\xc3\xa5'
>>> 'å'.decode('latin1')
u'\xc3\xa5'
>>> print 'å'.decode('latin1')
Ã¥
You can see that the character uses two bytes; when saving your Python source with an editor configured to use UTF-8, Python reads the exact same bytes from disk to put into your bytestring.
Decoding those two bytes as Latin-1 produces two Unicode codepoints corresponding to the Latin-1 codec.
You probably want to do some studying on the difference between Unicode and encodings, and how that relates to Python:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Pragmatic Unicode by Ned Batchelder
The Python Unicode HOWTO

Why is sys.getdefaultencoding() different from sys.stdout.encoding and how does this break Unicode strings?

I spent a few angry hours looking for the problem with Unicode strings that was broken down to something that Python (2.7) hides from me and I still don't understand. First, I tried to use u".." strings consistently in my code, but that resulted in the infamous UnicodeEncodeError. I tried using .encode('utf8'), but that didn't help either. Finally, it turned out I shouldn't use either and it all works out automagically. However, I (here I need to give credit to a friend who helped me) did notice something weird while banging my head against the wall. sys.getdefaultencoding() returns ascii, while sys.stdout.encoding returns UTF-8. 1. in the code below works fine without any modifications to sys and 2. raises a UnicodeEncodeError. If I change the default system encoding with reload(sys).setdefaultencoding("utf8"), then 2. works fine. My question is why the two encoding variables are different in the first place and how do I manage to use the wrong encoding in this simple piece of code? Please, don't send me to the Unicode HOWTO, I've read that obviously in the tens of questions about UnicodeEncodeError.
# -*- coding: utf-8 -*-
import sys
class Token:
def __init__(self, string, final=False):
self.value = string
self.final = final
def __str__(self):
return self.value
def __repr__(self):
return self.value
print(sys.getdefaultencoding())
print(sys.stdout.encoding)
# 1.
myString = "I need 20 000€."
tok = Token(myString)
print(tok)
reload(sys).setdefaultencoding("utf8")
# 2.
myString = u"I need 20 000€."
tok = Token(myString)
print(tok)
My question is why the two encoding variables are different in the first place
They serve different purposes.
sys.stdout.encoding should be the encoding that your terminal uses to interpret text otherwise you may get mojibake in the output. It may be utf-8 in one environment, cp437 in another, etc.
sys.getdefaultencoding() is used on Python 2 for implicit conversions (when the encoding is not set explicitly) i.e., Python 2 may mix ascii-only bytestrings and Unicode strings together e.g., xml.etree.ElementTree stores text in ascii range as bytestrings or json.dumps() returns an ascii-only bytestring instead of Unicode in Python 2 — perhaps due to performance — bytes were cheaper than Unicode for representing ascii characters. Implicit conversions are forbidden in Python 3.
sys.getdefaultencoding() is always 'ascii' on all systems in Python 2 unless you override it that you should not do otherwise it may hide bugs and your data may be easily corrupted due to the implicit conversions using a possibly wrong encoding for the data.
btw, there is another common encoding sys.getfilesystemencoding() that may be different from the two. sys.getfilesystemencoding() should be the encoding that is used to encode OS data (filenames, command-line arguments, environment variables).
The source code encoding declared using # -*- coding: utf-8 -*- may be different from all of the already-mentioned encodings.
Naturally, if you read data from a file, network; it may use character encodings different from the above e.g., if a file created in notepad is saved using Windows ANSI encoding such as cp1252 then on another system all the standard encodings can be different from it.
The point being: there could be multiple encodings for reasons unrelated to Python and to avoid the headache, use Unicode to represent text: convert as soon as possible encoded text to Unicode on input, and encode it to bytes (possibly using a different encoding) as late as possible on output — this is so called the concept of Unicode sandwich.
how do I manage to use the wrong encoding in this simple piece of code?
Your first code example is not fine. You use non-ascii literal characters in a byte string on Python 2 that you should not do. Use bytestrings' literals only for binary data (or so called native strings if necessary). The code may produce mojibake such as I need 20 000Γé¼. (notice the character noise) if you run it using Python 2 in any environment that does not use utf-8-compatible encoding such as Windows console
The second code example is ok assuming reload(sys) is not part of it. If you don't want to prefix all string literals with u''; you could use from __future__ import unicode_literals
Your actual issue is UnicodeEncodeError error and reload(sys) is not the right solution!
The correct solution is to configure your locale properly on POSIX (LANG, LC_CTYPE) or set PYTHONIOENCODING envvar if the output is redirected to a pipe/file or install win-unicode-console to print Unicode to Windows console.
I have noticed the same behaviour of some standard code (mailman library).
Thanks for your analysis, it helped me save some time. :-)
The problem is exactly the same. My system uses sys.getdefaultencoding() and gets ascii, which is inappropriate to handle a list of 1000 UTF-8 encoded names.
There is a mismatch between stdin/stdout and even filesystem encoding (utf-8) on one hand and "defaultencoding" on the other (ascii). This thread: How to print UTF-8 encoded text to the console in Python < 3? seems to indicate that it is well known and Changing default encoding of Python? contains some indication that a more homogeneous (like "utf-8 everywhere") would break other things like the hash implementation.
For that reason it is also not straightforward to change the defaultencoding. (See http://blog.ianbicking.org/illusive-setdefaultencoding.html for various ways to do so.) It is removed from the sys instance in the site.py file.

strings in hebrew in python for s60

I'm using python for S60.
I want to use string in hebrew, to represent them on the GUI and to send them in SMS message.
It seems that the PythonScriptShell don't accept such expressions, for example:
u"אבגדה"
what can I do?
thanks
development of situation:
I added the line:
# -*- coding: utf-8 -*-
as the first line in the source file and in notepad++ I selected: Encoding>>Convert to utf8.
now, the GUI appears in Hebrew but when I selected an option the selection value cannot be compared to a string in Hebrew in the code (probably) and there is no response.
On PythonScriptShell appears the warning:
Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal.
Help me, please.
I just tested this in both bluetooth and on-phone consoles with PyS60 2.0, and non-ASCII unicode was handled w/out exceptions.
If you have that string in the file rather than passing it in the console, error is caused by lack of encoding specification in the file.
Add # -*- coding: utf-8 -*- as first line there.
convert your words to unicode characters using
unichr
eg unichr(1507) for char ף
refer to the decimal values in this table: http://www.ssec.wisc.edu/~tomw/java/unicode.html#x0590
Add up
ru = lambda txt: str(txt).decode('utf-8','ignore')
And add the function before each text use
ru("אבגדה")

short Unicode \N{} names for Latin-1 characters in Python?

Are there short Unicode u"\N{...}" names for Latin1 characters in Python ?
\N{A umlaut} etc. would be nice,
\N{LATIN SMALL LETTER A WITH DIAERESIS} etc. is just too long to type every time.
(Added:) I use an English keyboard, but occasionally need German letters, as in "Löwenbräu Weißbier".
Yes one can cut-paste them singly, L cutpaste ö wenbr cutpaste ä ...
but that breaks the flow; I was hoping for a keyboard-only way.
Sorry, no, there's no such thing. In string literals, anyway... you could perhaps piggyback on another encoding scheme, such as HTML:
>>> import HTMLParser
>>> HTMLParser.HTMLParser().unescape(u'a ä b c')
u'a \xe4 b'
But I don't think this'd be worth it.
Hardly anyone even uses the \N notation in any case... for the occasional character the \xnn notation is acceptable; for more involved usage you're better off just typing ä directly and making sure a # coding= is defined in the script as per PEP263. (If you don't have a keyboard layout that can type those diacriticals directly, get one. eg. eurokb on Windows, or using the Compose key on Linux.)
If you want to do the right thing please use UTF-8 in your python source code. This will keep the code much more readable.
Python is able to real UTF-8 source files, all you have to do is to add an additional line after the first one:
#!/usr/bin/python
# -*- coding: UTF-8 -*-
By the way, starting with Python 3.0 UTF-8 is the default encoding so you will not need this line anymore. See PEP3120
You can put an actual "ä" character in your string. For this you have to declare the encoding of the source code at the top
#!/usr/bin/env python
# encoding: utf-8
x = u"ä"
Have you thought about writing your own converter? It wouldn't be hard to write something that would go through a file and replace \N{A umlaut} with \N{LATIN SMALL LETTER A WITH DIAERESIS} and all the rest.
You can use the Unicode notation \uXXXX do describe that character:
u"\u00E4"
On Windows, you can use the charmap.exe utility to look up the keyboard shortcut for common letters you're using such as:
ALT-0223 = ß
ALT-0228 = ä
ALT-0246 = ö
Then use Unicode and save in UTF-8:
# -*- coding: UTF-8 -*-
phrase = u'Löwenbräu Weißbier'
or use a converter as someone else mentioned and make up your own shortcuts:
# -*- coding: UTF-8 -*-
def german(s):
s = s.replace(u'SS',u'ß')
s = s.replace(u'a:',u'ä')
s = s.replace(u'o:',u'ö')
return s
phrase = german(u'Lo:wenbra:u WeiSSbier')
print phrase

Categories

Resources