I'm using python for S60.
I want to use string in hebrew, to represent them on the GUI and to send them in SMS message.
It seems that the PythonScriptShell don't accept such expressions, for example:
u"אבגדה"
what can I do?
thanks
development of situation:
I added the line:
# -*- coding: utf-8 -*-
as the first line in the source file and in notepad++ I selected: Encoding>>Convert to utf8.
now, the GUI appears in Hebrew but when I selected an option the selection value cannot be compared to a string in Hebrew in the code (probably) and there is no response.
On PythonScriptShell appears the warning:
Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal.
Help me, please.
I just tested this in both bluetooth and on-phone consoles with PyS60 2.0, and non-ASCII unicode was handled w/out exceptions.
If you have that string in the file rather than passing it in the console, error is caused by lack of encoding specification in the file.
Add # -*- coding: utf-8 -*- as first line there.
convert your words to unicode characters using
unichr
eg unichr(1507) for char ף
refer to the decimal values in this table: http://www.ssec.wisc.edu/~tomw/java/unicode.html#x0590
Add up
ru = lambda txt: str(txt).decode('utf-8','ignore')
And add the function before each text use
ru("אבגדה")
Related
Can't get the titles right in matplotlib:
'technologieën in °C' gives: technologieÃn in ÃC
Possible solutions already tried:
u'technologieën in °C' doesn't work
neither does: # -*- coding: utf-8 -*- at the beginning of the code-file.
Any solutions?
You need to pass in unicode text:
u'technologieën in °C'
Do make sure you use the # -*- coding: utf-8 -*- comment at the top, and make sure your text editor is actually using that codec. If your editor saves the file as Latin-1 encoded text, use that codec in the header, etc. The comment communicates to Python how to interpret your source file, especially when it comes to parsing string literals.
Alternatively, use escape codes for anything non-ASCII in your Unicode literals:
u'technologie\u00ebn in \u00b0C'
and avoid the issue of what codec to use in the first place.
I urge you to read:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
before you continue.
Most fonts will support the °, but if you see a box displayed instead, then you have a font issue and need to switch to a font that supports the characters you are trying to display. For example, if Ariel supports your required characters, then use:
matplotlib.rc('font', family='Arial')
before plotting.
In Python3, there is no need to worry about all that troublesome UTF-8 problems.
One note that you will need to set a Unicode font before plotting.
matplotlib.rc('font', family='Arial')
How can I print the check mark sign "✓" in Python?
It's the sign for approval, not a square root.
You can print any Unicode character using an escape sequence. Make sure to make a Unicode string.
print u'\u2713'
Since Python 2.1 you can use \N{name} escape sequence to insert Unicode characters by their names. Using this feature you can get check mark symbol like so:
$ python -c "print(u'\N{check mark}')"
✓
Note: For this feature to work you must use unicode string literal. u prefix is used for this reason. In Python 3 the prefix is not mandatory since string literals are unicode by default.
Solution defining python source file encoding:
#!/usr/bin/python
# -*- coding: UTF-8 -*-
print '✓'
http://ideone.com/dTW5D8
Like this:
print u'\u2713'.encode('utf8')
The encoding should match the one of your terminal (or wherever you are sending output to).
I'm trying to start learning Python, but I became confused from the first step.
I'm getting started with Hello, World, but when I try to run the script, I get:
Syntax Error: Non-UTF-8 code starting with '\xe9' in file C:\Documents and Settings\Home\workspace\Yassine frist stared\src\firstModule.py on line 5 but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details.
add to the first line is
# -*- coding: utf-8 -*-
Put as the first line of your program this:
# coding: utf-8
See also Correct way to define Python source code encoding
First off, you should know what an encoding is. Read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Now, the problem you are having is that most people write code in ASCII. Roughly speaking, that means that they use Latin letters, numerals and basic punctuation only in the code files themselves. You appear to have used a non-ASCII character code inside your program, which is confusing Python.
There are two ways to fix this. The first is to tell Python with what encoding you would like it to read the text file. You can do that by adding a # coding declaration at the top of the tile. The second, and probably better, is to restrict yourself to ASCII code. Remember that you can always have whatever characters you like inside strings, by writing them in their encoded form as e.g. \x00 or whatever.
When you run Python through the interpreter, you must run it in this format: python filename.py (command line args) or you will also get this error. I made the comment because you mentioned you were a beginner.
I'm writing a simple regular expression parser for the output of the sensors utility on Ubuntu. Here's an example of a line of text I'm parsing:
temp1: +31.0°C (crit = +107.0°C)
And here's the regex I'm using to match that (in Python):
temp_re = re.compile(r'(temp1:)\s+(\+|-)(\d+\.\d+)\W\WC\s+'
r'\(crit\s+=\s+(\+|-)(\d+\.\d+)\W\WC\).*')
This code works as expected and matches the example text I've given above. The only bits I'm really interested in are the numbers, so this bit:
(\+|-)(\d+\.\d+)\W\WC
which starts by matching the + or - sign and ends by matching the °C.
My question is, why does it take two \W (non-alphanumeric) characters to match ° rather than one? Will the code break on systems where Unicode is represented differently to mine? If so, how can I make it portable?
Possible portable solution:
Convert input data to unicode, and use re.UNICODE flag in regular expressions.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
data = u'temp1: +31.0°C (crit = +107.0°C)'
temp_re = re.compile(ur'(temp1:)\s+(\+|-)(\d+\.\d+)°C\s+'
ur'\(crit\s+=\s+(\+|-)(\d+\.\d+)°C\).*', flags=re.UNICODE)
print temp_re.findall(data)
Output
[(u'temp1:', u'+', u'31.0', u'+', u'107.0')]
EDIT
#netvope allready pointed this out in comments for question.
Update
Notes from J.F. Sebastian comments about input encoding:
check_output() returns binary data that sometimes can be text (that should have a known character encoding in this case and you can convert it to Unicode). Anyway ord(u'°') == 176 so it can not be encoded using ASCII encoding.
So, to decode input data to unicode, basically* you should use encoding from system locale using locale.getpreferredencoding() e.g.:
data = subprocess.check_output(...).decode(locale.getpreferredencoding())
With data encoded correctly:
you'll get the same output without re.UNICODE in this case.
Why basically? Because on Russian Win7 with cp1251 as preferredencoding if we have for example script.py which decodes it's output to utf-8:
#!/usr/bin/env python
# -*- coding: utf8 -*-
print u'temp1: +31.0°C (crit = +107.0°C)'.encode('utf-8')
And wee need to parse it's output:
subprocess.check_output(['python',
'script.py']).decode(locale.getpreferredencoding())
will produce wrong results: 'В°' instead °.
So you need to know encoding of input data, in some cases.
Are there short Unicode u"\N{...}" names for Latin1 characters in Python ?
\N{A umlaut} etc. would be nice,
\N{LATIN SMALL LETTER A WITH DIAERESIS} etc. is just too long to type every time.
(Added:) I use an English keyboard, but occasionally need German letters, as in "Löwenbräu Weißbier".
Yes one can cut-paste them singly, L cutpaste ö wenbr cutpaste ä ...
but that breaks the flow; I was hoping for a keyboard-only way.
Sorry, no, there's no such thing. In string literals, anyway... you could perhaps piggyback on another encoding scheme, such as HTML:
>>> import HTMLParser
>>> HTMLParser.HTMLParser().unescape(u'a ä b c')
u'a \xe4 b'
But I don't think this'd be worth it.
Hardly anyone even uses the \N notation in any case... for the occasional character the \xnn notation is acceptable; for more involved usage you're better off just typing ä directly and making sure a # coding= is defined in the script as per PEP263. (If you don't have a keyboard layout that can type those diacriticals directly, get one. eg. eurokb on Windows, or using the Compose key on Linux.)
If you want to do the right thing please use UTF-8 in your python source code. This will keep the code much more readable.
Python is able to real UTF-8 source files, all you have to do is to add an additional line after the first one:
#!/usr/bin/python
# -*- coding: UTF-8 -*-
By the way, starting with Python 3.0 UTF-8 is the default encoding so you will not need this line anymore. See PEP3120
You can put an actual "ä" character in your string. For this you have to declare the encoding of the source code at the top
#!/usr/bin/env python
# encoding: utf-8
x = u"ä"
Have you thought about writing your own converter? It wouldn't be hard to write something that would go through a file and replace \N{A umlaut} with \N{LATIN SMALL LETTER A WITH DIAERESIS} and all the rest.
You can use the Unicode notation \uXXXX do describe that character:
u"\u00E4"
On Windows, you can use the charmap.exe utility to look up the keyboard shortcut for common letters you're using such as:
ALT-0223 = ß
ALT-0228 = ä
ALT-0246 = ö
Then use Unicode and save in UTF-8:
# -*- coding: UTF-8 -*-
phrase = u'Löwenbräu Weißbier'
or use a converter as someone else mentioned and make up your own shortcuts:
# -*- coding: UTF-8 -*-
def german(s):
s = s.replace(u'SS',u'ß')
s = s.replace(u'a:',u'ä')
s = s.replace(u'o:',u'ö')
return s
phrase = german(u'Lo:wenbra:u WeiSSbier')
print phrase