How to portably parse the (Unicode) degree symbol with regular expressions? - python

I'm writing a simple regular expression parser for the output of the sensors utility on Ubuntu. Here's an example of a line of text I'm parsing:
temp1: +31.0°C (crit = +107.0°C)
And here's the regex I'm using to match that (in Python):
temp_re = re.compile(r'(temp1:)\s+(\+|-)(\d+\.\d+)\W\WC\s+'
r'\(crit\s+=\s+(\+|-)(\d+\.\d+)\W\WC\).*')
This code works as expected and matches the example text I've given above. The only bits I'm really interested in are the numbers, so this bit:
(\+|-)(\d+\.\d+)\W\WC
which starts by matching the + or - sign and ends by matching the °C.
My question is, why does it take two \W (non-alphanumeric) characters to match ° rather than one? Will the code break on systems where Unicode is represented differently to mine? If so, how can I make it portable?

Possible portable solution:
Convert input data to unicode, and use re.UNICODE flag in regular expressions.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
data = u'temp1: +31.0°C (crit = +107.0°C)'
temp_re = re.compile(ur'(temp1:)\s+(\+|-)(\d+\.\d+)°C\s+'
ur'\(crit\s+=\s+(\+|-)(\d+\.\d+)°C\).*', flags=re.UNICODE)
print temp_re.findall(data)
Output
[(u'temp1:', u'+', u'31.0', u'+', u'107.0')]
EDIT
#netvope allready pointed this out in comments for question.
Update
Notes from J.F. Sebastian comments about input encoding:
check_output() returns binary data that sometimes can be text (that should have a known character encoding in this case and you can convert it to Unicode). Anyway ord(u'°') == 176 so it can not be encoded using ASCII encoding.
So, to decode input data to unicode, basically* you should use encoding from system locale using locale.getpreferredencoding() e.g.:
data = subprocess.check_output(...).decode(locale.getpreferredencoding())
With data encoded correctly:
you'll get the same output without re.UNICODE in this case.
Why basically? Because on Russian Win7 with cp1251 as preferredencoding if we have for example script.py which decodes it's output to utf-8:
#!/usr/bin/env python
# -*- coding: utf8 -*-
print u'temp1: +31.0°C (crit = +107.0°C)'.encode('utf-8')
And wee need to parse it's output:
subprocess.check_output(['python',
'script.py']).decode(locale.getpreferredencoding())
will produce wrong results: 'В°' instead °.
So you need to know encoding of input data, in some cases.

Related

Python, replace long dash with short dash?

I want to replace a long dash (–) with a short dash (-). My code:
if " – " in string:
string = string.replace(" – ", " - ")
results in the following error:
SyntaxError: Non-ASCII character '\xe2' in file ./script.py on line 76, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
How can I fix this?
Long dash is not an ASCII character. Declare encoding of your script, like this (somewhere on top):
#-*- coding: utf-8 -*-
There are also other encodings beside utf-8 but it is always safe to use utf-8 if not working with ASCII characters which covers virtually all (unicode) characters.
See PEP 0263 for more info.
I would like to link another answer: https://stackoverflow.com/a/42856932/3751268. However that only worked for Python 2.
Here is a solution for python 3:
my_str = '—asasas—'
my_str.replace(b'\xe2\x80\x94'.decode('utf-8'), '--')

strings in hebrew in python for s60

I'm using python for S60.
I want to use string in hebrew, to represent them on the GUI and to send them in SMS message.
It seems that the PythonScriptShell don't accept such expressions, for example:
u"אבגדה"
what can I do?
thanks
development of situation:
I added the line:
# -*- coding: utf-8 -*-
as the first line in the source file and in notepad++ I selected: Encoding>>Convert to utf8.
now, the GUI appears in Hebrew but when I selected an option the selection value cannot be compared to a string in Hebrew in the code (probably) and there is no response.
On PythonScriptShell appears the warning:
Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal.
Help me, please.
I just tested this in both bluetooth and on-phone consoles with PyS60 2.0, and non-ASCII unicode was handled w/out exceptions.
If you have that string in the file rather than passing it in the console, error is caused by lack of encoding specification in the file.
Add # -*- coding: utf-8 -*- as first line there.
convert your words to unicode characters using
unichr
eg unichr(1507) for char ף
refer to the decimal values in this table: http://www.ssec.wisc.edu/~tomw/java/unicode.html#x0590
Add up
ru = lambda txt: str(txt).decode('utf-8','ignore')
And add the function before each text use
ru("אבגדה")

Remove non-ASCII characters from a string using python / django

I have a string of HTML stored in a database. Unfortunately it contains characters such as ®
I want to replace these characters by their HTML equivalent, either in the DB itself or using a Find Replace in my Python / Django code.
Any suggestions on how I can do this?
You can use that the ASCII characters are the first 128 ones, so get the number of each character with ord and strip it if it's out of range
# -*- coding: utf-8 -*-
def strip_non_ascii(string):
''' Returns the string without non ASCII characters'''
stripped = (c for c in string if 0 < ord(c) < 127)
return ''.join(stripped)
test = u'éáé123456tgreáé#€'
print test
print strip_non_ascii(test)
Result
éáé123456tgreáé#€
123456tgre#
Please note that # is included because, well, after all it's an ASCII character. If you want to strip a particular subset (like just numbers and uppercase and lowercase letters), you can limit the range looking at a ASCII table
EDITED: After reading your question again, maybe you need to escape your HTML code, so all those characters appears correctly once rendered. You can use the escape filter on your templates.
There's a much simpler answer to this at https://stackoverflow.com/a/18430817/5100481
To remove non-ASCII characters from a string, s, use:
s = s.encode('ascii',errors='ignore')
Then convert it from bytes back to a string using:
s = s.decode()
This all using Python 3.6
I found this a while ago, so this isn't in any way my work. I can't find the source, but here's the snippet from my code.
def unicode_escape(unistr):
"""
Tidys up unicode entities into HTML friendly entities
Takes a unicode string as an argument
Returns a unicode string
"""
import htmlentitydefs
escaped = ""
for char in unistr:
if ord(char) in htmlentitydefs.codepoint2name:
name = htmlentitydefs.codepoint2name.get(ord(char))
entity = htmlentitydefs.name2codepoint.get(name)
escaped +="&#" + str(entity)
else:
escaped += char
return escaped
Use it like this
>>> from zack.utilities import unicode_escape
>>> unicode_escape(u'such as ® I want')
u'such as &#174 I want'
This code snippet may help you.
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
def removeNonAscii(string):
nonascii = bytearray(range(0x80, 0x100))
return string.translate(None, nonascii)
nonascii_removed_string = removeNonAscii(string_to_remove_nonascii)
The encoding definition is very important here which is done in the second line.
To get rid of the special xml, html characters '<', '>', '&' you can use cgi.escape:
import cgi
test = "1 < 4 & 4 > 1"
cgi.escape(test)
Will return:
'1 < 4 & 4 > 1'
This is probably the bare minimum you need to avoid problem.
For more you have to know the encoding of your string.
If it fit the encoding of your html document you don't have to do something more.
If not you have to convert to the correct encoding.
test = test.decode("cp1252").encode("utf8")
Supposing that your string was cp1252 and that your html document is utf8
You shouldn't have anything to do, as Django will automatically escape characters :
see : http://docs.djangoproject.com/en/dev/topics/templates/#id2

Python's string.maketrans works at home but fails on Google App Engine

I have this code in Google AppEngine (Python SDK):
from string import maketrans
intab = u"ÀÁÂÃÄÅàáâãäåÒÓÔÕÖØòóôõöøÈÉÊËèéêëÇçÌÍÎÏìíîïÙÚÛÜùúûüÿÑñ".encode('latin1')
outtab = u"aaaaaaaaaaaaooooooooooooeeeeeeeecciiiiiiiiuuuuuuuuynn".encode('latin1')
logging.info(len(intab))
logging.info(len(outtab))
trantab = maketrans(intab, outtab)
When I run the code in the interactive console I have no problem, but when I try it in GAE I get the following error:
raise ValueError, "maketrans arguments must have same length"
ValueError: maketrans arguments must have same length
INFO 2009-12-03 20:04:02,904 dev_appserver.py:3038] "POST /backendsavenew HTTP/1.1" 500 -
INFO 2009-12-03 20:08:37,649 admin.py:112] 106
INFO 2009-12-03 20:08:37,651 admin.py:113] 53
ERROR 2009-12-03 20:08:37,653 init.py:388] maketrans arguments must have same length
I can't figure out why the intab it's doubled in size.
The python file with the code is saved as UTF-8.
Thanks in advance for any help.
string.maketrans and string.translate do not work for Unicode strings. Your call to string.maketrans will implictly convert the Unicode you gave it to an encoding like utf-8. In utf-8 å takes up more space than ASCII a. string.maketrans sees len(str(argument)) which is different for your two strings.
There is a Unicode translate, but for your use case (convert Unicode to ASCII because some part of your system cannot deal with Unicode) you should use http://pypi.python.org/pypi/Unidecode. Unidecode is very smart about transliterating Unicode characters to sensible ASCII, covering many more characters than in your example.
You should save your Python code as utf-8, but make sure you add the magic so Python doesn't have to assume you used the system's default encoding. This line should be the first or second line of your Python files:
# -*- coding: utf-8 -*-
There are many advantages to processing text as Unicode instead of binary strings. This is the Unicode way to do what you are trying to do:
intab = u"ÀÁÂÃÄÅàáâãäåÒÓÔÕÖØòóôõöøÈÉÊËèéêëÇçÌÍÎÏìíîïÙÚÛÜùúûüÿÑñ"
outtab = u"aaaaaaaaaaaaooooooooooooeeeeeeeecciiiiiiiiuuuuuuuuynn"
trantab = dict((ord(a), b) for a, b in zip(intab, outtab))
translated = intab.translate(trantab)
translated == outtab # True
See also Where is Python's "best ASCII for this Unicode" database?
See also How do I get str.translate to work with Unicode strings?
Maybe you could use iso-8859-1 encoding for your file instead of utf-8
# -*- coding: iso-8859-1 -*-
from string import maketrans
import logging
intab = "ÀÁÂÃÄÅàáâãäåÒÓÔÕÖØòóôõöøÈÉÊËèéêëÇçÌÍÎÏìíîïÙÚÛÜùúûüÿÑñ"
outtab = "aaaaaaaaaaaaooooooooooooeeeeeeeecciiiiiiiiuuuuuuuuynn"
logging.info(len(intab))
logging.info(len(outtab))
trantab = maketrans(intab, outtab)
Remember to select iso-8859-1 in your text editor while saving this python source file.

short Unicode \N{} names for Latin-1 characters in Python?

Are there short Unicode u"\N{...}" names for Latin1 characters in Python ?
\N{A umlaut} etc. would be nice,
\N{LATIN SMALL LETTER A WITH DIAERESIS} etc. is just too long to type every time.
(Added:) I use an English keyboard, but occasionally need German letters, as in "Löwenbräu Weißbier".
Yes one can cut-paste them singly, L cutpaste ö wenbr cutpaste ä ...
but that breaks the flow; I was hoping for a keyboard-only way.
Sorry, no, there's no such thing. In string literals, anyway... you could perhaps piggyback on another encoding scheme, such as HTML:
>>> import HTMLParser
>>> HTMLParser.HTMLParser().unescape(u'a ä b c')
u'a \xe4 b'
But I don't think this'd be worth it.
Hardly anyone even uses the \N notation in any case... for the occasional character the \xnn notation is acceptable; for more involved usage you're better off just typing ä directly and making sure a # coding= is defined in the script as per PEP263. (If you don't have a keyboard layout that can type those diacriticals directly, get one. eg. eurokb on Windows, or using the Compose key on Linux.)
If you want to do the right thing please use UTF-8 in your python source code. This will keep the code much more readable.
Python is able to real UTF-8 source files, all you have to do is to add an additional line after the first one:
#!/usr/bin/python
# -*- coding: UTF-8 -*-
By the way, starting with Python 3.0 UTF-8 is the default encoding so you will not need this line anymore. See PEP3120
You can put an actual "ä" character in your string. For this you have to declare the encoding of the source code at the top
#!/usr/bin/env python
# encoding: utf-8
x = u"ä"
Have you thought about writing your own converter? It wouldn't be hard to write something that would go through a file and replace \N{A umlaut} with \N{LATIN SMALL LETTER A WITH DIAERESIS} and all the rest.
You can use the Unicode notation \uXXXX do describe that character:
u"\u00E4"
On Windows, you can use the charmap.exe utility to look up the keyboard shortcut for common letters you're using such as:
ALT-0223 = ß
ALT-0228 = ä
ALT-0246 = ö
Then use Unicode and save in UTF-8:
# -*- coding: UTF-8 -*-
phrase = u'Löwenbräu Weißbier'
or use a converter as someone else mentioned and make up your own shortcuts:
# -*- coding: UTF-8 -*-
def german(s):
s = s.replace(u'SS',u'ß')
s = s.replace(u'a:',u'ä')
s = s.replace(u'o:',u'ö')
return s
phrase = german(u'Lo:wenbra:u WeiSSbier')
print phrase

Categories

Resources