Python, replace long dash with short dash? - python

I want to replace a long dash (–) with a short dash (-). My code:
if " – " in string:
string = string.replace(" – ", " - ")
results in the following error:
SyntaxError: Non-ASCII character '\xe2' in file ./script.py on line 76, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
How can I fix this?

Long dash is not an ASCII character. Declare encoding of your script, like this (somewhere on top):
#-*- coding: utf-8 -*-
There are also other encodings beside utf-8 but it is always safe to use utf-8 if not working with ASCII characters which covers virtually all (unicode) characters.
See PEP 0263 for more info.

I would like to link another answer: https://stackoverflow.com/a/42856932/3751268. However that only worked for Python 2.
Here is a solution for python 3:
my_str = '—asasas—'
my_str.replace(b'\xe2\x80\x94'.decode('utf-8'), '--')

Related

Encoding/converting a string with spanish or polish characters

1) How do I convert a variable with a string like "wdzi\xc4\x99czno\xc5\x9bci" into "wdzięczności"?
2) Also how do I convert string variable with characters like "±", "ę", "Ć" into correct letters?
I emphasise "variable" because all I've got from googling was examples with " u'some string' " and the like and I can't get anything like that to work.
I use "# -*- coding: utf-8 -*-" in second line of my script and I still crash into these problems.
Also I was said that simple print should output correctly - but it does not.
In Python 2.7 IDLE, I get this output:
>>> print "wdzi\xc4\x99czno\xc5\x9bci".decode('utf-8')
wdzięczności
Your first string appears to be a UTF-8 byte string, so all that's necessary is to decode it into a Unicode string. When Python prints that string, it will encode it back to the proper encoding based on your environment.
If you're using Python 3 then you have a string that has been decoded improperly and will need a little more work to fix the damage.
>>> print("wdzi\xc4\x99czno\xc5\x9bci".encode('iso-8859-1').decode('utf-8'))
wdzięczności

How to portably parse the (Unicode) degree symbol with regular expressions?

I'm writing a simple regular expression parser for the output of the sensors utility on Ubuntu. Here's an example of a line of text I'm parsing:
temp1: +31.0°C (crit = +107.0°C)
And here's the regex I'm using to match that (in Python):
temp_re = re.compile(r'(temp1:)\s+(\+|-)(\d+\.\d+)\W\WC\s+'
r'\(crit\s+=\s+(\+|-)(\d+\.\d+)\W\WC\).*')
This code works as expected and matches the example text I've given above. The only bits I'm really interested in are the numbers, so this bit:
(\+|-)(\d+\.\d+)\W\WC
which starts by matching the + or - sign and ends by matching the °C.
My question is, why does it take two \W (non-alphanumeric) characters to match ° rather than one? Will the code break on systems where Unicode is represented differently to mine? If so, how can I make it portable?
Possible portable solution:
Convert input data to unicode, and use re.UNICODE flag in regular expressions.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
data = u'temp1: +31.0°C (crit = +107.0°C)'
temp_re = re.compile(ur'(temp1:)\s+(\+|-)(\d+\.\d+)°C\s+'
ur'\(crit\s+=\s+(\+|-)(\d+\.\d+)°C\).*', flags=re.UNICODE)
print temp_re.findall(data)
Output
[(u'temp1:', u'+', u'31.0', u'+', u'107.0')]
EDIT
#netvope allready pointed this out in comments for question.
Update
Notes from J.F. Sebastian comments about input encoding:
check_output() returns binary data that sometimes can be text (that should have a known character encoding in this case and you can convert it to Unicode). Anyway ord(u'°') == 176 so it can not be encoded using ASCII encoding.
So, to decode input data to unicode, basically* you should use encoding from system locale using locale.getpreferredencoding() e.g.:
data = subprocess.check_output(...).decode(locale.getpreferredencoding())
With data encoded correctly:
you'll get the same output without re.UNICODE in this case.
Why basically? Because on Russian Win7 with cp1251 as preferredencoding if we have for example script.py which decodes it's output to utf-8:
#!/usr/bin/env python
# -*- coding: utf8 -*-
print u'temp1: +31.0°C (crit = +107.0°C)'.encode('utf-8')
And wee need to parse it's output:
subprocess.check_output(['python',
'script.py']).decode(locale.getpreferredencoding())
will produce wrong results: 'В°' instead °.
So you need to know encoding of input data, in some cases.

strings in hebrew in python for s60

I'm using python for S60.
I want to use string in hebrew, to represent them on the GUI and to send them in SMS message.
It seems that the PythonScriptShell don't accept such expressions, for example:
u"אבגדה"
what can I do?
thanks
development of situation:
I added the line:
# -*- coding: utf-8 -*-
as the first line in the source file and in notepad++ I selected: Encoding>>Convert to utf8.
now, the GUI appears in Hebrew but when I selected an option the selection value cannot be compared to a string in Hebrew in the code (probably) and there is no response.
On PythonScriptShell appears the warning:
Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal.
Help me, please.
I just tested this in both bluetooth and on-phone consoles with PyS60 2.0, and non-ASCII unicode was handled w/out exceptions.
If you have that string in the file rather than passing it in the console, error is caused by lack of encoding specification in the file.
Add # -*- coding: utf-8 -*- as first line there.
convert your words to unicode characters using
unichr
eg unichr(1507) for char ף
refer to the decimal values in this table: http://www.ssec.wisc.edu/~tomw/java/unicode.html#x0590
Add up
ru = lambda txt: str(txt).decode('utf-8','ignore')
And add the function before each text use
ru("אבגדה")

Python's string.maketrans works at home but fails on Google App Engine

I have this code in Google AppEngine (Python SDK):
from string import maketrans
intab = u"ÀÁÂÃÄÅàáâãäåÒÓÔÕÖØòóôõöøÈÉÊËèéêëÇçÌÍÎÏìíîïÙÚÛÜùúûüÿÑñ".encode('latin1')
outtab = u"aaaaaaaaaaaaooooooooooooeeeeeeeecciiiiiiiiuuuuuuuuynn".encode('latin1')
logging.info(len(intab))
logging.info(len(outtab))
trantab = maketrans(intab, outtab)
When I run the code in the interactive console I have no problem, but when I try it in GAE I get the following error:
raise ValueError, "maketrans arguments must have same length"
ValueError: maketrans arguments must have same length
INFO 2009-12-03 20:04:02,904 dev_appserver.py:3038] "POST /backendsavenew HTTP/1.1" 500 -
INFO 2009-12-03 20:08:37,649 admin.py:112] 106
INFO 2009-12-03 20:08:37,651 admin.py:113] 53
ERROR 2009-12-03 20:08:37,653 init.py:388] maketrans arguments must have same length
I can't figure out why the intab it's doubled in size.
The python file with the code is saved as UTF-8.
Thanks in advance for any help.
string.maketrans and string.translate do not work for Unicode strings. Your call to string.maketrans will implictly convert the Unicode you gave it to an encoding like utf-8. In utf-8 å takes up more space than ASCII a. string.maketrans sees len(str(argument)) which is different for your two strings.
There is a Unicode translate, but for your use case (convert Unicode to ASCII because some part of your system cannot deal with Unicode) you should use http://pypi.python.org/pypi/Unidecode. Unidecode is very smart about transliterating Unicode characters to sensible ASCII, covering many more characters than in your example.
You should save your Python code as utf-8, but make sure you add the magic so Python doesn't have to assume you used the system's default encoding. This line should be the first or second line of your Python files:
# -*- coding: utf-8 -*-
There are many advantages to processing text as Unicode instead of binary strings. This is the Unicode way to do what you are trying to do:
intab = u"ÀÁÂÃÄÅàáâãäåÒÓÔÕÖØòóôõöøÈÉÊËèéêëÇçÌÍÎÏìíîïÙÚÛÜùúûüÿÑñ"
outtab = u"aaaaaaaaaaaaooooooooooooeeeeeeeecciiiiiiiiuuuuuuuuynn"
trantab = dict((ord(a), b) for a, b in zip(intab, outtab))
translated = intab.translate(trantab)
translated == outtab # True
See also Where is Python's "best ASCII for this Unicode" database?
See also How do I get str.translate to work with Unicode strings?
Maybe you could use iso-8859-1 encoding for your file instead of utf-8
# -*- coding: iso-8859-1 -*-
from string import maketrans
import logging
intab = "ÀÁÂÃÄÅàáâãäåÒÓÔÕÖØòóôõöøÈÉÊËèéêëÇçÌÍÎÏìíîïÙÚÛÜùúûüÿÑñ"
outtab = "aaaaaaaaaaaaooooooooooooeeeeeeeecciiiiiiiiuuuuuuuuynn"
logging.info(len(intab))
logging.info(len(outtab))
trantab = maketrans(intab, outtab)
Remember to select iso-8859-1 in your text editor while saving this python source file.

short Unicode \N{} names for Latin-1 characters in Python?

Are there short Unicode u"\N{...}" names for Latin1 characters in Python ?
\N{A umlaut} etc. would be nice,
\N{LATIN SMALL LETTER A WITH DIAERESIS} etc. is just too long to type every time.
(Added:) I use an English keyboard, but occasionally need German letters, as in "Löwenbräu Weißbier".
Yes one can cut-paste them singly, L cutpaste ö wenbr cutpaste ä ...
but that breaks the flow; I was hoping for a keyboard-only way.
Sorry, no, there's no such thing. In string literals, anyway... you could perhaps piggyback on another encoding scheme, such as HTML:
>>> import HTMLParser
>>> HTMLParser.HTMLParser().unescape(u'a ä b c')
u'a \xe4 b'
But I don't think this'd be worth it.
Hardly anyone even uses the \N notation in any case... for the occasional character the \xnn notation is acceptable; for more involved usage you're better off just typing ä directly and making sure a # coding= is defined in the script as per PEP263. (If you don't have a keyboard layout that can type those diacriticals directly, get one. eg. eurokb on Windows, or using the Compose key on Linux.)
If you want to do the right thing please use UTF-8 in your python source code. This will keep the code much more readable.
Python is able to real UTF-8 source files, all you have to do is to add an additional line after the first one:
#!/usr/bin/python
# -*- coding: UTF-8 -*-
By the way, starting with Python 3.0 UTF-8 is the default encoding so you will not need this line anymore. See PEP3120
You can put an actual "ä" character in your string. For this you have to declare the encoding of the source code at the top
#!/usr/bin/env python
# encoding: utf-8
x = u"ä"
Have you thought about writing your own converter? It wouldn't be hard to write something that would go through a file and replace \N{A umlaut} with \N{LATIN SMALL LETTER A WITH DIAERESIS} and all the rest.
You can use the Unicode notation \uXXXX do describe that character:
u"\u00E4"
On Windows, you can use the charmap.exe utility to look up the keyboard shortcut for common letters you're using such as:
ALT-0223 = ß
ALT-0228 = ä
ALT-0246 = ö
Then use Unicode and save in UTF-8:
# -*- coding: UTF-8 -*-
phrase = u'Löwenbräu Weißbier'
or use a converter as someone else mentioned and make up your own shortcuts:
# -*- coding: UTF-8 -*-
def german(s):
s = s.replace(u'SS',u'ß')
s = s.replace(u'a:',u'ä')
s = s.replace(u'o:',u'ö')
return s
phrase = german(u'Lo:wenbra:u WeiSSbier')
print phrase

Categories

Resources