How to search and replace utf-8 special characters in Python?

How to search and replace utf-8 special characters in Python? - python

I'm a Python beginner, and I have a utf-8 problem.
I have a utf-8 string and I would like to replace all german umlauts with ASCII replacements (in German, u-umlaut 'ü' may be rewritten as 'ue').
u-umlaut has unicode code point 252, so I tried this:
>>> str = unichr(252) + 'ber'
>>> print repr(str)
u'\xfcber'
>>> print repr(str).replace(unichr(252), 'ue')
u'\xfcber'
I expected the last string to be u'ueber'.
What I ultimately want to do is replace all u-umlauts in a file with 'ue':
import sys
import codecs
f = codecs.open(sys.argv[1],encoding='utf-8')
for line in f:
print repr(line).replace(unichr(252), 'ue')
Thanks for your help! (I'm using Python 2.3.)

I would define a dictionary of special characters (that I want to map) then I use translate method.
line = 'Ich möchte die Qualität des Produkts überprüfen, bevor ich es kaufe.'
special_char_map = {ord('ä'):'ae', ord('ü'):'ue', ord('ö'):'oe', ord('ß'):'ss'}
print(line.translate(special_char_map))
you will get the following result:
Ich moechte die Qualitaet des Produkts ueberpruefen, bevor ich es kaufe.

I think it's easiest and clearer to do it on a more straightforward way, using directly the unicode representation os 'ü' better than unichr(252).
>>> s = u'über'
>>> s.replace(u'ü', 'ue')
u'ueber'
There's no need to use repr, as this will print the 'Python representation' of the string, you just need to present the readable string.
You will need also to include the following line at the beggining of the .py file, in case it's not already present, to tell the encoding of the file
#-*- coding: UTF-8 -*-
Added: Of course, the coding declared must be the same as the encoding of the file. Please check that as can be some problems (I had problems with Eclipse on Windows, for example, as it writes by default the files as cp1252. Also it should be the same encoding of the system, which could be utf-8, or latin-1 or others.
Also, don't use str as the definition of a variable, as it is part of the Python library. You could have problems later.
(I am trying on Python 2.6, I think in Python 2.3 the result is the same)

repr(str) returns a quoted version of str, that when printed out, will be something you could type back in as Python to get the string back. So, it's a string that literally contains \xfcber, instead of a string that contains über.
You can just use str.replace(unichr(252), 'ue') to replace the ü with ue.
If you need to get a quoted version of the result of that, though I don't believe you should need it, you can wrap the entire expression in repr:
repr(str.replace(unichr(252), 'ue'))

You can avoid all that sourcefile encoding stuff and its problems. Use the Unicode names, then its screamingly obvious what you are doing and the code can be read and modified anywhere.
I don't know of any language where the only accented Latin letter is lower-case-u-with-umlaut-aka-diaeresis, so I've added code to loop over a table of translations under the assumption that you'll need it.
# coding: ascii
translations = (
(u'\N{LATIN SMALL LETTER U WITH DIAERESIS}', u'ue'),
(u'\N{LATIN SMALL LETTER O WITH DIAERESIS}', u'oe'),
# et cetera
)
test = u'M\N{LATIN SMALL LETTER O WITH DIAERESIS}ller von M\N{LATIN SMALL LETTER U WITH DIAERESIS}nchen'
out = test
for from_str, to_str in translations:
out = out.replace(from_str, to_str)
print out
output:
Moeller von Muenchen

Related

Using unicode / umlauts in Python: Dictionary v manual input

I am using a dictionary to store some character pairs in Python (I am replacing umlaut characters). Here is what it looks like:
umlautdict={
'ae': 'ä',
'ue': 'ü',
'oe': 'ö'
}
Then I run my inputwords through it like so:
for item in umlautdict.keys():
outputword=inputword.replace(item,umlautdict[item])
But this does not do anything (no replacement happens). When I printed out my umlautdict, I saw that it looks like this:
{'ue': '\xfc', 'oe': '\xf6', 'ae': '\xc3\xa4'}
Of course that is not what I want; however, trying things like unicode() (--> Error) or pre-fixing u did not improve things.
If I type the 'ä' or 'ö' into the replace() command by hand, everything works just fine. I also changed the settings in my script (working in TextWrangler) to # -*- coding: utf-8 -*- as it would net even let me execute the script containing umlauts without it.
So I don't get...
Why does this happen? Why and when do the umlauts change from "good
to evil" when I store them in the dictionary?
How do I fix it?
Also, if anyone knows: what is a good resource to learn about
encoding in Python? I have issues all the time and so many things
don't make sense to me / I can't wrap my head around.
I'm working on a Mac in Python 2.7.10. Thanks for your help!

Converting to Unicode is done by decoding your string (assuming you're getting bytes):
data = "haer ueber loess"
word = data.decode('utf-8') # actual encoding depends on your data
Define your dict with unicode strings as well:
umlautdict={
u'ae': u'ä',
u'ue': u'ü',
u'oe': u'ö'
}
and finally print umlautdict will print out some representation of that dict, usually involving escapes. That's normal, you don't have to worry about that.

Declare your coding.
Use raw format for the special characters.
Iterate properly on your string: keep the changes from each loop iteration as you head to the next.
Here's code to get the job done:
\# -*- coding: utf-8 -*-
umlautdict = {
'ae': r'ä',
'ue': r'ü',
'oe': r'ö'
}
print umlautdict
inputword = "haer ueber loess"
for item in umlautdict.keys():
inputword = inputword.replace(item, umlautdict[item])
print inputword
Output:
{'ue': '\xc3\xbc', 'oe': '\xc3\xb6', 'ae': '\xc3\xa4'}
här über löss

Python UTF-8 REGEX

I have a problem while trying to find text specified in regex.
Everything work perfectly fine but when i added "\£" to my regex it started causing problems. I get SyntaxError. "NON ASCII CHACTER "\xc2" in file (...) but no encoding declared...
I've tried to solve this problem with using
import sys
reload(sys) # to enable `setdefaultencoding` again
sys.setdefaultencoding("UTF-8")
but it doesnt help. I just want to build regular expression and use pound sign there. flag re.Unicode flag doesnt help, saving string as unicode (pat) doesnt help. Is there any solution to fix this regex? I just want to build regular expression and use pound sign there.Thanks for help.
k = text.encode('utf-8')
pat = u'salar.{1,6}?([0-9\-,\. \tkFFRroOMmTtAanNuUMm\$\&\;\£]{2,})'
pattern = re.compile(pat, flags = re.DOTALL|re.I|re.UNICODE)
salary = pattern.search(k).group(1)
print (salary)
Error is still there even if I comment(put "#" and skip all of those lines. Maybe its not connected with re. library but my settings?

The error message means Python cannot guess which character set you are using. It also tells you that you can fix it by telling it the encoding of your script.
# coding: utf-8
string = "£"
or equivalently
string = u"\u00a3"
Without an encoding declaration, Python sees a bunch of bytes which mean different things in different encodings. Rather than guess, it forces you to tell you what they mean. This is codified in PEP-263.
(ASCII is unambiguous [except if your system is EBCDIC I guess] so it knows what you mean if you use a pure-ASCII representation for everything.)
The encoding settings you were fiddling with affect how files and streams are read, and program I/O generally, but not how the program source is interpreted.

Python :Non-UTF-8 code starting with '\xe8' in file [duplicate]

I am trying to write a binary search program for a class, and I am pretty sure that my logic is right, but I keep getting a non-UTF-8 error. I have never seen this error and any help/clarification would be great! Thanks a bunch.
Here's the code.
def main():
str names = [‘Ava Fischer’, ‘Bob White’, ‘Chris Rich’, ‘Danielle Porter’, ‘Gordon Pike’, ‘Hannah Beauregard’, ‘Matt Hoyle’, ‘Ross Harrison’, ‘Sasha Ricci’, ‘Xavier Adams’]
binarySearch(names, input(str("Please Enter a Name.")))
print("That name is at position "+position)
def binarySearch(array, searchedValue):
begin = 0
end = len(array) - 1
position = -1
found = False
while !=found & begin<=end:
middle=(begin+end)/2
if array[middle]== searchedValue:
found=True
position = middle
elif array[middle] >value:
end=middle-1
else:
first =middle+1
return position

Add this line at the top of you code. It may work.
# coding=utf8

Your editor replaced ' (ASCII 39) with U+2018 LEFT SINGLE QUOTATION MARK characters, usually a sign you used Word or a similar wordprocessor instead of a plain text editor; a word processor tries to make your text 'prettier' and auto-replaces things like simple quotes with fancy ones. This was then saved in the Windows 1252 codepage encoding, where the fancy quotes were saved as hex 91 characters.
Python is having none of it. It wants source code saved in UTF-8 and using ' or " for quotation marks. Use notepad, or better still, IDLE to edit your Python code instead.
You have numerous other errors in your code; you cannot use spaces in your variable names, for example, and Python uses and, not & as the boolean AND operator. != is an operator requiring 2 operands (it means 'not equal', the opposite of ==), the boolean NOT operator is called not.

If you're using Notepad++, click Encoding at the top and choose Encode in UTF-8.

The character you are beginning your constant strings with is not the right string delimiter. You are using
‘Ava Fischer’ # ‘ and ’ as string delimiters
when it should have been either
'Ava Fischer' # Ascii 39 as string delimiter
or maybe
"Ava Fischer" # Ascii 34 as string delimiter

Add this line to the top of your code, it might help
# -*- coding:utf-8 -*-

SyntaxError: Non-UTF-8 code starting with '\x91'

I am trying to write a binary search program for a class, and I am pretty sure that my logic is right, but I keep getting a non-UTF-8 error. I have never seen this error and any help/clarification would be great! Thanks a bunch.
Here's the code.
def main():
str names = [‘Ava Fischer’, ‘Bob White’, ‘Chris Rich’, ‘Danielle Porter’, ‘Gordon Pike’, ‘Hannah Beauregard’, ‘Matt Hoyle’, ‘Ross Harrison’, ‘Sasha Ricci’, ‘Xavier Adams’]
binarySearch(names, input(str("Please Enter a Name.")))
print("That name is at position "+position)
def binarySearch(array, searchedValue):
begin = 0
end = len(array) - 1
position = -1
found = False
while !=found & begin<=end:
middle=(begin+end)/2
if array[middle]== searchedValue:
found=True
position = middle
elif array[middle] >value:
end=middle-1
else:
first =middle+1
return position

Add this line at the top of you code. It may work.
# coding=utf8

Your editor replaced ' (ASCII 39) with U+2018 LEFT SINGLE QUOTATION MARK characters, usually a sign you used Word or a similar wordprocessor instead of a plain text editor; a word processor tries to make your text 'prettier' and auto-replaces things like simple quotes with fancy ones. This was then saved in the Windows 1252 codepage encoding, where the fancy quotes were saved as hex 91 characters.
Python is having none of it. It wants source code saved in UTF-8 and using ' or " for quotation marks. Use notepad, or better still, IDLE to edit your Python code instead.
You have numerous other errors in your code; you cannot use spaces in your variable names, for example, and Python uses and, not & as the boolean AND operator. != is an operator requiring 2 operands (it means 'not equal', the opposite of ==), the boolean NOT operator is called not.

If you're using Notepad++, click Encoding at the top and choose Encode in UTF-8.

The character you are beginning your constant strings with is not the right string delimiter. You are using
‘Ava Fischer’ # ‘ and ’ as string delimiters
when it should have been either
'Ava Fischer' # Ascii 39 as string delimiter
or maybe
"Ava Fischer" # Ascii 34 as string delimiter

Add this line to the top of your code, it might help
# -*- coding:utf-8 -*-

How can I determine a Unicode character from its name in Python, even if that character is a control character?

I'd like to create an array of the Unicode code points which constitute white space in JavaScript (minus the Unicode-white-space code points, which I address separately). These characters are horizontal tab, vertical tab, form feed, space, non-breaking space, and BOM. I could do this with magic numbers:
whitespace = [0x9, 0xb, 0xc, 0x20, 0xa0, 0xfeff]
That's a little bit obscure; names would be better. The unicodedata.lookup method passed through ord helps some:
>>> ord(unicodedata.lookup("NO-BREAK SPACE"))
160
But this doesn't work for 0x9, 0xb, or 0xc -- I think because they're control characters, and the "names" FORM FEED and such are just alias names. Is there any way to map these "names" to the characters, or their code points, in standard Python? Or am I out of luck?

Kerrek SB's comment is a good one: just put the names in a comment.
BTW, Python also supports a named unicode literal:
>>> u"\N{NO-BREAK SPACE}"
u'\xa0'
But it uses the same unicode name database, and the control characters are not in it.

You could roll your own "database" for the control characters by parsing a few lines of the UCD files in the Unicode public directory. In particular, see the UnicodeData-6.1.0d3 file (or see the parent directory for earlier versions).

I don't think it can be done in standard Python. The unicodedata module uses the UnicodeData.txt v5.2.0 Unicode database. Notice that the control characters are all assigned the name <control> (the second field, semicolon-delimited).
The script Tools/unicode/makeunicodedata.py in the Python source distribution is used to generate the table used by the Python runtime. The makeunicodename function looks like this:
def makeunicodename(unicode, trace):
FILE = "Modules/unicodename_db.h"
print "--- Preparing", FILE, "..."
# collect names
names = [None] * len(unicode.chars)
for char in unicode.chars:
record = unicode.table[char]
if record:
name = record[1].strip()
if name and name[0] != "<":
names[char] = name + chr(0)
...
Notice that it skips over entries whose name begins with "<". Hence, there is no name that can be passed to unicodedata.lookup that will give you back one of those control characters.
Just hardcode the code points for horizontal tab, line feed, and carriage return, and leave a descriptive comment. As the Zen of Python goes, "practicality beats purity".

A few points:
(1) "BOM" is not a character. BOM is a byte sequence that appears at the start of a file to indicate the byte order of a file that is encoded in UTF-nn. BOM is u'\uFEFF'.encode('UTF-nn'). Reading a file with the appropriate codec will slurp up the BOM; you don't see it as a Unicode character. A BOM is not data. If you do see u'\uFEFF' in your data, treat it as a (deprecated) ZERO-WIDTH NO-BREAK SPACE.
(2) "minus the Unicode-white-space code points, which I address separately"?? Isn't NO-BREAK SPACE a "Unicode-white-space" code point?
(3) Your Python appears to be broken; mine does this:
>>> ord(unicodedata.lookup("NO-BREAK SPACE"))
160
(4) You could use escape sequences for the first three.
>>> map(hex, map(ord, "\t\v\f"))
['0x9', '0xb', '0xc']
(5) You could use " " for the fourth one.
(6) Even if you could use names, the readers of your code would still be applying blind faith that e.g. "FORM FEED" is a whitespace character.
(7) What happened to to \r and \n?

Assuming you're working with Unicode strings, the first five items in your list, plus all other Unicode space characters, will be matched by the \s option when using a regular expression. Using Python 3.1.2:
>>> import re
>>> s = '\u0009,\u000b,\u000c,\u0020,\u00a0,\ufeff'
>>> s
'\t,\x0b,\x0c, ,\xa0,\ufeff'
>>> re.findall(r'\s', s)
['\t', '\x0b', '\x0c', ' ', '\xa0']
And as for the byte-order mark, the one given can be referred to as codecs.BOM_BE or codecs.BOM_UTF16_BE (though in Python 3+, it's returned as a bytes object rather than str).

The official Unicode recommendation for newlines may or may not be at odds with the way the Python codecs module handles newlines. Since u'\n' is often said to mean "new line", one might expect based on this recommendation for the Python string u'\n' to represent character U+2028 LINE SEPARATOR and to be encoded as such, rather than as the semantic-less control character U+000A. But I can only imagine the confusion that would result if the codecs module actually implemented that policy, and there are valid counter-arguments besides. Ditto for horizontal/vertical tab and form feed, which are probably not really characters but controls anyway. (I would certainly consider backspace to be a control, not a character.)
Your question seems to assume that treating U+000A as a control character (instead of a line separator) is wrong; but that is not at all certain. Perhaps it is more wrong for text processing applications everywhere to assume that a legacy printer-platen-scrolling control signal is really a true "line separator".

You can extend the lookup function to handle the characters that aren't included.
def unicode_lookup(x):
try:
ch = unicodedata.lookup(x)
except KeyError:
control_chars = {'LINE FEED':unichr(0x0a),'CARRIAGE RETURN':unichr(0x0d)}
if x in control_chars:
ch = control_chars[x]
else:
raise
return ch
>>> unicode_lookup('SPACE')
u' '
>>> unicode_lookup('LINE FEED')
u'\n'
>>> unicode_lookup('FORM FEED')
Traceback (most recent call last):
File "<pyshell#17>", line 1, in <module>
unicode_lookup('FORM FEED')
File "<pyshell#13>", line 3, in unicode_lookup
ch = unicodedata.lookup(x)
KeyError: "undefined character name 'FORM FEED'"

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to search and replace utf-8 special characters in Python? - python

Related

Using unicode / umlauts in Python: Dictionary v manual input

Python UTF-8 REGEX

Python :Non-UTF-8 code starting with '\xe8' in file [duplicate]

SyntaxError: Non-UTF-8 code starting with '\x91'

How can I determine a Unicode character from its name in Python, even if that character is a control character?

Categories

Resources