Replacing non-ASCII characters in a string [duplicate] - python

I have a Unicode string in Python, and I would like to remove all the accents (diacritics).
I found on the web an elegant way to do this (in Java):
convert the Unicode string to its long normalized form (with a separate character for letters and diacritics)
remove all the characters whose Unicode type is "diacritic".
Do I need to install a library such as pyICU or is this possible with just the Python standard library? And what about python 3?
Important note: I would like to avoid code with an explicit mapping from accented characters to their non-accented counterpart.

Unidecode is the correct answer for this. It transliterates any unicode string into the closest possible representation in ascii text.
Example:
>>> from unidecode import unidecode
>>> unidecode('kožušček')
'kozuscek'
>>> unidecode('北亰')
'Bei Jing '
>>> unidecode('François')
'Francois'

How about this:
import unicodedata
def strip_accents(s):
return ''.join(c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn')
This works on greek letters, too:
>>> strip_accents(u"A \u00c0 \u0394 \u038E")
u'A A \u0394 \u03a5'
>>>
The character category "Mn" stands for Nonspacing_Mark, which is similar to unicodedata.combining in MiniQuark's answer (I didn't think of unicodedata.combining, but it is probably the better solution, because it's more explicit).
And keep in mind, these manipulations may significantly alter the meaning of the text. Accents, Umlauts etc. are not "decoration".

I just found this answer on the Web:
import unicodedata
def remove_accents(input_str):
nfkd_form = unicodedata.normalize('NFKD', input_str)
only_ascii = nfkd_form.encode('ASCII', 'ignore')
return only_ascii
It works fine (for French, for example), but I think the second step (removing the accents) could be handled better than dropping the non-ASCII characters, because this will fail for some languages (Greek, for example). The best solution would probably be to explicitly remove the unicode characters that are tagged as being diacritics.
Edit: this does the trick:
import unicodedata
def remove_accents(input_str):
nfkd_form = unicodedata.normalize('NFKD', input_str)
return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])
unicodedata.combining(c) will return true if the character c can be combined with the preceding character, that is mainly if it's a diacritic.
Edit 2: remove_accents expects a unicode string, not a byte string. If you have a byte string, then you must decode it into a unicode string like this:
encoding = "utf-8" # or iso-8859-15, or cp1252, or whatever encoding you use
byte_string = b"café" # or simply "café" before python 3.
unicode_string = byte_string.decode(encoding)

Actually I work on project compatible python 2.6, 2.7 and 3.4 and I have to create IDs from free user entries.
Thanks to you, I have created this function that works wonders.
import re
import unicodedata
def strip_accents(text):
"""
Strip accents from input String.
:param text: The input string.
:type text: String.
:returns: The processed String.
:rtype: String.
"""
try:
text = unicode(text, 'utf-8')
except (TypeError, NameError): # unicode is a default on python 3
pass
text = unicodedata.normalize('NFD', text)
text = text.encode('ascii', 'ignore')
text = text.decode("utf-8")
return str(text)
def text_to_id(text):
"""
Convert input text to id.
:param text: The input string.
:type text: String.
:returns: The processed String.
:rtype: String.
"""
text = strip_accents(text.lower())
text = re.sub('[ ]+', '_', text)
text = re.sub('[^0-9a-zA-Z_-]', '', text)
return text
result:
text_to_id("Montréal, über, 12.89, Mère, Françoise, noël, 889")
>>> 'montreal_uber_1289_mere_francoise_noel_889'

This handles not only accents, but also "strokes" (as in ø etc.):
import unicodedata as ud
def rmdiacritics(char):
'''
Return the base character of char, by "removing" any
diacritics like accents or curls and strokes and the like.
'''
desc = ud.name(char)
cutoff = desc.find(' WITH ')
if cutoff != -1:
desc = desc[:cutoff]
try:
char = ud.lookup(desc)
except KeyError:
pass # removing "WITH ..." produced an invalid name
return char
This is the most elegant way I can think of (and it has been mentioned by alexis in a comment on this page), although I don't think it is very elegant indeed.
In fact, it's more of a hack, as pointed out in comments, since Unicode names are – really just names, they give no guarantee to be consistent or anything.
There are still special letters that are not handled by this, such as turned and inverted letters, since their unicode name does not contain 'WITH'. It depends on what you want to do anyway. I sometimes needed accent stripping for achieving dictionary sort order.
EDIT NOTE:
Incorporated suggestions from the comments (handling lookup errors, Python-3 code).

In my view, the proposed solutions should NOT be accepted answers. The original question is asking for the removal of accents, so the correct answer should only do that, not that plus other, unspecified, changes.
Simply observe the result of this code which is the accepted answer. where I have changed "Málaga" by "Málagueña:
accented_string = u'Málagueña'
# accented_string is of type 'unicode'
import unidecode
unaccented_string = unidecode.unidecode(accented_string)
# unaccented_string contains 'Malaguena'and is of type 'str'
There is an additional change (ñ -> n), which is not requested in the OQ.
A simple function that does the requested task, in lower form:
def f_remove_accents(old):
"""
Removes common accent characters, lower form.
Uses: regex.
"""
new = old.lower()
new = re.sub(r'[àáâãäå]', 'a', new)
new = re.sub(r'[èéêë]', 'e', new)
new = re.sub(r'[ìíîï]', 'i', new)
new = re.sub(r'[òóôõö]', 'o', new)
new = re.sub(r'[ùúûü]', 'u', new)
return new

gensim.utils.deaccent(text) from Gensim - topic modelling for humans:
'Sef chomutovskych komunistu dostal postou bily prasek'
Another solution is unidecode.
Note that the suggested solution with unicodedata typically removes accents only in some character (e.g. it turns 'ł' into '', rather than into 'l').

In response to #MiniQuark's answer:
I was trying to read in a csv file that was half-French (containing accents) and also some strings which would eventually become integers and floats.
As a test, I created a test.txt file that looked like this:
Montréal, über, 12.89, Mère, Françoise, noël, 889
I had to include lines 2 and 3 to get it to work (which I found in a python ticket), as well as incorporate #Jabba's comment:
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
import csv
import unicodedata
def remove_accents(input_str):
nkfd_form = unicodedata.normalize('NFKD', unicode(input_str))
return u"".join([c for c in nkfd_form if not unicodedata.combining(c)])
with open('test.txt') as f:
read = csv.reader(f)
for row in read:
for element in row:
print remove_accents(element)
The result:
Montreal
uber
12.89
Mere
Francoise
noel
889
(Note: I am on Mac OS X 10.8.4 and using Python 2.7.3)

import unicodedata
from random import choice
import perfplot
import regex
import text_unidecode
def remove_accent_chars_regex(x: str):
return regex.sub(r'\p{Mn}', '', unicodedata.normalize('NFKD', x))
def remove_accent_chars_join(x: str):
# answer by MiniQuark
# https://stackoverflow.com/a/517974/7966259
return u"".join([c for c in unicodedata.normalize('NFKD', x) if not unicodedata.combining(c)])
perfplot.show(
setup=lambda n: ''.join([choice('Málaga François Phút Hơn 中文') for i in range(n)]),
kernels=[
remove_accent_chars_regex,
remove_accent_chars_join,
text_unidecode.unidecode,
],
labels=['regex', 'join', 'unidecode'],
n_range=[2 ** k for k in range(22)],
equality_check=None, relative_to=0, xlabel='str len'
)

Some languages have combining diacritics as language letters and accent diacritics to specify accent.
I think it is more safe to specify explicitly what diactrics you want to strip:
def strip_accents(string, accents=('COMBINING ACUTE ACCENT', 'COMBINING GRAVE ACCENT', 'COMBINING TILDE')):
accents = set(map(unicodedata.lookup, accents))
chars = [c for c in unicodedata.normalize('NFD', string) if c not in accents]
return unicodedata.normalize('NFC', ''.join(chars))

Here is a short function which strips the diacritics, but keeps the non-latin characters. Most cases (e.g., "à" -> "a") are handled by unicodedata (standard library), but several (e.g., "æ" -> "ae") rely on the given parallel strings.
Code
from unicodedata import combining, normalize
LATIN = "ä æ ǽ đ ð ƒ ħ ı ł ø ǿ ö œ ß ŧ ü "
ASCII = "ae ae ae d d f h i l o o oe oe ss t ue"
def remove_diacritics(s, outliers=str.maketrans(dict(zip(LATIN.split(), ASCII.split())))):
return "".join(c for c in normalize("NFD", s.lower().translate(outliers)) if not combining(c))
NB. The default argument outliers is evaluated once and not meant to be provided by the caller.
Intended usage
As a key to sort a list of strings in a more “natural” order:
sorted(['cote', 'coteau', "crottez", 'crotté', 'côte', 'côté'], key=remove_diacritics)
Output:
['cote', 'côte', 'côté', 'coteau', 'crotté', 'crottez']
If your strings mix texts and numbers, you may be interested in composing remove_diacritics() with the function string_to_pairs() I give elsewhere.
Tests
To make sure the behavior meets your needs, take a look at the pangrams below:
examples = [
("hello, world", "hello, world"),
("42", "42"),
("你好,世界", "你好,世界"),
(
"Dès Noël, où un zéphyr haï me vêt de glaçons würmiens, je dîne d’exquis rôtis de bœuf au kir, à l’aÿ d’âge mûr, &cætera.",
"des noel, ou un zephyr hai me vet de glacons wuermiens, je dine d’exquis rotis de boeuf au kir, a l’ay d’age mur, &caetera.",
),
(
"Falsches Üben von Xylophonmusik quält jeden größeren Zwerg.",
"falsches ueben von xylophonmusik quaelt jeden groesseren zwerg.",
),
(
"Љубазни фењерџија чађавог лица хоће да ми покаже штос.",
"љубазни фењерџија чађавог лица хоће да ми покаже штос.",
),
(
"Ljubazni fenjerdžija čađavog lica hoće da mi pokaže štos.",
"ljubazni fenjerdzija cadavog lica hoce da mi pokaze stos.",
),
(
"Quizdeltagerne spiste jordbær med fløde, mens cirkusklovnen Walther spillede på xylofon.",
"quizdeltagerne spiste jordbaer med flode, mens cirkusklovnen walther spillede pa xylofon.",
),
(
"Kæmi ný öxi hér ykist þjófum nú bæði víl og ádrepa.",
"kaemi ny oexi her ykist þjofum nu baedi vil og adrepa.",
),
(
"Glāžšķūņa rūķīši dzērumā čiepj Baha koncertflīģeļu vākus.",
"glazskuna rukisi dzeruma ciepj baha koncertfligelu vakus.",
)
]
for (given, expected) in examples:
assert remove_diacritics(given) == expected
Case-preserving variant
LATIN = "ä æ ǽ đ ð ƒ ħ ı ł ø ǿ ö œ ß ŧ ü Ä Æ Ǽ Đ Ð Ƒ Ħ I Ł Ø Ǿ Ö Œ SS Ŧ Ü "
ASCII = "ae ae ae d d f h i l o o oe oe ss t ue AE AE AE D D F H I L O O OE OE SS T UE"
def remove_diacritics(s, outliers=str.maketrans(dict(zip(LATIN.split(), ASCII.split())))):
return "".join(c for c in normalize("NFD", s.translate(outliers)) if not combining(c))

There are already many answers here, but this was not previously considered: using sklearn
from sklearn.feature_extraction.text import strip_accents_ascii, strip_accents_unicode
accented_string = u'Málagueña®'
print(strip_accents_unicode(accented_string)) # output: Malaguena®
print(strip_accents_ascii(accented_string)) # output: Malaguena
This is particularly useful if you are already using sklearn to process text. Those are the functions internally called by classes like CountVectorizer to normalize strings: when using strip_accents='ascii' then strip_accents_ascii is called and when strip_accents='unicode' is used, then strip_accents_unicode is called.
More details
Finally, consider those details from its docstring:
Signature: strip_accents_ascii(s)
Transform accentuated unicode symbols into ascii or nothing
Warning: this solution is only suited for languages that have a direct
transliteration to ASCII symbols.
and
Signature: strip_accents_unicode(s)
Transform accentuated unicode symbols into their simple counterpart
Warning: the python-level loop and join operations make this
implementation 20 times slower than the strip_accents_ascii basic
normalization.

If you are hoping to get functionality similar to Elasticsearch's asciifolding filter, you might want to consider fold-to-ascii, which is [itself]...
A Python port of the Apache Lucene ASCII Folding Filter that converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into ASCII equivalents, if they exist.
Here's an example from the page mentioned above:
from fold_to_ascii import fold
s = u'Astroturf® paté'
fold(s)
> u'Astroturf pate'
fold(s, u'?')
> u'Astroturf? pate'
EDIT: The fold_to_ascii module seems to work well for normalizing Latin-based alphabets; however unmappable characters are removed, which means that this module will reduce Chinese text, for example, to empty strings. If you want to preserve Chinese, Japanese, and other Unicode alphabets, consider using #mo-han's remove_accent_chars_regex implementation, above.

Related

Obtain a list of the 143.859 Unicode Standard, Version 13.0.0 characters

Is it possible to obtain a list of all the 143,859 characters included in the 13.0.0 version of Unicode?
I'm trying to print these 143,859 characters in python but was unable to find a comprehensive list of all the characters.
To obtain a list of 143,859 characters you must exclude the same categories the unicode consortium has excluded in order to come up with that count.
import sys
from unicodedata import category, unidata_version
chars = []
for i in range(sys.maxunicode + 1):
c = chr(i)
cat = category(c)
if cat == "Cc": # control characters
continue
if cat == "Co": # private use
continue
if cat == "Cs": # surrogates
continue
if cat == "Cn": # noncharacter or reserved
continue
chars.append(c)
print(f"Number of characters in Unicode v{unidata_version}: {len(chars):,}")
Output on my machine:
Number of characters in Unicode v13.0.0: 143,859
I think your best bet is probably to read the UnicodeData.txt file as recommended by #wim in a comment below, then expand all the ranges that are marked off by <..., First> and <..., Last> in the second column, e.g., expand
3400;<CJK Ideograph Extension A, First>;Lo;0;L;;;;;N;;;;;
4DBF;<CJK Ideograph Extension A, Last>;Lo;0;L;;;;;N;;;;;
to
3400
3401
3402
...
4DBD
4DBE
4DBF
I haven't checked, but I'm guessing this would give you a pretty complete list.
Below are some other suggestions I made earlier, some of which could be useful.
Other Ideas
You can make a start with the Unicode Character Name Index, which is linked from the list of Unicode 13.0 Character Code Charts. However, that table has significant gaps and repetitions, e.g., all Latin capital letters are lumped under 0041 (A) and the group is identified a few different ways. Actually, the table is pretty incomplete -- it only has 2.759 unique codes.
Keying off of #wim's comment on the original post, another option might be to take a look at the source code for Python's unicodedata module. unicodename_db.h has some lists of codes that are read by _getucname in unicodedata.c. It looks like phrasebook may have a nearly complete list of codes (188,803 items), but possibly munged in some way (I don't have time to figure out the lookup/offset mechanism right now). In addition to those, Hangul syllables and unified ideographs are processed as ranges, not looked up from the phrasebook.
The ONLY file you need is DerivedName.txt, available in the Unicode Character Database (UCD). The file for version 14.0 can be found here.
# PyPI package unicode_charnames
# Supports version 14.0 of the Unicode Standard
from unicode_charnames import charname, UNICODE_VERSION
assigned = []
control = []
private_use = []
surrogate = []
noncharacter = []
reserved = []
for x in range(0x110000):
name = charname(chr(x))
if not name.startswith("<"):
assigned.append(x)
elif name.startswith("<control"):
control.append(x)
elif name.startswith("<private"):
private_use.append(x)
elif name.startswith("<surrogate"):
surrogate.append(x)
elif name.startswith("<noncharacter"):
noncharacter.append(x)
else:
reserved.append(x)
print(f"# Unicode {UNICODE_VERSION}")
print(f"# Code points assigned to an abstract character: {len(assigned):,}")
print(f"# Unassigned code points: {len(reserved):,}")
print( "# Code points with a normative function")
print(f"# Control characters : {len(control):>7,}")
print(f"# Private-use characters : {len(private_use):>7,}")
print(f"# Surrogate characters : {len(surrogate):>7,}")
print(f"# Noncharacters : {len(noncharacter):>7,}")
# Unicode 14.0.0
# Code points assigned to an abstract character: 144,697
# Unassigned code points: 829,768
# Code points with a normative function
# Control characters : 65
# Private-use characters : 137,468
# Surrogate characters : 2,048
# Noncharacters : 66

Fix special characters with closest equivalent without map

I'm getting output from a sensor (GPS) in Python and for some reason, the output is not entirely clean. I'm already using pynmea2 and its checksum to filter out the bad rows but I want to improve the acceptance rate.
If you look at some sample data from the sensor, you see that many characters are replaced with something that could be corrected by a human, such as ® = . or ³ = 3. Some on the other hand, are less clear, such as ¶ or ± or Ç = G and not C.
I've tried to research how I could fix this but short of creating a hardcoded map or search and replace, I can't come up with anything. Is there a library or a way to "clean up" my input to solve at least the obvious one and thus boost my acceptance rate?
nmea = [
"$GNRMC,175230.00,A,5231.08575,N,01324.94302,E,0.099,,300321,,,A,V*17",
"$GNRMC,175211.00,A,5231.08495,N,01324.94370,E,2.771,348.30,300321,,,A,V*0F",
"$GNGGA,175140.00,5231.06514,N,01325.03302,E,1,11,1.22,33.9,M,42.1,Í,,*7F",
"$GNRMC,175141.00,A,5231.06307,N,01³25.02563,E,16.734,234.24,300³21,,,A,V*3A",
"$GNGÇA,175141.00,5231.06307,N,01325.02563,E,1,11,1®22,33.6,M,42.1,M,,*75",
"$GNRMC,175142.00,A,5231.°6059,N,01325.01869,E,17.220,235.29,300321,,,A,V*38",
"$GNGGA,175142.00,5231.06059,N,01325.01869,E,1,11,±.22,33.5,M,42.1,M,,*79",
"$GNRMC,175143.00,A,5231.05861,N,01325.01218,E,16.¶45,238.31,300321,,,A,V*37",
"$GNGGA,175143.00,5231.05861,N,013²5.01218,E,1,11,1.22,34.7,M,42.1,M,,*71",
"$GNRMC,175144.00,A,5231.05689,N,01325.00574,E,16.090,241.36,300321,,,A,V*33",
"$GNGGA,175144.00,5231.05689,N,01325.00574,E,1,11,1.28,36.0,M,42.1,M,,*7D",
"$GNRMC,175145.00,A,5231.05478,N,01324.99957,E,16.045,240.96,300321,,,A,V*31",
"$GNGGA,175145.00,5231.05478,N,01324.99957,E,1,11,1.22,36.8,M,42.1,M,,*7E",
"$GNRMC,175146®00,A,5231.05277,N,01324.99327,E,15.832,241.30,300321,,,A,V*30",
"$GNGGA,175146.00,5231.05277,N,01324.99327,E,1,11,1.22,37.3,M,42.1,M,,*73",
"$GNGGA,175230.00,5231.08575,N,01324.94302,E,1,12,0.96,56.7,M,42.1,M,,*7D"]
There is a one-bit error between ®/., ³/3, Í/M (same bit):
>>> for c in s:
... print(f'{c} {ord(c):08b}')
...
® 10101110
. 00101110
³ 10110011
3 00110011
Í 11001101
M 01001101
You likely have a problem with your hardware. Since the data is ASCII the high bit (bit 7) is supposed to always be a 0, so you could just filter the output if you can't fix the hardware issue:
import re
from pprint import pprint
nmea = [
"$GNRMC,175230.00,A,5231.08575,N,01324.94302,E,0.099,,300321,,,A,V*17",
"$GNRMC,175211.00,A,5231.08495,N,01324.94370,E,2.771,348.30,300321,,,A,V*0F",
"$GNGGA,175140.00,5231.06514,N,01325.03302,E,1,11,1.22,33.9,M,42.1,Í,,*7F",
"$GNRMC,175141.00,A,5231.06307,N,01³25.02563,E,16.734,234.24,300³21,,,A,V*3A",
"$GNGÇA,175141.00,5231.06307,N,01325.02563,E,1,11,1®22,33.6,M,42.1,M,,*75",
"$GNRMC,175142.00,A,5231.°6059,N,01325.01869,E,17.220,235.29,300321,,,A,V*38",
"$GNGGA,175142.00,5231.06059,N,01325.01869,E,1,11,±.22,33.5,M,42.1,M,,*79",
"$GNRMC,175143.00,A,5231.05861,N,01325.01218,E,16.¶45,238.31,300321,,,A,V*37",
"$GNGGA,175143.00,5231.05861,N,013²5.01218,E,1,11,1.22,34.7,M,42.1,M,,*71",
"$GNRMC,175144.00,A,5231.05689,N,01325.00574,E,16.090,241.36,300321,,,A,V*33",
"$GNGGA,175144.00,5231.05689,N,01325.00574,E,1,11,1.28,36.0,M,42.1,M,,*7D",
"$GNRMC,175145.00,A,5231.05478,N,01324.99957,E,16.045,240.96,300321,,,A,V*31",
"$GNGGA,175145.00,5231.05478,N,01324.99957,E,1,11,1.22,36.8,M,42.1,M,,*7E",
"$GNRMC,175146®00,A,5231.05277,N,01324.99327,E,15.832,241.30,300321,,,A,V*30",
"$GNGGA,175146.00,5231.05277,N,01324.99327,E,1,11,1.22,37.3,M,42.1,M,,*73",
"$GNGGA,175230.00,5231.08575,N,01324.94302,E,1,12,0.96,56.7,M,42.1,M,,*7D"]
def fix(m):
a = m.group(0)
b = chr(ord(m.group(0)) & 0x7F) # clear bit 7.
print(f'Replaced {a} with {b}')
return b
for i,v in enumerate(nmea):
nmea[i] = re.sub(r'[\x80-\xff]',fix,v)
pprint(nmea)
Output:
Replaced Í with M
Replaced ³ with 3
Replaced ³ with 3
Replaced Ç with G
Replaced ® with .
Replaced ° with 0
Replaced ± with 1
Replaced ¶ with 6
Replaced ² with 2
Replaced ® with .
['$GNRMC,175230.00,A,5231.08575,N,01324.94302,E,0.099,,300321,,,A,V*17',
'$GNRMC,175211.00,A,5231.08495,N,01324.94370,E,2.771,348.30,300321,,,A,V*0F',
'$GNGGA,175140.00,5231.06514,N,01325.03302,E,1,11,1.22,33.9,M,42.1,M,,*7F',
'$GNRMC,175141.00,A,5231.06307,N,01325.02563,E,16.734,234.24,300321,,,A,V*3A',
'$GNGGA,175141.00,5231.06307,N,01325.02563,E,1,11,1.22,33.6,M,42.1,M,,*75',
'$GNRMC,175142.00,A,5231.06059,N,01325.01869,E,17.220,235.29,300321,,,A,V*38',
'$GNGGA,175142.00,5231.06059,N,01325.01869,E,1,11,1.22,33.5,M,42.1,M,,*79',
'$GNRMC,175143.00,A,5231.05861,N,01325.01218,E,16.645,238.31,300321,,,A,V*37',
'$GNGGA,175143.00,5231.05861,N,01325.01218,E,1,11,1.22,34.7,M,42.1,M,,*71',
'$GNRMC,175144.00,A,5231.05689,N,01325.00574,E,16.090,241.36,300321,,,A,V*33',
'$GNGGA,175144.00,5231.05689,N,01325.00574,E,1,11,1.28,36.0,M,42.1,M,,*7D',
'$GNRMC,175145.00,A,5231.05478,N,01324.99957,E,16.045,240.96,300321,,,A,V*31',
'$GNGGA,175145.00,5231.05478,N,01324.99957,E,1,11,1.22,36.8,M,42.1,M,,*7E',
'$GNRMC,175146.00,A,5231.05277,N,01324.99327,E,15.832,241.30,300321,,,A,V*30',
'$GNGGA,175146.00,5231.05277,N,01324.99327,E,1,11,1.22,37.3,M,42.1,M,,*73',
'$GNGGA,175230.00,5231.08575,N,01324.94302,E,1,12,0.96,56.7,M,42.1,M,,*7D']

How can I check if a Python unicode string contains non-Western letters?

I have a Python Unicode string. I want to make sure it only contains letters from the Roman alphabet (A through Z), as well as letters commonly found in European alphabets, such as ß, ü, ø, é, à, and î. It should not contain characters from other alphabets (Chinese, Japanese, Korean, Arabic, Cyrillic, Hebrew, etc.). What's the best way to go about doing this?
Currently I am using this bit of code, but I don't know if it's the best way:
def only_roman_chars(s):
try:
s.encode("iso-8859-1")
return True
except UnicodeDecodeError:
return False
(I am using Python 2.5. I am also doing this in Django, so if the Django framework happens to have a way to handle such strings, I can use that functionality -- I haven't come across anything like that, however.)
import unicodedata as ud
latin_letters= {}
def is_latin(uchr):
try: return latin_letters[uchr]
except KeyError:
return latin_letters.setdefault(uchr, 'LATIN' in ud.name(uchr))
def only_roman_chars(unistr):
return all(is_latin(uchr)
for uchr in unistr
if uchr.isalpha()) # isalpha suggested by John Machin
>>> only_roman_chars(u"ελληνικά means greek")
False
>>> only_roman_chars(u"frappé")
True
>>> only_roman_chars(u"hôtel lœwe")
True
>>> only_roman_chars(u"123 ångstrom ð áß")
True
>>> only_roman_chars(u"russian: гага")
False
The top answer to this by #tzot is great, but IMO there should really be a library for this that works for all scripts. So, I made one (heavily based on that answer).
pip install alphabet-detector
and then use it directly:
from alphabet_detector import AlphabetDetector
ad = AlphabetDetector()
ad.only_alphabet_chars(u"ελληνικά means greek", "LATIN") #False
ad.only_alphabet_chars(u"ελληνικά", "GREEK") #True
ad.only_alphabet_chars(u'سماوي يدور', 'ARABIC')
ad.only_alphabet_chars(u'שלום', 'HEBREW')
ad.only_alphabet_chars(u"frappé", "LATIN") #True
ad.only_alphabet_chars(u"hôtel lœwe 67", "LATIN") #True
ad.only_alphabet_chars(u"det forårsaker første", "LATIN") #True
ad.only_alphabet_chars(u"Cyrillic and кириллический", "LATIN") #False
ad.only_alphabet_chars(u"кириллический", "CYRILLIC") #True
Also, a few convenience methods for major languages:
ad.is_cyrillic(u"Поиск") #True
ad.is_latin(u"howdy") #True
ad.is_cjk(u"hi") #False
ad.is_cjk(u'汉字') #True
The standard string package contains all Latin letters, numbers and symbols. You can remove these values from the text and if there is anything left, it is not-Latin characters. I did that:
In [1]: from string import printable
In [2]: def is_latin(text):
...: return not bool(set(text) - set(printable))
...:
In [3]: is_latin('Hradec Králové District,,Czech Republic,')
Out[3]: False
In [4]: is_latin('Hradec Krlov District,,Czech Republic,')
Out[4]: True
I have no way to check all non-Latin characters and if anyone can do that, please let me know. Thanks.
For what you say you want to do, your approach is about right. If you are running on Windows, I'd suggest using cp1252 instead of iso-8859-1. You might also allow cp1250 as well -- this would pick up eastern European countries like Poland, Czech Republic, Slovakia, Romania, Slovenia, Hungary, Croatia, etc where the alphabet is Latin-based. Other cp125x would include Turkish and Maltese ...
You may also like to consider transcription from Cyrillic to Latin; as far as I know there are several systems, one of which may be endorsed by the UPU (Universal Postal Union).
I'm a little intrigued by your comment "Our shipping department doesn't want to have to fill out labels with, e.g., Chinese addresses" ... three questions: (1) do you mean "addresses in country X" or "addresses written in X-ese characters" (2) wouldn't it be better for your system to print the labels? (3) how does the order get shipped if it fails your test?
Checking for ISO-8559-1 would miss reasonable Western characters like 'œ' and '€'. The solution depends on how you define "Western", and how you want to handle non-letters. Here's one approach:
import unicodedata
def is_permitted_char(char):
cat = unicodedata.category(char)[0]
if cat == 'L': # Letter
return 'LATIN' in unicodedata.name(char, '').split()
elif cat == 'N': # Number
# Only DIGIT ZERO - DIGIT NINE are allowed
return '0' <= char <= '9'
elif cat in ('S', 'P', 'Z'): # Symbol, Punctuation, or Space
return True
else:
return False
def is_valid(text):
return all(is_permitted_char(c) for c in text)
check the code in django.template.defaultfilters.slugify
import unicodedata
value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore')
is what you are looking for, you can then compare the resulting string with the original
Maybe this will do if you're a django user?
from django.template.defaultfilters import slugify
def justroman(s):
return len(slugify(s)) == len(s)
To simply tzot's answer using the built-in unicodedata library, this seems to work for me:
import unicodedata as ud
def is_latin(word):
return all(['LATIN' in ud.name(c) for c in word])

Python String Cleanup + Manipulation (Accented Characters)

I have a database full of names like:
John Smith
Scott J. Holmes
Dr. Kaplan
Ray's Dog
Levi's
Adrian O'Brien
Perry Sean Smyre
Carie Burchfield-Thompson
Björn Árnason
There are a few foreign names with accents in them that need to be converted to strings with non-accented characters.
I'd like to convert the full names (after stripping characters like " ' " , "-") to user logins like:
john.smith
scott.j.holmes
dr.kaplan
rays.dog
levis
adrian.obrien
perry.sean.smyre
carie.burchfieldthompson
bjorn.arnason
So far I have:
Fullname.strip() # get rid of leading/trailing white space
Fullname.lower() # make everything lower case
... # after bad chars converted/removed
Fullname.replace(' ', '.') # replace spaces with periods
Take a look at this link [redacted]
Here is the code from the page
def latin1_to_ascii (unicrap):
"""This replaces UNICODE Latin-1 characters with
something equivalent in 7-bit ASCII. All characters in the standard
7-bit ASCII range are preserved. In the 8th bit range all the Latin-1
accented letters are stripped of their accents. Most symbol characters
are converted to something meaningful. Anything not converted is deleted.
"""
xlate = {
0xc0:'A', 0xc1:'A', 0xc2:'A', 0xc3:'A', 0xc4:'A', 0xc5:'A',
0xc6:'Ae', 0xc7:'C',
0xc8:'E', 0xc9:'E', 0xca:'E', 0xcb:'E',
0xcc:'I', 0xcd:'I', 0xce:'I', 0xcf:'I',
0xd0:'Th', 0xd1:'N',
0xd2:'O', 0xd3:'O', 0xd4:'O', 0xd5:'O', 0xd6:'O', 0xd8:'O',
0xd9:'U', 0xda:'U', 0xdb:'U', 0xdc:'U',
0xdd:'Y', 0xde:'th', 0xdf:'ss',
0xe0:'a', 0xe1:'a', 0xe2:'a', 0xe3:'a', 0xe4:'a', 0xe5:'a',
0xe6:'ae', 0xe7:'c',
0xe8:'e', 0xe9:'e', 0xea:'e', 0xeb:'e',
0xec:'i', 0xed:'i', 0xee:'i', 0xef:'i',
0xf0:'th', 0xf1:'n',
0xf2:'o', 0xf3:'o', 0xf4:'o', 0xf5:'o', 0xf6:'o', 0xf8:'o',
0xf9:'u', 0xfa:'u', 0xfb:'u', 0xfc:'u',
0xfd:'y', 0xfe:'th', 0xff:'y',
0xa1:'!', 0xa2:'{cent}', 0xa3:'{pound}', 0xa4:'{currency}',
0xa5:'{yen}', 0xa6:'|', 0xa7:'{section}', 0xa8:'{umlaut}',
0xa9:'{C}', 0xaa:'{^a}', 0xab:'<<', 0xac:'{not}',
0xad:'-', 0xae:'{R}', 0xaf:'_', 0xb0:'{degrees}',
0xb1:'{+/-}', 0xb2:'{^2}', 0xb3:'{^3}', 0xb4:"'",
0xb5:'{micro}', 0xb6:'{paragraph}', 0xb7:'*', 0xb8:'{cedilla}',
0xb9:'{^1}', 0xba:'{^o}', 0xbb:'>>',
0xbc:'{1/4}', 0xbd:'{1/2}', 0xbe:'{3/4}', 0xbf:'?',
0xd7:'*', 0xf7:'/'
}
r = ''
for i in unicrap:
if xlate.has_key(ord(i)):
r += xlate[ord(i)]
elif ord(i) >= 0x80:
pass
else:
r += i
return r
# This gives an example of how to use latin1_to_ascii().
# This creates a string will all the characters in the latin-1 character set
# then it converts the string to plain 7-bit ASCII.
if __name__ == '__main__':
s = unicode('','latin-1')
for c in range(32,256):
if c != 0x7f:
s = s + unicode(chr(c),'latin-1')
print 'INPUT:'
print s.encode('latin-1')
print
print 'OUTPUT:'
print latin1_to_ascii(s)
If you are not afraid to install third-party modules, then have a look at the python port of the Perl module Text::Unidecode (it's also on pypi).
The module does nothing more than use a lookup table to transliterate the characters. I glanced over the code and it looks very simple. So I suppose it's working on pretty much any OS and on any Python version (crossingfingers). It's also easy to bundle with your application.
With this module you don't have to create your lookup table manually ( = reduced risk it being incomplete).
The advantage of this module compared to the unicode normalization technique is this: Unicode normalization does not replace all characters. A good example is a character like "æ". Unicode normalisation will see it as "Letter, lowercase" (Ll). This means using the normalize method will give you neither a replacement character nor a useful hint. Unfortunately, that character is not representable in ASCII. So you'll get errors.
The mentioned module does a better job at this. This will actually replace the "æ" with "ae". Which is actually useful and makes sense.
The most impressive thing I've seen is that it goes much further. It even replaces Japanese Kana characters mostly properly. For example, it replaces "は" with "ha". Wich is perfectly fine. It's not fool-proof though as the current version replaces "ち" with "ti" instead of "chi". So you'll have to handle it with care for the more exotic characters.
Usage of the module is straightforward::
from unidecode import unidecode
var_utf8 = "æは".decode("utf8")
unidecode( var_utf8 ).encode("ascii")
>>> "aeha"
Note that I have nothing to do with this module directly. It just happens that I find it very useful.
Edit: The patch I submitted fixed the bug concerning the Japanese kana. I've only fixed the one's I could spot right away. I may have missed some.
The following function is generic:
import unicodedata
def not_combining(char):
return unicodedata.category(char) != 'Mn'
def strip_accents(text, encoding):
unicode_text= unicodedata.normalize('NFD', text.decode(encoding))
return filter(not_combining, unicode_text).encode(encoding)
# in a cp1252 environment
>>> print strip_accents("déjà", "cp1252")
deja
# in a cp1253 environment
>>> print strip_accents("καλημέρα", "cp1253")
καλημερα
Obviously, you should know the encoding of your strings.
I would do something like this
# coding=utf-8
def alnum_dot(name, replace={}):
import re
for k, v in replace.items():
name = name.replace(k, v)
return re.sub("[^a-z.]", "", name.strip().lower())
print alnum_dot(u"Frédrik Holmström", {
u"ö":"o",
" ":"."
})
Second argument is a dict of the characters you want replaced, all non a-z and . chars that are not replaced will be stripped
The translate method allows you to delete characters. You can use that to delete arbitrary characters.
Fullname.translate(None,"'-\"")
If you want to delete whole classes of characters, you might want to use the re module.
re.sub('[^a-z0-9 ]', '', Fullname.strip().lower(),)

What is the best way to remove accents (normalize) in a Python unicode string?

I have a Unicode string in Python, and I would like to remove all the accents (diacritics).
I found on the web an elegant way to do this (in Java):
convert the Unicode string to its long normalized form (with a separate character for letters and diacritics)
remove all the characters whose Unicode type is "diacritic".
Do I need to install a library such as pyICU or is this possible with just the Python standard library? And what about python 3?
Important note: I would like to avoid code with an explicit mapping from accented characters to their non-accented counterpart.
Unidecode is the correct answer for this. It transliterates any unicode string into the closest possible representation in ascii text.
Example:
>>> from unidecode import unidecode
>>> unidecode('kožušček')
'kozuscek'
>>> unidecode('北亰')
'Bei Jing '
>>> unidecode('François')
'Francois'
How about this:
import unicodedata
def strip_accents(s):
return ''.join(c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn')
This works on greek letters, too:
>>> strip_accents(u"A \u00c0 \u0394 \u038E")
u'A A \u0394 \u03a5'
>>>
The character category "Mn" stands for Nonspacing_Mark, which is similar to unicodedata.combining in MiniQuark's answer (I didn't think of unicodedata.combining, but it is probably the better solution, because it's more explicit).
And keep in mind, these manipulations may significantly alter the meaning of the text. Accents, Umlauts etc. are not "decoration".
I just found this answer on the Web:
import unicodedata
def remove_accents(input_str):
nfkd_form = unicodedata.normalize('NFKD', input_str)
only_ascii = nfkd_form.encode('ASCII', 'ignore')
return only_ascii
It works fine (for French, for example), but I think the second step (removing the accents) could be handled better than dropping the non-ASCII characters, because this will fail for some languages (Greek, for example). The best solution would probably be to explicitly remove the unicode characters that are tagged as being diacritics.
Edit: this does the trick:
import unicodedata
def remove_accents(input_str):
nfkd_form = unicodedata.normalize('NFKD', input_str)
return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])
unicodedata.combining(c) will return true if the character c can be combined with the preceding character, that is mainly if it's a diacritic.
Edit 2: remove_accents expects a unicode string, not a byte string. If you have a byte string, then you must decode it into a unicode string like this:
encoding = "utf-8" # or iso-8859-15, or cp1252, or whatever encoding you use
byte_string = b"café" # or simply "café" before python 3.
unicode_string = byte_string.decode(encoding)
Actually I work on project compatible python 2.6, 2.7 and 3.4 and I have to create IDs from free user entries.
Thanks to you, I have created this function that works wonders.
import re
import unicodedata
def strip_accents(text):
"""
Strip accents from input String.
:param text: The input string.
:type text: String.
:returns: The processed String.
:rtype: String.
"""
try:
text = unicode(text, 'utf-8')
except (TypeError, NameError): # unicode is a default on python 3
pass
text = unicodedata.normalize('NFD', text)
text = text.encode('ascii', 'ignore')
text = text.decode("utf-8")
return str(text)
def text_to_id(text):
"""
Convert input text to id.
:param text: The input string.
:type text: String.
:returns: The processed String.
:rtype: String.
"""
text = strip_accents(text.lower())
text = re.sub('[ ]+', '_', text)
text = re.sub('[^0-9a-zA-Z_-]', '', text)
return text
result:
text_to_id("Montréal, über, 12.89, Mère, Françoise, noël, 889")
>>> 'montreal_uber_1289_mere_francoise_noel_889'
This handles not only accents, but also "strokes" (as in ø etc.):
import unicodedata as ud
def rmdiacritics(char):
'''
Return the base character of char, by "removing" any
diacritics like accents or curls and strokes and the like.
'''
desc = ud.name(char)
cutoff = desc.find(' WITH ')
if cutoff != -1:
desc = desc[:cutoff]
try:
char = ud.lookup(desc)
except KeyError:
pass # removing "WITH ..." produced an invalid name
return char
This is the most elegant way I can think of (and it has been mentioned by alexis in a comment on this page), although I don't think it is very elegant indeed.
In fact, it's more of a hack, as pointed out in comments, since Unicode names are – really just names, they give no guarantee to be consistent or anything.
There are still special letters that are not handled by this, such as turned and inverted letters, since their unicode name does not contain 'WITH'. It depends on what you want to do anyway. I sometimes needed accent stripping for achieving dictionary sort order.
EDIT NOTE:
Incorporated suggestions from the comments (handling lookup errors, Python-3 code).
In my view, the proposed solutions should NOT be accepted answers. The original question is asking for the removal of accents, so the correct answer should only do that, not that plus other, unspecified, changes.
Simply observe the result of this code which is the accepted answer. where I have changed "Málaga" by "Málagueña:
accented_string = u'Málagueña'
# accented_string is of type 'unicode'
import unidecode
unaccented_string = unidecode.unidecode(accented_string)
# unaccented_string contains 'Malaguena'and is of type 'str'
There is an additional change (ñ -> n), which is not requested in the OQ.
A simple function that does the requested task, in lower form:
def f_remove_accents(old):
"""
Removes common accent characters, lower form.
Uses: regex.
"""
new = old.lower()
new = re.sub(r'[àáâãäå]', 'a', new)
new = re.sub(r'[èéêë]', 'e', new)
new = re.sub(r'[ìíîï]', 'i', new)
new = re.sub(r'[òóôõö]', 'o', new)
new = re.sub(r'[ùúûü]', 'u', new)
return new
gensim.utils.deaccent(text) from Gensim - topic modelling for humans:
'Sef chomutovskych komunistu dostal postou bily prasek'
Another solution is unidecode.
Note that the suggested solution with unicodedata typically removes accents only in some character (e.g. it turns 'ł' into '', rather than into 'l').
In response to #MiniQuark's answer:
I was trying to read in a csv file that was half-French (containing accents) and also some strings which would eventually become integers and floats.
As a test, I created a test.txt file that looked like this:
Montréal, über, 12.89, Mère, Françoise, noël, 889
I had to include lines 2 and 3 to get it to work (which I found in a python ticket), as well as incorporate #Jabba's comment:
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
import csv
import unicodedata
def remove_accents(input_str):
nkfd_form = unicodedata.normalize('NFKD', unicode(input_str))
return u"".join([c for c in nkfd_form if not unicodedata.combining(c)])
with open('test.txt') as f:
read = csv.reader(f)
for row in read:
for element in row:
print remove_accents(element)
The result:
Montreal
uber
12.89
Mere
Francoise
noel
889
(Note: I am on Mac OS X 10.8.4 and using Python 2.7.3)
import unicodedata
from random import choice
import perfplot
import regex
import text_unidecode
def remove_accent_chars_regex(x: str):
return regex.sub(r'\p{Mn}', '', unicodedata.normalize('NFKD', x))
def remove_accent_chars_join(x: str):
# answer by MiniQuark
# https://stackoverflow.com/a/517974/7966259
return u"".join([c for c in unicodedata.normalize('NFKD', x) if not unicodedata.combining(c)])
perfplot.show(
setup=lambda n: ''.join([choice('Málaga François Phút Hơn 中文') for i in range(n)]),
kernels=[
remove_accent_chars_regex,
remove_accent_chars_join,
text_unidecode.unidecode,
],
labels=['regex', 'join', 'unidecode'],
n_range=[2 ** k for k in range(22)],
equality_check=None, relative_to=0, xlabel='str len'
)
Some languages have combining diacritics as language letters and accent diacritics to specify accent.
I think it is more safe to specify explicitly what diactrics you want to strip:
def strip_accents(string, accents=('COMBINING ACUTE ACCENT', 'COMBINING GRAVE ACCENT', 'COMBINING TILDE')):
accents = set(map(unicodedata.lookup, accents))
chars = [c for c in unicodedata.normalize('NFD', string) if c not in accents]
return unicodedata.normalize('NFC', ''.join(chars))
Here is a short function which strips the diacritics, but keeps the non-latin characters. Most cases (e.g., "à" -> "a") are handled by unicodedata (standard library), but several (e.g., "æ" -> "ae") rely on the given parallel strings.
Code
from unicodedata import combining, normalize
LATIN = "ä æ ǽ đ ð ƒ ħ ı ł ø ǿ ö œ ß ŧ ü "
ASCII = "ae ae ae d d f h i l o o oe oe ss t ue"
def remove_diacritics(s, outliers=str.maketrans(dict(zip(LATIN.split(), ASCII.split())))):
return "".join(c for c in normalize("NFD", s.lower().translate(outliers)) if not combining(c))
NB. The default argument outliers is evaluated once and not meant to be provided by the caller.
Intended usage
As a key to sort a list of strings in a more “natural” order:
sorted(['cote', 'coteau', "crottez", 'crotté', 'côte', 'côté'], key=remove_diacritics)
Output:
['cote', 'côte', 'côté', 'coteau', 'crotté', 'crottez']
If your strings mix texts and numbers, you may be interested in composing remove_diacritics() with the function string_to_pairs() I give elsewhere.
Tests
To make sure the behavior meets your needs, take a look at the pangrams below:
examples = [
("hello, world", "hello, world"),
("42", "42"),
("你好,世界", "你好,世界"),
(
"Dès Noël, où un zéphyr haï me vêt de glaçons würmiens, je dîne d’exquis rôtis de bœuf au kir, à l’aÿ d’âge mûr, &cætera.",
"des noel, ou un zephyr hai me vet de glacons wuermiens, je dine d’exquis rotis de boeuf au kir, a l’ay d’age mur, &caetera.",
),
(
"Falsches Üben von Xylophonmusik quält jeden größeren Zwerg.",
"falsches ueben von xylophonmusik quaelt jeden groesseren zwerg.",
),
(
"Љубазни фењерџија чађавог лица хоће да ми покаже штос.",
"љубазни фењерџија чађавог лица хоће да ми покаже штос.",
),
(
"Ljubazni fenjerdžija čađavog lica hoće da mi pokaže štos.",
"ljubazni fenjerdzija cadavog lica hoce da mi pokaze stos.",
),
(
"Quizdeltagerne spiste jordbær med fløde, mens cirkusklovnen Walther spillede på xylofon.",
"quizdeltagerne spiste jordbaer med flode, mens cirkusklovnen walther spillede pa xylofon.",
),
(
"Kæmi ný öxi hér ykist þjófum nú bæði víl og ádrepa.",
"kaemi ny oexi her ykist þjofum nu baedi vil og adrepa.",
),
(
"Glāžšķūņa rūķīši dzērumā čiepj Baha koncertflīģeļu vākus.",
"glazskuna rukisi dzeruma ciepj baha koncertfligelu vakus.",
)
]
for (given, expected) in examples:
assert remove_diacritics(given) == expected
Case-preserving variant
LATIN = "ä æ ǽ đ ð ƒ ħ ı ł ø ǿ ö œ ß ŧ ü Ä Æ Ǽ Đ Ð Ƒ Ħ I Ł Ø Ǿ Ö Œ SS Ŧ Ü "
ASCII = "ae ae ae d d f h i l o o oe oe ss t ue AE AE AE D D F H I L O O OE OE SS T UE"
def remove_diacritics(s, outliers=str.maketrans(dict(zip(LATIN.split(), ASCII.split())))):
return "".join(c for c in normalize("NFD", s.translate(outliers)) if not combining(c))
There are already many answers here, but this was not previously considered: using sklearn
from sklearn.feature_extraction.text import strip_accents_ascii, strip_accents_unicode
accented_string = u'Málagueña®'
print(strip_accents_unicode(accented_string)) # output: Malaguena®
print(strip_accents_ascii(accented_string)) # output: Malaguena
This is particularly useful if you are already using sklearn to process text. Those are the functions internally called by classes like CountVectorizer to normalize strings: when using strip_accents='ascii' then strip_accents_ascii is called and when strip_accents='unicode' is used, then strip_accents_unicode is called.
More details
Finally, consider those details from its docstring:
Signature: strip_accents_ascii(s)
Transform accentuated unicode symbols into ascii or nothing
Warning: this solution is only suited for languages that have a direct
transliteration to ASCII symbols.
and
Signature: strip_accents_unicode(s)
Transform accentuated unicode symbols into their simple counterpart
Warning: the python-level loop and join operations make this
implementation 20 times slower than the strip_accents_ascii basic
normalization.
If you are hoping to get functionality similar to Elasticsearch's asciifolding filter, you might want to consider fold-to-ascii, which is [itself]...
A Python port of the Apache Lucene ASCII Folding Filter that converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into ASCII equivalents, if they exist.
Here's an example from the page mentioned above:
from fold_to_ascii import fold
s = u'Astroturf® paté'
fold(s)
> u'Astroturf pate'
fold(s, u'?')
> u'Astroturf? pate'
EDIT: The fold_to_ascii module seems to work well for normalizing Latin-based alphabets; however unmappable characters are removed, which means that this module will reduce Chinese text, for example, to empty strings. If you want to preserve Chinese, Japanese, and other Unicode alphabets, consider using #mo-han's remove_accent_chars_regex implementation, above.

Categories

Resources