Python String Cleanup + Manipulation (Accented Characters) - python

I have a database full of names like:
John Smith
Scott J. Holmes
Dr. Kaplan
Ray's Dog
Levi's
Adrian O'Brien
Perry Sean Smyre
Carie Burchfield-Thompson
Björn Árnason
There are a few foreign names with accents in them that need to be converted to strings with non-accented characters.
I'd like to convert the full names (after stripping characters like " ' " , "-") to user logins like:
john.smith
scott.j.holmes
dr.kaplan
rays.dog
levis
adrian.obrien
perry.sean.smyre
carie.burchfieldthompson
bjorn.arnason
So far I have:
Fullname.strip() # get rid of leading/trailing white space
Fullname.lower() # make everything lower case
... # after bad chars converted/removed
Fullname.replace(' ', '.') # replace spaces with periods

Take a look at this link [redacted]
Here is the code from the page
def latin1_to_ascii (unicrap):
"""This replaces UNICODE Latin-1 characters with
something equivalent in 7-bit ASCII. All characters in the standard
7-bit ASCII range are preserved. In the 8th bit range all the Latin-1
accented letters are stripped of their accents. Most symbol characters
are converted to something meaningful. Anything not converted is deleted.
"""
xlate = {
0xc0:'A', 0xc1:'A', 0xc2:'A', 0xc3:'A', 0xc4:'A', 0xc5:'A',
0xc6:'Ae', 0xc7:'C',
0xc8:'E', 0xc9:'E', 0xca:'E', 0xcb:'E',
0xcc:'I', 0xcd:'I', 0xce:'I', 0xcf:'I',
0xd0:'Th', 0xd1:'N',
0xd2:'O', 0xd3:'O', 0xd4:'O', 0xd5:'O', 0xd6:'O', 0xd8:'O',
0xd9:'U', 0xda:'U', 0xdb:'U', 0xdc:'U',
0xdd:'Y', 0xde:'th', 0xdf:'ss',
0xe0:'a', 0xe1:'a', 0xe2:'a', 0xe3:'a', 0xe4:'a', 0xe5:'a',
0xe6:'ae', 0xe7:'c',
0xe8:'e', 0xe9:'e', 0xea:'e', 0xeb:'e',
0xec:'i', 0xed:'i', 0xee:'i', 0xef:'i',
0xf0:'th', 0xf1:'n',
0xf2:'o', 0xf3:'o', 0xf4:'o', 0xf5:'o', 0xf6:'o', 0xf8:'o',
0xf9:'u', 0xfa:'u', 0xfb:'u', 0xfc:'u',
0xfd:'y', 0xfe:'th', 0xff:'y',
0xa1:'!', 0xa2:'{cent}', 0xa3:'{pound}', 0xa4:'{currency}',
0xa5:'{yen}', 0xa6:'|', 0xa7:'{section}', 0xa8:'{umlaut}',
0xa9:'{C}', 0xaa:'{^a}', 0xab:'<<', 0xac:'{not}',
0xad:'-', 0xae:'{R}', 0xaf:'_', 0xb0:'{degrees}',
0xb1:'{+/-}', 0xb2:'{^2}', 0xb3:'{^3}', 0xb4:"'",
0xb5:'{micro}', 0xb6:'{paragraph}', 0xb7:'*', 0xb8:'{cedilla}',
0xb9:'{^1}', 0xba:'{^o}', 0xbb:'>>',
0xbc:'{1/4}', 0xbd:'{1/2}', 0xbe:'{3/4}', 0xbf:'?',
0xd7:'*', 0xf7:'/'
}
r = ''
for i in unicrap:
if xlate.has_key(ord(i)):
r += xlate[ord(i)]
elif ord(i) >= 0x80:
pass
else:
r += i
return r
# This gives an example of how to use latin1_to_ascii().
# This creates a string will all the characters in the latin-1 character set
# then it converts the string to plain 7-bit ASCII.
if __name__ == '__main__':
s = unicode('','latin-1')
for c in range(32,256):
if c != 0x7f:
s = s + unicode(chr(c),'latin-1')
print 'INPUT:'
print s.encode('latin-1')
print
print 'OUTPUT:'
print latin1_to_ascii(s)

If you are not afraid to install third-party modules, then have a look at the python port of the Perl module Text::Unidecode (it's also on pypi).
The module does nothing more than use a lookup table to transliterate the characters. I glanced over the code and it looks very simple. So I suppose it's working on pretty much any OS and on any Python version (crossingfingers). It's also easy to bundle with your application.
With this module you don't have to create your lookup table manually ( = reduced risk it being incomplete).
The advantage of this module compared to the unicode normalization technique is this: Unicode normalization does not replace all characters. A good example is a character like "æ". Unicode normalisation will see it as "Letter, lowercase" (Ll). This means using the normalize method will give you neither a replacement character nor a useful hint. Unfortunately, that character is not representable in ASCII. So you'll get errors.
The mentioned module does a better job at this. This will actually replace the "æ" with "ae". Which is actually useful and makes sense.
The most impressive thing I've seen is that it goes much further. It even replaces Japanese Kana characters mostly properly. For example, it replaces "は" with "ha". Wich is perfectly fine. It's not fool-proof though as the current version replaces "ち" with "ti" instead of "chi". So you'll have to handle it with care for the more exotic characters.
Usage of the module is straightforward::
from unidecode import unidecode
var_utf8 = "æは".decode("utf8")
unidecode( var_utf8 ).encode("ascii")
>>> "aeha"
Note that I have nothing to do with this module directly. It just happens that I find it very useful.
Edit: The patch I submitted fixed the bug concerning the Japanese kana. I've only fixed the one's I could spot right away. I may have missed some.

The following function is generic:
import unicodedata
def not_combining(char):
return unicodedata.category(char) != 'Mn'
def strip_accents(text, encoding):
unicode_text= unicodedata.normalize('NFD', text.decode(encoding))
return filter(not_combining, unicode_text).encode(encoding)
# in a cp1252 environment
>>> print strip_accents("déjà", "cp1252")
deja
# in a cp1253 environment
>>> print strip_accents("καλημέρα", "cp1253")
καλημερα
Obviously, you should know the encoding of your strings.

I would do something like this
# coding=utf-8
def alnum_dot(name, replace={}):
import re
for k, v in replace.items():
name = name.replace(k, v)
return re.sub("[^a-z.]", "", name.strip().lower())
print alnum_dot(u"Frédrik Holmström", {
u"ö":"o",
" ":"."
})
Second argument is a dict of the characters you want replaced, all non a-z and . chars that are not replaced will be stripped

The translate method allows you to delete characters. You can use that to delete arbitrary characters.
Fullname.translate(None,"'-\"")
If you want to delete whole classes of characters, you might want to use the re module.
re.sub('[^a-z0-9 ]', '', Fullname.strip().lower(),)

Related

Obtain a list of the 143.859 Unicode Standard, Version 13.0.0 characters

Is it possible to obtain a list of all the 143,859 characters included in the 13.0.0 version of Unicode?
I'm trying to print these 143,859 characters in python but was unable to find a comprehensive list of all the characters.
To obtain a list of 143,859 characters you must exclude the same categories the unicode consortium has excluded in order to come up with that count.
import sys
from unicodedata import category, unidata_version
chars = []
for i in range(sys.maxunicode + 1):
c = chr(i)
cat = category(c)
if cat == "Cc": # control characters
continue
if cat == "Co": # private use
continue
if cat == "Cs": # surrogates
continue
if cat == "Cn": # noncharacter or reserved
continue
chars.append(c)
print(f"Number of characters in Unicode v{unidata_version}: {len(chars):,}")
Output on my machine:
Number of characters in Unicode v13.0.0: 143,859
I think your best bet is probably to read the UnicodeData.txt file as recommended by #wim in a comment below, then expand all the ranges that are marked off by <..., First> and <..., Last> in the second column, e.g., expand
3400;<CJK Ideograph Extension A, First>;Lo;0;L;;;;;N;;;;;
4DBF;<CJK Ideograph Extension A, Last>;Lo;0;L;;;;;N;;;;;
to
3400
3401
3402
...
4DBD
4DBE
4DBF
I haven't checked, but I'm guessing this would give you a pretty complete list.
Below are some other suggestions I made earlier, some of which could be useful.
Other Ideas
You can make a start with the Unicode Character Name Index, which is linked from the list of Unicode 13.0 Character Code Charts. However, that table has significant gaps and repetitions, e.g., all Latin capital letters are lumped under 0041 (A) and the group is identified a few different ways. Actually, the table is pretty incomplete -- it only has 2.759 unique codes.
Keying off of #wim's comment on the original post, another option might be to take a look at the source code for Python's unicodedata module. unicodename_db.h has some lists of codes that are read by _getucname in unicodedata.c. It looks like phrasebook may have a nearly complete list of codes (188,803 items), but possibly munged in some way (I don't have time to figure out the lookup/offset mechanism right now). In addition to those, Hangul syllables and unified ideographs are processed as ranges, not looked up from the phrasebook.
The ONLY file you need is DerivedName.txt, available in the Unicode Character Database (UCD). The file for version 14.0 can be found here.
# PyPI package unicode_charnames
# Supports version 14.0 of the Unicode Standard
from unicode_charnames import charname, UNICODE_VERSION
assigned = []
control = []
private_use = []
surrogate = []
noncharacter = []
reserved = []
for x in range(0x110000):
name = charname(chr(x))
if not name.startswith("<"):
assigned.append(x)
elif name.startswith("<control"):
control.append(x)
elif name.startswith("<private"):
private_use.append(x)
elif name.startswith("<surrogate"):
surrogate.append(x)
elif name.startswith("<noncharacter"):
noncharacter.append(x)
else:
reserved.append(x)
print(f"# Unicode {UNICODE_VERSION}")
print(f"# Code points assigned to an abstract character: {len(assigned):,}")
print(f"# Unassigned code points: {len(reserved):,}")
print( "# Code points with a normative function")
print(f"# Control characters : {len(control):>7,}")
print(f"# Private-use characters : {len(private_use):>7,}")
print(f"# Surrogate characters : {len(surrogate):>7,}")
print(f"# Noncharacters : {len(noncharacter):>7,}")
# Unicode 14.0.0
# Code points assigned to an abstract character: 144,697
# Unassigned code points: 829,768
# Code points with a normative function
# Control characters : 65
# Private-use characters : 137,468
# Surrogate characters : 2,048
# Noncharacters : 66

Replacing non-ASCII characters in a string [duplicate]

I have a Unicode string in Python, and I would like to remove all the accents (diacritics).
I found on the web an elegant way to do this (in Java):
convert the Unicode string to its long normalized form (with a separate character for letters and diacritics)
remove all the characters whose Unicode type is "diacritic".
Do I need to install a library such as pyICU or is this possible with just the Python standard library? And what about python 3?
Important note: I would like to avoid code with an explicit mapping from accented characters to their non-accented counterpart.
Unidecode is the correct answer for this. It transliterates any unicode string into the closest possible representation in ascii text.
Example:
>>> from unidecode import unidecode
>>> unidecode('kožušček')
'kozuscek'
>>> unidecode('北亰')
'Bei Jing '
>>> unidecode('François')
'Francois'
How about this:
import unicodedata
def strip_accents(s):
return ''.join(c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn')
This works on greek letters, too:
>>> strip_accents(u"A \u00c0 \u0394 \u038E")
u'A A \u0394 \u03a5'
>>>
The character category "Mn" stands for Nonspacing_Mark, which is similar to unicodedata.combining in MiniQuark's answer (I didn't think of unicodedata.combining, but it is probably the better solution, because it's more explicit).
And keep in mind, these manipulations may significantly alter the meaning of the text. Accents, Umlauts etc. are not "decoration".
I just found this answer on the Web:
import unicodedata
def remove_accents(input_str):
nfkd_form = unicodedata.normalize('NFKD', input_str)
only_ascii = nfkd_form.encode('ASCII', 'ignore')
return only_ascii
It works fine (for French, for example), but I think the second step (removing the accents) could be handled better than dropping the non-ASCII characters, because this will fail for some languages (Greek, for example). The best solution would probably be to explicitly remove the unicode characters that are tagged as being diacritics.
Edit: this does the trick:
import unicodedata
def remove_accents(input_str):
nfkd_form = unicodedata.normalize('NFKD', input_str)
return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])
unicodedata.combining(c) will return true if the character c can be combined with the preceding character, that is mainly if it's a diacritic.
Edit 2: remove_accents expects a unicode string, not a byte string. If you have a byte string, then you must decode it into a unicode string like this:
encoding = "utf-8" # or iso-8859-15, or cp1252, or whatever encoding you use
byte_string = b"café" # or simply "café" before python 3.
unicode_string = byte_string.decode(encoding)
Actually I work on project compatible python 2.6, 2.7 and 3.4 and I have to create IDs from free user entries.
Thanks to you, I have created this function that works wonders.
import re
import unicodedata
def strip_accents(text):
"""
Strip accents from input String.
:param text: The input string.
:type text: String.
:returns: The processed String.
:rtype: String.
"""
try:
text = unicode(text, 'utf-8')
except (TypeError, NameError): # unicode is a default on python 3
pass
text = unicodedata.normalize('NFD', text)
text = text.encode('ascii', 'ignore')
text = text.decode("utf-8")
return str(text)
def text_to_id(text):
"""
Convert input text to id.
:param text: The input string.
:type text: String.
:returns: The processed String.
:rtype: String.
"""
text = strip_accents(text.lower())
text = re.sub('[ ]+', '_', text)
text = re.sub('[^0-9a-zA-Z_-]', '', text)
return text
result:
text_to_id("Montréal, über, 12.89, Mère, Françoise, noël, 889")
>>> 'montreal_uber_1289_mere_francoise_noel_889'
This handles not only accents, but also "strokes" (as in ø etc.):
import unicodedata as ud
def rmdiacritics(char):
'''
Return the base character of char, by "removing" any
diacritics like accents or curls and strokes and the like.
'''
desc = ud.name(char)
cutoff = desc.find(' WITH ')
if cutoff != -1:
desc = desc[:cutoff]
try:
char = ud.lookup(desc)
except KeyError:
pass # removing "WITH ..." produced an invalid name
return char
This is the most elegant way I can think of (and it has been mentioned by alexis in a comment on this page), although I don't think it is very elegant indeed.
In fact, it's more of a hack, as pointed out in comments, since Unicode names are – really just names, they give no guarantee to be consistent or anything.
There are still special letters that are not handled by this, such as turned and inverted letters, since their unicode name does not contain 'WITH'. It depends on what you want to do anyway. I sometimes needed accent stripping for achieving dictionary sort order.
EDIT NOTE:
Incorporated suggestions from the comments (handling lookup errors, Python-3 code).
In my view, the proposed solutions should NOT be accepted answers. The original question is asking for the removal of accents, so the correct answer should only do that, not that plus other, unspecified, changes.
Simply observe the result of this code which is the accepted answer. where I have changed "Málaga" by "Málagueña:
accented_string = u'Málagueña'
# accented_string is of type 'unicode'
import unidecode
unaccented_string = unidecode.unidecode(accented_string)
# unaccented_string contains 'Malaguena'and is of type 'str'
There is an additional change (ñ -> n), which is not requested in the OQ.
A simple function that does the requested task, in lower form:
def f_remove_accents(old):
"""
Removes common accent characters, lower form.
Uses: regex.
"""
new = old.lower()
new = re.sub(r'[àáâãäå]', 'a', new)
new = re.sub(r'[èéêë]', 'e', new)
new = re.sub(r'[ìíîï]', 'i', new)
new = re.sub(r'[òóôõö]', 'o', new)
new = re.sub(r'[ùúûü]', 'u', new)
return new
gensim.utils.deaccent(text) from Gensim - topic modelling for humans:
'Sef chomutovskych komunistu dostal postou bily prasek'
Another solution is unidecode.
Note that the suggested solution with unicodedata typically removes accents only in some character (e.g. it turns 'ł' into '', rather than into 'l').
In response to #MiniQuark's answer:
I was trying to read in a csv file that was half-French (containing accents) and also some strings which would eventually become integers and floats.
As a test, I created a test.txt file that looked like this:
Montréal, über, 12.89, Mère, Françoise, noël, 889
I had to include lines 2 and 3 to get it to work (which I found in a python ticket), as well as incorporate #Jabba's comment:
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
import csv
import unicodedata
def remove_accents(input_str):
nkfd_form = unicodedata.normalize('NFKD', unicode(input_str))
return u"".join([c for c in nkfd_form if not unicodedata.combining(c)])
with open('test.txt') as f:
read = csv.reader(f)
for row in read:
for element in row:
print remove_accents(element)
The result:
Montreal
uber
12.89
Mere
Francoise
noel
889
(Note: I am on Mac OS X 10.8.4 and using Python 2.7.3)
import unicodedata
from random import choice
import perfplot
import regex
import text_unidecode
def remove_accent_chars_regex(x: str):
return regex.sub(r'\p{Mn}', '', unicodedata.normalize('NFKD', x))
def remove_accent_chars_join(x: str):
# answer by MiniQuark
# https://stackoverflow.com/a/517974/7966259
return u"".join([c for c in unicodedata.normalize('NFKD', x) if not unicodedata.combining(c)])
perfplot.show(
setup=lambda n: ''.join([choice('Málaga François Phút Hơn 中文') for i in range(n)]),
kernels=[
remove_accent_chars_regex,
remove_accent_chars_join,
text_unidecode.unidecode,
],
labels=['regex', 'join', 'unidecode'],
n_range=[2 ** k for k in range(22)],
equality_check=None, relative_to=0, xlabel='str len'
)
Some languages have combining diacritics as language letters and accent diacritics to specify accent.
I think it is more safe to specify explicitly what diactrics you want to strip:
def strip_accents(string, accents=('COMBINING ACUTE ACCENT', 'COMBINING GRAVE ACCENT', 'COMBINING TILDE')):
accents = set(map(unicodedata.lookup, accents))
chars = [c for c in unicodedata.normalize('NFD', string) if c not in accents]
return unicodedata.normalize('NFC', ''.join(chars))
Here is a short function which strips the diacritics, but keeps the non-latin characters. Most cases (e.g., "à" -> "a") are handled by unicodedata (standard library), but several (e.g., "æ" -> "ae") rely on the given parallel strings.
Code
from unicodedata import combining, normalize
LATIN = "ä æ ǽ đ ð ƒ ħ ı ł ø ǿ ö œ ß ŧ ü "
ASCII = "ae ae ae d d f h i l o o oe oe ss t ue"
def remove_diacritics(s, outliers=str.maketrans(dict(zip(LATIN.split(), ASCII.split())))):
return "".join(c for c in normalize("NFD", s.lower().translate(outliers)) if not combining(c))
NB. The default argument outliers is evaluated once and not meant to be provided by the caller.
Intended usage
As a key to sort a list of strings in a more “natural” order:
sorted(['cote', 'coteau', "crottez", 'crotté', 'côte', 'côté'], key=remove_diacritics)
Output:
['cote', 'côte', 'côté', 'coteau', 'crotté', 'crottez']
If your strings mix texts and numbers, you may be interested in composing remove_diacritics() with the function string_to_pairs() I give elsewhere.
Tests
To make sure the behavior meets your needs, take a look at the pangrams below:
examples = [
("hello, world", "hello, world"),
("42", "42"),
("你好,世界", "你好,世界"),
(
"Dès Noël, où un zéphyr haï me vêt de glaçons würmiens, je dîne d’exquis rôtis de bœuf au kir, à l’aÿ d’âge mûr, &cætera.",
"des noel, ou un zephyr hai me vet de glacons wuermiens, je dine d’exquis rotis de boeuf au kir, a l’ay d’age mur, &caetera.",
),
(
"Falsches Üben von Xylophonmusik quält jeden größeren Zwerg.",
"falsches ueben von xylophonmusik quaelt jeden groesseren zwerg.",
),
(
"Љубазни фењерџија чађавог лица хоће да ми покаже штос.",
"љубазни фењерџија чађавог лица хоће да ми покаже штос.",
),
(
"Ljubazni fenjerdžija čađavog lica hoće da mi pokaže štos.",
"ljubazni fenjerdzija cadavog lica hoce da mi pokaze stos.",
),
(
"Quizdeltagerne spiste jordbær med fløde, mens cirkusklovnen Walther spillede på xylofon.",
"quizdeltagerne spiste jordbaer med flode, mens cirkusklovnen walther spillede pa xylofon.",
),
(
"Kæmi ný öxi hér ykist þjófum nú bæði víl og ádrepa.",
"kaemi ny oexi her ykist þjofum nu baedi vil og adrepa.",
),
(
"Glāžšķūņa rūķīši dzērumā čiepj Baha koncertflīģeļu vākus.",
"glazskuna rukisi dzeruma ciepj baha koncertfligelu vakus.",
)
]
for (given, expected) in examples:
assert remove_diacritics(given) == expected
Case-preserving variant
LATIN = "ä æ ǽ đ ð ƒ ħ ı ł ø ǿ ö œ ß ŧ ü Ä Æ Ǽ Đ Ð Ƒ Ħ I Ł Ø Ǿ Ö Œ SS Ŧ Ü "
ASCII = "ae ae ae d d f h i l o o oe oe ss t ue AE AE AE D D F H I L O O OE OE SS T UE"
def remove_diacritics(s, outliers=str.maketrans(dict(zip(LATIN.split(), ASCII.split())))):
return "".join(c for c in normalize("NFD", s.translate(outliers)) if not combining(c))
There are already many answers here, but this was not previously considered: using sklearn
from sklearn.feature_extraction.text import strip_accents_ascii, strip_accents_unicode
accented_string = u'Málagueña®'
print(strip_accents_unicode(accented_string)) # output: Malaguena®
print(strip_accents_ascii(accented_string)) # output: Malaguena
This is particularly useful if you are already using sklearn to process text. Those are the functions internally called by classes like CountVectorizer to normalize strings: when using strip_accents='ascii' then strip_accents_ascii is called and when strip_accents='unicode' is used, then strip_accents_unicode is called.
More details
Finally, consider those details from its docstring:
Signature: strip_accents_ascii(s)
Transform accentuated unicode symbols into ascii or nothing
Warning: this solution is only suited for languages that have a direct
transliteration to ASCII symbols.
and
Signature: strip_accents_unicode(s)
Transform accentuated unicode symbols into their simple counterpart
Warning: the python-level loop and join operations make this
implementation 20 times slower than the strip_accents_ascii basic
normalization.
If you are hoping to get functionality similar to Elasticsearch's asciifolding filter, you might want to consider fold-to-ascii, which is [itself]...
A Python port of the Apache Lucene ASCII Folding Filter that converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into ASCII equivalents, if they exist.
Here's an example from the page mentioned above:
from fold_to_ascii import fold
s = u'Astroturf® paté'
fold(s)
> u'Astroturf pate'
fold(s, u'?')
> u'Astroturf? pate'
EDIT: The fold_to_ascii module seems to work well for normalizing Latin-based alphabets; however unmappable characters are removed, which means that this module will reduce Chinese text, for example, to empty strings. If you want to preserve Chinese, Japanese, and other Unicode alphabets, consider using #mo-han's remove_accent_chars_regex implementation, above.

How can I check if a Python unicode string contains non-Western letters?

I have a Python Unicode string. I want to make sure it only contains letters from the Roman alphabet (A through Z), as well as letters commonly found in European alphabets, such as ß, ü, ø, é, à, and î. It should not contain characters from other alphabets (Chinese, Japanese, Korean, Arabic, Cyrillic, Hebrew, etc.). What's the best way to go about doing this?
Currently I am using this bit of code, but I don't know if it's the best way:
def only_roman_chars(s):
try:
s.encode("iso-8859-1")
return True
except UnicodeDecodeError:
return False
(I am using Python 2.5. I am also doing this in Django, so if the Django framework happens to have a way to handle such strings, I can use that functionality -- I haven't come across anything like that, however.)
import unicodedata as ud
latin_letters= {}
def is_latin(uchr):
try: return latin_letters[uchr]
except KeyError:
return latin_letters.setdefault(uchr, 'LATIN' in ud.name(uchr))
def only_roman_chars(unistr):
return all(is_latin(uchr)
for uchr in unistr
if uchr.isalpha()) # isalpha suggested by John Machin
>>> only_roman_chars(u"ελληνικά means greek")
False
>>> only_roman_chars(u"frappé")
True
>>> only_roman_chars(u"hôtel lœwe")
True
>>> only_roman_chars(u"123 ångstrom ð áß")
True
>>> only_roman_chars(u"russian: гага")
False
The top answer to this by #tzot is great, but IMO there should really be a library for this that works for all scripts. So, I made one (heavily based on that answer).
pip install alphabet-detector
and then use it directly:
from alphabet_detector import AlphabetDetector
ad = AlphabetDetector()
ad.only_alphabet_chars(u"ελληνικά means greek", "LATIN") #False
ad.only_alphabet_chars(u"ελληνικά", "GREEK") #True
ad.only_alphabet_chars(u'سماوي يدور', 'ARABIC')
ad.only_alphabet_chars(u'שלום', 'HEBREW')
ad.only_alphabet_chars(u"frappé", "LATIN") #True
ad.only_alphabet_chars(u"hôtel lœwe 67", "LATIN") #True
ad.only_alphabet_chars(u"det forårsaker første", "LATIN") #True
ad.only_alphabet_chars(u"Cyrillic and кириллический", "LATIN") #False
ad.only_alphabet_chars(u"кириллический", "CYRILLIC") #True
Also, a few convenience methods for major languages:
ad.is_cyrillic(u"Поиск") #True
ad.is_latin(u"howdy") #True
ad.is_cjk(u"hi") #False
ad.is_cjk(u'汉字') #True
The standard string package contains all Latin letters, numbers and symbols. You can remove these values from the text and if there is anything left, it is not-Latin characters. I did that:
In [1]: from string import printable
In [2]: def is_latin(text):
...: return not bool(set(text) - set(printable))
...:
In [3]: is_latin('Hradec Králové District,,Czech Republic,')
Out[3]: False
In [4]: is_latin('Hradec Krlov District,,Czech Republic,')
Out[4]: True
I have no way to check all non-Latin characters and if anyone can do that, please let me know. Thanks.
For what you say you want to do, your approach is about right. If you are running on Windows, I'd suggest using cp1252 instead of iso-8859-1. You might also allow cp1250 as well -- this would pick up eastern European countries like Poland, Czech Republic, Slovakia, Romania, Slovenia, Hungary, Croatia, etc where the alphabet is Latin-based. Other cp125x would include Turkish and Maltese ...
You may also like to consider transcription from Cyrillic to Latin; as far as I know there are several systems, one of which may be endorsed by the UPU (Universal Postal Union).
I'm a little intrigued by your comment "Our shipping department doesn't want to have to fill out labels with, e.g., Chinese addresses" ... three questions: (1) do you mean "addresses in country X" or "addresses written in X-ese characters" (2) wouldn't it be better for your system to print the labels? (3) how does the order get shipped if it fails your test?
Checking for ISO-8559-1 would miss reasonable Western characters like 'œ' and '€'. The solution depends on how you define "Western", and how you want to handle non-letters. Here's one approach:
import unicodedata
def is_permitted_char(char):
cat = unicodedata.category(char)[0]
if cat == 'L': # Letter
return 'LATIN' in unicodedata.name(char, '').split()
elif cat == 'N': # Number
# Only DIGIT ZERO - DIGIT NINE are allowed
return '0' <= char <= '9'
elif cat in ('S', 'P', 'Z'): # Symbol, Punctuation, or Space
return True
else:
return False
def is_valid(text):
return all(is_permitted_char(c) for c in text)
check the code in django.template.defaultfilters.slugify
import unicodedata
value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore')
is what you are looking for, you can then compare the resulting string with the original
Maybe this will do if you're a django user?
from django.template.defaultfilters import slugify
def justroman(s):
return len(slugify(s)) == len(s)
To simply tzot's answer using the built-in unicodedata library, this seems to work for me:
import unicodedata as ud
def is_latin(word):
return all(['LATIN' in ud.name(c) for c in word])

Python: Split unicode string on word boundaries

I need to take a string, and shorten it to 140 characters.
Currently I am doing:
if len(tweet) > 140:
tweet = re.sub(r"\s+", " ", tweet) #normalize space
footer = "… " + utils.shorten_urls(post['url'])
avail = 140 - len(footer)
words = tweet.split()
result = ""
for word in words:
word += " "
if len(word) > avail:
break
result += word
avail -= len(word)
tweet = (result + footer).strip()
assert len(tweet) <= 140
So this works great for English, and English like strings, but fails for a Chinese string because tweet.split() just returns one array:
>>> s = u"简讯:新華社報道,美國總統奧巴馬乘坐的「空軍一號」專機晚上10時42分進入上海空域,預計約30分鐘後抵達浦東國際機場,開展他上任後首次訪華之旅。"
>>> s
u'\u7b80\u8baf\uff1a\u65b0\u83ef\u793e\u5831\u9053\uff0c\u7f8e\u570b\u7e3d\u7d71\u5967\u5df4\u99ac\u4e58\u5750\u7684\u300c\u7a7a\u8ecd\u4e00\u865f\u300d\u5c08\u6a5f\u665a\u4e0a10\u664242\u5206\u9032\u5165\u4e0a\u6d77\u7a7a\u57df\uff0c\u9810\u8a08\u7d0430\u5206\u9418\u5f8c\u62b5\u9054\u6d66\u6771\u570b\u969b\u6a5f\u5834\uff0c\u958b\u5c55\u4ed6\u4e0a\u4efb\u5f8c\u9996\u6b21\u8a2a\u83ef\u4e4b\u65c5\u3002'
>>> s.split()
[u'\u7b80\u8baf\uff1a\u65b0\u83ef\u793e\u5831\u9053\uff0c\u7f8e\u570b\u7e3d\u7d71\u5967\u5df4\u99ac\u4e58\u5750\u7684\u300c\u7a7a\u8ecd\u4e00\u865f\u300d\u5c08\u6a5f\u665a\u4e0a10\u664242\u5206\u9032\u5165\u4e0a\u6d77\u7a7a\u57df\uff0c\u9810\u8a08\u7d0430\u5206\u9418\u5f8c\u62b5\u9054\u6d66\u6771\u570b\u969b\u6a5f\u5834\uff0c\u958b\u5c55\u4ed6\u4e0a\u4efb\u5f8c\u9996\u6b21\u8a2a\u83ef\u4e4b\u65c5\u3002']
How should I do this so it handles I18N? Does this make sense in all languages?
I'm on python 2.5.4 if that matters.
Chinese doesn't usually have whitespace between words, and the symbols can have different meanings depending on context. You will have to understand the text in order to split it at a word boundary. In other words, what you are trying to do is not easy in general.
For word segmentation in Chinese, and other advanced tasks in processing natural language, consider NLTK as a good starting point if not a complete solution -- it's a rich Python-based toolkit, particularly good for learning about NL processing techniques (and not rarely good enough to offer you viable solution to some of these problems).
the re.U flag will treat \s according to the Unicode character properties database.
The given string, however, doesn't apparently contain any white space characters according to python's unicode database:
>>> x = u'\u7b80\u8baf\uff1a\u65b0\u83ef\u793e\u5831\u9053\uff0c\u7f8e\u570b\u7e3d\u7d71\u5967\u5df4\u99ac\u4e58\u5750\u7684\u300c\u7a7a\u8ecd\u4e00\u865f\u300d\u5c08\u6a5f\u665a\u4e0a10\u664242\u5206\u9032\u5165\u4e0a\u6d77\u7a7a\u57df\uff0c\u9810\u8a08\u7d0430\u5206\u9418\u5f8c\u62b5\u9054\u6d66\u6771\u570b\u969b\u6a5f\u5834\uff0c\u958b\u5c55\u4ed6\u4e0a\u4efb\u5f8c\u9996\u6b21\u8a2a\u83ef\u4e4b\u65c5\u3002'
>>> re.compile(r'\s+', re.U).split(x)
[u'\u7b80\u8baf\uff1a\u65b0\u83ef\u793e\u5831\u9053\uff0c\u7f8e\u570b\u7e3d\u7d71\u5967\u5df4\u99ac\u4e58\u5750\u7684\u300c\u7a7a\u8ecd\u4e00\u865f\u300d\u5c08\u6a5f\u665a\u4e0a10\u664242\u5206\u9032\u5165\u4e0a\u6d77\u7a7a\u57df\uff0c\u9810\u8a08\u7d0430\u5206\u9418\u5f8c\u62b5\u9054\u6d66\u6771\u570b\u969b\u6a5f\u5834\uff0c\u958b\u5c55\u4ed6\u4e0a\u4efb\u5f8c\u9996\u6b21\u8a2a\u83ef\u4e4b\u65c5\u3002']
I tried out the solution with PyAPNS for push notifications and just wanted to share what worked for me. The issue I had is that truncating at 256 bytes in UTF-8 would result in the notification getting dropped. I had to make sure the notification was encoded as "unicode_escape" to get it to work. I'm assuming this is because the result is sent as JSON and not raw UTF-8. Anyways here is the function that worked for me:
def unicode_truncate(s, length, encoding='unicode_escape'):
encoded = s.encode(encoding)[:length]
return encoded.decode(encoding, 'ignore')
After speaking with some native Cantonese, Mandarin, and Japanese speakers it seems that the correct thing to do is hard, but my current algorithm still makes sense to them in the context of internet posts.
Meaning, they are used to the "split on space and add … at the end" treatment.
So I'm going to be lazy and stick with it, until I get complaints from people that don't understand it.
The only change to my original implementation would be to not force a space on the last word since it is unneeded in any language (and use the unicode character … &#x2026 instead of ... three dots to save 2 characters)
Basically, in CJK (Except Korean with spaces), you need dictionary look-ups to segment words properly. Depending on your exact definition of "word", Japanese can be more difficult than that, since not all inflected variants of a word (i.e. "行こう" vs. "行った") will appear in the dictionary. Whether it's worth the effort depends upon your application.
This punts the word-breaking decision to the re module, but it may work well enough for you.
import re
def shorten(tweet, footer="", limit=140):
"""Break tweet into two pieces at roughly the last word break
before limit.
"""
lower_break_limit = limit / 2
# limit under which to assume breaking didn't work as expected
limit -= len(footer)
tweet = re.sub(r"\s+", " ", tweet.strip())
m = re.match(r"^(.{,%d})\b(?:\W|$)" % limit, tweet, re.UNICODE)
if not m or m.end(1) < lower_break_limit:
# no suitable word break found
# cutting at an arbitrary location,
# or if len(tweet) < lower_break_limit, this will be true and
# returning this still gives the desired result
return tweet[:limit] + footer
return m.group(1) + footer
What you're looking for is Chinese word segmentation tools. Word segmentation is not an easy task and is currently not perfectly solved. There are several tools:
CkipTagger
Developed by Academia Sinica, Taiwan.
jieba
Developed by Sun Junyi, a Baidu engineer.
pkuseg
Developed by Language Computing and Machine Learning Group, Peking University
If what you want is character segmentation, it can be done albeit not very useful.
>>> s = u"简讯:新華社報道,美國總統奧巴馬乘坐的「空軍一號」專機晚上10時42分進入上海空域,預計約30分鐘後抵達浦東國際機場,開展他上任後首次訪華之旅。"
>>> chars = list(s)
>>> chars
[u'\u7b80', u'\u8baf', u'\uff1a', u'\u65b0', u'\u83ef', u'\u793e', u'\u5831', u'\u9053', u'\uff0c', u'\u7f8e', u'\u570b', u'\u7e3d', u'\u7d71', u'\u5967', u'\u5df4', u'\u99ac', u'\u4e58', u'\u5750', u'\u7684', u'\u300c', u'\u7a7a', u'\u8ecd', u'\u4e00', u'\u865f', u'\u300d', u'\u5c08', u'\u6a5f', u'\u665a', u'\u4e0a', u'1', u'0', u'\u6642', u'4', u'2', u'\u5206', u'\u9032', u'\u5165', u'\u4e0a', u'\u6d77', u'\u7a7a', u'\u57df', u'\uff0c', u'\u9810', u'\u8a08', u'\u7d04', u'3', u'0', u'\u5206', u'\u9418', u'\u5f8c', u'\u62b5', u'\u9054', u'\u6d66', u'\u6771', u'\u570b', u'\u969b', u'\u6a5f', u'\u5834', u'\uff0c', u'\u958b', u'\u5c55', u'\u4ed6', u'\u4e0a', u'\u4efb', u'\u5f8c', u'\u9996', u'\u6b21', u'\u8a2a', u'\u83ef', u'\u4e4b', u'\u65c5', u'\u3002']
>>> print('/'.join(chars))
简/讯/:/新/華/社/報/道/,/美/國/總/統/奧/巴/馬/乘/坐/的/「/空/軍/一/號/」/專/機/晚/上/1/0/時/4/2/分/進/入/上/海/空/域/,/預/計/約/3/0/分/鐘/後/抵/達/浦/東/國/際/機/場/,/開/展/他/上/任/後/首/次/訪/華/之/旅/。
Save two characters and use an elipsis (…, 0x2026) instead of three dots!

What is the best way to remove accents (normalize) in a Python unicode string?

I have a Unicode string in Python, and I would like to remove all the accents (diacritics).
I found on the web an elegant way to do this (in Java):
convert the Unicode string to its long normalized form (with a separate character for letters and diacritics)
remove all the characters whose Unicode type is "diacritic".
Do I need to install a library such as pyICU or is this possible with just the Python standard library? And what about python 3?
Important note: I would like to avoid code with an explicit mapping from accented characters to their non-accented counterpart.
Unidecode is the correct answer for this. It transliterates any unicode string into the closest possible representation in ascii text.
Example:
>>> from unidecode import unidecode
>>> unidecode('kožušček')
'kozuscek'
>>> unidecode('北亰')
'Bei Jing '
>>> unidecode('François')
'Francois'
How about this:
import unicodedata
def strip_accents(s):
return ''.join(c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn')
This works on greek letters, too:
>>> strip_accents(u"A \u00c0 \u0394 \u038E")
u'A A \u0394 \u03a5'
>>>
The character category "Mn" stands for Nonspacing_Mark, which is similar to unicodedata.combining in MiniQuark's answer (I didn't think of unicodedata.combining, but it is probably the better solution, because it's more explicit).
And keep in mind, these manipulations may significantly alter the meaning of the text. Accents, Umlauts etc. are not "decoration".
I just found this answer on the Web:
import unicodedata
def remove_accents(input_str):
nfkd_form = unicodedata.normalize('NFKD', input_str)
only_ascii = nfkd_form.encode('ASCII', 'ignore')
return only_ascii
It works fine (for French, for example), but I think the second step (removing the accents) could be handled better than dropping the non-ASCII characters, because this will fail for some languages (Greek, for example). The best solution would probably be to explicitly remove the unicode characters that are tagged as being diacritics.
Edit: this does the trick:
import unicodedata
def remove_accents(input_str):
nfkd_form = unicodedata.normalize('NFKD', input_str)
return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])
unicodedata.combining(c) will return true if the character c can be combined with the preceding character, that is mainly if it's a diacritic.
Edit 2: remove_accents expects a unicode string, not a byte string. If you have a byte string, then you must decode it into a unicode string like this:
encoding = "utf-8" # or iso-8859-15, or cp1252, or whatever encoding you use
byte_string = b"café" # or simply "café" before python 3.
unicode_string = byte_string.decode(encoding)
Actually I work on project compatible python 2.6, 2.7 and 3.4 and I have to create IDs from free user entries.
Thanks to you, I have created this function that works wonders.
import re
import unicodedata
def strip_accents(text):
"""
Strip accents from input String.
:param text: The input string.
:type text: String.
:returns: The processed String.
:rtype: String.
"""
try:
text = unicode(text, 'utf-8')
except (TypeError, NameError): # unicode is a default on python 3
pass
text = unicodedata.normalize('NFD', text)
text = text.encode('ascii', 'ignore')
text = text.decode("utf-8")
return str(text)
def text_to_id(text):
"""
Convert input text to id.
:param text: The input string.
:type text: String.
:returns: The processed String.
:rtype: String.
"""
text = strip_accents(text.lower())
text = re.sub('[ ]+', '_', text)
text = re.sub('[^0-9a-zA-Z_-]', '', text)
return text
result:
text_to_id("Montréal, über, 12.89, Mère, Françoise, noël, 889")
>>> 'montreal_uber_1289_mere_francoise_noel_889'
This handles not only accents, but also "strokes" (as in ø etc.):
import unicodedata as ud
def rmdiacritics(char):
'''
Return the base character of char, by "removing" any
diacritics like accents or curls and strokes and the like.
'''
desc = ud.name(char)
cutoff = desc.find(' WITH ')
if cutoff != -1:
desc = desc[:cutoff]
try:
char = ud.lookup(desc)
except KeyError:
pass # removing "WITH ..." produced an invalid name
return char
This is the most elegant way I can think of (and it has been mentioned by alexis in a comment on this page), although I don't think it is very elegant indeed.
In fact, it's more of a hack, as pointed out in comments, since Unicode names are – really just names, they give no guarantee to be consistent or anything.
There are still special letters that are not handled by this, such as turned and inverted letters, since their unicode name does not contain 'WITH'. It depends on what you want to do anyway. I sometimes needed accent stripping for achieving dictionary sort order.
EDIT NOTE:
Incorporated suggestions from the comments (handling lookup errors, Python-3 code).
In my view, the proposed solutions should NOT be accepted answers. The original question is asking for the removal of accents, so the correct answer should only do that, not that plus other, unspecified, changes.
Simply observe the result of this code which is the accepted answer. where I have changed "Málaga" by "Málagueña:
accented_string = u'Málagueña'
# accented_string is of type 'unicode'
import unidecode
unaccented_string = unidecode.unidecode(accented_string)
# unaccented_string contains 'Malaguena'and is of type 'str'
There is an additional change (ñ -> n), which is not requested in the OQ.
A simple function that does the requested task, in lower form:
def f_remove_accents(old):
"""
Removes common accent characters, lower form.
Uses: regex.
"""
new = old.lower()
new = re.sub(r'[àáâãäå]', 'a', new)
new = re.sub(r'[èéêë]', 'e', new)
new = re.sub(r'[ìíîï]', 'i', new)
new = re.sub(r'[òóôõö]', 'o', new)
new = re.sub(r'[ùúûü]', 'u', new)
return new
gensim.utils.deaccent(text) from Gensim - topic modelling for humans:
'Sef chomutovskych komunistu dostal postou bily prasek'
Another solution is unidecode.
Note that the suggested solution with unicodedata typically removes accents only in some character (e.g. it turns 'ł' into '', rather than into 'l').
In response to #MiniQuark's answer:
I was trying to read in a csv file that was half-French (containing accents) and also some strings which would eventually become integers and floats.
As a test, I created a test.txt file that looked like this:
Montréal, über, 12.89, Mère, Françoise, noël, 889
I had to include lines 2 and 3 to get it to work (which I found in a python ticket), as well as incorporate #Jabba's comment:
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
import csv
import unicodedata
def remove_accents(input_str):
nkfd_form = unicodedata.normalize('NFKD', unicode(input_str))
return u"".join([c for c in nkfd_form if not unicodedata.combining(c)])
with open('test.txt') as f:
read = csv.reader(f)
for row in read:
for element in row:
print remove_accents(element)
The result:
Montreal
uber
12.89
Mere
Francoise
noel
889
(Note: I am on Mac OS X 10.8.4 and using Python 2.7.3)
import unicodedata
from random import choice
import perfplot
import regex
import text_unidecode
def remove_accent_chars_regex(x: str):
return regex.sub(r'\p{Mn}', '', unicodedata.normalize('NFKD', x))
def remove_accent_chars_join(x: str):
# answer by MiniQuark
# https://stackoverflow.com/a/517974/7966259
return u"".join([c for c in unicodedata.normalize('NFKD', x) if not unicodedata.combining(c)])
perfplot.show(
setup=lambda n: ''.join([choice('Málaga François Phút Hơn 中文') for i in range(n)]),
kernels=[
remove_accent_chars_regex,
remove_accent_chars_join,
text_unidecode.unidecode,
],
labels=['regex', 'join', 'unidecode'],
n_range=[2 ** k for k in range(22)],
equality_check=None, relative_to=0, xlabel='str len'
)
Some languages have combining diacritics as language letters and accent diacritics to specify accent.
I think it is more safe to specify explicitly what diactrics you want to strip:
def strip_accents(string, accents=('COMBINING ACUTE ACCENT', 'COMBINING GRAVE ACCENT', 'COMBINING TILDE')):
accents = set(map(unicodedata.lookup, accents))
chars = [c for c in unicodedata.normalize('NFD', string) if c not in accents]
return unicodedata.normalize('NFC', ''.join(chars))
Here is a short function which strips the diacritics, but keeps the non-latin characters. Most cases (e.g., "à" -> "a") are handled by unicodedata (standard library), but several (e.g., "æ" -> "ae") rely on the given parallel strings.
Code
from unicodedata import combining, normalize
LATIN = "ä æ ǽ đ ð ƒ ħ ı ł ø ǿ ö œ ß ŧ ü "
ASCII = "ae ae ae d d f h i l o o oe oe ss t ue"
def remove_diacritics(s, outliers=str.maketrans(dict(zip(LATIN.split(), ASCII.split())))):
return "".join(c for c in normalize("NFD", s.lower().translate(outliers)) if not combining(c))
NB. The default argument outliers is evaluated once and not meant to be provided by the caller.
Intended usage
As a key to sort a list of strings in a more “natural” order:
sorted(['cote', 'coteau', "crottez", 'crotté', 'côte', 'côté'], key=remove_diacritics)
Output:
['cote', 'côte', 'côté', 'coteau', 'crotté', 'crottez']
If your strings mix texts and numbers, you may be interested in composing remove_diacritics() with the function string_to_pairs() I give elsewhere.
Tests
To make sure the behavior meets your needs, take a look at the pangrams below:
examples = [
("hello, world", "hello, world"),
("42", "42"),
("你好,世界", "你好,世界"),
(
"Dès Noël, où un zéphyr haï me vêt de glaçons würmiens, je dîne d’exquis rôtis de bœuf au kir, à l’aÿ d’âge mûr, &cætera.",
"des noel, ou un zephyr hai me vet de glacons wuermiens, je dine d’exquis rotis de boeuf au kir, a l’ay d’age mur, &caetera.",
),
(
"Falsches Üben von Xylophonmusik quält jeden größeren Zwerg.",
"falsches ueben von xylophonmusik quaelt jeden groesseren zwerg.",
),
(
"Љубазни фењерџија чађавог лица хоће да ми покаже штос.",
"љубазни фењерџија чађавог лица хоће да ми покаже штос.",
),
(
"Ljubazni fenjerdžija čađavog lica hoće da mi pokaže štos.",
"ljubazni fenjerdzija cadavog lica hoce da mi pokaze stos.",
),
(
"Quizdeltagerne spiste jordbær med fløde, mens cirkusklovnen Walther spillede på xylofon.",
"quizdeltagerne spiste jordbaer med flode, mens cirkusklovnen walther spillede pa xylofon.",
),
(
"Kæmi ný öxi hér ykist þjófum nú bæði víl og ádrepa.",
"kaemi ny oexi her ykist þjofum nu baedi vil og adrepa.",
),
(
"Glāžšķūņa rūķīši dzērumā čiepj Baha koncertflīģeļu vākus.",
"glazskuna rukisi dzeruma ciepj baha koncertfligelu vakus.",
)
]
for (given, expected) in examples:
assert remove_diacritics(given) == expected
Case-preserving variant
LATIN = "ä æ ǽ đ ð ƒ ħ ı ł ø ǿ ö œ ß ŧ ü Ä Æ Ǽ Đ Ð Ƒ Ħ I Ł Ø Ǿ Ö Œ SS Ŧ Ü "
ASCII = "ae ae ae d d f h i l o o oe oe ss t ue AE AE AE D D F H I L O O OE OE SS T UE"
def remove_diacritics(s, outliers=str.maketrans(dict(zip(LATIN.split(), ASCII.split())))):
return "".join(c for c in normalize("NFD", s.translate(outliers)) if not combining(c))
There are already many answers here, but this was not previously considered: using sklearn
from sklearn.feature_extraction.text import strip_accents_ascii, strip_accents_unicode
accented_string = u'Málagueña®'
print(strip_accents_unicode(accented_string)) # output: Malaguena®
print(strip_accents_ascii(accented_string)) # output: Malaguena
This is particularly useful if you are already using sklearn to process text. Those are the functions internally called by classes like CountVectorizer to normalize strings: when using strip_accents='ascii' then strip_accents_ascii is called and when strip_accents='unicode' is used, then strip_accents_unicode is called.
More details
Finally, consider those details from its docstring:
Signature: strip_accents_ascii(s)
Transform accentuated unicode symbols into ascii or nothing
Warning: this solution is only suited for languages that have a direct
transliteration to ASCII symbols.
and
Signature: strip_accents_unicode(s)
Transform accentuated unicode symbols into their simple counterpart
Warning: the python-level loop and join operations make this
implementation 20 times slower than the strip_accents_ascii basic
normalization.
If you are hoping to get functionality similar to Elasticsearch's asciifolding filter, you might want to consider fold-to-ascii, which is [itself]...
A Python port of the Apache Lucene ASCII Folding Filter that converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into ASCII equivalents, if they exist.
Here's an example from the page mentioned above:
from fold_to_ascii import fold
s = u'Astroturf® paté'
fold(s)
> u'Astroturf pate'
fold(s, u'?')
> u'Astroturf? pate'
EDIT: The fold_to_ascii module seems to work well for normalizing Latin-based alphabets; however unmappable characters are removed, which means that this module will reduce Chinese text, for example, to empty strings. If you want to preserve Chinese, Japanese, and other Unicode alphabets, consider using #mo-han's remove_accent_chars_regex implementation, above.

Categories

Resources