Understanding unistr of unicodedata.normalize() - python

Wikipedia basically says the following for the four values of unistr.
- NFC (Normalization Form Canonical Composition)
- Characters are decomposed
- then recomposed by canonical equivalence.
- NFKC (Normalization Form Compatibility Composition)
- Characters are decomposed by compatibility
- recomposed by canonical equivalence
- NFD (Normalization Form Canonical Decomposition)
- Characters are decomposed by canonical equivalence
- multiple combining characters are arranged in a specific order.
- NFKD (Normalization Form Compatibility Decomposition)
- Characters are decomposed by compatibility
- multiple combining characters are arranged in a specific order.
So for each of the choice, it is a two step transform? But normalize() only shows the final result. Is there a way to see the intermediate results?
Wiki also says
For example, the distinct Unicode strings "U+212B" (the angstrom sign "Å") and "U+00C5" (the Swedish letter "Å") are both expanded by NFD (or NFKD) into the sequence "U+0041 U+030A" (Latin letter "A" and combining ring above "°") which is then reduced by NFC (or NFKC) to "U+00C5" (the Swedish letter "Å").
But I can not reproduce it with normalize(). Could anybody provide more examples to show how these four options work? What are their differences?
>>> from unicodedata import normalize
>>> print(normalize('NFC', 'A°'))
A°
>>> print(normalize('NFKC', 'A°'))
A°
>>> print(normalize('NFD', 'Å'))
Å
>>> print(normalize('NFKD', 'Å'))
Å
>>> len(normalize('NFC', 'A°'))
2
>>> len(normalize('NFKC', 'A°'))
2
>>> len(normalize('NFD', 'Å'))
2
>>> len(normalize('NFKD', 'Å'))
2

I am asking for examples to show the difference among the four cases.
Try the following script (Python 3):
# -*- coding: utf-8 -*-
import sys
from unicodedata import normalize
def encodeuni(s):
'''
Returns input string encoded to escape sequences as in a string literal.
Output is similar to
str(s.encode('unicode_escape')).strip("b'").replace('\\\\','\\');
but even every ASCII character is encoded as a \\xNN escape sequence
(except a space character). For instance:
s = 'A á ř 🌈';
encodeuni(s); # '\\x41 \\xe1 \\u0159 \\U0001f308' while
str(s.encode('unicode_escape')).strip("b'").replace('\\\\','\\');
# # 'A \\xe1 \\u0159 \\U0001f308'
'''
def encodechar(ch):
ordch = ord(ch)
return ( ch if ordch == 0x20 else
f"\\x{ordch:02x}" if ordch <= 0xFF else
f"\\u{ordch:04x}" if ordch <= 0xFFFF else
f"\\U{ordch:08x}" )
return ''.join([encodechar(ch) for ch in s])
if len(sys.argv) >= 2:
letters = ' '.join([sys.argv[i] for i in range(1,len(sys.argv))])
# .\SO\59979037.py ÅÅÅ🌈
else:
letters = '\u212B \u00C5 \u0041\u030A \U0001f308'
# \u212B Å Angstrom Sign
# \u00C5 Å Latin Capital Letter A With Ring Above
# \u0041 A Latin Capital Letter A
# \u030A ̊ Combining Ring Above
# \U0001f308 🌈 Rainbow
print('\t'.join( ['raw' , letters.ljust(10), str(len(letters)), encodeuni(letters),'\n']))
for form in ['NFC','NFKC','NFD','NFKD']:
letnorm = normalize(form, letters)
print( '\t'.join( [form, letnorm.ljust(10), str(len(letnorm)), encodeuni(letnorm)]))
Output shows a feature/bug in WebKit browsers (Chrome, Safari); they normalize form data to NFC. See the picture at the very end (Windows 10 cmd prompt, font DejaVu Sans Mono):
.\SO\59979037.py ÅÅÅ
raw ÅÅÅ 4 \u212b\xc5\x41\u030a
NFC ÅÅÅ 3 \xc5\xc5\xc5
NFKC ÅÅÅ 3 \xc5\xc5\xc5
NFD ÅÅÅ 6 \x41\u030a\x41\u030a\x41\u030a
NFKD ÅÅÅ 6 \x41\u030a\x41\u030a\x41\u030a

Related

Printing characters that are not in ASCII 0-127 (Python 3)

So, I've got an algorithm whereby I take a character, take its character code, increase that code by a variable, and then print that new character. However, I'd also like it to work for characters not in the default ASCII table. Currently it's not printing 'special' characters like € (for example). How can I make it print certain special characters?
#!/usr/bin/python3
# -*- coding: utf-8 -*-
def generateKey(name):
i = 0
result = ""
for char in name:
newOrd = ord(char) + i
newChar = chr(newOrd)
print(newChar)
result += newChar
i += 1
print("Serial key for name: ", result)
generateKey(input("Enter name: "))
Whenever I give an input that forces special characters (like |||||), it works fine for the first four characters (including DEL where it gives the transparent rectangle icon), but the fifth character (meant to be €) is also an error char, which is not what I want. How can I fix this?
Here's the output from |||||:
Enter name: |||||
|
}
~
Serial key for name: |}~
But the last char should be €, not a blank. (BTW the fourth char, DEL, becomes a transparent rectangle when I copy it into Windows)
In the default encoding (utf-8), chr(128) is not the euro symbol. It's a control character. See this Unicode table. So indeed it should be blank, not €.
You can verify the default encoding with sys.getdefaultencoding().
If you want to reinterpret chr(128) as the euro symbol, you should use the windows-1252 encoding. There, it is indeed the euro symbol. (Different encodings disagree on how to represent values beyond ASCII's 0–127.)

How to convert formatted text to the plain text [duplicate]

Is there a standard way, in Python, to normalize a unicode string, so that it only comprehends the simplest unicode entities that can be used to represent it ?
I mean, something which would translate a sequence like ['LATIN SMALL LETTER A', 'COMBINING ACUTE ACCENT'] to ['LATIN SMALL LETTER A WITH ACUTE'] ?
See where is the problem:
>>> import unicodedata
>>> char = "á"
>>> len(char)
1
>>> [ unicodedata.name(c) for c in char ]
['LATIN SMALL LETTER A WITH ACUTE']
But now:
>>> char = "á"
>>> len(char)
2
>>> [ unicodedata.name(c) for c in char ]
['LATIN SMALL LETTER A', 'COMBINING ACUTE ACCENT']
I could, of course, iterate over all the chars and do manual replacements, etc., but it is not efficient, and I'm pretty sure I would miss half of the special cases, and do mistakes.
The unicodedata module offers a .normalize() function, you want to normalize to the NFC form. An example using the same U+0061 LATIN SMALL LETTER - U+0301 A COMBINING ACUTE ACCENT combination and U+00E1 LATIN SMALL LETTER A WITH ACUTE code points you used:
>>> print(ascii(unicodedata.normalize('NFC', '\u0061\u0301')))
'\xe1'
>>> print(ascii(unicodedata.normalize('NFD', '\u00e1')))
'a\u0301'
(I used the ascii() function here to ensure non-ASCII codepoints are printed using escape syntax, making the differences clear).
NFC, or 'Normal Form Composed' returns composed characters, NFD, 'Normal Form Decomposed' gives you decomposed, combined characters.
The additional NFKC and NFKD forms deal with compatibility codepoints; e.g. U+2160 ROMAN NUMERAL ONE is really just the same thing as U+0049 LATIN CAPITAL LETTER I but present in the Unicode standard to remain compatible with encodings that treat them separately. Using either NFKC or NFKD form, in addition to composing or decomposing characters, will also replace all 'compatibility' characters with their canonical form.
Here is an example using the U+2167 ROMAN NUMERAL EIGHT codepoint; using the NFKC form replaces this with a sequence of ASCII V and I characters:
>>> unicodedata.normalize('NFC', '\u2167')
'Ⅷ'
>>> unicodedata.normalize('NFKC', '\u2167')
'VIII'
Note that there is no guarantee that composed and decomposed forms are commutative; normalizing a combined character to NFC form, then converting the result back to NFD form does not always result in the same character sequence. The Unicode standard maintains a list of exceptions; characters on this list are composable, but not decomposable back to their combined form, for various reasons. Also see the documentation on the Composition Exclusion Table.
Yes, there is.
unicodedata.normalize(form, unistr)
You need to select one of the four normalization forms.

Remove Emoji's from multilingual Unicode text

I'm trying to remove just Emoji from Unicode text. I tried the various methods described in another Stack Overflow post but none of those are removing all emojis / smileys completely. For example:
Solution 1:
def remove_emoji(self, string):
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags=re.UNICODE)
return emoji_pattern.sub(r'', string)
Leaves in 🤝 in the following example:
Input: తెలంగాణ రియల్ ఎస్టేట్ 🤝👍
Output: తెలంగాణ రియల్ ఎస్టేట్ 🤝
Another attempt, solution 2:
def deEmojify(self, inputString):
returnString = ""
for character in inputString:
try:
character.encode("ascii")
returnString += character
except UnicodeEncodeError:
returnString += ''
return returnString
Results in removing any non-English character:
Input: 🏣Testరియల్ ఎస్టేట్ A.P&T.S. 🤝🏩🏣👍
Output: Test A.P&T.S.
It removes not only all of the emoji but it also removed the non-English characters because of the character.encode("ascii"); my non-English inputs can not be encoded into ASCII.
Is there any way to properly remove Emoji from international Unicode text?
The regex is outdated. It appears to cover Emoji's defined up to Unicode 8.0 (since U+1F91D HANDSHAKE was added in Unicode 9.0). The other approach is just a very inefficient method of force-encoding to ASCII, which is rarely what you want when just removing Emoji (and can be much more easily and efficiently achieved with text.encode('ascii', 'ignore').decode('ascii')).
If you need a more up-to-date regex, take one from a package that is actively trying to keep up-to-date on Emoji; it specifically supports generating such a regex:
import emoji
def remove_emoji(text):
return emoji.get_emoji_regexp().sub(u'', text)
The package is currently up-to-date for Unicode 11.0 and has the infrastructure in place to update to future releases quickly. All your project has to do is upgrade along when there is a new release.
Demo using your sample inputs:
>>> print(remove_emoji(u'తెలంగాణ రియల్ ఎస్టేట్ 🤝👍'))
తెలంగాణ రియల్ ఎస్టేట్
>>> print(remove_emoji(u'🏣Testరియల్ ఎస్టేట్ A.P&T.S. 🤝🏩🏣👍'))
Testరియల్ ఎస్టేట్ A.P&T.S.
Note that the regex works on Unicode text, for Python 2 make sure you have decoded from str to unicode, for Python 3, from bytes to str first.
Emoji are complex beasts these days. The above will remove complete, valid Emoji. If you have 'incomplete' Emoji components such as skin-tone codepoints (meant to be combined with specific Emoji only) then you'll have much more trouble removing those. The skin-tone codepoints are easy (just remove those 5 codepoints afterwards), but there is a whole host of combinations that are made up of innocent characters such as ♀ U+2640 FEMALE SIGN or ♂ U+2642 MALE SIGN together with variant selectors and the U+200D ZERO-WIDTH JOINER that have specific meaning in other contexts too, and you can't just regex those out, not unless you don't mind breaking text using Devanagari, or Kannada or CJK ideographs, to name just a few examples.
That said, the following Unicode 11.0 codepoints are probably safe to remove (based on filtering the Emoji_Component Emoji-data designation):
20E3 ; (⃣) combining enclosing keycap
FE0F ; () VARIATION SELECTOR-16
1F1E6..1F1FF ; (🇦..🇿) regional indicator symbol letter a..regional indicator symbol letter z
1F3FB..1F3FF ; (🏻..🏿) light skin tone..dark skin tone
1F9B0..1F9B3 ; (🦰..🦳) red-haired..white-haired
E0020..E007F ; (󠀠..󠁿) tag space..cancel tag
which can be removed by creating a new regex to match those:
import re
try:
uchr = unichr # Python 2
import sys
if sys.maxunicode == 0xffff:
# narrow build, define alternative unichr encoding to surrogate pairs
# as unichr(sys.maxunicode + 1) fails.
def uchr(codepoint):
return (
unichr(codepoint) if codepoint <= sys.maxunicode else
unichr(codepoint - 0x010000 >> 10 | 0xD800) +
unichr(codepoint & 0x3FF | 0xDC00)
)
except NameError:
uchr = chr # Python 3
# Unicode 11.0 Emoji Component map (deemed safe to remove)
_removable_emoji_components = (
(0x20E3, 0xFE0F), # combining enclosing keycap, VARIATION SELECTOR-16
range(0x1F1E6, 0x1F1FF + 1), # regional indicator symbol letter a..regional indicator symbol letter z
range(0x1F3FB, 0x1F3FF + 1), # light skin tone..dark skin tone
range(0x1F9B0, 0x1F9B3 + 1), # red-haired..white-haired
range(0xE0020, 0xE007F + 1), # tag space..cancel tag
)
emoji_components = re.compile(u'({})'.format(u'|'.join([
re.escape(uchr(c)) for r in _removable_emoji_components for c in r])),
flags=re.UNICODE)
then update the above remove_emoji() function to use it:
def remove_emoji(text, remove_components=False):
cleaned = emoji.get_emoji_regexp().sub(u'', text)
if remove_components:
cleaned = emoji_components.sub(u'', cleaned)
return cleaned
The emoji.get_emoji_regexp() is outdated.
If you want to remove emoji from strings, you can use emoji.replace_emoji() as shown in the examples below.
import emoji
def remove_emoji(string):
return emoji.replace_emoji(string, '')
Visit https://carpedm20.github.io/emoji/docs/api.html#emoji.replace_emoji
If you use the regex library instead of the re library you get access to Unicode properties then you can change your function to
def remove_emoji(self, string):
emoji_pattern = re.compile("[\P{L}&&\P{D}&&\P{Z}&&\P{M}]", flags=re.UNICODE)
return emoji_pattern.sub(r'', string)
Which will keep all letters, digits, separators and marks (accents)

Regex - match a character and all its diacritic variations (aka accent-insensitive)

I am trying to match a character and all its possible diacritic variations (aka accent-insensitive) with a regular expression. What I could do of course is:
re.match(r"^[eēéěèȅêęëėẹẽĕȇȩę̋ḕḗḙḛḝė̄]$", "é")
but that is not a general solution. If I use unicode categories like \pL I can't reduce the match to a specific character, in this case e.
A workaround to achieve the desired goal would be to use unidecode to get rid of all diacritics first, and then just match agains the regular e
re.match(r"^e$", unidecode("é"))
Or in this simplified case
unidecode("é") == "e"
Another solution which doesn't depend on the unidecode-library, preserves unicode and gives more control is manually removing the diacritics as follows:
Use unicodedata.normalize() to turn your input string into normal form D (for decomposed), making sure composite characters like é get turned into the decomposite form e\u301 (e + COMBINING ACUTE ACCENT)
>>> input = "Héllô"
>>> input
'Héllô'
>>> normalized = unicodedata.normalize("NFKD", input)
>>> normalized
'He\u0301llo\u0302'
Then, remove all codepoints which fall into the category Mark, Nonspacing (short Mn). Those are all characters that have no width themselves and just decorate the previous character.
Use unicodedata.category() to determine the category.
>>> stripped = "".join(c for c in normalized if unicodedata.category(c) != "Mn")
>>> stripped
'Hello'
The result can be used as a source for regex-matching, just as in the unidecode-example above.
Here's the whole thing as a function:
def remove_diacritics(text):
"""
Returns a string with all diacritics (aka non-spacing marks) removed.
For example "Héllô" will become "Hello".
Useful for comparing strings in an accent-insensitive fashion.
"""
normalized = unicodedata.normalize("NFKD", text)
return "".join(c for c in normalized if unicodedata.category(c) != "Mn")

How to filter (or replace) unicode characters that would take more than 3 bytes in UTF-8?

I'm using Python and Django, but I'm having a problem caused by a limitation of MySQL. According to the MySQL 5.1 documentation, their utf8 implementation does not support 4-byte characters. MySQL 5.5 will support 4-byte characters using utf8mb4; and, someday in future, utf8 might support it as well.
But my server is not ready to upgrade to MySQL 5.5, and thus I'm limited to UTF-8 characters that take 3 bytes or less.
My question is: How to filter (or replace) unicode characters that would take more than 3 bytes?
I want to replace all 4-byte characters with the official \ufffd (U+FFFD REPLACEMENT CHARACTER), or with ?.
In other words, I want a behavior quite similar to Python's own str.encode() method (when passing 'replace' parameter). Edit: I want a behavior similar to encode(), but I don't want to actually encode the string. I want to still have an unicode string after filtering.
I DON'T want to escape the character before storing at the MySQL, because that would mean I would need to unescape all strings I get from the database, which is very annoying and unfeasible.
See also:
"Incorrect string value" warning when saving some unicode characters to MySQL (at Django ticket system)
‘𠂉’ Not a valid unicode character, but in the unicode character set? (at Stack Overflow)
[EDIT] Added tests about the proposed solutions
So I got good answers so far. Thanks, people! Now, in order to choose one of them, I did a quick testing to find the simplest and fastest one.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# vi:ts=4 sw=4 et
import cProfile
import random
import re
# How many times to repeat each filtering
repeat_count = 256
# Percentage of "normal" chars, when compared to "large" unicode chars
normal_chars = 90
# Total number of characters in this string
string_size = 8 * 1024
# Generating a random testing string
test_string = u''.join(
unichr(random.randrange(32,
0x10ffff if random.randrange(100) > normal_chars else 0x0fff
)) for i in xrange(string_size) )
# RegEx to find invalid characters
re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)
def filter_using_re(unicode_string):
return re_pattern.sub(u'\uFFFD', unicode_string)
def filter_using_python(unicode_string):
return u''.join(
uc if uc < u'\ud800' or u'\ue000' <= uc <= u'\uffff' else u'\ufffd'
for uc in unicode_string
)
def repeat_test(func, unicode_string):
for i in xrange(repeat_count):
tmp = func(unicode_string)
print '='*10 + ' filter_using_re() ' + '='*10
cProfile.run('repeat_test(filter_using_re, test_string)')
print '='*10 + ' filter_using_python() ' + '='*10
cProfile.run('repeat_test(filter_using_python, test_string)')
#print test_string.encode('utf8')
#print filter_using_re(test_string).encode('utf8')
#print filter_using_python(test_string).encode('utf8')
The results:
filter_using_re() did 515 function calls in 0.139 CPU seconds (0.138 CPU seconds at the sub() built-in)
filter_using_python() did 2097923 function calls in 3.413 CPU seconds (1.511 CPU seconds at the join() call and 1.900 CPU seconds evaluating the generator expression)
I did no test using itertools because... well... that solution, although interesting, was quite big and complex.
Conclusion
The RegEx solution was, by far, the fastest one.
Unicode characters in the ranges \u0000-\uD7FF and \uE000-\uFFFF will have 3 byte (or less) encodings in UTF8. The \uD800-\uDFFF range is for multibyte UTF16. I do not know python, but you should be able to set up a regular expression to match outside those ranges.
pattern = re.compile("[\uD800-\uDFFF].", re.UNICODE)
pattern = re.compile("[^\u0000-\uFFFF]", re.UNICODE)
Edit adding Python from Denilson Sá's script in the question body:
re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)
filtered_string = re_pattern.sub(u'\uFFFD', unicode_string)
You may skip the decoding and encoding steps and directly detect the value of the first byte (8-bit string) of each character. According to UTF-8:
#1-byte characters have the following format: 0xxxxxxx
#2-byte characters have the following format: 110xxxxx 10xxxxxx
#3-byte characters have the following format: 1110xxxx 10xxxxxx 10xxxxxx
#4-byte characters have the following format: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
According to that, you only need to check the value of only the first byte of each character to filter out 4-byte characters:
def filter_4byte_chars(s):
i = 0
j = len(s)
# you need to convert
# the immutable string
# to a mutable list first
s = list(s)
while i < j:
# get the value of this byte
k = ord(s[i])
# this is a 1-byte character, skip to the next byte
if k <= 127:
i += 1
# this is a 2-byte character, skip ahead by 2 bytes
elif k < 224:
i += 2
# this is a 3-byte character, skip ahead by 3 bytes
elif k < 240:
i += 3
# this is a 4-byte character, remove it and update
# the length of the string we need to check
else:
s[i:i+4] = []
j -= 4
return ''.join(s)
Skipping the decoding and encoding parts will save you some time and for smaller strings that mostly have 1-byte characters this could even be faster than the regular expression filtering.
And just for the fun of it, an itertools monstrosity :)
import itertools as it, operator as op
def max3bytes(unicode_string):
# sequence of pairs of (char_in_string, u'\N{REPLACEMENT CHARACTER}')
pairs= it.izip(unicode_string, it.repeat(u'\ufffd'))
# is the argument less than or equal to 65535?
selector= ft.partial(op.le, 65535)
# using the character ordinals, return 0 or 1 based on `selector`
indexer= it.imap(selector, it.imap(ord, unicode_string))
# now pick the correct item for all pairs
return u''.join(it.imap(tuple.__getitem__, pairs, indexer))
Encode as UTF-16, then reencode as UTF-8.
>>> t = u'𝐟𝐨𝐨'
>>> e = t.encode('utf-16le')
>>> ''.join(unichr(x).encode('utf-8') for x in struct.unpack('<' + 'H' * (len(e) // 2), e))
'\xed\xa0\xb5\xed\xb0\x9f\xed\xa0\xb5\xed\xb0\xa8\xed\xa0\xb5\xed\xb0\xa8'
Note that you can't encode after joining, since the surrogate pairs may be decoded before reencoding.
EDIT:
MySQL (at least 5.1.47) has no problem dealing with surrogate pairs:
mysql> create table utf8test (t character(128)) collate utf8_general_ci;
Query OK, 0 rows affected (0.12 sec)
...
>>> cxn = MySQLdb.connect(..., charset='utf8')
>>> csr = cxn.cursor()
>>> t = u'𝐟𝐨𝐨'
>>> e = t.encode('utf-16le')
>>> v = ''.join(unichr(x).encode('utf-8') for x in struct.unpack('<' + 'H' * (len(e) // 2), e))
>>> v
'\xed\xa0\xb5\xed\xb0\x9f\xed\xa0\xb5\xed\xb0\xa8\xed\xa0\xb5\xed\xb0\xa8'
>>> csr.execute('insert into utf8test (t) values (%s)', (v,))
1L
>>> csr.execute('select * from utf8test')
1L
>>> r = csr.fetchone()
>>> r
(u'\ud835\udc1f\ud835\udc28\ud835\udc28',)
>>> print r[0]
𝐟𝐨𝐨
According to the MySQL 5.1 documentation: "The ucs2 and utf8 character sets do not support supplementary characters that lie outside the BMP." This indicates that there might be a problem with surrogate pairs.
Note that the Unicode standard 5.2 chapter 3 actually forbids encoding a surrogate pair as two 3-byte UTF-8 sequences instead of one 4-byte UTF-8 sequence ... see for example page 93 """Because surrogate code points are not Unicode scalar values, any UTF-8 byte sequence that would otherwise map to code points D800..DFFF is ill-formed.""" However this proscription is as far as I know largely unknown or ignored.
It may well be a good idea to check what MySQL does with surrogate pairs. If they are not to be retained, this code will provide a simple-enough check:
all(uc < u'\ud800' or u'\ue000' <= uc <= u'\uffff' for uc in unicode_string)
and this code will replace any "nasties" with u\ufffd:
u''.join(
uc if uc < u'\ud800' or u'\ue000' <= uc <= u'\uffff' else u'\ufffd'
for uc in unicode_string
)
I'm guessing it's not the fastest, but quite straightforward (“pythonic” :) :
def max3bytes(unicode_string):
return u''.join(uc if uc <= u'\uffff' else u'\ufffd' for uc in unicode_string)
NB: this code does not take into account the fact that Unicode has surrogate characters in the ranges U+D800-U+DFFF.
This does more than filtering out just 3+ byte UTF-8 unicode characters. It removes unicode but tries to do that in a gentle way and replace it with relevant ASCII characters if possible. It can be a blessing in the future if you don't have for example a dozen of various unicode apostrophes and unicode quotation marks in your text (usually coming from Apple handhelds) but only the regular ASCII apostrophes and quotations.
unicodedata.normalize("NFKD", sentence).encode("ascii", "ignore")
This is robust, I use it with some more guards:
import unicodedata
def neutralize_unicode(value):
"""
Taking care of special characters as gently as possible
Args:
value (string): input string, can contain unicode characters
Returns:
:obj:`string` where the unicode characters are replaced with standard
ASCII counterparts (for example en-dash and em-dash with regular dash,
apostrophe and quotation variations with the standard ones) or taken
out if there's no substitute.
"""
if not value or not isinstance(value, basestring):
return value
if isinstance(value, str):
return value
return unicodedata.normalize("NFKD", value).encode("ascii", "ignore")
This is Python 2 BTW.

Categories

Resources