Regex - match a character and all its diacritic variations (aka accent-insensitive) - python

I am trying to match a character and all its possible diacritic variations (aka accent-insensitive) with a regular expression. What I could do of course is:
re.match(r"^[eēéěèȅêęëėẹẽĕȇȩę̋ḕḗḙḛḝė̄]$", "é")
but that is not a general solution. If I use unicode categories like \pL I can't reduce the match to a specific character, in this case e.

A workaround to achieve the desired goal would be to use unidecode to get rid of all diacritics first, and then just match agains the regular e
re.match(r"^e$", unidecode("é"))
Or in this simplified case
unidecode("é") == "e"
Another solution which doesn't depend on the unidecode-library, preserves unicode and gives more control is manually removing the diacritics as follows:
Use unicodedata.normalize() to turn your input string into normal form D (for decomposed), making sure composite characters like é get turned into the decomposite form e\u301 (e + COMBINING ACUTE ACCENT)
>>> input = "Héllô"
>>> input
'Héllô'
>>> normalized = unicodedata.normalize("NFKD", input)
>>> normalized
'He\u0301llo\u0302'
Then, remove all codepoints which fall into the category Mark, Nonspacing (short Mn). Those are all characters that have no width themselves and just decorate the previous character.
Use unicodedata.category() to determine the category.
>>> stripped = "".join(c for c in normalized if unicodedata.category(c) != "Mn")
>>> stripped
'Hello'
The result can be used as a source for regex-matching, just as in the unidecode-example above.
Here's the whole thing as a function:
def remove_diacritics(text):
"""
Returns a string with all diacritics (aka non-spacing marks) removed.
For example "Héllô" will become "Hello".
Useful for comparing strings in an accent-insensitive fashion.
"""
normalized = unicodedata.normalize("NFKD", text)
return "".join(c for c in normalized if unicodedata.category(c) != "Mn")

Related

How to convert formatted text to the plain text [duplicate]

Is there a standard way, in Python, to normalize a unicode string, so that it only comprehends the simplest unicode entities that can be used to represent it ?
I mean, something which would translate a sequence like ['LATIN SMALL LETTER A', 'COMBINING ACUTE ACCENT'] to ['LATIN SMALL LETTER A WITH ACUTE'] ?
See where is the problem:
>>> import unicodedata
>>> char = "á"
>>> len(char)
1
>>> [ unicodedata.name(c) for c in char ]
['LATIN SMALL LETTER A WITH ACUTE']
But now:
>>> char = "á"
>>> len(char)
2
>>> [ unicodedata.name(c) for c in char ]
['LATIN SMALL LETTER A', 'COMBINING ACUTE ACCENT']
I could, of course, iterate over all the chars and do manual replacements, etc., but it is not efficient, and I'm pretty sure I would miss half of the special cases, and do mistakes.
The unicodedata module offers a .normalize() function, you want to normalize to the NFC form. An example using the same U+0061 LATIN SMALL LETTER - U+0301 A COMBINING ACUTE ACCENT combination and U+00E1 LATIN SMALL LETTER A WITH ACUTE code points you used:
>>> print(ascii(unicodedata.normalize('NFC', '\u0061\u0301')))
'\xe1'
>>> print(ascii(unicodedata.normalize('NFD', '\u00e1')))
'a\u0301'
(I used the ascii() function here to ensure non-ASCII codepoints are printed using escape syntax, making the differences clear).
NFC, or 'Normal Form Composed' returns composed characters, NFD, 'Normal Form Decomposed' gives you decomposed, combined characters.
The additional NFKC and NFKD forms deal with compatibility codepoints; e.g. U+2160 ROMAN NUMERAL ONE is really just the same thing as U+0049 LATIN CAPITAL LETTER I but present in the Unicode standard to remain compatible with encodings that treat them separately. Using either NFKC or NFKD form, in addition to composing or decomposing characters, will also replace all 'compatibility' characters with their canonical form.
Here is an example using the U+2167 ROMAN NUMERAL EIGHT codepoint; using the NFKC form replaces this with a sequence of ASCII V and I characters:
>>> unicodedata.normalize('NFC', '\u2167')
'Ⅷ'
>>> unicodedata.normalize('NFKC', '\u2167')
'VIII'
Note that there is no guarantee that composed and decomposed forms are commutative; normalizing a combined character to NFC form, then converting the result back to NFD form does not always result in the same character sequence. The Unicode standard maintains a list of exceptions; characters on this list are composable, but not decomposable back to their combined form, for various reasons. Also see the documentation on the Composition Exclusion Table.
Yes, there is.
unicodedata.normalize(form, unistr)
You need to select one of the four normalization forms.

string matching with interchangeable characters

I am trying to do a simple string matching between two strings, a small string to a bigger string. The only catch is that I want to equate two characters in the small string to be the same. In particular if there is a character 'I' or a character 'L' in the smaller string, then I want it to be considered interchangeably.
For example let's say my small string is
s = 'AKIIMP'
and then the bigger string is:
b = 'MPKGEXAKILMP'
I want to write a function that will take the two strings and checks if the smaller one is in the big one. In this particular example even though the smaller string s is not a substring in b because there is no exact match, however in my case it should match with it because like I mentioned characters 'I' and 'L' would be used interchangeably and therefore the result should find a match.
Any idea of how I could proceed with this?
s.replace('I', 'L') in b.replace('I', 'L')
will evaluate to True in your example.
You could do it with regular expressions:
import re
s = 'AKIIMP'
b = 'MPKGEXAKILMP'
p = re.sub('[IL]', '[IL]', s)
if re.search(p, b):
print(f'{s!r} is in {b!r}')
else:
print('Not found')
This is not as elegant as #Deepstop's answer, but it provides a bit more flexibility in terms of what characters you equate.

How to clean the tweets having a specific but varying length pattern?

I pulled out some tweets for analysis. When I separate the words in tweets I can see a lot of following expressions in my output:
\xe3\x81\x86\xe3\x81\xa1
I want to use regular expressions to replace these patterns with nothing. I am not very good with regex. I tried using solution in some similar questions but nothing worked for me. They are replacing characters like "xt" from "extra".
I am looking for something that will replace \x?? with nothing, considering ?? can be either a-f or 0-9 but word must be 4 letter and starting with \x.
Also i would like to add replacement for anything other than alphabets. Like:
"Hi!! my number is (7097868709809)."
after replacement should yield
"Hi my number is."
Input:
\xe3\x81\x86\xe3Extra
Output required:
Extra
What you are seeing is Unicode characters that can't directly be printed, expressed as pairs of hexadecimal digits. So for a more printable example:
>>> ord('a')
97
>>> hex(97)
'0x61'
>>> "\x61"
'a'
Note that what appears to be a sequence of four characters '\x61' evaluates to a single character, 'a'. Therefore:
?? can't "be anything" - they can be '0'-'9' or 'a'-'f'; and
Although e.g. r'\\x[0-9a-f]{2}' would match the sequence you see, that's not what the regex would parse - each "word" is really a single character.
You can remove the characters "other than alphabets" using e.g. string.printable:
>>> s = "foo\xe3\x81"
>>> s
'foo\xe3\x81'
>>> import string
>>> valid_chars = set(string.printable)
>>> "".join([c for c in s if c in valid_chars])
'foo'
Note that e.g. '\xe3' can be directly printed in Python 3 (it's 'ã'), but isn't included in string.printable. For more on Unicode in Python, see the docs.

string.translate() with unicode data in python

I have 3 API's that return json data to 3 dictionary variables. I am taking some of the values from the dictionary to process them. I read the specific values that I want to the list valuelist. One of the steps is to remove the punctuation from them. I normally use string.translate(None, string.punctuation) for this process but because the dictionary data is unicode I get the error:
wordlist = [s.translate(None, string.punctuation)for s in valuelist]
TypeError: translate() takes exactly one argument (2 given)
Is there a way around this? Either by encoding the unicode or a replacement for string.translate?
The translate method work differently on Unicode objects than on byte-string objects:
>>> help(unicode.translate)
S.translate(table) -> unicode
Return a copy of the string S, where all characters have been mapped
through the given translation table, which must be a mapping of
Unicode ordinals to Unicode ordinals, Unicode strings or None.
Unmapped characters are left untouched. Characters mapped to None
are deleted.
So your example would become:
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
word_list = [s.translate(remove_punctuation_map) for s in value_list]
Note however that string.punctuation only contains ASCII punctuation. Full Unicode has many more punctuation characters, but it all depends on your use case.
I noticed that string.translate is deprecated. Since you are removing punctuation, not actually translating characters, you can use the re.sub function.
>>> import re
>>> s1="this.is a.string, with; (punctuation)."
>>> s1
'this.is a.string, with; (punctuation).'
>>> re.sub("[\.\t\,\:;\(\)\.]", "", s1, 0, 0)
'thisis astring with punctuation'
>>>
In this version you can relatively make one's letters to other
def trans(to_translate):
tabin = u'привет'
tabout = u'тевирп'
tabin = [ord(char) for char in tabin]
translate_table = dict(zip(tabin, tabout))
return to_translate.translate(translate_table)
Python re module allows to use a function as a replacement argument, which should take a Match object and return a suitable replacement. We may use this function to build a custom character translation function:
import re
def mk_replacer(oldchars, newchars):
"""A function to build a replacement function"""
mapping = dict(zip(oldchars, newchars))
def replacer(match):
"""A replacement function to pass to re.sub()"""
return mapping.get(match.group(0), "")
return replacer
An example. Match all lower-case letters ([a-z]), translate 'h' and 'i' to 'H' and 'I' respectively, delete other matches:
>>> re.sub("[a-z]", mk_replacer("hi", "HI"), "hail")
'HI'
As you can see, it may be used with short (incomplete) replacement sets, and it may be used to delete some characters.
A Unicode example:
>>> re.sub("[\W]", mk_replacer(u'\u0435\u0438\u043f\u0440\u0442\u0432', u"EIPRTV"), u'\u043f\u0440\u0438\u0432\u0435\u0442')
u'PRIVET'
As I stumbled upon the same problem and Simon's answer was the one that helped me to solve my case, I thought of showing an easier example just for clarification:
from collections import defaultdict
And then for the translation, say you'd like to remove '#' and '\r' characters:
remove_chars_map = defaultdict()
remove_chars_map['#'] = None
remove_chars_map['\r'] = None
new_string = old_string.translate(remove_chars_map)
And an example:
old_string = "word1#\r word2#\r word3#\r"
new_string = "word1 word2 word3"
'#' and '\r' removed

Stripping non printable characters from a string in python

I use to run
$s =~ s/[^[:print:]]//g;
on Perl to get rid of non printable characters.
In Python there's no POSIX regex classes, and I can't write [:print:] having it mean what I want. I know of no way in Python to detect if a character is printable or not.
What would you do?
EDIT: It has to support Unicode characters as well. The string.printable way will happily strip them out of the output.
curses.ascii.isprint will return false for any unicode character.
Iterating over strings is unfortunately rather slow in Python. Regular expressions are over an order of magnitude faster for this kind of thing. You just have to build the character class yourself. The unicodedata module is quite helpful for this, especially the unicodedata.category() function. See Unicode Character Database for descriptions of the categories.
import unicodedata, re, itertools, sys
all_chars = (chr(i) for i in range(sys.maxunicode))
categories = {'Cc'}
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories)
# or equivalently and much more efficiently
control_chars = ''.join(map(chr, itertools.chain(range(0x00,0x20), range(0x7f,0xa0))))
control_char_re = re.compile('[%s]' % re.escape(control_chars))
def remove_control_chars(s):
return control_char_re.sub('', s)
For Python2
import unicodedata, re, sys
all_chars = (unichr(i) for i in xrange(sys.maxunicode))
categories = {'Cc'}
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories)
# or equivalently and much more efficiently
control_chars = ''.join(map(unichr, range(0x00,0x20) + range(0x7f,0xa0)))
control_char_re = re.compile('[%s]' % re.escape(control_chars))
def remove_control_chars(s):
return control_char_re.sub('', s)
For some use-cases, additional categories (e.g. all from the control group might be preferable, although this might slow down the processing time and increase memory usage significantly. Number of characters per category:
Cc (control): 65
Cf (format): 161
Cs (surrogate): 2048
Co (private-use): 137468
Cn (unassigned): 836601
Edit Adding suggestions from the comments.
As far as I know, the most pythonic/efficient method would be:
import string
filtered_string = filter(lambda x: x in string.printable, myStr)
You could try setting up a filter using the unicodedata.category() function:
import unicodedata
printable = {'Lu', 'Ll'}
def filter_non_printable(str):
return ''.join(c for c in str if unicodedata.category(c) in printable)
See Table 4-9 on page 175 in the Unicode database character properties for the available categories
The following will work with Unicode input and is rather fast...
import sys
# build a table mapping all non-printable characters to None
NOPRINT_TRANS_TABLE = {
i: None for i in range(0, sys.maxunicode + 1) if not chr(i).isprintable()
}
def make_printable(s):
"""Replace non-printable characters in a string."""
# the translate method on str removes characters
# that map to None from the string
return s.translate(NOPRINT_TRANS_TABLE)
assert make_printable('Café') == 'Café'
assert make_printable('\x00\x11Hello') == 'Hello'
assert make_printable('') == ''
My own testing suggests this approach is faster than functions that iterate over the string and return a result using str.join.
In Python 3,
def filter_nonprintable(text):
import itertools
# Use characters of control category
nonprintable = itertools.chain(range(0x00,0x20),range(0x7f,0xa0))
# Use translate to remove all non-printable characters
return text.translate({character:None for character in nonprintable})
See this StackOverflow post on removing punctuation for how .translate() compares to regex & .replace()
The ranges can be generated via nonprintable = (ord(c) for c in (chr(i) for i in range(sys.maxunicode)) if unicodedata.category(c)=='Cc') using the Unicode character database categories as shown by #Ants Aasma.
This function uses list comprehensions and str.join, so it runs in linear time instead of O(n^2):
from curses.ascii import isprint
def printable(input):
return ''.join(char for char in input if isprint(char))
Yet another option in python 3:
re.sub(f'[^{re.escape(string.printable)}]', '', my_string)
Based on #Ber's answer, I suggest removing only control characters as defined in the Unicode character database categories:
import unicodedata
def filter_non_printable(s):
return ''.join(c for c in s if not unicodedata.category(c).startswith('C'))
The best I've come up with now is (thanks to the python-izers above)
def filter_non_printable(str):
return ''.join([c for c in str if ord(c) > 31 or ord(c) == 9])
This is the only way I've found out that works with Unicode characters/strings
Any better options?
In Python there's no POSIX regex classes
There are when using the regex library: https://pypi.org/project/regex/
It is well maintained and supports Unicode regex, Posix regex and many more. The usage (method signatures) is very similar to Python's re.
From the documentation:
[[:alpha:]]; [[:^alpha:]]
POSIX character classes are supported. These
are normally treated as an alternative form of \p{...}.
(I'm not affiliated, just a user.)
An elegant pythonic solution to stripping 'non printable' characters from a string in python is to use the isprintable() string method together with a generator expression or list comprehension depending on the use case ie. size of the string:
''.join(c for c in my_string if c.isprintable())
str.isprintable()
Return True if all characters in the string are printable or the string is empty, False otherwise. Nonprintable characters are those characters defined in the Unicode character database as “Other” or “Separator”, excepting the ASCII space (0x20) which is considered printable. (Note that printable characters in this context are those which should not be escaped when repr() is invoked on a string. It has no bearing on the handling of strings written to sys.stdout or sys.stderr.)
The one below performs faster than the others above. Take a look
''.join([x if x in string.printable else '' for x in Str])
Adapted from answers by Ants Aasma and shawnrad:
nonprintable = set(map(chr, list(range(0,32)) + list(range(127,160))))
ord_dict = {ord(character):None for character in nonprintable}
def filter_nonprintable(text):
return text.translate(ord_dict)
#use
str = "this is my string"
str = filter_nonprintable(str)
print(str)
tested on Python 3.7.7
To remove 'whitespace',
import re
t = """
\n\t<p> </p>\n\t<p> </p>\n\t<p> </p>\n\t<p> </p>\n\t<p>
"""
pat = re.compile(r'[\t\n]')
print(pat.sub("", t))
Error description
Run the copied and pasted python code report:
Python invalid non-printable character U+00A0
The cause of the error
The space in the copied code is not the same as the format in Python;
Solution
Delete the space and re-enter the space. For example, the red part in the above picture is an abnormal space. Delete and re-enter the space to run;
Source : Python invalid non-printable character U+00A0
I used this:
import sys
import unicodedata
# the test string has embedded characters, \u2069 \u2068
test_string = """"ABC⁩.⁨ 6", "}"""
nonprintable = list((ord(c) for c in (chr(i) for i in range(sys.maxunicode)) if
unicodedata.category(c) in ['Cc','Cf']))
translate_dict = {character: None for character in nonprintable}
print("Before translate, using repr()", repr(test_string))
print("After translate, using repr()", repr(test_string.translate(translate_dict)))

Categories

Resources