Strip unicode character modifiers

Strip unicode character modifiers - python

What is the simplest way to strip the character modifiers from a unicode string in Python?
For example:
A͋͠r͍̞̫̜͌ͦ̈́͐ͅt̼̭͞h́u̡̙̞̘̙̬͖͓rͬͣ̐ͮͥͨ̀͏̣ should become Arthur
I tried the docs but I couldn't find anything that does this.

Try this
import unicodedata
a = u"STRING GOES HERE" # using an actual string would break stackoverflow's code formatting.
u"".join( x for x in a if not unicodedata.category(x).startswith("M") )
This will remove all characters classified as marks, which is what I think you want. In general, you can get the category of a character with unicodedata.category.

You could also use r'\p{M}' that is supported by regex module:
import regex
def remove_marks(text):
return regex.sub(ur"\p{M}+", "", text)
Example:
>>> print s
A͋͠r͍̞̫̜t̼̭͞h́u̡̙̞̘rͬͣ̐ͮ
>>> def remove_marks(text):
... return regex.sub(ur"\p{M}+", "", text)
...
...
>>> print remove_marks(s)
Arthur
Depending on your use-case a whitelist approach might be better e.g., to limit the input only to ascii characters:
>>> s.encode('ascii', 'ignore').decode('ascii')
u'Arthur'
The result might depend on Unicode normalization used in the text.

Related

Leave only alphanumeric symbols in string in Python?

I am using Python 2.7. On SO I found the following regexp for removing non-word characters:
pat = re.compile('[\W]+', re.UNICODE)
I wrote the next function:
def leave_only_alphanumeric(string):
pat = re.compile('[\W]+', re.UNICODE)
return re.sub(pat,' ',string)
Though on the following string:
kr\xc3\xa9m
it produces the wrong result:
kr\xc3 m
\xa9 was deleted from the string, but should not have been.

You are confusing unicode codepoints and the utf-8 encoding.
The letter you are trying to handle is é, code point u00e9.
It is encoded in utf-8 as two bytes, 0xc3 and 0xa9.
Try:
>>> "kr\xc3\xa9m".decode('utf-8')
u'kr\xe9m'
>>> print("kr\xc3\xa9m")
krém
>>> print(u"kr\xe9m")
krém
With u"" you must use the actual code points. While with raw "", python just sees a chain of bytes.
Note that the second line only works because my terminal's encoding is utf-8, otherwise I'd see garbled output.
As a result, your string is not what you think:
>>> print(u"kr\xc3\xa9m")
krÃ©m
You actually entered two characters, with codepoint u00c3 and u00a9. The former is Ã, which is an alpha character and second is ©, which is not and is why your code removes it.
Now playing with your code:
>>> def leave_only_alphanumeric(string):
... pat = re.compile('[\W]+', re.UNICODE)
... return re.sub(pat,' ',string)
...
>>> leave_only_alphanumeric(u"kr\xe9m")
u'kr\xe9m'
>>> leave_only_alphanumeric("kr\xc3\xa9m") # this is not unicode
'kr\xc3 m' # -> thus the wrong result
>>> leave_only_alphanumeric("kr\xc3\xa9m".decode('utf-8'))
u'kr\xe9m'
>>> leave_only_alphanumeric("kr\xc3\xa9m".decode('utf-8')).encode('utf-8')
'kr\xc3\xa9m'
>>>

I believe regex might be a bit of an overkill here.
def leave_only_alphanumeric(string):
return ''.join(ch if ch.isalnum() else ' ' for ch in string)
EDIT: Your title says "alphanumeric" but your code removes digits as well. So there is a bit of unclarity.

python: how to remove '$'?

All I want to do is remove the dollar sign '$'. This seems simple, but I really don't know why my code isn't working.
import re
input = '$5'
if '$' in input:
input = re.sub(re.compile('$'), '', input)
print input
Input still is '$5' instead of just '5'! Can anyone help?

Try using replace instead:
input = input.replace('$', '')
As Madbreaks has stated, $ means match the end of the line in a regular expression.
Here is a handy link to regular expressions: http://docs.python.org/2/library/re.html

In this case, I'd use str.translate
>>> '$$foo$$'.translate(None,'$')
'foo'
And for benchmarking purposes:
>>> def repl(s):
... return s.replace('$','')
...
>>> def trans(s):
... return s.translate(None,'$')
...
>>> import timeit
>>> s = '$$foo bar baz $ qux'
>>> print timeit.timeit('repl(s)','from __main__ import repl,s')
0.969965934753
>>> print timeit.timeit('trans(s)','from __main__ import trans,s')
0.796354055405
There are a number of differences between str.replace and str.translate. The most notable is that str.translate is useful for switching 1 character with another whereas str.replace replaces 1 substring with another. So, for problems like, I want to delete all characters a,b,c, or I want to change a to d, I suggest str.translate. Conversely, problems like "I want to replace the substring abc with def" are well suited for str.replace.
Note that your example doesn't work because $ has special meaning in regex (it matches at the end of a string). To get it to work with regex you need to escape the $:
>>> re.sub('\$','',s)
'foo bar baz qux'
works OK.

$ is a special character in regular expressions that translates to 'end of the string'
you need to escape it if you want to use it literally
try this:
import re
input = "$5"
if "$" in input:
input = re.sub(re.compile('\$'), '', input)
print input

You need to escape the dollar sign - otherwise python thinks it is an anchor http://docs.python.org/2/library/re.html
import re
fred = "$hdkhsd%$"
print re.sub ("\$","!", fred)
>> !hdkhsd%!

Aside from the other answers, you can also use strip():
input = input.strip('$')

String.maketrans for English and Persian numbers

I have a function like this:
persian_numbers = '۱۲۳۴۵۶۷۸۹۰'
english_numbers = '1234567890'
arabic_numbers = '١٢٣٤٥٦٧٨٩٠'
english_trans = string.maketrans(english_numbers, persian_numbers)
arabic_trans = string.maketrans(arabic_numbers, persian_numbers)
text.translate(english_trans)
text.translate(arabic_trans)
I want it to translate all Arabic and English numbers to Persian. But Python says:
english_translate = string.maketrans(english_numbers, persian_numbers)
ValueError: maketrans arguments must have same length
I tried to encode strings with Unicode utf-8 but I always got some errors! Sometimes the problem is Arabic string instead! Do you know a better solution for this job?
EDIT:
It seems the problem is Unicode characters length in ASCII. An Arabic number like '۱' is two character -- that I find out with ord(). And the length problem starts from here :-(

See unidecode library which converts all strings into UTF8. It is very useful in case of number input in different languages.
In Python 2:
>>> from unidecode import unidecode
>>> a = unidecode(u"۰۱۲۳۴۵۶۷۸۹")
>>> a
'0123456789'
>>> unidecode(a)
'0123456789'
In Python 3:
>>> from unidecode import unidecode
>>> a = unidecode("۰۱۲۳۴۵۶۷۸۹")
>>> a
'0123456789'
>>> unidecode(a)
'0123456789'

Unicode objects can interpret these digits (arabic and persian) as actual digits -
no need to translate them by using character substitution.
EDIT -
I came out with a way to make your replacement using Python2 regular expressions:
# coding: utf-8
import re
# Attention: while the characters for the strings bellow are
# dislplayed indentically, inside they are represented
# by distinct unicode codepoints
persian_numbers = u'۱۲۳۴۵۶۷۸۹۰'
arabic_numbers = u'١٢٣٤٥٦٧٨٩٠'
english_numbers = u'1234567890'
persian_regexp = u"(%s)" % u"|".join(persian_numbers)
arabic_regexp = u"(%s)" % u"|".join(arabic_numbers)
def _sub(match_object, digits):
return english_numbers[digits.find(match_object.group(0))]
def _sub_arabic(match_object):
return _sub(match_object, arabic_numbers)
def _sub_persian(match_object):
return _sub(match_object, persian_numbers)
def replace_arabic(text):
return re.sub(arabic_regexp, _sub_arabic, text)
def replace_persian(text):
return re.sub(arabic_regexp, _sub_persian, text)
Attempt that the "text" parameter must be unicode itself.
(also this code could be shortened
by using lambdas and combining some expressions in a single line, but there is no point in doing so, but for loosing readability)
It should work to you up to here, but please read on the original answer I had posted
-- original answer
So, if you instantiate your variables as unicode (prepending an u to the quote char), they are correctly understood in Python:
>>> persian_numbers = u'۱۲۳۴۵۶۷۸۹۰'
>>> english_numbers = u'1234567890'
>>> arabic_numbers = u'١٢٣٤٥٦٧٨٩٠'
>>>
>>> print int(persian_numbers)
1234567890
>>> print int(english_numbers)
1234567890
>>> print int(arabic_numbers)
1234567890
>>> persian_numbers.isdigit()
True
>>>
By the way, the "maketrans" method does not exist for unicode objects (in Python2 - see the comments).
It is very important to understand the basics about unicode - for everyone, even people writing English only programs who think they will never deal with any char out of the 26 latin letters. When writing code that will deal with different chars it is vital - the program can't possibly work without you knowing what you are doing except by chance.
A very good article to read is http://www.joelonsoftware.com/articles/Unicode.html - please read it now.
You can keep in mind, while reading it, that Python allows one to translate unicode characters to a string in any "physical" encoding by using the "encode" method of unicode objects.
>>> arabic_numbers = u'١٢٣٤٥٦٧٨٩٠'
>>> len(arabic_numbers)
10
>>> enc_arabic = arabic_numbers.encode("utf-8")
>>> print enc_arabic
١٢٣٤٥٦٧٨٩٠
>>> len(enc_arabic)
20
>>> int(enc_arabic)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '\xd9\xa1\xd9\xa2\xd9\xa3\xd9\xa4\xd9\xa5\xd9\xa6\xd9\xa7\xd9\xa8\xd9\xa9\xd9\xa0'
Thus, the characters loose their sense as "single entities" and as digits when encoding - the encoded object (str type in Python 2.x) is justa strrng of bytes - which nonetheless is needed when sending these characters to any output from the program - be it console, GUI Window, database, html code, etc...

You can use persiantools package:
Examples:
>>> from persiantools import digits
>>> digits.en_to_fa("0987654321")
'۰۹۸۷۶۵۴۳۲۱'
>>> digits.ar_to_fa("٠٩٨٧٦٥٤٣٢١") # or digits.ar_to_fa(u"٠٩٨٧٦٥٤٣٢١")
'۰۹۸۷۶۵۴۳۲۱'

unidecode converts all characters from Persian to English, If you want to change only numbers follow bellow:
In python3 you can use this code to convert any Persian|Arabic number to English number while keeping other characters unchanged:
intab='۱۲۳۴۵۶۷۸۹۰١٢٣٤٥٦٧٨٩٠'
outtab='12345678901234567890'
translation_table = str.maketrans(intab, outtab)
output_text = input_text.translate(translation_table)

Use Unicode Strings:
persian_numbers = u'۱۲۳۴۵۶۷۸۹۰'
english_numbers = u'1234567890'
arabic_numbers = u'١٢٣٤٥٦٧٨٩٠'
And make sure the encoding of your Python file is correct.

With this you can easily do that:
def p2e(persiannumber):
number={
'0':'۰',
'1':'۱',
'2':'۲',
'3':'۳',
'4':'۴',
'5':'۵',
'6':'۶',
'7':'۷',
'8':'۸',
'9':'۹',
}
for i,j in number.items():
persiannumber=persiannumber.replace(j,i)
return persiannumber
here is usage:
print(p2e('۳۱۹۶'))
#returns 3196

In Python 3 easiest way is:
str(int('۱۲۳'))
#123
but if number starts with 0 it have an issue.
so we can use zip() function:
for i, j in zip('1234567890', '۱۲۳۴۵۶۷۸۹۰'):
number.replace(i, j)

def persian_number(persiannumber):
number={
'0':'۰',
'1':'۱',
'2':'۲',
'3':'۳',
'4':'۴',
'5':'۵',
'6':'۶',
'7':'۷',
'8':'۸',
'9':'۹',
}
for i,j in number.items():
persiannumber=time2str.replace(i,j)
return time2str
persiannumber must be a string

Python: any way to perform this "hybrid" split() on multi-lingual (e.g. Chinese & English) strings?

I have strings that are multi-lingual consist of both languages that use whitespace as word separator (English, French, etc) and languages that don't (Chinese, Japanese, Korean).
Given such a string, I want to separate the English/French/etc part into words using whitespace as separator, and to separate the Chinese/Japanese/Korean part into individual characters.
And I want to put of all those separated components into a list.
Some examples would probably make this clear:
Case 1: English-only string. This case is easy:
>>> "I love Python".split()
['I', 'love', 'Python']
Case 2: Chinese-only string:
>>> list(u"我爱蟒蛇")
[u'\u6211', u'\u7231', u'\u87d2', u'\u86c7']
In this case I can turn the string into a list of Chinese characters. But within the list I'm getting unicode representations:
[u'\u6211', u'\u7231', u'\u87d2', u'\u86c7']
How do I get it to display the actual characters instead of the unicode? Something like:
['我', '爱', '蟒', '蛇']
??
Case 3: A mix of English & Chinese:
I want to turn an input string such as
"我爱Python"
and turns it into a list like this:
['我', '爱', 'Python']
Is it possible to do something like that?

I thought I'd show the regex approach, too. It doesn't feel right to me, but that's mostly because all of the language-specific i18n oddnesses I've seen makes me worried that a regular expression might not be flexible enough for all of them--but you may well not need any of that. (In other words--overdesign.)
# -*- coding: utf-8 -*-
import re
def group_words(s):
regex = []
# Match a whole word:
regex += [ur'\w+']
# Match a single CJK character:
regex += [ur'[\u4e00-\ufaff]']
# Match one of anything else, except for spaces:
regex += [ur'[^\s]']
regex = "|".join(regex)
r = re.compile(regex)
return r.findall(s)
if __name__ == "__main__":
print group_words(u"Testing English text")
print group_words(u"我爱蟒蛇")
print group_words(u"Testing English text我爱蟒蛇")
In practice, you'd probably want to only compile the regex once, not on each call. Again, filling in the particulars of character grouping is up to you.

In Python 3, it also splits the number if you needed.
def spliteKeyWord(str):
regex = r"[\u4e00-\ufaff]|[0-9]+|[a-zA-Z]+\'*[a-z]*"
matches = re.findall(regex, str, re.UNICODE)
return matches
print(spliteKeyWord("Testing English text我爱Python123"))
=> ['Testing', 'English', 'text', '我', '爱', 'Python', '123']

Formatting a list shows the repr of its components. If you want to view the strings naturally rather than escaped, you'll need to format it yourself. (repr should not be escaping these characters; repr(u'我') should return "u'我'", not "u'\\u6211'. Apparently this does happen in Python 3; only 2.x is stuck with the English-centric escaping for Unicode strings.)
A basic algorithm you can use is assigning a character class to each character, then grouping letters by class. Starter code is below.
I didn't use a doctest for this because I hit some odd encoding issues that I don't want to look into (out of scope). You'll need to implement a correct grouping function.
Note that if you're using this for word wrapping, there are other per-language considerations. For example, you don't want to break on non-breaking spaces; you do want to break on hyphens; for Japanese you don't want to split apart きゅ; and so on.
# -*- coding: utf-8 -*-
import itertools, unicodedata
def group_words(s):
# This is a closure for key(), encapsulated in an array to work around
# 2.x's lack of the nonlocal keyword.
sequence = [0x10000000]
def key(part):
val = ord(part)
if part.isspace():
return 0
# This is incorrect, but serves this example; finding a more
# accurate categorization of characters is up to the user.
asian = unicodedata.category(part) == "Lo"
if asian:
# Never group asian characters, by returning a unique value for each one.
sequence[0] += 1
return sequence[0]
return 2
result = []
for key, group in itertools.groupby(s, key):
# Discard groups of whitespace.
if key == 0:
continue
str = "".join(group)
result.append(str)
return result
if __name__ == "__main__":
print group_words(u"Testing English text")
print group_words(u"我爱蟒蛇")
print group_words(u"Testing English text我爱蟒蛇")

Modified Glenn's solution to drop symbols and work for Russian, French, etc alphabets:
def rec_group_words():
regex = []
# Match a whole word:
regex += [r'[A-za-z0-9\xc0-\xff]+']
# Match a single CJK character:
regex += [r'[\u4e00-\ufaff]']
regex = "|".join(regex)
return re.compile(regex)

The following works for python3.7:
import re
def group_words(s):
return re.findall(u'[\u4e00-\u9fff]|[a-zA-Z0-9]+', s)
if __name__ == "__main__":
print(group_words(u"Testing English text"))
print(group_words(u"我爱蟒蛇"))
print(group_words(u"Testing English text我爱蟒蛇"))
['Testing', 'English', 'text']
['我', '爱', '蟒', '蛇']
['Testing', 'English', 'text', '我', '爱', '蟒', '蛇']
For some reason, I cannot adapt Glenn Maynard's answer to python3.

Stripping non printable characters from a string in python

I use to run
$s =~ s/[^[:print:]]//g;
on Perl to get rid of non printable characters.
In Python there's no POSIX regex classes, and I can't write [:print:] having it mean what I want. I know of no way in Python to detect if a character is printable or not.
What would you do?
EDIT: It has to support Unicode characters as well. The string.printable way will happily strip them out of the output.
curses.ascii.isprint will return false for any unicode character.

Iterating over strings is unfortunately rather slow in Python. Regular expressions are over an order of magnitude faster for this kind of thing. You just have to build the character class yourself. The unicodedata module is quite helpful for this, especially the unicodedata.category() function. See Unicode Character Database for descriptions of the categories.
import unicodedata, re, itertools, sys
all_chars = (chr(i) for i in range(sys.maxunicode))
categories = {'Cc'}
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories)
# or equivalently and much more efficiently
control_chars = ''.join(map(chr, itertools.chain(range(0x00,0x20), range(0x7f,0xa0))))
control_char_re = re.compile('[%s]' % re.escape(control_chars))
def remove_control_chars(s):
return control_char_re.sub('', s)
For Python2
import unicodedata, re, sys
all_chars = (unichr(i) for i in xrange(sys.maxunicode))
categories = {'Cc'}
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories)
# or equivalently and much more efficiently
control_chars = ''.join(map(unichr, range(0x00,0x20) + range(0x7f,0xa0)))
control_char_re = re.compile('[%s]' % re.escape(control_chars))
def remove_control_chars(s):
return control_char_re.sub('', s)
For some use-cases, additional categories (e.g. all from the control group might be preferable, although this might slow down the processing time and increase memory usage significantly. Number of characters per category:
Cc (control): 65
Cf (format): 161
Cs (surrogate): 2048
Co (private-use): 137468
Cn (unassigned): 836601
Edit Adding suggestions from the comments.

As far as I know, the most pythonic/efficient method would be:
import string
filtered_string = filter(lambda x: x in string.printable, myStr)

You could try setting up a filter using the unicodedata.category() function:
import unicodedata
printable = {'Lu', 'Ll'}
def filter_non_printable(str):
return ''.join(c for c in str if unicodedata.category(c) in printable)
See Table 4-9 on page 175 in the Unicode database character properties for the available categories

The following will work with Unicode input and is rather fast...
import sys
# build a table mapping all non-printable characters to None
NOPRINT_TRANS_TABLE = {
i: None for i in range(0, sys.maxunicode + 1) if not chr(i).isprintable()
}
def make_printable(s):
"""Replace non-printable characters in a string."""
# the translate method on str removes characters
# that map to None from the string
return s.translate(NOPRINT_TRANS_TABLE)
assert make_printable('Café') == 'Café'
assert make_printable('\x00\x11Hello') == 'Hello'
assert make_printable('') == ''
My own testing suggests this approach is faster than functions that iterate over the string and return a result using str.join.

In Python 3,
def filter_nonprintable(text):
import itertools
# Use characters of control category
nonprintable = itertools.chain(range(0x00,0x20),range(0x7f,0xa0))
# Use translate to remove all non-printable characters
return text.translate({character:None for character in nonprintable})
See this StackOverflow post on removing punctuation for how .translate() compares to regex & .replace()
The ranges can be generated via nonprintable = (ord(c) for c in (chr(i) for i in range(sys.maxunicode)) if unicodedata.category(c)=='Cc') using the Unicode character database categories as shown by #Ants Aasma.

This function uses list comprehensions and str.join, so it runs in linear time instead of O(n^2):
from curses.ascii import isprint
def printable(input):
return ''.join(char for char in input if isprint(char))

Yet another option in python 3:
re.sub(f'[^{re.escape(string.printable)}]', '', my_string)

Based on #Ber's answer, I suggest removing only control characters as defined in the Unicode character database categories:
import unicodedata
def filter_non_printable(s):
return ''.join(c for c in s if not unicodedata.category(c).startswith('C'))

The best I've come up with now is (thanks to the python-izers above)
def filter_non_printable(str):
return ''.join([c for c in str if ord(c) > 31 or ord(c) == 9])
This is the only way I've found out that works with Unicode characters/strings
Any better options?

In Python there's no POSIX regex classes
There are when using the regex library: https://pypi.org/project/regex/
It is well maintained and supports Unicode regex, Posix regex and many more. The usage (method signatures) is very similar to Python's re.
From the documentation:
[[:alpha:]]; [[:^alpha:]]
POSIX character classes are supported. These
are normally treated as an alternative form of \p{...}.
(I'm not affiliated, just a user.)

An elegant pythonic solution to stripping 'non printable' characters from a string in python is to use the isprintable() string method together with a generator expression or list comprehension depending on the use case ie. size of the string:
''.join(c for c in my_string if c.isprintable())
str.isprintable()
Return True if all characters in the string are printable or the string is empty, False otherwise. Nonprintable characters are those characters defined in the Unicode character database as “Other” or “Separator”, excepting the ASCII space (0x20) which is considered printable. (Note that printable characters in this context are those which should not be escaped when repr() is invoked on a string. It has no bearing on the handling of strings written to sys.stdout or sys.stderr.)

The one below performs faster than the others above. Take a look
''.join([x if x in string.printable else '' for x in Str])

Adapted from answers by Ants Aasma and shawnrad:
nonprintable = set(map(chr, list(range(0,32)) + list(range(127,160))))
ord_dict = {ord(character):None for character in nonprintable}
def filter_nonprintable(text):
return text.translate(ord_dict)
#use
str = "this is my string"
str = filter_nonprintable(str)
print(str)
tested on Python 3.7.7

To remove 'whitespace',
import re
t = """
\n\t<p> </p>\n\t<p> </p>\n\t<p> </p>\n\t<p> </p>\n\t<p>
"""
pat = re.compile(r'[\t\n]')
print(pat.sub("", t))

Error description
Run the copied and pasted python code report:
Python invalid non-printable character U+00A0
The cause of the error
The space in the copied code is not the same as the format in Python;
Solution
Delete the space and re-enter the space. For example, the red part in the above picture is an abnormal space. Delete and re-enter the space to run;
Source : Python invalid non-printable character U+00A0

I used this:
import sys
import unicodedata
# the test string has embedded characters, \u2069 \u2068
test_string = """"ABC⁩.⁨ 6", "}"""
nonprintable = list((ord(c) for c in (chr(i) for i in range(sys.maxunicode)) if
unicodedata.category(c) in ['Cc','Cf']))
translate_dict = {character: None for character in nonprintable}
print("Before translate, using repr()", repr(test_string))
print("After translate, using repr()", repr(test_string.translate(translate_dict)))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Strip unicode character modifiers - python

What is the simplest way to strip the character modifiers from a unicode string in Python? For example: A͋͠r͍̞̫̜͌ͦ̈́͐ͅt̼̭͞h́u̡̙̞̘̙̬͖͓rͬͣ̐ͮͥͨ̀͏̣ should become Arthur I tried the docs but I couldn't find anything that does this.

Related

Leave only alphanumeric symbols in string in Python?

python: how to remove '$'?

String.maketrans for English and Persian numbers

Python: any way to perform this "hybrid" split() on multi-lingual (e.g. Chinese & English) strings?

Stripping non printable characters from a string in python

Categories

Resources