Removing Words that contain non-ascii characters using Python

Removing Words that contain non-ascii characters using Python - python

I am using the following function to strip out non-ascii characters
def removeNonAscii(s):
return "".join(filter(lambda x: ord(x)<128, s))
def removeNonAscii1(s):
return "".join(i for i in s if ord(i)<128)
I would now like to remove the entire word if it contains any non-ascii characters. I thought of measuring the length pre and post function application but I am confident that there is a more efficient way. Any ideas?

If you define the word based on spaces, something like this might work:
def containsNonAscii(s):
return any(ord(i)>127 for i in s)
words = sentence.split()
cleaned_words = [word for word in words if not containsNonAscii(word)]
cleaned_sentence = ' '.join(cleaned_words)
Note that this will collapse repeated whitespace into just one space.

The most clean (but not necessarily most efficient) way is to convert a word to a binary and attempt to decode it as ASCII. If the attempt fails, the word has non-ASCII characters:
def is_ascii(w):
try:
w.encode().decode("us-ascii")
return True
except UnicodeEncodeError:
return False

I came up with the following function. I removes all words that contain any ASCII character but probably the range can be extended as desired.
def removeWordsWithASCII(s):
" ".join(filter(lambda x: not re.search(r'[\x20-\x7E]', x), s.split(' ')))

Related

Excluding ASCII characters

I'm creating a palindrome checker and it works however I need to find a way to replace/remove punctuation from the given input. I'm trying to do for chr(i) i in range 32,47 then substitute those in with ''. The characters I need excluded are 32 - 47. I've tried using the String module but I can only get it to either exclude spaces or punctuation it can't be both for whatever reason.
I've already tried the string module but can't get that to remove spaces and punctuation at the same time.
def is_palindrome_stack(string):
s = ArrayStack()
for character in string:
s.push(character)
reversed_string = ''
while not s.is_empty():
reversed_string = reversed_string + s.pop()
if string == reversed_string:
return True
else:
return False
def remove_punctuation(text):
return text.replace(" ",'')
exclude = set(string.punctuation)
return ''.join(ch for ch in text if ch not in exclude)

That is because you are returning from your method in the very first line, in return text.replace(" ",''). Change it to text = text.replace(" ", "") and it should work fine.
Also, the indentation is probably messed up in your post, maybe during copy pasting.
Full method snippet:
def remove_punctuation(text):
text = text.replace(" ",'')
exclude = set(string.punctuation)
return ''.join(ch for ch in text if ch not in exclude)

You might use str methods to get rid of unwanted characters following way:
import string
tr = ''.maketrans('','',' '+string.punctuation)
def remove_punctuation(text):
return text.translate(tr)
txt = 'Point.Space Question?'
output = remove_punctuation(txt)
print(output)
Output:
PointSpaceQuestion
maketrans create replacement table, it accepts 3 str-s: first and second must be equal length, n-th character from first will be replaced with n-th character from second, third str is characters to remove. You need only to remove (not replace) characters so first two arguments are empty strs.

Issue with stripping a variable? [duplicate]

>>> x = 'abc_cde_fgh'
>>> x.strip('abc_cde')
'fgh'
_fgh is expected.
How to understard this result?

Strip removes any characters it finds from either end from the substring: it doesn't remove a trailing or leading word.
This example demonstrates it nicely:
x.strip('ab_ch')
'de_fg'
Since the characters "a", "b", "c", "h", and "_" are in the remove case, the leading "abc_c" are all removed. The other characters are not removed.
If you would like to remove a leading or trailing word, I would recommend using re or startswith/endswith.
def rstrip_word(str, word):
if str.endswith(word):
return str[:-len(word)]
return str
def lstrip_word(str, word):
if str.startswith(word):
return str[len(word):]
return str
def strip_word(str, word):
return rstrip_word(lstrip_word(str, word), word)
Removing Multiple Words
A very simple implementation (a greedy one) to remove multiple words from a string can be done as follows:
def rstrip_word(str, *words):
for word in words:
if str.endswith(word):
return str[:-len(word)]
return str
def lstrip_word(str, *words):
for word in words:
if str.startswith(word):
return str[len(word):]
return str
def strip_word(str, *words):
return rstrip_word(lstrip_word(str, *words), *words)
Please note this algorithm is greedy, it will find the first possible example and then return: it may not behave as you expect. Finding the maximum length match (although not too tricky) is a bit more involved.
>>> strip_word(x, "abc", "adc_")
'_cde_fgh'

strip() removes characters, not a substring. For example:
x.strip('abcde_')
'fgh'

In the documentation of the strip method "The chars argument is a string specifying the set of characters to be removed." That is why every chars except "fgh" are removed (including the two underscores).

strip remove '_' unexpectedly

>>> x = 'abc_cde_fgh'
>>> x.strip('abc_cde')
'fgh'
_fgh is expected.
How to understard this result?

Strip removes any characters it finds from either end from the substring: it doesn't remove a trailing or leading word.
This example demonstrates it nicely:
x.strip('ab_ch')
'de_fg'
Since the characters "a", "b", "c", "h", and "_" are in the remove case, the leading "abc_c" are all removed. The other characters are not removed.
If you would like to remove a leading or trailing word, I would recommend using re or startswith/endswith.
def rstrip_word(str, word):
if str.endswith(word):
return str[:-len(word)]
return str
def lstrip_word(str, word):
if str.startswith(word):
return str[len(word):]
return str
def strip_word(str, word):
return rstrip_word(lstrip_word(str, word), word)
Removing Multiple Words
A very simple implementation (a greedy one) to remove multiple words from a string can be done as follows:
def rstrip_word(str, *words):
for word in words:
if str.endswith(word):
return str[:-len(word)]
return str
def lstrip_word(str, *words):
for word in words:
if str.startswith(word):
return str[len(word):]
return str
def strip_word(str, *words):
return rstrip_word(lstrip_word(str, *words), *words)
Please note this algorithm is greedy, it will find the first possible example and then return: it may not behave as you expect. Finding the maximum length match (although not too tricky) is a bit more involved.
>>> strip_word(x, "abc", "adc_")
'_cde_fgh'

strip() removes characters, not a substring. For example:
x.strip('abcde_')
'fgh'

In the documentation of the strip method "The chars argument is a string specifying the set of characters to be removed." That is why every chars except "fgh" are removed (including the two underscores).

How to check a string for a special character?

I can only use a string in my program if it contains no special characters except underscore _. How can I check this?
I tried using unicodedata library. But the special characters just got replaced by standard characters.

You can use string.punctuation and any function like this
import string
invalidChars = set(string.punctuation.replace("_", ""))
if any(char in invalidChars for char in word):
print "Invalid"
else:
print "Valid"
With this line
invalidChars = set(string.punctuation.replace("_", ""))
we are preparing a list of punctuation characters which are not allowed. As you want _ to be allowed, we are removing _ from the list and preparing new set as invalidChars. Because lookups are faster in sets.
any function will return True if atleast one of the characters is in invalidChars.
Edit: As asked in the comments, this is the regular expression solution. Regular expression taken from https://stackoverflow.com/a/336220/1903116
word = "Welcome"
import re
print "Valid" if re.match("^[a-zA-Z0-9_]*$", word) else "Invalid"

You will need to define "special characters", but it's likely that for some string s you mean:
import re
if re.match(r'^\w+$', s):
# s is good-to-go

Everyone else's method doesn't account for whitespaces. Obviously nobody really considers a whitespace a special character.
Use this method to detect special characters not including whitespaces:
import re
def detect_special_characer(pass_string):
regex= re.compile('[#_!#$%^&*()<>?/\|}{~:]')
if(regex.search(pass_string) == None):
res = False
else:
res = True
return(res)

Like the method from Cybernetic, to get those extra characters missed, modify the 2nd line of the function from
regex= re.compile('[#_!#$%^&*()<>?/\|}{~:]')
to
regex= re.compile('[#_!#$%^&*()<>?/\\|}{~:\[\]]')
where the \ and ] characters are escaped with \
So in full:
import re
def detect_special_characer(pass_string):
regex= re.compile('[#_!#$%^&*()<>?/\\\|}{~:[\]]')
if(regex.search(pass_string) == None):
res = False
else:
res = True
return(res)

If a character is not numeric, a space, or is A-Z, then it is special
for character in my_string
if not (character.isnumeric() and character.isspace() and character.isalpha() and character != "_")
print(" \(character)is special"

string.translate() with unicode data in python

I have 3 API's that return json data to 3 dictionary variables. I am taking some of the values from the dictionary to process them. I read the specific values that I want to the list valuelist. One of the steps is to remove the punctuation from them. I normally use string.translate(None, string.punctuation) for this process but because the dictionary data is unicode I get the error:
wordlist = [s.translate(None, string.punctuation)for s in valuelist]
TypeError: translate() takes exactly one argument (2 given)
Is there a way around this? Either by encoding the unicode or a replacement for string.translate?

The translate method work differently on Unicode objects than on byte-string objects:
>>> help(unicode.translate)
S.translate(table) -> unicode
Return a copy of the string S, where all characters have been mapped
through the given translation table, which must be a mapping of
Unicode ordinals to Unicode ordinals, Unicode strings or None.
Unmapped characters are left untouched. Characters mapped to None
are deleted.
So your example would become:
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
word_list = [s.translate(remove_punctuation_map) for s in value_list]
Note however that string.punctuation only contains ASCII punctuation. Full Unicode has many more punctuation characters, but it all depends on your use case.

I noticed that string.translate is deprecated. Since you are removing punctuation, not actually translating characters, you can use the re.sub function.
>>> import re
>>> s1="this.is a.string, with; (punctuation)."
>>> s1
'this.is a.string, with; (punctuation).'
>>> re.sub("[\.\t\,\:;\(\)\.]", "", s1, 0, 0)
'thisis astring with punctuation'
>>>

In this version you can relatively make one's letters to other
def trans(to_translate):
tabin = u'привет'
tabout = u'тевирп'
tabin = [ord(char) for char in tabin]
translate_table = dict(zip(tabin, tabout))
return to_translate.translate(translate_table)

Python re module allows to use a function as a replacement argument, which should take a Match object and return a suitable replacement. We may use this function to build a custom character translation function:
import re
def mk_replacer(oldchars, newchars):
"""A function to build a replacement function"""
mapping = dict(zip(oldchars, newchars))
def replacer(match):
"""A replacement function to pass to re.sub()"""
return mapping.get(match.group(0), "")
return replacer
An example. Match all lower-case letters ([a-z]), translate 'h' and 'i' to 'H' and 'I' respectively, delete other matches:
>>> re.sub("[a-z]", mk_replacer("hi", "HI"), "hail")
'HI'
As you can see, it may be used with short (incomplete) replacement sets, and it may be used to delete some characters.
A Unicode example:
>>> re.sub("[\W]", mk_replacer(u'\u0435\u0438\u043f\u0440\u0442\u0432', u"EIPRTV"), u'\u043f\u0440\u0438\u0432\u0435\u0442')
u'PRIVET'

As I stumbled upon the same problem and Simon's answer was the one that helped me to solve my case, I thought of showing an easier example just for clarification:
from collections import defaultdict
And then for the translation, say you'd like to remove '#' and '\r' characters:
remove_chars_map = defaultdict()
remove_chars_map['#'] = None
remove_chars_map['\r'] = None
new_string = old_string.translate(remove_chars_map)
And an example:
old_string = "word1#\r word2#\r word3#\r"
new_string = "word1 word2 word3"
'#' and '\r' removed

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing Words that contain non-ascii characters using Python - python

The most clean (but not necessarily most efficient) way is to convert a word to a binary and attempt to decode it as ASCII. If the attempt fails, the word has non-ASCII characters: def is_ascii(w): try: w.encode().decode("us-ascii") return True except UnicodeEncodeError: return False

I came up with the following function. I removes all words that contain any ASCII character but probably the range can be extended as desired. def removeWordsWithASCII(s): " ".join(filter(lambda x: not re.search(r'[\x20-\x7E]', x), s.split(' ')))

Related

Excluding ASCII characters

Issue with stripping a variable? [duplicate]

strip remove '_' unexpectedly

How to check a string for a special character?

string.translate() with unicode data in python

Categories

Resources