Replace non-ASCII characters with a single space - python

I need to replace all non-ASCII (\x00-\x7F) characters with a space. I'm surprised that this is not dead-easy in Python, unless I'm missing something. The following function simply removes all non-ASCII characters:
def remove_non_ascii_1(text):
return ''.join(i for i in text if ord(i)<128)
And this one replaces non-ASCII characters with the amount of spaces as per the amount of bytes in the character code point (i.e. the – character is replaced with 3 spaces):
def remove_non_ascii_2(text):
return re.sub(r'[^\x00-\x7F]',' ', text)
How can I replace all non-ASCII characters with a single space?
Of the myriad of similar SO questions, none address character replacement as opposed to stripping, and additionally address all non-ascii characters not a specific character.

Your ''.join() expression is filtering, removing anything non-ASCII; you could use a conditional expression instead:
return ''.join([i if ord(i) < 128 else ' ' for i in text])
This handles characters one by one and would still use one space per character replaced.
Your regular expression should just replace consecutive non-ASCII characters with a space:
re.sub(r'[^\x00-\x7F]+',' ', text)
Note the + there.

For you the get the most alike representation of your original string I recommend the unidecode module:
# python 2.x:
from unidecode import unidecode
def remove_non_ascii(text):
return unidecode(unicode(text, encoding = "utf-8"))
Then you can use it in a string:
remove_non_ascii("Ceñía")
Cenia

For character processing, use Unicode strings:
PythonWin 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit (AMD64)] on win32.
>>> s='ABC马克def'
>>> import re
>>> re.sub(r'[^\x00-\x7f]',r' ',s) # Each char is a Unicode codepoint.
'ABC def'
>>> b = s.encode('utf8')
>>> re.sub(rb'[^\x00-\x7f]',rb' ',b) # Each char is a 3-byte UTF-8 sequence.
b'ABC def'
But note you will still have a problem if your string contains decomposed Unicode characters (separate character and combining accent marks, for example):
>>> s = 'mañana'
>>> len(s)
6
>>> import unicodedata as ud
>>> n=ud.normalize('NFD',s)
>>> n
'mañana'
>>> len(n)
7
>>> re.sub(r'[^\x00-\x7f]',r' ',s) # single codepoint
'ma ana'
>>> re.sub(r'[^\x00-\x7f]',r' ',n) # only combining mark replaced
'man ana'

If the replacement character can be '?' instead of a space, then I'd suggest result = text.encode('ascii', 'replace').decode():
"""Test the performance of different non-ASCII replacement methods."""
import re
from timeit import timeit
# 10_000 is typical in the project that I'm working on and most of the text
# is going to be non-ASCII.
text = 'Æ' * 10_000
print(timeit(
"""
result = ''.join([c if ord(c) < 128 else '?' for c in text])
""",
number=1000,
globals=globals(),
))
print(timeit(
"""
result = text.encode('ascii', 'replace').decode()
""",
number=1000,
globals=globals(),
))
Results:
0.7208260721400134
0.009975979187503592

What about this one?
def replace_trash(unicode_string):
for i in range(0, len(unicode_string)):
try:
unicode_string[i].encode("ascii")
except:
#means it's non-ASCII
unicode_string=unicode_string[i].replace(" ") #replacing it with a single space
return unicode_string

As a native and efficient approach, you don't need to use ord or any loop over the characters. Just encode with ascii and ignore the errors.
The following will just remove the non-ascii characters:
new_string = old_string.encode('ascii',errors='ignore')
Now if you want to replace the deleted characters just do the following:
final_string = new_string + b' ' * (len(old_string) - len(new_string))

When we use the ascii() it escapes the non-ascii characters and it doesn't change ascii characters correctly. So my main thought is, it doesn't change the ASCII characters, so I am iterating through the string and checking if the character is changed. If it changed then replacing it with the replacer, what you give.
For example: ' '(a single space) or '?' (with a question mark).
def remove(x, replacer):
for i in x:
if f"'{i}'" == ascii(i):
pass
else:
x=x.replace(i,replacer)
return x
remove('hái',' ')
Result: "h i" (with single space between).
Syntax : remove(str,non_ascii_replacer)
str = Here you will give the string you want to work with.
non_ascii_replacer = Here you will give the replacer which you want to replace all the non ASCII characters with.

Pre-processing using Raku (formerly known as Perl_6)
~$ raku -pe 's:g/ <:!ASCII>+ / /;' file
Sample Input:
Peace be upon you
السلام عليكم
שלום עליכם
Paz sobre vosotros
Sample Output:
Peace be upon you
Paz sobre vosotros
Note, you can get extensive information on the matches using the following code:
~$ raku -ne 'say s:g/ <:!ASCII>+ / /.raku;' file
$( )
$(Match.new(:orig("السلام عليكم"), :from(0), :pos(6)), Match.new(:orig("السلام عليكم"), :from(7), :pos(12)))
$(Match.new(:orig("שלום עליכם"), :from(0), :pos(4)), Match.new(:orig("שלום עליכם"), :from(5), :pos(10)))
$( )
$( )
Or more simply, you can just visualize the replacement blank spaces:
~$ raku -ne 'say S:g/ <:!ASCII>+ / /.raku;' file
"Peace be upon you"
" "
" "
"Paz sobre vosotros"
""
https://docs.raku.org/language/regexes#Unicode_properties
https://www.codesections.com/blog/raku-unicode/
https://raku.org

def filterSpecialChars(strInput):
result = []
for character in strInput:
ordVal = ord(character)
if ordVal < 0 or ordVal > 127:
result.append(' ')
else:
result.append(character)
return ''.join(result)
And call it like this:
result = filterSpecialChars('Ceñía mañana')
print(result)

My problem was that my string contained things like Belgià for België and &#x20AC for the € sign. And I didn't want to replace them with spaces. But wth the right symbol itself.
my solution was string.encode('Latin1').decode('utf-8')

Potentially for a different question, but I'm providing my version of #Alvero's answer (using unidecode). I want to do a "regular" strip on my strings, i.e. the beginning and end of my string for whitespace characters, and then replace only other whitespace characters with a "regular" space, i.e.
"Ceñíaㅤmañanaㅤㅤㅤㅤ"
to
"Ceñía mañana"
,
def safely_stripped(s: str):
return ' '.join(
stripped for stripped in
(bit.strip() for bit in
''.join((c if unidecode(c) else ' ') for c in s).strip().split())
if stripped)
We first replace all non-unicode spaces with a regular space (and join it back again),
''.join((c if unidecode(c) else ' ') for c in s)
And then we split that again, with python's normal split, and strip each "bit",
(bit.strip() for bit in s.split())
And lastly join those back again, but only if the string passes an if test,
' '.join(stripped for stripped in s if stripped)
And with that, safely_stripped('ㅤㅤㅤㅤCeñíaㅤmañanaㅤㅤㅤㅤ') correctly returns 'Ceñía mañana'.

To replace all non-ASCII (\x00-\x7F) characters with a space:
''.join(map(lambda x: x if ord(x) in range(0, 128) else ' ', text))
To replace all visible characters, try this:
import string
''.join(map(lambda x: x if x in string.printable and x not in string.whitespace else ' ', text))
This will give the same result:
''.join(map(lambda x: x if ord(x) in range(32, 128) else ' ', text))

Related

How to convert Unicode to ASCII leaving out the non-convertible characters? [duplicate]

I'm working with a .txt file. I want a string of the text from the file with no non-ASCII characters. However, I want to leave spaces and periods. At present, I'm stripping those too. Here's the code:
def onlyascii(char):
if ord(char) < 48 or ord(char) > 127: return ''
else: return char
def get_my_string(file_path):
f=open(file_path,'r')
data=f.read()
f.close()
filtered_data=filter(onlyascii, data)
filtered_data = filtered_data.lower()
return filtered_data
How should I modify onlyascii() to leave spaces and periods? I imagine it's not too complicated but I can't figure it out.
You can filter all characters from the string that are not printable using string.printable, like this:
>>> s = "some\x00string. with\x15 funny characters"
>>> import string
>>> printable = set(string.printable)
>>> filter(lambda x: x in printable, s)
'somestring. with funny characters'
string.printable on my machine contains:
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~ \t\n\r\x0b\x0c
EDIT: On Python 3, filter will return an iterable. The correct way to obtain a string back would be:
''.join(filter(lambda x: x in printable, s))
An easy way to change to a different codec, is by using encode() or decode(). In your case, you want to convert to ASCII and ignore all symbols that are not supported. For example, the Swedish letter å is not an ASCII character:
>>>s = u'Good bye in Swedish is Hej d\xe5'
>>>s = s.encode('ascii',errors='ignore')
>>>print s
Good bye in Swedish is Hej d
Edit:
Python3: str -> bytes -> str
>>>"Hej då".encode("ascii", errors="ignore").decode()
'hej d'
Python2: unicode -> str -> unicode
>>> u"hej då".encode("ascii", errors="ignore").decode()
u'hej d'
Python2: str -> unicode -> str (decode and encode in reverse order)
>>> "hej d\xe5".decode("ascii", errors="ignore").encode()
'hej d'
According to #artfulrobot, this should be faster than filter and lambda:
import re
re.sub(r'[^\x00-\x7f]',r'', your-non-ascii-string)
See more examples here Replace non-ASCII characters with a single space
You may use the following code to remove non-English letters:
import re
str = "123456790 ABC#%? .(朱惠英)"
result = re.sub(r'[^\x00-\x7f]',r'', str)
print(result)
This will return
123456790 ABC#%? .()
Your question is ambiguous; the first two sentences taken together imply that you believe that space and "period" are non-ASCII characters. This is incorrect. All chars such that ord(char) <= 127 are ASCII characters. For example, your function excludes these characters !"#$%&\'()*+,-./ but includes several others e.g. []{}.
Please step back, think a bit, and edit your question to tell us what you are trying to do, without mentioning the word ASCII, and why you think that chars such that ord(char) >= 128 are ignorable. Also: which version of Python? What is the encoding of your input data?
Please note that your code reads the whole input file as a single string, and your comment ("great solution") to another answer implies that you don't care about newlines in your data. If your file contains two lines like this:
this is line 1
this is line 2
the result would be 'this is line 1this is line 2' ... is that what you really want?
A greater solution would include:
a better name for the filter function than onlyascii
recognition that a filter function merely needs to return a truthy value if the argument is to be retained:
def filter_func(char):
return char == '\n' or 32 <= ord(char) <= 126
# and later:
filtered_data = filter(filter_func, data).lower()
Working my way through Fluent Python (Ramalho) - highly recommended.
List comprehension one-ish-liners inspired by Chapter 2:
onlyascii = ''.join([s for s in data if ord(s) < 127])
onlymatch = ''.join([s for s in data if s in
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'])
If you want printable ascii characters you probably should correct your code to:
if ord(char) < 32 or ord(char) > 126: return ''
this is equivalent, to string.printable (answer from #jterrace), except for the absence of returns and tabs ('\t','\n','\x0b','\x0c' and '\r') but doesnt correspond to the range on your question
this is best way to get ascii characters and clean code, Checks for all possible errors
from string import printable
def getOnlyCharacters(texts):
_type = None
result = ''
if type(texts).__name__ == 'bytes':
_type = 'bytes'
texts = texts.decode('utf-8','ignore')
else:
_type = 'str'
texts = bytes(texts, 'utf-8').decode('utf-8', 'ignore')
texts = str(texts)
for text in texts:
if text in printable:
result += text
if _type == 'bytes':
result = result.encode('utf-8')
return result
text = '�Ahm�����ed Sheri��'
result = getOnlyCharacters(text)
print(result)
#input --> �Ahm�����ed Sheri��
#output --> Ahmed Sheri

Replacing unknown characters in a string Python 2.7

How can I define characters(in a LIST or a STRING), and have any other characters replaced with.. lets say a '?'
Example:
strinput = "abcdefg#~"
legal = '.,/?~abcdefg' #legal characters
while i not in legal:
#Turn i into '?'
print output
Put the legal characters in a set then use in to test each character of the string. Construct the new string using the str.join() method and a conditional expression.
>>> s = "test.,/?~abcdefgh"
>>> legal = set('.,/?~abcdefg')
>>> s = ''.join(char if char in legal else '?' for char in s)
>>> s
'?e??.,/?~abcdefg?'
>>>
If this is a large file, read in chunks, and apply re.sub(..) as below. ^ within a class (square brackets) stands for negation (similar to saying "anything other than")
>>> import re
>>> char = '.,/?~abcdefg'
>>> re.sub(r'[^' + char +']', '?', "test.,/?~abcdefgh")
'?e??.,/?~abcdefg?'

Leave only alphanumeric symbols in string in Python?

I am using Python 2.7. On SO I found the following regexp for removing non-word characters:
pat = re.compile('[\W]+', re.UNICODE)
I wrote the next function:
def leave_only_alphanumeric(string):
pat = re.compile('[\W]+', re.UNICODE)
return re.sub(pat,' ',string)
Though on the following string:
kr\xc3\xa9m
it produces the wrong result:
kr\xc3 m
\xa9 was deleted from the string, but should not have been.
You are confusing unicode codepoints and the utf-8 encoding.
The letter you are trying to handle is é, code point u00e9.
It is encoded in utf-8 as two bytes, 0xc3 and 0xa9.
Try:
>>> "kr\xc3\xa9m".decode('utf-8')
u'kr\xe9m'
>>> print("kr\xc3\xa9m")
krém
>>> print(u"kr\xe9m")
krém
With u"" you must use the actual code points. While with raw "", python just sees a chain of bytes.
Note that the second line only works because my terminal's encoding is utf-8, otherwise I'd see garbled output.
As a result, your string is not what you think:
>>> print(u"kr\xc3\xa9m")
krém
You actually entered two characters, with codepoint u00c3 and u00a9. The former is Ã, which is an alpha character and second is ©, which is not and is why your code removes it.
Now playing with your code:
>>> def leave_only_alphanumeric(string):
... pat = re.compile('[\W]+', re.UNICODE)
... return re.sub(pat,' ',string)
...
>>> leave_only_alphanumeric(u"kr\xe9m")
u'kr\xe9m'
>>> leave_only_alphanumeric("kr\xc3\xa9m") # this is not unicode
'kr\xc3 m' # -> thus the wrong result
>>> leave_only_alphanumeric("kr\xc3\xa9m".decode('utf-8'))
u'kr\xe9m'
>>> leave_only_alphanumeric("kr\xc3\xa9m".decode('utf-8')).encode('utf-8')
'kr\xc3\xa9m'
>>>
I believe regex might be a bit of an overkill here.
def leave_only_alphanumeric(string):
return ''.join(ch if ch.isalnum() else ' ' for ch in string)
EDIT: Your title says "alphanumeric" but your code removes digits as well. So there is a bit of unclarity.

How do I replace punctuation in a string in Python?

I would like to replace (and not remove) all punctuation characters by " " in a string in Python.
Is there something efficient of the following flavour?
text = text.translate(string.maketrans("",""), string.punctuation)
This answer is for Python 2 and will only work for ASCII strings:
The string module contains two things that will help you: a list of punctuation characters and the "maketrans" function. Here is how you can use them:
import string
replace_punctuation = string.maketrans(string.punctuation, ' '*len(string.punctuation))
text = text.translate(replace_punctuation)
Modified solution from Best way to strip punctuation from a string in Python
import string
import re
regex = re.compile('[%s]' % re.escape(string.punctuation))
out = regex.sub(' ', "This is, fortunately. A Test! string")
# out = 'This is fortunately A Test string'
This workaround works in python 3:
import string
ex_str = 'SFDF-OIU .df !hello.dfasf sad - - d-f - sd'
#because len(string.punctuation) = 32
table = str.maketrans(string.punctuation,' '*32)
res = ex_str.translate(table)
# res = 'SFDF OIU df hello dfasf sad d f sd'
There is a more robust solution which relies on a regex exclusion rather than inclusion through an extensive list of punctuation characters.
import re
print(re.sub('[^\w\s]', '', 'This is, fortunately. A Test! string'))
#Output - 'This is fortunately A Test string'
The regex catches anything which is not an alpha-numeric or whitespace character
Replace by ''?.
What's the difference between translating all ; into '' and remove all ;?
Here is to remove all ;:
s = 'dsda;;dsd;sad'
table = string.maketrans('','')
string.translate(s, table, ';')
And you can do your replacement with translate.
In my specific way, I removed "+" and "&" from the punctuation list:
all_punctuations = string.punctuation
selected_punctuations = re.sub(r'(\&|\+)', "", all_punctuations)
print selected_punctuations
str = "he+llo* ithis& place% if you * here ##"
punctuation_regex = re.compile('[%s]' % re.escape(selected_punctuations))
punc_free = punctuation_regex.sub("", str)
print punc_free
Result: he+llo ithis& place if you here

Remove all special characters, punctuation and spaces from string

I need to remove all special characters, punctuation and spaces from a string so that I only have letters and numbers.
This can be done without regex:
>>> string = "Special $#! characters spaces 888323"
>>> ''.join(e for e in string if e.isalnum())
'Specialcharactersspaces888323'
You can use str.isalnum:
S.isalnum() -> bool
Return True if all characters in S are alphanumeric
and there is at least one character in S, False otherwise.
If you insist on using regex, other solutions will do fine. However note that if it can be done without using a regular expression, that's the best way to go about it.
Here is a regex to match a string of characters that are not a letters or numbers:
[^A-Za-z0-9]+
Here is the Python command to do a regex substitution:
re.sub('[^A-Za-z0-9]+', '', mystring)
Shorter way :
import re
cleanString = re.sub('\W+','', string )
If you want spaces between words and numbers substitute '' with ' '
TLDR
I timed the provided answers.
import re
re.sub('\W+','', string)
is typically 3x faster than the next fastest provided top answer.
Caution should be taken when using this option. Some special characters (e.g. ø) may not be striped using this method.
After seeing this, I was interested in expanding on the provided answers by finding out which executes in the least amount of time, so I went through and checked some of the proposed answers with timeit against two of the example strings:
string1 = 'Special $#! characters spaces 888323'
string2 = 'how much for the maple syrup? $20.99? That s ridiculous!!!'
Example 1
'.join(e for e in string if e.isalnum())
string1 - Result: 10.7061979771
string2 - Result: 7.78372597694
Example 2
import re
re.sub('[^A-Za-z0-9]+', '', string)
string1 - Result: 7.10785102844
string2 - Result: 4.12814903259
Example 3
import re
re.sub('\W+','', string)
string1 - Result: 3.11899876595
string2 - Result: 2.78014397621
The above results are a product of the lowest returned result from an average of: repeat(3, 2000000)
Example 3 can be 3x faster than Example 1.
Python 2.*
I think just filter(str.isalnum, string) works
In [20]: filter(str.isalnum, 'string with special chars like !,#$% etcs.')
Out[20]: 'stringwithspecialcharslikeetcs'
Python 3.*
In Python3, filter( ) function would return an itertable object (instead of string unlike in above). One has to join back to get a string from itertable:
''.join(filter(str.isalnum, string))
or to pass list in join use (not sure but can be fast a bit)
''.join([*filter(str.isalnum, string)])
note: unpacking in [*args] valid from Python >= 3.5
#!/usr/bin/python
import re
strs = "how much for the maple syrup? $20.99? That's ricidulous!!!"
print strs
nstr = re.sub(r'[?|$|.|!]',r'',strs)
print nstr
nestr = re.sub(r'[^a-zA-Z0-9 ]',r'',nstr)
print nestr
you can add more special character and that will be replaced by '' means nothing i.e they will be removed.
Differently than everyone else did using regex, I would try to exclude every character that is not what I want, instead of enumerating explicitly what I don't want.
For example, if I want only characters from 'a to z' (upper and lower case) and numbers, I would exclude everything else:
import re
s = re.sub(r"[^a-zA-Z0-9]","",s)
This means "substitute every character that is not a number, or a character in the range 'a to z' or 'A to Z' with an empty string".
In fact, if you insert the special character ^ at the first place of your regex, you will get the negation.
Extra tip: if you also need to lowercase the result, you can make the regex even faster and easier, as long as you won't find any uppercase now.
import re
s = re.sub(r"[^a-z0-9]","",s.lower())
string.punctuation contains following characters:
'!"#$%&\'()*+,-./:;<=>?#[\]^_`{|}~'
You can use translate and maketrans functions to map punctuations to empty values (replace)
import string
'This, is. A test!'.translate(str.maketrans('', '', string.punctuation))
Output:
'This is A test'
s = re.sub(r"[-()\"#/#;:<>{}`+=~|.!?,]", "", s)
Assuming you want to use a regex and you want/need Unicode-cognisant 2.x code that is 2to3-ready:
>>> import re
>>> rx = re.compile(u'[\W_]+', re.UNICODE)
>>> data = u''.join(unichr(i) for i in range(256))
>>> rx.sub(u'', data)
u'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\xaa\xb2 [snip] \xfe\xff'
>>>
The most generic approach is using the 'categories' of the unicodedata table which classifies every single character. E.g. the following code filters only printable characters based on their category:
import unicodedata
# strip of crap characters (based on the Unicode database
# categorization:
# http://www.sql-und-xml.de/unicode-database/#kategorien
PRINTABLE = set(('Lu', 'Ll', 'Nd', 'Zs'))
def filter_non_printable(s):
result = []
ws_last = False
for c in s:
c = unicodedata.category(c) in PRINTABLE and c or u'#'
result.append(c)
return u''.join(result).replace(u'#', u' ')
Look at the given URL above for all related categories. You also can of course filter
by the punctuation categories.
For other languages like German, Spanish, Danish, French etc that contain special characters (like German "Umlaute" as ü, ä, ö) simply add these to the regex search string:
Example for German:
re.sub('[^A-ZÜÖÄa-z0-9]+', '', mystring)
This will remove all special characters, punctuation, and spaces from a string and only have numbers and letters.
import re
sample_str = "Hel&&lo %% Wo$#rl#d"
# using isalnum()
print("".join(k for k in sample_str if k.isalnum()))
# using regex
op2 = re.sub("[^A-Za-z]", "", sample_str)
print(f"op2 = ", op2)
special_char_list = ["$", "#", "#", "&", "%"]
# using list comprehension
op1 = "".join([k for k in sample_str if k not in special_char_list])
print(f"op1 = ", op1)
# using lambda function
op3 = "".join(filter(lambda x: x not in special_char_list, sample_str))
print(f"op3 = ", op3)
Use translate:
import string
def clean(instr):
return instr.translate(None, string.punctuation + ' ')
Caveat: Only works on ascii strings.
This will remove all non-alphanumeric characters except spaces.
string = "Special $#! characters spaces 888323"
''.join(e for e in string if (e.isalnum() or e.isspace()))
Special characters spaces 888323
import re
my_string = """Strings are amongst the most popular data types in Python. We can create the strings by enclosing characters in quotes. Python treats single quotes the
same as double quotes."""
# if we need to count the word python that ends with or without ',' or '.' at end
count = 0
for i in text:
if i.endswith("."):
text[count] = re.sub("^([a-z]+)(.)?$", r"\1", i)
count += 1
print("The count of Python : ", text.count("python"))
After 10 Years, below I wrote there is the best solution.
You can remove/clean all special characters, punctuation, ASCII characters and spaces from the string.
from clean_text import clean
string = 'Special $#! characters spaces 888323'
new = clean(string,lower=False,no_currency_symbols=True, no_punct = True,replace_with_currency_symbol='')
print(new)
Output ==> 'Special characters spaces 888323'
you can replace space if you want.
update = new.replace(' ','')
print(update)
Output ==> 'Specialcharactersspaces888323'
function regexFuntion(st) {
const regx = /[^\w\s]/gi; // allow : [a-zA-Z0-9, space]
st = st.replace(regx, ''); // remove all data without [a-zA-Z0-9, space]
st = st.replace(/\s\s+/g, ' '); // remove multiple space
return st;
}
console.log(regexFuntion('$Hello; # -world--78asdf+-===asdflkj******lkjasdfj67;'));
// Output: Hello world78asdfasdflkjlkjasdfj67
import re
abc = "askhnl#$%askdjalsdk"
ddd = abc.replace("#$%","")
print (ddd)
and you shall see your result as
'askhnlaskdjalsdk

Categories

Resources