efficiently replace bad characters

efficiently replace bad characters - python

I often work with utf-8 text containing characters like:
\xc2\x99
\xc2\x95
\xc2\x85
etc
These characters confuse other libraries I work with so need to be replaced.
What is an efficient way to do this, rather than:
text.replace('\xc2\x99', ' ').replace('\xc2\x85, '...')

There is always regular expressions; just list all of the offending characters inside square brackets like so:
import re
print re.sub(r'[\xc2\x99]'," ","Hello\xc2There\x99")
This prints: 'Hello There ', with the unwanted characters replaced by spaces.
Alternately, if you have a different replacement character for each:
# remove annoying characters
chars = {
'\xc2\x82' : ',', # High code comma
'\xc2\x84' : ',,', # High code double comma
'\xc2\x85' : '...', # Tripple dot
'\xc2\x88' : '^', # High carat
'\xc2\x91' : '\x27', # Forward single quote
'\xc2\x92' : '\x27', # Reverse single quote
'\xc2\x93' : '\x22', # Forward double quote
'\xc2\x94' : '\x22', # Reverse double quote
'\xc2\x95' : ' ',
'\xc2\x96' : '-', # High hyphen
'\xc2\x97' : '--', # Double hyphen
'\xc2\x99' : ' ',
'\xc2\xa0' : ' ',
'\xc2\xa6' : '|', # Split vertical bar
'\xc2\xab' : '<<', # Double less than
'\xc2\xbb' : '>>', # Double greater than
'\xc2\xbc' : '1/4', # one quarter
'\xc2\xbd' : '1/2', # one half
'\xc2\xbe' : '3/4', # three quarters
'\xca\xbf' : '\x27', # c-single quote
'\xcc\xa8' : '', # modifier - under curve
'\xcc\xb1' : '' # modifier - under line
}
def replace_chars(match):
char = match.group(0)
return chars[char]
return re.sub('(' + '|'.join(chars.keys()) + ')', replace_chars, text)

I think that there is an underlying problem here, and it might be a good idea to investigate and maybe solve it, rather than just trying to cover up the symptoms.
\xc2\x95 is the UTF-8 encoding of the character U+0095, which is a C1 control character (MESSAGE WAITING). It is not surprising that your library cannot handle it. But the question is, how did it get into your data?
Well, one very likely possibility is that it started out as the character 0x95 (BULLET) in the Windows-1252 encoding, was wrongly decoded as U+0095 instead of the correct U+2022, and then encoded into UTF-8. (The Japanese term mojibake describes this kind of mistake.)
If this is correct, then you can recover the original characters by putting them back into Windows-1252 and then decoding them into Unicode correctly this time. (In these examples I am using Python 3.3; these operations are a bit different in Python 2.)
>>> b'\x95'.decode('windows-1252')
'\u2022'
>>> import unicodedata
>>> unicodedata.name(_)
'BULLET'
If you want to do this correction for all the characters in the range 0x80–0x99 that are valid Windows-1252 characters, you can use this approach:
def restore_windows_1252_characters(s):
"""Replace C1 control characters in the Unicode string s by the
characters at the corresponding code points in Windows-1252,
where possible.
"""
import re
def to_windows_1252(match):
try:
return bytes([ord(match.group(0))]).decode('windows-1252')
except UnicodeDecodeError:
# No character at the corresponding code point: remove it.
return ''
return re.sub(r'[\u0080-\u0099]', to_windows_1252, s)
For example:
>>> restore_windows_1252_characters('\x95\x99\x85')
'•™…'

If you want to remove all non-ASCII characters from a string, you can use
text.encode("ascii", "ignore")

import unicodedata
# Convert to unicode
text_to_uncicode = unicode(text, "utf-8")
# Convert back to ascii
text_fixed = unicodedata.normalize('NFKD',text_to_unicode).encode('ascii','ignore')

This is not "Unicode characters" - it feels more like this an UTF-8 encoded string. (Although your prefix should be \xC3, not \xC2 for most chars). You should not just throw them away in 95% of the cases, unless you are comunicating with a COBOL backend. The World is not limited to 26 characters, you know.
There is a concise reading to explain the differences between Unicode strings (what is used as an Unicode object in python 2 and as strings in Python 3 here: http://www.joelonsoftware.com/articles/Unicode.html - please, for your sake do read that. Even if you are never planning to have anything that is not English in all of your applications, you still will stumble on symbols like € or º that won't fit in 7 bit ASCII. That article will help you.
That said, maybe the libraries you are using do accept Unicode python objects, and you can transform your UTF-8 Python 2 strings into unidoce by doing:
var_unicode = var.decode("utf-8")
If you really need 100% pure ASCII, replacing all non ASCII chars, after decoding the string to unicode, re-encode it to ASCII, telling it to ignore characters that don't fit in the charset with:
var_ascii = var_unicode.encode("ascii", "replace")

These characters are not in ASCII Library and that is the reason why you are getting the errors.
To avoid these errors, you can do the following while reading the file.
import codecs
f = codecs.open('file.txt', 'r',encoding='utf-8')
To know more about these kind of errors, go through this link.

Related

How to deal with variations in apostrophe ’ and ' [duplicate]

On my website people can post news and quite a few editors use MS word and similar tools to write the text and then copy&paste into my site's editor (simple textarea, no WYSIWYG etc.).
Those texts usually contain "nice" quotes instead of the plain ascii ones ("). They also sometimes contain those longer dashes like – instead of -.
Now I want to replace all those characters with their ascii counterparts. However, I do not want to remove umlauts and other non-ascii character. I'd also highly prefer to use a proper solution that does not involve creating a mapping dict for all those characters.
All my strings are unicode objects.

What about this?
It creates translation table first, but honestly I don't think you can do this without it.
transl_table = dict( [ (ord(x), ord(y)) for x,y in zip( u"‘’´“”–-", u"'''\"\"--") ] )
with open( "a.txt", "w", encoding = "utf-8" ) as f_out :
a_str = u" ´funny single quotes´ long–-and–-short dashes ‘nice single quotes’ “nice double quotes” "
print( " a_str = " + a_str, file = f_out )
fixed_str = a_str.translate( transl_table )
print( " fixed_str = " + fixed_str, file = f_out )
I wasn't able to run this printing to a console (on Windows) so I had to write to txt file.
The output in the a.txt file looks as follows:
a_str = ´funny single quotes´ long–-and–-short dashes ‘nice single
quotes’ “nice double quotes” fixed_str = 'funny single quotes'
long--and--short dashes 'nice single quotes' "nice double quotes"
By the way, the code above works in Python 3. If you need it for Python 2, it might need some fixes due to the difference in handling Unicode strings in both versions of the language

There is no such "proper" solution, because for any given Unicode character there is no "ASCII counterpart" defined.
For example, take the seemingly easy characters that you might want to map to ASCII single and double quotes and hyphens. First, lets generate all the Unicode characters with their official names. Second, lets find all the quotation marks, hyphens and dashes according to the names:
#!/usr/bin/env python3
import unicodedata
def unicode_character_name(char):
try:
return unicodedata.name(char)
except ValueError:
return None
# Generate all Unicode characters with their names
all_unicode_characters = []
for n in range(0, 0x10ffff): # Unicode planes 0-16
char = chr(n) # Python 3
#char = unichr(n) # Python 2
name = unicode_character_name(char)
if name:
all_unicode_characters.append((char, name))
# Find all Unicode quotation marks
print (' '.join([char for char, name in all_unicode_characters if 'QUOTATION MARK' in name]))
# " « » ‘ ’ ‚ ‛ “ ” „ ‟ ‹ › ❛ ❜ ❝ ❞ ❟ ❠ ❮ ❯ ⹂ 〝 〞 〟 ＂ 🙶 🙷 🙸
# Find all Unicode hyphens
print (' '.join([char for char, name in all_unicode_characters if 'HYPHEN' in name]))
# -  ֊ ᐀ ᠆ ‐ ‑ ‧ ⁃ ⸗ ⸚ ⹀ ゠ ﹣ － 󠀭
# Find all Unicode dashes
print (' '.join([char for char, name in all_unicode_characters if 'DASH' in name and 'DASHED' not in name]))
# ‒ – — ⁓ ⊝ ⑈ ┄ ┅ ┆ ┇ ┈ ┉ ┊ ┋ ╌ ╍ ╎ ╏ ⤌ ⤍ ⤎ ⤏ ⤐ ⥪ ⥫ ⥬ ⥭ ⩜ ⩝ ⫘ ⫦ ⬷ ⸺ ⸻ ⹃ 〜 〰 ︱ ︲ ﹘ 💨
As you can see, as easy as this example is, there are many problems. There are many quotation marks in Unicode that don't look anything like the quotation marks in US-ASCII and there are many hyphens in Unicode that don't look anything like the hyphen-minus sign in US-ASCII.
And there are many questions. For example:
should the "SWUNG DASH" (⁓) symbol be replaced with an ASCII hyphen (-) or a tilde (~)?
should the "CANADIAN SYLLABICS HYPHEN" (᐀) be replaced with an ASCII hyphen (-) or an equals sign (=)?
should the "SINGLE LEFT-POINTING ANGLE QUOTATION MARK" (‹) be replaces with an ASCII quotation mark ("), an apostrophe (') or a less-than sign (<)?
To establish a "correct" ASCII counterpart, somebody needs to answer these questions based on the use context. That's why all the solutions to your problem are based on a mapping dictionary in one way or another. And all these solutions will provide different results.

You can build on top of the unidecode package.
This is pretty slow, since we are normalizing all the unicode first to the combined form, then trying to see what unidecode turns it into. If we match a latin letter, then we actually use the original NFC character. If not, then we yield whatever degarbling unidecode has suggested. This leaves accentuated letters alone, but will convert everything else.
import unidecode
import unicodedata
import re
def char_filter(string):
latin = re.compile('[a-zA-Z]+')
for char in unicodedata.normalize('NFC', string):
decoded = unidecode.unidecode(char)
if latin.match(decoded):
yield char
else:
yield decoded
def clean_string(string):
return "".join(char_filter(string))
print(clean_string(u"vis-à-vis “Beyoncé”’s naïve papier–mâché résumé"))
# prints vis-à-vis "Beyoncé"'s naïve papier-mâché résumé

You can use the str.translate() method (http://docs.python.org/library/stdtypes.html#str.translate). However, read the doc related to Unicode -- the translation table has another form: unicode ordinal number --> unicode string (usually char) or None.
Well, but it requires the dict. You have to capture the replacements anyway. How do you want to do that without any table or arrays? You could use str.replace() for the single characters, but this would be inefficient.

This tool will normalize punctuation in markdown: http://johnmacfarlane.net/pandoc/README.html
-S, --smart Produce typographically correct output, converting straight quotes to curly quotes, --- to em-dashes, -- to en-dashes,
and ... to ellipses. Nonbreaking spaces are inserted after certain
abbreviations, such as “Mr.” (Note: This option is significant only
when the input format is markdown or textile. It is selected
automatically when the input format is textile or the output format is
latex or context.)
It's haskell, so you'd have to figure out the interface.

Replace multi-byte character in python

Im trying to replace no-break space copied from Word with normal space, however nothing seem to work for me.
I have tried reading this space as unicode and hexadecimal, then replace it with normal. According to https://unicode-table.com/en/202F/ it is Narrow No-Break Space, but it look like this space is more than one character.
input.html looks like this (2x Narrow No-Break Space in front):
  n
My script:
with open('input.html', 'r+') as f:
copy = f.read()
for line in copy:
for char in line:
print(char, hex(ord(char)), end = ' ')
print(repr(char), ord(char))
Gives output:
â 0xe2 'â' 226
€ 0x20ac '€' 8364
Ż 0x17b 'Ż' 379
â 0xe2 'â' 226
€ 0x20ac '€' 8364
Ż 0x17b 'Ż' 379
n 0x6e 'n' 110
Tried to replace spaces with:
copy.replace(u"\u202f", ".")
copy.replace("\0xe2\0x20ac\0x17b", ".")
copy.replace(' ', '.')
and many more configurations, but nothing seems to actually work.
I'd like to have all no-break spaces as normal spaces in html file but I have no idea how to do it.
Edit:
Replaced spaces with:
copyb = bytes(copy, 'utf8')
copyb = copyb.replace(b'\xc3\xa2\xe2\x82\xac\xc5\xbb', b'.')
but since (if I'm right) copyb is an object, I don't understand why replace() doesn't work in my case simply this way (Python 3.7):
copyb = bytes(copy, 'utf8')
copyb.replace(b'\xc3\xa2\xe2\x82\xac\xc5\xbb', b'.')

this space is more than one character.
This space is more than one byte. UTF8 characters can be up to 4 bytes.
Bytes vs Strings
There also seems to be some confusion about the difference between strings and bytes objects. Eli Bendersky has a good article on the difference. To refer to a non-printable character in a bytes object, preface the two hex numbers by \x like '\x12', not '\0x12'.
For 0xe2, you might be thinking of a hex number, which is an int representation:
>>> 0x10
16
Replacing narrow no-break space
Your question is about replacing this character, so let's do that.
In a String
>>> mystr = 'a\u202fb'
>>> print(mystr)
a b
>>> mystr.replace('\u202f', '.')
'a.b'
In a Bytes Object
>>> mybytes = bytes('a\u202fb', 'utf8')
>>> print(mybytes)
b'a\xe2\x80\xafb'
>>> mybytes.replace(b'\xe2\x80\xaf', b'.')
b'a.b'

Remove Emoji's from multilingual Unicode text

I'm trying to remove just Emoji from Unicode text. I tried the various methods described in another Stack Overflow post but none of those are removing all emojis / smileys completely. For example:
Solution 1:
def remove_emoji(self, string):
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags=re.UNICODE)
return emoji_pattern.sub(r'', string)
Leaves in 🤝 in the following example:
Input: తెలంగాణ రియల్ ఎస్టేట్ 🤝👍
Output: తెలంగాణ రియల్ ఎస్టేట్ 🤝
Another attempt, solution 2:
def deEmojify(self, inputString):
returnString = ""
for character in inputString:
try:
character.encode("ascii")
returnString += character
except UnicodeEncodeError:
returnString += ''
return returnString
Results in removing any non-English character:
Input: 🏣Testరియల్ ఎస్టేట్ A.P&T.S. 🤝🏩🏣👍
Output: Test A.P&T.S.
It removes not only all of the emoji but it also removed the non-English characters because of the character.encode("ascii"); my non-English inputs can not be encoded into ASCII.
Is there any way to properly remove Emoji from international Unicode text?

The regex is outdated. It appears to cover Emoji's defined up to Unicode 8.0 (since U+1F91D HANDSHAKE was added in Unicode 9.0). The other approach is just a very inefficient method of force-encoding to ASCII, which is rarely what you want when just removing Emoji (and can be much more easily and efficiently achieved with text.encode('ascii', 'ignore').decode('ascii')).
If you need a more up-to-date regex, take one from a package that is actively trying to keep up-to-date on Emoji; it specifically supports generating such a regex:
import emoji
def remove_emoji(text):
return emoji.get_emoji_regexp().sub(u'', text)
The package is currently up-to-date for Unicode 11.0 and has the infrastructure in place to update to future releases quickly. All your project has to do is upgrade along when there is a new release.
Demo using your sample inputs:
>>> print(remove_emoji(u'తెలంగాణ రియల్ ఎస్టేట్ 🤝👍'))
తెలంగాణ రియల్ ఎస్టేట్
>>> print(remove_emoji(u'🏣Testరియల్ ఎస్టేట్ A.P&T.S. 🤝🏩🏣👍'))
Testరియల్ ఎస్టేట్ A.P&T.S.
Note that the regex works on Unicode text, for Python 2 make sure you have decoded from str to unicode, for Python 3, from bytes to str first.
Emoji are complex beasts these days. The above will remove complete, valid Emoji. If you have 'incomplete' Emoji components such as skin-tone codepoints (meant to be combined with specific Emoji only) then you'll have much more trouble removing those. The skin-tone codepoints are easy (just remove those 5 codepoints afterwards), but there is a whole host of combinations that are made up of innocent characters such as ♀ U+2640 FEMALE SIGN or ♂ U+2642 MALE SIGN together with variant selectors and the U+200D ZERO-WIDTH JOINER that have specific meaning in other contexts too, and you can't just regex those out, not unless you don't mind breaking text using Devanagari, or Kannada or CJK ideographs, to name just a few examples.
That said, the following Unicode 11.0 codepoints are probably safe to remove (based on filtering the Emoji_Component Emoji-data designation):
20E3 ; (⃣) combining enclosing keycap
FE0F ; () VARIATION SELECTOR-16
1F1E6..1F1FF ; (🇦..🇿) regional indicator symbol letter a..regional indicator symbol letter z
1F3FB..1F3FF ; (🏻..🏿) light skin tone..dark skin tone
1F9B0..1F9B3 ; (🦰..🦳) red-haired..white-haired
E0020..E007F ; (󠀠..󠁿) tag space..cancel tag
which can be removed by creating a new regex to match those:
import re
try:
uchr = unichr # Python 2
import sys
if sys.maxunicode == 0xffff:
# narrow build, define alternative unichr encoding to surrogate pairs
# as unichr(sys.maxunicode + 1) fails.
def uchr(codepoint):
return (
unichr(codepoint) if codepoint <= sys.maxunicode else
unichr(codepoint - 0x010000 >> 10 | 0xD800) +
unichr(codepoint & 0x3FF | 0xDC00)
)
except NameError:
uchr = chr # Python 3
# Unicode 11.0 Emoji Component map (deemed safe to remove)
_removable_emoji_components = (
(0x20E3, 0xFE0F), # combining enclosing keycap, VARIATION SELECTOR-16
range(0x1F1E6, 0x1F1FF + 1), # regional indicator symbol letter a..regional indicator symbol letter z
range(0x1F3FB, 0x1F3FF + 1), # light skin tone..dark skin tone
range(0x1F9B0, 0x1F9B3 + 1), # red-haired..white-haired
range(0xE0020, 0xE007F + 1), # tag space..cancel tag
)
emoji_components = re.compile(u'({})'.format(u'|'.join([
re.escape(uchr(c)) for r in _removable_emoji_components for c in r])),
flags=re.UNICODE)
then update the above remove_emoji() function to use it:
def remove_emoji(text, remove_components=False):
cleaned = emoji.get_emoji_regexp().sub(u'', text)
if remove_components:
cleaned = emoji_components.sub(u'', cleaned)
return cleaned

The emoji.get_emoji_regexp() is outdated.
If you want to remove emoji from strings, you can use emoji.replace_emoji() as shown in the examples below.
import emoji
def remove_emoji(string):
return emoji.replace_emoji(string, '')
Visit https://carpedm20.github.io/emoji/docs/api.html#emoji.replace_emoji

If you use the regex library instead of the re library you get access to Unicode properties then you can change your function to
def remove_emoji(self, string):
emoji_pattern = re.compile("[\P{L}&&\P{D}&&\P{Z}&&\P{M}]", flags=re.UNICODE)
return emoji_pattern.sub(r'', string)
Which will keep all letters, digits, separators and marks (accents)

How to get python to accept unicode character 0x2000 (and others)

I am trying to remove certain characters from a string in Python. I have a list of characters or range of characters that I need removed, represented in hexidecimal like so:
- "0x00:0x20"
- "0x7F:0xA0"
- "0x1680"
- "0x180E"
- "0x2000:0x200A"
I am turning this list into a regular expression that looks like this:
re.sub(u'[\x00-\x20 \x7F-\xA0 \x1680 \x180E \x2000-\x200A]', ' ', my_str)
However, I am getting an error when I have \x2000-\x200A in there.
I have found that Python does not actually interpret u'\x2000' as a character:
>>> '\x2000'
' 00'
It is treating it like 'x20' (a space) and whatever else is after it:
>>> '\x20blah'
' blah'
x2000 is a valid unicode character:
http://www.unicodemap.org/details/0x2000/index.html
I would like Python to treat it that way so I can use re to remove it from strings.
As an alternative, I would like to know of another way to remove these characters from strings.
I appreciate any help. Thanks!

In a unicode string, you need to specify unicode characters(\uNNNN not \xNNNN). The following works:
>>> import re
>>> my_str=u'\u2000abc'
>>> re.sub(u'[\x00-\x20 \x7F-\xA0 \u1680 \u180E \u2000-\u200A]', ' ', my_str)
' abc'

From the docs (https://docs.python.org/2/howto/unicode.html):
Unicode literals can also use the same escape sequences as 8-bit
strings, including \x, but \x only takes two hex digits so it can’t
express an arbitrary code point. Octal escapes can go up to U+01ff,
which is octal 777.
>>> s = u"a\xac\u1234\u20ac\U00008000"
... # ^^^^ two-digit hex escape
... # ^^^^^^ four-digit Unicode escape
... # ^^^^^^^^^^ eight-digit Unicode escape
>>> for c in s: print ord(c),
...
97 172 4660 8364 32768

Encode binary data so that \n is escaped

I'm trying to work out a way to encode/decode binary data in such a way that the new line character is not part of the encoded string.
It seems to be a recursive problem, but I can't seem to work out a solution.
e.g. A naive implementation:
>>> original = 'binary\ndata'
>>> encoded = original.replace('\n', '=n')
'binary=ndata'
>>> decoded = original.replace('=n', '\n')
'binary\ndata'
What happens if there is already a =n in the original string?
>>> original = 'binary\ndata=n'
>>> encoded = original.replace('\n', '=n')
'binary=ndata=n'
>>> decoded = original.replace('=n', '\n')
'binary\ndata\n' # wrong
Try to escape existing =n's, but then what happens if there is already an escaped =n?
>>> original = '++nbinary\ndata=n'
>>> encoded = original.replace('=n', '++n').replace('\n', '=n')
'++nbinary=ndata++n'
How can I get around this recursive problem?

Solution
original = 'binary\ndata \\n'
# encoded = original.encode('string_escape') # escape many chr
encoded = original.replace('\\', '\\\\').replace('\n', '\\n') # escape \n and \\
decoded = encoded.decode('string_escape')
verified
>>> print encoded
binary\ndata \\n
>>> print decoded
binary
data \n
The solution is from How do I un-escape a backslash-escaped string in python?
Edit: I wrote it also with your ad-hoc economic encoding. The original "string_escape" codec escapes backslash, apostrophe and everything below chr(32) and above chr(126). Decoding is the same for both.

The way to encode strings that might contain the "escape" character is to escape the escape character as well. In python, the escape character is a backslash, but you could use anything you want. Your cost is one character for every occurrence of newline or the escape.
To avoid confusing you, I'll use forward slash:
# original
>>> print "slashes / and /newline/\nhere"
slashes / and /newline/
here
# encoding
>>> print "slashes / and /newline/\nhere".replace("/", "//").replace("\n", "/n")
slashes // and //newline///nhere
This encoding is unambiguous, since all real slashes are doubled; but it must be decoded in a single pass, so you can't just use two successive calls to replace():
# decoding
>>> def decode(c):
# Expand this into a real mapping if you have more substitutions
return '\n' if c == '/n' else c[0]
>>> print "".join( decode(c) for c in re.findall(r"(/.|.)",
"slashes // and //newline///nhere"))
slashes / and /newline/
here
Note that there is an actual /n in the input (and another slash before the newline): it all works correctly anyway.

If you encoded the entire string systematically, would you not end up escaping it? Say for every character you do chr(ord(char) + 1) or something trivial like that?

I don't have a great deal of experience with binary data, so this may be completely off/inefficient/both, but would this get around your issue?
In [40]: original = 'binary\ndata\nmorestuff'
In [41]: nlines = [index for index, i in enumerate(original) if i == '\n']
In [42]: encoded = original.replace('\n', '')
In [43]: encoded
Out[43]: 'binarydatamorestuff'
In [44]: decoded = list(encoded)
In [45]: map(lambda x: decoded.insert(x, '\n'), nlines)
Out[45]: [None, None]
In [46]: decoded = ''.join(decoded)
In [47]: decoded
Out[47]: 'binary\ndata\nmorestuff'
Again, I am sure there is a much better/more accurate way - this is just from a novice perspective.

If you are encoding an alphabet of n symbols (e.g. ASCII) into a smaller set of m symbols (e.g. ASCII except newline) you must allow the encoded string to be longer than the original string.
The typical way of doing this is to define one character as an "escape" character; the character following the "escape" represents an encoded character. This technique has been used since the 1940s in teletypewriters; that's where the "Esc" key you see on your keyboard came from.
Python (and other languages) already provide this in strings with the backslash character. Newlines are encoded as '\n' (or '\r\n'). Backslashes escape themselves, so the literal string '\r\n' would be encoded '\\r\\n'.
Note that the encoded length of a string that includes only the escaped character will be double that of the original string. If that is not acceptable you will have to use an encoding that uses a larger alphabet to avoid the escape characters (which may be longer than the original string) or compress it (which may also be longer than the original string).

How about:
In [8]: import urllib
In [9]: original = 'binary\ndata'
In [10]: encoded = urllib.quote(original)
In [11]: encoded
Out[11]: 'binary%0Adata'
In [12]: urllib.unquote(encoded)
Out[12]: 'binary\ndata'

The escapeless encodings are specifically designed to trim off certain characters from binary data. In your case of removing just the \n character, the overhead will be less than 0.4%.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

efficiently replace bad characters - python

I often work with utf-8 text containing characters like: \xc2\x99 \xc2\x95 \xc2\x85 etc These characters confuse other libraries I work with so need to be replaced. What is an efficient way to do this, rather than: text.replace('\xc2\x99', ' ').replace('\xc2\x85, '...')

If you want to remove all non-ASCII characters from a string, you can use text.encode("ascii", "ignore")

import unicodedata # Convert to unicode text_to_uncicode = unicode(text, "utf-8") # Convert back to ascii text_fixed = unicodedata.normalize('NFKD',text_to_unicode).encode('ascii','ignore')

These characters are not in ASCII Library and that is the reason why you are getting the errors. To avoid these errors, you can do the following while reading the file. import codecs f = codecs.open('file.txt', 'r',encoding='utf-8') To know more about these kind of errors, go through this link.

Related

How to deal with variations in apostrophe ’ and ' [duplicate]

Replace multi-byte character in python

Remove Emoji's from multilingual Unicode text

How to get python to accept unicode character 0x2000 (and others)

Encode binary data so that \n is escaped

Categories

Resources