How to replace unicode characters in string with something else python?

How to replace unicode characters in string with something else python? - python

I have a string that I got from reading a HTML webpage with bullets that have a symbol like "•" because of the bulleted list. Note that the text is an HTML source from a webpage using Python 2.7's urllib2.read(webaddress).
I know the unicode character for the bullet character as U+2022, but how do I actually replace that unicode character with something else?
I tried doing
str.replace("•", "something")
but it does not appear to work... how do I do this?

Decode the string to Unicode. Assuming it's UTF-8-encoded:
str.decode("utf-8")
Call the replace method and be sure to pass it a Unicode string as its first argument:
str.decode("utf-8").replace(u"\u2022", "*")
Encode back to UTF-8, if needed:
str.decode("utf-8").replace(u"\u2022", "*").encode("utf-8")
(Fortunately, Python 3 puts a stop to this mess. Step 3 should really only be performed just prior to I/O. Also, mind you that calling a string str shadows the built-in type str.)

Encode string as unicode.
>>> special = u"\u2022"
>>> abc = u'ABC•def'
>>> abc.replace(special,'X')
u'ABCXdef'

import re
regex = re.compile("u'2022'",re.UNICODE)
newstring = re.sub(regex, something, yourstring, <optional flags>)

Try this one.
you will get the output in a normal string
str.encode().decode('unicode-escape')
and after that, you can perform any replacement.
str.replace('•','something')

str1 = "This is Python\u500cPool"
Encode the string to ASCII and replace all the utf-8 characters with '?'.
str1 = str1.encode("ascii", "replace")
Decode the byte stream to string.
str1 = str1.decode(encoding="utf-8", errors="ignore")
Replace the question mark with the desired character.
str1 = str1.replace("?"," ")

Funny the answer is hidden in among the answers.
str.replace("•", "something")
would work if you use the right semantics.
str.replace(u"\u2022","something")
works wonders ;) , thnx to RParadox for the hint.

If you want to remove all \u character. Code below for you
def replace_unicode_character(self, content: str):
content = content.encode('utf-8')
if "\\x80" in str(content):
count_unicode = 0
i = 0
while i < len(content):
if "\\x" in str(content[i:i + 1]):
if count_unicode % 3 == 0:
content = content[:i] + b'\x80\x80\x80' + content[i + 3:]
i += 2
count_unicode += 1
i += 1
content = content.replace(b'\x80\x80\x80', b'')
return content.decode('utf-8')

Related

How to convert Unicode to ASCII leaving out the non-convertible characters? [duplicate]

I'm working with a .txt file. I want a string of the text from the file with no non-ASCII characters. However, I want to leave spaces and periods. At present, I'm stripping those too. Here's the code:
def onlyascii(char):
if ord(char) < 48 or ord(char) > 127: return ''
else: return char
def get_my_string(file_path):
f=open(file_path,'r')
data=f.read()
f.close()
filtered_data=filter(onlyascii, data)
filtered_data = filtered_data.lower()
return filtered_data
How should I modify onlyascii() to leave spaces and periods? I imagine it's not too complicated but I can't figure it out.

You can filter all characters from the string that are not printable using string.printable, like this:
>>> s = "some\x00string. with\x15 funny characters"
>>> import string
>>> printable = set(string.printable)
>>> filter(lambda x: x in printable, s)
'somestring. with funny characters'
string.printable on my machine contains:
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~ \t\n\r\x0b\x0c
EDIT: On Python 3, filter will return an iterable. The correct way to obtain a string back would be:
''.join(filter(lambda x: x in printable, s))

An easy way to change to a different codec, is by using encode() or decode(). In your case, you want to convert to ASCII and ignore all symbols that are not supported. For example, the Swedish letter å is not an ASCII character:
>>>s = u'Good bye in Swedish is Hej d\xe5'
>>>s = s.encode('ascii',errors='ignore')
>>>print s
Good bye in Swedish is Hej d
Edit:
Python3: str -> bytes -> str
>>>"Hej då".encode("ascii", errors="ignore").decode()
'hej d'
Python2: unicode -> str -> unicode
>>> u"hej då".encode("ascii", errors="ignore").decode()
u'hej d'
Python2: str -> unicode -> str (decode and encode in reverse order)
>>> "hej d\xe5".decode("ascii", errors="ignore").encode()
'hej d'

According to #artfulrobot, this should be faster than filter and lambda:
import re
re.sub(r'[^\x00-\x7f]',r'', your-non-ascii-string)
See more examples here Replace non-ASCII characters with a single space

You may use the following code to remove non-English letters:
import re
str = "123456790 ABC#%? .(朱惠英)"
result = re.sub(r'[^\x00-\x7f]',r'', str)
print(result)
This will return
123456790 ABC#%? .()

Your question is ambiguous; the first two sentences taken together imply that you believe that space and "period" are non-ASCII characters. This is incorrect. All chars such that ord(char) <= 127 are ASCII characters. For example, your function excludes these characters !"#$%&\'()*+,-./ but includes several others e.g. []{}.
Please step back, think a bit, and edit your question to tell us what you are trying to do, without mentioning the word ASCII, and why you think that chars such that ord(char) >= 128 are ignorable. Also: which version of Python? What is the encoding of your input data?
Please note that your code reads the whole input file as a single string, and your comment ("great solution") to another answer implies that you don't care about newlines in your data. If your file contains two lines like this:
this is line 1
this is line 2
the result would be 'this is line 1this is line 2' ... is that what you really want?
A greater solution would include:
a better name for the filter function than onlyascii
recognition that a filter function merely needs to return a truthy value if the argument is to be retained:
def filter_func(char):
return char == '\n' or 32 <= ord(char) <= 126
# and later:
filtered_data = filter(filter_func, data).lower()

Working my way through Fluent Python (Ramalho) - highly recommended.
List comprehension one-ish-liners inspired by Chapter 2:
onlyascii = ''.join([s for s in data if ord(s) < 127])
onlymatch = ''.join([s for s in data if s in
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'])

If you want printable ascii characters you probably should correct your code to:
if ord(char) < 32 or ord(char) > 126: return ''
this is equivalent, to string.printable (answer from #jterrace), except for the absence of returns and tabs ('\t','\n','\x0b','\x0c' and '\r') but doesnt correspond to the range on your question

this is best way to get ascii characters and clean code, Checks for all possible errors
from string import printable
def getOnlyCharacters(texts):
_type = None
result = ''
if type(texts).__name__ == 'bytes':
_type = 'bytes'
texts = texts.decode('utf-8','ignore')
else:
_type = 'str'
texts = bytes(texts, 'utf-8').decode('utf-8', 'ignore')
texts = str(texts)
for text in texts:
if text in printable:
result += text
if _type == 'bytes':
result = result.encode('utf-8')
return result
text = '�Ahm�����ed Sheri��'
result = getOnlyCharacters(text)
print(result)
#input --> �Ahm�����ed Sheri��
#output --> Ahmed Sheri

Add a non escaped escape character to python bytearray

I have an API that is demanding that the quotation marks in my XML attributes are escaped, so <cmd_id="1"> will not work, it requires <cmd_id=\"1\">.
I have tried iterating through my string, for example:
b'<?xml version=\'1.0\' encoding=\'utf8\'?>\n<tx><cmd id="1">SetChLevel</cmd><name>C</name><value>30</value></tx>'
Each time that I encounter a " (ascii 34) I will replace it with an escape character (ascii 92) and another quote. Infuriatingly this results in:
b'<?xml version=\'1.0\' encoding=\'utf8\'?>\n<tx><cmd id=\\"1\\">SetChLevel</cmd><name>C</name><value>30</value></tx>'
where the escapes have been escaped. As a sanity check I replaced 92 with any other character and it works as expected.
temp = b'<?xml version=\'1.0\' encoding=\'utf8\'?>\n<tx><cmd id="1">\
SetChLevel</cmd><name>C</name><value>30</value></tx>'
i = 0
j = 0
payload = bytearray(len(temp) + 4)
for char in temp:
if char == 34:
payload[i] = 92
i += 1
payload[i] = 34
i += 1
j += 1
else:
payload[i] = temp[j]
i += 1
j += 1
print(bytes(payload))
I would assume that character 92 would appear once but something is escaping the escape!

Your problem is the result of a very common misunderstanding for programmers new to Python.
When printing a string (or bytes) to the console, Python escapes the escape character (\) to show a string that, when used in Python as a literal, would give you the exact same value.
So:
s = 'abc\\abc'
print(s)
Prints abc\abc, but on the interpreter you get:
>>> s = 'abc\\abc'
>>> print(s)
abc\abc
>>> s
'abc\\abc'
Note that this is correct. After all print(s) should show the string on the console as it is, while s on the interpreter is asking Python to show you the representation of s, which includes the quotes and the escape characters.
Compare:
>>> repr(s)
"'abc\\\\abc'"
repr here prints the representation of the representation of s.
For bytes, things are further complicated because the representation is printed when using print, since print prints a string and a bytes needs to be decoded first, i.e.:
>>> print(some_bytes.decode('utf-8')) # or whatever the encoding is
In short: your code was doing what you wanted it to, it does not duplicate escape characters, you only thought it did because you were looking at the representation of the bytes, not the actual bytes content.
By the way, this also means that you don't have to be paranoid and go through the trouble of writing custom code to replace characters based on their ASCII values, you can simply:
>>> example = bytes('<some attr="value">test</some>', encoding='utf-8')
>>> result = example.replace(b'"', b"\\\"")
>>> print(result.decode('utf-8'))
<some attr=\"value\">test</some>
I won't pretend that b"\\\"" is intuitive, perhaps b'\\"' is better - but both require that you understand the difference between the representation of a string, or its printed value.
So, finally:
>>> example = b'<some attr="value">test</some>'
>>> result = example.replace(b'"', b'\\"')
>>> print(result.decode('utf-8'))
<some attr=\"value\">test</some>

Remove \u from string?

I have a few words in a list that are of the type '\uword'. I want to replace the '\u' with an empty string. I looked around on SO but nothing has worked for me so far. I tried converting to a raw string using "%r"%word but that didn't work. I also tried using word.encode('unicode-escape') but haven't gotten anywhere. Any ideas?
EDIT
Adding code
word = '\u2019'
word.encode('unicode-escape')
print(word) # error
word = '\u2019'
word = "%r"%word
print(word) # error

I was making an error in assuming that the .encode method of strings modifies the string inplace similar to the .sort() method of a list. But according to the documentation
The opposite method of bytes.decode() is str.encode(), which returns a bytes representation of the Unicode string, encoded in the requested encoding.
def remove_u(word):
word_u = (word.encode('unicode-escape')).decode("utf-8", "strict")
if r'\u' in word_u:
# print(True)
return word_u.split('\\u')[1]
return word
vocabulary_ = [remove_u(each_word) for each_word in vocabulary_]

Given that you are dealing with strings only.
We can simply convert it to string using the string function.
>>> string = u"your string"
>>> string
u'your string'
>>> str(string)
'your string'
Guess this will do!

If I have correctly understood, you don't have to use regular expressions. Just try:
>>> # string = '\u2019'
>>> char = string.decode('unicode-escape')
>>> print format(ord(char), 'x')
2019

Because you are facing problems with encodings and unicode it would be helpful to know the version of python you are using.
I don't know if I get you right but this should do the trick:
string = r'\uword'
string.replace(r'\u','')

How to un-escape a backslash-escaped string? [duplicate]

This question already has answers here:
Process escape sequences in a string in Python
(8 answers)
Closed 7 months ago.
Suppose I have a string which is a backslash-escaped version of another string. Is there an easy way, in Python, to unescape the string? I could, for example, do:
>>> escaped_str = '"Hello,\\nworld!"'
>>> raw_str = eval(escaped_str)
>>> print raw_str
Hello,
world!
>>>
However that involves passing a (possibly untrusted) string to eval() which is a security risk. Is there a function in the standard lib which takes a string and produces a string with no security implications?

>>> print '"Hello,\\nworld!"'.decode('string_escape')
"Hello,
world!"

You can use ast.literal_eval which is safe:
Safely evaluate an expression node or a string containing a Python
expression. The string or node provided may only consist of the
following Python literal structures: strings, numbers, tuples, lists,
dicts, booleans, and None. (END)
Like this:
>>> import ast
>>> escaped_str = '"Hello,\\nworld!"'
>>> print ast.literal_eval(escaped_str)
Hello,
world!

All given answers will break on general Unicode strings. The following works for Python3 in all cases, as far as I can tell:
from codecs import encode, decode
sample = u'mon€y\\nröcks'
result = decode(encode(sample, 'latin-1', 'backslashreplace'), 'unicode-escape')
print(result)
In recent Python versions, this also works without the import:
sample = u'mon€y\\nröcks'
result = sample.encode('latin-1', 'backslashreplace').decode('unicode-escape')
As suggested by obataku, you can also use the literal_eval method from the ast module like so:
import ast
sample = u'mon€y\\nröcks'
print(ast.literal_eval(F'"{sample}"'))
Or like this when your string really contains a string literal (including the quotes):
import ast
sample = u'"mon€y\\nröcks"'
print(ast.literal_eval(sample))
However, if you are uncertain whether the input string uses double or single quotes as delimiters, or when you cannot assume it to be properly escaped at all, then literal_eval may raise a SyntaxError while the encode/decode method will still work.

In python 3, str objects don't have a decode method and you have to use a bytes object. ChristopheD's answer covers python 2.
# create a `bytes` object from a `str`
my_str = "Hello,\\nworld"
# (pick an encoding suitable for your str, e.g. 'latin1')
my_bytes = my_str.encode("utf-8")
# or directly
my_bytes = b"Hello,\\nworld"
print(my_bytes.decode("unicode_escape"))
# "Hello,
# world"

For Python3, consider:
my_string.encode('raw_unicode_escape').decode('unicode_escape')
The 'raw_unicode_escape' codec encodes to latin1, but first replaces all other Unicode code points with an escaped '\uXXXX' or '\UXXXXXXXX' form. Importantly, it differs from the normal 'unicode_escape' codec in that it does not touch existing backslashes.
So when the normal 'unicode_escape' decoder is applied, both the newly-escaped code points and the originally-escaped elements are treated equally, and the result is an unescaped native Unicode string.
(The 'raw_unicode_escape' decoder appears to pay attention only to the '\uXXXX' and '\UXXXXXXXX' forms, ignoring all other escapes.)
Documentation:
https://docs.python.org/3/library/codecs.html?highlight=codecs#text-encodings

custom string parser to decode only some backslash-escapes, in this case \" and \'
def backslash_decode(src):
"decode backslash-escapes"
slashes = 0 # count backslashes
dst = ""
for loc in range(0, len(src)):
char = src[loc]
if char == "\\":
slashes += 1
if slashes == 2:
dst += char # decode backslash
slashes = 0
elif slashes == 0:
dst += char # normal char
else: # slashes == 1
if char == '"':
dst += char # decode double-quote
elif char == "'":
dst += char # decode single-quote
else:
dst += "\\" + char # keep backslash-escapes like \n or \t
slashes = 0
return dst
src = "a" + "\\\\" + r"\'" + r'\"' + r"\n" + r"\t" + r"\x" + "z" # input
exp = "a" + "\\" + "'" + '"' + r"\n" + r"\t" + r"\x" + "z" # expected output
res = backslash_decode(src)
print(res)
assert res == exp

Get str repr with double quotes Python

I'm using a small Python script to generate some binary data that will be used in a C header.
This data should be declared as a char[], and it will be nice if it could be encoded as a string (with the pertinent escape sequences when they are not in the range of ASCII printable chars) to keep the header more compact than with a decimal or hexadecimal array encoding.
The problem is that when I print the repr of a Python string, it is delimited by single quotes, and C doesn't like that. The naive solution is to do:
'"%s"'%repr(data)[1:-1]
but that doesn't work when one of the bytes in the data happens to be a double quote, so I'd need them to be escaped too.
I think a simple replace('"', '\\"') could do the job, but maybe there's a better, more pythonic solution out there.
Extra point:
It would be convenient too to split the data in lines of approximately 80 characters, but again the simple approach of splitting the source string in chunks of size 80 won't work, as each non printable character takes 2 or 3 characters in the escape sequence. Splitting the list in chunks of 80 after getting the repr won't help either, as it could divide escape sequence.
Any suggestions?

You could try json.dumps:
>>> import json
>>> print(json.dumps("hello world"))
"hello world"
>>> print(json.dumps('hëllo "world"!'))
"h\u00ebllo \"world\"!"
I don't know for sure whether json strings are compatible with C but at least they have a pretty large common subset and are guaranteed to be compatible with javascript;).

Better not hack the repr() but use the right encoding from the beginning. You can get the repr's encoding directly with the encoding string_escape
>>> "naïveté".encode("string_escape")
'na\\xc3\\xafvet\\xc3\\xa9'
>>> print _
na\xc3\xafvet\xc3\xa9
For escaping the "-quotes I think using a simple replace after escape-encoding the string is a completely unambiguous process:
>>> '"%s"' % 'data:\x00\x01 "like this"'.encode("string_escape").replace('"', r'\"')
'"data:\\x00\\x01 \\"like this\\""'
>>> print _
"data:\x00\x01 \"like this\""

If you're asking a python str for its repr, I don't think the type of quote is really configurable. From the PyString_Repr function in the python 2.6.4 source tree:
/* figure out which quote to use; single is preferred */
quote = '\'';
if (smartquotes &&
memchr(op->ob_sval, '\'', Py_SIZE(op)) &&
!memchr(op->ob_sval, '"', Py_SIZE(op)))
quote = '"';
So, I guess use double quotes if there is a single quote in the string, but don't even then if there is a double quote in the string.
I would try something like writing my own class to contain the string data instead of using the built in string to do it. One option would be deriving a class from str and writing your own repr:
class MyString(str):
__slots__ = []
def __repr__(self):
return '"%s"' % self.replace('"', r'\"')
print repr(MyString(r'foo"bar'))
Or, don't use repr at all:
def ready_string(string):
return '"%s"' % string.replace('"', r'\"')
print ready_string(r'foo"bar')
This simplistic quoting might not do the "right" thing if there's already an escaped quote in the string.

repr() isn't what you want. There's a fundamental problem: repr() can use any representation of the string that can be evaluated as Python to produce the string. That means, in theory, that it might decide to use any number of other constructs which wouldn't be valid in C, such as """long strings""".
This code is probably the right direction. I've used a default of wrapping at 140, which is a sensible value for 2009, but if you really want to wrap your code to 80 columns, just change it.
If unicode=True, it outputs a L"wide" string, which can store Unicode escapes meaningfully. Alternatively, you might want to convert Unicode characters to UTF-8 and output them escaped, depending on the program you're using them in.
def string_to_c(s, max_length = 140, unicode=False):
ret = []
# Try to split on whitespace, not in the middle of a word.
split_at_space_pos = max_length - 10
if split_at_space_pos < 10:
split_at_space_pos = None
position = 0
if unicode:
position += 1
ret.append('L')
ret.append('"')
position += 1
for c in s:
newline = False
if c == "\n":
to_add = "\\\n"
newline = True
elif ord(c) < 32 or 0x80 <= ord(c) <= 0xff:
to_add = "\\x%02x" % ord(c)
elif ord(c) > 0xff:
if not unicode:
raise ValueError, "string contains unicode character but unicode=False"
to_add = "\\u%04x" % ord(c)
elif "\\\"".find(c) != -1:
to_add = "\\%c" % c
else:
to_add = c
ret.append(to_add)
position += len(to_add)
if newline:
position = 0
if split_at_space_pos is not None and position >= split_at_space_pos and " \t".find(c) != -1:
ret.append("\\\n")
position = 0
elif position >= max_length:
ret.append("\\\n")
position = 0
ret.append('"')
return "".join(ret)
print string_to_c("testing testing testing testing testing testing testing testing testing testing testing testing testing testing testing testing testing", max_length = 20)
print string_to_c("Escapes: \"quote\" \\backslash\\ \x00 \x1f testing \x80 \xff")
print string_to_c(u"Unicode: \u1234", unicode=True)
print string_to_c("""New
lines""")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to replace unicode characters in string with something else python? - python

Encode string as unicode. >>> special = u"\u2022" >>> abc = u'ABC•def' >>> abc.replace(special,'X') u'ABCXdef'

import re regex = re.compile("u'2022'",re.UNICODE) newstring = re.sub(regex, something, yourstring, <optional flags>)

Try this one. you will get the output in a normal string str.encode().decode('unicode-escape') and after that, you can perform any replacement. str.replace('•','something')

Funny the answer is hidden in among the answers. str.replace("•", "something") would work if you use the right semantics. str.replace(u"\u2022","something") works wonders ;) , thnx to RParadox for the hint.

Related

How to convert Unicode to ASCII leaving out the non-convertible characters? [duplicate]

Add a non escaped escape character to python bytearray

Remove \u from string?

How to un-escape a backslash-escaped string? [duplicate]

Get str repr with double quotes Python

Categories

Resources