Get str repr with double quotes Python - python

I'm using a small Python script to generate some binary data that will be used in a C header.
This data should be declared as a char[], and it will be nice if it could be encoded as a string (with the pertinent escape sequences when they are not in the range of ASCII printable chars) to keep the header more compact than with a decimal or hexadecimal array encoding.
The problem is that when I print the repr of a Python string, it is delimited by single quotes, and C doesn't like that. The naive solution is to do:
'"%s"'%repr(data)[1:-1]
but that doesn't work when one of the bytes in the data happens to be a double quote, so I'd need them to be escaped too.
I think a simple replace('"', '\\"') could do the job, but maybe there's a better, more pythonic solution out there.
Extra point:
It would be convenient too to split the data in lines of approximately 80 characters, but again the simple approach of splitting the source string in chunks of size 80 won't work, as each non printable character takes 2 or 3 characters in the escape sequence. Splitting the list in chunks of 80 after getting the repr won't help either, as it could divide escape sequence.
Any suggestions?

You could try json.dumps:
>>> import json
>>> print(json.dumps("hello world"))
"hello world"
>>> print(json.dumps('hëllo "world"!'))
"h\u00ebllo \"world\"!"
I don't know for sure whether json strings are compatible with C but at least they have a pretty large common subset and are guaranteed to be compatible with javascript;).

Better not hack the repr() but use the right encoding from the beginning. You can get the repr's encoding directly with the encoding string_escape
>>> "naïveté".encode("string_escape")
'na\\xc3\\xafvet\\xc3\\xa9'
>>> print _
na\xc3\xafvet\xc3\xa9
For escaping the "-quotes I think using a simple replace after escape-encoding the string is a completely unambiguous process:
>>> '"%s"' % 'data:\x00\x01 "like this"'.encode("string_escape").replace('"', r'\"')
'"data:\\x00\\x01 \\"like this\\""'
>>> print _
"data:\x00\x01 \"like this\""

If you're asking a python str for its repr, I don't think the type of quote is really configurable. From the PyString_Repr function in the python 2.6.4 source tree:
/* figure out which quote to use; single is preferred */
quote = '\'';
if (smartquotes &&
memchr(op->ob_sval, '\'', Py_SIZE(op)) &&
!memchr(op->ob_sval, '"', Py_SIZE(op)))
quote = '"';
So, I guess use double quotes if there is a single quote in the string, but don't even then if there is a double quote in the string.
I would try something like writing my own class to contain the string data instead of using the built in string to do it. One option would be deriving a class from str and writing your own repr:
class MyString(str):
__slots__ = []
def __repr__(self):
return '"%s"' % self.replace('"', r'\"')
print repr(MyString(r'foo"bar'))
Or, don't use repr at all:
def ready_string(string):
return '"%s"' % string.replace('"', r'\"')
print ready_string(r'foo"bar')
This simplistic quoting might not do the "right" thing if there's already an escaped quote in the string.

repr() isn't what you want. There's a fundamental problem: repr() can use any representation of the string that can be evaluated as Python to produce the string. That means, in theory, that it might decide to use any number of other constructs which wouldn't be valid in C, such as """long strings""".
This code is probably the right direction. I've used a default of wrapping at 140, which is a sensible value for 2009, but if you really want to wrap your code to 80 columns, just change it.
If unicode=True, it outputs a L"wide" string, which can store Unicode escapes meaningfully. Alternatively, you might want to convert Unicode characters to UTF-8 and output them escaped, depending on the program you're using them in.
def string_to_c(s, max_length = 140, unicode=False):
ret = []
# Try to split on whitespace, not in the middle of a word.
split_at_space_pos = max_length - 10
if split_at_space_pos < 10:
split_at_space_pos = None
position = 0
if unicode:
position += 1
ret.append('L')
ret.append('"')
position += 1
for c in s:
newline = False
if c == "\n":
to_add = "\\\n"
newline = True
elif ord(c) < 32 or 0x80 <= ord(c) <= 0xff:
to_add = "\\x%02x" % ord(c)
elif ord(c) > 0xff:
if not unicode:
raise ValueError, "string contains unicode character but unicode=False"
to_add = "\\u%04x" % ord(c)
elif "\\\"".find(c) != -1:
to_add = "\\%c" % c
else:
to_add = c
ret.append(to_add)
position += len(to_add)
if newline:
position = 0
if split_at_space_pos is not None and position >= split_at_space_pos and " \t".find(c) != -1:
ret.append("\\\n")
position = 0
elif position >= max_length:
ret.append("\\\n")
position = 0
ret.append('"')
return "".join(ret)
print string_to_c("testing testing testing testing testing testing testing testing testing testing testing testing testing testing testing testing testing", max_length = 20)
print string_to_c("Escapes: \"quote\" \\backslash\\ \x00 \x1f testing \x80 \xff")
print string_to_c(u"Unicode: \u1234", unicode=True)
print string_to_c("""New
lines""")

Related

How to convert Unicode to ASCII leaving out the non-convertible characters? [duplicate]

I'm working with a .txt file. I want a string of the text from the file with no non-ASCII characters. However, I want to leave spaces and periods. At present, I'm stripping those too. Here's the code:
def onlyascii(char):
if ord(char) < 48 or ord(char) > 127: return ''
else: return char
def get_my_string(file_path):
f=open(file_path,'r')
data=f.read()
f.close()
filtered_data=filter(onlyascii, data)
filtered_data = filtered_data.lower()
return filtered_data
How should I modify onlyascii() to leave spaces and periods? I imagine it's not too complicated but I can't figure it out.
You can filter all characters from the string that are not printable using string.printable, like this:
>>> s = "some\x00string. with\x15 funny characters"
>>> import string
>>> printable = set(string.printable)
>>> filter(lambda x: x in printable, s)
'somestring. with funny characters'
string.printable on my machine contains:
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~ \t\n\r\x0b\x0c
EDIT: On Python 3, filter will return an iterable. The correct way to obtain a string back would be:
''.join(filter(lambda x: x in printable, s))
An easy way to change to a different codec, is by using encode() or decode(). In your case, you want to convert to ASCII and ignore all symbols that are not supported. For example, the Swedish letter å is not an ASCII character:
>>>s = u'Good bye in Swedish is Hej d\xe5'
>>>s = s.encode('ascii',errors='ignore')
>>>print s
Good bye in Swedish is Hej d
Edit:
Python3: str -> bytes -> str
>>>"Hej då".encode("ascii", errors="ignore").decode()
'hej d'
Python2: unicode -> str -> unicode
>>> u"hej då".encode("ascii", errors="ignore").decode()
u'hej d'
Python2: str -> unicode -> str (decode and encode in reverse order)
>>> "hej d\xe5".decode("ascii", errors="ignore").encode()
'hej d'
According to #artfulrobot, this should be faster than filter and lambda:
import re
re.sub(r'[^\x00-\x7f]',r'', your-non-ascii-string)
See more examples here Replace non-ASCII characters with a single space
You may use the following code to remove non-English letters:
import re
str = "123456790 ABC#%? .(朱惠英)"
result = re.sub(r'[^\x00-\x7f]',r'', str)
print(result)
This will return
123456790 ABC#%? .()
Your question is ambiguous; the first two sentences taken together imply that you believe that space and "period" are non-ASCII characters. This is incorrect. All chars such that ord(char) <= 127 are ASCII characters. For example, your function excludes these characters !"#$%&\'()*+,-./ but includes several others e.g. []{}.
Please step back, think a bit, and edit your question to tell us what you are trying to do, without mentioning the word ASCII, and why you think that chars such that ord(char) >= 128 are ignorable. Also: which version of Python? What is the encoding of your input data?
Please note that your code reads the whole input file as a single string, and your comment ("great solution") to another answer implies that you don't care about newlines in your data. If your file contains two lines like this:
this is line 1
this is line 2
the result would be 'this is line 1this is line 2' ... is that what you really want?
A greater solution would include:
a better name for the filter function than onlyascii
recognition that a filter function merely needs to return a truthy value if the argument is to be retained:
def filter_func(char):
return char == '\n' or 32 <= ord(char) <= 126
# and later:
filtered_data = filter(filter_func, data).lower()
Working my way through Fluent Python (Ramalho) - highly recommended.
List comprehension one-ish-liners inspired by Chapter 2:
onlyascii = ''.join([s for s in data if ord(s) < 127])
onlymatch = ''.join([s for s in data if s in
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'])
If you want printable ascii characters you probably should correct your code to:
if ord(char) < 32 or ord(char) > 126: return ''
this is equivalent, to string.printable (answer from #jterrace), except for the absence of returns and tabs ('\t','\n','\x0b','\x0c' and '\r') but doesnt correspond to the range on your question
this is best way to get ascii characters and clean code, Checks for all possible errors
from string import printable
def getOnlyCharacters(texts):
_type = None
result = ''
if type(texts).__name__ == 'bytes':
_type = 'bytes'
texts = texts.decode('utf-8','ignore')
else:
_type = 'str'
texts = bytes(texts, 'utf-8').decode('utf-8', 'ignore')
texts = str(texts)
for text in texts:
if text in printable:
result += text
if _type == 'bytes':
result = result.encode('utf-8')
return result
text = '�Ahm�����ed Sheri��'
result = getOnlyCharacters(text)
print(result)
#input --> �Ahm�����ed Sheri��
#output --> Ahmed Sheri

Add a non escaped escape character to python bytearray

I have an API that is demanding that the quotation marks in my XML attributes are escaped, so <cmd_id="1"> will not work, it requires <cmd_id=\"1\">.
I have tried iterating through my string, for example:
b'<?xml version=\'1.0\' encoding=\'utf8\'?>\n<tx><cmd id="1">SetChLevel</cmd><name>C</name><value>30</value></tx>'
Each time that I encounter a " (ascii 34) I will replace it with an escape character (ascii 92) and another quote. Infuriatingly this results in:
b'<?xml version=\'1.0\' encoding=\'utf8\'?>\n<tx><cmd id=\\"1\\">SetChLevel</cmd><name>C</name><value>30</value></tx>'
where the escapes have been escaped. As a sanity check I replaced 92 with any other character and it works as expected.
temp = b'<?xml version=\'1.0\' encoding=\'utf8\'?>\n<tx><cmd id="1">\
SetChLevel</cmd><name>C</name><value>30</value></tx>'
i = 0
j = 0
payload = bytearray(len(temp) + 4)
for char in temp:
if char == 34:
payload[i] = 92
i += 1
payload[i] = 34
i += 1
j += 1
else:
payload[i] = temp[j]
i += 1
j += 1
print(bytes(payload))
I would assume that character 92 would appear once but something is escaping the escape!
Your problem is the result of a very common misunderstanding for programmers new to Python.
When printing a string (or bytes) to the console, Python escapes the escape character (\) to show a string that, when used in Python as a literal, would give you the exact same value.
So:
s = 'abc\\abc'
print(s)
Prints abc\abc, but on the interpreter you get:
>>> s = 'abc\\abc'
>>> print(s)
abc\abc
>>> s
'abc\\abc'
Note that this is correct. After all print(s) should show the string on the console as it is, while s on the interpreter is asking Python to show you the representation of s, which includes the quotes and the escape characters.
Compare:
>>> repr(s)
"'abc\\\\abc'"
repr here prints the representation of the representation of s.
For bytes, things are further complicated because the representation is printed when using print, since print prints a string and a bytes needs to be decoded first, i.e.:
>>> print(some_bytes.decode('utf-8')) # or whatever the encoding is
In short: your code was doing what you wanted it to, it does not duplicate escape characters, you only thought it did because you were looking at the representation of the bytes, not the actual bytes content.
By the way, this also means that you don't have to be paranoid and go through the trouble of writing custom code to replace characters based on their ASCII values, you can simply:
>>> example = bytes('<some attr="value">test</some>', encoding='utf-8')
>>> result = example.replace(b'"', b"\\\"")
>>> print(result.decode('utf-8'))
<some attr=\"value\">test</some>
I won't pretend that b"\\\"" is intuitive, perhaps b'\\"' is better - but both require that you understand the difference between the representation of a string, or its printed value.
So, finally:
>>> example = b'<some attr="value">test</some>'
>>> result = example.replace(b'"', b'\\"')
>>> print(result.decode('utf-8'))
<some attr=\"value\">test</some>

How to replace unicode characters in string with something else python?

I have a string that I got from reading a HTML webpage with bullets that have a symbol like "•" because of the bulleted list. Note that the text is an HTML source from a webpage using Python 2.7's urllib2.read(webaddress).
I know the unicode character for the bullet character as U+2022, but how do I actually replace that unicode character with something else?
I tried doing
str.replace("•", "something")
but it does not appear to work... how do I do this?
Decode the string to Unicode. Assuming it's UTF-8-encoded:
str.decode("utf-8")
Call the replace method and be sure to pass it a Unicode string as its first argument:
str.decode("utf-8").replace(u"\u2022", "*")
Encode back to UTF-8, if needed:
str.decode("utf-8").replace(u"\u2022", "*").encode("utf-8")
(Fortunately, Python 3 puts a stop to this mess. Step 3 should really only be performed just prior to I/O. Also, mind you that calling a string str shadows the built-in type str.)
Encode string as unicode.
>>> special = u"\u2022"
>>> abc = u'ABC•def'
>>> abc.replace(special,'X')
u'ABCXdef'
import re
regex = re.compile("u'2022'",re.UNICODE)
newstring = re.sub(regex, something, yourstring, <optional flags>)
Try this one.
you will get the output in a normal string
str.encode().decode('unicode-escape')
and after that, you can perform any replacement.
str.replace('•','something')
str1 = "This is Python\u500cPool"
Encode the string to ASCII and replace all the utf-8 characters with '?'.
str1 = str1.encode("ascii", "replace")
Decode the byte stream to string.
str1 = str1.decode(encoding="utf-8", errors="ignore")
Replace the question mark with the desired character.
str1 = str1.replace("?"," ")
Funny the answer is hidden in among the answers.
str.replace("•", "something")
would work if you use the right semantics.
str.replace(u"\u2022","something")
works wonders ;) , thnx to RParadox for the hint.
If you want to remove all \u character. Code below for you
def replace_unicode_character(self, content: str):
content = content.encode('utf-8')
if "\\x80" in str(content):
count_unicode = 0
i = 0
while i < len(content):
if "\\x" in str(content[i:i + 1]):
if count_unicode % 3 == 0:
content = content[:i] + b'\x80\x80\x80' + content[i + 3:]
i += 2
count_unicode += 1
i += 1
content = content.replace(b'\x80\x80\x80', b'')
return content.decode('utf-8')

How to filter (or replace) unicode characters that would take more than 3 bytes in UTF-8?

I'm using Python and Django, but I'm having a problem caused by a limitation of MySQL. According to the MySQL 5.1 documentation, their utf8 implementation does not support 4-byte characters. MySQL 5.5 will support 4-byte characters using utf8mb4; and, someday in future, utf8 might support it as well.
But my server is not ready to upgrade to MySQL 5.5, and thus I'm limited to UTF-8 characters that take 3 bytes or less.
My question is: How to filter (or replace) unicode characters that would take more than 3 bytes?
I want to replace all 4-byte characters with the official \ufffd (U+FFFD REPLACEMENT CHARACTER), or with ?.
In other words, I want a behavior quite similar to Python's own str.encode() method (when passing 'replace' parameter). Edit: I want a behavior similar to encode(), but I don't want to actually encode the string. I want to still have an unicode string after filtering.
I DON'T want to escape the character before storing at the MySQL, because that would mean I would need to unescape all strings I get from the database, which is very annoying and unfeasible.
See also:
"Incorrect string value" warning when saving some unicode characters to MySQL (at Django ticket system)
‘𠂉’ Not a valid unicode character, but in the unicode character set? (at Stack Overflow)
[EDIT] Added tests about the proposed solutions
So I got good answers so far. Thanks, people! Now, in order to choose one of them, I did a quick testing to find the simplest and fastest one.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# vi:ts=4 sw=4 et
import cProfile
import random
import re
# How many times to repeat each filtering
repeat_count = 256
# Percentage of "normal" chars, when compared to "large" unicode chars
normal_chars = 90
# Total number of characters in this string
string_size = 8 * 1024
# Generating a random testing string
test_string = u''.join(
unichr(random.randrange(32,
0x10ffff if random.randrange(100) > normal_chars else 0x0fff
)) for i in xrange(string_size) )
# RegEx to find invalid characters
re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)
def filter_using_re(unicode_string):
return re_pattern.sub(u'\uFFFD', unicode_string)
def filter_using_python(unicode_string):
return u''.join(
uc if uc < u'\ud800' or u'\ue000' <= uc <= u'\uffff' else u'\ufffd'
for uc in unicode_string
)
def repeat_test(func, unicode_string):
for i in xrange(repeat_count):
tmp = func(unicode_string)
print '='*10 + ' filter_using_re() ' + '='*10
cProfile.run('repeat_test(filter_using_re, test_string)')
print '='*10 + ' filter_using_python() ' + '='*10
cProfile.run('repeat_test(filter_using_python, test_string)')
#print test_string.encode('utf8')
#print filter_using_re(test_string).encode('utf8')
#print filter_using_python(test_string).encode('utf8')
The results:
filter_using_re() did 515 function calls in 0.139 CPU seconds (0.138 CPU seconds at the sub() built-in)
filter_using_python() did 2097923 function calls in 3.413 CPU seconds (1.511 CPU seconds at the join() call and 1.900 CPU seconds evaluating the generator expression)
I did no test using itertools because... well... that solution, although interesting, was quite big and complex.
Conclusion
The RegEx solution was, by far, the fastest one.
Unicode characters in the ranges \u0000-\uD7FF and \uE000-\uFFFF will have 3 byte (or less) encodings in UTF8. The \uD800-\uDFFF range is for multibyte UTF16. I do not know python, but you should be able to set up a regular expression to match outside those ranges.
pattern = re.compile("[\uD800-\uDFFF].", re.UNICODE)
pattern = re.compile("[^\u0000-\uFFFF]", re.UNICODE)
Edit adding Python from Denilson Sá's script in the question body:
re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)
filtered_string = re_pattern.sub(u'\uFFFD', unicode_string)
You may skip the decoding and encoding steps and directly detect the value of the first byte (8-bit string) of each character. According to UTF-8:
#1-byte characters have the following format: 0xxxxxxx
#2-byte characters have the following format: 110xxxxx 10xxxxxx
#3-byte characters have the following format: 1110xxxx 10xxxxxx 10xxxxxx
#4-byte characters have the following format: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
According to that, you only need to check the value of only the first byte of each character to filter out 4-byte characters:
def filter_4byte_chars(s):
i = 0
j = len(s)
# you need to convert
# the immutable string
# to a mutable list first
s = list(s)
while i < j:
# get the value of this byte
k = ord(s[i])
# this is a 1-byte character, skip to the next byte
if k <= 127:
i += 1
# this is a 2-byte character, skip ahead by 2 bytes
elif k < 224:
i += 2
# this is a 3-byte character, skip ahead by 3 bytes
elif k < 240:
i += 3
# this is a 4-byte character, remove it and update
# the length of the string we need to check
else:
s[i:i+4] = []
j -= 4
return ''.join(s)
Skipping the decoding and encoding parts will save you some time and for smaller strings that mostly have 1-byte characters this could even be faster than the regular expression filtering.
And just for the fun of it, an itertools monstrosity :)
import itertools as it, operator as op
def max3bytes(unicode_string):
# sequence of pairs of (char_in_string, u'\N{REPLACEMENT CHARACTER}')
pairs= it.izip(unicode_string, it.repeat(u'\ufffd'))
# is the argument less than or equal to 65535?
selector= ft.partial(op.le, 65535)
# using the character ordinals, return 0 or 1 based on `selector`
indexer= it.imap(selector, it.imap(ord, unicode_string))
# now pick the correct item for all pairs
return u''.join(it.imap(tuple.__getitem__, pairs, indexer))
Encode as UTF-16, then reencode as UTF-8.
>>> t = u'𝐟𝐨𝐨'
>>> e = t.encode('utf-16le')
>>> ''.join(unichr(x).encode('utf-8') for x in struct.unpack('<' + 'H' * (len(e) // 2), e))
'\xed\xa0\xb5\xed\xb0\x9f\xed\xa0\xb5\xed\xb0\xa8\xed\xa0\xb5\xed\xb0\xa8'
Note that you can't encode after joining, since the surrogate pairs may be decoded before reencoding.
EDIT:
MySQL (at least 5.1.47) has no problem dealing with surrogate pairs:
mysql> create table utf8test (t character(128)) collate utf8_general_ci;
Query OK, 0 rows affected (0.12 sec)
...
>>> cxn = MySQLdb.connect(..., charset='utf8')
>>> csr = cxn.cursor()
>>> t = u'𝐟𝐨𝐨'
>>> e = t.encode('utf-16le')
>>> v = ''.join(unichr(x).encode('utf-8') for x in struct.unpack('<' + 'H' * (len(e) // 2), e))
>>> v
'\xed\xa0\xb5\xed\xb0\x9f\xed\xa0\xb5\xed\xb0\xa8\xed\xa0\xb5\xed\xb0\xa8'
>>> csr.execute('insert into utf8test (t) values (%s)', (v,))
1L
>>> csr.execute('select * from utf8test')
1L
>>> r = csr.fetchone()
>>> r
(u'\ud835\udc1f\ud835\udc28\ud835\udc28',)
>>> print r[0]
𝐟𝐨𝐨
According to the MySQL 5.1 documentation: "The ucs2 and utf8 character sets do not support supplementary characters that lie outside the BMP." This indicates that there might be a problem with surrogate pairs.
Note that the Unicode standard 5.2 chapter 3 actually forbids encoding a surrogate pair as two 3-byte UTF-8 sequences instead of one 4-byte UTF-8 sequence ... see for example page 93 """Because surrogate code points are not Unicode scalar values, any UTF-8 byte sequence that would otherwise map to code points D800..DFFF is ill-formed.""" However this proscription is as far as I know largely unknown or ignored.
It may well be a good idea to check what MySQL does with surrogate pairs. If they are not to be retained, this code will provide a simple-enough check:
all(uc < u'\ud800' or u'\ue000' <= uc <= u'\uffff' for uc in unicode_string)
and this code will replace any "nasties" with u\ufffd:
u''.join(
uc if uc < u'\ud800' or u'\ue000' <= uc <= u'\uffff' else u'\ufffd'
for uc in unicode_string
)
I'm guessing it's not the fastest, but quite straightforward (“pythonic” :) :
def max3bytes(unicode_string):
return u''.join(uc if uc <= u'\uffff' else u'\ufffd' for uc in unicode_string)
NB: this code does not take into account the fact that Unicode has surrogate characters in the ranges U+D800-U+DFFF.
This does more than filtering out just 3+ byte UTF-8 unicode characters. It removes unicode but tries to do that in a gentle way and replace it with relevant ASCII characters if possible. It can be a blessing in the future if you don't have for example a dozen of various unicode apostrophes and unicode quotation marks in your text (usually coming from Apple handhelds) but only the regular ASCII apostrophes and quotations.
unicodedata.normalize("NFKD", sentence).encode("ascii", "ignore")
This is robust, I use it with some more guards:
import unicodedata
def neutralize_unicode(value):
"""
Taking care of special characters as gently as possible
Args:
value (string): input string, can contain unicode characters
Returns:
:obj:`string` where the unicode characters are replaced with standard
ASCII counterparts (for example en-dash and em-dash with regular dash,
apostrophe and quotation variations with the standard ones) or taken
out if there's no substitute.
"""
if not value or not isinstance(value, basestring):
return value
if isinstance(value, str):
return value
return unicodedata.normalize("NFKD", value).encode("ascii", "ignore")
This is Python 2 BTW.

How to un-escape a backslash-escaped string? [duplicate]

This question already has answers here:
Process escape sequences in a string in Python
(8 answers)
Closed 7 months ago.
Suppose I have a string which is a backslash-escaped version of another string. Is there an easy way, in Python, to unescape the string? I could, for example, do:
>>> escaped_str = '"Hello,\\nworld!"'
>>> raw_str = eval(escaped_str)
>>> print raw_str
Hello,
world!
>>>
However that involves passing a (possibly untrusted) string to eval() which is a security risk. Is there a function in the standard lib which takes a string and produces a string with no security implications?
>>> print '"Hello,\\nworld!"'.decode('string_escape')
"Hello,
world!"
You can use ast.literal_eval which is safe:
Safely evaluate an expression node or a string containing a Python
expression. The string or node provided may only consist of the
following Python literal structures: strings, numbers, tuples, lists,
dicts, booleans, and None. (END)
Like this:
>>> import ast
>>> escaped_str = '"Hello,\\nworld!"'
>>> print ast.literal_eval(escaped_str)
Hello,
world!
All given answers will break on general Unicode strings. The following works for Python3 in all cases, as far as I can tell:
from codecs import encode, decode
sample = u'mon€y\\nröcks'
result = decode(encode(sample, 'latin-1', 'backslashreplace'), 'unicode-escape')
print(result)
In recent Python versions, this also works without the import:
sample = u'mon€y\\nröcks'
result = sample.encode('latin-1', 'backslashreplace').decode('unicode-escape')
As suggested by obataku, you can also use the literal_eval method from the ast module like so:
import ast
sample = u'mon€y\\nröcks'
print(ast.literal_eval(F'"{sample}"'))
Or like this when your string really contains a string literal (including the quotes):
import ast
sample = u'"mon€y\\nröcks"'
print(ast.literal_eval(sample))
However, if you are uncertain whether the input string uses double or single quotes as delimiters, or when you cannot assume it to be properly escaped at all, then literal_eval may raise a SyntaxError while the encode/decode method will still work.
In python 3, str objects don't have a decode method and you have to use a bytes object. ChristopheD's answer covers python 2.
# create a `bytes` object from a `str`
my_str = "Hello,\\nworld"
# (pick an encoding suitable for your str, e.g. 'latin1')
my_bytes = my_str.encode("utf-8")
# or directly
my_bytes = b"Hello,\\nworld"
print(my_bytes.decode("unicode_escape"))
# "Hello,
# world"
For Python3, consider:
my_string.encode('raw_unicode_escape').decode('unicode_escape')
The 'raw_unicode_escape' codec encodes to latin1, but first replaces all other Unicode code points with an escaped '\uXXXX' or '\UXXXXXXXX' form. Importantly, it differs from the normal 'unicode_escape' codec in that it does not touch existing backslashes.
So when the normal 'unicode_escape' decoder is applied, both the newly-escaped code points and the originally-escaped elements are treated equally, and the result is an unescaped native Unicode string.
(The 'raw_unicode_escape' decoder appears to pay attention only to the '\uXXXX' and '\UXXXXXXXX' forms, ignoring all other escapes.)
Documentation:
https://docs.python.org/3/library/codecs.html?highlight=codecs#text-encodings
custom string parser to decode only some backslash-escapes, in this case \" and \'
def backslash_decode(src):
"decode backslash-escapes"
slashes = 0 # count backslashes
dst = ""
for loc in range(0, len(src)):
char = src[loc]
if char == "\\":
slashes += 1
if slashes == 2:
dst += char # decode backslash
slashes = 0
elif slashes == 0:
dst += char # normal char
else: # slashes == 1
if char == '"':
dst += char # decode double-quote
elif char == "'":
dst += char # decode single-quote
else:
dst += "\\" + char # keep backslash-escapes like \n or \t
slashes = 0
return dst
src = "a" + "\\\\" + r"\'" + r'\"' + r"\n" + r"\t" + r"\x" + "z" # input
exp = "a" + "\\" + "'" + '"' + r"\n" + r"\t" + r"\x" + "z" # expected output
res = backslash_decode(src)
print(res)
assert res == exp

Categories

Resources