Add a non escaped escape character to python bytearray - python

I have an API that is demanding that the quotation marks in my XML attributes are escaped, so <cmd_id="1"> will not work, it requires <cmd_id=\"1\">.
I have tried iterating through my string, for example:
b'<?xml version=\'1.0\' encoding=\'utf8\'?>\n<tx><cmd id="1">SetChLevel</cmd><name>C</name><value>30</value></tx>'
Each time that I encounter a " (ascii 34) I will replace it with an escape character (ascii 92) and another quote. Infuriatingly this results in:
b'<?xml version=\'1.0\' encoding=\'utf8\'?>\n<tx><cmd id=\\"1\\">SetChLevel</cmd><name>C</name><value>30</value></tx>'
where the escapes have been escaped. As a sanity check I replaced 92 with any other character and it works as expected.
temp = b'<?xml version=\'1.0\' encoding=\'utf8\'?>\n<tx><cmd id="1">\
SetChLevel</cmd><name>C</name><value>30</value></tx>'
i = 0
j = 0
payload = bytearray(len(temp) + 4)
for char in temp:
if char == 34:
payload[i] = 92
i += 1
payload[i] = 34
i += 1
j += 1
else:
payload[i] = temp[j]
i += 1
j += 1
print(bytes(payload))
I would assume that character 92 would appear once but something is escaping the escape!

Your problem is the result of a very common misunderstanding for programmers new to Python.
When printing a string (or bytes) to the console, Python escapes the escape character (\) to show a string that, when used in Python as a literal, would give you the exact same value.
So:
s = 'abc\\abc'
print(s)
Prints abc\abc, but on the interpreter you get:
>>> s = 'abc\\abc'
>>> print(s)
abc\abc
>>> s
'abc\\abc'
Note that this is correct. After all print(s) should show the string on the console as it is, while s on the interpreter is asking Python to show you the representation of s, which includes the quotes and the escape characters.
Compare:
>>> repr(s)
"'abc\\\\abc'"
repr here prints the representation of the representation of s.
For bytes, things are further complicated because the representation is printed when using print, since print prints a string and a bytes needs to be decoded first, i.e.:
>>> print(some_bytes.decode('utf-8')) # or whatever the encoding is
In short: your code was doing what you wanted it to, it does not duplicate escape characters, you only thought it did because you were looking at the representation of the bytes, not the actual bytes content.
By the way, this also means that you don't have to be paranoid and go through the trouble of writing custom code to replace characters based on their ASCII values, you can simply:
>>> example = bytes('<some attr="value">test</some>', encoding='utf-8')
>>> result = example.replace(b'"', b"\\\"")
>>> print(result.decode('utf-8'))
<some attr=\"value\">test</some>
I won't pretend that b"\\\"" is intuitive, perhaps b'\\"' is better - but both require that you understand the difference between the representation of a string, or its printed value.
So, finally:
>>> example = b'<some attr="value">test</some>'
>>> result = example.replace(b'"', b'\\"')
>>> print(result.decode('utf-8'))
<some attr=\"value\">test</some>

Related

How to print unicode from a generator expression in python?

Create a list from generator expression:
V = [('\\u26' + str(x)) for x in range(63,70)]
First issue: if you try to use just "\u" + str(...) it gives a decoder error right away. Seems like it tries to decode immediately upon seeing the \u instead of when a full chunk is ready. I am trying to work around that with double backslash.
Second, that creates something promising but still cannot actually print them as unicode to console:
>>> print([v[0:] for v in V])
['\\u2663', '\\u2664', '\\u2665', .....]
>>> print(V[0])
\u2663
What I would expect to see is a list of symbols that look identical to when using commands like '\u0123' such as:
>>> print('\u2663')
♣
Any way to do that from a generated list? Or is there a better way to print them instead of the '\u0123' format?
This is NOT what I want. I want to see the actual symbols drawn, not the Unicode values:
>>> print(['{}'.format(v[0:]) for v in V])
['\\u2663', '\\u2664', '\\u2665', '\\u2666', '\\u2667', '\\u2668', '\\u2669']
Unicode is a character to bytes encoding, not escape sequences. Python 3 strings are Unicode. To return the character that corresponds to a Unicode code point use chr :
chr(i)
Return the string representing a character whose Unicode code point is the integer i. For example, chr(97) returns the string 'a', while chr(8364) returns the string '€'. This is the inverse of ord().
The valid range for the argument is from 0 through 1,114,111 (0x10FFFF in base 16). ValueError will be raised if i is outside that range.
To generate the characters between 2663 and 2670:
>>> [chr(x) for x in range(2663,2670)]
['੧', '੨', '੩', '੪', '੫', '੬', '੭']
Escape sequences use hexadecimal notation though. 0x2663 is 9827 in decimal, and 0x2670 becomes 9840.
>>> [chr(x) for x in range(9827,9840)]
['♣', '♤', '♥', '♦', '♧', '♨', '♩', '♪', '♫', '♬', '♭', '♮', '♯']
You can use also use hex numeric literals:
>>> [chr(x) for x in range(0x2663,0x2670)]
['♣', '♤', '♥', '♦', '♧', '♨', '♩', '♪', '♫', '♬', '♭', '♮', '♯']
or, to use exactly the same logic as the question
>>> [chr(0x2600 + x) for x in range(0x63,0x70)]
['♣', '♤', '♥', '♦', '♧', '♨', '♩', '♪', '♫', '♬', '♭', '♮', '♯']
The reason the original code doesn't work is that escape sequences are used to represent a single character in a string when we can't or don't want to type the character itself. The interpreter or compiler replaces them with the corresponding character immediatelly. The string \\u26 is an escaped \ followed by u, 2 and 6:
>>> len('\\u26')
4

Converting a hex values to ASCII

What is the most 'Pythonic' way of translating
'\xff\xab\x12'
into
'ffab12'
I looked for functions that can do it, but they all want to translate to ASCII (so '\x40' to 'a'). I want to have the hexadecimal digits in ASCII.
There's a module called binascii that contains functions for just this:
>>> import binascii
>>> binascii.hexlify('\xff\xab\x12')
'ffab12'
>>> binascii.unhexlify('ffab12')
'\xff\xab\x12'
original = '\xff\xab\x12'
result = original.replace('\\x', '')
print result
It's \x because it's escaped. a.replace(b,c) just replaces all occurances of b with c in a.
What you want is not ascii, because ascii translates 0x41 to 'A'. You just want it in hexadecimal base without the \x (or 0x, in some cases)
Edit!!
Sorry, I thought the \x is escaped. So, \x followed by 2 hex digits represents a single char, not 4..
print "\x41"
Will print
A
So what we have to do is to convert each char to hex, then print it like that:
res = ""
for i in original:
res += hex(ord(i))[2:].zfill(2)
print res
Now let's go over this line:
hex(ord(i))[2:]
ord(c) - returns the numerical value of the char c
hex(i) - returns the hex string value of the int i (e.g if i=65 it will return 0x41.
[2:] - cutting the "0x" prefix out of the hex string.
.zfill(2) - padding with zeroes
So, making that with a list comprehension will be much shorter:
result = "".join([hex(ord(c))[2:].zfill(2) for c in original])
print result

as I hold the string of hexadecimal format without encoding the string?

The general problem is that I need the hexadecimal string stays in that format to assign it to a variable and not to save the coding?
no good:
>>> '\x61\x74'
'at'
>>> a = '\x61\x74'
>>> a
'at'
works well, but is not as:
>>> '\x61\x74'
'\x61\x74' ????????
>>> a = '\x61\x74'
>>> a
'\x61\x74' ????????
Use r prefix (explained on SO)
a = r'\x61\x74'
b = '\x61\x74'
print (a) #prints \x61\x74
print (b) # prints at
It is the same data. Python lets you specify a literal string using different methods, one of which is to use escape codes to represent bytes.
As such, '\x61' is the same character value as 'a'. Python just chooses to show printable ASCII characters as printable ASCII characters instead of the escape code, just because that makes working with bytestrings that much easier.
If you need the literal slash, x character and the two digit 6 and 1 characters (so a string of length 4), you need to double the slash or use raw strings.
To illustrate:
>>> '\x61' == 'a' # two notations for the same value
True
>>> len('\x61') # it's just 1 character
1
>>> '\\x61' # escape the escape
'\\x61'
>>> r'\x61' # or use a raw literal instead
'\\x61'
>>> len('\\x61') # which produces 4 characters
4

Encode binary data so that \n is escaped

I'm trying to work out a way to encode/decode binary data in such a way that the new line character is not part of the encoded string.
It seems to be a recursive problem, but I can't seem to work out a solution.
e.g. A naive implementation:
>>> original = 'binary\ndata'
>>> encoded = original.replace('\n', '=n')
'binary=ndata'
>>> decoded = original.replace('=n', '\n')
'binary\ndata'
What happens if there is already a =n in the original string?
>>> original = 'binary\ndata=n'
>>> encoded = original.replace('\n', '=n')
'binary=ndata=n'
>>> decoded = original.replace('=n', '\n')
'binary\ndata\n' # wrong
Try to escape existing =n's, but then what happens if there is already an escaped =n?
>>> original = '++nbinary\ndata=n'
>>> encoded = original.replace('=n', '++n').replace('\n', '=n')
'++nbinary=ndata++n'
How can I get around this recursive problem?
Solution
original = 'binary\ndata \\n'
# encoded = original.encode('string_escape') # escape many chr
encoded = original.replace('\\', '\\\\').replace('\n', '\\n') # escape \n and \\
decoded = encoded.decode('string_escape')
verified
>>> print encoded
binary\ndata \\n
>>> print decoded
binary
data \n
The solution is from How do I un-escape a backslash-escaped string in python?
Edit: I wrote it also with your ad-hoc economic encoding. The original "string_escape" codec escapes backslash, apostrophe and everything below chr(32) and above chr(126). Decoding is the same for both.
The way to encode strings that might contain the "escape" character is to escape the escape character as well. In python, the escape character is a backslash, but you could use anything you want. Your cost is one character for every occurrence of newline or the escape.
To avoid confusing you, I'll use forward slash:
# original
>>> print "slashes / and /newline/\nhere"
slashes / and /newline/
here
# encoding
>>> print "slashes / and /newline/\nhere".replace("/", "//").replace("\n", "/n")
slashes // and //newline///nhere
This encoding is unambiguous, since all real slashes are doubled; but it must be decoded in a single pass, so you can't just use two successive calls to replace():
# decoding
>>> def decode(c):
# Expand this into a real mapping if you have more substitutions
return '\n' if c == '/n' else c[0]
>>> print "".join( decode(c) for c in re.findall(r"(/.|.)",
"slashes // and //newline///nhere"))
slashes / and /newline/
here
Note that there is an actual /n in the input (and another slash before the newline): it all works correctly anyway.
If you encoded the entire string systematically, would you not end up escaping it? Say for every character you do chr(ord(char) + 1) or something trivial like that?
I don't have a great deal of experience with binary data, so this may be completely off/inefficient/both, but would this get around your issue?
In [40]: original = 'binary\ndata\nmorestuff'
In [41]: nlines = [index for index, i in enumerate(original) if i == '\n']
In [42]: encoded = original.replace('\n', '')
In [43]: encoded
Out[43]: 'binarydatamorestuff'
In [44]: decoded = list(encoded)
In [45]: map(lambda x: decoded.insert(x, '\n'), nlines)
Out[45]: [None, None]
In [46]: decoded = ''.join(decoded)
In [47]: decoded
Out[47]: 'binary\ndata\nmorestuff'
Again, I am sure there is a much better/more accurate way - this is just from a novice perspective.
If you are encoding an alphabet of n symbols (e.g. ASCII) into a smaller set of m symbols (e.g. ASCII except newline) you must allow the encoded string to be longer than the original string.
The typical way of doing this is to define one character as an "escape" character; the character following the "escape" represents an encoded character. This technique has been used since the 1940s in teletypewriters; that's where the "Esc" key you see on your keyboard came from.
Python (and other languages) already provide this in strings with the backslash character. Newlines are encoded as '\n' (or '\r\n'). Backslashes escape themselves, so the literal string '\r\n' would be encoded '\\r\\n'.
Note that the encoded length of a string that includes only the escaped character will be double that of the original string. If that is not acceptable you will have to use an encoding that uses a larger alphabet to avoid the escape characters (which may be longer than the original string) or compress it (which may also be longer than the original string).
How about:
In [8]: import urllib
In [9]: original = 'binary\ndata'
In [10]: encoded = urllib.quote(original)
In [11]: encoded
Out[11]: 'binary%0Adata'
In [12]: urllib.unquote(encoded)
Out[12]: 'binary\ndata'
The escapeless encodings are specifically designed to trim off certain characters from binary data. In your case of removing just the \n character, the overhead will be less than 0.4%.

Get str repr with double quotes Python

I'm using a small Python script to generate some binary data that will be used in a C header.
This data should be declared as a char[], and it will be nice if it could be encoded as a string (with the pertinent escape sequences when they are not in the range of ASCII printable chars) to keep the header more compact than with a decimal or hexadecimal array encoding.
The problem is that when I print the repr of a Python string, it is delimited by single quotes, and C doesn't like that. The naive solution is to do:
'"%s"'%repr(data)[1:-1]
but that doesn't work when one of the bytes in the data happens to be a double quote, so I'd need them to be escaped too.
I think a simple replace('"', '\\"') could do the job, but maybe there's a better, more pythonic solution out there.
Extra point:
It would be convenient too to split the data in lines of approximately 80 characters, but again the simple approach of splitting the source string in chunks of size 80 won't work, as each non printable character takes 2 or 3 characters in the escape sequence. Splitting the list in chunks of 80 after getting the repr won't help either, as it could divide escape sequence.
Any suggestions?
You could try json.dumps:
>>> import json
>>> print(json.dumps("hello world"))
"hello world"
>>> print(json.dumps('hëllo "world"!'))
"h\u00ebllo \"world\"!"
I don't know for sure whether json strings are compatible with C but at least they have a pretty large common subset and are guaranteed to be compatible with javascript;).
Better not hack the repr() but use the right encoding from the beginning. You can get the repr's encoding directly with the encoding string_escape
>>> "naïveté".encode("string_escape")
'na\\xc3\\xafvet\\xc3\\xa9'
>>> print _
na\xc3\xafvet\xc3\xa9
For escaping the "-quotes I think using a simple replace after escape-encoding the string is a completely unambiguous process:
>>> '"%s"' % 'data:\x00\x01 "like this"'.encode("string_escape").replace('"', r'\"')
'"data:\\x00\\x01 \\"like this\\""'
>>> print _
"data:\x00\x01 \"like this\""
If you're asking a python str for its repr, I don't think the type of quote is really configurable. From the PyString_Repr function in the python 2.6.4 source tree:
/* figure out which quote to use; single is preferred */
quote = '\'';
if (smartquotes &&
memchr(op->ob_sval, '\'', Py_SIZE(op)) &&
!memchr(op->ob_sval, '"', Py_SIZE(op)))
quote = '"';
So, I guess use double quotes if there is a single quote in the string, but don't even then if there is a double quote in the string.
I would try something like writing my own class to contain the string data instead of using the built in string to do it. One option would be deriving a class from str and writing your own repr:
class MyString(str):
__slots__ = []
def __repr__(self):
return '"%s"' % self.replace('"', r'\"')
print repr(MyString(r'foo"bar'))
Or, don't use repr at all:
def ready_string(string):
return '"%s"' % string.replace('"', r'\"')
print ready_string(r'foo"bar')
This simplistic quoting might not do the "right" thing if there's already an escaped quote in the string.
repr() isn't what you want. There's a fundamental problem: repr() can use any representation of the string that can be evaluated as Python to produce the string. That means, in theory, that it might decide to use any number of other constructs which wouldn't be valid in C, such as """long strings""".
This code is probably the right direction. I've used a default of wrapping at 140, which is a sensible value for 2009, but if you really want to wrap your code to 80 columns, just change it.
If unicode=True, it outputs a L"wide" string, which can store Unicode escapes meaningfully. Alternatively, you might want to convert Unicode characters to UTF-8 and output them escaped, depending on the program you're using them in.
def string_to_c(s, max_length = 140, unicode=False):
ret = []
# Try to split on whitespace, not in the middle of a word.
split_at_space_pos = max_length - 10
if split_at_space_pos < 10:
split_at_space_pos = None
position = 0
if unicode:
position += 1
ret.append('L')
ret.append('"')
position += 1
for c in s:
newline = False
if c == "\n":
to_add = "\\\n"
newline = True
elif ord(c) < 32 or 0x80 <= ord(c) <= 0xff:
to_add = "\\x%02x" % ord(c)
elif ord(c) > 0xff:
if not unicode:
raise ValueError, "string contains unicode character but unicode=False"
to_add = "\\u%04x" % ord(c)
elif "\\\"".find(c) != -1:
to_add = "\\%c" % c
else:
to_add = c
ret.append(to_add)
position += len(to_add)
if newline:
position = 0
if split_at_space_pos is not None and position >= split_at_space_pos and " \t".find(c) != -1:
ret.append("\\\n")
position = 0
elif position >= max_length:
ret.append("\\\n")
position = 0
ret.append('"')
return "".join(ret)
print string_to_c("testing testing testing testing testing testing testing testing testing testing testing testing testing testing testing testing testing", max_length = 20)
print string_to_c("Escapes: \"quote\" \\backslash\\ \x00 \x1f testing \x80 \xff")
print string_to_c(u"Unicode: \u1234", unicode=True)
print string_to_c("""New
lines""")

Categories

Resources