I'm trying to convert a binary I have in python (a gzipped protocol buffer object) to an hexadecimal string in a string escape fashion (eg. \xFA\x1C ..).
I have tried both
repr(<mygzipfileobj>.getvalue())
as well as
<mygzipfileobj>.getvalue().encode('string-escape')
In both cases I end up with a string which is not made of HEX chars only.
\x86\xe3$T]\x0fPE\x1c\xaa\x1c8d\xb7\x9e\x127\xcd\x1a.\x88v ...
How can I achieve a consistent hexadecimal conversion where every single byte is actually translated to a \xHH format ? (where H represents a valid hex char 0-9A-F)
The \xhh format you often see is a debugging aid, the output of the repr() applied to a string with non-ASCII codepoints. Any ASCII codepoints are left a in-place to leave what readable information is there.
If you must have a string with all characters replaced by \xhh escapes, you need to do so manually:
''.join(r'\x{0:02x}'.format(ord(c)) for c in value)
If you need quotes around that, you'd need to add those manually too:
"'{0}'".format(''.join(r'\x{:02x}'.format(ord(c)) for c in value))
Related
In a python source code I stumbled upon I've seen a small b before a string like in:
b"abcdef"
I know about the u prefix signifying a unicode string, and the r prefix for a raw string literal.
What does the b stand for and in which kind of source code is it useful as it seems to be exactly like a plain string without any prefix?
The b prefix signifies a bytes string literal.
If you see it used in Python 3 source code, the expression creates a bytes object, not a regular Unicode str object. If you see it echoed in your Python shell or as part of a list, dict or other container contents, then you see a bytes object represented using this notation.
bytes objects basically contain a sequence of integers in the range 0-255, but when represented, Python displays these bytes as ASCII codepoints to make it easier to read their contents. Any bytes outside the printable range of ASCII characters are shown as escape sequences (e.g. \n, \x82, etc.). Inversely, you can use both ASCII characters and escape sequences to define byte values; for ASCII values their numeric value is used (e.g. b'A' == b'\x41')
Because a bytes object consist of a sequence of integers, you can construct a bytes object from any other sequence of integers with values in the 0-255 range, like a list:
bytes([72, 101, 108, 108, 111])
and indexing gives you back the integers (but slicing produces a new bytes value; for the above example, value[0] gives you 72, but value[:1] is b'H' as 72 is the ASCII code point for the capital letter H).
bytes model binary data, including encoded text. If your bytes value does contain text, you need to first decode it, using the correct codec. If the data is encoded as UTF-8, for example, you can obtain a Unicode str value with:
strvalue = bytesvalue.decode('utf-8')
Conversely, to go from text in a str object to bytes you need to encode. You need to decide on an encoding to use; the default is to use UTF-8, but what you will need is highly dependent on your use case:
bytesvalue = strvalue.encode('utf-8')
You can also use the constructor, bytes(strvalue, encoding) to do the same.
Both the decoding and encoding methods take an extra argument to specify how errors should be handled.
Python 2, versions 2.6 and 2.7 also support creating string literals using b'..' string literal syntax, to ease code that works on both Python 2 and 3.
bytes objects are immutable, just like str strings are. Use a bytearray() object if you need to have a mutable bytes value.
This is Python3 bytes literal. This prefix is absent in Python 2.5 and older (it is equivalent to a plain string of 2.x, while plain string of 3.x is equivalent to a literal with u prefix in 2.x). In Python 2.6+ it is equivalent to a plain string, for compatibility with 3.x.
In a python source code I stumbled upon I've seen a small b before a string like in:
b"abcdef"
I know about the u prefix signifying a unicode string, and the r prefix for a raw string literal.
What does the b stand for and in which kind of source code is it useful as it seems to be exactly like a plain string without any prefix?
The b prefix signifies a bytes string literal.
If you see it used in Python 3 source code, the expression creates a bytes object, not a regular Unicode str object. If you see it echoed in your Python shell or as part of a list, dict or other container contents, then you see a bytes object represented using this notation.
bytes objects basically contain a sequence of integers in the range 0-255, but when represented, Python displays these bytes as ASCII codepoints to make it easier to read their contents. Any bytes outside the printable range of ASCII characters are shown as escape sequences (e.g. \n, \x82, etc.). Inversely, you can use both ASCII characters and escape sequences to define byte values; for ASCII values their numeric value is used (e.g. b'A' == b'\x41')
Because a bytes object consist of a sequence of integers, you can construct a bytes object from any other sequence of integers with values in the 0-255 range, like a list:
bytes([72, 101, 108, 108, 111])
and indexing gives you back the integers (but slicing produces a new bytes value; for the above example, value[0] gives you 72, but value[:1] is b'H' as 72 is the ASCII code point for the capital letter H).
bytes model binary data, including encoded text. If your bytes value does contain text, you need to first decode it, using the correct codec. If the data is encoded as UTF-8, for example, you can obtain a Unicode str value with:
strvalue = bytesvalue.decode('utf-8')
Conversely, to go from text in a str object to bytes you need to encode. You need to decide on an encoding to use; the default is to use UTF-8, but what you will need is highly dependent on your use case:
bytesvalue = strvalue.encode('utf-8')
You can also use the constructor, bytes(strvalue, encoding) to do the same.
Both the decoding and encoding methods take an extra argument to specify how errors should be handled.
Python 2, versions 2.6 and 2.7 also support creating string literals using b'..' string literal syntax, to ease code that works on both Python 2 and 3.
bytes objects are immutable, just like str strings are. Use a bytearray() object if you need to have a mutable bytes value.
This is Python3 bytes literal. This prefix is absent in Python 2.5 and older (it is equivalent to a plain string of 2.x, while plain string of 3.x is equivalent to a literal with u prefix in 2.x). In Python 2.6+ it is equivalent to a plain string, for compatibility with 3.x.
I was doing a few experiments with escape backslashes in the Python 3.4 shell and noticed something quite strange.
>>> string = "\test\test\1\2\3"
>>> string
'\test\test\x01\x02\x03'
>>> string = "5"
>>> string
'5'
>>> string = "5\6\7"
>>> string
'5\x06\x07'
As you can see in the above code, I defined a variable string as "\test\test\1\2\3". However, when I entered string in the console, instead of printing "\test\test\1\2\3", it printed "\test\test\x01\x02\x03". Why does this occur, and what is it used for?
In Python string literals, the \ character starts escape sequences. \n translates to a newline character, \t to a tab, etc. \xhh hex sequences let you produce codepoints with hex values instead, \uhhhh produce codepoints with 4-digit hex values, and \Uhhhhhhhh produce codepoints with 8-digit hex values.
See the String and Bytes Literals documentation, which contains a table of all the possible escape sequences.
When Python echoes a string object in the interpreter (or you use the repr() function on a string object), then Python creates a representation of the string value. That representation happens to use the exact same Python string literal syntax, to make it easier to debug your values, as you can use the representation to recreate the exact same value.
To keep non-printable characters from either causing havoc or not be shown at all, Python uses the same escape sequence syntax to represent those characters. Thus bytes that are not printable are represented using suitable \xhh sequences, or if possible, one of the \c single letter escapes (so newlines are shown as \n).
In your example, you created non-printable bytes using the \ooo octal value escape sequence syntax. The digits are interpreted as an octal number to create a corrensponding codepoint. When echoing that string value back, the default \xhh syntax is used to represent the exact same value in hexadecimal:
>>> '\20' # Octal for 16
'\x10'
while your \t became a tab character:
>>> print('\test')
est
Note how there is no letter t there; instead, the remaining est is indented by whitespace, a horizontal tab.
If you need to include literal \ backslash characters you need to double the character:
>>> '\\test\\1\\2\\3'
'\\test\\1\\2\\3'
>>> print('\\test\\1\\2\\3')
\test\1\2\3
>>> len('\\test\\1\\2\\3')
11
Note that the representation used doubled backslashes! If it didn't, you'd not be able to copy the string and paste it back into Python to recreate the value. Using print() to write the value to the terminal as actual characters (and not as a string representation) shows that there are single backslashes there, and taking the length shows we have just 11 characters in the string, not 15.
You can also use a raw string literal. That's just a different syntax, the string objects that are created from the syntax are the exact same type, with the same value. It is just a different way of spelling out string values. In a raw string literal, backslashes are just backslashes, as long as they are not the last character in the string; most escape sequences do not work in a raw string literal:
>>> r'\test\1\2\3'
'\\test\\1\\2\\3'
Last but not least, if you are creating strings that represent filenames on your Windows system, you could also use forward slashes; most APIs in Window don't mind and accept both types of slash as separators in the filename:
>>> 'C:/This/is/a/valid/path'
'C:/This/is/a/valid/path'
When you write
string = "\test\test\1\2\3"
Python thinks that you want to define a string of characters that starts with the tab character ("\t") then the character "e", then "s", and so on. Python also thinks that you want to include some non-printable characters corresponding to the literal numbers 1, 2, and 3, which the shorthand "\1", "\2" and "\3" provides.
I was doing a few experiments with escape backslashes in the Python 3.4 shell and noticed something quite strange.
>>> string = "\test\test\1\2\3"
>>> string
'\test\test\x01\x02\x03'
>>> string = "5"
>>> string
'5'
>>> string = "5\6\7"
>>> string
'5\x06\x07'
As you can see in the above code, I defined a variable string as "\test\test\1\2\3". However, when I entered string in the console, instead of printing "\test\test\1\2\3", it printed "\test\test\x01\x02\x03". Why does this occur, and what is it used for?
In Python string literals, the \ character starts escape sequences. \n translates to a newline character, \t to a tab, etc. \xhh hex sequences let you produce codepoints with hex values instead, \uhhhh produce codepoints with 4-digit hex values, and \Uhhhhhhhh produce codepoints with 8-digit hex values.
See the String and Bytes Literals documentation, which contains a table of all the possible escape sequences.
When Python echoes a string object in the interpreter (or you use the repr() function on a string object), then Python creates a representation of the string value. That representation happens to use the exact same Python string literal syntax, to make it easier to debug your values, as you can use the representation to recreate the exact same value.
To keep non-printable characters from either causing havoc or not be shown at all, Python uses the same escape sequence syntax to represent those characters. Thus bytes that are not printable are represented using suitable \xhh sequences, or if possible, one of the \c single letter escapes (so newlines are shown as \n).
In your example, you created non-printable bytes using the \ooo octal value escape sequence syntax. The digits are interpreted as an octal number to create a corrensponding codepoint. When echoing that string value back, the default \xhh syntax is used to represent the exact same value in hexadecimal:
>>> '\20' # Octal for 16
'\x10'
while your \t became a tab character:
>>> print('\test')
est
Note how there is no letter t there; instead, the remaining est is indented by whitespace, a horizontal tab.
If you need to include literal \ backslash characters you need to double the character:
>>> '\\test\\1\\2\\3'
'\\test\\1\\2\\3'
>>> print('\\test\\1\\2\\3')
\test\1\2\3
>>> len('\\test\\1\\2\\3')
11
Note that the representation used doubled backslashes! If it didn't, you'd not be able to copy the string and paste it back into Python to recreate the value. Using print() to write the value to the terminal as actual characters (and not as a string representation) shows that there are single backslashes there, and taking the length shows we have just 11 characters in the string, not 15.
You can also use a raw string literal. That's just a different syntax, the string objects that are created from the syntax are the exact same type, with the same value. It is just a different way of spelling out string values. In a raw string literal, backslashes are just backslashes, as long as they are not the last character in the string; most escape sequences do not work in a raw string literal:
>>> r'\test\1\2\3'
'\\test\\1\\2\\3'
Last but not least, if you are creating strings that represent filenames on your Windows system, you could also use forward slashes; most APIs in Window don't mind and accept both types of slash as separators in the filename:
>>> 'C:/This/is/a/valid/path'
'C:/This/is/a/valid/path'
When you write
string = "\test\test\1\2\3"
Python thinks that you want to define a string of characters that starts with the tab character ("\t") then the character "e", then "s", and so on. Python also thinks that you want to include some non-printable characters corresponding to the literal numbers 1, 2, and 3, which the shorthand "\1", "\2" and "\3" provides.
In a python source code I stumbled upon I've seen a small b before a string like in:
b"abcdef"
I know about the u prefix signifying a unicode string, and the r prefix for a raw string literal.
What does the b stand for and in which kind of source code is it useful as it seems to be exactly like a plain string without any prefix?
The b prefix signifies a bytes string literal.
If you see it used in Python 3 source code, the expression creates a bytes object, not a regular Unicode str object. If you see it echoed in your Python shell or as part of a list, dict or other container contents, then you see a bytes object represented using this notation.
bytes objects basically contain a sequence of integers in the range 0-255, but when represented, Python displays these bytes as ASCII codepoints to make it easier to read their contents. Any bytes outside the printable range of ASCII characters are shown as escape sequences (e.g. \n, \x82, etc.). Inversely, you can use both ASCII characters and escape sequences to define byte values; for ASCII values their numeric value is used (e.g. b'A' == b'\x41')
Because a bytes object consist of a sequence of integers, you can construct a bytes object from any other sequence of integers with values in the 0-255 range, like a list:
bytes([72, 101, 108, 108, 111])
and indexing gives you back the integers (but slicing produces a new bytes value; for the above example, value[0] gives you 72, but value[:1] is b'H' as 72 is the ASCII code point for the capital letter H).
bytes model binary data, including encoded text. If your bytes value does contain text, you need to first decode it, using the correct codec. If the data is encoded as UTF-8, for example, you can obtain a Unicode str value with:
strvalue = bytesvalue.decode('utf-8')
Conversely, to go from text in a str object to bytes you need to encode. You need to decide on an encoding to use; the default is to use UTF-8, but what you will need is highly dependent on your use case:
bytesvalue = strvalue.encode('utf-8')
You can also use the constructor, bytes(strvalue, encoding) to do the same.
Both the decoding and encoding methods take an extra argument to specify how errors should be handled.
Python 2, versions 2.6 and 2.7 also support creating string literals using b'..' string literal syntax, to ease code that works on both Python 2 and 3.
bytes objects are immutable, just like str strings are. Use a bytearray() object if you need to have a mutable bytes value.
This is Python3 bytes literal. This prefix is absent in Python 2.5 and older (it is equivalent to a plain string of 2.x, while plain string of 3.x is equivalent to a literal with u prefix in 2.x). In Python 2.6+ it is equivalent to a plain string, for compatibility with 3.x.