Say I have someString = "00". basically I want to convert someString to \x00
I tried multiple ways to achieve my goal, but couldn't find a successful one.
tried:
HexString = '\x'+someString
This method throws this error:
ValueError: invalid \x escape
Unless I do HexString = r'\x'+someString, but then HexString value is set to \\x00 which is not the same as I want.
I also tried using hex() function, which had few issues. But the big issue I had with it was that it returns 0x0. It expects int and etc...
Can anyone help me with converting a string("11") to \x11?
If I am understanding you correctly, the actual goal is to take a string that contains a bunch of pairs of hex digits, and translate each pair of hex digits into the corresponding byte and have a result of type bytes.
In 3.x, this is built directly into the bytes type itself:
>>> bytes.fromhex('11abcdef')
b'\x11\xab\xcd\xef'
You can also instead use the standard library:
>>> import binascii
>>> binascii.unhexlify('11abcdef')
b'\x11\xab\xcd\xef'
You will not necessarily see a \x escape sequence for every byte value. This is normal and expected; it has to do with how the bytes object is represented as text for display purposes.
'\x'+someString
No approach of this general form can work, because it fundamentally misunderstands the problem. The output that you want is not a string, and a string literal like '\x00' does not have a backslash in it, nor a lowercase x - again, what you are seeing is how the string is represented as text, because not every character is printable.
int lets you set the base. For base 16
>>> someString = "00"
>>> int(someString, 16)
0
Of course, 0 is kinda boring because it works for all bases.
If you wanted a byte in a bytes object, you could
>>> import struct
>>> struct.pack("b", int(someString, 16))
b'\x00'
If you want a string (and I'm switching to 0x41 here) you could
>>> chr(int("41", 16))
'A'
You can get ord of the character by using int, then convert it to a character. Then you can encode it to bytes object without any import.
>>> chr(int("11", 16)) # a character
'\x11'
>>> chr(int("11", 16)).encode() # bytes object
b'\x11'
Related
I need to do some work with bytes in Python and I've come across a byte string I don't really understand:
b"H\x00\x84\xffQ\x00\xa6\xff+\x00\x96\xff\xc2\xffI\xff\xa5\xff'\xff\x8a\xff\x19\xff\x19\xff\xf6\xfe\xb0\xfe\xc7\xfeJ\xfel\xfe\xf8\xfd+\xfe\xef\xfd:\xfe\xc3\xfd*\xfe_\xfd\xdf\xfd\n\xfd\xa3\xfd\xc6\xfcq\xfd\xbd\xfc?\xfd"
So according to what I know, bytes should be represented as \xhh, where hh are hexadecimal values (from 0 to f). However, in the third segment, there is \xffQ, and farther on there are other characters which shouldn't appear: I, ', *, :, ? etc.
I've used hex() method to see what would be the outcome, and I got this:
480084ff5100a6ff2b0096ffc2ff49ffa5ff27ff8aff19ff19fff6feb0fec7fe4afe6cfef8fd2bfeeffd3afec3fd2afe5ffddffd0afda3fdc6fc71fdbdfc3ffd
As you can see, some parts of the hex are the same, but e.g. \xffQ was changed into ff51. I need to append some data to this byte string, so I'd like to know what's going on there (or how to get the same result).
Both repr and str when processing a bytes object will print ASCII characters where possible. Otherwise their hexadecimal values will be shown in the form \xNN.
It might help you to visualise the content if you print it as all hexadecimal as follows:
b = b"H\x00\x84\xffQ\x00\xa6\xff+\x00\x96\xff\xc2\xffI\xff\xa5\xff'\xff\x8a\xff\x19\xff\x19\xff\xf6\xfe\xb0\xfe\xc7\xfeJ\xfel\xfe\xf8\xfd+\xfe\xef\xfd:\xfe\xc3\xfd*\xfe_\xfd\xdf\xfd\n\xfd\xa3\xfd\xc6\xfcq\xfd\xbd\xfc?\xfd"
print(''.join(hex(b_) for b_ in b))
Output:
0x480x00x840xff0x510x00xa60xff0x2b0x00x960xff0xc20xff0x490xff0xa50xff0x270xff0x8a0xff0x190xff0x190xff0xf60xfe0xb00xfe0xc70xfe0x4a0xfe0x6c0xfe0xf80xfd0x2b0xfe0xef0xfd0x3a0xfe0xc30xfd0x2a0xfe0x5f0xfd0xdf0xfd0xa0xfd0xa30xfd0xc60xfc0x710xfd0xbd0xfc0x3f0xfd
Or you can use binascii module if you want to visualize the content:
import binascii
print(binascii.hexlify(b"hello world"))
Output:
68656c6c6f20776f726c64
This question already has answers here:
Process escape sequences in a string in Python
(8 answers)
Closed 7 months ago.
My problem is as follows:
I'm reading a .csv generated by some software and to read it I'm using Pandas. Pandas read the .csv properly but one of the columns stores bytes sequences representing vectors and Pandas stores them as a string.
So I have data (string) and I want to use np.frombuffer() to get the proper vector. The problem is, data is a string so its already encoded so when I use .encode() to turn it into bytes, the sequence is not the original one.
Example: The .csv contains \x00\x00 representing the vector [0,0] with dtype=np.uint8. Pandas stores it as a string and when I try to process it something like this happens:
data = df.data[x] # With x any row.
type(data)
<class 'str'>
print(data)
\x00\x00
e_data = data.encode("latin1")
print(e_data)
b'\\x00\\x00'
v = np.frombuffer(e_data, np.uint8)
print(v)
array([ 92 120 48 48 92 120 48 48], dtype=uint8)
I just want to get b'\x00\x00' from data instead of b'\\x00\\x00' which I understand is a little encoding mess I have not been able to fix yet.
Any way to do this?
Thanks!
Issue: you (apparently) have a string that contains literal backslash escape sequences, such as:
>>> x = r'\x00' # note the use of a raw string literal
>>> x # Python's representation of the string escapes the backslash
'\\x00'
>>> print(x) # but it looks right when printing
\x00
From this, you wish to create a corresponding bytes object, wherein the backslash-escape sequences are translated into the corresponding byte.
Handling these kinds of escape sequences is done using the unicode-escape string encoding. As you may be aware, string encodings convert between bytes and str objects, specifying the rules for which byte sequences correspond to what Unicode code points.
However, the unicode-escape codec assumes that the escape sequences are on the bytes side of the equation and that the str side will have the corresponding Unicode characters:
>>> rb'\x00'.decode('unicode-escape') # create a string with a NUL char
'\x00'
Applying .encode to the string will reverse that process; so if you start with the backslash-escape sequence, it will re-escape the backslash:
>>> r'\x00'.encode('unicode-escape') # the result contains two backslashes, represented as four
b'\\\\x00'
>>> list(r'\x00'.encode('unicode-escape')) # let's look at the numeric values of the bytes
[92, 92, 120, 48, 48]
As you can see, that is clearly not what we want.
We want to convert from bytes to str to do the backslash-escaping. But we have a str to start, so we need to change that to bytes; and we want bytes at the end, so we need to change the str that we get from the backslash-escaping. In both cases, we need to make each Unicode code point from 0-255 inclusive, correspond to a single byte with the same value.
The encoding we need for that task is called latin-1, also known as iso-8859-1.
For example:
>>> r'\x00'.encode('latin-1')
b'\\x00'
Thus, we can reason out the overall conversion:
>>> r'\x00'.encode('latin-1').decode('unicode-escape').encode('latin-1')
b'\x00'
As desired: our str with a literal backslash, lowercase x and two zeros, is converted to a bytes object containing a single zero byte.
Alternately: we can request that backslash-escapes are processed while decoding, by using escape_decode from the codecs standard library module. However, this isn't documented and isn't really meant to be used that way - it's internal stuff used to implement the unicode-escape codec and possibly some other things.
If you want to expose yourself to the risk of that breaking in the future, it looks like:
>>> import codecs
>>> codecs.escape_decode(r'\x00\x00')
(b'\x00\x00', 8)
We get a 2-tuple, with the desired bytes and what I assume is the number of Unicode code points that were decoded (i.e. the length of the string). From my testing, it appears that it can only use UTF-8 encoding for the non-backslash sequences (but this could be specific to how Python is configured), and you can't change this; there is no actual parameter to specify the encoding, for a decode method. Like I said - not meant for general use.
Yes, all of that is as awkward as it seems. The reason you don't get easy support for this kind of thing is that it isn't really how you're intended to design your system. Fundamentally, all data is bytes; text is an abstraction that is encoded by that byte data. Using a single byte (with value 0) to represent four characters of text (the symbols \, x, 0 and 0) is not a normal encoding, and not a reversible one (how do I know whether to decode the byte as those four characters, or as a single NUL character?). Instead, you should strongly consider using some other friendly string representation of your data (perhaps a plain hex dump) and a non-text-encoding-related way to parse it. For example:
>>> data = '41 42' # a string in a simple hex dump format
>>> bytes.fromhex(data) # support is built-in, and works simply
b'AB'
>>> list(bytes.fromhex(data))
[65, 66]
Apparently, the following is the valid syntax:
b'The string'
I would like to know:
What does this b character in front of the string mean?
What are the effects of using it?
What are appropriate situations to use it?
I found a related question right here on SO, but that question is about PHP though, and it states the b is used to indicate the string is binary, as opposed to Unicode, which was needed for code to be compatible from version of PHP < 6, when migrating to PHP 6. I don't think this applies to Python.
I did find this documentation on the Python site about using a u character in the same syntax to specify a string as Unicode. Unfortunately, it doesn't mention the b character anywhere in that document.
Also, just out of curiosity, are there more symbols than the b and u that do other things?
Python 3.x makes a clear distinction between the types:
str = '...' literals = a sequence of Unicode characters (Latin-1, UCS-2 or UCS-4, depending on the widest character in the string)
bytes = b'...' literals = a sequence of octets (integers between 0 and 255)
If you're familiar with:
Java or C#, think of str as String and bytes as byte[];
SQL, think of str as NVARCHAR and bytes as BINARY or BLOB;
Windows registry, think of str as REG_SZ and bytes as REG_BINARY.
If you're familiar with C(++), then forget everything you've learned about char and strings, because a character is not a byte. That idea is long obsolete.
You use str when you want to represent text.
print('שלום עולם')
You use bytes when you want to represent low-level binary data like structs.
NaN = struct.unpack('>d', b'\xff\xf8\x00\x00\x00\x00\x00\x00')[0]
You can encode a str to a bytes object.
>>> '\uFEFF'.encode('UTF-8')
b'\xef\xbb\xbf'
And you can decode a bytes into a str.
>>> b'\xE2\x82\xAC'.decode('UTF-8')
'€'
But you can't freely mix the two types.
>>> b'\xEF\xBB\xBF' + 'Text with a UTF-8 BOM'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: can't concat bytes to str
The b'...' notation is somewhat confusing in that it allows the bytes 0x01-0x7F to be specified with ASCII characters instead of hex numbers.
>>> b'A' == b'\x41'
True
But I must emphasize, a character is not a byte.
>>> 'A' == b'A'
False
In Python 2.x
Pre-3.0 versions of Python lacked this kind of distinction between text and binary data. Instead, there was:
unicode = u'...' literals = sequence of Unicode characters = 3.x str
str = '...' literals = sequences of confounded bytes/characters
Usually text, encoded in some unspecified encoding.
But also used to represent binary data like struct.pack output.
In order to ease the 2.x-to-3.x transition, the b'...' literal syntax was backported to Python 2.6, in order to allow distinguishing binary strings (which should be bytes in 3.x) from text strings (which should be str in 3.x). The b prefix does nothing in 2.x, but tells the 2to3 script not to convert it to a Unicode string in 3.x.
So yes, b'...' literals in Python have the same purpose that they do in PHP.
Also, just out of curiosity, are there
more symbols than the b and u that do
other things?
The r prefix creates a raw string (e.g., r'\t' is a backslash + t instead of a tab), and triple quotes '''...''' or """...""" allow multi-line string literals.
To quote the Python 2.x documentation:
A prefix of 'b' or 'B' is ignored in
Python 2; it indicates that the
literal should become a bytes literal
in Python 3 (e.g. when code is
automatically converted with 2to3). A
'u' or 'b' prefix may be followed by
an 'r' prefix.
The Python 3 documentation states:
Bytes literals are always prefixed with 'b' or 'B'; they produce an instance of the bytes type instead of the str type. They may only contain ASCII characters; bytes with a numeric value of 128 or greater must be expressed with escapes.
The b denotes a byte string.
Bytes are the actual data. Strings are an abstraction.
If you had multi-character string object and you took a single character, it would be a string, and it might be more than 1 byte in size depending on encoding.
If took 1 byte with a byte string, you'd get a single 8-bit value from 0-255 and it might not represent a complete character if those characters due to encoding were > 1 byte.
TBH I'd use strings unless I had some specific low level reason to use bytes.
From server side, if we send any response, it will be sent in the form of byte type, so it will appear in the client as b'Response from server'
In order get rid of b'....' simply use below code:
Server file:
stri="Response from server"
c.send(stri.encode())
Client file:
print(s.recv(1024).decode())
then it will print Response from server
The answer to the question is that, it does:
data.encode()
and in order to decode it(remove the b, because sometimes you don't need it)
use:
data.decode()
Here's an example where the absence of b would throw a TypeError exception in Python 3.x
>>> f=open("new", "wb")
>>> f.write("Hello Python!")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'str' does not support the buffer interface
Adding a b prefix would fix the problem.
It turns it into a bytes literal (or str in 2.x), and is valid for 2.6+.
The r prefix causes backslashes to be "uninterpreted" (not ignored, and the difference does matter).
In addition to what others have said, note that a single character in unicode can consist of multiple bytes.
The way unicode works is that it took the old ASCII format (7-bit code that looks like 0xxx xxxx) and added multi-bytes sequences where all bytes start with 1 (1xxx xxxx) to represent characters beyond ASCII so that Unicode would be backwards-compatible with ASCII.
>>> len('Öl') # German word for 'oil' with 2 characters
2
>>> 'Öl'.encode('UTF-8') # convert str to bytes
b'\xc3\x96l'
>>> len('Öl'.encode('UTF-8')) # 3 bytes encode 2 characters !
3
You can use JSON to convert it to dictionary
import json
data = b'{"key":"value"}'
print(json.loads(data))
{"key":"value"}
FLASK:
This is an example from flask. Run this on terminal line:
import requests
requests.post(url='http://localhost(example)/',json={'key':'value'})
In flask/routes.py
#app.route('/', methods=['POST'])
def api_script_add():
print(request.data) # --> b'{"hi":"Hello"}'
print(json.loads(request.data))
return json.loads(request.data)
{'key':'value'}
b"hello" is not a string (even though it looks like one), but a byte sequence. It is a sequence of 5 numbers, which, if you mapped them to a character table, would look like h e l l o. However the value itself is not a string, Python just has a convenient syntax for defining byte sequences using text characters rather than the numbers itself. This saves you some typing, and also often byte sequences are meant to be interpreted as characters. However, this is not always the case - for example, reading a JPG file will produce a sequence of nonsense letters inside b"..." because JPGs have a non-text structure.
.encode() and .decode() convert between strings and bytes.
bytes(somestring.encode()) is the solution that worked for me in python 3.
def compare_types():
output = b'sometext'
print(output)
print(type(output))
somestring = 'sometext'
encoded_string = somestring.encode()
output = bytes(encoded_string)
print(output)
print(type(output))
compare_types()
How do I convert a string which contains the literal representation of a byte string, to a byte string?
This might seem strange, but for a library I'm using for a certain type of exception I need one of the attributes of the exception, this gives me the value I need, but it is a byte string in a string.
It is "value=b'\\xbbOFa\\x14\\xdb{\\xf5\\x1b~H\\xba\\x96\\xdaec'", I can get the value by splitting on the equals and then using eval, such as
>>> eval("value=b'\\xbbOFa\\x14\\xdb{\\xf5\\x1b~H\\xba\\x96\\xdaec'".split("=")[1])
b'\xbbOFa\x14\xdb{\xf5\x1b~H\xba\x96\xdaec'
This works, but as we all know eval can be very, very bad. So, is there an alternative to using eval?
There is a unicode-escape codec that will convert bytes containing literal sequences like \x.. or \u.... into their equivalent characters in the string. The remainder of the string is converted using the latin1 encoding, which just translates all the bytes.
So you convert the string to raw bytes using latin1, then convert back to a string using unicode-escape, and finally back to bytes using latin1 again:
>>> s = '\\xbbOFa\\x14\\xdb{\\xf5\\x1b~H\\xba\\x96\\xdaec'
>>> s.encode('latin1').decode('unicode-escape').encode('latin1')
b'\xbbOFa\x14\xdb{\xf5\x1b~H\xba\x96\xdaec'
Getting rid of the clutter around the string is pretty easy using regex or the more manual parsing you showed. For example:
>>> x = "value=b'\\xbbOFa\\x14\\xdb{\\xf5\\x1b~H\\xba\\x96\\xdaec'"
>>> s = re.fullmatch('[^\'"]+b([\'"])(.*)\\1[^\'"]*', x).group(2)
>>> s
'\\xbbOFa\\x14\\xdb{\\xf5\\x1b~H\\xba\\x96\\xdaec'
OR
>>> s = x.split('=')[1].lstrip('b').strip("'")
>>> s
'\\xbbOFa\\x14\\xdb{\\xf5\\x1b~H\\xba\\x96\\xdaec'
I received the following string.How can it be converted to hex value='(\xd2M\x00\x18\x00\x18\x80\x00\x80\x00\x00\x00\x00\x00\x00\xe0\xd2\xe0\xd2.\xd2\x00\x00\x00\x00\x00\x00\n\x00\x18\x00&\x00\x00\x00\x00\x00\x00\x00\x0f0\xfe/\x010\xff/\x000\xff/\x000\xff/\xff/\xff/\xff/\xff/\x000\xff/\xff/\xff/\x000\x000\xff/\x000\x000\x000\xff/\xff/\x000\x000\xff/\x000\xad\xff\x0c\x00\xdd\xff\xc2\xff\xd3\xff\xde\xff\xe9\xff\xca\xff\xd8\xff\xe6\xff\xb5\xff\xb2\xff\xe6\xff\x92\xff\xd0\xff\xa0\xff\xbd\xff\xb4\xff\x82\xff\x90\xfff\xff\xe1\xff\x9f\xff\x94\xff\xd4\xff\xa4\xff\xbb\xff\xe8\xff\x00\x00\x02\x00\xff\x7f\xff\x7f\x97\xff\xd0\xff\xb7\xff~\xffG\xff\xa1\xff\xa1\xff\xcd\xab\x00\x00A\n\x00\x00'
That's not a hex string. You are confusing the Python repr() output for a bytestring, which aims to make debugging easier, with the contents.
Each \xhh is a standard Python string literal escape sequence, and displaying the string like this makes it trivial to copy and paste into another Python session to reproduce the exact same value.
You don't need to hex decode this at all.
An actual hex string consists only of the digits 0 through to 9, and the letters a through to f (upper or lowercase). Your value, converted to hex, looks like this:
>>> value='(\xd2M\x00\x18\x00\x18\x80\x00\x80\x00\x00\x00\x00\x00\x00\xe0\xd2\xe0\xd2.\xd2\x00\x00\x00\x00\x00\x00\n\x00\x18\x00&\x00\x00\x00\x00\x00\x00\x00\x0f0\xfe/\x010\xff/\x000\xff/\x000\xff/\xff/\xff/\xff/\xff/\x000\xff/\xff/\xff/\x000\x000\xff/\x000\x000\x000\xff/\xff/\x000\x000\xff/\x000\xad\xff\x0c\x00\xdd\xff\xc2\xff\xd3\xff\xde\xff\xe9\xff\xca\xff\xd8\xff\xe6\xff\xb5\xff\xb2\xff\xe6\xff\x92\xff\xd0\xff\xa0\xff\xbd\xff\xb4\xff\x82\xff\x90\xfff\xff\xe1\xff\x9f\xff\x94\xff\xd4\xff\xa4\xff\xbb\xff\xe8\xff\x00\x00\x02\x00\xff\x7f\xff\x7f\x97\xff\xd0\xff\xb7\xff~\xffG\xff\xa1\xff\xa1\xff\xcd\xab\x00\x00A\n\x00\x00'
>>> import binascii
>>> binascii.hexlify(value)
'28d24d00180018800080000000000000e0d2e0d22ed20000000000000a00180026000000000000000f30fe2f0130ff2f0030ff2f0030ff2fff2fff2fff2fff2f0030ff2fff2fff2f00300030ff2f003000300030ff2fff2f00300030ff2f0030adff0c00ddffc2ffd3ffdeffe9ffcaffd8ffe6ffb5ffb2ffe6ff92ffd0ffa0ffbdffb4ff82ff90ff66ffe1ff9fff94ffd4ffa4ffbbffe8ff00000200ff7fff7f97ffd0ffb7ff7eff47ffa1ffa1ffcdab0000410a0000'