Converting integer to a pair of bytes produces unexpected format? - python

I am using python 3.8.5, and trying to convert from an integer in the range (0,65535) to a pair of bytes. I am currently using the following code:
from struct import pack
input_integer = 2111
bytes_val = voltage.to_bytes(2,'little')
output_data = struct.pack('bb',bytes_val[1],bytes_val[0])
print(output_data)
This produces the following output:
b'\x08?'
This \x08 is 8 in hex, the most significant byte, and ? is 63 in ascii. So together, the numbers add up to 2111 (8*256+63=2111). What I can't figure out is why the least significant byte is coming out in ascii instead of hex? It's very strange to me that it's in a different format than the MSB right next to it. I want it in hex for the output data, and am trying to figure out how to achieve that.
I have also tried modifying the format string in the last line to the following:
output_data = struct.pack('cc',bytes_val[1],bytes_val[0])
which produces the following error:
struct.error: char format requires a bytes object of length 1
I checked the types at each step, and it looks like bytes_val is a bytearray of length 2, but when I take one of the individual elements, say bytes_val[1], it is an integer rather than a byte array.
Any ideas?

All your observations can be verified from the docs for the bytes class:
While bytes literals and representations are based on ASCII text, bytes objects actually behave like immutable sequences of integers
In Python strings any letters and punctuation are represented by themselves in ASCII, while any control codes by their hexadecimal value (0-31, 127). You can see this by printing ''.join(map(chr, range(128))). Bytes literals follow the same convention, except that individual byte elements are integer, i.e., output_data[0].
If you want to represent everything as hex
>>> output_data.hex()
'083f'
>>> bytes.fromhex('083f') # to recover
b'\x08?'
As of version 3.8 bytes.hex() now supports optional sep and bytes_per_sep parameters to insert separators between bytes in the hex output.
>>> b'abcdef'.hex(' ', 2)
'6162 6364 6566'

Related

What are \t and \r in byte representation?

import sys
for i in range(30):
# a = int(str(i),base = 16).to_bytes(4,sys.byteorder)
a = i.to_bytes(4,sys.byteorder)
print(a)
Here sys.byteorder seems to be 'little'. The output of the above code is:
b'\x00\x00\x00\x00'
b'\x01\x00\x00\x00'
b'\x02\x00\x00\x00'
b'\x03\x00\x00\x00'
b'\x04\x00\x00\x00'
b'\x05\x00\x00\x00'
b'\x06\x00\x00\x00'
b'\x07\x00\x00\x00'
b'\x08\x00\x00\x00'
b'\t\x00\x00\x00'
b'\n\x00\x00\x00'
b'\x0b\x00\x00\x00'
b'\x0c\x00\x00\x00'
b'\r\x00\x00\x00'
b'\x0e\x00\x00\x00'
b'\x0f\x00\x00\x00'
b'\x10\x00\x00\x00'
b'\x11\x00\x00\x00'
b'\x12\x00\x00\x00'
b'\x13\x00\x00\x00'
b'\x14\x00\x00\x00'
b'\x15\x00\x00\x00'
b'\x16\x00\x00\x00'
b'\x17\x00\x00\x00'
b'\x18\x00\x00\x00'
b'\x19\x00\x00\x00'
b'\x1a\x00\x00\x00'
b'\x1b\x00\x00\x00'
b'\x1c\x00\x00\x00'
b'\x1d\x00\x00\x00'
Observe integer 9 here is written obnoxiously as b'\t\x00\x00\x00' along with similar oddities such as 0xa and 0xd.
Is this an aberration or am I lacking knowledge of these notation?
My Python version is 3.8.2.
These are escape sequences.
\t represents an ASCII Horizontal Tab (TAB) and \r represents an ASCII Carriage Return (CR).
See Python's documentation of String and Bytes literals.
I think part of the problem is that you are using bytes in two senses. It can mean a datatype and it can mean a representation. And you are expecting that a variable of datatype byte will have a particular byte representation.
Let's begin by looking at these equivalences:
>>> b"\x09\x0a\x0b\x0c\x0d\x0e" == b"\t\n\x0b\x0c\r\x0e" == bytes([9,10,11,12,13,14])
True
As you can see, even though the representations of these 6 bytes in Python code differ, the data is the same. The middle one is Python's default representation if you just call print() on a bunch of bytes.
If you only care about seeing the integer values 0 to 29 displayed as 2 hex digits, then all you need to do is format the integers as 2 hex digits, like this:
for i in range(30):
print (f"{i:02x}")
00
01
02
03
...
1b
1c
1d
If you want a leading 0x then put it in the f-string before the opening brace.
You can't actually convert your integer value to datatype byte (which is what I think you may have been trying to do with the call to to_bytes()) because Python doesn't have a byte datatype. to_bytes() returns a bytes, which behaves at the Python level like a list of integers in the range 0–255, and its default on-screen representation is a bytestring.
Formatting only affects how the values appear on the screen. If you want the hex representation back in a variable (because you are writing a hex editor, say, and need to manipulate the appearance in your own code), then, as #Harmon758 says, use the hex() function:
for i in range(30):
h = hex(i)
print (h)
This gives the same output as print (f"0x{i:02x}"), but it is not doing the same thing, because h is not an integer, it is a string of length 4. Only the screen representation is the same. If you want the string to look a bit different (a capital X, for example, or 4 leading zeroes) you can use an f-string instead of calling hex():
>>> i = 29
>>> h = f"0X{i:04x}"
>>> h
'0X001d'
>>> h = f"0X{i:04X}"
>>> h
'0X001D'

Decode method throws an error in Python 3

Similar to this other question on decoding a hex string, I have some code in a Python 2.7 script which has worked for years. I'm now trying to convert that script to Python 3.
OK, I apologize for not posting a complete question initially. I hope this clarifies the situation.
The issue is that I'm trying to convert an older Python 2.7 script to Python 3.8. For the most part the conversion has gone ok, but I am having issues converting the following code:
# get Register Stings
RegString = ""
for i in range(length):
if regs[start+i]!=0:
RegString = RegString + str(format(regs[start+i],'x').decode('hex'))
Here are some suppodrting data:
regs[start+0] = 20341
regs[start+1] = 29762
I think that my Python 2.7 code is converting these to HEX as "4f75" and "7442", respectively. And then to the characters "Ou" and "tB", respectively.
In Python 3 I get this error:
'str' object has no attribute 'decode'
My goal is to modify my Python 3 code so that the script will generate the same results.
str(format(regs[start+i],'x').decode('hex')) is a very verbose and round-about way of turning the non-zero integer values in regs[start:start + length] into individual characters of a bytestring (str in Python 2 should really be seen as a sequence of bytes). It first converts an integer value into a hexadecimal representation (a string), decodes that hexadecimal string to a (series) of string characters, then calls str() on the result (redundantly, the value is already a string). Assuming that the values in regs are integers in the range 0-255 (or even 0-127), in Python 2 this should really have been using the chr() function.
If you want to preserve the loop use chr() (to get a str string value) or if you need a binary value, use bytes([...]). So:
RegString = ""
for codepoint in regs[start:start + length]:
RegString += chr(codepoint)
or
RegString = b""
for codepoint in regs[start:start + length]:
RegString += bytes([codepoint])
Since this is actually converting a sequence of integers, you can just pass the whole lot to bytes() and filter out the zeros as you go:
# only take non-zero values
RegString = bytes(b for b in regs[start:start + length] if b)
or remove the nulls afterwards:
RegString = bytes(regs[start:start + length]).replace(b"\x00", b"")
If that's still supposed to be a string and not a bytes value, you can then decode it, with whatever encoding is appropriate (ASCII if the integers are in the range 0-127, or a more specific codec otherwise, in Python 2 this code produced a bytestring so look for other hints in the code as to what encoding they might have been using).

How to turn a binary string into a byte?

If I take the letter 'à' and encode it in UTF-8 I obtain the following result:
'à'.encode('utf-8')
>> b'\xc3\xa0'
Now from a bytearray I would like to convert 'à' into a binary string and turn it back into 'à'. To do so I execute the following code:
byte = bytearray('à','utf-8')
for x in byte:
print(bin(x))
I get 0b11000011and0b10100000, which is 195 and 160. Then, I fuse them together and take the 0b part out. Now I execute this code:
s = '1100001110100000'
value1 = s[0:8].encode('utf-8')
value2 = s[9:16].encode('utf-8')
value = value1 + value2
print(chr(int(value, 2)))
>> 憠
No matter how I develop the later part I get symbols and never seem to be able to get back my 'à'. I would like to know why is that? And how can I get an 'à'.
>>> bytes(int(s[i:i+8], 2) for i in range(0, len(s), 8)).decode('utf-8')
'à'
There are multiple parts to this. The bytes constructor creates a byte string from a sequence of integers. The integers are formed from strings using int with a base of 2. The range combined with the slicing peels off 8 characters at a time. Finally decode converts those bytes back into Unicode characters.
you need your second bits to be s[8:16] (or just s[8:]) otherwise you get 0100000
you also need to convert you "bit string" back to an integer before thinking of it as a byte with int("0010101",2)
s = '1100001110100000'
value1 = bytearray([int(s[:8],2), # bits 0..7 (8 total)
int(s[8:],2)] # bits 8..15 (8 total)
)
print(value1.decode("utf8"))
Convert the base-2 value back to an integer with int(s,2), convert that integer to a number of bytes (int.to_bytes) based on the original length divided by 8 and big-endian conversion to keep the bytes in the right order, then .decode() it (default in Python 3 is utf8):
>>> s = '1100001110100000'
>>> int(s,2)
50080
>>> int(s,2).to_bytes(len(s)//8,'big')
b'\xc3\xa0'
>>> int(s,2).to_bytes(len(s)//8,'big').decode()
'à'

Python define bitwise

I have a function that accepts 'data' as a parameter. Being new to python I wasn't really sure that that was even a type.
I noticed when printing something of that type it would be
b'h'
if I encoded the letter h. Which dosen't make a ton of sense to me. Is there a way to define bits in python, such as 1 or 0. I guess b'h' must be in hex? Is there a way for me to simply define an eight bit string
bits1 = 10100000
You're conflating a number of unrelated things.
First of all, (in Python 3), quoted literals prefixed with b are of type bytes -- that means a string of raw byte values. Example:
x = b'abc'
print(type(x)) # will output `<class 'bytes'>`
This is in contrast to the str type, which is a (Unicode) string.
Integer literals can be expressed in binary using an 0b prefix, e.g.
y = 0b10100000
print(y) # Will output 160
For what I know, 'data' is not a type. Your function (probably) accepts anything you pass to it, regardless of its type.
Now, b'h' means "the number (int) whose binary sequence maps to the char ´h´", this is not hexadecimal, but a number with possibly 8 bits (1 byte, which is the standard size for int and char).
The ASCII code for ´h´ is 104 (decimal), written in binary that would be b'\b01101000', or in hexa b'\x68'.
So, here is the answer I think you are looking for: if you want to code an 8-bit int from its binary representation just type b'\b01101000' (for 104). I would recommend to use hexa instead, to make it more compact and readable. In hexa, every four bits make a symbol from 0 to f, and the symbols can be concatenated every four bits to form a larger number. So the bit sequence 01101000 is written b'\b0110\b1000' or b'\x6\x8', which can be written as b'\x68'. The preceding b, before the quote marks tells python to interpret the string as a binary sequence expressed in the base defined by \b or \x (or \d for decimal), instead of using escape characters.

Python array.tostring - Explanation for the byte representation

I know that array.tostring gives the array of machine values. But I am trying to figure out how they are represented.
e.g
>>> a = array('l', [2])
>>> a.tostring()
'\x02\x00\x00\x00'
Here, I know that 'l' means each index will be min of 4 bytes and that's why we have 4 bytes in the tostring representation. But why is the Most significant bit populated with \x02. Shouldn't it be '\x00\x00\x00\x02'?
>>> a = array('l', [50,3])
>>> a.tostring()
'2\x00\x00\x00\x03\x00\x00\x00'
Here I am guessing the 2 in the beginning is because 50 is the ASCII value of 2, then why don't we have the corresponding char for ASCII value of 3 which is Ctrl-C
But why is the Most significant bit populated with \x02. Shouldn't it be '\x00\x00\x00\x02'?
The \x02 in '\x02\x00\x00\x00' is not the most significant byte. I guess you are confused by trying to read it as a hexadecimal number where the most significant digit is on the left. This is not how the string representation of an array returned by array.tostring() works. Bytes of the represented value are put together in a string left-to-right in the order from least significant to most significant. Just consider the array as a list of bytes, and the first (or, rather, 0th) byte is on the left, as is usual in regular python lists.
why don't we have the corresponding char for ASCII value of 3 which is Ctrl-C?
Do you have any example where python represents the character behind Ctrl-C as Ctrl-C or similar? Since the ASCII code 3 corresponds to an unprintable character and it has no corresponding escape sequence, hence it is represented through its hex code.

Categories

Resources