Python array.tostring - Explanation for the byte representation - python

I know that array.tostring gives the array of machine values. But I am trying to figure out how they are represented.
e.g
>>> a = array('l', [2])
>>> a.tostring()
'\x02\x00\x00\x00'
Here, I know that 'l' means each index will be min of 4 bytes and that's why we have 4 bytes in the tostring representation. But why is the Most significant bit populated with \x02. Shouldn't it be '\x00\x00\x00\x02'?
>>> a = array('l', [50,3])
>>> a.tostring()
'2\x00\x00\x00\x03\x00\x00\x00'
Here I am guessing the 2 in the beginning is because 50 is the ASCII value of 2, then why don't we have the corresponding char for ASCII value of 3 which is Ctrl-C

But why is the Most significant bit populated with \x02. Shouldn't it be '\x00\x00\x00\x02'?
The \x02 in '\x02\x00\x00\x00' is not the most significant byte. I guess you are confused by trying to read it as a hexadecimal number where the most significant digit is on the left. This is not how the string representation of an array returned by array.tostring() works. Bytes of the represented value are put together in a string left-to-right in the order from least significant to most significant. Just consider the array as a list of bytes, and the first (or, rather, 0th) byte is on the left, as is usual in regular python lists.
why don't we have the corresponding char for ASCII value of 3 which is Ctrl-C?
Do you have any example where python represents the character behind Ctrl-C as Ctrl-C or similar? Since the ASCII code 3 corresponds to an unprintable character and it has no corresponding escape sequence, hence it is represented through its hex code.

Related

Converting integer to a pair of bytes produces unexpected format?

I am using python 3.8.5, and trying to convert from an integer in the range (0,65535) to a pair of bytes. I am currently using the following code:
from struct import pack
input_integer = 2111
bytes_val = voltage.to_bytes(2,'little')
output_data = struct.pack('bb',bytes_val[1],bytes_val[0])
print(output_data)
This produces the following output:
b'\x08?'
This \x08 is 8 in hex, the most significant byte, and ? is 63 in ascii. So together, the numbers add up to 2111 (8*256+63=2111). What I can't figure out is why the least significant byte is coming out in ascii instead of hex? It's very strange to me that it's in a different format than the MSB right next to it. I want it in hex for the output data, and am trying to figure out how to achieve that.
I have also tried modifying the format string in the last line to the following:
output_data = struct.pack('cc',bytes_val[1],bytes_val[0])
which produces the following error:
struct.error: char format requires a bytes object of length 1
I checked the types at each step, and it looks like bytes_val is a bytearray of length 2, but when I take one of the individual elements, say bytes_val[1], it is an integer rather than a byte array.
Any ideas?
All your observations can be verified from the docs for the bytes class:
While bytes literals and representations are based on ASCII text, bytes objects actually behave like immutable sequences of integers
In Python strings any letters and punctuation are represented by themselves in ASCII, while any control codes by their hexadecimal value (0-31, 127). You can see this by printing ''.join(map(chr, range(128))). Bytes literals follow the same convention, except that individual byte elements are integer, i.e., output_data[0].
If you want to represent everything as hex
>>> output_data.hex()
'083f'
>>> bytes.fromhex('083f') # to recover
b'\x08?'
As of version 3.8 bytes.hex() now supports optional sep and bytes_per_sep parameters to insert separators between bytes in the hex output.
>>> b'abcdef'.hex(' ', 2)
'6162 6364 6566'

What are \t and \r in byte representation?

import sys
for i in range(30):
# a = int(str(i),base = 16).to_bytes(4,sys.byteorder)
a = i.to_bytes(4,sys.byteorder)
print(a)
Here sys.byteorder seems to be 'little'. The output of the above code is:
b'\x00\x00\x00\x00'
b'\x01\x00\x00\x00'
b'\x02\x00\x00\x00'
b'\x03\x00\x00\x00'
b'\x04\x00\x00\x00'
b'\x05\x00\x00\x00'
b'\x06\x00\x00\x00'
b'\x07\x00\x00\x00'
b'\x08\x00\x00\x00'
b'\t\x00\x00\x00'
b'\n\x00\x00\x00'
b'\x0b\x00\x00\x00'
b'\x0c\x00\x00\x00'
b'\r\x00\x00\x00'
b'\x0e\x00\x00\x00'
b'\x0f\x00\x00\x00'
b'\x10\x00\x00\x00'
b'\x11\x00\x00\x00'
b'\x12\x00\x00\x00'
b'\x13\x00\x00\x00'
b'\x14\x00\x00\x00'
b'\x15\x00\x00\x00'
b'\x16\x00\x00\x00'
b'\x17\x00\x00\x00'
b'\x18\x00\x00\x00'
b'\x19\x00\x00\x00'
b'\x1a\x00\x00\x00'
b'\x1b\x00\x00\x00'
b'\x1c\x00\x00\x00'
b'\x1d\x00\x00\x00'
Observe integer 9 here is written obnoxiously as b'\t\x00\x00\x00' along with similar oddities such as 0xa and 0xd.
Is this an aberration or am I lacking knowledge of these notation?
My Python version is 3.8.2.
These are escape sequences.
\t represents an ASCII Horizontal Tab (TAB) and \r represents an ASCII Carriage Return (CR).
See Python's documentation of String and Bytes literals.
I think part of the problem is that you are using bytes in two senses. It can mean a datatype and it can mean a representation. And you are expecting that a variable of datatype byte will have a particular byte representation.
Let's begin by looking at these equivalences:
>>> b"\x09\x0a\x0b\x0c\x0d\x0e" == b"\t\n\x0b\x0c\r\x0e" == bytes([9,10,11,12,13,14])
True
As you can see, even though the representations of these 6 bytes in Python code differ, the data is the same. The middle one is Python's default representation if you just call print() on a bunch of bytes.
If you only care about seeing the integer values 0 to 29 displayed as 2 hex digits, then all you need to do is format the integers as 2 hex digits, like this:
for i in range(30):
print (f"{i:02x}")
00
01
02
03
...
1b
1c
1d
If you want a leading 0x then put it in the f-string before the opening brace.
You can't actually convert your integer value to datatype byte (which is what I think you may have been trying to do with the call to to_bytes()) because Python doesn't have a byte datatype. to_bytes() returns a bytes, which behaves at the Python level like a list of integers in the range 0–255, and its default on-screen representation is a bytestring.
Formatting only affects how the values appear on the screen. If you want the hex representation back in a variable (because you are writing a hex editor, say, and need to manipulate the appearance in your own code), then, as #Harmon758 says, use the hex() function:
for i in range(30):
h = hex(i)
print (h)
This gives the same output as print (f"0x{i:02x}"), but it is not doing the same thing, because h is not an integer, it is a string of length 4. Only the screen representation is the same. If you want the string to look a bit different (a capital X, for example, or 4 leading zeroes) you can use an f-string instead of calling hex():
>>> i = 29
>>> h = f"0X{i:04x}"
>>> h
'0X001d'
>>> h = f"0X{i:04X}"
>>> h
'0X001D'

Base56 conversion etc

It seems base58 and base56 conversion treat input data as a single Big Endian number; an unsigned bigint number.
If I'm encoding some integers into shorter strings by trying to use base58 or base56 it seems in some implementations the integer is taken as a native (little endian in my case) representation of bytes and then converted to a string, while in other implementations the number is converted to big endian representation first. It seems the loose specifications of these encoding don't clarify which approach is right. Is there an explicit specification of which to do, or a more wildly popular option of the two I'm not aware of?
I was trying to compare some methods of making a short URL. The source is actually a 10 digit number that's less than 4 billion. In this case I was thinking to make it an unsigned 4 byte integer, possibly Little Endian, and then encode it with a few options (with alphabets):
base64 A…Za…z0…9+/
base64 url-safe A…Za…z0…9-_
Z85 0…9a…zA…Z.-:+=^!/*?&<>()[]{}#%$#
base58 1…9A…HJ…NP…Za…km…z (excluding 0IOl+/ from base64 & reordered)
base56 2…9A…HJ…NP…Za…kmnp…z (excluding 1o from base58)
So like, base16, base32 and base64 make pretty good sense in that they're taking 4, 5 or 6 bits of input data at a time and looking them up in an alphabet index. The latter uses 4 symbols per 3 bytes. Straightforward, and this works for any data.
The other 3 have me finding various implementations that disagree with each other as to the right output. The problem appears to be that no amount of bytes has a fixed number of lookups in these. EG taking 2^1 to 2^100 and getting the remainders for 56, 58 and 85 results in no remainders of 0.
Z85 (ascii85 and base85 etal.) approach this by grabbing 4 bytes at a time and encoding them to 5 symbols and accepting some waste. But there's byte alignment to some degree here (base64 has alignment per 16 symbols, Z85 gets there with 5). But the alphabet is … not great for urls, command-line, nor sgml/xml use.
base58 and base56 seem intent on treating the input bytes like a Big Endian ordered bigint and repeating: % base; lookup; -= % base; /= base on the input bigint. Which… I mean, I think that ends up modifying most of the input for every iteration.
For my input that's not a huge performance concern though.
Because we shouldn't treat the input as string data, or we get output longer than the 10 digit decimal number input and what's the point in that, does anyone know of any indication of which kind of processing for the output results in something canonical for base56 or base58?
Have the Little Endian 4 byte word of the 10 digit number (<4*10^10) turned into a sequence of bytes that represent a different number if Big Endian, and convert that by repeating the steps.
Have the 10 digit number (<4*10^10) represented in 4 bytes Big Endian before converting that by repeating the steps.
I'm leaning towards going the route of the 2nd way.
For example given the number: 3003295320
The little endian representation is 58 a6 02 b3
The big endian representation is b3 02 a6 58, Meaning
base64 gives:
>>> base64.b64encode(int.to_bytes(3003295320,4,'little'))
b'WKYCsw=='
>>> base64.b64encode(int.to_bytes(3003295320,4,'big'))
b'swKmWA=='
>>> base64.b64encode('3003295320'.encode('ascii'))
b'MzAwMzI5NTMyMA==' # Definitely not using this
Z85 gives:
>>> encode(int.to_bytes(3003295320,4,'little'))
b'sF=ea'
>>> encode(int.to_bytes(3003295320,4,'big'))
b'VJv1a'
>>> encode('003003295320'.encode('ascii')) # padding to 4 byte boundary
b'fFCppfF+EAh8v0w' # Definitely not using this
base58 gives:
>>> base58.b58encode(int.to_bytes(3003295320,4,'little'))
b'3GRfwp'
>>> base58.b58encode(int.to_bytes(3003295320,4,'big'))
b'5aPg4o'
>>> base58.b58encode('3003295320')
b'3soMTaEYSLkS4w' # Still not using this
base56 gives:
>>> b56encode(int.to_bytes(3003295320,4,'little'))
b'4HSgyr'
>>> b56encode(int.to_bytes(3003295320,4,'big'))
b'6bQh5q'
>>> b56encode('3003295320')
b'4uqNUbFZTMmT5y' # Longer than 10 digits so...

Python define bitwise

I have a function that accepts 'data' as a parameter. Being new to python I wasn't really sure that that was even a type.
I noticed when printing something of that type it would be
b'h'
if I encoded the letter h. Which dosen't make a ton of sense to me. Is there a way to define bits in python, such as 1 or 0. I guess b'h' must be in hex? Is there a way for me to simply define an eight bit string
bits1 = 10100000
You're conflating a number of unrelated things.
First of all, (in Python 3), quoted literals prefixed with b are of type bytes -- that means a string of raw byte values. Example:
x = b'abc'
print(type(x)) # will output `<class 'bytes'>`
This is in contrast to the str type, which is a (Unicode) string.
Integer literals can be expressed in binary using an 0b prefix, e.g.
y = 0b10100000
print(y) # Will output 160
For what I know, 'data' is not a type. Your function (probably) accepts anything you pass to it, regardless of its type.
Now, b'h' means "the number (int) whose binary sequence maps to the char ´h´", this is not hexadecimal, but a number with possibly 8 bits (1 byte, which is the standard size for int and char).
The ASCII code for ´h´ is 104 (decimal), written in binary that would be b'\b01101000', or in hexa b'\x68'.
So, here is the answer I think you are looking for: if you want to code an 8-bit int from its binary representation just type b'\b01101000' (for 104). I would recommend to use hexa instead, to make it more compact and readable. In hexa, every four bits make a symbol from 0 to f, and the symbols can be concatenated every four bits to form a larger number. So the bit sequence 01101000 is written b'\b0110\b1000' or b'\x6\x8', which can be written as b'\x68'. The preceding b, before the quote marks tells python to interpret the string as a binary sequence expressed in the base defined by \b or \x (or \d for decimal), instead of using escape characters.

STL binary file reader with Python

I'm trying to write my "personal" python version of STL binary file reader, according to WIKIPEDIA : A binary STL file contains :
an 80-character (byte) headern which is generally ignored.
a 4-byte unsigned integer indicating the number of triangular facets in the file.
Each triangle is described by twelve 32-bit floating-point numbers: three for the normal and then three for the X/Y/Z coordinate of each vertex – just as with the ASCII version of STL. After these follows a 2-byte ("short") unsigned integer that is the "attribute byte count" – in the standard format, this should be zero because most software does not understand anything else. --Floating-point numbers are represented as IEEE floating-point numbers and are assumed to be little-endian--
Here is my code :
#! /usr/bin/env python3
with open("stlbinaryfile.stl","rb") as fichier :
head=fichier.read(80)
nbtriangles=fichier.read(4)
print(nbtriangles)
The output is :
b'\x90\x08\x00\x00'
It represents an unsigned integer, I need to convert it without using any package (struct,stl...). Are there any (basic) rules to do it ?, I don't know what does \x mean ? How does \x90 represent one byte ?
most of the answers in google mention "C structs", but I don't know nothing about C.
Thank you for your time.
Since you're using Python 3, you can use int.from_bytes. I'm guessing the value is stored little-endian, so you'd just do:
nbtriangles = int.from_bytes(fichier.read(4), 'little')
Change the second argument to 'big' if it's supposed to be big-endian.
Mind you, the normal way to parse a fixed width type is the struct module, but apparently you've ruled that out.
For the confusion over the repr, bytes objects will display ASCII printable characters (e.g. a) or standard ASCII escapes (e.g. \t) if the byte value corresponds to one of them. If it doesn't, it uses \x##, where ## is the hexadecimal representation of the byte value, so \x90 represents the byte with value 0x90, or 144. You need to combine the byte values at offsets to reconstruct the int, but int.from_bytes does this for you faster than any hand-rolled solution could.
Update: Since apparent int.from_bytes isn't "basic" enough, a couple more complex, but only using top-level built-ins (not alternate constructors) solutions. For little-endian, you can do this:
def int_from_bytes(inbytes):
res = 0
for i, b in enumerate(inbytes):
res |= b << (i * 8) # Adjust each byte individually by 8 times position
return res
You can use the same solution for big-endian by adding reversed to the loop, making it enumerate(reversed(inbytes)), or you can use this alternative solution that handles the offset adjustment a different way:
def int_from_bytes(inbytes):
res = 0
for b in inbytes:
res <<= 8 # Adjust bytes seen so far to make room for new byte
res |= b # Mask in new byte
return res
Again, this big-endian solution can trivially work for little-endian by looping over reversed(inbytes) instead of inbytes. In both cases inbytes[::-1] is an alternative to reversed(inbytes) (the former makes a new bytes in reversed order and iterates that, the latter iterates the existing bytes object in reverse, but unless it's a huge bytes object, enough to strain RAM if you copy it, the difference is pretty minimal).
The typical way to interpret an integer is to use struct.unpack, like so:
import struct
with open("stlbinaryfile.stl","rb") as fichier :
head=fichier.read(80)
nbtriangles=fichier.read(4)
print(nbtriangles)
nbtriangles=struct.unpack("<I", nbtriangles)
print(nbtriangles)
If you are allergic to import struct, then you can also compute it by hand:
def unsigned_int(s):
result = 0
for ch in s[::-1]:
result *= 256
result += ch
return result
...
nbtriangles = unsigned_int(nbtriangles)
As to what you are seeing when you print b'\x90\x08\x00\x00'. You are printing a bytes object, which is an array of integers in the range [0-255]. The first integer has the value 144 (decimal) or 90 (hexadecimal). When printing a bytes object, that value is represented by the string \x90. The 2nd has the value eight, represented by \x08. The 3rd and final integers are both zero. They are presented by \x00.
If you would like to see a more familiar representation of the integers, try:
print(list(nbtriangles))
[144, 8, 0, 0]
To compute the 32-bit integers represented by these four 8-bit integers, you can use this formula:
total = byte0 + (byte1*256) + (byte2*256*256) + (byte3*256*256*256)
Or, in hex:
total = byte0 + (byte1*0x100) + (byte2*0x10000) + (byte3*0x1000000)
Which results in:
0x00000890
Perhaps you can see the similarities to decimal, where the string "1234" represents the number:
4 + 3*10 + 2*100 + 1*1000

Categories

Resources