So I'm a total Python beginner and I got this byte object:
byte_obj = b'\x45\x10\x00\x4c\xcc\xde\x40\x00\x40\x06\x6c\x80\xc0\xa8\xd9\x17\x8d\x54\xda\x28'
But I have no idea how to put this in a binary number, I only know it's gonna have 32 bits.
You could try int.from_bytes(...), documented here e.g.:
>>> byte_obj = b'\x45\x10\x00\x4c\xcc\xde\x40\x00\x40\x06\x6c\x80\xc0\xa8\xd9\x17\x8d\x54\xda\x28'
>>> int.from_bytes(byte_obj, byteorder='big')
394277201243797802270421732363840487422965373480
Where byteorder is used to specify whether the input is big- or little-endian (i.e. most or least significant byte first).
(Looks a bit bigger than 32 bits though!)
Related
[Edit: In summary, this question was the result of me making (clearly incorrect) assumptions about what endian means (I assumed it was 00000001 vs 10000000, i.e. reversing the bits, rather than the bytes). Many thanks #tripleee for clearing up my confusion.]
As far as I can tell, the byte order of frames returned by the Python 3 wave module [1] (which I'll now refer to as pywave) isn't documented. I've had a look at the source code [2] [3], but haven't quite figured it out.
Firstly, it looks like pywave only supports 'RIFF' wave files [2]. 'RIFF' files use little endian; unsigned for 8 bit or lower bitrate, otherwise signed (two's complement).
However, it looks like pywave converts the bytes it reads from the file to sys.byteorder [2]:
data = self._data_chunk.read(nframes * self._framesize)
if self._sampwidth != 1 and sys.byteorder == 'big':
data = audioop.byteswap(data, self._sampwidth)
Except in the case of sampwidth==1, which corresponds to an 8 bit file. So 8 bit files aren't converted to sys.byteorder? Why would this be? (Maybe because they are unsigned?)
Currently my logic looks like:
if sampwidth == 1:
signed = False
byteorder = 'little'
else:
signed = True
byteorder = sys.byteorder
Is this correct?
8 bit wav files are incredibly rare nowadays, so this isn't really a problem. But I would still like to find answers...
[1] https://docs.python.org/3/library/wave.html
[2] https://github.com/python/cpython/blob/3.9/Lib/wave.py
[3] https://github.com/python/cpython/blob/3.9/Lib/chunk.py
A byte is a byte, little or big endian only makes sense for data which is more than one byte.
0xf0 is a single, 8-bit byte. The bits are 0x11110000 on any modern architecture. Without a sign bit, the range is 0 through 255 (8 bits of storage gets 28 possible values).
0xf0eb is a 16-bit number which takes two 8-bit bytes to represent. This can be represented as
0xf0 0xeb big-endian (0x11110000 0x11101011), or
0xeb 0xf0 little-endian (0x11101011 0x11110000)
The range of possible values without a sign bit is 0 through 65,535 (216 values).
You can also have different byte orders for 32-bit numbers etc, but I'll defer to Wikipedia etc for the full exposition.
It seems base58 and base56 conversion treat input data as a single Big Endian number; an unsigned bigint number.
If I'm encoding some integers into shorter strings by trying to use base58 or base56 it seems in some implementations the integer is taken as a native (little endian in my case) representation of bytes and then converted to a string, while in other implementations the number is converted to big endian representation first. It seems the loose specifications of these encoding don't clarify which approach is right. Is there an explicit specification of which to do, or a more wildly popular option of the two I'm not aware of?
I was trying to compare some methods of making a short URL. The source is actually a 10 digit number that's less than 4 billion. In this case I was thinking to make it an unsigned 4 byte integer, possibly Little Endian, and then encode it with a few options (with alphabets):
base64 A…Za…z0…9+/
base64 url-safe A…Za…z0…9-_
Z85 0…9a…zA…Z.-:+=^!/*?&<>()[]{}#%$#
base58 1…9A…HJ…NP…Za…km…z (excluding 0IOl+/ from base64 & reordered)
base56 2…9A…HJ…NP…Za…kmnp…z (excluding 1o from base58)
So like, base16, base32 and base64 make pretty good sense in that they're taking 4, 5 or 6 bits of input data at a time and looking them up in an alphabet index. The latter uses 4 symbols per 3 bytes. Straightforward, and this works for any data.
The other 3 have me finding various implementations that disagree with each other as to the right output. The problem appears to be that no amount of bytes has a fixed number of lookups in these. EG taking 2^1 to 2^100 and getting the remainders for 56, 58 and 85 results in no remainders of 0.
Z85 (ascii85 and base85 etal.) approach this by grabbing 4 bytes at a time and encoding them to 5 symbols and accepting some waste. But there's byte alignment to some degree here (base64 has alignment per 16 symbols, Z85 gets there with 5). But the alphabet is … not great for urls, command-line, nor sgml/xml use.
base58 and base56 seem intent on treating the input bytes like a Big Endian ordered bigint and repeating: % base; lookup; -= % base; /= base on the input bigint. Which… I mean, I think that ends up modifying most of the input for every iteration.
For my input that's not a huge performance concern though.
Because we shouldn't treat the input as string data, or we get output longer than the 10 digit decimal number input and what's the point in that, does anyone know of any indication of which kind of processing for the output results in something canonical for base56 or base58?
Have the Little Endian 4 byte word of the 10 digit number (<4*10^10) turned into a sequence of bytes that represent a different number if Big Endian, and convert that by repeating the steps.
Have the 10 digit number (<4*10^10) represented in 4 bytes Big Endian before converting that by repeating the steps.
I'm leaning towards going the route of the 2nd way.
For example given the number: 3003295320
The little endian representation is 58 a6 02 b3
The big endian representation is b3 02 a6 58, Meaning
base64 gives:
>>> base64.b64encode(int.to_bytes(3003295320,4,'little'))
b'WKYCsw=='
>>> base64.b64encode(int.to_bytes(3003295320,4,'big'))
b'swKmWA=='
>>> base64.b64encode('3003295320'.encode('ascii'))
b'MzAwMzI5NTMyMA==' # Definitely not using this
Z85 gives:
>>> encode(int.to_bytes(3003295320,4,'little'))
b'sF=ea'
>>> encode(int.to_bytes(3003295320,4,'big'))
b'VJv1a'
>>> encode('003003295320'.encode('ascii')) # padding to 4 byte boundary
b'fFCppfF+EAh8v0w' # Definitely not using this
base58 gives:
>>> base58.b58encode(int.to_bytes(3003295320,4,'little'))
b'3GRfwp'
>>> base58.b58encode(int.to_bytes(3003295320,4,'big'))
b'5aPg4o'
>>> base58.b58encode('3003295320')
b'3soMTaEYSLkS4w' # Still not using this
base56 gives:
>>> b56encode(int.to_bytes(3003295320,4,'little'))
b'4HSgyr'
>>> b56encode(int.to_bytes(3003295320,4,'big'))
b'6bQh5q'
>>> b56encode('3003295320')
b'4uqNUbFZTMmT5y' # Longer than 10 digits so...
I'm workign with Python 3, trying get an integer out of a digest in python. I'm only interested in the first n bits of the digest though.
What I have right now is this:
n = 3
int(hashlib.sha1(b'test').digest()[0:n])
This however results in a ValueError: invalid literal for int() with base 10: b'\xa9J' error.
Thanks.
The Py3 solution is to use int.from_bytes to convert bytes to int, then shift off the part you don't care about:
def bitsof(bt, nbits):
# Directly convert enough bytes to an int to ensure you have at least as many bits
# as needed, but no more
neededbytes = (nbits+7)//8
if neededbytes > len(bt):
raise ValueError("Require {} bytes, received {}".format(neededbytes, len(bt)))
i = int.from_bytes(bt[:neededbytes], 'big')
# If there were a non-byte aligned number of bits requested,
# shift off the excess from the right (which came from the last byte processed)
if nbits % 8:
i >>= 8 - nbits % 8
return i
Example use:
>>> bitsof(hashlib.sha1(b'test').digest(), 3)
5 # The leftmost bits of the a nibble that begins the hash
On Python 2, the function can be used almost as is, aside from adding a binascii import, and changing the conversion from bytes to int to the slightly less efficient two step conversion (from str to hex representation, then using int with base of 16 to parse it):
i = int(binascii.hexlify(bt[:neededbytes]), 16)
Everything else works as is (even the // operator works as expected; Python 2's / operator is different from Py 3's, but // works the same on both).
If I have two hex-strings and want to convert one to an 32-bit unsigned integer and the other to a 64-bit unsigned integer, what bases would I provide the int() function?
Well, python usually decides how much memory to allocate itself. See the following example:
>>> type(int('0x7fffffff', 16))
<type 'int'>
>>> type(int('0x80000000', 16))
<type 'long'>
Based on the size of the number, Python allocates the right amount of memory.
BUT if you use the method long() instead of int(), always 8 bytes will be allocated, no matter what the number is:
>>> type(long('0x7fffffff', 16))
<type 'long'>
>>> type(long('0x80000000', 16))
<type 'long'>
*Tested for Python 2.7 (not tested with 3.x)
So I goofed up and int() does not determine the size or sign of your hex string.
By definition, hex is 16. So you would put in your string with the hex base of 16
int('A1S31231', 16)
The issue between 32 bit and 64 bit was simply the size of the string put in as an argument.
By virtue of their size,
2 hex characters = 1 byte
So if I had a 64 bit int, it would be 8 bytes or a 16 character hex string.
If I had a 32 bit int, it would be 4 bytes or 8 character hex string.
Based off Duncan's answer: In order to make your result unsigned. You would need to take your result and & them with their proper mask.
If you're looking to go from hex to an uint32 you would do the aforementioned int() conversion and then
result & 0xffffffff
If you wanted to go from hex to uint64 you would do the aforementioned int() conversion and then
result & 0xffffffffffffffff
I was wondering how i could extract the last 2 bits of a byte. I receive the bytes when reading in from a file.
byte = b'\xfe'
bits = bin(byte)
output: 0b00110001
I want to know how i can 7th and 8th bit from that.
Any help would be appreciated.
There is always the old fashioned trick of masking:
>>> bits = bin(byte[0] & 0x03)
>>> bits
'0b10'