It seems base58 and base56 conversion treat input data as a single Big Endian number; an unsigned bigint number.
If I'm encoding some integers into shorter strings by trying to use base58 or base56 it seems in some implementations the integer is taken as a native (little endian in my case) representation of bytes and then converted to a string, while in other implementations the number is converted to big endian representation first. It seems the loose specifications of these encoding don't clarify which approach is right. Is there an explicit specification of which to do, or a more wildly popular option of the two I'm not aware of?
I was trying to compare some methods of making a short URL. The source is actually a 10 digit number that's less than 4 billion. In this case I was thinking to make it an unsigned 4 byte integer, possibly Little Endian, and then encode it with a few options (with alphabets):
base64 A…Za…z0…9+/
base64 url-safe A…Za…z0…9-_
Z85 0…9a…zA…Z.-:+=^!/*?&<>()[]{}#%$#
base58 1…9A…HJ…NP…Za…km…z (excluding 0IOl+/ from base64 & reordered)
base56 2…9A…HJ…NP…Za…kmnp…z (excluding 1o from base58)
So like, base16, base32 and base64 make pretty good sense in that they're taking 4, 5 or 6 bits of input data at a time and looking them up in an alphabet index. The latter uses 4 symbols per 3 bytes. Straightforward, and this works for any data.
The other 3 have me finding various implementations that disagree with each other as to the right output. The problem appears to be that no amount of bytes has a fixed number of lookups in these. EG taking 2^1 to 2^100 and getting the remainders for 56, 58 and 85 results in no remainders of 0.
Z85 (ascii85 and base85 etal.) approach this by grabbing 4 bytes at a time and encoding them to 5 symbols and accepting some waste. But there's byte alignment to some degree here (base64 has alignment per 16 symbols, Z85 gets there with 5). But the alphabet is … not great for urls, command-line, nor sgml/xml use.
base58 and base56 seem intent on treating the input bytes like a Big Endian ordered bigint and repeating: % base; lookup; -= % base; /= base on the input bigint. Which… I mean, I think that ends up modifying most of the input for every iteration.
For my input that's not a huge performance concern though.
Because we shouldn't treat the input as string data, or we get output longer than the 10 digit decimal number input and what's the point in that, does anyone know of any indication of which kind of processing for the output results in something canonical for base56 or base58?
Have the Little Endian 4 byte word of the 10 digit number (<4*10^10) turned into a sequence of bytes that represent a different number if Big Endian, and convert that by repeating the steps.
Have the 10 digit number (<4*10^10) represented in 4 bytes Big Endian before converting that by repeating the steps.
I'm leaning towards going the route of the 2nd way.
For example given the number: 3003295320
The little endian representation is 58 a6 02 b3
The big endian representation is b3 02 a6 58, Meaning
base64 gives:
>>> base64.b64encode(int.to_bytes(3003295320,4,'little'))
b'WKYCsw=='
>>> base64.b64encode(int.to_bytes(3003295320,4,'big'))
b'swKmWA=='
>>> base64.b64encode('3003295320'.encode('ascii'))
b'MzAwMzI5NTMyMA==' # Definitely not using this
Z85 gives:
>>> encode(int.to_bytes(3003295320,4,'little'))
b'sF=ea'
>>> encode(int.to_bytes(3003295320,4,'big'))
b'VJv1a'
>>> encode('003003295320'.encode('ascii')) # padding to 4 byte boundary
b'fFCppfF+EAh8v0w' # Definitely not using this
base58 gives:
>>> base58.b58encode(int.to_bytes(3003295320,4,'little'))
b'3GRfwp'
>>> base58.b58encode(int.to_bytes(3003295320,4,'big'))
b'5aPg4o'
>>> base58.b58encode('3003295320')
b'3soMTaEYSLkS4w' # Still not using this
base56 gives:
>>> b56encode(int.to_bytes(3003295320,4,'little'))
b'4HSgyr'
>>> b56encode(int.to_bytes(3003295320,4,'big'))
b'6bQh5q'
>>> b56encode('3003295320')
b'4uqNUbFZTMmT5y' # Longer than 10 digits so...
Related
[Edit: In summary, this question was the result of me making (clearly incorrect) assumptions about what endian means (I assumed it was 00000001 vs 10000000, i.e. reversing the bits, rather than the bytes). Many thanks #tripleee for clearing up my confusion.]
As far as I can tell, the byte order of frames returned by the Python 3 wave module [1] (which I'll now refer to as pywave) isn't documented. I've had a look at the source code [2] [3], but haven't quite figured it out.
Firstly, it looks like pywave only supports 'RIFF' wave files [2]. 'RIFF' files use little endian; unsigned for 8 bit or lower bitrate, otherwise signed (two's complement).
However, it looks like pywave converts the bytes it reads from the file to sys.byteorder [2]:
data = self._data_chunk.read(nframes * self._framesize)
if self._sampwidth != 1 and sys.byteorder == 'big':
data = audioop.byteswap(data, self._sampwidth)
Except in the case of sampwidth==1, which corresponds to an 8 bit file. So 8 bit files aren't converted to sys.byteorder? Why would this be? (Maybe because they are unsigned?)
Currently my logic looks like:
if sampwidth == 1:
signed = False
byteorder = 'little'
else:
signed = True
byteorder = sys.byteorder
Is this correct?
8 bit wav files are incredibly rare nowadays, so this isn't really a problem. But I would still like to find answers...
[1] https://docs.python.org/3/library/wave.html
[2] https://github.com/python/cpython/blob/3.9/Lib/wave.py
[3] https://github.com/python/cpython/blob/3.9/Lib/chunk.py
A byte is a byte, little or big endian only makes sense for data which is more than one byte.
0xf0 is a single, 8-bit byte. The bits are 0x11110000 on any modern architecture. Without a sign bit, the range is 0 through 255 (8 bits of storage gets 28 possible values).
0xf0eb is a 16-bit number which takes two 8-bit bytes to represent. This can be represented as
0xf0 0xeb big-endian (0x11110000 0x11101011), or
0xeb 0xf0 little-endian (0x11101011 0x11110000)
The range of possible values without a sign bit is 0 through 65,535 (216 values).
You can also have different byte orders for 32-bit numbers etc, but I'll defer to Wikipedia etc for the full exposition.
I'm having some troubles converting a hexadecimal value to decimal. Theoretically is easy with Python:
value = '000100000001f8c65400fefe3195000001230000000000000000000000000000000000000000642b00090700000000001e'
print(int(value,16))
And this is the result:
153914086775326342143664486282692693880080977757509806775956245674142536051238079779640236240803190331364310253598
Since here all is ok.
This string represents a payload of different bytes and I know two things:
1 byte = 8 bits.
Hexadecimal values are usually represented with two hexadecimal values, like FF 2E 32 etc.
The problem comes when I want to work with some concrete byte, because theoretically I know that in the byte 18, 19, 20 and 21 I have some decimal number that starts in 39 (I don't know the other numbers that follow). But when I want to decode it I can't find it.
# First try
a = value[36:43] # 18*2 to 21*2
print(a)
print(int(a,16))
# Second try
a = value[18:22] # 18 to 21
print(a)
print(int(a,16))
With a naked eye, I can see that the third and fourth value in the first result is this 39,
153914086775326342143664486282692693880080977757509806775956245674142536051238079779640236240803190331364310253598
But another time, if I do
# third try
a = value[2:4]
print(a)
print(int(a,16))
I can't find this 39, and the values change from the first result.
How I have to do it? I'm sure is easy but I don't know how to do it. I want to learn to access to the different bytes but I can't understand the logic.
EDIT trying to explain it better
I have this hexadecimal payload:
153914086775326342143664486282692693880080977757509806775956245674142536051238079779640236240803190331364310253598
And this represents the set of different values collected in bytes.
Therefore, what I try is to be able to access a byte (or a set) to know what its value would be in decimal. For example, I know that byte 18 to 21 is the latitude and byte 39 the battery. How can I decode it with python?
(In my city the latitude always starts in 39, that's what I said before this)
Thank you very much
You seem to be very confused about number bases. Please look up proper terminology, as well. For instance:
Hexadecimal values are usually represented with two hexadecimal values, like FF 2E 32 etc
I'm not sure what you're trying to say here. You seem to have noticed that hex values are often separated into bytes (each hex-digit is 4 bits), but the way you express it makes me wonder what you mean. The byte-level separation is not a "representation"; rather, it's a reading convenience.
Your central question seems to be how to find a particular decimal subsequence coded into only a subsequence of the hexidecimal version. Have you confused character coding (such as ASCII, UniCode, or EBCDIC) with simple base representation? Character coding allows you to make this sub-string conversion; changing number bases is not at all the same operation. For instance:
base 16 base 10
1 1
21 33 = 2 * 16 + 1
B21 2849 = 11 * 16*16 + 2 * 16 + 1
B2 178 = 11 * 16 + 2
B 11
There is no subsequence in one base that is coded with a subsequence in the other base. For instance, the "84" in base 10 is a notational feature of the entire hexadecimal number, not any subsequence.
I know that array.tostring gives the array of machine values. But I am trying to figure out how they are represented.
e.g
>>> a = array('l', [2])
>>> a.tostring()
'\x02\x00\x00\x00'
Here, I know that 'l' means each index will be min of 4 bytes and that's why we have 4 bytes in the tostring representation. But why is the Most significant bit populated with \x02. Shouldn't it be '\x00\x00\x00\x02'?
>>> a = array('l', [50,3])
>>> a.tostring()
'2\x00\x00\x00\x03\x00\x00\x00'
Here I am guessing the 2 in the beginning is because 50 is the ASCII value of 2, then why don't we have the corresponding char for ASCII value of 3 which is Ctrl-C
But why is the Most significant bit populated with \x02. Shouldn't it be '\x00\x00\x00\x02'?
The \x02 in '\x02\x00\x00\x00' is not the most significant byte. I guess you are confused by trying to read it as a hexadecimal number where the most significant digit is on the left. This is not how the string representation of an array returned by array.tostring() works. Bytes of the represented value are put together in a string left-to-right in the order from least significant to most significant. Just consider the array as a list of bytes, and the first (or, rather, 0th) byte is on the left, as is usual in regular python lists.
why don't we have the corresponding char for ASCII value of 3 which is Ctrl-C?
Do you have any example where python represents the character behind Ctrl-C as Ctrl-C or similar? Since the ASCII code 3 corresponds to an unprintable character and it has no corresponding escape sequence, hence it is represented through its hex code.
I'm trying to write my "personal" python version of STL binary file reader, according to WIKIPEDIA : A binary STL file contains :
an 80-character (byte) headern which is generally ignored.
a 4-byte unsigned integer indicating the number of triangular facets in the file.
Each triangle is described by twelve 32-bit floating-point numbers: three for the normal and then three for the X/Y/Z coordinate of each vertex – just as with the ASCII version of STL. After these follows a 2-byte ("short") unsigned integer that is the "attribute byte count" – in the standard format, this should be zero because most software does not understand anything else. --Floating-point numbers are represented as IEEE floating-point numbers and are assumed to be little-endian--
Here is my code :
#! /usr/bin/env python3
with open("stlbinaryfile.stl","rb") as fichier :
head=fichier.read(80)
nbtriangles=fichier.read(4)
print(nbtriangles)
The output is :
b'\x90\x08\x00\x00'
It represents an unsigned integer, I need to convert it without using any package (struct,stl...). Are there any (basic) rules to do it ?, I don't know what does \x mean ? How does \x90 represent one byte ?
most of the answers in google mention "C structs", but I don't know nothing about C.
Thank you for your time.
Since you're using Python 3, you can use int.from_bytes. I'm guessing the value is stored little-endian, so you'd just do:
nbtriangles = int.from_bytes(fichier.read(4), 'little')
Change the second argument to 'big' if it's supposed to be big-endian.
Mind you, the normal way to parse a fixed width type is the struct module, but apparently you've ruled that out.
For the confusion over the repr, bytes objects will display ASCII printable characters (e.g. a) or standard ASCII escapes (e.g. \t) if the byte value corresponds to one of them. If it doesn't, it uses \x##, where ## is the hexadecimal representation of the byte value, so \x90 represents the byte with value 0x90, or 144. You need to combine the byte values at offsets to reconstruct the int, but int.from_bytes does this for you faster than any hand-rolled solution could.
Update: Since apparent int.from_bytes isn't "basic" enough, a couple more complex, but only using top-level built-ins (not alternate constructors) solutions. For little-endian, you can do this:
def int_from_bytes(inbytes):
res = 0
for i, b in enumerate(inbytes):
res |= b << (i * 8) # Adjust each byte individually by 8 times position
return res
You can use the same solution for big-endian by adding reversed to the loop, making it enumerate(reversed(inbytes)), or you can use this alternative solution that handles the offset adjustment a different way:
def int_from_bytes(inbytes):
res = 0
for b in inbytes:
res <<= 8 # Adjust bytes seen so far to make room for new byte
res |= b # Mask in new byte
return res
Again, this big-endian solution can trivially work for little-endian by looping over reversed(inbytes) instead of inbytes. In both cases inbytes[::-1] is an alternative to reversed(inbytes) (the former makes a new bytes in reversed order and iterates that, the latter iterates the existing bytes object in reverse, but unless it's a huge bytes object, enough to strain RAM if you copy it, the difference is pretty minimal).
The typical way to interpret an integer is to use struct.unpack, like so:
import struct
with open("stlbinaryfile.stl","rb") as fichier :
head=fichier.read(80)
nbtriangles=fichier.read(4)
print(nbtriangles)
nbtriangles=struct.unpack("<I", nbtriangles)
print(nbtriangles)
If you are allergic to import struct, then you can also compute it by hand:
def unsigned_int(s):
result = 0
for ch in s[::-1]:
result *= 256
result += ch
return result
...
nbtriangles = unsigned_int(nbtriangles)
As to what you are seeing when you print b'\x90\x08\x00\x00'. You are printing a bytes object, which is an array of integers in the range [0-255]. The first integer has the value 144 (decimal) or 90 (hexadecimal). When printing a bytes object, that value is represented by the string \x90. The 2nd has the value eight, represented by \x08. The 3rd and final integers are both zero. They are presented by \x00.
If you would like to see a more familiar representation of the integers, try:
print(list(nbtriangles))
[144, 8, 0, 0]
To compute the 32-bit integers represented by these four 8-bit integers, you can use this formula:
total = byte0 + (byte1*256) + (byte2*256*256) + (byte3*256*256*256)
Or, in hex:
total = byte0 + (byte1*0x100) + (byte2*0x10000) + (byte3*0x1000000)
Which results in:
0x00000890
Perhaps you can see the similarities to decimal, where the string "1234" represents the number:
4 + 3*10 + 2*100 + 1*1000
This is less of a programming question, and more of a question to understand what is what? I am not a CS major, and I am trying to understand the basic difference between these 3 formats :
1) EBCDIC 2) Unsigned binary number 3) Binary coded decimal
If this is not a real question, I apologize, but google was not very useful in explaining this to me
Say I have a string of numbers like "12890". What would their representation in
EBCDIC, Unsigned binary number and BCD format?
Is there a python 2.6 library I can use to simply convert any string of numbers to either of these formats?
For example, for string to ebcdic, I am doing
def encodeEbcdic(text):
return text.decode('latin1').encode('cp037')
print encodeEbcdic('AGNS')
But, I get this ┴╟╒Γ
EBCDIC is an IBM character encoding. It's meant for encoding text. Of course numerals can occur in text, as in "1600 Pennsylvania Avenue" so there are codes for numerals, too. To translate 1600 to EBCDIC, you need to find an EBCDIC table. Then you look up the code for 1, the code for 6, and the code for 0 (twice.) According to the table at http://www.astrodigital.org/digital/ebcdic.html
the EBCIDIC code for 0 through 9 are F0 through F9, respectively. This looks familiar, but I can't say I really remember.
An unsigned binary number is just that. It's the number written in base two. (See below.)
Binary-coded decimal (BCD) is an old format for storing the decimal representation of numbers on a digital computer. Each decimal digit is represented by its binary equivalent. Let's take 64 as an example. Since 64 is just 2 to the sixth power, in binary it's represented as a 1 followed by 6 0's: 1000000. In binary-coded decimal, we write the six in binary -- 0110 and the four in binary -- 0100 so that the BCD representation is 01100100. We need four bits for each digit, because the largest decimal digit, 9 works out to be 1001. BCD was used extensively in COBOL. If it's used anywhere else these days, I'm not familiar with the application.
Edit: I should have explained that F0, F1, etc. in EBCDIC are hex codes, so the F is 1111 and the digits are the same as in BCD. So, EBCDIC for numbers turns out to be the same as BCD, but with an extra 1111 before each digit.
saulspatz, thanks for your explanation. I was able to find out what are the necessary methods needed to convert any string of numbers into their different encoding. I had to refer to Effective Python Chapter 1, Item 3 : Know the Differences Between bytes, str, and unicode
And from there on, I read more about data types and such.
Anyway, to answer my questions :
1) String to EBCDIC:
def encode_ebcdic(text):
return text.decode('latin1').encode('cp037')
The encoding here is cp037 for USA. You can use cp500 for International. Here is a list of them :
https://en.wikipedia.org/wiki/List_of_EBCDIC_code_pages_with_Latin-1_character_set
2) Hexadecimal String to unsigned binary number :
def str_to_binary(text):
return int(str, 16)
This is pretty basic, just convert the Hexadecimal string to a number.
3) Hexadecimal string to Binary coded decimal:
def str_to_bcd(text):
return bytes(str).decode('hex')
Yes, you need to convert it to a byte array, so that BCD conversion can take place. Please read saulspatz answer for what BCD encoding is.