Python bytes to binary string - how 4 bytes can be 29 bits? - python

I need to read time data from sensor. Here are the instructions from manual:
I have written code in Python, but I feel like there should be some better way:
# 1
data_bytes = b'\x43\x32\x21\x10'
print('data_bytes: ', data_bytes, len(data_bytes)) # why it changes b'\x43\x32\x21\x10' to b'C2!\x10' ???
# 2
data_binary = bin(int.from_bytes(data_bytes, 'little')) # remove '0b' from string
print('data_binary: ', data_binary, len(data_binary))
data_binary = data_binary[2:]
print('data_binary: ', data_binary, len(data_binary)) # should be: 0001 0000 0010 0001 0011 0010 0100 0011, 32
# 3
sec = data_binary[0:-20]
print(sec, len(sec)) # should be: 0001 0000 0010, 12
sec = int(sec, 2)
print(sec)
usec = data_binary[-20:]
print(usec, len(usec)) # 0001 0011 0010 0100 0011, 20
usec = int(usec, 2)
print(usec)
# 4
print('time: ', sec + usec/1000000) # should be: 258.078403
Results:
data_bytes: b'C2!\x10' 4
data_binary: 0b10000001000010011001001000011 31
data_binary: 10000001000010011001001000011 29
100000010 9
258
00010011001001000011 20
78403
time: 258.078403
I have questions:
Why Python changes b'\x43\x32\x21\x10' to b'C2!\x10'?
Why is the length of the message 29 bits and not 32?
Is it possible to do this in better/cleaner/faster way?
Thanks!

Both are the same data. Per the documentation, bytes objects are represented as ASCII characters or, for values over 127, by the appropriate hexadecimal literal. If you check an ASCII table, the hexadecimal values 0x43 0x32 and 0x21 are the characters C2!
As noted by other comments, leading zeros are stripped
you can avoid most of the conversions to and from strings by using binary operations:
data_bytes = b'\x43\x32\x21\x10'
# convert to int
data = int.from_bytes(data_bytes,'little')
# zero out the leading bits, leaving only the 20 bits corresponding to the microseconds
microseconds = data & 0x0fffff
# shift right by 20, thus keeping only the seconds (upper bits)
seconds = data >> 20
print(seconds) # 258
print(microseconds) #78403

One example would be to convert the bytestring into an int, and extract relevant information using bit operations.
code00.py:
#!/usr/bin/env python
import sys
def decode(byte_str):
i = int.from_bytes(byte_str, byteorder="little") # Convert to int reversing bytes (little endian)
#i = int.from_bytes(byte_str[::-1], byteorder="big") # Equivalent as the above line: reverse explicitly and convert without reversing bytes
print(hex(i)) # #TODO - cfati: Comment this line
secs = i >> 20 # Discard last 20 bytes (that belong to usecs)
usecs = i & 0x000FFFFF # Only retain last 20 bytes (5 hex digits)
return secs + usecs / 1000000
def main(*argv):
rec = b"\x43\x32\x21\x10"
dec = decode(rec)
print("Result: {:.6f}".format(dec))
if __name__ == "__main__":
print("Python {:s} {:03d}bit on {:s}\n".format(" ".join(elem.strip() for elem in sys.version.split("\n")),
64 if sys.maxsize > 0x100000000 else 32, sys.platform))
rc = main(*sys.argv[1:])
print("\nDone.\n")
sys.exit(rc)
Output:
[cfati#CFATI-5510-0:e:\Work\Dev\StackOverflow\q075472971]> "e:\Work\Dev\VEnvs\py_pc064_03.10_test0\Scripts\python.exe" ./code00.py
Python 3.10.9 (tags/v3.10.9:1dd9be6, Dec 6 2022, 20:01:21) [MSC v.1934 64 bit (AMD64)] 064bit on win32
0x10213243
Result: 258.078403
Done.
Might also worth reading:
[Python.Docs]: Built-in Types - Numeric Types — int, float, complex
[SO]: Python struct.pack() behavior (#CristiFati's answer) for an explanation regarding your "weird" outputs and hex representations
[SO]: Output of crc32b in PHP is not equal to Python (#CristiFati's answer)

Have you tried using divmod ?
Dividing the integer value by 2^20 or 1048576 will directly split it in the two parts you are looking for.
sec,usec = divmod(int.from_bytes(data_bytes, 'little'),1048576)
print(sec,usec)
# 258 78403

Is it possible to do this in better/cleaner/faster way?
I suggest taking look at struct part of standard library, consider following simple example, say you have message which is big-endian and contain one char and unsigned int then you could do
import struct
data_bytes = b'\x56\xFF\xFF\xFF\xFF'
character, value = struct.unpack('>cI',data_bytes)
print(character) # b'V'
print(value) # 4294967295

Related

combine two bytes to form signed value (16 bit)

I want to combine two bytes (8 bit) to form a signed value (one bit for sign and 15 for the value) according to the two complement's method.
I receive MSbyte (note that the most left bit of MSByte is for the sign) and the LSbyte. So I write a function by shifting the MSByte to the left by 8 bit then I add it with the LSByte to form a binary sequence of 16 bit. Then, I calculate the ones'complement, and I finally add 1 to the result. However, it does not work.
def twos_comp_two_bytes(msb, lsb):
a= (msb<<8)+ lsb
r = ~(a)+1
return r
For example 0b0b1111110111001001 is -567 however with the above function I get -64969.
EDIT : call of the function
twos_comp_two_bytes(0b11111101,0b11001001) => -64969
Python uses integers which may have any lenght - they are not restricted to 16bits so to get -567 it would need rather
r = a - (256*256)
but it need more code for other values
def twos_comp_two_bytes(msb, lsb):
a = (msb<<8) + lsb
if a >= (256*256)//2:
a = a - (256*256)
return a
print(twos_comp_two_bytes(0b11111101, 0b11001001))
print(twos_comp_two_bytes(0b0, 0b0))
print(twos_comp_two_bytes(0b0, 0b1))
print(twos_comp_two_bytes(0b10000000, 0b0))
print(twos_comp_two_bytes(0b10000000, 0b1))
Results:
-567
0
1
-32768
-32767
It would be better to use special module struct for this
import struct
def twos_comp_two_bytes(msb, lsb):
return struct.unpack('>h', bytes([msb, lsb]))[0]
#return struct.unpack('<h', bytes([lsb, msb]))[0] # different order `[lsb, msb]`
#return struct.unpack( 'h', bytes([lsb, msb]))[0] # different order `[lsb, msb]`
print(twos_comp_two_bytes(0b11111101, 0b11001001))
print(twos_comp_two_bytes(0b0, 0b0))
print(twos_comp_two_bytes(0b0, 0b1))
print(twos_comp_two_bytes(0b10000000, 0b0))
print(twos_comp_two_bytes(0b10000000, 0b1))
Results:
-567
0
1
-32768
-32767
Letter h means short integer (signed int with 2 bytes).
Char >, < describes order of bytes.
See more in Format Characters

Unpack IEEE 754 Floating Point Number

I am reading two 16 bit registers from a tcp client using the pymodbus module. The two registers make up a 32 bit IEEE 754 encoded floating point number. Currently I have the 32 bit binary value of the registers shown in the code below.
start_address = 0x1112
reg_count = 2
client = ModbusTcpClient(<IP_ADDRESS>)
response = client.read_input_registers(start_address,reg_count)
reg_1 = response.getRegister(0)<<(16 - (response.getRegister(0).bit_length())) #Get in 16 bit format
reg_2 = response.getRegister(1)<<(16 - (response.getRegister(1).bit_length())) #Get in 16 bit format
volts = (reg_1 << 16) | reg_2 #Get the 32 bit format
The above works fine to get the encoded value the problem is decoding it. I was going to code something like in this video but I came across the 'f' format in the struct module for IEEE 754 encoding. I tried decode the 32 bit float stored in volts in the code above using the unpack method in the struct module but ran into the following errors.
val = struct.unpack('f',volts)
>>> TypeError: a bytes-like object is required, not 'int'
Ok tried convert it to a 32 bit binary string.
temp = bin(volts)
val = struct.unpack('f',temp)
>>> TypeError: a bytes-like object is required, not 'str'
Tried to covert it to a bytes like object as in this post and format in different ways.
val = struct.unpack('f',bytes(volts))
>>> TypeError: string argument without an encoding
temp = "{0:b}".format(volts)
val = struct.unpack('f',temp)
>>> ValueError: Unknown format code 'b' for object of type 'str'
val = struct.unpack('f',volts.encode())
>>> struct.error: unpack requires a buffer of 4 bytes
Where do I add this buffer and where in the documentation does it say I need this buffer with the unpack method? It does say in the documentation
The string must contain exactly the amount of data required by the format (len(string) must equal calcsize(fmt)).
The calcsize(fmt) function returns a value in bytes but the len(string) returns a value of the length of the string, no?
Any suggestions are welcome.
EDIT
There is a solution to decoding below however a better solution to obtaining the 32 bit register value from the two 16 bit register values is shown below compared to the original in the question.
start_address = 0x1112
reg_count = 2
client = ModbusTcpClient(<IP_ADDRESS>)
response = client.read_input_registers(start_address,reg_count)
reg_1 = response.getRegister(0)
reg_2 = response.getRegister(1)
# Shift reg 1 by 16 bits
reg_1s = reg_1 << 16
# OR with the reg_2
total = reg_1s | reg_2
I found a solution to the problem using the BinaryPayloadDecoder.fromRegisters() from the pymodbus moudule instead of the struct module. Note that this solution is specific to the modbus smart meter device I am using as the byte order and word order of the registers could change in other devices. It may still work in other devices to decode registers but I would advise to read the documentation of the device first to be sure. I left in the comments in the code below but when I refer to page 24 this is just for my device.
from pymodbus.client.sync import ModbusTcpClient
from pymodbus.constants import Endian
from pymodbus.payload import BinaryPayloadDecoder
start_address = 0x1112
reg_count = 2
client = ModbusTcpClient(<IP_ADDRESS>)
response = client.read_input_registers(start_address,reg_count)
# The response will contain two registers making a 32 bit floating point number
# Use the BinaryPayloadDecoder.fromRegisters() function to decode
# The coding scheme for a 32 bit float is IEEE 754 https://en.wikipedia.org/wiki/IEEE_754
# The MS Bytes are stored in the first address and the LS bytes are stored in the second address,
# this corresponds to a big endian byte order (Second parameter in function)
# The documentation for the Modbus registers for the smart meter on page 24 says that
# the low word is the first priority, this correspond to a little endian word order (Third parameter in function)
decoder = BinaryPayloadDecoder.fromRegisters(response.registers, Endian.Big, wordorder=Endian.Little)
final_val = (decoder.decode_32bit_float())
client.close()
EDIT
Credit to juanpa-arrivillaga and chepner the problem can be solved using the struct module also with the byteorder='little'. The two functions in the code below can be used if the byteorder is little or if the byte order is big depending upon the implementation.
import struct
from pymodbus.client.sync import ModbusTcpClient
def big_endian(response):
reg_1 = response.getRegister(0)
reg_2 = response.getRegister(1)
# Shift reg 1 by 16 bits
reg_1s = reg_1 << 16
# OR with the reg_2
total = reg_1s | reg_2
return total
def little_endian(response):
reg_1 = response.getRegister(0)
reg_2 = response.getRegister(1)
# Shift reg 2 by 16 bits
reg_2s = reg_2 << 16
# OR with the reg_1
total = reg_2s | reg_1
return(total)
start_address = 0x1112
reg_count = 2
client = ModbusTcpClient(<IP_ADDRESS>)
response = client.read_input_registers(start_address,reg_count)
# Little
little = little_endian(response)
lit_byte = little.to_bytes(4,byteorder='little')
print(struct.unpack('f',lit_byte))
# Big
big = big_endian(response)
big_byte = big.to_bytes(4,byteorder='big')
print(struct.unpack('f',big_byte))

Math operation with 16b hex in python

I have a start hex : "00000000FFFFFFFF000000000AF50AF5" on this one I want to perform some operations.
User enter an int value (20 for exemple).
Program do : input*100. (=2000)
Convert it in "Hex Little Endian"(=D0070000)
Replace the first 4bytes (00000000) by this new 4bytes: (=D0070000FFFFFFFF000000000AF50AF5)
Until here It's good ! Problems begin now.
Replace same hex (=D0070000) at the third position of 4bytes(00000000): (=D0070000FFFFFFFFD00700000AF50AF5)
And finally substract this same hex (=D0070000) to the second postion of 4bytes (FFFFFFFF): (=2FF8FFFF)
Final hex : "D00700002FF8FFFFD00700000AF50AF5"
I don't understand how can I mention to my program the position of 4bytes (1,2,3 or 4)to replace.
user_int_value=int(input("enter num: "))*100 #user input*100
start_hex=bytes.fromhex("00000000FFFFFFFF000000000AF50AF5") #Starting hex
num_tot=hex(int.from_bytes(user_int_value.to_bytes(16, 'little'), 'big')) #convert user input to hex in little endian
sum = hex(int('0xFFFFFFFF', 16) - int(num_tot, 16)) #substract same hex to "0xFFFFFFFF"
EDIT
More simply i want to combine 4bytes :
data = ["0xD0070000", "0x2FF8FFFF", "0xD0070000", "0x0AF50AF5"]
final result I want "0xD00700002FF8FFFFD00700000AF50AF5"
Try this:
data = ["0xD0070000", "0x2FF8FFFF", "0xD0070000", "0x0AF50AF5"]
output = hex(int(data[0], 16) << 96 | int(data[1], 16) << 64 | int(data[2], 16) << 32 | int(data[3], 16) << 0)
output should become 0xd00700002ff8ffffd00700000af50af5
In some cases you won't get the output you expect because leading zeros will be chopped off, in that case you can fill the zeros manually doing:
print(f"0x{output[2:].zfill(32)}") # Uses f-string (requires newer python versions)
or
print("0x{}".format(output[2:].zfill(32))) # uses the old python format's string method

Packing an integer number to 3 bytes in Python

With background knowledge of C I want to serialize an integer number to 3 bytes. I searched a lot and found out I should use struct packing. I want something like this:
number = 1195855
buffer = struct.pack("format_string", number)
Now I expect buffer to be something like ['\x12' '\x3F' '\x4F']. Is it also possible to set endianness?
It is possible, using either > or < in your format string:
import struct
number = 1195855
def print_buffer(buffer):
print(''.join(["%02x" % ord(b) for b in buffer])) # Python 2
#print(buffer.hex()) # Python 3
# Little Endian
buffer = struct.pack("<L", number)
print_buffer(buffer) # 4f3f1200
# Big Endian
buffer = struct.pack(">L", number)
print_buffer(buffer) # 00123f4f
2.x docs
3.x docs
Note, however, that you're going to have to figure out how you want to get rid of the empty byte in the buffer, since L will give you 4 bytes and you only want 3.
Something like:
buffer = struct.pack("<L", number)
print_buffer(buffer[:3]) # 4f3f12
# Big Endian
buffer = struct.pack(">L", number)
print_buffer(buffer[-3:]) # 123f4f
would be one way.
Another way is to manually pack the bytes:
>>> import struct
>>> number = 1195855
>>> data = struct.pack('BBB',
... (number >> 16) & 0xff,
... (number >> 8) & 0xff,
... number & 0xff,
... )
>>> data
b'\xa5Z'
>>> list(data)
[18, 63, 79]
As just the 3-bytes, it's a bit redundant since the last 3 parameters of struct.pack equals the data. But this worked well in my case because I had header and footer bytes surrounding the unsigned 24-bit integer.
Whether this method, or slicing is more elegant is up to your application. I found this was cleaner for my project.

How to convert an integer to the shortest url-safe string in Python?

I want the shortest possible way of representing an integer in a URL. For example, 11234 can be shortened to '2be2' using hexadecimal. Since base64 uses is a 64 character encoding, it should be possible to represent an integer in base64 using even less characters than hexadecimal. The problem is I can't figure out the cleanest way to convert an integer to base64 (and back again) using Python.
The base64 module has methods for dealing with bytestrings - so maybe one solution would be to convert an integer to its binary representation as a Python string... but I'm not sure how to do that either.
This answer is similar in spirit to Douglas Leeder's, with the following changes:
It doesn't use actual Base64, so there's no padding characters
Instead of converting the number first to a byte-string (base 256), it converts it directly to base 64, which has the advantage of letting you represent negative numbers using a sign character.
import string
ALPHABET = string.ascii_uppercase + string.ascii_lowercase + \
string.digits + '-_'
ALPHABET_REVERSE = dict((c, i) for (i, c) in enumerate(ALPHABET))
BASE = len(ALPHABET)
SIGN_CHARACTER = '$'
def num_encode(n):
if n < 0:
return SIGN_CHARACTER + num_encode(-n)
s = []
while True:
n, r = divmod(n, BASE)
s.append(ALPHABET[r])
if n == 0: break
return ''.join(reversed(s))
def num_decode(s):
if s[0] == SIGN_CHARACTER:
return -num_decode(s[1:])
n = 0
for c in s:
n = n * BASE + ALPHABET_REVERSE[c]
return n
>>> num_encode(0)
'A'
>>> num_encode(64)
'BA'
>>> num_encode(-(64**5-1))
'$_____'
A few side notes:
You could (marginally) increase the human-readibility of the base-64 numbers by putting string.digits first in the alphabet (and making the sign character '-'); I chose the order that I did based on Python's urlsafe_b64encode.
If you're encoding a lot of negative numbers, you could increase the efficiency by using a sign bit or one's/two's complement instead of a sign character.
You should be able to easily adapt this code to different bases by changing the alphabet, either to restrict it to only alphanumeric characters or to add additional "URL-safe" characters.
I would recommend against using a representation other than base 10 in URIs in most cases—it adds complexity and makes debugging harder without significant savings compared to the overhead of HTTP—unless you're going for something TinyURL-esque.
All the answers given regarding Base64 are very reasonable solutions. But they're technically incorrect. To convert an integer to the shortest URL safe string possible, what you want is base 66 (there are 66 URL safe characters).
That code looks something like this:
from io import StringIO
import urllib
BASE66_ALPHABET = u"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz-_.~"
BASE = len(BASE66_ALPHABET)
def hexahexacontadecimal_encode_int(n):
if n == 0:
return BASE66_ALPHABET[0].encode('ascii')
r = StringIO()
while n:
n, t = divmod(n, BASE)
r.write(BASE66_ALPHABET[t])
return r.getvalue().encode('ascii')[::-1]
Here's a complete implementation of a scheme like this, ready to go as a pip installable package:
https://github.com/aljungberg/hhc
You probably do not want real base64 encoding for this - it will add padding etc, potentially even resulting in larger strings than hex would for small numbers. If there's no need to interoperate with anything else, just use your own encoding. Eg. here's a function that will encode to any base (note the digits are actually stored least-significant first to avoid extra reverse() calls:
def make_encoder(baseString):
size = len(baseString)
d = dict((ch, i) for (i, ch) in enumerate(baseString)) # Map from char -> value
if len(d) != size:
raise Exception("Duplicate characters in encoding string")
def encode(x):
if x==0: return baseString[0] # Only needed if don't want '' for 0
l=[]
while x>0:
l.append(baseString[x % size])
x //= size
return ''.join(l)
def decode(s):
return sum(d[ch] * size**i for (i,ch) in enumerate(s))
return encode, decode
# Base 64 version:
encode,decode = make_encoder("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/")
assert decode(encode(435346456456)) == 435346456456
This has the advantage that you can use whatever base you want, just by adding appropriate
characters to the encoder's base string.
Note that the gains for larger bases are not going to be that big however. base 64 will only reduce the size to 2/3rds of base 16 (6 bits/char instead of 4). Each doubling only adds one more bit per character. Unless you've a real need to compact things, just using hex will probably be the simplest and fastest option.
To encode n:
data = ''
while n > 0:
data = chr(n & 255) + data
n = n >> 8
encoded = base64.urlsafe_b64encode(data).rstrip('=')
To decode s:
data = base64.urlsafe_b64decode(s + '===')
decoded = 0
while len(data) > 0:
decoded = (decoded << 8) | ord(data[0])
data = data[1:]
In the same spirit as other for some “optimal” encoding, you can use 73 characters according to RFC 1738 (actually 74 if you count “+” as usable):
alphabet = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz_`\"!$'()*,-."
encoded = ''
while n > 0:
n, r = divmod(n, len(alphabet))
encoded = alphabet[r] + encoded
and the decoding:
decoded = 0
while len(s) > 0:
decoded = decoded * len(alphabet) + alphabet.find(s[0])
s = s[1:]
The easy bit is converting the byte string to web-safe base64:
import base64
output = base64.urlsafe_b64encode(s)
The tricky bit is the first step - convert the integer to a byte string.
If your integers are small you're better off hex encoding them - see saua
Otherwise (hacky recursive version):
def convertIntToByteString(i):
if i == 0:
return ""
else:
return convertIntToByteString(i >> 8) + chr(i & 255)
You don't want base64 encoding, you want to represent a base 10 numeral in numeral base X.
If you want your base 10 numeral represented in the 26 letters available you could use: http://en.wikipedia.org/wiki/Hexavigesimal.
(You can extend that example for a much larger base by using all the legal url characters)
You should atleast be able to get base 38 (26 letters, 10 numbers, +, _)
Base64 takes 4 bytes/characters to encode 3 bytes and can only encode multiples of 3 bytes (and adds padding otherwise).
So representing 4 bytes (your average int) in Base64 would take 8 bytes. Encoding the same 4 bytes in hex would also take 8 bytes. So you wouldn't gain anything for a single int.
a little hacky, but it works:
def b64num(num_to_encode):
h = hex(num_to_encode)[2:] # hex(n) returns 0xhh, strip off the 0x
h = len(h) & 1 and '0'+h or h # if odd number of digits, prepend '0' which hex codec requires
return h.decode('hex').encode('base64')
you could replace the call to .encode('base64') with something in the base64 module, such as urlsafe_b64encode()
If you are looking for a way to shorten the integer representation using base64, I think you need to look elsewhere. When you encode something with base64 it doesn't get shorter, in fact it gets longer.
E.g. 11234 encoded with base64 would yield MTEyMzQ=
When using base64 you have overlooked the fact that you are not converting just the digits (0-9) to a 64 character encoding. You are converting 3 bytes into 4 bytes so you are guaranteed your base64 encoded string would be 33.33% longer.
I maintain a little library named zbase62: http://pypi.python.org/pypi/zbase62
With it you can convert from a Python 2 str object to a base-62 encoded string and vice versa:
Python 2.7.1+ (r271:86832, Apr 11 2011, 18:13:53)
[GCC 4.5.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> d = os.urandom(32)
>>> d
'C$\x8f\xf9\x92NV\x97\x13H\xc7F\x0c\x0f\x8d9}\xf5.u\xeeOr\xc2V\x92f\x1b=:\xc3\xbc'
>>> from zbase62 import zbase62
>>> encoded = zbase62.b2a(d)
>>> encoded
'Fv8kTvGhIrJvqQ2oTojUGlaVIxFE1b6BCLpH8JfYNRs'
>>> zbase62.a2b(encoded)
'C$\x8f\xf9\x92NV\x97\x13H\xc7F\x0c\x0f\x8d9}\xf5.u\xeeOr\xc2V\x92f\x1b=:\xc3\xbc'
However, you still need to convert from integer to str. This comes built-in to Python 3:
Python 3.2 (r32:88445, Mar 25 2011, 19:56:22)
[GCC 4.5.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> d = os.urandom(32)
>>> d
b'\xe4\x0b\x94|\xb6o\x08\xe9oR\x1f\xaa\xa8\xe8qS3\x86\x82\t\x15\xf2"\x1dL%?\xda\xcc3\xe3\xba'
>>> int.from_bytes(d, 'big')
103147789615402524662804907510279354159900773934860106838120923694590497907642
>>> x= _
>>> x.to_bytes(32, 'big')
b'\xe4\x0b\x94|\xb6o\x08\xe9oR\x1f\xaa\xa8\xe8qS3\x86\x82\t\x15\xf2"\x1dL%?\xda\xcc3\xe3\xba'
To convert from int to bytes and vice versa in Python 2, there is not a convenient, standard way as far as I know. I guess maybe I should copy some implementation, such as this one: https://github.com/warner/foolscap/blob/46e3a041167950fa93e48f65dcf106a576ed110e/foolscap/banana.py#L41 into zbase62 for your convenience.
I needed a signed integer, so I ended up going with:
import struct, base64
def b64encode_integer(i):
return base64.urlsafe_b64encode(struct.pack('i', i)).rstrip('=\n')
Example:
>>> b64encode_integer(1)
'AQAAAA'
>>> b64encode_integer(-1)
'_____w'
>>> b64encode_integer(256)
'AAEAAA'
I'm working on making a pip package for this.
I recommend you use my bases.py https://github.com/kamijoutouma/bases.py which was inspired by bases.js
from bases import Bases
bases = Bases()
bases.toBase16(200) // => 'c8'
bases.toBase(200, 16) // => 'c8'
bases.toBase62(99999) // => 'q0T'
bases.toBase(200, 62) // => 'q0T'
bases.toAlphabet(300, 'aAbBcC') // => 'Abba'
bases.fromBase16('c8') // => 200
bases.fromBase('c8', 16) // => 200
bases.fromBase62('q0T') // => 99999
bases.fromBase('q0T', 62) // => 99999
bases.fromAlphabet('Abba', 'aAbBcC') // => 300
refer to https://github.com/kamijoutouma/bases.py#known-basesalphabets
for what bases are usable
For your case
I recommend you use either base 32, 58 or 64
Base-64 warning: besides there being several different standards, padding isn't currently added and line lengths aren't tracked. Not recommended for use with APIs that expect formal base-64 strings!
Same goes for base 66 which is currently not supported by both bases.js and bases.py but it might in the future
I'd go the 'encode integer as binary string, then base64 encode that' method you suggest, and I'd do it using struct:
>>> import struct, base64
>>> base64.b64encode(struct.pack('l', 47))
'LwAAAA=='
>>> struct.unpack('l', base64.b64decode(_))
(47,)
Edit again:
To strip out the extra 0s on numbers that are too small to need full 32-bit precision, try this:
def pad(str, l=4):
while len(str) < l:
str = '\x00' + str
return str
>>> base64.b64encode(struct.pack('!l', 47).replace('\x00', ''))
'Lw=='
>>> struct.unpack('!l', pad(base64.b64decode('Lw==')))
(47,)
Pure python, no dependancies, no encoding of byte strings etc. , just turning a base 10 int into base 64 int with the correct RFC 4648 characters:
def tetrasexagesimal(number):
out=""
while number>=0:
if number == 0:
out = 'A' + out
break
digit = number % 64
out = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"[digit] + out
number /= 64 # //= 64 for py3 (thank spanishgum!)
if number == 0:
break
return out
tetrasexagesimal(1)
As it was mentioned here in comments you can encode a data using 73 characters that are not escaped in URL.
I found two places were this Base73 URL encoding is used:
https://git.nolog.cz/NoLog.cz/f.bain/src/branch/master/static/script.js JS based URL shortener
https://gist.github.com/LoneFry/3792021 in PHP
But in fact you may use more characters like /, [, ], :, ; and some others. Those characters are escaped only when you doing encodeURIComponent i.e. you need to pass data via get parameter.
So in fact you can use up to 82 characters. The full alphabet is !$&'()*+,-./0123456789:;=#ABCDEFGHIJKLMNOPQRSTUVWXYZ[]_abcdefghijklmnopqrstuvwxyz~. I sorted all the symbols by their code so when Base82URL numbers are sorted as plain strings they are keep the same order.
I tested in Chrome and Firefox and they are works fine but may be confusing for regular users. But I used such ids for an internal API calls where nobody sees them.
Unsigned integer 32 bit may have a maximum value of 2^32=4294967296
And after encoding to the Base82 it will take 6 chars: $0~]mx.
I don't have a code in Python but here is a JS code that generates a random id (int32 unsigned) and encodes it into the Base82URL:
/**
* Convert uint32 number to Base82 url safe
* #param {int} number
* #returns {string}
*/
function toBase82Url(number) {
// all chars that are not escaped in url
let keys = "!$&'()*+,-./0123456789:;=#ABCDEFGHIJKLMNOPQRSTUVWXYZ[]_abcdefghijklmnopqrstuvwxyz~"
let radix = keys.length
let encoded = []
do {
let index = number% radix
encoded.unshift(keys.charAt(index))
number = Math.trunc(number / radix)
} while (number !== 0)
return encoded .join("")
}
function generateToken() {
let buf = new Uint32Array(1);
window.crypto.getRandomValues(buf)
var randomInt = buf[0]
return toBase82Url(randomInt)
}

Categories

Resources