hashlib: Unicode-objects must be encoded before hashing - python

I am running this hashlib code and it runs almost all the way:
def generate_hashes(peaks, fan_value=DEFAULT_FAN_VALUE):
if PEAK_SORT:
sorted(peaks,key=itemgetter(1))
# bruteforce all peaks
peaks=list(peaks)
len_peaks=len(peaks)
for i in range(len_peaks):
for j in range(1, fan_value):
if (i + j) < len(peaks):
# take current & next peak frequency value
freq1 = peaks[i][IDX_FREQ_I]
freq2 = peaks[i + j][IDX_FREQ_I]
# take current & next -peak time offset
t1 = peaks[i][IDX_TIME_J]
t2 = peaks[i + j][IDX_TIME_J]
# get diff of time offsets
t_delta = t2 - t1
# check if delta is between min & max
if t_delta >= MIN_HASH_TIME_DELTA and t_delta <= MAX_HASH_TIME_DELTA:
h = hashlib.sha1(("%s|%s|%s") % (str(freq1), str(freq2), str(t_delta)))
yield (h.hexdigest()[0:FINGERPRINT_REDUCTION], t1)
However, it returns this error:
h = hashlib.sha1(("%s|%s|%s") % (str(freq1), str(freq2), str(t_delta)))
TypeError: Unicode-objects must be encoded before hashing
I am honestly completely lost and don't know how to fix it. If you guys have any follow up questions regarding details about the code I will try my best to answer. Any feedback would be appreciated.

The answer is in the error message: use encode on your text string before hashing.
h = hashlib.sha1(("%s|%s|%s" % (str(freq1), str(freq2), str(t_delta))).encode('utf-8'))
The reason this is necessary is because hashlib.sha1() requires a bytes object due to the way it works internally. Normal Python strings (since version 3.0) are made of Unicode codepoints, which don't fit into a byte. They need an encoding which defines how the translation between codepoints and bytes occurs. UTF-8 is the most popular encoding, because it can handle every Unicode codepoint yet remain backwards compatible with older encodings like ASCII.

Related

Converting an Integer value to base64, and then decoding it to get a plaintext

I am given this number 427021005928, which i am supposed to change into a base64 encoded string and then decode the base64 string to get a plain text.
This decimal value 427021005928 when converted to binary gives 110001101101100011011110111010001101000 which corresponds to 'Y2xvdGg=', which is what i want. Got the conversion from (https://cryptii.com/pipes/binary-to-base64)
And then finally i decode 'Y2xvdGg=' to get the text cloth.
My problem is i do not have any idea how to use Python to get from either the decimal or binary value to get 'Y2xvdGg='
Some help would be appreciated!
NOTE: I only have this value 427021005928 at the start. I need to get the base64 and plaintext answers.
One elegant way would be using [Python 3]: struct - Interpret bytes as packed binary data, but given the fact that Python numbers are not fixed size, some additional computation would be required (for example, the number is 5 bytes long).
Apparently, the online converter, applied the base64 encoding on the number's memory representation, which can be obtained via [Python 3]: int.to_bytes(length, byteorder, *, signed=False)(endianness is important, and in this case it's big):
For the backwards process, reversed steps are required. There are 2 alternatives:
Things being done manually (this could also be applied to the "forward" process)
Using int.from_bytes
>>> import base64
>>>
>>> number = 427021005928
>>>
>>> number_bytes = number.to_bytes((number.bit_length() + 7) // 8, byteorder="big") # Here's where the magic happens
>>> number_bytes, number_bytes.decode()
(b'cloth', 'cloth')
>>>
>>> encoded = base64.b64encode(number_bytes)
>>> encoded, encoded.decode() # Don't let yourself tricked by the variable and method names resemblance
(b'Y2xvdGg=', 'Y2xvdGg=')
>>>
>>> # Now, getting the number back
...
>>> decoded = base64.b64decode(encoded)
>>> decoded
b'cloth'
>>>
>>> final_number0 = sum((item * 256 ** idx for idx, item in enumerate(reversed(decoded))))
>>> final_number0
427021005928
>>> number == final_number0
True
>>>
>>> # OR using from_bytes
...
>>> final_number1 = int.from_bytes(decoded, byteorder="big")
>>> final_number1
427021005928
>>> final_number1 == number
True
For more details on bitwise operations, check [SO]: Output of crc32b in PHP is not equal to Python (#CristiFati's answer).
Try this (https://docs.python.org/3/library/stdtypes.html#int.to_bytes)
>>> import base64
>>> x=427021005928
>>> y=x.to_bytes(5,byteorder='big').decode('utf-8')
>>> base64.b64encode(y.encode()).decode()
'Y2xvdGg='
>>> y
'cloth'
try
number = 427021005928
encode = base64.b64encode(bytes(number))
decode = base64.b64decode(encodeNumber)
The function below converts an unsigned 64 bit integer into base64 representation, and back again. This is particularly helpful for encoding database keys.
We first encode the integer into a byte array using little endian, and automatically remove any extra leading zeros. Then convert to base64, removing the unnecessary = sign. Note the flag url_safe which makes the solution non-base64 compliant, but works better with URLs.
def int_to_chars(number, url_safe = True):
'''
Convert an integer to base64. Used to turn IDs into short URL slugs.
:param number:
:param url_safe: base64 may contain "/" and "+", which do not play well
with URLS. Set to True to convert "/" to "-" and "+" to
"_". This no longer conforms to base64, but looks better
in URLS.
:return:
'''
if number < 0:
raise Exception("Cannot convert negative IDs.")
# Encode the long, long as little endian.
packed = struct.pack("<Q", number)
# Remove leading zeros
while len(packed) > 1 and packed[-1] == b'\x00':
packed = packed[:-1]
encoded = base64.b64encode(packed).split(b"=")[0]
if url_safe:
encoded = encoded.replace(b"/", b"-").replace(b"+", b".")
return encoded
def chars_to_int(chars):
'''Reverse of the above function. Will work regardless of whether
url_safe was set to True or False.'''
# Make sure the data is in binary type.
if isinstance(chars, six.string_types):
chars = chars.encode('utf8')
# Do the reverse of the url_safe conversion above.
chars = chars.replace(b"-", b"/").replace(b".", b"+")
# First decode the base64, adding the required "=" padding.
b64_pad_len = 4 - len(chars) % 4
decoded = base64.b64decode(chars + b"="*b64_pad_len)
# Now decode little endian with "0" padding, which are leading zeros.
int64_pad_len = 8 - len(decoded)
return struct.unpack("<Q", decoded + b'\x00' * int64_pad_len)[0]
You can do following conversions by using python
First of all import base64 by using following syntax
>>> import base64
For converting text to base64 do following
encoding
>>> base64.b64encode("cloth".encode()).decode()
'Y2xvdGg='
decoding
>>> base64.b64decode("Y2xvdGg=".encode()).decode()
'cloth'

Python u-Law (MULAW) wave decompression to raw wave signal

I googled this issue for last 2 weeks and wasn't able to find an algorithm or solution. I have some short .wav file but it has MULAW compression and python doesn't seem to have function inside wave.py that can successfully decompresses it. So I've taken upon myself to build a decoder in python.
I've found some info about MULAW in basic elements:
Wikipedia
A-law u-Law comparison
Some c-esc codec library
So I need some guidance, since I don't know how to approach getting from signed short integer to a full wave signal. This is my initial thought from what I've gathered so far:
So from wiki I've got a equation for u-law compression and decompression :
compression :
decompression :
So judging by compression equation, it looks like the output is limited to a float range of -1 to +1 , and with signed short integer from –32,768 to 32,767 so it looks like I would need to convert it from short int to float in specific range.
Now, to be honest, I've heard of quantisation before, but I am not sure if I should first try and dequantize and then decompress or in the other way, or even if in this case it is the same thing... the tutorials/documentation can be a bit of tricky with terminology.
The wave file I am working with is supposed to contain 'A' sound like for speech synthesis, I could probably verify success by comparing 2 waveforms in some audio software and custom wave analyzer but I would really like to diminish trial and error section of this process.
So what I've had in mind:
u = 0xff
data_chunk = b'\xe7\xe7' # -6169
data_to_r1 = unpack('h',data_chunk)[0]/0xffff # I suspect this is wrong,
# # but I don't know what else
u_law = ( -1 if data_chunk<0 else 1 )*( pow( 1+u, abs(data_to_r1)) -1 )/u
So is there some sort of algorithm or crucial steps I would need to take in form of first: decompression, second: quantisation : third ?
Since everything I find on google is how to read a .wav PCM-modulated file type, not how to manage it if wild compression arises.
So, after scouring the google the solution was found in github ( go figure ). I've searched for many many algorithms and found 1 that is within bounds of error for lossy compression. Which is for u law for positive values from 30 -> 1 and for negative values from -32 -> -1
To be honest i think this solution is adequate but not quite per equation per say, but it is best solution for now. This code is transcribed to python directly from gcc9108 audio codec
def uLaw_d(i8bit):
bias = 33
sign = pos = 0
decoded = 0
i8bit = ~i8bit
if i8bit&0x80:
i8bit &= ~(1<<7)
sign = -1
pos = ( (i8bit&0xf0) >> 4 ) + 5
decoded = ((1 << pos) | ((i8bit & 0x0F) << (pos - 4)) | (1 << (pos - 5))) - bias
return decoded if sign else ~decoded
def uLaw_e(i16bit):
MAX = 0x1fff
BIAS = 33
mask = 0x1000
sign = lsb = 0
pos = 12
if i16bit < 0:
i16bit = -i16bit
sign = 0x80
i16bit += BIAS
if ( i16bit>MAX ): i16bit = MAX
for x in reversed(range(pos)):
if i16bit&mask != mask and pos>=5:
pos = x
break
lsb = ( i16bit>>(pos-4) )&0xf
return ( ~( sign | ( pos<<4 ) | lsb ) )
With test:
print( 'normal :\t{0}\t|\t{0:2X}\t:\t{0:016b}'.format(0xff) )
print( 'encoded:\t{0}\t|\t{0:2X}\t:\t{0:016b}'.format(uLaw_e(0xff)) )
print( 'decoded:\t{0}\t|\t{0:2X}\t:\t{0:016b}'.format(uLaw_d(uLaw_e(0xff))) )
and output:
normal : 255 | FF : 0000000011111111
encoded: -179 | -B3 : -000000010110011
decoded: 263 | 107 : 0000000100000111
And as you can see 263-255 = 8 which is within bounds. When i tried to implement seeemmmm method described in G.711 ,that kind user Oliver Charlesworth suggested that i look in to , the decoded value for maximum in data was -8036 which is close to the maximum of uLaw spec, but i couldn't reverse engineer decoding function to get binary equivalent of function from wikipedia.
Lastly, i must say that i am currently disappointed that python library doesn't support all kind of compression algorithms since it is not just a tool that people use, it is also a resource python consumers learn from since most of data for further dive into code isn't readily available or understandable.
EDIT
After decoding the data and writing wav file via wave.py i've successfully succeeded to write a new raw linear PCM file. This works... even though i was sceptical at first.
EDIT 2: ::> you can find real solution oncompressions.py
I find this helpful for converting to/from ulaw with numpy arrays.
import audioop
def numpy_audioop_helper(x, xdtype, func, width, ydtype):
'''helper function for using audioop buffer conversion in numpy'''
xi = np.asanyarray(x).astype(xdtype)
if np.any(x != xi):
xinfo = np.iinfo(xdtype)
raise ValueError("input must be %s [%d..%d]" % (xdtype, xinfo.min, xinfo.max))
y = np.frombuffer(func(xi.tobytes(), width), dtype=ydtype)
return y.reshape(xi.shape)
def audioop_ulaw_compress(x):
return numpy_audioop_helper(x, np.int16, audioop.lin2ulaw, 2, np.uint8)
def audioop_ulaw_expand(x):
return numpy_audioop_helper(x, np.uint8, audioop.ulaw2lin, 2, np.int16)
Python actually supports decoding u-Law out of the box:
audioop.ulaw2lin(fragment, width)
Convert sound fragments in u-LAW encoding to linearly encoded sound fragments. u-LAW encoding always uses 8 bits samples, so width
refers only to the sample width of the output fragment here.
https://docs.python.org/3/library/audioop.html#audioop.ulaw2lin

Securely encrypt integers (up to 2^48) into the shortest possible URL-safe string

In my Django application I have hierarchical URL structure:
webpage.com/property/PK/sub-property/PK/ etc...
I do not want to expose primary keys and create a vulnerability. Therefore I am
encrypting all PKs into strings in all templates and URLs. This is done by the wonderful library django-encrypted-id written by this SO user.
However, the library supports up to 2^64 long integers and produces 24 characters output (22 + 2 padding). This results in huge URLs in my nested structure.
Therefore, I would like to patch the encrypting and decrypting functions and try to shorten the output. Here is the original code (+ padding handling which I added):
# Remove the padding after encode and add it on decode
PADDING = '=='
def encode(the_id):
assert 0 <= the_id < 2 ** 64
crc = binascii.crc32(bytes(the_id)) & 0xffffff
message = struct.pack(b"<IQxxxx", crc, the_id)
assert len(message) == 16
cypher = AES.new(
settings.SECRET_KEY[:24], AES.MODE_CBC,
settings.SECRET_KEY[-16:]
)
return base64.urlsafe_b64encode(cypher.encrypt(message)).rstrip(PADDING)
def decode(e):
if isinstance(e, basestring):
e = bytes(e.encode("ascii"))
try:
e += str(PADDING)
e = base64.urlsafe_b64decode(e)
except (TypeError, AttributeError):
raise ValueError("Failed to decrypt, invalid input.")
for skey in getattr(settings, "SECRET_KEYS", [settings.SECRET_KEY]):
cypher = AES.new(skey[:24], AES.MODE_CBC, skey[-16:])
msg = cypher.decrypt(e)
crc, the_id = struct.unpack("<IQxxxx", msg)
if crc != binascii.crc32(bytes(the_id)) & 0xffffff:
continue
return the_id
raise ValueError("Failed to decrypt, CRC never matched.")
# Lets test with big numbers
for x in range(100000000, 100000003):
ekey = encode(x)
pk = decode(ekey)
print "Pk: %s Ekey: %s" % (pk, ekey)
Output (I changed the strings a bit, so don't try to hack me :P):
Pk: 100000000 Ekey: GNtOHji8rA42qfq3p5gNMI
Pk: 100000001 Ekey: tK6RcAZ2MrWmR3nB5qkQDe
Pk: 100000002 Ekey: a7VXIf8pEB6R7XvqwGQo6W
I have tried to modify everything in the encode() function but without any success. The produced string has always the length of 22.
Here is what I want:
Keep the encryption strength near to the original level or at least do not decrease it dramatically
Support integers up to 2^48 (~281 trillions), or 2^40, because as it is now with 2^64 is too much, I do not think that we will ever have such huge PKs in the database.
I will be happy with string length between 14-20. If its 20.. then yeah, its still 2 chars less..
Currently you are using CBC mode with a static IV, so the code you have isn't secure anyway and, like you say, produces rather large ciphertexts.
I would recommend swapping from CBC mode to CTR mode, which lets you have a variable length IV. The normal recommended length for the IV (or nonce) in CTR mode, I think, is 12, but you can reduce this up or down as needed. CTR is also a stream cipher which means what you put in is what you get out in terms of size. With AES, CBC mode will always return you ciphertexts in blocks of 16 bytes so even if you are encrypting 6 bytes, you get 16 bytes out, so isn't ideal for you.
If you make your IV say... 48 bits long and aim to encrypt no larger than 48 bits, you'll be able to produce a raw output of 6 + 6 = 12 bytes, or with base64, (4*(12/3)) = 16 bytes. You will be able to get a lower output than this by further reducing your IV and/or input size (2^40?). You can lower possible values of your input as much as you want without damaging the security.
Keep in mind that CTR does have pitfalls. Producing two ciphertexts that share the same IV and key means that they can be trivially broken, so always randomly generate your IV (and don't reduce it in size too much).

Python 2.7.6 Optimizing code for packing big endian bytes into a string

import struct
varA['Z']['value'] = 8700
varA['Y']['value'] = 8800
varA['X']['value'] = 8900
varA['W']['value'] = 8800
varA['V']['value'] = 8700
varB = ""
varC = ""
for name in 'Z Y X W V'.split(' '):
varB = C[name]['value']
varC += str(struct.pack('>h',varB))
print varC[:-1] + '\n'
What i need is a string of bytes,
where each number is a signed int16 big-endian byte(s).
this code here works for what im trying to do, but i know
theres a far more elegant solution.
I wouldn't spend any time on optimizing the varA as its
only there to set up the code and won't be used in my project.
Also the print is also there to set up the problem, im actually
sending the bytes as a socket.
Initially I had this in an array first few times, but when I converted the
array to a bytearray, I kept running into having 0x00 mixed in.
Same with struct, as you can see in my solution removing the
0x00 at the end.
Here's a simpler way to do it. It's unclear from the question whether this is exactly the result you desire.
values = [varA[name]['value'] for name in 'ZYXWV']
varC = struct.pack('>'+str(len(values))+'h', *values)

How to convert an integer to the shortest url-safe string in Python?

I want the shortest possible way of representing an integer in a URL. For example, 11234 can be shortened to '2be2' using hexadecimal. Since base64 uses is a 64 character encoding, it should be possible to represent an integer in base64 using even less characters than hexadecimal. The problem is I can't figure out the cleanest way to convert an integer to base64 (and back again) using Python.
The base64 module has methods for dealing with bytestrings - so maybe one solution would be to convert an integer to its binary representation as a Python string... but I'm not sure how to do that either.
This answer is similar in spirit to Douglas Leeder's, with the following changes:
It doesn't use actual Base64, so there's no padding characters
Instead of converting the number first to a byte-string (base 256), it converts it directly to base 64, which has the advantage of letting you represent negative numbers using a sign character.
import string
ALPHABET = string.ascii_uppercase + string.ascii_lowercase + \
string.digits + '-_'
ALPHABET_REVERSE = dict((c, i) for (i, c) in enumerate(ALPHABET))
BASE = len(ALPHABET)
SIGN_CHARACTER = '$'
def num_encode(n):
if n < 0:
return SIGN_CHARACTER + num_encode(-n)
s = []
while True:
n, r = divmod(n, BASE)
s.append(ALPHABET[r])
if n == 0: break
return ''.join(reversed(s))
def num_decode(s):
if s[0] == SIGN_CHARACTER:
return -num_decode(s[1:])
n = 0
for c in s:
n = n * BASE + ALPHABET_REVERSE[c]
return n
>>> num_encode(0)
'A'
>>> num_encode(64)
'BA'
>>> num_encode(-(64**5-1))
'$_____'
A few side notes:
You could (marginally) increase the human-readibility of the base-64 numbers by putting string.digits first in the alphabet (and making the sign character '-'); I chose the order that I did based on Python's urlsafe_b64encode.
If you're encoding a lot of negative numbers, you could increase the efficiency by using a sign bit or one's/two's complement instead of a sign character.
You should be able to easily adapt this code to different bases by changing the alphabet, either to restrict it to only alphanumeric characters or to add additional "URL-safe" characters.
I would recommend against using a representation other than base 10 in URIs in most cases—it adds complexity and makes debugging harder without significant savings compared to the overhead of HTTP—unless you're going for something TinyURL-esque.
All the answers given regarding Base64 are very reasonable solutions. But they're technically incorrect. To convert an integer to the shortest URL safe string possible, what you want is base 66 (there are 66 URL safe characters).
That code looks something like this:
from io import StringIO
import urllib
BASE66_ALPHABET = u"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz-_.~"
BASE = len(BASE66_ALPHABET)
def hexahexacontadecimal_encode_int(n):
if n == 0:
return BASE66_ALPHABET[0].encode('ascii')
r = StringIO()
while n:
n, t = divmod(n, BASE)
r.write(BASE66_ALPHABET[t])
return r.getvalue().encode('ascii')[::-1]
Here's a complete implementation of a scheme like this, ready to go as a pip installable package:
https://github.com/aljungberg/hhc
You probably do not want real base64 encoding for this - it will add padding etc, potentially even resulting in larger strings than hex would for small numbers. If there's no need to interoperate with anything else, just use your own encoding. Eg. here's a function that will encode to any base (note the digits are actually stored least-significant first to avoid extra reverse() calls:
def make_encoder(baseString):
size = len(baseString)
d = dict((ch, i) for (i, ch) in enumerate(baseString)) # Map from char -> value
if len(d) != size:
raise Exception("Duplicate characters in encoding string")
def encode(x):
if x==0: return baseString[0] # Only needed if don't want '' for 0
l=[]
while x>0:
l.append(baseString[x % size])
x //= size
return ''.join(l)
def decode(s):
return sum(d[ch] * size**i for (i,ch) in enumerate(s))
return encode, decode
# Base 64 version:
encode,decode = make_encoder("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/")
assert decode(encode(435346456456)) == 435346456456
This has the advantage that you can use whatever base you want, just by adding appropriate
characters to the encoder's base string.
Note that the gains for larger bases are not going to be that big however. base 64 will only reduce the size to 2/3rds of base 16 (6 bits/char instead of 4). Each doubling only adds one more bit per character. Unless you've a real need to compact things, just using hex will probably be the simplest and fastest option.
To encode n:
data = ''
while n > 0:
data = chr(n & 255) + data
n = n >> 8
encoded = base64.urlsafe_b64encode(data).rstrip('=')
To decode s:
data = base64.urlsafe_b64decode(s + '===')
decoded = 0
while len(data) > 0:
decoded = (decoded << 8) | ord(data[0])
data = data[1:]
In the same spirit as other for some “optimal” encoding, you can use 73 characters according to RFC 1738 (actually 74 if you count “+” as usable):
alphabet = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz_`\"!$'()*,-."
encoded = ''
while n > 0:
n, r = divmod(n, len(alphabet))
encoded = alphabet[r] + encoded
and the decoding:
decoded = 0
while len(s) > 0:
decoded = decoded * len(alphabet) + alphabet.find(s[0])
s = s[1:]
The easy bit is converting the byte string to web-safe base64:
import base64
output = base64.urlsafe_b64encode(s)
The tricky bit is the first step - convert the integer to a byte string.
If your integers are small you're better off hex encoding them - see saua
Otherwise (hacky recursive version):
def convertIntToByteString(i):
if i == 0:
return ""
else:
return convertIntToByteString(i >> 8) + chr(i & 255)
You don't want base64 encoding, you want to represent a base 10 numeral in numeral base X.
If you want your base 10 numeral represented in the 26 letters available you could use: http://en.wikipedia.org/wiki/Hexavigesimal.
(You can extend that example for a much larger base by using all the legal url characters)
You should atleast be able to get base 38 (26 letters, 10 numbers, +, _)
Base64 takes 4 bytes/characters to encode 3 bytes and can only encode multiples of 3 bytes (and adds padding otherwise).
So representing 4 bytes (your average int) in Base64 would take 8 bytes. Encoding the same 4 bytes in hex would also take 8 bytes. So you wouldn't gain anything for a single int.
a little hacky, but it works:
def b64num(num_to_encode):
h = hex(num_to_encode)[2:] # hex(n) returns 0xhh, strip off the 0x
h = len(h) & 1 and '0'+h or h # if odd number of digits, prepend '0' which hex codec requires
return h.decode('hex').encode('base64')
you could replace the call to .encode('base64') with something in the base64 module, such as urlsafe_b64encode()
If you are looking for a way to shorten the integer representation using base64, I think you need to look elsewhere. When you encode something with base64 it doesn't get shorter, in fact it gets longer.
E.g. 11234 encoded with base64 would yield MTEyMzQ=
When using base64 you have overlooked the fact that you are not converting just the digits (0-9) to a 64 character encoding. You are converting 3 bytes into 4 bytes so you are guaranteed your base64 encoded string would be 33.33% longer.
I maintain a little library named zbase62: http://pypi.python.org/pypi/zbase62
With it you can convert from a Python 2 str object to a base-62 encoded string and vice versa:
Python 2.7.1+ (r271:86832, Apr 11 2011, 18:13:53)
[GCC 4.5.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> d = os.urandom(32)
>>> d
'C$\x8f\xf9\x92NV\x97\x13H\xc7F\x0c\x0f\x8d9}\xf5.u\xeeOr\xc2V\x92f\x1b=:\xc3\xbc'
>>> from zbase62 import zbase62
>>> encoded = zbase62.b2a(d)
>>> encoded
'Fv8kTvGhIrJvqQ2oTojUGlaVIxFE1b6BCLpH8JfYNRs'
>>> zbase62.a2b(encoded)
'C$\x8f\xf9\x92NV\x97\x13H\xc7F\x0c\x0f\x8d9}\xf5.u\xeeOr\xc2V\x92f\x1b=:\xc3\xbc'
However, you still need to convert from integer to str. This comes built-in to Python 3:
Python 3.2 (r32:88445, Mar 25 2011, 19:56:22)
[GCC 4.5.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> d = os.urandom(32)
>>> d
b'\xe4\x0b\x94|\xb6o\x08\xe9oR\x1f\xaa\xa8\xe8qS3\x86\x82\t\x15\xf2"\x1dL%?\xda\xcc3\xe3\xba'
>>> int.from_bytes(d, 'big')
103147789615402524662804907510279354159900773934860106838120923694590497907642
>>> x= _
>>> x.to_bytes(32, 'big')
b'\xe4\x0b\x94|\xb6o\x08\xe9oR\x1f\xaa\xa8\xe8qS3\x86\x82\t\x15\xf2"\x1dL%?\xda\xcc3\xe3\xba'
To convert from int to bytes and vice versa in Python 2, there is not a convenient, standard way as far as I know. I guess maybe I should copy some implementation, such as this one: https://github.com/warner/foolscap/blob/46e3a041167950fa93e48f65dcf106a576ed110e/foolscap/banana.py#L41 into zbase62 for your convenience.
I needed a signed integer, so I ended up going with:
import struct, base64
def b64encode_integer(i):
return base64.urlsafe_b64encode(struct.pack('i', i)).rstrip('=\n')
Example:
>>> b64encode_integer(1)
'AQAAAA'
>>> b64encode_integer(-1)
'_____w'
>>> b64encode_integer(256)
'AAEAAA'
I'm working on making a pip package for this.
I recommend you use my bases.py https://github.com/kamijoutouma/bases.py which was inspired by bases.js
from bases import Bases
bases = Bases()
bases.toBase16(200) // => 'c8'
bases.toBase(200, 16) // => 'c8'
bases.toBase62(99999) // => 'q0T'
bases.toBase(200, 62) // => 'q0T'
bases.toAlphabet(300, 'aAbBcC') // => 'Abba'
bases.fromBase16('c8') // => 200
bases.fromBase('c8', 16) // => 200
bases.fromBase62('q0T') // => 99999
bases.fromBase('q0T', 62) // => 99999
bases.fromAlphabet('Abba', 'aAbBcC') // => 300
refer to https://github.com/kamijoutouma/bases.py#known-basesalphabets
for what bases are usable
For your case
I recommend you use either base 32, 58 or 64
Base-64 warning: besides there being several different standards, padding isn't currently added and line lengths aren't tracked. Not recommended for use with APIs that expect formal base-64 strings!
Same goes for base 66 which is currently not supported by both bases.js and bases.py but it might in the future
I'd go the 'encode integer as binary string, then base64 encode that' method you suggest, and I'd do it using struct:
>>> import struct, base64
>>> base64.b64encode(struct.pack('l', 47))
'LwAAAA=='
>>> struct.unpack('l', base64.b64decode(_))
(47,)
Edit again:
To strip out the extra 0s on numbers that are too small to need full 32-bit precision, try this:
def pad(str, l=4):
while len(str) < l:
str = '\x00' + str
return str
>>> base64.b64encode(struct.pack('!l', 47).replace('\x00', ''))
'Lw=='
>>> struct.unpack('!l', pad(base64.b64decode('Lw==')))
(47,)
Pure python, no dependancies, no encoding of byte strings etc. , just turning a base 10 int into base 64 int with the correct RFC 4648 characters:
def tetrasexagesimal(number):
out=""
while number>=0:
if number == 0:
out = 'A' + out
break
digit = number % 64
out = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"[digit] + out
number /= 64 # //= 64 for py3 (thank spanishgum!)
if number == 0:
break
return out
tetrasexagesimal(1)
As it was mentioned here in comments you can encode a data using 73 characters that are not escaped in URL.
I found two places were this Base73 URL encoding is used:
https://git.nolog.cz/NoLog.cz/f.bain/src/branch/master/static/script.js JS based URL shortener
https://gist.github.com/LoneFry/3792021 in PHP
But in fact you may use more characters like /, [, ], :, ; and some others. Those characters are escaped only when you doing encodeURIComponent i.e. you need to pass data via get parameter.
So in fact you can use up to 82 characters. The full alphabet is !$&'()*+,-./0123456789:;=#ABCDEFGHIJKLMNOPQRSTUVWXYZ[]_abcdefghijklmnopqrstuvwxyz~. I sorted all the symbols by their code so when Base82URL numbers are sorted as plain strings they are keep the same order.
I tested in Chrome and Firefox and they are works fine but may be confusing for regular users. But I used such ids for an internal API calls where nobody sees them.
Unsigned integer 32 bit may have a maximum value of 2^32=4294967296
And after encoding to the Base82 it will take 6 chars: $0~]mx.
I don't have a code in Python but here is a JS code that generates a random id (int32 unsigned) and encodes it into the Base82URL:
/**
* Convert uint32 number to Base82 url safe
* #param {int} number
* #returns {string}
*/
function toBase82Url(number) {
// all chars that are not escaped in url
let keys = "!$&'()*+,-./0123456789:;=#ABCDEFGHIJKLMNOPQRSTUVWXYZ[]_abcdefghijklmnopqrstuvwxyz~"
let radix = keys.length
let encoded = []
do {
let index = number% radix
encoded.unshift(keys.charAt(index))
number = Math.trunc(number / radix)
} while (number !== 0)
return encoded .join("")
}
function generateToken() {
let buf = new Uint32Array(1);
window.crypto.getRandomValues(buf)
var randomInt = buf[0]
return toBase82Url(randomInt)
}

Categories

Resources