Representing multiple values with one character in Python - python

I have 2 values that are in the range 0-31. I want to be able to represent both of these values in 1 character (for example in base 64 to explain what I mean by 1 character) but still be able to know what both of the values are and which came first.

Find a nice Unicode block that has 1024 contiguous codepoints, for example CJK Unified Ideographs, and map your 32*32 values onto them. In Python 3:
def char_encode(a, b):
return chr(0x4E00 + a * 32 + b)
def char_decode(c):
return divmod(ord(c) - 0x4E00, 32)
print(char_encode(17, 3))
# => 倣
print(char_decode('倣'))
# => (17, 3)
As you mention Base64... this is impossible. Each character in a Base64 encoding only allows for 6 bits of data, and you need 10 to represent your two numbers.
And also note that while this is only one character, it takes up two or three bytes, depending on the encoding you use. As noted by others, there is no way to stuff 10 bits of data into an 8-bit byte.
Explanation: a * 32 + b simply maps two numbers in range [0, 32) into a single number in range [0, 1024). For example, 0 * 32 + 0 = 0; 31 * 32 + 31 = 1023. chr finds the Unicode character with that codepoint, but characters with low codepoints like 0 are not printable, and would be a poor choice, so the result is shifted to the beginning of a nice big Unicode block: 0x4E00 is a hexadecimal representation of 19968, and is the codepoint of the first character in the CJK Unified Ideographs block. Using the example values, 17 * 32 + 3 = 547 and 19968 + 547 = 20515, or 0x5023 in hexadecimal, which is the codepoint of the character 倣. Thus, chr(20515) = "倣".
The char_decode function just does all of these operations in reverse: if a * p + b = x, then a, b = divmod(x, p) (see divmod). If c = chr(x), then x = ord(c) (see ord). And I am sure you know that if w + r = y, then r = y - w. So in the example, ord("倣") = 20515; 20515 - 0x4E00 = 547; and divmod(547, 32) is (17, 3).

Values [0, 31] can be stored in 5 bits, since 2**5 == 32. You can therefore unambiguously store two such values in 10 bits. Conversely, you will not be able to unambiguously retrieve two 5-bit values from fewer than 10 bits unless some other conditions hold true.
If you are using an encoding that allows 1024 or more distinct characters, you can map your pairs to characters. Otherwise you simply can't. So ASCII is not going to work here, and neither is Latin1. But pretty much any of the "normal" Unicode encodings are fine.
Keep in mind that for something like UTF-8, the actual character will take up more than 10 bits. If that's a concern, consider using UTF-16 or so.

Related

Python hex digest to integer

Many online gambling games use a function that converts a hash into a decimal from 0-(usually 2^52).
Here's some code I grabbed that works fine, but I don't understand why it works:
def get_result(hash):
hm = hmac.new(str.encode(hash),b'', hashlib.sha256) #hashing object
h = hm.hexdigest() #hex digest, 32 bytes 256 bit
print(h) #Something like 848ab848c6486d4f64
c = int(h,16)
print(c) #numbers only, 77 numbers long...?
if (c % 33 == 0):
return 1
h = int(h[:13],16)
return (((100 * E - h) / (E - h)) // 1) / 100.0
The part of the code that I don't understand is the conversion from h to c. h is a hex digest, so it is base-16. The python documentation says that the int(a,b) function converts the string a into a base-b integer. Here's my question:
How can an integer number be base-16? Isn't the definition of decimal base-10 (0-9)? Where do the extra 6 come from?
As far as I'm aware, a single hex digit can be stored by 4 bits, or 1/2 a byte. So a hex string of 64 length will occupy 32 bytes. Does this mean that any base of this data will also be 32 bytes? (converting the hex string to base-n, n being anything)
What does the fact that the c variable is always 77 digits long mean?
How can an integer number be base-16? Where do the extra 6 come from?
This is known as the hexadecimal system.
Isn't the definition of decimal base-10 (0-9)?
Integer and decimal are not synonyms. You can have a integer in base 2 instead of base 10.
As far as I'm aware, a single hex digit can be stored by 4 bits, or 1/2 a byte. So a hex string of 64 length will occupy 32 bytes.
There are two different concepts: a hex string and a hex integer.
When you type in Python, for example, "8ff", you're creating a hex string of length 3. A string is an array of characters. A character is (under the hood) a 1-byte integer. Therefore, you're storing 3 bytes¹ (about your second statement, a hex string of length 64 will actually occupy 64 bytes).
Now, when you type in Python 0x8ff, you're creating a hex integer of 3 digits. If you print it, it'll show 2303, because of the conversion from base-16 (8ff, hex) to base-10 (2303, dec). A single integer stores 4 bytes², so you're storing 4 bytes.
Does this mean that any base of this data will also be 32 bytes? (converting the hex string to base-n, n being anything)
It depends, what type of data?
A string of length 3 will always occupy 3 bytes (let's ignore Unicode), it doesn't matter if its "8ff" or "123".
A string of length 10 will always occupy 10 bytes, it doesn't matter if its "85d8afff" or "ef08c0e38e".
An integer will always occupy 4 bytes³, it doesn't matter if its 10 or 1000000.
What does the fact that the c variable is always 77 digits long mean?
As #flakes noted, that's because 2^256 ~= 1.16e+77 in decimal.
¹ Actually a string of length 3 stores 4 bytes: three for its characters and one for the null terminator.
¹ Let's ignore that integers in Python are unbounded.
² If it's lesser than 2,147,483,647 (signed) or 4,294,967,295 (unsigned).

How can I densely store large numbers in a file?

I need to store and handle huge amounts of very long numbers, which are in range from 0 to f 64 times (ffffffffff.....ffff).
If I store these numbers in a file, I need 1 byte for each character (digit) + 2 bytes for \n symbol = up to 66 bytes. However to represent all possible numbers we need not more than 34 bytes (4 bits represent digits from 0 to f, therefore 4 [bits] * 64 [amount of hex digits]/8 [bits a in byte] = 32 bytes + \n, of course).
Is there any way to store the number without consuming excess memory?
So far I have created converter from hex (with 16 digits per symbol) to a number with base of 76 (hex + all letters and some other symbols), which reduces size of a number to 41 + 2 bytes.
You are trying to store 32 bytes long. Why not just store them as binary numbers? That way you need to store only 32 bytes per number instead of 41 or whatever. You can add on all sorts of quasi-compression schemes to take advantage of things like most of your numbers being shorter than 32 bytes.
If your number is a string, convert it to an int first. Python3 ints are basically infinite precision, so you will not lose any information:
>>> num = '113AB87C877AAE3790'
>>> num = int(num, 16)
>>> num
317825918024297625488
Now you can convert the result to a byte array and write it to a file opened for binary writing:
with open('output.bin', 'wb') as file:
file.write(num.to_bytes(32, byteorder='big'))
The int method to_bytes converts your number to a string of bytes that can be placed in a file. You need to specify the string length and the order. 'big' makes it easier to read a hex dump of the file.
To read the file back and decode it using int.from_bytes in a similar manner:
with open('output.bin', 'rb') as file:
bytes = file.read(32)
num = int.from_bytes(bytes, byteorder='big')
Remember to always include the b in the file mode, or you may run into unexpected problems if you try to read or write data with codes for \n in it.
Both the read and write operation can be looped as a matter of course.
If you anticipate storing an even distribution of numbers, then see Mad Physicist's answer. However, If you anticipate storing mostly small numbers but need to be able to store a few large numbers, then these schemes may also be useful.
If you only need to account for integers that are 255 or fewer bytes (2040 or fewer bits) in length, then simply convert the int to a bytes object and store the length in an additional byte, like this:
# This was only tested with non-negative integers!
def encode(num):
assert isinstance(num, int)
# Convert the number to a byte array and strip away leading null bytes.
# You can also use byteorder="little" and rstrip.
# If the integer does not fit into 255 bytes, an OverflowError will be raised.
encoded = num.to_bytes(255, byteorder="big").lstrip(b'\0')
# Return the length of the integer in the first byte, followed by the encoded integer.
return bytes([len(encoded)]) + encoded
def encode_many(nums):
return b''.join(encode(num) for num in nums)
def decode_many(byte_array):
assert isinstance(byte_array, bytes)
result = []
start = 0
while start < len(byte_array):
# The first byte contains the length of the integer.
int_length = byte_array[start]
# Read int_length bytes and decode them as int.
new_int = int.from_bytes(byte_array[(start+1):(start+int_length+1)], byteorder="big")
# Add the new integer to the result list.
result.append(new_int)
start += int_length + 1
return result
To store integers of (practically) infinite length, you can use this scheme, based on variable-length quantities in the MIDI file format. First, the rules:
A byte has eight bits (for those who don't know).
In each byte except the last, the left-most bit (the highest-order bit) will be 1.
The lower seven bits (i.e. all bits except the left-most bit) in each byte, when concatenated together, form an integer with a variable number of bits.
Here are a few examples:
0 in binary is 00000000. It can be represented in one byte without modification as 00000000.
127 in binary is 01111111. It can be represented in one byte without modification as 01111111.
128 in binary is 10000000. It must be converted to a two-byte representation: 10000001 00000000. Let's break that down:
The left-most bit in the first byte is 1, which means that it is not the last byte.
The left-most bit in the second byte is 0, which means that it is the last byte.
The lower seven bits in the first byte are 0000001, and the lower seven bits in the second byte are 0000000. Concatenate those together, and you get 00000010000000, which is 128.
173249806138790 in binary is 100111011001000111011101001001101111110110100110.
To store it:
First, split the binary number into groups of seven bits: 0100111 0110010 0011101 1101001 0011011 1111011 0100110 (a leading 0 was added)
Then, add a 1 in front of each byte except the last, which gets a 0: 10100111 10110010 10011101 11101001 10011011 11111011 00100110
To retrieve it:
First, drop the first bit of each byte: 0100111 0110010 0011101 1101001 0011011 1111011 0100110
You are left with an array of seven-bit segments. Join them together: 100111011001000111011101001001101111110110100110
When that is converted to decimal, you get 173,249,806,138,790.
Why, you ask, do we make the left-most bit in the last byte of each number a 0? Well, doing that allows you to concatenate multiple numbers together without using line breaks. When writing the numbers to a file, just write them one after another. When reading the numbers from a file, use a loop that builds an array of integers, ending each integer whenever it detects a byte where the left-most bit is 0.
Here are two functions, encode and decode, which convert between int and bytes in Python 3.
# Important! These methods only work with non-negative integers!
def encode(num):
assert isinstance(num, int)
# If the number is 0, then just return a single null byte.
if num <= 0:
return b'\0'
# Otherwise...
result_bytes_reversed = []
while num > 0:
# Find the right-most seven bits in the integer.
current_seven_bit_segment = num & 0b1111111
# Change the left-most bit to a 1.
current_seven_bit_segment |= 0b10000000
# Add that to the result array.
result_bytes_reversed.append(current_seven_bit_segment)
# Chop off the right-most seven bits.
num = num >> 7
# Change the left-most bit in the lowest-order byte (which is first in the list) back to a 0.
result_bytes_reversed[0] &= 0b1111111
# Un-reverse the order of the bytes and convert the list into a byte string.
return bytes(reversed(result_bytes_reversed))
def decode(byte_array):
assert isinstance(byte_array, bytes)
result = 0
for part in byte_array:
# Shift the result over by seven bits.
result = result << 7
# Add in the right-most seven bits from this part.
result |= (part & 0b1111111)
return result
Here are two functions for working with lists of ints:
def encode_many(nums):
return [encode(num) for num in nums]
def decode_many(byte_array):
parts = []
# Split the byte array after each byte where the left-most bit is 0.
start = 0
for i, b in enumerate(byte_array):
# Check whether the left-most bit in this byte is 0.
if not (b & 0b10000000):
# Copy everything up to here into a new part.
parts.append(byte_array[start:(i+1)])
start = i + 1
return [decode(part) for part in parts]
The densest possible way without knowing more about the numbers would be 256 bits per number (32 bytes).
You can store them right after one another.
A function to write to a file might look like this:
def write_numbers(numbers, file):
for n in numbers:
file.write(n.to_bytes(32, 'big'))
with open('file_name', 'wb') as f:
write_numbers(get_numbers(), f)
And to read the numbers, you can make a function like this:
def read_numbers(file):
while True:
read = file.read(32)
if not read:
break
yield int.from_bytes(read, 'big')
with open('file_name', 'rb') as f:
for n in read_numbers(f):
do_stuff(n)

STL binary file reader with Python

I'm trying to write my "personal" python version of STL binary file reader, according to WIKIPEDIA : A binary STL file contains :
an 80-character (byte) headern which is generally ignored.
a 4-byte unsigned integer indicating the number of triangular facets in the file.
Each triangle is described by twelve 32-bit floating-point numbers: three for the normal and then three for the X/Y/Z coordinate of each vertex – just as with the ASCII version of STL. After these follows a 2-byte ("short") unsigned integer that is the "attribute byte count" – in the standard format, this should be zero because most software does not understand anything else. --Floating-point numbers are represented as IEEE floating-point numbers and are assumed to be little-endian--
Here is my code :
#! /usr/bin/env python3
with open("stlbinaryfile.stl","rb") as fichier :
head=fichier.read(80)
nbtriangles=fichier.read(4)
print(nbtriangles)
The output is :
b'\x90\x08\x00\x00'
It represents an unsigned integer, I need to convert it without using any package (struct,stl...). Are there any (basic) rules to do it ?, I don't know what does \x mean ? How does \x90 represent one byte ?
most of the answers in google mention "C structs", but I don't know nothing about C.
Thank you for your time.
Since you're using Python 3, you can use int.from_bytes. I'm guessing the value is stored little-endian, so you'd just do:
nbtriangles = int.from_bytes(fichier.read(4), 'little')
Change the second argument to 'big' if it's supposed to be big-endian.
Mind you, the normal way to parse a fixed width type is the struct module, but apparently you've ruled that out.
For the confusion over the repr, bytes objects will display ASCII printable characters (e.g. a) or standard ASCII escapes (e.g. \t) if the byte value corresponds to one of them. If it doesn't, it uses \x##, where ## is the hexadecimal representation of the byte value, so \x90 represents the byte with value 0x90, or 144. You need to combine the byte values at offsets to reconstruct the int, but int.from_bytes does this for you faster than any hand-rolled solution could.
Update: Since apparent int.from_bytes isn't "basic" enough, a couple more complex, but only using top-level built-ins (not alternate constructors) solutions. For little-endian, you can do this:
def int_from_bytes(inbytes):
res = 0
for i, b in enumerate(inbytes):
res |= b << (i * 8) # Adjust each byte individually by 8 times position
return res
You can use the same solution for big-endian by adding reversed to the loop, making it enumerate(reversed(inbytes)), or you can use this alternative solution that handles the offset adjustment a different way:
def int_from_bytes(inbytes):
res = 0
for b in inbytes:
res <<= 8 # Adjust bytes seen so far to make room for new byte
res |= b # Mask in new byte
return res
Again, this big-endian solution can trivially work for little-endian by looping over reversed(inbytes) instead of inbytes. In both cases inbytes[::-1] is an alternative to reversed(inbytes) (the former makes a new bytes in reversed order and iterates that, the latter iterates the existing bytes object in reverse, but unless it's a huge bytes object, enough to strain RAM if you copy it, the difference is pretty minimal).
The typical way to interpret an integer is to use struct.unpack, like so:
import struct
with open("stlbinaryfile.stl","rb") as fichier :
head=fichier.read(80)
nbtriangles=fichier.read(4)
print(nbtriangles)
nbtriangles=struct.unpack("<I", nbtriangles)
print(nbtriangles)
If you are allergic to import struct, then you can also compute it by hand:
def unsigned_int(s):
result = 0
for ch in s[::-1]:
result *= 256
result += ch
return result
...
nbtriangles = unsigned_int(nbtriangles)
As to what you are seeing when you print b'\x90\x08\x00\x00'. You are printing a bytes object, which is an array of integers in the range [0-255]. The first integer has the value 144 (decimal) or 90 (hexadecimal). When printing a bytes object, that value is represented by the string \x90. The 2nd has the value eight, represented by \x08. The 3rd and final integers are both zero. They are presented by \x00.
If you would like to see a more familiar representation of the integers, try:
print(list(nbtriangles))
[144, 8, 0, 0]
To compute the 32-bit integers represented by these four 8-bit integers, you can use this formula:
total = byte0 + (byte1*256) + (byte2*256*256) + (byte3*256*256*256)
Or, in hex:
total = byte0 + (byte1*0x100) + (byte2*0x10000) + (byte3*0x1000000)
Which results in:
0x00000890
Perhaps you can see the similarities to decimal, where the string "1234" represents the number:
4 + 3*10 + 2*100 + 1*1000

How to convert a hexadecimal color value to RGB in Python

How can I convert a hexadecimal color value (like #ff3ab4) to a three-tuple RGB value like (128, 255, 0)?
The reason why the leading zero's are being omitted is because the string gets converted to a number and when you try to represent the number as a string again it won't show any leading 0's, since the number, which doesn't have a certain length, can't know that it originated from a string, which in turn does have a certain length.
In your case I assume you would like to return the seperate A, R, B and G sections as a nicely formatted string. If you want to do this manually, I would do the following:
def colorComponents(hexAsString):
def uniformLength(string):
return string[:2] + "".zfill(10-len(string)) + string[2:]
re = []
re.append(str(hex(int(hexAsString, 16) & int("0xFF000000", 16))))
re.append(str(hex(int(hexAsString, 16) & int("0x00FF0000", 16))))
re.append(str(hex(int(hexAsString, 16) & int("0x0000FF00", 16))))
re.append(str(hex(int(hexAsString, 16) & int("0x000000FF", 16))))
for i in range(len(re)):
re[i] = uniformLength(re[i])
return re
Note that this is not optimized performance-wise in any way. :)
You should use integers straight away (like using 255 instead of int("0xFF000000", 16)). This way though you can clearly see the bitmasks im using to filter out the separate components as hexadecimal numbers.
I am fairly certain there are things like string formatters as well that would do the job for having numbers represented as pretty strings, but I never used them so I can't tell you how a solution with those would look like :)
Edit:
In case it is unclear to you how the bit-masks work that I used, here's a (hopefully) simple example:
assume you have 2 bytes of data represented in their bit-form:
bytes0 = 0011 0101
bytes1 = 0000 1111
the & operator is a bit-wise AND operator.
bytesAND = bytes0 & bytes1
bytesAND now looks like this:
0000 0101
>>> bit32_to_bytes = lambda x: [255 & (x >> 8 * i) for i in (0,1,2,3)]
>>> bit32_to_bytes(0xFFEE2200)
[0L, 34L, 238L, 255L]
Some explanation:
>>> bin(0xFF00)
'0b1111111100000000' # this is string, not a number
>>> bin(0xFF00 >> 8) # shift right by 8 bits
'0b11111111' # leading zeros striped
>>> bin(0xA0FF00)
'0b101000001111111100000000' # 24-bit number
>>> bin(0xA0FF00 >> 8) # shift by 1 byte (drop the lowest byte)
'0b000000001010000011111111' # leading zeros added by myself!
>>> bin(255 & (0xA0FF00 >> 8)) # shift right (by 1 byte) and mask (get last byte)
'0b000000000000000011111111' # leading zeros added by myself!
After masking y = 255 & x, y receives the value of the low byte in the number x.
y = 255 & (x >> 8 * n) -> y = n-th byte from x

Get the x Least Significant Bits from a String in Python

How can I get the x LSBs from a string (str) in Python?
In the specific I have a 256 bits string consisting in 32 chars each occupying 1 byte, from which I have to get a "char" string with the 50 Least Significant Bits.
So here are the ingredients for an approach that works using strings (simple but not the most efficient variant):
ord(x) yields the number (i.e. essentially the bits) for a char (e.g. ord('A')=65). Note that ord expects really an byte-long character (no special signs such as € or similar...)
bin(x)[2:] creates a string representing the number x in binary.
Thus, we can do (mystr holds your string):
l = [bin(ord(x))[2:] for x in mystr] # retrieve every character as binary number
bits = ""
for x in l: # concatenate all bits
bits = bits + l
bits[-50:] # retrieve the last 50 bits
Note that this approach is surely not the most efficient one due to the heavy string operations that could be replaced by plain integer operations (using bit-shifts and such). However, it is the simplest variant that came to my mind.
I think that a possible answer could be in this function:
mystr holds my string
def get_lsbs_str(mystr):
chrlist = list(mystr)
result1 = [chr(ord(chrlist[-7])&(3))]
result2 = chrlist[-6:]
return "".join(result1 + result2)
this function take 2 LSBs of the -7rd char of mystr (these are the 2 MSBs of the 50 LSBs)
then take the last 6 characters of mystr (these are the 48 LSB of the 50 LSB)
Please make me know if I am in error.
If it's only to display, wouldn't it help you?
yourString [14:63]
You can also use
yourString [-50:]
For more information, see here

Categories

Resources