How can I compress four floats into a string? - python

I would like to represent four floats e.g, 123.545, 56.234, -4534.234, 544.64 using the set of characters [a..z, A..Z, 0..9] in the shortest way possible so I can encode the four floats and store them in a filename. What is the most efficient to do this?
I've looked at base64 encoding which doesn't actually compress the result. I also looked at a polyline encoding algorithm which uses characters like ) and { and I can't have that.

You could use the struct module to store them as binary 32-bit floats, and encode the result into base64. In Python 2:
>>> import struct, base64
>>> base64.urlsafe_b64encode(struct.pack("ffff", 123.545,56.234,-4534.234,544.64))
'Chf3Qp7vYELfsY3F9igIRA=='
The == padding can be removed and re-added for decoding such that the length of the base64 string is a multiple of 4. You will also want to use URL-safe base64 to avoid the / character.

Related

How to convert with Python this hex to its decoded output?

I'm trying to convert the following (that seems to be an HEX) with Python with its decoded output:
I want to convert this:
To this:
How to do this?
This is the string:
0x00000000000000000000000000000000000000000000000000000000000000040000000000000000000000000000000000000000000000000000000000000080000000000000000000000000000000000000000000000000000000006331b7e000000000000000000000000000000000000000000000000000000000000000c0000000000000000000000000000000000000000000000000000000000000000474657374000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000007566963746f727900000000000000000000000000000000000000000000000000
First you need to convert the hex into a bytearray:
hex = 0x00000000000000000000000000000000000000000000000000000000000000040000000000000000000000000000000000000000000000000000000000000080000000000000000000000000000000000000000000000000000000006331b7e000000000000000000000000000000000000000000000000000000000000000c0000000000000000000000000000000000000000000000000000000000000000474657374000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000007566963746f727900000000000000000000000000000000000000000000000000
b = bytearray.fromhex(hex).decode()
Then you will need to determine the layout of the bytes. For example, an unit256 is probably 32 bytes which is 64 hex digits:
a = b[:64]
print(int.from_bytes(a, "big"))
Here I assume the bytes are in big-endian. If they are instead little-endian, you can use "little" instead of "big". You will need to learn about so-called "endianness" to understand this better.
You can get the other uint256 in a similar way.
As for the strings, I don't know what their length is. You will have to research the format for Ethereum blockchain data. Once you determine the length, you can use a similar technique to get the bytes for each string and then decode it into characters.
Just use the inbuilt decode function in Python:
str="0x00000000000000000000000000000000000000000000000000000000000000040000000000000000000000000000000000000000000000000000000000000080000000000000000000000000000000000000000000000000000000006331b7e000000000000000000000000000000000000000000000000000000000000000c0000000000000000000000000000000000000000000000000000000000000000474657374000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000007566963746f727900000000000000000000000000000000000000000000000000" #YOUR HEX
str.decode("hex")
Alternatively, if that does not work, you can use:
bytearray.fromhex(str).decode()

Small difference in Python hashlib.sha3_512 and Nodejs jsSHA("SHA3-512", "HEX)

I'm trying to encrypt a string to send away to another app, and my Python and Nodejs implementation's outputs are not matching. Can someone tell me the difference between these two methods, or what I'm doing wrong? I have the Nodejs version working and compatible, and I'd like the may the Python version match its output.
NodeJS:
var jsSHA = require("jssha")
var hash = new jsSHA("SHA3-512","HEX");
hash.update("6b0d");
console.log(hash.getHash("B64"))
// $ 9G1hk/ztGnZyk1HGPQMYAtrkg6dFoPW+s5TZou101Yl4QJyaSe+l1uZIEpi/rosNCfpsKOI7kh5usLrn06uYtQ==
Python:
import hashlib
import base64
hash = hashlib.sha3_512("6b0d".encode()).digest()
print(base64.b64encode(hash).decode())
# $ mBg3+maf_9gyfkDIIsREJM8VjCxKEo3J5MrCiK8Bk6FFJZ81IcAc8PjTRB+/3jd0MGnynqjkSZEg++c40JRwhQ==
Can anyone tell the difference, or something I'm missing in Python?
Edit: Added output strings.
Node:
$ 9G1hk/ztGnZyk1HGPQMYAtrkg6dFoPW+s5TZou101Yl4QJyaSe+l1uZIEpi/rosNCfpsKOI7kh5usLrn06uYtQ==
Python:
$ mBg3+maf_9gyfkDIIsREJM8VjCxKEo3J5MrCiK8Bk6FFJZ81IcAc8PjTRB+/3jd0MGnynqjkSZEg++c40JRwhQ==
"6b0d".encode() is not converting from str hex representation to two raw bytes, it's converting to four raw bytes representing each of the characters (b'6b0d'), and hashing that (str.encode() is for encoding based on a character encoding, and it defaults to UTF-8; there's no convenient way to use it to convert to/from hex representation on Python 3). By contrast,
var hash = new jsSHA("SHA3-512","HEX");
is telling the library to interpret the input as a hex representation, so it decodes it from four characters to two bytes on your behalf and hashes that.
To make Python do the same thing, change:
"6b0d".encode()
to:
bytes.fromhex("6b0d")
which will get the same two bytes that jsSHA is producing for hashing on your behalf (equivalent to b'\x6b\x0d', or as Python would represent it when echoing, b'k\r', since both bytes have shorter printable representations in ASCII).
Note: On older versions of Python, you'd import binascii and replace bytes.fromhex with binascii.unhexlify, but all supported versions provide it as an alternate constructor on the built-in bytes type, so that's the easiest approach.

What encoding is used by json.dumps?

When I use json.dumps in Python 3.8 for special characters they are being "escaped", like:
>>> import json
>>> json.dumps({'Crêpes': 5})
'{"Cr\\u00eapes": 5}'
What kind of encoding is this? Is this an "escape encoding"? And why is this kind of encoding not part of the encodings module? (Also see codecs, I think I tried all of them.)
To put it another way, how can I convert the string 'Crêpes' to the string 'Cr\\u00eapes' using Python encodings, escaping, etc.?
You are probably confused by the fact that this is a JSON string, not directly a Python string.
Python would encode this string as "Cr\u00eapes", where \u00ea represents a single Unicode character using its hexadecimal code point. In other words, in Python, len("\u00ea") == 1
JSON requires the same sort of encoding, but embedding the JSON-encoded value in a Python string requires you to double the backslash; so in Python's representation, this becomes "Cr\\u00eapes" where you have a literal backslash (which has to be escaped by another backslash), two literal zeros, a literal e character, and a literal a character. Thus, len("\\u00ea") == 6
If you have JSON in a file, the absolutely simplest way to load it into Python is to use json.loads() to read and decode it into a native Python data structure.
If you need to decode the hexadecimal sequence separately, the unicode-escape function does that on a byte value:
>>> b"Cr\\u00eapes".decode('unicode-escape')
'Crêpes'
This is sort of coincidental, and works simply because the JSON representation happens to be identical to the Python unicode-escape representation. You still need a b'...' aka bytes input for that. ("Crêpes".encode('unicode-escape') produces a slightly different representation. "Cr\\u00eapes".encode('us-ascii') produces a bytes string with the Unicode representation b"Cr\\u00eapes".)
It is not a Python encoding. It is the way JSON encodes Unicode non-ASCII characters. It is independent of Python and is used exactly the same for example in Java or with a C or C++ library.
The rule is that a non-ASCII character in the Basic Multilingual Plane (i.e. with a maximum 16 bits code) is encoded as \uxxxx where xxxx is the unicode code value.
Which explains why the ê is written as \u00ea, because its unicode code point is U+00EA

Is there any way through which we could define the characters that a hash value is composed of?

For example, I want the hash value that I get by using python function blake2b to have only (acdefghjklmnpqrstuvwxyz2345679)
A hash is a bit string. You can encode this bit string using a specific set of printable characters if you want. Hexadecimal (using 0123456789abcdef) is the most common way, but if you want a different set of characters, you can choose those instead.
To encode the hash value in hexadecimal, assuming that you have it as a raw string like the value returned by the digest method in the standard hashlib module, use hash.hex() in Python 3 and hash.encode('hex') in Python 2. The hashlib module has a method hexdigest which returns this encoding directly.
If you want to encode the value using single-case letters and digits without a risk of confusion on 0/O and 1/I, there's a standard for that called Base32. Base32 is available in Python in the base64 module. The standard encoding uses only uppercase, but you can translate to lowercase if you want. Base32 pads with =, but you can remove them for storage.
import base64, hashlib
hash = hashlib.new('SHA256', b'input').digest()
b32_hash = base64.b32encode(hash).lower().rstrip(b'=')
If you really want that specific 30-character set, you can convert the hexadecimal representation to an integer using int(….hexdigest(), 16) then convert that integer to a string using the digits of your choice.

Python base64 data decode and byte order convert

I am now using python base64 module to decode a base64 coded XML file, what I did was to find each of the data (there are thousands of them as for exmaple in "ABC....", the "ABC..." was the base64 encoded data) and add it to a string, lets say s, then I use base64.b64decode(s) to get the result, I am not sure of the result of the decoding, was it a string, or bytes? In addition, how should convert such decoded data from the so-called "network byte order" to a "host byte order"? Thanks!
Each base64 encoded string should be decoded separately - you can't concatenate encoded strings (and get a correct decoding).
The result of the decode is a string, of byte-buffer - in Python, they're equivalent.
Regarding the network/host order - sequences of bytes, have no such 'order' (or endianity) - it only matters when interpreting these bytes as words / ints of larger width (i.e. more than 8 bits).
Base64 stuff, encoded or not, is stored in strings. Byte order is only an issue if you're dealing with non-characters (C's int, short, long, float, etc.), and then I'm not sure how it would relate to this issue. Also, I don't think concatenating base64-encoded strings is valid.
>>> from base64 import *
>>> b64encode( "abcdefg" )
'YWJjZGVmZw=='
>>> b64decode( "YWJjZGVmZw==" )
'abcdefg'
>>> b64encode( "hijklmn" )
'aGlqa2xtbg=='
>>> b64decode( "aGlqa2xtbg==" )
'hijklmn'
>>> b64decode( "YWJjZGVmZw==aGlqa2xtbg==" )
'abcdefg'
>>> b64decode( "YWJjZGVmZwaGlqa2xtbg==" )
'abcdefg\x06\x86\x96\xa6\xb6\xc6\xd6\xe0'
This guy has a good python based
b64decode parser http://groups.google.com/group/spctools-discuss/browse_thread/thread/a8afd04e1a04cde4
Extracting peak-lists from mzXML in "Python"

Categories

Resources