I googled and could not find much information on Base94 encoding. Does anyone has more details about this encoding?
Essentially: Base94 encoding takes 9 input bytes of 8 bits each, uses those to construct a 72-bit integer, and then converts that to an 11-digit base-94 number, and encodes that number using the ASCII characters ! (33) through ~ (126):
!"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
Here is some example code that someone published on GitHub:
https://gist.github.com/iso2022jp/4054241
Related
I'm currently playing around with decoded asn1 data and can't wrap my head around correctly decoding the data into strings (if the data is numerical it's working absolutely fine)
Example:
Hex String -> 0ddc2f93c6c7bb10
Expected Result -> MegaFon
According to the spec the first two octets are meta info and starting with octet 3 there should be two 7 bit chars in each octet
I tried to use the soltion's mentioned in decode 7-bit GSM but I just get scrap returns, would highly appreciate any ideas
managed to solve the riddle in the meantime (#BoarGules, you are right, the spec is misleading from my perspective). First of all, for Chars (the hex starts with d0 in this case), the nibbles must not be rotated as it is done for numerical output. Then just cut out the first two octets (d0 in our case) and run it through the gsm7bitdecode function mentioned in the other stackoverlow thread (linked in the question). To keep with the example 'CD' => 11001101, cut the first bit or set it to 0 gives us 01001101 or 4D in Hex which is M in Ascii!
This question already has answers here:
Why UTF-32 exists whereas only 21 bits are necessary to encode every character?
(4 answers)
Closed 3 years ago.
I am reading UNICODE Howto in the Python documentation.
It is written that
a Unicode string is a sequence of code points, which are numbers from
0 through 0x10FFFF
which make it looks like the maximum number of bits needed to represent a code point is 24 (because there are 6 hexadecimal characters, and 6*4=24).
But then the documentation states:
The first encoding you might think of is using 32-bit integers as the
code unit
Why is that? The first encoding I could think of is with 24-bit integers, not 32-bit.
Actually you only need 21. Many CPUs use 32-bit registers natively, and most languages have a 32-bit integer type.
If you study the UTF-16 and UTF-8 encodings, you’ll find that their algorithms encode a maximum of a 21-bit code point using two 16-bit code units and four 8-bit code units, respectively.
Because it is the standard way. Python uses different "internal encoding", depending the content of the string: ASCII/ISO, UTF-16, UTF-32. UTF-32 is a common used representation (usually just intern to programs) to represent Unicode code point. So Python, instead of reinventing an other encoding (e.g. a UTF-22), it just uses UTF-32 representation. It is also easier for the different interfaces. Not so efficient on space, but much more on string operations.
Note: Python uses (in seldom cases) also surrogate range to encode "wrong" bytes. So you need more than 10FFFF code points.
Note: Also colour encoding had a similar encoding: 8bit * 3 channels = 24bit, but often represented with 32 integers (but this also for other reasons: just a write, instead of 2 read + 2 write on bus). 32 bits is much more easier and fast to handle.
Mongo Change Stream is retuned in a binary format
To be able to script a mongo change stream I want to encode the byte array into a format that would be command line parameter safe.
pprint.pprint(change['_id']['_data'])
(b'\x82[8\x92G\x00\x00\x00\x01Fd_id\x00d[8\x91\xf2.\xc2\xd4\x00\x0b\xabO\x98'
b'\x00Z\x10\x04\x16,\x92\xf8\xbf\x92G\x87\x8d1\xff(\x1a\x1b{\xc8\x04')
What would be a safe format to convert the binary array text that would be accepted as a parameter?
Example for conversion from binary to given format, and from given format input str() back into binary would be helpful.
Attempt 1
base64.b85encode(change['_id']['_data']).decode('ascii')
'f?GI}M*si-0Y+qBX=DIoTR4&OF2d9R3#(6<09p_P7A%tZzmi9XjWPcy8XJ4a1O'
Going from binary to base85 works, but I can't seem to figure the way back.
EDIT: Reopening Rational
I think this question should not be marked as duplicate as this question targets conversion of random byte arrays which do not represent a human readable character / encoding. As follows the previous question focuses on converting a string into binary array and back, which is a special case of binary to string representation while my use case calls for a generic solution.
Oh cool, I think I've figured it out
base64.b85decode will take string as well as binary as input.
Example:
b = b'\x82[8\x929\x00\x00\x00\x04Fd'
b == base64.b85decode(base64.b85encode(b).decode('ascii'))
True
What is the most dense way (fewest characters) that I can store a complete SHA-256 hash?
Calling .digest() on a hashlib.sha256 object will return a 32-byte string -- the shortest possible way (with 8-bit bytes as the relevant unit) to store 256 bits of data which is effectively random for compressibility purposes.
Since 8 * 32 == 256, this provably has no wastage -- every bit is used.
Charles' answer is absolutely correct. However, I'm assuming that you don't want with the shortest binary encoding of the SHA256 bash - the 32-octet string - and want something printable and somewhat human-readable.
Note: However, this does not exactly apply to barcodes. At least QR codes encode binary data, so just use digest() method of your hash - that would be the most efficient encoding you can use there. Your QR code generation library should most likely support generating codes for "raw" binary strings - check your library docs and find the correct method/invocation.
SHA hashes (and other hashes) don't produce or operate on characters, they work with binary data. SHA-256 produces 256 bits of data, commonly represented with 32 bytes. In particular, in Python 3 you should notice that hashlib.sha256("...").digest() returns bytes and not str.
There is a convenience method hexdigest, that produces hexadecimal (base16) string that represents those bytes. You can use base32, base58, base64, baseEmoji or any other encoding that fits your requirements.
Basically, your problem is actually "I have a number and want a short encoding of it". Decide on how many distinct characters you can use (encoding base) and use that. There are many libraries on PyPI that could help. python-baseconv may come handy.
Could you explain in detail what the difference is between byte string and Unicode string in Python. I have read this:
Byte code is simply the converted source code into arrays of bytes
Does it mean that Python has its own coding/encoding format? Or does it use the operation system settings?
I don't understand. Could you please explain?
Thank you!
No, Python does not use its own encoding - it will use any encoding that it has access to and that you specify.
A character in a str represents one Unicode character. However, to represent more than 256 characters, individual Unicode encodings use more than one byte per character to represent many characters.
bytes objects give you access to the underlying bytes. str objects have the encode method that takes a string representing an encoding and returns the bytes object that represents the string in that encoding. bytes objects have the decode method that takes a string representing an encoding and returns the str that results from interpreting the byte as a string encoded in the the given encoding.
For example:
>>> a = "αά".encode('utf-8')
>>> a
b'\xce\xb1\xce\xac'
>>> a.decode('utf-8')
'αά'
We can see that UTF-8 is using four bytes, \xce, \xb1, \xce, and \xac, to represent two characters.
Related reading:
Python Unicode Howto (from the official documentation)
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Pragmatic Unicode by Ned Batchelder
Here's an attempt at a simple explanation that only applies to Python 3. I hope that coming from a lay person, it would help to clear some confusion for the completely uninitiated. If there are any technical inaccuracies, please forgive me and feel free to point it out.
Suppose you create a string using Python 3 in the usual way:
stringobject = 'ant'
stringobject would be a unicode string.
A unicode string is made up of unicode characters. In stringobject above, the unicode characters are the individual letters, e.g. a, n, t
Each unicode character is assigned a code point, which can be expressed as a sequence of hex digits (a hex digit can take on 16 values, ranging from 0-9 and A-F). For instance, the letter 'a' is equivalent to '\u0061', and 'ant' is equivalent to '\u0061\u006E\u0074'.
So you will find that if you type in,
stringobject = '\u0061\u006E\u0074'
stringobject
You will also get the output 'ant'.
Now, unicode is converted to bytes, in a process known as encoding. The reverse process of converting bytes to unicode is known as decoding.
How is this done? Since each hex digit can take on 16 different values, it can be reflected in a 4-bit binary sequence (e.g. the hex digit 0 can be expressed in binary as 0000, the hex digit 1 can be expressed as 0001 and so forth). If a unicode character has a code point consisting of four hex digits, it would need a 16-bit binary sequence to encode it.
Different encoding systems specify different rules for converting unicode to bits. Most importantly, encodings differ in the number of bits they use to express each unicode character.
For instance, the ASCII encoding system uses only 8 bits (1 byte) per character. Thus it can only encode unicode characters with code points up to two hex digits long (i.e. 256 different unicode characters). The UTF-8 encoding system uses 8 to 32 bits (1 to 4 bytes) per character, so it can encode unicode characters with code points up to 8 hex digits long, i.e. everything.
Running the following code:
byteobject = stringobject.encode('utf-8')
byteobject, type(byteobject)
converts a unicode string into a byte string using the utf-8 encoding system, and returns b'ant', bytes'.
Note that if you used 'ASCII' as the encoding system, you wouldn't run into any problems since all code points in 'ant' can be expressed with 1 byte. But if you had a unicode string containing characters with code points longer than two hex digits, you would get a UnicodeEncodeError.
Similarly,
stringobject = byteobject.decode('utf-8')
stringobject, type(stringobject)
gives you 'ant', str.