Mongo Change Stream is retuned in a binary format
To be able to script a mongo change stream I want to encode the byte array into a format that would be command line parameter safe.
pprint.pprint(change['_id']['_data'])
(b'\x82[8\x92G\x00\x00\x00\x01Fd_id\x00d[8\x91\xf2.\xc2\xd4\x00\x0b\xabO\x98'
b'\x00Z\x10\x04\x16,\x92\xf8\xbf\x92G\x87\x8d1\xff(\x1a\x1b{\xc8\x04')
What would be a safe format to convert the binary array text that would be accepted as a parameter?
Example for conversion from binary to given format, and from given format input str() back into binary would be helpful.
Attempt 1
base64.b85encode(change['_id']['_data']).decode('ascii')
'f?GI}M*si-0Y+qBX=DIoTR4&OF2d9R3#(6<09p_P7A%tZzmi9XjWPcy8XJ4a1O'
Going from binary to base85 works, but I can't seem to figure the way back.
EDIT: Reopening Rational
I think this question should not be marked as duplicate as this question targets conversion of random byte arrays which do not represent a human readable character / encoding. As follows the previous question focuses on converting a string into binary array and back, which is a special case of binary to string representation while my use case calls for a generic solution.
Oh cool, I think I've figured it out
base64.b85decode will take string as well as binary as input.
Example:
b = b'\x82[8\x929\x00\x00\x00\x04Fd'
b == base64.b85decode(base64.b85encode(b).decode('ascii'))
True
Related
Currently, I have a system that converts a list of integers to their binary representations. I calculate the number of bytes each number requires and then use the to_bytes() function to convert them to bytes, like so:
o = open(outFileName, "wb")
for n in result:
numBytes = math.ceil(n.bit_length()/8)
o.write(n.to_bytes(numBytes, 'little'))
o.close()
However, since the bytes are of varying lengths, what would be the method to allow an unpacking program/function to know how long each byte was? I have heard uses of the struct module and specifically the pack function, but with a focus on efficiency and reducing the size of the file as much as possible in mind, what would be the best way of approaching this to allow such an unpacking program to retrieve the exact list of originally encoded integers?
You can't. Your encoding maps different lists of integers to the same sequence of bytes. It is then impossible to know which one was the original input.
You need a different encoding.
Take a look at using the high bit each byte. There are other ways that might be better, depending on the distribution of your integers, such as Golomb coding.
I've been trying to find a way to encode limited character strings in order to compress data, as well as to find unique 'IDs' for each string.
I have several million strings each with around 280~300 characters each, but limited to only four letters (A, T, C and G). I've wondered if there wouldn't be an easier way to encode those, using less memory, considering that they should be easily encoded using a 'base four', but don't know what's the easier way to do that. I've considered using for loops in Python, where I'd iterate over each string, then find the correct value for each letter using a dictionary and multiply that by the base-four value. Example:
base_dict = {
'A' : 0,
'T' : 1,
'C' : 2,
'G' : 3
} # These are the four bases of DNA, each assigned a different numeric value
strings_list = [
'ATCG',
'TGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCGACGCCGCGTGAGCGAAGAAGTATTTCGGTATGTAAAGCTCTATCAGCAGGGAAGAAAATGACGGTACCTGACTAAGAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGGGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGTAGACGGGACAGCAAGTCTGATATGAAAGGCGGGGGCTCAACCCCCGGACTGCATTGGAAACTGCTGGCCTGGAGTACCGGAGG',
'GGGGGGGGGG'
] # A few sample DNA sequences
for string in strings_list:
encoded_number = 0
for i in range(len(string)):
letter = string[i]
encoded_number += (4**i) * base_dict[letter]
print('String {} = {}'.format(string, encoded_number))
It seemed to work well, encoding my strings into binary format. The problem is that I could not get the encoded_number to turn into binary. The best I could do was to use this:
binary = '{0:b}'.format(encoded_number)
But though it returned me the binary value, it would do so as a string. Trying to convert it to binary always yields an error because of the huge size of the integer (when using the actual 280+ characters strings), as the long string above would result in a huge integer (230124923583823837719192000765784020788478094239354720336304458517780079994251890530919145486338353514167796587078005476564902583371606379793061574009099280577109729494013):
bytes(encoded_number) # trying to turn the encoded number into bytes
OverflowError: cannot fit 'int' into an index-sized integer
I'd like to know if this is the most efficient way to encode limited character strings like this, or if there's some better way, and also if there's any other ways I could use to compress this data even more, while still being able to reverse the final number/binary back into my string. Also, is there anyway I can actually convert it to binary format, instead of an integer or a string? Does doing so helps in conserving data?
Also, what would be the most concise way of reducing the integer/binary for a human-readable value (to a new, shorter string)? Using integers or binaries seem to conserve data and I'd be able to store these strings using less memory (and also transfer the data faster), but if I want to create concise user-readable strings, what would be the best option? Is there any way I could encode back into a string, but making use of the whole ASCII table so as to use a lot less characters?
It would be very useful to be able to reduce my 300 character strings into smaller, 86 characters strings (considering the ASCII table has 128 characters available, and 4^300 ~= 128^86).
I'm trying to do this in Python, as it's the language I'm most familiar with, and also what my code is in already.
TL;DR, summarizing the several questions I'm having trouble with:
What is the most efficient way to encode limited character
strings? (There's an example in the code above, is that the best
way?)
Is there any other ways to compress strings that could be
used alongside the encoding of limited characters, to further
compress the data?
Can large integers (4^300) be converted into
binary without resulting in an Overflow? How?
What's the most efficient way to convert binary values, numbers or limited character strings (it's basically the same in this situation, as I'm trying to convert one into the other) into small, concise strings (user-readable, so the smaller, the better)
The conversion you're making is the obvious one: since 4 is a power of 2, the conversion to binary is as compact as you can get for uniformly-distributed sequences. You need only to represent each letter with its 2-bit sequence, and you're done with the conversion.
Your problem seems to be in storing the result. The shortest change is likely to upgrade your code using bytes properly.
Another version of this is to break the string into 8-letter chunks, turning each into a 32-bit integer; then write out the sequence of integers (in binary).
Another is to forget the entire conversion; feed the string to your system's compression algorithm, which will take advantage of frequent amino acids.
N.B. your conversion will lose leading zeros, such as "AAAAGCTGA"; this would re-constitute as "GCTGA". You'll need to include the expected string length.
For doing the simple chunk-convert method, refer to the link I provided.
For compression methods, research compression (which we presume you've done before posting here, per the posting guidelines). On Linux, use the file compression provided with the OS (likely gzip).
Another possibility is if you have at least two amino acids that don't appear in your data, code the other triples and use base62 (do a browser search for documentation) -- this uses the full range of alphanumeric characters to encode in text-readable form.
Assuming I have some ASCII characters in a string, let's say s = ABC, how can I retrieve the binary representation as a string?
In this case,
A = '01000001'
B = '01000010'
C = '01000011'
so I want something like make_binary('ABC') to return '010000010100001001000011'
I know I can get the hex values for a string. I know I can get the binary representation of an integer. I don't know if there's any way to tie all these pieces together.
Use the ord() funcction to get the integer encoding of each character.
def make_binary(s):
return "".join([format(ord(c), '08b') for c in s])
print(make_binary("ABC"))
08b formatting returns the number formatted as 8 bits with leading zeroes.
I think the other answer is wrong. Maybe I interpret wrongly the question.
In any case, I think you are asking for the 'bit' representation. Binary often is used for bytes representation (the .bin files, etc.)
The byte representation is given by an encoding, so you should encode the string, and you will get a byte array. This is your binary (as byte) representation.
But it seems you are asking 'bit-representation'. That is different (and the other answer, IMHO is wrong). You may convert the byte array into bit representation, like on the other answer. Note: you are converting bytes. The other answer will fails on any characters above 127, by showing you only the binary representation of one byte.
So:
def make_binary(s):
return "".join(format(c, '08b') for c in s.encode('utf-8'))
and the test (which file on #Barmar answer).
>>> print(make_binary("ABC"))
010000010100001001000011
>>> print(make_binary("Á"))
1100001110000001
What is the most dense way (fewest characters) that I can store a complete SHA-256 hash?
Calling .digest() on a hashlib.sha256 object will return a 32-byte string -- the shortest possible way (with 8-bit bytes as the relevant unit) to store 256 bits of data which is effectively random for compressibility purposes.
Since 8 * 32 == 256, this provably has no wastage -- every bit is used.
Charles' answer is absolutely correct. However, I'm assuming that you don't want with the shortest binary encoding of the SHA256 bash - the 32-octet string - and want something printable and somewhat human-readable.
Note: However, this does not exactly apply to barcodes. At least QR codes encode binary data, so just use digest() method of your hash - that would be the most efficient encoding you can use there. Your QR code generation library should most likely support generating codes for "raw" binary strings - check your library docs and find the correct method/invocation.
SHA hashes (and other hashes) don't produce or operate on characters, they work with binary data. SHA-256 produces 256 bits of data, commonly represented with 32 bytes. In particular, in Python 3 you should notice that hashlib.sha256("...").digest() returns bytes and not str.
There is a convenience method hexdigest, that produces hexadecimal (base16) string that represents those bytes. You can use base32, base58, base64, baseEmoji or any other encoding that fits your requirements.
Basically, your problem is actually "I have a number and want a short encoding of it". Decide on how many distinct characters you can use (encoding base) and use that. There are many libraries on PyPI that could help. python-baseconv may come handy.
I am retrieving a value that is set by another application from memcached using python-memcached library. But unfortunately this is the value that I am getting:
>>> mc.get("key")
'\x04\x08"\nHello'
Is it possible to parse this mixed ASCII code into plain string using python function?
Thanks heaps for your help
It is a "plain string", to the extent that such a thing exists. I have no idea what kind of output you're expecting, but:
There ain't no such thing as plain text.
The Python (in 2.x, anyway) str type is really a container for bytes, not characters. So it isn't really text in the first place :) It displays the bytes assuming a very simple encoding, using escape sequence to represent every byte that's even slightly "weird". It will be formatted differently again if you print the string (what you're seeing right now is syntax for creating such a literal string in your code).
In simpler times, we naively assumed that we could just map bytes to these symbols we call "characters", and that would be that. Then it turned out that there were approximately a zillion different mappings that people wanted to use, and lots of them needed more symbols than a byte could represent. Which is why we have Unicode now: it represents every symbol you could conceivably need for any real-world language (and several for fake languages and other purposes), and it abstractly assigns numbers to those symbols but does not say how to collect and interpret the bytes as numbers. (That is the purpose of the encoding).
If you know that the string data is encoded in a particular way, you can decode it to a Unicode string. It could either be an encoding of actual Unicode data, or it could be in some other format (for example, Japanese text is often found in something called "Shift-JIS", because it has approximately the same significance to them as "Latin-1" - a common extension of ASCII - does to us). Either way, you get an in-memory representation of a series of Unicode code points (the numbers referred to in the previous paragraph). This, for all intents and purposes, is really "text", but it isn't really "plain" :)
But it looks like the data you have is really a binary blob of bytes that simply happens to consist mostly of "readable text" if interpreted as ASCII.
What you really need to do is figure out why the first byte has a value of 4 and the next byte has a value of 8, and proceed accordingly.
If you just need to trim the '\x04\x08"\n', and it's always the same (you haven't put your question very clearly, I'm not certain if that's what it is or what you want), do something like this:
to_trim = '\x04\x08"\n'
string = mc.get('key')
if string.startswith(to_trim):
string = string[len(to_trim):]