I have arbitrary binary data. I need to store it in a system that expects valid UTF8. It will never be interpreted as text, I just need to put it in there and be able to retrieve it and reconstitute my binary data.
Obviously base64 would work, but I can't have that much inflation.
How can I easily achieve this in python 2.7?
You'll have to express your data using just ASCII characters. Using Base64 is the most efficient method (available in the Python standard library) to do this, in terms of making binary data fit in printable text that is also UTF-8 safe. Sure, it requires 33% more space to express the same data, but other methods take more additional space.
You can combine this with compression to limit how much space this is going to take, but make the compression optional (mark the data) and only actually use it if the data is going to be smaller.
import zlib
import base64
def pack_utf8_safe(data):
is_compressed = False
compressed = zlib.compress(data)
if len(compressed) < (len(data) - 1):
data = compressed
is_compressed = True
base64_encoded = base64.b64encode(data)
if is_compressed:
base64_encoded = '.' + base64_encoded
return base64_encoded
def unpack_utf8_safe(base64_encoded):
decompress = False
if base64_encoded.startswith('.'):
base64_encoded = base64_encoded[1:]
decompress = True
data = base64.b64decode(base64_encoded)
if decompress:
data = zlib.decompress(data)
return data
The '.' character is not part of the Base64 alphabet, so I used it here to mark compressed data.
You could further shave of the 1 or 2 = padding characters from the end of the Base64 encoded data; these can then be re-added when decoding (add '=' * (-len(encoded) * 4) to the end), but I'm not sure that's worth the bother.
You can achieve further savings by switching to the Base85 encoding, a 4-to-5 ratio ASCII-safe encoding for binary data, so a 20% overhead. For Python 2.7 this is only available in an external library (Python 3.4 added it to the base64 library). You can use python-mom project in 2.7:
from mom.codec import base85
and replace all base64.b64encode() and base64.b64decode() calls with base85.b85encode() and base85.b85decode() calls instead.
If you are 100% certain nothing along the path is going to treat your data as text (possibly altering line separators, or interpret and alter other control codes), you could also use the Base128 encoding, reducing the overhead to a 14.3% increase (8 characters for every 7 bytes). I cannot, however, recommend a pip-installable Python module for you; there is a GitHub hosted module but I have not tested it.
You can decode your bytes as 8859-1 data, which will always produce a valid Unicode string. Then you can encode it to UTF8:
utf8_data = my_bytes.decode('iso8859-1').encode('utf8')
On average, half your data will be in the 0-127 range, which is one byte in UTF8, and half your data will be in the 128-255 range, which is two bytes in UTF8, so your result will be 50% larger than your input data.
If there is any structure to your data at all, then zlib compressing it as Martijn suggests, might reduce the size.
If your application really requires you to be able to represent 256 different byte values in a graphically distinguishable form, all you actually need is 256 Unicode code points. Problem solved.
ASCII codes 33-127 are a no-brainer, Unicode code points 160-255 are also good candidates for representing themselves but you might want to exclude a few which are hard to distinguish (if you want OCR or humans to handle them reliably, áåä etc might be too similar). Pick the rest from the set of code points which can be represented in two bytes -- quite a large set, but again, many of them are graphically indistinguishable from other glyphs in most renderings.
This scheme does not attempt any form of compression. I imagine you'd get better results by compressing your data prior to encoding it if that's an issue.
Related
I'm currently working on a password storage program in Python, though C would likely be faster. I've been trying for the past hour or so to find a way to store a bytes object in a CSV file. I'm hashing the passwords with their own salt, and then storing that, and grabbing it again to check the password. It works perfectly well when it's stored in memory.
salt = os.urandom(64)
hash = hashlib.pbkdf2_hmac(
'sha256',
password.encode('utf-8'),
salt,
1000000
)
storage = salt + hash
salt_from_store = storage[:64]
hash_from_store = storage[64:]
However, when I try storing it in a CSV file, so it doesn't have to be constantly running, I get an error,
TypeError: write() argument must be str, not bytes
So, I converted it to a string using,
str(storage)
and that wrote just fine. But then, when I get it from the file, it's still a string, and the length goes from 128 (bytes) to 300+ (chars). It's also never consistent. I don't know the encoding, so I can't change it like that, when I print the bytes, it's a bunch of characters with backslashes and X's
b'\xfd\x3a'
and occasionally some random special characters. I'm not sure if there's a way to convert that to an int, and let it be converted back. Another issue is that I've found a way to do it, by changing
b"\xf1\x96"
to
"b\xf1\x96"
which prints the encoded text, rather than the bytes it's made up of. However, I don't know if that's a good way of changing it, and if it is, if there's a way to do it without something like
bytes[0] = '"'
bytes[1] = 'b'
If you want to save bytes as a string, you should probably encode them in a format made for this like base64. This is more efficient with space than directly writing hex.
Trying to convert arbitrary bytes to an encoding like utf-8 directly will likely result in UnicodeDecodeError errors.
In your case, you could do something like:
import os, hashlib, base64
password = "top_secret"
salt = os.urandom(64)
hash = hashlib.pbkdf2_hmac(
'sha256',
password.encode('utf-8'),
salt,
1000000
)
storage = salt + hash
# convert to a base64 string:
s = base64.b64encode(storage).decode('utf-8')
print(s) # <-- string you can save this to a file
# after reading it back from a file convert back to bytes
the_bytes = base64.b64decode(s)
the_bytes == storage
# True
To write bytes, either write to something that expects to contain bytes, or write text that represents the bytes in some way. CSV is fundamentally a text-based format. If you're going to use a CSV file, then you're going to open it in text mode, and write text to it.
Fundamentally, every file on the hard drive consists of bytes. This implies that, when you open the CSV file, you will be choosing (or using a default) text encoding scheme. So your bytes object will have to be converted twice (to text, and then into the underlying bytes in the file - which you could verify for example with a hex editor) on writing, and twice again on reading. That's just the reality of dealing with mixed data. Thankfully, half that work is taken care of for you automatically (by the open call, or wrappers for that like csv.Reader).
So, I converted it to a string using str(storage)
This is not actually a conversion in the sense that you're most likely interested in. This is asking for a printable, human-readable representation of the object (There is also repr, which asks for a more technically-oriented representation. For str and bytes objects, that's where the enclosing quotation marks come from, among other adjustments. When you print something, its str is used. When you evaluate something at the REPL, you see the repr of the result - except that when the result is None, it doesn't show anything at all). Specifically for dealing with bytes and str objects, Python has a concept of encoding and decoding, which uses explicit .encode (str->bytes) and .decode (bytes->str) methods. These are topics you can easily look up in the documentation (or previous Stack Overflow questions, or on the Internet in general).
when I print the bytes, it's a bunch of characters with backslashes and X's
Yes, this is the form that Python uses to tell you what data exists inside the bytes object. What you're saying here is basically the same as "when I print the list, it's a bunch of list elements with commas surrounded by square brackets", or "when I print the integer, it's a bunch of digit symbols".
But then, when I get it from the file, it's still a string, and the length goes from 128 (bytes) to 300+ (chars).
So decode it again. Of course you do need to encode properly. Everything that you get from the file will be a string, because you are opening the file in text mode, because CSV is a text format. (Incidentally, you are using the csv standard library module for this, right?)
It's also never consistent. I don't know the encoding
So tell it which encoding to use; and if you need to use a consistent amount of text, choose an encoding that consistently maps one byte to one Unicode code point (such as latin-1, also named iso-8859-1). But I suspect you don't actually care how long the text is (if anything, you'd care about the amount of bytes used in the file).
I've found a way to do it, by changing
You can only do this with literal data. Do not think in these terms. The b is part of the language syntax. It is not data.
You could use hex. Let's get some data:
>>> import os
>>> b = os.urandom(10)
>>> b
b'\xc5\xe2{\xdf\xd2\x13\xa7\x0b\xef\x07'
As a hex string that you can write to CSV:
>>> b.hex()
'c5e27bdfd213a70bef07'
Back to bytes:
>>> bytes.fromhex(b.hex())
b'\xc5\xe2{\xdf\xd2\x13\xa7\x0b\xef\x07'
What is the most dense way (fewest characters) that I can store a complete SHA-256 hash?
Calling .digest() on a hashlib.sha256 object will return a 32-byte string -- the shortest possible way (with 8-bit bytes as the relevant unit) to store 256 bits of data which is effectively random for compressibility purposes.
Since 8 * 32 == 256, this provably has no wastage -- every bit is used.
Charles' answer is absolutely correct. However, I'm assuming that you don't want with the shortest binary encoding of the SHA256 bash - the 32-octet string - and want something printable and somewhat human-readable.
Note: However, this does not exactly apply to barcodes. At least QR codes encode binary data, so just use digest() method of your hash - that would be the most efficient encoding you can use there. Your QR code generation library should most likely support generating codes for "raw" binary strings - check your library docs and find the correct method/invocation.
SHA hashes (and other hashes) don't produce or operate on characters, they work with binary data. SHA-256 produces 256 bits of data, commonly represented with 32 bytes. In particular, in Python 3 you should notice that hashlib.sha256("...").digest() returns bytes and not str.
There is a convenience method hexdigest, that produces hexadecimal (base16) string that represents those bytes. You can use base32, base58, base64, baseEmoji or any other encoding that fits your requirements.
Basically, your problem is actually "I have a number and want a short encoding of it". Decide on how many distinct characters you can use (encoding base) and use that. There are many libraries on PyPI that could help. python-baseconv may come handy.
Is it always safe to remove trailing zero or null bytes from the end of a file? I'm worried this might corrupt a file that uses say UTF-16 encoding, or for some other reason.
And further, is it always safe to add trailing zero bytes to the end of a file?
As an example using Python I'd do this by stripping single bytes from the file's end until all zero bytes are removed:
with open('in.ext', 'rb') as file_in:
with open('out.ext', 'wb') as file_out:
data = file_in.read()
while data.endswith('\x00'):
data = data[:-1]
file_out.write(data)
This is for the purpose of storing and retrieving arbitrary files on a portable storage medium. I was hoping to get away with padding half written byte blocks (a block contains 16 bytes) with zero bytes and then simply stripping the bytes off when reading the data back.
It really depends on what will use this file after then. Some software may require to find those null bytes at the end, for example if there are used for padding.
In the case of UTF-16, the number of bytes should be even, so you should manage them by pairs (looking for \x00\x00 instead of just \x00).
It depends on the application writing and reading the files. Unless you know the exact format of the file, no modification is safe by default. If you know the format, it will be obvious if trailing 0s are needed or not.
After the question edit: "storing and retrieving arbitrary files" is by default incompatible with just randomly stripping bytes. It doesn't matter what the bytes are, you need to preserve the files as they were. If you need proper padding scheme, have a look at stuff used in encryption algorithms - for example PKCS7.
I am retrieving a value that is set by another application from memcached using python-memcached library. But unfortunately this is the value that I am getting:
>>> mc.get("key")
'\x04\x08"\nHello'
Is it possible to parse this mixed ASCII code into plain string using python function?
Thanks heaps for your help
It is a "plain string", to the extent that such a thing exists. I have no idea what kind of output you're expecting, but:
There ain't no such thing as plain text.
The Python (in 2.x, anyway) str type is really a container for bytes, not characters. So it isn't really text in the first place :) It displays the bytes assuming a very simple encoding, using escape sequence to represent every byte that's even slightly "weird". It will be formatted differently again if you print the string (what you're seeing right now is syntax for creating such a literal string in your code).
In simpler times, we naively assumed that we could just map bytes to these symbols we call "characters", and that would be that. Then it turned out that there were approximately a zillion different mappings that people wanted to use, and lots of them needed more symbols than a byte could represent. Which is why we have Unicode now: it represents every symbol you could conceivably need for any real-world language (and several for fake languages and other purposes), and it abstractly assigns numbers to those symbols but does not say how to collect and interpret the bytes as numbers. (That is the purpose of the encoding).
If you know that the string data is encoded in a particular way, you can decode it to a Unicode string. It could either be an encoding of actual Unicode data, or it could be in some other format (for example, Japanese text is often found in something called "Shift-JIS", because it has approximately the same significance to them as "Latin-1" - a common extension of ASCII - does to us). Either way, you get an in-memory representation of a series of Unicode code points (the numbers referred to in the previous paragraph). This, for all intents and purposes, is really "text", but it isn't really "plain" :)
But it looks like the data you have is really a binary blob of bytes that simply happens to consist mostly of "readable text" if interpreted as ASCII.
What you really need to do is figure out why the first byte has a value of 4 and the next byte has a value of 8, and proceed accordingly.
If you just need to trim the '\x04\x08"\n', and it's always the same (you haven't put your question very clearly, I'm not certain if that's what it is or what you want), do something like this:
to_trim = '\x04\x08"\n'
string = mc.get('key')
if string.startswith(to_trim):
string = string[len(to_trim):]
I am using the md5 function to hash a string into a 32 digit string.
str_to_encode = 'this is a test string which I want to encode'
encoded = hashlib.md5(str_to_encode).hexdigest()
I want to be able to decode this string (i.e. encoded in the example above) back to its original value. I don't think this is possible using md5 (but if it is please let me know), but is there a compressionn function which I can use which will give me a 32 digit string at the end but which can be reverted?
EDIT:
The string being encoded is a url so will only be a couple of hundred characters max although in most cases it will be a lot less.
Thanks
It seems to me that you aren't looking for a hash or encryption, you are looking for compression. Try zlib and base64 encoding:
s = 'Hello, world'
encoded = zlib.compress(s).encode('base64')
The length of the encoded data will grow as the input grows, but it may work for you.
Even restricting yourself to URLs, there's no way to reversibly map them to 32-character strings, there are just too many possible URLs.
You seem to want two things that can't coexist:
Any string of any length is converted to exactly 32-bytes, even if it started as 4gb
The encoded string is decodable without loss of information
There's only so many bits in an MD5 hash, so by the pigeonhole principle it's impossible to reverse it. If it were reversible you could use a hash to compress information infinitely. Furthermore, irreversibility is the main point of a hash; they're intended to be one-way functions. Encryption algorithms are reversible, but require more bytes to store the ciphertext since decodability means they must be collision-free (two plaintexts can't encode to the same ciphertext, or the decode function wouldn't know which plaintext to output given that ciphertext)