Python base64 data decode and byte order convert - python

I am now using python base64 module to decode a base64 coded XML file, what I did was to find each of the data (there are thousands of them as for exmaple in "ABC....", the "ABC..." was the base64 encoded data) and add it to a string, lets say s, then I use base64.b64decode(s) to get the result, I am not sure of the result of the decoding, was it a string, or bytes? In addition, how should convert such decoded data from the so-called "network byte order" to a "host byte order"? Thanks!

Each base64 encoded string should be decoded separately - you can't concatenate encoded strings (and get a correct decoding).
The result of the decode is a string, of byte-buffer - in Python, they're equivalent.
Regarding the network/host order - sequences of bytes, have no such 'order' (or endianity) - it only matters when interpreting these bytes as words / ints of larger width (i.e. more than 8 bits).

Base64 stuff, encoded or not, is stored in strings. Byte order is only an issue if you're dealing with non-characters (C's int, short, long, float, etc.), and then I'm not sure how it would relate to this issue. Also, I don't think concatenating base64-encoded strings is valid.
>>> from base64 import *
>>> b64encode( "abcdefg" )
'YWJjZGVmZw=='
>>> b64decode( "YWJjZGVmZw==" )
'abcdefg'
>>> b64encode( "hijklmn" )
'aGlqa2xtbg=='
>>> b64decode( "aGlqa2xtbg==" )
'hijklmn'
>>> b64decode( "YWJjZGVmZw==aGlqa2xtbg==" )
'abcdefg'
>>> b64decode( "YWJjZGVmZwaGlqa2xtbg==" )
'abcdefg\x06\x86\x96\xa6\xb6\xc6\xd6\xe0'

This guy has a good python based
b64decode parser http://groups.google.com/group/spctools-discuss/browse_thread/thread/a8afd04e1a04cde4
Extracting peak-lists from mzXML in "Python"

Related

How to convert incorrectly encoded string to bytes?

I've got utf-8 string in the form of 'РїРѕРј'... - in Python 3 string. How can I decode it (to get correct string)?
As I see from error messages I can only convert string from bytes array, but how to get it then? I tried
bytes(str, 'ascii', errors='ignore')
so it should not change existing byte values, but it removed all "incorrect" characters (I suppose because they have codes >= 128).
The example string contains Russian 'пом'...
It looks like you have a string that has been encoded as UTF-8, then decoded as cp1251.
>>> s = 'пом'
>>> s.encode('utf-8').decode('cp1251')
'РїРѕРј'
You can get the original string by reversing the operation.
>>> e = 'РїРѕРј'
>>> e.encode('cp1251').decode('utf-8')
'пом'
If you want to encode the mojibake string as bytes, without losing information, use the backslashreplace error handler.
>>> e.encode('ascii', errors='backslashreplace')
b'\\u0420\\u0457\\u0420\\u0455\\u0420\\u0458'

decode python binary string but not ensure ascii symbols

I have a binary object:
b'{"node": "\\u041e\\u0431\\u043d\\u043e\\u0432\\u043b\\u0435\\u043d\\u0438\\u0435"}}'
and I want it to be printed in Unicode and not strictly using ASCII symbols.
There is a hacky way to do it:
decoded = string.decode()
parsed_to_dict = json.loads(decoded)
dumped = json.dumps(parsed_to_dict, ensure_ascii=False)
print(dumped)
>>> {"node": "Обновление"}
however the text will not always be parseable as JSON, so I need a simpler way.
Is there a way to print out my binary object (or a decoded Unicode string) as a non-ascii string without going trough parsing/dumping JSON?
For example, how to print this b'\\u041e\\u0431\\u043d\\u043e\\u0432\\u043b\\u0435\\u043d\\u0438\\u0435' as Обновление?
A bytes string like
b'\\u041e\\u0431\\u043d\\u043e\\u0432\\u043b\\u0435\\u043d\\u0438\\u0435'
has been encoded using Unicode escape sequences. To convert it back into a proper Unicode string you simply need to specify the 'unicode-escape' codec:
data = b'\\u041e\\u0431\\u043d\\u043e\\u0432\\u043b\\u0435\\u043d\\u0438\\u0435'
out = data.decode('unicode-escape')
print(out)
output
Обновление
However, if data is already a Unicode string, then you first need to encode it to bytes. You can do that using the ascii codec, presuming data only contains ASCII characters. If it contains characters outside ASCII but within the range of \x80 to \xff you may be able to use the 'latin1' codec.
data = '\\u041e\\u0431\\u043d\\u043e\\u0432\\u043b\\u0435\\u043d\\u0438\\u0435'
out = data.encode('ascii').decode('unicode-escape')
This should work so long as all the escapes are valid (no single \).
import ast
bytes_object = b'{"node": "\\u041e\\u0431\\u043d\\u043e\\u0432\\u043b\\u0435\\u043d\\u0438\\u0435"}}'
unicode_string = ast.literal_eval("'{}'".format(bytes_object.decode()))
output:
'{"node": "Обновление"}}'

pycrypto not encrypting in ascii or unicode [duplicate]

Following this python example, I encode a string as Base64 with:
>>> import base64
>>> encoded = base64.b64encode(b'data to be encoded')
>>> encoded
b'ZGF0YSB0byBiZSBlbmNvZGVk'
But, if I leave out the leading b:
>>> encoded = base64.b64encode('data to be encoded')
I get the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python32\lib\base64.py", line 56, in b64encode
raise TypeError("expected bytes, not %s" % s.__class__.__name__)
TypeError: expected bytes, not str
Why is this?
base64 encoding takes 8-bit binary byte data and encodes it uses only the characters A-Z, a-z, 0-9, +, /* so it can be transmitted over channels that do not preserve all 8-bits of data, such as email.
Hence, it wants a string of 8-bit bytes. You create those in Python 3 with the b'' syntax.
If you remove the b, it becomes a string. A string is a sequence of Unicode characters. base64 has no idea what to do with Unicode data, it's not 8-bit. It's not really any bits, in fact. :-)
In your second example:
>>> encoded = base64.b64encode('data to be encoded')
All the characters fit neatly into the ASCII character set, and base64 encoding is therefore actually a bit pointless. You can convert it to ascii instead, with
>>> encoded = 'data to be encoded'.encode('ascii')
Or simpler:
>>> encoded = b'data to be encoded'
Which would be the same thing in this case.
* Most base64 flavours may also include a = at the end as padding. In addition, some base64 variants may use characters other than + and /. See the Variants summary table at Wikipedia for an overview.
Short Answer
You need to push a bytes-like object (bytes, bytearray, etc) to the base64.b64encode() method. Here are two ways:
>>> import base64
>>> data = base64.b64encode(b'data to be encoded')
>>> print(data)
b'ZGF0YSB0byBiZSBlbmNvZGVk'
Or with a variable:
>>> import base64
>>> string = 'data to be encoded'
>>> data = base64.b64encode(string.encode())
>>> print(data)
b'ZGF0YSB0byBiZSBlbmNvZGVk'
Why?
In Python 3, str objects are not C-style character arrays (so they are not byte arrays), but rather, they are data structures that do not have any inherent encoding. You can encode that string (or interpret it) in a variety of ways. The most common (and default in Python 3) is utf-8, especially since it is backwards compatible with ASCII (although, as are most widely-used encodings). That is what is happening when you take a string and call the .encode() method on it: Python is interpreting the string in utf-8 (the default encoding) and providing you the array of bytes that it corresponds to.
Base-64 Encoding in Python 3
Originally the question title asked about Base-64 encoding. Read on for Base-64 stuff.
base64 encoding takes 6-bit binary chunks and encodes them using the characters A-Z, a-z, 0-9, '+', '/', and '=' (some encodings use different characters in place of '+' and '/'). This is a character encoding that is based off of the mathematical construct of radix-64 or base-64 number system, but they are very different. Base-64 in math is a number system like binary or decimal, and you do this change of radix on the entire number, or (if the radix you're converting from is a power of 2 less than 64) in chunks from right to left.
In base64 encoding, the translation is done from left to right; those first 64 characters are why it is called base64 encoding. The 65th '=' symbol is used for padding, since the encoding pulls 6-bit chunks but the data it is usually meant to encode are 8-bit bytes, so sometimes there are only two or 4 bits in the last chunk.
Example:
>>> data = b'test'
>>> for byte in data:
... print(format(byte, '08b'), end=" ")
...
01110100 01100101 01110011 01110100
>>>
If you interpret that binary data as a single integer, then this is how you would convert it to base-10 and base-64 (table for base-64):
base-2: 01 110100 011001 010111 001101 110100 (base-64 grouping shown)
base-10: 1952805748
base-64: B 0 Z X N 0
base64 encoding, however, will re-group this data thusly:
base-2: 011101 000110 010101 110011 011101 00(0000) <- pad w/zeros to make a clean 6-bit chunk
base-10: 29 6 21 51 29 0
base-64: d G V z d A
So, 'B0ZXN0' is the base-64 version of our binary, mathematically speaking. However, base64 encoding has to do the encoding in the opposite direction (so the raw data is converted to 'dGVzdA') and also has a rule to tell other applications how much space is left off at the end. This is done by padding the end with '=' symbols. So, the base64 encoding of this data is 'dGVzdA==', with two '=' symbols to signify two pairs of bits will need to be removed from the end when this data gets decoded to make it match the original data.
Let's test this to see if I am being dishonest:
>>> encoded = base64.b64encode(data)
>>> print(encoded)
b'dGVzdA=='
Why use base64 encoding?
Let's say I have to send some data to someone via email, like this data:
>>> data = b'\x04\x6d\x73\x67\x08\x08\x08\x20\x20\x20'
>>> print(data.decode())
>>> print(data)
b'\x04msg\x08\x08\x08 '
>>>
There are two problems I planted:
If I tried to send that email in Unix, the email would send as soon as the \x04 character was read, because that is ASCII for END-OF-TRANSMISSION (Ctrl-D), so the remaining data would be left out of the transmission.
Also, while Python is smart enough to escape all of my evil control characters when I print the data directly, when that string is decoded as ASCII, you can see that the 'msg' is not there. That is because I used three BACKSPACE characters and three SPACE characters to erase the 'msg'. Thus, even if I didn't have the EOF character there the end user wouldn't be able to translate from the text on screen to the real, raw data.
This is just a demo to show you how hard it can be to simply send raw data. Encoding the data into base64 format gives you the exact same data but in a format that ensures it is safe for sending over electronic media such as email.
If the data to be encoded contains "exotic" characters, I think you have to encode in "UTF-8"
encoded = base64.b64encode (bytes('data to be encoded', "utf-8"))
If the string is Unicode the easiest way is:
import base64
a = base64.b64encode(bytes(u'complex string: ñáéíóúÑ', "utf-8"))
# a: b'Y29tcGxleCBzdHJpbmc6IMOxw6HDqcOtw7PDusOR'
b = base64.b64decode(a).decode("utf-8", "ignore")
print(b)
# b :complex string: ñáéíóúÑ
There is all you need:
expected bytes, not str
The leading b makes your string binary.
What version of Python do you use? 2.x or 3.x?
Edit: See http://docs.python.org/release/3.0.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit for the gory details of strings in Python 3.x

How can I compress four floats into a string?

I would like to represent four floats e.g, 123.545, 56.234, -4534.234, 544.64 using the set of characters [a..z, A..Z, 0..9] in the shortest way possible so I can encode the four floats and store them in a filename. What is the most efficient to do this?
I've looked at base64 encoding which doesn't actually compress the result. I also looked at a polyline encoding algorithm which uses characters like ) and { and I can't have that.
You could use the struct module to store them as binary 32-bit floats, and encode the result into base64. In Python 2:
>>> import struct, base64
>>> base64.urlsafe_b64encode(struct.pack("ffff", 123.545,56.234,-4534.234,544.64))
'Chf3Qp7vYELfsY3F9igIRA=='
The == padding can be removed and re-added for decoding such that the length of the base64 string is a multiple of 4. You will also want to use URL-safe base64 to avoid the / character.

How do I convert a string to a buffer in Python 3.1?

I am attempting to pipe something to a subprocess using the following line:
p.communicate("insert into egg values ('egg');");
TypeError: must be bytes or buffer, not str
How can I convert the string to a buffer?
The correct answer is:
p.communicate(b"insert into egg values ('egg');");
Note the leading b, telling you that it's a string of bytes, not a string of unicode characters. Also, if you are reading this from a file:
value = open('thefile', 'rt').read()
p.communicate(value);
The change that to:
value = open('thefile', 'rb').read()
p.communicate(value);
Again, note the 'b'.
Now if your value is a string you get from an API that only returns strings no matter what, then you need to encode it.
p.communicate(value.encode('latin-1');
Latin-1, because unlike ASCII it supports all 256 bytes. But that said, having binary data in unicode is asking for trouble. It's better if you can make it binary from the start.
You can convert it to bytes with encode method:
>>> "insert into egg values ('egg');".encode('ascii') # ascii is just an example
b"insert into egg values ('egg');"

Categories

Resources