pycrypto not encrypting in ascii or unicode [duplicate] - python

Following this python example, I encode a string as Base64 with:
>>> import base64
>>> encoded = base64.b64encode(b'data to be encoded')
>>> encoded
b'ZGF0YSB0byBiZSBlbmNvZGVk'
But, if I leave out the leading b:
>>> encoded = base64.b64encode('data to be encoded')
I get the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python32\lib\base64.py", line 56, in b64encode
raise TypeError("expected bytes, not %s" % s.__class__.__name__)
TypeError: expected bytes, not str
Why is this?

base64 encoding takes 8-bit binary byte data and encodes it uses only the characters A-Z, a-z, 0-9, +, /* so it can be transmitted over channels that do not preserve all 8-bits of data, such as email.
Hence, it wants a string of 8-bit bytes. You create those in Python 3 with the b'' syntax.
If you remove the b, it becomes a string. A string is a sequence of Unicode characters. base64 has no idea what to do with Unicode data, it's not 8-bit. It's not really any bits, in fact. :-)
In your second example:
>>> encoded = base64.b64encode('data to be encoded')
All the characters fit neatly into the ASCII character set, and base64 encoding is therefore actually a bit pointless. You can convert it to ascii instead, with
>>> encoded = 'data to be encoded'.encode('ascii')
Or simpler:
>>> encoded = b'data to be encoded'
Which would be the same thing in this case.
* Most base64 flavours may also include a = at the end as padding. In addition, some base64 variants may use characters other than + and /. See the Variants summary table at Wikipedia for an overview.

Short Answer
You need to push a bytes-like object (bytes, bytearray, etc) to the base64.b64encode() method. Here are two ways:
>>> import base64
>>> data = base64.b64encode(b'data to be encoded')
>>> print(data)
b'ZGF0YSB0byBiZSBlbmNvZGVk'
Or with a variable:
>>> import base64
>>> string = 'data to be encoded'
>>> data = base64.b64encode(string.encode())
>>> print(data)
b'ZGF0YSB0byBiZSBlbmNvZGVk'
Why?
In Python 3, str objects are not C-style character arrays (so they are not byte arrays), but rather, they are data structures that do not have any inherent encoding. You can encode that string (or interpret it) in a variety of ways. The most common (and default in Python 3) is utf-8, especially since it is backwards compatible with ASCII (although, as are most widely-used encodings). That is what is happening when you take a string and call the .encode() method on it: Python is interpreting the string in utf-8 (the default encoding) and providing you the array of bytes that it corresponds to.
Base-64 Encoding in Python 3
Originally the question title asked about Base-64 encoding. Read on for Base-64 stuff.
base64 encoding takes 6-bit binary chunks and encodes them using the characters A-Z, a-z, 0-9, '+', '/', and '=' (some encodings use different characters in place of '+' and '/'). This is a character encoding that is based off of the mathematical construct of radix-64 or base-64 number system, but they are very different. Base-64 in math is a number system like binary or decimal, and you do this change of radix on the entire number, or (if the radix you're converting from is a power of 2 less than 64) in chunks from right to left.
In base64 encoding, the translation is done from left to right; those first 64 characters are why it is called base64 encoding. The 65th '=' symbol is used for padding, since the encoding pulls 6-bit chunks but the data it is usually meant to encode are 8-bit bytes, so sometimes there are only two or 4 bits in the last chunk.
Example:
>>> data = b'test'
>>> for byte in data:
... print(format(byte, '08b'), end=" ")
...
01110100 01100101 01110011 01110100
>>>
If you interpret that binary data as a single integer, then this is how you would convert it to base-10 and base-64 (table for base-64):
base-2: 01 110100 011001 010111 001101 110100 (base-64 grouping shown)
base-10: 1952805748
base-64: B 0 Z X N 0
base64 encoding, however, will re-group this data thusly:
base-2: 011101 000110 010101 110011 011101 00(0000) <- pad w/zeros to make a clean 6-bit chunk
base-10: 29 6 21 51 29 0
base-64: d G V z d A
So, 'B0ZXN0' is the base-64 version of our binary, mathematically speaking. However, base64 encoding has to do the encoding in the opposite direction (so the raw data is converted to 'dGVzdA') and also has a rule to tell other applications how much space is left off at the end. This is done by padding the end with '=' symbols. So, the base64 encoding of this data is 'dGVzdA==', with two '=' symbols to signify two pairs of bits will need to be removed from the end when this data gets decoded to make it match the original data.
Let's test this to see if I am being dishonest:
>>> encoded = base64.b64encode(data)
>>> print(encoded)
b'dGVzdA=='
Why use base64 encoding?
Let's say I have to send some data to someone via email, like this data:
>>> data = b'\x04\x6d\x73\x67\x08\x08\x08\x20\x20\x20'
>>> print(data.decode())
>>> print(data)
b'\x04msg\x08\x08\x08 '
>>>
There are two problems I planted:
If I tried to send that email in Unix, the email would send as soon as the \x04 character was read, because that is ASCII for END-OF-TRANSMISSION (Ctrl-D), so the remaining data would be left out of the transmission.
Also, while Python is smart enough to escape all of my evil control characters when I print the data directly, when that string is decoded as ASCII, you can see that the 'msg' is not there. That is because I used three BACKSPACE characters and three SPACE characters to erase the 'msg'. Thus, even if I didn't have the EOF character there the end user wouldn't be able to translate from the text on screen to the real, raw data.
This is just a demo to show you how hard it can be to simply send raw data. Encoding the data into base64 format gives you the exact same data but in a format that ensures it is safe for sending over electronic media such as email.

If the data to be encoded contains "exotic" characters, I think you have to encode in "UTF-8"
encoded = base64.b64encode (bytes('data to be encoded', "utf-8"))

If the string is Unicode the easiest way is:
import base64
a = base64.b64encode(bytes(u'complex string: ñáéíóúÑ', "utf-8"))
# a: b'Y29tcGxleCBzdHJpbmc6IMOxw6HDqcOtw7PDusOR'
b = base64.b64decode(a).decode("utf-8", "ignore")
print(b)
# b :complex string: ñáéíóúÑ

There is all you need:
expected bytes, not str
The leading b makes your string binary.
What version of Python do you use? 2.x or 3.x?
Edit: See http://docs.python.org/release/3.0.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit for the gory details of strings in Python 3.x

Related

Use string as bytes [duplicate]

This question already has answers here:
Process escape sequences in a string in Python
(8 answers)
Closed 7 months ago.
My problem is as follows:
I'm reading a .csv generated by some software and to read it I'm using Pandas. Pandas read the .csv properly but one of the columns stores bytes sequences representing vectors and Pandas stores them as a string.
So I have data (string) and I want to use np.frombuffer() to get the proper vector. The problem is, data is a string so its already encoded so when I use .encode() to turn it into bytes, the sequence is not the original one.
Example: The .csv contains \x00\x00 representing the vector [0,0] with dtype=np.uint8. Pandas stores it as a string and when I try to process it something like this happens:
data = df.data[x] # With x any row.
type(data)
<class 'str'>
print(data)
\x00\x00
e_data = data.encode("latin1")
print(e_data)
b'\\x00\\x00'
v = np.frombuffer(e_data, np.uint8)
print(v)
array([ 92 120 48 48 92 120 48 48], dtype=uint8)
I just want to get b'\x00\x00' from data instead of b'\\x00\\x00' which I understand is a little encoding mess I have not been able to fix yet.
Any way to do this?
Thanks!
Issue: you (apparently) have a string that contains literal backslash escape sequences, such as:
>>> x = r'\x00' # note the use of a raw string literal
>>> x # Python's representation of the string escapes the backslash
'\\x00'
>>> print(x) # but it looks right when printing
\x00
From this, you wish to create a corresponding bytes object, wherein the backslash-escape sequences are translated into the corresponding byte.
Handling these kinds of escape sequences is done using the unicode-escape string encoding. As you may be aware, string encodings convert between bytes and str objects, specifying the rules for which byte sequences correspond to what Unicode code points.
However, the unicode-escape codec assumes that the escape sequences are on the bytes side of the equation and that the str side will have the corresponding Unicode characters:
>>> rb'\x00'.decode('unicode-escape') # create a string with a NUL char
'\x00'
Applying .encode to the string will reverse that process; so if you start with the backslash-escape sequence, it will re-escape the backslash:
>>> r'\x00'.encode('unicode-escape') # the result contains two backslashes, represented as four
b'\\\\x00'
>>> list(r'\x00'.encode('unicode-escape')) # let's look at the numeric values of the bytes
[92, 92, 120, 48, 48]
As you can see, that is clearly not what we want.
We want to convert from bytes to str to do the backslash-escaping. But we have a str to start, so we need to change that to bytes; and we want bytes at the end, so we need to change the str that we get from the backslash-escaping. In both cases, we need to make each Unicode code point from 0-255 inclusive, correspond to a single byte with the same value.
The encoding we need for that task is called latin-1, also known as iso-8859-1.
For example:
>>> r'\x00'.encode('latin-1')
b'\\x00'
Thus, we can reason out the overall conversion:
>>> r'\x00'.encode('latin-1').decode('unicode-escape').encode('latin-1')
b'\x00'
As desired: our str with a literal backslash, lowercase x and two zeros, is converted to a bytes object containing a single zero byte.
Alternately: we can request that backslash-escapes are processed while decoding, by using escape_decode from the codecs standard library module. However, this isn't documented and isn't really meant to be used that way - it's internal stuff used to implement the unicode-escape codec and possibly some other things.
If you want to expose yourself to the risk of that breaking in the future, it looks like:
>>> import codecs
>>> codecs.escape_decode(r'\x00\x00')
(b'\x00\x00', 8)
We get a 2-tuple, with the desired bytes and what I assume is the number of Unicode code points that were decoded (i.e. the length of the string). From my testing, it appears that it can only use UTF-8 encoding for the non-backslash sequences (but this could be specific to how Python is configured), and you can't change this; there is no actual parameter to specify the encoding, for a decode method. Like I said - not meant for general use.
Yes, all of that is as awkward as it seems. The reason you don't get easy support for this kind of thing is that it isn't really how you're intended to design your system. Fundamentally, all data is bytes; text is an abstraction that is encoded by that byte data. Using a single byte (with value 0) to represent four characters of text (the symbols \, x, 0 and 0) is not a normal encoding, and not a reversible one (how do I know whether to decode the byte as those four characters, or as a single NUL character?). Instead, you should strongly consider using some other friendly string representation of your data (perhaps a plain hex dump) and a non-text-encoding-related way to parse it. For example:
>>> data = '41 42' # a string in a simple hex dump format
>>> bytes.fromhex(data) # support is built-in, and works simply
b'AB'
>>> list(bytes.fromhex(data))
[65, 66]

subprocess.check_output is returning an enclosing b' ' string [duplicate]

Apparently, the following is the valid syntax:
b'The string'
I would like to know:
What does this b character in front of the string mean?
What are the effects of using it?
What are appropriate situations to use it?
I found a related question right here on SO, but that question is about PHP though, and it states the b is used to indicate the string is binary, as opposed to Unicode, which was needed for code to be compatible from version of PHP < 6, when migrating to PHP 6. I don't think this applies to Python.
I did find this documentation on the Python site about using a u character in the same syntax to specify a string as Unicode. Unfortunately, it doesn't mention the b character anywhere in that document.
Also, just out of curiosity, are there more symbols than the b and u that do other things?
Python 3.x makes a clear distinction between the types:
str = '...' literals = a sequence of Unicode characters (Latin-1, UCS-2 or UCS-4, depending on the widest character in the string)
bytes = b'...' literals = a sequence of octets (integers between 0 and 255)
If you're familiar with:
Java or C#, think of str as String and bytes as byte[];
SQL, think of str as NVARCHAR and bytes as BINARY or BLOB;
Windows registry, think of str as REG_SZ and bytes as REG_BINARY.
If you're familiar with C(++), then forget everything you've learned about char and strings, because a character is not a byte. That idea is long obsolete.
You use str when you want to represent text.
print('שלום עולם')
You use bytes when you want to represent low-level binary data like structs.
NaN = struct.unpack('>d', b'\xff\xf8\x00\x00\x00\x00\x00\x00')[0]
You can encode a str to a bytes object.
>>> '\uFEFF'.encode('UTF-8')
b'\xef\xbb\xbf'
And you can decode a bytes into a str.
>>> b'\xE2\x82\xAC'.decode('UTF-8')
'€'
But you can't freely mix the two types.
>>> b'\xEF\xBB\xBF' + 'Text with a UTF-8 BOM'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: can't concat bytes to str
The b'...' notation is somewhat confusing in that it allows the bytes 0x01-0x7F to be specified with ASCII characters instead of hex numbers.
>>> b'A' == b'\x41'
True
But I must emphasize, a character is not a byte.
>>> 'A' == b'A'
False
In Python 2.x
Pre-3.0 versions of Python lacked this kind of distinction between text and binary data. Instead, there was:
unicode = u'...' literals = sequence of Unicode characters = 3.x str
str = '...' literals = sequences of confounded bytes/characters
Usually text, encoded in some unspecified encoding.
But also used to represent binary data like struct.pack output.
In order to ease the 2.x-to-3.x transition, the b'...' literal syntax was backported to Python 2.6, in order to allow distinguishing binary strings (which should be bytes in 3.x) from text strings (which should be str in 3.x). The b prefix does nothing in 2.x, but tells the 2to3 script not to convert it to a Unicode string in 3.x.
So yes, b'...' literals in Python have the same purpose that they do in PHP.
Also, just out of curiosity, are there
more symbols than the b and u that do
other things?
The r prefix creates a raw string (e.g., r'\t' is a backslash + t instead of a tab), and triple quotes '''...''' or """...""" allow multi-line string literals.
To quote the Python 2.x documentation:
A prefix of 'b' or 'B' is ignored in
Python 2; it indicates that the
literal should become a bytes literal
in Python 3 (e.g. when code is
automatically converted with 2to3). A
'u' or 'b' prefix may be followed by
an 'r' prefix.
The Python 3 documentation states:
Bytes literals are always prefixed with 'b' or 'B'; they produce an instance of the bytes type instead of the str type. They may only contain ASCII characters; bytes with a numeric value of 128 or greater must be expressed with escapes.
The b denotes a byte string.
Bytes are the actual data. Strings are an abstraction.
If you had multi-character string object and you took a single character, it would be a string, and it might be more than 1 byte in size depending on encoding.
If took 1 byte with a byte string, you'd get a single 8-bit value from 0-255 and it might not represent a complete character if those characters due to encoding were > 1 byte.
TBH I'd use strings unless I had some specific low level reason to use bytes.
From server side, if we send any response, it will be sent in the form of byte type, so it will appear in the client as b'Response from server'
In order get rid of b'....' simply use below code:
Server file:
stri="Response from server"
c.send(stri.encode())
Client file:
print(s.recv(1024).decode())
then it will print Response from server
The answer to the question is that, it does:
data.encode()
and in order to decode it(remove the b, because sometimes you don't need it)
use:
data.decode()
Here's an example where the absence of b would throw a TypeError exception in Python 3.x
>>> f=open("new", "wb")
>>> f.write("Hello Python!")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'str' does not support the buffer interface
Adding a b prefix would fix the problem.
It turns it into a bytes literal (or str in 2.x), and is valid for 2.6+.
The r prefix causes backslashes to be "uninterpreted" (not ignored, and the difference does matter).
In addition to what others have said, note that a single character in unicode can consist of multiple bytes.
The way unicode works is that it took the old ASCII format (7-bit code that looks like 0xxx xxxx) and added multi-bytes sequences where all bytes start with 1 (1xxx xxxx) to represent characters beyond ASCII so that Unicode would be backwards-compatible with ASCII.
>>> len('Öl') # German word for 'oil' with 2 characters
2
>>> 'Öl'.encode('UTF-8') # convert str to bytes
b'\xc3\x96l'
>>> len('Öl'.encode('UTF-8')) # 3 bytes encode 2 characters !
3
You can use JSON to convert it to dictionary
import json
data = b'{"key":"value"}'
print(json.loads(data))
{"key":"value"}
FLASK:
This is an example from flask. Run this on terminal line:
import requests
requests.post(url='http://localhost(example)/',json={'key':'value'})
In flask/routes.py
#app.route('/', methods=['POST'])
def api_script_add():
print(request.data) # --> b'{"hi":"Hello"}'
print(json.loads(request.data))
return json.loads(request.data)
{'key':'value'}
b"hello" is not a string (even though it looks like one), but a byte sequence. It is a sequence of 5 numbers, which, if you mapped them to a character table, would look like h e l l o. However the value itself is not a string, Python just has a convenient syntax for defining byte sequences using text characters rather than the numbers itself. This saves you some typing, and also often byte sequences are meant to be interpreted as characters. However, this is not always the case - for example, reading a JPG file will produce a sequence of nonsense letters inside b"..." because JPGs have a non-text structure.
.encode() and .decode() convert between strings and bytes.
bytes(somestring.encode()) is the solution that worked for me in python 3.
def compare_types():
output = b'sometext'
print(output)
print(type(output))
somestring = 'sometext'
encoded_string = somestring.encode()
output = bytes(encoded_string)
print(output)
print(type(output))
compare_types()

Python-3.x - Converting a string representation of a bytearray back to a string

The back-story here is a little verbose, but basically I want to take a string like b'\x04\x0e\x1d' and cast it back into a bytearray.
I am working on a basic implementation of a one time pad, where I take a plaintext A and shared key B to generate a ciphertext C accoring to the equation A⊕B=C. Then I reverse the process with the equation C⊕B=A.
I've already found plenty of python3 functions to encode strings as bytes and then xor the bytes, such as the following:
def xor_strings(xs, ys):
return "".join(chr(ord(x) ^ ord(y)) for x, y in zip(xs, ys)).encode()
A call to xor_strings() then returns a bytearray:
print( xor_strings("foo", "bar"))
But when I print it to the screen, what I'm shown is actually a string. So I'm assuming that python is just calling some str() function on the bytearray, and I get something that looks like the following:
b'\x04\x0e\x1d'
Herein lies the problem. I want to create a new bytearray from that string. Normally I would just call decode() on the bytearray. But if I enter `b'\x04\x0e\x1d' as input, python sees it as a string, not a bytearray!
How can I take a string like b'\x04\x0e\x1d' as user input and cast it back into a bytearray?
As discussed in the comments, use base64 to send binary data in text form.
import base64
def xor_strings(xs, ys):
return "".join(chr(ord(x) ^ ord(y)) for x, y in zip(xs, ys)).encode()
# ciphertext is bytes
ciphertext = xor_strings("foo", "bar")
# >>> b'\x04\x0e\x1d'
# ciphertext_b64 is *still* bytes, but only "safe" ones (in the printable ASCII range)
ciphertext_b64 = base64.encodebytes(ciphertext)
# >>> b'BA4d\n'
Now we can transfer the bytes:
# ...we could interpret them as ASCII and print them somewhere
safe_string = ciphertext_b64.decode('ascii')
# >>> BA4d
# ...or write them to a file (or a network socket)
with open('/tmp/output', 'wb') as f:
f.write(ciphertext_b64)
And the recipient can retrieve the original message by:
# ...reading bytes from a file (or a network socket)
with open('/tmp/output', 'rb') as f:
ciphertext_b64_2 = f.read()
# ...or by reading bytes from a string
ciphertext_b64_2 = safe_string.encode('ascii')
# >>> b'BA4d\n'
# and finally decoding them into the original nessage
ciphertext_2 = base64.decodestring(ciphertext_b64_2)
# >>> b'\x04\x0e\x1d'
Of course when it comes to writing bytes to a file or to the network, encoding them as base64 first is superfluous. You can write/read the ciphertext directly if it's the only file content. Only if the ciphertext it is part of a higher structure (JSON, XML, a config file...) encoding it as base64 becomes necessary again.
A note on the use of the words "decode" and "encode".
To encode a string means to turn it from its abstract meaning ("a list of characters") into a storable representation ("a list of bytes"). The exact result of this operation depends on the byte encoding that is being used. For example:
ASCII encoding maps one character to one byte (as a trade-off it can't map all characters that can exist in a Python string).
UTF-8 encoding maps one character to 1-5 bytes, depending on the character.
To decode a byte array means turning it from "a list of bytes" back into "a list of characters" again. This of course requires prior knowledge of what the byte encoding originally was.
ciphertext_b64 above is a list of bytes and is represented as b'BA4d\n' on the Python console.
Its string equivalent, safe_string, looks very similar 'BA4d\n' when printed to the console due to the fact that base64 is a sub-set of ASCII.
The data types however are still fundamentally different. Don't let the console output deceive you.
Responding to that final question only.
>>> type(b'\x04\x0e\x1d')
<class 'bytes'>
>>> bytearray(b'\x04\x0e\x1d')
bytearray(b'\x04\x0e\x1d')
>>> type(bytearray(b'\x04\x0e\x1d'))
<class 'bytearray'>

What is the difference between a string and a byte string?

I am working with a library which returns a "byte string" (bytes) and I need to convert this to a string.
Is there actually a difference between those two things? How are they related, and how can I do the conversion?
The only thing that a computer can store is bytes.
To store anything in a computer, you must first encode it, i.e. convert it to bytes. For example:
If you want to store music, you must first encode it using MP3, WAV, etc.
If you want to store a picture, you must first encode it using PNG, JPEG, etc.
If you want to store text, you must first encode it using ASCII, UTF-8, etc.
MP3, WAV, PNG, JPEG, ASCII and UTF-8 are examples of encodings. An encoding is a format to represent audio, images, text, etc. in bytes.
In Python, a byte string is just that: a sequence of bytes. It isn't human-readable. Under the hood, everything must be converted to a byte string before it can be stored in a computer.
On the other hand, a character string, often just called a "string", is a sequence of characters. It is human-readable. A character string can't be directly stored in a computer, it has to be encoded first (converted into a byte string). There are multiple encodings through which a character string can be converted into a byte string, such as ASCII and UTF-8.
'I am a string'.encode('ASCII')
The above Python code will encode the string 'I am a string' using the encoding ASCII. The result of the above code will be a byte string. If you print it, Python will represent it as b'I am a string'. Remember, however, that byte strings aren't human-readable, it's just that Python decodes them from ASCII when you print them. In Python, a byte string is represented by a b, followed by the byte string's ASCII representation.
A byte string can be decoded back into a character string, if you know the encoding that was used to encode it.
b'I am a string'.decode('ASCII')
The above code will return the original string 'I am a string'.
Encoding and decoding are inverse operations. Everything must be encoded before it can be written to disk, and it must be decoded before it can be read by a human.
Assuming Python 3 (in Python 2, this difference is a little less well-defined) - a string is a sequence of characters, ie unicode codepoints; these are an abstract concept, and can't be directly stored on disk. A byte string is a sequence of, unsurprisingly, bytes - things that can be stored on disk. The mapping between them is an encoding - there are quite a lot of these (and infinitely many are possible) - and you need to know which applies in the particular case in order to do the conversion, since a different encoding may map the same bytes to a different string:
>>> b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'.decode('utf-16')
'蓏콯캁澽苏'
>>> b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'.decode('utf-8')
'τoρνoς'
Once you know which one to use, you can use the .decode() method of the byte string to get the right character string from it as above. For completeness, the .encode() method of a character string goes the opposite way:
>>> 'τoρνoς'.encode('utf-8')
b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'
Note: I will elaborate more my answer for Python 3 since the end of life of Python 2 is very close.
In Python 3
bytes consists of sequences of 8-bit unsigned values, while str consists of sequences of Unicode code points that represent textual characters from human languages.
>>> # bytes
>>> b = b'h\x65llo'
>>> type(b)
<class 'bytes'>
>>> list(b)
[104, 101, 108, 108, 111]
>>> print(b)
b'hello'
>>>
>>> # str
>>> s = 'nai\u0308ve'
>>> type(s)
<class 'str'>
>>> list(s)
['n', 'a', 'i', '̈', 'v', 'e']
>>> print(s)
naïve
Even though bytes and str seem to work the same way, their instances are not compatible with each other, i.e, bytes and str instances can't be used together with operators like > and +. In addition, keep in mind that comparing bytes and str instances for equality, i.e. using ==, will always evaluate to False even when they contain exactly the same characters.
>>> # concatenation
>>> b'hi' + b'bye' # this is possible
b'hibye'
>>> 'hi' + 'bye' # this is also possible
'hibye'
>>> b'hi' + 'bye' # this will fail
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: can't concat str to bytes
>>> 'hi' + b'bye' # this will also fail
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: can only concatenate str (not "bytes") to str
>>>
>>> # comparison
>>> b'red' > b'blue' # this is possible
True
>>> 'red'> 'blue' # this is also possible
True
>>> b'red' > 'blue' # you can't compare bytes with str
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: '>' not supported between instances of 'bytes' and 'str'
>>> 'red' > b'blue' # you can't compare str with bytes
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: '>' not supported between instances of 'str' and 'bytes'
>>> b'blue' == 'red' # equality between str and bytes always evaluates to False
False
>>> b'blue' == 'blue' # equality between str and bytes always evaluates to False
False
Another issue when dealing with bytes and str is present when working with files that are returned using the open built-in function. On one hand, if you want ot read or write binary data to/from a file, always open the file using a binary mode like 'rb' or 'wb'. On the other hand, if you want to read or write Unicode data to/from a file, be aware of the default encoding of your computer, so if necessary pass the encoding parameter to avoid surprises.
In Python 2
str consists of sequences of 8-bit values, while unicode consists of sequences of Unicode characters. One thing to keep in mind is that str and unicode can be used together with operators if str only consists of 7-bit ASCI characters.
It might be useful to use helper functions to convert between str and unicode in Python 2, and between bytes and str in Python 3.
Let's have a simple one-character string 'š' and encode it into a sequence of bytes:
>>> 'š'.encode('utf-8')
b'\xc5\xa1'
For the purpose of this example, let's display the sequence of bytes in its binary form:
>>> bin(int(b'\xc5\xa1'.hex(), 16))
'0b1100010110100001'
Now it is generally not possible to decode the information back without knowing how it was encoded. Only if you know that the UTF-8 text encoding was used, you can follow the algorithm for decoding UTF-8 and acquire the original string:
11000101 10100001
^^^^^ ^^^^^^
00101 100001
You can display the binary number 101100001 back as a string:
>>> chr(int('101100001', 2))
'š'
From What is Unicode?:
Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one.
......
Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.
So when a computer represents a string, it finds characters stored in the computer of the string through their unique Unicode number and these figures are stored in memory. But you can't directly write the string to disk or transmit the string on network through their unique Unicode number because these figures are just simple decimal number. You should encode the string to byte string, such as UTF-8. UTF-8 is a character encoding capable of encoding all possible characters and it stores characters as bytes (it looks like this). So the encoded string can be used everywhere because UTF-8 is nearly supported everywhere. When you open a text file encoded in UTF-8 from other systems, your computer will decode it and display characters in it through their unique Unicode number.
When a browser receive string data encoded UTF-8 from the network, it will decode the data to string (assume the browser in UTF-8 encoding) and display the string.
In Python 3, you can transform string and byte string to each other:
>>> print('中文'.encode('utf-8'))
b'\xe4\xb8\xad\xe6\x96\x87'
>>> print(b'\xe4\xb8\xad\xe6\x96\x87'.decode('utf-8'))
中文
In a word, string is for displaying to humans to read on a computer and byte string is for storing to disk and data transmission.
Unicode is an agreed-upon format for the binary representation of characters and various kinds of formatting (e.g., lower case/upper case, new line, and carriage return), and other "things" (e.g., emojis). A computer is no less capable of storing a Unicode representation (a series of bits), whether in memory or in a file, than it is of storing an ASCII representation (a different series of bits), or any other representation (series of bits).
For communication to take place, the parties to the communication must agree on what representation will be used.
Because Unicode seeks to represent all the possible characters (and other "things") used in inter-human and inter-computer communication, it requires a greater number of bits for the representation of many characters (or things) than other systems of representation that seek to represent a more limited set of characters/things. To "simplify," and perhaps to accommodate historical usage, Unicode representation is almost exclusively converted to some other system of representation (e.g., ASCII) for the purpose of storing characters in files.
It is not the case that Unicode cannot be used for storing characters in files, or transmitting them through any communications channel. It is simply that it is not.
The term "string," is not precisely defined. "String," in its common usage, refers to a set of characters/things. In a computer, those characters may be stored in any one of many different bit-by-bit representations. A "byte string" is a set of characters stored using a representation that uses eight bits (eight bits being referred to as a byte). Since, these days, computers use the Unicode system (characters represented by a variable number of bytes) to store characters in memory, and byte strings (characters represented by single bytes) to store characters to files, a conversion must be used before characters represented in memory will be moved into storage in files.
Putting it simple, think of our natural languages like - English, Bengali, Chinese, etc. While talking, all of these languages make sound. But do we understand all of them even if we hear them? -
The answer is generally no. So, if I say I understand English, it means that I know how those sounds are encoded to some meaningful English words and I just decode these sounds in the same way to understand them. So, the same goes for any other language. If you know it, you have the encoder-decoder pack for that language in your mind, and again if you don't know it, you just don't have this.
The same goes for digital systems. Just like ourselves, as we can only listen sounds with our ears and make sound with mouth, computers can only store bytes and read bytes. So, the certain application knows how to read bytes and interpret them (like how many bytes to consider to understand any information) and also write in the same way such that its fellow applications also understand it. But without the understanding (encoder-decoder) all data written to a disk are just strings of bytes.
A string is a bunch of items strung together. A byte string is a sequence of bytes, like b'\xce\xb1\xce\xac' which represents "αά". A character string is a bunch of characters, like "αά". Synonymous to a sequence.
A byte string can be directly stored to the disk directly, while a string (character string) cannot be directly stored on the disk. The mapping between them is an encoding.
The Python languages includes str and bytes as standard "built-in types". In other words, they are both classes. I don't think it's worthwhile trying to rationalize why Python has been implemented this way.
Having said that, str and bytes are very similar to one another. Both share most of the same methods. The following methods are unique to the str class:
casefold
encode
format
format_map
isdecimal
isidentifier
isnumeric
isprintable
The following methods are unique to the bytes class:
decode
fromhex
hex

Python base64 data decode and byte order convert

I am now using python base64 module to decode a base64 coded XML file, what I did was to find each of the data (there are thousands of them as for exmaple in "ABC....", the "ABC..." was the base64 encoded data) and add it to a string, lets say s, then I use base64.b64decode(s) to get the result, I am not sure of the result of the decoding, was it a string, or bytes? In addition, how should convert such decoded data from the so-called "network byte order" to a "host byte order"? Thanks!
Each base64 encoded string should be decoded separately - you can't concatenate encoded strings (and get a correct decoding).
The result of the decode is a string, of byte-buffer - in Python, they're equivalent.
Regarding the network/host order - sequences of bytes, have no such 'order' (or endianity) - it only matters when interpreting these bytes as words / ints of larger width (i.e. more than 8 bits).
Base64 stuff, encoded or not, is stored in strings. Byte order is only an issue if you're dealing with non-characters (C's int, short, long, float, etc.), and then I'm not sure how it would relate to this issue. Also, I don't think concatenating base64-encoded strings is valid.
>>> from base64 import *
>>> b64encode( "abcdefg" )
'YWJjZGVmZw=='
>>> b64decode( "YWJjZGVmZw==" )
'abcdefg'
>>> b64encode( "hijklmn" )
'aGlqa2xtbg=='
>>> b64decode( "aGlqa2xtbg==" )
'hijklmn'
>>> b64decode( "YWJjZGVmZw==aGlqa2xtbg==" )
'abcdefg'
>>> b64decode( "YWJjZGVmZwaGlqa2xtbg==" )
'abcdefg\x06\x86\x96\xa6\xb6\xc6\xd6\xe0'
This guy has a good python based
b64decode parser http://groups.google.com/group/spctools-discuss/browse_thread/thread/a8afd04e1a04cde4
Extracting peak-lists from mzXML in "Python"

Categories

Resources