Some things that were trivial in Python 2 get a bit more tedious in Python 3. I am sending a string followed by some hex value:
buffer = "ABCD"
buffer += "\xCA\xFE\xCA\xFE"
This gives an error when sending, and I have read in other post that the solution is to use sendall and encode:
s.sendall(buffer.encode("UTF-8"))
However, what is send in the network for the hex value is the UTF-8 encoded:
c3 8a c3 be c3 8a c3 be
instead of the exact bytes I defined. How should I do this without using external libraries and possibly without having to "convert" the data into another structure?
I know this question has been widely asked, but I can't find a satisfying solution
You may think Python 3 is making thing more difficult, but it is the converse which is intended. You are experiencing a charset enforcement issue. In python 2 there were multiple reasons to be confused with UTF-8 and Unicode charsets. It is now fixed.
First of all, if you need to send binary data, you better choose the ad-hoc type, which is bytes. Using Python 3, it is sufficient to prefix your string with a b. This should fix you problem:
buffer = b"ABCD"
buffer += b"\xCA\xFE\xCA\xFE"
s.sendall(buffer)
Of course, bytes object has no encode method as it is already encoded to binary. But it has the converse method decode.
When you create a str object using quotes with no prefix, by default Python 3 will use Unicode encoding (which was enforced by unicode type or u prefix in Python 2). It means you will require to use encode method to get binary data.
Instead, directly use bytes to store binary data as no encoding operation will occur and it will stay as you typed it.
The error can only concatenate str (not "bytes") to str speaks for itself. Python is complaining it cannot concatenate str with bytes as the former data requires a further step, namely encoding, to make the + operation meaningful.
Based on the information in your question, you might be able to get away with encoding your data as latin-1, because this will not change any byte values
buffer = "ABCD"
buffer += "\xCA\xFE\xCA\xFE"
payload = buffer.encode("latin-1")
print(payload)
b'ABCD\xca\xfe\xca\xfe'
On the other side, you could just decode from latin-1:
buffer = payload.decode('latin-1')
buffer
'ABCDÊþÊþ'
But you might prefer to keep the text and binary parts of your message as their respective types:
encoded_text = payload[:4]
encoded_text
b'ABCD'
text = encoded_text.decode('latin-1')
print(text)
ABCD
binary_data = payload[4:]
binary_data
b'\xca\xfe\xca\xfe'
If your text contains codepoints which cannot be encoded as latin-1 - '你好,世界' for example - you could follow the same approach, but you would need to encode the text as UTF-8 while encoding the binary data as 'latin-1'; the resulting bytes will need to be split into their text and binary sections and decoded separately.
Finally: encoding string literals like '\xca\xfe\xca\xfe' is a poor style in Python3 - better to declare them as bytes literals like b'\xca\xfe\xca\xfe'.
Related
I converted some Chinese characters into UTF-8 in a string object for some operations. I now have a problem when I try to convert the string object back into a bytes object.
I tried using bytes():
a = '一'
bytes_value = a.encode('utf-8')
string_value = str(b)
bytes_value_again = bytes(string_value)
I want to convert it back to a bytes object, so I can use decode('utf-8') to convert it back to the original Chinese characters.
You should not convert bytes objects to strings with str(bytes_value). You created a printable representation of the object.
The proper way to convert from bytes to str is to decode the bytes to Unicode. If you have UTF-8 bytes, decode with that codec with the bytes.decode() method:
string_value = bytes_value.decode('utf8')
You can also specify the encoding if you want to use the str() function, see the str(bytes_value, encoding) form in the documentation:
string_value = str(bytes_value, 'utf8')
If you accidentally used str(bytes_value) and can't now get the original value by fixing that error and re-running your code, you can recover the original value by using ast.literal_eval():
bytes_representation = str(bytes_value) # "b'....'"
recovered_bytes_value = ast.literal_eval(bytes_representation)
This should only be used to recover data, not as a production-level serialisation mechanism. ast.literal_eval() is quite slow, and not safe from denial-of-service attacks when used on user-supplied input (it is possible to crash Python or significantly slow it down with bad input).
If you are using an API that should really work on bytes but for some reason is only accepting strings (usually a warning sign of a incorrectly designed and implemented API), then either use a binary-to-ASCII encoding (including base64 / base16 / base32 / base85) or decode the binary data as Latin-1.
This is even more important if you are doing this to encrypt data. The printable representation of a bytes() object only ever uses ASCII characters, always starts with b' or b", and always ends in ' or ". Any non-printable bytes (more than half of all 256 possible byte values) are represented with a limited range of \x{hh} and \{l} escape sequences. All this makes it much easier for a determined attacker to break your encryption. A best-practices encryption library will let you encrypt bytes directly. In fact, it is usually preferred to encrypt bytes.
For example given an arbitrary string. Could be chars or just random bytes:
string = '\xf0\x9f\xa4\xb1'
I want to output:
b'\xf0\x9f\xa4\xb1'
This seems so simple, but I could not find an answer anywhere. Of course just typing the b followed by the string will do. But I want to do this runtime, or from a variable containing the strings of byte.
if the given string was AAAA or some known characters I can simply do string.encode('utf-8'), but I am expecting the string of bytes to just be random. Doing that to '\xf0\x9f\xa4\xb1' ( random bytes ) produces unexpected result b'\xc3\xb0\xc2\x9f\xc2\xa4\xc2\xb1'.
There must be a simpler way to do this?
Edit:
I want to convert the string to bytes without using an encoding
The Latin-1 character encoding trivially (and unlike every other encoding supported by Python) encodes every code point in the range 0x00-0xff to a byte with the same value.
byteobj = '\xf0\x9f\xa4\xb1'.encode('latin-1')
You say you don't want to use an encoding, but the alternatives which avoid it seem far inferior.
The UTF-8 encoding is unsuitable because, as you already discovered, code points above 0x7f map to a sequence of multiple bytes (up to four bytes) none of which are exactly the input code point as a byte value.
Omitting the argument to .encode() (as in a now-deleted answer) forces Python to guess an encoding, which produces system-dependent behavior (probably picks UTF-8 on most systems except Windows, where it will typically instead choose something much more unpredictable, as well as usually much more sinister and horrible).
I found a working solution
import struct
def convert_string_to_bytes(string):
bytes = b''
for i in string:
bytes += struct.pack("B", ord(i))
return bytes
string = '\xf0\x9f\xa4\xb1'
print (convert_string_to_bytes(string)))
output:
b'\xf0\x9f\xa4\xb1'
The purpose of base64.b64encode() is to convert binary data into ASCII-safe "text". However, the method returns an object of type bytes:
>>> import base64
>>> base64.b64encode(b'abc')
b'YWJj'
It's easy to simply take that output and decode() it, but my question is: what is a significance of base64.b64encode() returning bytes rather than a str?
The purpose of the base64.b64encode() function is to convert binary data into ASCII-safe "text"
Python disagrees with that - base64 has been intentionally classified as a binary transform.
It was a design decision in Python 3 to force the separation of bytes and text and prohibit implicit transformations. Python is now so strict about this that bytes.encode doesn't even exist, and so b'abc'.encode('base64') would raise an AttributeError.
The opinion the language takes is that a bytestring object is already encoded. A codec which encodes bytes into text does not fit into this paradigm, because when you want to go from the bytes domain to the text domain it's a decode. Note that rot13 encoding was also banished from the list of standard encodings for the same reason - it didn't fit properly into the Python 3 paradigm.
There also can be a performance argument to make: suppose Python automatically handled decoding of the base64 output, which is an ASCII-encoded binary representation produced by C code from the binascii module, into a Python object in the text domain. If you actually wanted the bytes, you would just have to undo the decoding by encoding into ASCII again. It would be a wasteful round-trip, an unnecessary double-negation. Better to 'opt-in' for the decode-to-text step.
It's impossible for b64encode() to know what you want to do with its output.
While in many cases you may want to treat the encoded value as text, in many others – for example, sending it over a network – you may instead want to treat it as bytes.
Since b64encode() can't know, it refuses to guess. And since the input is bytes, the output remains the same type, rather than being implicitly coerced to str.
As you point out, decoding the output to str is straightforward:
base64.b64encode(b'abc').decode('ascii')
... as well as being explicit about the result.
As an aside, it's worth noting that although base64.b64decode() (note: decode, not encode) has accepted str since version 3.3, the change was somewhat controversial.
I have a udp socket which received datagram of different length.
The first of the datagram specifies what type of data it is going to receive say for example 64-means bool false, 65-means bool true, 66-means sint, 67-means int and so on. As most of datatypes have known length, but when it comes to string and wstring, the first byte says 85-means string, next 2 bytes says string length followed by actual string. For wstring 85, next 2 bytes says wstring length, followed by actual wstring.
To parse the above kind off wstring format b'U\x00\x07\x00C\x00o\x00u\x00p\x00o\x00n\x001' I used the following code
data = str(rawdata[3:]).split("\\x00")
data = "".join(data[1:])
data = "".join(data[:-1])
Is this correct or any other simple way?
As I received the datagram, I need to send the datagram also. But I donot know how to create the datagrams as the socket.sendto requires bytes. If I try to convert string to utf-16 format will it covert to wstring. If so how would I add the rest of the information into bytes
From the above datagram information U-85 which is wstring, \x00\x07 - 7 length of the wstring data, \x00C\x00o\x00u\x00p\x00o\x00n\x001 - is the actual string Coupon1
A complete answer depends on exactly what you intend to do with the resulting data. Splitting the string with '\x00' (assuming that's what you meant to do? not sure I understand why there are two backslashes there) doesn't really make sense. The reason for using a wstring type in the first place is to be able to represent characters that aren't plain old 8-bit (really 7-bit) ascii. If you have any characters that aren't standard Roman characters, they may well have something other than a zero byte separating the characters in which case your split result will make no sense.
Caveat: Since you mentioned sendto requiring bytes, I assume you're using python3. Details will be slightly different under python2.
Anyway if I understand what it is you're meaning to do, the "utf-16-be" codec may be what you're looking for. (The "utf-16" codec puts a "byte order marker" at the beginning of the encoded string which you probably don't want; "utf-16-be" just puts the big-endian 16-bit chars into the byte string.) Decoding could be performed something like this:
rawdata = b'U\x00\x07\x00C\x00o\x00u\x00p\x00o\x00n\x001'
dtype = rawdata[0]
if dtype == 85: # wstring
dlen = ord(rawdata[1:3].decode('utf-16-be'))
data = rawdata[3: (dlen * 2) + 3]
dstring = data.decode('utf-16-be')
This will leave dstring as a python unicode string. In python3, all strings are unicode. So you're done.
Encoding it could be done something like this:
tosend = 'Coupon1'
snd_data = bytearray([85]) # wstring indicator
snd_data += bytearray([(len(tosend) >> 8), (len(tosend) & 0xff)])
snd_data += tosend.encode('utf-16-be')
Ok I've found a lot of threads about how to convert a string from something like "/xe3" to "ã" but how the hell am I supposed to do it the other way around?
My concrete problem: I am using an API and everything works great except I provide some strings which then result in a json object. The result is sorted after the names (strings) I provided however they are returned as their unicode representation and as json APIs always work in pure strings. So all I need is a way to get from "ã" to "/xe3" but it can't for the love of god get it to work.
Every type of encoding or decoding I try either defaults back to a normal string, a string without that character, a string with a plain A or an unicode error that ascii can't decode it. (<- this was due to a horrible shell setup. Yay for old me.)
All I want is the plain encoded string!
(yea no not at all past me. All you want is the unicode representation of a character as string)
PS: All in python if that wasn't obvious from the title already.
Edit: Even though this is quite old I wanted to update this to not completely embarrass myself in the future.
The issue was an API which provided unicode representations of characters as string as a response. All I wanted to do was checking if they are the same however I had major issues getting python to interpret the string as unicode especially since those characters were just some inside of a longer text partially with backslashes.
This did help but I just stumbled across this horribly written question and just couldn't leave it like that.
"\xe3" in python is a string literal that represents a single byte with value 227:
>>> print len("\xe3")
1
>>> print ord("\xe3")
227
This single byte represents the 'ã' character in the latin-1 encoding (http://en.wikipedia.org/wiki/ISO/IEC_8859-1).
"ã" in python is a string literal consisting of two bytes: 0xC3, 0xA3 (195, 163):
>>> print len("ã")
2
>>> print ord("ã"[0])
195
>>> print ord("ã"[1])
163
This byte sequence is the UTF-8 encoding of the character "ã".
So, to go from "ã" in python to "\xe3", you first need to decode the utf-8 byte sequence into a python unicode string:
>>> "ã".decode("utf-8")
u'\xe3'
Now, you can take that unicode string and encode it however you like (e.g. into latin-1):
>>> "ã".decode("utf-8").encode("latin-1")
'\xe3'
Please read http://www.joelonsoftware.com/articles/Unicode.html . You should realize tehre is no such a thing as "a plain encoded string". There is "an encoded string in a given text encoding". So you are really in need to understand the better the concepts of Unicode.
Among other things, this is plain wrong: "The result is sorted after the names (strings) I provided however they are returned in encoded form." JSON uses Unicode, so you get the string in a decoded form.
Since I assume you are, perhaps unknowingly, working with UTF-8, you should be aware that \xe3 is the Unicode code point for the character ã. Not to be mistaken for the actual bytes that UTF-8 uses to reference that code point:
http://hexutf8.com/?q=U+e3
I.e. UTF-8 maps the byte sequence c3 a3 to the code point U+e3 which represents the character ã.
UTF-16 maps a different byte sequence, 00 e3 to that exact same code point. (Note how much simpler, but less space efficient the UTF-16 encoding is...)