Python convert strings of bytes to byte array

Python convert strings of bytes to byte array - python

For example given an arbitrary string. Could be chars or just random bytes:
string = '\xf0\x9f\xa4\xb1'
I want to output:
b'\xf0\x9f\xa4\xb1'
This seems so simple, but I could not find an answer anywhere. Of course just typing the b followed by the string will do. But I want to do this runtime, or from a variable containing the strings of byte.
if the given string was AAAA or some known characters I can simply do string.encode('utf-8'), but I am expecting the string of bytes to just be random. Doing that to '\xf0\x9f\xa4\xb1' ( random bytes ) produces unexpected result b'\xc3\xb0\xc2\x9f\xc2\xa4\xc2\xb1'.
There must be a simpler way to do this?
Edit:
I want to convert the string to bytes without using an encoding

The Latin-1 character encoding trivially (and unlike every other encoding supported by Python) encodes every code point in the range 0x00-0xff to a byte with the same value.
byteobj = '\xf0\x9f\xa4\xb1'.encode('latin-1')
You say you don't want to use an encoding, but the alternatives which avoid it seem far inferior.
The UTF-8 encoding is unsuitable because, as you already discovered, code points above 0x7f map to a sequence of multiple bytes (up to four bytes) none of which are exactly the input code point as a byte value.
Omitting the argument to .encode() (as in a now-deleted answer) forces Python to guess an encoding, which produces system-dependent behavior (probably picks UTF-8 on most systems except Windows, where it will typically instead choose something much more unpredictable, as well as usually much more sinister and horrible).

I found a working solution
import struct
def convert_string_to_bytes(string):
bytes = b''
for i in string:
bytes += struct.pack("B", ord(i))
return bytes
string = '\xf0\x9f\xa4\xb1'
print (convert_string_to_bytes(string)))
output:
b'\xf0\x9f\xa4\xb1'

Related

Encode a string to fixed-width unicode UCS-2 in Python

I need a fixed-width string encoding. From what I understood, UCS-2 and UCS-4 (also, ASCII) are such fixed-width encodings.
From what I understood, Python only supports a variable-width UTF-16 via s.encode('utf_16_le'). Is it true? Is there an easy way to encode into a unicode fixed-width encoding?
Context: I'm storing a string array in raw bytes and need a way to index into it to recover original strings. Index calculation is easier when all characters are fixed-width.
strings = ['asd', 'def']
# ascii
bytelens = list(map(len, strings))
bytes = ''.join(strings).encode('ascii')
# utf8
bytelens = []
bytes = bytearray()
for s in strings:
e = s.encode('utf-8')
bytelens.append(len(e))
bytes.extend(e)
# i need bytelens to later recover original strings from the array bytes
As you can see, ASCII variant is very simple, and UTF-8 is more convoluted and 20% slower (probably because of many allocations and function calls). A true fixed-width UCS-2 would be a solution!
A follow-up question: how can I know if my string has characters from UCS-1 / UCS-2 / UCS-4? For UCS-1 there is str.isascii. Any ideas about UCS-2?

You are mixing various concepts.
In Python, you can just index a string (or an array). It doesn't matter the length of every character. But also in this case, I should warn you that one character is not a single/simple entity: if you need single entities, you should put together more characters (combining characters, e.g. accents, etc.).
UTF16 is variable width, but it is the same as UCS2, but for characters outside UCS2. So for most things, it doesn't matter, and if you have such characters, you just work with sometime low and high surrogates (like on many other computer languages, which supports only UCS2). But this is often not a problem, because you should not split a string at random places, but always at end of an entity.
UCS4 and UTF-32 are practically the same encoding: Unicode code points into 32-bit numbers. (Differences are just virtual, and on some definition, not for Unicode characters [UCS is based on an ISO which allowed more (higher) code-points, never allocated)

string to wstring in python

I have a udp socket which received datagram of different length.
The first of the datagram specifies what type of data it is going to receive say for example 64-means bool false, 65-means bool true, 66-means sint, 67-means int and so on. As most of datatypes have known length, but when it comes to string and wstring, the first byte says 85-means string, next 2 bytes says string length followed by actual string. For wstring 85, next 2 bytes says wstring length, followed by actual wstring.
To parse the above kind off wstring format b'U\x00\x07\x00C\x00o\x00u\x00p\x00o\x00n\x001' I used the following code
data = str(rawdata[3:]).split("\\x00")
data = "".join(data[1:])
data = "".join(data[:-1])
Is this correct or any other simple way?
As I received the datagram, I need to send the datagram also. But I donot know how to create the datagrams as the socket.sendto requires bytes. If I try to convert string to utf-16 format will it covert to wstring. If so how would I add the rest of the information into bytes
From the above datagram information U-85 which is wstring, \x00\x07 - 7 length of the wstring data, \x00C\x00o\x00u\x00p\x00o\x00n\x001 - is the actual string Coupon1

A complete answer depends on exactly what you intend to do with the resulting data. Splitting the string with '\x00' (assuming that's what you meant to do? not sure I understand why there are two backslashes there) doesn't really make sense. The reason for using a wstring type in the first place is to be able to represent characters that aren't plain old 8-bit (really 7-bit) ascii. If you have any characters that aren't standard Roman characters, they may well have something other than a zero byte separating the characters in which case your split result will make no sense.
Caveat: Since you mentioned sendto requiring bytes, I assume you're using python3. Details will be slightly different under python2.
Anyway if I understand what it is you're meaning to do, the "utf-16-be" codec may be what you're looking for. (The "utf-16" codec puts a "byte order marker" at the beginning of the encoded string which you probably don't want; "utf-16-be" just puts the big-endian 16-bit chars into the byte string.) Decoding could be performed something like this:
rawdata = b'U\x00\x07\x00C\x00o\x00u\x00p\x00o\x00n\x001'
dtype = rawdata[0]
if dtype == 85: # wstring
dlen = ord(rawdata[1:3].decode('utf-16-be'))
data = rawdata[3: (dlen * 2) + 3]
dstring = data.decode('utf-16-be')
This will leave dstring as a python unicode string. In python3, all strings are unicode. So you're done.
Encoding it could be done something like this:
tosend = 'Coupon1'
snd_data = bytearray([85]) # wstring indicator
snd_data += bytearray([(len(tosend) >> 8), (len(tosend) & 0xff)])
snd_data += tosend.encode('utf-16-be')

What is a bytearray? Why was it used?

I'm going over other people's code in CoderByte exercises. I was just reviewing the first exercise to review a string.
Here is the code:
def FirstReverse(s):
ar = bytearray(s)
ar.reverse()
return str(ar)
print FirstReverse("Argument goes here")
I printed ar after the first line and just got the string back so I'm unclear how the bytearray helped. I also still didn't understand it after reading the documentation here: https://docs.python.org/2/library/functions.html#bytearray
So what is a bytearray? Did it make sense to use it in this example?

As the doc says,
Return a new array of bytes. ... is a mutable sequence of integers in the range 0 <= x < 256
For example,
>>> s = 'hello world'
>>> print bytearray(s)
hello world
>>> bytearray(s)[0]
104
and 104 is the ASCII side of h.
Class bytearray has the method reverse, but string doesn't. In order to reverse the string, this code first gets its bytes array, and then reserves, finally gets the reversed string by str.
In addition, you can use [::-1] to reverse a string.
>>> 'Argument goes here'[::-1]
'ereh seog tnemugrA'

The difference between a str and a bytearray is that a str is a sequence of Unicode code points, whereas a bytearray is a sequence of bytes. A single Unicode String may be represented by multiple different bytearrays, depending on the encoding format (e.g. there would be different bytearrays for the UTF-8 representation and the UTF-16 representation of the same str). In addition, str is intended to represent text; by contrast, bytearray may be used to represent arbitrary byte sequences that do not correspond to text at all (e.g. sequences of bytes that are not valid Unicode in any standard encoding format and that will, in fact, be interpreted as something completely different from text altogether such as integer sequences, serialized objects, extended precision integers, or anything else you would want to represent as a sequence of bytes).
In addition to this distinction, str is immutable whereas bytearray is mutable. This means that transformations on str necessarily perform copying operations; by contrast, the contents of a bytearray may be updated / modified in place.
In this particular example, there really is no reason to use a bytearray (and in fact, doing that is more dangerous than using a reversed slice of str, because bytearray.reverse() reverses the underlying bytes... for characters that are encoded by more than one byte, this may result in totally invalid Unicode sequences when interpreting back into Unicode code points). However, if you want to examine or manipulate the encoded form of a string or perform something that is totally unrelated to raw text (like populate the bytes of a datagram packet), that would be a use case for bytearray.

I don't see how it helped personally. You can do this type of reversal natively with a string by just slicing it with a step size of -1:
def FirstReverse(s):
return s[::-1]
print FirstReverse("Argument goes here")
I timed the bytearray version and this version using Python 2.7.10 and didn't see one being faster than the other.
So I guess it is a different approach, but I don't see it as a better approach.
The only advantage I could see is if the string were unicode and you are using Python 2.x instead of 3.x (because Python 2.x strings were not natively unicode). However, to pull a unicode string into a bytearray, you need to specify the encoding, which wasn't done here. So it must not have been for that purpose.

Unicode (Cyrillic) character indexing, re-writing in python

I am working with Russian words written in the Cyrillic orthography. Everything is working fine except for how many (but not all) of the Cyrillic characters are encoded as two characters when in an str. For instance:
>>>print ["ё"]
['\xd1\x91']
This wouldn't be a problem if I didn't want to index string positions or identify where a character is and replace it with another (say "e", without the diaeresis). Obviously, the 2 "characters" are treated as one when prefixed with u, as in u"ё":
>>>print [u"ё"]
[u'\u0451']
But the strs are being passed around as variables, and so can't be prefixed with u, and unicode() gives a UnicodeDecodeError (ascii codec can't decode...).
So... how do I get around this? If it helps, I am using python 2.7

There are two possible situations here.
Either your str represents valid UTF-8 encoded data, or it does not.
If it represents valid UTF-8 data, you can convert it to a Unicode object by using mystring.decode('utf-8'). After it's a unicode instance, it will be indexed by character instead of by byte, as you have already noticed.
If it has invalid byte sequences in it... You're in trouble. This is because the question of "which character does this byte represent?" no longer has a clear answer. You're going to have to decide exactly what you mean when you say "the third character" in the presence of byte sequences that don't actually represent a particular Unicode character in UTF-8 at all...
Perhaps the easiest way to work around the issue would be to use the ignore_errors flag to decode(). This will entirely discard invalid byte sequences and only give you the "correct" portions of the string.

These are actually different encodings:
>>>print ["ё"]
['\xd1\x91']
>>>print [u"ё"]
[u'\u0451']
What you're seeing is the __repr__'s for the elements in the lists. Not the __str__ versions of the unicode objects.
But the strs are being passed around as variables, and so can't be
prefixed with u
You mean the data are strings, and need to be converted into the unicode type:
>>> for c in ["ё"]: print repr(c)
...
'\xd1\x91'
You need to coerce the two-byte strings into double-byte width unicode:
>>> for c in ["ё"]: print repr(unicode(c, 'utf-8'))
...
u'\u0451'
And you'll see with this transform they're perfectly fine.

To convert bytes into Unicode, you need to know the corresponding character encoding and call bytes.decode:
>>> b'\xd1\x91'.decode('utf-8')
u'\u0451'
The encoding depends on the data source. It can be anything e.g., if the data comes from a web page; see A good way to get the charset/encoding of an HTTP response in Python
Don't use non-ascii characters in a bytes literal (it is explicitly forbidden in Python 3). Add from __future__ import unicode_literals to treat all "abc" literals as Unicode literals.
Note: a single user-perceived character may span several Unicode codepoints e.g.:
>>> print(u'\u0435\u0308')
ё

Convert an int value to unicode

I am using pyserial and need to send some values less than 255. If I send the int itself the the ascii value of the int gets sent. So now I am converting the int into a unicode value and sending it through the serial port.
unichr(numlessthan255);
However it throws this error:
'ascii' codec can't encode character u'\x9a' in position 24: ordinal not in range(128)
Whats the best way to convert an int to unicode?

In Python 2 - Turn it into a string first, then into unicode.
str(integer).decode("utf-8")
Best way I think. Works with any integer, plus still works if you put a string in as the input.
Updated edit due to a comment: For Python 2 and 3 - This works on both but a bit messy:
str(integer).encode("utf-8").decode("utf-8")

Just use chr(somenumber) to get a 1 byte value of an int as long as it is less than 256. pySerial will then send it fine.
If you are looking at sending things over pySerial it is a very good idea to look at the struct module in the standard library it handles endian issues an packing issues as well as encoding for just about every data type that you are likely to need that is 1 byte or over.

I think that the best solution is to be explicit and say that you want to represent a number as a byte (and not as a character):
>>> import struct
>>> struct.pack('B', 128)
>>> '\x80'
This makes your code work in both Python 2 and Python 3 (in Python 3, the result is, as it should, a bytes object). An alternative, in Python 3, would be to use the new bytes([128]) to create a single byte of value 128.
I am not a big fan of the chr() solutions: in Python 3, they produce a (character, not byte) string that needs to be encoded before sending it anywhere (file, socket, terminal,…)—chr() in Python 3 is equivalent to the problematic Python 2 unichr() of the question. The struct solution has the advantage of correctly producing a byte whatever the version of Python. If you want to send data over the serial port with chr(), you need to have control over the encoding that must take place subsequently. The code might work when the default encoding used by Python 3 is UTF-8 (which I think is the case), but this is due to the fact that Unicode characters of code point smaller than 256 can be coded as a single byte in UTF-8. This adds an unnecessary layer of subtlety and complexity that I do not recommend (it makes the code harder to understand and, if necessary, debug).
So, I strongly suggest that you use the approach above (which was also hinted at by Steve Barnes and Martijn Pieters): it makes it clear that you want to produce a byte (and not characters). It will not give you any surprise even if you run your code with Python 3, and it makes your intent clearer and more obvious.

Use the chr() function instead; you are sending a value of less than 256 but more than 128, but are creating a Unicode character.
The unicode character has to then be encoded first to get a byte character, and that encoding fails because you are using a value outside the ASCII range (0-127):
>>> str(unichr(169))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 0: ordinal not in range(128)
This is normal Python 2 behaviour; when trying to convert a unicode string to a byte string, an implicit encoding has to take place and the default encoding is ASCII.
If you were to use chr() instead, you create a byte string of one character and that implicit encoding does not have to take place:
>>> str(chr(169))
'\xa9'
Another method you may want to look into is the struct module, especially if you need to send integer values greater than 255:
>>> struct.pack('!H', 1000)
'\x03\xe8'
The above example packs an integer into a unsigned short in network byte order, for example.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.