How convert abitrary string into bytes without UnicodeEncodeError issue? - python

I should not expect any error here. I just want to take the string literaly and translate it into its bytes. I don't want to encode or decode anything.
I am taking here a stupid example:
>>> astring
u'\xb0'
Stupid enough to give me headache...
>>> bytes(astring)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb0' in position...
One horrible trick is to do this:
>>> bytes(repr(astring)[2:-1])
'\xb0'
One other bad solution is:
>>> bytes(astring.encode("utf-8"))
'\xc2\xb0'
It is a bad solution because my string is not composed of two chars. This is wrong.
Another awful solution would be:
>>> bytes(''.join(map(bytes, [chr(ord(c)) for c in astring])))
'\xb0'
I am using Python 2.7
Background
I would like to compare two columns on a database where the encoding is unknown and sometime conflicting. I don't care about wrong chars on my dump. I just want to get it to have a look at it.

If your Unicode strings are guaranteed to only contain codepoints < 256 then you can convert them to bytes using the Latin1 encoding. Here's some Python 2 code that performs this conversion on all codepoints in range(256).
r = range(256)
s = u''.join([unichr(i) for i in r])
print repr(s)
b = s.encode('latin1')
print repr(b)
a = [ord(c) for c in b]
print a == r
output
u'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
True
FWIW, here's the equivalent Python 3 code.
r = range(256)
s = u''.join([chr(i) for i in r])
print(repr(s))
b = s.encode('latin1')
print(repr(b))
print(list(b) == list(r))
output
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0¡¢£¤¥¦§¨©ª«¬\xad®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ'
b'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
True
Note that the Python 3 Unicode repr output is a little more human-friendly.

You cannot just 'take the string literally' because the actual, internal, bytes representation of your string is not fixed and is an implementation detail of the your python interpreter that your should not rely on (see PEP3993, on the same system different string can use different internal encoding).
That also means that to get a byte representation of you string, you really need to encode it, and thus specify the encoding.
By the way, astring.encode("utf-8") is not wrong (and already returns a bytes, you don't need the extra bytes(...) in your code), as in utf-8 a single character can be represented as several bytes.

You should be able to just add b before the quotes of the string.
>>> astring = b'\xb0'
>>> astring
b'\xb0'
>>> bytes(astring)
b'\xb0'
>>>
Putting b before the string makes it a bytes object. No more UnicodeEncodeError.

Related

Convert unicode string into byte string [duplicate]

This question already has answers here:
Process escape sequences in a string in Python
(8 answers)
Closed 7 months ago.
I have a string like:
s_str: str = r"\x00\x01\x00\xc0\x01\x00\x00\x00\x04"
I need to be able to get the corresponding byte literal of that unicode (for pickle.loads):
s_bytes: bytes = b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'
Here the solution of using s_new: bytes = bytes(s_str, encoding="raw_unicode_escape") was posted, but it does not work for me. I got an incorrect result: b'\\x00\\x01\\x00\\xc0\\x01\\x00\\x00\\x00\\x04' that has two backslashes (actually representing only one) for each one that it should have.
Also here and here a similar solution is proposed, but it does not work for me either, I end up getting the double backslashes again. Why does this occur? How do I get the bytes result I want?
You do not have byte escape codes as shown below (length 9) or you wouldn't get the s_not_bytes result:
s_str: str = "\x00\x01\x00\xc0\x01\x00\x00\x00\x04"
You have literal escape codes (length 36), and note the r for raw string that prevents interpreting the escape codes as bytes:
s_str: str = r"\x00\x01\x00\xc0\x01\x00\x00\x00\x04"
Note the difference. \\ is an escape code indicating a literal, single backslash:
>>> '\x00\x01\x00\xc0\x01\x00\x00\x00\x04'
'\x00\x01\x00À\x01\x00\x00\x00\x04'
>>> r'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'
'\\x00\\x01\\x00\\xc0\\x01\\x00\\x00\\x00\\x04'
>>> len('\x00\x01\x00\xc0\x01\x00\x00\x00\x04')
9
>>> len(r'\x00\x01\x00\xc0\x01\x00\x00\x00\x04')
36
The following gets the desired byte string by converting each code point to a byte using the latin1 codec, which maps 1:1 between the first 256 code points (U+0000 to U+00FF) and the byte values 0x00 to 0xFF. Then it decodes the literal escape codes, resulting in a Unicode string again so once more encode using latin1 to convert 1:1 back to bytes:
s_bytes: bytes = s_str.encode('latin1').decode('unicode_escape').encode('latin1')
print(s_bytes)
Output:
b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'
If you did have s_str as posted, a simple .encode('latin1') would convert it:
>>> s_str: str = "\x00\x01\x00\xc0\x01\x00\x00\x00\x04"
>>> s_str.encode('latin1')
b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'
I was about to post the question when I encounter a valid solution almost by chance. The combination that works for me is:
s_new: bytes = bytes(s_str.encode('utf-8').decode('unicode-escape'), encoding="oem")
As I said I have no idea why this works so feel free to explain it if you know why.
You might simply use .encode("utf-8") to get desired result i.e.:
s_1 = "\x00\x01\x00\xc0\x01\x00\x00\x00\x04"
s_2 = s_1.encode("utf-8")
print(s_2)
output
b'\x00\x01\x00\xc3\x80\x01\x00\x00\x00\x04'

Bytes operations in Python

I'm working on a project in which I have to perform some byte operations using python and I'd like to understand some basic principals before I go on with it.
t1 = b"\xAC\x42\x4C\x45\x54\x43\x48\x49\x4E\x47\x4C\x45\x59"
t2 = "\xAC\x42\x4C\x45\x54\x43\x48\x49\x4E\x47\x4C\x45\x59"
print("Adding b character before: ",t1)
print("Using bytes(str): ",bytes(t2,"utf-8"))
print("Using str.encode: ",t2.encode())
In particular, I cannot understand why the console prints this when I run the code above:
C:\Users\Marco\PycharmProjects\codeTest\venv\Scripts\python.exe C:/Users/Marco/PycharmProjects/codeTest/msgPack/temp.py
Adding b character before: b'\xacBLETCHINGLEY'
Using bytes(str): b'\xc2\xacBLETCHINGLEY'
Using str.encode: b'\xc2\xacBLETCHINGLEY'
What I would like to understand is why, if I use bytes() or decode, I get an extra "\xc2" in front of the value. What does it mean? Is this supposed to appear? And if so, how can I get rid of it without using the first method?
Because bytes objects and str objects are two different things. The former represents a sequence of bytes, the latter represents a sequence of unicode code points. There's a huge difference between the byte 172 and the unicode code point 172.
In particular, the byte 172 doesn't encode anything in particular in unicode. On the other hand, unicode code point 172 refers to the following character:
>>> c = chr(172)
>>> print(c)
¬
And of course, they actual raw bytes this would correspond to depend on the encoding. Using utf-8 it is a two-byte encoding:
>>> c.encode()
b'\xc2\xac'
In the latin-1 encoding, it is a 1 byte:
>>> c.encode('latin')
b'\xac'
If you want raw bytes, the most precise/easy way then is to use a bytes-literal.
In a string literal, \xhh (h being a hex digit) selects the corresponding unicode character U+0000 to U+00FF, with U+00AC being the ¬ "not sign". When encoding to utf-8, all code points above 0x7F take two or more bytes. \xc2\xac is the utf-8 encoding of U+00AC.
>>> "\u00AC" == "\xAC"
True
>>> "\u00AC" == "¬"
True
>>> "\xAC" == "¬"
True
>>> "\u00AC".encode('utf-8')
b'\xc2\xac'
>>> "¬".encode("utf-8")
b'\xc2\xac'

Python How to convert 8-bit ASCII string to 16-Bit Unicode

Although Python 3.x solved the problem that uppercase and lowercase for some locales (for example tr_TR.utf8) Python 2.x branch lacks this. Several workaround for this issuse like https://github.com/emre/unicode_tr/ but did not like this kind of a solution.
So I am implementing a new upper/lower/capitalize/title methods for monkey-patching unicode class with
string.maketrans method.
The problem with maketrans is the lenghts of two strings must have same lenght.
The nearest solution came to my mind is "How can I convert 1 Byte char to 2 bytes?"
Note: translate method does work only ascii encoding, when I pass u'İ' (1 byte length \u0130) as arguments to translate gives ascii encoding error.
from string import maketrans
import unicodedata
c1 = unicodedata.normalize('NFKD',u'i').encode('utf-8')
c2 = unicodedata.normalize('NFKD',u'İ').encode('utf-8')
c1,len(c1)
('\xc4\xb1', 2)
# c2,len(c2)
# ('I', 1)
'istanbul'.translate( maketrans(c1,c2))
ValueError: maketrans arguments must have same length
Unicode objects allow multicharacter translation via a dictionary instead of two byte strings mapped through maketrans.
#!python2
#coding:utf8
D = {ord(u'i'):u'İ'}
print u'istanbul'.translate(D)
Output:
İstanbul
If you start with an ASCII byte string and want the result in UTF-8, simply decode/encode around the translation:
#!python2
#coding:utf8
D = {ord(u'i'):u'İ'}
s = 'istanbul'.decode('ascii')
t = s.translate(D)
s = t.encode('utf8')
print repr(s)
Output:
'\xc4\xb0stanbul'
The following technique can do the job of maketrans. Note that the dictionary keys must be Unicode ordinals, but the value can be Unicode ordinals, Unicode strings or None. If None, the character is deleted when translated.
#!python2
#coding:utf8
def maketrans(a,b):
return dict(zip(map(ord,a),b))
D = maketrans(u'àáâãäå',u'ÀÁÂÃÄÅ')
print u'àbácâdãeäfåg'.translate(D)
Output:
ÀbÁcÂdÃeÄfÅg
Reference: str.translate

Easiest way to remove unicode representations from a string in python 3?

I have a string in python 3 that has several unicode representations in it, for example:
t = 'R\\u00f3is\\u00edn'
and I want to convert t so that it has the proper representation when I print it, ie:
>>> print(t)
Róisín
However I just get the original string back. I've tried re.sub and some others, but I can't seem to find a way that will change these characters without having to iterate over each one.
What would be the easiest way to do so?
You want to use the built-in codec unicode_escape.
If t is already a bytes (an 8-bit string), it's as simple as this:
>>> print(t.decode('unicode_escape'))
Róisín
If t has already been decoded to Unicode, you can to encode it back to a bytes and then decode it this way. If you're sure that all of your Unicode characters have been escaped, it actually doesn't matter what codec you use to do the encode. Otherwise, you could try to get your original byte string back, but it's simpler, and probably safer, to just force any non-encoded characters to get encoded, and then they'll get decoded along with the already-encoded ones:
>>> print(t.encode('unicode_escape').decode('unicode_escape')
Róisín
In case you want to know how to do this kind of thing with regular expressions in the future, note that sub lets you pass a function instead of a pattern for the repl. And you can convert any hex string into an integer by calling int(hexstring, 16), and any integer into the corresponding Unicode character with chr (note that this is the one bit that's different in Python 2—you need unichr instead). So:
>>> re.sub(r'(\\u[0-9A-Fa-f]+)', lambda matchobj: chr(int(matchobj.group(0)[2:], 16)), t)
Róisín
Or, making it a bit more clear:
>>> def unescapematch(matchobj):
... escapesequence = matchobj.group(0)
... digits = escapesequence[2:]
... ordinal = int(digits, 16)
... char = chr(ordinal)
... return char
>>> re.sub(r'(\\u[0-9A-Fa-f]+)', unescapematch, t)
Róisín
The unicode_escape codec actually handles \U, \x, \X, octal (\066), and special-character (\n) sequences as well as just \u, and it implements the proper rules for reading only the appropriate max number of digits (4 for \u, 8 for \U, etc., so r'\\u22222' decodes to '∢2' rather than '𢈢'), and probably more things I haven't thought of. But this should give you the idea.
First of all, it is rather confused what you what to convert to.
Just imagine that you may want to convert to 'o' and 'i'. In this case you can just make a map:
mp = {u'\u00f3':'o', u'\u00ed':'i'}
Than you may apply the replacement like:
t = u'R\u00f3is\u00edn'
for i in range(len(t)):
if t[i] in mp: t[i]=mp[t[i]]
print t
I apologize for posting as a second answer, I don't have the reputation to comment on abarnert's solution.
After using his function to process approximately 50K android strings I noticed that there is yet another small improvement possible for certain use-cases.
I changed the + to {1,4} to deal with the case where valid hex characters follow a 4-digit escape.
I also changed int(escapesequence) to read int(digits)
>>> def unescapematch(matchobj):
... escapesequence = matchobj.group(0)
... digits = escapesequence[2:]
... ordinal = int(digits, 16)
... char = unichr(ordinal)
... return char
>>> print re.sub(r'(\\u[0-9A-Fa-f]{1,4})', unescapematch, "Wi\u2011Fi")
Wi‑Fi
>>> print re.sub(r'(\\u[0-9A-Fa-f]+)', unescapematch, "Wi\u2011Fi")
Traceback (most recent call last):
File "<pyshell#102>", line 1, in <module>
print re.sub(r'(\\u[0-9A-Fa-f]+)', unescapematch, "Wi\u2011Fi")
File "C:\Python27\lib\re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
File "<pyshell#99>", line 5, in unescapematch
char = unichr(ordinal)
ValueError: unichr() arg not in range(0x10000) (narrow Python build)

Python get character code in different encoding?

Given a character code as integer number in one encoding, how can you get the character code in, say, utf-8 and again as integer?
UTF-8 is a variable-length encoding, so I'll assume you really meant "Unicode code point". Use chr() to convert the character code to a character, decode it, and use ord() to get the code point.
>>> ord(chr(145).decode('koi8-r'))
9618
You can only map an "integer number" from one encoding to another if they are both single-byte encodings.
Here's an example using "iso-8859-15" and "cp1252" (aka "ANSI"):
>>> s = u'€'
>>> s.encode('iso-8859-15')
'\xa4'
>>> s.encode('cp1252')
'\x80'
>>> ord(s.encode('cp1252'))
128
>>> ord(s.encode('iso-8859-15'))
164
Note that ord is here being used to get the ordinal number of the encoded byte. Using ord on the original unicode string would give its unicode code point:
>>> ord(s)
8364
The reverse operation to ord can be done using either chr (for codes in the range 0 to 127) or unichr (for codes in the range 0 to sys.maxunicode):
>>> print chr(65)
A
>>> print unichr(8364)
€
For multi-byte encodings, a simple "integer number" mapping is usually not possible.
Here's the same example as above, but using "iso-8859-15" and "utf-8":
>>> s = u'€'
>>> s.encode('iso-8859-15')
'\xa4'
>>> s.encode('utf-8')
'\xe2\x82\xac'
>>> [ord(c) for c in s.encode('iso-8859-15')]
[164]
>>> [ord(c) for c in s.encode('utf-8')]
[226, 130, 172]
The "utf-8" encoding uses three bytes to encode the same character, so a one-to-one mapping is not possible. Having said that, many encodings (including "utf-8") are designed to be ASCII-compatible, so a mapping is usually possible for codes in the range 0-127 (but only trivially so, because the code will always be the same).
Here's an example of how the encode/decode dance works:
>>> s = b'd\x06' # perhaps start with bytes encoded in utf-16
>>> map(ord, s) # show those bytes as integers
[100, 6]
>>> u = s.decode('utf-16') # turn the bytes into unicode
>>> print u # show what the character looks like
٤
>>> print ord(u) # show the unicode code point as an integer
1636
>>> t = u.encode('utf-8') # turn the unicode into bytes with a different encoding
>>> map(ord, t) # show that encoding as integers
[217, 164]
Hope this helps :-)
If you need to construct the unicode directly from an integer, use unichr:
>>> u = unichr(1636)
>>> print u
٤

Categories

Resources