Python get character code in different encoding? - python

Given a character code as integer number in one encoding, how can you get the character code in, say, utf-8 and again as integer?

UTF-8 is a variable-length encoding, so I'll assume you really meant "Unicode code point". Use chr() to convert the character code to a character, decode it, and use ord() to get the code point.
>>> ord(chr(145).decode('koi8-r'))
9618

You can only map an "integer number" from one encoding to another if they are both single-byte encodings.
Here's an example using "iso-8859-15" and "cp1252" (aka "ANSI"):
>>> s = u'€'
>>> s.encode('iso-8859-15')
'\xa4'
>>> s.encode('cp1252')
'\x80'
>>> ord(s.encode('cp1252'))
128
>>> ord(s.encode('iso-8859-15'))
164
Note that ord is here being used to get the ordinal number of the encoded byte. Using ord on the original unicode string would give its unicode code point:
>>> ord(s)
8364
The reverse operation to ord can be done using either chr (for codes in the range 0 to 127) or unichr (for codes in the range 0 to sys.maxunicode):
>>> print chr(65)
A
>>> print unichr(8364)
€
For multi-byte encodings, a simple "integer number" mapping is usually not possible.
Here's the same example as above, but using "iso-8859-15" and "utf-8":
>>> s = u'€'
>>> s.encode('iso-8859-15')
'\xa4'
>>> s.encode('utf-8')
'\xe2\x82\xac'
>>> [ord(c) for c in s.encode('iso-8859-15')]
[164]
>>> [ord(c) for c in s.encode('utf-8')]
[226, 130, 172]
The "utf-8" encoding uses three bytes to encode the same character, so a one-to-one mapping is not possible. Having said that, many encodings (including "utf-8") are designed to be ASCII-compatible, so a mapping is usually possible for codes in the range 0-127 (but only trivially so, because the code will always be the same).

Here's an example of how the encode/decode dance works:
>>> s = b'd\x06' # perhaps start with bytes encoded in utf-16
>>> map(ord, s) # show those bytes as integers
[100, 6]
>>> u = s.decode('utf-16') # turn the bytes into unicode
>>> print u # show what the character looks like
٤
>>> print ord(u) # show the unicode code point as an integer
1636
>>> t = u.encode('utf-8') # turn the unicode into bytes with a different encoding
>>> map(ord, t) # show that encoding as integers
[217, 164]
Hope this helps :-)
If you need to construct the unicode directly from an integer, use unichr:
>>> u = unichr(1636)
>>> print u
٤

Related

Python format hex number

I need to send a string via tcp. One of the first sections of the string is the length of the command variable
Example:
command = STATUS?UPDATE
I need to send the following string below
sendCommand = '\x00\x00\x00'+STRINGLENGTH+'\x02'+command+'\x0D\x0A'
My string length is 11 so I need STRINGLENGTH to be the hex equivalent of 11, which is 0xB, except that I need it to output as \x0B
Padding it with the leading 0 is easy, but I cannot get it to output as \x instead of 0x, and if I do a string replace it is treated as text and not as hex, so it doesn't work.
My final hex string should be:
\x00\x00\x00\x0B\x02\x53\x54\x41\x54\x55\x53\x3f\x55\x53\x45\x52\x0D\x0A
I am instead getting:
\x00\x00\x000x0B\x02\x53\x54\x41\x54\x55\x53\x3f\x55\x53\x45\x52\x0D\x0A
Any ideas on how to format it correctly?
So, this is a bit of a round-about fashion, but use a bytes object:
>>> STRINGLENGTH = bytes([11]).decode()
>>> endCommand = '\x00\x00\x00'+STRINGLENGTH+'\x02'
>>> endCommand
'\x00\x00\x00\x0b\x02'
Almost certainly, you are going to want to change your str object back to a bytes object, but the above should get you going.
I suspect what you were doing was using the hex function:
>>> STRINGLENGTH = hex(11)
>>> endCommand = '\x00\x00\x00'+STRINGLENGTH+'\x02'
>>> endCommand
'\x00\x00\x000xb\x02'
The fundamental thing you need to understand is that you aren't working with "hex", you are working with bytes. Hex is just how bytes are traditionally represented. The hex helper function returns a hexadecimal representation, as a string of an integer. But that isn't what you want. You want the byte corresponding to the value 11.
Note, for the ascii-range, chr(i) might works as well, so
>>> STRINGLENGTH = chr(11)
>>> endCommand = '\x00\x00\x00'+STRINGLENGTH+'\x02'
>>> endCommand
'\x00\x00\x00\x0b\x02'
But be careful, say you wanted the number 129, you have to care about the encoding...
>>> chr(129)
'\x81'
But in bytes, in UTF-8, that's actually represented by two different bytes
>>> chr(129).encode()
b'\xc2\x81'
>>> list(chr(129).encode())
[194, 129]
Which of course, depends on the encoding:
>>> chr(129).encode('latin')
b'\x81'
>>> list(chr(129).encode('latin'))
[129]
>>>
For that reason, I think it is safer to stick with the slightly wordier:
>>> bytes([129])
b'\x81'

why the result of '1' == u'1' in python2 is True?

I use python2.7 in Linux. From https://docs.python.org/2/howto/unicode.html. I find that python use one byte for each alphabet in str, while it uses 4 bytes in Unicode string. So why I get True after I input '1' == u'1'.
A similar truth in python2:
In [1]: a = {}
In [2]: a['1'] = 1
In [3]: a[u'1']
Out[3]: 1
UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one-byte (8-bit) code units. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. It was designed for backward compatibility with ASCII: the first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well.
You can see an example of this:
>>> a = u'1'
>>> a.encode('utf-8')
'1'
>>> b = u'ツ'
>>> b.encode('utf-8')
'\xe3\x83\x84'

Bytes operations in Python

I'm working on a project in which I have to perform some byte operations using python and I'd like to understand some basic principals before I go on with it.
t1 = b"\xAC\x42\x4C\x45\x54\x43\x48\x49\x4E\x47\x4C\x45\x59"
t2 = "\xAC\x42\x4C\x45\x54\x43\x48\x49\x4E\x47\x4C\x45\x59"
print("Adding b character before: ",t1)
print("Using bytes(str): ",bytes(t2,"utf-8"))
print("Using str.encode: ",t2.encode())
In particular, I cannot understand why the console prints this when I run the code above:
C:\Users\Marco\PycharmProjects\codeTest\venv\Scripts\python.exe C:/Users/Marco/PycharmProjects/codeTest/msgPack/temp.py
Adding b character before: b'\xacBLETCHINGLEY'
Using bytes(str): b'\xc2\xacBLETCHINGLEY'
Using str.encode: b'\xc2\xacBLETCHINGLEY'
What I would like to understand is why, if I use bytes() or decode, I get an extra "\xc2" in front of the value. What does it mean? Is this supposed to appear? And if so, how can I get rid of it without using the first method?
Because bytes objects and str objects are two different things. The former represents a sequence of bytes, the latter represents a sequence of unicode code points. There's a huge difference between the byte 172 and the unicode code point 172.
In particular, the byte 172 doesn't encode anything in particular in unicode. On the other hand, unicode code point 172 refers to the following character:
>>> c = chr(172)
>>> print(c)
¬
And of course, they actual raw bytes this would correspond to depend on the encoding. Using utf-8 it is a two-byte encoding:
>>> c.encode()
b'\xc2\xac'
In the latin-1 encoding, it is a 1 byte:
>>> c.encode('latin')
b'\xac'
If you want raw bytes, the most precise/easy way then is to use a bytes-literal.
In a string literal, \xhh (h being a hex digit) selects the corresponding unicode character U+0000 to U+00FF, with U+00AC being the ¬ "not sign". When encoding to utf-8, all code points above 0x7F take two or more bytes. \xc2\xac is the utf-8 encoding of U+00AC.
>>> "\u00AC" == "\xAC"
True
>>> "\u00AC" == "¬"
True
>>> "\xAC" == "¬"
True
>>> "\u00AC".encode('utf-8')
b'\xc2\xac'
>>> "¬".encode("utf-8")
b'\xc2\xac'

How convert abitrary string into bytes without UnicodeEncodeError issue?

I should not expect any error here. I just want to take the string literaly and translate it into its bytes. I don't want to encode or decode anything.
I am taking here a stupid example:
>>> astring
u'\xb0'
Stupid enough to give me headache...
>>> bytes(astring)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb0' in position...
One horrible trick is to do this:
>>> bytes(repr(astring)[2:-1])
'\xb0'
One other bad solution is:
>>> bytes(astring.encode("utf-8"))
'\xc2\xb0'
It is a bad solution because my string is not composed of two chars. This is wrong.
Another awful solution would be:
>>> bytes(''.join(map(bytes, [chr(ord(c)) for c in astring])))
'\xb0'
I am using Python 2.7
Background
I would like to compare two columns on a database where the encoding is unknown and sometime conflicting. I don't care about wrong chars on my dump. I just want to get it to have a look at it.
If your Unicode strings are guaranteed to only contain codepoints < 256 then you can convert them to bytes using the Latin1 encoding. Here's some Python 2 code that performs this conversion on all codepoints in range(256).
r = range(256)
s = u''.join([unichr(i) for i in r])
print repr(s)
b = s.encode('latin1')
print repr(b)
a = [ord(c) for c in b]
print a == r
output
u'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
True
FWIW, here's the equivalent Python 3 code.
r = range(256)
s = u''.join([chr(i) for i in r])
print(repr(s))
b = s.encode('latin1')
print(repr(b))
print(list(b) == list(r))
output
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0¡¢£¤¥¦§¨©ª«¬\xad®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ'
b'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
True
Note that the Python 3 Unicode repr output is a little more human-friendly.
You cannot just 'take the string literally' because the actual, internal, bytes representation of your string is not fixed and is an implementation detail of the your python interpreter that your should not rely on (see PEP3993, on the same system different string can use different internal encoding).
That also means that to get a byte representation of you string, you really need to encode it, and thus specify the encoding.
By the way, astring.encode("utf-8") is not wrong (and already returns a bytes, you don't need the extra bytes(...) in your code), as in utf-8 a single character can be represented as several bytes.
You should be able to just add b before the quotes of the string.
>>> astring = b'\xb0'
>>> astring
b'\xb0'
>>> bytes(astring)
b'\xb0'
>>>
Putting b before the string makes it a bytes object. No more UnicodeEncodeError.

How to convert string to bytes in Python 2

I know this may sounds like a duplicate question, but that's because I don't know how to describe this question properly.
For some reason I got a bunch of unicode string like this:
a = u'\xcb\xea'
As you can see, it's actually bytes representation of a Chinese character, encoding in gbk
>>> print(b'\xcb\xea'.decode('gbk'))
岁
u'岁' is what I need, but I don't know how to convert u'\xcb\xea' to b'\xcb\xea'.
Any suggestions?
It's not really a bytes representation, it's still unicode codepoints. They are the wrong codepoints, because it was decoded from bytes as if it was encoded to Latin-1.
Encode to Latin 1 (whose codepoints map one-on-one to bytes), then decode as GBK:
a.encode('latin1').decode('gbk')
Demo:
>>> a = u'\xcb\xea'
>>> a.encode('latin1').decode('gbk')
u'\u5c81'
>>> print a.encode('latin1').decode('gbk')
岁
The simpliest way for python2 is to use the repr():
>>> key_unicode = u'uuuu\xf6\x9f_\xa1\x05\xeb9\xd4\xa3\xd1'
>>> key_ascii = 'uuuu\xf6\x9f_\xa1\x05\xeb9\xd4\xa3\xd1'
>>> print(key_ascii)
uuuu��_��9ԣ�
>>> print(key_unicode)
uuuuö_¡ë9Ô£Ñ
>>>
>>> # here is the save method for both string types:
>>> print(repr(key_ascii).lstrip('u')[1:-1])
uuuu\xf6\x9f_\xa1\x05\xeb9\xd4\xa3\xd1
>>> print(repr(key_unicode).lstrip('u')[1:-1])
uuuu\xf6\x9f_\xa1\x05\xeb9\xd4\xa3\xd1
>>> # ____________WARNING!______________
>>> # if you will use jsut `str.strip('u\'\"')`, you will lose
>>> # the "uuuu" (and quotes, if such are present) on sides of the string:
>>> print(repr(key_unicode).strip('u\'\"'))
\xf6\x9f_\xa1\x05\xeb9\xd4\xa3\xd1
For python3 use str.encode() to get the bytes type.
>>> key = 'l\xf6\x9f_\xa1\x05\xeb9\xd4\xa3\xd1q\xf5L\xa9\xdd0\x90\x8b\xf5ht\x86za\x0e\x1b\xed\xb6(\xaa+'
>>> key
'lö\x9f_¡\x05ë9Ô£ÑqõL©Ý0\x90\x8bõht\x86za\x0e\x1bí¶(ª+'
>>> print(key)
lö_¡ë9Ô£ÑqõL©Ý0õhtzaí¶(ª+
>>> print(repr(key.encode()).lstrip('b')[1:-1])
l\xc3\xb6\xc2\x9f_\xc2\xa1\x05\xc3\xab9\xc3\x94\xc2\xa3\xc3\x91

Categories

Resources