Decode and encode unicode characters as '\u####' - python

I'm attempting to write a python implementation of java.util.Properties which has a requirement that unicode characters are written to the output file in the format of \u####
(Documentation is here if you are curious, though it isn't important to the question: http://docs.oracle.com/javase/6/docs/api/java/util/Properties.html)
I basically need something that passes the following test case
def my_encode(s):
# Magic
def my_decode(s):
# Magic
# Easy ones that are solved by .encode/.decode 'unicode_escape'
assert my_decode('\u2603') == u'☃'
assert my_encode(u'☃') == '\\u2603'
# This one also works with .decode('unicode_escape')
assert my_decode('\\u0081') == u'\x81'
# But this one does not quite produce what I want
assert my_encode(u'\u0081') == '\\u0081' # Instead produces '\\x81'
Note that I've tried unicode_escape and it comes close but doesn't quite satisfy what I want
I've noticed that simplejson does this conversion correctly:
>> simplejson.dumps(u'\u0081')
'"\\u0081"'
But I'd rather avoid:
reinventing the wheel
doing some gross substringing of simplejson's output

According to the documentation you linked to:
Characters less than \u0020 and characters greater than \u007E in property keys or values are written as \uxxxx for the appropriate hexadecimal value xxxx.
So, that converts into Python readily as:
def my_encode(s):
return ''.join(
c if 0x20 <= ord(c) <= 0x7E else r'\u%04x' % ord(c)
for c in s
)
For each character in the string, if the code point is between 0x20 and 0x7E, then that character remains unchanged; otherwise, \u followed by the code point encoded as a 4-digit hex number is used. The expression c for c in s is a generator expression, so we convert that back into a string using str.join on the empty string.
For decoding, you can just use the unicode_escape codec as you mentioned:
def my_decode(s):
return s.decode('unicode_escape')

Related

Bytes operations in Python

I'm working on a project in which I have to perform some byte operations using python and I'd like to understand some basic principals before I go on with it.
t1 = b"\xAC\x42\x4C\x45\x54\x43\x48\x49\x4E\x47\x4C\x45\x59"
t2 = "\xAC\x42\x4C\x45\x54\x43\x48\x49\x4E\x47\x4C\x45\x59"
print("Adding b character before: ",t1)
print("Using bytes(str): ",bytes(t2,"utf-8"))
print("Using str.encode: ",t2.encode())
In particular, I cannot understand why the console prints this when I run the code above:
C:\Users\Marco\PycharmProjects\codeTest\venv\Scripts\python.exe C:/Users/Marco/PycharmProjects/codeTest/msgPack/temp.py
Adding b character before: b'\xacBLETCHINGLEY'
Using bytes(str): b'\xc2\xacBLETCHINGLEY'
Using str.encode: b'\xc2\xacBLETCHINGLEY'
What I would like to understand is why, if I use bytes() or decode, I get an extra "\xc2" in front of the value. What does it mean? Is this supposed to appear? And if so, how can I get rid of it without using the first method?
Because bytes objects and str objects are two different things. The former represents a sequence of bytes, the latter represents a sequence of unicode code points. There's a huge difference between the byte 172 and the unicode code point 172.
In particular, the byte 172 doesn't encode anything in particular in unicode. On the other hand, unicode code point 172 refers to the following character:
>>> c = chr(172)
>>> print(c)
¬
And of course, they actual raw bytes this would correspond to depend on the encoding. Using utf-8 it is a two-byte encoding:
>>> c.encode()
b'\xc2\xac'
In the latin-1 encoding, it is a 1 byte:
>>> c.encode('latin')
b'\xac'
If you want raw bytes, the most precise/easy way then is to use a bytes-literal.
In a string literal, \xhh (h being a hex digit) selects the corresponding unicode character U+0000 to U+00FF, with U+00AC being the ¬ "not sign". When encoding to utf-8, all code points above 0x7F take two or more bytes. \xc2\xac is the utf-8 encoding of U+00AC.
>>> "\u00AC" == "\xAC"
True
>>> "\u00AC" == "¬"
True
>>> "\xAC" == "¬"
True
>>> "\u00AC".encode('utf-8')
b'\xc2\xac'
>>> "¬".encode("utf-8")
b'\xc2\xac'

How to extract numerical values after expression '\xb' in byte string

I am attempting to extract numerical values from a byte string transmitted from an RS-232 port. Here is an example:
b'S\xa0S\xa0\xa0\xa0\xa0\xa0\xa0\xb23.6\xb7\xa0\xe7\x8d\n'
If I attempt to decode the byte string as 'utf-8', I receive the following output:
x = b'S\xa0S\xa0\xa0\xa0\xa0\xa0\xa0\xb23.6\xb7\xa0\xe7\x8d\n'
x.decode('utf-8', errors='ignore')
>>> 'SS3.6\n'
What I ideally want is 23.67, which is observed after every \xb pattern. How could I extract 23.67 from this byte string?
As mentioned in https://stackoverflow.com/a/59416410/3319460, your input actually doesn't really represent the output you seek. But just to fulfil your requirements, of course, we might set semantics onto the input such that
numbers or '.' sign is allowed, others are skipped
if the byte is non-ASCII character such whether the first four bytes are 0xB. If it is the case then we will simply take the ASCII part of the byte (b & 0b01111111)
That is quite easily done in Python.
def _filter(char):
return char & 0xF0 == 0xB0 or chr(char) == "." or 48 <= char <= 58
def filter_xbchars(value: bytes) -> str:
return "".join(chr(ch & 0b01111111) for ch in value if _filter(ch))
import pytest
#pytest.mark.parametrize(
"value, expected",
[(b"S\xa0S\xa0\xa0\xa0\xa0\xa0\xa0\xb23.6\xb7\xa0\xe7\x8d\n", "23.67")],
)
def test_simple(value, expected):
assert filter_xbchars(value) == expected
Please be aware that even though code above satisfies the requirements it is an example of a poorly described task and as a result quite nonsensical solution. The code solves the task as you asked for it but we should firstly reconsider whether it even makes sense. I advise you to check the data you will test against and the meaning of the data (protocol).
Good luck :)
If you just want to get 23.67 from that byte string try this:
a = b'S\xa0S\xa0\xa0\xa0\xa0\xa0\xa0\xb23.6\xb7\xa0\xe7\x8d\n'
b = repr(a)[2:-1]
c = b.split("\\")
d = ''
e = []
for i in c:
if "xb" in i:
e.append(i[2:])
d = "".join(e)
print(d)
Please notice that \xHH is an escape code representing hexadecimal value HH and as such your string '\xb23.6\xb7' does not contain "23.67" but rater "(0xB2)3.6(0xB7)", those value cannot be extracted using a regular expression because it's not present in the string in the first place.
'\xb23.6\xb7' is not a valid UTF-8 sequence, and in Latin-1 extended ASCII it would represent "²3.6·"; the presence of many 0xA0 values would suggest a Latin-1 encoding as it represent a non-breaking space in that encoding (a fairly common character) while in UTF-8 it does not encode a meaningful sequence.

How convert abitrary string into bytes without UnicodeEncodeError issue?

I should not expect any error here. I just want to take the string literaly and translate it into its bytes. I don't want to encode or decode anything.
I am taking here a stupid example:
>>> astring
u'\xb0'
Stupid enough to give me headache...
>>> bytes(astring)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb0' in position...
One horrible trick is to do this:
>>> bytes(repr(astring)[2:-1])
'\xb0'
One other bad solution is:
>>> bytes(astring.encode("utf-8"))
'\xc2\xb0'
It is a bad solution because my string is not composed of two chars. This is wrong.
Another awful solution would be:
>>> bytes(''.join(map(bytes, [chr(ord(c)) for c in astring])))
'\xb0'
I am using Python 2.7
Background
I would like to compare two columns on a database where the encoding is unknown and sometime conflicting. I don't care about wrong chars on my dump. I just want to get it to have a look at it.
If your Unicode strings are guaranteed to only contain codepoints < 256 then you can convert them to bytes using the Latin1 encoding. Here's some Python 2 code that performs this conversion on all codepoints in range(256).
r = range(256)
s = u''.join([unichr(i) for i in r])
print repr(s)
b = s.encode('latin1')
print repr(b)
a = [ord(c) for c in b]
print a == r
output
u'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
True
FWIW, here's the equivalent Python 3 code.
r = range(256)
s = u''.join([chr(i) for i in r])
print(repr(s))
b = s.encode('latin1')
print(repr(b))
print(list(b) == list(r))
output
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0¡¢£¤¥¦§¨©ª«¬\xad®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ'
b'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
True
Note that the Python 3 Unicode repr output is a little more human-friendly.
You cannot just 'take the string literally' because the actual, internal, bytes representation of your string is not fixed and is an implementation detail of the your python interpreter that your should not rely on (see PEP3993, on the same system different string can use different internal encoding).
That also means that to get a byte representation of you string, you really need to encode it, and thus specify the encoding.
By the way, astring.encode("utf-8") is not wrong (and already returns a bytes, you don't need the extra bytes(...) in your code), as in utf-8 a single character can be represented as several bytes.
You should be able to just add b before the quotes of the string.
>>> astring = b'\xb0'
>>> astring
b'\xb0'
>>> bytes(astring)
b'\xb0'
>>>
Putting b before the string makes it a bytes object. No more UnicodeEncodeError.

Unicode representation to formatted Unicode?

I am having some difficulty understanding the translation of unicode expressions into their respective characters. I have been looking at the unicode specification, and I have come across various strings that are formatted as follows U+1F600. As far as I have seen, there doesn't appear to be a built in function that knows how to translate these strings into the correct formatting for Python, such as \U0001F600.
In my program I have made a small regular expression that will find these U\+.{5} patterns and substitute the U+ with \U000. However, what I have found is that this syntax isn't the same for all unicode characters, such as the zero width join that actually is supposed to be translated from U+200D to \u200D.
Because I don't know every variation of the correct unicode escape sequence, what is the best method to handle this case? Is it that there are only a finite amount of these special characters that I can just check for or am I going about this the wrong way entirely?
Python version is 2.7.
I think your most reliable method will be to parse the number to an integer and then use unichr to lookup that codepoint:
unichr(0x1f600) # or: unichr(int('1f600', 16))
Note: on Python 3, it's just chr.
U+NNNN is just common notation used to talk about Unicode. Python's syntax for a single Unicode character is one of:
u'\xNN' for Unicode characters through U+00FF
u'\uNNNN' for Unicode characters through U+FFFF
u'\U00NNNNNN' for Unicode characters through U+10FFFF (max)
Note: N is a hexadecimal digit.
Use the correct notation when entering a character. You can use the longer notations even for low characters:
u'A' == u'\x41' == u'\u0041' == u'\U00000041'
Programmatically, you can also generate the correct character with unichr(n) (Python 2) or chr(n) (Python 3).
Note that before Python 3.3, there were narrow and wide Unicode builds of Python. unichr/chr can only support sys.maxunicode, which is 65535 (0xFFFF) in narrow builds and 1114111 (0x10FFFF) on wide builds. Python 3.3 unified the builds and solved many issues with Unicode.
If you are dealing with text string in the format U+NNNN, here's a regular expression (Python 3). It looks for U+ and 4-6 hexadecimal digits, and replaces them with the chr() version. Note that ASCII characters (Python 2) or printable characters (Python 3) will display the actual character and not the escaped version.
>>> re.sub(r'U\+([0-9A-Fa-f]{4,6})',lambda m: chr(int(m.group(1),16)),'testing U+1F600')
'testing \U0001f600'
>>> re.sub(r'U\+([0-9A-Fa-f]{4,6})',lambda m: chr(int(m.group(1),16)),'testing U+5000')
'testing \u5000'
>>> re.sub(r'U\+([0-9A-Fa-f]{4,6})',lambda m: chr(int(m.group(1),16)),'testing U+0041')
'testing A'
>>> re.sub(r'U\+([0-9A-Fa-f]{4,6})',lambda m: chr(int(m.group(1),16)),'testing U+0081')
'testing \x81'
You can look into json module implementation. It seem that it is not that simple:
# Unicode escape sequence
uni = _decode_uXXXX(s, end)
end += 5
# Check for surrogate pair on UCS-4 systems
if sys.maxunicode > 65535 and \
0xd800 <= uni <= 0xdbff and s[end:end + 2] == '\\u':
uni2 = _decode_uXXXX(s, end + 1)
if 0xdc00 <= uni2 <= 0xdfff:
uni = 0x10000 + (((uni - 0xd800) << 10) | (uni2 - 0xdc00))
end += 6
char = unichr(uni)
(from cpython-2.7.9/Lib/json/decoder.py lines 129-138)
I think that it would be easier to use json.loads directly:
>>> print json.loads('"\\u0123"')
ģ

Python How to convert 8-bit ASCII string to 16-Bit Unicode

Although Python 3.x solved the problem that uppercase and lowercase for some locales (for example tr_TR.utf8) Python 2.x branch lacks this. Several workaround for this issuse like https://github.com/emre/unicode_tr/ but did not like this kind of a solution.
So I am implementing a new upper/lower/capitalize/title methods for monkey-patching unicode class with
string.maketrans method.
The problem with maketrans is the lenghts of two strings must have same lenght.
The nearest solution came to my mind is "How can I convert 1 Byte char to 2 bytes?"
Note: translate method does work only ascii encoding, when I pass u'İ' (1 byte length \u0130) as arguments to translate gives ascii encoding error.
from string import maketrans
import unicodedata
c1 = unicodedata.normalize('NFKD',u'i').encode('utf-8')
c2 = unicodedata.normalize('NFKD',u'İ').encode('utf-8')
c1,len(c1)
('\xc4\xb1', 2)
# c2,len(c2)
# ('I', 1)
'istanbul'.translate( maketrans(c1,c2))
ValueError: maketrans arguments must have same length
Unicode objects allow multicharacter translation via a dictionary instead of two byte strings mapped through maketrans.
#!python2
#coding:utf8
D = {ord(u'i'):u'İ'}
print u'istanbul'.translate(D)
Output:
İstanbul
If you start with an ASCII byte string and want the result in UTF-8, simply decode/encode around the translation:
#!python2
#coding:utf8
D = {ord(u'i'):u'İ'}
s = 'istanbul'.decode('ascii')
t = s.translate(D)
s = t.encode('utf8')
print repr(s)
Output:
'\xc4\xb0stanbul'
The following technique can do the job of maketrans. Note that the dictionary keys must be Unicode ordinals, but the value can be Unicode ordinals, Unicode strings or None. If None, the character is deleted when translated.
#!python2
#coding:utf8
def maketrans(a,b):
return dict(zip(map(ord,a),b))
D = maketrans(u'àáâãäå',u'ÀÁÂÃÄÅ')
print u'àbácâdãeäfåg'.translate(D)
Output:
ÀbÁcÂdÃeÄfÅg
Reference: str.translate

Categories

Resources