Correctly extract Emojis from a Unicode string - python

I am working in Python 2 and I have a string containing emojis as well as other unicode characters. I need to convert it to a list where each entry in the list is a single character/emoji.
x = u'๐Ÿ˜˜๐Ÿ˜˜xyz๐Ÿ˜Š๐Ÿ˜Š'
char_list = [c for c in x]
The desired output is:
['๐Ÿ˜˜', '๐Ÿ˜˜', 'x', 'y', 'z', '๐Ÿ˜Š', '๐Ÿ˜Š']
The actual output is:
[u'\ud83d', u'\ude18', u'\ud83d', u'\ude18', u'x', u'y', u'z', u'\ud83d', u'\ude0a', u'\ud83d', u'\ude0a']
How can I achieve the desired output?

First of all, in Python2, you need to use Unicode strings (u'<...>') for Unicode characters to be seen as Unicode characters. And correct source encoding if you want to use the chars themselves rather than the \UXXXXXXXX representation in source code.
Now, as per Python: getting correct string length when it contains surrogate pairs and Python returns length of 2 for single Unicode character string, in Python2 "narrow" builds (with sys.maxunicode==65535), 32-bit Unicode characters are represented as surrogate pairs, and this is not transparent to string functions. This has only been fixed in 3.3 (PEP0393).
The simplest resolution (save for migrating to 3.3+) is to compile a Python "wide" build from source as outlined on the 3rd link. In it, Unicode characters are all 4-byte (thus are a potential memory hog) but if you need to routinely handle wide Unicode chars, this is probably an acceptable price.
The solution for a "narrow" build is to make a custom set of string functions (len, slice; maybe as a subclass of unicode) that would detect surrogate pairs and handle them as a single character. I couldn't readily find an existing one (which is strange), but it's not too hard to write:
as per UTF-16#U+10000 to U+10FFFF - Wikipedia,
the 1st character (high surrogate) is in range 0xD800..0xDBFF
the 2nd character (low surrogate) - in range 0xDC00..0xDFFF
these ranges are reserved and thus cannot occur as regular characters
So here's the code to detect a surrogate pair:
def is_surrogate(s,i):
if 0xD800 <= ord(s[i]) <= 0xDBFF:
try:
l = s[i+1]
except IndexError:
return False
if 0xDC00 <= ord(l) <= 0xDFFF:
return True
else:
raise ValueError("Illegal UTF-16 sequence: %r" % s[i:i+2])
else:
return False
And a function that returns a simple slice:
def slice(s,start,end):
l=len(s)
i=0
while i<start and i<l:
if is_surrogate(s,i):
start+=1
end+=1
i+=1
i+=1
while i<end and i<l:
if is_surrogate(s,i):
end+=1
i+=1
i+=1
return s[start:end]
Here, the price you pay is performance, as these functions are much slower than built-ins:
>>> ux=u"a"*5000+u"\U00100000"*30000+u"b"*50000
>>> timeit.timeit('slice(ux,10000,100000)','from __main__ import slice,ux',number=1000)
46.44128203392029 #msec
>>> timeit.timeit('ux[10000:100000]','from __main__ import slice,ux',number=1000000)
8.814016103744507 #usec

I would use the uniseg library (pip install uniseg):
# -*- coding: utf-8 -*-
from uniseg import graphemecluster as gc
print list(gc.grapheme_clusters(u'๐Ÿ˜˜๐Ÿ˜˜xyz๐Ÿ˜Š๐Ÿ˜Š'))
outputs [u'\U0001f618', u'\U0001f618', u'x', u'y', u'z', u'\U0001f60a', u'\U0001f60a'], and
[x.encode('utf-8') for x in gc.grapheme_clusters(u'๐Ÿ˜˜๐Ÿ˜˜xyz๐Ÿ˜Š๐Ÿ˜Š'))]
will provide the list of characters as UTF-8 encoded strings.

Related

How to recover the text in url and convert text back? [duplicate]

I have a unicode string like "Tanฤฑm" which is encoded as "Tan%u0131m" somehow. How can i convert this encoded string back to original unicode.
Apparently urllib.unquote does not support unicode.
%uXXXX is a non-standard encoding scheme that has been rejected by the w3c, despite the fact that an implementation continues to live on in JavaScript land.
The more common technique seems to be to UTF-8 encode the string and then % escape the resulting bytes using %XX. This scheme is supported by urllib.unquote:
>>> urllib2.unquote("%0a")
'\n'
Unfortunately, if you really need to support %uXXXX, you will probably have to roll your own decoder. Otherwise, it is likely to be far more preferable to simply UTF-8 encode your unicode and then % escape the resulting bytes.
A more complete example:
>>> u"Tanฤฑm"
u'Tan\u0131m'
>>> url = urllib.quote(u"Tanฤฑm".encode('utf8'))
>>> urllib.unquote(url).decode('utf8')
u'Tan\u0131m'
def unquote(text):
def unicode_unquoter(match):
return unichr(int(match.group(1),16))
return re.sub(r'%u([0-9a-fA-F]{4})',unicode_unquoter,text)
This will do it if you absolutely have to have this (I really do agree with the cries of "non-standard"):
from urllib import unquote
def unquote_u(source):
result = unquote(source)
if '%u' in result:
result = result.replace('%u','\\u').decode('unicode_escape')
return result
print unquote_u('Tan%u0131m')
> Tanฤฑm
there is a bug in the above version where it freaks out sometimes when there are both ascii encoded and unicode encoded characters in the string. I think its specifically when there are characters from the upper 128 range like '\xab' in addition to unicode.
eg. "%5B%AB%u03E1%BB%5D" causes this error.
I found if you just did the unicode ones first, the problem went away:
def unquote_u(source):
result = source
if '%u' in result:
result = result.replace('%u','\\u').decode('unicode_escape')
result = unquote(result)
return result
You have a URL using a non-standard encoding scheme, rejected by standards bodies but still being produced by some encoders. The Python urllib.parse.unquote() function can't handle these.
Creating your own decoder is not that hard, luckily. %uhhhh entries are meant to be UTF-16 codepoints here, so we need to take surrogate pairs into account. I've also seen %hh codepoints mixed in, for added confusion.
With that in mind, here is a decoder which works in both Python 2 and Python 3, provided you pass in a str object in Python 3 (Python 2 cares less):
try:
# Python 3
from urllib.parse import unquote
unichr = chr
except ImportError:
# Python 2
from urllib import unquote
def unquote_unicode(string, _cache={}):
string = unquote(string) # handle two-digit %hh components first
parts = string.split(u'%u')
if len(parts) == 1:
return parts
r = [parts[0]]
append = r.append
for part in parts[1:]:
try:
digits = part[:4].lower()
if len(digits) < 4:
raise ValueError
ch = _cache.get(digits)
if ch is None:
ch = _cache[digits] = unichr(int(digits, 16))
if (
not r[-1] and
u'\uDC00' <= ch <= u'\uDFFF' and
u'\uD800' <= r[-2] <= u'\uDBFF'
):
# UTF-16 surrogate pair, replace with single non-BMP codepoint
r[-2] = (r[-2] + ch).encode(
'utf-16', 'surrogatepass').decode('utf-16')
else:
append(ch)
append(part[4:])
except ValueError:
append(u'%u')
append(part)
return u''.join(r)
The function is heavily inspired by the current standard-library implementation.
Demo:
>>> print(unquote_unicode('Tan%u0131m'))
Tanฤฑm
>>> print(unquote_unicode('%u05D0%u05D9%u05DA%20%u05DE%u05DE%u05D9%u05E8%u05D9%u05DD%20%u05D0%u05EA%20%u05D4%u05D8%u05E7%u05E1%u05D8%20%u05D4%u05D6%u05D4'))
ืื™ืš ืžืžื™ืจื™ื ืืช ื”ื˜ืงืกื˜ ื”ื–ื”
>>> print(unquote_unicode('%ud83c%udfd6')) # surrogate pair
๐Ÿ–
>>> print(unquote_unicode('%ufoobar%u666')) # incomplete
%ufoobar%u666
The function works on Python 2 (tested on 2.4 - 2.7) and Python 3 (tested on 3.3 - 3.8).

Unicode representation to formatted Unicode?

I am having some difficulty understanding the translation of unicode expressions into their respective characters. I have been looking at the unicode specification, and I have come across various strings that are formatted as follows U+1F600. As far as I have seen, there doesn't appear to be a built in function that knows how to translate these strings into the correct formatting for Python, such as \U0001F600.
In my program I have made a small regular expression that will find these U\+.{5} patterns and substitute the U+ with \U000. However, what I have found is that this syntax isn't the same for all unicode characters, such as the zero width join that actually is supposed to be translated from U+200D to \u200D.
Because I don't know every variation of the correct unicode escape sequence, what is the best method to handle this case? Is it that there are only a finite amount of these special characters that I can just check for or am I going about this the wrong way entirely?
Python version is 2.7.
I think your most reliable method will be to parse the number to an integer and then use unichr to lookup that codepoint:
unichr(0x1f600) # or: unichr(int('1f600', 16))
Note: on Python 3, it's just chr.
U+NNNN is just common notation used to talk about Unicode. Python's syntax for a single Unicode character is one of:
u'\xNN' for Unicode characters through U+00FF
u'\uNNNN' for Unicode characters through U+FFFF
u'\U00NNNNNN' for Unicode characters through U+10FFFF (max)
Note: N is a hexadecimal digit.
Use the correct notation when entering a character. You can use the longer notations even for low characters:
u'A' == u'\x41' == u'\u0041' == u'\U00000041'
Programmatically, you can also generate the correct character with unichr(n) (Python 2) or chr(n) (Python 3).
Note that before Python 3.3, there were narrow and wide Unicode builds of Python. unichr/chr can only support sys.maxunicode, which is 65535 (0xFFFF) in narrow builds and 1114111 (0x10FFFF) on wide builds. Python 3.3 unified the builds and solved many issues with Unicode.
If you are dealing with text string in the format U+NNNN, here's a regular expression (Python 3). It looks for U+ and 4-6 hexadecimal digits, and replaces them with the chr() version. Note that ASCII characters (Python 2) or printable characters (Python 3) will display the actual character and not the escaped version.
>>> re.sub(r'U\+([0-9A-Fa-f]{4,6})',lambda m: chr(int(m.group(1),16)),'testing U+1F600')
'testing \U0001f600'
>>> re.sub(r'U\+([0-9A-Fa-f]{4,6})',lambda m: chr(int(m.group(1),16)),'testing U+5000')
'testing \u5000'
>>> re.sub(r'U\+([0-9A-Fa-f]{4,6})',lambda m: chr(int(m.group(1),16)),'testing U+0041')
'testing A'
>>> re.sub(r'U\+([0-9A-Fa-f]{4,6})',lambda m: chr(int(m.group(1),16)),'testing U+0081')
'testing \x81'
You can look into json module implementation. It seem that it is not that simple:
# Unicode escape sequence
uni = _decode_uXXXX(s, end)
end += 5
# Check for surrogate pair on UCS-4 systems
if sys.maxunicode > 65535 and \
0xd800 <= uni <= 0xdbff and s[end:end + 2] == '\\u':
uni2 = _decode_uXXXX(s, end + 1)
if 0xdc00 <= uni2 <= 0xdfff:
uni = 0x10000 + (((uni - 0xd800) << 10) | (uni2 - 0xdc00))
end += 6
char = unichr(uni)
(from cpython-2.7.9/Lib/json/decoder.py lines 129-138)
I think that it would be easier to use json.loads directly:
>>> print json.loads('"\\u0123"')
ฤฃ

Decode and encode unicode characters as '\u####'

I'm attempting to write a python implementation of java.util.Properties which has a requirement that unicode characters are written to the output file in the format of \u####
(Documentation is here if you are curious, though it isn't important to the question: http://docs.oracle.com/javase/6/docs/api/java/util/Properties.html)
I basically need something that passes the following test case
def my_encode(s):
# Magic
def my_decode(s):
# Magic
# Easy ones that are solved by .encode/.decode 'unicode_escape'
assert my_decode('\u2603') == u'โ˜ƒ'
assert my_encode(u'โ˜ƒ') == '\\u2603'
# This one also works with .decode('unicode_escape')
assert my_decode('\\u0081') == u'\x81'
# But this one does not quite produce what I want
assert my_encode(u'\u0081') == '\\u0081' # Instead produces '\\x81'
Note that I've tried unicode_escape and it comes close but doesn't quite satisfy what I want
I've noticed that simplejson does this conversion correctly:
>> simplejson.dumps(u'\u0081')
'"\\u0081"'
But I'd rather avoid:
reinventing the wheel
doing some gross substringing of simplejson's output
According to the documentation you linked to:
Characters less than \u0020 and characters greater than \u007E in property keys or values are written as \uxxxx for the appropriate hexadecimal value xxxx.
So, that converts into Python readily as:
def my_encode(s):
return ''.join(
c if 0x20 <= ord(c) <= 0x7E else r'\u%04x' % ord(c)
for c in s
)
For each character in the string, if the code point is between 0x20 and 0x7E, then that character remains unchanged; otherwise, \u followed by the code point encoded as a 4-digit hex number is used. The expression c for c in s is a generator expression, so we convert that back into a string using str.join on the empty string.
For decoding, you can just use the unicode_escape codec as you mentioned:
def my_decode(s):
return s.decode('unicode_escape')

How to filter (or replace) unicode characters that would take more than 3 bytes in UTF-8?

I'm using Python and Django, but I'm having a problem caused by a limitation of MySQL. According to the MySQL 5.1 documentation, their utf8 implementation does not support 4-byte characters. MySQL 5.5 will support 4-byte characters using utf8mb4; and, someday in future, utf8 might support it as well.
But my server is not ready to upgrade to MySQL 5.5, and thus I'm limited to UTF-8 characters that take 3 bytes or less.
My question is: How to filter (or replace) unicode characters that would take more than 3 bytes?
I want to replace all 4-byte characters with the official \ufffd (U+FFFD REPLACEMENT CHARACTER), or with ?.
In other words, I want a behavior quite similar to Python's own str.encode() method (when passing 'replace' parameter). Edit: I want a behavior similar to encode(), but I don't want to actually encode the string. I want to still have an unicode string after filtering.
I DON'T want to escape the character before storing at the MySQL, because that would mean I would need to unescape all strings I get from the database, which is very annoying and unfeasible.
See also:
"Incorrect string value" warning when saving some unicode characters to MySQL (at Django ticket system)
โ€˜๐ ‚‰โ€™ Not a valid unicode character, but in the unicode character set? (at Stack Overflow)
[EDIT] Added tests about the proposed solutions
So I got good answers so far. Thanks, people! Now, in order to choose one of them, I did a quick testing to find the simplest and fastest one.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# vi:ts=4 sw=4 et
import cProfile
import random
import re
# How many times to repeat each filtering
repeat_count = 256
# Percentage of "normal" chars, when compared to "large" unicode chars
normal_chars = 90
# Total number of characters in this string
string_size = 8 * 1024
# Generating a random testing string
test_string = u''.join(
unichr(random.randrange(32,
0x10ffff if random.randrange(100) > normal_chars else 0x0fff
)) for i in xrange(string_size) )
# RegEx to find invalid characters
re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)
def filter_using_re(unicode_string):
return re_pattern.sub(u'\uFFFD', unicode_string)
def filter_using_python(unicode_string):
return u''.join(
uc if uc < u'\ud800' or u'\ue000' <= uc <= u'\uffff' else u'\ufffd'
for uc in unicode_string
)
def repeat_test(func, unicode_string):
for i in xrange(repeat_count):
tmp = func(unicode_string)
print '='*10 + ' filter_using_re() ' + '='*10
cProfile.run('repeat_test(filter_using_re, test_string)')
print '='*10 + ' filter_using_python() ' + '='*10
cProfile.run('repeat_test(filter_using_python, test_string)')
#print test_string.encode('utf8')
#print filter_using_re(test_string).encode('utf8')
#print filter_using_python(test_string).encode('utf8')
The results:
filter_using_re() did 515 function calls in 0.139 CPU seconds (0.138 CPU seconds at the sub() built-in)
filter_using_python() did 2097923 function calls in 3.413 CPU seconds (1.511 CPU seconds at the join() call and 1.900 CPU seconds evaluating the generator expression)
I did no test using itertools because... well... that solution, although interesting, was quite big and complex.
Conclusion
The RegEx solution was, by far, the fastest one.
Unicode characters in the ranges \u0000-\uD7FF and \uE000-\uFFFF will have 3 byte (or less) encodings in UTF8. The \uD800-\uDFFF range is for multibyte UTF16. I do not know python, but you should be able to set up a regular expression to match outside those ranges.
pattern = re.compile("[\uD800-\uDFFF].", re.UNICODE)
pattern = re.compile("[^\u0000-\uFFFF]", re.UNICODE)
Edit adding Python from Denilson Sรก's script in the question body:
re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)
filtered_string = re_pattern.sub(u'\uFFFD', unicode_string)
You may skip the decoding and encoding steps and directly detect the value of the first byte (8-bit string) of each character. According to UTF-8:
#1-byte characters have the following format: 0xxxxxxx
#2-byte characters have the following format: 110xxxxx 10xxxxxx
#3-byte characters have the following format: 1110xxxx 10xxxxxx 10xxxxxx
#4-byte characters have the following format: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
According to that, you only need to check the value of only the first byte of each character to filter out 4-byte characters:
def filter_4byte_chars(s):
i = 0
j = len(s)
# you need to convert
# the immutable string
# to a mutable list first
s = list(s)
while i < j:
# get the value of this byte
k = ord(s[i])
# this is a 1-byte character, skip to the next byte
if k <= 127:
i += 1
# this is a 2-byte character, skip ahead by 2 bytes
elif k < 224:
i += 2
# this is a 3-byte character, skip ahead by 3 bytes
elif k < 240:
i += 3
# this is a 4-byte character, remove it and update
# the length of the string we need to check
else:
s[i:i+4] = []
j -= 4
return ''.join(s)
Skipping the decoding and encoding parts will save you some time and for smaller strings that mostly have 1-byte characters this could even be faster than the regular expression filtering.
And just for the fun of it, an itertools monstrosity :)
import itertools as it, operator as op
def max3bytes(unicode_string):
# sequence of pairs of (char_in_string, u'\N{REPLACEMENT CHARACTER}')
pairs= it.izip(unicode_string, it.repeat(u'\ufffd'))
# is the argument less than or equal to 65535?
selector= ft.partial(op.le, 65535)
# using the character ordinals, return 0 or 1 based on `selector`
indexer= it.imap(selector, it.imap(ord, unicode_string))
# now pick the correct item for all pairs
return u''.join(it.imap(tuple.__getitem__, pairs, indexer))
Encode as UTF-16, then reencode as UTF-8.
>>> t = u'๐Ÿ๐จ๐จ'
>>> e = t.encode('utf-16le')
>>> ''.join(unichr(x).encode('utf-8') for x in struct.unpack('<' + 'H' * (len(e) // 2), e))
'\xed\xa0\xb5\xed\xb0\x9f\xed\xa0\xb5\xed\xb0\xa8\xed\xa0\xb5\xed\xb0\xa8'
Note that you can't encode after joining, since the surrogate pairs may be decoded before reencoding.
EDIT:
MySQL (at least 5.1.47) has no problem dealing with surrogate pairs:
mysql> create table utf8test (t character(128)) collate utf8_general_ci;
Query OK, 0 rows affected (0.12 sec)
...
>>> cxn = MySQLdb.connect(..., charset='utf8')
>>> csr = cxn.cursor()
>>> t = u'๐Ÿ๐จ๐จ'
>>> e = t.encode('utf-16le')
>>> v = ''.join(unichr(x).encode('utf-8') for x in struct.unpack('<' + 'H' * (len(e) // 2), e))
>>> v
'\xed\xa0\xb5\xed\xb0\x9f\xed\xa0\xb5\xed\xb0\xa8\xed\xa0\xb5\xed\xb0\xa8'
>>> csr.execute('insert into utf8test (t) values (%s)', (v,))
1L
>>> csr.execute('select * from utf8test')
1L
>>> r = csr.fetchone()
>>> r
(u'\ud835\udc1f\ud835\udc28\ud835\udc28',)
>>> print r[0]
๐Ÿ๐จ๐จ
According to the MySQL 5.1 documentation: "The ucs2 and utf8 character sets do not support supplementary characters that lie outside the BMP." This indicates that there might be a problem with surrogate pairs.
Note that the Unicode standard 5.2 chapter 3 actually forbids encoding a surrogate pair as two 3-byte UTF-8 sequences instead of one 4-byte UTF-8 sequence ... see for example page 93 """Because surrogate code points are not Unicode scalar values, any UTF-8 byte sequence that would otherwise map to code points D800..DFFF is ill-formed.""" However this proscription is as far as I know largely unknown or ignored.
It may well be a good idea to check what MySQL does with surrogate pairs. If they are not to be retained, this code will provide a simple-enough check:
all(uc < u'\ud800' or u'\ue000' <= uc <= u'\uffff' for uc in unicode_string)
and this code will replace any "nasties" with u\ufffd:
u''.join(
uc if uc < u'\ud800' or u'\ue000' <= uc <= u'\uffff' else u'\ufffd'
for uc in unicode_string
)
I'm guessing it's not the fastest, but quite straightforward (โ€œpythonicโ€ :) :
def max3bytes(unicode_string):
return u''.join(uc if uc <= u'\uffff' else u'\ufffd' for uc in unicode_string)
NB: this code does not take into account the fact that Unicode has surrogate characters in the ranges U+D800-U+DFFF.
This does more than filtering out just 3+ byte UTF-8 unicode characters. It removes unicode but tries to do that in a gentle way and replace it with relevant ASCII characters if possible. It can be a blessing in the future if you don't have for example a dozen of various unicode apostrophes and unicode quotation marks in your text (usually coming from Apple handhelds) but only the regular ASCII apostrophes and quotations.
unicodedata.normalize("NFKD", sentence).encode("ascii", "ignore")
This is robust, I use it with some more guards:
import unicodedata
def neutralize_unicode(value):
"""
Taking care of special characters as gently as possible
Args:
value (string): input string, can contain unicode characters
Returns:
:obj:`string` where the unicode characters are replaced with standard
ASCII counterparts (for example en-dash and em-dash with regular dash,
apostrophe and quotation variations with the standard ones) or taken
out if there's no substitute.
"""
if not value or not isinstance(value, basestring):
return value
if isinstance(value, str):
return value
return unicodedata.normalize("NFKD", value).encode("ascii", "ignore")
This is Python 2 BTW.

How to unquote a urlencoded unicode string in python?

I have a unicode string like "Tanฤฑm" which is encoded as "Tan%u0131m" somehow. How can i convert this encoded string back to original unicode.
Apparently urllib.unquote does not support unicode.
%uXXXX is a non-standard encoding scheme that has been rejected by the w3c, despite the fact that an implementation continues to live on in JavaScript land.
The more common technique seems to be to UTF-8 encode the string and then % escape the resulting bytes using %XX. This scheme is supported by urllib.unquote:
>>> urllib2.unquote("%0a")
'\n'
Unfortunately, if you really need to support %uXXXX, you will probably have to roll your own decoder. Otherwise, it is likely to be far more preferable to simply UTF-8 encode your unicode and then % escape the resulting bytes.
A more complete example:
>>> u"Tanฤฑm"
u'Tan\u0131m'
>>> url = urllib.quote(u"Tanฤฑm".encode('utf8'))
>>> urllib.unquote(url).decode('utf8')
u'Tan\u0131m'
def unquote(text):
def unicode_unquoter(match):
return unichr(int(match.group(1),16))
return re.sub(r'%u([0-9a-fA-F]{4})',unicode_unquoter,text)
This will do it if you absolutely have to have this (I really do agree with the cries of "non-standard"):
from urllib import unquote
def unquote_u(source):
result = unquote(source)
if '%u' in result:
result = result.replace('%u','\\u').decode('unicode_escape')
return result
print unquote_u('Tan%u0131m')
> Tanฤฑm
there is a bug in the above version where it freaks out sometimes when there are both ascii encoded and unicode encoded characters in the string. I think its specifically when there are characters from the upper 128 range like '\xab' in addition to unicode.
eg. "%5B%AB%u03E1%BB%5D" causes this error.
I found if you just did the unicode ones first, the problem went away:
def unquote_u(source):
result = source
if '%u' in result:
result = result.replace('%u','\\u').decode('unicode_escape')
result = unquote(result)
return result
You have a URL using a non-standard encoding scheme, rejected by standards bodies but still being produced by some encoders. The Python urllib.parse.unquote() function can't handle these.
Creating your own decoder is not that hard, luckily. %uhhhh entries are meant to be UTF-16 codepoints here, so we need to take surrogate pairs into account. I've also seen %hh codepoints mixed in, for added confusion.
With that in mind, here is a decoder which works in both Python 2 and Python 3, provided you pass in a str object in Python 3 (Python 2 cares less):
try:
# Python 3
from urllib.parse import unquote
unichr = chr
except ImportError:
# Python 2
from urllib import unquote
def unquote_unicode(string, _cache={}):
string = unquote(string) # handle two-digit %hh components first
parts = string.split(u'%u')
if len(parts) == 1:
return parts
r = [parts[0]]
append = r.append
for part in parts[1:]:
try:
digits = part[:4].lower()
if len(digits) < 4:
raise ValueError
ch = _cache.get(digits)
if ch is None:
ch = _cache[digits] = unichr(int(digits, 16))
if (
not r[-1] and
u'\uDC00' <= ch <= u'\uDFFF' and
u'\uD800' <= r[-2] <= u'\uDBFF'
):
# UTF-16 surrogate pair, replace with single non-BMP codepoint
r[-2] = (r[-2] + ch).encode(
'utf-16', 'surrogatepass').decode('utf-16')
else:
append(ch)
append(part[4:])
except ValueError:
append(u'%u')
append(part)
return u''.join(r)
The function is heavily inspired by the current standard-library implementation.
Demo:
>>> print(unquote_unicode('Tan%u0131m'))
Tanฤฑm
>>> print(unquote_unicode('%u05D0%u05D9%u05DA%20%u05DE%u05DE%u05D9%u05E8%u05D9%u05DD%20%u05D0%u05EA%20%u05D4%u05D8%u05E7%u05E1%u05D8%20%u05D4%u05D6%u05D4'))
ืื™ืš ืžืžื™ืจื™ื ืืช ื”ื˜ืงืกื˜ ื”ื–ื”
>>> print(unquote_unicode('%ud83c%udfd6')) # surrogate pair
๐Ÿ–
>>> print(unquote_unicode('%ufoobar%u666')) # incomplete
%ufoobar%u666
The function works on Python 2 (tested on 2.4 - 2.7) and Python 3 (tested on 3.3 - 3.8).

Categories

Resources