How to unquote a urlencoded unicode string in python? - python

I have a unicode string like "Tanım" which is encoded as "Tan%u0131m" somehow. How can i convert this encoded string back to original unicode.
Apparently urllib.unquote does not support unicode.

%uXXXX is a non-standard encoding scheme that has been rejected by the w3c, despite the fact that an implementation continues to live on in JavaScript land.
The more common technique seems to be to UTF-8 encode the string and then % escape the resulting bytes using %XX. This scheme is supported by urllib.unquote:
>>> urllib2.unquote("%0a")
'\n'
Unfortunately, if you really need to support %uXXXX, you will probably have to roll your own decoder. Otherwise, it is likely to be far more preferable to simply UTF-8 encode your unicode and then % escape the resulting bytes.
A more complete example:
>>> u"Tanım"
u'Tan\u0131m'
>>> url = urllib.quote(u"Tanım".encode('utf8'))
>>> urllib.unquote(url).decode('utf8')
u'Tan\u0131m'

def unquote(text):
def unicode_unquoter(match):
return unichr(int(match.group(1),16))
return re.sub(r'%u([0-9a-fA-F]{4})',unicode_unquoter,text)

This will do it if you absolutely have to have this (I really do agree with the cries of "non-standard"):
from urllib import unquote
def unquote_u(source):
result = unquote(source)
if '%u' in result:
result = result.replace('%u','\\u').decode('unicode_escape')
return result
print unquote_u('Tan%u0131m')
> Tanım

there is a bug in the above version where it freaks out sometimes when there are both ascii encoded and unicode encoded characters in the string. I think its specifically when there are characters from the upper 128 range like '\xab' in addition to unicode.
eg. "%5B%AB%u03E1%BB%5D" causes this error.
I found if you just did the unicode ones first, the problem went away:
def unquote_u(source):
result = source
if '%u' in result:
result = result.replace('%u','\\u').decode('unicode_escape')
result = unquote(result)
return result

You have a URL using a non-standard encoding scheme, rejected by standards bodies but still being produced by some encoders. The Python urllib.parse.unquote() function can't handle these.
Creating your own decoder is not that hard, luckily. %uhhhh entries are meant to be UTF-16 codepoints here, so we need to take surrogate pairs into account. I've also seen %hh codepoints mixed in, for added confusion.
With that in mind, here is a decoder which works in both Python 2 and Python 3, provided you pass in a str object in Python 3 (Python 2 cares less):
try:
# Python 3
from urllib.parse import unquote
unichr = chr
except ImportError:
# Python 2
from urllib import unquote
def unquote_unicode(string, _cache={}):
string = unquote(string) # handle two-digit %hh components first
parts = string.split(u'%u')
if len(parts) == 1:
return parts
r = [parts[0]]
append = r.append
for part in parts[1:]:
try:
digits = part[:4].lower()
if len(digits) < 4:
raise ValueError
ch = _cache.get(digits)
if ch is None:
ch = _cache[digits] = unichr(int(digits, 16))
if (
not r[-1] and
u'\uDC00' <= ch <= u'\uDFFF' and
u'\uD800' <= r[-2] <= u'\uDBFF'
):
# UTF-16 surrogate pair, replace with single non-BMP codepoint
r[-2] = (r[-2] + ch).encode(
'utf-16', 'surrogatepass').decode('utf-16')
else:
append(ch)
append(part[4:])
except ValueError:
append(u'%u')
append(part)
return u''.join(r)
The function is heavily inspired by the current standard-library implementation.
Demo:
>>> print(unquote_unicode('Tan%u0131m'))
Tanım
>>> print(unquote_unicode('%u05D0%u05D9%u05DA%20%u05DE%u05DE%u05D9%u05E8%u05D9%u05DD%20%u05D0%u05EA%20%u05D4%u05D8%u05E7%u05E1%u05D8%20%u05D4%u05D6%u05D4'))
איך ממירים את הטקסט הזה
>>> print(unquote_unicode('%ud83c%udfd6')) # surrogate pair
🏖
>>> print(unquote_unicode('%ufoobar%u666')) # incomplete
%ufoobar%u666
The function works on Python 2 (tested on 2.4 - 2.7) and Python 3 (tested on 3.3 - 3.8).

Related

How to recover the text in url and convert text back? [duplicate]

I have a unicode string like "Tanım" which is encoded as "Tan%u0131m" somehow. How can i convert this encoded string back to original unicode.
Apparently urllib.unquote does not support unicode.
%uXXXX is a non-standard encoding scheme that has been rejected by the w3c, despite the fact that an implementation continues to live on in JavaScript land.
The more common technique seems to be to UTF-8 encode the string and then % escape the resulting bytes using %XX. This scheme is supported by urllib.unquote:
>>> urllib2.unquote("%0a")
'\n'
Unfortunately, if you really need to support %uXXXX, you will probably have to roll your own decoder. Otherwise, it is likely to be far more preferable to simply UTF-8 encode your unicode and then % escape the resulting bytes.
A more complete example:
>>> u"Tanım"
u'Tan\u0131m'
>>> url = urllib.quote(u"Tanım".encode('utf8'))
>>> urllib.unquote(url).decode('utf8')
u'Tan\u0131m'
def unquote(text):
def unicode_unquoter(match):
return unichr(int(match.group(1),16))
return re.sub(r'%u([0-9a-fA-F]{4})',unicode_unquoter,text)
This will do it if you absolutely have to have this (I really do agree with the cries of "non-standard"):
from urllib import unquote
def unquote_u(source):
result = unquote(source)
if '%u' in result:
result = result.replace('%u','\\u').decode('unicode_escape')
return result
print unquote_u('Tan%u0131m')
> Tanım
there is a bug in the above version where it freaks out sometimes when there are both ascii encoded and unicode encoded characters in the string. I think its specifically when there are characters from the upper 128 range like '\xab' in addition to unicode.
eg. "%5B%AB%u03E1%BB%5D" causes this error.
I found if you just did the unicode ones first, the problem went away:
def unquote_u(source):
result = source
if '%u' in result:
result = result.replace('%u','\\u').decode('unicode_escape')
result = unquote(result)
return result
You have a URL using a non-standard encoding scheme, rejected by standards bodies but still being produced by some encoders. The Python urllib.parse.unquote() function can't handle these.
Creating your own decoder is not that hard, luckily. %uhhhh entries are meant to be UTF-16 codepoints here, so we need to take surrogate pairs into account. I've also seen %hh codepoints mixed in, for added confusion.
With that in mind, here is a decoder which works in both Python 2 and Python 3, provided you pass in a str object in Python 3 (Python 2 cares less):
try:
# Python 3
from urllib.parse import unquote
unichr = chr
except ImportError:
# Python 2
from urllib import unquote
def unquote_unicode(string, _cache={}):
string = unquote(string) # handle two-digit %hh components first
parts = string.split(u'%u')
if len(parts) == 1:
return parts
r = [parts[0]]
append = r.append
for part in parts[1:]:
try:
digits = part[:4].lower()
if len(digits) < 4:
raise ValueError
ch = _cache.get(digits)
if ch is None:
ch = _cache[digits] = unichr(int(digits, 16))
if (
not r[-1] and
u'\uDC00' <= ch <= u'\uDFFF' and
u'\uD800' <= r[-2] <= u'\uDBFF'
):
# UTF-16 surrogate pair, replace with single non-BMP codepoint
r[-2] = (r[-2] + ch).encode(
'utf-16', 'surrogatepass').decode('utf-16')
else:
append(ch)
append(part[4:])
except ValueError:
append(u'%u')
append(part)
return u''.join(r)
The function is heavily inspired by the current standard-library implementation.
Demo:
>>> print(unquote_unicode('Tan%u0131m'))
Tanım
>>> print(unquote_unicode('%u05D0%u05D9%u05DA%20%u05DE%u05DE%u05D9%u05E8%u05D9%u05DD%20%u05D0%u05EA%20%u05D4%u05D8%u05E7%u05E1%u05D8%20%u05D4%u05D6%u05D4'))
איך ממירים את הטקסט הזה
>>> print(unquote_unicode('%ud83c%udfd6')) # surrogate pair
🏖
>>> print(unquote_unicode('%ufoobar%u666')) # incomplete
%ufoobar%u666
The function works on Python 2 (tested on 2.4 - 2.7) and Python 3 (tested on 3.3 - 3.8).

How to convert \\xhh into \xhh python

I have encounter a case where I need to convert a string of character into a character string in python.
s = "\\x80\\x78\\x07\\x00\\x75\\xb3"
print s #gives: \x80\x78\x07\x00\x75\xb3
what I want is that, given the string s, I can get the real character store in s. which in this case is "\x80, \x78, \x07, \x00, \x75, and \xb3"(something like this)�xu�.
You can use string-escape encoding (Python 2.x):
>>> s = "\\x80\\x78\\x07\\x00\\x75\\xb3"
>>> s.decode('string-escape')
'\x80x\x07\x00u\xb3'
Use unicode-escape encoding (in Python 3.x, need to convert to bytes first):
>>> s.encode().decode('unicode-escape')
'\x80x\x07\x00u³'
you can simply write a function, taking the string and returning the converted form!
something like this:
def str_to_chr(s):
res = ""
s = s.split("\\")[1:] #"\\x33\\x45" -> ["x33","x45"]
for(i in s):
res += chr(int('0'+i, 16)) # converting to decimal then taking the chr
return res
remember to print the return of the function.
to find out what does each line do, run that line, if still have questions comment it... i'll answer
or you can build a string from the byte values, but that might not all be "printable" depending on your encoding, example:
# -*- coding: utf-8 -*-
s = "\\x80\\x78\\x07\\x00\\x75\\xb3"
r = ''
for byte in s.split('\\x'):
if byte: # to get rid of empties
r += chr(int(byte,16)) # convert to int from hex string first
print (r) # given the example, not all bytes are printable char's in utf-8
HTH, Edwin

How to encode UTF-8 strings with only "A-Z","a-z","0-9", and "_" in Python

I need to build a python encoder so that I can reformat strings like this:
import codecs
codecs.encode("Random 🐍 UTF-8 String ☑⚠⚡", 'name_of_my_encoder')
The reason this is even something I'm asking stack overflow is, the encoded strings need to pass this validation function. This is a hard constraint, there is no flexibility on this, its due to how the strings have to be stored.
from string import ascii_letters
from string import digits
valid_characters = set(ascii_letters + digits + ['_'])
def validation_function(characters):
for char in characters:
if char not in valid_characters:
raise Exception
Making an encoder seemed easy enough, but I'm not sure if this encoder is making it harder to build a decoder. Heres the encoder I've written.
from codecs import encode
from string import ascii_letters
from string import digits
ALPHANUMERIC_SET = set(ascii_letters + digits)
def underscore_encode(chars_in):
chars_out = list()
for char in chars_in:
if char not in ALPHANUMERIC_SET:
chars_out.append('_{}_'.format(encode(char.encode(), 'hex').decode('ascii')))
else:
chars_out.append(char)
return ''.join(chars_out)
This is the encoder I've written. I've only included it for example purposes, theres probably a better way to do this.
Edit 1 - Someone has wisely pointed out just using base32 on the entire string, which I can definitely use. However, it would be preferable to have something that is 'somewhat readable', so an escaping system like https://en.wikipedia.org/wiki/Quoted-printable or https://en.wikipedia.org/wiki/Percent-encoding would be preferred.
Edit 2 - Proposed solutions must work on Python 3.4 or newer, working in Python 2.7 as well is nice, but not required. I've added the python-3.x tag to help clarify this a little.
This seems to do the trick. Basically, alphanumeric letters are left alone. Any non-alphanumeric character in the ASCII set is encoded as a \xXX escape code. All other unicode characters are encoded using the \uXXXX escape code. However, you've said you can't use \, but you can use _, thus all escape sequences are translated to start with a _. This makes decoding extremely simple. Just replace the _ with \ and then use the unicode-escape codec. Encoding is slightly more difficult as the unicode-escape codec leaves ASCII characters alone. So first you have to escape the relevant ASCII characters, then run the string through the unicode-escape codec, before finally translating all \ to _.
Code:
from string import ascii_letters, digits
# non-translating characters
ALPHANUMERIC_SET = set(ascii_letters + digits)
# mapping all bytes to themselves, except '_' maps to '\'
ESCAPE_CHAR_DECODE_TABLE = bytes(bytearray(range(256)).replace(b"_", b"\\"))
# reverse mapping -- maps `\` back to `_`
ESCAPE_CHAR_ENCODE_TABLE = bytes(bytearray(range(256)).replace(b"\\", b"_"))
# encoding table for ASCII characters not in ALPHANUMERIC_SET
ASCII_ENCODE_TABLE = {i: u"_x{:x}".format(i) for i in set(range(128)) ^ set(map(ord, ALPHANUMERIC_SET))}
def encode(s):
s = s.translate(ASCII_ENCODE_TABLE) # translate ascii chars not in your set
bytes_ = s.encode("unicode-escape")
bytes_ = bytes_.translate(ESCAPE_CHAR_ENCODE_TABLE)
return bytes_
def decode(s):
s = s.translate(ESCAPE_CHAR_DECODE_TABLE)
return s.decode("unicode-escape")
s = u"Random UTF-8 String ☑⚠⚡"
#s = '北亰'
print(s)
b = encode(s)
print(b)
new_s = decode(b)
print(new_s)
Which outputs:
Random UTF-8 String ☑⚠⚡
b'Random_x20UTF_x2d8_x20String_x20_u2611_u26a0_u26a1'
Random UTF-8 String ☑⚠⚡
This works on both python 3.4 and python 2.7, which is why the ESCAPE_CHAR_{DE,EN}CODE_TABLE is a bit messy bytes on python 2.7 is an alias for str, which works differently to bytes on python 3.4. This is why the table is constructed using a bytearray. For python 2.7, the encode method expects a unicode object not str.
Use base32! It uses only the 26 letters of the alphabet and 0-9. You can’t use base64 because it uses the = character, which won’t pass your validator.
>>> import base64
>>>
>>> print base64.b32encode('Random 🐍 UTF-8 String ☑⚠⚡"')
KJQW4ZDPNUQPBH4QRUQFKVCGFU4CAU3UOJUW4ZZA4KMJDYU2UDRJVIJC
>>>
>>> print base64.b32decode('KJQW4ZDPNUQPBH4QRUQFKVCGFU4CAU3UOJUW4ZZA4KMJDYU2UDRJVIJC')
Random 🐍 UTF-8 String ☑⚠⚡"
>>>
Despite several good answers. I ended up with a solution that seems cleaner and more understandable. So I'm posting the code of my eventual solution to answer my own question.
from string import ascii_letters
from string import digits
from base64 import b16decode
from base64 import b16encode
ALPHANUMERIC_SET = set(ascii_letters + digits)
def utf8_string_to_hex_string(s):
return ''.join(chr(i) for i in b16encode(s.encode('utf-8')))
def hex_string_to_utf8_string(s):
return b16decode(bytes(list((ord(i) for i in s)))).decode('utf-8')
def underscore_encode(chars_in):
chars_out = list()
for char in chars_in:
if char not in ALPHANUMERIC_SET:
chars_out.append('_{}_'.format(utf8_string_to_hex_string(char)))
else:
chars_out.append(char)
return ''.join(chars_out)
def underscore_decode(chars_in):
chars_out = list()
decoding = False
for char in chars_in:
if char == '_':
if not decoding:
hex_chars = list()
decoding = True
elif decoding:
decoding = False
chars_out.append(hex_string_to_utf8_string(hex_chars))
else:
if not decoding:
chars_out.append(char)
elif decoding:
hex_chars.append(char)
return ''.join(chars_out)
You could abuse the url quoting, to get both readable and easy to decode in other languages format that passes your validation function:
#!/usr/bin/env python3
import urllib.parse
def alnum_encode(text):
return urllib.parse.quote(text, safe='')\
.replace('-', '%2d').replace('.', '%2e').replace('_', '%5f')\
.replace('%', '_')
def alnum_decode(underscore_encoded):
return urllib.parse.unquote(underscore_encoded.replace('_','%'), errors='strict')
s = alnum_encode("Random 🐍 UTF-8 String ☑⚠⚡")
print(s)
print(alnum_decode(s))
Output
Random_20_F0_9F_90_8D_20UTF_2d8_20String_20_E2_98_91_E2_9A_A0_E2_9A_A1
Random 🐍 UTF-8 String ☑⚠⚡
Here's an implementation using a bytearray() (to move it to C later if necessary):
#!/usr/bin/env python3.5
from string import ascii_letters, digits
def alnum_encode(text, alnum=bytearray(ascii_letters+digits, 'ascii')):
result = bytearray()
for byte in bytearray(text, 'utf-8'):
if byte in alnum:
result.append(byte)
else:
result += b'_%02x' % byte
return result.decode('ascii')
If you want a transliteration of Unicode to ASCII (e.g. ç --> c), then check out the Unidecode package. Here are their examples:
>>> from unidecode import unidecode
>>> unidecode(u'ko\u017eu\u0161\u010dek')
'kozuscek'
>>> unidecode(u'30 \U0001d5c4\U0001d5c6/\U0001d5c1')
'30 km/h'
>>> unidecode(u"\u5317\u4EB0")
'Bei Jing '
Here's my example:
# -*- coding: utf-8 -*-
from unidecode import unidecode
print unidecode(u'快樂星期天')
Gives as an output*
Kuai Le Xing Qi Tian
*may be nonsense, but at least it's ASCII
To remove punctuation, see this answer.

String.maketrans for English and Persian numbers

I have a function like this:
persian_numbers = '۱۲۳۴۵۶۷۸۹۰'
english_numbers = '1234567890'
arabic_numbers = '١٢٣٤٥٦٧٨٩٠'
english_trans = string.maketrans(english_numbers, persian_numbers)
arabic_trans = string.maketrans(arabic_numbers, persian_numbers)
text.translate(english_trans)
text.translate(arabic_trans)
I want it to translate all Arabic and English numbers to Persian. But Python says:
english_translate = string.maketrans(english_numbers, persian_numbers)
ValueError: maketrans arguments must have same length
I tried to encode strings with Unicode utf-8 but I always got some errors! Sometimes the problem is Arabic string instead! Do you know a better solution for this job?
EDIT:
It seems the problem is Unicode characters length in ASCII. An Arabic number like '۱' is two character -- that I find out with ord(). And the length problem starts from here :-(
See unidecode library which converts all strings into UTF8. It is very useful in case of number input in different languages.
In Python 2:
>>> from unidecode import unidecode
>>> a = unidecode(u"۰۱۲۳۴۵۶۷۸۹")
>>> a
'0123456789'
>>> unidecode(a)
'0123456789'
In Python 3:
>>> from unidecode import unidecode
>>> a = unidecode("۰۱۲۳۴۵۶۷۸۹")
>>> a
'0123456789'
>>> unidecode(a)
'0123456789'
Unicode objects can interpret these digits (arabic and persian) as actual digits -
no need to translate them by using character substitution.
EDIT -
I came out with a way to make your replacement using Python2 regular expressions:
# coding: utf-8
import re
# Attention: while the characters for the strings bellow are
# dislplayed indentically, inside they are represented
# by distinct unicode codepoints
persian_numbers = u'۱۲۳۴۵۶۷۸۹۰'
arabic_numbers = u'١٢٣٤٥٦٧٨٩٠'
english_numbers = u'1234567890'
persian_regexp = u"(%s)" % u"|".join(persian_numbers)
arabic_regexp = u"(%s)" % u"|".join(arabic_numbers)
def _sub(match_object, digits):
return english_numbers[digits.find(match_object.group(0))]
def _sub_arabic(match_object):
return _sub(match_object, arabic_numbers)
def _sub_persian(match_object):
return _sub(match_object, persian_numbers)
def replace_arabic(text):
return re.sub(arabic_regexp, _sub_arabic, text)
def replace_persian(text):
return re.sub(arabic_regexp, _sub_persian, text)
Attempt that the "text" parameter must be unicode itself.
(also this code could be shortened
by using lambdas and combining some expressions in a single line, but there is no point in doing so, but for loosing readability)
It should work to you up to here, but please read on the original answer I had posted
-- original answer
So, if you instantiate your variables as unicode (prepending an u to the quote char), they are correctly understood in Python:
>>> persian_numbers = u'۱۲۳۴۵۶۷۸۹۰'
>>> english_numbers = u'1234567890'
>>> arabic_numbers = u'١٢٣٤٥٦٧٨٩٠'
>>>
>>> print int(persian_numbers)
1234567890
>>> print int(english_numbers)
1234567890
>>> print int(arabic_numbers)
1234567890
>>> persian_numbers.isdigit()
True
>>>
By the way, the "maketrans" method does not exist for unicode objects (in Python2 - see the comments).
It is very important to understand the basics about unicode - for everyone, even people writing English only programs who think they will never deal with any char out of the 26 latin letters. When writing code that will deal with different chars it is vital - the program can't possibly work without you knowing what you are doing except by chance.
A very good article to read is http://www.joelonsoftware.com/articles/Unicode.html - please read it now.
You can keep in mind, while reading it, that Python allows one to translate unicode characters to a string in any "physical" encoding by using the "encode" method of unicode objects.
>>> arabic_numbers = u'١٢٣٤٥٦٧٨٩٠'
>>> len(arabic_numbers)
10
>>> enc_arabic = arabic_numbers.encode("utf-8")
>>> print enc_arabic
١٢٣٤٥٦٧٨٩٠
>>> len(enc_arabic)
20
>>> int(enc_arabic)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '\xd9\xa1\xd9\xa2\xd9\xa3\xd9\xa4\xd9\xa5\xd9\xa6\xd9\xa7\xd9\xa8\xd9\xa9\xd9\xa0'
Thus, the characters loose their sense as "single entities" and as digits when encoding - the encoded object (str type in Python 2.x) is justa strrng of bytes - which nonetheless is needed when sending these characters to any output from the program - be it console, GUI Window, database, html code, etc...
You can use persiantools package:
Examples:
>>> from persiantools import digits
>>> digits.en_to_fa("0987654321")
'۰۹۸۷۶۵۴۳۲۱'
>>> digits.ar_to_fa("٠٩٨٧٦٥٤٣٢١") # or digits.ar_to_fa(u"٠٩٨٧٦٥٤٣٢١")
'۰۹۸۷۶۵۴۳۲۱'
unidecode converts all characters from Persian to English, If you want to change only numbers follow bellow:
In python3 you can use this code to convert any Persian|Arabic number to English number while keeping other characters unchanged:
intab='۱۲۳۴۵۶۷۸۹۰١٢٣٤٥٦٧٨٩٠'
outtab='12345678901234567890'
translation_table = str.maketrans(intab, outtab)
output_text = input_text.translate(translation_table)
Use Unicode Strings:
persian_numbers = u'۱۲۳۴۵۶۷۸۹۰'
english_numbers = u'1234567890'
arabic_numbers = u'١٢٣٤٥٦٧٨٩٠'
And make sure the encoding of your Python file is correct.
With this you can easily do that:
def p2e(persiannumber):
number={
'0':'۰',
'1':'۱',
'2':'۲',
'3':'۳',
'4':'۴',
'5':'۵',
'6':'۶',
'7':'۷',
'8':'۸',
'9':'۹',
}
for i,j in number.items():
persiannumber=persiannumber.replace(j,i)
return persiannumber
here is usage:
print(p2e('۳۱۹۶'))
#returns 3196
In Python 3 easiest way is:
str(int('۱۲۳'))
#123
but if number starts with 0 it have an issue.
so we can use zip() function:
for i, j in zip('1234567890', '۱۲۳۴۵۶۷۸۹۰'):
number.replace(i, j)
def persian_number(persiannumber):
number={
'0':'۰',
'1':'۱',
'2':'۲',
'3':'۳',
'4':'۴',
'5':'۵',
'6':'۶',
'7':'۷',
'8':'۸',
'9':'۹',
}
for i,j in number.items():
persiannumber=time2str.replace(i,j)
return time2str
persiannumber must be a string

How to filter (or replace) unicode characters that would take more than 3 bytes in UTF-8?

I'm using Python and Django, but I'm having a problem caused by a limitation of MySQL. According to the MySQL 5.1 documentation, their utf8 implementation does not support 4-byte characters. MySQL 5.5 will support 4-byte characters using utf8mb4; and, someday in future, utf8 might support it as well.
But my server is not ready to upgrade to MySQL 5.5, and thus I'm limited to UTF-8 characters that take 3 bytes or less.
My question is: How to filter (or replace) unicode characters that would take more than 3 bytes?
I want to replace all 4-byte characters with the official \ufffd (U+FFFD REPLACEMENT CHARACTER), or with ?.
In other words, I want a behavior quite similar to Python's own str.encode() method (when passing 'replace' parameter). Edit: I want a behavior similar to encode(), but I don't want to actually encode the string. I want to still have an unicode string after filtering.
I DON'T want to escape the character before storing at the MySQL, because that would mean I would need to unescape all strings I get from the database, which is very annoying and unfeasible.
See also:
"Incorrect string value" warning when saving some unicode characters to MySQL (at Django ticket system)
‘𠂉’ Not a valid unicode character, but in the unicode character set? (at Stack Overflow)
[EDIT] Added tests about the proposed solutions
So I got good answers so far. Thanks, people! Now, in order to choose one of them, I did a quick testing to find the simplest and fastest one.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# vi:ts=4 sw=4 et
import cProfile
import random
import re
# How many times to repeat each filtering
repeat_count = 256
# Percentage of "normal" chars, when compared to "large" unicode chars
normal_chars = 90
# Total number of characters in this string
string_size = 8 * 1024
# Generating a random testing string
test_string = u''.join(
unichr(random.randrange(32,
0x10ffff if random.randrange(100) > normal_chars else 0x0fff
)) for i in xrange(string_size) )
# RegEx to find invalid characters
re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)
def filter_using_re(unicode_string):
return re_pattern.sub(u'\uFFFD', unicode_string)
def filter_using_python(unicode_string):
return u''.join(
uc if uc < u'\ud800' or u'\ue000' <= uc <= u'\uffff' else u'\ufffd'
for uc in unicode_string
)
def repeat_test(func, unicode_string):
for i in xrange(repeat_count):
tmp = func(unicode_string)
print '='*10 + ' filter_using_re() ' + '='*10
cProfile.run('repeat_test(filter_using_re, test_string)')
print '='*10 + ' filter_using_python() ' + '='*10
cProfile.run('repeat_test(filter_using_python, test_string)')
#print test_string.encode('utf8')
#print filter_using_re(test_string).encode('utf8')
#print filter_using_python(test_string).encode('utf8')
The results:
filter_using_re() did 515 function calls in 0.139 CPU seconds (0.138 CPU seconds at the sub() built-in)
filter_using_python() did 2097923 function calls in 3.413 CPU seconds (1.511 CPU seconds at the join() call and 1.900 CPU seconds evaluating the generator expression)
I did no test using itertools because... well... that solution, although interesting, was quite big and complex.
Conclusion
The RegEx solution was, by far, the fastest one.
Unicode characters in the ranges \u0000-\uD7FF and \uE000-\uFFFF will have 3 byte (or less) encodings in UTF8. The \uD800-\uDFFF range is for multibyte UTF16. I do not know python, but you should be able to set up a regular expression to match outside those ranges.
pattern = re.compile("[\uD800-\uDFFF].", re.UNICODE)
pattern = re.compile("[^\u0000-\uFFFF]", re.UNICODE)
Edit adding Python from Denilson Sá's script in the question body:
re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)
filtered_string = re_pattern.sub(u'\uFFFD', unicode_string)
You may skip the decoding and encoding steps and directly detect the value of the first byte (8-bit string) of each character. According to UTF-8:
#1-byte characters have the following format: 0xxxxxxx
#2-byte characters have the following format: 110xxxxx 10xxxxxx
#3-byte characters have the following format: 1110xxxx 10xxxxxx 10xxxxxx
#4-byte characters have the following format: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
According to that, you only need to check the value of only the first byte of each character to filter out 4-byte characters:
def filter_4byte_chars(s):
i = 0
j = len(s)
# you need to convert
# the immutable string
# to a mutable list first
s = list(s)
while i < j:
# get the value of this byte
k = ord(s[i])
# this is a 1-byte character, skip to the next byte
if k <= 127:
i += 1
# this is a 2-byte character, skip ahead by 2 bytes
elif k < 224:
i += 2
# this is a 3-byte character, skip ahead by 3 bytes
elif k < 240:
i += 3
# this is a 4-byte character, remove it and update
# the length of the string we need to check
else:
s[i:i+4] = []
j -= 4
return ''.join(s)
Skipping the decoding and encoding parts will save you some time and for smaller strings that mostly have 1-byte characters this could even be faster than the regular expression filtering.
And just for the fun of it, an itertools monstrosity :)
import itertools as it, operator as op
def max3bytes(unicode_string):
# sequence of pairs of (char_in_string, u'\N{REPLACEMENT CHARACTER}')
pairs= it.izip(unicode_string, it.repeat(u'\ufffd'))
# is the argument less than or equal to 65535?
selector= ft.partial(op.le, 65535)
# using the character ordinals, return 0 or 1 based on `selector`
indexer= it.imap(selector, it.imap(ord, unicode_string))
# now pick the correct item for all pairs
return u''.join(it.imap(tuple.__getitem__, pairs, indexer))
Encode as UTF-16, then reencode as UTF-8.
>>> t = u'𝐟𝐨𝐨'
>>> e = t.encode('utf-16le')
>>> ''.join(unichr(x).encode('utf-8') for x in struct.unpack('<' + 'H' * (len(e) // 2), e))
'\xed\xa0\xb5\xed\xb0\x9f\xed\xa0\xb5\xed\xb0\xa8\xed\xa0\xb5\xed\xb0\xa8'
Note that you can't encode after joining, since the surrogate pairs may be decoded before reencoding.
EDIT:
MySQL (at least 5.1.47) has no problem dealing with surrogate pairs:
mysql> create table utf8test (t character(128)) collate utf8_general_ci;
Query OK, 0 rows affected (0.12 sec)
...
>>> cxn = MySQLdb.connect(..., charset='utf8')
>>> csr = cxn.cursor()
>>> t = u'𝐟𝐨𝐨'
>>> e = t.encode('utf-16le')
>>> v = ''.join(unichr(x).encode('utf-8') for x in struct.unpack('<' + 'H' * (len(e) // 2), e))
>>> v
'\xed\xa0\xb5\xed\xb0\x9f\xed\xa0\xb5\xed\xb0\xa8\xed\xa0\xb5\xed\xb0\xa8'
>>> csr.execute('insert into utf8test (t) values (%s)', (v,))
1L
>>> csr.execute('select * from utf8test')
1L
>>> r = csr.fetchone()
>>> r
(u'\ud835\udc1f\ud835\udc28\ud835\udc28',)
>>> print r[0]
𝐟𝐨𝐨
According to the MySQL 5.1 documentation: "The ucs2 and utf8 character sets do not support supplementary characters that lie outside the BMP." This indicates that there might be a problem with surrogate pairs.
Note that the Unicode standard 5.2 chapter 3 actually forbids encoding a surrogate pair as two 3-byte UTF-8 sequences instead of one 4-byte UTF-8 sequence ... see for example page 93 """Because surrogate code points are not Unicode scalar values, any UTF-8 byte sequence that would otherwise map to code points D800..DFFF is ill-formed.""" However this proscription is as far as I know largely unknown or ignored.
It may well be a good idea to check what MySQL does with surrogate pairs. If they are not to be retained, this code will provide a simple-enough check:
all(uc < u'\ud800' or u'\ue000' <= uc <= u'\uffff' for uc in unicode_string)
and this code will replace any "nasties" with u\ufffd:
u''.join(
uc if uc < u'\ud800' or u'\ue000' <= uc <= u'\uffff' else u'\ufffd'
for uc in unicode_string
)
I'm guessing it's not the fastest, but quite straightforward (“pythonic” :) :
def max3bytes(unicode_string):
return u''.join(uc if uc <= u'\uffff' else u'\ufffd' for uc in unicode_string)
NB: this code does not take into account the fact that Unicode has surrogate characters in the ranges U+D800-U+DFFF.
This does more than filtering out just 3+ byte UTF-8 unicode characters. It removes unicode but tries to do that in a gentle way and replace it with relevant ASCII characters if possible. It can be a blessing in the future if you don't have for example a dozen of various unicode apostrophes and unicode quotation marks in your text (usually coming from Apple handhelds) but only the regular ASCII apostrophes and quotations.
unicodedata.normalize("NFKD", sentence).encode("ascii", "ignore")
This is robust, I use it with some more guards:
import unicodedata
def neutralize_unicode(value):
"""
Taking care of special characters as gently as possible
Args:
value (string): input string, can contain unicode characters
Returns:
:obj:`string` where the unicode characters are replaced with standard
ASCII counterparts (for example en-dash and em-dash with regular dash,
apostrophe and quotation variations with the standard ones) or taken
out if there's no substitute.
"""
if not value or not isinstance(value, basestring):
return value
if isinstance(value, str):
return value
return unicodedata.normalize("NFKD", value).encode("ascii", "ignore")
This is Python 2 BTW.

Categories

Resources