Printing characters that are not in ASCII 0-127 (Python 3) - python

So, I've got an algorithm whereby I take a character, take its character code, increase that code by a variable, and then print that new character. However, I'd also like it to work for characters not in the default ASCII table. Currently it's not printing 'special' characters like € (for example). How can I make it print certain special characters?
#!/usr/bin/python3
# -*- coding: utf-8 -*-
def generateKey(name):
i = 0
result = ""
for char in name:
newOrd = ord(char) + i
newChar = chr(newOrd)
print(newChar)
result += newChar
i += 1
print("Serial key for name: ", result)
generateKey(input("Enter name: "))
Whenever I give an input that forces special characters (like |||||), it works fine for the first four characters (including DEL where it gives the transparent rectangle icon), but the fifth character (meant to be €) is also an error char, which is not what I want. How can I fix this?
Here's the output from |||||:
Enter name: |||||
|
}
~
Serial key for name: |}~
But the last char should be €, not a blank. (BTW the fourth char, DEL, becomes a transparent rectangle when I copy it into Windows)

In the default encoding (utf-8), chr(128) is not the euro symbol. It's a control character. See this Unicode table. So indeed it should be blank, not €.
You can verify the default encoding with sys.getdefaultencoding().
If you want to reinterpret chr(128) as the euro symbol, you should use the windows-1252 encoding. There, it is indeed the euro symbol. (Different encodings disagree on how to represent values beyond ASCII's 0–127.)

Related

How to deal with variations in apostrophe ’ and ' [duplicate]

On my website people can post news and quite a few editors use MS word and similar tools to write the text and then copy&paste into my site's editor (simple textarea, no WYSIWYG etc.).
Those texts usually contain "nice" quotes instead of the plain ascii ones ("). They also sometimes contain those longer dashes like – instead of -.
Now I want to replace all those characters with their ascii counterparts. However, I do not want to remove umlauts and other non-ascii character. I'd also highly prefer to use a proper solution that does not involve creating a mapping dict for all those characters.
All my strings are unicode objects.
What about this?
It creates translation table first, but honestly I don't think you can do this without it.
transl_table = dict( [ (ord(x), ord(y)) for x,y in zip( u"‘’´“”–-", u"'''\"\"--") ] )
with open( "a.txt", "w", encoding = "utf-8" ) as f_out :
a_str = u" ´funny single quotes´ long–-and–-short dashes ‘nice single quotes’ “nice double quotes” "
print( " a_str = " + a_str, file = f_out )
fixed_str = a_str.translate( transl_table )
print( " fixed_str = " + fixed_str, file = f_out )
I wasn't able to run this printing to a console (on Windows) so I had to write to txt file.
The output in the a.txt file looks as follows:
a_str = ´funny single quotes´ long–-and–-short dashes ‘nice single
quotes’ “nice double quotes” fixed_str = 'funny single quotes'
long--and--short dashes 'nice single quotes' "nice double quotes"
By the way, the code above works in Python 3. If you need it for Python 2, it might need some fixes due to the difference in handling Unicode strings in both versions of the language
There is no such "proper" solution, because for any given Unicode character there is no "ASCII counterpart" defined.
For example, take the seemingly easy characters that you might want to map to ASCII single and double quotes and hyphens. First, lets generate all the Unicode characters with their official names. Second, lets find all the quotation marks, hyphens and dashes according to the names:
#!/usr/bin/env python3
import unicodedata
def unicode_character_name(char):
try:
return unicodedata.name(char)
except ValueError:
return None
# Generate all Unicode characters with their names
all_unicode_characters = []
for n in range(0, 0x10ffff): # Unicode planes 0-16
char = chr(n) # Python 3
#char = unichr(n) # Python 2
name = unicode_character_name(char)
if name:
all_unicode_characters.append((char, name))
# Find all Unicode quotation marks
print (' '.join([char for char, name in all_unicode_characters if 'QUOTATION MARK' in name]))
# " « » ‘ ’ ‚ ‛ “ ” „ ‟ ‹ › ❛ ❜ ❝ ❞ ❟ ❠ ❮ ❯ ⹂ 〝 〞 〟 " 🙶 🙷 🙸
# Find all Unicode hyphens
print (' '.join([char for char, name in all_unicode_characters if 'HYPHEN' in name]))
# - ­ ֊ ᐀ ᠆ ‐ ‑ ‧ ⁃ ⸗ ⸚ ⹀ ゠ ﹣ - 󠀭
# Find all Unicode dashes
print (' '.join([char for char, name in all_unicode_characters if 'DASH' in name and 'DASHED' not in name]))
# ‒ – — ⁓ ⊝ ⑈ ┄ ┅ ┆ ┇ ┈ ┉ ┊ ┋ ╌ ╍ ╎ ╏ ⤌ ⤍ ⤎ ⤏ ⤐ ⥪ ⥫ ⥬ ⥭ ⩜ ⩝ ⫘ ⫦ ⬷ ⸺ ⸻ ⹃ 〜 〰 ︱ ︲ ﹘ 💨
As you can see, as easy as this example is, there are many problems. There are many quotation marks in Unicode that don't look anything like the quotation marks in US-ASCII and there are many hyphens in Unicode that don't look anything like the hyphen-minus sign in US-ASCII.
And there are many questions. For example:
should the "SWUNG DASH" (⁓) symbol be replaced with an ASCII hyphen (-) or a tilde (~)?
should the "CANADIAN SYLLABICS HYPHEN" (᐀) be replaced with an ASCII hyphen (-) or an equals sign (=)?
should the "SINGLE LEFT-POINTING ANGLE QUOTATION MARK" (‹) be replaces with an ASCII quotation mark ("), an apostrophe (') or a less-than sign (<)?
To establish a "correct" ASCII counterpart, somebody needs to answer these questions based on the use context. That's why all the solutions to your problem are based on a mapping dictionary in one way or another. And all these solutions will provide different results.
You can build on top of the unidecode package.
This is pretty slow, since we are normalizing all the unicode first to the combined form, then trying to see what unidecode turns it into. If we match a latin letter, then we actually use the original NFC character. If not, then we yield whatever degarbling unidecode has suggested. This leaves accentuated letters alone, but will convert everything else.
import unidecode
import unicodedata
import re
def char_filter(string):
latin = re.compile('[a-zA-Z]+')
for char in unicodedata.normalize('NFC', string):
decoded = unidecode.unidecode(char)
if latin.match(decoded):
yield char
else:
yield decoded
def clean_string(string):
return "".join(char_filter(string))
print(clean_string(u"vis-à-vis “Beyoncé”’s naïve papier–mâché résumé"))
# prints vis-à-vis "Beyoncé"'s naïve papier-mâché résumé
You can use the str.translate() method (http://docs.python.org/library/stdtypes.html#str.translate). However, read the doc related to Unicode -- the translation table has another form: unicode ordinal number --> unicode string (usually char) or None.
Well, but it requires the dict. You have to capture the replacements anyway. How do you want to do that without any table or arrays? You could use str.replace() for the single characters, but this would be inefficient.
This tool will normalize punctuation in markdown: http://johnmacfarlane.net/pandoc/README.html
-S, --smart Produce typographically correct output, converting straight quotes to curly quotes, --- to em-dashes, -- to en-dashes,
and ... to ellipses. Nonbreaking spaces are inserted after certain
abbreviations, such as “Mr.” (Note: This option is significant only
when the input format is markdown or textile. It is selected
automatically when the input format is textile or the output format is
latex or context.)
It's haskell, so you'd have to figure out the interface.

How can I handle these weird special characters messing my print formatting?

I am printing a formatted table. But sometimes these user generated characters are taking more than one character width and it messes up the formatting as you can see in the screenshot below...
The width of the "title" column is formatted to be 68 bytes. But these "special characters" are taking up more than 1 character width but are only counted as 1 character. This pushes the column past its bounds.
print('{0:16s}{3:<18s}{1:68s}{2:>8n}'.format((
' ' + streamer['user_name'][:12] + '..') if len(streamer['user_name']) > 12 else ' ' + streamer['user_name'],
(streamer['title'].strip()[:62] + '..') if len(streamer['title']) > 62 else streamer['title'].strip(),
streamer['viewer_count'],
(gamesDic[streamer['game_id']][:15] + '..') if len(gamesDic[streamer['game_id']]) > 15 else gamesDic[streamer['game_id']]))
Any advice on how to deal with these special characters?
edit: I printed the offending string to file.
🔴 𝐀𝐒𝐌𝐑 (𝙪𝙥 𝙘𝙡𝙤𝙨𝙚) ✨ LIVE 🔔 SUBS GET SNAPCHAT
edit2:
Why do these not align on a character boundary?
edit3:
Today the first two characters are producing weird output. But the columns are aligned in each case below.
First character in isolation...
title[0]
Second character in isolation... title[1]
First and second character together.. title[0] + title[1]
I made this comment to the question:
The characters in "LIVE" are Fullwidth characters. A hacky way to
deal with them might be to test their width with
unicodedata.east_asian_width(char) (it will return "F" for fullwidth
characters) and substitute with the final character of
unicodedata.name(char) (or just count them as length 2)
This "answer" is essentially another comment, but too long for the comment field.
This hack - as implemented in Alderven's answer - almost works for the OP, but the example string is rendered with an extra half a character width (note the example string does not contain any East Asian halfwidth characters.).
I am unable to reproduce this exact behaviour, using this test statement, where s is the example string from the question, varying the characters removed:
print((s + (68 - (len(s) + sum(1 for x in s if ud.east_asian_width(x) in ('F', 'N', 'W')))) * 'x')+ '\n'+ ('x' * 68))
In a Python 3.6 interpreter in a Gnome terminal on Debian, using the default monospace regular font, removing full-width characters causes the example string to apparently render three characters longer than
the equivalent string of "x"s.
Removing full-width and wide (East Asian Width "W") characters produced a string that appeared to render the same length as the equivalent number of "x"s.
In a Python 3.7 KDE Konsole terminal on OpenSuse, using Ubuntu Monospace regular font, I could not produce a string that rendered the same length regardless of the combination of full-width, wide or neutral ("N") characters that I removed.
I did notice that in the sparkles character (✨) seemed to take up an extra half a width when rendered alone in Konsole, but could not see any half a width differences when testing the full string.
I suspect that the problem is low-level rendering outside the control of Python , as this note on the Unicode standard suggests:
Note: The East_Asian_Width property is not intended for use by modern
terminal emulators without appropriate tailoring on a case-by-case
basis. Such terminal emulators need a way to resolve the
halfwidth/fullwidth dichotomy that is necessary for such environments,
but the East_Asian_Width property does not provide an off-the-shelf
solution for all situations. The growing repertoire of the Unicode
Standard has long exceeded the bounds of East Asian legacy character
encodings, and terminal emulations often need to be customized to
support edge cases and for changes in typographical behavior over
time.
I've written custom string formatter based on #snakecharmerb`s comment but still "half character width" problem persist:
import unicodedata
def fstring(string, max_length, align='l'):
string = str(string)
extra_length = 0
for char in string:
if unicodedata.east_asian_width(char) == 'F':
extra_length += 1
diff = max_length - len(string) - extra_length
if diff > 0:
return string + diff * ' ' if align == 'l' else diff * ' ' + string
elif diff < 0:
return string[:max_length-3] + '.. '
return string
data = [{'user_name': 'shroud', 'game_id': 'Apex Legends', 'title': 'pathfinder twitch prime loot YAYA #shroud on socials for update', 'viewer_count': 66200},
{'user_name': 'Amouranth', 'game_id': 'ASMR', 'title': '🔴 𝐀𝐒𝐌𝐑 (𝙪𝙥 𝙘𝙡𝙤𝙨𝙚) ✨ LIVE 🔔 SUBS GET SNAPCHAT', 'viewer_count': 2261}]
for d in data:
name = fstring(d['user_name'], 20)
game_id = fstring(d['game_id'], 15)
title = fstring(d['title'], 62)
count = fstring(d['viewer_count'], 10, align='r')
print('{}{}{}{}'.format(name, game_id, title, count))
It produces output:
(can't post it as a text since formatting will be lost)

Displaying Unprintable Characers in Pygame

I'm trying to create a game with ASCII art in Pygame 2.7. If I go to the Idle console and simply type:
for i in range(255):
print str(i) + ' - ' + str(chr(i))
I get nearly 255 distinct characters. However, if I try a similar stunt in Pygame:
import pygame, os, string, sys
from pygame.locals import *
pygame.init()
class Prog:
def __init__(self):
self.title = 'Text'
self.screen_size = (800, 600)
self.screen = pygame.display.set_mode(self.screen_size)
self.bg_color = (255, 255, 255)
self.text_color = (0,0,0)
self.text_font = 'Times New Roman'
self.text_size = 20
self.font = pygame.font.SysFont(self.text_font, self.text_size)
def draw_text(self, text, x, y):
textobj = self.font.render(text, 1, self.text_color)
textrect = textobj.get_rect()
textrect.center = (x,y)
self.screen.blit(textobj, textrect)
def main(self):
rec = self.screen.get_rect()
done = False
while not done:
for event in pygame.event.get():
if event.type == QUIT:
done = True
self.screen.fill(self.bg_color)
x = 20
y = 100
n = 1
for i in range(1, 255):
self.draw_text(chr(i), x, y)
x += 20
n += 1
if n > 25:
n = 1
x = 20
y += 30
pygame.display.flip()
Most of it is just empty boxes. Why the discrepancy? I've tried changing fonts, even using the one that Idle uses; I've tried parsing it as unicode; nothing seems to work. It wouldn't bother me so much if not for the fact that, as I said, it prints in Idle just fine, and some of the characters I can't get are present in other ASCII games I play, so it must somehow be renderable.
Can anyone advise? I'm a self/google-taught amateur, and would honestly prefer not to have to download extra modules if at all avoidable. If nothing else, I'll settle for an explanation of my computer's apparent double-standard on this issue.
Thanks so much.
I think the main issue is that Pygame and Idle are using different default encodings to interpret your strings. Strings in Python 2 are tough to understand, so bear with me a little. Also, correct me if I'm mistaken, but I'm assuming you're using Python 2, which handles strings much differently than Python 3.
Strings in Python 2 are just series of bytes. Bytes in Python 2 can be expressed as bytes literals by creating a bytes object or prepending a string with b.
'a' == b'a' == bytes(a)
>>>True
chr(i) "Returns a string of one character whose ASCII code is the integer i." (Docs) If a string is a series of bytes, chr returns bytes. Let's look at what chr returns if we call it in the Python 2 interpreter:
chr(97)
>>>'a'
chr(120)
>>>'x'
The reason those letters show up in the interpreter output is that Python 2's default encoding is ASCII, so when it sees a byte value that corresponds to a symbol on the ASCII chart, it replaces the byte with that symbol. ASCII is limited to symbols corresponding to values 0 through 127. This chart shows what symbol each value corresponds to. 97 is 'a' and 120 is 'x'.
You'll notice that 0-31 and 127 are "control characters" that don't represent symbols, but communicate things to the machine reading the text. For example, in the C language, the end of a string is denoted by the null character, which is codepoint 0 in ASCII.
chr accepts an argument from 0 through 255. What happens if we pass an integer to chr that represents a control character in ASCII? What if we pass it an integer greater than 127?
chr(4)
>>>'\x04'
chr(163)
>>>'\xa3'
\x in a string in Python tells the interpreter that the next two characters represent a hexadecimal value. When the byte value doesn't have a corresponding symbol in ASCII, the interpreter just shows us the byte value as a hexadecimal number. Behind the scenes, 'a' is just '\x61', but the interpreter uses ASCII to show us what '\x61' represents.
There are lots of encodings besides ASCII. UTF-8 is a very common one you've probably heard of. Like ASCII, it relates numerical values to symbols. This chart shows the first 256 codepoints in UTF-8. They correspond exactly to what is being output by your pygame code:
So you're passing bytes to the drawtext function, and it's using UTF-8 to interpret those bytes as symbols. When it gets a value that corresponds to a control character, it simply outputs the square instead.
The reason Idle shows you different symbols is because it uses a different default encoding than Pygame. Without knowing what symbols it outputs, I can't say which encoding it is. ISO-8859-1 and Windows-1252 are two other very common encodings. You can check which encoding Idle is using by going to Preferences > General and looking at the "Default Source Encoding" option.
So what do you do about it? If you truly want to limit yourself to ASCII symbols, you can use the string module, which contains useful lists of characters. You can get all non-whitespace ASCII characters with the following code:
import string
ascii_characters = string.digits + string.letters + string.punctuation
print ascii_characters
If you want to use more than just those characters, Pygame will never be able to render UTF-8's control characters. Here are 2 options I see.
Pass a unicode string containing the character you want. This option requires you to declare a different default encoding on the first line of your file using a magic comment:
# -*- coding: utf-8 -*-
...
self.drawtext(u"€", x, y)
Pass a unicode string containing the unicode designation for the symbol you want. This does not require the magic comment from option 1.
self.draw_text(u'\u20ac', x, y)

efficiently replace bad characters

I often work with utf-8 text containing characters like:
\xc2\x99
\xc2\x95
\xc2\x85
etc
These characters confuse other libraries I work with so need to be replaced.
What is an efficient way to do this, rather than:
text.replace('\xc2\x99', ' ').replace('\xc2\x85, '...')
There is always regular expressions; just list all of the offending characters inside square brackets like so:
import re
print re.sub(r'[\xc2\x99]'," ","Hello\xc2There\x99")
This prints: 'Hello There ', with the unwanted characters replaced by spaces.
Alternately, if you have a different replacement character for each:
# remove annoying characters
chars = {
'\xc2\x82' : ',', # High code comma
'\xc2\x84' : ',,', # High code double comma
'\xc2\x85' : '...', # Tripple dot
'\xc2\x88' : '^', # High carat
'\xc2\x91' : '\x27', # Forward single quote
'\xc2\x92' : '\x27', # Reverse single quote
'\xc2\x93' : '\x22', # Forward double quote
'\xc2\x94' : '\x22', # Reverse double quote
'\xc2\x95' : ' ',
'\xc2\x96' : '-', # High hyphen
'\xc2\x97' : '--', # Double hyphen
'\xc2\x99' : ' ',
'\xc2\xa0' : ' ',
'\xc2\xa6' : '|', # Split vertical bar
'\xc2\xab' : '<<', # Double less than
'\xc2\xbb' : '>>', # Double greater than
'\xc2\xbc' : '1/4', # one quarter
'\xc2\xbd' : '1/2', # one half
'\xc2\xbe' : '3/4', # three quarters
'\xca\xbf' : '\x27', # c-single quote
'\xcc\xa8' : '', # modifier - under curve
'\xcc\xb1' : '' # modifier - under line
}
def replace_chars(match):
char = match.group(0)
return chars[char]
return re.sub('(' + '|'.join(chars.keys()) + ')', replace_chars, text)
I think that there is an underlying problem here, and it might be a good idea to investigate and maybe solve it, rather than just trying to cover up the symptoms.
\xc2\x95 is the UTF-8 encoding of the character U+0095, which is a C1 control character (MESSAGE WAITING). It is not surprising that your library cannot handle it. But the question is, how did it get into your data?
Well, one very likely possibility is that it started out as the character 0x95 (BULLET) in the Windows-1252 encoding, was wrongly decoded as U+0095 instead of the correct U+2022, and then encoded into UTF-8. (The Japanese term mojibake describes this kind of mistake.)
If this is correct, then you can recover the original characters by putting them back into Windows-1252 and then decoding them into Unicode correctly this time. (In these examples I am using Python 3.3; these operations are a bit different in Python 2.)
>>> b'\x95'.decode('windows-1252')
'\u2022'
>>> import unicodedata
>>> unicodedata.name(_)
'BULLET'
If you want to do this correction for all the characters in the range 0x80–0x99 that are valid Windows-1252 characters, you can use this approach:
def restore_windows_1252_characters(s):
"""Replace C1 control characters in the Unicode string s by the
characters at the corresponding code points in Windows-1252,
where possible.
"""
import re
def to_windows_1252(match):
try:
return bytes([ord(match.group(0))]).decode('windows-1252')
except UnicodeDecodeError:
# No character at the corresponding code point: remove it.
return ''
return re.sub(r'[\u0080-\u0099]', to_windows_1252, s)
For example:
>>> restore_windows_1252_characters('\x95\x99\x85')
'•™…'
If you want to remove all non-ASCII characters from a string, you can use
text.encode("ascii", "ignore")
import unicodedata
# Convert to unicode
text_to_uncicode = unicode(text, "utf-8")
# Convert back to ascii
text_fixed = unicodedata.normalize('NFKD',text_to_unicode).encode('ascii','ignore')
This is not "Unicode characters" - it feels more like this an UTF-8 encoded string. (Although your prefix should be \xC3, not \xC2 for most chars). You should not just throw them away in 95% of the cases, unless you are comunicating with a COBOL backend. The World is not limited to 26 characters, you know.
There is a concise reading to explain the differences between Unicode strings (what is used as an Unicode object in python 2 and as strings in Python 3 here: http://www.joelonsoftware.com/articles/Unicode.html - please, for your sake do read that. Even if you are never planning to have anything that is not English in all of your applications, you still will stumble on symbols like € or º that won't fit in 7 bit ASCII. That article will help you.
That said, maybe the libraries you are using do accept Unicode python objects, and you can transform your UTF-8 Python 2 strings into unidoce by doing:
var_unicode = var.decode("utf-8")
If you really need 100% pure ASCII, replacing all non ASCII chars, after decoding the string to unicode, re-encode it to ASCII, telling it to ignore characters that don't fit in the charset with:
var_ascii = var_unicode.encode("ascii", "replace")
These characters are not in ASCII Library and that is the reason why you are getting the errors.
To avoid these errors, you can do the following while reading the file.
import codecs
f = codecs.open('file.txt', 'r',encoding='utf-8')
To know more about these kind of errors, go through this link.

How to filter (or replace) unicode characters that would take more than 3 bytes in UTF-8?

I'm using Python and Django, but I'm having a problem caused by a limitation of MySQL. According to the MySQL 5.1 documentation, their utf8 implementation does not support 4-byte characters. MySQL 5.5 will support 4-byte characters using utf8mb4; and, someday in future, utf8 might support it as well.
But my server is not ready to upgrade to MySQL 5.5, and thus I'm limited to UTF-8 characters that take 3 bytes or less.
My question is: How to filter (or replace) unicode characters that would take more than 3 bytes?
I want to replace all 4-byte characters with the official \ufffd (U+FFFD REPLACEMENT CHARACTER), or with ?.
In other words, I want a behavior quite similar to Python's own str.encode() method (when passing 'replace' parameter). Edit: I want a behavior similar to encode(), but I don't want to actually encode the string. I want to still have an unicode string after filtering.
I DON'T want to escape the character before storing at the MySQL, because that would mean I would need to unescape all strings I get from the database, which is very annoying and unfeasible.
See also:
"Incorrect string value" warning when saving some unicode characters to MySQL (at Django ticket system)
‘𠂉’ Not a valid unicode character, but in the unicode character set? (at Stack Overflow)
[EDIT] Added tests about the proposed solutions
So I got good answers so far. Thanks, people! Now, in order to choose one of them, I did a quick testing to find the simplest and fastest one.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# vi:ts=4 sw=4 et
import cProfile
import random
import re
# How many times to repeat each filtering
repeat_count = 256
# Percentage of "normal" chars, when compared to "large" unicode chars
normal_chars = 90
# Total number of characters in this string
string_size = 8 * 1024
# Generating a random testing string
test_string = u''.join(
unichr(random.randrange(32,
0x10ffff if random.randrange(100) > normal_chars else 0x0fff
)) for i in xrange(string_size) )
# RegEx to find invalid characters
re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)
def filter_using_re(unicode_string):
return re_pattern.sub(u'\uFFFD', unicode_string)
def filter_using_python(unicode_string):
return u''.join(
uc if uc < u'\ud800' or u'\ue000' <= uc <= u'\uffff' else u'\ufffd'
for uc in unicode_string
)
def repeat_test(func, unicode_string):
for i in xrange(repeat_count):
tmp = func(unicode_string)
print '='*10 + ' filter_using_re() ' + '='*10
cProfile.run('repeat_test(filter_using_re, test_string)')
print '='*10 + ' filter_using_python() ' + '='*10
cProfile.run('repeat_test(filter_using_python, test_string)')
#print test_string.encode('utf8')
#print filter_using_re(test_string).encode('utf8')
#print filter_using_python(test_string).encode('utf8')
The results:
filter_using_re() did 515 function calls in 0.139 CPU seconds (0.138 CPU seconds at the sub() built-in)
filter_using_python() did 2097923 function calls in 3.413 CPU seconds (1.511 CPU seconds at the join() call and 1.900 CPU seconds evaluating the generator expression)
I did no test using itertools because... well... that solution, although interesting, was quite big and complex.
Conclusion
The RegEx solution was, by far, the fastest one.
Unicode characters in the ranges \u0000-\uD7FF and \uE000-\uFFFF will have 3 byte (or less) encodings in UTF8. The \uD800-\uDFFF range is for multibyte UTF16. I do not know python, but you should be able to set up a regular expression to match outside those ranges.
pattern = re.compile("[\uD800-\uDFFF].", re.UNICODE)
pattern = re.compile("[^\u0000-\uFFFF]", re.UNICODE)
Edit adding Python from Denilson Sá's script in the question body:
re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)
filtered_string = re_pattern.sub(u'\uFFFD', unicode_string)
You may skip the decoding and encoding steps and directly detect the value of the first byte (8-bit string) of each character. According to UTF-8:
#1-byte characters have the following format: 0xxxxxxx
#2-byte characters have the following format: 110xxxxx 10xxxxxx
#3-byte characters have the following format: 1110xxxx 10xxxxxx 10xxxxxx
#4-byte characters have the following format: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
According to that, you only need to check the value of only the first byte of each character to filter out 4-byte characters:
def filter_4byte_chars(s):
i = 0
j = len(s)
# you need to convert
# the immutable string
# to a mutable list first
s = list(s)
while i < j:
# get the value of this byte
k = ord(s[i])
# this is a 1-byte character, skip to the next byte
if k <= 127:
i += 1
# this is a 2-byte character, skip ahead by 2 bytes
elif k < 224:
i += 2
# this is a 3-byte character, skip ahead by 3 bytes
elif k < 240:
i += 3
# this is a 4-byte character, remove it and update
# the length of the string we need to check
else:
s[i:i+4] = []
j -= 4
return ''.join(s)
Skipping the decoding and encoding parts will save you some time and for smaller strings that mostly have 1-byte characters this could even be faster than the regular expression filtering.
And just for the fun of it, an itertools monstrosity :)
import itertools as it, operator as op
def max3bytes(unicode_string):
# sequence of pairs of (char_in_string, u'\N{REPLACEMENT CHARACTER}')
pairs= it.izip(unicode_string, it.repeat(u'\ufffd'))
# is the argument less than or equal to 65535?
selector= ft.partial(op.le, 65535)
# using the character ordinals, return 0 or 1 based on `selector`
indexer= it.imap(selector, it.imap(ord, unicode_string))
# now pick the correct item for all pairs
return u''.join(it.imap(tuple.__getitem__, pairs, indexer))
Encode as UTF-16, then reencode as UTF-8.
>>> t = u'𝐟𝐨𝐨'
>>> e = t.encode('utf-16le')
>>> ''.join(unichr(x).encode('utf-8') for x in struct.unpack('<' + 'H' * (len(e) // 2), e))
'\xed\xa0\xb5\xed\xb0\x9f\xed\xa0\xb5\xed\xb0\xa8\xed\xa0\xb5\xed\xb0\xa8'
Note that you can't encode after joining, since the surrogate pairs may be decoded before reencoding.
EDIT:
MySQL (at least 5.1.47) has no problem dealing with surrogate pairs:
mysql> create table utf8test (t character(128)) collate utf8_general_ci;
Query OK, 0 rows affected (0.12 sec)
...
>>> cxn = MySQLdb.connect(..., charset='utf8')
>>> csr = cxn.cursor()
>>> t = u'𝐟𝐨𝐨'
>>> e = t.encode('utf-16le')
>>> v = ''.join(unichr(x).encode('utf-8') for x in struct.unpack('<' + 'H' * (len(e) // 2), e))
>>> v
'\xed\xa0\xb5\xed\xb0\x9f\xed\xa0\xb5\xed\xb0\xa8\xed\xa0\xb5\xed\xb0\xa8'
>>> csr.execute('insert into utf8test (t) values (%s)', (v,))
1L
>>> csr.execute('select * from utf8test')
1L
>>> r = csr.fetchone()
>>> r
(u'\ud835\udc1f\ud835\udc28\ud835\udc28',)
>>> print r[0]
𝐟𝐨𝐨
According to the MySQL 5.1 documentation: "The ucs2 and utf8 character sets do not support supplementary characters that lie outside the BMP." This indicates that there might be a problem with surrogate pairs.
Note that the Unicode standard 5.2 chapter 3 actually forbids encoding a surrogate pair as two 3-byte UTF-8 sequences instead of one 4-byte UTF-8 sequence ... see for example page 93 """Because surrogate code points are not Unicode scalar values, any UTF-8 byte sequence that would otherwise map to code points D800..DFFF is ill-formed.""" However this proscription is as far as I know largely unknown or ignored.
It may well be a good idea to check what MySQL does with surrogate pairs. If they are not to be retained, this code will provide a simple-enough check:
all(uc < u'\ud800' or u'\ue000' <= uc <= u'\uffff' for uc in unicode_string)
and this code will replace any "nasties" with u\ufffd:
u''.join(
uc if uc < u'\ud800' or u'\ue000' <= uc <= u'\uffff' else u'\ufffd'
for uc in unicode_string
)
I'm guessing it's not the fastest, but quite straightforward (“pythonic” :) :
def max3bytes(unicode_string):
return u''.join(uc if uc <= u'\uffff' else u'\ufffd' for uc in unicode_string)
NB: this code does not take into account the fact that Unicode has surrogate characters in the ranges U+D800-U+DFFF.
This does more than filtering out just 3+ byte UTF-8 unicode characters. It removes unicode but tries to do that in a gentle way and replace it with relevant ASCII characters if possible. It can be a blessing in the future if you don't have for example a dozen of various unicode apostrophes and unicode quotation marks in your text (usually coming from Apple handhelds) but only the regular ASCII apostrophes and quotations.
unicodedata.normalize("NFKD", sentence).encode("ascii", "ignore")
This is robust, I use it with some more guards:
import unicodedata
def neutralize_unicode(value):
"""
Taking care of special characters as gently as possible
Args:
value (string): input string, can contain unicode characters
Returns:
:obj:`string` where the unicode characters are replaced with standard
ASCII counterparts (for example en-dash and em-dash with regular dash,
apostrophe and quotation variations with the standard ones) or taken
out if there's no substitute.
"""
if not value or not isinstance(value, basestring):
return value
if isinstance(value, str):
return value
return unicodedata.normalize("NFKD", value).encode("ascii", "ignore")
This is Python 2 BTW.

Categories

Resources