Customized non-ascii characters flagger

Customized non-ascii characters flagger - python

I've looked around for a custom-made solution, but I couldn't find a solution for a use case that I am facing.
Use Case
I'm building a 'website' QA test where the script will go through a bulk of HTML documents, and identify any rogue characters. I cannot use pure non-ascii method since the HTML documents contain characters such as ">" and other minor characters. Therefore, I am building up a unicode rainbow dictionary that identifies some of the common non-ascii characters that my team and I frequently see. The following is my Python code.
# -*- coding: utf-8 -*-
import re
unicode_rainbow_dictionary = {
u'\u00A0':' ',
u'\uFB01':'fi',
}
strings = ["This contains the annoying non-breaking space","This is fine!","This is not ﬁne!"]
for string in strings:
for regex in unicode_rainbow_dictionary:
result = re.search(regex,string)
if result:
print "Epic fail! There is a rogue character in '"+string+"'"
else:
print string
The issue here is that the last string in the strings array contains a non-ascii ligature character (the combined fi). When I run this script, it doesn't capture the ligature character, but it captures the non-breakable space character in the first case.
What is leading to the false positive?

Use Unicode strings for all text as #jgfoot points out. The easiest way to do this is to use from __future__ to default to Unicode literals for strings. Additionally, using print as a function will make the code Python 2/3 compatible:
# -*- coding: utf-8 -*-
from __future__ import unicode_literals,print_function
import re
unicode_rainbow_dictionary = {
'\u00A0':' ',
'\uFB01':'fi',
}
strings = ["This contains the\xa0annoying non-breaking space","This is fine!","This is not ﬁne!"]
for string in strings:
for regex in unicode_rainbow_dictionary:
result = re.search(regex,string)
if result:
print("Epic fail! There is a rogue character in '"+string+"'")
else:
print(string)

If you have the possibility then switch to Python 3 as soon as possible! Python 2 is not good at handling unicode whereas Python 3 does it natively.
for string in strings:
for character in unicode_rainbow_dictionary:
if character in string:
print("Rogue character '" + character + "' in '" + string + "'")
I couldn't get the non-breaking space to occur in my test. I got around that by using "This contains the annoying" + chr(160) + "non-breaking space" after which it matched.

Your code doesn't work as expected because, in your "strings" variable, you have unicode characters in non-unicode strings. You forgot to put the "u" in front of them to signal that they should be treated as unicode strings. So, when you search for a unicode string inside a non-unicode string, it doesn't work as expected
If you change this to:
strings = [u"This contains the annoying non-breaking space",u"This is fine!",u"This is not ﬁne!"]
It works as expected.
Solving unicode headaches like this is a major benefit of Python 3.
Here's an alternative approach to your problem. How about just trying to encode the string as ASCII, and catching errors if it doesn't work?:
def is_this_ascii(s):
try:
ignore = unicode(s).encode("ascii")
return True
except (UnicodeEncodeError, UnicodeDecodeError):
return False
strings = [u"This contains the annoying non-breaking space",u"This is fine!",u"This is not ﬁne!"]
for s in strings:
print(repr(is_this_ascii(s)))
##False
##True
##False

Related

Converting escaped characters to utf in Python

Is there an elegant way to convert "test\207\128" into "testπ" in python?
My issue stems from using avahi-browse on Linux, which has a -p flag to output information in an easy to parse format. However the problem is that it outputs non alpha-numeric characters as escaped sequences. So a service published as "name#id" gets output by avahi-browse as "name\035id". This can be dealt with by splitting on the \, dropping a leading zero and using chr(35) to recover the #. This solution breaks on multi-byte utf characters such as "π" which gets output as "\207\128".

The input string you have is an encoding of a UTF-8 string, in a format that Python can't deal with natively. This means you'll need to write a simple decoder, then use Python to translate the UTF-8 string to a string object:
import re
value = r"test\207\128"
# First off turn this into a byte array, since it's not a unicode string
value = value.encode("utf-8")
# Now replace any "\###" with a byte character based off
# the decimal number captured
value = re.sub(b"\\\\([0-9]{3})", lambda m: bytes([int(m.group(1))]), value)
# And now that we have a normal UTF-8 string, decode it back to a string
value = value.decode("utf-8")
print(value)
# Outputs: testπ

Decode bad escape characters in python

So I have a database with a lot of names. The names have bad characters. For example, a name in a record is JosÃ© Florés
I wanted to clean this to get José Florés
I tried the following
name = " JosÃ© Florés "
print(name.encode('iso-8859-1',errors='ignore').decode('utf8',errors='backslashreplace')
The output messes the last name to ' José Flor\\xe9s '
What is the best way to solve this? The names can have any kind of unicode or hex escape sequences.

ftfy is a python library which fixes unicode text broken in different ways with a function named fix_text.
from ftfy import fix_text
def convert_iso_name_to_string(name):
result = []
for word in name.split():
result.append(fix_text(word))
return ' '.join(result)
name = "JosÃ© Florés"
assert convert_iso_name_to_string(name) == "José Florés"
Using the fix_text method the names can be standardized, which is an alternate way to solve the problem.

We'll start with an example string containing a non-ASCII character (i.e., “ü” or “umlaut-u”):
s = 'Florés'
Now if we reference and print the string, it gives us essentially the same result:
>>> s
'Florés'
>>> print(s)
Florés
In contrast to the same string s in Python 2.x, in this case s is already a Unicode string, and all strings in Python 3.x are automatically Unicode. The visible difference is that s wasn't changed after we instantiated it
You can find the same here Encoding and Decoding Strings

python replace and sub not working with unicode character u"\u0092"

Python Version: Python 3.6.
I am trying to replace the Unicode character u"\u0092" (aka curly apostrophe) with a regular apostrophe.
I have tried all of the below:
mystring = <some string with problem character>
# option 1
mystring = mystring.replace(u"\u0092", u\"0027")
# option 2
mystring = mystring.replace(u"\u0092", "'")
# option 3
mystring = re.sub('\u0092',u"\u0027", mystring)
# option 4
mystring = re.sub('\u0092',u"'", mystring)
None of the above updates the character in mystring. Other sub and replace operations are working - which makes me think it is either an issue with how I am using the Unicode characters, or an issue with this particular character.
Update: I have also tried the suggestion below neither of which work:
mystring.decode("utf-8").replace(u"\u0092", u"\u0027").encode("utf-8")
mystring.decode("utf-8").replace(u"\u2019", u"\u0027").encode("utf-8")
But it gives me the error: AttributeError: 'str' object has no attribute 'decode'
Just to Clarify: The IDE is not the core issue here. My question is why when I run replace or sub with a Unicode character and print the result does it not register - the character is still present in the string.

your code is wrong it's \u2019 for apostrophe (’). from wikipedia
U+0092 146 Private Use 2 PU2
that's why eclipse is not happy.
with the right code:
#_*_ coding: utf8 _*_
import re
string = u"dkfljglkdfjg’fgkljlf"
string = string.replace(u"’", u"'"))
string = string.replace(u"\u2019", u"\u0027")
string = re.sub(u'\u2019',u"\u0027", string)
string = re.sub(u'’',u"'", string)
all solutions work
and don't call your vars str

Python 2 string somehow saved as pure Unicode

I have the following strings in Chinese that are saved in a following form as "str" type:
\u72ec\u5230
\u7528\u8272
I am on Python 2.7, when I print those strings they are printed as actual Chinese characters:
chinese_list = ["\u72ec\u5230", "\u7528\u8272", "\u72ec"]
print(chinese_list[0], chinese_list[1], chinese_list[2])
>>> 独到 用色 独
I can't really figure out how they were saved in that form, to me it looks like Unicode. The goal would be to take other Chinese characters that I have and save them in the same kind of encoding. Say I have "国道" and I would need them to be saved in the same way as in the original chinese_list.
I've tried to encode it as utf-8 and also other encodings but I never get the same output as in the original:
new_string = u"国道"
print(new_string.encode("utf-8"))
# >>> b'\xe5\x9b\xbd\xe9\x81\x93'
print(new_string.encode("utf-16"))
# >>> b'\xff\xfe\xfdVS\x90'
Any help appreciated!
EDIT: it doesn't have to have 2 Chinese characters.
EDIT2: Apparently, the encoding was unicode-escape. Thanks #deceze.
print(u"国".encode('unicode-escape'))
>>> \u56fd

The \u.... is unicode escape syntax. It works similar to how \n is a newline, not the two characters \ and n.
The elements of your list never actually contain a byte string with literal characters of \, u, 7 and so on. They contain a unicode string with the actual unicode characters, i.e. 独 and so on.
Note that this only works with unicode strings! In Python2, you need to write u"\u....". Python3 always uses unicode strings.
The unicode escape value of a character can be gotten with the ord builtin. For example, ord(u"国") gives 22269 - the same value as 0x56fd.
To get the hexadezimal escape value, convert the result to hex.
>>> def escape_literal(character):
... return r'\u' + hex(ord(character))[2:]
...
>>> print(escape_literal('国'))
\u56fd

Remove non-ASCII characters from a string using python / django

I have a string of HTML stored in a database. Unfortunately it contains characters such as ®
I want to replace these characters by their HTML equivalent, either in the DB itself or using a Find Replace in my Python / Django code.
Any suggestions on how I can do this?

You can use that the ASCII characters are the first 128 ones, so get the number of each character with ord and strip it if it's out of range
# -*- coding: utf-8 -*-
def strip_non_ascii(string):
''' Returns the string without non ASCII characters'''
stripped = (c for c in string if 0 < ord(c) < 127)
return ''.join(stripped)
test = u'éáé123456tgreáé#€'
print test
print strip_non_ascii(test)
Result
éáé123456tgreáé#€
123456tgre#
Please note that # is included because, well, after all it's an ASCII character. If you want to strip a particular subset (like just numbers and uppercase and lowercase letters), you can limit the range looking at a ASCII table
EDITED: After reading your question again, maybe you need to escape your HTML code, so all those characters appears correctly once rendered. You can use the escape filter on your templates.

There's a much simpler answer to this at https://stackoverflow.com/a/18430817/5100481
To remove non-ASCII characters from a string, s, use:
s = s.encode('ascii',errors='ignore')
Then convert it from bytes back to a string using:
s = s.decode()
This all using Python 3.6

I found this a while ago, so this isn't in any way my work. I can't find the source, but here's the snippet from my code.
def unicode_escape(unistr):
"""
Tidys up unicode entities into HTML friendly entities
Takes a unicode string as an argument
Returns a unicode string
"""
import htmlentitydefs
escaped = ""
for char in unistr:
if ord(char) in htmlentitydefs.codepoint2name:
name = htmlentitydefs.codepoint2name.get(ord(char))
entity = htmlentitydefs.name2codepoint.get(name)
escaped +="&#" + str(entity)
else:
escaped += char
return escaped
Use it like this
>>> from zack.utilities import unicode_escape
>>> unicode_escape(u'such as ® I want')
u'such as &#174 I want'

This code snippet may help you.
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
def removeNonAscii(string):
nonascii = bytearray(range(0x80, 0x100))
return string.translate(None, nonascii)
nonascii_removed_string = removeNonAscii(string_to_remove_nonascii)
The encoding definition is very important here which is done in the second line.

To get rid of the special xml, html characters '<', '>', '&' you can use cgi.escape:
import cgi
test = "1 < 4 & 4 > 1"
cgi.escape(test)
Will return:
'1 < 4 & 4 > 1'
This is probably the bare minimum you need to avoid problem.
For more you have to know the encoding of your string.
If it fit the encoding of your html document you don't have to do something more.
If not you have to convert to the correct encoding.
test = test.decode("cp1252").encode("utf8")
Supposing that your string was cp1252 and that your html document is utf8

You shouldn't have anything to do, as Django will automatically escape characters :
see : http://docs.djangoproject.com/en/dev/topics/templates/#id2

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Customized non-ascii characters flagger - python

Related

Converting escaped characters to utf in Python

Decode bad escape characters in python

python replace and sub not working with unicode character u"\u0092"

Python 2 string somehow saved as pure Unicode

Remove non-ASCII characters from a string using python / django

Categories

Resources