Remove non-ASCII characters from a string using python / django

Remove non-ASCII characters from a string using python / django - python

I have a string of HTML stored in a database. Unfortunately it contains characters such as ®
I want to replace these characters by their HTML equivalent, either in the DB itself or using a Find Replace in my Python / Django code.
Any suggestions on how I can do this?

You can use that the ASCII characters are the first 128 ones, so get the number of each character with ord and strip it if it's out of range
# -*- coding: utf-8 -*-
def strip_non_ascii(string):
''' Returns the string without non ASCII characters'''
stripped = (c for c in string if 0 < ord(c) < 127)
return ''.join(stripped)
test = u'éáé123456tgreáé#€'
print test
print strip_non_ascii(test)
Result
éáé123456tgreáé#€
123456tgre#
Please note that # is included because, well, after all it's an ASCII character. If you want to strip a particular subset (like just numbers and uppercase and lowercase letters), you can limit the range looking at a ASCII table
EDITED: After reading your question again, maybe you need to escape your HTML code, so all those characters appears correctly once rendered. You can use the escape filter on your templates.

There's a much simpler answer to this at https://stackoverflow.com/a/18430817/5100481
To remove non-ASCII characters from a string, s, use:
s = s.encode('ascii',errors='ignore')
Then convert it from bytes back to a string using:
s = s.decode()
This all using Python 3.6

I found this a while ago, so this isn't in any way my work. I can't find the source, but here's the snippet from my code.
def unicode_escape(unistr):
"""
Tidys up unicode entities into HTML friendly entities
Takes a unicode string as an argument
Returns a unicode string
"""
import htmlentitydefs
escaped = ""
for char in unistr:
if ord(char) in htmlentitydefs.codepoint2name:
name = htmlentitydefs.codepoint2name.get(ord(char))
entity = htmlentitydefs.name2codepoint.get(name)
escaped +="&#" + str(entity)
else:
escaped += char
return escaped
Use it like this
>>> from zack.utilities import unicode_escape
>>> unicode_escape(u'such as ® I want')
u'such as &#174 I want'

This code snippet may help you.
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
def removeNonAscii(string):
nonascii = bytearray(range(0x80, 0x100))
return string.translate(None, nonascii)
nonascii_removed_string = removeNonAscii(string_to_remove_nonascii)
The encoding definition is very important here which is done in the second line.

To get rid of the special xml, html characters '<', '>', '&' you can use cgi.escape:
import cgi
test = "1 < 4 & 4 > 1"
cgi.escape(test)
Will return:
'1 < 4 & 4 > 1'
This is probably the bare minimum you need to avoid problem.
For more you have to know the encoding of your string.
If it fit the encoding of your html document you don't have to do something more.
If not you have to convert to the correct encoding.
test = test.decode("cp1252").encode("utf8")
Supposing that your string was cp1252 and that your html document is utf8

You shouldn't have anything to do, as Django will automatically escape characters :
see : http://docs.djangoproject.com/en/dev/topics/templates/#id2

Related

Converting escaped characters to utf in Python

Is there an elegant way to convert "test\207\128" into "testπ" in python?
My issue stems from using avahi-browse on Linux, which has a -p flag to output information in an easy to parse format. However the problem is that it outputs non alpha-numeric characters as escaped sequences. So a service published as "name#id" gets output by avahi-browse as "name\035id". This can be dealt with by splitting on the \, dropping a leading zero and using chr(35) to recover the #. This solution breaks on multi-byte utf characters such as "π" which gets output as "\207\128".

The input string you have is an encoding of a UTF-8 string, in a format that Python can't deal with natively. This means you'll need to write a simple decoder, then use Python to translate the UTF-8 string to a string object:
import re
value = r"test\207\128"
# First off turn this into a byte array, since it's not a unicode string
value = value.encode("utf-8")
# Now replace any "\###" with a byte character based off
# the decimal number captured
value = re.sub(b"\\\\([0-9]{3})", lambda m: bytes([int(m.group(1))]), value)
# And now that we have a normal UTF-8 string, decode it back to a string
value = value.decode("utf-8")
print(value)
# Outputs: testπ

python replace and sub not working with unicode character u"\u0092"

Python Version: Python 3.6.
I am trying to replace the Unicode character u"\u0092" (aka curly apostrophe) with a regular apostrophe.
I have tried all of the below:
mystring = <some string with problem character>
# option 1
mystring = mystring.replace(u"\u0092", u\"0027")
# option 2
mystring = mystring.replace(u"\u0092", "'")
# option 3
mystring = re.sub('\u0092',u"\u0027", mystring)
# option 4
mystring = re.sub('\u0092',u"'", mystring)
None of the above updates the character in mystring. Other sub and replace operations are working - which makes me think it is either an issue with how I am using the Unicode characters, or an issue with this particular character.
Update: I have also tried the suggestion below neither of which work:
mystring.decode("utf-8").replace(u"\u0092", u"\u0027").encode("utf-8")
mystring.decode("utf-8").replace(u"\u2019", u"\u0027").encode("utf-8")
But it gives me the error: AttributeError: 'str' object has no attribute 'decode'
Just to Clarify: The IDE is not the core issue here. My question is why when I run replace or sub with a Unicode character and print the result does it not register - the character is still present in the string.

your code is wrong it's \u2019 for apostrophe (’). from wikipedia
U+0092 146 Private Use 2 PU2
that's why eclipse is not happy.
with the right code:
#_*_ coding: utf8 _*_
import re
string = u"dkfljglkdfjg’fgkljlf"
string = string.replace(u"’", u"'"))
string = string.replace(u"\u2019", u"\u0027")
string = re.sub(u'\u2019',u"\u0027", string)
string = re.sub(u'’',u"'", string)
all solutions work
and don't call your vars str

Customized non-ascii characters flagger

I've looked around for a custom-made solution, but I couldn't find a solution for a use case that I am facing.
Use Case
I'm building a 'website' QA test where the script will go through a bulk of HTML documents, and identify any rogue characters. I cannot use pure non-ascii method since the HTML documents contain characters such as ">" and other minor characters. Therefore, I am building up a unicode rainbow dictionary that identifies some of the common non-ascii characters that my team and I frequently see. The following is my Python code.
# -*- coding: utf-8 -*-
import re
unicode_rainbow_dictionary = {
u'\u00A0':' ',
u'\uFB01':'fi',
}
strings = ["This contains the annoying non-breaking space","This is fine!","This is not ﬁne!"]
for string in strings:
for regex in unicode_rainbow_dictionary:
result = re.search(regex,string)
if result:
print "Epic fail! There is a rogue character in '"+string+"'"
else:
print string
The issue here is that the last string in the strings array contains a non-ascii ligature character (the combined fi). When I run this script, it doesn't capture the ligature character, but it captures the non-breakable space character in the first case.
What is leading to the false positive?

Use Unicode strings for all text as #jgfoot points out. The easiest way to do this is to use from __future__ to default to Unicode literals for strings. Additionally, using print as a function will make the code Python 2/3 compatible:
# -*- coding: utf-8 -*-
from __future__ import unicode_literals,print_function
import re
unicode_rainbow_dictionary = {
'\u00A0':' ',
'\uFB01':'fi',
}
strings = ["This contains the\xa0annoying non-breaking space","This is fine!","This is not ﬁne!"]
for string in strings:
for regex in unicode_rainbow_dictionary:
result = re.search(regex,string)
if result:
print("Epic fail! There is a rogue character in '"+string+"'")
else:
print(string)

If you have the possibility then switch to Python 3 as soon as possible! Python 2 is not good at handling unicode whereas Python 3 does it natively.
for string in strings:
for character in unicode_rainbow_dictionary:
if character in string:
print("Rogue character '" + character + "' in '" + string + "'")
I couldn't get the non-breaking space to occur in my test. I got around that by using "This contains the annoying" + chr(160) + "non-breaking space" after which it matched.

Your code doesn't work as expected because, in your "strings" variable, you have unicode characters in non-unicode strings. You forgot to put the "u" in front of them to signal that they should be treated as unicode strings. So, when you search for a unicode string inside a non-unicode string, it doesn't work as expected
If you change this to:
strings = [u"This contains the annoying non-breaking space",u"This is fine!",u"This is not ﬁne!"]
It works as expected.
Solving unicode headaches like this is a major benefit of Python 3.
Here's an alternative approach to your problem. How about just trying to encode the string as ASCII, and catching errors if it doesn't work?:
def is_this_ascii(s):
try:
ignore = unicode(s).encode("ascii")
return True
except (UnicodeEncodeError, UnicodeDecodeError):
return False
strings = [u"This contains the annoying non-breaking space",u"This is fine!",u"This is not ﬁne!"]
for s in strings:
print(repr(is_this_ascii(s)))
##False
##True
##False

How to portably parse the (Unicode) degree symbol with regular expressions?

I'm writing a simple regular expression parser for the output of the sensors utility on Ubuntu. Here's an example of a line of text I'm parsing:
temp1: +31.0°C (crit = +107.0°C)
And here's the regex I'm using to match that (in Python):
temp_re = re.compile(r'(temp1:)\s+(\+|-)(\d+\.\d+)\W\WC\s+'
r'\(crit\s+=\s+(\+|-)(\d+\.\d+)\W\WC\).*')
This code works as expected and matches the example text I've given above. The only bits I'm really interested in are the numbers, so this bit:
(\+|-)(\d+\.\d+)\W\WC
which starts by matching the + or - sign and ends by matching the °C.
My question is, why does it take two \W (non-alphanumeric) characters to match ° rather than one? Will the code break on systems where Unicode is represented differently to mine? If so, how can I make it portable?

Possible portable solution:
Convert input data to unicode, and use re.UNICODE flag in regular expressions.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
data = u'temp1: +31.0°C (crit = +107.0°C)'
temp_re = re.compile(ur'(temp1:)\s+(\+|-)(\d+\.\d+)°C\s+'
ur'\(crit\s+=\s+(\+|-)(\d+\.\d+)°C\).*', flags=re.UNICODE)
print temp_re.findall(data)
Output
[(u'temp1:', u'+', u'31.0', u'+', u'107.0')]
EDIT
#netvope allready pointed this out in comments for question.
Update
Notes from J.F. Sebastian comments about input encoding:
check_output() returns binary data that sometimes can be text (that should have a known character encoding in this case and you can convert it to Unicode). Anyway ord(u'°') == 176 so it can not be encoded using ASCII encoding.
So, to decode input data to unicode, basically* you should use encoding from system locale using locale.getpreferredencoding() e.g.:
data = subprocess.check_output(...).decode(locale.getpreferredencoding())
With data encoded correctly:
you'll get the same output without re.UNICODE in this case.
Why basically? Because on Russian Win7 with cp1251 as preferredencoding if we have for example script.py which decodes it's output to utf-8:
#!/usr/bin/env python
# -*- coding: utf8 -*-
print u'temp1: +31.0°C (crit = +107.0°C)'.encode('utf-8')
And wee need to parse it's output:
subprocess.check_output(['python',
'script.py']).decode(locale.getpreferredencoding())
will produce wrong results: 'В°' instead °.
So you need to know encoding of input data, in some cases.

Python encoding problems

So, I've read a lot about Python encoding and stuff - maybe not enough but I've been working on this for 2 days and still nothing - but I'm still getting troubles. I'll try to be as clear as I can. The main thing is that I'm trying to remove all accents and characters such as #, !, %, &...
The thing is, I do a query search on Twitter Search API with this call:
query = urllib2.urlopen(settings.SEARCH_URL + '?%s' % params)
Then, I call a method (avaliar_pesquisa()) to evaluate the results I've got, based on the tags (or terms) of the input:
dados = avaliar_pesquisa(simplejson.loads(query.read()), str(tags))
On avaliar_pesquisa(), the following happens:
def avaliar_pesquisa(dados, tags):
resultados = []
# Percorre os resultados
for i in dados['results']
resultados.append({'texto' : i['text'],
'imagem' : i['profile_image_url'],
'classificacao' : avaliar_texto(i['text'], tags),
'timestamp' : i['created_at'],
})
Note the avaliar_texto() which evaluates the Tweet text. And there's exactly the problem on the following lines:
def avaliar_texto(texto, tags):
# Remove accents
from unicodedata import normalize
def strip_accents(txt):
return normalize('NFKD', txt.decode('utf-8'))
# Split
texto_split = strip_accents(texto)
texto_split = texto.lower().split()
# Remove non-alpha characters
import re
pattern = re.compile('[\W_]+')
texto_aux = []
for i in texto_split:
texto_aux.append(pattern.sub('', i))
texto_split = texto_aux
The split doesn't really matter here.
The thing is, if I print the type of the var texto on this last method, I may get str or unicode as answer. If there is any kind of accent on the text, it comes like unicode.
So, I get this error running the application that receives 100 tweets max as answer:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in
position 17: ordinal not in range(128)
For the following text:
Text: Agora o problema é com o speedy.
type 'unicode'
Any ideas?

See this page.
The decode() method is to be applied to a str object, not a unicode object. Given a unicode string as input, it first tries to encode it to a str using the ascii codec, then decode as utf-8, which fails.
Try return normalize('NFKD', unicode(txt) ).

This is what I used in my code to discard accents, etc.
text = unicodedata.normalize('NFD', text).encode('ascii','ignore')

Ty placing:
# -*- coding: utf-8 -*-
at the beginning of your python script containing the code.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove non-ASCII characters from a string using python / django - python

I have a string of HTML stored in a database. Unfortunately it contains characters such as ® I want to replace these characters by their HTML equivalent, either in the DB itself or using a Find Replace in my Python / Django code. Any suggestions on how I can do this?

There's a much simpler answer to this at https://stackoverflow.com/a/18430817/5100481 To remove non-ASCII characters from a string, s, use: s = s.encode('ascii',errors='ignore') Then convert it from bytes back to a string using: s = s.decode() This all using Python 3.6

You shouldn't have anything to do, as Django will automatically escape characters : see : http://docs.djangoproject.com/en/dev/topics/templates/#id2

Related

Converting escaped characters to utf in Python

python replace and sub not working with unicode character u"\u0092"

Customized non-ascii characters flagger

How to portably parse the (Unicode) degree symbol with regular expressions?

Python encoding problems

Categories

Resources