Python Version: Python 3.6.
I am trying to replace the Unicode character u"\u0092" (aka curly apostrophe) with a regular apostrophe.
I have tried all of the below:
mystring = <some string with problem character>
# option 1
mystring = mystring.replace(u"\u0092", u\"0027")
# option 2
mystring = mystring.replace(u"\u0092", "'")
# option 3
mystring = re.sub('\u0092',u"\u0027", mystring)
# option 4
mystring = re.sub('\u0092',u"'", mystring)
None of the above updates the character in mystring. Other sub and replace operations are working - which makes me think it is either an issue with how I am using the Unicode characters, or an issue with this particular character.
Update: I have also tried the suggestion below neither of which work:
mystring.decode("utf-8").replace(u"\u0092", u"\u0027").encode("utf-8")
mystring.decode("utf-8").replace(u"\u2019", u"\u0027").encode("utf-8")
But it gives me the error: AttributeError: 'str' object has no attribute 'decode'
Just to Clarify: The IDE is not the core issue here. My question is why when I run replace or sub with a Unicode character and print the result does it not register - the character is still present in the string.
your code is wrong it's \u2019 for apostrophe (’). from wikipedia
U+0092 146 Private Use 2 PU2
that's why eclipse is not happy.
with the right code:
#_*_ coding: utf8 _*_
import re
string = u"dkfljglkdfjg’fgkljlf"
string = string.replace(u"’", u"'"))
string = string.replace(u"\u2019", u"\u0027")
string = re.sub(u'\u2019',u"\u0027", string)
string = re.sub(u'’',u"'", string)
all solutions work
and don't call your vars str
Related
So I have a database with a lot of names. The names have bad characters. For example, a name in a record is José Florés
I wanted to clean this to get José Florés
I tried the following
name = " José Florés "
print(name.encode('iso-8859-1',errors='ignore').decode('utf8',errors='backslashreplace')
The output messes the last name to ' José Flor\\xe9s '
What is the best way to solve this? The names can have any kind of unicode or hex escape sequences.
ftfy is a python library which fixes unicode text broken in different ways with a function named fix_text.
from ftfy import fix_text
def convert_iso_name_to_string(name):
result = []
for word in name.split():
result.append(fix_text(word))
return ' '.join(result)
name = "José Florés"
assert convert_iso_name_to_string(name) == "José Florés"
Using the fix_text method the names can be standardized, which is an alternate way to solve the problem.
We'll start with an example string containing a non-ASCII character (i.e., “ü” or “umlaut-u”):
s = 'Florés'
Now if we reference and print the string, it gives us essentially the same result:
>>> s
'Florés'
>>> print(s)
Florés
In contrast to the same string s in Python 2.x, in this case s is already a Unicode string, and all strings in Python 3.x are automatically Unicode. The visible difference is that s wasn't changed after we instantiated it
You can find the same here Encoding and Decoding Strings
I've looked around for a custom-made solution, but I couldn't find a solution for a use case that I am facing.
Use Case
I'm building a 'website' QA test where the script will go through a bulk of HTML documents, and identify any rogue characters. I cannot use pure non-ascii method since the HTML documents contain characters such as ">" and other minor characters. Therefore, I am building up a unicode rainbow dictionary that identifies some of the common non-ascii characters that my team and I frequently see. The following is my Python code.
# -*- coding: utf-8 -*-
import re
unicode_rainbow_dictionary = {
u'\u00A0':' ',
u'\uFB01':'fi',
}
strings = ["This contains the annoying non-breaking space","This is fine!","This is not fine!"]
for string in strings:
for regex in unicode_rainbow_dictionary:
result = re.search(regex,string)
if result:
print "Epic fail! There is a rogue character in '"+string+"'"
else:
print string
The issue here is that the last string in the strings array contains a non-ascii ligature character (the combined fi). When I run this script, it doesn't capture the ligature character, but it captures the non-breakable space character in the first case.
What is leading to the false positive?
Use Unicode strings for all text as #jgfoot points out. The easiest way to do this is to use from __future__ to default to Unicode literals for strings. Additionally, using print as a function will make the code Python 2/3 compatible:
# -*- coding: utf-8 -*-
from __future__ import unicode_literals,print_function
import re
unicode_rainbow_dictionary = {
'\u00A0':' ',
'\uFB01':'fi',
}
strings = ["This contains the\xa0annoying non-breaking space","This is fine!","This is not fine!"]
for string in strings:
for regex in unicode_rainbow_dictionary:
result = re.search(regex,string)
if result:
print("Epic fail! There is a rogue character in '"+string+"'")
else:
print(string)
If you have the possibility then switch to Python 3 as soon as possible! Python 2 is not good at handling unicode whereas Python 3 does it natively.
for string in strings:
for character in unicode_rainbow_dictionary:
if character in string:
print("Rogue character '" + character + "' in '" + string + "'")
I couldn't get the non-breaking space to occur in my test. I got around that by using "This contains the annoying" + chr(160) + "non-breaking space" after which it matched.
Your code doesn't work as expected because, in your "strings" variable, you have unicode characters in non-unicode strings. You forgot to put the "u" in front of them to signal that they should be treated as unicode strings. So, when you search for a unicode string inside a non-unicode string, it doesn't work as expected
If you change this to:
strings = [u"This contains the annoying non-breaking space",u"This is fine!",u"This is not fine!"]
It works as expected.
Solving unicode headaches like this is a major benefit of Python 3.
Here's an alternative approach to your problem. How about just trying to encode the string as ASCII, and catching errors if it doesn't work?:
def is_this_ascii(s):
try:
ignore = unicode(s).encode("ascii")
return True
except (UnicodeEncodeError, UnicodeDecodeError):
return False
strings = [u"This contains the annoying non-breaking space",u"This is fine!",u"This is not fine!"]
for s in strings:
print(repr(is_this_ascii(s)))
##False
##True
##False
I am building a string query (cypher query) to execute it against a database (Neo4J).
I need to concatenate some strings but I am having trouble with encoding.
I am trying to build a unicode string.
# -*- coding: utf-8 -*-
value = u"D'Santana Carlos Lãnez"
key = u"Name"
line = key + u" = "+ repr(value)
print line.encode("utf-8")
I expected to have:
Name = "D'Santana Carlos Lãnez"
But i getting:
Name = u"D'Santana Carlos L\xe3nez"
I imagine that repr is returning a unicode. Or probably i am not using the right function.
Python literal (repr) syntax is not a valid substitute for Cypher string literal syntax. The leading u is only one of the differences between them; notably, Cypher string literals don't have \x escapes, which Python will use for characters between U+0080–U+00FF.
If you need to create Cypher string literals from Python strings you would need to write your own string escaping function that writes output matching that syntax. But you should generally avoid creating queries from variable input. As with SQL databases, the better answer is query parameterisation.
value = u"D'Santana Carlos Lãnez"
key = u"Name"
line = key + u" = "+ value
print(line)
value is already unicode because you use prefix u in u"..." so you don't need repr() (and unicode() or decode())
Besides repr() doesn't convert to unicode. But it returns string very useful for debuging - it shows hex codes of native chars and other things.
Update for clarification
I have to replicate the functionality from a server. One of the responses of this old server is the one seen here http://test.muchticket.com/api/?token=carlos&method=ventas&ESP=11, except that the double slashes should be single ones.
End of update
Update No.2 for clarification
This variable then goes to a dictionary wich is dumped to an HttpResponse with this
return HttpResponse(json.dumps(response_data,sort_keys=True), content_type="application/json")
I hate my job.
End of update
I need to store 'http:\/\/shop.muchticket.com\/' in a variable. And then save it in a dictionary. I have tried several different methods, but none of them seems to work, here are some examples of what I've tried:
url = 'http:\/\/shop.muchticket.com\/'
print url
>> http:\\/\\/shop.muchticket.com\\/
With raw
url = r'http:\/\/shop.muchticket.com\/'
print url
>> http:\\/\\/shop.muchticket.com\\/
With the escape character
url = 'http:\\/\\/shop.muchticket.com\\/'
print url
>> http:\\/\\/shop.muchticket.com\\/
Raw and escape character
url = r'http:\\/\\/shop.muchticket.com\\/'
print url
>> http:\\\\/\\\\/shop.muchticket.com\\\\/
Escape character and decode
url = 'http:\\/\\/shop.muchticket.com\\/'
print url.decode('string_escape')
>> http:\\/\\/shop.muchticket.com\\/
Decode only
url = 'http:\/\/shop.muchticket.com\/'
print url.decode('string_escape')
>> http:\\/\\/shop.muchticket.com\\/
The best way is not to use any escape sequences
>>> s = 'http://shop.muchticket.com/'
>>> s
'http://shop.muchticket.com/'
>>> print(s)
http://shop.muchticket.com/
Unlike "other" languages, you do not need to escape the forward slash (/) in Python!
If you need the forward slash then
>>> s = 'http:\/\/shop.muchticket.com\/'
>>> print(s)
http:\/\/shop.muchticket.com\/
Note: When you just type s in interpreter it gives you the repr output and thus you get the escaped forward slash
>>> s
'http:\\/\\/shop.muchticket.com\\/' # Internally stored!!!
>>> print(repr(s))
'http:\\/\\/shop.muchticket.com\\/'
Therefore Having a single \ is enough to store it in a variable.
As J F S says,
To avoid ambiguity, either use raw string literals or escape the
backslashes if you want a literal backslash in the string.
Thus your string would be
s = 'http:\\/\\/shop.muchticket.com\\/' # Escape the \ literal
s = r'http:\/\/shop.muchticket.com\/' # Make it a raw string
If you need two characters in the string: the backslash (REVERSE SOLIDUS) and the forward slash (SOLIDUS) then all three Python string literals produce the same string object:
>>> '\/' == r'\/' == '\\/' == '\x5c\x2f'
True
>>> len(r'\/') == 2
True
The preferable way to write it is: r'\/' or '\\/'.
The reason is that the backslash is a special character in a string literal (something that you write in Python source code (usually by hand)) if it is followed by certain characters e.g., '\n' is a single character (newline) and '\\' is also a single character (the backslash). But '\/' is not an escape sequence and therefore it is two characters. To avoid ambiguity, use raw string literals r'\/' where the backslash has no special meaning.
The REPL calls repr on a string to print it:
>>> r'\/'
'\\/'
>>> print r'\/'
\/
>>> print repr(r'\/')
'\\/'
repr() shows your the Python string literal (how you would write it in a Python source code). '\\/' is a two character string, not three. Don't confuse a string literal that is used to create a string and the string object itself.
And to test the understanding:
>>> repr(r'\/')
"'\\\\/'"
It shows the representation of the representation of the string.
For Python 2.7.9, ran:
url = "http:\/\/shop.muchticket.com\/"
print url
With the result of:
>> http:\/\/shop.muchticket.com\/
What is the version of Python you are using? From Bhargav Rao's answer, it seems that it should work in Python 3.X as well, so maybe it's a case of some weird imports?
I have a string of HTML stored in a database. Unfortunately it contains characters such as ®
I want to replace these characters by their HTML equivalent, either in the DB itself or using a Find Replace in my Python / Django code.
Any suggestions on how I can do this?
You can use that the ASCII characters are the first 128 ones, so get the number of each character with ord and strip it if it's out of range
# -*- coding: utf-8 -*-
def strip_non_ascii(string):
''' Returns the string without non ASCII characters'''
stripped = (c for c in string if 0 < ord(c) < 127)
return ''.join(stripped)
test = u'éáé123456tgreáé#€'
print test
print strip_non_ascii(test)
Result
éáé123456tgreáé#€
123456tgre#
Please note that # is included because, well, after all it's an ASCII character. If you want to strip a particular subset (like just numbers and uppercase and lowercase letters), you can limit the range looking at a ASCII table
EDITED: After reading your question again, maybe you need to escape your HTML code, so all those characters appears correctly once rendered. You can use the escape filter on your templates.
There's a much simpler answer to this at https://stackoverflow.com/a/18430817/5100481
To remove non-ASCII characters from a string, s, use:
s = s.encode('ascii',errors='ignore')
Then convert it from bytes back to a string using:
s = s.decode()
This all using Python 3.6
I found this a while ago, so this isn't in any way my work. I can't find the source, but here's the snippet from my code.
def unicode_escape(unistr):
"""
Tidys up unicode entities into HTML friendly entities
Takes a unicode string as an argument
Returns a unicode string
"""
import htmlentitydefs
escaped = ""
for char in unistr:
if ord(char) in htmlentitydefs.codepoint2name:
name = htmlentitydefs.codepoint2name.get(ord(char))
entity = htmlentitydefs.name2codepoint.get(name)
escaped +="&#" + str(entity)
else:
escaped += char
return escaped
Use it like this
>>> from zack.utilities import unicode_escape
>>> unicode_escape(u'such as ® I want')
u'such as ® I want'
This code snippet may help you.
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
def removeNonAscii(string):
nonascii = bytearray(range(0x80, 0x100))
return string.translate(None, nonascii)
nonascii_removed_string = removeNonAscii(string_to_remove_nonascii)
The encoding definition is very important here which is done in the second line.
To get rid of the special xml, html characters '<', '>', '&' you can use cgi.escape:
import cgi
test = "1 < 4 & 4 > 1"
cgi.escape(test)
Will return:
'1 < 4 & 4 > 1'
This is probably the bare minimum you need to avoid problem.
For more you have to know the encoding of your string.
If it fit the encoding of your html document you don't have to do something more.
If not you have to convert to the correct encoding.
test = test.decode("cp1252").encode("utf8")
Supposing that your string was cp1252 and that your html document is utf8
You shouldn't have anything to do, as Django will automatically escape characters :
see : http://docs.djangoproject.com/en/dev/topics/templates/#id2