In my utf-8 encoded file, there are curly quotes (“”).
How do I replace them all with normal quotes (")?
cell_info.replace('“','"')
cell_info.replace('”','"')
did not work. No error message.
Thank you. :)
str.replace() doesn't replace the original string, it just returns a new one.
Do:
cell_info = cell_info.replace('“','"').replace('”','"')
other way that work with my code is this:
cell_info = cell_info.replace(u'\u201c', '"').replace(u'\u201d', '"')
this because i already have this # -*- coding: utf-8 -*- at the top of my script
cell_info = cell_info.replace('“','"').replace('”','"')
The replace method return a new string with the replacement done. It doesn't act directly on the string.
Related
don't know wether this is trivial or not, but I'd need to convert an unicode string to ascii string, and I wouldn't like to have all those escape chars around. I mean, is it possible to have an "approximate" conversion to some quite similar ascii character?
For example: Gavin O’Connor gets converted to Gavin O\x92Connor, but I'd really like it to be just converted to Gavin O'Connor. Is this possible? Did anyone write some util to do it, or do I have to manually replace all chars?
Thank you very much!
Marco
Use the Unidecode package to transliterate the string.
>>> import unidecode
>>> unidecode.unidecode(u'Gavin O’Connor')
"Gavin O'Connor"
import unicodedata
unicode_string = u"Gavin O’Connor"
print unicodedata.normalize('NFKD', unicode_string).encode('ascii','ignore')
Output:
Gavin O'Connor
Here's the document that describes the normalization forms: http://unicode.org/reports/tr15/
b = str(a.encode('utf-8').decode('ascii', 'ignore'))
should work fine.
There is a technique to strip accents from characters, but other characters need to be directly replaced. Check this article: http://effbot.org/zone/unicode-convert.htm
Try simple character replacement
str1 = "“I am the greatest”, said Gavin O’Connor"
print(str1)
print(str1.replace("’", "'").replace("“","\"").replace("”","\""))
PS: add # -*- coding: utf-8 -*- to the top of your .py file if you get error
How can I print the check mark sign "✓" in Python?
It's the sign for approval, not a square root.
You can print any Unicode character using an escape sequence. Make sure to make a Unicode string.
print u'\u2713'
Since Python 2.1 you can use \N{name} escape sequence to insert Unicode characters by their names. Using this feature you can get check mark symbol like so:
$ python -c "print(u'\N{check mark}')"
✓
Note: For this feature to work you must use unicode string literal. u prefix is used for this reason. In Python 3 the prefix is not mandatory since string literals are unicode by default.
Solution defining python source file encoding:
#!/usr/bin/python
# -*- coding: UTF-8 -*-
print '✓'
http://ideone.com/dTW5D8
Like this:
print u'\u2713'.encode('utf8')
The encoding should match the one of your terminal (or wherever you are sending output to).
I'm using python for S60.
I want to use string in hebrew, to represent them on the GUI and to send them in SMS message.
It seems that the PythonScriptShell don't accept such expressions, for example:
u"אבגדה"
what can I do?
thanks
development of situation:
I added the line:
# -*- coding: utf-8 -*-
as the first line in the source file and in notepad++ I selected: Encoding>>Convert to utf8.
now, the GUI appears in Hebrew but when I selected an option the selection value cannot be compared to a string in Hebrew in the code (probably) and there is no response.
On PythonScriptShell appears the warning:
Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal.
Help me, please.
I just tested this in both bluetooth and on-phone consoles with PyS60 2.0, and non-ASCII unicode was handled w/out exceptions.
If you have that string in the file rather than passing it in the console, error is caused by lack of encoding specification in the file.
Add # -*- coding: utf-8 -*- as first line there.
convert your words to unicode characters using
unichr
eg unichr(1507) for char ף
refer to the decimal values in this table: http://www.ssec.wisc.edu/~tomw/java/unicode.html#x0590
Add up
ru = lambda txt: str(txt).decode('utf-8','ignore')
And add the function before each text use
ru("אבגדה")
I have a string of HTML stored in a database. Unfortunately it contains characters such as ®
I want to replace these characters by their HTML equivalent, either in the DB itself or using a Find Replace in my Python / Django code.
Any suggestions on how I can do this?
You can use that the ASCII characters are the first 128 ones, so get the number of each character with ord and strip it if it's out of range
# -*- coding: utf-8 -*-
def strip_non_ascii(string):
''' Returns the string without non ASCII characters'''
stripped = (c for c in string if 0 < ord(c) < 127)
return ''.join(stripped)
test = u'éáé123456tgreáé#€'
print test
print strip_non_ascii(test)
Result
éáé123456tgreáé#€
123456tgre#
Please note that # is included because, well, after all it's an ASCII character. If you want to strip a particular subset (like just numbers and uppercase and lowercase letters), you can limit the range looking at a ASCII table
EDITED: After reading your question again, maybe you need to escape your HTML code, so all those characters appears correctly once rendered. You can use the escape filter on your templates.
There's a much simpler answer to this at https://stackoverflow.com/a/18430817/5100481
To remove non-ASCII characters from a string, s, use:
s = s.encode('ascii',errors='ignore')
Then convert it from bytes back to a string using:
s = s.decode()
This all using Python 3.6
I found this a while ago, so this isn't in any way my work. I can't find the source, but here's the snippet from my code.
def unicode_escape(unistr):
"""
Tidys up unicode entities into HTML friendly entities
Takes a unicode string as an argument
Returns a unicode string
"""
import htmlentitydefs
escaped = ""
for char in unistr:
if ord(char) in htmlentitydefs.codepoint2name:
name = htmlentitydefs.codepoint2name.get(ord(char))
entity = htmlentitydefs.name2codepoint.get(name)
escaped +="&#" + str(entity)
else:
escaped += char
return escaped
Use it like this
>>> from zack.utilities import unicode_escape
>>> unicode_escape(u'such as ® I want')
u'such as ® I want'
This code snippet may help you.
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
def removeNonAscii(string):
nonascii = bytearray(range(0x80, 0x100))
return string.translate(None, nonascii)
nonascii_removed_string = removeNonAscii(string_to_remove_nonascii)
The encoding definition is very important here which is done in the second line.
To get rid of the special xml, html characters '<', '>', '&' you can use cgi.escape:
import cgi
test = "1 < 4 & 4 > 1"
cgi.escape(test)
Will return:
'1 < 4 & 4 > 1'
This is probably the bare minimum you need to avoid problem.
For more you have to know the encoding of your string.
If it fit the encoding of your html document you don't have to do something more.
If not you have to convert to the correct encoding.
test = test.decode("cp1252").encode("utf8")
Supposing that your string was cp1252 and that your html document is utf8
You shouldn't have anything to do, as Django will automatically escape characters :
see : http://docs.djangoproject.com/en/dev/topics/templates/#id2
Are there short Unicode u"\N{...}" names for Latin1 characters in Python ?
\N{A umlaut} etc. would be nice,
\N{LATIN SMALL LETTER A WITH DIAERESIS} etc. is just too long to type every time.
(Added:) I use an English keyboard, but occasionally need German letters, as in "Löwenbräu Weißbier".
Yes one can cut-paste them singly, L cutpaste ö wenbr cutpaste ä ...
but that breaks the flow; I was hoping for a keyboard-only way.
Sorry, no, there's no such thing. In string literals, anyway... you could perhaps piggyback on another encoding scheme, such as HTML:
>>> import HTMLParser
>>> HTMLParser.HTMLParser().unescape(u'a ä b c')
u'a \xe4 b'
But I don't think this'd be worth it.
Hardly anyone even uses the \N notation in any case... for the occasional character the \xnn notation is acceptable; for more involved usage you're better off just typing ä directly and making sure a # coding= is defined in the script as per PEP263. (If you don't have a keyboard layout that can type those diacriticals directly, get one. eg. eurokb on Windows, or using the Compose key on Linux.)
If you want to do the right thing please use UTF-8 in your python source code. This will keep the code much more readable.
Python is able to real UTF-8 source files, all you have to do is to add an additional line after the first one:
#!/usr/bin/python
# -*- coding: UTF-8 -*-
By the way, starting with Python 3.0 UTF-8 is the default encoding so you will not need this line anymore. See PEP3120
You can put an actual "ä" character in your string. For this you have to declare the encoding of the source code at the top
#!/usr/bin/env python
# encoding: utf-8
x = u"ä"
Have you thought about writing your own converter? It wouldn't be hard to write something that would go through a file and replace \N{A umlaut} with \N{LATIN SMALL LETTER A WITH DIAERESIS} and all the rest.
You can use the Unicode notation \uXXXX do describe that character:
u"\u00E4"
On Windows, you can use the charmap.exe utility to look up the keyboard shortcut for common letters you're using such as:
ALT-0223 = ß
ALT-0228 = ä
ALT-0246 = ö
Then use Unicode and save in UTF-8:
# -*- coding: UTF-8 -*-
phrase = u'Löwenbräu Weißbier'
or use a converter as someone else mentioned and make up your own shortcuts:
# -*- coding: UTF-8 -*-
def german(s):
s = s.replace(u'SS',u'ß')
s = s.replace(u'a:',u'ä')
s = s.replace(u'o:',u'ö')
return s
phrase = german(u'Lo:wenbra:u WeiSSbier')
print phrase