Is it possible to get ASCII equivalent of UTF-8 characters? [duplicate] - python

don't know wether this is trivial or not, but I'd need to convert an unicode string to ascii string, and I wouldn't like to have all those escape chars around. I mean, is it possible to have an "approximate" conversion to some quite similar ascii character?
For example: Gavin O’Connor gets converted to Gavin O\x92Connor, but I'd really like it to be just converted to Gavin O'Connor. Is this possible? Did anyone write some util to do it, or do I have to manually replace all chars?
Thank you very much!
Marco

Use the Unidecode package to transliterate the string.
>>> import unidecode
>>> unidecode.unidecode(u'Gavin O’Connor')
"Gavin O'Connor"

import unicodedata
unicode_string = u"Gavin O’Connor"
print unicodedata.normalize('NFKD', unicode_string).encode('ascii','ignore')
Output:
Gavin O'Connor
Here's the document that describes the normalization forms: http://unicode.org/reports/tr15/

b = str(a.encode('utf-8').decode('ascii', 'ignore'))
should work fine.

There is a technique to strip accents from characters, but other characters need to be directly replaced. Check this article: http://effbot.org/zone/unicode-convert.htm

Try simple character replacement
str1 = "“I am the greatest”, said Gavin O’Connor"
print(str1)
print(str1.replace("’", "'").replace("“","\"").replace("”","\""))
PS: add # -*- coding: utf-8 -*- to the top of your .py file if you get error

Related

Python's .format() minilanguage and Unicode

I'm trying to use some of the simple unicode characters in a command line program I'm writing, but drawing these things into a table becomes difficult because Python appears to be treating single-character symbols as multi-character strings.
For example, if I try to print(u"\u2714".encode("utf-8")) I see the unicode checkmark. However, if I try to add some padding to that character (as one might in tabular structure), Python seems to be interpreting this single-character string as a 3-character one. All three of these lines print the same thing:
print("|{:1}|".format(u"\u2714".encode("utf-8")))
print("|{:2}|".format(u"\u2714".encode("utf-8")))
print("|{:3}|".format(u"\u2714".encode("utf-8")))
Now I think I understand why this is happening: it's a multibyte string. My question is, how do I get Python to pad this string appropriately?
Make your format strings unicode:
from __future__ import print_function
print(u"|{:1}|".format(u"\u2714"))
print(u"|{:2}|".format(u"\u2714"))
print(u"|{:3}|".format(u"\u2714"))
outputs:
|✔|
|✔ |
|✔ |
Don't encode('utf-8') at that point do it latter:
>>> u"\u2714".encode("utf-8")
'\xe2\x9c\x94'
The UTF-8 encoding is three bytes long. Look at how format works with Unicode strings:
>>> u"|{:1}|".format(u"\u2714")
u'|\u2714|'
>>> u"|{:2}|".format(u"\u2714")
u'|\u2714 |'
>>> u"|{:3}|".format(u"\u2714")
u'|\u2714 |'
Tested on Python 2.7.3.

Print the "approval" sign/check mark (✓) U+2713 in Python

How can I print the check mark sign "✓" in Python?
It's the sign for approval, not a square root.
You can print any Unicode character using an escape sequence. Make sure to make a Unicode string.
print u'\u2713'
Since Python 2.1 you can use \N{name} escape sequence to insert Unicode characters by their names. Using this feature you can get check mark symbol like so:
$ python -c "print(u'\N{check mark}')"
✓
Note: For this feature to work you must use unicode string literal. u prefix is used for this reason. In Python 3 the prefix is not mandatory since string literals are unicode by default.
Solution defining python source file encoding:
#!/usr/bin/python
# -*- coding: UTF-8 -*-
print '✓'
http://ideone.com/dTW5D8
Like this:
print u'\u2713'.encode('utf8')
The encoding should match the one of your terminal (or wherever you are sending output to).

How to remove curly quotes?

In my utf-8 encoded file, there are curly quotes (“”).
How do I replace them all with normal quotes (")?
cell_info.replace('“','"')
cell_info.replace('”','"')
did not work. No error message.
Thank you. :)
str.replace() doesn't replace the original string, it just returns a new one.
Do:
cell_info = cell_info.replace('“','"').replace('”','"')
other way that work with my code is this:
cell_info = cell_info.replace(u'\u201c', '"').replace(u'\u201d', '"')
this because i already have this # -*- coding: utf-8 -*- at the top of my script
cell_info = cell_info.replace('“','"').replace('”','"')
The replace method return a new string with the replacement done. It doesn't act directly on the string.

Remove non-ASCII characters from a string using python / django

I have a string of HTML stored in a database. Unfortunately it contains characters such as ®
I want to replace these characters by their HTML equivalent, either in the DB itself or using a Find Replace in my Python / Django code.
Any suggestions on how I can do this?
You can use that the ASCII characters are the first 128 ones, so get the number of each character with ord and strip it if it's out of range
# -*- coding: utf-8 -*-
def strip_non_ascii(string):
''' Returns the string without non ASCII characters'''
stripped = (c for c in string if 0 < ord(c) < 127)
return ''.join(stripped)
test = u'éáé123456tgreáé#€'
print test
print strip_non_ascii(test)
Result
éáé123456tgreáé#€
123456tgre#
Please note that # is included because, well, after all it's an ASCII character. If you want to strip a particular subset (like just numbers and uppercase and lowercase letters), you can limit the range looking at a ASCII table
EDITED: After reading your question again, maybe you need to escape your HTML code, so all those characters appears correctly once rendered. You can use the escape filter on your templates.
There's a much simpler answer to this at https://stackoverflow.com/a/18430817/5100481
To remove non-ASCII characters from a string, s, use:
s = s.encode('ascii',errors='ignore')
Then convert it from bytes back to a string using:
s = s.decode()
This all using Python 3.6
I found this a while ago, so this isn't in any way my work. I can't find the source, but here's the snippet from my code.
def unicode_escape(unistr):
"""
Tidys up unicode entities into HTML friendly entities
Takes a unicode string as an argument
Returns a unicode string
"""
import htmlentitydefs
escaped = ""
for char in unistr:
if ord(char) in htmlentitydefs.codepoint2name:
name = htmlentitydefs.codepoint2name.get(ord(char))
entity = htmlentitydefs.name2codepoint.get(name)
escaped +="&#" + str(entity)
else:
escaped += char
return escaped
Use it like this
>>> from zack.utilities import unicode_escape
>>> unicode_escape(u'such as ® I want')
u'such as &#174 I want'
This code snippet may help you.
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
def removeNonAscii(string):
nonascii = bytearray(range(0x80, 0x100))
return string.translate(None, nonascii)
nonascii_removed_string = removeNonAscii(string_to_remove_nonascii)
The encoding definition is very important here which is done in the second line.
To get rid of the special xml, html characters '<', '>', '&' you can use cgi.escape:
import cgi
test = "1 < 4 & 4 > 1"
cgi.escape(test)
Will return:
'1 < 4 & 4 > 1'
This is probably the bare minimum you need to avoid problem.
For more you have to know the encoding of your string.
If it fit the encoding of your html document you don't have to do something more.
If not you have to convert to the correct encoding.
test = test.decode("cp1252").encode("utf8")
Supposing that your string was cp1252 and that your html document is utf8
You shouldn't have anything to do, as Django will automatically escape characters :
see : http://docs.djangoproject.com/en/dev/topics/templates/#id2

short Unicode \N{} names for Latin-1 characters in Python?

Are there short Unicode u"\N{...}" names for Latin1 characters in Python ?
\N{A umlaut} etc. would be nice,
\N{LATIN SMALL LETTER A WITH DIAERESIS} etc. is just too long to type every time.
(Added:) I use an English keyboard, but occasionally need German letters, as in "Löwenbräu Weißbier".
Yes one can cut-paste them singly, L cutpaste ö wenbr cutpaste ä ...
but that breaks the flow; I was hoping for a keyboard-only way.
Sorry, no, there's no such thing. In string literals, anyway... you could perhaps piggyback on another encoding scheme, such as HTML:
>>> import HTMLParser
>>> HTMLParser.HTMLParser().unescape(u'a ä b c')
u'a \xe4 b'
But I don't think this'd be worth it.
Hardly anyone even uses the \N notation in any case... for the occasional character the \xnn notation is acceptable; for more involved usage you're better off just typing ä directly and making sure a # coding= is defined in the script as per PEP263. (If you don't have a keyboard layout that can type those diacriticals directly, get one. eg. eurokb on Windows, or using the Compose key on Linux.)
If you want to do the right thing please use UTF-8 in your python source code. This will keep the code much more readable.
Python is able to real UTF-8 source files, all you have to do is to add an additional line after the first one:
#!/usr/bin/python
# -*- coding: UTF-8 -*-
By the way, starting with Python 3.0 UTF-8 is the default encoding so you will not need this line anymore. See PEP3120
You can put an actual "ä" character in your string. For this you have to declare the encoding of the source code at the top
#!/usr/bin/env python
# encoding: utf-8
x = u"ä"
Have you thought about writing your own converter? It wouldn't be hard to write something that would go through a file and replace \N{A umlaut} with \N{LATIN SMALL LETTER A WITH DIAERESIS} and all the rest.
You can use the Unicode notation \uXXXX do describe that character:
u"\u00E4"
On Windows, you can use the charmap.exe utility to look up the keyboard shortcut for common letters you're using such as:
ALT-0223 = ß
ALT-0228 = ä
ALT-0246 = ö
Then use Unicode and save in UTF-8:
# -*- coding: UTF-8 -*-
phrase = u'Löwenbräu Weißbier'
or use a converter as someone else mentioned and make up your own shortcuts:
# -*- coding: UTF-8 -*-
def german(s):
s = s.replace(u'SS',u'ß')
s = s.replace(u'a:',u'ä')
s = s.replace(u'o:',u'ö')
return s
phrase = german(u'Lo:wenbra:u WeiSSbier')
print phrase

Categories

Resources