Python, Convert non-ascii to ascii based on look-up dictionary - python

In Python, I need to convert special characters into ascii letters. I have a series of translations in a dictionary
dict_trans = {"U+1E9A":"a", "U+1EA0":"a"} # + more
my_char = "ẚ"
How do I covert my_char into (in this case) a?
I can change the format of the characters in dict_trans (but to what)?

From the unidecode module, you could use the unidecode function.
>>> from unidecode import unidecode
>>> unidecode('ẚ')
'a'

Use thes names from unicodedata:
import unicodedata
unicodedata.name("a")
# 'LATIN SMALL LETTER A'
unicodedata.name("ẚ")
# 'LATIN SMALL LETTER A WITH RIGHT HALF RING'
unicodedata.lookup('LATIN SMALL LETTER A WITH RIGHT HALF RING')
# 'ẚ'
d = {'LATIN SMALL LETTER A WITH RIGHT HALF RING':'a'}
d['LATIN SMALL LETTER A WITH RIGHT HALF RING']
# 'a'

Related

Python returns length of 2 for single non-ascii character string

I am trying to get the span of selected words in a string. When working with the İ character, I noticed the following behavior of Python:
len("İ")
Out[39]: 1
len("İ".lower())
Out[40]: 2
# when `upper()` is applied, the length stays the same
len("İ".lower().upper())
Out[41]: 2
Why does the length of the upper and lowercase value of the same character differ (that seems very confusing/undesired to me)?
Does anyone know if there are other characters for which that will happen?
Thank you!
EDIT:
On the other hand for e.g. Î the length stays the same:
len('Î')
Out[42]: 1
len('Î'.lower())
Out[43]: 1
That's because 'İ' in lowercase is 'i̇', which has 2 characters
>>> import unicodedata
>>> unicodedata.name('İ')
'LATIN CAPITAL LETTER I WITH DOT ABOVE'
>>> unicodedata.name('İ'.lower()[0])
'LATIN SMALL LETTER I'
>>> unicodedata.name('İ'.lower()[1])
'COMBINING DOT ABOVE'
One character is a combining dot that your browser might render overlapped with the last quote, so you may not be able to see it. But if you copy-paste it into your python console, you should be able to see it.
If you try:
print('i̇'.upper())
you should get
İ
I think the issue is that a lower case character for that symbol is undefined in ASCII.
The .lower() function probably performs a fixed offset to the ASCII number associated with the character, since that works for the English alphabet.

How to convert formatted text to the plain text [duplicate]

Is there a standard way, in Python, to normalize a unicode string, so that it only comprehends the simplest unicode entities that can be used to represent it ?
I mean, something which would translate a sequence like ['LATIN SMALL LETTER A', 'COMBINING ACUTE ACCENT'] to ['LATIN SMALL LETTER A WITH ACUTE'] ?
See where is the problem:
>>> import unicodedata
>>> char = "á"
>>> len(char)
1
>>> [ unicodedata.name(c) for c in char ]
['LATIN SMALL LETTER A WITH ACUTE']
But now:
>>> char = "á"
>>> len(char)
2
>>> [ unicodedata.name(c) for c in char ]
['LATIN SMALL LETTER A', 'COMBINING ACUTE ACCENT']
I could, of course, iterate over all the chars and do manual replacements, etc., but it is not efficient, and I'm pretty sure I would miss half of the special cases, and do mistakes.
The unicodedata module offers a .normalize() function, you want to normalize to the NFC form. An example using the same U+0061 LATIN SMALL LETTER - U+0301 A COMBINING ACUTE ACCENT combination and U+00E1 LATIN SMALL LETTER A WITH ACUTE code points you used:
>>> print(ascii(unicodedata.normalize('NFC', '\u0061\u0301')))
'\xe1'
>>> print(ascii(unicodedata.normalize('NFD', '\u00e1')))
'a\u0301'
(I used the ascii() function here to ensure non-ASCII codepoints are printed using escape syntax, making the differences clear).
NFC, or 'Normal Form Composed' returns composed characters, NFD, 'Normal Form Decomposed' gives you decomposed, combined characters.
The additional NFKC and NFKD forms deal with compatibility codepoints; e.g. U+2160 ROMAN NUMERAL ONE is really just the same thing as U+0049 LATIN CAPITAL LETTER I but present in the Unicode standard to remain compatible with encodings that treat them separately. Using either NFKC or NFKD form, in addition to composing or decomposing characters, will also replace all 'compatibility' characters with their canonical form.
Here is an example using the U+2167 ROMAN NUMERAL EIGHT codepoint; using the NFKC form replaces this with a sequence of ASCII V and I characters:
>>> unicodedata.normalize('NFC', '\u2167')
'Ⅷ'
>>> unicodedata.normalize('NFKC', '\u2167')
'VIII'
Note that there is no guarantee that composed and decomposed forms are commutative; normalizing a combined character to NFC form, then converting the result back to NFD form does not always result in the same character sequence. The Unicode standard maintains a list of exceptions; characters on this list are composable, but not decomposable back to their combined form, for various reasons. Also see the documentation on the Composition Exclusion Table.
Yes, there is.
unicodedata.normalize(form, unistr)
You need to select one of the four normalization forms.

Python: replace french letters with english

I would like to replace all the french letters within words with their ASCII equivalent.
letters = [['é', 'à'], ['è', 'ù'], ['â', 'ê'], ['î', 'ô'], ['û', 'ç']]
for x in letters:
for a in x:
a = a.replace('é', 'e')
a = a.replace('à', 'a')
a = a.replace('è', 'e')
a = a.replace('ù', 'u')
a = a.replace('â', 'a')
a = a.replace('ê', 'e')
a = a.replace('î', 'i')
a = a.replace('ô', 'o')
a = a.replace('û', 'u')
a = a.replace('ç', 'c')
print(letters[0][0])
This code prints é however. How can I make this work?
May I suggest you consider using translation tables.
translationTable = str.maketrans("éàèùâêîôûç", "eaeuaeiouc")
test = "Héllô Càèùverâêt Jîôûç"
test = test.translate(translationTable)
print(test)
will print Hello Caeuveraet Jiouc. Pardon my French.
You can also use unidecode. Install it: pip install unidecode.
Then, do:
from unidecode import unidecode
s = "Héllô Càèùverâêt Jîôûç ïîäüë"
s = unidecode(s)
print(s) # Hello Caeuveraet Jiouc iiaue
The result will be the same string, but the french characters will be converted to their ASCII equivalent: Hello Caeuveraet Jiouc iiaue
The replace function returns the string with the character replaced.
In your code you don't store this return value.
The lines in your loop should be a = a.replace('é', 'e').
You also need to store that output so you can print it in the end.
This post explains how variables within loops are accessed.
Although I am new to Python, I would approach it this way:
letterXchange = {'à':'a', 'â':'a', 'ä':'a', 'é':'e', 'è':'e', 'ê':'e', 'ë':'e',
'î':'i', 'ï':'i', 'ô':'o', 'ö':'o', 'ù':'u', 'û':'u', 'ü':'u', 'ç':'c'}
text = input() # Replace it with the string in your code.
for item in list(text):
if item in letterXchange:
text = text.replace(item,letterXchange.get(str(item)))
else:
pass
print(text)
Here is another solution, using the low level unicode package called unicodedata.
In the unicode structure, a character like 'ô' is actually a composite character, made of the character 'o' and another character called 'COMBINING GRAVE ACCENT', which is basically the '̀'. Using the method decomposition in unicodedata, one can obtain the unicodes (in hex) of these two parts.
>>> import unicodedata as ud
>>> ud.decomposition('ù')
'0075 0300'
>>> chr(0x0075)
'u'
>>> >>> chr(0x0300)
'̀'
Therefore, to retrieve 'u' from 'ù', we can first do a string split, then use the built-in int function for the conversion(see this thread for converting a hex string to an integer), and then get the character using chr function.
import unicodedata as ud
def get_ascii_char(c):
s = ud.decomposition(c)
if s == '': # for an indecomposable character, it returns ''
return c
code = int('0x' + s.split()[0], 0)
return chr(code)

How to replace characters in string by the next one?

I would like to replace every character of a string with the next one and the last should become first. Here is an example:
abcdefghijklmnopqrstuvwxyz
should become:
bcdefghijklmnopqrstuvwxyza
Is it possible to do it without using the replace function 26 times?
You can use the str.translate() method to have Python replace characters by other characters in one step.
Use the string.maketrans() function to map ASCII characters to their targets; using string.ascii_lowercase can help here as it saves you typing all the letters yourself:
from string import ascii_lowercase
try:
# Python 2
from string import maketrans
except ImportError:
# Python 3 made maketrans a static method
maketrans = str.maketrans
cipher_map = maketrans(ascii_lowercase, ascii_lowercase[1:] + ascii_lowercase[:1])
encrypted = text.translate(cipher_map)
Demo:
>>> from string import maketrans
>>> from string import ascii_lowercase
>>> cipher_map = maketrans(ascii_lowercase, ascii_lowercase[1:] + ascii_lowercase[:1])
>>> text = 'the quick brown fox jumped over the lazy dog'
>>> text.translate(cipher_map)
'uif rvjdl cspxo gpy kvnqfe pwfs uif mbaz eph'
Sure, just use string slicing:
>>> s = "abcdefghijklmnopqrstuvwxyz"
>>> s[1:] + s[:1]
'bcdefghijklmnopqrstuvwxyza'
Basically, the operation you want to do is analogous to rotating the position of the characters by one place to the left. So, we can simply take the part of string after the first character, and add the first character to it.
EDIT: I assumed that OP is asking to rotate a string (which is plausible from his given input, the input string has 26 characters, and he might have been doing a manual replace for each character), in case the post is about creating a cipher, please check #Martjin's answer above.
Because string in Python is immutable you need convert string to list, replace, and then convert back to string. Here I use modulo.
def convert(text):
lst = list(text)
new_list = [text[i % len(text) +1] for i in lst]
return "".join(new_list)
Don't use slicing because this is not efficient. Python will create new full copy string for every single changed char, because string is immutable.

How could we get unicode from glyph id in python?

If I have glyph ids like below how can I get the unicode from them, the language is python that I am working on ? Also what I understand the second value is the glyph id but what do we call the first value and the third value?
(582, 'uni0246', 'LATIN CAPITAL LETTER E WITH STROKE'), (583, 'uni0247', 'LATIN SMALL LETTER E WITH STROKE'), (584, 'uni0248', 'LATIN CAPITAL LETTER J WITHSTROKE'), (585, 'uni0249', 'LATIN SMALL LETTER J WITH STROKE')
Kindly reply.
Actually I am trying to get the unicode from a given ttf file in python.Here is the code :
from fontTools.ttLib import TTFont
from fontTools.unicode import Unicode
from ttfquery import ttfgroups
from fontTools.ttLib.tables import _c_m_a_p
from itertools import chain
ttfgroups.buildTable()
ttf = TTFont(sys.argv[1], 0, verbose=0, allowVID=0,
ignoreDecompileErrors=True,
fontNumber=-1)
chars = chain.from_iterable([y + (Unicode[y[0]],) for y in x.cmap.items()] for x in ttf["cmap"].tables)
print(list(chars))`
This code I got from stackoverflow only but this gives the above output and not what I require. So could anybody please tell me how to fetch the unicodes from the ttf file or is it fine to convert the glyphid to unicode, will it yield to actual unicode ?
You can use the first field: unichr(x[0]), or equivalently the second field. Then you remove the "uni" part ([3:]) and you convert it to a hexadecimal valu'Ɇ'e, then to a character. Of course, the first method is faster and simpler.
unichr(int(x[1][3:], 16)) #for the first item you've showed, returns 'Ɇ', for the second 'ɇ'
If you use python3, chr instead of unichr.
Here is a simple way to find all unicode character in ttf file.
chars = []
with TTFont('/path/to/ttf', 0, ignoreDecompileErrors=True) as ttf:
for x in ttf["cmap"].tables:
for (code, _) in x.cmap.items():
chars.append(chr(code))
# now chars is a list of \uxxxx characters
print(chars)

Categories

Resources