If I have glyph ids like below how can I get the unicode from them, the language is python that I am working on ? Also what I understand the second value is the glyph id but what do we call the first value and the third value?
(582, 'uni0246', 'LATIN CAPITAL LETTER E WITH STROKE'), (583, 'uni0247', 'LATIN SMALL LETTER E WITH STROKE'), (584, 'uni0248', 'LATIN CAPITAL LETTER J WITHSTROKE'), (585, 'uni0249', 'LATIN SMALL LETTER J WITH STROKE')
Kindly reply.
Actually I am trying to get the unicode from a given ttf file in python.Here is the code :
from fontTools.ttLib import TTFont
from fontTools.unicode import Unicode
from ttfquery import ttfgroups
from fontTools.ttLib.tables import _c_m_a_p
from itertools import chain
ttfgroups.buildTable()
ttf = TTFont(sys.argv[1], 0, verbose=0, allowVID=0,
ignoreDecompileErrors=True,
fontNumber=-1)
chars = chain.from_iterable([y + (Unicode[y[0]],) for y in x.cmap.items()] for x in ttf["cmap"].tables)
print(list(chars))`
This code I got from stackoverflow only but this gives the above output and not what I require. So could anybody please tell me how to fetch the unicodes from the ttf file or is it fine to convert the glyphid to unicode, will it yield to actual unicode ?
You can use the first field: unichr(x[0]), or equivalently the second field. Then you remove the "uni" part ([3:]) and you convert it to a hexadecimal valu'Ɇ'e, then to a character. Of course, the first method is faster and simpler.
unichr(int(x[1][3:], 16)) #for the first item you've showed, returns 'Ɇ', for the second 'ɇ'
If you use python3, chr instead of unichr.
Here is a simple way to find all unicode character in ttf file.
chars = []
with TTFont('/path/to/ttf', 0, ignoreDecompileErrors=True) as ttf:
for x in ttf["cmap"].tables:
for (code, _) in x.cmap.items():
chars.append(chr(code))
# now chars is a list of \uxxxx characters
print(chars)
Related
I have a string.
m = 'I have two element. <U+2F3E> and <U+2F8F>'
I want to m replace to:
m = 'I have two element. \u2F3E and \u2F8F' # utf-8
my code:
import re
p1 = re.compile('<U+\+') # start "<"
p2 = re.compile('>+') # end ">"
m = 'I have two element. <U+2F3E> and <U+2F8F>'
out = pattern.sub('\\u', m) # like: 'I have two element. \u2F3E> and \u2F8F>'
but I get this error message:
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape
How can I fix it. thanks.
import re
m = 'I have two element. <U+2F3E> and <U+2F8F>'
print(re.sub(r'<U\+(\w+)>', r"\\u\1", m))
# I have two element. \u2F3E and \u2F8F
You can use a single regular expression to find the strings and pull out the part you want to use in the replacement.
The reason you get an error is that '\\u' passes the literal string \u to the regex engine, which tries to parse it as a Unicode character, and fails; the \u needs to be followed by exactly four hex digits to form a valid Unicode code point. But you are still approaching this as if you wanted to replace with a literal string, which as per your clarifying comment is wrong.
import re
m = re.sub(r'<U\+([0-9a-fA-F]{4})>', lambda x: chr(int(x.group(1), 16)), m)
The lambda receives the match object as its argument; x.group(1) pulls out the first parenthesized group, and chr(int(that, 16)) produces the corresponding literal character.
If you actually want to produce the UTF-8 encoding of that, that's easy, too:
>>> re.sub(r'<U\+([0-9a-fA-F]{4})>', lambda x: chr(int(x.group(1), 16)), 'I have two element. <U+2F3E> and <U+2F8F>')
'I have two element. ⼾ and ⾏'
>>> re.sub(r'<U\+([0-9a-fA-F]{4})>', lambda x: chr(int(x.group(1), 16)), 'I have two element. <U+2F3E> and <U+2F8F>').encode('utf-8')
b'I have two element. \xe2\xbc\xbe and \xe2\xbe\x8f'
As you can see, the UTF-8 encoding is a sequence of bytes which do not correspond to printable characters at all. (Well, they could be printed in some other encodings; but then that's just mojibake.)
m.replace('<U+', '\u')
m.replace('>',' ')
Please use the below code, this might help
m = m.replace('<U+', '\\u').replace('>',' ')
I am trying to get the span of selected words in a string. When working with the İ character, I noticed the following behavior of Python:
len("İ")
Out[39]: 1
len("İ".lower())
Out[40]: 2
# when `upper()` is applied, the length stays the same
len("İ".lower().upper())
Out[41]: 2
Why does the length of the upper and lowercase value of the same character differ (that seems very confusing/undesired to me)?
Does anyone know if there are other characters for which that will happen?
Thank you!
EDIT:
On the other hand for e.g. Î the length stays the same:
len('Î')
Out[42]: 1
len('Î'.lower())
Out[43]: 1
That's because 'İ' in lowercase is 'i̇', which has 2 characters
>>> import unicodedata
>>> unicodedata.name('İ')
'LATIN CAPITAL LETTER I WITH DOT ABOVE'
>>> unicodedata.name('İ'.lower()[0])
'LATIN SMALL LETTER I'
>>> unicodedata.name('İ'.lower()[1])
'COMBINING DOT ABOVE'
One character is a combining dot that your browser might render overlapped with the last quote, so you may not be able to see it. But if you copy-paste it into your python console, you should be able to see it.
If you try:
print('i̇'.upper())
you should get
İ
I think the issue is that a lower case character for that symbol is undefined in ASCII.
The .lower() function probably performs a fixed offset to the ASCII number associated with the character, since that works for the English alphabet.
I am trying to read a file that has Arabic characters like, 'ع ' and map it to English string "AYN". I want to create such a mapping of all 28 Arabic alphabets to English string in Python 3.4. I am still a beginner in Python and do not have much clue how to start. The file that has Arabic character is coded in UTF8 format.
Use unicodedata;
(note: This is Python 3. In Python 2 use u'ع' instead)
In [1]: import unicodedata
In [2]: unicodedata.name('a')
Out[2]: 'LATIN SMALL LETTER A'
In [6]: unicodedata.name('ع')
Out[6]: 'ARABIC LETTER AIN'
In [7]: unicodedata.name('ع').split()[-1]
Out[7]: 'AIN'
The last line works fine with simple letters, but not with all Arabic symbols. E.g. ڥ is ARABIC LETTER FEH WITH THREE DOTS BELOW.
So you could use;
In [26]: unicodedata.name('ڥ').lower().split()[2]
Out[26]: 'feh'
or
In [28]: unicodedata.name('ڥ').lower()[14:]
Out[28]: 'feh with three dots below'
For identifying characters use something like this (Python 3) ;
c = 'ع'
id = unicodedata.name(c).lower()
if 'arabic letter' in id:
print("{}: {}".format(c, id[14:].lower()))
This would produce;
ع: ain
I'm filtering for the string 'arabic letter' because the arabic unicode block has a lot of other symbols as well.
A complete dictionary can be made with:
arabicdict = {}
for n in range(0x600, 0x700):
c = chr(n)
try:
id = unicodedata.name(c).lower()
if 'arabic letter' in id:
arabicdict[c] = id[14:]
except ValueError:
pass
Refer to the Unicode numbers for each character and then construct a dictionary as follows here:
arabic = {'alif': u'\u0623', 'baa': u'\u0628', ...} # use unicode mappings like so
Use a simple dictionary in python to do this properly. Make sure your file is set in the following way:
#!/usr/bin/python
# -*- coding: utf-8 -*-
Here is code that should work for you (I added in examples of how to get out the values from your dictionary as well, since you are a beginner):
exampledict = {unicode(('ا').decode('utf-8')):'ALIF',unicode(('ع').decode('utf-8')):'AYN'}
keys = exampledict.keys()
values = exampledict.values()
print(keys)
print(values)
exit()
Output:
[u'\u0639', u'\u0627']
['AYN', 'ALIF']
Hope this helps you on your journey learning python, it is fun!
In Python, I need to convert special characters into ascii letters. I have a series of translations in a dictionary
dict_trans = {"U+1E9A":"a", "U+1EA0":"a"} # + more
my_char = "ẚ"
How do I covert my_char into (in this case) a?
I can change the format of the characters in dict_trans (but to what)?
From the unidecode module, you could use the unidecode function.
>>> from unidecode import unidecode
>>> unidecode('ẚ')
'a'
Use thes names from unicodedata:
import unicodedata
unicodedata.name("a")
# 'LATIN SMALL LETTER A'
unicodedata.name("ẚ")
# 'LATIN SMALL LETTER A WITH RIGHT HALF RING'
unicodedata.lookup('LATIN SMALL LETTER A WITH RIGHT HALF RING')
# 'ẚ'
d = {'LATIN SMALL LETTER A WITH RIGHT HALF RING':'a'}
d['LATIN SMALL LETTER A WITH RIGHT HALF RING']
# 'a'
I have a function like this:
persian_numbers = '۱۲۳۴۵۶۷۸۹۰'
english_numbers = '1234567890'
arabic_numbers = '١٢٣٤٥٦٧٨٩٠'
english_trans = string.maketrans(english_numbers, persian_numbers)
arabic_trans = string.maketrans(arabic_numbers, persian_numbers)
text.translate(english_trans)
text.translate(arabic_trans)
I want it to translate all Arabic and English numbers to Persian. But Python says:
english_translate = string.maketrans(english_numbers, persian_numbers)
ValueError: maketrans arguments must have same length
I tried to encode strings with Unicode utf-8 but I always got some errors! Sometimes the problem is Arabic string instead! Do you know a better solution for this job?
EDIT:
It seems the problem is Unicode characters length in ASCII. An Arabic number like '۱' is two character -- that I find out with ord(). And the length problem starts from here :-(
See unidecode library which converts all strings into UTF8. It is very useful in case of number input in different languages.
In Python 2:
>>> from unidecode import unidecode
>>> a = unidecode(u"۰۱۲۳۴۵۶۷۸۹")
>>> a
'0123456789'
>>> unidecode(a)
'0123456789'
In Python 3:
>>> from unidecode import unidecode
>>> a = unidecode("۰۱۲۳۴۵۶۷۸۹")
>>> a
'0123456789'
>>> unidecode(a)
'0123456789'
Unicode objects can interpret these digits (arabic and persian) as actual digits -
no need to translate them by using character substitution.
EDIT -
I came out with a way to make your replacement using Python2 regular expressions:
# coding: utf-8
import re
# Attention: while the characters for the strings bellow are
# dislplayed indentically, inside they are represented
# by distinct unicode codepoints
persian_numbers = u'۱۲۳۴۵۶۷۸۹۰'
arabic_numbers = u'١٢٣٤٥٦٧٨٩٠'
english_numbers = u'1234567890'
persian_regexp = u"(%s)" % u"|".join(persian_numbers)
arabic_regexp = u"(%s)" % u"|".join(arabic_numbers)
def _sub(match_object, digits):
return english_numbers[digits.find(match_object.group(0))]
def _sub_arabic(match_object):
return _sub(match_object, arabic_numbers)
def _sub_persian(match_object):
return _sub(match_object, persian_numbers)
def replace_arabic(text):
return re.sub(arabic_regexp, _sub_arabic, text)
def replace_persian(text):
return re.sub(arabic_regexp, _sub_persian, text)
Attempt that the "text" parameter must be unicode itself.
(also this code could be shortened
by using lambdas and combining some expressions in a single line, but there is no point in doing so, but for loosing readability)
It should work to you up to here, but please read on the original answer I had posted
-- original answer
So, if you instantiate your variables as unicode (prepending an u to the quote char), they are correctly understood in Python:
>>> persian_numbers = u'۱۲۳۴۵۶۷۸۹۰'
>>> english_numbers = u'1234567890'
>>> arabic_numbers = u'١٢٣٤٥٦٧٨٩٠'
>>>
>>> print int(persian_numbers)
1234567890
>>> print int(english_numbers)
1234567890
>>> print int(arabic_numbers)
1234567890
>>> persian_numbers.isdigit()
True
>>>
By the way, the "maketrans" method does not exist for unicode objects (in Python2 - see the comments).
It is very important to understand the basics about unicode - for everyone, even people writing English only programs who think they will never deal with any char out of the 26 latin letters. When writing code that will deal with different chars it is vital - the program can't possibly work without you knowing what you are doing except by chance.
A very good article to read is http://www.joelonsoftware.com/articles/Unicode.html - please read it now.
You can keep in mind, while reading it, that Python allows one to translate unicode characters to a string in any "physical" encoding by using the "encode" method of unicode objects.
>>> arabic_numbers = u'١٢٣٤٥٦٧٨٩٠'
>>> len(arabic_numbers)
10
>>> enc_arabic = arabic_numbers.encode("utf-8")
>>> print enc_arabic
١٢٣٤٥٦٧٨٩٠
>>> len(enc_arabic)
20
>>> int(enc_arabic)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '\xd9\xa1\xd9\xa2\xd9\xa3\xd9\xa4\xd9\xa5\xd9\xa6\xd9\xa7\xd9\xa8\xd9\xa9\xd9\xa0'
Thus, the characters loose their sense as "single entities" and as digits when encoding - the encoded object (str type in Python 2.x) is justa strrng of bytes - which nonetheless is needed when sending these characters to any output from the program - be it console, GUI Window, database, html code, etc...
You can use persiantools package:
Examples:
>>> from persiantools import digits
>>> digits.en_to_fa("0987654321")
'۰۹۸۷۶۵۴۳۲۱'
>>> digits.ar_to_fa("٠٩٨٧٦٥٤٣٢١") # or digits.ar_to_fa(u"٠٩٨٧٦٥٤٣٢١")
'۰۹۸۷۶۵۴۳۲۱'
unidecode converts all characters from Persian to English, If you want to change only numbers follow bellow:
In python3 you can use this code to convert any Persian|Arabic number to English number while keeping other characters unchanged:
intab='۱۲۳۴۵۶۷۸۹۰١٢٣٤٥٦٧٨٩٠'
outtab='12345678901234567890'
translation_table = str.maketrans(intab, outtab)
output_text = input_text.translate(translation_table)
Use Unicode Strings:
persian_numbers = u'۱۲۳۴۵۶۷۸۹۰'
english_numbers = u'1234567890'
arabic_numbers = u'١٢٣٤٥٦٧٨٩٠'
And make sure the encoding of your Python file is correct.
With this you can easily do that:
def p2e(persiannumber):
number={
'0':'۰',
'1':'۱',
'2':'۲',
'3':'۳',
'4':'۴',
'5':'۵',
'6':'۶',
'7':'۷',
'8':'۸',
'9':'۹',
}
for i,j in number.items():
persiannumber=persiannumber.replace(j,i)
return persiannumber
here is usage:
print(p2e('۳۱۹۶'))
#returns 3196
In Python 3 easiest way is:
str(int('۱۲۳'))
#123
but if number starts with 0 it have an issue.
so we can use zip() function:
for i, j in zip('1234567890', '۱۲۳۴۵۶۷۸۹۰'):
number.replace(i, j)
def persian_number(persiannumber):
number={
'0':'۰',
'1':'۱',
'2':'۲',
'3':'۳',
'4':'۴',
'5':'۵',
'6':'۶',
'7':'۷',
'8':'۸',
'9':'۹',
}
for i,j in number.items():
persiannumber=time2str.replace(i,j)
return time2str
persiannumber must be a string