unicode table information about a character in python - python

Is there a way in python to get the technical information for a given character like it's displayed in the Unicode table? (cf.https://unicode-table.com/en/)
Example:
for the letter "Ȅ"
Name > Latin Capital Letter E with Double Grave
Unicode number > U+0204
HTML-code > Ȅ
Bloc > Latin Extended-B
Lowercase > ȅ
What I actually need is to get for any Unicode number (like here U+0204) the corresponding name (Latin Capital Letter E with Double Grave) and the lowercase version (here "ȅ").
Roughly:
input = a Unicode number
output = corresponding information
The closest thing I've been able to find is the fontTools library but I can't seem to find any tutorial/documentation on how to use it to do that.
Thank you.

The standard module unicodedata defines a lot of properties, but not everything. A quick peek at its source confirms this.
Fortunately unicodedata.txt, the data file where this comes from, is not hard to parse. Each line consists of exactly 15 elements, ; separated, which makes it ideal for parsing. Using the description of the elements on ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html, you can create a few classes to encapsulate the data. I've taken the names of the class elements from that list; the meaning of each of the elements is explained on that same page.
Make sure to download ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt and ftp://ftp.unicode.org/Public/UNIDATA/Blocks.txt first, and put them inside the same folder as this program.
Code (tested with Python 2.7 and 3.6):
# -*- coding: utf-8 -*-
class UnicodeCharacter:
def __init__(self):
self.code = 0
self.name = 'unnamed'
self.category = ''
self.combining = ''
self.bidirectional = ''
self.decomposition = ''
self.asDecimal = None
self.asDigit = None
self.asNumeric = None
self.mirrored = False
self.uc1Name = None
self.comment = ''
self.uppercase = None
self.lowercase = None
self.titlecase = None
self.block = None
def __getitem__(self, item):
return getattr(self, item)
def __repr__(self):
return '{'+self.name+'}'
class UnicodeBlock:
def __init__(self):
self.first = 0
self.last = 0
self.name = 'unnamed'
def __repr__(self):
return '{'+self.name+'}'
class BlockList:
def __init__(self):
self.blocklist = []
with open('Blocks.txt','r') as uc_f:
for line in uc_f:
line = line.strip(' \r\n')
if '#' in line:
line = line.split('#')[0].strip()
if line != '':
rawdata = line.split(';')
block = UnicodeBlock()
block.name = rawdata[1].strip()
rawdata = rawdata[0].split('..')
block.first = int(rawdata[0],16)
block.last = int(rawdata[1],16)
self.blocklist.append(block)
# make 100% sure it's sorted, for quicker look-up later
# (it is usually sorted in the file, but better make sure)
self.blocklist.sort (key=lambda x: block.first)
def lookup(self,code):
for item in self.blocklist:
if code >= item.first and code <= item.last:
return item.name
return None
class UnicodeList:
"""UnicodeList loads Unicode data from the external files
'UnicodeData.txt' and 'Blocks.txt', both available at unicode.org
These files must appear in the same directory as this program.
UnicodeList is a new interpretation of the standard library
'unicodedata'; you may first want to check if its functionality
suffices.
As UnicodeList loads its data from an external file, it does not depend
on the local build from Python (in which the Unicode data gets frozen
to the then 'current' version).
Initialize with
uclist = UnicodeList()
"""
def __init__(self):
# we need this first
blocklist = BlockList()
bpos = 0
self.codelist = []
with open('UnicodeData.txt','r') as uc_f:
for line in uc_f:
line = line.strip(' \r\n')
if '#' in line:
line = line.split('#')[0].strip()
if line != '':
rawdata = line.strip().split(';')
parsed = UnicodeCharacter()
parsed.code = int(rawdata[0],16)
parsed.characterName = rawdata[1]
parsed.category = rawdata[2]
parsed.combining = rawdata[3]
parsed.bidirectional = rawdata[4]
parsed.decomposition = rawdata[5]
parsed.asDecimal = int(rawdata[6]) if rawdata[6] else None
parsed.asDigit = int(rawdata[7]) if rawdata[7] else None
# the following value may contain a slash:
# ONE QUARTER ... 1/4
# let's make it Python 2.7 compatible :)
if '/' in rawdata[8]:
rawdata[8] = rawdata[8].replace('/','./')
parsed.asNumeric = eval(rawdata[8])
else:
parsed.asNumeric = int(rawdata[8]) if rawdata[8] else None
parsed.mirrored = rawdata[9] == 'Y'
parsed.uc1Name = rawdata[10]
parsed.comment = rawdata[11]
parsed.uppercase = int(rawdata[12],16) if rawdata[12] else None
parsed.lowercase = int(rawdata[13],16) if rawdata[13] else None
parsed.titlecase = int(rawdata[14],16) if rawdata[14] else None
while bpos < len(blocklist.blocklist) and parsed.code > blocklist.blocklist[bpos].last:
bpos += 1
parsed.block = blocklist.blocklist[bpos].name if bpos < len(blocklist.blocklist) and parsed.code >= blocklist.blocklist[bpos].first else None
self.codelist.append(parsed)
def find_code(self,codepoint):
"""Find the Unicode information for a codepoint (as int).
Returns:
a UnicodeCharacter class object or None.
"""
# the list is unlikely to contain duplicates but I have seen Unicode.org
# doing that in similar situations. Again, better make sure.
val = [x for x in self.codelist if codepoint == x.code]
return val[0] if val else None
def find_char(self,str):
"""Find the Unicode information for a codepoint (as character).
Returns:
for a single character: a UnicodeCharacter class object or
None.
for a multicharacter string: a list of the above, one element
per character.
"""
if len(str) > 1:
result = [self.find_code(ord(x)) for x in str]
return result
else:
return self.find_code(ord(str))
When loaded, you can now look up a character code with
>>> ul = UnicodeList() # ONLY NEEDED ONCE!
>>> print (ul.find_code(0x204))
{LATIN CAPITAL LETTER E WITH DOUBLE GRAVE}
which by default is shown as the name of a character (Unicode calls this a 'code point'), but you can retrieve other properties as well:
>>> print ('%04X' % uc.find_code(0x204).lowercase)
0205
>>> print (ul.lookup(0x204).block)
Latin Extended-B
and (as long as you don't get a None) even chain them:
>>> print (ul.find_code(ul.find_code(0x204).lowercase))
{LATIN SMALL LETTER E WITH DOUBLE GRAVE}
It does not rely on your particular build of Python; you can always download an updated list from unicode.org and be assured to get the most recent information:
import unicodedata
>>> print (unicodedata.name('\U0001F903'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: no such name
>>> print (uclist.find_code(0x1f903))
{LEFT HALF CIRCLE WITH FOUR DOTS}
(As tested with Python 3.5.3.)
There are currently two lookup functions defined:
find_code(int) looks up character information by codepoint as an integer.
find_char(string) looks up character information for the character(s) in string. If there is only one character, it returns a UnicodeCharacter object; if there are more, it returns a list of objects.
After import unicodelist (assuming you saved this as unicodelist.py), you can use
>>> ul = UnicodeList()
>>> hex(ul.find_char(u'è').code)
'0xe8'
to look up the hex code for any character, and a list comprehension such as
>>> l = [hex(ul.find_char(x).code) for x in 'Hello']
>>> l
['0x48', '0x65', '0x6c', '0x6c', '0x6f']
for longer strings. Note that you don't actually need all of this if all you want is a hex representation of a string! This suffices:
l = [hex(ord(x)) for x in 'Hello']
The purpose of this module is to give easy access to other Unicode properties. A longer example:
str = 'Héllo...'
dest = ''
for i in str:
dest += chr(ul.find_char(i).uppercase) if ul.find_char(i).uppercase is not None else i
print (dest)
HÉLLO...
and showing a list of properties for a character per your example:
letter = u'Ȅ'
print ('Name > '+ul.find_char(letter).name)
print ('Unicode number > U+%04x' % ul.find_char(letter).code)
print ('Bloc > '+ul.find_char(letter).block)
print ('Lowercase > %s' % chr(ul.find_char(letter).lowercase))
(I left out HTML; these names are not defined in the Unicode standard.)

The unicodedata documentation shows how to do most of this.
The Unicode block name is apparently not available but another Stack Overflow question has a solution of sorts and another has some additional approaches using regex.
The uppercase/lowercase mapping and character number information is not particularly Unicode-specific; just use the regular Python string functions.
So in summary
>>> import unicodedata
>>> unicodedata.name('Ë')
'LATIN CAPITAL LETTER E WITH DIAERESIS'
>>> 'U+%04X' % ord('Ë')
'U+00CB'
>>> '&#%i;' % ord('Ë')
'Ë'
>>> 'Ë'.lower()
'ë'
The U+%04X formatting is sort-of correct, in that it simply avoids padding and prints the whole hex number for code points with a value higher than 65,535. Note that some other formats require the use of %08X padding in this scenario (notably \U00010000 format in Python).

You can do this in some ways :
1- create an API yourself ( I can't find anything that do this )
2- create table in database or excel file
3- load and parse a website to do that
I think the 3rd way is very easy. take a look as This Page. you can find some information there Unicodes.
Get your Unicode number and then, find it in web page using parse tools like LXML , Scrapy , Selenium , etc

Related

Python - Possibly Regex - How to replace part of a filepath with another filepath based on a match?

I'm new to Python and relatively new to programming. I'm trying to replace part of a file path with a different file path. If possible, I'd like to avoid regex as I don't know it. If not, I understand.
I want an item in the Python list [] before the word PROGRAM to be replaced with the 'replaceWith' variable.
How would you go about doing this?
Current Python List []
item1ToReplace1 = \\server\drive\BusinessFolder\PROGRAM\New\new.vb
item1ToReplace2 = \\server\drive\BusinessFolder\PROGRAM\old\old.vb
Variable to replace part of the Python list path
replaceWith = 'C:\ProgramFiles\Microsoft\PROGRAM'
Desired results for Python List []:
item1ToReplace1 = C:\ProgramFiles\Micosoft\PROGRAM\New\new.vb
item1ToReplace2 = C:\ProgramFiles\Micosoft\PROGRAM\old\old.vb
Thank you for your help.
The following code does what you ask, note I updated your '' to '\', you probably need to account for the backslash in your code since it is used as an escape character in python.
import os
item1ToReplace1 = '\\server\\drive\\BusinessFolder\\PROGRAM\\New\\new.vb'
item1ToReplace2 = '\\server\\drive\\BusinessFolder\\PROGRAM\\old\\old.vb'
replaceWith = 'C:\ProgramFiles\Microsoft\PROGRAM'
keyword = "PROGRAM\\"
def replacer(rp, s, kw):
ss = s.split(kw,1)
if (len(ss) > 1):
tail = ss[1]
return os.path.join(rp, tail)
else:
return ""
print(replacer(replaceWith, item1ToReplace1, keyword))
print(replacer(replaceWith, item1ToReplace2, keyword))
The code splits on your keyword and puts that on the back of the string you want.
If your keyword is not in the string, your result will be an empty string.
Result:
C:\ProgramFiles\Microsoft\PROGRAM\New\new.vb
C:\ProgramFiles\Microsoft\PROGRAM\old\old.vb
One way would be:
item_ls = item1ToReplace1.split("\\")
idx = item_ls.index("PROGRAM")
result = ["C:", "ProgramFiles", "Micosoft"] + item_ls[idx:]
result = "\\".join(result)
Resulting in:
>>> item1ToReplace1 = r"\\server\drive\BusinessFolder\PROGRAM\New\new.vb"
... # the above
>>> result
'C:\ProgramFiles\Micosoft\PROGRAM\New\new.vb'
Note the use of r"..." in order to avoid needing to have to 'escape the escape characters' of your input (i.e. the \). Also that the join/split requires you to escape these characters with a double backslash.

How to determine whether a Python string contains an emoji?

I've seen previous responses to this question, but none of them are recent and none of them are working for me in Python 3. I have a list of strings, and I simply want to identify which ones contain emoji. What's the fastest way to do this?
To be more specific, I have a lengthy list of email subject lines from AB tests, and I'm trying to determine which subject lines contained emoji.
this link and this link both count © and other common characters as emoji. Also the former has minor mistakes and the latter still doesn't appear to work.
Here's an implementation that errs on the conservative side using this newer data and this documentation. It only considers code points that are marked with the unicode property Emoji_Presentation (which means it's definitely an emoji), or code points marked only with the property Emoji (which means it defaults to text but it could be an emoji), that are followed by a special variation selector code point fe0f that says to default to an emoji instead. The reason I say this is conservative is because certain systems aren't as picky about the fe0f and will treat characters as emoji wherever they can (read more about this here).
import re
from collections import defaultdict
def parse_line(line):
"""Return a pair (property, codepoints) where property is a string and
codepoints is a set of int unicode code points"""
pat = r'([0-9A-Z]+)(\.\.[0-9A-Z]+)? + ; +(\w+) + #.*'
match = re.match(pat, line)
assert match
codepoints = set()
start = int(match.group(1), 16)
if match.group(2):
trimmed = match.group(2)[2:]
end = int(trimmed, 16) + 1
else:
end = start + 1
for cp in range(start, end):
codepoints.add(cp)
return (match.group(3), codepoints)
def parse_emoji_data():
"""Return a dictionary mapping properties to code points"""
result = defaultdict(set)
with open('emoji-data.txt', mode='r', encoding='utf-8') as f:
for line in f:
if '#' != line[0] and len(line.strip()) > 0:
property, cp = parse_line(line)
result[property] |= cp
return result
def test_parse_emoji_data():
sets = parse_emoji_data()
sizes = {
'Emoji': 1123,
'Emoji_Presentation': 910,
'Emoji_Modifier': 5,
'Emoji_Modifier_Base': 83,
}
for k, v in sizes.items():
assert len(sets[k]) == v
def contains_emoji(text):
"""
Return true if the string contains either a code point with the
`Emoji_Presentation` property, or a code point with the `Emoji`
property that is followed by \uFE0F
"""
sets = parse_emoji_data()
for i, ch in enumerate(text):
if ord(ch) in sets['Emoji_Presentation']:
return True
elif ord(ch) in sets['Emoji']:
if len(text) > i+1 and text[i+1] == '\ufe0f':
return True
return False
test_parse_emoji_data()
assert not contains_emoji('hello')
assert not contains_emoji('hello :) :D 125% #%&*(##%&!#(^*(')
assert contains_emoji('here is a smiley \U0001F601 !!!')
To run this you need ftp://ftp.unicode.org/Public/emoji/3.0/emoji-data.txt in the working directory.
Once the regex module supports emoji properties it will be easier to use that instead.

Python templates for generating Python code with proper multiline indentation

I'm using Python to compile another Python file. To this end, I use Template from string into which I insert, e.g., a constructed function body, e.g.,
from string import Template
s = Template('''
def main():
${body}
return
''')
# body constructed bit by bit
body = ['a']
body.append('b')
body.append('c')
out = s.substitute(body='\n'.join(body))
print(out)
The output of the above is
def main():
a
b
c
return
which already highlights the problem: ${body} lines other than the first aren't correctly indented. I could of course manually add the spaces when inserting 'b' and 'c' into the body list, but that already assumes knowledge of the template into which the body will be inserted.
(Perhaps string.Template is not be the appropriate template engine to begin with.)
Assuming you need to fix indentation for multi line replacements only when the ${} model is at the beginning of the line (except for the indentation), you could use a regex to find all the tokens in the mask and if they are preceded with only blanks repeat them on all following lines from the replacement list.
You could use code like that:
import string, re
def substitute(s, reps):
t = string.Template(s)
i=0; cr = {} # prepare to iterate through the pattern string
while True:
# search for next replaceable token and its prefix
m =re.search(r'^(.*?)\$\{(.*?)\}', tpl[i:], re.MULTILINE)
if m is None: break # no more : finished
# the list is joined using the prefix if it contains only blanks
sep = ('\n' + m.group(1)) if m.group(1).strip() == '' else '\n'
cr[m.group(2)] = sep.join(rep[m.group(2)])
i += m.end() # continue past last processed replaceable token
return t.substitute(cr) # we can now substitute
With your example (slightly modified), it would give:
s = '''
def main():
${body}
return ${retval}
''')
# body constructed bit by bit
body = ['a']
body.append('b')
body.append('c')
out = substitute(s, { 'body': body, 'retval': 0 }
print (out)
it gives as expected:
def main():
a
b
c
return 0

Parsing key values in string

I have a string that I am getting from a command line application. It has the following structure:
-- section1 --
item11|value11
item12|value12
item13
-- section2 --
item21|value21
item22
what I would like is to parse this to a dict so that I can easily access the values with:
d['section1']['item11']
I already solved it for the case when there are no sections and every key has a value but I get errors otherwise. I have tried a couple things but it is getting complicated because and nothing seems to work. This is what I have now:
s="""
item11|value11
item12|value12
item21|value21
"""
d = {}
for l in s.split('\n'):
print(l, l.split('|'))
if l != '':
d[l.split('|')[0]] = l.split('|')[1]
Can somebody help me extend this for the section case and when no values are present?
Seems like a perfect fit for the ConfigParser module in the standard library:
d = ConfigParser(delimiters='|', allow_no_value=True)
d.SECTCRE = re.compile(r"-- *(?P<header>[^]]+?) *--") # sections regex
d.read_string(s)
Now you have an object that you can access like a dictionary:
>>> d['section1']['item11']
'value11'
>>> d['section2']['item22'] # no value case
None
Regexes are a good take at this:
import re
def parse(data):
lines = data.split("\n") #split input into lines
result = {}
current_header = ""
for line in lines:
if line: #if the line isn't empty
#tries to match anything between double dashes:
match = re.match(r"^-- (.*) --$", line)
if match: #true when the above pattern matches
#grabs the part inside parentheses:
current_header = match.group(1)
else:
#key = 1st element, value = 2nd element:
key, value = line.split("|")
#tries to get the section, defaults to empty section:
section = result.get(current_header, {})
section[key] = value #adds data to section
result[current_header] = section #updates section into result
return result #done.
print parse("""
-- section1 --
item1|value1
item2|value2
-- section2 --
item1|valueA
item2|valueB""")

Odd character appending to the front of python list

I'm having an issue with regards to running python on linux. I'm trying to learn python and wanted to try and parse a small XML file and put the tags and data into the list. But every time I run the code I get a 'u' appending to each element in the list.
[u'world']
defaultdict(<type 'list'>, {u'world': [u'data']})
My code is as follows:
import xml.sax
from collections import defaultdict
class TransformXML(xml.sax.ContentHandler):
def __init__ (self):
self.start_tag_name = -1
self.tag_data = -1
self.myDict = defaultdict(list)
self.tags = []
def startElement(self, name, attrs):
self.start_tag_name = name
print name
print self.start_tag_name
def characters(self, content):
if content.strip(' \r\n\t') != "":
self.tag_data = content.strip(' \r\n\t')
print self.start_tag_name
self.tags.append(self.start_tag_name)
self.myDict[self.start_tag_name].append(content.strip(' \r\n\t'))
def endElement(self, name):
pass
def __del__ (self):
if self.myDict:
del self.myDict
print "deleteing myDict"
Does anyone know what the issue might be?
That 'weird' symbol basically means that the string or character is encoded in unicode
Eg. If i have a string Test:
>>> unicode('Test')
u'Test'
>>> s = unicode('Test')
>>> type(s)
<type 'unicode'>
Documentation here
To sum up, according to the python docs,
...a Unicode string is a sequence of code points, which are numbers from
0 to 0x10ffff. This sequence needs to be represented as a set of bytes
(meaning, values from 0-255) in memory. The rules for translating a
Unicode string into a sequence of bytes are called an encoding.

Categories

Resources