How to find multi-line comments wrapped in quotes? - python

I am parsing Python code, and I need to remove all possible comments/docstrings. I have successfully been able to remove "comments" of the form:
#comment
"""comment"""
'''comment'''
However, I have found some samples where people write comments of the form:
"'''comment'''"
"\"\"\"\n comment \"\"\""
I am struggling to successfully remove these comments (three single quotes surrounded by a double quote, and double quotes with line breaks). The expression I tried was:
p = re.compile("([\'\"])\1\1(.*?)\1{3}", re.DOTALL)
code = p.sub('', code)
But this did not work for either of the second two cases. Does anyone have any suggestions?

You could try using strip().
It works by removing the characters you place in between the brackets. If nothing is in the brackets it removes spaces but you want to remove the three single quotes surrounded by a double quote, and double quotes with line breaks. So an example is:
txt = ",,,,,rrttgg.....banana....rrr"
x = txt.strip(",.grt")
print(x)
And the output you would get is "banana" as it has removed the ,.grt that was found in between the double brackets( x = txt.strip(",.grt")).
For more info check out this page, and i recommend the info at the bottom for further help:
https://www.w3schools.com/python/python_strings.asp

posting as an answer because my comment was hard to read
This is what I came up with, it's ugly and hacky but it does work.
import re
txt = "if x = 4: continue \"'''hi'''\" print(x) "
print(txt)
#find everything wrapped in double quotes
double_quotes = re.findall(r"\"(.+?)\"", txt)
for string in double_quotes:
triple_single = re.findall(r"\'''(.+?)\'''", string)[0]
full_comment = '"'+"'''" +triple_single+"'''"+'"'
txt = txt.replace(full_comment, '')
print(txt)
Prints:
if x = 4: continue "'''hi'''" print(x)
if x = 4: continue print(x)

Unassigned string literal can be considered as nodes on the source code's Abstract Syntax Tree (AST) representation. Then the problem is reduced to identifying these nodes and rewriting the AST without them, using the tools in the ast module.
Comments (# ...) are not parsed into the AST, so there is not need to code for them.
Unassigned string literals are nodes of type ast.Constant, and form part of the body attribute of nodes that have bodies, such as module definitions, function definitions and class definitions. We can identify these nodes, remove them from their parents' body's and then rewrite the AST.
import ast
import io
from unparse import Unparser
with open('comments.py') as f:
src = f.read()
root = ast.parse(src)
# print(ast.dump(root)) to see the ast structure.
def filter_constants(node):
if isinstance(node, ast.Expr):
if isinstance(node.value, ast.Constant):
if isinstance(node.value.value, str):
return None
return node
class CommentRemover(ast.NodeTransformer):
def visit(self, node):
if hasattr(node, 'body'):
node.body = [n for n in node.body if filter_constants(n)]
return super().visit(node)
remover = CommentRemover()
new = remover.visit(root)
ast.fix_missing_locations(new)
buf = io.StringIO()
Unparser(new, buf)
buf.seek(0)
print(buf.read())
Calling the script on this code (comments.py):
"""Module docstring."""
# A real comment
"""triple-double-quote comment"""
'''triple-single-quote comment'''
"'''weird comment'''"
"\"\"\"\n comment \"\"\""
NOT_A_COMMENT = 'spam'
42
def foo():
"""Function docstring."""
# Function comment
bar = 'baz'
return bar
class Quux:
"""class docstring."""
# class comment
def m(self):
"""method comment"""
return
Gives this output:
NOT_A_COMMENT = 'spam'
42
def foo():
bar = 'baz'
return bar
class Quux():
def m(self):
return
Notes:
the unparse script can be found in your Python distribution's Tools/parser folder (in v3.8 - in previous versions it has been in Tools or in the Demo folder). It may also be downloaded from github - be sure that you download the version for your version of Python
As of Python 3.8, the ast.Constant class is used for all constant nodes; for earlier versions you may need to use ast.Num, ast.Str, ast.Bytes, ast.NameConstant and ast.Ellipsis as appropriate. So in filter_constants might look like this:
def filter_constants(node):
if isinstance(node, ast.Expr):
if isinstance(node.value, ast.Str):
return None
return node
As of Python 3.9, the ast module provide an unparse function that may be used instead of the unparse script
src = ast.unparse(new)
print(src)

Related

Using python ast parser to process multi line strings

When using the python AST parser module in combination with scripts containing multi line strings, these multi line strings are always reduced to single line quoted strings. Example:
import ast
script = "text='''Line1\nLine2'''"
code = ast.parse (script, mode='exec')
print (ast.unparse (code))
node = code.body[0].value
print (node.lineno, node.end_lineno)
The output is:
> text = 'Line1\nLine2'
> 1 2
So in spite of being a multi line string before parsing, the text is reduced to a single line quoted string when unparsed. This makes script transformation difficult, because the multi lines are getting lost when unparsing a transformed AST graph.
Is there a way to parse/unparse scripts with multi line strings correctly with AST ?
Thank you in advance.
An examination of ast.unparse's underlying source reveals that the writer for the visit_Constant method, _write_constant, will produce the string repr unless the backslashing process is specifically avoided:
class _Unparse:
...
def _write_constant(self, value):
if isinstance(value, (float, complex)):
...
elif self._avoid_backslashes and isinstance(value, str):
self._write_str_avoiding_backslashes(value)
else:
self.write(repr(value))
By default, _avoid_backslashes is set to False, however, multiline string formatting can be properly performed by overriding visit_Constant and specifically calling _write_str_avoiding_backslashes if the string node is multiline:
import ast
class Unparser(ast._Unparser):
def visit_Constant(self, node):
if isinstance(node.value, str) and node.lineno < node.end_lineno:
super()._write_str_avoiding_backslashes(node.value)
return
return super().visit_Constant(node)
def _unparse(ast_node):
u = Unparser()
return u.visit(ast_node)
script = "text='''Line1\nLine2'''"
print(_unparse(ast.parse(script)))
Output:
text = """Line1
Line2"""

unicode table information about a character in python

Is there a way in python to get the technical information for a given character like it's displayed in the Unicode table? (cf.https://unicode-table.com/en/)
Example:
for the letter "Ȅ"
Name > Latin Capital Letter E with Double Grave
Unicode number > U+0204
HTML-code > Ȅ
Bloc > Latin Extended-B
Lowercase > ȅ
What I actually need is to get for any Unicode number (like here U+0204) the corresponding name (Latin Capital Letter E with Double Grave) and the lowercase version (here "ȅ").
Roughly:
input = a Unicode number
output = corresponding information
The closest thing I've been able to find is the fontTools library but I can't seem to find any tutorial/documentation on how to use it to do that.
Thank you.
The standard module unicodedata defines a lot of properties, but not everything. A quick peek at its source confirms this.
Fortunately unicodedata.txt, the data file where this comes from, is not hard to parse. Each line consists of exactly 15 elements, ; separated, which makes it ideal for parsing. Using the description of the elements on ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html, you can create a few classes to encapsulate the data. I've taken the names of the class elements from that list; the meaning of each of the elements is explained on that same page.
Make sure to download ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt and ftp://ftp.unicode.org/Public/UNIDATA/Blocks.txt first, and put them inside the same folder as this program.
Code (tested with Python 2.7 and 3.6):
# -*- coding: utf-8 -*-
class UnicodeCharacter:
def __init__(self):
self.code = 0
self.name = 'unnamed'
self.category = ''
self.combining = ''
self.bidirectional = ''
self.decomposition = ''
self.asDecimal = None
self.asDigit = None
self.asNumeric = None
self.mirrored = False
self.uc1Name = None
self.comment = ''
self.uppercase = None
self.lowercase = None
self.titlecase = None
self.block = None
def __getitem__(self, item):
return getattr(self, item)
def __repr__(self):
return '{'+self.name+'}'
class UnicodeBlock:
def __init__(self):
self.first = 0
self.last = 0
self.name = 'unnamed'
def __repr__(self):
return '{'+self.name+'}'
class BlockList:
def __init__(self):
self.blocklist = []
with open('Blocks.txt','r') as uc_f:
for line in uc_f:
line = line.strip(' \r\n')
if '#' in line:
line = line.split('#')[0].strip()
if line != '':
rawdata = line.split(';')
block = UnicodeBlock()
block.name = rawdata[1].strip()
rawdata = rawdata[0].split('..')
block.first = int(rawdata[0],16)
block.last = int(rawdata[1],16)
self.blocklist.append(block)
# make 100% sure it's sorted, for quicker look-up later
# (it is usually sorted in the file, but better make sure)
self.blocklist.sort (key=lambda x: block.first)
def lookup(self,code):
for item in self.blocklist:
if code >= item.first and code <= item.last:
return item.name
return None
class UnicodeList:
"""UnicodeList loads Unicode data from the external files
'UnicodeData.txt' and 'Blocks.txt', both available at unicode.org
These files must appear in the same directory as this program.
UnicodeList is a new interpretation of the standard library
'unicodedata'; you may first want to check if its functionality
suffices.
As UnicodeList loads its data from an external file, it does not depend
on the local build from Python (in which the Unicode data gets frozen
to the then 'current' version).
Initialize with
uclist = UnicodeList()
"""
def __init__(self):
# we need this first
blocklist = BlockList()
bpos = 0
self.codelist = []
with open('UnicodeData.txt','r') as uc_f:
for line in uc_f:
line = line.strip(' \r\n')
if '#' in line:
line = line.split('#')[0].strip()
if line != '':
rawdata = line.strip().split(';')
parsed = UnicodeCharacter()
parsed.code = int(rawdata[0],16)
parsed.characterName = rawdata[1]
parsed.category = rawdata[2]
parsed.combining = rawdata[3]
parsed.bidirectional = rawdata[4]
parsed.decomposition = rawdata[5]
parsed.asDecimal = int(rawdata[6]) if rawdata[6] else None
parsed.asDigit = int(rawdata[7]) if rawdata[7] else None
# the following value may contain a slash:
# ONE QUARTER ... 1/4
# let's make it Python 2.7 compatible :)
if '/' in rawdata[8]:
rawdata[8] = rawdata[8].replace('/','./')
parsed.asNumeric = eval(rawdata[8])
else:
parsed.asNumeric = int(rawdata[8]) if rawdata[8] else None
parsed.mirrored = rawdata[9] == 'Y'
parsed.uc1Name = rawdata[10]
parsed.comment = rawdata[11]
parsed.uppercase = int(rawdata[12],16) if rawdata[12] else None
parsed.lowercase = int(rawdata[13],16) if rawdata[13] else None
parsed.titlecase = int(rawdata[14],16) if rawdata[14] else None
while bpos < len(blocklist.blocklist) and parsed.code > blocklist.blocklist[bpos].last:
bpos += 1
parsed.block = blocklist.blocklist[bpos].name if bpos < len(blocklist.blocklist) and parsed.code >= blocklist.blocklist[bpos].first else None
self.codelist.append(parsed)
def find_code(self,codepoint):
"""Find the Unicode information for a codepoint (as int).
Returns:
a UnicodeCharacter class object or None.
"""
# the list is unlikely to contain duplicates but I have seen Unicode.org
# doing that in similar situations. Again, better make sure.
val = [x for x in self.codelist if codepoint == x.code]
return val[0] if val else None
def find_char(self,str):
"""Find the Unicode information for a codepoint (as character).
Returns:
for a single character: a UnicodeCharacter class object or
None.
for a multicharacter string: a list of the above, one element
per character.
"""
if len(str) > 1:
result = [self.find_code(ord(x)) for x in str]
return result
else:
return self.find_code(ord(str))
When loaded, you can now look up a character code with
>>> ul = UnicodeList() # ONLY NEEDED ONCE!
>>> print (ul.find_code(0x204))
{LATIN CAPITAL LETTER E WITH DOUBLE GRAVE}
which by default is shown as the name of a character (Unicode calls this a 'code point'), but you can retrieve other properties as well:
>>> print ('%04X' % uc.find_code(0x204).lowercase)
0205
>>> print (ul.lookup(0x204).block)
Latin Extended-B
and (as long as you don't get a None) even chain them:
>>> print (ul.find_code(ul.find_code(0x204).lowercase))
{LATIN SMALL LETTER E WITH DOUBLE GRAVE}
It does not rely on your particular build of Python; you can always download an updated list from unicode.org and be assured to get the most recent information:
import unicodedata
>>> print (unicodedata.name('\U0001F903'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: no such name
>>> print (uclist.find_code(0x1f903))
{LEFT HALF CIRCLE WITH FOUR DOTS}
(As tested with Python 3.5.3.)
There are currently two lookup functions defined:
find_code(int) looks up character information by codepoint as an integer.
find_char(string) looks up character information for the character(s) in string. If there is only one character, it returns a UnicodeCharacter object; if there are more, it returns a list of objects.
After import unicodelist (assuming you saved this as unicodelist.py), you can use
>>> ul = UnicodeList()
>>> hex(ul.find_char(u'è').code)
'0xe8'
to look up the hex code for any character, and a list comprehension such as
>>> l = [hex(ul.find_char(x).code) for x in 'Hello']
>>> l
['0x48', '0x65', '0x6c', '0x6c', '0x6f']
for longer strings. Note that you don't actually need all of this if all you want is a hex representation of a string! This suffices:
l = [hex(ord(x)) for x in 'Hello']
The purpose of this module is to give easy access to other Unicode properties. A longer example:
str = 'Héllo...'
dest = ''
for i in str:
dest += chr(ul.find_char(i).uppercase) if ul.find_char(i).uppercase is not None else i
print (dest)
HÉLLO...
and showing a list of properties for a character per your example:
letter = u'Ȅ'
print ('Name > '+ul.find_char(letter).name)
print ('Unicode number > U+%04x' % ul.find_char(letter).code)
print ('Bloc > '+ul.find_char(letter).block)
print ('Lowercase > %s' % chr(ul.find_char(letter).lowercase))
(I left out HTML; these names are not defined in the Unicode standard.)
The unicodedata documentation shows how to do most of this.
The Unicode block name is apparently not available but another Stack Overflow question has a solution of sorts and another has some additional approaches using regex.
The uppercase/lowercase mapping and character number information is not particularly Unicode-specific; just use the regular Python string functions.
So in summary
>>> import unicodedata
>>> unicodedata.name('Ë')
'LATIN CAPITAL LETTER E WITH DIAERESIS'
>>> 'U+%04X' % ord('Ë')
'U+00CB'
>>> '&#%i;' % ord('Ë')
'Ë'
>>> 'Ë'.lower()
'ë'
The U+%04X formatting is sort-of correct, in that it simply avoids padding and prints the whole hex number for code points with a value higher than 65,535. Note that some other formats require the use of %08X padding in this scenario (notably \U00010000 format in Python).
You can do this in some ways :
1- create an API yourself ( I can't find anything that do this )
2- create table in database or excel file
3- load and parse a website to do that
I think the 3rd way is very easy. take a look as This Page. you can find some information there Unicodes.
Get your Unicode number and then, find it in web page using parse tools like LXML , Scrapy , Selenium , etc

Python templates for generating Python code with proper multiline indentation

I'm using Python to compile another Python file. To this end, I use Template from string into which I insert, e.g., a constructed function body, e.g.,
from string import Template
s = Template('''
def main():
${body}
return
''')
# body constructed bit by bit
body = ['a']
body.append('b')
body.append('c')
out = s.substitute(body='\n'.join(body))
print(out)
The output of the above is
def main():
a
b
c
return
which already highlights the problem: ${body} lines other than the first aren't correctly indented. I could of course manually add the spaces when inserting 'b' and 'c' into the body list, but that already assumes knowledge of the template into which the body will be inserted.
(Perhaps string.Template is not be the appropriate template engine to begin with.)
Assuming you need to fix indentation for multi line replacements only when the ${} model is at the beginning of the line (except for the indentation), you could use a regex to find all the tokens in the mask and if they are preceded with only blanks repeat them on all following lines from the replacement list.
You could use code like that:
import string, re
def substitute(s, reps):
t = string.Template(s)
i=0; cr = {} # prepare to iterate through the pattern string
while True:
# search for next replaceable token and its prefix
m =re.search(r'^(.*?)\$\{(.*?)\}', tpl[i:], re.MULTILINE)
if m is None: break # no more : finished
# the list is joined using the prefix if it contains only blanks
sep = ('\n' + m.group(1)) if m.group(1).strip() == '' else '\n'
cr[m.group(2)] = sep.join(rep[m.group(2)])
i += m.end() # continue past last processed replaceable token
return t.substitute(cr) # we can now substitute
With your example (slightly modified), it would give:
s = '''
def main():
${body}
return ${retval}
''')
# body constructed bit by bit
body = ['a']
body.append('b')
body.append('c')
out = substitute(s, { 'body': body, 'retval': 0 }
print (out)
it gives as expected:
def main():
a
b
c
return 0

parse statement string for arguments using regex in Python

I have user input statements which I would like to parse for arguments. If possible using regex.
I have read much about functools.partial on Stackoverflow where I could not find argument parsing. Also in regex on Stackoverflow I could not find how to check for a match, but exclude the used tokens. The Python tokenizer seems to heavy for my purpose.
import re
def getarguments(statement):
prog = re.compile("([(].*[)])")
result = prog.search(statement)
m = result.group()
# m = '(interval=1, percpu=True)'
# or m = "('/')"
# strip the parentheses, ugly but it works
return statement[result.start()+1:result.end()-1]
stm = 'psutil.cpu_percent(interval=1, percpu=True)'
arg_list = getarguments(stm)
print(arg_list) # returns : interval=1, percpu=True
# But combining single and double quotes like
stm = "psutil.disk_usage('/').percent"
arg_list = getarguments(stm) # in debug value is "'/'"
print(arg_list) # when printed value is : '/'
callfunction = psutil.disk_usage
args = []
args.append(arg_list)
# args.append('/')
funct1 = functools.partial(callfunction, *args)
perc = funct1().percent
print(perc)
This results an error :
builtins.FileNotFoundError: [Errno 2] No such file or directory: "'/'"
But
callfunction = psutil.disk_usage
args = []
#args.append(arg_list)
args.append('/')
funct1 = functools.partial(callfunction, *args)
perc = funct1().percent
print(perc)
Does return (for me) 20.3 This is correct.
So there is somewhere a difference.
The weird thing is, if I view the content in my IDE (WingIDE) the result is "'/'" and then, if I want to view the details then the result is '/'
I use Python 3.4.0 What is happening here, and how to solve?
Your help is really appreciated.
getarguments("psutil.disk_usage('/').percent") returns '/'. You can check this by printing len(arg_list), for example.
Your IDE adds ", because by default strings are enclosed into single quotes '. Now you have a string which actually contains ', so IDE uses double quotes to enclose the string.
Note, that '/' is not equal to "'/'". The former is a string of 1 character, the latter is a string of 3 characters. So in order to get things right you need to strip quotes (both double and single ones) in getarguments. You can do it with following snippet
if (s.startswith('\'') and s.endswith('\'')) or
(s.startswith('\"') and s.endswith('\"')):
s = s[1:-1]

Empty XML element handling in Python

I'm puzzled by minidom parser handling of empty element, as shown in following code section.
import xml.dom.minidom
doc = xml.dom.minidom.parseString('<value></value>')
print doc.firstChild.nodeValue.__repr__()
# Out: None
print doc.firstChild.toxml()
# Out: <value/>
doc = xml.dom.minidom.Document()
v = doc.appendChild(doc.createElement('value'))
v.appendChild(doc.createTextNode(''))
print v.firstChild.nodeValue.__repr__()
# Out: ''
print doc.firstChild.toxml()
# Out: <value></value>
How can I get consistent behavior? I'd like to receive empty string as value of empty element (which IS what I put in XML structure in the first place).
Cracking open xml.dom.minidom and searching for "/>", we find this:
# Method of the Element(Node) class.
def writexml(self, writer, indent="", addindent="", newl=""):
# [snip]
if self.childNodes:
writer.write(">%s"%(newl))
for node in self.childNodes:
node.writexml(writer,indent+addindent,addindent,newl)
writer.write("%s</%s>%s" % (indent,self.tagName,newl))
else:
writer.write("/>%s"%(newl))
We can deduce from this that the short-end-tag form only occurs when childNodes is an empty list. Indeed, this seems to be true:
>>> doc = Document()
>>> v = doc.appendChild(doc.createElement('v'))
>>> v.toxml()
'<v/>'
>>> v.childNodes
[]
>>> v.appendChild(doc.createTextNode(''))
<DOM Text node "''">
>>> v.childNodes
[<DOM Text node "''">]
>>> v.toxml()
'<v></v>'
As pointed out by Lloyd, the XML spec makes no distinction between the two. If your code does make the distinction, that means you need to rethink how you want to serialize your data.
xml.dom.minidom simply displays something differently because it's easier to code. You can, however, get consistent output. Simply inherit the Element class and override the toxml method such that it will print out the short-end-tag form when there are no child nodes with non-empty text content. Then monkeypatch the module to use your new Element class.
value = thing.firstChild.nodeValue or ''
Xml spec does not distinguish these two cases.

Categories

Resources