Using python ast parser to process multi line strings - python

When using the python AST parser module in combination with scripts containing multi line strings, these multi line strings are always reduced to single line quoted strings. Example:
import ast
script = "text='''Line1\nLine2'''"
code = ast.parse (script, mode='exec')
print (ast.unparse (code))
node = code.body[0].value
print (node.lineno, node.end_lineno)
The output is:
> text = 'Line1\nLine2'
> 1 2
So in spite of being a multi line string before parsing, the text is reduced to a single line quoted string when unparsed. This makes script transformation difficult, because the multi lines are getting lost when unparsing a transformed AST graph.
Is there a way to parse/unparse scripts with multi line strings correctly with AST ?
Thank you in advance.

An examination of ast.unparse's underlying source reveals that the writer for the visit_Constant method, _write_constant, will produce the string repr unless the backslashing process is specifically avoided:
class _Unparse:
...
def _write_constant(self, value):
if isinstance(value, (float, complex)):
...
elif self._avoid_backslashes and isinstance(value, str):
self._write_str_avoiding_backslashes(value)
else:
self.write(repr(value))
By default, _avoid_backslashes is set to False, however, multiline string formatting can be properly performed by overriding visit_Constant and specifically calling _write_str_avoiding_backslashes if the string node is multiline:
import ast
class Unparser(ast._Unparser):
def visit_Constant(self, node):
if isinstance(node.value, str) and node.lineno < node.end_lineno:
super()._write_str_avoiding_backslashes(node.value)
return
return super().visit_Constant(node)
def _unparse(ast_node):
u = Unparser()
return u.visit(ast_node)
script = "text='''Line1\nLine2'''"
print(_unparse(ast.parse(script)))
Output:
text = """Line1
Line2"""

Related

remove the 'u' and brackets from text file output - pyspark

I need to output the values from my spark program into a text file in the following format:
'ADDRESS', VALUE
However, my current output is:
(u'ADDRESS', VALUE)
Is there a way for me to reformat the output so when it is written into the text file is it in the format of the 1st stated format.
Here is my code below:
import pyspark
import re
from operator import *
sc = pyspark.SparkContext()
sc.setLogLevel("ERROR")
def good_line(line):
try:
fields = line.split(',')
if len(fields)!=7:
return False
if int(fields[3]) == 0:
return false
str(fields[2])
int(fields[3])
return True
except:
return False
lines = sc.textFile("/user/ae306/transactions.csv")
clean_lines = lines.filter(good_line)
transactions = clean_lines.map(lambda transaction: (transaction.split(',')[2] ,int(transaction.split(',')[3])))
result = transactions.reduceByKey(add)
print(result)
result.saveAsTextFile("CompEvalSparkPartBJob1TestFile")
Thank you for your time.
result is a tuple and printing it prints its string representation as text, which in turn prints the representation of the elements inside it
But the representation of the elements is not the same if you print the elements separately. You have no control on this representation.
Good old format would allow this control
print("'{}', {}".format(*result))
to put this in a text file instead, a handle must be obtained somewhere during initialization (not using the with syntax willingly, look it up if needed)
f = open("textfile.txt","w")
then instead of print just f.write, with linefeed, as many times as needed (if there are several results)
f.write("'{}', {}\n".format(*result))
in the end, close the file:
f.close()

How to find multi-line comments wrapped in quotes?

I am parsing Python code, and I need to remove all possible comments/docstrings. I have successfully been able to remove "comments" of the form:
#comment
"""comment"""
'''comment'''
However, I have found some samples where people write comments of the form:
"'''comment'''"
"\"\"\"\n comment \"\"\""
I am struggling to successfully remove these comments (three single quotes surrounded by a double quote, and double quotes with line breaks). The expression I tried was:
p = re.compile("([\'\"])\1\1(.*?)\1{3}", re.DOTALL)
code = p.sub('', code)
But this did not work for either of the second two cases. Does anyone have any suggestions?
You could try using strip().
It works by removing the characters you place in between the brackets. If nothing is in the brackets it removes spaces but you want to remove the three single quotes surrounded by a double quote, and double quotes with line breaks. So an example is:
txt = ",,,,,rrttgg.....banana....rrr"
x = txt.strip(",.grt")
print(x)
And the output you would get is "banana" as it has removed the ,.grt that was found in between the double brackets( x = txt.strip(",.grt")).
For more info check out this page, and i recommend the info at the bottom for further help:
https://www.w3schools.com/python/python_strings.asp
posting as an answer because my comment was hard to read
This is what I came up with, it's ugly and hacky but it does work.
import re
txt = "if x = 4: continue \"'''hi'''\" print(x) "
print(txt)
#find everything wrapped in double quotes
double_quotes = re.findall(r"\"(.+?)\"", txt)
for string in double_quotes:
triple_single = re.findall(r"\'''(.+?)\'''", string)[0]
full_comment = '"'+"'''" +triple_single+"'''"+'"'
txt = txt.replace(full_comment, '')
print(txt)
Prints:
if x = 4: continue "'''hi'''" print(x)
if x = 4: continue print(x)
Unassigned string literal can be considered as nodes on the source code's Abstract Syntax Tree (AST) representation. Then the problem is reduced to identifying these nodes and rewriting the AST without them, using the tools in the ast module.
Comments (# ...) are not parsed into the AST, so there is not need to code for them.
Unassigned string literals are nodes of type ast.Constant, and form part of the body attribute of nodes that have bodies, such as module definitions, function definitions and class definitions. We can identify these nodes, remove them from their parents' body's and then rewrite the AST.
import ast
import io
from unparse import Unparser
with open('comments.py') as f:
src = f.read()
root = ast.parse(src)
# print(ast.dump(root)) to see the ast structure.
def filter_constants(node):
if isinstance(node, ast.Expr):
if isinstance(node.value, ast.Constant):
if isinstance(node.value.value, str):
return None
return node
class CommentRemover(ast.NodeTransformer):
def visit(self, node):
if hasattr(node, 'body'):
node.body = [n for n in node.body if filter_constants(n)]
return super().visit(node)
remover = CommentRemover()
new = remover.visit(root)
ast.fix_missing_locations(new)
buf = io.StringIO()
Unparser(new, buf)
buf.seek(0)
print(buf.read())
Calling the script on this code (comments.py):
"""Module docstring."""
# A real comment
"""triple-double-quote comment"""
'''triple-single-quote comment'''
"'''weird comment'''"
"\"\"\"\n comment \"\"\""
NOT_A_COMMENT = 'spam'
42
def foo():
"""Function docstring."""
# Function comment
bar = 'baz'
return bar
class Quux:
"""class docstring."""
# class comment
def m(self):
"""method comment"""
return
Gives this output:
NOT_A_COMMENT = 'spam'
42
def foo():
bar = 'baz'
return bar
class Quux():
def m(self):
return
Notes:
the unparse script can be found in your Python distribution's Tools/parser folder (in v3.8 - in previous versions it has been in Tools or in the Demo folder). It may also be downloaded from github - be sure that you download the version for your version of Python
As of Python 3.8, the ast.Constant class is used for all constant nodes; for earlier versions you may need to use ast.Num, ast.Str, ast.Bytes, ast.NameConstant and ast.Ellipsis as appropriate. So in filter_constants might look like this:
def filter_constants(node):
if isinstance(node, ast.Expr):
if isinstance(node.value, ast.Str):
return None
return node
As of Python 3.9, the ast module provide an unparse function that may be used instead of the unparse script
src = ast.unparse(new)
print(src)

unicode table information about a character in python

Is there a way in python to get the technical information for a given character like it's displayed in the Unicode table? (cf.https://unicode-table.com/en/)
Example:
for the letter "Ȅ"
Name > Latin Capital Letter E with Double Grave
Unicode number > U+0204
HTML-code > Ȅ
Bloc > Latin Extended-B
Lowercase > ȅ
What I actually need is to get for any Unicode number (like here U+0204) the corresponding name (Latin Capital Letter E with Double Grave) and the lowercase version (here "ȅ").
Roughly:
input = a Unicode number
output = corresponding information
The closest thing I've been able to find is the fontTools library but I can't seem to find any tutorial/documentation on how to use it to do that.
Thank you.
The standard module unicodedata defines a lot of properties, but not everything. A quick peek at its source confirms this.
Fortunately unicodedata.txt, the data file where this comes from, is not hard to parse. Each line consists of exactly 15 elements, ; separated, which makes it ideal for parsing. Using the description of the elements on ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html, you can create a few classes to encapsulate the data. I've taken the names of the class elements from that list; the meaning of each of the elements is explained on that same page.
Make sure to download ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt and ftp://ftp.unicode.org/Public/UNIDATA/Blocks.txt first, and put them inside the same folder as this program.
Code (tested with Python 2.7 and 3.6):
# -*- coding: utf-8 -*-
class UnicodeCharacter:
def __init__(self):
self.code = 0
self.name = 'unnamed'
self.category = ''
self.combining = ''
self.bidirectional = ''
self.decomposition = ''
self.asDecimal = None
self.asDigit = None
self.asNumeric = None
self.mirrored = False
self.uc1Name = None
self.comment = ''
self.uppercase = None
self.lowercase = None
self.titlecase = None
self.block = None
def __getitem__(self, item):
return getattr(self, item)
def __repr__(self):
return '{'+self.name+'}'
class UnicodeBlock:
def __init__(self):
self.first = 0
self.last = 0
self.name = 'unnamed'
def __repr__(self):
return '{'+self.name+'}'
class BlockList:
def __init__(self):
self.blocklist = []
with open('Blocks.txt','r') as uc_f:
for line in uc_f:
line = line.strip(' \r\n')
if '#' in line:
line = line.split('#')[0].strip()
if line != '':
rawdata = line.split(';')
block = UnicodeBlock()
block.name = rawdata[1].strip()
rawdata = rawdata[0].split('..')
block.first = int(rawdata[0],16)
block.last = int(rawdata[1],16)
self.blocklist.append(block)
# make 100% sure it's sorted, for quicker look-up later
# (it is usually sorted in the file, but better make sure)
self.blocklist.sort (key=lambda x: block.first)
def lookup(self,code):
for item in self.blocklist:
if code >= item.first and code <= item.last:
return item.name
return None
class UnicodeList:
"""UnicodeList loads Unicode data from the external files
'UnicodeData.txt' and 'Blocks.txt', both available at unicode.org
These files must appear in the same directory as this program.
UnicodeList is a new interpretation of the standard library
'unicodedata'; you may first want to check if its functionality
suffices.
As UnicodeList loads its data from an external file, it does not depend
on the local build from Python (in which the Unicode data gets frozen
to the then 'current' version).
Initialize with
uclist = UnicodeList()
"""
def __init__(self):
# we need this first
blocklist = BlockList()
bpos = 0
self.codelist = []
with open('UnicodeData.txt','r') as uc_f:
for line in uc_f:
line = line.strip(' \r\n')
if '#' in line:
line = line.split('#')[0].strip()
if line != '':
rawdata = line.strip().split(';')
parsed = UnicodeCharacter()
parsed.code = int(rawdata[0],16)
parsed.characterName = rawdata[1]
parsed.category = rawdata[2]
parsed.combining = rawdata[3]
parsed.bidirectional = rawdata[4]
parsed.decomposition = rawdata[5]
parsed.asDecimal = int(rawdata[6]) if rawdata[6] else None
parsed.asDigit = int(rawdata[7]) if rawdata[7] else None
# the following value may contain a slash:
# ONE QUARTER ... 1/4
# let's make it Python 2.7 compatible :)
if '/' in rawdata[8]:
rawdata[8] = rawdata[8].replace('/','./')
parsed.asNumeric = eval(rawdata[8])
else:
parsed.asNumeric = int(rawdata[8]) if rawdata[8] else None
parsed.mirrored = rawdata[9] == 'Y'
parsed.uc1Name = rawdata[10]
parsed.comment = rawdata[11]
parsed.uppercase = int(rawdata[12],16) if rawdata[12] else None
parsed.lowercase = int(rawdata[13],16) if rawdata[13] else None
parsed.titlecase = int(rawdata[14],16) if rawdata[14] else None
while bpos < len(blocklist.blocklist) and parsed.code > blocklist.blocklist[bpos].last:
bpos += 1
parsed.block = blocklist.blocklist[bpos].name if bpos < len(blocklist.blocklist) and parsed.code >= blocklist.blocklist[bpos].first else None
self.codelist.append(parsed)
def find_code(self,codepoint):
"""Find the Unicode information for a codepoint (as int).
Returns:
a UnicodeCharacter class object or None.
"""
# the list is unlikely to contain duplicates but I have seen Unicode.org
# doing that in similar situations. Again, better make sure.
val = [x for x in self.codelist if codepoint == x.code]
return val[0] if val else None
def find_char(self,str):
"""Find the Unicode information for a codepoint (as character).
Returns:
for a single character: a UnicodeCharacter class object or
None.
for a multicharacter string: a list of the above, one element
per character.
"""
if len(str) > 1:
result = [self.find_code(ord(x)) for x in str]
return result
else:
return self.find_code(ord(str))
When loaded, you can now look up a character code with
>>> ul = UnicodeList() # ONLY NEEDED ONCE!
>>> print (ul.find_code(0x204))
{LATIN CAPITAL LETTER E WITH DOUBLE GRAVE}
which by default is shown as the name of a character (Unicode calls this a 'code point'), but you can retrieve other properties as well:
>>> print ('%04X' % uc.find_code(0x204).lowercase)
0205
>>> print (ul.lookup(0x204).block)
Latin Extended-B
and (as long as you don't get a None) even chain them:
>>> print (ul.find_code(ul.find_code(0x204).lowercase))
{LATIN SMALL LETTER E WITH DOUBLE GRAVE}
It does not rely on your particular build of Python; you can always download an updated list from unicode.org and be assured to get the most recent information:
import unicodedata
>>> print (unicodedata.name('\U0001F903'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: no such name
>>> print (uclist.find_code(0x1f903))
{LEFT HALF CIRCLE WITH FOUR DOTS}
(As tested with Python 3.5.3.)
There are currently two lookup functions defined:
find_code(int) looks up character information by codepoint as an integer.
find_char(string) looks up character information for the character(s) in string. If there is only one character, it returns a UnicodeCharacter object; if there are more, it returns a list of objects.
After import unicodelist (assuming you saved this as unicodelist.py), you can use
>>> ul = UnicodeList()
>>> hex(ul.find_char(u'è').code)
'0xe8'
to look up the hex code for any character, and a list comprehension such as
>>> l = [hex(ul.find_char(x).code) for x in 'Hello']
>>> l
['0x48', '0x65', '0x6c', '0x6c', '0x6f']
for longer strings. Note that you don't actually need all of this if all you want is a hex representation of a string! This suffices:
l = [hex(ord(x)) for x in 'Hello']
The purpose of this module is to give easy access to other Unicode properties. A longer example:
str = 'Héllo...'
dest = ''
for i in str:
dest += chr(ul.find_char(i).uppercase) if ul.find_char(i).uppercase is not None else i
print (dest)
HÉLLO...
and showing a list of properties for a character per your example:
letter = u'Ȅ'
print ('Name > '+ul.find_char(letter).name)
print ('Unicode number > U+%04x' % ul.find_char(letter).code)
print ('Bloc > '+ul.find_char(letter).block)
print ('Lowercase > %s' % chr(ul.find_char(letter).lowercase))
(I left out HTML; these names are not defined in the Unicode standard.)
The unicodedata documentation shows how to do most of this.
The Unicode block name is apparently not available but another Stack Overflow question has a solution of sorts and another has some additional approaches using regex.
The uppercase/lowercase mapping and character number information is not particularly Unicode-specific; just use the regular Python string functions.
So in summary
>>> import unicodedata
>>> unicodedata.name('Ë')
'LATIN CAPITAL LETTER E WITH DIAERESIS'
>>> 'U+%04X' % ord('Ë')
'U+00CB'
>>> '&#%i;' % ord('Ë')
'Ë'
>>> 'Ë'.lower()
'ë'
The U+%04X formatting is sort-of correct, in that it simply avoids padding and prints the whole hex number for code points with a value higher than 65,535. Note that some other formats require the use of %08X padding in this scenario (notably \U00010000 format in Python).
You can do this in some ways :
1- create an API yourself ( I can't find anything that do this )
2- create table in database or excel file
3- load and parse a website to do that
I think the 3rd way is very easy. take a look as This Page. you can find some information there Unicodes.
Get your Unicode number and then, find it in web page using parse tools like LXML , Scrapy , Selenium , etc

parse statement string for arguments using regex in Python

I have user input statements which I would like to parse for arguments. If possible using regex.
I have read much about functools.partial on Stackoverflow where I could not find argument parsing. Also in regex on Stackoverflow I could not find how to check for a match, but exclude the used tokens. The Python tokenizer seems to heavy for my purpose.
import re
def getarguments(statement):
prog = re.compile("([(].*[)])")
result = prog.search(statement)
m = result.group()
# m = '(interval=1, percpu=True)'
# or m = "('/')"
# strip the parentheses, ugly but it works
return statement[result.start()+1:result.end()-1]
stm = 'psutil.cpu_percent(interval=1, percpu=True)'
arg_list = getarguments(stm)
print(arg_list) # returns : interval=1, percpu=True
# But combining single and double quotes like
stm = "psutil.disk_usage('/').percent"
arg_list = getarguments(stm) # in debug value is "'/'"
print(arg_list) # when printed value is : '/'
callfunction = psutil.disk_usage
args = []
args.append(arg_list)
# args.append('/')
funct1 = functools.partial(callfunction, *args)
perc = funct1().percent
print(perc)
This results an error :
builtins.FileNotFoundError: [Errno 2] No such file or directory: "'/'"
But
callfunction = psutil.disk_usage
args = []
#args.append(arg_list)
args.append('/')
funct1 = functools.partial(callfunction, *args)
perc = funct1().percent
print(perc)
Does return (for me) 20.3 This is correct.
So there is somewhere a difference.
The weird thing is, if I view the content in my IDE (WingIDE) the result is "'/'" and then, if I want to view the details then the result is '/'
I use Python 3.4.0 What is happening here, and how to solve?
Your help is really appreciated.
getarguments("psutil.disk_usage('/').percent") returns '/'. You can check this by printing len(arg_list), for example.
Your IDE adds ", because by default strings are enclosed into single quotes '. Now you have a string which actually contains ', so IDE uses double quotes to enclose the string.
Note, that '/' is not equal to "'/'". The former is a string of 1 character, the latter is a string of 3 characters. So in order to get things right you need to strip quotes (both double and single ones) in getarguments. You can do it with following snippet
if (s.startswith('\'') and s.endswith('\'')) or
(s.startswith('\"') and s.endswith('\"')):
s = s[1:-1]

Odd character appending to the front of python list

I'm having an issue with regards to running python on linux. I'm trying to learn python and wanted to try and parse a small XML file and put the tags and data into the list. But every time I run the code I get a 'u' appending to each element in the list.
[u'world']
defaultdict(<type 'list'>, {u'world': [u'data']})
My code is as follows:
import xml.sax
from collections import defaultdict
class TransformXML(xml.sax.ContentHandler):
def __init__ (self):
self.start_tag_name = -1
self.tag_data = -1
self.myDict = defaultdict(list)
self.tags = []
def startElement(self, name, attrs):
self.start_tag_name = name
print name
print self.start_tag_name
def characters(self, content):
if content.strip(' \r\n\t') != "":
self.tag_data = content.strip(' \r\n\t')
print self.start_tag_name
self.tags.append(self.start_tag_name)
self.myDict[self.start_tag_name].append(content.strip(' \r\n\t'))
def endElement(self, name):
pass
def __del__ (self):
if self.myDict:
del self.myDict
print "deleteing myDict"
Does anyone know what the issue might be?
That 'weird' symbol basically means that the string or character is encoded in unicode
Eg. If i have a string Test:
>>> unicode('Test')
u'Test'
>>> s = unicode('Test')
>>> type(s)
<type 'unicode'>
Documentation here
To sum up, according to the python docs,
...a Unicode string is a sequence of code points, which are numbers from
0 to 0x10ffff. This sequence needs to be represented as a set of bytes
(meaning, values from 0-255) in memory. The rules for translating a
Unicode string into a sequence of bytes are called an encoding.

Categories

Resources