Lexical analysis

Lexical analysis - python

I am learning lexers in Python. I am using Ply library for lexical analysis on some strings. I have implemented the following lexical analyzer for some of C++ language syntax.
However, I am facing a strange behavior. When I define the COMMENT states function definitions at the end of other function definitions, the code works fine. If I define COMMENT state functions before other definitions, I get errors as soon as // sectoin starts in the input string starts.
WHAT IS THE REASON BEHIND THAT?
import ply.lex as lex
tokens = (
'DLANGLE', # <<
'DRANGLE', # >>
'EQUAL', # =
'STRING', # "144"
'WORD', # 'Welcome' in "Welcome."
'SEMICOLON', # ;
)
t_ignore = ' \t\v\r' # shortcut for whitespace
states = (
('cppcomment', 'exclusive'), # <!--
)
def t_cppcomment(t): # definition here causes errors
r'//'
print 'MyCOm:',t.value
t.lexer.begin('cppcomment');
def t_cppcomment_end(t):
r'\n'
t.lexer.begin('INITIAL');
def t_cppcomment_error(t):
print "Error FOUND"
t.lexer.skip(1)
def t_DLANGLE(t):
r'<<'
print 'MyLAN:',t.value
return t
def t_DRANGLE(t):
r'>>'
return t
def t_SEMICOLON(t):
r';'
print 'MySemi:',t.value
return t;
def t_EQUAL(t):
r'='
return t
def t_STRING(t):
r'"[^"]*"'
t.value = t.value[1:-1] # drop "surrounding quotes"
print 'MyString:',t.value
return t
def t_WORD(t):
r'[^ <>\n]+'
print 'MyWord:',t.value
return t
webpage = "cout<<\"Hello World\"; // this comment"
htmllexer = lex.lex()
htmllexer.input(webpage)
while True:
tok = htmllexer.token()
if not tok: break
print tok
Regards

Just figured it out. As I have defined comment state as exclusive, it won't use the inclusive state modules (if comment modules are defined at the top, otherwise it uses it for some reason). So you will have redefine all the modules for comment state again. Therefore ply provides error() modules for skipping characters for which specific modules are not defined.

its because you have no rules that accept this or comment
and really you dont care about whats in the comment you can easilly do something like
t_cppcomment_ANYTHING = '[^\r\n]'
just below your t_ignore rule

Related

Sly: How can I implement if and while statements and commands

I'm new in sly and I'm trying to write simple language with sly python. I want to implement if-else, while loop and print command. I tried to search a lot, but there aren't many tutorials about sly. I'm totally confused how can I make it.
I want something like that:
If-else statement:
a = 0
if (a == 0) then {
print "variable a is equal to zero"
...
} else {
print "variable a is not equal to zero"
...
}
While loop:
a = 0
while (a == 0) then {
a = a + 1
}
Print command:
print "hello world"
I found that code on https://sly.readthedocs.io/en/latest/sly.html with statements, but this is just lexer.
# calclex.py
from sly import Lexer
class CalcLexer(Lexer):
# Set of token names. This is always required
tokens = { NUMBER, ID, WHILE, IF, ELSE, PRINT,
PLUS, MINUS, TIMES, DIVIDE, ASSIGN,
EQ, LT, LE, GT, GE, NE }
literals = { '(', ')', '{', '}', ';' }
# String containing ignored characters
ignore = ' \t'
# Regular expression rules for tokens
PLUS = r'\+'
MINUS = r'-'
TIMES = r'\*'
DIVIDE = r'/'
EQ = r'=='
ASSIGN = r'='
LE = r'<='
LT = r'<'
GE = r'>='
GT = r'>'
NE = r'!='
#_(r'\d+')
def NUMBER(self, t):
t.value = int(t.value)
return t
# Identifiers and keywords
ID = r'[a-zA-Z_][a-zA-Z0-9_]*'
ID['if'] = IF
ID['else'] = ELSE
ID['while'] = WHILE
ID['print'] = PRINT
ignore_comment = r'\#.*'
# Line number tracking
#_(r'\n+')
def ignore_newline(self, t):
self.lineno += t.value.count('\n')
def error(self, t):
print('Line %d: Bad character %r' % (self.lineno, t.value[0]))
self.index += 1
if __name__ == '__main__':
data = '''
# Counting
x = 0;
while (x < 10) {
print x:
x = x + 1;
}
'''
lexer = CalcLexer()
for tok in lexer.tokenize(data):
print(tok)
Hope you can help me.

You can learn a lot of how to construct these basic things by following the examples for Ply, Sly's predecessor. I am going to assume that you're attempting to build an Abstract Syntax Tree (AST) from which you will generate (or interpret) the actual runtime code. I'll start with the print statement (this assumes you also have a STRING token, of course):
#_('PRINT STRING')
def print_statement(self, p):
return ('PRINT', p.STRING)
This creates an AST tuple that you can interpret for printing. Next is a fairly simple one, the While statement:
#_('WHILE ( expression ) { statement_set }')
def while_statement(self, p):
return ('WHILE', p.expression, p.statement_set)
Much like before, this creates a simple tuple that contains the command you wish to run ,'WHILE', the expression that will be tested, and the statement_set that will run. Note that, due to how Sly works, the expression and statement_set are their own non-terminals, and might include complex code including nested statements.
The If-else statement is slightly more complex because we have two situations, either an If statement without an else, and one with. For the sake of simplicity, we'll assume we always generate the same tuple of ('IF', expression, primary statementes, else statements).
#_('IF ( expression ) { statement_set }')
def if_statement(self, p):
return ('IF', p.expression, p.statement_set, [])
#_('IF ( expression ) { statement_set } ELSE { statement_set }')
def if_statement(self, p):
return ('IF', p.expression, p.statement_set0, p.statement_set1)
Of course, you will need to define the expression and statement_set non-terminals, but I'll assume you've already done some work there.

Is it possible to get the source code of a (possibly decorated) Python function body, including inline comments? [duplicate]

This question already has answers here:
How can I get the source code of a Python function?
(13 answers)
Closed 1 year ago.
I am trying to figure out how to only get the source code of the body of the function.
Let's say I have:
def simple_function(b = 5):
a = 5
print("here")
return a + b
I would want to get (up to indentation):
"""
a = 5
print("here")
return a + b
"""
While it's easy in the case above, I want it to be agnostic of decorators/function headers, etc. However, still include inline comments. So for example:
#decorator1
#decorator2
def simple_function(b: int = 5):
""" Very sophisticated docs
"""
a = 5
# Comment on top
print("here") # And in line
return a + b
Would result in:
"""
a = 5
# Comment on top
print("here") # And in line
return a + b
"""
I was not able to find any utility and have been trying to play with inspect.getsourcelines for few hours now, but with no luck.
Any help appreciated!
Why is it different from How can I get the source code of a Python function?
This question asks for a whole function source code, which includes both decorators, docs, def, and body itself. I'm interested in only the body of the function.

I wrote a simple regex that does the trick. I tried this script with classes and without. It seemed to work fine either way. It just opens whatever file you designate in the Main call, at the bottom, rewrites the entire document with all function/method bodies doc-stringed and then save it as whatever you designated as the second argument in the Main call.
It's not beautiful, and it could probably have more efficient regex statements. It works though. The regex finds everything from a decorator (if one) to the end of a function/method, grouping tabs and the function/method body. It then uses those groups in finditer to construct a docstring and place it before the entire chunk it found.
import re
FUNC_BODY = re.compile(r'^((([ \t]+)?#.+\n)+)?(?P<tabs>[\t ]+)?def([^\n]+)\n(?P<body>(^([\t ]+)?([^\n]+)\n)+)', re.M)
BLANK_LINES = re.compile(r'^[ \t]+$', re.M)
class Main(object):
def __init__(self, file_in:str, file_out:str) -> None:
#prime in/out strings
in_txt = ''
out_txt = ''
#open resuested file
with open(file_in, 'r') as f:
in_txt = f.read()
#remove all lines that just have space characters on them
#this stops FUNC_BODY from finding the entire file in one shot
in_txt = BLANK_LINES.sub('', in_txt)
last = 0 #to keep track of where we are in the file
#process all matches
for m in FUNC_BODY.finditer(in_txt):
s, e = m.span()
#make sure we catch anything that was between our last match and this one
out_txt = f"{out_txt}{in_txt[last:s]}"
last = e
tabs = m.group('tabs') if not m.group('tabs') is None else ''
#construct the docstring and inject it before the found function/method
out_txt = f"{out_txt}{tabs}'''\n{m.group('body')}{tabs}'''\n{m.group()}"
#save as requested file name
with open(file_out, 'w') as f:
f.write(out_txt)
if __name__ == '__main__':
Main('test.py', 'test_docd.py')
EDIT:
Apparently, I "missed the entire point" so I wrote it again a different way. Now you can get the body while the code is running and decorators don't matter, at all. I left my other answer here because it is also a solution, just not a "real time" one.
import re, inspect
FUNC_BODY = re.compile('^(?P<tabs>[\t ]+)?def (?P<name>[a-zA-Z0-9_]+)([^\n]+)\n(?P<body>(^([\t ]+)?([^\n]+)\n)+)', re.M)
class Source(object):
#staticmethod
def investigate(focus:object, strfocus:str) -> str:
with open(inspect.getsourcefile(focus), 'r') as f:
for m in FUNC_BODY.finditer(f.read()):
if m.group('name') == strfocus:
tabs = m.group('tabs') if not m.group('tabs') is None else ''
return f"{tabs}'''\n{m.group('body')}{tabs}'''"
def decorator(func):
def inner():
print("I'm decorated")
func()
return inner
#decorator
def test():
a = 5
b = 6
return a+b
print(Source.investigate(test, 'test'))

Collect python source code comments along execution path

E.g. I've got the following python function:
def func(x):
"""Function docstring."""
result = x + 1
if result > 0:
# comment 2
return result
else:
# comment 3
return -1 * result
And I want to have some function that would print all function docstrings and comments that are met along the execution path, e.g.
> trace(func(2))
Function docstring.
Comment 2
3
In fact what I try to achieve is to provide some comments how the result has been calculated.
What could be used? AST as far as I understand does not keep comment in the tree.

I thought this was an interesting challenge, so I decided to give it a try. Here is what I came up with:
import ast
import inspect
import re
import sys
import __future__
if sys.version_info >= (3,5):
ast_Call = ast.Call
else:
def ast_Call(func, args, keywords):
"""Compatibility wrapper for ast.Call on Python 3.4 and below.
Used to have two additional fields (starargs, kwargs)."""
return ast.Call(func, args, keywords, None, None)
COMMENT_RE = re.compile(r'^(\s*)#\s?(.*)$')
def convert_comment_to_print(line):
"""If `line` contains a comment, it is changed into a print
statement, otherwise nothing happens. Only acts on full-line comments,
not on trailing comments. Returns the (possibly modified) line."""
match = COMMENT_RE.match(line)
if match:
return '{}print({!r})\n'.format(*match.groups())
else:
return line
def convert_docstrings_to_prints(syntax_tree):
"""Walks an AST and changes every docstring (i.e. every expression
statement consisting only of a string) to a print statement.
The AST is modified in-place."""
ast_print = ast.Name('print', ast.Load())
nodes = list(ast.walk(syntax_tree))
for node in nodes:
for bodylike_field in ('body', 'orelse', 'finalbody'):
if hasattr(node, bodylike_field):
for statement in getattr(node, bodylike_field):
if (isinstance(statement, ast.Expr) and
isinstance(statement.value, ast.Str)):
arg = statement.value
statement.value = ast_Call(ast_print, [arg], [])
def get_future_flags(module_or_func):
"""Get the compile flags corresponding to the features imported from
__future__ by the specified module, or by the module containing the
specific function. Returns a single integer containing the bitwise OR
of all the flags that were found."""
result = 0
for feature_name in __future__.all_feature_names:
feature = getattr(__future__, feature_name)
if (hasattr(module_or_func, feature_name) and
getattr(module_or_func, feature_name) is feature and
hasattr(feature, 'compiler_flag')):
result |= feature.compiler_flag
return result
def eval_function(syntax_tree, func_globals, filename, lineno, compile_flags,
*args, **kwargs):
"""Helper function for `trace`. Execute the function defined by
the given syntax tree, and return its return value."""
func = syntax_tree.body[0]
func.decorator_list.insert(0, ast.Name('_trace_exec_decorator', ast.Load()))
ast.increment_lineno(syntax_tree, lineno-1)
ast.fix_missing_locations(syntax_tree)
code = compile(syntax_tree, filename, 'exec', compile_flags, True)
result = [None]
def _trace_exec_decorator(compiled_func):
result[0] = compiled_func(*args, **kwargs)
func_locals = {'_trace_exec_decorator': _trace_exec_decorator}
exec(code, func_globals, func_locals)
return result[0]
def trace(func, *args, **kwargs):
"""Run the given function with the given arguments and keyword arguments,
and whenever a docstring or (whole-line) comment is encountered,
print it to stdout."""
filename = inspect.getsourcefile(func)
lines, lineno = inspect.getsourcelines(func)
lines = map(convert_comment_to_print, lines)
modified_source = ''.join(lines)
compile_flags = get_future_flags(func)
syntax_tree = compile(modified_source, filename, 'exec',
ast.PyCF_ONLY_AST | compile_flags, True)
convert_docstrings_to_prints(syntax_tree)
return eval_function(syntax_tree, func.__globals__,
filename, lineno, compile_flags, *args, **kwargs)
It is a bit long because I tried to cover most important cases, and the code might not be the most readable, but I hope it is nice enough to follow.
How it works:
First, read the function's source code using inspect.getsourcelines. (Warning: inspect does not work for functions that were defined interactively. If you need that, maybe you can use dill instead, see this answer.)
Search for lines that look like comments, and replace them with print statements. (Right now only whole-line comments are replaced, but it shouldn't be difficult to extend that to trailing comments if desired.)
Parse the source code into an AST.
Walk the AST and replace all docstrings with print statements.
Compile the AST.
Execute the AST. This and the previous step contain some trickery to try to reconstruct the context that the function was originally defined in (e.g. globals, __future__ imports, line numbers for exception tracebacks). Also, since just executing the source would only re-define the function and not call it, we fix that with a simple decorator.
It works in Python 2 and 3 (at least with the tests below, which I ran in 2.7 and 3.6).
To use it, simply do:
result = trace(func, 2) # result = func(2)
Here is a slightly more elaborate test that I used while writing the code:
#!/usr/bin/env python
from trace_comments import trace
from dateutil.easter import easter, EASTER_ORTHODOX
def func(x):
"""Function docstring."""
result = x + 1
if result > 0:
# comment 2
return result
else:
# comment 3
return -1 * result
if __name__ == '__main__':
result1 = trace(func, 2)
print("result1 = {}".format(result1))
result2 = trace(func, -10)
print("result2 = {}".format(result2))
# Test that trace() does not permanently replace the function
result3 = func(42)
print("result3 = {}".format(result3))
print("-----")
print(trace(easter, 2018))
print("-----")
print(trace(easter, 2018, EASTER_ORTHODOX))

Controlling Python PLY lexer states from parser

I am working on a simple SQL select like query parser and I need to be able to capture subqueries that can occur at certain places literally. I found lexer states are the best solution and was able to do a POC using curly braces to mark the start and end. However, the subqueries will be delimited by parenthesis, not curlys, and the parenthesis can occur at other places as well, so I can't being the state with every open-paren. This information is readily available with the parser, so I was hoping to call begin and end at appropriate locations in the parser rules. This however didn't work because lexer seem to tokenize the stream all at once, and so the tokens get generated in the INITIAL state. Is there a workaround for this problem? Here is an outline of what I tried to do:
def p_value_subquery(p):
"""
value : start_sub end_sub
"""
p[0] = "( " + p[1] + " )"
def p_start_sub(p):
"""
start_sub : OPAR
"""
start_subquery(p.lexer)
p[0] = p[1]
def p_end_sub(p):
"""
end_sub : CPAR
"""
subquery = end_subquery(p.lexer)
p[0] = subquery
The start_subquery() and end_subquery() are defined like this:
def start_subquery(lexer):
lexer.code_start = lexer.lexpos # Record the starting position
lexer.level = 1
lexer.begin('subquery')
def end_subquery(lexer):
value = lexer.lexdata[lexer.code_start:lexer.lexpos-1]
lexer.lineno += value.count('\n')
lexer.begin('INITIAL')
return value
The lexer tokens are simply there to detect the close-paren:
#lex.TOKEN(r"\(")
def t_subquery_SUBQST(t):
lexer.level += 1
#lex.TOKEN(r"\)")
def t_subquery_SUBQEN(t):
lexer.level -= 1
#lex.TOKEN(r".")
def t_subquery_anychar(t):
pass
I would appreciate any help.

This answer may only be partially helpful, but I would also suggest looking at section "6.11 Embedded Actions" of the PLY documentation (http://www.dabeaz.com/ply/ply.html). In a nutshell, it is possible to write grammar rules in which actions occur mid-rule. It would look something similar to this:
def p_somerule(p):
'''somerule : A B possible_sub_query LBRACE sub_query RBRACE'''
def p_possible_sub_query(p):
'''possible_sub_query :'''
...
# Check if the last token read was LBRACE. If so, flip lexer state
# Sadly, it doesn't seem that the token is easily accessible. Would have to hack it
if last_token == 'LBRACE':
p.lexer.begin('SUBQUERY')
Regarding the behavior of the lexer, there is only one token of lookahead being used. So, in any particular grammar rule, at most only one extra token has been read already. If you're going to flip lexer states, you need to make sure that it happens before the token gets consumed by the parser, but before the parser asks to read the next incoming token.
Also, if possible, I would try to stay out of the yacc() error handling stack as far as a solution. There is way too much black-magic going on in error handling--the more you can avoid it, the better.
I'm a bit pressed for time at the moment, but this seems to be something that could be investigated for the next version of PLY. Will put it on my to-do list.

Based on PLY author's response, I came up with this better solution. I am yet to figure out how to return the subquery as a token, but the rest looks much better and need not be considered a hack anymore.
def start_subquery(lexer):
lexer.code_start = lexer.lexpos # Record the starting position
lexer.level = 1
lexer.begin("subquery")
def end_subquery(lexer):
lexer.begin("INITIAL")
def get_subquery(lexer):
value = lexer.lexdata[lexer.code_start:lexer.code_end-1]
lexer.lineno += value.count('\n')
return value
#lex.TOKEN(r"\(")
def t_subquery_OPAR(t):
lexer.level += 1
#lex.TOKEN(r"\)")
def t_subquery_CPAR(t):
lexer.level -= 1
if lexer.level == 0:
lexer.code_end = lexer.lexpos # Record the ending position
return t
#lex.TOKEN(r".")
def t_subquery_anychar(t):
pass
def p_value_subquery(p):
"""
value : check_subquery_start OPAR check_subquery_end CPAR
"""
p[0] = "( " + get_subquery(p.lexer) + " )"
def p_check_subquery_start(p):
"""
check_subquery_start :
"""
# Here last_token would be yacc's lookahead.
if last_token.type == "OPAR":
start_subquery(p.lexer)
def p_check_subquery_end(p):
"""
check_subquery_end :
"""
# Here last_token would be yacc's lookahead.
if last_token.type == "CPAR":
end_subquery(p.lexer)
last_token = None
def p_error(p):
global subquery_retry_pos
if p is None:
print >> sys.stderr, "ERROR: unexpected end of query"
else:
print >> sys.stderr, "ERROR: Skipping unrecognized token", p.type, "("+ \
p.value+") at line:", p.lineno, "and column:", find_column(p.lexer.lexdata, p)
# Just discard the token and tell the parser it's okay.
yacc.errok()
def get_token():
global last_token
last_token = lexer.token()
return last_token
def parse_query(input, debug=0):
lexer.input(input)
return parser.parse(input, tokenfunc=get_token, debug=0)

Since nobody has an answer, it bugged me to find a workaround, and here is an ugly hack using the error recovery and restart().
def start_subquery(lexer, pos):
lexer.code_start = lexer.lexpos # Record the starting position
lexer.level = 1
lexer.begin("subquery")
lexer.lexpos = pos
def end_subquery(lexer):
value = lexer.lexdata[lexer.code_start:lexer.lexpos-1]
lexer.lineno += value.count('\n')
lexer.begin('INITIAL')
return value
#lex.TOKEN(r"\(")
def t_subquery_SUBQST(t):
lexer.level += 1
#lex.TOKEN(r"\)")
def t_subquery_SUBQEN(t):
lexer.level -= 1
if lexer.level == 0:
t.type = "SUBQUERY"
t.value = end_subquery(lexer)
return t
#lex.TOKEN(r".")
def t_subquery_anychar(t):
pass
# NOTE: Due to the nature of the ugly workaround, the CPAR gets dropped, which
# makes it look like there is an imbalance.
def p_value_subquery(p):
"""
value : OPAR SUBQUERY
"""
p[0] = "( " + p[2] + " )"
subquery_retry_pos = None
def p_error(p):
global subquery_retry_pos
if p is None:
print >> sys.stderr, "ERROR: unexpected end of query"
elif p.type == 'SELECT' and parser.symstack[-1].type == 'OPAR':
lexer.input(lexer.lexdata)
subquery_retry_pos = parser.symstack[-1].lexpos
yacc.restart()
else:
print >> sys.stderr, "ERROR: Skipping unrecognized token", p.type, "("+ \
p.value+") at line:", p.lineno, "and column:", find_column(p.lexer.lexdata, p)
# Just discard the token and tell the parser it's okay.
yacc.errok()
def get_token():
global subquery_retry_pos
token = lexer.token()
if token and token.lexpos == subquery_retry_pos:
start_subquery(lexer, lexer.lexpos)
subquery_retry_pos = None
return token
def parse_query(input, debug=0):
lexer.input(inp)
result = parser.parse(inp, tokenfunc=get_token, debug=0)

Efficiently match multiple regexes in Python

Lexical analyzers are quite easy to write when you have regexes. Today I wanted to write a simple general analyzer in Python, and came up with:
import re
import sys
class Token(object):
""" A simple Token structure.
Contains the token type, value and position.
"""
def __init__(self, type, val, pos):
self.type = type
self.val = val
self.pos = pos
def __str__(self):
return '%s(%s) at %s' % (self.type, self.val, self.pos)
class LexerError(Exception):
""" Lexer error exception.
pos:
Position in the input line where the error occurred.
"""
def __init__(self, pos):
self.pos = pos
class Lexer(object):
""" A simple regex-based lexer/tokenizer.
See below for an example of usage.
"""
def __init__(self, rules, skip_whitespace=True):
""" Create a lexer.
rules:
A list of rules. Each rule is a `regex, type`
pair, where `regex` is the regular expression used
to recognize the token and `type` is the type
of the token to return when it's recognized.
skip_whitespace:
If True, whitespace (\s+) will be skipped and not
reported by the lexer. Otherwise, you have to
specify your rules for whitespace, or it will be
flagged as an error.
"""
self.rules = []
for regex, type in rules:
self.rules.append((re.compile(regex), type))
self.skip_whitespace = skip_whitespace
self.re_ws_skip = re.compile('\S')
def input(self, buf):
""" Initialize the lexer with a buffer as input.
"""
self.buf = buf
self.pos = 0
def token(self):
""" Return the next token (a Token object) found in the
input buffer. None is returned if the end of the
buffer was reached.
In case of a lexing error (the current chunk of the
buffer matches no rule), a LexerError is raised with
the position of the error.
"""
if self.pos >= len(self.buf):
return None
else:
if self.skip_whitespace:
m = self.re_ws_skip.search(self.buf[self.pos:])
if m:
self.pos += m.start()
else:
return None
for token_regex, token_type in self.rules:
m = token_regex.match(self.buf[self.pos:])
if m:
value = self.buf[self.pos + m.start():self.pos + m.end()]
tok = Token(token_type, value, self.pos)
self.pos += m.end()
return tok
# if we're here, no rule matched
raise LexerError(self.pos)
def tokens(self):
""" Returns an iterator to the tokens found in the buffer.
"""
while 1:
tok = self.token()
if tok is None: break
yield tok
if __name__ == '__main__':
rules = [
('\d+', 'NUMBER'),
('[a-zA-Z_]\w+', 'IDENTIFIER'),
('\+', 'PLUS'),
('\-', 'MINUS'),
('\*', 'MULTIPLY'),
('\/', 'DIVIDE'),
('\(', 'LP'),
('\)', 'RP'),
('=', 'EQUALS'),
]
lx = Lexer(rules, skip_whitespace=True)
lx.input('erw = _abc + 12*(R4-623902) ')
try:
for tok in lx.tokens():
print tok
except LexerError, err:
print 'LexerError at position', err.pos
It works just fine, but I'm a bit worried that it's too inefficient. Are there any regex tricks that will allow me to write it in a more efficient / elegant way ?
Specifically, is there a way to avoid looping over all the regex rules linearly to find one that fits?

I suggest using the re.Scanner class, it's not documented in the standard library, but it's well worth using. Here's an example:
import re
scanner = re.Scanner([
(r"-?[0-9]+\.[0-9]+([eE]-?[0-9]+)?", lambda scanner, token: float(token)),
(r"-?[0-9]+", lambda scanner, token: int(token)),
(r" +", lambda scanner, token: None),
])
>>> scanner.scan("0 -1 4.5 7.8e3")[0]
[0, -1, 4.5, 7800.0]

You can merge all your regexes into one using the "|" operator and let the regex library do the work of discerning between tokens. Some care should be taken to ensure the preference of tokens (for example to avoid matching a keyword as an identifier).

I found this in python document. It's just simple and elegant.
import collections
import re
Token = collections.namedtuple('Token', ['typ', 'value', 'line', 'column'])
def tokenize(s):
keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'}
token_specification = [
('NUMBER', r'\d+(\.\d*)?'), # Integer or decimal number
('ASSIGN', r':='), # Assignment operator
('END', r';'), # Statement terminator
('ID', r'[A-Za-z]+'), # Identifiers
('OP', r'[+*\/\-]'), # Arithmetic operators
('NEWLINE', r'\n'), # Line endings
('SKIP', r'[ \t]'), # Skip over spaces and tabs
]
tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
get_token = re.compile(tok_regex).match
line = 1
pos = line_start = 0
mo = get_token(s)
while mo is not None:
typ = mo.lastgroup
if typ == 'NEWLINE':
line_start = pos
line += 1
elif typ != 'SKIP':
val = mo.group(typ)
if typ == 'ID' and val in keywords:
typ = val
yield Token(typ, val, line, mo.start()-line_start)
pos = mo.end()
mo = get_token(s, pos)
if pos != len(s):
raise RuntimeError('Unexpected character %r on line %d' %(s[pos], line))
statements = '''
IF quantity THEN
total := total + price * quantity;
tax := price * 0.05;
ENDIF;
'''
for token in tokenize(statements):
print(token)
The trick here is the line:
tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
Here (?P<ID>PATTERN) will mark the matched result with a name specified by ID.

re.match is anchored. You can give it a position argument:
pos = 0
end = len(text)
while pos < end:
match = regexp.match(text, pos)
# do something with your match
pos = match.end()
Have a look for pygments which ships a shitload of lexers for syntax highlighting purposes with different implementations, most based on regular expressions.

It's possible that combining the token regexes will work, but you'd have to benchmark it. Something like:
x = re.compile('(?P<NUMBER>[0-9]+)|(?P<VAR>[a-z]+)')
a = x.match('9999').groupdict() # => {'VAR': None, 'NUMBER': '9999'}
if a:
token = [a for a in a.items() if a[1] != None][0]
The filter is where you'll have to do some benchmarking...
Update: I tested this, and it seems as though if you combine all the tokens as stated and write a function like:
def find_token(lst):
for tok in lst:
if tok[1] != None: return tok
raise Exception
You'll get roughly the same speed (maybe a teensy faster) for this. I believe the speedup must be in the number of calls to match, but the loop for token discrimination is still there, which of course kills it.

This isn't exactly a direct answer to your question, but you might want to look at ANTLR. According to this document the python code generation target should be up to date.
As to your regexes, there are really two ways to go about speeding it up if you're sticking to regexes. The first would be to order your regexes in the order of the probability of finding them in a default text. You could figure adding a simple profiler to the code that collected token counts for each token type and running the lexer on a body of work. The other solution would be to bucket sort your regexes (since your key space, being a character, is relatively small) and then use a array or dictionary to perform the needed regexes after performing a single discrimination on the first character.
However, I think that if you're going to go this route, you should really try something like ANTLR which will be easier to maintain, faster, and less likely to have bugs.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Lexical analysis - python

its because you have no rules that accept this or comment and really you dont care about whats in the comment you can easilly do something like t_cppcomment_ANYTHING = '[^\r\n]' just below your t_ignore rule

Related

Sly: How can I implement if and while statements and commands

Is it possible to get the source code of a (possibly decorated) Python function body, including inline comments? [duplicate]

Collect python source code comments along execution path

Controlling Python PLY lexer states from parser

Efficiently match multiple regexes in Python

Categories

Resources