Python equivalent of Fortran list-directed input - python

I'd like to be able to read data from an input file in Python, similar to the way that Fortran handles a list-directed read (i.e. read (file, *) char_var, float_var, int_var).
The tricky part is that the way Fortran handles a read statement like this is very "forgiving" as far as the input format is concerned. For example, using the previous statement, this:
"some string" 10.0, 5
would be read the same as:
"some string", 10.0
5
and this:
"other string", 15.0 /
is read the same as:
"other string"
15
/
with the value of int_var retaining the same value as before the read statement. And trickier still this:
"nother string", , 7
will assign the values to char_var and int_var but float_var retains the same value as before the read statement.
Is there an elegant way to implement this?

That is indeed tricky - I found it easier to write a pure-python stated-based tokenizer than think on a regular expression to parse each line (tough it is possible).
I've used the link provided by Vladimir as the spec - the tokenizer have some doctests that pass.
def tokenize(line, separator=',', whitespace="\t\n\x20", quote='"'):
"""
>>> tokenize('"some string" 10.0, 5')
['some string', '10.0', '5']
>>> tokenize(' "other string", 15.0 /')
['other string', '15.0', '/']
>>> tokenize('"nother string", , 7')
['nother string', '', '7']
"""
inside_str = False
token_started = False
token = ""
tokens = []
separated = False
just_added = False
for char in line:
if char in quote:
if not inside_str:
inside_str = True
else:
inside_str = False
tokens.append(token)
token = ""
just_added = True
continue
if char in (whitespace + separator) and not inside_str:
if token:
tokens.append(token)
token = ""
just_added = True
elif char in separator:
if not just_added:
tokens.append("")
just_added = False
continue
token += char
if token:
tokens.append(token)
return tokens
class Character(object):
def __init__(self, length=None):
self.length = length
def __call__(self, text):
if self.length is None:
return text
if len(text) > self.length:
return text[:self.length]
return "{{:{}}}".format(self.length).format(text)
def make_types(types, default_value):
return types, [default_value] * len[types]
def fortran_reader(file, types, default_char="/", default_value=None, **kw):
types, results = make_types(types, default_value)
tokens = []
while True:
tokens = []
while len(tokens) < len(results):
try:
line = next(file)
except StopIteration:
raise StopIteration
tokens += tokenize(line, **kw)
for i, (type_, token) in enumerate(zip(types, tokens)):
if not token or token in default_char:
continue
results[i] = type_(token)
changed_types = yield(results)
if changed_types:
types, results = make_types(changed_types)
I have not teste this thoughtfully - but for the tokenizer -
it is designed to work in a Python forstatement if the same fields are repeated over and over again - or it can be used with Python's iterators send method to change the values to be read on each iteration.
Please test, and e-mail me (address at my profile) some testing file. If there is indeed nothing similar, maybe this deserves some polishing and be published in Pypi.

Since I was not able to find a solution to this problem, I decided to write my own solution.
The main drivers are a reader class, and a tokenizer. The reader gets one line at a time from the file, passes it to the tokenizer, and assigns to the variables it is given, getting the next line as necessary.
class FortranAsciiReader(file):
def read(self, *args):
"""
Read from file into the given objects
"""
num_args = len(args)
num_read = 0
encountered_slash = False
# If line contained '/' or read into all varialbes, we're done
while num_read < num_args and not encountered_slash:
line = self.readline()
if not line:
raise Exception()
values = tokenize(line)
# Assign elements one-by-one into args, skipping empty fields and stopping at a '/'
for val in values:
if val == '/':
encountered_slash = True
break
elif val == '':
num_read += 1
else:
args[num_read].assign(val)
num_read += 1
if num_read == num_args:
break
The tokenizer splits the line into tokens in accordance with the way that Fortran performs list directed reads, where ',' and white space are separators, tokens may be "repeated" via 4*token, and a / terminates input.
My implementation of the tokenizer is a bit long to reproduce here, and I also included classes to transparently provide the functionality of the basic Fortran intrinsic types (i.e. Real, Character, Integer, etc.). The whole project can be found on my github account, currently at https://github.com/bprichar/PyLiDiRe. Thanks jsbueno for inspiration for the tokenizer.

Related

Handle nested fields with conversion types in string with string.Formatter

Update 2
Alright, my answer to this question is not a complete solution to what I originally wanted but it's ok for simpler things like filename templating (what I originally intended to use this for). I have yet to come up with a solution for recursive templating. It might not matter to me though as I have reevaluated what I really need. Though it's possible I'll need bigger guns in the future, but then I'll probably just choose another more advanced templating engine instead of reinventing the tire.
Update
Ok I realize now string.Template probably is the better way to do this. I'll answer my own question when I have a working example.
I want to accomplish formatting strings by grouping keys and arbitrary text together in a nesting manner, like so
# conversions (!):
# u = upper case
# l = lower case
# c = capital case
# t = title case
fmt = RecursiveNamespaceFormatter(globals())
greeting = 'hello'
person = 'foreName surName'
world = 'WORLD'
sample = 'WELL {greeting!u} {super {person!t}, {tHiS iS tHe {world!t}!l}!c}!'
print(fmt.format(sample))
# output: WELL HELLO Super Forename Surname, this is the World!
I've subclassed string.Formatter to populate the nested fields which I retrieve with regex, and it works fine, except for the fields with a conversion type which doesn't get converted.
import re
from string import Formatter
class RecursiveNamespaceFormatter(Formatter):
def __init__(self, namespace={}):
Formatter.__init__(self)
self.namespace = namespace
def vformat(self, format_string, *args, **kwargs):
def func(i):
i = i.group().strip('{}')
return self.get_value(i,(),{})
format_string = re.sub('\{(?:[^}{]*)\}', func, format_string)
try:
return super().vformat(format_string, args, kwargs)
except ValueError:
return self.vformat(format_string)
def get_value(self, key, args, kwds):
if isinstance(key, str):
try:
# Check explicitly passed arguments first
return kwds[key]
except KeyError:
return self.namespace.get(key, key) # return key if not found (e.g. key == "this is the World")
else:
super().get_value(key, args, kwds)
def convert_field(self, value, conversion):
if conversion == "u":
return str(value).upper()
elif conversion == "l":
return str(value).lower()
elif conversion == "c":
return str(value).capitalize()
elif conversion == "t":
return str(value).title()
# Do the default conversion or raise error if no matching conversion found
return super().convert_field(value, conversion)
# output: WELL hello!u super foreName surName!t, tHiS iS tHe WORLD!t!l!c!
What am I missing? Is there a better way to do this?
Recursion is a complicated thing with this, especially with the limitations of python's re module. Before I tackled on with string.Template, I experimented with looping through the string and stacking all relevant indexes, to order each nested field in hierarchy. Maybe a combination of the two could work, I'm not sure.
Here's however a working, non-recursive example:
from string import Template, _sentinel_dict
class MyTemplate(Template):
delimiter = '$'
pattern = '\$(?:(?P<escaped>\$)|\{(?P<braced>[\w]+)(?:\.(?P<braced_func>\w+)\(\))*\}|(?P<named>(?:[\w]+))(?:\.(?P<named_func>\w+)\(\))*|(?P<invalid>))'
def substitute(self, mapping=_sentinel_dict, **kws):
if mapping is _sentinel_dict:
mapping = kws
elif kws:
mapping = _ChainMap(kws, mapping)
def convert(mo):
named = mapping.get(mo.group('named'), mapping.get(mo.group('braced')))
func = mo.group('named_func') or mo.group('braced_func') # i.e. $var.func() or ${var.func()}
if named is not None:
if func is not None:
# if named doesn't contain func, convert it to str and try again.
callable_named = getattr(named, func, getattr(str(named), func, None))
if callable_named:
return str(callable_named())
return str(named)
if mo.group('escaped') is not None:
return self.delimiter
if mo.group('invalid') is not None:
self._invalid(mo)
if named is not None:
raise ValueError('Unrecognized named group in pattern',
self.pattern)
return self.pattern.sub(convert, self.template)
sample1 = 'WELL $greeting.upper() super$person.title(), tHiS iS tHe $world.title().lower().capitalize()!'
S = MyTemplate(sample1)
print(S.substitute(**{'greeting': 'hello', 'person': 'foreName surName', 'world': 'world'}))
# output: WELL HELLO super Forename Surname, tHiS iS tHe World!
sample2 = 'testing${äää.capitalize()}.upper()ing $NOT_DECLARED.upper() $greeting '
sample2 += '$NOT_DECLARED_EITHER ASDF$world.upper().lower()ASDF'
S = MyTemplate(sample2)
print(S.substitute(**{
'some_var': 'some_value',
'äää': 'TEST',
'greeting': 'talofa',
'person': 'foreName surName',
'world': 'världen'
}))
# output: testingTest.upper()ing talofa ASDFvärldenASDF
sample3 = 'a=$a.upper() b=$b.bit_length() c=$c.bit_length() d=$d.upper()'
S = MyTemplate(sample3)
print(S.substitute(**{'a':1, 'b':'two', 'c': 3, 'd': 'four'}))
# output: a=1 b=two c=2 d=FOUR
As you can see, $var and ${var} works as expected, but the fields can also handle type methods. If the method is not found, it converts the value to str and checks again.
The methods can't take any arguments though. It also only catches the last method so chaining doesn't work either, which I believe is because re do not allow multiple groups to use the same name (the regex module does however).
With some tweaking of the regex pattern and some extra logic in convert both these things should be easily fixed.
MyTemplate.substitute works like MyTemplate.safe_substitute by not throwing exceptions on missing keys or fields.

Regex: Match string (including nested characters) inside 2 characters) while not capturing characters, but non-greedy

I'm done making a markup language, but I'm now making optimized strings for it.
I want there to be nested characters allowed inside the strings, with my RPLY lexer there are 3 objects I need to allow inside of strings (curly braces and backticks), as well as not capturing the characters, such as the regex code I've tried:
(?:`)+([\w\W]+)(?:`)+
But this is greedy, and will only match between the first grave it sees and the last, as well as creating non-captured groups that are unsupported with RPLY.
Is there an alternative to this that's non-greedy but will allow nested characters? (and no non-capturing groups please, I'm using the the RPython distribution of PLY (RPLY) lexer and parser which doesn't support regex groups)
If anybody needs more code, I have 2 Python classes, both for the lexer and parser.
LEXER
from rply import LexerGenerator
class BMLLexer():
def __init__(self):
self.__lexer = LexerGenerator()
def __add_tokens(self):
# Statement definitions
# Note that I need both the OPEN_STATEMENT and CLOSE_STATEMENT to be allowed inside the string.
self.__lexer.add('OPEN_STATEMENT', r'\{')
self.__lexer.add('CLOSE_STATEMENT', r'\}')
# Basic things
# Note that RPLY's parser doesn't allow multiple groups (includes capturing and non-capturing), so the only option would be just to use functions to remove the first and last backtick from the string, or find something in regex that allows me to automatically get rid of the first and last backtick in the string.
self.__lexer.add('STRING', r'(?:`)+([\w\W]+)(?:`)+')
# Ignore spaces
self.__lexer.ignore('\s+')
def build(self):
self.__add_tokens()
return self.__lexer.build()
PARSER
import re
from bmls.language.parser.definitions import BMLDefinitionCapped, BMLDefinitionSingle
from rply import ParserGenerator, Token
class BMLParser():
"""The direct BML parser.
Raises:
SyntaxError: If given invalid syntax, the parser will throw a SyntaxError.
"""
# Init parser
def __init__(self):
"""Initializes the parser.
"""
self.pg = ParserGenerator(
# A list of all token names accepted by the parser.
[
'OPEN_STATEMENT',
'CLOSE_STATEMENT',
'STRING',
],
precedence= [
('left', ['OPEN_STATEMENT']),
('left', ['STRING']),
('right', ['CLOSE_STATEMENT'])
]
)
# Parsing
def parse(self):
"""Parses BML content
Raises:
SyntaxError: If given invalid syntax, the parser will throw a SyntaxError.
Returns:
list: An HTML/XML formatted list of items.
"""
# Multi-expression handling
#self.pg.production('main : expr')
#self.pg.production('main : main expr')
def main(p):
if len(p) == 1:
return p
else:
for x in p[1:]:
p[0].append(x)
return p[0]
# Expression handling
#self.pg.production('expr : STRING OPEN_STATEMENT main CLOSE_STATEMENT')
def definition_capped(p):
name = self.__toSTRING(p[0])
definition1 = self.__toSTRING(p[2])
comp = BMLDefinitionCapped(name, definition1)
return self.__toHTML(comp)
#self.pg.production('expr : OPEN_STATEMENT STRING CLOSE_STATEMENT')
def definition_uncapped(p):
name = self.__toSTRING(p[1])
comp = BMLDefinitionSingle(name)
return self.__toHTML(comp)
# Expression types
# This is where the string is parsed, currently using the function __removeFIrstLast. I wish to replace this with a supported expression.
#self.pg.production('expr : STRING')
def string_expr(p):
if p[0].gettokentype() == 'STRING':
return self.__removeFirstLast(self.__toSTRING(p[0]), '`', '`')
# Error handling
#self.pg.error
def error_handle(token):
raise SyntaxError('Error on Token (\'' + token.gettokentype() + '\' , \'' + token.getstr() + '\')')
# Public utilities
def build(self):
return self.pg.build()
# Private utilities
def __removeFirstLast(self, tok, char, endchar):
if isinstance(tok, str):
if tok.startswith(char) and tok.endswith(endchar):
return re.sub(r'^' + char + r'|' + endchar + r'$', '', tok)
else:
return tok
else:
return tok
def __toHTML(self, tok):
output = ''
if isinstance(tok, BMLDefinitionCapped):
right = ''
try:
for k1 in tok.right:
right += k1
except:
right += tok.right
output += '<' + tok.left + '>' + right + '</' + tok.left.split(' ')[0] + '>'
elif isinstance(tok, BMLDefinitionSingle):
output += '<' + tok.left + '>'
elif isinstance(tok, Token):
output += tok.getstr()
else:
output += tok
return output
def __toSTRING(self, tok):
if isinstance(tok, Token):
return tok.getstr()
else:
return tok
def __toINT(self, tok):
if isinstance(tok, Token):
return int(tok.getstr())
else:
return int(tok)

Replacing multiple words in a string from different data sets in Python

Essentially I have a python script that loads in a number of files, each file contains a list and these are used to generate strings. For example: "Just been to see $film% in $location%, I'd highly recommend it!" I need to replace the $film% and $location% placeholders with a random element of the array of their respective imported lists.
I'm very new to Python but have picked up most of it quite easily but obviously in Python strings are immutable and so handling this sort of task is different compared to other languages I've used.
Here is the code as it stands, I've tried adding in a while loop but it would still only replace the first instance of a replaceable word and leave the rest.
#!/usr/bin/python
import random
def replaceWord(string):
#Find Variable Type
if "url" in string:
varType = "url"
elif "film" in string:
varType = "film"
elif "food" in string:
varType = "food"
elif "location" in string:
varType = "location"
elif "tvshow" in string:
varType = "tvshow"
#LoadVariableFile
fileToOpen = "/prototype/default_" + varType + "s.txt"
var_file = open(fileToOpen, "r")
var_array = var_file.read().split('\n')
#Get number of possible variables
numberOfVariables = len(var_array)
#ChooseRandomElement
randomElement = random.randrange(0,numberOfVariables)
#ReplaceWord
oldValue = "$" + varType + "%"
newString = string.replace(oldValue, var_array[randomElement], 1)
return newString
testString = "Just been to see $film% in $location%, I'd highly recommend it!"
Test = replaceWord(testString)
This would give the following output: Just been to see Harry Potter in $location%, I'd highly recommend it!
I have tried using while loops, counting the number of words to replace in the string etc. however it still only changes the first word. It also needs to be able to replace multiple instances of the same "variable" type in the same string, so if there are two occurrences of $film% in a string it should replace both with a random element from the loaded file.
The following program may be somewhat closer to what you are trying to accomplish. Please note that documentation has been included to help explain what is going on. The templates are a little different than yours but provide customization options.
#! /usr/bin/env python3
import random
PATH_TEMPLATE = './prototype/default_{}s.txt'
def main():
"""Demonstrate the StringReplacer class with a test sting."""
replacer = StringReplacer(PATH_TEMPLATE)
text = "Just been to see {film} in {location}, I'd highly recommend it!"
result = replacer.process(text)
print(result)
class StringReplacer:
"""StringReplacer(path_template) -> StringReplacer instance"""
def __init__(self, path_template):
"""Initialize the instance attribute of the class."""
self.path_template = path_template
self.cache = {}
def process(self, text):
"""Automatically discover text keys and replace them at random."""
keys = self.load_keys(text)
result = self.replace_keys(text, keys)
return result
def load_keys(self, text):
"""Discover what replacements can be made in a string."""
keys = {}
while True:
try:
text.format(**keys)
except KeyError as error:
key = error.args[0]
self.load_to_cache(key)
keys[key] = ''
else:
return keys
def load_to_cache(self, key):
"""Warm up the cache as needed in preparation for replacements."""
if key not in self.cache:
with open(self.path_template.format(key)) as file:
unique = set(filter(None, map(str.strip, file)))
self.cache[key] = tuple(unique)
def replace_keys(self, text, keys):
"""Build a dictionary of random replacements and run formatting."""
for key in keys:
keys[key] = random.choice(self.cache[key])
new_string = text.format(**keys)
return new_string
if __name__ == '__main__':
main()
The varType you are assigning will be set in only one of your if-elif-else sequence and then the interpreter will go outside. You would have to run all over it and perform operations. One way would be to set flags which part of sentence you want to change. It would go that way:
url_to_change = False
film_to_change = False
if "url" in string:
url_to_change = True
elif "film" in string:
film_to_change = True
if url_to_change:
change_url()
if film_to_change:
change_film()
If you want to change all occurances you could use a foreach loop. Just do something like this in the part you are swapping a word:
for word in sentence:
if word == 'url':
change_word()
Having said this, I'd reccomend introducing two improvements. Push changing into separate functions. It would be easier to manage your code.
For example function for getting items from file to random from could be
def load_variable_file(file_name)
fileToOpen = "/prototype/default_" + file_name + "s.txt"
var_file = open(fileToOpen, "r")
var_array = var_file.read().split('\n')
var_file.clos()
return var_array
Instead of
if "url" in string:
varType = "url"
you could do:
def change_url(sentence):
var_array = load_variable_file(url)
numberOfVariables = len(var_array)
randomElement = random.randrange(0,numberOfVariables)
oldValue = "$" + varType + "%"
return sentence.replace(oldValue, var_array[randomElement], 1)
if "url" in sentence:
setnence = change_url(sentence)
And so on. You could push some part of what I've put into change_url() into a separate function, since it would be used by all such functions (just like loading data from file). I deliberately do not change everything, I hope you get my point. As you see with functions with clear names you can write less code, split it into logical, reusable parts, no needs to comment the code.
A few points about your code:
You can replace the randrange with random.choice as you just
want to select an item from an array.
You can iterate over your types and do the replacement without
specifying a limit (the third parameter), then assign it to the same object, so you keep all your replacements.
readlines() do what you want for open, read from the file as store the lines as an array
Return the new string after go through all the possible replacements
Something like this:
#!/usr/bin/python
import random
def replaceWord(string):
#Find Variable Type
types = ("url", "film", "food", "location", "tvshow")
for t in types:
if "$" + t + "%" in string:
var_array = []
#LoadVariableFile
fileToOpen = "/prototype/default_" + varType + "s.txt"
with open(fname) as f:
var_array = f.readlines()
tag = "$" + t + "%"
while tag in string:
choice = random.choice(var_array)
string = string.replace(tag, choice, 1)
var_array.remove(choice)
return string
testString = "Just been to see $film% in $location%, I'd highly recommend it!"
new = replaceWord(testString)
print(new)

Using inherited method causes unexpected result

Trying to find if a keyword is in a title by using a subclass, titleMatch, which calls a method, isWordIn, from parent class match.
What I want for the text parameter seems to end up as the keyword (see output.) Most of the design of this code is constrained by the exercise, but I think I've tried every way (but the right one :)
Obviously missing some very basic (no doubt major) concept of inheritance.
import string
class abstractBaseClass(object):
def evaluate(self, story):
"""
Returns True if an alert should be generated
for the given news item, or False otherwise.
"""
raise NotImplementedError
class match(abstractBaseClass):
'''
get word to match
isWordIn breaks text into lower case list of words, tries word against list
'''
def __init__(self, word):
self.word = word
def isWordIn(self, text):
self.text = text
newText = ''
for char in range(len(self.text)):
if self.text[char] in string.punctuation:
char = ' '
newText += char
else:
char = self.text[char]
newText += char
lowerText = newText.lower()
chopText = lowerText.split(' ')
print 'self.word: ' + self.word.lower()
return self.word.lower() in chopText
class titleMatch(match):
'''
find word in title
'''
def evaluate(self, titl):
self.titl = titl
return self.isWordIn(self.titl)
title = 'Red Badge of Courage'
abstract = 'foo'
text = 'some long string'
test = match('RED')
print test.isWordIn(title)
test2 = titleMatch(title)
print test2.evaluate(title)
results:
%run "c:\docume~1\winuser\locals~1\temp\tmpvrnfcj.py"
self.word: red
True
self.word: red badge of courage
False
This seems like a really complicated way of doing:
>>> word = 'red'
>>> string = 'red badge of courage'
>>> word in string
True
Python isn't Java you don't need to put everything in a class and in fact you probably don't need a class most of the time in Python. A good rule of thumb is if a class has only 2 methods and one of them is __init__(), you don't need it.
maybe you meant to initialize test2 this way?:
test2 = titleMatch('RED')

Efficiently match multiple regexes in Python

Lexical analyzers are quite easy to write when you have regexes. Today I wanted to write a simple general analyzer in Python, and came up with:
import re
import sys
class Token(object):
""" A simple Token structure.
Contains the token type, value and position.
"""
def __init__(self, type, val, pos):
self.type = type
self.val = val
self.pos = pos
def __str__(self):
return '%s(%s) at %s' % (self.type, self.val, self.pos)
class LexerError(Exception):
""" Lexer error exception.
pos:
Position in the input line where the error occurred.
"""
def __init__(self, pos):
self.pos = pos
class Lexer(object):
""" A simple regex-based lexer/tokenizer.
See below for an example of usage.
"""
def __init__(self, rules, skip_whitespace=True):
""" Create a lexer.
rules:
A list of rules. Each rule is a `regex, type`
pair, where `regex` is the regular expression used
to recognize the token and `type` is the type
of the token to return when it's recognized.
skip_whitespace:
If True, whitespace (\s+) will be skipped and not
reported by the lexer. Otherwise, you have to
specify your rules for whitespace, or it will be
flagged as an error.
"""
self.rules = []
for regex, type in rules:
self.rules.append((re.compile(regex), type))
self.skip_whitespace = skip_whitespace
self.re_ws_skip = re.compile('\S')
def input(self, buf):
""" Initialize the lexer with a buffer as input.
"""
self.buf = buf
self.pos = 0
def token(self):
""" Return the next token (a Token object) found in the
input buffer. None is returned if the end of the
buffer was reached.
In case of a lexing error (the current chunk of the
buffer matches no rule), a LexerError is raised with
the position of the error.
"""
if self.pos >= len(self.buf):
return None
else:
if self.skip_whitespace:
m = self.re_ws_skip.search(self.buf[self.pos:])
if m:
self.pos += m.start()
else:
return None
for token_regex, token_type in self.rules:
m = token_regex.match(self.buf[self.pos:])
if m:
value = self.buf[self.pos + m.start():self.pos + m.end()]
tok = Token(token_type, value, self.pos)
self.pos += m.end()
return tok
# if we're here, no rule matched
raise LexerError(self.pos)
def tokens(self):
""" Returns an iterator to the tokens found in the buffer.
"""
while 1:
tok = self.token()
if tok is None: break
yield tok
if __name__ == '__main__':
rules = [
('\d+', 'NUMBER'),
('[a-zA-Z_]\w+', 'IDENTIFIER'),
('\+', 'PLUS'),
('\-', 'MINUS'),
('\*', 'MULTIPLY'),
('\/', 'DIVIDE'),
('\(', 'LP'),
('\)', 'RP'),
('=', 'EQUALS'),
]
lx = Lexer(rules, skip_whitespace=True)
lx.input('erw = _abc + 12*(R4-623902) ')
try:
for tok in lx.tokens():
print tok
except LexerError, err:
print 'LexerError at position', err.pos
It works just fine, but I'm a bit worried that it's too inefficient. Are there any regex tricks that will allow me to write it in a more efficient / elegant way ?
Specifically, is there a way to avoid looping over all the regex rules linearly to find one that fits?
I suggest using the re.Scanner class, it's not documented in the standard library, but it's well worth using. Here's an example:
import re
scanner = re.Scanner([
(r"-?[0-9]+\.[0-9]+([eE]-?[0-9]+)?", lambda scanner, token: float(token)),
(r"-?[0-9]+", lambda scanner, token: int(token)),
(r" +", lambda scanner, token: None),
])
>>> scanner.scan("0 -1 4.5 7.8e3")[0]
[0, -1, 4.5, 7800.0]
You can merge all your regexes into one using the "|" operator and let the regex library do the work of discerning between tokens. Some care should be taken to ensure the preference of tokens (for example to avoid matching a keyword as an identifier).
I found this in python document. It's just simple and elegant.
import collections
import re
Token = collections.namedtuple('Token', ['typ', 'value', 'line', 'column'])
def tokenize(s):
keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'}
token_specification = [
('NUMBER', r'\d+(\.\d*)?'), # Integer or decimal number
('ASSIGN', r':='), # Assignment operator
('END', r';'), # Statement terminator
('ID', r'[A-Za-z]+'), # Identifiers
('OP', r'[+*\/\-]'), # Arithmetic operators
('NEWLINE', r'\n'), # Line endings
('SKIP', r'[ \t]'), # Skip over spaces and tabs
]
tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
get_token = re.compile(tok_regex).match
line = 1
pos = line_start = 0
mo = get_token(s)
while mo is not None:
typ = mo.lastgroup
if typ == 'NEWLINE':
line_start = pos
line += 1
elif typ != 'SKIP':
val = mo.group(typ)
if typ == 'ID' and val in keywords:
typ = val
yield Token(typ, val, line, mo.start()-line_start)
pos = mo.end()
mo = get_token(s, pos)
if pos != len(s):
raise RuntimeError('Unexpected character %r on line %d' %(s[pos], line))
statements = '''
IF quantity THEN
total := total + price * quantity;
tax := price * 0.05;
ENDIF;
'''
for token in tokenize(statements):
print(token)
The trick here is the line:
tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
Here (?P<ID>PATTERN) will mark the matched result with a name specified by ID.
re.match is anchored. You can give it a position argument:
pos = 0
end = len(text)
while pos < end:
match = regexp.match(text, pos)
# do something with your match
pos = match.end()
Have a look for pygments which ships a shitload of lexers for syntax highlighting purposes with different implementations, most based on regular expressions.
It's possible that combining the token regexes will work, but you'd have to benchmark it. Something like:
x = re.compile('(?P<NUMBER>[0-9]+)|(?P<VAR>[a-z]+)')
a = x.match('9999').groupdict() # => {'VAR': None, 'NUMBER': '9999'}
if a:
token = [a for a in a.items() if a[1] != None][0]
The filter is where you'll have to do some benchmarking...
Update: I tested this, and it seems as though if you combine all the tokens as stated and write a function like:
def find_token(lst):
for tok in lst:
if tok[1] != None: return tok
raise Exception
You'll get roughly the same speed (maybe a teensy faster) for this. I believe the speedup must be in the number of calls to match, but the loop for token discrimination is still there, which of course kills it.
This isn't exactly a direct answer to your question, but you might want to look at ANTLR. According to this document the python code generation target should be up to date.
As to your regexes, there are really two ways to go about speeding it up if you're sticking to regexes. The first would be to order your regexes in the order of the probability of finding them in a default text. You could figure adding a simple profiler to the code that collected token counts for each token type and running the lexer on a body of work. The other solution would be to bucket sort your regexes (since your key space, being a character, is relatively small) and then use a array or dictionary to perform the needed regexes after performing a single discrimination on the first character.
However, I think that if you're going to go this route, you should really try something like ANTLR which will be easier to maintain, faster, and less likely to have bugs.

Categories

Resources