pyparsing question - python

This code works:
from pyparsing import *
zipRE = "\d{5}(?:[-\s]\d{4})?"
fooRE = "^\!\s+.*"
zipcode = Regex( zipRE )
foo = Regex( fooRE )
query = ( zipcode | foo )
tests = [ "80517", "C6H5OH", "90001-3234", "! sfs" ]
for t in tests:
try:
results = query.parseString( t )
print t,"->", results
except ParseException, pe:
print pe
I'm stuck on two issues:
1 - How to use a custom function to parse a token. For instance, if I wanted to use some custom logic instead of a regex to determine if a number is a zipcode.
Instead of:
zipcode = Regex( zipRE )
perhaps:
zipcode = MyFunc()
2 - How do I determine what a string parses TO. "80001" parses to "zipcode" but how do I determine this using pyparsing? I'm not parsing a string for its contents but simply to determine what kind of query it is.

You could use zipcode and foo separately, so that you know which one the string matches.
zipresults = zipcode.parseString( t )
fooresults = foo.parseString( t )

Your second question is easy, so I'll answer that first. Change query to assign results names to the different expressions:
query = ( zipcode("zip") | foo("foo") )
Now you can call getName() on the returned result:
print t,"->", results, results.getName()
Giving:
80517 -> ['80517'] zip
Expected Re:('\\d{5}(?:[-\\s]\\d{4})?') (at char 0), (line:1, col:1)
90001-3234 -> ['90001-3234'] zip
! sfs -> ['! sfs'] foo
If you are going to use the result's fooness or zipness to call another function, then you could do this at parse time by attaching a parse action to your foo and zipcode expressions:
# enclose zipcodes in '*'s, foos in '#'s
zipcode.setParseAction(lambda t: '*' + t[0] + '*')
foo.setParseAction(lambda t: '#' + t[0] + '#')
query = ( zipcode("zip") | foo("foo") )
Now gives:
80517 -> ['*80517*'] zip
Expected Re:('\\d{5}(?:[-\\s]\\d{4})?') (at char 0), (line:1, col:1)
90001-3234 -> ['*90001-3234*'] zip
! sfs -> ['#! sfs#'] foo
For your first question, I don't exactly know what kind of function you mean. Pyparsing provides many more parsing classes than just Regex (such as Word, Keyword, Literal, CaselessLiteral), and you compose your parser by combining them with '+', '|', '^', '~', '#' and '*' operators. For instance, if you wanted to parse for a US social security number, but not use a Regex, you could use:
ssn = Combine(Word(nums,exact=3) + '-' +
Word(nums,exact=2) + '-' + Word(nums,exact=4))
Word matches for contiguous "words" made up of the given characters in its constructor, Combine concatenates the matched tokens into a single token.
If you wanted to parse for a potential list of such numbers, delimited by '/'s, use:
delimitedList(ssn, '/')
or if there were between 1 and 3 such numbers, with no delimters, use:
ssn * (1,3)
And any expression can have results names or parse actions attached to them, to further enrich the parsed results, or the functionality during parsing. You can even build recursive parsers, such as nested lists of parentheses, arithmetic expressions, etc. using the Forward class.
My intent when I wrote pyparsing was that this composition of parsers from basic building blocks would be the primary form for creating a parser. It was only in a later release that I added Regex as (what I though was) the ultimate escape valve - if people couldn't build up their parser, they could fall back on regex's format, which has definitely proven its power over time.
Or, as one other poster suggests, you can open up the pyparsing source, and subclass one of the existing classes, or write your own, following their structure. Here is a class that would match for paired characters:
class PairOf(Token):
"""Token for matching words composed of a pair
of characters in a given set.
"""
def __init__( self, chars ):
super(PairOf,self).__init__()
self.pair_chars = set(chars)
def parseImpl( self, instring, loc, doActions=True ):
if (loc < len(instring)-1 and
instring[loc] in self.pair_chars and
instring[loc+1] == instring[loc]):
return loc+2, instring[loc:loc+2]
else:
raise ParseException(instring, loc, "Not at a pair of characters")
So that:
punc = r"~!##$%^&*_-+=|\?/"
parser = OneOrMore(Word(alphas) | PairOf(punc))
print parser.parseString("Does ** this match #### %% the parser?")
Gives:
['Does', '**', 'this', 'match', '##', '##', '%%', 'the', 'parser']
(Note the omission of the trailing single '?')

I do not have the pyparsing module, but Regex must be a class, not a function.
What you can do is subclass from it and override methods as required to customize behaviour, then use your subclasses instead.

Related

Extract all variables from a string of Python code (regex or AST)

I want to find and extract all the variables in a string that contains Python code. I only want to extract the variables (and variables with subscripts) but not function calls.
For example, from the following string:
code = 'foo + bar[1] + baz[1:10:var1[2+1]] + qux[[1,2,int(var2)]] + bob[len("foobar")] + func() + func2 (var3[0])'
I want to extract: foo, bar[1], baz[1:10:var1[2+1]], var1[2+1], qux[[1,2,int(var2)]], var2, bob[len("foobar")], var3[0]. Please note that some variables may be "nested". For example, from baz[1:10:var1[2+1]] I want to extract baz[1:10:var1[2+1]] and var1[2+1].
The first two ideas that come to mind is to use either a regex or an AST. I have tried both but with no success.
When using a regex, in order to make things simpler, I thought it would be a good idea to first extract the "top level" variables, and then recursively the nested ones. Unfortunately, I can't even do that.
This is what I have so far:
regex = r'[_a-zA-Z]\w*\s*(\[.*\])?'
for match in re.finditer(regex, code):
print(match)
Here is a demo: https://regex101.com/r/INPRdN/2
The other solution is to use an AST, extend ast.NodeVisitor, and implement the visit_Name and visit_Subscript methods. However, this doesn't work either because visit_Name is also called for functions.
I would appreciate if someone could provide me with a solution (regex or AST) to this problem.
Thank you.
I find your question an interesting challenge, so here is a code that do what you want, doing this using Regex alone it's impossible because there is nested expression, this is a solution using a combination of Regex and string manipulations to handle nested expressions:
# -*- coding: utf-8 -*-
import re
RE_IDENTIFIER = r'\b[a-z]\w*\b(?!\s*[\[\("\'])'
RE_INDEX_ONLY = re.compile(r'(##)(\d+)(##)')
RE_INDEX = re.compile('##\d+##')
def extract_expression(string):
""" extract all identifier and getitem expression in the given order."""
def remove_brackets(text):
# 1. handle `[...]` expression replace them with #{#...#}#
# so we don't confuse them with word[...]
pattern = '(?<!\w)(\s*)(\[)([^\[]+?)(\])'
# keep extracting expression until there is no expression
while re.search(pattern, text):
text = re.sub(pattern, r'\1#{#\3#}#', string)
return text
def get_ordered_subexp(exp):
""" get index of nested expression."""
index = int(exp.replace('#', ''))
subexp = RE_INDEX.findall(expressions[index])
if not subexp:
return exp
return exp + ''.join(get_ordered_subexp(i) for i in subexp)
def replace_expression(match):
""" save the expression in the list, replace it with special key and it's index in the list."""
match_exp = match.group(0)
current_index = len(expressions)
expressions.append(None) # just to make sure the expression is inserted before it's inner identifier
# if the expression contains identifier extract too.
if re.search(RE_IDENTIFIER, match_exp) and '[' in match_exp:
match_exp = re.sub(RE_IDENTIFIER, replace_expression, match_exp)
expressions[current_index] = match_exp
return '##{}##'.format(current_index)
def fix_expression(match):
""" replace the match by the corresponding expression using the index"""
return expressions[int(match.group(2))]
# result that will contains
expressions = []
string = remove_brackets(string)
# 2. extract all expression and keep track of there place in the original code
pattern = r'\w+\s*\[[^\[]+?\]|{}'.format(RE_IDENTIFIER)
# keep extracting expression until there is no expression
while re.search(pattern, string):
# every exression that is extracted is replaced by a special key
string = re.sub(pattern, replace_expression, string)
# some times inside brackets can contains getitem expression
# so when we extract that expression we handle the brackets
string = remove_brackets(string)
# 3. build the correct result with extracted expressions
result = [None] * len(expressions)
for index, exp in enumerate(expressions):
# keep replacing special keys with the correct expression
while RE_INDEX_ONLY.search(exp):
exp = RE_INDEX_ONLY.sub(fix_expression, exp)
# finally we don't forget about the brackets
result[index] = exp.replace('#{#', '[').replace('#}#', ']')
# 4. Order the index that where extracted
ordered_index = ''.join(get_ordered_subexp(exp) for exp in RE_INDEX.findall(string))
# convert it to integer
ordered_index = [int(index[1]) for index in RE_INDEX_ONLY.findall(ordered_index)]
# 5. fix the order of expressions using the ordered indexes
final_result = []
for exp_index in ordered_index:
final_result.append(result[exp_index])
# for debug:
# print('final string:', string)
# print('expression :', expressions)
# print('order_of_expresion: ', ordered_index)
return final_result
code = 'foo + bar[1] + baz[1:10:var1[2+1]] + qux[[1,2,int(var2)]] + bob[len("foobar")] + func() + func2 (var3[0])'
code2 = 'baz[1:10:var1[2+1]]'
code3 = 'baz[[1]:10:var1[2+1]:[var3[3+1*x]]]'
print(extract_expression(code))
print(extract_expression(code2))
print(extract_expression(code3))
OUTPU:
['foo', 'bar[1]', 'baz[1:10:var1[2+1]]', 'var1[2+1]', 'qux[[1,2,int(var2)]]', 'var2', 'bob[len("foobar")]', 'var3[0]']
['baz[1:10:var1[2+1]]', 'var1[2+1]']
['baz[[1]:10:var1[2+1]:[var3[3+1*x]]]', 'var1[2+1]', 'var3[3+1*x]', 'x']
I tested this code for very complicated examples and it worked perfectly. and notice that the order if extraction is the same as you wanted, Hope that this is what you needed.
This answer might be too later. But it is possible to do it using python regex package.
import regex
code= '''foo + bar[1] + baz[1:10:var1[2+1]] +
qux[[1,2,int(var2)]] + bob[len("foobar")] + func() + func2
(var3[0])'''
p=r'(\b[a-z]\w*\b(?!\s*[\(\"])(\[(?:[^\[\]]|(?2))*\])?)'
result=regex.findall(p,code,overlapped=True) #overlapped=True is needed to capture something inside a group like 'var1[2+1]'
[x[0] for x in result] #result variable is list of tuple of two,as each pattern capture two groups ,see below.
output:
['foo','bar[1]','baz[1:10:var1[2+1]]','var1[2+1]','qux[[1,2,int(var2)]]','var2','bob[len("foobar")]','var3[0]']
pattern explaination:
(     # 1st capturing group start
\b[a-z]\w*\b     #variable name ,eg 'bar'
(?!\s*[\(\"])     #negative lookahead. so to ignore something like foobar"
(\[(?:[^\[\]]|(?2))*\])     #2nd capture group,capture nested groups in '[ ]'
                                        #eg '[1:10:var1[2+1]]'.
                                        #'?2' refer to 2nd capturing group recursively
?     #2nd capturing group is optional so to capture something like 'foo'
)     #end of 1st group.
Regex is not a powerful enough tool to do this. If there is a finite depth of your nesting there is some hacky work around that would allow you to make complicate regex to do what you are looking for but I would not recommend it.
This is question is asked a lot an the linked response is famous for demonstrating the difficulty of what you are trying to do
If you really must parse a string for code an AST would technically work but I am not aware of a library to help with this. You would be best off trying to build a recursive function to do the parsing.

How to use stringed regex as proper regex with raw literalization

I have a list of regexes in string form (created after parsing natural language text which were search queries). I want to use them for searching text now. Here is how I am doing it right now-
# given that regex_list=["r'((?<=[\W_])(%s\(\+\))(?=[\W_]|$))'", "r'((?<=[\W_])(activation\ of\ %s)(?=[\W_]|$))'"....]
sent='in this file we have the case of a foo(+) in the town'
gs1='foo'
for string_regex in regex_list:
mo=re.search(string_regex %gs1,sent,re.I)
if mo:
print(mo.group())
What I need is to be able to use these string regexes, but also have Python's raw literal notation on them, as we all should for regex queries. Now about these expressions - I have natural text search commands like -
LINE_CONTAINS foo(+)
Which I use pyparsing to convert to regex like r'((?<=[\W_])(%s\(\+\))(?=[\W_]|$))' based on a grammar. I send a list of these human rules to the pyparsing code and it gives me back a list of ~100 of these regexes. These regexes are constructed in string format.
This is the MCVE version of the code that generates these strings that are supposed to act as regexes -
from pyparsing import *
import re
def parse_hrr(received_sentences):
UPTO, AND, OR, WORDS, CHARACTERS = map(Literal, "UPTO AND OR WORDS CHARACTERS".split())
LBRACE,RBRACE = map(Suppress, "{}")
integer = pyparsing_common.integer()
LINE_CONTAINS, PARA_STARTSWITH, LINE_ENDSWITH = map(Literal,
"""LINE_CONTAINS PARA_STARTSWITH LINE_ENDSWITH""".split()) # put option for LINE_ENDSWITH. Users may use, I don't presently
keyword = UPTO | WORDS | AND | OR | BEFORE | AFTER | JOIN | LINE_CONTAINS | PARA_STARTSWITH
class Node(object):
def __init__(self, tokens):
self.tokens = tokens
def generate(self):
pass
class LiteralNode(Node):
def generate(self):
return "(%s)" %(re.escape(''.join(self.tokens[0]))) # here, merged the elements, so that re.escape does not have to do an escape for the entire list
def __repr__(self):
return repr(self.tokens[0])
class ConsecutivePhrases(Node):
def generate(self):
join_these=[]
tokens = self.tokens[0]
for t in tokens:
tg = t.generate()
join_these.append(tg)
seq = []
for word in join_these[:-1]:
if (r"(([\w]+\s*)" in word) or (r"((\w){0," in word): #or if the first part of the regex in word:
seq.append(word + "")
else:
seq.append(word + "\s+")
seq.append(join_these[-1])
result = "".join(seq)
return result
class AndNode(Node):
def generate(self):
tokens = self.tokens[0]
join_these=[]
for t in tokens[::2]:
tg = t.generate()
tg_mod = tg[0]+r'?=.*\b'+tg[1:][:-1]+r'\b)' # to place the regex commands at the right place
join_these.append(tg_mod)
joined = ''.join(ele for ele in join_these)
full = '('+ joined+')'
return full
class OrNode(Node):
def generate(self):
tokens = self.tokens[0]
joined = '|'.join(t.generate() for t in tokens[::2])
full = '('+ joined+')'
return full
class LineTermNode(Node):
def generate(self):
tokens = self.tokens[0]
ret = ''
dir_phr_map = {
'LINE_CONTAINS': lambda a: r"((?:(?<=[\W_])" + a + r"(?=[\W_]|$))456", #%gs1, sent, re.I)",
'PARA_STARTSWITH':
lambda a: ("r'(^" + a + "(?=[\W_]|$))' 457") if 'gene' in repr(a) #%gs1, s, re.I)"
else ("r'(^" + a + "(?=[\W_]|$))' 458")} #,s, re.I
for line_dir, phr_term in zip(tokens[0::2], tokens[1::2]):
ret = dir_phr_map[line_dir](phr_term.generate())
return ret
## THE GRAMMAR
word = ~keyword + Word(alphas, alphanums+'-_+/()')
some_words = OneOrMore(word).setParseAction(' '.join, LiteralNode)
phrase_item = some_words
phrase_expr = infixNotation(phrase_item,
[
(None, 2, opAssoc.LEFT, ConsecutivePhrases),
(AND, 2, opAssoc.LEFT, AndNode),
(OR, 2, opAssoc.LEFT, OrNode),
],
lpar=Suppress('{'), rpar=Suppress('}')
) # structure of a single phrase with its operators
line_term = Group((LINE_CONTAINS|PARA_STARTSWITH)("line_directive") +
(phrase_expr)("phrases")) # basically giving structure to a single sub-rule having line-term and phrase
line_contents_expr = line_term.setParseAction(LineTermNode)
###########################################################################################
mrrlist=[]
for t in received_sentences:
t = t.strip()
try:
parsed = line_contents_expr.parseString(t)
temp_regex = parsed[0].generate()
mrrlist.append(temp_regex)
return(mrrlist)
So basically, the code is stringing together the regex. Then I add the necessary parameters like re.search, %gs1 etc .to have the complete regex search query. I want to be able to use these string regexes for searching, hence I had earlier thought eval() would convert the string to its corresponding Python expression here, which is why I used it - I was wrong.
TL;DR - I basically have a list of strings that have been created in the source code, and I want to be able to use them as regexes, using Python's raw literal notation.
Your issue seems to stem from a misunderstanding of what raw string literals do and what they're for. There's no magic raw string type. A raw string literal is just another way of creating a normal string. A raw literal just gets parsed a little bit differently.
For instance, the raw string r"\(foo\)" can also be written "\\(foo\\)". The doubled backslashes tell Python's regular string parsing algorithm that you want an actual backslash character in the string, rather than the backslash in the literal being part of an escape sequence that gets replaced by a special character. The raw string algorithm doesn't the extra backslashes since it never replaces escape sequences.
However, in this particular case the special treatment is not actually necessary, since the \( and \) are not meaningful escape sequences in a Python string. When Python sees an invalid escape sequence, it just includes it literally (backslash and all). So you could also use "\(foo\)" (without the r prefix) and it will work just fine too.
But it's not generally a good idea to rely upon backslashes being ignored however, since if you edit the string later you might inadvertently add an escape sequence that Python does understand (when you really wanted the raw, un-transformed version). Since regex syntax has a number of its own escape sequences that are also escape sequences in Python (but with different meanings, such as \b and \1), it's a best practice to always write regex patterns with raw strings to avoid introducing issues when editing them.
Now to bring this around to the example code you've shown. I have no idea why you're using eval at all. As far as I can tell, you've mistakenly wrapped extra quotes around your regex patterns for no good reason. You're using exec to undo that wrapping. But because only the inner strings are using raw string syntax, by the time you eval them you're too late to avoid Python's string parsing messing up your literals if you have any of the troublesome escape sequences (the outer string will have already parsed \b for instance and turned it into the ASCII backspace character \x08).
You should tear the exec code out and fix your literals to avoid the extra quotes. This should work:
regex_list=[r'((?<=[\W_])(%s\(\+\))(?=[\W_]|$))', # use raw literals, with no extra quotes!
r'((?<=[\W_])(activation\ of\ %s)(?=[\W_]|$))'] # unnecessary backslashes?
sent='in this file we have the case of a foo(+) in the town'
gs1='foo'
for string_regex in regex_list:
mo=re.search(string_regex %gs1,sent,re.I) # no eval here!
if mo:
print(mo.group())
This example works for me (it prints foo(+)). Note that you've got some extra unnecessary backslashes in your second pattern (before the spaces). Those are harmless, but might be adding even more confusion to a complicate subject (regex are notoriously hard to understand).

Pyparsing: dblQuotedString parsing differently in nestedExpr

I'm working on a grammar to parse search queries (not evaluate them, just break them into component pieces). Right now I'm working with nestedExpr, just to grab the different 'levels' of each term, but I seem to have an issue if the first part of a term is in double quotes.
A simple version of the grammar:
QUOTED = QuotedString(quoteChar = '“', endQuoteChar = '”', unquoteResults = False).setParseAction(remove_curlies)
WWORD = Word(alphas8bit + printables.replace("(", "").replace(")", ""))
WORDS = Combine(OneOrMore(dblQuotedString | QUOTED | WWORD), joinString = ' ', adjacent = False)
TERM = OneOrMore(WORDS)
NESTED = OneOrMore(nestedExpr(content = TERM))
query = '(dog* OR boy girl w/3 ("girls n dolls" OR friends OR "best friend" OR (friends w/10 enemies)))'
Calling NESTED.parseString(query) returns:
[['dog* OR boy girl w/3', ['"girls n dolls"', 'OR friends OR "best friend" OR', ['friends w/10 enemies']]]]
The first dblQuotedString instance is separate from the rest of the term at the same nesting, which doesn't occur to the second dblQuotedString instance, and also doesn't occur if the quoted bit is a QUOTED instance (with curly quotes) rather than dblQuotedString with straight ones.
Is there something special about dblQuotedString that I'm missing?
NOTE: I know that operatorPrecedence can break up search terms like this, but I have some limits on what can be broken apart, so I'm testing if I can use nestedExpr to work within those limits.
nestedExpr takes an optional keyword argument ignoreExpr, to take an expression that nestedExpr should use to ignore characters that would otherwise be interpreted as nesting openers or closers, and the default is pyparsing's quotedString, which is defined as sglQuotedString | dblQuotedString. This is to handle strings like:
(this has a tricky string "string with )" )
Since the default ignoreExpr is quotedString, the ')' in quotes is not misinterpreted as the closing parenthesis.
However, your content argument also matches on dblQuotedString. The leading quoted string is matched internally by nestedExpr by way of skipping over quoted strings that may contain "()"s, then your content is matched, which also matches quoted strings. You can suppress nestedExpr's ignore expression using a NoMatch:
NESTED = OneOrMore(nestedExpr(content = TERM, ignoreExpr=NoMatch()))
which should now give you:
[['dog* OR boy girl w/3',
['"girls n dolls" OR friends OR "best friend" OR', ['friends w/10 enemies']]]]
You'll find more details and examples at https://pythonhosted.org/pyparsing/pyparsing-module.html#nestedExpr

Regular Expression (find matching characters in order)

Let us say that I have the following string variables:
welcome = "StackExchange 2016"
string_to_find = "Sx2016"
Here, I want to find the string string_to_find inside welcome using regular expressions. I want to see if each character in string_to_find comes in the same order as in welcome.
For instance, this expression would evaluate to True since the 'S' comes before the 'x' in both strings, the 'x' before the '2', the '2' before the 0, and so forth.
Is there a simple way to do this using regex?
Your answer is rather trivial. The .* character combination matches 0 or more characters. For your purpose, you would put it between all characters in there. As in S.*x.*2.*0.*1.*6. If this pattern is matched, then the string obeys your condition.
For a general string you would insert the .* pattern between characters, also taking care of escaping special characters like literal dots, stars etc. that may otherwise be interpreted by regex.
This function might fit your need
import re
def check_string(text, pattern):
return re.match('.*'.join(pattern), text)
'.*'.join(pattern) create a pattern with all you characters separated by '.*'. For instance
>> ".*".join("Sx2016")
'S.*x.*2.*0.*1.*6'
Use wildcard matches with ., repeating with *:
expression = 'S.*x.*2.*0.*1.*6'
You can also assemble this expression with join():
expression = '.*'.join('Sx2016')
Or just find it without a regular expression, checking whether the location of each of string_to_find's characters within welcome proceeds in ascending order, handling the case where a character in string_to_find is not present in welcome by catching the ValueError:
>>> welcome = "StackExchange 2016"
>>> string_to_find = "Sx2016"
>>> try:
... result = [welcome.index(c) for c in string_to_find]
... except ValueError:
... result = None
...
>>> print(result and result == sorted(result))
True
Actually having a sequence of chars like Sx2016 the pattern that best serve your purpose is a more specific:
S[^x]*x[^2]*2[^0]*0[^1]*1[^6]*6
You can obtain this kind of check defining a function like this:
import re
def contains_sequence(text, seq):
pattern = seq[0] + ''.join(map(lambda c: '[^' + c + ']*' + c, list(seq[1:])))
return re.search(pattern, text)
This approach add a layer of complexity but brings a couple of advantages as well:
It's the fastest one because the regex engine walk down the string only once while the dot-star approach go till the end of the sequence and back each time a .* is used. Compare on the same string (~1k chars):
Negated class -> 12 steps
Dot star -> 4426 step
It works on multiline strings in input as well.
Example code
>>> sequence = 'Sx2016'
>>> inputs = ['StackExchange2015','StackExchange2016','Stack\nExchange\n2015','Stach\nExchange\n2016']
>>> map(lambda x: x + ': yes' if contains_sequence(x,sequence) else x + ': no', inputs)
['StackExchange2015: no', 'StackExchange2016: yes', 'Stack\nExchange\n2015: no', 'Stach\nExchange\n2016: yes']

python regex in pyparsing

How do you make the below regex be used in pyparsing? It should return a list of tokens given the regex.
Any help would be greatly appreciated! Thank you!
python regex example in the shell:
>>> re.split("(\w+)(lab)(\d+)", "abclab1", 3)
>>> ['', 'abc', 'lab', '1', '']
I tried this in pyparsing, but I can't seem to figure out how to get it right because the first match is being greedy, i.e the first token will be 'abclab' instead of two tokens 'abc' and 'lab'.
pyparsing example (high level, i.e non working code):
name = 'abclab1'
location = Word(alphas).setResultsName('location')
lab = CaselessLiteral('lab').setResultsName('environment')
identifier = Word(nums).setResultsName('identifier')
expr = location + lab + identifier
match, start, end = expr.scanString(name).next()
print match.asDict()
Pyparsing's classes are pretty much left-to-right, with lookahead implemented using explicit expressions like FollowedBy (for positive lookahead) and NotAny or the '~' operator (for negative lookahead). This allows you to detect a terminator which would normally match an item that is being repeated. For instance, OneOrMore(Word(alphas)) + Literal('end') will never find a match in strings like "start blah blah end", because the terminating 'end' will get swallowed up in the repetition expression in OneOrMore. The fix is to add negative lookahead in the expression being repeated: OneOrMore(~Literal('end') + Word(alphas)) + Literal('end') - that is, before reading another word composed of alphas, first make sure it is not the word 'end'.
This breaks down when the repetition is within a pyparsing class, like Word. Word(alphas) will continue to read alpha characters as long as there is no whitespace to stop the word. You would have to break into this repetition using something very expensive, like Combine(OneOrMore(~Literal('lab') + Word(alphas, exact=1))) - I say expensive because composition of simple tokens using complex Combine expressions will make for a slow parser.
You might be able to compromise by using a regex wrapped in a pyparsing Regex object:
>>> labword = Regex(r'(\w+)(lab)(\d+)')
>>> print labword.parseString("abclab1").dump()
['abclab1']
This does the right kind of grouping and detection, but does not expose the groups themselves. To do that, add names to each group - pyparsing will treat these like results names, and give you access to the individual fields, just as if you had called setResultsName:
>>> labword = Regex(r'(?P<locn>\w+)(?P<env>lab)(?P<identifier>\d+)')
>>> print labword.parseString("abclab1").dump()
['abclab1']
- env: lab
- identifier: 1
- locn: abc
>>> print labword.parseString("abclab1").asDict()
{'identifier': '1', 'locn': 'abc', 'env': 'lab'}
The only other non-regex approach I can think of would be to define an expression to read the whole string, and then break up the parts in a parse action.
If you strip the subgroup sign(the parenthesis), you'll get the right answer:)
>>> re.split("\w+lab\d+", "abclab1")
['', '']

Categories

Resources