pyparsing Regex as Word

pyparsing Regex as Word - python

I'm building a syntax parser to perform simple actions on objects identified using dotted notation, something like this:
DISABLE ALL;
ENABLE A.1 B.1.1 C
but in DISABLE ALL the keyword ALL is instead matched as 3 Regex(r'[a-zA-Z]') => 'A', 'L', 'L' I use to match arguments.
How can I make a Word using regex? AFAIK I can't get A.1.1 using Word
please see example below
import pyparsing as pp
def toggle_item_action(s, loc, tokens):
'enable / disable a sequence of items'
action = True if tokens[0].lower() == "enable" else False
for token in tokens[1:]:
print "it[%s].active = %s" % (token, action)
def toggle_all_items_action(s, loc, tokens):
'enable / disable ALL items'
action = True if tokens[0].lower() == "enable" else False
print "it.enable_all(%s)" % action
expr_separator = pp.Suppress(';')
#match A
area = pp.Regex(r'[a-zA-Z]')
#match A.1
category = pp.Regex(r'[a-zA-Z]\.\d{1,2}')
#match A.1.1
criteria = pp.Regex(r'[a-zA-Z]\.\d{1,2}\.\d{1,2}')
#match any of the above
item = area ^ category ^ criteria
#keyword to perform action on ALL items
all_ = pp.CaselessLiteral("all")
#actions
enable = pp.CaselessKeyword('enable')
disable = pp.CaselessKeyword('disable')
toggle = enable | disable
#toggle item expression
toggle_item = (toggle + item + pp.ZeroOrMore(item)
).setParseAction(toggle_item_action)
#toggle ALL items expression
toggle_all_items = (toggle + all_).setParseAction(toggle_all_items_action)
#swapping order to `toggle_all_items ^ toggle_item` works
#but seems to weak to me and error prone for future maintenance
expr = toggle_item ^ toggle_all_items
#expr = toggle_all_items ^ toggle_item
more = expr + pp.ZeroOrMore(expr_separator + expr)
more.parseString("""
ENABLE A.1 B.1.1;
DISABLE ALL
""", parseAll=True)

Is this the problem?
#match any of the above
item = area ^ category ^ criteria
#keyword to perform action on ALL items
all_ = pp.CaselessLiteral("all")
Should be:
#keyword to perform action on ALL items
all_ = pp.CaselessLiteral("all")
#match any of the above
item = area ^ category ^ criteria ^ all_
EDIT - if you're interested...
Your regexes are so similar, I thought I'd see what it would look like to combine them into one. Here is a snippet to parse out your three dotted notations using a single Regex, and then using a parse action to figure out which type you got:
import pyparsing as pp
dotted_notation = pp.Regex(r'[a-zA-Z](\.\d{1,2}(\.\d{1,2})?)?')
def name_notation_type(tokens):
name = {
0 : "area",
1 : "category",
2 : "criteria"}[tokens[0].count('.')]
# assign results name to results -
tokens[name] = tokens[0]
dotted_notation.setParseAction(name_notation_type)
# test each individually
tests = "A A.1 A.2.2".split()
for t in tests:
print t
val = dotted_notation.parseString(t)
print val.dump()
print val[0], 'is a', val.getName()
print
# test all at once
tests = "A A.1 A.2.2"
val = pp.OneOrMore(dotted_notation).parseString(tests)
print val.dump()
Prints:
A
['A']
- area: A
A is a area
A.1
['A.1']
- category: A.1
A.1 is a category
A.2.2
['A.2.2']
- criteria: A.2.2
A.2.2 is a criteria
['A', 'A.1', 'A.2.2']
- area: A
- category: A.1
- criteria: A.2.2
EDIT2 - I see the original problem...
What is messing you up is pyparsing's implicit whitespace skipping. Pyparsing will skip over whitespace between defined tokens, but the converse is not true - pyparsing does not require whitespace between separate parser expressions. So in your all_-less version, "ALL" looks like 3 areas, "A", "L", and "L". This is true not just of Regex, but just about any pyparsing class. See if the pyparsing WordEnd class might be useful in enforcing this.
EDIT3 - Then maybe something like this...
toggle_item = (toggle + pp.OneOrMore(item)).setParseAction(toggle_item_action)
toggle_all = (toggle + all_).setParseAction(toggle_all_action)
toggle_directive = toggle_all | toggle_item
The way your commands are formatted, you have to make the parser first see if ALL is being toggled before looking for individual areas, etc. If you need to support something that might read "ENABLE A.1 ALL", then use a negative lookahead for item: item = ~all_ + (area ^ etc...).
(Note also that I replaced item + pp.ZeroOrMore(item) with just pp.OneOrMore(item).)

Related

Exhaustively parse file for all matches

I have a grammar for parsing some log files using pyparsing but am running into an issue where only the first match is being returned. Is there a way to ensure that I get exhaustive matches? Here's some code:
from pyparsing import Literal, Optional, oneOf, OneOrMore, ParserElement, Regex, restOfLine, Suppress, ZeroOrMore
ParserElement.setDefaultWhitespaceChars(' ')
dt = Regex(r'''\d{2} (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) 20\d\d \d\d:\d\d:\d\d\,\d{3}''')
# TODO maybe add a parse action to make a datetime object out of the dt capture group
log_level = Suppress('[') + oneOf("INFO DEBUG ERROR WARN TRACE") + Suppress(']')
package_name = Regex(r'''(com|org|net)\.(\w+\.)+\w+''')
junk_data = Optional(Regex('\(.*?\)'))
guid = Regex('[A-Za-z0-9]{8}-[A-Za-z0-9]{4}-[A-Za-z0-9]{4}-[A-Za-z0-9]{4}-[A-Za-z0-9]{12}')
first_log_line = dt.setResultsName('datetime') + \
log_level('log_level') + \
guid('guid') + \
junk_data('junk') + \
package_name('package_name') + \
Suppress(':') + \
restOfLine('message') + \
Suppress('\n')
additional_log_lines = Suppress('\t') + package_name + restOfLine
log_entry = (first_log_line + Optional(ZeroOrMore(additional_log_lines)))
log_batch = OneOrMore(log_entry)
In my mind, the last two lines are sort of equivalent to
log_entry := first_log_line | first_log_line additional_log_lines
additional_log_lines := additional_log_line | additional_log_line additional_log_lines
log_batch := log_entry | log_entry log_batch
Or something of the sort. Am I thinking about this wrong? I only see a single match with all of the expected tokens when I do print(log_batch.parseString(data).dump()).

Your scanString behavior is a strong clue. Suppose I wrote an expression to match one or more items, and erroneously defined my expression such that the second item in my list did not match. Then OneOrMore(expr) would fail, while expr.scanString would "succeed", in that it would give me more matches, but would still overlook the match I might have wanted, but just mis-parsed.
import pyparsing as pp
data = "AAA _AB BBB CCC"
expr = pp.Word(pp.alphas)
print(pp.OneOrMore(expr).parseString(data))
Gives:
['AAA']
At first glance, this looks like the OneOrMore is failing, whereas scanString shows more matches:
['AAA']
['AB'] <- really wanted '_AB' here
['BBB']
['CCC']
Here is a loop using scanString which prints not the matches, but the gaps between the matches, and where they start:
# loop to find non-matching parts in data
last_end = 0
for t,s,e in expr.scanString(data):
gap = data[last_end:s]
print(s, ':', repr(gap))
last_end = e
Giving:
0 : ''
5 : ' _' <-- AHA!!
8 : ' '
12 : ' '
Here's another way to visualize this.
# print markers where each match begins in input string
markers = [' ']*len(data)
for t,s,e in expr.scanString(data):
markers[s] = '^'
print(data)
print(''.join(markers))
Prints:
AAA _AB BBB CCC
^ ^ ^ ^
Your code would be a little more complex since your data spans many lines, but using pyparsing's line, lineno and col methods, you could do something similar.

So, there's a workaround that seems to do the trick. For whatever reason, scanString does iterate through them all appropriately, so I can very simply get my matches in a generator with:
matches = (m for m, _, _ in log_batch.scanString(data))
Still not sure why parseString isn't working exhaustively, though, and still a bit worried that I've misunderstood something about pyparsing, so more pointers are welcome here.

Regex match specific word in string but exclude indexed versions

I'm sure that if a solution exists for this then its out there somewhere but I can't find it. I've followed Python regex to match a specific word and had success in the first aspect but now am struggling with the second aspect.
I've inherited a horrible file format where each test result is on its own line. They are limited to 12 chars per record so some results are split into groups of lines e.g SITE, SITE1 and SITE2. I'm trying to parse the file into a dictionary so I can do more analysis with it and ultimately produce a formatted report.
The link above / code below allows me to match each SITE and concatenate them together but its giving me problems matching INS, INS 1 and INS 2 correctly. Yes the space is intentional - its what I have to deal with. INS is the test result and INS 1 is the limit of the test for a pass.
Is there a regular expression that would match
SITE > SITE True but SITE > SITE1 false
and
INS > INS True but INS to INS 1 false?
Here is the python code.
import re
lines = ['SITE start', 'SITE1 more', 'SITE2 end','INS value1', 'INS 1 value2']
headings = ['SITE','SITE1',"SITE2", "INS", "INS 1"]
for line in lines:
for heading in headings:
headregex = r"\b" + heading + r"\b"
match = re.search(headregex,heading)
if match:
print "Found " + heading + " " + line
else:
print "Not Found " + heading + " " + line
And here is some dummy data:
TEST MODE 131 AUTO
SITE startaddy
SITE1 middle addy
SITE2 end addy
USER DB
VISUAL CHECK P
BOND RANGE 25A
EARTH 0.09 OHM P
LIMIT 0.10 OHM
INS 500 V
INS 1 >299 MEG P
...
TEST MODE 231 AUTO
SITE startaddy
SITE1 middle addy
SITE2 end addy
USER DB
VISUAL CHECK P
INS 500 V
INS 2 >299 MEG P
...
Sorry for the horrid formatting - its copied and pasted from what I am dealing with!

The problem is that INS pattern finds a partial match in INS in INS 1 or INS 2 etc.
In cases when you extract alternatives, it is customary to use alternations starting with the longest value (like INS \d+|INS), but in this case you are looking to obtain a list of all regex matches only excluding some overlapping heading matches.
To achieve that, there is a way to exclude that match by treating all headings items as regular expressions, and define the INS pattern as INS(?! \d) to make sure INS is not matched if it is followed with a space and a digit.
See the Python demo:
import re
lines = ['SITE start', 'SITE1 more', 'SITE2 end','INS value1', 'INS 1 value2']
headings = ['SITE','SITE1',"SITE2", r"INS(?! \d)", "INS 1"]
headings=sorted(headings, key=lambda x: len(x), reverse=True)
for line in lines:
print("----")
for heading in headings:
headregex = r"\b{}\b".format(heading)
match = re.search(headregex,heading)
if match:
print "Found " + heading + " " + line
else:
print "Not Found " + heading + " " + line

Just to give an answer that might solve the problem while avoiding some of the tediousness, is this what you are trying to achieve?
import re
lines = ['SITE start', 'SITE1 more', 'SITE2 end','INS value1', 'INS 1 value2']
headings = ['SITE','SITE1',"SITE2", "INS", "INS 1"]
headings_re = re.compile(r"(SITE\d? )?(INS( \d)?)? (.*)")
# build by hand, only works if SITE and INS are the literal identifiers
site = []
ins = []
for line in lines:
match = headings_re.match(line)
if match:
if match.group(1):
site.append(match.group(4))
elif match.group(2):
ins.append(match.group(4))
else:
print("something weird happened")
print(match.group(0))
else:
print("something weird happened")
print(line)
print("SITE: {}".format(" ".join(site)))
>> SITE: start more end
print("INS: {}".format(" ".join(ins)))
>> INS: value1 value2

Parse and group multiple items together using Pyparse

This is a build up on Build a simple parser that is able to parse different date formats using PyParse
I have a parser that should group one or more users together into a list
So a.parser('show abc, xyz commits from "Jan 10,2015" to "27/1/2015"') should group the two usernames into a list [abc,xyz]
For users I have:
keywords = ["select", "show", "team", "from", "to", "commits", "and", "or"]
[select, show, team, _from, _to, commits, _and, _or] = [ CaselessKeyword(word) for word in keywords ]
user = Word(alphas+"."+alphas)
user2 = Combine(user + "'s")
users = OneOrMore((user|user2))
And the grammar is
bnf = (show|select)+Group(users).setResultsName("users")+Optional(team)+(commits).setResultsName("stats")\
+Optional(_from + quotedString.setParseAction(removeQuotes)('from') +\
_to + quotedString.setParseAction(removeQuotes)('to'))
This is erroneous. Can anyone guide me in the right direction.
Also, is there a way in pyparse to selectively decide which group the word should fall under. What I mean is that 'xyz' standalone should go under my user list. But 'xyz team' should go under a team list. If the optional keyword team is provided then pyparse should group it differently.
I haven't been able to find what I am looking for online. Or maybe I haven't been framing my question correctly on Google?

You are on the right track, see the embedded comments in this update to your parser:
from pyparsing import *
keywords = ["select", "show", "team", "from", "to", "commits", "and", "or"]
[select, show, team, _from, _to, commits, _and, _or] = [ CaselessKeyword(word) for word in keywords ]
# define an expression to prevent matching keywords as user names - used below in users expression
keyword = MatchFirst(map(CaselessKeyword, keywords))
user = Word(alphas+"."+alphas) # ??? what are you trying to define here?
user2 = Combine(user + "'s")
# must not confuse keywords like commit with usernames - and use ungroup to
# unpack single-element token lists
users = ungroup(~keyword + (user|user2))
#~ bnf = (show|select)+Group(users).setResultsName("users")+Optional(team)+(commits).setResultsName("stats") \
#~ + Optional(_from + quotedString.setParseAction(removeQuotes)('from') +
#~ _to + quotedString.setParseAction(removeQuotes)('to'))
def convertToDatetime(tokens):
# change this code to do your additional parsing/conversion to a Python datetime
return tokens[0]
timestamp = quotedString.setParseAction(removeQuotes, convertToDatetime)
# similar to your expression
# - use delimitedList instead of OneOrMore to handle comma-separated list of items
# - add distinction of "xxx team" vs "xxx"
# - dropped expr.setResultsName("name") in favor of short notation expr("name")
# - results names with trailing '*' will accumulate like elements into a single
# named result (short notation for setResultsName(name, listAllValues=True) )
# - dropped setResultsName("stats") on keyword "commits", no point to this, commits must always be present
#
bnf = ((show|select)("command") + delimitedList(users("team*") + team | users("user*")) + commits +
Optional(_from + timestamp('from') + _to + timestamp('to')))
test = 'show abc, def team, xyz commits from "Jan 10,2015" to "27/1/2015"'
print bnf.parseString(test).dump()
Prints:
['show', 'abc', 'def', 'team', 'xyz', 'commits', 'from', 'Jan 10,2015', 'to', '27/1/2015']
- command: show
- from: Jan 10,2015
- team: ['def']
- to: 27/1/2015
- user: ['abc', 'xyz']

Capture the text inside square brackets using a regex

I saw question here:
Regex to capture {}
which is similar to what I want, but I cannot get it to work.
My data is:
[Honda] Japanese manufacturer [VTEC] Name of electronic lift control
And I want the output to be
[Honda], [VTEC]
My expression is:
m = re.match('(\[[^\[\]]*\])', '[Honda] Japanese manufacturer [VTEC] Name of electronic lift control')
I would expect:
m.group(0) to output [Honda]
m.group(1) to output [VTEC]
However both output [Honda]. How can I access the second match?

You only have one group in your expression, so you can only ever get that one group. Group 1 is the capturing group, group 0 is the whole matched text; in your expression they are one and the same. Had you omitted the (...) parentheses, you'd only have a group 0.
If you wanted to get all matches, use re.findall(). This returns a list of matching groups (or group 0, if there are no capturing groups in your expression):
>>> import re
>>> re.findall('\[[^\[\]]*\]', '[Honda] Japanese manufacturer [VTEC] Name of electronic lift control')
['[Honda]', '[VTEC]']

You can use re.findall to get all the matches, though you'll get them in a list, and you don't need capture groups:
m = re.findall('\[[^\[\]]*\]', '[Honda] Japanese manufacturer [VTEC] Name of electronic lift control')
Gives ['[Honda]', '[VTEC]'] so you can get each with:
print(m[0])
# => [Honda]
print(m[1])
# => [VTEC]

If you are considering other than re:
s="[Honda] Japanese manufacturer [VTEC] Name of electronic lift control"
result = []
tempStr = ""
flag = False
for i in s:
if i == '[':
flag = True
elif i == ']':
flag = False
elif flag:
tempStr = tempStr + i
elif tempStr != "":
result.append(tempStr)
tempStr = ""
print result
Output:
['Honda', 'VTEC']

Python: Match strings for certain terms

I have a list of tweets, from which I have to choose tweets that have terms like "sale", "discount", or "offer". Also, I need to find tweets that advertise certain deals, like a discount, by recognizing things like "%", "Rs.", "$" amongst others. I have absolutely no idea about regular expressions and the documentation isn't getting me anywhere. Here is my code. It's rather lousy, but please excuse that
import pymongo
import re
import datetime
client = pymongo.MongoClient()
db = client .PWSocial
fourteen_days_ago = datetime.datetime.utcnow() - datetime.timedelta(days=14)
id_list = [57947109, 183093247, 89443197, 431336956]
ar1 = [" deal "," deals ", " offer "," offers " "discount", "promotion", " sale ", " inr", " rs", "%", "inr ", "rs ", " rs."]
def func(ac_id):
mylist = []
newlist = []
tweets = list(db.tweets.find({'user_id' : ac_id, 'created_at': { '$gte': fourteen_days_ago }}))
for item in tweets:
data = item.get('text')
data = data.lower()
data = data.split()
flag = 0
if set(ar1).intersection(data):
flag = 1
abc = []
for x in ar1:
for y in data:
if re.search(x,y):
abc.append(x)
flag = 1
break
if flag == 1:
mylist.append(item.get('id'))
newlist.append(abc)
print mylist
print newlist
for i in id_list:
func(i)
This code soen't give me any correct results, and being a noob to regexes, I cannot figure out whats wrong with it. Can anyone suggest a better way to do this job? Any help is appreciated.

My first advice - learn regular expressions, it gives you an unlimited power of text processing.
But, to give you some working solution (and start point to further exploration) try this:
import re
re_offers = re.compile(r'''
\b # Word boundary
(?: # Non capturing parenthesis
deals? # Deal or deals
| # or ...
offers? # Offer or offers
|
discount
|
promotion
|
sale
|
rs\.? # rs or rs.
|
inr\d+ # INR then digits
|
\d+inr # Digits then INR
) # And group
\b # Word boundary
| # or ...
\b\d+% # Digits (1 or more) then percent
|
\$\d+\b # Dollar then digits (didn't care of thousand separator yet)
''',
re.I|re.X) # Ignore case, verbose format - for you :)
abc = re_offers.findall("e misio $1 is inr123 discount 1INR a 1% and deal")
print(abc)

You don't need to use a regular expression for this, you can use any:
if any(term in tweet for term in search_terms):

In your array of things to search for you don't have a comma between " offers " and "discount" which is causing them to be joined together.
Also when you use split you are getting rid of the whitespace in your input text. "I have a deal" will become ["I","have","a","deal"] but your search terms almost all contain whitespace. So remove the spaces from your search terms in array ar1.
However you might want to avoid using regular expressions and just use in instead (you will still need the chnages I suggest above though):
if x in y:

You might want to consider starting with find instead instead of a regex. You don't have complex expressions, and as you're handling a line of text you don't need to call split, instead just use find:
for token in ar1:
if data.find(token) != -1:
abc.append(data)
Your for item in tweets loop becomes:
for item in tweets:
data = item.get('text')
data = data.lower()
for x in ar1:
if data.find(x)
newlist.append(data)
mylist.append(item.get('id'))
break
Re: your comment on jonsharpe's post, to avoid including substrings, surround your tokens by spaces, e.g. " rs ", " INR "

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pyparsing Regex as Word - python

Related

Exhaustively parse file for all matches

Regex match specific word in string but exclude indexed versions

Parse and group multiple items together using Pyparse

Capture the text inside square brackets using a regex

Python: Match strings for certain terms

Categories

Resources