This is a build up on Build a simple parser that is able to parse different date formats using PyParse
I have a parser that should group one or more users together into a list
So a.parser('show abc, xyz commits from "Jan 10,2015" to "27/1/2015"') should group the two usernames into a list [abc,xyz]
For users I have:
keywords = ["select", "show", "team", "from", "to", "commits", "and", "or"]
[select, show, team, _from, _to, commits, _and, _or] = [ CaselessKeyword(word) for word in keywords ]
user = Word(alphas+"."+alphas)
user2 = Combine(user + "'s")
users = OneOrMore((user|user2))
And the grammar is
bnf = (show|select)+Group(users).setResultsName("users")+Optional(team)+(commits).setResultsName("stats")\
+Optional(_from + quotedString.setParseAction(removeQuotes)('from') +\
_to + quotedString.setParseAction(removeQuotes)('to'))
This is erroneous. Can anyone guide me in the right direction.
Also, is there a way in pyparse to selectively decide which group the word should fall under. What I mean is that 'xyz' standalone should go under my user list. But 'xyz team' should go under a team list. If the optional keyword team is provided then pyparse should group it differently.
I haven't been able to find what I am looking for online. Or maybe I haven't been framing my question correctly on Google?
You are on the right track, see the embedded comments in this update to your parser:
from pyparsing import *
keywords = ["select", "show", "team", "from", "to", "commits", "and", "or"]
[select, show, team, _from, _to, commits, _and, _or] = [ CaselessKeyword(word) for word in keywords ]
# define an expression to prevent matching keywords as user names - used below in users expression
keyword = MatchFirst(map(CaselessKeyword, keywords))
user = Word(alphas+"."+alphas) # ??? what are you trying to define here?
user2 = Combine(user + "'s")
# must not confuse keywords like commit with usernames - and use ungroup to
# unpack single-element token lists
users = ungroup(~keyword + (user|user2))
#~ bnf = (show|select)+Group(users).setResultsName("users")+Optional(team)+(commits).setResultsName("stats") \
#~ + Optional(_from + quotedString.setParseAction(removeQuotes)('from') +
#~ _to + quotedString.setParseAction(removeQuotes)('to'))
def convertToDatetime(tokens):
# change this code to do your additional parsing/conversion to a Python datetime
return tokens[0]
timestamp = quotedString.setParseAction(removeQuotes, convertToDatetime)
# similar to your expression
# - use delimitedList instead of OneOrMore to handle comma-separated list of items
# - add distinction of "xxx team" vs "xxx"
# - dropped expr.setResultsName("name") in favor of short notation expr("name")
# - results names with trailing '*' will accumulate like elements into a single
# named result (short notation for setResultsName(name, listAllValues=True) )
# - dropped setResultsName("stats") on keyword "commits", no point to this, commits must always be present
#
bnf = ((show|select)("command") + delimitedList(users("team*") + team | users("user*")) + commits +
Optional(_from + timestamp('from') + _to + timestamp('to')))
test = 'show abc, def team, xyz commits from "Jan 10,2015" to "27/1/2015"'
print bnf.parseString(test).dump()
Prints:
['show', 'abc', 'def', 'team', 'xyz', 'commits', 'from', 'Jan 10,2015', 'to', '27/1/2015']
- command: show
- from: Jan 10,2015
- team: ['def']
- to: 27/1/2015
- user: ['abc', 'xyz']
Related
I have been creating a few regex patterns to search a file. I basically need to search each line of a text file as a string of values. The issue I am having is that the regexs I have created work when used against a list of of values; however, I can not use the same regex when I search a string using the same regex. I'm not sure what I am missing. My test code is below. The regex works against the list_primary, but when I change it to string2, the regex does not find the date value I'm looking for.
import re
list_primary = ["Wi-Fi", "goat", "Access Point", "(683A1E320680)", "detected", "Access Point detected", "2/5/2021", "10:44:45 PM", "Local", "41.289227", "-72.958748"]
string1 = "Wi-Fi Access Point (683A1E320680) detected puppy Access Point detected 2/5/2021 10:44:45 PM Local 41.289227 -72.958748"
#Lattitude = re.findall("[0-9][0-9][.][0-9][0-9][0-9][0-9][0-9][0-9]")
#Longitude = re.findall("[-][0-9][0-9][.][0-9][0-9][0-9][0-9][0-9][0-9]")
string2 = string1.split('"')
# print(string2)
list1 = []
for item in string2:
data_dict = {}
date_field = re.search(r"(\d{1})[/.-](\d{1})[/.-](\d{4})$",item)
print(date_field)
if date_field is not None:
date = date_field.group()
else:
date = None
For your current expression to work on the string, you need to delete the dollar sign from the end. Also, in order to find double digit dates (meaning 11/20/2018), you need to change your repetitions (since with your regex you can only find singular digits dates like 2/5/2011):
import re
list_primary = ["Wi-Fi", "goat", "Access Point", "(683A1E320680)", "detected", "Access Point detected", "2/5/2021", "10:44:45 PM", "Local", "41.289227", "-72.958748"]
string1 = "Wi-Fi Access Point (683A1E320680) detected puppy Access Point detected 2/5/2021 10:44:45 PM Local 41.289227 -72.958748"
#Lattitude = re.findall("[0-9][0-9][.][0-9][0-9][0-9][0-9][0-9][0-9]")
#Longitude = re.findall("[-][0-9][0-9][.][0-9][0-9][0-9][0-9][0-9][0-9]")
string2 = string1.split('"')
# print(string2)
list1 = []
for item in string2:
data_dict = {}
date_field = re.search(r"(\d{1,2})[/.-](\d{1,2})[/.-](\d{4})",item)
print(date_field)
if date_field is not None:
date = date_field.group()
else:
date = None
Output:
re.Match object; span=(71, 79), match='2/5/2021'>
If you want to extract the date from your string (rather than just search if it exists), include a capturing group around your whole expression in order to see your date as one string and not as 3 different numbers:
date_field = re.findall(r"(\d{1,2}[/.-]\d{1,2}[/.-]\d{4})",string1)
print(date_field)
Output:
['2/5/2021']
I am building a simple parser that takes a query like the following:
'show fizi commits from 1/1/2010 to 11/2/2006'
So far I have:
class QueryParser(object):
def parser(self, stmnt):
keywords = ["select", "from","to", "show","commits", "where", "group by", "order by", "and", "or"]
[select, _from, _to, show, commits, where, groupby, orderby, _and, _or] = [ CaselessKeyword(word) for word in keywords ]
user = Word(alphas+"."+alphas)
user2 = Combine(user + "'s")
startdate=self.getdate()
enddate=self.getdate()
bnf = (show|select)+(user|user2).setResultsName("user")+(commits).setResultsName("stats")\
+Optional(_from+startdate.setResultsName("start")+_to+enddate.setResultsName("end"))
a = bnf.parseString(stmnt)
return a
def getdate(self):
integer = Word(nums).setParseAction(lambda t: int(t[0]))
date = Combine(integer('year') + '/' + integer('month') + '/' + integer('day'))
#date.setParseAction(self.convertToDatetime)
return date
I would like the dates to be more generic. Meaning user can provide 20 Jan, 2010 or some other date format. I found a good date parsing online that does exactly that. It takes a date as a string and then parses it. So what I am left with is to feed that function the date string I get from my parser. How do I go about tokenizing and capturing the two date strings. For now it only captures the format 'y/m/d' format. Is there a way to just get the entire string regarless of how its formatted. Something like capture the word right after keywords and . Any help is greatly appreciated.
A simple approach is to require the date be quoted. A rough example is something like this, but you'll need to adjust to fit in with your current grammar if needs be:
from pyparsing import CaselessKeyword, quotedString, removeQuotes
from dateutil.parser import parse as parse_date
dp = (
CaselessKeyword('from') + quotedString.setParseAction(removeQuotes)('from') +
CaselessKeyword('to') + quotedString.setParseAction(removeQuotes)('to')
)
res = dp.parseString('from "jan 20" to "apr 5"')
from_date = parse_date(res['from'])
to_date = parse_date(res['to'])
# from_date, to_date == (datetime.datetime(2015, 1, 20, 0, 0), datetime.datetime(2015, 4, 5, 0, 0))
I suggest using something like sqlparse that already handles all the weird edge cases for you. It might be a better option in the long term, if you have to deal with more advanced cases.
EDIT: Why not just parse the date blocks as strings? Like so:
from pyparsing import CaselessKeyword, Word, Combine, Optional, alphas, nums
class QueryParser(object):
def parser(self, stmnt):
keywords = ["select", "from", "to", "show", "commits", "where",
"groupby", "order by", "and", "or"]
[select, _from, _to, show, commits, where, groupby, orderby, _and, _or]\
= [CaselessKeyword(word) for word in keywords]
user = Word(alphas + "." + alphas)
user2 = Combine(user + "'s")
startdate = Word(alphas + nums + "/")
enddate = Word(alphas + nums + "/")
bnf = (
(show | select) + (user | user2).setResultsName("user") +
(commits).setResultsName("stats") +
Optional(
_from + startdate.setResultsName("start") +
_to + enddate.setResultsName("end"))
)
a = bnf.parseString(stmnt)
return a
This gives me something like:
In [3]: q.parser("show fizi commits from 1/1/2010 to 11/2/2006")
Out[3]: (['show', 'fizi', 'commits', 'from', '1/1/2010', 'to', '11/2/2006'], {'start': [('1/1/2010', 4)], 'end': [('11/2/2006', 6)], 'stats': [('commits', 2)], 'user': [('fizi', 1)]})
Then you can use libraries like delorean or arrow that try to deal intelligently with the date part - or just use regular old dateutil.
You can make the pyparsing parser very lenient in what it matches, and then have a parse action do the more rigorous value checking. This is especially easy if your date strings are all non-whitespace characters.
For example, say we wanted to parse for a month name, but for some reason did not want our parser expression to just do `oneOf('January February March ...etc.'). We could put in a placeholder that will just parse a Word group of characters up to the next non-eligible character (whitespace, or punctuation).
monthName = Word(alphas.upper(), alphas.lower())
So here our month starts with a capitalized letter, followed by 0 or more lowercase letters. Obviously this will match many non-month names, so we will add a parse action to do additional validation:
def validate_month(tokens):
import calendar
monthname = tokens[0]
print "check if %s is a valid month name" % monthname
if monthname not in calendar.month_name:
raise ParseException(monthname + " is not a valid month abbreviation")
monthName.setParseAction(validate_month)
If we do these two statements:
print monthName.parseString("January")
print monthName.parseString("Foo")
we get
check if January is a valid month name
['January']
check if Foo is a valid month name
Traceback (most recent call last):
File "dd.py", line 15, in <module>
print monthName.parseString("Foo")
File "c:\python27\lib\site-packages\pyparsing.py", line 1125, in parseString
raise exc
pyparsing.ParseException: Foo is not a valid month abbreviation (at char 0), (line:1, col:1)
(Once you are done testing, you can remove the print statement from the middle of the parse action - I just included it to show that it was being called during the parsing process.)
If you can get away with a space-delimited date format, then you could write your parser as:
date = Word(nums,nums+'/-')
and then you could accept 1/1/2001, 29-10-1929 and so forth. Again, you will also match strings like 32237--/234//234/7, obviously not a valid date, so you could write a validating parse action to check the string's validity. In the parse action, you could implement your own validating logic, or call out to an external library. (You will have to be wary of dates like '4/3/2013' if you are being tolerant of different locales, since there is variety in month-first vs. date-first options, and this string could easily mean April 3rd or March 4th.) You can also have the parse action do the actual conversion for you, so that when you process the parsed tokens, the string will be an actual Python datetime.
I have a list of tweets, from which I have to choose tweets that have terms like "sale", "discount", or "offer". Also, I need to find tweets that advertise certain deals, like a discount, by recognizing things like "%", "Rs.", "$" amongst others. I have absolutely no idea about regular expressions and the documentation isn't getting me anywhere. Here is my code. It's rather lousy, but please excuse that
import pymongo
import re
import datetime
client = pymongo.MongoClient()
db = client .PWSocial
fourteen_days_ago = datetime.datetime.utcnow() - datetime.timedelta(days=14)
id_list = [57947109, 183093247, 89443197, 431336956]
ar1 = [" deal "," deals ", " offer "," offers " "discount", "promotion", " sale ", " inr", " rs", "%", "inr ", "rs ", " rs."]
def func(ac_id):
mylist = []
newlist = []
tweets = list(db.tweets.find({'user_id' : ac_id, 'created_at': { '$gte': fourteen_days_ago }}))
for item in tweets:
data = item.get('text')
data = data.lower()
data = data.split()
flag = 0
if set(ar1).intersection(data):
flag = 1
abc = []
for x in ar1:
for y in data:
if re.search(x,y):
abc.append(x)
flag = 1
break
if flag == 1:
mylist.append(item.get('id'))
newlist.append(abc)
print mylist
print newlist
for i in id_list:
func(i)
This code soen't give me any correct results, and being a noob to regexes, I cannot figure out whats wrong with it. Can anyone suggest a better way to do this job? Any help is appreciated.
My first advice - learn regular expressions, it gives you an unlimited power of text processing.
But, to give you some working solution (and start point to further exploration) try this:
import re
re_offers = re.compile(r'''
\b # Word boundary
(?: # Non capturing parenthesis
deals? # Deal or deals
| # or ...
offers? # Offer or offers
|
discount
|
promotion
|
sale
|
rs\.? # rs or rs.
|
inr\d+ # INR then digits
|
\d+inr # Digits then INR
) # And group
\b # Word boundary
| # or ...
\b\d+% # Digits (1 or more) then percent
|
\$\d+\b # Dollar then digits (didn't care of thousand separator yet)
''',
re.I|re.X) # Ignore case, verbose format - for you :)
abc = re_offers.findall("e misio $1 is inr123 discount 1INR a 1% and deal")
print(abc)
You don't need to use a regular expression for this, you can use any:
if any(term in tweet for term in search_terms):
In your array of things to search for you don't have a comma between " offers " and "discount" which is causing them to be joined together.
Also when you use split you are getting rid of the whitespace in your input text. "I have a deal" will become ["I","have","a","deal"] but your search terms almost all contain whitespace. So remove the spaces from your search terms in array ar1.
However you might want to avoid using regular expressions and just use in instead (you will still need the chnages I suggest above though):
if x in y:
You might want to consider starting with find instead instead of a regex. You don't have complex expressions, and as you're handling a line of text you don't need to call split, instead just use find:
for token in ar1:
if data.find(token) != -1:
abc.append(data)
Your for item in tweets loop becomes:
for item in tweets:
data = item.get('text')
data = data.lower()
for x in ar1:
if data.find(x)
newlist.append(data)
mylist.append(item.get('id'))
break
Re: your comment on jonsharpe's post, to avoid including substrings, surround your tokens by spaces, e.g. " rs ", " INR "
I'm trying to parse an XML-like file (with no associated DTD) with pyparsing. Part of each record looks has the following contents:
Something within <L> and <L/> tags,
One or more things within <pc> and <pc/> tags,
Optionally, something within <MW> and <MW/> tags,
Optionally, a literal <mul/>, and optionally a literal <mat/>
The ordering of these elements varies.
So I wrote the following (I'm new to pyparsing; please point out if I'm doing something stupid):
#!/usr/bin/env python
from pyparsing import *
def DumbTagParser(tag):
tag_close = '</%s>' % tag
return Group(
Literal('<') + Literal(tag).setResultsName('tag') + Literal('>')
+ SkipTo(tag_close).setResultsName('contents')
+ Literal(tag_close)
).setResultsName(tag)
record1 = Group(ZeroOrMore(DumbTagParser('pc'))).setResultsName('pcs') &\
DumbTagParser('L') & \
Optional(Literal('<mul/>')) & \
Optional(DumbTagParser('MW')) & \
Optional(Literal('<mat/>'))
record2 = Group(ZeroOrMore(DumbTagParser('pc'))).setResultsName('pcs') &\
Optional(DumbTagParser('MW')) & \
Optional(Literal('<mul/>')) & \
DumbTagParser('L')
def attempt(s):
print 'Attempting:', s
match = record1.parseString(s, parseAll = True)
print 'Match: ', match
print
attempt('<L>1.1</L>')
attempt('<pc>Page1,1</pc> <pc>Page1,2</pc> <MW>000001</MW> <L>1.1</L>')
attempt('<mul/><MW>000003</MW><pc>1,1</pc><L>3.1</L>')
attempt('<mul/> <MW>000003</MW> <pc>1,1</pc> <L>3.1</L> ') # Note end space
Both parsers record1 and record2 fail, with different exceptions. With record1, it fails on the last string (which differs from the penultimate string only in spaces):
pyparsing.ParseException: (at char 47), (line:1, col:48)
and with record2, it fails on the penultimate string itself:
pyparsing.ParseException: Missing one or more required elements (Group:({"<" "L" ">" SkipTo:("</L>") "</L>"})) (at char 0), (line:1, col:1)
Now what is weird is that if I interchange lines 2 and 3 in the definition of record2, then it parses fine!
record2 = Group(ZeroOrMore(DumbTagParser('pc'))).setResultsName('pcs') &\
Optional(Literal('<mul/>')) & \
Optional(DumbTagParser('MW')) & \
DumbTagParser('L') # parses my example strings fine
(Yes I realise that record2 doesn't contain any rule for <mat/>. I'm trying to get a minimal example that reflects this sensitivity to reordering.)
I'm not sure if this is a bug in pyparsing or in my code, but my real question is how I should parse the kind of strings I want.
I don't know if you still want an answer but here is my bash...
I can see the following problems in your code are as follows :
You've asigned resultsName multiple times to multiple items, as a Dict could eventually be returned you must either add '*' to each occurence of resultsName or drop it from a number of elements. I'll assume you are after the content and not the tags and drop their names. FYI, The shortcut for setting parser.resultsName(name) is parser(name).
Setting the resultsname to 'Contents' for everything is also a bad idea as we would loose information already available to us. Rather name CONTENTS by it's corresponding TAG.
You are also making multiple items Optional within the0 ZeroOrMore, they are already 'optional' through the ZeroOrMore, so let's allow them to be variations using the '^' operator as there is no predefined sequence ie. pc tags could precede mul tags or vice versa. It seems reasonable to allow any combintation and collect these as we go by.
As we also have to deal with multiples of a given tag we append '*' to the CONTENTS' resultsName so that we can collect the results into lists.
First we create a function to create set of opening and closing tags, your DumbTagCreator is now called tagset :
from pyparsing import *
def tagset(str, keywords = False):
if keywords :
return [Group(Literal('<') + Keyword(str) + Literal('>')).suppress(),
Group(Literal('</') + Keyword(str) + Literal('/>')).suppress()]
else :
return [Group(Literal('<') + Literal(str) + Literal('>')).suppress(),
Group(Literal('</') + Literal(str) + Literal('>')).suppress()]
Next we create the parser which will parse <tag\>CONTENT</tag>, where CONTENT is the content we have an interest in, to return a dict so that we have {'pc' : CONTENT, 'MW' : CONTENT, ...}:
tagDict = {name : (tagset(name)) for name in ['pc','MW','L','mul','mat']}
parser = None
for name, tags in tagDict.iteritems() :
if parser :
parser = parser ^ (tags[0] + SkipTo(tags[1])(name) + tags[1])
else :
parser = (tags[0] + SkipTo(tags[1])(name) + tags[1])
# If you have added the </mul> tag deliberately...
parser = Optional(Literal('<mul/>')) + ZeroOrMore(parser)
# If you have added the </mul> tag by acccident...
parser = ZeroOrMore(parser)
and finally we test :
test = ['<L>1.1</L>',
'<pc>Page1,1</pc> <pc>Page1,2</pc> <MW>000001</MW> <L>1.1</L>',
'<mul/><MW>000003</MW><pc>1,1</pc><L>3.1</L>',
'<mul/> <MW>000003</MW> <pc>1,1</pc> <L>3.1</L> ']
for item in test :
print {key:val.asList() for key,val in parser.parseString(item).asDict().iteritems()}
which should produce, assuming you want a dict of lists :
{'L': ['1.1']}
{'pc': ['Page1,1', 'Page1,2'], 'MW': ['000001'], 'L': ['1.1']}
{'pc': ['1,1'], 'MW': ['000003'], 'L': ['3.1']}
{'pc': ['1,1'], 'MW': ['000003'], 'L': ['3.1']}
I'm building a syntax parser to perform simple actions on objects identified using dotted notation, something like this:
DISABLE ALL;
ENABLE A.1 B.1.1 C
but in DISABLE ALL the keyword ALL is instead matched as 3 Regex(r'[a-zA-Z]') => 'A', 'L', 'L' I use to match arguments.
How can I make a Word using regex? AFAIK I can't get A.1.1 using Word
please see example below
import pyparsing as pp
def toggle_item_action(s, loc, tokens):
'enable / disable a sequence of items'
action = True if tokens[0].lower() == "enable" else False
for token in tokens[1:]:
print "it[%s].active = %s" % (token, action)
def toggle_all_items_action(s, loc, tokens):
'enable / disable ALL items'
action = True if tokens[0].lower() == "enable" else False
print "it.enable_all(%s)" % action
expr_separator = pp.Suppress(';')
#match A
area = pp.Regex(r'[a-zA-Z]')
#match A.1
category = pp.Regex(r'[a-zA-Z]\.\d{1,2}')
#match A.1.1
criteria = pp.Regex(r'[a-zA-Z]\.\d{1,2}\.\d{1,2}')
#match any of the above
item = area ^ category ^ criteria
#keyword to perform action on ALL items
all_ = pp.CaselessLiteral("all")
#actions
enable = pp.CaselessKeyword('enable')
disable = pp.CaselessKeyword('disable')
toggle = enable | disable
#toggle item expression
toggle_item = (toggle + item + pp.ZeroOrMore(item)
).setParseAction(toggle_item_action)
#toggle ALL items expression
toggle_all_items = (toggle + all_).setParseAction(toggle_all_items_action)
#swapping order to `toggle_all_items ^ toggle_item` works
#but seems to weak to me and error prone for future maintenance
expr = toggle_item ^ toggle_all_items
#expr = toggle_all_items ^ toggle_item
more = expr + pp.ZeroOrMore(expr_separator + expr)
more.parseString("""
ENABLE A.1 B.1.1;
DISABLE ALL
""", parseAll=True)
Is this the problem?
#match any of the above
item = area ^ category ^ criteria
#keyword to perform action on ALL items
all_ = pp.CaselessLiteral("all")
Should be:
#keyword to perform action on ALL items
all_ = pp.CaselessLiteral("all")
#match any of the above
item = area ^ category ^ criteria ^ all_
EDIT - if you're interested...
Your regexes are so similar, I thought I'd see what it would look like to combine them into one. Here is a snippet to parse out your three dotted notations using a single Regex, and then using a parse action to figure out which type you got:
import pyparsing as pp
dotted_notation = pp.Regex(r'[a-zA-Z](\.\d{1,2}(\.\d{1,2})?)?')
def name_notation_type(tokens):
name = {
0 : "area",
1 : "category",
2 : "criteria"}[tokens[0].count('.')]
# assign results name to results -
tokens[name] = tokens[0]
dotted_notation.setParseAction(name_notation_type)
# test each individually
tests = "A A.1 A.2.2".split()
for t in tests:
print t
val = dotted_notation.parseString(t)
print val.dump()
print val[0], 'is a', val.getName()
print
# test all at once
tests = "A A.1 A.2.2"
val = pp.OneOrMore(dotted_notation).parseString(tests)
print val.dump()
Prints:
A
['A']
- area: A
A is a area
A.1
['A.1']
- category: A.1
A.1 is a category
A.2.2
['A.2.2']
- criteria: A.2.2
A.2.2 is a criteria
['A', 'A.1', 'A.2.2']
- area: A
- category: A.1
- criteria: A.2.2
EDIT2 - I see the original problem...
What is messing you up is pyparsing's implicit whitespace skipping. Pyparsing will skip over whitespace between defined tokens, but the converse is not true - pyparsing does not require whitespace between separate parser expressions. So in your all_-less version, "ALL" looks like 3 areas, "A", "L", and "L". This is true not just of Regex, but just about any pyparsing class. See if the pyparsing WordEnd class might be useful in enforcing this.
EDIT3 - Then maybe something like this...
toggle_item = (toggle + pp.OneOrMore(item)).setParseAction(toggle_item_action)
toggle_all = (toggle + all_).setParseAction(toggle_all_action)
toggle_directive = toggle_all | toggle_item
The way your commands are formatted, you have to make the parser first see if ALL is being toggled before looking for individual areas, etc. If you need to support something that might read "ENABLE A.1 ALL", then use a negative lookahead for item: item = ~all_ + (area ^ etc...).
(Note also that I replaced item + pp.ZeroOrMore(item) with just pp.OneOrMore(item).)