I have a grammar for parsing some log files using pyparsing but am running into an issue where only the first match is being returned. Is there a way to ensure that I get exhaustive matches? Here's some code:
from pyparsing import Literal, Optional, oneOf, OneOrMore, ParserElement, Regex, restOfLine, Suppress, ZeroOrMore
ParserElement.setDefaultWhitespaceChars(' ')
dt = Regex(r'''\d{2} (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) 20\d\d \d\d:\d\d:\d\d\,\d{3}''')
# TODO maybe add a parse action to make a datetime object out of the dt capture group
log_level = Suppress('[') + oneOf("INFO DEBUG ERROR WARN TRACE") + Suppress(']')
package_name = Regex(r'''(com|org|net)\.(\w+\.)+\w+''')
junk_data = Optional(Regex('\(.*?\)'))
guid = Regex('[A-Za-z0-9]{8}-[A-Za-z0-9]{4}-[A-Za-z0-9]{4}-[A-Za-z0-9]{4}-[A-Za-z0-9]{12}')
first_log_line = dt.setResultsName('datetime') + \
log_level('log_level') + \
guid('guid') + \
junk_data('junk') + \
package_name('package_name') + \
Suppress(':') + \
restOfLine('message') + \
Suppress('\n')
additional_log_lines = Suppress('\t') + package_name + restOfLine
log_entry = (first_log_line + Optional(ZeroOrMore(additional_log_lines)))
log_batch = OneOrMore(log_entry)
In my mind, the last two lines are sort of equivalent to
log_entry := first_log_line | first_log_line additional_log_lines
additional_log_lines := additional_log_line | additional_log_line additional_log_lines
log_batch := log_entry | log_entry log_batch
Or something of the sort. Am I thinking about this wrong? I only see a single match with all of the expected tokens when I do print(log_batch.parseString(data).dump()).
Your scanString behavior is a strong clue. Suppose I wrote an expression to match one or more items, and erroneously defined my expression such that the second item in my list did not match. Then OneOrMore(expr) would fail, while expr.scanString would "succeed", in that it would give me more matches, but would still overlook the match I might have wanted, but just mis-parsed.
import pyparsing as pp
data = "AAA _AB BBB CCC"
expr = pp.Word(pp.alphas)
print(pp.OneOrMore(expr).parseString(data))
Gives:
['AAA']
At first glance, this looks like the OneOrMore is failing, whereas scanString shows more matches:
['AAA']
['AB'] <- really wanted '_AB' here
['BBB']
['CCC']
Here is a loop using scanString which prints not the matches, but the gaps between the matches, and where they start:
# loop to find non-matching parts in data
last_end = 0
for t,s,e in expr.scanString(data):
gap = data[last_end:s]
print(s, ':', repr(gap))
last_end = e
Giving:
0 : ''
5 : ' _' <-- AHA!!
8 : ' '
12 : ' '
Here's another way to visualize this.
# print markers where each match begins in input string
markers = [' ']*len(data)
for t,s,e in expr.scanString(data):
markers[s] = '^'
print(data)
print(''.join(markers))
Prints:
AAA _AB BBB CCC
^ ^ ^ ^
Your code would be a little more complex since your data spans many lines, but using pyparsing's line, lineno and col methods, you could do something similar.
So, there's a workaround that seems to do the trick. For whatever reason, scanString does iterate through them all appropriately, so I can very simply get my matches in a generator with:
matches = (m for m, _, _ in log_batch.scanString(data))
Still not sure why parseString isn't working exhaustively, though, and still a bit worried that I've misunderstood something about pyparsing, so more pointers are welcome here.
Related
I'm trying to scrape dates from URL's of blogs and the like.
Since there's no universal way to get a date, I am for now, relying
on the date to be in the URL of the resource.
The dates come for the most part, in these formats:
url1 = "foo/bar/baz/2014/01/01/more/text"
url2 = "foo/bar/baz/2014/01/more/text"
url3 = "foo/bar/baz/20140101/more/text"
url4 = "foo/bar/baz/2014-01-01/more/text"
url5 = "foo/bar/baz/2014-01more/text"
url6 = "foo/bar/baz/2014_01_01/more/text"
url7 = "foo/bar/baz/2014_01/more/text"
# forgot one
url8 = "foo/bar/baz20140101more/text"
I've written a brute force code to get what I want.
It's explicit, but not elegant and probably not very robust.
I'd tried to cover the cases where I match "\" or "-" or "_" with no luck.
So I'm curious as to how one does that.
Although my main question is:
What's the best robust way to capture dates in a URL with the intention of converting them to datetime objects.
I don't think it's common for time elements to be in the format.
Cheers !
UPDATE
I believe I have the solution from Casimer. I'd like to add one more
url-date format that I missed before and might add a little trouble:
# this one maynot have a regex solution. Maybe machine learning.
# and it's not that big a deal if I get the wrong day for this application.
# I think it's safe to assume, that a legit date with Y/M/d with have
# /Y/m/d/ trailing "/"
http://www.nakedcapitalism.com/2014/03/17-million-reasons-rent-control-efficient.html
2014/03/17 # group captured
2014-03-17 00:00:00 # date time object
http://www.nakedcapitalism.com/2014/11/200pm-water-cooler-11514.html
2014/11/20
2014-11-20 00:00:00
# i put more restrictions on the number matching, but perhaps there's a better way...?
pat = r'(20[0-1][0-5]([-_/]?)[0-1][0-9]\2[0-3][0-9])'
Existing ugly solution:
NOTE: I've restricted the year info, because I was capturing strings of numbers that do not represent a date. Plus I figured it was more robust that way.
def get_date_from_url(self, url):
#pat = "(20[0-14]{2}\w+[0-9]{2}(?!\w+[0-9]{2}))"
pat = "(20[0-1][0-5]/[0-9]{2}/[0-9]{2})"
ob1 = re.compile(pat)
pat = "(20[0-1][0-5]-[0-9]{2}-[0-9]{2})"
ob2 = re.compile(pat)
pat = "(20[0-1][0-5]_[0-9]{2}_[0-9]{2})"
ob3 = re.compile(pat)
pat = "(20[0-1][0-5]/[0-9]{2})"
ob4 = re.compile(pat)
pat = "(20[0-1][0-5]-[0-9]{2})"
ob5 = re.compile(pat)
pat = "(20[0-1][0-5]_[0-9]{2})"
ob6 = re.compile(pat)
if ob1.search(url):
grp = ob1.search(url).group()
elif ob2.search(url):
grp = ob2.search(url).group()
elif ob3.search(url):
grp = ob3.search(url).group()
elif ob4.search(url):
grp = ob4.search(url).group()
elif ob5.search(url):
grp = ob5.search(url).group()
elif ob6.search(url):
grp = ob6.search(url).group()
else:
return None
print url
print grp
grp = re.sub('_', '/', grp) # fail to match return orig string
date = to_datetime(grp)
if isinstance(date, datetime.datetime):
print date
else:
return None
You can use this:
pat = r'(20[0-1][0-5]([-_/]?)[0-9]{2}(?:\2[0-9]{2})?)'
the delimiter is captured in group 2, so I use a backreference \2 for the second delimiter. The delimiter can be - _ or / but is optional too (with the ? quantifier).
This makes the day optional too by putting it in an optional non-capturing group: (?:\2[0-9]{2})?
Note that you can add the slashes at the begining and at the end to ensure that the date are enclosed between paths.
I have a list of tweets, from which I have to choose tweets that have terms like "sale", "discount", or "offer". Also, I need to find tweets that advertise certain deals, like a discount, by recognizing things like "%", "Rs.", "$" amongst others. I have absolutely no idea about regular expressions and the documentation isn't getting me anywhere. Here is my code. It's rather lousy, but please excuse that
import pymongo
import re
import datetime
client = pymongo.MongoClient()
db = client .PWSocial
fourteen_days_ago = datetime.datetime.utcnow() - datetime.timedelta(days=14)
id_list = [57947109, 183093247, 89443197, 431336956]
ar1 = [" deal "," deals ", " offer "," offers " "discount", "promotion", " sale ", " inr", " rs", "%", "inr ", "rs ", " rs."]
def func(ac_id):
mylist = []
newlist = []
tweets = list(db.tweets.find({'user_id' : ac_id, 'created_at': { '$gte': fourteen_days_ago }}))
for item in tweets:
data = item.get('text')
data = data.lower()
data = data.split()
flag = 0
if set(ar1).intersection(data):
flag = 1
abc = []
for x in ar1:
for y in data:
if re.search(x,y):
abc.append(x)
flag = 1
break
if flag == 1:
mylist.append(item.get('id'))
newlist.append(abc)
print mylist
print newlist
for i in id_list:
func(i)
This code soen't give me any correct results, and being a noob to regexes, I cannot figure out whats wrong with it. Can anyone suggest a better way to do this job? Any help is appreciated.
My first advice - learn regular expressions, it gives you an unlimited power of text processing.
But, to give you some working solution (and start point to further exploration) try this:
import re
re_offers = re.compile(r'''
\b # Word boundary
(?: # Non capturing parenthesis
deals? # Deal or deals
| # or ...
offers? # Offer or offers
|
discount
|
promotion
|
sale
|
rs\.? # rs or rs.
|
inr\d+ # INR then digits
|
\d+inr # Digits then INR
) # And group
\b # Word boundary
| # or ...
\b\d+% # Digits (1 or more) then percent
|
\$\d+\b # Dollar then digits (didn't care of thousand separator yet)
''',
re.I|re.X) # Ignore case, verbose format - for you :)
abc = re_offers.findall("e misio $1 is inr123 discount 1INR a 1% and deal")
print(abc)
You don't need to use a regular expression for this, you can use any:
if any(term in tweet for term in search_terms):
In your array of things to search for you don't have a comma between " offers " and "discount" which is causing them to be joined together.
Also when you use split you are getting rid of the whitespace in your input text. "I have a deal" will become ["I","have","a","deal"] but your search terms almost all contain whitespace. So remove the spaces from your search terms in array ar1.
However you might want to avoid using regular expressions and just use in instead (you will still need the chnages I suggest above though):
if x in y:
You might want to consider starting with find instead instead of a regex. You don't have complex expressions, and as you're handling a line of text you don't need to call split, instead just use find:
for token in ar1:
if data.find(token) != -1:
abc.append(data)
Your for item in tweets loop becomes:
for item in tweets:
data = item.get('text')
data = data.lower()
for x in ar1:
if data.find(x)
newlist.append(data)
mylist.append(item.get('id'))
break
Re: your comment on jonsharpe's post, to avoid including substrings, surround your tokens by spaces, e.g. " rs ", " INR "
What I am trying to do is to take user input text which would contain wildcards (so I need to keep them that way) but furthermore to look for the specified input. So for example that I have working below I use the pipe |.
I figured out how to make this work:
dual = 'a bunch of stuff and a bunch more stuff!'
reobj = re.compile('b(.*?)f|\s[a](.*?)u', re.IGNORECASE)
result = reobj.findall(dual)
for link in result:
print link[0] +' ' + link[1]
which returns:
unch o
nd a b
As well
dual2 = 'a bunch of stuff and a bunch more stuff!'
#So I want to now send in the regex codes of my own.
userin1 = 'b(.*?)f'
userin2 = '\s[a](.*?)u'
reobj = re.compile(userin1, re.IGNORECASE)
result = reobj.findall(dual2)
for link in result:
print link[0] +' ' + link[1]
Which returns:
u n
u n
I don't understand what it is doing as if I get rid of all save link[0] in print I get:
u
u
I however can pass in a user input regex string:
dual = 'a bunch of stuff and a bunch more stuff!'
userinput = 'b(.*?)f'
reobj = re.compile(userinput, re.IGNORECASE)
result = reobj.findall(dual)
print(result)
but when I try to update this to two user strings with the pipe:
dual = 'a bunch of stuff and a bunch more stuff!'
userin1 = 'b(.*?)f'
userin2 = '\s[a](.*?)u'
reobj = re.compile(userin1|userin2, re.IGNORECASE)
result = reobj.findall(dual)
print(result)
I get the error:
reobj = re.compile(userin1|userin2, re.IGNORECASE)
TypeError: unsupported operand type(s) for |: 'str' and 'str'
I get this error a lot such as if I put brackets () or [] around userin1|userin2.
I have found the following:
Python regular expressions OR
but can not get it to work ;..{-( .
What I would like to do is to be able to understand how to pass in these regex variables such as that of OR and return all the matches of both as well as something such as AND - which in the end is useful as it will operate on files and let me know which files contain particular words with the various logical relations OR, AND etc.
Thanks much for your thoughts,
Brian
Although I couldn't get the answer from A. Rodas to work, he gave the idea for the .join. The example I worked out - although slightly different returns (in link[0] and link[1]) the desired results.
userin1 = '(T.*?n)'
userin2 = '(G.*?p)'
list_patterns = [userin1,userin2]
swaplogic = '|'
string = 'What is a Torsion Abelian Group (TAB)?'
theresult = re.findall(swaplogic.join(list_patterns), string)
print theresult
for link in theresult:
print link[0]+' '+link[1]
I'm trying to parse an XML-like file (with no associated DTD) with pyparsing. Part of each record looks has the following contents:
Something within <L> and <L/> tags,
One or more things within <pc> and <pc/> tags,
Optionally, something within <MW> and <MW/> tags,
Optionally, a literal <mul/>, and optionally a literal <mat/>
The ordering of these elements varies.
So I wrote the following (I'm new to pyparsing; please point out if I'm doing something stupid):
#!/usr/bin/env python
from pyparsing import *
def DumbTagParser(tag):
tag_close = '</%s>' % tag
return Group(
Literal('<') + Literal(tag).setResultsName('tag') + Literal('>')
+ SkipTo(tag_close).setResultsName('contents')
+ Literal(tag_close)
).setResultsName(tag)
record1 = Group(ZeroOrMore(DumbTagParser('pc'))).setResultsName('pcs') &\
DumbTagParser('L') & \
Optional(Literal('<mul/>')) & \
Optional(DumbTagParser('MW')) & \
Optional(Literal('<mat/>'))
record2 = Group(ZeroOrMore(DumbTagParser('pc'))).setResultsName('pcs') &\
Optional(DumbTagParser('MW')) & \
Optional(Literal('<mul/>')) & \
DumbTagParser('L')
def attempt(s):
print 'Attempting:', s
match = record1.parseString(s, parseAll = True)
print 'Match: ', match
print
attempt('<L>1.1</L>')
attempt('<pc>Page1,1</pc> <pc>Page1,2</pc> <MW>000001</MW> <L>1.1</L>')
attempt('<mul/><MW>000003</MW><pc>1,1</pc><L>3.1</L>')
attempt('<mul/> <MW>000003</MW> <pc>1,1</pc> <L>3.1</L> ') # Note end space
Both parsers record1 and record2 fail, with different exceptions. With record1, it fails on the last string (which differs from the penultimate string only in spaces):
pyparsing.ParseException: (at char 47), (line:1, col:48)
and with record2, it fails on the penultimate string itself:
pyparsing.ParseException: Missing one or more required elements (Group:({"<" "L" ">" SkipTo:("</L>") "</L>"})) (at char 0), (line:1, col:1)
Now what is weird is that if I interchange lines 2 and 3 in the definition of record2, then it parses fine!
record2 = Group(ZeroOrMore(DumbTagParser('pc'))).setResultsName('pcs') &\
Optional(Literal('<mul/>')) & \
Optional(DumbTagParser('MW')) & \
DumbTagParser('L') # parses my example strings fine
(Yes I realise that record2 doesn't contain any rule for <mat/>. I'm trying to get a minimal example that reflects this sensitivity to reordering.)
I'm not sure if this is a bug in pyparsing or in my code, but my real question is how I should parse the kind of strings I want.
I don't know if you still want an answer but here is my bash...
I can see the following problems in your code are as follows :
You've asigned resultsName multiple times to multiple items, as a Dict could eventually be returned you must either add '*' to each occurence of resultsName or drop it from a number of elements. I'll assume you are after the content and not the tags and drop their names. FYI, The shortcut for setting parser.resultsName(name) is parser(name).
Setting the resultsname to 'Contents' for everything is also a bad idea as we would loose information already available to us. Rather name CONTENTS by it's corresponding TAG.
You are also making multiple items Optional within the0 ZeroOrMore, they are already 'optional' through the ZeroOrMore, so let's allow them to be variations using the '^' operator as there is no predefined sequence ie. pc tags could precede mul tags or vice versa. It seems reasonable to allow any combintation and collect these as we go by.
As we also have to deal with multiples of a given tag we append '*' to the CONTENTS' resultsName so that we can collect the results into lists.
First we create a function to create set of opening and closing tags, your DumbTagCreator is now called tagset :
from pyparsing import *
def tagset(str, keywords = False):
if keywords :
return [Group(Literal('<') + Keyword(str) + Literal('>')).suppress(),
Group(Literal('</') + Keyword(str) + Literal('/>')).suppress()]
else :
return [Group(Literal('<') + Literal(str) + Literal('>')).suppress(),
Group(Literal('</') + Literal(str) + Literal('>')).suppress()]
Next we create the parser which will parse <tag\>CONTENT</tag>, where CONTENT is the content we have an interest in, to return a dict so that we have {'pc' : CONTENT, 'MW' : CONTENT, ...}:
tagDict = {name : (tagset(name)) for name in ['pc','MW','L','mul','mat']}
parser = None
for name, tags in tagDict.iteritems() :
if parser :
parser = parser ^ (tags[0] + SkipTo(tags[1])(name) + tags[1])
else :
parser = (tags[0] + SkipTo(tags[1])(name) + tags[1])
# If you have added the </mul> tag deliberately...
parser = Optional(Literal('<mul/>')) + ZeroOrMore(parser)
# If you have added the </mul> tag by acccident...
parser = ZeroOrMore(parser)
and finally we test :
test = ['<L>1.1</L>',
'<pc>Page1,1</pc> <pc>Page1,2</pc> <MW>000001</MW> <L>1.1</L>',
'<mul/><MW>000003</MW><pc>1,1</pc><L>3.1</L>',
'<mul/> <MW>000003</MW> <pc>1,1</pc> <L>3.1</L> ']
for item in test :
print {key:val.asList() for key,val in parser.parseString(item).asDict().iteritems()}
which should produce, assuming you want a dict of lists :
{'L': ['1.1']}
{'pc': ['Page1,1', 'Page1,2'], 'MW': ['000001'], 'L': ['1.1']}
{'pc': ['1,1'], 'MW': ['000003'], 'L': ['3.1']}
{'pc': ['1,1'], 'MW': ['000003'], 'L': ['3.1']}
I'm building a syntax parser to perform simple actions on objects identified using dotted notation, something like this:
DISABLE ALL;
ENABLE A.1 B.1.1 C
but in DISABLE ALL the keyword ALL is instead matched as 3 Regex(r'[a-zA-Z]') => 'A', 'L', 'L' I use to match arguments.
How can I make a Word using regex? AFAIK I can't get A.1.1 using Word
please see example below
import pyparsing as pp
def toggle_item_action(s, loc, tokens):
'enable / disable a sequence of items'
action = True if tokens[0].lower() == "enable" else False
for token in tokens[1:]:
print "it[%s].active = %s" % (token, action)
def toggle_all_items_action(s, loc, tokens):
'enable / disable ALL items'
action = True if tokens[0].lower() == "enable" else False
print "it.enable_all(%s)" % action
expr_separator = pp.Suppress(';')
#match A
area = pp.Regex(r'[a-zA-Z]')
#match A.1
category = pp.Regex(r'[a-zA-Z]\.\d{1,2}')
#match A.1.1
criteria = pp.Regex(r'[a-zA-Z]\.\d{1,2}\.\d{1,2}')
#match any of the above
item = area ^ category ^ criteria
#keyword to perform action on ALL items
all_ = pp.CaselessLiteral("all")
#actions
enable = pp.CaselessKeyword('enable')
disable = pp.CaselessKeyword('disable')
toggle = enable | disable
#toggle item expression
toggle_item = (toggle + item + pp.ZeroOrMore(item)
).setParseAction(toggle_item_action)
#toggle ALL items expression
toggle_all_items = (toggle + all_).setParseAction(toggle_all_items_action)
#swapping order to `toggle_all_items ^ toggle_item` works
#but seems to weak to me and error prone for future maintenance
expr = toggle_item ^ toggle_all_items
#expr = toggle_all_items ^ toggle_item
more = expr + pp.ZeroOrMore(expr_separator + expr)
more.parseString("""
ENABLE A.1 B.1.1;
DISABLE ALL
""", parseAll=True)
Is this the problem?
#match any of the above
item = area ^ category ^ criteria
#keyword to perform action on ALL items
all_ = pp.CaselessLiteral("all")
Should be:
#keyword to perform action on ALL items
all_ = pp.CaselessLiteral("all")
#match any of the above
item = area ^ category ^ criteria ^ all_
EDIT - if you're interested...
Your regexes are so similar, I thought I'd see what it would look like to combine them into one. Here is a snippet to parse out your three dotted notations using a single Regex, and then using a parse action to figure out which type you got:
import pyparsing as pp
dotted_notation = pp.Regex(r'[a-zA-Z](\.\d{1,2}(\.\d{1,2})?)?')
def name_notation_type(tokens):
name = {
0 : "area",
1 : "category",
2 : "criteria"}[tokens[0].count('.')]
# assign results name to results -
tokens[name] = tokens[0]
dotted_notation.setParseAction(name_notation_type)
# test each individually
tests = "A A.1 A.2.2".split()
for t in tests:
print t
val = dotted_notation.parseString(t)
print val.dump()
print val[0], 'is a', val.getName()
print
# test all at once
tests = "A A.1 A.2.2"
val = pp.OneOrMore(dotted_notation).parseString(tests)
print val.dump()
Prints:
A
['A']
- area: A
A is a area
A.1
['A.1']
- category: A.1
A.1 is a category
A.2.2
['A.2.2']
- criteria: A.2.2
A.2.2 is a criteria
['A', 'A.1', 'A.2.2']
- area: A
- category: A.1
- criteria: A.2.2
EDIT2 - I see the original problem...
What is messing you up is pyparsing's implicit whitespace skipping. Pyparsing will skip over whitespace between defined tokens, but the converse is not true - pyparsing does not require whitespace between separate parser expressions. So in your all_-less version, "ALL" looks like 3 areas, "A", "L", and "L". This is true not just of Regex, but just about any pyparsing class. See if the pyparsing WordEnd class might be useful in enforcing this.
EDIT3 - Then maybe something like this...
toggle_item = (toggle + pp.OneOrMore(item)).setParseAction(toggle_item_action)
toggle_all = (toggle + all_).setParseAction(toggle_all_action)
toggle_directive = toggle_all | toggle_item
The way your commands are formatted, you have to make the parser first see if ALL is being toggled before looking for individual areas, etc. If you need to support something that might read "ENABLE A.1 ALL", then use a negative lookahead for item: item = ~all_ + (area ^ etc...).
(Note also that I replaced item + pp.ZeroOrMore(item) with just pp.OneOrMore(item).)