Python Pyparsing Optional field

Python Pyparsing Optional field - python

I am currently using pyparsing module to create my own interpreter. My current code is
import pyparsing as pp
# To parse a packet with srcip,dstip,srcmac,dstmac
identifier = pp.Word(pp.alphas,pp.alphanums+'_')
ipfield = pp.Word(pp.nums,max=3)
ipAddr = pp.Combine(ipfield+"."+ipfield+"."+ipfield+"."+ipfield)
hexint = pp.Word(pp.hexnums,exact=2)
macAddr = pp.Combine(hexint+(":"+hexint)*5)
ip = pp.Combine(identifier+"="+ipAddr)
mac = pp.Combine(identifier+"="+macAddr)
pkt = ip & ip & mac & mac
arg = "createpacket<" + pp.Optional(pkt) + ">"
arg.parseString("createpacket<srcip=192.168.1.3dstip=192.168.1.4srcmac=00:FF:FF:FF:FF:00>")
While I run the last line of the code to parse an example string I get an error as follows:
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/pyparsing.py", line 1041, in parseString
raise exc
pyparsing.ParseException: Expected ">" (at char 13), (line:1, col:14)
Can anyone explain the reason for this error?

Related

PyParsing and multi-line syslog messages

I have copy-pasted a PyParsing syslog parser from here and there.
It's all nice and fluffy, but I have some Syslog messages that look non-compliant to the "standard":
Apr 2 09:23:09 dawn Java App[537]: [main] ERROR ch.java.core.Verifier - Unknown validation error
java.lang.NullPointerException
at org.databin.cms.CMSSignedData.getSignedData(Unknown Source)
at org.databin.cms.CMSSignedData.<init>(Unknown Source)
at org.databin.cms.CMSSignedData.<init>(Unknown Source)
And so on. Now with my PyParsing grammar I go through syslog.log line by line.
def main():
with open("system.log", "r") as myfile:
data = myfile.readlines()
pattern = Parser()._pattern
pattern.runTests(data)
if __name__ == '__main__':
main()
I somehow need to handle multi-line syslog messages. Either I need
to attach the many lines of these Java exceptions to the Syslog message, that has already been parsed.
or make the left side optional.
I don't know. Right now my implementation fails, because it assumes a new line is logged by a new app. Which would be... usual... unless Java...
> Traceback (most recent call last): File
> "/Users/wishi/PycharmProjects/Sparky_1/syslog_to_spark.py", line 39,
> in <module>
> main() File "/Users/wishi/PycharmProjects/Sparky_1/syslog_to_spark.py", line 34,
> in main
> pattern.runTests(data) File "/Users/wishi/anaconda2/envs/sparky/lib/python2.7/site-packages/pyparsing.py",
> line 2305, in runTests
> if comment is not None and comment.matches(t, False) or comments and not t: File
> "/Users/wishi/anaconda2/envs/sparky/lib/python2.7/site-packages/pyparsing.py",
> line 2205, in matches
> self.parseString(_ustr(testString), parseAll=parseAll) File "/Users/wishi/anaconda2/envs/sparky/lib/python2.7/site-packages/pyparsing.py",
> line 1622, in parseString
> loc, tokens = self._parse( instring, 0 ) File "/Users/wishi/anaconda2/envs/sparky/lib/python2.7/site-packages/pyparsing.py",
> line 1383, in _parseNoCache
> loc,tokens = self.parseImpl( instring, preloc, doActions ) File "/Users/wishi/anaconda2/envs/sparky/lib/python2.7/site-packages/pyparsing.py",
> line 2410, in parseImpl
> if (instring[loc] == self.firstMatchChar and IndexError: string index out of range
Does anyone know a simple way to avoid failure here?
from pyparsing import Word, alphas, Suppress, Combine, nums, string, Regex, Optional, ParserElement, LineEnd, OneOrMore, \
unicodeString, White
import sys
from datetime import datetime
class Parser(object):
# log lines don't include the year, but if we don't provide one, datetime.strptime will assume 1900
ASSUMED_YEAR = str(datetime.now().year)
def __init__(self):
ints = Word(nums)
ParserElement.setDefaultWhitespaceChars(" \t")
NL = Suppress(LineEnd())
unicodePrintables = u''.join(unichr(c) for c in xrange(sys.maxunicode)
if not unichr(c).isspace())
# priority
# priority = Suppress("<") + ints + Suppress(">")
# timestamp
month = Word(string.ascii_uppercase, string.ascii_lowercase, exact=3)
day = ints
hour = Combine(ints + ":" + ints + ":" + ints)
timestamp = month + day + hour
# a parse action will convert this timestamp to a datetime
timestamp.setParseAction(
lambda t: datetime.strptime(Parser.ASSUMED_YEAR + ' ' + ' '.join(t), '%Y %b %d %H:%M:%S'))
# hostname
# usually hostnames follow some convention
hostname = Word(alphas + nums + "_-.")
# appname
# if you call your app "my big fat app with a very long name" go away
appname = (Word(alphas + nums + "/-_.()") + Optional(Word(" ")) + Optional(Word(alphas + nums + "/-_.()")))(
"appname") + (Suppress("[") + ints("pid") + Suppress("]")) | (Word(alphas + "/-_.")("appname"))
appname.setName("appname")
# message
# supports messages with printed unicode
message = Combine(OneOrMore(Word(unicodePrintables) | OneOrMore("\t") | OneOrMore(" "))) + Suppress(OneOrMore(NL))
messages = OneOrMore(message) # does not work
# pattern build
# (add results names to make it easier to access parsed fields)
self._pattern = timestamp("timestamp") + hostname("hostname") + Optional(appname) + Optional(Suppress(':')) + messages("message")
def parse(self, line):
if line.strip():
parsed = self._pattern.parseString(line)
return parsed.asDict()
The partly parsed result is:
[datetime.datetime(2018, 4, 2, 9, 23, 9), 'dawn', 'Java', 'App', '537', '[main] ERROR ch.databin.core.Verifier - Unknown validation error']
- appname: ['Java', 'App']
- hostname: 'dawn'
- message: '[main] ERROR ch.databin.core.Verifier - Unknown validation error'
- pid: '537'
- timestamp: datetime.datetime(2018, 4, 2, 9, 23, 9)
It only contains the first line.
So for syslog messages without linebreaks this works.

The simplest solution is to go back to parsing a line at a time, and keep the valid log lines in a list. If you get a valid log line, just append it to the list; if you don't then append it to the 'messages' item of the last line in the list.
def main():
valid_log_lines = []
with open("system.log", "r") as myfile:
data = myfile.read()
pattern = Parser()._pattern
for line in data.splitlines():
try:
log_dict = pattern.parse(line)
if log_dict is None:
continue
except ParseException:
if valid_log_lines:
valid_log_lines[-1]['message'] += '\n' + line
else:
valid_log_lines.append(log_dict)
To speed up detection of invalid lines, try adding timestamp.leaveWhitespace(), so that any line that does not start with a timestamp in column 1 will immediately fail.
Or you can modify your parser to handle multi-line log messages, that is a longer topic.
I like that you were using runTests, but that is more of a development tool; in your actual code, probably use parseString or one of its ilk.

Python pyparsing issue

I am very new to python and using pyparsing but getting some exception with following code
while site_contents.find('---', line_end) != line_end + 2:
cut_start = site_contents.find(" ", site_contents.find("\r\n", start))
cut_end = site_contents.find(" ", cut_start+1)
line_end = site_contents.find("\r\n", cut_end)
name = site_contents[cut_start:cut_end].strip()
float_num = Word(nums + '.').setParseAction(lambda t:float(t[0]))
nonempty_line = Literal(name) + Word(nums+',') + float_num + Suppress(Literal('-')) + float_num * 2
empty_line = Literal(name) + Literal('-')
line = nonempty_line | empty_line
parsed = line.parseString(site_contents[cut_start:line_end])
start = line_end
Exception
Traceback (most recent call last):
File "D:\Ecllipse_Python\HellloWorld\src\HelloPython.py", line 108, in <module>
parsed = line.parseString(site_contents[cut_start:line_end]) # parse line of data following cut name
File "C:\Users\arbatra\AppData\Local\Continuum\Anaconda\lib\site-packages\pyparsing.py", line 1041, in parseString
raise exc
pyparsing.ParseException: Expected W:(0123...) (at char 38), (line:1, col:39)
how to resolve this issue?

You'll get a little better exception message if you give names to your expressions, using setName. From the "Expected W:(0123...)" part of the exception message, it looks like the parser is not finding a numeric value where it is expected. But the default name is not showing us enough to know which type of numeric field is expected. Modify your parser to add setName as shown below, and also change the defintion of nonempty_line:
float_num = Word(nums + '.').setParseAction(lambda t:float(t[0])).setName("float_num")
integer_with_commas = Word(nums + ',').setName("int_with_commas")
nonempty_line = Literal(name) + integer_with_commas + float_num + Suppress(Literal('-')) + float_num * 2
I would also preface the call to parseString with:
print site_contents[cut_start:line_end]
at least while you are debugging. Then you can compare the string being parsed with the error message, including the column number where the parse error is occurring, as given in your posted example as "(at char 38), (line:1, col:39)". "char xx" starts with the first character as "char 0"; "col:xx" starts with the first column as "col:1".
These code changes might help you pinpoint your problem:
print "12345678901234567890123456789012345678901234567890"
print site_contents[cut_start:line_end]
try:
parsed = line.parseString(site_contents[cut_start:line_end])
except ParseException as pe:
print pe.loc*' ' + '^'
print pe
Be sure to run this in a window that uses a monospaced font (so that all the character columns line up, and all characters are the same width as each other).
Once you've done this, you may have enough information to fix the problem yourself, or you'll have some better output to edit into your original question so we can help you better.

Packet capture with python

I am trying to use dpkt and pcapy to listen for HTTP responses on an interface
import dpkt
import pcapy
cap = pcapy.open_live('eth0',10000,1,0)
(header,payload) = cap.next()
while header:
packet = dpkt.ethernet.Ethernet(str(payload))
if str(packet.data.data.data).startswith('HTTP'):
h = dpkt.http.Response(str(packet.data.data.data))
(header,payload) = cap.next()
When I run this, it reads the first packet fine. But for the second packet, it ends up reading a wrong value of content-length. The exception is:
cona#vm-02:~$ sudo python cache-pcapy.py
Value of N 160 value of body 160
Value of N 5965717 value of body 1193
Traceback (most recent call last):
File "cache-pcapy.py", line 12, in <module>
h = dpkt.http.Response(str(packet.data.data.data))
File "/usr/local/lib/python2.7/dist-packages/dpkt/http.py", line 76, in __init__
self.unpack(args[0])
File "/usr/local/lib/python2.7/dist-packages/dpkt/http.py", line 159, in unpack
Message.unpack(self, f.read())
File "/usr/local/lib/python2.7/dist-packages/dpkt/http.py", line 90, in unpack
self.body = parse_body(f, self.headers)
File "/usr/local/lib/python2.7/dist-packages/dpkt/http.py", line 59, in parse_body
raise dpkt.NeedData('short body (missing %d bytes)' % (n - len(body)))
dpkt.dpkt.NeedData: short body (missing 5964524 bytes)
The prints for values of N and length of body are from http.py in dpkt where I added this:
elif 'content-length' in headers:
n = int(headers['content-length'])
body = f.read(n)
print 'Value of N {} value of body {}'.format(n,len(body))
if len(body) != n:
raise dpkt.NeedData('short body (missing %d bytes)' % (n - len(body)))
It seems that the wrong bytes are being read as content-length. Why does this happen?

IP REGEX validation

I've been trying to validate an inputted string (sys argv[1] in this case). I need to create a script that goes through a log file and matches the entries for source and destination ip with any argument input with the script. The kinds of valid inputs are either
an IP or partial ip
"any"(string which means all ip addresses in a given column).
So far I have the following code. Whenever I run the script in bash along with an argument (e.g any random number or word/alphabets etc) I get errors. Please let me know how I can fix them. Really appreciate a way to validate input against the IP address reg ex and the word any.
#!/usr/bin/python
import sys,re
def ipcheck(ip):
#raw patterns for "any" and "IP":
ippattern = '([1-2]?[0-9]?[0-9]\.){1,3}([1-2]?[0-9]?[0-9])?'
anypattern = any
#Compiled patterns
cippattern = re.compile(ippattern)
canypattern = re.compile(any)
#creating global variables for call outside function
global matchip
global matchany
#matching the compiled pattern
matchip = cippattern.match(ip)
matchany = canypattern.match(ip)
new = sys.argv[1]
snew = str(new)
print type(snew)
ipcheck(new)
Also I tried to do it this way but it kept giving me errors, is it possible to pass 2 arguments to an if loop via the "OR |" operator? How would I do it this way?[/b]
#if (matchip | matchany) :
#print "the ip address is valid"
#else:
#print "Invalid Destination IP"
Error
========================
user#bt:/home# ./ipregex.py a
<type 'str'>
Traceback (most recent call last):
File "./ipregex.py", line 21, in <module>
ipcheck(new)
File "./ipregex.py", line 15, in ipcheck
matchany = re.match(anypattern,ip)
File "/usr/lib/python2.5/re.py", line 137, in match
return _compile(pattern, flags).match(string)
File "/usr/lib/python2.5/re.py", line 237, in _compile
raise TypeError, "first argument must be string or compiled pattern"
TypeError: first argument must be string or compiled pattern
==========================================================
EDIT
I was trying to match the IP without compiling the regex. So I modified the script to do so. This resulted in the error:
Error
user#bt:/home# ./ipregex.py a
<type 'str'>
Traceback (most recent call last):
File "./ipregex.py", line 21, in <module>
ipcheck(new)
File "./ipregex.py", line 15, in ipcheck
matchany = anypattern.match(ip)
AttributeError: 'builtin_function_or_method' object has no attribute 'match'
==========================================================
EDIT#2
I was able to reproduce my error in a simpler code version. What the heck am i doing wrong??????
#!/usr/bin/python
import sys
import re
def ipcheck(ip):
anypattern = any
cpattern = re.compile(anypattern)
global matchany
matchany = cpattern.match(ip)
if matchany:
print "ip match: %s" % matchany.group()
new = sys.argv[1]
ipcheck(new)
ERROR
user#bt:/home# ./test.py any
Traceback (most recent call last):
File "./test.py", line 14, in <module>
ipcheck(new)
File "./test.py", line 8, in ipcheck
cpattern = re.compile(anypattern)
File "/usr/lib/python2.5/re.py", line 188, in compile
return _compile(pattern, flags)
File "/usr/lib/python2.5/re.py", line 237, in _compile
raise TypeError, "first argument must be string or compiled pattern"
TypeError: first argument must be string or compiled pattern

When you use re.compile you call the match function on the compiled object: ippattern.match(ip). Also, to get to matched ip from a MatchObject, use MatchObject.group(). Fixed up your example some and it should now do what you need:
#!/usr/bin/python
import sys
import re
def ipcheck(ip):
ippattern_str = '(([1-2]?[\d]{0,2}\.){1,3}([1-2]?[\d]{0,2})|any)'
ippattern = re.compile(ippattern_str)
# ippattern is now used to call match, passing only the ip string
matchip = ippattern.match(ip)
if matchip:
print "ip match: %s" % matchip.group()
if len(sys.argv) > 1:
ipcheck(sys.argv[1])
Some results:
[ 19:46 jon#hozbox ~/SO/python ]$ ./new.py 100.
ip match: 100.
[ 19:46 jon#hozbox ~/SO/python ]$ ./new.py 100.1.
ip match: 100.1.
[ 19:46 jon#hozbox ~/SO/python ]$ ./new.py 100.1.55.
ip match: 100.1.55.
[ 19:46 jon#hozbox ~/SO/python ]$ ./new.py 100.1.55.255
ip match: 100.1.55.255
[ 19:47 jon#hozbox ~/SO/python ]$ ./new.py any
ip match: any
[ 19:47 jon#hozbox ~/SO/python ]$ ./new.py foo
[ 19:47 jon#hozbox ~/SO/python ]$

This regular expression might be better:
((([1-2]?[0-9]?[0-9]\.){1,3}([1-2]?[0-9]?[0-9])?)|any)
It will match anything like:
127.0.0.1
127.0.0
127.0
127.
192.168.1.1
any
Your regular expression would have trouble with the above because it doesn't match 0.
Edit:
I had missed the part about matching any.
This regular expression will match a few invalid addresses, however if you are just searching through log files that should be fine. You may wish to check out this link if you really need to be exact.

How to disallow spaces in between literals in pyparsing?

grammar = Literal("from") + Literal(":") + Word(alphas)
The grammar needs to reject from : mary and only accept from:mary i.e. without any interleaving spaces. How can I enforce this in pyparsing ? Thanks

Can you use Combine?
grammar = Combine(Literal("from") + Literal(":") + Word(alphas))
So then:
EDIT in response to your comment.
Really?
>>> grammar = pyparsing.Combine(Literal("from") + Literal(":") + Word(pyparsing.alphas))
>>> grammar.parseString('from : mary')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/pymodules/python2.6/pyparsing.py", line 1076, in parseString
raise exc
pyparsing.ParseException: Expected ":" (at char 4), (line:1, col:5)
>>> grammar.parseString('from:mary')
(['from:mary'], {})

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Pyparsing Optional field - python

Related

PyParsing and multi-line syslog messages

Python pyparsing issue

Packet capture with python

IP REGEX validation

How to disallow spaces in between literals in pyparsing?

Categories

Resources