Parsing Snort Logs with PyParsing - python

Having a problem with parsing Snort logs using the pyparsing module.
The problem is with separating the Snort log (which has multiline entries, separated by a blank line) and getting pyparsing to parse each entry as a whole chunk, rather than read in line by line and expecting the grammar to work with each line (obviously, it does not.)
I have tried converting each chunk to a temporary string, stripping out the newlines inside each chunk, but it refuses to process correctly. I may be wholly on the wrong track, but I don't think so (a similar form works perfectly for syslog-type logs, but those are one-line entries and so lend themselves to your basic file iterator / line processing)
Here's a sample of the log and the code I have so far:
[**] [1:486:4] ICMP Destination Unreachable Communication with Destination Host is Administratively Prohibited [**]
[Classification: Misc activity] [Priority: 3]
08/03-07:30:02.233350 172.143.241.86 -> 63.44.2.33
ICMP TTL:61 TOS:0xC0 ID:49461 IpLen:20 DgmLen:88
Type:3 Code:10 DESTINATION UNREACHABLE: ADMINISTRATIVELY PROHIBITED HOST FILTERED
** ORIGINAL DATAGRAM DUMP:
63.44.2.33:41235 -> 172.143.241.86:4949
TCP TTL:61 TOS:0x0 ID:36212 IpLen:20 DgmLen:60 DF
Seq: 0xF74E606
(32 more bytes of original packet)
** END OF DUMP
[**] ...more like this [**]
And the updated code:
def snort_parse(logfile):
header = Suppress("[**] [") + Combine(integer + ":" + integer + ":" + integer) + Suppress("]") + Regex(".*") + Suppress("[**]")
cls = Optional(Suppress("[Classification:") + Regex(".*") + Suppress("]"))
pri = Suppress("[Priority:") + integer + Suppress("]")
date = integer + "/" + integer + "-" + integer + ":" + integer + "." + Suppress(integer)
src_ip = ip_addr + Suppress("->")
dest_ip = ip_addr
extra = Regex(".*")
bnf = header + cls + pri + date + src_ip + dest_ip + extra
def logreader(logfile):
chunk = []
with open(logfile) as snort_logfile:
for line in snort_logfile:
if line !='\n':
line = line[:-1]
chunk.append(line)
continue
else:
print chunk
yield " ".join(chunk)
chunk = []
string_to_parse = "".join(logreader(logfile).next())
fields = bnf.parseString(string_to_parse)
print fields
Any help, pointers, RTFMs, You're Doing It Wrongs, etc., greatly appreciated.

import pyparsing as pyp
import itertools
integer = pyp.Word(pyp.nums)
ip_addr = pyp.Combine(integer+'.'+integer+'.'+integer+'.'+integer)
def snort_parse(logfile):
header = (pyp.Suppress("[**] [")
+ pyp.Combine(integer + ":" + integer + ":" + integer)
+ pyp.Suppress(pyp.SkipTo("[**]", include = True)))
cls = (
pyp.Suppress(pyp.Optional(pyp.Literal("[Classification:")))
+ pyp.Regex("[^]]*") + pyp.Suppress(']'))
pri = pyp.Suppress("[Priority:") + integer + pyp.Suppress("]")
date = pyp.Combine(
integer+"/"+integer+'-'+integer+':'+integer+':'+integer+'.'+integer)
src_ip = ip_addr + pyp.Suppress("->")
dest_ip = ip_addr
bnf = header+cls+pri+date+src_ip+dest_ip
with open(logfile) as snort_logfile:
for has_content, grp in itertools.groupby(
snort_logfile, key = lambda x: bool(x.strip())):
if has_content:
tmpStr = ''.join(grp)
fields = bnf.searchString(tmpStr)
print(fields)
snort_parse('snort_file')
yields
[['1:486:4', 'Misc activity', '3', '08/03-07:30:02.233350', '172.143.241.86', '63.44.2.33']]

You have some regex unlearning to do, but hopefully this won't be too painful. The biggest culprit in your thinking is the use of this construct:
some_stuff + Regex(".*") +
Suppress(string_representing_where_you_want_the_regex_to_stop)
Each subparser within a pyparsing parser is pretty much standalone, and works sequentially through the incoming text. So the Regex term has no way to look ahead to the next expression to see where the '*' repetition should stop. In other words, the expression Regex(".*") is going to just read until the end of the line, since that is where ".*" stops without specifying multiline.
In pyparsing, this concept is implemented using SkipTo. Here is how your header line is written:
header = Suppress("[**] [") + Combine(integer + ":" + integer + ":" + integer) +
Suppress("]") + Regex(".*") + Suppress("[**]")
Your ".*" problem gets resolved by changing it to:
header = Suppress("[**] [") + Combine(integer + ":" + integer + ":" + integer) +
Suppress("]") + SkipTo("[**]") + Suppress("[**]")
Same thing for cls.
One last bug, your definition of date is short by one ':' + integer:
date = integer + "/" + integer + "-" + integer + ":" + integer + "." +
Suppress(integer)
should be:
date = integer + "/" + integer + "-" + integer + ":" + integer + ":" +
integer + "." + Suppress(integer)
I think those changes will be sufficient to start parsing your log data.
Here are some other style suggestions:
You have a lot of repeated Suppress("]") expressions. I've started defining all my suppressable punctuation in a very compact and easy to maintain statement like this:
LBRACK,RBRACK,LBRACE,RBRACE = map(Suppress,"[]{}")
(expand to add whatever other punctuation characters you like). Now I can use these characters by their symbolic names, and I find the resulting code a little easier to read.
You start off header with header = Suppress("[**] [") + .... I never like seeing spaces embedded in literals this way, as it bypasses some of the parsing robustness pyparsing gives you with its automatic whitespace skipping. If for some reason the space between "[**]" and "[" was changed to use 2 or 3 spaces, or a tab, then your suppressed literal would fail. Combine this with the previous suggestion, and header would begin with
header = Suppress("[**]") + LBRACK + ...
I know this is generated text, so variation in this format is unlikely, but it plays better to pyparsing's strengths.
Once you have your fields parsed out, start assigning results names to different elements within your parser. This will make it a lot easier to get the data out afterward. For instance, change cls to:
cls = Optional(Suppress("[Classification:") +
SkipTo(RBRACK)("classification") + RBRACK)
Will allow you to access the classification data using fields.classification.

Well, I don't know Snort or pyparsing, so apologies in advance if I say something stupid. I'm unclear as to whether the problem is with pyparsing being unable to handle the entries, or with you being unable to send them to pyparsing in the right format. If the latter, why not do something like this?
def logreader( path_to_file ):
chunk = [ ]
with open( path_to_file ) as theFile:
for line in theFile:
if line:
chunk.append( line )
continue
else:
yield "".join( *chunk )
chunk = [ ]
Of course, if you need to modify each chunk before sending it to pyparsing, you can do so before yielding it.

Related

Add the actual characters '\n' to a string in python?

I'm writing a short program to go through a directory and write create table and load from csv statements for a bunch of csvs and get them all into mySQL. I'm sure there's an easier way to do this, but I thought it would be fun to make it myself.
This is one of the lines I have in python to build the load csv statement, where l_d is a variable I'm storing it in, f is the file path, and n is the table name:
l_d = "LOAD DATA INFILE " + "'" + f + "'" + "\nINTO TABLE " + n + "\nFIELDS TERMINATED BY ','\nENCLOSED BY '" + '"' +"'" + "\nLINES TERMINATED BY" +"\'\n\'" + "\nIGNORE 1 ROWS;"
The statement I want in SQL is:
LOAD DATA INFILE 'file.csv'
INTO TABLE table
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY'\n'
IGNORE 1 ROWS;
but what I get is always
LOAD DATA INFILE 'file.csv'
INTO TABLE table
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY'
'
IGNORE 1 ROWS;
because it thinks my \n is supposed to be a line break and not the actual characters.
How can I get the actual characters to show up here?
Also, I know my whole string concatenation in the original statement is kinda gross (I'm pretty new to this), so any general tips on how to improve that would also be much appreciated :)
to escape the backspace add another one before it:
\\n
gets \n
so your code will be:
l_d = "LOAD DATA INFILE " + "'" + f + "'" + "\nINTO TABLE " + n +
"\nFIELDS TERMINATED BY ','\nENCLOSED BY '" + '"' +"'" + "\nLINES
TERMINATED BY" +"\'\\n\'" + "\nIGNORE 1 ROWS;"
print("hello\ \n")
#this print a original "\n"

Python: CSV delimiter failing randomly

I have created a script which a number of random passwords are generated (see below)
import string
import secrets
import datetime
now = datetime.datetime.now()
T = now.strftime('%Y_%m_d')
entities = ['AA','BB','CC','DD','EE','FF','GG','HH']
masterpass = ('MasterPass' + '_' + T + '.csv')
f= open(masterpass,"w+")
def random_secure_string(stringLength):
secureStrMain = ''.join((secrets.choice(string.ascii_lowercase + string.ascii_uppercase + string.digits + ('!'+'?'+'"'+'('+')'+'$'+'%'+'#'+'#'+'/'+':'+';'+'['+']'+'#')) for i in range(stringLength)))
return secureStrMain
def random_secure_string_lower(stringLength):
secureStrLower = ''.join((secrets.choice(string.ascii_lowercase)) for i in range(stringLength))
return secureStrLower
def random_secure_string_upper(stringLength):
secureStrUpper = ''.join((secrets.choice(string.ascii_uppercase)) for i in range(stringLength))
return secureStrUpper
def random_secure_string_digit(stringLength):
secureStrDigit = ''.join((secrets.choice(string.digits)) for i in range(stringLength))
return secureStrDigit
def random_secure_string_char(stringLength):
secureStrChar = ''.join((secrets.choice('!'+'?'+'"'+'('+')'+'$'+'%'+'#'+'#'+'/'+':'+';'+'['+']'+'#')) for i in range(stringLength))
return secureStrChar
for x in entities:
f.write(x + ',' + random_secure_string(6) + random_secure_string_lower(1) + random_secure_string_upper(1) + random_secure_string_digit(1) + random_secure_string_char(1) + ',' + T + "\n")
f.close()
I use pandas to get the code to import a list, so normally it is for 200-250 entities, not just the 8 in the example.
The issue comes every so often where it looks like the comma delimiter fails to be read (see row 6 of attached photo)
In all the cases I have had of this (multiple run throughs), it looks like the 10th character is a comma, the 4 before (characters 6-9) are as stated in the script, but then instead of generating 6 initial characters (from random_secure_string(6)), it is generating 5. Could this be causing the issue? If so, how do I fix this?
Thank you in advance
Wild guess, because the content of the csv file as text is required to make sure.
A csv is a Comma Separated Values text file. That means that it is a plain text files where fields are delimited with a separator, normally the comma (,). In order to allow text fields to contain commas or even new lines, they can be enclosed in quotes (normally ") or special characters can be escaped, normally with \.
That means that if a line contains abcdefg\,2020_05 the comma will not be interpreted as a separator.
How to fix:
CSV is a simple format, but with many corner cases. The rule is avoid to read or write it by hand. Just use the standard library csv module here:
...
import csv
...
with open(masterpass,"w+", newline='') as f:
wr = csv.writer(f)
for x in entities:
wr.writerow([x, random_secure_string(6) + random_secure_string_lower(1) + random_secure_string_upper(1) + random_secure_string_digit(1) + random_secure_string_char(1), T])
The writer will take care for special characters and ensure that appropriate encoding or escaping will be used

Python - How to handle space as a value of a variable without quotes?

I have a string "bitrate:8000"
I need to convert it to "-bps 8000". Note that the parameter name is changed and so is the delimiter from ':' to space.
Also the delimiters are not fixed always, sometimes I would need to change from ':' to '-' using the same program.
The change rules are supplied as a config file which I am reading through the ConfigParser module. Something like:
[params]
modify_param_name = bitrate/bps
modify_delimiter = :/' '
value = 8000
In my program:
orig_param = modify_param_name.split('/')[0]
new_param = modify_param_name.split('/')[1]
orig_delimiter = modify_delimiter.split('/')[0]
new_delimiter = modify_delimiter.split('/')[1]
new_param_string = new_param + new_delimiter + value
However, this results in the string as below:
-bps' '8000
The question is how can I handle spaces without the ' ' quotes?
The reason why you're getting the ' ' string is probably related to the way you parse your modify_delimiter value.
You're reading that as a string, so that modify_delimiter == ":/' '".
When you're doing:
new_delimiter = modify_delimiter.split('/')[1]
Essentially modify_delimiter.split('/') gives you an array of [':', "' '"].
So when you're doing new_param_string = new_param + new_delimiter + value
, you are concatenating together 'bps' + "' '" + '8000'.
If your modify_delimiter contained the string ':/ ', this would work just fine:
>>> new_param_string = new_param + new_delimiter + value
>>> new_param_string
'bps 8000'
It has been pointed out that you're using ConfigParser. Unfortunatelly, I don't see an option for ConfigParser (either in python 2 or 3) to preserve trailing whitespaces - it looks like they're always stripped.
What I can suggest in that case is that you wrap your string in quotes entirely in your config file:
[params]
modify_param_name = bitrate/bps
modify_delimiter = ":/ "
And in your code, when you initialize modify_delimiter, strip the " on your own:
modify_delimiter = config.get('params', 'modify_delimiter').strip('"')
That way the trailing space will get preserved and you should get your desired output.

int and string errors in the class file in Python

I am writing a python program where I have 3 files. One is the main file, one is the class file and one is data file. The data file reads from 2 text files and splits and arranges the data for use by the class and main file. Anyways, I am pretty much done with the data and main files but I am having problems with the class file. Its a general string formatting issue but I am failing to understand what I can possibly do to fix it. I am getting the error
" File "/Users/admin/Desktop/Program 6/FINAL/classFile.py", line 83,
in repr
if len(self._birthDay[0])<2: TypeError: object of type 'int' has no len()
Use string formatting, not string concatenation, it's much cleaner:
return "{} {} (# {} ) GPA {:0.2f}".format(
self._first, self._last, self._techID, self.currentGPA()
)
Plus if you use this format, it will auto-convert the type for you
It seems to me like birthDay is a list of ints, not a list of strings.
If you want to make sure they're all strings, you can try:
self._birthDay = list(map(str, birthDay))
Alternatively, if you know that they are all strings, you can use string formatting in the first place to avoid these len checks:
self._birthDay = ['{:02d}'.format(x) for x in birthDay]
Even better, though, would be to represent birthDay as a datetime.datetime object. Assuming it always comes in as 3 ints, Month, Day, Year, you'd do:
bmon, bday, byear = birthDay
self._birthDay = datetime.datetime(byear, bmon, bday)
Then your __repr__ can make use of the datetime.strftime method.
Edit
In response to your update, I think you should add from datetime import datetime to the top of getData, then instead of parsing out the month/day/year, use:
birthDay = datetime.strptime(x[3], '%m/%d/%Y')
This will get you a full-fledged datetime object to represent the birthdate (alternatively, you can use a datetime.date object, since you don't need the time).
Then you can replace your __repr__ method with:
def __repr__(self):
fmtstr = '{first} {last} (#{techid})\nAge: {age} ({bday})\nGPA: {gpa:0.2f} ({credits})'
bday = self._birthDay.strftime('%m/%d/%Y')
return fmtstr.format(first=self._first,
last=self._last,
age=self.currentAge(),
bday=bday,
gpa=self.currentGPA(),
credits=self._totalCredits)
Oh, and since _birthDay is now a datetime.datetime, you need to update currentAge() to return int((datetime.datetime.now() - self._birthDay) / datetime.timedelta(days=365)) That will be reasonably accurate without being too complicated.
As the error message states, len doesn't make sense with an int. If you want the number of characters in it, convert it to an str first.
def __repr__(self):
if len(str(self._birthDay[0]))<2:
self._birthDay[0] = "0" + str(self._birthDay[0])
elif len(str(self._birthDay[1]))<2:
self._birthDay[1] = "0" + str(self._birthDay[1])
return self._first + " " + self._last + " (#" + self._techID + ")\nAge: " + str(self.currentAge()) + \
" (" + str(self._birthDay[0]) + "/" + str(self._birthDay[1]) + "/" + str(self._birthDay[2]) + ")" + \
"\nGPA: %0.2f" % (self.currentGPA()) + " (" + str(self._totalCredits) + " Credits" + ")\n"

Python - splitting lines in txt file by semicolon in order to extract a text title...except sometimes the title has semicolons in it

So, I have an extremely inefficient way to do this that works, which I'll show, as it will help illustrate the problem more clearly. I'm an absolute beginner in python and this is definitely not "the python way" nor "remotely sane."
I have a .txt file where each line contains information about a large number of .csv files, following format:
File; Title; Units; Frequency; Seasonal Adjustment; Last Updated
(first entry:)
0\00XALCATM086NEST.csv;Harmonized Index of Consumer Prices: Overall Index Excluding Alcohol and Tobacco for Austria©; Index 2005=100; M; NSA; 2015-08-24
and so on, repeats like this for a while. For anyone interested, this is the St.Louis Fed (FRED) data.
I want to rename each file (currently named the alphanumeric code # the start, 00XA etc), to the text name. So, just split by semicolon, right? Except, sometimes, the text title has semicolons within it (and I want all of the text).
So I did:
data_file_data_directory = 'C:\*****\Downloads\FRED2_csv_3\FRED2_csv_2'
rename_data_file_name = 'README_SERIES_ID_SORT.txt'
rename_data_file = open(data_file_data_directory + '\\' + rename_data_file_name)
for line in rename_data_file.readlines():
data = line.split(';')
if len(data) > 2 and data[0].rstrip().lstrip() != 'File':
original_file_name = data[0]
These last 2 lines deal with the fact that there is some introductory text that we want to skip, and we don't want to rename based on the legend # the top (!= 'File'). It saves the 00XAL__.csv as the oldname. It may be possible to make this more elegant (I would appreciate the tips), but it's the next part (the new, text name) that gets really ugly.
if len(data) ==6:
new_file_name = data[0][:-4].split("\\")[-1] + '-' + data[1][:-2].replace(':',' -').replace('"','').replace('/',' or ')
else:
if len(data) ==7:
new_file_name = data[0][:-4].split("\\")[-1] + '-' + data[1].replace(':',' -').replace('"','').replace('/',' or ') + '-' + data[2][:-2].replace(':',' -').replace('"','').replace('/',' or ')
else:
if len(data) ==8:
new_file_name = data[0][:-4].split("\\")[-1] + '-' + data[1].replace(':',' -').replace('"','').replace('/',' or ') + '-' + data[2].replace(':',' -').replace('"','').replace('/',' or ') + '-' + data[3][:-2].replace(':',' -').replace('"','').replace('/',' or ')
else:
if len(data) ==9:
new_file_name = data[0][:-4].split("\\")[-1] + '-' + data[1].replace(':',' -').replace('"','').replace('/',' or ') + '-' + data[2].replace(':',' -').replace('"','').replace('/',' or ') + '-' + data[3].replace(':',' -').replace('"','').replace('/',' or ') + '-' + data[4][:-2].replace(':',' -').replace('"','').replace('/',' or ')
else:
if len(data) ==10:
new_file_name = data[0][:-4].split("\\")[-1] + '-' + data[1].replace(':',' -').replace('"','').replace('/',' or ') + '-' + data[2].replace(':',' -').replace('"','').replace('/',' or ') + '-' + data[3].replace(':',' -').replace('"','').replace('/',' or ') + '-' + data[4].replace(':',' -').replace('"','').replace('/',' or ') + '-' + data[5][:-2].replace(':',' -').replace('"','').replace('/',' or ')
else:
(etc)
What I'm doing here is that there is no way to know for each line in the .csv how many items are in the list created by splitting it by semicolons. Ideally, the list would be length 6 - as follows the key # the top of my example of the data. However, for every semicolon in the text name, the length increases by 1...and we want everything before the last four items in the list (counting backwards from the right: date, seasonal adjustment, frequency, units/index) but after the .csv code (this is just another way of saying, I want the text "title" - everything for each line after .csv but before units/index).
Really what I want is just a way to save the entirety of the text name as "new_name" for each line, even after I split each line by semicolon, when I have no idea how many semicolons are in each text name or the line as a whole. The above code achieves this, but OMG, this can't be the right way to do this.
Please let me know if it's unclear or if I can provide more info.

Categories

Resources