Fast .gz Log File Parsing in Python

Fast .gz Log File Parsing in Python - python

I have multiple log files that contain 10000+ lines of info and are Gzipped. I need a way to quickly parse each log file for relevant information and then display stats based on the information contained in all the log files. I currently use gzip.open() to recursively open each .gz file and then run the contents through a primitive parser.
def parse(logfile):
for line in logfile:
if "REPORT" in line:
info = line.split()
username = info[2]
area = info[4]
# Put info into dicts/lists etc.
elif "ERROR" in line:
info = line.split()
...
def main(args):
argdir = args[1]
for currdir, subdirs, files in os.walk(argdir):
for filename in files:
with gzip.open(os.path.join(currdir, filename), "rt") as log:
parse(log)
# Create a report at the end: createreport()
Is there any way to optimize this process for each file? It currently takes ~28 seconds per file on my computer to go through each .gz and every little optimization counts. I've tried using pypy and for some reason it takes 2 times longer to process a file.

Related

PermissionError: [Errno 13] Permission denied: output.csv [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
After the user provides the source directory, the following script reads in a list of csvs. It then takes one csv and copies its contents row by row to a new csv until it reaches 100,000 rows at which point a new csv is created to continue the process until the original csv has been copied completely. The process is then repeated for the next csv file in the directory.
I will sometimes encounter the above PermissionError and am not sure how to go about fixing it, but sometimes I will run the script and I encounter no issues. I've verified that both the input and output files are NOT open on my machine. I've also tried to change the properties of my directory folder to not be read-only, though this always reverts back. When the error does occur, it is always within a few seconds of first starting to process a csv. Once you are about 5 seconds in, it won't give the error for that csv. But it could later once it gets to a new input csv.
"""
Script processes all csv's in a provided directory and returns
csv's with a maximum of 100,000 rows
"""
import csv
import pathlib
import argparse
import os
import glob
def _get_csv_list(
*, description: str = "Process csv file directory.",
):
"""
Uses argument parser to set up working directory, then
extracts list of csv file names from directory
Args: Directory string
Returns list of csv file name strings
"""
parser = argparse.ArgumentParser(description=description)
parser.add_argument(
"SRC", type=pathlib.Path, help="source (input) directory"
)
parsed_arg = parser.parse_args()
os.chdir(parsed_arg.SRC)
return glob.glob("*.{}".format("csv"))
def _process_csv(file_name):
"""
Iterates through csv file and copies each row to output
file. Once 100,000 rows is reached, a new file is started
Args: file name string
"""
file_index = 0
max_records_per_file = 100_000
with open(file_name) as _file:
reader = csv.reader(_file)
first_line = _file.readline()
first_line_list = first_line.split(",")
for index, row in enumerate(reader):
if index % max_records_per_file == 0:
file_index += 1
with open(
f"output_{file_name.strip('.csv')}_{file_index}.csv",
mode="xt",
encoding="utf-8",
newline="\n",
) as buffer:
writer = csv.writer(buffer)
writer.writerow(first_line_list)
else:
try:
with open(
f"output_{file_name.strip('.csv')}_{file_index}.csv",
mode="at",
encoding="utf-8",
newline="\n",
) as buffer:
writer = csv.writer(buffer)
writer.writerow(row)
except FileNotFoundError as error:
print(error)
with open(
f"output_{file_name.strip('.csv')}_{file_index}.csv",
mode="xt",
encoding="utf-8",
newline="\n",
) as buffer:
writer = csv.writer(buffer)
writer.writerow(first_line_list)
writer.writerow(row)
def main():
"""
Primary function for limiting csv file size
Cmd Line: python csv_row_limiter.py . (Replace '.' with other path
if csv_row_limiter.py directory and csv directory are different)
"""
csv_list = _get_csv_list()
for file_name in csv_list:
_process_csv(file_name)
if __name__ == "__main__":
main()
Also, please note that the only requirement for the contents of the input csv's is that they have a large number of rows (100,000+) with some amount of data.
Any ideas of how I might resolve this issue?

try opening it as root i.e try running this python script through root or su privileges. What i mean is login as root and then run this python script . Hope this helps.

Iterating Over List of Parsed Files Python

This program scans through a log file and finds faults and timestamps for the faults. The problem I am having with my program is finding a way to modify my program so that it can iterate over multiple files given via the command line and wildcard. In the state the code is now, it can accept a single file and build the dictionary with my my desired info successfully. I have been struggling finding a way to perform this with multiple files simultaneously. The goal is to able to enter into the command line the filename with a wildcard to parse files associated. For example on the command line after the executable I would enter, -f filename.*txt**. However, I cannot find a way to parse multiple files through my fault finder. I have been successful in parsing multiple files and proved it by printing out the list of files parsed. But when it comes to using multiple files and building the dictionary, I am stumped. I would like to use my program and have the same result as it would when parsing a singular file.
import sys
import argparse
_TIME_STAMP_LENGTH = 16
_FAULT_STRING_HEADER_LENGTH = 15
class FaultList():
fault_dict = {}
fault_dict_counter = {}
def __init__(self, file):
self.file = file
self.find_faults()
print self.fault_dict
def find_faults(self):
with open(self.file) as f:
for line in f.readlines():
fault_index = line.find("Fault Cache id")
if(fault_index != -1):
time_stamp = line[:_TIME_STAMP_LENGTH]
fault_data = line[fault_index+_FAULT_STRING_HEADER_LENGTH:-11][:-1] #need the [:-1] to remove new line from string
self.handle_new_fault_found(fault_data, time_stamp)
def handle_new_fault_found(self, fault, time_stamp):
try:
self.fault_dict[fault] = [fault]
self.fault_dict[fault].append(int(time_stamp))
self.fault_dict_counter[0] += 1
except KeyError:
self.fault_dict_counter[fault] = [1, [time_stamp]]
def main(file_names):
parser = argparse.ArgumentParser()
parser.add_argument("-f", "--file", dest="file_names",
help="The binary file to be writen to flash")
args = parser.parse_args()
fault_finder = FaultList(args.file_names)
args = parser.parse_args()
if __name__ == '__main__':
main(sys.argv[1:])
Here is the output of dictionary when parsing a singular file
{'fault_01_17_00 Type:Warning': ['fault_01_17_00 Type:Warning', 37993146319], 'fault_0E_00_00 Type:Warning': ['fault_0E_00_00 Type:Warning', 38304267561], 'fault_05_01_00 Typ': ['fault_05_01_00 Typ', 38500887160]}

You can use the os module for listing files.
import os
# finds all files in a directory
files = [file for file in os.listdir('path of the files') if os.path.isfile(file)]
# filter them looking for files that end with 'txt'
txt_files = [file for file in files if file.endswith('txt')]

Run script on multiple files sequentially?

I have this code which along with the rest of it runs on a single file in a folder.
I want to try and run this code on 11 files in the folder. I have to pass parameters to the whole script via an .sh script which I've written.
I've searched on here and found various solutions which have not worked.
def get_m3u_name():
m3u_name = ""
dirs = os.listdir("/tmp/")
for m3u_file in dirs:
if m3u_file.endswith(".m3u") or m3u_file.endswith(".xml"):
m3u_name = m3u_file
return m3u_name
def remove_line(filename, what):
if os.path.isfile(filename):
file_read = open(filename).readlines()
file_write = open(filename, 'w')
for line in file_read:
if what not in line:
file_write.write(line)
file_write.close()
m3ufile = get_m3u_name()
I have tried a different method of deleting the file just processed then looping the script to run again on the next file as I can do this manually but when I use
os.remove(m3ufile)
I get file not found either method of improving my code would be of great help to me. I'm just a newbie at this but pointing me in the right direction would be of great help.

Parse multiple log files for strings

I'm trying to parse a number of log files from a log directory, to search for any number of strings in a list along with a server name. I feel like I've tried a million different options, and I have it working fine with just one log file.. but when I try to go through all the log files in the directory I can't seem to get anywhere.
if args.f:
logs = args.f
else:
try:
logs = glob("/var/opt/cray/log/p0-current/*")
except IndexError:
print "Something is wrong. p0-current is not available."
sys.exit(1)
valid_errors = ["error", "nmi", "CATERR"]
logList = []
for log in logs:
logList.append(log)
#theLog = open("logList")
#logFile = log.readlines()
#logFile.close()
#printList = []
#for line in logFile:
# if (valid_errors in line):
# printList.append(line)
#
#for item in printList:
# print item
# with open("log", "r") as tmp_log:
# open_log = tmp_log.readlines()
# for line in open_log:
# for down_nodes in open_log:
# if valid_errors in open_log:
# print valid_errors
down_nodes is a pre-filled list further up the script containing a list of servers which are marked as down.
Commented out are some of the various attempts I've been working through.
logList = []
for log in logs:
logList.append(log)
I thought this may be the way forward to put each individual log file in a list, then loop through this list and use open() followed by readlines() but I'm missing some kind of logic here.. maybe I'm not thinking correctly.
I could really do with some pointers here please.
Thanks.

So your last for loop is redundant because logs is already a list of strings. With that information, we can iterate through logs and do something for each log.
for log in logs:
with open(log) as f:
for line in f.readlines():
if any(error in line for error in valid_errors):
#do stuff
The line if any(error in line for error in valid_errors): checks the line to see if any of the errors in valid_errors are in the line. The syntax is a generator that yields error for each error in valid_errors.
To answer your question involving down_nodes, I don't believe you should include this in the same any(). You should try something like
if any(error in line for error in valid_errors) and \
any(node in line for node in down_nodes):

Firstly you need to find all logs:
import os
import fnmatch
def find_files(pattern, top_level_dir):
for path, dirlist, filelist in os.walk(top_level_dir):
for name in fnmatch.filter(filelist, pattern)
yield os.path.join(path, name)
For example, to find all *.txt files in current dir:
txtfiles = find_files('*.txt', '.')
Then get file objects from the names:
def open_files(filenames):
for name in filenames:
yield open(name, 'r', encoding='utf-8')
Finally individual lines from files:
def lines_from_files(files):
for f in files:
for line in f:
yield line
Since you want to find some errors the check could look like this:
import re
def find_errors(lines):
pattern = re.compile('(error|nmi|CATERR)')
for line in lines:
if pattern.search(line):
print(line)
You can now process a stream of lines generated from a given directory:
txt_file_names = find_files('*.txt', '.')
txt_files = open_files(txt_file_names)
txt_lines = lines_from_files(txt_files)
find_errors(txt_lines)
The idea to process logs as a stream of data originates from talk by David Beazley.

How to print the content of zipped gzip'd files

Ok, so I have a zip file that contains gz files (unix gzip).
Here's what I do --
def parseSTS(file):
import zipfile, re, io, gzip
with zipfile.ZipFile(file, 'r') as zfile:
for name in zfile.namelist():
if re.search(r'\.gz$', name) != None:
zfiledata = zfile.open(name)
print("start for file ", name)
with gzip.open(zfiledata,'r') as gzfile:
print("done opening")
filecontent = gzfile.read()
print("done reading")
print(filecontent)
This gives the following result --
>>>
start for file XXXXXX.gz
done opening
done reading
Then stays like that forever until it crashes ...
What can I do with filecontent?
Edit : this is not a duplicate since my gzipped files are in a zipped file and i'm trying to avoid extracting that zip file to disk. It works with zip files in a zip file as per How to read from a zip file within zip file in Python? .

I created a zip file containing a gzip'ed PDF file I grabbed from the web.
I ran this code (with two small changes):
1) Fixed indenting of everything under the def statement (which I also corrected in your Question because I'm sure that it's right on your end or it wouldn't get to the problem you have).
2) I changed:
zfiledata = zfile.open(name)
print("start for file ", name)
with gzip.open(zfiledata,'r') as gzfile:
print("done opening")
filecontent = gzfile.read()
print("done reading")
print(filecontent)
to:
print("start for file ", name)
with gzip.open(name,'rb') as gzfile:
print("done opening")
filecontent = gzfile.read()
print("done reading")
print(filecontent)
Because you were passing a file object to gzip.open instead of a string. I have no idea how your code is executing without that change, but it was crashing for me until I fixed it.
EDIT: Adding link to GZIP docs from James R's answer --
Also, see here for further documentation:
http://docs.python.org/2/library/gzip.html#examples-of-usage
END EDIT
Now, since my gzip'ed file is small, the behavior I observe is that is pauses for about 3 seconds after printing done reading, then outputs what is in filecontent.
I would suggest adding the following debugging line after your print "done reading" -- print len(filecontent). If this number is very, very large, consider not printing the entire file contents in one shot.
I would also suggest reading this for more insight into what I expect is your problem: Why is printing to stdout so slow? Can it be sped up?
EDIT 2 - an alternative if your system does not handle file io on zip files, causing no such file errors in the above:
def parseSTS(afile):
import zipfile
import zlib
import gzip
import io
with zipfile.ZipFile(afile, 'r') as archive:
for name in archive.namelist():
if name.endswith('.gz'):
bfn = archive.read(name)
bfi = io.BytesIO(bfn)
g = gzip.GzipFile(fileobj=bfi,mode='rb')
qqq = g.read()
print qqq
parseSTS('t.zip')

Most likely your problem lies here:
if name.endswith(".gz"): #as goncalopp said in the comments, use endswith
#zfiledata = zfile.open(name) #don't do this
#print("start for file ", name)
with gzip.open(name,'rb') as gzfile: #gz compressed files should be read in binary and gzip opens the files directly
#print("done opening") #trust in your program, luke
filecontent = gzfile.read()
#print("done reading")
print(filecontent)
See here for further documentation:
http://docs.python.org/2/library/gzip.html#examples-of-usage

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fast .gz Log File Parsing in Python - python

Related

PermissionError: [Errno 13] Permission denied: output.csv [closed]

Iterating Over List of Parsed Files Python

Run script on multiple files sequentially?

Parse multiple log files for strings

How to print the content of zipped gzip'd files

Categories

Resources