Lower execution time for apache log parser in Python - python

I have an school assignment where I were tasked with writing a apache log parser in Python. This parser will extract all the IP addresses and all the HTTP Methods using Regex and store these in a nested dictionary. The code can be seen below:
def aggregatelog(filename):
keyvaluepairscounter = {"IP":{}, "HTTP":{}}
with open(filename, "r") as file:
for line in file:
result = search(r'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s*.*"(\b[A-Z]+\b)', line).groups() #Combines the regexes: IP (\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) and HTTP Method ("(\b[A-Z]+\b))
if result[0] in set(keyvaluepairscounter["IP"].keys()): #Using set will lower look up time complexity from O(n) to O(1)
keyvaluepairscounter["IP"][result[0]] += 1
else:
keyvaluepairscounter["IP"][result[0]] = 1
if result[1] in set(keyvaluepairscounter["HTTP"].keys()):
keyvaluepairscounter["HTTP"][result[1]] += 1
else:
keyvaluepairscounter["HTTP"][result[1]] = 1
return keyvaluepairscounter
This code works (it gives me the expected data for the log files we were given). However, when extracting data from large log files (in my case, ~500 MB) the program is VERY slow (it takes ~30 min for the script to finish). According to my teacher, a good script should be able to process the large file in under 3 minutes (wth?). My question is: Is there anything I can do to speed up my script? I have done some things, like replacing the lists with sets which have better lookup times.

At minimum, pre-compile your regex before the loop i.e.
pattern = re.compile(r'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s*.*"(\b[A-Z]+\b)')
then later in your loop:
for line in file:
result = search(pattern, line).groups()
you should also consider optimizing your pattern especially the .* as it is an expensive operation

I found my answer. Use "re.findall()" instead of storing returned regex data in array as such:
for data in re.findall(pattern, text):
do things
instead of
array = re.findall(pattern, text)
for data in array:
do things
I also read the entire file in one go:
file = open("file", "r")
text = file.read()
This implementation processed the file in under 1 minute!

Related

How to store read-in data element by element in a memory-efficient way?

The program I'm working on needs to read in data files which can be quite large (up to 5GB) in ASCII. The format can vary that's why I came up with using readline(), split every line to get just the pure entries, append them all to one big list of strings and divide this one in smaller string lists depending on the occurrence of certain marker words, and then pass the data to a program internal data structure for further unified processing.
This method is working good enough except that it needs way to much memory and I wonder why.
So I wrote this little test case which makes you understand my problem:
The input data here is the text of Shakespears Romeo and Juliet (actually I expect mixed alphabetic - numeric input) - note that I want you to copy the data yourself to keep things clearly. The script generates a .txt file which is then read in again using. The original memory size in this case is 153 KB.
Reading this file with...
f.read() gives you a single string with a size of 153 KB, too.
f.readlines() gives you a list with single strings for every line with a overall size of 420 KB.
Splitting the line strings of f.readlines() at every whiespace and save all those single entries in a new list results in 1619 KB in memory use.
As these numbers don't seem to be a problem in this cases, a factor of >10 in increase of RAM requirement is definitly one for input data in GB order.
I don't have any idea why this is or how to avoid this. From my point of understanding a list is just a structure of pointers pointing on all the values stored in the list (this is also the reason, why sys.getsizeof() on a list gives you a 'wrong' result).
For the values themself it shouldn't make a difference in memory if I have "LONG STRING" or "LONG" + "STRING" as both use the same characters which should result in the same amount of bits/bytes.
Maybe the answer is really simple but I am really stuck with this problem so I am thankful for every idea.
# step1: http://shakespeare.mit.edu/romeo_juliet/full.html
# step2: Ctrl+A and then Ctrl+C
# step3: Ctrl+V after benchmarkText
benchmarkText = """ >>INSERT ASCII DATA HERE<< """
#=== import modules =======================================
from pympler import asizeof
import sys
#=== open files and safe data to a structure ==============
#-- original memory size
print("\n\nAll memory sizes are in KB:\n")
print("Original string size:")
print(asizeof.asizeof(benchmarkText)/1e3)
print(sys.getsizeof(benchmarkText)/1e3)
#--- write bench mark file
with open('benchMarkText.txt', 'w') as f:
f.write(benchmarkText)
#--- read the whole file (should always be equal to original size)
with open('benchMarkText.txt', 'r') as f:
# read the whole file as one string
wholeFileString = f.read()
# check size:
print("\nSize using f.read():")
print(asizeof.asizeof(wholeFileString)/1e3)
#--- read the file in a list
listOfWordOrNumberStrings = []
with open('benchMarkText.txt', 'r') as f:
# safe every line of the file
listOfLineStrings = f.readlines()
print("\nSize using f.readlines():")
print(asizeof.asizeof(listOfLineStrings)/1e3)
# split every line into the words or punctation marks
for stringLine in listOfLineStrings:
line = stringLine[:-1] # get rid of the '\n'
# line = re.sub('"', '', line) # The final implementation will need this, however for the test case it doesn't matter.
elemsInLine = line.split()
for elem in elemsInLine:
listOfWordOrNumberStrings.append(elem)
# check size
print("\nSize after splitting:")
print(asizeof.asizeof(listOfWordOrNumberStrings)/1e3)
(I am aware that I use readlines() instead of readline() here - I changed it for this test case because I think it makes things easier to understand.)

Optimization of the removal of lines using Python 3

I am currently trying to remove the majority of lines from a large text file and rewrite the chosen information into another. I have to read the original file line-by-line as the order in which the lines appear is relevant. So far, the best approach I could think of pulled only the relevant lines and rewrote them using something like:
with open('input.txt', 'r') as input_file:
with open('output.txt', 'w') as output_file:
# We only have to loop through the large file once
for line in input_file:
# Looping through my data many times is OK as it only contains ~100 elements
for stuff in data:
# Search the line
line_data = re.search(r"(match group a)|(match group b)", line)
# Verify there is indeed a match to avoid raising an exception.
# I found using try/except was negligibly slower here
if line_data:
if line_data.group(1):
output_file.write('\n')
elif line_data.group(2) == stuff:
output_file.write('stuff')
output_file.close()
input_file.close()
However, this program still takes ~8 hours to run with a ~1Gb file and ~120,000 matched lines. I believe the bottleneck may involve either the regex or output bit as time taken to complete this script scales linearly with the number of line matches.
I have tried storing the output data first in memory before writing it to the new text file but a quick test showed that it was storing data at roughly the same speed as it was writing it before.
If it helps, I have a Ryzen 5 1500 and 8Gb of 2133 Mhz RAM. However, my RAM usage never seems to cap out.
You could move your inner loop to only run when needed. Right now, you're looping over data for every line in the large file, but only using the stuff variable when you match. So just move the for stuff in data: loop to inside the if block that actually uses it.
for line in input_file:
# Search the line
line_data = re.search(r"(match group a)|(match group b)", line)
# Verify there is indeed a match to avoid raising an exception.
# I found using try/except was negligibly slower here
if line_data:
for stuff in data:
if line_data.group(1):
output_file.write('\n')
elif line_data.group(2) == stuff:
output_file.write('stuff')
You're generating the regex for each line with consume a lot of CPU, you should compile the regex at the beginning of the search instead that would save some cycles.

Replace Multiple Strings in a Large Text File in Python

Problem:
Replacing multiple string patterns in a large text file is taking a lot of time. (Python)
Scenario:
I have a large text file with no particular structure to it. But, it contains several patterns. For example, email addresses and phone numbers.
The text file has over 100 different such patterns and the file is of size 10mb (size could increase). The text file may or may not contain all the 100 patterns.
At present, I am replacing the matches using re.sub() and the approach for performing replaces looks as shown below.
readfile = gzip.open(path, 'r') # read the zipped file
lines = readfile.readlines() # load the lines
for line in lines:
if len(line.strip()) != 0: # strip the empty lines
linestr += line
for pattern in patterns: # patterns contains all regex and respective replaces
regex = pattern[0]
replace = pattern[1]
compiled_regex = compile_regex(regex)
linestr = re.sub(compiled_regex, replace, linestr)
This approach is taking a lot of time for large files. Is there a better way to optimize it?
I am thinking of replacing += with .join() but not sure how much that would help.
you could use lineprofiler to find which lines in your code take the most time
pip install line_profiler
kernprof -l run.py
another thing, I think you're building the string too large in memory, maybe you can make use of generators
You may obtain slightly better results doing :
large_list = []
with gzip.open(path, 'r') as fp:
for line in fp.readlines():
if line.strip():
large_list.append(line)
merged_lines = ''.join(large_list)
for regex, replace in patterns:
compiled_regex = compile_regex(regex)
merged_lines = re.sub(compiled_regex, replace, merged_lines)
However, further optimization can be achieved knowing what kind of processing you apply. In fact the last line will be the one that takes up all CPU power (and memory allocation). If regexes can be applied on a per-line basis, you can achieve great results using the multiprocessing package. Threading won't give you anything because of the GIL (https://wiki.python.org/moin/GlobalInterpreterLock)

Python large files, how to find specific lines with a particular string

I am using Python to process data from very large text files (~52GB, 800 million lines each with 30 columns of data). I am trying to find an efficient way to find specific lines. Luckily the string is always in the first column.
The whole thing works, memory is not a problem (I'm not loading it, just opening and closing the file as needed) and I run it on a cluster anyway. Its more about speed. The script takes days to run!
The data looks something like this:
scaffold126 1 C 0:0:20:0:0:0 0:0:1:0:0:0 0:0:0:0:0:0
scaffold126 2 C 0:0:10:0:0:0 0:0:1:0:0:0 0:0:0:0:0:0
scaffold5112 2 C 0:0:10:0:0:0 0:0:1:0:0:0 0:0:0:0:0:0
scaffold5112 2 C 0:0:10:0:0:0 0:0:1:0:0:0 0:0:0:0:0:0
and I am searching for all the lines that start with a particular string from the first column. I want to process the data and send a summary to a output file. Then I search for all the lines for another string and so on...
I am using something like this:
for (thisScaff in AllScaffs):
InFile = open(sys.argv[2], 'r')
for line in InFile:
LineList = line.split()
currentScaff = LineList[0]
if (thisScaff == currentScaff):
#Then do this stuff...
The main problem seems to be that all 800 million lines have to be looked through to find those that match the current string. Then once I move to another string, all 800 have to be looked through again. I have been exploring grep options but is there another way?
Many thanks in advance!
Clearly you only want to read the file once. It's very expensive to read it over and over again. To speed searching, make a set of the strings you're looking for. Like so:
looking_for = set(AllScaffs)
with open(sys.argv[2]) as f:
for line in f:
if line.split(None, 1)[0] in looking_for:
# bingo! found one
line.split(None, 1) splits on whitespace, but at most 1 split is done. For example,
>>> "abc def ghi".split(None, 1)
['abc', 'def ghi']
This is significantly faster than splitting 29 times (which will happen if each line has 30 whitespace-separated columns).
An alternative:
if line[:line.find(' ')] in looking_for:
That's probably faster still, since no list at all is created. It searches for the leftmost blank, and takes the initial slice of line up to (but not including) that blank.
Create an Index. It'll require a lot of disk space. Use it only if you have to perform these scaffold lookups too many times.
This will be a one time job, will take a good amount of time, but will definitely serve you in long run.
Your Index will be of the form:
scaffold126:3|34|234443|4564564|3453454
scaffold666:1|2
scaffold5112:4|23|5456456|345345|234234
where 3,4 etc are line numbers. Make sure the final file is sorted alphabetically (to make way for Binary Search). Let's call this Index as Index_Primary
Now you will create a secondary index to make the search faster. Let's call it Index_Second. Let's say Index_Primary contains hundred thousand lines, each line representing one scaffold. Index_Second will give us jump points. It can be like:
scaffold1:1
scaffold1000:850
scaffold2000:1450
This says that information about scaffold2000 is present in line number 1450 of Index_Primary.
So now let's say you want to find lines with scaffold1234, you will go to Index_Second. This will tell you that scaffold1234 is present somewhere between line 850 and 1450 of Index_Primary. Now load that and start from the middle of this block, ie, line 1150. Find the required scaffold using Binary Search and voila! You get the line numbers of lines containing that scaffold! Possibly within milliseconds!
My first instinct would be to load your data into a database, making sure to create an index from column 0, and then query as needed.
For a Python approach, try this:
wanted_scaffs = set(['scaffold126', 'scaffold5112'])
files = {name: open(name+'.txt', 'w') for name in wanted_scaffs}
for line in big_file:
curr_scaff = line.split(' ', 1)[0] # minimal splitting
if curr_scaff in wanted_scaffs:
files[key].write(line)
for f in files.values():
f.close()
Then do your summary reports:
for scaff in wanted_scaffs:
with open(scaff + '.txt', 'r') as f:
... # summarize your data

In python, is there a way for re.finditer to take a file as input instead of a string?

Let's say I have a really large file foo.txt and I want to iterate through it doing something upon finding a regular expression. Currently I do this:
f = open('foo.txt')
s = f.read()
f.close()
for m in re.finditer(regex, s):
doSomething()
Is there a way to do this without having to store the entire file in memory?
NOTE: Reading the file line by line is not an option because the regex can possibly span multiple lines.
UPDATE: I would also like this to work with stdin if possible.
UPDATE: I am considering somehow emulating a string object with a custom file wrapper but I am not sure if the regex functions would accept a custom string-like object.
If you can limit the number of lines that the regex can span to some reasonable number, then you can use a collections.deque to create a rolling window on the file and keep only that number of lines in memory.
from collections import deque
def textwindow(filename, numlines):
with open(filename) as f:
window = deque((f.readline() for i in xrange(numlines)), maxlen=numlines)
nextline = True
while nextline:
text = "".join(window)
yield text
nextline = f.readline()
window.append(nextline)
for text in textwindow("bigfile.txt", 10):
# test to see whether your regex matches and do something
Either you will have to read the file chunk-wise, with overlaps to allow for the maximum possible length of the expression, or use an mmapped file, which will work almost/just as good as using a stream: https://docs.python.org/library/mmap.html
UPDATE to your UPDATE:
consider that stdin isn't a file, it just behaves a lot like one in that it has a file descriptor and so on. it is a posix stream. if you are unclear on the difference, do some googling around. the OS cannot mmap it, therefore python can not.
also consider that what you're doing may be an ill-suited thing to use a regex for. regex's are great for capturing small stuff, like parsing a connection string, a log entry, csv data and so on. they are not a good tool to parse through huge chunks of data. this is by design. you may be better off writing a custom parser.
some words of wisdom from the past:
http://regex.info/blog/2006-09-15/247
Perhaps you could write a function that yields one line (reads one line) at a time of the file and call re.finditer on that until it yields an EOF signal.
Here is another solution, using an internal text buffer to progressively yield found matches without loading the entire file in memory.
This buffer acts like a "sliding windows" through the file text, moving forward while yielding found matches.
As the file content is loaded by chunks, this means this solution works with multilines regexes too.
def find_chunked(fileobj, regex, *, chunk_size=4096):
buffer = ""
while 1:
text = fileobj.read(chunk_size)
buffer += text
matches = list(regex.finditer(buffer))
# End of file, search through remaining final buffer and exit
if not text:
yield from matches
break
# Yield found matches except the last one which is maybe
# incomplete because of the chunk cut (think about '.*')
if len(matches) > 1:
end = matches[-2].end()
buffer = buffer[end:]
yield from matches[:-1]
However, note that it may end up loading the whole file in memory if no matches are found at all, so you better should use this function if you are confident that your file contains the regex pattern many times.

Categories

Resources