Replace Multiple Strings in a Large Text File in Python - python

Problem:
Replacing multiple string patterns in a large text file is taking a lot of time. (Python)
Scenario:
I have a large text file with no particular structure to it. But, it contains several patterns. For example, email addresses and phone numbers.
The text file has over 100 different such patterns and the file is of size 10mb (size could increase). The text file may or may not contain all the 100 patterns.
At present, I am replacing the matches using re.sub() and the approach for performing replaces looks as shown below.
readfile = gzip.open(path, 'r') # read the zipped file
lines = readfile.readlines() # load the lines
for line in lines:
if len(line.strip()) != 0: # strip the empty lines
linestr += line
for pattern in patterns: # patterns contains all regex and respective replaces
regex = pattern[0]
replace = pattern[1]
compiled_regex = compile_regex(regex)
linestr = re.sub(compiled_regex, replace, linestr)
This approach is taking a lot of time for large files. Is there a better way to optimize it?
I am thinking of replacing += with .join() but not sure how much that would help.

you could use lineprofiler to find which lines in your code take the most time
pip install line_profiler
kernprof -l run.py
another thing, I think you're building the string too large in memory, maybe you can make use of generators

You may obtain slightly better results doing :
large_list = []
with gzip.open(path, 'r') as fp:
for line in fp.readlines():
if line.strip():
large_list.append(line)
merged_lines = ''.join(large_list)
for regex, replace in patterns:
compiled_regex = compile_regex(regex)
merged_lines = re.sub(compiled_regex, replace, merged_lines)
However, further optimization can be achieved knowing what kind of processing you apply. In fact the last line will be the one that takes up all CPU power (and memory allocation). If regexes can be applied on a per-line basis, you can achieve great results using the multiprocessing package. Threading won't give you anything because of the GIL (https://wiki.python.org/moin/GlobalInterpreterLock)

Related

Lower execution time for apache log parser in Python

I have an school assignment where I were tasked with writing a apache log parser in Python. This parser will extract all the IP addresses and all the HTTP Methods using Regex and store these in a nested dictionary. The code can be seen below:
def aggregatelog(filename):
keyvaluepairscounter = {"IP":{}, "HTTP":{}}
with open(filename, "r") as file:
for line in file:
result = search(r'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s*.*"(\b[A-Z]+\b)', line).groups() #Combines the regexes: IP (\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) and HTTP Method ("(\b[A-Z]+\b))
if result[0] in set(keyvaluepairscounter["IP"].keys()): #Using set will lower look up time complexity from O(n) to O(1)
keyvaluepairscounter["IP"][result[0]] += 1
else:
keyvaluepairscounter["IP"][result[0]] = 1
if result[1] in set(keyvaluepairscounter["HTTP"].keys()):
keyvaluepairscounter["HTTP"][result[1]] += 1
else:
keyvaluepairscounter["HTTP"][result[1]] = 1
return keyvaluepairscounter
This code works (it gives me the expected data for the log files we were given). However, when extracting data from large log files (in my case, ~500 MB) the program is VERY slow (it takes ~30 min for the script to finish). According to my teacher, a good script should be able to process the large file in under 3 minutes (wth?). My question is: Is there anything I can do to speed up my script? I have done some things, like replacing the lists with sets which have better lookup times.
At minimum, pre-compile your regex before the loop i.e.
pattern = re.compile(r'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s*.*"(\b[A-Z]+\b)')
then later in your loop:
for line in file:
result = search(pattern, line).groups()
you should also consider optimizing your pattern especially the .* as it is an expensive operation
I found my answer. Use "re.findall()" instead of storing returned regex data in array as such:
for data in re.findall(pattern, text):
do things
instead of
array = re.findall(pattern, text)
for data in array:
do things
I also read the entire file in one go:
file = open("file", "r")
text = file.read()
This implementation processed the file in under 1 minute!

Optimization of the removal of lines using Python 3

I am currently trying to remove the majority of lines from a large text file and rewrite the chosen information into another. I have to read the original file line-by-line as the order in which the lines appear is relevant. So far, the best approach I could think of pulled only the relevant lines and rewrote them using something like:
with open('input.txt', 'r') as input_file:
with open('output.txt', 'w') as output_file:
# We only have to loop through the large file once
for line in input_file:
# Looping through my data many times is OK as it only contains ~100 elements
for stuff in data:
# Search the line
line_data = re.search(r"(match group a)|(match group b)", line)
# Verify there is indeed a match to avoid raising an exception.
# I found using try/except was negligibly slower here
if line_data:
if line_data.group(1):
output_file.write('\n')
elif line_data.group(2) == stuff:
output_file.write('stuff')
output_file.close()
input_file.close()
However, this program still takes ~8 hours to run with a ~1Gb file and ~120,000 matched lines. I believe the bottleneck may involve either the regex or output bit as time taken to complete this script scales linearly with the number of line matches.
I have tried storing the output data first in memory before writing it to the new text file but a quick test showed that it was storing data at roughly the same speed as it was writing it before.
If it helps, I have a Ryzen 5 1500 and 8Gb of 2133 Mhz RAM. However, my RAM usage never seems to cap out.
You could move your inner loop to only run when needed. Right now, you're looping over data for every line in the large file, but only using the stuff variable when you match. So just move the for stuff in data: loop to inside the if block that actually uses it.
for line in input_file:
# Search the line
line_data = re.search(r"(match group a)|(match group b)", line)
# Verify there is indeed a match to avoid raising an exception.
# I found using try/except was negligibly slower here
if line_data:
for stuff in data:
if line_data.group(1):
output_file.write('\n')
elif line_data.group(2) == stuff:
output_file.write('stuff')
You're generating the regex for each line with consume a lot of CPU, you should compile the regex at the beginning of the search instead that would save some cycles.

Most optimized way to replace string in a set of files from a list

I have a list that contains some strings.
I have a set of files that may or may not contain these strings.
I need to replace these strings with modified version of the string in every instance of the files. (eg. string1_abc -> string1_xyz, string2_abc -> string2_xyz). In essence, the substring that needs to be replaced and/or modified is common among the all the items in the list.
Is there any optimized or easy way of doing that? The most naive algorithm I can think of looks at each line in each file, and for each line, iterate over each of the items in the list and replace that using line.replace . I know this would give me an O(mnq) complexity where m = number of files , n = number of lines per file and q = number of items in the list
Note:
All the file sizes aren't very large so I'm not sure if reading line
by line vs doing a file.read() into the buffer would be better?
q isn't very large either. The list is about 40-50 items.
m is quite
large.
n can go upto 5000 lines.
Also, I've only played around with Python on the side and am not very used to it. Also, I'm limited to using Python 2.6
Pseudo Python:
import glob
LoT=[("string1_abc","string1_xyz"), ("string2_abc","string2_xyz")]
for fn in glob.glob(glob_describes_your_files):
with open(fn) as f_in:
buf=f_in.read() # You said n is about 5000 lines so
# I would just read it in
for t in LoT:
buf=buf.replace(*t)
# write buf back out to a new file or the existing one
with open(fn, "w") as f_out:
f_out.write(buf)
Something like that...
If the files are BIG, investigate using a mmap on the files and everything else is more or less the same.

Python large files, how to find specific lines with a particular string

I am using Python to process data from very large text files (~52GB, 800 million lines each with 30 columns of data). I am trying to find an efficient way to find specific lines. Luckily the string is always in the first column.
The whole thing works, memory is not a problem (I'm not loading it, just opening and closing the file as needed) and I run it on a cluster anyway. Its more about speed. The script takes days to run!
The data looks something like this:
scaffold126 1 C 0:0:20:0:0:0 0:0:1:0:0:0 0:0:0:0:0:0
scaffold126 2 C 0:0:10:0:0:0 0:0:1:0:0:0 0:0:0:0:0:0
scaffold5112 2 C 0:0:10:0:0:0 0:0:1:0:0:0 0:0:0:0:0:0
scaffold5112 2 C 0:0:10:0:0:0 0:0:1:0:0:0 0:0:0:0:0:0
and I am searching for all the lines that start with a particular string from the first column. I want to process the data and send a summary to a output file. Then I search for all the lines for another string and so on...
I am using something like this:
for (thisScaff in AllScaffs):
InFile = open(sys.argv[2], 'r')
for line in InFile:
LineList = line.split()
currentScaff = LineList[0]
if (thisScaff == currentScaff):
#Then do this stuff...
The main problem seems to be that all 800 million lines have to be looked through to find those that match the current string. Then once I move to another string, all 800 have to be looked through again. I have been exploring grep options but is there another way?
Many thanks in advance!
Clearly you only want to read the file once. It's very expensive to read it over and over again. To speed searching, make a set of the strings you're looking for. Like so:
looking_for = set(AllScaffs)
with open(sys.argv[2]) as f:
for line in f:
if line.split(None, 1)[0] in looking_for:
# bingo! found one
line.split(None, 1) splits on whitespace, but at most 1 split is done. For example,
>>> "abc def ghi".split(None, 1)
['abc', 'def ghi']
This is significantly faster than splitting 29 times (which will happen if each line has 30 whitespace-separated columns).
An alternative:
if line[:line.find(' ')] in looking_for:
That's probably faster still, since no list at all is created. It searches for the leftmost blank, and takes the initial slice of line up to (but not including) that blank.
Create an Index. It'll require a lot of disk space. Use it only if you have to perform these scaffold lookups too many times.
This will be a one time job, will take a good amount of time, but will definitely serve you in long run.
Your Index will be of the form:
scaffold126:3|34|234443|4564564|3453454
scaffold666:1|2
scaffold5112:4|23|5456456|345345|234234
where 3,4 etc are line numbers. Make sure the final file is sorted alphabetically (to make way for Binary Search). Let's call this Index as Index_Primary
Now you will create a secondary index to make the search faster. Let's call it Index_Second. Let's say Index_Primary contains hundred thousand lines, each line representing one scaffold. Index_Second will give us jump points. It can be like:
scaffold1:1
scaffold1000:850
scaffold2000:1450
This says that information about scaffold2000 is present in line number 1450 of Index_Primary.
So now let's say you want to find lines with scaffold1234, you will go to Index_Second. This will tell you that scaffold1234 is present somewhere between line 850 and 1450 of Index_Primary. Now load that and start from the middle of this block, ie, line 1150. Find the required scaffold using Binary Search and voila! You get the line numbers of lines containing that scaffold! Possibly within milliseconds!
My first instinct would be to load your data into a database, making sure to create an index from column 0, and then query as needed.
For a Python approach, try this:
wanted_scaffs = set(['scaffold126', 'scaffold5112'])
files = {name: open(name+'.txt', 'w') for name in wanted_scaffs}
for line in big_file:
curr_scaff = line.split(' ', 1)[0] # minimal splitting
if curr_scaff in wanted_scaffs:
files[key].write(line)
for f in files.values():
f.close()
Then do your summary reports:
for scaff in wanted_scaffs:
with open(scaff + '.txt', 'r') as f:
... # summarize your data

In python, is there a way for re.finditer to take a file as input instead of a string?

Let's say I have a really large file foo.txt and I want to iterate through it doing something upon finding a regular expression. Currently I do this:
f = open('foo.txt')
s = f.read()
f.close()
for m in re.finditer(regex, s):
doSomething()
Is there a way to do this without having to store the entire file in memory?
NOTE: Reading the file line by line is not an option because the regex can possibly span multiple lines.
UPDATE: I would also like this to work with stdin if possible.
UPDATE: I am considering somehow emulating a string object with a custom file wrapper but I am not sure if the regex functions would accept a custom string-like object.
If you can limit the number of lines that the regex can span to some reasonable number, then you can use a collections.deque to create a rolling window on the file and keep only that number of lines in memory.
from collections import deque
def textwindow(filename, numlines):
with open(filename) as f:
window = deque((f.readline() for i in xrange(numlines)), maxlen=numlines)
nextline = True
while nextline:
text = "".join(window)
yield text
nextline = f.readline()
window.append(nextline)
for text in textwindow("bigfile.txt", 10):
# test to see whether your regex matches and do something
Either you will have to read the file chunk-wise, with overlaps to allow for the maximum possible length of the expression, or use an mmapped file, which will work almost/just as good as using a stream: https://docs.python.org/library/mmap.html
UPDATE to your UPDATE:
consider that stdin isn't a file, it just behaves a lot like one in that it has a file descriptor and so on. it is a posix stream. if you are unclear on the difference, do some googling around. the OS cannot mmap it, therefore python can not.
also consider that what you're doing may be an ill-suited thing to use a regex for. regex's are great for capturing small stuff, like parsing a connection string, a log entry, csv data and so on. they are not a good tool to parse through huge chunks of data. this is by design. you may be better off writing a custom parser.
some words of wisdom from the past:
http://regex.info/blog/2006-09-15/247
Perhaps you could write a function that yields one line (reads one line) at a time of the file and call re.finditer on that until it yields an EOF signal.
Here is another solution, using an internal text buffer to progressively yield found matches without loading the entire file in memory.
This buffer acts like a "sliding windows" through the file text, moving forward while yielding found matches.
As the file content is loaded by chunks, this means this solution works with multilines regexes too.
def find_chunked(fileobj, regex, *, chunk_size=4096):
buffer = ""
while 1:
text = fileobj.read(chunk_size)
buffer += text
matches = list(regex.finditer(buffer))
# End of file, search through remaining final buffer and exit
if not text:
yield from matches
break
# Yield found matches except the last one which is maybe
# incomplete because of the chunk cut (think about '.*')
if len(matches) > 1:
end = matches[-2].end()
buffer = buffer[end:]
yield from matches[:-1]
However, note that it may end up loading the whole file in memory if no matches are found at all, so you better should use this function if you are confident that your file contains the regex pattern many times.

Categories

Resources