Searching through a large text or log file (10GB+)

Searching through a large text or log file (10GB+) - python

Part of a Python script that I'm writing requires me to find a particular string in a large text or log file: if it exists then do something; otherwise, do something else.
The files which are being fed in are extremely large (10GB+). It feels extremely slow and inefficient to use:
with open('file.txt') as f:
for line in f:
if some_string in line:
return True
return False
If the string doesn't exist in the file, then iterating through would take a long time.
Is there a time efficient way to achieve this?

You can try with mmap:
>>> import mmap
>>> import re
>>> f = open("data.log", "r")
>>> mm = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
>>> re.search(b"test", mm)
<re.Match object; span=(12, 16), match=b'test'>

If you’re on Linux or BSD (Mac) I would just create a subprocess with grep or awk and let them do the search, they have had decades of optimisation for finding strings in big files. Make sure to include commandline flag to tell it to stop searching after the first match, if you only care that it exists and don’t need all instances or a count.

Try handling larger chunks instead of individual lines. For example:
def contains(filename, some_string):
n = len(some_string)
prev_chunk = ''
with open(filename) as f:
while chunk := f.read(2 ** 20):
if some_string in prev_chunk[-(n-1):] + chunk:
return True
prev_chunk = chunk
return False
I tried that with some made up 1 GB file and it took about 1 second to check a string that's not in there.

Related

MemoryError in Python by searching a large file using mmap and re.findall

I'm looking to implement a few lines of python, using re, to firstly manipulate a string then use that string as a regex search. I have strings with *'s in the middle of them, i.e. ab***cd, with the *'s being any length. The aim of this is to do the regex search in a document to extract any lines that match the starting and finishing characters, with any number of characters in between. i.e. ab12345cd, abbbcd, ab_fghfghfghcd, would all be positive matches. Examples of negative matches: 1abcd, agcd, bb111cd.
I have come up with the regex of [\s\S]*? to input instead of the *'s. So I want to get from an example string of ab***cd to ^ab[\s\S]*?cd, I will then use that for a regex search of a document.
I then wanted to open the file in mmap, search through it using the regex, then save the matches to file.
import re
import mmap
def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
def searchFile(list_txt, raw_str):
search="^"+raw_str #add regex ^ newline operator
search_rgx=re.sub(r'\*+',r'[\\s\\S]*?',search) #replace * with regex function
#search file
with open(list_txt, 'r+') as f:
data = mmap.mmap(f.fileno(), 0)
results = re.findall(bytes(search_rgx,encoding="utf-8"),data, re.MULTILINE)
#save results
f1 = open('results.txt', 'w+b')
results_bin = b'\n'.join(results)
f1.write(results_bin)
f1.close()
print("Found "+str(file_len("results.txt"))+" results")
searchFile("largelist.txt","ab**cd")
Now this works fine with a small file. However when the file gets larger (1gb of text) I get this error:
Traceback (most recent call last):
File "c:\Programming\test.py", line 27, in <module>
searchFile("largelist.txt","ab**cd")
File "c:\Programming\test.py", line 21, in searchFile
results_bin = b'\n'.join(results)
MemoryError
Firstly - can anyone help optimize the code slightly? Am I doing something seriously wrong? I used mmap because I know I wanted to look at large files and I wanted to read the file line and by line rather than all at once (hence someone suggested mmap).
I've also been told to have a look at the pandas library for more data manipulation. Would panda's replace mmap?
Thanks for any help. I'm pretty new to python as you can tell - so appreciate any help.

You are doing line by line processing so you want to avoid accumulating data in memory. Regular file reads and writes should work well here. mmap is backed by virtual memory, but that has to turn into real memory as you read it. Accumulating results in findall is also a memory hog. Try this as an alternate:
import re
# buffer to 1Meg but any effect would be modest
MEG = 2**20
def searchFile(filename, raw_str):
# extract start and end from "ab***cd"
startswith, endswith = re.match(r"([^\*]+)\*+?([^\*]+)", raw_str).groups()
with open(filename, buffering=MEG) as in_f, open("results.txt", "w", buffering=MEG) as out_f:
for line in in_f:
stripped = line.strip()
if stripped.startswith(startswith) and stripped.endswith(endswith):
out_f.write(line)
# write test file
test_txt = """ab12345cd
abbbcd
ab_fghfghfghcd
1abcd
agcd
bb111cd
"""
want = """ab12345cd
abbbcd
ab_fghfghfghcd
"""
open("test.txt", "w").write(test_txt)
searchFile("test.txt", "ab**cd")
result = open("results.txt").read()
print(result == want)

I am not sure what advantage you believe you will get from opening the input file with mmap, but since each string that must be matched is delimited by a newline (as per your comment), I would use the below approach (Note that it is Python, but deliberately kept as pseudo code):
with open(input_file_path, "r") as input_file:
with open(output_file_path, "x" as output_file:
for line in input_file:
if is_match(line):
print(line, file=output_file)
possibly tuning the endline parameter of the print function to your needs.
This way results are written as they are generated, and you avoid having a large results in memory before writing it.
Furthermore, you don't need to concentrate about newlines. Only whether each line matches.

How about this? In this situation, what you want is a list of all of your lines represented as strings. The following emulates that, resulting in a list of strings:
import io
longstring = """ab12345cd
abbbcd
ab_fghfghfghcd
1abcd
agcd
bb111cd
"""
list_of_strings = io.StringIO(longstring).read().splitlines()
list_of_strings
Outputs
['ab12345cd', 'abbbcd', 'ab_fghfghfghcd', '1abcd', 'agcd', 'bb111cd']
This is the part that matters
s = pd.Series(list_of_strings)
s[s.str.match('^ab[\s\S]*?cd')]
Outputs
0 ab12345cd
1 abbbcd
2 ab_fghfghfghcd
dtype: object
Edit2: Try this: (I don't see a reason for you to want to it as a function, but I've done it like that since that what you did in the comments.)
def newsearch(filename):
with open(filename, 'r', encoding="utf-8") as f:
list_of_strings = f.read().splitlines()
s = pd.Series(list_of_strings)
s = s[s.str.match('^ab[\s\S]*?cd')]
s.to_csv('output.txt', header=False, index=False)
newsearch('list.txt')
A chunk-based approach
import os
def newsearch(filename):
outpath = 'output.txt'
if os.path.exists(outpath):
os.remove(outpath)
for chunk in pd.read_csv(filename, sep='|', header=None, chunksize=10**6):
chunk = chunk[chunk[0].str.match('^ab[\s\S]*?cd')]
chunk[0].to_csv(outpath, index=False, header=False, mode='a')
newsearch('list.txt')
A dask approach
import dask.dataframe as dd
def newsearch(filename):
chunk = dd.read_csv(filename, header=None, blocksize=25e6)
chunk = chunk[chunk[0].str.match('^ab[\s\S]*?cd')]
chunk[0].to_csv('output.txt', index=False, header=False, single_file = True)
newsearch('list.txt')

Easy way to switch endianess of string

I need to read the binary file, and write it's content in form of text file which will initialize memory model. Problem is, I need to switch endianess in process. Let's look at example
binary file content, when I read it with:
with open(source_name, mode='rb') as file:
fileContent = file.read().hex()
filecontent: "aa000000bb000000...".
I need, to transform that into "000000aa000000bb...".
Of course, I can split this string into list of 8 chars substrings, than manualy reorganize it like newsubstr = substr[6:8]+substr[4:6]+substr[2:4]+substr[0:2]
, and then merge them into result string, but that seems clumsily, I suppose there is more natural way to do this in python.
Thanks to k1m190r, I found out about struct module which looks like what I need, but I still lost. I just designed another clumsy solution:
with open(source_name, mode='rb') as file:
fileContent = file.read()
while len(fileContent)%4 != 0:
fileContent += b"\x00"
res = ""
for i in range(0,len(fileContent),4):
substr = fileContent[i:i+4]
substr_val = struct.unpack("<L", substr)[0]
res += struct.pack(">L", substr_val).hex()
Is there a more elegant way? This solution is just slightly better than the original.

Actually in your specific case you don't even need struct. Below should be sufficient.
from binascii import b2a_hex
# open files in binary
with open("infile", "rb") as infile, open("outfile", "wb") as outfile:
# read 4 bytes at a time till read() spits out empty byte string b""
for x in iter(lambda: infile.read(4), b""):
if len(x) != 4:
# skip last bit if it is not 4 bytes long
break
outfile.write(b2a_hex(x[::-1]))

Is there a more elegant way? This solution is just slightly better than the original
Alternatively, you can craft a "smarter" struct format string: format specifiers take a number prefix which is the number of repetitions e.g. 10L is the same as LLLLLLLLLL so you can inject the size of your data divided by 4 before the letter and and convert the entire thing in one go (or a few steps, I don't know how big the counter can be).
array.array might also work as that's what the `byteswap, but you can't specify the input endianness (I think), so it's iffier.

To answer the original question:
import re
changed = re.sub(b'(....)', lambda x:x.group()[::-1], bindata)
Note: original had r'(....)' when the r should have been b.

Replacing string with id using dictionary in python

I have a dictionary file that contains a word in each line.
titles-sorted.txt
a&a
a&b
a&c_bus
a&e
a&f
a&m
....
For each word, its line number is the word's id.
Then I have another file that contains a set of words separated by tab in each line.
a.txt
a_15 a_15_highway_(sri_lanka) a_15_motorway a_15_motorway_(germany) a_15_road_(sri_lanka)
I'd like to replace all of the words by id if it exists in the dictionary, so that the output looks like,
3454 2345 123 5436 322 ....
So I wrote such python code to do this:
f = open("titles-sorted.txt")
lines = f.readlines()
titlemap = {}
nr = 1
for l in lines:
l = l.replace("\n", "")
titlemap[l.lower()] = nr
nr+=1
fw = open("a.index", "w")
f = open("a.txt")
lines = f.readlines()
for l in lines:
tokens = l.split("\t")
if tokens[0] in titlemap.keys():
fw.write(str(titlemap[tokens[0]]) + "\t")
for t in tokens[1:]:
if t in titlemap.keys():
fw.write(str(titlemap[t]) + "\t")
fw.write("\n")
fw.close()
f.close()
But this code is ridiculously slow, so it makes me suspicious if I have done everything right.
Is this an efficient way to do this?

The write loop contains a lot of calls to write, which are usually inefficient. You can probably speed things up by writing only once per line (or once per file if the file is small enough)
tokens = l.split("\t")
fw.write('\t'.join(fw.write(str(titlemap[t])) for t in tokens if t in titlemap)
fw.write("\n")
or even:
lines = []
for l in f:
lines.append('\t'.join(fw.write(str(titlemap[t])) for t in l.split('\t') if t in titlemap)
fw.write('\n'.join(lines))
Also, if your tokens are used more than once, you can save time by converting them to string when you read then:
titlemap = {l.strip().lower(): str(index) for index, l in enumerate(f, start=1)}

So, I suspect this differs based on the operating system you're running on and the specific python implementation (someone wiser than I may be able to provide some clarify here), but I have a suspicion about what is going on:
Every time you call write, some amount of your desired write request gets written to a buffer, and then once the buffer is full, this information is written to file. The file needs to be fetched from your hard disk (as it doesn't exist in main memory). So your computer pauses while it waits the several milliseconds that it takes to fetch the block from the harddisk and writes to it. On the other hand, you can do the parsing of the string and the lookup to your hashmap in a couple of nanoseconds, so you spend a lot of time waiting for the write request to finish!
Instead of writing immediately, what if you instead kept a list of the lines that you wanted to write and then only wrote them at the end, all in a row, or if you're handling a huge, huge file that will exceed the capacity of your main memory, write it once you have parsed a certain number of lines.
This allows the writing to disk to be optimized, as you can write multiple blocks at a time (again, this depends on how Python and the operating system handle the write call).

If we apply the suggestions so far and clean up your code some more (e.g. remove unnecessary .keys() calls), is the following still too slow for your needs?
title_map = {}
token_file = open("titles-sorted.txt")
for number, line in enumerate(token_file):
title_map[line.rstrip().lower()] = str(number + 1)
token_file.close()
input_file = open("a.txt")
output_file = open("a.index", "w")
for line in input_file:
tokens = line.split("\t")
if tokens[0] in title_map:
output_list = [title_map[tokens[0]]]
output_list.extend(title_map[token] for token in tokens[1:] if token in title_map)
output_file.write("\t".join(output_list) + "\n")
output_file.close()
input_file.close()
If it's still too slow, give us slightly more data to work with including an estimate of the number of lines in each of your two input files.

Improving the speed of a python script

I have an input file with containing a list of strings.
I am iterating through every fourth line starting on line two.
From each of these lines I make a new string from the first and last 6 characters and put this in an output file only if that new string is unique.
The code I wrote to do this works, but I am working with very large deep sequencing files, and has been running for a day and has not made much progress. So I'm looking for any suggestions to make this much faster if possible. Thanks.
def method():
target = open(output_file, 'w')
with open(input_file, 'r') as f:
lineCharsList = []
for line in f:
#Make string from first and last 6 characters of a line
lineChars = line[0:6]+line[145:151]
if not (lineChars in lineCharsList):
lineCharsList.append(lineChars)
target.write(lineChars + '\n') #If string is unique, write to output file
for skip in range(3): #Used to step through four lines at a time
try:
check = line #Check for additional lines in file
next(f)
except StopIteration:
break
target.close()

Try defining lineCharsList as a set instead of a list:
lineCharsList = set()
...
lineCharsList.add(lineChars)
That'll improve the performance of the in operator. Also, if memory isn't a problem at all, you might want to accumulate all the output in a list and write it all at the end, instead of performing multiple write() operations.

You can use https://docs.python.org/2/library/itertools.html#itertools.islice:
import itertools
def method():
with open(input_file, 'r') as inf, open(output_file, 'w') as ouf:
seen = set()
for line in itertools.islice(inf, None, None, 4):
s = line[:6]+line[-6:]
if s not in seen:
seen.add(s)
ouf.write("{}\n".format(s))

Besides using set as Oscar suggested, you can also use islice to skip lines rather than use a for loop.
As stated in this post, islice preprocesses the iterator in C, so it should be much faster than using a plain vanilla python for loop.

Try replacing
lineChars = line[0:6]+line[145:151]
with
lineChars = ''.join([line[0:6], line[145:151]])
as it can be more efficient, depending on the circumstances.

How to read a big binary file and split its content by some marker

In Python, reading a big text file line-by-line is simple:
for line in open('somefile', 'r'): ...
But how to read a binary file and 'split' (by generator) its content by some given marker, not the newline '\n'?
I want something like that:
content = open('somefile', 'r').read()
result = content.split('some_marker')
but, of course, memory-efficient (the file is around 70GB). Of course, we can't read the file by every byte (it'll be too slow because of the HDD nature).
The 'chunks' length (the data between those markers) might differ, theoretically from 1 byte to megabytes.
So, to give an example to sum up, the data looks like that (digits mean bytes here, the data is in a binary format):
12345223-MARKER-3492-MARKER-34834983428623762374632784-MARKER-888-MARKER-...
Is there any simple way to do that (not implementing reading in chunks, splitting the chunks, remembering tails etc.)?

There is no magic in Python that will do it for you, but it's not hard to write. For example:
def split_file(fp, marker):
BLOCKSIZE = 4096
result = []
current = ''
for block in iter(lambda: fp.read(BLOCKSIZE), ''):
current += block
while 1:
markerpos = current.find(marker)
if markerpos == -1:
break
result.append(current[:markerpos])
current = current[markerpos + len(marker):]
result.append(current)
return result
Memory usage of this function can be further reduced by turning it into a generator, i.e. converting result.append(...) to yield .... This is left as an excercise to the reader.

A general idea is using mmap you can then re.finditer over it:
import mmap
import re
with open('somefile', 'rb') as fin:
mf = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
markers = re.finditer('(.*?)MARKER', mf)
for marker in markers:
print marker.group(1)
I haven't tested, but you may want a (.*?)(MARKER|$) or similar in there as well.
Then, it's down to the OS to provide the necessaries for access to the file.

I don't think there's any built-in function for that, but you can "read-in-chunks" nicely with an iterator to prevent memory-inefficiency, similarly to #user4815162342 's suggestion:
def split_by_marker(f, marker = "-MARKER-", block_size = 4096):
current = ''
while True:
block = f.read(block_size)
if not block: # end-of-file
yield current
return
current += block
while True:
markerpos = current.find(marker)
if markerpos < 0:
break
yield current[:markerpos]
current = current[markerpos + len(marker):]
This way you won't save all the results in the memory at once, and you can still iterate it like:
for line in split_by_marker(open(filename, 'rb')): ...
Just make sure that each "line" does not take too much memory...

Readline itself reads in chunks, splits the chunks, remembers tails, etc. So, no.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Searching through a large text or log file (10GB+) - python

You can try with mmap: >>> import mmap >>> import re >>> f = open("data.log", "r") >>> mm = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ) >>> re.search(b"test", mm) <re.Match object; span=(12, 16), match=b'test'>

Related

MemoryError in Python by searching a large file using mmap and re.findall

Easy way to switch endianess of string

Replacing string with id using dictionary in python

Improving the speed of a python script

How to read a big binary file and split its content by some marker

Categories

Resources