How to make log parsing faster for large text files - python

I have large (500,000 line) log files that I parse through for specified sections. When found the sections are printed to a Text widget. Even if I cut the readlines down to the last 50,000 lines it takes upwards of a minute or longer to finish.
with open(i, "r") as f:
r = f.readlines()
r = r[-50000:]
start = 0
for line in r:
if 'Start section' in line:
if start == 1:
cpfotxt.insert('end', line + "\n", 'hidden')
start = 1
if 'End section' in line:
start = 0
cpfotxt.insert('end', line + "\n")
if start == 1:
cpfotxt.insert('end', line + "\n")
f.close()
Any way to do this faster?

You should try to read it in chunks.
with open(...) as f:
for line in f:
<do something with line>
A more clear approach that can be applied for you:
def readInChunks(fileObj, chunkSize=2048):
"""
Lazy function to read a file piece by piece.
Default chunk size: 2kB.
"""
while True:
data = fileObj.read(chunkSize)
if not data:
break
yield data
f = open('bigFile')
for chuck in readInChunks(f):
do_something(chunk)

Another possibility is to use seek to skip over a lot of lines. However, this requires that you have some idea of how large the last 50K lines might be. Instead of reading through all the early lines, jump close to the end:
with ... as f:
f.seek(-50000 * 80)
# insert your processing here

Related

How can I handle multiple lines at once while reading from a file?

The standard Python approach to working with files using the open() function to create a 'file object' f allows you to either load the entire file into memory at once using f.read() or to read lines one-by-one using a for loop:
with open('filename') as f:
# 1) Read all lines at once into memory:
all_data = f.read()
# 2) Read lines one-by-one:
for line in f:
# Work with each line
I'm searching through several large files looking for a pattern that might span multiple lines. The most intuitive way to do this is to read line-by-line looking for the beginning of the pattern, and then to load in the next few lines to see where it ends:
with open('large_file') as f:
# Read lines one-by-one:
for line in f:
if line.startswith("beginning"):
# Load in the next line, i.e.
nextline = f.getline(line+1) # ??? #
# or something
The line I've marked with # ??? # is my own pseudocode for what I imagine this should look like.
My question is, does this exist in Python? Is there any method for me to access other lines as needed while keeping the cursor at line and without loading the entire file into memory?
Edit Inferring from the responses here and other reading, the answer is "No."
Like this:
gather = []
for line in f:
if gather:
gather.append(line)
if "ending" in line:
process( ''.join(gather) )
gather = []
elif line.startswith("beginning"):
gather = [line]
Although in many cases it's easier just to load the whole file into a string and search it.
You may want to rstrip the newline before appending the line.
Just store the interesting lines into a list while going line-wise through the file:
with open("file.txt","w") as f:
f.write("""
a
b
------
c
d
e
####
g
f""")
interesting_data = []
inside = False
with open ("file.txt") as f:
for line in f:
line = line.strip()
# start of interesting stuff
if line.startswith("---"):
inside = True
# end of interesting stuff
elif line.startswith("###"):
inside = False
# adding interesting bits
elif inside:
interesting_data.append(line)
print(interesting_data)
to get
['c', 'd', 'e']
I think you're looking for .readline(), which does exactly that. Here is a sketch to proceed to the line where a pattern starts.
with open('large_file') as f:
line = f.readline()
while not line.startswith("beginning"):
line = f.readline()
# end of file
if not line:
print("EOF")
break
# do_something with line, get additional lines by
# calling .readline() again, etc.

Reading CSV file with python

filename = 'NTS.csv'
mycsv = open(filename, 'r')
mycsv.seek(0, os.SEEK_END)
while 1:
time.sleep(1)
where = mycsv.tell()
line = mycsv.readline()
if not line:
mycsv.seek(where)
else:
arr_line = line.split(',')
var3 = arr_line[3]
print (var3)
I have this Paython code which is reading the values from a csv file every time there is a new line printed in the csv from external program. My problem is that the csv file is periodically completely rewriten and then python stops reading the new lines. My guess is that python is stuck on some line number and the new update can put maybe 50 more or less lines. So for example python is now waiting a new line at line 70 and the new line has come at line 95. I think the solution is to let mycsv.seek(0, os.SEEK_END) been updated but not sure how to do that.
What you want to do is difficult to accomplish without rewinding the file every time to make sure that you are truly on the last line. If you know approximately how many characters there are on each line, then there is a shortcut you could take using mycsv.seek(-end_buf, os.SEEK_END), as outlined in this answer. So your code could work somehow like this:
avg_len = 50 # use an appropriate number here
end_buf = 3 * avg_len / 2
filename = 'NTS.csv'
mycsv = open(filename, 'r')
mycsv.seek(-end_buf, os.SEEK_END)
last = mycsv.readlines()[-1]
while 1:
time.sleep(1)
mycsv.seek(-end_buf, os.SEEK_END)
line = mycsv.readlines()[-1]
if not line == last:
arr_line = line.split(',')
var3 = arr_line[3]
print (var3)
Here, in each iteration of the while loop, you seek to a position close to the end of the file, just far back enough that you know for sure the last line will be contained in what remains. Then you read in all the remaining lines (this will probably include a partial amount of the second or third to last lines) and check if the last line of these is different to what you had before.
You can do a simpler way of reading lines in your program. Instead of trying to use seek in order to get what you need, try using readlines on the file object mycsv.
You can do the following:
mycsv = open('NTS.csv', 'r')
csv_lines = mycsv.readlines()
for line in csv_lines:
arr_line = line.split(',')
var3 = arr_line[3]
print(var3)

Limiting amount read using readline

I'm trying to read the first 100 lines of large text files. Simple code for doing this is shown below. The challenge, though, is that I have to guard against the case of corrupt or otherwise screwy files that don't have any line breaks (yes, people somehow figure out ways to generate these). In those cases I'd still like to read in data (because I need to see what's going on in there) but limit it to, say, n bytes.
The only way I can think of to do this is to read the file char by char. Other than being slow (probably not an issue for only 100 lines) I am worried that I'll run into trouble when I encounter a file using non-ASCII encoding.
Is it possible to limit the bytes read using readline()? Or is there a more elegant way to handle this?
line_count = 0
with open(filepath, 'r') as f:
for line in f:
line_count += 1
print('{0}: {1}'.format(line_count, line))
if line_count == 100:
break
EDIT:
As #Fredrik correctly pointed out, readline() accepts an arg that limits the number of chars read (I'd thought it was a buffer size param). So, for my purposes, the following works quite well:
max_bytes = 1024*1024
bytes_read = 0
fo = open(filepath, "r")
line = fo.readline(max_bytes)
bytes_read += len(line)
line_count = 0
while line != '':
line_count += 1
print('{0}: {1}'.format(line_count, line))
if (line_count == 100) or (bytes-read >= max_bytes):
break
else:
line = fo.readline(max_bytes - bytes_read)
bytes_read += len(line)
If you have a file:
f = open("a.txt", "r")
f.readline(size)
The size parameter tells the maximum number of bytes to read
This checks for data with no line breaks:
f=open('abc.txt','r')
dodgy=False
if '\n' not in f.read(1024):
print "Dodgy file - No linefeeds in the first Kb"
dodgy=True
f.seek(0)
if dodgy==False: #read the first 100 lines
for x in range(1,101):
try: line = next(f)
except Exception as e: break
print('{0}: {1}'.format(x, line))
else: #read the first n bytes
line = f.read(1024)
print('bytes: '+line)
f.close()

parse blocks of text from text file using Python

I am trying to parse some text files and need to extract blocks of text. Specifically, the lines that start with "1:" and 19 lines after the text. The "1:" does not start on the same row in each file and there is only one instance of "1:". I would prefer to save the block of text and export it to a separate file. In addition, I need to preserve the formatting of the text in the original file.
Needless to say I am new to Python. I generally work with R but these files are not really compatible with R and I have about 100 to process. Any information would be appreciated.
The code that I have so far is:
tmp = open(files[0],"r")
lines = tmp.readlines()
tmp.close()
num = 0
a=0
for line in lines:
num += 1
if "1:" in line:
a = num
break
a = num is the line number for the block of text I want. I then want to save to another file the next 19 lines of code, but can't figure how how to do this. Any help would be appreciated.
Here is one option. Read all lines from your file. Iterate till you find your line and return next 19 lines. You would need to handle situations where your file doesn't contain additional 19 lines.
fh = open('yourfile.txt', 'r')
all_lines = fh.readlines()
fh.close()
for count, line in enumerate(all_lines):
if "1:" in line:
return all_lines[count+1:count+20]
Could be done in a one-liner...
open(files[0]).read().split('1:', 1)[1].split('\n')[:19]
or more readable
txt = open(files[0]).read() # read the file into a big string
before, after = txt.split('1:', 1) # split the file on the first "1:"
after_lines = after.split('\n') # create lines from the after text
lines_to_save = after_lines[:19] # grab the first 19 lines after "1:"
then join the lines with a newline (and add a newline to the end) before writing it to a new file:
out_text = "1:" # add back "1:"
out_text += "\n".join(lines_to_save) # add all 19 lines with newlines between them
out_text += "\n" # add a newline at the end
open("outputfile.txt", "w").write(out_text)
to comply with best practice for reading and writing files you should also be using the with statement to ensure that the file handles are closed as soon as possible. You can create convenience functions for it:
def read_file(fname):
"Returns contents of file with name `fname`."
with open(fname) as fp:
return fp.read()
def write_file(fname, txt):
"Writes `txt` to a file named `fname`."
with open(fname, 'w') as fp:
fp.write(txt)
then you can replace the first line above with:
txt = read_file(files[0])
and the last line with:
write_file("outputfile.txt", out_text)
I always prefer to read the file into memory first, but sometimes that's not possible. If you want to use iteration then this will work:
def process_file(fname):
with open(fname) as fp:
for line in fp:
if line.startswith('1:'):
break
else:
return # no '1:' in file
yield line # yield line containing '1:'
for i, line in enumerate(fp):
if i >= 19:
break
yield line
if __name__ == "__main__":
with open('ouput.txt', 'w') as fp:
for line in process_file('intxt.txt'):
fp.write(line)
It's using the else: clause on a for-loop which you don't see very often anymore, but was created for just this purpose (the else clause if executed if the for-loop doesn't break).

Using python, how to read a file starting at the seventh line ?

I have a text file structure as:
date
downland
user
date data1 date2
201102 foo bar 200 50
201101 foo bar 300 35
So first six lines of file are not needed. filename:dnw.txt
f = open('dwn.txt', 'rb')
How do I "split" this file starting at line 7 to EOF?
with open('dwn.txt') as f:
for i in xrange(6):
f, next()
for line in f:
process(line)
Update: use next(f) for python 3.x.
Itertools answer!
from itertools import islice
with open('foo') as f:
for line in islice(f, 6, None):
print line
Python 3:
with open("file.txt","r") as f:
for i in range(6):
f.readline()
for line in f:
# process lines 7-end
with open('test.txt', 'r') as fo:
for i in xrange(6):
fo.next()
for line in fo:
print "%s" % line.strip()
In fact, to answer precisely at the question as it was written
How do I "split" this file starting at line 7 to EOF?
you can do
:
in case the file is not big:
with open('dwn.txt','rb+') as f:
for i in xrange(6):
print f.readline()
content = f.read()
f.seek(0,0)
f.write(content)
f.truncate()
in case the file is very big
with open('dwn.txt','rb+') as ahead, open('dwn.txt','rb+') as back:
for i in xrange(6):
print ahead.readline()
x = 100000
chunk = ahead.read(x)
while chunk:
print repr(chunk)
back.write(chunk)
chunk = ahead.read(x)
back.truncate()
The truncate() function is essential to put the EOF you asked for. Without executing truncate() , the tail of the file, corresponding to the offset of 6 lines, would remain.
.
The file must be opened in binary mode to prevent any problem to happen.
When Python reads '\r\n' , it transforms them in '\n' (that's the Universal Newline Support, enabled by default) , that is to say there are only '\n' in the chains chunk even if there were '\r\n' in the file.
If the file is from Macintosh origin , it contains only CR = '\r' newlines before the treatment but they will be changed to '\n' or '\r\n' (according to the platform) during the rewriting on a non-Macintosh machine.
If it is a file from Linux origin, it contains only LF = '\n' newlines which, on a Windows OS, will be changed to '\r\n' (I don't know for a Linux file processed on a Macintosh ).
The reason is that the OS Windows writes '\r\n' whatever it is ordered to write , '\n' or '\r' or '\r\n'. Consequently, there would be more characters rewritten than having been read, and then the offset between the file's pointers ahead and back would diminish and cause a messy rewriting.
In HTML sources , there are also various newlines.
That's why it's always preferable to open files in binary mode when they are so processed.
Alternative version
You can direct use the command read() if you know the character position pos of the separating (header part from the part of interest) linebreak, e.g. an \n, in the text at which you want to break your input text:
with open('input.txt', 'r') as txt_in:
txt_in.seek(pos)
second_half = txt_in.read()
If you are interested in both halfs, you could also investigate the following method:
with open('input.txt', 'r') as txt_in:
all_contents = txt_in.read()
first_half = all_contents[:pos]
second_half = all_contents[pos:]
You can read the entire file into an array/list and then just start at the index appropriate to the line you wish to start reading at.
f = open('dwn.txt', 'rb')
fileAsList = f.readlines()
fileAsList[0] #first line
fileAsList[1] #second line
#!/usr/bin/python
with open('dnw.txt', 'r') as f:
lines_7_through_end = f.readlines()[6:]
print "Lines 7+:"
i = 7;
for line in lines_7_through_end:
print " Line %s: %s" % (i, line)
i+=1
Prints:
Lines 7+:
Line 7: 201102 foo bar 200 50
Line 8: 201101 foo bar 300 35
Edit:
To rebuild dwn.txt without the first six lines, do this after the above code:
with open('dnw.txt', 'w') as f:
for line in lines_7_through_end:
f.write(line)
I have created a script used to cut an Apache access.log file several times a day.
It's not original topic of question, but I think it can be useful, if you have store the file cursor position after the 6 first lines reading.
So I needed the set a position cursor on last line parsed during last execution.
To this end, I used file.seek() and file.seek() methods which allows the storage of the cursor in file.
My code :
ENCODING = "utf8"
CURRENT_FILE_DIR = os.path.dirname(os.path.abspath(__file__))
# This file is used to store the last cursor position
cursor_position = os.path.join(CURRENT_FILE_DIR, "access_cursor_position.log")
# Log file with new lines
log_file_to_cut = os.path.join(CURRENT_FILE_DIR, "access.log")
cut_file = os.path.join(CURRENT_FILE_DIR, "cut_access", "cut.log")
# Set in from_line
from_position = 0
try:
with open(cursor_position, "r", encoding=ENCODING) as f:
from_position = int(f.read())
except Exception as e:
pass
# We read log_file_to_cut to put new lines in cut_file
with open(log_file_to_cut, "r", encoding=ENCODING) as f:
with open(cut_file, "w", encoding=ENCODING) as fw:
# We set cursor to the last position used (during last run of script)
f.seek(from_position)
for line in f:
fw.write("%s" % (line))
# We save the last position of cursor for next usage
with open(cursor_position, "w", encoding=ENCODING) as fw:
fw.write(str(f.tell()))
Just do f.readline() six times. Ignore the returned value.
Solutions with readlines() are not satisfactory in my opinion because readlines() reads the entire file. The user will have to read again the lines (in file or in the produced list) to process what he wants, while it could have been done without having read the intersting lines already a first time. Moreover if the file is big, the memory is weighed by the file's content while a for line in file instruction would have been lighter.
Doing repetition of readline() can be done like that
nb = 6
exec( nb * 'f.readline()\n')
It's short piece of code and nb is programmatically adjustable

Categories

Resources