Came across some different methods for reading files in Python, I was wondering which is the fastest way to do it.
For example reading the last line of a file, one can do
input_file = open('mytext.txt', 'r')
lastLine = ""
for line in input_file:
lastLine = line
print lastLine # This is the last line
Or
fileHandle = open('mytext.txt', 'r')
lineList = fileHandle.readlines()
print lineList[-1] #This is the last line
I'm assuming for that particular case this may be not really relevant discussing efficiency...
Question:
1. Which method is faster for picking a random line
2. Can we deal with concepts like "SEEK" in Python (if so is it faster?)
If you don't need a uniform distribution (i.e. it's okay that the chance for some line to be picked is not equal for all lines) and/or if your lines are all about the same length then the problem of picking the random line can be simplified to:
Determine the size of the file in bytes
Seek to a random position
Search for the last newline character if any (there may be none if there's no preceding line)
Pick all text up to the next newline character or the end of file, whichever comes first.
For (2) you do an educated guess for how far you've got to search backwards to find the previous newline. If you can tell that a line is n bytes on average then you could read the previous n bytes in a single step.
I had this problematic few days ago and I use this solution. My solution is similar to the #Frerich Raabe one, but with no random, just logic :)
def get_last_line(f):
""" f is a file object in read mode, I just extract the algorithm from a bigger function """
tries = 0
offs = -512
while tries < 5:
# Put the cursor at n*512nth character before the end.
# If we reach the max fsize, it puts the cursor at the beginning (fsize * -1 means move the cursor of -fsize from the end)
f.seek(max(fsize * -1, offs), 2)
lines = f.readlines()
if len(lines) > 1: # If there's more than 1 lines found, then we have the last complete line
return lines[-1] # Returns the last complete line
offs *= 2
tries += 1
raise ValueError("No end line found, after 5 tries (Your file may has only 1 line or the last line is longer than %s characters)" % offs)
The tries counters avoid to be block if the file has also one line (a very very long last line). The algorithm tries to get the last line from the last 512 characters, then 1024, 2048... and stop if there's still no complete line at the th iteration.
Related
I'm trying to make a sorting function to order words by length, so the first step is finding the longest word.
In order to do this I've produced the following code, it works as intended, but there's a bug that caught my attention and I'm trying to figure out what causes it.
After replacing a word, the buffer will always read the next line as if it was empty, but will eventually reach the next word, but will testify the reading was done in the wrong line.
def sort():
f = open("source.txt","r")
w = open("result.txt","a+")
word = ''
og_lines = sum(1 for line in open('source.txt'))
print("Sorting",og_lines,"words...")
new_lines = 1
lenght = 0
word = f.readline(new_lines)
new_lines+=1
buffer=f.readline(new_lines)
while (buffer != ''):
if (len(buffer)>len(word)):
word=buffer
print("change")
new_lines+=1
print(new_lines, "lines read")
buffer=f.readline(new_lines)
buffer.rstrip()
lenght = len(word.rstrip())
print("Longest word is",word.rstrip(),lenght)
Expected to read 25 lines, but since it found a longer word 4 times along the way, it ended up reading the nonexistent line 29 and yet returning the word from the real line 25.
What im trying to do is match a phrase in a text file, then print that line(This works fine). I then need to move the cursor up 4 lines so I can do another match in that line, but I cant get the seek() method to move up 4 lines from the line that has been matched so that I can do another regex search. All I can seem to do with seek() is search from the very end of the file, or the beginning. It doesn't seem to let me just do seek(105,1) from the line that is matched.
### This is the example test.txt
This is 1st line
This is 2nd line # Needs to seek() to this line from the 6th line. This needs to be dynamic as it wont always be 4 lines.
This is 3rd line
This is 4th line
This is 5th line
This is 6st line # Matches this line, now need to move it up 4 lines to the "2nd line"
This is 7 line
This is 8 line
This is 9 line
This is 10 line
#
def Findmatch():
file = open("test.txt", "r")
print file.tell() # shows 0 which is the beginning of the file
string = file.readlines()
for line in string:
if "This is 6th line" in line:
print line
print file.tell() # shows 171 which is the end of the file. I need for it to be on the line that matches my search which should be around 108. seek() only lets me search from end or beginning of file, but not from the line that was matched.
Findmatch()
Since you've read all of it into memory at once with file.readlines(). tell() method does indeed correctly point to the end and your already have all your lines in an array. If you still wanted to, you'd have to read the file in line by line and record position within file for each line start so that you could go back four lines.
For your described problem. You can first find index of the line first match and then do the second operation starting from the list slice four items before that.
Here a very rough example of that (return None isn't really needed, it's just for sake of verbosity, clearly stating intent/expected behavior; raising an exception might be just as well a desired depending on what the overall plan is):
def relevant(value, lines):
found = False
for (idx, line) in enumerate(lines):
if value in line:
found = True
break # Stop iterating, last idx is a match.
if found is True:
idx = idx - 4
if idx < 0:
idx = 0 # Just return all lines up to now? Or was that broken input and fail?
return lines[idx:]
else:
return None
with open("test.txt") as in_file:
lines = in_file.readlines()
print(''.join(relevant("This is 6th line", lines)))
Please also note: It's a bit confusing to name list of lines string (one would probably expect a str there), go with lines or something else) and it's also not advisable (esp. since you indicate to be using 2.7) to assign your variable names already used for built-ins, like file. Use in_file for instance.
EDIT: As requested in a comment, just a printing example, adding it in parallel as the former seem potentially more useful for further extension. :) ...
def print_relevant(value, lines):
found = False
for (idx, line) in enumerate(lines):
if value in line:
found = True
print(line.rstrip('\n'))
break # Stop iterating, last idx is a match.
if found is True:
idx = idx - 4
if idx < 0:
idx = 0 # Just return all lines up to now? Or was that broken input and fail?
print(lines[idx].rstrip('\n'))
with open("test.txt") as in_file:
lines = in_file.readlines()
print_relevant("This is 6th line", lines)
Note, since lines are read in with trailing newlines and print would add one of its own I've rstrip'ed the line before printing. Just be aware of it.
I have created a method which reads a file line by line and checks if they all contain the same number of delimiters (see below code). The trouble with the solution is that it works on a line per line basis. Given that some of the files I am dealing with are gigabytes in size, this will take a while to process, is there a better solution which will 1) validate whether all lines contain the same number of delimiters 2) not cause any out of memory issues. Thanks in advance.
def isValid(fileName):
with open(fileName,'rb') as infile:
for lineNumber,line in enumerate(infile,1):
count = line.count(',')
if lineNumber > 1 and prevCount != count:
# this line does not contain the same number of delimiters
return False
prevCount = count
return True
You can use all instead and a generator expression:
with open(file_name) as your_file:
start = your_file.readline().count(',') # initial count
print all(i.count(',') == start for i in your_file)
I propose a different approach (without code):
1. read the file as binary, and in chunks of, say, 64 KB
2. count the number of end-of-line tokens in the chunk
3. count the number of delimiters in the chunk but only up to the position of the last EOL token
4. if both number do not divide evenly, stop and return False
5. At EOF, return True
As you'd have to handle the 'overlap' between the last EOL token and the end of the chunk the logic is a bit more complicated than the 'brute-force' approach. But in dealing with GBs it might pay off.
I just noticed that - if you would want to stick with simple logic - the original code can be deflated a bit:
def isValid(fileName):
with open(fileName,'r') as infile:
count = infile.readline().count(',')
for line in infile:
if line.count(',') != count:
return False
return True
There is no need to keep the previous line's count as one single difference will decide it. So keep only the delim count of the first line.
Then, the file needs to be opened as a text file ('r'), not as a binary.
Lastly, by prefetching the very first line just before the loop we can discard the call to enumerate.
So I have to write this program which has to basicaly read string long few lines. Here's an example of string I need to check:
Let's say this is first line and let's check line 4 In this
second line there are number 10 and 8 Third line doesn't have any
numbers This is fourth line and we'll go on in line 12 This
is fifth line and we go on in line 8 In sixth line there is
number 6 This seventh line contains number 5, which means 5th
line This is eighth and the last line, it ends here. These
three lines are boring. They really are In eleventh line we
can see that line 12 is really interesting This is twelfth line
but we go back to line 7.
I need a function that will read first line. It'll find number 4 in it. This means nex line to check will line 4. In line 4 there is number 12. So it goes to line 12. There's number 7 so it goes to line 7. There's number 5, so 5th line, there's number 8 and so 8th line. In 8th line, there's no more numbers.
So as a result I have to get number of line where there are no more numbers to go on.
Also, if there are 2 numbers in 1 line it should acknowledge only the first one and this should be done by another function that I wrote:
def find_number(s):
s = s.split()
m = None
for word in s:
if word.isdigit():
m = int(word)
return word
So basicaly I need to use this function to solve the string with multiple lines. So my question is how can I "jump" from one line to another by utilisting written function?
If I understand your problem correctly (which I think I do, you stated it quite clearly), you want to find the first number in each line of a string, and then go to that line.
The first thing you need to do is split the string into lines with str.splitlines:
s = """Let's say this is first line and let's check line 4
In this second line there are number 10 and 8
Third line doesn't have any numbers
This is fourth line and we'll go on in line 12
This is fifth line and we go on in line 8
In sixth line there is number 6
This seventh line contains number 5, which means 5th line
This is eighth and the last line, it ends here.
These three lines are boring.
They really are
In eleventh line we can see that line 12 is really interesting
This is twelfth line but we go back to line 7."""
lines = s.splitlines()
Then you need to get the first integer in the first line. This is what your function does.
current_line = lines[0]
number = find_number(current_line)
Then you need to do the same thing, but with a different current_line. To get the next line, you might do this:
if number is None: # No numbers were found
end_line = current_line
else:
current_line = lines[first_number]
number = find_number(current_line)
You want do this over and over again, an indefinite number of times, so you need either a while loop, or recursion. This sounds like homework, so I won't give you the code for this (correct me if I'm wrong), you will have to work it out yourself. This shouldn't be too hard.
For future reference - a recursive solution:
def get_line(line):
number = find_number(line)
if number is None: # No numbers were found
return line
else:
return get_line(find_number(lines[number]))
I need a function that will read first line.
If you're using a list of lines, rather than a file, you don't need linecache at all; just list_of_lines[0] gets the first line.
If you're using a single long string, the splitlines method will turn it into a list of lines.
If you're using a file, you could read the whole file in: with open(filename) as f: list_of_lines = list(f). Or, the stdlib has a function, linecache.getline, that lets you get lines from a file in random-access order, without reading the whole thing into memory.
Whichever one you use, just remember that Python uses 0-based indices, not 1-based. So, to read the first line, you need linecache.getline(filename, 0).
I'll use linecache just to show that even the most complicated version of the problem still isn't very complicated. You should be able to adapt it yourself, if you don't need that.
It'll find number 4 in it. This means nex line to check will line 4.
Let's translate that logic into Python. You've already got the find_number function, and getline is the only other tricky part. So:
line = linecache.getline(filename, linenumber - 1)
linenumber = find_number(line)
if linenumber is None:
# We're done
else:
# We got a number.
In line 4 there is number 12. So it goes to line 12. There's number 7 so it goes to line 7. There's number 5, so 5th line, there's number 8 and so 8th line. In 8th line, there's no more numbers.
So you just need to loop until linenumber is None. You do that with a while statement:
linenumber = 1
while linenumber is not None:
line = linecache.getline(filename, linenumber - 1)
linenumber = find_number(line)
The only problem is that when linenumber is None, you want to be able to return the last linenumber, the one that pointed to None. That's easy:
linenumber = 1
while linenumber is not None:
line = linecache.getline(filename, linenumber - 1)
new_linenumber = find_number(line)
if new_linenumber is None:
return linenumber
else:
linenumber = new_linenumber
Of course once you've done that, you don't need to re-check the linenumber at the top of the loop, so you can just change it to while True:.
Now you just need to wrap this up in a function so it can get the starting values as parameters, and you're done.
However, it's worth noting that find_number doesn't quite work. While you do compute a number, m, you don't actually return m, you return something else. You'll need to fix that to get this all working.
Here is my approach:
def check_line( line = None):
assert(line != None )
for word in line.split():
if word.isdigit():
return int(word)
return -1
next = 0
while next >= 0:
last = next
next = check_line(text[next]) -1
if next >= 0:
print "next line:", next +1
else:
print "The last line with number is:", last +1
Its not the most efficient in the world, but...
Pretty new to Python and have been writing up a script to pick out certain lines of a basic log file
Basically the function searches lines of the file and when it finds one I want to output to a separate file, adds it into a list, then also adds the next five lines following that. This then gets output to a separate file at the end in a different funcition.
What I've been trying to do following that is jump the loop to continue on from the last of those five lines, rather than going over them again. I thought the last line in the code would solved the problem, but unfortunately not.
Are there any recommended variations of a for loop I could use for this purpose?
def readSingleDayLogs(aDir):
print 'Processing files in ' + str(aDir) + '\n'
lineNumber = 0
try:
open_aDirFile = open(aDir) #open the log file
for aLine in open_aDirFile: #total the num. lines in file
lineNumber = lineNumber + 1
lowerBound = 0
for lineIDX in range(lowerBound, lineNumber):
currentLine = linecache.getline(aDir, lineIDX)
if (bunch of logic conditions):
issueList.append(currentLine)
for extraLineIDX in range(1, 6): #loop over the next five lines of the error and append to issue list
extraLine = linecache.getline(aDir, lineIDX+ extraLineIDX) #get the x extra line after problem line
issueList.append(extraLine)
issueList.append('\n\n')
lowerBound = lineIDX
You should use a while loop :
line = lowerBound
while line < lineNumber:
...
if conditions:
...
for lineIDX in range(line, line+6):
...
line = line + 6
else:
line = line + 1
A for-loop uses an iterator over the range, so you can have the ability to change the loop variable.
Consider using a while-loop instead. That way, you can update the line index directly.
I would look at something like:
from itertools import islice
with open('somefile') as fin:
line_count = 0
my_lines = []
for line in fin:
line_count += 1
if some_logic(line):
my_lines.append(line)
next_5 = list(islice(fin, 5))
line_count += len(next_5)
my_lines.extend(next_5)
This way, by using islice on the input, you're able to move the iterator ahead and resume after the 5 lines (perhaps fewer if near the end of the file) are exhausted.
This is based on if I'm understanding correctly that you can read forward through the file, identify a line, and only want a fixed number of lines after that point, then resume looping as per normal. (You may not even require the line counting if that's all you're after as it only appears to be for the getline and not any other purpose).
If you indeed you want to take the next 5, and still consider the following line, you can use itertools.tee to branch at the point of the faulty line, and islice that and let the fin iterator resume on the next line.