Improving efficiency of search loop python - python

I have written a code that reads a file, finds if a line has the word table_begin and then counts the number of lines until the line with the word table_end.
Here is my code -
for line in read_file:
if "table_begin" in line:
k=read_file.index(line)
if 'table_end' in line:
k1=read_file.index(line)
break
count=k1-k
if count<10:
q.write(file)
I have to run it on ~15K files so, since its a bit slow (~1 file/sec), I was wondering if I am doing something inefficient. I was not able to find myself, so any help would be great!

When you do read_file.index(line), you are scanning through the entire list of lines, just to get the index of the line you're already on. This is likely what's slowing you down. Instead, use enumerate() to keep track of the line number as you go:
for i, line in enumerate(read_file):
if "table_begin" in line:
k = i
if "table_end" in line:
k1 = i
break

You are always checking for both strings in the line. In addition, index is heavy as you're seeking the file, not the line. Using "in" or "find" will be quicker, as will only checking for table_begin until you've found it, and table_end after you've seen table_begin. If you aren't positive each file has table_begin and table_end in that order (and only one of each) you may need some tweaking/checks here (maybe pairing your begin/end into tuples?)
EDIT: Incorporated enumerate and switched from a while to a for loop, allowing some complexity to be removed.
def find_lines(filename):
bookends = ["table_begin", "table_end"]
lines = open(filename).readlines()
for bookend in bookends:
for ind, line in enumerate(lines):
if bookend in line:
yield ind
break
for line in find_lines(r"myfile.txt"):
print line
print "done"

Clearly, you obtain read_file by f.readlines(), which is a bad idea, because you read the all file.
You can win a lot of time by :
reading file line by line :
searching one keyword at each time.
stopping after 10 lines.
with open('test.txt') as read_file:
counter=0
for line in read_file:
if "table_begin" in line : break
for line in read_file:
counter+=1
if "table_end" in line or counter>=10 : break # if "begin" => "end" ...
if counter < 10 : q.write(file)

Related

Python 3.4.3: Iterating over each line and each character in each line in a text file

I have to write a program that iterates over each line in a text file and then over each character in each line in order to count the number of entries in each line.
Here is a segment of the text file:
N00000031,B,,D,D,C,B,D,A,A,C,D,C,A,B,A,C,B,C,A,C,C,A,B,D,D,D,B,A,B,A,C,B,,,C,A,A,B,D,D
N00000032,B,A,D,D,C,B,D,A,C,C,D,,A,A,A,C,B,D,A,C,,A,B,D,D
N00000033,B,A,D,D,C,,D,A,C,B,D,B,A,B,C,C,C,D,A,C,A,,B,D,D
N00000034,B,,D,,C,B,A,A,C,C,D,B,A,,A,C,B,A,B,C,A,,B,D,D
The first and last lines are "unusable lines" because they contain too many entries (more or less than 25). I would like to count the amount of unusable lines in the file.
Here is my code:
for line in file:
answers=line.split(",")
i=0
for i in answers:
i+=1
unusable_line=0
for line in file:
if i!=26:
unusable_line+=1
print("Unusable lines in the file:", unusable_line)
I tried using this method as well:
alldata=file.read()
for line in file:
student=alldata.split("\n")
answer=student.split(",")
My problem is each variable I create doesn't exist when I try to run the program. I get a "students" is not defined error.
I know my coding is awful but I'm a beginner. Sorry!!! Thank you and any help at all is appreciated!!!
A simplified code for your method using list,count and if condition
Code:
unusable_line = 0
for line in file:
answers = line.strip().split(",")
if len(answers) < 26:
unusable_line += 1
print("Unusable lines in the file:", unusable_line)
Notes:
Initially I have created a variable to store count of unstable lines unusable_line.
Then I iterate over the lines of the file object.
Then I split the lines at , to create a list.
Then I check if the count of list is less then 26. If so I increment the unusable_line varaiable.
Finally I print it.
You could use something like this and wrap it into a function. You don't need to re-iterate the items in the line, str.split() returns a list[] that has your elements in it, you can count the number of its elements with len()
my_file = open('temp.txt', 'r')
lines_count = usable = ununsable = 0
for line in my_file:
lines_count+=1
if len(line.split(',')) == 26:
usable+=1
else:
ununsable+=1
my_file.close()
print("Processed %d lines, %d usable and %d ununsable" % (lines_count, usable, ununsable))
You can do it much shorter:
with open('my_fike.txt') as fobj:
unusable = sum(1 for line in fobj if len(line.split(',')) != 26)
The line with open('my_fike.txt') as fobj: opens the file for reading and closes it automatically after leaving the indented block. I use a generator expression to go through all lines and add up all that have a length different from 26.

Python 3.5 - Startswith() in if statement not working as intended

I'm working through some easy examples and I get to this one and can not figure out why I am not getting the desired result for loop2. Loop 1 is what I am using to see line by line what is happening. The curious thing is at line 1875, the startswith returns a true (see in loop 1) yet it does not print in loop 2.
Clearly I am missing something crucial. Please help me see it.
Text file can be found at: http://www.py4inf.com/code/mbox-short.txt
xfile = open("SampleTextData.txt", 'r')
cntr = 0
print("Loop 1 with STEPWISE PRINT STATEMENTS")
for line in xfile:
cntr = cntr + 1
if cntr >1873 and cntr < 1876:
print(line)
print(line.startswith('From: '))
line = line.rstrip()
print(line)
print(cntr)
print()
print("LOOP 2")
for line in xfile:
line = line.rstrip()
if line.startswith('From: '):
print(line)
A file object such as xfile is a one-pass iterator. To iterate through the file twice, you must either close and reopen the file, or use seek to return to the beginning of the file:
xfile.seek(0)
Only then will the second loop iterate through the lines of the file.
Your open file is an iterator that is exhausted by the first loop.
Once you loop through it once, it is done. The second loop will not execute, unless you close and re-open it.
Alternatively, you could read it into a string or a list of strings.
Before starting the loop 2 you are not closing and re-opening the file. A file is read from starting to end. After loop 1 is completed the read cursor is already at the end of the file and hence nothing left for loop 2 to loop.

Update iteration value in Python for loop

Pretty new to Python and have been writing up a script to pick out certain lines of a basic log file
Basically the function searches lines of the file and when it finds one I want to output to a separate file, adds it into a list, then also adds the next five lines following that. This then gets output to a separate file at the end in a different funcition.
What I've been trying to do following that is jump the loop to continue on from the last of those five lines, rather than going over them again. I thought the last line in the code would solved the problem, but unfortunately not.
Are there any recommended variations of a for loop I could use for this purpose?
def readSingleDayLogs(aDir):
print 'Processing files in ' + str(aDir) + '\n'
lineNumber = 0
try:
open_aDirFile = open(aDir) #open the log file
for aLine in open_aDirFile: #total the num. lines in file
lineNumber = lineNumber + 1
lowerBound = 0
for lineIDX in range(lowerBound, lineNumber):
currentLine = linecache.getline(aDir, lineIDX)
if (bunch of logic conditions):
issueList.append(currentLine)
for extraLineIDX in range(1, 6): #loop over the next five lines of the error and append to issue list
extraLine = linecache.getline(aDir, lineIDX+ extraLineIDX) #get the x extra line after problem line
issueList.append(extraLine)
issueList.append('\n\n')
lowerBound = lineIDX
You should use a while loop :
line = lowerBound
while line < lineNumber:
...
if conditions:
...
for lineIDX in range(line, line+6):
...
line = line + 6
else:
line = line + 1
A for-loop uses an iterator over the range, so you can have the ability to change the loop variable.
Consider using a while-loop instead. That way, you can update the line index directly.
I would look at something like:
from itertools import islice
with open('somefile') as fin:
line_count = 0
my_lines = []
for line in fin:
line_count += 1
if some_logic(line):
my_lines.append(line)
next_5 = list(islice(fin, 5))
line_count += len(next_5)
my_lines.extend(next_5)
This way, by using islice on the input, you're able to move the iterator ahead and resume after the 5 lines (perhaps fewer if near the end of the file) are exhausted.
This is based on if I'm understanding correctly that you can read forward through the file, identify a line, and only want a fixed number of lines after that point, then resume looping as per normal. (You may not even require the line counting if that's all you're after as it only appears to be for the getline and not any other purpose).
If you indeed you want to take the next 5, and still consider the following line, you can use itertools.tee to branch at the point of the faulty line, and islice that and let the fin iterator resume on the next line.

getting data out of a txt file

I'm only just beginning my journey into Python. I want to build a little program that will calculate shim sizes for when I do the valve clearances on my motorbike. I will have a file that will have the target clearances, and I will query the user to enter the current shim sizes, and the current clearances. The program will then spit out the target shim size. Looks simple enough, I have built a spread-sheet that does it, but I want to learn python, and this seems like a simple enough project...
Anyway, so far I have this:
def print_target_exhaust(f):
print f.read()
#current_file = open("clearances.txt")
print print_target_exhaust(open("clearances.txt"))
Now, I've got it reading the whole file, but how do I make it ONLY get the value on, for example, line 4. I've tried print f.readline(4) in the function, but that seems to just spit out the first four characters... What am I doing wrong?
I'm brand new, please be easy on me!
-d
To read all the lines:
lines = f.readlines()
Then, to print line 4:
print lines[4]
Note that indices in python start at 0 so that is actually the fifth line in the file.
with open('myfile') as myfile: # Use a with statement so you don't have to remember to close the file
for line_number, data in enumerate(myfile): # Use enumerate to get line numbers starting with 0
if line_number == 3:
print(data)
break # stop looping when you've found the line you want
More information:
with statement
enumerate
Not very efficient, but it should show you how it works. Basically it will keep a running counter on every line it reads. If the line is '4' then it will print it out.
## Open the file with read only permit
f = open("clearances.txt", "r")
counter = 0
## Read the first line
line = f.readline()
## If the file is not empty keep reading line one at a time
## till the file is empty
while line:
counter = counter + 1
if counter == 4
print line
line = f.readline()
f.close()

Use Python to remove lines in a files that start with an octothorpe?

This seems like a straight-forward question but I can't seem to pinpoint my problem. I am trying to delete all lines in a file that start with an octothorpe (#) except the first line. Here is the loop I am working with:
for i, line in enumerate(input_file):
if i > 1:
if not line.startswith('#'):
output.write(line)
The above code doesn't seem to work. Does anyone known what my problem is? Thanks!
You aren't writing out the first line:
for i, line in enumerate(input_file):
if i == 0:
output.write(line)
else:
if not line.startswith('#'):
output.write(line)
Keep in mind also that enumerate (like most things) starts at zero.
A little more concisely (and not repeating the output line):
for i, line in enumerate(input_file):
if i == 0 or not line.startswith('#'):
output.write(line)
I wouldn't bother with enumerate here. You only need it decide which line is the first line and which isn't. This should be easy enough to deal with by simply writing the first line out and then using a for loop to conditionally write additional lines that do not start with a '#'.
def removeComments(inputFileName, outputFileName):
input = open(inputFileName, "r")
output = open(outputFileName, "w")
output.write(input.readline())
for line in input:
if not line.lstrip().startswith("#"):
output.write(line)
input.close()
output.close()
Thanks to twopoint718 for pointing out the advantage of using lstrip.
Maybe you want to omit lines from the output where the first non-whitespace character is an octothorpe:
for i, line in enumerate(input_file):
if i == 0 or not line.lstrip().startswith('#'):
output.write(line)
(note the call to lstrip)

Categories

Resources