Pythonic way of processing a file between two previously known strings

Pythonic way of processing a file between two previously known strings - python

I process log files with python. Let´s say that I have a log file that contains a line which is START and a line that is END, like below:
START
one line
two line
...
n line
END
What I do want is to be able to store the content between the START and END lines for further processing.
I do the following in Python:
with open (file) as name_of_file:
for line in name_of_file:
if 'START' in line: # We found the start_delimiter
print(line)
found_start = True
for line in name_of_file: # We now read until the end delimiter
if 'END' in line: # We exit here as we have the info
found_end=True
break
else:
if not (line.isspace()): # We do not want to add to the data empty strings, so we ensure the line is not empty
data.append(line.replace(',','').strip().split()) # We store information in a list called data we do not want ','' or spaces
if(found_start and found_end):
relevant_data=data
And then I process the relevant_data.
Looks to far complicated for the purity of Python, and hence my question: is there a more Pythonic way of doing this?
Thanks!

You are right that there is something not OK with having a nested loop over the same iterator. File objects are already iterators, and you can use that to your advantage. For example, to find the first line with a START in it:
line = next(l for l in name_of_file if 'START' in l)
This will raise a StopIteration if there is no such line. It also sets the file pointer to the beginning of the first line you care about.
Getting the last line without anything that comes after it is a bit more complicated because it's difficult to set external state in a generator expression. Instead, you can make a simple generator:
def interesting_lines(file):
if not next((line for line in file if 'START' in line), None):
return
for line in file:
if 'END' in line:
break
line = line.strip()
if not line:
continue
yield line.replace(',', '').split()
The generator will yield nothing if you don't have a START, but it will yield all the lines until the end if there is no END, so it differs a little from your implementation. You would use the generator to replace your loop entirely:
with open(name_of_file) as file:
data = list(interesting_lines(file))
if data:
... # process data
Wrapping the generator in list immediately processes it, so the lines persist even after you close the file. The iterator can be used repeatedly because at the end of your call, the file pointer will be just past the END line:
with open(name_of_file) as file:
for data in iter(lambda: list(interesting_lines(file)), []):
# Process another data set.
The relatively lesser known form of iter converts any callable object that accepts no arguments into an iterator. The end is reached when the callable returns the sentinel value, in this case an empty list.

To perform that, you can use iter(callable, sentinel) discussed in this post , that will read until a sentinel value is reached, in your case 'END' (after applying .strip()).
with open(filename) as file:
start_token = next(l for l in file if l.strip()=='START') # Used to read until the start token
result = [line.replace(',', '').split() for line in iter(lambda x=file: next(x).strip(), 'END') if line]

This is a mission for regular expressions re, for example:
import re
lines = """ not this line
START
this line
this line too
END
not this one
"""
search_obj = re.search( r'START(.*)END', lines, re.S)
search_obj.groups(1)
# ('\n this line\n this line too\n ',)
The re.S is necessary for spanning multiple lines.

Related

How to only read first none empty line in python using sys.stdin

I want my Python code to read a file that will contain numbers only in one line. But that one line will not necessarily be the first one. I want my program to ignore all empty lines until it gets to the first line with numbers.
The file will look something like this:
In this example I would want my Python code to ignore the first 2 lines which are empty and just grabbed the first one.
I know that when doing the following I can read the first line:
import sys
line = sys.stdin.readline()
And I tried doing a for loop like the following to try to get it done:
for line in sys.stdin.readlines():
values = line.split()
rest of code ....
However I cannot get the code to work properly if the line of numbers in the file is empty. I did try a while loop but then it became an infinite loop. Any suggestions on how does one properly skip empty lines and just performs specific actions on the first line that is not empty?

Here is example of a function to get the next line containing some non-whitespace character, from a given input stream.
You might want to modify the exact behaviour in the event that no line is found (e.g. return None or do something else instead of raising an exception).
import sys
import re
def get_non_empty_line(fh):
for line in fh:
if re.search(r'\S', line):
return line
raise EOFError
line = get_non_empty_line(sys.stdin)
print(line)
Note: you can happily call the function more than once; the iteration (for line in f:) will carry on from wherever it got to the last time.

You probably want to use the continue keyword with a check if the line is empty, like this:
for line in sys.stdin.readlines():
if not line.strip():
continue
values = line.split()
rest of code ....

How would I iterate onto the next line of a file within a nested loop using "for line" in fileinput?

I have eliminated some of the nested loops for simplicity of the example.
I am iterating over a file line-by-line using fileinput. If the line meets a certain condition I want it to replace all future lines with '' until it meets the condition again.
import re
import fileinput
with fileinput.FileInput("survey.qsf", inplace=True, backup='.bak') as file:
for line in file:
if re.match(r'l'+datamap[i][2]+'.*$',line)!=None:
line=re.sub(r'.*$','',line)
while re.match(r'lQID\d*$',line)==None:
line=re.sub(r'.*$','',line)
next(line)
I used "next(line)" as a placeholder as I can't figure out how to iterate to the next line without breaking out of the inner loop.
I want to be able to iterate through the lines to have:
lQID44
xyz
xyz
lQID45
output as:
[blank line]
[blank line]
[blank line]
lQID45
Thanks.

next takes the iterator as its argument.
while re.match(r'lQID\d*$',line)==None:
line=re.sub(r'.*$','',line)
try:
line = next(file) # Not next(line)
except StopIteration:
break
As an aside, there's no need to use re.sub to replace the entire line with an empty string; line = '' would suffice.
(Also, assigning to line doesn't make changes to the actual file; inplace=True just means that you can write to file as well as read from it, but you have to explicitly write to the file, using print or file.write.)

While using readlines and .split() function, would it be easier to iterate through with for-loop?

I've created a while loop to iterate through, read the data.txt file, calculate the perimeter/area/vertices, and the root will print out the answers.
I'm stuck here:
from Polygon import Polygon
from Rectangle import Rectangle
in_file = open('data.txt', 'r')
line = in_file.readlines()
while line:
type_, num_of_sides = line.split( ) ###ERROR 'list' object has no attribute 'split'
sides = [map(float, in_file.readline().split()) for _ in range(int(num_of_sides))]
print(line)
if type_ == 'P':
poly_object = Polygon(sides)
elif type_ == 'R':
poly_object = Rectangle(sides)
print('The vertices are {}'.format(poly_object.vertices))
print('The perimeter is {}'.format(poly_object.perimeter()))
print('The area is {}'.format(poly_object.area()))
print()
line = in_file.readline()
in_file.close()
Should I create a for loop that loops through since readlines is a list of strings and I want the split to read through each line? Or is it just the way that I'm choosing to format, so that's why I'm getting an error?

The immediate issue is use of readlines, not readline in line = in_file.readlines(). This will populate line with a list of all lines from the file, rather than a single line. This will lead to two issues with your program where the wrong type of data will propagate through the loop:
Calling while line, in this instance, after reading a non-empty file will cause the loop to be executed, because a non-empty list evaluates to true.
You call split on line. On the first iteration of the loop, line contains a list of strings (where each string was a line in the file). Lists to do not have a split method – only the individual strings do. This call fails.
If the call to split had not failed and the loop had been allowed to run through once, the subsequent call to line = in_file.readline() would not return the next line in the file, as all lines have been read by the earlier call to readlines and the cursor not reset in the intervening period from EOF. The loop will terminate.
If you did not have the final call in step 3, the loop would instead run forever without termination, as the value of line would never be updated to be a non-empty list.
A minimal change is to adjust the initial call which assigns a value to the line variable to readline() rather than readlines() to ensure only a single line is read from the file. The logic in the code should then work.
You may find the following implementation logic easier to implement and reason about:
collect the machinery for reading from in_file together into a single block before the loop, using a context manager to automatically close the file on leaving the block
producing a list of lines, which a simple for loop is sufficient to iterate over
# This context manager will automatically ensure `in_file` is closed
# on leaving the `with` block.
with open('data.txt', 'r') as in_file:
lines = in_file.readlines()
for line in lines:
# Do something with each line; in particular, you can call
# split() on line without issue here.
#
# You do not require any further calls to in_file.readline()
# in this block.

Is an object file a list by default?

I've encountered two versions of code that both can accomplish the same task with a little difference in the code itself:
with open("file") as f:
for line in f:
print line
and
with open("file") as f:
data = f.readlines()
for line in data:
print line
My question is, is the file object f a list by default just like data? If not, why does the first chunk of code work? Which version is the better practice?

File object is not a list - it's an object that conforms to iterator interface (docs). I.e. it implements __iter__ method that returns an iterator object. That iterator object implements both __iter__ and next methods allowing iteration over the collection.
It happens that the File object is it's own iterator (docs) meaning file.__iter__() returns self.
Both for line in file and lines = file.readlines() are equivalent in that they yield the same result if used to get/iterator over all lines in the file. But, file.next() buffers the contents from the file (it reads ahead) to speed up the process of reading, effectively moving the file descriptor to position exact to or farther than where the last line ended. This means that if you have used for line in file, read some lines and the stopped the iteration (you haven't reach end of the file) and now called file.readlines(), the first line returned might not be the full line following the last line iterated over the for loop.
When you use for x in my_it, the interpreter calls my_it.__iter__(). Now, the next() method is being called on the object returned by the previous call, and for each call it's return value is being assigned to x. When next() raises StopIteration, the loop ends.
Note: A valid iterator implementation should ensure that once StopIteration is raised, it should remain to be risen for all subsequent calls to next().

In both cases, you are getting a file line-by-line. The method is different.
With your first version:
with open("file") as f:
for line in f:
print line
While you are interating over the file line by line, the file contents are not resident fully in memory (unless it is a 1 line file).
The open built-in function returns a file object -- not a list. That object supports iteration; in this case returning individual strings that are each group of characters in the file terminated by either a carriage return or the end of file.
You can write a loop that is similar to what for line in f: print line is doing under the hood:
with open('file') as f:
while True:
try:
line=f.next()
except StopIteration:
break
else:
print line
With the second version:
with open("file") as f:
data = f.readlines() # equivelent to data=list(f)
for line in data:
print line
You are using a method of a file object (file.readlines()) that reads the entire file contents into memory as a list of the individual lines. The code is then iterating over that list.
You can write a similar version of that as well that highlights the iterators under the hood:
with open('file') as f:
data=list(f)
it=iter(data)
while True:
try:
line=it.next()
except StopIteration:
break
else:
print line
In both of your examples, you are using a for loop to loop over items in a sequence. The items are the same in each case (individual lines of the file) but the underlying sequence is different. In the first version, the sequence is a file object; in the second version it is a list. Use the first version if you just want to deal with each line. Use the second if you want a list of lines.
Read Ned Batchelder's excellent overview on looping and iteration for more.

f is a filehandle, not a list. It is iterable.

A file is an iterable. Lots of objects, including lists are iterable, which just means that they can be used in a for loop to sequentially yield an object to bind the for iterator variable to.
Both versions of your code accomplish iteration line by line. The second versions reads the whole file into memory and constructs a list; the first may not read the whole file first. The reason why you might prefer the second is that you want to close the file before something else modifies it; the first might be preferred if the file is very large.

More pythonic way of skipping header lines

Is there a shorter (perhaps more pythonic) way of opening a text file and reading past the lines that start with a comment character?
In other words, a neater way of doing this
fin = open("data.txt")
line = fin.readline()
while line.startswith("#"):
line = fin.readline()

At this stage in my arc of learning Python, I find this most Pythonic:
def iscomment(s):
return s.startswith('#')
from itertools import dropwhile
with open(filename, 'r') as f:
for line in dropwhile(iscomment, f):
# do something with line
to skip all of the lines at the top of the file starting with #. To skip all lines starting with #:
from itertools import ifilterfalse
with open(filename, 'r') as f:
for line in ifilterfalse(iscomment, f):
# do something with line
That's almost all about readability for me; functionally there's almost no difference between:
for line in ifilterfalse(iscomment, f))
and
for line in (x for x in f if not x.startswith('#'))
Breaking out the test into its own function makes the intent of the code a little clearer; it also means that if your definition of a comment changes you have one place to change it.

for line in open('data.txt'):
if line.startswith('#'):
continue
# work with line
of course, if your commented lines are only at the beginning of the file, you might use some optimisations.

from itertools import dropwhile
for line in dropwhile(lambda line: line.startswith('#'), file('data.txt')):
pass

If you want to filter out all comment lines (not just those at the start of the file):
for line in file("data.txt"):
if not line.startswith("#"):
# process line
If you only want to skip those at the start then see ephemient's answer using itertools.dropwhile

You could use a generator function
def readlines(filename):
fin = open(filename)
for line in fin:
if not line.startswith("#"):
yield line
and use it like
for line in readlines("data.txt"):
# do things
pass
Depending on exactly where the files come from, you may also want to strip() the lines before the startswith() check. I once had to debug a script like that months after it was written because someone put in a couple of space characters before the '#'

As a practical matter if I knew I was dealing with reasonable sized text files (anything which will comfortably fit in memory) then I'd problem go with something like:
f = open("data.txt")
lines = [ x for x in f.readlines() if x[0] != "#" ]
... to snarf in the whole file and filter out all lines that begin with the octothorpe.
As others have pointed out one might want ignore leading whitespace occurring before the octothorpe like so:
lines = [ x for x in f.readlines() if not x.lstrip().startswith("#") ]
I like this for its brevity.
This assumes that we want to strip out all of the comment lines.
We can also "chop" the last characters (almost always newlines) off the end of each using:
lines = [ x[:-1] for x in ... ]
... assuming that we're not worried about the infamously obscure issue of a missing final newline on the last line of the file. (The only time a line from the .readlines() or related file-like object methods might NOT end in a newline is at EOF).
In reasonably recent versions of Python one can "chomp" (only newlines) off the ends of the lines using a conditional expression like so:
lines = [ x[:-1] if x[-1]=='\n' else x for x in ... ]
... which is about as complicated as I'll go with a list comprehension for legibility's sake.
If we were worried about the possibility of an overly large file (or low memory constraints) impacting our performance or stability, and we're using a version of Python that's recent enough to support generator expressions (which are more recent additions to the language than the list comprehensions I've been using here), then we could use:
for line in (x[:-1] if x[-1]=='\n' else x for x in
f.readlines() if x.lstrip().startswith('#')):
# do stuff with each line
... is at the limits of what I'd expect anyone else to parse in one line a year after the code's been checked in.
If the intent is only to skip "header" lines then I think the best approach would be:
f = open('data.txt')
for line in f:
if line.lstrip().startswith('#'):
continue
... and be done with it.

You could make a generator that loops over the file that skips those lines:
fin = open("data.txt")
fileiter = (l for l in fin if not l.startswith('#'))
for line in fileiter:
...

You could do something like
def drop(n, seq):
for i, x in enumerate(seq):
if i >= n:
yield x
And then say
for line in drop(1, file(filename)):
# whatever

I like #iWerner's generator function idea. One small change to his code and it does what the question asked for.
def readlines(filename):
f = open(filename)
# discard first lines that start with '#'
for line in f:
if not line.lstrip().startswith("#"):
break
yield line
for line in f:
yield line
and use it like
for line in readlines("data.txt"):
# do things
pass
But here is a different approach. This is almost very simple. The idea is that we open the file, and get a file object, which we can use as an iterator. Then we pull the lines we don't want out of the iterator, and just return the iterator. This would be ideal if we always knew how many lines to skip. The problem here is we don't know how many lines we need to skip; we just need to pull lines and look at them. And there is no way to put a line back into the iterator, once we have pulled it.
So: open the iterator, pull lines and count how many have the leading '#' character; then use the .seek() method to rewind the file, pull the correct number again, and return the iterator.
One thing I like about this: you get the actual file object back, with all its methods; you can just use this instead of open() and it will work in all cases. I renamed the function to open_my_text() to reflect this.
def open_my_text(filename):
f = open(filename, "rt")
# count number of lines that start with '#'
count = 0
for line in f:
if not line.lstrip().startswith("#"):
break
count += 1
# rewind file, and discard lines counted above
f.seek(0)
for _ in range(count):
f.readline()
# return file object with comment lines pre-skipped
return f
Instead of f.readline() I could have used f.next() (for Python 2.x) or next(f) (for Python 3.x) but I wanted to write it so it was portable to any Python.
EDIT: Okay, I know nobody cares and I"m not getting any upvotes for this, but I have re-written my answer one last time to make it more elegant.
You can't put a line back into an iterator. But, you can open a file twice, and get two iterators; given the way file caching works, the second iterator is almost free. If we imagine a file with a megabyte of '#' lines at the top, this version would greatly outperform the previous version that calls f.seek(0).
def open_my_text(filename):
# open the same file twice to get two file objects
# (We are opening the file read-only so this is safe.)
ftemp = open(filename, "rt")
f = open(filename, "rt")
# use ftemp to look at lines, then discard from f
for line in ftemp:
if not line.lstrip().startswith("#"):
break
f.readline()
# return file object with comment lines pre-skipped
return f
This version is much better than the previous version, and it still returns a full file object with all its methods.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.