More pythonic way of skipping header lines - python

Is there a shorter (perhaps more pythonic) way of opening a text file and reading past the lines that start with a comment character?
In other words, a neater way of doing this
fin = open("data.txt")
line = fin.readline()
while line.startswith("#"):
line = fin.readline()

At this stage in my arc of learning Python, I find this most Pythonic:
def iscomment(s):
return s.startswith('#')
from itertools import dropwhile
with open(filename, 'r') as f:
for line in dropwhile(iscomment, f):
# do something with line
to skip all of the lines at the top of the file starting with #. To skip all lines starting with #:
from itertools import ifilterfalse
with open(filename, 'r') as f:
for line in ifilterfalse(iscomment, f):
# do something with line
That's almost all about readability for me; functionally there's almost no difference between:
for line in ifilterfalse(iscomment, f))
and
for line in (x for x in f if not x.startswith('#'))
Breaking out the test into its own function makes the intent of the code a little clearer; it also means that if your definition of a comment changes you have one place to change it.

for line in open('data.txt'):
if line.startswith('#'):
continue
# work with line
of course, if your commented lines are only at the beginning of the file, you might use some optimisations.

from itertools import dropwhile
for line in dropwhile(lambda line: line.startswith('#'), file('data.txt')):
pass

If you want to filter out all comment lines (not just those at the start of the file):
for line in file("data.txt"):
if not line.startswith("#"):
# process line
If you only want to skip those at the start then see ephemient's answer using itertools.dropwhile

You could use a generator function
def readlines(filename):
fin = open(filename)
for line in fin:
if not line.startswith("#"):
yield line
and use it like
for line in readlines("data.txt"):
# do things
pass
Depending on exactly where the files come from, you may also want to strip() the lines before the startswith() check. I once had to debug a script like that months after it was written because someone put in a couple of space characters before the '#'

As a practical matter if I knew I was dealing with reasonable sized text files (anything which will comfortably fit in memory) then I'd problem go with something like:
f = open("data.txt")
lines = [ x for x in f.readlines() if x[0] != "#" ]
... to snarf in the whole file and filter out all lines that begin with the octothorpe.
As others have pointed out one might want ignore leading whitespace occurring before the octothorpe like so:
lines = [ x for x in f.readlines() if not x.lstrip().startswith("#") ]
I like this for its brevity.
This assumes that we want to strip out all of the comment lines.
We can also "chop" the last characters (almost always newlines) off the end of each using:
lines = [ x[:-1] for x in ... ]
... assuming that we're not worried about the infamously obscure issue of a missing final newline on the last line of the file. (The only time a line from the .readlines() or related file-like object methods might NOT end in a newline is at EOF).
In reasonably recent versions of Python one can "chomp" (only newlines) off the ends of the lines using a conditional expression like so:
lines = [ x[:-1] if x[-1]=='\n' else x for x in ... ]
... which is about as complicated as I'll go with a list comprehension for legibility's sake.
If we were worried about the possibility of an overly large file (or low memory constraints) impacting our performance or stability, and we're using a version of Python that's recent enough to support generator expressions (which are more recent additions to the language than the list comprehensions I've been using here), then we could use:
for line in (x[:-1] if x[-1]=='\n' else x for x in
f.readlines() if x.lstrip().startswith('#')):
# do stuff with each line
... is at the limits of what I'd expect anyone else to parse in one line a year after the code's been checked in.
If the intent is only to skip "header" lines then I think the best approach would be:
f = open('data.txt')
for line in f:
if line.lstrip().startswith('#'):
continue
... and be done with it.

You could make a generator that loops over the file that skips those lines:
fin = open("data.txt")
fileiter = (l for l in fin if not l.startswith('#'))
for line in fileiter:
...

You could do something like
def drop(n, seq):
for i, x in enumerate(seq):
if i >= n:
yield x
And then say
for line in drop(1, file(filename)):
# whatever

I like #iWerner's generator function idea. One small change to his code and it does what the question asked for.
def readlines(filename):
f = open(filename)
# discard first lines that start with '#'
for line in f:
if not line.lstrip().startswith("#"):
break
yield line
for line in f:
yield line
and use it like
for line in readlines("data.txt"):
# do things
pass
But here is a different approach. This is almost very simple. The idea is that we open the file, and get a file object, which we can use as an iterator. Then we pull the lines we don't want out of the iterator, and just return the iterator. This would be ideal if we always knew how many lines to skip. The problem here is we don't know how many lines we need to skip; we just need to pull lines and look at them. And there is no way to put a line back into the iterator, once we have pulled it.
So: open the iterator, pull lines and count how many have the leading '#' character; then use the .seek() method to rewind the file, pull the correct number again, and return the iterator.
One thing I like about this: you get the actual file object back, with all its methods; you can just use this instead of open() and it will work in all cases. I renamed the function to open_my_text() to reflect this.
def open_my_text(filename):
f = open(filename, "rt")
# count number of lines that start with '#'
count = 0
for line in f:
if not line.lstrip().startswith("#"):
break
count += 1
# rewind file, and discard lines counted above
f.seek(0)
for _ in range(count):
f.readline()
# return file object with comment lines pre-skipped
return f
Instead of f.readline() I could have used f.next() (for Python 2.x) or next(f) (for Python 3.x) but I wanted to write it so it was portable to any Python.
EDIT: Okay, I know nobody cares and I"m not getting any upvotes for this, but I have re-written my answer one last time to make it more elegant.
You can't put a line back into an iterator. But, you can open a file twice, and get two iterators; given the way file caching works, the second iterator is almost free. If we imagine a file with a megabyte of '#' lines at the top, this version would greatly outperform the previous version that calls f.seek(0).
def open_my_text(filename):
# open the same file twice to get two file objects
# (We are opening the file read-only so this is safe.)
ftemp = open(filename, "rt")
f = open(filename, "rt")
# use ftemp to look at lines, then discard from f
for line in ftemp:
if not line.lstrip().startswith("#"):
break
f.readline()
# return file object with comment lines pre-skipped
return f
This version is much better than the previous version, and it still returns a full file object with all its methods.

Related

Is there pythonic oneliner to iterate over lines of a file?

90% of the time when I read file, it ends up like this:
with open('file.txt') as f:
for line in f:
my_function(line)
This seems to be a very common scenario, so I thought of a shorter way, but is this safe? I mean will the file be closed correctly or do you see any other problems with this approach? :
for line in open('file.txt'):
my_function(line)
Edit: Thanks Eric, this seems to be best solution. Hopefully I don't turn this into discussion with this, but what do you think of this approach for the case when we want to use line in several operations (not just as argument for my_function):
def line_generator(filename):
with open(filename) as f:
for line in f:
yield line
and then using:
for line in line_generator('groceries.txt'):
print line
grocery_list += [line]
Does this function have disadvantages over iterate_over_file?
If you need this often, you could always define :
def iterate_over_file(filename, func):
with open(filename) as f:
for line in f:
func(line)
def my_function(line):
print line,
Your pythonic one-liner is now :
iterate_over_file('file.txt', my_function)
using a context manager is the best way, and that pretty much bars the way to your one-liner solution. If you naively want to create a one-liner you get:
with open('file.txt') as f: for line in f: my_function(line) # wrong code!!
which is invalid syntax.
So if you badly want a one-liner you could do
with open('file.txt') as f: [my_function(line) for line in f]
but that's bad practice since you're creating a list comprehension only for the side effect (you don't care about the return of my_function).
Another approach would be
with open('file.txt') as f: collections.deque((my_function(line) for line in f), maxlen=0)
so no list comprehension is created, and you force consumption of the iterator using a itertools recipe (0-size deque: no memory allocated either)
Conclusion:
to reach the "pythonic/one-liner" goal, we sacrificed readability.
Sometimes the best approach doesn't hold in one line, period.
Building upon the approach by Eric, you could also make it a bit more generic by just writing a function that uses with to open the file and then just returns the file. This, however:
def with_open(filename):
with open(filename) as f:
return f # won't work!
does not work, as the file f will already be closed by with when returned by the function. Instead, you can make it a generator function, and yield the individual lines:
def with_open(filename):
with open(filename) as f:
for line in f:
yield line
or shorter, with newer versions of Python:
def with_open(filename):
with open(filename) as f:
yield from f
And use it like this:
for line in with_open("test.txt"):
print line
or this:
nums = [int(n) for n in with_open("test.txt")]

When should I ever use file.read() or file.readlines()?

I noticed that if I iterate over a file that I opened, it is much faster to iterate over it without "read"-ing it.
i.e.
l = open('file','r')
for line in l:
pass (or code)
is much faster than
l = open('file','r')
for line in l.read() / l.readlines():
pass (or code)
The 2nd loop will take around 1.5x as much time (I used timeit over the exact same file, and the results were 0.442 vs. 0.660), and would give the same result.
So - when should I ever use the .read() or .readlines()?
Since I always need to iterate over the file I'm reading, and after learning the hard way how painfully slow the .read() can be on large data - I can't seem to imagine ever using it again.
The short answer to your question is that each of these three methods of reading bits of a file have different use cases. As noted above, f.read() reads the file as an individual string, and so allows relatively easy file-wide manipulations, such as a file-wide regex search or substitution.
f.readline() reads a single line of the file, allowing the user to parse a single line without necessarily reading the entire file. Using f.readline() also allows easier application of logic in reading the file than a complete line by line iteration, such as when a file changes format partway through.
Using the syntax for line in f: allows the user to iterate over the file line by line as noted in the question.
(As noted in the other answer, this documentation is a very good read):
https://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects
Note:
It was previously claimed that f.readline() could be used to skip a line during a for loop iteration. However, this doesn't work in Python 2.7, and is perhaps a questionable practice, so this claim has been removed.
Hope this helps!
https://docs.python.org/2/tutorial/inputoutput.html#methods-of-file-objects
When size is omitted or negative, the entire contents of the file will be read and returned; it’s your problem if the file is twice as large as your machine’s memory
Sorry for all the edits!
For reading lines from a file, you can loop over the file object. This is memory efficient, fast, and leads to simple code:
for line in f:
print line,
This is the first line of the file.
Second line of the file
Note that readline() is not comparable to the case of reading all lines in for-loop since it reads line by line and there is an overhead which is pointed out by others already.
I ran timeit on two identical snippts but one with for-loop and the other with readlines(). You can see my snippet below:
def test_read_file_1():
f = open('ml/README.md', 'r')
for line in f.readlines():
print(line)
def test_read_file_2():
f = open('ml/README.md', 'r')
for line in f:
print(line)
def test_time_read_file():
from timeit import timeit
duration_1 = timeit(lambda: test_read_file_1(), number=1000000)
duration_2 = timeit(lambda: test_read_file_2(), number=1000000)
print('duration using readlines():', duration_1)
print('duration using for-loop:', duration_2)
And the results:
duration using readlines(): 78.826229238
duration using for-loop: 69.487692794
The bottomline, I would say, for-loop is faster but in case of possibility of both, I'd rather readlines().
readlines() is better than for line in file when you know that the data you are interested starts from, for example, 2nd line. You can simply write readlines()[1:].
Such use cases are when you have a tab/comma separated value file and the first line is a header (and you don't want to use additional module for tsv or csv files).
#The difference between file.read(), file.readline(), file.readlines()
file = open('samplefile', 'r')
single_string = file.read() #Reads all the elements of the file
#into a single string(\n characters might be included)
line = file.readline() #Reads the current line where the cursor as a string
#is positioned and moves to the next line
list_strings = file.readlines()#Makes a list of strings

print a previous line based on a condition in python

I am trying to print -4th line based on a condition. I have a text file SFU.txt with some content. My objective is: if there is a word configuration in a line, I want to print -4th line. For example if the content of my file is like below:
This is a random text document
We are talking about planets here
This is planet Mars
in solarsystem
sun is the star
this is 4th planet
configuration lifeform exists
bla bla bla
bla bla bla
So, once the compiler hits the line configuration lifeform exists and it sees configuration, I want to print the line This is planet earth
My code below:
file = open("SFU.txt","r")
for line in file:
if "configuration" in line:
#want to print the -4th line-HOW?
Use tee to run a pair of iterators across inf. This only stores five lines in memory at any given time:
from itertools import tee
with open("SFU.txt") as inf:
# set up iterators
cfg,res = tee(inf)
# advance cfg by four lines
for i in range(4):
next(cfg)
for c,r in zip(cfg, res):
if "configuration" in c:
print(r)
and, as expected, results in
This is planet Mars
Edit: if you want to edit the -4th line, I suggest
def edited(r):
# make your changes to r
return new_r
with open("SFU.txt") as inf, open("edited.txt", "w") as outf:
# set up iterators
cfg, res = tee(inf)
for i in range(4):
next(cfg)
# iterate through in tandem
for c, r in zip(cfg, res):
if "configuration" in c:
r = edited(r)
outf.write(r)
# reached end - write out remaining queued values
for r in res:
outf.write(r)
A limited-size deque is a good way to keep a "ring buffer" of the last few lines:
import collections
lastfewlines = collections.deque((), 4)
with open('SFU.txt') as f:
for line in f:
if 'configuration' in line and len(lastfewlines) == 4:
print(lastfewlines[0])
lastfewlines.append(line.rstrip())
However, while this solves the problem posed in the question, it doesn't work for the "real problem" the OP mentioned only in a comment -- "editing" that line, meaning, presumably, alter the input file "in place".
Alas, modern file-systems do not allow "in-place editing" of files except byte-for-byte overwriting -- unless the "edited" line is exactly the same number of bytes as the original one, you can't just overwrite said original line and imagine that all the following lines in the file will shift back or forth as desired!-)
Rather, one has to read the file, alter it, and rewrite it (the soundest approach is usually to write a new file then rename it to the old one's name "as atomically as your operating system and file-system will let you", to avoid losing data should there be a crash).
The deque approach can be adapted to this -- instead of just conditionally printing lastfewlines[0], write to the output file either the original or modified version of it (and at the end write what's left in the deque to the output file). Then, at least on Unix systems and local file-systems, a simple os.rename will do the atomic trick (as long as the output file is on the same mounted disk as the input one).
For all but really huge files, however, reading all lines in memory (with f.readlines()), performing alterations if any on the list of lines, then writing the lot out again, is much simpler. And since the user mentions 16,000 lines (length not specified but let's assume less than 100 bytes per average line), this tiny file of less than 2 megabytes should be dealt with in the most simple way -- it's orders of magnitude smaller than any file that would cause any "too big to fit in memory" worries!-)
If you have a few lines you can use readlines() to save your lines as a list then just use indexing :
my_file = open("SFU.txt","r").readlines()
for i,line in enumerate(my_file):
if "configuration" in line:
print file[i-4]
But note that if i<4 it chose your line from end !
If you have a longer file and don't want to read the whole thing into memory, you can use an efficient queue implementation such as collections.deque like:
import collections
myfile = open("SFU.txt","r")
# This is a fixed length queue, and will hold 4 items at most
lines = collections.deque(['']*4,4)
for i, line in enumerate(myfile):
if 'configuration' in line:
print lines[0]
else:
# push the new line clearing the 4th previous
lines.append(line)
Maybe try something like this.
As the whole thing is copied to a list, all the text is editable. You can write it back to a file when you are done.
f = open("SFU.txt","r")
lines = [line.strip() for line in f]
for i, line in enumerate(lines):
if "configuration" in line:
if i > 4:
print lines[i - 4]
# edit here
else:
print 'There is no -4th line'
f.close()
Alternatively, you may open the file twice and yield one file to read from 4th line, then compare the next line first, and print the current line, something like this:
with open('SFU.txt', 'r') as f:
with open('SFU.txt', 'r') as next_f:
[next(next_f) for _ in range(4)] # yield to 4th line first
for line in next_f:
if 'configuration' in line: # if keyword in next line
print next(f) # this is current line from f
break
next(f) # if not found, yield f to next line
Yield result:
This is planet Mars
As a side note: please try not to use file as the namespace as it's a shadow name of Python builtin.

Writelines writes lines without newline, Just fills the file

I have a program that writes a list to a file.
The list is a list of pipe delimited lines and the lines should be written to the file like this:
123|GSV|Weather_Mean|hello|joe|43.45
122|GEV|temp_Mean|hello|joe|23.45
124|GSI|Weather_Mean|hello|Mike|47.45
BUT it wrote them line this ahhhh:
123|GSV|Weather_Mean|hello|joe|43.45122|GEV|temp_Mean|hello|joe|23.45124|GSI|Weather_Mean|hello|Mike|47.45
This program wrote all the lines into like one line without any line breaks.. This hurts me a lot and I gotta figure-out how to reverse this but anyway, where is my program wrong here? I thought write lines should write lines down the file rather than just write everything to one line..
fr = open(sys.argv[1], 'r') # source file
fw = open(sys.argv[2]+"/masked_"+sys.argv[1], 'w') # Target Directory Location
for line in fr:
line = line.strip()
if line == "":
continue
columns = line.strip().split('|')
if columns[0].find("#") > 1:
looking_for = columns[0] # this is what we need to search
else:
looking_for = "Dummy#dummy.com"
if looking_for in d:
# by default, iterating over a dictionary will return keys
new_line = d[looking_for]+'|'+'|'.join(columns[1:])
line_list.append(new_line)
else:
new_idx = str(len(d)+1)
d[looking_for] = new_idx
kv = open(sys.argv[3], 'a')
kv.write(looking_for+" "+new_idx+'\n')
kv.close()
new_line = d[looking_for]+'|'+'|'.join(columns[1:])
line_list.append(new_line)
fw.writelines(line_list)
This is actually a pretty common problem for newcomers to Python—especially since, across the standard library and popular third-party libraries, some reading functions strip out newlines, but almost no writing functions (except the log-related stuff) add them.
So, there's a lot of Python code out there that does things like:
fw.write('\n'.join(line_list) + '\n')
(writing a single string) or
fw.writelines(line + '\n' for line in line_list)
Either one is correct, and of course you could even write your own writelinesWithNewlines function that wraps it up…
But you should only do this if you can't avoid it.
It's better if you can create/keep the newlines in the first place—as in Greg Hewgill's suggestions:
line_list.append(new_line + "\n")
And it's even better if you can work at a higher level than raw lines of text, e.g., by using the csv module in the standard library, as esuaro suggests.
For example, right after defining fw, you might do this:
cw = csv.writer(fw, delimiter='|')
Then, instead of this:
new_line = d[looking_for]+'|'+'|'.join(columns[1:])
line_list.append(new_line)
You do this:
row_list.append(d[looking_for] + columns[1:])
And at the end, instead of this:
fw.writelines(line_list)
You do this:
cw.writerows(row_list)
Finally, your design is "open a file, then build up a list of lines to add to the file, then write them all at once". If you're going to open the file up top, why not just write the lines one by one? Whether you're using simple writes or a csv.writer, it'll make your life simpler, and your code easier to read. (Sometimes there can be simplicity, efficiency, or correctness reasons to write a file all at once—but once you've moved the open all the way to the opposite end of the program from the write, you've pretty much lost any benefits of all-at-once.)
The documentation for writelines() states:
writelines() does not add line separators
So you'll need to add them yourself. For example:
line_list.append(new_line + "\n")
whenever you append a new item to line_list.
As others have noted, writelines is a misnomer (it ridiculously does not add newlines to the end of each line).
To do that, explicitly add it to each line:
with open(dst_filename, 'w') as f:
f.writelines(s + '\n' for s in lines)
writelines() does not add line separators. You can alter the list of strings by using map() to add a new \n (line break) at the end of each string.
items = ['abc', '123', '!##']
items = map(lambda x: x + '\n', items)
w.writelines(items)
As others have mentioned, and counter to what the method name would imply, writelines does not add line separators. This is a textbook case for a generator. Here is a contrived example:
def item_generator(things):
for item in things:
yield item
yield '\n'
def write_things_to_file(things):
with open('path_to_file.txt', 'wb') as f:
f.writelines(item_generator(things))
Benefits: adds newlines explicitly without modifying the input or output values or doing any messy string concatenation. And, critically, does not create any new data structures in memory. IO (writing to a file) is when that kind of thing tends to actually matter. Hope this helps someone!
Credits to Brent Faust.
Python >= 3.6 with format string:
with open(dst_filename, 'w') as f:
f.writelines(f'{s}\n' for s in lines)
lines can be a set.
If you are oldschool (like me) you may add f.write('\n') below the second line.
As we have well established here, writelines does not append the newlines for you. But, what everyone seems to be missing, is that it doesn't have to when used as a direct "counterpart" for readlines() and the initial read persevered the newlines!
When you open a file for reading in binary mode (via 'rb'), then use readlines() to fetch the file contents into memory, split by line, the newlines remain attached to the end of your lines! So, if you then subsequently write them back, you don't likely want writelines to append anything!
So if, you do something like:
with open('test.txt','rb') as f: lines=f.readlines()
with open('test.txt','wb') as f: f.writelines(lines)
You should end up with the same file content you started with.
As we want to only separate lines, and the writelines function in python does not support adding separator between lines, I have written the simple code below which best suits this problem:
sep = "\n" # defining the separator
new_lines = sep.join(lines) # lines as an iterator containing line strings
and finally:
with open("file_name", 'w') as file:
file.writelines(new_lines)
and you are done.

How to read from a file using only readline()

The code i have right now is this
f = open(SINGLE_FILENAME)
lines = [i for i in f.readlines()]
but my proffessor demands that
You may use readline(). You may not use read(), readlines() or iterate over the open file using for.
any suggestions?
thanks
You could use a two-argument iter() version:
lines = iter(f.readline, "")
If you need a list of lines:
lines = list(lines)
First draft:
lines = []
with open(SINGLE_FILENAME) as f:
while True:
line = f.readline()
if line:
lines.append(line)
else:
break
I feel fairly certain there is a better way to do it, but that does avoid iterating with for, using read, or using readlines.
You could write a generator function to keep calling readline() until the file was empty, but that doesn't really seem like a large improvement here.

Categories

Resources