I noticed that if I iterate over a file that I opened, it is much faster to iterate over it without "read"-ing it.
i.e.
l = open('file','r')
for line in l:
pass (or code)
is much faster than
l = open('file','r')
for line in l.read() / l.readlines():
pass (or code)
The 2nd loop will take around 1.5x as much time (I used timeit over the exact same file, and the results were 0.442 vs. 0.660), and would give the same result.
So - when should I ever use the .read() or .readlines()?
Since I always need to iterate over the file I'm reading, and after learning the hard way how painfully slow the .read() can be on large data - I can't seem to imagine ever using it again.
The short answer to your question is that each of these three methods of reading bits of a file have different use cases. As noted above, f.read() reads the file as an individual string, and so allows relatively easy file-wide manipulations, such as a file-wide regex search or substitution.
f.readline() reads a single line of the file, allowing the user to parse a single line without necessarily reading the entire file. Using f.readline() also allows easier application of logic in reading the file than a complete line by line iteration, such as when a file changes format partway through.
Using the syntax for line in f: allows the user to iterate over the file line by line as noted in the question.
(As noted in the other answer, this documentation is a very good read):
https://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects
Note:
It was previously claimed that f.readline() could be used to skip a line during a for loop iteration. However, this doesn't work in Python 2.7, and is perhaps a questionable practice, so this claim has been removed.
Hope this helps!
https://docs.python.org/2/tutorial/inputoutput.html#methods-of-file-objects
When size is omitted or negative, the entire contents of the file will be read and returned; it’s your problem if the file is twice as large as your machine’s memory
Sorry for all the edits!
For reading lines from a file, you can loop over the file object. This is memory efficient, fast, and leads to simple code:
for line in f:
print line,
This is the first line of the file.
Second line of the file
Note that readline() is not comparable to the case of reading all lines in for-loop since it reads line by line and there is an overhead which is pointed out by others already.
I ran timeit on two identical snippts but one with for-loop and the other with readlines(). You can see my snippet below:
def test_read_file_1():
f = open('ml/README.md', 'r')
for line in f.readlines():
print(line)
def test_read_file_2():
f = open('ml/README.md', 'r')
for line in f:
print(line)
def test_time_read_file():
from timeit import timeit
duration_1 = timeit(lambda: test_read_file_1(), number=1000000)
duration_2 = timeit(lambda: test_read_file_2(), number=1000000)
print('duration using readlines():', duration_1)
print('duration using for-loop:', duration_2)
And the results:
duration using readlines(): 78.826229238
duration using for-loop: 69.487692794
The bottomline, I would say, for-loop is faster but in case of possibility of both, I'd rather readlines().
readlines() is better than for line in file when you know that the data you are interested starts from, for example, 2nd line. You can simply write readlines()[1:].
Such use cases are when you have a tab/comma separated value file and the first line is a header (and you don't want to use additional module for tsv or csv files).
#The difference between file.read(), file.readline(), file.readlines()
file = open('samplefile', 'r')
single_string = file.read() #Reads all the elements of the file
#into a single string(\n characters might be included)
line = file.readline() #Reads the current line where the cursor as a string
#is positioned and moves to the next line
list_strings = file.readlines()#Makes a list of strings
Related
I am using the command lineslist = file.readlines() of a 2GB file.
So, I guess it will create a lineslist array of 2GB or more size. So, basically is it the same as readfile = file.read(), which also creates readfile (instance/variable?) of 2GB exactly?
Why should I prefer readlines in this case?
Adding to that I have one more question, it is also mentioned here https://docs.python.org/2/tutorial/inputoutput.html:
readline(): a newline character (\n) is left at the end of the string, and is only omitted on the last line of the file if the file doesn’t end in a newline. This makes the return value unambiguous;
I don't understand the last point. So, does readlines() also have unambiguous value in the last element of its array if there is no \n in the end of the file?
We are dealing with combining the files (which were split on the basis of blocksize) So, I am thinking of choosing readlines or read. As the individual files may not be end with a \n after splitting and if readlines returns unambiguous values, it would be a problem, I think.)
PS: I haven't learnt python. So, forgive me if there is no such thing as instances in python or if I am speaking rubbish. I am just assuming.
EDIT:
Ok, I just found. It's not returning any unambiguous output.
len(lineslist)
6923798
lineslist[6923797]
"\xf4\xe5\xcf1)\xff\x16\x93\xf2\xa3-\....\xab\xbb\xcd"
So, it doesn't end with '\n'. But it's not unambiguous output eiter.
Also, no unambiguous output with readline either for the lastline.
If I understood your issue correctly you just want to combine (ie concatenate) files.
If memory is an issue normally for line in f is the way to go.
I tried benchmarking using a 1.9GB csv file. One possible alternative is to read in large chunks of the data which fit in memory.
Codes:
#read in large chunks - fastest in my test
chunksize = 2**16
with open(fn,'r') as f:
chunk = f.read(chunksize)
while chunk:
chunk = f.read(chunksize)
#1 loop, best of 3: 4.48 s per loop
#read whole file in one go - slowest in my test
with open(fn,'r') as f:
chunk = f.read()
#1 loop, best of 3: 11.7 s per loop
#read file using iterator over each line - most practical for most cases
with open(fn,'r') as f:
for line in f:
s = line
#1 loop, best of 3: 6.74 s per loop
Knowing this you could write something like:
with open(outputfile,'w') as fo:
for inputfile in inputfiles: #assuming inputfiles is a list of filepaths
with open(inputfile,'r') as fi:
for chunk in iter(lambda: fi.read(chunksize), ''):
fo.write(fi.read(chunk))
fo.write('\n') #newline between each file(might not be necessary)
file.read() will read the entire stream of data as 1 long string, whereas file.readlines() will create a list of lines from the stream.
Generally performance will suffer, especially in the case of large files, if you read in the entire thing all at once. The general approach is to iterate over the file object line by line, which it supports.
for line in file_object:
# Process the line
As this way of processing will only consume memory for a line (loosely speaking) and not the entire contents of the file.
Yes, readlines() causes reading all file to variable.
Much better it would be to read file line by line:
f = open("file_path", "r")
for line in f:
print f
It will cause loading only one line to RAM, so you're saving about 1.99 GB of memory :)
As I understood You want to concatenate two files.
target = open("target_file", "w")
f1 = open("f1", "r")
f2 = open("f2", "r")
for line in f1:
print >> target, line
for line in f2:
print >> target, line
target.close()
Or consider using other technology like bash:
cat file1 > target
cat file2 >> target
I have a big files to read and process.
Which is the faster method to read the file through and process it.
with open('file') as file:
for line in file:
print line
OR
file = open('file')
lines = f.readlines()
file.close()
for line in lines:
print line
The former can use buffered reading; the latter requires reading the entire file into memory first before it can start looping.
In general, it's a better idea to use the former; it's not going to be any slower than the latter and it's better on memory usage.
If you have some line-base large file, I strongly suggest using the following lines to achieve your goal:
file = open('file')
for line in f.readlines():
print line
file.close()
There are 2 points,
Read all content to memory is never a good idea, the right way is read them by trunk(line)
Don't call lines=f.readlines(), this will also cause reading all content to memory
PS: The former "with" statement only is short for try:open();execept:pass; readlines is implemented using iterator, so it won't eat all your memory.
I'm new to Python from the R world, and I'm working on big text files, structured in data columns (this is LiDaR data, so generally 60 million + records).
Is it possible to change the field separator (eg from tab-delimited to comma-delimited) of such a big file without having to read the file and do a for loop on the lines?
No.
Read the file in
Change separators for each line
Write each line back
This is easily doable with just a few lines of Python (not tested but the general approach works):
# Python - it's so readable, the code basically just writes itself ;-)
#
with open('infile') as infile:
with open('outfile', 'w') as outfile:
for line in infile:
fields = line.split('\t')
outfile.write(','.join(fields))
I'm not familiar with R, but if it has a library function for this it's probably doing exactly the same thing.
Note that this code only reads one line at a time from the file, so the file can be larger than the physical RAM - it's never wholly loaded in.
You can use the linux tr command to replace any character with any other character.
Actually lets say yes, you can do it without loops eg:
with open('in') as infile:
with open('out', 'w') as outfile:
map(lambda line: outfile.write(','.join(line.split('\n'))), infile)
You cant, but i strongly advise you to check generators.
Point is that you can make faster and well structured program without need to write and store data in memory in order to process it.
For instance
file = open("bigfile","w")
j = (i.split("\t") for i in file)
s = (","join(i) for i in j)
#and now magic happens
for i in s:
some_other_file.write(i)
This code spends memory for holding only single line.
in python , suppose i have file data.txt . which has 6 lines of data . I want to calculate the no of lines which i am planning to do by going through each character and finding out the number of '\n' in the file . How to take one character input from the file ? Readline takes the whole line .
I think the method you're looking for is readlines, as in
lines = open("inputfilex.txt", "r").readlines()
This will give you a list of each of the lines in the file. To find out how many lines, you can just do:
len(lines)
And then access it using indexes, like lines[3] or lines[-1] as you would any normal Python list.
You can use read(1) to read a single byte. help(file) says:
read(size) -> read at most size bytes, returned as a string.
If the size argument is negative or omitted, read until EOF is reached.
Notice that when in non-blocking mode, less data than what was requested
may be returned, even if no size parameter was given.
Note that reading a file a byte at a time is quite un-"Pythonic". This is par for the course in C, but Python can do a lot more work with far less code. For example, you can read the entire file into an array in one line of code:
lines = f.readlines()
You could then access by line number with a simple lines[lineNumber] lookup.
Or if you don't want to store the entire file in memory at once, you can iterate over it line-by-line:
for line in f:
# Do whatever you want.
That is much more readable and idiomatic.
It seems the simplest answer for you would be to do:
for line in file:
lines += 1
# do whatever else you need to do for each line
Or the equivalent construction explicitly using readline(). I'm not sure why you want to look at every character when you said above that readline() is correctly reading each line in its entirety.
To access a file based on its lines, make a list of its lines.
with open('myfile') as f:
lines = list(f)
then simply access lines[3] to get the fourth line and so forth. (Note that this will not strip the newline characters.)
The linecache module can also be useful for this.
Is there a shorter (perhaps more pythonic) way of opening a text file and reading past the lines that start with a comment character?
In other words, a neater way of doing this
fin = open("data.txt")
line = fin.readline()
while line.startswith("#"):
line = fin.readline()
At this stage in my arc of learning Python, I find this most Pythonic:
def iscomment(s):
return s.startswith('#')
from itertools import dropwhile
with open(filename, 'r') as f:
for line in dropwhile(iscomment, f):
# do something with line
to skip all of the lines at the top of the file starting with #. To skip all lines starting with #:
from itertools import ifilterfalse
with open(filename, 'r') as f:
for line in ifilterfalse(iscomment, f):
# do something with line
That's almost all about readability for me; functionally there's almost no difference between:
for line in ifilterfalse(iscomment, f))
and
for line in (x for x in f if not x.startswith('#'))
Breaking out the test into its own function makes the intent of the code a little clearer; it also means that if your definition of a comment changes you have one place to change it.
for line in open('data.txt'):
if line.startswith('#'):
continue
# work with line
of course, if your commented lines are only at the beginning of the file, you might use some optimisations.
from itertools import dropwhile
for line in dropwhile(lambda line: line.startswith('#'), file('data.txt')):
pass
If you want to filter out all comment lines (not just those at the start of the file):
for line in file("data.txt"):
if not line.startswith("#"):
# process line
If you only want to skip those at the start then see ephemient's answer using itertools.dropwhile
You could use a generator function
def readlines(filename):
fin = open(filename)
for line in fin:
if not line.startswith("#"):
yield line
and use it like
for line in readlines("data.txt"):
# do things
pass
Depending on exactly where the files come from, you may also want to strip() the lines before the startswith() check. I once had to debug a script like that months after it was written because someone put in a couple of space characters before the '#'
As a practical matter if I knew I was dealing with reasonable sized text files (anything which will comfortably fit in memory) then I'd problem go with something like:
f = open("data.txt")
lines = [ x for x in f.readlines() if x[0] != "#" ]
... to snarf in the whole file and filter out all lines that begin with the octothorpe.
As others have pointed out one might want ignore leading whitespace occurring before the octothorpe like so:
lines = [ x for x in f.readlines() if not x.lstrip().startswith("#") ]
I like this for its brevity.
This assumes that we want to strip out all of the comment lines.
We can also "chop" the last characters (almost always newlines) off the end of each using:
lines = [ x[:-1] for x in ... ]
... assuming that we're not worried about the infamously obscure issue of a missing final newline on the last line of the file. (The only time a line from the .readlines() or related file-like object methods might NOT end in a newline is at EOF).
In reasonably recent versions of Python one can "chomp" (only newlines) off the ends of the lines using a conditional expression like so:
lines = [ x[:-1] if x[-1]=='\n' else x for x in ... ]
... which is about as complicated as I'll go with a list comprehension for legibility's sake.
If we were worried about the possibility of an overly large file (or low memory constraints) impacting our performance or stability, and we're using a version of Python that's recent enough to support generator expressions (which are more recent additions to the language than the list comprehensions I've been using here), then we could use:
for line in (x[:-1] if x[-1]=='\n' else x for x in
f.readlines() if x.lstrip().startswith('#')):
# do stuff with each line
... is at the limits of what I'd expect anyone else to parse in one line a year after the code's been checked in.
If the intent is only to skip "header" lines then I think the best approach would be:
f = open('data.txt')
for line in f:
if line.lstrip().startswith('#'):
continue
... and be done with it.
You could make a generator that loops over the file that skips those lines:
fin = open("data.txt")
fileiter = (l for l in fin if not l.startswith('#'))
for line in fileiter:
...
You could do something like
def drop(n, seq):
for i, x in enumerate(seq):
if i >= n:
yield x
And then say
for line in drop(1, file(filename)):
# whatever
I like #iWerner's generator function idea. One small change to his code and it does what the question asked for.
def readlines(filename):
f = open(filename)
# discard first lines that start with '#'
for line in f:
if not line.lstrip().startswith("#"):
break
yield line
for line in f:
yield line
and use it like
for line in readlines("data.txt"):
# do things
pass
But here is a different approach. This is almost very simple. The idea is that we open the file, and get a file object, which we can use as an iterator. Then we pull the lines we don't want out of the iterator, and just return the iterator. This would be ideal if we always knew how many lines to skip. The problem here is we don't know how many lines we need to skip; we just need to pull lines and look at them. And there is no way to put a line back into the iterator, once we have pulled it.
So: open the iterator, pull lines and count how many have the leading '#' character; then use the .seek() method to rewind the file, pull the correct number again, and return the iterator.
One thing I like about this: you get the actual file object back, with all its methods; you can just use this instead of open() and it will work in all cases. I renamed the function to open_my_text() to reflect this.
def open_my_text(filename):
f = open(filename, "rt")
# count number of lines that start with '#'
count = 0
for line in f:
if not line.lstrip().startswith("#"):
break
count += 1
# rewind file, and discard lines counted above
f.seek(0)
for _ in range(count):
f.readline()
# return file object with comment lines pre-skipped
return f
Instead of f.readline() I could have used f.next() (for Python 2.x) or next(f) (for Python 3.x) but I wanted to write it so it was portable to any Python.
EDIT: Okay, I know nobody cares and I"m not getting any upvotes for this, but I have re-written my answer one last time to make it more elegant.
You can't put a line back into an iterator. But, you can open a file twice, and get two iterators; given the way file caching works, the second iterator is almost free. If we imagine a file with a megabyte of '#' lines at the top, this version would greatly outperform the previous version that calls f.seek(0).
def open_my_text(filename):
# open the same file twice to get two file objects
# (We are opening the file read-only so this is safe.)
ftemp = open(filename, "rt")
f = open(filename, "rt")
# use ftemp to look at lines, then discard from f
for line in ftemp:
if not line.lstrip().startswith("#"):
break
f.readline()
# return file object with comment lines pre-skipped
return f
This version is much better than the previous version, and it still returns a full file object with all its methods.