Memory issues with splitting lines in huge files in Python - python

I'm trying to read from disk a huge file (~2GB) and split each line into multiple strings:
def get_split_lines(file_path):
with open(file_path, 'r') as f:
split_lines = [line.rstrip().split() for line in f]
return split_lines
Problem is, it tries to allocate tens and tens of GB in memory. I found out that it doesn't happen if I change my code in the following way:
def get_split_lines(file_path):
with open(file_path, 'r') as f:
split_lines = [line.rstrip() for line in f] # no splitting
return split_lines
I.e., if I do not split the lines, memory usage drastically goes down.
Is there any way to handle this problem, maybe some smart way to store split lines without filling up the main memory?
Thank you for your time.

After the split, you have multiple objects: a tuple plus some number of string objects. Each object has its own overhead in addition to the actual set of characters that make up the original string.
Rather than reading the entire file into memory, use a generator.
def get_split_lines(file_path):
with open(file_path, 'r') as f:
for line in f:
yield line.rstrip.split()
for t in get_split_lines(file_path):
# Do something with the tuple t
This does not preclude you from writing something like
lines = list(get_split_lines(file_path))
if you really need to read the entire file into memory.

In the end, I ended up storing a list of stripped lines:
with open(file_path, 'r') as f:
split_lines = [line.rstrip() for line in f]
And, in each iteration of my algorithm, I simply recomputed on-the-fly the split line:
for line in split_lines:
split_line = line.split()
#do something with the split line
If you can afford to keep all the lines in memory like I did, and you have to go through all the file more than once, this approach is faster than the one proposed by #chepner as you read the file lines just once.

Related

Is there a way to have a sentences variable pull sentences from a .txt file? [duplicate]

This question's answers are a community effort. Edit existing answers to improve this post. It is not currently accepting new answers or interactions.
How do I read every line of a file in Python and store each line as an element in a list?
I want to read the file line by line and append each line to the end of the list.
This code will read the entire file into memory and remove all whitespace characters (newlines and spaces) from the end of each line:
with open(filename) as file:
lines = [line.rstrip() for line in file]
If you're working with a large file, then you should instead read and process it line-by-line:
with open(filename) as file:
for line in file:
print(line.rstrip())
In Python 3.8 and up you can use a while loop with the walrus operator like so:
with open(filename) as file:
while (line := file.readline().rstrip()):
print(line)
Depending on what you plan to do with your file and how it was encoded, you may also want to manually set the access mode and character encoding:
with open(filename, 'r', encoding='UTF-8') as file:
while (line := file.readline().rstrip()):
print(line)
See Input and Ouput:
with open('filename') as f:
lines = f.readlines()
or with stripping the newline character:
with open('filename') as f:
lines = [line.rstrip('\n') for line in f]
This is more explicit than necessary, but does what you want.
with open("file.txt") as file_in:
lines = []
for line in file_in:
lines.append(line)
This will yield an "array" of lines from the file.
lines = tuple(open(filename, 'r'))
open returns a file which can be iterated over. When you iterate over a file, you get the lines from that file. tuple can take an iterator and instantiate a tuple instance for you from the iterator that you give it. lines is a tuple created from the lines of the file.
According to Python's Methods of File Objects, the simplest way to convert a text file into a list is:
with open('file.txt') as f:
my_list = list(f)
# my_list = [x.rstrip() for x in f] # remove line breaks
Demo
If you just need to iterate over the text file lines, you can use:
with open('file.txt') as f:
for line in f:
...
Old answer:
Using with and readlines() :
with open('file.txt') as f:
lines = f.readlines()
If you don't care about closing the file, this one-liner will work:
lines = open('file.txt').readlines()
The traditional way:
f = open('file.txt') # Open file on read mode
lines = f.read().splitlines() # List with stripped line-breaks
f.close() # Close file
If you want the \n included:
with open(fname) as f:
content = f.readlines()
If you do not want \n included:
with open(fname) as f:
content = f.read().splitlines()
You could simply do the following, as has been suggested:
with open('/your/path/file') as f:
my_lines = f.readlines()
Note that this approach has 2 downsides:
1) You store all the lines in memory. In the general case, this is a very bad idea. The file could be very large, and you could run out of memory. Even if it's not large, it is simply a waste of memory.
2) This does not allow processing of each line as you read them. So if you process your lines after this, it is not efficient (requires two passes rather than one).
A better approach for the general case would be the following:
with open('/your/path/file') as f:
for line in f:
process(line)
Where you define your process function any way you want. For example:
def process(line):
if 'save the world' in line.lower():
superman.save_the_world()
(The implementation of the Superman class is left as an exercise for you).
This will work nicely for any file size and you go through your file in just 1 pass. This is typically how generic parsers will work.
Having a Text file content:
line 1
line 2
line 3
We can use this Python script in the same directory of the txt above
>>> with open("myfile.txt", encoding="utf-8") as file:
... x = [l.rstrip("\n") for l in file]
>>> x
['line 1','line 2','line 3']
Using append:
x = []
with open("myfile.txt") as file:
for l in file:
x.append(l.strip())
Or:
>>> x = open("myfile.txt").read().splitlines()
>>> x
['line 1', 'line 2', 'line 3']
Or:
>>> x = open("myfile.txt").readlines()
>>> x
['linea 1\n', 'line 2\n', 'line 3\n']
Or:
def print_output(lines_in_textfile):
print("lines_in_textfile =", lines_in_textfile)
y = [x.rstrip() for x in open("001.txt")]
print_output(y)
with open('001.txt', 'r', encoding='utf-8') as file:
file = file.read().splitlines()
print_output(file)
with open('001.txt', 'r', encoding='utf-8') as file:
file = [x.rstrip("\n") for x in file]
print_output(file)
output:
lines_in_textfile = ['line 1', 'line 2', 'line 3']
lines_in_textfile = ['line 1', 'line 2', 'line 3']
lines_in_textfile = ['line 1', 'line 2', 'line 3']
Introduced in Python 3.4, pathlib has a really convenient method for reading in text from files, as follows:
from pathlib import Path
p = Path('my_text_file')
lines = p.read_text().splitlines()
(The splitlines call is what turns it from a string containing the whole contents of the file to a list of lines in the file.)
pathlib has a lot of handy conveniences in it. read_text is nice and concise, and you don't have to worry about opening and closing the file. If all you need to do with the file is read it all in in one go, it's a good choice.
To read a file into a list you need to do three things:
Open the file
Read the file
Store the contents as list
Fortunately Python makes it very easy to do these things so the shortest way to read a file into a list is:
lst = list(open(filename))
However I'll add some more explanation.
Opening the file
I assume that you want to open a specific file and you don't deal directly with a file-handle (or a file-like-handle). The most commonly used function to open a file in Python is open, it takes one mandatory argument and two optional ones in Python 2.7:
Filename
Mode
Buffering (I'll ignore this argument in this answer)
The filename should be a string that represents the path to the file. For example:
open('afile') # opens the file named afile in the current working directory
open('adir/afile') # relative path (relative to the current working directory)
open('C:/users/aname/afile') # absolute path (windows)
open('/usr/local/afile') # absolute path (linux)
Note that the file extension needs to be specified. This is especially important for Windows users because file extensions like .txt or .doc, etc. are hidden by default when viewed in the explorer.
The second argument is the mode, it's r by default which means "read-only". That's exactly what you need in your case.
But in case you actually want to create a file and/or write to a file you'll need a different argument here. There is an excellent answer if you want an overview.
For reading a file you can omit the mode or pass it in explicitly:
open(filename)
open(filename, 'r')
Both will open the file in read-only mode. In case you want to read in a binary file on Windows you need to use the mode rb:
open(filename, 'rb')
On other platforms the 'b' (binary mode) is simply ignored.
Now that I've shown how to open the file, let's talk about the fact that you always need to close it again. Otherwise it will keep an open file-handle to the file until the process exits (or Python garbages the file-handle).
While you could use:
f = open(filename)
# ... do stuff with f
f.close()
That will fail to close the file when something between open and close throws an exception. You could avoid that by using a try and finally:
f = open(filename)
# nothing in between!
try:
# do stuff with f
finally:
f.close()
However Python provides context managers that have a prettier syntax (but for open it's almost identical to the try and finally above):
with open(filename) as f:
# do stuff with f
# The file is always closed after the with-scope ends.
The last approach is the recommended approach to open a file in Python!
Reading the file
Okay, you've opened the file, now how to read it?
The open function returns a file object and it supports Pythons iteration protocol. Each iteration will give you a line:
with open(filename) as f:
for line in f:
print(line)
This will print each line of the file. Note however that each line will contain a newline character \n at the end (you might want to check if your Python is built with universal newlines support - otherwise you could also have \r\n on Windows or \r on Mac as newlines). If you don't want that you can could simply remove the last character (or the last two characters on Windows):
with open(filename) as f:
for line in f:
print(line[:-1])
But the last line doesn't necessarily has a trailing newline, so one shouldn't use that. One could check if it ends with a trailing newline and if so remove it:
with open(filename) as f:
for line in f:
if line.endswith('\n'):
line = line[:-1]
print(line)
But you could simply remove all whitespaces (including the \n character) from the end of the string, this will also remove all other trailing whitespaces so you have to be careful if these are important:
with open(filename) as f:
for line in f:
print(f.rstrip())
However if the lines end with \r\n (Windows "newlines") that .rstrip() will also take care of the \r!
Store the contents as list
Now that you know how to open the file and read it, it's time to store the contents in a list. The simplest option would be to use the list function:
with open(filename) as f:
lst = list(f)
In case you want to strip the trailing newlines you could use a list comprehension instead:
with open(filename) as f:
lst = [line.rstrip() for line in f]
Or even simpler: The .readlines() method of the file object by default returns a list of the lines:
with open(filename) as f:
lst = f.readlines()
This will also include the trailing newline characters, if you don't want them I would recommend the [line.rstrip() for line in f] approach because it avoids keeping two lists containing all the lines in memory.
There's an additional option to get the desired output, however it's rather "suboptimal": read the complete file in a string and then split on newlines:
with open(filename) as f:
lst = f.read().split('\n')
or:
with open(filename) as f:
lst = f.read().splitlines()
These take care of the trailing newlines automatically because the split character isn't included. However they are not ideal because you keep the file as string and as a list of lines in memory!
Summary
Use with open(...) as f when opening files because you don't need to take care of closing the file yourself and it closes the file even if some exception happens.
file objects support the iteration protocol so reading a file line-by-line is as simple as for line in the_file_object:.
Always browse the documentation for the available functions/classes. Most of the time there's a perfect match for the task or at least one or two good ones. The obvious choice in this case would be readlines() but if you want to process the lines before storing them in the list I would recommend a simple list-comprehension.
Clean and Pythonic Way of Reading the Lines of a File Into a List
First and foremost, you should focus on opening your file and reading its contents in an efficient and pythonic way. Here is an example of the way I personally DO NOT prefer:
infile = open('my_file.txt', 'r') # Open the file for reading.
data = infile.read() # Read the contents of the file.
infile.close() # Close the file since we're done using it.
Instead, I prefer the below method of opening files for both reading and writing as it
is very clean, and does not require an extra step of closing the file
once you are done using it. In the statement below, we're opening the file
for reading, and assigning it to the variable 'infile.' Once the code within
this statement has finished running, the file will be automatically closed.
# Open the file for reading.
with open('my_file.txt', 'r') as infile:
data = infile.read() # Read the contents of the file into memory.
Now we need to focus on bringing this data into a Python List because they are iterable, efficient, and flexible. In your case, the desired goal is to bring each line of the text file into a separate element. To accomplish this, we will use the splitlines() method as follows:
# Return a list of the lines, breaking at line boundaries.
my_list = data.splitlines()
The Final Product:
# Open the file for reading.
with open('my_file.txt', 'r') as infile:
data = infile.read() # Read the contents of the file into memory.
# Return a list of the lines, breaking at line boundaries.
my_list = data.splitlines()
Testing Our Code:
Contents of the text file:
A fost odatã ca-n povesti,
A fost ca niciodatã,
Din rude mãri împãrãtesti,
O prea frumoasã fatã.
Print statements for testing purposes:
print my_list # Print the list.
# Print each line in the list.
for line in my_list:
print line
# Print the fourth element in this list.
print my_list[3]
Output (different-looking because of unicode characters):
['A fost odat\xc3\xa3 ca-n povesti,', 'A fost ca niciodat\xc3\xa3,',
'Din rude m\xc3\xa3ri \xc3\xaemp\xc3\xa3r\xc3\xa3testi,', 'O prea
frumoas\xc3\xa3 fat\xc3\xa3.']
A fost odatã ca-n povesti, A fost ca niciodatã, Din rude mãri
împãrãtesti, O prea frumoasã fatã.
O prea frumoasã fatã.
Here's one more option by using list comprehensions on files;
lines = [line.rstrip() for line in open('file.txt')]
This should be more efficient way as the most of the work is done inside the Python interpreter.
f = open("your_file.txt",'r')
out = f.readlines() # will append in the list out
Now variable out is a list (array) of what you want. You could either do:
for line in out:
print (line)
Or:
for line in f:
print (line)
You'll get the same results.
Another option is numpy.genfromtxt, for example:
import numpy as np
data = np.genfromtxt("yourfile.dat",delimiter="\n")
This will make data a NumPy array with as many rows as are in your file.
Read and write text files with Python 2 and Python 3; it works with Unicode
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# Define data
lines = [' A first string ',
'A Unicode sample: €',
'German: äöüß']
# Write text file
with open('file.txt', 'w') as fp:
fp.write('\n'.join(lines))
# Read text file
with open('file.txt', 'r') as fp:
read_lines = fp.readlines()
read_lines = [line.rstrip('\n') for line in read_lines]
print(lines == read_lines)
Things to notice:
with is a so-called context manager. It makes sure that the opened file is closed again.
All solutions here which simply make .strip() or .rstrip() will fail to reproduce the lines as they also strip the white space.
Common file endings
.txt
More advanced file writing/reading
CSV: Super simple format (read & write)
JSON: Nice for writing human-readable data; VERY commonly used (read & write)
YAML: YAML is a superset of JSON, but easier to read (read & write, comparison of JSON and YAML)
pickle: A Python serialization format (read & write)
MessagePack (Python package): More compact representation (read & write)
HDF5 (Python package): Nice for matrices (read & write)
XML: exists too *sigh* (read & write)
For your application, the following might be important:
Support by other programming languages
Reading/writing performance
Compactness (file size)
See also: Comparison of data serialization formats
In case you are rather looking for a way to make configuration files, you might want to read my short article Configuration files in Python.
If you'd like to read a file from the command line or from stdin, you can also use the fileinput module:
# reader.py
import fileinput
content = []
for line in fileinput.input():
content.append(line.strip())
fileinput.close()
Pass files to it like so:
$ python reader.py textfile.txt
Read more here: http://docs.python.org/2/library/fileinput.html
The simplest way to do it
A simple way is to:
Read the whole file as a string
Split the string line by line
In one line, that would give:
lines = open('C:/path/file.txt').read().splitlines()
However, this is quite inefficient way as this will store 2 versions of the content in memory (probably not a big issue for small files, but still). [Thanks Mark Amery].
There are 2 easier ways:
Using the file as an iterator
lines = list(open('C:/path/file.txt'))
# ... or if you want to have a list without EOL characters
lines = [l.rstrip() for l in open('C:/path/file.txt')]
If you are using Python 3.4 or above, better use pathlib to create a path for your file that you could use for other operations in your program:
from pathlib import Path
file_path = Path("C:/path/file.txt")
lines = file_path.read_text().split_lines()
# ... or ...
lines = [l.rstrip() for l in file_path.open()]
Just use the splitlines() functions. Here is an example.
inp = "file.txt"
data = open(inp)
dat = data.read()
lst = dat.splitlines()
print lst
# print(lst) # for python 3
In the output you will have the list of lines.
If you are faced with a very large / huge file and want to read faster (imagine you are in a TopCoder or HackerRank coding competition), you might read a considerably bigger chunk of lines into a memory buffer at one time, rather than just iterate line by line at file level.
buffersize = 2**16
with open(path) as f:
while True:
lines_buffer = f.readlines(buffersize)
if not lines_buffer:
break
for line in lines_buffer:
process(line)
The easiest ways to do that with some additional benefits are:
lines = list(open('filename'))
or
lines = tuple(open('filename'))
or
lines = set(open('filename'))
In the case with set, we must be remembered that we don't have the line order preserved and get rid of the duplicated lines.
Below I added an important supplement from #MarkAmery:
Since you're not calling .close on the file object nor using a with statement, in some Python implementations the file may not get closed after reading and your process will leak an open file handle.
In CPython (the normal Python implementation that most people use), this isn't a problem since the file object will get immediately garbage-collected and this will close the file, but it's nonetheless generally considered best practice to do something like:
with open('filename') as f: lines = list(f)
to ensure that the file gets closed regardless of what Python implementation you're using.
Use this:
import pandas as pd
data = pd.read_csv(filename) # You can also add parameters such as header, sep, etc.
array = data.values
data is a dataframe type, and uses values to get ndarray. You can also get a list by using array.tolist().
In case that there are also empty lines in the document I like to read in the content and pass it through filter to prevent empty string elements
with open(myFile, "r") as f:
excludeFileContent = list(filter(None, f.read().splitlines()))
Outline and Summary
With a filename, handling the file from a Path(filename) object, or directly with open(filename) as f, do one of the following:
list(fileinput.input(filename))
using with path.open() as f, call f.readlines()
list(f)
path.read_text().splitlines()
path.read_text().splitlines(keepends=True)
iterate over fileinput.input or f and list.append each line one at a time
pass f to a bound list.extend method
use f in a list comprehension
I explain the use-case for each below.
In Python, how do I read a file line-by-line?
This is an excellent question. First, let's create some sample data:
from pathlib import Path
Path('filename').write_text('foo\nbar\nbaz')
File objects are lazy iterators, so just iterate over it.
filename = 'filename'
with open(filename) as f:
for line in f:
line # do something with the line
Alternatively, if you have multiple files, use fileinput.input, another lazy iterator. With just one file:
import fileinput
for line in fileinput.input(filename):
line # process the line
or for multiple files, pass it a list of filenames:
for line in fileinput.input([filename]*2):
line # process the line
Again, f and fileinput.input above both are/return lazy iterators.
You can only use an iterator one time, so to provide functional code while avoiding verbosity I'll use the slightly more terse fileinput.input(filename) where apropos from here.
In Python, how do I read a file line-by-line into a list?
Ah but you want it in a list for some reason? I'd avoid that if possible. But if you insist... just pass the result of fileinput.input(filename) to list:
list(fileinput.input(filename))
Another direct answer is to call f.readlines, which returns the contents of the file (up to an optional hint number of characters, so you could break this up into multiple lists that way).
You can get to this file object two ways. One way is to pass the filename to the open builtin:
filename = 'filename'
with open(filename) as f:
f.readlines()
or using the new Path object from the pathlib module (which I have become quite fond of, and will use from here on):
from pathlib import Path
path = Path(filename)
with path.open() as f:
f.readlines()
list will also consume the file iterator and return a list - a quite direct method as well:
with path.open() as f:
list(f)
If you don't mind reading the entire text into memory as a single string before splitting it, you can do this as a one-liner with the Path object and the splitlines() string method. By default, splitlines removes the newlines:
path.read_text().splitlines()
If you want to keep the newlines, pass keepends=True:
path.read_text().splitlines(keepends=True)
I want to read the file line by line and append each line to the end of the list.
Now this is a bit silly to ask for, given that we've demonstrated the end result easily with several methods. But you might need to filter or operate on the lines as you make your list, so let's humor this request.
Using list.append would allow you to filter or operate on each line before you append it:
line_list = []
for line in fileinput.input(filename):
line_list.append(line)
line_list
Using list.extend would be a bit more direct, and perhaps useful if you have a preexisting list:
line_list = []
line_list.extend(fileinput.input(filename))
line_list
Or more idiomatically, we could instead use a list comprehension, and map and filter inside it if desirable:
[line for line in fileinput.input(filename)]
Or even more directly, to close the circle, just pass it to list to create a new list directly without operating on the lines:
list(fileinput.input(filename))
Conclusion
You've seen many ways to get lines from a file into a list, but I'd recommend you avoid materializing large quantities of data into a list and instead use Python's lazy iteration to process the data if possible.
That is, prefer fileinput.input or with path.open() as f.
I would try one of the below mentioned methods. The example file that I use has the name dummy.txt. You can find the file here. I presume that the file is in the same directory as the code (you can change fpath to include the proper file name and folder path).
In both the below mentioned examples, the list that you want is given by lst.
1. First method
fpath = 'dummy.txt'
with open(fpath, "r") as f: lst = [line.rstrip('\n \t') for line in f]
print lst
>>>['THIS IS LINE1.', 'THIS IS LINE2.', 'THIS IS LINE3.', 'THIS IS LINE4.']
2. In the second method, one can use csv.reader module from Python Standard Library:
import csv
fpath = 'dummy.txt'
with open(fpath) as csv_file:
csv_reader = csv.reader(csv_file, delimiter=' ')
lst = [row[0] for row in csv_reader]
print lst
>>>['THIS IS LINE1.', 'THIS IS LINE2.', 'THIS IS LINE3.', 'THIS IS LINE4.']
You can use either of the two methods. The time taken for the creation of lst is almost equal for the two methods.
I like to use the following. Reading the lines immediately.
contents = []
for line in open(filepath, 'r').readlines():
contents.append(line.strip())
Or using list comprehension:
contents = [line.strip() for line in open(filepath, 'r').readlines()]
You could also use the loadtxt command in NumPy. This checks for fewer conditions than genfromtxt, so it may be faster.
import numpy
data = numpy.loadtxt(filename, delimiter="\n")
Here is a Python(3) helper library class that I use to simplify file I/O:
import os
# handle files using a callback method, prevents repetition
def _FileIO__file_handler(file_path, mode, callback = lambda f: None):
f = open(file_path, mode)
try:
return callback(f)
except Exception as e:
raise IOError("Failed to %s file" % ["write to", "read from"][mode.lower() in "r rb r+".split(" ")])
finally:
f.close()
class FileIO:
# return the contents of a file
def read(file_path, mode = "r"):
return __file_handler(file_path, mode, lambda rf: rf.read())
# get the lines of a file
def lines(file_path, mode = "r", filter_fn = lambda line: len(line) > 0):
return [line for line in FileIO.read(file_path, mode).strip().split("\n") if filter_fn(line)]
# create or update a file (NOTE: can also be used to replace a file's original content)
def write(file_path, new_content, mode = "w"):
return __file_handler(file_path, mode, lambda wf: wf.write(new_content))
# delete a file (if it exists)
def delete(file_path):
return os.remove() if os.path.isfile(file_path) else None
You would then use the FileIO.lines function, like this:
file_ext_lines = FileIO.lines("./path/to/file.ext"):
for i, line in enumerate(file_ext_lines):
print("Line {}: {}".format(i + 1, line))
Remember that the mode ("r" by default) and filter_fn (checks for empty lines by default) parameters are optional.
You could even remove the read, write and delete methods and just leave the FileIO.lines, or even turn it into a separate method called read_lines.
Command line version
#!/bin/python3
import os
import sys
abspath = os.path.abspath(__file__)
dname = os.path.dirname(abspath)
filename = dname + sys.argv[1]
arr = open(filename).read().split("\n")
print(arr)
Run with:
python3 somefile.py input_file_name.txt

how to read a large file faster? [duplicate]

This question's answers are a community effort. Edit existing answers to improve this post. It is not currently accepting new answers or interactions.
How do I read every line of a file in Python and store each line as an element in a list?
I want to read the file line by line and append each line to the end of the list.
This code will read the entire file into memory and remove all whitespace characters (newlines and spaces) from the end of each line:
with open(filename) as file:
lines = [line.rstrip() for line in file]
If you're working with a large file, then you should instead read and process it line-by-line:
with open(filename) as file:
for line in file:
print(line.rstrip())
In Python 3.8 and up you can use a while loop with the walrus operator like so:
with open(filename) as file:
while (line := file.readline().rstrip()):
print(line)
Depending on what you plan to do with your file and how it was encoded, you may also want to manually set the access mode and character encoding:
with open(filename, 'r', encoding='UTF-8') as file:
while (line := file.readline().rstrip()):
print(line)
See Input and Ouput:
with open('filename') as f:
lines = f.readlines()
or with stripping the newline character:
with open('filename') as f:
lines = [line.rstrip('\n') for line in f]
This is more explicit than necessary, but does what you want.
with open("file.txt") as file_in:
lines = []
for line in file_in:
lines.append(line)
This will yield an "array" of lines from the file.
lines = tuple(open(filename, 'r'))
open returns a file which can be iterated over. When you iterate over a file, you get the lines from that file. tuple can take an iterator and instantiate a tuple instance for you from the iterator that you give it. lines is a tuple created from the lines of the file.
According to Python's Methods of File Objects, the simplest way to convert a text file into a list is:
with open('file.txt') as f:
my_list = list(f)
# my_list = [x.rstrip() for x in f] # remove line breaks
Demo
If you just need to iterate over the text file lines, you can use:
with open('file.txt') as f:
for line in f:
...
Old answer:
Using with and readlines() :
with open('file.txt') as f:
lines = f.readlines()
If you don't care about closing the file, this one-liner will work:
lines = open('file.txt').readlines()
The traditional way:
f = open('file.txt') # Open file on read mode
lines = f.read().splitlines() # List with stripped line-breaks
f.close() # Close file
If you want the \n included:
with open(fname) as f:
content = f.readlines()
If you do not want \n included:
with open(fname) as f:
content = f.read().splitlines()
You could simply do the following, as has been suggested:
with open('/your/path/file') as f:
my_lines = f.readlines()
Note that this approach has 2 downsides:
1) You store all the lines in memory. In the general case, this is a very bad idea. The file could be very large, and you could run out of memory. Even if it's not large, it is simply a waste of memory.
2) This does not allow processing of each line as you read them. So if you process your lines after this, it is not efficient (requires two passes rather than one).
A better approach for the general case would be the following:
with open('/your/path/file') as f:
for line in f:
process(line)
Where you define your process function any way you want. For example:
def process(line):
if 'save the world' in line.lower():
superman.save_the_world()
(The implementation of the Superman class is left as an exercise for you).
This will work nicely for any file size and you go through your file in just 1 pass. This is typically how generic parsers will work.
Having a Text file content:
line 1
line 2
line 3
We can use this Python script in the same directory of the txt above
>>> with open("myfile.txt", encoding="utf-8") as file:
... x = [l.rstrip("\n") for l in file]
>>> x
['line 1','line 2','line 3']
Using append:
x = []
with open("myfile.txt") as file:
for l in file:
x.append(l.strip())
Or:
>>> x = open("myfile.txt").read().splitlines()
>>> x
['line 1', 'line 2', 'line 3']
Or:
>>> x = open("myfile.txt").readlines()
>>> x
['linea 1\n', 'line 2\n', 'line 3\n']
Or:
def print_output(lines_in_textfile):
print("lines_in_textfile =", lines_in_textfile)
y = [x.rstrip() for x in open("001.txt")]
print_output(y)
with open('001.txt', 'r', encoding='utf-8') as file:
file = file.read().splitlines()
print_output(file)
with open('001.txt', 'r', encoding='utf-8') as file:
file = [x.rstrip("\n") for x in file]
print_output(file)
output:
lines_in_textfile = ['line 1', 'line 2', 'line 3']
lines_in_textfile = ['line 1', 'line 2', 'line 3']
lines_in_textfile = ['line 1', 'line 2', 'line 3']
Introduced in Python 3.4, pathlib has a really convenient method for reading in text from files, as follows:
from pathlib import Path
p = Path('my_text_file')
lines = p.read_text().splitlines()
(The splitlines call is what turns it from a string containing the whole contents of the file to a list of lines in the file.)
pathlib has a lot of handy conveniences in it. read_text is nice and concise, and you don't have to worry about opening and closing the file. If all you need to do with the file is read it all in in one go, it's a good choice.
To read a file into a list you need to do three things:
Open the file
Read the file
Store the contents as list
Fortunately Python makes it very easy to do these things so the shortest way to read a file into a list is:
lst = list(open(filename))
However I'll add some more explanation.
Opening the file
I assume that you want to open a specific file and you don't deal directly with a file-handle (or a file-like-handle). The most commonly used function to open a file in Python is open, it takes one mandatory argument and two optional ones in Python 2.7:
Filename
Mode
Buffering (I'll ignore this argument in this answer)
The filename should be a string that represents the path to the file. For example:
open('afile') # opens the file named afile in the current working directory
open('adir/afile') # relative path (relative to the current working directory)
open('C:/users/aname/afile') # absolute path (windows)
open('/usr/local/afile') # absolute path (linux)
Note that the file extension needs to be specified. This is especially important for Windows users because file extensions like .txt or .doc, etc. are hidden by default when viewed in the explorer.
The second argument is the mode, it's r by default which means "read-only". That's exactly what you need in your case.
But in case you actually want to create a file and/or write to a file you'll need a different argument here. There is an excellent answer if you want an overview.
For reading a file you can omit the mode or pass it in explicitly:
open(filename)
open(filename, 'r')
Both will open the file in read-only mode. In case you want to read in a binary file on Windows you need to use the mode rb:
open(filename, 'rb')
On other platforms the 'b' (binary mode) is simply ignored.
Now that I've shown how to open the file, let's talk about the fact that you always need to close it again. Otherwise it will keep an open file-handle to the file until the process exits (or Python garbages the file-handle).
While you could use:
f = open(filename)
# ... do stuff with f
f.close()
That will fail to close the file when something between open and close throws an exception. You could avoid that by using a try and finally:
f = open(filename)
# nothing in between!
try:
# do stuff with f
finally:
f.close()
However Python provides context managers that have a prettier syntax (but for open it's almost identical to the try and finally above):
with open(filename) as f:
# do stuff with f
# The file is always closed after the with-scope ends.
The last approach is the recommended approach to open a file in Python!
Reading the file
Okay, you've opened the file, now how to read it?
The open function returns a file object and it supports Pythons iteration protocol. Each iteration will give you a line:
with open(filename) as f:
for line in f:
print(line)
This will print each line of the file. Note however that each line will contain a newline character \n at the end (you might want to check if your Python is built with universal newlines support - otherwise you could also have \r\n on Windows or \r on Mac as newlines). If you don't want that you can could simply remove the last character (or the last two characters on Windows):
with open(filename) as f:
for line in f:
print(line[:-1])
But the last line doesn't necessarily has a trailing newline, so one shouldn't use that. One could check if it ends with a trailing newline and if so remove it:
with open(filename) as f:
for line in f:
if line.endswith('\n'):
line = line[:-1]
print(line)
But you could simply remove all whitespaces (including the \n character) from the end of the string, this will also remove all other trailing whitespaces so you have to be careful if these are important:
with open(filename) as f:
for line in f:
print(f.rstrip())
However if the lines end with \r\n (Windows "newlines") that .rstrip() will also take care of the \r!
Store the contents as list
Now that you know how to open the file and read it, it's time to store the contents in a list. The simplest option would be to use the list function:
with open(filename) as f:
lst = list(f)
In case you want to strip the trailing newlines you could use a list comprehension instead:
with open(filename) as f:
lst = [line.rstrip() for line in f]
Or even simpler: The .readlines() method of the file object by default returns a list of the lines:
with open(filename) as f:
lst = f.readlines()
This will also include the trailing newline characters, if you don't want them I would recommend the [line.rstrip() for line in f] approach because it avoids keeping two lists containing all the lines in memory.
There's an additional option to get the desired output, however it's rather "suboptimal": read the complete file in a string and then split on newlines:
with open(filename) as f:
lst = f.read().split('\n')
or:
with open(filename) as f:
lst = f.read().splitlines()
These take care of the trailing newlines automatically because the split character isn't included. However they are not ideal because you keep the file as string and as a list of lines in memory!
Summary
Use with open(...) as f when opening files because you don't need to take care of closing the file yourself and it closes the file even if some exception happens.
file objects support the iteration protocol so reading a file line-by-line is as simple as for line in the_file_object:.
Always browse the documentation for the available functions/classes. Most of the time there's a perfect match for the task or at least one or two good ones. The obvious choice in this case would be readlines() but if you want to process the lines before storing them in the list I would recommend a simple list-comprehension.
Clean and Pythonic Way of Reading the Lines of a File Into a List
First and foremost, you should focus on opening your file and reading its contents in an efficient and pythonic way. Here is an example of the way I personally DO NOT prefer:
infile = open('my_file.txt', 'r') # Open the file for reading.
data = infile.read() # Read the contents of the file.
infile.close() # Close the file since we're done using it.
Instead, I prefer the below method of opening files for both reading and writing as it
is very clean, and does not require an extra step of closing the file
once you are done using it. In the statement below, we're opening the file
for reading, and assigning it to the variable 'infile.' Once the code within
this statement has finished running, the file will be automatically closed.
# Open the file for reading.
with open('my_file.txt', 'r') as infile:
data = infile.read() # Read the contents of the file into memory.
Now we need to focus on bringing this data into a Python List because they are iterable, efficient, and flexible. In your case, the desired goal is to bring each line of the text file into a separate element. To accomplish this, we will use the splitlines() method as follows:
# Return a list of the lines, breaking at line boundaries.
my_list = data.splitlines()
The Final Product:
# Open the file for reading.
with open('my_file.txt', 'r') as infile:
data = infile.read() # Read the contents of the file into memory.
# Return a list of the lines, breaking at line boundaries.
my_list = data.splitlines()
Testing Our Code:
Contents of the text file:
A fost odatã ca-n povesti,
A fost ca niciodatã,
Din rude mãri împãrãtesti,
O prea frumoasã fatã.
Print statements for testing purposes:
print my_list # Print the list.
# Print each line in the list.
for line in my_list:
print line
# Print the fourth element in this list.
print my_list[3]
Output (different-looking because of unicode characters):
['A fost odat\xc3\xa3 ca-n povesti,', 'A fost ca niciodat\xc3\xa3,',
'Din rude m\xc3\xa3ri \xc3\xaemp\xc3\xa3r\xc3\xa3testi,', 'O prea
frumoas\xc3\xa3 fat\xc3\xa3.']
A fost odatã ca-n povesti, A fost ca niciodatã, Din rude mãri
împãrãtesti, O prea frumoasã fatã.
O prea frumoasã fatã.
Here's one more option by using list comprehensions on files;
lines = [line.rstrip() for line in open('file.txt')]
This should be more efficient way as the most of the work is done inside the Python interpreter.
f = open("your_file.txt",'r')
out = f.readlines() # will append in the list out
Now variable out is a list (array) of what you want. You could either do:
for line in out:
print (line)
Or:
for line in f:
print (line)
You'll get the same results.
Another option is numpy.genfromtxt, for example:
import numpy as np
data = np.genfromtxt("yourfile.dat",delimiter="\n")
This will make data a NumPy array with as many rows as are in your file.
Read and write text files with Python 2 and Python 3; it works with Unicode
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# Define data
lines = [' A first string ',
'A Unicode sample: €',
'German: äöüß']
# Write text file
with open('file.txt', 'w') as fp:
fp.write('\n'.join(lines))
# Read text file
with open('file.txt', 'r') as fp:
read_lines = fp.readlines()
read_lines = [line.rstrip('\n') for line in read_lines]
print(lines == read_lines)
Things to notice:
with is a so-called context manager. It makes sure that the opened file is closed again.
All solutions here which simply make .strip() or .rstrip() will fail to reproduce the lines as they also strip the white space.
Common file endings
.txt
More advanced file writing/reading
CSV: Super simple format (read & write)
JSON: Nice for writing human-readable data; VERY commonly used (read & write)
YAML: YAML is a superset of JSON, but easier to read (read & write, comparison of JSON and YAML)
pickle: A Python serialization format (read & write)
MessagePack (Python package): More compact representation (read & write)
HDF5 (Python package): Nice for matrices (read & write)
XML: exists too *sigh* (read & write)
For your application, the following might be important:
Support by other programming languages
Reading/writing performance
Compactness (file size)
See also: Comparison of data serialization formats
In case you are rather looking for a way to make configuration files, you might want to read my short article Configuration files in Python.
If you'd like to read a file from the command line or from stdin, you can also use the fileinput module:
# reader.py
import fileinput
content = []
for line in fileinput.input():
content.append(line.strip())
fileinput.close()
Pass files to it like so:
$ python reader.py textfile.txt
Read more here: http://docs.python.org/2/library/fileinput.html
The simplest way to do it
A simple way is to:
Read the whole file as a string
Split the string line by line
In one line, that would give:
lines = open('C:/path/file.txt').read().splitlines()
However, this is quite inefficient way as this will store 2 versions of the content in memory (probably not a big issue for small files, but still). [Thanks Mark Amery].
There are 2 easier ways:
Using the file as an iterator
lines = list(open('C:/path/file.txt'))
# ... or if you want to have a list without EOL characters
lines = [l.rstrip() for l in open('C:/path/file.txt')]
If you are using Python 3.4 or above, better use pathlib to create a path for your file that you could use for other operations in your program:
from pathlib import Path
file_path = Path("C:/path/file.txt")
lines = file_path.read_text().split_lines()
# ... or ...
lines = [l.rstrip() for l in file_path.open()]
Just use the splitlines() functions. Here is an example.
inp = "file.txt"
data = open(inp)
dat = data.read()
lst = dat.splitlines()
print lst
# print(lst) # for python 3
In the output you will have the list of lines.
If you are faced with a very large / huge file and want to read faster (imagine you are in a TopCoder or HackerRank coding competition), you might read a considerably bigger chunk of lines into a memory buffer at one time, rather than just iterate line by line at file level.
buffersize = 2**16
with open(path) as f:
while True:
lines_buffer = f.readlines(buffersize)
if not lines_buffer:
break
for line in lines_buffer:
process(line)
The easiest ways to do that with some additional benefits are:
lines = list(open('filename'))
or
lines = tuple(open('filename'))
or
lines = set(open('filename'))
In the case with set, we must be remembered that we don't have the line order preserved and get rid of the duplicated lines.
Below I added an important supplement from #MarkAmery:
Since you're not calling .close on the file object nor using a with statement, in some Python implementations the file may not get closed after reading and your process will leak an open file handle.
In CPython (the normal Python implementation that most people use), this isn't a problem since the file object will get immediately garbage-collected and this will close the file, but it's nonetheless generally considered best practice to do something like:
with open('filename') as f: lines = list(f)
to ensure that the file gets closed regardless of what Python implementation you're using.
Use this:
import pandas as pd
data = pd.read_csv(filename) # You can also add parameters such as header, sep, etc.
array = data.values
data is a dataframe type, and uses values to get ndarray. You can also get a list by using array.tolist().
In case that there are also empty lines in the document I like to read in the content and pass it through filter to prevent empty string elements
with open(myFile, "r") as f:
excludeFileContent = list(filter(None, f.read().splitlines()))
Outline and Summary
With a filename, handling the file from a Path(filename) object, or directly with open(filename) as f, do one of the following:
list(fileinput.input(filename))
using with path.open() as f, call f.readlines()
list(f)
path.read_text().splitlines()
path.read_text().splitlines(keepends=True)
iterate over fileinput.input or f and list.append each line one at a time
pass f to a bound list.extend method
use f in a list comprehension
I explain the use-case for each below.
In Python, how do I read a file line-by-line?
This is an excellent question. First, let's create some sample data:
from pathlib import Path
Path('filename').write_text('foo\nbar\nbaz')
File objects are lazy iterators, so just iterate over it.
filename = 'filename'
with open(filename) as f:
for line in f:
line # do something with the line
Alternatively, if you have multiple files, use fileinput.input, another lazy iterator. With just one file:
import fileinput
for line in fileinput.input(filename):
line # process the line
or for multiple files, pass it a list of filenames:
for line in fileinput.input([filename]*2):
line # process the line
Again, f and fileinput.input above both are/return lazy iterators.
You can only use an iterator one time, so to provide functional code while avoiding verbosity I'll use the slightly more terse fileinput.input(filename) where apropos from here.
In Python, how do I read a file line-by-line into a list?
Ah but you want it in a list for some reason? I'd avoid that if possible. But if you insist... just pass the result of fileinput.input(filename) to list:
list(fileinput.input(filename))
Another direct answer is to call f.readlines, which returns the contents of the file (up to an optional hint number of characters, so you could break this up into multiple lists that way).
You can get to this file object two ways. One way is to pass the filename to the open builtin:
filename = 'filename'
with open(filename) as f:
f.readlines()
or using the new Path object from the pathlib module (which I have become quite fond of, and will use from here on):
from pathlib import Path
path = Path(filename)
with path.open() as f:
f.readlines()
list will also consume the file iterator and return a list - a quite direct method as well:
with path.open() as f:
list(f)
If you don't mind reading the entire text into memory as a single string before splitting it, you can do this as a one-liner with the Path object and the splitlines() string method. By default, splitlines removes the newlines:
path.read_text().splitlines()
If you want to keep the newlines, pass keepends=True:
path.read_text().splitlines(keepends=True)
I want to read the file line by line and append each line to the end of the list.
Now this is a bit silly to ask for, given that we've demonstrated the end result easily with several methods. But you might need to filter or operate on the lines as you make your list, so let's humor this request.
Using list.append would allow you to filter or operate on each line before you append it:
line_list = []
for line in fileinput.input(filename):
line_list.append(line)
line_list
Using list.extend would be a bit more direct, and perhaps useful if you have a preexisting list:
line_list = []
line_list.extend(fileinput.input(filename))
line_list
Or more idiomatically, we could instead use a list comprehension, and map and filter inside it if desirable:
[line for line in fileinput.input(filename)]
Or even more directly, to close the circle, just pass it to list to create a new list directly without operating on the lines:
list(fileinput.input(filename))
Conclusion
You've seen many ways to get lines from a file into a list, but I'd recommend you avoid materializing large quantities of data into a list and instead use Python's lazy iteration to process the data if possible.
That is, prefer fileinput.input or with path.open() as f.
I would try one of the below mentioned methods. The example file that I use has the name dummy.txt. You can find the file here. I presume that the file is in the same directory as the code (you can change fpath to include the proper file name and folder path).
In both the below mentioned examples, the list that you want is given by lst.
1. First method
fpath = 'dummy.txt'
with open(fpath, "r") as f: lst = [line.rstrip('\n \t') for line in f]
print lst
>>>['THIS IS LINE1.', 'THIS IS LINE2.', 'THIS IS LINE3.', 'THIS IS LINE4.']
2. In the second method, one can use csv.reader module from Python Standard Library:
import csv
fpath = 'dummy.txt'
with open(fpath) as csv_file:
csv_reader = csv.reader(csv_file, delimiter=' ')
lst = [row[0] for row in csv_reader]
print lst
>>>['THIS IS LINE1.', 'THIS IS LINE2.', 'THIS IS LINE3.', 'THIS IS LINE4.']
You can use either of the two methods. The time taken for the creation of lst is almost equal for the two methods.
I like to use the following. Reading the lines immediately.
contents = []
for line in open(filepath, 'r').readlines():
contents.append(line.strip())
Or using list comprehension:
contents = [line.strip() for line in open(filepath, 'r').readlines()]
You could also use the loadtxt command in NumPy. This checks for fewer conditions than genfromtxt, so it may be faster.
import numpy
data = numpy.loadtxt(filename, delimiter="\n")
Here is a Python(3) helper library class that I use to simplify file I/O:
import os
# handle files using a callback method, prevents repetition
def _FileIO__file_handler(file_path, mode, callback = lambda f: None):
f = open(file_path, mode)
try:
return callback(f)
except Exception as e:
raise IOError("Failed to %s file" % ["write to", "read from"][mode.lower() in "r rb r+".split(" ")])
finally:
f.close()
class FileIO:
# return the contents of a file
def read(file_path, mode = "r"):
return __file_handler(file_path, mode, lambda rf: rf.read())
# get the lines of a file
def lines(file_path, mode = "r", filter_fn = lambda line: len(line) > 0):
return [line for line in FileIO.read(file_path, mode).strip().split("\n") if filter_fn(line)]
# create or update a file (NOTE: can also be used to replace a file's original content)
def write(file_path, new_content, mode = "w"):
return __file_handler(file_path, mode, lambda wf: wf.write(new_content))
# delete a file (if it exists)
def delete(file_path):
return os.remove() if os.path.isfile(file_path) else None
You would then use the FileIO.lines function, like this:
file_ext_lines = FileIO.lines("./path/to/file.ext"):
for i, line in enumerate(file_ext_lines):
print("Line {}: {}".format(i + 1, line))
Remember that the mode ("r" by default) and filter_fn (checks for empty lines by default) parameters are optional.
You could even remove the read, write and delete methods and just leave the FileIO.lines, or even turn it into a separate method called read_lines.
Command line version
#!/bin/python3
import os
import sys
abspath = os.path.abspath(__file__)
dname = os.path.dirname(abspath)
filename = dname + sys.argv[1]
arr = open(filename).read().split("\n")
print(arr)
Run with:
python3 somefile.py input_file_name.txt

Does it takes RAM to save a readlines array?

I am using the command lineslist = file.readlines() of a 2GB file.
So, I guess it will create a lineslist array of 2GB or more size. So, basically is it the same as readfile = file.read(), which also creates readfile (instance/variable?) of 2GB exactly?
Why should I prefer readlines in this case?
Adding to that I have one more question, it is also mentioned here https://docs.python.org/2/tutorial/inputoutput.html:
readline(): a newline character (\n) is left at the end of the string, and is only omitted on the last line of the file if the file doesn’t end in a newline. This makes the return value unambiguous;
I don't understand the last point. So, does readlines() also have unambiguous value in the last element of its array if there is no \n in the end of the file?
We are dealing with combining the files (which were split on the basis of blocksize) So, I am thinking of choosing readlines or read. As the individual files may not be end with a \n after splitting and if readlines returns unambiguous values, it would be a problem, I think.)
PS: I haven't learnt python. So, forgive me if there is no such thing as instances in python or if I am speaking rubbish. I am just assuming.
EDIT:
Ok, I just found. It's not returning any unambiguous output.
len(lineslist)
6923798
lineslist[6923797]
"\xf4\xe5\xcf1)\xff\x16\x93\xf2\xa3-\....\xab\xbb\xcd"
So, it doesn't end with '\n'. But it's not unambiguous output eiter.
Also, no unambiguous output with readline either for the lastline.
If I understood your issue correctly you just want to combine (ie concatenate) files.
If memory is an issue normally for line in f is the way to go.
I tried benchmarking using a 1.9GB csv file. One possible alternative is to read in large chunks of the data which fit in memory.
Codes:
#read in large chunks - fastest in my test
chunksize = 2**16
with open(fn,'r') as f:
chunk = f.read(chunksize)
while chunk:
chunk = f.read(chunksize)
#1 loop, best of 3: 4.48 s per loop
#read whole file in one go - slowest in my test
with open(fn,'r') as f:
chunk = f.read()
#1 loop, best of 3: 11.7 s per loop
#read file using iterator over each line - most practical for most cases
with open(fn,'r') as f:
for line in f:
s = line
#1 loop, best of 3: 6.74 s per loop
Knowing this you could write something like:
with open(outputfile,'w') as fo:
for inputfile in inputfiles: #assuming inputfiles is a list of filepaths
with open(inputfile,'r') as fi:
for chunk in iter(lambda: fi.read(chunksize), ''):
fo.write(fi.read(chunk))
fo.write('\n') #newline between each file(might not be necessary)
file.read() will read the entire stream of data as 1 long string, whereas file.readlines() will create a list of lines from the stream.
Generally performance will suffer, especially in the case of large files, if you read in the entire thing all at once. The general approach is to iterate over the file object line by line, which it supports.
for line in file_object:
# Process the line
As this way of processing will only consume memory for a line (loosely speaking) and not the entire contents of the file.
Yes, readlines() causes reading all file to variable.
Much better it would be to read file line by line:
f = open("file_path", "r")
for line in f:
print f
It will cause loading only one line to RAM, so you're saving about 1.99 GB of memory :)
As I understood You want to concatenate two files.
target = open("target_file", "w")
f1 = open("f1", "r")
f2 = open("f2", "r")
for line in f1:
print >> target, line
for line in f2:
print >> target, line
target.close()
Or consider using other technology like bash:
cat file1 > target
cat file2 >> target

Looping over lines with Python

So I have a file that contains this:
SequenceName 4.6e-38 810..924
SequenceName_FGS_810..924 VAWNCRQNVFWAPLFQGPYTPARYYYAPEEPKHYQEMKQCFSQTYHGMSFCDGCQIGMCH
SequenceName 1.6e-38 887..992
SequenceName_GYQ_887..992 PLFQGPYTPARYYYAPEEPKHYQEMKQCFSQTYHGMSFCDGCQIGMCH
I want my program to read only the lines that contain these protein sequences. Up until now I got this, which skips the first line and read the second one:
handle = open(filename, "r")
handle.readline()
linearr = handle.readline().split()
handle.close()
fnamealpha = fname + ".txt"
handle = open(fnamealpha, "w")
handle.write(">%s\n%s\n" % (linearr[0], linearr[1]))
handle.close()
But it only processes the first sequence and I need it to process every line that contains a sequence, so I need a loop, how can I do it?
The part that saves to a txt file is really important too so I need to find a way in which I can combine these two objectives.
My output with the above code is:
>SequenceName_810..924
VAWNCRQNVFWAPLFQGPYTPARYYYAPEEPKHYQEMKQCFSQTYHGMSFCDGCQIGMCH
Okay, I think I understand your question--you want to iterate over the lines in the file, right? But only the second line in the sequence--the one with the protein sequence--matters, correct? Here's my suggestion:
# context manager `with` takes care of file closing, error handling
with open(filename, 'r') as handle:
for line in handle:
if line.startswith('SequenceName_'):
print line.split()
# Write to file, etc.
My reasoning being that you're only interested in lines that start with SequenceName_###.
Use readlines and throw it all into a for loop.
with open(filename, 'r') as fh:
for line in fh.readlines:
# do processing here
In the #do processing here section, you can just prepare another list of lines to write to the other file. (Using with handles all the proper closure and sure.)

More pythonic way of skipping header lines

Is there a shorter (perhaps more pythonic) way of opening a text file and reading past the lines that start with a comment character?
In other words, a neater way of doing this
fin = open("data.txt")
line = fin.readline()
while line.startswith("#"):
line = fin.readline()
At this stage in my arc of learning Python, I find this most Pythonic:
def iscomment(s):
return s.startswith('#')
from itertools import dropwhile
with open(filename, 'r') as f:
for line in dropwhile(iscomment, f):
# do something with line
to skip all of the lines at the top of the file starting with #. To skip all lines starting with #:
from itertools import ifilterfalse
with open(filename, 'r') as f:
for line in ifilterfalse(iscomment, f):
# do something with line
That's almost all about readability for me; functionally there's almost no difference between:
for line in ifilterfalse(iscomment, f))
and
for line in (x for x in f if not x.startswith('#'))
Breaking out the test into its own function makes the intent of the code a little clearer; it also means that if your definition of a comment changes you have one place to change it.
for line in open('data.txt'):
if line.startswith('#'):
continue
# work with line
of course, if your commented lines are only at the beginning of the file, you might use some optimisations.
from itertools import dropwhile
for line in dropwhile(lambda line: line.startswith('#'), file('data.txt')):
pass
If you want to filter out all comment lines (not just those at the start of the file):
for line in file("data.txt"):
if not line.startswith("#"):
# process line
If you only want to skip those at the start then see ephemient's answer using itertools.dropwhile
You could use a generator function
def readlines(filename):
fin = open(filename)
for line in fin:
if not line.startswith("#"):
yield line
and use it like
for line in readlines("data.txt"):
# do things
pass
Depending on exactly where the files come from, you may also want to strip() the lines before the startswith() check. I once had to debug a script like that months after it was written because someone put in a couple of space characters before the '#'
As a practical matter if I knew I was dealing with reasonable sized text files (anything which will comfortably fit in memory) then I'd problem go with something like:
f = open("data.txt")
lines = [ x for x in f.readlines() if x[0] != "#" ]
... to snarf in the whole file and filter out all lines that begin with the octothorpe.
As others have pointed out one might want ignore leading whitespace occurring before the octothorpe like so:
lines = [ x for x in f.readlines() if not x.lstrip().startswith("#") ]
I like this for its brevity.
This assumes that we want to strip out all of the comment lines.
We can also "chop" the last characters (almost always newlines) off the end of each using:
lines = [ x[:-1] for x in ... ]
... assuming that we're not worried about the infamously obscure issue of a missing final newline on the last line of the file. (The only time a line from the .readlines() or related file-like object methods might NOT end in a newline is at EOF).
In reasonably recent versions of Python one can "chomp" (only newlines) off the ends of the lines using a conditional expression like so:
lines = [ x[:-1] if x[-1]=='\n' else x for x in ... ]
... which is about as complicated as I'll go with a list comprehension for legibility's sake.
If we were worried about the possibility of an overly large file (or low memory constraints) impacting our performance or stability, and we're using a version of Python that's recent enough to support generator expressions (which are more recent additions to the language than the list comprehensions I've been using here), then we could use:
for line in (x[:-1] if x[-1]=='\n' else x for x in
f.readlines() if x.lstrip().startswith('#')):
# do stuff with each line
... is at the limits of what I'd expect anyone else to parse in one line a year after the code's been checked in.
If the intent is only to skip "header" lines then I think the best approach would be:
f = open('data.txt')
for line in f:
if line.lstrip().startswith('#'):
continue
... and be done with it.
You could make a generator that loops over the file that skips those lines:
fin = open("data.txt")
fileiter = (l for l in fin if not l.startswith('#'))
for line in fileiter:
...
You could do something like
def drop(n, seq):
for i, x in enumerate(seq):
if i >= n:
yield x
And then say
for line in drop(1, file(filename)):
# whatever
I like #iWerner's generator function idea. One small change to his code and it does what the question asked for.
def readlines(filename):
f = open(filename)
# discard first lines that start with '#'
for line in f:
if not line.lstrip().startswith("#"):
break
yield line
for line in f:
yield line
and use it like
for line in readlines("data.txt"):
# do things
pass
But here is a different approach. This is almost very simple. The idea is that we open the file, and get a file object, which we can use as an iterator. Then we pull the lines we don't want out of the iterator, and just return the iterator. This would be ideal if we always knew how many lines to skip. The problem here is we don't know how many lines we need to skip; we just need to pull lines and look at them. And there is no way to put a line back into the iterator, once we have pulled it.
So: open the iterator, pull lines and count how many have the leading '#' character; then use the .seek() method to rewind the file, pull the correct number again, and return the iterator.
One thing I like about this: you get the actual file object back, with all its methods; you can just use this instead of open() and it will work in all cases. I renamed the function to open_my_text() to reflect this.
def open_my_text(filename):
f = open(filename, "rt")
# count number of lines that start with '#'
count = 0
for line in f:
if not line.lstrip().startswith("#"):
break
count += 1
# rewind file, and discard lines counted above
f.seek(0)
for _ in range(count):
f.readline()
# return file object with comment lines pre-skipped
return f
Instead of f.readline() I could have used f.next() (for Python 2.x) or next(f) (for Python 3.x) but I wanted to write it so it was portable to any Python.
EDIT: Okay, I know nobody cares and I"m not getting any upvotes for this, but I have re-written my answer one last time to make it more elegant.
You can't put a line back into an iterator. But, you can open a file twice, and get two iterators; given the way file caching works, the second iterator is almost free. If we imagine a file with a megabyte of '#' lines at the top, this version would greatly outperform the previous version that calls f.seek(0).
def open_my_text(filename):
# open the same file twice to get two file objects
# (We are opening the file read-only so this is safe.)
ftemp = open(filename, "rt")
f = open(filename, "rt")
# use ftemp to look at lines, then discard from f
for line in ftemp:
if not line.lstrip().startswith("#"):
break
f.readline()
# return file object with comment lines pre-skipped
return f
This version is much better than the previous version, and it still returns a full file object with all its methods.

Categories

Resources