How to read lots of line from file at once - python

I want to generate a bunch of files based on a template. The template has thousands of lines. For each of the new files, only top 5 lines are different. What is the best way of reading all the lines but first 5 at once instead of read the whole file in line by line?

One approach would be to create a list of the 5 first lines, and read the rest in a big buffer:
with open("input.txt") as f:
first_lines = [f.readline() for _ in range(5)]
rest_of_lines = f.read()
or more symmetrical for the first part: create 1 small buffer with the 5 lines:
first_lines = "".join([f.readline() for _ in range(5)])
As an alternative, from a purely I/O point of view, the quickest would be
with open("input.txt") as f:
lines = f.read()
and use a line split generator to read the 5 first lines (splitlines() would be disastrous in terms of memory copy, find an implementation here)

File objects in python are quite conveniently their own iterator objects so that when you call for line in f: ... you get the file line by line. The file object has what's generally referred to as a cursor that keeps track of where you're reading from. when you use the generic for loop, this cursor advances to the next newline each time and returns what it has read. If you interrupt this loop before the end of the file, you can pick back up where you left off with another loop or just a call to f.read() to read the rest of the file
with open(inputfile, 'r') as f:
lineN = 0
header = ""
for line in f:
header = header + line
lineN += 1
if lineN >= 4: #read first 5 lines (0 indexed)
break
body = f.read() #read the rest of the file

Related

How can I handle multiple lines at once while reading from a file?

The standard Python approach to working with files using the open() function to create a 'file object' f allows you to either load the entire file into memory at once using f.read() or to read lines one-by-one using a for loop:
with open('filename') as f:
# 1) Read all lines at once into memory:
all_data = f.read()
# 2) Read lines one-by-one:
for line in f:
# Work with each line
I'm searching through several large files looking for a pattern that might span multiple lines. The most intuitive way to do this is to read line-by-line looking for the beginning of the pattern, and then to load in the next few lines to see where it ends:
with open('large_file') as f:
# Read lines one-by-one:
for line in f:
if line.startswith("beginning"):
# Load in the next line, i.e.
nextline = f.getline(line+1) # ??? #
# or something
The line I've marked with # ??? # is my own pseudocode for what I imagine this should look like.
My question is, does this exist in Python? Is there any method for me to access other lines as needed while keeping the cursor at line and without loading the entire file into memory?
Edit Inferring from the responses here and other reading, the answer is "No."
Like this:
gather = []
for line in f:
if gather:
gather.append(line)
if "ending" in line:
process( ''.join(gather) )
gather = []
elif line.startswith("beginning"):
gather = [line]
Although in many cases it's easier just to load the whole file into a string and search it.
You may want to rstrip the newline before appending the line.
Just store the interesting lines into a list while going line-wise through the file:
with open("file.txt","w") as f:
f.write("""
a
b
------
c
d
e
####
g
f""")
interesting_data = []
inside = False
with open ("file.txt") as f:
for line in f:
line = line.strip()
# start of interesting stuff
if line.startswith("---"):
inside = True
# end of interesting stuff
elif line.startswith("###"):
inside = False
# adding interesting bits
elif inside:
interesting_data.append(line)
print(interesting_data)
to get
['c', 'd', 'e']
I think you're looking for .readline(), which does exactly that. Here is a sketch to proceed to the line where a pattern starts.
with open('large_file') as f:
line = f.readline()
while not line.startswith("beginning"):
line = f.readline()
# end of file
if not line:
print("EOF")
break
# do_something with line, get additional lines by
# calling .readline() again, etc.

Saving all lines in a file after a certain string to a separate file

I have a file containing a block of introductory text for an unknown number of lines, then the rest of the file contains data. Before the data block begins, there are column titles and I want to skip those also. So the file looks something like this:
this is an introduction..
blah blah blah...
...
UniqueString
Time Position Count
0 35 12
1 48 6
2 96 8
...
1000 82 37
I want to record the Time Position and Count data to a separate file. Time Position and Count Data appears only after UniqueString.
Is it what you're looking for?
reduce(lambda x, line: (x and (outfile.write(line) or x)) or line=='UniqueString\n', infile)
How it works:
files can be iterated, so we can read infile line by line by simply doing [... for line in infile]
in the operation part, we use the fact that writeline() will not be triggered if the first operand for and is False.
in the or part, we set up the trigger if the desired line is found, so writeline will be fired for the next and consequent lines
default initial value for reduce is None, which evaluates to False
You could extract and write the data to another file like this:
with open("data.txt", "r") as infile:
x = infile.readlines()
x = [i.strip() for i in x[x.index('UniqueString\n') + 1:] if i != '\n' ]
with open("output.txt", "w") as outfile:
for i in x[1:]:
outfile.write(i+"\n")
It is pretty straight forward I think: The file is opened and all lines are read, a list comprehension slices the list beginning with the header string and the desired remaining lines are wrote to file again.
You could create a generator function (and more info here) that filtered the file for you.
It operates incrementally so doesn't require reading the entire file into memory at one time.
def extract_lines_following(file, marker=None):
"""Generator yielding all lines in file following the line following the marker.
"""
marker_seen = False
while True:
line = file.next()
if marker_seen:
yield line
elif line.strip() == marker:
marker_seen = True
file.next() # skip following line, too
# sample usage
with open('test_data.txt', 'r') as infile, open('cleaned_data.txt', 'w') as outfile:
outfile.writelines(extract_lines_following(infile, 'UniqueString'))
This could be optimized a little if you're using Python 3, but the basic idea would be the same.

Reading CSV file with python

filename = 'NTS.csv'
mycsv = open(filename, 'r')
mycsv.seek(0, os.SEEK_END)
while 1:
time.sleep(1)
where = mycsv.tell()
line = mycsv.readline()
if not line:
mycsv.seek(where)
else:
arr_line = line.split(',')
var3 = arr_line[3]
print (var3)
I have this Paython code which is reading the values from a csv file every time there is a new line printed in the csv from external program. My problem is that the csv file is periodically completely rewriten and then python stops reading the new lines. My guess is that python is stuck on some line number and the new update can put maybe 50 more or less lines. So for example python is now waiting a new line at line 70 and the new line has come at line 95. I think the solution is to let mycsv.seek(0, os.SEEK_END) been updated but not sure how to do that.
What you want to do is difficult to accomplish without rewinding the file every time to make sure that you are truly on the last line. If you know approximately how many characters there are on each line, then there is a shortcut you could take using mycsv.seek(-end_buf, os.SEEK_END), as outlined in this answer. So your code could work somehow like this:
avg_len = 50 # use an appropriate number here
end_buf = 3 * avg_len / 2
filename = 'NTS.csv'
mycsv = open(filename, 'r')
mycsv.seek(-end_buf, os.SEEK_END)
last = mycsv.readlines()[-1]
while 1:
time.sleep(1)
mycsv.seek(-end_buf, os.SEEK_END)
line = mycsv.readlines()[-1]
if not line == last:
arr_line = line.split(',')
var3 = arr_line[3]
print (var3)
Here, in each iteration of the while loop, you seek to a position close to the end of the file, just far back enough that you know for sure the last line will be contained in what remains. Then you read in all the remaining lines (this will probably include a partial amount of the second or third to last lines) and check if the last line of these is different to what you had before.
You can do a simpler way of reading lines in your program. Instead of trying to use seek in order to get what you need, try using readlines on the file object mycsv.
You can do the following:
mycsv = open('NTS.csv', 'r')
csv_lines = mycsv.readlines()
for line in csv_lines:
arr_line = line.split(',')
var3 = arr_line[3]
print(var3)

"Move" some parts of the file to another file

Let say I have a file with 48,222 lines. I then give an index value, let say, 21,000.
Is there any way in Python to "move" the contents of the file starting from index 21,000 such that now I have two files: the original one and the new one. But the original one now is having 21,000 lines and the new one 27,222 lines.
I read this post which uses partition and is quite describing what I want:
with open("inputfile") as f:
contents1, sentinel, contents2 = f.read().partition("Sentinel text\n")
with open("outputfile1", "w") as f:
f.write(contents1)
with open("outputfile2", "w") as f:
f.write(contents2)
Except that (1) it uses "Sentinel Text" as separator, (2) it creates two new files and require me to delete the old file. As of now, the way I do it is like this:
for r in result.keys(): #the filenames are in my dictionary, don't bother that
f = open(r)
lines = f.readlines()
f.close()
with open("outputfile1.txt", "w") as fn:
for line in lines[0:21000]:
#write each line
with open("outputfile2.txt", "w") as fn:
for line in lines[21000:]:
#write each line
Which is quite a manual work. Is there a built-in or more efficient way?
You can also use writelines() and dump the sliced list of lines from 0 to 20999 into one file and another sliced list from 21000 to the end into another file.
with open("inputfile") as f:
content = f.readlines()
content1 = content[:21000]
content2 = content[21000:]
with open("outputfile1.txt", "w") as fn1:
fn1.writelines(content1)
with open('outputfile2.txt','w') as fn2:
fn2.writelines(content2)

parse blocks of text from text file using Python

I am trying to parse some text files and need to extract blocks of text. Specifically, the lines that start with "1:" and 19 lines after the text. The "1:" does not start on the same row in each file and there is only one instance of "1:". I would prefer to save the block of text and export it to a separate file. In addition, I need to preserve the formatting of the text in the original file.
Needless to say I am new to Python. I generally work with R but these files are not really compatible with R and I have about 100 to process. Any information would be appreciated.
The code that I have so far is:
tmp = open(files[0],"r")
lines = tmp.readlines()
tmp.close()
num = 0
a=0
for line in lines:
num += 1
if "1:" in line:
a = num
break
a = num is the line number for the block of text I want. I then want to save to another file the next 19 lines of code, but can't figure how how to do this. Any help would be appreciated.
Here is one option. Read all lines from your file. Iterate till you find your line and return next 19 lines. You would need to handle situations where your file doesn't contain additional 19 lines.
fh = open('yourfile.txt', 'r')
all_lines = fh.readlines()
fh.close()
for count, line in enumerate(all_lines):
if "1:" in line:
return all_lines[count+1:count+20]
Could be done in a one-liner...
open(files[0]).read().split('1:', 1)[1].split('\n')[:19]
or more readable
txt = open(files[0]).read() # read the file into a big string
before, after = txt.split('1:', 1) # split the file on the first "1:"
after_lines = after.split('\n') # create lines from the after text
lines_to_save = after_lines[:19] # grab the first 19 lines after "1:"
then join the lines with a newline (and add a newline to the end) before writing it to a new file:
out_text = "1:" # add back "1:"
out_text += "\n".join(lines_to_save) # add all 19 lines with newlines between them
out_text += "\n" # add a newline at the end
open("outputfile.txt", "w").write(out_text)
to comply with best practice for reading and writing files you should also be using the with statement to ensure that the file handles are closed as soon as possible. You can create convenience functions for it:
def read_file(fname):
"Returns contents of file with name `fname`."
with open(fname) as fp:
return fp.read()
def write_file(fname, txt):
"Writes `txt` to a file named `fname`."
with open(fname, 'w') as fp:
fp.write(txt)
then you can replace the first line above with:
txt = read_file(files[0])
and the last line with:
write_file("outputfile.txt", out_text)
I always prefer to read the file into memory first, but sometimes that's not possible. If you want to use iteration then this will work:
def process_file(fname):
with open(fname) as fp:
for line in fp:
if line.startswith('1:'):
break
else:
return # no '1:' in file
yield line # yield line containing '1:'
for i, line in enumerate(fp):
if i >= 19:
break
yield line
if __name__ == "__main__":
with open('ouput.txt', 'w') as fp:
for line in process_file('intxt.txt'):
fp.write(line)
It's using the else: clause on a for-loop which you don't see very often anymore, but was created for just this purpose (the else clause if executed if the for-loop doesn't break).

Categories

Resources