print a previous line based on a condition in python - python

I am trying to print -4th line based on a condition. I have a text file SFU.txt with some content. My objective is: if there is a word configuration in a line, I want to print -4th line. For example if the content of my file is like below:
This is a random text document
We are talking about planets here
This is planet Mars
in solarsystem
sun is the star
this is 4th planet
configuration lifeform exists
bla bla bla
bla bla bla
So, once the compiler hits the line configuration lifeform exists and it sees configuration, I want to print the line This is planet earth
My code below:
file = open("SFU.txt","r")
for line in file:
if "configuration" in line:
#want to print the -4th line-HOW?

Use tee to run a pair of iterators across inf. This only stores five lines in memory at any given time:
from itertools import tee
with open("SFU.txt") as inf:
# set up iterators
cfg,res = tee(inf)
# advance cfg by four lines
for i in range(4):
next(cfg)
for c,r in zip(cfg, res):
if "configuration" in c:
print(r)
and, as expected, results in
This is planet Mars
Edit: if you want to edit the -4th line, I suggest
def edited(r):
# make your changes to r
return new_r
with open("SFU.txt") as inf, open("edited.txt", "w") as outf:
# set up iterators
cfg, res = tee(inf)
for i in range(4):
next(cfg)
# iterate through in tandem
for c, r in zip(cfg, res):
if "configuration" in c:
r = edited(r)
outf.write(r)
# reached end - write out remaining queued values
for r in res:
outf.write(r)

A limited-size deque is a good way to keep a "ring buffer" of the last few lines:
import collections
lastfewlines = collections.deque((), 4)
with open('SFU.txt') as f:
for line in f:
if 'configuration' in line and len(lastfewlines) == 4:
print(lastfewlines[0])
lastfewlines.append(line.rstrip())
However, while this solves the problem posed in the question, it doesn't work for the "real problem" the OP mentioned only in a comment -- "editing" that line, meaning, presumably, alter the input file "in place".
Alas, modern file-systems do not allow "in-place editing" of files except byte-for-byte overwriting -- unless the "edited" line is exactly the same number of bytes as the original one, you can't just overwrite said original line and imagine that all the following lines in the file will shift back or forth as desired!-)
Rather, one has to read the file, alter it, and rewrite it (the soundest approach is usually to write a new file then rename it to the old one's name "as atomically as your operating system and file-system will let you", to avoid losing data should there be a crash).
The deque approach can be adapted to this -- instead of just conditionally printing lastfewlines[0], write to the output file either the original or modified version of it (and at the end write what's left in the deque to the output file). Then, at least on Unix systems and local file-systems, a simple os.rename will do the atomic trick (as long as the output file is on the same mounted disk as the input one).
For all but really huge files, however, reading all lines in memory (with f.readlines()), performing alterations if any on the list of lines, then writing the lot out again, is much simpler. And since the user mentions 16,000 lines (length not specified but let's assume less than 100 bytes per average line), this tiny file of less than 2 megabytes should be dealt with in the most simple way -- it's orders of magnitude smaller than any file that would cause any "too big to fit in memory" worries!-)

If you have a few lines you can use readlines() to save your lines as a list then just use indexing :
my_file = open("SFU.txt","r").readlines()
for i,line in enumerate(my_file):
if "configuration" in line:
print file[i-4]
But note that if i<4 it chose your line from end !

If you have a longer file and don't want to read the whole thing into memory, you can use an efficient queue implementation such as collections.deque like:
import collections
myfile = open("SFU.txt","r")
# This is a fixed length queue, and will hold 4 items at most
lines = collections.deque(['']*4,4)
for i, line in enumerate(myfile):
if 'configuration' in line:
print lines[0]
else:
# push the new line clearing the 4th previous
lines.append(line)

Maybe try something like this.
As the whole thing is copied to a list, all the text is editable. You can write it back to a file when you are done.
f = open("SFU.txt","r")
lines = [line.strip() for line in f]
for i, line in enumerate(lines):
if "configuration" in line:
if i > 4:
print lines[i - 4]
# edit here
else:
print 'There is no -4th line'
f.close()

Alternatively, you may open the file twice and yield one file to read from 4th line, then compare the next line first, and print the current line, something like this:
with open('SFU.txt', 'r') as f:
with open('SFU.txt', 'r') as next_f:
[next(next_f) for _ in range(4)] # yield to 4th line first
for line in next_f:
if 'configuration' in line: # if keyword in next line
print next(f) # this is current line from f
break
next(f) # if not found, yield f to next line
Yield result:
This is planet Mars
As a side note: please try not to use file as the namespace as it's a shadow name of Python builtin.

Related

In Python (SageMath 9.0) - text file on 1B lines - optimal way to read from a specific line

I'm running SageMath 9.0, on Windows 10 OS
I've read several similar questions (and answers) in this site. Mainly this one one reading from the 7th line, and this one on optimizing. But I have some specific issues: I need to understand how to optimally read from a specific (possibly very far away) line, and if I should read line by line, or if reading by block could be "more optimal" in my case.
I have a 12Go text file, made of around 1 billion small lines, all made of ASCII printable characters. Each line has a constant number of characters. Here are the actual first 5 lines:
J??????????
J???????C??
J???????E??
J??????_A??
J???????F??
...
For context, this file is a list of all non-isomorphic graphs on 11-vertices, encoded using graph6 format. The file has been computed and made available by Brendan McKay on its webpage here.
I need to check every graph for some properties. I could use the generator for G in graphs(11) but this can be very long (few days at least on my laptop). I want to use the complete database in the file, so that I'm able to stop and start again from a certain point.
My current code reads the file line by line from start, and do some computation after reading each line :
with open(filename,'r') as file:
while True:
# Get next line from file
line = file.readline()
# if line is empty, end of file is reached
if not line:
print("End of Database Reached")
break
G = Graph()
from_graph6(G,line.strip())
run_some_code(G)
In order to be able to stop the code, or save the progress in case of crash, I was thinking of :
Every million line read (or so), save the progress in a specific file
When restarting the code, read the last saved value, and instead of using line = file.readline(), I would use itertool option, for line in islice(file, start_line, None).
so that my new code is
from itertools import islice
start_line = load('foo')
count = start_line
save_every_n_lines = 1000000
with open(filename,'r') as file:
for line in islice(file, start_line, None):
G = Graph()
from_graph6(G,line.strip())
run_some_code(G)
count +=1
if (count % save_every_n_lines )==0:
save(count,'foo')
The code does work, but I would like to understand if I can optimise it. I'm not a big fan of my if statement in my for loop.
Is the itertools.islice() the good option here ? the document states "If start is non-zero, then elements from the iterable are skipped until start is reached". As "start" could be quite large, ad given that I'm working on simple text files, could there be a faster option, in order to "jump" directly to the start line?
Knowing that the text file is fixed, could it be more optimal to split the actual file into 100 or 1000 smaller files and reading them one by one ? this would get read of the if statement in my for loop.
I also have the option to read blocks of line in one go instead of line by line, and then work on a list of graphs. Could that be a good option ?
Each line has a constant number of characters. So "jumping" might be feasible.
Assuming each line is the same size, you can use a memory mapped file read it by index without mucking about with seek and tell. The memory mapped file emulates a bytearray and you can take record-sized slices from the array for the data you want. If you want to pause processing, you only have to save the current record index in the array and you can startup again with that index later.
This example is on linux - mmap open on windows is a bit different - but after its setup, access should be the same.
import os
import mmap
# I think this is the record plus newline
LINE_SZ = 12
RECORD_SZ = LINE_SZ - 1
# generate test file
testdata = "testdata.txt"
with open(testdata, 'wb') as f:
for i in range(100):
f.write("R{: 10}\n".format(i).encode('ascii'))
f = open(testdata, 'rb')
data = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
# the i-th record is
i = 20
record = data[i*LINE_SZ:i*LINE_SZ+RECORD_SZ]
print("record 20", record)
# you can stick it in a function. this is a bit slower, but encapsulated
def get_record(mmapped_file, index):
return mmapped_file[i*LINE_SZ:i*LINE_SZ+RECORD_SZ]
print("get record 20", get_record(data, 11))
# to enumerate
def enum_records(mmapped_file, start, stop=None, step=1):
if stop is None:
stop = mmapped_file.size()/LINE_SZ
for pos in range(start*LINE_SZ, stop*LINE_SZ, step*LINE_SZ):
yield mmapped_file[pos:pos+RECORD_SZ]
print("enum 6 to 8", [record for record in enum_records(data,6,9)])
del data
f.close()
If the length of the line is constant (in this case it's 12 (11 and endline character)), you might do
def get_line(k, line_len):
with open('file') as f:
f.seek(k*line_len)
return next(f)

how to show the number of rows in a python code for opening a data file?

I have this code for opening a big file:
fr = open('X1','r')
text = fr.read()
print(text)
fr.close()
When I open it with gedit, the file is something like this with the number of each row:
but in terminal it is shown without any row number:
So, it is difficult to distinguish among different rows.
How can I add the number of rows in my python script?
If you just want to show the number of the line, wrap the line iterator in enumerate, it will return a iterator of tuples with the index (zero based) and the line.
Like this:
with open('X1', 'r') as fr:
for index, line in enumerate(fr):
print(f'{index}: {line}')
Also, when working with files, using the context manager with with is better. It ensures proper closing and flushing the data buffers even if an exception is raised
EDIT:
As a bonus, the example I gave uses only iterators, the file object is a iterator of lines and enumerate also returns a iterator that builds the tuple.
This means that this script only holds only line at a time in memory (and the buffers defined by your platform), not the whole file.
fr = open('X1','r')
rowcount=0
for i in fr:
print(rowcount,end="")
print(":",end="")
print(i)
rowcount+=1
print(rowcount)
fr.close()
We have just read file line by line and printing counter each time.

how to save changes after modifying content in file using Python

I want to insert a line into file "original.txt" (the file contains about 200 lines). the line neds to be inserted two lines after a string is found in one of the existing lines. This is my code, I am using a couple of print options that show me that the line is being added to the list, in the spot I need, but the file "original.txt" is not being edited
with open("original.txt", "r+") as file:
lines = file.readlines() # makes file into a list of lines
print(lines) #test
for number, item in enumerate(lines):
if testStr in item:
i = number +2
print(i) #test
lines.insert(i, newLine)
print(lines) #test
break
file.close()
I am turning the lines in the text into a list, then I enumerate the lines as I look for the string, assigning the value of the line to i and adding 2 so that the new line is inserted two lines after, the print() fiction shows the line was added in the correct spot, but the text "original.txt" is not modified
You seem to misunderstand what your code is doing. Lets go line by line
with open("original.txt", "r+") as file: # open a file for reading
lines = file.readlines() # read the contents into a list of lines
print(lines) # print the whole file
for number, item in enumerate(lines): # iterate over lines
if testStr in item:
i = number +2
print(i) #test
lines.insert(i, newLine) # insert lines into the list
print(lines) #test
break # get out of the look
file.close() # not needed, with statement takes care of closing
You are not modifying the file. You read the file into a list of strings and modify the list. To modify the actual file you need to open it for writing and write the list back into it. Something like this at the end of the code might work
with open("modified.txt", "w") as f:
for line in lines: f.write(line)
You never modified the original text. Your codes reads the lines into local memory, one at a time. When you identify your trigger, you count two lines, and then insert the undefined value newLine into your local copy. At no point in your code did you modify the original file.
One way is to close the file and then rewrite it from your final value of lines. Do not modify the file while you're reading it -- it's good that you read it all in and then start processing.
Another way is to write to a new file as you go, then use a system command to replace the original file with your new version.

editing a single .txt line in python 3.1

i have some data stored in a .txt file in this format:
----------|||||||||||||||||||||||||-----------|||||||||||
1029450386abcdefghijklmnopqrstuvwxy0293847719184756301943
1020414646canBeFollowedBySpaces 3292532113435532419963
don't ask...
i have many lines of this, and i need a way to add more digits to the end of a particular line.
i've written code to find the line i want, but im stumped as to how to add 11 characters to the end of it. i've looked around, this site has been helpful with some other issues i've run into, but i can't seem to find what i need for this.
it is important that the line retain its position in the file, and its contents in their current order.
using python3.1, how would you turn this:
1020414646canBeFollowedBySpaces 3292532113435532419963
into
1020414646canBeFollowedBySpaces 329253211343553241996301846372998
As a general principle, there's no shortcut to "inserting" new data in the middle of a text file. You will need to make a copy of the entire original file in a new file, modifying your desired line(s) of text on the way.
For example:
with open("input.txt") as infile:
with open("output.txt", "w") as outfile:
for s in infile:
s = s.rstrip() # remove trailing newline
if "target" in s:
s += "0123456789"
print(s, file=outfile)
os.rename("input.txt", "input.txt.original")
os.rename("output.txt", "input.txt")
Check out the fileinput module, it can do sort of "inplace" edits with files. though I believe temporary files are still involved in the internal process.
import fileinput
for line in fileinput.input('input.txt', inplace=1, backup='.orig'):
if line.startswith('1020414646canBeFollowedBySpaces'):
line = line.rstrip() + '01846372998' '\n'
print(line, end='')
The print now prints to the file instead of the console.
You might want to back up your original file before editing.
target_chain = '1020414646canBeFollowedBySpaces 3292532113435532419963'
to_add = '01846372998'
with open('zaza.txt','rb+') as f:
ch = f.read()
x = ch.find(target_chain)
f.seek(x + len(target_chain),0)
f.write(to_add)
f.write(ch[x + len(target_chain):])
In this method it's absolutely obligatory to open the file in binary mode 'b' for some reason linked to the treatment of the end of lines by Python (see Universal Newline, enabled by default)
The mode 'r+' is to allow the writing as well as the reading
In this method, what is before the target_chain in the file remains untouched. And what is after the target_chain is shifted ahead. As said by Greg Hewgill, there is no possibility to move apart bits on a hard drisk to insert new bits in the middle.
Evidently, if the file is very big, reading all of its content in ch could be too much memory consuming and the algorithm should then be changed: reading line after line until the line containing the target_chain, and then reading the next line before inserting, and then continuing to do "reading the next line - re-writing on the current line" until the end of the file in order to shift progressively the content from the line concerned with addition.
You see what I mean...
Copy the file, line by line, to another file. When you get to the line that needs extra chars then add them before writing.

More pythonic way of skipping header lines

Is there a shorter (perhaps more pythonic) way of opening a text file and reading past the lines that start with a comment character?
In other words, a neater way of doing this
fin = open("data.txt")
line = fin.readline()
while line.startswith("#"):
line = fin.readline()
At this stage in my arc of learning Python, I find this most Pythonic:
def iscomment(s):
return s.startswith('#')
from itertools import dropwhile
with open(filename, 'r') as f:
for line in dropwhile(iscomment, f):
# do something with line
to skip all of the lines at the top of the file starting with #. To skip all lines starting with #:
from itertools import ifilterfalse
with open(filename, 'r') as f:
for line in ifilterfalse(iscomment, f):
# do something with line
That's almost all about readability for me; functionally there's almost no difference between:
for line in ifilterfalse(iscomment, f))
and
for line in (x for x in f if not x.startswith('#'))
Breaking out the test into its own function makes the intent of the code a little clearer; it also means that if your definition of a comment changes you have one place to change it.
for line in open('data.txt'):
if line.startswith('#'):
continue
# work with line
of course, if your commented lines are only at the beginning of the file, you might use some optimisations.
from itertools import dropwhile
for line in dropwhile(lambda line: line.startswith('#'), file('data.txt')):
pass
If you want to filter out all comment lines (not just those at the start of the file):
for line in file("data.txt"):
if not line.startswith("#"):
# process line
If you only want to skip those at the start then see ephemient's answer using itertools.dropwhile
You could use a generator function
def readlines(filename):
fin = open(filename)
for line in fin:
if not line.startswith("#"):
yield line
and use it like
for line in readlines("data.txt"):
# do things
pass
Depending on exactly where the files come from, you may also want to strip() the lines before the startswith() check. I once had to debug a script like that months after it was written because someone put in a couple of space characters before the '#'
As a practical matter if I knew I was dealing with reasonable sized text files (anything which will comfortably fit in memory) then I'd problem go with something like:
f = open("data.txt")
lines = [ x for x in f.readlines() if x[0] != "#" ]
... to snarf in the whole file and filter out all lines that begin with the octothorpe.
As others have pointed out one might want ignore leading whitespace occurring before the octothorpe like so:
lines = [ x for x in f.readlines() if not x.lstrip().startswith("#") ]
I like this for its brevity.
This assumes that we want to strip out all of the comment lines.
We can also "chop" the last characters (almost always newlines) off the end of each using:
lines = [ x[:-1] for x in ... ]
... assuming that we're not worried about the infamously obscure issue of a missing final newline on the last line of the file. (The only time a line from the .readlines() or related file-like object methods might NOT end in a newline is at EOF).
In reasonably recent versions of Python one can "chomp" (only newlines) off the ends of the lines using a conditional expression like so:
lines = [ x[:-1] if x[-1]=='\n' else x for x in ... ]
... which is about as complicated as I'll go with a list comprehension for legibility's sake.
If we were worried about the possibility of an overly large file (or low memory constraints) impacting our performance or stability, and we're using a version of Python that's recent enough to support generator expressions (which are more recent additions to the language than the list comprehensions I've been using here), then we could use:
for line in (x[:-1] if x[-1]=='\n' else x for x in
f.readlines() if x.lstrip().startswith('#')):
# do stuff with each line
... is at the limits of what I'd expect anyone else to parse in one line a year after the code's been checked in.
If the intent is only to skip "header" lines then I think the best approach would be:
f = open('data.txt')
for line in f:
if line.lstrip().startswith('#'):
continue
... and be done with it.
You could make a generator that loops over the file that skips those lines:
fin = open("data.txt")
fileiter = (l for l in fin if not l.startswith('#'))
for line in fileiter:
...
You could do something like
def drop(n, seq):
for i, x in enumerate(seq):
if i >= n:
yield x
And then say
for line in drop(1, file(filename)):
# whatever
I like #iWerner's generator function idea. One small change to his code and it does what the question asked for.
def readlines(filename):
f = open(filename)
# discard first lines that start with '#'
for line in f:
if not line.lstrip().startswith("#"):
break
yield line
for line in f:
yield line
and use it like
for line in readlines("data.txt"):
# do things
pass
But here is a different approach. This is almost very simple. The idea is that we open the file, and get a file object, which we can use as an iterator. Then we pull the lines we don't want out of the iterator, and just return the iterator. This would be ideal if we always knew how many lines to skip. The problem here is we don't know how many lines we need to skip; we just need to pull lines and look at them. And there is no way to put a line back into the iterator, once we have pulled it.
So: open the iterator, pull lines and count how many have the leading '#' character; then use the .seek() method to rewind the file, pull the correct number again, and return the iterator.
One thing I like about this: you get the actual file object back, with all its methods; you can just use this instead of open() and it will work in all cases. I renamed the function to open_my_text() to reflect this.
def open_my_text(filename):
f = open(filename, "rt")
# count number of lines that start with '#'
count = 0
for line in f:
if not line.lstrip().startswith("#"):
break
count += 1
# rewind file, and discard lines counted above
f.seek(0)
for _ in range(count):
f.readline()
# return file object with comment lines pre-skipped
return f
Instead of f.readline() I could have used f.next() (for Python 2.x) or next(f) (for Python 3.x) but I wanted to write it so it was portable to any Python.
EDIT: Okay, I know nobody cares and I"m not getting any upvotes for this, but I have re-written my answer one last time to make it more elegant.
You can't put a line back into an iterator. But, you can open a file twice, and get two iterators; given the way file caching works, the second iterator is almost free. If we imagine a file with a megabyte of '#' lines at the top, this version would greatly outperform the previous version that calls f.seek(0).
def open_my_text(filename):
# open the same file twice to get two file objects
# (We are opening the file read-only so this is safe.)
ftemp = open(filename, "rt")
f = open(filename, "rt")
# use ftemp to look at lines, then discard from f
for line in ftemp:
if not line.lstrip().startswith("#"):
break
f.readline()
# return file object with comment lines pre-skipped
return f
This version is much better than the previous version, and it still returns a full file object with all its methods.

Categories

Resources