Nested loop fail in python - python

Sorry for daft newbie question but my nested loops wont work. It returns only the first iteration. What have I missed??
I'm trying to grep for a multiple strings in my main file. I think I messed up the indentation but all the variations I try return errors.
f = open('GRCh37_genes_all_mod.txt', 'rU') # main search file
f1 = open('genes_regions_out.txt', 'a') #out file
f2 = open('gene_list.txt', 'r') # search list
for gene in f2:
for line in f:
if gene in line:
print line
f1.write(line)

You can only iterate through a file once. After the first time through f, the next time you try and run for line in f, you won't get any content.
If you want to iterate through a file's content multiple times, you can put that content into a list.
with open('GRCh37_genes_all_mod.txt', 'rU') as f:
contents = list(f)
with open('gene_list.txt', 'r') as f:
genes = list(f)
for gene in genes:
for line in contents:
...

After the first iteration, the file pointer is at the end of the file and the iterator is exhausted (calls to next(f) will raise StopIteration).
The simplest solution for this case is to reset the file pointer using f.seek(0):
for gene in f2:
f.seek(0)
for line in f:
# ...
For other iterables (that might not be 'resetable'), if you know how many 'copies' you need, you can use itertools.tee(), or, if you know the iterable is finite (some iterable are infinite) and all it's content will fit in memory, you can just make a list of it as explained by Khelwood.

Related

Comparing contents of two txt.files for deleted lines or changes in python

I'm trying to compare two .txt files for changes or deleted lines. If its deleted I want to output what the deleted line was and if it was changed I want to output the new line. I originally tried comparing line to line but when something was deleted it wouldn't work for my purpose:
for line1 in f1:
for line1 in f2:
if line1==line1:
print("SAME",file=x)
else:
print(f"Original:{line1} / New:{line1}", file=x)
Then I tried not comparing line to line so I could figure out if something was deleted but I'm not getting any output:
def check_diff(f1,f2):
check = {}
for file in [f1,f2]:
with open(file,'r') as f:
check[file] = []
for line in f:
check[file].append(line)
diff = set(check[f1]) - set(check[f2])
for line in diff:
print(line.rstrip(),file=x)
I tried combining a lot of other questions previously asked similar to my problem to get this far, but I'm new to python so I need a little extra help. Thanks! Please let me know if I need to add any additional information.
The concept is simple. Lets say file1,txt is the original file, and file2 is the one we need to see what changes were made to it:
with open('file1.txt', 'r') as f:
f1 = f.readlines()
with open('file2.txt', 'r') as f:
f2 = f.readlines()
deleted = []
added = []
for l in f1:
if l not in f2:
deleted.append(l)
for l in f2:
if l not in f1:
added.append(l)
print('Deleted lines:')
print(''.join(deleted))
print('Added lines:')
print(''.join(added))
For every line in the original file, if that line isn't in the other file, then that means that the line have been deleted. If it's the other way around, that means the line have been added.
I am not sure how you would quantify a changed line (since you could count it as one deleted plus one added line), but perhaps something like the below would be of some aid. Note that if your files are large, it might be faster to store the data in a set instead of a list, since the former has typically a search time complexity of O(1), while the latter has O(n):
with open('file1.txt', 'r') as f1, open('file2.txt', 'r') as f2:
file1 = set(f1.read().splitlines())
file2 = set(f2.read().splitlines())
changed_lines = [line for line in file1 if line not in file2]
deleted_lines = [line for line in file2 if line not in file1]
print('Changed lines:\n' + '\n'.join(changed_lines))
print('Deleted lines:\n' + '\n'.join(deleted_lines))

How to read lots of line from file at once

I want to generate a bunch of files based on a template. The template has thousands of lines. For each of the new files, only top 5 lines are different. What is the best way of reading all the lines but first 5 at once instead of read the whole file in line by line?
One approach would be to create a list of the 5 first lines, and read the rest in a big buffer:
with open("input.txt") as f:
first_lines = [f.readline() for _ in range(5)]
rest_of_lines = f.read()
or more symmetrical for the first part: create 1 small buffer with the 5 lines:
first_lines = "".join([f.readline() for _ in range(5)])
As an alternative, from a purely I/O point of view, the quickest would be
with open("input.txt") as f:
lines = f.read()
and use a line split generator to read the 5 first lines (splitlines() would be disastrous in terms of memory copy, find an implementation here)
File objects in python are quite conveniently their own iterator objects so that when you call for line in f: ... you get the file line by line. The file object has what's generally referred to as a cursor that keeps track of where you're reading from. when you use the generic for loop, this cursor advances to the next newline each time and returns what it has read. If you interrupt this loop before the end of the file, you can pick back up where you left off with another loop or just a call to f.read() to read the rest of the file
with open(inputfile, 'r') as f:
lineN = 0
header = ""
for line in f:
header = header + line
lineN += 1
if lineN >= 4: #read first 5 lines (0 indexed)
break
body = f.read() #read the rest of the file

Python: Issue with Writing over Lines?

So, this is the code I'm using in Python to remove lines, hence the name "cleanse." I have a list of a few thousand words and their parts-of-speech:
NN by
PP at
PP at
... This is the issue. For whatever reason (one I can't figure out and have been trying to for a few hours), the program I'm using to go through the word inputs isn't clearing duplicates, so the next best thing I can do is the former! Y'know, cycle through the file and delete the duplicates on run. However, whenever I do, this code instead takes the last line of the list and duplicates that hundreds of thousands of times.
Thoughts, please? :(
EDIT: The idea is that cleanseArchive() goes through a file called words.txt, takes any duplicate lines and deletes them. Since Python isn't able to delete lines, though, and I haven't had luck with any other methods, I've turned to essentially saving the non-duplicate data in a list (saveList) and then writing each object from that list into a new file (deleting the old). However, as of the moment as I said, it just repeats the final object of the original list thousands upon thousands of times.
EDIT2: This is what I have so far, taking suggestions from the replies:
def cleanseArchive():
f = open("words.txt", "r+")
given_line = f.readlines()
f.seek(0)
saveList = set(given_line)
f.close()
os.remove("words.txt")
f = open("words.txt", "a")
f.write(saveList)
but ATM it's giving me this error:
Traceback (most recent call last):
File "C:\Python33\Scripts\AI\prototypal_intelligence.py", line 154, in <module>
initialize()
File "C:\Python33\Scripts\AI\prototypal_intelligence.py", line 100, in initialize
cleanseArchive()
File "C:\Python33\Scripts\AI\prototypal_intelligence.py", line 29, in cleanseArchive
f.write(saveList)
TypeError: must be str, not set
for i in saveList:
f.write(n+"\n")
You basically print the value of n over and over.
Try this:
for i in saveList:
f.write(i+"\n")
If you just want to delete "duplicated lines", I've modified your reading code:
saveList = []
duplicates = []
with open("words.txt", "r") as ins:
for line in ins:
if line not in duplicates:
duplicates.append(line)
saveList.append(line)
Additionally take the correction above!
def cleanseArchive():
f = open("words.txt", "r+")
f.seek(0)
given_line = f.readlines()
saveList = set()
for x,y in enumerate(given_line):
t=(y)
saveList.add(t)
f.close()
os.remove("words.txt")
f = open("words.txt", "a")
for i in saveList: f.write(i)
Finished product! I ended up digging into enumerate and essentially just using that to get the strings. Man, Python has some bumpy roads when you get into sets/lists, holy shit. So much stuff not working for very ambiguous reasons! Whatever the case, fixed it up.
Let's clean up this code you gave us in your update:
def cleanseArchive():
f = open("words.txt", "r+")
given_line = f.readlines()
f.seek(0)
saveList = set(given_line)
f.close()
os.remove("words.txt")
f = open("words.txt", "a")
f.write(saveList)
We have bad names that don't respect the Style Guide for Python Code, we have superfluous code parts, we don't use the full power of Python and part of it is not working.
Let us start with dropping unneeded code while at the same time using meaningful names.
def cleanse_archive():
infile = open("words.txt", "r")
given_lines = infile.readlines()
words = set(given_lines)
infile.close()
outfile = open("words.txt", "w")
outfile.write(words)
The seek was not needed, the mode for opening a file to read is now just r, the mode for writing is now w and we dropped the removing of the file because it will be overwritten anyway. Having a look at this now clearer code we see, that we missed to close the file after writing. If we open the file with the with statement Python will take care of that for us.
def cleanse_archive():
with open("words.txt", "r") as infile:
words = set(infile.readlines())
with open("words.txt", "w") as outfile:
outfile.write(words)
Now that we have clear code we'll deal with the error message that occurs when outfile.write is called: TypeError: must be str, not set. This message is clear: You can't write a set directly to the file. Obviously you'll have to loop over the content of the set.
def cleanse_archive():
with open("words.txt", "r") as infile:
words = set(infile.readlines())
with open("words.txt", "w") as outfile:
for word in words:
outfile.write(word)
That's it.

"Move" some parts of the file to another file

Let say I have a file with 48,222 lines. I then give an index value, let say, 21,000.
Is there any way in Python to "move" the contents of the file starting from index 21,000 such that now I have two files: the original one and the new one. But the original one now is having 21,000 lines and the new one 27,222 lines.
I read this post which uses partition and is quite describing what I want:
with open("inputfile") as f:
contents1, sentinel, contents2 = f.read().partition("Sentinel text\n")
with open("outputfile1", "w") as f:
f.write(contents1)
with open("outputfile2", "w") as f:
f.write(contents2)
Except that (1) it uses "Sentinel Text" as separator, (2) it creates two new files and require me to delete the old file. As of now, the way I do it is like this:
for r in result.keys(): #the filenames are in my dictionary, don't bother that
f = open(r)
lines = f.readlines()
f.close()
with open("outputfile1.txt", "w") as fn:
for line in lines[0:21000]:
#write each line
with open("outputfile2.txt", "w") as fn:
for line in lines[21000:]:
#write each line
Which is quite a manual work. Is there a built-in or more efficient way?
You can also use writelines() and dump the sliced list of lines from 0 to 20999 into one file and another sliced list from 21000 to the end into another file.
with open("inputfile") as f:
content = f.readlines()
content1 = content[:21000]
content2 = content[21000:]
with open("outputfile1.txt", "w") as fn1:
fn1.writelines(content1)
with open('outputfile2.txt','w') as fn2:
fn2.writelines(content2)

How to only read lines in a text file after a certain string?

I'd like to read to a dictionary all of the lines in a text file that come after a particular string. I'd like to do this over thousands of text files.
I'm able to identify and print out the particular string ('Abstract') using the following code (gotten from this answer):
for files in filepath:
with open(files, 'r') as f:
for line in f:
if 'Abstract' in line:
print line;
But how do I tell Python to start reading the lines that only come after the string?
Just start another loop when you reach the line you want to start from:
for files in filepath:
with open(files, 'r') as f:
for line in f:
if 'Abstract' in line:
for line in f: # now you are at the lines you want
# do work
A file object is its own iterator, so when we reach the line with 'Abstract' in it we continue our iteration from that line until we have consumed the iterator.
A simple example:
gen = (n for n in xrange(8))
for x in gen:
if x == 3:
print('Starting second loop')
for x in gen:
print('In second loop', x)
else:
print('In first loop', x)
Produces:
In first loop 0
In first loop 1
In first loop 2
Starting second loop
In second loop 4
In second loop 5
In second loop 6
In second loop 7
You can also use itertools.dropwhile to consume the lines up to the point you want:
from itertools import dropwhile
for files in filepath:
with open(files, 'r') as f:
dropped = dropwhile(lambda _line: 'Abstract' not in _line, f)
next(dropped, '')
for line in dropped:
print(line)
Use a boolean to ignore lines up to that point:
found_abstract = False
for files in filepath:
with open(files, 'r') as f:
for line in f:
if 'Abstract' in line:
found_abstract = True
if found_abstract:
#do whatever you want
You can use itertools.dropwhile and itertools.islice here, a pseudo-example:
from itertools import dropwhile, islice
for fname in filepaths:
with open(fname) as fin:
start_at = dropwhile(lambda L: 'Abstract' not in L.split(), fin)
for line in islice(start_at, 1, None): # ignore the line still with Abstract in
print line
To me, the following code is easier to understand.
with open(file_name, 'r') as f:
while not 'Abstract' in next(f):
pass
for line in f:
#line will be now the next line after the one that contains 'Abstract'
Just to clarify, your code already "reads" all the lines. To start "paying attention" to lines after a certain point, you can just set a boolean flag to indicate whether or not lines should be ignored, and check it at each line.
pay_attention = False
for line in f:
if pay_attention:
print line
else: # We haven't found our trigger yet; see if it's in this line
if 'Abstract' in line:
pay_attention = True
If you don't mind a little more rearranging of your code, you can also use two partial loops instead: one loop that terminates once you've found your trigger phrase ('Abstract'), and one that reads all following lines. This approach is a little cleaner (and a very tiny bit faster).
for skippable_line in f: # First skim over all lines until we find 'Abstract'.
if 'Abstract' in skippable_line:
break
for line in f: # The file's iterator starts up again right where we left it.
print line
The reason this works is that the file object returned by open behaves like a generator, rather than, say, a list: it only produces values as they are requested. So when the first loop stops, the file is left with its internal position set at the beginning of the first "unread" line. This means that when you enter the second loop, the first line you see is the first line after the one that triggered the break.
Making a guess as to how the dictionary is involved, I'd write it this way:
lines = dict()
for filename in filepath:
with open(filename, 'r') as f:
for line in f:
if 'Abstract' in line:
break
lines[filename] = tuple(f)
So for each file, your dictionary contains a tuple of lines.
This works because the loop reads up to and including the line you identify, leaving the remaining lines in the file ready to be read from f.

Categories

Resources