Print lines between two patterns in python

Print lines between two patterns in python - python

I have a file with the following structure:
#scaffold456
ATGTCGTGTCAGTG
GTACGTGTGTGG
+
!!!!!#!!!!!!!!
!!!!!!!!!!!!
#scaffold342
ATGGTGTCGTGGTG
ACGTGGC
+
!>!>!!!!+!!!!!
!!!!!!!
I would want an output like this:
>scaffold456
ATGTCGTGTCAGTG
GTACGTGTGTGG
>scaffold342
ATGGTGTCGTGGTG
ACGTGGC
I want to achieve this in Python, I started with the following:
fastq_filename = "test_file"
fastq = open(fastq_filename) # fastq is the file object
for line in fastq:
if line.startswith("#"):
print line.replace("#", ">")
but I can't go on anymore as I don't know:
1. How to print lines after a certain pattern match?
2. How I should specify that I want to skip lines between + till the next # sign?
This is a more complex topic in Python which I don't know, any help and explanation would be great, thanks!

fastq_filename = "test_file"
fastq = open(fastq_filename) # fastq is the file object
canPrintLines = False # Boolean state variable to keep track of whether we want to be printing lines or not
for line in fastq:
if line.startswith("#"):
canPrintLines = True # We have found an # so we can start printing lines
line = line.replace("#", ">")
elif line.startswith("+"):
canPrintLines = False # We have found a + so we don't want to print anymore
if canPrintLines:
print(line)

I don't know how complex your lines with the ! can get. I understand your question such that you wish to ignore all + and # signs inside these lines.
In that case I would introduce a state variable that stores whether we are currently working on an interesting line:
interesting_line=True
for line in fastq:
if line.strip()=='+': # Here we check for the + sign. You might need to adapt the test.
interesting_line=False # We don't care from now on
if line.startswith('#'):
interesting_line=True
if interesting_line:
# Do what you want with your line.
As I said, you might need to check if there can be situations where my simple tests don't match but this should give you a starting point

This is an easy way to do it:
for line in fastq:
if line and line[0].isalpha() or line[0]== '#':
line = line.rstrip()
print line.replace("#", ">")
Output:
>scaffold456
ATGTCGTGTCAGTG
GTACGTGTGTGG
>scaffold342
ATGGTGTCGTGGTG
ACGTGGC

for line in fastq:
if line.startswith("#") or line.isalpha():
print(line.replace("#", ">"))
Find the line that starts with # replace that with > and print it.
Then find a line that contains only letters then print that line either.

Below code will
ignore lines start with + or !
replace # with > if line start with #
write all other lines
code
def format_file(path):
new_lines = ""
for line in open(path):
if line.startswith("#"):
new_lines += line.replace("#", ">")
elif line.startswith("+"):
pass
elif line.startswith("!"):
pass
else:
new_lines += line
print new_lines
format_file("test_file")

If I'm interpreting your question correctly then I think this is what you are looking for
for line in fastq:
line = line.replace('\n','')
n = len(line)
mat = re.match(r'([ATGC]){%d}' % n,line)
if mat:
print line
if line[0] == '#':
print line.replace('#','>')
This uses Regular Expressions which are incredibly useful. This says if it is either A,T,G, or C only in a line then print that line and then the other if statement is the same as what you have. {%d} matches n number of occurrences of the previous statement, [ATGC]. If there are more than A,T,G, or C then just add them between the square brackets.

Related

Script continues reading from file although file is finished

I am creating script which reads from rockyou.txt file and the problem is that when it finishes going through all lines - 1.5M then it continues reading empty lines from the file and i need it to stop.
I can't do a simple if statement to check if the line is empty because in the file there are multiple places where there is a single empty line.
Do you have any ideas how to implement?
Code:
while line != static:
line = f.readline()
line = line.strip()
counter = counter + 1
print("Trying " + line + " Number " + str(counter))
if line == static:
print("Success")
flag = 1
break
if flag == 0:
print("Unsuccessful")

Your code attempts to read lines until a hit is found, but it doesn’t test whether the end of the file is reached.
Rewrite your code as follows to stop at the end of the file:
found = False
for line in f:
if line.strip() == static:
found = True
break
This code is omitting the counter, but it could be added back in trivially:
for counter, line in enumerate(f, 1):
line = line.strip()
print(f'Trying {line} Number {counter}')
if line == static:
found = True
break

If you have a single blank line, readline() will actually return "\n" rather than an empty string "". Thus it is safe to do this:
line = f.readline()
if not line:
break
Since bool('\n') is True. No blank lines will be skipped.

Instead of checking for 1 single empty line, check for multiple single lines. You can do this by setting another counter like this
emptyLineCounter = 0
while True:
if line == '': #Because it has been stripped,there will be no extra empty spaces
emptyLineCounter+=1
if emptyLineCounter==2: #Or any number of lines you want it to be
break
else:
emptyLineCounter = 0 #Resetting it to zero if there is text in the line

How can I reverse compliment a multiple sequence fasta file with python?

I am new to python and I am trying to figure out how to read a fasta file with multiple sequences and then create a new fasta file containing the reverse compliment of the sequences. The file will look something like:
>homo_sapiens
ACGTCAGTACGTACGTCATGACGTACGTACTGACTGACTGACTGACGTACTGACTGACTGACGTACGTACGTACGTACGTACGTACTG
>Canis_lupus
CAGTCATGCATGCATGCAGTCATGACGTCAGTCAGTACTGCATGCATGCATGCATGCATGACTGCAGTACTGACGTACTGACGTCATGCATGCAGTCATG
>Pan_troglodytus
CATGCATACTGCATGCATGCATCATGCATGCATGCATGCATGCATGCATCATGACTGCAGTCATGCAGTCAGTCATGCATGCATCAT
I am trying to learn how to use for and while loops so if the solution can incorporate one of them it would be preferred.
So far I managed to do it in a very unelegant manner as follows:
file1 = open('/path/to/file', 'r')
for line in file1:
if line[0] == '>':
print line.strip() #to capture the title line
else:
import re
seq = line.strip()
line = re.sub(r'T', r'P', seq)
seq = line
line = re.sub(r'A',r'T', seq)
seq = line
line = re.sub(r'G', r'R', seq)
seq = line
line = re.sub(r'C', r'G', seq)
seq = line
line = re.sub(r'P', r'A', seq)
seq = line
line = re.sub(r'R', r'C', seq)
print line[::-1]
file1.close()
This worked but I know there is a better way to iterate through that end part. Any better solutions?

I know you consider this an exercise for yourself, but in case you are interested in using existing facilities, have a look at the Biopython package. Especially if you are going to do more sequence work.
That would allow you to instantiate a sequence with e.g. seq = Seq('GATTACA'). Then, seq.reverse_complement() will give you the reverse complement.
Note that the reverse complement is more than just string reversal, the nucleotide bases need to be replaced with their complementary letter as well.

Assuming I got you right, would the code below work for you? You could just add the exchanges you want to the dictionary.
d = {'A':'T','C':'G','T':'A','G':'C'}
with open("seqs.fasta", 'r') as in_file:
for line in in_file:
if line != '\n': # skip empty lines
line = line.strip() # Remove new line character (I'm working on windows)
if line.startswith('>'):
head = line
else:
print head
print ''.join(d[nuc] for nuc in line[::-1])
Output:
>homo_sapiens
CAGTACGTACGTACGTACGTACGTACGTCAGTCAGTCAGTACGTCAGTCAGTCAGTCAGTACGTACGTCATGACGTACGT
ACTGACGT
>Canis_lupus
CATGACTGCATGCATGACGTCAGTACGTCAGTACTGCAGTCATGCATGCATGCATGCATGCAGTACTGACTGACGTCATG
ACTGCATGCATGCATGACTG
>Pan_troglodytus
ATGATGCATGCATGACTGACTGCATGACTGCAGTCATGATGCATGCATGCATGCATGCATGCATGATGCATGCATGCAGT
ATGCATG

Here is a simple example of a string reversal.
Python Code
string = raw_input("Enter a string:")
reverse_string = ""
print "our string is %s" % string
print "our range will be %s\n" % range(0,len(string))
for num in range(0,len(string)):
offset = len(string) - 1
reverse_string += string[offset - num]
print "the num is currently: %d" % num
print "the offset is currently: %d" % offset
print "the index is currently: %d" % int(offset - num)
print "the new string is currently: %s" % reverse_string
print "-------------------------------"
offset =- 1
print "\nOur reverse string is: %s" % reverse_string
Added print commands to show you what is happening in the script.
Run it in python and see what happens.

Usually, to iterate over lines in a text file you use a for loop, because "open" returns a file object which is iterable
>>> f = open('workfile', 'w')
>>> print f
<open file 'workfile', mode 'w' at 80a0960>
There is more about this here
You can also use context manager "with" to open a file. This key statement will close the file object for you, so you will never forget it.
I decided not to include a "for line in f:" statement because you have to read several lines to process one sequence (title, sequence and blank line). If you try to use a for loop with "readline()" you will end up with a ValueError (try :)
So I would use string.translate. This script opens a file named "test" with your example in it:
import string
if __name__ == "__main__":
file_name = "test"
translator = string.maketrans("TAGCPR", "PTRGAC")
with open(file_name, "r") as f:
while True:
title = f.readline().strip()
if not title: # end of file
break
rev_seq = f.readline().strip().translate(translator)[::-1]
f.readline() # blank line
print(title)
print(rev_seq)
Output (with your example):
>homo_sapiens
RPGTPRGTPRGTPRGTPRGTPRGTPRGTRPGTRPGTRPGTPRGTRPGTRPGTRPGTRPGTPRGTPRGTRPTGPRGTPRGTPRTGPRGT
>Canis_lupus
RPTGPRTGRPTGRPTGPRGTRPGTPRGTRPGTPRTGRPGTRPTGRPTGRPTGRPTGRPTGRPGTPRTGPRTGPRGTRPTGPRTGRPTGRPTGRPTGPRTG
>Pan_troglodytus
PTGPTGRPTGRPTGPRTGPRTGRPTGPRTGRPGTRPTGPTGRPTGRPTGRPTGRPTGRPTGRPTGPTGRPTGRPTGRPGTPTGRPTG

Python: Copying lines that meet requirements

So, basically, I need a program that opens a .dat file, checks each line to see if it meets certain prerequisites, and if they do, copy them into a new csv file.
The prerequisites are that it must 1) contain "$W" or "$S" and 2) have the last value at the end of the line of the DAT say one of a long list of acceptable terms. (I can simply make-up a list of terms and hardcode them into a list)
For example, if the CSV was a list of purchase information and the last item was what was purchased, I only want to include fruit. In this case, the last item is an ID Tag, and I only want to accept a handful of ID Tags, but there is a list of about 5 acceptable tags. The Tags have very veriable length, however, but they are always the last item in the list (and always the 4th item on the list)
Let me give a better example, again with the fruit.
My original .DAT might be:
DGH$G$H $2.53 London_Port Gyro
DGH.$WFFT$Q5632 $33.54 55n39 Barkdust
UYKJ$S.52UE $23.57 22#3 Apple
WSIAJSM_33$4.FJ4 $223.4 Ha25%ek Banana
Only the line: "UYKJ$S $23.57 22#3 Apple" would be copied because only it has both 1) $W or $S (in this case a $S) and 2) The last item is a fruit. Once the .csv file is made, I am going to need to go back through it and replace all the spaces with commas, but that's not nearly as problematic for me as figuring out how to scan each line for requirements and only copy the ones that are wanted.
I am making a few programs all very similar to this one, that open .dat files, check each line to see if they meet requirements, and then decides to copy them to the new file or not. But sadly, I have no idea what I am doing. They are all similar enough that once I figure out how to make one, the rest will be easy, though.
EDIT: The .DAT files are a few thousand lines long, if that matters at all.
EDIT2: The some of my current code snippets
Right now, my current version is this:
def main():
#NewFile_Loc = C:\Users\J18509\Documents
OldFile_Loc=raw_input("Input File for MCLG:")
OldFile = open(OldFile_Loc,"r")
OldText = OldFile.read()
# for i in range(0, len(OldText)):
# if (OldText[i] != " "):
# print OldText[i]
i = split_line(OldText)
if u'$S' in i:
# $S is in the line
print i
main()
But it's very choppy still. I'm just learning python.
Brief update: the server I am working on is down, and might be for the next few hours, but I have my new code, which has syntax errors in it, but here it is anyways. I'll update again once I get it working. Thanks a bunch everyone!
import os
NewFilePath = "A:\test.txt"
Acceptable_Values = ('Apple','Banana')
#Main
def main():
if os.path.isfile(NewFilePath):
os.remove(NewFilePath)
NewFile = open (NewFilePath, 'w')
NewFile.write('Header 1,','Name Header,','Header 3,','Header 4)
OldFile_Loc=raw_input("Input File for Program:")
OldFile = open(OldFile_Loc,"r")
for line in OldFile:
LineParts = line.split()
if (LineParts[0].find($W)) or (LineParts[0].find($S)):
if LineParts[3] in Acceptable_Values:
print(LineParts[1], ' is accepted')
#This Line is acceptable!
NewFile.write(LineParts[1],',',LineParts[0],',',LineParts[2],',',LineParts[3])
OldFile.close()
NewFile.close()
main()

There are two parts you need to implement: First, read a file line by line and write lines meeting a specific criteria. This is done by
with open('file.dat') as f:
for line in f:
stripped = line.strip() # remove '\n' from the end of the line
if test_line(stripped):
print stripped # Write to stdout
The criteria you want to check for are implemented in the function test_line. To check for the occurrence of "$W" or "$S", you can simply use the in-Operator like
if not '$W' in line and not '$S' in line:
return False
else:
return True
To check, if the last item in the line is contained in a fixed list, first split the line using split(), then take the last item using the index notation [-1] (negative indices count from the end of a sequence) and then use the in operator again against your fixed list. This looks like
items = line.split() # items is an array of strings
last_item = items[-1] # take the last element of the array
if last_item in ['Apple', 'Banana']:
return True
else:
return False
Now, you combine these two parts into the test_line function like
def test_line(line):
if not '$W' in line and not '$S' in line:
return False
items = line.split() # items is an array of strings
last_item = items[-1] # take the last element of the array
if last_item in ['Apple', 'Banana']:
return True
else:
return False
Note that the program writes the result to stdout, which you can easily redirect. If you want to write the output to a file, have a look at Correct way to write line to file in Python

inlineRequirements = ['$W','$S']
endlineRequirements = ['Apple','Banana']
inputFile = open(input_filename,'rb')
outputFile = open(output_filename,'wb')
for line in inputFile.readlines():
line = line.strip()
#trailing and leading whitespace has been removed
if any(req in line for req in inlineRequirements):
#passed inline requirement
lastWord = line.split(' ')[-1]
if lastWord in endlineRequirements:
#passed endline requirement
outputFile.write(line.replace(' ',','))
#replaced spaces with commas and wrote to file
inputFile.close()
outputFile.close()

tags = ['apple', 'banana']
match = ['$W', '$S']
OldFile_Loc=raw_input("Input File for MCLG:")
OldFile = open(OldFile_Loc,"r")
for line in OldFile.readlines(): # Loop through the file
line = line.strip() # Remove the newline and whitespace
if line and not line.isspace(): # If the line isn't empty
lparts = line.split() # Split the line
if any(tag.lower() == lparts[-1].lower() for tag in tags) and any(c in line for c in match):
# $S or $W is in the line AND the last section is in tags(case insensitive)
print line

import re
list_of_fruits = ["Apple","Bannana",...]
with open('some.dat') as f:
for line in f:
if re.findall("\$[SW]",line) and line.split()[-1] in list_of_fruits:
print "Found:%s" % line

import os
NewFilePath = "A:\test.txt"
Acceptable_Values = ('Apple','Banana')
#Main
def main():
if os.path.isfile(NewFilePath):
os.remove(NewFilePath)
NewFile = open (NewFilePath, 'w')
NewFile.write('Header 1,','Name Header,','Header 3,','Header 4)
OldFile_Loc=raw_input("Input File for Program:")
OldFile = open(OldFile_Loc,"r")
for line in OldFile:
LineParts = line.split()
if (LineParts[0].find(\$W)) or (LineParts[0].find(\$S)):
if LineParts[3] in Acceptable_Values:
print(LineParts[1], ' is accepted')
#This Line is acceptable!
NewFile.write(LineParts[1],',',LineParts[0],',',LineParts[2],',',LineParts[3])
OldFile.close()
NewFile.close()
main()
This worked great, and has all the capabilities I needed. The other answers are good, but none of them do 100% of what I needed like this one does.

Counting Lines and numbering them

Another question.
This program counts and numbers every line in the code unless it has a hash tag or if the line is empty. I got it to number every line besides the hash tags. How can I stop it from counting empty lines?
def main():
file_Name = input('Enter file you would like to open: ')
infile = open(file_Name, 'r')
contents = infile.readlines()
line_Number = 0
for line in contents:
if '#' in line:
print(line)
if line == '' or line == '\n':
print(line)
else:
line_Number += 1
print(line_Number, line)
infile.close()
main()

You check if line == '' or line == '\n' inside the if clause for '#' in line, where it has no chance to be True.
Basically, you need the if line == '' or line == '\n': line shifted to the left :)
Also, you can combine the two cases, since you perform the same actions:
if '#' in line or not line or line == '\n':
print line
But actually, why would you need printing empty stings or '\n'?
Edit:
If other cases such as line == '\t' should be treated the same way, it's the best to use Tim's advice and do: if '#' in line or not line.strip().

You can skip empty lines by adding the following to the beginning of your for loop:
if not line:
continue
In Python, the empty string evaluates to the boolean value True. In case, that means empty lines are skipped because this if statement is only True when the string is empty.
The statement continue means that the code will continue at the next pass through the loop. It won't execute the code after that statement and this means your code that's counting the lines is skipped.

str.startswith() not working as I intended

I can't see why this won't work. I am performing lstrip() on the string being passed to the function, and trying to see if it starts with """. For some reason, it gets caught in an infinite loop
def find_comment(infile, line):
line_t = line.lstrip()
if not line_t.startswith('"""') and not line_t.startswith('#'):
print (line, end = '')
return line
elif line.lstrip().startswith('"""'):
while True:
if line.rstrip().endswith('"""'):
line = infile.readline()
find_comment(infile, line)
else:
line = infile.readline()
else:
line = infile.readline()
find_comment(infile, line)
And my output:
Enter the file name: test.txt
import re
def count_loc(infile):
Here is the top of the file i am reading in for reference:
import re
def count_loc(infile):
""" Receives a file and then returns the amount
of actual lines of code by not counting commented
or blank lines """
loc = 0
func_records = {}
for line in infile:
(...)

You haven't provided and exit path from the recursive loop. A return statement should do the trick.
(...)
while True:
if line.rstrip().endswith('"""'):
line = infile.readline()
return find_comment(infile, line)
else:
line = infile.readline()

while True is an infinite loop. You need to break once you're done.

not line_t.startswith('"""') or not line_t.startswith('#')
This expression evaluates to True no matter what string line_t denotes. Do you want 'and' instead of 'or'? Your question isn't clear to me.

if not line_t.startswith('"""') or not line_t.startswith('#'):
This if will always be satisfied -- either the line doesn't start with """, or it doesn't start with # (or both). You probably meant to use and where you used or.

As long as lines start or end with a comment, the code below should work.
However, keep in mind that the docstrings can start or end in the middle of a line of code.
Also, you'll need to code for triple single-quotes as well as docstrings assigned to variables which aren't really comments.
Does this get you closer to an answer?
def count_loc(infile):
skipping_comments = False
loc = 0
for line in infile:
# Skip one-liners
if line.strip().startswith("#"): continue
# Toggle multi-line comment finder: on and off
if line.strip().startswith('"""'):
skipping_comments = not skipping_comments
if line.strip().endswith('"""'):
skipping_comments = not skipping_comments
continue
if skipping_comments: continue
print line,

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Print lines between two patterns in python - python

This is an easy way to do it: for line in fastq: if line and line[0].isalpha() or line[0]== '#': line = line.rstrip() print line.replace("#", ">") Output: >scaffold456 ATGTCGTGTCAGTG GTACGTGTGTGG >scaffold342 ATGGTGTCGTGGTG ACGTGGC

for line in fastq: if line.startswith("#") or line.isalpha(): print(line.replace("#", ">")) Find the line that starts with # replace that with > and print it. Then find a line that contains only letters then print that line either.

Related

Script continues reading from file although file is finished

How can I reverse compliment a multiple sequence fasta file with python?

Python: Copying lines that meet requirements

Counting Lines and numbering them

str.startswith() not working as I intended

Categories

Resources