I have a long fasta file and I need to format the lines. I tried many things but since I'm not much familiar python I couldn't solve exactly.
>seq1
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>seq2
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
I want them to look like:
>seq1
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>seq2
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
I've tried this:
a_file = open("file.fasta", "r")
string_without_line_breaks = ""
for line in a_file:
if line[0:1] == ">":
continue
else:
stripped_line = line.rstrip()
string_without_line_breaks += stripped_line
a_file.close()
print(string_without_line_breaks)
But the result not showing ">" lines and also merging all other lines. Hope you can help me about it. Thank you
A common arrangement is to remove the newline, and then add it back when you see the next record.
# Use a context manager (with statement)
with open("file.fasta", "r") as a_file:
# Keep track of whether we have written something without a newline
written_lines = False
for line in a_file:
# Use standard .startswith()
if line.startswith(">"):
if written_lines:
print()
written_lines = False
print(line, end='')
else:
print(line.rstrip('\n'), end='')
written_lines = True
if written_lines:
print()
A common beginner bug is forgetting to add the final newline after falling off the end of the loop.
This simply prints one line at a time and doesn't return anything. Probably a better design would be to collect and yield one FASTA record (header + sequence) at a time, probably as an object. and have the caller decide what to do with it; but then, you probably want to use an existing library which does that - BioPython seems to be the go-to solution for bioinformatics.
Since you’re working with FASTA data, another solution would be to use a dedicated library, in which case what you want is a one-liner:
from Bio import SeqIO
SeqIO.write(SeqIO.parse('file.fasta', 'fasta'), sys.stdout, 'fasta-2line')
Using the 'fasta-2line' format description tells SeqIO.write to omit line breaks inside sequences.
First the usual disclaimer: operate on files using a with block when at all possible. Otherwise they won't be closed on error.
Observe that you want to remove newlines on every line not starting with >, except the last one of every block. You can achieve the same effect by stripping the newline after every line that doesn't start with >, and prepend a newline to each line starting with > except the first.
out = sys.stdout
with open(..., 'r') as file:
first = True
hasline = False
for line in file:
if line.startswith('>'):
if not first:
out.write('\n')
out.write(line)
first = False
else:
out.write(line.rstrip())
hasline = True
if hasline:
out.write('\n')
Printing as you go is much simpler than accumulating the strings in this case. Printing to a file using the write method is simpler than using print when you're just transcribing lines.
I have edited some mistakes in your code.
a_file = open("file.fasta", "r")
string_without_line_breaks = ""
needed_lines = []
for line in a_file:
if line.strip().startswith(">") or line.strip() == "":
# If there was any lines appended before, commit it.
if string_without_line_breaks != "":
needed_lines.append(string_without_line_breaks)
string_without_line_breaks = ""
needed_lines.append(line)
continue
else:
stripped_line = line.strip()
string_without_line_breaks += stripped_line
a_file.close()
print("\n".join(needed_lines))
Please make sure to add the lines containing the right bracket (>) to your string.
a_file = open("file.fasta", "r")
string_without_line_breaks = ""
for line in a_file:
if line[0:1] == ">":
string_without_line_breaks += "\n" + line
continue
else:
stripped_line = line.rstrip()
string_without_line_breaks += stripped_line
a_file.close()
print(string_without_line_breaks)
By the way, you can turn this into a one liner:
import re
with open("file.fasta", 'r') as f:
data = f.read()
result = re.sub(r"^(?!>)(.*)$\n(?!>)", r"\1", data, flags=re.MULTILINE)
print(result)
The regex contains a negative lookahead to prevent trimming lines starting with >, and prevents trimming lines that are right before a >
Related
In Python, calling e.g. temp = open(filename,'r').readlines() results in a list in which each element is a line from the file. However, these strings have a newline character at the end, which I don't want.
How can I get the data without the newlines?
You can read the whole file and split lines using str.splitlines:
temp = file.read().splitlines()
Or you can strip the newline by hand:
temp = [line[:-1] for line in file]
Note: this last solution only works if the file ends with a newline, otherwise the last line will lose a character.
This assumption is true in most cases (especially for files created by text editors, which often do add an ending newline anyway).
If you want to avoid this you can add a newline at the end of file:
with open(the_file, 'r+') as f:
f.seek(-1, 2) # go at the end of the file
if f.read(1) != '\n':
# add missing newline if not already present
f.write('\n')
f.flush()
f.seek(0)
lines = [line[:-1] for line in f]
Or a simpler alternative is to strip the newline instead:
[line.rstrip('\n') for line in file]
Or even, although pretty unreadable:
[line[:-(line[-1] == '\n') or len(line)+1] for line in file]
Which exploits the fact that the return value of or isn't a boolean, but the object that was evaluated true or false.
The readlines method is actually equivalent to:
def readlines(self):
lines = []
for line in iter(self.readline, ''):
lines.append(line)
return lines
# or equivalently
def readlines(self):
lines = []
while True:
line = self.readline()
if not line:
break
lines.append(line)
return lines
Since readline() keeps the newline also readlines() keeps it.
Note: for symmetry to readlines() the writelines() method does not add ending newlines, so f2.writelines(f.readlines()) produces an exact copy of f in f2.
temp = open(filename,'r').read().split('\n')
Reading file one row at the time. Removing unwanted chars from end of the string with str.rstrip(chars).
with open(filename, 'r') as fileobj:
for row in fileobj:
print(row.rstrip('\n'))
See also str.strip([chars]) and str.lstrip([chars]).
I think this is the best option.
temp = [line.strip() for line in file.readlines()]
temp = open(filename,'r').read().splitlines()
My preferred one-liner -- if you don't count from pathlib import Path :)
lines = Path(filename).read_text().splitlines()
This it auto-closes the file, no need for with open()...
Added in Python 3.5.
https://docs.python.org/3/library/pathlib.html#pathlib.Path.read_text
Try this:
u=open("url.txt","r")
url=u.read().replace('\n','')
print(url)
To get rid of trailing end-of-line (/n) characters and of empty list values (''), try:
f = open(path_sample, "r")
lines = [line.rstrip('\n') for line in f.readlines() if line.strip() != '']
You can read the file as a list easily using a list comprehension
with open("foo.txt", 'r') as f:
lst = [row.rstrip('\n') for row in f]
my_file = open("first_file.txt", "r")
for line in my_file.readlines():
if line[-1:] == "\n":
print(line[:-1])
else:
print(line)
my_file.close()
This script here will take lines from file and save every line without newline with ,0 at the end in file2.
file = open("temp.txt", "+r")
file2 = open("res.txt", "+w")
for line in file:
file2.writelines(f"{line.splitlines()[0]},0\n")
file2.close()
if you looked at line, this value is data\n, so we put splitlines()
to make it as an array and [0] to choose the only word data
import csv
with open(filename) as f:
csvreader = csv.reader(f)
for line in csvreader:
print(line[0])
I want to copy strings from one file and place them into a new one. It is in fasta format so there are lines that contain headers with sequence identifiers. The succeeding lines contain the sequence until the next header. For downstream applications I need the sequence to be on one line. I am having difficulty concatenating the sequence strings into one line while still preserving the fasta format.
I wrote a script that should identify ">","A","G","C" or "T" as the start of a line and perform specific actions with it. If it starts with ">" it should paste the contents of the line. If it starts with any of the letters it should take that line and append it to the previous line. I have made it work so that it can perform everything correctly for the first sequence, however for any other sequences in the fasta file, lines starting with ">" are appended to the previous line.
import os
import glob
location = input("location: ")
os.chdir(location)
for file in glob.glob(os.path.join(location, '*.fasta')):
outputfile = open("test.fasta", "w")
sequencefile = open(file, "r")
lines = sequencefile.readlines()
for line in lines:
if line.startswith(">"):
outputfile.write("\n"+line+"\n")
elif line.startswith("G"):
line = line.replace('\n','')
outputfile.write(line)
elif line.startswith("A"):
line = line.replace('\n','')
outputfile.write(line)
elif line.startswith("C"):
line = line.replace('\n','')
outputfile.write(line)
elif line.startswith("T"):
line = line.replace('\n','')
outputfile.write(line)
sequencefile.close()
outputfile.close()
Expected:
ORIGINAL FILE:
>somebacterianame1
AGCT
GCAT
CGAT
AGAT
>somebacterianame2
AGCA
CGAT
AGAT
CGAT
>somebacterianame3
AGAT
GTTA
CCTA
AGAT
NEW FILE:
>somebacterianame1
AGCTGCATCGATAGAT
>somebacterianame2
AGCACGATAGATCGAT
>somebacterianame3
AGATGTTACCTAAGAT
ACTUAL OUTCOME:
NEW FILE:
>somebacterianame1
AGCTGCATCGATAGAT>somebacterianame2
AGCACGATAGATCGAT>somebacterianame3
AGATGTTACCTAAGAT
As I mentioned in the comments, the original code works as-is, with just an extra \n around. However, I think the code can be a little simplified taking advantage of some python constructs and shortcuts. Here what I would do:
import os
import glob
location = input("location: ")
os.chdir(location)
for file in glob.glob(os.path.join(location, '*.fasta')):
outputfile = open("test.fasta", "w")
sequencefile = open(file, "r")
lines = sequencefile.readlines()
is_first_line = True
for line in lines:
# since you only check the first character, we can make a mode condensed check
if line[0] == ">":
if is_first_line:
outputfile.write(line) # here keep the trailing newline and do not prepend a newline
is_first_line = False # flip the flag: not anymore the first line
else:
outputfile.write("\n" + line) # here keep the trailing newline and prepend a newline
elif line[0] in "GACT":
outputfile.write(line.strip()) # here strip the trailing newline
sequencefile.close()
outputfile.close()
#!/usr/bin/env python
FILE_NAME = "testprecomb.txt"
NR_MATCHING_CHARS = 5
lines = set()
with open(FILE_NAME, "r") as inF:
for line in inF:
line = line.strip()
if line == "": continue
beginOfSequence = line[:NR_MATCHING_CHARS]
if not (beginOfSequence in lines):
print(line)
lines.add(beginOfSequence)
This is the code I have right now but it is not working. I have a file that has lines of DNA that sometimes start with the same sequence (or pattern of letters). I need to write a code that will find all lines of DNA that start with the same letters (perhaps the same 10 characters) and delete one of the lines.
Example (issue):
CCTGGATGGCTTATATAAGAT***GTTAT***
***GTTAT***ATAATATACCACCGGGCTGCTT
***GTTAT***ATAGTTACAGCGGAGTCTTGTGACTGGCTCGAGTCAAAAT
What I need as result after one is taken out of file:
CCTGGATGGCTTATATAAGAT***GTTAT***
***GTTAT***ATAATATACCACCGGGCTGCTT
(no third line)
I think your set logic is correct. You are just missing the portion that will save the lines you want to write back into the file. I am guessing you tried this with a separate list that you forgot to add here since you are using append somewhere.
FILE_NAME = "sample_file.txt"
NR_MATCHING_CHARS = 5
lines = set()
output_lines = [] # keep track of lines you want to keep
with open(FILE_NAME, "r") as inF:
for line in inF:
line = line.strip()
if line == "": continue
beginOfSequence = line[:NR_MATCHING_CHARS]
if not (beginOfSequence in lines):
output_lines.append(line + '\n') # add line to list, newline needed since we will write to file
lines.add(beginOfSequence)
print output_lines
with open(FILE_NAME, 'w') as f:
f.writelines(output_lines) # write it out to the file
Your approach has a few problems. First, I would avoid naming file variables inF as this can be confused with inf. Descriptive names are better: testFile for instance. Also testing for empty strings using equality misses a few important edge cases (what if line is None for instance?); use the not keyword instead. As for your actual problem, you're not actually doing anything based on that set membership:
FILE_NAME = "testprecomb.txt"
NR_MATCHING_CHARS = 5
prefixCache = set()
data = []
with open(FILE_NAME, "r") as testFile:
for line in testFile:
line = line.strip()
if not line:
continue
beginOfSequence = line[:NR_MATCHING_CHARS]
if (beginOfSequence in prefixCache):
continue
else:
print(line)
data.append(line)
prefixCache.add(beginOfSequence)
I want to read a textfile using python and print out specific lines. The problem is that I want to print a line which starts with the word "nominal" (and I know how to do it) and the line following this which is not recognizable by some specific string. Could you point me to some lines of code that are able to do that?
In good faith and under the assumption that this will help you start coding and showing some effort, here you go:
file_to_read = r'myfile.txt'
with open(file_to_read, 'r') as f_in:
flag = False
for line in f_in:
if line.startswith('nominal'):
print(line)
flag = True
elif flag:
print(line)
flag = False
it might work out-of-the-box but please try to spend some time going through it and you will definitely get the logic behind it. Note that text comparison in python is case sensitive.
If the file isn't too large, you can put it all in a list:
def printLines(fname):
with open(fname) as f:
lines = f.read().split('\n')
if len(lines) == 0: return None
if lines[0].startswith('Nominal'): print(lines[0])
for i, line in enumerate(lines[1:]):
if lines[i-1].startswith('Nominal') or line.startswith('Nominal'):
print(line)
Then e.g. printLines('test.txt') will do what you want.
I have a file that looks like this(have to put in code box so it resembles file):
text
(starts with parentheses)
tabbed info
text
(starts with parentheses)
tabbed info
...repeat
I want to grab only "text" lines from the file(or every fourth line) and copy them to another file. This is the code I have, but it copies everything to the new file:
import sys
def process_file(filename):
output_file = open("data.txt", 'w')
input_file = open(filename, "r")
for line in input_file:
line = line.strip()
if not line.startswith("(") or line.startswith(""):
output_file.write(line)
output_file.close()
if __name__ == "__main__":
process_file(sys.argv[1])
The reason why your script is copying every line is because line.startswith("") is True, no matter what line equals.
You might try using isspace to test if line begins with a space:
def process_file(filename):
with open("data.txt", 'w') as output_file:
with open(filename, "r") as input_file:
for line in input_file:
line=line.rstrip()
if not line.startswith("(") or line[:1].isspace():
output_file.write(line)
with open('data.txt','w') as of:
of.write(''.join(textline
for textline in open(filename)
if textline[0] not in ' \t(')
)
To write every fourth line use slice result[::4]
with open('data.txt','w') as of:
of.write(''.join([textline
for textline in open(filename)
if textline[0] not in ' \t('][::4])
)
I need not to rstrip the newlines as I use them with write.
In addition to line.startswith("") always being true, line.strip() will remove the leading tab forcing the tabbed data to be written as well. change it to line.rstrip() and use \t to test for a tab. That part of your code should look like:
line = line.rstrip()
if not line.startswith(('(', '\t')):
#....
In response to your question in the comments:
#edited in response to comments in post
for i, line in input_file:
if i % 4 == 0:
output_file.write(line)
try:
if not line.startswith("(") and not line.startswith("\t"):
without doing line.strip() (this will strip the tabs)
So the issue is that (1) you are misusing boolean logic, and (2) every possible line starts with "".
First, the boolean logic:
The way the or operator works is that it returns True if either of its operands is True. The operands are "not line.startswith('(')" and "line.startswith('')". Note that the not only applies to one of the operands. If you want to apply it to the total result of the or expression, you will have to put the whole thing in parentheses.
The second issue is your use of the startswith() method with a zero-length strong as an argument. This essentially says "match any string where the first zero characters are nothing. It matches any strong you could give it.
See other answers for what you should be doing here.