I have fasta file that contains two gene sequences and what i want to is to remove the fasta header (line starting with ">"), concatenate the rest of the lines and output that sequence in length of 50 characters per line. I made some progress but got struck at the end.
Here is my fasta sequence:
>Potrs164783
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
>Potrs164784
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
And the output that i want is something like this
>conc
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGA
TCAGAATTGAACCAACATGATGAAGGGGATTGTTTGCCATCAGAATATGG
CATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCC
TTAGTGAGAACTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATA
AGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAAACCATATGGC
ATTTTGCATCCATTTGTGCATTTCATTTAGTTTACTTGCATTCATTCAGG
My script so far is
final = list()
with open("test.fa", 'r') as fh_in:
for line in fh_in:
line = line.strip()
if not line.startswith(">"):
final.append(line)
final2 = "".join(final)
with open("testconcat.fa", 'w') as fh_out:
fh_out.write(">con")
fh_out.write("\n")
fh_out.write(final2)
How can i make sure that i only write 50 characters in each line?
You can use the inbuilt textwrap library
import textwrap
final2 = "".join(final)
print '\n'.join(textwrap.wrap(final2, 50)
When dealing with large files, if you do the joining, slicing etc. in memory, you may end up getting weird issues as you'll be consuming relatively much more memory for appending each line and then splitting them again into equally divided chunks before actually writing to file.
I think the best way to do avoid such issues is operating on file not on memory, in other words, you should write as you read at the same time.
>>> with open('test.fa', 'r') as r, open('testconcat.fa', 'w') as w:
... for line in r:
... if not line.startswith(">"):
... w.write(line.strip())
>>> with open('testconcat.fa', 'r+') as w:
... chunk = 50
... i = 0
... while next(w, None):
... w.seek(((i + 1) * chunk) + i)
... w.write('\n')
... i = i + 1
>>> cat testconcat.fa
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGA
CAGAATTGAACCAACATGATGAAGGGGATTGTTTGCCATCAGAATATGGC
TGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTT
GTGAGAACTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGT
AAAGAAAAACTTGAAACAAATAACAAGCATGCATAATTACCCTCTACCAG
ACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAAC
ATTGTTACCATTCCGGAATTACATTCTGAGATAAAAACCCTCAAATCTGA
TTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
Hope this helps.
Related
The text file is "ics2o.txt" and I don't know how to print numbers next to the lines
import random
print ("----------------------------------------------------------")
print ("Student Name Student Mark")
print ("----------------------------------------------------------")
f = open("ics2o.txt")
for line in f:
x = len(f.readlines())
for i in range (x):
contents = f.read()
print(str(contents) + str(random.randint(75,100)))
for line in f:
x = len(f.readlines())
for i in range (x):
contents = f.read()
print(str(contents) + str(random.randint(75,100)))
The problem is that you are reading the file in at least 3 different ways which causes none of them to work the way you want. In particular, f.readlines() consumes the entire file buffer, so when you next do f.read() there is nothing left to read. Don't mix and match these. Instead, you should use line since you are iterating over the file already:
for line in f:
print(line + str(random.randint(75,100)))
The lesson here is don't make things any more complicated than they need to be.
Firstly, doing print("----...") is a bad practice, at least use string multiplication:print("-"*10)
Secondly, always open files using 'with' keyword. (u can google it up why)
Thirdly, the code:
with open("ics2o.txt") as f:
for i,j in enumerate(f):
print(i,j)
I'd like to count specific things from a file, i.e. how many times "--undefined--" appears. Here is a piece of the file's content:
"jo:ns 76.434
pRE 75.417
zi: 75.178
dEnt --undefined--
ba --undefined--
I tried to use something like this. But it won't work:
with open("v3.txt", 'r') as infile:
data = infile.readlines().decode("UTF-8")
count = 0
for i in data:
if i.endswith("--undefined--"):
count += 1
print count
Do I have to implement, say, dictionary of tuples to tackle this or there is an easier solution for that?
EDIT:
The word in question appears only once in a line.
you can read all the data in one string and split the string in a list, and count occurrences of the substring in that list.
with open('afile.txt', 'r') as myfile:
data=myfile.read().replace('\n', ' ')
data.split(' ').count("--undefined--")
or directly from the string :
data.count("--undefined--")
readlines() returns the list of lines, but they are not stripped (ie. they contain the newline character).
Either strip them first:
data = [line.strip() for line in data]
or check for --undefined--\n:
if line.endswith("--undefined--\n"):
Alternatively, consider string's .count() method:
file_contents.count("--undefined--")
Or don't limit yourself to .endswith(), use the in operator.
data = ''
count = 0
with open('v3.txt', 'r') as infile:
data = infile.readlines()
print(data)
for line in data:
if '--undefined--' in line:
count += 1
count
When reading a file line by line, each line ends with the newline character:
>>> with open("blookcore/models.py") as f:
... lines = f.readlines()
...
>>> lines[0]
'# -*- coding: utf-8 -*-\n'
>>>
so your endswith() test just can't work - you have to strip the line first:
if i.strip().endswith("--undefined--"):
count += 1
Now reading a whole file in memory is more often than not a bad idea - even if the file fits in memory, it still eats fresources for no good reason. Python's file objects are iterable, so you can just loop over your file. And finally, you can specify which encoding should be used when opening the file (instead of decoding manually) using the codecs module (python 2) or directly (python3):
# py3
with open("your/file.text", encoding="utf-8") as f:
# py2:
import codecs
with codecs.open("your/file.text", encoding="utf-8") as f:
then just use the builtin sum and a generator expression:
result = sum(line.strip().endswith("whatever") for line in f)
this relies on the fact that booleans are integers with values 0 (False) and 1 (True).
Quoting Raymond Hettinger, "There must be a better way":
from collections import Counter
counter = Counter()
words = ('--undefined--', 'otherword', 'onemore')
with open("v3.txt", 'r') as f:
lines = f.readlines()
for line in lines:
for word in words:
if word in line:
counter.update((word,)) # note the single element tuple
print counter
I have a file containing a block of introductory text for an unknown number of lines, then the rest of the file contains data. Before the data block begins, there are column titles and I want to skip those also. So the file looks something like this:
this is an introduction..
blah blah blah...
...
UniqueString
Time Position Count
0 35 12
1 48 6
2 96 8
...
1000 82 37
I want to record the Time Position and Count data to a separate file. Time Position and Count Data appears only after UniqueString.
Is it what you're looking for?
reduce(lambda x, line: (x and (outfile.write(line) or x)) or line=='UniqueString\n', infile)
How it works:
files can be iterated, so we can read infile line by line by simply doing [... for line in infile]
in the operation part, we use the fact that writeline() will not be triggered if the first operand for and is False.
in the or part, we set up the trigger if the desired line is found, so writeline will be fired for the next and consequent lines
default initial value for reduce is None, which evaluates to False
You could extract and write the data to another file like this:
with open("data.txt", "r") as infile:
x = infile.readlines()
x = [i.strip() for i in x[x.index('UniqueString\n') + 1:] if i != '\n' ]
with open("output.txt", "w") as outfile:
for i in x[1:]:
outfile.write(i+"\n")
It is pretty straight forward I think: The file is opened and all lines are read, a list comprehension slices the list beginning with the header string and the desired remaining lines are wrote to file again.
You could create a generator function (and more info here) that filtered the file for you.
It operates incrementally so doesn't require reading the entire file into memory at one time.
def extract_lines_following(file, marker=None):
"""Generator yielding all lines in file following the line following the marker.
"""
marker_seen = False
while True:
line = file.next()
if marker_seen:
yield line
elif line.strip() == marker:
marker_seen = True
file.next() # skip following line, too
# sample usage
with open('test_data.txt', 'r') as infile, open('cleaned_data.txt', 'w') as outfile:
outfile.writelines(extract_lines_following(infile, 'UniqueString'))
This could be optimized a little if you're using Python 3, but the basic idea would be the same.
I have a plain text file with a sequence of numbers, one on each line. I need to import those values into a list. I'm currently learning python and I'm not sure of which is a fast or even "standard" way of doing this (also, I come from R so I'm used to the scan or readLines functions that makes this task a breeze).
The file looks like this (note: this isn't a csv file, commas are decimal points):
204,00
10,00
10,00
10,00
10,00
11,00
70,00
276,00
58,00
...
Since it uses commas instead of '.' for decimal points, I guess the task's a little harder, but it should be more or less the same, right?
This is my current solution, which I find quite cumbersome:
f = open("some_file", "r")
data = f.read().replace('\n', '|')
data = data[0:(len(data) - 2)].replace(',', '.')
data = data.split('|')
x = range(len(data))
for i in range(len(data)):
x[i] = float(data[i])
Thanks in advance.
UPDATE
I didn't realize the comma was the decimal separator. If the locale is set right, something like this should work
lines = [locale.atof(line.strip()) for line in open(filename)]
if not, you could do
lines = [float(line.strip().replace(',','.')) for line in open(filename)]
lines = [line.strip() for line in open(filename)]
if you want the data as numbers ...
lines = [map(float,line.strip().split(',')) for line in open(filename)]
edited as per first two comments below
bsoist's answer is good if locale is set correctly. If not, you can simply read the entire file in and split on the line breaks (\n), then use a list comprehension for replacements.
with open('some_file.txt', 'r') as datafile:
data = datafile.read()
x = [float(value.replace(",", ".")) for value in data.split('\n')]
For a more simpler way you could just do
Read = []
with open('File.txt', 'r') as File:
Read = File.readLines()
for A in Read:
print A
The "with open()" will open the file and quit when it's finished reading. This is good practice IIRC.
Then the For loop will just loop over Read and print out the lines.
I usually read files like this in Python:
f = open('filename.txt', 'r')
for x in f:
doStuff(x)
f.close()
However, this splits the file by newlines. I now have a file which has all of its info in one line (45,000 strings separated by commas). While a file of this size is trivial to read in using something like
f = open('filename.txt', 'r')
doStuff(f.read())
f.close()
I am curious if for a much larger file which is all in one line it would be possible to achieve a similar iteration effect as in the first code snippet but with splitting by comma instead of newline, or by any other character?
The following function is a fairly straightforward way to do what you want:
def file_split(f, delim=',', bufsize=1024):
prev = ''
while True:
s = f.read(bufsize)
if not s:
break
split = s.split(delim)
if len(split) > 1:
yield prev + split[0]
prev = split[-1]
for x in split[1:-1]:
yield x
else:
prev += s
if prev:
yield prev
You would use it like this:
for item in file_split(open('filename.txt')):
doStuff(item)
This should be faster than the solution that EMS linked, and will save a lot of memory over reading the entire file at once for large files.
Open the file using open(), then use the file.read(x) method to read (approximately) the next x bytes from the file. You could keep requesting blocks of 4096 characters until you hit end-of-file.
You will have to implement the splitting yourself - you can take inspiration from the csv module, but I don't believe you can use it directly because it wasn't designed to deal with extremely long lines.