I have an input file with containing a list of strings.
I am iterating through every fourth line starting on line two.
From each of these lines I make a new string from the first and last 6 characters and put this in an output file only if that new string is unique.
The code I wrote to do this works, but I am working with very large deep sequencing files, and has been running for a day and has not made much progress. So I'm looking for any suggestions to make this much faster if possible. Thanks.
def method():
target = open(output_file, 'w')
with open(input_file, 'r') as f:
lineCharsList = []
for line in f:
#Make string from first and last 6 characters of a line
lineChars = line[0:6]+line[145:151]
if not (lineChars in lineCharsList):
lineCharsList.append(lineChars)
target.write(lineChars + '\n') #If string is unique, write to output file
for skip in range(3): #Used to step through four lines at a time
try:
check = line #Check for additional lines in file
next(f)
except StopIteration:
break
target.close()
Try defining lineCharsList as a set instead of a list:
lineCharsList = set()
...
lineCharsList.add(lineChars)
That'll improve the performance of the in operator. Also, if memory isn't a problem at all, you might want to accumulate all the output in a list and write it all at the end, instead of performing multiple write() operations.
You can use https://docs.python.org/2/library/itertools.html#itertools.islice:
import itertools
def method():
with open(input_file, 'r') as inf, open(output_file, 'w') as ouf:
seen = set()
for line in itertools.islice(inf, None, None, 4):
s = line[:6]+line[-6:]
if s not in seen:
seen.add(s)
ouf.write("{}\n".format(s))
Besides using set as Oscar suggested, you can also use islice to skip lines rather than use a for loop.
As stated in this post, islice preprocesses the iterator in C, so it should be much faster than using a plain vanilla python for loop.
Try replacing
lineChars = line[0:6]+line[145:151]
with
lineChars = ''.join([line[0:6], line[145:151]])
as it can be more efficient, depending on the circumstances.
Related
I am trying to make a program where i have to check if certain numbers are in use in a text file. The problem is that my for loop only loops trough the first line, instead of every line. How can i solve this? I've already used readlines() but that has not worked for me. This is the code and i've got a text file with: 1;, 2; and 3;, each on a seporated line. Hope someone can help!
if int(keuze) == 2:
def new_safe():
with open('fa_kluizen.txt', 'r') as f:
for number in f:
return number
print(new_safe())
My text File:
# TextFile
1;
2;
3;
You are returning too early (at first iteration).
You can read all lines in a list while cleaning the data and then return that list.
with open('fa_kluizen.txt', 'r') as f:
data = [line.strip() for line in f]
return data
Also most of the time its bad to create a function inside an if-statement.
Maybe you can add a little bit more information about what you want to achieve.
you are returning the first line you encounter, and by doing so, the program exits the current function and of course, the loop.
One way to do it is:
def new_safe():
with open('fa_kluizen.txt', 'r') as f:
return f.read().splitlines()
Which returns a each line as a list of the strings.
Output:
['1;', '2;', '3;']
That's beacause "return numbrer", Try
if int(keuze) == 2:
def new_safe():
my_list = []
with open('fa_kluizen.txt', 'r') as f:
for number in f:
my_list.append(number)
return my_list
Hi tried several solutions found on SO but I am missing some info.
I want to read 4 lines at once until I hit EOF. I know how to do it in other languages, but what is the best approach in Python 3?
This is what I have, lines is always the first 4 lines and the code stops afterwards (I know, because the comprehension only gives me the first 4 elements of all_lines. I could use some kind of counter and break and so on, but that seems rather cheap to me.
if os.path.isfile(myfile):
with open(myfile, 'r') as fo:
all_lines = fo.readlines()
for lines in all_lines[:4]:
print(lines)
I want to handle 4 lines at once until I hit EOF. The file I am working with is rather short, maybe about 100 lines MAX
If you want to iterate the lines in chunks of 4, you can do something like this:
if os.path.isfile(myfile):
with open(myfile, 'r') as fo:
all_lines = fo.readlines()
for i in range(0, len(all_lines), 4):
print(all_lines[i:i+4])
Instead of reading in the whole file and then looping over the lines four at a time, you can simply read them in four at a time. Consider
def fun(myfile):
if not os.path.isfile(myfile):
return
with open(myfile, 'r') as fo:
while True:
for line in (fo.readline() for _ in range(4)):
if not line:
return
print(line)
Here, a generator expression is used to read four lines, which is embedded in an "infinite" loop, which stop when line is falsy (the empty str ''), which only happens when we have reached EOF.
Content of file scores.txt that lists the performance of players at a certain game:
80,55,16,26,37,62,49,13,28,56
43,45,47,63,43,65,10,52,30,18
63,71,69,24,54,29,79,83,38,56
46,42,39,14,47,40,72,43,57,47
61,49,65,31,79,62,9,90,65,44
10,28,16,6,61,72,78,55,54,48
The following program reads the file and stores the scores into a list
f = open('scores.txt','r')
L = []
for line in f:
L = L + map(float,str.split(line[:-1],','))
print(L)
But it leads to error messages. I was given code in class so quite confused as very new to Pyton.
Can I fix code?
It appears you've adapted python2.x code to use in python3.x. Note that map does not return a list in python3.x, it returns a generator map object (not a list, basically) that you've to convert to a list appropriately.
Furthermore, I'd recommend using list.extend instead of adding the two together. Why? The former creates a new list object every time you perform addition, and is wasteful in terms of time and space.
numbers = []
for line in f:
numbers.extend(list(map(float, line.rstrip().split(','))))
print(numbers)
An alternative way of doing this would be:
for line in f:
numbers.extend([float(x) for x in line.rstrip().split(',')])
Which happens to be slightly more readable. You could also choose to get rid of the outer for loop using a nested list comprehension.
numbers = [float(x) for line in f for x in line.rstrip().split(',')]
Also, forgot to mention this (thanks to chris in the comments), but you really should be using a context manager to handle file I/O.
with open('scores.txt', 'r') as f:
...
It's cleaner, because it closes your files automatically when you're done with them.
After seeing your ValueError message, it's clear there's issues with your data (invalid characters, etc). Let's try something a little more aggressive.
numbers = []
with open('scores.txt', 'r') as f:
for line in f:
for x in line.strip().split(','):
try:
numbers.append(float(x.strip()))
except ValueError:
pass
If even that doesn't work, perhaps, something even more aggressive with regex might do it:
import re
numbers = []
with open('scores.txt', 'r') as f:
for line in f:
line = re.sub('[^\d\s,.+-]', '', line)
... # the rest remains the same
To return a particular line from a txt file for further manipulation, also the file that must be opened by this function is quite big ~ 500 lines so creating a list and then printing a particular line seemed pretty absurd. Can you please suggest me an alternative? The code is as follows :
def returnline(filename, n):
ofile = open(filename, 'r')
filelist = ofile.readlines()
return filelist[n - 1].strip('\n')
If your files are not many thousands of lines, I wouldn't worry about optimizing that bit, however, what you can do is simply keep reading the file until you've reached the line you want, and stop reading from there on; that way, when the file is, say, 5000 lines, and you want the 10th line, you'll only have to read 10 lines. Also, you need to close the file after opening and reading from it.
So all in all, something like this:
def line_of_file(fname, linenum):
# this will ensure the file gets closed
# once the with block exits
with open(fname, 'r') as f:
# skip n - 1 lines
for _ in xrange(linenum - 1):
f.readline()
return f.readline().strip('\n')
Alternatively, generators (lazy lists, kind of) might provide better performance:
from itertools import islice
def line_of_file(fname, linenum):
with open(fname, 'r') as f:
# (lazily) read all lines
lines = f.xreadlines()
# skip until the line we want
lines = islice(lines, linenum - 1, linenum)
# read the next line (the one we want)
return next(lines)
...which can be shortened to:
from itertools import islice
def line_of_file(fname, linenum):
with open(fname, 'r') as f:
return next(islice(f.xreadlines(),
linenum - 1,
linenum))
(In Python 2.x, islice(xs, n, m) is like xs[n:m] except islice works on generators; see https://docs.python.org/2/library/itertools.html#itertools.islice)
I am new to Python, so please bear with me.
I can't get this little script to work properly:
genome = open('refT.txt','r')
datafile - a reference genome with a bunch (2 million) of contigs:
Contig_01
TGCAGGTAAAAAACTGTCACCTGCTGGT
Contig_02
TGCAGGTCTTCCCACTTTATGATCCCTTA
Contig_03
TGCAGTGTGTCACTGGCCAAGCCCAGCGC
Contig_04
TGCAGTGAGCAGACCCCAAAGGGAACCAT
Contig_05
TGCAGTAAGGGTAAGATTTGCTTGACCTA
The file is opened:
cont_list = open('dataT.txt','r')
a list of contigs that I want to extract from the dataset listed above:
Contig_01
Contig_02
Contig_03
Contig_05
My hopeless script:
for line in cont_list:
if genome.readline() not in line:
continue
else:
a=genome.readline()
s=line+a
data_out = open ('output.txt','a')
data_out.write("%s" % s)
data_out.close()
input('Press ENTER to exit')
The script successfully writes the first three contigs to the output file, but for some reason it doesn't seem able to skip "contig_04", which is not in the list, and move on to "Contig_05".
I might seem a lazy bastard for posting this, but I've spent all afternoon on this tiny bit of code -_-
I would first try to generate an iterable which gives you a tuple: (contig, gnome):
def pair(file_obj):
for line in file_obj:
yield line, next(file_obj)
Now, I would use that to get the desired elements:
wanted = {'Contig_01', 'Contig_02', 'Contig_03', 'Contig_05'}
with open('filename') as fin:
pairs = pair(fin)
while wanted:
p = next(pairs)
if p[0] in wanted:
# write to output file, store in a list, or dict, ...
wanted.forget(p[0])
I would recommend several things:
Try using with open(filename, 'r') as f instead of f = open(...)/f.close(). with will handle the closing for you. It also encourages you to handle all of your file IO in one place.
Try to read in all the contigs you want into a list or other structure. It is a pain to have many files open at once. Read all the lines at once and store them.
Here's some example code that might do what you're looking for
from itertools import izip_longest
# Read in contigs from file and store in list
contigs = []
with open('dataT.txt', 'r') as contigfile:
for line in contigfile:
contigs.append(line.rstrip()) #rstrip() removes '\n' from EOL
# Read through genome file, open up an output file
with open('refT.txt', 'r') as genomefile, open('out.txt', 'w') as outfile:
# Nifty way to sort through fasta files 2 lines at a time
for name, seq in izip_longest(*[genomefile]*2):
# compare the contig name to your list of contigs
if name.rstrip() in contigs:
outfile.write(name) #optional. remove if you only want the seq
outfile.write(seq)
Here's a pretty compact approach to get the sequences you'd like.
def get_sequences(data_file, valid_contigs):
sequences = []
with open(data_file) as cont_list:
for line in cont_list:
if line.startswith(valid_contigs):
sequence = cont_list.next().strip()
sequences.append(sequence)
return sequences
if __name__ == '__main__':
valid_contigs = ('Contig_01', 'Contig_02', 'Contig_03', 'Contig_05')
sequences = get_sequences('dataT.txt', valid_contigs)
print(sequences)
The utilizes the ability of startswith() to accept a tuple as a parameter and check for any matches. If the line matches what you want (a desired contig), it will grab the next line and append it to sequences after stripping out the unwanted whitespace characters.
From there, writing the sequences grabbed to an output file is pretty straightforward.
Example output:
['TGCAGGTAAAAAACTGTCACCTGCTGGT',
'TGCAGGTCTTCCCACTTTATGATCCCTTA',
'TGCAGTGTGTCACTGGCCAAGCCCAGCGC',
'TGCAGTAAGGGTAAGATTTGCTTGACCTA']