Refering to a list of names using Python - python

I am new to Python, so please bear with me.
I can't get this little script to work properly:
genome = open('refT.txt','r')
datafile - a reference genome with a bunch (2 million) of contigs:
Contig_01
TGCAGGTAAAAAACTGTCACCTGCTGGT
Contig_02
TGCAGGTCTTCCCACTTTATGATCCCTTA
Contig_03
TGCAGTGTGTCACTGGCCAAGCCCAGCGC
Contig_04
TGCAGTGAGCAGACCCCAAAGGGAACCAT
Contig_05
TGCAGTAAGGGTAAGATTTGCTTGACCTA
The file is opened:
cont_list = open('dataT.txt','r')
a list of contigs that I want to extract from the dataset listed above:
Contig_01
Contig_02
Contig_03
Contig_05
My hopeless script:
for line in cont_list:
if genome.readline() not in line:
continue
else:
a=genome.readline()
s=line+a
data_out = open ('output.txt','a')
data_out.write("%s" % s)
data_out.close()
input('Press ENTER to exit')
The script successfully writes the first three contigs to the output file, but for some reason it doesn't seem able to skip "contig_04", which is not in the list, and move on to "Contig_05".
I might seem a lazy bastard for posting this, but I've spent all afternoon on this tiny bit of code -_-

I would first try to generate an iterable which gives you a tuple: (contig, gnome):
def pair(file_obj):
for line in file_obj:
yield line, next(file_obj)
Now, I would use that to get the desired elements:
wanted = {'Contig_01', 'Contig_02', 'Contig_03', 'Contig_05'}
with open('filename') as fin:
pairs = pair(fin)
while wanted:
p = next(pairs)
if p[0] in wanted:
# write to output file, store in a list, or dict, ...
wanted.forget(p[0])

I would recommend several things:
Try using with open(filename, 'r') as f instead of f = open(...)/f.close(). with will handle the closing for you. It also encourages you to handle all of your file IO in one place.
Try to read in all the contigs you want into a list or other structure. It is a pain to have many files open at once. Read all the lines at once and store them.
Here's some example code that might do what you're looking for
from itertools import izip_longest
# Read in contigs from file and store in list
contigs = []
with open('dataT.txt', 'r') as contigfile:
for line in contigfile:
contigs.append(line.rstrip()) #rstrip() removes '\n' from EOL
# Read through genome file, open up an output file
with open('refT.txt', 'r') as genomefile, open('out.txt', 'w') as outfile:
# Nifty way to sort through fasta files 2 lines at a time
for name, seq in izip_longest(*[genomefile]*2):
# compare the contig name to your list of contigs
if name.rstrip() in contigs:
outfile.write(name) #optional. remove if you only want the seq
outfile.write(seq)

Here's a pretty compact approach to get the sequences you'd like.
def get_sequences(data_file, valid_contigs):
sequences = []
with open(data_file) as cont_list:
for line in cont_list:
if line.startswith(valid_contigs):
sequence = cont_list.next().strip()
sequences.append(sequence)
return sequences
if __name__ == '__main__':
valid_contigs = ('Contig_01', 'Contig_02', 'Contig_03', 'Contig_05')
sequences = get_sequences('dataT.txt', valid_contigs)
print(sequences)
The utilizes the ability of startswith() to accept a tuple as a parameter and check for any matches. If the line matches what you want (a desired contig), it will grab the next line and append it to sequences after stripping out the unwanted whitespace characters.
From there, writing the sequences grabbed to an output file is pretty straightforward.
Example output:
['TGCAGGTAAAAAACTGTCACCTGCTGGT',
'TGCAGGTCTTCCCACTTTATGATCCCTTA',
'TGCAGTGTGTCACTGGCCAAGCCCAGCGC',
'TGCAGTAAGGGTAAGATTTGCTTGACCTA']

Related

Sort file by key

I am learning Python 3 and I'm having issues completing this task. It's given a file with a string on each new line. I have to sort its content by the string located between the first hyphen and the second hyphen and write the sorted content into a different file. This is what I tried so far, but nothing gets sorted:
def sort_keys(path, input, output):
list = []
with open(path+'\\'+input, 'r') as f:
for line in f:
if line.count('-') >= 1:
list.append(line)
sorted(list, key = lambda s: s.split("-")[1])
with open(path + "\\"+ output, 'w') as o:
for line in list:
o.write(line)
sort_keys("C:\\Users\\Daniel\\Desktop", "sample.txt", "results.txt")
This is the input file: https://pastebin.com/j8r8fZP6
Question 1: What am I doing wrong with the sorting? I've used it to sort the words of a sentence on the last letter and it worked fine, but here don't know what I am doing wrong
Question 2: I feel writing the content of the input file in a list, sorting the list and writing aftwerwards that content is not very efficient. What is the "pythonic" way of doing it?
Question 3: Do you know any good exercises to learn working with files + folders in Python 3?
Kind regards
Your sorting is fine. The problem is that sorted() returns a list, rather than altering the one provided. It's also much easier to use list comprehensions to read the file:
def sort_keys(path, infile, outfile):
with open(path+'\\'+infile, 'r') as f:
inputlines = [line.strip() for line in f.readlines() if "-" in line]
outputlines = sorted(inputlines, key=lambda s: s.split("-")[1])
with open(path + "\\" + outfile, 'w') as o:
for line in outputlines:
o.write(line + "\n")
sort_keys("C:\\Users\\Daniel\\Desktop", "sample.txt", "results.txt")
I also changed a few variable names, for legibility's sake.
EDIT: I understand that there are easier ways of doing the sorting (list.sort(x)), however this way seems more readable to me.
First, your data has a couple lines without hyphens. Is that a typo? Or do you need to deal with those lines? If it is NOT a typo and those lines are supposed to be part of the data, how should they be handled?
I'm going to assume those lines are typos and ignore them for now.
Second, do you need to return the whole line? But each line is sorted by the 2nd group of characters between the hyphens? If that's the case...
first, read in the file:
f = open('./text.txt', 'r')
There are a couple ways to go from here, but let's clean up the file contents a little and make a list object:
l = [i.replace("\n","") for i in f]
This will create a list l with all the newline characters removed. This particular way of creating the list is called a list comprehension. You can do the exact same thing with the following code:
l = []
for i in f:
l.append(i.replace("\n","")
Now lets create a dictionary with the key as the 2nd group and the value as the whole line. Again, there are some lines with no hyphens, so we are going to just skip those for now with a simple try/except block:
d = {}
for i in l:
try:
d[i.split("-")[1]] = i
except IndexError:
pass
Now, here things can get slightly tricky. It depends on how you want to approach the problem. Dictionaries are inherently unsorted in python, so there is not a really good way to simply sort the dictionary. ONE way (not necessarily the BEST way) is to create a sorted list of the dictionary keys:
s = sorted([k for k, v in d.items()])
Again, I used a list comprehension here, but you can rewrite that line to do the exact same thing here:
s = []
for k, v in d.items():
s.append(k)
s = sorted(s)
Now, we can write the dictionary back to a file by iterating through the dictionary using the sorted list. To see what I mean, lets print out the dictionary one value at a time using the sorted list as the keys:
for i in s:
print(d[i])
But instead of printing, we will now append the line to a file:
o = open('./out.txt', 'a')
for i in s:
o.write(d[i] + "\n")
Depending on your system and formatting, you may or may not need the + "\n" part. Also note that you want to use 'a' and not 'w' because you are appending one line at a time and if you use 'w' your file will only be the last item of the list.

python merge files by rules

I need to write script in python that accept and merge 2 files to a new file according to the following rule:
1)take 1 word from 1st file followed by 2 words from the second file.
2) when we reach the end of 1 file i'll need to copy the rest of the other file to the merged file without change.
I wrote that script, but i managed to only read 1 word from each file.
Complete script will be nice, but I really want to understand by words how i can do this by my own.
This is what i wrote:
def exercise3(file1,file2):
lstFile1=readFile(file1)
lstFile2=readFile(file2)
with open("mergedFile", 'w') as outfile:
merged = [j for i in zip(lstFile1, lstFile2) for j in i]
for word in merged:
outfile.write(word)
def readFile(filename):
lines = []
with open(filename) as file:
for line in file:
line = line.strip()
for word in line.split():
lines.append(word)
return lines
Your immediate problem is that zip alternates items from the iterables you give it: in short, it's a 1:1 mapping, where you need 1:2. Try this:
lstFile2a = listfile2[0::2]
lstFile2b = listfile2[1::2]
... zip(lstfile1, listfile2a, lstfile2b)
This is a bit inefficient, but gets the job done.
Another way is to zip up pairs (2-tuples) in lstFile2 before zipping it with lstFile1. A third way is to forget zipping altogether, and run your own indexing:
for i in min(len(lstFile1), len(lstFile2)//2):
outfile.write(lstFile1[i])
outfile.write(lstFile2[2*i])
outfile.write(lstFile2[2*i+1])
However, this leaves you with the leftovers of the longer file to handle.
These aren't particularly elegant, but they should get you moving.

reading data from multiple lines as a single item

I have a set of data from a file as such
"johnnyboy"=splice(23):15,00,30,00,31,00,32,02,39,00,62,00,a3,00,33,00,2d,0f,39,00,\
00,5c,00,6d,00,65,00,64,00,69,00,61,00,5c,00,57,00,69,00,6e,00,64,00,6f,00,\
77,00,73,00,20,00,41,00,61,00,63,00,6b,00,65,aa,72,00,6f,00,75,00,6e,dd,64,\
00,2e,00,77,00,61,00,76,00,ff,00
"johnnyboy"="gotwastedatthehouse"
"johnnyboy"=splice(23):15,00,30,00,31,00,32,02,39,00,62,00,a3,00,33,00,2d,0f,39,00,\
00,5c,00,6d,00,65,00,64,00,69,00,61,00,5c,00,57,00,69,00,6e,00,64,00,6f,00,\
77,00,73,00,20,00,41,00,61,00,63,00,6b,00,65,aa,72,00,6f,00,75,00,6e,dd,64,\
00,2e,00,77,00,61,00,76,00,ff,00
[mattplayhouse\wherecanwego\tothepoolhall]
How can I read/reference the text per "johnnyboy"=splice(23) as as single line as such:
"johnnyboy"=splice(23):15,00,30,00,31,00,32,02,39,00,62,00,a3,00,33,00,2d,0f,39,00,00,5c,00,6d,00,65,00,64,00,69,00,61,00,5c,00,57,00,69,00,6e,00,64,00,6f,00,77,00,73,00,20,00,41,00,61,00,63,00,6b,00,65,aa,72,00,6f,00,75,00,6e,dd,64,00,2e,00,77,00,61,00,76,00,ff,00
I am currently matching he regex based on splice(23): with a search as follows:
re_johnny = re.compile('splice')
with open("file.txt", 'r') as file:
read = file.readlines()
for line in read:
if re_johnny.match(line):
print(line)
I think I need to take and remove the backslashes and the spaces to merge the lines but am unfamiliar with how to do that and not obtain the blank lines or the new line that is not like my regex. When trying the first solution attempt, my last row was pulled inappropriately. Any assistance would be great.
Input file: fin
"johnnyboy"=splice(23):15,00,30,00,31,00,32,02,39,00,62,00,a3,00,33,00,2d,0f,39,00,\
00,5c,00,6d,00,65,00,64,00,69,00,61,00,5c,00,57,00,69,00,6e,00,64,00,6f,00,\
77,00,73,00,20,00,41,00,61,00,63,00,6b,00,65,aa,72,00,6f,00,75,00,6e,dd,64,\
00,2e,00,77,00,61,00,76,00,ff,00
"johnnyboy"="gotwastedatthehouse"
"johnnyboy"=splice(23):15,00,30,00,31,00,32,02,39,00,62,00,a3,00,33,00,2d,0f,39,00,\
00,5c,00,6d,00,65,00,64,00,69,00,61,00,5c,00,57,00,69,00,6e,00,64,00,6f,00,\
77,00,73,00,20,00,41,00,61,00,63,00,6b,00,65,aa,72,00,6f,00,75,00,6e,dd,64,\
00,2e,00,77,00,61,00,76,00,ff,00
[mattplayhouse\wherecanwego\tothepoolhall]
Adding to tigerhawk's suggestion you can try something like this:
Code:
import re
with open('fin', 'r') as f:
for l in [''.join([b.strip('\\') for b in a.split()]) for a in f.read().split('\n\n')]:
if 'splice' in l:
print(l)
Output:
"johnnyboy"=splice(23):15,00,30,00,31,00,32,02,39,00,62,00,a3,00,33,00,2d,0f,39,00,00,5c,00,6d,00,65,00,64,00,69,00,61,00,5c,00,57,00,69,00,6e,00,64,00,6f,00,77,00,73,00,20,00,41,00,61,00,63,00,6b,00,65,aa,72,00,6f,00,75,00,6e,dd,64,00,2e,00,77,00,61,00,76,00,ff,00
"johnnyboy"=splice(23):15,00,30,00,31,00,32,02,39,00,62,00,a3,00,33,00,2d,0f,39,00,00,5c,00,6d,00,65,00,64,00,69,00,61,00,5c,00,57,00,69,00,6e,00,64,00,6f,00,77,00,73,00,20,00,41,00,61,00,63,00,6b,00,65,aa,72,00,6f,00,75,00,6e,dd,64,00,2e,00,77,00,61,00,76,00,ff,00
With regex you have multiplied your problems. Instead, keep it simple:
If a line starts with ", it begins a record.
Else, append it to the previous record.
You can implement parsing for such a scheme in just a few lines in Python. And you don't need regex.

Improving the speed of a python script

I have an input file with containing a list of strings.
I am iterating through every fourth line starting on line two.
From each of these lines I make a new string from the first and last 6 characters and put this in an output file only if that new string is unique.
The code I wrote to do this works, but I am working with very large deep sequencing files, and has been running for a day and has not made much progress. So I'm looking for any suggestions to make this much faster if possible. Thanks.
def method():
target = open(output_file, 'w')
with open(input_file, 'r') as f:
lineCharsList = []
for line in f:
#Make string from first and last 6 characters of a line
lineChars = line[0:6]+line[145:151]
if not (lineChars in lineCharsList):
lineCharsList.append(lineChars)
target.write(lineChars + '\n') #If string is unique, write to output file
for skip in range(3): #Used to step through four lines at a time
try:
check = line #Check for additional lines in file
next(f)
except StopIteration:
break
target.close()
Try defining lineCharsList as a set instead of a list:
lineCharsList = set()
...
lineCharsList.add(lineChars)
That'll improve the performance of the in operator. Also, if memory isn't a problem at all, you might want to accumulate all the output in a list and write it all at the end, instead of performing multiple write() operations.
You can use https://docs.python.org/2/library/itertools.html#itertools.islice:
import itertools
def method():
with open(input_file, 'r') as inf, open(output_file, 'w') as ouf:
seen = set()
for line in itertools.islice(inf, None, None, 4):
s = line[:6]+line[-6:]
if s not in seen:
seen.add(s)
ouf.write("{}\n".format(s))
Besides using set as Oscar suggested, you can also use islice to skip lines rather than use a for loop.
As stated in this post, islice preprocesses the iterator in C, so it should be much faster than using a plain vanilla python for loop.
Try replacing
lineChars = line[0:6]+line[145:151]
with
lineChars = ''.join([line[0:6], line[145:151]])
as it can be more efficient, depending on the circumstances.

Adding each item in list to end of specific lines in FASTA file

I solved this in the comments below.
So essentially what I am trying to do is add each element of a list of strings to the end of specific lines in a different file.
Hard to explain but essentially I want to parse a FASTA file, and every time it reaches a header (line.startswith('>')) I want it to replace parts of that header with an element in a list I've already made.
For example:
File1:
">seq1 unwanted here
AATATTATA
ATATATATA
>seq2 unwanted stuff here
GTGTGTGTG
GTGTGTGTG
>seq3 more stuff I don't want
ACACACACAC
ACACACACAC"
I want it to keep ">seq#" but replace everything after with the next item in the list below:
List:
mylist = "['things1', '', 'things3', 'things4', '' 'things6', 'things7']"
Result (modified file1):
">seq1 things1
AATATTATA
ATATATATA
>seq2 # adds nothing here due to mylist[1] = ''
GTGTGTGTG
GTGTGTGTG
>seq3 things3
ACACACACAC
ACACACACAC
As you can see I want it to add even the blank items in the list.
So once again, I want it to parse this FASTA file, and every time it gets to a header (there are thousands), I want it to replace everything after the first word with the next item in the separate list I have made.
What you have will work, but there are a few unnecessary lines so I've edited down to use a few less lines. Also, an important note is that you don't close your file handles. This could result in errors, specifically when writing to file, either way it's bad practice. code:
#!/usr/bin/python
import sys
# gets list of annotations
def get_annos(infile):
with open(infile, 'r') as fh: # makes sure the file is closed properly
annos = []
for line in fh:
annos.append( line.split('\t')[5] ) # added tab as separator
return annos
# replaces extra info on each header with correct annotation
def add_annos(infile1, infile2, outfile):
annos = get_annos(infile1) # contains list of annos
with open(infile2, 'r') as f2, open(outfile, 'w') as output:
for line in f2:
if line.startswith('>'):
line_split = list(line.split()[0]) # split line on whitespace and store first element in list
line_split.append(annos.pop(0)) # append data of interest to current id line
output.write( ' '.join(line_split) + '\n' ) # join and write to file with a newline character
else:
output.write(line)
anno = sys.argv[1]
seq = sys.argv[2]
out = sys.argv[3]
add_annos(anno, seq, out)
get_annos(anno)
This is not perfect but it cleans things up a bit. I'd might veer away from using pop() to associate the annotation data with the sequence IDs unless you are certain the files are in the same order every time.
There is a great library in python for Fasta and other DNA file parsing. It is totally helpful in Bioinformatics. You can also manipulate any data according to your need.
Here is a simple example extracted from the library website:
from Bio import SeqIO
for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
print(seq_record.id)
print(repr(seq_record.seq))
print(len(seq_record))
You should get something like this on your screen:
gi|2765658|emb|Z78533.1|CIZ78533
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', SingleLetterAlphabet())
740
...
gi|2765564|emb|Z78439.1|PBZ78439
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC', SingleLetterAlphabet())
592
***********EDIT*********
I solved this before anyone could help. This is my code, can anyone tell me if I have any bad practices? Is there a way to do this without writing everything to a new file? Seems like it would take a long time/lots of memory.
#!/usr/bin/python
# Script takes unedited FASTA file, removed seq length and
# other header info, adds annotation after sequence name
# run as: $ python addanno.py testanno.out testseq.fasta out.txt
import sys
# gets list of annotations
def get_annos(infile):
f = open(infile)
list2 = []
for line in f:
columns = line.strip().split('\t')
list2.append(columns[5])
return list2
# replaces extra info on each header with correct annotation
def add_annos(infile1, infile2, outfile):
mylist = get_annos(infile1) # contains list of annos
f2 = open(infile2, 'r')
output = open(out, 'w')
for line in f2:
if line.startswith('>'):
l = line.partition(" ")
list3 = list(l)
del list3[1:]
list3.append(' ')
list3.append(mylist.pop(0))
final = ''.join(list3)
line = line.replace(line, final)
output.write(line)
output.write('\n')
else:
output.write(line)
anno = sys.argv[1]
seq = sys.argv[2]
out = sys.argv[3]
add_annos(anno, seq, out)
get_annos(anno)

Categories

Resources