How to split a string into equal sized parts? - python

I have a string that contains a sequence of nucleotides. The string is 1191 nucleotides long.
How do I print the sequence in a format which each line only has 100 nucleotides? right now I have it hard coded but I would like it to work for any string of nucleotides. here is the code I have now
def printinfasta(SeqName, Sequence, SeqDescription):
print(SeqName + " " + SeqDescription)
#how do I make sure to only have 100 nucleotides per line?
print(Sequence[0:100])
print(Sequence[100:200])
print(Sequence[200:300])
print(Sequence[400:500])
print(Sequence[500:600])
print(Sequence[600:700])
print(Sequence[700:800])
print(Sequence[800:900])
print(Sequence[900:1000])
print(Sequence[1000:1100])
print(Sequence[1100:1191])
printinfasta(SeqName, Sequence, SeqDescription)
Sequence = "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAATCCATAAATCCCTAAAACCATAATCCTAAATCCCTTAATTCCTAAATCCCTAATACTTAGACCCTAATCTTTAGTTCCTAGACCCTAATCTTTAGTTCCTAGACCCTAAATCCATAATCCTTAATTCCTAAATTCCTAAATCCCTAATACTAAATCTCTAAATCCCTAGCAATTTTCAAGTTTTGCTTGATTGTTGTAGGATGGTCCTTTCTCTTGTTTCTTCTCTGTGTTGTTGAGATTAGTTTGTTTAGGTTTGATAGCGTTGATTTTGGCCTGCGTTTGGTGACTCATATGGTTTGATTGGAGTTTGTTTCTGGGTTTTATGGTTTTGGTTGAAGCGACATTTTTTTGTGGAATATGGTTTTTGCAAAATATTTTGTTCCGGATGAGTAATATCTACGGTGCTGCTGTGAGAATTATGCTATTGTTTT"

You can use textwrap.wrap to split long strings into list of strings
import textwrap
seq = "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAATCCATAAATCCCTAAAACCATAATCCTAAATCCCTTAATTCCTAAATCCCTAATACTTAGACCCTAATCTTTAGTTCCTAGACCCTAATCTTTAGTTCCTAGACCCTAAATCCATAATCCTTAATTCCTAAATTCCTAAATCCCTAATACTAAATCTCTAAATCCCTAGCAATTTTCAAGTTTTGCTTGATTGTTGTAGGATGGTCCTTTCTCTTGTTTCTTCTCTGTGTTGTTGAGATTAGTTTGTTTAGGTTTGATAGCGTTGATTTTGGCCTGCGTTTGGTGACTCATATGGTTTGATTGGAGTTTGTTTCTGGGTTTTATGGTTTTGGTTGAAGCGACATTTTTTTGTGGAATATGGTTTTTGCAAAATATTTTGTTCCGGATGAGTAATATCTACGGTGCTGCTGTGAGAATTATGCTATTGTTTT"
print('\n'.join(textwrap.wrap(seq, width=100)))

You can use itertools.zip_longest and some iter magic to get this in one line:
from itertools import zip_longest
sequence = "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAATCCATAAATCCCTAAAACCATAATCCTAAATCCCTTAATTCCTAAATCCCTAATACTTAGACCCTAATCTTTAGTTCCTAGACCCTAATCTTTAGTTCCTAGACCCTAAATCCATAATCCTTAATTCCTAAATTCCTAAATCCCTAATACTAAATCTCTAAATCCCTAGCAATTTTCAAGTTTTGCTTGATTGTTGTAGGATGGTCCTTTCTCTTGTTTCTTCTCTGTGTTGTTGAGATTAGTTTGTTTAGGTTTGATAGCGTTGATTTTGGCCTGCGTTTGGTGACTCATATGGTTTGATTGGAGTTTGTTTCTGGGTTTTATGGTTTTGGTTGAAGCGACATTTTTTTGTGGAATATGGTTTTTGCAAAATATTTTGTTCCGGATGAGTAATATCTACGGTGCTGCTGTGAGAATTATGCTATTGTTTT"
output = [''.join(filter(None, s)) for s in zip_longest(*([iter(sequence)]*100))]
Or:
for s in zip_longest(*([iter(sequence)]*100)):
print(''.join(filter(None, s)))

A possible solution is to use re module.
import re
def splitstring(strg, leng):
chunks = re.findall('.{1,%d}' % leng, strg)
for i in chunks:
print(i)
splitstring(strg = seq, leng = 100))

You can use a helper function based on itertools.zip_longest. The helper function has been designed to (also) handle cases where the sequence isn't an exact multiple of the size of the equal parts (the last group will have fewer elements than those before it).
from itertools import zip_longest
def grouper(n, iterable):
""" s -> (s0,s1,...sn-1), (sn,sn+1,...s2n-1), (s2n,s2n+1,...s3n-1), ... """
FILLER = object() # Value that couldn't be in data.
for result in zip_longest(*[iter(iterable)]*n, fillvalue=FILLER):
yield ''.join(v for v in result if v is not FILLER)
def printinfasta(SeqName, Sequence, SeqDescription):
print(SeqName + " " + SeqDescription)
for group in grouper(100, Sequence):
print(group)
Sequence = "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAATCCATAAATCCCTAAAACCATAATCCTAAATCCCTTAATTCCTAAATCCCTAATACTTAGACCCTAATCTTTAGTTCCTAGACCCTAATCTTTAGTTCCTAGACCCTAAATCCATAATCCTTAATTCCTAAATTCCTAAATCCCTAATACTAAATCTCTAAATCCCTAGCAATTTTCAAGTTTTGCTTGATTGTTGTAGGATGGTCCTTTCTCTTGTTTCTTCTCTGTGTTGTTGAGATTAGTTTGTTTAGGTTTGATAGCGTTGATTTTGGCCTGCGTTTGGTGACTCATATGGTTTGATTGGAGTTTGTTTCTGGGTTTTATGGTTTTGGTTGAAGCGACATTTTTTTGTGGAATATGGTTTTTGCAAAATATTTTGTTCCGGATGAGTAATATCTACGGTGCTGCTGTGAGAATTATGCTATTGTTTT"
printinfasta('Name', Sequence, 'Description')
Sample output:
Name Description
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
CCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAATCCATAAATCCCTAAAACCATAATCCTAAATCCCTTAATTCCTA
AATCCCTAATACTTAGACCCTAATCTTTAGTTCCTAGACCCTAATCTTTAGTTCCTAGACCCTAAATCCATAATCCTTAATTCCTAAATTCCTAAATCCC
TAATACTAAATCTCTAAATCCCTAGCAATTTTCAAGTTTTGCTTGATTGTTGTAGGATGGTCCTTTCTCTTGTTTCTTCTCTGTGTTGTTGAGATTAGTT
TGTTTAGGTTTGATAGCGTTGATTTTGGCCTGCGTTTGGTGACTCATATGGTTTGATTGGAGTTTGTTTCTGGGTTTTATGGTTTTGGTTGAAGCGACAT
TTTTTTGTGGAATATGGTTTTTGCAAAATATTTTGTTCCGGATGAGTAATATCTACGGTGCTGCTGTGAGAATTATGCTATTGTTTT

I assume that your sequence is in FASTA format. If this is the case, you can use any of a number of bioinformatics packages that provide FASTA sequence wrapping utilities. For example, you can use FASTX-Toolkit. Wrap FASTA sequences using FASTA Formatter command line utility, for example to a max of 100 nucleotides per line:
fasta_formatter -i INFILE -o OUTFILE -w 100
You can install FASTX-Toolkit package using conda, for example:
conda install fastx_toolkit
or
conda create -n fastx_toolkit fastx_toolkit
Note that if you end up writing the (simple) code to wrap FASTA sequences from scratch, remember that the header lines (the lines starting with >) should not be wrapped. Wrap only the sequence lines.
SEE ALSO:
Convert single line fasta to multi line fasta

Package cytoolz (installable using pip install cytoolz) provides a function partition_all that can be used here:
#!/usr/bin/env python3
from cytoolz import partition_all
def printinfasta(name, seq, descr):
header = f">{name} {descr}"
print(header)
print(*map("".join, partition_all(100, seq)), sep="\n")
printinfasta("test", 468 * "ACGTGA", "this is a test")
partition_all(100, seq)) generate tuples of 100 letters each taken from seq, and a last shorter one is the number of letters is not a multiple of 100.
The map("".join, ...) is used to group letters within each such tuple into a single string.
The * in front of the map makes its results considered as separate arguments to print.

Related

How to extract a floating number from a string and add it using simple operation on python

I have a file named ping.txt which has the values that shows the time taken to ping an ip for n number of times.
I have my ping.txt contains:
time=35.9
time=32.4
I have written a python code to extract this floating number alone and add it using regular expression. But I feel that the below code is the indirect way of completing my task. The findall regex I am using here outputs a list which is them converted, join and then added.
import re
add,tmp=0,0
with open("ping.txt","r+") as pingfile:
for i in pingfile.readlines():
tmp=re.findall(r'\d+\.\d+',i)
add=add+float("".join(tmp))
print("The sum of the times is :",add)
My question is how to solve this problem without using regex or any other way to reduce the number of lines in my code to make it more efficient?
In other words, can I use different regex or some other method to do this operation?
~
You can use the following:
with open('ping.txt', 'r') as f:
s = sum(float(line.split('=')[1]) for line in f)
Output:
>>> with open('ping.txt', 'r') as f:
... s = sum(float(line.split('=')[1]) for line in f)
...
>>> s
68.3
Note: I assume each line of your file contains time=some_float_number
You could do it like this:
import re
total = sum(float(s) for s in re.findall(r'\d+(\.\d+)?', open("ping.txt","r+").read()))
If you have the string:
>>> s='time=35.9'
Then to get the value, you just need:
>>> float(s.split('=')[1]))
35.9
You don't need regular expressions for something with a simple delimiter.
You can use the string split to split each line at '=' and append them to a list. At the end, you can simply call the sum function to print the sum of elements in the list
temp = []
with open("test.txt","r+") as pingfile:
for i in pingfile.readlines():
temp.append(float(str.split(i,'=')[1]))
print("The sum of the times is :",sum(temp))
Use This in RE
tmp = re.findall("[0-9]+.[0-9]+",i)
After that run a loop
sum = 0
for each in tmp:
sum = sum + float(each)

How can I effectively pull out human readable strings/terms from code automatically?

I'm trying to determine the most common words, or "terms" (I think) as I iterate over many different files.
Example - For this line of code found in a file:
for w in sorted(strings, key=strings.get, reverse=True):
I'd want these unique strings/terms returned to my dictionary as keys:
for
w
in
sorted
strings
key
strings
get
reverse
True
However, I want this code to be tunable so that I can return strings with periods or other characters between them as well, because I just don't know what makes sense yet until I run the script and count up the "terms" a few times:
strings.get
How can I approach this problem? It would help to understand how I can do this one line at a time so I can loop it as I read my file's lines in. I've got the basic logic down but I'm currently just doing the tallying by unique line instead of "term":
strings = dict()
fname = '/tmp/bigfile.txt'
with open(fname, "r") as f:
for line in f:
if line in strings:
strings[line] += 1
else:
strings[line] = 1
for w in sorted(strings, key=strings.get, reverse=True):
print str(w).rstrip() + " : " + str(strings[w])
(Yes I used code from my little snippet here as the example at the top.)
If the only python token you want to keep together is the object.attr construct then all the tokens you are interested would fit into the regular expression
\w+\.?\w*
Which basically means "one or more alphanumeric characters (including _) optionally followed by a . and then some more characters"
note that this would also match number literals like 42 or 7.6 but that would be easy enough to filter out afterwards.
then you can use collections.Counter to do the actual counting for you:
import collections
import re
pattern = re.compile(r"\w+\.?\w*")
#here I'm using the source file for `collections` as the test example
with open(collections.__file__, "r") as f:
tokens = collections.Counter(t.group() for t in pattern.finditer(f.read()))
for token, count in tokens.most_common(5): #show only the top 5
print(token, count)
Running python version 3.6.0a1 the output is this:
self 226
def 173
return 170
self.data 129
if 102
which makes sense for the collections module since it is full of classes that use self and define methods, it also shows that it does capture self.data which fits the construct you are interested in.

Rosalind Profile and Consensus: Writing long strings to one line in Python (Formatting)

I'm trying to tackle a problem on Rosalind where, given a FASTA file of at most 10 sequences at 1kb, I need to give the consensus sequence and profile (how many of each base do all the sequences have in common at each nucleotide). In the context of formatting my response, what I have as my code works for small sequences (verified).
However, I have issues in formatting my response when it comes to large sequences.
What I expect to return, regardless of length, is:
"consensus sequence"
"A: one line string of numbers without commas"
"C: one line string """" "
"G: one line string """" "
"T: one line string """" "
All aligned with each other and on their own respective lines, or at least some formatting that allows me to carry this formatting as a unit onward to maintain the integrity of aligning.
but when I run my code for a large sequence, I get each separate string below the consensus sequence broken up by a newline, presumably because the string itself is too long. I've been struggling to think of ways to circumvent the issue, but my searches have been fruitless. I'm thinking about some iterative writing algorithm that can just write the entirety of the above expectation but in chunks Any help would be greatly appreciated. I have attached the entirety of my code below for the sake of completeness, with block comments as needed, though the main section.
def cons(file):
#returns consensus sequence and profile of a FASTA file
import os
path = os.path.abspath(os.path.expanduser(file))
with open(path,"r") as D:
F=D.readlines()
#initialize list of sequences, list of all strings, and a temporary storage
#list, respectively
SEQS=[]
mystrings=[]
temp_seq=[]
#get a list of strings from the file, stripping the newline character
for x in F:
mystrings.append(x.strip("\n"))
#if the string in question is a nucleotide sequence (without ">")
#i'll store that string into a temporary variable until I run into a string
#with a ">", in which case I'll join all the strings in my temporary
#sequence list and append to my list of sequences SEQS
for i in range(1,len(mystrings)):
if ">" not in mystrings[i]:
temp_seq.append(mystrings[i])
else:
SEQS.append(("").join(temp_seq))
temp_seq=[]
SEQS.append(("").join(temp_seq))
#set up list of nucleotide counts for A,C,G and T, in that order
ACGT= [[0 for i in range(0,len(SEQS[0]))],
[0 for i in range(0,len(SEQS[0]))],
[0 for i in range(0,len(SEQS[0]))],
[0 for i in range(0,len(SEQS[0]))]]
#assumed to be equal length sequences. Counting amount of shared nucleotides
#in each column
for i in range(0,len(SEQS[0])-1):
for j in range(0, len(SEQS)):
if SEQS[j][i]=="A":
ACGT[0][i]+=1
elif SEQS[j][i]=="C":
ACGT[1][i]+=1
elif SEQS[j][i]=="G":
ACGT[2][i]+=1
elif SEQS[j][i]=="T":
ACGT[3][i]+=1
ancstr=""
TR_ACGT=list(zip(*ACGT))
acgt=["A: ","C: ","G: ","T: "]
for i in range(0,len(TR_ACGT)-1):
comp=TR_ACGT[i]
if comp.index(max(comp))==0:
ancstr+=("A")
elif comp.index(max(comp))==1:
ancstr+=("C")
elif comp.index(max(comp))==2:
ancstr+=("G")
elif comp.index(max(comp))==3:
ancstr+=("T")
'''
writing to file... trying to get it to write as
consensus sequence
A: blah(1line)
C: blah(1line)
G: blah(1line)
T: blah(line)
which works for small sequences. but for larger sequences
python keeps adding newlines if the string in question is very long...
'''
myfile="myconsensus.txt"
writing_strings=[acgt[i]+' '.join(str(n) for n in ACGT[i] for i in range(0,len(ACGT))) for i in range(0,len(acgt))]
with open(myfile,'w') as D:
D.writelines(ancstr)
D.writelines("\n")
for i in range(0,len(writing_strings)):
D.writelines(writing_strings[i])
D.writelines("\n")
cons("rosalind_cons.txt")
Your code is totally fine except for this line:
writing_strings=[acgt[i]+' '.join(str(n) for n in ACGT[i] for i in range(0,len(ACGT))) for i in range(0,len(acgt))]
You accidentally replicate your data. Try replacing it with:
writing_strings=[ACGT[i] + str(ACGT[i]) for i in range(0,len(ACGT))]
and then write it to your output file as follows:
D.write(writing_strings[i][1:-1])
That's a lazy way to get rid of the brackets from your list.

error comparing sequences - string interpreted as number

I'm trying to do something similar with my previous question.
My purpose is to join all sequences that are equal. But this time instead of letters, I have numbers.
alignment file can be found here - phylip file
the problem is when I try to do this:
records = list(SeqIO.parse(file(filename),'phylip'))
I get this error:
ValueError: Sequence 1 length 49, expected length 1001000000100000100000001000000000000000
I don't understand why because this is the second file I'm creating and the first one worked perfectly..
Below is the code used to build the alignment file:
fl.write('\t')
fl.write(str(161))
fl.write('\t')
fl.write(str(size))
fl.write('\n')
for i in info_plex:
if 'ref' in i[0]:
i[0] = 'H37Rv'
fl.write(str(i[0]))
num = 10 - len(i[0])
fl.write(' ' * num)
for x in i[1:]:
fl.write(str(x))
fl.write('\n')
So it shouldn't interpret 1001000000100000100000001000000000000000 as a number since its a string..
Any ideas?
Thank you!
Your PHYLIP file is broken. The header says 161 sequences but there are 166. Fixing that the current version of Biopython seems to load your file fine. Maybe use len(info_plex) when creating the header line.
P.S. It would have been a good idea to include the version of Biopython in your question.
The code of Kevin Jacobs in your former question employs Biopython that uses sequences of type Seq that
« are essentially strings of letters like AGTACACTGGT, which seems very
natural since this is the most common way that sequences are seen in
biological file formats. »
« There are two important differences between Seq objects and standard
Python strings. (...)
First of all, they have different methods. (...)
Secondly, the Seq object has an important
attribute, alphabet, which is an object describing what the individual
characters making up the sequence string “mean”, and how they should
be interpreted. For example, is AGTACACTGGT a DNA sequence, or just a
protein sequence that happens to be rich in Alanines, Glycines,
Cysteines and Threonines?
The alphabet object is perhaps the important thing that makes the Seq
object more than just a string. The currently available alphabets for
Biopython are defined in the Bio.Alphabet module. »
http://biopython.org/DIST/docs/tutorial/Tutorial.html
The reason of your problem is simply that SeqIO.parse() can't create Seq objects from a file containing characters for which there is no alphabet attribute able to manage them.
.
So, you must use another method. Not try to plate an inadapted method on a different problem.
Here's my way:
from itertools import groupby
from operator import itemgetter
import re
regx = re.compile('^(\d+)[ \t]+([01]+)',re.MULTILINE)
with open('pastie-2486250.rb') as f:
records = regx.findall(f.read())
records.sort(key=itemgetter(1))
print 'len(records) == %s\n' % len(records)
n = 0
for seq,equal in groupby(records, itemgetter(1)):
ids = tuple(x[0] for x in equal)
if len(ids)>1:
print '>%s :\n%s' % (','.join(ids), seq)
else:
n+=1
print '\nNumber of unique occurences : %s' % n
result
len(records) == 165
>154995,168481 :
0000000000001000000010000100000001000000000000000
>123031,74772 :
0000000000001111000101100011100000100010000000000
>176816,178586,80016 :
0100000000000010010010000010110011100000000000000
>129575,45329 :
0100000000101101100000101110001000000100000000000
Number of unique occurences : 156
.
Edit
I've understood MY problem: I let 'fasta' instead of 'phylip' in my code.
'phylip' is a valid value for the attribute alphabet, with it it works fine
records = list(SeqIO.parse(file('pastie-2486250.rb'),'phylip'))
def seq_getter(s): return str(s.seq)
records.sort(key=seq_getter)
ecr = []
for seq,equal in groupby(records, seq_getter):
ids = tuple(s.id for s in equal)
if len(ids)>1:
ecr.append( '>%s\n%s' % (','.join(ids),seq) )
print '\n'.join(ecr)
produces
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
>154995,168481
0000000000001000000010000100000001000000000000000
>123031,74772
0000000000001111000101100011100000100010000000000
>176816,178586,80016
0100000000000010010010000010110011100000000000000
>129575,45329
0100000000101101100000101110001000000100000000000
There is an incredible amount of characters ,,,,,,,,,,,,,,,, before the interesting data, I wonder what it is.
.
But my code isn't useless. See:
from time import clock
from itertools import groupby
from operator import itemgetter
import re
from Bio import SeqIO
def seq_getter(s): return str(s.seq)
t0 = clock()
with open('pastie-2486250.rb') as f:
records = list(SeqIO.parse(f,'phylip'))
records.sort(key=seq_getter)
print clock()-t0,'seconds'
t0 = clock()
regx = re.compile('^(\d+)[ \t]+([01]+)',re.MULTILINE)
with open('pastie-2486250.rb') as f:
records = regx.findall(f.read())
records.sort(key=itemgetter(1))
print clock()-t0,'seconds'
result
12.4826178327 seconds
0.228640588399 seconds
ratio = 55 !

best way to compare sequence of letters inside file?

I have a file, that have lots of sequences of letters.
Some of these sequences might be equal, so I would like to compare them, all to all.
I'm doing something like this but this isn't exactly want I wanted:
for line in fl:
line = line.split()
for elem in line:
if '>' in elem:
pass
else:
for el in line:
if elem == el:
print elem, el
example of the file:
>1
GTCGTCGAAGCATGCCGGGCCCGCTTCGTGTTCGCTGATA
>2
GTCGTCGAAAGAGGTCT-GACCGCTTCGCGCCCGCTGGTA
>3
GTCGTCGAAAGAGGCTT-GCCCGCCACGCGCCCGCTGATA
>4
GTCGTCGAAAGAGGCTT-GCCCGCTACGCGCCCCCTGATA
>5
GTCGTCGAAAGAGGTCT-GACCGCTTCGCGCCCGCTGGTA
>6
GTCGTCGAAAGAGTCTGACCGCTTCTCGCCCGCTGATACG
>7
GTCGTCGAAAGAGGTCT-GACCGCTTCTCGCCCGCTGATA
So what I want if to known if any sequence is totally equal to 1, or to 2, and so on.
If the goal is to simply group like sequences together, then simply sorting the data will do the trick. Here is a solution that uses BioPython to parse the input FASTA file, sorts the collection of sequences, uses the standard Python itertools.groupby function to merge ids for equal sequences, and outputs a new FASTA file:
from itertools import groupby
from Bio import SeqIO
records = list(SeqIO.parse(file('spoo.fa'),'fasta'))
def seq_getter(s): return str(s.seq)
records.sort(key=seq_getter)
for seq,equal in groupby(records, seq_getter):
ids = ','.join(s.id for s in equal)
print '>%s' % ids
print seq
Output:
>3
GTCGTCGAAAGAGGCTT-GCCCGCCACGCGCCCGCTGATA
>4
GTCGTCGAAAGAGGCTT-GCCCGCTACGCGCCCCCTGATA
>2,5
GTCGTCGAAAGAGGTCT-GACCGCTTCGCGCCCGCTGGTA
>7
GTCGTCGAAAGAGGTCT-GACCGCTTCTCGCCCGCTGATA
>6
GTCGTCGAAAGAGTCTGACCGCTTCTCGCCCGCTGATACG
>1
GTCGTCGAAGCATGCCGGGCCCGCTTCGTGTTCGCTGATA
In general for this type of work you may want to investigate Biopython which has lots of functionality for parsing and otherwise dealing with sequences.
However, your particular problem can be solved using a dict, an example of which Manoj has given you.
Comparing long sequences of letters is going to be pretty inefficient. It will be quicker to compare the hash of the sequences. Python offers two built in data types that use hash: set and dict. It's best to use dict here as we can store the line numbers of all the matches.
I've assumed the file has identifiers and labels on alternate lines, so if we split the file text on new lines we can take one line as the id and the next as the sequence to match.
We then use a dict with the sequence as a key. The corresponding value is a list of ids which have this sequence. By using defaultdict from collections we can easily handle the case of a sequence not being in the dict; if the key hasn't be used before defaultdict will automatically create a value for us, in this case an empty list.
So when we've finished working through the file the values of the dict will effectively be a list of lists, each entry containing the ids which share a sequence. We can then use a list comprehension to pull out the interesting values, i.e. entries where more than one id is used by a sequence.
from collections import defaultdict
lines = filetext.split("\n")
sequences = defaultdict(list)
while (lines):
id = lines.pop(0)
data = lines.pop(0)
sequences[data].append(id)
results = [match for match in sequences.values() if len(match) > 1]
print results
The following script will return a count of sequences. It returns a dictionary with the individual, distinct sequences as keys and the numbers (the first part of each line) where these sequences occur.
#!/usr/bin/python
import sys
from collections import defaultdict
def count_sequences(filename):
result = defaultdict(list)
with open(filename) as f:
for index, line in enumerate(f):
sequence = line.replace('\n', '')
line_number = index + 1
result[sequence].append(line_number)
return result
if __name__ == '__main__':
filename = sys.argv[1]
for sequence, occurrences in count_sequences(filename).iteritems():
print "%s: %s, found in %s" % (sequence, len(occurrences), occurrences)
Sample output:
etc#etc:~$ python ./fasta.py /path/to/my/file
GTCGTCGAAAGAGGCTT-GCCCGCTACGCGCCCCCTGATA: 1, found in ['4']
GTCGTCGAAAGAGGCTT-GCCCGCCACGCGCCCGCTGATA: 1, found in ['3']
GTCGTCGAAAGAGGTCT-GACCGCTTCGCGCCCGCTGGTA: 2, found in ['2', '5']
GTCGTCGAAAGAGGTCT-GACCGCTTCTCGCCCGCTGATA: 1, found in ['7']
GTCGTCGAAGCATGCCGGGCCCGCTTCGTGTTCGCTGATA: 1, found in ['1']
GTCGTCGAAAGAGTCTGACCGCTTCTCGCCCGCTGATACG: 1, found in ['6']
Update
Changed code to use dafaultdict and for loop. Thanks #KennyTM.
Update 2
Changed code to use append rather than +. Thanks #Dave Webb.

Categories

Resources