How to split a string into equal sized parts? - python
I have a string that contains a sequence of nucleotides. The string is 1191 nucleotides long.
How do I print the sequence in a format which each line only has 100 nucleotides? right now I have it hard coded but I would like it to work for any string of nucleotides. here is the code I have now
def printinfasta(SeqName, Sequence, SeqDescription):
print(SeqName + " " + SeqDescription)
#how do I make sure to only have 100 nucleotides per line?
print(Sequence[0:100])
print(Sequence[100:200])
print(Sequence[200:300])
print(Sequence[400:500])
print(Sequence[500:600])
print(Sequence[600:700])
print(Sequence[700:800])
print(Sequence[800:900])
print(Sequence[900:1000])
print(Sequence[1000:1100])
print(Sequence[1100:1191])
printinfasta(SeqName, Sequence, SeqDescription)
Sequence = "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAATCCATAAATCCCTAAAACCATAATCCTAAATCCCTTAATTCCTAAATCCCTAATACTTAGACCCTAATCTTTAGTTCCTAGACCCTAATCTTTAGTTCCTAGACCCTAAATCCATAATCCTTAATTCCTAAATTCCTAAATCCCTAATACTAAATCTCTAAATCCCTAGCAATTTTCAAGTTTTGCTTGATTGTTGTAGGATGGTCCTTTCTCTTGTTTCTTCTCTGTGTTGTTGAGATTAGTTTGTTTAGGTTTGATAGCGTTGATTTTGGCCTGCGTTTGGTGACTCATATGGTTTGATTGGAGTTTGTTTCTGGGTTTTATGGTTTTGGTTGAAGCGACATTTTTTTGTGGAATATGGTTTTTGCAAAATATTTTGTTCCGGATGAGTAATATCTACGGTGCTGCTGTGAGAATTATGCTATTGTTTT"
You can use textwrap.wrap to split long strings into list of strings
import textwrap
seq = "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAATCCATAAATCCCTAAAACCATAATCCTAAATCCCTTAATTCCTAAATCCCTAATACTTAGACCCTAATCTTTAGTTCCTAGACCCTAATCTTTAGTTCCTAGACCCTAAATCCATAATCCTTAATTCCTAAATTCCTAAATCCCTAATACTAAATCTCTAAATCCCTAGCAATTTTCAAGTTTTGCTTGATTGTTGTAGGATGGTCCTTTCTCTTGTTTCTTCTCTGTGTTGTTGAGATTAGTTTGTTTAGGTTTGATAGCGTTGATTTTGGCCTGCGTTTGGTGACTCATATGGTTTGATTGGAGTTTGTTTCTGGGTTTTATGGTTTTGGTTGAAGCGACATTTTTTTGTGGAATATGGTTTTTGCAAAATATTTTGTTCCGGATGAGTAATATCTACGGTGCTGCTGTGAGAATTATGCTATTGTTTT"
print('\n'.join(textwrap.wrap(seq, width=100)))
You can use itertools.zip_longest and some iter magic to get this in one line:
from itertools import zip_longest
sequence = "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAATCCATAAATCCCTAAAACCATAATCCTAAATCCCTTAATTCCTAAATCCCTAATACTTAGACCCTAATCTTTAGTTCCTAGACCCTAATCTTTAGTTCCTAGACCCTAAATCCATAATCCTTAATTCCTAAATTCCTAAATCCCTAATACTAAATCTCTAAATCCCTAGCAATTTTCAAGTTTTGCTTGATTGTTGTAGGATGGTCCTTTCTCTTGTTTCTTCTCTGTGTTGTTGAGATTAGTTTGTTTAGGTTTGATAGCGTTGATTTTGGCCTGCGTTTGGTGACTCATATGGTTTGATTGGAGTTTGTTTCTGGGTTTTATGGTTTTGGTTGAAGCGACATTTTTTTGTGGAATATGGTTTTTGCAAAATATTTTGTTCCGGATGAGTAATATCTACGGTGCTGCTGTGAGAATTATGCTATTGTTTT"
output = [''.join(filter(None, s)) for s in zip_longest(*([iter(sequence)]*100))]
Or:
for s in zip_longest(*([iter(sequence)]*100)):
print(''.join(filter(None, s)))
A possible solution is to use re module.
import re
def splitstring(strg, leng):
chunks = re.findall('.{1,%d}' % leng, strg)
for i in chunks:
print(i)
splitstring(strg = seq, leng = 100))
You can use a helper function based on itertools.zip_longest. The helper function has been designed to (also) handle cases where the sequence isn't an exact multiple of the size of the equal parts (the last group will have fewer elements than those before it).
from itertools import zip_longest
def grouper(n, iterable):
""" s -> (s0,s1,...sn-1), (sn,sn+1,...s2n-1), (s2n,s2n+1,...s3n-1), ... """
FILLER = object() # Value that couldn't be in data.
for result in zip_longest(*[iter(iterable)]*n, fillvalue=FILLER):
yield ''.join(v for v in result if v is not FILLER)
def printinfasta(SeqName, Sequence, SeqDescription):
print(SeqName + " " + SeqDescription)
for group in grouper(100, Sequence):
print(group)
Sequence = "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAATCCATAAATCCCTAAAACCATAATCCTAAATCCCTTAATTCCTAAATCCCTAATACTTAGACCCTAATCTTTAGTTCCTAGACCCTAATCTTTAGTTCCTAGACCCTAAATCCATAATCCTTAATTCCTAAATTCCTAAATCCCTAATACTAAATCTCTAAATCCCTAGCAATTTTCAAGTTTTGCTTGATTGTTGTAGGATGGTCCTTTCTCTTGTTTCTTCTCTGTGTTGTTGAGATTAGTTTGTTTAGGTTTGATAGCGTTGATTTTGGCCTGCGTTTGGTGACTCATATGGTTTGATTGGAGTTTGTTTCTGGGTTTTATGGTTTTGGTTGAAGCGACATTTTTTTGTGGAATATGGTTTTTGCAAAATATTTTGTTCCGGATGAGTAATATCTACGGTGCTGCTGTGAGAATTATGCTATTGTTTT"
printinfasta('Name', Sequence, 'Description')
Sample output:
Name Description
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
CCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAATCCATAAATCCCTAAAACCATAATCCTAAATCCCTTAATTCCTA
AATCCCTAATACTTAGACCCTAATCTTTAGTTCCTAGACCCTAATCTTTAGTTCCTAGACCCTAAATCCATAATCCTTAATTCCTAAATTCCTAAATCCC
TAATACTAAATCTCTAAATCCCTAGCAATTTTCAAGTTTTGCTTGATTGTTGTAGGATGGTCCTTTCTCTTGTTTCTTCTCTGTGTTGTTGAGATTAGTT
TGTTTAGGTTTGATAGCGTTGATTTTGGCCTGCGTTTGGTGACTCATATGGTTTGATTGGAGTTTGTTTCTGGGTTTTATGGTTTTGGTTGAAGCGACAT
TTTTTTGTGGAATATGGTTTTTGCAAAATATTTTGTTCCGGATGAGTAATATCTACGGTGCTGCTGTGAGAATTATGCTATTGTTTT
I assume that your sequence is in FASTA format. If this is the case, you can use any of a number of bioinformatics packages that provide FASTA sequence wrapping utilities. For example, you can use FASTX-Toolkit. Wrap FASTA sequences using FASTA Formatter command line utility, for example to a max of 100 nucleotides per line:
fasta_formatter -i INFILE -o OUTFILE -w 100
You can install FASTX-Toolkit package using conda, for example:
conda install fastx_toolkit
or
conda create -n fastx_toolkit fastx_toolkit
Note that if you end up writing the (simple) code to wrap FASTA sequences from scratch, remember that the header lines (the lines starting with >) should not be wrapped. Wrap only the sequence lines.
SEE ALSO:
Convert single line fasta to multi line fasta
Package cytoolz (installable using pip install cytoolz) provides a function partition_all that can be used here:
#!/usr/bin/env python3
from cytoolz import partition_all
def printinfasta(name, seq, descr):
header = f">{name} {descr}"
print(header)
print(*map("".join, partition_all(100, seq)), sep="\n")
printinfasta("test", 468 * "ACGTGA", "this is a test")
partition_all(100, seq)) generate tuples of 100 letters each taken from seq, and a last shorter one is the number of letters is not a multiple of 100.
The map("".join, ...) is used to group letters within each such tuple into a single string.
The * in front of the map makes its results considered as separate arguments to print.
Related
How to extract a floating number from a string and add it using simple operation on python
I have a file named ping.txt which has the values that shows the time taken to ping an ip for n number of times. I have my ping.txt contains: time=35.9 time=32.4 I have written a python code to extract this floating number alone and add it using regular expression. But I feel that the below code is the indirect way of completing my task. The findall regex I am using here outputs a list which is them converted, join and then added. import re add,tmp=0,0 with open("ping.txt","r+") as pingfile: for i in pingfile.readlines(): tmp=re.findall(r'\d+\.\d+',i) add=add+float("".join(tmp)) print("The sum of the times is :",add) My question is how to solve this problem without using regex or any other way to reduce the number of lines in my code to make it more efficient? In other words, can I use different regex or some other method to do this operation? ~
You can use the following: with open('ping.txt', 'r') as f: s = sum(float(line.split('=')[1]) for line in f) Output: >>> with open('ping.txt', 'r') as f: ... s = sum(float(line.split('=')[1]) for line in f) ... >>> s 68.3 Note: I assume each line of your file contains time=some_float_number
You could do it like this: import re total = sum(float(s) for s in re.findall(r'\d+(\.\d+)?', open("ping.txt","r+").read()))
If you have the string: >>> s='time=35.9' Then to get the value, you just need: >>> float(s.split('=')[1])) 35.9 You don't need regular expressions for something with a simple delimiter.
You can use the string split to split each line at '=' and append them to a list. At the end, you can simply call the sum function to print the sum of elements in the list temp = [] with open("test.txt","r+") as pingfile: for i in pingfile.readlines(): temp.append(float(str.split(i,'=')[1])) print("The sum of the times is :",sum(temp))
Use This in RE tmp = re.findall("[0-9]+.[0-9]+",i) After that run a loop sum = 0 for each in tmp: sum = sum + float(each)
How can I effectively pull out human readable strings/terms from code automatically?
I'm trying to determine the most common words, or "terms" (I think) as I iterate over many different files. Example - For this line of code found in a file: for w in sorted(strings, key=strings.get, reverse=True): I'd want these unique strings/terms returned to my dictionary as keys: for w in sorted strings key strings get reverse True However, I want this code to be tunable so that I can return strings with periods or other characters between them as well, because I just don't know what makes sense yet until I run the script and count up the "terms" a few times: strings.get How can I approach this problem? It would help to understand how I can do this one line at a time so I can loop it as I read my file's lines in. I've got the basic logic down but I'm currently just doing the tallying by unique line instead of "term": strings = dict() fname = '/tmp/bigfile.txt' with open(fname, "r") as f: for line in f: if line in strings: strings[line] += 1 else: strings[line] = 1 for w in sorted(strings, key=strings.get, reverse=True): print str(w).rstrip() + " : " + str(strings[w]) (Yes I used code from my little snippet here as the example at the top.)
If the only python token you want to keep together is the object.attr construct then all the tokens you are interested would fit into the regular expression \w+\.?\w* Which basically means "one or more alphanumeric characters (including _) optionally followed by a . and then some more characters" note that this would also match number literals like 42 or 7.6 but that would be easy enough to filter out afterwards. then you can use collections.Counter to do the actual counting for you: import collections import re pattern = re.compile(r"\w+\.?\w*") #here I'm using the source file for `collections` as the test example with open(collections.__file__, "r") as f: tokens = collections.Counter(t.group() for t in pattern.finditer(f.read())) for token, count in tokens.most_common(5): #show only the top 5 print(token, count) Running python version 3.6.0a1 the output is this: self 226 def 173 return 170 self.data 129 if 102 which makes sense for the collections module since it is full of classes that use self and define methods, it also shows that it does capture self.data which fits the construct you are interested in.
Rosalind Profile and Consensus: Writing long strings to one line in Python (Formatting)
I'm trying to tackle a problem on Rosalind where, given a FASTA file of at most 10 sequences at 1kb, I need to give the consensus sequence and profile (how many of each base do all the sequences have in common at each nucleotide). In the context of formatting my response, what I have as my code works for small sequences (verified). However, I have issues in formatting my response when it comes to large sequences. What I expect to return, regardless of length, is: "consensus sequence" "A: one line string of numbers without commas" "C: one line string """" " "G: one line string """" " "T: one line string """" " All aligned with each other and on their own respective lines, or at least some formatting that allows me to carry this formatting as a unit onward to maintain the integrity of aligning. but when I run my code for a large sequence, I get each separate string below the consensus sequence broken up by a newline, presumably because the string itself is too long. I've been struggling to think of ways to circumvent the issue, but my searches have been fruitless. I'm thinking about some iterative writing algorithm that can just write the entirety of the above expectation but in chunks Any help would be greatly appreciated. I have attached the entirety of my code below for the sake of completeness, with block comments as needed, though the main section. def cons(file): #returns consensus sequence and profile of a FASTA file import os path = os.path.abspath(os.path.expanduser(file)) with open(path,"r") as D: F=D.readlines() #initialize list of sequences, list of all strings, and a temporary storage #list, respectively SEQS=[] mystrings=[] temp_seq=[] #get a list of strings from the file, stripping the newline character for x in F: mystrings.append(x.strip("\n")) #if the string in question is a nucleotide sequence (without ">") #i'll store that string into a temporary variable until I run into a string #with a ">", in which case I'll join all the strings in my temporary #sequence list and append to my list of sequences SEQS for i in range(1,len(mystrings)): if ">" not in mystrings[i]: temp_seq.append(mystrings[i]) else: SEQS.append(("").join(temp_seq)) temp_seq=[] SEQS.append(("").join(temp_seq)) #set up list of nucleotide counts for A,C,G and T, in that order ACGT= [[0 for i in range(0,len(SEQS[0]))], [0 for i in range(0,len(SEQS[0]))], [0 for i in range(0,len(SEQS[0]))], [0 for i in range(0,len(SEQS[0]))]] #assumed to be equal length sequences. Counting amount of shared nucleotides #in each column for i in range(0,len(SEQS[0])-1): for j in range(0, len(SEQS)): if SEQS[j][i]=="A": ACGT[0][i]+=1 elif SEQS[j][i]=="C": ACGT[1][i]+=1 elif SEQS[j][i]=="G": ACGT[2][i]+=1 elif SEQS[j][i]=="T": ACGT[3][i]+=1 ancstr="" TR_ACGT=list(zip(*ACGT)) acgt=["A: ","C: ","G: ","T: "] for i in range(0,len(TR_ACGT)-1): comp=TR_ACGT[i] if comp.index(max(comp))==0: ancstr+=("A") elif comp.index(max(comp))==1: ancstr+=("C") elif comp.index(max(comp))==2: ancstr+=("G") elif comp.index(max(comp))==3: ancstr+=("T") ''' writing to file... trying to get it to write as consensus sequence A: blah(1line) C: blah(1line) G: blah(1line) T: blah(line) which works for small sequences. but for larger sequences python keeps adding newlines if the string in question is very long... ''' myfile="myconsensus.txt" writing_strings=[acgt[i]+' '.join(str(n) for n in ACGT[i] for i in range(0,len(ACGT))) for i in range(0,len(acgt))] with open(myfile,'w') as D: D.writelines(ancstr) D.writelines("\n") for i in range(0,len(writing_strings)): D.writelines(writing_strings[i]) D.writelines("\n") cons("rosalind_cons.txt")
Your code is totally fine except for this line: writing_strings=[acgt[i]+' '.join(str(n) for n in ACGT[i] for i in range(0,len(ACGT))) for i in range(0,len(acgt))] You accidentally replicate your data. Try replacing it with: writing_strings=[ACGT[i] + str(ACGT[i]) for i in range(0,len(ACGT))] and then write it to your output file as follows: D.write(writing_strings[i][1:-1]) That's a lazy way to get rid of the brackets from your list.
error comparing sequences - string interpreted as number
I'm trying to do something similar with my previous question. My purpose is to join all sequences that are equal. But this time instead of letters, I have numbers. alignment file can be found here - phylip file the problem is when I try to do this: records = list(SeqIO.parse(file(filename),'phylip')) I get this error: ValueError: Sequence 1 length 49, expected length 1001000000100000100000001000000000000000 I don't understand why because this is the second file I'm creating and the first one worked perfectly.. Below is the code used to build the alignment file: fl.write('\t') fl.write(str(161)) fl.write('\t') fl.write(str(size)) fl.write('\n') for i in info_plex: if 'ref' in i[0]: i[0] = 'H37Rv' fl.write(str(i[0])) num = 10 - len(i[0]) fl.write(' ' * num) for x in i[1:]: fl.write(str(x)) fl.write('\n') So it shouldn't interpret 1001000000100000100000001000000000000000 as a number since its a string.. Any ideas? Thank you!
Your PHYLIP file is broken. The header says 161 sequences but there are 166. Fixing that the current version of Biopython seems to load your file fine. Maybe use len(info_plex) when creating the header line. P.S. It would have been a good idea to include the version of Biopython in your question.
The code of Kevin Jacobs in your former question employs Biopython that uses sequences of type Seq that « are essentially strings of letters like AGTACACTGGT, which seems very natural since this is the most common way that sequences are seen in biological file formats. » « There are two important differences between Seq objects and standard Python strings. (...) First of all, they have different methods. (...) Secondly, the Seq object has an important attribute, alphabet, which is an object describing what the individual characters making up the sequence string “mean”, and how they should be interpreted. For example, is AGTACACTGGT a DNA sequence, or just a protein sequence that happens to be rich in Alanines, Glycines, Cysteines and Threonines? The alphabet object is perhaps the important thing that makes the Seq object more than just a string. The currently available alphabets for Biopython are defined in the Bio.Alphabet module. » http://biopython.org/DIST/docs/tutorial/Tutorial.html The reason of your problem is simply that SeqIO.parse() can't create Seq objects from a file containing characters for which there is no alphabet attribute able to manage them. . So, you must use another method. Not try to plate an inadapted method on a different problem. Here's my way: from itertools import groupby from operator import itemgetter import re regx = re.compile('^(\d+)[ \t]+([01]+)',re.MULTILINE) with open('pastie-2486250.rb') as f: records = regx.findall(f.read()) records.sort(key=itemgetter(1)) print 'len(records) == %s\n' % len(records) n = 0 for seq,equal in groupby(records, itemgetter(1)): ids = tuple(x[0] for x in equal) if len(ids)>1: print '>%s :\n%s' % (','.join(ids), seq) else: n+=1 print '\nNumber of unique occurences : %s' % n result len(records) == 165 >154995,168481 : 0000000000001000000010000100000001000000000000000 >123031,74772 : 0000000000001111000101100011100000100010000000000 >176816,178586,80016 : 0100000000000010010010000010110011100000000000000 >129575,45329 : 0100000000101101100000101110001000000100000000000 Number of unique occurences : 156 . Edit I've understood MY problem: I let 'fasta' instead of 'phylip' in my code. 'phylip' is a valid value for the attribute alphabet, with it it works fine records = list(SeqIO.parse(file('pastie-2486250.rb'),'phylip')) def seq_getter(s): return str(s.seq) records.sort(key=seq_getter) ecr = [] for seq,equal in groupby(records, seq_getter): ids = tuple(s.id for s in equal) if len(ids)>1: ecr.append( '>%s\n%s' % (','.join(ids),seq) ) print '\n'.join(ecr) produces ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, >154995,168481 0000000000001000000010000100000001000000000000000 >123031,74772 0000000000001111000101100011100000100010000000000 >176816,178586,80016 0100000000000010010010000010110011100000000000000 >129575,45329 0100000000101101100000101110001000000100000000000 There is an incredible amount of characters ,,,,,,,,,,,,,,,, before the interesting data, I wonder what it is. . But my code isn't useless. See: from time import clock from itertools import groupby from operator import itemgetter import re from Bio import SeqIO def seq_getter(s): return str(s.seq) t0 = clock() with open('pastie-2486250.rb') as f: records = list(SeqIO.parse(f,'phylip')) records.sort(key=seq_getter) print clock()-t0,'seconds' t0 = clock() regx = re.compile('^(\d+)[ \t]+([01]+)',re.MULTILINE) with open('pastie-2486250.rb') as f: records = regx.findall(f.read()) records.sort(key=itemgetter(1)) print clock()-t0,'seconds' result 12.4826178327 seconds 0.228640588399 seconds ratio = 55 !
best way to compare sequence of letters inside file?
I have a file, that have lots of sequences of letters. Some of these sequences might be equal, so I would like to compare them, all to all. I'm doing something like this but this isn't exactly want I wanted: for line in fl: line = line.split() for elem in line: if '>' in elem: pass else: for el in line: if elem == el: print elem, el example of the file: >1 GTCGTCGAAGCATGCCGGGCCCGCTTCGTGTTCGCTGATA >2 GTCGTCGAAAGAGGTCT-GACCGCTTCGCGCCCGCTGGTA >3 GTCGTCGAAAGAGGCTT-GCCCGCCACGCGCCCGCTGATA >4 GTCGTCGAAAGAGGCTT-GCCCGCTACGCGCCCCCTGATA >5 GTCGTCGAAAGAGGTCT-GACCGCTTCGCGCCCGCTGGTA >6 GTCGTCGAAAGAGTCTGACCGCTTCTCGCCCGCTGATACG >7 GTCGTCGAAAGAGGTCT-GACCGCTTCTCGCCCGCTGATA So what I want if to known if any sequence is totally equal to 1, or to 2, and so on.
If the goal is to simply group like sequences together, then simply sorting the data will do the trick. Here is a solution that uses BioPython to parse the input FASTA file, sorts the collection of sequences, uses the standard Python itertools.groupby function to merge ids for equal sequences, and outputs a new FASTA file: from itertools import groupby from Bio import SeqIO records = list(SeqIO.parse(file('spoo.fa'),'fasta')) def seq_getter(s): return str(s.seq) records.sort(key=seq_getter) for seq,equal in groupby(records, seq_getter): ids = ','.join(s.id for s in equal) print '>%s' % ids print seq Output: >3 GTCGTCGAAAGAGGCTT-GCCCGCCACGCGCCCGCTGATA >4 GTCGTCGAAAGAGGCTT-GCCCGCTACGCGCCCCCTGATA >2,5 GTCGTCGAAAGAGGTCT-GACCGCTTCGCGCCCGCTGGTA >7 GTCGTCGAAAGAGGTCT-GACCGCTTCTCGCCCGCTGATA >6 GTCGTCGAAAGAGTCTGACCGCTTCTCGCCCGCTGATACG >1 GTCGTCGAAGCATGCCGGGCCCGCTTCGTGTTCGCTGATA
In general for this type of work you may want to investigate Biopython which has lots of functionality for parsing and otherwise dealing with sequences. However, your particular problem can be solved using a dict, an example of which Manoj has given you.
Comparing long sequences of letters is going to be pretty inefficient. It will be quicker to compare the hash of the sequences. Python offers two built in data types that use hash: set and dict. It's best to use dict here as we can store the line numbers of all the matches. I've assumed the file has identifiers and labels on alternate lines, so if we split the file text on new lines we can take one line as the id and the next as the sequence to match. We then use a dict with the sequence as a key. The corresponding value is a list of ids which have this sequence. By using defaultdict from collections we can easily handle the case of a sequence not being in the dict; if the key hasn't be used before defaultdict will automatically create a value for us, in this case an empty list. So when we've finished working through the file the values of the dict will effectively be a list of lists, each entry containing the ids which share a sequence. We can then use a list comprehension to pull out the interesting values, i.e. entries where more than one id is used by a sequence. from collections import defaultdict lines = filetext.split("\n") sequences = defaultdict(list) while (lines): id = lines.pop(0) data = lines.pop(0) sequences[data].append(id) results = [match for match in sequences.values() if len(match) > 1] print results
The following script will return a count of sequences. It returns a dictionary with the individual, distinct sequences as keys and the numbers (the first part of each line) where these sequences occur. #!/usr/bin/python import sys from collections import defaultdict def count_sequences(filename): result = defaultdict(list) with open(filename) as f: for index, line in enumerate(f): sequence = line.replace('\n', '') line_number = index + 1 result[sequence].append(line_number) return result if __name__ == '__main__': filename = sys.argv[1] for sequence, occurrences in count_sequences(filename).iteritems(): print "%s: %s, found in %s" % (sequence, len(occurrences), occurrences) Sample output: etc#etc:~$ python ./fasta.py /path/to/my/file GTCGTCGAAAGAGGCTT-GCCCGCTACGCGCCCCCTGATA: 1, found in ['4'] GTCGTCGAAAGAGGCTT-GCCCGCCACGCGCCCGCTGATA: 1, found in ['3'] GTCGTCGAAAGAGGTCT-GACCGCTTCGCGCCCGCTGGTA: 2, found in ['2', '5'] GTCGTCGAAAGAGGTCT-GACCGCTTCTCGCCCGCTGATA: 1, found in ['7'] GTCGTCGAAGCATGCCGGGCCCGCTTCGTGTTCGCTGATA: 1, found in ['1'] GTCGTCGAAAGAGTCTGACCGCTTCTCGCCCGCTGATACG: 1, found in ['6'] Update Changed code to use dafaultdict and for loop. Thanks #KennyTM. Update 2 Changed code to use append rather than +. Thanks #Dave Webb.