best way to compare sequence of letters inside file? - python
I have a file, that have lots of sequences of letters.
Some of these sequences might be equal, so I would like to compare them, all to all.
I'm doing something like this but this isn't exactly want I wanted:
for line in fl:
line = line.split()
for elem in line:
if '>' in elem:
pass
else:
for el in line:
if elem == el:
print elem, el
example of the file:
>1
GTCGTCGAAGCATGCCGGGCCCGCTTCGTGTTCGCTGATA
>2
GTCGTCGAAAGAGGTCT-GACCGCTTCGCGCCCGCTGGTA
>3
GTCGTCGAAAGAGGCTT-GCCCGCCACGCGCCCGCTGATA
>4
GTCGTCGAAAGAGGCTT-GCCCGCTACGCGCCCCCTGATA
>5
GTCGTCGAAAGAGGTCT-GACCGCTTCGCGCCCGCTGGTA
>6
GTCGTCGAAAGAGTCTGACCGCTTCTCGCCCGCTGATACG
>7
GTCGTCGAAAGAGGTCT-GACCGCTTCTCGCCCGCTGATA
So what I want if to known if any sequence is totally equal to 1, or to 2, and so on.
If the goal is to simply group like sequences together, then simply sorting the data will do the trick. Here is a solution that uses BioPython to parse the input FASTA file, sorts the collection of sequences, uses the standard Python itertools.groupby function to merge ids for equal sequences, and outputs a new FASTA file:
from itertools import groupby
from Bio import SeqIO
records = list(SeqIO.parse(file('spoo.fa'),'fasta'))
def seq_getter(s): return str(s.seq)
records.sort(key=seq_getter)
for seq,equal in groupby(records, seq_getter):
ids = ','.join(s.id for s in equal)
print '>%s' % ids
print seq
Output:
>3
GTCGTCGAAAGAGGCTT-GCCCGCCACGCGCCCGCTGATA
>4
GTCGTCGAAAGAGGCTT-GCCCGCTACGCGCCCCCTGATA
>2,5
GTCGTCGAAAGAGGTCT-GACCGCTTCGCGCCCGCTGGTA
>7
GTCGTCGAAAGAGGTCT-GACCGCTTCTCGCCCGCTGATA
>6
GTCGTCGAAAGAGTCTGACCGCTTCTCGCCCGCTGATACG
>1
GTCGTCGAAGCATGCCGGGCCCGCTTCGTGTTCGCTGATA
In general for this type of work you may want to investigate Biopython which has lots of functionality for parsing and otherwise dealing with sequences.
However, your particular problem can be solved using a dict, an example of which Manoj has given you.
Comparing long sequences of letters is going to be pretty inefficient. It will be quicker to compare the hash of the sequences. Python offers two built in data types that use hash: set and dict. It's best to use dict here as we can store the line numbers of all the matches.
I've assumed the file has identifiers and labels on alternate lines, so if we split the file text on new lines we can take one line as the id and the next as the sequence to match.
We then use a dict with the sequence as a key. The corresponding value is a list of ids which have this sequence. By using defaultdict from collections we can easily handle the case of a sequence not being in the dict; if the key hasn't be used before defaultdict will automatically create a value for us, in this case an empty list.
So when we've finished working through the file the values of the dict will effectively be a list of lists, each entry containing the ids which share a sequence. We can then use a list comprehension to pull out the interesting values, i.e. entries where more than one id is used by a sequence.
from collections import defaultdict
lines = filetext.split("\n")
sequences = defaultdict(list)
while (lines):
id = lines.pop(0)
data = lines.pop(0)
sequences[data].append(id)
results = [match for match in sequences.values() if len(match) > 1]
print results
The following script will return a count of sequences. It returns a dictionary with the individual, distinct sequences as keys and the numbers (the first part of each line) where these sequences occur.
#!/usr/bin/python
import sys
from collections import defaultdict
def count_sequences(filename):
result = defaultdict(list)
with open(filename) as f:
for index, line in enumerate(f):
sequence = line.replace('\n', '')
line_number = index + 1
result[sequence].append(line_number)
return result
if __name__ == '__main__':
filename = sys.argv[1]
for sequence, occurrences in count_sequences(filename).iteritems():
print "%s: %s, found in %s" % (sequence, len(occurrences), occurrences)
Sample output:
etc#etc:~$ python ./fasta.py /path/to/my/file
GTCGTCGAAAGAGGCTT-GCCCGCTACGCGCCCCCTGATA: 1, found in ['4']
GTCGTCGAAAGAGGCTT-GCCCGCCACGCGCCCGCTGATA: 1, found in ['3']
GTCGTCGAAAGAGGTCT-GACCGCTTCGCGCCCGCTGGTA: 2, found in ['2', '5']
GTCGTCGAAAGAGGTCT-GACCGCTTCTCGCCCGCTGATA: 1, found in ['7']
GTCGTCGAAGCATGCCGGGCCCGCTTCGTGTTCGCTGATA: 1, found in ['1']
GTCGTCGAAAGAGTCTGACCGCTTCTCGCCCGCTGATACG: 1, found in ['6']
Update
Changed code to use dafaultdict and for loop. Thanks #KennyTM.
Update 2
Changed code to use append rather than +. Thanks #Dave Webb.
Related
python script not joining strings as expected
I have a list of lists of sequences, and a corresponding list of lists of names. testSequences = [ ['aaaa', 'cccc'], ['tt', 'gg'], ['AAAAAAA', 'CCCCCC', 'TTTTTT', 'GGGGGG']] testNames = [ ['>xx_oneFish |xzx', '>xx_twoFish |zzx'], ['>xx_redFish |zxx', '>xx_blueFish |zxx'], ['>xx_oneFish |xzx', '>xx_twoFish |xzx', '>xz_redFish |xxx', '>zx_blueFish |xzz']] I also have a list of all the identifying parts of the names: taxonNames = ['oneFish', 'twoFish', 'redFish', 'blueFish'] I am trying to produce a new list, where each item in the list will correspond to one of the "identifying parts of the names", and the string will be made up of all the sequences for that name. If a name and sequence does not appear in one of the lists in the lists (i.e. no redFish or blueFish in the first list of testNames) I want to add in a string of hyphens the same length as the sequences in that list. This would give me this output: ['aaaa--AAAAAA', 'cccc--CCCCCC', '----ttTTTTTT', '----ggGGGG'] I have this piece of code to do this. complete = [''] * len(taxonNames) for i in range(len(testSequences)): for j in range(len(taxonNames)): sequenceLength = len(testSequences[i][0]) for k in range(len(testSequences[i])): if taxonNames[j] in testNames[i][k]: complete[j].join(testSequences[i][k]) if taxonNames[j] not in testNames[i][k]: hyphenString = "-" * sequenceLength complete[j].join(hyphenString) print complete "complete" should give my final output as explained above, but it comes out looking like this: ['', '', '', ''] How can I fix my code to give me the correct answer?
The main issue with your code, which makes it very hard to understand, is you're not really leveraging the language elements that make Python so strong. Here's a solution to your problem that works: test_sequences = [ ['aaaa', 'cccc'], ['tt', 'gg'], ['AAAAAAA', 'CCCCCC', 'TTTTTT', 'GGGGGG']] test_names = [ ['>xx_oneFish |xzx', '>xx_twoFish |zzx'], ['>xx_redFish |zxx', '>xx_blueFish |zxx'], ['>xx_oneFish |xzx', '>xx_twoFish |xzx', '>xz_redFish |xxx', '>zx_blueFish |xzz']] taxon_names = ['oneFish', 'twoFish', 'redFish', 'blueFish'] def get_seqs(taxon_name, sequences_list, names_list): for seqs, names in zip(sequences_list, names_list): found_seq = None for seq, name in zip(seqs, names): if taxon_name in name: found_seq = seq break yield found_seq if found_seq else '-' * len(seqs[0]) result = [''.join(get_seqs(taxon_name, test_sequences, test_names)) for taxon_name in taxon_names] print(result) The generator get_seqs pairs up lists from test_sequences and test_names and for each pair, tries to find the sequence (seq) for the name (name) that matches and yields it, or yields a string of the right number of hyphens for that list of sequences. The generator (a function that yields multiple values) has code that quite literally follows the explanation above. The result is then simply a matter of, for each taxon_name, getting all the resulting sequences that match in order and joining them together into a string, which is the result = ... line. You could make it work with list indexing loops and string concatenation, but this is not a PHP question, now is it? :) Note: for brevity, you could just access the global test_sequences and test_names instead of passing them in as parameters, but I think that would come back to haunt you if you were to actually use this code. Also, I think it makes semantic sense to change the order of names and sequences in the entire example, but I didn't to avoid further deviating from your example.
Here is a solution that may do what you want. It begins, not with your data structures from this post, but with the three example files from your previous post (which you used to build this post's data structures). The only thing I couldn't figure out was how many hyphens to use for a missing sequence from a file. differentNames = ['oneFish', 'twoFish', 'redFish', 'blueFish'] files = ['f1.txt', 'f2.txt', 'f3.txt'] data = [[] for _ in range(len(differentNames))] final = [] for file in files: d = dict() with open(file, 'r') as fin: for line in fin: line = line.rstrip() if line.startswith('>'): # for ex., >xx_oneFish |xxx underscore = line.index('_') space = line.index(' ') key = line[underscore+1:space] else: d[key] = line for i, key in enumerate(differentNames): data[i].append(d.get(key, '-' * 4)) for array in data: final.append(''.join(array)) print(final) Prints: ['AAAAAAAaaaa----', 'CCCCCCcccc----', 'TTTTTT----tt', 'GGGGGG----gg']
How to form a list of lists?
My code below is extracting some portion from a file and displaying the result in separate lists. I want to form a list of all these lists which were filtered out. I tried to form it in my code but when I am trying to print it out, I am getting an empty list. import re hand = open('mbox.txt') for line in hand: my_list = list() line = line.rstrip() #Extracting out the data from file x=re.findall('^From .* ([0-9][0-9]:[0-9][0-9]:[0-9][0-9])', line) #checking the length and checking if the data is not present to the list if len(x) != 0 and x not in my_list: my_list.append(x[0]) print my_list Filtered list is: ['15:46:24'] ['15:03:18'] ['14:50:18'] ['11:37:30'] ['11:35:08'] ['11:12:37'] and so on.
A couple of things to note. If you are repeatedly doing regex matching, I suggest you compile the pattern first and then do the matching. Also, you don't need to check length of a container manually to get its bool value - just do if container:. Use builtin filter to remove empty items. Or you can use a set that avoids duplicates automatically. I am also not sure why you are stripping the space characters before doing the regex match. Is that necessary? import re match = r"^From .* ([0-9][0-9]:[0-9][0-9]:[0-9][0-9])" with open("mbox.txt") as f: for line in f.readlines(): match = filter(None,re.findall(match, line)) data.append(list(match)) print(data) This is all you need to get that list of lists. The use of list comprehension and filter made the code more compact.
just move my_list=list() to out of the for loop.
Python: create dict from list and auto-gen/increment the keys (list is the actual key values)?
i've searched pretty hard and cant find a question that exactly pertains to what i want to.. I have a file called "words" that has about 1000 lines of random A-Z sorted words... 10th 1st 2nd 3rd 4th 5th 6th 7th 8th 9th a AAA AAAS Aarhus Aaron AAU ABA Ababa aback abacus abalone abandon abase abash abate abater abbas abbe abbey abbot Abbott abbreviate abc abdicate abdomen abdominal abduct Abe abed Abel Abelian I am trying to load this file into a dictionary, where using the word are the key values and the keys are actually auto-gen/auto-incremented for each word e.g {0:10th, 1:1st, 2:2nd} ...etc..etc... below is the code i've hobbled together so far, it seems to sort of works but its only showing me the last entry in the file as the only dict pair element f3data = open('words') mydict = {} for line in f3data: print line.strip() cmyline = line.split() key = +1 mydict [key] = cmyline print mydict
key = +1 +1 is the same thing as 1. I assume you meant key += 1. I also can't see a reason why you'd split each line when there's only one item per line. However, there's really no reason to do the looping yourself. with open('words') as f3data: mydict = dict(enumerate(line.strip() for line in f3data))
dict(enumerate(x.rstrip() for x in f3data)) But your error is key += 1.
f3data = open('words') print f3data.readlines()
The use of zero-based numeric keys in a dict is very suspicious. Consider whether a simple list would suffice. Here is an example using a list comprehension: >>> mylist = [word.strip() for word in open('/usr/share/dict/words')] >>> mylist[1] 'A' >>> mylist[10] "Aaron's" >>> mylist[100] "Addie's" >>> mylist[1000] "Armand's" >>> mylist[10000] "Loyd's" I use str.strip() to remove whitespace and newlines, which are present in /usr/share/dict/words. This may not be necessary with your data. However, if you really need a dictionary, Python's enumerate() built-in function is your friend here, and you can pass the output directly into the dict() function to create it: >>> mydict = dict(enumerate(word.strip() for word in open('/usr/share/dict/words'))) >>> mydict[1] 'A' >>> mydict[10] "Aaron's" >>> mydict[100] "Addie's" >>> mydict[1000] "Armand's" >>> mydict[10000] "Loyd's"
With keys that dense, you don't want a dict, you want a list. with open('words') as fp: data = map(str.strip, fp.readlines()) But if you really can't live without a dict: with open('words') as fp: data = dict(enumerate(X.strip() for X in fp))
{index: x.strip() for index, x in enumerate(open('filename.txt'))} This code uses a dictionary comprehension and the enumerate built-in, which takes an input sequence (in this case, the file object, which yields each line when iterated through) and returns an index along with the item. Then, a dictionary is built up with the index and text. One question: why not just use a list if all of your keys are integers? Finally, your original code should be f3data = open('words') mydict = {} for index, line in enumerate(f3data): cmyline = line.strip() mydict[index] = cmyline print mydict
Putting the words in a dict makes no sense. If you're using numbers as keys you should be using a list. from __future__ import with_statement with open('words.txt', 'r') as f: lines = f.readlines() words = {} for n, line in enumerate(lines): words[n] = line.strip() print words
error comparing sequences - string interpreted as number
I'm trying to do something similar with my previous question. My purpose is to join all sequences that are equal. But this time instead of letters, I have numbers. alignment file can be found here - phylip file the problem is when I try to do this: records = list(SeqIO.parse(file(filename),'phylip')) I get this error: ValueError: Sequence 1 length 49, expected length 1001000000100000100000001000000000000000 I don't understand why because this is the second file I'm creating and the first one worked perfectly.. Below is the code used to build the alignment file: fl.write('\t') fl.write(str(161)) fl.write('\t') fl.write(str(size)) fl.write('\n') for i in info_plex: if 'ref' in i[0]: i[0] = 'H37Rv' fl.write(str(i[0])) num = 10 - len(i[0]) fl.write(' ' * num) for x in i[1:]: fl.write(str(x)) fl.write('\n') So it shouldn't interpret 1001000000100000100000001000000000000000 as a number since its a string.. Any ideas? Thank you!
Your PHYLIP file is broken. The header says 161 sequences but there are 166. Fixing that the current version of Biopython seems to load your file fine. Maybe use len(info_plex) when creating the header line. P.S. It would have been a good idea to include the version of Biopython in your question.
The code of Kevin Jacobs in your former question employs Biopython that uses sequences of type Seq that « are essentially strings of letters like AGTACACTGGT, which seems very natural since this is the most common way that sequences are seen in biological file formats. » « There are two important differences between Seq objects and standard Python strings. (...) First of all, they have different methods. (...) Secondly, the Seq object has an important attribute, alphabet, which is an object describing what the individual characters making up the sequence string “mean”, and how they should be interpreted. For example, is AGTACACTGGT a DNA sequence, or just a protein sequence that happens to be rich in Alanines, Glycines, Cysteines and Threonines? The alphabet object is perhaps the important thing that makes the Seq object more than just a string. The currently available alphabets for Biopython are defined in the Bio.Alphabet module. » http://biopython.org/DIST/docs/tutorial/Tutorial.html The reason of your problem is simply that SeqIO.parse() can't create Seq objects from a file containing characters for which there is no alphabet attribute able to manage them. . So, you must use another method. Not try to plate an inadapted method on a different problem. Here's my way: from itertools import groupby from operator import itemgetter import re regx = re.compile('^(\d+)[ \t]+([01]+)',re.MULTILINE) with open('pastie-2486250.rb') as f: records = regx.findall(f.read()) records.sort(key=itemgetter(1)) print 'len(records) == %s\n' % len(records) n = 0 for seq,equal in groupby(records, itemgetter(1)): ids = tuple(x[0] for x in equal) if len(ids)>1: print '>%s :\n%s' % (','.join(ids), seq) else: n+=1 print '\nNumber of unique occurences : %s' % n result len(records) == 165 >154995,168481 : 0000000000001000000010000100000001000000000000000 >123031,74772 : 0000000000001111000101100011100000100010000000000 >176816,178586,80016 : 0100000000000010010010000010110011100000000000000 >129575,45329 : 0100000000101101100000101110001000000100000000000 Number of unique occurences : 156 . Edit I've understood MY problem: I let 'fasta' instead of 'phylip' in my code. 'phylip' is a valid value for the attribute alphabet, with it it works fine records = list(SeqIO.parse(file('pastie-2486250.rb'),'phylip')) def seq_getter(s): return str(s.seq) records.sort(key=seq_getter) ecr = [] for seq,equal in groupby(records, seq_getter): ids = tuple(s.id for s in equal) if len(ids)>1: ecr.append( '>%s\n%s' % (','.join(ids),seq) ) print '\n'.join(ecr) produces ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, >154995,168481 0000000000001000000010000100000001000000000000000 >123031,74772 0000000000001111000101100011100000100010000000000 >176816,178586,80016 0100000000000010010010000010110011100000000000000 >129575,45329 0100000000101101100000101110001000000100000000000 There is an incredible amount of characters ,,,,,,,,,,,,,,,, before the interesting data, I wonder what it is. . But my code isn't useless. See: from time import clock from itertools import groupby from operator import itemgetter import re from Bio import SeqIO def seq_getter(s): return str(s.seq) t0 = clock() with open('pastie-2486250.rb') as f: records = list(SeqIO.parse(f,'phylip')) records.sort(key=seq_getter) print clock()-t0,'seconds' t0 = clock() regx = re.compile('^(\d+)[ \t]+([01]+)',re.MULTILINE) with open('pastie-2486250.rb') as f: records = regx.findall(f.read()) records.sort(key=itemgetter(1)) print clock()-t0,'seconds' result 12.4826178327 seconds 0.228640588399 seconds ratio = 55 !
Importing data from a text file using python
I have a text file containing data in rows and columns (~17000 rows in total). Each column is a uniform number of characters long, with the 'unused' characters filled in by spaces. For example, the first column is 11 characters long, but the last four characters in that column are always spaces (so that it appears to be a nice column when viewed with a text editor). Sometimes it's more than four if the entry is less than 7 characters. The columns are not otherwise separated by commas, tabs, or spaces. They are also not all the same number of characters (the first two are 11, the next two are 8 and the last one is 5 - but again, some are spaces). What I want to do is import the entires (which are numbers) in the last two columns if the second column contains the string 'OW' somewhere in it. Any help would be greatly appreciated.
Python's struct.unpack is probably the quickest way to split fixed-length fields. Here's a function that will lazily read your file and return tuples of numbers that match your criteria: import struct def parsefile(filename): with open(filename) as myfile: for line in myfile: line = line.rstrip('\n') fields = struct.unpack('11s11s8s8s5s', line) if 'OW' in fields[1]: yield (int(fields[3]), int(fields[4])) Usage: if __name__ == '__main__': for field in parsefile('file.txt'): print field Test data: 1234567890a1234567890a123456781234567812345 something maybe OW d 111111118888888855555 aaaaa bbbbb 1234 1212121233333 other thinganother OW 121212 6666666644444 Output: (88888888, 55555) (66666666, 44444)
In Python you can extract a substring at known positions using a slice - this is normally done with the list[start:end] syntax. However you can also create slice objects that you can use later to do the indexing. So you can do something like this: columns = [slice(11,22), slice(30,38), slice(38,44)] myfile = open('some/file/path') for line in myfile: fields = [line[column].strip() for column in columns] if "OW" in fields[0]: value1 = int(fields[1]) value12 = int(fields[2]) .... Separating out the slices into a list makes it easy to change the code if the data format changes, or you need to do stuff with the other fields.
Here's a function which might help you: def rows(f, columnSizes): while True: row = {} for (key, size) in columnSizes: value = f.read(size) if len(value) < size: # EOF return row[key] = value yield row for an example of how it's used: from StringIO import StringIO sample = StringIO("""aaabbbccc d e f g h i """) for row in rows(sample, [('first', 3), ('second', 3), ('third', 4)]): print repr(row) Note that unlike the other answers, this example is not line-delimited (it uses the file purely as a provider of bytes, not an iterator of lines), since you specifically mentioned that the fields were not separated, I assumed that the rows might not be either; the newline is taken into account specifically. You can test if one string is a substring of another with the 'in' operator. For example, >>> 'OW' in 'hello' False >>> 'OW' in 'helOWlo' True So in this case, you might do if 'OW' in row['third']: stuff() but you can obviously test any field for any value as you see fit.
entries = ((float(line[30:38]), float(line[38:43])) for line in myfile if "OW" in line[11:22]) for num1, num2 in entries: # whatever
entries = [] with open('my_file.txt', 'r') as f: for line in f.read().splitlines() line = line.split() if line[1].find('OW') >= 0 entries.append( ( int(line[-2]) , int(line[-1]) ) ) entries is an array containing tuples of the last two entries edit: oops