I am a molecular biologist using Biopython to analyze mutations in genes and my problem is this:
I have a file containing many different sequences (millions), most of which are duplicates. I need to find the duplicates and discard them, keeping one copy of each unique sequence. I was planning on using the module editdist to calculate the edit distance between them all to determine which ones the duplicates are, but editdist can only work with 2 strings, not files.
Anyone know how I can use that module with files instead of strings?
Assuming your file consists solely of sequences arranged one sequence per line, I would suggest the following:
seq_file = open(#your file)
sequences = [seq for seq in seq_file]
uniques = list(set(sequences))
Assuming you have the memory for it. How many millions?
ETA:
Was reading the comments above (but don't have comment privs) - assuming the sequence IDs are the same for any duplicates, this will work. If duplicate sequences can different sequence IDs, then would to know which comes first and what is between them in the file.
If you want to filter out exact duplicates, you can use the set Python built-in type. As an example :
a = ["tccggatcc", "actcctgct", "tccggatcc"] # You have a list of sequences
s = set(a) # Put that into a set
s is then equal to ['tccggatcc', 'actcctgct'], without duplicates.
Does it have to be Python?
If the sequences are simply text strings one per line then a shell script will be very efficient:
sort input-file-name | uniq > output-file-name
This will do the job on files up to 2GB on 32 bit Linux.
If you are on Windows then install the GNU utils http://gnuwin32.sourceforge.net/summary.html.
Four things come to mind:
You can use a set(), as described by F.X. - assuming the unique
strings will all fit in memory
You can use one file per sequence, and feed the files to a program
like equivs3e:
http://stromberg.dnsalias.org/~strombrg/equivalence-classes.html#python-3e
You could perhaps use a gdbm as a set, instead of its usual
key-value store use. This is good if you need something that's 100%
accurate, but you have too much data to fit all the uniques in
Virtual Memory.
You could perhaps use a bloom filter to cut down the data to more
manageable sizes, if you have a truly huge number of strings to
check and lots of duplicates. Basically a bloom filter can say
"This is definitely not in the set" or "This is almost definitely in
the set". In this way, you can eliminate most of the obvious
duplicates before using a more common means to operate on the
remaining elements.
http://stromberg.dnsalias.org/~strombrg/drs-bloom-filter/
Don't be afraid of files! ;-)
I'm posting an example by assuming the following:
its a text-file
one sequence per line
-
filename = 'sequence.txt'
with open(filename, 'r') as sqfile:
sequences = sqfile.readlines() # now we have a list of strings
#discarding the duplicates:
uniques = list(set(sequences))
That's it - by using pythons set-type we eliminate all duplicates automagically.
if you have the id and the sequence in the same line like:
423401 ttacguactg
you may want to eliminate the ids like:
sequences = [s.strip().split()[-1] for s in sequences]
with strip we strip the string from leading and trailing whitespaces and with split we split the line/string into 2 components: the id, and the sequence.
with the [-1] we select the last component (= the sequence-string) and repack it into our sequence-list.
Related
Working on a python function which parses a file containing a list of strings.
Basically a walked folder structure parsed to a txt file so I don't have to work on real raid while in prod. That is also a requirement. To work from a txt file containing list of paths.
lpaths =[
'/projects/0100/dbu/shots/11_1/SC11_1_Shot012/render/SC11_1_Shot012.v01_1025.exr',
'/projects/0100/dbu/shots/11_1/SC11_1_Shot012/render/SC11_1_Shot012.v01_1042.exr',
'/projects/0100/dbu/shots/11_1/SC11_1_Shot012/render/SC11_1_Shot012.v01_1016.exr',
'/projects/0100/dbu/shots/11_1/SC11_1_Shot012/2d/app/Shot012_v1.exr',
'/projects/0100/dbu/shots/11_1/SC11_1_Shot012/2d/app/Shot012_v02.exr',
'/projects/0100/dbu/shots/11_1/SC11_1_Shot004/3d/app2/workspace.cfg',
'/projects/0100/dbu/shots/11_1/SC11_1_Shot004/3d/app2/scenes/SC11_1_Shot004_v01.exr',
'/projects/0100/dbu/shots/11_1/SC11_1_Shot004/3d/app2/scenes/Shot004_camera_v01.exr',
'/projects/0100/dbu/shots/11_1/SC11_1_Shot004/render/SC11_1_Shot004.v01_1112.exr',
'/projects/0100/dbu/shots/11_1/SC11_1_Shot004/render/SC11_1_Shot004.v01_1034.exr',
'/projects/0100/dbu/shots/11_1/SC11_1_Shot004/render/SC11_1_Shot004.v02_1116.exr',
'/projects/0100/dbu/shots/11_1/SC11_1_Shot004/render/SC11_1_Shot004.v02_1126.exr'
]
This is partial list of the cleaned list version ive already worked out and works fine.
The real problem, need to parse all frames from a folder to into a list so it can hold a proper listed sequence.
There could be 1 frame or 1000, also there are multiple sequences in same folder as seen in the list.
My goal is to have a list for each sequence in a folder, so I can push them ahead to do more work down the road.
Code:
groups = [list(group) for key, group in itertools.groupby(sorted(lpaths), len)]
pp.pprint(groups)
Since you seem to have differing naming conventions you need to write a function that takes a single string and, possibly using regular expressions, returns an unambiguous key for you to sort on, lets say that you names are critically identified by the shot number which can be identified by r".*[Ss]hot_?(\d+).*\.ext" you could return an integer for the match base 10 so discarding any leading 0s.
Since you also may have a version number you could do a similar operation to get an unambiguous version number, (and possibly only process the latest version of a given shot).
I have a dict with 50,000,000 keys (strings) mapped to a count of that key (which is a subset of one with billions).
I also have a series of objects with a class set member containing a few thousand strings that may or may not be in the dict keys.
I need the fastest way to find the intersection of each of these sets.
Right now, I do it like this code snippet below:
for block in self.blocks:
#a block is a python object containing the set in the thousands range
#block.get_kmers() returns the set
count = sum([kmerCounts[x] for x in block.get_kmers().intersection(kmerCounts)])
#kmerCounts is the dict mapping millions of strings to ints
From my tests so far, this takes about 15 seconds per iteration. Since I have around 20,000 of these blocks, I am looking at half a week just to do this. And that is for the 50,000,000 items, not the billions I need to handle...
(And yes I should probably do this in another language, but I also need it done fast and I am not very good at non-python languages).
There's no need to do a full intersection, you just want the matching elements from the big dictionary if they exist. If an element doesn't exist you can substitute 0 and there will be no effect on the sum. There's also no need to convert the input of sum to a list.
count = sum(kmerCounts.get(x, 0) for x in block.get_kmers())
Remove the square brackets around your list comprehension to turn it into a generator expression:
sum(kmerCounts[x] for x in block.get_kmers().intersection(kmerCounts))
That will save you some time and some memory, which may in turn reduce swapping, if you're experiencing that.
There is a lower bound to how much you can optimize here. Switching to another language may ultimately be your only option.
I'd like to compare two large text files(200M) to get the same lines of them.
How to do that in Python?
since they are just 200M, allocate enough memory, read them, sort the lines in ascending order for each, then iterate through both collections of lines in parallel like in a merge operation and delete those that only occur in one set.
preserve line numbers in the collections and sort them by line number after the above, if you want to output them in original order.
merge operation: keep one index for each collection, if lines at both indexes match, increment both indexes, otherwise delete the smaller line and increment just that index. if either index is past the last line, delete all remaining lines in the other collection.
optimization: use a hash to optimize comparisons a little bit; do the hash in the initial read
Disclaimer: I really have no idea how efficient this will be for 200Mb but it's worth the try I guess:
I have tried the following for two ~80mb files and the result was around 2.7 seconds in a 3GB Ram intel i3 machine.
f1 = open("one")
f2 = open("two")
print set(f1).intersection(f2)
You may be able to use the standard difflib module. The module offers several ways of creating difference deltas from various kinds of input.
Here's an example from the docs:
>>> from difflib import context_diff
>>> fromfile = open('before.py')
>>> tofile = open('tofile.py')
>>> for line in context_diff(fromfile, tofile, fromfile='before.py', tofile='after.py'):
print line,
I am working on some large (several million line) bioinformatics data sets with the general format:
chromosomeNumber locusStart locusStop sequence moreData
I have other files in this format:
chromosomeNumber locusStart locusStop moreData
What I need to be able to do is read one of each type of file into memory and if the locusStart of a line of the upper file is between the start and stop of any of the lines in the lower file, print the line to output file 1. If the locusStart of that line is not between the start and stop of any lines in the bottom file, then print it to output file 2.
I am currently reading the files in, converting them into dictionaries keyed on chromosome with the corresponding lines as values. I then split each value line into a string, and then do comparisons with the strings. This takes an incredibly long time, and I would like to know if there is a more efficient way to do it.
Thanks.
It seems that for the lower file (which I assuming has the second format), the only field you are concerned about is 'locusStart'. Since, from your description, you do not necessarily care about the other data, you could make a set of all of the locusStart:
locusStart_list = set()
with open(upper_file, 'r') as f:
for line in f:
tmp_list = line.strip().split()
locusStart_list.add(tmp_list[1])
This removes all of the unnecessary line manipulation you do for the bottom file. Then, you can easily compare the locusStart of a field to the set built from the lower file. The set would also remove duplicates, making it a bit faster than using a list.
It sounds like you are going to be doing lots of greater than/less than comparisons, as such I don't think loading your data into dictionaries is going to enhance the speed of code at all--based on what you've explained it sounds like you're still looping through every element in one file or the other.
What you need is a different data structure to load your data into and run comparison operations with. Check out the the Python bisect module, I think it may provide the data structure that you need to run your comparison operations much more efficiently.
If you can more precisely describe what exactly you're trying to accomplish, we'll be able to help you get started writing your code.
Using a dictionary of the chromosome number is a good idea, as long as you can fit both files into memory.
You then want to sort both lists by locusStart (split the string, convert locusStart to a number--see instructions on sorting if you're unsure how to sort on locusStart alone).
Now you can just walk through your lists: if the lower locusStart is less than the first upper locusStart, put the line in file 2 and go on to the next one. If the lower locusStart is greater than the first upper locusStart then
While it is also greater than locusEnd, throw away the beginning of the upper list
If you find a case where it's greater than locusStart and less than locusEnd, put it in file 1
Otherwise, put it in file 2
This should replace what is now probably an O(n^2) algorithm with a O(n log n) one.
I have a file of names and addresses as follows (example line)
OSCAR ,CANNONS ,8 ,STIEGLITZ CIRCUIT
And I want to read it into a dictionary of name and value. Here self.field_list is a list of the name, length and start point of the fixed fields in the file. What ways are there to speed up this method? (python 2.6)
def line_to_dictionary(self, file_line,rec_num):
file_line = file_line.lower() # Make it all lowercase
return_rec = {} # Return record as a dictionary
for (field_start, field_length, field_name) in self.field_list:
field_data = file_line[field_start:field_start+field_length]
if self.strip_fields == True: # Strip off white spaces first
field_data = field_data.strip()
if field_data != '': # Only add non-empty fields to dictionary
return_rec[field_name] = field_data
# Set hidden fields
#
return_rec['_rec_num_'] = rec_num
return_rec['_dataset_name_'] = self.name
return return_rec
struct.unpack() combined with s specifiers with lengths will tear the string apart faster than slicing.
Edit: Just saw your remark below about commas. The approach below is fast when it comes to file reading, but it is delimiter-based, and would fail in your case. It's useful in other cases, though.
If you want to read the file really fast, you can use a dedicated module, such as the almost standard Numpy:
data = numpy.loadtxt('file_name.txt', dtype=('S10', 'S8'), delimiter=',') # dtype must be adapted to your column sizes
loadtxt() also allows you to process fields on the fly (with the converters argument). Numpy also allows you to give names to columns (see the doc), so that you can do:
data['name'][42] # Name # 42
The structure obtained is like an Excel array; it is quite memory efficient, compared to a dictionary.
If you really need to use a dictionary, you can use a dedicated loop over the data array read quickly by Numpy, in a way similar to what you have done.
If you want to get some speed up, you can also store field_start+field_length directly in self.field_list, instead of storing field_length.
Also, if field_data != '' can more simply be written as if field_data (if this gives any speed up, it is marginal, though).
I would say that your method is quite fast, compared to what standard Python can do (i.e., without using non-standard, dedicated modules).
If your lines include commas like the example, you can use line.split(',') instead of several slices. This may prove to be faster.
You'll want to use the csv module.
It handle not only csv, but any csv-like format which yours seems to be.