Fastest way to count instances of substrings in string Python3.6 - python

I have been working on a program which requires the counting of sub-strings (up to 4000 sub-strings of 2-6 characters located in a list) inside a main string (~400,000 characters). I understand this is similar to the question asked at Counting substrings in a string, however, this solution does not work for me. Since my sub-strings are DNA sequences, many of my sub-strings are repetitive instances of a single character (e.g. 'AA'); therefore, 'AAA' will be interpreted as a single instance of 'AA' rather than two instances if i split the string by 'AA'. My current solution is using nested loops, but I'm hoping there is a faster way as this code takes 5+ minutes for a single main string. Thanks in advance!
def getKmers(self, kmer):
self.kmer_dict = {}
kmer_tuples = list(product(['A', 'C', 'G', 'T'], repeat = kmer))
kmer_list = []
for x in range(len(kmer_tuples)):
new_kmer = ''
for y in range(kmer):
new_kmer += kmer_tuples[x][y]
kmer_list.append(new_kmer)
for x in range(len(kmer_list)):
self.kmer_dict[kmer_list[x]] = 0
for x in range(len(self.sequence)-kmer):
for substr in kmer_list:
if self.sequence[x:x+kmer] == substr:
self.kmer_dict[substr] += 1
break
return self.kmer_dict

For counting overlapping substrings of DNA, you can use Biopython:
>>> from Bio.Seq import Seq
>>> Seq('AAA').count_overlap('AA')
2
Disclaimer: I wrote this method, see commit 97709cc.
However, if you're looking for really high performance, Python probably isn't the right language choice (although an extension like Cython could help).

Of course Python is fully able to perform these string searches. But instead of re-inventing all the wheels you will need, one screw at a time, you would be better of using a more specialized tool inside Python to deal with your problem - it looks like the BioPython project is the most activelly maintained and complete to deal with this sort of problem.
Short post with an example resembling your problem:
https://dodona.ugent.be/nl/exercises/1377336647/
Link to the BioPython project documentation: https://biopython.org/wiki/Documentation
(if the problem were simply overlapping strings, then the 3rd party "regex" module would be a way to go - https://pypi.org/project/regex/ - as the built-in regex engine in Python's re module can't deal with overlapping sequence either)

Related

Python/Biopython: How to search for a reference sequence (string) in a sequence with gaps?

I am facing the following problem and have not found a solution yet:
I am working on a tool for sequence analysis which uses a file with reference sequences and tries to find one of these reference sequences in a test sequence.
The problem is that the test sequence might contain gaps (for example: ATG---TCA).
I want my tool to find a specific reference sequence as substring of the test sequence even if the reference sequence is interrupted by gaps (-) in the test sequence.
For example:
one of my reference sequences:
a = TGTAACGAACGG
my test sequence:
b = ACCT**TGT--CGAA-GG**AGT
(the corresponding part from the reference sequence is given in bold)
I though about regular expressions and tried to work myself into it but if I am not wrong regular expressions only work the other way round. So I would need to include the gap positions as regular expressions into the reference sequence and than map it against the test sequence.
However, I do not know the positions, the length and the number of gaps in the test sequence.
My idea was to exchange gap positions (so all -) in the test sequence string into some kind of regular expressions or into a special character which stand for any other character in the reference sequence. Than I would compare the unmodified reference sequences against my modified test sequence...
Unfortunately I have not found a function in python for string search or a type of regular expression which could to this.
Thank you very much!
There's good news and there's bad news...
Bad news first: What you are trying to do it not easy and regex is really not the way to do it. In a simple case regex could be made to work (maybe) but it will be inefficient and would not scale.
However, the good news is that this is well understood problem in bioinformatics (e.g. see https://en.wikipedia.org/wiki/Sequence_alignment). Even better news is that there are tools in Biopython that can help you. E.g. http://biopython.org/DIST/docs/api/Bio.pairwise2-module.html
EDIT
From the discussion below it seems you are saying that 'b' is likely to be very long, but assuming 'a' is still short (12 bases in your example above) I think you can tackle this by iterating over every 12-mer in 'b'. I.e. divide 'b' into sequences that are 12 bases long (obviously you'll end up with a lot!). You can then easily compare the two sequences. If you really want to use regex (and I still advise you not to) then you can replace the '-' with a '.' and do a simple match. E.g.
import re
''' a is the reference '''
a = 'TGTAACGAACGG'
''' b is 12-mer taken from the seqence of interest, in reality you'll be doing this test for every possible 12-mer in the sequence'''
b = 'TGT--CGAA-GG'
b = b.replace('-', '.')
r = re.compile(b);
m = r.match(a)
print(m)
You could do this:
import re
a = 'TGTAACGAACGG'
b = 'ACCTTGT--CGAA-GGAGT'
temp_b = re.sub(r'[\W_]+', '', b) #removes everything that isn't a number or letter
if a in temp_b:
#do something

Find which lines in a file contain certain characters

Is there a way to find out if a string contains any one of the characters in a set with python?
It's straightforward to do it with a single character, but I need to check and see if a string contains any one of a set of bad characters.
Specifically, suppose I have a string:
s = 'amanaplanacanalpanama~012345'
and I want to see if the string contains any vowels:
bad_chars = 'aeiou'
and do this in a for loop for each line in a file:
if [any one or more of the bad_chars] in s:
do something
I am scanning a large file so if there is a faster method to this, that would be ideal. Also, not every bad character has to be checked---so long as one is encountered that is enough to end the search.
I'm not sure if there is a builtin function or easy way to implement this, but I haven't come across anything yet. Any pointers would be much appreciated!
any((c in badChars) for c in yourString)
or
any((c in yourString) for c in badChars) # extensionally equivalent, slower
or
set(yourString) & set(badChars) # extensionally equivalent, slower
"so long as one is encountered that is enough to end the search." - This will be true if you use the first method.
You say you are concerned with performance: performance should not be an issue unless you are dealing with a huge amount of data. If you encounter issues, you can try:
Regexes
edit Previously I had written a section here on using regexes, via the re module, programatically generating a regex that consisted of a single character-class [...] and using .finditer, with the caveat that putting a simple backslash before everything might not work correctly. Indeed, after testing it, that is the case, and I would definitely not recommend this method. Using this would require reverse engineering the entire (slightly complex) sub-grammar of regex character classes (e.g. you might have characters like \ followed by w, like ] or [, or like -, and merely escaping some like \w may give it a new meaning).
Sets
Depending on whether the str.__contains__ operation is O(1) or O(N), it may be justifiable to first convert your text/lines into a set to ensure the in operation is O(1), if you have many badChars:
badCharSet = set(badChars)
any((c in badChars) for c in yourString)
(it may be possible to make that a one-liner any((c in set(yourString)) for c in badChars), depending on how smart the python compiler is)
Do you really need to do this line-by-line?
It may be faster to do this once for the entire file O(#badchars), than once for every line in the file O(#lines*#badchars), though the asymptotic constants may be such that it won't matter.
Use python's any function.
if any((bad_char in my_string) for bad_char in bad_chars):
# do something
This should be very efficient and clear. It uses sets:
#!/usr/bin/python
bad_chars = set('aeiou')
with open('/etc/passwd', 'r') as file_:
file_string = file_.read()
file_chars = set(file_string)
if file_chars & bad_chars:
print('found something bad')
This regular expression is twice as fast as any with my minimal testing. You should try it with your own data.
r = re.compile('[aeiou]')
if r.search(s):
# do something
The following Python code should print out any character in bad_chars if it exists in s:
for i in vowels:
if i in your charset:
#do_something
You could also use the python in-built any using an example like this:
>>> any(e for e in bad_chars if e in s)
True

How to ignore numbers when doing string.startswith() in python?

I have a directory with a large number of files. The file names are similar to the following: the(number)one(number), where (number) can be any number. There are also files with the name: the(number), where (number) can be any number. I was wondering how I can count the number of files with the additional "one(number)" at the end of their file name.
Let's say I have the list of file names, I was thinking of doing
for n in list:
if n.startswith(the(number)one):
add one to a counter
Is there anyway for it to accept any number in the (number) space when doing a startswith?
Example:
the34one5
the37one2
the444one3
the87one8
the34
the32
This should return 4.
Use a regex matching 'one\d+' using the re module.
import re
for n in list:
if re.search(r"one\d+", n):
add one to a counter
If you want to make it very accurate, you can even do:
for n in list:
if re.search(r"^the\d+one\d+$", n):
add one to a counter
Which will even take care of any possible non digit chars between "the" and "one" and won't allow anything else before 'the' and after the last digit'.
You should start learning regexp now:
they let you make some complex text analysis in a blink that would be hard to code manually
they work almost the same from one language to another, making you more flexible
if you encounter some code using them, you will be puzzled if you didn't cause it's not something you can guess
the sooner you know them, the sooner you'll learn when NOT (hint) to use them. Which is eventually as important as knowing them.
The easiest way to do this probably is glob.glob():
number = len(glob.glob("/path/to/files/the*one*"))
Note that * here will match any string, not just numbers.
The same as a one-liner and also answering the question as it should match 'the' as well:
import re
count = len([name for name in list if re.match('the\d+one', name)])

algorithm for testing mutliple substrings in multiple strings

I have several million strings, X, each with less than 20 or so words. I also have a list of several thousand candidate substrings C. for each x in X, I want to see if there are any strings in C that are contained in x. Right now I am using a naive double for loop, but it's been a while and it hasn't finished yet...Any suggestions? I'm using python if any one knows of a nice implementation, but links for any language or general algorithms would be nice too.
Encode one of your sets of strings as a trie (I recommend the bigger set). Lookup time should be faster than an imperfect hash and you will save some memory too.
It's gonna be a long while. You have to check every one of those several million strings against every one of those several thousand candidate substrings, meaning that you will be doing (several million * several thousand) string comparisons. Yeah, that will take a while.
If this is something that you're only going to do once or infrequently, I would suggest using fgrep. If this is something that you're going to do often, then you want to look into implementing something like the Aho-Corasick string matching algorithm.
If your x in X only contains words, and you only want to match words you could do the following:
Insert your keywords into a set, that makes the access log(n), and then check for every word in x if it is contained in that set.
like:
keywords = set(['bla', 'fubar'])
for w in [x.split(' ') for x in X]:
if w in keywords:
pass # do what you need to do
A good alternative would be to use googles re2 library, that uses super nice automata theory to produce efficient matchers. (http://code.google.com/p/re2/)
EDIT: Be sure you use proper buffering and something in a compiled language, that makes it a lot faster. If its less than a couple gigabytes, it should work with python too.
you could try to use regex
subs=re.compile('|'.join(C))
for x in X:
if subs.search(x):
print 'found'
Have a look at http://en.wikipedia.org/wiki/Aho-Corasick. You can build a pattern-matcher for a set of fixed strings in time linear in the total size of the strings, then search in text, or multiple sections of text, in time linear in the length of the text + the number of matches found.
Another fast exact pattern matcher is http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm

Speed of many regular expressions in python

I'm writing a python program that deals with a fair amount of strings/files. My problem is that I'm going to be presented with a fairly short piece of text, and I'm going to need to search it for instances of a fairly broad range of words/phrases.
I'm thinking I'll need to compile regular expressions as a way of matching these words/phrases in the text. My concern, however, is that this will take a lot of time.
My question is how fast is the process of repeatedly compiling regular expressions, and then searching through a small body of text to find matches? Would I be better off using some string method?
Edit: So, I guess an example of my question would be: How expensive would it be to compile and search with one regular expression versus say, iterating 'if "word" in string' say, 5 times?
You should try to compile all your regexps into a single one using the | operator. That way, the regexp engine will do most of the optimizations for you. Use the grouping operator () to determine which regexp matched.
If speed is of the essence, you are better off running some tests before you decide how to code your production application.
First of all, you said that you are searching for words which suggests that you may be able to do this using split() to break up the string on whitespace. And then use simple string comparisons to do your search.
Definitely do compile your regular expressions and do a timing test comparing that with the plain string functions. Check the documentation for the string class for a full list.
Your requirement appears to be searching a text for the first occurrence of any one of a collection of strings. Presumably you then wish to restart the search to find the next occurrence, and so on until the searched string is exhausted. Only plain old string comparison is involved.
The classic algorithm for this task is Aho-Corasick for which there is a Python extension (written in C). This should beat the socks off any alternative that's using the re module.
If you like to know how does it fast during compiling regex patterns, you need to benchmark it.
Here is how I do that. Its compile 1 Million time each patterns.
import time,re
def taken(f):
def wrap(*arg):
t1,r,t2=time.time(),f(*arg),time.time()
print t2-t1,"s taken"
return r
return wrap
#taken
def regex_compile_test(x):
for i in range(1000000):
re.compile(x)
print "for",x,
#sample tests
regex_compile_test("a")
regex_compile_test("[a-z]")
regex_compile_test("[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}")
Its took around 5 min for each patterns in my computer.
for a 4.88999986649 s taken
for [a-z] 4.70300006866 s taken
for [A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4} 4.78200006485 s taken
The real Bottleneck is not in compiling patterns, its in extracting text like re.findall, replacing re.sub. If you use that against Several MB texts, Its quite slow.
If your text is fixed, use normal str.find, its faster than regex.
Actually, If you give your text samples, and your regex patterns samples, we could give you better idea, there is many many great regex, and python guys out there.
Hope this help, sorry If my answer couldn't help you.
When you compile the regexp, it is converted into a state machine representation. Provided the regexp is efficiently expressed, it should still be very fast to match. Compiling the regexp can be expensive though, so you will want to do that up front, and as infrequently as possible. Ultimately though, only you can answer if it is fast enough for your requirements.
There are other string searching approaches, such as the Boyer-Moore algorithm. But I'd wager the complexity of searching for multiple separate strings is much higher than a regexp that can switch off each successive character.
This is a question that can readily be answered by just trying it.
>>> import re
>>> import timeit
>>> find = ['foo', 'bar', 'baz']
>>> pattern = re.compile("|".join(find))
>>> with open('c:\\temp\\words.txt', 'r') as f:
words = f.readlines()
>>> len(words)
235882
>>> timeit.timeit('r = filter(lambda w: any(s for s in find if w.find(s) >= 0), words)', 'from __main__ import find, words', number=30)
18.404569854548527
>>> timeit.timeit('r = filter(lambda w: any(s for s in find if s in w), words)', 'from __main__ import find, words', number=30)
10.953313759150944
>>> timeit.timeit('r = filter(lambda w: pattern.search(w), words)', 'from __main__ import pattern, words', number=30)
6.8793022576891758
It looks like you can reasonably expect regular expressions to be faster than using find or in. Though if I were you I'd repeat this test with a case that was more like your real data.
If you're just searching for a particular substring, use str.find() instead.
Depending on what you're doing it might be better to use a tokenizer and loop through the tokens to find matches.
However, when it comes to short pieces of text regexes have incredibly good performance. Personally I remember only coming into problems when text sizes became ridiculous like 100k words or something like that.
Furthermore, if you are worried about the speed of actual regex compilation rather than matching, you might benefit from creating a daemon that compiles all the regexes then goes through all the pieces of text in a big loop or runs as a service. This way you will only have to compile the regexes once.
in general case, you can use "in" keyword
for line in open("file"):
if "word" in line:
print line.rstrip()
regex is usually not needed when you use Python :)

Categories

Resources