Let's say I have a string and a list of strings:
a = 'ABCDEFG'
b = ['ABC', 'QRS', 'AHQ']
How can I pull out the string in list b that matches up perfectly with a section of the string a? So the would return would be something like ['ABC']
The most important issue is that I have tens of millions of strings, so that time efficiency is essential.
If you only want the first match in b:
next((s for s in b if s in a), None)
This has the advantage of short-circuiting as soon as it finds a match whereas the other list solutions will keep going. If no match is found, it will return None.
Keep in mind that Python's substring search x in a is already optimized pretty well for the general case (and coded in C, for CPython), so you're unlikely to beat it in general, especially with pure Python code.
However, if you have a more specialized case, you can do much better.
For example, if you have an arbitrary list of millions of strings b that all need to be searched for within one giant static string a that never changes, preprocessing a can make a huge difference. (Note that this is the opposite of the usual case, where preprocessing the patterns is the key.)
On the other hand, if you expect matches to be unlikely, and you've got the whole b list in advance, you can probably get some large gains by organizing b in some way. For example, there's no point searching for "ABCD" if "ABC" already failed; if you need to search both "ABC" and "ABD" you can search for "AB" first and then check whether it's followed by "C" or "D" so you don't have to repeat yourself; etc. (It might even be possible to merge all of b into a single regular expression that's close enough to optimal… although with millions of elements, that probably isn't the answer.)
But it's hard to guess in advance, with no more information than you've given us, exactly what algorithm you want.
Wikipedia has a pretty good high-level overview of string searching algorithms. There's also a website devoted to pattern matching in general, which seems to be a bit out of date, but then I doubt you're going to turn out to need an algorithm invented in the past 3 years anyway.
Answer:
(x for x in b if x in a )
That will return a generator that will be a list of ones that match. Take the first or loop over it.
In [3]: [s for s in b if s in a]
Out[3]: ['ABC']
On my machine this takes about 3 seconds when b contains 20,000,000 elements (tested with a and b containing strings similar to those in the question).
You might want to have a look at the following algorithm:
Boyer–Moore string search algorithm
And wikipedia
But without knowing more, this might be overkill!
Related
I am using a regex expression like "a.{1000000}b.{1000000}c" to pattern match on a string. However this is WAY too slow. Is there a better way to do this? I am not interested in the stuff between a, b and c, as long as their gap is of my specified size I care not of the content within. One can think of it as skipping n characters. Checking the index doesn't serve me well either, I need to be using some built-in method written in C. Any suggestions?
Thanks in advance
If you just need to verify that a string is in a given pattern and do not care to extract the a, b, nor c then this would work:
(?=^a.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}b.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}c$)
The limit for regex quantifiers is 65535 so if you need one million then you would have to repeat .{50000} 20 times like I did above.
Now you just need to make Python code that says "if regex match then proceed"
Regex101 takes 68ms so I would consider that to be "fast".
https://regex101.com/r/q6RgNJ/1
I am facing the following problem and have not found a solution yet:
I am working on a tool for sequence analysis which uses a file with reference sequences and tries to find one of these reference sequences in a test sequence.
The problem is that the test sequence might contain gaps (for example: ATG---TCA).
I want my tool to find a specific reference sequence as substring of the test sequence even if the reference sequence is interrupted by gaps (-) in the test sequence.
For example:
one of my reference sequences:
a = TGTAACGAACGG
my test sequence:
b = ACCT**TGT--CGAA-GG**AGT
(the corresponding part from the reference sequence is given in bold)
I though about regular expressions and tried to work myself into it but if I am not wrong regular expressions only work the other way round. So I would need to include the gap positions as regular expressions into the reference sequence and than map it against the test sequence.
However, I do not know the positions, the length and the number of gaps in the test sequence.
My idea was to exchange gap positions (so all -) in the test sequence string into some kind of regular expressions or into a special character which stand for any other character in the reference sequence. Than I would compare the unmodified reference sequences against my modified test sequence...
Unfortunately I have not found a function in python for string search or a type of regular expression which could to this.
Thank you very much!
There's good news and there's bad news...
Bad news first: What you are trying to do it not easy and regex is really not the way to do it. In a simple case regex could be made to work (maybe) but it will be inefficient and would not scale.
However, the good news is that this is well understood problem in bioinformatics (e.g. see https://en.wikipedia.org/wiki/Sequence_alignment). Even better news is that there are tools in Biopython that can help you. E.g. http://biopython.org/DIST/docs/api/Bio.pairwise2-module.html
EDIT
From the discussion below it seems you are saying that 'b' is likely to be very long, but assuming 'a' is still short (12 bases in your example above) I think you can tackle this by iterating over every 12-mer in 'b'. I.e. divide 'b' into sequences that are 12 bases long (obviously you'll end up with a lot!). You can then easily compare the two sequences. If you really want to use regex (and I still advise you not to) then you can replace the '-' with a '.' and do a simple match. E.g.
import re
''' a is the reference '''
a = 'TGTAACGAACGG'
''' b is 12-mer taken from the seqence of interest, in reality you'll be doing this test for every possible 12-mer in the sequence'''
b = 'TGT--CGAA-GG'
b = b.replace('-', '.')
r = re.compile(b);
m = r.match(a)
print(m)
You could do this:
import re
a = 'TGTAACGAACGG'
b = 'ACCTTGT--CGAA-GGAGT'
temp_b = re.sub(r'[\W_]+', '', b) #removes everything that isn't a number or letter
if a in temp_b:
#do something
Consider this Python code:
import timeit
import re
def one():
any(s in mystring for s in ('foo', 'bar', 'hello'))
r = re.compile('(foo|bar|hello)')
def two():
r.search(mystring)
mystring="hello"*1000
print([timeit.timeit(k, number=10000) for k in (one, two)])
mystring="goodbye"*1000
print([timeit.timeit(k, number=10000) for k in (one, two)])
Basically, I'm benchmarking two ways to check existence of one of several substrings in a large string.
What I get here (Python 3.2.3) is this output:
[0.36678314208984375, 0.03450202941894531]
[0.6672089099884033, 3.7519450187683105]
In the first case, the regular expression easily defeats the any expression - the regular expression finds the substring immediately, while the any has to check the whole string a couple of times before it gets to the correct substring.
But what's going on in the second example? In the case where the substring isn't present, the regular expression is surprisingly slow! This surprises me, since theoretically the regex only has to go over the string once, while the any expression has to go over the string 3 times. What's wrong here? Is there a problem with my regex, or are Python regexs simply slow in this case?
Note to future readers
I think the correct answer is actually that Python's string handling algorithms are really optimized for this case, and the re module is actually a bit slower. What I've written below is true, but is probably not relevant to the simple regexps I have in the question.
Original Answer
Apparently this is not a random fluke - Python's re module really is slower. It looks like it uses a recursive backtracking approach when it fails to find a match, as opposed to building a DFA and simulating it.
It uses the backtracking approach even when there are no back references in the regular expression!
What this means is that in the worst case, Python regexs take exponential, and not linear, time!
This is a very detailed paper describing the issue:
http://swtch.com/~rsc/regexp/regexp1.html
I think this graph near the end summarizes it succinctly:
My coworker found the re2 library (https://code.google.com/p/re2/)? There is a python wrapper. It's a bit to get installed on some systems.
I was having the same issue with some complex regexes and long strings -- re2 sped the processing time up significantly -- from seconds to milliseconds.
The reason the regex is so slow is because it not only has to go through the whole string, but it has to several calculations at every character.
The first one simply does this:
Does f match h? No.
Does b match h? No.
Does h match h? Yes.
Does e match e? Yes.
Does l match l? Yes.
Does l match l? Yes.
Does o match o? Yes.
Done. Match found.
The second one does this:
Does f match g? No.
Does b match g? No.
Does h match g? No.
Does f match o? No.
Does b match o? No.
Does h match o? No.
Does f match o? No.
Does b match o? No.
Does h match o? No.
Does f match d? No.
Does b match d? No.
Does h match d? No.
Does f match b? No.
Does b match b? Yes.
Does a match y? No.
Does h match b? No.
Does f match y? No.
Does b match y? No.
Does h match y? No.
Does f match e? No.
Does b match e? No.
Does h match e? No.
... 999 more times ...
Done. No match found.
I can only speculate about the difference between the any and regex, but I'm guessing the regex is slower mostly because it runs in a highly complex engine, and with state machine stuff and everything, it just isn't as efficient as a specific implementation (in).
In the first string, the regex will find a match almost instantaneously, while any has to loop through the string twice before finding anything.
In the second string, however, the any performs essentially the same steps as the regex, but in a different order. This seems to point out that the any solution is faster, probably because it is simpler.
Specific code is more efficient than generic code. Any knowledge about the problem can be put to use in optimizing the solution. Simple code is preferred over complex code. Essentially, the regex is faster when the pattern will be near the start of the string, but in is faster when the pattern is near the end of the string, or not found at all.
Disclaimer: I don't know Python. I know algorithms.
You have a regexp that is made up of three regexps. Exactly how do you think that works, if the regexp doesn't check this three times? :-) There's no magic in computing, you still have to do three checks.
But the regexp will do each three tests character by character, while the "one()" method will check the whole string for one match before going onto the next one.
That the regexp is much faster in the first case is because you check for the string that will match last. That means one() needs to first look through the whole string for "foo", then for "bar" and then for "hello", where it matches. Move "hello" first, and one() and two() are almost the same speed, as the first match done in both cases succeed.
Regexps are much more complex tests than "in" so I'd expect it to be slower. I suspect that this complexity increases a lot when you use "|", but I haven't read the source for the regexp library, so what do I know. :-)
I have several million strings, X, each with less than 20 or so words. I also have a list of several thousand candidate substrings C. for each x in X, I want to see if there are any strings in C that are contained in x. Right now I am using a naive double for loop, but it's been a while and it hasn't finished yet...Any suggestions? I'm using python if any one knows of a nice implementation, but links for any language or general algorithms would be nice too.
Encode one of your sets of strings as a trie (I recommend the bigger set). Lookup time should be faster than an imperfect hash and you will save some memory too.
It's gonna be a long while. You have to check every one of those several million strings against every one of those several thousand candidate substrings, meaning that you will be doing (several million * several thousand) string comparisons. Yeah, that will take a while.
If this is something that you're only going to do once or infrequently, I would suggest using fgrep. If this is something that you're going to do often, then you want to look into implementing something like the Aho-Corasick string matching algorithm.
If your x in X only contains words, and you only want to match words you could do the following:
Insert your keywords into a set, that makes the access log(n), and then check for every word in x if it is contained in that set.
like:
keywords = set(['bla', 'fubar'])
for w in [x.split(' ') for x in X]:
if w in keywords:
pass # do what you need to do
A good alternative would be to use googles re2 library, that uses super nice automata theory to produce efficient matchers. (http://code.google.com/p/re2/)
EDIT: Be sure you use proper buffering and something in a compiled language, that makes it a lot faster. If its less than a couple gigabytes, it should work with python too.
you could try to use regex
subs=re.compile('|'.join(C))
for x in X:
if subs.search(x):
print 'found'
Have a look at http://en.wikipedia.org/wiki/Aho-Corasick. You can build a pattern-matcher for a set of fixed strings in time linear in the total size of the strings, then search in text, or multiple sections of text, in time linear in the length of the text + the number of matches found.
Another fast exact pattern matcher is http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm
I have tokenized names (strings), with the tokens separated by underscores, which will always contain a "side" token by the value of M, L, or R.
The presence of that value is guaranteed to be unique (no repetitions or dangers that other tokens might get similar values).
In example:
foo_M_bar_type
foo_R_bar_type
foo_L_bar_type
I'd like, in a single regex, to swap L for R and viceversa whenever found, and M to be left untouched.
IE the above would become:
foo_M_bar_type
foo_L_bar_type
foo_R_bar_type
when pushed through this ideal expression.
This was what I thought to be a 10 minutes exercise while writing some simple stuff, that I couldn't quite crack as concisely as I wanted to.
The problem itself was of course trivial to solve with one condition that changes the pattern, but I'd love some help doing it within a single re.sub()
Of course any food for thought is always welcome, but this being an intellectual exercise that me and a couple colleagues failed at I'd love to see it cracked that way.
And yes, I'm fully aware it might not be considered very Pythonic, nor ideal, to solve the problem with a regex, but humour me please :)
Thanks in advance
This answer [ab]uses the replacement function:
>>> s = "foo_M_bar_type foo_R_bar_type foo_L_bar_type"
>>> import re
>>> re.sub("_[LR]_", lambda m: {'_L_':'_R_','_R_':'_L_'}[m.group()], s)
'foo_M_bar_type foo_L_bar_type foo_R_bar_type'
>>>