Finding a substring's position in a larger string - python

I have a large string and a large number of smaller substrings and I am trying to check if each substring exists in the larger string and get the position of each of these substrings.
string="some large text here"
sub_strings=["some", "text"]
for each_sub_string in sub_strings:
if each_sub_string in string:
print each_sub_string, string.index(each_sub_string)
The problem is, since I have a large number of substrings (around a million), it takes about an hour of processing time. Is there any way to reduce this time, maybe by using regular expressions or some other way?

The best way to solve this is with a tree implementation. As Rishav mentioned, you're repeating a lot of work here. Ideally, this should be implemented as a tree-based FSM. Imagine the following example:
Large String: 'The cat sat on the mat, it was great'
Small Strings: ['cat', 'sat', 'ca']
Then imagine a tree where each level is an additional letter.
small_lookup = {
'c':
['a', {
'a': ['t']
}], {
's':
['at']
}
}
Apologies for the gross formatting, but I think it's helpful to map back to a python data structure directly. You can build a tree where the top level entries are the starting letters, and they map to the list of potential final substrings that could be completed. If you hit something that is a list element and has nothing more nested beneath you've hit a leaf and you know that you've hit the first instance of that substring.
Holding that tree in memory is a little hefty, but if you've only got a million string this should be the most efficient implementation. You should also make sure that you trim the tree as you find the first instance of words.
For those of you with CS chops, or if you want to learn more about this approach, it's a simplified version of the Aho-Corasick string matching algorithm.
If you're interested in learning more about these approaches there are three main algorithms used in practice:
Aho-Corasick (Basis of fgrep) [Worst case: O(m+n)]
Commentz-Walter (Basis of vanilla GNU grep) [Worst case: O(mn)]
Rabin-Karp (Used for plagiarism detection) [Worst case: O(mn)]
There are domains in which all of these algorithms will outperform the others, but based on the fact that you've got a very high number of sub-strings that you're searching and there's likely a lot of overlap between them I would bet that Aho-Corasick is going to give you significantly better performance than the other two methods as it avoid the O(mn) worst-case scenario
There is also a great python library that implements the Aho-Corasick algorithm found here that should allow you to avoid writing the gross implementation details yourself.

Depending on the distribution of the lengths of your substrings, you might be able to shave off a lot of time using preprocessing.
Say the set of the lengths of your substrings form the set {23, 33, 45} (meaning that you might have millions of substrings, but each one takes one of these three lengths).
Then, for each of these lengths, find the Rabin Window over your large string, and place the results into a dictionary for that length. That is, let's take 23. Go over the large string, and find the 23-window hashes. Say the hash for position 0 is 13. So you insert into the dictionary rabin23 that 13 is mapped to [0]. Then you see that for position 1, the hash is 13 as well. Then in rabin23, update that 13 is mapped to [0, 1]. Then in position 2, the hash is 4. So in rabin23, 4 is mapped to [2].
Now, given a substring, you can calculate its Rabin hash and immediately check the relevant dictionary for the indices of its occurrence (which you then need to compare).
BTW, in many cases, then lengths of your substrings will exhibit a Pareto behavior, where say 90% of the strings are in 10% of the lengths. If so, you can do this for these lengths only.

This is approach is sub-optimal compared to the other answers, but might be good enough regardless, and is simple to implement. The idea is to turn the algorithm around so that instead of testing each sub-string in turn against the larger string, iterate over the large string and test against possible matching sub-strings at each position, using a dictionary to narrow down the number of sub-strings you need to test.
The output will differ from the original code in that it will be sorted in ascending order of index as opposed to by sub-string, but you can post-process the output to sort by sub-string if you want to.
Create a dictionary containing a list of sub-strings beginning each possible 1-3 characters. Then iterate over the string and at each character read the 1-3 characters after it and check for a match at that position for each sub-string in the dictionary that begins with those 1-3 characters:
string="some large text here"
sub_strings=["some", "text"]
# add each of the substrings to a dictionary based the first 1-3 characters
dict = {}
for s in sub_strings:
if s[0:3] in dict:
dict[s[0:3]].append(s)
else:
dict[s[0:3]] = [s];
# iterate over the chars in string, testing words that match on first 1-3 chars
for i in range(0, len(string)):
for j in range(1,4):
char = string[i:i+j]
if char in dict:
for word in dict[char]:
if string[i:i+len(word)] == word:
print word, i
If you don't need to match any sub-strings 1 or 2 characters long then you can get rid of the for j loop and just assign char with char = string[i:3]
Using this second approach I timed the algorithm by reading in Tolstoy's War and Peace and splitting it into unique words, like this:
with open ("warandpeace.txt", "r") as textfile:
string=textfile.read().replace('\n', '')
sub_strings=list(set(string.split()))
Doing a complete search for every unique word in the text and outputting every instance of each took 124 seconds.

Related

Is there a more efficient O(n) algorithm to filter substring in a list? [duplicate]

I have 300K strings stored in the list, and the length of each string is between 10 and 400. I want to remove the ones that are substring of other strings (the strings with shorter length have higher probability to be the substring of others).
Currently, I first sort these 300K strings based on length, then use below method.
sorted_string = sorted(string_list, key=length, reverse=True)
for item in sorted_string
for next_item in sorted_string[sorted_string.index(item)+1:]
if next_item in item:
del sorted_string[sorted_string.index(next_item)]
The running time of this method is O(n^2). Since I have 300K strings, I am not satisfied with this method.
I have tried to divide these sorted strings into different chunks and use multiprocessing to compute each chunk. My first thought was to put the first 10K to the first chunk, and next 10K to the second chunk, etc. But in this way, strings in each chunk have similar length, and they may not substring of others in the same chunk. So this is not a good divide strategy.
Any good ideas?
Edit: these strings represent DNA sequences, and only contain 'g', 'c', 't' and 'a'
Update:
I have tried to build the suffix tree using the code from https://github.com/kvh/Python-Suffix-Tree. This program builds the suffix tree based on Ukkonen's algorithm.
The total length of concatenated string is about 90,000,000. That is a large number. The program has been running half an hour and just ~3,000,000 (1/30) characters are processed. I am not satisfied with this program.
Is there any other suffix tree building algorithm that can process this large string?
You could use a suffix tree. It will get you to O(mn) where m is the length of the strings. It's still quadratic, but since m << n in your case, it would provide a noticeable improvement.
These lecture notes provide a pretty good visual explanation of how you can use the suffix tree to find substrings.
This is a very cool and very interesting problem. I've studied subset seed algorithms and there are quite a few already out there.
Have you heard of the BLAST algorithm? http://blastalgorithm.com/
A GUI: http://blast.ncbi.nlm.nih.gov/

Finding the shortest unique substring

I have a name and a list of names. I can guarantee that the selected name is contained by the list of other names.
I'd like to generate the shortest substring of the selected name that is contained only by that name, and not by any of the other names in the data.
>>> names = ['smith','jones','williams','brown','wilson','taylor','johnson','white','martin','anderson']
>>> find_substring('smith', names)
"sm"
>>> find_substring('williams', names)
"ll"
>>> find_substring('taylor', names)
"y"
I can probably brute-force this fairly easily, by taking the first letter of the selected name and seeing if it matches any of the names, then iterating through the rest of the letters followed by pairs of letters, etc.
My problem is that my list contains more than ten thousand names and they're fairly long - more similar to book titles. Brute force would take forever.
Is there some simple way to efficiently achieve this?
I believe your best bet would be brute force, however, keep a dictionary of checked letter combinations and whether or not they matched any other names.
["s":true, "m": true, "sm": false"]
Consulting this list first would help reduce the code of checking against other strings and speed up the method as it runs.
A variation of a common suffix tree might be enough to achieve this at less than O(n^2) time (used in bioinformatics for large genome sequencing), but as #HeapOverflow mentioned in the comments, I do not believe brute forcing this problem would be much of an issue unless you are considering running the algorithm with literally hundreds of millions of strings.
Using the Wikipedia article above for reference: you can built the tree at O(n) time (all strings, not individual string), and use it to find all z occurences of a string P of length m in O(m + z) time. Implemented right you'll likely be looking at a time of O(n) + O(am + az) = O(am + az) time for a list of a words (anyone is welcome to double check my math on this).

Fast way to find a substring with some mismatches allowed

I am looking for help in making an efficient way to process some high throughput DNA sequencing data.
The data are in 5 files with a few hundred thousand sequences each, within which each sequence is formatted as follows:
#M01102:307:000000000-BCYH3:1:1102:19202:1786 1:N:0:TAGAGGCA+CTCTCTCT
TAATACGACTCACTATAGGGTTAACTTTAAGAGGGAGATATACATATGAGTCTTTTGGGTAAGAAGCCTTTTTGTCTGCTTTATGGTCCTATCTGCGGCAGGGCCAGCGGCAGCTAGGACGGGGGGCGGATAAGATCGGAAGAGCACTCGTCTGAACTCCAGTCACTAGAGGCAATCTCGT
+
AAABBFAABBBFGGGGFGGGGGAG5GHHHCH54BEEEEA5GGHDHHHH5BAE5DF5GGCEB33AF3313GHHHE255D55D55D53#5#B5DBD5#E/#//>/1??/?/E#///FDF0B?CC??CAAA;--./;/BBE?;AFFA./;/;.;AEA//BFFFF/BB/////;/..:.9999.;
What I am doing at the moment is iterating over the lines, checking if the first and last letter is an allowed character for a DNA sequence (A/C/G/T or N), then doing a fuzzy search for the two primer sequences that flank the coding sequence fragment I am interested in. This last step is the part where things are going wrong...
When I search for exact matches, I get useable data in a reasonable time frame. However, I know I am missing out on a lot of data that is being skipped because of a single mis-match in the primer sequences. This happens because read quality degrades with length, and so more unreadable bases ('N') crop up. These aren't a problem in my analysis otherwise, but are a problem with a simple direct string search approach -- N should be allowed to match with anything from a DNA perspective, but is not from a string search perspective (I am less concerned about insertion or deletions). For this reason I am trying to implement some sort of fuzzy or more biologically informed search approach, but have yet to find an efficient way of doing it.
What I have now does work on test datasets, but is much too slow to be useful on a full size real dataset. The relevant fragment of the code is:
from Bio import pairwise2
Sequence = 'NNNNNTAATACGACTCACTATAGGGTTAACTTTAAGAGGGAGATATACATATGAGTCTTTTGGGTAAGAAGCCTTTTTGTCTGCTTTATGGTCCTATCTGCGGCAGGGCCAGCGGCAGCTAGGACGGGGGGCGGATAAGATCGGAAGAGCACTCGTCTGAACTCCAGTCACTAGAGGCAATCTCGT'
fwdprimer = 'TAATACGACTCACTATAGGGTTAACTTTAAGAAGGAGATATACATATG'
revprimer = 'TAGGACGGGGGGCGGAAA'
if Sequence.endswith(('N','A','G','T','G')) and Sequence.startswith(('N','A','G','T','G')):
fwdalign = pairwise2.align.localxs(Sequence,fwdprimer,-1,-1, one_alignment_only=1)
revalign = pairwise2.align.localxs(Sequence,revprimer,-1,-1, one_alignment_only=1)
if fwdalign[0][2]>45 and revalign[0][2]>15:
startIndex = fwdalign[0][3]+45
endIndex = revalign[0][3]+3
Sequence = Sequence[startIndex:endIndex]
print Sequence
(obviously the first conditional is not needed in this example, but helps to filter out the other 3/4 of the lines that don't have DNA sequence and so don't need to be searched)
This approach uses the pairwise alignment method from biopython, which is designed for finding alignments of DNA sequences with mismatches allowed. That part it does well, but because it needs to do a sequence alignment for each sequence with both primers it takes way too long to be practical. All I need it to do is find the matching sequence, allowing for one or two mismatches. Is there another way of doing this that would serve my goals but be computationally more feasible? For comparison, the following code from a previous version works plenty fast with my full data sets:
if ('TAATACGACTCACTATAGGGTTAACTTTAAGAAGGAGATATACATATG' in Line) and ('TAGGACGGGGGGCGGAAA' in Line):
startIndex = Line.find('TAATACGACTCACTATAGGGTTAACTTTAAGAAGGAGATATACATATG')+45
endIndex = Line.find('TAGGACGGGGGGCGGAAA')+3
Line = Line[startIndex:endIndex]
print Line
This is not something I run frequently, so don't mind if it is a little inefficient, but don't want to have to leave it running for a whole day. I would like to get a result in seconds or minutes, not hours.
The tre library provides fast approximate matching functions. You can specify the maximum number of mismatched characters with maxerr as in the example below:
https://github.com/laurikari/tre/blob/master/python/example.py
There is also the regex module, which supports fuzzy searching options: https://pypi.org/project/regex/#additional-features
In addition, you can also use a simple regular expression to allow alternate characters as in:
# Allow any character to be N
pattern = re.compile('[TN][AN][AN][TN]')
if pattern.match('TANN'):
print('found')

string match in Python

I have 300K strings stored in the list, and the length of each string is between 10 and 400. I want to remove the ones that are substring of other strings (the strings with shorter length have higher probability to be the substring of others).
Currently, I first sort these 300K strings based on length, then use below method.
sorted_string = sorted(string_list, key=length, reverse=True)
for item in sorted_string
for next_item in sorted_string[sorted_string.index(item)+1:]
if next_item in item:
del sorted_string[sorted_string.index(next_item)]
The running time of this method is O(n^2). Since I have 300K strings, I am not satisfied with this method.
I have tried to divide these sorted strings into different chunks and use multiprocessing to compute each chunk. My first thought was to put the first 10K to the first chunk, and next 10K to the second chunk, etc. But in this way, strings in each chunk have similar length, and they may not substring of others in the same chunk. So this is not a good divide strategy.
Any good ideas?
Edit: these strings represent DNA sequences, and only contain 'g', 'c', 't' and 'a'
Update:
I have tried to build the suffix tree using the code from https://github.com/kvh/Python-Suffix-Tree. This program builds the suffix tree based on Ukkonen's algorithm.
The total length of concatenated string is about 90,000,000. That is a large number. The program has been running half an hour and just ~3,000,000 (1/30) characters are processed. I am not satisfied with this program.
Is there any other suffix tree building algorithm that can process this large string?
You could use a suffix tree. It will get you to O(mn) where m is the length of the strings. It's still quadratic, but since m << n in your case, it would provide a noticeable improvement.
These lecture notes provide a pretty good visual explanation of how you can use the suffix tree to find substrings.
This is a very cool and very interesting problem. I've studied subset seed algorithms and there are quite a few already out there.
Have you heard of the BLAST algorithm? http://blastalgorithm.com/
A GUI: http://blast.ncbi.nlm.nih.gov/

Efficient replacement of occurrences of a list of words

I need to censor all occurrences of a list of words with *'s. I have about 400 words in the list and it's going to get hit with a lot of traffic, so I want to make it very efficient. What's an efficient algorithm/data structure to do this in? Preferably something already in Python.
Examples:
"piss off" => "**** off"
"hello" => "hello"
"go to hell" => "go to ****"
A case-insensitive trie-backed set implementation might fit the bill. For each word, you'll only process a minimum of characters. For example, you would only need to process the first letter of the word 'zoo' to know the word is not present in your list (assuming you have no 'z' expletives).
This is something that is not packaged with python, however. You may observe better performance from a simple dictionary solution since it's implemented in C.
(1) Let P be the set of phrases to censor.
(2) Precompute H = {h(w) | p in P, w is a word in p}, where h is a sensible hash function.
(3) For each word v that is input, test whether h(v) in H.
(4) If h(v) not in H, emit v.
(5) If h(v) in H, back off to any naive method that will check whether v and the words following form a phrase in P.
Step (5) is not a problem since we assume that P is (very) small compared to the quantity of input. Step (3) is an O(1) operation.
like cheeken has mentioned, a Trie may be the thing you need, and actually, you should use Aho–Corasick string matching algorithm. Something more than a trie.
For every string, say S you need to process, the time complexity is approximately O(len(S)). I mean, Linear
And you need to build the automaton initially, it's time complexity is O(sigma(len(words))), and space complexity is about(less always) O(52*sigma(len(words))) here 52 means the size of the alphabet(i take it as ['a'..'z', 'A'..'Z']). And you need to do this just for once(or every time the system launches).
You might want to time a regexp based solution against others. I have used similar regexp based substitution of one to three thousand words on a text to change phrases into links before, but I am not serving those pages to many people.
I take the set of words (it could be phrases), and form a regular expression out of them that will match their occurrence as a complete word in the text because of the '\b'.
If you have a dictionary mapping words to their sanitized version then you could use that. I just swap every odd letter with '*' for convenience here.
The sanitizer function just returns the sanitized version of any matched swear word and is used in the regular expression substitution call on the text to return a sanitized version.
import re
swearwords = set("Holy Cow".split())
swear = re.compile(r'\b(%s)\b' % '|'.join(sorted(swearwords, key=lambda w: (-len(w), w))))
sanitized = {sw:''.join((ch if not i % 2 else '*' for i,ch in enumerate(sw))) for sw in swearwords}
def sanitizer(matchobj):
return sanitized.get(matchobj.group(1), '????')
txt = 'twat prick Holy Cow ... hell hello shitter bonk'
swear.sub(sanitizer, txt)
# Out[1]: 'twat prick H*l* C*w ... hell hello shitter bonk'
You might want to use re.subn and the count argument to limit the number of substitutions done and just reject the whole text if it has too many profanities:
maxswear = 2
newtxt, scount = swear.subn(sanitizer, txt, count=maxswear)
if scount >= maxswear: newtxt = 'Ouch my ears hurt. Please tone it down'
print(newtxt)
# 'Ouch my ears hurt. Please tone it down'
If performance is what you want I would suggest:
Get a sample of the input
Calculate the average amount of censored words per line
Define a max number of words to filter per line (3 for example)
Calcule what censored words have the most hits in the sample
Write a function that given the censored words, will generate a
python file with IF statements to check each words, putting the 'most
hits' words first, since you just want to match whole words it will
be fairly simple
Once you hit the max number per line, exit the function
I know this is not nice and I'm only suggesting this approach because of the high traffic scenario, doing a loop of each word in your list will have a huge negative impact on performance.
Hope that help or at least give you some out of the box idea on how to tackle the problem.

Categories

Resources