Checking if word segmentation is possible - python

This is a follow up question to this response and the pseudo-code algorithm that the user posted. I didn't comment on that question because of its age. I am only interested in validating whether or not a string can be split into words. The algorithm doesn't need to actually split the string. This is the response from the linked question:
Let S[1..length(w)] be a table with Boolean entries. S[i] is true if
the word w[1..i] can be split. Then set S[1] = isWord(w[1]) and for
i=2 to length(w) calculate
S[i] = (isWord[w[1..i] or for any j in {2..i}: S[j-1] and
isWord[j..i]).
I'm translating this algorithm into simple python code, but I'm not sure if I'm understanding it properly. Code:
def is_all_words(a_string, dictionary)):
str_len = len(a_string)
S = [False] * str_len
S[0] = is_word(a_string[0], dictionary)
for i in range(1, str_len):
check = is_word(a_string[0:i], dictionary)
if (check):
S[i] = check
else:
for j in range(1, str_len):
check = (S[j - 1] and is_word(a_string[j:i]), dictionary)
if (check):
S[i] == True
break
return S
I have two related questions. 1) Is this code a proper translation of the linked algorithm into Python, and if it is, 2) Now that I have S, how do I use it to tell if the string is only comprised of words? In this case, is_word is a function that simply looks a given word up in a list. I haven't implemented it as a trie yet.
UPDATE: After updating the code to include the suggested change, it doesn't work. This is the updated code:
def is_all_words(a_string, dictionary)):
str_len = len(a_string)
S = [False] * str_len
S[0] = is_word(a_string[0], dictionary)
for i in range(1, str_len):
check = is_word(a_string[0:i], dictionary)
if (check):
S[i] = check
else:
for j in range(1, i): #THIS LINE WAS UPDATED
check = (S[j - 1] and is_word(a_string[j:i]), dictionary)
if (check):
S[i] == True
break
return S
a_string = "carrotforever"
S = is_all_words(a_string, dictionary)
print(S[len(S) - 1]) #prints FALSE
a_string = "hello"
S = is_all_words(a_string, dictionary)
print(S[len(S) - 1]) #prints TRUE
It should return True for both of these.

Here is a modified version of your code that should return good results.
Notice that your mistake was simply in the translation from pseudocode array indexing (starting at 1) to python array indexing (starting at 0) therefore S[0] and S[1] where populated with the same value where S[L-1] was actually never computed. You can easily trace this mistake by printing the whole S values. You will find that S[3] is set true in the first example where it should be S[2] for the word "car".
Also you could speed up the process by storing the index of composite words found so far, instead of testing each position.
def is_all_words(a_string, dictionary):
str_len = len(a_string)
S = [False] * (str_len)
# I replaced is_word function by a simple list lookup,
# feel free to replace it with whatever function you use.
# tries or suffix tree are best for this.
S[0] = (a_string[0] in dictionary)
for i in range(1, str_len):
check = a_string[0:i+1] in dictionary # i+1 instead of i
if (check):
S[i] = check
else:
for j in range(0,i+1): # i+1 instead of i
if (S[j-1] and (a_string[j:i+1] in dictionary)): # i+1 instead of i
S[i] = True
break
return S
a_string = "carrotforever"
S = is_all_words(a_string, ["a","car","carrot","for","eve","forever"])
print(S[len(a_string)-1]) #prints TRUE
a_string = "helloworld"
S = is_all_words(a_string, ["hello","world"])
print(S[len(a_string)-1]) #prints TRUE

For a real-world example of how to do English word segmentation, look at the source of the Python wordsegment module. It's a little more sophisticated because it uses word and phrase frequency tables but it illustrates the recursive approach. By modifying the score function you can prioritize longer matches.
Installation is easy with pip:
$ pip install wordsegment
And segment returns a list of words:
>>> import wordsegment
>>> wordsegment.segment('carrotfever')
['carrot', 'forever']

1) at first glance, looks good. One thing: for j in range(1, str_len): should be for j in range(1, i): I think
2) if S[str_len-1]==true, then the whole string should consist of whole words only.
After all S[i] is true iff
the whole string from 0 to i consists of a single dictionary word
OR there exists a S[j-1]==true with j<i, and the string[j:i] is a single dictionaryword
so if S[str_len-1] is true, then the whole string is composed out of dictionary words

Related

find the minimum number of words(distance) between repeated occurrences of a search string in the input string

Here are test cases for the code:
string - 'Tim had been saying that he had been there'
search - 'had'
expected output - 4
string - 'he got what he got and what he wanted'
search - 'he'
expected out - 2
def return_distance(input, search):
words = input.split()
distance = None
indx = []
if not input or not search:
return None
else:
if words.count(search) >1:
indx = [ index for index, word in enumerate(words) if word == search]
distance = indx[1] - indx[0]
for i in range(len(indx)-1):
distance = min(distance, indx[i+1] - indx[i])-1
return distance
I am thinking how to optimize the code. I admit it is poorly written.
How about
def min_distance_between_words(sentence, word):
idxes = [i for i, e in enumerate(sentence.split()) if e == word]
return min([y - x - 1 for x, y in zip(idxes, idxes[1:])])
This splits the input sentence, makes a list of every index that matches the target word, then iterates over this list to compute the differences between each index and returns the minimum difference.
Since behavior is unspecified when the sentence doesn't have a word, it raises an error but you can add a check for this and return the value of your choice if desired using min's default parameter:
def min_distance_between_words(sentence, word):
idxes = [i for i, e in enumerate(sentence.split()) if e == word]
return min([y - x - 1 for x, y in zip(idxes, idxes[1:])], default=None)
As an aside, naming a variable input overwrites a builtin and return_distance is a rather ambiguous name for a function.
Adding a precondition for parameters for None as done with if not input or not search: is not typically done in Python (we assume caller will always pass in a string and adhere to the function's contract).
If you want to generalize this further, move the split() duty to the domain of the caller which enables the function to operate on arbitrary iterables:
def min_distance_between_occurrences(it, target):
idxes = [i for i, e in enumerate(it) if e == target]
return min([y - x - 1 for x, y in zip(idxes, idxes[1:])], default=None)
Call with:
min_distance_between_occurrences("a b c a".split(), "a")
min_distance_between_occurrences([(1, 2), (1, 3), (1, 2)], (1, 2))
Refactoring aside, as pointed out in the comments, the original code isn't correct. Issues include:
search_str does not exist. You probably meant search.
distance and min_dist don't really work together. Pick one or the other and use it for all minimum calculations.
min(min_dist, indx[i+1] - indx[i])-1 subtracts 1 in the wrong place, throwing off the count.
Here's a potential fix for these issues:
def return_distance(input, search):
words = input.split()
distance = None
if words.count(search) > 1:
indx = [index for index, word in enumerate(words) if word == search]
distance = indx[1] - indx[0] - 1
# ^^^^
for i in range(len(indx) - 1):
distance = min(distance, indx[i+1] - indx[i] - 1)
# ^^^^
return distance
One way is to use min with list comprehension on indx
min_dist = min([(indx[i+1] - indx[i]-1) for i in range(len(indx)-1) ])

Word Ladder without replacement in python

I have question, where I need to implement ladder problem with different logic.
In each step, the player must either add one letter to the word
from the previous step, or take away one letter, and then rearrange the letters to make a new word.
croissant(-C) -> arsonist(-S) -> aroints(+E)->notaries(+B)->baritones(-S)->baritone
The new word should make sense from a wordList.txt which is dictionary of word.
Dictionary
My code look like this,
where I have calculated first the number of character removed "remove_list" and added "add_list". Then I have stored that value in the list.
Then I read the file, and stored into the dictionary which the sorted pair.
Then I started removing and add into the start word and matched with dictionary.
But now challenge is, some word after deletion and addition doesn't match with the dictionary and it misses the goal.
In that case, it should backtrack to previous step and should add instead of subtracting.
I am looking for some sort of recursive function, which could help in this or complete new logic which I could help to achieve the output.
Sample of my code.
start = 'croissant'
goal = 'baritone'
list_start = map(list,start)
list_goal = map(list, goal)
remove_list = [x for x in list_start if x not in list_goal]
add_list = [x for x in list_goal if x not in list_start]
file = open('wordList.txt','r')
dict_words = {}
for word in file:
strip_word = word.rstrip()
dict_words[''.join(sorted(strip_word))]=strip_word
file.close()
final_list = []
flag_remove = 0
for i in remove_list:
sorted_removed_list = sorted(start.replace(''.join(map(str, i)),"",1))
sorted_removed_string = ''.join(map(str, sorted_removed_list))
if sorted_removed_string in dict_words.keys():
print dict_words[sorted_removed_string]
final_list.append(sorted_removed_string)
flag_remove = 1
start = sorted_removed_string
print final_list
flag_add = 0
for i in add_list:
first_character = ''.join(map(str,i))
sorted_joined_list = sorted(''.join([first_character, final_list[-1]]))
sorted_joined_string = ''.join(map(str, sorted_joined_list))
if sorted_joined_string in dict_words.keys():
print dict_words[sorted_joined_string]
final_list.append(sorted_joined_string)
flag_add = 1
sorted_removed_string = sorted_joined_string
Recursion-based backtracking isn't a good idea for search problem of this sort. It blindly goes downward in search tree, without exploiting the fact that words are almost never 10-12 distance away from each other, causing StackOverflow (or recursion limit exceeded in Python).
The solution here uses breadth-first search. It uses mate(s) as helper, which given a word s, finds all possible words we can travel to next. mate in turn uses a global dictionary wdict, pre-processed at the beginning of the program, which for a given word, finds all it's anagrams (i.e re-arrangement of letters).
from queue import Queue
words = set(''.join(s[:-1]) for s in open("wordsEn.txt"))
wdict = {}
for w in words:
s = ''.join(sorted(w))
if s in wdict: wdict[s].append(w)
else: wdict[s] = [w]
def mate(s):
global wdict
ans = [''.join(s[:c]+s[c+1:]) for c in range(len(s))]
for c in range(97,123): ans.append(s + chr(c))
for m in ans: yield from wdict.get(''.join(sorted(m)),[])
def bfs(start,goal,depth=0):
already = set([start])
prev = {}
q = Queue()
q.put(start)
while not q.empty():
cur = q.get()
if cur==goal:
ans = []
while cur: ans.append(cur);cur = prev.get(cur)
return ans[::-1] #reverse the array
for m in mate(cur):
if m not in already:
already.add(m)
q.put(m)
prev[m] = cur
print(bfs('croissant','baritone'))
which outputs: ['croissant', 'arsonist', 'rations', 'senorita', 'baritones', 'baritone']

Somewhere inside my loop it's not appending results to a list. Why?

So I have two files/dictionaries I want to compare, using a binary search implementation (yes, this is very obviously homework).
One file is
american-english
Amazon
Americana
Americanization
Civilization
And the other file is
british-english
Amazon
Americana
Americanisation
Civilisation
The code below should be pretty straight forward. Import files, compare them, return differences. However, somewhere near the bottom, where it says entry == found_difference: I feel as if the debugger skips right over, even though I can see the two variables in memory being different, and I only get the final element returned in the end. Where am I going wrong?
# File importer
def wordfile_to_list(filename):
"""Converts a list of words to a Python list"""
wordlist = []
with open(filename) as f:
for line in f:
wordlist.append(line.rstrip("\n"))
return wordlist
# Binary search algorithm
def binary_search(sorted_list, element):
"""Search for element in list using binary search. Assumes sorted list"""
matches = []
index_start = 0
index_end = len(sorted_list)
while (index_end - index_start) > 0:
index_current = (index_end - index_start) // 2 + index_start
if element == sorted_list[index_current]:
return True
elif element < sorted_list[index_current]:
index_end = index_current
elif element > sorted_list[index_current]:
index_start = index_current + 1
return element
# Check file differences using the binary search algorithm
def wordfile_differences_binarysearch(file_1, file_2):
"""Finds the differences between two plaintext lists,
using binary search algorithm, and returns them in a new list"""
wordlist_1 = wordfile_to_list(file_1)
wordlist_2 = wordfile_to_list(file_2)
matches = []
for entry in wordlist_1:
found_difference = binary_search(sorted_list=wordlist_2, element=entry)
if entry == found_difference:
pass
else:
matches.append(found_difference)
return matches
# Check if it works
differences = wordfile_differences_binarysearch(file_1="british-english", file_2="american-english")
print(differences)
You don't have an else suite for your if statement. Your if statement does nothing (it uses pass when the test is true, skipped otherwise).
You do have an else suite for the for loop:
for entry in wordlist_1:
# ...
else:
matches.append(found_difference)
A for loop can have an else suite as well; it is executed when a loop completes without a break statement. So when your for loop completes, the current value for found_difference is appended; so whatever was assigned last to that name.
Fix your indentation if the else suite was meant to be part of the if test:
for entry in wordlist_1:
found_difference = binary_search(sorted_list=wordlist_2, element=entry)
if entry == found_difference:
pass
else:
matches.append(found_difference)
However, you shouldn't use a pass statement there, just invert the test:
matches = []
for entry in wordlist_1:
found_difference = binary_search(sorted_list=wordlist_2, element=entry)
if entry != found_difference:
matches.append(found_difference)
Note that the variable name matches feels off here; you are appending words that are missing in the other list, not words that match. Perhaps missing is a better variable name here.
Note that your binary_search() function always returns element, the word you searched on. That'll always be equal to the element you passed in, so you can't use that to detect if a word differed! You need to unindent that last return line and return False instead:
def binary_search(sorted_list, element):
"""Search for element in list using binary search. Assumes sorted list"""
matches = []
index_start = 0
index_end = len(sorted_list)
while (index_end - index_start) > 0:
index_current = (index_end - index_start) // 2 + index_start
if element == sorted_list[index_current]:
return True
elif element < sorted_list[index_current]:
index_end = index_current
elif element > sorted_list[index_current]:
index_start = index_current + 1
return False
Now you can use a list comprehension in your wordfile_differences_binarysearch() loop:
[entry for entry in wordlist_1 if not binary_search(wordlist_2, entry)]
Last but not least, you don't have to re-invent the binary seach wheel, just use the bisect module:
from bisect import bisect_left
def binary_search(sorted_list, element):
return sorted_list[bisect(sorted_list, element)] == element
With sets
Binary search is used to improve efficiency of an algorithm, and decrease complexity from O(n) to O(log n).
Since the naive approach would be to check every word in wordlist1 for every word in wordlist2, the complexity would be O(n**2).
Using binary search would help to get O(n * log n), which is already much better.
Using sets, you could get O(n):
american = """Amazon
Americana
Americanization
Civilization"""
british = """Amazon
Americana
Americanisation
Civilisation"""
american = {line.strip() for line in american.split("\n")}
british = {line.strip() for line in british.split("\n")}
You could get the american words not present in the british dictionary:
print(american - british)
# {'Civilization', 'Americanization'}
You could get the british words not present in the american dictionary:
print(british - american)
# {'Civilisation', 'Americanisation'}
You could get the union of the two last sets. I.e. words that are present in exactly one dictionary:
print(american ^ british)
# {'Americanisation', 'Civilisation', 'Americanization', 'Civilization'}
This approach is faster and more concise than any binary search implementation. But if you really want to use it, as usual, you cannot go wrong with #MartijnPieters' answer.
With two iterators
Since you know the two lists are sorted, you could simply iterate in parallel over the two sorted lists and look for any difference:
american = """Amazon
Americana
Americanism
Americanization
Civilization"""
british = """Amazon
Americana
Americanisation
Americanism
Civilisation"""
american = [line.strip() for line in american.split("\n")]
british = [line.strip() for line in british.split("\n")]
n1, n2 = len(american), len(british)
i, j = 0, 0
while True:
try:
w1 = american[i]
w2 = british[j]
if w1 == w2:
i += 1
j += 1
elif w1 < w2:
print('%s is in american dict only' % w1)
i += 1
else:
print('%s is in british dict only' % w2)
j += 1
except IndexError:
break
for w1 in american[i:]:
print('%s is in american dict only' % w1)
for w2 in british[j:]:
print('%s is in british dict only' % w2)
It outputs:
Americanisation is in british dict only
Americanization is in american dict only
Civilisation is in british dict only
Civilization is in american dict only
It's O(n) as well.

Search and replace multiple specific sequences of elements in Python list/array

I currently have 6 separate for loops which iterate over a list of numbers looking to match specific sequences of numbers within larger sequences, and replace them like this:
[...0,1,0...] => [...0,0,0...]
[...0,1,1,0...] => [...0,0,0,0...]
[...0,1,1,1,0...] => [...0,0,0,0,0...]
And their inverse:
[...1,0,1...] => [...1,1,1...]
[...1,0,0,1...] => [...1,1,1,1...]
[...1,0,0,0,1...] => [...1,1,1,1,1...]
My existing code is like this:
for i in range(len(output_array)-2):
if output_array[i] == 0 and output_array[i+1] == 1 and output_array[i+2] == 0:
output_array[i+1] = 0
for i in range(len(output_array)-3):
if output_array[i] == 0 and output_array[i+1] == 1 and output_array[i+2] == 1 and output_array[i+3] == 0:
output_array[i+1], output_array[i+2] = 0
In total I'm iterating over the same output_array 6 times, using brute force checking. Is there a faster method?
# I would create a map between the string searched and the new one.
patterns = {}
patterns['010'] = '000'
patterns['0110'] = '0000'
patterns['01110'] = '00000'
# I would loop over the lists
lists = [[0,1,0,0,1,1,0,0,1,1,1,0]]
for lista in lists:
# i would join the list elements as a string
string_list = ''.join(map(str,lista))
# we loop over the patterns
for pattern,value in patterns.items():
# if a pattern is detected, we replace it
string_list = string_list.replace(pattern, value)
lista = list(string_list)
print lista
While this question related to the questions Here and Here, the question from OP relates to fast searching of multiple sequences at once. While the accepted answer works well, we may not want to loop through all the search sequences for every sub-iteration of the base sequence.
Below is an algo which checks for a sequence of i ints only if the sequence of (i-1) ints is present in the base sequence
# This is the driver function which takes in a) the search sequences and
# replacements as a dictionary and b) the full sequence list in which to search
def findSeqswithinSeq(searchSequences,baseSequence):
seqkeys = [[int(i) for i in elem.split(",")] for elem in searchSequences]
maxlen = max([len(elem) for elem in seqkeys])
decisiontree = getdecisiontree(seqkeys)
i = 0
while i < len(baseSequence):
(increment,replacement) = get_increment_replacement(decisiontree,baseSequence[i:i+maxlen])
if replacement != -1:
baseSequence[i:i+len(replacement)] = searchSequences[",".join(map(str,replacement))]
i +=increment
return baseSequence
#the following function gives the dictionary of intermediate sequences allowed
def getdecisiontree(searchsequences):
dtree = {}
for elem in searchsequences:
for i in range(len(elem)):
if i+1 == len(elem):
dtree[",".join(map(str,elem[:i+1]))] = True
else:
dtree[",".join(map(str,elem[:i+1]))] = False
return dtree
# the following is the function does most of the work giving us a) how many
# positions we can skip in the search and b)whether the search seq was found
def get_increment_replacement(decisiontree,sequence):
if str(sequence[0]) not in decisiontree:
return (1,-1)
for i in range(1,len(sequence)):
key = ",".join(map(str,sequence[:i+1]))
if key not in decisiontree:
return (1,-1)
elif decisiontree[key] == True:
key = [int(i) for i in key.split(",")]
return (len(key),key)
return 1, -1
You can test the above code with this snippet:
if __name__ == "__main__":
inputlist = [5,4,0,1,1,1,0,2,0,1,0,99,15,1,0,1]
patternsandrepls = {'0,1,0':[0,0,0],
'0,1,1,0':[0,0,0,0],
'0,1,1,1,0':[0,0,0,0,0],
'1,0,1':[1,1,1],
'1,0,0,1':[1,1,1,1],
'1,0,0,0,1':[1,1,1,1,1]}
print(findSeqswithinSeq(patternsandrepls,inputlist))
The proposed solution represents the sequences to be searched as a decision tree.
Due to skipping the many of the search points, we should be able to do better than O(m*n) with this method (where m is number of search sequences and n is length of base sequence.
EDIT: Changed answer based on more clarity in edited question.

Find all Occurences of Every Substring in String

I am trying to find all occurrences of sub-strings in a main string (of all lengths). My function takes one string and then returns a dictionary of every sub-string (which occurs more than once, of course) and how many times it occurs (format of the dictionary: {substring: # of occurrences, ...}). I am using collections.Counter(s) to help me with it.
Here is my function:
from collections import Counter
def patternFind(s):
patterns = {}
for index in range(1, len(s)+1)[::-1]:
d = nChunks(s, step=index)
parts = dict(Counter(d))
patterns.update({elem: parts[elem] for elem in parts.keys() if parts[elem] > 1})
return patterns
def nChunks(iterable, start=0, step=1):
return [iterable[i:i+step] for i in range(start, len(iterable), step)]
I have a string, data with about 2500 random letters (in a random order). However, there are 2 strings inserted into it (random points). Say this string is 'TEST'. data.count('TEST') returns 2. However, patternFind(data)['TEST'] gives me a KeyError. Therefore, my program does not detect the two strings in it.
What have I done wrong? Thanks!
Edit: My method of creating testing-instances:
def createNewTest():
n = randint(500, 2500)
x, y = randint(500, n), randint(500, n)
s = ''
for i in range(n):
s += choice(uppercase)
if i == x or i == y: s += "TEST"
return s
Using Regular Expressions
Apart from the count() method you described, regex is an obvious alternative
import re
needle = r'TEST'
haystack = 'khjkzahklahjTESTkahklaghTESTjklajhkhzkhjkzahklahjTESTkahklagh'
pattern = re.compile(needle)
print len(re.findall(pattern, haystack))
Short Cut
If you need to build a dictionary of substrings, possibly you can do this with only subset of those strings. Assuming you know the needle you are looking for in the data then you only need the dictionary of substrings of data that are the same length of needle. This is very fast.
from collections import Counter
needle = "TEST"
def gen_sub(s, len_chunk):
for start in range(0, len(s)-len_chunk+1):
yield s[start:start+len_chunk]
data = 'khjkzahklahjTESTkahklaghTESTjklajhkhzkhjkzahklahjTESTkahklaghTESz'
parts = Counter([sub for sub in gen_sub(data, len(needle))])
print parts[needle]
Brute Force: building dictionary of all substrings
If you need to have a count of all possible substrings, this works but it is very slow:
from collections import Counter
def gen_sub(s):
for start in range(0, len(s)):
for end in range(start+1, len(s)+1):
yield s[start:end]
data = 'khjkzahklahjTESTkahklaghTESTjklajhkhz'
parts = Counter([sub for sub in gen_sub(data)])
print parts['TEST']
Substring generator adapted from this: https://stackoverflow.com/a/8305463/1290420
While jurgenreza has explained why your program didn't work, the solution is still quite slow. If you only examine substrings s for which you know that s[:-1] repeats, you get a much faster solution (typically a hundred times faster and more):
from collections import defaultdict
def pfind(prefix, sequences):
collector = defaultdict(list)
for sequence in sequences:
collector[sequence[0]].append(sequence)
for item, matching_sequences in collector.items():
if len(matching_sequences) >= 2:
new_prefix = prefix + item
yield (new_prefix, len(matching_sequences))
for r in pfind(new_prefix, [sequence[1:] for sequence in matching_sequences]):
yield r
def find_repeated_substrings(s):
s0 = s + " "
return pfind("", [s0[i:] for i in range(len(s))])
If you want a dict, you call it like this:
result = dict(find_repeated_substrings(s))
On my machine, for a run with 2247 elements, it took 0.02 sec, while the original (corrected) solution took 12.72 sec.
(Note that this is a rather naive implementation; using indexes of instead of substrings should be even faster.)
Edit: The following variant works with other sequence types (not only strings). Also, it doesn't need a sentinel element.
from collections import defaultdict
def pfind(s, length, ends):
collector = defaultdict(list)
if ends[-1] >= len(s):
del ends[-1]
for end in ends:
if end < len(s):
collector[s[end]].append(end)
for key, matching_ends in collector.items():
if len(matching_ends) >= 2:
end = matching_ends[0]
yield (s[end - length: end + 1], len(matching_ends))
for r in pfind(s, length + 1, [end + 1 for end in matching_ends if end < len(s)]):
yield r
def find_repeated_substrings(s):
return pfind(s, 0, list(range(len(s))))
This still has the problem that very long substrings will exceed recursion depth. You might want to catch the exception.
The problem is in your nChunks function. It does not give you all the chunks that are necessary.
Let's consider a test string:
s='1test2345test'
For the chunks of size 4 your nChunks function gives this output:
>>>nChunks(s, step=4)
['1tes', 't234', '5tes', 't']
But what you really want is:
>>>def nChunks(iterable, start=0, step=1):
return [iterable[i:i+step] for i in range(len(iterable)-step+1)]
>>>nChunks(s, step=4)
['1tes', 'test', 'est2', 'st23', 't234', '2345', '345t', '45te', '5tes', 'test']
You can see that this way there are two 'test' chunks and your patternFind(s) will work like a charm:
>>> patternFind(s)
{'tes': 2, 'st': 2, 'te': 2, 'e': 2, 't': 4, 'es': 2, 'est': 2, 'test': 2, 's': 2}
here you can find a solution that uses a recursive wrapper around string.find() that searches all the occurences of a substring in a main string.
The collectallchuncks() function returns a defaultdict whith all the substrings as keys and for each substring a list of all the indexes where the substring is found in the main string.
import collections
# Minimum substring size, may be 1
MINSIZE = 3
# Recursive wrapper
def recfind(p, data, pos, acc):
res = data.find(p, pos)
if res == -1:
return acc
else:
acc.append(res)
return recfind(p, data, res+1, acc)
def collectallchuncks(data):
res = collections.defaultdict(str)
size = len(data)
for base in xrange(size):
for seg in xrange(MINSIZE, size-base+1):
chunk = data[base:base+seg]
if data.count(chunk) > 1:
res[chunk] = recfind(chunk, data, 0, [])
return res
if __name__ == "__main__":
data = 'khjkzahklahjTESTkahklaghTESTjklajhkhzkhjkzahklahjTESTkahklaghTESz'
allchuncks = collectallchuncks(data)
print 'TEST', allchuncks['TEST']
print 'hklag', allchuncks['hklag']
EDIT: If you just need the number of occurrences of each substring in the main string you can easily obtain it getting rid of the recursive function:
import collections
MINSIZE = 3
def collectallchuncks2(data):
res = collections.defaultdict(str)
size = len(data)
for base in xrange(size):
for seg in xrange(MINSIZE, size-base+1):
chunk = data[base:base+seg]
cnt = data.count(chunk)
if cnt > 1:
res[chunk] = cnt
return res
if __name__ == "__main__":
data = 'khjkzahklahjTESTkahklaghTESTjklajhkhzkhjkzahklahjTESTkahklaghTESz'
allchuncks = collectallchuncks2(data)
print 'TEST', allchuncks['TEST']
print 'hklag', allchuncks['hklag']

Categories

Resources