Approaches for finding matches in a large dataset

Approaches for finding matches in a large dataset - python

I have a project where, given a list of ~10,000 unique strings, I want to find where those strings occur in a file with 10,000,000+ string entries. I also want to include partial matches if possible. My list of ~10,000 strings is dynamic data and updates every 30 minutes, and currently I'm not able to process all of the searching to keep up with the updated data. My searches take about 3 hours now (compared to the 30 minutes I have to do the search within), so I feel my approach to this problem isn't quite right.
My current approach is to first create a list from the 10,000,000+ string entries. Then each item from the dynamic list is searched for in the larger list using an in-search.
results_boolean = [keyword in n for n in string_data]
Is there a way I can greatly speed this up with a more appropriate approach?

Using a generator with a set is probably your best bet ... this solution i think will work and presumably faster
def find_matches(target_words,filename_to_search):
targets = set(target_words)
with open("search_me.txt") as f:
for line_no,line in enumerate(f):
matching_intersection = targets.intersection(line.split())
if matching_intersection:
yield (line_no,line,matching_intersection) # there was a match
for match in find_matches(["unique","list","of","strings"],"search_me.txt"):
print("Match: %s"%(match,))
input("Hit Enter For next match:") #py3 ... just to see your matches
of coarse it gets harder if your matches are not single words, especially if there is no reliable grouping delimiter

In general, you would want to preprocess the large, unchanging data is some way to speed repeated searches. But you said too little to suggest something clearly practical. Like: how long are these strings? What's the alphabet (e.g., 7-bit ASCII or full-blown Unicode?)? How many characters total are there? Are characters in the alphabet equally likely to appear in each string position, or is the distribution highly skewed? If so, how? And so on.
Here's about the simplest kind of indexing, buiding a dict with a number of entries equal to the number of unique characters across all of string_data. It maps each character to the set of string_data indices of strings containing that character. Then a search for a keyword can be restricted to the only string_data entries now known in advance to contain the keyword's first character.
Now, depending on details that can't be guessed from what you said, it's possible even this modest indexing will consume more RAM than you have - or it's possible that it's already more than good enough to get you the 6x speedup you seem to need:
# Preprocessing - do this just once, when string_data changes.
def build_map(string_data):
from collections import defaultdict
ch2ixs = defaultdict(set)
for i, s in enumerate(string_data):
for ch in s:
ch2ixs[ch].add(i)
return ch2ixs
def find_partial_matches(keywords, string_data, ch2ixs):
for keyword in keywords:
ch = keyword[0]
if ch in ch2ixs:
result = []
for i in ch2ixs[ch]:
if keyword in string_data[i]:
result.append(i)
if result:
print(repr(keyword), "found in strings", result)
Then, e.g.,
string_data = ['banana', 'bandana', 'bandito']
ch2ixs = build_map(string_data)
find_partial_matches(['ban', 'i', 'dana', 'xyz', 'na'],
string_data,
ch2ixs)
displays:
'ban' found in strings [0, 1, 2]
'i' found in strings [2]
'dana' found in strings [1]
'na' found in strings [0, 1]
If, e.g., you still have plenty of RAM, but need more speed, and are willing to give up on (probably silly - but can't guess from here) 1-character matches, you could index bigrams (adjacent letter pairs) instead.
In the limit, you could build a trie out of string_data, which would require lots of RAM, but could reduce the time to search for an embedded keyword to a number of operations proportional to the number of characters in the keyword, independent of how many strings are in string_data.
Note that you should really find a way to get rid of this:
results_boolean = [keyword in n for n in string_data]
Building a list with over 10 million entries for every keyword search makes every search expensive, no matter how cleverly you index the data.
Note: a probably practical refinement of the above is to restrict the search to strings that contain all of the keyword's characters:
def find_partial_matches(keywords, string_data, ch2ixs):
for keyword in keywords:
keyset = set(keyword)
if all(ch in ch2ixs for ch in keyset):
ixs = set.intersection(*(ch2ixs[ch] for ch in keyset))
result = []
for i in ixs:
if keyword in string_data[i]:
result.append(i)
if result:
print(repr(keyword), "found in strings", result)

Related

Find all the possible N-length anagrams - fast alternatives

I am given a sequence of letters and have to produce all the N-length anagrams of the sequence given, where N is the length of the sequence.
I am following a kinda naive approach in python, where I am taking all the permutations in order to achieve that. I have found some similar threads like this one but I would prefer a math-oriented approach in Python. So what would be a more performant alternative to permutations? Is there anything particularly wrong in my attempt below?
from itertools import permutations
def find_all_anagrams(word):
pp = permutations(word)
perm_set = set()
for i in pp:
perm_set.add(i)
ll = [list(i) for i in perm_set]
ll.sort()
print(ll)

If there are lots of repeated letters, the key will be to produce each anagram only once instead of producing all possible permutations and eliminating duplicates.
Here's one possible algorithm which only produces each anagram once:
from collections import Counter
def perm(unplaced, prefix):
if unplaced:
for element in unplaced:
yield from perm(unplaced - Counter(element), prefix + element)
else:
yield prefix
def permutations(iterable):
yield from perm(Counter(iterable), "")
That's actually not much different from the classic recursion to produce all permutations; the only difference is that it uses a collections.Counter (a multiset) to hold the as-yet-unplaced elements instead of just using a list.
The number of Counter objects produced in the course of the iteration is certainly excessive, and there is almost certainly a faster way of writing that; I chose this version for its simplicity and (hopefully) its clarity

This is very slow for long words with many similar characters. Slow compared to theoretical maximum performance that is. For example, permutations("mississippi") will produce a much longer list than necessary. It will have a length of 39916800, but but the set has a size of 34650.
>>> len(list(permutations("mississippi")))
39916800
>>> len(set(permutations("mississippi")))
34650
So the big flaw with your method is that you generate ALL anagrams and then remove the duplicates. Use a method that only generates the unique anagrams.
EDIT:
Here is some working, but extremely ugly and possibly buggy code. I'm making it nicer as you're reading this. It does give 34650 for mississippi, so I assume there aren't any major bugs. Warning again. UGLY!
# Returns a dictionary with letter count
# get_letter_list("mississippi") returns
# {'i':4, 'm':1, 'p': 2, 's':4}
def get_letter_list(word):
w = sorted(word)
c = 0
dd = {}
dd[w[0]]=1
for l in range(1,len(w)):
if w[l]==w[l-1]:
d[c]=d[c]+1
dd[w[l]]=dd[w[l]]+1
else:
c=c+1
d.append(1)
dd[w[l]]=1
return dd
def sum_dict(d):
s=0
for x in d:
s=s+d[x]
return s
# Recursively create the anagrams. It takes a letter list
# from the above function as an argument.
def create_anagrams(dd):
if sum_dict(dd)==1: # If there's only one letter left
for l in dd:
return l # Ugly hack, because I'm not used to dics
a = []
for l in dd:
if dd[l] != 0:
newdd=dict(dd)
newdd[l]=newdd[l]-1
if newdd[l]==0:
newdd.pop(l)
newl=create(newdd)
for x in newl:
a.append(str(l)+str(x))
return a
>>> print (len(create_anagrams(get_letter_list("mississippi"))))
34650
It works like this: For every unique letter l, create all unique permutations with one less occurance of the letter l, and then append l to all these permutations.
For "mississippi", this is way faster than set(permutations(word)) and it's far from optimally written. For instance, dictionaries are quite slow and there's probably lots of things to improve in this code, but it shows that the algorithm itself is much faster than your approach.

Maybe I am missing something, but why don't you just do this:
from itertools import permutations
def find_all_anagrams(word):
return sorted(set(permutations(word)))

You could simplify to:
from itertools import permutations
def find_all_anagrams(word):
word = set(''.join(sorted(word)))
return list(permutations(word))
In the doc for permutations the code is detailled and it seems already optimized.

I don't know python but I want to try to help you: probably there are a lot of other more performant algorithm, but I've thought about this one: it's completely recursive and it should cover all the cases of a permutation. I want to start with a basic example:
permutation of ABC
Now, this algorithm works in this way: for Length times you shift right the letters, but the last letter will become the first one (you could easily do this with a queue).
Back to the example, we will have:
ABC
BCA
CAB
Now you repeat the first (and only) step with the substring built from the second letter to the last one.
Unfortunately, with this algorithm you cannot consider permutation with repetition.

Any way to split python string without generate new strings?

The input is a string containing a huge number of characters, and I hope to split this string into a list of strings with a special delimiter.
But I guess that simply using split would generate new strings rather than split the original input string itself, and in that case it consumes large memory(it's guaranteed that the original string would not be used any longer).
So is there a convenient way to do this destructive split?
Here is the case:
input_string = 'data1 data2 <...> dataN'
output_list = ['data1', 'data2', <...> 'dataN']
What I hope is that the data1 in output_list is and the data1(and all others) in input_string shares the same memory area.
BTW, for each input string, the size is 10MB-20MB; but as there are lots of such strings(about 100), so I guess memory consumption should be taken into consideration here?

In Python, strings are immutable. This means that any operation that changes the string will create a new string. If you are worried about memory (although this shouldn't be much of an issue unless you are dealing with gigantic strings), you can always overwrite the old string with the new, modified string, replacing it.
The situation you are describing is a little different though, because the input to split is a string and the output is a list of strings. They are different types. In this case, I would just create a new variable containing the output of split and then set the old string (that was the input to the split function) to None, since you guarantee it will not be used again.
Code:
split_str = input_string.split(delim)
input_string = None

The only alternative would be to access the substrings using slicing instead of split. You can use str.find to find the position of each delimiter. However this would be slow, and fiddly. If you can use split and get the original string to drop out of scope then it would be worth the effort.
You say that this string is input, so you might like to consider reading a smaller number of characters so you are dealing with more manageable chunks. Do you really need all the data in memory at the same time?

Perhaps the Pythonic way would be to use iterators? That way, the new substrings will be in memory only one at a time. Based on
Splitting a string into an iterator :
import re
string_long = "my_string " * 100000000 # takes some memory
# strings_split = string_long.split() # takes too much memory
strings_reiter = re.finditer("(\S*)\s*", string_long) # takes no memory
for match in strings_reiter:
print match.group()
This works fine without leading to memory problems.

If you're talking about strings that are SO huge that you can't stand to put them in memory, then maybe running through the string once (O(n), probably improvable using str.find but I'm not sure) then storing a generator that holds slice objects would be more memory-efficient?
long_string = "abc,def,ghi,jkl,mno,pqr" # ad nauseum
splitters = [','] # add whatever you want to split by
marks = [i for i,ch in enumerate(long_string) if ch in splitters]
slices = []
start = 0
for end in marks:
slices.append(slice(start,end))
start = end+1
else:
slices.append(slice(start,None))
split_string = (long_string[slice_] for slice_ in slices)

Python searching a large list speed

I have run into a speed issue searching through a very large list. I have a file with a lot of errors and very strange words in it. I am trying to use difflib to find the closest match in a dictionary file I have that has 650,000 words in it. This approach below works really well but is very very slow and I was wondering if there is a better way to approach this problem. This is the code:
from difflib import SequenceMatcher
headWordList = [ #This is a list of 650,000 words]
openFile = open("sentences.txt","r")
for line in openFile:
sentenceList.append[line]
percentage = 0
count = 0
for y in sentenceList:
if y not in headwordList:
for x in headwordList:
m = SequenceMatcher(None, y.lower(), x)
if m.ratio() > percentage:
percentage = m.ratio()
word = x
if percentage > 0.86:
sentenceList[count] = word
count=count+1
Thanks for the help, software engineering is not even close to my strong suit. Much appreciated.

Two things that might provide some small help:
1) Use the approach in this SO answer to read through your large file the most efficiently.
2) Change your code from
for x in headwordList:
m = SequenceMatcher(None, y.lower(), 1)
to
yLower = y.lower()
for x in headwordList:
m = SequenceMatcher(None, yLower, 1)
You're converting each sentence to lower 650,000 times. No need for that.

You should change headwordList into a set.
The test word in headwordList will be very slow. It must do a string comparison on each word in headwordList, one word at a time. It will take time proportional to the length of the list; if you double the length of the list, you will double the amount of time it takes to do the test (on average).
With a set, it always takes the same amount of time to do the in test; it doesn't depend on the number of elements in the set. So that will be a huge speedup.
Now, this whole loop can be simplified:
for x in headwordList:
m = SequenceMatcher(None, y.lower(), x)
if m.ratio() > percentage:
percentage = m.ratio()
word = x
if percentage > 0.86:
sentenceList[count] = word
All this does is find the word from headwordList that has the highest ratio, and keep it (but only keep it if the ratio is over 0.86). Here's a faster way to do this. I'm going to change the name headwordList to just headwords as I want you to make it be a set and not a list.
def check_ratio(m):
return m.ratio()
y = y.lower() # do the .lower() call one time
m, word = max((SequenceMatcher(None, y, word), word) for word in headwords, key=check_ratio)
percentage = max(percentage, m.ratio()) # remember best ratio
if m.ratio() > 0.86:
setence_list.append(word)
This might seem a bit tricky but it is the fastest way to do this in Python. We will call the built-in max() function to find the SequenceMatcher result that has the highest ratio. First, we build a "generator expression" that tries all the words in headwords, calling SequenceMatcher() on each. But when we are done, we also want to know what the word was. So the generator expression produces tuples, where the first value in the tuple is the SequenceMatcher result and the second value is the word. The max() function cannot know that what we care about is the ratio, so we have to tell it that; we do this by making a function that tests what we care about, then passing that function as the key= argument. Now max() finds the value with the highest ratio for us. max() consumes all the values produced by the generator expression and returns a single value, which we then unpack into the varaibles m and word.
In Python, it is best practice to use variable names like sentence_list rather than sentenceList. Please see these guidelines: http://www.python.org/dev/peps/pep-0008/
It is not good practice to use an incrementing index variable and assign into indexed positions in a list. Rather, start with an empty list and use the .append() method function to append values.
Also, you might do better to build a dictionary of words and their ratios.
Note that your original code seems to have a bug: as soon as any word has a percentage over 0.86, all words are saved in sentenceList no matter what their ratio is. The code I wrote, above, only saves words where the word's own ratio was high enough.
EDIT: This is to answer a question about generator expressions needing to be parenthesized.
Whenever I get that error message, I usually split out the generator expression by itself and assign it to a variable. Like this:
def check_ratio(m):
return m.ratio()
y = y.lower() # do the .lower() call one time
genexp = ((SequenceMatcher(None, y, word), word) for word in headwords)
m, word = max(genexp, key=check_ratio)
percentage = max(percentage, m.ratio()) # remember best ratio
if m.ratio() > 0.86:
setence_list.append(word)
That's what I suggest. But if you don't mind a complicated line looking even busier, you can simply add an extra pair of parentheses as the error message suggests, so the generator expression is fully parenthesized. Like so:
m, word = max(((SequenceMatcher(None, y, word), word) for word in headwords), key=check_ratio)
Python lets you omit the explicit parentheses around a generator expression when you pass the expression to a function, but only if it is the only argument to that function. As we are also passing a key= argument, we need a fully parenthesized generator expression.
But I think it's easier to read if you split out the genexp on its own line.
EDIT: #Peter Wood pointed out that the documentation suggests reusing a SequenceMatcher for speed. I don't have time to test this, but I think this is the right way to do it.
Happily, the code got simpler! Always a good sign.
EDIT: I just tested the code. This code works for me; see if it works for you.
from difflib import SequenceMatcher
headwords = [
# This is a list of 650,000 words
# Dummy list:
"happy",
"new",
"year",
]
def words_from_file(filename):
with open(filename, "rt") as f:
for line in f:
for word in line.split():
yield word
def _match(matcher, s):
matcher.set_seq2(s)
return (matcher.ratio(), s)
ratios = {}
best_ratio = 0
matcher = SequenceMatcher()
for word in words_from_file("sentences.txt"):
matcher.set_seq1(word.lower())
if word not in headwords:
ratio, word = max(_match(matcher, word.lower()) for word in headwords)
best_ratio = max(best_ratio, ratio) # remember best ratio
if ratio > 0.86:
ratios[word] = ratio
print(best_ratio)
print(ratios)

1) I would store headwordList as a set, not a list, allowing for faster access as it is a hashed data structure.
2) You have sentenceList defined as a list then attempt to use it as a dictionary with sentenceList[x] = y. I would define a different structure specifically for counts.
3) You construct sentenceList which doesn't need to be done.
for line in file:
if line not in headwordList...
4) You never tokenize line which means you store the entire line before the newline character in sentenceList and see if it is in a wordlist

This is a data structures question. What you want to do, is to turn your list into something with faster element lookup speed, for example a binary search tree would work great here: time complexity is only O (log n) as opposed to O (n) in a list (which is in comparison incredibly fast).
There's a fairly simple explanation here:
http://interactivepython.org/runestone/static/pythonds/Trees/balanced.html
But if you are not familiar with tree concepts, you might want to start few chapters earlier:
http://interactivepython.org/runestone/static/pythonds/Trees/trees.html

Sorting a concordance?

For my homework, I need to isolate the most frequent 50 words in a text. I have tried a whole lot of things, and in my most recent attempt, I have done a concordance using this:
concordance = {}
lineno = 0
for line in vocab:
lineno = lineno + 1
words = re.findall(r'[A-Za-z][A-Za-z\'\-]*', line)
for word in words:
word = word.title()
if word in concordance:
concordance[word].append(lineno)
else:
concordance[word] = [lineno]
listing = []
for key in sorted(concordance.keys()):
listing.append( [key, concordance[key] ])
What I would like to know is whether I can sort the subsequent concordance in order of most frequently used word to least frequently used word, and then isolate and print the top 50? I am not permitted to import any modules other than re and sys, and I'm struggling to come up with a solution.

sorted is a builtin which does not require import. Try something like:
list(sorted(concordance.items(), key = lambda (k,v): v))[:50]
Not tested, but you get the idea.
The list constructor is there because sorted returns a generator, which you can't slice directly (itertools provides a utility to do that, but you can't import it).
There are probably slightly more efficient ways to take the first 50, but I doubt it matters here.

Few hints:
Use enumerate(list) in your for loop to get the line number and the line at once.
Try using \w for word characters in your regular expression instead of listing [A-Za-z...].
Read about the dict.items() method. It will return a list of (key, value) pairs.
Manipulate that list with list.sort(key=function_to_compare_two_items).
You can define that function with a lambda, but it is not necessary.
Use the len(list) function to get the length of the list. You can use it to get the number of matches of a word (which are stored in a list).
UPDATE: Oh yeah, and use slices to get a part of the resulting list. list[:50] to get the first 50 items (equivalent to list[0:50]), and list[5:10] to get the items from index 5 inclusive to index 10 exclusive.
To print them, loop through the resulting list, then print every word. Alternatively, you can use something similar to print '[separator]'.join(list) to print a string with all the items separated by '[separator]'.
Good luck.

Finding words from random input letters in python. What algorithm to use/code already there?

I am trying to code a word descrambler like this one here and was wondering what algorithms I should use to implement this. Also, if anyone can find existing code for this that would be great as well. Basically the functionality is going to be like a boggle solver but without being a matrix, just searching for all word possibilities from a string of characters. I do already have adequate dictionaries.
I was planning to do this in either python or ruby.
Thanks in advance for your help guys!

I'd use a Trie. Here's an implementation in Python: http://jtauber.com/2005/02/trie.py (credit to James Tauber)

I may be missing an understanding of the game but barring some complications in the rules, such as with the introduction of "joker" (wildcard) letters, missing or additional letters, multiple words etc... I think the following ideas would help turn the problem in a somewhat relatively uninteresting thing. :-(
Main idea index words by the ordered sequence of their letters.
For example "computer" gets keyed as "cemoprtu". Whatever the random drawings provide is sorting in kind, and used as key to find possible matches.
Using trie structures as suggested by perimosocordiae, as the underlying storage for these sorted keys and associated words(s)/wordIds in the "leaf" nodes, Word lookup can be done in O(n) time, where n is the number of letters (or better, on average due to non-existing words).
To further help with indexing we can have several tables/dictionaries, one per number of letters. Also depending on statistics the vowels and consonants could be handled separately. Another trick would be to have a custom sort order, placing the most selective letters first.
Additional twists to the game (such as finding words made from a subset of the letters) is mostly a matter of iterating the power set of these letters and checking the dictionary for each combination.
A few heuristics can be introduced to help prune some of the combinations (for example combinations without vowels [and of a given length] are not possible solutions etc. One should manage these heuristics carefully for the lookup cost is relatively small.

For your dictionary index, build a map (Map[Bag[Char], List[String]]). It should be a hash map so you can get O(1) word lookup. A Bag[Char] is an identifier for a word that is unique up to character order. It's is basically a hash map from Char to Int. The Char is a given character in the word and the Int is the number of times that character appears in the word.
Example:
{'a'=>3, 'n'=>1, 'g'=>1, 'r'=>1, 'm'=>1} => ["anagram"]
{'s'=>3, 't'=>1, 'r'=>1, 'e'=>2, 'd'=>1} => ["stressed", "desserts"]
To find words, take every combination of characters from the input string and look it up in this map. The complexity of this algorithm is O(2^n) in the length of the input string. Notably, the complexity does not depend on the length of the dictionary.

This sounds like Rabin-Karp string search would be a good choice. If you use a rolling hash-function then at each position you need one hash value update and one dictionary lookup. You also need to create a good way to cope with different word lengths, like truncating all words to the shortest word in the set and rechecking possible matches. Splitting the word set into separate length ranges will reduce the amount of false positives at the expense of increasing the hashing work.

There are two ways to do this. One is to check every candidate permutation of letters in the word to see if the candidate is in your dictionary of words. That's an O(N!) operation, depending on the length of the word.
The other way is to check every candidate word in your dictionary to see if it's contained within the word. This can be sped up by aggregating the dictionary; instead of every candidate word, you check all words that are anagrams of each other at once, since if any one of them is contained in your word, all of them are.
So start by building a dictionary whose key is a sorted string of letters and whose value is a list of the words that are anagrams of the key:
>>> from collections import defaultdict
>>> d = defaultdict(list)
>>> with open(r"c:\temp\words.txt", "r") as f:
for line in f.readlines():
if line[0].isupper(): continue
word = line.strip()
key = "".join(sorted(word.lower()))
d[key].append(word)
Now we need a function to see if a word contains a candidate. This function assumes that the word and candidate are both sorted, so that it can go through them both letter by letter and give up quickly when it finds that they don't match.
>>> def contains(sorted_word, sorted_candidate):
wchars = (c for c in sorted_word)
for cc in sorted_candidate:
while(True):
try:
wc = wchars.next()
except StopIteration:
return False
if wc < cc: continue
if wc == cc: break
return False
return True
Now find all the candidate keys in the dictionary that are contained by the word, and aggregate all of their values into a single list:
>>> w = sorted("mythopoetic")
>>> result = []
>>> for k in d.keys():
if contains(w, k): result.extend(d[k])
>>> len(result)
429
>>> sorted(result)[:20]
['c', 'ce', 'cep', 'ceti', 'che', 'chetty', 'chi', 'chime', 'chip', 'chit', 'chitty', 'cho', 'chomp', 'choop', 'chop', 'chott', 'chyme', 'cipo', 'cit', 'cite']
That last step takes about a quarter second on my laptop; there are 195K keys in my dictionary (I'm using the BSD Unix words file).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.