Given a list of poker flops and a str as a target:
target = '5c6d2d'
flops = ['5s4d3s', '6s4d2d', '6s5d3s', '6s4s2d']
I am trying to find the closest match to the target. Currently using fuzzywuzzy.process.extract, but sometimes this doesn't return the desired match. And also (more importantly) it doesn't account for ranks properly because the rank of the face cards are represented by letters so 9c9d9s is more similar too 2c2d2h than it is to TcTdTh. Is there a clever way of parsing the target flop to do this with a simple algorithm? Or would it be best to try and train a machine learning model for this?
Side note in case it's relevant: my fuzzywuzzy uses the pure-python SequenceMatcher as I don't have the privileges to install Visual Studio 14 for python-Levenshtein.
EDIT: To clarify, by closest match I mean the closest in terms of flop texture, i.e. the flop that is strategically the most similar (I understand this is somewhat of a difficult qualification). The initial list of examples I gave is actually not too clear, so here is another example with fuzzywuzzy:
>>> target = '8c8h5s'
>>> flops = ['2d2s6c', '7c5s5d', '8s8d7h', '4h3s3d']
>>> matches = process.extract(target, flops)
>>> print(matches)
[('7c5s5d', 50), ('8s8d7h', 50), ('4h3s3d', 33), ('2d2s6c', 17)]
For my purposes '8s8d7h' should score better than '7c5s5d'. So the matching of ranks should be given priority over the matching of suits.
What do you mean by closest match? Do you mean just least number of different characters?
If so, here's how you could do it:
def find_closest(main_flop, flops):
best_match = ''
matching_characters = 0
for flop in flops:
for i, character in enumerate(flop):
if main_flop[i] == character:
matching_characters += 1
if matching characters > max_matching:
best_match = flop
return best_match
On the other hand, if you want to find the "closest" flop in terms of how similar of equities it gives to a full range of hands, then you will have to build something quite complex, or use a library like https://github.com/worldveil/deuces or https://poker.readthedocs.io/en/latest/
Related
I'm trying to find out how similar are 2 sentences.
For doing it i'm using gensim word mover distance and since what i'm trying to find it's a similarity i do like it follow:
sim = 1 - wv.wmdistance(sentence_obama, sentence_president)
What i give as an input are 2 strings:
sentence_obama = 'Obama speaks to the media in Illinois'
sentence_president = 'The president greets the press in Chicago'
The model i'm using is the one that you can find on the web: word2vec-google-news-300
I load it with this code:
wv = api.load("word2vec-google-news-300")
It give me reasonable results.
Here it's where the problem starts.
For what i can read from the documentation here it seems the wmd take as input a list of string and not a string like i do!
def preprocess(sentence):
return [w for w in sentence.lower().split() if w not in stop_words]
sentence_obama = preprocess(sentence_obama)
sentence_president = preprocess(sentence_president)
sim = 1 - wv.wmdistance(sentence_obama, sentence_president)
When i follow the documentation i get results really different:
wmd using string as input: 0.5562025871542842
wmd using list of string as input: -0.0174646259300113
I'm really confused. Why is it working with string as input and it works better than when i give what the documentation is asking for?
The function needs a list-of-string-tokens to give proper results: if your results pasing full strings look good to you, it's pure luck and/or poor evaluation.
So: why do you consider 0.556 to be a better value than -0.017?
Since passing the texts as plain strings means they are interpreted as lists-of-single-characters, the value there is going to be a function of how different the letters in the two texts are - and the fact that all English sentences of about the same length have very similar letter-distributions, means most texts will rate as very-similar under that error.
Also, similarity or distance values mainly have meaning in comparison to other pairs of sentences, not two different results from different processes (where one of them is essentially random). You shouldn't consider absolute values that are exceeding some set threshold, or close to 1.0, as definitively good. You should instead consider relative differences, between two similarity/distance values, to mean one pair is more similary/distant than another pair.
Finally: converting a distance (which goes from 0.0 for closest to infinity for furthest) to a similarity (which typically goes from 1.0 for most-similar to -1.0 or 0.0 for least-similar) is not usefully done via the formula you're using, similarity = 1.0 - distance. Because a distance can be larger than 2.0, you could have arbitrarily negative similarities with that approach, and be fooled to think -0.017 (etc) is bad, because it's negative, even if it's quite good across all the possible return values.
Some more typical distance-to-similarity conversions are given in another SO question:
How do I convert between a measure of similarity and a measure of difference (distance)?
I am trying to wrap my head around how the fuzzywuzzy library calculates the Levenshtein Distance between two strings, as the docs clearly mention that it is using that.
The Levenshtein Distance algorithm counts looks for the minimum number of edits between the two strings. That can be achieved using the addition, deletion, and substitution of a character in the string. All these operations are counted as a single operation when calculating the score.
Here are a couple of examples:
Example 1
s1 = 'hello'
s2 = 'hell'
Levenshtein Score = 1 (it requires 1 edit, addition of 'o')
Example 2
s1 = 'hello'
s2 = 'hella'
Levenshtein Score = 1 (it requires 1 edit, substitution of 'a' to 'o')
Plugging these scores into the Fuzzywuzzy formula (len(s1)+len(s2) - LevenshteinScore)/((len(s1)+len(s2)):
Example 1: (5+4-1)/9 = 89%
Example 2: (5+5-1)/10 = 90%
Now the fuzzywuzzy does return the same score for Example 1, but not for example 2. The score for example 2 is 80%. On investigating how it is calculating the distances under the hood, I found out that it counts the 'substitution' operation as 2 operations rather than 1 (as defined for Levenshtein). I understand that it uses the difflib library but I just want to know why is it called Levenshtein Distance, when it actually is not?
I am just trying to figure out why is there a distinction here? What does it mean or explain? Basically the reason for using 2 operations for substitution rather than one as defined in Levenshtein Distance and still calling it Levenshtein Distance. Is it got something to do with the gaps in sentences? Is this a standard way of converting LD to a normalized similarity score?
I would love if somebody could give me some insight. Also is there a better way to convert LD to a similarity score? Or in general measure the similarity between two strings? I am trying to measure the similarity between some audio file transcriptions done by a human transcription service and by an Automatic Speech Recognition system.
Thank you!
I work on text mining problem and need to extract all mentioned of certain keywords. For example, given the list:
list_of_keywords = ['citalopram', 'trazodone', 'aspirin']
I need to find all occurrences of the keywords in a text. That could be easily done with Pandas (assuming my text is read in from a csv file):
import pandas as pd
df_text = pd.read_csv('text.csv')
df_text['matches'] = df_text.str.findall('|'.join(list_of_keywords))
However, there are spelling mistakes in the text and some times my keywords will be written as:
'citalopram' as 'cetalopram'
or
'trazodone' as 'trazadon'
Searching on the web, I found some suggestions how to implement the spell checker, but I am not sure where to insert the spell checker and I reckon that it may slow down the search in the case a very large text.
As another option, it has been suggested to use a wild card with regex and insert in the potential locations of confusion (conceptually written)
.findall('c*t*l*pr*m')
However I am not convinced that I can capture all possible problematic cases. I tried some out-of-the-box spell checkers, but my texts are some-what specific and I need a spell checker that 'knows' my domain (medical domain).
QUESTION
Is there any efficient way to extract keywords from a text including spelling mistakes?
You are right, you cannot capture all possible misspellings with regular expressions.
You do however have options.
You could
Use a trie. A lot of auto complete and spell checking solutions use tries. However most of them operate on a word for word basis. Not on text as a whole
That being said, what you really want is a fuzzy text extractor as you just want to match alternate/slightly wrong spellings and not correct those spellings in the text. So you have more options here as well
Computational genomics has this challenge where they want to search for patterns of base pairs in long sequences. They allow for a certain amount of mismatch in the matched text. So an approximate matching solution like the ones outlined here will help. Those slides have excellent use of the pigeon hole principle to do what you need and the code is open source too!
If you want something a lot less complex, just run an edit distance filter on all the words of the doc and admit only words with an edit distance of k or less.
To expand on what I mean by Edit Distance
(Images/Code borrowed from slides linked above, slides are free to use for anyone i.e no license)
Let us examine a simpler concept of Hamming Distance
def hammingDistance(x, y):
assert len(x) == len(y)
nmm = 0
for i in xrange(0, len(x)):
if x[i] != y[i]:
nmm += 1
return nmm
Hamming distance returns the number of characters that must swapped between 2 equal length strings to make them equal.
But what happens when the strings are not equal length?
Use editDistance Which is the num of characters that must be swapped/inserted/deleted on 2 strings to make them equal
Hamming distance becomes the base case for a recursive algorithm now
def edDistRecursive(x, y):
if len(x) == 0: return len(y)
if len(y) == 0: return len(x)
delt = 1 if x[-1] != y[-1] else 0
diag = edDistRecursive(x[:-1], y[:-1]) + delt
vert = edDistRecursive(x[:-1], y) + 1
horz = edDistRecursive(x, y[:-1]) + 1
return min(diag, vert, horz)
Just call the func above on what you think the word would/should match to (maybe by first looking up a trie). You can even memoize the soln to make it faster as there is a high probability for overlap
I am working with DNA sequence alignment, and I have a performance issue.
I need to create a dict that maps a word (a sequence of a set length) to a list of all words that are similar as decided by a separate function.
Right now, I am doing the following:
all_words_rdd = sc.parallelize([''.join(word) for word in itertools.product(all_letters, repeat=WORD_SIZE)], PARALLELISM)
all_similar_word_pairs_map = (all_words_rdd.cartesian(all_words_rdd)
.filter(lambda (word1, word2), scoring_matrix=scoring_matrix, threshold_value=threshold_value: areWordsSimilar((word1, word2), scoring_matrix, threshold_value))
.groupByKey()
.mapValues(set)
.collectAsMap())
Where areWordsSimilar obviously calculates whether the words reach a set similarity threshold.
However, this is horribly slow. It works fine with words of length 3, but once I go any higher it slows down exponentially (as you might expect). It also starts complaining about the task size being too big (again, not surprising)
I know the cartesian join is a really inefficient way to do this, but I'm not sure how to approach it otherwise.
I was thinking of starting with something like this:
all_words_rdd = (sc.parallelize(xrange(0, len(all_letters) ** WORD_SIZE))
.repartition(PARALLELISM)
...
)
This would let me split the calculation across multiple nodes. However, how do I calculate this? I was thinking about doing something with bases and inferring the letter using the modulo operator (i.e. in base of len(all_letters), num % 2 = all_letters[0], num % 3 = all_letters[1], etc).
However, this sounds horribly complicated, so I was wondering if anybody had a better way.
Thanks in advance.
EDIT
I understand that I cannot reduce the exponential complexity of the problem, that is not my goal. My goal is to break up the complexity across multiple nodes of execution by having each node perform part of the calculation. However, to do this I need to be able to derive a DNA word from a number using some process.
Generally speaking even without driver side code it looks like a hopeless task. Size of the sequence set is growing exponentially and you simply cannot win with that. Depending on how you plan to use this data there is most likely a better approach out there.
If you still want to go with this you can start with spiting kmers generation between a driver and workers:
from itertools import product
def extend_kmer(n, kmer="", alphabet="ATGC"):
"""
>>> list(extend_kmer(2))[:4]
['AA', 'AT', 'AG', 'AC']
"""
tails = product(alphabet, repeat=n)
for tail in tails:
yield kmer + "".join(tail)
def generate_kmers(k, seed_size, alphabet="ATGC"):
"""
>>> kmers = generate_kmers(6, 3, "ATGC").collect()
>>> len(kmers)
4096
>>> sorted(kmers)[0]
'AAAAAA'
"""
seed = sc.parallelize([x for x in extend_kmer(seed_size, "", alphabet)])
return seed.flatMap(lambda kmer: extend_kmer(k - seed_size, kmer, alphabet))
k = ... # Integer
seed_size = ... # Integer <= k
kmers = generate_kmers(k, seed_size) # RDD kmers
The simplest optimization you can do when it comes to searching is to drop cartesian and use a local generation:
from difflib import SequenceMatcher
def is_similar(x, y):
"""Dummy similarity check
>>> is_similar("AAAAA", "AAAAT")
True
>>> is_similar("AAAAA", "TTTTTT")
False
"""
return SequenceMatcher(None, x, y).ratio() > 0.75
def find_similar(kmer, f=is_similar, alphabet="ATGC"):
"""
>>> kmer, similar = find_similar("AAAAAA")
>>> sorted(similar)[:5]
['AAAAAA', 'AAAAAC', 'AAAAAG', 'AAAAAT', 'AAAACA']
"""
candidates = product(alphabet, repeat=len(kmer))
return (kmer, {"".join(x) for x in candidates if is_similar(kmer, x)})
similar_map = kmers.flatmap(find_similar)
It is still an extremely naive approach but it doesn't require expensive data shuffling.
Next thing you can try is to improve search strategy. It can be done either locally like above or globally using joins.
In both cases you need a smarter approach than checking all possible kmers. First thing that comes to mind is to use seed kmers taken from a given word. In locally mode these can be used as a starting point for candidate generation, in a global mode a join key (optionally combined with hashing).
I have two lists: one, the interests of the user; and second, the keywords about a book. I want to recommend the book to the user based on his given interests list. I am using the SequenceMatcher class of Python library difflib to match similar words like "game", "games", "gaming", "gamer", etc. The ratio function gives me a number between [0,1] stating how similar the 2 strings are. But I got stuck at one example where I calculated the similarity between "looping" and "shooting". It comes out to be 0.6667.
for interest in self.interests:
for keyword in keywords:
s = SequenceMatcher(None,interest,keyword)
match_freq = s.ratio()
if match_freq >= self.limit:
#print interest, keyword, match_freq
final_score += 1
break
Is there any other way to perform this kind of matching in Python?
Firstly a word can have many senses and when you try to find similar words you might need some word sense disambiguation http://en.wikipedia.org/wiki/Word-sense_disambiguation.
Given a pair of words, if we take the most similar pair of senses as the gauge of whether two words are similar, we can try this:
from nltk.corpus import wordnet as wn
from itertools import product
wordx, wordy = "cat","dog"
sem1, sem2 = wn.synsets(wordx), wn.synsets(wordy)
maxscore = 0
for i,j in list(product(*[sem1,sem2])):
score = i.wup_similarity(j) # Wu-Palmer Similarity
maxscore = score if maxscore < score else maxscore
There are other similarity functions that you can use. http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html. The only problem is when you encounter words not in wordnet. Then i suggest you fallback on difflib.
At first, I thought to regular expressions to perform additional tests to discriminate the matchings with low ratio. It can be a solution to treat specific problem like the one happening with words ending with ing. But that's only a limited case and thre can be numerous other cases that would need to add specific treatment for each one.
Then I thought that we could try to find additional criterium to eliminate not semantically matching words having a letters simlarity ratio enough to be detected as matcging together though the ratio is low,
WHILE in the same time catching real semantically matching terms having low ratio because they are short.
Here's a possibility
from difflib import SequenceMatcher
interests = ('shooting','gaming','looping')
keywords = ('loop','looping','game')
s = SequenceMatcher(None)
limit = 0.50
for interest in interests:
s.set_seq2(interest)
for keyword in keywords:
s.set_seq1(keyword)
b = s.ratio()>=limit and len(s.get_matching_blocks())==2
print '%10s %-10s %f %s' % (interest, keyword,
s.ratio(),
'** MATCH **' if b else '')
print
gives
shooting loop 0.333333
shooting looping 0.666667
shooting game 0.166667
gaming loop 0.000000
gaming looping 0.461538
gaming game 0.600000 ** MATCH **
looping loop 0.727273 ** MATCH **
looping looping 1.000000 ** MATCH **
looping game 0.181818
Note this from the doc:
SequenceMatcher computes and caches detailed information about the
second sequence, so if you want to compare one sequence against many
sequences, use set_seq2() to set the commonly used sequence once and
call set_seq1() repeatedly, once for each of the other sequences.
Thats because SequenceMatcher is based on edit distance or something alike. semantic similarity is more suitable for your case or hybrid of the two.
take a look NLTK pack (code example) as you are using python and maybe this paper
for people using c++ can check this open source project for reference