making difflib's SequenceMatcher ignore "junk" characters

making difflib's SequenceMatcher ignore "junk" characters - python

I have a lot of strings that i want to match for similarity(each string is 30 characters on average). I found difflib's SequenceMatcher great for this task as it was simple and found the results good. But if i compare hellboy and hell-boy like this
>>> sm=SequenceMatcher(lambda x:x=='-','hellboy','hell-boy')
>>> sm.ratio()
0: 0.93333333333333335
I want such words to give a 100 percent match i.e ratio of 1.0. I understand that the junk character specified in the function above are not used for comparison but finding longest contiguous matching subsequence. Is there some way i can make SequenceMatcher to ignore some "junk" characters for comparison purpose?

If you wish to do as I suggested in the comments, (removing the junk characters) the fastest method is to use str.translate().
E.g:
to_compare = to_compare.translate(None, {"-"})
As shown here, this is significantly (3x) faster (and I feel nicer to read) than a regex.
Note that under Python 3.x, or if you are using Unicode under Python 2.x, this will not work as the delchars parameter is not accepted. In this case, you simply need to make a mapping to None. E.g:
translation_map = str.maketrans({"-": None})
to_compare = to_compare.translate(translation_map)
You could also have a small function to save some typing if you have a lot of characters you want to remove, just make a set and pass through:
def to_translation_map(iterable):
return {key: None for key in iterable}
#return dict((key, None) for key in iterable) #For old versions of Python without dict comps.

If you were to make a function to remove all the junk character before hand you could use re:
string=re.sub('-|_|\*','',string)
for the regular expression '-|_|\*' just put a | between all junk characters and if its a special re character put a \ before it (like * and +)

Related

Python: Identifying whether all characters of a word are present in a single string

I am a python user. I want to check whether a particular word contains characters from a string.
For example i have a word "mizaner". All its characters are present in the string "randomnizer".
I can identify whether a particular substring is part of a string eg. i can verify if "random" is part of "randomnizer" by using the if 'random' in 'randomnizer': statement but I cannot do the same for "mizaner" as even though all its characters are found in "randomnizer", the characters are jumbled up and the word cannot be used as a substring. Any suggestions ?

The Boolean expression
set('mizaner') <= set('randomnizer')
will return True since all the letters of the first string are in the second string. A similar expression will return False if there are any letters in the first string that are not in the second string.
This works because converting a string to a set removes duplicate characters and makes the order of the characters not matter, which is just what you want. The less-than-or-equal-to comparison for sets tests for the subset relation.

To handle cases where character counts matter, not just presence of characters, you'd want:
from collections import Counter
word = "randomnizer"
searchlets = "mizaner"
if not (Counter(searchlets) - Counter(word)):
# All the letters in searchlets appeared in word
If count doesn't matter, as others have noted:
if set(searchlets).issubset(word):
will do the trick. Using issubset instead of set(searchlets) <= set(word) is slightly better theoretically, since the implementation could choose to avoid converting word to a set at all and simply stream it by with short-circuiting if the subset condition becomes impossible midway through. In practice, it looks like CPython internally converts non-set arguments to set anyway, so unless the implementation changes, both approaches are roughly equivalent.

you can use:
set(word).issubset(string)
..
word="mizaner"
string="randomnizer"
if set(word).issubset(string):
print("All characters in word is in string")

efficient way to get words before and after substring in text (python)

I'm using regex to find occurrences of string patterns in a body of text. Once I find that the string pattern occurs, I want to get x words before and after the string as well (x could be as small as 4, but preferably ~10 if still as efficient).
I am currently using regex to find all instances, but occasionally it will hang. Is there a more efficient way to solve this problem?
This is the solution I currently have:
sub = r'(\w*)\W*(\w*)\W*(\w*)\W*(\w*)\W*(%s)\W*(\w*)\W*(\w*)\W*(\w*)\W*(\w*)' % result_string #refind string and get surrounding += 4 words
surrounding_text = re.findall(sub, text)
for found_text in surrounding_text:
result_found.append(" ".join(map(str,found_text)))

I'm not sure if this is what you're looking for:
>>> text = "Hello, world. Regular expressions are not always the answer."
>>> words = text.partition("Regular expressions")
>>> words
('Hello, world. ', 'Regular expressions', ' are not always the answer.')
>>> words_before = words[0]
>>> words_before
'Hello, world. '
>>> separator = words[1]
>>> separator
'Regular expressions'
>>> words_after = words[2]
>>> words_after
' are not always the answer.'
Basically, str.partition() splits the string into a 3-element tuple. In this example, the first element is all of the words before the specific "separator", the second element is the separator, and the third element is all of the words after the separator.

The main problem with your pattern is that it begins with optional things that causes a lot of tries for each positions in the string until a match is found. The number of tries increases with the text size and with the value of n (the number of words before and after). This is why only few lines of text suffice to crash your code.
A way consists to begin the pattern with the target word and to use lookarounds to capture the text (or the words) before and after:
keyword (?= words after ) (?<= words before - keyword)
Starting a pattern with the searched word (a literal string) makes it very fast, and words around are then quickly found from this position in the string. Unfortunately the re module has some limitations and doesn't allow variable length lookbehinds (as many other regex flavors).
The new regex module supports variable length lookbehinds and other useful features like the ability to store the matches of a repeated capture group (handy to get the separated words in one shot).
import regex
text = '''In strange contrast to the hardly tolerable constraint and nameless
invisible domineerings of the captain's table, was the entire care-free
license and ease, the almost frantic democracy of those inferior fellows
the harpooneers. While their masters, the mates, seemed afraid of the
sound of the hinges of their own jaws, the harpooneers chewed their food
with such a relish that there was a report to it.'''
word = 'harpooneers'
n = 4
pattern = r'''
\m (?<target> %s ) \M # target word
(?<= # content before
(?<before> (?: (?<wdb>\w+) \W+ ){0,%d} )
%s
)
(?= # content after
(?<after> (?: \W+ (?<wda>\w+) ){0,%d} )
)
''' % (word, n, word, n)
rgx = regex.compile(pattern, regex.VERBOSE | regex.IGNORECASE)
class Result(object):
def __init__(self, m):
self.target_span = m.span()
self.excerpt_span = (m.starts('before')[0], m.ends('after')[0])
self.excerpt = m.expandf('{before}{target}{after}')
self.words_before = m.captures('wdb')[::-1]
self.words_after = m.captures('wda')
results = [Result(m) for m in rgx.finditer(text)]
print(results[0].excerpt)
print(results[0].excerpt_span)
print(results[0].words_before)
print(results[0].words_after)
print(results[1].excerpt)

Making a regex (well, anything, for that matter) with "as much repetitions as you will ever possibly need" is an extremely bad idea. That's because you
do an excessive amount of needless work every time
cannot really know for sure how much you will ever possibly need, thus introducing an arbitrary limitation
The bottom line for the below solutions: the 1st solution is the most effective one for large data; the 2nd one is the closest to your current, but scales much worse.
strip your entities to exactly what you are interested in at each moment:
find the substring (e.g. str.index. For whole words only, re.find with e.g. r'\b%s\b'%re.escape(word) is more suitable)
go N words back.
Since you mentioned a "text", your strings are likely to be very large, so you want to avoid copying potentially unlimited chunks of them.
E.g. re.finditer over a substring-reverse-iterator-in-place according to slices to immutable strings by reference and not copy and Best way to loop over a python string backwards. This will only become better than slicing when the latter is expensive in terms of CPU and/or memory - test on some realistic examples to find out. Doesn't work. re works directly with the memory buffer. Thus it's impossible to reverse a string for it without copying the data.
There's no function to find a character from a class in Python, nor an "xsplit". So the fastest way appears to be (i for i,c in enumerate(reversed(buffer(text,0,substring_index)) if c.isspace()) (timeit gives ~100ms on P3 933MHz for a full pass through a 100k string).
Alternatively:
Fix your regex to not be subject to catastrophic backtracking and eliminate code duplication (DRY principle).
The 2nd measure will eliminate the 2nd issue: we'll make the number of repetitions explicit (Python Zen, koan 2) and thus highly visible and manageable.
As for the 1st issue, if you really only need "up to known, same N" items in each case, you won't actually be doing "excessive work" by finding them together with your string.
The "fix" part here is \w*\W* -> \w+\W+. This eliminates major ambiguity (see the above link) from the fact that each x* can be a blank match.
Matching up to N words before the string effectively is harder:
with (\w+\W+){,10} or equivalent, the matcher will be finding every 10 words before discovering that your string doesn't follow them, then trying 9,8, etc. To ease it up on the matcher somewhat, \b before the pattern will make it only perform all this work at the beginning of each word
lookbehind is not allowed here: as the linked article explains, the regex engine must know how many characters to step back before trying the contained regex. And even if it was - a lookbehind is tried before every character - i.e. it's even more of a CPU hog
As you can see, regexes aren't quite cut to match things backwards
To eliminate code duplication, either
use the aforementioned {,10}. This will not save individual words but should be noticeably faster for large text (see the above on how the matching works here). We can always parse the retrieved chunk of text in more details (with the regex in the next item) once we have it. Or
autogenerate the repetitive part
note that (\w+\W+)? repeated mindlessly is subject to the same ambiguity as above. To be unambiguous, the expression must be like this (w=(\w+\W+) here for brevity): (w(w...(ww?)?...)?)? (and all the groups need to be non-capturing).

I personally think that using text.partition() is the best option, as it eliminates the messy regular expressions, and automatically leaves output in an easy-to-access tuple.

Matching characters in two Python strings

I am trying to print the shared characters between 2 sets of strings in Python, I am doing this with the hopes of actually finding how to do this using nothing but python regular expressions (I don't know regex so this might be a good time to learn it).
So if first_word = "peepa" and second_word = "poopa" I want the return value to be: "pa"
since in both variables the characters that are shared are p and a. So far I am following the documentation on how to use the re module, but I can't seem to grasp the basic concepts of this.
Any ideas as to how would I solve this problem?

This sounds like a problem where you want to find the intersection of characters between the two strings. The quickest way would be to do this:
>>> set(first_word).intersection(second_word)
set(['a', 'p'])
I don't think regular expressions are the right fit for this problem.

Use sets. Casting a string to a set returns an iterable with unique letters. Then you can retrieve the intersection of the two sets.
match = set(first_word.lower()) & set(second_word.lower())

Using regular expressions
This problem is tailor made for sets. But, you ask for "how to do this using nothing but python regular expressions."
Here is a start:
>>> import re
>>> re.sub('[^peepa]', '', "poopa")
'ppa'
The above uses regular expressions to remove from "poopa" every letter that was not already in "peepa". (As you see it leaves duplicated letters which sets would not do.)
In more detail, re.sub does substitutions based on regular expressions. [peepa] is a regular expression that means any of the letters peepa. The regular expression [^peepa] means anything that is not in peepa. Anything matching this regular expression is replaced with the empty string "", that is, it is removed. What remains are only the common letters.

In python: How to perform regular expression search on "circular" string

Assume:
string="aacctcaaaca"
find="aaa"
and I want to find all occurrences of find.
Usually, I would do
re.findall(find, string)
The catch is that the string is circular, i.e. the start/end of the string is irrelevant. So the "aaa" made up of the first two + last a's should also be counted.
In addition, I would like to find the start position of the match (6 and 10 in the above example)
I was thinking about adding string[0:len(find)-1] to string and performing the re on that new string
i.e.
re.findall(find, string+string[0:len(find)-1])
Does that sound right? Any other ideas/suggestions?

Your current approach seems perfectly reasonable. Another option is to just concatenate the entire string and ignore any matches that start after the wrapping.
For example:
string="aacctcaaaca"
find="aaa"
[m.group(0) for m in re.finditer(find, string+string) if m.start() < len(string)]
This is a bit more extensible because you can use an arbitrary regex such as a{3,} where you might not be able to rely on len(find).
As suggested by mgilson in comments you can make this more efficient by using itertools so that you aren't finding repeat matches unnecessarily.
It would look something like this:
from itertools import takewhile
takewhile(lambda m: m.start() < len(string), re.finditer(find, string+string))
Note that this will return an iterable of match objects instead of a list of the matched substrings.

Find which lines in a file contain certain characters

Is there a way to find out if a string contains any one of the characters in a set with python?
It's straightforward to do it with a single character, but I need to check and see if a string contains any one of a set of bad characters.
Specifically, suppose I have a string:
s = 'amanaplanacanalpanama~012345'
and I want to see if the string contains any vowels:
bad_chars = 'aeiou'
and do this in a for loop for each line in a file:
if [any one or more of the bad_chars] in s:
do something
I am scanning a large file so if there is a faster method to this, that would be ideal. Also, not every bad character has to be checked---so long as one is encountered that is enough to end the search.
I'm not sure if there is a builtin function or easy way to implement this, but I haven't come across anything yet. Any pointers would be much appreciated!

any((c in badChars) for c in yourString)
or
any((c in yourString) for c in badChars) # extensionally equivalent, slower
or
set(yourString) & set(badChars) # extensionally equivalent, slower
"so long as one is encountered that is enough to end the search." - This will be true if you use the first method.
You say you are concerned with performance: performance should not be an issue unless you are dealing with a huge amount of data. If you encounter issues, you can try:
Regexes
edit Previously I had written a section here on using regexes, via the re module, programatically generating a regex that consisted of a single character-class [...] and using .finditer, with the caveat that putting a simple backslash before everything might not work correctly. Indeed, after testing it, that is the case, and I would definitely not recommend this method. Using this would require reverse engineering the entire (slightly complex) sub-grammar of regex character classes (e.g. you might have characters like \ followed by w, like ] or [, or like -, and merely escaping some like \w may give it a new meaning).
Sets
Depending on whether the str.__contains__ operation is O(1) or O(N), it may be justifiable to first convert your text/lines into a set to ensure the in operation is O(1), if you have many badChars:
badCharSet = set(badChars)
any((c in badChars) for c in yourString)
(it may be possible to make that a one-liner any((c in set(yourString)) for c in badChars), depending on how smart the python compiler is)
Do you really need to do this line-by-line?
It may be faster to do this once for the entire file O(#badchars), than once for every line in the file O(#lines*#badchars), though the asymptotic constants may be such that it won't matter.

Use python's any function.
if any((bad_char in my_string) for bad_char in bad_chars):
# do something

This should be very efficient and clear. It uses sets:
#!/usr/bin/python
bad_chars = set('aeiou')
with open('/etc/passwd', 'r') as file_:
file_string = file_.read()
file_chars = set(file_string)
if file_chars & bad_chars:
print('found something bad')

This regular expression is twice as fast as any with my minimal testing. You should try it with your own data.
r = re.compile('[aeiou]')
if r.search(s):
# do something

The following Python code should print out any character in bad_chars if it exists in s:
for i in vowels:
if i in your charset:
#do_something
You could also use the python in-built any using an example like this:
>>> any(e for e in bad_chars if e in s)
True

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.