How can I find the best fuzzy string match? - python

Python's new regex module supports fuzzy string matching. Sing praises aloud (now).
Per the docs:
The ENHANCEMATCH flag makes fuzzy matching attempt to improve the fit
of the next match that it finds.
The BESTMATCH flag makes fuzzy matching search for the best match
instead of the next match
The ENHANCEMATCH flag is set using (?e) as in
regex.search("(?e)(dog){e<=1}", "cat and dog")[1] returns "dog"
but there's nothing on actually setting the BESTMATCH flag. How's it done?

Documentation on the BESTMATCH flag functionality is partial (but improving). Poke-n-hope shows that BESTMATCH is set using (?b).
>>> import regex
>>> regex.search(r"(?e)(?:hello){e<=4}", "What did you say, oh - hello")[0]
'hat d'
>>> regex.search(r"(?b)(?:hello){e<=4}", "What did you say, oh - hello")[0]
'hello'

Related

Prevent last duplicate character from string [duplicate]

My regex pattern looks something like
<xxxx location="file path/level1/level2" xxxx some="xxx">
I am only interested in the part in quotes assigned to location. Shouldn't it be as easy as below without the greedy switch?
/.*location="(.*)".*/
Does not seem to work.
You need to make your regular expression lazy/non-greedy, because by default, "(.*)" will match all of "file path/level1/level2" xxx some="xxx".
Instead you can make your dot-star non-greedy, which will make it match as few characters as possible:
/location="(.*?)"/
Adding a ? on a quantifier (?, * or +) makes it non-greedy.
Note: this is only available in regex engines which implement the Perl 5 extensions (Java, Ruby, Python, etc) but not in "traditional" regex engines (including Awk, sed, grep without -P, etc.).
location="(.*)" will match from the " after location= until the " after some="xxx unless you make it non-greedy.
So you either need .*? (i.e. make it non-greedy by adding ?) or better replace .* with [^"]*.
[^"] Matches any character except for a " <quotation-mark>
More generic: [^abc] - Matches any character except for an a, b or c
How about
.*location="([^"]*)".*
This avoids the unlimited search with .* and will match exactly to the first quote.
Use non-greedy matching, if your engine supports it. Add the ? inside the capture.
/location="(.*?)"/
Use of Lazy quantifiers ? with no global flag is the answer.
Eg,
If you had global flag /g then, it would have matched all the lowest length matches as below.
Here's another way.
Here's the one you want. This is lazy [\s\S]*?
The first item:
[\s\S]*?(?:location="[^"]*")[\s\S]* Replace with: $1
Explaination: https://regex101.com/r/ZcqcUm/2
For completeness, this gets the last one. This is greedy [\s\S]*
The last item:[\s\S]*(?:location="([^"]*)")[\s\S]*
Replace with: $1
Explaination: https://regex101.com/r/LXSPDp/3
There's only 1 difference between these two regular expressions and that is the ?
The other answers here fail to spell out a full solution for regex versions which don't support non-greedy matching. The greedy quantifiers (.*?, .+? etc) are a Perl 5 extension which isn't supported in traditional regular expressions.
If your stopping condition is a single character, the solution is easy; instead of
a(.*?)b
you can match
a[^ab]*b
i.e specify a character class which excludes the starting and ending delimiiters.
In the more general case, you can painstakingly construct an expression like
start(|[^e]|e(|[^n]|n(|[^d])))end
to capture a match between start and the first occurrence of end. Notice how the subexpression with nested parentheses spells out a number of alternatives which between them allow e only if it isn't followed by nd and so forth, and also take care to cover the empty string as one alternative which doesn't match whatever is disallowed at that particular point.
Of course, the correct approach in most cases is to use a proper parser for the format you are trying to parse, but sometimes, maybe one isn't available, or maybe the specialized tool you are using is insisting on a regular expression and nothing else.
Because you are using quantified subpattern and as descried in Perl Doc,
By default, a quantified subpattern is "greedy", that is, it will
match as many times as possible (given a particular starting location)
while still allowing the rest of the pattern to match. If you want it
to match the minimum number of times possible, follow the quantifier
with a "?" . Note that the meanings don't change, just the
"greediness":
*? //Match 0 or more times, not greedily (minimum matches)
+? //Match 1 or more times, not greedily
Thus, to allow your quantified pattern to make minimum match, follow it by ? :
/location="(.*?)"/
import regex
text = 'ask her to call Mary back when she comes back'
p = r'(?i)(?s)call(.*?)back'
for match in regex.finditer(p, str(text)):
print (match.group(1))
Output:
Mary

Case-insensitivity exclusively in lookbehind / lookahead groups for Python regex [duplicate]

This question already has answers here:
How to set ignorecase flag for part of regular expression in Python?
(3 answers)
Closed 3 years ago.
I understand how to make matching case in-sensitive in Python, and I understand how to use lookahead / lookbehinds, but how do I combine the two?
For instance, my text is
mytext = I LOVE EATING popsicles at home.
I want to extract popsicles from this text (my target food item). This regex works great:
import re
regex = r'(?<=I\sLOVE\sEATING\s)[a-z0-9]*(?=\sat\shome)'
re.search(regex, mytext)
However, I'd like to account for the scenario where someone writes
i LOVE eating apples at HOME.
That should match. But "I LOVE eating Apples at home" should NOT match, since Apples is uppercase.
Thus, I'd like to have local case insensitivity in my two lookahead (?=\sat\shome)and lookbehind (?<=I\sLOVE\sEATING\s) groups. I know I can use re.IGNORECASE flags for global case insensitivity, but I just want the lookahead/behind groups to be case insensitive, not my actual target expression.
Traditionally, I'd prepend (?i:I LOVE EATING) to create a case-insensitive non-capturing group that is capable of matching both I LOVE EATING and I love eating. However, If I try to combine the two together:
(?i:<=I\sLOVE\sEATING\s)
I get no matches, since it now interprets the i: as a literal expression to match. Is there a way to combine lookaheads/behinds with case sensitivity?
Edit: I don’t think this is a duplicate of the marked question. That question specifically asks about a part of a group- I’m asking for a specific subset- look ahead and behinds. The syntax is different here. The answers in that other post do not directly apply. As the answers on this post suggest, you need to apply some work arounds to achieve this functionality that don’t apply to the supposed duplicate SO post.
You can set the regex to case-insensitive globally with (?i) and switch a group to case-sensitive with (?-i:groupcontent):
regex = r'(?i)(?<=I\sLOVE\sEATING\s)(?-i:[a-z0-9]*)(?=\sat\shome)'
Instead of (?i), you can also use re.I in the search. The following is equivalent to the regex above:
regex = r'(?<=I\sLOVE\sEATING\s)(?-i:[a-z0-9]*)(?=\sat\shome)'
re.search(regex, mytext, re.I)
Unfortunately python re module doesn't allow inline use of mode modifiers in the middle of a regex.
As a workaround, you may use this regex:
reg = re.compile(r'(?<=[Ii]\s[Ll][Oo][Vv][Ee]\s[Ee][Aa][Tt][Ii][Nn][Gg]\s)[a-z0-9]*(?=\s[Aa][Tt]\s[Hh][Oo][Mm][Ee])')
print "Case 1: ", reg.findall('I LOVE Eating popsicles at HOME.')
print "Case 2: ", reg.findall('I LOVE EATING popsicles at home.')
print "Case 3: ", reg.findall('I LOVE Eating Popsicles at HOME.')
Output:
Case 1: ['popsicles']
Case 2: ['popsicles']
Case 3: []
Using (?i:...) you can set a regex a flag (in this case i)
locally (inline) for some part of the regex.
Such a local flag setting is allowed also within lookbehind or
lookahead, while keeping the rest of the regex without any option.
I modified your code, so it compliles the regex once and then
calls is 2 times for different strings:
mytext1 = 'i LOVE eating Apples at HOME.'
mytext2 = 'i LOVE eating apples at HOME.'
pat = re.compile(r'(?<=(?i:I\sLOVE\sEATING\s))[a-z0-9]+(?=(?i:\sAT\sHOME))')
m = pat.search(mytext1)
print('1:', m.group() if m else '** Not found **')
m = pat.search(mytext2)
print('2:', m.group() if m else '** Not found **')
It prints:
1: ** Not found **
2: apples
so the match is only for the second source string.

How do I check if a string matches a set pattern in Python?

I want to match a string to a specific pattern or set of words, like below:
the apple is red is the query and
the apple|orange|grape is red|orange|violet is the pattern to match.
The pipes would represent words that would substitute each other. The pattern could also be grouped like [launch app]|[start program]. I would like the module to return True or False whether the query matches the pattern, naturally.
What is the best way to accomplish this if there is not a library that does this already? If this can be done with simple regex, great; however I know next to nothing about regex. I am using Python 2.7.11
import re
string = 'the apple is red'
re.search(r'^the (apple|orange|grape) is (red|orange|violet)', string)
Here's an example of it running:
In [20]: re.search(r'^the (apple|orange|grape) is (red|orange|violet)', string). groups()
Out[20]: ('apple', 'red')
If there are no matches then re.search() will return nothing.
You may know "next to nothing about regex" but you nearly wrote the pattern.
The sections within the parentheses can also have their own regex patterns, too. So you could match "apple" and "apples" with
r'the (apple[s]*|orange|grape)
The re based solutions for this kind of problem work great. But it would sure be nice if there were an easy way to pull data out of strings in Python without have to learn regex (or to learn it AGAIN, which what I always end up having to do since my brain is broken).
Thankfully, someone took the time to write parse.
parse
parse is a nice package for this kind of thing. It uses regular expressions under the hood, but the API is based on the string format specification mini-language, which most Python users will already be familiar with.
For a format spec you will use over and over again, you'd use parse.compile. Here is an example:
>>> import parse
>>> theaisb_parser = parse.compile('the {} is {}')
>>> fruit, color = theaisb_parser.parse('the apple is red')
>>> print(fruit, color)
apple red

detect emoticon in a sentence using regex python [duplicate]

This question already has answers here:
Capturing emoticons using regular expression in python
(4 answers)
Closed 9 years ago.
Here is the list of emoticons: http://en.wikipedia.org/wiki/List_of_emoticons
I want to form a regex which checks if any of these emoticons exist in the sentence. For example, "hey there I am good :)" or "I am angry and sad :(" but there are a lot of emoticons in the list on wikipedia so wondering how I can achieve this task.
I am new to regex. & python.
>>> s = "hey there I am good :)"
>>> import re
>>> q = re.findall(":",s)
>>> q
[':']
I see two approaches to your problem:
Either, you can create a regular expression for a "generic smiley" and try to match as many as possible without making it overly complicated and insane. For example, you could say that each smiley has some sort of eyes, a nose (optional), and a mouth.
Or, if you want to match each and every smiley from that list (and none else) you can just take those smileys, escape any regular-expression specific special characters, and build a huge disjunction from those.
Here is some code that should get you started for both approaches:
# approach 1: pattern for "generic smiley"
eyes, noses, mouths = r":;8BX=", r"-~'^", r")(/\|DP"
pattern1 = "[%s][%s]?[%s]" % tuple(map(re.escape, [eyes, noses, mouths]))
# approach 2: disjunction of a list of smileys
smileys = """:-) :) :o) :] :3 :c) :> =] 8) =) :} :^)
:D 8-D 8D x-D xD X-D XD =-D =D =-3 =3 B^D""".split()
pattern2 = "|".join(map(re.escape, smileys))
text = "bla bla bla :-/ more text 8^P and another smiley =-D even more text"
print re.findall(pattern1, text)
Both approaches have pros, cons, and some general limitations. You will always have false positives, like in a mathematical term like 18^P. It might help to put spaces around the expression, but then you can't match smileys followed by punctuation. The first approach is more powerful and catches smileys the second approach won't match, but only as long as they follow a certain schema. You could use the same approach for "eastern" smileys, but it won't work for strictly symmetric ones, like =^_^=, as this is not a regular language. The second approach, on the other hand, is easier to extend with new smileys, as you just have to add them to the list.

multiple negative lookahead assertions

I can't figure out how to do multiple lookaround for the life of me. Say I want to match a variable number of numbers following a hash but not if preceded by something or followed by something else. For example I want to match #123 or #12345 in the following. The lookbehinds seem to be fine but the lookaheads do not. I'm out of ideas.
matches = ["#123", "This is #12345",
# But not
"bad #123", "No match #12345", "This is #123-ubuntu",
"This is #123 0x08"]
pat = '(?<!bad )(?<!No match )(#[0-9]+)(?! 0x0)(?!-ubuntu)'
for i in matches:
print i, re.search(pat, i)
You should have a look at the captures as well. I bet for the last two strings you will get:
#12
This is what happens:
The engine checks the two lookbehinds - they don't match, so it continues with the capturing group #[0-9]+ and matches #123. Now it checks the lookaheads. They fail as desired. But now there's backtracking! There is one variable in the pattern and that is the +. So the engine discards the last matched character (3) and tries again. Now the lookaheads are no problem any more and you get a match. The simplest way to solve this is to add another lookahead that makes sure that you go to the last digit:
pat = r'(?<!bad )(?<!No match )(#[0-9]+)(?![0-9])(?! 0x0)(?!-ubuntu)'
Note the use of a raw string (the leading r) - it doesn't matter in this pattern, but it's generally a good practice, because things get ugly once you start escaping characters.
EDIT: If you are using or willing to use the regex package instead of re, you get possessive quantifiers which suppress backtracking:
pat = r'(?<!bad )(?<!No match )(#[0-9]++)(?! 0x0)(?!-ubuntu)'
It's up to you which you find more readable or maintainable. The latter will be marginally more efficient, though. (Credits go to nhahtdh for pointing me to the regex package.)

Categories

Resources