Python regex not matching at word boundary as required

Python regex not matching at word boundary as required - python

I want to match a set of patterns at "word boundary", but the patterns may have a prefix [##] which should get matched if present.
I'm using following regex pattern in python.
r"\b[##]?(abc|ef|ghij)\b"
Sample text is : #abc is a pattern which should match. also abc should match. And finally #ef
In this text only abc, abc and ef are matched without and not #abc and #ef as I want.

You need to put the word boundary next to [##] which you made as optional. Because in this #abc part there is a non-word boundary \B exists before # (not a word character) and after the start of the line (not a word character) not a word boundary \b. Note that \b matches between a word character and a non-word character, vice-versa. \B matches between two word characters or two non-word characters.
r"[##]?\b(abc|ef|ghij)\b"
If you put \b before [##], it would match strings like foo#abc or bar#abc because here there is actually a word boundary exists before # and #.
DEMO
Example:
>>> s = "#abc is a pattern which should match. also abc should match. And finally #ef"
>>> re.findall(r'[##]?\b(?:abc|ef|ghij)\b', s)
['#abc', 'abc', '#ef']
#abc
^ ^
\B \b

The group (##)? is saying that the word may begin with "##". What you are looking for is [##]? which is saying the first character is # or #, but it is not required. If you need the match to be part of a group you could use (#|#)?.

I will also throw in my version of the fixed regex without capturing group (since you do not seem to be using them):
r'[##]?\b(?:abc|ef|ghij)\b'
See my demo.
EXPLANATION: [##] are non-word characters and are optional due to ?. \b is not optional, and regex engine consumes it first, i.e. it consumes right # or #, but they are not part of the match since \b is always zero-width.
Here are more details on \b from Regular-Expressions.info:
The metacharacter \b is an anchor like the caret and the dollar sign.
It matches at a position that is called a "word boundary". This match
is zero-length.
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a
word character.
After the last character in the string, if the last
character is a word character.
Between two characters in the string,
where one is a word character and the other is not a word character.

Related

Question about matching RE in a complicated form

How can I match a word using RE in the following format:
Letter number Alphanumeric dot(.) Alphanumeric{0-4}
Examples:
A24.L
A2F.L9
A2F.LG4
This is what I've come up with so far:
answer=re.findall(r'[A-Za-z]\d\w\.\w{0-4})

As you are using re.findall, I assume you are looking for partial matches inside longer text. Bearing that in mind, you need to fix the following:
\w matches not only alphanumeric, but also a _ char
{0-4} is not a valid limiting ("range", or "interval") quantifier, it has a {min,max} syntax (note that the min value should not be omitted, although some regex engines allow that with 0 value used as default, but there are regex engines that either do not support or that do not work correctly with this omitting)
In Python 3, \d matches any Unicode digit (like ٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨੩੪੫੬੭੮੯૦૧૨૩૪૫૬૭૮૯୦୧୨୩୪୫୬୭୮୯௦௧௨௩௪௫௬௭௮௯౦౧౨౩౪౫౬౭౮౯೦೧೨೩೪೫೬೭೮೯൦൧൨൩൪൫൬൭൮൯๐๑๒๓๔๕๖๗๘๙໐໑໒໓໔໕໖໗໘໙༠༡༢༣༤༥༦༧༨༩၀၁၂၃၄၅၆၇၈၉႐႑႒႓႔႕႖႗႘႙០១២៣៤៥៦៧៨៩᠐᠑᠒᠓᠔᠕᠖᠗᠘᠙᥆᥇᥈᥉᥊᥋᥌᥍᥎᥏᧐᧑᧒᧓᧔᧕᧖᧗᧘᧙᭐᭑᭒᭓᭔᭕᭖᭗᭘᭙᮰᮱᮲᮳᮴᮵᮶᮷᮸᮹᱀᱁᱂᱃᱄᱅᱆᱇᱈᱉᱐᱑᱒᱓᱔᱕᱖᱗᱘᱙꘠꘡꘢꘣꘤꘥꘦꘧꘨꘩꣐꣑꣒꣓꣔꣕꣖꣗꣘꣙꤀꤁꤂꤃꤄꤅꤆꤇꤈꤉꩐꩑꩒꩓꩔꩕꩖꩗꩘꩙０１２３４５６７８９), so you probably want to use (?a) inline modifier (to only match ASCII digits) or an explicit [0-9].
So, you can use
answer=re.findall(r'\b[A-Za-z][0-9][A-Za-z0-9]\.[A-Za-z0-9]{1,4}\b', text)
if the alphanumeric after . is obligatory, and the following if the match can end in a dot:
answer=re.findall(r'\b[A-Za-z][0-9][A-Za-z0-9]\.[A-Za-z0-9]{0,4}(?<!\w\B)', text)
Details:
\b - word boundary
[A-Za-z] - a letter
[0-9] - an ASCII digit
[A-Za-z0-9] - an ASCII alphanumeric
\. - a . char
[A-Za-z0-9]{1,4}\b - one to four alphanumeric chars at the word boundary.
The second regex does not contain a word boundary at the end since the match is supposed to be able to end in a . (that is not a word char). The (?<!\w\B) is a right-hand dynamic word boundary that only requires a non-word char or end position if the preceding char is a word char.
See the regex demo.

patten search in Python

I am trying to search a string 'Test^' in another string 'test1 Test2 Test^ test'. I find that
re.search(r'\bTest\^\B', 'test1 Test2 Test^ test')
would work but
re.search(r'\bTest\^\b', 'test1 Test2 Test^ test')
would not work. I am a bit confused as I think I should use \b for the word boundary of 'Test^' (both sides have an empty space. Is it because Python treats the end of the string as '^' so it is a non word boundary?
Thank you.

\b means transition from "word character" to "non-word character" or vice versa. Word characters are alphanumeric characters, plus underscore, _. ^ is not a word character, nor is (space), so the transition from one to another is not a word boundary; as observed, it matches \B, not \b. If you want a space specific check, you'd need to explicitly use look-ahead (?=) or look-behind (?<=) assertions (possibly negated, depending on use case) with \s/\S.

I believe the caret is considered a word boundary, i.e. delimiter. The one with the capital "\B" is ignored and so isn't looking for a word boundary. The caret is a non-word character.
So, it should not be part of a regex pattern that's looking for words only.
https://www.regular-expressions.info/wordboundaries.html

Python regex to match word boundary when part do the word contains special character [duplicate]

I've spent some time, but still have to solution. I need regular expression that is able to match a words with signs in it (like c++) in string.
I've used /\bword\b/, for "usual" words, it works OK. But as soon as I try /\bC\+\+\b/ it just does not work. It some how works wrong with a plus signs in it.
I need a regex to detect if input string contains c++ word in it. Input like,
"c++ developer"
"using c++ language"
etc.
ps. Using C#, .Net Regex.Match function.
Thanks for help!

+ is a special character so you need to escape it
\bC\+\+(?!\w)
Note that we can't use \b because + is not a word-character.

The problem isn't with the plus character, that you've escaped correctly, but the \b sequence. It indicates a word boundary, which is a point between a word character (alphanumeric) and something else. Plus isn't a word character, so for \b to match, there would need to be a word character directly after the last plus sign.
\bC\+\+\b matches "Test C++Test" but not "Test C++ Test" for example. Try something like \bC\+\+\s if you expect there to be a whitespace after the last plus sign.

Plus sign have special meaning so you will have to escape it with \. The same rule applies to these characters: \, *, +, ?, |, {, [, (,), ^, $,., #, and white space
UPDATE: the problem was with \b sequence

If you want to match a c++ between non-word chars (chars other than letters, digits and underscores) you may use
\bc\+\+\B
See the regex demo where \b is a word boundary and \B matches all positions that are not word boundary positions.
C# syntax:
var pattern = #"\bc\+\+\B";
You must remember that \b / \B are context dependent: \b matches between the start/end of string and the adjoining word char or between a word and a non-word chars, while \B matches between the start/end of string and the adjoining non-word char or between two word or two non-word chars.
If you build the pattern dynamically, it is hard to rely on word boundary \b pattern.
Use adaptive dynamic wod boundaries, (?!\B\w) and (?<!\w\B) lookarounds instead, they will always match a word not immediately preceded/followed with a word char if the word starts/ends with a word char:
var pattern = $#"(?!\B\w){Regex.Escape(word)}(?<!\w\B)";
If the word boundaries you want to match are whitespace boundaries (i.e. the match is expected only between whitespaces), use
var pattern = $#"(?<!\S){Regex.Escape(word)}(?!\S)";

As the others said, your problem isn't the + sign you've escaped correctly but the \b that is a zero-lenght char that match word boundary that takes place between word \w and non-word \W char.
There is also another mistake in your regex, you want to match char C (uppercase) with c++ (lowercase).To do so you have to change your regex to /\bc\+\+/ or use the i modifier to match case insensitive : /\bc\+\+/i

Regular expressions with \b and non-word characters (like '.')

Why does this regular expression:
r'^(?P<first_init>\b\w\.\b)\s(?P<mid_init>\b\w\.\b)\s(?P<last_name>\b\w+\b)$'
does not match J. F. Kennedy?
I have to remove \b in groups first_init and mid_init to match the words.
I am using Python. And for testing i am using https://regex101.com/
Thanks

You are over-applying the \b word breaks.
\b will only match if on one side there is a valid "word" character and on the other side not. Now you use this construction twice:
\b\w\.\b\s
.. and, rightly so, it does not match because on the left side you have a not-word character (a single full stop) and on the other side you also have a not-word character (a space).
Removing the \b between the full stop and \s is enough to make it work.

\b matches the empty string only at the beginning or end of a word. A word is a sequence of alphanumeric or underscore characters. The dot (.) cannot comprise part of the word.
>>> import re
# does not match when \. is within word boundary
>>> re.match(r'^(?P<first_init>\b\w\.\b)\s(?P<mid_init>\b\w\.\b)\s(?P<last_name>\b\w+\b)$', 'J. F. Kennedy')
# matches when \b is moved to left of \.
>>> re.match(r'^(?P<first_init>\b\w\b\.)\s(?P<mid_init>\b\w\b\.)\s(?P<last_name>\b\w+\b)$', 'J. F. Kennedy') # matches
The . is not part of the word in this sense. See the docs here.

It does not match because of the \. (dot) character. A word boundary does not include the dot (it is not the same definition of word you perhaps would like). You can easily rewrite it without the need of \b. Read the documentation carefully.

Just remove the second boundary:
^(?P<first_init>\b\w\.)\s
(?P<mid_init>\b\w\.)\s
(?P<last_name>\b\w+\b)$
And see a demo on regex101.com.
Background is that the second \b is between a dot and a space, so it fails (remember that one of the sides needs to be a word character, ie one of a-zA-Z0-9_)

\b means border of a word.
Word here is defined like so:
A word ends, when there is a space character following it.
"J.", "F." and "Kennedy" are the words here.
You're example is trying to search for a space between the letter and the dot and it is searching for J . F . Kennedy.

Escaping [ in Python Regular Expressions

This reg exp search correctly checks to see if a string contains the text harry:
re.search(r'\bharry\b', '[harry] blah', re.IGNORECASE)
However, I need to ensure that the string contains [harry]. I have tried escaping with various numbers of back-slashes:
re.search(r'\b\[harry\]\b', '[harry] blah', re.IGNORECASE)
re.search(r'\b\\[harry\\]\b', '[harry] blah', re.IGNORECASE)
re.search(r'\b\\\[harry\\\]\b', '[harry] blah', re.IGNORECASE)
None of these solutions work find the match. What do I need to do?

The first one is correct:
r'\b\[harry\]\b'
But this won’t match [harry] blah as [ is not a word character and so there is no word boundary. It would only match if there were a word character in front of [ like in foobar[harry] blah.

>>> re.search(r'\bharry\b','[harry] blah',re.IGNORECASE)
<_sre.SRE_Match object at 0x7f14d22df648>
>>> re.search(r'\b\[harry\]\b','[harry] blah',re.IGNORECASE)
>>> re.search(r'\[harry\]','[harry] blah',re.IGNORECASE)
<_sre.SRE_Match object at 0x7f14d22df6b0>
>>> re.search(r'\[harry\]','harry blah',re.IGNORECASE)
The problem is the \b, not the brackets. A single backslash is correct for escaping.

You escape it the way you escape most regex metacharacter: preceding with a backslash.
Thus, r"\[harry\]" will match a literal string [harry].
The problem is with the \b in your pattern. This is the word boundary anchor.
The \b matches:
At the beginning of the string, if it starts with a word character
At the end of the string, if it ends with a word character
Between a word character \w and a non-word character \W (note the case difference)
The brackets [ and ] are NOT word characters, thus if a string starts with [, there is no \b to its left. Any where there is no \b, there is \B instead (note the case difference).
References
regular-expressions.info/Word Boundaries
http://docs.python.org/library/re.html
\b : Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore character. Note that \b is defined as the boundary between \w and \W, so the precise set of characters deemed to be alphanumeric depends on the values of the UNICODE and LOCALE flags. Inside a character range, \b represents the backspace character, for compatibility with Python’s string literals.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python regex not matching at word boundary as required - python

The group (##)? is saying that the word may begin with "##". What you are looking for is [##]? which is saying the first character is # or #, but it is not required. If you need the match to be part of a group you could use (#|#)?.

Related

Question about matching RE in a complicated form

patten search in Python

Python regex to match word boundary when part do the word contains special character [duplicate]

Regular expressions with \b and non-word characters (like '.')

Escaping [ in Python Regular Expressions

Categories

Resources