Python re module - Unexpected behaviour with \b and '-' - python

I was testing (using pythex) a regex match between
re.compile('''
[ ]?
(?P<element> [a-zA-Z])
[ ]+
(?P<x_axis> \b-?[0-9]+[.][0-9]+\b) # the first '\b' seems to be
# causing this issue
''', re.VERBOSE)
and string ' C -1.97046278'. This lead to no matches being found.
Then, when I tried removing \b re found the match for the aforementioned string.
Is the syntax incorrect? Cause I've been reading the documentation for re and haven't found any mention of this.
Care to explain this behaviour for me?

There are two main issues here:
Word boundaries \b are ambiguous as their meaning is context dependent. In this case, if there is a - before the digit, \b-?[0-9] will only match if there is a word char before the -. You need to place \b after -?. If you remove this \b, your regex will start matching the digits in any context, and I suspect you still want to match whole words only.
You need to declare the regex with a raw string literal so as \b was treated as a word boundary and not as a backspace char.
Use
import re
r=re.compile(r'''
[ ]?
(?P<element> [a-zA-Z])
[ ]+
(?P<x_axis> -?\b[0-9]+[.][0-9]+\b) # the first '\b' seems to be
# causing this issue
''', re.VERBOSE)
s = ' C -1.97046278'
print(r.findall(s))
See an online Python demo

\b matches between a word character and a non-word character. Both space and - are non-word characters, so \b will not match between them.
Word characters are letters, digits, and underscore. Non-word characters are everything else.
Also, you need to use a raw string delimited by r''' ... ''', so that escape sequences like \b will be passed to the re module, not processed as string escapes.

You will certainly need:
re.compile(r'''
[ ]?
(?P<element> [a-zA-Z])
[ ]+
(?P<x_axis> \b-?[0-9]+[.][0-9]+\b) # the first '\b' seems to be
# causing this issue
''', re.VERBOSE)
Note the r in the compile() call.

Related

Python regular expression Understanding Boundary "\b"

input :
- "example (.com)"
output :
- "example"
What I tried
import re
pattern=re.compile(r'\b\([\W\w]+\)\b')
#pattern=re.compile(r'\([\W\w]+\)')
print(pattern.sub("","example (.com)"))
This doesn't work but if I remove \b it works fine - why?
Make yourself clear what the \b stands for:
(?:(?=\w)(?<!\w)|(?<=\w)(?!\w))
That is, either a word character ahead and not behind or the other way round. A word character (\w) stands for [A-Za-z0-9_] and does not include ( or ), so there's no word boundary between a space and a parenthesis.
( is not a word boundary, so \b won't match it. Instead you could use \B to match the blank

Python regex to match word boundary when part do the word contains special character [duplicate]

I've spent some time, but still have to solution. I need regular expression that is able to match a words with signs in it (like c++) in string.
I've used /\bword\b/, for "usual" words, it works OK. But as soon as I try /\bC\+\+\b/ it just does not work. It some how works wrong with a plus signs in it.
I need a regex to detect if input string contains c++ word in it. Input like,
"c++ developer"
"using c++ language"
etc.
ps. Using C#, .Net Regex.Match function.
Thanks for help!
+ is a special character so you need to escape it
\bC\+\+(?!\w)
Note that we can't use \b because + is not a word-character.
The problem isn't with the plus character, that you've escaped correctly, but the \b sequence. It indicates a word boundary, which is a point between a word character (alphanumeric) and something else. Plus isn't a word character, so for \b to match, there would need to be a word character directly after the last plus sign.
\bC\+\+\b matches "Test C++Test" but not "Test C++ Test" for example. Try something like \bC\+\+\s if you expect there to be a whitespace after the last plus sign.
Plus sign have special meaning so you will have to escape it with \. The same rule applies to these characters: \, *, +, ?, |, {, [, (,), ^, $,., #, and white space
UPDATE: the problem was with \b sequence
If you want to match a c++ between non-word chars (chars other than letters, digits and underscores) you may use
\bc\+\+\B
See the regex demo where \b is a word boundary and \B matches all positions that are not word boundary positions.
C# syntax:
var pattern = #"\bc\+\+\B";
You must remember that \b / \B are context dependent: \b matches between the start/end of string and the adjoining word char or between a word and a non-word chars, while \B matches between the start/end of string and the adjoining non-word char or between two word or two non-word chars.
If you build the pattern dynamically, it is hard to rely on word boundary \b pattern.
Use adaptive dynamic wod boundaries, (?!\B\w) and (?<!\w\B) lookarounds instead, they will always match a word not immediately preceded/followed with a word char if the word starts/ends with a word char:
var pattern = $#"(?!\B\w){Regex.Escape(word)}(?<!\w\B)";
If the word boundaries you want to match are whitespace boundaries (i.e. the match is expected only between whitespaces), use
var pattern = $#"(?<!\S){Regex.Escape(word)}(?!\S)";
As the others said, your problem isn't the + sign you've escaped correctly but the \b that is a zero-lenght char that match word boundary that takes place between word \w and non-word \W char.
There is also another mistake in your regex, you want to match char C (uppercase) with c++ (lowercase).To do so you have to change your regex to /\bc\+\+/ or use the i modifier to match case insensitive : /\bc\+\+/i

Regular expressions with \b and non-word characters (like '.')

Why does this regular expression:
r'^(?P<first_init>\b\w\.\b)\s(?P<mid_init>\b\w\.\b)\s(?P<last_name>\b\w+\b)$'
does not match J. F. Kennedy?
I have to remove \b in groups first_init and mid_init to match the words.
I am using Python. And for testing i am using https://regex101.com/
Thanks
You are over-applying the \b word breaks.
\b will only match if on one side there is a valid "word" character and on the other side not. Now you use this construction twice:
\b\w\.\b\s
.. and, rightly so, it does not match because on the left side you have a not-word character (a single full stop) and on the other side you also have a not-word character (a space).
Removing the \b between the full stop and \s is enough to make it work.
\b matches the empty string only at the beginning or end of a word. A word is a sequence of alphanumeric or underscore characters. The dot (.) cannot comprise part of the word.
>>> import re
# does not match when \. is within word boundary
>>> re.match(r'^(?P<first_init>\b\w\.\b)\s(?P<mid_init>\b\w\.\b)\s(?P<last_name>\b\w+\b)$', 'J. F. Kennedy')
# matches when \b is moved to left of \.
>>> re.match(r'^(?P<first_init>\b\w\b\.)\s(?P<mid_init>\b\w\b\.)\s(?P<last_name>\b\w+\b)$', 'J. F. Kennedy') # matches
The . is not part of the word in this sense. See the docs here.
It does not match because of the \. (dot) character. A word boundary does not include the dot (it is not the same definition of word you perhaps would like). You can easily rewrite it without the need of \b. Read the documentation carefully.
Just remove the second boundary:
^(?P<first_init>\b\w\.)\s
(?P<mid_init>\b\w\.)\s
(?P<last_name>\b\w+\b)$
And see a demo on regex101.com.
Background is that the second \b is between a dot and a space, so it fails (remember that one of the sides needs to be a word character, ie one of a-zA-Z0-9_)
\b means border of a word.
Word here is defined like so:
A word ends, when there is a space character following it.
"J.", "F." and "Kennedy" are the words here.
You're example is trying to search for a space between the letter and the dot and it is searching for J . F . Kennedy.

Python regex not matching at word boundary as required

I want to match a set of patterns at "word boundary", but the patterns may have a prefix [##] which should get matched if present.
I'm using following regex pattern in python.
r"\b[##]?(abc|ef|ghij)\b"
Sample text is : #abc is a pattern which should match. also abc should match. And finally #ef
In this text only abc, abc and ef are matched without and not #abc and #ef as I want.
You need to put the word boundary next to [##] which you made as optional. Because in this #abc part there is a non-word boundary \B exists before # (not a word character) and after the start of the line (not a word character) not a word boundary \b. Note that \b matches between a word character and a non-word character, vice-versa. \B matches between two word characters or two non-word characters.
r"[##]?\b(abc|ef|ghij)\b"
If you put \b before [##], it would match strings like foo#abc or bar#abc because here there is actually a word boundary exists before # and #.
DEMO
Example:
>>> s = "#abc is a pattern which should match. also abc should match. And finally #ef"
>>> re.findall(r'[##]?\b(?:abc|ef|ghij)\b', s)
['#abc', 'abc', '#ef']
#abc
^ ^
\B \b
The group (##)? is saying that the word may begin with "##". What you are looking for is [##]? which is saying the first character is # or #, but it is not required. If you need the match to be part of a group you could use (#|#)?.
I will also throw in my version of the fixed regex without capturing group (since you do not seem to be using them):
r'[##]?\b(?:abc|ef|ghij)\b'
See my demo.
EXPLANATION: [##] are non-word characters and are optional due to ?. \b is not optional, and regex engine consumes it first, i.e. it consumes right # or #, but they are not part of the match since \b is always zero-width.
Here are more details on \b from Regular-Expressions.info:
The metacharacter \b is an anchor like the caret and the dollar sign.
It matches at a position that is called a "word boundary". This match
is zero-length.
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a
word character.
After the last character in the string, if the last
character is a word character.
Between two characters in the string,
where one is a word character and the other is not a word character.

Escaping [ in Python Regular Expressions

This reg exp search correctly checks to see if a string contains the text harry:
re.search(r'\bharry\b', '[harry] blah', re.IGNORECASE)
However, I need to ensure that the string contains [harry]. I have tried escaping with various numbers of back-slashes:
re.search(r'\b\[harry\]\b', '[harry] blah', re.IGNORECASE)
re.search(r'\b\\[harry\\]\b', '[harry] blah', re.IGNORECASE)
re.search(r'\b\\\[harry\\\]\b', '[harry] blah', re.IGNORECASE)
None of these solutions work find the match. What do I need to do?
The first one is correct:
r'\b\[harry\]\b'
But this won’t match [harry] blah as [ is not a word character and so there is no word boundary. It would only match if there were a word character in front of [ like in foobar[harry] blah.
>>> re.search(r'\bharry\b','[harry] blah',re.IGNORECASE)
<_sre.SRE_Match object at 0x7f14d22df648>
>>> re.search(r'\b\[harry\]\b','[harry] blah',re.IGNORECASE)
>>> re.search(r'\[harry\]','[harry] blah',re.IGNORECASE)
<_sre.SRE_Match object at 0x7f14d22df6b0>
>>> re.search(r'\[harry\]','harry blah',re.IGNORECASE)
The problem is the \b, not the brackets. A single backslash is correct for escaping.
You escape it the way you escape most regex metacharacter: preceding with a backslash.
Thus, r"\[harry\]" will match a literal string [harry].
The problem is with the \b in your pattern. This is the word boundary anchor.
The \b matches:
At the beginning of the string, if it starts with a word character
At the end of the string, if it ends with a word character
Between a word character \w and a non-word character \W (note the case difference)
The brackets [ and ] are NOT word characters, thus if a string starts with [, there is no \b to its left. Any where there is no \b, there is \B instead (note the case difference).
References
regular-expressions.info/Word Boundaries
http://docs.python.org/library/re.html
\b : Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore character. Note that \b is defined as the boundary between \w and \W, so the precise set of characters deemed to be alphanumeric depends on the values of the UNICODE and LOCALE flags. Inside a character range, \b represents the backspace character, for compatibility with Python’s string literals.

Categories

Resources