Regex that recognizes ^^ and ^ as different

Regex that recognizes ^^ and ^ as different - python

I am using the Python re module.
I can use the regex r'\bA\b' (a raw string) to differentiate between 'A' and 'AA': it will find a match in the string 'A' and no matches in the string 'AA'.
I would like to achieve the same thing with a carat ^ instead of the A: I want a regex which differentiates between '^' and '^^'.
The problem I have is that the regex r'\b\^\b' does not find a match in '^'.
Any ideas?

You need to use lookaround for this:
(?<!\^)\^(?!\^)
\b is a word boundary, a place between a word character and a non-word character, so your pattern is quite non-specific (doesn't say anything about A specifically, A_ would also not match given that _ is a word character.
Here, we assert that there needs to be a place where the preceding character is not a caret, then a caret, then a place where the following character is not a caret (which boils down to "the caret must not be in caret company").

Related

patten search in Python

I am trying to search a string 'Test^' in another string 'test1 Test2 Test^ test'. I find that
re.search(r'\bTest\^\B', 'test1 Test2 Test^ test')
would work but
re.search(r'\bTest\^\b', 'test1 Test2 Test^ test')
would not work. I am a bit confused as I think I should use \b for the word boundary of 'Test^' (both sides have an empty space. Is it because Python treats the end of the string as '^' so it is a non word boundary?
Thank you.

\b means transition from "word character" to "non-word character" or vice versa. Word characters are alphanumeric characters, plus underscore, _. ^ is not a word character, nor is (space), so the transition from one to another is not a word boundary; as observed, it matches \B, not \b. If you want a space specific check, you'd need to explicitly use look-ahead (?=) or look-behind (?<=) assertions (possibly negated, depending on use case) with \s/\S.

I believe the caret is considered a word boundary, i.e. delimiter. The one with the capital "\B" is ignored and so isn't looking for a word boundary. The caret is a non-word character.
So, it should not be part of a regex pattern that's looking for words only.
https://www.regular-expressions.info/wordboundaries.html

Python regex to match word boundary when part do the word contains special character [duplicate]

I've spent some time, but still have to solution. I need regular expression that is able to match a words with signs in it (like c++) in string.
I've used /\bword\b/, for "usual" words, it works OK. But as soon as I try /\bC\+\+\b/ it just does not work. It some how works wrong with a plus signs in it.
I need a regex to detect if input string contains c++ word in it. Input like,
"c++ developer"
"using c++ language"
etc.
ps. Using C#, .Net Regex.Match function.
Thanks for help!

+ is a special character so you need to escape it
\bC\+\+(?!\w)
Note that we can't use \b because + is not a word-character.

The problem isn't with the plus character, that you've escaped correctly, but the \b sequence. It indicates a word boundary, which is a point between a word character (alphanumeric) and something else. Plus isn't a word character, so for \b to match, there would need to be a word character directly after the last plus sign.
\bC\+\+\b matches "Test C++Test" but not "Test C++ Test" for example. Try something like \bC\+\+\s if you expect there to be a whitespace after the last plus sign.

Plus sign have special meaning so you will have to escape it with \. The same rule applies to these characters: \, *, +, ?, |, {, [, (,), ^, $,., #, and white space
UPDATE: the problem was with \b sequence

If you want to match a c++ between non-word chars (chars other than letters, digits and underscores) you may use
\bc\+\+\B
See the regex demo where \b is a word boundary and \B matches all positions that are not word boundary positions.
C# syntax:
var pattern = #"\bc\+\+\B";
You must remember that \b / \B are context dependent: \b matches between the start/end of string and the adjoining word char or between a word and a non-word chars, while \B matches between the start/end of string and the adjoining non-word char or between two word or two non-word chars.
If you build the pattern dynamically, it is hard to rely on word boundary \b pattern.
Use adaptive dynamic wod boundaries, (?!\B\w) and (?<!\w\B) lookarounds instead, they will always match a word not immediately preceded/followed with a word char if the word starts/ends with a word char:
var pattern = $#"(?!\B\w){Regex.Escape(word)}(?<!\w\B)";
If the word boundaries you want to match are whitespace boundaries (i.e. the match is expected only between whitespaces), use
var pattern = $#"(?<!\S){Regex.Escape(word)}(?!\S)";

As the others said, your problem isn't the + sign you've escaped correctly but the \b that is a zero-lenght char that match word boundary that takes place between word \w and non-word \W char.
There is also another mistake in your regex, you want to match char C (uppercase) with c++ (lowercase).To do so you have to change your regex to /\bc\+\+/ or use the i modifier to match case insensitive : /\bc\+\+/i

How to write word boundary inside character class in python without losing its meaning? I wish to add underscore(_) in definition of word boundary(\b)

I am aware that definition of word boundary is (?<!\w)(?=\w)|(?<=\w)(?!\w)
and i wish to add underscore(optionally) too in definition of word boundary.
The one way of doing it is we can simply modify the definition
like the new one would be (_)?((?<!\w)(?=\w)|(?<=\w)(?!\w))
, but don't wish to use too long expression.
Easy Approach can be
If i can write word boundary inside character class, then adding underscore inside character class would be very easy just like [\b-], but the problem is that putting \b inside character class i.e. [\b], means back space character not word boundary.
please tell the solution i.e. how to put \b inside character class without losing its original meaning.

You may use lookarounds:
(?:\b|(?<=_))word(?=\b|_)
^^^^^^^^^^^^^ ^^^^^^^
See the regex demo where (?:\b|(?<=_)) is a non-capturing group matching either a word boundary or a location preceded with _, and (?=\b|_) is a positive lookahead matching either a word boundary or a _ symbol.
Unfortunately, Python re won't allow using (?<=\b|_) as the lookbehind pattern should be of fixed width (else, you will get look-behind requires fixed-width pattern error).
A Python demo:
import re
rx = r"(?:\b|(?<=_))word(?=\b|_)"
s = "some_word_here and a word there"
print(re.findall(rx,s))
An alternative solution is to use custom word boundaries like (?<![^\W_]) / (?![^\W_]) (see online demo):
rx = r"(?<![^\W_])word(?![^\W_])"
The (?<![^\W_]) negative lookbehind fails a match if there is no character other than non-word and _ char (so, it requires the start of string or any word char excluding _ before the search word) and (?![^\W_]) negative lookahead will fail the match if there is no char other than non-word and _ char (that is, requires the end of string or a word char excluding _).

Regular expressions with \b and non-word characters (like '.')

Why does this regular expression:
r'^(?P<first_init>\b\w\.\b)\s(?P<mid_init>\b\w\.\b)\s(?P<last_name>\b\w+\b)$'
does not match J. F. Kennedy?
I have to remove \b in groups first_init and mid_init to match the words.
I am using Python. And for testing i am using https://regex101.com/
Thanks

You are over-applying the \b word breaks.
\b will only match if on one side there is a valid "word" character and on the other side not. Now you use this construction twice:
\b\w\.\b\s
.. and, rightly so, it does not match because on the left side you have a not-word character (a single full stop) and on the other side you also have a not-word character (a space).
Removing the \b between the full stop and \s is enough to make it work.

\b matches the empty string only at the beginning or end of a word. A word is a sequence of alphanumeric or underscore characters. The dot (.) cannot comprise part of the word.
>>> import re
# does not match when \. is within word boundary
>>> re.match(r'^(?P<first_init>\b\w\.\b)\s(?P<mid_init>\b\w\.\b)\s(?P<last_name>\b\w+\b)$', 'J. F. Kennedy')
# matches when \b is moved to left of \.
>>> re.match(r'^(?P<first_init>\b\w\b\.)\s(?P<mid_init>\b\w\b\.)\s(?P<last_name>\b\w+\b)$', 'J. F. Kennedy') # matches
The . is not part of the word in this sense. See the docs here.

It does not match because of the \. (dot) character. A word boundary does not include the dot (it is not the same definition of word you perhaps would like). You can easily rewrite it without the need of \b. Read the documentation carefully.

Just remove the second boundary:
^(?P<first_init>\b\w\.)\s
(?P<mid_init>\b\w\.)\s
(?P<last_name>\b\w+\b)$
And see a demo on regex101.com.
Background is that the second \b is between a dot and a space, so it fails (remember that one of the sides needs to be a word character, ie one of a-zA-Z0-9_)

\b means border of a word.
Word here is defined like so:
A word ends, when there is a space character following it.
"J.", "F." and "Kennedy" are the words here.
You're example is trying to search for a space between the letter and the dot and it is searching for J . F . Kennedy.

Could you explain why this regex is not working?

>>> d = "Batman,Superman"
>>> m = re.search("(?<!Bat)\w+",d)
>>> m.group(0)
'Batman'
Why isn't group(0) matching Superman? This lookaround tutorial says:
(?<!a)b matches a "b" that is not
preceded by an "a", using negative
lookbehind

Batman isn't directly preceded by Bat, so that matches first. In fact, neither is Superman; there's a comma in-between in your string which will do just fine to allow that RE to match, but that's not matched anyway because it's possible to match earlier in the string.
Maybe this will explain better: if the string was Batman and you were starting to try to match from the m, the RE would not match until the character after (giving a match of an) because that's the only place in the string which is preceded by Bat.

At a simple level, the regex engine starts from the left of the string and moves progressively towards the right, trying to match your pattern (think of it like a cursor moving through the string). In the case of a lookaround, at each stop of the cursor, the lookaround is asserted, and if true, the engine continues trying to make a match. As soon as the engine can match your pattern, it'll return a match.
At position 0 of your string (ie. prior to the B in Batman), the assertion succeeded, as Bat is not present before the current position - thus, \w+ can match the entire word Batman (remember, regexes are inherently greedy - ie. will match as much as possible).
See this page for more information on engine internals.
To achieve what you wanted, you could instead use something like:
\b(?!Bat)\w+
In this pattern, the engine will match a word boundary (\b)1, followed by one or more word characters, with the assertion that the word characters do not start with Bat. A lookahead is used rather than a lookbehind because using a lookbehind here would have the same problem as your original pattern; it would look before the position directly following the word boundary, and since its already been determined that the position before the cursor is a word boundary, the negative lookbehind would always succeed.
1 Note that word boundaries match a boundary between \w and \W (ie. between [A-Za-z0-9_] and any other character; it also matches the ^ and $ anchors). If your boundaries need to be more complex, you'll need a different way of anchoring your pattern.

From the manual:
Patterns which start with negative
lookbehind assertions may match at the
beginning of the string being
searched.
http://docs.python.org/library/re.html#regular-expression-syntax

You're looking for the first set of one or more alphanumeric characters (\w+) that is not preceded by 'Bat'. Batman is the first such match. (Note that negative lookbehind assertions can match the start of a string.)

To do what you want, you have to constrain the regex to match 'man' specifically; otherwise, as others have pointed out, \w greedily matches anything including 'Batman'. As in:
>>> re.search("\w+(?<!Bat)man","Batman,Superman").group(0)
'Superman'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.