Escaping [ in Python Regular Expressions - python

This reg exp search correctly checks to see if a string contains the text harry:
re.search(r'\bharry\b', '[harry] blah', re.IGNORECASE)
However, I need to ensure that the string contains [harry]. I have tried escaping with various numbers of back-slashes:
re.search(r'\b\[harry\]\b', '[harry] blah', re.IGNORECASE)
re.search(r'\b\\[harry\\]\b', '[harry] blah', re.IGNORECASE)
re.search(r'\b\\\[harry\\\]\b', '[harry] blah', re.IGNORECASE)
None of these solutions work find the match. What do I need to do?

The first one is correct:
r'\b\[harry\]\b'
But this won’t match [harry] blah as [ is not a word character and so there is no word boundary. It would only match if there were a word character in front of [ like in foobar[harry] blah.

>>> re.search(r'\bharry\b','[harry] blah',re.IGNORECASE)
<_sre.SRE_Match object at 0x7f14d22df648>
>>> re.search(r'\b\[harry\]\b','[harry] blah',re.IGNORECASE)
>>> re.search(r'\[harry\]','[harry] blah',re.IGNORECASE)
<_sre.SRE_Match object at 0x7f14d22df6b0>
>>> re.search(r'\[harry\]','harry blah',re.IGNORECASE)
The problem is the \b, not the brackets. A single backslash is correct for escaping.

You escape it the way you escape most regex metacharacter: preceding with a backslash.
Thus, r"\[harry\]" will match a literal string [harry].
The problem is with the \b in your pattern. This is the word boundary anchor.
The \b matches:
At the beginning of the string, if it starts with a word character
At the end of the string, if it ends with a word character
Between a word character \w and a non-word character \W (note the case difference)
The brackets [ and ] are NOT word characters, thus if a string starts with [, there is no \b to its left. Any where there is no \b, there is \B instead (note the case difference).
References
regular-expressions.info/Word Boundaries
http://docs.python.org/library/re.html
\b : Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore character. Note that \b is defined as the boundary between \w and \W, so the precise set of characters deemed to be alphanumeric depends on the values of the UNICODE and LOCALE flags. Inside a character range, \b represents the backspace character, for compatibility with Python’s string literals.

Related

Does PyCharm code inspection incorrectly warn on regular expression equivalents? [duplicate]

I am new to Python regex and am trying to match non-white space ASCII characters in Python.
The following is my code:
impore re
p = re.compile(r"[\S]{2,3}", re.ASCII)
p.search('1234') # have some result
p.search('你好吗') # also have result, but Why?
I have specified ASCII mode in re.compile, but p.search('你好吗') still have result. I wonder what I am doing wrong here?
The re.A flag only affects what shorthand character classes match.
In Python 3.x, shorthand character classes are Unicode aware, the Python 2.x re.UNICODE/re.U is ON by default. That means:
\d: Matches any Unicode decimal digit (that is, any character in Unicode character category [Nd])
\D: Matches any character which is not a decimal digit. (So, all characters other than those in the Nd Unicode category).
\w - Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. (So, \w+ matches each word in a My name is Виктор string)
\W - Matches any character which is not a word character. This is the opposite of \w. (So, it will not match any Unicode letter or digit.)
\s - Matches Unicode whitespace characters (it will match NEL, hard spaces, etc.)
\S - Matches any character which is not a whitespace character. (So, no match for NEL, hard space, etc.)
\b - word boundaries match locations between Unicode letters/digits and non-letters/digits or start/end of string.
\B - non-word boundaries match locations between two Unicode letters/digits, two non-letters/digits or between a Unicode non-letter/digit and start/end of string.
If you want to disable this behavior, you use re.A or re.ASCII:
Make \w, \W, \b, \B, \d, \D, \s and \S perform ASCII-only matching instead of full Unicode matching. This is only meaningful for Unicode patterns, and is ignored for byte patterns. Corresponds to the inline flag (?a).
That means that:
\d = [0-9] - and no longer matches Hindi, Bengali, etc. digits
\D = [^0-9] - and matches any characters other than ASCII digits (i.e. it acts as (?u)(?![0-9])\d now)
\w = [A-Za-z0-9_] - and it only matches ASCII words now, Wiktor is matched with \w+, but Виктор does not
\W = [^A-Za-z0-9_] - it matches any char but ASCII letters/digits/_ (i.e. it matches 你好吗, Виктор, etc.
\s = [ \t\n\r\f\v] - matches a regular space, tab, linefeed, carriage return, form feed and a vertical tab
\S = [^ \t\n\r\f\v] - matches any char other than a space, tab, linefeed, carriage return, form feed and a vertical tab, so it matches all Unicode letters, digits and punctuation and Unicode (non-ASCII) whitespace. E.g., re.sub(r'\S+', r'{\g<0>}', '\xA0 ', flags=re.A) will return '{ } ', as you see, the \S now matches hard spaces.

Python regex to match word boundary when part do the word contains special character [duplicate]

I've spent some time, but still have to solution. I need regular expression that is able to match a words with signs in it (like c++) in string.
I've used /\bword\b/, for "usual" words, it works OK. But as soon as I try /\bC\+\+\b/ it just does not work. It some how works wrong with a plus signs in it.
I need a regex to detect if input string contains c++ word in it. Input like,
"c++ developer"
"using c++ language"
etc.
ps. Using C#, .Net Regex.Match function.
Thanks for help!
+ is a special character so you need to escape it
\bC\+\+(?!\w)
Note that we can't use \b because + is not a word-character.
The problem isn't with the plus character, that you've escaped correctly, but the \b sequence. It indicates a word boundary, which is a point between a word character (alphanumeric) and something else. Plus isn't a word character, so for \b to match, there would need to be a word character directly after the last plus sign.
\bC\+\+\b matches "Test C++Test" but not "Test C++ Test" for example. Try something like \bC\+\+\s if you expect there to be a whitespace after the last plus sign.
Plus sign have special meaning so you will have to escape it with \. The same rule applies to these characters: \, *, +, ?, |, {, [, (,), ^, $,., #, and white space
UPDATE: the problem was with \b sequence
If you want to match a c++ between non-word chars (chars other than letters, digits and underscores) you may use
\bc\+\+\B
See the regex demo where \b is a word boundary and \B matches all positions that are not word boundary positions.
C# syntax:
var pattern = #"\bc\+\+\B";
You must remember that \b / \B are context dependent: \b matches between the start/end of string and the adjoining word char or between a word and a non-word chars, while \B matches between the start/end of string and the adjoining non-word char or between two word or two non-word chars.
If you build the pattern dynamically, it is hard to rely on word boundary \b pattern.
Use adaptive dynamic wod boundaries, (?!\B\w) and (?<!\w\B) lookarounds instead, they will always match a word not immediately preceded/followed with a word char if the word starts/ends with a word char:
var pattern = $#"(?!\B\w){Regex.Escape(word)}(?<!\w\B)";
If the word boundaries you want to match are whitespace boundaries (i.e. the match is expected only between whitespaces), use
var pattern = $#"(?<!\S){Regex.Escape(word)}(?!\S)";
As the others said, your problem isn't the + sign you've escaped correctly but the \b that is a zero-lenght char that match word boundary that takes place between word \w and non-word \W char.
There is also another mistake in your regex, you want to match char C (uppercase) with c++ (lowercase).To do so you have to change your regex to /\bc\+\+/ or use the i modifier to match case insensitive : /\bc\+\+/i

Python re module - Unexpected behaviour with \b and '-'

I was testing (using pythex) a regex match between
re.compile('''
[ ]?
(?P<element> [a-zA-Z])
[ ]+
(?P<x_axis> \b-?[0-9]+[.][0-9]+\b) # the first '\b' seems to be
# causing this issue
''', re.VERBOSE)
and string ' C -1.97046278'. This lead to no matches being found.
Then, when I tried removing \b re found the match for the aforementioned string.
Is the syntax incorrect? Cause I've been reading the documentation for re and haven't found any mention of this.
Care to explain this behaviour for me?
There are two main issues here:
Word boundaries \b are ambiguous as their meaning is context dependent. In this case, if there is a - before the digit, \b-?[0-9] will only match if there is a word char before the -. You need to place \b after -?. If you remove this \b, your regex will start matching the digits in any context, and I suspect you still want to match whole words only.
You need to declare the regex with a raw string literal so as \b was treated as a word boundary and not as a backspace char.
Use
import re
r=re.compile(r'''
[ ]?
(?P<element> [a-zA-Z])
[ ]+
(?P<x_axis> -?\b[0-9]+[.][0-9]+\b) # the first '\b' seems to be
# causing this issue
''', re.VERBOSE)
s = ' C -1.97046278'
print(r.findall(s))
See an online Python demo
\b matches between a word character and a non-word character. Both space and - are non-word characters, so \b will not match between them.
Word characters are letters, digits, and underscore. Non-word characters are everything else.
Also, you need to use a raw string delimited by r''' ... ''', so that escape sequences like \b will be passed to the re module, not processed as string escapes.
You will certainly need:
re.compile(r'''
[ ]?
(?P<element> [a-zA-Z])
[ ]+
(?P<x_axis> \b-?[0-9]+[.][0-9]+\b) # the first '\b' seems to be
# causing this issue
''', re.VERBOSE)
Note the r in the compile() call.

Python regex not matching at word boundary as required

I want to match a set of patterns at "word boundary", but the patterns may have a prefix [##] which should get matched if present.
I'm using following regex pattern in python.
r"\b[##]?(abc|ef|ghij)\b"
Sample text is : #abc is a pattern which should match. also abc should match. And finally #ef
In this text only abc, abc and ef are matched without and not #abc and #ef as I want.
You need to put the word boundary next to [##] which you made as optional. Because in this #abc part there is a non-word boundary \B exists before # (not a word character) and after the start of the line (not a word character) not a word boundary \b. Note that \b matches between a word character and a non-word character, vice-versa. \B matches between two word characters or two non-word characters.
r"[##]?\b(abc|ef|ghij)\b"
If you put \b before [##], it would match strings like foo#abc or bar#abc because here there is actually a word boundary exists before # and #.
DEMO
Example:
>>> s = "#abc is a pattern which should match. also abc should match. And finally #ef"
>>> re.findall(r'[##]?\b(?:abc|ef|ghij)\b', s)
['#abc', 'abc', '#ef']
#abc
^ ^
\B \b
The group (##)? is saying that the word may begin with "##". What you are looking for is [##]? which is saying the first character is # or #, but it is not required. If you need the match to be part of a group you could use (#|#)?.
I will also throw in my version of the fixed regex without capturing group (since you do not seem to be using them):
r'[##]?\b(?:abc|ef|ghij)\b'
See my demo.
EXPLANATION: [##] are non-word characters and are optional due to ?. \b is not optional, and regex engine consumes it first, i.e. it consumes right # or #, but they are not part of the match since \b is always zero-width.
Here are more details on \b from Regular-Expressions.info:
The metacharacter \b is an anchor like the caret and the dollar sign.
It matches at a position that is called a "word boundary". This match
is zero-length.
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a
word character.
After the last character in the string, if the last
character is a word character.
Between two characters in the string,
where one is a word character and the other is not a word character.

python regex to find "cat" but not "catfish" or "caterpillar", etc

I'm not very used with regex and I'm having trouble creating one that would find "cat" anywhere in a string followed (or not) by any punctuations but not "caterpillar", "catfish", etc.
In Python regular expressions, \b is a word boundary so you can search for cat\b (though that will also pick up things like bobcat or tomcat so you may need to use \bcat\b if you don't want those).
From the Python 3.4 docs (though 2.7 is very similar):
\b - Matches the empty string, but only at the beginning or end of a word.
A word is defined as a sequence of Unicode alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore Unicode character.
Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string. This means that r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.
By default Unicode alphanumerics are the ones used, but this can be changed by using the ASCII flag. Inside a character range, \b represents the backspace character, for compatibility with Python’s string literals.

Categories

Resources