What's the reason for writting r"\1 \2"? [duplicate] - python

I don't understand the logic in the functioning of the scape operator \ in python regex together with r' of raw strings.
Some help is appreciated.
code:
import re
text=' esto .es 10 . er - 12 .23 with [ and.Other ] here is more ; puntuation'
print('text0=',text)
text1 = re.sub(r'(\s+)([;:\.\-])', r'\2', text)
text2 = re.sub(r'\s+\.', '\.', text)
text3 = re.sub(r'\s+\.', r'\.', text)
print('text1=',text1)
print('text2=',text2)
print('text3=',text3)
The theory says:
backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning.
And as far as the link provided at the end of this question explains, r' represents a raw string, i.e. there is no special meaning for symbols, it is as it stays.
so in the above regex I would expect text2 and text3 to be different, since the substitution text is '.' in text 2, i.e. a period, whereas (in principle) the substitution text in text 3 is r'.' which is a raw string, i.e. the string as it is should appear, backslash and period. But they result in the same:
The result is:
text0= esto .es 10 . er - 12 .23 with [ and.Other ] here is more ; puntuation
text1= esto.es 10. er- 12.23 with [ and.Other ] here is more; puntuation
text2= esto\.es 10\. er - 12\.23 with [ and.Other ] here is more ; puntuation
text3= esto\.es 10\. er - 12\.23 with [ and.Other ] here is more ; puntuation
#text2=text3 but substitutions are not the same r'\.' vs '\.'
It looks to me that the r' does not work the same way in substitution part, nor the backslash. On the other hand my intuition tells me I am missing something here.
EDIT 1:
Following #Wiktor Stribiżew comment.
He pointed out that (following his link):
import re
print(re.sub(r'(.)(.)(.)(.)(.)(.)', 'a\6b', '123456'))
print(re.sub(r'(.)(.)(.)(.)(.)(.)', r'a\6b', '123456'))
# in my example the substitutions were not the same and the result were equal
# here indeed r' changes the results
which gives:
ab
a6b
that puzzles me even more.
Note:
I read this stack overflow question about raw strings which is super complete. Nevertheless it does not speak about substitutions

First and foremost,
replacement patterns ≠ regular expression patterns
We use a regex pattern to search for matches, we use replacement patterns to replace matches found with regex.
NOTE: The only special character in a substitution pattern is a backslash, \. Only the backslash must be doubled.
Replacement pattern syntax in Python
The re.sub docs are confusing as they mention both string escape sequences that can be used in replacement patterns (like \n, \r) and regex escape sequences (\6) and those that can be used as both regex and string escape sequences (\&).
I am using the term regex escape sequence to denote an escape sequence consisting of a literal backslash + a character, that is, '\\X' or r'\X', and a string escape sequence to denote a sequence of \ and a char or some sequence that together form a valid string escape sequence. They are only recognized in regular string literals. In raw string literals, you can only escape " (and that is the reason why you can't end a raw string literal with \", but the backlash is still part of the string then).
So, in a replacement pattern, you may use backreferences:
re.sub(r'\D(\d)\D', r'\1', 'a1b') # => 1
re.sub(r'\D(\d)\D', '\\1', 'a1b') # => 1
re.sub(r'\D(\d)\D', '\g<1>', 'a1b') # => 1
re.sub(r'\D(\d)\D', r'\g<1>', 'a1b') # => 1
You may see that r'\1' and '\\1' is the same replacement pattern, \1. If you use '\1', it will get parse as a string escape sequence, a character with octal value 001. If you forget to use r prefix with the unambiguous backreference, there is no problem because \g is not a valid string escape sequence, and there, \ escape character remains in the string. Read on the docs I linked to:
Unlike Standard C, all unrecognized escape sequences are left in the string unchanged, i.e., the backslash is left in the result.
So, when you pass '\.' as a replacement string, you actually send \. two-char combination as the replacement string, and that is why you get \. in the result.
\ is a special character in Python replacement pattern
If you use re.sub(r'\s+\.', r'\\.', text), you will get the same result as in text2 and text3 cases, see this demo.
That happens because \\, two literal backslashes, denote a single backslash in the replacement pattern. If you have no Group 2 in your regex pattern, but pass r'\2' in the replacement to actually replace with \ and 2 char combination, you would get an error.
Thus, when you have dynamic, user-defined replacement patterns you need to double all backslashes in the replacement patterns that are meant to be passed as literal strings:
re.sub(some_regex, some_replacement.replace('\\', '\\\\'), input_string)

A simple way to work around all these string escaping issues is to use a function/lambda as the repl argument, instead of a string. For example:
output = re.sub(
pattern=find_pattern,
repl=lambda _: replacement,
string=input,
)
The replacement string won't be parsed at all, just substituted in place of the match.

From the doc (my emphasis):
re.sub(pattern, repl, string, count=0, flags=0)
Return the string
obtained by replacing the leftmost non-overlapping occurrences of
pattern in string by the replacement repl. If the pattern isn’t found,
string is returned unchanged. repl can be a string or a function; if
it is a string, any backslash escapes in it are processed. That is, \n
is converted to a single newline character, \r is converted to a
carriage return, and so forth. Unknown escapes of ASCII letters are
reserved for future use and treated as errors. Other unknown escapes
such as \& are left alone. Backreferences, such as \6, are replaced
with the substring matched by group 6 in the pattern.
The repl argument is not just plain text. It can also be the name of a function or refer to a position in a group (e.g. \g<quote>, \g<1>, \1).
Also, from here:
Unlike Standard C, all unrecognized escape sequences are left in the
string unchanged, i.e., the backslash is left in the result.
Since . is not a special escape character, '\.' is the same as r'\.\.

Related

Regex matching with \b [duplicate]

From the python documentation on regex, regarding the '\' character:
The solution is to use Python’s raw string notation for regular
expression patterns; backslashes are not handled in any special way in
a string literal prefixed with 'r'. So r"\n" is a two-character string
containing '\' and 'n', while "\n" is a one-character string
containing a newline. Usually patterns will be expressed in Python
code using this raw string notation.
What is this raw string notation? If you use a raw string format, does that mean "*" is taken as a a literal character rather than a zero-or-more indicator? That obviously can't be right, or else regex would completely lose its power. But then if it's a raw string, how does it recognize newline characters if "\n" is literally a backslash and an "n"?
I don't follow.
Edit for bounty:
I'm trying to understand how a raw string regex matches newlines, tabs, and character sets, e.g. \w for words or \d for digits or all whatnot, if raw string patterns don't recognize backslashes as anything more than ordinary characters. I could really use some good examples.
Zarkonnen's response does answer your question, but not directly. Let me try to be more direct, and see if I can grab the bounty from Zarkonnen.
You will perhaps find this easier to understand if you stop using the terms "raw string regex" and "raw string patterns". These terms conflate two separate concepts: the representations of a particular string in Python source code, and what regular expression that string represents.
In fact, it's helpful to think of these as two different programming languages, each with their own syntax. The Python language has source code that, among other things, builds strings with certain contents, and calls the regular expression system. The regular expression system has source code that resides in string objects, and matches strings. Both languages use backslash as an escape character.
First, understand that a string is a sequence of characters (i.e. bytes or Unicode code points; the distinction doesn't much matter here). There are many ways to represent a string in Python source code. A raw string is simply one of these representations. If two representations result in the same sequence of characters, they produce equivalent behaviour.
Imagine a 2-character string, consisting of the backslash character followed by the n character. If you know that the character value for backslash is 92, and for n is 110, then this expression generates our string:
s = chr(92)+chr(110)
print len(s), s
2 \n
The conventional Python string notation "\n" does not generate this string. Instead it generates a one-character string with a newline character. The Python docs 2.4.1. String literals say, "The backslash (\) character is used to escape characters that otherwise have a special meaning, such as newline, backslash itself, or the quote character."
s = "\n"
print len(s), s
1
 
(Note that the newline isn't visible in this example, but if you look carefully, you'll see a blank line after the "1".)
To get our two-character string, we have to use another backslash character to escape the special meaning of the original backslash character:
s = "\\n"
print len(s), s
2 \n
What if you want to represent strings that have many backslash characters in them? Python docs 2.4.1. String literals continue, "String literals may optionally be prefixed with a letter 'r' or 'R'; such strings are called raw strings and use different rules for interpreting backslash escape sequences." Here is our two-character string, using raw string representation:
s = r"\n"
print len(s), s
2 \n
So we have three different string representations, all giving the same string, or sequence of characters:
print chr(92)+chr(110) == "\\n" == r"\n"
True
Now, let's turn to regular expressions. The Python docs, 7.2. re — Regular expression operations says, "Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python’s usage of the same character for the same purpose in string literals..."
If you want a Python regular expression object which matches a newline character, then you need a 2-character string, consisting of the backslash character followed by the n character. The following lines of code all set prog to a regular expression object which recognises a newline character:
prog = re.compile(chr(92)+chr(110))
prog = re.compile("\\n")
prog = re.compile(r"\n")
So why is it that "Usually patterns will be expressed in Python code using this raw string notation."? Because regular expressions are frequently static strings, which are conveniently represented as string literals. And from the different string literal notations available, raw strings are a convenient choice, when the regular expression includes a backslash character.
Questions
Q: what about the expression re.compile(r"\s\tWord")? A: It's easier to understand by separating the string from the regular expression compilation, and understanding them separately.
s = r"\s\tWord"
prog = re.compile(s)
The string s contains eight characters: a backslash, an s, a backslash, a t, and then four characters Word.
Q: What happens to the tab and space characters? A: At the Python language level, string s doesn't have tab and space character. It starts with four characters: backslash, s, backslash, t . The regular expression system, meanwhile, treats that string as source code in the regular expression language, where it means "match a string consisting of a whitespace character, a tab character, and the four characters Word.
Q: How do you match those if that's being treated as backlash-s and backslash-t? A: Maybe the question is clearer if the words 'you' and 'that' are made more specific: how does the regular expression system match the expressions backlash-s and backslash-t? As 'any whitespace character' and as 'tab character'.
Q: Or what if you have the 3-character string backslash-n-newline? A: In the Python language, the 3-character string backslash-n-newline can be represented as conventional string "\\n\n", or raw plus conventional string r"\n" "\n", or in other ways. The regular expression system matches the 3-character string backslash-n-newline when it finds any two consecutive newline characters.
N.B. All examples and document references are to Python 2.7.
Update: Incorporated clarifications from answers of #Vladislav Zorov and #m.buettner, and from follow-up question of #Aerovistae.
Most of these questions have a lot of words in them and maybe it's hard to find the answer to your specific question.
If you use a regular string and you pass in a pattern like "\t" to the RegEx parser, Python will translate that literal into a buffer with the tab byte in it (0x09).
If you use a raw string and you pass in a pattern like r"\t" to the RegEx parser, Python does not do any interpretation, and it creates a buffer with two bytes in it: '\', and 't'. (0x5c, 0x74).
The RegEx parser knows what to do with the sequence '\t' -- it matches that against a tab. It also knows what to do with the 0x09 character -- that also matches a tab. For the most part, the results will be indistinguishable.
So the key to understanding what's happening is recognizing that there are two parsers being employed here. The first one is the Python parser, and it translates your string literal (or raw string literal) into a sequence of bytes. The second one is Python's regular expression parser, and it converts a sequence of bytes into a compiled regular expression.
The issue with using a normal string to write regexes that contain a \ is that you end up having to write \\ for every \. So the string literals "stuff\\things" and r"stuff\things" produce the same string. This gets especially useful if you want to write a regular expression that matches against backslashes.
Using normal strings, a regexp that matches the string \ would be "\\\\"!
Why? Because we have to escape \ twice: once for the regular expression syntax, and once for the string syntax.
You can use triple quotes to include newlines, like this:
r'''stuff\
things'''
Note that usually, python would treat \-newline as a line continuation, but this is not the case in raw strings. Also note that backslashes still escape quotes in raw strings, but are left in themselves. So the raw string literal r"\"" produces the string \". This means you can't end a raw string literal with a backslash.
See the lexical analysis section of the Python documentation for more information.
You seem to be struggling with the idea that a RegEx isn't part of Python, but instead a different programming language with its own parser and compiler. Raw strings help you get the "source code" of a RegEx safely to the RegEx parser, which will then assign meaning to character sequences like \d, \w, \n, etc...
The issue exists because Python and RegExps use \ as escape character, which is, by the way, a coincidence - there are languages with other escape characters (like "`n" for a newline, but even there you have to use "\n" in RegExps). The advantage is that you don't need to differentiate between raw and non-raw strings in these languages, they won't both try to convert the text and butcher it, because they react to different escape sequences.
raw string does not affect special sequences in python regex such as \w, \d. It only affects escape sequences such as \n. So most of the time it doesn't matter we write r in front or not.
I think that is the answer most beginners are looking for.
The relevant Python manual section ("String and Bytes literals") has a clear explanation of raw string literals:
Both string and bytes literals may optionally be prefixed with a
letter 'r' or 'R'; such strings are called raw strings and treat
backslashes as literal characters. As a result, in string literals,
'\U' and '\u' escapes in raw strings are not treated specially. Given
that Python 2.x’s raw unicode literals behave differently than Python
3.x’s the 'ur' syntax is not supported.
New in version 3.3: The 'rb' prefix of raw bytes literals has been
added as a synonym of 'br'.
New in version 3.3: Support for the unicode legacy literal (u'value')
was reintroduced to simplify the maintenance of dual Python 2.x and
3.x codebases. See PEP 414 for more information.
In triple-quoted strings, unescaped newlines and quotes are allowed
(and are retained), except that three unescaped quotes in a row
terminate the string. (A “quote” is the character used to open the
string, i.e. either ' or ".)
Unless an 'r' or 'R' prefix is present, escape sequences in strings
are interpreted according to rules similar to those used by Standard
C. The recognized escape sequences are:
Escape Sequence Meaning Notes
\newline Backslash and newline ignored
\ Backslash ()
\' Single quote (')
\" Double quote (")
\a ASCII Bell (BEL)
\b ASCII Backspace (BS)
\f ASCII Formfeed (FF)
\n ASCII Linefeed (LF)
\r ASCII Carriage Return (CR)
\t ASCII Horizontal Tab (TAB)
\v ASCII Vertical Tab (VT)
\ooo Character with octal value ooo (1,3)
\xhh Character with hex value hh (2,3)
Escape sequences only recognized in string literals are:
Escape Sequence Meaning Notes \N{name} Character named name in the
Unicode database (4) \uxxxx Character with 16-bit hex value xxxx (5)
\Uxxxxxxxx Character with 32-bit hex value xxxxxxxx (6)
Notes:
As in Standard C, up to three octal digits are accepted.
Unlike in Standard C, exactly two hex digits are required.
In a bytes literal, hexadecimal and octal escapes denote the byte with the given value. In a string literal, these escapes denote a
Unicode character with the given value.
Changed in version 3.3: Support for name aliases [1] has been added.
Individual code units which form parts of a surrogate pair can be encoded using this escape sequence. Exactly four hex digits are
required.
Any Unicode character can be encoded this way, but characters outside the Basic Multilingual Plane (BMP) will be encoded using a
surrogate pair if Python is compiled to use 16-bit code units (the
default). Exactly eight hex digits are required.
Unlike Standard C, all unrecognized escape sequences are left in the
string unchanged, i.e., the backslash is left in the string. (This
behavior is useful when debugging: if an escape sequence is mistyped,
the resulting output is more easily recognized as broken.) It is also
important to note that the escape sequences only recognized in string
literals fall into the category of unrecognized escapes for bytes
literals.
Even in a raw string, string quotes can be escaped with a backslash,
but the backslash remains in the string; for example, r"\"" is a valid
string literal consisting of two characters: a backslash and a double
quote; r"\" is not a valid string literal (even a raw string cannot
end in an odd number of backslashes). Specifically, a raw string
cannot end in a single backslash (since the backslash would escape the
following quote character). Note also that a single backslash followed
by a newline is interpreted as those two characters as part of the
string, not as a line continuation.
\n is an Escape Sequence in Python
\w is a Special Sequence in (Python) Regex
They look like they are in the same family but they are not. Raw string notation will affect Escape Sequences but not Regex Special Sequences.
For more about Escape Sequences
search for "\newline"
https://docs.python.org/3/reference/lexical_analysis.html
For more about Special Sequences:
search for "\number"
https://docs.python.org/3/library/re.html

Searching for a raw string (r'\n') within a raw string full of ('\n') resulting in an empty result set [duplicate]

This question already has answers here:
Can't escape the backslash with regex?
(7 answers)
Closed 4 years ago.
I am confused with the backslash in regular expressions. Within a regex a \ has a special meaning, e.g. \d means a decimal digit. If you add a backslash in front of the backslash this special meaning gets lost. In the regex-howto one can read:
Perhaps the most important metacharacter is the backslash, \. As in Python string literals, the backslash can be followed by various characters to signal various special sequences. It’s also used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a [ or \, you can precede them with a backslash to remove their special meaning: \[ or \\.
So print(re.search('\d', '\d')) gives None because \d matches any decimal digit but there is none in \d.
I now would expect print(re.search('\\d', '\d')) to match \d but the answer is still None.
Only print(re.search('\\\d', '\d')) gives as output <_sre.SRE_Match object; span=(0, 2), match='\\d'>.
Does someone have an explanation?
The confusion is due to the fact that the backslash character \ is used as an escape at two different levels. First, the Python interpreter itself performs substitutions for \ before the re module ever sees your string. For instance, \n is converted to a newline character, \t is converted to a tab character, etc. To get an actual \ character, you can escape it as well, so \\ gives a single \ character. If the character following the \ isn't a recognized escape character, then the \ is treated like any other character and passed through, but I don't recommend depending on this. Instead, always escape your \ characters by doubling them, i.e. \\.
If you want to see how Python is expanding your string escapes, just print out the string. For example:
s = 'a\\b\tc'
print(s)
If s is part of an aggregate data type, e.g. a list or a tuple, and if you print that aggregate, Python will enclose the string in single quotes and will include the \ escapes (in a canonical form), so be aware of how your string is being printed. If you just type a quoted string into the interpreter, it will also display it enclosed in quotes with \ escapes.
Once you know how your string is being encoded, you can then think about what the re module will do with it. For instance, if you want to escape \ in a string you pass to the re module, you will need to pass \\ to re, which means you will need to use \\\\ in your quoted Python string. The Python string will end up with \\ and the re module will treat this as a single literal \ character.
An alternative way to include \ characters in Python strings is to use raw strings, e.g. r'a\b' is equivalent to "a\\b".
An r character before the regular expression in a call to search() specifies that the regular expression is a raw string. This allows backslashes to be used in the regular expression as regular characters rather than in an escape sequence of characters. Let me explain ...
Before the re module's search method processes the strings that are passed to it, the Python interpreter takes an initial pass over the string. If there are backslashes present in a string, the Python interpreter must decide if each is part of a Python escape sequence (e.g. \n or \t) or not.
Note: at this point Python does not care whether or not '\' is a regular expression meta-character.
If the '\' is followed by a recognized Python escape character (t,n, etc.), then the backslash and the escape character are replaced with the actual Unicode or 8-bit character. For example, '\t' would be replaced with the ASCII character for tab. Otherwise it is passed by and interpreted as a '\' character.
Consider the following.
>>> s = '\t'
>>> print ("[" + s + "]")
>>> [ ] // an actual tab character after preprocessing
>>> s = '\d'
>>> print ("[" + s + "]")
>>> [\d] // '\d' after preprocessing
Sometimes we want to include in a string a character sequence that includes '\' without it being interpreted by Python as an escape sequence. To do this we escape the '\' with a '\'. Now when Python sees '\' it replaces the two backslashes with a single '\' character.
>>> s = '\\t'
>>> print ("[" + s + "]")
>>> [\t] // '\t' after preprocessing
After the Python interpreter take a pass on both strings, they are passed to the re module's search method. The search method parses the regular expression string to identify the regular expression's meta-characters.
Now '\' is also a special regular expression meta-character and is interpreted as one UNLESS it is escaped at the time that the re search() method is executed.
Consider the following call.
>>> match = re.search('a\\t','a\\t') //Match is None
Here, match is None. Why? Lets look at the strings after the Python interpreter makes its pass.
String 1: 'a\t'
String 2: 'a\t'
So why is match equal to None? When search() interprets String 1, since it is a regular expression, the backslash is interpreted as a meta-character, not an ordinary character. The backslash in String 2 however is not in a regular expression and has already been processed by the Python interpreter, so it is interpreted as an ordinary character.
So the search() method is looking for 'a escape-t' in the string 'a\t' which are not a match.
To fix this we can tell the search() method to not interpret the '\' as a meta-character. We can do this by escaping it.
Consider the following call.
>>> match = re.search('a\\\\t','a\\t') // Match contains 'a\t'
Again, lets look at the strings after the Python interpreter has made its pass.
String 1: 'a\\t'
String 2: 'a\t'
Now when the search() method processes the regular expression, it sees that the second backslash is escaped by the first and should not be considered a meta-character. It therefore interprets the string as 'a\t', which matches String 2.
An alternate way to have search() consider '\' as a character is to place an r before the regular expression. This tells the Python interpreter to NOT preprocess the string.
Consider this.
>>> match = re.search(r'a\\t','a\\t') // match contains 'a\t'
Here the Python interpreter does not modify the first string but does process the second string. The strings passed to search() are:
String 1: 'a\\t'
String 2: 'a\t'
As in the previous example, search interprets the '\' as the single character '\' and not a meta-character, thus matches with String 2.
Python's own string parsing (partially) comes in your way.
If you want to see what re sees, type
print '\d'
print '\\d'
print '\\\d'
on the Python command prompt. You see that \d and \\d both result in \d, the latter one being taken care by the Python string parser.
If you want to avoid any hassle with these, use raw strings as suggested by the re module documentation: r'\\d' will result in \\d seen by the RE module.

regular expression problems in re.findall python [duplicate]

This question already has answers here:
Can't escape the backslash with regex?
(7 answers)
Closed 4 years ago.
I am confused with the backslash in regular expressions. Within a regex a \ has a special meaning, e.g. \d means a decimal digit. If you add a backslash in front of the backslash this special meaning gets lost. In the regex-howto one can read:
Perhaps the most important metacharacter is the backslash, \. As in Python string literals, the backslash can be followed by various characters to signal various special sequences. It’s also used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a [ or \, you can precede them with a backslash to remove their special meaning: \[ or \\.
So print(re.search('\d', '\d')) gives None because \d matches any decimal digit but there is none in \d.
I now would expect print(re.search('\\d', '\d')) to match \d but the answer is still None.
Only print(re.search('\\\d', '\d')) gives as output <_sre.SRE_Match object; span=(0, 2), match='\\d'>.
Does someone have an explanation?
The confusion is due to the fact that the backslash character \ is used as an escape at two different levels. First, the Python interpreter itself performs substitutions for \ before the re module ever sees your string. For instance, \n is converted to a newline character, \t is converted to a tab character, etc. To get an actual \ character, you can escape it as well, so \\ gives a single \ character. If the character following the \ isn't a recognized escape character, then the \ is treated like any other character and passed through, but I don't recommend depending on this. Instead, always escape your \ characters by doubling them, i.e. \\.
If you want to see how Python is expanding your string escapes, just print out the string. For example:
s = 'a\\b\tc'
print(s)
If s is part of an aggregate data type, e.g. a list or a tuple, and if you print that aggregate, Python will enclose the string in single quotes and will include the \ escapes (in a canonical form), so be aware of how your string is being printed. If you just type a quoted string into the interpreter, it will also display it enclosed in quotes with \ escapes.
Once you know how your string is being encoded, you can then think about what the re module will do with it. For instance, if you want to escape \ in a string you pass to the re module, you will need to pass \\ to re, which means you will need to use \\\\ in your quoted Python string. The Python string will end up with \\ and the re module will treat this as a single literal \ character.
An alternative way to include \ characters in Python strings is to use raw strings, e.g. r'a\b' is equivalent to "a\\b".
An r character before the regular expression in a call to search() specifies that the regular expression is a raw string. This allows backslashes to be used in the regular expression as regular characters rather than in an escape sequence of characters. Let me explain ...
Before the re module's search method processes the strings that are passed to it, the Python interpreter takes an initial pass over the string. If there are backslashes present in a string, the Python interpreter must decide if each is part of a Python escape sequence (e.g. \n or \t) or not.
Note: at this point Python does not care whether or not '\' is a regular expression meta-character.
If the '\' is followed by a recognized Python escape character (t,n, etc.), then the backslash and the escape character are replaced with the actual Unicode or 8-bit character. For example, '\t' would be replaced with the ASCII character for tab. Otherwise it is passed by and interpreted as a '\' character.
Consider the following.
>>> s = '\t'
>>> print ("[" + s + "]")
>>> [ ] // an actual tab character after preprocessing
>>> s = '\d'
>>> print ("[" + s + "]")
>>> [\d] // '\d' after preprocessing
Sometimes we want to include in a string a character sequence that includes '\' without it being interpreted by Python as an escape sequence. To do this we escape the '\' with a '\'. Now when Python sees '\' it replaces the two backslashes with a single '\' character.
>>> s = '\\t'
>>> print ("[" + s + "]")
>>> [\t] // '\t' after preprocessing
After the Python interpreter take a pass on both strings, they are passed to the re module's search method. The search method parses the regular expression string to identify the regular expression's meta-characters.
Now '\' is also a special regular expression meta-character and is interpreted as one UNLESS it is escaped at the time that the re search() method is executed.
Consider the following call.
>>> match = re.search('a\\t','a\\t') //Match is None
Here, match is None. Why? Lets look at the strings after the Python interpreter makes its pass.
String 1: 'a\t'
String 2: 'a\t'
So why is match equal to None? When search() interprets String 1, since it is a regular expression, the backslash is interpreted as a meta-character, not an ordinary character. The backslash in String 2 however is not in a regular expression and has already been processed by the Python interpreter, so it is interpreted as an ordinary character.
So the search() method is looking for 'a escape-t' in the string 'a\t' which are not a match.
To fix this we can tell the search() method to not interpret the '\' as a meta-character. We can do this by escaping it.
Consider the following call.
>>> match = re.search('a\\\\t','a\\t') // Match contains 'a\t'
Again, lets look at the strings after the Python interpreter has made its pass.
String 1: 'a\\t'
String 2: 'a\t'
Now when the search() method processes the regular expression, it sees that the second backslash is escaped by the first and should not be considered a meta-character. It therefore interprets the string as 'a\t', which matches String 2.
An alternate way to have search() consider '\' as a character is to place an r before the regular expression. This tells the Python interpreter to NOT preprocess the string.
Consider this.
>>> match = re.search(r'a\\t','a\\t') // match contains 'a\t'
Here the Python interpreter does not modify the first string but does process the second string. The strings passed to search() are:
String 1: 'a\\t'
String 2: 'a\t'
As in the previous example, search interprets the '\' as the single character '\' and not a meta-character, thus matches with String 2.
Python's own string parsing (partially) comes in your way.
If you want to see what re sees, type
print '\d'
print '\\d'
print '\\\d'
on the Python command prompt. You see that \d and \\d both result in \d, the latter one being taken care by the Python string parser.
If you want to avoid any hassle with these, use raw strings as suggested by the re module documentation: r'\\d' will result in \\d seen by the RE module.

Confused about backslashes in regular expressions [duplicate]

This question already has answers here:
Can't escape the backslash with regex?
(7 answers)
Closed 4 years ago.
I am confused with the backslash in regular expressions. Within a regex a \ has a special meaning, e.g. \d means a decimal digit. If you add a backslash in front of the backslash this special meaning gets lost. In the regex-howto one can read:
Perhaps the most important metacharacter is the backslash, \. As in Python string literals, the backslash can be followed by various characters to signal various special sequences. It’s also used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a [ or \, you can precede them with a backslash to remove their special meaning: \[ or \\.
So print(re.search('\d', '\d')) gives None because \d matches any decimal digit but there is none in \d.
I now would expect print(re.search('\\d', '\d')) to match \d but the answer is still None.
Only print(re.search('\\\d', '\d')) gives as output <_sre.SRE_Match object; span=(0, 2), match='\\d'>.
Does someone have an explanation?
The confusion is due to the fact that the backslash character \ is used as an escape at two different levels. First, the Python interpreter itself performs substitutions for \ before the re module ever sees your string. For instance, \n is converted to a newline character, \t is converted to a tab character, etc. To get an actual \ character, you can escape it as well, so \\ gives a single \ character. If the character following the \ isn't a recognized escape character, then the \ is treated like any other character and passed through, but I don't recommend depending on this. Instead, always escape your \ characters by doubling them, i.e. \\.
If you want to see how Python is expanding your string escapes, just print out the string. For example:
s = 'a\\b\tc'
print(s)
If s is part of an aggregate data type, e.g. a list or a tuple, and if you print that aggregate, Python will enclose the string in single quotes and will include the \ escapes (in a canonical form), so be aware of how your string is being printed. If you just type a quoted string into the interpreter, it will also display it enclosed in quotes with \ escapes.
Once you know how your string is being encoded, you can then think about what the re module will do with it. For instance, if you want to escape \ in a string you pass to the re module, you will need to pass \\ to re, which means you will need to use \\\\ in your quoted Python string. The Python string will end up with \\ and the re module will treat this as a single literal \ character.
An alternative way to include \ characters in Python strings is to use raw strings, e.g. r'a\b' is equivalent to "a\\b".
An r character before the regular expression in a call to search() specifies that the regular expression is a raw string. This allows backslashes to be used in the regular expression as regular characters rather than in an escape sequence of characters. Let me explain ...
Before the re module's search method processes the strings that are passed to it, the Python interpreter takes an initial pass over the string. If there are backslashes present in a string, the Python interpreter must decide if each is part of a Python escape sequence (e.g. \n or \t) or not.
Note: at this point Python does not care whether or not '\' is a regular expression meta-character.
If the '\' is followed by a recognized Python escape character (t,n, etc.), then the backslash and the escape character are replaced with the actual Unicode or 8-bit character. For example, '\t' would be replaced with the ASCII character for tab. Otherwise it is passed by and interpreted as a '\' character.
Consider the following.
>>> s = '\t'
>>> print ("[" + s + "]")
>>> [ ] // an actual tab character after preprocessing
>>> s = '\d'
>>> print ("[" + s + "]")
>>> [\d] // '\d' after preprocessing
Sometimes we want to include in a string a character sequence that includes '\' without it being interpreted by Python as an escape sequence. To do this we escape the '\' with a '\'. Now when Python sees '\' it replaces the two backslashes with a single '\' character.
>>> s = '\\t'
>>> print ("[" + s + "]")
>>> [\t] // '\t' after preprocessing
After the Python interpreter take a pass on both strings, they are passed to the re module's search method. The search method parses the regular expression string to identify the regular expression's meta-characters.
Now '\' is also a special regular expression meta-character and is interpreted as one UNLESS it is escaped at the time that the re search() method is executed.
Consider the following call.
>>> match = re.search('a\\t','a\\t') //Match is None
Here, match is None. Why? Lets look at the strings after the Python interpreter makes its pass.
String 1: 'a\t'
String 2: 'a\t'
So why is match equal to None? When search() interprets String 1, since it is a regular expression, the backslash is interpreted as a meta-character, not an ordinary character. The backslash in String 2 however is not in a regular expression and has already been processed by the Python interpreter, so it is interpreted as an ordinary character.
So the search() method is looking for 'a escape-t' in the string 'a\t' which are not a match.
To fix this we can tell the search() method to not interpret the '\' as a meta-character. We can do this by escaping it.
Consider the following call.
>>> match = re.search('a\\\\t','a\\t') // Match contains 'a\t'
Again, lets look at the strings after the Python interpreter has made its pass.
String 1: 'a\\t'
String 2: 'a\t'
Now when the search() method processes the regular expression, it sees that the second backslash is escaped by the first and should not be considered a meta-character. It therefore interprets the string as 'a\t', which matches String 2.
An alternate way to have search() consider '\' as a character is to place an r before the regular expression. This tells the Python interpreter to NOT preprocess the string.
Consider this.
>>> match = re.search(r'a\\t','a\\t') // match contains 'a\t'
Here the Python interpreter does not modify the first string but does process the second string. The strings passed to search() are:
String 1: 'a\\t'
String 2: 'a\t'
As in the previous example, search interprets the '\' as the single character '\' and not a meta-character, thus matches with String 2.
Python's own string parsing (partially) comes in your way.
If you want to see what re sees, type
print '\d'
print '\\d'
print '\\\d'
on the Python command prompt. You see that \d and \\d both result in \d, the latter one being taken care by the Python string parser.
If you want to avoid any hassle with these, use raw strings as suggested by the re module documentation: r'\\d' will result in \\d seen by the RE module.

What does 'r' mean before a Regex pattern?

I found the following regex substitution example from the documentation for Regex. I'm a little bit confused as to what the prefix r does before the string?
re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
... r'static PyObject*\npy_\1(void)\n{',
... 'def myfunc():')
Placing r or R before a string literal creates what is known as a raw-string literal. Raw strings do not process escape sequences (\n, \b, etc.) and are thus commonly used for Regex patterns, which often contain a lot of \ characters.
Below is a demonstration:
>>> print('\n') # Prints a newline character
>>> print(r'\n') # Escape sequence is not processed
\n
>>> print('\b') # Prints a backspace character
>>> print(r'\b') # Escape sequence is not processed
\b
>>>
The only other option would be to double every backslash:
re.sub('def\\s+([a-zA-Z_][a-zA-Z_0-9]*)\\s*\\(\\s*\\):',
... 'static PyObject*\\npy_\\1(void)\\n{',
... 'def myfunc():')
which is just tedious.
The r means that the string is to be treated as a raw string, which means all escape codes will be ignored.
The Python document says this precisely:
String literals may optionally be prefixed with a letter 'r' or 'R'; such strings are called raw strings and use different rules for interpreting backslash escape sequences.
Current re module docs gives explanation regarding raw-string usage
Regular expressions use the backslash character ('\') to indicate
special forms or to allow special characters to be used without
invoking their special meaning. This collides with Python’s usage of
the same character for the same purpose in string literals; for
example, to match a literal backslash, one might have to write '\\\\'
as the pattern string, because the regular expression must be \\, and
each backslash must be expressed as \\ inside a regular Python string
literal. Also, please note that any invalid escape sequences in
Python’s usage of the backslash in string literals now generate a
DeprecationWarning and in the future this will become a SyntaxError.
This behaviour will happen even if it is a valid escape sequence for a
regular expression.
The solution is to use Python’s raw string notation for regular
expression patterns; backslashes are not handled in any special way in
a string literal prefixed with 'r'. So r"\n" is a two-character string
containing '\' and 'n', while "\n" is a one-character string
containing a newline. Usually patterns will be expressed in Python
code using this raw string notation.

Categories

Resources