Don't raw strings treat backslashes as a literal character? [duplicate] - python

This question already has answers here:
Raw string and regular expression in Python
(4 answers)
Closed 2 years ago.
I have a question about the backslashes when using the re module in python. Consider the code:
import re
message = 'My phone number is 345-298-2372'
num_reg = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
match = num_reg.search(message)
print(match.group())
In the code above, a raw string is passed into the re.compile method, but the backslash is still not treated as a literal character, as /d remain a placeholder for a digit. Why the raw string then?

The documentation for re and raw strings answers this question well.
So in your example the parameter passed to re.compile() ends up containing the original \. This is desirable when working with re because it has its own escape sequences that may or may not conflict with python's escape sequences. Typically it's much more convenient to use r'foo' when working with regex so you don't have to double escape your regex special characters.
Without the raw string, for the escape character to make it to re for processing you would need to use:
import re
message = 'My phone number is 345-298-2372'
num_reg = re.compile('\\d\\d\\d-\\d\\d\\d-\\d\\d\\d\\d')
match = num_reg.search(message)
print(match.group())
You may consider looking at regex quantifier/repetition syntax as it generally makes re more readable:
import re
message = 'My phone number is 345-298-2372'
num_reg = re.compile(r'\d{3}-\d{3}-\d{4}')
match = num_reg.search(message)
print(match.group())

Related

How can I use a variable as regex in python? [duplicate]

This question already has answers here:
How to use a variable inside a regular expression?
(12 answers)
Closed 4 years ago.
I use re to find a word on a file and I stored it as lattice_type
Now I want to use the word stored on lattice_type to make another regex
I tried using the name of the variable on this way
pnt_grp=re.match(r'+ lattice_type + (.*?) .*',line, re.M|re.I)
Here I look for the regex lattice_type= and store the group(1) in lattice_type
latt=open(cell_file,"r")
for types in latt:
line = types
latt_type = re.match(r'lattice_type = (.*)', line, re.M|re.I)
if latt_type:
lattice_type=latt_type.group(1)
Here is where I want to use the variable containing the word to find it on another file, but I got problems
pg=open(parameters,"r")
for lines in pg:
line=lines
pnt_grp=re.match(r'+ lattice_type + (.*?) .*',line, re.M|re.I)
if pnt_grp:
print(pnt_grp(1))
The r prefix is only needed when defining a string with a lot of backslashes, because both regex and Python string syntax attach meaning to backslashes. r'..' is just an alternative syntax that makes it easier to work with regex patterns. You don't have to use r'..' raw string literals. See The backslash plague in the Python regex howto for more information.
All that means that you certainly don't need to use the r prefix when already have a string value. A regex pattern is just a string value, and you can just use normal string formatting or concatenation techniques:
pnt_grp = re.match(lattice_type + '(.*?) .*', line, re.M|re.I)
I didn't use r in the string literal above, because there are no \ backslashes in the expression there to cause issues.
You may need to use the re.escape() function on your lattice_type value, if there is a possibility of that value containing regular expression meta-characters such as . or ? or [, etc. re.escape() escapes such metacharacters so that only literal text is matched:
pnt_grp = re.match(re.escape(lattice_type) + '(.*?) .*', line, re.M|re.I)

Do raw strings in python disable meta characters such as \w or \d just as they do with \n? [duplicate]

This question already has answers here:
Confused about backslashes in regular expressions [duplicate]
(3 answers)
Closed 4 years ago.
I am new to Python. Can someone tell me what is the difference between these two regex statements (re.findall(r"\d+","i am aged 35")) and (re.findall("\d+","i am aged 35")).
I had the understanding that the raw string in the first statement will make "\d+" inactive because that is the primarily role of a raw string - to make escape characters inactive. In other words "\d+" will not be a meta character for finding/searching/matching digits if a raw string is used. However, I now see that both statements return the same result.
Both the Python parser and the regular expression parser handle escape sequences. This means that any escape sequence that both engines support must either use double slashes, or you use a raw string literal so the Python parser doesn't try to interpret escape sequences.
In this case, \d has no meaning to Python, so the backslash is left in place for the re module to handle. So here specifically, there is no difference between the two snippets.
However, if you needed to match a literal backslash before other text like section in your regular expression, without raw strings, you'd have to use '\\\\section' to define the pattern! That's because the Python interpreter would see '\\section' as an escape sequence producing a single backslash, and then the regular expression parser sees the start of the escape sequence \s.
See the section on backslashes and raw string literals in the Python regular expression HOWTO.

Searching for an exact match that contains brackets [duplicate]

This question already has an answer here:
Python re - escape coincidental parentheses in regex pattern
(1 answer)
Closed 5 years ago.
I am reading in lines from a file each of which are formatted like this:
array_name[0]
array_name[1]
How can I do an exact match on this string in python? I've tried this:
if re.match(line, "array_name[0]")
but it seems to match all the time without taking the parts in bracket ([0], [1], etc.) into account
re.escape from the re module is a useful tool for automatically escaping characters that the regex engine considers special. From the docs:
re.escape(pattern)
Escape all the characters in pattern except ASCII
letters and numbers. This is useful if you want to match an arbitrary
literal string that may have regular expression metacharacters in it.
In [1]: re.escape("array_name[0]")
Out[1]: 'array_name\\[0\\]'
Also, you've reversed the order of your arguments. You'll need your pattern to come first, followed by the text you want to match:
re.match(re.escape("array_name[0]"), line)
Example:
In [2]: re.match(re.escape("array_name[0]"), 'array_name[0] in a line')
Out[2]: <_sre.SRE_Match object; span=(0, 13), match='array_name[0]'>

Regex - replace word having plus or brackets [duplicate]

This question already has answers here:
Escaping regex string
(4 answers)
Closed 6 years ago.
In Python, I am trying to do
text = re.sub(r'\b%s\b' % word, "replace_text", text)
to replace a word with some text. Using re rather than just doing text.replace to replace only if the whole word matches using \b. Problem comes when there are characters like +, (, [ etc in word. For example +91xxxxxxxx.
Regex treats this + as wildcard for one or more and breaks with error. sre_constants.error: nothing to repeat. Same is in the case of ( too.
Could find a fix for this after searching around a bit. Is there a way?
Just use re.escape(string):
word = re.escape(word)
text = re.sub(r'\b{}\b'.format(word), "replace_text", text)
It replaces all critical characters with a special meaning in regex patterns with their escape forms (e.g. \+ instead of +).
Just a sidenote: formatting with the percent (%) character is deprecated and was replaced by the .format() method of strings.

Why is 'r' used in regular expression in Python? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
What exactly do “u” and “r”string flags in Python, and what are raw string litterals?
p = re.compile(r'(\b\w+)\s+\1')
p.search('Paris in the the spring').group()
What is the meaning of r in the 1st line?
From the re documentation:
The solution is to use Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with 'r'. So r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline. Usually patterns will be expressed in Python code using this raw string notation.
r designates a raw string in Python, which has different rules than a standard string, such as you don't have to escape backslashes and other special chars.

Categories

Resources