This question already has answers here:
Raw string and regular expression in Python
(4 answers)
Closed 2 years ago.
I have a question about the backslashes when using the re module in python. Consider the code:
import re
message = 'My phone number is 345-298-2372'
num_reg = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
match = num_reg.search(message)
print(match.group())
In the code above, a raw string is passed into the re.compile method, but the backslash is still not treated as a literal character, as /d remain a placeholder for a digit. Why the raw string then?
The documentation for re and raw strings answers this question well.
So in your example the parameter passed to re.compile() ends up containing the original \. This is desirable when working with re because it has its own escape sequences that may or may not conflict with python's escape sequences. Typically it's much more convenient to use r'foo' when working with regex so you don't have to double escape your regex special characters.
Without the raw string, for the escape character to make it to re for processing you would need to use:
import re
message = 'My phone number is 345-298-2372'
num_reg = re.compile('\\d\\d\\d-\\d\\d\\d-\\d\\d\\d\\d')
match = num_reg.search(message)
print(match.group())
You may consider looking at regex quantifier/repetition syntax as it generally makes re more readable:
import re
message = 'My phone number is 345-298-2372'
num_reg = re.compile(r'\d{3}-\d{3}-\d{4}')
match = num_reg.search(message)
print(match.group())
This question already has answers here:
Escaping regex string
(4 answers)
Closed 6 years ago.
In Python, I am trying to do
text = re.sub(r'\b%s\b' % word, "replace_text", text)
to replace a word with some text. Using re rather than just doing text.replace to replace only if the whole word matches using \b. Problem comes when there are characters like +, (, [ etc in word. For example +91xxxxxxxx.
Regex treats this + as wildcard for one or more and breaks with error. sre_constants.error: nothing to repeat. Same is in the case of ( too.
Could find a fix for this after searching around a bit. Is there a way?
Just use re.escape(string):
word = re.escape(word)
text = re.sub(r'\b{}\b'.format(word), "replace_text", text)
It replaces all critical characters with a special meaning in regex patterns with their escape forms (e.g. \+ instead of +).
Just a sidenote: formatting with the percent (%) character is deprecated and was replaced by the .format() method of strings.
This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 1 year ago.
Why does one need to add the DOTALL flag for the python regular expression to match characters including the new line character in a raw string. I ask because a raw string is supposed to ignore the escape of special characters such as the new line character. From the docs:
The solution is to use Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with 'r'. So r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline.
This is my situation:
string = '\nSubject sentence is: Appropriate support for families of children diagnosed with hearing impairment\nCausal Verb is : may have\npredicate sentence is: a direct impact on the success of early hearing detection and intervention programs in reducing the negative effects of permanent hearing loss'
re.search(r"Subject sentence is:(.*)Causal Verb is :(.*)predicate sentence is:(.*)", string ,re.DOTALL)
results in a match , However , when I remove the DOTALL flag, I get no match.
In regex . means any character except \n
So if you have newlines in your string, then .* will not pass that newline(\n).
But in Python, if you use the re.DOTALL flag(also known as re.S) then it includes the \n(newline) with that dot .
Your source string is not raw, only your pattern string.
maybe try
string = r'\n...\n'
re.search("Subject sentence is:(.*)Causal Verb is :(.*)predicate sentence is:(.*)", string)
This question already has answers here:
What special characters must be escaped in regular expressions?
(13 answers)
Closed 5 years ago.
I'm trying to parse lines of input that look like
8=FIX.4.2^A9=0126^A35=0^A34=000742599^A49=L3Q206N^A50=2J6L^A52=20130620-11:16:27.344^A369=000733325^A56=CME^A57=G^A142=US,IL^A1603=OMS2^A1604=0.1^A
where you have different fields of data separated by ^A. I'm trying to get at the individual data fields (like 8=FIX.4.2, 9=0126, 35=0, etc). The problem is that python sometimes interprets ^A as a single character (in vim this is ctrl-v, ctrl-a) and sometimes as the string '^A' with two characters. So I have tried doing
entries = re.split('^A|^A', str(line))
but later when i do
for entry in entries:
print entries
I just end up with the original string, with nothing split. Is this a problem with re.split?
Depends on what that line contains.
If you want to split on the 2-character string '^A', escape the special-to-regexps character ^, in this case probably meaning '\^A'.
It's more likely that this is instead the caret notation way of printing the single character with byte value 0x01, in which case you probably want to split on '\x01' instead.
(You might as well use string's own split() function, I'm guessing it's faster than using regexps for something this simple)
^ has a special meaning in regular expressions, so you should escape it first.
>>> strs = "8=FIX.4.2^A9=0126^A35=0^A34=000742599^A49=L3Q206N^A50=2J6L^A52=20130620-11:16:27.344^A369=000733325^A56=CME^A57=G^A142=US,IL^A1603=OMS2^A1604=0.1^A"
>>> re.split('\^A',strs)
['8=FIX.4.2', '9=0126', '35=0', '34=000742599', '49=L3Q206N', '50=2J6L', '52=20130620-11:16:27.344', '369=000733325', '56=CME', '57=G', '142=US,IL', '1603=OMS2', '1604=0.1', '']
From docs:
'^' : (Caret.) Matches the start of the string, and in MULTILINE mode also
matches immediately after each newline.
^ is a metacharacter, it matches only at the start of a string. Escape it:
>>> re.split('\^A', line)
['8=FIX.4.2', '9=0126', '35=0', '34=000742599', '49=L3Q206N', '50=2J6L', '52=20130620-11:16:27.344', '369=000733325', '56=CME', '57=G', '142=US,IL', '1603=OMS2', '1604=0.1', '']
There is no need to use a | in your expression, especially not when both 'alternate' strings are the same.
It appears however that you have the \x07 or \a control character, not the two-character ^A string. Just use .split() to split on that value, no need for a regular expression:
>>> line = line.replace('^A', '\a')
>>> line
'8=FIX.4.2\x079=0126\x0735=0\x0734=000742599\x0749=L3Q206N\x0750=2J6L\x0752=20130620-11:16:27.344\x07369=000733325\x0756=CME\x0757=G\x07142=US,IL\x071603=OMS2\x071604=0.1\x07'
>>> line.split('\a')
['8=FIX.4.2', '9=0126', '35=0', '34=000742599', '49=L3Q206N', '50=2J6L', '52=20130620-11:16:27.344', '369=000733325', '56=CME', '57=G', '142=US,IL', '1603=OMS2', '1604=0.1', '']
I'm searching for strings within strings using Regex. The pattern is a string literal that ends in (, e.g.
# pattern
" before the bracket ("
# string
this text is before the bracket (and this text is inside) and this text is after the bracket
I know the pattern will work if I escape the character with a backslash, i.e.:
# pattern
" before the bracket \\("
But the pattern strings are coming from another search and I can not control what characters will be or where. Is there a way of escaping an entire string literal so that anything between markers is treated as a string? For example:
# pattern
\" before the ("
The only other option I have is to do a substitute adding escapes for every protected character.
re.escape is exactly what I need. I'm using regexp in Access VBA which doens't have that method. I only have replace, execute or test methods.
Is there a way to escape everything within a string in VBA?
Thanks
You didn't specify the language, but it looks like Python, so if you have a string in Python whose special regex characters you need to escape, use re.escape():
>>> import re
>>> re.escape("Wow. This (really) is *cool*")
'Wow\\.\\ This\\ \\(really\\)\\ is\\ \\*cool\\*'
Note that spaces are escaped, too (probably to ensure that they still work in a re.VERBOSE regex).
Maybe write your own VBA escape function:
Function EscapeRegEx(text As String) As String
Dim regEx As RegExp
Set regEx = New RegExp
regEx.Global = True
regEx.Pattern = "(\[|\\|\^|\$|\.|\||\?|\*|\+|\(|\)|\{|\})"
EscapeRegEx = regEx.Replace(text, "\$1")
End Function
I'm pretty sure that with the limitations of the RegExp abilities in VBA/VBScript, you are going to have to replace the special characters in your pattern before using it. There doesn't seem to be anything built into it like there is in Python.
The following regex will capture everything from the beginning of the string to the first (. The first captured group $1 will contain the portion before (.
^([^(]+)\(
Depending on your language, you might have to escape it as:
"^([^(]+)\\("