basic regex operations in Python - python

I'm following a tutorial about regular expression. I'm getting an error when doing the following:
regex = r'(+|-)?\d*\.?\d*'
Apparently, Python doesn't like (+|-). What could be the problem?
Also, what could be the issue with not adding r ahead of the regex?

You need to escape + in regular expressions to get a literal +, because it usually means "one or more instances of something":
regex = r'(\+|-)?\d*\.?\d*'
And r makes it a "raw" string. Without the r, the regular expression escape sequences will be interpreted as string escape sequences, and they'll cause all sorts of problems. (\b being a backspace instead of a word boundary, and that kind of thing.)

+ is a special character. You can use brackets to specify a range of characters, which is better than using an "or" with the pipe character in this case.:
regex = r'([+-])?\d*\.?\d*'
Otherwise, you just need to escape it in your original version:
regex = r'(\+|-)?\d*\.?\d*'
Using the r is the preferred way of specifying a regex string in python because it indicates a raw string, which should not be interpreted and reduces the amount of escaping you must perform with backslashes. It is just a python regex idiom you will see everywhere.
r'(\+|-)?\d*\.?\d*'
#'(\\+|-)?\\d*\\.?\\d*'

Related

how does the regular expression work in python on interpret the pattern '\\\\mac\\\\'

i can not figure out how does regular expression to interpret the pattern \\\\mac\\\\. It comes out in python that \\mac\\.
however, i wander why does not the re module in python to continually interpret the pattern to \mac\ since it has double backslash both before and behind the word mac in \\mac\\.
Does it means that re do the escapes just for one time and will not escape the string that has been escaped. Does someone can help me?
Use the regexp string literals (prefixed with r) for denoting such monsters:
r'\\\\mac\\\\'
Then all your characters stay the way they are given.
>>> print r'\\\\mac\\\\'
\\\\mac\\\\
If you want to get a regexp matching such a monster, you will need to escape each special character:
>>> import re
>>> re.match(, r'\\\\mac\\\\')
<_sre.SRE_Match object at 0x7febff89d850>
Quoting and escaping often run into hard to understand situations if more than one interpretation steps take place. In this case the regexp function match interprets the string it is given (\\\\\\\\mac\\\\\\\\). Since a backslash has a special meaning as escape character in the language of regexps, a verbatim backslash must be escaped (again with a backslash). So each backslash is doubled. That's why you need eight literal backslashes to represent four verbatim backslashes. If you do not use the r notation as a prefix to the string literal, then you'd have to double each backslash because the string parser already interprets the backslashes in string literals without r prefix, i. e.:
r'\\\\\\\\mac\\\\\\\\' == '\\\\\\\\\\\\\\\\mac\\\\\\\\\\\\\\\\'
And that's why I call those "monsters".

Weird Python Regex Issues

whitespace_pattern = u"\s" # bug: tried to use unicode \u0020, broke regex
time_sig_pattern = \
"""^%(ws)s*time signature:%(ws)s*(?P<top>\d+)%(ws)s*\/%(ws)s*(?P<bottom>\d+)%(ws)s*$""" %{"ws": whitespace_pattern}
time_sig = compile(time_sig_pattern, U|M)
For some reason, adding the Verbose flag, X, to compile breaks the pattern.
Also, I wanted to use unicode for whitespace_pattern recognition (supposedly, we'll get patterns that use non-unicode spaces and we need to explicitly check for that one unicode character as a valid space), but the pattern keeps breaking.
VERBOSE gives you the ability to write comments in your regex to document it.
In order to do so, it ignores spaces, since you need to use line breaks to write comments.
Replace all spaces in your regex by \s to specify they are spaces you want to match in your pattern, and not just some spaces to format your comments.
What's more, you may want to use the r prefix for the string you use as a pattern. It tells Python not to interpret special notations such as \n and let you use backslashes without escaping them.
Always define regexes with the r prefix to indicate they are raw strings.
r"""^%(ws)s*time signature:%(ws)s*(?P<top>\d+)%(ws)s*\/%(ws)s*(?P<bottom>\d+)%(ws)s*$""" %{"ws": whitespace_pattern}
When creating a regex to match unicode characters you do not want to use a Python unicode string. In your example regular expression needs to see the literal characters \u0020, so you should use whitespace_pattern = r"\u0020" instead of u"\u0020".
As other answers have mentioned, you should also use the r prefix for time_sig_pattern, after those two changes your code should work fine.
For VERBOSE to work correctly you need to escape all whitespace in the pattern, so towards the beginning of the pattern replace the space in time signature with "\ " (quotes for clarity), \s, or [ ] as documented here.

Need regular expression expert: round bracket within stringliteral

I'm searching for strings within strings using Regex. The pattern is a string literal that ends in (, e.g.
# pattern
" before the bracket ("
# string
this text is before the bracket (and this text is inside) and this text is after the bracket
I know the pattern will work if I escape the character with a backslash, i.e.:
# pattern
" before the bracket \\("
But the pattern strings are coming from another search and I can not control what characters will be or where. Is there a way of escaping an entire string literal so that anything between markers is treated as a string? For example:
# pattern
\" before the ("
The only other option I have is to do a substitute adding escapes for every protected character.
re.escape is exactly what I need. I'm using regexp in Access VBA which doens't have that method. I only have replace, execute or test methods.
Is there a way to escape everything within a string in VBA?
Thanks
You didn't specify the language, but it looks like Python, so if you have a string in Python whose special regex characters you need to escape, use re.escape():
>>> import re
>>> re.escape("Wow. This (really) is *cool*")
'Wow\\.\\ This\\ \\(really\\)\\ is\\ \\*cool\\*'
Note that spaces are escaped, too (probably to ensure that they still work in a re.VERBOSE regex).
Maybe write your own VBA escape function:
Function EscapeRegEx(text As String) As String
Dim regEx As RegExp
Set regEx = New RegExp
regEx.Global = True
regEx.Pattern = "(\[|\\|\^|\$|\.|\||\?|\*|\+|\(|\)|\{|\})"
EscapeRegEx = regEx.Replace(text, "\$1")
End Function
I'm pretty sure that with the limitations of the RegExp abilities in VBA/VBScript, you are going to have to replace the special characters in your pattern before using it. There doesn't seem to be anything built into it like there is in Python.
The following regex will capture everything from the beginning of the string to the first (. The first captured group $1 will contain the portion before (.
^([^(]+)\(
Depending on your language, you might have to escape it as:
"^([^(]+)\\("

Use string as input to re.compile

I want to use a variable in a regex, like this:
variables = ['variableA','variableB']
for i in range(len(variables)):
regex = r"'('+variables[i]+')[:|=|\(](-?\d+(?:\.\d+)?)(?:\))?'"
pattern_variable = re.compile(regex)
match = re.search(pattern_variable, line)
The problem is that python adds an extra backslash character for each backslash character in my regex string (ipython), and makes my regex invalid:
In [76]: regex
Out[76]: "'('+variables[i]+')[:|=|\\(](-?\\d+(?:\\.\\d+)?)(?:\\))?'"
Any tips on how I can avoid this?
No, it only displays extra backslashes so that the string could be read in again and have the correct number of backslashes. Try
print regex
and you will see the difference.
There is no problem there. What you're seeing is the output of the repr() of the string. Since the repr is supposed to be more-or-less reversible back into the original object, it doubles up all backslashes, as well as escaping the type of quote used at the ends of the repr.

Regex From .NET to Python

I have a regular expression which works perfectly well (although I am sure it is weak) in .NET/C#:
((^|\s))(?<tag>\#(?<tagname>(\w|\+)+))(?($|\s|\.))
I am trying to move it over to Python, but I seem to be running into a formatting issue (invalid expression exception).
It is a lame question/request, but I have been staring at this for a while, but nothing obvious is jumping out at me.
Note: I am simply trying
r = re.compile('((^|\s))(?<tag>\#(?<tagname>(\w|\+)+))(?($|\s|\.))')
Thanks,
Scott
There are some syntax incompatibilities between .NET regexps and PCRE/Python regexps :
(?<name>...) is (?P<name>...)
(?...) does not exist, and as I don't know what it is used for in .NET I can't guess any equivalent. A Google codesearch do not give me any pointer to what it could be used for.
Besides, you should use Python raw strings (r"I am a raw string") instead of normal strings when expressing regexps : raw strings do not interpret escape sequences (like \n). But it is not the problem in your example as you're not using any known escape sequence which could be replaced (\s does not mean anything as an escape sequence, so it is not replaced).
Is "(?" there to prevent creation of a separate group? In Python's re's, this is "(:?". Try this:
r = re.compile(r'((^|\s))(:?<tag>\#(:?<tagname>(\w|\+)+))(:?($|\s|\.))')
Also, note the use of a raw string literal (the "r" character just before the quotes). Raw literals suppress '\' escaping, so that your '\' characters pass straight through to re (otherwise, you'd need '\\' for every '\').

Categories

Resources