Yesterday, I needed to add a file path to a regular expression creating a pattern like this:
"some_pattern/%s/more_pattern" % some_path
In the beginning the regular expression did not match, because some_path contained
several regex specific symbols like ? or .. As a quick fix I replaced them with [?]{1} and . with \..
However, I asked myself if there isn't a more reliable or better way to clean a string from regex specific symbols.
Is such functionality supported in the Python Standard library?
If not, do you know a regular expression to identify all regex symbols and clean them through a substitute?
Just apply re.escape and you'll be fine.
re.escape(string)
Return string with all non-alphanumerics
backslashed; this is useful if you want to match an arbitrary literal
string that may have regular expression metacharacters in it.
Related
I trying to match the following using regular expression but I struggling in matching the round bracket.
[??(Z)Z-axis Down Position Stroke]
Can anyone kindly advise ?
My Current expression as shown.
[[][[a-zA-Z0-9_?. - ]{1,30}[]]
You can add a backslash \ before any character to escape it. Try this regex:
\[\?\?\(Z\)Z-axis Down Position Stroke\]
When writing regex, I find regex101.com to be really helpful. It's a free website that evaluates your regex and lets you specify test cases etc, then breaks those down and tells you about the various matching conditions in them. Worth a look if you're learning regular expressions.
Edit: Also, it's necessary to escape the brackets, parentheses, and question marks because those all have special meaning in regex.
I have simple, but tricky question about regex (using in python), which i have did not find answer for anywhere here on google. Is there any "trick" how to make two capture groups in optional order? Let's say we have following:
.*abc.*
What i want is to match also this:
.*acb.*
I know i could use
.*abc|acb.*
but the problem is, that if we have something more complicated then abc, code is very long. Is not there any workaround to say e.g. "match last two capturing groups (or symbols, etc.) in any order?
I don't really get what is this in-any-order thing that would make the regex shorter. On the other hand, I can show you how to make this readable, even if you have tons of options.
import re
pattern = """
.* # match from starting the line
(?: # A non-capturing group starts so we can list lots of alternatives
abc| # alternative 1
acb # alternative 2
) # end of alternatives
.* # then match everything up to the end of the line
"""
re.search(pattern, 'qqabcqq', re.VERBOSE) # returns a match
re.search(pattern, 'qqacbqq', re.VERBOSE) # returns a match
re.search(pattern, 'qqaSDqq', re.VERBOSE) # does not return a match
So what did we just see here?
The """ ... """ construct is a convenient way to define multiline strings in python.
Then the re.VERBOSE skips the whitespaces and comments. As the manual says:
Whitespace within the pattern is ignored, except when in a character
class or when preceded by an unescaped backslash. When a line contains a # that is not in a character class and is not preceded by an unescaped backslash, all characters from the leftmost such # through the end of the line are ignored.
This two things let you add structure and comments to your regex. Here is another great example.
With standard regular expressions you can define patterns without order. Example:
[cdgjow]
Of course this example refers to characters.
Alternative sequences must be specified using "|". Example:
abc|cba
There is no way to express what you would like to express in classic regular expression syntax. Regular expression syntax has no syntactic elements to express what you would like to express. It's lacking this feature. You have to rely on "manually" specifying your alternatives. It's not a limit of the automaton constructed from regular expressions but of the regular expression syntax itself.
That means: You will have to construct the regular expression you require by yourself with all variants possible. There are two ways how to do this:
Do it manually. Take your time, be careful, built the correct regex manually.
Do it programmatically. Write some code that generates the regex you require.
If you do it manually consider #TamasRev answer. (Thanks #TamasRev! Nice answer!) But if I were you I'd build the regex programmatically. (For things like that programming has been invented for anyway :-) )
I'm trying to produce a python regex to represent identifiers for a lexical analyzer. My approach is:
([a-zA-Z]([a-zA-Z]|\d)*)
When I use this in:
regex = re.compile("\s*([a-zA-Z]([a-zA-Z]|\d)*)")
regex.findall(line)
It doesn't produce a list of identifiers like it should. Have I built the expression incorrectly?
What's a good way to represent the form:
alpha(alpha|digit)*
With the python re module?
like this:
regex = re.compile(r'[a-zA-Z][a-zA-Z\d]*')
Note the r before the quote to obtain a raw string, otherwise you need to escape all backslashes.
Since the \s* before is optional, you can remove it, like capture groups.
If you want to ensure that the match isn't preceded by a digit, you can write it like this with a negative lookbehind (?<!...):
regex = re.compile(r'(?:^|(?<![\da-zA-Z]))[a-zA-Z][a-zA-Z\d]*')
Note that with re.compile you can use the case insensitive option:
regex = re.compile(r'(?:^|(?<![\da-z]))[a-z][a-z\d]*', re.I)
I'm searching for strings within strings using Regex. The pattern is a string literal that ends in (, e.g.
# pattern
" before the bracket ("
# string
this text is before the bracket (and this text is inside) and this text is after the bracket
I know the pattern will work if I escape the character with a backslash, i.e.:
# pattern
" before the bracket \\("
But the pattern strings are coming from another search and I can not control what characters will be or where. Is there a way of escaping an entire string literal so that anything between markers is treated as a string? For example:
# pattern
\" before the ("
The only other option I have is to do a substitute adding escapes for every protected character.
re.escape is exactly what I need. I'm using regexp in Access VBA which doens't have that method. I only have replace, execute or test methods.
Is there a way to escape everything within a string in VBA?
Thanks
You didn't specify the language, but it looks like Python, so if you have a string in Python whose special regex characters you need to escape, use re.escape():
>>> import re
>>> re.escape("Wow. This (really) is *cool*")
'Wow\\.\\ This\\ \\(really\\)\\ is\\ \\*cool\\*'
Note that spaces are escaped, too (probably to ensure that they still work in a re.VERBOSE regex).
Maybe write your own VBA escape function:
Function EscapeRegEx(text As String) As String
Dim regEx As RegExp
Set regEx = New RegExp
regEx.Global = True
regEx.Pattern = "(\[|\\|\^|\$|\.|\||\?|\*|\+|\(|\)|\{|\})"
EscapeRegEx = regEx.Replace(text, "\$1")
End Function
I'm pretty sure that with the limitations of the RegExp abilities in VBA/VBScript, you are going to have to replace the special characters in your pattern before using it. There doesn't seem to be anything built into it like there is in Python.
The following regex will capture everything from the beginning of the string to the first (. The first captured group $1 will contain the portion before (.
^([^(]+)\(
Depending on your language, you might have to escape it as:
"^([^(]+)\\("
I am battling regular expressions now as I type.
I would like to determine a pattern for the following example file: b410cv11_test.ext. I want to be able to do a search for files that match the pattern of the example file aforementioned. Where do I start (so lost and confused) and what is the best way of arriving at a solution that best matches the file pattern? Thanks in advance.
Further clarification of question:
I would like the pattern to be as follows: must start with 'b', followed by three digits, followed by 'cv', followed by two digits, then an underscore, followed by 'release', followed by .'ext'
Now that you have a human readable description of your file name, it's quite straight forward to translate it into a regular expression (at least in this case ;)
must start with
The caret (^) anchors a regular expression to the beginning of what you want to match, so your re has to start with this symbol.
'b',
Any non-special character in your re will match literally, so you just use "b" for this part: ^b.
followed by [...] digits,
This depends a bit on which flavor of re you use:
The most general way of expressing this is to use brackets ([]). Those mean "match any one of the characters listed within. [ASDF] for example would match either A or S or D or F, [0-9] would match anything between 0 and 9.
Your re library probably has a shortcut for "any digit". In sed and awk you could use [[:digit:]] [sic!], in python and many other languages you can use \d.
So now your re reads ^b\d.
followed by three [...]
The most simple way to express this would be to just repeat the atom three times like this: \d\d\d.
Again your language might provide a shortcut: braces ({}). Sometimes you would have to escape them with a backslash (if you are using sed or awk, read about "extended regular expressions"). They also give you a way to say "at least x, but no more than y occurances of the previous atom": {x,y}.
Now you have: ^b\d{3}
followed by 'cv',
Literal matching again, now we have ^b\d{3}cv
followed by two digits,
We already covered this: ^b\d{3}cv\d{2}.
then an underscore, followed by 'release', followed by .'ext'
Again, this should all match literally, but the dot (.) is a special character. This means you have to escape it with a backslash: ^\d{3}cv\d{2}_release\.ext
Leaving out the backslash would mean that a filename like "b410cv11_test_ext" would also match, which may or may not be a problem for you.
Finally, if you want to guarantee that there is nothing else following ".ext", anchor the re to the end of the thing to match, use the dollar sign ($).
Thus the complete regular expression for your specific problem would be:
^b\d{3}cv\d{2}_release\.ext$
Easy.
Whatever language or library you use, there has to be a reference somewhere in the documentation that will show you what the exact syntax in your case should be. Once you have learned to break down the problem into a suitable description, understanding the more advanced constructs will come to you step by step.
To avoid confusion, read the following, in order.
First, you have the glob module, which handles file name regular expressions just like the Windows and unix shells.
Second, you have the fnmatch module, which just does pattern matching using the unix shell rules.
Third, you have the re module, which is the complete set of regular expressions.
Then ask another, more specific question.
I would like the pattern to be as
follows: must start with 'b', followed
by three digits, followed by 'cv',
followed by two digits, then an
underscore, followed by 'release',
followed by .'ext'
^b\d{3}cv\d{2}_release\.ext$
Your question is a bit unclear. You say you want a regular expression, but could it be that you want a glob-style pattern you can use with commands like ls? glob expressions and regular expressions are similar in concept but different in practice (regular expressions are considerably more powerful, glob style patterns are easier for the most common cases when looking for files.
Also, what do you consider to be the pattern? Certainly, * (glob) or .* (regex) will match the pattern. Also, _test.ext (glob) or ._test.ext (regexp) pattern would match, as would many other variations.
Can you be more specific about the pattern? For example, you might describe it as "b, followed by digits, followed by cv, followed by digits ..."
Once you can precisely explain the pattern in your native language (and that must be your first step), it's usually a fairly straight-forward task to translate that into a glob or regular expression pattern.
if the letters are unimportant, you could try \w\d\d\d\w\w\d\d_test.ext which would match the letter/number pattern, or b\d\d\dcv\d\d_test.ext or some mix of the two.
When working with regexes I find the Mochikit regex example to be a great help.
/^b\d\d\dcv\d\d_test\.ext$/
Then use the python re (regex) module to do the match. This is of course assuming regex is really what you need and not glob as the others mentioned.