This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 8 years ago.
I am reading the Shinken source code in shinken/misc/perfdata.py and i finally find a regex that i can not understand. like this:
metric_pattern = re.compile('^([^=]+)=([\d\.\-\+eE]+)([\w\/%]*);?([\d\.\-\+eE:~#]+)?;?([\d\.\-\+eE:~#]+)?;?([\d\.\-\+eE]+)?;?([\d\.\-\+eE]+)?;?\s*')
what confused me is that what does \/ mean in ([\w\/%]*)?
You're rightfully confused, because that regex must have been written by someone who doesn't know Python regexes well.
In some languages (e.g. JavaScript), regexes are delimited by slashes. That means that if you need an actual slash in your regex, you have to escape it. Since Python doesn't use slashes, there's no need to escape the slash (but it doesn't cause an error, either).
Much more worrisome is that the author failed to use a raw string. In many cases, that won't matter (because Python will treat "\d" as "\\d" which then correctly translates to the regex \d, but in other cases, it will cause problems. One example is "\b" which means "a backspace character" and not "a word boundary anchor" like the regex \b would.
Also, the author has escaped a lot of characters that didn't need escaping at all. The entire regex could be rewritten as
metric_pattern = re.compile(r'^([^=]+)=([\d.+eE-]+)([\w/%]*);?([\d.+eE:~#-]+)?;?([\d.+eE:~#-]+)?;?([\d.+eE-]+)?;?([\d.+eE-]+)?;?\s*')
and even then, I'm surprised that it works at all. Looks very chaotic to me and is definitely not foolproof. For example, there appears to be a big potential for catastrophic backtracking meaning that users could freeze your server with malicious input.
Related
This question already has answers here:
Why can't Python parse this JSON data? [closed]
(3 answers)
Closed 4 years ago.
Python 2.4.4 (yeah, long story)
I want to parse this fragment (with re)
"comment":"#2 Surely, (this) can't be any [more] complicated a reg-ex?",
i.e., it (the comment) can contain characters (upper or lower), numbers, hash, parentheses, square brackets, single quotes, and commas, and it (this fragment) specifically ends with a dquote and a comma
i've gotten this far with the expression,
r'\"comment\":\"(?P<COMMENT>[a-zA-Z0-9\s]+)\",'
but, of course, it only matches when none of the meta characters are in the comment. the final \", works as the the termination criterion. I've tried all kinds of escape, double escape ...
could a kind 're geek' please enlighten ?
i want to access the "entire" comment as match.group["COMMENT"]
corrected the pattern to what I was actually using when asked. my bad cut-n-paste.
until marked with all the "DUPLICATES", I couldn't spell JSON. But, I DID specify I had to do this with re.
even with all the JSON responses and code frags, it wasn't introduced until 2.6, and I did specify I'm still using 2.4.4.
Thanks to those responding with the regex-based solutions. Now working for me :)
Use a non-greedy .*? to match anything before ",, assuming this as the end of comment:
import re
s = '''"comment":"#2 Surely, (this) can't be any [more] complicated a reg-ex?",'''
match = re.search(r'"comment":"(?P<comment>.*?)",', s)
print(match.group('comment'))
# #2 Surely, (this) can't be any [more] complicated a reg-ex?
You can name your matched string using (?P<group_name>…).
This question already has answers here:
What does the forward slash mean within a JavaScript regular expression?
(5 answers)
Closed 6 years ago.
I am just starting to learn regular expression module within python and I am being asked to explain an interesting regular expression sequence.
/^[a-z0-9_-]{3,16}$/
I can explain the codes within the two forward slash that search for a username that is alphanumeric including hyphen and underscore and has at least 3 and at most 16 digits or characters.
Now my question is, what does it mean by the two forward slashes? I tried the web and it seems that most tutorial has an explanation for backward slash but not forward slash. Please advise. Thanks.
The forward slashes are used as separators. They're only used in some flavors (Perl and JavaScript, for example), and can usually be changed to the delimiter if your choice. Changing the delimiter will change what (if anything) needs to be escaped.
See this sed statement with a regex I wrote earlier today for a different question:
sed -E 's/OldUtility.getList.([^)]*)\)([\)]*)/\1\2.getList()/g'
In this case:
s for substitute
/ the first slash
The regex. If the regex needed a / it would need to be escaped. If you have enough /s that need to be escaped, it's good to switch to a different delimiter, if possible.
/ the second slash
Then, there's the substitution: \1\2.getList()
/ the third and last slash
Lastly, there's the modifier: g for global.
The slashes represent the start and end of your regex. This is typical of how Perl expresses its regexes:
/<my_regex_here>/
In Perl, you can specify various options such as:
s/<my_regex>/<replaceWith>/
Perl is of course a language designed specifically for regexes, so it's common to see people talk about regexes using Perl-like syntax.
Forward slash is just a divider which delimits the start and end of a regular expression. The reason that it's forward slash and not some other character is mainly convention.
For example, you can define a regular expression in vim like this, with a question mark instead of the conventional slash:
:s?[a-z0-9_-]??g
I'm working on a file parser that needs to cut out comments from JavaScript code. The thing is it has to be smart so it won't take '//' sequence inside string as the beggining of the comment. I have following idea to do it:
Iterate through lines.
Find '//' sequence first, then find all strings surrounded with quotes ( ' or ") in line and then iterate through all string matches to check if the '//' sequence is inside or outside one of those strings. If it is outside of them it's obvious that it'll be a proper comment begining.
When testing code on following line (part of bigger js file of course):
document.getElementById("URL_LABEL").innerHTML="<a name=\"link\" href=\"http://"+url+"\" target=\"blank\">"+url+"</a>";
I've encountered problem. My regular expression code:
re_strings=re.compile(""" "
(?:
\\.|
[^\\"]
)*
"
|
'
(?:
[^\\']|
\\.
)*
'
""",re.VERBOSE);
for s in re.finditer(re_strings,line):
print(s.group(0))
In python 3.2.3 (and 3.1.4) returns the following strings:
"URL_LABEL"
"<a name=\"
" href=\"
"+url+"
" target=\"
">"
"</a>"
Which is obviously wrong because \" should not exit the string. I've been debugging my regex for quite a long time and it SHOULDN'T exit here. So i used RegexBuddy (with Python compatibility) and Python regex tester at http://re-try.appspot.com/ for reference.
The most peculiar thing is they both return same, correct results other than my code, that is:
"URL_LABEL"
"<a name=\"link\" href=\"http://"
"\" target=\"blank\">"
"</a>"
My question is what is the cause of those differences? What have I overlooked? I'm rather a beginer in both Python and regular expressions so maybe the answer is simple...
P.S. I know that finding if the '//' sequence is inside string quotes can be accomplished with one, bigger regex. I've already tried it and met the same problem.
P.P.S I would like to know what I'm doing wrong, why there are differences in behaviour of my code and regex test applications, not find other ideas how to parse JavaScript code.
You just need to use a raw string to create the regex:
re_strings=re.compile(r""" "
etc.
"
""",re.VERBOSE);
The way you've got it, \\.|[^\\"] becomes the regex \.|[^\"], which matches a literal dot (.) or anything that's not a quotation mark ("). Add the r prefix to the string literal and it works as you intended.
See the demo here. (I also used a raw string to make sure the backslashes appeared in the target string. I don't know how you arranged that in your tests, but the backslashes obviously are present; the problem is that they're missing from your regex.)
you cannot deal with matching quotes with regex ... in fact you cannot guarantee any matching pairs of anything(and nested pairs especially) ... you need a more sophisticated statemachine for that(LLVM, etc...)
source: lots of CS classes...
and also see : Matching pair tag with regex for a more detailed explanation
I know its not what you wanted to hear but its basically just the way it is ... and yes different implementations of regex can return different results for stuff that regex cant really do
I have a regular expression which works perfectly well (although I am sure it is weak) in .NET/C#:
((^|\s))(?<tag>\#(?<tagname>(\w|\+)+))(?($|\s|\.))
I am trying to move it over to Python, but I seem to be running into a formatting issue (invalid expression exception).
It is a lame question/request, but I have been staring at this for a while, but nothing obvious is jumping out at me.
Note: I am simply trying
r = re.compile('((^|\s))(?<tag>\#(?<tagname>(\w|\+)+))(?($|\s|\.))')
Thanks,
Scott
There are some syntax incompatibilities between .NET regexps and PCRE/Python regexps :
(?<name>...) is (?P<name>...)
(?...) does not exist, and as I don't know what it is used for in .NET I can't guess any equivalent. A Google codesearch do not give me any pointer to what it could be used for.
Besides, you should use Python raw strings (r"I am a raw string") instead of normal strings when expressing regexps : raw strings do not interpret escape sequences (like \n). But it is not the problem in your example as you're not using any known escape sequence which could be replaced (\s does not mean anything as an escape sequence, so it is not replaced).
Is "(?" there to prevent creation of a separate group? In Python's re's, this is "(:?". Try this:
r = re.compile(r'((^|\s))(:?<tag>\#(:?<tagname>(\w|\+)+))(:?($|\s|\.))')
Also, note the use of a raw string literal (the "r" character just before the quotes). Raw literals suppress '\' escaping, so that your '\' characters pass straight through to re (otherwise, you'd need '\\' for every '\').
I have some unicode text that I want to clean up using regular expressions. For example I have cases where u'(2'. This exists because for formatting reasons the closing paren ends up in an adjacent html cell. My initial solution to this problem was to look ahead at the contents of the next cell and using a string function determine if it held the closing paren. I knew this was not a great solution but it worked. Now I want to fix it but I can't seem to make the regular expression work.
missingParen=re.compile(r"^\(\d[^\)]$")
My understanding of what I think I am doing:
^ at the beginning of the string I want to find
( an open paren, the paren has to be backslashed because it is a special character
\d I also want to find a single digit
[ I am creating a special character class
^ I don't want to find what follows
) which is a close paren
$ at the end of the string
And of course the plot thickens I made a silly assumption that because I placed a \d I would not find (33 but I am wrong so I added a {1} to my regular expression and that did not help, it matched (3333, so my problem is more complicated than I thought. I want the string to be only an open paren and a single digit. Is this the more clever approach
missingParen=re.compile(r"^\(\d$")
And note S Lott _I already tagged it beginner so you can't pick up any cheap points Not that I don't appreciate your insights I keep meaning to read your book, it probably has the answer
Okay sorry for using this a a stream of consciousness thinking stimulator but it appears that writing out my original question got me on the path. It seems to me that this is a solution for what I am trying to do:
missingParen=re.compile(r"^\(\d$")