python regex escaping meta characters among delimiters [duplicate] - python

This question already has answers here:
Why can't Python parse this JSON data? [closed]
(3 answers)
Closed 4 years ago.
Python 2.4.4 (yeah, long story)
I want to parse this fragment (with re)
"comment":"#2 Surely, (this) can't be any [more] complicated a reg-ex?",
i.e., it (the comment) can contain characters (upper or lower), numbers, hash, parentheses, square brackets, single quotes, and commas, and it (this fragment) specifically ends with a dquote and a comma
i've gotten this far with the expression,
r'\"comment\":\"(?P<COMMENT>[a-zA-Z0-9\s]+)\",'
but, of course, it only matches when none of the meta characters are in the comment. the final \", works as the the termination criterion. I've tried all kinds of escape, double escape ...
could a kind 're geek' please enlighten ?
i want to access the "entire" comment as match.group["COMMENT"]
corrected the pattern to what I was actually using when asked. my bad cut-n-paste.
until marked with all the "DUPLICATES", I couldn't spell JSON. But, I DID specify I had to do this with re.
even with all the JSON responses and code frags, it wasn't introduced until 2.6, and I did specify I'm still using 2.4.4.
Thanks to those responding with the regex-based solutions. Now working for me :)

Use a non-greedy .*? to match anything before ",, assuming this as the end of comment:
import re
s = '''"comment":"#2 Surely, (this) can't be any [more] complicated a reg-ex?",'''
match = re.search(r'"comment":"(?P<comment>.*?)",', s)
print(match.group('comment'))
# #2 Surely, (this) can't be any [more] complicated a reg-ex?
You can name your matched string using (?P<group_name>…).

Related

Does Python regular expression library re have a maximum match [duplicate]

This question already has answers here:
Python extract pattern matches
(10 answers)
How do I return a string from a regex match in python? [duplicate]
(4 answers)
Closed 2 years ago.
I have scoured the web (and perhaps I am searching the wrong thing), but I have a very long regex pattern that I would like to match:
Ex:
import re
re_pattern_str = r"I want to match this \(this is an example\) regular expression to a giant string"
sample_paragraph = "I want to match this (this is an example) regular expression to a giant string. This is a huge paragraph with a bunch of stuff in it"
print(re.match(re_pattern_str, sample_paragraph))
The output of the above program is as follows:
run
<re.Match object; span=(0, 78), match='I want to match this (this is an example) regular>
As you can see, it gets cut off and doesn't capture the whole string.
Also, I noticed that using verbose mode with a lot of comments ((?x) in Python) captures less. Does this mean there is a limit to how much can be captured? I also noticed using different Python versions and different machines caused different amounts of a long regex string to be captured. I still can't pinpoint if this is an issue in the re library in Python, a Python 3 specific thing (I haven't compared this to Python 2), a machine issue, memory issue, or something else.
I have used Python 3.8.1 for the above example, and have used Python 3.7.2 for another example using verbose regexes and other examples (I can't share these examples since those are proprietary).
Any help on the mechanics of Python regex engine and why this happens (and if there is a maximum length that can be captured via regex, why?), this would be very helpful.
You think the repr of the match is the matched text. It isn't. The repr tries not to dump pages of text for large matches. If you want to see the complete matched text, index in to get it as a string:
print(re.match(re_pattern_str, sample_paragraph)[0])
#^^^ gets the matched text itself
You can see from the repr it's a much longer match (it spans index 0 to 78).

what does this python regex mean "([\w\/%]*)" [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 8 years ago.
I am reading the Shinken source code in shinken/misc/perfdata.py and i finally find a regex that i can not understand. like this:
metric_pattern = re.compile('^([^=]+)=([\d\.\-\+eE]+)([\w\/%]*);?([\d\.\-\+eE:~#]+)?;?([\d\.\-\+eE:~#]+)?;?([\d\.\-\+eE]+)?;?([\d\.\-\+eE]+)?;?\s*')
what confused me is that what does \/ mean in ([\w\/%]*)?
You're rightfully confused, because that regex must have been written by someone who doesn't know Python regexes well.
In some languages (e.g. JavaScript), regexes are delimited by slashes. That means that if you need an actual slash in your regex, you have to escape it. Since Python doesn't use slashes, there's no need to escape the slash (but it doesn't cause an error, either).
Much more worrisome is that the author failed to use a raw string. In many cases, that won't matter (because Python will treat "\d" as "\\d" which then correctly translates to the regex \d, but in other cases, it will cause problems. One example is "\b" which means "a backspace character" and not "a word boundary anchor" like the regex \b would.
Also, the author has escaped a lot of characters that didn't need escaping at all. The entire regex could be rewritten as
metric_pattern = re.compile(r'^([^=]+)=([\d.+eE-]+)([\w/%]*);?([\d.+eE:~#-]+)?;?([\d.+eE:~#-]+)?;?([\d.+eE-]+)?;?([\d.+eE-]+)?;?\s*')
and even then, I'm surprised that it works at all. Looks very chaotic to me and is definitely not foolproof. For example, there appears to be a big potential for catastrophic backtracking meaning that users could freeze your server with malicious input.

Why does python allow escaping quotes in rawdata string? [duplicate]

This question already has answers here:
Why can't Python's raw string literals end with a single backslash?
(14 answers)
Closed 8 years ago.
I'm trying to understand why python has this unheard of behavior.
If I'm writing rawdata string, it is much more likely that I won't want escaping quotes.
This behavior forces us to write this weird code:
s = r'something' + '\\' instead of just 's = r'something\'
any ideas why python developers found this more sensible?
EDIT:
I'm not asking why it is so. I'm asking what makes this design decision, or if anyone finds any thing good in it.
The r prefix for a string literal doesn't disable escaping, it changes escaping so that the sequence \x (where x is any character) is "converted" to itself. So then, \' emits \' and your string is unterminated because there's no ' at the end of it.
The decision to disallow an unpaired ending backslash in a raw string is explained in this faq:
Raw strings were designed to ease creating input for processors (chiefly regular expression engines) that want to do their own backslash escape processing. Such processors consider an unmatched trailing backslash to be an error anyway, so raw strings disallow that. In return, they allow you to pass on the string quote character by escaping it with a backslash. These rules work well when r-strings are used for their intended purpose.

Python Regex working different depending on the implementation?

I'm working on a file parser that needs to cut out comments from JavaScript code. The thing is it has to be smart so it won't take '//' sequence inside string as the beggining of the comment. I have following idea to do it:
Iterate through lines.
Find '//' sequence first, then find all strings surrounded with quotes ( ' or ") in line and then iterate through all string matches to check if the '//' sequence is inside or outside one of those strings. If it is outside of them it's obvious that it'll be a proper comment begining.
When testing code on following line (part of bigger js file of course):
document.getElementById("URL_LABEL").innerHTML="<a name=\"link\" href=\"http://"+url+"\" target=\"blank\">"+url+"</a>";
I've encountered problem. My regular expression code:
re_strings=re.compile(""" "
(?:
\\.|
[^\\"]
)*
"
|
'
(?:
[^\\']|
\\.
)*
'
""",re.VERBOSE);
for s in re.finditer(re_strings,line):
print(s.group(0))
In python 3.2.3 (and 3.1.4) returns the following strings:
"URL_LABEL"
"<a name=\"
" href=\"
"+url+"
" target=\"
">"
"</a>"
Which is obviously wrong because \" should not exit the string. I've been debugging my regex for quite a long time and it SHOULDN'T exit here. So i used RegexBuddy (with Python compatibility) and Python regex tester at http://re-try.appspot.com/ for reference.
The most peculiar thing is they both return same, correct results other than my code, that is:
"URL_LABEL"
"<a name=\"link\" href=\"http://"
"\" target=\"blank\">"
"</a>"
My question is what is the cause of those differences? What have I overlooked? I'm rather a beginer in both Python and regular expressions so maybe the answer is simple...
P.S. I know that finding if the '//' sequence is inside string quotes can be accomplished with one, bigger regex. I've already tried it and met the same problem.
P.P.S I would like to know what I'm doing wrong, why there are differences in behaviour of my code and regex test applications, not find other ideas how to parse JavaScript code.
You just need to use a raw string to create the regex:
re_strings=re.compile(r""" "
etc.
"
""",re.VERBOSE);
The way you've got it, \\.|[^\\"] becomes the regex \.|[^\"], which matches a literal dot (.) or anything that's not a quotation mark ("). Add the r prefix to the string literal and it works as you intended.
See the demo here. (I also used a raw string to make sure the backslashes appeared in the target string. I don't know how you arranged that in your tests, but the backslashes obviously are present; the problem is that they're missing from your regex.)
you cannot deal with matching quotes with regex ... in fact you cannot guarantee any matching pairs of anything(and nested pairs especially) ... you need a more sophisticated statemachine for that(LLVM, etc...)
source: lots of CS classes...
and also see : Matching pair tag with regex for a more detailed explanation
I know its not what you wanted to hear but its basically just the way it is ... and yes different implementations of regex can return different results for stuff that regex cant really do

Python raw literal string [duplicate]

This question already has answers here:
Why can't Python's raw string literals end with a single backslash?
(14 answers)
Closed 7 months ago.
str = r'c:\path\to\folder\' # my comment
IDE: Eclipse
Python2.6
When the last character in the string is a backslash, it seems like it will escape the last single quote and treat my comment as part of the string. But the raw string is supposed to ignore all escape characters, right? What could be wrong? Thanks.
Raw string literals don't treat backslashes as initiating escape sequences except when the immediately-following character is the quote-character that is delimiting the literal, in which case the backslash does escape it.
The design motivation is that raw string literals really exist only for the convenience of entering regular expression patterns – that is all, no other design objective exists for such literals. And RE patterns never need to end with a backslash, but they might need to include all kinds of quote characters, whence the rule.
Many people do try to use raw string literals to enable them to enter Windows paths the way they're used to (with backslashes) – but as you've noticed this use breaks down when you do need a path to end with a backslash. Usually, the simplest solution is to use forward slashes, which Microsoft's C runtime and all version of Python support as totally equivalent in paths:
s = 'c:/path/to/folder/'
(side note: don't shadow builtin names, like str, with your own identifiers – it's a horrible practice, without any upside, and unless you get into the habit of avoiding that horrible practice one day you'll find yourseld with a nasty-to-debug problem, when some part of your code tramples over a builtin name and another part needs to use the builtin name in its real meaning).
It's IMHO an inconsistency in Python, but it's described in the documentation. Go to the second last paragraph:
http://docs.python.org/reference/lexical_analysis.html#string-literals
r"\" is not a valid string literal
(even a raw string cannot end in an
odd number of backslashes)

Categories

Resources