Python Regex working different depending on the implementation?

Python Regex working different depending on the implementation? - python

I'm working on a file parser that needs to cut out comments from JavaScript code. The thing is it has to be smart so it won't take '//' sequence inside string as the beggining of the comment. I have following idea to do it:
Iterate through lines.
Find '//' sequence first, then find all strings surrounded with quotes ( ' or ") in line and then iterate through all string matches to check if the '//' sequence is inside or outside one of those strings. If it is outside of them it's obvious that it'll be a proper comment begining.
When testing code on following line (part of bigger js file of course):
document.getElementById("URL_LABEL").innerHTML="<a name=\"link\" href=\"http://"+url+"\" target=\"blank\">"+url+"</a>";
I've encountered problem. My regular expression code:
re_strings=re.compile(""" "
(?:
\\.|
[^\\"]
)*
"
|
'
(?:
[^\\']|
\\.
)*
'
""",re.VERBOSE);
for s in re.finditer(re_strings,line):
print(s.group(0))
In python 3.2.3 (and 3.1.4) returns the following strings:
"URL_LABEL"
"<a name=\"
" href=\"
"+url+"
" target=\"
">"
"</a>"
Which is obviously wrong because \" should not exit the string. I've been debugging my regex for quite a long time and it SHOULDN'T exit here. So i used RegexBuddy (with Python compatibility) and Python regex tester at http://re-try.appspot.com/ for reference.
The most peculiar thing is they both return same, correct results other than my code, that is:
"URL_LABEL"
"<a name=\"link\" href=\"http://"
"\" target=\"blank\">"
"</a>"
My question is what is the cause of those differences? What have I overlooked? I'm rather a beginer in both Python and regular expressions so maybe the answer is simple...
P.S. I know that finding if the '//' sequence is inside string quotes can be accomplished with one, bigger regex. I've already tried it and met the same problem.
P.P.S I would like to know what I'm doing wrong, why there are differences in behaviour of my code and regex test applications, not find other ideas how to parse JavaScript code.

You just need to use a raw string to create the regex:
re_strings=re.compile(r""" "
etc.
"
""",re.VERBOSE);
The way you've got it, \\.|[^\\"] becomes the regex \.|[^\"], which matches a literal dot (.) or anything that's not a quotation mark ("). Add the r prefix to the string literal and it works as you intended.
See the demo here. (I also used a raw string to make sure the backslashes appeared in the target string. I don't know how you arranged that in your tests, but the backslashes obviously are present; the problem is that they're missing from your regex.)

you cannot deal with matching quotes with regex ... in fact you cannot guarantee any matching pairs of anything(and nested pairs especially) ... you need a more sophisticated statemachine for that(LLVM, etc...)
source: lots of CS classes...
and also see : Matching pair tag with regex for a more detailed explanation
I know its not what you wanted to hear but its basically just the way it is ... and yes different implementations of regex can return different results for stuff that regex cant really do

Related

Match LaTeX reserved characters with regex

I have an HTML to LaTeX parser tailored to what it's supposed to do (convert snippets of HTML into snippets of LaTeX), but there is a little issue with filling in variables. The issue is that variables should be allowed to contain the LaTeX reserved characters (namely # $ % ^ & _ { } ~ \). These need to be escaped so that they won't kill our LaTeX renderer.
The program that handles the conversion and everything is written in Python, so I tried to find a nice solution. My first idea was to simply do a .replace(), but replace doesn't allow you to match only if the first is not a \. My second attempt was a regex, but I failed miserably at that.
The regex I came up with is ([^\][#\$%\^&_\{\}~\\]). I hoped that this would match any of the reserved characters, but only if it didn't have a \ in front. Unfortunately, this matches ever single character in my input text. I've also tried different variations on this regex, but I can't get it to work. The variations mainly consisted of removing/adding slashes in the second part of the regex.
Can anyone help with this regex?
EDIT Whoops, I seem to have included the slashes as well. Shows how awake I was when I posted this :) They shouldn't be escaped in my case, but it's relatively easy to remove them from the regexes in the answers. Thanks all!

The [^\] is a character class for anything not a \, that is why it is matching everything. You want a negative lookbehind assertion:
((?<!\)[#\$%\^&_\{\}~\\])
(?<!...) will match whatever follows it as long as ... is not in front of it. You can check this out at the python docs

The regex ([^\][#\$%\^&_\{\}~\\]) is matching anything that isn't found between the first [ and the last ], so it should be matching everything except for what you want it to.
Moving around the parenthesis should fix your original regex ([^\\])[#\$%\^&_\{\}~\\].
I would try using regex lookbehinds, which won't match the character preceding what you want to escape. I'm not a regex expert so perhaps there is a better pattern, but this should work (?<!\\)[#\$%\^&_\{\}~\\].

If you're looking to find special characters that aren't escaped, without eliminating special chars preceded by escaped backslashes (e.g. you do want to match the last backslash in abc\\\def), try this:
(?<!\\)(\\\\)*[#\$%\^&_\{\}~\\]
This will match any of your special characters preceded by an even number (this includes 0) of backslashes. It says the character can be preceded by any number of pairs of backslashes, with a negative lookbehind to say those backslashes can't be preceded by another backslash.
The match will include the backslashes, but if you stick another in front of all of them, it'll achieve the same effect of escaping the special char, anyway.

Correctly parsing string literals with python's re module

I'm trying to add some light markdown support for a javascript preprocessor which I'm writing in Python.
For the most part it's working, but sometimes the regex I'm using is acting a little odd, and I think it's got something to do with raw-strings and escape sequences.
The regex is: (?<!\\)\"[^\"]+\"
Yes, I am aware that it only matches strings beginning with a " character. However, this project is born out of curiosity more than anything, so I can live with it for now.
To break it down:
(?<\\)\" # The group should begin with a quotation mark that is not escaped
[^\"]+ # and match any number of at least one character that is not a quotation mark (this is the biggest problem, I know)
\" # and end at the first quotation mark it finds
That being said, I (obviously) start hitting problems with things like this:
"This is a string with an \"escaped quote\" inside it"
I'm not really sure how to say "Everything but a quotation mark, unless that mark is escaped". I tried:
([^\"]|\\\")+ # a group of anything but a quote or an escaped quote
, but that lead to very strange results.
I'm fully prepared to hear that I'm going about this all wrong. For the sake of simplicity, let's say that this regex will always start and end with double quotes (") to avoid adding another element in the mix. I really want to understand what I have so far.
Thanks for any assistance.
EDIT
As a test for the regex, I'm trying to find all string literals in the minified jQuery script with the following code (using the unutbu's pattern below):
STRLIT = r'''(?x) # verbose mode
(?<!\\) # not preceded by a backslash
" # a literal double-quote
.*? # non-greedy 1-or-more characters
(?<!\\) # not preceded by a backslash
" # a literal double-quote
'''
f = open("jquery.min.js","r")
jq = f.read()
f.close()
literals = re.findall(STRLIT,jq)
The answer below fixes almost all issues. The ones that do arise are within jquery's own regular expressions, which is a very edge case. The solution no longer misidentifies valid javascript as markdown links, which was really the goal.

I think I first saw this idea in... Jinja2's source code? Later transplanted it to Mako.
r'''(\"\"\"|\'\'\'|\"|\')((?<!\\)\\\1|.)*?\1'''
Which does the following:
(\"\"\"|\'\'\'|\"|\') matches a Python opening quote, because this happens to be taken from code for parsing Python. You probably don't need all those quote types.
((?<!\\)\\\1|.) matches: EITHER a matching quote that was escaped ONLY ONCE, OR any other character. So \\" will still be recognized as the end of the string.
*? non-greedily matches as many of those as possible.
And \1 is just the closing quote.
Alas, \\\" will still incorrectly be detected as the end of the string. (The template engines only use this to check if there is a string, not to extract it.) This is a problem very poorly suited for regular expressions; short of doing insane things in Perl, where you can embed real code inside a regex, I'm not sure it's possible even with PCRE. Though I'd love to be proven wrong. :) The killer is that (?<!...) has to be constant-length, but you want to check that there's any even number of backslashes before the closing quote.
If you want to get this correct, and not just mostly-correct, you might have to use a real parser. Have a look at parsley, pyparsing, or any of these tools.
edit: By the way, there's no need to check that the opening quote doesn't have a backslash before it. That's not valid syntax outside a string in JS (or Python).

Perhaps use two negative look behinds:
import re
text = r'''"This is a string with an \"escaped quote\" inside it". While ""===r?+r:wt.test(r)?st.parseJSON(r) :r}catch(o){}st.data(e,n,r)}else r=t}return r}function s(e){var t;for(t in e)if(("data" '''
for match in (re.findall(r'''(?x) # verbose mode
(?<!\\) # not preceded by a backslash
" # a literal double-quote
.*? # 1-or-more characters
(?<!\\) # not preceded by a backslash
" # a literal double-quote
''', text)):
print(match)
yields
"This is a string with an \"escaped quote\" inside it"
""
"data"
The question mark in .+? makes the pattern non-greedy. The non-greediness causes the pattern to match when it encounters the first unescaped double quotation mark.

Using python, the correct regex matching double quoted string is:
pattern = r'"(\.|[^"])*"'
It describes strings starts and ends with ". For each character inside the two double quotes, it's either an escaped character OR any character expect ".
unutbu's ansever is wrong because for valid string "\\\\", cannot matched by that pattern.

heavy regex - really time consuming

I have the following regex to detect start and end script tags in the html file:
<script(?:[^<]+|<(?:[^/]|/(?:[^s])))*>(?:[^<]+|<(?:[^/]|/(?:[^s]))*)</script>
meaning in short it will catch: <script "NOT THIS</s" > "NOT THIS</s" </script>
it works but needs really long time to detect <script>,
even minutes or hours for long strings
The lite version works perfectly even for long string:
<script[^<]*>[^<]*</script>
however, the extended pattern I use as well for other tags like <a> where < and > are possible to appears also as values of attributes.
python test:
import re
pattern = re.compile('<script(?:[^<]+|<(?:[^/]|/(?:[^s])))*>(?:[^<]+|<(?:[^/]|/(?:^s]))*)</script>', re.I + re.DOTALL)
re.search(pattern, '11<script type="text/javascript"> easy>example</script>22').group()
re.search(pattern, '<script type="text/javascript">' + ('hard example' * 50) + '</script>').group()
how can I fix it?
The inner part of regex (after <script>) should be changed and simplified.
PS :) Anticipate your answers about the wrong approach like using regex in html parsing,
I know very well many html/xml parsers, and what I can expect in often broken html code, and regex is really useful here.
comment:
well, I need to handle:
each <a < document like this.border="5px;">
and approach is to use parsers and regex together
BeautifulSoup is only 2k lines, which not handling every html and just extends regex from sgmllib.
and the main reason is that I must know exact the position where every tag starts and stop. and every broken html must be handled.
BS is not perfect, sometimes happens:
BeautifulSoup('< scriPt\n\n>a<aa>s< /script>').findAll('script') == []
#Cylian:
atomic grouping as you know is not available in python's re.
so non-geedy everything .*? until <\s/\stag\s*>** is a winner at this time.
I know that is not perfect in that case:
re.search('<\sscript.?<\s*/\sscript\s>','< script </script> shit </script>').group()
but I can handle refused tail in the next parsing.
It's pretty obvious that html parsing with regex is not one battle figthing.

Use an HTML parser like beautifulsoup.
See the great answers for "Can I remove script tags with beautifulsoup?".
If your only tool is a hammer, every problem starts looking like a nail. Regular expressions are a powerful hammer but not always the best solution for some problems.
I guess you want to remove scripts from HTML posted by users for security reasons. If security is the main concern, regular expressions are hard to implement because there are so many things a hacker can modify to fool your regex, yet most browsers will happily evaluate... An specialized parser is easier to use, performs better and is safer.
If you are still thinking "why can't I use regex", read this answer pointed by mayhewr's comment. I could not put it better, the guy nailed it, and his 4433 upvotes are well deserved.

I don't know python, but I know regular expressions:
if you use the greedy/non-greedy operators you get a much simpler regex:
<script.*?>.*?</script>
This is assuming there are no nested scripts.

The problem in pattern is that it is backtracking. Using atomic groups this issue could be solved. Change your pattern to this**
<script(?>[^<]+?|<(?:[^/]|/(?:[^s])))*>(?>[^<]+|<(?:[^/]|/(?:[^s]))*)</script>
^^^^^ ^^^^^
Explanation
<!--
<script(?>[^<]+?|<(?:[^/]|/(?:[^s])))*>(?>[^<]+|<(?:[^/]|/(?:[^s]))*)</script>
Match the characters “<script” literally «<script»
Python does not support atomic grouping «(?>[^<]+?|<(?:[^/]|/(?:[^s])))*»
Match either the regular expression below (attempting the next alternative only if this one fails) «[^<]+?»
Match any character that is NOT a “<” «[^<]+?»
Between one and unlimited times, as few times as possible, expanding as needed (lazy) «+?»
Or match regular expression number 2 below (the entire group fails if this one fails to match) «<(?:[^/]|/(?:[^s]))»
Match the character “<” literally «<»
Match the regular expression below «(?:[^/]|/(?:[^s]))»
Match either the regular expression below (attempting the next alternative only if this one fails) «[^/]»
Match any character that is NOT a “/” «[^/]»
Or match regular expression number 2 below (the entire group fails if this one fails to match) «/(?:[^s])»
Match the character “/” literally «/»
Match the regular expression below «(?:[^s])»
Match any character that is NOT a “s” «[^s]»
Match the character “>” literally «>»
Python does not support atomic grouping «(?>[^<]+|<(?:[^/]|/(?:[^s]))*)»
Match either the regular expression below (attempting the next alternative only if this one fails) «[^<]+»
Match any character that is NOT a “<” «[^<]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Or match regular expression number 2 below (the entire group fails if this one fails to match) «<(?:[^/]|/(?:[^s]))*»
Match the character “<” literally «<»
Match the regular expression below «(?:[^/]|/(?:[^s]))*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match either the regular expression below (attempting the next alternative only if this one fails) «[^/]»
Match any character that is NOT a “/” «[^/]»
Or match regular expression number 2 below (the entire group fails if this one fails to match) «/(?:[^s])»
Match the character “/” literally «/»
Match the regular expression below «(?:[^s])»
Match any character that is NOT a “s” «[^s]»
Match the characters “</script>” literally «</script>»
-->

What does the "s!" operator in Perl do?

I have this Perl snippet from a script that I am translating into Python. I have no idea what the "s!" operator is doing; some sort of regex substitution. Unfortunately searching Google or Stackoverflow for operators like that doesn't yield many helpful results.
$var =~ s!<foo>.+?</foo>!!;
$var =~ s!;!/!g;
What is each line doing? I'd like to know in case I run into this operator again.
And, what would equivalent statements in Python be?

s!foo!bar! is the same as the more common s/foo/bar/, except that foo and bar can contain unescaped slashes without causing problems. What it does is, it replaces the first occurence of the regex foo with bar. The version with g replaces all occurences.

It's doing exactly the same as $var =~ s///. i.e. performing a search and replace within the $var variable.
In Perl you can define the delimiting character following the s. Why ? So, for example, if you're matching '/', you can specify another delimiting character ('!' in this case) and not have to escape or backtick the character you're matching. Otherwise you'd end up with (say)
s/;/\//g;
which is a little more confusing.
Perlre has more info on this.

Perl lets you choose the delimiter for many of its constructs. This makes it easier to see what is going on in expressions like
$str =~ s{/foo/bar/baz/}{/quux/};
As you can see though, not all delimiters have the same effects. Bracketing characters (<>, [], {}, and ()) use different characters for the beginning and ending. And ?, when used as a delimiter to a regex, causes the regexes to match only once between calls to the reset() operator.
You may find it helpful to read perldoc perlop (in particular the sections on m/PATTERN/msixpogc, ?PATTERN?, and s/PATTERN/REPLACEMENT/msixpogce).

s! is syntactic sugar for the 'proper' s/// operator. Basically, you can substitute whatever delimiter you want instead of the '/'s.
As to what each line is doing, the first line is matching occurances of the regex <foo>.+?</foo> and replacing the whole lot with nothing. The second is matching the regex ; and replacing it with /.
s/// is the substitute operator. It takes a regular expression and a substitution string.
s/regex/replace string/;
It supports most (all?) of the normal regular expression switches, which are used in the normal way (by appending them to the end of the operator).

s is the substitution operator. Usually it is in the form of s/foo/bar/, but you can replace // separator characters some other characters like !. Using other separator charaters may make working with things like paths a lot easier since you don't need to escape path separators.
See manual page for further info.
You can find similar functionality for python in re-module.

s is the substitution operator. Normally this uses '/' for the delimiter:
s/foo/bar/
, but this is not required: a number of other characters can be used as delimiters instead. In this case, '!' has been used as the delimiter, presumably to avoid the need to escape the '/' characters in the actual text to be substituted.
In your specific case, the first line removes text matching '.+?'; i.e. it removes 'foo' tags with or without content.
The second line replaces all ';' characters with '/' characters, globally (all occurences).
The python equivalent code uses the re module:
f=re.sub(searchregx,replacement_str,line)

And the python equivalent is to use the re module.

Regular Expressions in unicode strings

I have some unicode text that I want to clean up using regular expressions. For example I have cases where u'(2'. This exists because for formatting reasons the closing paren ends up in an adjacent html cell. My initial solution to this problem was to look ahead at the contents of the next cell and using a string function determine if it held the closing paren. I knew this was not a great solution but it worked. Now I want to fix it but I can't seem to make the regular expression work.
missingParen=re.compile(r"^\(\d[^\)]$")
My understanding of what I think I am doing:
^ at the beginning of the string I want to find
( an open paren, the paren has to be backslashed because it is a special character
\d I also want to find a single digit
[ I am creating a special character class
^ I don't want to find what follows
) which is a close paren
$ at the end of the string
And of course the plot thickens I made a silly assumption that because I placed a \d I would not find (33 but I am wrong so I added a {1} to my regular expression and that did not help, it matched (3333, so my problem is more complicated than I thought. I want the string to be only an open paren and a single digit. Is this the more clever approach
missingParen=re.compile(r"^\(\d$")
And note S Lott _I already tagged it beginner so you can't pick up any cheap points Not that I don't appreciate your insights I keep meaning to read your book, it probably has the answer

Okay sorry for using this a a stream of consciousness thinking stimulator but it appears that writing out my original question got me on the path. It seems to me that this is a solution for what I am trying to do:
missingParen=re.compile(r"^\(\d$")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Regex working different depending on the implementation? - python

Related

Match LaTeX reserved characters with regex

Correctly parsing string literals with python's re module

heavy regex - really time consuming

What does the "s!" operator in Perl do?

Regular Expressions in unicode strings

Categories

Resources