Regex Match Whole Multiline Comment Cointaining Special Word - python

I've been trying to design this regex but for the life of me I could not get it to not match if */ was hit before the special word.
I'm trying to match a whole multi line comment only if it contains a special word. I tried negative lookaheads/behinds but I could not figure out how to do it properly.
This is what I have so far:
(?s)(/\*.+?special.+?\*/)
Am I close or horribly off base? I tried including (?!\*/) unsuccessfully.
https://regex101.com/r/mD1nJ2/3
Edit: I had some redundant parts to the regex I removed.

You were not totally off base:
/\* # match /*
(?:(?!\*/)[\s\S])+? # match anything lazily, do not overrun */
special # match special
[\s\S]+? # match anything lazily afterwards
\*/ # match the closing */
The technique is called a tempered greedy token, see a demo on regex101.com (mind the modifiers, e.g. x for verbose mode !).
You might want to try another approach tough: analyze your document, grep the comments (using eg BeautifulSoup) and run string functions over them (if "special" in comment...).

Related

Do character classes count as groups in regular expressions?

A small project I got assigned is supposed to extract website URLs from given text. Here's how the most relevant portion of it looks like :
webURLregex = re.compile(r'''(
(https://|http://)
[a-zA-Z0-9.%+-\\/_]+
)''',re.VERBOSE)
This does do its job properly, but I noticed that it also includes the ','s and '.' in URL strings it prints. So my first question is, how do I make it exclude any punctuation symbols in the end of the string it detects ?
My second question is referring to the title itself ( finally ), but doesn't really seem to affect this particular program I'm working on : Do character classes ( in this case [a-zA-Z0-9.%+-\/_]+ ) count as groups ( group[3] in this case ) ?
Thanks in advance.
To exclude some symbols at the end of string you can use negative lookbehind. For example, to disallow . ,:
.*(?<![.,])$
answering in reverse:
No, character classes are just shorthand for bracketed text. They don't provide groups in the same way that surrounding with parenthesis would. They only allow the regular expression engine to select the specified characters -- nothing more, nothing less.
With regards to finding comma and dot: Actually, I see the problem here, though the below may still be valuable, so I'll leave it. Essentially, you have this: [a-zA-Z0-9.%+-\\/_]+ the - character has special meaning: everything between these two characters -- by ascii code. so [A-a] is a valid range. It include A-Z, but also a bunch of other characters that aren't A-Z. If you want to include - in the range, then it needs to be the last character: [a-zA-Z0-9.%+\\/_-]+ should work
For comma, I actually don't see it represented in your regex, so I can't comment specifically on that. It shouldn't be allowed anywhere in the url. In general though, you'll just want to add more groups/more conditions.
First, break apart the url into the specifc groups you'll want:
(scheme)://(domain)(endpoint)
Each section gets a different set of requirements: e.g. maybe domain needs to end with a slash:
[a-zA-Z0-9]+\.com/ should match any domain that uses an alphanumeric character, and ends -- specifically -- with .com (note the \., otherwise it'll capture any single character followed by com/
For the endpoint section, you'll probably still want to allow special characters, but if you're confident you don't want the url to end with, say, a dot, then you could do something [A-Za-z0-9] -- note the lack of a dot here, plus, it's length -- only a single character. This will change the rest of your regex, so you need to think about that.
A couple of random thoughts:
If you're confident you want to match the whole line, add a $ to the end of the regex, to signify the end of the line. One possibility here is that your regex does match some portion of the text, but ignores the junk at the end, since you didn't say to read the whole line.
Regexes get complicated really fast -- they're kind of write-only code. Add some comments to help. E.g.
web_url_regex = re.compile(
r'(http://|https://)' # Capture the scheme name
r'([a-zA-Z0-9.%+-\\/_])' # Everything else, apparently
)
Do not try to be exhaustive in your validation -- as noted, urls are hard to validate because you can't know for sure that one is valid. But the form is pretty consistent, as laid out above: scheme, domain, endpoint (and query string)
To answer the second question first, no a character class is not a group (unless you explicitly make it into one by putting it in parentheses).
Regarding the first question of how to make it exclude the punctuation symbols at the end, the code below should answer that.
Firstly though, your regex had an issue separate from the fact that it was matching the final punctuation, namely that the last - does not appear to be intended as defining a range of characters (see footnote below re why I believe this to be the case), but was doing so. I've moved it to the end of the character class to avoid this problem.
Now a character class to match the final character is added at the end of the regexp, which is the same as the previous character class except that it does not include . (other punctuation is now already not included). So the matched pattern cannot end in .. The + (one or more) on the previous character class is now reduced to * (zero or more).
If for any reason the exact set of characters matched needs tweaking, then the same principle can still be employed: match a single character at the end from a reduced set of possibilities, preceded by any number of characters from a wider set which includes characters that are permitted to be included but not at the end.
import re
webURLregex = re.compile(r'''(
(https://|http://)
[a-zA-Z0-9.%+\\/_-]*
[a-zA-Z0-9%+\\/_-]
)''',re.VERBOSE)
str = "... at http://www.google.com/. It says"
m = re.search(webURLregex, str)
if m:
print(m.group())
Outputs:
http://www.google.com/
[*] The observation that the second - does not appear to be intended to define a character range is based on the fact that, if it was, such a range would be from 056-134 (octal) which would include also the alphabetical characters, making the a-zA-Z redundant.

Python Regular Expression: Multiline pattern match with more than two substrings

I want to use a regex to find merge conflicts in a file.
I've found previous posts that show how to find a pattern that matches this structure
FIRST SUBSTRING
/* several
new
lines
*/
SECOND SUBSTRING
which works with the following regex: (^FIRST SUBSTRING)(.+)((?:\n.+)+)(SECOND SUBSTRING)
However, I need to match this pattern:
FIRST SUBSTRING
/* several
new
lines
*/
SECOND SUBSTRING
/* several
new
lines
*/
THIRD SUBSTRING
Where first, second and third substrings are <<<<<<<, =======, >>>>>>> respectively.
I gave (^<<<<<<<)(.+)((?:\n.+)+)(=======)(.+)((?:\n.+)+)(>>>>>>) a shot but it does not work, which you can see on this demo ((^<<<<<<<)(.+)((?:\n.+)+)(=======) does work but it is not exactly what I am looking for)
Your expression does work with a couple of slight changes. Lengths of characters do not exactly match. And You are asking for at least one character after the SECOND SUBSTRING with (.+), when there are none in the text.
(<<<<<<<)(.+)((?:\n.+)+)(=======)(.*)((?:\n.+)+)(>>>>>>>)
From then onwards it makes groups as you expect (which the answer in the comments does not). You probably want to distinguish between your and their code.
Plus, if you have to choose among working expressions, I would choose yours instead of the options proposed for readability. Regex are not friendly things to read, and using repetitions (among other sophistications) make the code harder to read. This also goes for the ?:, just query specific groups, there is no need to avoid group creation there.
Setting the flag s (single line - dot matches newline) is needed to match the text from the structure. So you can use .*? for select multi line text overriding \n, until the next pattern (? lazy mode).
With this setting, the regex below matches what you need.
(<{7})(.*)(={7})(.*?)(>{7})(.*?\n)

Vim searching: avoid matches within comments

Using vim searching capabilities, is it possible to avoid matching on a comment block/line.
As an example, I would like to match on 'image' in this python code, but only in the code, not in the comment:
# Export the image
if export_images and i + j % 1000 == 0:
export_image(img_mask, "images/image{}.png".format(image_id))
image_id += 1
Using regular expressions I would do something like this: /^[^#].*(image).*/gm
But it does not seem to work in vim.
You can use
/^[^#].*\zsimage\ze
The \zs and \ze signalize the start and end of a match respectively.
setting the start and end of the match: \zs \ze
Note that this will not match several "image"s on a line, just the last one.
Also, perhaps, a "negative lookahead" would be better than a negated character class at the beginning:
/^#\#!.*\zsimage\ze
^^^^
The #\#! is equal to (?!#) in Python.
And since look-behinds are non-fixed-width in Vim (like (?<=pattern) in Perl, but Vim allows non-fixed-width patterns), you can match all occurrences of the character sequence image with
/\(^#\#!.*\)\#<=image
And to finally skip matching image on an indented comment line, you just need to match optional (zero or more) whitespace symbol(s) at the beginning of the line:
\(^\(\s*#\)\#!.*\)\#<=image
^^^^^^^^^^^
This \(\s*#\)\#! is equivalent to Python (?!\s*#) (match if not followed by zero or more whitespace followed with a #).
This mailing list post suggest using folds:
To search only in open folds (unfolded text):
:set fdo-=search
To fold # comments, adapting on this Vi and Vim post (where an autocmd for Python files is given):
set foldmethod=expr foldexpr=getline(v:lnum)=~'^\s*#'
However, folding by default works only on multiple lines. You need to enable folding of a single line, for single-line comments to be excluded:
set fml=0
After folding everything (zM, since I did not have anything else to be folded), a search for /image does not match anything in the comments.
A more generic way to ignore matches inside end-of-line comment markers (which does not account for the more complicated case of avoiding literal string delimiters, which could be achieved for a simpl-ish case if you want) is:
/\v%(^.{-}\/\/.{-})#<!<%(this|that|self|other)>
Where:
/ is the ex command to execute the search (remove if using the regex as part of another command, or a vimscript expression).
\v forces the "very magic" for this regex.
\/\/ is the end-of-line comment token (escaped, so the / characters are not interpreted as the end of the regex by vim). The example above works for C/C++, JavaScript, Node.JS, etc.
(^.{-}\/\/.{-})#<! is the zero-width expression that means "there should not be any comments preceding the start of the expression following this" (see vim's documentation for \#<!). In our case, this expression is just trying to find the end-of-line comment token at any point in the line (hence the ^.{-}) and making sure that the zero-width match can end up in the first character of the positive expression that follows this one (achieved by the .{-} at the end of the parenthesised expression).
<%(this|that|self|other)> can be replaced by any regex. In this case, this shows that you can match an arbitrary expression at this point, without worrying about the comments (which are handled by the zero-width expression preceding this one).
As an example, consider the following piece of code:
some.this = 'hello'
some.this = 'hello' + this.someother;
some.this = 'hello' + this.someother; // this is a comment, and this is another
The above expression will match all the this words, except the ones inside the comment (or any other //-prefixed comments, for that matter).
(Note: links all pointing to the vim-7.0 reference documentation, which should (and does in my own testing) also work in the latest vim and nvim releases as of the time of writing)

Python Regex reading in c style comments

Im trying to find c style comments in a c file but im having trouble if there happens to be // inside of quotations. This is the file:
/*My function
is great.*/
int j = 0//hello world
void foo(){
//tricky example
cout << "This // is // not a comment\n";
}
it will match with that cout. This is what i have so far (i can match the /**/ comments already)
fp = open(s)
p = re.compile(r'//(.+)')
txt = p.findall(fp.read())
print (txt)
The first step is to identify cases where // or /* must not be interpreted as the begining of a comment substring. For example when they are inside a string (between quotes). To avoid content between quotes (or other things), the trick is to put them in a capture group and to insert a backreference in the replacement pattern:
pattern:
(
"(?:[^"\\]|\\[\s\S])*"
|
'(?:[^'\\]|\\[\s\S])*'
)
|
//.*
|
/\*(?:[^*]|\*(?!/))*\*/
replacement:
\1
online demo
Since quoted parts are searching first, each time you find // or /*...*/, you can be sure that your are not inside a string.
Note that the pattern is voluntary inefficient (due to (A|B)* subpatterns) to make it easier to understand. To make it more efficient you can rewrite it like this:
("(?=((?:[^"\\]+|\\[\s\S])*))\2"|'(?=((?:[^'\\]+|\\[\s\S])*))\3')|//.*|/\*(?=((?:[^*]+|\*(?!/))*))\4\*/
(?=(something+))\1 is only a way to emulate an atomic group (?>something+)
online demo
So, If you only want to find comments (and not to remove them), the most handy is to put the comments part of the pattern in capture group and to test if it isn't empty. The following pattern has been udapted (after Jonathan Leffler comment) to handle the trigraph ??/ that is interpreted as a backslash character by the preprocessor (I assume that the code isn't written for the -trigraphs option) and to handle the backslash followed by a newline character that allows to format a single line on several lines:
fp = open(s)
p = re.compile(r'''(?x)
(?=["'/]) # trick to make it faster, a kind of anchor
(?:
"(?=((?:[^"\\?]+|\?(?!\?/)|(?:\?\?/|\\)[\s\S])*))\1" # double quotes string
|
'(?=((?:[^'\\?]+|\?(?!\?/)|(?:\?\?/|\\)[\s\S])*))\2' # single quotes string
|
(
/(?:(?:\?\?/|\\)\n)*/(?:.*(?:\?\?|\\)/\n)*.* # single line comment
|
/(?:(?:\?\?/|\\)\n)*\* # multiline comment
(?=((?:[^*]+|\*+(?!(?:(?:\?\?/|\\)\n)*/))*))\4
\*(?:(?:\?\?/|\\)\n)*/
)
)
''')
for m in p.findall(fp.read()):
if (m[2]):
print m[2]
These changes would not affect the pattern efficiency since the main work for the regex engine is to find positions that begin with a quote or a slash. This task is simplify by the presence of a lookahead at the begining of the pattern (?=["'/]) that allows internals optimizations to quickly find the first character.
An other optimization is the use of emulated atomic groups, that reduces the backtracking to the minimum and allows to use greedy quantifiers inside repeated groups.
NB: a chance there is no heredoc syntax in C!
Python's re.findall method basically works the same way as most lexers do: it successively returns the longest match starting where the previous match finished. All that is required is to produce a disjunction of all the lexical patterns:
(<pattern 1>)|(<pattern 2>)|...|(<pattern n>)
Unlike most lexers, it doesn't require the matches to be contiguous, but that's not a significant difference since you can always just add (.) as the last pattern, in order to match all otherwise unmatched characters individually.
An important feature of re.findall is that if the regex has any groups, then only the groups will be returned. Consequently, you can exclude alternatives by simply leaving out the parentheses, or changing them to non-capturing parentheses:
(<pattern 1>)|(?:<unimportant pattern 2>)|(<pattern 3)
With that in mind, let's take a look at how to tokenize C just enough to recognize comments. We need to deal with:
Single-line comments: // Comment
Multi-line comments: /* Comment */
Double-quoted string: "Might include escapes like \n"
Single-quoted character: '\t'
(See below for a few more irritating cases)
With that in mind, let's create regexen for each of the above.
Two slashes followed by anything other than a newline: //[^\n]*
This regex is tedious to explain: /*[^*]*[*]+(?:[^/*][^*]*[*]+)*/
Note that it uses (?:...) to avoid capturing the repeated group.
A quote, any repetition of a character other than quote and backslash, or a backslash followed by any character whatsoever. That's not a precise definition of an escape sequence, but it's good enough to detect when a " terminates the string, which is all we care about: "(?:[^"\\]|\\.*)"
The same as (3) but with single quotes: '(?:[^'\\]|\\.)*'
Finally, the goal was to find the text of C-style comments. So we just need to avoid captures in any of the other groups. Hence:
p = re.compile('|'.join((r"(//[^\n])*"
,r"/*[^*]*[*]+(?:[^/*][^*]*[*]+)*/"
,'"'+r"""(?:[^"\\]|\\.)*"""+'"'
,r"'(?:[^'\\]|\\.)*'")))
return [c[2:] for c in p.findall(text) if c]
Above, I left out some obscure cases which are unlikely to arise:
In an #include <...> directive, the <...> is essentially a string. In theory, it could contain quotes or sequences which look like comments, but in practice you will never see:
#include </*This looks like a comment but it is a filename*/>
A line which ends with \ is continued on the next line; the \ and following newline character are simply removed from the input. This happens before any lexical scanning is performed, so the following is a perfectly legal comment (actually two comments):
/\
**************** Surprise! **************\
//////////////////////////////////////////
To make the above worse, the trigraph ??/ is the same as a \, and that replacement happens before the continuation handling.
/************************************//??/
**************** Surprise! ************??/
//////////////////////////////////////////
Outside of obfuscation contests, no-one actually uses trigraphs. But they're still in the standard. The easiest way to deal with both of these issues would be to prescan the string:
return [c[2:]
for c in p.findall(text.replace('//?','\\').replace('\\\n',''))
if c]
The only way to deal with the #include <...> issue, if you really cared about it, would be to add one more pattern, something like #define\s*<[^>\n]*>.

heavy regex - really time consuming

I have the following regex to detect start and end script tags in the html file:
<script(?:[^<]+|<(?:[^/]|/(?:[^s])))*>(?:[^<]+|<(?:[^/]|/(?:[^s]))*)</script>
meaning in short it will catch: <script "NOT THIS</s" > "NOT THIS</s" </script>
it works but needs really long time to detect <script>,
even minutes or hours for long strings
The lite version works perfectly even for long string:
<script[^<]*>[^<]*</script>
however, the extended pattern I use as well for other tags like <a> where < and > are possible to appears also as values of attributes.
python test:
import re
pattern = re.compile('<script(?:[^<]+|<(?:[^/]|/(?:[^s])))*>(?:[^<]+|<(?:[^/]|/(?:^s]))*)</script>', re.I + re.DOTALL)
re.search(pattern, '11<script type="text/javascript"> easy>example</script>22').group()
re.search(pattern, '<script type="text/javascript">' + ('hard example' * 50) + '</script>').group()
how can I fix it?
The inner part of regex (after <script>) should be changed and simplified.
PS :) Anticipate your answers about the wrong approach like using regex in html parsing,
I know very well many html/xml parsers, and what I can expect in often broken html code, and regex is really useful here.
comment:
well, I need to handle:
each <a < document like this.border="5px;">
and approach is to use parsers and regex together
BeautifulSoup is only 2k lines, which not handling every html and just extends regex from sgmllib.
and the main reason is that I must know exact the position where every tag starts and stop. and every broken html must be handled.
BS is not perfect, sometimes happens:
BeautifulSoup('< scriPt\n\n>a<aa>s< /script>').findAll('script') == []
#Cylian:
atomic grouping as you know is not available in python's re.
so non-geedy everything .*? until <\s/\stag\s*>** is a winner at this time.
I know that is not perfect in that case:
re.search('<\sscript.?<\s*/\sscript\s>','< script </script> shit </script>').group()
but I can handle refused tail in the next parsing.
It's pretty obvious that html parsing with regex is not one battle figthing.
Use an HTML parser like beautifulsoup.
See the great answers for "Can I remove script tags with beautifulsoup?".
If your only tool is a hammer, every problem starts looking like a nail. Regular expressions are a powerful hammer but not always the best solution for some problems.
I guess you want to remove scripts from HTML posted by users for security reasons. If security is the main concern, regular expressions are hard to implement because there are so many things a hacker can modify to fool your regex, yet most browsers will happily evaluate... An specialized parser is easier to use, performs better and is safer.
If you are still thinking "why can't I use regex", read this answer pointed by mayhewr's comment. I could not put it better, the guy nailed it, and his 4433 upvotes are well deserved.
I don't know python, but I know regular expressions:
if you use the greedy/non-greedy operators you get a much simpler regex:
<script.*?>.*?</script>
This is assuming there are no nested scripts.
The problem in pattern is that it is backtracking. Using atomic groups this issue could be solved. Change your pattern to this**
<script(?>[^<]+?|<(?:[^/]|/(?:[^s])))*>(?>[^<]+|<(?:[^/]|/(?:[^s]))*)</script>
^^^^^ ^^^^^
Explanation
<!--
<script(?>[^<]+?|<(?:[^/]|/(?:[^s])))*>(?>[^<]+|<(?:[^/]|/(?:[^s]))*)</script>
Match the characters “<script” literally «<script»
Python does not support atomic grouping «(?>[^<]+?|<(?:[^/]|/(?:[^s])))*»
Match either the regular expression below (attempting the next alternative only if this one fails) «[^<]+?»
Match any character that is NOT a “<” «[^<]+?»
Between one and unlimited times, as few times as possible, expanding as needed (lazy) «+?»
Or match regular expression number 2 below (the entire group fails if this one fails to match) «<(?:[^/]|/(?:[^s]))»
Match the character “<” literally «<»
Match the regular expression below «(?:[^/]|/(?:[^s]))»
Match either the regular expression below (attempting the next alternative only if this one fails) «[^/]»
Match any character that is NOT a “/” «[^/]»
Or match regular expression number 2 below (the entire group fails if this one fails to match) «/(?:[^s])»
Match the character “/” literally «/»
Match the regular expression below «(?:[^s])»
Match any character that is NOT a “s” «[^s]»
Match the character “>” literally «>»
Python does not support atomic grouping «(?>[^<]+|<(?:[^/]|/(?:[^s]))*)»
Match either the regular expression below (attempting the next alternative only if this one fails) «[^<]+»
Match any character that is NOT a “<” «[^<]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Or match regular expression number 2 below (the entire group fails if this one fails to match) «<(?:[^/]|/(?:[^s]))*»
Match the character “<” literally «<»
Match the regular expression below «(?:[^/]|/(?:[^s]))*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match either the regular expression below (attempting the next alternative only if this one fails) «[^/]»
Match any character that is NOT a “/” «[^/]»
Or match regular expression number 2 below (the entire group fails if this one fails to match) «/(?:[^s])»
Match the character “/” literally «/»
Match the regular expression below «(?:[^s])»
Match any character that is NOT a “s” «[^s]»
Match the characters “</script>” literally «</script>»
-->

Categories

Resources