Python Regex reading in c style comments - python

Im trying to find c style comments in a c file but im having trouble if there happens to be // inside of quotations. This is the file:
/*My function
is great.*/
int j = 0//hello world
void foo(){
//tricky example
cout << "This // is // not a comment\n";
}
it will match with that cout. This is what i have so far (i can match the /**/ comments already)
fp = open(s)
p = re.compile(r'//(.+)')
txt = p.findall(fp.read())
print (txt)

The first step is to identify cases where // or /* must not be interpreted as the begining of a comment substring. For example when they are inside a string (between quotes). To avoid content between quotes (or other things), the trick is to put them in a capture group and to insert a backreference in the replacement pattern:
pattern:
(
"(?:[^"\\]|\\[\s\S])*"
|
'(?:[^'\\]|\\[\s\S])*'
)
|
//.*
|
/\*(?:[^*]|\*(?!/))*\*/
replacement:
\1
online demo
Since quoted parts are searching first, each time you find // or /*...*/, you can be sure that your are not inside a string.
Note that the pattern is voluntary inefficient (due to (A|B)* subpatterns) to make it easier to understand. To make it more efficient you can rewrite it like this:
("(?=((?:[^"\\]+|\\[\s\S])*))\2"|'(?=((?:[^'\\]+|\\[\s\S])*))\3')|//.*|/\*(?=((?:[^*]+|\*(?!/))*))\4\*/
(?=(something+))\1 is only a way to emulate an atomic group (?>something+)
online demo
So, If you only want to find comments (and not to remove them), the most handy is to put the comments part of the pattern in capture group and to test if it isn't empty. The following pattern has been udapted (after Jonathan Leffler comment) to handle the trigraph ??/ that is interpreted as a backslash character by the preprocessor (I assume that the code isn't written for the -trigraphs option) and to handle the backslash followed by a newline character that allows to format a single line on several lines:
fp = open(s)
p = re.compile(r'''(?x)
(?=["'/]) # trick to make it faster, a kind of anchor
(?:
"(?=((?:[^"\\?]+|\?(?!\?/)|(?:\?\?/|\\)[\s\S])*))\1" # double quotes string
|
'(?=((?:[^'\\?]+|\?(?!\?/)|(?:\?\?/|\\)[\s\S])*))\2' # single quotes string
|
(
/(?:(?:\?\?/|\\)\n)*/(?:.*(?:\?\?|\\)/\n)*.* # single line comment
|
/(?:(?:\?\?/|\\)\n)*\* # multiline comment
(?=((?:[^*]+|\*+(?!(?:(?:\?\?/|\\)\n)*/))*))\4
\*(?:(?:\?\?/|\\)\n)*/
)
)
''')
for m in p.findall(fp.read()):
if (m[2]):
print m[2]
These changes would not affect the pattern efficiency since the main work for the regex engine is to find positions that begin with a quote or a slash. This task is simplify by the presence of a lookahead at the begining of the pattern (?=["'/]) that allows internals optimizations to quickly find the first character.
An other optimization is the use of emulated atomic groups, that reduces the backtracking to the minimum and allows to use greedy quantifiers inside repeated groups.
NB: a chance there is no heredoc syntax in C!

Python's re.findall method basically works the same way as most lexers do: it successively returns the longest match starting where the previous match finished. All that is required is to produce a disjunction of all the lexical patterns:
(<pattern 1>)|(<pattern 2>)|...|(<pattern n>)
Unlike most lexers, it doesn't require the matches to be contiguous, but that's not a significant difference since you can always just add (.) as the last pattern, in order to match all otherwise unmatched characters individually.
An important feature of re.findall is that if the regex has any groups, then only the groups will be returned. Consequently, you can exclude alternatives by simply leaving out the parentheses, or changing them to non-capturing parentheses:
(<pattern 1>)|(?:<unimportant pattern 2>)|(<pattern 3)
With that in mind, let's take a look at how to tokenize C just enough to recognize comments. We need to deal with:
Single-line comments: // Comment
Multi-line comments: /* Comment */
Double-quoted string: "Might include escapes like \n"
Single-quoted character: '\t'
(See below for a few more irritating cases)
With that in mind, let's create regexen for each of the above.
Two slashes followed by anything other than a newline: //[^\n]*
This regex is tedious to explain: /*[^*]*[*]+(?:[^/*][^*]*[*]+)*/
Note that it uses (?:...) to avoid capturing the repeated group.
A quote, any repetition of a character other than quote and backslash, or a backslash followed by any character whatsoever. That's not a precise definition of an escape sequence, but it's good enough to detect when a " terminates the string, which is all we care about: "(?:[^"\\]|\\.*)"
The same as (3) but with single quotes: '(?:[^'\\]|\\.)*'
Finally, the goal was to find the text of C-style comments. So we just need to avoid captures in any of the other groups. Hence:
p = re.compile('|'.join((r"(//[^\n])*"
,r"/*[^*]*[*]+(?:[^/*][^*]*[*]+)*/"
,'"'+r"""(?:[^"\\]|\\.)*"""+'"'
,r"'(?:[^'\\]|\\.)*'")))
return [c[2:] for c in p.findall(text) if c]
Above, I left out some obscure cases which are unlikely to arise:
In an #include <...> directive, the <...> is essentially a string. In theory, it could contain quotes or sequences which look like comments, but in practice you will never see:
#include </*This looks like a comment but it is a filename*/>
A line which ends with \ is continued on the next line; the \ and following newline character are simply removed from the input. This happens before any lexical scanning is performed, so the following is a perfectly legal comment (actually two comments):
/\
**************** Surprise! **************\
//////////////////////////////////////////
To make the above worse, the trigraph ??/ is the same as a \, and that replacement happens before the continuation handling.
/************************************//??/
**************** Surprise! ************??/
//////////////////////////////////////////
Outside of obfuscation contests, no-one actually uses trigraphs. But they're still in the standard. The easiest way to deal with both of these issues would be to prescan the string:
return [c[2:]
for c in p.findall(text.replace('//?','\\').replace('\\\n',''))
if c]
The only way to deal with the #include <...> issue, if you really cared about it, would be to add one more pattern, something like #define\s*<[^>\n]*>.

Related

re.match output not deterministic

I am doing a rather complex match with python using re.match, which takes the form of (some_pattern_1)?(some_pattern_2)?..(.*)
On the other side of it I have a unit test with about one hundred examples I am checking, which are all sending request asynchronously to my (local, development) server. The server is in django.
I am sometime seing the match apparently be non-greedy (i.e. too many things end up in the last catch all block) and the unit test fail, but can't really reproduce it in isolation, and I don't really have an idea what's going on.
More concretely, the relevant part of the regex is (in Python):
import re
input = "1 small shoe"
sizes = ["small", "large", "big", "huge"]
colors = ["blue", "red", "green", "yellow", "grey"]
anySize = u' |'.join(sizes)
anyColor = u' |'.join(colors)
matched_expression = re.match(
r'\s*(?P<amount>(((\d{1,2}\.)?\d{1,3})?)\s*'
r'(?P<size>(\b'+anySize+'\b)?)\s*'
r'(?P<color>(\b'+anyColor+'\b)?)\s*'
r'(?P<name>.*, input, re.UNICODE|re.IGNORECASE)
if matched_expression:
print(matched_expression.groupdict()["amount"])
print(matched_expression.groupdict()["size"])
print(matched_expression.groupdict()["color"])
print(matched_expression.groupdict()["name"])
And I am sometimes seeing this printed:
1
''
''
'small shoe'
Are there know conditions where this could happen (and am I correct to assume that the regex match is guaranteed to be fully deterministic in principle?) ?
Most of the string literals you're using to build your pattern are raw literals (introduced with the r prefix), which is great—the string interpreter therefore does not give backslash any special meaning, but instead leaves them intact for the regex parser. However, you have unfortunately not used raw literals in every case:
r'(?P<size>(\b'+anySize+'\b)?)\s*'
# ^^^^^^^^^^ this is not a raw string literal
r'(?P<color>(\b'+anyColor+'\b)?)\s*'
# ^^^^^^^^^^ and nor is this
Consequently, the backslashes in those literals have the effect described under String and Bytes literals before the interpreted string is given to the regex compiler. Accordingly, your \b boundary anchors are replaced with ASCII backspace characters!
Either use raw string literals by prefixing them with r or else be sure to escape the backslashes they contain.
There are however also a number of other issues with your code worth noting:
As currently written, your regex won't compile due to some syntax errors. In particular, the capture groups named amount and name are not terminated due to unbalanced brackets:
r'\s*(?P<amount>(((\d{1,2}\.)?\d{1,3})?)\s*'
# + +++ - - -
There are four opening brackets, but only three closing brackets. You probably intended to write:
r'\s*(?P<amount>(((\d{1,2}\.)?\d{1,3})?))\s*'
# ^
Similarly, r'(?P<name>.*, ... should probably be r'(?P<name>.*)', ... (note also the pattern string needs to be terminated before the argument separator).
\b boundary anchors bind more tightly than | alternation, so when placed at the same level as your joined arrays they are only bound to the first and last elements of the alternatives respectively. For example, the capture group named size is currently specified by the following pattern:
(\bsmall |large |big |huge\b)?
Which is equivalent, in terms of precedence, to:
((\bsmall )|(large )|(big )|(huge\b))?
Better instead to place the boundary anchors outside of the brackets:
r'(?P<size>\b('+anySize+r')?\b)\s*'
r'(?P<color>\b('+anyColor+r')?\b)\s*'
As shown above, the whitespace in your join expressions is likely to lead to unintended consequences: anySize and anyColor require that all but the final terms in their underlying arrays are, if present, followed by a space character (in addition to those that match the \s* patterns. Better to join the arrays with '|' alone, rather than ' |':
anySize = u'|'.join(sizes)
anyColor = u'|'.join(colors)
Depending on the source of the underlying arrays, and how confident you are that they do not contain any special regex patterns, you may wish to first escape the array elements.

Regex Match Whole Multiline Comment Cointaining Special Word

I've been trying to design this regex but for the life of me I could not get it to not match if */ was hit before the special word.
I'm trying to match a whole multi line comment only if it contains a special word. I tried negative lookaheads/behinds but I could not figure out how to do it properly.
This is what I have so far:
(?s)(/\*.+?special.+?\*/)
Am I close or horribly off base? I tried including (?!\*/) unsuccessfully.
https://regex101.com/r/mD1nJ2/3
Edit: I had some redundant parts to the regex I removed.
You were not totally off base:
/\* # match /*
(?:(?!\*/)[\s\S])+? # match anything lazily, do not overrun */
special # match special
[\s\S]+? # match anything lazily afterwards
\*/ # match the closing */
The technique is called a tempered greedy token, see a demo on regex101.com (mind the modifiers, e.g. x for verbose mode !).
You might want to try another approach tough: analyze your document, grep the comments (using eg BeautifulSoup) and run string functions over them (if "special" in comment...).

Vim searching: avoid matches within comments

Using vim searching capabilities, is it possible to avoid matching on a comment block/line.
As an example, I would like to match on 'image' in this python code, but only in the code, not in the comment:
# Export the image
if export_images and i + j % 1000 == 0:
export_image(img_mask, "images/image{}.png".format(image_id))
image_id += 1
Using regular expressions I would do something like this: /^[^#].*(image).*/gm
But it does not seem to work in vim.
You can use
/^[^#].*\zsimage\ze
The \zs and \ze signalize the start and end of a match respectively.
setting the start and end of the match: \zs \ze
Note that this will not match several "image"s on a line, just the last one.
Also, perhaps, a "negative lookahead" would be better than a negated character class at the beginning:
/^#\#!.*\zsimage\ze
^^^^
The #\#! is equal to (?!#) in Python.
And since look-behinds are non-fixed-width in Vim (like (?<=pattern) in Perl, but Vim allows non-fixed-width patterns), you can match all occurrences of the character sequence image with
/\(^#\#!.*\)\#<=image
And to finally skip matching image on an indented comment line, you just need to match optional (zero or more) whitespace symbol(s) at the beginning of the line:
\(^\(\s*#\)\#!.*\)\#<=image
^^^^^^^^^^^
This \(\s*#\)\#! is equivalent to Python (?!\s*#) (match if not followed by zero or more whitespace followed with a #).
This mailing list post suggest using folds:
To search only in open folds (unfolded text):
:set fdo-=search
To fold # comments, adapting on this Vi and Vim post (where an autocmd for Python files is given):
set foldmethod=expr foldexpr=getline(v:lnum)=~'^\s*#'
However, folding by default works only on multiple lines. You need to enable folding of a single line, for single-line comments to be excluded:
set fml=0
After folding everything (zM, since I did not have anything else to be folded), a search for /image does not match anything in the comments.
A more generic way to ignore matches inside end-of-line comment markers (which does not account for the more complicated case of avoiding literal string delimiters, which could be achieved for a simpl-ish case if you want) is:
/\v%(^.{-}\/\/.{-})#<!<%(this|that|self|other)>
Where:
/ is the ex command to execute the search (remove if using the regex as part of another command, or a vimscript expression).
\v forces the "very magic" for this regex.
\/\/ is the end-of-line comment token (escaped, so the / characters are not interpreted as the end of the regex by vim). The example above works for C/C++, JavaScript, Node.JS, etc.
(^.{-}\/\/.{-})#<! is the zero-width expression that means "there should not be any comments preceding the start of the expression following this" (see vim's documentation for \#<!). In our case, this expression is just trying to find the end-of-line comment token at any point in the line (hence the ^.{-}) and making sure that the zero-width match can end up in the first character of the positive expression that follows this one (achieved by the .{-} at the end of the parenthesised expression).
<%(this|that|self|other)> can be replaced by any regex. In this case, this shows that you can match an arbitrary expression at this point, without worrying about the comments (which are handled by the zero-width expression preceding this one).
As an example, consider the following piece of code:
some.this = 'hello'
some.this = 'hello' + this.someother;
some.this = 'hello' + this.someother; // this is a comment, and this is another
The above expression will match all the this words, except the ones inside the comment (or any other //-prefixed comments, for that matter).
(Note: links all pointing to the vim-7.0 reference documentation, which should (and does in my own testing) also work in the latest vim and nvim releases as of the time of writing)

Regex matching open or closed Python string

I'd like find the contents of all Python strings in source code as that code is being typed. I assume the string is contained within a single line, but it might not be closed yet.
Right now I have
for m in re.finditer('''(?P<open>(?:""")|"|(?:''\')|')(?:((?P<closed>.*?)(?P=open))|(?P<unclosed>.*))''', 'as"df'):
i = 3 if m.group(3) else 4
print m.group(i)
But I'd love to have a predictable match group to search on. Something like
re.finditer('''(?P<open>(?:""")|"|(?:''\')|')(.*?)(?P=open)''', line)
is nicer because the contents of the string literal will always be in match group (but this one doesn't match strings that aren't yet closed).
Edit: I'm fine with multiline matches, I just mean to make the problem simpler by excluding them from being in the input.
You can try this:
(?s)('''|"""|'|")((?:(?=([^"'\\]+|\\.|(?!\1)["']))\3)*)\1?
The quote is captured in group 1, a backreference is used at the end to close the string \1.
[^"'\\]+ | \\. | (?!\1)["'] describes allowed content:
[^"'\\]+ # all that is not a quote or a backslash
\\. # an escaped character
(?!\1)["'] # a quote that is not the captured quote
Then, to repeat these elements without risking a catastrophic backtracking, I emulate an atomic group with this trick:(?>subpattern)* => (?:(?=(subpattern))\1)*
Note: If you want to forbid multiline matches, you only need to change the allowed content to
[^"'\r\n\\]+ | \\. | (?!\1)["'] and to remove the (?s) modifier.
[EDIT]
If you want to match a backslash at the end of the string (example: text = r'''abc def ghi\), you need to change the pattern to:
multiline mode:
(?m)('''|"""|'|")((?:(?=([^"'\r\n\\]+|\\(?:.|$)|(?!\1)["']))\3)*)\1?
singleline mode:
(?s)('''|"""|'|")((?:(?=([^"'\\]+|\\(?:.|$)|(?!\1)["']))\3)*)\1?
How about this:
^("""|'''|"|')((?!\\").*?)(?:(?<!\\)\1$|$)
I'm not exactly sure as to the behaviour you would like when it comes down strings that are syntactically wrong, when you have two double quotes before you get to three (at the start and end). But from what I understand this should do the job.
Use the second match group in your code.

What does the "s!" operator in Perl do?

I have this Perl snippet from a script that I am translating into Python. I have no idea what the "s!" operator is doing; some sort of regex substitution. Unfortunately searching Google or Stackoverflow for operators like that doesn't yield many helpful results.
$var =~ s!<foo>.+?</foo>!!;
$var =~ s!;!/!g;
What is each line doing? I'd like to know in case I run into this operator again.
And, what would equivalent statements in Python be?
s!foo!bar! is the same as the more common s/foo/bar/, except that foo and bar can contain unescaped slashes without causing problems. What it does is, it replaces the first occurence of the regex foo with bar. The version with g replaces all occurences.
It's doing exactly the same as $var =~ s///. i.e. performing a search and replace within the $var variable.
In Perl you can define the delimiting character following the s. Why ? So, for example, if you're matching '/', you can specify another delimiting character ('!' in this case) and not have to escape or backtick the character you're matching. Otherwise you'd end up with (say)
s/;/\//g;
which is a little more confusing.
Perlre has more info on this.
Perl lets you choose the delimiter for many of its constructs. This makes it easier to see what is going on in expressions like
$str =~ s{/foo/bar/baz/}{/quux/};
As you can see though, not all delimiters have the same effects. Bracketing characters (<>, [], {}, and ()) use different characters for the beginning and ending. And ?, when used as a delimiter to a regex, causes the regexes to match only once between calls to the reset() operator.
You may find it helpful to read perldoc perlop (in particular the sections on m/PATTERN/msixpogc, ?PATTERN?, and s/PATTERN/REPLACEMENT/msixpogce).
s! is syntactic sugar for the 'proper' s/// operator. Basically, you can substitute whatever delimiter you want instead of the '/'s.
As to what each line is doing, the first line is matching occurances of the regex <foo>.+?</foo> and replacing the whole lot with nothing. The second is matching the regex ; and replacing it with /.
s/// is the substitute operator. It takes a regular expression and a substitution string.
s/regex/replace string/;
It supports most (all?) of the normal regular expression switches, which are used in the normal way (by appending them to the end of the operator).
s is the substitution operator. Usually it is in the form of s/foo/bar/, but you can replace // separator characters some other characters like !. Using other separator charaters may make working with things like paths a lot easier since you don't need to escape path separators.
See manual page for further info.
You can find similar functionality for python in re-module.
s is the substitution operator. Normally this uses '/' for the delimiter:
s/foo/bar/
, but this is not required: a number of other characters can be used as delimiters instead. In this case, '!' has been used as the delimiter, presumably to avoid the need to escape the '/' characters in the actual text to be substituted.
In your specific case, the first line removes text matching '.+?'; i.e. it removes 'foo' tags with or without content.
The second line replaces all ';' characters with '/' characters, globally (all occurences).
The python equivalent code uses the re module:
f=re.sub(searchregx,replacement_str,line)
And the python equivalent is to use the re module.

Categories

Resources