Capturing negative matches in Python regular expressions

Capturing negative matches in Python regular expressions - python

I'm trying to do a non-greedy negative match, and I need to capture it as well. I'm using these flags in Python, re.DOTALL | re.LOCALE | re.MULTILINE, to do multi-line cleanup of some text-file 'databases' in which each field begins on a new line with a backslash. Each record begins with an \lx field.
\lx foo
\ps n
\nt note 1
\ps v
\nt note
\ge happy
\nt note 2
\ge lonely
\nt note 3
\ge lonely
\dt 19/Dec/2011
\lx bar
...
I'm trying to ensure that each \ge field has a \ps field somewhere above it within its record, one for one. Currently, one \ps is often followed by several \ge and thus needs to be copied down, as with the two lonely \ge above.
Here's most of the needed logic: after any \ps field, but before encountering another \ps or \lx, find a \ge, then find another \ge. Capture everything so that the \ps field can be copied down to just before the second \ge.
And here's my non-functional attempt. Replace this:
^(\\ps\b.*?\n)((?!^\\(ps|lx)*?)^(\\ge.*?\n)((?!^\\ps)*?)^(\\ge.*?\n)
with this:
\1\2\3\4\1\5
I'm getting a memory error even on a tiny file (34 lines long). Of course, even if this worked, I would have to run it multiple times, since it's only trying to handle a second \ge, and not a third or fourth one. So any ideas in that regard would interest me as well.
UPDATE: Alan Moore's solution worked great, although there were cases that required a little tweaking. Sadly, I had to turn off DOTALL since otherwise I couldn't prevent the first .* including subsequent \ps fields--even with the non-greedy .*? form. But I was delighted to learn about the (?s) modifier just now at regular-expressions dot info. This allowed me to turn off DOTALL in general but still use it in other regexes that it is essential for.
Here is the suggested regex, condensed down to the one-line format I need:
^(?P<PS_BLOCK>(?P<PS_LINE>\\ps.*\n)(?:(?!\\(?:ps|lx|ge)).*\n)*\\ge.*\n)(?P<GE_BLOCK>(?:(?!\\(?:ps|lx|ge)).*\n)*\\ge.*\n)
That worked, but when I modified the example above, it inserted the \ps above "note 2". It also was treating \lxs and \ge2 the same as \lx and \ge (needed a few \b). So, I went with a slightly tweaked version:
^(?P<PS_BLOCK>(?P<PS_LINE>\\ps\b.*\n)(?:(?!\\(?:ps|lx|ge)\b).*\n)*\\ge\b.*\n)(?P<AFTER_GE1>(?:(?!\\(?:ps|lx|ge)\b).*\n)*)(?P<GE2_LINE>\\ge\b.*\n)
and this replacement string:
\g<PS_BLOCK>\g<AFTER_GE1>\g<PS_LINE>\g<GE2_LINE>
Thanks again!

You're getting the memory error because you used the DOTALL flag. If your data is really formatted the way you showed it, you don't need that flag anyway; the default behavior is exactly what you want. You don't need the non-greedy modifier (?), either.
Try this regex:
prog = re.compile(r"""
^
(?P<PS_BLOCK>
(?P<PS_LINE>\\ps.*\n)
(?: # Zero or more lines that
(?!\\(?:ps|lx|ge)) # don't start with
.*\n # '\ps', '\lx', or '\ge'...
)*
\\ge.*\n # ...followed by a '\ge' line.
)
(?P<GE_BLOCK>
(?: # Again, zero or more lines
(?!\\(?:ps|lx|ge)) # that don't start with
.*\n # '\ps', '\lx', or '\ge'...
)*
\\ge.*\n # ...followed by a '\ge' line.
)
""", re.MULTILINE | re.VERBOSE)
The replacement string would be:
r'\g<PS_BLOCK>\g<PS_LINE>\g<GE_BLOCK>'
You'll still have to do multiple passes. If Python supported \G, that wouldn't be necessary, but you can use subn and check the number_of_subs_made return value.

Any time you're having problems with regexen and tell yourself "I would have to run it multiple times,.." it's a clear sign that you need to write a parser :-)
The language seems to be pretty regular, so a a parser should be easy to write, perhaps starting with something as easy as:
def parse_line(line):
kind, value = line.split(' ', 1) # split on the first space
kind = kind[1:] # remove the \
parsed_value = globals().get('parse_' + kind, lambda x:x)(value)
return (kind, parsed_value)
def parse_dt(value):
val = ... # create datetime.date() from "19/Dec/2011"
return val
It's maybe a little too cute to use globals() to write a state machine, but it saves a ton of boilerplate code... :-)
Convert your input into a list of tuples:
records = [parse_line(line) for line in open("myfile.dta")]
Figuring out if there is always a ("ps", ..) tuple before a ("ge", ..) tuple should then be pretty easy -- e.g. by first noting where all the lx tuples are...

Related

Do character classes count as groups in regular expressions?

A small project I got assigned is supposed to extract website URLs from given text. Here's how the most relevant portion of it looks like :
webURLregex = re.compile(r'''(
(https://|http://)
[a-zA-Z0-9.%+-\\/_]+
)''',re.VERBOSE)
This does do its job properly, but I noticed that it also includes the ','s and '.' in URL strings it prints. So my first question is, how do I make it exclude any punctuation symbols in the end of the string it detects ?
My second question is referring to the title itself ( finally ), but doesn't really seem to affect this particular program I'm working on : Do character classes ( in this case [a-zA-Z0-9.%+-\/_]+ ) count as groups ( group[3] in this case ) ?
Thanks in advance.

To exclude some symbols at the end of string you can use negative lookbehind. For example, to disallow . ,:
.*(?<![.,])$

answering in reverse:
No, character classes are just shorthand for bracketed text. They don't provide groups in the same way that surrounding with parenthesis would. They only allow the regular expression engine to select the specified characters -- nothing more, nothing less.
With regards to finding comma and dot: Actually, I see the problem here, though the below may still be valuable, so I'll leave it. Essentially, you have this: [a-zA-Z0-9.%+-\\/_]+ the - character has special meaning: everything between these two characters -- by ascii code. so [A-a] is a valid range. It include A-Z, but also a bunch of other characters that aren't A-Z. If you want to include - in the range, then it needs to be the last character: [a-zA-Z0-9.%+\\/_-]+ should work
For comma, I actually don't see it represented in your regex, so I can't comment specifically on that. It shouldn't be allowed anywhere in the url. In general though, you'll just want to add more groups/more conditions.
First, break apart the url into the specifc groups you'll want:
(scheme)://(domain)(endpoint)
Each section gets a different set of requirements: e.g. maybe domain needs to end with a slash:
[a-zA-Z0-9]+\.com/ should match any domain that uses an alphanumeric character, and ends -- specifically -- with .com (note the \., otherwise it'll capture any single character followed by com/
For the endpoint section, you'll probably still want to allow special characters, but if you're confident you don't want the url to end with, say, a dot, then you could do something [A-Za-z0-9] -- note the lack of a dot here, plus, it's length -- only a single character. This will change the rest of your regex, so you need to think about that.
A couple of random thoughts:
If you're confident you want to match the whole line, add a $ to the end of the regex, to signify the end of the line. One possibility here is that your regex does match some portion of the text, but ignores the junk at the end, since you didn't say to read the whole line.
Regexes get complicated really fast -- they're kind of write-only code. Add some comments to help. E.g.
web_url_regex = re.compile(
r'(http://|https://)' # Capture the scheme name
r'([a-zA-Z0-9.%+-\\/_])' # Everything else, apparently
)
Do not try to be exhaustive in your validation -- as noted, urls are hard to validate because you can't know for sure that one is valid. But the form is pretty consistent, as laid out above: scheme, domain, endpoint (and query string)

To answer the second question first, no a character class is not a group (unless you explicitly make it into one by putting it in parentheses).
Regarding the first question of how to make it exclude the punctuation symbols at the end, the code below should answer that.
Firstly though, your regex had an issue separate from the fact that it was matching the final punctuation, namely that the last - does not appear to be intended as defining a range of characters (see footnote below re why I believe this to be the case), but was doing so. I've moved it to the end of the character class to avoid this problem.
Now a character class to match the final character is added at the end of the regexp, which is the same as the previous character class except that it does not include . (other punctuation is now already not included). So the matched pattern cannot end in .. The + (one or more) on the previous character class is now reduced to * (zero or more).
If for any reason the exact set of characters matched needs tweaking, then the same principle can still be employed: match a single character at the end from a reduced set of possibilities, preceded by any number of characters from a wider set which includes characters that are permitted to be included but not at the end.
import re
webURLregex = re.compile(r'''(
(https://|http://)
[a-zA-Z0-9.%+\\/_-]*
[a-zA-Z0-9%+\\/_-]
)''',re.VERBOSE)
str = "... at http://www.google.com/. It says"
m = re.search(webURLregex, str)
if m:
print(m.group())
Outputs:
http://www.google.com/
[*] The observation that the second - does not appear to be intended to define a character range is based on the fact that, if it was, such a range would be from 056-134 (octal) which would include also the alphabetical characters, making the a-zA-Z redundant.

Regex to match only part of certain line

I have some config file from which I need to extract only some values. For example, I have this:
PART
{
title = Some Title
description = Some description here. // this 2 params are needed
tags = qwe rty // don't need this param
...
}
I need to extract value of certain param, for example description's value. How do I do this in Python3 with regex?

Here is the regex, assuming that the file text is in txt:
import re
m = re.search(r'^\s*description\s*=\s*(.*?)(?=(//)|$)', txt, re.M)
print(m.group(1))
Let me explain.
^ matches at beginning of line.
Then \s* means zero or more spaces (or tabs)
description is your anchor for finding the value part.
After that we expect = sign with optional spaces before or after by denoting \s*=\s*.
Then we capture everything after the = and optional spaces, by denoting (.*?). This expression is captured by parenthesis. Inside the parenthesis we say match anything (the dot) as many times as you can find (the asterisk) in a non greedy manner (the question mark), that is, stop as soon as the following expression is matched.
The following expression is a lookahead expression, starting with (?= which matches the thing right after the (?=.
And that thing is actually two options, separated by the vertical bar |.
The first option, to the left of the bar says // (in parenthesis to make it atomic unit for the vertical bar choice operation), that is, the start of the comment, which, I suppose, you don't want to capture.
The second option is $, meaning the end of the line, which will be reached if there is no comment // on the line.
So we look for everything we can after the first = sign, until either we meet a // pattern, or we meet the end of the line. This is the essence of the (?=(//)|$) part.
We also need the re.M flag, to tell the regex engine that we want ^ and $ match the start and end of lines, respectively. Without the flag they match the start and end of the entire string, which isn't what we want in this case.

The better approach would be to use an established configuration file system. Python has built-in support for INI-like files in the configparser module.
However, if you just desperately need to get the string of text in that file after the description, you could do this:
def get_value_for_key(key, file):
with open(file) as f:
lines = f.readlines()
for line in lines:
line = line.lstrip()
if line.startswith(key + " ="):
return line.split("=", 1)[1].lstrip()
You can use it with a call like: get_value_for_key("description", "myfile.txt"). The method will return None if nothing is found. It is assumed that your file will be formatted where there is a space and the equals sign after the key name, e.g. key = value.
This avoids regular expressions altogether and preserves any whitespace on the right side of the value. (If that's not important to you, you can use strip instead of lstrip.)
Why avoid regular expressions? They're expensive and really not ideal for this scenario. Use simple string matching. This avoids importing a module and simplifies your code. But really I'd say to convert to a supported configuration file format.

This is a pretty simple regex, you just need a positive lookbehind, and optionally something to remove the comments. (do this by appending ?(//)? to the regex)
r"(?<=description = ).*"
Regex101 demo

Vim searching: avoid matches within comments

Using vim searching capabilities, is it possible to avoid matching on a comment block/line.
As an example, I would like to match on 'image' in this python code, but only in the code, not in the comment:
# Export the image
if export_images and i + j % 1000 == 0:
export_image(img_mask, "images/image{}.png".format(image_id))
image_id += 1
Using regular expressions I would do something like this: /^[^#].*(image).*/gm
But it does not seem to work in vim.

You can use
/^[^#].*\zsimage\ze
The \zs and \ze signalize the start and end of a match respectively.
setting the start and end of the match: \zs \ze
Note that this will not match several "image"s on a line, just the last one.
Also, perhaps, a "negative lookahead" would be better than a negated character class at the beginning:
/^#\#!.*\zsimage\ze
^^^^
The #\#! is equal to (?!#) in Python.
And since look-behinds are non-fixed-width in Vim (like (?<=pattern) in Perl, but Vim allows non-fixed-width patterns), you can match all occurrences of the character sequence image with
/\(^#\#!.*\)\#<=image
And to finally skip matching image on an indented comment line, you just need to match optional (zero or more) whitespace symbol(s) at the beginning of the line:
\(^\(\s*#\)\#!.*\)\#<=image
^^^^^^^^^^^
This \(\s*#\)\#! is equivalent to Python (?!\s*#) (match if not followed by zero or more whitespace followed with a #).

This mailing list post suggest using folds:
To search only in open folds (unfolded text):
:set fdo-=search
To fold # comments, adapting on this Vi and Vim post (where an autocmd for Python files is given):
set foldmethod=expr foldexpr=getline(v:lnum)=~'^\s*#'
However, folding by default works only on multiple lines. You need to enable folding of a single line, for single-line comments to be excluded:
set fml=0
After folding everything (zM, since I did not have anything else to be folded), a search for /image does not match anything in the comments.

A more generic way to ignore matches inside end-of-line comment markers (which does not account for the more complicated case of avoiding literal string delimiters, which could be achieved for a simpl-ish case if you want) is:
/\v%(^.{-}\/\/.{-})#<!<%(this|that|self|other)>
Where:
/ is the ex command to execute the search (remove if using the regex as part of another command, or a vimscript expression).
\v forces the "very magic" for this regex.
\/\/ is the end-of-line comment token (escaped, so the / characters are not interpreted as the end of the regex by vim). The example above works for C/C++, JavaScript, Node.JS, etc.
(^.{-}\/\/.{-})#<! is the zero-width expression that means "there should not be any comments preceding the start of the expression following this" (see vim's documentation for \#<!). In our case, this expression is just trying to find the end-of-line comment token at any point in the line (hence the ^.{-}) and making sure that the zero-width match can end up in the first character of the positive expression that follows this one (achieved by the .{-} at the end of the parenthesised expression).
<%(this|that|self|other)> can be replaced by any regex. In this case, this shows that you can match an arbitrary expression at this point, without worrying about the comments (which are handled by the zero-width expression preceding this one).
As an example, consider the following piece of code:
some.this = 'hello'
some.this = 'hello' + this.someother;
some.this = 'hello' + this.someother; // this is a comment, and this is another
The above expression will match all the this words, except the ones inside the comment (or any other //-prefixed comments, for that matter).
(Note: links all pointing to the vim-7.0 reference documentation, which should (and does in my own testing) also work in the latest vim and nvim releases as of the time of writing)

Python Regex reading in c style comments

Im trying to find c style comments in a c file but im having trouble if there happens to be // inside of quotations. This is the file:
/*My function
is great.*/
int j = 0//hello world
void foo(){
//tricky example
cout << "This // is // not a comment\n";
}
it will match with that cout. This is what i have so far (i can match the /**/ comments already)
fp = open(s)
p = re.compile(r'//(.+)')
txt = p.findall(fp.read())
print (txt)

The first step is to identify cases where // or /* must not be interpreted as the begining of a comment substring. For example when they are inside a string (between quotes). To avoid content between quotes (or other things), the trick is to put them in a capture group and to insert a backreference in the replacement pattern:
pattern:
(
"(?:[^"\\]|\\[\s\S])*"
|
'(?:[^'\\]|\\[\s\S])*'
)
|
//.*
|
/\*(?:[^*]|\*(?!/))*\*/
replacement:
\1
online demo
Since quoted parts are searching first, each time you find // or /*...*/, you can be sure that your are not inside a string.
Note that the pattern is voluntary inefficient (due to (A|B)* subpatterns) to make it easier to understand. To make it more efficient you can rewrite it like this:
("(?=((?:[^"\\]+|\\[\s\S])*))\2"|'(?=((?:[^'\\]+|\\[\s\S])*))\3')|//.*|/\*(?=((?:[^*]+|\*(?!/))*))\4\*/
(?=(something+))\1 is only a way to emulate an atomic group (?>something+)
online demo
So, If you only want to find comments (and not to remove them), the most handy is to put the comments part of the pattern in capture group and to test if it isn't empty. The following pattern has been udapted (after Jonathan Leffler comment) to handle the trigraph ??/ that is interpreted as a backslash character by the preprocessor (I assume that the code isn't written for the -trigraphs option) and to handle the backslash followed by a newline character that allows to format a single line on several lines:
fp = open(s)
p = re.compile(r'''(?x)
(?=["'/]) # trick to make it faster, a kind of anchor
(?:
"(?=((?:[^"\\?]+|\?(?!\?/)|(?:\?\?/|\\)[\s\S])*))\1" # double quotes string
|
'(?=((?:[^'\\?]+|\?(?!\?/)|(?:\?\?/|\\)[\s\S])*))\2' # single quotes string
|
(
/(?:(?:\?\?/|\\)\n)*/(?:.*(?:\?\?|\\)/\n)*.* # single line comment
|
/(?:(?:\?\?/|\\)\n)*\* # multiline comment
(?=((?:[^*]+|\*+(?!(?:(?:\?\?/|\\)\n)*/))*))\4
\*(?:(?:\?\?/|\\)\n)*/
)
)
''')
for m in p.findall(fp.read()):
if (m[2]):
print m[2]
These changes would not affect the pattern efficiency since the main work for the regex engine is to find positions that begin with a quote or a slash. This task is simplify by the presence of a lookahead at the begining of the pattern (?=["'/]) that allows internals optimizations to quickly find the first character.
An other optimization is the use of emulated atomic groups, that reduces the backtracking to the minimum and allows to use greedy quantifiers inside repeated groups.
NB: a chance there is no heredoc syntax in C!

Python's re.findall method basically works the same way as most lexers do: it successively returns the longest match starting where the previous match finished. All that is required is to produce a disjunction of all the lexical patterns:
(<pattern 1>)|(<pattern 2>)|...|(<pattern n>)
Unlike most lexers, it doesn't require the matches to be contiguous, but that's not a significant difference since you can always just add (.) as the last pattern, in order to match all otherwise unmatched characters individually.
An important feature of re.findall is that if the regex has any groups, then only the groups will be returned. Consequently, you can exclude alternatives by simply leaving out the parentheses, or changing them to non-capturing parentheses:
(<pattern 1>)|(?:<unimportant pattern 2>)|(<pattern 3)
With that in mind, let's take a look at how to tokenize C just enough to recognize comments. We need to deal with:
Single-line comments: // Comment
Multi-line comments: /* Comment */
Double-quoted string: "Might include escapes like \n"
Single-quoted character: '\t'
(See below for a few more irritating cases)
With that in mind, let's create regexen for each of the above.
Two slashes followed by anything other than a newline: //[^\n]*
This regex is tedious to explain: /*[^*]*[*]+(?:[^/*][^*]*[*]+)*/
Note that it uses (?:...) to avoid capturing the repeated group.
A quote, any repetition of a character other than quote and backslash, or a backslash followed by any character whatsoever. That's not a precise definition of an escape sequence, but it's good enough to detect when a " terminates the string, which is all we care about: "(?:[^"\\]|\\.*)"
The same as (3) but with single quotes: '(?:[^'\\]|\\.)*'
Finally, the goal was to find the text of C-style comments. So we just need to avoid captures in any of the other groups. Hence:
p = re.compile('|'.join((r"(//[^\n])*"
,r"/*[^*]*[*]+(?:[^/*][^*]*[*]+)*/"
,'"'+r"""(?:[^"\\]|\\.)*"""+'"'
,r"'(?:[^'\\]|\\.)*'")))
return [c[2:] for c in p.findall(text) if c]
Above, I left out some obscure cases which are unlikely to arise:
In an #include <...> directive, the <...> is essentially a string. In theory, it could contain quotes or sequences which look like comments, but in practice you will never see:
#include </*This looks like a comment but it is a filename*/>
A line which ends with \ is continued on the next line; the \ and following newline character are simply removed from the input. This happens before any lexical scanning is performed, so the following is a perfectly legal comment (actually two comments):
/\
**************** Surprise! **************\
//////////////////////////////////////////
To make the above worse, the trigraph ??/ is the same as a \, and that replacement happens before the continuation handling.
/************************************//??/
**************** Surprise! ************??/
//////////////////////////////////////////
Outside of obfuscation contests, no-one actually uses trigraphs. But they're still in the standard. The easiest way to deal with both of these issues would be to prescan the string:
return [c[2:]
for c in p.findall(text.replace('//?','\\').replace('\\\n',''))
if c]
The only way to deal with the #include <...> issue, if you really cared about it, would be to add one more pattern, something like #define\s*<[^>\n]*>.

Lookahead assertions seem to short-circuit ordering of alternates in regular expressions

I'm working with a (Python-flavored) regular expression to recognize common and idiosyncratic forms and abbreviations of scripture references. Given the following verbose snippet:
>>> cp = re.compile(ur"""
(?:(
# Numbered books
(?:(?:Third|Thir|Thi|III|3rd|Th|3)\ ?
(?:John|Joh|Jhn|Jo|Jn|Jn|J))
# Other books
|Thessalonians|John|Th|Jn)\ ?
# Lookahead for numbers or punctuation
(?=[\d:., ]))
|
# Do the same check, this time at the end of the string.
(
(?:(?:Third|Thir|Thi|III|3rd|Th|3)\ ?
(?:John|Joh|Jhn|Jo|Jn|Jn|J))
|Thessalonians|John|Th|Jn)\.?$
""", re.IGNORECASE | re.VERBOSE)
>>> cp.match("Third John").group()
'Third John'
>>> cp.match("Th Jn").group()
'Th'
>>> cp.match("Th Jn ").group()
'Th Jn'
The intention of this snippet is to match various forms of "Third John", as well as forms of "Thessalonians" and "John" by themselves. In most cases this works fine, but it does not match "Th Jn" (or "Th John"), rather matching "Th" by itself.
I've ordered the appearance of each abbreviation in the expression from longest to shortest expressly to avoid a situation like this, relying on a regular expression's typically greedy behavior. But the positive lookahead assertion seems to be short-circuiting this order, picking the shortest match instead of the greediest match.
Of course, removing the lookahead assertion makes this case work, but breaks a bunch of other tests. How might I go about fixing this?

I've given up after a little try to follow what _sre.so is doing in this case (too complicated!) but a "blind fix" I tried seemed to work -- switch to a negative lookahead assertion for the complementary character set...:
cp = re.compile(ur"""
(?:(
# Numbered books
(?:(?:Third|Thir|Thi|III|3rd|Th|3)\ ?
(?:John|Joh|Jhn|Jo|Jn|Jn|J))
# Other books
|Thessalonians|John|Th|Jn)\ ?
# Lookahead for numbers or punctuation
(?![^\d:., ]))
|
etc. I.e. I changed the original (?=[\d:., ])) positive lookahead into a "double negation" form (negative lookahead for complement) (?![^\d:., ])) and this seems to remove the perturbation. Does this work correctly for you?
I think it's an implementation anomaly in this corner case of _sre.so -- it might be interesting to see what other RE engines do in these two cases, just as a sanity check.

The lookahead isn't really short circuiting anything. The regex is only greedy up to a point. It'll prefer a match in your first big block because it doesn't want to cross that "|" boundary to the second part of the regex and have to check that as well.
Since the whole string doesn't match the first big block (because the lookeahead says it needs to be followed by a particular character rather than end of line) it just matches the "Th" from the "Thessalonians" group and the lookahead sees a space following "Th" in "Th Jn" so it considers this a valid match.
What you'll probably want to do is move the "|Thessalonians|John|Th|Jn)\ ? " group out to another large "|" block. Check your two word books at the beginning of text OR at the end of text OR check for one word books in a third group.
Hope this explanation made sense.

Another alternate solution I discovered while asking the question: switch the order of the blocks, putting the end-of-line check first, then the lookahead assertion last. However, I prefer Alex's double negative solution, and have implemented that.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Capturing negative matches in Python regular expressions - python

Related

Do character classes count as groups in regular expressions?

Regex to match only part of certain line

Vim searching: avoid matches within comments

Python Regex reading in c style comments

Lookahead assertions seem to short-circuit ordering of alternates in regular expressions

Categories

Resources